From: Simon Baatz <gmbnomis@gmail.com>
By default, the Linux TCP implementation does not shrink the
advertised window (RFC 7323 calls this "window retraction") with the
following exceptions:
- When an incoming segment cannot be added due to the receive buffer
running out of memory. Since commit 8c670bdfa58e ("tcp: correct
handling of extreme memory squeeze") a zero window will be
advertised in this case. It turns out that reaching the required
"memory pressure" is very easy when window scaling is in use. In the
simplest case, sending a sufficient number of segments smaller than
the scale factor to a receiver that does not read data is enough.
Since commit 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks") this
happens much earlier than before, leading to regressions (the test
suite of the Valkey project does not pass because of a TCP
connection that is no longer bi-directional).
- Commit b650d953cd39 ("tcp: enforce receive buffer memory limits by
allowing the tcp window to shrink") addressed the "eating memory"
problem by introducing a sysctl knob that allows shrinking the
window before running out of memory.
However, RFC 7323 does not only state that shrinking the window is
necessary in some cases, it also formulates requirements for TCP
implementations when doing so (Section 2.4).
This commit addresses the receiver-side requirements: After retracting
the window, the peer may have a snd_nxt that lies within a previously
advertised window but is now beyond the retracted window. This means
that all incoming segments (including pure ACKs) will be rejected
until the application happens to read enough data to let the peer's
snd_nxt be in window again (which may be never).
To comply with RFC 7323, the receiver MUST honor any segment that
would have been in window for any ACK sent by the receiver and, when
window scaling is in effect, SHOULD track the maximum window sequence
number it has advertised. This patch tracks that maximum window
sequence number throughout the connection and uses it in
tcp_sequence() when deciding whether a segment is acceptable.
Acceptability of data is not changed.
Fixes: 8c670bdfa58e ("tcp: correct handling of extreme memory squeeze")
Fixes: b650d953cd39 ("tcp: enforce receive buffer memory limits by allowing the tcp window to shrink")
Signed-off-by: Simon Baatz <gmbnomis@gmail.com>
---
Documentation/networking/net_cachelines/tcp_sock.rst | 1 +
include/linux/tcp.h | 1 +
include/net/tcp.h | 14 ++++++++++++++
net/ipv4/tcp_fastopen.c | 1 +
net/ipv4/tcp_input.c | 6 ++++--
net/ipv4/tcp_minisocks.c | 1 +
net/ipv4/tcp_output.c | 12 ++++++++++++
.../selftests/net/packetdrill/tcp_rcv_big_endseq.pkt | 2 +-
8 files changed, 35 insertions(+), 3 deletions(-)
diff --git a/Documentation/networking/net_cachelines/tcp_sock.rst b/Documentation/networking/net_cachelines/tcp_sock.rst
index 563daea10d6c5c074f004cb1b8574f5392157abb..fecf61166a54ee2f64bcef5312c81dcc4aa9a124 100644
--- a/Documentation/networking/net_cachelines/tcp_sock.rst
+++ b/Documentation/networking/net_cachelines/tcp_sock.rst
@@ -121,6 +121,7 @@ u64 delivered_mstamp read_write
u32 rate_delivered read_mostly tcp_rate_gen
u32 rate_interval_us read_mostly rate_delivered,rate_app_limited
u32 rcv_wnd read_write read_mostly tcp_select_window,tcp_receive_window,tcp_fast_path_check
+u32 rcv_mwnd_seq read_write tcp_select_window
u32 write_seq read_write tcp_rate_check_app_limited,tcp_write_queue_empty,tcp_skb_entail,forced_push,tcp_mark_push
u32 notsent_lowat read_mostly tcp_stream_memory_free
u32 pushed_seq read_write tcp_mark_push,forced_push
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index f72eef31fa23cc584f2f0cefacdc35cae43aa52d..5a943b12d4c050a980b4cf81635b9fa2f0036283 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -271,6 +271,7 @@ struct tcp_sock {
u32 lsndtime; /* timestamp of last sent data packet (for restart window) */
u32 mdev_us; /* medium deviation */
u32 rtt_seq; /* sequence number to update rttvar */
+ u32 rcv_mwnd_seq; /* Maximum window sequence number (RFC 7323, section 2.4) */
u64 tcp_wstamp_ns; /* departure time for next sent data packet */
u64 accecn_opt_tstamp; /* Last AccECN option sent timestamp */
struct list_head tsorted_sent_queue; /* time-sorted sent but un-SACKed skbs */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 40e72b9cb85f08714d3f458c0bd1402a5fb1eb4e..e1944d504823d5f8754d85bfbbf3c9630d2190ac 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -912,6 +912,20 @@ static inline u32 tcp_receive_window(const struct tcp_sock *tp)
return (u32) win;
}
+/* Compute the maximum receive window we ever advertised.
+ * Rcv_nxt can be after the window if our peer push more data
+ * than the offered window.
+ */
+static inline u32 tcp_max_receive_window(const struct tcp_sock *tp)
+{
+ s32 win = tp->rcv_mwnd_seq - tp->rcv_nxt;
+
+ if (win < 0)
+ win = 0;
+ return (u32) win;
+}
+
+
/* Choose a new window, without checks for shrinking, and without
* scaling applied to the result. The caller does these things
* if necessary. This is a "raw" window selection.
diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
index b30090cff3cf7d925dc46694860abd3ca5516d70..f034ef6e3e7b54bf73c77fd2bf1d3090c75dbfc6 100644
--- a/net/ipv4/tcp_fastopen.c
+++ b/net/ipv4/tcp_fastopen.c
@@ -377,6 +377,7 @@ static struct sock *tcp_fastopen_create_child(struct sock *sk,
tcp_rsk(req)->rcv_nxt = tp->rcv_nxt;
tp->rcv_wup = tp->rcv_nxt;
+ tp->rcv_mwnd_seq = tp->rcv_wup + tp->rcv_wnd;
/* tcp_conn_request() is sending the SYNACK,
* and queues the child into listener accept queue.
*/
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index e7b41abb82aad33d8cab4fcfa989cc4771149b41..af9dd51256b01fd31d9e390d69dcb1d1700daf1b 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4865,8 +4865,8 @@ static enum skb_drop_reason tcp_sequence(const struct sock *sk,
if (before(end_seq, tp->rcv_wup))
return SKB_DROP_REASON_TCP_OLD_SEQUENCE;
- if (after(end_seq, tp->rcv_nxt + tcp_receive_window(tp))) {
- if (after(seq, tp->rcv_nxt + tcp_receive_window(tp)))
+ if (after(end_seq, tp->rcv_nxt + tcp_max_receive_window(tp))) {
+ if (after(seq, tp->rcv_nxt + tcp_max_receive_window(tp)))
return SKB_DROP_REASON_TCP_INVALID_SEQUENCE;
/* Only accept this packet if receive queue is empty. */
@@ -6959,6 +6959,7 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
*/
WRITE_ONCE(tp->rcv_nxt, TCP_SKB_CB(skb)->seq + 1);
tp->rcv_wup = TCP_SKB_CB(skb)->seq + 1;
+ tp->rcv_mwnd_seq = tp->rcv_wup + tp->rcv_wnd;
/* RFC1323: The window in SYN & SYN/ACK segments is
* never scaled.
@@ -7071,6 +7072,7 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
WRITE_ONCE(tp->rcv_nxt, TCP_SKB_CB(skb)->seq + 1);
WRITE_ONCE(tp->copied_seq, tp->rcv_nxt);
tp->rcv_wup = TCP_SKB_CB(skb)->seq + 1;
+ tp->rcv_mwnd_seq = tp->rcv_wup + tp->rcv_wnd;
/* RFC1323: The window in SYN & SYN/ACK segments is
* never scaled.
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index ec128865f5c029c971eb00cb9ee058b742efafd1..df95d8b6dce5c746e5e34545aa75a96080cc752d 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -604,6 +604,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
newtp->window_clamp = req->rsk_window_clamp;
newtp->rcv_ssthresh = req->rsk_rcv_wnd;
newtp->rcv_wnd = req->rsk_rcv_wnd;
+ newtp->rcv_mwnd_seq = newtp->rcv_wup + req->rsk_rcv_wnd;
newtp->rx_opt.wscale_ok = ireq->wscale_ok;
if (newtp->rx_opt.wscale_ok) {
newtp->rx_opt.snd_wscale = ireq->snd_wscale;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 326b58ff1118d02fc396753d56f210f9d3007c7f..50774443f6ae0ca83f360c7fc3239184a1523e1b 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -274,6 +274,15 @@ void tcp_select_initial_window(const struct sock *sk, int __space, __u32 mss,
}
EXPORT_IPV6_MOD(tcp_select_initial_window);
+/* Check if we need to update the maximum window sequence number */
+static inline void tcp_update_max_wnd_seq(struct tcp_sock *tp)
+{
+ u32 wre = tp->rcv_wup + tp->rcv_wnd;
+
+ if (after(wre, tp->rcv_mwnd_seq))
+ tp->rcv_mwnd_seq = wre;
+}
+
/* Chose a new window to advertise, update state in tcp_sock for the
* socket, and return result with RFC1323 scaling applied. The return
* value can be stuffed directly into th->window for an outgoing
@@ -293,6 +302,7 @@ static u16 tcp_select_window(struct sock *sk)
tp->pred_flags = 0;
tp->rcv_wnd = 0;
tp->rcv_wup = tp->rcv_nxt;
+ tcp_update_max_wnd_seq(tp);
return 0;
}
@@ -316,6 +326,7 @@ static u16 tcp_select_window(struct sock *sk)
tp->rcv_wnd = new_win;
tp->rcv_wup = tp->rcv_nxt;
+ tcp_update_max_wnd_seq(tp);
/* Make sure we do not exceed the maximum possible
* scaled window.
@@ -4169,6 +4180,7 @@ static void tcp_connect_init(struct sock *sk)
else
tp->rcv_tstamp = tcp_jiffies32;
tp->rcv_wup = tp->rcv_nxt;
+ tp->rcv_mwnd_seq = tp->rcv_nxt + tp->rcv_wnd;
WRITE_ONCE(tp->copied_seq, tp->rcv_nxt);
inet_csk(sk)->icsk_rto = tcp_timeout_init(sk);
diff --git a/tools/testing/selftests/net/packetdrill/tcp_rcv_big_endseq.pkt b/tools/testing/selftests/net/packetdrill/tcp_rcv_big_endseq.pkt
index 3848b419e68c3fc895ad736d06373fc32f3691c1..1a86ee5093696deb316c532ca8f7de2bbf5cd8ea 100644
--- a/tools/testing/selftests/net/packetdrill/tcp_rcv_big_endseq.pkt
+++ b/tools/testing/selftests/net/packetdrill/tcp_rcv_big_endseq.pkt
@@ -36,7 +36,7 @@
+0 read(4, ..., 100000) = 4000
-// If queue is empty, accept a packet even if its end_seq is above wup + rcv_wnd
+// If queue is empty, accept a packet even if its end_seq is above rcv_mwnd_seq
+0 < P. 4001:54001(50000) ack 1 win 257
+0 > . 1:1(0) ack 54001 win 0
--
2.52.0
Hi Simon,
It all makes sense to me at a quick look, I have just some nits and one
more substantial worry, below:
On Fri, 20 Feb 2026 00:55:14 +0100
Simon Baatz via B4 Relay <devnull+gmbnomis.gmail.com@kernel.org> wrote:
> From: Simon Baatz <gmbnomis@gmail.com>
>
> By default, the Linux TCP implementation does not shrink the
> advertised window (RFC 7323 calls this "window retraction") with the
> following exceptions:
>
> - When an incoming segment cannot be added due to the receive buffer
> running out of memory. Since commit 8c670bdfa58e ("tcp: correct
> handling of extreme memory squeeze") a zero window will be
> advertised in this case. It turns out that reaching the required
> "memory pressure" is very easy when window scaling is in use. In the
> simplest case, sending a sufficient number of segments smaller than
> the scale factor to a receiver that does not read data is enough.
>
> Since commit 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks") this
> happens much earlier than before, leading to regressions (the test
> suite of the Valkey project does not pass because of a TCP
> connection that is no longer bi-directional).
Ouch. By the way, that same commit helped us unveil an issue (at least
in the sense of RFC 9293, 3.8.6) we fixed in passt:
https://passt.top/passt/commit/?id=8d2f8c4d0fb58d6b2011e614bc7d7ff9dab406b3
> - Commit b650d953cd39 ("tcp: enforce receive buffer memory limits by
> allowing the tcp window to shrink") addressed the "eating memory"
> problem by introducing a sysctl knob that allows shrinking the
> window before running out of memory.
>
> However, RFC 7323 does not only state that shrinking the window is
> necessary in some cases, it also formulates requirements for TCP
> implementations when doing so (Section 2.4).
>
> This commit addresses the receiver-side requirements: After retracting
> the window, the peer may have a snd_nxt that lies within a previously
> advertised window but is now beyond the retracted window. This means
> that all incoming segments (including pure ACKs) will be rejected
> until the application happens to read enough data to let the peer's
> snd_nxt be in window again (which may be never).
>
> To comply with RFC 7323, the receiver MUST honor any segment that
> would have been in window for any ACK sent by the receiver and, when
> window scaling is in effect, SHOULD track the maximum window sequence
> number it has advertised. This patch tracks that maximum window
> sequence number throughout the connection and uses it in
> tcp_sequence() when deciding whether a segment is acceptable.
> Acceptability of data is not changed.
>
> Fixes: 8c670bdfa58e ("tcp: correct handling of extreme memory squeeze")
> Fixes: b650d953cd39 ("tcp: enforce receive buffer memory limits by allowing the tcp window to shrink")
> Signed-off-by: Simon Baatz <gmbnomis@gmail.com>
> ---
> Documentation/networking/net_cachelines/tcp_sock.rst | 1 +
> include/linux/tcp.h | 1 +
> include/net/tcp.h | 14 ++++++++++++++
> net/ipv4/tcp_fastopen.c | 1 +
> net/ipv4/tcp_input.c | 6 ++++--
> net/ipv4/tcp_minisocks.c | 1 +
> net/ipv4/tcp_output.c | 12 ++++++++++++
> .../selftests/net/packetdrill/tcp_rcv_big_endseq.pkt | 2 +-
> 8 files changed, 35 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/networking/net_cachelines/tcp_sock.rst b/Documentation/networking/net_cachelines/tcp_sock.rst
> index 563daea10d6c5c074f004cb1b8574f5392157abb..fecf61166a54ee2f64bcef5312c81dcc4aa9a124 100644
> --- a/Documentation/networking/net_cachelines/tcp_sock.rst
> +++ b/Documentation/networking/net_cachelines/tcp_sock.rst
> @@ -121,6 +121,7 @@ u64 delivered_mstamp read_write
> u32 rate_delivered read_mostly tcp_rate_gen
> u32 rate_interval_us read_mostly rate_delivered,rate_app_limited
> u32 rcv_wnd read_write read_mostly tcp_select_window,tcp_receive_window,tcp_fast_path_check
> +u32 rcv_mwnd_seq read_write tcp_select_window
> u32 write_seq read_write tcp_rate_check_app_limited,tcp_write_queue_empty,tcp_skb_entail,forced_push,tcp_mark_push
> u32 notsent_lowat read_mostly tcp_stream_memory_free
> u32 pushed_seq read_write tcp_mark_push,forced_push
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index f72eef31fa23cc584f2f0cefacdc35cae43aa52d..5a943b12d4c050a980b4cf81635b9fa2f0036283 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -271,6 +271,7 @@ struct tcp_sock {
> u32 lsndtime; /* timestamp of last sent data packet (for restart window) */
> u32 mdev_us; /* medium deviation */
> u32 rtt_seq; /* sequence number to update rttvar */
> + u32 rcv_mwnd_seq; /* Maximum window sequence number (RFC 7323, section 2.4) */
Nit: tab between ; and /* for consistency (I would personally prefer
the comment style as you see on 'highest_sack' but I don't think it's
enforced anymore).
Second nit: mentioning RFC 7323, section 2.4 could be a bit misleading
here because the relevant paragraph there covers a very specific case of
window retraction, caused by quantisation error from window scaling,
which is not the most common case here. I couldn't quickly find a better
reference though.
More importantly: do we need to restore this on a connection that's
being dumped and recreated using TCP_REPAIR, or will things still work
(even though sub-optimally) if we lose this value?
Other window values that *need* to be dumped and restored are currently
available via TCP_REPAIR_WINDOW socket option, and they are listed in
do_tcp_getsockopt(), net/ipv4/tcp.c:
opt.snd_wl1 = tp->snd_wl1;
opt.snd_wnd = tp->snd_wnd;
opt.max_window = tp->max_window;
opt.rcv_wnd = tp->rcv_wnd;
opt.rcv_wup = tp->rcv_wup;
CRIU uses it to checkpoint and restore established connections, and
passt uses it to migrate them to a different host:
https://criu.org/TCP_connection
https://passt.top/passt/tree/tcp.c?id=02af38d4177550c086bae54246fc3aaa33ddc018#n3063
If it's strictly needed to preserve functionality, we would need to add
it to struct tcp_repair_window, notify CRIU maintainers (or send them a
patch), and add this in passt as well (I can take care of it).
Strictly speaking, in case, this could be considered a breaking change
for userspace, but I don't see how to avoid it, so I'd just make sure
it doesn't impact users as TCP_REPAIR has just a couple of (known!)
projects relying on it.
An alternative would be to have a special, initial value representing
the fact that this value was lost, but it looks really annoying to not
be able to use a u32 for it.
Disregard all this if the correct value is not strictly needed for
functionality, of course. I haven't tested things (not yet, at least).
> u64 tcp_wstamp_ns; /* departure time for next sent data packet */
> u64 accecn_opt_tstamp; /* Last AccECN option sent timestamp */
> struct list_head tsorted_sent_queue; /* time-sorted sent but un-SACKed skbs */
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 40e72b9cb85f08714d3f458c0bd1402a5fb1eb4e..e1944d504823d5f8754d85bfbbf3c9630d2190ac 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -912,6 +912,20 @@ static inline u32 tcp_receive_window(const struct tcp_sock *tp)
> return (u32) win;
> }
>
> +/* Compute the maximum receive window we ever advertised.
> + * Rcv_nxt can be after the window if our peer push more data
s/push/pushes/
s/Rcv_nxt/rcv_nxt/ (useful for grepping)
> + * than the offered window.
> + */
> +static inline u32 tcp_max_receive_window(const struct tcp_sock *tp)
> +{
> + s32 win = tp->rcv_mwnd_seq - tp->rcv_nxt;
> +
> + if (win < 0)
> + win = 0;
I must be missing something but... if the sequence is about to wrap,
we'll return 0 here. Is that intended?
Doing the subtraction unsigned would have looked more natural to me,
but I didn't really think it through.
> + return (u32) win;
Kernel coding style doesn't usually include a space between cast and
identifier.
> +}
> +
> +
> /* Choose a new window, without checks for shrinking, and without
> * scaling applied to the result. The caller does these things
> * if necessary. This is a "raw" window selection.
> diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
> index b30090cff3cf7d925dc46694860abd3ca5516d70..f034ef6e3e7b54bf73c77fd2bf1d3090c75dbfc6 100644
> --- a/net/ipv4/tcp_fastopen.c
> +++ b/net/ipv4/tcp_fastopen.c
> @@ -377,6 +377,7 @@ static struct sock *tcp_fastopen_create_child(struct sock *sk,
>
> tcp_rsk(req)->rcv_nxt = tp->rcv_nxt;
> tp->rcv_wup = tp->rcv_nxt;
> + tp->rcv_mwnd_seq = tp->rcv_wup + tp->rcv_wnd;
> /* tcp_conn_request() is sending the SYNACK,
> * and queues the child into listener accept queue.
> */
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index e7b41abb82aad33d8cab4fcfa989cc4771149b41..af9dd51256b01fd31d9e390d69dcb1d1700daf1b 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -4865,8 +4865,8 @@ static enum skb_drop_reason tcp_sequence(const struct sock *sk,
> if (before(end_seq, tp->rcv_wup))
> return SKB_DROP_REASON_TCP_OLD_SEQUENCE;
>
> - if (after(end_seq, tp->rcv_nxt + tcp_receive_window(tp))) {
> - if (after(seq, tp->rcv_nxt + tcp_receive_window(tp)))
> + if (after(end_seq, tp->rcv_nxt + tcp_max_receive_window(tp))) {
> + if (after(seq, tp->rcv_nxt + tcp_max_receive_window(tp)))
> return SKB_DROP_REASON_TCP_INVALID_SEQUENCE;
>
> /* Only accept this packet if receive queue is empty. */
> @@ -6959,6 +6959,7 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
> */
> WRITE_ONCE(tp->rcv_nxt, TCP_SKB_CB(skb)->seq + 1);
> tp->rcv_wup = TCP_SKB_CB(skb)->seq + 1;
> + tp->rcv_mwnd_seq = tp->rcv_wup + tp->rcv_wnd;
>
> /* RFC1323: The window in SYN & SYN/ACK segments is
> * never scaled.
> @@ -7071,6 +7072,7 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
> WRITE_ONCE(tp->rcv_nxt, TCP_SKB_CB(skb)->seq + 1);
> WRITE_ONCE(tp->copied_seq, tp->rcv_nxt);
> tp->rcv_wup = TCP_SKB_CB(skb)->seq + 1;
> + tp->rcv_mwnd_seq = tp->rcv_wup + tp->rcv_wnd;
>
> /* RFC1323: The window in SYN & SYN/ACK segments is
> * never scaled.
> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> index ec128865f5c029c971eb00cb9ee058b742efafd1..df95d8b6dce5c746e5e34545aa75a96080cc752d 100644
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
> @@ -604,6 +604,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
> newtp->window_clamp = req->rsk_window_clamp;
> newtp->rcv_ssthresh = req->rsk_rcv_wnd;
> newtp->rcv_wnd = req->rsk_rcv_wnd;
> + newtp->rcv_mwnd_seq = newtp->rcv_wup + req->rsk_rcv_wnd;
> newtp->rx_opt.wscale_ok = ireq->wscale_ok;
> if (newtp->rx_opt.wscale_ok) {
> newtp->rx_opt.snd_wscale = ireq->snd_wscale;
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 326b58ff1118d02fc396753d56f210f9d3007c7f..50774443f6ae0ca83f360c7fc3239184a1523e1b 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -274,6 +274,15 @@ void tcp_select_initial_window(const struct sock *sk, int __space, __u32 mss,
> }
> EXPORT_IPV6_MOD(tcp_select_initial_window);
>
> +/* Check if we need to update the maximum window sequence number */
> +static inline void tcp_update_max_wnd_seq(struct tcp_sock *tp)
> +{
> + u32 wre = tp->rcv_wup + tp->rcv_wnd;
> +
> + if (after(wre, tp->rcv_mwnd_seq))
> + tp->rcv_mwnd_seq = wre;
> +}
> +
> /* Chose a new window to advertise, update state in tcp_sock for the
> * socket, and return result with RFC1323 scaling applied. The return
> * value can be stuffed directly into th->window for an outgoing
> @@ -293,6 +302,7 @@ static u16 tcp_select_window(struct sock *sk)
> tp->pred_flags = 0;
> tp->rcv_wnd = 0;
> tp->rcv_wup = tp->rcv_nxt;
> + tcp_update_max_wnd_seq(tp);
> return 0;
> }
>
> @@ -316,6 +326,7 @@ static u16 tcp_select_window(struct sock *sk)
>
> tp->rcv_wnd = new_win;
> tp->rcv_wup = tp->rcv_nxt;
> + tcp_update_max_wnd_seq(tp);
>
> /* Make sure we do not exceed the maximum possible
> * scaled window.
> @@ -4169,6 +4180,7 @@ static void tcp_connect_init(struct sock *sk)
> else
> tp->rcv_tstamp = tcp_jiffies32;
> tp->rcv_wup = tp->rcv_nxt;
> + tp->rcv_mwnd_seq = tp->rcv_nxt + tp->rcv_wnd;
> WRITE_ONCE(tp->copied_seq, tp->rcv_nxt);
>
> inet_csk(sk)->icsk_rto = tcp_timeout_init(sk);
> diff --git a/tools/testing/selftests/net/packetdrill/tcp_rcv_big_endseq.pkt b/tools/testing/selftests/net/packetdrill/tcp_rcv_big_endseq.pkt
> index 3848b419e68c3fc895ad736d06373fc32f3691c1..1a86ee5093696deb316c532ca8f7de2bbf5cd8ea 100644
> --- a/tools/testing/selftests/net/packetdrill/tcp_rcv_big_endseq.pkt
> +++ b/tools/testing/selftests/net/packetdrill/tcp_rcv_big_endseq.pkt
> @@ -36,7 +36,7 @@
>
> +0 read(4, ..., 100000) = 4000
>
> -// If queue is empty, accept a packet even if its end_seq is above wup + rcv_wnd
> +// If queue is empty, accept a packet even if its end_seq is above rcv_mwnd_seq
> +0 < P. 4001:54001(50000) ack 1 win 257
> +0 > . 1:1(0) ack 54001 win 0
>
>
--
Stefano
Hi Stefano,
On Mon, Feb 23, 2026 at 11:26:40PM +0100, Stefano Brivio wrote:
> Hi Simon,
>
> It all makes sense to me at a quick look, I have just some nits and one
> more substantial worry, below:
>
> On Fri, 20 Feb 2026 00:55:14 +0100
> Simon Baatz via B4 Relay <devnull+gmbnomis.gmail.com@kernel.org> wrote:
>
> > From: Simon Baatz <gmbnomis@gmail.com>
> >
> > By default, the Linux TCP implementation does not shrink the
> > advertised window (RFC 7323 calls this "window retraction") with the
> > following exceptions:
> >
> > - When an incoming segment cannot be added due to the receive buffer
> > running out of memory. Since commit 8c670bdfa58e ("tcp: correct
> > handling of extreme memory squeeze") a zero window will be
> > advertised in this case. It turns out that reaching the required
> > "memory pressure" is very easy when window scaling is in use. In the
> > simplest case, sending a sufficient number of segments smaller than
> > the scale factor to a receiver that does not read data is enough.
> >
> > Since commit 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks") this
> > happens much earlier than before, leading to regressions (the test
> > suite of the Valkey project does not pass because of a TCP
> > connection that is no longer bi-directional).
>
> Ouch. By the way, that same commit helped us unveil an issue (at least
> in the sense of RFC 9293, 3.8.6) we fixed in passt:
>
> https://passt.top/passt/commit/?id=8d2f8c4d0fb58d6b2011e614bc7d7ff9dab406b3
This looks concerning: It seems as if just filling the advertised
window triggered the out of memory condition(?). Am I right in
assuming that this happened with the original 1d2fbaad7cd8, not the
relaxed version of tcp_can_ingest() from f017c1f768b?
>
> > - Commit b650d953cd39 ("tcp: enforce receive buffer memory limits by
> > allowing the tcp window to shrink") addressed the "eating memory"
> > problem by introducing a sysctl knob that allows shrinking the
> > window before running out of memory.
> >
> > However, RFC 7323 does not only state that shrinking the window is
> > necessary in some cases, it also formulates requirements for TCP
> > implementations when doing so (Section 2.4).
> >
> > This commit addresses the receiver-side requirements: After retracting
> > the window, the peer may have a snd_nxt that lies within a previously
> > advertised window but is now beyond the retracted window. This means
> > that all incoming segments (including pure ACKs) will be rejected
> > until the application happens to read enough data to let the peer's
> > snd_nxt be in window again (which may be never).
> >
> > To comply with RFC 7323, the receiver MUST honor any segment that
> > would have been in window for any ACK sent by the receiver and, when
> > window scaling is in effect, SHOULD track the maximum window sequence
> > number it has advertised. This patch tracks that maximum window
> > sequence number throughout the connection and uses it in
> > tcp_sequence() when deciding whether a segment is acceptable.
> > Acceptability of data is not changed.
> >
> > Fixes: 8c670bdfa58e ("tcp: correct handling of extreme memory squeeze")
> > Fixes: b650d953cd39 ("tcp: enforce receive buffer memory limits by allowing the tcp window to shrink")
> > Signed-off-by: Simon Baatz <gmbnomis@gmail.com>
> > ---
> > Documentation/networking/net_cachelines/tcp_sock.rst | 1 +
> > include/linux/tcp.h | 1 +
> > include/net/tcp.h | 14 ++++++++++++++
> > net/ipv4/tcp_fastopen.c | 1 +
> > net/ipv4/tcp_input.c | 6 ++++--
> > net/ipv4/tcp_minisocks.c | 1 +
> > net/ipv4/tcp_output.c | 12 ++++++++++++
> > .../selftests/net/packetdrill/tcp_rcv_big_endseq.pkt | 2 +-
> > 8 files changed, 35 insertions(+), 3 deletions(-)
> >
> > diff --git a/Documentation/networking/net_cachelines/tcp_sock.rst b/Documentation/networking/net_cachelines/tcp_sock.rst
> > index 563daea10d6c5c074f004cb1b8574f5392157abb..fecf61166a54ee2f64bcef5312c81dcc4aa9a124 100644
> > --- a/Documentation/networking/net_cachelines/tcp_sock.rst
> > +++ b/Documentation/networking/net_cachelines/tcp_sock.rst
> > @@ -121,6 +121,7 @@ u64 delivered_mstamp read_write
> > u32 rate_delivered read_mostly tcp_rate_gen
> > u32 rate_interval_us read_mostly rate_delivered,rate_app_limited
> > u32 rcv_wnd read_write read_mostly tcp_select_window,tcp_receive_window,tcp_fast_path_check
> > +u32 rcv_mwnd_seq read_write tcp_select_window
> > u32 write_seq read_write tcp_rate_check_app_limited,tcp_write_queue_empty,tcp_skb_entail,forced_push,tcp_mark_push
> > u32 notsent_lowat read_mostly tcp_stream_memory_free
> > u32 pushed_seq read_write tcp_mark_push,forced_push
> > diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> > index f72eef31fa23cc584f2f0cefacdc35cae43aa52d..5a943b12d4c050a980b4cf81635b9fa2f0036283 100644
> > --- a/include/linux/tcp.h
> > +++ b/include/linux/tcp.h
> > @@ -271,6 +271,7 @@ struct tcp_sock {
> > u32 lsndtime; /* timestamp of last sent data packet (for restart window) */
> > u32 mdev_us; /* medium deviation */
> > u32 rtt_seq; /* sequence number to update rttvar */
> > + u32 rcv_mwnd_seq; /* Maximum window sequence number (RFC 7323, section 2.4) */
>
> Nit: tab between ; and /* for consistency (I would personally prefer
> the comment style as you see on 'highest_sack' but I don't think it's
> enforced anymore).
Thanks, I missed that.
> Second nit: mentioning RFC 7323, section 2.4 could be a bit misleading
> here because the relevant paragraph there covers a very specific case of
> window retraction, caused by quantisation error from window scaling,
> which is not the most common case here. I couldn't quickly find a better
> reference though.
I agree, but there is a part that, I think, is more generally
applicable:
2.4. Addressing Window Retraction
[ specific window retraction case introduction removed ]
... Implementations MUST ensure that they handle a shrinking
window, as specified in Section 4.2.2.16 of [RFC1122].
For the receiver, this implies that:
1) The receiver MUST honor, as in window, any segment that would
have been in window for any <ACK> sent by the receiver.
2) When window scaling is in effect, the receiver SHOULD track the
actual maximum window sequence number (which is likely to be
greater than the window announced by the most recent <ACK>, if
more than one segment has arrived since the application consumed
any data in the receive buffer).
There is no "When window scaling is in effect," on the first
requirement. And it "happens" to be implementable by the second
requirement (with or without window scaling).
I think an improvement could be to refer to the receiver requirements
specifically here.
> More importantly: do we need to restore this on a connection that's
> being dumped and recreated using TCP_REPAIR, or will things still work
> (even though sub-optimally) if we lose this value?
>
> Other window values that *need* to be dumped and restored are currently
> available via TCP_REPAIR_WINDOW socket option, and they are listed in
> do_tcp_getsockopt(), net/ipv4/tcp.c:
>
> opt.snd_wl1 = tp->snd_wl1;
> opt.snd_wnd = tp->snd_wnd;
> opt.max_window = tp->max_window;
> opt.rcv_wnd = tp->rcv_wnd;
> opt.rcv_wup = tp->rcv_wup;
>
> CRIU uses it to checkpoint and restore established connections, and
> passt uses it to migrate them to a different host:
>
> https://criu.org/TCP_connection
>
> https://passt.top/passt/tree/tcp.c?id=02af38d4177550c086bae54246fc3aaa33ddc018#n3063
>
> If it's strictly needed to preserve functionality, we would need to add
> it to struct tcp_repair_window, notify CRIU maintainers (or send them a
> patch), and add this in passt as well (I can take care of it).
Thanks for the pointer, I missed that tp->rcv_wnd update. Could the
following happen when checkpointing/restoring?
1. A client app opens a connection and writes (blocking) a specific amount
of data before doing any reads. (Not very clever, but this is
supposed to work; this is what caused the problem in the Valkey
tests.)
2. The traffic pattern causes an out-of-memory condition for the
receive buffer; we see the RWIN 0 segments that do not ack the
last data segment(s).
3. TCP connection is checkpointed and restored (on the client side) without
restoring rcv_mwnd_seq.
4. If the receive buffer is still full at the new location, the
acceptable sequence numbers in the receive window will not change
(restored client is still blocked on write) and we no longer have
the larger max receive window -> the client's kernel will reject
all incoming packets and the connection is stuck.
If this scenario is possible, I'd argue that rcv_mwnd_seq is
necessary.
> Strictly speaking, in case, this could be considered a breaking change
> for userspace, but I don't see how to avoid it, so I'd just make sure
> it doesn't impact users as TCP_REPAIR has just a couple of (known!)
> projects relying on it.
>
> An alternative would be to have a special, initial value representing
> the fact that this value was lost, but it looks really annoying to not
> be able to use a u32 for it.
Do we need a dedicated value indicating that rcv_mwnd_seq is not
present, or is it enough to choose an initial rcv_mwnd_seq based on
the size of the struct passed? Both seem doable to me:
Missing: Initialize rcv_mwnd_seq = rcv_wup + rcv_wnd (possibly
leading to the problem described above, of course)
Default value 0: Store how much we retracted the window, i.e.
rcv_mwnd_seq - (rcv_wup + rcv_wnd). 0 means the window was not
retracted and could double as the "we don't know" value.
For the time being, I will just initialize rcv_mwnd_seq to rcv_wup +
rcv_wnd in tcp_repair_set_window() to keep status quo. Of course,
I am happy to discuss enhancements.
> Disregard all this if the correct value is not strictly needed for
> functionality, of course. I haven't tested things (not yet, at least).
>
> > u64 tcp_wstamp_ns; /* departure time for next sent data packet */
> > u64 accecn_opt_tstamp; /* Last AccECN option sent timestamp */
> > struct list_head tsorted_sent_queue; /* time-sorted sent but un-SACKed skbs */
> > diff --git a/include/net/tcp.h b/include/net/tcp.h
> > index 40e72b9cb85f08714d3f458c0bd1402a5fb1eb4e..e1944d504823d5f8754d85bfbbf3c9630d2190ac 100644
> > --- a/include/net/tcp.h
> > +++ b/include/net/tcp.h
> > @@ -912,6 +912,20 @@ static inline u32 tcp_receive_window(const struct tcp_sock *tp)
> > return (u32) win;
> > }
> >
> > +/* Compute the maximum receive window we ever advertised.
> > + * Rcv_nxt can be after the window if our peer push more data
>
> s/push/pushes/
>
> s/Rcv_nxt/rcv_nxt/ (useful for grepping)
tcp_max_receive_window() is an adapted copy of
tcp_receive_window() above. But it makes sense to improve it.
>
> > + * than the offered window.
> > + */
> > +static inline u32 tcp_max_receive_window(const struct tcp_sock *tp)
> > +{
> > + s32 win = tp->rcv_mwnd_seq - tp->rcv_nxt;
> > +
> > + if (win < 0)
> > + win = 0;
>
> I must be missing something but... if the sequence is about to wrap,
> we'll return 0 here. Is that intended?
>
> Doing the subtraction unsigned would have looked more natural to me,
> but I didn't really think it through.
The substraction is unsigned and the outcome is interpreted as
signed. And as mentioned, it is copied with pride ;-)
> > + return (u32) win;
>
> Kernel coding style doesn't usually include a space between cast and
> identifier.
Yes, same reason as above and I will change it.
--
Simon Baatz <gmbnomis@gmail.com>
On Tue, 24 Feb 2026 19:07:45 +0100
Simon Baatz <gmbnomis@gmail.com> wrote:
> Hi Stefano,
>
> On Mon, Feb 23, 2026 at 11:26:40PM +0100, Stefano Brivio wrote:
> > Hi Simon,
> >
> > It all makes sense to me at a quick look, I have just some nits and one
> > more substantial worry, below:
> >
> > On Fri, 20 Feb 2026 00:55:14 +0100
> > Simon Baatz via B4 Relay <devnull+gmbnomis.gmail.com@kernel.org> wrote:
> >
> > > From: Simon Baatz <gmbnomis@gmail.com>
> > >
> > > By default, the Linux TCP implementation does not shrink the
> > > advertised window (RFC 7323 calls this "window retraction") with the
> > > following exceptions:
> > >
> > > - When an incoming segment cannot be added due to the receive buffer
> > > running out of memory. Since commit 8c670bdfa58e ("tcp: correct
> > > handling of extreme memory squeeze") a zero window will be
> > > advertised in this case. It turns out that reaching the required
> > > "memory pressure" is very easy when window scaling is in use. In the
> > > simplest case, sending a sufficient number of segments smaller than
> > > the scale factor to a receiver that does not read data is enough.
> > >
> > > Since commit 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks") this
> > > happens much earlier than before, leading to regressions (the test
> > > suite of the Valkey project does not pass because of a TCP
> > > connection that is no longer bi-directional).
> >
> > Ouch. By the way, that same commit helped us unveil an issue (at least
> > in the sense of RFC 9293, 3.8.6) we fixed in passt:
> >
> > https://passt.top/passt/commit/?id=8d2f8c4d0fb58d6b2011e614bc7d7ff9dab406b3
>
> This looks concerning: It seems as if just filling the advertised
> window triggered the out of memory condition(?).
Right, even if it's not so much a general "out of memory" condition:
it's just that the socket might simply refuse to queue more data at
that point (we run out of window space, rather than memory).
Together with commit e2142825c120 ("net: tcp: send zero-window ACK when
no memory"), we will even get zero-window updates in that case. Jon
raised the issue here:
https://lore.kernel.org/r/20240406182107.261472-3-jmaloy@redhat.com/
but it was not really fixed. Anyway:
> Am I right in
> assuming that this happened with the original 1d2fbaad7cd8, not the
> relaxed version of tcp_can_ingest() from f017c1f768b?
...you're right. I wasn't even aware of f017c1f768b, thanks for
pointing that out. That seems to make things saner, and I don't expect
further issues at this point.
By the way of which, passt struggled talking to applications entirely
written in the 21st century. That's socat, I think started in 2001,
being used in Podman tests, and its only SO_RCVBUF-related fault is
that it uses the default 208 KiB value (from rmem_default) as a
starting value by... not doing anything.
Applications can set SO_RCVBUF and SO_SNDBUF to bigger values
(depending on rmem_max and wmem_max), but if they do, automatic tuning
of TCP buffer sizes (which allows exceeding rmem_max and wmem_max!) is
disabled. We used to do that in passt itself, and I eventually dropped
it here:
https://passt.top/passt/commit/?id=71249ef3f9bcf1dbb2d6c13cdbc41ba88c794f06
because we might really need automatic tuning and the resulting big
buffers for high latency, high throughput connections.
> > > - Commit b650d953cd39 ("tcp: enforce receive buffer memory limits by
> > > allowing the tcp window to shrink") addressed the "eating memory"
> > > problem by introducing a sysctl knob that allows shrinking the
> > > window before running out of memory.
> > >
> > > However, RFC 7323 does not only state that shrinking the window is
> > > necessary in some cases, it also formulates requirements for TCP
> > > implementations when doing so (Section 2.4).
> > >
> > > This commit addresses the receiver-side requirements: After retracting
> > > the window, the peer may have a snd_nxt that lies within a previously
> > > advertised window but is now beyond the retracted window. This means
> > > that all incoming segments (including pure ACKs) will be rejected
> > > until the application happens to read enough data to let the peer's
> > > snd_nxt be in window again (which may be never).
> > >
> > > To comply with RFC 7323, the receiver MUST honor any segment that
> > > would have been in window for any ACK sent by the receiver and, when
> > > window scaling is in effect, SHOULD track the maximum window sequence
> > > number it has advertised. This patch tracks that maximum window
> > > sequence number throughout the connection and uses it in
> > > tcp_sequence() when deciding whether a segment is acceptable.
> > > Acceptability of data is not changed.
> > >
> > > Fixes: 8c670bdfa58e ("tcp: correct handling of extreme memory squeeze")
> > > Fixes: b650d953cd39 ("tcp: enforce receive buffer memory limits by allowing the tcp window to shrink")
> > > Signed-off-by: Simon Baatz <gmbnomis@gmail.com>
> > > ---
> > > Documentation/networking/net_cachelines/tcp_sock.rst | 1 +
> > > include/linux/tcp.h | 1 +
> > > include/net/tcp.h | 14 ++++++++++++++
> > > net/ipv4/tcp_fastopen.c | 1 +
> > > net/ipv4/tcp_input.c | 6 ++++--
> > > net/ipv4/tcp_minisocks.c | 1 +
> > > net/ipv4/tcp_output.c | 12 ++++++++++++
> > > .../selftests/net/packetdrill/tcp_rcv_big_endseq.pkt | 2 +-
> > > 8 files changed, 35 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/Documentation/networking/net_cachelines/tcp_sock.rst b/Documentation/networking/net_cachelines/tcp_sock.rst
> > > index 563daea10d6c5c074f004cb1b8574f5392157abb..fecf61166a54ee2f64bcef5312c81dcc4aa9a124 100644
> > > --- a/Documentation/networking/net_cachelines/tcp_sock.rst
> > > +++ b/Documentation/networking/net_cachelines/tcp_sock.rst
> > > @@ -121,6 +121,7 @@ u64 delivered_mstamp read_write
> > > u32 rate_delivered read_mostly tcp_rate_gen
> > > u32 rate_interval_us read_mostly rate_delivered,rate_app_limited
> > > u32 rcv_wnd read_write read_mostly tcp_select_window,tcp_receive_window,tcp_fast_path_check
> > > +u32 rcv_mwnd_seq read_write tcp_select_window
> > > u32 write_seq read_write tcp_rate_check_app_limited,tcp_write_queue_empty,tcp_skb_entail,forced_push,tcp_mark_push
> > > u32 notsent_lowat read_mostly tcp_stream_memory_free
> > > u32 pushed_seq read_write tcp_mark_push,forced_push
> > > diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> > > index f72eef31fa23cc584f2f0cefacdc35cae43aa52d..5a943b12d4c050a980b4cf81635b9fa2f0036283 100644
> > > --- a/include/linux/tcp.h
> > > +++ b/include/linux/tcp.h
> > > @@ -271,6 +271,7 @@ struct tcp_sock {
> > > u32 lsndtime; /* timestamp of last sent data packet (for restart window) */
> > > u32 mdev_us; /* medium deviation */
> > > u32 rtt_seq; /* sequence number to update rttvar */
> > > + u32 rcv_mwnd_seq; /* Maximum window sequence number (RFC 7323, section 2.4) */
> >
> > Nit: tab between ; and /* for consistency (I would personally prefer
> > the comment style as you see on 'highest_sack' but I don't think it's
> > enforced anymore).
>
> Thanks, I missed that.
>
> > Second nit: mentioning RFC 7323, section 2.4 could be a bit misleading
> > here because the relevant paragraph there covers a very specific case of
> > window retraction, caused by quantisation error from window scaling,
> > which is not the most common case here. I couldn't quickly find a better
> > reference though.
>
> I agree, but there is a part that, I think, is more generally
> applicable:
>
> 2.4. Addressing Window Retraction
>
> [ specific window retraction case introduction removed ]
> ... Implementations MUST ensure that they handle a shrinking
> window, as specified in Section 4.2.2.16 of [RFC1122].
>
> For the receiver, this implies that:
>
> 1) The receiver MUST honor, as in window, any segment that would
> have been in window for any <ACK> sent by the receiver.
>
> 2) When window scaling is in effect, the receiver SHOULD track the
> actual maximum window sequence number (which is likely to be
> greater than the window announced by the most recent <ACK>, if
> more than one segment has arrived since the application consumed
> any data in the receive buffer).
>
> There is no "When window scaling is in effect," on the first
> requirement. And it "happens" to be implementable by the second
> requirement (with or without window scaling).
Right, I saw that, but the first requirement doesn't mention the
"actual maximum sequence number" which this new field represents.
> I think an improvement could be to refer to the receiver requirements
> specifically here.
Ah, yes, that sounds like a good idea.
> > More importantly: do we need to restore this on a connection that's
> > being dumped and recreated using TCP_REPAIR, or will things still work
> > (even though sub-optimally) if we lose this value?
> >
> > Other window values that *need* to be dumped and restored are currently
> > available via TCP_REPAIR_WINDOW socket option, and they are listed in
> > do_tcp_getsockopt(), net/ipv4/tcp.c:
> >
> > opt.snd_wl1 = tp->snd_wl1;
> > opt.snd_wnd = tp->snd_wnd;
> > opt.max_window = tp->max_window;
> > opt.rcv_wnd = tp->rcv_wnd;
> > opt.rcv_wup = tp->rcv_wup;
> >
> > CRIU uses it to checkpoint and restore established connections, and
> > passt uses it to migrate them to a different host:
> >
> > https://criu.org/TCP_connection
> >
> > https://passt.top/passt/tree/tcp.c?id=02af38d4177550c086bae54246fc3aaa33ddc018#n3063
> >
> > If it's strictly needed to preserve functionality, we would need to add
> > it to struct tcp_repair_window, notify CRIU maintainers (or send them a
> > patch), and add this in passt as well (I can take care of it).
>
> Thanks for the pointer, I missed that tp->rcv_wnd update. Could the
> following happen when checkpointing/restoring?
>
> 1. A client app opens a connection and writes (blocking) a specific amount
> of data before doing any reads. (Not very clever, but this is
> supposed to work; this is what caused the problem in the Valkey
> tests.)
> 2. The traffic pattern causes an out-of-memory condition for the
> receive buffer; we see the RWIN 0 segments that do not ack the
> last data segment(s).
> 3. TCP connection is checkpointed and restored (on the client side) without
> restoring rcv_mwnd_seq.
> 4. If the receive buffer is still full at the new location, the
> acceptable sequence numbers in the receive window will not change
> (restored client is still blocked on write) and we no longer have
> the larger max receive window -> the client's kernel will reject
> all incoming packets and the connection is stuck.
>
> If this scenario is possible, I'd argue that rcv_mwnd_seq is
> necessary.
It really sounds like a corner case, especially 1. in combination with
2., but the outcome would be pretty bad, and I think it's possible.
Typically, once the connection is restored (with TCP_REPAIR_OFF, not
with TCP_REPAIR_OFF_NO_WP), the kernel sends out an empty segment as a
window probe / keepalive, but as far as I understand that wouldn't be
enough to fix the situation. And even if it did, we still have the
TCP_REPAIR_OFF_NO_WP case, even though I'm not aware of any usage.
> > Strictly speaking, in case, this could be considered a breaking change
> > for userspace, but I don't see how to avoid it, so I'd just make sure
> > it doesn't impact users as TCP_REPAIR has just a couple of (known!)
> > projects relying on it.
> >
> > An alternative would be to have a special, initial value representing
> > the fact that this value was lost, but it looks really annoying to not
> > be able to use a u32 for it.
>
> Do we need a dedicated value indicating that rcv_mwnd_seq is not
> present, or is it enough to choose an initial rcv_mwnd_seq based on
> the size of the struct passed? Both seem doable to me:
>
> Missing: Initialize rcv_mwnd_seq = rcv_wup + rcv_wnd (possibly
> leading to the problem described above, of course)
Well but if we might run into the problem described above, we need to
dump / restore rcv_mwnd_seq in any case, so we wouldn't have an issue
at all.
Except for a compatibility issue, but what you describe looks like a
reasonable fallback.
> Default value 0: Store how much we retracted the window, i.e.
> rcv_mwnd_seq - (rcv_wup + rcv_wnd). 0 means the window was not
> retracted and could double as the "we don't know" value.
>
> For the time being, I will just initialize rcv_mwnd_seq to rcv_wup +
> rcv_wnd in tcp_repair_set_window() to keep status quo. Of course,
> I am happy to discuss enhancements.
That makes sense to me at a glance, but I should still review / test it
as a whole though.
> > Disregard all this if the correct value is not strictly needed for
> > functionality, of course. I haven't tested things (not yet, at least).
> >
> > > u64 tcp_wstamp_ns; /* departure time for next sent data packet */
> > > u64 accecn_opt_tstamp; /* Last AccECN option sent timestamp */
> > > struct list_head tsorted_sent_queue; /* time-sorted sent but un-SACKed skbs */
> > > diff --git a/include/net/tcp.h b/include/net/tcp.h
> > > index 40e72b9cb85f08714d3f458c0bd1402a5fb1eb4e..e1944d504823d5f8754d85bfbbf3c9630d2190ac 100644
> > > --- a/include/net/tcp.h
> > > +++ b/include/net/tcp.h
> > > @@ -912,6 +912,20 @@ static inline u32 tcp_receive_window(const struct tcp_sock *tp)
> > > return (u32) win;
> > > }
> > >
> > > +/* Compute the maximum receive window we ever advertised.
> > > + * Rcv_nxt can be after the window if our peer push more data
> >
> > s/push/pushes/
> >
> > s/Rcv_nxt/rcv_nxt/ (useful for grepping)
>
> tcp_max_receive_window() is an adapted copy of
> tcp_receive_window() above. But it makes sense to improve it.
Ah, sorry, I didn't notice.
> > > + * than the offered window.
> > > + */
> > > +static inline u32 tcp_max_receive_window(const struct tcp_sock *tp)
> > > +{
> > > + s32 win = tp->rcv_mwnd_seq - tp->rcv_nxt;
> > > +
> > > + if (win < 0)
> > > + win = 0;
> >
> > I must be missing something but... if the sequence is about to wrap,
> > we'll return 0 here. Is that intended?
> >
> > Doing the subtraction unsigned would have looked more natural to me,
> > but I didn't really think it through.
>
> The substraction is unsigned and the outcome is interpreted as
> signed. And as mentioned, it is copied with pride ;-)
Oh, wow, I mean, "of course"! How could anybody ever miss that! Pride,
you say. :) ...but sure, if it's taken from there, it makes sense to
keep it like that I guess.
> > > + return (u32) win;
> >
> > Kernel coding style doesn't usually include a space between cast and
> > identifier.
>
> Yes, same reason as above and I will change it.
--
Stefano
On Wed, Feb 25, 2026 at 10:33:34PM +0100, Stefano Brivio wrote:
> On Tue, 24 Feb 2026 19:07:45 +0100
> Simon Baatz <gmbnomis@gmail.com> wrote:
>
> > Hi Stefano,
> >
> > On Mon, Feb 23, 2026 at 11:26:40PM +0100, Stefano Brivio wrote:
> > > Hi Simon,
> > >
> > > It all makes sense to me at a quick look, I have just some nits and one
> > > more substantial worry, below:
> > >
> > > On Fri, 20 Feb 2026 00:55:14 +0100
> > > Simon Baatz via B4 Relay <devnull+gmbnomis.gmail.com@kernel.org> wrote:
> > >
> > > > From: Simon Baatz <gmbnomis@gmail.com>
> > > >
> > > > By default, the Linux TCP implementation does not shrink the
> > > > advertised window (RFC 7323 calls this "window retraction") with the
> > > > following exceptions:
> > > >
> > > > - When an incoming segment cannot be added due to the receive buffer
> > > > running out of memory. Since commit 8c670bdfa58e ("tcp: correct
> > > > handling of extreme memory squeeze") a zero window will be
> > > > advertised in this case. It turns out that reaching the required
> > > > "memory pressure" is very easy when window scaling is in use. In the
> > > > simplest case, sending a sufficient number of segments smaller than
> > > > the scale factor to a receiver that does not read data is enough.
> > > >
> > > > Since commit 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks") this
> > > > happens much earlier than before, leading to regressions (the test
> > > > suite of the Valkey project does not pass because of a TCP
> > > > connection that is no longer bi-directional).
> > >
> > > Ouch. By the way, that same commit helped us unveil an issue (at least
> > > in the sense of RFC 9293, 3.8.6) we fixed in passt:
> > >
> > > https://passt.top/passt/commit/?id=8d2f8c4d0fb58d6b2011e614bc7d7ff9dab406b3
> >
> > This looks concerning: It seems as if just filling the advertised
> > window triggered the out of memory condition(?).
>
> Right, even if it's not so much a general "out of memory" condition:
> it's just that the socket might simply refuse to queue more data at
> that point (we run out of window space, rather than memory).
>
> Together with commit e2142825c120 ("net: tcp: send zero-window ACK when
> no memory"), we will even get zero-window updates in that case. Jon
> raised the issue here:
>
> https://lore.kernel.org/r/20240406182107.261472-3-jmaloy@redhat.com/
>
> but it was not really fixed. Anyway:
Didn't that result in 8c670bdfa58e ("tcp: correct handling of extreme
memory squeeze")?
> [...]
--
Simon Baatz <gmbnomis@gmail.com>
On Thu, 26 Feb 2026 02:10:25 +0100
Simon Baatz <gmbnomis@gmail.com> wrote:
> On Wed, Feb 25, 2026 at 10:33:34PM +0100, Stefano Brivio wrote:
> > On Tue, 24 Feb 2026 19:07:45 +0100
> > Simon Baatz <gmbnomis@gmail.com> wrote:
> >
> > > Hi Stefano,
> > >
> > > On Mon, Feb 23, 2026 at 11:26:40PM +0100, Stefano Brivio wrote:
> > > > Hi Simon,
> > > >
> > > > It all makes sense to me at a quick look, I have just some nits and one
> > > > more substantial worry, below:
> > > >
> > > > On Fri, 20 Feb 2026 00:55:14 +0100
> > > > Simon Baatz via B4 Relay <devnull+gmbnomis.gmail.com@kernel.org> wrote:
> > > >
> > > > > From: Simon Baatz <gmbnomis@gmail.com>
> > > > >
> > > > > By default, the Linux TCP implementation does not shrink the
> > > > > advertised window (RFC 7323 calls this "window retraction") with the
> > > > > following exceptions:
> > > > >
> > > > > - When an incoming segment cannot be added due to the receive buffer
> > > > > running out of memory. Since commit 8c670bdfa58e ("tcp: correct
> > > > > handling of extreme memory squeeze") a zero window will be
> > > > > advertised in this case. It turns out that reaching the required
> > > > > "memory pressure" is very easy when window scaling is in use. In the
> > > > > simplest case, sending a sufficient number of segments smaller than
> > > > > the scale factor to a receiver that does not read data is enough.
> > > > >
> > > > > Since commit 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks") this
> > > > > happens much earlier than before, leading to regressions (the test
> > > > > suite of the Valkey project does not pass because of a TCP
> > > > > connection that is no longer bi-directional).
> > > >
> > > > Ouch. By the way, that same commit helped us unveil an issue (at least
> > > > in the sense of RFC 9293, 3.8.6) we fixed in passt:
> > > >
> > > > https://passt.top/passt/commit/?id=8d2f8c4d0fb58d6b2011e614bc7d7ff9dab406b3
> > >
> > > This looks concerning: It seems as if just filling the advertised
> > > window triggered the out of memory condition(?).
> >
> > Right, even if it's not so much a general "out of memory" condition:
> > it's just that the socket might simply refuse to queue more data at
> > that point (we run out of window space, rather than memory).
> >
> > Together with commit e2142825c120 ("net: tcp: send zero-window ACK when
> > no memory"), we will even get zero-window updates in that case. Jon
> > raised the issue here:
> >
> > https://lore.kernel.org/r/20240406182107.261472-3-jmaloy@redhat.com/
> >
> > but it was not really fixed. Anyway:
>
> Didn't that result in 8c670bdfa58e ("tcp: correct handling of extreme
> memory squeeze")?
Yes, but with that (the v3 of it) we still send zero-window updates
more frequently (because of the 'return 0' instead of 'goto out') and
together with e2142825c120 I was seeing in the captures one zero-window
update almost every time the sender filled up the window completely.
Perhaps it was even desired, I'm not sure, I can't say it's entirely
wrong (that's why I didn't propose a further patch), and strictly
speaking the issue was on passt side (we didn't send window probes in
that case, and we didn't retransmit FINs).
I guess with f017c1f768b things should be sane again. I didn't check.
--
Stefano
© 2016 - 2026 Red Hat, Inc.