From nobody Sun Mar 22 09:34:07 2026 Received: from mail-oo1-f50.google.com (mail-oo1-f50.google.com [209.85.161.50]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EB3743B2FF8 for ; Wed, 11 Mar 2026 07:56:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.161.50 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773215800; cv=none; b=sIuioiiDaZP8xlPy8+YIVAf+1/dnvavWvxytXj3jKJGq6iRkd6XHMri8ISUS1fEToJ8pwvk/WV3LhiYccIuKzfvJDu3HKaH7VRMcZrMvHiuovh3hM6LbYjUxASJ2NwPSxcwQ/wgUVEW/gWhQSAO9XdZ/ZyTYSzJWc4WCRRZHnQM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773215800; c=relaxed/simple; bh=ePqPnwQ/F2I2migGyX7cR8MPgEP4dU5kjufej9umejg=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=OeP2CIIK+Iygyzobh59LezDhTFhE84yV5QfipFteIGa1zrkbXm0SnDSxbf3unVKFJYZUIdsU5HpjvMvbhiPBrzQ7ZGwvavNvm4iYiCgyUj1TSANHphRb9HwafoQdVRUnv4NNi1cVEa9mTM+Fc5fOw0BynfkmuStOatuh5RBUWXU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=AaPSH409; arc=none smtp.client-ip=209.85.161.50 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="AaPSH409" Received: by mail-oo1-f50.google.com with SMTP id 006d021491bc7-67bb5e4cf5aso1761735eaf.2 for ; Wed, 11 Mar 2026 00:56:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1773215793; x=1773820593; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=Bbor1ZYXGNJCT22C8JYZ8ruTnAFylNvT3LJ2XPYETWo=; b=AaPSH409ofW22colwt/fc83kgrgzPTOjyuS4QkHKhjnMM7DAEz3gFEYcjW30OWvuvw d2jrxJs8Vri7fMUEAt/8NYZ9F54ZKUdKzhQY5tt9/oJsvMNRIQ1dPALjGexfbN5PY3ww zC9xKNDWTQSJvXkvO2yfmHgcANk65ZQnri4qb+CuvhwKuot4r7AMJZyqI+SYn57fi8GN FO0m9KDoMhpH8Yv/aMzL4LTkp/EiJEgDIH0apRHqUGVwYpE9QlaUEgivHn1CTniy7yEu AwM+/w2u/8GBTPgE1Vwd+G3H11qCyvb1vNYKpJAyVZsuI3RQ8IG8R+kJLYp9NV3DNNNN 9mvg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1773215793; x=1773820593; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=Bbor1ZYXGNJCT22C8JYZ8ruTnAFylNvT3LJ2XPYETWo=; b=OM+54UQxCSa/yrVOzaZjP3DQtdX0brUp2DuNSyYybB7I8kewdRxBvTjzjV6rVreMe7 t0Zl+Msmohr3BNjcrf2tys/w1gwWiQ9Ki7AF4DfIo7RvgfOxlMz3dyxgdM+NlxFAJrlK WageDQup+9B+y0JlPjUk+SBV7OZQ04OGZoHNYWcisCFQT46tHSjSilvaF2raiSP3Vo62 owR293IblyxPthdGuRfG8gUvOFXeMkf3s9xO28lzCjMbsxg9IlV6Sd80Fk54dKeX+Iyw 0mNI6+i55hgXV7hqZSGwjYLst24miY3QbXPplRMzjotayJ+bA1HsF60nl0fotjoNxIxu KhNg== X-Forwarded-Encrypted: i=1; AJvYcCWf25rveqaweLuyt07U38W1+OJIhq3hGRA3rV8WR1SKSYQwI2nb1OkARNnBw6gox2ZJ68J7N0GoDzfYTlY=@vger.kernel.org X-Gm-Message-State: AOJu0YyZadLwebJIk6H7TpCpM4LzMG9adzSjyKBlqV784BuD6cSkdVWA FzKo1y+A9LQbKxtvrKGByOs5GYeGKz2N/8437JDVibbvVZ2U1syTg2Tp X-Gm-Gg: ATEYQzzqCFotVmhmasWF2ZqwtlgtKYDe4Pg3CkJTEweDHb6UaRJuGljLj2FdlYLp51N 7W8LHDO5clIMg5Z240iqadsAR8uq0AFbS0tPUpoYM7gaSHQxx49drj3sDe6elLJquNy/BXBkSmd laUN1SXb/QHLtM9Aki3OWrZOi/46skUdaZqeVJTLEj9yxocGudY+yeUfywBPfYoao1lWKUvSBOy N090kuITZJeRuBjcIXyxAL6htlog5qtqJB+6UqtDFEJB1GXlSk/CFWtBsGqGIZJXDOu5pDTQ8hQ qvWY3Z1AXA5tEfwa1adoz8kU70dCDo2LaGxLpx5rkm2QeP93MNILx3Y9xqIpf1ftzIqyxhpoCzX j8j/uvb6DH8kvEDre2bWmxmeBQRA6JMlSKqz42H3M05O9dUBF0/QLIVXGazqfOObKl/IrMAki34 PVK/2WycDan4De0cggdhcRZ6E61pA/M9/iFKaO5VijfT4BCsrqEzN5+z0D5OLFrrWGqcQ8Yx0cd kWET3W7dyB1cTf+U0qjFQPcpMIXyxgjzWI2WlEdBS9zoVzX X-Received: by 2002:a05:6820:8ca:b0:679:e68b:f95d with SMTP id 006d021491bc7-67bc8a08660mr1135048eaf.53.1773215793063; Wed, 11 Mar 2026 00:56:33 -0700 (PDT) Received: from localhost.localdomain (108-212-132-20.lightspeed.irvnca.sbcglobal.net. [108.212.132.20]) by smtp.gmail.com with ESMTPSA id 586e51a60fabf-4177e6ae0e3sm1568938fac.16.2026.03.11.00.56.31 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 11 Mar 2026 00:56:32 -0700 (PDT) From: Wesley Atwell To: davem@davemloft.net, kuba@kernel.org, pabeni@redhat.com, edumazet@google.com, ncardwell@google.com, dsahern@kernel.org, matttbe@kernel.org, martineau@kernel.org, netdev@vger.kernel.org, mptcp@lists.linux.dev Cc: kuniyu@google.com, horms@kernel.org, geliang@kernel.org, corbet@lwn.net, skhan@linuxfoundation.org, rostedt@goodmis.org, mhiramat@kernel.org, mathieu.desnoyers@efficios.com, 0x7f454c46@gmail.com, linux-doc@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, atwellwea@gmail.com Subject: [PATCH net 3/7] tcp: honor advertised receive window in memory admission and clamping Date: Wed, 11 Mar 2026 01:55:56 -0600 Message-Id: <20260311075600.948413-4-atwellwea@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260311075600.948413-1-atwellwea@gmail.com> References: <20260311075600.948413-1-atwellwea@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" tp->rcv_wnd is an advertised promise to the sender, but receive-memory accounting was still reconstructing that promise through mutable live state. Switch the receive-side decisions over to the advertise-time snapshot. Use it when deciding whether a packet can be admitted, when deciding how far to clamp future window growth, and when handling the scaled-window quantization slack in __tcp_select_window(). If a snapshot is not available, keep the legacy fallback behavior. This keeps sender-visible rwnd and the local hard rmem budget in the same unit system instead of letting ratio drift create accounting mismatches. Signed-off-by: Wesley Atwell --- include/net/tcp.h | 1 + net/ipv4/tcp_input.c | 86 ++++++++++++++++++++++++++++++++++++++++--- net/ipv4/tcp_output.c | 14 ++++++- 3 files changed, 93 insertions(+), 8 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 187e6d660f62..88ddf7ee826e 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -384,6 +384,7 @@ int tcp_ioctl(struct sock *sk, int cmd, int *karg); enum skb_drop_reason tcp_rcv_state_process(struct sock *sk, struct sk_buff= *skb); void tcp_rcv_established(struct sock *sk, struct sk_buff *skb); void tcp_rcvbuf_grow(struct sock *sk, u32 newval); +bool tcp_try_grow_rcvbuf(struct sock *sk, int needed); void tcp_rcv_space_adjust(struct sock *sk); int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp); void tcp_twsk_destructor(struct sock *sk); diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index cba89733d121..f76011fc1b7a 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -774,8 +774,37 @@ static void tcp_init_buffer_space(struct sock *sk) (u32)TCP_INIT_CWND * tp->advmss); } =20 +/* Try to grow sk_rcvbuf so the hard receive-memory limit covers @needed + * bytes beyond the memory already charged in sk_rmem_alloc. + */ +bool tcp_try_grow_rcvbuf(struct sock *sk, int needed) +{ + struct net *net =3D sock_net(sk); + int target; + int rmem2; + + needed =3D max(needed, 0); + target =3D tcp_rmem_used(sk) + needed; + + if (target <=3D READ_ONCE(sk->sk_rcvbuf)) + return true; + + rmem2 =3D READ_ONCE(net->ipv4.sysctl_tcp_rmem[2]); + if (READ_ONCE(sk->sk_rcvbuf) >=3D rmem2 || + (sk->sk_userlocks & SOCK_RCVBUF_LOCK) || + tcp_under_memory_pressure(sk) || + sk_memory_allocated(sk) >=3D sk_prot_mem_limits(sk, 0)) + return false; + + WRITE_ONCE(sk->sk_rcvbuf, + min_t(int, rmem2, + max_t(int, READ_ONCE(sk->sk_rcvbuf), target))); + + return target <=3D READ_ONCE(sk->sk_rcvbuf); +} + /* 4. Recalculate window clamp after socket hit its memory bounds. */ -static void tcp_clamp_window(struct sock *sk) +static void tcp_clamp_window_legacy(struct sock *sk) { struct tcp_sock *tp =3D tcp_sk(sk); struct inet_connection_sock *icsk =3D inet_csk(sk); @@ -785,14 +814,42 @@ static void tcp_clamp_window(struct sock *sk) icsk->icsk_ack.quick =3D 0; rmem2 =3D READ_ONCE(net->ipv4.sysctl_tcp_rmem[2]); =20 - if (sk->sk_rcvbuf < rmem2 && + if (READ_ONCE(sk->sk_rcvbuf) < rmem2 && !(sk->sk_userlocks & SOCK_RCVBUF_LOCK) && !tcp_under_memory_pressure(sk) && sk_memory_allocated(sk) < sk_prot_mem_limits(sk, 0)) { WRITE_ONCE(sk->sk_rcvbuf, min(atomic_read(&sk->sk_rmem_alloc), rmem2)); } - if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) + if (atomic_read(&sk->sk_rmem_alloc) > READ_ONCE(sk->sk_rcvbuf)) + tp->rcv_ssthresh =3D min(tp->window_clamp, 2U * tp->advmss); +} + +static void tcp_clamp_window(struct sock *sk) +{ + struct tcp_sock *tp =3D tcp_sk(sk); + u32 cur_rwnd =3D tcp_receive_window(tp); + int need; + + if (!tcp_space_from_rcv_wnd(tp, cur_rwnd, &need)) { + tcp_clamp_window_legacy(sk); + return; + } + + inet_csk(sk)->icsk_ack.quick =3D 0; + need =3D max_t(int, need, 0); + + /* Keep the hard receive-memory cap large enough to honor the + * remaining receive window we already exposed to the sender. Use + * the scaling_ratio snapshot taken when tp->rcv_wnd was advertised, + * not the mutable live ratio which may drift later in the flow. + */ + tcp_try_grow_rcvbuf(sk, need); + + /* If the remaining advertised rwnd no longer fits the hard budget, + * slow future window growth until the accounting converges again. + */ + if (need > tcp_rmem_avail(sk)) tp->rcv_ssthresh =3D min(tp->window_clamp, 2U * tp->advmss); } =20 @@ -5374,11 +5431,28 @@ static void tcp_ofo_queue(struct sock *sk) static bool tcp_prune_ofo_queue(struct sock *sk, const struct sk_buff *in_= skb); static int tcp_prune_queue(struct sock *sk, const struct sk_buff *in_skb); =20 +/* Sequence checks run against the sender-visible receive window before th= is + * point. Convert the incoming payload back to the hard receive-memory bud= get + * using the scaling_ratio that was in force when tp->rcv_wnd was advertis= ed, + * so admission keeps honoring the same exposed window even if the live ra= tio + * changes later in the flow. Legacy TCP_REPAIR restores do not have that + * advertise-time basis, so they fall back to the pre-series admission rule + * until a fresh local advertisement refreshes the pair. + * + * Do not subtract sk_backlog.len here. tcp_space() already reserves backl= og + * bytes when selecting future advertised windows, and sk_backlog.len stays + * inflated until __release_sock() finishes draining backlog. Subtracting = it + * again here would double count already-queued backlog packets as they mo= ve + * into sk_rmem_alloc. + */ static bool tcp_can_ingest(const struct sock *sk, const struct sk_buff *sk= b) { - unsigned int rmem =3D atomic_read(&sk->sk_rmem_alloc); + int need; + + if (!tcp_space_from_rcv_wnd(tcp_sk(sk), skb->len, &need)) + return atomic_read(&sk->sk_rmem_alloc) <=3D READ_ONCE(sk->sk_rcvbuf); =20 - return rmem <=3D sk->sk_rcvbuf; + return need <=3D tcp_rmem_avail(sk); } =20 static int tcp_try_rmem_schedule(struct sock *sk, const struct sk_buff *sk= b, @@ -6014,7 +6088,7 @@ static int tcp_prune_queue(struct sock *sk, const str= uct sk_buff *in_skb) struct tcp_sock *tp =3D tcp_sk(sk); =20 /* Do nothing if our queues are empty. */ - if (!atomic_read(&sk->sk_rmem_alloc)) + if (!tcp_rmem_used(sk)) return -1; =20 NET_INC_STATS(sock_net(sk), LINUX_MIB_PRUNECALLED); diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index c1b94d67d8fe..5e69fc31a4da 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -3377,13 +3377,23 @@ u32 __tcp_select_window(struct sock *sk) * scaled window will not line up with the MSS boundary anyway. */ if (tp->rx_opt.rcv_wscale) { + int rcv_wscale =3D 1 << tp->rx_opt.rcv_wscale; + window =3D free_space; =20 /* Advertise enough space so that it won't get scaled away. - * Import case: prevent zero window announcement if + * Important case: prevent zero-window announcement if * 1< mss. */ - window =3D ALIGN(window, (1 << tp->rx_opt.rcv_wscale)); + window =3D ALIGN(window, rcv_wscale); + + /* Back any scale-quantization slack before we expose it. + * Otherwise tcp_can_ingest() can reject data which is still + * within the sender-visible window. + */ + if (window > free_space && + !tcp_try_grow_rcvbuf(sk, tcp_space_from_win(sk, window))) + window =3D round_down(free_space, rcv_wscale); } else { window =3D tp->rcv_wnd; /* Get the largest window that is a nice multiple of mss. --=20 2.34.1