tcp: Add net.ipv4.tcp_purge_receive_queue sysctl

[RFC PATCH net-next] tcp: Add net.ipv4.tcp_purge_receive_queue sysctl

Posted by Leon Hwang 1 month, 2 weeks ago

Introduce a new sysctl knob, net.ipv4.tcp_purge_receive_queue, to
address a memory leak scenario related to TCP sockets.

Issue:
When a TCP socket in the CLOSE_WAIT state receives a RST packet, the
current implementation does not clear the socket's receive queue. This
causes SKBs in the queue to remain allocated until the socket is
explicitly closed by the application. As a consequence:

1. The page pool pages held by these SKBs are not released.
2. The associated page pool cannot be freed.

RFC 9293 Section 3.10.7.4 specifies that when a RST is received in
CLOSE_WAIT state, "all segment queues should be flushed." However, the
current implementation does not flush the receive queue.

Solution:
Add a per-namespace sysctl (net.ipv4.tcp_purge_receive_queue) that,
when enabled, causes the kernel to purge the receive queue when a RST
packet is received in CLOSE_WAIT state. This allows immediate release
of SKBs and their associated memory resources.

The feature is disabled by default to maintain backward compatibility
with existing behavior.

Signed-off-by: Leon Hwang <leon.huangfu@shopee.com>
---
 Documentation/networking/ip-sysctl.rst         | 18 ++++++++++++++++++
 .../net_cachelines/netns_ipv4_sysctl.rst       |  1 +
 include/net/netns/ipv4.h                       |  1 +
 net/ipv4/sysctl_net_ipv4.c                     |  9 +++++++++
 net/ipv4/tcp_input.c                           | 16 ++++++++++++++++
 5 files changed, 45 insertions(+)

diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index d1eeb5323af0..71a529462baa 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -1441,6 +1441,24 @@ tcp_rto_max_ms - INTEGER
 
 	Default: 120,000
 
+tcp_purge_receive_queue - BOOLEAN
+	When a socket in the TCP_CLOSE_WAIT state receives a RST packet, the
+	default behavior is to not clear its receive queue.  As a result,
+	any SKBs in the queue are not freed until the socket is closed.
+	Consequently, the pages held by these SKBs are not released, which
+	can also prevent the associated page pool from being freed.
+
+	If enabled, the receive queue is purged upon receiving the RST,
+	allowing the SKBs and their associated memory to be released
+	promptly.
+
+	Possible values:
+
+	- 0 (disabled)
+	- 1 (enabled)
+
+	Default: 0 (disabled)
+
 UDP variables
 =============
 
diff --git a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
index beaf1880a19b..f2c42e7d84a9 100644
--- a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
+++ b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
@@ -123,6 +123,7 @@ unsigned_long                   sysctl_tcp_comp_sack_delay_ns
 unsigned_long                   sysctl_tcp_comp_sack_slack_ns                                                        __tcp_ack_snd_check
 int                             sysctl_max_syn_backlog
 int                             sysctl_tcp_fastopen
+u8                              sysctl_tcp_purge_receive_queue
 struct_tcp_congestion_ops       tcp_congestion_control                                                               init_cc
 struct_tcp_fastopen_context     tcp_fastopen_ctx
 unsigned_int                    sysctl_tcp_fastopen_blackhole_timeout
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 8e971c7bf164..ab973f30f502 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -220,6 +220,7 @@ struct netns_ipv4 {
 	u8 sysctl_tcp_nometrics_save;
 	u8 sysctl_tcp_no_ssthresh_metrics_save;
 	u8 sysctl_tcp_workaround_signed_windows;
+	u8 sysctl_tcp_purge_receive_queue;
 	int sysctl_tcp_challenge_ack_limit;
 	u8 sysctl_tcp_min_tso_segs;
 	u8 sysctl_tcp_reflect_tos;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 643763bc2142..da30970bb5d5 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -1641,6 +1641,15 @@ static struct ctl_table ipv4_net_table[] = {
 		.extra1		= SYSCTL_ONE_THOUSAND,
 		.extra2		= &tcp_rto_max_max,
 	},
+	{
+		.procname       = "tcp_purge_receive_queue",
+		.data           = &init_net.ipv4.sysctl_tcp_purge_receive_queue,
+		.maxlen         = sizeof(u8),
+		.mode           = 0644,
+		.proc_handler   = proc_dou8vec_minmax,
+		.extra1         = SYSCTL_ZERO,
+		.extra2         = SYSCTL_ONE,
+	},
 };
 
 static __net_init int ipv4_sysctl_init_net(struct net *net)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 6c3f1d031444..43f32fb5831d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4895,6 +4895,7 @@ EXPORT_IPV6_MOD(tcp_done_with_error);
 /* When we get a reset we do this. */
 void tcp_reset(struct sock *sk, struct sk_buff *skb)
 {
+	const struct net *net = sock_net(sk);
 	int err;
 
 	trace_tcp_receive_reset(sk);
@@ -4911,6 +4912,21 @@ void tcp_reset(struct sock *sk, struct sk_buff *skb)
 		err = ECONNREFUSED;
 		break;
 	case TCP_CLOSE_WAIT:
+		/* RFC9293 3.10.7.4. Other States
+		 *   Second, check the RST bit:
+		 *     CLOSE-WAIT STATE
+		 *
+		 * If the RST bit is set, then any outstanding RECEIVEs and
+		 * SEND should receive "reset" responses.  All segment queues
+		 * should be flushed.  Users should also receive an unsolicited
+		 * general "connection reset" signal.  Enter the CLOSED state,
+		 * delete the TCB, and return.
+		 *
+		 * If net.ipv4.tcp_purge_receive_queue is enabled,
+		 * sk_receive_queue will be flushed too.
+		 */
+		if (unlikely(net->ipv4.sysctl_tcp_purge_receive_queue))
+			skb_queue_purge(&sk->sk_receive_queue);
 		err = EPIPE;
 		break;
 	case TCP_CLOSE:
-- 
2.52.0

Re: [RFC PATCH net-next] tcp: Add net.ipv4.tcp_purge_receive_queue sysctl

Posted by Jakub Kicinski 1 month, 2 weeks ago

On Wed, 25 Feb 2026 15:46:33 +0800 Leon Hwang wrote:
> Issue:
> When a TCP socket in the CLOSE_WAIT state receives a RST packet, the
> current implementation does not clear the socket's receive queue. This
> causes SKBs in the queue to remain allocated until the socket is
> explicitly closed by the application. As a consequence:
> 
> 1. The page pool pages held by these SKBs are not released.

On what kernel version and driver are you observing this?

Re: [RFC PATCH net-next] tcp: Add net.ipv4.tcp_purge_receive_queue sysctl

Posted by Leon Hwang 1 month, 2 weeks ago


On 26/2/26 09:43, Jakub Kicinski wrote:
> On Wed, 25 Feb 2026 15:46:33 +0800 Leon Hwang wrote:
>> Issue:
>> When a TCP socket in the CLOSE_WAIT state receives a RST packet, the
>> current implementation does not clear the socket's receive queue. This
>> causes SKBs in the queue to remain allocated until the socket is
>> explicitly closed by the application. As a consequence:
>>
>> 1. The page pool pages held by these SKBs are not released.
> 
> On what kernel version and driver are you observing this?

# uname -r
6.19.0-061900-generic

# ethtool -i eth0
driver: mlx5_core
version: 6.19.0-061900-generic
firmware-version: 26.43.2566 (MT_0000000531)

In addition, the Python scripts below reproduce that SKBs remain in the
receive queue.

Thanks,
Leon

---

server.py:

import socket
import time

HOST, PORT = "127.0.0.1", 9999

s = socket.socket()
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, 8 * 1024)

s.bind((HOST, PORT))
s.listen(1)

conn, addr = s.accept()
print("accepted", addr)

time.sleep(1)

print("Read 1st:", conn.recv(1))

try:
    conn.send(b"A")
    print("sent 1 byte to client")
except Exception as e:
    print("send failed:", e)

time.sleep(1)

conn.settimeout(0.2)
try:
    b = conn.recv(1)
    print("recv(1) after RST:", b, "len=", len(b))
except Exception as e:
    print("recv(1) after RST raised:", repr(e))

print("Conn remains opening..")

try:
    print("Press Ctrl+C to stop...")
    while True:
        time.sleep(1)
except KeyboardInterrupt:
    print("\nProgram interrupted by user. Exiting.")

conn.close()
s.close()


client.py:

import socket
import time

HOST, PORT = "127.0.0.1", 9999

c = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
c.connect((HOST, PORT))

payload = b"x" * (4 * 1024)  # 4KiB
c.sendall(payload)
time.sleep(0.1)
c.close()

time.sleep(3)

Re: [RFC PATCH net-next] tcp: Add net.ipv4.tcp_purge_receive_queue sysctl

Posted by Jakub Kicinski 1 month, 2 weeks ago

On Mon, 2 Mar 2026 17:55:59 +0800 Leon Hwang wrote:
> On 26/2/26 09:43, Jakub Kicinski wrote:
> > On Wed, 25 Feb 2026 15:46:33 +0800 Leon Hwang wrote:  
> >> Issue:
> >> When a TCP socket in the CLOSE_WAIT state receives a RST packet, the
> >> current implementation does not clear the socket's receive queue. This
> >> causes SKBs in the queue to remain allocated until the socket is
> >> explicitly closed by the application. As a consequence:
> >>
> >> 1. The page pool pages held by these SKBs are not released.  
> > 
> > On what kernel version and driver are you observing this?  
> 
> # uname -r
> 6.19.0-061900-generic
> 
> # ethtool -i eth0
> driver: mlx5_core
> version: 6.19.0-061900-generic
> firmware-version: 26.43.2566 (MT_0000000531)

Okay... this kernel + driver should just patiently wait for the page
pool to go away. 

What is the actual, end user problem that you're trying to solve?
A few kB of data waiting to be freed is not a huge problem..

Re: [RFC PATCH net-next] tcp: Add net.ipv4.tcp_purge_receive_queue sysctl

Posted by Leon Hwang 1 month, 2 weeks ago


On 3/3/26 08:22, Jakub Kicinski wrote:
> On Mon, 2 Mar 2026 17:55:59 +0800 Leon Hwang wrote:
>> On 26/2/26 09:43, Jakub Kicinski wrote:
>>> On Wed, 25 Feb 2026 15:46:33 +0800 Leon Hwang wrote:  
>>>> Issue:
>>>> When a TCP socket in the CLOSE_WAIT state receives a RST packet, the
>>>> current implementation does not clear the socket's receive queue. This
>>>> causes SKBs in the queue to remain allocated until the socket is
>>>> explicitly closed by the application. As a consequence:
>>>>
>>>> 1. The page pool pages held by these SKBs are not released.  
>>>
>>> On what kernel version and driver are you observing this?  
>>
>> # uname -r
>> 6.19.0-061900-generic
>>
>> # ethtool -i eth0
>> driver: mlx5_core
>> version: 6.19.0-061900-generic
>> firmware-version: 26.43.2566 (MT_0000000531)
> 
> Okay... this kernel + driver should just patiently wait for the page
> pool to go away. 
> 
> What is the actual, end user problem that you're trying to solve?
> A few kB of data waiting to be freed is not a huge problem..

Yes, it is not a huge problem.

The actual end-user issue was discussed in
"page_pool: Add page_pool_release_stalled tracepoint" [1].

I think it would be useful to provide a way for SREs to purge the
receive queue when CLOSE_WAIT TCP sockets receive RST packets. If the
NIC, e.g., Mellanox, flaps, the underlying page pool and pages can be
released at the same time.

Links:
[1]
https://lore.kernel.org/netdev/b676baa0-2044-4a74-900d-f471620f2896@linux.dev/

Thanks,
Leon

Re: [RFC PATCH net-next] tcp: Add net.ipv4.tcp_purge_receive_queue sysctl

Posted by Eric Dumazet 1 month, 2 weeks ago

On Tue, Mar 3, 2026 at 3:12 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>
>
>
> On 3/3/26 08:22, Jakub Kicinski wrote:
> > On Mon, 2 Mar 2026 17:55:59 +0800 Leon Hwang wrote:
> >> On 26/2/26 09:43, Jakub Kicinski wrote:
> >>> On Wed, 25 Feb 2026 15:46:33 +0800 Leon Hwang wrote:
> >>>> Issue:
> >>>> When a TCP socket in the CLOSE_WAIT state receives a RST packet, the
> >>>> current implementation does not clear the socket's receive queue. This
> >>>> causes SKBs in the queue to remain allocated until the socket is
> >>>> explicitly closed by the application. As a consequence:
> >>>>
> >>>> 1. The page pool pages held by these SKBs are not released.
> >>>
> >>> On what kernel version and driver are you observing this?
> >>
> >> # uname -r
> >> 6.19.0-061900-generic
> >>
> >> # ethtool -i eth0
> >> driver: mlx5_core
> >> version: 6.19.0-061900-generic
> >> firmware-version: 26.43.2566 (MT_0000000531)
> >
> > Okay... this kernel + driver should just patiently wait for the page
> > pool to go away.
> >
> > What is the actual, end user problem that you're trying to solve?
> > A few kB of data waiting to be freed is not a huge problem..
>
> Yes, it is not a huge problem.
>
> The actual end-user issue was discussed in
> "page_pool: Add page_pool_release_stalled tracepoint" [1].
>
> I think it would be useful to provide a way for SREs to purge the
> receive queue when CLOSE_WAIT TCP sockets receive RST packets. If the
> NIC, e.g., Mellanox, flaps, the underlying page pool and pages can be
> released at the same time.
>
> Links:
> [1]
> https://lore.kernel.org/netdev/b676baa0-2044-4a74-900d-f471620f2896@linux.dev/

Perhaps SRE could use this in an emergency?

ss -t -a state close-wait -K

Re: [RFC PATCH net-next] tcp: Add net.ipv4.tcp_purge_receive_queue sysctl

Posted by Leon Hwang 1 month, 2 weeks ago


On 3/3/26 11:55, Eric Dumazet wrote:
> On Tue, Mar 3, 2026 at 3:12 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>>
>>
>>
>> On 3/3/26 08:22, Jakub Kicinski wrote:
>>> On Mon, 2 Mar 2026 17:55:59 +0800 Leon Hwang wrote:
>>>> On 26/2/26 09:43, Jakub Kicinski wrote:
>>>>> On Wed, 25 Feb 2026 15:46:33 +0800 Leon Hwang wrote:
>>>>>> Issue:
>>>>>> When a TCP socket in the CLOSE_WAIT state receives a RST packet, the
>>>>>> current implementation does not clear the socket's receive queue. This
>>>>>> causes SKBs in the queue to remain allocated until the socket is
>>>>>> explicitly closed by the application. As a consequence:
>>>>>>
>>>>>> 1. The page pool pages held by these SKBs are not released.
>>>>>
>>>>> On what kernel version and driver are you observing this?
>>>>
>>>> # uname -r
>>>> 6.19.0-061900-generic
>>>>
>>>> # ethtool -i eth0
>>>> driver: mlx5_core
>>>> version: 6.19.0-061900-generic
>>>> firmware-version: 26.43.2566 (MT_0000000531)
>>>
>>> Okay... this kernel + driver should just patiently wait for the page
>>> pool to go away.
>>>
>>> What is the actual, end user problem that you're trying to solve?
>>> A few kB of data waiting to be freed is not a huge problem..
>>
>> Yes, it is not a huge problem.
>>
>> The actual end-user issue was discussed in
>> "page_pool: Add page_pool_release_stalled tracepoint" [1].
>>
>> I think it would be useful to provide a way for SREs to purge the
>> receive queue when CLOSE_WAIT TCP sockets receive RST packets. If the
>> NIC, e.g., Mellanox, flaps, the underlying page pool and pages can be
>> released at the same time.
>>
>> Links:
>> [1]
>> https://lore.kernel.org/netdev/b676baa0-2044-4a74-900d-f471620f2896@linux.dev/
> 
> Perhaps SRE could use this in an emergency?
> 
> ss -t -a state close-wait -K

This ss command is acceptable in an emergency.

A sysctl option would be better for persistent SRE operations.

Thanks,
Leon

Re: [RFC PATCH net-next] tcp: Add net.ipv4.tcp_purge_receive_queue sysctl

Posted by Leon Hwang 1 month, 2 weeks ago


On 3/3/26 14:26, Leon Hwang wrote:
> 
> 
> On 3/3/26 11:55, Eric Dumazet wrote:
>> On Tue, Mar 3, 2026 at 3:12 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>>>
>>>
>>>
>>> On 3/3/26 08:22, Jakub Kicinski wrote:
>>>> On Mon, 2 Mar 2026 17:55:59 +0800 Leon Hwang wrote:
>>>>> On 26/2/26 09:43, Jakub Kicinski wrote:
>>>>>> On Wed, 25 Feb 2026 15:46:33 +0800 Leon Hwang wrote:
>>>>>>> Issue:
>>>>>>> When a TCP socket in the CLOSE_WAIT state receives a RST packet, the
>>>>>>> current implementation does not clear the socket's receive queue. This
>>>>>>> causes SKBs in the queue to remain allocated until the socket is
>>>>>>> explicitly closed by the application. As a consequence:
>>>>>>>
>>>>>>> 1. The page pool pages held by these SKBs are not released.
>>>>>>
>>>>>> On what kernel version and driver are you observing this?
>>>>>
>>>>> # uname -r
>>>>> 6.19.0-061900-generic
>>>>>
>>>>> # ethtool -i eth0
>>>>> driver: mlx5_core
>>>>> version: 6.19.0-061900-generic
>>>>> firmware-version: 26.43.2566 (MT_0000000531)
>>>>
>>>> Okay... this kernel + driver should just patiently wait for the page
>>>> pool to go away.
>>>>
>>>> What is the actual, end user problem that you're trying to solve?
>>>> A few kB of data waiting to be freed is not a huge problem..
>>>
>>> Yes, it is not a huge problem.
>>>
>>> The actual end-user issue was discussed in
>>> "page_pool: Add page_pool_release_stalled tracepoint" [1].
>>>
>>> I think it would be useful to provide a way for SREs to purge the
>>> receive queue when CLOSE_WAIT TCP sockets receive RST packets. If the
>>> NIC, e.g., Mellanox, flaps, the underlying page pool and pages can be
>>> released at the same time.
>>>
>>> Links:
>>> [1]
>>> https://lore.kernel.org/netdev/b676baa0-2044-4a74-900d-f471620f2896@linux.dev/
>>
>> Perhaps SRE could use this in an emergency?
>>
>> ss -t -a state close-wait -K
> 
> This ss command is acceptable in an emergency.
> 

However, once a CLOSE_WAIT TCP socket receives an RST packet, it
transitions to the CLOSE state. A socket in the CLOSE state cannot be
killed using the ss approach.

The SKBs remain in the receive queue of the CLOSE socket until it is
closed by the user-space application.

Thanks,
Leon

Re: [RFC PATCH net-next] tcp: Add net.ipv4.tcp_purge_receive_queue sysctl

Posted by Eric Dumazet 1 month, 2 weeks ago

On Tue, Mar 3, 2026 at 8:55 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>
>
>
> On 3/3/26 14:26, Leon Hwang wrote:
> >
> >
> > On 3/3/26 11:55, Eric Dumazet wrote:
> >> On Tue, Mar 3, 2026 at 3:12 AM Leon Hwang <leon.hwang@linux.dev> wrote:
> >>>
> >>>
> >>>
> >>> On 3/3/26 08:22, Jakub Kicinski wrote:
> >>>> On Mon, 2 Mar 2026 17:55:59 +0800 Leon Hwang wrote:
> >>>>> On 26/2/26 09:43, Jakub Kicinski wrote:
> >>>>>> On Wed, 25 Feb 2026 15:46:33 +0800 Leon Hwang wrote:
> >>>>>>> Issue:
> >>>>>>> When a TCP socket in the CLOSE_WAIT state receives a RST packet, the
> >>>>>>> current implementation does not clear the socket's receive queue. This
> >>>>>>> causes SKBs in the queue to remain allocated until the socket is
> >>>>>>> explicitly closed by the application. As a consequence:
> >>>>>>>
> >>>>>>> 1. The page pool pages held by these SKBs are not released.
> >>>>>>
> >>>>>> On what kernel version and driver are you observing this?
> >>>>>
> >>>>> # uname -r
> >>>>> 6.19.0-061900-generic
> >>>>>
> >>>>> # ethtool -i eth0
> >>>>> driver: mlx5_core
> >>>>> version: 6.19.0-061900-generic
> >>>>> firmware-version: 26.43.2566 (MT_0000000531)
> >>>>
> >>>> Okay... this kernel + driver should just patiently wait for the page
> >>>> pool to go away.
> >>>>
> >>>> What is the actual, end user problem that you're trying to solve?
> >>>> A few kB of data waiting to be freed is not a huge problem..
> >>>
> >>> Yes, it is not a huge problem.
> >>>
> >>> The actual end-user issue was discussed in
> >>> "page_pool: Add page_pool_release_stalled tracepoint" [1].
> >>>
> >>> I think it would be useful to provide a way for SREs to purge the
> >>> receive queue when CLOSE_WAIT TCP sockets receive RST packets. If the
> >>> NIC, e.g., Mellanox, flaps, the underlying page pool and pages can be
> >>> released at the same time.
> >>>
> >>> Links:
> >>> [1]
> >>> https://lore.kernel.org/netdev/b676baa0-2044-4a74-900d-f471620f2896@linux.dev/
> >>
> >> Perhaps SRE could use this in an emergency?
> >>
> >> ss -t -a state close-wait -K
> >
> > This ss command is acceptable in an emergency.
> >
>
> However, once a CLOSE_WAIT TCP socket receives an RST packet, it
> transitions to the CLOSE state. A socket in the CLOSE state cannot be
> killed using the ss approach.
>
> The SKBs remain in the receive queue of the CLOSE socket until it is
> closed by the user-space application.

Why user-space application does not drain the receive queue ?

Is there a missing EPOLLIN or something ?

Re: [RFC PATCH net-next] tcp: Add net.ipv4.tcp_purge_receive_queue sysctl

Posted by Leon Hwang 1 month, 2 weeks ago


On 3/3/26 16:17, Eric Dumazet wrote:
> On Tue, Mar 3, 2026 at 8:55 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>>
>>
>>
>> On 3/3/26 14:26, Leon Hwang wrote:
>>>
>>>
>>> On 3/3/26 11:55, Eric Dumazet wrote:
>>>> On Tue, Mar 3, 2026 at 3:12 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 3/3/26 08:22, Jakub Kicinski wrote:
>>>>>> On Mon, 2 Mar 2026 17:55:59 +0800 Leon Hwang wrote:
>>>>>>> On 26/2/26 09:43, Jakub Kicinski wrote:
>>>>>>>> On Wed, 25 Feb 2026 15:46:33 +0800 Leon Hwang wrote:
>>>>>>>>> Issue:
>>>>>>>>> When a TCP socket in the CLOSE_WAIT state receives a RST packet, the
>>>>>>>>> current implementation does not clear the socket's receive queue. This
>>>>>>>>> causes SKBs in the queue to remain allocated until the socket is
>>>>>>>>> explicitly closed by the application. As a consequence:
>>>>>>>>>
>>>>>>>>> 1. The page pool pages held by these SKBs are not released.
>>>>>>>>
>>>>>>>> On what kernel version and driver are you observing this?
>>>>>>>
>>>>>>> # uname -r
>>>>>>> 6.19.0-061900-generic
>>>>>>>
>>>>>>> # ethtool -i eth0
>>>>>>> driver: mlx5_core
>>>>>>> version: 6.19.0-061900-generic
>>>>>>> firmware-version: 26.43.2566 (MT_0000000531)
>>>>>>
>>>>>> Okay... this kernel + driver should just patiently wait for the page
>>>>>> pool to go away.
>>>>>>
>>>>>> What is the actual, end user problem that you're trying to solve?
>>>>>> A few kB of data waiting to be freed is not a huge problem..
>>>>>
>>>>> Yes, it is not a huge problem.
>>>>>
>>>>> The actual end-user issue was discussed in
>>>>> "page_pool: Add page_pool_release_stalled tracepoint" [1].
>>>>>
>>>>> I think it would be useful to provide a way for SREs to purge the
>>>>> receive queue when CLOSE_WAIT TCP sockets receive RST packets. If the
>>>>> NIC, e.g., Mellanox, flaps, the underlying page pool and pages can be
>>>>> released at the same time.
>>>>>
>>>>> Links:
>>>>> [1]
>>>>> https://lore.kernel.org/netdev/b676baa0-2044-4a74-900d-f471620f2896@linux.dev/
>>>>
>>>> Perhaps SRE could use this in an emergency?
>>>>
>>>> ss -t -a state close-wait -K
>>>
>>> This ss command is acceptable in an emergency.
>>>
>>
>> However, once a CLOSE_WAIT TCP socket receives an RST packet, it
>> transitions to the CLOSE state. A socket in the CLOSE state cannot be
>> killed using the ss approach.
>>
>> The SKBs remain in the receive queue of the CLOSE socket until it is
>> closed by the user-space application.
> 
> Why user-space application does not drain the receive queue ?
> 
> Is there a missing EPOLLIN or something ?

The user-space application uses a TCP connection pool. It establishes
several TCP connections at startup and keeps them in the pool.

However, the application does not always drain their receive queues.
Instead, it selects one connection from the pool using a hash algorithm
for communication with the TCP server. When it attempts to write data
through a socket in the CLOSE state, it receives -EPIPE and then closes
it. As a result, TCP connections whose underlying socket state is CLOSE
may retain an SKB in their receive queues if they are not selected for
communication.

I proposed a solution to address this issue: close the TCP connection if
the underlying sk_err is non-zero.

Thanks,
Leon

Re: [RFC PATCH net-next] tcp: Add net.ipv4.tcp_purge_receive_queue sysctl

Posted by Eric Dumazet 1 month, 2 weeks ago

On Tue, Mar 3, 2026 at 9:54 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>
>
>
> On 3/3/26 16:17, Eric Dumazet wrote:
> > On Tue, Mar 3, 2026 at 8:55 AM Leon Hwang <leon.hwang@linux.dev> wrote:
> >>
> >>
> >>
> >> On 3/3/26 14:26, Leon Hwang wrote:
> >>>
> >>>
> >>> On 3/3/26 11:55, Eric Dumazet wrote:
> >>>> On Tue, Mar 3, 2026 at 3:12 AM Leon Hwang <leon.hwang@linux.dev> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 3/3/26 08:22, Jakub Kicinski wrote:
> >>>>>> On Mon, 2 Mar 2026 17:55:59 +0800 Leon Hwang wrote:
> >>>>>>> On 26/2/26 09:43, Jakub Kicinski wrote:
> >>>>>>>> On Wed, 25 Feb 2026 15:46:33 +0800 Leon Hwang wrote:
> >>>>>>>>> Issue:
> >>>>>>>>> When a TCP socket in the CLOSE_WAIT state receives a RST packet, the
> >>>>>>>>> current implementation does not clear the socket's receive queue. This
> >>>>>>>>> causes SKBs in the queue to remain allocated until the socket is
> >>>>>>>>> explicitly closed by the application. As a consequence:
> >>>>>>>>>
> >>>>>>>>> 1. The page pool pages held by these SKBs are not released.
> >>>>>>>>
> >>>>>>>> On what kernel version and driver are you observing this?
> >>>>>>>
> >>>>>>> # uname -r
> >>>>>>> 6.19.0-061900-generic
> >>>>>>>
> >>>>>>> # ethtool -i eth0
> >>>>>>> driver: mlx5_core
> >>>>>>> version: 6.19.0-061900-generic
> >>>>>>> firmware-version: 26.43.2566 (MT_0000000531)
> >>>>>>
> >>>>>> Okay... this kernel + driver should just patiently wait for the page
> >>>>>> pool to go away.
> >>>>>>
> >>>>>> What is the actual, end user problem that you're trying to solve?
> >>>>>> A few kB of data waiting to be freed is not a huge problem..
> >>>>>
> >>>>> Yes, it is not a huge problem.
> >>>>>
> >>>>> The actual end-user issue was discussed in
> >>>>> "page_pool: Add page_pool_release_stalled tracepoint" [1].
> >>>>>
> >>>>> I think it would be useful to provide a way for SREs to purge the
> >>>>> receive queue when CLOSE_WAIT TCP sockets receive RST packets. If the
> >>>>> NIC, e.g., Mellanox, flaps, the underlying page pool and pages can be
> >>>>> released at the same time.
> >>>>>
> >>>>> Links:
> >>>>> [1]
> >>>>> https://lore.kernel.org/netdev/b676baa0-2044-4a74-900d-f471620f2896@linux.dev/
> >>>>
> >>>> Perhaps SRE could use this in an emergency?
> >>>>
> >>>> ss -t -a state close-wait -K
> >>>
> >>> This ss command is acceptable in an emergency.
> >>>
> >>
> >> However, once a CLOSE_WAIT TCP socket receives an RST packet, it
> >> transitions to the CLOSE state. A socket in the CLOSE state cannot be
> >> killed using the ss approach.
> >>
> >> The SKBs remain in the receive queue of the CLOSE socket until it is
> >> closed by the user-space application.
> >
> > Why user-space application does not drain the receive queue ?
> >
> > Is there a missing EPOLLIN or something ?
>
> The user-space application uses a TCP connection pool. It establishes
> several TCP connections at startup and keeps them in the pool.
>
> However, the application does not always drain their receive queues.
> Instead, it selects one connection from the pool using a hash algorithm
> for communication with the TCP server. When it attempts to write data
> through a socket in the CLOSE state, it receives -EPIPE and then closes
> it. As a result, TCP connections whose underlying socket state is CLOSE
> may retain an SKB in their receive queues if they are not selected for
> communication.
>
> I proposed a solution to address this issue: close the TCP connection if
> the underlying sk_err is non-zero.

Okay, makes sense to fix the root cause. Applications can be fixed in
a matter of hours,
while kernels can stick to hosts for years.

Re: [RFC PATCH net-next] tcp: Add net.ipv4.tcp_purge_receive_queue sysctl

Posted by Eric Dumazet 1 month, 2 weeks ago

On Wed, Feb 25, 2026 at 8:46 AM Leon Hwang <leon.huangfu@shopee.com> wrote:
>
> Introduce a new sysctl knob, net.ipv4.tcp_purge_receive_queue, to
> address a memory leak scenario related to TCP sockets.

We use the term "memory leak" for a persistent loss of memory (until reboot)

Lets not abuse it and confuse various AI/human agents which will
declare emergency situations
caused by an inexistent fatal error.

>
> Issue:
> When a TCP socket in the CLOSE_WAIT state receives a RST packet, the
> current implementation does not clear the socket's receive queue. This
> causes SKBs in the queue to remain allocated until the socket is
> explicitly closed by the application. As a consequence:
>
> 1. The page pool pages held by these SKBs are not released.

This situation also applies for normal TCP_ESTABLISHED sockets, when
applications
do not drain the receive queue.

As long the application has not called close(), kernel should not
assume the application
will _not_ read the data that was received.


> 2. The associated page pool cannot be freed.
>
> RFC 9293 Section 3.10.7.4 specifies that when a RST is received in
> CLOSE_WAIT state, "all segment queues should be flushed." However, the
> current implementation does not flush the receive queue.

Some buggy stacks send RST anyway after FIN. I think that forcingly
purging good data
received before the RST would add many surprises.

>
> Solution:
> Add a per-namespace sysctl (net.ipv4.tcp_purge_receive_queue) that,
> when enabled, causes the kernel to purge the receive queue when a RST
> packet is received in CLOSE_WAIT state. This allows immediate release
> of SKBs and their associated memory resources.
>
> The feature is disabled by default to maintain backward compatibility
> with existing behavior.
>
> Signed-off-by: Leon Hwang <leon.huangfu@shopee.com>
> ---
>  Documentation/networking/ip-sysctl.rst         | 18 ++++++++++++++++++
>  .../net_cachelines/netns_ipv4_sysctl.rst       |  1 +
>  include/net/netns/ipv4.h                       |  1 +
>  net/ipv4/sysctl_net_ipv4.c                     |  9 +++++++++
>  net/ipv4/tcp_input.c                           | 16 ++++++++++++++++
>  5 files changed, 45 insertions(+)
>
> diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
> index d1eeb5323af0..71a529462baa 100644
> --- a/Documentation/networking/ip-sysctl.rst
> +++ b/Documentation/networking/ip-sysctl.rst
> @@ -1441,6 +1441,24 @@ tcp_rto_max_ms - INTEGER
>
>         Default: 120,000
>
> +tcp_purge_receive_queue - BOOLEAN
> +       When a socket in the TCP_CLOSE_WAIT state receives a RST packet, the
> +       default behavior is to not clear its receive queue.  As a result,
> +       any SKBs in the queue are not freed until the socket is closed.
> +       Consequently, the pages held by these SKBs are not released, which
> +       can also prevent the associated page pool from being freed.
> +
> +       If enabled, the receive queue is purged upon receiving the RST,
> +       allowing the SKBs and their associated memory to be released
> +       promptly.
> +
> +       Possible values:
> +
> +       - 0 (disabled)
> +       - 1 (enabled)
> +
> +       Default: 0 (disabled)
> +
>  UDP variables
>  =============
>
> diff --git a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
> index beaf1880a19b..f2c42e7d84a9 100644
> --- a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
> +++ b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
> @@ -123,6 +123,7 @@ unsigned_long                   sysctl_tcp_comp_sack_delay_ns
>  unsigned_long                   sysctl_tcp_comp_sack_slack_ns                                                        __tcp_ack_snd_check
>  int                             sysctl_max_syn_backlog
>  int                             sysctl_tcp_fastopen
> +u8                              sysctl_tcp_purge_receive_queue
>  struct_tcp_congestion_ops       tcp_congestion_control                                                               init_cc
>  struct_tcp_fastopen_context     tcp_fastopen_ctx
>  unsigned_int                    sysctl_tcp_fastopen_blackhole_timeout
> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> index 8e971c7bf164..ab973f30f502 100644
> --- a/include/net/netns/ipv4.h
> +++ b/include/net/netns/ipv4.h
> @@ -220,6 +220,7 @@ struct netns_ipv4 {
>         u8 sysctl_tcp_nometrics_save;
>         u8 sysctl_tcp_no_ssthresh_metrics_save;
>         u8 sysctl_tcp_workaround_signed_windows;
> +       u8 sysctl_tcp_purge_receive_queue;
>         int sysctl_tcp_challenge_ack_limit;
>         u8 sysctl_tcp_min_tso_segs;
>         u8 sysctl_tcp_reflect_tos;
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 643763bc2142..da30970bb5d5 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -1641,6 +1641,15 @@ static struct ctl_table ipv4_net_table[] = {
>                 .extra1         = SYSCTL_ONE_THOUSAND,
>                 .extra2         = &tcp_rto_max_max,
>         },
> +       {
> +               .procname       = "tcp_purge_receive_queue",
> +               .data           = &init_net.ipv4.sysctl_tcp_purge_receive_queue,
> +               .maxlen         = sizeof(u8),
> +               .mode           = 0644,
> +               .proc_handler   = proc_dou8vec_minmax,
> +               .extra1         = SYSCTL_ZERO,
> +               .extra2         = SYSCTL_ONE,
> +       },
>  };
>
>  static __net_init int ipv4_sysctl_init_net(struct net *net)
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 6c3f1d031444..43f32fb5831d 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -4895,6 +4895,7 @@ EXPORT_IPV6_MOD(tcp_done_with_error);
>  /* When we get a reset we do this. */
>  void tcp_reset(struct sock *sk, struct sk_buff *skb)
>  {
> +       const struct net *net = sock_net(sk);
>         int err;
>
>         trace_tcp_receive_reset(sk);
> @@ -4911,6 +4912,21 @@ void tcp_reset(struct sock *sk, struct sk_buff *skb)
>                 err = ECONNREFUSED;
>                 break;
>         case TCP_CLOSE_WAIT:
> +               /* RFC9293 3.10.7.4. Other States
> +                *   Second, check the RST bit:
> +                *     CLOSE-WAIT STATE
> +                *
> +                * If the RST bit is set, then any outstanding RECEIVEs and
> +                * SEND should receive "reset" responses.  All segment queues
> +                * should be flushed.  Users should also receive an unsolicited
> +                * general "connection reset" signal.  Enter the CLOSED state,
> +                * delete the TCB, and return.
> +                *
> +                * If net.ipv4.tcp_purge_receive_queue is enabled,
> +                * sk_receive_queue will be flushed too.
> +                */
> +               if (unlikely(net->ipv4.sysctl_tcp_purge_receive_queue))
> +                       skb_queue_purge(&sk->sk_receive_queue);
>                 err = EPIPE;
>                 break;
>         case TCP_CLOSE:
> --
> 2.52.0
>

Please prepare a packetdrill test.

Re: [RFC PATCH net-next] tcp: Add net.ipv4.tcp_purge_receive_queue sysctl

Posted by Leon Hwang 1 month, 2 weeks ago


On 25/2/26 16:31, Eric Dumazet wrote:
> On Wed, Feb 25, 2026 at 8:46 AM Leon Hwang <leon.huangfu@shopee.com> wrote:
>>
>> Introduce a new sysctl knob, net.ipv4.tcp_purge_receive_queue, to
>> address a memory leak scenario related to TCP sockets.
> 
> We use the term "memory leak" for a persistent loss of memory (until reboot)
> 

Thanks for the clarification.

> Lets not abuse it and confuse various AI/human agents which will
> declare emergency situations
> caused by an inexistent fatal error.
> 

I'll reword it in the next revision.

>>
>> Issue:
>> When a TCP socket in the CLOSE_WAIT state receives a RST packet, the
>> current implementation does not clear the socket's receive queue. This
>> causes SKBs in the queue to remain allocated until the socket is
>> explicitly closed by the application. As a consequence:
>>
>> 1. The page pool pages held by these SKBs are not released.
> 
> This situation also applies for normal TCP_ESTABLISHED sockets, when
> applications
> do not drain the receive queue.
> 
> As long the application has not called close(), kernel should not
> assume the application
> will _not_ read the data that was received.
> 

Understood.

This patch provides an option to drain the receive queue in the
CLOSE_WAIT + RST case, instead of purging it unconditionally upon
receiving a RST packet.

> 
>> 2. The associated page pool cannot be freed.
>>
>> RFC 9293 Section 3.10.7.4 specifies that when a RST is received in
>> CLOSE_WAIT state, "all segment queues should be flushed." However, the
>> current implementation does not flush the receive queue.
> 
> Some buggy stacks send RST anyway after FIN. I think that forcingly
> purging good data
> received before the RST would add many surprises.
> 

Understood.

There is a tcp_write_queue_purge(sk) call in tcp_done_with_error(),
which means sk_write_queue is always purged when a RST packet is
received. I assume the reason for purging sk_write_queue is that any
pending transmissions become meaningless once a RST is received.

Would it be better to defer kb_queue_purge(&sk->sk_receive_queue) until
after tcp_done_with_error()?

[...]

>>
> 
> Please prepare a packetdrill test.

Ack.

I'll add a packetdrill test in the next revision.

Thanks,
Leon