net/ipv4/tcp.c | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-)
We have some problem closing zero-window fin-wait-1 tcp sockets in our
environment. This patch come from the investigation.
Previously tcp_abort only sends out reset and calls tcp_done when the
socket is not SOCK_DEAD, aka orphan. For orphan socket, it will only
purging the write queue, but not close the socket and left it to the
timer.
While purging the write queue, tp->packets_out and sk->sk_write_queue
is cleared along the way. However tcp_retransmit_timer have early
return based on !tp->packets_out and tcp_probe_timer have early
return based on !sk->sk_write_queue.
This caused ICSK_TIME_RETRANS and ICSK_TIME_PROBE0 not being resched
and socket not being killed by the timers, converting a zero-windowed
orphan into a forever orphan.
This patch removes the SOCK_DEAD check in tcp_abort, making it send
reset to peer and close the socket accordingly. Preventing the
timer-less orphan from happening.
According to Lorenzo's email in the v1 thread, the check was there to
prevent force-closing the same socket twice. That situation is handled
by testing for TCP_CLOSE inside lock, and returning -ENOENT if it is
already closed.
The -ENOENT code comes from the associate patch Lorenzo made for
iproute2-ss; link attached below.
Link: https://patchwork.ozlabs.org/project/netdev/patch/1450773094-7978-3-git-send-email-lorenzo@google.com/
Fixes: c1e64e298b8c ("net: diag: Support destroying TCP sockets.")
Signed-off-by: Xueming Feng <kuro@kuroa.me>
---
net/ipv4/tcp.c | 18 +++++++++++-------
1 file changed, 11 insertions(+), 7 deletions(-)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e03a342c9162..831a18dc7aa6 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -4637,6 +4637,13 @@ int tcp_abort(struct sock *sk, int err)
/* Don't race with userspace socket closes such as tcp_close. */
lock_sock(sk);
+ /* Avoid closing the same socket twice. */
+ if (sk->sk_state == TCP_CLOSE) {
+ if (!has_current_bpf_ctx())
+ release_sock(sk);
+ return -ENOENT;
+ }
+
if (sk->sk_state == TCP_LISTEN) {
tcp_set_state(sk, TCP_CLOSE);
inet_csk_listen_stop(sk);
@@ -4646,16 +4653,13 @@ int tcp_abort(struct sock *sk, int err)
local_bh_disable();
bh_lock_sock(sk);
- if (!sock_flag(sk, SOCK_DEAD)) {
- if (tcp_need_reset(sk->sk_state))
- tcp_send_active_reset(sk, GFP_ATOMIC,
- SK_RST_REASON_NOT_SPECIFIED);
- tcp_done_with_error(sk, err);
- }
+ if (tcp_need_reset(sk->sk_state))
+ tcp_send_active_reset(sk, GFP_ATOMIC,
+ SK_RST_REASON_NOT_SPECIFIED);
+ tcp_done_with_error(sk, err);
bh_unlock_sock(sk);
local_bh_enable();
- tcp_write_queue_purge(sk);
if (!has_current_bpf_ctx())
release_sock(sk);
return 0;
--
On Mon, Aug 12, 2024 at 7:53 PM Xueming Feng <kuro@kuroa.me> wrote: > The -ENOENT code comes from the associate patch Lorenzo made for > iproute2-ss; link attached below. ENOENT does seem reasonable. It's the same thing that would happen if userspace passed in a nonexistent cookie (we have a test for that). I'd guess this could happen if userspace was trying to destroy a socket but it lost the race against the process owning a socket closing it? > bh_unlock_sock(sk); > local_bh_enable(); > - tcp_write_queue_purge(sk); Is this not necessary in any other cases? What if there is retransmitted data, shouldn't that be cleared? Other than that, I have run the Android tests on this patch and they all passed other than the test that checks that closing FIN_WAIT1 sockets can't be closed. That's expected to fail because it checking that the kernel does not support what you are trying to support. I uploaded a CL to fix that: https://r.android.com/3217682 . Tested-By: Lorenzo Colitti <lorenzo@google.com>
On Mon, Aug 14, 2024 at 7:34 AM Lorenzo Colitti <lorenzo@google.com> wrote: > On Mon, Aug 12, 2024 at 7:53 PM Xueming Feng <kuro@kuroa.me> wrote: > > The -ENOENT code comes from the associate patch Lorenzo made for > > iproute2-ss; link attached below. > > ENOENT does seem reasonable. It's the same thing that would happen if > userspace passed in a nonexistent cookie (we have a test for that). In the latest TCP RFC 9293, section 3.10.5 on the ABORT CALL, it mentions that an "error: connection does not exist" to be returned for a CLOSED STATE. I noticed this while verifying whether a reset in the FIN-WAIT STATE is legal, which it is. > I'd guess this could happen if userspace was trying to destroy a > socket but it lost the race against the process owning a socket > closing it? Yes, that’s exactly the scenario I'm addressing. I tested this locally by calling tcp_diag twice with the same socket pointer. > > > bh_unlock_sock(sk); > > local_bh_enable(); > > - tcp_write_queue_purge(sk); > > Is this not necessary in any other cases? What if there is > retransmitted data, shouldn't that be cleared? The tcp_write_queue_purge() function is indeed invoked within tcp_done_with_error(). In this patch, the tcp_done_with_error is elevated to the same logical level where tcp_write_queue_purge would typically be called. The difference is that the purge happens just before tcp_done. So the queue should still be cleared in other scenarios as well.
On Wed, Aug 14, 2024 at 3:35 PM Lorenzo Colitti <lorenzo@google.com> wrote: > > On Mon, Aug 12, 2024 at 7:53 PM Xueming Feng <kuro@kuroa.me> wrote: > > The -ENOENT code comes from the associate patch Lorenzo made for > > iproute2-ss; link attached below. > > ENOENT does seem reasonable. It's the same thing that would happen if > userspace passed in a nonexistent cookie (we have a test for that). > I'd guess this could happen if userspace was trying to destroy a > socket but it lost the race against the process owning a socket > closing it? > > > bh_unlock_sock(sk); > > local_bh_enable(); > > - tcp_write_queue_purge(sk); > > Is this not necessary in any other cases? What if there is > retransmitted data, shouldn't that be cleared? This line is duplicated, please see tcp_done_with_error()->tcp_write_queue_purge(). Thanks, Jason
On Mon, Aug 12, 2024 at 6:53 PM Xueming Feng <kuro@kuroa.me> wrote:
>
> We have some problem closing zero-window fin-wait-1 tcp sockets in our
> environment. This patch come from the investigation.
>
> Previously tcp_abort only sends out reset and calls tcp_done when the
> socket is not SOCK_DEAD, aka orphan. For orphan socket, it will only
> purging the write queue, but not close the socket and left it to the
> timer.
>
> While purging the write queue, tp->packets_out and sk->sk_write_queue
> is cleared along the way. However tcp_retransmit_timer have early
> return based on !tp->packets_out and tcp_probe_timer have early
> return based on !sk->sk_write_queue.
>
> This caused ICSK_TIME_RETRANS and ICSK_TIME_PROBE0 not being resched
> and socket not being killed by the timers, converting a zero-windowed
> orphan into a forever orphan.
>
> This patch removes the SOCK_DEAD check in tcp_abort, making it send
> reset to peer and close the socket accordingly. Preventing the
> timer-less orphan from happening.
>
> According to Lorenzo's email in the v1 thread, the check was there to
> prevent force-closing the same socket twice. That situation is handled
> by testing for TCP_CLOSE inside lock, and returning -ENOENT if it is
> already closed.
>
> The -ENOENT code comes from the associate patch Lorenzo made for
> iproute2-ss; link attached below.
>
> Link: https://patchwork.ozlabs.org/project/netdev/patch/1450773094-7978-3-git-send-email-lorenzo@google.com/
> Fixes: c1e64e298b8c ("net: diag: Support destroying TCP sockets.")
> Signed-off-by: Xueming Feng <kuro@kuroa.me>
You seem to have forgotten to CC Jakub and Paolo which are also
networking maintainers.
> ---
> net/ipv4/tcp.c | 18 +++++++++++-------
> 1 file changed, 11 insertions(+), 7 deletions(-)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index e03a342c9162..831a18dc7aa6 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -4637,6 +4637,13 @@ int tcp_abort(struct sock *sk, int err)
> /* Don't race with userspace socket closes such as tcp_close. */
> lock_sock(sk);
>
> + /* Avoid closing the same socket twice. */
> + if (sk->sk_state == TCP_CLOSE) {
> + if (!has_current_bpf_ctx())
> + release_sock(sk);
> + return -ENOENT;
> + }
> +
> if (sk->sk_state == TCP_LISTEN) {
> tcp_set_state(sk, TCP_CLOSE);
> inet_csk_listen_stop(sk);
> @@ -4646,16 +4653,13 @@ int tcp_abort(struct sock *sk, int err)
> local_bh_disable();
> bh_lock_sock(sk);
>
> - if (!sock_flag(sk, SOCK_DEAD)) {
> - if (tcp_need_reset(sk->sk_state))
> - tcp_send_active_reset(sk, GFP_ATOMIC,
> - SK_RST_REASON_NOT_SPECIFIED);
> - tcp_done_with_error(sk, err);
> - }
> + if (tcp_need_reset(sk->sk_state))
> + tcp_send_active_reset(sk, GFP_ATOMIC,
> + SK_RST_REASON_NOT_SPECIFIED);
Please use SK_RST_REASON_TCP_STATE here. I should have pointed out this earlier.
Please feel free to add:
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
in your next submission.
Thanks,
Jason
> + tcp_done_with_error(sk, err);
>
> bh_unlock_sock(sk);
> local_bh_enable();
> - tcp_write_queue_purge(sk);
> if (!has_current_bpf_ctx())
> release_sock(sk);
> return 0;
> --
On Mon, Aug 12, 2024 at 10:00 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Mon, Aug 12, 2024 at 6:53 PM Xueming Feng <kuro@kuroa.me> wrote:
> >
> > We have some problem closing zero-window fin-wait-1 tcp sockets in our
> > environment. This patch come from the investigation.
> >
> > Previously tcp_abort only sends out reset and calls tcp_done when the
> > socket is not SOCK_DEAD, aka orphan. For orphan socket, it will only
> > purging the write queue, but not close the socket and left it to the
> > timer.
> >
> > While purging the write queue, tp->packets_out and sk->sk_write_queue
> > is cleared along the way. However tcp_retransmit_timer have early
> > return based on !tp->packets_out and tcp_probe_timer have early
> > return based on !sk->sk_write_queue.
> >
> > This caused ICSK_TIME_RETRANS and ICSK_TIME_PROBE0 not being resched
> > and socket not being killed by the timers, converting a zero-windowed
> > orphan into a forever orphan.
> >
> > This patch removes the SOCK_DEAD check in tcp_abort, making it send
> > reset to peer and close the socket accordingly. Preventing the
> > timer-less orphan from happening.
> >
> > According to Lorenzo's email in the v1 thread, the check was there to
> > prevent force-closing the same socket twice. That situation is handled
> > by testing for TCP_CLOSE inside lock, and returning -ENOENT if it is
> > already closed.
> >
> > The -ENOENT code comes from the associate patch Lorenzo made for
> > iproute2-ss; link attached below.
> >
> > Link: https://patchwork.ozlabs.org/project/netdev/patch/1450773094-7978-3-git-send-email-lorenzo@google.com/
> > Fixes: c1e64e298b8c ("net: diag: Support destroying TCP sockets.")
> > Signed-off-by: Xueming Feng <kuro@kuroa.me>
>
> You seem to have forgotten to CC Jakub and Paolo which are also
> networking maintainers.
>
> > ---
> > net/ipv4/tcp.c | 18 +++++++++++-------
> > 1 file changed, 11 insertions(+), 7 deletions(-)
> >
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index e03a342c9162..831a18dc7aa6 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -4637,6 +4637,13 @@ int tcp_abort(struct sock *sk, int err)
> > /* Don't race with userspace socket closes such as tcp_close. */
> > lock_sock(sk);
> >
> > + /* Avoid closing the same socket twice. */
> > + if (sk->sk_state == TCP_CLOSE) {
> > + if (!has_current_bpf_ctx())
> > + release_sock(sk);
> > + return -ENOENT;
> > + }
> > +
> > if (sk->sk_state == TCP_LISTEN) {
> > tcp_set_state(sk, TCP_CLOSE);
> > inet_csk_listen_stop(sk);
> > @@ -4646,16 +4653,13 @@ int tcp_abort(struct sock *sk, int err)
> > local_bh_disable();
> > bh_lock_sock(sk);
> >
> > - if (!sock_flag(sk, SOCK_DEAD)) {
> > - if (tcp_need_reset(sk->sk_state))
> > - tcp_send_active_reset(sk, GFP_ATOMIC,
> > - SK_RST_REASON_NOT_SPECIFIED);
> > - tcp_done_with_error(sk, err);
> > - }
> > + if (tcp_need_reset(sk->sk_state))
> > + tcp_send_active_reset(sk, GFP_ATOMIC,
> > + SK_RST_REASON_NOT_SPECIFIED);
>
> Please use SK_RST_REASON_TCP_STATE here. I should have pointed out this earlier.
Sorry, my fault. It's the net tree which doesn't have the commit
edefba66d92 yet, so you don't need to modify this.
As I said, the patch looks good to me. Let Eric make the decision
finally. Thanks!
Thanks,
Jason
© 2016 - 2026 Red Hat, Inc.