include/net/dropreason-core.h | 3 + include/net/tcp.h | 3 +- net/ipv4/tcp.c | 9 + net/ipv4/tcp_input.c | 15 +- tools/testing/selftests/Makefile | 1 + .../selftests/net/tcp_repair/.gitignore | 3 + .../testing/selftests/net/tcp_repair/Makefile | 23 ++ .../testing/selftests/net/tcp_repair/client.c | 273 ++++++++++++++++++ .../testing/selftests/net/tcp_repair/inner.sh | 32 ++ .../testing/selftests/net/tcp_repair/outer.sh | 44 +++ tools/testing/selftests/net/tcp_repair/run.sh | 12 + .../testing/selftests/net/tcp_repair/server.c | 155 ++++++++++ tools/testing/selftests/net/tcp_repair/talk.h | 26 ++ 13 files changed, 596 insertions(+), 3 deletions(-) create mode 100644 tools/testing/selftests/net/tcp_repair/.gitignore create mode 100644 tools/testing/selftests/net/tcp_repair/Makefile create mode 100644 tools/testing/selftests/net/tcp_repair/client.c create mode 100755 tools/testing/selftests/net/tcp_repair/inner.sh create mode 100755 tools/testing/selftests/net/tcp_repair/outer.sh create mode 100755 tools/testing/selftests/net/tcp_repair/run.sh create mode 100644 tools/testing/selftests/net/tcp_repair/server.c create mode 100644 tools/testing/selftests/net/tcp_repair/talk.h
If we receive data on a socket that's in repair mode, the sequence and
contents of the receive queue we dump depend on the timing. We need to
freeze the input queue, otherwise the connection parameters restored
later might not match the actual state of the connection.
Patch 1/2 has the full details and the fix, patch 2/2 introduces
selftests to illustrate the problem and verify the solution.
v2: Don't touch the fast path in 1/2 (concern raised by Kuniyuki Iwashima
and Eric Dumazet). Further details in the message for 1/2 itself.
Stefano Brivio (2):
tcp: Don't accept data when socket is in repair mode
selftests: Add data path tests for TCP_REPAIR mode
include/net/dropreason-core.h | 3 +
include/net/tcp.h | 3 +-
net/ipv4/tcp.c | 9 +
net/ipv4/tcp_input.c | 15 +-
tools/testing/selftests/Makefile | 1 +
.../selftests/net/tcp_repair/.gitignore | 3 +
.../testing/selftests/net/tcp_repair/Makefile | 23 ++
.../testing/selftests/net/tcp_repair/client.c | 273 ++++++++++++++++++
.../testing/selftests/net/tcp_repair/inner.sh | 32 ++
.../testing/selftests/net/tcp_repair/outer.sh | 44 +++
tools/testing/selftests/net/tcp_repair/run.sh | 12 +
.../testing/selftests/net/tcp_repair/server.c | 155 ++++++++++
tools/testing/selftests/net/tcp_repair/talk.h | 26 ++
13 files changed, 596 insertions(+), 3 deletions(-)
create mode 100644 tools/testing/selftests/net/tcp_repair/.gitignore
create mode 100644 tools/testing/selftests/net/tcp_repair/Makefile
create mode 100644 tools/testing/selftests/net/tcp_repair/client.c
create mode 100755 tools/testing/selftests/net/tcp_repair/inner.sh
create mode 100755 tools/testing/selftests/net/tcp_repair/outer.sh
create mode 100755 tools/testing/selftests/net/tcp_repair/run.sh
create mode 100644 tools/testing/selftests/net/tcp_repair/server.c
create mode 100644 tools/testing/selftests/net/tcp_repair/talk.h
--
2.43.0
On Mon, 18 May 2026 20:34:22 +0200 Stefano Brivio wrote: > Stefano Brivio (2): > tcp: Don't accept data when socket is in repair mode Not sure Eric is on board with this patch in the first place. Sound like it's not the intended use case for REPAIR so IMO it's up to TCP maintainers whether we want to support this. And it's definitely not a Fix. > selftests: Add data path tests for TCP_REPAIR mode Please don't add a new target, fold it under net. Targets are a PITA to deal with in kselftests.
On Tue, 19 May 2026 19:03:52 -0700 Jakub Kicinski <kuba@kernel.org> wrote: > On Mon, 18 May 2026 20:34:22 +0200 Stefano Brivio wrote: > > Stefano Brivio (2): > > tcp: Don't accept data when socket is in repair mode > > Not sure Eric is on board with this patch in the first place. > Sound like it's not the intended use case for REPAIR so IMO > it's up to TCP maintainers whether we want to support this. > And it's definitely not a Fix. Jakub, thanks for looking into this. As I was pointing out on the v1 thread, I think it's a race condition regardless of the usage, because after you switch a socket to repair mode you can do separate operations on the socket and the outcome will be inconsistent between them. It depends on external conditions and looks quite fragile. For example, you could dump a given length of the receive queue, just to read it a moment later, but now the length is wrong. Note that now, on the v1 thread, also Andrei, one of the authors of TCP_REPAIR and the matching feature in CRIU, agreed that my "fix" makes things safer: https://lore.kernel.org/all/CAEWA0a4d-PpWpVexYGP5SLRuzj8hs1W1_Ww6qA4BBrkzSs4umQ@mail.gmail.com/ But I won't certainly insist on handling it as a fix and I'm now taking care of the new feedback coming from Eric of course. > > selftests: Add data path tests for TCP_REPAIR mode > > Please don't add a new target, fold it under net. > Targets are a PITA to deal with in kselftests. Sorry, I had no idea, I'll change that. -- Stefano
On Tue, May 19, 2026 at 7:03 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 18 May 2026 20:34:22 +0200 Stefano Brivio wrote:
> > Stefano Brivio (2):
> > tcp: Don't accept data when socket is in repair mode
>
> Not sure Eric is on board with this patch in the first place.
> Sound like it's not the intended use case for REPAIR so IMO
> it's up to TCP maintainers whether we want to support this.
> And it's definitely not a Fix.
I am not on board. Only net-next material for sure.
While v2 is slightly better, we have the fundamental problem of adding
stuff to TCP receive path for features that almost nobody use.
(It took more than 13 years to finally complain about the race)
This consumes Gigawats of power on our planet (and tons of CO2), given
the trillions of packets processed every second.
Please at least add a static key as we did in:
commit 020e71a3cf7f50c0f2c54cf2444067b76fe6d785
Author: Eric Dumazet <edumazet@google.com>
Date: Mon Oct 25 09:48:24 2021 -0700
ipv4: guard IP_MINTTL with a static key
RFC 5082 IP_MINTTL option is rarely used on hosts.
But in my original feedback I pointed that we TCP_REPAIR should
probably only insert a TCP socket into TPC established hash table
only when it is ready to process incoming packets.
>
> > selftests: Add data path tests for TCP_REPAIR mode
>
> Please don't add a new target, fold it under net.
> Targets are a PITA to deal with in kselftests.
On 5/20/26 06:39, Eric Dumazet wrote: > On Tue, May 19, 2026 at 7:03 PM Jakub Kicinski <kuba@kernel.org> wrote: >> >> On Mon, 18 May 2026 20:34:22 +0200 Stefano Brivio wrote: >>> Stefano Brivio (2): >>> tcp: Don't accept data when socket is in repair mode >> >> Not sure Eric is on board with this patch in the first place. >> Sound like it's not the intended use case for REPAIR so IMO >> it's up to TCP maintainers whether we want to support this. >> And it's definitely not a Fix. > > I am not on board. Only net-next material for sure. > > While v2 is slightly better, we have the fundamental problem of adding > stuff to TCP receive path for features that almost nobody use. > (It took more than 13 years to finally complain about the race) > > This consumes Gigawats of power on our planet (and tons of CO2), given > the trillions of packets processed every second. > > Please at least add a static key as we did in: > > commit 020e71a3cf7f50c0f2c54cf2444067b76fe6d785 > Author: Eric Dumazet <edumazet@google.com> > Date: Mon Oct 25 09:48:24 2021 -0700 > > ipv4: guard IP_MINTTL with a static key > > RFC 5082 IP_MINTTL option is rarely used on hosts. > > > But in my original feedback I pointed that we TCP_REPAIR should > probably only insert a TCP socket into TPC established hash table > only when it is ready to process incoming packets. The problem is on the source side of the migration and I think removing the socket from the ehash table would trigger RSTs, closing the connection on the remote side. Thanks, Laurent
On Wed, May 20, 2026 at 12:24 AM Laurent Vivier <lvivier@redhat.com> wrote: > > On 5/20/26 06:39, Eric Dumazet wrote: > > On Tue, May 19, 2026 at 7:03 PM Jakub Kicinski <kuba@kernel.org> wrote: > >> > >> On Mon, 18 May 2026 20:34:22 +0200 Stefano Brivio wrote: > >>> Stefano Brivio (2): > >>> tcp: Don't accept data when socket is in repair mode > >> > >> Not sure Eric is on board with this patch in the first place. > >> Sound like it's not the intended use case for REPAIR so IMO > >> it's up to TCP maintainers whether we want to support this. > >> And it's definitely not a Fix. > > > > I am not on board. Only net-next material for sure. > > > > While v2 is slightly better, we have the fundamental problem of adding > > stuff to TCP receive path for features that almost nobody use. > > (It took more than 13 years to finally complain about the race) > > > > This consumes Gigawats of power on our planet (and tons of CO2), given > > the trillions of packets processed every second. > > > > Please at least add a static key as we did in: > > > > commit 020e71a3cf7f50c0f2c54cf2444067b76fe6d785 > > Author: Eric Dumazet <edumazet@google.com> > > Date: Mon Oct 25 09:48:24 2021 -0700 > > > > ipv4: guard IP_MINTTL with a static key > > > > RFC 5082 IP_MINTTL option is rarely used on hosts. > > > > > > But in my original feedback I pointed that we TCP_REPAIR should > > probably only insert a TCP socket into TPC established hash table > > only when it is ready to process incoming packets. > > The problem is on the source side of the migration and I think removing the socket from > the ehash table would trigger RSTs, closing the connection on the remote side. RST is triggered anyway if TCP_REPAIR did not occur yet. This is why a netfilter solution is generally used to make sure no incoming packet is received until the whole state has been repaired.
On Wed, 20 May 2026 00:27:38 -0700 Eric Dumazet <edumazet@google.com> wrote: > On Wed, May 20, 2026 at 12:24 AM Laurent Vivier <lvivier@redhat.com> wrote: > > > > On 5/20/26 06:39, Eric Dumazet wrote: > > > On Tue, May 19, 2026 at 7:03 PM Jakub Kicinski <kuba@kernel.org> wrote: > > >> > > >> On Mon, 18 May 2026 20:34:22 +0200 Stefano Brivio wrote: > > >>> Stefano Brivio (2): > > >>> tcp: Don't accept data when socket is in repair mode > > >> > > >> Not sure Eric is on board with this patch in the first place. > > >> Sound like it's not the intended use case for REPAIR so IMO > > >> it's up to TCP maintainers whether we want to support this. > > >> And it's definitely not a Fix. > > > > > > I am not on board. Only net-next material for sure. I wasn't quite sure, I'll re-target it for net-next. > > > While v2 is slightly better, we have the fundamental problem of adding > > > stuff to TCP receive path for features that almost nobody use. > > > (It took more than 13 years to finally complain about the race) > > > > > > This consumes Gigawats of power on our planet (and tons of CO2), given > > > the trillions of packets processed every second. Absolutely not what I want, I didn't consider that. > > > Please at least add a static key as we did in: > > > > > > commit 020e71a3cf7f50c0f2c54cf2444067b76fe6d785 > > > Author: Eric Dumazet <edumazet@google.com> > > > Date: Mon Oct 25 09:48:24 2021 -0700 > > > > > > ipv4: guard IP_MINTTL with a static key > > > > > > RFC 5082 IP_MINTTL option is rarely used on hosts. I will, indeed, thanks for providing the reference. > > > But in my original feedback I pointed that we TCP_REPAIR should > > > probably only insert a TCP socket into TPC established hash table > > > only when it is ready to process incoming packets. I looked into your original feedback, and replied, because I couldn't understand how adding sockets to the hash table (they are already there) would help fixing this. Now I understand the confusion: > > The problem is on the source side of the migration and I think removing the socket from > > the ehash table would trigger RSTs, closing the connection on the remote side. > > RST is triggered anyway if TCP_REPAIR did not occur yet. It's different than the problem Kuniyuki pointed to (with the link to that proposed hack from January): there, you have a situation where the socket doesn't exist yet because it hasn't been restored yet. That's livepatching, on the same host / node, a different (new) usage. But here (regardless of whether it's CRIU or passt) we're talking about *migration* (CRIU: container, passt: VM) to a different host. In this case, migration will be considered done / ready only after the sockets on the *target node* are up and running. No traffic will be routed to the target node before that point. As long as we don't prematurely close the sockets on the source node, there's no issue with RSTs anywhere (and we can also roll things back on a failed migration). That's something we fixed in userspace without bothering the kernel at all. > This is why a netfilter solution is generally used to make sure no > incoming packet is received until the whole state has been repaired. See comment from Andrei here: the _main_ reason why CRIU (only by default, though) sets up netfilter rules seems to be that it makes things simpler *while dumping connections* in that specific case. A separate namespace makes it straightforward. That's not the case for passt: it doesn't need to touch netfilter at all, and I think it's a very useful thing in terms of complexity, security, and keeping the downtime to a minimum (other than, as Andrei mentioned, making things generally safer). -- Stefano
© 2016 - 2026 Red Hat, Inc.