[v2] Fix race condition between TCP_REPAIR dump and data receive

[PATCH net v2 0/2] Fix race condition between TCP_REPAIR dump and data receive

Posted by Stefano Brivio 1 month ago

If we receive data on a socket that's in repair mode, the sequence and
contents of the receive queue we dump depend on the timing. We need to
freeze the input queue, otherwise the connection parameters restored
later might not match the actual state of the connection.

Patch 1/2 has the full details and the fix, patch 2/2 introduces
selftests to illustrate the problem and verify the solution.

v2: Don't touch the fast path in 1/2 (concern raised by Kuniyuki Iwashima
    and Eric Dumazet). Further details in the message for 1/2 itself.

Stefano Brivio (2):
  tcp: Don't accept data when socket is in repair mode
  selftests: Add data path tests for TCP_REPAIR mode

 include/net/dropreason-core.h                 |   3 +
 include/net/tcp.h                             |   3 +-
 net/ipv4/tcp.c                                |   9 +
 net/ipv4/tcp_input.c                          |  15 +-
 tools/testing/selftests/Makefile              |   1 +
 .../selftests/net/tcp_repair/.gitignore       |   3 +
 .../testing/selftests/net/tcp_repair/Makefile |  23 ++
 .../testing/selftests/net/tcp_repair/client.c | 273 ++++++++++++++++++
 .../testing/selftests/net/tcp_repair/inner.sh |  32 ++
 .../testing/selftests/net/tcp_repair/outer.sh |  44 +++
 tools/testing/selftests/net/tcp_repair/run.sh |  12 +
 .../testing/selftests/net/tcp_repair/server.c | 155 ++++++++++
 tools/testing/selftests/net/tcp_repair/talk.h |  26 ++
 13 files changed, 596 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/net/tcp_repair/.gitignore
 create mode 100644 tools/testing/selftests/net/tcp_repair/Makefile
 create mode 100644 tools/testing/selftests/net/tcp_repair/client.c
 create mode 100755 tools/testing/selftests/net/tcp_repair/inner.sh
 create mode 100755 tools/testing/selftests/net/tcp_repair/outer.sh
 create mode 100755 tools/testing/selftests/net/tcp_repair/run.sh
 create mode 100644 tools/testing/selftests/net/tcp_repair/server.c
 create mode 100644 tools/testing/selftests/net/tcp_repair/talk.h

-- 
2.43.0

Re: [PATCH net v2 0/2] Fix race condition between TCP_REPAIR dump and data receive

Posted by Jakub Kicinski 4 weeks, 1 day ago

On Mon, 18 May 2026 20:34:22 +0200 Stefano Brivio wrote:
> Stefano Brivio (2):
>   tcp: Don't accept data when socket is in repair mode

Not sure Eric is on board with this patch in the first place.
Sound like it's not the intended use case for REPAIR so IMO
it's up to TCP maintainers whether we want to support this.
And it's definitely not a Fix.

>   selftests: Add data path tests for TCP_REPAIR mode

Please don't add a new target, fold it under net.
Targets are a PITA to deal with in kselftests.

Re: [PATCH net v2 0/2] Fix race condition between TCP_REPAIR dump and data receive

Posted by Stefano Brivio 4 weeks, 1 day ago

On Tue, 19 May 2026 19:03:52 -0700
Jakub Kicinski <kuba@kernel.org> wrote:

> On Mon, 18 May 2026 20:34:22 +0200 Stefano Brivio wrote:
> > Stefano Brivio (2):
> >   tcp: Don't accept data when socket is in repair mode  
> 
> Not sure Eric is on board with this patch in the first place.
> Sound like it's not the intended use case for REPAIR so IMO
> it's up to TCP maintainers whether we want to support this.
> And it's definitely not a Fix.

Jakub, thanks for looking into this.

As I was pointing out on the v1 thread, I think it's a race
condition regardless of the usage, because after you switch a socket to
repair mode you can do separate operations on the socket and the
outcome will be inconsistent between them. It depends on external
conditions and looks quite fragile.

For example, you could dump a given length of the receive queue, just
to read it a moment later, but now the length is wrong.

Note that now, on the v1 thread, also Andrei, one of the authors of
TCP_REPAIR and the matching feature in CRIU, agreed that my "fix" makes
things safer:

  https://lore.kernel.org/all/CAEWA0a4d-PpWpVexYGP5SLRuzj8hs1W1_Ww6qA4BBrkzSs4umQ@mail.gmail.com/

But I won't certainly insist on handling it as a fix and I'm now taking
care of the new feedback coming from Eric of course.

> >   selftests: Add data path tests for TCP_REPAIR mode  
> 
> Please don't add a new target, fold it under net.
> Targets are a PITA to deal with in kselftests.

Sorry, I had no idea, I'll change that.

-- 
Stefano

Re: [PATCH net v2 0/2] Fix race condition between TCP_REPAIR dump and data receive

Posted by Eric Dumazet 4 weeks, 1 day ago

On Tue, May 19, 2026 at 7:03 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 18 May 2026 20:34:22 +0200 Stefano Brivio wrote:
> > Stefano Brivio (2):
> >   tcp: Don't accept data when socket is in repair mode
>
> Not sure Eric is on board with this patch in the first place.
> Sound like it's not the intended use case for REPAIR so IMO
> it's up to TCP maintainers whether we want to support this.
> And it's definitely not a Fix.

I am not on board. Only net-next material for sure.

While v2 is slightly better, we have the fundamental problem of adding
stuff to TCP receive path for features that almost nobody use.
(It took more than 13 years to finally complain about the race)

This consumes Gigawats of power on our planet (and tons of CO2), given
the trillions of packets processed every second.

Please at least add a static key as we did in:

commit 020e71a3cf7f50c0f2c54cf2444067b76fe6d785
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Oct 25 09:48:24 2021 -0700

    ipv4: guard IP_MINTTL with a static key

    RFC 5082 IP_MINTTL option is rarely used on hosts.

But in my original feedback I pointed that we TCP_REPAIR should
probably only insert a TCP socket into TPC established hash table
only when it is ready to process incoming packets.

>
> >   selftests: Add data path tests for TCP_REPAIR mode
>
> Please don't add a new target, fold it under net.
> Targets are a PITA to deal with in kselftests.

Re: [PATCH net v2 0/2] Fix race condition between TCP_REPAIR dump and data receive

Posted by Laurent Vivier 4 weeks, 1 day ago

On 5/20/26 06:39, Eric Dumazet wrote:
> On Tue, May 19, 2026 at 7:03 PM Jakub Kicinski <kuba@kernel.org> wrote:
>>
>> On Mon, 18 May 2026 20:34:22 +0200 Stefano Brivio wrote:
>>> Stefano Brivio (2):
>>>    tcp: Don't accept data when socket is in repair mode
>>
>> Not sure Eric is on board with this patch in the first place.
>> Sound like it's not the intended use case for REPAIR so IMO
>> it's up to TCP maintainers whether we want to support this.
>> And it's definitely not a Fix.
> 
> I am not on board. Only net-next material for sure.
> 
> While v2 is slightly better, we have the fundamental problem of adding
> stuff to TCP receive path for features that almost nobody use.
> (It took more than 13 years to finally complain about the race)
> 
> This consumes Gigawats of power on our planet (and tons of CO2), given
> the trillions of packets processed every second.
> 
> Please at least add a static key as we did in:
> 
> commit 020e71a3cf7f50c0f2c54cf2444067b76fe6d785
> Author: Eric Dumazet <edumazet@google.com>
> Date:   Mon Oct 25 09:48:24 2021 -0700
> 
>      ipv4: guard IP_MINTTL with a static key
> 
>      RFC 5082 IP_MINTTL option is rarely used on hosts.
> 
> 
> But in my original feedback I pointed that we TCP_REPAIR should
> probably only insert a TCP socket into TPC established hash table
> only when it is ready to process incoming packets.

The problem is on the source side of the migration and I think removing the socket from 
the ehash table would trigger RSTs, closing the connection on the remote side.

Thanks,
Laurent

Re: [PATCH net v2 0/2] Fix race condition between TCP_REPAIR dump and data receive

Posted by Eric Dumazet 4 weeks, 1 day ago

On Wed, May 20, 2026 at 12:24 AM Laurent Vivier <lvivier@redhat.com> wrote:
>
> On 5/20/26 06:39, Eric Dumazet wrote:
> > On Tue, May 19, 2026 at 7:03 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >>
> >> On Mon, 18 May 2026 20:34:22 +0200 Stefano Brivio wrote:
> >>> Stefano Brivio (2):
> >>>    tcp: Don't accept data when socket is in repair mode
> >>
> >> Not sure Eric is on board with this patch in the first place.
> >> Sound like it's not the intended use case for REPAIR so IMO
> >> it's up to TCP maintainers whether we want to support this.
> >> And it's definitely not a Fix.
> >
> > I am not on board. Only net-next material for sure.
> >
> > While v2 is slightly better, we have the fundamental problem of adding
> > stuff to TCP receive path for features that almost nobody use.
> > (It took more than 13 years to finally complain about the race)
> >
> > This consumes Gigawats of power on our planet (and tons of CO2), given
> > the trillions of packets processed every second.
> >
> > Please at least add a static key as we did in:
> >
> > commit 020e71a3cf7f50c0f2c54cf2444067b76fe6d785
> > Author: Eric Dumazet <edumazet@google.com>
> > Date:   Mon Oct 25 09:48:24 2021 -0700
> >
> >      ipv4: guard IP_MINTTL with a static key
> >
> >      RFC 5082 IP_MINTTL option is rarely used on hosts.
> >
> >
> > But in my original feedback I pointed that we TCP_REPAIR should
> > probably only insert a TCP socket into TPC established hash table
> > only when it is ready to process incoming packets.
>
> The problem is on the source side of the migration and I think removing the socket from
> the ehash table would trigger RSTs, closing the connection on the remote side.

RST is triggered anyway if TCP_REPAIR did not occur yet.

This is why a netfilter solution is generally used to make sure no
incoming packet is received until the whole state has been repaired.

Re: [PATCH net v2 0/2] Fix race condition between TCP_REPAIR dump and data receive

Posted by Stefano Brivio 4 weeks, 1 day ago

On Wed, 20 May 2026 00:27:38 -0700
Eric Dumazet <edumazet@google.com> wrote:

> On Wed, May 20, 2026 at 12:24 AM Laurent Vivier <lvivier@redhat.com> wrote:
> >
> > On 5/20/26 06:39, Eric Dumazet wrote:  
> > > On Tue, May 19, 2026 at 7:03 PM Jakub Kicinski <kuba@kernel.org> wrote:  
> > >>
> > >> On Mon, 18 May 2026 20:34:22 +0200 Stefano Brivio wrote:  
> > >>> Stefano Brivio (2):
> > >>>    tcp: Don't accept data when socket is in repair mode  
> > >>
> > >> Not sure Eric is on board with this patch in the first place.
> > >> Sound like it's not the intended use case for REPAIR so IMO
> > >> it's up to TCP maintainers whether we want to support this.
> > >> And it's definitely not a Fix.  
> > >
> > > I am not on board. Only net-next material for sure.

I wasn't quite sure, I'll re-target it for net-next.

> > > While v2 is slightly better, we have the fundamental problem of adding
> > > stuff to TCP receive path for features that almost nobody use.
> > > (It took more than 13 years to finally complain about the race)
> > >
> > > This consumes Gigawats of power on our planet (and tons of CO2), given
> > > the trillions of packets processed every second.

Absolutely not what I want, I didn't consider that.

> > > Please at least add a static key as we did in:
> > >
> > > commit 020e71a3cf7f50c0f2c54cf2444067b76fe6d785
> > > Author: Eric Dumazet <edumazet@google.com>
> > > Date:   Mon Oct 25 09:48:24 2021 -0700
> > >
> > >      ipv4: guard IP_MINTTL with a static key
> > >
> > >      RFC 5082 IP_MINTTL option is rarely used on hosts.

I will, indeed, thanks for providing the reference.

> > > But in my original feedback I pointed that we TCP_REPAIR should
> > > probably only insert a TCP socket into TPC established hash table
> > > only when it is ready to process incoming packets.

I looked into your original feedback, and replied, because I couldn't
understand how adding sockets to the hash table (they are already
there) would help fixing this.

Now I understand the confusion:

> > The problem is on the source side of the migration and I think removing the socket from
> > the ehash table would trigger RSTs, closing the connection on the remote side.  
> 
> RST is triggered anyway if TCP_REPAIR did not occur yet.

It's different than the problem Kuniyuki pointed to (with the link to
that proposed hack from January): there, you have a situation where the
socket doesn't exist yet because it hasn't been restored yet.

That's livepatching, on the same host / node, a different (new) usage.

But here (regardless of whether it's CRIU or passt) we're talking about
*migration* (CRIU: container, passt: VM) to a different host.

In this case, migration will be considered done / ready only after the
sockets on the *target node* are up and running. No traffic will be
routed to the target node before that point.

As long as we don't prematurely close the sockets on the source node,
there's no issue with RSTs anywhere (and we can also roll things back
on a failed migration). That's something we fixed in userspace without
bothering the kernel at all.

> This is why a netfilter solution is generally used to make sure no
> incoming packet is received until the whole state has been repaired.

See comment from Andrei here: the _main_ reason why CRIU (only by
default, though) sets up netfilter rules seems to be that it makes
things simpler *while dumping connections* in that specific case. A
separate namespace makes it straightforward.

That's not the case for passt: it doesn't need to touch netfilter at
all, and I think it's a very useful thing in terms of complexity,
security, and keeping the downtime to a minimum (other than, as Andrei
mentioned, making things generally safer).

-- 
Stefano