include/net/inet_timewait_sock.h | 8 +- net/dccp/minisocks.c | 9 +-- net/ipv4/inet_diag.c | 2 +- net/ipv4/inet_timewait_sock.c | 73 +++++++++++++------ net/ipv4/tcp_ipv4.c | 2 +- net/ipv4/tcp_minisocks.c | 9 +-- net/ipv6/tcp_ipv6.c | 2 +- .../selftests/bpf/progs/bpf_iter_tcp4.c | 2 +- .../selftests/bpf/progs/bpf_iter_tcp6.c | 2 +- 9 files changed, 69 insertions(+), 40 deletions(-)
Hi, This is v5 of the series where the tw_timer is un-pinned to get rid of interferences in isolated CPUs setups. The first patch is a new one stemming from Jakub's bug reported. It's there mainly to make the reviewing a bit easier, but as it changes behaviour it should be squashed with the second one. Revisions ========= v4 -> v5 ++++++++ o Rebased against latest Linus' tree o Converted tw_timer into a delayed work following Jakub's bug report on v4 http://lore.kernel.org/r/20240411100536.224fa1e7@kernel.org v3 -> v4 ++++++++ o Rebased against latest Linus' tree o Added ehash lock usage to serialize scheduling vs descheduling of the tw_timer (Paolo) v2 -> v3 ++++++++ o Dropped bh_disable patch o Rebased against latest Linus' tree RFCv1 -> v2 ++++++++ o Added comment in inet_twsk_deschedule_put() to highlight the race o Added bh_disable patch Valentin Schneider (2): SQUASH: tcp/dcpp: Convert timewait timer into a delayed_work tcp/dcpp: Un-pin tw_timer include/net/inet_timewait_sock.h | 8 +- net/dccp/minisocks.c | 9 +-- net/ipv4/inet_diag.c | 2 +- net/ipv4/inet_timewait_sock.c | 73 +++++++++++++------ net/ipv4/tcp_ipv4.c | 2 +- net/ipv4/tcp_minisocks.c | 9 +-- net/ipv6/tcp_ipv6.c | 2 +- .../selftests/bpf/progs/bpf_iter_tcp4.c | 2 +- .../selftests/bpf/progs/bpf_iter_tcp6.c | 2 +- 9 files changed, 69 insertions(+), 40 deletions(-) -- 2.43.0
On Mon, Apr 15, 2024 at 1:34 PM Valentin Schneider <vschneid@redhat.com> wrote: > > Hi, > > This is v5 of the series where the tw_timer is un-pinned to get rid of > interferences in isolated CPUs setups. > > The first patch is a new one stemming from Jakub's bug reported. It's there > mainly to make the reviewing a bit easier, but as it changes behaviour it should > be squashed with the second one. > > Revisions > ========= > > v4 -> v5 > ++++++++ > > o Rebased against latest Linus' tree > o Converted tw_timer into a delayed work following Jakub's bug report on v4 > http://lore.kernel.org/r/20240411100536.224fa1e7@kernel.org What was the issue again ? Please explain precisely why it was fundamentally tied to the use of timers (and this was not possible to fix the issue without adding work queues and more dependencies to TCP stack)
On 15/04/24 14:35, Eric Dumazet wrote:
> On Mon, Apr 15, 2024 at 1:34 PM Valentin Schneider <vschneid@redhat.com> wrote:
>>
>> Hi,
>>
>> This is v5 of the series where the tw_timer is un-pinned to get rid of
>> interferences in isolated CPUs setups.
>>
>> The first patch is a new one stemming from Jakub's bug reported. It's there
>> mainly to make the reviewing a bit easier, but as it changes behaviour it should
>> be squashed with the second one.
>>
>> Revisions
>> =========
>>
>> v4 -> v5
>> ++++++++
>>
>> o Rebased against latest Linus' tree
>> o Converted tw_timer into a delayed work following Jakub's bug report on v4
>> http://lore.kernel.org/r/20240411100536.224fa1e7@kernel.org
>
> What was the issue again ?
>
> Please explain precisely why it was fundamentally tied to the use of
> timers (and this was not possible to fix the issue without
> adding work queues and more dependencies to TCP stack)
In v4 I added the use of the ehash lock to serialize arming the timewait
timer vs destroying it (inet_twsk_schedule() vs inet_twsk_deschedule_put()).
Unfortunately, holding a lock both in a timer callback and in the context
in which it is destroyed is invalid. AIUI the issue is as follows:
CPUx CPUy
spin_lock(foo);
<timer fires>
call_timer_fn()
spin_lock(foo) // blocks
timer_shutdown_sync()
__timer_delete_sync()
__try_to_del_timer_sync() // looped as long as timer is running
<deadlock>
In our case, we had in v4:
inet_twsk_deschedule_put()
spin_lock(ehash_lock);
tw_timer_handler()
inet_twsk_kill()
spin_lock(ehash_lock);
__inet_twsk_kill();
timer_shutdown_sync(&tw->tw_timer);
The fix here is to move the timer deletion to a non-timer
context. Workqueues fit the bill, and as the tw_timer_handler() would just queue
a work item, I converted it to a delayed_work.
On Mon, Apr 15, 2024 at 4:33 PM Valentin Schneider <vschneid@redhat.com> wrote: > > On 15/04/24 14:35, Eric Dumazet wrote: > > On Mon, Apr 15, 2024 at 1:34 PM Valentin Schneider <vschneid@redhat.com> wrote: > >> > >> Hi, > >> > >> This is v5 of the series where the tw_timer is un-pinned to get rid of > >> interferences in isolated CPUs setups. > >> > >> The first patch is a new one stemming from Jakub's bug reported. It's there > >> mainly to make the reviewing a bit easier, but as it changes behaviour it should > >> be squashed with the second one. > >> > >> Revisions > >> ========= > >> > >> v4 -> v5 > >> ++++++++ > >> > >> o Rebased against latest Linus' tree > >> o Converted tw_timer into a delayed work following Jakub's bug report on v4 > >> http://lore.kernel.org/r/20240411100536.224fa1e7@kernel.org > > > > What was the issue again ? > > > > Please explain precisely why it was fundamentally tied to the use of > > timers (and this was not possible to fix the issue without > > adding work queues and more dependencies to TCP stack) > > In v4 I added the use of the ehash lock to serialize arming the timewait > timer vs destroying it (inet_twsk_schedule() vs inet_twsk_deschedule_put()). > > Unfortunately, holding a lock both in a timer callback and in the context > in which it is destroyed is invalid. AIUI the issue is as follows: > > CPUx CPUy > spin_lock(foo); > <timer fires> > call_timer_fn() > spin_lock(foo) // blocks > timer_shutdown_sync() > __timer_delete_sync() > __try_to_del_timer_sync() // looped as long as timer is running > <deadlock> > > In our case, we had in v4: > > inet_twsk_deschedule_put() > spin_lock(ehash_lock); > tw_timer_handler() > inet_twsk_kill() > spin_lock(ehash_lock); > __inet_twsk_kill(); > timer_shutdown_sync(&tw->tw_timer); > > The fix here is to move the timer deletion to a non-timer > context. Workqueues fit the bill, and as the tw_timer_handler() would just queue > a work item, I converted it to a delayed_work. I do not like this delayed work approach. Adding more dependencies to the TCP stack is not very nice from a maintainer point of view. Why couldn't you call timer_shutdown_sync() before grabbing the lock ?
Apologies for the delayed reply, I was away for most of last week;
On 16/04/24 17:01, Eric Dumazet wrote:
> On Mon, Apr 15, 2024 at 4:33 PM Valentin Schneider <vschneid@redhat.com> wrote:
>>
>> On 15/04/24 14:35, Eric Dumazet wrote:
>> > On Mon, Apr 15, 2024 at 1:34 PM Valentin Schneider <vschneid@redhat.com> wrote:
>> >> v4 -> v5
>> >> ++++++++
>> >>
>> >> o Rebased against latest Linus' tree
>> >> o Converted tw_timer into a delayed work following Jakub's bug report on v4
>> >> http://lore.kernel.org/r/20240411100536.224fa1e7@kernel.org
>> >
>> > What was the issue again ?
>> >
>> > Please explain precisely why it was fundamentally tied to the use of
>> > timers (and this was not possible to fix the issue without
>> > adding work queues and more dependencies to TCP stack)
>>
>> In v4 I added the use of the ehash lock to serialize arming the timewait
>> timer vs destroying it (inet_twsk_schedule() vs inet_twsk_deschedule_put()).
>>
>> Unfortunately, holding a lock both in a timer callback and in the context
>> in which it is destroyed is invalid. AIUI the issue is as follows:
>>
>> CPUx CPUy
>> spin_lock(foo);
>> <timer fires>
>> call_timer_fn()
>> spin_lock(foo) // blocks
>> timer_shutdown_sync()
>> __timer_delete_sync()
>> __try_to_del_timer_sync() // looped as long as timer is running
>> <deadlock>
>>
>> In our case, we had in v4:
>>
>> inet_twsk_deschedule_put()
>> spin_lock(ehash_lock);
>> tw_timer_handler()
>> inet_twsk_kill()
>> spin_lock(ehash_lock);
>> __inet_twsk_kill();
>> timer_shutdown_sync(&tw->tw_timer);
>>
>> The fix here is to move the timer deletion to a non-timer
>> context. Workqueues fit the bill, and as the tw_timer_handler() would just queue
>> a work item, I converted it to a delayed_work.
>
> I do not like this delayed work approach.
>
> Adding more dependencies to the TCP stack is not very nice from a
> maintainer point of view.
>
> Why couldn't you call timer_shutdown_sync() before grabbing the lock ?
We need the timer_shutdown_sync() and mod_timer() of tw->tw_timer to be
serialized in some way. If they aren't, we have the following race:
tcp_time_wait()
inet_twsk_hashdance()
inet_twsk_deschedule_put()
// Returns 0 because not pending, but prevents future arming
timer_shutdown_sync()
inet_twsk_schedule()
// Returns 0 as if timer had been succesfully armed
mod_timer()
This means inet_twsk_deschedule_put() doesn't end up calling
inet_twsk_kill() (because the timer wasn't pending when it got shutdown),
but inet_twsk_schedule() doesn't arm it either despite the hashdance()
having updated the refcounts.
If we leave the deschedule as a del_timer_sync(), the timer ends up armed
in inet_twsk_schedule(), but that means waiting for the timer to fire to
clean up the resources despite having called inet_twsk_deschedule_put().
Hi, On 22/04/24 16:31, Valentin Schneider wrote: > Apologies for the delayed reply, I was away for most of last week; > > On 16/04/24 17:01, Eric Dumazet wrote: >> On Mon, Apr 15, 2024 at 4:33 PM Valentin Schneider <vschneid@redhat.com> wrote: >>> >>> On 15/04/24 14:35, Eric Dumazet wrote: >>> > On Mon, Apr 15, 2024 at 1:34 PM Valentin Schneider <vschneid@redhat.com> wrote: >>> >> v4 -> v5 >>> >> ++++++++ >>> >> >>> >> o Rebased against latest Linus' tree >>> >> o Converted tw_timer into a delayed work following Jakub's bug report on v4 >>> >> http://lore.kernel.org/r/20240411100536.224fa1e7@kernel.org >>> > >>> > What was the issue again ? >>> > >>> > Please explain precisely why it was fundamentally tied to the use of >>> > timers (and this was not possible to fix the issue without >>> > adding work queues and more dependencies to TCP stack) >>> >>> In v4 I added the use of the ehash lock to serialize arming the timewait >>> timer vs destroying it (inet_twsk_schedule() vs inet_twsk_deschedule_put()). >>> >>> Unfortunately, holding a lock both in a timer callback and in the context >>> in which it is destroyed is invalid. AIUI the issue is as follows: >>> >>> CPUx CPUy >>> spin_lock(foo); >>> <timer fires> >>> call_timer_fn() >>> spin_lock(foo) // blocks >>> timer_shutdown_sync() >>> __timer_delete_sync() >>> __try_to_del_timer_sync() // looped as long as timer is running >>> <deadlock> >>> >>> In our case, we had in v4: >>> >>> inet_twsk_deschedule_put() >>> spin_lock(ehash_lock); >>> tw_timer_handler() >>> inet_twsk_kill() >>> spin_lock(ehash_lock); >>> __inet_twsk_kill(); >>> timer_shutdown_sync(&tw->tw_timer); >>> >>> The fix here is to move the timer deletion to a non-timer >>> context. Workqueues fit the bill, and as the tw_timer_handler() would just queue >>> a work item, I converted it to a delayed_work. Does this explanation make sense? This is the reasoning that drove me to involve workqueues. I'm open to suggestions on alternative approaches.
© 2016 - 2026 Red Hat, Inc.