[PATCH v2 0/5] timer_migration: Fix two possible races and an improvement

Anna-Maria Behnsen posted 5 patches 1 year, 5 months ago
kernel/time/timer_migration.c | 137 ++++++++++++++++++++++++------------------
kernel/time/timer_migration.h |  12 +++-
2 files changed, 90 insertions(+), 59 deletions(-)
[PATCH v2 0/5] timer_migration: Fix two possible races and an improvement
Posted by Anna-Maria Behnsen 1 year, 5 months ago
Borislav reported a warning in timer migration deactive path

  https://lore.kernel.org/r/20240612090347.GBZmlkc5PwlVpOG6vT@fat_crate.local

Sadly it doesn't reproduce directly. But with the change of timing (by
adding a trace prinkt before the warning), it is possible to trigger the
warning reliable at least in my test setup. The problem here is a racy
check agains group->parent pointer. This is also used in other places in
the code and fixing this racy usage is adressed by the first patch.

There was another race reported by Frederic in setup path:

  https://lore.kernel.org/r/ZnWOswTMML6ShzYO@localhost.localdomain

It is addressed patch 2-4. Patch 2 is an already existing patch of v1
(improve tracing) and makes the fix easier. Patch 3 is also a preparation
patch for the final fix and Patch 4 is then the real fix. (I labelled all
those three patches with Fixes tag to be easier selectable.)

While working with the code, I saw that the update of per cpu group wakeup
value could be improved. This improvement is adressed by the last patch.

Patches are available here:

  https://git.kernel.org/pub/scm/linux/kernel/git/anna-maria/linux-devel.git timers/misc

---
Changes in v2:

- Address another possible race in setup code (reported by Frederic) and
  recycle therefore one improvement patch
- Change order and move the already existing improvement patch to the end
  of the queue
- Existing patches didn't change
- Link to v1: https://lore.kernel.org/r/20240621-tmigr-fixes-v1-0-8c8a2d8e8d77@linutronix.de

Thanks,

        Anna-Maria

---
Anna-Maria Behnsen (5):
      timer_migration: Do not rely always on group->parent
      timer_migration: Improve tracing
      timer_migration: Split out state update of tmigr_active_up()
      timer_migration: Fix possible race in tmigr_active_up() in setup path
      timer_migration: Spare write when nothing changed

 kernel/time/timer_migration.c | 137 ++++++++++++++++++++++++------------------
 kernel/time/timer_migration.h |  12 +++-
 2 files changed, 90 insertions(+), 59 deletions(-)
Re: [PATCH v2 0/5] timer_migration: Fix two possible races and an improvement
Posted by Frederic Weisbecker 1 year, 5 months ago
On Mon, Jun 24, 2024 at 04:53:52PM +0200, Anna-Maria Behnsen wrote:
> Borislav reported a warning in timer migration deactive path
> 
>   https://lore.kernel.org/r/20240612090347.GBZmlkc5PwlVpOG6vT@fat_crate.local
> 
> Sadly it doesn't reproduce directly. But with the change of timing (by
> adding a trace prinkt before the warning), it is possible to trigger the
> warning reliable at least in my test setup. The problem here is a racy
> check agains group->parent pointer. This is also used in other places in
> the code and fixing this racy usage is adressed by the first patch.
> 
> There was another race reported by Frederic in setup path:
> 
>   https://lore.kernel.org/r/ZnWOswTMML6ShzYO@localhost.localdomain
> 
> It is addressed patch 2-4. Patch 2 is an already existing patch of v1
> (improve tracing) and makes the fix easier. Patch 3 is also a preparation
> patch for the final fix and Patch 4 is then the real fix. (I labelled all
> those three patches with Fixes tag to be easier selectable.)
> 
> While working with the code, I saw that the update of per cpu group wakeup
> value could be improved. This improvement is adressed by the last patch.

Another possible issue with the CPU up code is the fact that child->parent and
child->childmask updates and reads are unordered.

So it's possible that a CPU going (in-)active sees the new parent but doesn't
observe yet the update to child->childmask. This may result to calling
tmigr_(in-)active_up() with a zero data->childmask with all sorts of
consequences.

Or am I missing something that prevents from that?

Thanks.