[PATCH] genirq: Fix IRQ threads affinity VS cpuset isolated partitions

Frederic Weisbecker posted 1 patch 1 month, 1 week ago
kernel/irq/manage.c | 33 ++++++++++++++++++++-------------
1 file changed, 20 insertions(+), 13 deletions(-)
[PATCH] genirq: Fix IRQ threads affinity VS cpuset isolated partitions
Posted by Frederic Weisbecker 1 month, 1 week ago
When a cpuset isolated partition is created / updated or destroyed,
the IRQ threads are affine blindly to all the non-isolated CPUs. And
this happens without taking into account the IRQ thread initial
affinity that becomes ignored.

For example in a system with 8 CPUs, if an IRQ and its kthread are
initially affine to CPU 5, creating an isolated partition with only
CPU 2 inside will eventually end up affining the IRQ kthread to all
CPUs but CPU 2 (that is CPUs 0,1,3-7), losing the kthread preference for
CPU 5.

Besides the blind re-affinity, this doesn't take care of the actual
low level interrupt which isn't migrated. As of today the only way to
isolate non managed interrupts, along with their kthreads, is to
overwrite their affinity separately, for example through /proc/irq/

To avoid doing that manually, future development should focus on
updating the IRQs affinity whenever cpuset isolated partitions are
updated.

In the meantime, cpuset shouldn't fiddle with IRQ threads directly.
To prevent from that, set the PF_NO_SETAFFINITY flag to them.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/irq/manage.c | 33 ++++++++++++++++++++-------------
 1 file changed, 20 insertions(+), 13 deletions(-)

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 400856abf672..5ca000c9f4a7 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -176,7 +176,7 @@ bool irq_can_set_affinity_usr(unsigned int irq)
 }
 
 /**
- * irq_set_thread_affinity - Notify irq threads to adjust affinity
+ * irq_thread_update_affinity - Notify irq threads to adjust affinity
  * @desc:	irq descriptor which has affinity changed
  *
  * Just set IRQTF_AFFINITY and delegate the affinity setting to the
@@ -184,7 +184,7 @@ bool irq_can_set_affinity_usr(unsigned int irq)
  * we hold desc->lock and this code can be called from hard interrupt
  * context.
  */
-static void irq_set_thread_affinity(struct irq_desc *desc)
+static void irq_thread_update_affinity(struct irq_desc *desc)
 {
 	struct irqaction *action;
 
@@ -283,7 +283,7 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
 		fallthrough;
 	case IRQ_SET_MASK_OK_NOCOPY:
 		irq_validate_effective_affinity(data);
-		irq_set_thread_affinity(desc);
+		irq_thread_update_affinity(desc);
 		ret = 0;
 	}
 
@@ -1035,8 +1035,23 @@ static void irq_thread_check_affinity(struct irq_desc *desc, struct irqaction *a
 		set_cpus_allowed_ptr(current, mask);
 	free_cpumask_var(mask);
 }
+
+static inline void irq_thread_set_affinity(struct task_struct *t,
+					   struct irq_desc *desc)
+{
+	const struct cpumask *mask;
+
+	if (cpumask_available(desc->irq_common_data.affinity))
+		mask = irq_data_get_effective_affinity_mask(&desc->irq_data);
+	else
+		mask = cpu_possible_mask;
+
+	kthread_bind_mask(t, mask);
+}
 #else
 static inline void irq_thread_check_affinity(struct irq_desc *desc, struct irqaction *action) { }
+static inline void irq_thread_set_affinity(struct task_struct *t,
+					   struct irq_desc *desc) { }
 #endif
 
 static int irq_wait_for_interrupt(struct irq_desc *desc,
@@ -1221,6 +1236,7 @@ static void wake_up_and_wait_for_irq_thread_ready(struct irq_desc *desc,
 	if (!action || !action->thread)
 		return;
 
+	irq_thread_set_affinity(action->thread, desc);
 	wake_up_process(action->thread);
 	wait_event(desc->wait_for_threads,
 		   test_bit(IRQTF_READY, &action->thread_flags));
@@ -1405,16 +1421,7 @@ setup_irq_thread(struct irqaction *new, unsigned int irq, bool secondary)
 	 * references an already freed task_struct.
 	 */
 	new->thread = get_task_struct(t);
-	/*
-	 * Tell the thread to set its affinity. This is
-	 * important for shared interrupt handlers as we do
-	 * not invoke setup_affinity() for the secondary
-	 * handlers as everything is already set up. Even for
-	 * interrupts marked with IRQF_NO_BALANCE this is
-	 * correct as we want the thread to move to the cpu(s)
-	 * on which the requesting code placed the interrupt.
-	 */
-	set_bit(IRQTF_AFFINITY, &new->thread_flags);
+
 	return 0;
 }
 
-- 
2.51.0
Re: [PATCH] genirq: Fix IRQ threads affinity VS cpuset isolated partitions
Posted by Waiman Long 1 month, 1 week ago
On 11/5/25 8:17 AM, Frederic Weisbecker wrote:
> When a cpuset isolated partition is created / updated or destroyed,
> the IRQ threads are affine blindly to all the non-isolated CPUs. And
> this happens without taking into account the IRQ thread initial
> affinity that becomes ignored.
>
> For example in a system with 8 CPUs, if an IRQ and its kthread are
> initially affine to CPU 5, creating an isolated partition with only
> CPU 2 inside will eventually end up affining the IRQ kthread to all
> CPUs but CPU 2 (that is CPUs 0,1,3-7), losing the kthread preference for
> CPU 5.
>
> Besides the blind re-affinity, this doesn't take care of the actual
> low level interrupt which isn't migrated. As of today the only way to
> isolate non managed interrupts, along with their kthreads, is to
> overwrite their affinity separately, for example through /proc/irq/
>
> To avoid doing that manually, future development should focus on
> updating the IRQs affinity whenever cpuset isolated partitions are
> updated.
>
> In the meantime, cpuset shouldn't fiddle with IRQ threads directly.
> To prevent from that, set the PF_NO_SETAFFINITY flag to them.
>
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> ---
>   kernel/irq/manage.c | 33 ++++++++++++++++++++-------------
>   1 file changed, 20 insertions(+), 13 deletions(-)
>
> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> index 400856abf672..5ca000c9f4a7 100644
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -176,7 +176,7 @@ bool irq_can_set_affinity_usr(unsigned int irq)
>   }
>   
>   /**
> - * irq_set_thread_affinity - Notify irq threads to adjust affinity
> + * irq_thread_update_affinity - Notify irq threads to adjust affinity
>    * @desc:	irq descriptor which has affinity changed
>    *
>    * Just set IRQTF_AFFINITY and delegate the affinity setting to the
> @@ -184,7 +184,7 @@ bool irq_can_set_affinity_usr(unsigned int irq)
>    * we hold desc->lock and this code can be called from hard interrupt
>    * context.
>    */
> -static void irq_set_thread_affinity(struct irq_desc *desc)
> +static void irq_thread_update_affinity(struct irq_desc *desc)
>   {
>   	struct irqaction *action;
>   
> @@ -283,7 +283,7 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
>   		fallthrough;
>   	case IRQ_SET_MASK_OK_NOCOPY:
>   		irq_validate_effective_affinity(data);
> -		irq_set_thread_affinity(desc);
> +		irq_thread_update_affinity(desc);
>   		ret = 0;
>   	}
>   
> @@ -1035,8 +1035,23 @@ static void irq_thread_check_affinity(struct irq_desc *desc, struct irqaction *a
>   		set_cpus_allowed_ptr(current, mask);
>   	free_cpumask_var(mask);
>   }
> +
> +static inline void irq_thread_set_affinity(struct task_struct *t,
> +					   struct irq_desc *desc)
> +{
> +	const struct cpumask *mask;
> +
> +	if (cpumask_available(desc->irq_common_data.affinity))
> +		mask = irq_data_get_effective_affinity_mask(&desc->irq_data);
> +	else
> +		mask = cpu_possible_mask;
> +
> +	kthread_bind_mask(t, mask);
> +}

This function seems to mirror what is done in 
irq_thread_check_affinity() when the affinity cpumask is available.  But 
if affinity isn't defined, it will make this irq kthread immune from 
changes in the set of isolated CPUs. Should we use IRQD_AFFINITY_SET 
flag to check if affinity has been set and then set PF_NO_SETAFFINITY 
only in this case?

Cheers,
Longman

>   #else
>   static inline void irq_thread_check_affinity(struct irq_desc *desc, struct irqaction *action) { }
> +static inline void irq_thread_set_affinity(struct task_struct *t,
> +					   struct irq_desc *desc) { }
>   #endif
>   
>   static int irq_wait_for_interrupt(struct irq_desc *desc,
> @@ -1221,6 +1236,7 @@ static void wake_up_and_wait_for_irq_thread_ready(struct irq_desc *desc,
>   	if (!action || !action->thread)
>   		return;
>   
> +	irq_thread_set_affinity(action->thread, desc);
>   	wake_up_process(action->thread);
>   	wait_event(desc->wait_for_threads,
>   		   test_bit(IRQTF_READY, &action->thread_flags));
> @@ -1405,16 +1421,7 @@ setup_irq_thread(struct irqaction *new, unsigned int irq, bool secondary)
>   	 * references an already freed task_struct.
>   	 */
>   	new->thread = get_task_struct(t);
> -	/*
> -	 * Tell the thread to set its affinity. This is
> -	 * important for shared interrupt handlers as we do
> -	 * not invoke setup_affinity() for the secondary
> -	 * handlers as everything is already set up. Even for
> -	 * interrupts marked with IRQF_NO_BALANCE this is
> -	 * correct as we want the thread to move to the cpu(s)
> -	 * on which the requesting code placed the interrupt.
> -	 */
> -	set_bit(IRQTF_AFFINITY, &new->thread_flags);
> +
>   	return 0;
>   }
>   

Re: [PATCH] genirq: Fix IRQ threads affinity VS cpuset isolated partitions
Posted by Frederic Weisbecker 1 month ago
Le Mon, Nov 10, 2025 at 04:28:49PM -0500, Waiman Long a écrit :
> On 11/5/25 8:17 AM, Frederic Weisbecker wrote:
> > When a cpuset isolated partition is created / updated or destroyed,
> > the IRQ threads are affine blindly to all the non-isolated CPUs. And
> > this happens without taking into account the IRQ thread initial
> > affinity that becomes ignored.
> > 
> > For example in a system with 8 CPUs, if an IRQ and its kthread are
> > initially affine to CPU 5, creating an isolated partition with only
> > CPU 2 inside will eventually end up affining the IRQ kthread to all
> > CPUs but CPU 2 (that is CPUs 0,1,3-7), losing the kthread preference for
> > CPU 5.
> > 
> > Besides the blind re-affinity, this doesn't take care of the actual
> > low level interrupt which isn't migrated. As of today the only way to
> > isolate non managed interrupts, along with their kthreads, is to
> > overwrite their affinity separately, for example through /proc/irq/
> > 
> > To avoid doing that manually, future development should focus on
> > updating the IRQs affinity whenever cpuset isolated partitions are
> > updated.
> > 
> > In the meantime, cpuset shouldn't fiddle with IRQ threads directly.
> > To prevent from that, set the PF_NO_SETAFFINITY flag to them.
> > 
> > Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> > ---
> >   kernel/irq/manage.c | 33 ++++++++++++++++++++-------------
> >   1 file changed, 20 insertions(+), 13 deletions(-)
> > 
> > diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> > index 400856abf672..5ca000c9f4a7 100644
> > --- a/kernel/irq/manage.c
> > +++ b/kernel/irq/manage.c
> > @@ -176,7 +176,7 @@ bool irq_can_set_affinity_usr(unsigned int irq)
> >   }
> >   /**
> > - * irq_set_thread_affinity - Notify irq threads to adjust affinity
> > + * irq_thread_update_affinity - Notify irq threads to adjust affinity
> >    * @desc:	irq descriptor which has affinity changed
> >    *
> >    * Just set IRQTF_AFFINITY and delegate the affinity setting to the
> > @@ -184,7 +184,7 @@ bool irq_can_set_affinity_usr(unsigned int irq)
> >    * we hold desc->lock and this code can be called from hard interrupt
> >    * context.
> >    */
> > -static void irq_set_thread_affinity(struct irq_desc *desc)
> > +static void irq_thread_update_affinity(struct irq_desc *desc)
> >   {
> >   	struct irqaction *action;
> > @@ -283,7 +283,7 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
> >   		fallthrough;
> >   	case IRQ_SET_MASK_OK_NOCOPY:
> >   		irq_validate_effective_affinity(data);
> > -		irq_set_thread_affinity(desc);
> > +		irq_thread_update_affinity(desc);
> >   		ret = 0;
> >   	}
> > @@ -1035,8 +1035,23 @@ static void irq_thread_check_affinity(struct irq_desc *desc, struct irqaction *a
> >   		set_cpus_allowed_ptr(current, mask);
> >   	free_cpumask_var(mask);
> >   }
> > +
> > +static inline void irq_thread_set_affinity(struct task_struct *t,
> > +					   struct irq_desc *desc)
> > +{
> > +	const struct cpumask *mask;
> > +
> > +	if (cpumask_available(desc->irq_common_data.affinity))
> > +		mask = irq_data_get_effective_affinity_mask(&desc->irq_data);
> > +	else
> > +		mask = cpu_possible_mask;
> > +
> > +	kthread_bind_mask(t, mask);
> > +}
> 
> This function seems to mirror what is done in irq_thread_check_affinity()
> when the affinity cpumask is available.  But if affinity isn't defined, it
> will make this irq kthread immune from changes in the set of isolated CPUs.
> Should we use IRQD_AFFINITY_SET flag to check if affinity has been set and
> then set PF_NO_SETAFFINITY only in this case?

Oh and if IRQD_AFFINITY_SET hasn't been set, then the cpuset shouldn't touch
the kthread affinity directly either but reaffine the whole IRQ so that both
the vector and its kthreads are moved away and not just the kthreads.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs
Re: [PATCH] genirq: Fix IRQ threads affinity VS cpuset isolated partitions
Posted by Frederic Weisbecker 1 month ago
Le Mon, Nov 10, 2025 at 04:28:49PM -0500, Waiman Long a écrit :
> On 11/5/25 8:17 AM, Frederic Weisbecker wrote:
> > When a cpuset isolated partition is created / updated or destroyed,
> > the IRQ threads are affine blindly to all the non-isolated CPUs. And
> > this happens without taking into account the IRQ thread initial
> > affinity that becomes ignored.
> > 
> > For example in a system with 8 CPUs, if an IRQ and its kthread are
> > initially affine to CPU 5, creating an isolated partition with only
> > CPU 2 inside will eventually end up affining the IRQ kthread to all
> > CPUs but CPU 2 (that is CPUs 0,1,3-7), losing the kthread preference for
> > CPU 5.
> > 
> > Besides the blind re-affinity, this doesn't take care of the actual
> > low level interrupt which isn't migrated. As of today the only way to
> > isolate non managed interrupts, along with their kthreads, is to
> > overwrite their affinity separately, for example through /proc/irq/
> > 
> > To avoid doing that manually, future development should focus on
> > updating the IRQs affinity whenever cpuset isolated partitions are
> > updated.
> > 
> > In the meantime, cpuset shouldn't fiddle with IRQ threads directly.
> > To prevent from that, set the PF_NO_SETAFFINITY flag to them.
> > 
> > Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> > ---
> >   kernel/irq/manage.c | 33 ++++++++++++++++++++-------------
> >   1 file changed, 20 insertions(+), 13 deletions(-)
> > 
> > diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> > index 400856abf672..5ca000c9f4a7 100644
> > --- a/kernel/irq/manage.c
> > +++ b/kernel/irq/manage.c
> > @@ -176,7 +176,7 @@ bool irq_can_set_affinity_usr(unsigned int irq)
> >   }
> >   /**
> > - * irq_set_thread_affinity - Notify irq threads to adjust affinity
> > + * irq_thread_update_affinity - Notify irq threads to adjust affinity
> >    * @desc:	irq descriptor which has affinity changed
> >    *
> >    * Just set IRQTF_AFFINITY and delegate the affinity setting to the
> > @@ -184,7 +184,7 @@ bool irq_can_set_affinity_usr(unsigned int irq)
> >    * we hold desc->lock and this code can be called from hard interrupt
> >    * context.
> >    */
> > -static void irq_set_thread_affinity(struct irq_desc *desc)
> > +static void irq_thread_update_affinity(struct irq_desc *desc)
> >   {
> >   	struct irqaction *action;
> > @@ -283,7 +283,7 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
> >   		fallthrough;
> >   	case IRQ_SET_MASK_OK_NOCOPY:
> >   		irq_validate_effective_affinity(data);
> > -		irq_set_thread_affinity(desc);
> > +		irq_thread_update_affinity(desc);
> >   		ret = 0;
> >   	}
> > @@ -1035,8 +1035,23 @@ static void irq_thread_check_affinity(struct irq_desc *desc, struct irqaction *a
> >   		set_cpus_allowed_ptr(current, mask);
> >   	free_cpumask_var(mask);
> >   }
> > +
> > +static inline void irq_thread_set_affinity(struct task_struct *t,
> > +					   struct irq_desc *desc)
> > +{
> > +	const struct cpumask *mask;
> > +
> > +	if (cpumask_available(desc->irq_common_data.affinity))
> > +		mask = irq_data_get_effective_affinity_mask(&desc->irq_data);
> > +	else
> > +		mask = cpu_possible_mask;
> > +
> > +	kthread_bind_mask(t, mask);
> > +}
> 
> This function seems to mirror what is done in irq_thread_check_affinity()
> when the affinity cpumask is available.  But if affinity isn't defined, it
> will make this irq kthread immune from changes in the set of isolated CPUs.
> Should we use IRQD_AFFINITY_SET flag to check if affinity has been set and
> then set PF_NO_SETAFFINITY only in this case?

So IIUC, the cpumask_available() failure can't really happen because an allocation
failure would make irq_alloc_descs() fail.

__irq_alloc_descs() -> alloc_descs() -> alloc_desc() -> init_desc() - > alloc_mask()

The error doesn't seem as well handled in early_irq_init() but the desc is freed
anyway if that happens.

So this is just a sanity check at best.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs
Re: [PATCH] genirq: Fix IRQ threads affinity VS cpuset isolated partitions
Posted by Thomas Gleixner 1 month ago
On Wed, Nov 12 2025 at 13:56, Frederic Weisbecker wrote:
> Le Mon, Nov 10, 2025 at 04:28:49PM -0500, Waiman Long a écrit :
>> This function seems to mirror what is done in irq_thread_check_affinity()
>> when the affinity cpumask is available.  But if affinity isn't defined, it
>> will make this irq kthread immune from changes in the set of isolated CPUs.
>> Should we use IRQD_AFFINITY_SET flag to check if affinity has been set and
>> then set PF_NO_SETAFFINITY only in this case?
>
> So IIUC, the cpumask_available() failure can't really happen because an allocation
> failure would make irq_alloc_descs() fail.

That's indeed a historical leftover.

> __irq_alloc_descs() -> alloc_descs() -> alloc_desc() -> init_desc() - > alloc_mask()
>
> The error doesn't seem as well handled in early_irq_init() but the desc is freed
> anyway if that happens.

Right, the insert should only happen when desc != NULL. OTOH if it fails
at that stage the kernel won't get far anyway and definitely not to the
point where these cpumasks are checked :)

> So this is just a sanity check at best.

I think we can just remove it. It does not make sense at all.

Thanks,

        tglx