[PATCH v1 4/7] sched/isolation: Adjust affinity of managed irqs according to change of housekeeping cpumask

Costa Shulyupin posted 7 patches 1 year, 7 months ago
[PATCH v1 4/7] sched/isolation: Adjust affinity of managed irqs according to change of housekeeping cpumask
Posted by Costa Shulyupin 1 year, 7 months ago
irq_affinity_adjust() is prototyped from irq_affinity_online_cpu()
and irq_restore_affinity_of_irq().

Core test snippets without infrastructure:

1. Create managed IRQ on specific cpu with:

static int test_set_affinity(struct irq_data *data,
			     const struct cpumask *m, bool f)
{
	irq_data_update_effective_affinity(data, m);
	return 0;
}

static int make_test_irq(void)
{
	struct irq_affinity_desc a = { mask: *cpumask_of(test_cpu),
				       is_managed: true };
	static struct irq_chip test_chip = { .irq_set_affinity = test_set_affinity };
	int test_irq = __irq_alloc_descs(-1, 1, 1, 0, THIS_MODULE, &a);

	irq_set_chip(test_irq, &test_chip);
	irq_set_status_flags(test_irq, IRQ_MOVE_PCNTXT);
	request_irq(test_irq, test_irq_cb, 0, "test_affinity", 0);

	return test_irq;
}

2. Isolate specified CPU.

3.  Assure that test_irq doesn't affine with test_cpu:

cat /proc/irq/$test_irq/smp_affinity_list

Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
---
 kernel/cgroup/cpuset.c   |  3 ++-
 kernel/sched/isolation.c | 44 +++++++++++++++++++++++++++++++++++++++-
 2 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 9d01e8e0a3ed9..2e59a2983eb2e 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -225,7 +225,8 @@ static struct list_head remote_children;
 /*
  * The set of housekeeping flags to be updated for CPU isolation
  */
-#define	HOUSEKEEPING_FLAGS	(BIT(HK_TYPE_TIMER) | BIT(HK_TYPE_RCU))
+#define	HOUSEKEEPING_FLAGS	(BIT(HK_TYPE_TIMER) | BIT(HK_TYPE_RCU) \
+		| BIT(HK_TYPE_MANAGED_IRQ))
 
 /*
  * Partition root states:
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 85a17d39d8bb0..b0503ed362fce 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -135,6 +135,43 @@ static void resettle_all_timers(cpumask_var_t enable_mask, cpumask_var_t disable
 	}
 }
 
+static int irq_affinity_adjust(cpumask_var_t disable_mask)
+{
+	unsigned int irq;
+	cpumask_var_t mask;
+
+	if (!alloc_cpumask_var(&mask, GFP_KERNEL))
+		return -ENOMEM;
+
+	irq_lock_sparse();
+	for_each_active_irq(irq) {
+		struct irq_desc *desc = irq_to_desc(irq);
+
+		raw_spin_lock_irq(&desc->lock);
+		struct irq_data *data = irq_desc_get_irq_data(desc);
+
+		if (irqd_affinity_is_managed(data) && cpumask_weight_and(disable_mask,
+			irq_data_get_affinity_mask(data))) {
+
+			cpumask_and(mask, cpu_online_mask, irq_default_affinity);
+			cpumask_and(mask, mask, housekeeping_cpumask(HK_TYPE_MANAGED_IRQ));
+			irq_set_affinity_locked(data, mask, true);
+			WARN_ON(cpumask_weight_and(irq_data_get_effective_affinity_mask(data),
+						disable_mask));
+			WARN_ON(!cpumask_subset(irq_data_get_effective_affinity_mask(data),
+						cpu_online_mask));
+			WARN_ON(!cpumask_subset(irq_data_get_effective_affinity_mask(data),
+						housekeeping_cpumask(HK_TYPE_MANAGED_IRQ)));
+		}
+		raw_spin_unlock_irq(&desc->lock);
+	}
+	irq_unlock_sparse();
+
+	free_cpumask_var(mask);
+
+	return 0;
+}
+
 /*
  * housekeeping_update - change housekeeping.cpumasks[type] and propagate the
  * change.
@@ -144,6 +181,8 @@ static void resettle_all_timers(cpumask_var_t enable_mask, cpumask_var_t disable
  */
 static int housekeeping_update(enum hk_type type, cpumask_var_t update)
 {
+	int err = 0;
+
 	struct {
 		struct cpumask changed;
 		struct cpumask enable;
@@ -171,11 +210,14 @@ static int housekeeping_update(enum hk_type type, cpumask_var_t update)
 		lockup_detector_reconfigure();
 #endif
 		break;
+	case HK_TYPE_MANAGED_IRQ:
+		err = irq_affinity_adjust(&masks->disable);
+		break;
 	default:
 	}
 	kfree(masks);
 
-	return 0;
+	return err;
 }
 
 static int __init housekeeping_setup(char *str, unsigned long flags)
-- 
2.45.0
Re: [PATCH v1 4/7] sched/isolation: Adjust affinity of managed irqs according to change of housekeeping cpumask
Posted by Thomas Gleixner 1 year, 7 months ago
On Thu, May 16 2024 at 22:04, Costa Shulyupin wrote:
> irq_affinity_adjust() is prototyped from irq_affinity_online_cpu()
> and irq_restore_affinity_of_irq().

I'm used to this prototyped phrase by now. It still does not justify to
expose me to this POC hackery.

My previous comments about change logs still apply.

> +static int irq_affinity_adjust(cpumask_var_t disable_mask)
> +{
> +	unsigned int irq;
> +	cpumask_var_t mask;
> +
> +	if (!alloc_cpumask_var(&mask, GFP_KERNEL))
> +		return -ENOMEM;
> +
> +	irq_lock_sparse();
> +	for_each_active_irq(irq) {
> +		struct irq_desc *desc = irq_to_desc(irq);
> +
> +		raw_spin_lock_irq(&desc->lock);

That's simply broken. This is not CPU hotplug on an outgoing CPU. Why
are you assuming that your isolation change code can rely on the
implicit guarantees of CPU hot(un)plug?

Also there is a reason why interrupt related code is in kernel/irq/* and
not in some random other location. Even if C allows you to fiddle with
everything that does not mean that hiding random hacks in other places
is correct in any way.

> +		struct irq_data *data = irq_desc_get_irq_data(desc);
> +
> +		if (irqd_affinity_is_managed(data) && cpumask_weight_and(disable_mask,
> +			irq_data_get_affinity_mask(data))) {

Interrupt target isolation is only relevant for managed interrupts and
non-managed interrupts clearly are going to migrate themself away
magically, right?

> +
> +			cpumask_and(mask, cpu_online_mask, irq_default_affinity);
> +			cpumask_and(mask, mask, housekeeping_cpumask(HK_TYPE_MANAGED_IRQ));

There are clearly a lot of comments explaining what this is doing and
why it is correct as there is a guarantee that these masks overlap by
definition.

> +			irq_set_affinity_locked(data, mask, true);

Plus the extensive explanation why using 'force=true' is even remotely
correct here.

I conceed that the documentation of that function and its arguments is
close to non-existant, but if you follow the call chain of that function
there are enough hints down the road, no?

> +			WARN_ON(cpumask_weight_and(irq_data_get_effective_affinity_mask(data),
> +						disable_mask));
> +			WARN_ON(!cpumask_subset(irq_data_get_effective_affinity_mask(data),
> +						cpu_online_mask));
> +			WARN_ON(!cpumask_subset(irq_data_get_effective_affinity_mask(data),
> +						housekeeping_cpumask(HK_TYPE_MANAGED_IRQ)));

These warnings are required and useful within the spinlock held and
interrupt disabled section because of what?

 - Because the resulting stack trace provides a well known call chain

 - Because the resulting warnings do not tell anything about the
   affected interrupt number

 - Because the resulting warnings do not tell anything about the CPU
   masks which cause the problem

 - Because the aggregate information of the above is utterly useless

Impressive...

Thanks,

       tglx
Re: [PATCH v1 4/7] sched/isolation: Adjust affinity of managed irqs according to change of housekeeping cpumask
Posted by Thomas Gleixner 1 year, 7 months ago
Costa!

On Sat, May 18 2024 at 03:17, Thomas Gleixner wrote:
> Impressive...

Now let's take a step back because none of this makes any sense at all
conceptually.

Reconfiguring the housekeeping CPUs on a life system is expensive and a
slow path operation no matter what.

So why inflicting all of this nonsense to the kernel instead of
cleverly (ab)using CPU hotplug for it in user space:

          for_each_cpu(cpu, new_house_keeping_mask) {
          	if (cpu_ishk(cpu))
                	continue;
                cpu_offline(cpu);
                set_cpu_in_hkmask(cpu);
                cpu_online(cpu);
          }

          for_each_cpu(cpu, new_isolated_mask) {
          	if (!cpu_ishk(cpu))
                	continue;
                cpu_offline(cpu);
                clear_cpu_in_hkmask(cpu);
                cpu_online(cpu);
          }

Or something like that. You get the idea, right?

IOW, the only kernel change which is required to achieve your goal is to
ensure that changing the housekeeping/isolated property of a CPU at
runtime is only possible when the CPU is "offline".

Then all of the magic things you try to solve just work out of the box
because the existing and well exercised hotplug code takes care of it
already, no?

I might be missing something obvious as always, so feel free to educate
me on it. 

Thanks,

        tglx
Re: [PATCH v1 4/7] sched/isolation: Adjust affinity of managed irqs according to change of housekeeping cpumask
Posted by Costa Shulyupin 1 year, 6 months ago
Hello Thomas,

Thank you so much for very valuable feedback!
I've tested your proposal to use the hotplug for the isolation.
It works well for both timers, but managed irqs are not ready for this scenario.

Current implementation is introduced by
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=11ea68f553e244851d15793a7fa33a97c46d8271

How do you suggest to proceed?

> So why inflicting all of this nonsense to the kernel instead of
> cleverly (ab)using CPU hotplug for it in user space:
...


> Or something like that. You get the idea, right?
>
> IOW, the only kernel change which is required to achieve your goal is to
> ensure that changing the housekeeping/isolated property of a CPU at
> runtime is only possible when the CPU is "offline".
Re: [PATCH v1 4/7] sched/isolation: Adjust affinity of managed irqs according to change of housekeeping cpumask
Posted by Thomas Gleixner 1 year, 6 months ago
On Mon, May 27 2024 at 15:21, Costa Shulyupin wrote:
> Thank you so much for very valuable feedback!
> I've tested your proposal to use the hotplug for the isolation.
> It works well for both timers, but managed irqs are not ready for this scenario.
>
> How do you suggest to proceed?

That's non-trivial because it's not only the interrupt affinity.

Drivers also have their queues associated accordingly and if you just
change the interrupt affinity under the hood then all the other
associations don't get updated and don't work.

That needs to be solved at some other level IMO.

Thanks,

        tglx
Re: [PATCH v1 4/7] sched/isolation: Adjust affinity of managed irqs according to change of housekeeping cpumask
Posted by Costa Shulyupin 1 year, 3 months ago
> On Sat, 18 May 2024 at 04:25, Thomas Gleixner <[1]tglx@linutronix.de> wrote:
>
>     Reconfiguring the housekeeping CPUs on a life system is expensive and a
>     slow path operation no matter what.
>
>     So why inflicting all of this nonsense to the kernel instead of
>     cleverly (ab)using CPU hotplug for it in user space.

On Mon, 27 May 2024 at 18:31, Thomas Gleixner <tglx@linutronix.de> wrote:
> That's non-trivial because it's not only the interrupt affinity.
>
> Drivers also have their queues associated accordingly and if you just
> change the interrupt affinity under the hood then all the other
> associations don't get updated and don't work.
>
> That needs to be solved at some other level IMO.

Hello Thomas,

Our telco clients run DPDK in OpenShift/Kubernetes containers.
DPDK requires isolated cpus to run real-time processes.
Kubernetes manages allocation of resources for containers.
Unfortunately Kubernetes doesn't support dynamic CPU offlining/onlining:
https://github.com/kubernetes/kubernetes/issues/67500
and is not planning to support it.
Solving the issue at the application level appears to be even
less straightforward than addressing it at the kernel level.

What would you recommend?

Thanks,
Costa