net/mlx5: Update mlx5_irq.mask when IRQ affinity changes

[PATCH] net/mlx5: Update mlx5_irq.mask when IRQ affinity changes

Posted by Yi Li 4 weeks, 1 day ago

mlx5_irq.mask is used for:
1) Setting IRQ affinity_hint in mlx5_irq_alloc()
2) Determining mlx5e_channel.cpu in mlx5e_open_channel(), which in turn
   decides the NUMA node for queue allocations.

When a user modifies IRQ affinity, mlx5_irq.mask remains unchanged.
Consequently even if mlx5e_open_channel() is invoked again, queues are
still allocated on the original NUMA node instead of the newly
preferred one.

Fix this by registering an irq_set_affinity_notifier to update
mlx5_irq.mask when /proc/irq/N/smp_affinity is modified.
Therefore subsequent queue allocations reflect the updated affinity.

Signed-off-by: Yi Li <liyi@hygon.cn>
---
 .../net/ethernet/mellanox/mlx5/core/pci_irq.c  | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
index e051b9a939ee..501496159aa2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
@@ -35,6 +35,7 @@ struct mlx5_irq {
 	int refcount;
 	struct msi_map map;
 	u32 pool_index;
+	struct irq_affinity_notify af_notify;
 };
 
 struct mlx5_irq_table {
@@ -158,6 +159,9 @@ static void mlx5_system_free_irq(struct mlx5_irq *irq)
 	struct cpu_rmap *rmap;
 #endif
 
+	if (irq->af_notify.notify)
+		irq_set_affinity_notifier(irq->map.virq, NULL);
+
 	/* free_irq requires that affinity_hint and rmap will be cleared before
 	 * calling it. To satisfy this requirement, we call
 	 * irq_cpu_rmap_remove() to remove the notifier
@@ -252,6 +256,16 @@ static void irq_set_name(struct mlx5_irq_pool *pool, char *name, int vecidx)
 	snprintf(name, MLX5_MAX_IRQ_NAME, "mlx5_comp%d", vecidx);
 }
 
+static void mlx5_irq_affinity_changed(struct irq_affinity_notify *notify,
+				      const cpumask_t *mask)
+{
+	struct mlx5_irq *irq = container_of(notify, struct mlx5_irq, af_notify);
+
+	cpumask_copy(irq->mask, mask);
+}
+
+static void mlx5_irq_affinity_notifier_release(struct kref *ref) {}
+
 struct mlx5_irq *mlx5_irq_alloc(struct mlx5_irq_pool *pool, int i,
 				struct irq_affinity_desc *af_desc,
 				struct cpu_rmap **rmap)
@@ -307,6 +321,10 @@ struct mlx5_irq *mlx5_irq_alloc(struct mlx5_irq_pool *pool, int i,
 	if (af_desc) {
 		cpumask_copy(irq->mask, &af_desc->mask);
 		irq_set_affinity_and_hint(irq->map.virq, irq->mask);
+
+		irq->af_notify.notify  = mlx5_irq_affinity_changed;
+		irq->af_notify.release = mlx5_irq_affinity_notifier_release;
+		irq_set_affinity_notifier(irq->map.virq, &irq->af_notify);
 	}
 	irq->pool = pool;
 	irq->refcount = 1;
-- 
2.53.0

Re: [PATCH] net/mlx5: Update mlx5_irq.mask when IRQ affinity changes

Posted by Tariq Toukan 3 weeks, 5 days ago

On 14/05/2026 10:42, Yi Li wrote:
> mlx5_irq.mask is used for:
> 1) Setting IRQ affinity_hint in mlx5_irq_alloc()
> 2) Determining mlx5e_channel.cpu in mlx5e_open_channel(), which in turn
>     decides the NUMA node for queue allocations.
> 
> When a user modifies IRQ affinity, mlx5_irq.mask remains unchanged.
> Consequently even if mlx5e_open_channel() is invoked again, queues are
> still allocated on the original NUMA node instead of the newly
> preferred one.
> 
> Fix this by registering an irq_set_affinity_notifier to update
> mlx5_irq.mask when /proc/irq/N/smp_affinity is modified.
> Therefore subsequent queue allocations reflect the updated affinity.
> 
> Signed-off-by: Yi Li <liyi@hygon.cn>
> ---

Hi,

Thanks for the patch. Looking at the proposal, I want to discuss two 
distinct aspects:

NAPI Execution Location: We already track effective affinity closely 
through irq_get_effective_affinity_mask(); NAPI processing is 
dynamically moved accordingly, including a forced NAPI cycle break if 
needed.

Memory Allocation Location: This is the core focus of your patch.

I have serious comments on the proposed implementation, but first I want 
to discuss the idea.

We investigated a similar approach a few years ago but ultimately 
decided against upstreaming it due to stability concerns:

High Volatility: The "current affinity" value can be extremely dynamic, 
potentially shifting multiple times per second depending on system load 
and tuning.

Performance Risk: Sampling a highly volatile "current" value to allocate 
relatively permanent resources (like channel queues) risks severe 
worst-case performance regressions if the affinity shifts immediately 
after allocation.

Because of this, we have historically relied on numa-distance logic for 
channel allocations to ensure a predictable baseline.

Do you have any benchmark data or specific use cases showing a clear net 
benefit over the existing numa-distance logic?

Best regards,
Tariq

Re: [PATCH] net/mlx5: Update mlx5_irq.mask when IRQ affinity changes

Posted by Yi Li 3 weeks, 2 days ago

Hi Tariq,

Thanks for the feedback.

On 5/17/2026 4:58 PM, Tariq Toukan wrote:

> 
> Hi,
> 
> Thanks for the patch. Looking at the proposal, I want to discuss two distinct aspects:
> 
> NAPI Execution Location: We already track effective affinity closely through irq_get_effective_affinity_mask(); NAPI processing is dynamically moved accordingly, including a forced NAPI cycle break if needed.
> 

Right. I also see the driver flushes the page_pool alloc cache via page_pool_nid_changed() in mlx5e_post_rx_wqes(),
so RX data buffers follow the NAPI CPU as well.

> Memory Allocation Location: This is the core focus of your patch.
> 

If I understand correctly, the current design keeps the page_pool cache
near the NAPI CPU, and the channel queues near the NIC based on numa-distance.
Please correct me if that's wrong.

> I have serious comments on the proposed implementation, but first I want to discuss the idea.
> 
> We investigated a similar approach a few years ago but ultimately decided against upstreaming it due to stability concerns:
> 
> High Volatility: The "current affinity" value can be extremely dynamic, potentially shifting multiple times per second depending on system load and tuning.
> 
> Performance Risk: Sampling a highly volatile "current" value to allocate relatively permanent resources (like channel queues) risks severe worst-case performance regressions if the affinity shifts immediately after allocation.
> 

I agree with the concern. But page_pool already samples node id change on the hot path.
Sampling affinity only in mlx5e_open_channel() might avoid most of the risk?

> Because of this, we have historically relied on numa-distance logic for channel allocations to ensure a predictable baseline.
> 
> Do you have any benchmark data or specific use cases showing a clear net benefit over the existing numa-distance logic?
> 

Honestly, my Nginx test showed no measurable throughput change.

It's more of a functional issue I ran into. My setup:
  - 6 NUMA nodes, 32 SMT cores each
  - ConnectX-6 on node0
  - node1 and node2 equidistant from node0
  - 63 combined queues
  - default spread: 32 IRQs on node0, 16 on node1, 16 on node2

I moved the 16 IRQs from node2 to node1 via smp_affinity. The IRQs
followed, but the queues stayed on node2, and I observed a lot of
cross-node traffic to/from node2. Nginx wasn't affected -- the
bottleneck is elsewhere in my workload.

Do you think it OK to have an option so users can change queue location?
I'm not sure how common this need is in production, so I'd appreciate your idea.

Also, I agree irq_get_effective_affinity_mask() is better than the IRQ notifier:

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index b6c12460b54a..073239082144 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -2769,6 +2769,7 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
 {
        struct net_device *netdev = priv->netdev;
        struct mlx5e_channel_param *cparam;
+       const struct cpumask *eff_mask;
        struct mlx5_core_dev *mdev;
        struct mlx5e_xsk_param xsk;
        bool async_icosq_needed;
@@ -2786,6 +2787,10 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
        if (err)
                return err;

+       eff_mask = irq_get_effective_affinity_mask(irq);
+       if (eff_mask)
+               cpu = cpumask_first(eff_mask);
+
        err = mlx5e_channel_stats_alloc(priv, ix, cpu);
        if (err)
                return err;

> Best regards,
> Tariq
> 
> 

Thanks,
-Yi