[v2] x86/irq: do not release irq until all cleanup is done

[PATCH v2] x86/irq: do not release irq until all cleanup is done

Posted by Roger Pau Monne 3 years, 2 months ago

Current code in _clear_irq_vector() will mark the irq as unused before
doing the cleanup required when move_in_progress is true.

This can lead to races in create_irq() if the function picks an irq
desc that's been marked as unused but has move_in_progress set, as the
call to assign_irq_vector() in that function can then fail with
-EAGAIN.

Prevent that by only marking irq descs as unused when all the cleanup
has been done.  While there also use write_atomic() when setting
IRQ_UNUSED in _clear_irq_vector() and add a barrier in order to
prevent the setting of IRQ_UNUSED getting reordered by the compiler.

The check for move_in_progress cannot be removed from
_assign_irq_vector(), as other users (io_apic_set_pci_routing() and
ioapic_guest_write()) can still pass active irq descs to
assign_irq_vector().

Note the trace point is not moved and is now set before the irq is
marked as unused.  This is done so that the CPU mask provided in the
trace point is the one belonging to the current vector, not the old
one.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v1:
 - Leave the trace point at the same place (before freeing old
   vector).
 - Add a (write) barrier before the IRQ_UNUSED write.
---
I've only observed this race when using vPCI with a PVH dom0, so I
assume other users of _{clear,assign}_irq_vector() are likely to
already be mutually exclusive by means of other external locks.

The path that triggered this race is using
allocate_and_map_msi_pirq(), which will sadly drop the returned error
code from create_irq() and just return EINVAL, that makes figuring out
the issue more complicated, but fixing that error path should be
done in a separate commit.
---
 xen/arch/x86/irq.c | 31 ++++++++++++++++---------------
 1 file changed, 16 insertions(+), 15 deletions(-)

diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
index cd0c8a30a8..20150b1c7f 100644
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -220,27 +220,28 @@ static void _clear_irq_vector(struct irq_desc *desc)
         clear_bit(vector, desc->arch.used_vectors);
     }
 
-    desc->arch.used = IRQ_UNUSED;
-
     trace_irq_mask(TRC_HW_IRQ_CLEAR_VECTOR, irq, vector, tmp_mask);
 
-    if ( likely(!desc->arch.move_in_progress) )
-        return;
+    if ( unlikely(desc->arch.move_in_progress) )
+    {
+        /* If we were in motion, also clear desc->arch.old_vector */
+        old_vector = desc->arch.old_vector;
+        cpumask_and(tmp_mask, desc->arch.old_cpu_mask, &cpu_online_map);
 
-    /* If we were in motion, also clear desc->arch.old_vector */
-    old_vector = desc->arch.old_vector;
-    cpumask_and(tmp_mask, desc->arch.old_cpu_mask, &cpu_online_map);
+        for_each_cpu(cpu, tmp_mask)
+        {
+            ASSERT(per_cpu(vector_irq, cpu)[old_vector] == irq);
+            TRACE_3D(TRC_HW_IRQ_MOVE_FINISH, irq, old_vector, cpu);
+            per_cpu(vector_irq, cpu)[old_vector] = ~irq;
+        }
 
-    for_each_cpu(cpu, tmp_mask)
-    {
-        ASSERT(per_cpu(vector_irq, cpu)[old_vector] == irq);
-        TRACE_3D(TRC_HW_IRQ_MOVE_FINISH, irq, old_vector, cpu);
-        per_cpu(vector_irq, cpu)[old_vector] = ~irq;
-    }
+        release_old_vec(desc);
 
-    release_old_vec(desc);
+        desc->arch.move_in_progress = 0;
+    }
 
-    desc->arch.move_in_progress = 0;
+    smp_wmb();
+    write_atomic(&desc->arch.used, IRQ_UNUSED);
 }
 
 void __init clear_irq_vector(int irq)
-- 
2.37.3

Re: [PATCH v2] x86/irq: do not release irq until all cleanup is done

Posted by Jan Beulich 3 years, 2 months ago

On 16.11.2022 13:21, Roger Pau Monne wrote:
> Current code in _clear_irq_vector() will mark the irq as unused before
> doing the cleanup required when move_in_progress is true.
> 
> This can lead to races in create_irq() if the function picks an irq
> desc that's been marked as unused but has move_in_progress set, as the
> call to assign_irq_vector() in that function can then fail with
> -EAGAIN.
> 
> Prevent that by only marking irq descs as unused when all the cleanup
> has been done.  While there also use write_atomic() when setting
> IRQ_UNUSED in _clear_irq_vector() and add a barrier in order to
> prevent the setting of IRQ_UNUSED getting reordered by the compiler.
> 
> The check for move_in_progress cannot be removed from
> _assign_irq_vector(), as other users (io_apic_set_pci_routing() and
> ioapic_guest_write()) can still pass active irq descs to
> assign_irq_vector().
> 
> Note the trace point is not moved and is now set before the irq is
> marked as unused.  This is done so that the CPU mask provided in the
> trace point is the one belonging to the current vector, not the old
> one.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>