[PATCH] x86/irq: do not release irq until all cleanup is done

Roger Pau Monne posted 1 patch 3 weeks, 2 days ago
Test gitlab-ci passed
Patches applied successfully (tree, apply log)
git fetch https://gitlab.com/xen-project/patchew/xen tags/patchew/20221115123500.97114-1-roger.pau@citrix.com
There is a newer version of this series
xen/arch/x86/irq.c | 30 +++++++++++++++---------------
1 file changed, 15 insertions(+), 15 deletions(-)
[PATCH] x86/irq: do not release irq until all cleanup is done
Posted by Roger Pau Monne 3 weeks, 2 days ago
Current code in _clear_irq_vector() will mark the irq as unused before
doing the cleanup required when move_in_progress is true.

This can lead to races in create_irq() if the function picks an irq
desc that's been marked as unused but has move_in_progress set, as the
call to assign_irq_vector() in that function can then fail with
-EAGAIN.

Prevent that by only marking irq descs as unused when all the cleanup
has been done.  While there also use write_atomic() when setting
IRQ_UNUSED in _clear_irq_vector().

The check for move_in_progress cannot be removed from
_assign_irq_vector(), as other users (io_apic_set_pci_routing() and
ioapic_guest_write()) can still pass active irq descs to
assign_irq_vector().

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
I've only observed this race when using vPCI with a PVH dom0, so I
assume other users of _{clear,assign}_irq_vector() are likely to
already be mutually exclusive by means of other external locks.

The path that triggered this race is using
allocate_and_map_msi_pirq(), which will sadly drop the returned error
code from create_irq() and just return EINVAL, that makes figuring out
the issue more complicated, but fixing that error path should be
done in a separate commit.
---
 xen/arch/x86/irq.c | 30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
index cd0c8a30a8..15a78a44da 100644
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -220,27 +220,27 @@ static void _clear_irq_vector(struct irq_desc *desc)
         clear_bit(vector, desc->arch.used_vectors);
     }
 
-    desc->arch.used = IRQ_UNUSED;
-
-    trace_irq_mask(TRC_HW_IRQ_CLEAR_VECTOR, irq, vector, tmp_mask);
+    if ( unlikely(desc->arch.move_in_progress) )
+    {
+        /* If we were in motion, also clear desc->arch.old_vector */
+        old_vector = desc->arch.old_vector;
+        cpumask_and(tmp_mask, desc->arch.old_cpu_mask, &cpu_online_map);
 
-    if ( likely(!desc->arch.move_in_progress) )
-        return;
+        for_each_cpu(cpu, tmp_mask)
+        {
+            ASSERT(per_cpu(vector_irq, cpu)[old_vector] == irq);
+            TRACE_3D(TRC_HW_IRQ_MOVE_FINISH, irq, old_vector, cpu);
+            per_cpu(vector_irq, cpu)[old_vector] = ~irq;
+        }
 
-    /* If we were in motion, also clear desc->arch.old_vector */
-    old_vector = desc->arch.old_vector;
-    cpumask_and(tmp_mask, desc->arch.old_cpu_mask, &cpu_online_map);
+        release_old_vec(desc);
 
-    for_each_cpu(cpu, tmp_mask)
-    {
-        ASSERT(per_cpu(vector_irq, cpu)[old_vector] == irq);
-        TRACE_3D(TRC_HW_IRQ_MOVE_FINISH, irq, old_vector, cpu);
-        per_cpu(vector_irq, cpu)[old_vector] = ~irq;
+        desc->arch.move_in_progress = 0;
     }
 
-    release_old_vec(desc);
+    write_atomic(&desc->arch.used, IRQ_UNUSED);
 
-    desc->arch.move_in_progress = 0;
+    trace_irq_mask(TRC_HW_IRQ_CLEAR_VECTOR, irq, vector, tmp_mask);
 }
 
 void __init clear_irq_vector(int irq)
-- 
2.37.3


Re: [PATCH] x86/irq: do not release irq until all cleanup is done
Posted by Jan Beulich 3 weeks, 2 days ago
On 15.11.2022 13:35, Roger Pau Monne wrote:
> --- a/xen/arch/x86/irq.c
> +++ b/xen/arch/x86/irq.c
> @@ -220,27 +220,27 @@ static void _clear_irq_vector(struct irq_desc *desc)
>          clear_bit(vector, desc->arch.used_vectors);
>      }
>  
> -    desc->arch.used = IRQ_UNUSED;
> -
> -    trace_irq_mask(TRC_HW_IRQ_CLEAR_VECTOR, irq, vector, tmp_mask);

The use of tmp_mask here was ...

> +    if ( unlikely(desc->arch.move_in_progress) )
> +    {
> +        /* If we were in motion, also clear desc->arch.old_vector */
> +        old_vector = desc->arch.old_vector;
> +        cpumask_and(tmp_mask, desc->arch.old_cpu_mask, &cpu_online_map);

.. before the variable was updated here. Now you move it past this
update (to the very end of the function). I wonder whether, to keep
things simple in this change, it would be tolerable to leave the
trace_irq_mask() invocation where it was, despite cleanup not having
completed yet at that point. (The alternative would look to be to
act directly on desc->arch.old_cpu_mask in the code you re-indent,
leaving tmp_mask alone. I think that ought to be okay since nothing
else should access that mask anymore. We could even avoid altering
the field, by going from cpumask_and() to a cpu_online() check in
the for_each_cpu() body.)

>  
> -    if ( likely(!desc->arch.move_in_progress) )
> -        return;
> +        for_each_cpu(cpu, tmp_mask)
> +        {
> +            ASSERT(per_cpu(vector_irq, cpu)[old_vector] == irq);
> +            TRACE_3D(TRC_HW_IRQ_MOVE_FINISH, irq, old_vector, cpu);
> +            per_cpu(vector_irq, cpu)[old_vector] = ~irq;
> +        }
>  
> -    /* If we were in motion, also clear desc->arch.old_vector */
> -    old_vector = desc->arch.old_vector;
> -    cpumask_and(tmp_mask, desc->arch.old_cpu_mask, &cpu_online_map);
> +        release_old_vec(desc);
>  
> -    for_each_cpu(cpu, tmp_mask)
> -    {
> -        ASSERT(per_cpu(vector_irq, cpu)[old_vector] == irq);
> -        TRACE_3D(TRC_HW_IRQ_MOVE_FINISH, irq, old_vector, cpu);
> -        per_cpu(vector_irq, cpu)[old_vector] = ~irq;
> +        desc->arch.move_in_progress = 0;
>      }
>  
> -    release_old_vec(desc);
> +    write_atomic(&desc->arch.used, IRQ_UNUSED);

Now that there is an ordering requirement between these last two writes,
I think you want to add smp_wmb() after the first one (still inside the
inner scope). write_atomic() is only a volatile asm() (which the compiler
may move under certain conditions) and an access through a volatile
pointer (which isn't ordered with non-volatile memory accesses), but it
is specifically not a memory barrier.

Jan

> -    desc->arch.move_in_progress = 0;
> +    trace_irq_mask(TRC_HW_IRQ_CLEAR_VECTOR, irq, vector, tmp_mask);
>  }
>  
>  void __init clear_irq_vector(int irq)
Re: [PATCH] x86/irq: do not release irq until all cleanup is done
Posted by Roger Pau Monné 3 weeks, 2 days ago
On Tue, Nov 15, 2022 at 05:29:47PM +0100, Jan Beulich wrote:
> On 15.11.2022 13:35, Roger Pau Monne wrote:
> > --- a/xen/arch/x86/irq.c
> > +++ b/xen/arch/x86/irq.c
> > @@ -220,27 +220,27 @@ static void _clear_irq_vector(struct irq_desc *desc)
> >          clear_bit(vector, desc->arch.used_vectors);
> >      }
> >  
> > -    desc->arch.used = IRQ_UNUSED;
> > -
> > -    trace_irq_mask(TRC_HW_IRQ_CLEAR_VECTOR, irq, vector, tmp_mask);
> 
> The use of tmp_mask here was ...
> 
> > +    if ( unlikely(desc->arch.move_in_progress) )
> > +    {
> > +        /* If we were in motion, also clear desc->arch.old_vector */
> > +        old_vector = desc->arch.old_vector;
> > +        cpumask_and(tmp_mask, desc->arch.old_cpu_mask, &cpu_online_map);
> 
> .. before the variable was updated here. Now you move it past this
> update (to the very end of the function). I wonder whether, to keep
> things simple in this change, it would be tolerable to leave the
> trace_irq_mask() invocation where it was, despite cleanup not having
> completed yet at that point. (The alternative would look to be to
> act directly on desc->arch.old_cpu_mask in the code you re-indent,
> leaving tmp_mask alone. I think that ought to be okay since nothing
> else should access that mask anymore. We could even avoid altering
> the field, by going from cpumask_and() to a cpu_online() check in
> the for_each_cpu() body.)

Hm, I was thinking we should print the old vector mask (since that's
the one still in use because migration hasn't happened yet) but then
we would also need to print the old vector instead of the new one, but
maybe that's too much change given the current behavior.

I think it's fine to set the trace here, before IRQ_UNUSED gets set
(in fact it might be better, as then the trace should be more
coherent).

> >  
> > -    if ( likely(!desc->arch.move_in_progress) )
> > -        return;
> > +        for_each_cpu(cpu, tmp_mask)
> > +        {
> > +            ASSERT(per_cpu(vector_irq, cpu)[old_vector] == irq);
> > +            TRACE_3D(TRC_HW_IRQ_MOVE_FINISH, irq, old_vector, cpu);
> > +            per_cpu(vector_irq, cpu)[old_vector] = ~irq;
> > +        }
> >  
> > -    /* If we were in motion, also clear desc->arch.old_vector */
> > -    old_vector = desc->arch.old_vector;
> > -    cpumask_and(tmp_mask, desc->arch.old_cpu_mask, &cpu_online_map);
> > +        release_old_vec(desc);
> >  
> > -    for_each_cpu(cpu, tmp_mask)
> > -    {
> > -        ASSERT(per_cpu(vector_irq, cpu)[old_vector] == irq);
> > -        TRACE_3D(TRC_HW_IRQ_MOVE_FINISH, irq, old_vector, cpu);
> > -        per_cpu(vector_irq, cpu)[old_vector] = ~irq;
> > +        desc->arch.move_in_progress = 0;
> >      }
> >  
> > -    release_old_vec(desc);
> > +    write_atomic(&desc->arch.used, IRQ_UNUSED);
> 
> Now that there is an ordering requirement between these last two writes,
> I think you want to add smp_wmb() after the first one (still inside the
> inner scope). write_atomic() is only a volatile asm() (which the compiler
> may move under certain conditions) and an access through a volatile
> pointer (which isn't ordered with non-volatile memory accesses), but it
> is specifically not a memory barrier.

Right, sorry - I always get confused and assume that a volatile asm
can't be reordered.  I was about to add the write barrier but didn't
do it because of the volatile in the asm.

I would like to request a backport for this, but I think it's now too
late for the patch to be accepted to 4.17.

Thanks, Roger.
Re: [PATCH] x86/irq: do not release irq until all cleanup is done
Posted by Jan Beulich 3 weeks, 1 day ago
On 15.11.2022 18:04, Roger Pau Monné wrote:
> I would like to request a backport for this, but I think it's now too
> late for the patch to be accepted to 4.17.

Yes, I certainly was intending to queue this up once it went in. I agree
it's not critical enough to still slip into 4.17.0, though.

Jan