xen/arch/x86/irq.c | 30 +++++++++++++++--------------- 1 file changed, 15 insertions(+), 15 deletions(-)
Current code in _clear_irq_vector() will mark the irq as unused before
doing the cleanup required when move_in_progress is true.
This can lead to races in create_irq() if the function picks an irq
desc that's been marked as unused but has move_in_progress set, as the
call to assign_irq_vector() in that function can then fail with
-EAGAIN.
Prevent that by only marking irq descs as unused when all the cleanup
has been done. While there also use write_atomic() when setting
IRQ_UNUSED in _clear_irq_vector().
The check for move_in_progress cannot be removed from
_assign_irq_vector(), as other users (io_apic_set_pci_routing() and
ioapic_guest_write()) can still pass active irq descs to
assign_irq_vector().
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
I've only observed this race when using vPCI with a PVH dom0, so I
assume other users of _{clear,assign}_irq_vector() are likely to
already be mutually exclusive by means of other external locks.
The path that triggered this race is using
allocate_and_map_msi_pirq(), which will sadly drop the returned error
code from create_irq() and just return EINVAL, that makes figuring out
the issue more complicated, but fixing that error path should be
done in a separate commit.
---
xen/arch/x86/irq.c | 30 +++++++++++++++---------------
1 file changed, 15 insertions(+), 15 deletions(-)
diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
index cd0c8a30a8..15a78a44da 100644
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -220,27 +220,27 @@ static void _clear_irq_vector(struct irq_desc *desc)
clear_bit(vector, desc->arch.used_vectors);
}
- desc->arch.used = IRQ_UNUSED;
-
- trace_irq_mask(TRC_HW_IRQ_CLEAR_VECTOR, irq, vector, tmp_mask);
+ if ( unlikely(desc->arch.move_in_progress) )
+ {
+ /* If we were in motion, also clear desc->arch.old_vector */
+ old_vector = desc->arch.old_vector;
+ cpumask_and(tmp_mask, desc->arch.old_cpu_mask, &cpu_online_map);
- if ( likely(!desc->arch.move_in_progress) )
- return;
+ for_each_cpu(cpu, tmp_mask)
+ {
+ ASSERT(per_cpu(vector_irq, cpu)[old_vector] == irq);
+ TRACE_3D(TRC_HW_IRQ_MOVE_FINISH, irq, old_vector, cpu);
+ per_cpu(vector_irq, cpu)[old_vector] = ~irq;
+ }
- /* If we were in motion, also clear desc->arch.old_vector */
- old_vector = desc->arch.old_vector;
- cpumask_and(tmp_mask, desc->arch.old_cpu_mask, &cpu_online_map);
+ release_old_vec(desc);
- for_each_cpu(cpu, tmp_mask)
- {
- ASSERT(per_cpu(vector_irq, cpu)[old_vector] == irq);
- TRACE_3D(TRC_HW_IRQ_MOVE_FINISH, irq, old_vector, cpu);
- per_cpu(vector_irq, cpu)[old_vector] = ~irq;
+ desc->arch.move_in_progress = 0;
}
- release_old_vec(desc);
+ write_atomic(&desc->arch.used, IRQ_UNUSED);
- desc->arch.move_in_progress = 0;
+ trace_irq_mask(TRC_HW_IRQ_CLEAR_VECTOR, irq, vector, tmp_mask);
}
void __init clear_irq_vector(int irq)
--
2.37.3
On 15.11.2022 13:35, Roger Pau Monne wrote: > --- a/xen/arch/x86/irq.c > +++ b/xen/arch/x86/irq.c > @@ -220,27 +220,27 @@ static void _clear_irq_vector(struct irq_desc *desc) > clear_bit(vector, desc->arch.used_vectors); > } > > - desc->arch.used = IRQ_UNUSED; > - > - trace_irq_mask(TRC_HW_IRQ_CLEAR_VECTOR, irq, vector, tmp_mask); The use of tmp_mask here was ... > + if ( unlikely(desc->arch.move_in_progress) ) > + { > + /* If we were in motion, also clear desc->arch.old_vector */ > + old_vector = desc->arch.old_vector; > + cpumask_and(tmp_mask, desc->arch.old_cpu_mask, &cpu_online_map); .. before the variable was updated here. Now you move it past this update (to the very end of the function). I wonder whether, to keep things simple in this change, it would be tolerable to leave the trace_irq_mask() invocation where it was, despite cleanup not having completed yet at that point. (The alternative would look to be to act directly on desc->arch.old_cpu_mask in the code you re-indent, leaving tmp_mask alone. I think that ought to be okay since nothing else should access that mask anymore. We could even avoid altering the field, by going from cpumask_and() to a cpu_online() check in the for_each_cpu() body.) > > - if ( likely(!desc->arch.move_in_progress) ) > - return; > + for_each_cpu(cpu, tmp_mask) > + { > + ASSERT(per_cpu(vector_irq, cpu)[old_vector] == irq); > + TRACE_3D(TRC_HW_IRQ_MOVE_FINISH, irq, old_vector, cpu); > + per_cpu(vector_irq, cpu)[old_vector] = ~irq; > + } > > - /* If we were in motion, also clear desc->arch.old_vector */ > - old_vector = desc->arch.old_vector; > - cpumask_and(tmp_mask, desc->arch.old_cpu_mask, &cpu_online_map); > + release_old_vec(desc); > > - for_each_cpu(cpu, tmp_mask) > - { > - ASSERT(per_cpu(vector_irq, cpu)[old_vector] == irq); > - TRACE_3D(TRC_HW_IRQ_MOVE_FINISH, irq, old_vector, cpu); > - per_cpu(vector_irq, cpu)[old_vector] = ~irq; > + desc->arch.move_in_progress = 0; > } > > - release_old_vec(desc); > + write_atomic(&desc->arch.used, IRQ_UNUSED); Now that there is an ordering requirement between these last two writes, I think you want to add smp_wmb() after the first one (still inside the inner scope). write_atomic() is only a volatile asm() (which the compiler may move under certain conditions) and an access through a volatile pointer (which isn't ordered with non-volatile memory accesses), but it is specifically not a memory barrier. Jan > - desc->arch.move_in_progress = 0; > + trace_irq_mask(TRC_HW_IRQ_CLEAR_VECTOR, irq, vector, tmp_mask); > } > > void __init clear_irq_vector(int irq)
On Tue, Nov 15, 2022 at 05:29:47PM +0100, Jan Beulich wrote: > On 15.11.2022 13:35, Roger Pau Monne wrote: > > --- a/xen/arch/x86/irq.c > > +++ b/xen/arch/x86/irq.c > > @@ -220,27 +220,27 @@ static void _clear_irq_vector(struct irq_desc *desc) > > clear_bit(vector, desc->arch.used_vectors); > > } > > > > - desc->arch.used = IRQ_UNUSED; > > - > > - trace_irq_mask(TRC_HW_IRQ_CLEAR_VECTOR, irq, vector, tmp_mask); > > The use of tmp_mask here was ... > > > + if ( unlikely(desc->arch.move_in_progress) ) > > + { > > + /* If we were in motion, also clear desc->arch.old_vector */ > > + old_vector = desc->arch.old_vector; > > + cpumask_and(tmp_mask, desc->arch.old_cpu_mask, &cpu_online_map); > > .. before the variable was updated here. Now you move it past this > update (to the very end of the function). I wonder whether, to keep > things simple in this change, it would be tolerable to leave the > trace_irq_mask() invocation where it was, despite cleanup not having > completed yet at that point. (The alternative would look to be to > act directly on desc->arch.old_cpu_mask in the code you re-indent, > leaving tmp_mask alone. I think that ought to be okay since nothing > else should access that mask anymore. We could even avoid altering > the field, by going from cpumask_and() to a cpu_online() check in > the for_each_cpu() body.) Hm, I was thinking we should print the old vector mask (since that's the one still in use because migration hasn't happened yet) but then we would also need to print the old vector instead of the new one, but maybe that's too much change given the current behavior. I think it's fine to set the trace here, before IRQ_UNUSED gets set (in fact it might be better, as then the trace should be more coherent). > > > > - if ( likely(!desc->arch.move_in_progress) ) > > - return; > > + for_each_cpu(cpu, tmp_mask) > > + { > > + ASSERT(per_cpu(vector_irq, cpu)[old_vector] == irq); > > + TRACE_3D(TRC_HW_IRQ_MOVE_FINISH, irq, old_vector, cpu); > > + per_cpu(vector_irq, cpu)[old_vector] = ~irq; > > + } > > > > - /* If we were in motion, also clear desc->arch.old_vector */ > > - old_vector = desc->arch.old_vector; > > - cpumask_and(tmp_mask, desc->arch.old_cpu_mask, &cpu_online_map); > > + release_old_vec(desc); > > > > - for_each_cpu(cpu, tmp_mask) > > - { > > - ASSERT(per_cpu(vector_irq, cpu)[old_vector] == irq); > > - TRACE_3D(TRC_HW_IRQ_MOVE_FINISH, irq, old_vector, cpu); > > - per_cpu(vector_irq, cpu)[old_vector] = ~irq; > > + desc->arch.move_in_progress = 0; > > } > > > > - release_old_vec(desc); > > + write_atomic(&desc->arch.used, IRQ_UNUSED); > > Now that there is an ordering requirement between these last two writes, > I think you want to add smp_wmb() after the first one (still inside the > inner scope). write_atomic() is only a volatile asm() (which the compiler > may move under certain conditions) and an access through a volatile > pointer (which isn't ordered with non-volatile memory accesses), but it > is specifically not a memory barrier. Right, sorry - I always get confused and assume that a volatile asm can't be reordered. I was about to add the write barrier but didn't do it because of the volatile in the asm. I would like to request a backport for this, but I think it's now too late for the patch to be accepted to 4.17. Thanks, Roger.
On 15.11.2022 18:04, Roger Pau Monné wrote: > I would like to request a backport for this, but I think it's now too > late for the patch to be accepted to 4.17. Yes, I certainly was intending to queue this up once it went in. I agree it's not critical enough to still slip into 4.17.0, though. Jan
© 2016 - 2024 Red Hat, Inc.