Hi all,
After many PVH Dom0 suspend/resume cycles we are seeing the following
Xen crash (it is random and doesn't reproduce reliably):
(XEN) [555.042981][<ffff82d04032a137>] R arch/x86/irq.c#_clear_irq_vector+0x214/0x2bd
(XEN) [555.042986][<ffff82d04032a74c>] F destroy_irq+0xe2/0x1b8
(XEN) [555.042991][<ffff82d0403276db>] F msi_free_irq+0x5e/0x1a7
(XEN) [555.042995][<ffff82d04032da2d>] F unmap_domain_pirq+0x441/0x4b4
(XEN) [555.043001][<ffff82d0402d29b9>] F arch/x86/hvm/vmsi.c#vpci_msi_disable+0xc0/0x155
(XEN) [555.043006][<ffff82d0402d39fc>] F vpci_msi_arch_disable+0x1e/0x2b
(XEN) [555.043013][<ffff82d04026561c>] F drivers/vpci/msi.c#control_write+0x109/0x10e
(XEN) [555.043018][<ffff82d0402640c3>] F vpci_write+0x11f/0x268
(XEN) [555.043024][<ffff82d0402c726a>] F arch/x86/hvm/io.c#vpci_portio_write+0xa0/0xa7
(XEN) [555.043029][<ffff82d0402c6682>] F hvm_process_io_intercept+0x203/0x26f
(XEN) [555.043034][<ffff82d0402c6718>] F hvm_io_intercept+0x2a/0x4c
(XEN) [555.043039][<ffff82d0402b6353>] F arch/x86/hvm/emulate.c#hvmemul_do_io+0x29b/0x5f6
(XEN) [555.043043][<ffff82d0402b66dd>] F arch/x86/hvm/emulate.c#hvmemul_do_io_buffer+0x2f/0x6a
(XEN) [555.043048][<ffff82d0402b7bde>] F hvmemul_do_pio_buffer+0x33/0x35
(XEN) [555.043053][<ffff82d0402c7042>] F handle_pio+0x6d/0x1b4
(XEN) [555.043059][<ffff82d04029ec20>] F svm_vmexit_handler+0x10bf/0x18b0
(XEN) [555.043064][<ffff82d0402034e5>] F svm_stgi_label+0x8/0x18
(XEN) [555.043067]
(XEN) [555.469861]
(XEN) [555.471855] ****************************************
(XEN) [555.477315] Panic on CPU 9:
(XEN) [555.480608] Assertion 'per_cpu(vector_irq, cpu)[old_vector] == irq' failed at arch/x86/irq.c:233
(XEN) [555.489882] ****************************************
Looking at the code in question, the ASSERT looks wrong to me.
Specifically, if you see send_cleanup_vector and
irq_move_cleanup_interrupt, it is entirely possible to have old_vector
still valid and also move_in_progress still set, but only some of the
per_cpu(vector_irq, me)[vector] cleared. It seems to me that this could
happen especially when an MSI has a large old_cpu_mask.
While we go and clear per_cpu(vector_irq, me)[vector] in each CPU one by
one, the state is that not all per_cpu(vector_irq, me)[vector] are
cleared and old_vector is still set.
If at this point we enter _clear_irq_vector, we are going to hit the
ASSERT above.
My suggestion was to turn the ASSERT into an if. Any better ideas?
Cheers,
Stefano
diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
index 20150b1c7f..c82c6b350a 100644
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -230,9 +230,11 @@ static void _clear_irq_vector(struct irq_desc *desc)
for_each_cpu(cpu, tmp_mask)
{
- ASSERT(per_cpu(vector_irq, cpu)[old_vector] == irq);
- TRACE_3D(TRC_HW_IRQ_MOVE_FINISH, irq, old_vector, cpu);
- per_cpu(vector_irq, cpu)[old_vector] = ~irq;
+ if ( per_cpu(vector_irq, cpu)[old_vector] == irq )
+ {
+ TRACE_3D(TRC_HW_IRQ_MOVE_FINISH, irq, old_vector, cpu);
+ per_cpu(vector_irq, cpu)[old_vector] = ~irq;
+ }
}
release_old_vec(desc);
On Thu, May 18, 2023 at 04:44:53PM -0700, Stefano Stabellini wrote: > Hi all, > > After many PVH Dom0 suspend/resume cycles we are seeing the following > Xen crash (it is random and doesn't reproduce reliably): > > (XEN) [555.042981][<ffff82d04032a137>] R arch/x86/irq.c#_clear_irq_vector+0x214/0x2bd > (XEN) [555.042986][<ffff82d04032a74c>] F destroy_irq+0xe2/0x1b8 > (XEN) [555.042991][<ffff82d0403276db>] F msi_free_irq+0x5e/0x1a7 > (XEN) [555.042995][<ffff82d04032da2d>] F unmap_domain_pirq+0x441/0x4b4 > (XEN) [555.043001][<ffff82d0402d29b9>] F arch/x86/hvm/vmsi.c#vpci_msi_disable+0xc0/0x155 > (XEN) [555.043006][<ffff82d0402d39fc>] F vpci_msi_arch_disable+0x1e/0x2b > (XEN) [555.043013][<ffff82d04026561c>] F drivers/vpci/msi.c#control_write+0x109/0x10e > (XEN) [555.043018][<ffff82d0402640c3>] F vpci_write+0x11f/0x268 > (XEN) [555.043024][<ffff82d0402c726a>] F arch/x86/hvm/io.c#vpci_portio_write+0xa0/0xa7 > (XEN) [555.043029][<ffff82d0402c6682>] F hvm_process_io_intercept+0x203/0x26f > (XEN) [555.043034][<ffff82d0402c6718>] F hvm_io_intercept+0x2a/0x4c > (XEN) [555.043039][<ffff82d0402b6353>] F arch/x86/hvm/emulate.c#hvmemul_do_io+0x29b/0x5f6 > (XEN) [555.043043][<ffff82d0402b66dd>] F arch/x86/hvm/emulate.c#hvmemul_do_io_buffer+0x2f/0x6a > (XEN) [555.043048][<ffff82d0402b7bde>] F hvmemul_do_pio_buffer+0x33/0x35 > (XEN) [555.043053][<ffff82d0402c7042>] F handle_pio+0x6d/0x1b4 > (XEN) [555.043059][<ffff82d04029ec20>] F svm_vmexit_handler+0x10bf/0x18b0 > (XEN) [555.043064][<ffff82d0402034e5>] F svm_stgi_label+0x8/0x18 > (XEN) [555.043067] > (XEN) [555.469861] > (XEN) [555.471855] **************************************** > (XEN) [555.477315] Panic on CPU 9: > (XEN) [555.480608] Assertion 'per_cpu(vector_irq, cpu)[old_vector] == irq' failed at arch/x86/irq.c:233 > (XEN) [555.489882] **************************************** > > Looking at the code in question, the ASSERT looks wrong to me. > > Specifically, if you see send_cleanup_vector and > irq_move_cleanup_interrupt, it is entirely possible to have old_vector > still valid and also move_in_progress still set, but only some of the > per_cpu(vector_irq, me)[vector] cleared. It seems to me that this could > happen especially when an MSI has a large old_cpu_mask. i guess the only way to get into such situation would be if you happen to execute _clear_irq_vector() with a cpu_online_map smaller than the one in old_cpu_mask, at which point you will leave old_vector fields not updated. Maybe somehow you get into this situation when doing suspend/resume? Could you try to add a: ASSERT(cpumask_equal(tmp_mask, desc->arch.old_cpu_mask)); Before the `for_each_cpu(cpu, tmp_mask)` loop? Thanks, Roger.
On Tue, May 23, 2023 at 03:54:36PM +0200, Roger Pau Monné wrote: > On Thu, May 18, 2023 at 04:44:53PM -0700, Stefano Stabellini wrote: > > Hi all, > > > > After many PVH Dom0 suspend/resume cycles we are seeing the following > > Xen crash (it is random and doesn't reproduce reliably): > > > > (XEN) [555.042981][<ffff82d04032a137>] R arch/x86/irq.c#_clear_irq_vector+0x214/0x2bd > > (XEN) [555.042986][<ffff82d04032a74c>] F destroy_irq+0xe2/0x1b8 > > (XEN) [555.042991][<ffff82d0403276db>] F msi_free_irq+0x5e/0x1a7 > > (XEN) [555.042995][<ffff82d04032da2d>] F unmap_domain_pirq+0x441/0x4b4 > > (XEN) [555.043001][<ffff82d0402d29b9>] F arch/x86/hvm/vmsi.c#vpci_msi_disable+0xc0/0x155 > > (XEN) [555.043006][<ffff82d0402d39fc>] F vpci_msi_arch_disable+0x1e/0x2b > > (XEN) [555.043013][<ffff82d04026561c>] F drivers/vpci/msi.c#control_write+0x109/0x10e > > (XEN) [555.043018][<ffff82d0402640c3>] F vpci_write+0x11f/0x268 > > (XEN) [555.043024][<ffff82d0402c726a>] F arch/x86/hvm/io.c#vpci_portio_write+0xa0/0xa7 > > (XEN) [555.043029][<ffff82d0402c6682>] F hvm_process_io_intercept+0x203/0x26f > > (XEN) [555.043034][<ffff82d0402c6718>] F hvm_io_intercept+0x2a/0x4c > > (XEN) [555.043039][<ffff82d0402b6353>] F arch/x86/hvm/emulate.c#hvmemul_do_io+0x29b/0x5f6 > > (XEN) [555.043043][<ffff82d0402b66dd>] F arch/x86/hvm/emulate.c#hvmemul_do_io_buffer+0x2f/0x6a > > (XEN) [555.043048][<ffff82d0402b7bde>] F hvmemul_do_pio_buffer+0x33/0x35 > > (XEN) [555.043053][<ffff82d0402c7042>] F handle_pio+0x6d/0x1b4 > > (XEN) [555.043059][<ffff82d04029ec20>] F svm_vmexit_handler+0x10bf/0x18b0 > > (XEN) [555.043064][<ffff82d0402034e5>] F svm_stgi_label+0x8/0x18 > > (XEN) [555.043067] > > (XEN) [555.469861] > > (XEN) [555.471855] **************************************** > > (XEN) [555.477315] Panic on CPU 9: > > (XEN) [555.480608] Assertion 'per_cpu(vector_irq, cpu)[old_vector] == irq' failed at arch/x86/irq.c:233 > > (XEN) [555.489882] **************************************** > > > > Looking at the code in question, the ASSERT looks wrong to me. > > > > Specifically, if you see send_cleanup_vector and > > irq_move_cleanup_interrupt, it is entirely possible to have old_vector > > still valid and also move_in_progress still set, but only some of the > > per_cpu(vector_irq, me)[vector] cleared. It seems to me that this could > > happen especially when an MSI has a large old_cpu_mask. > > i guess the only way to get into such situation would be if you happen > to execute _clear_irq_vector() with a cpu_online_map smaller than the > one in old_cpu_mask, at which point you will leave old_vector fields > not updated. > > Maybe somehow you get into this situation when doing suspend/resume? > > Could you try to add a: > > ASSERT(cpumask_equal(tmp_mask, desc->arch.old_cpu_mask)); > > Before the `for_each_cpu(cpu, tmp_mask)` loop? I see that the old_cpu_mask is cleared in release_old_vec(), so that suggestion is not very useful. Does the crash happen at specific points, for example just after resume or before suspend? Roger.
On Tue, 23 May 2023, Roger Pau Monné wrote: > On Tue, May 23, 2023 at 03:54:36PM +0200, Roger Pau Monné wrote: > > On Thu, May 18, 2023 at 04:44:53PM -0700, Stefano Stabellini wrote: > > > Hi all, > > > > > > After many PVH Dom0 suspend/resume cycles we are seeing the following > > > Xen crash (it is random and doesn't reproduce reliably): > > > > > > (XEN) [555.042981][<ffff82d04032a137>] R arch/x86/irq.c#_clear_irq_vector+0x214/0x2bd > > > (XEN) [555.042986][<ffff82d04032a74c>] F destroy_irq+0xe2/0x1b8 > > > (XEN) [555.042991][<ffff82d0403276db>] F msi_free_irq+0x5e/0x1a7 > > > (XEN) [555.042995][<ffff82d04032da2d>] F unmap_domain_pirq+0x441/0x4b4 > > > (XEN) [555.043001][<ffff82d0402d29b9>] F arch/x86/hvm/vmsi.c#vpci_msi_disable+0xc0/0x155 > > > (XEN) [555.043006][<ffff82d0402d39fc>] F vpci_msi_arch_disable+0x1e/0x2b > > > (XEN) [555.043013][<ffff82d04026561c>] F drivers/vpci/msi.c#control_write+0x109/0x10e > > > (XEN) [555.043018][<ffff82d0402640c3>] F vpci_write+0x11f/0x268 > > > (XEN) [555.043024][<ffff82d0402c726a>] F arch/x86/hvm/io.c#vpci_portio_write+0xa0/0xa7 > > > (XEN) [555.043029][<ffff82d0402c6682>] F hvm_process_io_intercept+0x203/0x26f > > > (XEN) [555.043034][<ffff82d0402c6718>] F hvm_io_intercept+0x2a/0x4c > > > (XEN) [555.043039][<ffff82d0402b6353>] F arch/x86/hvm/emulate.c#hvmemul_do_io+0x29b/0x5f6 > > > (XEN) [555.043043][<ffff82d0402b66dd>] F arch/x86/hvm/emulate.c#hvmemul_do_io_buffer+0x2f/0x6a > > > (XEN) [555.043048][<ffff82d0402b7bde>] F hvmemul_do_pio_buffer+0x33/0x35 > > > (XEN) [555.043053][<ffff82d0402c7042>] F handle_pio+0x6d/0x1b4 > > > (XEN) [555.043059][<ffff82d04029ec20>] F svm_vmexit_handler+0x10bf/0x18b0 > > > (XEN) [555.043064][<ffff82d0402034e5>] F svm_stgi_label+0x8/0x18 > > > (XEN) [555.043067] > > > (XEN) [555.469861] > > > (XEN) [555.471855] **************************************** > > > (XEN) [555.477315] Panic on CPU 9: > > > (XEN) [555.480608] Assertion 'per_cpu(vector_irq, cpu)[old_vector] == irq' failed at arch/x86/irq.c:233 > > > (XEN) [555.489882] **************************************** > > > > > > Looking at the code in question, the ASSERT looks wrong to me. > > > > > > Specifically, if you see send_cleanup_vector and > > > irq_move_cleanup_interrupt, it is entirely possible to have old_vector > > > still valid and also move_in_progress still set, but only some of the > > > per_cpu(vector_irq, me)[vector] cleared. It seems to me that this could > > > happen especially when an MSI has a large old_cpu_mask. > > > > i guess the only way to get into such situation would be if you happen > > to execute _clear_irq_vector() with a cpu_online_map smaller than the > > one in old_cpu_mask, at which point you will leave old_vector fields > > not updated. > > > > Maybe somehow you get into this situation when doing suspend/resume? > > > > Could you try to add a: > > > > ASSERT(cpumask_equal(tmp_mask, desc->arch.old_cpu_mask)); > > > > Before the `for_each_cpu(cpu, tmp_mask)` loop? > > I see that the old_cpu_mask is cleared in release_old_vec(), so that > suggestion is not very useful. > > Does the crash happen at specific points, for example just after > resume or before suspend? Hi Roger, Jan, Unfortunately we are all completely blind on this issue. It is not reproducible. It only happened once. We only have our wits to solve this problem. However, looking at the codebase I think there are a few possible races. Here is my analysis and suggested fix. --- xen/irq: fix races between send_cleanup_vector and _clear_irq_vector It is possible that send_cleanup_vector and _clear_irq_vector are running at the same time on different CPUs. In that case we have a race as both _clear_irq_vector and irq_move_cleanup_interrupt are trying to clear old_vector. This patch fixes 3 races: 1) As irq_move_cleanup_interrupt is running on multiple CPUs at the same time, and also _clear_irq_vector is running, it is possible that only some per_cpu(vector_irq, cpu)[old_vector] are valid but not all. So, turn the ASSERT in _clear_irq_vector into an if. 2) It is possible that _clear_irq_vector is running at the same time as release_old_vec, called from irq_move_cleanup_interrupt. At the moment, it is possible for _clear_irq_vector to read a valid old_cpu_mask but an invalid old_vector (because it is being set to invalid by release_old_vec). To avoid this problem in release_old_vec move clearing old_cpu_mask before setting old_vector to invalid. This way, we know that in _clear_irq_vector if old_vector is invalid also old_cpu_mask is zero and we don't enter the loop. 3) It is possible that release_old_vec is running twice at the same time for the same old_vector. Change the code in release_old_vec to make it OK to call it twice. Remove both ASSERTs. With those gone, it should be possible now to call release_old_vec twice in a row for the same old_vector. Signed-off-by: Stefano Stabellini <stefano.stabellini@amd.com> --- xen/arch/x86/irq.c | 17 +++++++---------- 1 file changed, 7 insertions(+), 10 deletions(-) diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c index 20150b1c7f..d9c139cf1b 100644 --- a/xen/arch/x86/irq.c +++ b/xen/arch/x86/irq.c @@ -112,16 +112,11 @@ static void release_old_vec(struct irq_desc *desc) { unsigned int vector = desc->arch.old_vector; - desc->arch.old_vector = IRQ_VECTOR_UNASSIGNED; cpumask_clear(desc->arch.old_cpu_mask); + desc->arch.old_vector = IRQ_VECTOR_UNASSIGNED; - if ( !valid_irq_vector(vector) ) - ASSERT_UNREACHABLE(); - else if ( desc->arch.used_vectors ) - { - ASSERT(test_bit(vector, desc->arch.used_vectors)); + if ( desc->arch.used_vectors ) clear_bit(vector, desc->arch.used_vectors); - } } static void _trace_irq_mask(uint32_t event, int irq, int vector, @@ -230,9 +225,11 @@ static void _clear_irq_vector(struct irq_desc *desc) for_each_cpu(cpu, tmp_mask) { - ASSERT(per_cpu(vector_irq, cpu)[old_vector] == irq); - TRACE_3D(TRC_HW_IRQ_MOVE_FINISH, irq, old_vector, cpu); - per_cpu(vector_irq, cpu)[old_vector] = ~irq; + if ( per_cpu(vector_irq, cpu)[old_vector] == irq ) + { + TRACE_3D(TRC_HW_IRQ_MOVE_FINISH, irq, old_vector, cpu); + per_cpu(vector_irq, cpu)[old_vector] = ~irq; + } } release_old_vec(desc); -- 2.25.1
On 25.05.2023 01:51, Stefano Stabellini wrote: > xen/irq: fix races between send_cleanup_vector and _clear_irq_vector This title is, I'm afraid, already misleading. No such race can occur afaict, as both callers of _clear_irq_vector() acquire the IRQ descriptor lock first, and irq_complete_move() (the sole caller of send_cleanup_vector()) is only ever invoked as or by an ->ack() hook, which in turn is only invoked with, again, the descriptor lock held. > It is possible that send_cleanup_vector and _clear_irq_vector are > running at the same time on different CPUs. In that case we have a race > as both _clear_irq_vector and irq_move_cleanup_interrupt are trying to > clear old_vector. > > This patch fixes 3 races: > > 1) As irq_move_cleanup_interrupt is running on multiple CPUs at the > same time, and also _clear_irq_vector is running, it is possible that > only some per_cpu(vector_irq, cpu)[old_vector] are valid but not all. > So, turn the ASSERT in _clear_irq_vector into an if. Note again the locking which is in effect. > 2) It is possible that _clear_irq_vector is running at the same time as > release_old_vec, called from irq_move_cleanup_interrupt. At the moment, > it is possible for _clear_irq_vector to read a valid old_cpu_mask but an > invalid old_vector (because it is being set to invalid by > release_old_vec). To avoid this problem in release_old_vec move clearing > old_cpu_mask before setting old_vector to invalid. This way, we know that > in _clear_irq_vector if old_vector is invalid also old_cpu_mask is zero > and we don't enter the loop. All invocations of release_old_vec() are similarly inside suitably locked regions. > 3) It is possible that release_old_vec is running twice at the same time > for the same old_vector. Change the code in release_old_vec to make it > OK to call it twice. Remove both ASSERTs. With those gone, it should be > possible now to call release_old_vec twice in a row for the same > old_vector. Same here. Any such issues would surface more frequently and without any suspend / resume involved. What is still missing is that connection, and only then it'll (or really: may) become clear what needs adjusting. If you've seen the issue exactly once, then I'm afraid there's not much we can do unless someone can come up with a plausible explanation of something being broken on any of the involved code paths. More information will need to be gathered out of the next occurrence of this, whenever that's going to be. One of the things we will want to know, as mentioned before, is the value that per_cpu(vector_irq, cpu)[old_vector] has when the assertion triggers. Iirc Roger did suggest another piece of data you'd want to log. > --- a/xen/arch/x86/irq.c > +++ b/xen/arch/x86/irq.c > @@ -112,16 +112,11 @@ static void release_old_vec(struct irq_desc *desc) > { > unsigned int vector = desc->arch.old_vector; > > - desc->arch.old_vector = IRQ_VECTOR_UNASSIGNED; > cpumask_clear(desc->arch.old_cpu_mask); > + desc->arch.old_vector = IRQ_VECTOR_UNASSIGNED; > > - if ( !valid_irq_vector(vector) ) > - ASSERT_UNREACHABLE(); > - else if ( desc->arch.used_vectors ) > - { > - ASSERT(test_bit(vector, desc->arch.used_vectors)); > + if ( desc->arch.used_vectors ) > clear_bit(vector, desc->arch.used_vectors); > - } > } > > static void _trace_irq_mask(uint32_t event, int irq, int vector, > @@ -230,9 +225,11 @@ static void _clear_irq_vector(struct irq_desc *desc) > > for_each_cpu(cpu, tmp_mask) > { > - ASSERT(per_cpu(vector_irq, cpu)[old_vector] == irq); > - TRACE_3D(TRC_HW_IRQ_MOVE_FINISH, irq, old_vector, cpu); > - per_cpu(vector_irq, cpu)[old_vector] = ~irq; > + if ( per_cpu(vector_irq, cpu)[old_vector] == irq ) > + { > + TRACE_3D(TRC_HW_IRQ_MOVE_FINISH, irq, old_vector, cpu); > + per_cpu(vector_irq, cpu)[old_vector] = ~irq; > + } > } As said before - replacing ASSERT() by a respective if() cannot really be done without discussing the "else" in the description. Except of course in trivial/obvious cases, but I think we agree here we don't have such a case. Jan
On Thu, 25 May 2023, Jan Beulich wrote: > On 25.05.2023 01:51, Stefano Stabellini wrote: > > xen/irq: fix races between send_cleanup_vector and _clear_irq_vector > > This title is, I'm afraid, already misleading. No such race can occur > afaict, as both callers of _clear_irq_vector() acquire the IRQ > descriptor lock first, and irq_complete_move() (the sole caller of > send_cleanup_vector()) is only ever invoked as or by an ->ack() > hook, which in turn is only invoked with, again, the descriptor lock > held. Yes I see that you are right about the locking, and thank you for taking the time to look into it. One last question: could it be that a second interrupt arrives while ->ack() is being handled? do_IRQ() is running with interrupts disabled? > > It is possible that send_cleanup_vector and _clear_irq_vector are > > running at the same time on different CPUs. In that case we have a race > > as both _clear_irq_vector and irq_move_cleanup_interrupt are trying to > > clear old_vector. > > > > This patch fixes 3 races: > > > > 1) As irq_move_cleanup_interrupt is running on multiple CPUs at the > > same time, and also _clear_irq_vector is running, it is possible that > > only some per_cpu(vector_irq, cpu)[old_vector] are valid but not all. > > So, turn the ASSERT in _clear_irq_vector into an if. > > Note again the locking which is in effect. > > > 2) It is possible that _clear_irq_vector is running at the same time as > > release_old_vec, called from irq_move_cleanup_interrupt. At the moment, > > it is possible for _clear_irq_vector to read a valid old_cpu_mask but an > > invalid old_vector (because it is being set to invalid by > > release_old_vec). To avoid this problem in release_old_vec move clearing > > old_cpu_mask before setting old_vector to invalid. This way, we know that > > in _clear_irq_vector if old_vector is invalid also old_cpu_mask is zero > > and we don't enter the loop. > > All invocations of release_old_vec() are similarly inside suitably > locked regions. > > > 3) It is possible that release_old_vec is running twice at the same time > > for the same old_vector. Change the code in release_old_vec to make it > > OK to call it twice. Remove both ASSERTs. With those gone, it should be > > possible now to call release_old_vec twice in a row for the same > > old_vector. > > Same here. > > Any such issues would surface more frequently and without any suspend / > resume involved. What is still missing is that connection, and only then > it'll (or really: may) become clear what needs adjusting. If you've seen > the issue exactly once, then I'm afraid there's not much we can do unless > someone can come up with a plausible explanation of something being > broken on any of the involved code paths. More information will need to > be gathered out of the next occurrence of this, whenever that's going to > be. One of the things we will want to know, as mentioned before, is the > value that per_cpu(vector_irq, cpu)[old_vector] has when the assertion > triggers. Iirc Roger did suggest another piece of data you'd want to log. Understood, thanks for the explanation. > > --- a/xen/arch/x86/irq.c > > +++ b/xen/arch/x86/irq.c > > @@ -112,16 +112,11 @@ static void release_old_vec(struct irq_desc *desc) > > { > > unsigned int vector = desc->arch.old_vector; > > > > - desc->arch.old_vector = IRQ_VECTOR_UNASSIGNED; > > cpumask_clear(desc->arch.old_cpu_mask); > > + desc->arch.old_vector = IRQ_VECTOR_UNASSIGNED; > > > > - if ( !valid_irq_vector(vector) ) > > - ASSERT_UNREACHABLE(); > > - else if ( desc->arch.used_vectors ) > > - { > > - ASSERT(test_bit(vector, desc->arch.used_vectors)); > > + if ( desc->arch.used_vectors ) > > clear_bit(vector, desc->arch.used_vectors); > > - } > > } > > > > static void _trace_irq_mask(uint32_t event, int irq, int vector, > > @@ -230,9 +225,11 @@ static void _clear_irq_vector(struct irq_desc *desc) > > > > for_each_cpu(cpu, tmp_mask) > > { > > - ASSERT(per_cpu(vector_irq, cpu)[old_vector] == irq); > > - TRACE_3D(TRC_HW_IRQ_MOVE_FINISH, irq, old_vector, cpu); > > - per_cpu(vector_irq, cpu)[old_vector] = ~irq; > > + if ( per_cpu(vector_irq, cpu)[old_vector] == irq ) > > + { > > + TRACE_3D(TRC_HW_IRQ_MOVE_FINISH, irq, old_vector, cpu); > > + per_cpu(vector_irq, cpu)[old_vector] = ~irq; > > + } > > } > > As said before - replacing ASSERT() by a respective if() cannot really > be done without discussing the "else" in the description. Except of > course in trivial/obvious cases, but I think we agree here we don't > have such a case.
On 25.05.2023 21:54, Stefano Stabellini wrote: > On Thu, 25 May 2023, Jan Beulich wrote: >> On 25.05.2023 01:51, Stefano Stabellini wrote: >>> xen/irq: fix races between send_cleanup_vector and _clear_irq_vector >> >> This title is, I'm afraid, already misleading. No such race can occur >> afaict, as both callers of _clear_irq_vector() acquire the IRQ >> descriptor lock first, and irq_complete_move() (the sole caller of >> send_cleanup_vector()) is only ever invoked as or by an ->ack() >> hook, which in turn is only invoked with, again, the descriptor lock >> held. > > Yes I see that you are right about the locking, and thank you for taking > the time to look into it. > > One last question: could it be that a second interrupt arrives while > ->ack() is being handled? do_IRQ() is running with interrupts disabled? It is, at least as far as the invocation of ->ack() is concerned. Else the locking scheme would be broken. You may not that around ->handler() invocation we enable interrupts. Jan
On Fri, 26 May 2023, Jan Beulich wrote: > On 25.05.2023 21:54, Stefano Stabellini wrote: > > On Thu, 25 May 2023, Jan Beulich wrote: > >> On 25.05.2023 01:51, Stefano Stabellini wrote: > >>> xen/irq: fix races between send_cleanup_vector and _clear_irq_vector > >> > >> This title is, I'm afraid, already misleading. No such race can occur > >> afaict, as both callers of _clear_irq_vector() acquire the IRQ > >> descriptor lock first, and irq_complete_move() (the sole caller of > >> send_cleanup_vector()) is only ever invoked as or by an ->ack() > >> hook, which in turn is only invoked with, again, the descriptor lock > >> held. > > > > Yes I see that you are right about the locking, and thank you for taking > > the time to look into it. > > > > One last question: could it be that a second interrupt arrives while > > ->ack() is being handled? do_IRQ() is running with interrupts disabled? > > It is, at least as far as the invocation of ->ack() is concerned. Else > the locking scheme would be broken. You may not that around ->handler() > invocation we enable interrupts. OK. FYI, we were able to repro a problem after 250+ suspend/resume Dom0 cycles with my patch applied. So unfortunately there is no extra information as my patch removes the ASSERTs. However I can tell you that the symptom is the below. I am not sure if it tells you anything but FYI. So clearly my patch makes the problem harder to repro but doesn't fix it. May 23 22:47:31 amd-saravana-crater kernel: [17881.744986] INFO: task kworker/u8:1:45 blocked for more than 120 seconds. May 23 22:47:31 amd-saravana-crater kernel: [17881.745048] Not tainted 6.1.0-rtc-s3 #1 May 23 22:47:31 amd-saravana-crater kernel: [17881.745089] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 23 22:47:31 amd-saravana-crater kernel: [17881.745144] task:kworker/u8:1 state:D stack:0 pid:45 ppid:2 flags:0x00004000 May 23 22:47:31 amd-saravana-crater kernel: [17881.745154] Workqueue: writeback wb_workfn (flush-259:0) May 23 22:47:31 amd-saravana-crater kernel: [17881.745170] Call Trace: May 23 22:47:31 amd-saravana-crater kernel: [17881.745174] <TASK> May 23 22:47:31 amd-saravana-crater kernel: [17881.745182] __schedule+0x2d5/0x920 May 23 22:47:31 amd-saravana-crater kernel: [17881.745192] ? preempt_count_add+0x7c/0xc0 May 23 22:47:31 amd-saravana-crater kernel: [17881.745202] schedule+0x63/0xd0 May 23 22:47:31 amd-saravana-crater kernel: [17881.745208] __bio_queue_enter+0xeb/0x230 May 23 22:47:31 amd-saravana-crater kernel: [17881.745217] ? prepare_to_wait_event+0x130/0x130 May 23 22:47:31 amd-saravana-crater kernel: [17881.745226] blk_mq_submit_bio+0x358/0x570 May 23 22:47:31 amd-saravana-crater kernel: [17881.745237] __submit_bio+0xfa/0x170 May 23 22:47:31 amd-saravana-crater kernel: [17881.745243] submit_bio_noacct_nocheck+0x229/0x2b0 May 23 22:47:31 amd-saravana-crater kernel: [17881.745250] ? ktime_get+0x47/0xb0 May 23 22:47:31 amd-saravana-crater kernel: [17881.745256] submit_bio_noacct+0x1e4/0x5a0 May 23 22:47:31 amd-saravana-crater kernel: [17881.745261] ? submit_bio_noacct+0x1e4/0x5a0 May 23 22:47:31 amd-saravana-crater kernel: [17881.745268] submit_bio+0x47/0x80 May 23 22:47:31 amd-saravana-crater kernel: [17881.745273] ext4_io_submit+0x24/0x40 May 23 22:47:31 amd-saravana-crater kernel: [17881.745282] ext4_writepages+0x57f/0xdd0 May 23 22:47:31 amd-saravana-crater kernel: [17881.745288] ? _raw_read_lock_bh+0x20/0x40 May 23 22:47:31 amd-saravana-crater kernel: [17881.745296] ? update_sd_lb_stats.constprop.148+0x11e/0x960 May 23 22:47:31 amd-saravana-crater kernel: [17881.745308] do_writepages+0xbf/0x1a0 May 23 22:47:31 amd-saravana-crater kernel: [17881.745314] ? __enqueue_entity+0x6c/0x80 May 23 22:47:31 amd-saravana-crater kernel: [17881.745321] ? enqueue_entity+0x1a9/0x370 May 23 22:47:31 amd-saravana-crater kernel: [17881.745327] __writeback_single_inode+0x44/0x360 May 23 22:47:31 amd-saravana-crater kernel: [17881.745332] ? _raw_spin_unlock+0x19/0x40 May 23 22:47:31 amd-saravana-crater kernel: [17881.745339] writeback_sb_inodes+0x203/0x4e0 May 23 22:47:31 amd-saravana-crater kernel: [17881.745350] __writeback_inodes_wb+0x66/0xd0 May 23 22:47:31 amd-saravana-crater kernel: [17881.745358] wb_writeback+0x23d/0x2d0 May 23 22:47:31 amd-saravana-crater kernel: [17881.745366] wb_workfn+0x20b/0x490 May 23 22:47:31 amd-saravana-crater kernel: [17881.745372] ? _raw_spin_unlock+0x19/0x40 May 23 22:47:31 amd-saravana-crater kernel: [17881.745381] process_one_work+0x227/0x440 May 23 22:47:31 amd-saravana-crater kernel: [17881.745389] worker_thread+0x31/0x3e0 May 23 22:47:31 amd-saravana-crater kernel: [17881.745395] ? process_one_work+0x440/0x440 May 23 22:47:31 amd-saravana-crater kernel: [17881.745400] kthread+0xfe/0x130 May 23 22:47:31 amd-saravana-crater kernel: [17881.745406] ? kthread_complete_and_exit+0x20/0x20 May 23 22:47:31 amd-saravana-crater kernel: [17881.745413] ret_from_fork+0x22/0x30 May 23 22:47:31 amd-saravana-crater kernel: [17881.745425] </TASK>
On 19.05.2023 01:44, Stefano Stabellini wrote: > Hi all, > > After many PVH Dom0 suspend/resume cycles we are seeing the following > Xen crash (it is random and doesn't reproduce reliably): > > (XEN) [555.042981][<ffff82d04032a137>] R arch/x86/irq.c#_clear_irq_vector+0x214/0x2bd > (XEN) [555.042986][<ffff82d04032a74c>] F destroy_irq+0xe2/0x1b8 > (XEN) [555.042991][<ffff82d0403276db>] F msi_free_irq+0x5e/0x1a7 > (XEN) [555.042995][<ffff82d04032da2d>] F unmap_domain_pirq+0x441/0x4b4 > (XEN) [555.043001][<ffff82d0402d29b9>] F arch/x86/hvm/vmsi.c#vpci_msi_disable+0xc0/0x155 > (XEN) [555.043006][<ffff82d0402d39fc>] F vpci_msi_arch_disable+0x1e/0x2b > (XEN) [555.043013][<ffff82d04026561c>] F drivers/vpci/msi.c#control_write+0x109/0x10e > (XEN) [555.043018][<ffff82d0402640c3>] F vpci_write+0x11f/0x268 > (XEN) [555.043024][<ffff82d0402c726a>] F arch/x86/hvm/io.c#vpci_portio_write+0xa0/0xa7 > (XEN) [555.043029][<ffff82d0402c6682>] F hvm_process_io_intercept+0x203/0x26f > (XEN) [555.043034][<ffff82d0402c6718>] F hvm_io_intercept+0x2a/0x4c > (XEN) [555.043039][<ffff82d0402b6353>] F arch/x86/hvm/emulate.c#hvmemul_do_io+0x29b/0x5f6 > (XEN) [555.043043][<ffff82d0402b66dd>] F arch/x86/hvm/emulate.c#hvmemul_do_io_buffer+0x2f/0x6a > (XEN) [555.043048][<ffff82d0402b7bde>] F hvmemul_do_pio_buffer+0x33/0x35 > (XEN) [555.043053][<ffff82d0402c7042>] F handle_pio+0x6d/0x1b4 > (XEN) [555.043059][<ffff82d04029ec20>] F svm_vmexit_handler+0x10bf/0x18b0 > (XEN) [555.043064][<ffff82d0402034e5>] F svm_stgi_label+0x8/0x18 > (XEN) [555.043067] > (XEN) [555.469861] > (XEN) [555.471855] **************************************** > (XEN) [555.477315] Panic on CPU 9: > (XEN) [555.480608] Assertion 'per_cpu(vector_irq, cpu)[old_vector] == irq' failed at arch/x86/irq.c:233 > (XEN) [555.489882] **************************************** > > Looking at the code in question, the ASSERT looks wrong to me. > > Specifically, if you see send_cleanup_vector and > irq_move_cleanup_interrupt, it is entirely possible to have old_vector > still valid and also move_in_progress still set, but only some of the > per_cpu(vector_irq, me)[vector] cleared. It seems to me that this could > happen especially when an MSI has a large old_cpu_mask. > > While we go and clear per_cpu(vector_irq, me)[vector] in each CPU one by > one, the state is that not all per_cpu(vector_irq, me)[vector] are > cleared and old_vector is still set. > > If at this point we enter _clear_irq_vector, we are going to hit the > ASSERT above. > > My suggestion was to turn the ASSERT into an if. Any better ideas? While I'm fully willing to believe there is something that needs fixing, we need to understand what's going wrong before applying any change. Even more so when the suggestion is the simplest possible, converting an assertion to an if(). There's no explanation at all what happens in the "else" case: Are we (silently) leaking vectors then? Any other undue side effects? What I'm particularly missing is any connection to suspend/resume. There my primary suspect would be an issue with/in fixup_irqs(). What might be relevant in the investigation is what the value is that is found in the array slot. In particular if it was already ~irq, it may hint at where the issue is coming from and/or that the assertion may merely need a little bit of relaxing. Allowing for that value in an array slot would in particular not raise any questions towards "what if not" (as mentioned above), because that's already the value we store there anyway. Jan
© 2016 - 2024 Red Hat, Inc.