When creating a domU, but the creation fails, there is a corner case that may
lead to a crash in the null scheduler when running a debug build of Xen.
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) Assertion 'npc->unit == unit' failed at common/sched/null.c:379
(XEN) ****************************************
The events leading to the crash are:
* null_unit_insert() was invoked with the unit offline. Since the unit was
offline, unit_assign() was not called, and null_unit_insert() returned.
* Later during domain creation, the unit was onlined
* Eventually, domain creation failed due to bad configuration
* null_unit_remove() was invoked with the unit still online. Since the unit was
online, it called unit_deassign() and triggered an ASSERT.
To fix this, only call unit_deassign() when npc->unit is non-NULL in
null_unit_remove.
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
---
RFC->v1
* Follow Juergen's suggested fix
Link to RFC [1]
[1] https://lists.xenproject.org/archives/html/xen-devel/2023-04/msg01387.html
---
xen/common/sched/null.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/xen/common/sched/null.c b/xen/common/sched/null.c
index 65a0a6c5312d..2091337fcd06 100644
--- a/xen/common/sched/null.c
+++ b/xen/common/sched/null.c
@@ -522,6 +522,8 @@ static void cf_check null_unit_remove(
{
struct null_private *prv = null_priv(ops);
struct null_unit *nvc = null_unit(unit);
+ struct null_pcpu *npc;
+ unsigned int cpu;
spinlock_t *lock;
ASSERT(!is_idle_unit(unit));
@@ -531,8 +533,6 @@ static void cf_check null_unit_remove(
/* If offline, the unit shouldn't be assigned, nor in the waitqueue */
if ( unlikely(!is_unit_online(unit)) )
{
- struct null_pcpu *npc;
-
npc = unit->res->sched_priv;
ASSERT(npc->unit != unit);
ASSERT(list_empty(&nvc->waitq_elem));
@@ -549,7 +549,10 @@ static void cf_check null_unit_remove(
goto out;
}
- unit_deassign(prv, unit);
+ cpu = sched_unit_master(unit);
+ npc = get_sched_res(cpu)->sched_priv;
+ if ( npc->unit )
+ unit_deassign(prv, unit);
out:
unit_schedule_unlock_irq(lock, unit);
--
2.40.1
On 01.05.23 22:30, Stewart Hildebrand wrote: > When creating a domU, but the creation fails, there is a corner case that may > lead to a crash in the null scheduler when running a debug build of Xen. > > (XEN) **************************************** > (XEN) Panic on CPU 0: > (XEN) Assertion 'npc->unit == unit' failed at common/sched/null.c:379 > (XEN) **************************************** > > The events leading to the crash are: > > * null_unit_insert() was invoked with the unit offline. Since the unit was > offline, unit_assign() was not called, and null_unit_insert() returned. > * Later during domain creation, the unit was onlined > * Eventually, domain creation failed due to bad configuration > * null_unit_remove() was invoked with the unit still online. Since the unit was > online, it called unit_deassign() and triggered an ASSERT. > > To fix this, only call unit_deassign() when npc->unit is non-NULL in > null_unit_remove. > > Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com> Reviewed-by: Juergen Gross <jgross@suse.com> Juergen
On 5/5/23 01:59, Juergen Gross wrote: > On 01.05.23 22:30, Stewart Hildebrand wrote: >> When creating a domU, but the creation fails, there is a corner case that may >> lead to a crash in the null scheduler when running a debug build of Xen. >> >> (XEN) **************************************** >> (XEN) Panic on CPU 0: >> (XEN) Assertion 'npc->unit == unit' failed at common/sched/null.c:379 >> (XEN) **************************************** >> >> The events leading to the crash are: >> >> * null_unit_insert() was invoked with the unit offline. Since the unit was >> offline, unit_assign() was not called, and null_unit_insert() returned. >> * Later during domain creation, the unit was onlined >> * Eventually, domain creation failed due to bad configuration >> * null_unit_remove() was invoked with the unit still online. Since the unit was >> online, it called unit_deassign() and triggered an ASSERT. >> >> To fix this, only call unit_deassign() when npc->unit is non-NULL in >> null_unit_remove. >> >> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com> > > Reviewed-by: Juergen Gross <jgross@suse.com> Thanks for the review. Does this still need a maintainer ack?
On Thu, 2023-05-18 at 17:27 -0400, Stewart Hildebrand wrote: > On 5/5/23 01:59, Juergen Gross wrote: > > > > > > Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com> > > > > Reviewed-by: Juergen Gross <jgross@suse.com> > > Thanks for the review. Does this still need a maintainer ack? > Acked-by: Dario Faggioli <dfaggioli@suse.com> Regards -- Dario Faggioli, Ph.D http://about.me/dario.faggioli Virtualization Software Engineer SUSE Labs, SUSE https://www.suse.com/ ------------------------------------------------------------------- <<This happens because _I_ choose it to happen!>> (Raistlin Majere)
On 18.05.2023 23:27, Stewart Hildebrand wrote: > On 5/5/23 01:59, Juergen Gross wrote: >> On 01.05.23 22:30, Stewart Hildebrand wrote: >>> When creating a domU, but the creation fails, there is a corner case that may >>> lead to a crash in the null scheduler when running a debug build of Xen. >>> >>> (XEN) **************************************** >>> (XEN) Panic on CPU 0: >>> (XEN) Assertion 'npc->unit == unit' failed at common/sched/null.c:379 >>> (XEN) **************************************** >>> >>> The events leading to the crash are: >>> >>> * null_unit_insert() was invoked with the unit offline. Since the unit was >>> offline, unit_assign() was not called, and null_unit_insert() returned. >>> * Later during domain creation, the unit was onlined >>> * Eventually, domain creation failed due to bad configuration >>> * null_unit_remove() was invoked with the unit still online. Since the unit was >>> online, it called unit_deassign() and triggered an ASSERT. >>> >>> To fix this, only call unit_deassign() when npc->unit is non-NULL in >>> null_unit_remove. >>> >>> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com> >> >> Reviewed-by: Juergen Gross <jgross@suse.com> > > Thanks for the review. Does this still need a maintainer ack? In principle yes. I might be willing to time out at some point, but not before at least one ping was sent (and some more time has passed afterwards). Jan
© 2016 - 2024 Red Hat, Inc.