With the POWER9 processor comes a new interrupt controller called
XIVE. It is composed of three sub-engines :
- Interrupt Virtualization Source Engine (IVSE). These are in PHBs,
in the main controller for the IPIS and in the PSI host
bridge. They are configured to feed the IVRE with events.
- Interrupt Virtualization Routing Engine (IVRE). Their job is to
match an event source with a Notification Virtualization Target
(NVT), a priority and an Event Queue (EQ) to determine if a
Virtual Processor can handle the event.
- Interrupt Virtualization Presentation Engine (IVPE). It maintains
the interrupt state of each hardware thread and present the
notification as an external exception.
Each of the engines uses a set of internal tables to redirect
exceptions from event sources to CPU threads. The first table we
introduce is the Interrupt Virtualization Entry (IVE) table, part of
the virtualization engine in charge of routing events. It associates
event sources (IRQ numbers) to event queues which will forward, or
not, the event notification to the presentation controller.
The XIVE model is designed to make use of the full range of the IRQ
number space and does not use an offset like the XICS mode does.
Hence, the IVE table is directly indexed by the IRQ number.
Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
Changes since v1 :
- used g_new0 instead of g_malloc0
- removed VMSTATE_STRUCT_VARRAY_UINT32_ALLOC
- introduced a device reset handler. the object needs to be parented
to sysbus when created.
- renamed spapr_xive_irq_set to spapr_xive_irq_enable
- renamed spapr_xive_irq_unset to spapr_xive_irq_disable
- moved the PPC_BIT macros under target/ppc/cpu.h
- shrinked file copyright header
default-configs/ppc64-softmmu.mak | 1 +
hw/intc/Makefile.objs | 1 +
hw/intc/spapr_xive.c | 156 ++++++++++++++++++++++++++++++++++++++
hw/intc/xive-internal.h | 41 ++++++++++
include/hw/ppc/spapr_xive.h | 35 +++++++++
5 files changed, 234 insertions(+)
create mode 100644 hw/intc/spapr_xive.c
create mode 100644 hw/intc/xive-internal.h
create mode 100644 include/hw/ppc/spapr_xive.h
diff --git a/default-configs/ppc64-softmmu.mak b/default-configs/ppc64-softmmu.mak
index d1b3a6dd50f8..4a7f6a0696de 100644
--- a/default-configs/ppc64-softmmu.mak
+++ b/default-configs/ppc64-softmmu.mak
@@ -56,6 +56,7 @@ CONFIG_SM501=y
CONFIG_XICS=$(CONFIG_PSERIES)
CONFIG_XICS_SPAPR=$(CONFIG_PSERIES)
CONFIG_XICS_KVM=$(call land,$(CONFIG_PSERIES),$(CONFIG_KVM))
+CONFIG_XIVE_SPAPR=$(CONFIG_PSERIES)
# For PReP
CONFIG_SERIAL_ISA=y
CONFIG_MC146818RTC=y
diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
index ae358569a155..49e13e7aeeee 100644
--- a/hw/intc/Makefile.objs
+++ b/hw/intc/Makefile.objs
@@ -35,6 +35,7 @@ obj-$(CONFIG_SH4) += sh_intc.o
obj-$(CONFIG_XICS) += xics.o
obj-$(CONFIG_XICS_SPAPR) += xics_spapr.o
obj-$(CONFIG_XICS_KVM) += xics_kvm.o
+obj-$(CONFIG_XIVE_SPAPR) += spapr_xive.o
obj-$(CONFIG_POWERNV) += xics_pnv.o
obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
obj-$(CONFIG_S390_FLIC) += s390_flic.o
diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
new file mode 100644
index 000000000000..e6e8841add17
--- /dev/null
+++ b/hw/intc/spapr_xive.c
@@ -0,0 +1,156 @@
+/*
+ * QEMU PowerPC sPAPR XIVE interrupt controller model
+ *
+ * Copyright (c) 2017, IBM Corporation.
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/log.h"
+#include "qapi/error.h"
+#include "target/ppc/cpu.h"
+#include "sysemu/cpus.h"
+#include "sysemu/dma.h"
+#include "monitor/monitor.h"
+#include "hw/ppc/spapr_xive.h"
+
+#include "xive-internal.h"
+
+/*
+ * Main XIVE object
+ */
+
+void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)
+{
+ int i;
+
+ for (i = 0; i < xive->nr_irqs; i++) {
+ XiveIVE *ive = &xive->ivt[i];
+
+ if (!(ive->w & IVE_VALID)) {
+ continue;
+ }
+
+ monitor_printf(mon, " %4x %s %08x %08x\n", i,
+ ive->w & IVE_MASKED ? "M" : " ",
+ (int) GETFIELD(IVE_EQ_INDEX, ive->w),
+ (int) GETFIELD(IVE_EQ_DATA, ive->w));
+ }
+}
+
+static void spapr_xive_reset(DeviceState *dev)
+{
+ sPAPRXive *xive = SPAPR_XIVE(dev);
+ int i;
+
+ /* Mask all valid IVEs in the IRQ number space. */
+ for (i = 0; i < xive->nr_irqs; i++) {
+ XiveIVE *ive = &xive->ivt[i];
+ if (ive->w & IVE_VALID) {
+ ive->w |= IVE_MASKED;
+ }
+ }
+}
+
+static void spapr_xive_realize(DeviceState *dev, Error **errp)
+{
+ sPAPRXive *xive = SPAPR_XIVE(dev);
+
+ if (!xive->nr_irqs) {
+ error_setg(errp, "Number of interrupt needs to be greater 0");
+ return;
+ }
+
+ /* Allocate the IVT (Interrupt Virtualization Table) */
+ xive->ivt = g_new0(XiveIVE, xive->nr_irqs);
+}
+
+static const VMStateDescription vmstate_spapr_xive_ive = {
+ .name = TYPE_SPAPR_XIVE "/ive",
+ .version_id = 1,
+ .minimum_version_id = 1,
+ .fields = (VMStateField []) {
+ VMSTATE_UINT64(w, XiveIVE),
+ VMSTATE_END_OF_LIST()
+ },
+};
+
+static bool vmstate_spapr_xive_needed(void *opaque)
+{
+ /* TODO check machine XIVE support */
+ return true;
+}
+
+static const VMStateDescription vmstate_spapr_xive = {
+ .name = TYPE_SPAPR_XIVE,
+ .version_id = 1,
+ .minimum_version_id = 1,
+ .needed = vmstate_spapr_xive_needed,
+ .fields = (VMStateField[]) {
+ VMSTATE_UINT32_EQUAL(nr_irqs, sPAPRXive, NULL),
+ VMSTATE_STRUCT_VARRAY_UINT32(ivt, sPAPRXive, nr_irqs, 1,
+ vmstate_spapr_xive_ive, XiveIVE),
+ VMSTATE_END_OF_LIST()
+ },
+};
+
+static Property spapr_xive_properties[] = {
+ DEFINE_PROP_UINT32("nr-irqs", sPAPRXive, nr_irqs, 0),
+ DEFINE_PROP_END_OF_LIST(),
+};
+
+static void spapr_xive_class_init(ObjectClass *klass, void *data)
+{
+ DeviceClass *dc = DEVICE_CLASS(klass);
+
+ dc->realize = spapr_xive_realize;
+ dc->reset = spapr_xive_reset;
+ dc->props = spapr_xive_properties;
+ dc->desc = "sPAPR XIVE interrupt controller";
+ dc->vmsd = &vmstate_spapr_xive;
+}
+
+static const TypeInfo spapr_xive_info = {
+ .name = TYPE_SPAPR_XIVE,
+ .parent = TYPE_SYS_BUS_DEVICE,
+ .instance_size = sizeof(sPAPRXive),
+ .class_init = spapr_xive_class_init,
+};
+
+static void spapr_xive_register_types(void)
+{
+ type_register_static(&spapr_xive_info);
+}
+
+type_init(spapr_xive_register_types)
+
+XiveIVE *spapr_xive_get_ive(sPAPRXive *xive, uint32_t lisn)
+{
+ return lisn < xive->nr_irqs ? &xive->ivt[lisn] : NULL;
+}
+
+bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn)
+{
+ XiveIVE *ive = spapr_xive_get_ive(xive, lisn);
+
+ if (!ive) {
+ return false;
+ }
+
+ ive->w |= IVE_VALID;
+ return true;
+}
+
+bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn)
+{
+ XiveIVE *ive = spapr_xive_get_ive(xive, lisn);
+
+ if (!ive) {
+ return false;
+ }
+
+ ive->w &= ~IVE_VALID;
+ return true;
+}
diff --git a/hw/intc/xive-internal.h b/hw/intc/xive-internal.h
new file mode 100644
index 000000000000..132b71a6daf0
--- /dev/null
+++ b/hw/intc/xive-internal.h
@@ -0,0 +1,41 @@
+/*
+ * QEMU PowerPC XIVE interrupt controller model
+ *
+ * Copyright (c) 2016-2017, IBM Corporation.
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ */
+
+#ifndef _INTC_XIVE_INTERNAL_H
+#define _INTC_XIVE_INTERNAL_H
+
+/* Utilities to manipulate these (originaly from OPAL) */
+#define MASK_TO_LSH(m) (__builtin_ffsl(m) - 1)
+#define GETFIELD(m, v) (((v) & (m)) >> MASK_TO_LSH(m))
+#define SETFIELD(m, v, val) \
+ (((v) & ~(m)) | ((((typeof(v))(val)) << MASK_TO_LSH(m)) & (m)))
+
+/* IVE/EAS
+ *
+ * One per interrupt source. Targets that interrupt to a given EQ
+ * and provides the corresponding logical interrupt number (EQ data)
+ *
+ * We also map this structure to the escalation descriptor inside
+ * an EQ, though in that case the valid and masked bits are not used.
+ */
+typedef struct XiveIVE {
+ /* Use a single 64-bit definition to make it easier to
+ * perform atomic updates
+ */
+ uint64_t w;
+#define IVE_VALID PPC_BIT(0)
+#define IVE_EQ_BLOCK PPC_BITMASK(4, 7) /* Destination EQ block# */
+#define IVE_EQ_INDEX PPC_BITMASK(8, 31) /* Destination EQ index */
+#define IVE_MASKED PPC_BIT(32) /* Masked */
+#define IVE_EQ_DATA PPC_BITMASK(33, 63) /* Data written to the EQ */
+} XiveIVE;
+
+XiveIVE *spapr_xive_get_ive(sPAPRXive *xive, uint32_t lisn);
+
+#endif /* _INTC_XIVE_INTERNAL_H */
diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
new file mode 100644
index 000000000000..5b1f78e06a1e
--- /dev/null
+++ b/include/hw/ppc/spapr_xive.h
@@ -0,0 +1,35 @@
+/*
+ * QEMU PowerPC sPAPR XIVE interrupt controller model
+ *
+ * Copyright (c) 2017, IBM Corporation.
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ */
+
+#ifndef PPC_SPAPR_XIVE_H
+#define PPC_SPAPR_XIVE_H
+
+#include <hw/sysbus.h>
+
+typedef struct sPAPRXive sPAPRXive;
+typedef struct XiveIVE XiveIVE;
+
+#define TYPE_SPAPR_XIVE "spapr-xive"
+#define SPAPR_XIVE(obj) OBJECT_CHECK(sPAPRXive, (obj), TYPE_SPAPR_XIVE)
+
+struct sPAPRXive {
+ SysBusDevice parent;
+
+ /* Properties */
+ uint32_t nr_irqs;
+
+ /* XIVE internal tables */
+ XiveIVE *ivt;
+};
+
+bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn);
+bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
+void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
+
+#endif /* PPC_SPAPR_XIVE_H */
--
2.13.6
> +static const VMStateDescription vmstate_spapr_xive = {
> + .name = TYPE_SPAPR_XIVE,
> + .version_id = 1,
> + .minimum_version_id = 1,
> + .needed = vmstate_spapr_xive_needed,
> + .fields = (VMStateField[]) {
> + VMSTATE_UINT32_EQUAL(nr_irqs, sPAPRXive, NULL),
> + VMSTATE_STRUCT_VARRAY_UINT32(ivt, sPAPRXive, nr_irqs, 1,
> + vmstate_spapr_xive_ive, XiveIVE),
I got it wrong again. This should be :
VMSTATE_STRUCT_VARRAY_POINTER_UINT32(ivt, sPAPRXive, nr_irqs,
vmstate_spapr_xive_ive, XiveIVE),
for migration to work.
Cheers,
C.
On Sat, Dec 09, 2017 at 09:43:21AM +0100, Cédric Le Goater wrote:
> With the POWER9 processor comes a new interrupt controller called
> XIVE. It is composed of three sub-engines :
>
> - Interrupt Virtualization Source Engine (IVSE). These are in PHBs,
> in the main controller for the IPIS and in the PSI host
> bridge. They are configured to feed the IVRE with events.
>
> - Interrupt Virtualization Routing Engine (IVRE). Their job is to
> match an event source with a Notification Virtualization Target
> (NVT), a priority and an Event Queue (EQ) to determine if a
> Virtual Processor can handle the event.
>
> - Interrupt Virtualization Presentation Engine (IVPE). It maintains
> the interrupt state of each hardware thread and present the
> notification as an external exception.
>
> Each of the engines uses a set of internal tables to redirect
> exceptions from event sources to CPU threads. The first table we
> introduce is the Interrupt Virtualization Entry (IVE) table, part of
> the virtualization engine in charge of routing events. It associates
> event sources (IRQ numbers) to event queues which will forward, or
> not, the event notification to the presentation controller.
>
> The XIVE model is designed to make use of the full range of the IRQ
> number space and does not use an offset like the XICS mode does.
> Hence, the IVE table is directly indexed by the IRQ number.
>
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
As you've suggested in yourself, I think we might need to more
explicitly model the different components of the XIVE system. As part
of that, I think you need to be clearer in this base skeleton about
exactly what component your XIVE object represents.
If the answer is "the overall thing" I suspect that's not what you
want - I had one of those for XICs which proved to be a mistake
(eventually replaced by the XICSFabric interface).
Changing the model later isn't impossible, but doing so without
breaking migration can be a real pain, so I think it's worth a
reasonable effort to try and get it right initially.
> ---
>
> Changes since v1 :
>
> - used g_new0 instead of g_malloc0
> - removed VMSTATE_STRUCT_VARRAY_UINT32_ALLOC
> - introduced a device reset handler. the object needs to be parented
> to sysbus when created.
> - renamed spapr_xive_irq_set to spapr_xive_irq_enable
> - renamed spapr_xive_irq_unset to spapr_xive_irq_disable
> - moved the PPC_BIT macros under target/ppc/cpu.h
> - shrinked file copyright header
>
> default-configs/ppc64-softmmu.mak | 1 +
> hw/intc/Makefile.objs | 1 +
> hw/intc/spapr_xive.c | 156 ++++++++++++++++++++++++++++++++++++++
> hw/intc/xive-internal.h | 41 ++++++++++
> include/hw/ppc/spapr_xive.h | 35 +++++++++
> 5 files changed, 234 insertions(+)
> create mode 100644 hw/intc/spapr_xive.c
> create mode 100644 hw/intc/xive-internal.h
> create mode 100644 include/hw/ppc/spapr_xive.h
>
> diff --git a/default-configs/ppc64-softmmu.mak b/default-configs/ppc64-softmmu.mak
> index d1b3a6dd50f8..4a7f6a0696de 100644
> --- a/default-configs/ppc64-softmmu.mak
> +++ b/default-configs/ppc64-softmmu.mak
> @@ -56,6 +56,7 @@ CONFIG_SM501=y
> CONFIG_XICS=$(CONFIG_PSERIES)
> CONFIG_XICS_SPAPR=$(CONFIG_PSERIES)
> CONFIG_XICS_KVM=$(call land,$(CONFIG_PSERIES),$(CONFIG_KVM))
> +CONFIG_XIVE_SPAPR=$(CONFIG_PSERIES)
> # For PReP
> CONFIG_SERIAL_ISA=y
> CONFIG_MC146818RTC=y
> diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
> index ae358569a155..49e13e7aeeee 100644
> --- a/hw/intc/Makefile.objs
> +++ b/hw/intc/Makefile.objs
> @@ -35,6 +35,7 @@ obj-$(CONFIG_SH4) += sh_intc.o
> obj-$(CONFIG_XICS) += xics.o
> obj-$(CONFIG_XICS_SPAPR) += xics_spapr.o
> obj-$(CONFIG_XICS_KVM) += xics_kvm.o
> +obj-$(CONFIG_XIVE_SPAPR) += spapr_xive.o
> obj-$(CONFIG_POWERNV) += xics_pnv.o
> obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
> obj-$(CONFIG_S390_FLIC) += s390_flic.o
> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
> new file mode 100644
> index 000000000000..e6e8841add17
> --- /dev/null
> +++ b/hw/intc/spapr_xive.c
> @@ -0,0 +1,156 @@
> +/*
> + * QEMU PowerPC sPAPR XIVE interrupt controller model
> + *
> + * Copyright (c) 2017, IBM Corporation.
> + *
> + * This code is licensed under the GPL version 2 or later. See the
> + * COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/log.h"
> +#include "qapi/error.h"
> +#include "target/ppc/cpu.h"
> +#include "sysemu/cpus.h"
> +#include "sysemu/dma.h"
> +#include "monitor/monitor.h"
> +#include "hw/ppc/spapr_xive.h"
> +
> +#include "xive-internal.h"
> +
> +/*
> + * Main XIVE object
> + */
> +
> +void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)
> +{
> + int i;
> +
> + for (i = 0; i < xive->nr_irqs; i++) {
> + XiveIVE *ive = &xive->ivt[i];
> +
> + if (!(ive->w & IVE_VALID)) {
> + continue;
> + }
> +
> + monitor_printf(mon, " %4x %s %08x %08x\n", i,
> + ive->w & IVE_MASKED ? "M" : " ",
> + (int) GETFIELD(IVE_EQ_INDEX, ive->w),
> + (int) GETFIELD(IVE_EQ_DATA, ive->w));
> + }
> +}
> +
> +static void spapr_xive_reset(DeviceState *dev)
> +{
> + sPAPRXive *xive = SPAPR_XIVE(dev);
> + int i;
> +
> + /* Mask all valid IVEs in the IRQ number space. */
> + for (i = 0; i < xive->nr_irqs; i++) {
> + XiveIVE *ive = &xive->ivt[i];
> + if (ive->w & IVE_VALID) {
> + ive->w |= IVE_MASKED;
> + }
> + }
> +}
> +
> +static void spapr_xive_realize(DeviceState *dev, Error **errp)
> +{
> + sPAPRXive *xive = SPAPR_XIVE(dev);
> +
> + if (!xive->nr_irqs) {
> + error_setg(errp, "Number of interrupt needs to be greater 0");
> + return;
> + }
> +
> + /* Allocate the IVT (Interrupt Virtualization Table) */
> + xive->ivt = g_new0(XiveIVE, xive->nr_irqs);
> +}
> +
> +static const VMStateDescription vmstate_spapr_xive_ive = {
> + .name = TYPE_SPAPR_XIVE "/ive",
> + .version_id = 1,
> + .minimum_version_id = 1,
> + .fields = (VMStateField []) {
> + VMSTATE_UINT64(w, XiveIVE),
> + VMSTATE_END_OF_LIST()
> + },
> +};
> +
> +static bool vmstate_spapr_xive_needed(void *opaque)
> +{
> + /* TODO check machine XIVE support */
> + return true;
> +}
> +
> +static const VMStateDescription vmstate_spapr_xive = {
> + .name = TYPE_SPAPR_XIVE,
> + .version_id = 1,
> + .minimum_version_id = 1,
> + .needed = vmstate_spapr_xive_needed,
> + .fields = (VMStateField[]) {
> + VMSTATE_UINT32_EQUAL(nr_irqs, sPAPRXive, NULL),
> + VMSTATE_STRUCT_VARRAY_UINT32(ivt, sPAPRXive, nr_irqs, 1,
> + vmstate_spapr_xive_ive, XiveIVE),
> + VMSTATE_END_OF_LIST()
> + },
> +};
> +
> +static Property spapr_xive_properties[] = {
> + DEFINE_PROP_UINT32("nr-irqs", sPAPRXive, nr_irqs, 0),
> + DEFINE_PROP_END_OF_LIST(),
> +};
> +
> +static void spapr_xive_class_init(ObjectClass *klass, void *data)
> +{
> + DeviceClass *dc = DEVICE_CLASS(klass);
> +
> + dc->realize = spapr_xive_realize;
> + dc->reset = spapr_xive_reset;
> + dc->props = spapr_xive_properties;
> + dc->desc = "sPAPR XIVE interrupt controller";
> + dc->vmsd = &vmstate_spapr_xive;
> +}
> +
> +static const TypeInfo spapr_xive_info = {
> + .name = TYPE_SPAPR_XIVE,
> + .parent = TYPE_SYS_BUS_DEVICE,
> + .instance_size = sizeof(sPAPRXive),
> + .class_init = spapr_xive_class_init,
> +};
> +
> +static void spapr_xive_register_types(void)
> +{
> + type_register_static(&spapr_xive_info);
> +}
> +
> +type_init(spapr_xive_register_types)
> +
> +XiveIVE *spapr_xive_get_ive(sPAPRXive *xive, uint32_t lisn)
> +{
> + return lisn < xive->nr_irqs ? &xive->ivt[lisn] : NULL;
> +}
> +
> +bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn)
> +{
> + XiveIVE *ive = spapr_xive_get_ive(xive, lisn);
> +
> + if (!ive) {
> + return false;
> + }
> +
> + ive->w |= IVE_VALID;
> + return true;
> +}
> +
> +bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn)
> +{
> + XiveIVE *ive = spapr_xive_get_ive(xive, lisn);
> +
> + if (!ive) {
> + return false;
> + }
> +
> + ive->w &= ~IVE_VALID;
> + return true;
> +}
> diff --git a/hw/intc/xive-internal.h b/hw/intc/xive-internal.h
> new file mode 100644
> index 000000000000..132b71a6daf0
> --- /dev/null
> +++ b/hw/intc/xive-internal.h
> @@ -0,0 +1,41 @@
> +/*
> + * QEMU PowerPC XIVE interrupt controller model
> + *
> + * Copyright (c) 2016-2017, IBM Corporation.
> + *
> + * This code is licensed under the GPL version 2 or later. See the
> + * COPYING file in the top-level directory.
> + */
> +
> +#ifndef _INTC_XIVE_INTERNAL_H
> +#define _INTC_XIVE_INTERNAL_H
> +
> +/* Utilities to manipulate these (originaly from OPAL) */
> +#define MASK_TO_LSH(m) (__builtin_ffsl(m) - 1)
> +#define GETFIELD(m, v) (((v) & (m)) >> MASK_TO_LSH(m))
> +#define SETFIELD(m, v, val) \
> + (((v) & ~(m)) | ((((typeof(v))(val)) << MASK_TO_LSH(m)) & (m)))
> +
> +/* IVE/EAS
> + *
> + * One per interrupt source. Targets that interrupt to a given EQ
> + * and provides the corresponding logical interrupt number (EQ data)
> + *
> + * We also map this structure to the escalation descriptor inside
> + * an EQ, though in that case the valid and masked bits are not used.
> + */
> +typedef struct XiveIVE {
> + /* Use a single 64-bit definition to make it easier to
> + * perform atomic updates
> + */
> + uint64_t w;
> +#define IVE_VALID PPC_BIT(0)
> +#define IVE_EQ_BLOCK PPC_BITMASK(4, 7) /* Destination EQ block# */
> +#define IVE_EQ_INDEX PPC_BITMASK(8, 31) /* Destination EQ index */
> +#define IVE_MASKED PPC_BIT(32) /* Masked */
> +#define IVE_EQ_DATA PPC_BITMASK(33, 63) /* Data written to the EQ */
> +} XiveIVE;
> +
> +XiveIVE *spapr_xive_get_ive(sPAPRXive *xive, uint32_t lisn);
> +
> +#endif /* _INTC_XIVE_INTERNAL_H */
> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
> new file mode 100644
> index 000000000000..5b1f78e06a1e
> --- /dev/null
> +++ b/include/hw/ppc/spapr_xive.h
> @@ -0,0 +1,35 @@
> +/*
> + * QEMU PowerPC sPAPR XIVE interrupt controller model
> + *
> + * Copyright (c) 2017, IBM Corporation.
> + *
> + * This code is licensed under the GPL version 2 or later. See the
> + * COPYING file in the top-level directory.
> + */
> +
> +#ifndef PPC_SPAPR_XIVE_H
> +#define PPC_SPAPR_XIVE_H
> +
> +#include <hw/sysbus.h>
> +
> +typedef struct sPAPRXive sPAPRXive;
> +typedef struct XiveIVE XiveIVE;
> +
> +#define TYPE_SPAPR_XIVE "spapr-xive"
> +#define SPAPR_XIVE(obj) OBJECT_CHECK(sPAPRXive, (obj), TYPE_SPAPR_XIVE)
> +
> +struct sPAPRXive {
> + SysBusDevice parent;
> +
> + /* Properties */
> + uint32_t nr_irqs;
> +
> + /* XIVE internal tables */
> + XiveIVE *ivt;
> +};
> +
> +bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn);
> +bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
> +void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
> +
> +#endif /* PPC_SPAPR_XIVE_H */
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
On 12/20/2017 06:09 AM, David Gibson wrote:
> On Sat, Dec 09, 2017 at 09:43:21AM +0100, Cédric Le Goater wrote:
>> With the POWER9 processor comes a new interrupt controller called
>> XIVE. It is composed of three sub-engines :
>>
>> - Interrupt Virtualization Source Engine (IVSE). These are in PHBs,
>> in the main controller for the IPIS and in the PSI host
>> bridge. They are configured to feed the IVRE with events.
>>
>> - Interrupt Virtualization Routing Engine (IVRE). Their job is to
>> match an event source with a Notification Virtualization Target
>> (NVT), a priority and an Event Queue (EQ) to determine if a
>> Virtual Processor can handle the event.
>>
>> - Interrupt Virtualization Presentation Engine (IVPE). It maintains
>> the interrupt state of each hardware thread and present the
>> notification as an external exception.
>>
>> Each of the engines uses a set of internal tables to redirect
>> exceptions from event sources to CPU threads. The first table we
>> introduce is the Interrupt Virtualization Entry (IVE) table, part of
>> the virtualization engine in charge of routing events. It associates
>> event sources (IRQ numbers) to event queues which will forward, or
>> not, the event notification to the presentation controller.
>>
>> The XIVE model is designed to make use of the full range of the IRQ
>> number space and does not use an offset like the XICS mode does.
>> Hence, the IVE table is directly indexed by the IRQ number.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>
> As you've suggested in yourself, I think we might need to more
> explicitly model the different components of the XIVE system. As part
> of that, I think you need to be clearer in this base skeleton about
> exactly what component your XIVE object represents.
ok. The base skeleton is the IVRE, the central engine handling
the routing.
> If the answer is "the overall thing"
Yes, it is more or less that currently.
The sPAPRXive object models the source engine and the routing
engine in one object.
I have merged these for simplicity and because the interrupt
controller has an internal source for the interrupts of the "IPI"
type, which are used for the CPU IPIs but also for other generic
interrupts, like the OpenCAPI ones. The XIVE sPAPR interface is
also much simpler than the baremetal one, all the tables are
maintained in the hypervisor, so this choice made some sense.
But since, I have started the PowerNV model and I am duplicating
a lot of code to handle the triggering and the MMIOs in the
different sources. So I am not convinced anymore. Nevertheless,
the overall routing logic is the same even if some the tables
are not located in QEMU anymore, but in the machine memory.
The sPAPRXiveNVT models some of the CPU presenter engine. It
holds the virtual CPU interrupt states when not dispatched on
a real HW thread. Real world is more complex. There are "CAM"
lines in the HW threads which are compared to find a matching
candidate. But I don't think we need to anything more complex
than today unless we want to support KVM under TCG ...
> I suspect that's not what you
> want - I had one of those for XICs which proved to be a mistake
> (eventually replaced by the XICSFabric interface).
The XICSFabric would be the main Xive object. The interface
between the sources and the routing engine is hidden in sPAPR,
we can use a simple function call :
spapr_xive_irq(pnv->xive, irq);
we could get rid of the qirqs but they are required for XICS.
PowerNV uses MMIOs to notify an event and it makes the modeling
somewhat easier. Each controller model has a notify port address
register on which a interrupt number is written to forward an
event to the routing engine. So it is a simple store.
I don't know why there is a different notify port address per
source, may be for extra filtering at the routing engine level.
> Changing the model later isn't impossible, but doing so without
> breaking migration can be a real pain, so I think it's worth a
> reasonable effort to try and get it right initially.
I completely agree.
This is why I have started the PnvXive model to challenge the
current PAPR design. I have hacked a bunch of patches for XIVE,
LPC, PSI, OCC and basic PPC support which boot a PowerNV P9 up to
petitboot. It would look better with a source object, but the
location of the PQ bits is a bit problematic. It highly depends
on the controller. The main controller uses tables in the hypervisor
memory. The PSIHB controller has its own bits. I suppose it is
the same for PHB4. I need to take a closer look at how we could
have a common source object.
The most important part is KVM support and how we expose the
MMIO region. We need to make progress on that topic.
Thanks,
C.
>> ---
>>
>> Changes since v1 :
>>
>> - used g_new0 instead of g_malloc0
>> - removed VMSTATE_STRUCT_VARRAY_UINT32_ALLOC
>> - introduced a device reset handler. the object needs to be parented
>> to sysbus when created.
>> - renamed spapr_xive_irq_set to spapr_xive_irq_enable
>> - renamed spapr_xive_irq_unset to spapr_xive_irq_disable
>> - moved the PPC_BIT macros under target/ppc/cpu.h
>> - shrinked file copyright header
>>
>> default-configs/ppc64-softmmu.mak | 1 +
>> hw/intc/Makefile.objs | 1 +
>> hw/intc/spapr_xive.c | 156 ++++++++++++++++++++++++++++++++++++++
>> hw/intc/xive-internal.h | 41 ++++++++++
>> include/hw/ppc/spapr_xive.h | 35 +++++++++
>> 5 files changed, 234 insertions(+)
>> create mode 100644 hw/intc/spapr_xive.c
>> create mode 100644 hw/intc/xive-internal.h
>> create mode 100644 include/hw/ppc/spapr_xive.h
>>
>> diff --git a/default-configs/ppc64-softmmu.mak b/default-configs/ppc64-softmmu.mak
>> index d1b3a6dd50f8..4a7f6a0696de 100644
>> --- a/default-configs/ppc64-softmmu.mak
>> +++ b/default-configs/ppc64-softmmu.mak
>> @@ -56,6 +56,7 @@ CONFIG_SM501=y
>> CONFIG_XICS=$(CONFIG_PSERIES)
>> CONFIG_XICS_SPAPR=$(CONFIG_PSERIES)
>> CONFIG_XICS_KVM=$(call land,$(CONFIG_PSERIES),$(CONFIG_KVM))
>> +CONFIG_XIVE_SPAPR=$(CONFIG_PSERIES)
>> # For PReP
>> CONFIG_SERIAL_ISA=y
>> CONFIG_MC146818RTC=y
>> diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
>> index ae358569a155..49e13e7aeeee 100644
>> --- a/hw/intc/Makefile.objs
>> +++ b/hw/intc/Makefile.objs
>> @@ -35,6 +35,7 @@ obj-$(CONFIG_SH4) += sh_intc.o
>> obj-$(CONFIG_XICS) += xics.o
>> obj-$(CONFIG_XICS_SPAPR) += xics_spapr.o
>> obj-$(CONFIG_XICS_KVM) += xics_kvm.o
>> +obj-$(CONFIG_XIVE_SPAPR) += spapr_xive.o
>> obj-$(CONFIG_POWERNV) += xics_pnv.o
>> obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
>> obj-$(CONFIG_S390_FLIC) += s390_flic.o
>> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
>> new file mode 100644
>> index 000000000000..e6e8841add17
>> --- /dev/null
>> +++ b/hw/intc/spapr_xive.c
>> @@ -0,0 +1,156 @@
>> +/*
>> + * QEMU PowerPC sPAPR XIVE interrupt controller model
>> + *
>> + * Copyright (c) 2017, IBM Corporation.
>> + *
>> + * This code is licensed under the GPL version 2 or later. See the
>> + * COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qemu/log.h"
>> +#include "qapi/error.h"
>> +#include "target/ppc/cpu.h"
>> +#include "sysemu/cpus.h"
>> +#include "sysemu/dma.h"
>> +#include "monitor/monitor.h"
>> +#include "hw/ppc/spapr_xive.h"
>> +
>> +#include "xive-internal.h"
>> +
>> +/*
>> + * Main XIVE object
>> + */
>> +
>> +void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)
>> +{
>> + int i;
>> +
>> + for (i = 0; i < xive->nr_irqs; i++) {
>> + XiveIVE *ive = &xive->ivt[i];
>> +
>> + if (!(ive->w & IVE_VALID)) {
>> + continue;
>> + }
>> +
>> + monitor_printf(mon, " %4x %s %08x %08x\n", i,
>> + ive->w & IVE_MASKED ? "M" : " ",
>> + (int) GETFIELD(IVE_EQ_INDEX, ive->w),
>> + (int) GETFIELD(IVE_EQ_DATA, ive->w));
>> + }
>> +}
>> +
>> +static void spapr_xive_reset(DeviceState *dev)
>> +{
>> + sPAPRXive *xive = SPAPR_XIVE(dev);
>> + int i;
>> +
>> + /* Mask all valid IVEs in the IRQ number space. */
>> + for (i = 0; i < xive->nr_irqs; i++) {
>> + XiveIVE *ive = &xive->ivt[i];
>> + if (ive->w & IVE_VALID) {
>> + ive->w |= IVE_MASKED;
>> + }
>> + }
>> +}
>> +
>> +static void spapr_xive_realize(DeviceState *dev, Error **errp)
>> +{
>> + sPAPRXive *xive = SPAPR_XIVE(dev);
>> +
>> + if (!xive->nr_irqs) {
>> + error_setg(errp, "Number of interrupt needs to be greater 0");
>> + return;
>> + }
>> +
>> + /* Allocate the IVT (Interrupt Virtualization Table) */
>> + xive->ivt = g_new0(XiveIVE, xive->nr_irqs);
>> +}
>> +
>> +static const VMStateDescription vmstate_spapr_xive_ive = {
>> + .name = TYPE_SPAPR_XIVE "/ive",
>> + .version_id = 1,
>> + .minimum_version_id = 1,
>> + .fields = (VMStateField []) {
>> + VMSTATE_UINT64(w, XiveIVE),
>> + VMSTATE_END_OF_LIST()
>> + },
>> +};
>> +
>> +static bool vmstate_spapr_xive_needed(void *opaque)
>> +{
>> + /* TODO check machine XIVE support */
>> + return true;
>> +}
>> +
>> +static const VMStateDescription vmstate_spapr_xive = {
>> + .name = TYPE_SPAPR_XIVE,
>> + .version_id = 1,
>> + .minimum_version_id = 1,
>> + .needed = vmstate_spapr_xive_needed,
>> + .fields = (VMStateField[]) {
>> + VMSTATE_UINT32_EQUAL(nr_irqs, sPAPRXive, NULL),
>> + VMSTATE_STRUCT_VARRAY_UINT32(ivt, sPAPRXive, nr_irqs, 1,
>> + vmstate_spapr_xive_ive, XiveIVE),
>> + VMSTATE_END_OF_LIST()
>> + },
>> +};
>> +
>> +static Property spapr_xive_properties[] = {
>> + DEFINE_PROP_UINT32("nr-irqs", sPAPRXive, nr_irqs, 0),
>> + DEFINE_PROP_END_OF_LIST(),
>> +};
>> +
>> +static void spapr_xive_class_init(ObjectClass *klass, void *data)
>> +{
>> + DeviceClass *dc = DEVICE_CLASS(klass);
>> +
>> + dc->realize = spapr_xive_realize;
>> + dc->reset = spapr_xive_reset;
>> + dc->props = spapr_xive_properties;
>> + dc->desc = "sPAPR XIVE interrupt controller";
>> + dc->vmsd = &vmstate_spapr_xive;
>> +}
>> +
>> +static const TypeInfo spapr_xive_info = {
>> + .name = TYPE_SPAPR_XIVE,
>> + .parent = TYPE_SYS_BUS_DEVICE,
>> + .instance_size = sizeof(sPAPRXive),
>> + .class_init = spapr_xive_class_init,
>> +};
>> +
>> +static void spapr_xive_register_types(void)
>> +{
>> + type_register_static(&spapr_xive_info);
>> +}
>> +
>> +type_init(spapr_xive_register_types)
>> +
>> +XiveIVE *spapr_xive_get_ive(sPAPRXive *xive, uint32_t lisn)
>> +{
>> + return lisn < xive->nr_irqs ? &xive->ivt[lisn] : NULL;
>> +}
>> +
>> +bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn)
>> +{
>> + XiveIVE *ive = spapr_xive_get_ive(xive, lisn);
>> +
>> + if (!ive) {
>> + return false;
>> + }
>> +
>> + ive->w |= IVE_VALID;
>> + return true;
>> +}
>> +
>> +bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn)
>> +{
>> + XiveIVE *ive = spapr_xive_get_ive(xive, lisn);
>> +
>> + if (!ive) {
>> + return false;
>> + }
>> +
>> + ive->w &= ~IVE_VALID;
>> + return true;
>> +}
>> diff --git a/hw/intc/xive-internal.h b/hw/intc/xive-internal.h
>> new file mode 100644
>> index 000000000000..132b71a6daf0
>> --- /dev/null
>> +++ b/hw/intc/xive-internal.h
>> @@ -0,0 +1,41 @@
>> +/*
>> + * QEMU PowerPC XIVE interrupt controller model
>> + *
>> + * Copyright (c) 2016-2017, IBM Corporation.
>> + *
>> + * This code is licensed under the GPL version 2 or later. See the
>> + * COPYING file in the top-level directory.
>> + */
>> +
>> +#ifndef _INTC_XIVE_INTERNAL_H
>> +#define _INTC_XIVE_INTERNAL_H
>> +
>> +/* Utilities to manipulate these (originaly from OPAL) */
>> +#define MASK_TO_LSH(m) (__builtin_ffsl(m) - 1)
>> +#define GETFIELD(m, v) (((v) & (m)) >> MASK_TO_LSH(m))
>> +#define SETFIELD(m, v, val) \
>> + (((v) & ~(m)) | ((((typeof(v))(val)) << MASK_TO_LSH(m)) & (m)))
>> +
>> +/* IVE/EAS
>> + *
>> + * One per interrupt source. Targets that interrupt to a given EQ
>> + * and provides the corresponding logical interrupt number (EQ data)
>> + *
>> + * We also map this structure to the escalation descriptor inside
>> + * an EQ, though in that case the valid and masked bits are not used.
>> + */
>> +typedef struct XiveIVE {
>> + /* Use a single 64-bit definition to make it easier to
>> + * perform atomic updates
>> + */
>> + uint64_t w;
>> +#define IVE_VALID PPC_BIT(0)
>> +#define IVE_EQ_BLOCK PPC_BITMASK(4, 7) /* Destination EQ block# */
>> +#define IVE_EQ_INDEX PPC_BITMASK(8, 31) /* Destination EQ index */
>> +#define IVE_MASKED PPC_BIT(32) /* Masked */
>> +#define IVE_EQ_DATA PPC_BITMASK(33, 63) /* Data written to the EQ */
>> +} XiveIVE;
>> +
>> +XiveIVE *spapr_xive_get_ive(sPAPRXive *xive, uint32_t lisn);
>> +
>> +#endif /* _INTC_XIVE_INTERNAL_H */
>> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
>> new file mode 100644
>> index 000000000000..5b1f78e06a1e
>> --- /dev/null
>> +++ b/include/hw/ppc/spapr_xive.h
>> @@ -0,0 +1,35 @@
>> +/*
>> + * QEMU PowerPC sPAPR XIVE interrupt controller model
>> + *
>> + * Copyright (c) 2017, IBM Corporation.
>> + *
>> + * This code is licensed under the GPL version 2 or later. See the
>> + * COPYING file in the top-level directory.
>> + */
>> +
>> +#ifndef PPC_SPAPR_XIVE_H
>> +#define PPC_SPAPR_XIVE_H
>> +
>> +#include <hw/sysbus.h>
>> +
>> +typedef struct sPAPRXive sPAPRXive;
>> +typedef struct XiveIVE XiveIVE;
>> +
>> +#define TYPE_SPAPR_XIVE "spapr-xive"
>> +#define SPAPR_XIVE(obj) OBJECT_CHECK(sPAPRXive, (obj), TYPE_SPAPR_XIVE)
>> +
>> +struct sPAPRXive {
>> + SysBusDevice parent;
>> +
>> + /* Properties */
>> + uint32_t nr_irqs;
>> +
>> + /* XIVE internal tables */
>> + XiveIVE *ivt;
>> +};
>> +
>> +bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn);
>> +bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
>> +void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
>> +
>> +#endif /* PPC_SPAPR_XIVE_H */
>
On Wed, Dec 20, 2017 at 08:38:41AM +0100, Cédric Le Goater wrote:
> On 12/20/2017 06:09 AM, David Gibson wrote:
> > On Sat, Dec 09, 2017 at 09:43:21AM +0100, Cédric Le Goater wrote:
> >> With the POWER9 processor comes a new interrupt controller called
> >> XIVE. It is composed of three sub-engines :
> >>
> >> - Interrupt Virtualization Source Engine (IVSE). These are in PHBs,
> >> in the main controller for the IPIS and in the PSI host
> >> bridge. They are configured to feed the IVRE with events.
> >>
> >> - Interrupt Virtualization Routing Engine (IVRE). Their job is to
> >> match an event source with a Notification Virtualization Target
> >> (NVT), a priority and an Event Queue (EQ) to determine if a
> >> Virtual Processor can handle the event.
> >>
> >> - Interrupt Virtualization Presentation Engine (IVPE). It maintains
> >> the interrupt state of each hardware thread and present the
> >> notification as an external exception.
> >>
> >> Each of the engines uses a set of internal tables to redirect
> >> exceptions from event sources to CPU threads. The first table we
> >> introduce is the Interrupt Virtualization Entry (IVE) table, part of
> >> the virtualization engine in charge of routing events. It associates
> >> event sources (IRQ numbers) to event queues which will forward, or
> >> not, the event notification to the presentation controller.
> >>
> >> The XIVE model is designed to make use of the full range of the IRQ
> >> number space and does not use an offset like the XICS mode does.
> >> Hence, the IVE table is directly indexed by the IRQ number.
> >>
> >> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >
> > As you've suggested in yourself, I think we might need to more
> > explicitly model the different components of the XIVE system. As part
> > of that, I think you need to be clearer in this base skeleton about
> > exactly what component your XIVE object represents.
Sorry it's been so long since I looked at these.
> ok. The base skeleton is the IVRE, the central engine handling
> the routing.
>
> > If the answer is "the overall thing"
>
> Yes, it is more or less that currently.
>
> The sPAPRXive object models the source engine and the routing
> engine in one object.
Yeah, I suspect we don't want that. Although it might seem simpler in
the spapr case, at least at first glance, I think it will cause us
problems later. At the very least, it's likely to make it harder to
share code between the spapr and powernv case. I think it will also
make for more confusion about exactly what things belong where.
> I have merged these for simplicity and because the interrupt
> controller has an internal source for the interrupts of the "IPI"
> type, which are used for the CPU IPIs but also for other generic
> interrupts, like the OpenCAPI ones. The XIVE sPAPR interface is
> also much simpler than the baremetal one, all the tables are
> maintained in the hypervisor, so this choice made some sense.
>
> But since, I have started the PowerNV model and I am duplicating
> a lot of code to handle the triggering and the MMIOs in the
> different sources. So I am not convinced anymore. Nevertheless,
> the overall routing logic is the same even if some the tables
> are not located in QEMU anymore, but in the machine memory.
>
> The sPAPRXiveNVT models some of the CPU presenter engine. It
> holds the virtual CPU interrupt states when not dispatched on
> a real HW thread. Real world is more complex. There are "CAM"
> lines in the HW threads which are compared to find a matching
> candidate. But I don't think we need to anything more complex
> than today unless we want to support KVM under TCG ...
>
> > I suspect that's not what you
> > want - I had one of those for XICs which proved to be a mistake
> > (eventually replaced by the XICSFabric interface).
>
> The XICSFabric would be the main Xive object. The interface
> between the sources and the routing engine is hidden in sPAPR,
> we can use a simple function call :
>
> spapr_xive_irq(pnv->xive, irq);
>
> we could get rid of the qirqs but they are required for XICS.
I don't quite follow, but this doesn't sound right.
> PowerNV uses MMIOs to notify an event and it makes the modeling
> somewhat easier. Each controller model has a notify port address
> register on which a interrupt number is written to forward an
> event to the routing engine. So it is a simple store.
>
> I don't know why there is a different notify port address per
> source, may be for extra filtering at the routing engine level.
>
> > Changing the model later isn't impossible, but doing so without
> > breaking migration can be a real pain, so I think it's worth a
> > reasonable effort to try and get it right initially.
>
> I completely agree.
>
> This is why I have started the PnvXive model to challenge the
> current PAPR design. I have hacked a bunch of patches for XIVE,
> LPC, PSI, OCC and basic PPC support which boot a PowerNV P9 up to
> petitboot. It would look better with a source object, but the
> location of the PQ bits is a bit problematic. It highly depends
> on the controller. The main controller uses tables in the hypervisor
> memory. The PSIHB controller has its own bits. I suppose it is
> the same for PHB4. I need to take a closer look at how we could
> have a common source object.
Ok, sounds like a good idea.
>
> The most important part is KVM support and how we expose the
> MMIO region. We need to make progress on that topic.
>
> Thanks,
>
> C.
>
>
> >> ---
> >>
> >> Changes since v1 :
> >>
> >> - used g_new0 instead of g_malloc0
> >> - removed VMSTATE_STRUCT_VARRAY_UINT32_ALLOC
> >> - introduced a device reset handler. the object needs to be parented
> >> to sysbus when created.
> >> - renamed spapr_xive_irq_set to spapr_xive_irq_enable
> >> - renamed spapr_xive_irq_unset to spapr_xive_irq_disable
> >> - moved the PPC_BIT macros under target/ppc/cpu.h
> >> - shrinked file copyright header
> >>
> >> default-configs/ppc64-softmmu.mak | 1 +
> >> hw/intc/Makefile.objs | 1 +
> >> hw/intc/spapr_xive.c | 156 ++++++++++++++++++++++++++++++++++++++
> >> hw/intc/xive-internal.h | 41 ++++++++++
> >> include/hw/ppc/spapr_xive.h | 35 +++++++++
> >> 5 files changed, 234 insertions(+)
> >> create mode 100644 hw/intc/spapr_xive.c
> >> create mode 100644 hw/intc/xive-internal.h
> >> create mode 100644 include/hw/ppc/spapr_xive.h
> >>
> >> diff --git a/default-configs/ppc64-softmmu.mak b/default-configs/ppc64-softmmu.mak
> >> index d1b3a6dd50f8..4a7f6a0696de 100644
> >> --- a/default-configs/ppc64-softmmu.mak
> >> +++ b/default-configs/ppc64-softmmu.mak
> >> @@ -56,6 +56,7 @@ CONFIG_SM501=y
> >> CONFIG_XICS=$(CONFIG_PSERIES)
> >> CONFIG_XICS_SPAPR=$(CONFIG_PSERIES)
> >> CONFIG_XICS_KVM=$(call land,$(CONFIG_PSERIES),$(CONFIG_KVM))
> >> +CONFIG_XIVE_SPAPR=$(CONFIG_PSERIES)
> >> # For PReP
> >> CONFIG_SERIAL_ISA=y
> >> CONFIG_MC146818RTC=y
> >> diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
> >> index ae358569a155..49e13e7aeeee 100644
> >> --- a/hw/intc/Makefile.objs
> >> +++ b/hw/intc/Makefile.objs
> >> @@ -35,6 +35,7 @@ obj-$(CONFIG_SH4) += sh_intc.o
> >> obj-$(CONFIG_XICS) += xics.o
> >> obj-$(CONFIG_XICS_SPAPR) += xics_spapr.o
> >> obj-$(CONFIG_XICS_KVM) += xics_kvm.o
> >> +obj-$(CONFIG_XIVE_SPAPR) += spapr_xive.o
> >> obj-$(CONFIG_POWERNV) += xics_pnv.o
> >> obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
> >> obj-$(CONFIG_S390_FLIC) += s390_flic.o
> >> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
> >> new file mode 100644
> >> index 000000000000..e6e8841add17
> >> --- /dev/null
> >> +++ b/hw/intc/spapr_xive.c
> >> @@ -0,0 +1,156 @@
> >> +/*
> >> + * QEMU PowerPC sPAPR XIVE interrupt controller model
> >> + *
> >> + * Copyright (c) 2017, IBM Corporation.
> >> + *
> >> + * This code is licensed under the GPL version 2 or later. See the
> >> + * COPYING file in the top-level directory.
> >> + */
> >> +
> >> +#include "qemu/osdep.h"
> >> +#include "qemu/log.h"
> >> +#include "qapi/error.h"
> >> +#include "target/ppc/cpu.h"
> >> +#include "sysemu/cpus.h"
> >> +#include "sysemu/dma.h"
> >> +#include "monitor/monitor.h"
> >> +#include "hw/ppc/spapr_xive.h"
> >> +
> >> +#include "xive-internal.h"
> >> +
> >> +/*
> >> + * Main XIVE object
> >> + */
> >> +
> >> +void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)
> >> +{
> >> + int i;
> >> +
> >> + for (i = 0; i < xive->nr_irqs; i++) {
> >> + XiveIVE *ive = &xive->ivt[i];
> >> +
> >> + if (!(ive->w & IVE_VALID)) {
> >> + continue;
> >> + }
> >> +
> >> + monitor_printf(mon, " %4x %s %08x %08x\n", i,
> >> + ive->w & IVE_MASKED ? "M" : " ",
> >> + (int) GETFIELD(IVE_EQ_INDEX, ive->w),
> >> + (int) GETFIELD(IVE_EQ_DATA, ive->w));
> >> + }
> >> +}
> >> +
> >> +static void spapr_xive_reset(DeviceState *dev)
> >> +{
> >> + sPAPRXive *xive = SPAPR_XIVE(dev);
> >> + int i;
> >> +
> >> + /* Mask all valid IVEs in the IRQ number space. */
> >> + for (i = 0; i < xive->nr_irqs; i++) {
> >> + XiveIVE *ive = &xive->ivt[i];
> >> + if (ive->w & IVE_VALID) {
> >> + ive->w |= IVE_MASKED;
> >> + }
> >> + }
> >> +}
> >> +
> >> +static void spapr_xive_realize(DeviceState *dev, Error **errp)
> >> +{
> >> + sPAPRXive *xive = SPAPR_XIVE(dev);
> >> +
> >> + if (!xive->nr_irqs) {
> >> + error_setg(errp, "Number of interrupt needs to be greater 0");
> >> + return;
> >> + }
> >> +
> >> + /* Allocate the IVT (Interrupt Virtualization Table) */
> >> + xive->ivt = g_new0(XiveIVE, xive->nr_irqs);
> >> +}
> >> +
> >> +static const VMStateDescription vmstate_spapr_xive_ive = {
> >> + .name = TYPE_SPAPR_XIVE "/ive",
> >> + .version_id = 1,
> >> + .minimum_version_id = 1,
> >> + .fields = (VMStateField []) {
> >> + VMSTATE_UINT64(w, XiveIVE),
> >> + VMSTATE_END_OF_LIST()
> >> + },
> >> +};
> >> +
> >> +static bool vmstate_spapr_xive_needed(void *opaque)
> >> +{
> >> + /* TODO check machine XIVE support */
> >> + return true;
> >> +}
> >> +
> >> +static const VMStateDescription vmstate_spapr_xive = {
> >> + .name = TYPE_SPAPR_XIVE,
> >> + .version_id = 1,
> >> + .minimum_version_id = 1,
> >> + .needed = vmstate_spapr_xive_needed,
> >> + .fields = (VMStateField[]) {
> >> + VMSTATE_UINT32_EQUAL(nr_irqs, sPAPRXive, NULL),
> >> + VMSTATE_STRUCT_VARRAY_UINT32(ivt, sPAPRXive, nr_irqs, 1,
> >> + vmstate_spapr_xive_ive, XiveIVE),
> >> + VMSTATE_END_OF_LIST()
> >> + },
> >> +};
> >> +
> >> +static Property spapr_xive_properties[] = {
> >> + DEFINE_PROP_UINT32("nr-irqs", sPAPRXive, nr_irqs, 0),
> >> + DEFINE_PROP_END_OF_LIST(),
> >> +};
> >> +
> >> +static void spapr_xive_class_init(ObjectClass *klass, void *data)
> >> +{
> >> + DeviceClass *dc = DEVICE_CLASS(klass);
> >> +
> >> + dc->realize = spapr_xive_realize;
> >> + dc->reset = spapr_xive_reset;
> >> + dc->props = spapr_xive_properties;
> >> + dc->desc = "sPAPR XIVE interrupt controller";
> >> + dc->vmsd = &vmstate_spapr_xive;
> >> +}
> >> +
> >> +static const TypeInfo spapr_xive_info = {
> >> + .name = TYPE_SPAPR_XIVE,
> >> + .parent = TYPE_SYS_BUS_DEVICE,
> >> + .instance_size = sizeof(sPAPRXive),
> >> + .class_init = spapr_xive_class_init,
> >> +};
> >> +
> >> +static void spapr_xive_register_types(void)
> >> +{
> >> + type_register_static(&spapr_xive_info);
> >> +}
> >> +
> >> +type_init(spapr_xive_register_types)
> >> +
> >> +XiveIVE *spapr_xive_get_ive(sPAPRXive *xive, uint32_t lisn)
> >> +{
> >> + return lisn < xive->nr_irqs ? &xive->ivt[lisn] : NULL;
> >> +}
> >> +
> >> +bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn)
> >> +{
> >> + XiveIVE *ive = spapr_xive_get_ive(xive, lisn);
> >> +
> >> + if (!ive) {
> >> + return false;
> >> + }
> >> +
> >> + ive->w |= IVE_VALID;
> >> + return true;
> >> +}
> >> +
> >> +bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn)
> >> +{
> >> + XiveIVE *ive = spapr_xive_get_ive(xive, lisn);
> >> +
> >> + if (!ive) {
> >> + return false;
> >> + }
> >> +
> >> + ive->w &= ~IVE_VALID;
> >> + return true;
> >> +}
> >> diff --git a/hw/intc/xive-internal.h b/hw/intc/xive-internal.h
> >> new file mode 100644
> >> index 000000000000..132b71a6daf0
> >> --- /dev/null
> >> +++ b/hw/intc/xive-internal.h
> >> @@ -0,0 +1,41 @@
> >> +/*
> >> + * QEMU PowerPC XIVE interrupt controller model
> >> + *
> >> + * Copyright (c) 2016-2017, IBM Corporation.
> >> + *
> >> + * This code is licensed under the GPL version 2 or later. See the
> >> + * COPYING file in the top-level directory.
> >> + */
> >> +
> >> +#ifndef _INTC_XIVE_INTERNAL_H
> >> +#define _INTC_XIVE_INTERNAL_H
> >> +
> >> +/* Utilities to manipulate these (originaly from OPAL) */
> >> +#define MASK_TO_LSH(m) (__builtin_ffsl(m) - 1)
> >> +#define GETFIELD(m, v) (((v) & (m)) >> MASK_TO_LSH(m))
> >> +#define SETFIELD(m, v, val) \
> >> + (((v) & ~(m)) | ((((typeof(v))(val)) << MASK_TO_LSH(m)) & (m)))
> >> +
> >> +/* IVE/EAS
> >> + *
> >> + * One per interrupt source. Targets that interrupt to a given EQ
> >> + * and provides the corresponding logical interrupt number (EQ data)
> >> + *
> >> + * We also map this structure to the escalation descriptor inside
> >> + * an EQ, though in that case the valid and masked bits are not used.
> >> + */
> >> +typedef struct XiveIVE {
> >> + /* Use a single 64-bit definition to make it easier to
> >> + * perform atomic updates
> >> + */
> >> + uint64_t w;
> >> +#define IVE_VALID PPC_BIT(0)
> >> +#define IVE_EQ_BLOCK PPC_BITMASK(4, 7) /* Destination EQ block# */
> >> +#define IVE_EQ_INDEX PPC_BITMASK(8, 31) /* Destination EQ index */
> >> +#define IVE_MASKED PPC_BIT(32) /* Masked */
> >> +#define IVE_EQ_DATA PPC_BITMASK(33, 63) /* Data written to the EQ */
> >> +} XiveIVE;
> >> +
> >> +XiveIVE *spapr_xive_get_ive(sPAPRXive *xive, uint32_t lisn);
> >> +
> >> +#endif /* _INTC_XIVE_INTERNAL_H */
> >> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
> >> new file mode 100644
> >> index 000000000000..5b1f78e06a1e
> >> --- /dev/null
> >> +++ b/include/hw/ppc/spapr_xive.h
> >> @@ -0,0 +1,35 @@
> >> +/*
> >> + * QEMU PowerPC sPAPR XIVE interrupt controller model
> >> + *
> >> + * Copyright (c) 2017, IBM Corporation.
> >> + *
> >> + * This code is licensed under the GPL version 2 or later. See the
> >> + * COPYING file in the top-level directory.
> >> + */
> >> +
> >> +#ifndef PPC_SPAPR_XIVE_H
> >> +#define PPC_SPAPR_XIVE_H
> >> +
> >> +#include <hw/sysbus.h>
> >> +
> >> +typedef struct sPAPRXive sPAPRXive;
> >> +typedef struct XiveIVE XiveIVE;
> >> +
> >> +#define TYPE_SPAPR_XIVE "spapr-xive"
> >> +#define SPAPR_XIVE(obj) OBJECT_CHECK(sPAPRXive, (obj), TYPE_SPAPR_XIVE)
> >> +
> >> +struct sPAPRXive {
> >> + SysBusDevice parent;
> >> +
> >> + /* Properties */
> >> + uint32_t nr_irqs;
> >> +
> >> + /* XIVE internal tables */
> >> + XiveIVE *ivt;
> >> +};
> >> +
> >> +bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn);
> >> +bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
> >> +void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
> >> +
> >> +#endif /* PPC_SPAPR_XIVE_H */
> >
>
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
On 04/12/2018 07:07 AM, David Gibson wrote:
> On Wed, Dec 20, 2017 at 08:38:41AM +0100, Cédric Le Goater wrote:
>> On 12/20/2017 06:09 AM, David Gibson wrote:
>>> On Sat, Dec 09, 2017 at 09:43:21AM +0100, Cédric Le Goater wrote:
>>>> With the POWER9 processor comes a new interrupt controller called
>>>> XIVE. It is composed of three sub-engines :
>>>>
>>>> - Interrupt Virtualization Source Engine (IVSE). These are in PHBs,
>>>> in the main controller for the IPIS and in the PSI host
>>>> bridge. They are configured to feed the IVRE with events.
>>>>
>>>> - Interrupt Virtualization Routing Engine (IVRE). Their job is to
>>>> match an event source with a Notification Virtualization Target
>>>> (NVT), a priority and an Event Queue (EQ) to determine if a
>>>> Virtual Processor can handle the event.
>>>>
>>>> - Interrupt Virtualization Presentation Engine (IVPE). It maintains
>>>> the interrupt state of each hardware thread and present the
>>>> notification as an external exception.
>>>>
>>>> Each of the engines uses a set of internal tables to redirect
>>>> exceptions from event sources to CPU threads. The first table we
>>>> introduce is the Interrupt Virtualization Entry (IVE) table, part of
>>>> the virtualization engine in charge of routing events. It associates
>>>> event sources (IRQ numbers) to event queues which will forward, or
>>>> not, the event notification to the presentation controller.
>>>>
>>>> The XIVE model is designed to make use of the full range of the IRQ
>>>> number space and does not use an offset like the XICS mode does.
>>>> Hence, the IVE table is directly indexed by the IRQ number.
>>>>
>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>
>>> As you've suggested in yourself, I think we might need to more
>>> explicitly model the different components of the XIVE system. As part
>>> of that, I think you need to be clearer in this base skeleton about
>>> exactly what component your XIVE object represents.
>
> Sorry it's been so long since I looked at these.
That's fine. I have been working on a XIVE device model for the PowerNV
machine and KVM support for the pseries. I have a better understanding
of the overall picture.
The patchset has not changed much so we can still discuss on this
basis without me flooding the mailing list.
>> ok. The base skeleton is the IVRE, the central engine handling
>> the routing.
>>
>>> If the answer is "the overall thing"
>>
>> Yes, it is more or less that currently.
>>
>> The sPAPRXive object models the source engine and the routing
>> engine in one object.
>
> Yeah, I suspect we don't want that. Although it might seem simpler in
> the spapr case, at least at first glance, I think it will cause us
> problems later. At the very least, it's likely to make it harder to
> share code between the spapr and powernv case. I think it will also
> make for more confusion about exactly what things belong where.
I tend to agree.
We need to clarify (a bit) what is in the XIVE interrupt controller
silicon, and how XIVE works. The XIVE device models for spapr and
powernv should be very close as the differences are small. KVM support
should be built on the spapr model.
There are 3 different sub-engines in the XIVE interrupt controller
device :
* IVSE (XiveSource model)
interrupt sources, which expose their PQ bits through ESB MMIO pages
(there are different levels of support depending on HW revision)
The XIVE interrupt controller has a set of internal sources for
IPIs and CAPI like interrupts.
* IVRE (No real model)
in the middle, doing the routing of source event notification to
(cpu) targets. It relies on internal tables which are stored in
the hypervisor/QEMU/KVM for the spapr machine and in the VM RAM
for the powernv machine.
Configuration updates of the XIVE tables are done through hcalls
on spapr and with MMIOs on the IC regs on powernv. On the latter,
the changes are flushed backed in the VM RAM.
* IVPE (XiveNVT)
set of registers for interrupt management at the CPU level. Exposed
in a specific MMIO region called the TIMA.
The XIVE tables are :
* IVT
associate an interrupt source number with an event queue. the data
to be pushed in the queue is stored there also.
* EQDT:
describes the queues in the OS RAM, also contains a set of flags,
a virtual target, etc.
* VPDT:
describe the virtual targets, which can have different natures,
a lpar, a cpu. This is for powernv, spapr does not have this
concept.
So, the idea behind the sPAPRXive object is to model a XIVE interrupt
controller device. It contains today :
- an internal source block for all interrupts : IPIs and virtual
device interrupts. In the IRQ number space, the IPIs are below
4096 and the device interrupts above, which keeps compatibility
with XICS. This is important to be able to change interrupt mode.
PowerNV has different source blocks, like for P8.
- a routing engine, which is limited to the IVT. This is a shortcut
and it might be better to introduce a specific object. Anyhow, this
is a state to capture.
In the current version I am working on, the XiveFabric interface is
more complex :
typedef struct XiveFabricClass {
InterfaceClass parent;
XiveIVE *(*get_ive)(XiveFabric *xf, uint32_t lisn);
XiveNVT *(*get_nvt)(XiveFabric *xf, uint32_t server);
XiveEQ *(*get_eq)(XiveFabric *xf, uint32_t eq_idx);
} XiveFabricClass;
It helps in making the routing algorithm independent of the model.
I hope to make powernv converge and use it.
- a set of MMIOs for the TIMA. They model the presenter engine.
current_cpu is used to retrieve the NVT object, which holds the
registers for interrupt management.
The EQs are stored under the NVT. This saves us an unnecessary EQDT
table. But we could add one under the XIVE device model.
>> I have merged these for simplicity and because the interrupt
>> controller has an internal source for the interrupts of the "IPI"
>> type, which are used for the CPU IPIs but also for other generic
>> interrupts, like the OpenCAPI ones. The XIVE sPAPR interface is
>> also much simpler than the baremetal one, all the tables are
>> maintained in the hypervisor, so this choice made some sense.
>>
>> But since, I have started the PowerNV model and I am duplicating
>> a lot of code to handle the triggering and the MMIOs in the
>> different sources. So I am not convinced anymore. Nevertheless,
>> the overall routing logic is the same even if some the tables
>> are not located in QEMU anymore, but in the machine memory.
>>
>> The sPAPRXiveNVT models some of the CPU presenter engine. It
>> holds the virtual CPU interrupt states when not dispatched on
>> a real HW thread. Real world is more complex. There are "CAM"
>> lines in the HW threads which are compared to find a matching
>> candidate. But I don't think we need to anything more complex
>> than today unless we want to support KVM under TCG ...
>>
>>> I suspect that's not what you
>>> want - I had one of those for XICs which proved to be a mistake
>>> (eventually replaced by the XICSFabric interface).
>>
>> The XICSFabric would be the main Xive object. The interface
>> between the sources and the routing engine is hidden in sPAPR,
>> we can use a simple function call :
>>
>> spapr_xive_irq(pnv->xive, irq);
>>
>> we could get rid of the qirqs but they are required for XICS.
>
> I don't quite follow, but this doesn't sound right.
I don't remember what I had in mind at that time. Let's forget it.
>> PowerNV uses MMIOs to notify an event and it makes the modeling
>> somewhat easier. Each controller model has a notify port address
>> register on which a interrupt number is written to forward an
>> event to the routing engine. So it is a simple store.
>>
>> I don't know why there is a different notify port address per
>> source, may be for extra filtering at the routing engine level.
>>
>>> Changing the model later isn't impossible, but doing so without
>>> breaking migration can be a real pain, so I think it's worth a
>>> reasonable effort to try and get it right initially.
>>
>> I completely agree.
>>
>> This is why I have started the PnvXive model to challenge the
>> current PAPR design. I have hacked a bunch of patches for XIVE,
>> LPC, PSI, OCC and basic PPC support which boot a PowerNV P9 up to
>> petitboot. It would look better with a source object, but the
>> location of the PQ bits is a bit problematic. It highly depends
>> on the controller. The main controller uses tables in the hypervisor
>> memory. The PSIHB controller has its own bits. I suppose it is
>> the same for PHB4. I need to take a closer look at how we could
>> have a common source object.
>
> Ok, sounds like a good idea.
I made progress and the spapr and powernv models are nearly
reconciliated. the only difference in the routing algorithm is the
privilege level at which the NVT notification is done. powernv work
at the HV level.
>> The most important part is KVM support and how we expose the
>> MMIO region. We need to make progress on that topic.
for KVM, a set of *_kvm objects handle the differences with the
emulated mode. ram_device memory regions are needed for the ESB
MMIO pages and the TIMA. That's mostly it.
C.
>> Thanks,
>>
>> C.
>>
>>
>>>> ---
>>>>
>>>> Changes since v1 :
>>>>
>>>> - used g_new0 instead of g_malloc0
>>>> - removed VMSTATE_STRUCT_VARRAY_UINT32_ALLOC
>>>> - introduced a device reset handler. the object needs to be parented
>>>> to sysbus when created.
>>>> - renamed spapr_xive_irq_set to spapr_xive_irq_enable
>>>> - renamed spapr_xive_irq_unset to spapr_xive_irq_disable
>>>> - moved the PPC_BIT macros under target/ppc/cpu.h
>>>> - shrinked file copyright header
>>>>
>>>> default-configs/ppc64-softmmu.mak | 1 +
>>>> hw/intc/Makefile.objs | 1 +
>>>> hw/intc/spapr_xive.c | 156 ++++++++++++++++++++++++++++++++++++++
>>>> hw/intc/xive-internal.h | 41 ++++++++++
>>>> include/hw/ppc/spapr_xive.h | 35 +++++++++
>>>> 5 files changed, 234 insertions(+)
>>>> create mode 100644 hw/intc/spapr_xive.c
>>>> create mode 100644 hw/intc/xive-internal.h
>>>> create mode 100644 include/hw/ppc/spapr_xive.h
>>>>
>>>> diff --git a/default-configs/ppc64-softmmu.mak b/default-configs/ppc64-softmmu.mak
>>>> index d1b3a6dd50f8..4a7f6a0696de 100644
>>>> --- a/default-configs/ppc64-softmmu.mak
>>>> +++ b/default-configs/ppc64-softmmu.mak
>>>> @@ -56,6 +56,7 @@ CONFIG_SM501=y
>>>> CONFIG_XICS=$(CONFIG_PSERIES)
>>>> CONFIG_XICS_SPAPR=$(CONFIG_PSERIES)
>>>> CONFIG_XICS_KVM=$(call land,$(CONFIG_PSERIES),$(CONFIG_KVM))
>>>> +CONFIG_XIVE_SPAPR=$(CONFIG_PSERIES)
>>>> # For PReP
>>>> CONFIG_SERIAL_ISA=y
>>>> CONFIG_MC146818RTC=y
>>>> diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
>>>> index ae358569a155..49e13e7aeeee 100644
>>>> --- a/hw/intc/Makefile.objs
>>>> +++ b/hw/intc/Makefile.objs
>>>> @@ -35,6 +35,7 @@ obj-$(CONFIG_SH4) += sh_intc.o
>>>> obj-$(CONFIG_XICS) += xics.o
>>>> obj-$(CONFIG_XICS_SPAPR) += xics_spapr.o
>>>> obj-$(CONFIG_XICS_KVM) += xics_kvm.o
>>>> +obj-$(CONFIG_XIVE_SPAPR) += spapr_xive.o
>>>> obj-$(CONFIG_POWERNV) += xics_pnv.o
>>>> obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
>>>> obj-$(CONFIG_S390_FLIC) += s390_flic.o
>>>> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
>>>> new file mode 100644
>>>> index 000000000000..e6e8841add17
>>>> --- /dev/null
>>>> +++ b/hw/intc/spapr_xive.c
>>>> @@ -0,0 +1,156 @@
>>>> +/*
>>>> + * QEMU PowerPC sPAPR XIVE interrupt controller model
>>>> + *
>>>> + * Copyright (c) 2017, IBM Corporation.
>>>> + *
>>>> + * This code is licensed under the GPL version 2 or later. See the
>>>> + * COPYING file in the top-level directory.
>>>> + */
>>>> +
>>>> +#include "qemu/osdep.h"
>>>> +#include "qemu/log.h"
>>>> +#include "qapi/error.h"
>>>> +#include "target/ppc/cpu.h"
>>>> +#include "sysemu/cpus.h"
>>>> +#include "sysemu/dma.h"
>>>> +#include "monitor/monitor.h"
>>>> +#include "hw/ppc/spapr_xive.h"
>>>> +
>>>> +#include "xive-internal.h"
>>>> +
>>>> +/*
>>>> + * Main XIVE object
>>>> + */
>>>> +
>>>> +void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)
>>>> +{
>>>> + int i;
>>>> +
>>>> + for (i = 0; i < xive->nr_irqs; i++) {
>>>> + XiveIVE *ive = &xive->ivt[i];
>>>> +
>>>> + if (!(ive->w & IVE_VALID)) {
>>>> + continue;
>>>> + }
>>>> +
>>>> + monitor_printf(mon, " %4x %s %08x %08x\n", i,
>>>> + ive->w & IVE_MASKED ? "M" : " ",
>>>> + (int) GETFIELD(IVE_EQ_INDEX, ive->w),
>>>> + (int) GETFIELD(IVE_EQ_DATA, ive->w));
>>>> + }
>>>> +}
>>>> +
>>>> +static void spapr_xive_reset(DeviceState *dev)
>>>> +{
>>>> + sPAPRXive *xive = SPAPR_XIVE(dev);
>>>> + int i;
>>>> +
>>>> + /* Mask all valid IVEs in the IRQ number space. */
>>>> + for (i = 0; i < xive->nr_irqs; i++) {
>>>> + XiveIVE *ive = &xive->ivt[i];
>>>> + if (ive->w & IVE_VALID) {
>>>> + ive->w |= IVE_MASKED;
>>>> + }
>>>> + }
>>>> +}
>>>> +
>>>> +static void spapr_xive_realize(DeviceState *dev, Error **errp)
>>>> +{
>>>> + sPAPRXive *xive = SPAPR_XIVE(dev);
>>>> +
>>>> + if (!xive->nr_irqs) {
>>>> + error_setg(errp, "Number of interrupt needs to be greater 0");
>>>> + return;
>>>> + }
>>>> +
>>>> + /* Allocate the IVT (Interrupt Virtualization Table) */
>>>> + xive->ivt = g_new0(XiveIVE, xive->nr_irqs);
>>>> +}
>>>> +
>>>> +static const VMStateDescription vmstate_spapr_xive_ive = {
>>>> + .name = TYPE_SPAPR_XIVE "/ive",
>>>> + .version_id = 1,
>>>> + .minimum_version_id = 1,
>>>> + .fields = (VMStateField []) {
>>>> + VMSTATE_UINT64(w, XiveIVE),
>>>> + VMSTATE_END_OF_LIST()
>>>> + },
>>>> +};
>>>> +
>>>> +static bool vmstate_spapr_xive_needed(void *opaque)
>>>> +{
>>>> + /* TODO check machine XIVE support */
>>>> + return true;
>>>> +}
>>>> +
>>>> +static const VMStateDescription vmstate_spapr_xive = {
>>>> + .name = TYPE_SPAPR_XIVE,
>>>> + .version_id = 1,
>>>> + .minimum_version_id = 1,
>>>> + .needed = vmstate_spapr_xive_needed,
>>>> + .fields = (VMStateField[]) {
>>>> + VMSTATE_UINT32_EQUAL(nr_irqs, sPAPRXive, NULL),
>>>> + VMSTATE_STRUCT_VARRAY_UINT32(ivt, sPAPRXive, nr_irqs, 1,
>>>> + vmstate_spapr_xive_ive, XiveIVE),
>>>> + VMSTATE_END_OF_LIST()
>>>> + },
>>>> +};
>>>> +
>>>> +static Property spapr_xive_properties[] = {
>>>> + DEFINE_PROP_UINT32("nr-irqs", sPAPRXive, nr_irqs, 0),
>>>> + DEFINE_PROP_END_OF_LIST(),
>>>> +};
>>>> +
>>>> +static void spapr_xive_class_init(ObjectClass *klass, void *data)
>>>> +{
>>>> + DeviceClass *dc = DEVICE_CLASS(klass);
>>>> +
>>>> + dc->realize = spapr_xive_realize;
>>>> + dc->reset = spapr_xive_reset;
>>>> + dc->props = spapr_xive_properties;
>>>> + dc->desc = "sPAPR XIVE interrupt controller";
>>>> + dc->vmsd = &vmstate_spapr_xive;
>>>> +}
>>>> +
>>>> +static const TypeInfo spapr_xive_info = {
>>>> + .name = TYPE_SPAPR_XIVE,
>>>> + .parent = TYPE_SYS_BUS_DEVICE,
>>>> + .instance_size = sizeof(sPAPRXive),
>>>> + .class_init = spapr_xive_class_init,
>>>> +};
>>>> +
>>>> +static void spapr_xive_register_types(void)
>>>> +{
>>>> + type_register_static(&spapr_xive_info);
>>>> +}
>>>> +
>>>> +type_init(spapr_xive_register_types)
>>>> +
>>>> +XiveIVE *spapr_xive_get_ive(sPAPRXive *xive, uint32_t lisn)
>>>> +{
>>>> + return lisn < xive->nr_irqs ? &xive->ivt[lisn] : NULL;
>>>> +}
>>>> +
>>>> +bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn)
>>>> +{
>>>> + XiveIVE *ive = spapr_xive_get_ive(xive, lisn);
>>>> +
>>>> + if (!ive) {
>>>> + return false;
>>>> + }
>>>> +
>>>> + ive->w |= IVE_VALID;
>>>> + return true;
>>>> +}
>>>> +
>>>> +bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn)
>>>> +{
>>>> + XiveIVE *ive = spapr_xive_get_ive(xive, lisn);
>>>> +
>>>> + if (!ive) {
>>>> + return false;
>>>> + }
>>>> +
>>>> + ive->w &= ~IVE_VALID;
>>>> + return true;
>>>> +}
>>>> diff --git a/hw/intc/xive-internal.h b/hw/intc/xive-internal.h
>>>> new file mode 100644
>>>> index 000000000000..132b71a6daf0
>>>> --- /dev/null
>>>> +++ b/hw/intc/xive-internal.h
>>>> @@ -0,0 +1,41 @@
>>>> +/*
>>>> + * QEMU PowerPC XIVE interrupt controller model
>>>> + *
>>>> + * Copyright (c) 2016-2017, IBM Corporation.
>>>> + *
>>>> + * This code is licensed under the GPL version 2 or later. See the
>>>> + * COPYING file in the top-level directory.
>>>> + */
>>>> +
>>>> +#ifndef _INTC_XIVE_INTERNAL_H
>>>> +#define _INTC_XIVE_INTERNAL_H
>>>> +
>>>> +/* Utilities to manipulate these (originaly from OPAL) */
>>>> +#define MASK_TO_LSH(m) (__builtin_ffsl(m) - 1)
>>>> +#define GETFIELD(m, v) (((v) & (m)) >> MASK_TO_LSH(m))
>>>> +#define SETFIELD(m, v, val) \
>>>> + (((v) & ~(m)) | ((((typeof(v))(val)) << MASK_TO_LSH(m)) & (m)))
>>>> +
>>>> +/* IVE/EAS
>>>> + *
>>>> + * One per interrupt source. Targets that interrupt to a given EQ
>>>> + * and provides the corresponding logical interrupt number (EQ data)
>>>> + *
>>>> + * We also map this structure to the escalation descriptor inside
>>>> + * an EQ, though in that case the valid and masked bits are not used.
>>>> + */
>>>> +typedef struct XiveIVE {
>>>> + /* Use a single 64-bit definition to make it easier to
>>>> + * perform atomic updates
>>>> + */
>>>> + uint64_t w;
>>>> +#define IVE_VALID PPC_BIT(0)
>>>> +#define IVE_EQ_BLOCK PPC_BITMASK(4, 7) /* Destination EQ block# */
>>>> +#define IVE_EQ_INDEX PPC_BITMASK(8, 31) /* Destination EQ index */
>>>> +#define IVE_MASKED PPC_BIT(32) /* Masked */
>>>> +#define IVE_EQ_DATA PPC_BITMASK(33, 63) /* Data written to the EQ */
>>>> +} XiveIVE;
>>>> +
>>>> +XiveIVE *spapr_xive_get_ive(sPAPRXive *xive, uint32_t lisn);
>>>> +
>>>> +#endif /* _INTC_XIVE_INTERNAL_H */
>>>> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
>>>> new file mode 100644
>>>> index 000000000000..5b1f78e06a1e
>>>> --- /dev/null
>>>> +++ b/include/hw/ppc/spapr_xive.h
>>>> @@ -0,0 +1,35 @@
>>>> +/*
>>>> + * QEMU PowerPC sPAPR XIVE interrupt controller model
>>>> + *
>>>> + * Copyright (c) 2017, IBM Corporation.
>>>> + *
>>>> + * This code is licensed under the GPL version 2 or later. See the
>>>> + * COPYING file in the top-level directory.
>>>> + */
>>>> +
>>>> +#ifndef PPC_SPAPR_XIVE_H
>>>> +#define PPC_SPAPR_XIVE_H
>>>> +
>>>> +#include <hw/sysbus.h>
>>>> +
>>>> +typedef struct sPAPRXive sPAPRXive;
>>>> +typedef struct XiveIVE XiveIVE;
>>>> +
>>>> +#define TYPE_SPAPR_XIVE "spapr-xive"
>>>> +#define SPAPR_XIVE(obj) OBJECT_CHECK(sPAPRXive, (obj), TYPE_SPAPR_XIVE)
>>>> +
>>>> +struct sPAPRXive {
>>>> + SysBusDevice parent;
>>>> +
>>>> + /* Properties */
>>>> + uint32_t nr_irqs;
>>>> +
>>>> + /* XIVE internal tables */
>>>> + XiveIVE *ivt;
>>>> +};
>>>> +
>>>> +bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn);
>>>> +bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
>>>> +void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
>>>> +
>>>> +#endif /* PPC_SPAPR_XIVE_H */
>>>
>>
>
On Thu, Apr 12, 2018 at 10:18:11AM +0200, Cédric Le Goater wrote:
> On 04/12/2018 07:07 AM, David Gibson wrote:
> > On Wed, Dec 20, 2017 at 08:38:41AM +0100, Cédric Le Goater wrote:
> >> On 12/20/2017 06:09 AM, David Gibson wrote:
> >>> On Sat, Dec 09, 2017 at 09:43:21AM +0100, Cédric Le Goater wrote:
> >>>> With the POWER9 processor comes a new interrupt controller called
> >>>> XIVE. It is composed of three sub-engines :
> >>>>
> >>>> - Interrupt Virtualization Source Engine (IVSE). These are in PHBs,
> >>>> in the main controller for the IPIS and in the PSI host
> >>>> bridge. They are configured to feed the IVRE with events.
> >>>>
> >>>> - Interrupt Virtualization Routing Engine (IVRE). Their job is to
> >>>> match an event source with a Notification Virtualization Target
> >>>> (NVT), a priority and an Event Queue (EQ) to determine if a
> >>>> Virtual Processor can handle the event.
> >>>>
> >>>> - Interrupt Virtualization Presentation Engine (IVPE). It maintains
> >>>> the interrupt state of each hardware thread and present the
> >>>> notification as an external exception.
> >>>>
> >>>> Each of the engines uses a set of internal tables to redirect
> >>>> exceptions from event sources to CPU threads. The first table we
> >>>> introduce is the Interrupt Virtualization Entry (IVE) table, part of
> >>>> the virtualization engine in charge of routing events. It associates
> >>>> event sources (IRQ numbers) to event queues which will forward, or
> >>>> not, the event notification to the presentation controller.
> >>>>
> >>>> The XIVE model is designed to make use of the full range of the IRQ
> >>>> number space and does not use an offset like the XICS mode does.
> >>>> Hence, the IVE table is directly indexed by the IRQ number.
> >>>>
> >>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >>>
> >>> As you've suggested in yourself, I think we might need to more
> >>> explicitly model the different components of the XIVE system. As part
> >>> of that, I think you need to be clearer in this base skeleton about
> >>> exactly what component your XIVE object represents.
> >
> > Sorry it's been so long since I looked at these.
>
> That's fine. I have been working on a XIVE device model for the PowerNV
> machine and KVM support for the pseries. I have a better understanding
> of the overall picture.
>
> The patchset has not changed much so we can still discuss on this
> basis without me flooding the mailing list.
>
> >> ok. The base skeleton is the IVRE, the central engine handling
> >> the routing.
> >>
> >>> If the answer is "the overall thing"
> >>
> >> Yes, it is more or less that currently.
> >>
> >> The sPAPRXive object models the source engine and the routing
> >> engine in one object.
> >
> > Yeah, I suspect we don't want that. Although it might seem simpler in
> > the spapr case, at least at first glance, I think it will cause us
> > problems later. At the very least, it's likely to make it harder to
> > share code between the spapr and powernv case. I think it will also
> > make for more confusion about exactly what things belong where.
>
> I tend to agree.
>
> We need to clarify (a bit) what is in the XIVE interrupt controller
> silicon, and how XIVE works. The XIVE device models for spapr and
> powernv should be very close as the differences are small. KVM support
> should be built on the spapr model.
>
> There are 3 different sub-engines in the XIVE interrupt controller
> device :
>
> * IVSE (XiveSource model)
>
> interrupt sources, which expose their PQ bits through ESB MMIO pages
> (there are different levels of support depending on HW revision)
>
> The XIVE interrupt controller has a set of internal sources for
> IPIs and CAPI like interrupts.
Ok. IIUC in hardware there's one of these in each PHB, plus maybe one
or two others. Is that right?
>
> * IVRE (No real model)
>
> in the middle, doing the routing of source event notification to
> (cpu) targets. It relies on internal tables which are stored in
> the hypervisor/QEMU/KVM for the spapr machine and in the VM RAM
> for the powernv machine.
What does VM RAM mean in the powernv context?
> Configuration updates of the XIVE tables are done through hcalls
> on spapr and with MMIOs on the IC regs on powernv. On the latter,
> the changes are flushed backed in the VM RAM.
>
> * IVPE (XiveNVT)
>
> set of registers for interrupt management at the CPU level. Exposed
> in a specific MMIO region called the TIMA.
Ok.
> The XIVE tables are :
>
> * IVT
>
> associate an interrupt source number with an event queue. the data
> to be pushed in the queue is stored there also.
Ok, so there would be one of these tables for each IVRE, with one
entry for each source managed by that IVSE, yes?
Do the XIVE IPIs have entries here, or do they bypass this?
> * EQDT:
>
> describes the queues in the OS RAM, also contains a set of flags,
> a virtual target, etc.
So on real hardware this would be global, yes? And it would be
consulted by the IVRE?
For guests, we'd expect one table per-guest? How would those be
integrated with the host table?
> * VPDT:
>
> describe the virtual targets, which can have different natures,
> a lpar, a cpu. This is for powernv, spapr does not have this
> concept.
Ok On hardware that would also be global and consulted by the IVRE,
yes?
Under PAPR, I'm guessing the concept is missing because it essentially
has a fixed contents: an entry for each vcpu and maybe one for the
lpar as a whole?
> So, the idea behind the sPAPRXive object is to model a XIVE interrupt
> controller device. It contains today :
Yeah, what a "XIVE interrupt controller device" is not really clear to
me. If it's something that is necessarily global, I think you'll be
better off making it a machine-interface rather than a distinct
object.
>
> - an internal source block for all interrupts : IPIs and virtual
> device interrupts. In the IRQ number space, the IPIs are below
> 4096 and the device interrupts above, which keeps compatibility
> with XICS. This is important to be able to change interrupt mode.
>
> PowerNV has different source blocks, like for P8.
>
> - a routing engine, which is limited to the IVT. This is a shortcut
> and it might be better to introduce a specific object. Anyhow, this
> is a state to capture.
Ok. It sounds like this is roughly the equivalent of the XICSFabric,
and likewise would probably be better handled by an interface on the
machine rather than a distinct object. But I'm not clear enough to be
certain of that yet.
> In the current version I am working on, the XiveFabric interface is
> more complex :
>
> typedef struct XiveFabricClass {
> InterfaceClass parent;
> XiveIVE *(*get_ive)(XiveFabric *xf, uint32_t lisn);
This does an IVT lookup, I take it?
> XiveNVT *(*get_nvt)(XiveFabric *xf, uint32_t server);
This one a VPDT lookup, yes?
> XiveEQ *(*get_eq)(XiveFabric *xf, uint32_t eq_idx);
And this one an EQDT lookup?
> } XiveFabricClass;
>
> It helps in making the routing algorithm independent of the model.
> I hope to make powernv converge and use it.
>
> - a set of MMIOs for the TIMA. They model the presenter engine.
> current_cpu is used to retrieve the NVT object, which holds the
> registers for interrupt management.
Right. Now the TIMA is local to a target/server not an EQ, right?
I guess we need at least one of these per-vcpu. Do we also need an
lpar-global, or other special ones?
> The EQs are stored under the NVT. This saves us an unnecessary EQDT
> table. But we could add one under the XIVE device model.
I'm not sure of the distinction you're drawing between the NVT and the
XIVE device mode.
[snip]
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
On 04/16/2018 06:26 AM, David Gibson wrote:
> On Thu, Apr 12, 2018 at 10:18:11AM +0200, Cédric Le Goater wrote:
>> On 04/12/2018 07:07 AM, David Gibson wrote:
>>> On Wed, Dec 20, 2017 at 08:38:41AM +0100, Cédric Le Goater wrote:
>>>> On 12/20/2017 06:09 AM, David Gibson wrote:
>>>>> On Sat, Dec 09, 2017 at 09:43:21AM +0100, Cédric Le Goater wrote:
>>>>>> With the POWER9 processor comes a new interrupt controller called
>>>>>> XIVE. It is composed of three sub-engines :
>>>>>>
>>>>>> - Interrupt Virtualization Source Engine (IVSE). These are in PHBs,
>>>>>> in the main controller for the IPIS and in the PSI host
>>>>>> bridge. They are configured to feed the IVRE with events.
>>>>>>
>>>>>> - Interrupt Virtualization Routing Engine (IVRE). Their job is to
>>>>>> match an event source with a Notification Virtualization Target
>>>>>> (NVT), a priority and an Event Queue (EQ) to determine if a
>>>>>> Virtual Processor can handle the event.
>>>>>>
>>>>>> - Interrupt Virtualization Presentation Engine (IVPE). It maintains
>>>>>> the interrupt state of each hardware thread and present the
>>>>>> notification as an external exception.
>>>>>>
>>>>>> Each of the engines uses a set of internal tables to redirect
>>>>>> exceptions from event sources to CPU threads. The first table we
>>>>>> introduce is the Interrupt Virtualization Entry (IVE) table, part of
>>>>>> the virtualization engine in charge of routing events. It associates
>>>>>> event sources (IRQ numbers) to event queues which will forward, or
>>>>>> not, the event notification to the presentation controller.
>>>>>>
>>>>>> The XIVE model is designed to make use of the full range of the IRQ
>>>>>> number space and does not use an offset like the XICS mode does.
>>>>>> Hence, the IVE table is directly indexed by the IRQ number.
>>>>>>
>>>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>>>
>>>>> As you've suggested in yourself, I think we might need to more
>>>>> explicitly model the different components of the XIVE system. As part
>>>>> of that, I think you need to be clearer in this base skeleton about
>>>>> exactly what component your XIVE object represents.
>>>
>>> Sorry it's been so long since I looked at these.
>>
>> That's fine. I have been working on a XIVE device model for the PowerNV
>> machine and KVM support for the pseries. I have a better understanding
>> of the overall picture.
>>
>> The patchset has not changed much so we can still discuss on this
>> basis without me flooding the mailing list.
>>
>>>> ok. The base skeleton is the IVRE, the central engine handling
>>>> the routing.
>>>>
>>>>> If the answer is "the overall thing"
>>>>
>>>> Yes, it is more or less that currently.
>>>>
>>>> The sPAPRXive object models the source engine and the routing
>>>> engine in one object.
>>>
>>> Yeah, I suspect we don't want that. Although it might seem simpler in
>>> the spapr case, at least at first glance, I think it will cause us
>>> problems later. At the very least, it's likely to make it harder to
>>> share code between the spapr and powernv case. I think it will also
>>> make for more confusion about exactly what things belong where.
>>
>> I tend to agree.
>>
>> We need to clarify (a bit) what is in the XIVE interrupt controller
>> silicon, and how XIVE works. The XIVE device models for spapr and
>> powernv should be very close as the differences are small. KVM support
>> should be built on the spapr model.
>>
>> There are 3 different sub-engines in the XIVE interrupt controller
>> device :
>>
>> * IVSE (XiveSource model)
>>
>> interrupt sources, which expose their PQ bits through ESB MMIO pages
>> (there are different levels of support depending on HW revision)
>>
>> The XIVE interrupt controller has a set of internal sources for
>> IPIs and CAPI like interrupts.
>
> Ok. IIUC in hardware there's one of these in each PHB,
yes
> plus maybe one or two others. Is that right?
yes. PSI for instance on PowerNV. I have this device as a first
xive source on Power?V
>>
>> * IVRE (No real model)
>>
>> in the middle, doing the routing of source event notification to
>> (cpu) targets. It relies on internal tables which are stored in
>> the hypervisor/QEMU/KVM for the spapr machine and in the VM RAM
>> for the powernv machine.
>
> What does VM RAM mean in the powernv context?
The PowerNV is indeed not a VM. So I meant the RAM of the QEMU PowerNV
machine. skiboot does the allocation and the HW setup using a set of
IC registers exposed as MMIOs.
>> Configuration updates of the XIVE tables are done through hcalls
>> on spapr and with MMIOs on the IC regs on powernv. On the latter,
>> the changes are flushed backed in the VM RAM.
>>
>> * IVPE (XiveNVT)
>>
>> set of registers for interrupt management at the CPU level. Exposed
>> in a specific MMIO region called the TIMA.
>
> Ok.
>
>> The XIVE tables are :
>>
>> * IVT
>>
>> associate an interrupt source number with an event queue. the data
>> to be pushed in the queue is stored there also.
>
> Ok, so there would be one of these tables for each IVRE,
yes. one for each XIVE interrupt controller. That is one per processor
or socket.
> with one entry for each source managed by that IVSE, yes?
yes. The table is simply indexed by the interrupt number in the
global IRQ number space of the machine.
> Do the XIVE IPIs have entries here, or do they bypass this?
no. The IPIs have entries also in this table.
>> * EQDT:
>>
>> describes the queues in the OS RAM, also contains a set of flags,
>> a virtual target, etc.
>
> So on real hardware this would be global, yes? And it would be
> consulted by the IVRE?
yes. Exactly. The XIVE routing routine :
https://github.com/legoater/qemu/blob/xive/hw/intc/xive.c#L706
gives a good overview of the usage of the tables.
> For guests, we'd expect one table per-guest?
yes but only in emulation mode.
> How would those be integrated with the host table?
Under KVM, this is handled by the host table (setup done in skiboot)
and we are only interested in the state of the EQs for migration.
This state is set with the H_INT_SET_QUEUE_CONFIG hcall, followed
by an OPAL call and then a HW update. It defines the EQ page in which
to push event notification for the couple server/priority.
>> * VPDT:
>>
>> describe the virtual targets, which can have different natures,
>> a lpar, a cpu. This is for powernv, spapr does not have this
>> concept.
>
> Ok On hardware that would also be global and consulted by the IVRE,
> yes?
yes.
> Under PAPR, I'm guessing the concept is missing because it essentially
> has a fixed contents: an entry for each vcpu
yes.
> and maybe one for the lpar as a whole?
That would be more a host concept. But, yes, it exists in XIVE.
>> So, the idea behind the sPAPRXive object is to model a XIVE interrupt
>> controller device. It contains today :
>
> Yeah, what a "XIVE interrupt controller device" is not really clear to
> me. If it's something that is necessarily global, I think you'll be
> better off making it a machine-interface rather than a distinct
> object.
hmm, OK. We do need a XiveSource object (like in the XICS) and an IVE
table. reshuffling is not a big problem. But then, we also have the
associated KVM device which is very much like the QEMU emulated device.
>> - an internal source block for all interrupts : IPIs and virtual
>> device interrupts. In the IRQ number space, the IPIs are below
>> 4096 and the device interrupts above, which keeps compatibility
>> with XICS. This is important to be able to change interrupt mode.
>>
>> PowerNV has different source blocks, like for P8.
>>
>> - a routing engine, which is limited to the IVT. This is a shortcut
>> and it might be better to introduce a specific object. Anyhow, this
>> is a state to capture.
>
> Ok. It sounds like this is roughly the equivalent of the XICSFabric,
> and likewise would probably be better handled by an interface on the
> machine
yes indeed. it is the case in v3. PowerNV isn't quite in sync with
this concept but it is getting close.
> rather than a distinct object. But I'm not clear enough to be
> certain of that yet.
but we need to put the IVT somewhere.
>> In the current version I am working on, the XiveFabric interface is
>> more complex :
>>
>> typedef struct XiveFabricClass {
>> InterfaceClass parent;
>> XiveIVE *(*get_ive)(XiveFabric *xf, uint32_t lisn);
>
> This does an IVT lookup, I take it?
yes. It is an interface for the underlying storage, which is different
in sPAPR and PowerNV. The goal is to make the routing generic.
>> XiveNVT *(*get_nvt)(XiveFabric *xf, uint32_t server);
>
> This one a VPDT lookup, yes?
yes.
>> XiveEQ *(*get_eq)(XiveFabric *xf, uint32_t eq_idx);
>
> And this one an EQDT lookup?
yes.
>> } XiveFabricClass;
>>
>> It helps in making the routing algorithm independent of the model.
>> I hope to make powernv converge and use it.
>>
>> - a set of MMIOs for the TIMA. They model the presenter engine.
>> current_cpu is used to retrieve the NVT object, which holds the
>> registers for interrupt management.
>
> Right. Now the TIMA is local to a target/server not an EQ, right?
The TIMA is the MMIO giving access to the registers which are per CPU.
The EQ are for routing. They are under the CPU object because it is
convenient.
> I guess we need at least one of these per-vcpu.
yes.
> Do we also need an lpar-global, or other special ones?
That would be for the host. AFAICT KVM does not use such special
VPs.
>> The EQs are stored under the NVT. This saves us an unnecessary EQDT
>> table. But we could add one under the XIVE device model.
>
> I'm not sure of the distinction you're drawing between the NVT and the
> XIVE device mode.
we could add a new table under the XIVE interrupt device model
sPAPRXive to store the EQs and indexed them like skiboot does.
But it seems unnecessary to me as we can use the object below
'cpu->intc', which is the XiveNVT object.
C.
On Thu, Apr 19, 2018 at 07:40:09PM +0200, Cédric Le Goater wrote:
> On 04/16/2018 06:26 AM, David Gibson wrote:
> > On Thu, Apr 12, 2018 at 10:18:11AM +0200, Cédric Le Goater wrote:
> >> On 04/12/2018 07:07 AM, David Gibson wrote:
> >>> On Wed, Dec 20, 2017 at 08:38:41AM +0100, Cédric Le Goater wrote:
> >>>> On 12/20/2017 06:09 AM, David Gibson wrote:
> >>>>> On Sat, Dec 09, 2017 at 09:43:21AM +0100, Cédric Le Goater
> wrote:
[snip]
> >> The XIVE tables are :
> >>
> >> * IVT
> >>
> >> associate an interrupt source number with an event queue. the data
> >> to be pushed in the queue is stored there also.
> >
> > Ok, so there would be one of these tables for each IVRE,
>
> yes. one for each XIVE interrupt controller. That is one per processor
> or socket.
Ah.. so there can be more than one in a multi-socket system.
> > with one entry for each source managed by that IVSE, yes?
>
> yes. The table is simply indexed by the interrupt number in the
> global IRQ number space of the machine.
How does that work on a multi-chip machine? Does each chip just have
a table for a slice of the global irq number space?
> > Do the XIVE IPIs have entries here, or do they bypass this?
>
> no. The IPIs have entries also in this table.
>
> >> * EQDT:
> >>
> >> describes the queues in the OS RAM, also contains a set of flags,
> >> a virtual target, etc.
> >
> > So on real hardware this would be global, yes? And it would be
> > consulted by the IVRE?
>
> yes. Exactly. The XIVE routing routine :
>
> https://github.com/legoater/qemu/blob/xive/hw/intc/xive.c#L706
>
> gives a good overview of the usage of the tables.
>
> > For guests, we'd expect one table per-guest?
>
> yes but only in emulation mode.
I'm not sure what you mean by this.
> > How would those be integrated with the host table?
>
> Under KVM, this is handled by the host table (setup done in skiboot)
> and we are only interested in the state of the EQs for migration.
This doesn't make sense to me; the guest is able to alter the IVT
entries, so that configuration must be migrated somehow.
> This state is set with the H_INT_SET_QUEUE_CONFIG hcall,
"This state" here meaning IVT entries?
> followed
> by an OPAL call and then a HW update. It defines the EQ page in which
> to push event notification for the couple server/priority.
>
> >> * VPDT:
> >>
> >> describe the virtual targets, which can have different natures,
> >> a lpar, a cpu. This is for powernv, spapr does not have this
> >> concept.
> >
> > Ok On hardware that would also be global and consulted by the IVRE,
> > yes?
>
> yes.
Except.. is it actually global, or is there one per-chip/socket?
[snip]
> >> In the current version I am working on, the XiveFabric interface is
> >> more complex :
> >>
> >> typedef struct XiveFabricClass {
> >> InterfaceClass parent;
> >> XiveIVE *(*get_ive)(XiveFabric *xf, uint32_t lisn);
> >
> > This does an IVT lookup, I take it?
>
> yes. It is an interface for the underlying storage, which is different
> in sPAPR and PowerNV. The goal is to make the routing generic.
Right. So, yes, we definitely want a method *somehwere* to do an IVT
lookup. I'm not entirely sure where it belongs yet.
>
> >> XiveNVT *(*get_nvt)(XiveFabric *xf, uint32_t server);
> >
> > This one a VPDT lookup, yes?
>
> yes.
>
> >> XiveEQ *(*get_eq)(XiveFabric *xf, uint32_t eq_idx);
> >
> > And this one an EQDT lookup?
>
> yes.
>
> >> } XiveFabricClass;
> >>
> >> It helps in making the routing algorithm independent of the model.
> >> I hope to make powernv converge and use it.
> >>
> >> - a set of MMIOs for the TIMA. They model the presenter engine.
> >> current_cpu is used to retrieve the NVT object, which holds the
> >> registers for interrupt management.
> >
> > Right. Now the TIMA is local to a target/server not an EQ, right?
>
> The TIMA is the MMIO giving access to the registers which are per CPU.
> The EQ are for routing. They are under the CPU object because it is
> convenient.
>
> > I guess we need at least one of these per-vcpu.
>
> yes.
>
> > Do we also need an lpar-global, or other special ones?
>
> That would be for the host. AFAICT KVM does not use such special
> VPs.
Um.. "does not use".. don't we get to decide that?
> >> The EQs are stored under the NVT. This saves us an unnecessary EQDT
> >> table. But we could add one under the XIVE device model.
> >
> > I'm not sure of the distinction you're drawing between the NVT and the
> > XIVE device mode.
>
> we could add a new table under the XIVE interrupt device model
> sPAPRXive to store the EQs and indexed them like skiboot does.
> But it seems unnecessary to me as we can use the object below
> 'cpu->intc', which is the XiveNVT object.
So, basically assuming a fixed set of EQs (one per priority?) per CPU
for a PAPR guest? That makes sense (assuming PAPR doesn't provide
guest interfaces to ask for something else).
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
On 04/26/2018 07:36 AM, David Gibson wrote:
> On Thu, Apr 19, 2018 at 07:40:09PM +0200, Cédric Le Goater wrote:
>> On 04/16/2018 06:26 AM, David Gibson wrote:
>>> On Thu, Apr 12, 2018 at 10:18:11AM +0200, Cédric Le Goater wrote:
>>>> On 04/12/2018 07:07 AM, David Gibson wrote:
>>>>> On Wed, Dec 20, 2017 at 08:38:41AM +0100, Cédric Le Goater wrote:
>>>>>> On 12/20/2017 06:09 AM, David Gibson wrote:
>>>>>>> On Sat, Dec 09, 2017 at 09:43:21AM +0100, Cédric Le Goater
>> wrote:
> [snip]
>>>> The XIVE tables are :
>>>>
>>>> * IVT
>>>>
>>>> associate an interrupt source number with an event queue. the data
>>>> to be pushed in the queue is stored there also.
>>>
>>> Ok, so there would be one of these tables for each IVRE,
>>
>> yes. one for each XIVE interrupt controller. That is one per processor
>> or socket.
>
> Ah.. so there can be more than one in a multi-socket system.
> >>> with one entry for each source managed by that IVSE, yes?
>>
>> yes. The table is simply indexed by the interrupt number in the
>> global IRQ number space of the machine.
>
> How does that work on a multi-chip machine? Does each chip just have
> a table for a slice of the global irq number space?
yes. IRQ Allocation is done relative to the chip, each chip having
a range depending on its block id. XIVE has a concept of block,
which is used in skiboot in a one-to-one relationship with the chip.
>>> Do the XIVE IPIs have entries here, or do they bypass this?
>>
>> no. The IPIs have entries also in this table.
>>
>>>> * EQDT:
>>>>
>>>> describes the queues in the OS RAM, also contains a set of flags,
>>>> a virtual target, etc.
>>>
>>> So on real hardware this would be global, yes? And it would be
>>> consulted by the IVRE?
>>
>> yes. Exactly. The XIVE routing routine :
>>
>> https://github.com/legoater/qemu/blob/xive/hw/intc/xive.c#L706
>>
>> gives a good overview of the usage of the tables.
>>
>>> For guests, we'd expect one table per-guest?
>>
>> yes but only in emulation mode.
>
> I'm not sure what you mean by this.
I meant the sPAPR QEMU emulation mode. Linux/KVM relies on the overall
table allocated in OPAL for the system.
>>> How would those be integrated with the host table?
>>
>> Under KVM, this is handled by the host table (setup done in skiboot)
>> and we are only interested in the state of the EQs for migration.
>
> This doesn't make sense to me; the guest is able to alter the IVT
> entries, so that configuration must be migrated somehow.
yes. The IVE needs to be migrated. We use get/set KVM ioctls to save
and restore the value which is cached in the KVM irq state struct
(server, prio, eq data). no OPAL calls are needed though.
>> This state is set with the H_INT_SET_QUEUE_CONFIG hcall,
>
> "This state" here meaning IVT entries?
no. The H_INT_SET_QUEUE_CONFIG sets the event queue OS page for a
server/priority couple. That is where the event queue data is pushed.
H_INT_SET_SOURCE_CONFIG does the targeting : irq, server, priority,
and the eq data to be pushed in case of an event.
>> followed
>> by an OPAL call and then a HW update. It defines the EQ page in which
>> to push event notification for the couple server/priority.
>>
>>>> * VPDT:
>>>>
>>>> describe the virtual targets, which can have different natures,
>>>> a lpar, a cpu. This is for powernv, spapr does not have this
>>>> concept.
>>>
>>> Ok On hardware that would also be global and consulted by the IVRE,
>>> yes?
>>
>> yes.
>
> Except.. is it actually global, or is there one per-chip/socket?
There is a global VP allocator splitting the ids depending on the
block/chip, but, to be honest, I have not dug in the details
> [snip]
>>>> In the current version I am working on, the XiveFabric interface is
>>>> more complex :
>>>>
>>>> typedef struct XiveFabricClass {
>>>> InterfaceClass parent;
>>>> XiveIVE *(*get_ive)(XiveFabric *xf, uint32_t lisn);
>>>
>>> This does an IVT lookup, I take it?
>>
>> yes. It is an interface for the underlying storage, which is different
>> in sPAPR and PowerNV. The goal is to make the routing generic.
>
> Right. So, yes, we definitely want a method *somehwere* to do an IVT
> lookup. I'm not entirely sure where it belongs yet.
Me either. I have stuffed the XiveFabric with all the abstraction
needed for the moment.
I am starting to think that there should be an interface to forward
events and another one to route them. The router being a special case
of the forwarder, the last one. The "simple" devices, like PSI, should
only be forwarders for the sources they own but the interrupt controllers
should be forwarders (they have sources) and also routers.
>>>> XiveNVT *(*get_nvt)(XiveFabric *xf, uint32_t server);
>>>
>>> This one a VPDT lookup, yes?
>>
>> yes.
>>
>>>> XiveEQ *(*get_eq)(XiveFabric *xf, uint32_t eq_idx);
>>>
>>> And this one an EQDT lookup?
>>
>> yes.
>>
>>>> } XiveFabricClass;
>>>>
>>>> It helps in making the routing algorithm independent of the model.
>>>> I hope to make powernv converge and use it.
>>>>
>>>> - a set of MMIOs for the TIMA. They model the presenter engine.
>>>> current_cpu is used to retrieve the NVT object, which holds the
>>>> registers for interrupt management.
>>>
>>> Right. Now the TIMA is local to a target/server not an EQ, right?
>>
>> The TIMA is the MMIO giving access to the registers which are per CPU.
>> The EQ are for routing. They are under the CPU object because it is
>> convenient.
>>
>>> I guess we need at least one of these per-vcpu.
>>
>> yes.
>>
>>> Do we also need an lpar-global, or other special ones?
>>
>> That would be for the host. AFAICT KVM does not use such special
>> VPs.
>
> Um.. "does not use".. don't we get to decide that?
Well, that part in the specs is still a little obscure for me and
I am not sure it will fit very well in the Linux/KVM model. It should
be hidden to the guest anyway and can come in later.
>>>> The EQs are stored under the NVT. This saves us an unnecessary EQDT
>>>> table. But we could add one under the XIVE device model.
>>>
>>> I'm not sure of the distinction you're drawing between the NVT and the
>>> XIVE device mode.
>>
>> we could add a new table under the XIVE interrupt device model
>> sPAPRXive to store the EQs and indexed them like skiboot does.
>> But it seems unnecessary to me as we can use the object below
>> 'cpu->intc', which is the XiveNVT object.
>
> So, basically assuming a fixed set of EQs (one per priority?)
yes. It's easier to capture the state and dump information from
the monitor.
> per CPU for a PAPR guest?
yes, that's own it works.
> That makes sense (assuming PAPR doesn't provide guest interfaces to
> ask for something else).
Yes. All hcalls take prio/server parameters and the reserved prio range
for the platform is in the device tree. 0xFF is a special case to reset
targeting.
Thanks,
C.
On Thu, Apr 26, 2018 at 10:17:13AM +0200, Cédric Le Goater wrote:
> On 04/26/2018 07:36 AM, David Gibson wrote:
> > On Thu, Apr 19, 2018 at 07:40:09PM +0200, Cédric Le Goater wrote:
> >> On 04/16/2018 06:26 AM, David Gibson wrote:
> >>> On Thu, Apr 12, 2018 at 10:18:11AM +0200, Cédric Le Goater wrote:
> >>>> On 04/12/2018 07:07 AM, David Gibson wrote:
> >>>>> On Wed, Dec 20, 2017 at 08:38:41AM +0100, Cédric Le Goater wrote:
> >>>>>> On 12/20/2017 06:09 AM, David Gibson wrote:
> >>>>>>> On Sat, Dec 09, 2017 at 09:43:21AM +0100, Cédric Le Goater
> >> wrote:
> > [snip]
> >>>> The XIVE tables are :
> >>>>
> >>>> * IVT
> >>>>
> >>>> associate an interrupt source number with an event queue. the data
> >>>> to be pushed in the queue is stored there also.
> >>>
> >>> Ok, so there would be one of these tables for each IVRE,
> >>
> >> yes. one for each XIVE interrupt controller. That is one per processor
> >> or socket.
> >
> > Ah.. so there can be more than one in a multi-socket system.
> > >>> with one entry for each source managed by that IVSE, yes?
> >>
> >> yes. The table is simply indexed by the interrupt number in the
> >> global IRQ number space of the machine.
> >
> > How does that work on a multi-chip machine? Does each chip just have
> > a table for a slice of the global irq number space?
>
> yes. IRQ Allocation is done relative to the chip, each chip having
> a range depending on its block id. XIVE has a concept of block,
> which is used in skiboot in a one-to-one relationship with the chip.
Ok. I'm assuming this block id forms the high(ish) bits of the global
irq number, yes?
> >>> Do the XIVE IPIs have entries here, or do they bypass this?
> >>
> >> no. The IPIs have entries also in this table.
> >>
> >>>> * EQDT:
> >>>>
> >>>> describes the queues in the OS RAM, also contains a set of flags,
> >>>> a virtual target, etc.
> >>>
> >>> So on real hardware this would be global, yes? And it would be
> >>> consulted by the IVRE?
> >>
> >> yes. Exactly. The XIVE routing routine :
> >>
> >> https://github.com/legoater/qemu/blob/xive/hw/intc/xive.c#L706
> >>
> >> gives a good overview of the usage of the tables.
> >>
> >>> For guests, we'd expect one table per-guest?
> >>
> >> yes but only in emulation mode.
> >
> > I'm not sure what you mean by this.
>
> I meant the sPAPR QEMU emulation mode. Linux/KVM relies on the overall
> table allocated in OPAL for the system.
Right.. I'm thinking of this from the point of view of the guest
and/or qemu, rather than from the implementation. Even if the actual
storage of the entries is distributed across the host's global table,
we still logically have a table per guest, right?
> >>> How would those be integrated with the host table?
> >>
> >> Under KVM, this is handled by the host table (setup done in skiboot)
> >> and we are only interested in the state of the EQs for migration.
> >
> > This doesn't make sense to me; the guest is able to alter the IVT
> > entries, so that configuration must be migrated somehow.
>
> yes. The IVE needs to be migrated. We use get/set KVM ioctls to save
> and restore the value which is cached in the KVM irq state struct
> (server, prio, eq data). no OPAL calls are needed though.
Right. Again, at this stage I don't particularly care what the
backend details are - whether the host calls OPAL or whatever. I'm
more concerned with the logical model.
> >> This state is set with the H_INT_SET_QUEUE_CONFIG hcall,
> >
> > "This state" here meaning IVT entries?
>
> no. The H_INT_SET_QUEUE_CONFIG sets the event queue OS page for a
> server/priority couple. That is where the event queue data is
> pushed.
Ah. Doesn't that mean the guest *does* effectively have an EQD table,
updated by this call? We'd need to migrate that data as well, and
it's not part of the IVT, right?
> H_INT_SET_SOURCE_CONFIG does the targeting : irq, server, priority,
> and the eq data to be pushed in case of an event.
Ok - that's the IVT entries, yes?
>
> >> followed
> >> by an OPAL call and then a HW update. It defines the EQ page in which
> >> to push event notification for the couple server/priority.
> >>
> >>>> * VPDT:
> >>>>
> >>>> describe the virtual targets, which can have different natures,
> >>>> a lpar, a cpu. This is for powernv, spapr does not have this
> >>>> concept.
> >>>
> >>> Ok On hardware that would also be global and consulted by the IVRE,
> >>> yes?
> >>
> >> yes.
> >
> > Except.. is it actually global, or is there one per-chip/socket?
>
> There is a global VP allocator splitting the ids depending on the
> block/chip, but, to be honest, I have not dug in the details
>
> > [snip]
> >>>> In the current version I am working on, the XiveFabric interface is
> >>>> more complex :
> >>>>
> >>>> typedef struct XiveFabricClass {
> >>>> InterfaceClass parent;
> >>>> XiveIVE *(*get_ive)(XiveFabric *xf, uint32_t lisn);
> >>>
> >>> This does an IVT lookup, I take it?
> >>
> >> yes. It is an interface for the underlying storage, which is different
> >> in sPAPR and PowerNV. The goal is to make the routing generic.
> >
> > Right. So, yes, we definitely want a method *somehwere* to do an IVT
> > lookup. I'm not entirely sure where it belongs yet.
>
> Me either. I have stuffed the XiveFabric with all the abstraction
> needed for the moment.
>
> I am starting to think that there should be an interface to forward
> events and another one to route them. The router being a special case
> of the forwarder, the last one. The "simple" devices, like PSI, should
> only be forwarders for the sources they own but the interrupt controllers
> should be forwarders (they have sources) and also routers.
I'm not really clear what you mean by "forward" here.
>
> >>>> XiveNVT *(*get_nvt)(XiveFabric *xf, uint32_t server);
> >>>
> >>> This one a VPDT lookup, yes?
> >>
> >> yes.
> >>
> >>>> XiveEQ *(*get_eq)(XiveFabric *xf, uint32_t eq_idx);
> >>>
> >>> And this one an EQDT lookup?
> >>
> >> yes.
> >>
> >>>> } XiveFabricClass;
> >>>>
> >>>> It helps in making the routing algorithm independent of the model.
> >>>> I hope to make powernv converge and use it.
> >>>>
> >>>> - a set of MMIOs for the TIMA. They model the presenter engine.
> >>>> current_cpu is used to retrieve the NVT object, which holds the
> >>>> registers for interrupt management.
> >>>
> >>> Right. Now the TIMA is local to a target/server not an EQ, right?
> >>
> >> The TIMA is the MMIO giving access to the registers which are per CPU.
> >> The EQ are for routing. They are under the CPU object because it is
> >> convenient.
> >>
> >>> I guess we need at least one of these per-vcpu.
> >>
> >> yes.
> >>
> >>> Do we also need an lpar-global, or other special ones?
> >>
> >> That would be for the host. AFAICT KVM does not use such special
> >> VPs.
> >
> > Um.. "does not use".. don't we get to decide that?
>
> Well, that part in the specs is still a little obscure for me and
> I am not sure it will fit very well in the Linux/KVM model. It should
> be hidden to the guest anyway and can come in later.
>
> >>>> The EQs are stored under the NVT. This saves us an unnecessary EQDT
> >>>> table. But we could add one under the XIVE device model.
> >>>
> >>> I'm not sure of the distinction you're drawing between the NVT and the
> >>> XIVE device mode.
> >>
> >> we could add a new table under the XIVE interrupt device model
> >> sPAPRXive to store the EQs and indexed them like skiboot does.
> >> But it seems unnecessary to me as we can use the object below
> >> 'cpu->intc', which is the XiveNVT object.
> >
> > So, basically assuming a fixed set of EQs (one per priority?)
>
> yes. It's easier to capture the state and dump information from
> the monitor.
>
> > per CPU for a PAPR guest?
>
> yes, that's own it works.
>
> > That makes sense (assuming PAPR doesn't provide guest interfaces to
> > ask for something else).
>
> Yes. All hcalls take prio/server parameters and the reserved prio range
> for the platform is in the device tree. 0xFF is a special case to reset
> targeting.
>
> Thanks,
>
> C.
>
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
On 05/03/2018 04:29 AM, David Gibson wrote:
> On Thu, Apr 26, 2018 at 10:17:13AM +0200, Cédric Le Goater wrote:
>> On 04/26/2018 07:36 AM, David Gibson wrote:
>>> On Thu, Apr 19, 2018 at 07:40:09PM +0200, Cédric Le Goater wrote:
>>>> On 04/16/2018 06:26 AM, David Gibson wrote:
>>>>> On Thu, Apr 12, 2018 at 10:18:11AM +0200, Cédric Le Goater wrote:
>>>>>> On 04/12/2018 07:07 AM, David Gibson wrote:
>>>>>>> On Wed, Dec 20, 2017 at 08:38:41AM +0100, Cédric Le Goater wrote:
>>>>>>>> On 12/20/2017 06:09 AM, David Gibson wrote:
>>>>>>>>> On Sat, Dec 09, 2017 at 09:43:21AM +0100, Cédric Le Goater
>>>> wrote:
>>> [snip]
>>>>>> The XIVE tables are :
>>>>>>
>>>>>> * IVT
>>>>>>
>>>>>> associate an interrupt source number with an event queue. the data
>>>>>> to be pushed in the queue is stored there also.
>>>>>
>>>>> Ok, so there would be one of these tables for each IVRE,
>>>>
>>>> yes. one for each XIVE interrupt controller. That is one per processor
>>>> or socket.
>>>
>>> Ah.. so there can be more than one in a multi-socket system.
>>> >>> with one entry for each source managed by that IVSE, yes?
>>>>
>>>> yes. The table is simply indexed by the interrupt number in the
>>>> global IRQ number space of the machine.
>>>
>>> How does that work on a multi-chip machine? Does each chip just have
>>> a table for a slice of the global irq number space?
>>
>> yes. IRQ Allocation is done relative to the chip, each chip having
>> a range depending on its block id. XIVE has a concept of block,
>> which is used in skiboot in a one-to-one relationship with the chip.
>
> Ok. I'm assuming this block id forms the high(ish) bits of the global
> irq number, yes?
yes. the 8 top bits are reserved, the next 4 bits are for the
block id, 16 blocks for 16 socket/chips, and the 20 lower bits
are for the ISN.
>>>>> Do the XIVE IPIs have entries here, or do they bypass this?
>>>>
>>>> no. The IPIs have entries also in this table.
>>>>
>>>>>> * EQDT:
>>>>>>
>>>>>> describes the queues in the OS RAM, also contains a set of flags,
>>>>>> a virtual target, etc.
>>>>>
>>>>> So on real hardware this would be global, yes? And it would be
>>>>> consulted by the IVRE?
>>>>
>>>> yes. Exactly. The XIVE routing routine :
>>>>
>>>> https://github.com/legoater/qemu/blob/xive/hw/intc/xive.c#L706
>>>>
>>>> gives a good overview of the usage of the tables.
>>>>
>>>>> For guests, we'd expect one table per-guest?
>>>>
>>>> yes but only in emulation mode.
>>>
>>> I'm not sure what you mean by this.
>>
>> I meant the sPAPR QEMU emulation mode. Linux/KVM relies on the overall
>> table allocated in OPAL for the system.
>
> Right.. I'm thinking of this from the point of view of the guest
> and/or qemu, rather than from the implementation. Even if the actual
> storage of the entries is distributed across the host's global table,
> we still logically have a table per guest, right?
Yes. (the XiveSource object would be the table-per-guest and its
counterpart in KVM: the source block)
>>>>> How would those be integrated with the host table?
>>>>
>>>> Under KVM, this is handled by the host table (setup done in skiboot)
>>>> and we are only interested in the state of the EQs for migration.
>>>
>>> This doesn't make sense to me; the guest is able to alter the IVT
>>> entries, so that configuration must be migrated somehow.
>>
>> yes. The IVE needs to be migrated. We use get/set KVM ioctls to save
>> and restore the value which is cached in the KVM irq state struct
>> (server, prio, eq data). no OPAL calls are needed though.
>
> Right. Again, at this stage I don't particularly care what the
> backend details are - whether the host calls OPAL or whatever. I'm
> more concerned with the logical model.
ok.
>
>>>> This state is set with the H_INT_SET_QUEUE_CONFIG hcall,
>>>
>>> "This state" here meaning IVT entries?
>>
>> no. The H_INT_SET_QUEUE_CONFIG sets the event queue OS page for a
>> server/priority couple. That is where the event queue data is
>> pushed.
>
> Ah. Doesn't that mean the guest *does* effectively have an EQD table,
well, yes, it's behing the hood. but the guest does not know anything
about the Xive controller internal structures, IVE, EQD, VPD and tables.
Only OPAL does in fact.
> updated by this call?
it is indeed the purpose of H_INT_SET_QUEUE_CONFIG
> We'd need to migrate that data as well,
yes we do and some fields require OPAL support.
> and it's not part of the IVT, right?
yes. The IVT only contains the EQ index, the server/priority tuple used
for routing.
>> H_INT_SET_SOURCE_CONFIG does the targeting : irq, server, priority,
>> and the eq data to be pushed in case of an event.
>
> Ok - that's the IVT entries, yes?
yes.
>>>> followed
>>>> by an OPAL call and then a HW update. It defines the EQ page in which
>>>> to push event notification for the couple server/priority.
>>>>
>>>>>> * VPDT:
>>>>>>
>>>>>> describe the virtual targets, which can have different natures,
>>>>>> a lpar, a cpu. This is for powernv, spapr does not have this
>>>>>> concept.
>>>>>
>>>>> Ok On hardware that would also be global and consulted by the IVRE,
>>>>> yes?
>>>>
>>>> yes.
>>>
>>> Except.. is it actually global, or is there one per-chip/socket?
>>
>> There is a global VP allocator splitting the ids depending on the
>> block/chip, but, to be honest, I have not dug in the details
>>
>>> [snip]
>>>>>> In the current version I am working on, the XiveFabric interface is
>>>>>> more complex :
>>>>>>
>>>>>> typedef struct XiveFabricClass {
>>>>>> InterfaceClass parent;
>>>>>> XiveIVE *(*get_ive)(XiveFabric *xf, uint32_t lisn);
>>>>>
>>>>> This does an IVT lookup, I take it?
>>>>
>>>> yes. It is an interface for the underlying storage, which is different
>>>> in sPAPR and PowerNV. The goal is to make the routing generic.
>>>
>>> Right. So, yes, we definitely want a method *somehwere* to do an IVT
>>> lookup. I'm not entirely sure where it belongs yet.
>>
>> Me either. I have stuffed the XiveFabric with all the abstraction
>> needed for the moment.
>>
>> I am starting to think that there should be an interface to forward
>> events and another one to route them. The router being a special case
>> of the forwarder, the last one. The "simple" devices, like PSI, should
>> only be forwarders for the sources they own but the interrupt controllers
>> should be forwarders (they have sources) and also routers.
>
> I'm not really clear what you mean by "forward" here.
When a interrupt source is triggered, a notification event can
be generated and forwarded to the XIVE router if the transition
algo (depending on the PQ bit) lets it through. A forward is
a simple load of the IRQ number at a specific MMIO address defined
by the main IC.
For QEMU sPAPR, it's a funtion call but for QEMU powernv, it's a
load.
C.
>>>>>> XiveNVT *(*get_nvt)(XiveFabric *xf, uint32_t server);
>>>>>
>>>>> This one a VPDT lookup, yes?
>>>>
>>>> yes.
>>>>
>>>>>> XiveEQ *(*get_eq)(XiveFabric *xf, uint32_t eq_idx);
>>>>>
>>>>> And this one an EQDT lookup?
>>>>
>>>> yes.
>>>>
>>>>>> } XiveFabricClass;
>>>>>>
>>>>>> It helps in making the routing algorithm independent of the model.
>>>>>> I hope to make powernv converge and use it.
>>>>>>
>>>>>> - a set of MMIOs for the TIMA. They model the presenter engine.
>>>>>> current_cpu is used to retrieve the NVT object, which holds the
>>>>>> registers for interrupt management.
>>>>>
>>>>> Right. Now the TIMA is local to a target/server not an EQ, right?
>>>>
>>>> The TIMA is the MMIO giving access to the registers which are per CPU.
>>>> The EQ are for routing. They are under the CPU object because it is
>>>> convenient.
>>>>
>>>>> I guess we need at least one of these per-vcpu.
>>>>
>>>> yes.
>>>>
>>>>> Do we also need an lpar-global, or other special ones?
>>>>
>>>> That would be for the host. AFAICT KVM does not use such special
>>>> VPs.
>>>
>>> Um.. "does not use".. don't we get to decide that?
>>
>> Well, that part in the specs is still a little obscure for me and
>> I am not sure it will fit very well in the Linux/KVM model. It should
>> be hidden to the guest anyway and can come in later.
>>
>>>>>> The EQs are stored under the NVT. This saves us an unnecessary EQDT
>>>>>> table. But we could add one under the XIVE device model.
>>>>>
>>>>> I'm not sure of the distinction you're drawing between the NVT and the
>>>>> XIVE device mode.
>>>>
>>>> we could add a new table under the XIVE interrupt device model
>>>> sPAPRXive to store the EQs and indexed them like skiboot does.
>>>> But it seems unnecessary to me as we can use the object below
>>>> 'cpu->intc', which is the XiveNVT object.
>>>
>>> So, basically assuming a fixed set of EQs (one per priority?)
>>
>> yes. It's easier to capture the state and dump information from
>> the monitor.
>>
>>> per CPU for a PAPR guest?
>>
>> yes, that's own it works.
>>
>>> That makes sense (assuming PAPR doesn't provide guest interfaces to
>>> ask for something else).
>>
>> Yes. All hcalls take prio/server parameters and the reserved prio range
>> for the platform is in the device tree. 0xFF is a special case to reset
>> targeting.
>>
>> Thanks,
>>
>> C.
>>
>
On Thu, May 03, 2018 at 10:43:47AM +0200, Cédric Le Goater wrote:
> On 05/03/2018 04:29 AM, David Gibson wrote:
> > On Thu, Apr 26, 2018 at 10:17:13AM +0200, Cédric Le Goater wrote:
> >> On 04/26/2018 07:36 AM, David Gibson wrote:
> >>> On Thu, Apr 19, 2018 at 07:40:09PM +0200, Cédric Le Goater wrote:
> >>>> On 04/16/2018 06:26 AM, David Gibson wrote:
> >>>>> On Thu, Apr 12, 2018 at 10:18:11AM +0200, Cédric Le Goater wrote:
> >>>>>> On 04/12/2018 07:07 AM, David Gibson wrote:
> >>>>>>> On Wed, Dec 20, 2017 at 08:38:41AM +0100, Cédric Le Goater wrote:
> >>>>>>>> On 12/20/2017 06:09 AM, David Gibson wrote:
> >>>>>>>>> On Sat, Dec 09, 2017 at 09:43:21AM +0100, Cédric Le Goater
> >>>> wrote:
> >>> [snip]
> >>>>>> The XIVE tables are :
> >>>>>>
> >>>>>> * IVT
> >>>>>>
> >>>>>> associate an interrupt source number with an event queue. the data
> >>>>>> to be pushed in the queue is stored there also.
> >>>>>
> >>>>> Ok, so there would be one of these tables for each IVRE,
> >>>>
> >>>> yes. one for each XIVE interrupt controller. That is one per processor
> >>>> or socket.
> >>>
> >>> Ah.. so there can be more than one in a multi-socket system.
> >>> >>> with one entry for each source managed by that IVSE, yes?
> >>>>
> >>>> yes. The table is simply indexed by the interrupt number in the
> >>>> global IRQ number space of the machine.
> >>>
> >>> How does that work on a multi-chip machine? Does each chip just have
> >>> a table for a slice of the global irq number space?
> >>
> >> yes. IRQ Allocation is done relative to the chip, each chip having
> >> a range depending on its block id. XIVE has a concept of block,
> >> which is used in skiboot in a one-to-one relationship with the chip.
> >
> > Ok. I'm assuming this block id forms the high(ish) bits of the global
> > irq number, yes?
>
> yes. the 8 top bits are reserved, the next 4 bits are for the
> block id, 16 blocks for 16 socket/chips, and the 20 lower bits
> are for the ISN.
Ok.
> >>>>> Do the XIVE IPIs have entries here, or do they bypass this?
> >>>>
> >>>> no. The IPIs have entries also in this table.
> >>>>
> >>>>>> * EQDT:
> >>>>>>
> >>>>>> describes the queues in the OS RAM, also contains a set of flags,
> >>>>>> a virtual target, etc.
> >>>>>
> >>>>> So on real hardware this would be global, yes? And it would be
> >>>>> consulted by the IVRE?
> >>>>
> >>>> yes. Exactly. The XIVE routing routine :
> >>>>
> >>>> https://github.com/legoater/qemu/blob/xive/hw/intc/xive.c#L706
> >>>>
> >>>> gives a good overview of the usage of the tables.
> >>>>
> >>>>> For guests, we'd expect one table per-guest?
> >>>>
> >>>> yes but only in emulation mode.
> >>>
> >>> I'm not sure what you mean by this.
> >>
> >> I meant the sPAPR QEMU emulation mode. Linux/KVM relies on the overall
> >> table allocated in OPAL for the system.
> >
> > Right.. I'm thinking of this from the point of view of the guest
> > and/or qemu, rather than from the implementation. Even if the actual
> > storage of the entries is distributed across the host's global table,
> > we still logically have a table per guest, right?
>
> Yes. (the XiveSource object would be the table-per-guest and its
> counterpart in KVM: the source block)
Uh.. I'm talking about the IVT (or a slice of it) here, so this would
be a XiveRouter, not a XiveSource owning it.
>
> >>>>> How would those be integrated with the host table?
> >>>>
> >>>> Under KVM, this is handled by the host table (setup done in skiboot)
> >>>> and we are only interested in the state of the EQs for migration.
> >>>
> >>> This doesn't make sense to me; the guest is able to alter the IVT
> >>> entries, so that configuration must be migrated somehow.
> >>
> >> yes. The IVE needs to be migrated. We use get/set KVM ioctls to save
> >> and restore the value which is cached in the KVM irq state struct
> >> (server, prio, eq data). no OPAL calls are needed though.
> >
> > Right. Again, at this stage I don't particularly care what the
> > backend details are - whether the host calls OPAL or whatever. I'm
> > more concerned with the logical model.
>
> ok.
>
> >
> >>>> This state is set with the H_INT_SET_QUEUE_CONFIG hcall,
> >>>
> >>> "This state" here meaning IVT entries?
> >>
> >> no. The H_INT_SET_QUEUE_CONFIG sets the event queue OS page for a
> >> server/priority couple. That is where the event queue data is
> >> pushed.
> >
> > Ah. Doesn't that mean the guest *does* effectively have an EQD table,
>
> well, yes, it's behing the hood. but the guest does not know anything
> about the Xive controller internal structures, IVE, EQD, VPD and tables.
> Only OPAL does in fact.
Right, it's under the hood. But then so is the IVT (and the TCE
tables and the HPT for that matter). So we're probably going to have
a (*get_eqd) method somewhere that looks up in guest RAM or in an
external table depending.
> > updated by this call?
>
> it is indeed the purpose of H_INT_SET_QUEUE_CONFIG
>
> > We'd need to migrate that data as well,
>
> yes we do and some fields require OPAL support.
>
> > and it's not part of the IVT, right?
>
> yes. The IVT only contains the EQ index, the server/priority tuple used
> for routing.
>
> >> H_INT_SET_SOURCE_CONFIG does the targeting : irq, server, priority,
> >> and the eq data to be pushed in case of an event.
> >
> > Ok - that's the IVT entries, yes?
>
> yes.
>
>
> >>>> followed
> >>>> by an OPAL call and then a HW update. It defines the EQ page in which
> >>>> to push event notification for the couple server/priority.
> >>>>
> >>>>>> * VPDT:
> >>>>>>
> >>>>>> describe the virtual targets, which can have different natures,
> >>>>>> a lpar, a cpu. This is for powernv, spapr does not have this
> >>>>>> concept.
> >>>>>
> >>>>> Ok On hardware that would also be global and consulted by the IVRE,
> >>>>> yes?
> >>>>
> >>>> yes.
> >>>
> >>> Except.. is it actually global, or is there one per-chip/socket?
> >>
> >> There is a global VP allocator splitting the ids depending on the
> >> block/chip, but, to be honest, I have not dug in the details
> >>
> >>> [snip]
> >>>>>> In the current version I am working on, the XiveFabric interface is
> >>>>>> more complex :
> >>>>>>
> >>>>>> typedef struct XiveFabricClass {
> >>>>>> InterfaceClass parent;
> >>>>>> XiveIVE *(*get_ive)(XiveFabric *xf, uint32_t lisn);
> >>>>>
> >>>>> This does an IVT lookup, I take it?
> >>>>
> >>>> yes. It is an interface for the underlying storage, which is different
> >>>> in sPAPR and PowerNV. The goal is to make the routing generic.
> >>>
> >>> Right. So, yes, we definitely want a method *somehwere* to do an IVT
> >>> lookup. I'm not entirely sure where it belongs yet.
> >>
> >> Me either. I have stuffed the XiveFabric with all the abstraction
> >> needed for the moment.
> >>
> >> I am starting to think that there should be an interface to forward
> >> events and another one to route them. The router being a special case
> >> of the forwarder, the last one. The "simple" devices, like PSI, should
> >> only be forwarders for the sources they own but the interrupt controllers
> >> should be forwarders (they have sources) and also routers.
> >
> > I'm not really clear what you mean by "forward" here.
>
> When a interrupt source is triggered, a notification event can
> be generated and forwarded to the XIVE router if the transition
> algo (depending on the PQ bit) lets it through. A forward is
> a simple load of the IRQ number at a specific MMIO address defined
> by the main IC.
>
> For QEMU sPAPR, it's a funtion call but for QEMU powernv, it's a
> load.
>
> C.
>
>
> >>>>>> XiveNVT *(*get_nvt)(XiveFabric *xf, uint32_t server);
> >>>>>
> >>>>> This one a VPDT lookup, yes?
> >>>>
> >>>> yes.
> >>>>
> >>>>>> XiveEQ *(*get_eq)(XiveFabric *xf, uint32_t eq_idx);
> >>>>>
> >>>>> And this one an EQDT lookup?
> >>>>
> >>>> yes.
> >>>>
> >>>>>> } XiveFabricClass;
> >>>>>>
> >>>>>> It helps in making the routing algorithm independent of the model.
> >>>>>> I hope to make powernv converge and use it.
> >>>>>>
> >>>>>> - a set of MMIOs for the TIMA. They model the presenter engine.
> >>>>>> current_cpu is used to retrieve the NVT object, which holds the
> >>>>>> registers for interrupt management.
> >>>>>
> >>>>> Right. Now the TIMA is local to a target/server not an EQ, right?
> >>>>
> >>>> The TIMA is the MMIO giving access to the registers which are per CPU.
> >>>> The EQ are for routing. They are under the CPU object because it is
> >>>> convenient.
> >>>>
> >>>>> I guess we need at least one of these per-vcpu.
> >>>>
> >>>> yes.
> >>>>
> >>>>> Do we also need an lpar-global, or other special ones?
> >>>>
> >>>> That would be for the host. AFAICT KVM does not use such special
> >>>> VPs.
> >>>
> >>> Um.. "does not use".. don't we get to decide that?
> >>
> >> Well, that part in the specs is still a little obscure for me and
> >> I am not sure it will fit very well in the Linux/KVM model. It should
> >> be hidden to the guest anyway and can come in later.
> >>
> >>>>>> The EQs are stored under the NVT. This saves us an unnecessary EQDT
> >>>>>> table. But we could add one under the XIVE device model.
> >>>>>
> >>>>> I'm not sure of the distinction you're drawing between the NVT and the
> >>>>> XIVE device mode.
> >>>>
> >>>> we could add a new table under the XIVE interrupt device model
> >>>> sPAPRXive to store the EQs and indexed them like skiboot does.
> >>>> But it seems unnecessary to me as we can use the object below
> >>>> 'cpu->intc', which is the XiveNVT object.
> >>>
> >>> So, basically assuming a fixed set of EQs (one per priority?)
> >>
> >> yes. It's easier to capture the state and dump information from
> >> the monitor.
> >>
> >>> per CPU for a PAPR guest?
> >>
> >> yes, that's own it works.
> >>
> >>> That makes sense (assuming PAPR doesn't provide guest interfaces to
> >>> ask for something else).
> >>
> >> Yes. All hcalls take prio/server parameters and the reserved prio range
> >> for the platform is in the device tree. 0xFF is a special case to reset
> >> targeting.
> >>
> >> Thanks,
> >>
> >> C.
> >>
> >
>
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
On 05/04/2018 08:35 AM, David Gibson wrote:
> On Thu, May 03, 2018 at 10:43:47AM +0200, Cédric Le Goater wrote:
>> On 05/03/2018 04:29 AM, David Gibson wrote:
>>> On Thu, Apr 26, 2018 at 10:17:13AM +0200, Cédric Le Goater wrote:
>>>> On 04/26/2018 07:36 AM, David Gibson wrote:
>>>>> On Thu, Apr 19, 2018 at 07:40:09PM +0200, Cédric Le Goater wrote:
>>>>>> On 04/16/2018 06:26 AM, David Gibson wrote:
>>>>>>> On Thu, Apr 12, 2018 at 10:18:11AM +0200, Cédric Le Goater wrote:
>>>>>>>> On 04/12/2018 07:07 AM, David Gibson wrote:
>>>>>>>>> On Wed, Dec 20, 2017 at 08:38:41AM +0100, Cédric Le Goater wrote:
>>>>>>>>>> On 12/20/2017 06:09 AM, David Gibson wrote:
>>>>>>>>>>> On Sat, Dec 09, 2017 at 09:43:21AM +0100, Cédric Le Goater
>>>>>> wrote:
>>>>> [snip]
>>>>>>>> The XIVE tables are :
>>>>>>>>
>>>>>>>> * IVT
>>>>>>>>
>>>>>>>> associate an interrupt source number with an event queue. the data
>>>>>>>> to be pushed in the queue is stored there also.
>>>>>>>
>>>>>>> Ok, so there would be one of these tables for each IVRE,
>>>>>>
>>>>>> yes. one for each XIVE interrupt controller. That is one per processor
>>>>>> or socket.
>>>>>
>>>>> Ah.. so there can be more than one in a multi-socket system.
>>>>> >>> with one entry for each source managed by that IVSE, yes?
>>>>>>
>>>>>> yes. The table is simply indexed by the interrupt number in the
>>>>>> global IRQ number space of the machine.
>>>>>
>>>>> How does that work on a multi-chip machine? Does each chip just have
>>>>> a table for a slice of the global irq number space?
>>>>
>>>> yes. IRQ Allocation is done relative to the chip, each chip having
>>>> a range depending on its block id. XIVE has a concept of block,
>>>> which is used in skiboot in a one-to-one relationship with the chip.
>>>
>>> Ok. I'm assuming this block id forms the high(ish) bits of the global
>>> irq number, yes?
>>
>> yes. the 8 top bits are reserved, the next 4 bits are for the
>> block id, 16 blocks for 16 socket/chips, and the 20 lower bits
>> are for the ISN.
>
> Ok.
>
>>>>>>> Do the XIVE IPIs have entries here, or do they bypass this?
>>>>>>
>>>>>> no. The IPIs have entries also in this table.
>>>>>>
>>>>>>>> * EQDT:
>>>>>>>>
>>>>>>>> describes the queues in the OS RAM, also contains a set of flags,
>>>>>>>> a virtual target, etc.
>>>>>>>
>>>>>>> So on real hardware this would be global, yes? And it would be
>>>>>>> consulted by the IVRE?
>>>>>>
>>>>>> yes. Exactly. The XIVE routing routine :
>>>>>>
>>>>>> https://github.com/legoater/qemu/blob/xive/hw/intc/xive.c#L706
>>>>>>
>>>>>> gives a good overview of the usage of the tables.
>>>>>>
>>>>>>> For guests, we'd expect one table per-guest?
>>>>>>
>>>>>> yes but only in emulation mode.
>>>>>
>>>>> I'm not sure what you mean by this.
>>>>
>>>> I meant the sPAPR QEMU emulation mode. Linux/KVM relies on the overall
>>>> table allocated in OPAL for the system.
>>>
>>> Right.. I'm thinking of this from the point of view of the guest
>>> and/or qemu, rather than from the implementation. Even if the actual
>>> storage of the entries is distributed across the host's global table,
>>> we still logically have a table per guest, right?
>>
>> Yes. (the XiveSource object would be the table-per-guest and its
>> counterpart in KVM: the source block)
>
> Uh.. I'm talking about the IVT (or a slice of it) here, so this would
> be a XiveRouter, not a XiveSource owning it.
yes. Sorry. sPAPR has a unique XiveSource and a corresponding IVT.
>>>>>>> How would those be integrated with the host table?
>>>>>>
>>>>>> Under KVM, this is handled by the host table (setup done in skiboot)
>>>>>> and we are only interested in the state of the EQs for migration.
>>>>>
>>>>> This doesn't make sense to me; the guest is able to alter the IVT
>>>>> entries, so that configuration must be migrated somehow.
>>>>
>>>> yes. The IVE needs to be migrated. We use get/set KVM ioctls to save
>>>> and restore the value which is cached in the KVM irq state struct
>>>> (server, prio, eq data). no OPAL calls are needed though.
>>>
>>> Right. Again, at this stage I don't particularly care what the
>>> backend details are - whether the host calls OPAL or whatever. I'm
>>> more concerned with the logical model.
>>
>> ok.
>>
>>>
>>>>>> This state is set with the H_INT_SET_QUEUE_CONFIG hcall,
>>>>>
>>>>> "This state" here meaning IVT entries?
>>>>
>>>> no. The H_INT_SET_QUEUE_CONFIG sets the event queue OS page for a
>>>> server/priority couple. That is where the event queue data is
>>>> pushed.
>>>
>>> Ah. Doesn't that mean the guest *does* effectively have an EQD table,
>>
>> well, yes, it's behing the hood. but the guest does not know anything
>> about the Xive controller internal structures, IVE, EQD, VPD and tables.
>> Only OPAL does in fact.
>
> Right, it's under the hood. But then so is the IVT (and the TCE
> tables and the HPT for that matter). So we're probably going to have
> a (*get_eqd) method somewhere that looks up in guest RAM or in an
> external table depending.
yes. definitely.
C.
>>> updated by this call?
>>
>> it is indeed the purpose of H_INT_SET_QUEUE_CONFIG
>>
>>> We'd need to migrate that data as well,
>>
>> yes we do and some fields require OPAL support.
>>
>>> and it's not part of the IVT, right?
>>
>> yes. The IVT only contains the EQ index, the server/priority tuple used
>> for routing.
>>
>>>> H_INT_SET_SOURCE_CONFIG does the targeting : irq, server, priority,
>>>> and the eq data to be pushed in case of an event.
>>>
>>> Ok - that's the IVT entries, yes?
>>
>> yes.
>>
>>
>>>>>> followed
>>>>>> by an OPAL call and then a HW update. It defines the EQ page in which
>>>>>> to push event notification for the couple server/priority.
>>>>>>
>>>>>>>> * VPDT:
>>>>>>>>
>>>>>>>> describe the virtual targets, which can have different natures,
>>>>>>>> a lpar, a cpu. This is for powernv, spapr does not have this
>>>>>>>> concept.
>>>>>>>
>>>>>>> Ok On hardware that would also be global and consulted by the IVRE,
>>>>>>> yes?
>>>>>>
>>>>>> yes.
>>>>>
>>>>> Except.. is it actually global, or is there one per-chip/socket?
>>>>
>>>> There is a global VP allocator splitting the ids depending on the
>>>> block/chip, but, to be honest, I have not dug in the details
>>>>
>>>>> [snip]
>>>>>>>> In the current version I am working on, the XiveFabric interface is
>>>>>>>> more complex :
>>>>>>>>
>>>>>>>> typedef struct XiveFabricClass {
>>>>>>>> InterfaceClass parent;
>>>>>>>> XiveIVE *(*get_ive)(XiveFabric *xf, uint32_t lisn);
>>>>>>>
>>>>>>> This does an IVT lookup, I take it?
>>>>>>
>>>>>> yes. It is an interface for the underlying storage, which is different
>>>>>> in sPAPR and PowerNV. The goal is to make the routing generic.
>>>>>
>>>>> Right. So, yes, we definitely want a method *somehwere* to do an IVT
>>>>> lookup. I'm not entirely sure where it belongs yet.
>>>>
>>>> Me either. I have stuffed the XiveFabric with all the abstraction
>>>> needed for the moment.
>>>>
>>>> I am starting to think that there should be an interface to forward
>>>> events and another one to route them. The router being a special case
>>>> of the forwarder, the last one. The "simple" devices, like PSI, should
>>>> only be forwarders for the sources they own but the interrupt controllers
>>>> should be forwarders (they have sources) and also routers.
>>>
>>> I'm not really clear what you mean by "forward" here.
>>
>> When a interrupt source is triggered, a notification event can
>> be generated and forwarded to the XIVE router if the transition
>> algo (depending on the PQ bit) lets it through. A forward is
>> a simple load of the IRQ number at a specific MMIO address defined
>> by the main IC.
>>
>> For QEMU sPAPR, it's a funtion call but for QEMU powernv, it's a
>> load.
>>
>> C.
>>
>>
>>>>>>>> XiveNVT *(*get_nvt)(XiveFabric *xf, uint32_t server);
>>>>>>>
>>>>>>> This one a VPDT lookup, yes?
>>>>>>
>>>>>> yes.
>>>>>>
>>>>>>>> XiveEQ *(*get_eq)(XiveFabric *xf, uint32_t eq_idx);
>>>>>>>
>>>>>>> And this one an EQDT lookup?
>>>>>>
>>>>>> yes.
>>>>>>
>>>>>>>> } XiveFabricClass;
>>>>>>>>
>>>>>>>> It helps in making the routing algorithm independent of the model.
>>>>>>>> I hope to make powernv converge and use it.
>>>>>>>>
>>>>>>>> - a set of MMIOs for the TIMA. They model the presenter engine.
>>>>>>>> current_cpu is used to retrieve the NVT object, which holds the
>>>>>>>> registers for interrupt management.
>>>>>>>
>>>>>>> Right. Now the TIMA is local to a target/server not an EQ, right?
>>>>>>
>>>>>> The TIMA is the MMIO giving access to the registers which are per CPU.
>>>>>> The EQ are for routing. They are under the CPU object because it is
>>>>>> convenient.
>>>>>>
>>>>>>> I guess we need at least one of these per-vcpu.
>>>>>>
>>>>>> yes.
>>>>>>
>>>>>>> Do we also need an lpar-global, or other special ones?
>>>>>>
>>>>>> That would be for the host. AFAICT KVM does not use such special
>>>>>> VPs.
>>>>>
>>>>> Um.. "does not use".. don't we get to decide that?
>>>>
>>>> Well, that part in the specs is still a little obscure for me and
>>>> I am not sure it will fit very well in the Linux/KVM model. It should
>>>> be hidden to the guest anyway and can come in later.
>>>>
>>>>>>>> The EQs are stored under the NVT. This saves us an unnecessary EQDT
>>>>>>>> table. But we could add one under the XIVE device model.
>>>>>>>
>>>>>>> I'm not sure of the distinction you're drawing between the NVT and the
>>>>>>> XIVE device mode.
>>>>>>
>>>>>> we could add a new table under the XIVE interrupt device model
>>>>>> sPAPRXive to store the EQs and indexed them like skiboot does.
>>>>>> But it seems unnecessary to me as we can use the object below
>>>>>> 'cpu->intc', which is the XiveNVT object.
>>>>>
>>>>> So, basically assuming a fixed set of EQs (one per priority?)
>>>>
>>>> yes. It's easier to capture the state and dump information from
>>>> the monitor.
>>>>
>>>>> per CPU for a PAPR guest?
>>>>
>>>> yes, that's own it works.
>>>>
>>>>> That makes sense (assuming PAPR doesn't provide guest interfaces to
>>>>> ask for something else).
>>>>
>>>> Yes. All hcalls take prio/server parameters and the reserved prio range
>>>> for the platform is in the device tree. 0xFF is a special case to reset
>>>> targeting.
>>>>
>>>> Thanks,
>>>>
>>>> C.
>>>>
>>>
>>
>
On Wed, 2017-12-20 at 16:09 +1100, David Gibson wrote: > > As you've suggested in yourself, I think we might need to more > explicitly model the different components of the XIVE system. As part > of that, I think you need to be clearer in this base skeleton about > exactly what component your XIVE object represents. > > If the answer is "the overall thing" I suspect that's not what you > want - I had one of those for XICs which proved to be a mistake > (eventually replaced by the XICSFabric interface). > > Changing the model later isn't impossible, but doing so without > breaking migration can be a real pain, so I think it's worth a > reasonable effort to try and get it right initially. Note: we do need to speed things up a bit, as having exploitation mode in KVM will significantly help with IPI performance among other things. I'm about ready to do the KVM bits. The one thing we need to discuss and figure a good design for is how we map all those interrupt control pages into qemu. Each interrupt (either PCIe pass-through or the "generic XIVE IPIs" which are used for guest IPIs and for vio/virtio/emulated interrupts) comes with a "control page" (ESB page) which needs to be mapped into the guest, and the generic IPIs also come with a trigger page which needs to be mapped into the guest for guest IPIs or OpenCAPI interrupts, or just qemu for emulated devices. Now that can be thousands of these critters. I certainly don't want to create thousands of VMAs in qemu and even less thousands of memory regions in KVM. So we need some kind of mechanism by wich a single large VMA gets mmap'ed into qemu (or maybe a couple of these, but not too many) and the interrupt pages can be assigned to slots in there and demand faulted. For the generic interrupts, this can probably be covered by KVM, adding some arch ioctls for allocating IPIs and mmap'ing that region etc... For pass-through, it's trickier, we don't want to mmap each irqfd individually for the above reason, so we want to "link" them to KVM. We don't want to allow qemu to take control of any arbitrary interrupt in the system though, so it has to related to the ownership of the irqfd coming from vfio. OpenCAPI I suspect will be its own can of worms... Also, have we decided how the process of switching between XICS and XIVE will work vs. CAS ? And how that will interact with KVM ? I was thinking the kernel would implement a different KVM device type, ie the "emulated XICS" would remain KVM_DEV_TYPE_XICS and XIVE would be KVM_DEV_TYPE_XIVE. Cheers, Ben.
On 12/21/2017 01:12 AM, Benjamin Herrenschmidt wrote:
> On Wed, 2017-12-20 at 16:09 +1100, David Gibson wrote:
>>
>> As you've suggested in yourself, I think we might need to more
>> explicitly model the different components of the XIVE system. As part
>> of that, I think you need to be clearer in this base skeleton about
>> exactly what component your XIVE object represents.
>>
>> If the answer is "the overall thing" I suspect that's not what you
>> want - I had one of those for XICs which proved to be a mistake
>> (eventually replaced by the XICSFabric interface).
>>
>> Changing the model later isn't impossible, but doing so without
>> breaking migration can be a real pain, so I think it's worth a
>> reasonable effort to try and get it right initially.
>
> Note: we do need to speed things up a bit, as having exploitation mode
> in KVM will significantly help with IPI performance among other things.
>
> I'm about ready to do the KVM bits. The one thing we need to discuss
> and figure a good design for is how we map all those interrupt control
> pages into qemu.
>
> Each interrupt (either PCIe pass-through or the "generic XIVE IPIs"
> which are used for guest IPIs and for vio/virtio/emulated interrupts)
> comes with a "control page" (ESB page) which needs to be mapped into
> the guest, and the generic IPIs also come with a trigger page which
> needs to be mapped into the guest for guest IPIs or OpenCAPI
> interrupts, or just qemu for emulated devices.
what about the OS TIMA page ? Do we trap the accesses in QEMU and
forward them to KVM ? or do we use a similar mechanism.
> Now that can be thousands of these critters. I certainly don't want to
> create thousands of VMAs in qemu and even less thousands of memory
> regions in KVM.
we can provision one mapping per kvmppc_xive_src_block maybe ?
> So we need some kind of mechanism by wich a single large VMA gets
> mmap'ed into qemu (or maybe a couple of these, but not too many) and
> the interrupt pages can be assigned to slots in there and demand
> faulted.
Frederic has started to put in place a similar mecanism for OpenCAPI.
> For the generic interrupts, this can probably be covered by KVM, adding
> some arch ioctls for allocating IPIs and mmap'ing that region etc...
The KVM device has a ioctl handler :
struct kvm_device_ops {
long (*ioctl)(struct kvm_device *dev, unsigned int ioctl,
unsigned long arg);
};
So a KVM device for the XIVE interrupt controller can implement a couple
of extra calls for its need, like getting the VMA addresses, etc
> For pass-through, it's trickier, we don't want to mmap each irqfd
> individually for the above reason, so we want to "link" them to KVM. We
> don't want to allow qemu to take control of any arbitrary interrupt in
> the system though, so it has to related to the ownership of the irqfd
> coming from vfio.
>
> OpenCAPI I suspect will be its own can of worms...
>
> Also, have we decided how the process of switching between XICS and
> XIVE will work vs. CAS ?
That's how it is described in the architecture. The current choice is
to create both XICS and XIVE objects and choose at CAS which one to
use. It relies today on the capability of the pseries machine to
allocate IRQ numbers for both interrupt controller backends. These
patches have been merged in QEMU.
A change of interrupt mode results in a reset. The device tree is
populated accordingly and the ICPs are switched for the model in
use.
> And how that will interact with KVM ?
I expect we will do the same, which is to create two KVM devices to
be able to handle both interrupt controller backends depending on the
mode negotiated by the guest.
> I was
> thinking the kernel would implement a different KVM device type, ie
> the "emulated XICS" would remain KVM_DEV_TYPE_XICS and XIVE would be
> KVM_DEV_TYPE_XIVE.
yes. it makes sense. The new device will have a lot in common with the
KVM_DEV_TYPE_XICS using kvm_xive_ops.
C.
On 12/21/2017 10:16 AM, Cédric Le Goater wrote:
> On 12/21/2017 01:12 AM, Benjamin Herrenschmidt wrote:
>> On Wed, 2017-12-20 at 16:09 +1100, David Gibson wrote:
>>>
>>> As you've suggested in yourself, I think we might need to more
>>> explicitly model the different components of the XIVE system. As part
>>> of that, I think you need to be clearer in this base skeleton about
>>> exactly what component your XIVE object represents.
>>>
>>> If the answer is "the overall thing" I suspect that's not what you
>>> want - I had one of those for XICs which proved to be a mistake
>>> (eventually replaced by the XICSFabric interface).
>>>
>>> Changing the model later isn't impossible, but doing so without
>>> breaking migration can be a real pain, so I think it's worth a
>>> reasonable effort to try and get it right initially.
>>
>> Note: we do need to speed things up a bit, as having exploitation mode
>> in KVM will significantly help with IPI performance among other things.
>>
>> I'm about ready to do the KVM bits. The one thing we need to discuss
>> and figure a good design for is how we map all those interrupt control
>> pages into qemu.
>>
>> Each interrupt (either PCIe pass-through or the "generic XIVE IPIs"
>> which are used for guest IPIs and for vio/virtio/emulated interrupts)
>> comes with a "control page" (ESB page) which needs to be mapped into
>> the guest, and the generic IPIs also come with a trigger page which
>> needs to be mapped into the guest for guest IPIs or OpenCAPI
>> interrupts, or just qemu for emulated devices.
>
> what about the OS TIMA page ? Do we trap the accesses in QEMU and
> forward them to KVM ? or do we use a similar mechanism.
>
>> Now that can be thousands of these critters. I certainly don't want to
>> create thousands of VMAs in qemu and even less thousands of memory
>> regions in KVM.
>
> we can provision one mapping per kvmppc_xive_src_block maybe ?
>
>> So we need some kind of mechanism by wich a single large VMA gets
>> mmap'ed into qemu (or maybe a couple of these, but not too many) and
>> the interrupt pages can be assigned to slots in there and demand
>> faulted.
>
> Frederic has started to put in place a similar mecanism for OpenCAPI.
>
>> For the generic interrupts, this can probably be covered by KVM, adding
>> some arch ioctls for allocating IPIs and mmap'ing that region etc...
>
> The KVM device has a ioctl handler :
>
> struct kvm_device_ops {
>
> long (*ioctl)(struct kvm_device *dev, unsigned int ioctl,
> unsigned long arg);
> };
>
> So a KVM device for the XIVE interrupt controller can implement a couple
> of extra calls for its need, like getting the VMA addresses, etc
or use set/get_attr.
I wonder if it would be possible to add a 'mmap' ops to kvm_device_fops
for the KVM_DEV_TYPE_XIVE device.
C.
On Thu, 2017-12-21 at 10:16 +0100, Cédric Le Goater wrote:
> On 12/21/2017 01:12 AM, Benjamin Herrenschmidt wrote:
> > On Wed, 2017-12-20 at 16:09 +1100, David Gibson wrote:
> > >
> > > As you've suggested in yourself, I think we might need to more
> > > explicitly model the different components of the XIVE system. As part
> > > of that, I think you need to be clearer in this base skeleton about
> > > exactly what component your XIVE object represents.
> > >
> > > If the answer is "the overall thing" I suspect that's not what you
> > > want - I had one of those for XICs which proved to be a mistake
> > > (eventually replaced by the XICSFabric interface).
> > >
> > > Changing the model later isn't impossible, but doing so without
> > > breaking migration can be a real pain, so I think it's worth a
> > > reasonable effort to try and get it right initially.
> >
> > Note: we do need to speed things up a bit, as having exploitation mode
> > in KVM will significantly help with IPI performance among other things.
> >
> > I'm about ready to do the KVM bits. The one thing we need to discuss
> > and figure a good design for is how we map all those interrupt control
> > pages into qemu.
> >
> > Each interrupt (either PCIe pass-through or the "generic XIVE IPIs"
> > which are used for guest IPIs and for vio/virtio/emulated interrupts)
> > comes with a "control page" (ESB page) which needs to be mapped into
> > the guest, and the generic IPIs also come with a trigger page which
> > needs to be mapped into the guest for guest IPIs or OpenCAPI
> > interrupts, or just qemu for emulated devices.
>
> what about the OS TIMA page ? Do we trap the accesses in QEMU and
> forward them to KVM ? or do we use a similar mechanism.
No, no, we'll have an mmap facility for it in kvm but it worries me
less as there's only one of these and there's little damage qemu can do
having access to it :)
>
> > Now that can be thousands of these critters. I certainly don't want to
> > create thousands of VMAs in qemu and even less thousands of memory
> > regions in KVM.
>
> we can provision one mapping per kvmppc_xive_src_block maybe ?
Maybe. Last I looked KVM walk of memory regions was linear though. Mind
you it's not a huge deal if the guest RAM is always in the first
entries.
> > So we need some kind of mechanism by wich a single large VMA gets
> > mmap'ed into qemu (or maybe a couple of these, but not too many) and
> > the interrupt pages can be assigned to slots in there and demand
> > faulted.
>
> Frederic has started to put in place a similar mecanism for OpenCAPI.
I know, though he made it rather OpenCAPI specific which is going to be
"interesting" when it comes to virtualizing OpenCAPI...
> > For the generic interrupts, this can probably be covered by KVM, adding
> > some arch ioctls for allocating IPIs and mmap'ing that region etc...
>
> The KVM device has a ioctl handler :
>
> struct kvm_device_ops {
>
> long (*ioctl)(struct kvm_device *dev, unsigned int ioctl,
> unsigned long arg);
> };
>
> So a KVM device for the XIVE interrupt controller can implement a couple
> of extra calls for its need, like getting the VMA addresses, etc
>
> > For pass-through, it's trickier, we don't want to mmap each irqfd
> > individually for the above reason, so we want to "link" them to KVM. We
> > don't want to allow qemu to take control of any arbitrary interrupt in
> > the system though, so it has to related to the ownership of the irqfd
> > coming from vfio.
> >
> > OpenCAPI I suspect will be its own can of worms...
> >
> > Also, have we decided how the process of switching between XICS and
> > XIVE will work vs. CAS ?
>
> That's how it is described in the architecture. The current choice is
> to create both XICS and XIVE objects and choose at CAS which one to
> use. It relies today on the capability of the pseries machine to
> allocate IRQ numbers for both interrupt controller backends. These
> patches have been merged in QEMU.
>
> A change of interrupt mode results in a reset. The device tree is
> populated accordingly and the ICPs are switched for the model in
> use.
For KVM we need to only instanciate one of them though.
> > And how that will interact with KVM ?
>
> I expect we will do the same, which is to create two KVM devices to
> be able to handle both interrupt controller backends depending on the
> mode negotiated by the guest.
That will be an ungodly mess, I'd rather we only instanciate the right
one.
> > I was
> > thinking the kernel would implement a different KVM device type, ie
> > the "emulated XICS" would remain KVM_DEV_TYPE_XICS and XIVE would be
> > KVM_DEV_TYPE_XIVE.
>
> yes. it makes sense. The new device will have a lot in common with the
> KVM_DEV_TYPE_XICS using kvm_xive_ops.
Ben.
>>> Also, have we decided how the process of switching between XICS and >>> XIVE will work vs. CAS ? >> >> That's how it is described in the architecture. The current choice is >> to create both XICS and XIVE objects and choose at CAS which one to >> use. It relies today on the capability of the pseries machine to >> allocate IRQ numbers for both interrupt controller backends. These >> patches have been merged in QEMU. >> >> A change of interrupt mode results in a reset. The device tree is >> populated accordingly and the ICPs are switched for the model in >> use. > > For KVM we need to only instanciate one of them though. Hmm, How would we handle a guest rebooting on a kernel without XIVE support ? Are you suggesting to create the XICS or XIVE device in the CAS negotiation process ? So, the machine would not have any interrupt controller before CAS. That seems really late to me. grub uses the console for instance. I think it should prepare for both options, start in XIVE legacy mode, which is XICS, then possibly switch to XIVE exploitation mode. >>> And how that will interact with KVM ? >> >> I expect we will do the same, which is to create two KVM devices to >> be able to handle both interrupt controller backends depending on the >> mode negotiated by the guest. > > That will be an ungodly mess, I'd rather we only instanciate the right > one. It's rather transparent currently in the emulated version. There are two sets of objects in QEMU, switching is done in CAS. KVM support should not change anything in that area. I expect the 'xive-kvm' object to get/set states for migration, just like for XICS and to setup the ESB+TIMA memory regions, which is new. C. >>> I was >>> thinking the kernel would implement a different KVM device type, ie >>> the "emulated XICS" would remain KVM_DEV_TYPE_XICS and XIVE would be >>> KVM_DEV_TYPE_XIVE. >> >> yes. it makes sense. The new device will have a lot in common with the >> KVM_DEV_TYPE_XICS using kvm_xive_ops. > > Ben. >
On Wed, 2018-01-17 at 10:18 +0100, Cédric Le Goater wrote: > > > > Also, have we decided how the process of switching between XICS and > > > > XIVE will work vs. CAS ? > > > > > > That's how it is described in the architecture. The current choice is > > > to create both XICS and XIVE objects and choose at CAS which one to > > > use. It relies today on the capability of the pseries machine to > > > allocate IRQ numbers for both interrupt controller backends. These > > > patches have been merged in QEMU. > > > > > > A change of interrupt mode results in a reset. The device tree is > > > populated accordingly and the ICPs are switched for the model in > > > use. > > > > For KVM we need to only instanciate one of them though. > > Hmm, > > How would we handle a guest rebooting on a kernel without XIVE support ? It will do CAS again and we can change the devices. > Are you suggesting to create the XICS or XIVE device in the CAS negotiation > process ? So, the machine would not have any interrupt controller before > CAS. That seems really late to me. grub uses the console for instance. We start with XICS by default. > I think it should prepare for both options, start in XIVE legacy mode, > which is XICS, then possibly switch to XIVE exploitation mode. > > > > > And how that will interact with KVM ? > > > > > > I expect we will do the same, which is to create two KVM devices to > > > be able to handle both interrupt controller backends depending on the > > > mode negotiated by the guest. > > > > That will be an ungodly mess, I'd rather we only instanciate the right > > one. > > It's rather transparent currently in the emulated version. There are two > sets of objects in QEMU, switching is done in CAS. KVM support should not > change anything in that area. > > I expect the 'xive-kvm' object to get/set states for migration, just like > for XICS and to setup the ESB+TIMA memory regions, which is new. But both XICS and XIVE are completely different kernel KVM devices that will need to "hook" into the same set of internal hooks for things like interrupts being passed through, RTAS calls etc... How does KVM knows which one to "activate" ? I don't think the kernel should have both. > > > > I was > > > > thinking the kernel would implement a different KVM device type, ie > > > > the "emulated XICS" would remain KVM_DEV_TYPE_XICS and XIVE would be > > > > KVM_DEV_TYPE_XIVE. > > > > > > yes. it makes sense. The new device will have a lot in common with the > > > KVM_DEV_TYPE_XICS using kvm_xive_ops. > > > > Ben. > >
On 01/17/2018 12:10 PM, Benjamin Herrenschmidt wrote: > On Wed, 2018-01-17 at 10:18 +0100, Cédric Le Goater wrote: >>>>> Also, have we decided how the process of switching between XICS and >>>>> XIVE will work vs. CAS ? >>>> >>>> That's how it is described in the architecture. The current choice is >>>> to create both XICS and XIVE objects and choose at CAS which one to >>>> use. It relies today on the capability of the pseries machine to >>>> allocate IRQ numbers for both interrupt controller backends. These >>>> patches have been merged in QEMU. >>>> >>>> A change of interrupt mode results in a reset. The device tree is >>>> populated accordingly and the ICPs are switched for the model in >>>> use. >>> >>> For KVM we need to only instanciate one of them though. >> >> Hmm, >> >> How would we handle a guest rebooting on a kernel without XIVE support ? > > It will do CAS again and we can change the devices. So, we would destroy the previous QEMU ICS object and create a new one in the CAS hcall. That would probably work. There might be some issues in creating and destroying the ICS KVM device, but that can be studied without XIVE. It used to be considered ugly to create a QEMU device at reset time, so I wonder if this is still the case, because when the machine reaches CAS, we really are beyond reset. If this is OK, then the next "issue" is to keep in sync the allocated IRQ numbers. The IRQ allocator is now merged at the machine level, so the synchronization is obvious to do when both backend QEMU objects are available. that's the path I took. If both QEMU objects are not available, then we need to scan the IRQ number space in the current interrupt mode and allocate the same IRQs in the newly negotiated mode. Probably OK. I don't see major problems with the current code. Migration is a problem. We will need both backend QEMU objects to be available anyhow if we want to migrate. So we are back to the current solution creating both QEMU objects but we can try to defer some of the KVM inits and create the KVM device on demand at CAS time. The next problem is the ICP object that currently needs the KVM device fd to connect the vcpus ... So, we will need to change that also. That is probably the biggest problem today. We need a way to disconnect the vpcu from the KVM device and see how we can defer the connection. I need to make sure this is possible, I can check that without XIVE I think. >> Are you suggesting to create the XICS or XIVE device in the CAS negotiation >> process ? So, the machine would not have any interrupt controller before >> CAS. That seems really late to me. grub uses the console for instance. > > We start with XICS by default. yes. >> I think it should prepare for both options, start in XIVE legacy mode, >> which is XICS, then possibly switch to XIVE exploitation mode. >> >>>>> And how that will interact with KVM ? >>>> >>>> I expect we will do the same, which is to create two KVM devices to >>>> be able to handle both interrupt controller backends depending on the >>>> mode negotiated by the guest. >>> >>> That will be an ungodly mess, I'd rather we only instanciate the right >>> one. >> >> It's rather transparent currently in the emulated version. There are two >> sets of objects in QEMU, switching is done in CAS. KVM support should not >> change anything in that area. >> >> I expect the 'xive-kvm' object to get/set states for migration, just like >> for XICS and to setup the ESB+TIMA memory regions, which is new. > > But both XICS and XIVE are completely different kernel KVM devices that will > need to "hook" into the same set of internal hooks for things like interrupts > being passed through, RTAS calls etc... > > How does KVM knows which one to "activate" ? Can't we add an extra IRQ type and use vcpu->arch.irq_type for that ? I haven't studied all the low level details though. > I don't think the kernel should have both. I hear that. From a QEMU perspective, it is much easier to put everything in place for both interrupt modes and let the guest decide what it wants to use. If we choose not to, we will need to find solution to defer the KVM inits and to disconnect/reconnect the vcpus. For the latter, we could add a KVM_DISABLE_CAP ioctl or maybe better add a new capability like KVM_CAP_IRQ_XIVE to perform the switch. C.
> How does KVM knows which one to "activate" ? > > Can't we add an extra IRQ type and use vcpu->arch.irq_type for that ? > I haven't studied all the low level details though. I don't think connecting a vcpu to two different KVM devices makes sense ... So we need to destroy/recreate the KVM device and disconnect/reconnect the vcpus. I will take a closer look. C.
On Wed, 2018-01-17 at 15:39 +0100, Cédric Le Goater wrote: > Migration is a problem. We will need both backend QEMU objects to be > available anyhow if we want to migrate. So we are back to the current > solution creating both QEMU objects but we can try to defer some of the > KVM inits and create the KVM device on demand at CAS time. Do we have a way to migrate a piece of info from the machine *first* that indicate what type of XICS/XIVE to instanciate ? > The next problem is the ICP object that currently needs the KVM device > fd to connect the vcpus ... So, we will need to change that also. > That is probably the biggest problem today. We need a way to disconnect > the vpcu from the KVM device and see how we can defer the connection. > I need to make sure this is possible, I can check that without XIVE Ben.
On 01/17/2018 10:27 PM, Benjamin Herrenschmidt wrote: > On Wed, 2018-01-17 at 15:39 +0100, Cédric Le Goater wrote: >> Migration is a problem. We will need both backend QEMU objects to be >> available anyhow if we want to migrate. So we are back to the current >> solution creating both QEMU objects but we can try to defer some of the >> KVM inits and create the KVM device on demand at CAS time. > > Do we have a way to migrate a piece of info from the machine *first* > that indicate what type of XICS/XIVE to instanciate ? The source and the target machines should have the same realized objects. I think this is the simplest solution to keep the migration framework maintainable. I don't think it is a problem to call a xics_fini() routine to destroy the XICS KVM device if a new interrupt mode was negotiated in CAS. We would then call a xive_init() routing to create the new XIVE KVM device. When done, the question boils down to disconnect and reconnect the vcpus to the KVM device. The QEMU CPU ->intc pointer should be updated also but that's a QEMU level problem. Already done. In the QEMU "icp-kvm" object, the connection to the KVM device is currently forced in the realize routine but we can add some handlers to manage the link. Similar handlers would do the same in the QEMU "nvt-kvm" object when XIVE is on. If we think this is a possible way to address the problem, I can check the above thinking on a XICS KVM machine and force the init/fini sequence in the CAS negotiation process. I will need a KVM ioctl to destroy a device and maybe a KVM VCPU ioctl to disable a capability. Cheers, C.
On Thu, 2018-01-18 at 14:27 +0100, Cédric Le Goater wrote: > The source and the target machines should have the same realized > objects. I think this is the simplest solution to keep the migration > framework maintainable. Yeah well, it all boils down to qemu migration being completely brain dead in relying on an external entity to create the same machine rather than carrying the configuration in the migration stream... ugh. > I don't think it is a problem to call a xics_fini() routine to > destroy the XICS KVM device if a new interrupt mode was negotiated > in CAS. We would then call a xive_init() routing to create the new > XIVE KVM device. > > When done, the question boils down to disconnect and reconnect the > vcpus to the KVM device. The QEMU CPU ->intc pointer should be > updated also but that's a QEMU level problem. Already done. The problem is more the in-kernel hooks. > In the QEMU "icp-kvm" object, the connection to the KVM device > is currently forced in the realize routine but we can add some > handlers to manage the link. Similar handlers would do the same > in the QEMU "nvt-kvm" object when XIVE is on. > > > If we think this is a possible way to address the problem, I can > check the above thinking on a XICS KVM machine and force the > init/fini sequence in the CAS negotiation process. I will need > a KVM ioctl to destroy a device and maybe a KVM VCPU ioctl to > disable a capability.
On Thu, Jan 18, 2018 at 08:27:52AM +1100, Benjamin Herrenschmidt wrote: > On Wed, 2018-01-17 at 15:39 +0100, Cédric Le Goater wrote: > > Migration is a problem. We will need both backend QEMU objects to be > > available anyhow if we want to migrate. So we are back to the current > > solution creating both QEMU objects but we can try to defer some of the > > KVM inits and create the KVM device on demand at CAS time. > > Do we have a way to migrate a piece of info from the machine *first* > that indicate what type of XICS/XIVE to instanciate ? Nope. qemu migration doesn't work like that. Yes, it should, and everyone knows it, but changing it is a really long term project. > > > The next problem is the ICP object that currently needs the KVM device > > fd to connect the vcpus ... So, we will need to change that also. > > That is probably the biggest problem today. We need a way to disconnect > > the vpcu from the KVM device and see how we can defer the connection. > > I need to make sure this is possible, I can check that without XIVE > > Ben. > -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
On Sun, 2018-02-11 at 19:08 +1100, David Gibson wrote: > On Thu, Jan 18, 2018 at 08:27:52AM +1100, Benjamin Herrenschmidt wrote: > > On Wed, 2018-01-17 at 15:39 +0100, Cédric Le Goater wrote: > > > Migration is a problem. We will need both backend QEMU objects to be > > > available anyhow if we want to migrate. So we are back to the current > > > solution creating both QEMU objects but we can try to defer some of the > > > KVM inits and create the KVM device on demand at CAS time. > > > > Do we have a way to migrate a piece of info from the machine *first* > > that indicate what type of XICS/XIVE to instanciate ? > > Nope. qemu migration doesn't work like that. Yes, it should, and > everyone knows it, but changing it is a really long term project. Well, we have a problem then. It looks like Qemu broken migration is fundamentally incompatible with PAPR and CAS design... I know we don't migrate the configuration, that's not exactly what I had in mind tho... Can we have some piece of *data* from the machine be migrated first, and use it on the target to reconfigure the interrupt controller before the stream arrives ? Otherwise, we have indeed no much choice but the horrible wart of creating both interrupt controllers with only one "active". > > > > > The next problem is the ICP object that currently needs the KVM device > > > fd to connect the vcpus ... So, we will need to change that also. > > > That is probably the biggest problem today. We need a way to disconnect > > > the vpcu from the KVM device and see how we can defer the connection. > > > I need to make sure this is possible, I can check that without XIVE > > > > Ben. > > > >
On 12/02/18 09:55, Benjamin Herrenschmidt wrote: > On Sun, 2018-02-11 at 19:08 +1100, David Gibson wrote: >> On Thu, Jan 18, 2018 at 08:27:52AM +1100, Benjamin Herrenschmidt wrote: >>> On Wed, 2018-01-17 at 15:39 +0100, Cédric Le Goater wrote: >>>> Migration is a problem. We will need both backend QEMU objects to be >>>> available anyhow if we want to migrate. So we are back to the current >>>> solution creating both QEMU objects but we can try to defer some of the >>>> KVM inits and create the KVM device on demand at CAS time. >>> >>> Do we have a way to migrate a piece of info from the machine *first* >>> that indicate what type of XICS/XIVE to instanciate ? >> >> Nope. qemu migration doesn't work like that. Yes, it should, and >> everyone knows it, but changing it is a really long term project. > > Well, we have a problem then. It looks like Qemu broken migration is > fundamentally incompatible with PAPR and CAS design... > > I know we don't migrate the configuration, that's not exactly what I > had in mind tho... Can we have some piece of *data* from the machine be > migrated first, and use it on the target to reconfigure the interrupt > controller before the stream arrives ? These days this is done via libvirt - it reads properties it needs via QMP, then sends an XML with everything (the interrupt controller type may be one of such properties), and starts the destination QEMU with the explicit interrupt controller (like -machine pseries,intrc=xive). Hacking QEMU to do all of this is still in a distant TODO... > Otherwise, we have indeed no much choice but the horrible wart of > creating both interrupt controllers with only one "active". > >>> >>>> The next problem is the ICP object that currently needs the KVM device >>>> fd to connect the vcpus ... So, we will need to change that also. >>>> That is probably the biggest problem today. We need a way to disconnect >>>> the vpcu from the KVM device and see how we can defer the connection. >>>> I need to make sure this is possible, I can check that without XIVE -- Alexey
On Mon, 2018-02-12 at 13:02 +1100, Alexey Kardashevskiy wrote: > On 12/02/18 09:55, Benjamin Herrenschmidt wrote: > > Well, we have a problem then. It looks like Qemu broken migration is > > fundamentally incompatible with PAPR and CAS design... > > > > I know we don't migrate the configuration, that's not exactly what I > > had in mind tho... Can we have some piece of *data* from the machine be > > migrated first, and use it on the target to reconfigure the interrupt > > controller before the stream arrives ? > > These days this is done via libvirt - it reads properties it needs via QMP, > then sends an XML with everything (the interrupt controller type may be one > of such properties), and starts the destination QEMU with the explicit > interrupt controller (like -machine pseries,intrc=xive). Clarification: libvirt will use the user-defined XML configuration to generate the QEMU command line both for the source and the target of the migration, but it will not automagically figure out properties through QMP. So if you want the controller to explicitly show up on the QEMU command line, libvirt should be taught about it. -- Andrea Bolognani / Red Hat / Virtualization
On Mon, 2018-02-12 at 13:20 +0100, Andrea Bolognani wrote: > On Mon, 2018-02-12 at 13:02 +1100, Alexey Kardashevskiy wrote: > > On 12/02/18 09:55, Benjamin Herrenschmidt wrote: > > > Well, we have a problem then. It looks like Qemu broken migration is > > > fundamentally incompatible with PAPR and CAS design... > > > > > > I know we don't migrate the configuration, that's not exactly what I > > > had in mind tho... Can we have some piece of *data* from the machine be > > > migrated first, and use it on the target to reconfigure the interrupt > > > controller before the stream arrives ? > > > > These days this is done via libvirt - it reads properties it needs via QMP, > > then sends an XML with everything (the interrupt controller type may be one > > of such properties), and starts the destination QEMU with the explicit > > interrupt controller (like -machine pseries,intrc=xive). > > Clarification: libvirt will use the user-defined XML configuration > to generate the QEMU command line both for the source and the target > of the migration, but it will not automagically figure out properties > through QMP. So if you want the controller to explicitly show up on > the QEMU command line, libvirt should be taught about it. Which can't work because the guest pretty much decides what it will be early on during the boot process. So we're back to square 1 having to instanciate both objects in qemu with some kind of "activation" flag. Cheers, Ben.
On 13/02/18 01:40, Benjamin Herrenschmidt wrote: > On Mon, 2018-02-12 at 13:20 +0100, Andrea Bolognani wrote: >> On Mon, 2018-02-12 at 13:02 +1100, Alexey Kardashevskiy wrote: >>> On 12/02/18 09:55, Benjamin Herrenschmidt wrote: >>>> Well, we have a problem then. It looks like Qemu broken migration is >>>> fundamentally incompatible with PAPR and CAS design... >>>> >>>> I know we don't migrate the configuration, that's not exactly what I >>>> had in mind tho... Can we have some piece of *data* from the machine be >>>> migrated first, and use it on the target to reconfigure the interrupt >>>> controller before the stream arrives ? >>> >>> These days this is done via libvirt - it reads properties it needs via QMP, >>> then sends an XML with everything (the interrupt controller type may be one >>> of such properties), and starts the destination QEMU with the explicit >>> interrupt controller (like -machine pseries,intrc=xive). >> >> Clarification: libvirt will use the user-defined XML configuration >> to generate the QEMU command line both for the source and the target >> of the migration, but it will not automagically figure out properties >> through QMP. So if you want the controller to explicitly show up on >> the QEMU command line, libvirt should be taught about it. > > Which can't work because the guest pretty much decides what it will be > early on during the boot process. At the time of migration the guest has told QEMU what intrc it wants (via cas?) and libvirt can ask QEMU via QMP about that when migrating. > So we're back to square 1 having to instanciate both objects in qemu > with some kind of "activation" flag. > > Cheers, > Ben. > -- Alexey
On 02/12/2018 03:40 PM, Benjamin Herrenschmidt wrote: > On Mon, 2018-02-12 at 13:20 +0100, Andrea Bolognani wrote: >> On Mon, 2018-02-12 at 13:02 +1100, Alexey Kardashevskiy wrote: >>> On 12/02/18 09:55, Benjamin Herrenschmidt wrote: >>>> Well, we have a problem then. It looks like Qemu broken migration is >>>> fundamentally incompatible with PAPR and CAS design... >>>> >>>> I know we don't migrate the configuration, that's not exactly what I >>>> had in mind tho... Can we have some piece of *data* from the machine be >>>> migrated first, and use it on the target to reconfigure the interrupt >>>> controller before the stream arrives ? >>> >>> These days this is done via libvirt - it reads properties it needs via QMP, >>> then sends an XML with everything (the interrupt controller type may be one >>> of such properties), and starts the destination QEMU with the explicit >>> interrupt controller (like -machine pseries,intrc=xive). >> >> Clarification: libvirt will use the user-defined XML configuration >> to generate the QEMU command line both for the source and the target >> of the migration, but it will not automagically figure out properties >> through QMP. So if you want the controller to explicitly show up on >> the QEMU command line, libvirt should be taught about it. > > Which can't work because the guest pretty much decides what it will be > early on during the boot process. > > So we're back to square 1 having to instanciate both objects in qemu > with some kind of "activation" flag. yes and the activation flag is the associated bit in CAS OV5 : spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT) if a new interrupt mode is negotiated, a machine reset is required, a new device tree is populated, new ICPs are installed, etc. There is a little more to do with KVM and we need to find the right model abstraction for it. Anyhow, it is not a big problem to switch from one mode to another when both objects are around. It is even easier to keep the allocated IRQs in sync in fact. What problem do you foresee with KVM ? this is already solved for irqchip=off. Cheers, C.
On 02/11/2018 11:55 PM, Benjamin Herrenschmidt wrote: > On Sun, 2018-02-11 at 19:08 +1100, David Gibson wrote: >> On Thu, Jan 18, 2018 at 08:27:52AM +1100, Benjamin Herrenschmidt wrote: >>> On Wed, 2018-01-17 at 15:39 +0100, Cédric Le Goater wrote: >>>> Migration is a problem. We will need both backend QEMU objects to be >>>> available anyhow if we want to migrate. So we are back to the current >>>> solution creating both QEMU objects but we can try to defer some of the >>>> KVM inits and create the KVM device on demand at CAS time. >>> >>> Do we have a way to migrate a piece of info from the machine *first* >>> that indicate what type of XICS/XIVE to instanciate ? >> >> Nope. qemu migration doesn't work like that. Yes, it should, and >> everyone knows it, but changing it is a really long term project. > > Well, we have a problem then. It looks like Qemu broken migration is > fundamentally incompatible with PAPR and CAS design... > > I know we don't migrate the configuration, that's not exactly what I > had in mind tho... Can we have some piece of *data* from the machine be > migrated first, and use it on the target to reconfigure the interrupt > controller before the stream arrives ? > > Otherwise, we have indeed no much choice but the horrible wart of > creating both interrupt controllers with only one "active". Well, both QEMU model objects would be created, yes, but one only KVM associated device. It's a bit ugly from a QEMU point of view because the KVM initialization is deferred at reset but, in the pratice, it results in a couple of calls to : - disconnect the VCPU from the KVM interrupt device - destroy the previous KVM interrupt device (new ioctl) - create the new KVM interrupt device - reconnect the VCPU to the KVM interrupt device I don't think it will be a major problem. What I am unease with currently, is how to share the same XIVE objects when under KVM and when not. The only difference is in the nature of the MMIO region and the qemu_irq handler. Work in progress. And we have four interrupt modes to support : XICS-KVM, XICS, XIVE-KVM, XIVE. Thanks, C. >>>> The next problem is the ICP object that currently needs the KVM device >>>> fd to connect the vcpus ... So, we will need to change that also. >>>> That is probably the biggest problem today. We need a way to disconnect >>>> the vpcu from the KVM device and see how we can defer the connection. >>>> I need to make sure this is possible, I can check that without XIVE >>> >>> Ben. >>> >> >>
On Mon, Feb 12, 2018 at 09:55:17AM +1100, Benjamin Herrenschmidt wrote: > On Sun, 2018-02-11 at 19:08 +1100, David Gibson wrote: > > On Thu, Jan 18, 2018 at 08:27:52AM +1100, Benjamin Herrenschmidt wrote: > > > On Wed, 2018-01-17 at 15:39 +0100, Cédric Le Goater wrote: > > > > Migration is a problem. We will need both backend QEMU objects to be > > > > available anyhow if we want to migrate. So we are back to the current > > > > solution creating both QEMU objects but we can try to defer some of the > > > > KVM inits and create the KVM device on demand at CAS time. > > > > > > Do we have a way to migrate a piece of info from the machine *first* > > > that indicate what type of XICS/XIVE to instanciate ? > > > > Nope. qemu migration doesn't work like that. Yes, it should, and > > everyone knows it, but changing it is a really long term project. > > Well, we have a problem then. It looks like Qemu broken migration is > fundamentally incompatible with PAPR and CAS design... Hrm, the fit is very clunky certainly, but i think we can make it work. > I know we don't migrate the configuration, that's not exactly what I > had in mind tho... Can we have some piece of *data* from the machine be > migrated first, and use it on the target to reconfigure the interrupt > controller before the stream arrives ? Sorta.. maybe.. but it would probably get really ugly if we don't preserve the usual way object lifetimes work. > Otherwise, we have indeed no much choice but the horrible wart of > creating both interrupt controllers with only one "active". I really think this is the way to go, warts and all. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
On 04/12/2018 07:16 AM, David Gibson wrote: > On Mon, Feb 12, 2018 at 09:55:17AM +1100, Benjamin Herrenschmidt wrote: >> On Sun, 2018-02-11 at 19:08 +1100, David Gibson wrote: >>> On Thu, Jan 18, 2018 at 08:27:52AM +1100, Benjamin Herrenschmidt wrote: >>>> On Wed, 2018-01-17 at 15:39 +0100, Cédric Le Goater wrote: >>>>> Migration is a problem. We will need both backend QEMU objects to be >>>>> available anyhow if we want to migrate. So we are back to the current >>>>> solution creating both QEMU objects but we can try to defer some of the >>>>> KVM inits and create the KVM device on demand at CAS time. >>>> >>>> Do we have a way to migrate a piece of info from the machine *first* >>>> that indicate what type of XICS/XIVE to instanciate ? >>> >>> Nope. qemu migration doesn't work like that. Yes, it should, and >>> everyone knows it, but changing it is a really long term project. >> >> Well, we have a problem then. It looks like Qemu broken migration is >> fundamentally incompatible with PAPR and CAS design... > > Hrm, the fit is very clunky certainly, but i think we can make it work. > >> I know we don't migrate the configuration, that's not exactly what I >> had in mind tho... Can we have some piece of *data* from the machine be >> migrated first, and use it on the target to reconfigure the interrupt >> controller before the stream arrives ? > > Sorta.. maybe.. but it would probably get really ugly if we don't > preserve the usual way object lifetimes work. > >> Otherwise, we have indeed no much choice but the horrible wart of >> creating both interrupt controllers with only one "active". > > I really think this is the way to go, warts and all. > Yes ... KVM makes it a little uglier. A KVM_DEVICE_DESTROY device is needed to cleanup the VM and a DISABLE_CAP to disconnect the vpcu from the current KVM XIVE/XICS device. I have used an extra arg on ENABLE_CAP for the moment. At the QEMU level, we need to connect/reconnect at reset time to handle possible changes in CAS, and at post_load. Destroying the MemoryRegion is a bit problematic, I have not found a common layout compatible with both the emulated mode (std IO regions) and the KVM mode (ram device regions) C.
On Thu, Apr 12, 2018 at 10:36:10AM +0200, Cédric Le Goater wrote: > On 04/12/2018 07:16 AM, David Gibson wrote: > > On Mon, Feb 12, 2018 at 09:55:17AM +1100, Benjamin Herrenschmidt wrote: > >> On Sun, 2018-02-11 at 19:08 +1100, David Gibson wrote: > >>> On Thu, Jan 18, 2018 at 08:27:52AM +1100, Benjamin Herrenschmidt wrote: > >>>> On Wed, 2018-01-17 at 15:39 +0100, Cédric Le Goater wrote: > >>>>> Migration is a problem. We will need both backend QEMU objects to be > >>>>> available anyhow if we want to migrate. So we are back to the current > >>>>> solution creating both QEMU objects but we can try to defer some of the > >>>>> KVM inits and create the KVM device on demand at CAS time. > >>>> > >>>> Do we have a way to migrate a piece of info from the machine *first* > >>>> that indicate what type of XICS/XIVE to instanciate ? > >>> > >>> Nope. qemu migration doesn't work like that. Yes, it should, and > >>> everyone knows it, but changing it is a really long term project. > >> > >> Well, we have a problem then. It looks like Qemu broken migration is > >> fundamentally incompatible with PAPR and CAS design... > > > > Hrm, the fit is very clunky certainly, but i think we can make it work. > > > >> I know we don't migrate the configuration, that's not exactly what I > >> had in mind tho... Can we have some piece of *data* from the machine be > >> migrated first, and use it on the target to reconfigure the interrupt > >> controller before the stream arrives ? > > > > Sorta.. maybe.. but it would probably get really ugly if we don't > > preserve the usual way object lifetimes work. > > > >> Otherwise, we have indeed no much choice but the horrible wart of > >> creating both interrupt controllers with only one "active". > > > > I really think this is the way to go, warts and all. > > > > Yes ... KVM makes it a little uglier. > > A KVM_DEVICE_DESTROY device is needed to cleanup the VM and a > DISABLE_CAP to disconnect the vpcu from the current KVM XIVE/XICS > device. I have used an extra arg on ENABLE_CAP for the moment. > > At the QEMU level, we need to connect/reconnect at reset time to > handle possible changes in CAS, and at post_load. Right. > Destroying the MemoryRegion is a bit problematic, I have not > found a common layout compatible with both the emulated mode > (std IO regions) and the KVM mode (ram device regions) That sounds awkward, I guess we'll discuss the details of this later. Btw, a secondary advantage of starting off with XIVE only under a different machine type is that we can declare that one not to be migration stable until we're ready. So we can merge something that's ok to experiment with, but reserve the right to incompatibly change the migration format until we're confident we're ready and can merge it into the "stable" machine type. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
On 04/16/2018 06:29 AM, David Gibson wrote: > On Thu, Apr 12, 2018 at 10:36:10AM +0200, Cédric Le Goater wrote: >> On 04/12/2018 07:16 AM, David Gibson wrote: >>> On Mon, Feb 12, 2018 at 09:55:17AM +1100, Benjamin Herrenschmidt wrote: >>>> On Sun, 2018-02-11 at 19:08 +1100, David Gibson wrote: >>>>> On Thu, Jan 18, 2018 at 08:27:52AM +1100, Benjamin Herrenschmidt wrote: >>>>>> On Wed, 2018-01-17 at 15:39 +0100, Cédric Le Goater wrote: >>>>>>> Migration is a problem. We will need both backend QEMU objects to be >>>>>>> available anyhow if we want to migrate. So we are back to the current >>>>>>> solution creating both QEMU objects but we can try to defer some of the >>>>>>> KVM inits and create the KVM device on demand at CAS time. >>>>>> >>>>>> Do we have a way to migrate a piece of info from the machine *first* >>>>>> that indicate what type of XICS/XIVE to instanciate ? >>>>> >>>>> Nope. qemu migration doesn't work like that. Yes, it should, and >>>>> everyone knows it, but changing it is a really long term project. >>>> >>>> Well, we have a problem then. It looks like Qemu broken migration is >>>> fundamentally incompatible with PAPR and CAS design... >>> >>> Hrm, the fit is very clunky certainly, but i think we can make it work. >>> >>>> I know we don't migrate the configuration, that's not exactly what I >>>> had in mind tho... Can we have some piece of *data* from the machine be >>>> migrated first, and use it on the target to reconfigure the interrupt >>>> controller before the stream arrives ? >>> >>> Sorta.. maybe.. but it would probably get really ugly if we don't >>> preserve the usual way object lifetimes work. >>> >>>> Otherwise, we have indeed no much choice but the horrible wart of >>>> creating both interrupt controllers with only one "active". >>> >>> I really think this is the way to go, warts and all. >>> >> >> Yes ... KVM makes it a little uglier. >> >> A KVM_DEVICE_DESTROY device is needed to cleanup the VM and a >> DISABLE_CAP to disconnect the vpcu from the current KVM XIVE/XICS >> device. I have used an extra arg on ENABLE_CAP for the moment. >> >> At the QEMU level, we need to connect/reconnect at reset time to >> handle possible changes in CAS, and at post_load. > > Right. v3 uses the same 'reset' function to setup the interrupts at machine reset time and at post_load. Keep that in mind. May be we should have distinct routines. > >> Destroying the MemoryRegion is a bit problematic, I have not >> found a common layout compatible with both the emulated mode >> (std IO regions) and the KVM mode (ram device regions) > > That sounds awkward, I guess we'll discuss the details of this later. I have fixed that in v3. > Btw, a secondary advantage of starting off with XIVE only under a > different machine type is that we can declare that one not to be > migration stable until we're ready. So we can merge something that's > ok to experiment with, but reserve the right to incompatibly change > the migration format until we're confident we're ready and can merge > it into the "stable" machine type. Reseting KVM devices is not the most complex feature to support but it seems to have an impact on migration. So we might need the non-migratable machine type to fix that. Let's see how v3 is welcomed or not. Thanks, C.
On Wed, Jan 17, 2018 at 03:39:46PM +0100, Cédric Le Goater wrote: > On 01/17/2018 12:10 PM, Benjamin Herrenschmidt wrote: > > On Wed, 2018-01-17 at 10:18 +0100, Cédric Le Goater wrote: > >>>>> Also, have we decided how the process of switching between XICS and > >>>>> XIVE will work vs. CAS ? > >>>> > >>>> That's how it is described in the architecture. The current choice is > >>>> to create both XICS and XIVE objects and choose at CAS which one to > >>>> use. It relies today on the capability of the pseries machine to > >>>> allocate IRQ numbers for both interrupt controller backends. These > >>>> patches have been merged in QEMU. > >>>> > >>>> A change of interrupt mode results in a reset. The device tree is > >>>> populated accordingly and the ICPs are switched for the model in > >>>> use. > >>> > >>> For KVM we need to only instanciate one of them though. > >> > >> Hmm, > >> > >> How would we handle a guest rebooting on a kernel without XIVE support ? > > > > It will do CAS again and we can change the devices. > > So, we would destroy the previous QEMU ICS object and create a new one > in the CAS hcall. That would probably work. There might be some issues > in creating and destroying the ICS KVM device, but that can be studied > without XIVE. Adding and removing devices at runtime based on guest requests like this will get really hairy in qemu. As I've said before for the first cut, I think we want to select just one as a machine option to avoid this confusion. Looking further ahead, I think we'll be better off having both the XIVE and XICS models always present (at least minimally) in qemu, but with only one "active" at any given time. Note that having the inactive one destroy and clean up the corresponding KVM devices is fine, as is deallocating as much of its runtime state as we can without changing the notional QOM tree. > > It used to be considered ugly to create a QEMU device at reset time, so > I wonder if this is still the case, because when the machine reaches CAS, > we really are beyond reset. > > If this is OK, then the next "issue" is to keep in sync the allocated > IRQ numbers. The IRQ allocator is now merged at the machine level, so > the synchronization is obvious to do when both backend QEMU objects > are available. that's the path I took. If both QEMU objects are not > available, then we need to scan the IRQ number space in the current > interrupt mode and allocate the same IRQs in the newly negotiated mode. > Probably OK. I don't see major problems with the current code. > > Migration is a problem. We will need both backend QEMU objects to be > available anyhow if we want to migrate. So we are back to the current > solution creating both QEMU objects but we can try to defer some of the > KVM inits and create the KVM device on demand at CAS time. > > The next problem is the ICP object that currently needs the KVM device > fd to connect the vcpus ... So, we will need to change that also. > That is probably the biggest problem today. We need a way to disconnect > the vpcu from the KVM device and see how we can defer the connection. > I need to make sure this is possible, I can check that without XIVE > I think. > > >> Are you suggesting to create the XICS or XIVE device in the CAS negotiation > >> process ? So, the machine would not have any interrupt controller before > >> CAS. That seems really late to me. grub uses the console for instance. > > > > We start with XICS by default. > > yes. > > >> I think it should prepare for both options, start in XIVE legacy mode, > >> which is XICS, then possibly switch to XIVE exploitation mode. > >> > >>>>> And how that will interact with KVM ? > >>>> > >>>> I expect we will do the same, which is to create two KVM devices to > >>>> be able to handle both interrupt controller backends depending on the > >>>> mode negotiated by the guest. > >>> > >>> That will be an ungodly mess, I'd rather we only instanciate the right > >>> one. > >> > >> It's rather transparent currently in the emulated version. There are two > >> sets of objects in QEMU, switching is done in CAS. KVM support should not > >> change anything in that area. > >> > >> I expect the 'xive-kvm' object to get/set states for migration, just like > >> for XICS and to setup the ESB+TIMA memory regions, which is new. > > > > But both XICS and XIVE are completely different kernel KVM devices that will > > need to "hook" into the same set of internal hooks for things like interrupts > > being passed through, RTAS calls etc... > > > > How does KVM knows which one to "activate" ? > > Can't we add an extra IRQ type and use vcpu->arch.irq_type for that ? > I haven't studied all the low level details though. > > > I don't think the kernel should have both. > > I hear that. From a QEMU perspective, it is much easier to put everything > in place for both interrupt modes and let the guest decide what it wants > to use. > > If we choose not to, we will need to find solution to defer the KVM inits > and to disconnect/reconnect the vcpus. For the latter, we could add a > KVM_DISABLE_CAP ioctl or maybe better add a new capability like > KVM_CAP_IRQ_XIVE to perform the switch. > > > C. > -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
On 04/12/2018 07:15 AM, David Gibson wrote: > On Wed, Jan 17, 2018 at 03:39:46PM +0100, Cédric Le Goater wrote: >> On 01/17/2018 12:10 PM, Benjamin Herrenschmidt wrote: >>> On Wed, 2018-01-17 at 10:18 +0100, Cédric Le Goater wrote: >>>>>>> Also, have we decided how the process of switching between XICS and >>>>>>> XIVE will work vs. CAS ? >>>>>> >>>>>> That's how it is described in the architecture. The current choice is >>>>>> to create both XICS and XIVE objects and choose at CAS which one to >>>>>> use. It relies today on the capability of the pseries machine to >>>>>> allocate IRQ numbers for both interrupt controller backends. These >>>>>> patches have been merged in QEMU. >>>>>> >>>>>> A change of interrupt mode results in a reset. The device tree is >>>>>> populated accordingly and the ICPs are switched for the model in >>>>>> use. >>>>> >>>>> For KVM we need to only instanciate one of them though. >>>> >>>> Hmm, >>>> >>>> How would we handle a guest rebooting on a kernel without XIVE support ? >>> >>> It will do CAS again and we can change the devices. >> >> So, we would destroy the previous QEMU ICS object and create a new one >> in the CAS hcall. That would probably work. There might be some issues >> in creating and destroying the ICS KVM device, but that can be studied >> without XIVE. > > Adding and removing devices at runtime based on guest requests like > this will get really hairy in qemu. I confirm ... > As I've said before for the first cut, I think we want to select just > one as a machine option to avoid this confusion. OK > Looking further ahead, I think we'll be better off having both the > XIVE and XICS models always present (at least minimally) in qemu, but > with only one "active" at any given time. Under emulation it is not too complex to support both mode. XIVE and XICS objects are both created but spapr->ov5_cas filters their usage However, syncing the change in KVM is more complex. > Note that having the inactive one destroy and clean up the > corresponding KVM devices is fine, as is deallocating as much of its > runtime state as we can without changing the notional QOM tree. yes. I will try to send a patchset organized that way : - spapr XIVE emulated mode (both mode supported) - XIVE KVM in an exclusive way, the machine will need to be restarted from the command line to change interrupt mode. - support of change of interrupt mode under KVM - powernv device model (rough) C. >> It used to be considered ugly to create a QEMU device at reset time, so >> I wonder if this is still the case, because when the machine reaches CAS, >> we really are beyond reset. >> >> If this is OK, then the next "issue" is to keep in sync the allocated >> IRQ numbers. The IRQ allocator is now merged at the machine level, so >> the synchronization is obvious to do when both backend QEMU objects >> are available. that's the path I took. If both QEMU objects are not >> available, then we need to scan the IRQ number space in the current >> interrupt mode and allocate the same IRQs in the newly negotiated mode. >> Probably OK. I don't see major problems with the current code. >> >> Migration is a problem. We will need both backend QEMU objects to be >> available anyhow if we want to migrate. So we are back to the current >> solution creating both QEMU objects but we can try to defer some of the >> KVM inits and create the KVM device on demand at CAS time. >> >> The next problem is the ICP object that currently needs the KVM device >> fd to connect the vcpus ... So, we will need to change that also. >> That is probably the biggest problem today. We need a way to disconnect >> the vpcu from the KVM device and see how we can defer the connection. >> I need to make sure this is possible, I can check that without XIVE >> I think. >> >>>> Are you suggesting to create the XICS or XIVE device in the CAS negotiation >>>> process ? So, the machine would not have any interrupt controller before >>>> CAS. That seems really late to me. grub uses the console for instance. >>> >>> We start with XICS by default. >> >> yes. >> >>>> I think it should prepare for both options, start in XIVE legacy mode, >>>> which is XICS, then possibly switch to XIVE exploitation mode. >>>> >>>>>>> And how that will interact with KVM ? >>>>>> >>>>>> I expect we will do the same, which is to create two KVM devices to >>>>>> be able to handle both interrupt controller backends depending on the >>>>>> mode negotiated by the guest. >>>>> >>>>> That will be an ungodly mess, I'd rather we only instanciate the right >>>>> one. >>>> >>>> It's rather transparent currently in the emulated version. There are two >>>> sets of objects in QEMU, switching is done in CAS. KVM support should not >>>> change anything in that area. >>>> >>>> I expect the 'xive-kvm' object to get/set states for migration, just like >>>> for XICS and to setup the ESB+TIMA memory regions, which is new. >>> >>> But both XICS and XIVE are completely different kernel KVM devices that will >>> need to "hook" into the same set of internal hooks for things like interrupts >>> being passed through, RTAS calls etc... >>> >>> How does KVM knows which one to "activate" ? >> >> Can't we add an extra IRQ type and use vcpu->arch.irq_type for that ? >> I haven't studied all the low level details though. >> >>> I don't think the kernel should have both. >> >> I hear that. From a QEMU perspective, it is much easier to put everything >> in place for both interrupt modes and let the guest decide what it wants >> to use. >> >> If we choose not to, we will need to find solution to defer the KVM inits >> and to disconnect/reconnect the vcpus. For the latter, we could add a >> KVM_DISABLE_CAP ioctl or maybe better add a new capability like >> KVM_CAP_IRQ_XIVE to perform the switch. >> >> >> C. >> >
On Wed, Jan 17, 2018 at 10:18:43AM +0100, Cédric Le Goater wrote: > >>> Also, have we decided how the process of switching between XICS and > >>> XIVE will work vs. CAS ? > >> > >> That's how it is described in the architecture. The current choice is > >> to create both XICS and XIVE objects and choose at CAS which one to > >> use. It relies today on the capability of the pseries machine to > >> allocate IRQ numbers for both interrupt controller backends. These > >> patches have been merged in QEMU. > >> > >> A change of interrupt mode results in a reset. The device tree is > >> populated accordingly and the ICPs are switched for the model in > >> use. > > > > For KVM we need to only instanciate one of them though. > > Hmm, > > How would we handle a guest rebooting on a kernel without XIVE support ? > Are you suggesting to create the XICS or XIVE device in the CAS negotiation > process ? So, the machine would not have any interrupt controller before > CAS. That seems really late to me. grub uses the console for instance. > > I think it should prepare for both options, start in XIVE legacy mode, > which is XICS, then possibly switch to XIVE exploitation mode. I think for our first draft we should have XIVE and XICS based platforms as separate machine types (or a machine option, I guess). We do want to allow this to be autonegotiated, but I feel like emphasising that at the beginning is causing unnatural design decisions in the XIVE model itself. > > >>> And how that will interact with KVM ? > >> > >> I expect we will do the same, which is to create two KVM devices to > >> be able to handle both interrupt controller backends depending on the > >> mode negotiated by the guest. > > > > That will be an ungodly mess, I'd rather we only instanciate the right > > one. > > It's rather transparent currently in the emulated version. There are two > sets of objects in QEMU, switching is done in CAS. KVM support should not > change anything in that area. > > I expect the 'xive-kvm' object to get/set states for migration, just like > for XICS and to setup the ESB+TIMA memory regions, which is new. > > C. > > >>> I was > >>> thinking the kernel would implement a different KVM device type, ie > >>> the "emulated XICS" would remain KVM_DEV_TYPE_XICS and XIVE would be > >>> KVM_DEV_TYPE_XIVE. > >> > >> yes. it makes sense. The new device will have a lot in common with the > >> KVM_DEV_TYPE_XICS using kvm_xive_ops. > > > > Ben. > > > -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
On 04/12/2018 07:10 AM, David Gibson wrote: > On Wed, Jan 17, 2018 at 10:18:43AM +0100, Cédric Le Goater wrote: >>>>> Also, have we decided how the process of switching between XICS and >>>>> XIVE will work vs. CAS ? >>>> >>>> That's how it is described in the architecture. The current choice is >>>> to create both XICS and XIVE objects and choose at CAS which one to >>>> use. It relies today on the capability of the pseries machine to >>>> allocate IRQ numbers for both interrupt controller backends. These >>>> patches have been merged in QEMU. >>>> >>>> A change of interrupt mode results in a reset. The device tree is >>>> populated accordingly and the ICPs are switched for the model in >>>> use. >>> >>> For KVM we need to only instanciate one of them though. >> >> Hmm, >> >> How would we handle a guest rebooting on a kernel without XIVE support ? >> Are you suggesting to create the XICS or XIVE device in the CAS negotiation >> process ? So, the machine would not have any interrupt controller before >> CAS. That seems really late to me. grub uses the console for instance. >> >> I think it should prepare for both options, start in XIVE legacy mode, >> which is XICS, then possibly switch to XIVE exploitation mode. > > I think for our first draft we should have XIVE and XICS based > platforms as separate machine types (or a machine option, I guess). OK. This is my current choice for KVM. Emulated mode is rather simple to handle, and this is why I have kept the reset after CAS if there is a change in the interrupt mode. > We do want to allow this to be autonegotiated, but I feel like > emphasising that at the beginning is causing unnatural design > decisions in the XIVE model itself. Yes. This is mostly a KVM problem which also has impacts on XICS of course ... C. >> >>>>> And how that will interact with KVM ? >>>> >>>> I expect we will do the same, which is to create two KVM devices to >>>> be able to handle both interrupt controller backends depending on the >>>> mode negotiated by the guest. >>> >>> That will be an ungodly mess, I'd rather we only instanciate the right >>> one. >> >> It's rather transparent currently in the emulated version. There are two >> sets of objects in QEMU, switching is done in CAS. KVM support should not >> change anything in that area. >> >> I expect the 'xive-kvm' object to get/set states for migration, just like >> for XICS and to setup the ESB+TIMA memory regions, which is new. >> >> C. >> >>>>> I was >>>>> thinking the kernel would implement a different KVM device type, ie >>>>> the "emulated XICS" would remain KVM_DEV_TYPE_XICS and XIVE would be >>>>> KVM_DEV_TYPE_XIVE. >>>> >>>> yes. it makes sense. The new device will have a lot in common with the >>>> KVM_DEV_TYPE_XICS using kvm_xive_ops. >>> >>> Ben. >>> >> >
On Thu, Dec 21, 2017 at 11:12:06AM +1100, Benjamin Herrenschmidt wrote: > On Wed, 2017-12-20 at 16:09 +1100, David Gibson wrote: > > > > As you've suggested in yourself, I think we might need to more > > explicitly model the different components of the XIVE system. As part > > of that, I think you need to be clearer in this base skeleton about > > exactly what component your XIVE object represents. > > > > If the answer is "the overall thing" I suspect that's not what you > > want - I had one of those for XICs which proved to be a mistake > > (eventually replaced by the XICSFabric interface). > > > > Changing the model later isn't impossible, but doing so without > > breaking migration can be a real pain, so I think it's worth a > > reasonable effort to try and get it right initially. > > Note: we do need to speed things up a bit, as having exploitation mode > in KVM will significantly help with IPI performance among other things. > > I'm about ready to do the KVM bits. The one thing we need to discuss > and figure a good design for is how we map all those interrupt control > pages into qemu. > > Each interrupt (either PCIe pass-through or the "generic XIVE IPIs" > which are used for guest IPIs and for vio/virtio/emulated interrupts) > comes with a "control page" (ESB page) which needs to be mapped into > the guest, and the generic IPIs also come with a trigger page which > needs to be mapped into the guest for guest IPIs or OpenCAPI > interrupts, or just qemu for emulated devices. > > Now that can be thousands of these critters. I certainly don't want to > create thousands of VMAs in qemu and even less thousands of memory > regions in KVM. > > So we need some kind of mechanism by wich a single large VMA gets > mmap'ed into qemu (or maybe a couple of these, but not too many) and > the interrupt pages can be assigned to slots in there and demand > faulted. Ok, I see your point. We'll definitely need to be able to map things in as a block, rather than one by one. > For the generic interrupts, this can probably be covered by KVM, adding > some arch ioctls for allocating IPIs and mmap'ing that region etc... > > For pass-through, it's trickier, we don't want to mmap each irqfd > individually for the above reason, so we want to "link" them to KVM. We > don't want to allow qemu to take control of any arbitrary interrupt in > the system though, so it has to related to the ownership of the irqfd > coming from vfio. > > OpenCAPI I suspect will be its own can of worms... > > Also, have we decided how the process of switching between XICS and > XIVE will work vs. CAS ? And how that will interact with KVM ? I was > thinking the kernel would implement a different KVM device type, ie > the "emulated XICS" would remain KVM_DEV_TYPE_XICS and XIVE would be > KVM_DEV_TYPE_XIVE. > -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
On 04/12/2018 07:08 AM, David Gibson wrote: > On Thu, Dec 21, 2017 at 11:12:06AM +1100, Benjamin Herrenschmidt wrote: >> On Wed, 2017-12-20 at 16:09 +1100, David Gibson wrote: >>> >>> As you've suggested in yourself, I think we might need to more >>> explicitly model the different components of the XIVE system. As part >>> of that, I think you need to be clearer in this base skeleton about >>> exactly what component your XIVE object represents. >>> >>> If the answer is "the overall thing" I suspect that's not what you >>> want - I had one of those for XICs which proved to be a mistake >>> (eventually replaced by the XICSFabric interface). >>> >>> Changing the model later isn't impossible, but doing so without >>> breaking migration can be a real pain, so I think it's worth a >>> reasonable effort to try and get it right initially. >> >> Note: we do need to speed things up a bit, as having exploitation mode >> in KVM will significantly help with IPI performance among other things. >> >> I'm about ready to do the KVM bits. The one thing we need to discuss >> and figure a good design for is how we map all those interrupt control >> pages into qemu. >> >> Each interrupt (either PCIe pass-through or the "generic XIVE IPIs" >> which are used for guest IPIs and for vio/virtio/emulated interrupts) >> comes with a "control page" (ESB page) which needs to be mapped into >> the guest, and the generic IPIs also come with a trigger page which >> needs to be mapped into the guest for guest IPIs or OpenCAPI >> interrupts, or just qemu for emulated devices. >> >> Now that can be thousands of these critters. I certainly don't want to >> create thousands of VMAs in qemu and even less thousands of memory >> regions in KVM. >> >> So we need some kind of mechanism by wich a single large VMA gets >> mmap'ed into qemu (or maybe a couple of these, but not too many) and >> the interrupt pages can be assigned to slots in there and demand >> faulted. > > Ok, I see your point. We'll definitely need to be able to map things > in as a block, rather than one by one. So, the approach taken is to use a mmap() exposed in a single ram_device memory region to the guest. The size is the irq number space size. This is hardcoded to 4096 (IPIs) + 1024 (virtual device interrupts) in QEMU. We can change that, but the 4K split is important for XICS compatibility. The kvm xive device should self adapt. C. >> For the generic interrupts, this can probably be covered by KVM, adding >> some arch ioctls for allocating IPIs and mmap'ing that region etc... >> >> For pass-through, it's trickier, we don't want to mmap each irqfd >> individually for the above reason, so we want to "link" them to KVM. We >> don't want to allow qemu to take control of any arbitrary interrupt in >> the system though, so it has to related to the ownership of the irqfd >> coming from vfio. >> >> OpenCAPI I suspect will be its own can of worms... >> >> Also, have we decided how the process of switching between XICS and >> XIVE will work vs. CAS ? And how that will interact with KVM ? I was >> thinking the kernel would implement a different KVM device type, ie >> the "emulated XICS" would remain KVM_DEV_TYPE_XICS and XIVE would be >> KVM_DEV_TYPE_XIVE. >> >
© 2016 - 2026 Red Hat, Inc.