[Qemu-devel] [PATCH] docs: provide documentation on the POWER9 XIVE interrupt controller

Cédric Le Goater posted 1 patch 10 weeks ago
Test asan passed
Test docker-clang@ubuntu passed
Test checkpatch passed
Test docker-mingw@fedora passed
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20190514064627.3838-1-clg@kaod.org
docs/index.rst     |   1 +
docs/ppc/index.rst |  13 ++
docs/ppc/xive.rst  | 344 +++++++++++++++++++++++++++++++++++++++++++++
MAINTAINERS        |   1 +
4 files changed, 359 insertions(+)
create mode 100644 docs/ppc/index.rst
create mode 100644 docs/ppc/xive.rst

[Qemu-devel] [PATCH] docs: provide documentation on the POWER9 XIVE interrupt controller

Posted by Cédric Le Goater 10 weeks ago
This documents the overall XIVE architecture and gives an overview of
the QEMU models. It also provides documentation on the 'info pic'
command.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 docs/index.rst     |   1 +
 docs/ppc/index.rst |  13 ++
 docs/ppc/xive.rst  | 344 +++++++++++++++++++++++++++++++++++++++++++++
 MAINTAINERS        |   1 +
 4 files changed, 359 insertions(+)
 create mode 100644 docs/ppc/index.rst
 create mode 100644 docs/ppc/xive.rst

diff --git a/docs/index.rst b/docs/index.rst
index 3690955dd1f5..557fe86233e3 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -12,4 +12,5 @@ Welcome to QEMU's documentation!
 
    interop/index
    devel/index
+   ppc/index
 
diff --git a/docs/ppc/index.rst b/docs/ppc/index.rst
new file mode 100644
index 000000000000..146f416ea3a0
--- /dev/null
+++ b/docs/ppc/index.rst
@@ -0,0 +1,13 @@
+.. This is the top level page for the 'ppc' manual
+
+
+QEMU PowerPC Machine and Controller Guide
+=========================================
+
+
+Contents:
+
+.. toctree::
+   :maxdepth: 2
+
+   xive
diff --git a/docs/ppc/xive.rst b/docs/ppc/xive.rst
new file mode 100644
index 000000000000..90ddde6bf39f
--- /dev/null
+++ b/docs/ppc/xive.rst
@@ -0,0 +1,344 @@
+================================
+POWER9 XIVE interrupt controller
+================================
+
+The POWER9 processor comes with a new interrupt controller
+architecture, called XIVE as "eXternal Interrupt Virtualization
+Engine".
+
+Compared to the previous architecture, the main characteristics of
+XIVE are to support a larger number of interrupt sources and to
+deliver interrupts directly to virtual processors without hypervisor
+assistance. This removes the context switches required for the
+delivery process.
+
+
+Overall architecture
+====================
+
+The XIVE IC is composed of three sub-engines, each taking care of a
+processing layer of external interrupts:
+
+- Interrupt Virtualization Source Engine (IVSE), or Source Controller
+  (SC). These are found in PCI PHBs, in the PSI host bridge
+  controller, but also inside the main controller for the core IPIs
+  and other sub-chips (NX, CAP, NPU) of the chip/processor. They are
+  configured to feed the IVRE with events.
+- Interrupt Virtualization Routing Engine (IVRE) or Virtualization
+  Controller (VC). It handles event coalescing and perform interrupt
+  routing by matching an event source number with an Event
+  Notification Descriptor (END).
+- Interrupt Virtualization Presentation Engine (IVPE) or Presentation
+  Controller (PC). It maintains the interrupt context state of each
+  thread and handles the delivery of the external interrupt to the
+  thread.
+
+::
+
+                XIVE Interrupt Controller
+                +------------------------------------+      IPIs
+                | +---------+ +---------+ +--------+ |    +-------+
+                | |IVRE     | |Common Q | |IVPE    |----> | CORES |
+                | |     esb | |         | |        |----> |       |
+                | |     eas | |  Bridge | |   tctx |----> |       |
+                | |SC   end | |         | |    nvt | |    |       |
+    +------+    | +---------+ +----+----+ +--------+ |    +-+-+-+-+
+    | RAM  |    +------------------|-----------------+      | | |
+    |      |                       |                        | | |
+    |      |                       |                        | | |
+    |      |  +--------------------v------------------------v-v-v--+    other
+    |      <--+                     Power Bus                      +--> chips
+    |  esb |  +---------+-----------------------+------------------+
+    |  eas |            |                       |
+    |  end |         +--|------+                |
+    |  nvt |       +----+----+ |           +----+----+
+    +------+       |IVSE     | |           |IVSE     |
+                   |         | |           |         |
+                   | PQ-bits | |           | PQ-bits |
+                   | local   |-+           |  in VC  |
+                   +---------+             +---------+
+                      PCIe                 NX,NPU,CAPI
+
+
+    PQ-bits: 2 bits source state machine (P:pending Q:queued)
+    esb: Event State Buffer (Array of PQ bits in an IVSE)
+    eas: Event Assignment Structure
+    end: Event Notification Descriptor
+    nvt: Notification Virtual Target
+    tctx: Thread interrupt Context registers
+
+
+
+XIVE internal tables
+--------------------
+
+Each of the sub-engines uses a set of tables to redirect interrupts
+from event sources to CPU threads.
+
+::
+
+                                            +-------+
+    User or O/S                             |  EQ   |
+        or                          +------>|entries|
+    Hypervisor                      |       |  ..   |
+      Memory                        |       +-------+
+                                    |           ^
+                                    |           |
+               +-------------------------------------------------+
+                                    |           |
+    Hypervisor      +------+    +---+--+    +---+--+   +------+
+      Memory        | ESB  |    | EAT  |    | ENDT |   | NVTT |
+     (skiboot)      +----+-+    +----+-+    +----+-+   +------+
+                      ^  |        ^  |        ^  |       ^
+                      |  |        |  |        |  |       |
+               +-------------------------------------------------+
+                      |  |        |  |        |  |       |
+                      |  |        |  |        |  |       |
+                 +----|--|--------|--|--------|--|-+   +-|-----+    +------+
+                 |    |  |        |  |        |  | |   | | tctx|    |Thread|
+     IPI or   ---+    +  v        +  v        +  v |---| +  .. |----->     |
+    HW events    |                                 |   |       |    |      |
+                 |             IVRE                |   | IVPE  |    +------+
+                 +---------------------------------+   +-------+
+
+
+The IVSE have a 2-bits state machine, P for pending and Q for queued,
+for each source that allows events to be triggered. They are stored in
+an Event State Buffer (ESB) array and can be controlled by MMIOs.
+
+If the event is let through, the IVRE looks up in the Event Assignment
+Structure (EAS) table for an Event Notification Descriptor (END)
+configured for the source. Each Event Notification Descriptor defines
+a notification path to a CPU and an in-memory Event Queue, in which
+will be enqueued an EQ data for the O/S to pull.
+
+The IVPE determines if a Notification Virtual Target (NVT) can handle
+the event by scanning the thread contexts of the VCPUs dispatched on
+the processor HW threads. It maintains the interrupt context state of
+each thread in a NVT table.
+
+XIVE thread interrupt context
+-----------------------------
+
+The XIVE presenter can generate four different exceptions to its
+HW threads:
+
+- hypervisor exception
+- O/S exception
+- Event-Based Branch (user level)
+- msgsnd (doorbell)
+
+Each exception has a state independent from the others called a Thread
+Interrupt Management context. This context is a set of registers which
+lets the thread handle priority management and interrupt
+acknowledgment among other things. The most important ones being :
+
+- Interrupt Priority Register  (PIPR)
+- Interrupt Pending Buffer     (IPB)
+- Current Processor Priority   (CPPR)
+- Notification Source Register (NSR)
+
+TIMA
+~~~~
+
+The Thread Interrupt Management registers are accessible through a
+specific MMIO region, called the Thread Interrupt Management Area
+(TIMA), four aligned pages, each exposing a different view of the
+registers. First page (page address ending in ``0b00``) gives access
+to the entire context and is reserved for the ring 0 view for the
+physical thread context. The second (page address ending in ``0b01``)
+is for the hypervisor, ring 1 view. The third (page address ending in
+``0b10``) is for the operating system, ring 2 view. The fourth (page
+address ending in ``0b11``) is for user level, ring 3 view.
+
+Interrupt flow from an O/S perspective
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+After an event data has been enqueued in the O/S Event Queue, the IVPE
+raises the bit corresponding to the priority of the pending interrupt
+in the register IBP (Interrupt Pending Buffer) to indicate that an
+event is pending in one of the 8 priority queues. The Pending
+Interrupt Priority Register (PIPR) is also updated using the IPB. This
+register represent the priority of the most favored pending
+notification.
+
+The PIPR is then compared to the the Current Processor Priority
+Register (CPPR). If it is more favored (numerically less than), the
+CPU interrupt line is raised and the EO bit of the Notification Source
+Register (NSR) is updated to notify the presence of an exception for
+the O/S. The O/S acknowledges the interrupt with a special load in the
+Thread Interrupt Management Area.
+
+The O/S handles the interrupt and when done, performs an EOI using a
+MMIO operation on the ESB management page of the associate source.
+
+
+Overview of the QEMU models for XIVE
+====================================
+
+The XiveSource models the IVSE in general, internal and external. It
+handles the source ESBs and the MMIO interface to control them.
+
+The XiveNotifier is a small helper interface interconnecting the
+XiveSource to the XiveRouter.
+
+The XiveRouter is an abstract model acting as a combined IVRE and
+IVPE. It routes event notifications using the EAS and END tables to
+the IVPE sub-engine which does a CAM scan to find a CPU to deliver the
+exception. Storage should be provided by the inheriting classes.
+
+XiveEnDSource is a special source object. It exposes the END ESB MMIOs
+of the Event Queues which are used for coalescing event notifications
+and for escalation. Not used on the field, only to sync the EQ cache
+in OPAL.
+
+Finally, the XiveTCTX contains the interrupt state context of a thread,
+four sets of registers, one for each exception that can be delivered
+to a CPU. These contexts are scanned by the IVPE to find a matching VP
+when a notification is triggered. It also models the Thread Interrupt
+Management Area (TIMA), which exposes the thread context registers to
+the CPU for interrupt management.
+
+
+XIVE for sPAPR (pseries machines)
+=================================
+
+SpaprXive models the XIVE interrupt controller of a ``pseries``
+machine. It inherits from the XiveRouter and provisions storage for
+the EAS and END tables. The NVT table does not need a backend in
+sPAPR. It owns a XiveSource object for the IPIs and the virtual device
+interrupts, a memory region for the TIMA and a XiveENDSource object to
+manage the END ESBs (not used by Linux).
+
+These choices were made to have a sPAPR interrupt controller consistent
+with the one found on baremetal and to facilitate KVM support, the
+main difficulty being the host memory regions exposed to the guest.
+
+CAS Negotiation
+---------------
+
+The interrupt mode advertised by the ``pseries`` machine in the CAS
+negotiation process depends on the CPU type (XIVE requires POWER9) but
+also on the machine property ``ic-mode`` which can take the following
+values: ``xics``, ``xive`` and ``dual``. ``xics`` is currently the
+default mode but it should change in the future.
+
+The choosen interrupt mode is activated after a reconfiguration done
+in a machine reset.
+
+KVM support
+-----------
+
+Two host memory regions are exposed to the guest and require special
+attention at initialization :
+
+- ESB MMIOs
+- Thread Interrupt Management Area (TIMA)
+
+When using the KVM device, these are `ram device` memory mappings,
+similarly to VFIO, exposed to the guest and the associated VMAs on the
+host are populated dynamically with the appropriate pages using a
+fault handler.
+
+The models uses KVM accessors to synchronize the QEMU state with KVM :
+
+- the source configuration (EAT)
+- the END configuration (ENDT)
+- the O/S EQ state (toggle bit and index)
+- the thread interrupt context registers.
+
+Hybrid guest using KVM and an emulated irqchip ``kernel_irqchip=off``
+is supported.
+
+Monitoring XIVE
+---------------
+
+The state of the XIVE interrupt controller can be queried through the
+monitor commands ``info pic``. The output comes in two parts.
+
+First, the state of the thread interrupt context registers is dumped
+for each CPU :
+
+::
+
+   (qemu) info pic
+   CPU[0000]:   QW   NSR CPPR IPB LSMFB ACK# INC AGE PIPR  W2
+   CPU[0000]: USER    00   00  00    00   00  00  00   00  00000000
+   CPU[0000]:   OS    00   ff  00    00   ff  00  ff   ff  80000400
+   CPU[0000]: POOL    00   00  00    00   00  00  00   00  00000000
+   CPU[0000]: PHYS    00   00  00    00   00  00  00   ff  00000000
+   ...
+
+In the case of a ``pseries`` machine, QEMU acts as the hypervisor and only
+the O/S and USER register rings make sense. ``W2`` contains the vCPU CAM
+line which is set to the VP identifier.
+
+Then comes the routing information which aggregates the EAS and the
+END configuration:
+
+::
+
+   ...
+   LISN         PQ    EISN     CPU/PRIO EQ
+   00000000 MSI --    00000010   0/6    380/16384 @1fe3e0000 ^1 [ 80000010 ... ]
+   00000001 MSI --    00000010   1/6    305/16384 @1fc230000 ^1 [ 80000010 ... ]
+   00000002 MSI --    00000010   2/6    220/16384 @1fc2f0000 ^1 [ 80000010 ... ]
+   00000003 MSI --    00000010   3/6    201/16384 @1fc390000 ^1 [ 80000010 ... ]
+   00000004 MSI -Q  M 00000000
+   00000005 MSI -Q  M 00000000
+   00000006 MSI -Q  M 00000000
+   00000007 MSI -Q  M 00000000
+   00001000 MSI --    00000012   0/6    380/16384 @1fe3e0000 ^1 [ 80000010 ... ]
+   00001001 MSI --    00000013   0/6    380/16384 @1fe3e0000 ^1 [ 80000010 ... ]
+   00001100 MSI --    00000100   1/6    305/16384 @1fc230000 ^1 [ 80000010 ... ]
+   00001101 MSI -Q  M 00000000
+   00001200 LSI -Q  M 00000000
+   00001201 LSI -Q  M 00000000
+   00001202 LSI -Q  M 00000000
+   00001203 LSI -Q  M 00000000
+   00001300 MSI --    00000102   1/6    305/16384 @1fc230000 ^1 [ 80000010 ... ]
+   00001301 MSI --    00000103   2/6    220/16384 @1fc2f0000 ^1 [ 80000010 ... ]
+   00001302 MSI --    00000104   3/6    201/16384 @1fc390000 ^1 [ 80000010 ... ]
+
+The source information and configuration:
+
+- The ``LISN`` column outputs the interrupt number of the source in
+  range ``[ 0x0 ... 0x1FFF ]`` and its type : ``MSI`` or ``LSI``
+- The ``PQ`` column reflects the state of the PQ bits of the source :
+
+  - ``--`` source is ready to take events
+  - ``P-`` an event was sent and an EOI is PENDING
+  - ``PQ`` an event was QUEUED
+  - ``-Q`` source is OFF
+
+  a ``M`` indicates that source is *MASKED* at the EAS level,
+
+The targeting configuration :
+
+- The ``EISN`` column is the event data what will be queued in the event
+  queue of the O/S.
+- The ``CPU/PRIO`` column is the tuple defining the CPU number and
+  priority queue serving the source.
+- The ``EQ`` column outputs :
+
+  - the current index of the event queue/ the max number of entries
+  - the O/S event queue address
+  - the toggle bit
+  - the last entries that were pushed in the event queue.
+
+
+
+XIVE for PowerNV
+================
+
+The PnvXIVE model uses the XiveRouter abstract model just like
+sPAPRXive. It provides accessors to the EAS, END and NVT tables which
+are stored in the QEMU PowerNV machine and not in QEMU anymore. It
+owns a set of memory regions for the IC registers, the ESBs, the END
+ESBs, the TIMA, the notification MMIO.
+
+Multichip is supported and the available IVSEs are the internal one
+for the IPIS, the PSI host bridge controller and PHB4.
+
+The next interesting step would be to add escalation events and model
+the VCPU dispatching to support emulated KVM guests.
diff --git a/MAINTAINERS b/MAINTAINERS
index 66ddbda9c958..a896c7407294 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1697,6 +1697,7 @@ L: qemu-ppc@nongnu.org
 S: Supported
 F: hw/*/*xive*
 F: include/hw/*/*xive*
+F: docs/ppc/xive.rst
 
 Subsystems
 ----------
-- 
2.20.1


Re: [Qemu-devel] [PATCH] docs: provide documentation on the POWER9 XIVE interrupt controller

Posted by Peter Maydell 10 weeks ago
On Tue, 14 May 2019 at 07:46, Cédric Le Goater <clg@kaod.org> wrote:
>
> This documents the overall XIVE architecture and gives an overview of
> the QEMU models. It also provides documentation on the 'info pic'
> command.
>
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  docs/index.rst     |   1 +
>  docs/ppc/index.rst |  13 ++
>  docs/ppc/xive.rst  | 344 +++++++++++++++++++++++++++++++++++++++++++++
>  MAINTAINERS        |   1 +
>  4 files changed, 359 insertions(+)
>  create mode 100644 docs/ppc/index.rst
>  create mode 100644 docs/ppc/xive.rst

Hi -- it's great to see this documentation. Unfortunately,
where you've put it doesn't match our intended layout for docs.

Each subdirectory of docs/ becomes its own manual, and
the intention is to eventually have five manuals
(as sketched out in https://wiki.qemu.org/Features/Documentation):
 * QEMU user mode emulation -- docs/user
 * QEMU full-system emulation user's guide -- docs/system
 * QEMU full-system emulation management and interoperability guide --
docs/interop
 * QEMU full-system emulation guest hardware specifications  -- docs/specs
 * QEMU developer's guide -- docs/devel

We don't want to have a separate PPC-specific manual.

Currently we only have interop and devel. I have on
my todo list to try to sort out the others, including
figuring out how to transition from our current set
of texinfo-based manuals to this layout.

I'm not sure exactly where this document should live.
From a quick scan it appears to be mixing together
information aimed at several different audiences --
the "Overview of the QEMU models for XIVE" part looks
like information about QEMU internals which belongs
in docs/devel, but some other parts seem to be user
facing information which should go in one of the
other manuals.

thanks
-- PMM

Re: [Qemu-devel] [PATCH] docs: provide documentation on the POWER9 XIVE interrupt controller

Posted by Cédric Le Goater 10 weeks ago
On 5/14/19 11:17 AM, Peter Maydell wrote:
> On Tue, 14 May 2019 at 07:46, Cédric Le Goater <clg@kaod.org> wrote:
>>
>> This documents the overall XIVE architecture and gives an overview of
>> the QEMU models. It also provides documentation on the 'info pic'
>> command.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  docs/index.rst     |   1 +
>>  docs/ppc/index.rst |  13 ++
>>  docs/ppc/xive.rst  | 344 +++++++++++++++++++++++++++++++++++++++++++++
>>  MAINTAINERS        |   1 +
>>  4 files changed, 359 insertions(+)
>>  create mode 100644 docs/ppc/index.rst
>>  create mode 100644 docs/ppc/xive.rst
> 
> Hi -- it's great to see this documentation. Unfortunately,
> where you've put it doesn't match our intended layout for docs.

OK. I guess I need to split the file in multiple parts.
 
> Each subdirectory of docs/ becomes its own manual, and
> the intention is to eventually have five manuals
> (as sketched out in https://wiki.qemu.org/Features/Documentation):
>  * QEMU user mode emulation -- docs/user
>  * QEMU full-system emulation user's guide -- docs/system

Should we put the documentation of machine options under this directory ? 

>  * QEMU full-system emulation management and interoperability guide --
> docs/interop

There, I could put the 'info pic' documentation.

>  * QEMU full-system emulation guest hardware specifications  -- docs/specs

and there, the low level information on the XIVE controller.

>  * QEMU developer's guide -- docs/devel

and finally, there, some of the documentation on QEMU modeling.  

> We don't want to have a separate PPC-specific manual.

OK.

> Currently we only have interop and devel. I have on
> my todo list to try to sort out the others, including
> figuring out how to transition from our current set
> of texinfo-based manuals to this layout.
> 
> I'm not sure exactly where this document should live.
> From a quick scan it appears to be mixing together
> information aimed at several different audiences --
> the "Overview of the QEMU models for XIVE" part looks
> like information about QEMU internals which belongs
> in docs/devel, but some other parts seem to be user
> facing information which should go in one of the
> other manuals.

What is nice about the single file model is that you find all 
the information related to one topic in one place. 

Can manuals reference each another ? 

Thanks,

C. 

Re: [Qemu-devel] [Qemu-ppc] [PATCH] docs: provide documentation on the POWER9 XIVE interrupt controller

Posted by Satheesh Rajendran 10 weeks ago
On Tue, May 14, 2019 at 08:46:27AM +0200, Cédric Le Goater wrote:
> This documents the overall XIVE architecture and gives an overview of
> the QEMU models. It also provides documentation on the 'info pic'
> command.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  docs/index.rst     |   1 +
>  docs/ppc/index.rst |  13 ++
>  docs/ppc/xive.rst  | 344 +++++++++++++++++++++++++++++++++++++++++++++
>  MAINTAINERS        |   1 +
>  4 files changed, 359 insertions(+)
>  create mode 100644 docs/ppc/index.rst
>  create mode 100644 docs/ppc/xive.rst

Overall doc, looks great, have few minor suggestions below.
> 
> diff --git a/docs/index.rst b/docs/index.rst
> index 3690955dd1f5..557fe86233e3 100644
> --- a/docs/index.rst
> +++ b/docs/index.rst
> @@ -12,4 +12,5 @@ Welcome to QEMU's documentation!
>  
>     interop/index
>     devel/index
> +   ppc/index
>  
> diff --git a/docs/ppc/index.rst b/docs/ppc/index.rst
> new file mode 100644
> index 000000000000..146f416ea3a0
> --- /dev/null
> +++ b/docs/ppc/index.rst
> @@ -0,0 +1,13 @@
> +.. This is the top level page for the 'ppc' manual
> +
> +
> +QEMU PowerPC Machine and Controller Guide
> +=========================================
> +
> +
> +Contents:
> +
> +.. toctree::
> +   :maxdepth: 2
> +
> +   xive
> diff --git a/docs/ppc/xive.rst b/docs/ppc/xive.rst
> new file mode 100644
> index 000000000000..90ddde6bf39f
> --- /dev/null
> +++ b/docs/ppc/xive.rst
> @@ -0,0 +1,344 @@
> +================================
> +POWER9 XIVE interrupt controller
> +================================
> +
> +The POWER9 processor comes with a new interrupt controller
> +architecture, called XIVE as "eXternal Interrupt Virtualization
> +Engine".
> +
> +Compared to the previous architecture, the main characteristics of
> +XIVE are to support a larger number of interrupt sources and to
> +deliver interrupts directly to virtual processors without hypervisor
> +assistance. This removes the context switches required for the
> +delivery process.
> +
> +
> +Overall architecture
> +====================
> +
> +The XIVE IC is composed of three sub-engines, each taking care of a
> +processing layer of external interrupts:
> +
> +- Interrupt Virtualization Source Engine (IVSE), or Source Controller
> +  (SC). These are found in PCI PHBs, in the PSI host bridge
> +  controller, but also inside the main controller for the core IPIs
> +  and other sub-chips (NX, CAP, NPU) of the chip/processor. They are
> +  configured to feed the IVRE with events.
> +- Interrupt Virtualization Routing Engine (IVRE) or Virtualization
> +  Controller (VC). It handles event coalescing and perform interrupt
> +  routing by matching an event source number with an Event
> +  Notification Descriptor (END).
> +- Interrupt Virtualization Presentation Engine (IVPE) or Presentation
> +  Controller (PC). It maintains the interrupt context state of each
> +  thread and handles the delivery of the external interrupt to the
> +  thread.
> +
> +::
> +
> +                XIVE Interrupt Controller
> +                +------------------------------------+      IPIs
> +                | +---------+ +---------+ +--------+ |    +-------+
> +                | |IVRE     | |Common Q | |IVPE    |----> | CORES |
> +                | |     esb | |         | |        |----> |       |
> +                | |     eas | |  Bridge | |   tctx |----> |       |
> +                | |SC   end | |         | |    nvt | |    |       |
> +    +------+    | +---------+ +----+----+ +--------+ |    +-+-+-+-+
> +    | RAM  |    +------------------|-----------------+      | | |
> +    |      |                       |                        | | |
> +    |      |                       |                        | | |
> +    |      |  +--------------------v------------------------v-v-v--+    other
> +    |      <--+                     Power Bus                      +--> chips
> +    |  esb |  +---------+-----------------------+------------------+
> +    |  eas |            |                       |
> +    |  end |         +--|------+                |
> +    |  nvt |       +----+----+ |           +----+----+
> +    +------+       |IVSE     | |           |IVSE     |
> +                   |         | |           |         |
> +                   | PQ-bits | |           | PQ-bits |
> +                   | local   |-+           |  in VC  |
> +                   +---------+             +---------+
> +                      PCIe                 NX,NPU,CAPI
> +
> +
> +    PQ-bits: 2 bits source state machine (P:pending Q:queued)
> +    esb: Event State Buffer (Array of PQ bits in an IVSE)
> +    eas: Event Assignment Structure
> +    end: Event Notification Descriptor
> +    nvt: Notification Virtual Target
> +    tctx: Thread interrupt Context registers
> +
> +
> +
> +XIVE internal tables
> +--------------------
> +
> +Each of the sub-engines uses a set of tables to redirect interrupts
> +from event sources to CPU threads.
> +
> +::
> +
> +                                            +-------+
> +    User or O/S                             |  EQ   |
> +        or                          +------>|entries|
> +    Hypervisor                      |       |  ..   |
> +      Memory                        |       +-------+
> +                                    |           ^
> +                                    |           |
> +               +-------------------------------------------------+
> +                                    |           |
> +    Hypervisor      +------+    +---+--+    +---+--+   +------+
> +      Memory        | ESB  |    | EAT  |    | ENDT |   | NVTT |
> +     (skiboot)      +----+-+    +----+-+    +----+-+   +------+
> +                      ^  |        ^  |        ^  |       ^
> +                      |  |        |  |        |  |       |
> +               +-------------------------------------------------+
> +                      |  |        |  |        |  |       |
> +                      |  |        |  |        |  |       |
> +                 +----|--|--------|--|--------|--|-+   +-|-----+    +------+
> +                 |    |  |        |  |        |  | |   | | tctx|    |Thread|
> +     IPI or   ---+    +  v        +  v        +  v |---| +  .. |----->     |
> +    HW events    |                                 |   |       |    |      |
> +                 |             IVRE                |   | IVPE  |    +------+
> +                 +---------------------------------+   +-------+
> +
> +
> +The IVSE have a 2-bits state machine, P for pending and Q for queued,
> +for each source that allows events to be triggered. They are stored in
> +an Event State Buffer (ESB) array and can be controlled by MMIOs.
> +
> +If the event is let through, the IVRE looks up in the Event Assignment
> +Structure (EAS) table for an Event Notification Descriptor (END)
> +configured for the source. Each Event Notification Descriptor defines
> +a notification path to a CPU and an in-memory Event Queue, in which
> +will be enqueued an EQ data for the O/S to pull.
> +
> +The IVPE determines if a Notification Virtual Target (NVT) can handle
> +the event by scanning the thread contexts of the VCPUs dispatched on
> +the processor HW threads. It maintains the interrupt context state of
> +each thread in a NVT table.
> +
> +XIVE thread interrupt context
> +-----------------------------
> +
> +The XIVE presenter can generate four different exceptions to its
> +HW threads:
> +
> +- hypervisor exception
> +- O/S exception
> +- Event-Based Branch (user level)
> +- msgsnd (doorbell)
> +
> +Each exception has a state independent from the others called a Thread
> +Interrupt Management context. This context is a set of registers which
> +lets the thread handle priority management and interrupt
> +acknowledgment among other things. The most important ones being :
> +
> +- Interrupt Priority Register  (PIPR)
> +- Interrupt Pending Buffer     (IPB)
> +- Current Processor Priority   (CPPR)
> +- Notification Source Register (NSR)
> +
> +TIMA
> +~~~~
> +
> +The Thread Interrupt Management registers are accessible through a
> +specific MMIO region, called the Thread Interrupt Management Area
> +(TIMA), four aligned pages, each exposing a different view of the
> +registers. First page (page address ending in ``0b00``) gives access
> +to the entire context and is reserved for the ring 0 view for the
> +physical thread context. The second (page address ending in ``0b01``)
> +is for the hypervisor, ring 1 view. The third (page address ending in
> +``0b10``) is for the operating system, ring 2 view. The fourth (page
> +address ending in ``0b11``) is for user level, ring 3 view.
> +
> +Interrupt flow from an O/S perspective
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +After an event data has been enqueued in the O/S Event Queue, the IVPE
> +raises the bit corresponding to the priority of the pending interrupt
> +in the register IBP (Interrupt Pending Buffer) to indicate that an
> +event is pending in one of the 8 priority queues. The Pending
> +Interrupt Priority Register (PIPR) is also updated using the IPB. This
> +register represent the priority of the most favored pending
> +notification.
> +
> +The PIPR is then compared to the the Current Processor Priority
> +Register (CPPR). If it is more favored (numerically less than), the
> +CPU interrupt line is raised and the EO bit of the Notification Source
> +Register (NSR) is updated to notify the presence of an exception for
> +the O/S. The O/S acknowledges the interrupt with a special load in the
> +Thread Interrupt Management Area.
> +
> +The O/S handles the interrupt and when done, performs an EOI using a
> +MMIO operation on the ESB management page of the associate source.
> +
> +
> +Overview of the QEMU models for XIVE
> +====================================
> +
> +The XiveSource models the IVSE in general, internal and external. It
> +handles the source ESBs and the MMIO interface to control them.
> +
> +The XiveNotifier is a small helper interface interconnecting the
> +XiveSource to the XiveRouter.
> +
> +The XiveRouter is an abstract model acting as a combined IVRE and
> +IVPE. It routes event notifications using the EAS and END tables to
> +the IVPE sub-engine which does a CAM scan to find a CPU to deliver the
> +exception. Storage should be provided by the inheriting classes.
> +
> +XiveEnDSource is a special source object. It exposes the END ESB MMIOs
> +of the Event Queues which are used for coalescing event notifications
> +and for escalation. Not used on the field, only to sync the EQ cache
> +in OPAL.
> +
> +Finally, the XiveTCTX contains the interrupt state context of a thread,
> +four sets of registers, one for each exception that can be delivered
> +to a CPU. These contexts are scanned by the IVPE to find a matching VP
> +when a notification is triggered. It also models the Thread Interrupt
> +Management Area (TIMA), which exposes the thread context registers to
> +the CPU for interrupt management.
> +
> +
> +XIVE for sPAPR (pseries machines)
> +=================================
> +
> +SpaprXive models the XIVE interrupt controller of a ``pseries``
> +machine. It inherits from the XiveRouter and provisions storage for
> +the EAS and END tables. The NVT table does not need a backend in
> +sPAPR. It owns a XiveSource object for the IPIs and the virtual device
> +interrupts, a memory region for the TIMA and a XiveENDSource object to
> +manage the END ESBs (not used by Linux).
> +
> +These choices were made to have a sPAPR interrupt controller consistent
> +with the one found on baremetal and to facilitate KVM support, the
> +main difficulty being the host memory regions exposed to the guest.
> +
> +CAS Negotiation
> +---------------
> +
> +The interrupt mode advertised by the ``pseries`` machine in the CAS
> +negotiation process depends on the CPU type (XIVE requires POWER9) but
> +also on the machine property ``ic-mode`` which can take the following
> +values: ``xics``, ``xive`` and ``dual``. ``xics`` is currently the
> +default mode but it should change in the future.
> +
> +The choosen interrupt mode is activated after a reconfiguration done
> +in a machine reset.
> +
can this be included?
guest uses this device-tree entry(ibm,arch-vec-5-platform-support) to decide on xive vs xics if dual set for ic-mode

> +KVM support
> +-----------
> +
> +Two host memory regions are exposed to the guest and require special
> +attention at initialization :
> +
> +- ESB MMIOs
> +- Thread Interrupt Management Area (TIMA)
> +
> +When using the KVM device, these are `ram device` memory mappings,
> +similarly to VFIO, exposed to the guest and the associated VMAs on the
> +host are populated dynamically with the appropriate pages using a
> +fault handler.
> +
> +The models uses KVM accessors to synchronize the QEMU state with KVM :
> +
> +- the source configuration (EAT)
> +- the END configuration (ENDT)
> +- the O/S EQ state (toggle bit and index)
> +- the thread interrupt context registers.
> +
> +Hybrid guest using KVM and an emulated irqchip ``kernel_irqchip=off``

Some more explanations ``kernel_irqchip=off`` vs ``kernel_irqchip=on``(default for full xive support) would help?
kernel-irqchip=on - in-kernel accelerated one - more performance
kernel-irqchip=off - fully emulated XIVE in QEMU - less performace


> +is supported.
> +
> +Monitoring XIVE
> +---------------
> +
> +The state of the XIVE interrupt controller can be queried through the
> +monitor commands ``info pic``. The output comes in two parts.
> +
> +First, the state of the thread interrupt context registers is dumped
> +for each CPU :
> +
> +::
> +
> +   (qemu) info pic
> +   CPU[0000]:   QW   NSR CPPR IPB LSMFB ACK# INC AGE PIPR  W2
> +   CPU[0000]: USER    00   00  00    00   00  00  00   00  00000000
> +   CPU[0000]:   OS    00   ff  00    00   ff  00  ff   ff  80000400
> +   CPU[0000]: POOL    00   00  00    00   00  00  00   00  00000000
> +   CPU[0000]: PHYS    00   00  00    00   00  00  00   ff  00000000
> +   ...
> +
> +In the case of a ``pseries`` machine, QEMU acts as the hypervisor and only
> +the O/S and USER register rings make sense. ``W2`` contains the vCPU CAM
> +line which is set to the VP identifier.
> +
> +Then comes the routing information which aggregates the EAS and the
> +END configuration:
> +
> +::
> +
> +   ...
> +   LISN         PQ    EISN     CPU/PRIO EQ
> +   00000000 MSI --    00000010   0/6    380/16384 @1fe3e0000 ^1 [ 80000010 ... ]
> +   00000001 MSI --    00000010   1/6    305/16384 @1fc230000 ^1 [ 80000010 ... ]
> +   00000002 MSI --    00000010   2/6    220/16384 @1fc2f0000 ^1 [ 80000010 ... ]
> +   00000003 MSI --    00000010   3/6    201/16384 @1fc390000 ^1 [ 80000010 ... ]
> +   00000004 MSI -Q  M 00000000
> +   00000005 MSI -Q  M 00000000
> +   00000006 MSI -Q  M 00000000
> +   00000007 MSI -Q  M 00000000
> +   00001000 MSI --    00000012   0/6    380/16384 @1fe3e0000 ^1 [ 80000010 ... ]
> +   00001001 MSI --    00000013   0/6    380/16384 @1fe3e0000 ^1 [ 80000010 ... ]
> +   00001100 MSI --    00000100   1/6    305/16384 @1fc230000 ^1 [ 80000010 ... ]
> +   00001101 MSI -Q  M 00000000
> +   00001200 LSI -Q  M 00000000
> +   00001201 LSI -Q  M 00000000
> +   00001202 LSI -Q  M 00000000
> +   00001203 LSI -Q  M 00000000
> +   00001300 MSI --    00000102   1/6    305/16384 @1fc230000 ^1 [ 80000010 ... ]
> +   00001301 MSI --    00000103   2/6    220/16384 @1fc2f0000 ^1 [ 80000010 ... ]
> +   00001302 MSI --    00000104   3/6    201/16384 @1fc390000 ^1 [ 80000010 ... ]
> +
> +The source information and configuration:
> +
> +- The ``LISN`` column outputs the interrupt number of the source in

Explanation on different ranges of ``LISN`` corresponds to 
different type of interrupt sources,if applicable would help..

> +  range ``[ 0x0 ... 0x1FFF ]`` and its type : ``MSI`` or ``LSI``

Small explanation about `MSI` and `LSI` type interrupts and 
example sources for each would help...

> +- The ``PQ`` column reflects the state of the PQ bits of the source :
> +
> +  - ``--`` source is ready to take events
> +  - ``P-`` an event was sent and an EOI is PENDING
> +  - ``PQ`` an event was QUEUED
> +  - ``-Q`` source is OFF
> +
> +  a ``M`` indicates that source is *MASKED* at the EAS level,
> +
> +The targeting configuration :
> +
> +- The ``EISN`` column is the event data what will be queued in the event
> +  queue of the O/S.
> +- The ``CPU/PRIO`` column is the tuple defining the CPU number and
> +  priority queue serving the source.
> +- The ``EQ`` column outputs :
> +
> +  - the current index of the event queue/ the max number of entries
> +  - the O/S event queue address
> +  - the toggle bit
> +  - the last entries that were pushed in the event queue.
> +
> +
> +
> +XIVE for PowerNV
> +================
> +
> +The PnvXIVE model uses the XiveRouter abstract model just like
> +sPAPRXive. It provides accessors to the EAS, END and NVT tables which
> +are stored in the QEMU PowerNV machine and not in QEMU anymore. It
> +owns a set of memory regions for the IC registers, the ESBs, the END
> +ESBs, the TIMA, the notification MMIO.
> +
> +Multichip is supported and the available IVSEs are the internal one
> +for the IPIS, the PSI host bridge controller and PHB4.
> +
> +The next interesting step would be to add escalation events and model
> +the VCPU dispatching to support emulated KVM guests.
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 66ddbda9c958..a896c7407294 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -1697,6 +1697,7 @@ L: qemu-ppc@nongnu.org
>  S: Supported
>  F: hw/*/*xive*
>  F: include/hw/*/*xive*
> +F: docs/ppc/xive.rst
>  
>  Subsystems
>  ----------
> -- 
> 2.20.1
> 
>

Regards,
-Satheesh 


Re: [Qemu-devel] [Qemu-ppc] [PATCH] docs: provide documentation on the POWER9 XIVE interrupt controller

Posted by Cédric Le Goater 10 weeks ago
On 5/14/19 9:33 AM, Satheesh Rajendran wrote:
> On Tue, May 14, 2019 at 08:46:27AM +0200, Cédric Le Goater wrote:
>> This documents the overall XIVE architecture and gives an overview of
>> the QEMU models. It also provides documentation on the 'info pic'
>> command.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  docs/index.rst     |   1 +
>>  docs/ppc/index.rst |  13 ++
>>  docs/ppc/xive.rst  | 344 +++++++++++++++++++++++++++++++++++++++++++++
>>  MAINTAINERS        |   1 +
>>  4 files changed, 359 insertions(+)
>>  create mode 100644 docs/ppc/index.rst
>>  create mode 100644 docs/ppc/xive.rst
> 
> Overall doc, looks great, have few minor suggestions below.
>>
>> diff --git a/docs/index.rst b/docs/index.rst
>> index 3690955dd1f5..557fe86233e3 100644
>> --- a/docs/index.rst
>> +++ b/docs/index.rst
>> @@ -12,4 +12,5 @@ Welcome to QEMU's documentation!
>>  
>>     interop/index
>>     devel/index
>> +   ppc/index
>>  
>> diff --git a/docs/ppc/index.rst b/docs/ppc/index.rst
>> new file mode 100644
>> index 000000000000..146f416ea3a0
>> --- /dev/null
>> +++ b/docs/ppc/index.rst
>> @@ -0,0 +1,13 @@
>> +.. This is the top level page for the 'ppc' manual
>> +
>> +
>> +QEMU PowerPC Machine and Controller Guide
>> +=========================================
>> +
>> +
>> +Contents:
>> +
>> +.. toctree::
>> +   :maxdepth: 2
>> +
>> +   xive
>> diff --git a/docs/ppc/xive.rst b/docs/ppc/xive.rst
>> new file mode 100644
>> index 000000000000..90ddde6bf39f
>> --- /dev/null
>> +++ b/docs/ppc/xive.rst
>> @@ -0,0 +1,344 @@
>> +================================
>> +POWER9 XIVE interrupt controller
>> +================================
>> +
>> +The POWER9 processor comes with a new interrupt controller
>> +architecture, called XIVE as "eXternal Interrupt Virtualization
>> +Engine".
>> +
>> +Compared to the previous architecture, the main characteristics of
>> +XIVE are to support a larger number of interrupt sources and to
>> +deliver interrupts directly to virtual processors without hypervisor
>> +assistance. This removes the context switches required for the
>> +delivery process.
>> +
>> +
>> +Overall architecture
>> +====================
>> +
>> +The XIVE IC is composed of three sub-engines, each taking care of a
>> +processing layer of external interrupts:
>> +
>> +- Interrupt Virtualization Source Engine (IVSE), or Source Controller
>> +  (SC). These are found in PCI PHBs, in the PSI host bridge
>> +  controller, but also inside the main controller for the core IPIs
>> +  and other sub-chips (NX, CAP, NPU) of the chip/processor. They are
>> +  configured to feed the IVRE with events.
>> +- Interrupt Virtualization Routing Engine (IVRE) or Virtualization
>> +  Controller (VC). It handles event coalescing and perform interrupt
>> +  routing by matching an event source number with an Event
>> +  Notification Descriptor (END).
>> +- Interrupt Virtualization Presentation Engine (IVPE) or Presentation
>> +  Controller (PC). It maintains the interrupt context state of each
>> +  thread and handles the delivery of the external interrupt to the
>> +  thread.
>> +
>> +::
>> +
>> +                XIVE Interrupt Controller
>> +                +------------------------------------+      IPIs
>> +                | +---------+ +---------+ +--------+ |    +-------+
>> +                | |IVRE     | |Common Q | |IVPE    |----> | CORES |
>> +                | |     esb | |         | |        |----> |       |
>> +                | |     eas | |  Bridge | |   tctx |----> |       |
>> +                | |SC   end | |         | |    nvt | |    |       |
>> +    +------+    | +---------+ +----+----+ +--------+ |    +-+-+-+-+
>> +    | RAM  |    +------------------|-----------------+      | | |
>> +    |      |                       |                        | | |
>> +    |      |                       |                        | | |
>> +    |      |  +--------------------v------------------------v-v-v--+    other
>> +    |      <--+                     Power Bus                      +--> chips
>> +    |  esb |  +---------+-----------------------+------------------+
>> +    |  eas |            |                       |
>> +    |  end |         +--|------+                |
>> +    |  nvt |       +----+----+ |           +----+----+
>> +    +------+       |IVSE     | |           |IVSE     |
>> +                   |         | |           |         |
>> +                   | PQ-bits | |           | PQ-bits |
>> +                   | local   |-+           |  in VC  |
>> +                   +---------+             +---------+
>> +                      PCIe                 NX,NPU,CAPI
>> +
>> +
>> +    PQ-bits: 2 bits source state machine (P:pending Q:queued)
>> +    esb: Event State Buffer (Array of PQ bits in an IVSE)
>> +    eas: Event Assignment Structure
>> +    end: Event Notification Descriptor
>> +    nvt: Notification Virtual Target
>> +    tctx: Thread interrupt Context registers
>> +
>> +
>> +
>> +XIVE internal tables
>> +--------------------
>> +
>> +Each of the sub-engines uses a set of tables to redirect interrupts
>> +from event sources to CPU threads.
>> +
>> +::
>> +
>> +                                            +-------+
>> +    User or O/S                             |  EQ   |
>> +        or                          +------>|entries|
>> +    Hypervisor                      |       |  ..   |
>> +      Memory                        |       +-------+
>> +                                    |           ^
>> +                                    |           |
>> +               +-------------------------------------------------+
>> +                                    |           |
>> +    Hypervisor      +------+    +---+--+    +---+--+   +------+
>> +      Memory        | ESB  |    | EAT  |    | ENDT |   | NVTT |
>> +     (skiboot)      +----+-+    +----+-+    +----+-+   +------+
>> +                      ^  |        ^  |        ^  |       ^
>> +                      |  |        |  |        |  |       |
>> +               +-------------------------------------------------+
>> +                      |  |        |  |        |  |       |
>> +                      |  |        |  |        |  |       |
>> +                 +----|--|--------|--|--------|--|-+   +-|-----+    +------+
>> +                 |    |  |        |  |        |  | |   | | tctx|    |Thread|
>> +     IPI or   ---+    +  v        +  v        +  v |---| +  .. |----->     |
>> +    HW events    |                                 |   |       |    |      |
>> +                 |             IVRE                |   | IVPE  |    +------+
>> +                 +---------------------------------+   +-------+
>> +
>> +
>> +The IVSE have a 2-bits state machine, P for pending and Q for queued,
>> +for each source that allows events to be triggered. They are stored in
>> +an Event State Buffer (ESB) array and can be controlled by MMIOs.
>> +
>> +If the event is let through, the IVRE looks up in the Event Assignment
>> +Structure (EAS) table for an Event Notification Descriptor (END)
>> +configured for the source. Each Event Notification Descriptor defines
>> +a notification path to a CPU and an in-memory Event Queue, in which
>> +will be enqueued an EQ data for the O/S to pull.
>> +
>> +The IVPE determines if a Notification Virtual Target (NVT) can handle
>> +the event by scanning the thread contexts of the VCPUs dispatched on
>> +the processor HW threads. It maintains the interrupt context state of
>> +each thread in a NVT table.
>> +
>> +XIVE thread interrupt context
>> +-----------------------------
>> +
>> +The XIVE presenter can generate four different exceptions to its
>> +HW threads:
>> +
>> +- hypervisor exception
>> +- O/S exception
>> +- Event-Based Branch (user level)
>> +- msgsnd (doorbell)
>> +
>> +Each exception has a state independent from the others called a Thread
>> +Interrupt Management context. This context is a set of registers which
>> +lets the thread handle priority management and interrupt
>> +acknowledgment among other things. The most important ones being :
>> +
>> +- Interrupt Priority Register  (PIPR)
>> +- Interrupt Pending Buffer     (IPB)
>> +- Current Processor Priority   (CPPR)
>> +- Notification Source Register (NSR)
>> +
>> +TIMA
>> +~~~~
>> +
>> +The Thread Interrupt Management registers are accessible through a
>> +specific MMIO region, called the Thread Interrupt Management Area
>> +(TIMA), four aligned pages, each exposing a different view of the
>> +registers. First page (page address ending in ``0b00``) gives access
>> +to the entire context and is reserved for the ring 0 view for the
>> +physical thread context. The second (page address ending in ``0b01``)
>> +is for the hypervisor, ring 1 view. The third (page address ending in
>> +``0b10``) is for the operating system, ring 2 view. The fourth (page
>> +address ending in ``0b11``) is for user level, ring 3 view.
>> +
>> +Interrupt flow from an O/S perspective
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +After an event data has been enqueued in the O/S Event Queue, the IVPE
>> +raises the bit corresponding to the priority of the pending interrupt
>> +in the register IBP (Interrupt Pending Buffer) to indicate that an
>> +event is pending in one of the 8 priority queues. The Pending
>> +Interrupt Priority Register (PIPR) is also updated using the IPB. This
>> +register represent the priority of the most favored pending
>> +notification.
>> +
>> +The PIPR is then compared to the the Current Processor Priority
>> +Register (CPPR). If it is more favored (numerically less than), the
>> +CPU interrupt line is raised and the EO bit of the Notification Source
>> +Register (NSR) is updated to notify the presence of an exception for
>> +the O/S. The O/S acknowledges the interrupt with a special load in the
>> +Thread Interrupt Management Area.
>> +
>> +The O/S handles the interrupt and when done, performs an EOI using a
>> +MMIO operation on the ESB management page of the associate source.
>> +
>> +
>> +Overview of the QEMU models for XIVE
>> +====================================
>> +
>> +The XiveSource models the IVSE in general, internal and external. It
>> +handles the source ESBs and the MMIO interface to control them.
>> +
>> +The XiveNotifier is a small helper interface interconnecting the
>> +XiveSource to the XiveRouter.
>> +
>> +The XiveRouter is an abstract model acting as a combined IVRE and
>> +IVPE. It routes event notifications using the EAS and END tables to
>> +the IVPE sub-engine which does a CAM scan to find a CPU to deliver the
>> +exception. Storage should be provided by the inheriting classes.
>> +
>> +XiveEnDSource is a special source object. It exposes the END ESB MMIOs
>> +of the Event Queues which are used for coalescing event notifications
>> +and for escalation. Not used on the field, only to sync the EQ cache
>> +in OPAL.
>> +
>> +Finally, the XiveTCTX contains the interrupt state context of a thread,
>> +four sets of registers, one for each exception that can be delivered
>> +to a CPU. These contexts are scanned by the IVPE to find a matching VP
>> +when a notification is triggered. It also models the Thread Interrupt
>> +Management Area (TIMA), which exposes the thread context registers to
>> +the CPU for interrupt management.
>> +
>> +
>> +XIVE for sPAPR (pseries machines)
>> +=================================
>> +
>> +SpaprXive models the XIVE interrupt controller of a ``pseries``
>> +machine. It inherits from the XiveRouter and provisions storage for
>> +the EAS and END tables. The NVT table does not need a backend in
>> +sPAPR. It owns a XiveSource object for the IPIs and the virtual device
>> +interrupts, a memory region for the TIMA and a XiveENDSource object to
>> +manage the END ESBs (not used by Linux).
>> +
>> +These choices were made to have a sPAPR interrupt controller consistent
>> +with the one found on baremetal and to facilitate KVM support, the
>> +main difficulty being the host memory regions exposed to the guest.
>> +
>> +CAS Negotiation
>> +---------------
>> +
>> +The interrupt mode advertised by the ``pseries`` machine in the CAS
>> +negotiation process depends on the CPU type (XIVE requires POWER9) but
>> +also on the machine property ``ic-mode`` which can take the following
>> +values: ``xics``, ``xive`` and ``dual``. ``xics`` is currently the
>> +default mode but it should change in the future.
>> +
>> +The choosen interrupt mode is activated after a reconfiguration done
>> +in a machine reset.
>> +
> can this be included?
> guest uses this device-tree entry(ibm,arch-vec-5-platform-support) to decide on xive vs xics if dual set for ic-mode

yes. we can add some more information on the PAPR CAS negotiation process
as described by the PAPR specs. 

>> +KVM support
>> +-----------
>> +
>> +Two host memory regions are exposed to the guest and require special
>> +attention at initialization :
>> +
>> +- ESB MMIOs
>> +- Thread Interrupt Management Area (TIMA)
>> +
>> +When using the KVM device, these are `ram device` memory mappings,
>> +similarly to VFIO, exposed to the guest and the associated VMAs on the
>> +host are populated dynamically with the appropriate pages using a
>> +fault handler.
>> +
>> +The models uses KVM accessors to synchronize the QEMU state with KVM :
>> +
>> +- the source configuration (EAT)
>> +- the END configuration (ENDT)
>> +- the O/S EQ state (toggle bit and index)
>> +- the thread interrupt context registers.
>> +
>> +Hybrid guest using KVM and an emulated irqchip ``kernel_irqchip=off``
> 
> Some more explanations ``kernel_irqchip=off`` vs ``kernel_irqchip=on``(default for full xive support) would help?
> kernel-irqchip=on - in-kernel accelerated one - more performance
> kernel-irqchip=off - fully emulated XIVE in QEMU - less performace

This is not XIVE specific, it's a QEMU level information which is
documented in the man page.
 
>> +is supported.
>> +
>> +Monitoring XIVE
>> +---------------
>> +
>> +The state of the XIVE interrupt controller can be queried through the
>> +monitor commands ``info pic``. The output comes in two parts.
>> +
>> +First, the state of the thread interrupt context registers is dumped
>> +for each CPU :
>> +
>> +::
>> +
>> +   (qemu) info pic
>> +   CPU[0000]:   QW   NSR CPPR IPB LSMFB ACK# INC AGE PIPR  W2
>> +   CPU[0000]: USER    00   00  00    00   00  00  00   00  00000000
>> +   CPU[0000]:   OS    00   ff  00    00   ff  00  ff   ff  80000400
>> +   CPU[0000]: POOL    00   00  00    00   00  00  00   00  00000000
>> +   CPU[0000]: PHYS    00   00  00    00   00  00  00   ff  00000000
>> +   ...
>> +
>> +In the case of a ``pseries`` machine, QEMU acts as the hypervisor and only
>> +the O/S and USER register rings make sense. ``W2`` contains the vCPU CAM
>> +line which is set to the VP identifier.
>> +
>> +Then comes the routing information which aggregates the EAS and the
>> +END configuration:
>> +
>> +::
>> +
>> +   ...
>> +   LISN         PQ    EISN     CPU/PRIO EQ
>> +   00000000 MSI --    00000010   0/6    380/16384 @1fe3e0000 ^1 [ 80000010 ... ]
>> +   00000001 MSI --    00000010   1/6    305/16384 @1fc230000 ^1 [ 80000010 ... ]
>> +   00000002 MSI --    00000010   2/6    220/16384 @1fc2f0000 ^1 [ 80000010 ... ]
>> +   00000003 MSI --    00000010   3/6    201/16384 @1fc390000 ^1 [ 80000010 ... ]
>> +   00000004 MSI -Q  M 00000000
>> +   00000005 MSI -Q  M 00000000
>> +   00000006 MSI -Q  M 00000000
>> +   00000007 MSI -Q  M 00000000
>> +   00001000 MSI --    00000012   0/6    380/16384 @1fe3e0000 ^1 [ 80000010 ... ]
>> +   00001001 MSI --    00000013   0/6    380/16384 @1fe3e0000 ^1 [ 80000010 ... ]
>> +   00001100 MSI --    00000100   1/6    305/16384 @1fc230000 ^1 [ 80000010 ... ]
>> +   00001101 MSI -Q  M 00000000
>> +   00001200 LSI -Q  M 00000000
>> +   00001201 LSI -Q  M 00000000
>> +   00001202 LSI -Q  M 00000000
>> +   00001203 LSI -Q  M 00000000
>> +   00001300 MSI --    00000102   1/6    305/16384 @1fc230000 ^1 [ 80000010 ... ]
>> +   00001301 MSI --    00000103   2/6    220/16384 @1fc2f0000 ^1 [ 80000010 ... ]
>> +   00001302 MSI --    00000104   3/6    201/16384 @1fc390000 ^1 [ 80000010 ... ]
>> +
>> +The source information and configuration:
>> +
>> +- The ``LISN`` column outputs the interrupt number of the source in
> 
> Explanation on different ranges of ``LISN`` corresponds to 
> different type of interrupt sources,if applicable would help..

This is not specific to XIVE but, yes, we can add extra documentation  
file describing the sPAPR IRQ ranges.

>> +  range ``[ 0x0 ... 0x1FFF ]`` and its type : ``MSI`` or ``LSI``
> 
> Small explanation about `MSI` and `LSI` type interrupts and 
> example sources for each would help...

It does not belong in this file.

Thanks,

C.
 
>> +- The ``PQ`` column reflects the state of the PQ bits of the source :
>> +
>> +  - ``--`` source is ready to take events
>> +  - ``P-`` an event was sent and an EOI is PENDING
>> +  - ``PQ`` an event was QUEUED
>> +  - ``-Q`` source is OFF
>> +
>> +  a ``M`` indicates that source is *MASKED* at the EAS level,
>> +
>> +The targeting configuration :
>> +
>> +- The ``EISN`` column is the event data what will be queued in the event
>> +  queue of the O/S.
>> +- The ``CPU/PRIO`` column is the tuple defining the CPU number and
>> +  priority queue serving the source.
>> +- The ``EQ`` column outputs :
>> +
>> +  - the current index of the event queue/ the max number of entries
>> +  - the O/S event queue address
>> +  - the toggle bit
>> +  - the last entries that were pushed in the event queue.
>> +
>> +
>> +
>> +XIVE for PowerNV
>> +================
>> +
>> +The PnvXIVE model uses the XiveRouter abstract model just like
>> +sPAPRXive. It provides accessors to the EAS, END and NVT tables which
>> +are stored in the QEMU PowerNV machine and not in QEMU anymore. It
>> +owns a set of memory regions for the IC registers, the ESBs, the END
>> +ESBs, the TIMA, the notification MMIO.
>> +
>> +Multichip is supported and the available IVSEs are the internal one
>> +for the IPIS, the PSI host bridge controller and PHB4.
>> +
>> +The next interesting step would be to add escalation events and model
>> +the VCPU dispatching to support emulated KVM guests.
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 66ddbda9c958..a896c7407294 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -1697,6 +1697,7 @@ L: qemu-ppc@nongnu.org
>>  S: Supported
>>  F: hw/*/*xive*
>>  F: include/hw/*/*xive*
>> +F: docs/ppc/xive.rst
>>  
>>  Subsystems
>>  ----------
>> -- 
>> 2.20.1
>>
>>
> 
> Regards,
> -Satheesh 
>