[PATCH for-4.18] docs/sphinx: Lifecycle of a domid

Andrew Cooper posted 1 patch 6 months, 3 weeks ago
Patches applied successfully (tree, apply log)
git fetch https://gitlab.com/xen-project/patchew/xen tags/patchew/20231016162450.1286178-1-andrew.cooper3@citrix.com
docs/glossary.rst                         |   9 ++
docs/hypervisor-guide/domid-lifecycle.rst | 164 ++++++++++++++++++++++
docs/hypervisor-guide/index.rst           |   1 +
3 files changed, 174 insertions(+)
create mode 100644 docs/hypervisor-guide/domid-lifecycle.rst
[PATCH for-4.18] docs/sphinx: Lifecycle of a domid
Posted by Andrew Cooper 6 months, 3 weeks ago
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: George Dunlap <George.Dunlap@eu.citrix.com>
CC: Jan Beulich <JBeulich@suse.com>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Wei Liu <wl@xen.org>
CC: Julien Grall <julien@xen.org>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Juergen Gross <jgross@suse.com>
CC: Henry Wang <Henry.Wang@arm.com>

Rendered form:
  https://andrewcoop-xen.readthedocs.io/en/docs-devel/hypervisor-guide/domid-lifecycle.html

I'm not sure why it's using the alibaster theme and not RTD theme, but I
don't have time to debug that further at this point.

This was written mostly while sat waiting for flights in Nanjing and Beijing.

If while reading this you spot a hole, congratulations.  There are holes which
need fixing...
---
 docs/glossary.rst                         |   9 ++
 docs/hypervisor-guide/domid-lifecycle.rst | 164 ++++++++++++++++++++++
 docs/hypervisor-guide/index.rst           |   1 +
 3 files changed, 174 insertions(+)
 create mode 100644 docs/hypervisor-guide/domid-lifecycle.rst

diff --git a/docs/glossary.rst b/docs/glossary.rst
index 8ddbdab160a1..1fd1de0f0e97 100644
--- a/docs/glossary.rst
+++ b/docs/glossary.rst
@@ -50,3 +50,12 @@ Glossary
 
      By default it gets all devices, including all disks and network cards, so
      is responsible for multiplexing guest I/O.
+
+   system domain
+     Abstractions within Xen that are modelled in a similar way to regular
+     :term:`domains<domain>`.  E.g. When there's no work to do, Xen schedules
+     ``DOMID_IDLE`` to put the CPU into a lower power state.
+
+     System domains have :term:`domids<domid>` and are referenced by
+     privileged software for certain control operations, but they do not run
+     guest code.
diff --git a/docs/hypervisor-guide/domid-lifecycle.rst b/docs/hypervisor-guide/domid-lifecycle.rst
new file mode 100644
index 000000000000..d405a321f3c7
--- /dev/null
+++ b/docs/hypervisor-guide/domid-lifecycle.rst
@@ -0,0 +1,164 @@
+.. SPDX-License-Identifier: CC-BY-4.0
+
+Lifecycle of a domid
+====================
+
+Overview
+--------
+
+A :term:`domid` is Xen's numeric identifier for a :term:`domain`.  In any
+operational Xen system, there are one or more domains running.
+
+Domids are 16-bit integers.  Regular domids start from 0, but there are some
+special identifiers, e.g. ``DOMID_SELF``, and :term:`system domains<system
+domain>`, e.g. ``DOMID_IDLE`` starting from 0x7ff0.  Therefore, a Xen system
+can run a maximum of 32k domains concurrently.
+
+.. note::
+
+   Despite being exposed in the domid ABI, the system domains are internal to
+   Xen and do not have lifecycles like regular domains.  Therefore, they are
+   not discussed further in this document.
+
+At system boot, Xen will construct one or more domains.  Kernels and
+configuration for these domains must be provided by the bootloader, or at
+Xen's compile time for more highly integrated solutions.
+
+Correct functioning of the domain lifecycle involves ``xenstored``, and some
+privileged entity which has bound the ``VIRQ_DOM_EXC`` global event channel.
+
+.. note::
+
+   While not a strict requirement for these to be the same entity, it is
+   ``xenstored`` which typically has ``VIRQ_DOM_EXC`` bound.  This document is
+   written assuming the common case.
+
+Creation
+--------
+
+Within Xen, the ``domain_create()`` function is used to allocate and perform
+bare minimum construction of a domain.  The :term:`control domain` accesses
+this functionality via the ``DOMCTL_createdomain`` hypercall.
+
+The final action that ``domain_create()`` performs before returning
+successfully is to enter the new domain into the domlist.  This makes the
+domain "visible" within Xen, allowing the new domid to be successfully
+referenced by other hypercalls.
+
+At this point, the domain exists as far as Xen is concerned, but not usefully
+as a VM yet.  The toolstack performs further construction activities;
+allocating vCPUs, RAM, copying in the initial executable code, etc.  Domains
+are automatically created with one "pause" reference count held, meaning that
+it is not eligible for scheduling.
+
+When the toolstack has finished VM construction, it send an ``XS_INTRODUCE``
+command to ``xenstored``.  This instructs ``xenstored`` to connect to the
+guest's xenstore ring, and fire the ``@introduceDomain`` watch.  The firing of
+this watch is the signal to all other components which care that a new VM has
+appeared and is about to start running.
+
+When the ``XS_INTRODUCE`` command returns successfully, the final action the
+toolstack performs is to unpause the guest, using the ``DOMCTL_unpausedomain``
+hypercall.  This drops the "pause" reference the domain was originally created
+with, meaning that the vCPU(s) are eligible for scheduling and the domain will
+start executing its first instruction.
+
+.. note::
+
+   It is common for vCPUs other than 0 to be left in an offline state, to be
+   started by actions within the VM.
+
+Termination
+-----------
+
+The VM runs for a period of time, but eventually stops.  It can stop for a
+number of reasons, including:
+
+ * Directly at the guest kernel's request, via the ``SCHEDOP_shutdown``
+   hypercall.  The hypercall also includes the reason for the shutdown,
+   e.g. ``poweroff``, ``reboot`` or ``crash``.
+
+ * Indirectly from certain states.  E.g. executing a ``HLT`` instruction with
+   interrupts disabled is interpreted as a shutdown request as it is a common
+   code pattern for fatal error handling when no better options are available.
+
+ * Indirectly from fatal exceptions.  In some states, execution is unable to
+   continue, e.g. Triple Fault on x86.
+
+ * Directly from the device model, via the ``DMOP_remote_shutdown`` hypercall.
+   E.g. On x86, the 0xcf9 IO port is commonly used to perform platform
+   poweroff, reset or sleep transitions.
+
+ * Directly from the toolstack.  The toolstack is capable of initiating
+   cleanup directly, e.g. ``xl destroy``.  This is typically an administration
+   action of last resort to clean up a domain which malfunctioned but not
+   terminated properly.
+
+ * Directly from Xen.  Some error handling ends up using ``domain_crash()``
+   when Xen doesn't think it can safely continue running the VM.
+
+Whatever the reason for termination, Xen ends up calling ``domain_shutdown()``
+to set the shutdown reason and deschedule all vCPUs.  Xen also fires the
+``VIRQ_DOM_EXC`` event channel, which is a signal to ``xenstored``.
+
+Upon receiving ``VIRQ_DOM_EXC``, ``xenstored`` re-scans all domains using the
+``SYSCTL_getdomaininfolist`` hypercall.  If any domain has changed state from
+running to shut down, ``xenstored`` fires the ``@releaseDomain`` watch.  The
+firing of this watch is the signal to all other components which care that a
+VM has stopped.
+
+.. note::
+
+   Xen does not treat reboot differently to poweroff; both statuses are
+   forwarded to the toolstack.  It is up to the toolstack to restart the VM,
+   which is typically done by constructing a new domain.
+
+.. note::
+
+   Some shutdowns may not result in the cleanup of a domain.  ``suspend`` for
+   example can be used for snapshotting, and the VM resumes execution in the
+   same domain/domid.  Therefore, a domain can cycle several times between
+   running and "shut down" before moving into the destruction phase.
+
+Destruction
+-----------
+
+The domain object in Xen is reference counted, and survives until all
+references are dropped.
+
+The ``@releaseDomain`` watch is to inform all entities that hold a reference
+on the domain to clean up.  This may include:
+
+ * Paravirtual driver backends having a grant map of the shared ring with the
+   frontend.
+ * A device model with a map of the IOREQ page(s).
+
+The toolstack also has work to do in response to ``@releaseDomain``.  It must
+issue the ``DOMCTL_destroydomain`` hypercall.  This hypercall can take minutes
+of wall-clock time to complete for large domains as, amongst other things, it
+is freeing the domain's RAM back to the system.
+
+The actions triggered by the ``@releaseDomain`` watch are asynchronous.  There
+is no guarantee as to the order in which actions start, or which action is the
+final one to complete.  However, the toolstack can achieve some ordering by
+delaying the ``DOMCTL_destroydomain`` hypercall if necessary.
+
+Freeing
+-------
+
+When the final reference on the domain object is dropped, Xen will remove the
+domain from the domlist.  This means the domid is no longer visible in Xen,
+and no longer able to be referenced by other hypercalls.
+
+Xen then schedules the object for deletion at some point after any concurrent
+hypercalls referencing the domain have completed.
+
+When the object is finally cleaned up, Xen fires the ``VIRQ_DOM_EXC`` event
+channel again, causing ``xenstored`` to rescan an notice that the domain has
+ceased to exist.  It fires the ``@releaseDomain`` watch a second time to
+signal to any components which care that the domain has gone away.
+
+E.g. The second ``@releaseDomain`` is commonly used by paravirtual driver
+backends to shut themselves down.
+
+At this point, the toolstack can reuse the domid for a new domain.
diff --git a/docs/hypervisor-guide/index.rst b/docs/hypervisor-guide/index.rst
index e4393b06975b..af88bcef8313 100644
--- a/docs/hypervisor-guide/index.rst
+++ b/docs/hypervisor-guide/index.rst
@@ -6,6 +6,7 @@ Hypervisor documentation
 .. toctree::
    :maxdepth: 2
 
+   domid-lifecycle
    code-coverage
 
    x86/index

base-commit: dc9d9aa62ddeb14abd5672690d30789829f58f7e
prerequisite-patch-id: 832bdc9a23500d426b4fe11237ae7f6614f2369c
-- 
2.30.2


Re: [PATCH for-4.18] docs/sphinx: Lifecycle of a domid
Posted by Alejandro Vallejo 6 months, 3 weeks ago
The page is pretty nice overall and I quite liked it. Just a few extra
questions below, as I'm not familiar with this side of things,

On Mon, Oct 16, 2023 at 05:24:50PM +0100, Andrew Cooper wrote:
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> CC: George Dunlap <George.Dunlap@eu.citrix.com>
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Stefano Stabellini <sstabellini@kernel.org>
> CC: Wei Liu <wl@xen.org>
> CC: Julien Grall <julien@xen.org>
> CC: Roger Pau Monné <roger.pau@citrix.com>
> CC: Juergen Gross <jgross@suse.com>
> CC: Henry Wang <Henry.Wang@arm.com>
> 
> Rendered form:
>   https://andrewcoop-xen.readthedocs.io/en/docs-devel/hypervisor-guide/domid-lifecycle.html
> 
> I'm not sure why it's using the alibaster theme and not RTD theme, but I
> don't have time to debug that further at this point.
> 
> This was written mostly while sat waiting for flights in Nanjing and Beijing.
> 
> If while reading this you spot a hole, congratulations.  There are holes which
> need fixing...
> ---
>  docs/glossary.rst                         |   9 ++
>  docs/hypervisor-guide/domid-lifecycle.rst | 164 ++++++++++++++++++++++
>  docs/hypervisor-guide/index.rst           |   1 +
>  3 files changed, 174 insertions(+)
>  create mode 100644 docs/hypervisor-guide/domid-lifecycle.rst
> 
> diff --git a/docs/glossary.rst b/docs/glossary.rst
> index 8ddbdab160a1..1fd1de0f0e97 100644
> --- a/docs/glossary.rst
> +++ b/docs/glossary.rst
> @@ -50,3 +50,12 @@ Glossary
>  
>       By default it gets all devices, including all disks and network cards, so
>       is responsible for multiplexing guest I/O.
> +
> +   system domain
> +     Abstractions within Xen that are modelled in a similar way to regular
> +     :term:`domains<domain>`.  E.g. When there's no work to do, Xen schedules
> +     ``DOMID_IDLE`` to put the CPU into a lower power state.
> +
> +     System domains have :term:`domids<domid>` and are referenced by
> +     privileged software for certain control operations, but they do not run
> +     guest code.
> diff --git a/docs/hypervisor-guide/domid-lifecycle.rst b/docs/hypervisor-guide/domid-lifecycle.rst
> new file mode 100644
> index 000000000000..d405a321f3c7
> --- /dev/null
> +++ b/docs/hypervisor-guide/domid-lifecycle.rst
> @@ -0,0 +1,164 @@
> +.. SPDX-License-Identifier: CC-BY-4.0
> +
> +Lifecycle of a domid
> +====================
> +
> +Overview
> +--------
> +
> +A :term:`domid` is Xen's numeric identifier for a :term:`domain`.  In any
> +operational Xen system, there are one or more domains running.
> +
> +Domids are 16-bit integers.  Regular domids start from 0, but there are some
> +special identifiers, e.g. ``DOMID_SELF``, and :term:`system domains<system
> +domain>`, e.g. ``DOMID_IDLE`` starting from 0x7ff0.  Therefore, a Xen system
> +can run a maximum of 32k domains concurrently.
> +
> +.. note::
> +
> +   Despite being exposed in the domid ABI, the system domains are internal to
> +   Xen and do not have lifecycles like regular domains.  Therefore, they are
> +   not discussed further in this document.
> +
> +At system boot, Xen will construct one or more domains.  Kernels and
> +configuration for these domains must be provided by the bootloader, or at
> +Xen's compile time for more highly integrated solutions.
> +
> +Correct functioning of the domain lifecycle involves ``xenstored``, and some
> +privileged entity which has bound the ``VIRQ_DOM_EXC`` global event channel.
> +
> +.. note::
> +
> +   While not a strict requirement for these to be the same entity, it is
> +   ``xenstored`` which typically has ``VIRQ_DOM_EXC`` bound.  This document is
> +   written assuming the common case.
> +
> +Creation
> +--------
> +
> +Within Xen, the ``domain_create()`` function is used to allocate and perform
> +bare minimum construction of a domain.  The :term:`control domain` accesses
> +this functionality via the ``DOMCTL_createdomain`` hypercall.
> +
> +The final action that ``domain_create()`` performs before returning
> +successfully is to enter the new domain into the domlist.  This makes the
> +domain "visible" within Xen, allowing the new domid to be successfully
> +referenced by other hypercalls.
> +
> +At this point, the domain exists as far as Xen is concerned, but not usefully
> +as a VM yet.  The toolstack performs further construction activities;
> +allocating vCPUs, RAM, copying in the initial executable code, etc.  Domains
> +are automatically created with one "pause" reference count held, meaning that
> +it is not eligible for scheduling.
> +
> +When the toolstack has finished VM construction, it send an ``XS_INTRODUCE``
> +command to ``xenstored``.  This instructs ``xenstored`` to connect to the
> +guest's xenstore ring, and fire the ``@introduceDomain`` watch.  The firing of
> +this watch is the signal to all other components which care that a new VM has
> +appeared and is about to start running.
Presumably the xenstore ring is memory shared between xenstore and the
newly created domain. Who establishes that connection? For the case where
xenstore lives in dom0 things are _simpler_ because it lives in the same VM
as the toolstack, but I suspect things are hairier when xenstore is in its
own stubdomain. A description of the grant dance (if any), would be helpful.

In that same line, having mermaid sequence diagrams would make these
descriptions easier to follow:

  https://sphinxcontrib-mermaid-demo.readthedocs.io/en/latest/

> +
> +When the ``XS_INTRODUCE`` command returns successfully, the final action the
Not knowing the internals I find the wording weird, though that might be my
own misunderstanding. I imagine you mean "... when xenstore replies with
the successful completion of the ``XS_INTRODUCE`` command...". Considering
the "xenstore ring" mentioned before, I assume all xenstore comms are
async.

> +toolstack performs is to unpause the guest, using the ``DOMCTL_unpausedomain``
> +hypercall.  This drops the "pause" reference the domain was originally created
> +with, meaning that the vCPU(s) are eligible for scheduling and the domain will
> +start executing its first instruction.
> +
> +.. note::
> +
> +   It is common for vCPUs other than 0 to be left in an offline state, to be
> +   started by actions within the VM.
Peculiar choice of words. I guess you don't want to pinch your fingers
precluding other toolstack implementations doing something different. One
possible way to express it is that "Conventionally, other vCPUs other than
0 are left in an offline state to be started by actions within the VM.
This is non-normative, however, and custom Xen-based systems may
choose to do otherwise."

As is, it's unclear whether the unconventional behaviour is assumed to be a
real possibility, a known existing bug, or uncertainty about the past,
present and future.

> +
> +Termination
> +-----------
> +
> +The VM runs for a period of time, but eventually stops.  It can stop for a
> +number of reasons, including:
> +
> + * Directly at the guest kernel's request, via the ``SCHEDOP_shutdown``
nit: I would 's/guest kernel/guest', but that's just me. Internally the
kernel may very well be a passive shim where the active intelligence is in
some disaggregated network of userspace components, making the kernel just
an accidental proxy.

> +   hypercall.  The hypercall also includes the reason for the shutdown,
> +   e.g. ``poweroff``, ``reboot`` or ``crash``.
> +
> + * Indirectly from certain states.  E.g. executing a ``HLT`` instruction with
> +   interrupts disabled is interpreted as a shutdown request as it is a common
> +   code pattern for fatal error handling when no better options are available.
> +
> + * Indirectly from fatal exceptions.  In some states, execution is unable to
> +   continue, e.g. Triple Fault on x86.
> +
> + * Directly from the device model, via the ``DMOP_remote_shutdown`` hypercall.
> +   E.g. On x86, the 0xcf9 IO port is commonly used to perform platform
> +   poweroff, reset or sleep transitions.
> +
> + * Directly from the toolstack.  The toolstack is capable of initiating
> +   cleanup directly, e.g. ``xl destroy``.  This is typically an administration
> +   action of last resort to clean up a domain which malfunctioned but not
> +   terminated properly.
This one is at a different abstraction layer than the others. The hypercall(s)
being used would be more helpful, along with a line saying that the
toolstack makes use of this through e.g: ``xl destory``.

> +
> + * Directly from Xen.  Some error handling ends up using ``domain_crash()``
> +   when Xen doesn't think it can safely continue running the VM.
> +
> +Whatever the reason for termination, Xen ends up calling ``domain_shutdown()``
> +to set the shutdown reason and deschedule all vCPUs.  Xen also fires the
> +``VIRQ_DOM_EXC`` event channel, which is a signal to ``xenstored``.
> +
> +Upon receiving ``VIRQ_DOM_EXC``, ``xenstored`` re-scans all domains using the
> +``SYSCTL_getdomaininfolist`` hypercall.  If any domain has changed state from
> +running to shut down, ``xenstored`` fires the ``@releaseDomain`` watch.  The
> +firing of this watch is the signal to all other components which care that a
> +VM has stopped.
> +
> +.. note::
> +
> +   Xen does not treat reboot differently to poweroff; both statuses are
> +   forwarded to the toolstack.  It is up to the toolstack to restart the VM,
> +   which is typically done by constructing a new domain.
> +
> +.. note::
> +
> +   Some shutdowns may not result in the cleanup of a domain.  ``suspend`` for
> +   example can be used for snapshotting, and the VM resumes execution in the
> +   same domain/domid.  Therefore, a domain can cycle several times between
> +   running and "shut down" before moving into the destruction phase.
> +
> +Destruction
> +-----------
> +
> +The domain object in Xen is reference counted, and survives until all
> +references are dropped.
What a "reference" means might help. I'd like to think it means any
pointer to a domain, and any domid in hypervisor memory, but...

> +
> +The ``@releaseDomain`` watch is to inform all entities that hold a reference
> +on the domain to clean up.  This may include:
... this statement leads me to believe only references held by trusted
parties are collected, and by their choice (not by force). What about pages
granted to other domains that may not whish (or be able to) comply?

> +
> + * Paravirtual driver backends having a grant map of the shared ring with the
> +   frontend.
On a related tangent, what happens if your driver domain is compromised?
Does it suddenly hold all your domains (and their RAM!) hostage because it
won't act upon ``@releaseDomain``?

> + * A device model with a map of the IOREQ page(s).
> +
> +The toolstack also has work to do in response to ``@releaseDomain``.  It must
> +issue the ``DOMCTL_destroydomain`` hypercall.  This hypercall can take minutes
> +of wall-clock time to complete for large domains as, amongst other things, it
> +is freeing the domain's RAM back to the system.
> +
> +The actions triggered by the ``@releaseDomain`` watch are asynchronous.  There
> +is no guarantee as to the order in which actions start, or which action is the
> +final one to complete.  However, the toolstack can achieve some ordering by
> +delaying the ``DOMCTL_destroydomain`` hypercall if necessary.
> +
> +Freeing
> +-------
> +
> +When the final reference on the domain object is dropped, Xen will remove the
nit: 's/will remove/removes'
> +domain from the domlist.  This means the domid is no longer visible in Xen,
> +and no longer able to be referenced by other hypercalls.
> +
> +Xen then schedules the object for deletion at some point after any concurrent
> +hypercalls referencing the domain have completed.
> +
> +When the object is finally cleaned up, Xen fires the ``VIRQ_DOM_EXC`` event
> +channel again, causing ``xenstored`` to rescan an notice that the domain has
> +ceased to exist.  It fires the ``@releaseDomain`` watch a second time to
> +signal to any components which care that the domain has gone away.
At which point did the grant tables drop the domid references? Are we relying
on the goodwill of the grant destination?

> +
> +E.g. The second ``@releaseDomain`` is commonly used by paravirtual driver
> +backends to shut themselves down.
> +
> +At this point, the toolstack can reuse the domid for a new domain.
> diff --git a/docs/hypervisor-guide/index.rst b/docs/hypervisor-guide/index.rst
> index e4393b06975b..af88bcef8313 100644
> --- a/docs/hypervisor-guide/index.rst
> +++ b/docs/hypervisor-guide/index.rst
> @@ -6,6 +6,7 @@ Hypervisor documentation
>  .. toctree::
>     :maxdepth: 2
>  
> +   domid-lifecycle
>     code-coverage
>  
>     x86/index
> 
> base-commit: dc9d9aa62ddeb14abd5672690d30789829f58f7e
> prerequisite-patch-id: 832bdc9a23500d426b4fe11237ae7f6614f2369c
> -- 
> 2.30.2
> 
> 

Thanks,
Alejandro
Re: [PATCH for-4.18] docs/sphinx: Lifecycle of a domid
Posted by Andrew Cooper 6 months, 2 weeks ago
On 17/10/2023 4:09 pm, Alejandro Vallejo wrote:
> The page is pretty nice overall and I quite liked it. Just a few extra
> questions below, as I'm not familiar with this side of things,
>
> On Mon, Oct 16, 2023 at 05:24:50PM +0100, Andrew Cooper wrote:
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> ---
>> CC: George Dunlap <George.Dunlap@eu.citrix.com>
>> CC: Jan Beulich <JBeulich@suse.com>
>> CC: Stefano Stabellini <sstabellini@kernel.org>
>> CC: Wei Liu <wl@xen.org>
>> CC: Julien Grall <julien@xen.org>
>> CC: Roger Pau Monné <roger.pau@citrix.com>
>> CC: Juergen Gross <jgross@suse.com>
>> CC: Henry Wang <Henry.Wang@arm.com>
>>
>> Rendered form:
>>   https://andrewcoop-xen.readthedocs.io/en/docs-devel/hypervisor-guide/domid-lifecycle.html
>>
>> I'm not sure why it's using the alibaster theme and not RTD theme, but I
>> don't have time to debug that further at this point.
>>
>> This was written mostly while sat waiting for flights in Nanjing and Beijing.
>>
>> If while reading this you spot a hole, congratulations.  There are holes which
>> need fixing...
>> ---
>>  docs/glossary.rst                         |   9 ++
>>  docs/hypervisor-guide/domid-lifecycle.rst | 164 ++++++++++++++++++++++
>>  docs/hypervisor-guide/index.rst           |   1 +
>>  3 files changed, 174 insertions(+)
>>  create mode 100644 docs/hypervisor-guide/domid-lifecycle.rst
>>
>> diff --git a/docs/glossary.rst b/docs/glossary.rst
>> index 8ddbdab160a1..1fd1de0f0e97 100644
>> --- a/docs/glossary.rst
>> +++ b/docs/glossary.rst
>> @@ -50,3 +50,12 @@ Glossary
>>  
>>       By default it gets all devices, including all disks and network cards, so
>>       is responsible for multiplexing guest I/O.
>> +
>> +   system domain
>> +     Abstractions within Xen that are modelled in a similar way to regular
>> +     :term:`domains<domain>`.  E.g. When there's no work to do, Xen schedules
>> +     ``DOMID_IDLE`` to put the CPU into a lower power state.
>> +
>> +     System domains have :term:`domids<domid>` and are referenced by
>> +     privileged software for certain control operations, but they do not run
>> +     guest code.
>> diff --git a/docs/hypervisor-guide/domid-lifecycle.rst b/docs/hypervisor-guide/domid-lifecycle.rst
>> new file mode 100644
>> index 000000000000..d405a321f3c7
>> --- /dev/null
>> +++ b/docs/hypervisor-guide/domid-lifecycle.rst
>> @@ -0,0 +1,164 @@
>> +.. SPDX-License-Identifier: CC-BY-4.0
>> +
>> +Lifecycle of a domid
>> +====================
>> +
>> +Overview
>> +--------
>> +
>> +A :term:`domid` is Xen's numeric identifier for a :term:`domain`.  In any
>> +operational Xen system, there are one or more domains running.
>> +
>> +Domids are 16-bit integers.  Regular domids start from 0, but there are some
>> +special identifiers, e.g. ``DOMID_SELF``, and :term:`system domains<system
>> +domain>`, e.g. ``DOMID_IDLE`` starting from 0x7ff0.  Therefore, a Xen system
>> +can run a maximum of 32k domains concurrently.
>> +
>> +.. note::
>> +
>> +   Despite being exposed in the domid ABI, the system domains are internal to
>> +   Xen and do not have lifecycles like regular domains.  Therefore, they are
>> +   not discussed further in this document.
>> +
>> +At system boot, Xen will construct one or more domains.  Kernels and
>> +configuration for these domains must be provided by the bootloader, or at
>> +Xen's compile time for more highly integrated solutions.
>> +
>> +Correct functioning of the domain lifecycle involves ``xenstored``, and some
>> +privileged entity which has bound the ``VIRQ_DOM_EXC`` global event channel.
>> +
>> +.. note::
>> +
>> +   While not a strict requirement for these to be the same entity, it is
>> +   ``xenstored`` which typically has ``VIRQ_DOM_EXC`` bound.  This document is
>> +   written assuming the common case.
>> +
>> +Creation
>> +--------
>> +
>> +Within Xen, the ``domain_create()`` function is used to allocate and perform
>> +bare minimum construction of a domain.  The :term:`control domain` accesses
>> +this functionality via the ``DOMCTL_createdomain`` hypercall.
>> +
>> +The final action that ``domain_create()`` performs before returning
>> +successfully is to enter the new domain into the domlist.  This makes the
>> +domain "visible" within Xen, allowing the new domid to be successfully
>> +referenced by other hypercalls.
>> +
>> +At this point, the domain exists as far as Xen is concerned, but not usefully
>> +as a VM yet.  The toolstack performs further construction activities;
>> +allocating vCPUs, RAM, copying in the initial executable code, etc.  Domains
>> +are automatically created with one "pause" reference count held, meaning that
>> +it is not eligible for scheduling.
>> +
>> +When the toolstack has finished VM construction, it send an ``XS_INTRODUCE``
>> +command to ``xenstored``.  This instructs ``xenstored`` to connect to the
>> +guest's xenstore ring, and fire the ``@introduceDomain`` watch.  The firing of
>> +this watch is the signal to all other components which care that a new VM has
>> +appeared and is about to start running.
> Presumably the xenstore ring is memory shared between xenstore and the
> newly created domain. Who establishes that connection? For the case where
> xenstore lives in dom0 things are _simpler_ because it lives in the same VM
> as the toolstack, but I suspect things are hairier when xenstore is in its
> own stubdomain. A description of the grant dance (if any), would be helpful.
>
> In that same line, having mermaid sequence diagrams would make these
> descriptions easier to follow:
>
>   https://sphinxcontrib-mermaid-demo.readthedocs.io/en/latest/

"how to connect to Xenstore" is the subject of a different document I
have planned.

I'll likely xlink it with this doc when it's done, but it's not
something to get mixed up in here, because it's extraordinary
complicated when all cases are considered.

>> +
>> +When the ``XS_INTRODUCE`` command returns successfully, the final action the
> Not knowing the internals I find the wording weird, though that might be my
> own misunderstanding. I imagine you mean "... when xenstore replies with
> the successful completion of the ``XS_INTRODUCE`` command...". Considering
> the "xenstore ring" mentioned before, I assume all xenstore comms are
> async.

All xenbus commands generate a reply.  Here I technically mean "the
reply from XS_INTRODUCE says success", but the libxenstore library wraps
them as blocking operations.

And now I'm staring at the code, I notice that libxl fails to check the
return value and assumes success... /sigh

>
>> +toolstack performs is to unpause the guest, using the ``DOMCTL_unpausedomain``
>> +hypercall.  This drops the "pause" reference the domain was originally created
>> +with, meaning that the vCPU(s) are eligible for scheduling and the domain will
>> +start executing its first instruction.
>> +
>> +.. note::
>> +
>> +   It is common for vCPUs other than 0 to be left in an offline state, to be
>> +   started by actions within the VM.
> Peculiar choice of words. I guess you don't want to pinch your fingers
> precluding other toolstack implementations doing something different. One
> possible way to express it is that "Conventionally, other vCPUs other than
> 0 are left in an offline state to be started by actions within the VM.
> This is non-normative, however, and custom Xen-based systems may
> choose to do otherwise."
>
> As is, it's unclear whether the unconventional behaviour is assumed to be a
> real possibility, a known existing bug, or uncertainty about the past,
> present and future.

There are some architectures which only support starting every thread of
a core at once.  (That said, I'm pretty sure OpenSBI already abstracts
this behaviour for kernels on RISC-V.)

When virtualised, we have the ability to undo that misbehaviour and give
the VM a nicer executing environment.


>
>> +
>> +Termination
>> +-----------
>> +
>> +The VM runs for a period of time, but eventually stops.  It can stop for a
>> +number of reasons, including:
>> +
>> + * Directly at the guest kernel's request, via the ``SCHEDOP_shutdown``
> nit: I would 's/guest kernel/guest', but that's just me. Internally the
> kernel may very well be a passive shim where the active intelligence is in
> some disaggregated network of userspace components, making the kernel just
> an accidental proxy.

Fine.

>
>> +   hypercall.  The hypercall also includes the reason for the shutdown,
>> +   e.g. ``poweroff``, ``reboot`` or ``crash``.
>> +
>> + * Indirectly from certain states.  E.g. executing a ``HLT`` instruction with
>> +   interrupts disabled is interpreted as a shutdown request as it is a common
>> +   code pattern for fatal error handling when no better options are available.
>> +
>> + * Indirectly from fatal exceptions.  In some states, execution is unable to
>> +   continue, e.g. Triple Fault on x86.
>> +
>> + * Directly from the device model, via the ``DMOP_remote_shutdown`` hypercall.
>> +   E.g. On x86, the 0xcf9 IO port is commonly used to perform platform
>> +   poweroff, reset or sleep transitions.
>> +
>> + * Directly from the toolstack.  The toolstack is capable of initiating
>> +   cleanup directly, e.g. ``xl destroy``.  This is typically an administration
>> +   action of last resort to clean up a domain which malfunctioned but not
>> +   terminated properly.
> This one is at a different abstraction layer than the others. The hypercall(s)
> being used would be more helpful, along with a line saying that the
> toolstack makes use of this through e.g: ``xl destory``.

It is a different abstraction, but it's relevant to how a VM may
terminate, and "how to implement xl destroy" isn't.
>> +
>> + * Directly from Xen.  Some error handling ends up using ``domain_crash()``
>> +   when Xen doesn't think it can safely continue running the VM.
>> +
>> +Whatever the reason for termination, Xen ends up calling ``domain_shutdown()``
>> +to set the shutdown reason and deschedule all vCPUs.  Xen also fires the
>> +``VIRQ_DOM_EXC`` event channel, which is a signal to ``xenstored``.
>> +
>> +Upon receiving ``VIRQ_DOM_EXC``, ``xenstored`` re-scans all domains using the
>> +``SYSCTL_getdomaininfolist`` hypercall.  If any domain has changed state from
>> +running to shut down, ``xenstored`` fires the ``@releaseDomain`` watch.  The
>> +firing of this watch is the signal to all other components which care that a
>> +VM has stopped.
>> +
>> +.. note::
>> +
>> +   Xen does not treat reboot differently to poweroff; both statuses are
>> +   forwarded to the toolstack.  It is up to the toolstack to restart the VM,
>> +   which is typically done by constructing a new domain.
>> +
>> +.. note::
>> +
>> +   Some shutdowns may not result in the cleanup of a domain.  ``suspend`` for
>> +   example can be used for snapshotting, and the VM resumes execution in the
>> +   same domain/domid.  Therefore, a domain can cycle several times between
>> +   running and "shut down" before moving into the destruction phase.
>> +
>> +Destruction
>> +-----------
>> +
>> +The domain object in Xen is reference counted, and survives until all
>> +references are dropped.
> What a "reference" means might help. I'd like to think it means any
> pointer to a domain, and any domid in hypervisor memory, but...
>
>> +
>> +The ``@releaseDomain`` watch is to inform all entities that hold a reference
>> +on the domain to clean up.  This may include:
> ... this statement leads me to believe only references held by trusted
> parties are collected, and by their choice (not by force). What about pages
> granted to other domains that may not whish (or be able to) comply?

That's not a question I can reasonably answer here.  There is an
atomic_t refcount in struct domain and that's ultimately what controls
the freeing of the structure, and oustanding mappings are one source
holding a ref, but there are others too.  e.g. there's one ref held for
the domain having a non-zero memory allocation.

>
>> +
>> + * Paravirtual driver backends having a grant map of the shared ring with the
>> +   frontend.
> On a related tangent, what happens if your driver domain is compromised?
> Does it suddenly hold all your domains (and their RAM!) hostage because it
> won't act upon ``@releaseDomain``?

Xen has no support for revocable grants.  It has been an issue under
discussion for more than a decade, but nothing has been completed.

If a rogue driver domain holds your memory hostage, tough.  The overall
system can recover by destroying the driver domain; one action in
DOMCTL_destroydomain is to unmap all oustanding mapped grants, which
will allow both domains to be cleaned up.
>> + * A device model with a map of the IOREQ page(s).
>> +
>> +The toolstack also has work to do in response to ``@releaseDomain``.  It must
>> +issue the ``DOMCTL_destroydomain`` hypercall.  This hypercall can take minutes
>> +of wall-clock time to complete for large domains as, amongst other things, it
>> +is freeing the domain's RAM back to the system.
>> +
>> +The actions triggered by the ``@releaseDomain`` watch are asynchronous.  There
>> +is no guarantee as to the order in which actions start, or which action is the
>> +final one to complete.  However, the toolstack can achieve some ordering by
>> +delaying the ``DOMCTL_destroydomain`` hypercall if necessary.
>> +
>> +Freeing
>> +-------
>> +
>> +When the final reference on the domain object is dropped, Xen will remove the
> nit: 's/will remove/removes'
>> +domain from the domlist.  This means the domid is no longer visible in Xen,
>> +and no longer able to be referenced by other hypercalls.
>> +
>> +Xen then schedules the object for deletion at some point after any concurrent
>> +hypercalls referencing the domain have completed.
>> +
>> +When the object is finally cleaned up, Xen fires the ``VIRQ_DOM_EXC`` event
>> +channel again, causing ``xenstored`` to rescan an notice that the domain has
>> +ceased to exist.  It fires the ``@releaseDomain`` watch a second time to
>> +signal to any components which care that the domain has gone away.
> At which point did the grant tables drop the domid references? Are we relying
> on the goodwill of the grant destination?

No - all of that is done in the previous section.  While there are
grants of the domain's mapped, it's refcount won't drop to 0.

~Andrew
Re: [PATCH for-4.18] docs/sphinx: Lifecycle of a domid
Posted by Jan Beulich 6 months, 3 weeks ago
On 16.10.2023 18:24, Andrew Cooper wrote:
> +Creation
> +--------
> +
> +Within Xen, the ``domain_create()`` function is used to allocate and perform
> +bare minimum construction of a domain.  The :term:`control domain` accesses
> +this functionality via the ``DOMCTL_createdomain`` hypercall.
> +
> +The final action that ``domain_create()`` performs before returning
> +successfully is to enter the new domain into the domlist.  This makes the
> +domain "visible" within Xen, allowing the new domid to be successfully
> +referenced by other hypercalls.
> +
> +At this point, the domain exists as far as Xen is concerned, but not usefully
> +as a VM yet.  The toolstack performs further construction activities;
> +allocating vCPUs, RAM, copying in the initial executable code, etc.  Domains
> +are automatically created with one "pause" reference count held, meaning that
> +it is not eligible for scheduling.

Nit: Afaict either "A domain is ..." or "... they are ...". One might also
add "... right away, i.e. until the tool stack asks for the pause count to
be decremented".

> +Termination
> +-----------
> +
> +The VM runs for a period of time, but eventually stops.  It can stop for a
> +number of reasons, including:
> +
> + * Directly at the guest kernel's request, via the ``SCHEDOP_shutdown``
> +   hypercall.  The hypercall also includes the reason for the shutdown,
> +   e.g. ``poweroff``, ``reboot`` or ``crash``.
> +
> + * Indirectly from certain states.  E.g. executing a ``HLT`` instruction with
> +   interrupts disabled is interpreted as a shutdown request as it is a common
> +   code pattern for fatal error handling when no better options are available.

HLT (note btw that this is x86 and HVM specific and hence may want mentioning
as such) is interpreted this way only if all other vCPU-s are also "down"
already.

> + * Indirectly from fatal exceptions.  In some states, execution is unable to
> +   continue, e.g. Triple Fault on x86.

Nit: This again is HVM specific.

> + * Directly from the device model, via the ``DMOP_remote_shutdown`` hypercall.
> +   E.g. On x86, the 0xcf9 IO port is commonly used to perform platform
> +   poweroff, reset or sleep transitions.
> +
> + * Directly from the toolstack.  The toolstack is capable of initiating
> +   cleanup directly, e.g. ``xl destroy``.  This is typically an administration
> +   action of last resort to clean up a domain which malfunctioned but not
> +   terminated properly.

Nit: You're the native speaker, but doesn't this want to be "... but did not
terminate ..."?

> +Destruction
> +-----------
> +
> +The domain object in Xen is reference counted, and survives until all
> +references are dropped.
> +
> +The ``@releaseDomain`` watch is to inform all entities that hold a reference
> +on the domain to clean up.  This may include:
> +
> + * Paravirtual driver backends having a grant map of the shared ring with the
> +   frontend.

Beyond the shared ring(s), other (data) pages may also still have mappings.

Jan
Re: [PATCH for-4.18] docs/sphinx: Lifecycle of a domid
Posted by Andrew Cooper 6 months, 3 weeks ago
On 17/10/2023 7:34 am, Jan Beulich wrote:
> On 16.10.2023 18:24, Andrew Cooper wrote:
>> +Termination
>> +-----------
>> +
>> +The VM runs for a period of time, but eventually stops.  It can stop for a
>> +number of reasons, including:
>> +
>> + * Directly at the guest kernel's request, via the ``SCHEDOP_shutdown``
>> +   hypercall.  The hypercall also includes the reason for the shutdown,
>> +   e.g. ``poweroff``, ``reboot`` or ``crash``.
>> +
>> + * Indirectly from certain states.  E.g. executing a ``HLT`` instruction with
>> +   interrupts disabled is interpreted as a shutdown request as it is a common
>> +   code pattern for fatal error handling when no better options are available.
> HLT (note btw that this is x86 and HVM specific and hence may want mentioning
> as such) is interpreted this way only if all other vCPU-s are also "down"
> already.
>
>> + * Indirectly from fatal exceptions.  In some states, execution is unable to
>> +   continue, e.g. Triple Fault on x86.
> Nit: This again is HVM specific.

Triple fault, maybe.  fatal exceptions terminating the VM, no.

For both, these details are not important for the document.  This is an
list of examples, not an exhaustive list.
>> +Destruction
>> +-----------
>> +
>> +The domain object in Xen is reference counted, and survives until all
>> +references are dropped.
>> +
>> +The ``@releaseDomain`` watch is to inform all entities that hold a reference
>> +on the domain to clean up.  This may include:
>> +
>> + * Paravirtual driver backends having a grant map of the shared ring with the
>> +   frontend.
> Beyond the shared ring(s), other (data) pages may also still have mappings.

Yes, but again, this is just a examples.  Other data pages should only
be mapped while data is in flight.

~Andrew
Re: [PATCH for-4.18] docs/sphinx: Lifecycle of a domid
Posted by Jan Beulich 6 months, 3 weeks ago
On 17.10.2023 12:15, Andrew Cooper wrote:
> On 17/10/2023 7:34 am, Jan Beulich wrote:
>> On 16.10.2023 18:24, Andrew Cooper wrote:
>>> +Termination
>>> +-----------
>>> +
>>> +The VM runs for a period of time, but eventually stops.  It can stop for a
>>> +number of reasons, including:
>>> +
>>> + * Directly at the guest kernel's request, via the ``SCHEDOP_shutdown``
>>> +   hypercall.  The hypercall also includes the reason for the shutdown,
>>> +   e.g. ``poweroff``, ``reboot`` or ``crash``.
>>> +
>>> + * Indirectly from certain states.  E.g. executing a ``HLT`` instruction with
>>> +   interrupts disabled is interpreted as a shutdown request as it is a common
>>> +   code pattern for fatal error handling when no better options are available.
>> HLT (note btw that this is x86 and HVM specific and hence may want mentioning
>> as such) is interpreted this way only if all other vCPU-s are also "down"
>> already.
>>
>>> + * Indirectly from fatal exceptions.  In some states, execution is unable to
>>> +   continue, e.g. Triple Fault on x86.
>> Nit: This again is HVM specific.
> 
> Triple fault, maybe.  fatal exceptions terminating the VM, no.
> 
> For both, these details are not important for the document.  This is an
> list of examples, not an exhaustive list.

Of course. But I would recommend to avoid having examples which don't
describe technical details correctly.

>>> +Destruction
>>> +-----------
>>> +
>>> +The domain object in Xen is reference counted, and survives until all
>>> +references are dropped.
>>> +
>>> +The ``@releaseDomain`` watch is to inform all entities that hold a reference
>>> +on the domain to clean up.  This may include:
>>> +
>>> + * Paravirtual driver backends having a grant map of the shared ring with the
>>> +   frontend.
>> Beyond the shared ring(s), other (data) pages may also still have mappings.
> 
> Yes, but again, this is just a examples.  Other data pages should only
> be mapped while data is in flight.

Hmm, blkif's persistent grants are explicitly not like that. All I'm asking
for here is to insert another "e.g."

Jan

Re: [PATCH for-4.18] docs/sphinx: Lifecycle of a domid
Posted by Juergen Gross 6 months, 3 weeks ago
On 16.10.23 18:24, Andrew Cooper wrote:
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> CC: George Dunlap <George.Dunlap@eu.citrix.com>
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Stefano Stabellini <sstabellini@kernel.org>
> CC: Wei Liu <wl@xen.org>
> CC: Julien Grall <julien@xen.org>
> CC: Roger Pau Monné <roger.pau@citrix.com>
> CC: Juergen Gross <jgross@suse.com>
> CC: Henry Wang <Henry.Wang@arm.com>
> 
> Rendered form:
>    https://andrewcoop-xen.readthedocs.io/en/docs-devel/hypervisor-guide/domid-lifecycle.html
> 
> I'm not sure why it's using the alibaster theme and not RTD theme, but I
> don't have time to debug that further at this point.
> 
> This was written mostly while sat waiting for flights in Nanjing and Beijing.
> 
> If while reading this you spot a hole, congratulations.  There are holes which
> need fixing...
> ---
>   docs/glossary.rst                         |   9 ++
>   docs/hypervisor-guide/domid-lifecycle.rst | 164 ++++++++++++++++++++++
>   docs/hypervisor-guide/index.rst           |   1 +
>   3 files changed, 174 insertions(+)
>   create mode 100644 docs/hypervisor-guide/domid-lifecycle.rst
> 
> diff --git a/docs/glossary.rst b/docs/glossary.rst
> index 8ddbdab160a1..1fd1de0f0e97 100644
> --- a/docs/glossary.rst
> +++ b/docs/glossary.rst
> @@ -50,3 +50,12 @@ Glossary
>   
>        By default it gets all devices, including all disks and network cards, so
>        is responsible for multiplexing guest I/O.
> +
> +   system domain
> +     Abstractions within Xen that are modelled in a similar way to regular
> +     :term:`domains<domain>`.  E.g. When there's no work to do, Xen schedules
> +     ``DOMID_IDLE`` to put the CPU into a lower power state.
> +
> +     System domains have :term:`domids<domid>` and are referenced by
> +     privileged software for certain control operations, but they do not run
> +     guest code.
> diff --git a/docs/hypervisor-guide/domid-lifecycle.rst b/docs/hypervisor-guide/domid-lifecycle.rst
> new file mode 100644
> index 000000000000..d405a321f3c7
> --- /dev/null
> +++ b/docs/hypervisor-guide/domid-lifecycle.rst
> @@ -0,0 +1,164 @@
> +.. SPDX-License-Identifier: CC-BY-4.0
> +
> +Lifecycle of a domid
> +====================
> +
> +Overview
> +--------
> +
> +A :term:`domid` is Xen's numeric identifier for a :term:`domain`.  In any
> +operational Xen system, there are one or more domains running.
> +
> +Domids are 16-bit integers.  Regular domids start from 0, but there are some
> +special identifiers, e.g. ``DOMID_SELF``, and :term:`system domains<system
> +domain>`, e.g. ``DOMID_IDLE`` starting from 0x7ff0.  Therefore, a Xen system
> +can run a maximum of 32k domains concurrently.
> +
> +.. note::
> +
> +   Despite being exposed in the domid ABI, the system domains are internal to
> +   Xen and do not have lifecycles like regular domains.  Therefore, they are
> +   not discussed further in this document.
> +
> +At system boot, Xen will construct one or more domains.  Kernels and
> +configuration for these domains must be provided by the bootloader, or at
> +Xen's compile time for more highly integrated solutions.
> +
> +Correct functioning of the domain lifecycle involves ``xenstored``, and some
> +privileged entity which has bound the ``VIRQ_DOM_EXC`` global event channel.
> +
> +.. note::
> +
> +   While not a strict requirement for these to be the same entity, it is
> +   ``xenstored`` which typically has ``VIRQ_DOM_EXC`` bound.  This document is
> +   written assuming the common case.
> +
> +Creation
> +--------
> +
> +Within Xen, the ``domain_create()`` function is used to allocate and perform
> +bare minimum construction of a domain.  The :term:`control domain` accesses
> +this functionality via the ``DOMCTL_createdomain`` hypercall.
> +
> +The final action that ``domain_create()`` performs before returning
> +successfully is to enter the new domain into the domlist.  This makes the
> +domain "visible" within Xen, allowing the new domid to be successfully
> +referenced by other hypercalls.
> +
> +At this point, the domain exists as far as Xen is concerned, but not usefully
> +as a VM yet.  The toolstack performs further construction activities;
> +allocating vCPUs, RAM, copying in the initial executable code, etc.  Domains
> +are automatically created with one "pause" reference count held, meaning that
> +it is not eligible for scheduling.
> +
> +When the toolstack has finished VM construction, it send an ``XS_INTRODUCE``

s/send/sends/

> +command to ``xenstored``.  This instructs ``xenstored`` to connect to the
> +guest's xenstore ring, and fire the ``@introduceDomain`` watch.  The firing of
> +this watch is the signal to all other components which care that a new VM has
> +appeared and is about to start running.

A note should be added that the control domain is introduced implicitly by
xenstored, so no XS_INTRODUCE command is needed and no @introduceDomain watch
is being sent for the control domain.

All components interested in the @introduceDomain watch have to find out for
themselves which new domain has appeared, as the watch event doesn't contain
the domid of the new domain.

> +
> +When the ``XS_INTRODUCE`` command returns successfully, the final action the
> +toolstack performs is to unpause the guest, using the ``DOMCTL_unpausedomain``
> +hypercall.  This drops the "pause" reference the domain was originally created
> +with, meaning that the vCPU(s) are eligible for scheduling and the domain will
> +start executing its first instruction.
> +
> +.. note::
> +
> +   It is common for vCPUs other than 0 to be left in an offline state, to be
> +   started by actions within the VM.
> +
> +Termination
> +-----------
> +
> +The VM runs for a period of time, but eventually stops.  It can stop for a
> +number of reasons, including:
> +
> + * Directly at the guest kernel's request, via the ``SCHEDOP_shutdown``
> +   hypercall.  The hypercall also includes the reason for the shutdown,
> +   e.g. ``poweroff``, ``reboot`` or ``crash``.
> +
> + * Indirectly from certain states.  E.g. executing a ``HLT`` instruction with
> +   interrupts disabled is interpreted as a shutdown request as it is a common
> +   code pattern for fatal error handling when no better options are available.
> +
> + * Indirectly from fatal exceptions.  In some states, execution is unable to
> +   continue, e.g. Triple Fault on x86.
> +
> + * Directly from the device model, via the ``DMOP_remote_shutdown`` hypercall.
> +   E.g. On x86, the 0xcf9 IO port is commonly used to perform platform
> +   poweroff, reset or sleep transitions.
> +
> + * Directly from the toolstack.  The toolstack is capable of initiating
> +   cleanup directly, e.g. ``xl destroy``.  This is typically an administration
> +   action of last resort to clean up a domain which malfunctioned but not
> +   terminated properly.
> +
> + * Directly from Xen.  Some error handling ends up using ``domain_crash()``
> +   when Xen doesn't think it can safely continue running the VM.
> +
> +Whatever the reason for termination, Xen ends up calling ``domain_shutdown()``
> +to set the shutdown reason and deschedule all vCPUs.  Xen also fires the
> +``VIRQ_DOM_EXC`` event channel, which is a signal to ``xenstored``.
> +
> +Upon receiving ``VIRQ_DOM_EXC``, ``xenstored`` re-scans all domains using the
> +``SYSCTL_getdomaininfolist`` hypercall.  If any domain has changed state from
> +running to shut down, ``xenstored`` fires the ``@releaseDomain`` watch.  The
> +firing of this watch is the signal to all other components which care that a
> +VM has stopped.

The same as above applies: all components receiving the @releaseDomain watch
event have to find out themselves which domain has stopped.

> +
> +.. note::
> +
> +   Xen does not treat reboot differently to poweroff; both statuses are
> +   forwarded to the toolstack.  It is up to the toolstack to restart the VM,
> +   which is typically done by constructing a new domain.
> +
> +.. note::
> +
> +   Some shutdowns may not result in the cleanup of a domain.  ``suspend`` for
> +   example can be used for snapshotting, and the VM resumes execution in the
> +   same domain/domid.  Therefore, a domain can cycle several times between
> +   running and "shut down" before moving into the destruction phase.
> +
> +Destruction
> +-----------
> +
> +The domain object in Xen is reference counted, and survives until all
> +references are dropped.
> +
> +The ``@releaseDomain`` watch is to inform all entities that hold a reference
> +on the domain to clean up.  This may include:
> +
> + * Paravirtual driver backends having a grant map of the shared ring with the
> +   frontend.
> + * A device model with a map of the IOREQ page(s).
> +
> +The toolstack also has work to do in response to ``@releaseDomain``.  It must
> +issue the ``DOMCTL_destroydomain`` hypercall.  This hypercall can take minutes
> +of wall-clock time to complete for large domains as, amongst other things, it
> +is freeing the domain's RAM back to the system.
> +
> +The actions triggered by the ``@releaseDomain`` watch are asynchronous.  There
> +is no guarantee as to the order in which actions start, or which action is the
> +final one to complete.  However, the toolstack can achieve some ordering by
> +delaying the ``DOMCTL_destroydomain`` hypercall if necessary.
> +
> +Freeing
> +-------
> +
> +When the final reference on the domain object is dropped, Xen will remove the
> +domain from the domlist.  This means the domid is no longer visible in Xen,
> +and no longer able to be referenced by other hypercalls.
> +
> +Xen then schedules the object for deletion at some point after any concurrent
> +hypercalls referencing the domain have completed.
> +
> +When the object is finally cleaned up, Xen fires the ``VIRQ_DOM_EXC`` event
> +channel again, causing ``xenstored`` to rescan an notice that the domain has

s/an/and/

> +ceased to exist.  It fires the ``@releaseDomain`` watch a second time to
> +signal to any components which care that the domain has gone away.
> +
> +E.g. The second ``@releaseDomain`` is commonly used by paravirtual driver
> +backends to shut themselves down.

There is no guarantee that @releaseDomain will always be fired twice for a
domain ceasing to exist, and multiple domains disappearing might result in
only one @releaseDomain watch being fired. This means that any component 
receiving this watch event have not only to find out the domid(s) of the
domains changing state, but whether they have been shutting down only, or
are completely gone, too.

> +
> +At this point, the toolstack can reuse the domid for a new domain.
> diff --git a/docs/hypervisor-guide/index.rst b/docs/hypervisor-guide/index.rst
> index e4393b06975b..af88bcef8313 100644
> --- a/docs/hypervisor-guide/index.rst
> +++ b/docs/hypervisor-guide/index.rst
> @@ -6,6 +6,7 @@ Hypervisor documentation
>   .. toctree::
>      :maxdepth: 2
>   
> +   domid-lifecycle
>      code-coverage
>   
>      x86/index
> 
> base-commit: dc9d9aa62ddeb14abd5672690d30789829f58f7e
> prerequisite-patch-id: 832bdc9a23500d426b4fe11237ae7f6614f2369c


Juergen
Re: [PATCH for-4.18] docs/sphinx: Lifecycle of a domid
Posted by Andrew Cooper 6 months, 3 weeks ago
On 17/10/2023 6:24 am, Juergen Gross wrote:
> On 16.10.23 18:24, Andrew Cooper wrote:
>> +command to ``xenstored``.  This instructs ``xenstored`` to connect to
>> the
>> +guest's xenstore ring, and fire the ``@introduceDomain`` watch.  The
>> firing of
>> +this watch is the signal to all other components which care that a
>> new VM has
>> +appeared and is about to start running.
> 
> A note should be added that the control domain is introduced implicitly by
> xenstored, so no XS_INTRODUCE command is needed and no @introduceDomain
> watch is being sent for the control domain.

How does this work for a stub xenstored?  It can't know that dom0 is
alive, and is the control domain, and mustn't assume that this is true.

I admit that I've been a bit vague in the areas where I think there are
pre-existing bugs.  This is one area.

I'm planning a separate document on "how to connect to xenstore" seeing
as it is buggy in multiple ways in Linux (causing a deadlock on boot
with a stub xenstored), and made worse by dom0less creating memory
corruption from a 3rd entity into the xenstored<->kernel comms channel.

(And as I've said multiple times already, shuffling code in one of the
two xenstored's doesn't fix the root of the dom0less bug.  It simply
shuffles it around for someone else to trip over.)

> All components interested in the @introduceDomain watch have to find out for
> themselves which new domain has appeared, as the watch event doesn't contain
> the domid of the new domain.

Yes, but we're intending to change that, and it is diverting focus from
the domain's lifecycle.

I suppose I could put in a footnote discussing the single-bit-ness of
the three signals.

>> +ceased to exist.  It fires the ``@releaseDomain`` watch a second time to
>> +signal to any components which care that the domain has gone away.
>> +
>> +E.g. The second ``@releaseDomain`` is commonly used by paravirtual
>> driver
>> +backends to shut themselves down.
> 
> There is no guarantee that @releaseDomain will always be fired twice for a
> domain ceasing to exist,

Are you sure?

Because the toolstack needs to listen to @releaseDomain in order to
start cleanup, there will be two distinct @releaseDomain's for an
individual domain.

But an individual @releaseDomain can be relevant for a state change in
more than one domain, so there are not necessary 2*nr_doms worth of
@releaseDomain's fired.

> and multiple domains disappearing might result in
> only one @releaseDomain watch being fired. This means that any component
> receiving this watch event have not only to find out the domid(s) of the
> domains changing state, but whether they have been shutting down only, or
> are completely gone, too.

All entities holding a reference on the domain will block the second
notification until they have performed their own unmap action.

But for entities which don't hold a reference on the domain, there is a
race condition where it's @releaseDomain notification is delivered
sufficiently late that the domid has already disappeared.

It's certainly good coding practice to cope with the domain disappearing
entirely underfoot, but entities without held references don't watch
@releaseDomain in the first place, so I don't think this case occurs in
practice.

~Andrew

Re: [PATCH for-4.18] docs/sphinx: Lifecycle of a domid
Posted by Juergen Gross 6 months, 3 weeks ago
On 17.10.23 12:09, Andrew Cooper wrote:
> On 17/10/2023 6:24 am, Juergen Gross wrote:
>> On 16.10.23 18:24, Andrew Cooper wrote:
>>> +command to ``xenstored``.  This instructs ``xenstored`` to connect to
>>> the
>>> +guest's xenstore ring, and fire the ``@introduceDomain`` watch.  The
>>> firing of
>>> +this watch is the signal to all other components which care that a
>>> new VM has
>>> +appeared and is about to start running.
>>
>> A note should be added that the control domain is introduced implicitly by
>> xenstored, so no XS_INTRODUCE command is needed and no @introduceDomain
>> watch is being sent for the control domain.
> 
> How does this work for a stub xenstored?  It can't know that dom0 is
> alive, and is the control domain, and mustn't assume that this is true.

A stub xenstored gets the control domain's domid via a boot parameter.

> I admit that I've been a bit vague in the areas where I think there are
> pre-existing bugs.  This is one area.
> 
> I'm planning a separate document on "how to connect to xenstore" seeing
> as it is buggy in multiple ways in Linux (causing a deadlock on boot
> with a stub xenstored), and made worse by dom0less creating memory
> corruption from a 3rd entity into the xenstored<->kernel comms channel.
> 
> (And as I've said multiple times already, shuffling code in one of the
> two xenstored's doesn't fix the root of the dom0less bug.  It simply
> shuffles it around for someone else to trip over.)
> 
>> All components interested in the @introduceDomain watch have to find out for
>> themselves which new domain has appeared, as the watch event doesn't contain
>> the domid of the new domain.
> 
> Yes, but we're intending to change that, and it is diverting focus from
> the domain's lifecycle.
> 
> I suppose I could put in a footnote discussing the single-bit-ness of
> the three signals.

Fine with me. I just wanted to mention this detail.

> 
>>> +ceased to exist.  It fires the ``@releaseDomain`` watch a second time to
>>> +signal to any components which care that the domain has gone away.
>>> +
>>> +E.g. The second ``@releaseDomain`` is commonly used by paravirtual
>>> driver
>>> +backends to shut themselves down.
>>
>> There is no guarantee that @releaseDomain will always be fired twice for a
>> domain ceasing to exist,
> 
> Are you sure?

Yes. Identical pending watch events are allowed to be merged into one.

> Because the toolstack needs to listen to @releaseDomain in order to
> start cleanup, there will be two distinct @releaseDomain's for an
> individual domain.
> > But an individual @releaseDomain can be relevant for a state change in
> more than one domain, so there are not necessary 2*nr_doms worth of
> @releaseDomain's fired.

Correct.

> 
>> and multiple domains disappearing might result in
>> only one @releaseDomain watch being fired. This means that any component
>> receiving this watch event have not only to find out the domid(s) of the
>> domains changing state, but whether they have been shutting down only, or
>> are completely gone, too.
> 
> All entities holding a reference on the domain will block the second
> notification until they have performed their own unmap action.

You are aware that backends normally don't register for @releaseDomain, but
set a watch on their backend specific Xenstore node in order to react on
the tool stack removing the backend device nodes?

> But for entities which don't hold a reference on the domain, there is a
> race condition where it's @releaseDomain notification is delivered
> sufficiently late that the domid has already disappeared.

Exactly.

> It's certainly good coding practice to cope with the domain disappearing
> entirely underfoot, but entities without held references don't watch
> @releaseDomain in the first place, so I don't think this case occurs in
> practice.

I could easily see use cases where this assumptions isn't true, like a daemon
supervising domains in order to respawn them in case they have died.


Juergen