[v2] Add a new acquire resource to query vcpu statistics

[RFC PATCH v2 0/2] Add a new acquire resource to query vcpu statistics

Posted by Matias Ezequiel Vara Larsen 1 year, 6 months ago

Hello all,

The purpose of this RFC is to get feedback about a new acquire resource that
exposes vcpu statistics for a given domain. The current mechanism to get those
statistics is by querying the hypervisor. This mechanism relies on a hypercall
and holds the domctl spinlock during its execution. When a pv tool like xcp-rrdd
periodically samples these counters, it ends up affecting other paths that share
that spinlock. By using acquire resources, the pv tool only requires a few
hypercalls to set the shared memory region and samples are got without issuing
any other hypercall. The original idea has been suggested by Andrew Cooper to
which I have been discussing about how to implement the current PoC. You can
find the RFC patch series at [1]. The series is rebased on top of stable-4.15.

I am currently a bit blocked on 1) what to expose and 2) how to expose it. For
1), I decided to expose what xcp-rrdd is querying, e.g., XEN_DOMCTL_getvcpuinfo.
More precisely, xcp-rrd gets runstate.time[RUNSTATE_running]. This is a uint64_t
counter. However, the time spent in other states may be interesting too.
Regarding 2), I am not sure if simply using an array of uint64_t is enough or if
a different interface should be exposed. The remaining question is when to get
new values. For the moment, I am updating this counter during
vcpu_runstate_change().

The current series includes a simple pv tool that shows how this new interface is
used. This tool maps the counter and periodically samples it.

Any feedback/help would be appreciated.

Thanks, Matias.

[1] https://github.com/MatiasVara/xen/tree/feature_stats

Changes in v2:
- rework to ensure that consumer fetches consistent data

Changes in v1:
- rework how the resource is allocated and released
- rework when the resource is allocated that happens only when the resource is
  requested 
- rework the structure shared between the tool and Xen to make it extensible to
  new counters and declare it in a public header

There are still the following questions:
   - resource shall be released when there are no more readers otherwise we keep
     updating it during a hot path
   - one frame can host up to 512 vcpus. Should I check to this limit when
     updating? Should it be possible to allocate more than one frame for vcpu
     counters? 

Matias Ezequiel Vara Larsen (2):
  xen/memory : Add a stats_table resource type
  tools/misc: Add xen-vcpus-stats tool

 tools/misc/Makefile          |  6 +++
 tools/misc/xen-vcpus-stats.c | 87 +++++++++++++++++++++++++++++++++
 xen/arch/x86/hvm/hvm.c       |  2 +
 xen/common/memory.c          | 94 ++++++++++++++++++++++++++++++++++++
 xen/common/sched/core.c      | 16 ++++++
 xen/include/public/memory.h  |  3 ++
 xen/include/public/vcpu.h    | 16 ++++++
 xen/include/xen/mm.h         |  2 +
 xen/include/xen/sched.h      |  5 ++
 9 files changed, 231 insertions(+)
 create mode 100644 tools/misc/xen-vcpus-stats.c

-- 
2.25.1

API/ABIs: Re: [RFC PATCH v2 0/2] Add a new acquire resource to query vcpu statistics

Posted by Andrew Cooper 1 year, 2 months ago

A discussion about forward extensible APIs and ABIs here.

First, its important to say that this should be "domain stats" and not
just "vcpu stats".  This is easy to account for now, but hard to
retrofit later.

For the shared memory, we have a minimum granularity of PAGE_SIZE (4k
for now, but this will change in due course on non-x86 architectures),
and a resource-agnostic way of determining the resource size (as a
multiple of the page size).

But there are other things we need to consider:

1) To be sensibly extendable, there needs to be some metadata, and the
domain stats is clearly going to be a different shape to the vcpu stats.

2) We also want to give Xen some flexibility to allocate memory in a
suitable manner for the system.

3) Xen and the userspace consuming this interface will likely be built
from different versions of the header.  This needs to inter-operate with
the common subset of functionality.


So what we want, at least to describe the shape, is something like this:

struct dom_shmem_stats {
    uint32_t dom_size; // sizeof(dom)
    uint32_t vcpu_offs;
    uint32_t vcpu_size; // sizeof(vcpu)
    uint32_t vcpu_stride;
    ...
};

struct vcpu_shmem_stats {
    ...
};

where the data layout shall be that there is one dom structure starting
at 0, and an array of vcpu_stride objects starting at vcpu_offset.

Obviously, some invariants apply.  vcpu_offs >= dom_size, and
vcpu_stride >= vcpu_size.  The total size of the mapping must be larger
than vcpu_offs + vcpus * vcpu_stride  (and no, I intentionally don't
care about the rounding at the end because it doesn't change things in
practice.)

A very simple layout, packing the data as closely as reasonable, might be:

vcpu_offs = roundup(sizeof(dom), cacheline_size)
vcpu_stride = roundup(sizeof(vcpu), cacheline_size);

but Xen might have reason to use some other rounding.  As the dom or
vcpu size approaches a page, then Xen will want to change allocation
scheme to use page size for both, and not vmap the lot as one
consecutive region.



For the stats data itself, there wants to be some indication of what
data Xen is producing, so userspace can know not to bother consuming. 
So something like:

struct $foo_stats {
    ...

#define STATS_A (1u << 0)
#define STATS_B (1u << 2)
    uint32_t stats_active;
   
    struct $foo_stats_a {
        uint32_t single_field;
        ... // maybe other singleton fields
    };

    struct $foo_stats_b {
        uint32_t seq;  // "seq" is more common than "version"
        uint32_t _pad;
        uint64_t field1;
        uint64_t field2;
        ...
    };
};


Forward compatibility rules say that you can only ever append new
fields.  But as hinted at with the stats_active field, its fine to leave
reserved fields around for future use, generally with a rule that
anything reserved shall be 0.

Importantly, this means that offsetof(dom, stats_b) is fixed, and will
inter-operate just fine if e.g. userspace knows about a stats_c that Xen
doesn't know about.

But this does highlight some more invariants.  Xen must not produce any
data outside of [0, vcpu_offs) for dom data, and [base, vcpu_stride) for
vcpu data.

Furthermore, Xen should not produce any data not indicated by the
stats_active field.  That said, if Xen is compiled knowing about
dom->stats_c, and userspace is not, then userspace will observe Xen
advertising a stat it doesn't know, and producing data beyond
userspace's sizeof(dom), but within dom->vcpu_offs.  This is explicitly
fine and expected, and why Xen writes it's sizeof() information into the
dom header.  This allows both sides to agree on the layout even when
they're not compiled from identical copies of the header.



A few closing thoughts.

1) It is wise to have a magic number at the head of each dom and vcpu
struct.  This helps sanity check that both sides have correctly agreed
on the layout, but can also serve double duty as an ABI "version".  If
we screw up spectacularly, and decide that the best course of action is
to break backwards compatibility, then we can change the magic and edit
the structs in a non-forwards-compatible way.

2) We may get to a point of wanting arch specific stats.  This can be
accommodated in the model by having struct arch_{dom,vcpu}_stats at
positions described by the layout at the start of dom.  It would be wise
to leave space (reserved fields) there to use if necessary.  This is
cleaner than deciding that we need to put some new layout fields after
the latest $foo_stats_$N and before $foo_stats_$M.

3) It would be great if we could have something in tools/tests/ which
can attach to a running VM and assess the correctness of the invariants
given.  Better yet if it could compile for each change of the ABI and
assess the correctness for all.


I hope this all makes sense.  I know its not trivial, but there's also
nothing in here which is rocket science, and with a bit of good design
work up front, it will be a flexible interface that we never have to
break backwards compatibility with.

~Andrew