[v1] spapr: Clean up pagesize handling

[Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling

Posted by David Gibson 7 years, 9 months ago

Currently the "pseries" machine type will (usually) advertise
different pagesizes to the guest when running under KVM and TCG, which
is not how things are supposed to work.

This comes from poor handling of hardware limitations which mean that
under KVM HV the guest is unable to use pagesizes larger than those
backing the guest's RAM on the host side.

The new scheme turns things around by having an explicit machine
parameter controlling the largest page size that the guest is allowed
to use.  This limitation applies regardless of accelerator.  When
we're running on KVM HV we ensure that our backing pages are adequate
to supply the requested guest page sizes, rather than adjusting the
guest page sizes based on what KVM can supply.

This means that in order to use hugepages in a PAPR guest it's
necessary to add a "cap-hpt-mps=24" machine parameter as well as
setting the mem-path correctly.  This is a bit more work on the user
and/or management side, but results in consistent behaviour so I think
it's worth it.

Longer term, we can also use this parameter to control IOMMU page
sizes.  However the restrictions here are even more complicated based
on an intersection of guest, host kernel and hardware capabilities.

This applies on top of my recent series cleaning up the PAPR mode
initialization, which in turn applies on top of my ppc-for-2.13 tree.

David Gibson (7):
  spapr: Maximum (HPT) pagesize property
  spapr: Use maximum page size capability to simplify memory backend
    checking
  target/ppc: Add ppc_hash64_filter_pagesizes()
  spapr: Add cpu_apply hook to capabilities
  spapr: Limit available pagesizes to provide a consistent guest
    environment
  spapr: Don't rewrite mmu capabilities in KVM mode
  spapr_pci: Remove unhelpful pagesize warning

 hw/ppc/spapr.c          |  18 +++---
 hw/ppc/spapr_caps.c     | 125 ++++++++++++++++++++++++++++++++++++++++
 hw/ppc/spapr_cpu_core.c |   4 ++
 hw/ppc/spapr_pci.c      |   7 ---
 include/hw/ppc/spapr.h  |   8 ++-
 target/ppc/kvm.c        | 149 ++++++++++++++++++++++++------------------------
 target/ppc/kvm_ppc.h    |  11 +++-
 target/ppc/mmu-hash64.c |  59 +++++++++++++++++++
 target/ppc/mmu-hash64.h |   3 +
 9 files changed, 287 insertions(+), 97 deletions(-)

-- 
2.14.3

Re: [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling

Posted by Andrea Bolognani 7 years, 9 months ago

On Thu, 2018-04-19 at 16:29 +1000, David Gibson wrote:
> Currently the "pseries" machine type will (usually) advertise
> different pagesizes to the guest when running under KVM and TCG, which
> is not how things are supposed to work.
> 
> This comes from poor handling of hardware limitations which mean that
> under KVM HV the guest is unable to use pagesizes larger than those
> backing the guest's RAM on the host side.
> 
> The new scheme turns things around by having an explicit machine
> parameter controlling the largest page size that the guest is allowed
> to use.  This limitation applies regardless of accelerator.  When
> we're running on KVM HV we ensure that our backing pages are adequate
> to supply the requested guest page sizes, rather than adjusting the
> guest page sizes based on what KVM can supply.
> 
> This means that in order to use hugepages in a PAPR guest it's
> necessary to add a "cap-hpt-mps=24" machine parameter as well as
> setting the mem-path correctly.  This is a bit more work on the user
> and/or management side, but results in consistent behaviour so I think
> it's worth it.

libvirt guests already need to explicitly opt-in to hugepages, so
adding this new option automagically based on that shouldn't be too
difficult.

A couple of questions:

  * I see the option accepts values 12, 16, 24 and 34, with 16
    being the default. I guess 34 corresponds to 1 GiB hugepages?
    Also, in what scenario would 12 be used?

  * The name of the property suggests this setting is only relevant
    for HPT guests. libvirt doesn't really have the notion of HPT
    and RPT, and I'm not really itching to introduce it. Can we
    safely use this option for all guests, even RPT ones?

Thanks.

-- 
Andrea Bolognani / Red Hat / Virtualization

Re: [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling

Posted by David Gibson 7 years, 9 months ago

On Thu, Apr 19, 2018 at 05:30:04PM +0200, Andrea Bolognani wrote:
> On Thu, 2018-04-19 at 16:29 +1000, David Gibson wrote:
> > Currently the "pseries" machine type will (usually) advertise
> > different pagesizes to the guest when running under KVM and TCG, which
> > is not how things are supposed to work.
> > 
> > This comes from poor handling of hardware limitations which mean that
> > under KVM HV the guest is unable to use pagesizes larger than those
> > backing the guest's RAM on the host side.
> > 
> > The new scheme turns things around by having an explicit machine
> > parameter controlling the largest page size that the guest is allowed
> > to use.  This limitation applies regardless of accelerator.  When
> > we're running on KVM HV we ensure that our backing pages are adequate
> > to supply the requested guest page sizes, rather than adjusting the
> > guest page sizes based on what KVM can supply.
> > 
> > This means that in order to use hugepages in a PAPR guest it's
> > necessary to add a "cap-hpt-mps=24" machine parameter as well as
> > setting the mem-path correctly.  This is a bit more work on the user
> > and/or management side, but results in consistent behaviour so I think
> > it's worth it.
> 
> libvirt guests already need to explicitly opt-in to hugepages, so
> adding this new option automagically based on that shouldn't be too
> difficult.

Right.  We have to be a bit careful with automagic though, because
treating hugepage as a boolean is one of the problems that this
parameter is there to address.

If libvirt were to set the parameter based on the pagesize of the
hugepage mount, then it might not be consistent across a migration
(e.g. p8 to p9).  Now the new code would at least catch that and
safely fail the migration, but that might be confusing to users.

> A couple of questions:
> 
>   * I see the option accepts values 12, 16, 24 and 34, with 16
>     being the default.

In fact it should accept any value >= 12, though the ones that you
list are the interesting ones.  This does mean, for example, that if
it was just set to the hugepage size on a p9, 21 (2MiB) things should
work correctly (in practice it would act identically to setting it to
16).

> I guess 34 corresponds to 1 GiB hugepages?

No, 16GiB hugepages, which is the "colossal page" size on HPT POWER
machines.  It's a simple shift, (1 << 34) == 16 GiB, 1GiB pages would
be 30 (but wouldn't let the guest do any more than 24 ~ 16 MiB in
practice).

>     Also, in what scenario would 12 be used?

So RHEL, at least, generally configures ppc64 kernels to use 64kiB
pages, but 4kiB pages are still supported upstream (not sure if there
are any distros that still use that mode).  If your host uses 4kiB
pages you wouldn't be able to start a (KVM HV) guest without setting
this to 12 (or using a 64kiB hugepage mount).

>   * The name of the property suggests this setting is only relevant
>     for HPT guests. libvirt doesn't really have the notion of HPT
>     and RPT, and I'm not really itching to introduce it. Can we
>     safely use this option for all guests, even RPT ones?

Yes.  The "hpt" in the main is meant to imply that its restriction
only applies when the guest is in HPT mode, but it can be safely set
in any mode.  In RPT mode guest and host pagesizes are independent of
each other, so we don't have to deal with this mess.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

Re: [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling

Posted by Andrea Bolognani 7 years, 9 months ago

On Fri, 2018-04-20 at 12:35 +1000, David Gibson wrote:
> On Thu, Apr 19, 2018 at 05:30:04PM +0200, Andrea Bolognani wrote:
> > On Thu, 2018-04-19 at 16:29 +1000, David Gibson wrote:
> > > This means that in order to use hugepages in a PAPR guest it's
> > > necessary to add a "cap-hpt-mps=24" machine parameter as well as
> > > setting the mem-path correctly.  This is a bit more work on the user
> > > and/or management side, but results in consistent behaviour so I think
> > > it's worth it.
> > 
> > libvirt guests already need to explicitly opt-in to hugepages, so
> > adding this new option automagically based on that shouldn't be too
> > difficult.
> 
> Right.  We have to be a bit careful with automagic though, because
> treating hugepage as a boolean is one of the problems that this
> parameter is there to address.
> 
> If libvirt were to set the parameter based on the pagesize of the
> hugepage mount, then it might not be consistent across a migration
> (e.g. p8 to p9).  Now the new code would at least catch that and
> safely fail the migration, but that might be confusing to users.

Good point.

I'll have to look into it to be sure, but I think it should be
possible for libvirt to convert a generic

  <memoryBacking>
    <hugepages/>
  </memoryBacking>

to a more specific

  <memoryBacking>
    <hugepages>
      <page size="16384" unit="KiB"/>
    </hugepages>
  </memoryBacking>

by figuring out the page size for the default hugepage mount,
which actually sounds like a good idea regardless. Of course users
user would still be able to provide the page size themselves in the
first place.

Is the 16 MiB page size available for both POWER8 and POWER9?

> > A couple of questions:
> > 
> >   * I see the option accepts values 12, 16, 24 and 34, with 16
> >     being the default.
> 
> In fact it should accept any value >= 12, though the ones that you
> list are the interesting ones.

Well, I copied them from the QEMU help text, and I kinda assumed
that you wouldn't just list completely random values there O:-)

> This does mean, for example, that if
> it was just set to the hugepage size on a p9, 21 (2MiB) things should
> work correctly (in practice it would act identically to setting it to
> 16).

Wouldn't that lead to different behavior depending on whether you
start the guest on a POWER9 or POWER8 machine? The former would be
able to use 2 MiB hugepages, while the latter would be stuck using
regular 64 KiB pages. Migration of such a guest from POWER9 to
POWER8 wouldn't work because the hugepage allocation couldn't be
fulfilled, but the other way around would probably work and lead to
different page sizes being available inside the guest after a power
cycle, no?

> > I guess 34 corresponds to 1 GiB hugepages?
> 
> No, 16GiB hugepages, which is the "colossal page" size on HPT POWER
> machines.  It's a simple shift, (1 << 34) == 16 GiB, 1GiB pages would
> be 30 (but wouldn't let the guest do any more than 24 ~ 16 MiB in
> practice).

Isn't 1 GiB hugepages support at least being worked on[1]?

> >     Also, in what scenario would 12 be used?
> 
> So RHEL, at least, generally configures ppc64 kernels to use 64kiB
> pages, but 4kiB pages are still supported upstream (not sure if there
> are any distros that still use that mode).  If your host uses 4kiB
> pages you wouldn't be able to start a (KVM HV) guest without setting
> this to 12 (or using a 64kiB hugepage mount).

Mh, that's annoying, as needing to support 4 KiB pages would most
likely mean we'd have to turn this into a stand-alone configuration
knob rather than deriving it entirely from existing ones, which I'd
prefer as it's clearly much more user-friendly.

I'll check out what other distros are doing: if all the major ones
are defaulting to 64 KiB pages these days, it might be reasonable
to do the same and pretend smaller page sizes don't exist at all in
order to avoid the pain of having to tweak yet another knob, even
if that means leaving people compiling their own custom kernels
with 4 KiB page size in the dust.

> >   * The name of the property suggests this setting is only relevant
> >     for HPT guests. libvirt doesn't really have the notion of HPT
> >     and RPT, and I'm not really itching to introduce it. Can we
> >     safely use this option for all guests, even RPT ones?
> 
> Yes.  The "hpt" in the main is meant to imply that its restriction
> only applies when the guest is in HPT mode, but it can be safely set
> in any mode.  In RPT mode guest and host pagesizes are independent of
> each other, so we don't have to deal with this mess.

Good :)

[1] https://patchwork.kernel.org/patch/9729991/
-- 
Andrea Bolognani / Red Hat / Virtualization

Re: [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling

Posted by David Gibson 7 years, 9 months ago

On Fri, Apr 20, 2018 at 11:31:10AM +0200, Andrea Bolognani wrote:
> On Fri, 2018-04-20 at 12:35 +1000, David Gibson wrote:
> > On Thu, Apr 19, 2018 at 05:30:04PM +0200, Andrea Bolognani wrote:
> > > On Thu, 2018-04-19 at 16:29 +1000, David Gibson wrote:
> > > > This means that in order to use hugepages in a PAPR guest it's
> > > > necessary to add a "cap-hpt-mps=24" machine parameter as well as
> > > > setting the mem-path correctly.  This is a bit more work on the user
> > > > and/or management side, but results in consistent behaviour so I think
> > > > it's worth it.
> > > 
> > > libvirt guests already need to explicitly opt-in to hugepages, so
> > > adding this new option automagically based on that shouldn't be too
> > > difficult.
> > 
> > Right.  We have to be a bit careful with automagic though, because
> > treating hugepage as a boolean is one of the problems that this
> > parameter is there to address.
> > 
> > If libvirt were to set the parameter based on the pagesize of the
> > hugepage mount, then it might not be consistent across a migration
> > (e.g. p8 to p9).  Now the new code would at least catch that and
> > safely fail the migration, but that might be confusing to users.
> 
> Good point.
> 
> I'll have to look into it to be sure, but I think it should be
> possible for libvirt to convert a generic
> 
>   <memoryBacking>
>     <hugepages/>
>   </memoryBacking>
> 
> to a more specific
> 
>   <memoryBacking>
>     <hugepages>
>       <page size="16384" unit="KiB"/>
>     </hugepages>
>   </memoryBacking>
> 
> by figuring out the page size for the default hugepage mount,
> which actually sounds like a good idea regardless. Of course users
> user would still be able to provide the page size themselves in the
> first place.

Sounds like a good approach.

> Is the 16 MiB page size available for both POWER8 and POWER9?

No.  That's a big part of what makes this such a mess.  HPT has 16MiB
and 16GiB hugepages, RPT has 2MiB and 1GiB hugepages.  (Well, I guess
tecnically Power9 does have 16MiB pages - but only in hash mode, which
the host won't be).

I've been looking into whether it's feasible to make a 16MiB hugepage
pool for POWER9 RPT.  The hardware can't actually use that as a
pagesize, but we could still allocate them physically contiguous, map
them using a bunch of 2MiB PTEs in RPT mode and allow them to be
mapped by guests in HPT mode.

I *think* it won't be too hard, but I haven't looked close enough to
rule out horrible gotchas yet.

> > > A couple of questions:
> > > 
> > >   * I see the option accepts values 12, 16, 24 and 34, with 16
> > >     being the default.
> > 
> > In fact it should accept any value >= 12, though the ones that you
> > list are the interesting ones.
> 
> Well, I copied them from the QEMU help text, and I kinda assumed
> that you wouldn't just list completely random values there O:-)

Ah, right, of course.

> > This does mean, for example, that if
> > it was just set to the hugepage size on a p9, 21 (2MiB) things should
> > work correctly (in practice it would act identically to setting it to
> > 16).
> 
> Wouldn't that lead to different behavior depending on whether you
> start the guest on a POWER9 or POWER8 machine? The former would be
> able to use 2 MiB hugepages, while the latter would be stuck using
> regular 64 KiB pages.

Well, no, because 2MiB hugepages aren't a thing in HPT mode.  In RPT
mode it'd be able to use 2MiB hugepages either way, because the
limitations only apply to HPT mode.

> Migration of such a guest from POWER9 to
> POWER8 wouldn't work because the hugepage allocation couldn't be
> fulfilled,

Sort of, you couldn't even get as far as staring the incoming qemu
with hpt-mps=21 on the POWER8 (unless you gave it 16MiB hugepages for
backing).

> but the other way around would probably work and lead to
> different page sizes being available inside the guest after a power
> cycle, no?

Well.. there are a few cases here.  If you migrated p8 -> p8 with
hpt-mps=21 on both ends, you couldn't actually start the guest on the
source without giving it hugepage backing.  In which case it'll be
fine on the p9 with hugepage mapping.

If you had hpt-mps=16 on the source and hpt-mps=21 on the other end,
well, you don't get to count on anything because you changed the VM
definition.  In fact it would work in this case, and you wouldn't even
get new page sizes after restart because HPT mode doesn't support any
pagesizes between 64kiB and 16MiB.

> > > I guess 34 corresponds to 1 GiB hugepages?
> > 
> > No, 16GiB hugepages, which is the "colossal page" size on HPT POWER
> > machines.  It's a simple shift, (1 << 34) == 16 GiB, 1GiB pages would
> > be 30 (but wouldn't let the guest do any more than 24 ~ 16 MiB in
> > practice).
> 
> Isn't 1 GiB hugepages support at least being worked on[1]?

That's for radix mode.  Hash mode has 16MiB and 16GiB, no 1GiB.

> > >     Also, in what scenario would 12 be used?
> > 
> > So RHEL, at least, generally configures ppc64 kernels to use 64kiB
> > pages, but 4kiB pages are still supported upstream (not sure if there
> > are any distros that still use that mode).  If your host uses 4kiB
> > pages you wouldn't be able to start a (KVM HV) guest without setting
> > this to 12 (or using a 64kiB hugepage mount).
> 
> Mh, that's annoying, as needing to support 4 KiB pages would most
> likely mean we'd have to turn this into a stand-alone configuration
> knob rather than deriving it entirely from existing ones, which I'd
> prefer as it's clearly much more user-friendly.

Yeah, there's really no way around it though.  Well other than always
restricting to 4kiB pages by default, which would suck for performance
with guests that want to use 64kIB pages.

> I'll check out what other distros are doing: if all the major ones
> are defaulting to 64 KiB pages these days, it might be reasonable
> to do the same and pretend smaller page sizes don't exist at all in
> order to avoid the pain of having to tweak yet another knob, even
> if that means leaving people compiling their own custom kernels
> with 4 KiB page size in the dust.

That's my guess.

> > >   * The name of the property suggests this setting is only relevant
> > >     for HPT guests. libvirt doesn't really have the notion of HPT
> > >     and RPT, and I'm not really itching to introduce it. Can we
> > >     safely use this option for all guests, even RPT ones?
> > 
> > Yes.  The "hpt" in the main is meant to imply that its restriction
> > only applies when the guest is in HPT mode, but it can be safely set
> > in any mode.  In RPT mode guest and host pagesizes are independent of
> > each other, so we don't have to deal with this mess.
> 
> Good :)
> 
> 
> [1] https://patchwork.kernel.org/patch/9729991/

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

Re: [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling

Posted by Andrea Bolognani 7 years, 9 months ago

On Fri, 2018-04-20 at 20:21 +1000, David Gibson wrote:
> On Fri, Apr 20, 2018 at 11:31:10AM +0200, Andrea Bolognani wrote:
> > I'll check out what other distros are doing: if all the major ones
> > are defaulting to 64 KiB pages these days, it might be reasonable
> > to do the same and pretend smaller page sizes don't exist at all in
> > order to avoid the pain of having to tweak yet another knob, even
> > if that means leaving people compiling their own custom kernels
> > with 4 KiB page size in the dust.
> 
> That's my guess.

I just checked RHEL 7, Fedora 27, OpenSUSE Leap 42.3, Debian 9 and
Ubuntu 16.04: they all use 64 KiB pages, so I'd conclude leaving
out 4 KiB pages support is basically a non-issue.

-- 
Andrea Bolognani / Red Hat / Virtualization

Re: [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling

Posted by David Gibson 7 years, 9 months ago

On Mon, Apr 23, 2018 at 10:31:39AM +0200, Andrea Bolognani wrote:
> On Fri, 2018-04-20 at 20:21 +1000, David Gibson wrote:
> > On Fri, Apr 20, 2018 at 11:31:10AM +0200, Andrea Bolognani wrote:
> > > I'll check out what other distros are doing: if all the major ones
> > > are defaulting to 64 KiB pages these days, it might be reasonable
> > > to do the same and pretend smaller page sizes don't exist at all in
> > > order to avoid the pain of having to tweak yet another knob, even
> > > if that means leaving people compiling their own custom kernels
> > > with 4 KiB page size in the dust.
> > 
> > That's my guess.
> 
> I just checked RHEL 7, Fedora 27, OpenSUSE Leap 42.3, Debian 9 and
> Ubuntu 16.04: they all use 64 KiB pages, so I'd conclude leaving
> out 4 KiB pages support is basically a non-issue.

Good to hear, thanks for checking.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

Re: [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling

Posted by Andrea Bolognani 7 years, 9 months ago

On Fri, 2018-04-20 at 20:21 +1000, David Gibson wrote:
> On Fri, Apr 20, 2018 at 11:31:10AM +0200, Andrea Bolognani wrote:
> > I'll have to look into it to be sure, but I think it should be
> > possible for libvirt to convert a generic
> > 
> >   <memoryBacking>
> >     <hugepages/>
> >   </memoryBacking>
> > 
> > to a more specific
> > 
> >   <memoryBacking>
> >     <hugepages>
> >       <page size="16384" unit="KiB"/>
> >     </hugepages>
> >   </memoryBacking>
> > 
> > by figuring out the page size for the default hugepage mount,
> > which actually sounds like a good idea regardless. Of course users
> > user would still be able to provide the page size themselves in the
> > first place.
> 
> Sounds like a good approach.

Unfortunately it seems like this is not going to be feasible, as
POWER8 is apparently the only platform that enforces a strict
relationship between host page size and guest page size: x86,
aarch64 (and I have to assume POWER9 as well?) can reportedly all
deal gracefully with guests migrating between hosts that have
different hugepage mounts configured.

I need to spend some more time digesting the rest of the
information you provided, but as it stands right now I'm starting
to think this might actually need to be its own, explicit opt-in
knob at the libvirt level too after all :(

-- 
Andrea Bolognani / Red Hat / Virtualization

Re: [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling

Posted by David Gibson 7 years, 9 months ago

On Tue, Apr 24, 2018 at 05:35:59PM +0200, Andrea Bolognani wrote:
> On Fri, 2018-04-20 at 20:21 +1000, David Gibson wrote:
> > On Fri, Apr 20, 2018 at 11:31:10AM +0200, Andrea Bolognani wrote:
> > > I'll have to look into it to be sure, but I think it should be
> > > possible for libvirt to convert a generic
> > > 
> > >   <memoryBacking>
> > >     <hugepages/>
> > >   </memoryBacking>
> > > 
> > > to a more specific
> > > 
> > >   <memoryBacking>
> > >     <hugepages>
> > >       <page size="16384" unit="KiB"/>
> > >     </hugepages>
> > >   </memoryBacking>
> > > 
> > > by figuring out the page size for the default hugepage mount,
> > > which actually sounds like a good idea regardless. Of course users
> > > user would still be able to provide the page size themselves in the
> > > first place.
> > 
> > Sounds like a good approach.
> 
> Unfortunately it seems like this is not going to be feasible, as
> POWER8 is apparently the only platform that enforces a strict
> relationship between host page size and guest page size: x86,
> aarch64 (and I have to assume POWER9 as well?) can reportedly all
> deal gracefully with guests migrating between hosts that have
> different hugepage mounts configured.

Yes, that's right.  As you guess POWER9 will also allow this.. at
least as long as the guest is in radix mode.  If the guest is in hash
mode (which includes but isn't limited to POWER8 compat mode guests)
then it will suffer from the same pagesize limitations as POWER8.

> I need to spend some more time digesting the rest of the
> information you provided, but as it stands right now I'm starting
> to think this might actually need to be its own, explicit opt-in
> knob at the libvirt level too after all :(

Poo.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

Re: [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling

Posted by Andrea Bolognani 7 years, 9 months ago

On Fri, 2018-04-20 at 20:21 +1000, David Gibson wrote:
> On Fri, Apr 20, 2018 at 11:31:10AM +0200, Andrea Bolognani wrote:
> > Is the 16 MiB page size available for both POWER8 and POWER9?
> 
> No.  That's a big part of what makes this such a mess.  HPT has 16MiB
> and 16GiB hugepages, RPT has 2MiB and 1GiB hugepages.  (Well, I guess
> tecnically Power9 does have 16MiB pages - but only in hash mode, which
> the host won't be).
> 
[...]
> > > This does mean, for example, that if
> > > it was just set to the hugepage size on a p9, 21 (2MiB) things should
> > > work correctly (in practice it would act identically to setting it to
> > > 16).
> > 
> > Wouldn't that lead to different behavior depending on whether you
> > start the guest on a POWER9 or POWER8 machine? The former would be
> > able to use 2 MiB hugepages, while the latter would be stuck using
> > regular 64 KiB pages.
> 
> Well, no, because 2MiB hugepages aren't a thing in HPT mode.  In RPT
> mode it'd be able to use 2MiB hugepages either way, because the
> limitations only apply to HPT mode.
> 
> > Migration of such a guest from POWER9 to
> > POWER8 wouldn't work because the hugepage allocation couldn't be
> > fulfilled,
> 
> Sort of, you couldn't even get as far as staring the incoming qemu
> with hpt-mps=21 on the POWER8 (unless you gave it 16MiB hugepages for
> backing).
> 
> > but the other way around would probably work and lead to
> > different page sizes being available inside the guest after a power
> > cycle, no?
> 
> Well.. there are a few cases here.  If you migrated p8 -> p8 with
> hpt-mps=21 on both ends, you couldn't actually start the guest on the
> source without giving it hugepage backing.  In which case it'll be
> fine on the p9 with hugepage mapping.
> 
> If you had hpt-mps=16 on the source and hpt-mps=21 on the other end,
> well, you don't get to count on anything because you changed the VM
> definition.  In fact it would work in this case, and you wouldn't even
> get new page sizes after restart because HPT mode doesn't support any
> pagesizes between 64kiB and 16MiB.
> 
> > > > I guess 34 corresponds to 1 GiB hugepages?
> > > 
> > > No, 16GiB hugepages, which is the "colossal page" size on HPT POWER
> > > machines.  It's a simple shift, (1 << 34) == 16 GiB, 1GiB pages would
> > > be 30 (but wouldn't let the guest do any more than 24 ~ 16 MiB in
> > > practice).
> > 
> > Isn't 1 GiB hugepages support at least being worked on[1]?
> 
> That's for radix mode.  Hash mode has 16MiB and 16GiB, no 1GiB.

So, I've spent some more time trying to wrap my head around the
whole ordeal I'm still unclear about some of the details, though;
hopefully you'll be willing to answer a few more questions.

Basically the only page sizes you can have for HPT guests are
4 KiB, 64 KiB, 16 MiB and 16 GiB; in each case, for KVM, you need
the guest memory to be backed by host pages which are at least as
big, or it won't work. The same limitation doesn't apply to either
RPT or TCG guests.

The new parameter would make it possible to make sure you will
actually be able to use the page size you're interested in inside
the guest, by preventing it from starting at all if the host didn't
provide big enough backing pages; it would also ensure the guest
gets access to different page sizes when running using TCG as an
accelerator instead of KVM.

For a KVM guest running on a POWER8 host, the matrix would look
like

    b \ m | 64 KiB |  2 MiB | 16 MiB |  1 GiB | 16 GiB |
  -------- -------- -------- -------- -------- --------
   64 KiB | 64 KiB | 64 KiB |        |        |        |
  -------- -------- -------- -------- -------- --------
   16 MiB | 64 KiB | 64 KiB | 16 MiB | 16 MiB |        |
  -------- -------- -------- -------- -------- --------
   16 GiB | 64 KiB | 64 KiB | 16 MiB | 16 MiB | 16 GiB |
  -------- -------- -------- -------- -------- --------

with backing page sizes from top to bottom, requested max page
sizes from left to right, actual max page sizes in the cells and
empty cells meaning the guest won't be able to start; on a POWER9
machine, the matrix would look like

    b \ m | 64 KiB |  2 MiB | 16 MiB |  1 GiB | 16 GiB |
  -------- -------- -------- -------- -------- --------
   64 KiB | 64 KiB | 64 KiB |        |        |        |
  -------- -------- -------- -------- -------- --------
    2 MiB | 64 KiB | 64 KiB |        |        |        |
  -------- -------- -------- -------- -------- --------
    1 GiB | 64 KiB | 64 KiB | 16 MiB | 16 MiB |        |
  -------- -------- -------- -------- -------- --------

instead, and finally on TCG the backing page size wouldn't matter
and you would simply have

    b \ m | 64 KiB |  2 MiB | 16 MiB |  1 GiB | 16 GiB |
  -------- -------- -------- -------- -------- --------
          | 64 KiB | 64 KiB | 16 MiB | 16 MiB | 16 GiB |
  -------- -------- -------- -------- -------- --------

Does everything up until here make sense?

While trying to figure out this, one of the things I attempted to
do was run a guest in POWER8 compatibility mode on a POWER9 host
and use hugepages for backing, but that didn't seem to work at
all, possibly hinting at the fact that not all of the above is
actually accurate and I need you to correct me :)

This is the command line I used:

  /usr/libexec/qemu-kvm \
  -machine pseries,accel=kvm \
  -cpu host,compat=power8 \
  -m 2048 \
  -mem-prealloc \
  -mem-path /dev/hugepages \
  -smp 8,sockets=8,cores=1,threads=1 \
  -display none \
  -no-user-config \
  -nodefaults \
  -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x2,drive=vda \
  -drive file=/var/lib/libvirt/images/huge.qcow2,format=qcow2,if=none,id=vda \
  -serial mon:stdio

-- 
Andrea Bolognani / Red Hat / Virtualization

Re: [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling

Posted by David Gibson 7 years, 9 months ago

On Wed, Apr 25, 2018 at 06:09:26PM +0200, Andrea Bolognani wrote:
> On Fri, 2018-04-20 at 20:21 +1000, David Gibson wrote:
> > On Fri, Apr 20, 2018 at 11:31:10AM +0200, Andrea Bolognani wrote:
> > > Is the 16 MiB page size available for both POWER8 and POWER9?
> > 
> > No.  That's a big part of what makes this such a mess.  HPT has 16MiB
> > and 16GiB hugepages, RPT has 2MiB and 1GiB hugepages.  (Well, I guess
> > tecnically Power9 does have 16MiB pages - but only in hash mode, which
> > the host won't be).
> > 
> [...]
> > > > This does mean, for example, that if
> > > > it was just set to the hugepage size on a p9, 21 (2MiB) things should
> > > > work correctly (in practice it would act identically to setting it to
> > > > 16).
> > > 
> > > Wouldn't that lead to different behavior depending on whether you
> > > start the guest on a POWER9 or POWER8 machine? The former would be
> > > able to use 2 MiB hugepages, while the latter would be stuck using
> > > regular 64 KiB pages.
> > 
> > Well, no, because 2MiB hugepages aren't a thing in HPT mode.  In RPT
> > mode it'd be able to use 2MiB hugepages either way, because the
> > limitations only apply to HPT mode.
> > 
> > > Migration of such a guest from POWER9 to
> > > POWER8 wouldn't work because the hugepage allocation couldn't be
> > > fulfilled,
> > 
> > Sort of, you couldn't even get as far as staring the incoming qemu
> > with hpt-mps=21 on the POWER8 (unless you gave it 16MiB hugepages for
> > backing).
> > 
> > > but the other way around would probably work and lead to
> > > different page sizes being available inside the guest after a power
> > > cycle, no?
> > 
> > Well.. there are a few cases here.  If you migrated p8 -> p8 with
> > hpt-mps=21 on both ends, you couldn't actually start the guest on the
> > source without giving it hugepage backing.  In which case it'll be
> > fine on the p9 with hugepage mapping.
> > 
> > If you had hpt-mps=16 on the source and hpt-mps=21 on the other end,
> > well, you don't get to count on anything because you changed the VM
> > definition.  In fact it would work in this case, and you wouldn't even
> > get new page sizes after restart because HPT mode doesn't support any
> > pagesizes between 64kiB and 16MiB.
> > 
> > > > > I guess 34 corresponds to 1 GiB hugepages?
> > > > 
> > > > No, 16GiB hugepages, which is the "colossal page" size on HPT POWER
> > > > machines.  It's a simple shift, (1 << 34) == 16 GiB, 1GiB pages would
> > > > be 30 (but wouldn't let the guest do any more than 24 ~ 16 MiB in
> > > > practice).
> > > 
> > > Isn't 1 GiB hugepages support at least being worked on[1]?
> > 
> > That's for radix mode.  Hash mode has 16MiB and 16GiB, no 1GiB.
> 
> So, I've spent some more time trying to wrap my head around the
> whole ordeal I'm still unclear about some of the details, though;
> hopefully you'll be willing to answer a few more questions.
> 
> Basically the only page sizes you can have for HPT guests are
> 4 KiB, 64 KiB, 16 MiB and 16 GiB; in each case, for KVM, you need
> the guest memory to be backed by host pages which are at least as
> big, or it won't work. The same limitation doesn't apply to either
> RPT or TCG guests.

That's right.  The limitation also doesn't apply to KVM PR, just KVM
HV.

[If you're interested, the reason for the limitation is that unlike
 x86 or POWER9 there aren't separate sets of gva->gpa and gpa->hpa
 pagetables. Instead there's just a single gva->hpa (hash) pagetable
 that's managed by the _host_.  When the guest wants to create a new
 mapping it uses an hcall to insert a PTE, and the hcall
 implementation translates the gpa into an hpa before inserting it
 into the HPT.  The full contents of the real HPT aren't visible to
 the guest, but the precise slot numbers within it are, so the
 assumption that there's an exact 1:1 correspondence between guest
 PTEs and host PTEs is pretty much baked into the PAPR interface.  So,
 if a hugepage is to be inserted into the guest HPT, then it's also
 being inserted into the host HPT, and needs to be really, truly host
 contiguous]

> The new parameter would make it possible to make sure you will
> actually be able to use the page size you're interested in inside
> the guest, by preventing it from starting at all if the host didn't
> provide big enough backing pages;

That's right

> it would also ensure the guest
> gets access to different page sizes when running using TCG as an
> accelerator instead of KVM.

Uh.. it would ensure the guest *doesn't* get access to different page
sizes in TCG vs. KVM.  Is that what you meant to say?

> For a KVM guest running on a POWER8 host, the matrix would look
> like
> 
>     b \ m | 64 KiB |  2 MiB | 16 MiB |  1 GiB | 16 GiB |
>   -------- -------- -------- -------- -------- --------
>    64 KiB | 64 KiB | 64 KiB |        |        |        |
>   -------- -------- -------- -------- -------- --------
>    16 MiB | 64 KiB | 64 KiB | 16 MiB | 16 MiB |        |
>   -------- -------- -------- -------- -------- --------
>    16 GiB | 64 KiB | 64 KiB | 16 MiB | 16 MiB | 16 GiB |
>   -------- -------- -------- -------- -------- --------
> 
> with backing page sizes from top to bottom, requested max page
> sizes from left to right, actual max page sizes in the cells and
> empty cells meaning the guest won't be able to start; on a POWER9
> machine, the matrix would look like
> 
>     b \ m | 64 KiB |  2 MiB | 16 MiB |  1 GiB | 16 GiB |
>   -------- -------- -------- -------- -------- --------
>    64 KiB | 64 KiB | 64 KiB |        |        |        |
>   -------- -------- -------- -------- -------- --------
>     2 MiB | 64 KiB | 64 KiB |        |        |        |
>   -------- -------- -------- -------- -------- --------
>     1 GiB | 64 KiB | 64 KiB | 16 MiB | 16 MiB |        |
>   -------- -------- -------- -------- -------- --------
> 
> instead, and finally on TCG the backing page size wouldn't matter
> and you would simply have
> 
>     b \ m | 64 KiB |  2 MiB | 16 MiB |  1 GiB | 16 GiB |
>   -------- -------- -------- -------- -------- --------
>           | 64 KiB | 64 KiB | 16 MiB | 16 MiB | 16 GiB |
>   -------- -------- -------- -------- -------- --------
> 
> Does everything up until here make sense?

Yes, that all looks right.

> While trying to figure out this, one of the things I attempted to
> do was run a guest in POWER8 compatibility mode on a POWER9 host
> and use hugepages for backing, but that didn't seem to work at
> all, possibly hinting at the fact that not all of the above is
> actually accurate and I need you to correct me :)
> 
> This is the command line I used:
> 
>   /usr/libexec/qemu-kvm \
>   -machine pseries,accel=kvm \
>   -cpu host,compat=power8 \
>   -m 2048 \
>   -mem-prealloc \
>   -mem-path /dev/hugepages \
>   -smp 8,sockets=8,cores=1,threads=1 \
>   -display none \
>   -no-user-config \
>   -nodefaults \
>   -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x2,drive=vda \
>   -drive file=/var/lib/libvirt/images/huge.qcow2,format=qcow2,if=none,id=vda \
>   -serial mon:stdio

Ok, so note that the scheme I'm talking about here is *not* merged as
yet.  The above command line will run the guest with 2MiB backing.

With the existing code that should work, but the guest will only be
able to use 64kiB pages.  If it didn't work at all.. there was a bug
fixed relatively recently that broke all hugepage backing, so you
could try updating to a more recent host kernel.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

Re: [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling

Posted by Andrea Bolognani 7 years, 9 months ago

On Thu, 2018-04-26 at 10:55 +1000, David Gibson wrote:
> On Wed, Apr 25, 2018 at 06:09:26PM +0200, Andrea Bolognani wrote:
> > The new parameter would make it possible to make sure you will
> > actually be able to use the page size you're interested in inside
> > the guest, by preventing it from starting at all if the host didn't
> > provide big enough backing pages;
> 
> That's right
> 
> > it would also ensure the guest
> > gets access to different page sizes when running using TCG as an
> > accelerator instead of KVM.
> 
> Uh.. it would ensure the guest *doesn't* get access to different page
> sizes in TCG vs. KVM.  Is that what you meant to say?

Oops, looks like I accidentally a word there. Of course you got it
right and I meant exactly the opposite of what I actually wrote :/

> > For a KVM guest running on a POWER8 host, the matrix would look
> > like
> > 
> >     b \ m | 64 KiB |  2 MiB | 16 MiB |  1 GiB | 16 GiB |
> >   -------- -------- -------- -------- -------- --------
> >    64 KiB | 64 KiB | 64 KiB |        |        |        |
> >   -------- -------- -------- -------- -------- --------
> >    16 MiB | 64 KiB | 64 KiB | 16 MiB | 16 MiB |        |
> >   -------- -------- -------- -------- -------- --------
> >    16 GiB | 64 KiB | 64 KiB | 16 MiB | 16 MiB | 16 GiB |
> >   -------- -------- -------- -------- -------- --------
> > 
> > with backing page sizes from top to bottom, requested max page
> > sizes from left to right, actual max page sizes in the cells and
> > empty cells meaning the guest won't be able to start; on a POWER9
> > machine, the matrix would look like
> > 
> >     b \ m | 64 KiB |  2 MiB | 16 MiB |  1 GiB | 16 GiB |
> >   -------- -------- -------- -------- -------- --------
> >    64 KiB | 64 KiB | 64 KiB |        |        |        |
> >   -------- -------- -------- -------- -------- --------
> >     2 MiB | 64 KiB | 64 KiB |        |        |        |
> >   -------- -------- -------- -------- -------- --------
> >     1 GiB | 64 KiB | 64 KiB | 16 MiB | 16 MiB |        |
> >   -------- -------- -------- -------- -------- --------
> > 
> > instead, and finally on TCG the backing page size wouldn't matter
> > and you would simply have
> > 
> >     b \ m | 64 KiB |  2 MiB | 16 MiB |  1 GiB | 16 GiB |
> >   -------- -------- -------- -------- -------- --------
> >           | 64 KiB | 64 KiB | 16 MiB | 16 MiB | 16 GiB |
> >   -------- -------- -------- -------- -------- --------
> > 
> > Does everything up until here make sense?
> 
> Yes, that all looks right.

Cool.

Unfortunately, that pretty much seals the deal on libvirt *not* being
able to infer the value from other guest settings :(

The only reasonable candidate would be the size of host pages used for
backing guest memory; however

  * TCG, RPT and KVM PR guests can't infer anything from it, as they
    are not tied to it. Having different behaviors for TCG and KVM
    would be easy, but differentiating between HPT KVM HV guest and
    all other kinds is something we can't do at the moment, and that
    in the past have actively resisted doing;

  * the user might want to limit things further, eg. preventing an
    HPT KVM HV guest backed by 16 MiB pages or an HPT TCG guest from
    using hugepages.

With the second use case in mind: would it make sense, or even be
possible, to make it so the capability works for RPT guests too?

Thinking even further, what about other architectures? Is this
something they might want to do too? The scenario I have in mind is
guests backed by regular pages being prevented from using hugepages
with the rationale that they wouldn't have the same performance
characteristics as if they were backed by hugepages; on the opposite
side of the spectrum, you might want to ensure the pages used to
back guest memory are as big as the biggest page you plan to use in
the guest, in order to guarantee the performance characteristics
fully match expectations.

> > While trying to figure out this, one of the things I attempted to
> > do was run a guest in POWER8 compatibility mode on a POWER9 host
> > and use hugepages for backing, but that didn't seem to work at
> > all, possibly hinting at the fact that not all of the above is
> > actually accurate and I need you to correct me :)
> > [...]
> 
> Ok, so note that the scheme I'm talking about here is *not* merged as
> yet.  The above command line will run the guest with 2MiB backing.
> 
> With the existing code that should work, but the guest will only be
> able to use 64kiB pages.

Understood: even without the ability to limit it further, the max
guest page size is obviously still capped by the backing page size.

> If it didn't work at all.. there was a bug
> fixed relatively recently that broke all hugepage backing, so you
> could try updating to a more recent host kernel.

That was probably it then!

I'll see whether I can get a newer kernel running on the host, but
my primary concern was not having gotten the command line (or the
concepts above) completely wrong :)

-- 
Andrea Bolognani / Red Hat / Virtualization

Re: [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling

Posted by David Gibson 7 years, 9 months ago

On Thu, Apr 26, 2018 at 10:45:40AM +0200, Andrea Bolognani wrote:
> On Thu, 2018-04-26 at 10:55 +1000, David Gibson wrote:
> > On Wed, Apr 25, 2018 at 06:09:26PM +0200, Andrea Bolognani wrote:
> > > The new parameter would make it possible to make sure you will
> > > actually be able to use the page size you're interested in inside
> > > the guest, by preventing it from starting at all if the host didn't
> > > provide big enough backing pages;
> > 
> > That's right
> > 
> > > it would also ensure the guest
> > > gets access to different page sizes when running using TCG as an
> > > accelerator instead of KVM.
> > 
> > Uh.. it would ensure the guest *doesn't* get access to different page
> > sizes in TCG vs. KVM.  Is that what you meant to say?
> 
> Oops, looks like I accidentally a word there. Of course you got it
> right and I meant exactly the opposite of what I actually wrote :/

:)

> > > For a KVM guest running on a POWER8 host, the matrix would look
> > > like
> > > 
> > >     b \ m | 64 KiB |  2 MiB | 16 MiB |  1 GiB | 16 GiB |
> > >   -------- -------- -------- -------- -------- --------
> > >    64 KiB | 64 KiB | 64 KiB |        |        |        |
> > >   -------- -------- -------- -------- -------- --------
> > >    16 MiB | 64 KiB | 64 KiB | 16 MiB | 16 MiB |        |
> > >   -------- -------- -------- -------- -------- --------
> > >    16 GiB | 64 KiB | 64 KiB | 16 MiB | 16 MiB | 16 GiB |
> > >   -------- -------- -------- -------- -------- --------
> > > 
> > > with backing page sizes from top to bottom, requested max page
> > > sizes from left to right, actual max page sizes in the cells and
> > > empty cells meaning the guest won't be able to start; on a POWER9
> > > machine, the matrix would look like
> > > 
> > >     b \ m | 64 KiB |  2 MiB | 16 MiB |  1 GiB | 16 GiB |
> > >   -------- -------- -------- -------- -------- --------
> > >    64 KiB | 64 KiB | 64 KiB |        |        |        |
> > >   -------- -------- -------- -------- -------- --------
> > >     2 MiB | 64 KiB | 64 KiB |        |        |        |
> > >   -------- -------- -------- -------- -------- --------
> > >     1 GiB | 64 KiB | 64 KiB | 16 MiB | 16 MiB |        |
> > >   -------- -------- -------- -------- -------- --------
> > > 
> > > instead, and finally on TCG the backing page size wouldn't matter
> > > and you would simply have
> > > 
> > >     b \ m | 64 KiB |  2 MiB | 16 MiB |  1 GiB | 16 GiB |
> > >   -------- -------- -------- -------- -------- --------
> > >           | 64 KiB | 64 KiB | 16 MiB | 16 MiB | 16 GiB |
> > >   -------- -------- -------- -------- -------- --------
> > > 
> > > Does everything up until here make sense?
> > 
> > Yes, that all looks right.
> 
> Cool.
> 
> Unfortunately, that pretty much seals the deal on libvirt *not* being
> able to infer the value from other guest settings :(
> 
> The only reasonable candidate would be the size of host pages used for
> backing guest memory; however

Right.

>   * TCG, RPT and KVM PR guests can't infer anything from it, as they
>     are not tied to it. Having different behaviors for TCG and KVM
>     would be easy, but differentiating between HPT KVM HV guest and
>     all other kinds is something we can't do at the moment, and that
>     in the past have actively resisted doing;

Yeah, I certainly wouldn't recommend that.  It's basically what we're
doing in qemu now, and I want to change, because it's a bad idea.

It still would be possible to key off the host side hugepage size, but
apply the limit to all VMs - in a sense crippling TCG guests to give
them matching behaviour to KVM guests.

>   * the user might want to limit things further, eg. preventing an
>     HPT KVM HV guest backed by 16 MiB pages or an HPT TCG guest from
>     using hugepages.

Right.. note that with the draft qemu patches a TCG guest will be
prevented from using hugepages *by default* (the default value of the
capability is 16).  You have to explicitly change it to allow
hugepages to be used in a TCG guest (but you don't have to supply
hugepage backing).

> With the second use case in mind: would it make sense, or even be
> possible, to make it so the capability works for RPT guests too?

Possible, maybe.. I think there's another property where RPT pagesizes
are advertised.  But I think it's a bad idea.  In order to have the
normal HPT case work consistently we need to set the default cap value
to 16 (64kiB page max).  If that applied to RPT guests as well, we'd
be unnecessarily crippling nearly all RPT guests.

> Thinking even further, what about other architectures? Is this
> something they might want to do too? The scenario I have in mind is
> guests backed by regular pages being prevented from using hugepages
> with the rationale that they wouldn't have the same performance
> characteristics as if they were backed by hugepages; on the opposite
> side of the spectrum, you might want to ensure the pages used to
> back guest memory are as big as the biggest page you plan to use in
> the guest, in order to guarantee the performance characteristics
> fully match expectations.

Hm, well, you'd have to ask other arch people if they see a use for
that.  It doesn't look very useful to me.  I don't think libvirt can
or should ensure identical performance characteristics for a guest
across all possible migrations.  But for HPT guests, it's not a matter
of performance characteristics: if it tries to use a large page size
and KVM doesn't have large enough backing pages, the guest will
quickly just freeze on a page fault that can never be satisfied.

> > > While trying to figure out this, one of the things I attempted to
> > > do was run a guest in POWER8 compatibility mode on a POWER9 host
> > > and use hugepages for backing, but that didn't seem to work at
> > > all, possibly hinting at the fact that not all of the above is
> > > actually accurate and I need you to correct me :)
> > > [...]
> > 
> > Ok, so note that the scheme I'm talking about here is *not* merged as
> > yet.  The above command line will run the guest with 2MiB backing.
> > 
> > With the existing code that should work, but the guest will only be
> > able to use 64kiB pages.
> 
> Understood: even without the ability to limit it further, the max
> guest page size is obviously still capped by the backing page size.
> 
> > If it didn't work at all.. there was a bug
> > fixed relatively recently that broke all hugepage backing, so you
> > could try updating to a more recent host kernel.
> 
> That was probably it then!
> 
> I'll see whether I can get a newer kernel running on the host, but
> my primary concern was not having gotten the command line (or the
> concepts above) completely wrong :)

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

Re: [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling

Posted by Andrea Bolognani 7 years, 9 months ago

On Fri, 2018-04-27 at 12:14 +1000, David Gibson wrote:
> On Thu, Apr 26, 2018 at 10:45:40AM +0200, Andrea Bolognani wrote:
> > Unfortunately, that pretty much seals the deal on libvirt *not* being
> > able to infer the value from other guest settings :(
> > 
> > The only reasonable candidate would be the size of host pages used for
> > backing guest memory; however
> 
> Right.
> 
> >   * TCG, RPT and KVM PR guests can't infer anything from it, as they
> >     are not tied to it. Having different behaviors for TCG and KVM
> >     would be easy, but differentiating between HPT KVM HV guest and
> >     all other kinds is something we can't do at the moment, and that
> >     in the past have actively resisted doing;
> 
> Yeah, I certainly wouldn't recommend that.  It's basically what we're
> doing in qemu now, and I want to change, because it's a bad idea.
> 
> It still would be possible to key off the host side hugepage size, but
> apply the limit to all VMs - in a sense crippling TCG guests to give
> them matching behaviour to KVM guests.

As you yourself mention later...

> >   * the user might want to limit things further, eg. preventing an
> >     HPT KVM HV guest backed by 16 MiB pages or an HPT TCG guest from
> >     using hugepages.
> 
> Right.. note that with the draft qemu patches a TCG guest will be
> prevented from using hugepages *by default* (the default value of the
> capability is 16).  You have to explicitly change it to allow
> hugepages to be used in a TCG guest (but you don't have to supply
> hugepage backing).

... this will already happen. That's okay[1], we can't really
avoid it if we want to ensure consistent behavior between KVM and
TCG.

> > With the second use case in mind: would it make sense, or even be
> > possible, to make it so the capability works for RPT guests too?
> 
> Possible, maybe.. I think there's another property where RPT pagesizes
> are advertised.  But I think it's a bad idea.  In order to have the
> normal HPT case work consistently we need to set the default cap value
> to 16 (64kiB page max).  If that applied to RPT guests as well, we'd
> be unnecessarily crippling nearly all RPT guests.
> 
> > Thinking even further, what about other architectures? Is this
> > something they might want to do too? The scenario I have in mind is
> > guests backed by regular pages being prevented from using hugepages
> > with the rationale that they wouldn't have the same performance
> > characteristics as if they were backed by hugepages; on the opposite
> > side of the spectrum, you might want to ensure the pages used to
> > back guest memory are as big as the biggest page you plan to use in
> > the guest, in order to guarantee the performance characteristics
> > fully match expectations.
> 
> Hm, well, you'd have to ask other arch people if they see a use for
> that.  It doesn't look very useful to me.  I don't think libvirt can
> or should ensure identical performance characteristics for a guest
> across all possible migrations.  But for HPT guests, it's not a matter
> of performance characteristics: if it tries to use a large page size
> and KVM doesn't have large enough backing pages, the guest will
> quickly just freeze on a page fault that can never be satisfied.

I realize only HPT guests *need* this, but I was trying to figure
out whether giving the host administrator more control over the
guest page size could be a useful feature in other cases as well,
as it sounds to me like it's more generally applicable

Users already need to opt-in to using hugepages in the host; asking
to opt-in to guest hugepages support as well doesn't seem too
outlandish to me.

Even if the specific flags required vary between architectures, we
could expose this in a unified fashion in libvirt. However, if this
is not something people would consider useful, we can just have a
pSeries-specific setting instead.


[1] That's of course assuming you have made sure the restriction
    only applies to the 2.13 machine type forward, and existing
    guests are not affected by the change.
-- 
Andrea Bolognani / Red Hat / Virtualization

Re: [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling

Posted by David Gibson 7 years, 9 months ago

On Fri, Apr 27, 2018 at 10:31:10AM +0200, Andrea Bolognani wrote:
> On Fri, 2018-04-27 at 12:14 +1000, David Gibson wrote:
> > On Thu, Apr 26, 2018 at 10:45:40AM +0200, Andrea Bolognani wrote:
> > > Unfortunately, that pretty much seals the deal on libvirt *not* being
> > > able to infer the value from other guest settings :(
> > > 
> > > The only reasonable candidate would be the size of host pages used for
> > > backing guest memory; however
> > 
> > Right.
> > 
> > >   * TCG, RPT and KVM PR guests can't infer anything from it, as they
> > >     are not tied to it. Having different behaviors for TCG and KVM
> > >     would be easy, but differentiating between HPT KVM HV guest and
> > >     all other kinds is something we can't do at the moment, and that
> > >     in the past have actively resisted doing;
> > 
> > Yeah, I certainly wouldn't recommend that.  It's basically what we're
> > doing in qemu now, and I want to change, because it's a bad idea.
> > 
> > It still would be possible to key off the host side hugepage size, but
> > apply the limit to all VMs - in a sense crippling TCG guests to give
> > them matching behaviour to KVM guests.
> 
> As you yourself mention later...

Right, I basically already made the decision to cripple TCG for KVM
compatibility at the qemu level (by default, at least).

> > >   * the user might want to limit things further, eg. preventing an
> > >     HPT KVM HV guest backed by 16 MiB pages or an HPT TCG guest from
> > >     using hugepages.
> > 
> > Right.. note that with the draft qemu patches a TCG guest will be
> > prevented from using hugepages *by default* (the default value of the
> > capability is 16).  You have to explicitly change it to allow
> > hugepages to be used in a TCG guest (but you don't have to supply
> > hugepage backing).
> 
> ... this will already happen. That's okay[1], we can't really
> avoid it if we want to ensure consistent behavior between KVM and
> TCG.

So.. regarding [1].  The draft patches *do* change the behaviour on
older machine types.  I'll consider revisiting that, but I'd need to
be convinced.  Basically we have to choose between consistency between
accelerator and consistency between versions.  I think the former is
the better choice; at least I think it is given that we *can* get both
for the overwhelmingly common case in production (KVM HV).

> > > With the second use case in mind: would it make sense, or even be
> > > possible, to make it so the capability works for RPT guests too?
> > 
> > Possible, maybe.. I think there's another property where RPT pagesizes
> > are advertised.  But I think it's a bad idea.  In order to have the
> > normal HPT case work consistently we need to set the default cap value
> > to 16 (64kiB page max).  If that applied to RPT guests as well, we'd
> > be unnecessarily crippling nearly all RPT guests.
> > 
> > > Thinking even further, what about other architectures? Is this
> > > something they might want to do too? The scenario I have in mind is
> > > guests backed by regular pages being prevented from using hugepages
> > > with the rationale that they wouldn't have the same performance
> > > characteristics as if they were backed by hugepages; on the opposite
> > > side of the spectrum, you might want to ensure the pages used to
> > > back guest memory are as big as the biggest page you plan to use in
> > > the guest, in order to guarantee the performance characteristics
> > > fully match expectations.
> > 
> > Hm, well, you'd have to ask other arch people if they see a use for
> > that.  It doesn't look very useful to me.  I don't think libvirt can
> > or should ensure identical performance characteristics for a guest
> > across all possible migrations.  But for HPT guests, it's not a matter
> > of performance characteristics: if it tries to use a large page size
> > and KVM doesn't have large enough backing pages, the guest will
> > quickly just freeze on a page fault that can never be satisfied.
> 
> I realize only HPT guests *need* this, but I was trying to figure
> out whether giving the host administrator more control over the
> guest page size could be a useful feature in other cases as well,
> as it sounds to me like it's more generally applicable

Perhaps, but I don't see a strong case for it.

> Users already need to opt-in to using hugepages in the host; asking
> to opt-in to guest hugepages support as well doesn't seem too
> outlandish to me.
> 
> Even if the specific flags required vary between architectures, we
> could expose this in a unified fashion in libvirt. However, if this
> is not something people would consider useful, we can just have a
> pSeries-specific setting instead.

The trouble is, if we made a generic knob for limiting guest
pagesizes, then I think we'd need *another* pseries specific knob so
that we can set the HPT and RPT limits differently:

It seems to be the common case is going to be to allow unlimited RPT
pagesizes (because there's really no downside to that), but limited
HPT pagesizes, so that we can migrate without requiring hugepage
backing everywhere.  A single knob that applied in all cases wouldn't
let us do that.

> [1] That's of course assuming you have made sure the restriction
>     only applies to the 2.13 machine type forward, and existing
>     guests are not affected by the change.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

Re: [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling

Posted by Andrea Bolognani 7 years, 9 months ago

On Fri, 2018-04-27 at 22:17 +1000, David Gibson wrote:
> On Fri, Apr 27, 2018 at 10:31:10AM +0200, Andrea Bolognani wrote:
> > On Fri, 2018-04-27 at 12:14 +1000, David Gibson wrote:
> > > Right.. note that with the draft qemu patches a TCG guest will be
> > > prevented from using hugepages *by default* (the default value of the
> > > capability is 16).  You have to explicitly change it to allow
> > > hugepages to be used in a TCG guest (but you don't have to supply
> > > hugepage backing).
> > 
> > ... this will already happen. That's okay[1], we can't really
> > avoid it if we want to ensure consistent behavior between KVM and
> > TCG.
> 
> So.. regarding [1].  The draft patches *do* change the behaviour on
> older machine types.  I'll consider revisiting that, but I'd need to
> be convinced.  Basically we have to choose between consistency between
> accelerator and consistency between versions.  I think the former is
> the better choice; at least I think it is given that we *can* get both
> for the overwhelmingly common case in production (KVM HV).

Forgot to answer this point.

I agree that consistency between accelerators is the sane option
going forward, but changing the behavior for old machine types will
cause existing guests which have been using hugepages to lose the
ability to do so after being restarted on a newer QEMU.

Isn't that exactly the kind of scenario versioned machine types are
supposed to prevent?

-- 
Andrea Bolognani / Red Hat / Virtualization

Re: [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling

Posted by David Gibson 7 years, 8 months ago

On Mon, May 07, 2018 at 03:48:54PM +0200, Andrea Bolognani wrote:
> On Fri, 2018-04-27 at 22:17 +1000, David Gibson wrote:
> > On Fri, Apr 27, 2018 at 10:31:10AM +0200, Andrea Bolognani wrote:
> > > On Fri, 2018-04-27 at 12:14 +1000, David Gibson wrote:
> > > > Right.. note that with the draft qemu patches a TCG guest will be
> > > > prevented from using hugepages *by default* (the default value of the
> > > > capability is 16).  You have to explicitly change it to allow
> > > > hugepages to be used in a TCG guest (but you don't have to supply
> > > > hugepage backing).
> > > 
> > > ... this will already happen. That's okay[1], we can't really
> > > avoid it if we want to ensure consistent behavior between KVM and
> > > TCG.
> > 
> > So.. regarding [1].  The draft patches *do* change the behaviour on
> > older machine types.  I'll consider revisiting that, but I'd need to
> > be convinced.  Basically we have to choose between consistency between
> > accelerator and consistency between versions.  I think the former is
> > the better choice; at least I think it is given that we *can* get both
> > for the overwhelmingly common case in production (KVM HV).
> 
> Forgot to answer this point.
> 
> I agree that consistency between accelerators is the sane option
> going forward, but changing the behavior for old machine types will
> cause existing guests which have been using hugepages to lose the
> ability to do so after being restarted on a newer QEMU.
> 
> Isn't that exactly the kind of scenario versioned machine types are
> supposed to prevent?

Yeah, it is.  What I was questioning was whether it was important
enough for the case of TCG and PR guests (which have never been as
well supported) to justify keeping the other inconsistency.

On reflection, I think it probably does.. and I also think I have a
way to preserve it without having to keep around masses of the old
code, so I'll adjust that for the next spin.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson