[v1] runstate/time area registration by (guest) physical address

[PATCH RFC 07/10] domain: map/unmap GADDR based shared guest areas

Posted by Jan Beulich 3 years, 3 months ago

The registration by virtual/linear address has downsides: At least on
x86 the access is expensive for HVM/PVH domains. Furthermore for 64-bit
PV domains the areas are inaccessible (and hence cannot be updated by
Xen) when in guest-user mode.

In preparation of the introduction of new vCPU operations allowing to
register the respective areas (one of the two is x86-specific) by
guest-physical address, flesh out the map/unmap functions.

Noteworthy differences from map_vcpu_info():
- areas can be registered more than once (and de-registered),
- remote vCPU-s are paused rather than checked for being down (which in
  principle can change right after the check),
- the domain lock is taken for a much smaller region.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
RFC: By using global domain page mappings the demand on the underlying
     VA range may increase significantly. I did consider to use per-
     domain mappings instead, but they exist for x86 only. Of course we
     could have arch_{,un}map_guest_area() aliasing global domain page
     mapping functions on Arm and using per-domain mappings on x86. Yet
     then again map_vcpu_info() doesn't do so either (albeit that's
     likely to be converted subsequently to use map_vcpu_area() anyway).

RFC: In map_guest_area() I'm not checking the P2M type, instead - just
     like map_vcpu_info() - solely relying on the type ref acquisition.
     Checking for p2m_ram_rw alone would be wrong, as at least
     p2m_ram_logdirty ought to also be okay to use here (and in similar
     cases, e.g. in Argo's find_ring_mfn()). p2m_is_pageable() could be
     used here (like altp2m_vcpu_enable_ve() does) as well as in
     map_vcpu_info(), yet then again the P2M type is stale by the time
     it is being looked at anyway without the P2M lock held.

--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -1563,7 +1563,82 @@ int map_guest_area(struct vcpu *v, paddr
                    struct guest_area *area,
                    void (*populate)(void *dst, struct vcpu *v))
 {
-    return -EOPNOTSUPP;
+    struct domain *currd = v->domain;
+    void *map = NULL;
+    struct page_info *pg = NULL;
+    int rc = 0;
+
+    if ( gaddr )
+    {
+        unsigned long gfn = PFN_DOWN(gaddr);
+        unsigned int align;
+        p2m_type_t p2mt;
+
+        if ( gfn != PFN_DOWN(gaddr + size - 1) )
+            return -ENXIO;
+
+#ifdef CONFIG_COMPAT
+        if ( has_32bit_shinfo(currd) )
+            align = alignof(compat_ulong_t);
+        else
+#endif
+            align = alignof(xen_ulong_t);
+        if ( gaddr & (align - 1) )
+            return -ENXIO;
+
+        rc = check_get_page_from_gfn(currd, _gfn(gfn), false, &p2mt, &pg);
+        if ( rc )
+            return rc;
+
+        if ( !get_page_type(pg, PGT_writable_page) )
+        {
+            put_page(pg);
+            return -EACCES;
+        }
+
+        map = __map_domain_page_global(pg);
+        if ( !map )
+        {
+            put_page_and_type(pg);
+            return -ENOMEM;
+        }
+        map += PAGE_OFFSET(gaddr);
+    }
+
+    if ( v != current )
+    {
+        if ( !spin_trylock(&currd->hypercall_deadlock_mutex) )
+        {
+            rc = -ERESTART;
+            goto unmap;
+        }
+
+        vcpu_pause(v);
+
+        spin_unlock(&currd->hypercall_deadlock_mutex);
+    }
+
+    domain_lock(currd);
+
+    if ( map )
+        populate(map, v);
+
+    SWAP(area->pg, pg);
+    SWAP(area->map, map);
+
+    domain_unlock(currd);
+
+    if ( v != current )
+        vcpu_unpause(v);
+
+ unmap:
+    if ( pg )
+    {
+        unmap_domain_page_global(map);
+        put_page_and_type(pg);
+    }
+
+    return rc;
 }
 
 /*
@@ -1573,6 +1648,22 @@ int map_guest_area(struct vcpu *v, paddr
  */
 void unmap_guest_area(struct vcpu *v, struct guest_area *area)
 {
+    struct domain *d = v->domain;
+    void *map;
+    struct page_info *pg;
+
+    domain_lock(d);
+    map = area->map;
+    area->map = NULL;
+    pg = area->pg;
+    area->pg = NULL;
+    domain_unlock(d);
+
+    if ( pg )
+    {
+        unmap_domain_page_global(map);
+        put_page_and_type(pg);
+    }
 }
 
 int default_initialise_vcpu(struct vcpu *v, XEN_GUEST_HANDLE_PARAM(void) arg)

Re: [PATCH RFC 07/10] domain: map/unmap GADDR based shared guest areas

Posted by Andrew Cooper 3 years ago

On 19/10/2022 8:43 am, Jan Beulich wrote:
> The registration by virtual/linear address has downsides: At least on
> x86 the access is expensive for HVM/PVH domains. Furthermore for 64-bit
> PV domains the areas are inaccessible (and hence cannot be updated by
> Xen) when in guest-user mode.

They're also inaccessible in HVM guests (x86 and ARM) when Meltdown
mitigations are in place.

And lets not get started on the multitude of layering violations that is
guest_memory_policy() for nested virt.  In fact, prohibiting any form of
map-by-va is a perquisite to any rational attempt to make nested virt work.

(In fact, that infrastructure needs purging before any other
architecture pick up stubs too.)

They're also inaccessible in general because no architecture has
hypervisor privilege in a normal user/supervisor split, and you don't
know whether the mapping is over supervisor or user mapping, and
settings like SMAP/PAN can cause the pagewalk to fail even when the
mapping is in place.

There are a lot of good reasons why map-by-va should never have happened.

Even for PV guests, map-by-gfn (and let the guest manage whatever
virtual mappings it wants on its own) would have been closer to the
status quo for how real hardware worked, and critically would have
avoided the restriction that the areas had to live at a globally fixed
position to be useful.

>
> In preparation of the introduction of new vCPU operations allowing to
> register the respective areas (one of the two is x86-specific) by
> guest-physical address, flesh out the map/unmap functions.
>
> Noteworthy differences from map_vcpu_info():
> - areas can be registered more than once (and de-registered),

When register by GFN is available, there is never a good reason to the
same area twice.

The guest maps one MMIO-like region, and then constructs all the regular
virtual addresses mapping it (or not) that it wants.

This API is new, so we can enforce sane behaviour from the outset.  In
particular, it will help with ...

> - remote vCPU-s are paused rather than checked for being down (which in
>   principle can change right after the check),
> - the domain lock is taken for a much smaller region.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> RFC: By using global domain page mappings the demand on the underlying
>      VA range may increase significantly. I did consider to use per-
>      domain mappings instead, but they exist for x86 only. Of course we
>      could have arch_{,un}map_guest_area() aliasing global domain page
>      mapping functions on Arm and using per-domain mappings on x86. Yet
>      then again map_vcpu_info() doesn't do so either (albeit that's
>      likely to be converted subsequently to use map_vcpu_area() anyway).

... this by providing a bound on the amount of vmap() space can be consumed.

>
> RFC: In map_guest_area() I'm not checking the P2M type, instead - just
>      like map_vcpu_info() - solely relying on the type ref acquisition.
>      Checking for p2m_ram_rw alone would be wrong, as at least
>      p2m_ram_logdirty ought to also be okay to use here (and in similar
>      cases, e.g. in Argo's find_ring_mfn()). p2m_is_pageable() could be
>      used here (like altp2m_vcpu_enable_ve() does) as well as in
>      map_vcpu_info(), yet then again the P2M type is stale by the time
>      it is being looked at anyway without the P2M lock held.

Again, another error caused by Xen not knowing the guest physical
address layout.  These mappings should be restricted to just RAM regions
and I think we want to enforce that right from the outset.

~Andrew

Re: [PATCH RFC 07/10] domain: map/unmap GADDR based shared guest areas

Posted by Jan Beulich 3 years ago

On 17.01.2023 23:04, Andrew Cooper wrote:
> On 19/10/2022 8:43 am, Jan Beulich wrote:
>> The registration by virtual/linear address has downsides: At least on
>> x86 the access is expensive for HVM/PVH domains. Furthermore for 64-bit
>> PV domains the areas are inaccessible (and hence cannot be updated by
>> Xen) when in guest-user mode.
> 
> They're also inaccessible in HVM guests (x86 and ARM) when Meltdown
> mitigations are in place.

I've added this explicitly, but ...

> And lets not get started on the multitude of layering violations that is
> guest_memory_policy() for nested virt.  In fact, prohibiting any form of
> map-by-va is a perquisite to any rational attempt to make nested virt work.
> 
> (In fact, that infrastructure needs purging before any other
> architecture pick up stubs too.)
> 
> They're also inaccessible in general because no architecture has
> hypervisor privilege in a normal user/supervisor split, and you don't
> know whether the mapping is over supervisor or user mapping, and
> settings like SMAP/PAN can cause the pagewalk to fail even when the
> mapping is in place.

... I'm now merely saying that there are yet more reasons, rather than
trying to enumerate them all.

>> In preparation of the introduction of new vCPU operations allowing to
>> register the respective areas (one of the two is x86-specific) by
>> guest-physical address, flesh out the map/unmap functions.
>>
>> Noteworthy differences from map_vcpu_info():
>> - areas can be registered more than once (and de-registered),
> 
> When register by GFN is available, there is never a good reason to the
> same area twice.

Why not? Why shouldn't different entities be permitted to register their
areas, one after the other? This at the very least requires a way to
de-register.

> The guest maps one MMIO-like region, and then constructs all the regular
> virtual addresses mapping it (or not) that it wants.
> 
> This API is new, so we can enforce sane behaviour from the outset.  In
> particular, it will help with ...
> 
>> - remote vCPU-s are paused rather than checked for being down (which in
>>   principle can change right after the check),
>> - the domain lock is taken for a much smaller region.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> RFC: By using global domain page mappings the demand on the underlying
>>      VA range may increase significantly. I did consider to use per-
>>      domain mappings instead, but they exist for x86 only. Of course we
>>      could have arch_{,un}map_guest_area() aliasing global domain page
>>      mapping functions on Arm and using per-domain mappings on x86. Yet
>>      then again map_vcpu_info() doesn't do so either (albeit that's
>>      likely to be converted subsequently to use map_vcpu_area() anyway).
> 
> ... this by providing a bound on the amount of vmap() space can be consumed.

I'm afraid I don't understand. When re-registering a different area, the
earlier one will be unmapped. The consumption of vmap space cannot grow
(or else we'd have a resource leak and hence an XSA).

>> RFC: In map_guest_area() I'm not checking the P2M type, instead - just
>>      like map_vcpu_info() - solely relying on the type ref acquisition.
>>      Checking for p2m_ram_rw alone would be wrong, as at least
>>      p2m_ram_logdirty ought to also be okay to use here (and in similar
>>      cases, e.g. in Argo's find_ring_mfn()). p2m_is_pageable() could be
>>      used here (like altp2m_vcpu_enable_ve() does) as well as in
>>      map_vcpu_info(), yet then again the P2M type is stale by the time
>>      it is being looked at anyway without the P2M lock held.
> 
> Again, another error caused by Xen not knowing the guest physical
> address layout.  These mappings should be restricted to just RAM regions
> and I think we want to enforce that right from the outset.

Meaning what exactly in terms of action for me to take? As said, checking
the P2M type is pointless. So without you being more explicit, all I can
take your reply for is merely a comment, with no action on my part (not
even to remove this RFC remark).

Jan

Re: [PATCH RFC 07/10] domain: map/unmap GADDR based shared guest areas

Posted by Andrew Cooper 3 years ago

On 18/01/2023 9:55 am, Jan Beulich wrote:
> On 17.01.2023 23:04, Andrew Cooper wrote:
>> On 19/10/2022 8:43 am, Jan Beulich wrote:
>>> The registration by virtual/linear address has downsides: At least on
>>> x86 the access is expensive for HVM/PVH domains. Furthermore for 64-bit
>>> PV domains the areas are inaccessible (and hence cannot be updated by
>>> Xen) when in guest-user mode.
>> They're also inaccessible in HVM guests (x86 and ARM) when Meltdown
>> mitigations are in place.
> I've added this explicitly, but ...
>
>> And lets not get started on the multitude of layering violations that is
>> guest_memory_policy() for nested virt.  In fact, prohibiting any form of
>> map-by-va is a perquisite to any rational attempt to make nested virt work.
>>
>> (In fact, that infrastructure needs purging before any other
>> architecture pick up stubs too.)
>>
>> They're also inaccessible in general because no architecture has
>> hypervisor privilege in a normal user/supervisor split, and you don't
>> know whether the mapping is over supervisor or user mapping, and
>> settings like SMAP/PAN can cause the pagewalk to fail even when the
>> mapping is in place.
> ... I'm now merely saying that there are yet more reasons, rather than
> trying to enumerate them all.

That's fine.  I just wanted to point out that its far more reasons that
were given the first time around.

>>> In preparation of the introduction of new vCPU operations allowing to
>>> register the respective areas (one of the two is x86-specific) by
>>> guest-physical address, flesh out the map/unmap functions.
>>>
>>> Noteworthy differences from map_vcpu_info():
>>> - areas can be registered more than once (and de-registered),
>> When register by GFN is available, there is never a good reason to the
>> same area twice.
> Why not? Why shouldn't different entities be permitted to register their
> areas, one after the other? This at the very least requires a way to
> de-register.

Because it's useless and extra complexity.  From the point of view of
any guest, its an MMIO(ish) window that Xen happens to update the
content of.

You don't get systems where you can ask hardware for e.g. "another copy
of the HPET at mfn $foo please".

>> The guest maps one MMIO-like region, and then constructs all the regular
>> virtual addresses mapping it (or not) that it wants.
>>
>> This API is new, so we can enforce sane behaviour from the outset.  In
>> particular, it will help with ...
>>
>>> - remote vCPU-s are paused rather than checked for being down (which in
>>>   principle can change right after the check),
>>> - the domain lock is taken for a much smaller region.
>>>
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>> ---
>>> RFC: By using global domain page mappings the demand on the underlying
>>>      VA range may increase significantly. I did consider to use per-
>>>      domain mappings instead, but they exist for x86 only. Of course we
>>>      could have arch_{,un}map_guest_area() aliasing global domain page
>>>      mapping functions on Arm and using per-domain mappings on x86. Yet
>>>      then again map_vcpu_info() doesn't do so either (albeit that's
>>>      likely to be converted subsequently to use map_vcpu_area() anyway).
>> ... this by providing a bound on the amount of vmap() space can be consumed.
> I'm afraid I don't understand. When re-registering a different area, the
> earlier one will be unmapped. The consumption of vmap space cannot grow
> (or else we'd have a resource leak and hence an XSA).

In which case you mean "can be re-registered elsewhere".  More
specifically, the area can be moved, and isn't a singleton operation
like map_vcpu_info was.

The wording as presented firmly suggests the presence of an XSA.

>>> RFC: In map_guest_area() I'm not checking the P2M type, instead - just
>>>      like map_vcpu_info() - solely relying on the type ref acquisition.
>>>      Checking for p2m_ram_rw alone would be wrong, as at least
>>>      p2m_ram_logdirty ought to also be okay to use here (and in similar
>>>      cases, e.g. in Argo's find_ring_mfn()). p2m_is_pageable() could be
>>>      used here (like altp2m_vcpu_enable_ve() does) as well as in
>>>      map_vcpu_info(), yet then again the P2M type is stale by the time
>>>      it is being looked at anyway without the P2M lock held.
>> Again, another error caused by Xen not knowing the guest physical
>> address layout.  These mappings should be restricted to just RAM regions
>> and I think we want to enforce that right from the outset.
> Meaning what exactly in terms of action for me to take? As said, checking
> the P2M type is pointless. So without you being more explicit, all I can
> take your reply for is merely a comment, with no action on my part (not
> even to remove this RFC remark).

There will become a point where it will need to become prohibited to
issue this against something which isn't p2m_type_ram.  If we had a sane
idea of the guest physmap, I'd go as far as saying E820_RAM, but that's
clearly not feasible yet.

Even now, absolutely nothing good can possibly come of e.g. trying to
overlay it on the grant table, or a grant mapping.

ram || logdirty ought to exclude most cases we care about the guest
(not) putting the mapping.

~Andrew

Re: [PATCH RFC 07/10] domain: map/unmap GADDR based shared guest areas

Posted by Jan Beulich 3 years ago

On 20.01.2023 19:15, Andrew Cooper wrote:
> On 18/01/2023 9:55 am, Jan Beulich wrote:
>> On 17.01.2023 23:04, Andrew Cooper wrote:
>>> On 19/10/2022 8:43 am, Jan Beulich wrote:
>>>> In preparation of the introduction of new vCPU operations allowing to
>>>> register the respective areas (one of the two is x86-specific) by
>>>> guest-physical address, flesh out the map/unmap functions.
>>>>
>>>> Noteworthy differences from map_vcpu_info():
>>>> - areas can be registered more than once (and de-registered),
>>> When register by GFN is available, there is never a good reason to the
>>> same area twice.
>> Why not? Why shouldn't different entities be permitted to register their
>> areas, one after the other? This at the very least requires a way to
>> de-register.
> 
> Because it's useless and extra complexity.

As to this: Looking at the code I think that I would actually add
complexity (just a little - an extra check) to prevent re-registration.
Things come out more naturally, from what I can tell, by allowing it.
This can also be seen in "common: convert vCPU info area registration"
where I'm actually adding such a (conditional) check to maintain the
"no re-registration" property of the sub-op there. Granted there can be
an argument towards making that check unconditional then ...

Jan

Re: [PATCH RFC 07/10] domain: map/unmap GADDR based shared guest areas

Posted by Jan Beulich 3 years ago

On 20.01.2023 19:15, Andrew Cooper wrote:
> On 18/01/2023 9:55 am, Jan Beulich wrote:
>> On 17.01.2023 23:04, Andrew Cooper wrote:
>>> On 19/10/2022 8:43 am, Jan Beulich wrote:
>>>> Noteworthy differences from map_vcpu_info():
>>>> - areas can be registered more than once (and de-registered),
>>> When register by GFN is available, there is never a good reason to the
>>> same area twice.
>> Why not? Why shouldn't different entities be permitted to register their
>> areas, one after the other? This at the very least requires a way to
>> de-register.
> 
> Because it's useless and extra complexity.  From the point of view of
> any guest, its an MMIO(ish) window that Xen happens to update the
> content of.
> 
> You don't get systems where you can ask hardware for e.g. "another copy
> of the HPET at mfn $foo please".

I/O ports appear in multiple places on many systems. I think MMIO regions
can, too. And then I don't see why there couldn't be a way to actually
control this (via e.g. some chipset specific register).

>>>> RFC: By using global domain page mappings the demand on the underlying
>>>>      VA range may increase significantly. I did consider to use per-
>>>>      domain mappings instead, but they exist for x86 only. Of course we
>>>>      could have arch_{,un}map_guest_area() aliasing global domain page
>>>>      mapping functions on Arm and using per-domain mappings on x86. Yet
>>>>      then again map_vcpu_info() doesn't do so either (albeit that's
>>>>      likely to be converted subsequently to use map_vcpu_area() anyway).
>>> ... this by providing a bound on the amount of vmap() space can be consumed.
>> I'm afraid I don't understand. When re-registering a different area, the
>> earlier one will be unmapped. The consumption of vmap space cannot grow
>> (or else we'd have a resource leak and hence an XSA).
> 
> In which case you mean "can be re-registered elsewhere".  More
> specifically, the area can be moved, and isn't a singleton operation
> like map_vcpu_info was.
> 
> The wording as presented firmly suggests the presence of an XSA.

You mean the "map_vcpu_info() doesn't do so either"? That talks about the
function not using per-domain mappings. There's no connection at all that
I can see to a missed unmapping, which at this point is the only thing I
can deduce you might be referring to.

>>>> RFC: In map_guest_area() I'm not checking the P2M type, instead - just
>>>>      like map_vcpu_info() - solely relying on the type ref acquisition.
>>>>      Checking for p2m_ram_rw alone would be wrong, as at least
>>>>      p2m_ram_logdirty ought to also be okay to use here (and in similar
>>>>      cases, e.g. in Argo's find_ring_mfn()). p2m_is_pageable() could be
>>>>      used here (like altp2m_vcpu_enable_ve() does) as well as in
>>>>      map_vcpu_info(), yet then again the P2M type is stale by the time
>>>>      it is being looked at anyway without the P2M lock held.
>>> Again, another error caused by Xen not knowing the guest physical
>>> address layout.  These mappings should be restricted to just RAM regions
>>> and I think we want to enforce that right from the outset.
>> Meaning what exactly in terms of action for me to take? As said, checking
>> the P2M type is pointless. So without you being more explicit, all I can
>> take your reply for is merely a comment, with no action on my part (not
>> even to remove this RFC remark).
> 
> There will become a point where it will need to become prohibited to
> issue this against something which isn't p2m_type_ram.  If we had a sane
> idea of the guest physmap, I'd go as far as saying E820_RAM, but that's
> clearly not feasible yet.
> 
> Even now, absolutely nothing good can possibly come of e.g. trying to
> overlay it on the grant table, or a grant mapping.
> 
> ram || logdirty ought to exclude most cases we care about the guest
> (not) putting the mapping.

It's still not clear to me what you want me to do: If I add the P2M type
check here including log-dirty, then this will be inconsistent with what
we do elsewhere _and_ useless code (for the time being). I hope you're
not making a scope-creeping request for me to "fix" all the other places
(I may not have found all) where such a P2M type check is either missing
of failing to include log-dirty.

Jan

Re: [PATCH RFC 07/10] domain: map/unmap GADDR based shared guest areas

Posted by Julien Grall 3 years, 2 months ago

Hi Jan,

I am still digesting this series and replying with some initial comments.

On 19/10/2022 09:43, Jan Beulich wrote:
> The registration by virtual/linear address has downsides: At least on
> x86 the access is expensive for HVM/PVH domains. Furthermore for 64-bit
> PV domains the areas are inaccessible (and hence cannot be updated by
> Xen) when in guest-user mode.
> 
> In preparation of the introduction of new vCPU operations allowing to
> register the respective areas (one of the two is x86-specific) by
> guest-physical address, flesh out the map/unmap functions.
> 
> Noteworthy differences from map_vcpu_info():
> - areas can be registered more than once (and de-registered),
> - remote vCPU-s are paused rather than checked for being down (which in
>    principle can change right after the check),
> - the domain lock is taken for a much smaller region.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> RFC: By using global domain page mappings the demand on the underlying
>       VA range may increase significantly. I did consider to use per-
>       domain mappings instead, but they exist for x86 only. Of course we
>       could have arch_{,un}map_guest_area() aliasing global domain page
>       mapping functions on Arm and using per-domain mappings on x86. Yet
>       then again map_vcpu_info() doesn't do so either (albeit that's
>       likely to be converted subsequently to use map_vcpu_area() anyway).
> 
> RFC: In map_guest_area() I'm not checking the P2M type, instead - just
>       like map_vcpu_info() - solely relying on the type ref acquisition.
>       Checking for p2m_ram_rw alone would be wrong, as at least
>       p2m_ram_logdirty ought to also be okay to use here (and in similar
>       cases, e.g. in Argo's find_ring_mfn()). p2m_is_pageable() could be
>       used here (like altp2m_vcpu_enable_ve() does) as well as in
>       map_vcpu_info(), yet then again the P2M type is stale by the time
>       it is being looked at anyway without the P2M lock held.
> 
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -1563,7 +1563,82 @@ int map_guest_area(struct vcpu *v, paddr
>                      struct guest_area *area,
>                      void (*populate)(void *dst, struct vcpu *v))
>   {
> -    return -EOPNOTSUPP;
> +    struct domain *currd = v->domain;
> +    void *map = NULL;
> +    struct page_info *pg = NULL;
> +    int rc = 0;
> +
> +    if ( gaddr )

0 is technically a valid (guest) physical address on Arm.

> +    {
> +        unsigned long gfn = PFN_DOWN(gaddr);

This could be gfn_t for adding some type safety.

> +        unsigned int align;
> +        p2m_type_t p2mt;
> +
> +        if ( gfn != PFN_DOWN(gaddr + size - 1) )
> +            return -ENXIO;
> +
> +#ifdef CONFIG_COMPAT
> +        if ( has_32bit_shinfo(currd) )
> +            align = alignof(compat_ulong_t);
> +        else
> +#endif
> +            align = alignof(xen_ulong_t);
> +        if ( gaddr & (align - 1) )
> +            return -ENXIO;
> +
> +        rc = check_get_page_from_gfn(currd, _gfn(gfn), false, &p2mt, &pg);
> +        if ( rc )
> +            return rc;
> +
> +        if ( !get_page_type(pg, PGT_writable_page) )
> +        {
> +            put_page(pg);
> +            return -EACCES;
> +        }
> +
> +        map = __map_domain_page_global(pg);
> +        if ( !map )
> +        {
> +            put_page_and_type(pg);
> +            return -ENOMEM;
> +        }
> +        map += PAGE_OFFSET(gaddr);
> +    }
> +
> +    if ( v != current )
> +    {
> +        if ( !spin_trylock(&currd->hypercall_deadlock_mutex) )
> +        {
> +            rc = -ERESTART;
> +            goto unmap;
> +        }
> +
> +        vcpu_pause(v);

AFAIU, the goal of vcpu_pause() here is to guarantee that the "area" 
will not be touched by another pCPU. However, looking at the function 
context_switch() we have:

sched_context_switched(prev, next);
_update_runstate_area();

The first function will set v->is_running to false (see 
vcpu_context_saved()). So I think the "area" could be touched even afte 
vcpu_pause() is returned.

Therefore, I think we will need _update_runstate_area() (or 
update_runstate_area()) to be called first.

> +
> +        spin_unlock(&currd->hypercall_deadlock_mutex);
> +    }
> +
> +    domain_lock(currd);
> +
> +    if ( map )
> +        populate(map, v);
> +
> +    SWAP(area->pg, pg);
> +    SWAP(area->map, map);
> +
> +    domain_unlock(currd);
> +
> +    if ( v != current )
> +        vcpu_unpause(v);
> +
> + unmap:
> +    if ( pg )
> +    {
> +        unmap_domain_page_global(map);
> +        put_page_and_type(pg);
> +    }
> +
> +    return rc;
>   }
>   
>   /*
> @@ -1573,6 +1648,22 @@ int map_guest_area(struct vcpu *v, paddr
>    */
>   void unmap_guest_area(struct vcpu *v, struct guest_area *area)
>   {
> +    struct domain *d = v->domain;
> +    void *map;
> +    struct page_info *pg;

AFAIU, the assumption is the vCPU should be paused here. Should we add 
an ASSERT()?

> +
> +    domain_lock(d);
> +    map = area->map;
> +    area->map = NULL;
> +    pg = area->pg;
> +    area->pg = NULL;
> +    domain_unlock(d);
> +
> +    if ( pg )
> +    {
> +        unmap_domain_page_global(map);
> +        put_page_and_type(pg);
> +    }
>   }
>   
>   int default_initialise_vcpu(struct vcpu *v, XEN_GUEST_HANDLE_PARAM(void) arg)
> 

Cheers,

-- 
Julien Grall

Re: [PATCH RFC 07/10] domain: map/unmap GADDR based shared guest areas

Posted by Andrew Cooper 3 years ago

On 24/11/2022 9:29 pm, Julien Grall wrote:
> Hi Jan,
>
> I am still digesting this series and replying with some initial comments.
>
> On 19/10/2022 09:43, Jan Beulich wrote:
>> The registration by virtual/linear address has downsides: At least on
>> x86 the access is expensive for HVM/PVH domains. Furthermore for 64-bit
>> PV domains the areas are inaccessible (and hence cannot be updated by
>> Xen) when in guest-user mode.
>>
>> In preparation of the introduction of new vCPU operations allowing to
>> register the respective areas (one of the two is x86-specific) by
>> guest-physical address, flesh out the map/unmap functions.
>>
>> Noteworthy differences from map_vcpu_info():
>> - areas can be registered more than once (and de-registered),
>> - remote vCPU-s are paused rather than checked for being down (which in
>>    principle can change right after the check),
>> - the domain lock is taken for a much smaller region.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> RFC: By using global domain page mappings the demand on the underlying
>>       VA range may increase significantly. I did consider to use per-
>>       domain mappings instead, but they exist for x86 only. Of course we
>>       could have arch_{,un}map_guest_area() aliasing global domain page
>>       mapping functions on Arm and using per-domain mappings on x86. Yet
>>       then again map_vcpu_info() doesn't do so either (albeit that's
>>       likely to be converted subsequently to use map_vcpu_area()
>> anyway).
>>
>> RFC: In map_guest_area() I'm not checking the P2M type, instead - just
>>       like map_vcpu_info() - solely relying on the type ref acquisition.
>>       Checking for p2m_ram_rw alone would be wrong, as at least
>>       p2m_ram_logdirty ought to also be okay to use here (and in similar
>>       cases, e.g. in Argo's find_ring_mfn()). p2m_is_pageable() could be
>>       used here (like altp2m_vcpu_enable_ve() does) as well as in
>>       map_vcpu_info(), yet then again the P2M type is stale by the time
>>       it is being looked at anyway without the P2M lock held.
>>
>> --- a/xen/common/domain.c
>> +++ b/xen/common/domain.c
>> @@ -1563,7 +1563,82 @@ int map_guest_area(struct vcpu *v, paddr
>>                      struct guest_area *area,
>>                      void (*populate)(void *dst, struct vcpu *v))
>>   {
>> -    return -EOPNOTSUPP;
>> +    struct domain *currd = v->domain;
>> +    void *map = NULL;
>> +    struct page_info *pg = NULL;
>> +    int rc = 0;
>> +
>> +    if ( gaddr )
>
> 0 is technically a valid (guest) physical address on Arm.

It is on x86 too, but that's not why 0 is generally considered an
invalid address.

See the multitude of XSAs, and near-XSAs which have been caused by bad
logic in Xen caused by trying to make a variable held in struct
vcpu/domain have a default value other than 0.

It's not impossible to write such code safely, and in this case I expect
it can be done by the NULLness (or not) of the mapping pointer, rather
than by stashing the gaddr, but history has proved repeatedly that this
is a very fertile source of security bugs.

~Andrew

Re: [PATCH RFC 07/10] domain: map/unmap GADDR based shared guest areas

Posted by Jan Beulich 3 years ago

On 17.01.2023 23:20, Andrew Cooper wrote:
> On 24/11/2022 9:29 pm, Julien Grall wrote:
>> On 19/10/2022 09:43, Jan Beulich wrote:
>>> The registration by virtual/linear address has downsides: At least on
>>> x86 the access is expensive for HVM/PVH domains. Furthermore for 64-bit
>>> PV domains the areas are inaccessible (and hence cannot be updated by
>>> Xen) when in guest-user mode.
>>>
>>> In preparation of the introduction of new vCPU operations allowing to
>>> register the respective areas (one of the two is x86-specific) by
>>> guest-physical address, flesh out the map/unmap functions.
>>>
>>> Noteworthy differences from map_vcpu_info():
>>> - areas can be registered more than once (and de-registered),
>>> - remote vCPU-s are paused rather than checked for being down (which in
>>>    principle can change right after the check),
>>> - the domain lock is taken for a much smaller region.
>>>
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>> ---
>>> RFC: By using global domain page mappings the demand on the underlying
>>>       VA range may increase significantly. I did consider to use per-
>>>       domain mappings instead, but they exist for x86 only. Of course we
>>>       could have arch_{,un}map_guest_area() aliasing global domain page
>>>       mapping functions on Arm and using per-domain mappings on x86. Yet
>>>       then again map_vcpu_info() doesn't do so either (albeit that's
>>>       likely to be converted subsequently to use map_vcpu_area()
>>> anyway).
>>>
>>> RFC: In map_guest_area() I'm not checking the P2M type, instead - just
>>>       like map_vcpu_info() - solely relying on the type ref acquisition.
>>>       Checking for p2m_ram_rw alone would be wrong, as at least
>>>       p2m_ram_logdirty ought to also be okay to use here (and in similar
>>>       cases, e.g. in Argo's find_ring_mfn()). p2m_is_pageable() could be
>>>       used here (like altp2m_vcpu_enable_ve() does) as well as in
>>>       map_vcpu_info(), yet then again the P2M type is stale by the time
>>>       it is being looked at anyway without the P2M lock held.
>>>
>>> --- a/xen/common/domain.c
>>> +++ b/xen/common/domain.c
>>> @@ -1563,7 +1563,82 @@ int map_guest_area(struct vcpu *v, paddr
>>>                      struct guest_area *area,
>>>                      void (*populate)(void *dst, struct vcpu *v))
>>>   {
>>> -    return -EOPNOTSUPP;
>>> +    struct domain *currd = v->domain;
>>> +    void *map = NULL;
>>> +    struct page_info *pg = NULL;
>>> +    int rc = 0;
>>> +
>>> +    if ( gaddr )
>>
>> 0 is technically a valid (guest) physical address on Arm.
> 
> It is on x86 too, but that's not why 0 is generally considered an
> invalid address.
> 
> See the multitude of XSAs, and near-XSAs which have been caused by bad
> logic in Xen caused by trying to make a variable held in struct
> vcpu/domain have a default value other than 0.
> 
> It's not impossible to write such code safely, and in this case I expect
> it can be done by the NULLness (or not) of the mapping pointer, rather
> than by stashing the gaddr, but history has proved repeatedly that this
> is a very fertile source of security bugs.

I'm checking a value passed in from the guest here. No checking of internal
state can replace that. The checks on internal state leverage zero-init:

 unmap:
    if ( pg )
    {
        unmap_domain_page_global(map);
        put_page_and_type(pg);
    }

It's also not clear to me whether, like Julien looks to have read it, you
mean to ask that I revert back to using 0 as the "invalid" (i.e. request
for unmap) indicator.

Jan

Re: [PATCH RFC 07/10] domain: map/unmap GADDR based shared guest areas

Posted by Andrew Cooper 3 years ago

On 18/01/2023 9:59 am, Jan Beulich wrote:
> On 17.01.2023 23:20, Andrew Cooper wrote:
>> On 24/11/2022 9:29 pm, Julien Grall wrote:
>>> On 19/10/2022 09:43, Jan Beulich wrote:
>>>> The registration by virtual/linear address has downsides: At least on
>>>> x86 the access is expensive for HVM/PVH domains. Furthermore for 64-bit
>>>> PV domains the areas are inaccessible (and hence cannot be updated by
>>>> Xen) when in guest-user mode.
>>>>
>>>> In preparation of the introduction of new vCPU operations allowing to
>>>> register the respective areas (one of the two is x86-specific) by
>>>> guest-physical address, flesh out the map/unmap functions.
>>>>
>>>> Noteworthy differences from map_vcpu_info():
>>>> - areas can be registered more than once (and de-registered),
>>>> - remote vCPU-s are paused rather than checked for being down (which in
>>>>    principle can change right after the check),
>>>> - the domain lock is taken for a much smaller region.
>>>>
>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>> ---
>>>> RFC: By using global domain page mappings the demand on the underlying
>>>>       VA range may increase significantly. I did consider to use per-
>>>>       domain mappings instead, but they exist for x86 only. Of course we
>>>>       could have arch_{,un}map_guest_area() aliasing global domain page
>>>>       mapping functions on Arm and using per-domain mappings on x86. Yet
>>>>       then again map_vcpu_info() doesn't do so either (albeit that's
>>>>       likely to be converted subsequently to use map_vcpu_area()
>>>> anyway).
>>>>
>>>> RFC: In map_guest_area() I'm not checking the P2M type, instead - just
>>>>       like map_vcpu_info() - solely relying on the type ref acquisition.
>>>>       Checking for p2m_ram_rw alone would be wrong, as at least
>>>>       p2m_ram_logdirty ought to also be okay to use here (and in similar
>>>>       cases, e.g. in Argo's find_ring_mfn()). p2m_is_pageable() could be
>>>>       used here (like altp2m_vcpu_enable_ve() does) as well as in
>>>>       map_vcpu_info(), yet then again the P2M type is stale by the time
>>>>       it is being looked at anyway without the P2M lock held.
>>>>
>>>> --- a/xen/common/domain.c
>>>> +++ b/xen/common/domain.c
>>>> @@ -1563,7 +1563,82 @@ int map_guest_area(struct vcpu *v, paddr
>>>>                      struct guest_area *area,
>>>>                      void (*populate)(void *dst, struct vcpu *v))
>>>>   {
>>>> -    return -EOPNOTSUPP;
>>>> +    struct domain *currd = v->domain;
>>>> +    void *map = NULL;
>>>> +    struct page_info *pg = NULL;
>>>> +    int rc = 0;
>>>> +
>>>> +    if ( gaddr )
>>> 0 is technically a valid (guest) physical address on Arm.
>> It is on x86 too, but that's not why 0 is generally considered an
>> invalid address.
>>
>> See the multitude of XSAs, and near-XSAs which have been caused by bad
>> logic in Xen caused by trying to make a variable held in struct
>> vcpu/domain have a default value other than 0.
>>
>> It's not impossible to write such code safely, and in this case I expect
>> it can be done by the NULLness (or not) of the mapping pointer, rather
>> than by stashing the gaddr, but history has proved repeatedly that this
>> is a very fertile source of security bugs.
> I'm checking a value passed in from the guest here. No checking of internal
> state can replace that. The checks on internal state leverage zero-init:
>
>  unmap:
>     if ( pg )
>     {
>         unmap_domain_page_global(map);
>         put_page_and_type(pg);
>     }
>
> It's also not clear to me whether, like Julien looks to have read it, you
> mean to ask that I revert back to using 0 as the "invalid" (i.e. request
> for unmap) indicator.

I'm merely asking you to be extra careful and not add bugs to the error
paths.  And it appears that you have done it based on the NULLness of
the mapping pointer, which is fine.

~Andrew

Re: [PATCH RFC 07/10] domain: map/unmap GADDR based shared guest areas

Posted by Julien Grall 3 years ago

Hi Andrew,

On 17/01/2023 22:20, Andrew Cooper wrote:
> On 24/11/2022 9:29 pm, Julien Grall wrote:
>> Hi Jan,
>>
>> I am still digesting this series and replying with some initial comments.
>>
>> On 19/10/2022 09:43, Jan Beulich wrote:
>>> The registration by virtual/linear address has downsides: At least on
>>> x86 the access is expensive for HVM/PVH domains. Furthermore for 64-bit
>>> PV domains the areas are inaccessible (and hence cannot be updated by
>>> Xen) when in guest-user mode.
>>>
>>> In preparation of the introduction of new vCPU operations allowing to
>>> register the respective areas (one of the two is x86-specific) by
>>> guest-physical address, flesh out the map/unmap functions.
>>>
>>> Noteworthy differences from map_vcpu_info():
>>> - areas can be registered more than once (and de-registered),
>>> - remote vCPU-s are paused rather than checked for being down (which in
>>>     principle can change right after the check),
>>> - the domain lock is taken for a much smaller region.
>>>
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>> ---
>>> RFC: By using global domain page mappings the demand on the underlying
>>>        VA range may increase significantly. I did consider to use per-
>>>        domain mappings instead, but they exist for x86 only. Of course we
>>>        could have arch_{,un}map_guest_area() aliasing global domain page
>>>        mapping functions on Arm and using per-domain mappings on x86. Yet
>>>        then again map_vcpu_info() doesn't do so either (albeit that's
>>>        likely to be converted subsequently to use map_vcpu_area()
>>> anyway).
>>>
>>> RFC: In map_guest_area() I'm not checking the P2M type, instead - just
>>>        like map_vcpu_info() - solely relying on the type ref acquisition.
>>>        Checking for p2m_ram_rw alone would be wrong, as at least
>>>        p2m_ram_logdirty ought to also be okay to use here (and in similar
>>>        cases, e.g. in Argo's find_ring_mfn()). p2m_is_pageable() could be
>>>        used here (like altp2m_vcpu_enable_ve() does) as well as in
>>>        map_vcpu_info(), yet then again the P2M type is stale by the time
>>>        it is being looked at anyway without the P2M lock held.
>>>
>>> --- a/xen/common/domain.c
>>> +++ b/xen/common/domain.c
>>> @@ -1563,7 +1563,82 @@ int map_guest_area(struct vcpu *v, paddr
>>>                       struct guest_area *area,
>>>                       void (*populate)(void *dst, struct vcpu *v))
>>>    {
>>> -    return -EOPNOTSUPP;
>>> +    struct domain *currd = v->domain;
>>> +    void *map = NULL;
>>> +    struct page_info *pg = NULL;
>>> +    int rc = 0;
>>> +
>>> +    if ( gaddr )
>>
>> 0 is technically a valid (guest) physical address on Arm.
> 
> It is on x86 too, but that's not why 0 is generally considered an
> invalid address.
> 
> See the multitude of XSAs, and near-XSAs which have been caused by bad
> logic in Xen caused by trying to make a variable held in struct
> vcpu/domain have a default value other than 0.
> 
> It's not impossible to write such code safely, and in this case I expect
> it can be done by the NULLness (or not) of the mapping pointer, rather
> than by stashing the gaddr, but history has proved repeatedly that this
> is a very fertile source of security bugs.

I understand the security concern. However... you are now imposing some 
constraint in the guest OS which may be more difficult to address.

Furthermore, we are trying to make a sane ABI. It doesn't look sane to 
me to expose our currently shortcomings to the guest OS. The more that 
if we decide it to relax in the future, then it would not help an OS 
because they will need to support older Xen...

Cheers,

-- 
Julien Grall

Re: [PATCH RFC 07/10] domain: map/unmap GADDR based shared guest areas

Posted by Jan Beulich 3 years, 2 months ago

On 24.11.2022 22:29, Julien Grall wrote:
> On 19/10/2022 09:43, Jan Beulich wrote:
>> --- a/xen/common/domain.c
>> +++ b/xen/common/domain.c
>> @@ -1563,7 +1563,82 @@ int map_guest_area(struct vcpu *v, paddr
>>                      struct guest_area *area,
>>                      void (*populate)(void *dst, struct vcpu *v))
>>   {
>> -    return -EOPNOTSUPP;
>> +    struct domain *currd = v->domain;
>> +    void *map = NULL;
>> +    struct page_info *pg = NULL;
>> +    int rc = 0;
>> +
>> +    if ( gaddr )
> 
> 0 is technically a valid (guest) physical address on Arm.

I guess it is everywhere; it certainly also is on x86. While perhaps a
little unfortunate in ordering, the public header changes coming only
in the following patches was the best way I could think of to split
this work into reasonable size pieces. There the special meaning of 0
is clearly documented. And I don't really see it as a meaningful
limitation to not allow guests to register such areas at address zero.

>> +    {
>> +        unsigned long gfn = PFN_DOWN(gaddr);
> 
> This could be gfn_t for adding some type safety.

Indeed I did consider doing so, but the resulting code would imo be
less legible. But this difference perhaps isn't significant enough
for me to object to changing, in case you (or others) think the
type safety is really a meaningful gain here.

>> +        unsigned int align;
>> +        p2m_type_t p2mt;
>> +
>> +        if ( gfn != PFN_DOWN(gaddr + size - 1) )
>> +            return -ENXIO;
>> +
>> +#ifdef CONFIG_COMPAT
>> +        if ( has_32bit_shinfo(currd) )
>> +            align = alignof(compat_ulong_t);
>> +        else
>> +#endif
>> +            align = alignof(xen_ulong_t);
>> +        if ( gaddr & (align - 1) )
>> +            return -ENXIO;
>> +
>> +        rc = check_get_page_from_gfn(currd, _gfn(gfn), false, &p2mt, &pg);
>> +        if ( rc )
>> +            return rc;
>> +
>> +        if ( !get_page_type(pg, PGT_writable_page) )
>> +        {
>> +            put_page(pg);
>> +            return -EACCES;
>> +        }
>> +
>> +        map = __map_domain_page_global(pg);
>> +        if ( !map )
>> +        {
>> +            put_page_and_type(pg);
>> +            return -ENOMEM;
>> +        }
>> +        map += PAGE_OFFSET(gaddr);
>> +    }
>> +
>> +    if ( v != current )
>> +    {
>> +        if ( !spin_trylock(&currd->hypercall_deadlock_mutex) )
>> +        {
>> +            rc = -ERESTART;
>> +            goto unmap;
>> +        }
>> +
>> +        vcpu_pause(v);
> 
> AFAIU, the goal of vcpu_pause() here is to guarantee that the "area" 
> will not be touched by another pCPU.

Hmm, I find the way you put it a little confusing, but yes. I'd express
it as "The purpose of the vcpu_pause() is to guarantee that the vCPU in
question won't use its own area while the location thereof is being
updated." This includes updates by Xen as well as guest side consumption
of the data (with focus on the former, yes).

> However, looking at the function context_switch() we have:
> 
> sched_context_switched(prev, next);
> _update_runstate_area();

With this really being

    _update_runstate_area(next);

...

> The first function will set v->is_running to false (see 
> vcpu_context_saved()). So I think the "area" could be touched even afte 
> vcpu_pause() is returned.
> 
> Therefore, I think we will need _update_runstate_area() (or 
> update_runstate_area()) to be called first.

... I don't see a need for adjustment. The corresponding

    _update_runstate_area(prev);

sits quite a bit earlier in context_switch(). (Arm code is quite a bit
different, but this particular aspect looks to apply there as well.)

>> @@ -1573,6 +1648,22 @@ int map_guest_area(struct vcpu *v, paddr
>>    */
>>   void unmap_guest_area(struct vcpu *v, struct guest_area *area)
>>   {
>> +    struct domain *d = v->domain;
>> +    void *map;
>> +    struct page_info *pg;
> 
> AFAIU, the assumption is the vCPU should be paused here.

Yes, as the comment ahead of the function (introduced by an earlier
patch) says.

> Should we add an ASSERT()?

I was going from unmap_vcpu_info(), which had the same requirement,
while also only recording it by way of a comment. I certainly could
add an ASSERT(), but besides this being questionable as to the rules
set forth in ./CODING_STYLE I also view assertions of "paused" state
as being of limited use - the entity in question may become unpaused
on the clock cycle after the check was done. (But yes, such are no
different from e.g. the fair number of spin_is_locked() checks we
have scattered around, which don't really provide guarantees either.)

Jan

Re: [PATCH RFC 07/10] domain: map/unmap GADDR based shared guest areas

Posted by Julien Grall 3 years, 2 months ago

Hi Jan,

On 28/11/2022 10:01, Jan Beulich wrote:
> On 24.11.2022 22:29, Julien Grall wrote:
>> On 19/10/2022 09:43, Jan Beulich wrote:
>>> --- a/xen/common/domain.c
>>> +++ b/xen/common/domain.c
>>> @@ -1563,7 +1563,82 @@ int map_guest_area(struct vcpu *v, paddr
>>>                       struct guest_area *area,
>>>                       void (*populate)(void *dst, struct vcpu *v))
>>>    {
>>> -    return -EOPNOTSUPP;
>>> +    struct domain *currd = v->domain;
>>> +    void *map = NULL;
>>> +    struct page_info *pg = NULL;
>>> +    int rc = 0;
>>> +
>>> +    if ( gaddr )
>>
>> 0 is technically a valid (guest) physical address on Arm.
> 
> I guess it is everywhere; it certainly also is on x86. While perhaps a
> little unfortunate in ordering, the public header changes coming only
> in the following patches was the best way I could think of to split
> this work into reasonable size pieces. There the special meaning of 0
> is clearly documented. And I don't really see it as a meaningful
> limitation to not allow guests to register such areas at address zero.
I would expect an OS to allocate the region using the generic physical 
allocator. This allocator may decide that '0' is a valid address and 
return it.

So I think your approach could potentially complicate the OS 
implementation. I think it would be better to use an all Fs value as 
this cannot be valid for this hypercall (we require at least 4-byte 
alignment).

> 
>>> +    {
>>> +        unsigned long gfn = PFN_DOWN(gaddr);
>>
>> This could be gfn_t for adding some type safety.
> 
> Indeed I did consider doing so, but the resulting code would imo be
> less legible. But this difference perhaps isn't significant enough
> for me to object to changing, in case you (or others) think the
> type safety is really a meaningful gain here.

In general, my preference is to always use the typesafe version because 
it reduces the number of "unsigned long".

[...]

>> The first function will set v->is_running to false (see
>> vcpu_context_saved()). So I think the "area" could be touched even afte
>> vcpu_pause() is returned.
>>
>> Therefore, I think we will need _update_runstate_area() (or
>> update_runstate_area()) to be called first.
> 
> ... I don't see a need for adjustment. The corresponding
> 
>      _update_runstate_area(prev);
> 
> sits quite a bit earlier in context_switch(). (Arm code is quite a bit
> different, but this particular aspect looks to apply there as well.)

You are right. Sorry I misread the code.

> 
>>> @@ -1573,6 +1648,22 @@ int map_guest_area(struct vcpu *v, paddr
>>>     */
>>>    void unmap_guest_area(struct vcpu *v, struct guest_area *area)
>>>    {
>>> +    struct domain *d = v->domain;
>>> +    void *map;
>>> +    struct page_info *pg;
>>
>> AFAIU, the assumption is the vCPU should be paused here.
> 
> Yes, as the comment ahead of the function (introduced by an earlier
> patch) says.

Ah, I missed that. Thanks!

> 
>> Should we add an ASSERT()?
> 
> I was going from unmap_vcpu_info(), which had the same requirement,
> while also only recording it by way of a comment. I certainly could
> add an ASSERT(), but besides this being questionable as to the rules
> set forth in ./CODING_STYLE I also view assertions of "paused" state
> as being of limited use - the entity in question may become unpaused
> on the clock cycle after the check was done. 

That's correct. However, that race you mention is unlikely to happen 
*every* time. So there are a very high chance the ASSERT() will hit if 
miscalled.

> (But yes, such are no
> different from e.g. the fair number of spin_is_locked() checks we
> have scattered around, which don't really provide guarantees either.)
And that's fine to not provide the full guarantee. You are still 
probably going to catch 95% (if not more) of the callers that forgot to 
call it with the spin lock held.

At least for me, those ASSERTs() were super helpful during development 
in more than a few cases.

Cheers,

-- 
Julien Grall

Re: [PATCH RFC 07/10] domain: map/unmap GADDR based shared guest areas

Posted by Jan Beulich 3 years, 2 months ago

On 29.11.2022 09:40, Julien Grall wrote:
> On 28/11/2022 10:01, Jan Beulich wrote:
>> On 24.11.2022 22:29, Julien Grall wrote:
>>> On 19/10/2022 09:43, Jan Beulich wrote:
>>>> --- a/xen/common/domain.c
>>>> +++ b/xen/common/domain.c
>>>> @@ -1563,7 +1563,82 @@ int map_guest_area(struct vcpu *v, paddr
>>>>                       struct guest_area *area,
>>>>                       void (*populate)(void *dst, struct vcpu *v))
>>>>    {
>>>> -    return -EOPNOTSUPP;
>>>> +    struct domain *currd = v->domain;
>>>> +    void *map = NULL;
>>>> +    struct page_info *pg = NULL;
>>>> +    int rc = 0;
>>>> +
>>>> +    if ( gaddr )
>>>
>>> 0 is technically a valid (guest) physical address on Arm.
>>
>> I guess it is everywhere; it certainly also is on x86. While perhaps a
>> little unfortunate in ordering, the public header changes coming only
>> in the following patches was the best way I could think of to split
>> this work into reasonable size pieces. There the special meaning of 0
>> is clearly documented. And I don't really see it as a meaningful
>> limitation to not allow guests to register such areas at address zero.
> I would expect an OS to allocate the region using the generic physical 
> allocator. This allocator may decide that '0' is a valid address and 
> return it.
> 
> So I think your approach could potentially complicate the OS 
> implementation. I think it would be better to use an all Fs value as 
> this cannot be valid for this hypercall (we require at least 4-byte 
> alignment).

Valid point, yet my avoiding of an all-Fs value was specifically with
compat callers in mind - the values would be different for these and
native ones, which would make the check more clumsy (otherwise it
could simply be "if ( ~gaddr )").

>>>> @@ -1573,6 +1648,22 @@ int map_guest_area(struct vcpu *v, paddr
>>>>     */
>>>>    void unmap_guest_area(struct vcpu *v, struct guest_area *area)
>>>>    {
>>>> +    struct domain *d = v->domain;
>>>> +    void *map;
>>>> +    struct page_info *pg;
>>>
>>> AFAIU, the assumption is the vCPU should be paused here.
>>
>> Yes, as the comment ahead of the function (introduced by an earlier
>> patch) says.
> 
> Ah, I missed that. Thanks!
> 
>>
>>> Should we add an ASSERT()?
>>
>> I was going from unmap_vcpu_info(), which had the same requirement,
>> while also only recording it by way of a comment. I certainly could
>> add an ASSERT(), but besides this being questionable as to the rules
>> set forth in ./CODING_STYLE I also view assertions of "paused" state
>> as being of limited use - the entity in question may become unpaused
>> on the clock cycle after the check was done. 
> 
> That's correct. However, that race you mention is unlikely to happen 
> *every* time. So there are a very high chance the ASSERT() will hit if 
> miscalled.
> 
>> (But yes, such are no
>> different from e.g. the fair number of spin_is_locked() checks we
>> have scattered around, which don't really provide guarantees either.)
> And that's fine to not provide the full guarantee. You are still 
> probably going to catch 95% (if not more) of the callers that forgot to 
> call it with the spin lock held.
> 
> At least for me, those ASSERTs() were super helpful during development 
> in more than a few cases.

Okay, I'll add one then, but in the earlier patch, matching the comment
that's introduced there.

Jan

Re: [PATCH RFC 07/10] domain: map/unmap GADDR based shared guest areas

Posted by Julien Grall 3 years, 2 months ago

Hi Jan,

On 29/11/2022 14:02, Jan Beulich wrote:
> On 29.11.2022 09:40, Julien Grall wrote:
>> On 28/11/2022 10:01, Jan Beulich wrote:
>>> On 24.11.2022 22:29, Julien Grall wrote:
>>>> On 19/10/2022 09:43, Jan Beulich wrote:
>>>>> --- a/xen/common/domain.c
>>>>> +++ b/xen/common/domain.c
>>>>> @@ -1563,7 +1563,82 @@ int map_guest_area(struct vcpu *v, paddr
>>>>>                        struct guest_area *area,
>>>>>                        void (*populate)(void *dst, struct vcpu *v))
>>>>>     {
>>>>> -    return -EOPNOTSUPP;
>>>>> +    struct domain *currd = v->domain;
>>>>> +    void *map = NULL;
>>>>> +    struct page_info *pg = NULL;
>>>>> +    int rc = 0;
>>>>> +
>>>>> +    if ( gaddr )
>>>>
>>>> 0 is technically a valid (guest) physical address on Arm.
>>>
>>> I guess it is everywhere; it certainly also is on x86. While perhaps a
>>> little unfortunate in ordering, the public header changes coming only
>>> in the following patches was the best way I could think of to split
>>> this work into reasonable size pieces. There the special meaning of 0
>>> is clearly documented. And I don't really see it as a meaningful
>>> limitation to not allow guests to register such areas at address zero.
>> I would expect an OS to allocate the region using the generic physical
>> allocator. This allocator may decide that '0' is a valid address and
>> return it.
>>
>> So I think your approach could potentially complicate the OS
>> implementation. I think it would be better to use an all Fs value as
>> this cannot be valid for this hypercall (we require at least 4-byte
>> alignment).
> 
> Valid point, yet my avoiding of an all-Fs value was specifically with
> compat callers in mind - the values would be different for these and
> native ones, which would make the check more clumsy (otherwise it
> could simply be "if ( ~gaddr )").

Ah I forgot about compat. How about converting the 32-bit Fs to a 64-bit 
Fs in the compat code?

This will avoid to add restriction in the hypercall interface just 
because of compat.

> 
>>>>> @@ -1573,6 +1648,22 @@ int map_guest_area(struct vcpu *v, paddr
>>>>>      */
>>>>>     void unmap_guest_area(struct vcpu *v, struct guest_area *area)
>>>>>     {
>>>>> +    struct domain *d = v->domain;
>>>>> +    void *map;
>>>>> +    struct page_info *pg;
>>>>
>>>> AFAIU, the assumption is the vCPU should be paused here.
>>>
>>> Yes, as the comment ahead of the function (introduced by an earlier
>>> patch) says.
>>
>> Ah, I missed that. Thanks!
>>
>>>
>>>> Should we add an ASSERT()?
>>>
>>> I was going from unmap_vcpu_info(), which had the same requirement,
>>> while also only recording it by way of a comment. I certainly could
>>> add an ASSERT(), but besides this being questionable as to the rules
>>> set forth in ./CODING_STYLE I also view assertions of "paused" state
>>> as being of limited use - the entity in question may become unpaused
>>> on the clock cycle after the check was done.
>>
>> That's correct. However, that race you mention is unlikely to happen
>> *every* time. So there are a very high chance the ASSERT() will hit if
>> miscalled.
>>
>>> (But yes, such are no
>>> different from e.g. the fair number of spin_is_locked() checks we
>>> have scattered around, which don't really provide guarantees either.)
>> And that's fine to not provide the full guarantee. You are still
>> probably going to catch 95% (if not more) of the callers that forgot to
>> call it with the spin lock held.
>>
>> At least for me, those ASSERTs() were super helpful during development
>> in more than a few cases.
> 
> Okay, I'll add one then, but in the earlier patch, matching the comment
> that's introduced there.

Thanks! I still owe you a review for the rest of the series.

Cheers,

-- 
Julien Grall

Re: [PATCH RFC 07/10] domain: map/unmap GADDR based shared guest areas

Posted by Jan Beulich 3 years, 2 months ago

On 06.12.2022 19:27, Julien Grall wrote:
> On 29/11/2022 14:02, Jan Beulich wrote:
>> On 29.11.2022 09:40, Julien Grall wrote:
>>> On 28/11/2022 10:01, Jan Beulich wrote:
>>>> On 24.11.2022 22:29, Julien Grall wrote:
>>>>> On 19/10/2022 09:43, Jan Beulich wrote:
>>>>>> --- a/xen/common/domain.c
>>>>>> +++ b/xen/common/domain.c
>>>>>> @@ -1563,7 +1563,82 @@ int map_guest_area(struct vcpu *v, paddr
>>>>>>                        struct guest_area *area,
>>>>>>                        void (*populate)(void *dst, struct vcpu *v))
>>>>>>     {
>>>>>> -    return -EOPNOTSUPP;
>>>>>> +    struct domain *currd = v->domain;
>>>>>> +    void *map = NULL;
>>>>>> +    struct page_info *pg = NULL;
>>>>>> +    int rc = 0;
>>>>>> +
>>>>>> +    if ( gaddr )
>>>>>
>>>>> 0 is technically a valid (guest) physical address on Arm.
>>>>
>>>> I guess it is everywhere; it certainly also is on x86. While perhaps a
>>>> little unfortunate in ordering, the public header changes coming only
>>>> in the following patches was the best way I could think of to split
>>>> this work into reasonable size pieces. There the special meaning of 0
>>>> is clearly documented. And I don't really see it as a meaningful
>>>> limitation to not allow guests to register such areas at address zero.
>>> I would expect an OS to allocate the region using the generic physical
>>> allocator. This allocator may decide that '0' is a valid address and
>>> return it.
>>>
>>> So I think your approach could potentially complicate the OS
>>> implementation. I think it would be better to use an all Fs value as
>>> this cannot be valid for this hypercall (we require at least 4-byte
>>> alignment).
>>
>> Valid point, yet my avoiding of an all-Fs value was specifically with
>> compat callers in mind - the values would be different for these and
>> native ones, which would make the check more clumsy (otherwise it
>> could simply be "if ( ~gaddr )").
> 
> Ah I forgot about compat. How about converting the 32-bit Fs to a 64-bit 
> Fs in the compat code?
> 
> This will avoid to add restriction in the hypercall interface just 
> because of compat.

Possible, but ugly: Since we're using the uint64_t field of the interface
structure, no translation is presently necessary for
VCPUOP_register_vcpu_time_phys_area (we do have the layer for
VCPUOP_register_runstate_phys_area, because of the size of the pointed
to area being different). Hence what you ask for would mean adding compat
code somewhere. In which case we could as well make the above check
itself depend on whether we're dealing with a compat domain. Which is
what I've named "clumsy" above; still that would seem better to me than
to add actual compat translation code.

Then again, looking at the last patch, switching to what you suggest
would (largely) address one of the RFC remarks there as well. So I guess
I'll make the change to that if() plus associated adjustments elsewhere
(in particular in the last patch there is a similar check which needs to
remain in sync).

Jan