[v1] x86/PVH: Dom0 building adjustments

[PATCH 1/4] x86/PVH: de-duplicate mappings for first Mb of Dom0 memory

Posted by Jan Beulich 4 years, 5 months ago

One of the changes comprising the fixes for XSA-378 disallows replacing
MMIO mappings by unintended (for this purpose) code paths. This means we
need to be more careful about the mappings put in place in this range -
mappings should be created exactly once:
- iommu_hwdom_init() comes first; it should avoid the first Mb,
- pvh_populate_p2m() should insert identity mappings only into ranges
  not populated as RAM,
- pvh_setup_acpi() should again avoid the first Mb, which was already
  dealt with at that point.

Fixes: 753cb68e6530 ("x86/p2m: guard (in particular) identity mapping entries")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Note: Conflicts with the previously submitted "IOMMU/x86: perform PV
      Dom0 mappings in batches".

--- a/xen/arch/x86/hvm/dom0_build.c
+++ b/xen/arch/x86/hvm/dom0_build.c
@@ -430,17 +430,6 @@ static int __init pvh_populate_p2m(struc
     int rc;
 #define MB1_PAGES PFN_DOWN(MB(1))
 
-    /*
-     * Memory below 1MB is identity mapped initially. RAM regions are
-     * populated and copied below, replacing the respective mappings.
-     */
-    rc = modify_identity_mmio(d, 0, MB1_PAGES, true);
-    if ( rc )
-    {
-        printk("Failed to identity map low 1MB: %d\n", rc);
-        return rc;
-    }
-
     /* Populate memory map. */
     for ( i = 0; i < d->arch.nr_e820; i++ )
     {
@@ -472,6 +461,23 @@ static int __init pvh_populate_p2m(struc
         }
     }
 
+    /* Non-RAM regions of space below 1MB get identity mapped. */
+    for ( i = rc = 0; i < MB1_PAGES; ++i )
+    {
+        p2m_type_t p2mt;
+
+        if ( mfn_eq(get_gfn_query(d, i, &p2mt), INVALID_MFN) )
+            rc = set_mmio_p2m_entry(d, _gfn(i), _mfn(i), PAGE_ORDER_4K);
+        else
+            ASSERT(p2mt == p2m_ram_rw);
+        put_gfn(d, i);
+        if ( rc )
+        {
+            printk("Failed to identity map PFN %x: %d\n", i, rc);
+            return rc;
+        }
+    }
+
     if ( cpu_has_vmx && paging_mode_hap(d) && !vmx_unrestricted_guest(v) )
     {
         /*
@@ -1095,6 +1101,17 @@ static int __init pvh_setup_acpi(struct
         nr_pages = PFN_UP((d->arch.e820[i].addr & ~PAGE_MASK) +
                           d->arch.e820[i].size);
 
+        /* Memory below 1MB has been dealt with by pvh_populate_p2m(). */
+        if ( pfn < PFN_DOWN(MB(1)) )
+        {
+            if ( pfn + nr_pages <= PFN_DOWN(MB(1)) )
+                continue;
+
+            /* This shouldn't happen, but is easy to deal with. */
+            nr_pages -= PFN_DOWN(MB(1)) - pfn;
+            pfn = PFN_DOWN(MB(1));
+        }
+
         rc = modify_identity_mmio(d, pfn, nr_pages, true);
         if ( rc )
         {
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -337,7 +337,13 @@ void __hwdom_init arch_iommu_hwdom_init(
     max_pfn = (GB(4) >> PAGE_SHIFT) - 1;
     top = max(max_pdx, pfn_to_pdx(max_pfn) + 1);
 
-    for ( i = 0; i < top; i++ )
+    /*
+     * First Mb will get mapped in one go by pvh_populate_p2m(). Avoid
+     * setting up potentially conflicting mappings here.
+     */
+    i = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
+
+    for ( ; i < top; i++ )
     {
         unsigned long pfn = pdx_to_pfn(i);
         int rc;

Re: [PATCH 1/4] x86/PVH: de-duplicate mappings for first Mb of Dom0 memory

Posted by Andrew Cooper 4 years, 5 months ago

On 30/08/2021 14:02, Jan Beulich wrote:
> One of the changes comprising the fixes for XSA-378 disallows replacing
> MMIO mappings by unintended (for this purpose) code paths.

I'd drop the brackets.  All it does is confuse the sentence.

>  This means we
> need to be more careful about the mappings put in place in this range -
> mappings should be created exactly once:
> - iommu_hwdom_init() comes first; it should avoid the first Mb,
> - pvh_populate_p2m() should insert identity mappings only into ranges
>   not populated as RAM,
> - pvh_setup_acpi() should again avoid the first Mb, which was already
>   dealt with at that point.

This really is a mess.  It also seems very fragile.

Why is iommu_hwdom_init() separate in the first place?  It only does
mappings up to 4G in the first place, and with this change, it is now
[1M-4G)

All IOMMU modifications should be as a side effect of regular p2m/guest
physmap operations.  I suppose this is another breakage from the PV side
of things.

> @@ -1095,6 +1101,17 @@ static int __init pvh_setup_acpi(struct
>          nr_pages = PFN_UP((d->arch.e820[i].addr & ~PAGE_MASK) +
>                            d->arch.e820[i].size);
>  
> +        /* Memory below 1MB has been dealt with by pvh_populate_p2m(). */
> +        if ( pfn < PFN_DOWN(MB(1)) )
> +        {
> +            if ( pfn + nr_pages <= PFN_DOWN(MB(1)) )
> +                continue;
> +
> +            /* This shouldn't happen, but is easy to deal with. */

I'm not sure this comment is helpful.

Under PVH, there is nothing special about the 1M boundary, and we can
reasonably have something else here or crossing the boundary.

Preferably with this removed, Acked-by: Andrew Cooper
<andrew.cooper3@citrix.com>, but only because this is an emergency fix. 
I really don't think it is an improvement to the logic.

~Andrew

Re: [PATCH 1/4] x86/PVH: de-duplicate mappings for first Mb of Dom0 memory

Posted by Jan Beulich 4 years, 5 months ago

On 31.08.2021 15:02, Andrew Cooper wrote:
> On 30/08/2021 14:02, Jan Beulich wrote:
>> One of the changes comprising the fixes for XSA-378 disallows replacing
>> MMIO mappings by unintended (for this purpose) code paths.
> 
> I'd drop the brackets.  All it does is confuse the sentence.
> 
>>  This means we
>> need to be more careful about the mappings put in place in this range -
>> mappings should be created exactly once:
>> - iommu_hwdom_init() comes first; it should avoid the first Mb,
>> - pvh_populate_p2m() should insert identity mappings only into ranges
>>   not populated as RAM,
>> - pvh_setup_acpi() should again avoid the first Mb, which was already
>>   dealt with at that point.
> 
> This really is a mess.  It also seems very fragile.

So it seems to me.

> Why is iommu_hwdom_init() separate in the first place?  It only does
> mappings up to 4G in the first place, and with this change, it is now
> [1M-4G)

I guess we'll want to wait for Roger to return to shed some light on
this.

>> @@ -1095,6 +1101,17 @@ static int __init pvh_setup_acpi(struct
>>          nr_pages = PFN_UP((d->arch.e820[i].addr & ~PAGE_MASK) +
>>                            d->arch.e820[i].size);
>>  
>> +        /* Memory below 1MB has been dealt with by pvh_populate_p2m(). */
>> +        if ( pfn < PFN_DOWN(MB(1)) )
>> +        {
>> +            if ( pfn + nr_pages <= PFN_DOWN(MB(1)) )
>> +                continue;
>> +
>> +            /* This shouldn't happen, but is easy to deal with. */
> 
> I'm not sure this comment is helpful.
> 
> Under PVH, there is nothing special about the 1M boundary, and we can
> reasonably have something else here or crossing the boundary.

As long as we have this special treatment of the low Mb, the boundary
is meaningful imo. I'd see the comment go away when the handling in
general gets streamlined.

> Preferably with this removed, Acked-by: Andrew Cooper
> <andrew.cooper3@citrix.com>, but only because this is an emergency fix.

Thanks. I see. You'll like patch 2 even less; at least I do. And I'm
not really certain that change is enough to cover all possible
systems.

> I really don't think it is an improvement to the logic.

Yet I suppose you also have no immediate suggestions towards doing
better? Of course right here a full rework is out of scope. But if
there were smaller bits that - if adjusted - would make you feel
better about the change as a whole, I'd be happy to consider making
adjustments.

Jan

Re: [PATCH 1/4] x86/PVH: de-duplicate mappings for first Mb of Dom0 memory

Posted by Roger Pau Monné 4 years, 5 months ago

On Tue, Aug 31, 2021 at 03:14:06PM +0200, Jan Beulich wrote:

Sorry for the delay, and likely not being of much help right now.

> On 31.08.2021 15:02, Andrew Cooper wrote:
> > On 30/08/2021 14:02, Jan Beulich wrote:
> >> One of the changes comprising the fixes for XSA-378 disallows replacing
> >> MMIO mappings by unintended (for this purpose) code paths.
> > 
> > I'd drop the brackets.  All it does is confuse the sentence.
> > 
> >>  This means we
> >> need to be more careful about the mappings put in place in this range -
> >> mappings should be created exactly once:
> >> - iommu_hwdom_init() comes first; it should avoid the first Mb,
> >> - pvh_populate_p2m() should insert identity mappings only into ranges
> >>   not populated as RAM,
> >> - pvh_setup_acpi() should again avoid the first Mb, which was already
> >>   dealt with at that point.
> > 
> > This really is a mess.  It also seems very fragile.
> 
> So it seems to me.
> 
> > Why is iommu_hwdom_init() separate in the first place?  It only does
> > mappings up to 4G in the first place, and with this change, it is now
> > [1M-4G)
> 
> I guess we'll want to wait for Roger to return to shed some light on
> this.

iommu_hwdom_init should cover from [0, max_pdx], or 4G if max_pdx is
below that.

IIRC first PVH dom0 implementations used to have a behavior more
similar to PV in iommu_hwdom_init, as they would get almost everything
below 4GB that's not RAM identity mapped in the (IOMMU) page tables.

PVH dom0 has since diverged, and now iommu_hwdom_init just identity
maps reserved regions. We could likely move this somewhere else, but
given it's still shared with PV dom0 (by using the same command line
option, dom0-iommu=map-reserved) I think it would just make option
handling more confusing.

One way to simplify things would be to rely on the hardware provided
memory map to correctly have the regions in the low 1M marked as
reserved, so that iommu_hwdom_init would identity map them. Then we
would just need a bit of special handling to duplicate the data in RAM
regions for the low 1M but we could avoid a lot of complexity.  This
however requires trusting the hardware is not missing required regions
in the low 1M.

Another option might be to slightly modify hwdom_iommu_map so that for
PVH it returns true for all non-RAM regions in the low 1M. That would
avoid having to add another loop in pvh_populate_p2m to map those.

Thanks, Roger.

Re: [PATCH 1/4] x86/PVH: de-duplicate mappings for first Mb of Dom0 memory

Posted by Jan Beulich 4 years, 5 months ago

On 31.08.2021 15:14, Jan Beulich wrote:
> On 31.08.2021 15:02, Andrew Cooper wrote:
>> On 30/08/2021 14:02, Jan Beulich wrote:
>>> One of the changes comprising the fixes for XSA-378 disallows replacing
>>> MMIO mappings by unintended (for this purpose) code paths.
>>
>> I'd drop the brackets.  All it does is confuse the sentence.
>>
>>>  This means we
>>> need to be more careful about the mappings put in place in this range -
>>> mappings should be created exactly once:
>>> - iommu_hwdom_init() comes first; it should avoid the first Mb,
>>> - pvh_populate_p2m() should insert identity mappings only into ranges
>>>   not populated as RAM,
>>> - pvh_setup_acpi() should again avoid the first Mb, which was already
>>>   dealt with at that point.
>>
>> This really is a mess.  It also seems very fragile.
> 
> So it seems to me.
> 
>> Why is iommu_hwdom_init() separate in the first place?  It only does
>> mappings up to 4G in the first place, and with this change, it is now
>> [1M-4G)
> 
> I guess we'll want to wait for Roger to return to shed some light on
> this.
> 
>>> @@ -1095,6 +1101,17 @@ static int __init pvh_setup_acpi(struct
>>>          nr_pages = PFN_UP((d->arch.e820[i].addr & ~PAGE_MASK) +
>>>                            d->arch.e820[i].size);
>>>  
>>> +        /* Memory below 1MB has been dealt with by pvh_populate_p2m(). */
>>> +        if ( pfn < PFN_DOWN(MB(1)) )
>>> +        {
>>> +            if ( pfn + nr_pages <= PFN_DOWN(MB(1)) )
>>> +                continue;
>>> +
>>> +            /* This shouldn't happen, but is easy to deal with. */
>>
>> I'm not sure this comment is helpful.
>>
>> Under PVH, there is nothing special about the 1M boundary, and we can
>> reasonably have something else here or crossing the boundary.
> 
> As long as we have this special treatment of the low Mb, the boundary
> is meaningful imo. I'd see the comment go away when the handling in
> general gets streamlined.

I should have added: And as long as Dom0's E820 map gets cloned from
the host's, which will necessarily consider the 1Mb boundary special.

Jan

Re: [PATCH 1/4] x86/PVH: de-duplicate mappings for first Mb of Dom0 memory

Posted by Andrew Cooper 4 years, 5 months ago

On 31/08/2021 14:19, Jan Beulich wrote:
>>>> @@ -1095,6 +1101,17 @@ static int __init pvh_setup_acpi(struct
>>>>          nr_pages = PFN_UP((d->arch.e820[i].addr & ~PAGE_MASK) +
>>>>                            d->arch.e820[i].size);
>>>>  
>>>> +        /* Memory below 1MB has been dealt with by pvh_populate_p2m(). */
>>>> +        if ( pfn < PFN_DOWN(MB(1)) )
>>>> +        {
>>>> +            if ( pfn + nr_pages <= PFN_DOWN(MB(1)) )
>>>> +                continue;
>>>> +
>>>> +            /* This shouldn't happen, but is easy to deal with. */
>>> I'm not sure this comment is helpful.
>>>
>>> Under PVH, there is nothing special about the 1M boundary, and we can
>>> reasonably have something else here or crossing the boundary.
>> As long as we have this special treatment of the low Mb, the boundary
>> is meaningful imo. I'd see the comment go away when the handling in
>> general gets streamlined.
> I should have added: And as long as Dom0's E820 map gets cloned from
> the host's, which will necessarily consider the 1Mb boundary special.

Not when you're booting virtualised in the first place.

~Andrew

Re: [PATCH 1/4] x86/PVH: de-duplicate mappings for first Mb of Dom0 memory

Posted by Jan Beulich 4 years, 5 months ago

On 31.08.2021 15:27, Andrew Cooper wrote:
> On 31/08/2021 14:19, Jan Beulich wrote:
>>>>> @@ -1095,6 +1101,17 @@ static int __init pvh_setup_acpi(struct
>>>>>          nr_pages = PFN_UP((d->arch.e820[i].addr & ~PAGE_MASK) +
>>>>>                            d->arch.e820[i].size);
>>>>>  
>>>>> +        /* Memory below 1MB has been dealt with by pvh_populate_p2m(). */
>>>>> +        if ( pfn < PFN_DOWN(MB(1)) )
>>>>> +        {
>>>>> +            if ( pfn + nr_pages <= PFN_DOWN(MB(1)) )
>>>>> +                continue;
>>>>> +
>>>>> +            /* This shouldn't happen, but is easy to deal with. */
>>>> I'm not sure this comment is helpful.
>>>>
>>>> Under PVH, there is nothing special about the 1M boundary, and we can
>>>> reasonably have something else here or crossing the boundary.
>>> As long as we have this special treatment of the low Mb, the boundary
>>> is meaningful imo. I'd see the comment go away when the handling in
>>> general gets streamlined.
>> I should have added: And as long as Dom0's E820 map gets cloned from
>> the host's, which will necessarily consider the 1Mb boundary special.
> 
> Not when you're booting virtualised in the first place.

You mean when Xen itself runs in PVH (not HVM) mode, and then in turn
has a PVH Dom0? And if the underlying Xen has not in turn cloned
the E820 from the host's? That's surely too exotic a case to warrant
removing this comment. If you insist, I can mention that case as a
possible exception.

Jan

Re: [PATCH 1/4] x86/PVH: de-duplicate mappings for first Mb of Dom0 memory

Posted by Andrew Cooper 4 years, 5 months ago

On 31/08/2021 14:36, Jan Beulich wrote:
> On 31.08.2021 15:27, Andrew Cooper wrote:
>> On 31/08/2021 14:19, Jan Beulich wrote:
>>>>>> @@ -1095,6 +1101,17 @@ static int __init pvh_setup_acpi(struct
>>>>>>          nr_pages = PFN_UP((d->arch.e820[i].addr & ~PAGE_MASK) +
>>>>>>                            d->arch.e820[i].size);
>>>>>>  
>>>>>> +        /* Memory below 1MB has been dealt with by pvh_populate_p2m(). */
>>>>>> +        if ( pfn < PFN_DOWN(MB(1)) )
>>>>>> +        {
>>>>>> +            if ( pfn + nr_pages <= PFN_DOWN(MB(1)) )
>>>>>> +                continue;
>>>>>> +
>>>>>> +            /* This shouldn't happen, but is easy to deal with. */
>>>>> I'm not sure this comment is helpful.
>>>>>
>>>>> Under PVH, there is nothing special about the 1M boundary, and we can
>>>>> reasonably have something else here or crossing the boundary.
>>>> As long as we have this special treatment of the low Mb, the boundary
>>>> is meaningful imo. I'd see the comment go away when the handling in
>>>> general gets streamlined.
>>> I should have added: And as long as Dom0's E820 map gets cloned from
>>> the host's, which will necessarily consider the 1Mb boundary special.
>> Not when you're booting virtualised in the first place.
> You mean when Xen itself runs in PVH (not HVM) mode, and then in turn
> has a PVH Dom0? And if the underlying Xen has not in turn cloned
> the E820 from the host's? That's surely too exotic a case to warrant
> removing this comment. If you insist, I can mention that case as a
> possible exception.

It's a long way from exotic.

Also the magic surrounding the 1M boundary is disappearing on real
hardware with legacy BIOS support being dropped from platforms.


The comment is misleading and should be dropped.  The logic is still
perfectly clear given the outer comment.

~Andrew