[v2] IOMMU: superpage support when not sharing pagetables

[PATCH v2 06/18] IOMMU/x86: restrict IO-APIC mappings for PV Dom0

Posted by Jan Beulich 4 years, 4 months ago

While already the case for PVH, there's no reason to treat PV
differently here, though of course the addresses get taken from another
source in this case. Except that, to match CPU side mappings, by default
we permit r/o ones. This then also means we now deal consistently with
IO-APICs whose MMIO is or is not covered by E820 reserved regions.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
[integrated] v1: Integrate into series.
[standalone] v2: Keep IOMMU mappings in sync with CPU ones.

--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -253,12 +253,12 @@ void iommu_identity_map_teardown(struct
     }
 }
 
-static bool __hwdom_init hwdom_iommu_map(const struct domain *d,
-                                         unsigned long pfn,
-                                         unsigned long max_pfn)
+static unsigned int __hwdom_init hwdom_iommu_map(const struct domain *d,
+                                                 unsigned long pfn,
+                                                 unsigned long max_pfn)
 {
     mfn_t mfn = _mfn(pfn);
-    unsigned int i, type;
+    unsigned int i, type, perms = IOMMUF_readable | IOMMUF_writable;
 
     /*
      * Set up 1:1 mapping for dom0. Default to include only conventional RAM
@@ -267,44 +267,60 @@ static bool __hwdom_init hwdom_iommu_map
      * that fall in unusable ranges for PV Dom0.
      */
     if ( (pfn > max_pfn && !mfn_valid(mfn)) || xen_in_range(pfn) )
-        return false;
+        return 0;
 
     switch ( type = page_get_ram_type(mfn) )
     {
     case RAM_TYPE_UNUSABLE:
-        return false;
+        return 0;
 
     case RAM_TYPE_CONVENTIONAL:
         if ( iommu_hwdom_strict )
-            return false;
+            return 0;
         break;
 
     default:
         if ( type & RAM_TYPE_RESERVED )
         {
             if ( !iommu_hwdom_inclusive && !iommu_hwdom_reserved )
-                return false;
+                perms = 0;
         }
-        else if ( is_hvm_domain(d) || !iommu_hwdom_inclusive || pfn > max_pfn )
-            return false;
+        else if ( is_hvm_domain(d) )
+            return 0;
+        else if ( !iommu_hwdom_inclusive || pfn > max_pfn )
+            perms = 0;
     }
 
     /* Check that it doesn't overlap with the Interrupt Address Range. */
     if ( pfn >= 0xfee00 && pfn <= 0xfeeff )
-        return false;
+        return 0;
     /* ... or the IO-APIC */
-    for ( i = 0; has_vioapic(d) && i < d->arch.hvm.nr_vioapics; i++ )
-        if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
-            return false;
+    if ( has_vioapic(d) )
+    {
+        for ( i = 0; i < d->arch.hvm.nr_vioapics; i++ )
+            if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
+                return 0;
+    }
+    else if ( is_pv_domain(d) )
+    {
+        /*
+         * Be consistent with CPU mappings: Dom0 is permitted to establish r/o
+         * ones there, so it should also have such established for IOMMUs.
+         */
+        for ( i = 0; i < nr_ioapics; i++ )
+            if ( pfn == PFN_DOWN(mp_ioapics[i].mpc_apicaddr) )
+                return rangeset_contains_singleton(mmio_ro_ranges, pfn)
+                       ? IOMMUF_readable : 0;
+    }
     /*
      * ... or the PCIe MCFG regions.
      * TODO: runtime added MMCFG regions are not checked to make sure they
      * don't overlap with already mapped regions, thus preventing trapping.
      */
     if ( has_vpci(d) && vpci_is_mmcfg_address(d, pfn_to_paddr(pfn)) )
-        return false;
+        return 0;
 
-    return true;
+    return perms;
 }
 
 void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
@@ -346,15 +362,19 @@ void __hwdom_init arch_iommu_hwdom_init(
     for ( ; i < top; i++ )
     {
         unsigned long pfn = pdx_to_pfn(i);
+        unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
         int rc;
 
-        if ( !hwdom_iommu_map(d, pfn, max_pfn) )
+        if ( !perms )
             rc = 0;
         else if ( paging_mode_translate(d) )
-            rc = set_identity_p2m_entry(d, pfn, p2m_access_rw, 0);
+            rc = set_identity_p2m_entry(d, pfn,
+                                        perms & IOMMUF_writable ? p2m_access_rw
+                                                                : p2m_access_r,
+                                        0);
         else
             rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
-                           IOMMUF_readable | IOMMUF_writable, &flush_flags);
+                           perms, &flush_flags);
 
         if ( rc )
             printk(XENLOG_WARNING "%pd: identity %smapping of %lx failed: %d\n",

Re: [PATCH v2 06/18] IOMMU/x86: restrict IO-APIC mappings for PV Dom0

Posted by Roger Pau Monné 4 years, 2 months ago

On Fri, Sep 24, 2021 at 11:46:57AM +0200, Jan Beulich wrote:
> While already the case for PVH, there's no reason to treat PV
> differently here, though of course the addresses get taken from another
> source in this case. Except that, to match CPU side mappings, by default
> we permit r/o ones. This then also means we now deal consistently with
> IO-APICs whose MMIO is or is not covered by E820 reserved regions.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> [integrated] v1: Integrate into series.
> [standalone] v2: Keep IOMMU mappings in sync with CPU ones.
> 
> --- a/xen/drivers/passthrough/x86/iommu.c
> +++ b/xen/drivers/passthrough/x86/iommu.c
> @@ -253,12 +253,12 @@ void iommu_identity_map_teardown(struct
>      }
>  }
>  
> -static bool __hwdom_init hwdom_iommu_map(const struct domain *d,
> -                                         unsigned long pfn,
> -                                         unsigned long max_pfn)
> +static unsigned int __hwdom_init hwdom_iommu_map(const struct domain *d,
> +                                                 unsigned long pfn,
> +                                                 unsigned long max_pfn)
>  {
>      mfn_t mfn = _mfn(pfn);
> -    unsigned int i, type;
> +    unsigned int i, type, perms = IOMMUF_readable | IOMMUF_writable;
>  
>      /*
>       * Set up 1:1 mapping for dom0. Default to include only conventional RAM
> @@ -267,44 +267,60 @@ static bool __hwdom_init hwdom_iommu_map
>       * that fall in unusable ranges for PV Dom0.
>       */
>      if ( (pfn > max_pfn && !mfn_valid(mfn)) || xen_in_range(pfn) )
> -        return false;
> +        return 0;
>  
>      switch ( type = page_get_ram_type(mfn) )
>      {
>      case RAM_TYPE_UNUSABLE:
> -        return false;
> +        return 0;
>  
>      case RAM_TYPE_CONVENTIONAL:
>          if ( iommu_hwdom_strict )
> -            return false;
> +            return 0;
>          break;
>  
>      default:
>          if ( type & RAM_TYPE_RESERVED )
>          {
>              if ( !iommu_hwdom_inclusive && !iommu_hwdom_reserved )
> -                return false;
> +                perms = 0;
>          }
> -        else if ( is_hvm_domain(d) || !iommu_hwdom_inclusive || pfn > max_pfn )
> -            return false;
> +        else if ( is_hvm_domain(d) )
> +            return 0;
> +        else if ( !iommu_hwdom_inclusive || pfn > max_pfn )
> +            perms = 0;

I'm confused about the reason to set perms = 0 instead of just
returning here. AFAICT perms won't be set to any other value below,
so you might as well just return 0.

>      }
>  
>      /* Check that it doesn't overlap with the Interrupt Address Range. */
>      if ( pfn >= 0xfee00 && pfn <= 0xfeeff )
> -        return false;
> +        return 0;
>      /* ... or the IO-APIC */
> -    for ( i = 0; has_vioapic(d) && i < d->arch.hvm.nr_vioapics; i++ )
> -        if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
> -            return false;
> +    if ( has_vioapic(d) )
> +    {
> +        for ( i = 0; i < d->arch.hvm.nr_vioapics; i++ )
> +            if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
> +                return 0;
> +    }
> +    else if ( is_pv_domain(d) )
> +    {
> +        /*
> +         * Be consistent with CPU mappings: Dom0 is permitted to establish r/o
> +         * ones there, so it should also have such established for IOMMUs.
> +         */
> +        for ( i = 0; i < nr_ioapics; i++ )
> +            if ( pfn == PFN_DOWN(mp_ioapics[i].mpc_apicaddr) )
> +                return rangeset_contains_singleton(mmio_ro_ranges, pfn)
> +                       ? IOMMUF_readable : 0;
> +    }

Note that the emulated vIO-APICs are mapped over the real ones (ie:
using the same base addresses), and hence both loops will end up using
the same regions. I would rather keep them separated anyway, just in
case we decide to somehow change the position of the emulated ones in
the future.

>      /*
>       * ... or the PCIe MCFG regions.
>       * TODO: runtime added MMCFG regions are not checked to make sure they
>       * don't overlap with already mapped regions, thus preventing trapping.
>       */
>      if ( has_vpci(d) && vpci_is_mmcfg_address(d, pfn_to_paddr(pfn)) )
> -        return false;
> +        return 0;
>  
> -    return true;
> +    return perms;
>  }
>  
>  void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
> @@ -346,15 +362,19 @@ void __hwdom_init arch_iommu_hwdom_init(
>      for ( ; i < top; i++ )
>      {
>          unsigned long pfn = pdx_to_pfn(i);
> +        unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
>          int rc;
>  
> -        if ( !hwdom_iommu_map(d, pfn, max_pfn) )
> +        if ( !perms )
>              rc = 0;
>          else if ( paging_mode_translate(d) )
> -            rc = set_identity_p2m_entry(d, pfn, p2m_access_rw, 0);
> +            rc = set_identity_p2m_entry(d, pfn,
> +                                        perms & IOMMUF_writable ? p2m_access_rw
> +                                                                : p2m_access_r,
> +                                        0);
>          else
>              rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
> -                           IOMMUF_readable | IOMMUF_writable, &flush_flags);
> +                           perms, &flush_flags);

You could just call set_identity_p2m_entry uniformly here. It will
DTRT for non-translated guests also, and then hwdom_iommu_map could
perhaps return a p2m_access_t?

Thanks, Roger.

Re: [PATCH v2 06/18] IOMMU/x86: restrict IO-APIC mappings for PV Dom0

Posted by Jan Beulich 4 years, 2 months ago

On 01.12.2021 10:09, Roger Pau Monné wrote:
> On Fri, Sep 24, 2021 at 11:46:57AM +0200, Jan Beulich wrote:
>> @@ -267,44 +267,60 @@ static bool __hwdom_init hwdom_iommu_map
>>       * that fall in unusable ranges for PV Dom0.
>>       */
>>      if ( (pfn > max_pfn && !mfn_valid(mfn)) || xen_in_range(pfn) )
>> -        return false;
>> +        return 0;
>>  
>>      switch ( type = page_get_ram_type(mfn) )
>>      {
>>      case RAM_TYPE_UNUSABLE:
>> -        return false;
>> +        return 0;
>>  
>>      case RAM_TYPE_CONVENTIONAL:
>>          if ( iommu_hwdom_strict )
>> -            return false;
>> +            return 0;
>>          break;
>>  
>>      default:
>>          if ( type & RAM_TYPE_RESERVED )
>>          {
>>              if ( !iommu_hwdom_inclusive && !iommu_hwdom_reserved )
>> -                return false;
>> +                perms = 0;
>>          }
>> -        else if ( is_hvm_domain(d) || !iommu_hwdom_inclusive || pfn > max_pfn )
>> -            return false;
>> +        else if ( is_hvm_domain(d) )
>> +            return 0;
>> +        else if ( !iommu_hwdom_inclusive || pfn > max_pfn )
>> +            perms = 0;
> 
> I'm confused about the reason to set perms = 0 instead of just
> returning here. AFAICT perms won't be set to any other value below,
> so you might as well just return 0.

This is so that ...

>>      }
>>  
>>      /* Check that it doesn't overlap with the Interrupt Address Range. */
>>      if ( pfn >= 0xfee00 && pfn <= 0xfeeff )
>> -        return false;
>> +        return 0;
>>      /* ... or the IO-APIC */
>> -    for ( i = 0; has_vioapic(d) && i < d->arch.hvm.nr_vioapics; i++ )
>> -        if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
>> -            return false;
>> +    if ( has_vioapic(d) )
>> +    {
>> +        for ( i = 0; i < d->arch.hvm.nr_vioapics; i++ )
>> +            if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
>> +                return 0;
>> +    }
>> +    else if ( is_pv_domain(d) )
>> +    {
>> +        /*
>> +         * Be consistent with CPU mappings: Dom0 is permitted to establish r/o
>> +         * ones there, so it should also have such established for IOMMUs.
>> +         */
>> +        for ( i = 0; i < nr_ioapics; i++ )
>> +            if ( pfn == PFN_DOWN(mp_ioapics[i].mpc_apicaddr) )
>> +                return rangeset_contains_singleton(mmio_ro_ranges, pfn)
>> +                       ? IOMMUF_readable : 0;
>> +    }

... this return, as per the comment, takes precedence over returning
zero.

> Note that the emulated vIO-APICs are mapped over the real ones (ie:
> using the same base addresses), and hence both loops will end up using
> the same regions. I would rather keep them separated anyway, just in
> case we decide to somehow change the position of the emulated ones in
> the future.

Yes - I don't think we should bake any such assumption into the code
here.

>> @@ -346,15 +362,19 @@ void __hwdom_init arch_iommu_hwdom_init(
>>      for ( ; i < top; i++ )
>>      {
>>          unsigned long pfn = pdx_to_pfn(i);
>> +        unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
>>          int rc;
>>  
>> -        if ( !hwdom_iommu_map(d, pfn, max_pfn) )
>> +        if ( !perms )
>>              rc = 0;
>>          else if ( paging_mode_translate(d) )
>> -            rc = set_identity_p2m_entry(d, pfn, p2m_access_rw, 0);
>> +            rc = set_identity_p2m_entry(d, pfn,
>> +                                        perms & IOMMUF_writable ? p2m_access_rw
>> +                                                                : p2m_access_r,
>> +                                        0);
>>          else
>>              rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
>> -                           IOMMUF_readable | IOMMUF_writable, &flush_flags);
>> +                           perms, &flush_flags);
> 
> You could just call set_identity_p2m_entry uniformly here. It will
> DTRT for non-translated guests also, and then hwdom_iommu_map could
> perhaps return a p2m_access_t?

That's an orthogonal change imo, i.e. could be done as a prereq change,
but I'd prefer to leave it as is for now. Furthermore see "x86/mm: split
set_identity_p2m_entry() into PV and HVM parts": In v2 I'm now also
adjusting the code here (and vpci_make_msix_hole()) to call the
translated-only function.

Jan

Re: [PATCH v2 06/18] IOMMU/x86: restrict IO-APIC mappings for PV Dom0

Posted by Roger Pau Monné 4 years, 2 months ago

On Wed, Dec 01, 2021 at 10:27:21AM +0100, Jan Beulich wrote:
> On 01.12.2021 10:09, Roger Pau Monné wrote:
> > On Fri, Sep 24, 2021 at 11:46:57AM +0200, Jan Beulich wrote:
> >> @@ -267,44 +267,60 @@ static bool __hwdom_init hwdom_iommu_map
> >>       * that fall in unusable ranges for PV Dom0.
> >>       */
> >>      if ( (pfn > max_pfn && !mfn_valid(mfn)) || xen_in_range(pfn) )
> >> -        return false;
> >> +        return 0;
> >>  
> >>      switch ( type = page_get_ram_type(mfn) )
> >>      {
> >>      case RAM_TYPE_UNUSABLE:
> >> -        return false;
> >> +        return 0;
> >>  
> >>      case RAM_TYPE_CONVENTIONAL:
> >>          if ( iommu_hwdom_strict )
> >> -            return false;
> >> +            return 0;
> >>          break;
> >>  
> >>      default:
> >>          if ( type & RAM_TYPE_RESERVED )
> >>          {
> >>              if ( !iommu_hwdom_inclusive && !iommu_hwdom_reserved )
> >> -                return false;
> >> +                perms = 0;
> >>          }
> >> -        else if ( is_hvm_domain(d) || !iommu_hwdom_inclusive || pfn > max_pfn )
> >> -            return false;
> >> +        else if ( is_hvm_domain(d) )
> >> +            return 0;
> >> +        else if ( !iommu_hwdom_inclusive || pfn > max_pfn )
> >> +            perms = 0;
> > 
> > I'm confused about the reason to set perms = 0 instead of just
> > returning here. AFAICT perms won't be set to any other value below,
> > so you might as well just return 0.
> 
> This is so that ...
> 
> >>      }
> >>  
> >>      /* Check that it doesn't overlap with the Interrupt Address Range. */
> >>      if ( pfn >= 0xfee00 && pfn <= 0xfeeff )
> >> -        return false;
> >> +        return 0;
> >>      /* ... or the IO-APIC */
> >> -    for ( i = 0; has_vioapic(d) && i < d->arch.hvm.nr_vioapics; i++ )
> >> -        if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
> >> -            return false;
> >> +    if ( has_vioapic(d) )
> >> +    {
> >> +        for ( i = 0; i < d->arch.hvm.nr_vioapics; i++ )
> >> +            if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
> >> +                return 0;
> >> +    }
> >> +    else if ( is_pv_domain(d) )
> >> +    {
> >> +        /*
> >> +         * Be consistent with CPU mappings: Dom0 is permitted to establish r/o
> >> +         * ones there, so it should also have such established for IOMMUs.
> >> +         */
> >> +        for ( i = 0; i < nr_ioapics; i++ )
> >> +            if ( pfn == PFN_DOWN(mp_ioapics[i].mpc_apicaddr) )
> >> +                return rangeset_contains_singleton(mmio_ro_ranges, pfn)
> >> +                       ? IOMMUF_readable : 0;
> >> +    }
> 
> ... this return, as per the comment, takes precedence over returning
> zero.

I see. This is because you want to map those in the IOMMU page tables
even if the IO-APIC ranges are outside of a reserved region.

I have to admit this is kind of weird, because the purpose of this
function is to add mappings for all memory below 4G, and/or for all
reserved regions.

I also wonder whether we should kind of generalize the handling of RO
regions in the IOMMU for PV dom0 by using mmio_ro_ranges instead? Ie:
we could loop around the RO ranges in PV dom0 build and map them.

FWIW MSI-X tables are also RO, but adding and removing those to the
IOMMU might be quite complex as we have to track the memory decoding
and MSI-X enable bits.

And we are likely missing a check for iomem_access_permitted in
hwdom_iommu_map?

> >> @@ -346,15 +362,19 @@ void __hwdom_init arch_iommu_hwdom_init(
> >>      for ( ; i < top; i++ )
> >>      {
> >>          unsigned long pfn = pdx_to_pfn(i);
> >> +        unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
> >>          int rc;
> >>  
> >> -        if ( !hwdom_iommu_map(d, pfn, max_pfn) )
> >> +        if ( !perms )
> >>              rc = 0;
> >>          else if ( paging_mode_translate(d) )
> >> -            rc = set_identity_p2m_entry(d, pfn, p2m_access_rw, 0);
> >> +            rc = set_identity_p2m_entry(d, pfn,
> >> +                                        perms & IOMMUF_writable ? p2m_access_rw
> >> +                                                                : p2m_access_r,
> >> +                                        0);
> >>          else
> >>              rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
> >> -                           IOMMUF_readable | IOMMUF_writable, &flush_flags);
> >> +                           perms, &flush_flags);
> > 
> > You could just call set_identity_p2m_entry uniformly here. It will
> > DTRT for non-translated guests also, and then hwdom_iommu_map could
> > perhaps return a p2m_access_t?
> 
> That's an orthogonal change imo, i.e. could be done as a prereq change,
> but I'd prefer to leave it as is for now. Furthermore see "x86/mm: split
> set_identity_p2m_entry() into PV and HVM parts": In v2 I'm now also
> adjusting the code here 

I would rather adjust the code here to just call
set_identity_p2m_entry instead of differentiating between PV and
HVM.

> (and vpci_make_msix_hole()) to call the
> translated-only function.

This one does make sense, as vpci is strictly HVM only.

Thanks, Roger.

Re: [PATCH v2 06/18] IOMMU/x86: restrict IO-APIC mappings for PV Dom0

Posted by Jan Beulich 4 years, 2 months ago

On 01.12.2021 11:32, Roger Pau Monné wrote:
> On Wed, Dec 01, 2021 at 10:27:21AM +0100, Jan Beulich wrote:
>> On 01.12.2021 10:09, Roger Pau Monné wrote:
>>> On Fri, Sep 24, 2021 at 11:46:57AM +0200, Jan Beulich wrote:
>>>> @@ -267,44 +267,60 @@ static bool __hwdom_init hwdom_iommu_map
>>>>       * that fall in unusable ranges for PV Dom0.
>>>>       */
>>>>      if ( (pfn > max_pfn && !mfn_valid(mfn)) || xen_in_range(pfn) )
>>>> -        return false;
>>>> +        return 0;
>>>>  
>>>>      switch ( type = page_get_ram_type(mfn) )
>>>>      {
>>>>      case RAM_TYPE_UNUSABLE:
>>>> -        return false;
>>>> +        return 0;
>>>>  
>>>>      case RAM_TYPE_CONVENTIONAL:
>>>>          if ( iommu_hwdom_strict )
>>>> -            return false;
>>>> +            return 0;
>>>>          break;
>>>>  
>>>>      default:
>>>>          if ( type & RAM_TYPE_RESERVED )
>>>>          {
>>>>              if ( !iommu_hwdom_inclusive && !iommu_hwdom_reserved )
>>>> -                return false;
>>>> +                perms = 0;
>>>>          }
>>>> -        else if ( is_hvm_domain(d) || !iommu_hwdom_inclusive || pfn > max_pfn )
>>>> -            return false;
>>>> +        else if ( is_hvm_domain(d) )
>>>> +            return 0;
>>>> +        else if ( !iommu_hwdom_inclusive || pfn > max_pfn )
>>>> +            perms = 0;
>>>
>>> I'm confused about the reason to set perms = 0 instead of just
>>> returning here. AFAICT perms won't be set to any other value below,
>>> so you might as well just return 0.
>>
>> This is so that ...
>>
>>>>      }
>>>>  
>>>>      /* Check that it doesn't overlap with the Interrupt Address Range. */
>>>>      if ( pfn >= 0xfee00 && pfn <= 0xfeeff )
>>>> -        return false;
>>>> +        return 0;
>>>>      /* ... or the IO-APIC */
>>>> -    for ( i = 0; has_vioapic(d) && i < d->arch.hvm.nr_vioapics; i++ )
>>>> -        if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
>>>> -            return false;
>>>> +    if ( has_vioapic(d) )
>>>> +    {
>>>> +        for ( i = 0; i < d->arch.hvm.nr_vioapics; i++ )
>>>> +            if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
>>>> +                return 0;
>>>> +    }
>>>> +    else if ( is_pv_domain(d) )
>>>> +    {
>>>> +        /*
>>>> +         * Be consistent with CPU mappings: Dom0 is permitted to establish r/o
>>>> +         * ones there, so it should also have such established for IOMMUs.
>>>> +         */
>>>> +        for ( i = 0; i < nr_ioapics; i++ )
>>>> +            if ( pfn == PFN_DOWN(mp_ioapics[i].mpc_apicaddr) )
>>>> +                return rangeset_contains_singleton(mmio_ro_ranges, pfn)
>>>> +                       ? IOMMUF_readable : 0;
>>>> +    }
>>
>> ... this return, as per the comment, takes precedence over returning
>> zero.
> 
> I see. This is because you want to map those in the IOMMU page tables
> even if the IO-APIC ranges are outside of a reserved region.
> 
> I have to admit this is kind of weird, because the purpose of this
> function is to add mappings for all memory below 4G, and/or for all
> reserved regions.

Well, that was what it started out as. The purpose here is to be consistent
about IO-APICs: Either have them all mapped, or none of them. Since we map
them in the CPU page tables and since Andrew asked for the two mappings to
be consistent, this is the only way to satisfy the requests. Personally I'd
be okay with not mapping IO-APICs here (but then regardless of whether they
are covered by a reserved region).

> I also wonder whether we should kind of generalize the handling of RO
> regions in the IOMMU for PV dom0 by using mmio_ro_ranges instead? Ie:
> we could loop around the RO ranges in PV dom0 build and map them.

We shouldn't, for example because of ...

> FWIW MSI-X tables are also RO, but adding and removing those to the
> IOMMU might be quite complex as we have to track the memory decoding
> and MSI-X enable bits.

... these: Dom0 shouldn't have a need for mappings of these tables. It's
bad enough that we need to map them in the CPU page tables.

But yes, if the goal is to map stuff uniformly in CPU and IOMMU, then
what you suggest would look to be a reasonable approach.

> And we are likely missing a check for iomem_access_permitted in
> hwdom_iommu_map?

This would be a documentation-only check: The pages have permissions
removed when not in mmio_ro_ranges (see dom0_setup_permissions()).
IOW their presence there is an indication of permissions having been
granted.

>>>> @@ -346,15 +362,19 @@ void __hwdom_init arch_iommu_hwdom_init(
>>>>      for ( ; i < top; i++ )
>>>>      {
>>>>          unsigned long pfn = pdx_to_pfn(i);
>>>> +        unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
>>>>          int rc;
>>>>  
>>>> -        if ( !hwdom_iommu_map(d, pfn, max_pfn) )
>>>> +        if ( !perms )
>>>>              rc = 0;
>>>>          else if ( paging_mode_translate(d) )
>>>> -            rc = set_identity_p2m_entry(d, pfn, p2m_access_rw, 0);
>>>> +            rc = set_identity_p2m_entry(d, pfn,
>>>> +                                        perms & IOMMUF_writable ? p2m_access_rw
>>>> +                                                                : p2m_access_r,
>>>> +                                        0);
>>>>          else
>>>>              rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
>>>> -                           IOMMUF_readable | IOMMUF_writable, &flush_flags);
>>>> +                           perms, &flush_flags);
>>>
>>> You could just call set_identity_p2m_entry uniformly here. It will
>>> DTRT for non-translated guests also, and then hwdom_iommu_map could
>>> perhaps return a p2m_access_t?
>>
>> That's an orthogonal change imo, i.e. could be done as a prereq change,
>> but I'd prefer to leave it as is for now. Furthermore see "x86/mm: split
>> set_identity_p2m_entry() into PV and HVM parts": In v2 I'm now also
>> adjusting the code here 
> 
> I would rather adjust the code here to just call
> set_identity_p2m_entry instead of differentiating between PV and
> HVM.

I'm a little hesitant, in particular considering your suggestion to
then have hwdom_iommu_map() return p2m_access_t. Andrew has made quite
clear that the use of p2m_access_* here and in a number of other places
is actually an abuse.

Plus - forgot about this in my earlier reply - there would also be a
conflict with the next patch in this series, where larger orders will
get passed to iommu_map(), while set_identity_p2m_entry() has no
respective parameter yet (and imo it isn't urgent for it to gain one).

Jan

Re: [PATCH v2 06/18] IOMMU/x86: restrict IO-APIC mappings for PV Dom0

Posted by Roger Pau Monné 4 years, 2 months ago

On Wed, Dec 01, 2021 at 12:45:12PM +0100, Jan Beulich wrote:
> On 01.12.2021 11:32, Roger Pau Monné wrote:
> > On Wed, Dec 01, 2021 at 10:27:21AM +0100, Jan Beulich wrote:
> >> On 01.12.2021 10:09, Roger Pau Monné wrote:
> >>> On Fri, Sep 24, 2021 at 11:46:57AM +0200, Jan Beulich wrote:
> >>>> @@ -267,44 +267,60 @@ static bool __hwdom_init hwdom_iommu_map
> >>>>       * that fall in unusable ranges for PV Dom0.
> >>>>       */
> >>>>      if ( (pfn > max_pfn && !mfn_valid(mfn)) || xen_in_range(pfn) )
> >>>> -        return false;
> >>>> +        return 0;
> >>>>  
> >>>>      switch ( type = page_get_ram_type(mfn) )
> >>>>      {
> >>>>      case RAM_TYPE_UNUSABLE:
> >>>> -        return false;
> >>>> +        return 0;
> >>>>  
> >>>>      case RAM_TYPE_CONVENTIONAL:
> >>>>          if ( iommu_hwdom_strict )
> >>>> -            return false;
> >>>> +            return 0;
> >>>>          break;
> >>>>  
> >>>>      default:
> >>>>          if ( type & RAM_TYPE_RESERVED )
> >>>>          {
> >>>>              if ( !iommu_hwdom_inclusive && !iommu_hwdom_reserved )
> >>>> -                return false;
> >>>> +                perms = 0;
> >>>>          }
> >>>> -        else if ( is_hvm_domain(d) || !iommu_hwdom_inclusive || pfn > max_pfn )
> >>>> -            return false;
> >>>> +        else if ( is_hvm_domain(d) )
> >>>> +            return 0;
> >>>> +        else if ( !iommu_hwdom_inclusive || pfn > max_pfn )
> >>>> +            perms = 0;
> >>>
> >>> I'm confused about the reason to set perms = 0 instead of just
> >>> returning here. AFAICT perms won't be set to any other value below,
> >>> so you might as well just return 0.
> >>
> >> This is so that ...
> >>
> >>>>      }
> >>>>  
> >>>>      /* Check that it doesn't overlap with the Interrupt Address Range. */
> >>>>      if ( pfn >= 0xfee00 && pfn <= 0xfeeff )
> >>>> -        return false;
> >>>> +        return 0;
> >>>>      /* ... or the IO-APIC */
> >>>> -    for ( i = 0; has_vioapic(d) && i < d->arch.hvm.nr_vioapics; i++ )
> >>>> -        if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
> >>>> -            return false;
> >>>> +    if ( has_vioapic(d) )
> >>>> +    {
> >>>> +        for ( i = 0; i < d->arch.hvm.nr_vioapics; i++ )
> >>>> +            if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
> >>>> +                return 0;
> >>>> +    }
> >>>> +    else if ( is_pv_domain(d) )
> >>>> +    {
> >>>> +        /*
> >>>> +         * Be consistent with CPU mappings: Dom0 is permitted to establish r/o
> >>>> +         * ones there, so it should also have such established for IOMMUs.
> >>>> +         */
> >>>> +        for ( i = 0; i < nr_ioapics; i++ )
> >>>> +            if ( pfn == PFN_DOWN(mp_ioapics[i].mpc_apicaddr) )
> >>>> +                return rangeset_contains_singleton(mmio_ro_ranges, pfn)
> >>>> +                       ? IOMMUF_readable : 0;
> >>>> +    }
> >>
> >> ... this return, as per the comment, takes precedence over returning
> >> zero.
> > 
> > I see. This is because you want to map those in the IOMMU page tables
> > even if the IO-APIC ranges are outside of a reserved region.
> > 
> > I have to admit this is kind of weird, because the purpose of this
> > function is to add mappings for all memory below 4G, and/or for all
> > reserved regions.
> 
> Well, that was what it started out as. The purpose here is to be consistent
> about IO-APICs: Either have them all mapped, or none of them. Since we map
> them in the CPU page tables and since Andrew asked for the two mappings to
> be consistent, this is the only way to satisfy the requests. Personally I'd
> be okay with not mapping IO-APICs here (but then regardless of whether they
> are covered by a reserved region).

I'm unsure of the best way to deal with this, it seems like both
the CPU and the IOMMU page tables would never be equal for PV dom0,
because we have no intention to map the MSI-X tables in RO mode in the
IOMMU page tables.

I'm not really opposed to having the IO-APIC mapped RO in the IOMMU
page tables, but I also don't see much benefit of doing it unless we
have a user-case for it. The IO-APIC handling in PV is already
different from native, so I would be fine if we add a comment noting
that while the IO-APIC is mappable to the CPU page tables as RO it's
not present in the IOMMU page tables (and then adjust hwdom_iommu_map
to prevent it's mapping).

I think we should also prevent mapping of the LAPIC, the HPET and the
HyperTransport range in case they fall into a reserved region?

TBH looks like we should be using iomem_access_permitted in
arch_iommu_hwdom_init? (not just for the IO-APIC, but for other device
ranges)

> >>>> @@ -346,15 +362,19 @@ void __hwdom_init arch_iommu_hwdom_init(
> >>>>      for ( ; i < top; i++ )
> >>>>      {
> >>>>          unsigned long pfn = pdx_to_pfn(i);
> >>>> +        unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
> >>>>          int rc;
> >>>>  
> >>>> -        if ( !hwdom_iommu_map(d, pfn, max_pfn) )
> >>>> +        if ( !perms )
> >>>>              rc = 0;
> >>>>          else if ( paging_mode_translate(d) )
> >>>> -            rc = set_identity_p2m_entry(d, pfn, p2m_access_rw, 0);
> >>>> +            rc = set_identity_p2m_entry(d, pfn,
> >>>> +                                        perms & IOMMUF_writable ? p2m_access_rw
> >>>> +                                                                : p2m_access_r,
> >>>> +                                        0);
> >>>>          else
> >>>>              rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
> >>>> -                           IOMMUF_readable | IOMMUF_writable, &flush_flags);
> >>>> +                           perms, &flush_flags);
> >>>
> >>> You could just call set_identity_p2m_entry uniformly here. It will
> >>> DTRT for non-translated guests also, and then hwdom_iommu_map could
> >>> perhaps return a p2m_access_t?
> >>
> >> That's an orthogonal change imo, i.e. could be done as a prereq change,
> >> but I'd prefer to leave it as is for now. Furthermore see "x86/mm: split
> >> set_identity_p2m_entry() into PV and HVM parts": In v2 I'm now also
> >> adjusting the code here 
> > 
> > I would rather adjust the code here to just call
> > set_identity_p2m_entry instead of differentiating between PV and
> > HVM.
> 
> I'm a little hesitant, in particular considering your suggestion to
> then have hwdom_iommu_map() return p2m_access_t. Andrew has made quite
> clear that the use of p2m_access_* here and in a number of other places
> is actually an abuse.
> 
> Plus - forgot about this in my earlier reply - there would also be a
> conflict with the next patch in this series, where larger orders will
> get passed to iommu_map(), while set_identity_p2m_entry() has no
> respective parameter yet (and imo it isn't urgent for it to gain one).

I've seen now as the iommu_map path is modified to handle ranges
instead of single pages. Long term we also want to expand this range
handling to the HVM branch.

Thanks, Roger.

Re: [PATCH v2 06/18] IOMMU/x86: restrict IO-APIC mappings for PV Dom0

Posted by Jan Beulich 4 years, 2 months ago

On 02.12.2021 16:12, Roger Pau Monné wrote:
> On Wed, Dec 01, 2021 at 12:45:12PM +0100, Jan Beulich wrote:
>> On 01.12.2021 11:32, Roger Pau Monné wrote:
>>> On Wed, Dec 01, 2021 at 10:27:21AM +0100, Jan Beulich wrote:
>>>> On 01.12.2021 10:09, Roger Pau Monné wrote:
>>>>> On Fri, Sep 24, 2021 at 11:46:57AM +0200, Jan Beulich wrote:
>>>>>> @@ -267,44 +267,60 @@ static bool __hwdom_init hwdom_iommu_map
>>>>>>       * that fall in unusable ranges for PV Dom0.
>>>>>>       */
>>>>>>      if ( (pfn > max_pfn && !mfn_valid(mfn)) || xen_in_range(pfn) )
>>>>>> -        return false;
>>>>>> +        return 0;
>>>>>>  
>>>>>>      switch ( type = page_get_ram_type(mfn) )
>>>>>>      {
>>>>>>      case RAM_TYPE_UNUSABLE:
>>>>>> -        return false;
>>>>>> +        return 0;
>>>>>>  
>>>>>>      case RAM_TYPE_CONVENTIONAL:
>>>>>>          if ( iommu_hwdom_strict )
>>>>>> -            return false;
>>>>>> +            return 0;
>>>>>>          break;
>>>>>>  
>>>>>>      default:
>>>>>>          if ( type & RAM_TYPE_RESERVED )
>>>>>>          {
>>>>>>              if ( !iommu_hwdom_inclusive && !iommu_hwdom_reserved )
>>>>>> -                return false;
>>>>>> +                perms = 0;
>>>>>>          }
>>>>>> -        else if ( is_hvm_domain(d) || !iommu_hwdom_inclusive || pfn > max_pfn )
>>>>>> -            return false;
>>>>>> +        else if ( is_hvm_domain(d) )
>>>>>> +            return 0;
>>>>>> +        else if ( !iommu_hwdom_inclusive || pfn > max_pfn )
>>>>>> +            perms = 0;
>>>>>
>>>>> I'm confused about the reason to set perms = 0 instead of just
>>>>> returning here. AFAICT perms won't be set to any other value below,
>>>>> so you might as well just return 0.
>>>>
>>>> This is so that ...
>>>>
>>>>>>      }
>>>>>>  
>>>>>>      /* Check that it doesn't overlap with the Interrupt Address Range. */
>>>>>>      if ( pfn >= 0xfee00 && pfn <= 0xfeeff )
>>>>>> -        return false;
>>>>>> +        return 0;
>>>>>>      /* ... or the IO-APIC */
>>>>>> -    for ( i = 0; has_vioapic(d) && i < d->arch.hvm.nr_vioapics; i++ )
>>>>>> -        if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
>>>>>> -            return false;
>>>>>> +    if ( has_vioapic(d) )
>>>>>> +    {
>>>>>> +        for ( i = 0; i < d->arch.hvm.nr_vioapics; i++ )
>>>>>> +            if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
>>>>>> +                return 0;
>>>>>> +    }
>>>>>> +    else if ( is_pv_domain(d) )
>>>>>> +    {
>>>>>> +        /*
>>>>>> +         * Be consistent with CPU mappings: Dom0 is permitted to establish r/o
>>>>>> +         * ones there, so it should also have such established for IOMMUs.
>>>>>> +         */
>>>>>> +        for ( i = 0; i < nr_ioapics; i++ )
>>>>>> +            if ( pfn == PFN_DOWN(mp_ioapics[i].mpc_apicaddr) )
>>>>>> +                return rangeset_contains_singleton(mmio_ro_ranges, pfn)
>>>>>> +                       ? IOMMUF_readable : 0;
>>>>>> +    }
>>>>
>>>> ... this return, as per the comment, takes precedence over returning
>>>> zero.
>>>
>>> I see. This is because you want to map those in the IOMMU page tables
>>> even if the IO-APIC ranges are outside of a reserved region.
>>>
>>> I have to admit this is kind of weird, because the purpose of this
>>> function is to add mappings for all memory below 4G, and/or for all
>>> reserved regions.
>>
>> Well, that was what it started out as. The purpose here is to be consistent
>> about IO-APICs: Either have them all mapped, or none of them. Since we map
>> them in the CPU page tables and since Andrew asked for the two mappings to
>> be consistent, this is the only way to satisfy the requests. Personally I'd
>> be okay with not mapping IO-APICs here (but then regardless of whether they
>> are covered by a reserved region).
> 
> I'm unsure of the best way to deal with this, it seems like both
> the CPU and the IOMMU page tables would never be equal for PV dom0,
> because we have no intention to map the MSI-X tables in RO mode in the
> IOMMU page tables.
> 
> I'm not really opposed to having the IO-APIC mapped RO in the IOMMU
> page tables, but I also don't see much benefit of doing it unless we
> have a user-case for it. The IO-APIC handling in PV is already
> different from native, so I would be fine if we add a comment noting
> that while the IO-APIC is mappable to the CPU page tables as RO it's
> not present in the IOMMU page tables (and then adjust hwdom_iommu_map
> to prevent it's mapping).

Andrew, you did request both mappings to get in sync - thoughts?

> I think we should also prevent mapping of the LAPIC, the HPET and the
> HyperTransport range in case they fall into a reserved region?

Probably.

> TBH looks like we should be using iomem_access_permitted in
> arch_iommu_hwdom_init? (not just for the IO-APIC, but for other device
> ranges)

In general - perhaps. Not sure though whether to switch to doing so
right here.

>>>>>> @@ -346,15 +362,19 @@ void __hwdom_init arch_iommu_hwdom_init(
>>>>>>      for ( ; i < top; i++ )
>>>>>>      {
>>>>>>          unsigned long pfn = pdx_to_pfn(i);
>>>>>> +        unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
>>>>>>          int rc;
>>>>>>  
>>>>>> -        if ( !hwdom_iommu_map(d, pfn, max_pfn) )
>>>>>> +        if ( !perms )
>>>>>>              rc = 0;
>>>>>>          else if ( paging_mode_translate(d) )
>>>>>> -            rc = set_identity_p2m_entry(d, pfn, p2m_access_rw, 0);
>>>>>> +            rc = set_identity_p2m_entry(d, pfn,
>>>>>> +                                        perms & IOMMUF_writable ? p2m_access_rw
>>>>>> +                                                                : p2m_access_r,
>>>>>> +                                        0);
>>>>>>          else
>>>>>>              rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
>>>>>> -                           IOMMUF_readable | IOMMUF_writable, &flush_flags);
>>>>>> +                           perms, &flush_flags);
>>>>>
>>>>> You could just call set_identity_p2m_entry uniformly here. It will
>>>>> DTRT for non-translated guests also, and then hwdom_iommu_map could
>>>>> perhaps return a p2m_access_t?
>>>>
>>>> That's an orthogonal change imo, i.e. could be done as a prereq change,
>>>> but I'd prefer to leave it as is for now. Furthermore see "x86/mm: split
>>>> set_identity_p2m_entry() into PV and HVM parts": In v2 I'm now also
>>>> adjusting the code here 
>>>
>>> I would rather adjust the code here to just call
>>> set_identity_p2m_entry instead of differentiating between PV and
>>> HVM.
>>
>> I'm a little hesitant, in particular considering your suggestion to
>> then have hwdom_iommu_map() return p2m_access_t. Andrew has made quite
>> clear that the use of p2m_access_* here and in a number of other places
>> is actually an abuse.
>>
>> Plus - forgot about this in my earlier reply - there would also be a
>> conflict with the next patch in this series, where larger orders will
>> get passed to iommu_map(), while set_identity_p2m_entry() has no
>> respective parameter yet (and imo it isn't urgent for it to gain one).
> 
> I've seen now as the iommu_map path is modified to handle ranges
> instead of single pages. Long term we also want to expand this range
> handling to the HVM branch.

Long (or maybe better mid) term, yes.

Jan

Re: [PATCH v2 06/18] IOMMU/x86: restrict IO-APIC mappings for PV Dom0

Posted by Andrew Cooper 4 years, 2 months ago

On 02/12/2021 15:28, Jan Beulich wrote:
> On 02.12.2021 16:12, Roger Pau Monné wrote:
>> On Wed, Dec 01, 2021 at 12:45:12PM +0100, Jan Beulich wrote:
>>> On 01.12.2021 11:32, Roger Pau Monné wrote:
>>>> On Wed, Dec 01, 2021 at 10:27:21AM +0100, Jan Beulich wrote:
>>>>> On 01.12.2021 10:09, Roger Pau Monné wrote:
>>>>>> On Fri, Sep 24, 2021 at 11:46:57AM +0200, Jan Beulich wrote:
>>>>>>> @@ -267,44 +267,60 @@ static bool __hwdom_init hwdom_iommu_map
>>>>>>>       * that fall in unusable ranges for PV Dom0.
>>>>>>>       */
>>>>>>>      if ( (pfn > max_pfn && !mfn_valid(mfn)) || xen_in_range(pfn) )
>>>>>>> -        return false;
>>>>>>> +        return 0;
>>>>>>>  
>>>>>>>      switch ( type = page_get_ram_type(mfn) )
>>>>>>>      {
>>>>>>>      case RAM_TYPE_UNUSABLE:
>>>>>>> -        return false;
>>>>>>> +        return 0;
>>>>>>>  
>>>>>>>      case RAM_TYPE_CONVENTIONAL:
>>>>>>>          if ( iommu_hwdom_strict )
>>>>>>> -            return false;
>>>>>>> +            return 0;
>>>>>>>          break;
>>>>>>>  
>>>>>>>      default:
>>>>>>>          if ( type & RAM_TYPE_RESERVED )
>>>>>>>          {
>>>>>>>              if ( !iommu_hwdom_inclusive && !iommu_hwdom_reserved )
>>>>>>> -                return false;
>>>>>>> +                perms = 0;
>>>>>>>          }
>>>>>>> -        else if ( is_hvm_domain(d) || !iommu_hwdom_inclusive || pfn > max_pfn )
>>>>>>> -            return false;
>>>>>>> +        else if ( is_hvm_domain(d) )
>>>>>>> +            return 0;
>>>>>>> +        else if ( !iommu_hwdom_inclusive || pfn > max_pfn )
>>>>>>> +            perms = 0;
>>>>>> I'm confused about the reason to set perms = 0 instead of just
>>>>>> returning here. AFAICT perms won't be set to any other value below,
>>>>>> so you might as well just return 0.
>>>>> This is so that ...
>>>>>
>>>>>>>      }
>>>>>>>  
>>>>>>>      /* Check that it doesn't overlap with the Interrupt Address Range. */
>>>>>>>      if ( pfn >= 0xfee00 && pfn <= 0xfeeff )
>>>>>>> -        return false;
>>>>>>> +        return 0;
>>>>>>>      /* ... or the IO-APIC */
>>>>>>> -    for ( i = 0; has_vioapic(d) && i < d->arch.hvm.nr_vioapics; i++ )
>>>>>>> -        if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
>>>>>>> -            return false;
>>>>>>> +    if ( has_vioapic(d) )
>>>>>>> +    {
>>>>>>> +        for ( i = 0; i < d->arch.hvm.nr_vioapics; i++ )
>>>>>>> +            if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
>>>>>>> +                return 0;
>>>>>>> +    }
>>>>>>> +    else if ( is_pv_domain(d) )
>>>>>>> +    {
>>>>>>> +        /*
>>>>>>> +         * Be consistent with CPU mappings: Dom0 is permitted to establish r/o
>>>>>>> +         * ones there, so it should also have such established for IOMMUs.
>>>>>>> +         */
>>>>>>> +        for ( i = 0; i < nr_ioapics; i++ )
>>>>>>> +            if ( pfn == PFN_DOWN(mp_ioapics[i].mpc_apicaddr) )
>>>>>>> +                return rangeset_contains_singleton(mmio_ro_ranges, pfn)
>>>>>>> +                       ? IOMMUF_readable : 0;
>>>>>>> +    }
>>>>> ... this return, as per the comment, takes precedence over returning
>>>>> zero.
>>>> I see. This is because you want to map those in the IOMMU page tables
>>>> even if the IO-APIC ranges are outside of a reserved region.
>>>>
>>>> I have to admit this is kind of weird, because the purpose of this
>>>> function is to add mappings for all memory below 4G, and/or for all
>>>> reserved regions.
>>> Well, that was what it started out as. The purpose here is to be consistent
>>> about IO-APICs: Either have them all mapped, or none of them. Since we map
>>> them in the CPU page tables and since Andrew asked for the two mappings to
>>> be consistent, this is the only way to satisfy the requests. Personally I'd
>>> be okay with not mapping IO-APICs here (but then regardless of whether they
>>> are covered by a reserved region).
>> I'm unsure of the best way to deal with this, it seems like both
>> the CPU and the IOMMU page tables would never be equal for PV dom0,
>> because we have no intention to map the MSI-X tables in RO mode in the
>> IOMMU page tables.
>>
>> I'm not really opposed to having the IO-APIC mapped RO in the IOMMU
>> page tables, but I also don't see much benefit of doing it unless we
>> have a user-case for it. The IO-APIC handling in PV is already
>> different from native, so I would be fine if we add a comment noting
>> that while the IO-APIC is mappable to the CPU page tables as RO it's
>> not present in the IOMMU page tables (and then adjust hwdom_iommu_map
>> to prevent it's mapping).
> Andrew, you did request both mappings to get in sync - thoughts?

Lets step back to first principles.

On real hardware, there is no such thing as read-only-ness of the
physical address space.  Anything like that is a device which accepts
and discards writes.

It's not clear what a real hardware platform would do in this scenario,
but from reading some of the platform docs, I suspect the System Address
Decoder would provide a symmetric view of the hardware address space,
but this doesn't mean that UBOX would tolerate memory accesses uniformly
from all sources.  Also, there's nothing to say that all platforms
behave the same.


For HVM with shared-pt, the CPU and IOMMU mappings really are
identical.  The IOMMU really will get a read-only mapping of real MMCFG,
and holes for fully-emulated devices, which would suffer a IOMMU fault
if targetted.

For HVM without shared-pt, the translations are mostly kept in sync, but
the permissions in the CPU mappings may be reduced for e.g. logdirty
reasons.

For PV guests, things are mostly like the HVM shared-pt case, except
we've got the real IO-APICs mapped read-only, and no fully-emulated devices.


Putting the real IO-APICs in the IOMMU is about as short sighted as
letting the PV guest see them to begin with, but there is nothing
fundamentally wrong with letting a PV guest do a DMA read of the
IO-APIC, seeing as we let it do a CPU read.  (And whether the platform
will even allow it, is a different matter.)


However, it is really important for there to not be a load of special
casing (all undocumented, naturally) keeping the CPU and IOMMU views
different.  It is an error that the views were ever different
(translation wise), and the only legitimate permission difference I can
think of is to support logdirty mode for migration.  (Introspection
protection for device-enabled VMs will be left as an exercise to
whomever first wants to use it.)

Making the guest physical address space view consistent between the CPU
and device is a "because its obviously the correct thing to do" issue. 
Deciding "well it makes no sense for you to have an IO mapping of $FOO"
is a matter of policy that Xen has no legitimate right to be enforcing.

~Andrew

Re: [PATCH v2 06/18] IOMMU/x86: restrict IO-APIC mappings for PV Dom0

Posted by Jan Beulich 4 years, 2 months ago

On 02.12.2021 20:16, Andrew Cooper wrote:
> On 02/12/2021 15:28, Jan Beulich wrote:
>> On 02.12.2021 16:12, Roger Pau Monné wrote:
>>> On Wed, Dec 01, 2021 at 12:45:12PM +0100, Jan Beulich wrote:
>>>> On 01.12.2021 11:32, Roger Pau Monné wrote:
>>>>> On Wed, Dec 01, 2021 at 10:27:21AM +0100, Jan Beulich wrote:
>>>>>> On 01.12.2021 10:09, Roger Pau Monné wrote:
>>>>>>> On Fri, Sep 24, 2021 at 11:46:57AM +0200, Jan Beulich wrote:
>>>>>>>> @@ -267,44 +267,60 @@ static bool __hwdom_init hwdom_iommu_map
>>>>>>>>       * that fall in unusable ranges for PV Dom0.
>>>>>>>>       */
>>>>>>>>      if ( (pfn > max_pfn && !mfn_valid(mfn)) || xen_in_range(pfn) )
>>>>>>>> -        return false;
>>>>>>>> +        return 0;
>>>>>>>>  
>>>>>>>>      switch ( type = page_get_ram_type(mfn) )
>>>>>>>>      {
>>>>>>>>      case RAM_TYPE_UNUSABLE:
>>>>>>>> -        return false;
>>>>>>>> +        return 0;
>>>>>>>>  
>>>>>>>>      case RAM_TYPE_CONVENTIONAL:
>>>>>>>>          if ( iommu_hwdom_strict )
>>>>>>>> -            return false;
>>>>>>>> +            return 0;
>>>>>>>>          break;
>>>>>>>>  
>>>>>>>>      default:
>>>>>>>>          if ( type & RAM_TYPE_RESERVED )
>>>>>>>>          {
>>>>>>>>              if ( !iommu_hwdom_inclusive && !iommu_hwdom_reserved )
>>>>>>>> -                return false;
>>>>>>>> +                perms = 0;
>>>>>>>>          }
>>>>>>>> -        else if ( is_hvm_domain(d) || !iommu_hwdom_inclusive || pfn > max_pfn )
>>>>>>>> -            return false;
>>>>>>>> +        else if ( is_hvm_domain(d) )
>>>>>>>> +            return 0;
>>>>>>>> +        else if ( !iommu_hwdom_inclusive || pfn > max_pfn )
>>>>>>>> +            perms = 0;
>>>>>>> I'm confused about the reason to set perms = 0 instead of just
>>>>>>> returning here. AFAICT perms won't be set to any other value below,
>>>>>>> so you might as well just return 0.
>>>>>> This is so that ...
>>>>>>
>>>>>>>>      }
>>>>>>>>  
>>>>>>>>      /* Check that it doesn't overlap with the Interrupt Address Range. */
>>>>>>>>      if ( pfn >= 0xfee00 && pfn <= 0xfeeff )
>>>>>>>> -        return false;
>>>>>>>> +        return 0;
>>>>>>>>      /* ... or the IO-APIC */
>>>>>>>> -    for ( i = 0; has_vioapic(d) && i < d->arch.hvm.nr_vioapics; i++ )
>>>>>>>> -        if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
>>>>>>>> -            return false;
>>>>>>>> +    if ( has_vioapic(d) )
>>>>>>>> +    {
>>>>>>>> +        for ( i = 0; i < d->arch.hvm.nr_vioapics; i++ )
>>>>>>>> +            if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
>>>>>>>> +                return 0;
>>>>>>>> +    }
>>>>>>>> +    else if ( is_pv_domain(d) )
>>>>>>>> +    {
>>>>>>>> +        /*
>>>>>>>> +         * Be consistent with CPU mappings: Dom0 is permitted to establish r/o
>>>>>>>> +         * ones there, so it should also have such established for IOMMUs.
>>>>>>>> +         */
>>>>>>>> +        for ( i = 0; i < nr_ioapics; i++ )
>>>>>>>> +            if ( pfn == PFN_DOWN(mp_ioapics[i].mpc_apicaddr) )
>>>>>>>> +                return rangeset_contains_singleton(mmio_ro_ranges, pfn)
>>>>>>>> +                       ? IOMMUF_readable : 0;
>>>>>>>> +    }
>>>>>> ... this return, as per the comment, takes precedence over returning
>>>>>> zero.
>>>>> I see. This is because you want to map those in the IOMMU page tables
>>>>> even if the IO-APIC ranges are outside of a reserved region.
>>>>>
>>>>> I have to admit this is kind of weird, because the purpose of this
>>>>> function is to add mappings for all memory below 4G, and/or for all
>>>>> reserved regions.
>>>> Well, that was what it started out as. The purpose here is to be consistent
>>>> about IO-APICs: Either have them all mapped, or none of them. Since we map
>>>> them in the CPU page tables and since Andrew asked for the two mappings to
>>>> be consistent, this is the only way to satisfy the requests. Personally I'd
>>>> be okay with not mapping IO-APICs here (but then regardless of whether they
>>>> are covered by a reserved region).
>>> I'm unsure of the best way to deal with this, it seems like both
>>> the CPU and the IOMMU page tables would never be equal for PV dom0,
>>> because we have no intention to map the MSI-X tables in RO mode in the
>>> IOMMU page tables.
>>>
>>> I'm not really opposed to having the IO-APIC mapped RO in the IOMMU
>>> page tables, but I also don't see much benefit of doing it unless we
>>> have a user-case for it. The IO-APIC handling in PV is already
>>> different from native, so I would be fine if we add a comment noting
>>> that while the IO-APIC is mappable to the CPU page tables as RO it's
>>> not present in the IOMMU page tables (and then adjust hwdom_iommu_map
>>> to prevent it's mapping).
>> Andrew, you did request both mappings to get in sync - thoughts?
> 
> Lets step back to first principles.
> 
> On real hardware, there is no such thing as read-only-ness of the
> physical address space.  Anything like that is a device which accepts
> and discards writes.
> 
> It's not clear what a real hardware platform would do in this scenario,
> but from reading some of the platform docs, I suspect the System Address
> Decoder would provide a symmetric view of the hardware address space,
> but this doesn't mean that UBOX would tolerate memory accesses uniformly
> from all sources.  Also, there's nothing to say that all platforms
> behave the same.
> 
> 
> For HVM with shared-pt, the CPU and IOMMU mappings really are
> identical.  The IOMMU really will get a read-only mapping of real MMCFG,
> and holes for fully-emulated devices, which would suffer a IOMMU fault
> if targetted.
> 
> For HVM without shared-pt, the translations are mostly kept in sync, but
> the permissions in the CPU mappings may be reduced for e.g. logdirty
> reasons.
> 
> For PV guests, things are mostly like the HVM shared-pt case, except
> we've got the real IO-APICs mapped read-only, and no fully-emulated devices.
> 
> 
> Putting the real IO-APICs in the IOMMU is about as short sighted as
> letting the PV guest see them to begin with, but there is nothing
> fundamentally wrong with letting a PV guest do a DMA read of the
> IO-APIC, seeing as we let it do a CPU read.  (And whether the platform
> will even allow it, is a different matter.)
> 
> 
> However, it is really important for there to not be a load of special
> casing (all undocumented, naturally) keeping the CPU and IOMMU views
> different.  It is an error that the views were ever different
> (translation wise), and the only legitimate permission difference I can
> think of is to support logdirty mode for migration.  (Introspection
> protection for device-enabled VMs will be left as an exercise to
> whomever first wants to use it.)
> 
> Making the guest physical address space view consistent between the CPU
> and device is a "because its obviously the correct thing to do" issue. 
> Deciding "well it makes no sense for you to have an IO mapping of $FOO"
> is a matter of policy that Xen has no legitimate right to be enforcing.

To summarize: You continue to think it's better to map the IO-APICs r/o
also in the IOMMU, despite there not being any practical need for these
mappings (the CPU ones get permitted as a workaround only, after all).
Please correct me if that's a wrong understanding of your reply. And I
take it that you're aware that CPU mappings get inserted only upon Dom0's
request, whereas IOMMU mappings get created once during boot (the
inconsistent form of which had been present prior to this patch).

Any decision here would then imo also want to apply to e.g. the HPET
region, which we have a mode for where Dom0 can map it r/o. And the
MSI-X tables and PBAs (which get dynamically entered into mmio_ro_ranges).

Jan