One of the changes comprising the fixes for XSA-378 disallows replacing
MMIO mappings by unintended (for this purpose) code paths. This means we
need to be more careful about the mappings put in place in this range -
mappings should be created exactly once:
- iommu_hwdom_init() comes first; it should avoid the first Mb,
- pvh_populate_p2m() should insert identity mappings only into ranges
not populated as RAM,
- pvh_setup_acpi() should again avoid the first Mb, which was already
dealt with at that point.
Fixes: 753cb68e6530 ("x86/p2m: guard (in particular) identity mapping entries")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Note: Conflicts with the previously submitted "IOMMU/x86: perform PV
Dom0 mappings in batches".
--- a/xen/arch/x86/hvm/dom0_build.c
+++ b/xen/arch/x86/hvm/dom0_build.c
@@ -430,17 +430,6 @@ static int __init pvh_populate_p2m(struc
int rc;
#define MB1_PAGES PFN_DOWN(MB(1))
- /*
- * Memory below 1MB is identity mapped initially. RAM regions are
- * populated and copied below, replacing the respective mappings.
- */
- rc = modify_identity_mmio(d, 0, MB1_PAGES, true);
- if ( rc )
- {
- printk("Failed to identity map low 1MB: %d\n", rc);
- return rc;
- }
-
/* Populate memory map. */
for ( i = 0; i < d->arch.nr_e820; i++ )
{
@@ -472,6 +461,23 @@ static int __init pvh_populate_p2m(struc
}
}
+ /* Non-RAM regions of space below 1MB get identity mapped. */
+ for ( i = rc = 0; i < MB1_PAGES; ++i )
+ {
+ p2m_type_t p2mt;
+
+ if ( mfn_eq(get_gfn_query(d, i, &p2mt), INVALID_MFN) )
+ rc = set_mmio_p2m_entry(d, _gfn(i), _mfn(i), PAGE_ORDER_4K);
+ else
+ ASSERT(p2mt == p2m_ram_rw);
+ put_gfn(d, i);
+ if ( rc )
+ {
+ printk("Failed to identity map PFN %x: %d\n", i, rc);
+ return rc;
+ }
+ }
+
if ( cpu_has_vmx && paging_mode_hap(d) && !vmx_unrestricted_guest(v) )
{
/*
@@ -1095,6 +1101,17 @@ static int __init pvh_setup_acpi(struct
nr_pages = PFN_UP((d->arch.e820[i].addr & ~PAGE_MASK) +
d->arch.e820[i].size);
+ /* Memory below 1MB has been dealt with by pvh_populate_p2m(). */
+ if ( pfn < PFN_DOWN(MB(1)) )
+ {
+ if ( pfn + nr_pages <= PFN_DOWN(MB(1)) )
+ continue;
+
+ /* This shouldn't happen, but is easy to deal with. */
+ nr_pages -= PFN_DOWN(MB(1)) - pfn;
+ pfn = PFN_DOWN(MB(1));
+ }
+
rc = modify_identity_mmio(d, pfn, nr_pages, true);
if ( rc )
{
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -337,7 +337,13 @@ void __hwdom_init arch_iommu_hwdom_init(
max_pfn = (GB(4) >> PAGE_SHIFT) - 1;
top = max(max_pdx, pfn_to_pdx(max_pfn) + 1);
- for ( i = 0; i < top; i++ )
+ /*
+ * First Mb will get mapped in one go by pvh_populate_p2m(). Avoid
+ * setting up potentially conflicting mappings here.
+ */
+ i = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
+
+ for ( ; i < top; i++ )
{
unsigned long pfn = pdx_to_pfn(i);
int rc;
On 30/08/2021 14:02, Jan Beulich wrote: > One of the changes comprising the fixes for XSA-378 disallows replacing > MMIO mappings by unintended (for this purpose) code paths. I'd drop the brackets. All it does is confuse the sentence. > This means we > need to be more careful about the mappings put in place in this range - > mappings should be created exactly once: > - iommu_hwdom_init() comes first; it should avoid the first Mb, > - pvh_populate_p2m() should insert identity mappings only into ranges > not populated as RAM, > - pvh_setup_acpi() should again avoid the first Mb, which was already > dealt with at that point. This really is a mess. It also seems very fragile. Why is iommu_hwdom_init() separate in the first place? It only does mappings up to 4G in the first place, and with this change, it is now [1M-4G) All IOMMU modifications should be as a side effect of regular p2m/guest physmap operations. I suppose this is another breakage from the PV side of things. > @@ -1095,6 +1101,17 @@ static int __init pvh_setup_acpi(struct > nr_pages = PFN_UP((d->arch.e820[i].addr & ~PAGE_MASK) + > d->arch.e820[i].size); > > + /* Memory below 1MB has been dealt with by pvh_populate_p2m(). */ > + if ( pfn < PFN_DOWN(MB(1)) ) > + { > + if ( pfn + nr_pages <= PFN_DOWN(MB(1)) ) > + continue; > + > + /* This shouldn't happen, but is easy to deal with. */ I'm not sure this comment is helpful. Under PVH, there is nothing special about the 1M boundary, and we can reasonably have something else here or crossing the boundary. Preferably with this removed, Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>, but only because this is an emergency fix. I really don't think it is an improvement to the logic. ~Andrew
On 31.08.2021 15:02, Andrew Cooper wrote: > On 30/08/2021 14:02, Jan Beulich wrote: >> One of the changes comprising the fixes for XSA-378 disallows replacing >> MMIO mappings by unintended (for this purpose) code paths. > > I'd drop the brackets. All it does is confuse the sentence. > >> This means we >> need to be more careful about the mappings put in place in this range - >> mappings should be created exactly once: >> - iommu_hwdom_init() comes first; it should avoid the first Mb, >> - pvh_populate_p2m() should insert identity mappings only into ranges >> not populated as RAM, >> - pvh_setup_acpi() should again avoid the first Mb, which was already >> dealt with at that point. > > This really is a mess. It also seems very fragile. So it seems to me. > Why is iommu_hwdom_init() separate in the first place? It only does > mappings up to 4G in the first place, and with this change, it is now > [1M-4G) I guess we'll want to wait for Roger to return to shed some light on this. >> @@ -1095,6 +1101,17 @@ static int __init pvh_setup_acpi(struct >> nr_pages = PFN_UP((d->arch.e820[i].addr & ~PAGE_MASK) + >> d->arch.e820[i].size); >> >> + /* Memory below 1MB has been dealt with by pvh_populate_p2m(). */ >> + if ( pfn < PFN_DOWN(MB(1)) ) >> + { >> + if ( pfn + nr_pages <= PFN_DOWN(MB(1)) ) >> + continue; >> + >> + /* This shouldn't happen, but is easy to deal with. */ > > I'm not sure this comment is helpful. > > Under PVH, there is nothing special about the 1M boundary, and we can > reasonably have something else here or crossing the boundary. As long as we have this special treatment of the low Mb, the boundary is meaningful imo. I'd see the comment go away when the handling in general gets streamlined. > Preferably with this removed, Acked-by: Andrew Cooper > <andrew.cooper3@citrix.com>, but only because this is an emergency fix. Thanks. I see. You'll like patch 2 even less; at least I do. And I'm not really certain that change is enough to cover all possible systems. > I really don't think it is an improvement to the logic. Yet I suppose you also have no immediate suggestions towards doing better? Of course right here a full rework is out of scope. But if there were smaller bits that - if adjusted - would make you feel better about the change as a whole, I'd be happy to consider making adjustments. Jan
On Tue, Aug 31, 2021 at 03:14:06PM +0200, Jan Beulich wrote: Sorry for the delay, and likely not being of much help right now. > On 31.08.2021 15:02, Andrew Cooper wrote: > > On 30/08/2021 14:02, Jan Beulich wrote: > >> One of the changes comprising the fixes for XSA-378 disallows replacing > >> MMIO mappings by unintended (for this purpose) code paths. > > > > I'd drop the brackets. All it does is confuse the sentence. > > > >> This means we > >> need to be more careful about the mappings put in place in this range - > >> mappings should be created exactly once: > >> - iommu_hwdom_init() comes first; it should avoid the first Mb, > >> - pvh_populate_p2m() should insert identity mappings only into ranges > >> not populated as RAM, > >> - pvh_setup_acpi() should again avoid the first Mb, which was already > >> dealt with at that point. > > > > This really is a mess. It also seems very fragile. > > So it seems to me. > > > Why is iommu_hwdom_init() separate in the first place? It only does > > mappings up to 4G in the first place, and with this change, it is now > > [1M-4G) > > I guess we'll want to wait for Roger to return to shed some light on > this. iommu_hwdom_init should cover from [0, max_pdx], or 4G if max_pdx is below that. IIRC first PVH dom0 implementations used to have a behavior more similar to PV in iommu_hwdom_init, as they would get almost everything below 4GB that's not RAM identity mapped in the (IOMMU) page tables. PVH dom0 has since diverged, and now iommu_hwdom_init just identity maps reserved regions. We could likely move this somewhere else, but given it's still shared with PV dom0 (by using the same command line option, dom0-iommu=map-reserved) I think it would just make option handling more confusing. One way to simplify things would be to rely on the hardware provided memory map to correctly have the regions in the low 1M marked as reserved, so that iommu_hwdom_init would identity map them. Then we would just need a bit of special handling to duplicate the data in RAM regions for the low 1M but we could avoid a lot of complexity. This however requires trusting the hardware is not missing required regions in the low 1M. Another option might be to slightly modify hwdom_iommu_map so that for PVH it returns true for all non-RAM regions in the low 1M. That would avoid having to add another loop in pvh_populate_p2m to map those. Thanks, Roger.
On 31.08.2021 15:14, Jan Beulich wrote: > On 31.08.2021 15:02, Andrew Cooper wrote: >> On 30/08/2021 14:02, Jan Beulich wrote: >>> One of the changes comprising the fixes for XSA-378 disallows replacing >>> MMIO mappings by unintended (for this purpose) code paths. >> >> I'd drop the brackets. All it does is confuse the sentence. >> >>> This means we >>> need to be more careful about the mappings put in place in this range - >>> mappings should be created exactly once: >>> - iommu_hwdom_init() comes first; it should avoid the first Mb, >>> - pvh_populate_p2m() should insert identity mappings only into ranges >>> not populated as RAM, >>> - pvh_setup_acpi() should again avoid the first Mb, which was already >>> dealt with at that point. >> >> This really is a mess. It also seems very fragile. > > So it seems to me. > >> Why is iommu_hwdom_init() separate in the first place? It only does >> mappings up to 4G in the first place, and with this change, it is now >> [1M-4G) > > I guess we'll want to wait for Roger to return to shed some light on > this. > >>> @@ -1095,6 +1101,17 @@ static int __init pvh_setup_acpi(struct >>> nr_pages = PFN_UP((d->arch.e820[i].addr & ~PAGE_MASK) + >>> d->arch.e820[i].size); >>> >>> + /* Memory below 1MB has been dealt with by pvh_populate_p2m(). */ >>> + if ( pfn < PFN_DOWN(MB(1)) ) >>> + { >>> + if ( pfn + nr_pages <= PFN_DOWN(MB(1)) ) >>> + continue; >>> + >>> + /* This shouldn't happen, but is easy to deal with. */ >> >> I'm not sure this comment is helpful. >> >> Under PVH, there is nothing special about the 1M boundary, and we can >> reasonably have something else here or crossing the boundary. > > As long as we have this special treatment of the low Mb, the boundary > is meaningful imo. I'd see the comment go away when the handling in > general gets streamlined. I should have added: And as long as Dom0's E820 map gets cloned from the host's, which will necessarily consider the 1Mb boundary special. Jan
On 31/08/2021 14:19, Jan Beulich wrote: >>>> @@ -1095,6 +1101,17 @@ static int __init pvh_setup_acpi(struct >>>> nr_pages = PFN_UP((d->arch.e820[i].addr & ~PAGE_MASK) + >>>> d->arch.e820[i].size); >>>> >>>> + /* Memory below 1MB has been dealt with by pvh_populate_p2m(). */ >>>> + if ( pfn < PFN_DOWN(MB(1)) ) >>>> + { >>>> + if ( pfn + nr_pages <= PFN_DOWN(MB(1)) ) >>>> + continue; >>>> + >>>> + /* This shouldn't happen, but is easy to deal with. */ >>> I'm not sure this comment is helpful. >>> >>> Under PVH, there is nothing special about the 1M boundary, and we can >>> reasonably have something else here or crossing the boundary. >> As long as we have this special treatment of the low Mb, the boundary >> is meaningful imo. I'd see the comment go away when the handling in >> general gets streamlined. > I should have added: And as long as Dom0's E820 map gets cloned from > the host's, which will necessarily consider the 1Mb boundary special. Not when you're booting virtualised in the first place. ~Andrew
On 31.08.2021 15:27, Andrew Cooper wrote: > On 31/08/2021 14:19, Jan Beulich wrote: >>>>> @@ -1095,6 +1101,17 @@ static int __init pvh_setup_acpi(struct >>>>> nr_pages = PFN_UP((d->arch.e820[i].addr & ~PAGE_MASK) + >>>>> d->arch.e820[i].size); >>>>> >>>>> + /* Memory below 1MB has been dealt with by pvh_populate_p2m(). */ >>>>> + if ( pfn < PFN_DOWN(MB(1)) ) >>>>> + { >>>>> + if ( pfn + nr_pages <= PFN_DOWN(MB(1)) ) >>>>> + continue; >>>>> + >>>>> + /* This shouldn't happen, but is easy to deal with. */ >>>> I'm not sure this comment is helpful. >>>> >>>> Under PVH, there is nothing special about the 1M boundary, and we can >>>> reasonably have something else here or crossing the boundary. >>> As long as we have this special treatment of the low Mb, the boundary >>> is meaningful imo. I'd see the comment go away when the handling in >>> general gets streamlined. >> I should have added: And as long as Dom0's E820 map gets cloned from >> the host's, which will necessarily consider the 1Mb boundary special. > > Not when you're booting virtualised in the first place. You mean when Xen itself runs in PVH (not HVM) mode, and then in turn has a PVH Dom0? And if the underlying Xen has not in turn cloned the E820 from the host's? That's surely too exotic a case to warrant removing this comment. If you insist, I can mention that case as a possible exception. Jan
On 31/08/2021 14:36, Jan Beulich wrote: > On 31.08.2021 15:27, Andrew Cooper wrote: >> On 31/08/2021 14:19, Jan Beulich wrote: >>>>>> @@ -1095,6 +1101,17 @@ static int __init pvh_setup_acpi(struct >>>>>> nr_pages = PFN_UP((d->arch.e820[i].addr & ~PAGE_MASK) + >>>>>> d->arch.e820[i].size); >>>>>> >>>>>> + /* Memory below 1MB has been dealt with by pvh_populate_p2m(). */ >>>>>> + if ( pfn < PFN_DOWN(MB(1)) ) >>>>>> + { >>>>>> + if ( pfn + nr_pages <= PFN_DOWN(MB(1)) ) >>>>>> + continue; >>>>>> + >>>>>> + /* This shouldn't happen, but is easy to deal with. */ >>>>> I'm not sure this comment is helpful. >>>>> >>>>> Under PVH, there is nothing special about the 1M boundary, and we can >>>>> reasonably have something else here or crossing the boundary. >>>> As long as we have this special treatment of the low Mb, the boundary >>>> is meaningful imo. I'd see the comment go away when the handling in >>>> general gets streamlined. >>> I should have added: And as long as Dom0's E820 map gets cloned from >>> the host's, which will necessarily consider the 1Mb boundary special. >> Not when you're booting virtualised in the first place. > You mean when Xen itself runs in PVH (not HVM) mode, and then in turn > has a PVH Dom0? And if the underlying Xen has not in turn cloned > the E820 from the host's? That's surely too exotic a case to warrant > removing this comment. If you insist, I can mention that case as a > possible exception. It's a long way from exotic. Also the magic surrounding the 1M boundary is disappearing on real hardware with legacy BIOS support being dropped from platforms. The comment is misleading and should be dropped. The logic is still perfectly clear given the outer comment. ~Andrew
© 2016 - 2024 Red Hat, Inc.