update_paging_mode() in the AMD IOMMU code expects to be invoked with the PCI devices lock held. The check occurring only when the mode actually needs updating, the violation of this rule by the majority of callers did go unnoticed until per-domain IOMMU setup was changed to do away with on-demand creation of IOMMU page tables. Unfortunately the only half way reasonable fix to this that I could come up with requires more re-work than would seem desirable at this time of the release process, but addressing the issue seems unavoidable to me as its manifestation is a regression from the IOMMU page table setup re-work. The change also isn't without risk of further regressions - if in patch 2 I've missed a code path that would also need to invoke the new hook, then this might mean non- working guests (with passed-through devices on AMD hardware). 1: AMD/IOMMU: don't needlessly trigger errors/crashes when unmapping a page 2: introduce GFN notification for translated domains 3: AMD/IOMMU: use notify_dfn() hook to update paging mode Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
On 06/11/2019 15:16, Jan Beulich wrote: > update_paging_mode() in the AMD IOMMU code expects to be invoked with > the PCI devices lock held. The check occurring only when the mode > actually needs updating, the violation of this rule by the majority > of callers did go unnoticed until per-domain IOMMU setup was changed > to do away with on-demand creation of IOMMU page tables. > > Unfortunately the only half way reasonable fix to this that I could > come up with requires more re-work than would seem desirable at this > time of the release process, but addressing the issue seems > unavoidable to me as its manifestation is a regression from the > IOMMU page table setup re-work. The change also isn't without risk > of further regressions - if in patch 2 I've missed a code path that > would also need to invoke the new hook, then this might mean non- > working guests (with passed-through devices on AMD hardware). > > 1: AMD/IOMMU: don't needlessly trigger errors/crashes when unmapping a page > 2: introduce GFN notification for translated domains > 3: AMD/IOMMU: use notify_dfn() hook to update paging mode Having now looked at all three, why don't we just delete the dynamic height of AMD IOMMU pagetables? This series looks suspiciously like it is adding new common infrastructure to work around the fact we're doing something fairly dumb to being with. Hardcoding at 4 levels is, at the very worst, 2 extra pages per domain, and a substantial reduction in the complexity of the IOMMU code. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
On 06.11.2019 18:31, Andrew Cooper wrote: > On 06/11/2019 15:16, Jan Beulich wrote: >> update_paging_mode() in the AMD IOMMU code expects to be invoked with >> the PCI devices lock held. The check occurring only when the mode >> actually needs updating, the violation of this rule by the majority >> of callers did go unnoticed until per-domain IOMMU setup was changed >> to do away with on-demand creation of IOMMU page tables. >> >> Unfortunately the only half way reasonable fix to this that I could >> come up with requires more re-work than would seem desirable at this >> time of the release process, but addressing the issue seems >> unavoidable to me as its manifestation is a regression from the >> IOMMU page table setup re-work. The change also isn't without risk >> of further regressions - if in patch 2 I've missed a code path that >> would also need to invoke the new hook, then this might mean non- >> working guests (with passed-through devices on AMD hardware). >> >> 1: AMD/IOMMU: don't needlessly trigger errors/crashes when unmapping a page >> 2: introduce GFN notification for translated domains >> 3: AMD/IOMMU: use notify_dfn() hook to update paging mode > > Having now looked at all three, why don't we just delete the dynamic > height of AMD IOMMU pagetables? > > This series looks suspiciously like it is adding new common > infrastructure to work around the fact we're doing something fairly dumb > to being with. > > Hardcoding at 4 levels is, at the very worst, 2 extra pages per domain, > and a substantial reduction in the complexity of the IOMMU code. Yet an additional level of page walks hardware has to perform. Also 4 levels won't cover all possible 52 address bits. And finally, the more applicable your "substantial reduction", the less suitable such a change may be at this point of the release (but I didn't look at this side of things in any detail, so it may well not be an issue). Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
On 07/11/2019 07:36, Jan Beulich wrote: > On 06.11.2019 18:31, Andrew Cooper wrote: >> On 06/11/2019 15:16, Jan Beulich wrote: >>> update_paging_mode() in the AMD IOMMU code expects to be invoked with >>> the PCI devices lock held. The check occurring only when the mode >>> actually needs updating, the violation of this rule by the majority >>> of callers did go unnoticed until per-domain IOMMU setup was changed >>> to do away with on-demand creation of IOMMU page tables. >>> >>> Unfortunately the only half way reasonable fix to this that I could >>> come up with requires more re-work than would seem desirable at this >>> time of the release process, but addressing the issue seems >>> unavoidable to me as its manifestation is a regression from the >>> IOMMU page table setup re-work. The change also isn't without risk >>> of further regressions - if in patch 2 I've missed a code path that >>> would also need to invoke the new hook, then this might mean non- >>> working guests (with passed-through devices on AMD hardware). >>> >>> 1: AMD/IOMMU: don't needlessly trigger errors/crashes when unmapping a page >>> 2: introduce GFN notification for translated domains >>> 3: AMD/IOMMU: use notify_dfn() hook to update paging mode >> Having now looked at all three, why don't we just delete the dynamic >> height of AMD IOMMU pagetables? >> >> This series looks suspiciously like it is adding new common >> infrastructure to work around the fact we're doing something fairly dumb >> to being with. >> >> Hardcoding at 4 levels is, at the very worst, 2 extra pages per domain, >> and a substantial reduction in the complexity of the IOMMU code. > Yet an additional level of page walks hardware has to perform. Also > 4 levels won't cover all possible 52 address bits. And finally, the > more applicable your "substantial reduction", the less suitable such > a change may be at this point of the release (but I didn't look at > this side of things in any detail, so it may well not be an issue). There is, in practice, no such thing as an HVM guest using 2 levels. The VRAM just below the 4G boundary will force a resize to 3 levels during domain construction, and as a 1-line fix for 4.13, this probably isn't the worst idea going. There are no AMD systems which support >48 bit PA space, so 4 levels is sufficient for now, but fundamentally details such as the size of GPA space should be specified in domain_create() and remain static for the lifetime of the domain. As far as I can tell, it is only AMD systems with IOMMUs which permit the PA space to be variable, and I still can't help but feeling that this series is attempting to work around a problem we shouldn't have in the first place. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
On 07.11.2019 13:49, Andrew Cooper wrote: > On 07/11/2019 07:36, Jan Beulich wrote: >> On 06.11.2019 18:31, Andrew Cooper wrote: >>> On 06/11/2019 15:16, Jan Beulich wrote: >>>> update_paging_mode() in the AMD IOMMU code expects to be invoked with >>>> the PCI devices lock held. The check occurring only when the mode >>>> actually needs updating, the violation of this rule by the majority >>>> of callers did go unnoticed until per-domain IOMMU setup was changed >>>> to do away with on-demand creation of IOMMU page tables. >>>> >>>> Unfortunately the only half way reasonable fix to this that I could >>>> come up with requires more re-work than would seem desirable at this >>>> time of the release process, but addressing the issue seems >>>> unavoidable to me as its manifestation is a regression from the >>>> IOMMU page table setup re-work. The change also isn't without risk >>>> of further regressions - if in patch 2 I've missed a code path that >>>> would also need to invoke the new hook, then this might mean non- >>>> working guests (with passed-through devices on AMD hardware). >>>> >>>> 1: AMD/IOMMU: don't needlessly trigger errors/crashes when unmapping a page >>>> 2: introduce GFN notification for translated domains >>>> 3: AMD/IOMMU: use notify_dfn() hook to update paging mode >>> Having now looked at all three, why don't we just delete the dynamic >>> height of AMD IOMMU pagetables? >>> >>> This series looks suspiciously like it is adding new common >>> infrastructure to work around the fact we're doing something fairly dumb >>> to being with. >>> >>> Hardcoding at 4 levels is, at the very worst, 2 extra pages per domain, >>> and a substantial reduction in the complexity of the IOMMU code. >> Yet an additional level of page walks hardware has to perform. Also >> 4 levels won't cover all possible 52 address bits. And finally, the >> more applicable your "substantial reduction", the less suitable such >> a change may be at this point of the release (but I didn't look at >> this side of things in any detail, so it may well not be an issue). > > There is, in practice, no such thing as an HVM guest using 2 levels. > The VRAM just below the 4G boundary will force a resize to 3 levels > during domain construction, and as a 1-line fix for 4.13, this probably > isn't the worst idea going. So here (with the 1-line fix remark) you talk about 3 levels. Yet switching the 2 that we start from to 3 won't fix anything, as we still may need to go to 4 for huge guests. Such a change would merely eliminate the indeed pretty pointless move from 2 to 3 which now happens for all domains as their memory gets populated. > There are no AMD systems which support >48 bit PA space, so 4 levels is > sufficient for now, but fundamentally details such as the size of GPA > space should be specified in domain_create() and remain static for the > lifetime of the domain. I agree GPA dimensions ought to be static. But the number-of-levels adjustment the code does has nothing to do with variable GPA boundaries. Even for a domain with a, say, 36-bit GFN space it may be beneficial to run with only 3 levels, as long as it has "little" enough memory assigned. In fact the number of levels the hardware has to walk is the one aspect you don't even touch in your reply. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
On 06/11/2019 16:16, Jan Beulich wrote: > update_paging_mode() in the AMD IOMMU code expects to be invoked with > the PCI devices lock held. The check occurring only when the mode > actually needs updating, the violation of this rule by the majority > of callers did go unnoticed until per-domain IOMMU setup was changed > to do away with on-demand creation of IOMMU page tables. > > Unfortunately the only half way reasonable fix to this that I could > come up with requires more re-work than would seem desirable at this > time of the release process, but addressing the issue seems > unavoidable to me as its manifestation is a regression from the > IOMMU page table setup re-work. The change also isn't without risk > of further regressions - if in patch 2 I've missed a code path that > would also need to invoke the new hook, then this might mean non- > working guests (with passed-through devices on AMD hardware). > > 1: AMD/IOMMU: don't needlessly trigger errors/crashes when unmapping a page > 2: introduce GFN notification for translated domains > 3: AMD/IOMMU: use notify_dfn() hook to update paging mode > > Jan > Hi Jan, I just tested and I don't get the "pcidevs" message any more. I assume this only was a fix for that issue, so it's probably expected that the other issue: AMD-Vi: INVALID_DEV_REQUEST 00000800 8a000000 f8000840 000000fd and malfunctioning device in one of the guests. is still around. -- Sander _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
On 06.11.2019 19:29, Sander Eikelenboom wrote: > On 06/11/2019 16:16, Jan Beulich wrote: >> update_paging_mode() in the AMD IOMMU code expects to be invoked with >> the PCI devices lock held. The check occurring only when the mode >> actually needs updating, the violation of this rule by the majority >> of callers did go unnoticed until per-domain IOMMU setup was changed >> to do away with on-demand creation of IOMMU page tables. >> >> Unfortunately the only half way reasonable fix to this that I could >> come up with requires more re-work than would seem desirable at this >> time of the release process, but addressing the issue seems >> unavoidable to me as its manifestation is a regression from the >> IOMMU page table setup re-work. The change also isn't without risk >> of further regressions - if in patch 2 I've missed a code path that >> would also need to invoke the new hook, then this might mean non- >> working guests (with passed-through devices on AMD hardware). >> >> 1: AMD/IOMMU: don't needlessly trigger errors/crashes when unmapping a page >> 2: introduce GFN notification for translated domains >> 3: AMD/IOMMU: use notify_dfn() hook to update paging mode > > I just tested and I don't get the "pcidevs" message any more. Thanks for testing the series. > I assume this only was a fix for that issue, so it's probably expected > that the other issue: > AMD-Vi: INVALID_DEV_REQUEST 00000800 8a000000 f8000840 000000fd > and malfunctioning device in one of the guests. > is still around. Indeed. Someone (possibly me) still needs to look into the other one. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Unmapping a page which has never been mapped should be a no-op (note how it already is in case there was no root page table allocated). There's in particular no need to grow the number of page table levels in use, and there's also no need to allocate intermediate page tables except when needing to split a large page. Signed-off-by: Jan Beulich <jbeulich@suse.com> --- a/xen/drivers/passthrough/amd/iommu_map.c +++ b/xen/drivers/passthrough/amd/iommu_map.c @@ -176,7 +176,7 @@ void iommu_dte_set_guest_cr3(struct amd_ * page tables. */ static int iommu_pde_from_dfn(struct domain *d, unsigned long dfn, - unsigned long pt_mfn[]) + unsigned long pt_mfn[], bool map) { struct amd_iommu_pte *pde, *next_table_vaddr; unsigned long next_table_mfn; @@ -189,6 +189,13 @@ static int iommu_pde_from_dfn(struct dom BUG_ON( table == NULL || level < 1 || level > 6 ); + /* + * A frame number past what the current page tables can represent can't + * possibly have a mapping. + */ + if ( dfn >> (PTE_PER_TABLE_SHIFT * level) ) + return 0; + next_table_mfn = mfn_x(page_to_mfn(table)); if ( level == 1 ) @@ -246,6 +253,9 @@ static int iommu_pde_from_dfn(struct dom /* Install lower level page table for non-present entries */ else if ( !pde->pr ) { + if ( !map ) + return 0; + if ( next_table_mfn == 0 ) { table = alloc_amd_iommu_pgtable(); @@ -404,7 +414,7 @@ int amd_iommu_map_page(struct domain *d, } } - if ( iommu_pde_from_dfn(d, dfn_x(dfn), pt_mfn) || (pt_mfn[1] == 0) ) + if ( iommu_pde_from_dfn(d, dfn_x(dfn), pt_mfn, true) || (pt_mfn[1] == 0) ) { spin_unlock(&hd->arch.mapping_lock); AMD_IOMMU_DEBUG("Invalid IO pagetable entry dfn = %"PRI_dfn"\n", @@ -439,24 +449,7 @@ int amd_iommu_unmap_page(struct domain * return 0; } - /* Since HVM domain is initialized with 2 level IO page table, - * we might need a deeper page table for lager dfn now */ - if ( is_hvm_domain(d) ) - { - int rc = update_paging_mode(d, dfn_x(dfn)); - - if ( rc ) - { - spin_unlock(&hd->arch.mapping_lock); - AMD_IOMMU_DEBUG("Update page mode failed dfn = %"PRI_dfn"\n", - dfn_x(dfn)); - if ( rc != -EADDRNOTAVAIL ) - domain_crash(d); - return rc; - } - } - - if ( iommu_pde_from_dfn(d, dfn_x(dfn), pt_mfn) || (pt_mfn[1] == 0) ) + if ( iommu_pde_from_dfn(d, dfn_x(dfn), pt_mfn, false) ) { spin_unlock(&hd->arch.mapping_lock); AMD_IOMMU_DEBUG("Invalid IO pagetable entry dfn = %"PRI_dfn"\n", @@ -465,8 +458,11 @@ int amd_iommu_unmap_page(struct domain * return -EFAULT; } - /* mark PTE as 'page not present' */ - *flush_flags |= clear_iommu_pte_present(pt_mfn[1], dfn_x(dfn)); + if ( pt_mfn[1] ) + { + /* Mark PTE as 'page not present'. */ + *flush_flags |= clear_iommu_pte_present(pt_mfn[1], dfn_x(dfn)); + } spin_unlock(&hd->arch.mapping_lock); _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
On 06/11/2019 15:18, Jan Beulich wrote: > Unmapping a page which has never been mapped should be a no-op (note how > it already is in case there was no root page table allocated). Which function are you talking about here? iommu_pde_from_dfn() will BUG() if no root was set up. > There's > in particular no need to grow the number of page table levels in use, > and there's also no need to allocate intermediate page tables except > when needing to split a large page. To be honest, I've never been convinced that dynamically changing the number of levels in the AMD IOMMU tables is clever. It should be fixed at 4 (like everything else) and suddenly a lot of runtime complexity disappears. (I'm fairly confident that we'll need a domain create parameter to support 5 level paging in a rational way, so we won't even include walk-length gymnastics then either.) > > Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
On 06.11.2019 18:12, Andrew Cooper wrote: > On 06/11/2019 15:18, Jan Beulich wrote: >> Unmapping a page which has never been mapped should be a no-op (note how >> it already is in case there was no root page table allocated). > > Which function are you talking about here? iommu_pde_from_dfn() will > BUG() if no root was set up. amd_iommu_unmap_page() has such a check first thing. >> There's >> in particular no need to grow the number of page table levels in use, >> and there's also no need to allocate intermediate page tables except >> when needing to split a large page. > > To be honest, I've never been convinced that dynamically changing the > number of levels in the AMD IOMMU tables is clever. It should be fixed > at 4 (like everything else) and suddenly a lot of runtime complexity > disappears. (I'm fairly confident that we'll need a domain create > parameter to support 5 level paging in a rational way, so we won't even > include walk-length gymnastics then either.) 5-level paging for the CPU 1st-stage-translation is imo pretty orthogonal to needing 5 levels of paging for 2nd-stage-translation (which also is what the IOMMU code here is about). >> Signed-off-by: Jan Beulich <jbeulich@suse.com> > > Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Thanks, Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
On Wed, 6 Nov 2019 at 15:20, Jan Beulich <jbeulich@suse.com> wrote: > > Unmapping a page which has never been mapped should be a no-op (note how > it already is in case there was no root page table allocated). There's > in particular no need to grow the number of page table levels in use, > and there's also no need to allocate intermediate page tables except > when needing to split a large page. > > Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> > > --- a/xen/drivers/passthrough/amd/iommu_map.c > +++ b/xen/drivers/passthrough/amd/iommu_map.c > @@ -176,7 +176,7 @@ void iommu_dte_set_guest_cr3(struct amd_ > * page tables. > */ > static int iommu_pde_from_dfn(struct domain *d, unsigned long dfn, > - unsigned long pt_mfn[]) > + unsigned long pt_mfn[], bool map) > { > struct amd_iommu_pte *pde, *next_table_vaddr; > unsigned long next_table_mfn; > @@ -189,6 +189,13 @@ static int iommu_pde_from_dfn(struct dom > > BUG_ON( table == NULL || level < 1 || level > 6 ); > > + /* > + * A frame number past what the current page tables can represent can't > + * possibly have a mapping. > + */ > + if ( dfn >> (PTE_PER_TABLE_SHIFT * level) ) > + return 0; > + > next_table_mfn = mfn_x(page_to_mfn(table)); > > if ( level == 1 ) > @@ -246,6 +253,9 @@ static int iommu_pde_from_dfn(struct dom > /* Install lower level page table for non-present entries */ > else if ( !pde->pr ) > { > + if ( !map ) > + return 0; > + > if ( next_table_mfn == 0 ) > { > table = alloc_amd_iommu_pgtable(); > @@ -404,7 +414,7 @@ int amd_iommu_map_page(struct domain *d, > } > } > > - if ( iommu_pde_from_dfn(d, dfn_x(dfn), pt_mfn) || (pt_mfn[1] == 0) ) > + if ( iommu_pde_from_dfn(d, dfn_x(dfn), pt_mfn, true) || (pt_mfn[1] == 0) ) > { > spin_unlock(&hd->arch.mapping_lock); > AMD_IOMMU_DEBUG("Invalid IO pagetable entry dfn = %"PRI_dfn"\n", > @@ -439,24 +449,7 @@ int amd_iommu_unmap_page(struct domain * > return 0; > } > > - /* Since HVM domain is initialized with 2 level IO page table, > - * we might need a deeper page table for lager dfn now */ > - if ( is_hvm_domain(d) ) > - { > - int rc = update_paging_mode(d, dfn_x(dfn)); > - > - if ( rc ) > - { > - spin_unlock(&hd->arch.mapping_lock); > - AMD_IOMMU_DEBUG("Update page mode failed dfn = %"PRI_dfn"\n", > - dfn_x(dfn)); > - if ( rc != -EADDRNOTAVAIL ) > - domain_crash(d); > - return rc; > - } > - } > - > - if ( iommu_pde_from_dfn(d, dfn_x(dfn), pt_mfn) || (pt_mfn[1] == 0) ) > + if ( iommu_pde_from_dfn(d, dfn_x(dfn), pt_mfn, false) ) > { > spin_unlock(&hd->arch.mapping_lock); > AMD_IOMMU_DEBUG("Invalid IO pagetable entry dfn = %"PRI_dfn"\n", > @@ -465,8 +458,11 @@ int amd_iommu_unmap_page(struct domain * > return -EFAULT; > } > > - /* mark PTE as 'page not present' */ > - *flush_flags |= clear_iommu_pte_present(pt_mfn[1], dfn_x(dfn)); > + if ( pt_mfn[1] ) > + { > + /* Mark PTE as 'page not present'. */ > + *flush_flags |= clear_iommu_pte_present(pt_mfn[1], dfn_x(dfn)); > + } > > spin_unlock(&hd->arch.mapping_lock); > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xenproject.org > https://lists.xenproject.org/mailman/listinfo/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
On 06.11.19 16:18, Jan Beulich wrote: > Unmapping a page which has never been mapped should be a no-op (note how > it already is in case there was no root page table allocated). There's > in particular no need to grow the number of page table levels in use, > and there's also no need to allocate intermediate page tables except > when needing to split a large page. > > Signed-off-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com> Juergen _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
In order for individual IOMMU drivers (and from an abstract pov also architectures) to be able to adjust their data structures ahead of time when they might cover only a sub-range of all possible GFNs, introduce a notification call used by various code paths potentially installing a fresh mapping of a never used GFN (for a particular domain). Note that in gnttab_transfer() the notification and lock re-acquire handling is best effort only (the guest may not be able to make use of the new page in case of failure, but that's in line with the lack of a return value check of guest_physmap_add_page() itself). Signed-off-by: Jan Beulich <jbeulich@suse.com> --- a/xen/arch/x86/hvm/dom0_build.c +++ b/xen/arch/x86/hvm/dom0_build.c @@ -173,7 +173,8 @@ static int __init pvh_populate_memory_ra continue; } - rc = guest_physmap_add_page(d, _gfn(start), page_to_mfn(page), + rc = notify_gfn(d, _gfn(start + (1UL << order) - 1)) ?: + guest_physmap_add_page(d, _gfn(start), page_to_mfn(page), order); if ( rc != 0 ) { --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -4286,9 +4286,17 @@ static int hvmop_set_param( if ( a.value > SHUTDOWN_MAX ) rc = -EINVAL; break; + case HVM_PARAM_IOREQ_SERVER_PFN: - d->arch.hvm.ioreq_gfn.base = a.value; + if ( d->arch.hvm.params[HVM_PARAM_NR_IOREQ_SERVER_PAGES] ) + rc = notify_gfn( + d, + _gfn(a.value + d->arch.hvm.params + [HVM_PARAM_NR_IOREQ_SERVER_PAGES] - 1)); + if ( !rc ) + d->arch.hvm.ioreq_gfn.base = a.value; break; + case HVM_PARAM_NR_IOREQ_SERVER_PAGES: { unsigned int i; @@ -4299,6 +4307,9 @@ static int hvmop_set_param( rc = -EINVAL; break; } + rc = notify_gfn(d, _gfn(d->arch.hvm.ioreq_gfn.base + a.value - 1)); + if ( rc ) + break; for ( i = 0; i < a.value; i++ ) set_bit(i, &d->arch.hvm.ioreq_gfn.mask); @@ -4312,7 +4323,11 @@ static int hvmop_set_param( BUILD_BUG_ON(HVM_PARAM_BUFIOREQ_PFN > sizeof(d->arch.hvm.ioreq_gfn.legacy_mask) * 8); if ( a.value ) - set_bit(a.index, &d->arch.hvm.ioreq_gfn.legacy_mask); + { + rc = notify_gfn(d, _gfn(a.value)); + if ( !rc ) + set_bit(a.index, &d->arch.hvm.ioreq_gfn.legacy_mask); + } break; case HVM_PARAM_X87_FIP_WIDTH: --- a/xen/common/grant_table.c +++ b/xen/common/grant_table.c @@ -946,6 +946,16 @@ map_grant_ref( return; } + if ( paging_mode_translate(ld) /* && (op->flags & GNTMAP_host_map) */ && + (rc = notify_gfn(ld, gaddr_to_gfn(op->host_addr))) ) + { + gdprintk(XENLOG_INFO, "notify(%"PRI_gfn") -> %d\n", + gfn_x(gaddr_to_gfn(op->host_addr)), rc); + op->status = GNTST_general_error; + return; + BUILD_BUG_ON(GNTST_okay); + } + if ( unlikely((rd = rcu_lock_domain_by_id(op->dom)) == NULL) ) { gdprintk(XENLOG_INFO, "Could not find domain %d\n", op->dom); @@ -2123,6 +2133,7 @@ gnttab_transfer( { bool_t okay; int rc; + gfn_t gfn; if ( i && hypercall_preempt_check() ) return i; @@ -2300,21 +2311,52 @@ gnttab_transfer( act = active_entry_acquire(e->grant_table, gop.ref); if ( evaluate_nospec(e->grant_table->gt_version == 1) ) + gfn = _gfn(shared_entry_v1(e->grant_table, gop.ref).frame); + else + gfn = _gfn(shared_entry_v2(e->grant_table, gop.ref).full_page.frame); + + if ( paging_mode_translate(e) ) { - grant_entry_v1_t *sha = &shared_entry_v1(e->grant_table, gop.ref); + gfn_t gfn2; + + active_entry_release(act); + grant_read_unlock(e->grant_table); + + rc = notify_gfn(e, gfn); + if ( rc ) + printk(XENLOG_G_WARNING + "%pd: gref %u: xfer GFN %"PRI_gfn" may be inaccessible (%d)\n", + e, gop.ref, gfn_x(gfn), rc); + + grant_read_lock(e->grant_table); + act = active_entry_acquire(e->grant_table, gop.ref); - guest_physmap_add_page(e, _gfn(sha->frame), mfn, 0); - if ( !paging_mode_translate(e) ) - sha->frame = mfn_x(mfn); + if ( evaluate_nospec(e->grant_table->gt_version == 1) ) + gfn2 = _gfn(shared_entry_v1(e->grant_table, gop.ref).frame); + else + gfn2 = _gfn(shared_entry_v2(e->grant_table, gop.ref). + full_page.frame); + + if ( !gfn_eq(gfn, gfn2) ) + { + printk(XENLOG_G_WARNING + "%pd: gref %u: xfer GFN went %"PRI_gfn" -> %"PRI_gfn"\n", + e, gop.ref, gfn_x(gfn), gfn_x(gfn2)); + gfn = gfn2; + } } - else - { - grant_entry_v2_t *sha = &shared_entry_v2(e->grant_table, gop.ref); - guest_physmap_add_page(e, _gfn(sha->full_page.frame), mfn, 0); - if ( !paging_mode_translate(e) ) - sha->full_page.frame = mfn_x(mfn); + guest_physmap_add_page(e, gfn, mfn, 0); + + if ( !paging_mode_translate(e) ) + { + if ( evaluate_nospec(e->grant_table->gt_version == 1) ) + shared_entry_v1(e->grant_table, gop.ref).frame = mfn_x(mfn); + else + shared_entry_v2(e->grant_table, gop.ref).full_page.frame = + mfn_x(mfn); } + smp_wmb(); shared_entry_header(e->grant_table, gop.ref)->flags |= GTF_transfer_completed; --- a/xen/common/memory.c +++ b/xen/common/memory.c @@ -203,6 +203,10 @@ static void populate_physmap(struct memo if ( unlikely(__copy_from_guest_offset(&gpfn, a->extent_list, i, 1)) ) goto out; + if ( paging_mode_translate(d) && + notify_gfn(d, _gfn(gpfn + (1U << a->extent_order) - 1)) ) + goto out; + if ( a->memflags & MEMF_populate_on_demand ) { /* Disallow populating PoD pages on oneself. */ @@ -745,6 +749,10 @@ static long memory_exchange(XEN_GUEST_HA continue; } + if ( paging_mode_translate(d) ) + rc = notify_gfn(d, + _gfn(gpfn + (1U << exch.out.extent_order) - 1)); + mfn = page_to_mfn(page); guest_physmap_add_page(d, _gfn(gpfn), mfn, exch.out.extent_order); @@ -813,12 +821,20 @@ int xenmem_add_to_physmap(struct domain extra.foreign_domid = DOMID_INVALID; if ( xatp->space != XENMAPSPACE_gmfn_range ) - return xenmem_add_to_physmap_one(d, xatp->space, extra, + return notify_gfn(d, _gfn(xatp->gpfn)) ?: + xenmem_add_to_physmap_one(d, xatp->space, extra, xatp->idx, _gfn(xatp->gpfn)); if ( xatp->size < start ) return -EILSEQ; + if ( !start && xatp->size ) + { + rc = notify_gfn(d, _gfn(xatp->gpfn + xatp->size - 1)); + if ( rc ) + return rc; + } + xatp->idx += start; xatp->gpfn += start; xatp->size -= start; @@ -891,7 +907,8 @@ static int xenmem_add_to_physmap_batch(s extent, 1)) ) return -EFAULT; - rc = xenmem_add_to_physmap_one(d, xatpb->space, + rc = notify_gfn(d, _gfn(gpfn)) ?: + xenmem_add_to_physmap_one(d, xatpb->space, xatpb->u, idx, _gfn(gpfn)); --- a/xen/drivers/passthrough/iommu.c +++ b/xen/drivers/passthrough/iommu.c @@ -530,6 +530,14 @@ void iommu_share_p2m_table(struct domain iommu_get_ops()->share_p2m(d); } +int iommu_notify_gfn(struct domain *d, gfn_t gfn) +{ + const struct iommu_ops *ops = dom_iommu(d)->platform_ops; + + return need_iommu_pt_sync(d) && ops->notify_dfn + ? iommu_call(ops, notify_dfn, d, _dfn(gfn_x(gfn))) : 0; +} + void iommu_crash_shutdown(void) { if ( !iommu_crash_disable ) --- a/xen/include/xen/iommu.h +++ b/xen/include/xen/iommu.h @@ -237,6 +237,8 @@ struct iommu_ops { int __must_check (*lookup_page)(struct domain *d, dfn_t dfn, mfn_t *mfn, unsigned int *flags); + int __must_check (*notify_dfn)(struct domain *d, dfn_t dfn); + void (*free_page_table)(struct page_info *); #ifdef CONFIG_X86 @@ -331,6 +333,7 @@ void iommu_crash_shutdown(void); int iommu_get_reserved_device_memory(iommu_grdm_t *, void *); void iommu_share_p2m_table(struct domain *d); +int __must_check iommu_notify_gfn(struct domain *d, gfn_t gfn); #ifdef CONFIG_HAS_PCI int iommu_do_pci_domctl(struct xen_domctl *, struct domain *d, --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -1039,6 +1039,11 @@ static always_inline bool is_iommu_enabl return evaluate_nospec(d->options & XEN_DOMCTL_CDF_iommu); } +static inline int __must_check notify_gfn(struct domain *d, gfn_t gfn) +{ + return /* arch_notify_gfn(d, gfn) ?: */ iommu_notify_gfn(d, gfn); +} + extern bool sched_smt_power_savings; extern bool sched_disable_smt_switching; _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
On 11/6/19 3:19 PM, Jan Beulich wrote: > In order for individual IOMMU drivers (and from an abstract pov also > architectures) to be able to adjust their data structures ahead of time > when they might cover only a sub-range of all possible GFNs, introduce > a notification call used by various code paths potentially installing a > fresh mapping of a never used GFN (for a particular domain). So trying to reverse engineer what's going on here, you mean to say something like this: --- Individual IOMMU drivers contain adjuct data structures for gfn ranges contained in the main p2m. For efficiency, these adjuct data structures often cover only a subset of the gfn range. Installing a fresh mapping of a never-used gfn may require these ranges to be expanded. Doing this when the p2m entry is first updated may be problematic because <reasons>. To fix this, implement notify_gfn(), to be called when Xen first becomes aware that a potentially new gfn may be about to be used. This will notify the IOMMU driver about the new gfn, allowing it to expand the data structures. It may return -ERESTART (?) for long-running operations, in which case the operation should be restarted or a different error if the expansion of the data structure is not possible. In the latter case, the entire operation should fail. --- Is that about right? Note I've had to make a lot of guesses here about the functionality and intent. > Note that in gnttab_transfer() the notification and lock re-acquire > handling is best effort only (the guest may not be able to make use of > the new page in case of failure, but that's in line with the lack of a > return value check of guest_physmap_add_page() itself). Is there a reason we can't just return an error to the caller? -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
On 07.11.2019 12:35, George Dunlap wrote: > On 11/6/19 3:19 PM, Jan Beulich wrote: >> In order for individual IOMMU drivers (and from an abstract pov also >> architectures) to be able to adjust their data structures ahead of time >> when they might cover only a sub-range of all possible GFNs, introduce >> a notification call used by various code paths potentially installing a >> fresh mapping of a never used GFN (for a particular domain). > > So trying to reverse engineer what's going on here, you mean to say > something like this: > > --- > Individual IOMMU drivers contain adjuct data structures for gfn ranges > contained in the main p2m. For efficiency, these adjuct data structures > often cover only a subset of the gfn range. Installing a fresh mapping > of a never-used gfn may require these ranges to be expanded. Doing this > when the p2m entry is first updated may be problematic because <reasons>. > > To fix this, implement notify_gfn(), to be called when Xen first becomes > aware that a potentially new gfn may be about to be used. This will > notify the IOMMU driver about the new gfn, allowing it to expand the > data structures. It may return -ERESTART (?) for long-running > operations, in which case the operation should be restarted or a > different error if the expansion of the data structure is not possible. > In the latter case, the entire operation should fail. > --- > > Is that about right? With the exception of the -ERESTART / long running operations aspect, yes. Plus assuming you mean "adjunct" (not "adjuct", which my dictionary doesn't know about). > Note I've had to make a lot of guesses here about > the functionality and intent. Well, even after seeing your longer description, I don't see what mine doesn't say. >> Note that in gnttab_transfer() the notification and lock re-acquire >> handling is best effort only (the guest may not be able to make use of >> the new page in case of failure, but that's in line with the lack of a >> return value check of guest_physmap_add_page() itself). > > Is there a reason we can't just return an error to the caller? Rolling back what has been done by that point would seem rather difficult, which I guess is the reason why the code was written the way it is (prior to my change). Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
On 11/7/19 11:47 AM, Jan Beulich wrote: > On 07.11.2019 12:35, George Dunlap wrote: >> On 11/6/19 3:19 PM, Jan Beulich wrote: >>> In order for individual IOMMU drivers (and from an abstract pov also >>> architectures) to be able to adjust their data structures ahead of time >>> when they might cover only a sub-range of all possible GFNs, introduce >>> a notification call used by various code paths potentially installing a >>> fresh mapping of a never used GFN (for a particular domain). >> >> So trying to reverse engineer what's going on here, you mean to say >> something like this: >> >> --- >> Individual IOMMU drivers contain adjuct data structures for gfn ranges >> contained in the main p2m. For efficiency, these adjuct data structures >> often cover only a subset of the gfn range. Installing a fresh mapping >> of a never-used gfn may require these ranges to be expanded. Doing this >> when the p2m entry is first updated may be problematic because <reasons>. >> >> To fix this, implement notify_gfn(), to be called when Xen first becomes >> aware that a potentially new gfn may be about to be used. This will >> notify the IOMMU driver about the new gfn, allowing it to expand the >> data structures. It may return -ERESTART (?) for long-running >> operations, in which case the operation should be restarted or a >> different error if the expansion of the data structure is not possible. >> In the latter case, the entire operation should fail. >> --- >> >> Is that about right? > > With the exception of the -ERESTART / long running operations aspect, > yes. Plus assuming you mean "adjunct" (not "adjuct", which my > dictionary doesn't know about). > >> Note I've had to make a lot of guesses here about >> the functionality and intent. > > Well, even after seeing your longer description, I don't see what mine > doesn't say * "Ahead of time" -- ahead of what? * Why do things need to be done ahead of time, rather than at the time (for whatever it is)? (I couldn't even really guess at this, which is why I put "<reasons>".) * To me "notify" doesn't in any way imply that the operation can fail. Most modern notifications are FYI only, with no opportunity to prevent the thing from happening. (That's not to say that notify is an inappropriate name -- just that by itself it doesn't imply the ability to cancel, which seems like a major factor to understanding the intent of the patch.) >>> Note that in gnttab_transfer() the notification and lock re-acquire >>> handling is best effort only (the guest may not be able to make use of >>> the new page in case of failure, but that's in line with the lack of a >>> return value check of guest_physmap_add_page() itself). >> >> Is there a reason we can't just return an error to the caller? > > Rolling back what has been done by that point would seem rather > difficult, which I guess is the reason why the code was written the > way it is (prior to my change). The phrasing made me think that you were changing it to be best-effort, rather than following suit with existing functionality. Maybe: "Note that before this patch, in gnttab_transfer(), once <condition> happens, further errors modifying the physmap are ignored (presumably because it would be too complicated to try to roll back at that point). This patch follows suit by ignoring failed notify_gfn()s, simply printing out a warning that the gfn may not be accessible due to the failure." -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
On 07.11.2019 13:10, George Dunlap wrote: > On 11/7/19 11:47 AM, Jan Beulich wrote: >> On 07.11.2019 12:35, George Dunlap wrote: >>> On 11/6/19 3:19 PM, Jan Beulich wrote: >>>> In order for individual IOMMU drivers (and from an abstract pov also >>>> architectures) to be able to adjust their data structures ahead of time >>>> when they might cover only a sub-range of all possible GFNs, introduce >>>> a notification call used by various code paths potentially installing a >>>> fresh mapping of a never used GFN (for a particular domain). >>> >>> So trying to reverse engineer what's going on here, you mean to say >>> something like this: >>> >>> --- >>> Individual IOMMU drivers contain adjuct data structures for gfn ranges >>> contained in the main p2m. For efficiency, these adjuct data structures >>> often cover only a subset of the gfn range. Installing a fresh mapping >>> of a never-used gfn may require these ranges to be expanded. Doing this >>> when the p2m entry is first updated may be problematic because <reasons>. >>> >>> To fix this, implement notify_gfn(), to be called when Xen first becomes >>> aware that a potentially new gfn may be about to be used. This will >>> notify the IOMMU driver about the new gfn, allowing it to expand the >>> data structures. It may return -ERESTART (?) for long-running >>> operations, in which case the operation should be restarted or a >>> different error if the expansion of the data structure is not possible. >>> In the latter case, the entire operation should fail. >>> --- >>> >>> Is that about right? >> >> With the exception of the -ERESTART / long running operations aspect, >> yes. Plus assuming you mean "adjunct" (not "adjuct", which my >> dictionary doesn't know about). >> >>> Note I've had to make a lot of guesses here about >>> the functionality and intent. >> >> Well, even after seeing your longer description, I don't see what mine >> doesn't say > > * "Ahead of time" -- ahead of what? I replaced "time" by "actual mapping requests", realizing that I'm implying too much here of what is the subject of the next patch. > * Why do things need to be done ahead of time, rather than at the time > (for whatever it is)? (I couldn't even really guess at this, which is > why I put "<reasons>".) This "why" imo really is the subject of the next patch, and hence gets explained there. > * To me "notify" doesn't in any way imply that the operation can fail. > Most modern notifications are FYI only, with no opportunity to prevent > the thing from happening. (That's not to say that notify is an > inappropriate name -- just that by itself it doesn't imply the ability > to cancel, which seems like a major factor to understanding the intent > of the patch.) I'm up for different names; "notify" is what I could think of. It being able to fail is in line with our more abstract notifier infrastructure (inherited from Linux) also allowing for NOTIFY_BAD. >>>> Note that in gnttab_transfer() the notification and lock re-acquire >>>> handling is best effort only (the guest may not be able to make use of >>>> the new page in case of failure, but that's in line with the lack of a >>>> return value check of guest_physmap_add_page() itself). >>> >>> Is there a reason we can't just return an error to the caller? >> >> Rolling back what has been done by that point would seem rather >> difficult, which I guess is the reason why the code was written the >> way it is (prior to my change). > > The phrasing made me think that you were changing it to be best-effort, > rather than following suit with existing functionality. > > Maybe: > > "Note that before this patch, in gnttab_transfer(), once <condition> > happens, further errors modifying the physmap are ignored (presumably > because it would be too complicated to try to roll back at that point). > This patch follows suit by ignoring failed notify_gfn()s, simply > printing out a warning that the gfn may not be accessible due to the > failure." Thanks, I'll use this in a slightly extended form. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
update_paging_mode() expects to be invoked with the PCI devices lock held. The check occurring only when the mode actually needs updating, the violation of this rule by the majority of callers did go unnoticed until per-domain IOMMU setup was changed to do away with on-demand creation of IOMMU page tables. Acquiring the necessary lock in amd_iommu_map_page() or intermediate layers in generic IOMMU code is not possible - we'd risk all sorts of lock order violations. Hence the call to update_paging_mode() gets pulled out of the function, to be invoked instead from the new notify_dfn() hook, where no potentially conflicting locks are being held by the callers. Similarly the call to amd_iommu_alloc_root() gets pulled out - now that we receive notification of all DFN range increases, there's no need anymore to do this check when actually mapping a page. Note that this ought to result in a small performance improvement as well: The hook often gets invoked just once for larger blocks of pages, so rather than going through amd_iommu_alloc_root() and update_paging_mode() once per page, we may now invoke it just once per batch. Reported-by: Sander Eikelenboom <linux@eikelenboom.it> Signed-off-by: Jan Beulich <jbeulich@suse.com> --- a/xen/drivers/passthrough/amd/iommu_map.c +++ b/xen/drivers/passthrough/amd/iommu_map.c @@ -383,35 +383,16 @@ int amd_iommu_map_page(struct domain *d, unsigned int flags, unsigned int *flush_flags) { struct domain_iommu *hd = dom_iommu(d); - int rc; unsigned long pt_mfn[7]; memset(pt_mfn, 0, sizeof(pt_mfn)); spin_lock(&hd->arch.mapping_lock); - rc = amd_iommu_alloc_root(hd); - if ( rc ) + if ( !hd->arch.root_table ) { spin_unlock(&hd->arch.mapping_lock); - AMD_IOMMU_DEBUG("Root table alloc failed, dfn = %"PRI_dfn"\n", - dfn_x(dfn)); - domain_crash(d); - return rc; - } - - /* Since HVM domain is initialized with 2 level IO page table, - * we might need a deeper page table for wider dfn now */ - if ( is_hvm_domain(d) ) - { - if ( update_paging_mode(d, dfn_x(dfn)) ) - { - spin_unlock(&hd->arch.mapping_lock); - AMD_IOMMU_DEBUG("Update page mode failed dfn = %"PRI_dfn"\n", - dfn_x(dfn)); - domain_crash(d); - return -EFAULT; - } + return -ENODATA; } if ( iommu_pde_from_dfn(d, dfn_x(dfn), pt_mfn, true) || (pt_mfn[1] == 0) ) @@ -468,6 +449,48 @@ int amd_iommu_unmap_page(struct domain * return 0; } + +int amd_iommu_notify_dfn(struct domain *d, dfn_t dfn) +{ + struct domain_iommu *hd = dom_iommu(d); + int rc; + + ASSERT(is_hvm_domain(d)); + + /* + * Since HVM domain is initialized with 2 level IO page table, + * we might need a deeper page table for wider dfn now. + */ + pcidevs_lock(); + spin_lock(&hd->arch.mapping_lock); + + rc = amd_iommu_alloc_root(hd); + if ( rc ) + { + spin_unlock(&hd->arch.mapping_lock); + pcidevs_unlock(); + AMD_IOMMU_DEBUG("Root table alloc failed, dfn = %"PRI_dfn" (rc %d)\n", + dfn_x(dfn), rc); + domain_crash(d); + return rc; + } + + rc = update_paging_mode(d, dfn_x(dfn)); + if ( rc ) + { + spin_unlock(&hd->arch.mapping_lock); + pcidevs_unlock(); + AMD_IOMMU_DEBUG("Update paging mode failed dfn %"PRI_dfn" (rc %d)\n", + dfn_x(dfn), rc); + domain_crash(d); + return rc; + } + + spin_unlock(&hd->arch.mapping_lock); + pcidevs_unlock(); + + return 0; +} static unsigned long flush_count(unsigned long dfn, unsigned int page_count, unsigned int order) --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c @@ -628,6 +628,7 @@ static const struct iommu_ops __initcons .teardown = amd_iommu_domain_destroy, .map_page = amd_iommu_map_page, .unmap_page = amd_iommu_unmap_page, + .notify_dfn = amd_iommu_notify_dfn, .iotlb_flush = amd_iommu_flush_iotlb_pages, .iotlb_flush_all = amd_iommu_flush_iotlb_all, .free_page_table = deallocate_page_table, --- a/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h +++ b/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h @@ -61,6 +61,7 @@ int __must_check amd_iommu_map_page(stru int __must_check amd_iommu_unmap_page(struct domain *d, dfn_t dfn, unsigned int *flush_flags); int __must_check amd_iommu_alloc_root(struct domain_iommu *hd); +int __must_check amd_iommu_notify_dfn(struct domain *d, dfn_t dfn); int amd_iommu_reserve_domain_unity_map(struct domain *domain, paddr_t phys_addr, unsigned long size, int iw, int ir); _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
© 2016 - 2024 Red Hat, Inc.