When a page table ends up with all contiguous entries (including all
identical attributes), it can be replaced by a superpage entry at the
next higher level. The page table itself can then be scheduled for
freeing.
The adjustment to LEVEL_MASK is merely to avoid leaving a latent trap
for whenever we (and obviously hardware) start supporting 512G mappings.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
---
Unlike the freeing of all-empty page tables, this causes quite a bit of
back and forth for PV domains, due to their mapping/unmapping of pages
when they get converted to/from being page tables. It may therefore be
worth considering to delay re-coalescing a little, to avoid doing so
when the superpage would otherwise get split again pretty soon. But I
think this would better be the subject of a separate change anyway.
Of course this could also be helped by more "aware" kernel side
behavior: They could avoid immediately mapping freed page tables
writable again, in anticipation of re-using that same page for another
page table elsewhere.
---
v4: Re-base over changes earlier in the series.
v3: New.
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2216,14 +2216,35 @@ static int __must_check cf_check intel_i
* While the (ab)use of PTE_kind_table here allows to save some work in
* the function, the main motivation for it is that it avoids a so far
* unexplained hang during boot (while preparing Dom0) on a Westmere
- * based laptop.
+ * based laptop. This also has the intended effect of terminating the
+ * loop when super pages aren't supported anymore at the next level.
*/
- pt_update_contig_markers(&page->val,
- address_level_offset(dfn_to_daddr(dfn), level),
- level,
- (hd->platform_ops->page_sizes &
- (1UL << level_to_offset_bits(level + 1))
- ? PTE_kind_leaf : PTE_kind_table));
+ while ( pt_update_contig_markers(&page->val,
+ address_level_offset(dfn_to_daddr(dfn), level),
+ level,
+ (hd->platform_ops->page_sizes &
+ (1UL << level_to_offset_bits(level + 1))
+ ? PTE_kind_leaf : PTE_kind_table)) )
+ {
+ struct page_info *pg = maddr_to_page(pg_maddr);
+
+ unmap_vtd_domain_page(page);
+
+ new.val &= ~(LEVEL_MASK << level_to_offset_bits(level));
+ dma_set_pte_superpage(new);
+
+ pg_maddr = addr_to_dma_page_maddr(d, dfn_to_daddr(dfn), ++level,
+ flush_flags, false);
+ BUG_ON(pg_maddr < PAGE_SIZE);
+
+ page = map_vtd_domain_page(pg_maddr);
+ pte = &page[address_level_offset(dfn_to_daddr(dfn), level)];
+ *pte = new;
+ iommu_sync_cache(pte, sizeof(*pte));
+
+ *flush_flags |= IOMMU_FLUSHF_modified | IOMMU_FLUSHF_all;
+ iommu_queue_free_pgtable(hd, pg);
+ }
spin_unlock(&hd->arch.mapping_lock);
unmap_vtd_domain_page(page);
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -232,7 +232,7 @@ struct context_entry {
/* page table handling */
#define LEVEL_STRIDE (9)
-#define LEVEL_MASK ((1 << LEVEL_STRIDE) - 1)
+#define LEVEL_MASK (PTE_NUM - 1UL)
#define PTE_NUM (1 << LEVEL_STRIDE)
#define level_to_agaw(val) ((val) - 2)
#define agaw_to_level(val) ((val) + 2)