From nobody Fri Dec 19 14:09:17 2025 Received: from mail-qk1-f172.google.com (mail-qk1-f172.google.com [209.85.222.172]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A703F441E for ; Thu, 21 Dec 2023 03:19:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=soleen.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=soleen.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=soleen.com header.i=@soleen.com header.b="XlGhKX2P" Received: by mail-qk1-f172.google.com with SMTP id af79cd13be357-7811db57cb4so17310185a.0 for ; Wed, 20 Dec 2023 19:19:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=soleen.com; s=google; t=1703128759; x=1703733559; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:to:from:from:to:cc:subject:date:message-id :reply-to; bh=bHVoW/MSujJv5NIsFUvfttxthtWbl9YXxYuPhvKhYN4=; b=XlGhKX2PWrEJ++A1pSj7qSrm/6i0Jn7zizpA0YrA9m5flSBQegMw5PnfY2BWZbPc2B SGfLt6zBDF5Cmq05Yw1M7Zjb6/aH0paSup88doMLh+02cRNn/03IiYPzwNkzKGC26HyS vFLpz4SPXD4R9pOrnN39Vz2CQLT8RogCUo+V6mx1P8sBipmSz+ecXk2Oi8j7ASRxo+FZ pe3bvw/IhA3iupiQnn3WNZaeLyRqY7NIceDVEriaQruuo8m5dURcZk8kucVM3WD2Ob0m MIH6WvxjEefRB7SMjrvNKllmwPtTWfNP2JYkcYDXi131n7WRU9bdy6Ea56Bnq9RH/PZZ UgGA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703128759; x=1703733559; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=bHVoW/MSujJv5NIsFUvfttxthtWbl9YXxYuPhvKhYN4=; b=w/YudEIDUXGJOfQ6sKL4sL334L0eKu1+Q9GpdKC6uj3TFfSbShg+KjzIc2yD6nvdTt bL52WSW53JFW0CDLlHhpGJNj9ywOdZNfR03UFKIRj9VB88WM/d4SC7F9GR7Oqa3DkVrR 4GB+aiBcueZVB2A2CV3/0TTziHfnoVPsIkXXB64igFDQnSLh4bFs/T/7/rzMGbLNxMGg vq7vYY9mRUV1dgV/yYtSuRGmPRlnKvsWbh6xKpfAbJ3Rys9A9nvwCaP0b30skc+/3IoN Ni1d1cQOjEqwscpnZue8mD9Ee3Lz9obRLdfwCSa7gqIsHm2GkfE8rcuFP/6GG7Q4plJz JroQ== X-Gm-Message-State: AOJu0YynB/XNEUYReNKbXrF39NYpq+F9Nt/ZmFrtpBOQAMs+nknzC8xp hgPzdgkdABNCe2lPEVIzfEUr1Q== X-Google-Smtp-Source: AGHT+IHY+sU7XqrPzSe1RhsjN9sLabzyaNEAd3hCwu8MQn6FSl4jaGvTX3sXjbffBRHiVyH9teDPwg== X-Received: by 2002:a05:620a:461f:b0:781:97:fbc4 with SMTP id br31-20020a05620a461f00b007810097fbc4mr67467qkb.25.1703128759521; Wed, 20 Dec 2023 19:19:19 -0800 (PST) Received: from soleen.c.googlers.com.com (55.87.194.35.bc.googleusercontent.com. [35.194.87.55]) by smtp.gmail.com with ESMTPSA id m18-20020a05620a221200b0077d85695db4sm371893qkh.99.2023.12.20.19.19.18 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 Dec 2023 19:19:18 -0800 (PST) From: Pasha Tatashin To: akpm@linux-foundation.org, linux-mm@kvack.org, pasha.tatashin@soleen.com, linux-kernel@vger.kernel.org, rientjes@google.com, dwmw2@infradead.org, baolu.lu@linux.intel.com, joro@8bytes.org, will@kernel.org, robin.murphy@arm.com, iommu@lists.linux.dev Subject: [RFC 1/3] iommu/intel: Use page->refcount to count number of entries in IOMMU Date: Thu, 21 Dec 2023 03:19:13 +0000 Message-ID: <20231221031915.619337-2-pasha.tatashin@soleen.com> X-Mailer: git-send-email 2.43.0.472.g3155946c3a-goog In-Reply-To: <20231221031915.619337-1-pasha.tatashin@soleen.com> References: <20231221031915.619337-1-pasha.tatashin@soleen.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" In order to be able to efficiently free empty page table levels, count the number of entries in each page table my incremeanting and decremeanting refcount every time a PTE is inserted or removed form the page table. For this to work correctly, add two helper function: dma_clear_pte and dma_set_pte where counting is performed, Also, modify the code so every page table entry is always updated using the two new functions. Signed-off-by: Pasha Tatashin --- drivers/iommu/intel/iommu.c | 40 +++++++++++++++++++++--------------- drivers/iommu/intel/iommu.h | 41 +++++++++++++++++++++++++++++++------ 2 files changed, 58 insertions(+), 23 deletions(-) diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 897159dba47d..4688ef797161 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -949,7 +949,7 @@ static struct dma_pte *pfn_to_dma_pte(struct dmar_domai= n *domain, if (domain->use_first_level) pteval |=3D DMA_FL_PTE_XD | DMA_FL_PTE_US | DMA_FL_PTE_ACCESS; =20 - if (cmpxchg64(&pte->val, 0ULL, pteval)) + if (dma_set_pte(pte, pteval)) /* Someone else set it while we were thinking; use theirs. */ free_pgtable_page(tmp_page); else @@ -1021,7 +1021,8 @@ static void dma_pte_clear_range(struct dmar_domain *d= omain, continue; } do { - dma_clear_pte(pte); + if (dma_pte_present(pte)) + dma_clear_pte(pte); start_pfn +=3D lvl_to_nr_pages(large_page); pte++; } while (start_pfn <=3D last_pfn && !first_pte_in_page(pte)); @@ -1062,7 +1063,8 @@ static void dma_pte_free_level(struct dmar_domain *do= main, int level, */ if (level < retain_level && !(start_pfn > level_pfn || last_pfn < level_pfn + level_size(level) - 1)) { - dma_clear_pte(pte); + if (dma_pte_present(pte)) + dma_clear_pte(pte); domain_flush_cache(domain, pte, sizeof(*pte)); free_pgtable_page(level_pte); } @@ -1093,12 +1095,13 @@ static void dma_pte_free_pagetable(struct dmar_doma= in *domain, } } =20 -/* When a page at a given level is being unlinked from its parent, we don't - need to *modify* it at all. All we need to do is make a list of all the - pages which can be freed just as soon as we've flushed the IOTLB and we - know the hardware page-walk will no longer touch them. - The 'pte' argument is the *parent* PTE, pointing to the page that is to - be freed. */ +/* + * A given page at a given level is being unlinked from its parent. + * We need to make a list of all the pages which can be freed just as soon= as + * we've flushed the IOTLB and we know the hardware page-walk will no long= er + * touch them. The 'pte' argument is the *parent* PTE, pointing to the page + * that is to be freed. + */ static void dma_pte_list_pagetables(struct dmar_domain *domain, int level, struct dma_pte *pte, struct list_head *freelist) @@ -1106,17 +1109,20 @@ static void dma_pte_list_pagetables(struct dmar_dom= ain *domain, struct page *pg; =20 pg =3D pfn_to_page(dma_pte_addr(pte) >> PAGE_SHIFT); - list_add_tail(&pg->lru, freelist); - - if (level =3D=3D 1) - return; - pte =3D page_address(pg); + do { - if (dma_pte_present(pte) && !dma_pte_superpage(pte)) - dma_pte_list_pagetables(domain, level - 1, pte, freelist); + if (dma_pte_present(pte)) { + if (level > 1 && !dma_pte_superpage(pte)) { + dma_pte_list_pagetables(domain, level - 1, pte, + freelist); + } + dma_clear_pte(pte); + } pte++; } while (!first_pte_in_page(pte)); + + list_add_tail(&pg->lru, freelist); } =20 static void dma_pte_clear_level(struct dmar_domain *domain, int level, @@ -2244,7 +2250,7 @@ __domain_mapping(struct dmar_domain *domain, unsigned= long iov_pfn, /* We don't need lock here, nobody else * touches the iova range */ - tmp =3D cmpxchg64_local(&pte->val, 0ULL, pteval); + tmp =3D dma_set_pte(pte, pteval); if (tmp) { static int dumps =3D 5; pr_crit("ERROR: DMA PTE for vPFN 0x%lx already set (to %llx not %llx)\n= ", diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h index ce030c5b5772..f1ea508f45bd 100644 --- a/drivers/iommu/intel/iommu.h +++ b/drivers/iommu/intel/iommu.h @@ -802,11 +802,6 @@ struct dma_pte { u64 val; }; =20 -static inline void dma_clear_pte(struct dma_pte *pte) -{ - pte->val =3D 0; -} - static inline u64 dma_pte_addr(struct dma_pte *pte) { #ifdef CONFIG_64BIT @@ -818,9 +813,43 @@ static inline u64 dma_pte_addr(struct dma_pte *pte) #endif } =20 +#define DMA_PTEVAL_PRESENT(pteval) (((pteval) & 3) !=3D 0) static inline bool dma_pte_present(struct dma_pte *pte) { - return (pte->val & 3) !=3D 0; + return DMA_PTEVAL_PRESENT(pte->val); +} + +static inline void dma_clear_pte(struct dma_pte *pte) +{ + u64 old_pteval; + + old_pteval =3D xchg(&pte->val, 0ULL); + if (DMA_PTEVAL_PRESENT(old_pteval)) { + struct page *pg =3D virt_to_page(pte); + int rc =3D page_ref_dec_return(pg); + + WARN_ON_ONCE(rc > 512 || rc < 1); + } else { + /* Ensure that we cleared a valid entry from the page table */ + WARN_ON(1); + } +} + +static inline u64 dma_set_pte(struct dma_pte *pte, u64 pteval) +{ + u64 old_pteval; + + /* Ensure we about to set a valid entry to the page table */ + WARN_ON(!DMA_PTEVAL_PRESENT(pteval)); + old_pteval =3D cmpxchg64(&pte->val, 0ULL, pteval); + if (old_pteval =3D=3D 0) { + struct page *pg =3D virt_to_page(pte); + int rc =3D page_ref_inc_return(pg); + + WARN_ON_ONCE(rc > 513 || rc < 2); + } + + return old_pteval; } =20 static inline bool dma_sl_pte_test_and_clear_dirty(struct dma_pte *pte, --=20 2.43.0.472.g3155946c3a-goog From nobody Fri Dec 19 14:09:17 2025 Received: from mail-qk1-f182.google.com (mail-qk1-f182.google.com [209.85.222.182]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CAB314427 for ; Thu, 21 Dec 2023 03:19:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=soleen.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=soleen.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=soleen.com header.i=@soleen.com header.b="Fe8eTqkr" Received: by mail-qk1-f182.google.com with SMTP id af79cd13be357-7811db57cb4so17311085a.0 for ; Wed, 20 Dec 2023 19:19:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=soleen.com; s=google; t=1703128761; x=1703733561; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:to:from:from:to:cc:subject:date:message-id :reply-to; bh=mGmYN/slxAgBeZjp+ZQs88mMzGzd5XnAVtjIWHCDIHA=; b=Fe8eTqkrvUEvuHdlfK0989Y56i0ewKkeikgkKm4o3BceLgczzf3dW7OMFgZAbEGMyI FIcorCOHFnsGdfMhe/vN1MM9IQbfy+a+AoqOZEiB+8f0G73vnXGGAnIytZE+cA09omcL G31AlRomLMgp9EA2gkjSouLunftC4MwflOB0h3SZluP8AybjGKhzWvrDmj9tYyHpKAVi kGF/ZrPncicNraJF4HaS8nVLMQu9RFH9pAjs1abtK3Q003P10YFTe5AbLfx289E/Nr1F nIbUt9cE6EhfEl0g7ORkBzDokxo3cU7FEnYjx/CHKvyMVvDWy9xEPDMlr2DvaL7HjHTF /o5w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703128761; x=1703733561; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=mGmYN/slxAgBeZjp+ZQs88mMzGzd5XnAVtjIWHCDIHA=; b=o7TdSTjMoBPfvMJ7AZzjc8wB9jLSw7CGg9VDbBZ2i8y/9jnMXz+6uBYedmQ1B7or1q l9lgFtzLRbmea1NvN4O5i33aci8Ip0bMMVnfNqc59XZKoqP680zkZQuwvnuODW8WLhZX jPjO+yj6bTrs1kmbZbNdIFjENLUU30PyKR2nROlg5Yz6ucSXelYYlMcK129bBRmPEHJo qwZuYFBWzl7c7pwM/byp8Q+iaL7R8QfzxaMvwmGlTBaoBnTiW1BL3e8dlCikMQfWMwrq p13Fw9Esyo5zUyK0i/p2hYGTQczMN4blh81RjIlWLdDZU4ztrLVET4LG5FzK1Y5cL2NZ X2Qw== X-Gm-Message-State: AOJu0YwhFGPNl6BJgTYPc6tzJ3+BZJjZMip45XysM3tgPCIkAr9DF+57 auXApNOWc0vBi7VMLjHn43sZmwjKExZjjuNGL0A= X-Google-Smtp-Source: AGHT+IGNYrBgDx+nvC9AuuwI67vV1vM61BysWepXxnQdaZCSvAikEGaalUvebM9KlNFAOPYa+XowDw== X-Received: by 2002:ae9:e64d:0:b0:781:1ae:5aa1 with SMTP id x13-20020ae9e64d000000b0078101ae5aa1mr52489qkl.29.1703128760786; Wed, 20 Dec 2023 19:19:20 -0800 (PST) Received: from soleen.c.googlers.com.com (55.87.194.35.bc.googleusercontent.com. [35.194.87.55]) by smtp.gmail.com with ESMTPSA id m18-20020a05620a221200b0077d85695db4sm371893qkh.99.2023.12.20.19.19.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 Dec 2023 19:19:19 -0800 (PST) From: Pasha Tatashin To: akpm@linux-foundation.org, linux-mm@kvack.org, pasha.tatashin@soleen.com, linux-kernel@vger.kernel.org, rientjes@google.com, dwmw2@infradead.org, baolu.lu@linux.intel.com, joro@8bytes.org, will@kernel.org, robin.murphy@arm.com, iommu@lists.linux.dev Subject: [RFC 2/3] iommu/intel: synchronize page table map and unmap operations Date: Thu, 21 Dec 2023 03:19:14 +0000 Message-ID: <20231221031915.619337-3-pasha.tatashin@soleen.com> X-Mailer: git-send-email 2.43.0.472.g3155946c3a-goog In-Reply-To: <20231221031915.619337-1-pasha.tatashin@soleen.com> References: <20231221031915.619337-1-pasha.tatashin@soleen.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Since, we are going to update parent page table entries when lower level page tables become emtpy and we add them to the free list. We need a way to synchronize the operation. Use domain->pgd_lock to protect all map and unmap operations. This is reader/writer lock. At the beginning everything is going to be read only mode, however, later, when free page table on unmap is added we will add a writer section as well. Signed-off-by: Pasha Tatashin --- drivers/iommu/intel/iommu.c | 21 +++++++++++++++++++-- drivers/iommu/intel/iommu.h | 3 +++ 2 files changed, 22 insertions(+), 2 deletions(-) diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 4688ef797161..733f25b277a3 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -1082,11 +1082,13 @@ static void dma_pte_free_pagetable(struct dmar_doma= in *domain, unsigned long last_pfn, int retain_level) { + read_lock(&domain->pgd_lock); dma_pte_clear_range(domain, start_pfn, last_pfn); =20 /* We don't need lock here; nobody else touches the iova range */ dma_pte_free_level(domain, agaw_to_level(domain->agaw), retain_level, domain->pgd, 0, start_pfn, last_pfn); + read_unlock(&domain->pgd_lock); =20 /* free pgd */ if (start_pfn =3D=3D 0 && last_pfn =3D=3D DOMAIN_MAX_PFN(domain->gaw)) { @@ -1179,9 +1181,11 @@ static void domain_unmap(struct dmar_domain *domain,= unsigned long start_pfn, WARN_ON(start_pfn > last_pfn)) return; =20 + read_lock(&domain->pgd_lock); /* we don't need lock here; nobody else touches the iova range */ dma_pte_clear_level(domain, agaw_to_level(domain->agaw), domain->pgd, 0, start_pfn, last_pfn, freelist); + read_unlock(&domain->pgd_lock); =20 /* free pgd */ if (start_pfn =3D=3D 0 && last_pfn =3D=3D DOMAIN_MAX_PFN(domain->gaw)) { @@ -2217,6 +2221,7 @@ __domain_mapping(struct dmar_domain *domain, unsigned= long iov_pfn, =20 pteval =3D ((phys_addr_t)phys_pfn << VTD_PAGE_SHIFT) | attr; =20 + read_lock(&domain->pgd_lock); while (nr_pages > 0) { uint64_t tmp; =20 @@ -2226,8 +2231,10 @@ __domain_mapping(struct dmar_domain *domain, unsigne= d long iov_pfn, =20 pte =3D pfn_to_dma_pte(domain, iov_pfn, &largepage_lvl, gfp); - if (!pte) + if (!pte) { + read_unlock(&domain->pgd_lock); return -ENOMEM; + } first_pte =3D pte; =20 lvl_pages =3D lvl_to_nr_pages(largepage_lvl); @@ -2287,6 +2294,7 @@ __domain_mapping(struct dmar_domain *domain, unsigned= long iov_pfn, pte =3D NULL; } } + read_unlock(&domain->pgd_lock); =20 return 0; } @@ -4013,6 +4021,7 @@ static int md_domain_init(struct dmar_domain *domain,= int guest_width) domain->pgd =3D alloc_pgtable_page(domain->nid, GFP_ATOMIC); if (!domain->pgd) return -ENOMEM; + rwlock_init(&domain->pgd_lock); domain_flush_cache(domain, domain->pgd, PAGE_SIZE); return 0; } @@ -4247,11 +4256,15 @@ static size_t intel_iommu_unmap(struct iommu_domain= *domain, unsigned long start_pfn, last_pfn; int level =3D 0; =20 + read_lock(&dmar_domain->pgd_lock); /* Cope with horrid API which requires us to unmap more than the size argument if it happens to be a large-page mapping. */ if (unlikely(!pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT, - &level, GFP_ATOMIC))) + &level, GFP_ATOMIC))) { + read_unlock(&dmar_domain->pgd_lock); return 0; + } + read_unlock(&dmar_domain->pgd_lock); =20 if (size < VTD_PAGE_SIZE << level_to_offset_bits(level)) size =3D VTD_PAGE_SIZE << level_to_offset_bits(level); @@ -4315,8 +4328,10 @@ static phys_addr_t intel_iommu_iova_to_phys(struct i= ommu_domain *domain, int level =3D 0; u64 phys =3D 0; =20 + read_lock(&dmar_domain->pgd_lock); pte =3D pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT, &level, GFP_ATOMIC); + read_unlock(&dmar_domain->pgd_lock); if (pte && dma_pte_present(pte)) phys =3D dma_pte_addr(pte) + (iova & (BIT_MASK(level_to_offset_bits(level) + @@ -4919,8 +4934,10 @@ static int intel_iommu_read_and_clear_dirty(struct i= ommu_domain *domain, struct dma_pte *pte; int lvl =3D 0; =20 + read_lock(&dmar_domain->pgd_lock); pte =3D pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT, &lvl, GFP_ATOMIC); + read_unlock(&dmar_domain->pgd_lock); pgsize =3D level_size(lvl) << VTD_PAGE_SHIFT; if (!pte || !dma_pte_present(pte)) { iova +=3D pgsize; diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h index f1ea508f45bd..cb0577ec5166 100644 --- a/drivers/iommu/intel/iommu.h +++ b/drivers/iommu/intel/iommu.h @@ -618,6 +618,9 @@ struct dmar_domain { struct { /* virtual address */ struct dma_pte *pgd; + + /* Synchronizes pgd map/unmap operations */ + rwlock_t pgd_lock; /* max guest address width */ int gaw; /* --=20 2.43.0.472.g3155946c3a-goog From nobody Fri Dec 19 14:09:17 2025 Received: from mail-qk1-f176.google.com (mail-qk1-f176.google.com [209.85.222.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 95C345248 for ; Thu, 21 Dec 2023 03:19:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=soleen.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=soleen.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=soleen.com header.i=@soleen.com header.b="U/kdJhYI" Received: by mail-qk1-f176.google.com with SMTP id af79cd13be357-77f9c7d35deso16039885a.1 for ; Wed, 20 Dec 2023 19:19:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=soleen.com; s=google; t=1703128761; x=1703733561; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:to:from:from:to:cc:subject:date:message-id :reply-to; bh=1AmuHXuEhInFkqtbi2Boe13e6hcYidcEU++H5mZtuKg=; b=U/kdJhYITr3gShRYsJi6KSCuHG02j1yWuwc0YFoNUeHiUY8/j958BDLfMlFVo1Mht6 FUGKfMLp2kBe+Cop/xfaulyo3hD12/cGfgp50Hv0qb0afKZReaCozXAPWrehQnsNzlcr Neol11iCbllbYAOWpF/j6diMnG7VWYABff+K1AHpoLet+J9ZRx35aRs1lmtGcvgFOnuL WNmloJHygmzi0fjaAf1tvWY78YLrNIvwGwb6dL89ucCyqoM1HogNQjUuEc+pMduoszne 24aa+BtF/vgQnQMws0JWt5eLtRbo+YyVaoOSziSVO3skyBfmafCn2qWaic1kLKH8zJeL TxQg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703128761; x=1703733561; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=1AmuHXuEhInFkqtbi2Boe13e6hcYidcEU++H5mZtuKg=; b=ejM5t9jipeE4IN8ZHzNE6wfcK5h7W4xjYKH2/N1EcDj7Hp52LXMY2McEKhz1CdViJp ncefpnhNtIf3IEADlIGuLsflgZDquEM9ddAvsE30q/8hN2oRXvuB6szXeJE+wdLQZnCL CSsoxeAYmpYjKVwznLvS+XNvTqmu/+1aXxfpn/SX2gUJjzdf+6UukrI4p43GLoy9POmS 8gun2pp/R36jwlHbOFoGIOHZhBKOf4I1hgd3rvl+Foss2QtFRvkQXpMMfhr//FwCTCXX 6BInJZ0+QcgsxCwnWR9a8t/80gBEwEIND8zu0ca4JOi0qLDRH2uGkuCZqW0L0qD7Vagk IcbA== X-Gm-Message-State: AOJu0Yw7+2yZGjwP2vlunag6u93GBZv/S1LAYdeo4nRgSSI4Iep3PNZO +2bZ0jZBFb91s90rmhulZ/+L8w== X-Google-Smtp-Source: AGHT+IHF5zSXhAGcOI9AIkZshfFia5huQM56CGqHZFlyV5V3eI/TVjPw6Y6m+SE7x4iKtWD/m0z+kg== X-Received: by 2002:a05:620a:4891:b0:774:cf9:b206 with SMTP id ea17-20020a05620a489100b007740cf9b206mr19223447qkb.42.1703128761561; Wed, 20 Dec 2023 19:19:21 -0800 (PST) Received: from soleen.c.googlers.com.com (55.87.194.35.bc.googleusercontent.com. [35.194.87.55]) by smtp.gmail.com with ESMTPSA id m18-20020a05620a221200b0077d85695db4sm371893qkh.99.2023.12.20.19.19.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 Dec 2023 19:19:21 -0800 (PST) From: Pasha Tatashin To: akpm@linux-foundation.org, linux-mm@kvack.org, pasha.tatashin@soleen.com, linux-kernel@vger.kernel.org, rientjes@google.com, dwmw2@infradead.org, baolu.lu@linux.intel.com, joro@8bytes.org, will@kernel.org, robin.murphy@arm.com, iommu@lists.linux.dev Subject: [RFC 3/3] iommu/intel: free empty page tables on unmaps Date: Thu, 21 Dec 2023 03:19:15 +0000 Message-ID: <20231221031915.619337-4-pasha.tatashin@soleen.com> X-Mailer: git-send-email 2.43.0.472.g3155946c3a-goog In-Reply-To: <20231221031915.619337-1-pasha.tatashin@soleen.com> References: <20231221031915.619337-1-pasha.tatashin@soleen.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" When page tables become empty, add them to the freelist so that they can also be freed. This is means that a page tables that are outside of the imediat iova range might be freed as well, therefore, only in the case where such page tables are going to be freed, we take the writer lock. Signed-off-by: Pasha Tatashin --- drivers/iommu/intel/iommu.c | 92 +++++++++++++++++++++++++++++++------ 1 file changed, 78 insertions(+), 14 deletions(-) diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 733f25b277a3..141dc106fb01 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -1130,7 +1130,7 @@ static void dma_pte_list_pagetables(struct dmar_domai= n *domain, static void dma_pte_clear_level(struct dmar_domain *domain, int level, struct dma_pte *pte, unsigned long pfn, unsigned long start_pfn, unsigned long last_pfn, - struct list_head *freelist) + struct list_head *freelist, int *freed_level) { struct dma_pte *first_pte =3D NULL, *last_pte =3D NULL; =20 @@ -1156,11 +1156,48 @@ static void dma_pte_clear_level(struct dmar_domain = *domain, int level, first_pte =3D pte; last_pte =3D pte; } else if (level > 1) { + struct dma_pte *npte =3D phys_to_virt(dma_pte_addr(pte)); + struct page *npage =3D virt_to_page(npte); + /* Recurse down into a level that isn't *entirely* obsolete */ - dma_pte_clear_level(domain, level - 1, - phys_to_virt(dma_pte_addr(pte)), + dma_pte_clear_level(domain, level - 1, npte, level_pfn, start_pfn, last_pfn, - freelist); + freelist, freed_level); + + /* + * Free next level page table if it became empty. + * + * We only holding the reader lock, and it is possible + * that other threads are accessing page table as + * readers as well. We can only free page table that + * is outside of the request IOVA space only if + * we grab the writer lock. Since we need to drop reader + * lock, we are incrementing the refcount in the npage + * so it (and the current page table) does not + * dissappear due to concurrent unmapping threads. + * + * Store the size maximum size of the freed page table + * into freed_level, so the size of the IOTLB flush + * can be determined. + */ + if (freed_level && page_count(npage) =3D=3D 1) { + page_ref_inc(npage); + read_unlock(&domain->pgd_lock); + write_lock(&domain->pgd_lock); + if (page_count(npage) =3D=3D 2) { + dma_clear_pte(pte); + + if (!first_pte) + first_pte =3D pte; + + last_pte =3D pte; + list_add_tail(&npage->lru, freelist); + *freed_level =3D level; + } + write_unlock(&domain->pgd_lock); + read_lock(&domain->pgd_lock); + page_ref_dec(npage); + } } next: pfn =3D level_pfn + level_size(level); @@ -1175,7 +1212,8 @@ static void dma_pte_clear_level(struct dmar_domain *d= omain, int level, the page tables, and may have cached the intermediate levels. The pages can only be freed after the IOTLB flush has been done. */ static void domain_unmap(struct dmar_domain *domain, unsigned long start_p= fn, - unsigned long last_pfn, struct list_head *freelist) + unsigned long last_pfn, struct list_head *freelist, + int *level) { if (WARN_ON(!domain_pfn_supported(domain, last_pfn)) || WARN_ON(start_pfn > last_pfn)) @@ -1184,7 +1222,8 @@ static void domain_unmap(struct dmar_domain *domain, = unsigned long start_pfn, read_lock(&domain->pgd_lock); /* we don't need lock here; nobody else touches the iova range */ dma_pte_clear_level(domain, agaw_to_level(domain->agaw), - domain->pgd, 0, start_pfn, last_pfn, freelist); + domain->pgd, 0, start_pfn, last_pfn, freelist, + level); read_unlock(&domain->pgd_lock); =20 /* free pgd */ @@ -1524,11 +1563,11 @@ static void domain_flush_pasid_iotlb(struct intel_i= ommu *iommu, =20 static void iommu_flush_iotlb_psi(struct intel_iommu *iommu, struct dmar_domain *domain, - unsigned long pfn, unsigned int pages, + unsigned long pfn, unsigned long pages, int ih, int map) { - unsigned int aligned_pages =3D __roundup_pow_of_two(pages); - unsigned int mask =3D ilog2(aligned_pages); + unsigned long aligned_pages =3D __roundup_pow_of_two(pages); + unsigned long mask =3D ilog2(aligned_pages); uint64_t addr =3D (uint64_t)pfn << VTD_PAGE_SHIFT; u16 did =3D domain_id_iommu(domain, iommu); =20 @@ -1872,7 +1911,8 @@ static void domain_exit(struct dmar_domain *domain) if (domain->pgd) { LIST_HEAD(freelist); =20 - domain_unmap(domain, 0, DOMAIN_MAX_PFN(domain->gaw), &freelist); + domain_unmap(domain, 0, DOMAIN_MAX_PFN(domain->gaw), &freelist, + NULL); put_pages_list(&freelist); } =20 @@ -3579,7 +3619,8 @@ static int intel_iommu_memory_notifier(struct notifie= r_block *nb, struct intel_iommu *iommu; LIST_HEAD(freelist); =20 - domain_unmap(si_domain, start_vpfn, last_vpfn, &freelist); + domain_unmap(si_domain, start_vpfn, last_vpfn, + &freelist, NULL); =20 rcu_read_lock(); for_each_active_iommu(iommu, drhd) @@ -4253,6 +4294,7 @@ static size_t intel_iommu_unmap(struct iommu_domain *= domain, struct iommu_iotlb_gather *gather) { struct dmar_domain *dmar_domain =3D to_dmar_domain(domain); + bool queued =3D iommu_iotlb_gather_queued(gather); unsigned long start_pfn, last_pfn; int level =3D 0; =20 @@ -4272,7 +4314,16 @@ static size_t intel_iommu_unmap(struct iommu_domain = *domain, start_pfn =3D iova >> VTD_PAGE_SHIFT; last_pfn =3D (iova + size - 1) >> VTD_PAGE_SHIFT; =20 - domain_unmap(dmar_domain, start_pfn, last_pfn, &gather->freelist); + /* + * pass level only if !queued, which means we will do iotlb + * flush callback before freeing pages from freelist. + * + * When level is passed domain_unamp will attempt to add empty + * page tables to freelist, and pass the level number of the highest + * page table that was added to the freelist. + */ + domain_unmap(dmar_domain, start_pfn, last_pfn, &gather->freelist, + queued ? NULL : &level); =20 if (dmar_domain->max_addr =3D=3D iova + size) dmar_domain->max_addr =3D iova; @@ -4281,8 +4332,21 @@ static size_t intel_iommu_unmap(struct iommu_domain = *domain, * We do not use page-selective IOTLB invalidation in flush queue, * so there is no need to track page and sync iotlb. */ - if (!iommu_iotlb_gather_queued(gather)) - iommu_iotlb_gather_add_page(domain, gather, iova, size); + if (!queued) { + size_t sz =3D size; + + /* + * Increase iova and sz for flushing if level was returned, + * as it means we also are freeing some page tables. + */ + if (level) { + unsigned long pgsize =3D level_size(level) << VTD_PAGE_SHIFT; + + iova =3D ALIGN_DOWN(iova, pgsize); + sz =3D ALIGN(size, pgsize); + } + iommu_iotlb_gather_add_page(domain, gather, iova, sz); + } =20 return size; } --=20 2.43.0.472.g3155946c3a-goog