From nobody Fri Oct 3 10:12:38 2025 Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 791F52D4801 for ; Tue, 2 Sep 2025 13:04:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.187 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1756818297; cv=none; b=Lh1YuCs+P+J4DqTLsksoe67AZ9qZmUFsMLfZmYoH4CSNFfGjyqIdS7YnsE40q6+mfVW1mOHdBzAbvRWbNcHzILJGiM/9ojp+xkwLA6G6yLPpGczD28xY0kC6vRcsnLp6/YhemFDVKRu/QnYJVoN6QSWvYXSD1wHLKERT5/pvm5Q= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1756818297; c=relaxed/simple; bh=9S5ggAris22eza8Vgn21ReT2ZYzJGB8tbqQzEXCN4Pw=; h=From:To:CC:Subject:Date:Message-ID:References:In-Reply-To: Content-Type:MIME-Version; b=Te/Xdn5aDgeLNBgiZXop0flEjgJB/3A2Y9R/mDCZtC5EsODxtH0CeKo76w1U5XqeoYztmSaHBsKXN0Hnx7PjPtZg5SiuPQSY0gJCQ38ndXrqYvGYBXLiIT3Iu6LRB8Vla+AdV6zVum4QfVVipzW3YAslBwUydkewKnVxcuRmW/M= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=45.249.212.187 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.19.88.105]) by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4cGQpj3BfPz13NTJ; Tue, 2 Sep 2025 21:01:01 +0800 (CST) Received: from kwepemj200002.china.huawei.com (unknown [7.202.194.14]) by mail.maildlp.com (Postfix) with ESMTPS id 9E7C1140202; Tue, 2 Sep 2025 21:04:51 +0800 (CST) Received: from kwepemj100003.china.huawei.com (7.202.195.248) by kwepemj200002.china.huawei.com (7.202.194.14) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Tue, 2 Sep 2025 21:04:51 +0800 Received: from kwepemj100003.china.huawei.com ([7.202.195.248]) by kwepemj100003.china.huawei.com ([7.202.195.248]) with mapi id 15.02.1544.011; Tue, 2 Sep 2025 21:04:51 +0800 From: Zhangyuhao To: David Hildenbrand , Andrew Morton , Jason Gunthorpe , John Hubbard , Peter Xu , Joerg Roedel , Will Deacon , Robin Murphy CC: "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , "iommu@lists.linux.dev" Subject: RE: Issues with Pinning User Pages for SVA on IOMMUs Lacking IOPF Thread-Topic: Issues with Pinning User Pages for SVA on IOMMUs Lacking IOPF Thread-Index: AdwbQ8jtyLdUfE4eSeWgQaAe9dULRf//jTWA//4BcbA= Date: Tue, 2 Sep 2025 13:04:51 +0000 Message-ID: <54be598d93404e7185ecfbe49f7fe93c@huawei.com> References: <08c869037c6d4bd2bd9bc5e2bcfb7e7c@huawei.com> In-Reply-To: Accept-Language: zh-CN, en-US Content-Language: zh-CN X-MS-Has-Attach: X-MS-TNEF-Correlator: Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 [Adding linux-kernel mailing list for visibility] Best, Yuhao -----Original Message----- From: David Hildenbrand =20 Sent: Monday, September 1, 2025 10:34 PM To: Zhangyuhao ; Andrew Morton ; Jason Gunthorpe ; John Hubbard = ; Peter Xu ; Joerg Roedel ; Will Deacon= ; Robin Murphy Subject: Re: Issues with Pinning User Pages for SVA on IOMMUs Lacking IOPF On 01.09.25 15:43, Zhangyuhao wrote: > Hello Linux kernel community, Hi, >=20 > Current IOMMU SVA support relies on hardware IOPF (IO Page Fault). We hav= e observed that certain IOMMU devices do not support IOPF. > But We are still exploring how to enable SVA in such scenarios. >=20 > To address this, we attempted to pin memory to prevent device accesses fr= om triggering IO page faults. >=20 > Solution 1: User-space mlock + madvise(MADV_POPULATE_WRITE) >=20 > if (madvise(buf, size, MADV_POPULATE_WRITE) !=3D 0) { > free(buf); > return 1; > } > if (mlock(buf, size) !=3D 0) { > free(buf); > return 1; > } > Result: Page faults still occurred due to page migration. Yes, NUMA-hinting might similarly affect this (even when page not migrated). >=20 > Solution 2: Kernel-space pin via IOCTL >=20 > ret =3D pin_user_pages_fast(cur_base, npages, FOLL_LONGTERM, page_list); >=20 > Result: Page faults occurred occasionally, traced to NUMA balancing marki= ng pages as invalid. Ah, there you talk about NUMA balancing. >=20 > To solve the problem, we used FOLL_LONGTERM | FOLL_HONOR_NUMA_FAULT to pi= n user pages. >=20 See prot_numa_skip(): we skip DMA-pinned folios in COW mappings only. So If= you would have a !COW mapping (e.g., MAP_SHARED shmem), that wouldn't work= reliably I think. I think we could change that without causing too much harm. diff --git a/mm/mprotect.c b/mm/mprotect.c index 113b489858341..17809c8604f= 25 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -137,8 +137,11 @@ static bool prot_numa_skip(struct vm_area_struct *vma,= unsigned long addr, goto skip; =20 /* Also skip shared copy-on-write pages */ - if (is_cow_mapping(vma->vm_flags) && - (folio_maybe_dma_pinned(folio) || folio_maybe_mapped_shared(fol= io))) + if (is_cow_mapping(vma->vm_flags) && folio_maybe_mapped_shared(foli= o)) + goto skip; + + /* Folios that are pinned and cannot be migrated either way. */ + if (folio_maybe_dma_pinned(folio)) goto skip; =20 /* > This approach has been tested and successfully prevents IO page faults so= far. >=20 > We would like guidance from the community: >=20 > Can this approach reliably prevent all IO page faults? See the case above regarding non-cow mappings. We essentially need to make sure that we don't (temporarily) unmap for migr= ation/reclaim/split/whatever if a folio maybe pinned. We back out in all cases (unexpected reference), but we'll have to sanity-c= heck whether we reject maybe_pinned folios early to not temporarily unmap. >=20 > Is there a better or recommended method to pin user pages for SVA? Most use cases use longterm pinnings to then configure the iommu manually. = Then, it does not really matter what happens to your process page tables. So your use case is rather new :) But yes, a longerm pinning while resolving NUMA-hitning faults should in th= eory work. We just have to make sure that everybody else plays nice early with dma-pin= ned folios. -- Cheers David / dhildenb