From nobody Sun May 10 17:09:04 2026
Return-Path: <linux-kernel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 83B93C433FE
	for <linux-kernel@archiver.kernel.org>; Thu, 28 Apr 2022 14:14:44 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1348213AbiD1OR5 (ORCPT
        <rfc822;linux-kernel@archiver.kernel.org>);
        Thu, 28 Apr 2022 10:17:57 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40740 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1348190AbiD1ORu (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 28 Apr 2022 10:17:50 -0400
Received: from mail-pj1-x102e.google.com (mail-pj1-x102e.google.com
 [IPv6:2607:f8b0:4864:20::102e])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 26F7AB53CB;
        Thu, 28 Apr 2022 07:14:35 -0700 (PDT)
Received: by mail-pj1-x102e.google.com with SMTP id
 z5-20020a17090a468500b001d2bc2743c4so4534775pjf.0;
        Thu, 28 Apr 2022 07:14:35 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=0UUXPJDraGbhFTVoIbnS8XgAby0hzaUlyipurHUShxA=;
        b=OPKAv1lB4sk0oLs5sGNAhS9Dwl7/ENyLSNSRjMJdO/nvlEwTsK9QalHUjbNwDr2JBt
         t7aql+4ZLYfxK/LGguw9ZLAl9OstwIeYcyEtzKKa2V2Y4XefDxCSPSef4xEtI77/Oe8I
         dSv2ruDaINqbNbTTJVbmRQPskwNewiy51S6NzBOcXGoHRfrK66QR0SvrdLg6Rywxbv4u
         WMnXEMFQjQNkscpZXpPNjU3Z3TAR7JoC+7+irOw7H96zX4CEBK7aWrtFCPZ3FkhCwTqN
         5BOAXJJPEQVkLNGG6moL15w5TRBN52RI5IxhBsg/oJLIcfKENvPBnUS3rFHVX5FCn+CD
         dAKA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=0UUXPJDraGbhFTVoIbnS8XgAby0hzaUlyipurHUShxA=;
        b=G99AKpVM1o4vU7ngs+YrQBKd7It4YGY9kyvhEx6UFMD1qggyhm11DKzF1Ox2yAL93t
         8KKEcIe+iXQ8o03hMt9Uk5qaobKEPKKx3kq52a07nWZFMB7YPDhTxD9DizyvO1QFJ5BO
         N/YFLZq+0D4gBtQLF4V+DKIlmsMWgewGL2rP5t44wNDZH5RoRO/dAozIF0G5o43ydau+
         j91/nGrR3nwRBLEJxs7wOfUPmcnDLlPJg2ofhZyVroqCkrVCqHLGd2fi3Z3Y78jhTPAR
         srMOtaLFsWSr24xXmBZXXMW0uBJPDGT9mozHx45bmoUTQC6MDbsr8MxuN2HXvkCd6iIk
         4UxQ==
X-Gm-Message-State: AOAM530r206979za1xYKcHPQCQSyXqRFZAYP8f4g2tvBeUfJZWlRMWWv
        +VA3GhXJkaHV3t71Z7yQtLw=
X-Google-Smtp-Source: 
 ABdhPJx3oKpzJf14JM8e3akcHdJdzgdHIBV59CgvGcddyLqaWktNCE5X78fv+H/+1SMjtq4EI5c1lg==
X-Received: by 2002:a17:903:1206:b0:151:7d67:2924 with SMTP id
 l6-20020a170903120600b001517d672924mr33694577plh.45.1651155274381;
        Thu, 28 Apr 2022 07:14:34 -0700 (PDT)
Received: from ubuntu-Virtual-Machine.corp.microsoft.com
 ([2001:4898:80e8:f:de19:84:6ee0:6f2e])
        by smtp.gmail.com with ESMTPSA id
 d8-20020a056a00198800b004fab740dbe6sm65331pfl.15.2022.04.28.07.14.33
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 28 Apr 2022 07:14:33 -0700 (PDT)
From: Tianyu Lan <ltykernel@gmail.com>
To: hch@infradead.org, m.szyprowski@samsung.com, robin.murphy@arm.com,
        michael.h.kelley@microsoft.com, kys@microsoft.com
Cc: Tianyu Lan <Tianyu.Lan@microsoft.com>,
        iommu@lists.linux-foundation.org, linux-kernel@vger.kernel.org,
        vkuznets@redhat.com, brijesh.singh@amd.com, konrad.wilk@oracle.com,
        hch@lst.de, wei.liu@kernel.org, parri.andrea@gmail.com,
        thomas.lendacky@amd.com, linux-hyperv@vger.kernel.org,
        andi.kleen@intel.com, kirill.shutemov@intel.com,
        Andi Kleen <ak@linux.intel.com>
Subject: [RFC PATCH 1/2] swiotlb: Split up single swiotlb lock
Date: Thu, 28 Apr 2022 10:14:28 -0400
Message-Id: <20220428141429.1637028-2-ltykernel@gmail.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <20220428141429.1637028-1-ltykernel@gmail.com>
References: <20220428141429.1637028-1-ltykernel@gmail.com>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="utf-8"

From: Tianyu Lan <Tianyu.Lan@microsoft.com>

Traditionally swiotlb was not performance critical because it was only
used for slow devices. But in some setups, like TDX/SEV confidential
guests, all IO has to go through swiotlb. Currently swiotlb only has a
single lock. Under high IO load with multiple CPUs this can lead to
significat lock contention on the swiotlb lock.

This patch splits the swiotlb into individual areas which have their
own lock. When there are swiotlb map/allocate request, allocate
io tlb buffer from areas averagely and free the allocation back
to the associated area. This is to prepare to resolve the overhead
of single spinlock among device's queues. Per device may have its
own io tlb mem and bounce buffer pool.

This idea from Andi Kleen patch(https://github.com/intel/tdx/commit/4529b578
4c141782c72ec9bd9a92df2b68cb7d45). Rework it and make it may work
for individual device's io tlb mem. The device driver may determine
area number according to device queue number.

Based-on-idea-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
---
 include/linux/swiotlb.h |  25 ++++++
 kernel/dma/swiotlb.c    | 173 +++++++++++++++++++++++++++++++---------
 2 files changed, 162 insertions(+), 36 deletions(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 7ed35dd3de6e..489c249da434 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -62,6 +62,24 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t p=
hys,
 #ifdef CONFIG_SWIOTLB
 extern enum swiotlb_force swiotlb_force;
=20
+/**
+ * struct io_tlb_area - IO TLB memory area descriptor
+ *
+ * This is a single area with a single lock.
+ *
+ * @used:	The number of used IO TLB block.
+ * @area_index: The index of to tlb area.
+ * @index:	The slot index to start searching in this area for next round.
+ * @lock:	The lock to protect the above data structures in the map and
+ *		unmap calls.
+ */
+struct io_tlb_area {
+	unsigned long used;
+	unsigned int area_index;
+	unsigned int index;
+	spinlock_t lock;
+};
+
 /**
  * struct io_tlb_mem - IO TLB Memory Pool Descriptor
  *
@@ -89,6 +107,9 @@ extern enum swiotlb_force swiotlb_force;
  * @late_alloc:	%true if allocated using the page allocator
  * @force_bounce: %true if swiotlb bouncing is forced
  * @for_alloc:  %true if the pool is used for memory allocation
+ * @num_areas:  The area number in the pool.
+ * @area_start: The area index to start searching in the next round.
+ * @area_nslabs: The slot number in the area.
  */
 struct io_tlb_mem {
 	phys_addr_t start;
@@ -102,6 +123,10 @@ struct io_tlb_mem {
 	bool late_alloc;
 	bool force_bounce;
 	bool for_alloc;
+	unsigned int num_areas;
+	unsigned int area_start;
+	unsigned int area_nslabs;
+	struct io_tlb_area *areas;
 	struct io_tlb_slot {
 		phys_addr_t orig_addr;
 		size_t alloc_size;
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index e2ef0864eb1e..00a16f540f20 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -62,6 +62,8 @@
=20
 #define INVALID_PHYS_ADDR (~(phys_addr_t)0)
=20
+#define NUM_AREAS_DEFAULT 1
+
 static bool swiotlb_force_bounce;
 static bool swiotlb_force_disable;
=20
@@ -70,6 +72,25 @@ struct io_tlb_mem io_tlb_default_mem;
 phys_addr_t swiotlb_unencrypted_base;
=20
 static unsigned long default_nslabs =3D IO_TLB_DEFAULT_SIZE >> IO_TLB_SHIF=
T;
+static unsigned long default_area_num =3D NUM_AREAS_DEFAULT;
+
+static int swiotlb_setup_areas(struct io_tlb_mem *mem,
+		unsigned int num_areas, unsigned long nslabs)
+{
+	if (nslabs < 1 || !is_power_of_2(num_areas)) {
+		pr_err("swiotlb: Invalid areas parameter %d.\n", num_areas);
+		return -EINVAL;
+	}
+
+	/* Round up number of slabs to the next power of 2.
+	 * The last area is going be smaller than the rest if default_nslabs is
+	 * not power of two.
+	 */
+	mem->area_start =3D 0;
+	mem->num_areas =3D num_areas;
+	mem->area_nslabs =3D nslabs / num_areas;
+	return 0;
+}
=20
 static int __init
 setup_io_tlb_npages(char *str)
@@ -114,6 +135,8 @@ void __init swiotlb_adjust_size(unsigned long size)
 		return;
 	size =3D ALIGN(size, IO_TLB_SIZE);
 	default_nslabs =3D ALIGN(size >> IO_TLB_SHIFT, IO_TLB_SEGSIZE);
+	swiotlb_setup_areas(&io_tlb_default_mem, default_area_num,
+			    default_nslabs);
 	pr_info("SWIOTLB bounce buffer size adjusted to %luMB", size >> 20);
 }
=20
@@ -195,7 +218,8 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *=
mem, phys_addr_t start,
 				    unsigned long nslabs, bool late_alloc)
 {
 	void *vaddr =3D phys_to_virt(start);
-	unsigned long bytes =3D nslabs << IO_TLB_SHIFT, i;
+	unsigned long bytes =3D nslabs << IO_TLB_SHIFT, i, j;
+	unsigned int block_list;
=20
 	mem->nslabs =3D nslabs;
 	mem->start =3D start;
@@ -206,8 +230,13 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem =
*mem, phys_addr_t start,
 	if (swiotlb_force_bounce)
 		mem->force_bounce =3D true;
=20
-	spin_lock_init(&mem->lock);
-	for (i =3D 0; i < mem->nslabs; i++) {
+	for (i =3D 0, j =3D 0, k =3D 0; i < mem->nslabs; i++) {
+		if (!(i % mem->area_nslabs)) {
+			mem->areas[j].index =3D 0;
+			spin_lock_init(&mem->areas[j].lock);
+			j++;
+		}
+
 		mem->slots[i].list =3D IO_TLB_SEGSIZE - io_tlb_offset(i);
 		mem->slots[i].orig_addr =3D INVALID_PHYS_ADDR;
 		mem->slots[i].alloc_size =3D 0;
@@ -272,6 +301,13 @@ void __init swiotlb_init_remap(bool addressing_limit, =
unsigned int flags,
 		panic("%s: Failed to allocate %zu bytes align=3D0x%lx\n",
 		      __func__, alloc_size, PAGE_SIZE);
=20
+	swiotlb_setup_areas(&io_tlb_default_mem, default_area_num,
+		    default_nslabs);
+	mem->areas =3D memblock_alloc(sizeof(struct io_tlb_area) * mem->num_areas,
+			    SMP_CACHE_BYTES);
+	if (!mem->areas)
+		panic("%s: Failed to allocate mem->areas.\n", __func__);
+
 	swiotlb_init_io_tlb_mem(mem, __pa(tlb), default_nslabs, false);
 	mem->force_bounce =3D flags & SWIOTLB_FORCE;
=20
@@ -296,7 +332,7 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
 	unsigned long nslabs =3D ALIGN(size >> IO_TLB_SHIFT, IO_TLB_SEGSIZE);
 	unsigned long bytes;
 	unsigned char *vstart =3D NULL;
-	unsigned int order;
+	unsigned int order, area_order;
 	int rc =3D 0;
=20
 	if (swiotlb_force_disable)
@@ -334,18 +370,32 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
 		goto retry;
 	}
=20
+	swiotlb_setup_areas(&io_tlb_default_mem, default_area_num,
+			    nslabs);
+
+	area_order =3D get_order(array_size(sizeof(*mem->areas),
+		default_area_num));
+	mem->areas =3D (struct io_tlb_area *)
+		__get_free_pages(GFP_KERNEL | __GFP_ZERO, area_order);
+	if (!mem->areas)
+		goto error_area;
+
 	mem->slots =3D (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
 		get_order(array_size(sizeof(*mem->slots), nslabs)));
-	if (!mem->slots) {
-		free_pages((unsigned long)vstart, order);
-		return -ENOMEM;
-	}
+	if (!mem->slots)
+		goto error_slots;
=20
 	set_memory_decrypted((unsigned long)vstart, bytes >> PAGE_SHIFT);
 	swiotlb_init_io_tlb_mem(mem, virt_to_phys(vstart), nslabs, true);
=20
 	swiotlb_print_info();
 	return 0;
+
+error_slots:
+	free_pages((unsigned long)mem->areas, area_order);
+error_area:
+	free_pages((unsigned long)vstart, order);
+	return -ENOMEM;
 }
=20
 void __init swiotlb_exit(void)
@@ -353,6 +403,7 @@ void __init swiotlb_exit(void)
 	struct io_tlb_mem *mem =3D &io_tlb_default_mem;
 	unsigned long tbl_vaddr;
 	size_t tbl_size, slots_size;
+	unsigned int area_order;
=20
 	if (swiotlb_force_bounce)
 		return;
@@ -367,9 +418,14 @@ void __init swiotlb_exit(void)
=20
 	set_memory_encrypted(tbl_vaddr, tbl_size >> PAGE_SHIFT);
 	if (mem->late_alloc) {
+		area_order =3D get_order(array_size(sizeof(*mem->areas),
+			mem->num_areas));
+		free_pages((unsigned long)mem->areas, area_order);
 		free_pages(tbl_vaddr, get_order(tbl_size));
 		free_pages((unsigned long)mem->slots, get_order(slots_size));
 	} else {
+		memblock_free_late(__pa(mem->areas),
+				   mem->num_areas * sizeof(struct io_tlb_area));
 		memblock_free_late(mem->start, tbl_size);
 		memblock_free_late(__pa(mem->slots), slots_size);
 	}
@@ -472,9 +528,9 @@ static inline unsigned long get_max_slots(unsigned long=
 boundary_mask)
 	return nr_slots(boundary_mask + 1);
 }
=20
-static unsigned int wrap_index(struct io_tlb_mem *mem, unsigned int index)
+static unsigned int wrap_area_index(struct io_tlb_mem *mem, unsigned int i=
ndex)
 {
-	if (index >=3D mem->nslabs)
+	if (index >=3D mem->area_nslabs)
 		return 0;
 	return index;
 }
@@ -483,10 +539,13 @@ static unsigned int wrap_index(struct io_tlb_mem *mem=
, unsigned int index)
  * Find a suitable number of IO TLB entries size that will fit this reques=
t and
  * allocate a buffer from that IO TLB pool.
  */
-static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
-			      size_t alloc_size, unsigned int alloc_align_mask)
+static int swiotlb_do_find_slots(struct io_tlb_mem *mem,
+				 struct io_tlb_area *area,
+				 int area_index,
+				 struct device *dev, phys_addr_t orig_addr,
+				 size_t alloc_size,
+				 unsigned int alloc_align_mask)
 {
-	struct io_tlb_mem *mem =3D dev->dma_io_tlb_mem;
 	unsigned long boundary_mask =3D dma_get_seg_boundary(dev);
 	dma_addr_t tbl_dma_addr =3D
 		phys_to_dma_unencrypted(dev, mem->start) & boundary_mask;
@@ -497,8 +556,11 @@ static int swiotlb_find_slots(struct device *dev, phys=
_addr_t orig_addr,
 	unsigned int index, wrap, count =3D 0, i;
 	unsigned int offset =3D swiotlb_align_offset(dev, orig_addr);
 	unsigned long flags;
+	unsigned int slot_base;
+	unsigned int slot_index;
=20
 	BUG_ON(!nslots);
+	BUG_ON(area_index >=3D mem->num_areas);
=20
 	/*
 	 * For mappings with an alignment requirement don't bother looping to
@@ -510,16 +572,20 @@ static int swiotlb_find_slots(struct device *dev, phy=
s_addr_t orig_addr,
 		stride =3D max(stride, stride << (PAGE_SHIFT - IO_TLB_SHIFT));
 	stride =3D max(stride, (alloc_align_mask >> IO_TLB_SHIFT) + 1);
=20
-	spin_lock_irqsave(&mem->lock, flags);
-	if (unlikely(nslots > mem->nslabs - mem->used))
+	spin_lock_irqsave(&area->lock, flags);
+	if (unlikely(nslots > mem->area_nslabs - area->used))
 		goto not_found;
=20
-	index =3D wrap =3D wrap_index(mem, ALIGN(mem->index, stride));
+	slot_base =3D area_index * mem->area_nslabs;
+	index =3D wrap =3D wrap_area_index(mem, ALIGN(area->index, stride));
+
 	do {
+		slot_index =3D slot_base + index;
+
 		if (orig_addr &&
-		    (slot_addr(tbl_dma_addr, index) & iotlb_align_mask) !=3D
-			    (orig_addr & iotlb_align_mask)) {
-			index =3D wrap_index(mem, index + 1);
+		    (slot_addr(tbl_dma_addr, slot_index) &
+		     iotlb_align_mask) !=3D (orig_addr & iotlb_align_mask)) {
+			index =3D wrap_area_index(mem, index + 1);
 			continue;
 		}
=20
@@ -528,26 +594,26 @@ static int swiotlb_find_slots(struct device *dev, phy=
s_addr_t orig_addr,
 		 * contiguous buffers, we allocate the buffers from that slot
 		 * and mark the entries as '0' indicating unavailable.
 		 */
-		if (!iommu_is_span_boundary(index, nslots,
+		if (!iommu_is_span_boundary(slot_index, nslots,
 					    nr_slots(tbl_dma_addr),
 					    max_slots)) {
-			if (mem->slots[index].list >=3D nslots)
+			if (mem->slots[slot_index].list >=3D nslots)
 				goto found;
 		}
-		index =3D wrap_index(mem, index + stride);
+		index =3D wrap_area_index(mem, index + stride);
 	} while (index !=3D wrap);
=20
 not_found:
-	spin_unlock_irqrestore(&mem->lock, flags);
+	spin_unlock_irqrestore(&area->lock, flags);
 	return -1;
=20
 found:
-	for (i =3D index; i < index + nslots; i++) {
+	for (i =3D slot_index; i < slot_index + nslots; i++) {
 		mem->slots[i].list =3D 0;
 		mem->slots[i].alloc_size =3D
-			alloc_size - (offset + ((i - index) << IO_TLB_SHIFT));
+			alloc_size - (offset + ((i - slot_index) << IO_TLB_SHIFT));
 	}
-	for (i =3D index - 1;
+	for (i =3D slot_index - 1;
 	     io_tlb_offset(i) !=3D IO_TLB_SEGSIZE - 1 &&
 	     mem->slots[i].list; i--)
 		mem->slots[i].list =3D ++count;
@@ -555,14 +621,45 @@ static int swiotlb_find_slots(struct device *dev, phy=
s_addr_t orig_addr,
 	/*
 	 * Update the indices to avoid searching in the next round.
 	 */
-	if (index + nslots < mem->nslabs)
-		mem->index =3D index + nslots;
+	if (index + nslots < mem->area_nslabs)
+		area->index =3D index + nslots;
 	else
-		mem->index =3D 0;
-	mem->used +=3D nslots;
+		area->index =3D 0;
+	area->used +=3D nslots;
+	spin_unlock_irqrestore(&area->lock, flags);
+	return slot_index;
+}
=20
-	spin_unlock_irqrestore(&mem->lock, flags);
-	return index;
+static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
+			      size_t alloc_size, unsigned int alloc_align_mask)
+{
+	struct io_tlb_mem *mem =3D dev->dma_io_tlb_mem;
+	int start, i, index;
+
+	i =3D start =3D mem->area_start;
+	mem->area_start =3D (mem->area_start + 1) % mem->num_areas;
+
+	do {
+		index =3D swiotlb_do_find_slots(mem, mem->areas + i, i,
+					      dev, orig_addr, alloc_size,
+					      alloc_align_mask);
+		if (index >=3D 0)
+			return index;
+		if (++i >=3D mem->num_areas)
+			i =3D 0;
+	} while (i !=3D start);
+
+	return -1;
+}
+
+static unsigned long mem_used(struct io_tlb_mem *mem)
+{
+	int i;
+	unsigned long used =3D 0;
+
+	for (i =3D 0; i < mem->num_areas; i++)
+		used +=3D mem->areas[i].used;
+	return used;
 }
=20
 phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_ad=
dr,
@@ -594,7 +691,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, =
phys_addr_t orig_addr,
 		if (!(attrs & DMA_ATTR_NO_WARN))
 			dev_warn_ratelimited(dev,
 	"swiotlb buffer is full (sz: %zd bytes), total %lu (slots), used %lu (slo=
ts)\n",
-				 alloc_size, mem->nslabs, mem->used);
+				 alloc_size, mem->nslabs, mem_used(mem));
 		return (phys_addr_t)DMA_MAPPING_ERROR;
 	}
=20
@@ -624,6 +721,8 @@ static void swiotlb_release_slots(struct device *dev, p=
hys_addr_t tlb_addr)
 	unsigned int offset =3D swiotlb_align_offset(dev, tlb_addr);
 	int index =3D (tlb_addr - offset - mem->start) >> IO_TLB_SHIFT;
 	int nslots =3D nr_slots(mem->slots[index].alloc_size + offset);
+	int aindex =3D index / mem->area_nslabs;
+	struct io_tlb_area *area =3D &mem->areas[aindex];
 	int count, i;
=20
 	/*
@@ -632,7 +731,9 @@ static void swiotlb_release_slots(struct device *dev, p=
hys_addr_t tlb_addr)
 	 * While returning the entries to the free list, we merge the entries
 	 * with slots below and above the pool being returned.
 	 */
-	spin_lock_irqsave(&mem->lock, flags);
+	BUG_ON(aindex >=3D mem->num_areas);
+
+	spin_lock_irqsave(&area->lock, flags);
 	if (index + nslots < ALIGN(index + 1, IO_TLB_SEGSIZE))
 		count =3D mem->slots[index + nslots].list;
 	else
@@ -656,8 +757,8 @@ static void swiotlb_release_slots(struct device *dev, p=
hys_addr_t tlb_addr)
 	     io_tlb_offset(i) !=3D IO_TLB_SEGSIZE - 1 && mem->slots[i].list;
 	     i--)
 		mem->slots[i].list =3D ++count;
-	mem->used -=3D nslots;
-	spin_unlock_irqrestore(&mem->lock, flags);
+	area->used -=3D nslots;
+	spin_unlock_irqrestore(&area->lock, flags);
 }
=20
 /*
--=20
2.25.1
From nobody Sun May 10 17:09:04 2026
Return-Path: <linux-kernel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 0D56AC433F5
	for <linux-kernel@archiver.kernel.org>; Thu, 28 Apr 2022 14:14:47 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1348223AbiD1OR7 (ORCPT
        <rfc822;linux-kernel@archiver.kernel.org>);
        Thu, 28 Apr 2022 10:17:59 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40750 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1348192AbiD1ORv (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 28 Apr 2022 10:17:51 -0400
Received: from mail-pj1-x1032.google.com (mail-pj1-x1032.google.com
 [IPv6:2607:f8b0:4864:20::1032])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9F82DB6E70;
        Thu, 28 Apr 2022 07:14:36 -0700 (PDT)
Received: by mail-pj1-x1032.google.com with SMTP id p6so4455204pjm.1;
        Thu, 28 Apr 2022 07:14:36 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=the4yA34RGUD48q304/xQ5DG1rsnoGDHZu5Bqshv4b4=;
        b=IoFEsTlkhJ1tegef+ipqlLKrrmXBKiFGc4RMdw1/tIy22ZjfsCrWFEJ7f8vRZPt/RS
         7PoxvHGu9WKZoSi651rEMlRQoyb/Hm92oQwQdB53JulJEtjQkZZxS74e+VuosNXKMebV
         vmaGjn3BTmPJh1RZj0LsSPeA9SOKu+kvrTnnsioao/RGCjsLyNmk5lnGGQ72y3cEn2/T
         n6AChK9VYkXgGPbcO4PHPavZe/INFNTuoOzFppg1MLT5vBLZj130X+pZK+BG5mKpPN2W
         2wjGWtqMLvbX8AkF2goPSABn8g5m2CFGgZRm0uoDXon1jFovjDeWlzodvcFaUo0HiwMu
         kyWg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=the4yA34RGUD48q304/xQ5DG1rsnoGDHZu5Bqshv4b4=;
        b=56mJofzrg/3P9kLyJAJLMuRvDbTLFVq6YM4flDafa5zb73zufnUgQ09tuGvFdnP9Wc
         QZArbAHjPaIF+gJjrlT6d0VnWBNcguRKPZ6Cp1Q2QMfBvqfgDcbDa+2Lg/hgoSpyWR94
         JQQ7m99gvLy6MbfgnWjTUAQD3xotD/4hsaSpTyN5i02E/1yJtgyx0g7QVw8n/qEjdlmK
         cFKJxytVLbqxULKzBtg0wddvWXj6Qd7WLDAz6QKSz21F/jhtjYI57zoxqAeeSAhCbVeS
         qCWqN/lLjGeiOEDD0nAxWPU9b1i1M6RvdG6i87ZsLFsSO+TUtHpL8NWCpuZbB8F9MkFy
         o/Nw==
X-Gm-Message-State: AOAM531WCKXwH6c+hXAtIhMDm8P2j4AlnxDRAzHGYupCGowrgtUcPt8k
        g8CjxtyL1j6nQPh1euZ+zos=
X-Google-Smtp-Source: 
 ABdhPJw2ZFQgBay15hbXejlLvhqBSwLf48hvRHEOIhvxG1Dbz8YoaUTn3A4AZ4aPlClceUBb9dNLLg==
X-Received: by 2002:a17:902:9005:b0:158:e46e:688c with SMTP id
 a5-20020a170902900500b00158e46e688cmr33664927plp.173.1651155275954;
        Thu, 28 Apr 2022 07:14:35 -0700 (PDT)
Received: from ubuntu-Virtual-Machine.corp.microsoft.com
 ([2001:4898:80e8:f:de19:84:6ee0:6f2e])
        by smtp.gmail.com with ESMTPSA id
 d8-20020a056a00198800b004fab740dbe6sm65331pfl.15.2022.04.28.07.14.34
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 28 Apr 2022 07:14:34 -0700 (PDT)
From: Tianyu Lan <ltykernel@gmail.com>
To: hch@infradead.org, m.szyprowski@samsung.com, robin.murphy@arm.com,
        michael.h.kelley@microsoft.com, kys@microsoft.com
Cc: Tianyu Lan <Tianyu.Lan@microsoft.com>,
        iommu@lists.linux-foundation.org, linux-kernel@vger.kernel.org,
        vkuznets@redhat.com, brijesh.singh@amd.com, konrad.wilk@oracle.com,
        hch@lst.de, wei.liu@kernel.org, parri.andrea@gmail.com,
        thomas.lendacky@amd.com, linux-hyperv@vger.kernel.org,
        andi.kleen@intel.com, kirill.shutemov@intel.com
Subject: [RFC PATCH 2/2] Swiotlb: Add device bounce buffer allocation
 interface
Date: Thu, 28 Apr 2022 10:14:29 -0400
Message-Id: <20220428141429.1637028-3-ltykernel@gmail.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <20220428141429.1637028-1-ltykernel@gmail.com>
References: <20220428141429.1637028-1-ltykernel@gmail.com>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="utf-8"

From: Tianyu Lan <Tianyu.Lan@microsoft.com>

In SEV/TDX Confidential VM, device DMA transaction needs use swiotlb
bounce buffer to share data with host/hypervisor. The swiotlb spinlock
introduces overhead among devices if they share io tlb mem. Avoid such
issue, introduce swiotlb_device_allocate() to allocate device bounce
buffer from default io tlb pool and set up areas according input queue
number. Device may have multi io queues and setting up the same number
of io tlb area may help to resolve spinlock overhead among queues.

Introduce IO TLB Block unit(2MB) concepts to allocate big bounce buffer
from default pool for devices. IO TLB segment(256k) is too small.

Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
---
 include/linux/swiotlb.h |  33 ++++++++
 kernel/dma/swiotlb.c    | 173 +++++++++++++++++++++++++++++++++++++++-
 2 files changed, 203 insertions(+), 3 deletions(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 489c249da434..380bd1ce3d0f 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -31,6 +31,14 @@ struct scatterlist;
 #define IO_TLB_SHIFT 11
 #define IO_TLB_SIZE (1 << IO_TLB_SHIFT)
=20
+/*
+ * IO TLB BLOCK UNIT as device bounce buffer allocation unit.
+ * This allows device allocates bounce buffer from default io
+ * tlb pool.
+ */
+#define IO_TLB_BLOCKSIZE   (8 * IO_TLB_SEGSIZE)
+#define IO_TLB_BLOCK_UNIT  (IO_TLB_BLOCKSIZE << IO_TLB_SHIFT)
+
 /* default to 64MB */
 #define IO_TLB_DEFAULT_SIZE (64UL<<20)
=20
@@ -72,11 +80,13 @@ extern enum swiotlb_force swiotlb_force;
  * @index:	The slot index to start searching in this area for next round.
  * @lock:	The lock to protect the above data structures in the map and
  *		unmap calls.
+ * @block_index: The block index to start earching in this area for next r=
ound.
  */
 struct io_tlb_area {
 	unsigned long used;
 	unsigned int area_index;
 	unsigned int index;
+	unsigned int block_index;
 	spinlock_t lock;
 };
=20
@@ -110,6 +120,7 @@ struct io_tlb_area {
  * @num_areas:  The area number in the pool.
  * @area_start: The area index to start searching in the next round.
  * @area_nslabs: The slot number in the area.
+ * @areas_block_number: The block number in the area.
  */
 struct io_tlb_mem {
 	phys_addr_t start;
@@ -126,7 +137,14 @@ struct io_tlb_mem {
 	unsigned int num_areas;
 	unsigned int area_start;
 	unsigned int area_nslabs;
+	unsigned int area_block_number;
+	struct io_tlb_mem *parent;
 	struct io_tlb_area *areas;
+	struct io_tlb_block {
+		size_t alloc_size;
+		unsigned long start_slot;
+		unsigned int list;
+	} *block;
 	struct io_tlb_slot {
 		phys_addr_t orig_addr;
 		size_t alloc_size;
@@ -155,6 +173,10 @@ unsigned int swiotlb_max_segment(void);
 size_t swiotlb_max_mapping_size(struct device *dev);
 bool is_swiotlb_active(struct device *dev);
 void __init swiotlb_adjust_size(unsigned long size);
+int swiotlb_device_allocate(struct device *dev,
+			    unsigned int area_num,
+			    unsigned long size);
+void swiotlb_device_free(struct device *dev);
 #else
 static inline void swiotlb_init(bool addressing_limited, unsigned int flag=
s)
 {
@@ -187,6 +209,17 @@ static inline bool is_swiotlb_active(struct device *de=
v)
 static inline void swiotlb_adjust_size(unsigned long size)
 {
 }
+
+void swiotlb_device_free(struct device *dev)
+{
+}
+
+int swiotlb_device_allocate(struct device *dev,
+			    unsigned int area_num,
+			    unsigned long size)
+{
+	return -ENOMEM;
+}
 #endif /* CONFIG_SWIOTLB */
=20
 extern void swiotlb_print_info(void);
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 00a16f540f20..7b95a140694a 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -218,7 +218,7 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *=
mem, phys_addr_t start,
 				    unsigned long nslabs, bool late_alloc)
 {
 	void *vaddr =3D phys_to_virt(start);
-	unsigned long bytes =3D nslabs << IO_TLB_SHIFT, i, j;
+	unsigned long bytes =3D nslabs << IO_TLB_SHIFT, i, j, k;
 	unsigned int block_list;
=20
 	mem->nslabs =3D nslabs;
@@ -226,6 +226,7 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *=
mem, phys_addr_t start,
 	mem->end =3D mem->start + bytes;
 	mem->index =3D 0;
 	mem->late_alloc =3D late_alloc;
+	mem->area_block_number =3D nslabs / (IO_TLB_BLOCKSIZE * mem->num_areas);
=20
 	if (swiotlb_force_bounce)
 		mem->force_bounce =3D true;
@@ -233,10 +234,18 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem=
 *mem, phys_addr_t start,
 	for (i =3D 0, j =3D 0, k =3D 0; i < mem->nslabs; i++) {
 		if (!(i % mem->area_nslabs)) {
 			mem->areas[j].index =3D 0;
+			mem->areas[j].block_index =3D 0;
 			spin_lock_init(&mem->areas[j].lock);
+			block_list =3D mem->area_block_number;
 			j++;
 		}
=20
+		if (!(i % IO_TLB_BLOCKSIZE)) {
+			mem->block[k].alloc_size =3D 0;
+			mem->block[k].list =3D block_list--;
+			k++;
+		}
+
 		mem->slots[i].list =3D IO_TLB_SEGSIZE - io_tlb_offset(i);
 		mem->slots[i].orig_addr =3D INVALID_PHYS_ADDR;
 		mem->slots[i].alloc_size =3D 0;
@@ -308,6 +317,12 @@ void __init swiotlb_init_remap(bool addressing_limit, =
unsigned int flags,
 	if (!mem->areas)
 		panic("%s: Failed to allocate mem->areas.\n", __func__);
=20
+	mem->block =3D memblock_alloc(sizeof(struct io_tlb_block) *
+				    (default_nslabs / IO_TLB_BLOCKSIZE),
+				     SMP_CACHE_BYTES);
+	if (!mem->block)
+		panic("%s: Failed to allocate mem->block.\n", __func__);
+
 	swiotlb_init_io_tlb_mem(mem, __pa(tlb), default_nslabs, false);
 	mem->force_bounce =3D flags & SWIOTLB_FORCE;
=20
@@ -332,7 +347,7 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
 	unsigned long nslabs =3D ALIGN(size >> IO_TLB_SHIFT, IO_TLB_SEGSIZE);
 	unsigned long bytes;
 	unsigned char *vstart =3D NULL;
-	unsigned int order, area_order;
+	unsigned int order, area_order, block_order;
 	int rc =3D 0;
=20
 	if (swiotlb_force_disable)
@@ -380,6 +395,13 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
 	if (!mem->areas)
 		goto error_area;
=20
+	block_order =3D get_order(array_size(sizeof(*mem->block),
+		nslabs / IO_TLB_BLOCKSIZE));
+	mem->block =3D (struct io_tlb_block *)
+		__get_free_pages(GFP_KERNEL | __GFP_ZERO, block_order);
+	if (!mem->block)
+		goto error_block;
+
 	mem->slots =3D (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
 		get_order(array_size(sizeof(*mem->slots), nslabs)));
 	if (!mem->slots)
@@ -392,6 +414,8 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
 	return 0;
=20
 error_slots:
+	free_pages((unsigned long)mem->block, block_order);
+error_block:
 	free_pages((unsigned long)mem->areas, area_order);
 error_area:
 	free_pages((unsigned long)vstart, order);
@@ -403,7 +427,7 @@ void __init swiotlb_exit(void)
 	struct io_tlb_mem *mem =3D &io_tlb_default_mem;
 	unsigned long tbl_vaddr;
 	size_t tbl_size, slots_size;
-	unsigned int area_order;
+	unsigned int area_order, block_order;
=20
 	if (swiotlb_force_bounce)
 		return;
@@ -421,6 +445,9 @@ void __init swiotlb_exit(void)
 		area_order =3D get_order(array_size(sizeof(*mem->areas),
 			mem->num_areas));
 		free_pages((unsigned long)mem->areas, area_order);
+		block_order =3D get_order(array_size(sizeof(*mem->block),
+			mem->nslabs / IO_TLB_BLOCKSIZE));
+		free_pages((unsigned long)mem->block, block_order);
 		free_pages(tbl_vaddr, get_order(tbl_size));
 		free_pages((unsigned long)mem->slots, get_order(slots_size));
 	} else {
@@ -863,6 +890,146 @@ static int __init __maybe_unused swiotlb_create_defau=
lt_debugfs(void)
 late_initcall(swiotlb_create_default_debugfs);
 #endif
=20
+static void swiotlb_free_block(struct io_tlb_mem *mem,
+			       phys_addr_t start, unsigned int block_num)
+{
+	unsigned int start_slot =3D (start - mem->start) >> IO_TLB_SHIFT;
+	unsigned int area_index =3D start_slot / mem->num_areas;
+	unsigned int block_index =3D start_slot / IO_TLB_BLOCKSIZE;
+	unsigned int area_block_index =3D start_slot % mem->area_block_number;
+	struct io_tlb_area *area =3D &mem->areas[area_index];
+	unsigned long flags;
+	int count, i, num;
+
+	spin_lock_irqsave(&area->lock, flags);
+	if (area_block_index + block_num < mem->area_block_number)
+		count =3D mem->block[block_index + block_num].list;
+	else
+		count =3D 0;
+
+
+	for (i =3D block_index + block_num; i >=3D block_index; i--) {
+		mem->block[i].list =3D ++count;
+		/* Todo: recover slot->list and alloc_size here. */
+	}
+
+	for (i =3D block_index - 1, num =3D block_index % mem->area_block_number;
+	    i < num && mem->block[i].list; i--)
+		mem->block[i].list =3D ++count;
+
+	spin_unlock_irqrestore(&area->lock, flags);
+}
+
+void swiotlb_device_free(struct device *dev)
+{
+	struct io_tlb_mem *mem =3D dev->dma_io_tlb_mem;
+	struct io_tlb_mem *parent_mem =3D dev->dma_io_tlb_mem->parent;
+
+	swiotlb_free_block(parent_mem, mem->start, mem->nslabs / IO_TLB_BLOCKSIZE=
);
+}
+
+static struct page *swiotlb_alloc_block(struct io_tlb_mem *mem, unsigned i=
nt block_num)
+{
+	unsigned int area_index, block_index, nslot;
+	phys_addr_t tlb_addr;
+	struct io_tlb_area *area;
+	unsigned long flags;
+	int i, j;
+
+	if (!mem || !mem->block)
+		return NULL;
+
+	area_index =3D mem->area_start;
+	mem->area_start =3D (mem->area_start + 1) % mem->num_areas;
+	area =3D &mem->areas[area_index];
+
+	spin_lock_irqsave(&area->lock, flags);
+	block_index =3D area_index * mem->area_block_number + area->block_index;
+
+	/* Todo: Search more blocks. */
+	if (mem->block[block_index].list < block_num) {
+		spin_unlock_irqrestore(&area->lock, flags);
+		return NULL;
+	}
+
+	/* Update block and slot list. */
+	for (i =3D block_index; i < block_index + block_num; i++) {
+		mem->block[i].list =3D 0;
+		for (j =3D 0; j < IO_TLB_BLOCKSIZE; j++) {
+			nslot =3D i * IO_TLB_BLOCKSIZE + j;
+			mem->slots[nslot].list =3D 0;
+			mem->slots[nslot].alloc_size =3D IO_TLB_SIZE;
+		}
+	}
+	spin_unlock_irqrestore(&area->lock, flags);
+
+	area->block_index +=3D block_num;
+	area->used +=3D block_num * IO_TLB_BLOCKSIZE;
+	tlb_addr =3D slot_addr(mem->start, block_index * IO_TLB_BLOCKSIZE);
+	return pfn_to_page(PFN_DOWN(tlb_addr));
+}
+
+/*
+ * swiotlb_device_allocate - Allocate bounce buffer fo device from
+ * default io tlb pool. The allocation size should be aligned with
+ * IO_TLB_BLOCK_UNIT.
+ */
+int swiotlb_device_allocate(struct device *dev,
+			    unsigned int area_num,
+			    unsigned long size)
+{
+	struct io_tlb_mem *mem, *parent_mem =3D dev->dma_io_tlb_mem;
+	unsigned long nslabs =3D ALIGN(size >> IO_TLB_SHIFT, IO_TLB_BLOCKSIZE);
+	struct page *page;
+	int ret =3D -ENOMEM;
+
+	page =3D swiotlb_alloc_block(parent_mem, nslabs / IO_TLB_BLOCKSIZE);
+	if (!page)
+		return -ENOMEM;
+
+	mem =3D kzalloc(sizeof(*mem), GFP_KERNEL);
+	if (!mem)
+		goto error_mem;
+
+	mem->slots =3D kzalloc(array_size(sizeof(*mem->slots), nslabs),
+			     GFP_KERNEL);
+	if (!mem->slots)
+		goto error_slots;
+
+	swiotlb_setup_areas(mem, area_num, nslabs);
+	mem->areas =3D (struct io_tlb_area *)kcalloc(area_num,
+				   sizeof(struct io_tlb_area),
+				   GFP_KERNEL);
+	if (!mem->areas)
+		goto error_areas;
+
+	mem->block =3D (struct io_tlb_block *)kcalloc(nslabs / IO_TLB_BLOCKSIZE,
+				sizeof(struct io_tlb_block),
+				GFP_KERNEL);
+	if (!mem->block)
+		goto error_block;
+
+	swiotlb_init_io_tlb_mem(mem, page_to_phys(page), nslabs, true);
+	mem->force_bounce =3D true;
+	mem->for_alloc =3D true;
+
+	mem->vaddr =3D parent_mem->vaddr + page_to_phys(page) -  parent_mem->star=
t;
+	dev->dma_io_tlb_mem->parent =3D parent_mem;
+	dev->dma_io_tlb_mem =3D mem;
+	return 0;
+
+error_block:
+	kfree(mem->areas);
+error_areas:
+	kfree(mem->slots);
+error_slots:
+	kfree(mem);
+error_mem:
+	swiotlb_device_free(dev);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(swiotlb_device_allocate);
+
 #ifdef CONFIG_DMA_RESTRICTED_POOL
=20
 struct page *swiotlb_alloc(struct device *dev, size_t size)
--=20
2.25.1