From nobody Tue Dec 2 01:51:21 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9A242266EFC; Thu, 20 Nov 2025 14:49:35 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763650175; cv=none; b=SFJJqUAUNF3pRPxD18o+34OkFwYfejBEYQgYtPEsBEhNIXu70h6dN1KYHm+YU/8CdecUfAA24KrxXjjdqTsr2Zx4MAdl1VzaqzGDL2O5XWlRZyizijYXXFVrakueIQNx8ig88iCaH/YmKVGwYcU21y2fxXMjIc4jfBoSrZYxYcE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763650175; c=relaxed/simple; bh=brlOZ4qQsvv0b9NPc30tO5696xoN05KOIgVTU/7B7vw=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version:Content-Type; b=pS1ppPfCrmj0fm+8Y0FFlA5HxNz9Rm5FkxuHHR1bs/fkbff2rxGXvazA4BWE5M33OPlE67titbvP3VJWsgze+pyMq0KvdpFpAde8jx2wSrN6AsEQ3WdREOC6VM8YPErC7rR7wigV/XR/32RFtuPzFaI260D9C+qrcMylfJcyPXM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=RBTMZ+BU; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="RBTMZ+BU" Received: by smtp.kernel.org (Postfix) with ESMTPSA id A908FC4CEF1; Thu, 20 Nov 2025 14:49:34 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1763650175; bh=brlOZ4qQsvv0b9NPc30tO5696xoN05KOIgVTU/7B7vw=; h=From:To:Cc:Subject:Date:From; b=RBTMZ+BUleUAvpfUIQ0NQO4A9LYVGtT6T1St1WJXhR4qZN4bO+vUn7SsWfl5hMGRD 7tGwACQxJSpSSmdupDjKexNACDGx+ThwSoLpmIdEZH5UJUUoxKOjaRZO1LRSINoaMP AdJz0y4sLjlqwoA4eoyCW4mkugNJlHG0nnz6hMfY2elV7mhsZL0KGatoMZRq4Y1yJP u+jRD6HWpUSkCqZPiAUwBPXhi7eWT6Rfe5MeThFApk2+Mw9GbHhIGZm8CF0NJRVG7w v05I12Ul4q7QLlj3dvc7PcSYjp8ZiIO7jfEXah5WcD4xkQCXKbs/g3AbYmMtNDLEMA /pveQcpiqv3FA== From: Leon Romanovsky To: Jason Gunthorpe Cc: linux-rdma@vger.kernel.org, linux-kernel@vger.kernel.org, Yishai Hadas , Or Har-Toov , Michael Guralnik , Edward Srouji Subject: [PATCH rdma-next] IB/mlx5: Reduce IMR KSM size when 5-level paging is enabled Date: Thu, 20 Nov 2025 16:49:28 +0200 Message-ID: <20251120-reduce-ksm-v1-1-6864bfc814dc@kernel.org> X-Mailer: git-send-email 2.51.1 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" X-Change-ID: 20251103-reduce-ksm-a091ca606e8b X-Mailer: b4 0.15-dev-a6db3 Content-Transfer-Encoding: quoted-printable From: Yishai Hadas Enabling 5-level paging (LA57) increases TASK_SIZE on x86_64 from 2^47 to 2^56. This affects implicit ODP, which uses TASK_SIZE to calculate the number of IMR KSM entries. As a result, the number of entries and the memory usage for KSM mkeys increase drastically: - With 2^47 TASK_SIZE: 0x20000 entries (~2MB) - With 2^56 TASK_SIZE: 0x4000000 entries (~1GB) This issue could happen previously on systems with LA57 manually enabled, but now commit 7212b58d6d71 ("x86/mm/64: Make 5-level paging support unconditional") enables LA57 by default on all supported systems. This makes the issue impact widespread. To mitigate this, increase the size each MTT entry maps from 1GB to 16GB when 5-level paging is enabled. This reduces the number of KSM entries and lowers the memory usage on LA57 systems from 1GB to 64MB per IMR. As now 'mlx5_imr_mtt_size' is larger than 32 bits, we move to use u64 instead of int as part of populate_klm() to prevent overflow of the 'step' variable. In addition, as populate_klm() actually handles KSM and not KLM, as it's used only by implicit ODP, we renamed its signature and the internal structures accordingly while dropping the byte_count handling which is not relevant in KSM. The page size in KSM is fixed for all the entries and come from the log_page_size of the mkey. Note: On platforms where the calculated value for 'mlx5_imr_ksm_page_shift' is higher than the max firmware cap to be changed over UMR, or that the calculated value for 'log_va_pages' is higher than what we may expect, the implicit ODP cap will be simply turned off. Co-developed-by: Or Har-Toov Signed-off-by: Or Har-Toov Signed-off-by: Yishai Hadas Reviewed-by: Michael Guralnik Signed-off-by: Edward Srouji Signed-off-by: Leon Romanovsky --- drivers/infiniband/hw/mlx5/odp.c | 89 +++++++++++++++++++++++-------------= ---- 1 file changed, 51 insertions(+), 38 deletions(-) diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/= odp.c index 6441abdf1f3b..e71ee3d52eb0 100644 --- a/drivers/infiniband/hw/mlx5/odp.c +++ b/drivers/infiniband/hw/mlx5/odp.c @@ -97,33 +97,28 @@ struct mlx5_pagefault { * a pagefault. */ #define MMU_NOTIFIER_TIMEOUT 1000 =20 -#define MLX5_IMR_MTT_BITS (30 - PAGE_SHIFT) -#define MLX5_IMR_MTT_SHIFT (MLX5_IMR_MTT_BITS + PAGE_SHIFT) -#define MLX5_IMR_MTT_ENTRIES BIT_ULL(MLX5_IMR_MTT_BITS) -#define MLX5_IMR_MTT_SIZE BIT_ULL(MLX5_IMR_MTT_SHIFT) -#define MLX5_IMR_MTT_MASK (~(MLX5_IMR_MTT_SIZE - 1)) - -#define MLX5_KSM_PAGE_SHIFT MLX5_IMR_MTT_SHIFT - static u64 mlx5_imr_ksm_entries; +static u64 mlx5_imr_mtt_entries; +static u64 mlx5_imr_mtt_size; +static u8 mlx5_imr_mtt_shift; +static u8 mlx5_imr_ksm_page_shift; =20 -static void populate_klm(struct mlx5_klm *pklm, size_t idx, size_t nentrie= s, +static void populate_ksm(struct mlx5_ksm *pksm, size_t idx, size_t nentrie= s, struct mlx5_ib_mr *imr, int flags) { struct mlx5_core_dev *dev =3D mr_to_mdev(imr)->mdev; - struct mlx5_klm *end =3D pklm + nentries; - int step =3D MLX5_CAP_ODP(dev, mem_page_fault) ? MLX5_IMR_MTT_SIZE : 0; + struct mlx5_ksm *end =3D pksm + nentries; + u64 step =3D MLX5_CAP_ODP(dev, mem_page_fault) ? mlx5_imr_mtt_size : 0; __be32 key =3D MLX5_CAP_ODP(dev, mem_page_fault) ? cpu_to_be32(imr->null_mmkey.key) : mr_to_mdev(imr)->mkeys.null_mkey; u64 va =3D - MLX5_CAP_ODP(dev, mem_page_fault) ? idx * MLX5_IMR_MTT_SIZE : 0; + MLX5_CAP_ODP(dev, mem_page_fault) ? idx * mlx5_imr_mtt_size : 0; =20 if (flags & MLX5_IB_UPD_XLT_ZAP) { - for (; pklm !=3D end; pklm++, idx++, va +=3D step) { - pklm->bcount =3D cpu_to_be32(MLX5_IMR_MTT_SIZE); - pklm->key =3D key; - pklm->va =3D cpu_to_be64(va); + for (; pksm !=3D end; pksm++, idx++, va +=3D step) { + pksm->key =3D key; + pksm->va =3D cpu_to_be64(va); } return; } @@ -147,16 +142,15 @@ static void populate_klm(struct mlx5_klm *pklm, size_= t idx, size_t nentries, */ lockdep_assert_held(&to_ib_umem_odp(imr->umem)->umem_mutex); =20 - for (; pklm !=3D end; pklm++, idx++, va +=3D step) { + for (; pksm !=3D end; pksm++, idx++, va +=3D step) { struct mlx5_ib_mr *mtt =3D xa_load(&imr->implicit_children, idx); =20 - pklm->bcount =3D cpu_to_be32(MLX5_IMR_MTT_SIZE); if (mtt) { - pklm->key =3D cpu_to_be32(mtt->ibmr.lkey); - pklm->va =3D cpu_to_be64(idx * MLX5_IMR_MTT_SIZE); + pksm->key =3D cpu_to_be32(mtt->ibmr.lkey); + pksm->va =3D cpu_to_be64(idx * mlx5_imr_mtt_size); } else { - pklm->key =3D key; - pklm->va =3D cpu_to_be64(va); + pksm->key =3D key; + pksm->va =3D cpu_to_be64(va); } } } @@ -201,7 +195,7 @@ int mlx5_odp_populate_xlt(void *xlt, size_t idx, size_t= nentries, struct mlx5_ib_mr *mr, int flags) { if (flags & MLX5_IB_UPD_XLT_INDIRECT) { - populate_klm(xlt, idx, nentries, mr, flags); + populate_ksm(xlt, idx, nentries, mr, flags); return 0; } else { return populate_mtt(xlt, idx, nentries, mr, flags); @@ -226,7 +220,7 @@ static void free_implicit_child_mr_work(struct work_str= uct *work) =20 mutex_lock(&odp_imr->umem_mutex); mlx5r_umr_update_xlt(mr->parent, - ib_umem_start(odp) >> MLX5_IMR_MTT_SHIFT, 1, 0, + ib_umem_start(odp) >> mlx5_imr_mtt_shift, 1, 0, MLX5_IB_UPD_XLT_INDIRECT | MLX5_IB_UPD_XLT_ATOMIC); mutex_unlock(&odp_imr->umem_mutex); mlx5_ib_dereg_mr(&mr->ibmr, NULL); @@ -237,7 +231,7 @@ static void free_implicit_child_mr_work(struct work_str= uct *work) static void destroy_unused_implicit_child_mr(struct mlx5_ib_mr *mr) { struct ib_umem_odp *odp =3D to_ib_umem_odp(mr->umem); - unsigned long idx =3D ib_umem_start(odp) >> MLX5_IMR_MTT_SHIFT; + unsigned long idx =3D ib_umem_start(odp) >> mlx5_imr_mtt_shift; struct mlx5_ib_mr *imr =3D mr->parent; =20 /* @@ -425,7 +419,10 @@ static void internal_fill_odp_caps(struct mlx5_ib_dev = *dev) if (MLX5_CAP_GEN(dev->mdev, fixed_buffer_size) && MLX5_CAP_GEN(dev->mdev, null_mkey) && MLX5_CAP_GEN(dev->mdev, umr_extended_translation_offset) && - !MLX5_CAP_GEN(dev->mdev, umr_indirect_mkey_disabled)) + !MLX5_CAP_GEN(dev->mdev, umr_indirect_mkey_disabled) && + mlx5_imr_ksm_entries !=3D 0 && + !(mlx5_imr_ksm_page_shift > + get_max_log_entity_size_cap(dev, MLX5_MKC_ACCESS_MODE_KSM))) caps->general_caps |=3D IB_ODP_SUPPORT_IMPLICIT; } =20 @@ -476,14 +473,14 @@ static struct mlx5_ib_mr *implicit_get_child_mr(struc= t mlx5_ib_mr *imr, int err; =20 odp =3D ib_umem_odp_alloc_child(to_ib_umem_odp(imr->umem), - idx * MLX5_IMR_MTT_SIZE, - MLX5_IMR_MTT_SIZE, &mlx5_mn_ops); + idx * mlx5_imr_mtt_size, + mlx5_imr_mtt_size, &mlx5_mn_ops); if (IS_ERR(odp)) return ERR_CAST(odp); =20 mr =3D mlx5_mr_cache_alloc(dev, imr->access_flags, MLX5_MKC_ACCESS_MODE_MTT, - MLX5_IMR_MTT_ENTRIES); + mlx5_imr_mtt_entries); if (IS_ERR(mr)) { ib_umem_odp_release(odp); return mr; @@ -495,7 +492,7 @@ static struct mlx5_ib_mr *implicit_get_child_mr(struct = mlx5_ib_mr *imr, mr->umem =3D &odp->umem; mr->ibmr.lkey =3D mr->mmkey.key; mr->ibmr.rkey =3D mr->mmkey.key; - mr->ibmr.iova =3D idx * MLX5_IMR_MTT_SIZE; + mr->ibmr.iova =3D idx * mlx5_imr_mtt_size; mr->parent =3D imr; odp->private =3D mr; =20 @@ -506,7 +503,7 @@ static struct mlx5_ib_mr *implicit_get_child_mr(struct = mlx5_ib_mr *imr, refcount_set(&mr->mmkey.usecount, 2); =20 err =3D mlx5r_umr_update_xlt(mr, 0, - MLX5_IMR_MTT_ENTRIES, + mlx5_imr_mtt_entries, PAGE_SHIFT, MLX5_IB_UPD_XLT_ZAP | MLX5_IB_UPD_XLT_ENABLE); @@ -611,7 +608,7 @@ struct mlx5_ib_mr *mlx5_ib_alloc_implicit_mr(struct mlx= 5_ib_pd *pd, struct mlx5_ib_mr *imr; int err; =20 - if (!mlx5r_umr_can_load_pas(dev, MLX5_IMR_MTT_ENTRIES * PAGE_SIZE)) + if (!mlx5r_umr_can_load_pas(dev, mlx5_imr_mtt_entries * PAGE_SIZE)) return ERR_PTR(-EOPNOTSUPP); =20 umem_odp =3D ib_umem_odp_alloc_implicit(&dev->ib_dev, access_flags); @@ -647,7 +644,7 @@ struct mlx5_ib_mr *mlx5_ib_alloc_implicit_mr(struct mlx= 5_ib_pd *pd, =20 err =3D mlx5r_umr_update_xlt(imr, 0, mlx5_imr_ksm_entries, - MLX5_KSM_PAGE_SHIFT, + mlx5_imr_ksm_page_shift, MLX5_IB_UPD_XLT_INDIRECT | MLX5_IB_UPD_XLT_ZAP | MLX5_IB_UPD_XLT_ENABLE); @@ -750,20 +747,20 @@ static int pagefault_implicit_mr(struct mlx5_ib_mr *i= mr, struct ib_umem_odp *odp_imr, u64 user_va, size_t bcnt, u32 *bytes_mapped, u32 flags) { - unsigned long end_idx =3D (user_va + bcnt - 1) >> MLX5_IMR_MTT_SHIFT; + unsigned long end_idx =3D (user_va + bcnt - 1) >> mlx5_imr_mtt_shift; unsigned long upd_start_idx =3D end_idx + 1; unsigned long upd_len =3D 0; unsigned long npages =3D 0; int err; int ret; =20 - if (unlikely(user_va >=3D mlx5_imr_ksm_entries * MLX5_IMR_MTT_SIZE || - mlx5_imr_ksm_entries * MLX5_IMR_MTT_SIZE - user_va < bcnt)) + if (unlikely(user_va >=3D mlx5_imr_ksm_entries * mlx5_imr_mtt_size || + mlx5_imr_ksm_entries * mlx5_imr_mtt_size - user_va < bcnt)) return -EFAULT; =20 /* Fault each child mr that intersects with our interval. */ while (bcnt) { - unsigned long idx =3D user_va >> MLX5_IMR_MTT_SHIFT; + unsigned long idx =3D user_va >> mlx5_imr_mtt_shift; struct ib_umem_odp *umem_odp; struct mlx5_ib_mr *mtt; u64 len; @@ -1924,9 +1921,25 @@ void mlx5_ib_odp_cleanup_one(struct mlx5_ib_dev *dev) =20 int mlx5_ib_odp_init(void) { + u32 log_va_pages =3D ilog2(TASK_SIZE) - PAGE_SHIFT; + u8 mlx5_imr_mtt_bits; + + /* 48 is default ARM64 VA space and covers X86 4-level paging which is 47= */ + if (log_va_pages <=3D 48 - PAGE_SHIFT) + mlx5_imr_mtt_shift =3D 30; + /* 56 is x86-64, 5-level paging */ + else if (log_va_pages <=3D 56 - PAGE_SHIFT) + mlx5_imr_mtt_shift =3D 34; + else + return 0; + + mlx5_imr_mtt_size =3D BIT_ULL(mlx5_imr_mtt_shift); + mlx5_imr_mtt_bits =3D mlx5_imr_mtt_shift - PAGE_SHIFT; + mlx5_imr_mtt_entries =3D BIT_ULL(mlx5_imr_mtt_bits); mlx5_imr_ksm_entries =3D BIT_ULL(get_order(TASK_SIZE) - - MLX5_IMR_MTT_BITS); + mlx5_imr_mtt_bits); =20 + mlx5_imr_ksm_page_shift =3D mlx5_imr_mtt_shift; return 0; } =20 --- base-commit: d056bc45b62b5981ebcd18c4303a915490b8ebe9 change-id: 20251103-reduce-ksm-a091ca606e8b Best regards, -- =20 Leon Romanovsky