net/smc: buffer allocation and registration improvements

[PATCH net-next 2/3] mm: vmalloc: export find_vm_area()

Posted by D. Wythe 2 weeks, 1 day ago

find_vm_area() provides a way to find the vm_struct associated with a
virtual address. Export this symbol to modules so that modularized
subsystems can perform lookups on vmalloc addresses.

Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
---
 mm/vmalloc.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ecbac900c35f..3eb9fe761c34 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr)
 
 	return va->vm;
 }
+EXPORT_SYMBOL_GPL(find_vm_area);
 
 /**
  * remove_vm_area - find and remove a continuous kernel virtual area
-- 
2.45.0

Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()

Posted by Uladzislau Rezki 2 weeks ago

On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote:
> find_vm_area() provides a way to find the vm_struct associated with a
> virtual address. Export this symbol to modules so that modularized
> subsystems can perform lookups on vmalloc addresses.
> 
> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> ---
>  mm/vmalloc.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index ecbac900c35f..3eb9fe761c34 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr)
>  
>  	return va->vm;
>  }
> +EXPORT_SYMBOL_GPL(find_vm_area);
>  
This is internal. We can not just export it.

--
Uladzislau Rezki

Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()

Posted by D. Wythe 2 weeks ago

On Fri, Jan 23, 2026 at 07:55:17PM +0100, Uladzislau Rezki wrote:
> On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote:
> > find_vm_area() provides a way to find the vm_struct associated with a
> > virtual address. Export this symbol to modules so that modularized
> > subsystems can perform lookups on vmalloc addresses.
> > 
> > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> > ---
> >  mm/vmalloc.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index ecbac900c35f..3eb9fe761c34 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr)
> >  
> >  	return va->vm;
> >  }
> > +EXPORT_SYMBOL_GPL(find_vm_area);
> >  
> This is internal. We can not just export it.
> 
> --
> Uladzislau Rezki

Hi Uladzislau,

Thank you for the feedback. I agree that we should avoid exposing
internal implementation details like struct vm_struct to external
subsystems.

Following Christoph's suggestion, I'm planning to encapsulate the page
order lookup into a minimal helper instead:

unsigned int vmalloc_page_order(const void *addr){
	struct vm_struct *vm;
 	vm = find_vm_area(addr);
	return vm ? vm->page_order : 0;
}
EXPORT_SYMBOL_GPL(vmalloc_page_order);

Does this approach look reasonable to you? It would keep the vm_struct
layout private while satisfying the optimization needs of SMC.

Thanks,
D. Wythe

Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()

Posted by Uladzislau Rezki 1 week, 6 days ago

Hello, D. Wythe!

> On Fri, Jan 23, 2026 at 07:55:17PM +0100, Uladzislau Rezki wrote:
> > On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote:
> > > find_vm_area() provides a way to find the vm_struct associated with a
> > > virtual address. Export this symbol to modules so that modularized
> > > subsystems can perform lookups on vmalloc addresses.
> > > 
> > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> > > ---
> > >  mm/vmalloc.c | 1 +
> > >  1 file changed, 1 insertion(+)
> > > 
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index ecbac900c35f..3eb9fe761c34 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr)
> > >  
> > >  	return va->vm;
> > >  }
> > > +EXPORT_SYMBOL_GPL(find_vm_area);
> > >  
> > This is internal. We can not just export it.
> > 
> > --
> > Uladzislau Rezki
> 
> Hi Uladzislau,
> 
> Thank you for the feedback. I agree that we should avoid exposing
> internal implementation details like struct vm_struct to external
> subsystems.
> 
> Following Christoph's suggestion, I'm planning to encapsulate the page
> order lookup into a minimal helper instead:
> 
> unsigned int vmalloc_page_order(const void *addr){
> 	struct vm_struct *vm;
>  	vm = find_vm_area(addr);
> 	return vm ? vm->page_order : 0;
> }
> EXPORT_SYMBOL_GPL(vmalloc_page_order);
> 
> Does this approach look reasonable to you? It would keep the vm_struct
> layout private while satisfying the optimization needs of SMC.
> 
Could you please clarify why you need info about page_order? I have not
looked at your second patch.

Thanks!

--
Uladzislau Rezki

Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()

Posted by D. Wythe 1 week, 6 days ago

On Sat, Jan 24, 2026 at 11:48:59AM +0100, Uladzislau Rezki wrote:
> Hello, D. Wythe!
> 
> > On Fri, Jan 23, 2026 at 07:55:17PM +0100, Uladzislau Rezki wrote:
> > > On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote:
> > > > find_vm_area() provides a way to find the vm_struct associated with a
> > > > virtual address. Export this symbol to modules so that modularized
> > > > subsystems can perform lookups on vmalloc addresses.
> > > > 
> > > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> > > > ---
> > > >  mm/vmalloc.c | 1 +
> > > >  1 file changed, 1 insertion(+)
> > > > 
> > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > index ecbac900c35f..3eb9fe761c34 100644
> > > > --- a/mm/vmalloc.c
> > > > +++ b/mm/vmalloc.c
> > > > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr)
> > > >  
> > > >  	return va->vm;
> > > >  }
> > > > +EXPORT_SYMBOL_GPL(find_vm_area);
> > > >  
> > > This is internal. We can not just export it.
> > > 
> > > --
> > > Uladzislau Rezki
> > 
> > Hi Uladzislau,
> > 
> > Thank you for the feedback. I agree that we should avoid exposing
> > internal implementation details like struct vm_struct to external
> > subsystems.
> > 
> > Following Christoph's suggestion, I'm planning to encapsulate the page
> > order lookup into a minimal helper instead:
> > 
> > unsigned int vmalloc_page_order(const void *addr){
> > 	struct vm_struct *vm;
> >  	vm = find_vm_area(addr);
> > 	return vm ? vm->page_order : 0;
> > }
> > EXPORT_SYMBOL_GPL(vmalloc_page_order);
> > 
> > Does this approach look reasonable to you? It would keep the vm_struct
> > layout private while satisfying the optimization needs of SMC.
> > 
> Could you please clarify why you need info about page_order? I have not
> looked at your second patch.
> 
> Thanks!
> 
> --
> Uladzislau Rezki

Hi Uladzislau,

This stems from optimizing memory registration in SMC-R. To provide the
RDMA hardware with direct access to memory buffers, we must register
them with the NIC. During this process, the hardware generates one MTT
entry for each physically contiguous block. Since these hardware entries
are a finite and scarce resource, and SMC currently defaults to a 4KB
registration granularity, a single 2MB buffer consumes 512 entries. In
high-concurrency scenarios, this inefficiency quickly exhausts NIC
resources and becomes a major bottleneck for system scalability.

To address this, we intend to use vmalloc_huge(). When it successfully
allocates high-order pages, the vmalloc area is backed by a sequence of
physically contiguous chunks (e.g., 2MB each). If we know this
page_order, we can register these larger physical blocks instead of
individual 4KB pages, reducing MTT consumption from 512 entries down to
1 for every 2MB of memory (with page_order == 9).

However, the result of vmalloc_huge() is currently opaque to the caller.
We cannot determine whether it successfully allocated huge pages or fell
back to 4KB pages based solely on the returned pointer. Therefore, we
need a helper function to query the actual page order, enabling SMC-R to
adapt its registration logic to the underlying physical layout.

I hope this clarifies our design motivation!

Best regards,
D. Wythe

Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()

Posted by Leon Romanovsky 1 week, 3 days ago

On Sat, Jan 24, 2026 at 10:57:54PM +0800, D. Wythe wrote:
> On Sat, Jan 24, 2026 at 11:48:59AM +0100, Uladzislau Rezki wrote:
> > Hello, D. Wythe!
> > 
> > > On Fri, Jan 23, 2026 at 07:55:17PM +0100, Uladzislau Rezki wrote:
> > > > On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote:
> > > > > find_vm_area() provides a way to find the vm_struct associated with a
> > > > > virtual address. Export this symbol to modules so that modularized
> > > > > subsystems can perform lookups on vmalloc addresses.
> > > > > 
> > > > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> > > > > ---
> > > > >  mm/vmalloc.c | 1 +
> > > > >  1 file changed, 1 insertion(+)
> > > > > 
> > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > index ecbac900c35f..3eb9fe761c34 100644
> > > > > --- a/mm/vmalloc.c
> > > > > +++ b/mm/vmalloc.c
> > > > > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr)
> > > > >  
> > > > >  	return va->vm;
> > > > >  }
> > > > > +EXPORT_SYMBOL_GPL(find_vm_area);
> > > > >  
> > > > This is internal. We can not just export it.
> > > > 
> > > > --
> > > > Uladzislau Rezki
> > > 
> > > Hi Uladzislau,
> > > 
> > > Thank you for the feedback. I agree that we should avoid exposing
> > > internal implementation details like struct vm_struct to external
> > > subsystems.
> > > 
> > > Following Christoph's suggestion, I'm planning to encapsulate the page
> > > order lookup into a minimal helper instead:
> > > 
> > > unsigned int vmalloc_page_order(const void *addr){
> > > 	struct vm_struct *vm;
> > >  	vm = find_vm_area(addr);
> > > 	return vm ? vm->page_order : 0;
> > > }
> > > EXPORT_SYMBOL_GPL(vmalloc_page_order);
> > > 
> > > Does this approach look reasonable to you? It would keep the vm_struct
> > > layout private while satisfying the optimization needs of SMC.
> > > 
> > Could you please clarify why you need info about page_order? I have not
> > looked at your second patch.
> > 
> > Thanks!
> > 
> > --
> > Uladzislau Rezki
> 
> Hi Uladzislau,
> 
> This stems from optimizing memory registration in SMC-R. To provide the
> RDMA hardware with direct access to memory buffers, we must register
> them with the NIC. During this process, the hardware generates one MTT
> entry for each physically contiguous block. Since these hardware entries
> are a finite and scarce resource, and SMC currently defaults to a 4KB
> registration granularity, a single 2MB buffer consumes 512 entries. In
> high-concurrency scenarios, this inefficiency quickly exhausts NIC
> resources and becomes a major bottleneck for system scalability.

I believe this complexity can be avoided by using the RDMA MR pool API,
as other ULPs do, for example NVMe.

Thanks

> 
> To address this, we intend to use vmalloc_huge(). When it successfully
> allocates high-order pages, the vmalloc area is backed by a sequence of
> physically contiguous chunks (e.g., 2MB each). If we know this
> page_order, we can register these larger physical blocks instead of
> individual 4KB pages, reducing MTT consumption from 512 entries down to
> 1 for every 2MB of memory (with page_order == 9).
> 
> However, the result of vmalloc_huge() is currently opaque to the caller.
> We cannot determine whether it successfully allocated huge pages or fell
> back to 4KB pages based solely on the returned pointer. Therefore, we
> need a helper function to query the actual page order, enabling SMC-R to
> adapt its registration logic to the underlying physical layout.
> 
> I hope this clarifies our design motivation!
> 
> Best regards,
> D. Wythe
> 
> 
> 
> 
>

Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()

Posted by D. Wythe 1 week, 3 days ago

On Tue, Jan 27, 2026 at 03:34:17PM +0200, Leon Romanovsky wrote:
> On Sat, Jan 24, 2026 at 10:57:54PM +0800, D. Wythe wrote:
> > On Sat, Jan 24, 2026 at 11:48:59AM +0100, Uladzislau Rezki wrote:
> > > Hello, D. Wythe!
> > > 
> > > > On Fri, Jan 23, 2026 at 07:55:17PM +0100, Uladzislau Rezki wrote:
> > > > > On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote:
> > > > > > find_vm_area() provides a way to find the vm_struct associated with a
> > > > > > virtual address. Export this symbol to modules so that modularized
> > > > > > subsystems can perform lookups on vmalloc addresses.
> > > > > > 
> > > > > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> > > > > > ---
> > > > > >  mm/vmalloc.c | 1 +
> > > > > >  1 file changed, 1 insertion(+)
> > > > > > 
> > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > > index ecbac900c35f..3eb9fe761c34 100644
> > > > > > --- a/mm/vmalloc.c
> > > > > > +++ b/mm/vmalloc.c
> > > > > > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr)
> > > > > >  
> > > > > >  	return va->vm;
> > > > > >  }
> > > > > > +EXPORT_SYMBOL_GPL(find_vm_area);
> > > > > >  
> > > > > This is internal. We can not just export it.
> > > > > 
> > > > > --
> > > > > Uladzislau Rezki
> > > > 
> > > > Hi Uladzislau,
> > > > 
> > > > Thank you for the feedback. I agree that we should avoid exposing
> > > > internal implementation details like struct vm_struct to external
> > > > subsystems.
> > > > 
> > > > Following Christoph's suggestion, I'm planning to encapsulate the page
> > > > order lookup into a minimal helper instead:
> > > > 
> > > > unsigned int vmalloc_page_order(const void *addr){
> > > > 	struct vm_struct *vm;
> > > >  	vm = find_vm_area(addr);
> > > > 	return vm ? vm->page_order : 0;
> > > > }
> > > > EXPORT_SYMBOL_GPL(vmalloc_page_order);
> > > > 
> > > > Does this approach look reasonable to you? It would keep the vm_struct
> > > > layout private while satisfying the optimization needs of SMC.
> > > > 
> > > Could you please clarify why you need info about page_order? I have not
> > > looked at your second patch.
> > > 
> > > Thanks!
> > > 
> > > --
> > > Uladzislau Rezki
> > 
> > Hi Uladzislau,
> > 
> > This stems from optimizing memory registration in SMC-R. To provide the
> > RDMA hardware with direct access to memory buffers, we must register
> > them with the NIC. During this process, the hardware generates one MTT
> > entry for each physically contiguous block. Since these hardware entries
> > are a finite and scarce resource, and SMC currently defaults to a 4KB
> > registration granularity, a single 2MB buffer consumes 512 entries. In
> > high-concurrency scenarios, this inefficiency quickly exhausts NIC
> > resources and becomes a major bottleneck for system scalability.
> 
> I believe this complexity can be avoided by using the RDMA MR pool API,
> as other ULPs do, for example NVMe.
> 
> Thanks
> 

Hi Leon,

Am I correct in assuming you are suggesting mr_pool to limit the number
of MRs as a way to cap MTTE consumption?

However, our goal is to maximize the total registered memory within
the MTTE limits rather than to cap it. In SMC-R, each connection
occupies a configurable, fixed-size registered buffer; consequently,
the more memory we can register, the more concurrent connections
we can support.

By leveraging vmalloc_huge() and the proposed helper to increase the
page_size in ib_map_mr_sg(), each MTTE covers a much larger contiguous
physical block. This significantly reduces the total number of entries
required to map the same amount of memory, allowing us to serve more
connections under the same hardware constraints

D. Wythe

Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()

Posted by Jason Gunthorpe 1 week, 2 days ago

On Wed, Jan 28, 2026 at 11:45:58AM +0800, D. Wythe wrote:

> By leveraging vmalloc_huge() and the proposed helper to increase the
> page_size in ib_map_mr_sg(), each MTTE covers a much larger contiguous
> physical block.

This doesn't seem right, if your goal is to take a vmalloc() pointer
and convert it to a MR via a scatterlist and ib_map_mr_sg() then you
should be asking for a helper to convert a kernel pointer into a
scatterlist.

Even if you do this in a naive way and call the
sg_alloc_append_table_from_pages() function it will automatically join
physically contiguous ranges together for you.

From there you can check the resulting scatterlist and compute the
page_size to pass to ib_map_mr_sg().

No need to ask the MM for anything other than the list of physicals to
build the scatterlist with.

Still, I wouldn't mind seeing a helper to convert a kernel pointer
into a scatterlist because I have see that opencoded in a few places,
and maybe there are ways to optimize that using more information from
the MM - but it should be APIs used only by this helper not exposed to
drivers.

Jason

Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()

Posted by D. Wythe 1 week, 1 day ago

On Wed, Jan 28, 2026 at 02:06:29PM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 28, 2026 at 11:45:58AM +0800, D. Wythe wrote:
> 
> > By leveraging vmalloc_huge() and the proposed helper to increase the
> > page_size in ib_map_mr_sg(), each MTTE covers a much larger contiguous
> > physical block.
> 
> This doesn't seem right, if your goal is to take a vmalloc() pointer
> and convert it to a MR via a scatterlist and ib_map_mr_sg() then you
> should be asking for a helper to convert a kernel pointer into a
> scatterlist.
> 
> Even if you do this in a naive way and call the
> sg_alloc_append_table_from_pages() function it will automatically join
> physically contiguous ranges together for you.
> 
> From there you can check the resulting scatterlist and compute the
> page_size to pass to ib_map_mr_sg().
> 
> No need to ask the MM for anything other than the list of physicals to
> build the scatterlist with.
> 
> Still, I wouldn't mind seeing a helper to convert a kernel pointer
> into a scatterlist because I have see that opencoded in a few places,
> and maybe there are ways to optimize that using more information from
> the MM - but it should be APIs used only by this helper not exposed to
> drivers.
> 
> Jason

Hi Jason,

To be honest, I was previously unaware of the
sg_alloc_append_table_from_pages() function, although I had indeed
considered manually calculating the size of contiguous physical blocks.
The reason I proposed the MM helper is that SMC is not driver, it  utilizes
vmalloc() for memory allocation and is thus in direct contact with the MM;
from this perspective, having the MM provide the page_order would be the most
straightforward approach.

Given the significant opposition and our plans to transition SMC to newer
APIs in the future anyway, I agree that introducing this helper now is
less justified.

I will follow your suggestion and update the next version accordingly.

Thanks.

Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()

Posted by Jason Gunthorpe 1 week, 1 day ago

On Thu, Jan 29, 2026 at 07:36:09PM +0800, D. Wythe wrote:

> > From there you can check the resulting scatterlist and compute the
> > page_size to pass to ib_map_mr_sg().

I should clarify this is done after DMA mapping the scatterlist. dma
mapping can improve the page size.

And maybe the core code should be helping compute the MR's target page
size for a scatterlist.. We already have code to do this in umem, and
it is a pretty bit tricky considering the IOVA related rules.

Jason

Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()

Posted by D. Wythe 1 week, 1 day ago

On Thu, Jan 29, 2026 at 09:20:58AM -0400, Jason Gunthorpe wrote:
> On Thu, Jan 29, 2026 at 07:36:09PM +0800, D. Wythe wrote:
> 
> > > From there you can check the resulting scatterlist and compute the
> > > page_size to pass to ib_map_mr_sg().
> 
> I should clarify this is done after DMA mapping the scatterlist. dma
> mapping can improve the page size.
> 
> And maybe the core code should be helping compute the MR's target page
> size for a scatterlist.. We already have code to do this in umem, and
> it is a pretty bit tricky considering the IOVA related rules.
>

Hi Jason,

After a deep dive into ib_umem_find_best_pgsz(), I have to say it is
much more subtle than it first appears. The IOVA-to-PA relative offset
rules, in particular, make it quite easy to get wrong.

While SMC could duplicate this logic, it is certainly not ideal for
maintenance. Are there any plans to refactor this into a generic RDMA
core helper—for instance, one that can determine the best page size
directly from an sg_table or scatterlist?

Best regards,
D. Wythe

Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()

Posted by Jason Gunthorpe 1 week ago

On Fri, Jan 30, 2026 at 04:51:31PM +0800, D. Wythe wrote:
> On Thu, Jan 29, 2026 at 09:20:58AM -0400, Jason Gunthorpe wrote:
> > On Thu, Jan 29, 2026 at 07:36:09PM +0800, D. Wythe wrote:
> > 
> > > > From there you can check the resulting scatterlist and compute the
> > > > page_size to pass to ib_map_mr_sg().
> > 
> > I should clarify this is done after DMA mapping the scatterlist. dma
> > mapping can improve the page size.
> > 
> > And maybe the core code should be helping compute the MR's target page
> > size for a scatterlist.. We already have code to do this in umem, and
> > it is a pretty bit tricky considering the IOVA related rules.
> >
> 
> Hi Jason,
> 
> After a deep dive into ib_umem_find_best_pgsz(), I have to say it is
> much more subtle than it first appears. The IOVA-to-PA relative offset
> rules, in particular, make it quite easy to get wrong.
> 
> While SMC could duplicate this logic, it is certainly not ideal for
> maintenance. Are there any plans to refactor this into a generic RDMA
> core helper—for instance, one that can determine the best page size
> directly from an sg_table or scatterlist?

I have not heard of anyone touching this.

It looks like there are only two users in the kernel that pass
something other than PAGE_SIZE, so it seems nobody has cared about
this till now.

With high order folios being more common it seems like something
missing.

However, I wonder what the drivers do with the input page size, 
segmenting a scatterlist is a bit hard and we have helpers for that
already too.

It is a bigger project but probably the right thing is to remove the
page size input, wrap the scatterlist in a umem and fixup the drivers
to use the existing umem support for building mtts, splitting
scatterlists into blocks and so on.

The kernel side here has been left alone for a long time..

Jason

Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()

Posted by D. Wythe 4 days, 1 hour ago

On Fri, Jan 30, 2026 at 11:16:36AM -0400, Jason Gunthorpe wrote:
> On Fri, Jan 30, 2026 at 04:51:31PM +0800, D. Wythe wrote:
> > On Thu, Jan 29, 2026 at 09:20:58AM -0400, Jason Gunthorpe wrote:
> > > On Thu, Jan 29, 2026 at 07:36:09PM +0800, D. Wythe wrote:
> > > 
> > > > > From there you can check the resulting scatterlist and compute the
> > > > > page_size to pass to ib_map_mr_sg().
> > > 
> > > I should clarify this is done after DMA mapping the scatterlist. dma
> > > mapping can improve the page size.
> > > 
> > > And maybe the core code should be helping compute the MR's target page
> > > size for a scatterlist.. We already have code to do this in umem, and
> > > it is a pretty bit tricky considering the IOVA related rules.
> > >
> > 
> > Hi Jason,
> > 
> > After a deep dive into ib_umem_find_best_pgsz(), I have to say it is
> > much more subtle than it first appears. The IOVA-to-PA relative offset
> > rules, in particular, make it quite easy to get wrong.
> > 
> > While SMC could duplicate this logic, it is certainly not ideal for
> > maintenance. Are there any plans to refactor this into a generic RDMA
> > core helper—for instance, one that can determine the best page size
> > directly from an sg_table or scatterlist?
> 
> I have not heard of anyone touching this.
> 
> It looks like there are only two users in the kernel that pass
> something other than PAGE_SIZE, so it seems nobody has cared about
> this till now.
> 
> With high order folios being more common it seems like something
> missing.
> 
> However, I wonder what the drivers do with the input page size, 
> segmenting a scatterlist is a bit hard and we have helpers for that
> already too.
> 
> It is a bigger project but probably the right thing is to remove the
> page size input, wrap the scatterlist in a umem and fixup the drivers
> to use the existing umem support for building mtts, splitting
> scatterlists into blocks and so on.
> 
> The kernel side here has been left alone for a long time..

I am also curious about the original design intent behind requiring the 
caller to explicitly pass `page_size`. From what I can see, its primary 
role is to define the memory size per MTTE, but calculating the optimal 
value is surprisingly complex.

I completely agree that providing an automatic way to optimize or 
calculate the best page size should be the responsibility of the drivers
or the RDMA core themselves. Handling such low-level hardware-related 
details in a ULP like SMC feels misplaced.

Since it appears this isn't a high-priority issue for the community at
the moment, and a proper fix requires a much larger architectural effort 
in the RDMA core, I will withdraw this patch series. 

I'll keep an eye on the RDMA subsystem's progress and see if a more 
generic solution emerges in the future.

Thanks,
D. Wythe

Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()

Posted by Leon Romanovsky 1 week, 2 days ago

On Wed, Jan 28, 2026 at 11:45:58AM +0800, D. Wythe wrote:
> On Tue, Jan 27, 2026 at 03:34:17PM +0200, Leon Romanovsky wrote:
> > On Sat, Jan 24, 2026 at 10:57:54PM +0800, D. Wythe wrote:
> > > On Sat, Jan 24, 2026 at 11:48:59AM +0100, Uladzislau Rezki wrote:
> > > > Hello, D. Wythe!
> > > > 
> > > > > On Fri, Jan 23, 2026 at 07:55:17PM +0100, Uladzislau Rezki wrote:
> > > > > > On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote:
> > > > > > > find_vm_area() provides a way to find the vm_struct associated with a
> > > > > > > virtual address. Export this symbol to modules so that modularized
> > > > > > > subsystems can perform lookups on vmalloc addresses.
> > > > > > > 
> > > > > > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> > > > > > > ---
> > > > > > >  mm/vmalloc.c | 1 +
> > > > > > >  1 file changed, 1 insertion(+)
> > > > > > > 
> > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > > > index ecbac900c35f..3eb9fe761c34 100644
> > > > > > > --- a/mm/vmalloc.c
> > > > > > > +++ b/mm/vmalloc.c
> > > > > > > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr)
> > > > > > >  
> > > > > > >  	return va->vm;
> > > > > > >  }
> > > > > > > +EXPORT_SYMBOL_GPL(find_vm_area);
> > > > > > >  
> > > > > > This is internal. We can not just export it.
> > > > > > 
> > > > > > --
> > > > > > Uladzislau Rezki
> > > > > 
> > > > > Hi Uladzislau,
> > > > > 
> > > > > Thank you for the feedback. I agree that we should avoid exposing
> > > > > internal implementation details like struct vm_struct to external
> > > > > subsystems.
> > > > > 
> > > > > Following Christoph's suggestion, I'm planning to encapsulate the page
> > > > > order lookup into a minimal helper instead:
> > > > > 
> > > > > unsigned int vmalloc_page_order(const void *addr){
> > > > > 	struct vm_struct *vm;
> > > > >  	vm = find_vm_area(addr);
> > > > > 	return vm ? vm->page_order : 0;
> > > > > }
> > > > > EXPORT_SYMBOL_GPL(vmalloc_page_order);
> > > > > 
> > > > > Does this approach look reasonable to you? It would keep the vm_struct
> > > > > layout private while satisfying the optimization needs of SMC.
> > > > > 
> > > > Could you please clarify why you need info about page_order? I have not
> > > > looked at your second patch.
> > > > 
> > > > Thanks!
> > > > 
> > > > --
> > > > Uladzislau Rezki
> > > 
> > > Hi Uladzislau,
> > > 
> > > This stems from optimizing memory registration in SMC-R. To provide the
> > > RDMA hardware with direct access to memory buffers, we must register
> > > them with the NIC. During this process, the hardware generates one MTT
> > > entry for each physically contiguous block. Since these hardware entries
> > > are a finite and scarce resource, and SMC currently defaults to a 4KB
> > > registration granularity, a single 2MB buffer consumes 512 entries. In
> > > high-concurrency scenarios, this inefficiency quickly exhausts NIC
> > > resources and becomes a major bottleneck for system scalability.
> > 
> > I believe this complexity can be avoided by using the RDMA MR pool API,
> > as other ULPs do, for example NVMe.
> > 
> > Thanks
> > 
> 
> Hi Leon,
> 
> Am I correct in assuming you are suggesting mr_pool to limit the number
> of MRs as a way to cap MTTE consumption?

I don't see this a limit, but something that is considered standard
practice to reduce MTT consumption.

> 
> However, our goal is to maximize the total registered memory within
> the MTTE limits rather than to cap it. In SMC-R, each connection
> occupies a configurable, fixed-size registered buffer; consequently,
> the more memory we can register, the more concurrent connections
> we can support.

It is not cap, but more efficient use of existing resources.

> 
> By leveraging vmalloc_huge() and the proposed helper to increase the
> page_size in ib_map_mr_sg(), each MTTE covers a much larger contiguous
> physical block. This significantly reduces the total number of entries
> required to map the same amount of memory, allowing us to serve more
> connections under the same hardware constraints
> 
> D. Wythe

Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()

Posted by D. Wythe 1 week, 2 days ago

On Wed, Jan 28, 2026 at 01:13:46PM +0200, Leon Romanovsky wrote:
> On Wed, Jan 28, 2026 at 11:45:58AM +0800, D. Wythe wrote:
> > On Tue, Jan 27, 2026 at 03:34:17PM +0200, Leon Romanovsky wrote:
> > > On Sat, Jan 24, 2026 at 10:57:54PM +0800, D. Wythe wrote:
> > > > On Sat, Jan 24, 2026 at 11:48:59AM +0100, Uladzislau Rezki wrote:
> > > > > Hello, D. Wythe!
> > > > > 
> > > > > > On Fri, Jan 23, 2026 at 07:55:17PM +0100, Uladzislau Rezki wrote:
> > > > > > > On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote:
> > > > > > > > find_vm_area() provides a way to find the vm_struct associated with a
> > > > > > > > virtual address. Export this symbol to modules so that modularized
> > > > > > > > subsystems can perform lookups on vmalloc addresses.
> > > > > > > > 
> > > > > > > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> > > > > > > > ---
> > > > > > > >  mm/vmalloc.c | 1 +
> > > > > > > >  1 file changed, 1 insertion(+)
> > > > > > > > 
> > > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > > > > index ecbac900c35f..3eb9fe761c34 100644
> > > > > > > > --- a/mm/vmalloc.c
> > > > > > > > +++ b/mm/vmalloc.c
> > > > > > > > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr)
> > > > > > > >  
> > > > > > > >  	return va->vm;
> > > > > > > >  }
> > > > > > > > +EXPORT_SYMBOL_GPL(find_vm_area);
> > > > > > > >  
> > > > > > > This is internal. We can not just export it.
> > > > > > > 
> > > > > > > --
> > > > > > > Uladzislau Rezki
> > > > > > 
> > > > > > Hi Uladzislau,
> > > > > > 
> > > > > > Thank you for the feedback. I agree that we should avoid exposing
> > > > > > internal implementation details like struct vm_struct to external
> > > > > > subsystems.
> > > > > > 
> > > > > > Following Christoph's suggestion, I'm planning to encapsulate the page
> > > > > > order lookup into a minimal helper instead:
> > > > > > 
> > > > > > unsigned int vmalloc_page_order(const void *addr){
> > > > > > 	struct vm_struct *vm;
> > > > > >  	vm = find_vm_area(addr);
> > > > > > 	return vm ? vm->page_order : 0;
> > > > > > }
> > > > > > EXPORT_SYMBOL_GPL(vmalloc_page_order);
> > > > > > 
> > > > > > Does this approach look reasonable to you? It would keep the vm_struct
> > > > > > layout private while satisfying the optimization needs of SMC.
> > > > > > 
> > > > > Could you please clarify why you need info about page_order? I have not
> > > > > looked at your second patch.
> > > > > 
> > > > > Thanks!
> > > > > 
> > > > > --
> > > > > Uladzislau Rezki
> > > > 
> > > > Hi Uladzislau,
> > > > 
> > > > This stems from optimizing memory registration in SMC-R. To provide the
> > > > RDMA hardware with direct access to memory buffers, we must register
> > > > them with the NIC. During this process, the hardware generates one MTT
> > > > entry for each physically contiguous block. Since these hardware entries
> > > > are a finite and scarce resource, and SMC currently defaults to a 4KB
> > > > registration granularity, a single 2MB buffer consumes 512 entries. In
> > > > high-concurrency scenarios, this inefficiency quickly exhausts NIC
> > > > resources and becomes a major bottleneck for system scalability.
> > > 
> > > I believe this complexity can be avoided by using the RDMA MR pool API,
> > > as other ULPs do, for example NVMe.
> > > 
> > > Thanks
> > > 
> > 
> > Hi Leon,
> > 
> > Am I correct in assuming you are suggesting mr_pool to limit the number
> > of MRs as a way to cap MTTE consumption?
> 
> I don't see this a limit, but something that is considered standard
> practice to reduce MTT consumption.
> 
> > 
> > However, our goal is to maximize the total registered memory within
> > the MTTE limits rather than to cap it. In SMC-R, each connection
> > occupies a configurable, fixed-size registered buffer; consequently,
> > the more memory we can register, the more concurrent connections
> > we can support.
> 
> It is not cap, but more efficient use of existing resources.

Got it. While MRs pool might be more standard practice, but it doesn't
address our specific bottleneck. In fact, smc already has its own internal
MR reuse; our core issue remains reducing MTTE consumption by increasing the
registration granularity to maximize the memory size mapped per MTT entry.

Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()

Posted by Leon Romanovsky 1 week, 2 days ago

On Wed, Jan 28, 2026 at 08:44:04PM +0800, D. Wythe wrote:
> On Wed, Jan 28, 2026 at 01:13:46PM +0200, Leon Romanovsky wrote:
> > On Wed, Jan 28, 2026 at 11:45:58AM +0800, D. Wythe wrote:
> > > On Tue, Jan 27, 2026 at 03:34:17PM +0200, Leon Romanovsky wrote:
> > > > On Sat, Jan 24, 2026 at 10:57:54PM +0800, D. Wythe wrote:
> > > > > On Sat, Jan 24, 2026 at 11:48:59AM +0100, Uladzislau Rezki wrote:
> > > > > > Hello, D. Wythe!
> > > > > > 
> > > > > > > On Fri, Jan 23, 2026 at 07:55:17PM +0100, Uladzislau Rezki wrote:
> > > > > > > > On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote:
> > > > > > > > > find_vm_area() provides a way to find the vm_struct associated with a
> > > > > > > > > virtual address. Export this symbol to modules so that modularized
> > > > > > > > > subsystems can perform lookups on vmalloc addresses.
> > > > > > > > > 
> > > > > > > > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> > > > > > > > > ---
> > > > > > > > >  mm/vmalloc.c | 1 +
> > > > > > > > >  1 file changed, 1 insertion(+)
> > > > > > > > > 
> > > > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > > > > > index ecbac900c35f..3eb9fe761c34 100644
> > > > > > > > > --- a/mm/vmalloc.c
> > > > > > > > > +++ b/mm/vmalloc.c
> > > > > > > > > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr)
> > > > > > > > >  
> > > > > > > > >  	return va->vm;
> > > > > > > > >  }
> > > > > > > > > +EXPORT_SYMBOL_GPL(find_vm_area);
> > > > > > > > >  
> > > > > > > > This is internal. We can not just export it.
> > > > > > > > 
> > > > > > > > --
> > > > > > > > Uladzislau Rezki
> > > > > > > 
> > > > > > > Hi Uladzislau,
> > > > > > > 
> > > > > > > Thank you for the feedback. I agree that we should avoid exposing
> > > > > > > internal implementation details like struct vm_struct to external
> > > > > > > subsystems.
> > > > > > > 
> > > > > > > Following Christoph's suggestion, I'm planning to encapsulate the page
> > > > > > > order lookup into a minimal helper instead:
> > > > > > > 
> > > > > > > unsigned int vmalloc_page_order(const void *addr){
> > > > > > > 	struct vm_struct *vm;
> > > > > > >  	vm = find_vm_area(addr);
> > > > > > > 	return vm ? vm->page_order : 0;
> > > > > > > }
> > > > > > > EXPORT_SYMBOL_GPL(vmalloc_page_order);
> > > > > > > 
> > > > > > > Does this approach look reasonable to you? It would keep the vm_struct
> > > > > > > layout private while satisfying the optimization needs of SMC.
> > > > > > > 
> > > > > > Could you please clarify why you need info about page_order? I have not
> > > > > > looked at your second patch.
> > > > > > 
> > > > > > Thanks!
> > > > > > 
> > > > > > --
> > > > > > Uladzislau Rezki
> > > > > 
> > > > > Hi Uladzislau,
> > > > > 
> > > > > This stems from optimizing memory registration in SMC-R. To provide the
> > > > > RDMA hardware with direct access to memory buffers, we must register
> > > > > them with the NIC. During this process, the hardware generates one MTT
> > > > > entry for each physically contiguous block. Since these hardware entries
> > > > > are a finite and scarce resource, and SMC currently defaults to a 4KB
> > > > > registration granularity, a single 2MB buffer consumes 512 entries. In
> > > > > high-concurrency scenarios, this inefficiency quickly exhausts NIC
> > > > > resources and becomes a major bottleneck for system scalability.
> > > > 
> > > > I believe this complexity can be avoided by using the RDMA MR pool API,
> > > > as other ULPs do, for example NVMe.
> > > > 
> > > > Thanks
> > > > 
> > > 
> > > Hi Leon,
> > > 
> > > Am I correct in assuming you are suggesting mr_pool to limit the number
> > > of MRs as a way to cap MTTE consumption?
> > 
> > I don't see this a limit, but something that is considered standard
> > practice to reduce MTT consumption.
> > 
> > > 
> > > However, our goal is to maximize the total registered memory within
> > > the MTTE limits rather than to cap it. In SMC-R, each connection
> > > occupies a configurable, fixed-size registered buffer; consequently,
> > > the more memory we can register, the more concurrent connections
> > > we can support.
> > 
> > It is not cap, but more efficient use of existing resources.
> 
> Got it. While MRs pool might be more standard practice, but it doesn't
> address our specific bottleneck. In fact, smc already has its own internal
> MR reuse; our core issue remains reducing MTTE consumption by increasing the
> registration granularity to maximize the memory size mapped per MTT entry.

And this is something MR pools can handle as well. We are going in circles,
so let's summarize.

I see SMC‑R as one of the RDMA ULPs, and it should ideally rely on the
existing ULP API used by NVMe, NFS, and others, rather than maintaining its
own internal logic.

I also do not know whether vmalloc_page_order() is an appropriate solution;
I only want to show that we can probably achieve the same result without
introducing a new function.

Thanks

Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()

Posted by D. Wythe 1 week, 1 day ago

On Wed, Jan 28, 2026 at 03:49:34PM +0200, Leon Romanovsky wrote:
> On Wed, Jan 28, 2026 at 08:44:04PM +0800, D. Wythe wrote:
> > On Wed, Jan 28, 2026 at 01:13:46PM +0200, Leon Romanovsky wrote:
> > > On Wed, Jan 28, 2026 at 11:45:58AM +0800, D. Wythe wrote:
> > > > On Tue, Jan 27, 2026 at 03:34:17PM +0200, Leon Romanovsky wrote:
> > > > > On Sat, Jan 24, 2026 at 10:57:54PM +0800, D. Wythe wrote:
> > > > > > On Sat, Jan 24, 2026 at 11:48:59AM +0100, Uladzislau Rezki wrote:
> > > > > > > Hello, D. Wythe!
> > > > > > > 
> > > > > > > > On Fri, Jan 23, 2026 at 07:55:17PM +0100, Uladzislau Rezki wrote:
> > > > > > > > > On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote:
> > > > > > > > > > find_vm_area() provides a way to find the vm_struct associated with a
> > > > > > > > > > virtual address. Export this symbol to modules so that modularized
> > > > > > > > > > subsystems can perform lookups on vmalloc addresses.
> > > > > > > > > > 
> > > > > > > > > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> > > > > > > > > > ---
> > > > > > > > > >  mm/vmalloc.c | 1 +
> > > > > > > > > >  1 file changed, 1 insertion(+)
> > > > > > > > > > 
> > > > > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > > > > > > index ecbac900c35f..3eb9fe761c34 100644
> > > > > > > > > > --- a/mm/vmalloc.c
> > > > > > > > > > +++ b/mm/vmalloc.c
> > > > > > > > > > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr)
> > > > > > > > > >  
> > > > > > > > > >  	return va->vm;
> > > > > > > > > >  }
> > > > > > > > > > +EXPORT_SYMBOL_GPL(find_vm_area);
> > > > > > > > > >  
> > > > > > > > > This is internal. We can not just export it.
> > > > > > > > > 
> > > > > > > > > --
> > > > > > > > > Uladzislau Rezki
> > > > > > > > 
> > > > > > > > Hi Uladzislau,
> > > > > > > > 
> > > > > > > > Thank you for the feedback. I agree that we should avoid exposing
> > > > > > > > internal implementation details like struct vm_struct to external
> > > > > > > > subsystems.
> > > > > > > > 
> > > > > > > > Following Christoph's suggestion, I'm planning to encapsulate the page
> > > > > > > > order lookup into a minimal helper instead:
> > > > > > > > 
> > > > > > > > unsigned int vmalloc_page_order(const void *addr){
> > > > > > > > 	struct vm_struct *vm;
> > > > > > > >  	vm = find_vm_area(addr);
> > > > > > > > 	return vm ? vm->page_order : 0;
> > > > > > > > }
> > > > > > > > EXPORT_SYMBOL_GPL(vmalloc_page_order);
> > > > > > > > 
> > > > > > > > Does this approach look reasonable to you? It would keep the vm_struct
> > > > > > > > layout private while satisfying the optimization needs of SMC.
> > > > > > > > 
> > > > > > > Could you please clarify why you need info about page_order? I have not
> > > > > > > looked at your second patch.
> > > > > > > 
> > > > > > > Thanks!
> > > > > > > 
> > > > > > > --
> > > > > > > Uladzislau Rezki
> > > > > > 
> > > > > > Hi Uladzislau,
> > > > > > 
> > > > > > This stems from optimizing memory registration in SMC-R. To provide the
> > > > > > RDMA hardware with direct access to memory buffers, we must register
> > > > > > them with the NIC. During this process, the hardware generates one MTT
> > > > > > entry for each physically contiguous block. Since these hardware entries
> > > > > > are a finite and scarce resource, and SMC currently defaults to a 4KB
> > > > > > registration granularity, a single 2MB buffer consumes 512 entries. In
> > > > > > high-concurrency scenarios, this inefficiency quickly exhausts NIC
> > > > > > resources and becomes a major bottleneck for system scalability.
> > > > > 
> > > > > I believe this complexity can be avoided by using the RDMA MR pool API,
> > > > > as other ULPs do, for example NVMe.
> > > > > 
> > > > > Thanks
> > > > > 
> > > > 
> > > > Hi Leon,
> > > > 
> > > > Am I correct in assuming you are suggesting mr_pool to limit the number
> > > > of MRs as a way to cap MTTE consumption?
> > > 
> > > I don't see this a limit, but something that is considered standard
> > > practice to reduce MTT consumption.
> > > 
> > > > 
> > > > However, our goal is to maximize the total registered memory within
> > > > the MTTE limits rather than to cap it. In SMC-R, each connection
> > > > occupies a configurable, fixed-size registered buffer; consequently,
> > > > the more memory we can register, the more concurrent connections
> > > > we can support.
> > > 
> > > It is not cap, but more efficient use of existing resources.
> > 
> > Got it. While MRs pool might be more standard practice, but it doesn't
> > address our specific bottleneck. In fact, smc already has its own internal
> > MR reuse; our core issue remains reducing MTTE consumption by increasing the
> > registration granularity to maximize the memory size mapped per MTT entry.
> 
> And this is something MR pools can handle as well. We are going in circles,
> so let's summarize.

I believe some points need to be thoroughly clarified here:

> 
> I see SMC‑R as one of the RDMA ULPs, and it should ideally rely on the
> existing ULP API used by NVMe, NFS, and others, rather than maintaining its
> own internal logic.

SMC is not opposed to adopting newer RDMA interfaces; in fact, I have
already planned a gradual migration to the updated RDMA APIs. We are
currently in the process of adapting to ib_cqe, for instance. As long as
functionality remains intact, there is no reason to oppose changes that
reduce maintenance overhead or provide additional gains, but such a
transition takes time.

> 
> I also do not know whether vmalloc_page_order() is an appropriate solution;
> I only want to show that we can probably achieve the same result without
> introducing a new function.

Regarding the specific issue under discussion, I believe the newer RDMA
APIs you mentioned do not solve my problem, at least for now. My
understanding is that regardless of how MRs are pooled, the core
requirement is to increase the page_size parameter in ib_map_mr_sg to
maximize the physical size mapped per MTTE. From the code I have
examined, I see no evidence of these new APIs utilizing values other
than 4KB.

Of course, I believe that regardless of whether this issue
currently exists, it is something the RDMA community can resolve.
However, as I mentioned, adapting to new API takes time. Before a
complete transition is achieved, we need to allow for some necessary
updates to SMC.

Thanks

Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()

Posted by Leon Romanovsky 1 week, 1 day ago

On Thu, Jan 29, 2026 at 07:03:23PM +0800, D. Wythe wrote:
> On Wed, Jan 28, 2026 at 03:49:34PM +0200, Leon Romanovsky wrote:
> > On Wed, Jan 28, 2026 at 08:44:04PM +0800, D. Wythe wrote:
> > > On Wed, Jan 28, 2026 at 01:13:46PM +0200, Leon Romanovsky wrote:
> > > > On Wed, Jan 28, 2026 at 11:45:58AM +0800, D. Wythe wrote:
> > > > > On Tue, Jan 27, 2026 at 03:34:17PM +0200, Leon Romanovsky wrote:
> > > > > > On Sat, Jan 24, 2026 at 10:57:54PM +0800, D. Wythe wrote:
> > > > > > > On Sat, Jan 24, 2026 at 11:48:59AM +0100, Uladzislau Rezki wrote:
> > > > > > > > Hello, D. Wythe!
> > > > > > > > 
> > > > > > > > > On Fri, Jan 23, 2026 at 07:55:17PM +0100, Uladzislau Rezki wrote:
> > > > > > > > > > On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote:
> > > > > > > > > > > find_vm_area() provides a way to find the vm_struct associated with a
> > > > > > > > > > > virtual address. Export this symbol to modules so that modularized
> > > > > > > > > > > subsystems can perform lookups on vmalloc addresses.
> > > > > > > > > > > 
> > > > > > > > > > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> > > > > > > > > > > ---
> > > > > > > > > > >  mm/vmalloc.c | 1 +
> > > > > > > > > > >  1 file changed, 1 insertion(+)
> > > > > > > > > > > 
> > > > > > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > > > > > > > index ecbac900c35f..3eb9fe761c34 100644
> > > > > > > > > > > --- a/mm/vmalloc.c
> > > > > > > > > > > +++ b/mm/vmalloc.c
> > > > > > > > > > > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr)
> > > > > > > > > > >  
> > > > > > > > > > >  	return va->vm;
> > > > > > > > > > >  }
> > > > > > > > > > > +EXPORT_SYMBOL_GPL(find_vm_area);
> > > > > > > > > > >  
> > > > > > > > > > This is internal. We can not just export it.
> > > > > > > > > > 
> > > > > > > > > > --
> > > > > > > > > > Uladzislau Rezki
> > > > > > > > > 
> > > > > > > > > Hi Uladzislau,
> > > > > > > > > 
> > > > > > > > > Thank you for the feedback. I agree that we should avoid exposing
> > > > > > > > > internal implementation details like struct vm_struct to external
> > > > > > > > > subsystems.
> > > > > > > > > 
> > > > > > > > > Following Christoph's suggestion, I'm planning to encapsulate the page
> > > > > > > > > order lookup into a minimal helper instead:
> > > > > > > > > 
> > > > > > > > > unsigned int vmalloc_page_order(const void *addr){
> > > > > > > > > 	struct vm_struct *vm;
> > > > > > > > >  	vm = find_vm_area(addr);
> > > > > > > > > 	return vm ? vm->page_order : 0;
> > > > > > > > > }
> > > > > > > > > EXPORT_SYMBOL_GPL(vmalloc_page_order);
> > > > > > > > > 
> > > > > > > > > Does this approach look reasonable to you? It would keep the vm_struct
> > > > > > > > > layout private while satisfying the optimization needs of SMC.
> > > > > > > > > 
> > > > > > > > Could you please clarify why you need info about page_order? I have not
> > > > > > > > looked at your second patch.
> > > > > > > > 
> > > > > > > > Thanks!
> > > > > > > > 
> > > > > > > > --
> > > > > > > > Uladzislau Rezki
> > > > > > > 
> > > > > > > Hi Uladzislau,
> > > > > > > 
> > > > > > > This stems from optimizing memory registration in SMC-R. To provide the
> > > > > > > RDMA hardware with direct access to memory buffers, we must register
> > > > > > > them with the NIC. During this process, the hardware generates one MTT
> > > > > > > entry for each physically contiguous block. Since these hardware entries
> > > > > > > are a finite and scarce resource, and SMC currently defaults to a 4KB
> > > > > > > registration granularity, a single 2MB buffer consumes 512 entries. In
> > > > > > > high-concurrency scenarios, this inefficiency quickly exhausts NIC
> > > > > > > resources and becomes a major bottleneck for system scalability.
> > > > > > 
> > > > > > I believe this complexity can be avoided by using the RDMA MR pool API,
> > > > > > as other ULPs do, for example NVMe.
> > > > > > 
> > > > > > Thanks
> > > > > > 
> > > > > 
> > > > > Hi Leon,
> > > > > 
> > > > > Am I correct in assuming you are suggesting mr_pool to limit the number
> > > > > of MRs as a way to cap MTTE consumption?
> > > > 
> > > > I don't see this a limit, but something that is considered standard
> > > > practice to reduce MTT consumption.
> > > > 
> > > > > 
> > > > > However, our goal is to maximize the total registered memory within
> > > > > the MTTE limits rather than to cap it. In SMC-R, each connection
> > > > > occupies a configurable, fixed-size registered buffer; consequently,
> > > > > the more memory we can register, the more concurrent connections
> > > > > we can support.
> > > > 
> > > > It is not cap, but more efficient use of existing resources.
> > > 
> > > Got it. While MRs pool might be more standard practice, but it doesn't
> > > address our specific bottleneck. In fact, smc already has its own internal
> > > MR reuse; our core issue remains reducing MTTE consumption by increasing the
> > > registration granularity to maximize the memory size mapped per MTT entry.
> > 
> > And this is something MR pools can handle as well. We are going in circles,
> > so let's summarize.
> 
> I believe some points need to be thoroughly clarified here:
> 
> > 
> > I see SMC‑R as one of the RDMA ULPs, and it should ideally rely on the
> > existing ULP API used by NVMe, NFS, and others, rather than maintaining its
> > own internal logic.
> 
> SMC is not opposed to adopting newer RDMA interfaces; in fact, I have
> already planned a gradual migration to the updated RDMA APIs. We are
> currently in the process of adapting to ib_cqe, for instance. As long as
> functionality remains intact, there is no reason to oppose changes that
> reduce maintenance overhead or provide additional gains, but such a
> transition takes time.
> 
> > 
> > I also do not know whether vmalloc_page_order() is an appropriate solution;
> > I only want to show that we can probably achieve the same result without
> > introducing a new function.
> 
> Regarding the specific issue under discussion, I believe the newer RDMA
> APIs you mentioned do not solve my problem, at least for now. My
> understanding is that regardless of how MRs are pooled, the core
> requirement is to increase the page_size parameter in ib_map_mr_sg to
> maximize the physical size mapped per MTTE. From the code I have
> examined, I see no evidence of these new APIs utilizing values other
> than 4KB.
> 
> Of course, I believe that regardless of whether this issue
> currently exists, it is something the RDMA community can resolve.
> However, as I mentioned, adapting to new API takes time. Before a
> complete transition is achieved, we need to allow for some necessary
> updates to SMC.

I disagree with that statement.

SMC‑R has a long history of re‑implementing existing RDMA ULP APIs, and
not always correctly.

https://lore.kernel.org/netdev/20170510072627.12060-1-hch@lst.de/
https://lore.kernel.org/netdev/20241105112313.GE311159@unreal/#t

Thanks

> 
> Thanks
>

Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()

Posted by D. Wythe 1 week, 1 day ago

On Thu, Jan 29, 2026 at 02:22:02PM +0200, Leon Romanovsky wrote:
> On Thu, Jan 29, 2026 at 07:03:23PM +0800, D. Wythe wrote:
> > On Wed, Jan 28, 2026 at 03:49:34PM +0200, Leon Romanovsky wrote:
> > > On Wed, Jan 28, 2026 at 08:44:04PM +0800, D. Wythe wrote:
> > > > On Wed, Jan 28, 2026 at 01:13:46PM +0200, Leon Romanovsky wrote:
> > > > > On Wed, Jan 28, 2026 at 11:45:58AM +0800, D. Wythe wrote:
> > > > > > On Tue, Jan 27, 2026 at 03:34:17PM +0200, Leon Romanovsky wrote:
> > > > > > > On Sat, Jan 24, 2026 at 10:57:54PM +0800, D. Wythe wrote:
> > > > > > > > On Sat, Jan 24, 2026 at 11:48:59AM +0100, Uladzislau Rezki wrote:
> > > > > > > > > Hello, D. Wythe!
> > > > > > > > > 
> > > > > > > > > > On Fri, Jan 23, 2026 at 07:55:17PM +0100, Uladzislau Rezki wrote:
> > > > > > > > > > > On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote:
> > > > > > > > > > > > find_vm_area() provides a way to find the vm_struct associated with a
> > > > > > > > > > > > virtual address. Export this symbol to modules so that modularized
> > > > > > > > > > > > subsystems can perform lookups on vmalloc addresses.
> > > > > > > > > > > > 
> > > > > > > > > > > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> > > > > > > > > > > > ---
> > > > > > > > > > > >  mm/vmalloc.c | 1 +
> > > > > > > > > > > >  1 file changed, 1 insertion(+)
> > > > > > > > > > > > 
> > > > > > > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > > > > > > > > index ecbac900c35f..3eb9fe761c34 100644
> > > > > > > > > > > > --- a/mm/vmalloc.c
> > > > > > > > > > > > +++ b/mm/vmalloc.c
> > > > > > > > > > > > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr)
> > > > > > > > > > > >  
> > > > > > > > > > > >  	return va->vm;
> > > > > > > > > > > >  }
> > > > > > > > > > > > +EXPORT_SYMBOL_GPL(find_vm_area);
> > > > > > > > > > > >  
> > > > > > > > > > > This is internal. We can not just export it.
> > > > > > > > > > > 
> > > > > > > > > > > --
> > > > > > > > > > > Uladzislau Rezki
> > > > > > > > > > 
> > > > > > > > > > Hi Uladzislau,
> > > > > > > > > > 
> > > > > > > > > > Thank you for the feedback. I agree that we should avoid exposing
> > > > > > > > > > internal implementation details like struct vm_struct to external
> > > > > > > > > > subsystems.
> > > > > > > > > > 
> > > > > > > > > > Following Christoph's suggestion, I'm planning to encapsulate the page
> > > > > > > > > > order lookup into a minimal helper instead:
> > > > > > > > > > 
> > > > > > > > > > unsigned int vmalloc_page_order(const void *addr){
> > > > > > > > > > 	struct vm_struct *vm;
> > > > > > > > > >  	vm = find_vm_area(addr);
> > > > > > > > > > 	return vm ? vm->page_order : 0;
> > > > > > > > > > }
> > > > > > > > > > EXPORT_SYMBOL_GPL(vmalloc_page_order);
> > > > > > > > > > 
> > > > > > > > > > Does this approach look reasonable to you? It would keep the vm_struct
> > > > > > > > > > layout private while satisfying the optimization needs of SMC.
> > > > > > > > > > 
> > > > > > > > > Could you please clarify why you need info about page_order? I have not
> > > > > > > > > looked at your second patch.
> > > > > > > > > 
> > > > > > > > > Thanks!
> > > > > > > > > 
> > > > > > > > > --
> > > > > > > > > Uladzislau Rezki
> > > > > > > > 
> > > > > > > > Hi Uladzislau,
> > > > > > > > 
> > > > > > > > This stems from optimizing memory registration in SMC-R. To provide the
> > > > > > > > RDMA hardware with direct access to memory buffers, we must register
> > > > > > > > them with the NIC. During this process, the hardware generates one MTT
> > > > > > > > entry for each physically contiguous block. Since these hardware entries
> > > > > > > > are a finite and scarce resource, and SMC currently defaults to a 4KB
> > > > > > > > registration granularity, a single 2MB buffer consumes 512 entries. In
> > > > > > > > high-concurrency scenarios, this inefficiency quickly exhausts NIC
> > > > > > > > resources and becomes a major bottleneck for system scalability.
> > > > > > > 
> > > > > > > I believe this complexity can be avoided by using the RDMA MR pool API,
> > > > > > > as other ULPs do, for example NVMe.
> > > > > > > 
> > > > > > > Thanks
> > > > > > > 
> > > > > > 
> > > > > > Hi Leon,
> > > > > > 
> > > > > > Am I correct in assuming you are suggesting mr_pool to limit the number
> > > > > > of MRs as a way to cap MTTE consumption?
> > > > > 
> > > > > I don't see this a limit, but something that is considered standard
> > > > > practice to reduce MTT consumption.
> > > > > 
> > > > > > 
> > > > > > However, our goal is to maximize the total registered memory within
> > > > > > the MTTE limits rather than to cap it. In SMC-R, each connection
> > > > > > occupies a configurable, fixed-size registered buffer; consequently,
> > > > > > the more memory we can register, the more concurrent connections
> > > > > > we can support.
> > > > > 
> > > > > It is not cap, but more efficient use of existing resources.
> > > > 
> > > > Got it. While MRs pool might be more standard practice, but it doesn't
> > > > address our specific bottleneck. In fact, smc already has its own internal
> > > > MR reuse; our core issue remains reducing MTTE consumption by increasing the
> > > > registration granularity to maximize the memory size mapped per MTT entry.
> > > 
> > > And this is something MR pools can handle as well. We are going in circles,
> > > so let's summarize.
> > 
> > I believe some points need to be thoroughly clarified here:
> > 
> > > 
> > > I see SMC‑R as one of the RDMA ULPs, and it should ideally rely on the
> > > existing ULP API used by NVMe, NFS, and others, rather than maintaining its
> > > own internal logic.
> > 
> > SMC is not opposed to adopting newer RDMA interfaces; in fact, I have
> > already planned a gradual migration to the updated RDMA APIs. We are
> > currently in the process of adapting to ib_cqe, for instance. As long as
> > functionality remains intact, there is no reason to oppose changes that
> > reduce maintenance overhead or provide additional gains, but such a
> > transition takes time.
> > 
> > > 
> > > I also do not know whether vmalloc_page_order() is an appropriate solution;
> > > I only want to show that we can probably achieve the same result without
> > > introducing a new function.
> > 
> > Regarding the specific issue under discussion, I believe the newer RDMA
> > APIs you mentioned do not solve my problem, at least for now. My
> > understanding is that regardless of how MRs are pooled, the core
> > requirement is to increase the page_size parameter in ib_map_mr_sg to
> > maximize the physical size mapped per MTTE. From the code I have
> > examined, I see no evidence of these new APIs utilizing values other
> > than 4KB.
> > 
> > Of course, I believe that regardless of whether this issue
> > currently exists, it is something the RDMA community can resolve.
> > However, as I mentioned, adapting to new API takes time. Before a
> > complete transition is achieved, we need to allow for some necessary
> > updates to SMC.
> 
> I disagree with that statement.
> 
> SMC‑R has a long history of re‑implementing existing RDMA ULP APIs, and
> not always correctly.
> 
> https://lore.kernel.org/netdev/20170510072627.12060-1-hch@lst.de/
> https://lore.kernel.org/netdev/20241105112313.GE311159@unreal/#t
>

I see that this discussion has moved beyond the technical scope of the
patch into historical design critiques. I do not wish to engage in a
debate over SMC's history, nor am I responsible for those past
decisions.

I will discontinue the conversation here.

Thanks.

Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()

Posted by Uladzislau Rezki 1 week, 4 days ago

Hello, D. Wythe!

> > > On Fri, Jan 23, 2026 at 07:55:17PM +0100, Uladzislau Rezki wrote:
> > > > On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote:
> > > > > find_vm_area() provides a way to find the vm_struct associated with a
> > > > > virtual address. Export this symbol to modules so that modularized
> > > > > subsystems can perform lookups on vmalloc addresses.
> > > > > 
> > > > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> > > > > ---
> > > > >  mm/vmalloc.c | 1 +
> > > > >  1 file changed, 1 insertion(+)
> > > > > 
> > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > index ecbac900c35f..3eb9fe761c34 100644
> > > > > --- a/mm/vmalloc.c
> > > > > +++ b/mm/vmalloc.c
> > > > > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr)
> > > > >  
> > > > >  	return va->vm;
> > > > >  }
> > > > > +EXPORT_SYMBOL_GPL(find_vm_area);
> > > > >  
> > > > This is internal. We can not just export it.
> > > > 
> > > > --
> > > > Uladzislau Rezki
> > > 
> > > Hi Uladzislau,
> > > 
> > > Thank you for the feedback. I agree that we should avoid exposing
> > > internal implementation details like struct vm_struct to external
> > > subsystems.
> > > 
> > > Following Christoph's suggestion, I'm planning to encapsulate the page
> > > order lookup into a minimal helper instead:
> > > 
> > > unsigned int vmalloc_page_order(const void *addr){
> > > 	struct vm_struct *vm;
> > >  	vm = find_vm_area(addr);
> > > 	return vm ? vm->page_order : 0;
> > > }
> > > EXPORT_SYMBOL_GPL(vmalloc_page_order);
> > > 
> > > Does this approach look reasonable to you? It would keep the vm_struct
> > > layout private while satisfying the optimization needs of SMC.
> > > 
> > Could you please clarify why you need info about page_order? I have not
> > looked at your second patch.
> > 
> > Thanks!
> > 
> > --
> > Uladzislau Rezki
> 
> Hi Uladzislau,
> 
> This stems from optimizing memory registration in SMC-R. To provide the
> RDMA hardware with direct access to memory buffers, we must register
> them with the NIC. During this process, the hardware generates one MTT
> entry for each physically contiguous block. Since these hardware entries
> are a finite and scarce resource, and SMC currently defaults to a 4KB
> registration granularity, a single 2MB buffer consumes 512 entries. In
> high-concurrency scenarios, this inefficiency quickly exhausts NIC
> resources and becomes a major bottleneck for system scalability.
> 
> To address this, we intend to use vmalloc_huge(). When it successfully
> allocates high-order pages, the vmalloc area is backed by a sequence of
> physically contiguous chunks (e.g., 2MB each). If we know this
> page_order, we can register these larger physical blocks instead of
> individual 4KB pages, reducing MTT consumption from 512 entries down to
> 1 for every 2MB of memory (with page_order == 9).
> 
> However, the result of vmalloc_huge() is currently opaque to the caller.
> We cannot determine whether it successfully allocated huge pages or fell
> back to 4KB pages based solely on the returned pointer. Therefore, we
> need a helper function to query the actual page order, enabling SMC-R to
> adapt its registration logic to the underlying physical layout.
> 
> I hope this clarifies our design motivation!
> 
Appreciate for the explanation. Yes it clarifies an intention.

As for proposed patch above:

- A page_order is available if CONFIG_HAVE_ARCH_HUGE_VMALLOC is defined;
- It makes sense to get a node, grab a spin-lock and find VM, save
  page_order and release the lock.

You can have a look at the vmalloc_dump_obj(void *object) function.
We try-spinlock there whereas you need just spin-lock. But the idea
is the same.

--
Uladzislau Rezki

Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()

Posted by D. Wythe 1 week, 4 days ago

On Mon, Jan 26, 2026 at 11:28:46AM +0100, Uladzislau Rezki wrote:
> Hello, D. Wythe!
> 
> > > > On Fri, Jan 23, 2026 at 07:55:17PM +0100, Uladzislau Rezki wrote:
> > > > > On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote:
> > > > > > find_vm_area() provides a way to find the vm_struct associated with a
> > > > > > virtual address. Export this symbol to modules so that modularized
> > > > > > subsystems can perform lookups on vmalloc addresses.
> > > > > > 
> > > > > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> > > > > > ---
> > > > > >  mm/vmalloc.c | 1 +
> > > > > >  1 file changed, 1 insertion(+)
> > > > > > 
> > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > > index ecbac900c35f..3eb9fe761c34 100644
> > > > > > --- a/mm/vmalloc.c
> > > > > > +++ b/mm/vmalloc.c
> > > > > > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr)
> > > > > >  
> > > > > >  	return va->vm;
> > > > > >  }
> > > > > > +EXPORT_SYMBOL_GPL(find_vm_area);
> > > > > >  
> > > > > This is internal. We can not just export it.
> > > > > 
> > > > > --
> > > > > Uladzislau Rezki
> > > > 
> > > > Hi Uladzislau,
> > > > 
> > > > Thank you for the feedback. I agree that we should avoid exposing
> > > > internal implementation details like struct vm_struct to external
> > > > subsystems.
> > > > 
> > > > Following Christoph's suggestion, I'm planning to encapsulate the page
> > > > order lookup into a minimal helper instead:
> > > > 
> > > > unsigned int vmalloc_page_order(const void *addr){
> > > > 	struct vm_struct *vm;
> > > >  	vm = find_vm_area(addr);
> > > > 	return vm ? vm->page_order : 0;
> > > > }
> > > > EXPORT_SYMBOL_GPL(vmalloc_page_order);
> > > > 
> > > > Does this approach look reasonable to you? It would keep the vm_struct
> > > > layout private while satisfying the optimization needs of SMC.
> > > > 
> > > Could you please clarify why you need info about page_order? I have not
> > > looked at your second patch.
> > > 
> > > Thanks!
> > > 
> > > --
> > > Uladzislau Rezki
> > 
> > Hi Uladzislau,
> > 
> > This stems from optimizing memory registration in SMC-R. To provide the
> > RDMA hardware with direct access to memory buffers, we must register
> > them with the NIC. During this process, the hardware generates one MTT
> > entry for each physically contiguous block. Since these hardware entries
> > are a finite and scarce resource, and SMC currently defaults to a 4KB
> > registration granularity, a single 2MB buffer consumes 512 entries. In
> > high-concurrency scenarios, this inefficiency quickly exhausts NIC
> > resources and becomes a major bottleneck for system scalability.
> > 
> > To address this, we intend to use vmalloc_huge(). When it successfully
> > allocates high-order pages, the vmalloc area is backed by a sequence of
> > physically contiguous chunks (e.g., 2MB each). If we know this
> > page_order, we can register these larger physical blocks instead of
> > individual 4KB pages, reducing MTT consumption from 512 entries down to
> > 1 for every 2MB of memory (with page_order == 9).
> > 
> > However, the result of vmalloc_huge() is currently opaque to the caller.
> > We cannot determine whether it successfully allocated huge pages or fell
> > back to 4KB pages based solely on the returned pointer. Therefore, we
> > need a helper function to query the actual page order, enabling SMC-R to
> > adapt its registration logic to the underlying physical layout.
> > 
> > I hope this clarifies our design motivation!
> > 
> Appreciate for the explanation. Yes it clarifies an intention.
> 
> As for proposed patch above:
> 
> - A page_order is available if CONFIG_HAVE_ARCH_HUGE_VMALLOC is defined;
> - It makes sense to get a node, grab a spin-lock and find VM, save
>   page_order and release the lock.
> 
> You can have a look at the vmalloc_dump_obj(void *object) function.
> We try-spinlock there whereas you need just spin-lock. But the idea
> is the same.
> 
> --
> Uladzislau Rezki

Hi Uladzislau,

Thanks very much for the detailed guidance, especially on the correct
locking pattern. This is extremely helpful.I will follow it and send
a v2 patch series with the new helper implemented in mm/vmalloc.c.

Thanks again for your support.

Best regards,
D. Wythe

Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()

Posted by Uladzislau Rezki 1 week, 4 days ago

On Mon, Jan 26, 2026 at 08:02:26PM +0800, D. Wythe wrote:
> On Mon, Jan 26, 2026 at 11:28:46AM +0100, Uladzislau Rezki wrote:
> > Hello, D. Wythe!
> > 
> > > > > On Fri, Jan 23, 2026 at 07:55:17PM +0100, Uladzislau Rezki wrote:
> > > > > > On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote:
> > > > > > > find_vm_area() provides a way to find the vm_struct associated with a
> > > > > > > virtual address. Export this symbol to modules so that modularized
> > > > > > > subsystems can perform lookups on vmalloc addresses.
> > > > > > > 
> > > > > > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> > > > > > > ---
> > > > > > >  mm/vmalloc.c | 1 +
> > > > > > >  1 file changed, 1 insertion(+)
> > > > > > > 
> > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > > > index ecbac900c35f..3eb9fe761c34 100644
> > > > > > > --- a/mm/vmalloc.c
> > > > > > > +++ b/mm/vmalloc.c
> > > > > > > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr)
> > > > > > >  
> > > > > > >  	return va->vm;
> > > > > > >  }
> > > > > > > +EXPORT_SYMBOL_GPL(find_vm_area);
> > > > > > >  
> > > > > > This is internal. We can not just export it.
> > > > > > 
> > > > > > --
> > > > > > Uladzislau Rezki
> > > > > 
> > > > > Hi Uladzislau,
> > > > > 
> > > > > Thank you for the feedback. I agree that we should avoid exposing
> > > > > internal implementation details like struct vm_struct to external
> > > > > subsystems.
> > > > > 
> > > > > Following Christoph's suggestion, I'm planning to encapsulate the page
> > > > > order lookup into a minimal helper instead:
> > > > > 
> > > > > unsigned int vmalloc_page_order(const void *addr){
> > > > > 	struct vm_struct *vm;
> > > > >  	vm = find_vm_area(addr);
> > > > > 	return vm ? vm->page_order : 0;
> > > > > }
> > > > > EXPORT_SYMBOL_GPL(vmalloc_page_order);
> > > > > 
> > > > > Does this approach look reasonable to you? It would keep the vm_struct
> > > > > layout private while satisfying the optimization needs of SMC.
> > > > > 
> > > > Could you please clarify why you need info about page_order? I have not
> > > > looked at your second patch.
> > > > 
> > > > Thanks!
> > > > 
> > > > --
> > > > Uladzislau Rezki
> > > 
> > > Hi Uladzislau,
> > > 
> > > This stems from optimizing memory registration in SMC-R. To provide the
> > > RDMA hardware with direct access to memory buffers, we must register
> > > them with the NIC. During this process, the hardware generates one MTT
> > > entry for each physically contiguous block. Since these hardware entries
> > > are a finite and scarce resource, and SMC currently defaults to a 4KB
> > > registration granularity, a single 2MB buffer consumes 512 entries. In
> > > high-concurrency scenarios, this inefficiency quickly exhausts NIC
> > > resources and becomes a major bottleneck for system scalability.
> > > 
> > > To address this, we intend to use vmalloc_huge(). When it successfully
> > > allocates high-order pages, the vmalloc area is backed by a sequence of
> > > physically contiguous chunks (e.g., 2MB each). If we know this
> > > page_order, we can register these larger physical blocks instead of
> > > individual 4KB pages, reducing MTT consumption from 512 entries down to
> > > 1 for every 2MB of memory (with page_order == 9).
> > > 
> > > However, the result of vmalloc_huge() is currently opaque to the caller.
> > > We cannot determine whether it successfully allocated huge pages or fell
> > > back to 4KB pages based solely on the returned pointer. Therefore, we
> > > need a helper function to query the actual page order, enabling SMC-R to
> > > adapt its registration logic to the underlying physical layout.
> > > 
> > > I hope this clarifies our design motivation!
> > > 
> > Appreciate for the explanation. Yes it clarifies an intention.
> > 
> > As for proposed patch above:
> > 
> > - A page_order is available if CONFIG_HAVE_ARCH_HUGE_VMALLOC is defined;
> > - It makes sense to get a node, grab a spin-lock and find VM, save
> >   page_order and release the lock.
> > 
> > You can have a look at the vmalloc_dump_obj(void *object) function.
> > We try-spinlock there whereas you need just spin-lock. But the idea
> > is the same.
> > 
> > --
> > Uladzislau Rezki
> 
> Hi Uladzislau,
> 
> Thanks very much for the detailed guidance, especially on the correct
> locking pattern. This is extremely helpful.I will follow it and send
> a v2 patch series with the new helper implemented in mm/vmalloc.c.
> 
> Thanks again for your support.
> 
Welcome!

--
Uladzislau Rezki

Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()

Posted by Christoph Hellwig 2 weeks ago

On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote:
> find_vm_area() provides a way to find the vm_struct associated with a
> virtual address. Export this symbol to modules so that modularized
> subsystems can perform lookups on vmalloc addresses.

No, they have absolutely no business doing that.  This functionality
is very intentionally kept private.