arm64/mm: Lift the cma address limit when CONFIG_DMA_NUMA_CMA=y

[PATCH RFC] arm64/mm: Lift the cma address limit when CONFIG_DMA_NUMA_CMA=y

Posted by Feng Tang 8 months, 3 weeks ago

When porting an cma related usage from x86_64 server to arm64 server,
the "cma=4G" setup failed on arm64, and the reason is arm64 has 4G (32bit)
address limit for cma reservation.

The limit is reasonable due to device DMA requirement, but for NUMA
servers which have CONFIG_DMA_NUMA_CMA enabled, the limit is not required
as that config already allows cma area to be reserved on different NUMA
nodes whose memory very likely goes beyond 4G limit.

Lift the cma limit for platform with such configuration.

Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com>
---
 arch/arm64/mm/init.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index b99bf3980fc6..661758678cc4 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -312,6 +312,7 @@ void __init arm64_memblock_init(void)
 void __init bootmem_init(void)
 {
 	unsigned long min, max;
+	phys_addr_t cma_limit;
 
 	min = PFN_UP(memblock_start_of_DRAM());
 	max = PFN_DOWN(memblock_end_of_DRAM());
@@ -343,8 +344,14 @@ void __init bootmem_init(void)
 
 	/*
 	 * Reserve the CMA area after arm64_dma_phys_limit was initialised.
+	 *
+	 * When CONFIG_DMA_NUMA_CMA is enabled, system may have CMA reserved
+	 * area in different NUMA nodes, which likely goes beyond the 32bit
+	 * limit, thus use (PHYS_MASK+1) as cma limit.
 	 */
-	dma_contiguous_reserve(arm64_dma_phys_limit);
+	cma_limit = IS_ENABLED(CONFIG_DMA_NUMA_CMA) ?
+			(PHYS_MASK + 1) : arm64_dma_phys_limit;
+	dma_contiguous_reserve(cma_limit);
 
 	/*
 	 * request_standard_resources() depends on crashkernel's memory being
-- 
2.39.5 (Apple Git-154)

Re: [PATCH RFC] arm64/mm: Lift the cma address limit when CONFIG_DMA_NUMA_CMA=y

Posted by Catalin Marinas 8 months ago

On Wed, May 21, 2025 at 09:47:01AM +0800, Feng Tang wrote:
> When porting an cma related usage from x86_64 server to arm64 server,
> the "cma=4G" setup failed on arm64, and the reason is arm64 has 4G (32bit)
> address limit for cma reservation.
> 
> The limit is reasonable due to device DMA requirement, but for NUMA
> servers which have CONFIG_DMA_NUMA_CMA enabled, the limit is not required
> as that config already allows cma area to be reserved on different NUMA
> nodes whose memory very likely goes beyond 4G limit.
> 
> Lift the cma limit for platform with such configuration.

I don't think that's the right fix. Those devices that have a NUMA node
associated may be ok to address memory beyond 4GB. The default for
NUMA_NO_NODE devices is still dma_contiguous_default_area. I also don't
like to make such run-time decisions on the config option.

That said, maybe we should make the under-4G CMA allocation a best
effort. In the arch code, if that failed, attempt the allocation again
with a limit of 0 and maybe do a pr_notice() that CMA allocation in the
DMA zone failed.

Adding Robin in case he has a different view.

> Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com>
> ---
>  arch/arm64/mm/init.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index b99bf3980fc6..661758678cc4 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -312,6 +312,7 @@ void __init arm64_memblock_init(void)
>  void __init bootmem_init(void)
>  {
>  	unsigned long min, max;
> +	phys_addr_t cma_limit;
>  
>  	min = PFN_UP(memblock_start_of_DRAM());
>  	max = PFN_DOWN(memblock_end_of_DRAM());
> @@ -343,8 +344,14 @@ void __init bootmem_init(void)
>  
>  	/*
>  	 * Reserve the CMA area after arm64_dma_phys_limit was initialised.
> +	 *
> +	 * When CONFIG_DMA_NUMA_CMA is enabled, system may have CMA reserved
> +	 * area in different NUMA nodes, which likely goes beyond the 32bit
> +	 * limit, thus use (PHYS_MASK+1) as cma limit.
>  	 */
> -	dma_contiguous_reserve(arm64_dma_phys_limit);
> +	cma_limit = IS_ENABLED(CONFIG_DMA_NUMA_CMA) ?
> +			(PHYS_MASK + 1) : arm64_dma_phys_limit;
> +	dma_contiguous_reserve(cma_limit);
>  
>  	/*
>  	 * request_standard_resources() depends on crashkernel's memory being
> -- 
> 2.39.5 (Apple Git-154)

Re: [PATCH RFC] arm64/mm: Lift the cma address limit when CONFIG_DMA_NUMA_CMA=y

Posted by Robin Murphy 8 months ago

On 2025-06-10 6:10 pm, Catalin Marinas wrote:
> On Wed, May 21, 2025 at 09:47:01AM +0800, Feng Tang wrote:
>> When porting an cma related usage from x86_64 server to arm64 server,
>> the "cma=4G" setup failed on arm64, and the reason is arm64 has 4G (32bit)
>> address limit for cma reservation.
>>
>> The limit is reasonable due to device DMA requirement, but for NUMA
>> servers which have CONFIG_DMA_NUMA_CMA enabled, the limit is not required
>> as that config already allows cma area to be reserved on different NUMA
>> nodes whose memory very likely goes beyond 4G limit.
>>
>> Lift the cma limit for platform with such configuration.
> 
> I don't think that's the right fix. Those devices that have a NUMA node
> associated may be ok to address memory beyond 4GB. The default for
> NUMA_NO_NODE devices is still dma_contiguous_default_area. I also don't
> like to make such run-time decisions on the config option.

Indeed, the fact that the kernel was built with the option enabled says 
nothing at all about the needs of whatever system we're actually running 
on, so that's definitely wrong. This one is also the kind of option 
which may well be enabled in a multi-platform distro kernel, since it 
only adds a tiny amount of code with no functional impact on systems 
which don't explicitly opt in, but offers a useful benefit to those 
which can and do.

Furthermore, the justification doesn't add up at all - if the relevant 
devices could use the per-NUMA-node CMA areas, then... why not just have 
them use the per-NUMA-node CMA areas, no kernel change needed (and maybe 
a slight performance bonus too)? On the other hand, where those areas 
may or may not be allocated is entirely meaningless to NUMA_NO_NODE 
devices which wouldn't use them anyway.

> That said, maybe we should make the under-4G CMA allocation a best
> effort. In the arch code, if that failed, attempt the allocation again
> with a limit of 0 and maybe do a pr_notice() that CMA allocation in the
> DMA zone failed.

TBH given that the command-line parameter can specify placement as well 
as size, I think it would make a lot of sense to allow that to override 
the default limit provided by the arch code. That would give users the 
most flexibility, at the minor cost of having to accept the consequences 
if they do specify something which ends up not working for some devices. 
Otherwise I fear that any attempt to make the code itself cleverer will 
just lead down a rabbit-hole of trying to second-guess the user's intent 
- if the size doesn't fit the limit, who says it's right to increase the 
limit rather than reduce the size? And so on...

Thanks,
Robin.

> 
> Adding Robin in case he has a different view.
> 
>> Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com>
>> ---
>>   arch/arm64/mm/init.c | 9 ++++++++-
>>   1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
>> index b99bf3980fc6..661758678cc4 100644
>> --- a/arch/arm64/mm/init.c
>> +++ b/arch/arm64/mm/init.c
>> @@ -312,6 +312,7 @@ void __init arm64_memblock_init(void)
>>   void __init bootmem_init(void)
>>   {
>>   	unsigned long min, max;
>> +	phys_addr_t cma_limit;
>>   
>>   	min = PFN_UP(memblock_start_of_DRAM());
>>   	max = PFN_DOWN(memblock_end_of_DRAM());
>> @@ -343,8 +344,14 @@ void __init bootmem_init(void)
>>   
>>   	/*
>>   	 * Reserve the CMA area after arm64_dma_phys_limit was initialised.
>> +	 *
>> +	 * When CONFIG_DMA_NUMA_CMA is enabled, system may have CMA reserved
>> +	 * area in different NUMA nodes, which likely goes beyond the 32bit
>> +	 * limit, thus use (PHYS_MASK+1) as cma limit.
>>   	 */
>> -	dma_contiguous_reserve(arm64_dma_phys_limit);
>> +	cma_limit = IS_ENABLED(CONFIG_DMA_NUMA_CMA) ?
>> +			(PHYS_MASK + 1) : arm64_dma_phys_limit;
>> +	dma_contiguous_reserve(cma_limit);
>>   
>>   	/*
>>   	 * request_standard_resources() depends on crashkernel's memory being
>> -- 
>> 2.39.5 (Apple Git-154)

Re: [PATCH RFC] arm64/mm: Lift the cma address limit when CONFIG_DMA_NUMA_CMA=y

Posted by Feng Tang 8 months ago

Add Marek Szyprowski

Thanks Catalin and Robin for the comments and suggestions!

On Tue, Jun 10, 2025 at 08:46:38PM +0100, Robin Murphy wrote:
> On 2025-06-10 6:10 pm, Catalin Marinas wrote:
> > On Wed, May 21, 2025 at 09:47:01AM +0800, Feng Tang wrote:
> > > When porting an cma related usage from x86_64 server to arm64 server,
> > > the "cma=4G" setup failed on arm64, and the reason is arm64 has 4G (32bit)
> > > address limit for cma reservation.
> > > 
> > > The limit is reasonable due to device DMA requirement, but for NUMA
> > > servers which have CONFIG_DMA_NUMA_CMA enabled, the limit is not required
> > > as that config already allows cma area to be reserved on different NUMA
> > > nodes whose memory very likely goes beyond 4G limit.
> > > 
> > > Lift the cma limit for platform with such configuration.
> > 
> > I don't think that's the right fix. Those devices that have a NUMA node
> > associated may be ok to address memory beyond 4GB. The default for
> > NUMA_NO_NODE devices is still dma_contiguous_default_area. I also don't
> > like to make such run-time decisions on the config option.
> 
> Indeed, the fact that the kernel was built with the option enabled says
> nothing at all about the needs of whatever system we're actually running on,
> so that's definitely wrong. This one is also the kind of option which may
> well be enabled in a multi-platform distro kernel, since it only adds a tiny
> amount of code with no functional impact on systems which don't explicitly
> opt in, but offers a useful benefit to those which can and do.

Yep, the analysis from you two make sense to me. Will drop this patch.

> Furthermore, the justification doesn't add up at all - if the relevant
> devices could use the per-NUMA-node CMA areas, then... why not just have
> them use the per-NUMA-node CMA areas, no kernel change needed (and maybe a
> slight performance bonus too)? On the other hand, where those areas may or
> may not be allocated is entirely meaningless to NUMA_NO_NODE devices which
> wouldn't use them anyway.

The usage model ported from x86_64 is use "cma=4G@4G" in cmdline, and use
something like dma_alloc_from_continguous(NULL, 1 << 18, 18, false) to
get a huge buffer from 'dma_contiguous_default_area'

btw, I really like the 'numa_cma=' cmdline option, which helps our cma
usage a lot.

> > That said, maybe we should make the under-4G CMA allocation a best
> > effort. In the arch code, if that failed, attempt the allocation again
> > with a limit of 0 and maybe do a pr_notice() that CMA allocation in the
> > DMA zone failed.
> 
> TBH given that the command-line parameter can specify placement as well as
> size, I think it would make a lot of sense to allow that to override the
> default limit provided by the arch code. That would give users the most
> flexibility, at the minor cost of having to accept the consequences if they
> do specify something which ends up not working for some devices. Otherwise I
> fear that any attempt to make the code itself cleverer will just lead down a
> rabbit-hole of trying to second-guess the user's intent - if the size
> doesn't fit the limit, who says it's right to increase the limit rather than
> reduce the size? And so on...

Strongly agree. Some platforms may have the 32bit limit, and many other
platforms which don't have to suffer from the limit. This kind of
flexibility should benefit users widely.

Something like below?
---
diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c
index 8df0dfaaca18..6a93ad3e024d 100644
--- a/kernel/dma/contiguous.c
+++ b/kernel/dma/contiguous.c
@@ -222,7 +222,12 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
 	if (size_cmdline != -1) {
 		selected_size = size_cmdline;
 		selected_base = base_cmdline;
-		selected_limit = min_not_zero(limit_cmdline, limit);
+
+		selected_limit = limit_cmdline ?: limit;
+		if (limit_cmdline > limit)
+			pr_notice("User set cma limit [0x%llx] bigger than architectual value [0x%llx], will use the former\n",
+				limit_cmdline, limit);
+
 		if (base_cmdline + size_cmdline == limit_cmdline)
 			fixed = true;
 	} else {


Thanks,
Feng

> 
> Thanks,
> Robin.
> 
> > 
> > Adding Robin in case he has a different view.
> > 
> > > Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com>
> > > ---
> > >   arch/arm64/mm/init.c | 9 ++++++++-
> > >   1 file changed, 8 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> > > index b99bf3980fc6..661758678cc4 100644
> > > --- a/arch/arm64/mm/init.c
> > > +++ b/arch/arm64/mm/init.c
> > > @@ -312,6 +312,7 @@ void __init arm64_memblock_init(void)
> > >   void __init bootmem_init(void)
> > >   {
> > >   	unsigned long min, max;
> > > +	phys_addr_t cma_limit;
> > >   	min = PFN_UP(memblock_start_of_DRAM());
> > >   	max = PFN_DOWN(memblock_end_of_DRAM());
> > > @@ -343,8 +344,14 @@ void __init bootmem_init(void)
> > >   	/*
> > >   	 * Reserve the CMA area after arm64_dma_phys_limit was initialised.
> > > +	 *
> > > +	 * When CONFIG_DMA_NUMA_CMA is enabled, system may have CMA reserved
> > > +	 * area in different NUMA nodes, which likely goes beyond the 32bit
> > > +	 * limit, thus use (PHYS_MASK+1) as cma limit.
> > >   	 */
> > > -	dma_contiguous_reserve(arm64_dma_phys_limit);
> > > +	cma_limit = IS_ENABLED(CONFIG_DMA_NUMA_CMA) ?
> > > +			(PHYS_MASK + 1) : arm64_dma_phys_limit;
> > > +	dma_contiguous_reserve(cma_limit);
> > >   	/*
> > >   	 * request_standard_resources() depends on crashkernel's memory being
> > > -- 
> > > 2.39.5 (Apple Git-154)

Re: [PATCH RFC] arm64/mm: Lift the cma address limit when CONFIG_DMA_NUMA_CMA=y

Posted by Robin Murphy 8 months ago

On 2025-06-11 5:08 am, Feng Tang wrote:
> Add Marek Szyprowski
> 
> Thanks Catalin and Robin for the comments and suggestions!
> 
> On Tue, Jun 10, 2025 at 08:46:38PM +0100, Robin Murphy wrote:
>> On 2025-06-10 6:10 pm, Catalin Marinas wrote:
>>> On Wed, May 21, 2025 at 09:47:01AM +0800, Feng Tang wrote:
>>>> When porting an cma related usage from x86_64 server to arm64 server,
>>>> the "cma=4G" setup failed on arm64, and the reason is arm64 has 4G (32bit)
>>>> address limit for cma reservation.
>>>>
>>>> The limit is reasonable due to device DMA requirement, but for NUMA
>>>> servers which have CONFIG_DMA_NUMA_CMA enabled, the limit is not required
>>>> as that config already allows cma area to be reserved on different NUMA
>>>> nodes whose memory very likely goes beyond 4G limit.
>>>>
>>>> Lift the cma limit for platform with such configuration.
>>>
>>> I don't think that's the right fix. Those devices that have a NUMA node
>>> associated may be ok to address memory beyond 4GB. The default for
>>> NUMA_NO_NODE devices is still dma_contiguous_default_area. I also don't
>>> like to make such run-time decisions on the config option.
>>
>> Indeed, the fact that the kernel was built with the option enabled says
>> nothing at all about the needs of whatever system we're actually running on,
>> so that's definitely wrong. This one is also the kind of option which may
>> well be enabled in a multi-platform distro kernel, since it only adds a tiny
>> amount of code with no functional impact on systems which don't explicitly
>> opt in, but offers a useful benefit to those which can and do.
> 
> Yep, the analysis from you two make sense to me. Will drop this patch.
> 
>> Furthermore, the justification doesn't add up at all - if the relevant
>> devices could use the per-NUMA-node CMA areas, then... why not just have
>> them use the per-NUMA-node CMA areas, no kernel change needed (and maybe a
>> slight performance bonus too)? On the other hand, where those areas may or
>> may not be allocated is entirely meaningless to NUMA_NO_NODE devices which
>> wouldn't use them anyway.
> 
> The usage model ported from x86_64 is use "cma=4G@4G" in cmdline, and use
> something like dma_alloc_from_continguous(NULL, 1 << 18, 18, false) to
> get a huge buffer from 'dma_contiguous_default_area'
> 
> btw, I really like the 'numa_cma=' cmdline option, which helps our cma
> usage a lot.
> 
>>> That said, maybe we should make the under-4G CMA allocation a best
>>> effort. In the arch code, if that failed, attempt the allocation again
>>> with a limit of 0 and maybe do a pr_notice() that CMA allocation in the
>>> DMA zone failed.
>>
>> TBH given that the command-line parameter can specify placement as well as
>> size, I think it would make a lot of sense to allow that to override the
>> default limit provided by the arch code. That would give users the most
>> flexibility, at the minor cost of having to accept the consequences if they
>> do specify something which ends up not working for some devices. Otherwise I
>> fear that any attempt to make the code itself cleverer will just lead down a
>> rabbit-hole of trying to second-guess the user's intent - if the size
>> doesn't fit the limit, who says it's right to increase the limit rather than
>> reduce the size? And so on...
> 
> Strongly agree. Some platforms may have the 32bit limit, and many other
> platforms which don't have to suffer from the limit. This kind of
> flexibility should benefit users widely.
> 
> Something like below?

Pretty much, although personally I wouldn't even bother with the message 
- if the user has gone out of their way to override the default 
behaviour on their system, then logically it's because they know what 
that default is and why it didn't suit them, so it seems unlikely that 
they need reminding of that in one particular subset of override 
conditions. We still report the actual address where the CMA region ends 
up being allocated, and that's what really matters in the end.

Thanks,
Robin.

> ---
> diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c
> index 8df0dfaaca18..6a93ad3e024d 100644
> --- a/kernel/dma/contiguous.c
> +++ b/kernel/dma/contiguous.c
> @@ -222,7 +222,12 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
>   	if (size_cmdline != -1) {
>   		selected_size = size_cmdline;
>   		selected_base = base_cmdline;
> -		selected_limit = min_not_zero(limit_cmdline, limit);
> +
> +		selected_limit = limit_cmdline ?: limit;
> +		if (limit_cmdline > limit)
> +			pr_notice("User set cma limit [0x%llx] bigger than architectual value [0x%llx], will use the former\n",
> +				limit_cmdline, limit);
> +
>   		if (base_cmdline + size_cmdline == limit_cmdline)
>   			fixed = true;
>   	} else {
> 
> 
> Thanks,
> Feng
> 
>>
>> Thanks,
>> Robin.
>>
>>>
>>> Adding Robin in case he has a different view.
>>>
>>>> Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com>
>>>> ---
>>>>    arch/arm64/mm/init.c | 9 ++++++++-
>>>>    1 file changed, 8 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
>>>> index b99bf3980fc6..661758678cc4 100644
>>>> --- a/arch/arm64/mm/init.c
>>>> +++ b/arch/arm64/mm/init.c
>>>> @@ -312,6 +312,7 @@ void __init arm64_memblock_init(void)
>>>>    void __init bootmem_init(void)
>>>>    {
>>>>    	unsigned long min, max;
>>>> +	phys_addr_t cma_limit;
>>>>    	min = PFN_UP(memblock_start_of_DRAM());
>>>>    	max = PFN_DOWN(memblock_end_of_DRAM());
>>>> @@ -343,8 +344,14 @@ void __init bootmem_init(void)
>>>>    	/*
>>>>    	 * Reserve the CMA area after arm64_dma_phys_limit was initialised.
>>>> +	 *
>>>> +	 * When CONFIG_DMA_NUMA_CMA is enabled, system may have CMA reserved
>>>> +	 * area in different NUMA nodes, which likely goes beyond the 32bit
>>>> +	 * limit, thus use (PHYS_MASK+1) as cma limit.
>>>>    	 */
>>>> -	dma_contiguous_reserve(arm64_dma_phys_limit);
>>>> +	cma_limit = IS_ENABLED(CONFIG_DMA_NUMA_CMA) ?
>>>> +			(PHYS_MASK + 1) : arm64_dma_phys_limit;
>>>> +	dma_contiguous_reserve(cma_limit);
>>>>    	/*
>>>>    	 * request_standard_resources() depends on crashkernel's memory being
>>>> -- 
>>>> 2.39.5 (Apple Git-154)