arch/x86/mm/init_64.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-)
Commit 7ffb791423c7 ("x86/kaslr: Reduce KASLR entropy on most x86 systems")
exposed a bug with nokaslr and zone device
interaction, as seen on a system with an AMD iGPU and dGPU (see [1]).
The root cause of the issue is that, the gpu driver registers a zone
device private memory region. When kaslr is disabled or the above commit
is applied, the direct_map_physmem_end is set to much higher than 10 TiB
typically to the 64TiB address. When zone device private memory is added
to the system via add_pages(), it bumps up the max_pfn to the same
value. This causes dma_addressing_limited() to return true, since the
device cannot address memory all the way up to max_pfn.
This caused a regression for games played on the iGPU, as it resulted in
the DMA32 zone being used for GPU allocations.
Fix this by not bumping up max_pfn on x86 systems, when pgmap is passed
into add_pages(). The presence of pgmap is used to determine if device
private memory is being added via add_pages().
More details:
devm_request_mem_region() and request_free_mem_region() request for
device private memory. iomem_resource is passed as the base resource
with start and end parameters. iomem_resource's end depends on several
factors, including the platform and virtualization. On x86 for example
on bare metal, this value is set to boot_cpu_data.x86_phys_bits.
boot_cpu_data.x86_phys_bits can change depending on support for MKTME.
By default it is set to the same as log2(direct_map_physmem_end) which
is 46 to 52 bits depending on the number of levels in the page table.
The allocation routines used iomem_resource's end and
direct_map_physmem_end to figure out where to allocate the region.
arch/powerpc is also impacted by this bug, this patch does not fix
the issue for powerpc.
Testing:
1. Tested on a virtual machine with test_hmm for zone device inseration
2. A previous version of this patch was tested by Bert, please see [2]
Link: https://lore.kernel.org/lkml/20250310112206.4168-1-spasswolf@web.de/ [1]
Link: https://lore.kernel.org/lkml/d87680bab997fdc9fb4e638983132af235d9a03a.camel@web.de/ [2]
Fixes: 7ffb791423c7 ("x86/kaslr: Reduce KASLR entropy on most x86 systems")
Cc: "Christian König" <christian.koenig@amd.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kees Cook <kees@kernel.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Bert Karwatzki <spasswolf@web.de>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
I've left powerpc out of this regression change due to the time required
to setup and test via qemu. I wanted to address the regression quickly
arch/x86/mm/init_64.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index dce60767124f..cc60b57473a4 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -970,9 +970,18 @@ int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
ret = __add_pages(nid, start_pfn, nr_pages, params);
WARN_ON_ONCE(ret);
- /* update max_pfn, max_low_pfn and high_memory */
- update_end_of_memory_vars(start_pfn << PAGE_SHIFT,
- nr_pages << PAGE_SHIFT);
+ /*
+ * add_pages() is called by memremap_pages() for adding device private
+ * pages. Do not bump up max_pfn in the device private path. max_pfn
+ * changes affect dma_addressing_limited. dma_addressing_limited
+ * returning true when max_pfn is the device's addressable memory,
+ * can force device drivers to use bounce buffers and impact their
+ * performance
+ */
+ if (!params->pgmap)
+ /* update max_pfn, max_low_pfn and high_memory */
+ update_end_of_memory_vars(start_pfn << PAGE_SHIFT,
+ nr_pages << PAGE_SHIFT);
return ret;
}
--
2.48.1
[ add Gregory and linux-mm ]
[ full context for new Cc: ]
Balbir Singh wrote:
> Commit 7ffb791423c7 ("x86/kaslr: Reduce KASLR entropy on most x86 systems")
> exposed a bug with nokaslr and zone device
> interaction, as seen on a system with an AMD iGPU and dGPU (see [1]).
> The root cause of the issue is that, the gpu driver registers a zone
> device private memory region. When kaslr is disabled or the above commit
> is applied, the direct_map_physmem_end is set to much higher than 10 TiB
> typically to the 64TiB address. When zone device private memory is added
> to the system via add_pages(), it bumps up the max_pfn to the same
> value. This causes dma_addressing_limited() to return true, since the
> device cannot address memory all the way up to max_pfn.
>
> This caused a regression for games played on the iGPU, as it resulted in
> the DMA32 zone being used for GPU allocations.
>
> Fix this by not bumping up max_pfn on x86 systems, when pgmap is passed
> into add_pages(). The presence of pgmap is used to determine if device
> private memory is being added via add_pages().
>
> More details:
>
> devm_request_mem_region() and request_free_mem_region() request for
> device private memory. iomem_resource is passed as the base resource
> with start and end parameters. iomem_resource's end depends on several
> factors, including the platform and virtualization. On x86 for example
> on bare metal, this value is set to boot_cpu_data.x86_phys_bits.
> boot_cpu_data.x86_phys_bits can change depending on support for MKTME.
> By default it is set to the same as log2(direct_map_physmem_end) which
> is 46 to 52 bits depending on the number of levels in the page table.
> The allocation routines used iomem_resource's end and
> direct_map_physmem_end to figure out where to allocate the region.
>
> arch/powerpc is also impacted by this bug, this patch does not fix
> the issue for powerpc.
>
> Testing:
> 1. Tested on a virtual machine with test_hmm for zone device inseration
> 2. A previous version of this patch was tested by Bert, please see [2]
>
> Link: https://lore.kernel.org/lkml/20250310112206.4168-1-spasswolf@web.de/ [1]
> Link: https://lore.kernel.org/lkml/d87680bab997fdc9fb4e638983132af235d9a03a.camel@web.de/ [2]
> Fixes: 7ffb791423c7 ("x86/kaslr: Reduce KASLR entropy on most x86 systems")
>
> Cc: "Christian König" <christian.koenig@amd.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Kees Cook <kees@kernel.org>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: Bert Karwatzki <spasswolf@web.de>
> Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
> Cc: Nicholas Piggin <npiggin@gmail.com>
>
>
> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> ---
> I've left powerpc out of this regression change due to the time required
> to setup and test via qemu. I wanted to address the regression quickly
>
>
> arch/x86/mm/init_64.c | 15 ++++++++++++---
> 1 file changed, 12 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index dce60767124f..cc60b57473a4 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -970,9 +970,18 @@ int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
> ret = __add_pages(nid, start_pfn, nr_pages, params);
> WARN_ON_ONCE(ret);
>
> - /* update max_pfn, max_low_pfn and high_memory */
> - update_end_of_memory_vars(start_pfn << PAGE_SHIFT,
> - nr_pages << PAGE_SHIFT);
> + /*
> + * add_pages() is called by memremap_pages() for adding device private
> + * pages. Do not bump up max_pfn in the device private path. max_pfn
> + * changes affect dma_addressing_limited. dma_addressing_limited
> + * returning true when max_pfn is the device's addressable memory,
> + * can force device drivers to use bounce buffers and impact their
> + * performance
> + */
> + if (!params->pgmap)
> + /* update max_pfn, max_low_pfn and high_memory */
> + update_end_of_memory_vars(start_pfn << PAGE_SHIFT,
> + nr_pages << PAGE_SHIFT);
The comment says that this adjustment is only for the device-private
case, but it applies to all driver-managed device memory.
Why not actually do what the comment says and limit this to
DEVICE_PRIVATE? I.e.:
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 0e4270e20fad..4cc8175f9ffd 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -989,7 +989,7 @@ int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
* addressable memory can force device drivers to use bounce buffers
* and impact their performance negatively:
*/
- if (!params->pgmap)
+ if (!params->pgmap || params->pgmap->type != MEMORY_DEVICE_PRIVATE)
/* update max_pfn, max_low_pfn and high_memory */
update_end_of_memory_vars(start_pfn << PAGE_SHIFT, nr_pages << PAGE_SHIFT);
On 12/2/25 09:11, dan.j.williams@intel.com wrote:
> [ add Gregory and linux-mm ]
>
> [ full context for new Cc: ]
> Balbir Singh wrote:
>> Commit 7ffb791423c7 ("x86/kaslr: Reduce KASLR entropy on most x86 systems")
>> exposed a bug with nokaslr and zone device
>> interaction, as seen on a system with an AMD iGPU and dGPU (see [1]).
>> The root cause of the issue is that, the gpu driver registers a zone
>> device private memory region. When kaslr is disabled or the above commit
>> is applied, the direct_map_physmem_end is set to much higher than 10 TiB
>> typically to the 64TiB address. When zone device private memory is added
>> to the system via add_pages(), it bumps up the max_pfn to the same
>> value. This causes dma_addressing_limited() to return true, since the
>> device cannot address memory all the way up to max_pfn.
>>
>> This caused a regression for games played on the iGPU, as it resulted in
>> the DMA32 zone being used for GPU allocations.
>>
>> Fix this by not bumping up max_pfn on x86 systems, when pgmap is passed
>> into add_pages(). The presence of pgmap is used to determine if device
>> private memory is being added via add_pages().
>>
>> More details:
>>
>> devm_request_mem_region() and request_free_mem_region() request for
>> device private memory. iomem_resource is passed as the base resource
>> with start and end parameters. iomem_resource's end depends on several
>> factors, including the platform and virtualization. On x86 for example
>> on bare metal, this value is set to boot_cpu_data.x86_phys_bits.
>> boot_cpu_data.x86_phys_bits can change depending on support for MKTME.
>> By default it is set to the same as log2(direct_map_physmem_end) which
>> is 46 to 52 bits depending on the number of levels in the page table.
>> The allocation routines used iomem_resource's end and
>> direct_map_physmem_end to figure out where to allocate the region.
>>
>> arch/powerpc is also impacted by this bug, this patch does not fix
>> the issue for powerpc.
>>
>> Testing:
>> 1. Tested on a virtual machine with test_hmm for zone device inseration
>> 2. A previous version of this patch was tested by Bert, please see [2]
>>
>> Link: https://lore.kernel.org/lkml/20250310112206.4168-1-spasswolf@web.de/ [1]
>> Link: https://lore.kernel.org/lkml/d87680bab997fdc9fb4e638983132af235d9a03a.camel@web.de/ [2]
>> Fixes: 7ffb791423c7 ("x86/kaslr: Reduce KASLR entropy on most x86 systems")
>>
>> Cc: "Christian König" <christian.koenig@amd.com>
>> Cc: Ingo Molnar <mingo@kernel.org>
>> Cc: Kees Cook <kees@kernel.org>
>> Cc: Bjorn Helgaas <bhelgaas@google.com>
>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Andy Lutomirski <luto@kernel.org>
>> Cc: Alex Deucher <alexander.deucher@amd.com>
>> Cc: Bert Karwatzki <spasswolf@web.de>
>> Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
>> Cc: Nicholas Piggin <npiggin@gmail.com>
>>
>>
>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>> ---
>> I've left powerpc out of this regression change due to the time required
>> to setup and test via qemu. I wanted to address the regression quickly
>>
>>
>> arch/x86/mm/init_64.c | 15 ++++++++++++---
>> 1 file changed, 12 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
>> index dce60767124f..cc60b57473a4 100644
>> --- a/arch/x86/mm/init_64.c
>> +++ b/arch/x86/mm/init_64.c
>> @@ -970,9 +970,18 @@ int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
>> ret = __add_pages(nid, start_pfn, nr_pages, params);
>> WARN_ON_ONCE(ret);
>>
>> - /* update max_pfn, max_low_pfn and high_memory */
>> - update_end_of_memory_vars(start_pfn << PAGE_SHIFT,
>> - nr_pages << PAGE_SHIFT);
>> + /*
>> + * add_pages() is called by memremap_pages() for adding device private
>> + * pages. Do not bump up max_pfn in the device private path. max_pfn
>> + * changes affect dma_addressing_limited. dma_addressing_limited
>> + * returning true when max_pfn is the device's addressable memory,
>> + * can force device drivers to use bounce buffers and impact their
>> + * performance
>> + */
>> + if (!params->pgmap)
>> + /* update max_pfn, max_low_pfn and high_memory */
>> + update_end_of_memory_vars(start_pfn << PAGE_SHIFT,
>> + nr_pages << PAGE_SHIFT);
>
> The comment says that this adjustment is only for the device-private
> case, but it applies to all driver-managed device memory.
>
> Why not actually do what the comment says and limit this to
> DEVICE_PRIVATE? I.e.:
>
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 0e4270e20fad..4cc8175f9ffd 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -989,7 +989,7 @@ int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
> * addressable memory can force device drivers to use bounce buffers
> * and impact their performance negatively:
> */
> - if (!params->pgmap)
> + if (!params->pgmap || params->pgmap->type != MEMORY_DEVICE_PRIVATE)
> /* update max_pfn, max_low_pfn and high_memory */
> update_end_of_memory_vars(start_pfn << PAGE_SHIFT, nr_pages << PAGE_SHIFT);
>
At that time when I audited the code, I did notice that max_pfn was already set to the upper
end of physical memory (even if it's not hotunplugged, because those regions were parsed and
added to memblock_add() and I noticed that all hotplug regions changing max_pfn were coming
from the device private).
I agree we should check for pgmap->type, so definitely the right fix
Reviewed-by: Balbir Singh <balbirs@nvidia.com>
On Mon, Dec 01, 2025 at 02:11:35PM -0800, dan.j.williams@intel.com wrote:
> [ add Gregory and linux-mm ]
>
> [ full context for new Cc: ]
> Balbir Singh wrote:
> > Commit 7ffb791423c7 ("x86/kaslr: Reduce KASLR entropy on most x86 systems")
> > exposed a bug with nokaslr and zone device
> > interaction, as seen on a system with an AMD iGPU and dGPU (see [1]).
> > The root cause of the issue is that, the gpu driver registers a zone
^^^^^^^^^^^^^^ which one, iGPU
or dGPU? or they managed by the
same driver?
(sorry, stickler for vagueness)
> > Fix this by not bumping up max_pfn on x86 systems, when pgmap is passed
> > into add_pages(). The presence of pgmap is used to determine if device
> > private memory is being added via add_pages().
> >
Concur with Dan's take below here, please check for DEVICE_PRIVATE so as
not to affect DEVICE_COHERENT. Or if there's a reason to affect
DEVICE_COHERENT, please explain here.
> > arch/powerpc is also impacted by this bug, this patch does not fix
> > the issue for powerpc.
> >
> > I've left powerpc out of this regression change due to the time required
> > to setup and test via qemu. I wanted to address the regression quickly
> >
At least +Cc ppc folks to take a look?
+Cc: linux-ppc-dev
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 0e4270e20fad..4cc8175f9ffd 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -989,7 +989,7 @@ int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
> * addressable memory can force device drivers to use bounce buffers
> * and impact their performance negatively:
> */
> - if (!params->pgmap)
> + if (!params->pgmap || params->pgmap->type != MEMORY_DEVICE_PRIVATE)
> /* update max_pfn, max_low_pfn and high_memory */
> update_end_of_memory_vars(start_pfn << PAGE_SHIFT, nr_pages << PAGE_SHIFT);
>
This looks better to me.
~Gregory
* Balbir Singh <balbirs@nvidia.com> wrote: > arch/x86/mm/init_64.c | 15 ++++++++++++--- > 1 file changed, 12 insertions(+), 3 deletions(-) > > diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c > index dce60767124f..cc60b57473a4 100644 > --- a/arch/x86/mm/init_64.c > +++ b/arch/x86/mm/init_64.c > @@ -970,9 +970,18 @@ int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages, > ret = __add_pages(nid, start_pfn, nr_pages, params); > WARN_ON_ONCE(ret); > > - /* update max_pfn, max_low_pfn and high_memory */ > - update_end_of_memory_vars(start_pfn << PAGE_SHIFT, > - nr_pages << PAGE_SHIFT); > + /* > + * add_pages() is called by memremap_pages() for adding device private > + * pages. Do not bump up max_pfn in the device private path. max_pfn > + * changes affect dma_addressing_limited. dma_addressing_limited > + * returning true when max_pfn is the device's addressable memory, > + * can force device drivers to use bounce buffers and impact their > + * performance > + */ > + if (!params->pgmap) > + /* update max_pfn, max_low_pfn and high_memory */ > + update_end_of_memory_vars(start_pfn << PAGE_SHIFT, > + nr_pages << PAGE_SHIFT); So given that device private pages are not supposed to be mapped directly, not including these PFNs in max_pfn absolutely sounds like the correct fix to me. But wouldn't the abnormally high max_pfn also cause us to create a too large direct mapping to cover it, or does something save us there? Such an overly large mapping would increase kernel page table size rather substantially on non-gbpages systems, AFAICS. Say we create a 16TB mapping on a 16GB system - 1024x larger: to map 16 TB with largepages requires 8,388,608 largepage mappings (!), which with 8-byte page table entries takes up ~64MB of unswappable RAM. (!!) Is my math off, or am I misunderstanding something here? Anyway, I've applied your fix to tip:x86/urgent with a few edits to the comments and the changelog, but I've also expanded the Cc: list of the commit liberally, in hope of getting more reviews for this fix. :-) Thanks, Ingo
On 4/1/25 19:57, Ingo Molnar wrote: > > * Balbir Singh <balbirs@nvidia.com> wrote: > >> arch/x86/mm/init_64.c | 15 ++++++++++++--- >> 1 file changed, 12 insertions(+), 3 deletions(-) >> >> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c >> index dce60767124f..cc60b57473a4 100644 >> --- a/arch/x86/mm/init_64.c >> +++ b/arch/x86/mm/init_64.c >> @@ -970,9 +970,18 @@ int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages, >> ret = __add_pages(nid, start_pfn, nr_pages, params); >> WARN_ON_ONCE(ret); >> >> - /* update max_pfn, max_low_pfn and high_memory */ >> - update_end_of_memory_vars(start_pfn << PAGE_SHIFT, >> - nr_pages << PAGE_SHIFT); >> + /* >> + * add_pages() is called by memremap_pages() for adding device private >> + * pages. Do not bump up max_pfn in the device private path. max_pfn >> + * changes affect dma_addressing_limited. dma_addressing_limited >> + * returning true when max_pfn is the device's addressable memory, >> + * can force device drivers to use bounce buffers and impact their >> + * performance >> + */ >> + if (!params->pgmap) >> + /* update max_pfn, max_low_pfn and high_memory */ >> + update_end_of_memory_vars(start_pfn << PAGE_SHIFT, >> + nr_pages << PAGE_SHIFT); > > So given that device private pages are not supposed to be mapped > directly, not including these PFNs in max_pfn absolutely sounds like > the correct fix to me. > > But wouldn't the abnormally high max_pfn also cause us to create a too > large direct mapping to cover it, or does something save us there? Such > an overly large mapping would increase kernel page table size rather > substantially on non-gbpages systems, AFAICS. > > Say we create a 16TB mapping on a 16GB system - 1024x larger: to map 16 > TB with largepages requires 8,388,608 largepage mappings (!), which > with 8-byte page table entries takes up ~64MB of unswappable RAM. (!!) > > Is my math off, or am I misunderstanding something here? > That is a valid point, but that is only if we cover all of the max_pfn with direct mapping (I can't seem to remember if we do so with sparsemem) > Anyway, I've applied your fix to tip:x86/urgent with a few edits to the > comments and the changelog, but I've also expanded the Cc: list of the > commit liberally, in hope of getting more reviews for this fix. :-) > Thanks and I'd like to get broader testing as well. I am also inclined to send an RFC to add a WARN_ON_ONCE() if dma_addressing_limited returns true for 64 bit systems, not sure if the DMA folks would be inclined and how often it really happens on existing systems. Balbir Singh
The following commit has been merged into the x86/urgent branch of tip:
Commit-ID: 7170130e4c72ce0caa0cb42a1627c635cc262821
Gitweb: https://git.kernel.org/tip/7170130e4c72ce0caa0cb42a1627c635cc262821
Author: Balbir Singh <balbirs@nvidia.com>
AuthorDate: Tue, 01 Apr 2025 11:07:52 +11:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 01 Apr 2025 10:52:38 +02:00
x86/mm/init: Handle the special case of device private pages in add_pages(), to not increase max_pfn and trigger dma_addressing_limited() bounce buffers
As Bert Karwatzki reported, the following recent commit causes a
performance regression on AMD iGPU and dGPU systems:
7ffb791423c7 ("x86/kaslr: Reduce KASLR entropy on most x86 systems")
It exposed a bug with nokaslr and zone device interaction.
The root cause of the bug is that, the GPU driver registers a zone
device private memory region. When KASLR is disabled or the above commit
is applied, the direct_map_physmem_end is set to much higher than 10 TiB
typically to the 64TiB address. When zone device private memory is added
to the system via add_pages(), it bumps up the max_pfn to the same
value. This causes dma_addressing_limited() to return true, since the
device cannot address memory all the way up to max_pfn.
This caused a regression for games played on the iGPU, as it resulted in
the DMA32 zone being used for GPU allocations.
Fix this by not bumping up max_pfn on x86 systems, when pgmap is passed
into add_pages(). The presence of pgmap is used to determine if device
private memory is being added via add_pages().
More details:
devm_request_mem_region() and request_free_mem_region() request for
device private memory. iomem_resource is passed as the base resource
with start and end parameters. iomem_resource's end depends on several
factors, including the platform and virtualization. On x86 for example
on bare metal, this value is set to boot_cpu_data.x86_phys_bits.
boot_cpu_data.x86_phys_bits can change depending on support for MKTME.
By default it is set to the same as log2(direct_map_physmem_end) which
is 46 to 52 bits depending on the number of levels in the page table.
The allocation routines used iomem_resource's end and
direct_map_physmem_end to figure out where to allocate the region.
[ arch/powerpc is also impacted by this problem, but this patch does not fix
the issue for PowerPC. ]
Testing:
1. Tested on a virtual machine with test_hmm for zone device inseration
2. A previous version of this patch was tested by Bert, please see:
https://lore.kernel.org/lkml/d87680bab997fdc9fb4e638983132af235d9a03a.camel@web.de/
[ mingo: Clarified the comments and the changelog. ]
Reported-by: Bert Karwatzki <spasswolf@web.de>
Tested-by: Bert Karwatzki <spasswolf@web.de>
Fixes: 7ffb791423c7 ("x86/kaslr: Reduce KASLR entropy on most x86 systems")
Signed-off-by: Balbir Singh <balbirs@nvidia.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Link: https://lore.kernel.org/r/20250401000752.249348-1-balbirs@nvidia.com
---
arch/x86/mm/init_64.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 519aa53..821a0b5 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -959,9 +959,18 @@ int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
ret = __add_pages(nid, start_pfn, nr_pages, params);
WARN_ON_ONCE(ret);
- /* update max_pfn, max_low_pfn and high_memory */
- update_end_of_memory_vars(start_pfn << PAGE_SHIFT,
- nr_pages << PAGE_SHIFT);
+ /*
+ * Special case: add_pages() is called by memremap_pages() for adding device
+ * private pages. Do not bump up max_pfn in the device private path,
+ * because max_pfn changes affect dma_addressing_limited().
+ *
+ * dma_addressing_limited() returning true when max_pfn is the device's
+ * addressable memory can force device drivers to use bounce buffers
+ * and impact their performance negatively:
+ */
+ if (!params->pgmap)
+ /* update max_pfn, max_low_pfn and high_memory */
+ update_end_of_memory_vars(start_pfn << PAGE_SHIFT, nr_pages << PAGE_SHIFT);
return ret;
}
© 2016 - 2025 Red Hat, Inc.