kernel/resource.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-)
Commit 97523a4edb7b ("kernel/resource: remove first_lvl / siblings_only
logic") removed an optimization introduced by commit 756398750e11
("resource: avoid unnecessary lookups in find_next_iomem_res()"). That
was not called out in the message of the first commit explicitly so it's
not entirely clear whether removing the optimization happened
inadvertently or not.
As the original commit message of the optimization explains there is no
point considering the children of a subtree in find_next_iomem_res() if
the top level range does not match. Reinstating the optimization results
in significant performance improvements in systems with very large iomem
maps when mmaping /dev/mem.
Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
---
kernel/resource.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/kernel/resource.c b/kernel/resource.c
index b9fa2a4ce089..e4e9bac12e6e 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -341,6 +341,8 @@ static int find_next_iomem_res(resource_size_t start, resource_size_t end,
unsigned long flags, unsigned long desc,
struct resource *res)
{
+ /* Skip children until we find a top level range that matches */
+ bool skip_children = true;
struct resource *p;
if (!res)
@@ -351,7 +353,7 @@ static int find_next_iomem_res(resource_size_t start, resource_size_t end,
read_lock(&resource_lock);
- for_each_resource(&iomem_resource, p, false) {
+ for_each_resource(&iomem_resource, p, skip_children) {
/* If we passed the resource we are looking for, stop */
if (p->start > end) {
p = NULL;
@@ -362,6 +364,12 @@ static int find_next_iomem_res(resource_size_t start, resource_size_t end,
if (p->end < start)
continue;
+ /*
+ * We found a top level range that matches what we are looking
+ * for. Time to start checking children too.
+ */
+ skip_children = false;
+
/* Found a match, break */
if (is_type_match(p, flags, desc))
break;
--
2.43.0
On 11/24/25 17:53, Ilias Stamatis wrote:
> Commit 97523a4edb7b ("kernel/resource: remove first_lvl / siblings_only
> logic") removed an optimization introduced by commit 756398750e11
> ("resource: avoid unnecessary lookups in find_next_iomem_res()"). That
> was not called out in the message of the first commit explicitly so it's
> not entirely clear whether removing the optimization happened
> inadvertently or not.
Remembering the history, we have some things where the top might not
fully describe what the lower levels do.
An example is for example found here:
Author: Dan Williams <dan.j.williams@intel.com>
Date: Thu Feb 16 00:36:02 2023 -0800
dax/kmem: Fix leak of memory-hotplug resources
While experimenting with CXL region removal the following corruption of
/proc/iomem appeared.
Before:
f010000000-f04fffffff : CXL Window 0
f010000000-f02fffffff : region4
f010000000-f02fffffff : dax4.0
f010000000-f02fffffff : System RAM (kmem)
The CXL Windows will certainly not match System RAM, as one example.
How would your change affect such cases?
--
Cheers
David
On 11/25/25 09:09, David Hildenbrand (Red Hat) wrote:
> On 11/24/25 17:53, Ilias Stamatis wrote:
>> Commit 97523a4edb7b ("kernel/resource: remove first_lvl / siblings_only
>> logic") removed an optimization introduced by commit 756398750e11
>> ("resource: avoid unnecessary lookups in find_next_iomem_res()"). That
>> was not called out in the message of the first commit explicitly so it's
>> not entirely clear whether removing the optimization happened
>> inadvertently or not.
>
> Remembering the history, we have some things where the top might not
> fully describe what the lower levels do.
>
> An example is for example found here:
>
> Author: Dan Williams <dan.j.williams@intel.com>
> Date: Thu Feb 16 00:36:02 2023 -0800
>
> dax/kmem: Fix leak of memory-hotplug resources
>
> While experimenting with CXL region removal the following corruption of
> /proc/iomem appeared.
>
> Before:
> f010000000-f04fffffff : CXL Window 0
> f010000000-f02fffffff : region4
> f010000000-f02fffffff : dax4.0
> f010000000-f02fffffff : System RAM (kmem)
>
> The CXL Windows will certainly not match System RAM, as one example.
>
> How would your change affect such cases?
Looking into the details, I assume, as we only check that the actual
range matches, not the type, that this is fine.
So yeah, that makes sense to me. I guess I removed it as part of
97523a4edb7b by accident, when I primarily wanted to remove the
first_lvl parameter.
Given that the above still works as expected:
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
--
Cheers
David
On Mon, 24 Nov 2025 16:53:49 +0000 Ilias Stamatis <ilstam@amazon.com> wrote:
> Commit 97523a4edb7b ("kernel/resource: remove first_lvl / siblings_only
> logic") removed an optimization introduced by commit 756398750e11
> ("resource: avoid unnecessary lookups in find_next_iomem_res()"). That
> was not called out in the message of the first commit explicitly so it's
> not entirely clear whether removing the optimization happened
> inadvertently or not.
>
> As the original commit message of the optimization explains there is no
> point considering the children of a subtree in find_next_iomem_res() if
> the top level range does not match. Reinstating the optimization results
> in significant performance improvements in systems with very large iomem
> maps when mmaping /dev/mem.
It would be great if we could quantify "significant performance
improvements"?
On Mon, 2025-11-24 at 08:58 -0800, Andrew Morton wrote:
> On Mon, 24 Nov 2025 16:53:49 +0000 Ilias Stamatis <ilstam@amazon.com> wrote:
>
> > Commit 97523a4edb7b ("kernel/resource: remove first_lvl / siblings_only
> > logic") removed an optimization introduced by commit 756398750e11
> > ("resource: avoid unnecessary lookups in find_next_iomem_res()"). That
> > was not called out in the message of the first commit explicitly so it's
> > not entirely clear whether removing the optimization happened
> > inadvertently or not.
> >
> > As the original commit message of the optimization explains there is no
> > point considering the children of a subtree in find_next_iomem_res() if
> > the top level range does not match. Reinstating the optimization results
> > in significant performance improvements in systems with very large iomem
> > maps when mmaping /dev/mem.
>
> It would be great if we could quantify "significant performance
> improvements"?
Hi Andrew and Andy,
You are right to call that out and apologies for leaving it vague.
I've done my testing with older kernel versions in systems where `wc -l
/proc/iomem` can return ~5k. In that environment I see mmaping parts of
/dev/mem taking 700-1500μs without the optimisation and 10-50μs with the
optimisation.
The real-world use case we care about is hypervisor live update where having to
do lots of these mmaps() serially can significantly affect the guest downtime
if the cost is 20-30x.
> It also would be good to know which exact function(s) is a bottleneck.
Perf tracing shows that ~95% of CPU time is spent in find_next_iomem_res(), the
full call stack being:
find_next_iomem_res+0x3b ([kernel.kallsyms])
walk_system_ram_range+0x98 ([kernel.kallsyms])
pat_pagerange_is_ram+0x6e ([kernel.kallsyms])
reserve_pfn_range+0x47 ([kernel.kallsyms])
track_pfn_remap+0xb6 ([kernel.kallsyms])
remap_pfn_range+0x3b ([kernel.kallsyms])
mmap_mem+0x9e ([kernel.kallsyms])
mm_struct_mmap_region+0x1f3 ([kernel.kallsyms])
mmap_region+0xa3 ([kernel.kallsyms])
do_mmap+0x3ea ([kernel.kallsyms])
vm_mmap_pgoff+0xa2 ([kernel.kallsyms])
ksys_mmap_pgoff+0xec ([kernel.kallsyms])
do_syscall_64+0x29 ([kernel.kallsyms])
entry_SYSCALL_64_after_hwframe+0x4b ([kernel.kallsyms])
Thanks,
Ilias
On Mon, Nov 24, 2025 at 06:01:35PM +0000, Stamatis, Ilias wrote:
> On Mon, 2025-11-24 at 08:58 -0800, Andrew Morton wrote:
> > On Mon, 24 Nov 2025 16:53:49 +0000 Ilias Stamatis <ilstam@amazon.com> wrote:
> >
> > > Commit 97523a4edb7b ("kernel/resource: remove first_lvl / siblings_only
> > > logic") removed an optimization introduced by commit 756398750e11
> > > ("resource: avoid unnecessary lookups in find_next_iomem_res()"). That
> > > was not called out in the message of the first commit explicitly so it's
> > > not entirely clear whether removing the optimization happened
> > > inadvertently or not.
> > >
> > > As the original commit message of the optimization explains there is no
> > > point considering the children of a subtree in find_next_iomem_res() if
> > > the top level range does not match. Reinstating the optimization results
> > > in significant performance improvements in systems with very large iomem
> > > maps when mmaping /dev/mem.
> >
> > It would be great if we could quantify "significant performance
> > improvements"?
>
> Hi Andrew and Andy,
>
> You are right to call that out and apologies for leaving it vague.
>
> I've done my testing with older kernel versions in systems where `wc -l
> /proc/iomem` can return ~5k. In that environment I see mmaping parts of
> /dev/mem taking 700-1500μs without the optimisation and 10-50μs with the
> optimisation.
>
> The real-world use case we care about is hypervisor live update where having to
> do lots of these mmaps() serially can significantly affect the guest downtime
> if the cost is 20-30x.
Thanks for providing this information.
> > It also would be good to know which exact function(s) is a bottleneck.
>
> Perf tracing shows that ~95% of CPU time is spent in find_next_iomem_res(),
Have you investigated possibility to return that check directly into
the culprit?
> the full call stack being:
>
> find_next_iomem_res+0x3b ([kernel.kallsyms])
> walk_system_ram_range+0x98 ([kernel.kallsyms])
> pat_pagerange_is_ram+0x6e ([kernel.kallsyms])
> reserve_pfn_range+0x47 ([kernel.kallsyms])
> track_pfn_remap+0xb6 ([kernel.kallsyms])
> remap_pfn_range+0x3b ([kernel.kallsyms])
> mmap_mem+0x9e ([kernel.kallsyms])
> mm_struct_mmap_region+0x1f3 ([kernel.kallsyms])
> mmap_region+0xa3 ([kernel.kallsyms])
--
With Best Regards,
Andy Shevchenko
On Mon, 2025-11-24 at 20:55 +0200, andriy.shevchenko@linux.intel.com wrote:
> On Mon, Nov 24, 2025 at 06:01:35PM +0000, Stamatis, Ilias wrote:
> > On Mon, 2025-11-24 at 08:58 -0800, Andrew Morton wrote:
> > > On Mon, 24 Nov 2025 16:53:49 +0000 Ilias Stamatis <ilstam@amazon.com> wrote:
> > >
> > > > Commit 97523a4edb7b ("kernel/resource: remove first_lvl / siblings_only
> > > > logic") removed an optimization introduced by commit 756398750e11
> > > > ("resource: avoid unnecessary lookups in find_next_iomem_res()"). That
> > > > was not called out in the message of the first commit explicitly so it's
> > > > not entirely clear whether removing the optimization happened
> > > > inadvertently or not.
> > > >
> > > > As the original commit message of the optimization explains there is no
> > > > point considering the children of a subtree in find_next_iomem_res() if
> > > > the top level range does not match. Reinstating the optimization results
> > > > in significant performance improvements in systems with very large iomem
> > > > maps when mmaping /dev/mem.
> > >
> > > It would be great if we could quantify "significant performance
> > > improvements"?
> >
> > Hi Andrew and Andy,
> >
> > You are right to call that out and apologies for leaving it vague.
> >
> > I've done my testing with older kernel versions in systems where `wc -l
> > /proc/iomem` can return ~5k. In that environment I see mmaping parts of
> > /dev/mem taking 700-1500μs without the optimisation and 10-50μs with the
> > optimisation.
> >
> > The real-world use case we care about is hypervisor live update where having to
> > do lots of these mmaps() serially can significantly affect the guest downtime
> > if the cost is 20-30x.
>
> Thanks for providing this information.
>
> > > It also would be good to know which exact function(s) is a bottleneck.
> >
> > Perf tracing shows that ~95% of CPU time is spent in find_next_iomem_res(),
>
> Have you investigated possibility to return that check directly into
> the culprit?
>
>
I'm sorry, I don't understand this. Could you please clarify what you mean?
What do you consider to be the culprit and which check do you refer to?
On Mon, Nov 24, 2025 at 07:35:31PM +0000, Stamatis, Ilias wrote:
> On Mon, 2025-11-24 at 20:55 +0200, andriy.shevchenko@linux.intel.com wrote:
> > On Mon, Nov 24, 2025 at 06:01:35PM +0000, Stamatis, Ilias wrote:
> > > On Mon, 2025-11-24 at 08:58 -0800, Andrew Morton wrote:
> > > > On Mon, 24 Nov 2025 16:53:49 +0000 Ilias Stamatis <ilstam@amazon.com> wrote:
> > > >
> > > > > Commit 97523a4edb7b ("kernel/resource: remove first_lvl / siblings_only
> > > > > logic") removed an optimization introduced by commit 756398750e11
> > > > > ("resource: avoid unnecessary lookups in find_next_iomem_res()"). That
> > > > > was not called out in the message of the first commit explicitly so it's
> > > > > not entirely clear whether removing the optimization happened
> > > > > inadvertently or not.
> > > > >
> > > > > As the original commit message of the optimization explains there is no
> > > > > point considering the children of a subtree in find_next_iomem_res() if
> > > > > the top level range does not match. Reinstating the optimization results
> > > > > in significant performance improvements in systems with very large iomem
> > > > > maps when mmaping /dev/mem.
> > > >
> > > > It would be great if we could quantify "significant performance
> > > > improvements"?
> > >
> > > Hi Andrew and Andy,
> > >
> > > You are right to call that out and apologies for leaving it vague.
> > >
> > > I've done my testing with older kernel versions in systems where `wc -l
> > > /proc/iomem` can return ~5k. In that environment I see mmaping parts of
> > > /dev/mem taking 700-1500μs without the optimisation and 10-50μs with the
> > > optimisation.
> > >
> > > The real-world use case we care about is hypervisor live update where having to
> > > do lots of these mmaps() serially can significantly affect the guest downtime
> > > if the cost is 20-30x.
> >
> > Thanks for providing this information.
> >
> > > > It also would be good to know which exact function(s) is a bottleneck.
> > >
> > > Perf tracing shows that ~95% of CPU time is spent in find_next_iomem_res(),
> >
> > Have you investigated possibility to return that check directly into
> > the culprit?
>
> I'm sorry, I don't understand this. Could you please clarify what you mean?
> What do you consider to be the culprit and which check do you refer to?
The mentioned patch removed the check for siblings from next_resource().
The function that your test case complains about is find_next_iomem_res().
Hence, have you tried to reinstantiate the (removed) check from next_resource()
in find_next_iomem_res() and see if it helps?
--
With Best Regards,
Andy Shevchenko
On Mon, 2025-11-24 at 21:52 +0200, andriy.shevchenko@linux.intel.com wrote:
> On Mon, Nov 24, 2025 at 07:35:31PM +0000, Stamatis, Ilias wrote:
> > On Mon, 2025-11-24 at 20:55 +0200, andriy.shevchenko@linux.intel.com wrote:
> > > On Mon, Nov 24, 2025 at 06:01:35PM +0000, Stamatis, Ilias wrote:
> > > > On Mon, 2025-11-24 at 08:58 -0800, Andrew Morton wrote:
> > > > > On Mon, 24 Nov 2025 16:53:49 +0000 Ilias Stamatis <ilstam@amazon.com> wrote:
> > > > >
> > > > > > Commit 97523a4edb7b ("kernel/resource: remove first_lvl / siblings_only
> > > > > > logic") removed an optimization introduced by commit 756398750e11
> > > > > > ("resource: avoid unnecessary lookups in find_next_iomem_res()"). That
> > > > > > was not called out in the message of the first commit explicitly so it's
> > > > > > not entirely clear whether removing the optimization happened
> > > > > > inadvertently or not.
> > > > > >
> > > > > > As the original commit message of the optimization explains there is no
> > > > > > point considering the children of a subtree in find_next_iomem_res() if
> > > > > > the top level range does not match. Reinstating the optimization results
> > > > > > in significant performance improvements in systems with very large iomem
> > > > > > maps when mmaping /dev/mem.
> > > > >
> > > > > It would be great if we could quantify "significant performance
> > > > > improvements"?
> > > >
> > > > Hi Andrew and Andy,
> > > >
> > > > You are right to call that out and apologies for leaving it vague.
> > > >
> > > > I've done my testing with older kernel versions in systems where `wc -l
> > > > /proc/iomem` can return ~5k. In that environment I see mmaping parts of
> > > > /dev/mem taking 700-1500μs without the optimisation and 10-50μs with the
> > > > optimisation.
> > > >
> > > > The real-world use case we care about is hypervisor live update where having to
> > > > do lots of these mmaps() serially can significantly affect the guest downtime
> > > > if the cost is 20-30x.
> > >
> > > Thanks for providing this information.
> > >
> > > > > It also would be good to know which exact function(s) is a bottleneck.
> > > >
> > > > Perf tracing shows that ~95% of CPU time is spent in find_next_iomem_res(),
> > >
> > > Have you investigated possibility to return that check directly into
> > > the culprit?
> >
> > I'm sorry, I don't understand this. Could you please clarify what you mean?
> > What do you consider to be the culprit and which check do you refer to?
>
> The mentioned patch removed the check for siblings from next_resource().
> The function that your test case complains about is find_next_iomem_res().
> Hence, have you tried to reinstantiate the (removed) check from next_resource()
> in find_next_iomem_res() and see if it helps?
>
next_resource() does accept a 'skip_children' parameter in the latest kernel
today which is equivalent to the 'sibling_only' parameter in the older kernels.
And the for_each_resource() macro currently used in find_next_iomem_res() calls
next_resource().
Hope that makes sense.
On Mon, Nov 24, 2025 at 11:30:46PM +0000, Stamatis, Ilias wrote:
> On Mon, 2025-11-24 at 21:52 +0200, andriy.shevchenko@linux.intel.com wrote:
> > On Mon, Nov 24, 2025 at 07:35:31PM +0000, Stamatis, Ilias wrote:
> > > On Mon, 2025-11-24 at 20:55 +0200, andriy.shevchenko@linux.intel.com wrote:
> > > > On Mon, Nov 24, 2025 at 06:01:35PM +0000, Stamatis, Ilias wrote:
> > > > > On Mon, 2025-11-24 at 08:58 -0800, Andrew Morton wrote:
> > > > > > On Mon, 24 Nov 2025 16:53:49 +0000 Ilias Stamatis <ilstam@amazon.com> wrote:
> > > > > >
> > > > > > > Commit 97523a4edb7b ("kernel/resource: remove first_lvl / siblings_only
> > > > > > > logic") removed an optimization introduced by commit 756398750e11
> > > > > > > ("resource: avoid unnecessary lookups in find_next_iomem_res()"). That
> > > > > > > was not called out in the message of the first commit explicitly so it's
> > > > > > > not entirely clear whether removing the optimization happened
> > > > > > > inadvertently or not.
> > > > > > >
> > > > > > > As the original commit message of the optimization explains there is no
> > > > > > > point considering the children of a subtree in find_next_iomem_res() if
> > > > > > > the top level range does not match. Reinstating the optimization results
> > > > > > > in significant performance improvements in systems with very large iomem
> > > > > > > maps when mmaping /dev/mem.
> > > > > >
> > > > > > It would be great if we could quantify "significant performance
> > > > > > improvements"?
> > > > >
> > > > > Hi Andrew and Andy,
> > > > >
> > > > > You are right to call that out and apologies for leaving it vague.
> > > > >
> > > > > I've done my testing with older kernel versions in systems where `wc -l
> > > > > /proc/iomem` can return ~5k. In that environment I see mmaping parts of
> > > > > /dev/mem taking 700-1500μs without the optimisation and 10-50μs with the
> > > > > optimisation.
> > > > >
> > > > > The real-world use case we care about is hypervisor live update where having to
> > > > > do lots of these mmaps() serially can significantly affect the guest downtime
> > > > > if the cost is 20-30x.
> > > >
> > > > Thanks for providing this information.
> > > >
> > > > > > It also would be good to know which exact function(s) is a bottleneck.
> > > > >
> > > > > Perf tracing shows that ~95% of CPU time is spent in find_next_iomem_res(),
> > > >
> > > > Have you investigated possibility to return that check directly into
> > > > the culprit?
> > >
> > > I'm sorry, I don't understand this. Could you please clarify what you mean?
> > > What do you consider to be the culprit and which check do you refer to?
> >
> > The mentioned patch removed the check for siblings from next_resource().
> > The function that your test case complains about is find_next_iomem_res().
> > Hence, have you tried to reinstantiate the (removed) check from next_resource()
> > in find_next_iomem_res() and see if it helps?
>
> next_resource() does accept a 'skip_children' parameter in the latest kernel
> today which is equivalent to the 'sibling_only' parameter in the older
> kernels.
It used to be
if (sibling_only)
return p->sibling;
if (p->child)
return p->child;
...
and become (in the latest kernels)
if (!skip_children && p->child)
return p->child;
...
Can you elaborate how are they interoperable?
TL;DR: I don't think it's an equivalent.
> And the for_each_resource() macro currently used in find_next_iomem_res()
> calls next_resource().
>
> Hope that makes sense.
--
With Best Regards,
Andy Shevchenko
On Tue, 2025-11-25 at 08:50 +0200, andriy.shevchenko@linux.intel.com wrote:
> On Mon, Nov 24, 2025 at 11:30:46PM +0000, Stamatis, Ilias wrote:
> > On Mon, 2025-11-24 at 21:52 +0200, andriy.shevchenko@linux.intel.com wrote:
> > > On Mon, Nov 24, 2025 at 07:35:31PM +0000, Stamatis, Ilias wrote:
> > > > On Mon, 2025-11-24 at 20:55 +0200, andriy.shevchenko@linux.intel.com wrote:
> > > > > On Mon, Nov 24, 2025 at 06:01:35PM +0000, Stamatis, Ilias wrote:
> > > > > > On Mon, 2025-11-24 at 08:58 -0800, Andrew Morton wrote:
> > > > > > > On Mon, 24 Nov 2025 16:53:49 +0000 Ilias Stamatis <ilstam@amazon.com> wrote:
> > > > > > >
> > > > > > > > Commit 97523a4edb7b ("kernel/resource: remove first_lvl / siblings_only
> > > > > > > > logic") removed an optimization introduced by commit 756398750e11
> > > > > > > > ("resource: avoid unnecessary lookups in find_next_iomem_res()"). That
> > > > > > > > was not called out in the message of the first commit explicitly so it's
> > > > > > > > not entirely clear whether removing the optimization happened
> > > > > > > > inadvertently or not.
> > > > > > > >
> > > > > > > > As the original commit message of the optimization explains there is no
> > > > > > > > point considering the children of a subtree in find_next_iomem_res() if
> > > > > > > > the top level range does not match. Reinstating the optimization results
> > > > > > > > in significant performance improvements in systems with very large iomem
> > > > > > > > maps when mmaping /dev/mem.
> > > > > > >
> > > > > > > It would be great if we could quantify "significant performance
> > > > > > > improvements"?
> > > > > >
> > > > > > Hi Andrew and Andy,
> > > > > >
> > > > > > You are right to call that out and apologies for leaving it vague.
> > > > > >
> > > > > > I've done my testing with older kernel versions in systems where `wc -l
> > > > > > /proc/iomem` can return ~5k. In that environment I see mmaping parts of
> > > > > > /dev/mem taking 700-1500μs without the optimisation and 10-50μs with the
> > > > > > optimisation.
> > > > > >
> > > > > > The real-world use case we care about is hypervisor live update where having to
> > > > > > do lots of these mmaps() serially can significantly affect the guest downtime
> > > > > > if the cost is 20-30x.
> > > > >
> > > > > Thanks for providing this information.
> > > > >
> > > > > > > It also would be good to know which exact function(s) is a bottleneck.
> > > > > >
> > > > > > Perf tracing shows that ~95% of CPU time is spent in find_next_iomem_res(),
> > > > >
> > > > > Have you investigated possibility to return that check directly into
> > > > > the culprit?
> > > >
> > > > I'm sorry, I don't understand this. Could you please clarify what you mean?
> > > > What do you consider to be the culprit and which check do you refer to?
> > >
> > > The mentioned patch removed the check for siblings from next_resource().
> > > The function that your test case complains about is find_next_iomem_res().
> > > Hence, have you tried to reinstantiate the (removed) check from next_resource()
> > > in find_next_iomem_res() and see if it helps?
> >
> > next_resource() does accept a 'skip_children' parameter in the latest kernel
> > today which is equivalent to the 'sibling_only' parameter in the older
> > kernels.
>
> It used to be
>
> if (sibling_only)
> return p->sibling;
>
> if (p->child)
> return p->child;
> ...
>
This returns p->sibling if sibling_only == true.
The return value might also be NULL.
>
> and become (in the latest kernels)
>
> if (!skip_children && p->child)
> return p->child;
> ...
>
>
if (!skip_children && p->child)
return p->child;
while (!p->sibling && p->parent) {
p = p->parent;
if (p == subtree_root)
return NULL;
}
return p->sibling;
This is the full function on the latest kernel. If skip_children == true and
there is a sibling, it also returns p->sibling.
If p->sibling is NULL, it'll try to get the parent. In the case of
find_next_iomem_res() the parent will be iomem_resource, in which case the if
(p == subtree_root) path is taken and we return NULL (same as the case of
p->sibling being NULL above).
> Can you elaborate how are they interoperable?
>
> TL;DR: I don't think it's an equivalent.
On Tue, Nov 25, 2025 at 09:56:36AM +0000, Stamatis, Ilias wrote:
> On Tue, 2025-11-25 at 08:50 +0200, andriy.shevchenko@linux.intel.com wrote:
> > On Mon, Nov 24, 2025 at 11:30:46PM +0000, Stamatis, Ilias wrote:
> > > On Mon, 2025-11-24 at 21:52 +0200, andriy.shevchenko@linux.intel.com wrote:
> > > > On Mon, Nov 24, 2025 at 07:35:31PM +0000, Stamatis, Ilias wrote:
> > > > > On Mon, 2025-11-24 at 20:55 +0200, andriy.shevchenko@linux.intel.com wrote:
> > > > > > On Mon, Nov 24, 2025 at 06:01:35PM +0000, Stamatis, Ilias wrote:
> > > > > > > On Mon, 2025-11-24 at 08:58 -0800, Andrew Morton wrote:
> > > > > > > > On Mon, 24 Nov 2025 16:53:49 +0000 Ilias Stamatis <ilstam@amazon.com> wrote:
...
> > > > > > > > > Commit 97523a4edb7b ("kernel/resource: remove first_lvl / siblings_only
> > > > > > > > > logic") removed an optimization introduced by commit 756398750e11
> > > > > > > > > ("resource: avoid unnecessary lookups in find_next_iomem_res()"). That
> > > > > > > > > was not called out in the message of the first commit explicitly so it's
> > > > > > > > > not entirely clear whether removing the optimization happened
> > > > > > > > > inadvertently or not.
> > > > > > > > >
> > > > > > > > > As the original commit message of the optimization explains there is no
> > > > > > > > > point considering the children of a subtree in find_next_iomem_res() if
> > > > > > > > > the top level range does not match. Reinstating the optimization results
> > > > > > > > > in significant performance improvements in systems with very large iomem
> > > > > > > > > maps when mmaping /dev/mem.
> > > > > > > >
> > > > > > > > It would be great if we could quantify "significant performance
> > > > > > > > improvements"?
> > > > > > >
> > > > > > > I've done my testing with older kernel versions in systems where `wc -l
> > > > > > > /proc/iomem` can return ~5k. In that environment I see mmaping parts of
> > > > > > > /dev/mem taking 700-1500μs without the optimisation and 10-50μs with the
> > > > > > > optimisation.
> > > > > > >
> > > > > > > The real-world use case we care about is hypervisor live update where having to
> > > > > > > do lots of these mmaps() serially can significantly affect the guest downtime
> > > > > > > if the cost is 20-30x.
> > > > > >
> > > > > > Thanks for providing this information.
> > > > > >
> > > > > > > > It also would be good to know which exact function(s) is a bottleneck.
> > > > > > >
> > > > > > > Perf tracing shows that ~95% of CPU time is spent in find_next_iomem_res(),
> > > > > >
> > > > > > Have you investigated possibility to return that check directly into
> > > > > > the culprit?
> > > > >
> > > > > I'm sorry, I don't understand this. Could you please clarify what you mean?
> > > > > What do you consider to be the culprit and which check do you refer to?
> > > >
> > > > The mentioned patch removed the check for siblings from next_resource().
> > > > The function that your test case complains about is find_next_iomem_res().
> > > > Hence, have you tried to reinstantiate the (removed) check from next_resource()
> > > > in find_next_iomem_res() and see if it helps?
> > >
> > > next_resource() does accept a 'skip_children' parameter in the latest kernel
> > > today which is equivalent to the 'sibling_only' parameter in the older
> > > kernels.
> >
> > It used to be
> >
> > if (sibling_only)
> > return p->sibling;
> >
> > if (p->child)
> > return p->child;
> > ...
>
> This returns p->sibling if sibling_only == true.
> The return value might also be NULL.
>
> > and become (in the latest kernels)
> >
> > if (!skip_children && p->child)
> > return p->child;
> > ...
>
> if (!skip_children && p->child)
> return p->child;
> while (!p->sibling && p->parent) {
> p = p->parent;
> if (p == subtree_root)
> return NULL;
> }
> return p->sibling;
>
> This is the full function on the latest kernel. If skip_children == true and
> there is a sibling, it also returns p->sibling.
>
> If p->sibling is NULL, it'll try to get the parent. In the case of
> find_next_iomem_res() the parent will be iomem_resource, in which case the if
> (p == subtree_root) path is taken and we return NULL (same as the case of
> p->sibling being NULL above).
Thanks for elaboration.
Please summarise this, add the performance test results and send a v2.
Seems okay to me.
> > Can you elaborate how are they interoperable?
> >
> > TL;DR: I don't think it's an equivalent.
So, it's not a literal equivalent, but it behaves in a very similar way.
--
With Best Regards,
Andy Shevchenko
On Tue, 2025-11-25 at 12:23 +0200, andriy.shevchenko@linux.intel.com wrote:
> Thanks for elaboration.
>
> Please summarise this, add the performance test results and send a v2.
> Seems okay to me.
How does this look?
No code change, but happy to send a v2 if that makes things easier for merging.
Reinstate "resource: avoid unnecessary lookups in find_next_iomem_res()"
Commit 97523a4edb7b ("kernel/resource: remove first_lvl / siblings_only
logic") removed an optimization introduced by commit 756398750e11
("resource: avoid unnecessary lookups in find_next_iomem_res()"). That
was not called out in the message of the first commit explicitly so it's
not entirely clear whether removing the optimization happened
inadvertently or not.
As the original commit message of the optimization explains there is no
point considering the children of a subtree in find_next_iomem_res() if
the top level range does not match.
Reinstating the optimization results in performance improvements in
systems where /proc/iomem is ~5k lines long. Calling mmap() on /dev/mem
in such platforms takes 700-1500μs without the optimisation and 10-50μs
with the optimisation.
Note that even though commit 97523a4edb7b removed the 'sibling_only'
parameter from next_resource(), newer kernels have basically reinstated
it under the name 'skip_children'.
Link: https://lore.kernel.org/all/20251124165349.3377826-1-ilstam@amazon.com/T/#u
Fixes: 97523a4edb7b ("kernel/resource: remove first_lvl / siblings_only logic")
Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
On Tue, Nov 25, 2025 at 02:23:38PM +0000, Stamatis, Ilias wrote:
> On Tue, 2025-11-25 at 12:23 +0200, andriy.shevchenko@linux.intel.com wrote:
> > Thanks for elaboration.
> >
> > Please summarise this, add the performance test results and send a v2.
> > Seems okay to me.
>
> How does this look?
>
> No code change, but happy to send a v2 if that makes things easier for merging.
Yes, please do.
> Reinstate "resource: avoid unnecessary lookups in find_next_iomem_res()"
resource: Reinstate "avoid unnecessary lookups in find_next_iomem_res()"
(always keep leading prefix).
> Commit 97523a4edb7b ("kernel/resource: remove first_lvl / siblings_only
> logic") removed an optimization introduced by commit 756398750e11
> ("resource: avoid unnecessary lookups in find_next_iomem_res()"). That
> was not called out in the message of the first commit explicitly so it's
> not entirely clear whether removing the optimization happened
> inadvertently or not.
>
> As the original commit message of the optimization explains there is no
> point considering the children of a subtree in find_next_iomem_res() if
> the top level range does not match.
>
> Reinstating the optimization results in performance improvements in
> systems where /proc/iomem is ~5k lines long. Calling mmap() on /dev/mem
> in such platforms takes 700-1500μs without the optimisation and 10-50μs
> with the optimisation.
>
> Note that even though commit 97523a4edb7b removed the 'sibling_only'
> parameter from next_resource(), newer kernels have basically reinstated
> it under the name 'skip_children'.
>
> Link: https://lore.kernel.org/all/20251124165349.3377826-1-ilstam@amazon.com/T/#u
> Fixes: 97523a4edb7b ("kernel/resource: remove first_lvl / siblings_only logic")
> Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
The rest LGTM.
Also, if required, make a comment in the code aligned with this new commit
message.
--
With Best Regards,
Andy Shevchenko
On Mon, Nov 24, 2025 at 08:58:16AM -0800, Andrew Morton wrote:
> On Mon, 24 Nov 2025 16:53:49 +0000 Ilias Stamatis <ilstam@amazon.com> wrote:
>
> > Commit 97523a4edb7b ("kernel/resource: remove first_lvl / siblings_only
> > logic") removed an optimization introduced by commit 756398750e11
> > ("resource: avoid unnecessary lookups in find_next_iomem_res()"). That
> > was not called out in the message of the first commit explicitly so it's
> > not entirely clear whether removing the optimization happened
> > inadvertently or not.
> >
> > As the original commit message of the optimization explains there is no
> > point considering the children of a subtree in find_next_iomem_res() if
> > the top level range does not match. Reinstating the optimization results
> > in significant performance improvements in systems with very large iomem
> > maps when mmaping /dev/mem.
>
> It would be great if we could quantify "significant performance
> improvements"?
+1. It also would be good to know which exact function(s) is a bottleneck.
The mentioned change updated a handful of them.
--
With Best Regards,
Andy Shevchenko
© 2016 - 2025 Red Hat, Inc.