kernel/resource.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-)
We can skip children resources when the parent resource does not cover
the range.
This should help vmf_insert_* users on x86, such as several DRM drivers.
On my AMD Ryzen 5 7520C, when streaming data from cpu memory into amdgpu
bo, the throughput goes from 5.1GB/s to 6.6GB/s. perf report says
34.69%--__do_fault
34.60%--amdgpu_gem_fault
34.00%--ttm_bo_vm_fault_reserved
32.95%--vmf_insert_pfn_prot
25.89%--track_pfn_insert
24.35%--lookup_memtype
21.77%--pat_pagerange_is_ram
20.80%--walk_system_ram_range
17.42%--find_next_iomem_res
before this change, and
26.67%--__do_fault
26.57%--amdgpu_gem_fault
25.83%--ttm_bo_vm_fault_reserved
24.40%--vmf_insert_pfn_prot
14.30%--track_pfn_insert
12.20%--lookup_memtype
9.34%--pat_pagerange_is_ram
8.22%--walk_system_ram_range
5.09%--find_next_iomem_res
after.
Signed-off-by: Chia-I Wu <olvaffe@gmail.com>
---
kernel/resource.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/kernel/resource.c b/kernel/resource.c
index fcbca39dbc450..19b84b4f9a577 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -326,6 +326,7 @@ static int find_next_iomem_res(resource_size_t start, resource_size_t end,
unsigned long flags, unsigned long desc,
struct resource *res)
{
+ bool skip_children = false;
struct resource *p;
if (!res)
@@ -336,7 +337,7 @@ static int find_next_iomem_res(resource_size_t start, resource_size_t end,
read_lock(&resource_lock);
- for_each_resource(&iomem_resource, p, false) {
+ for_each_resource(&iomem_resource, p, skip_children) {
/* If we passed the resource we are looking for, stop */
if (p->start > end) {
p = NULL;
@@ -344,8 +345,11 @@ static int find_next_iomem_res(resource_size_t start, resource_size_t end,
}
/* Skip until we find a range that matches what we look for */
- if (p->end < start)
+ if (p->end < start) {
+ skip_children = true;
continue;
+ }
+ skip_children = false;
if ((p->flags & flags) != flags)
continue;
--
2.45.1.288.g0e0cd299f1-goog
On Thu, May 30, 2024 at 10:36:57PM -0700, Chia-I Wu wrote: > We can skip children resources when the parent resource does not cover > the range. > > This should help vmf_insert_* users on x86, such as several DRM drivers. > On my AMD Ryzen 5 7520C, when streaming data from cpu memory into amdgpu > bo, the throughput goes from 5.1GB/s to 6.6GB/s. perf report says > > 34.69%--__do_fault > 34.60%--amdgpu_gem_fault > 34.00%--ttm_bo_vm_fault_reserved > 32.95%--vmf_insert_pfn_prot > 25.89%--track_pfn_insert > 24.35%--lookup_memtype > 21.77%--pat_pagerange_is_ram > 20.80%--walk_system_ram_range > 17.42%--find_next_iomem_res > > before this change, and > > 26.67%--__do_fault > 26.57%--amdgpu_gem_fault > 25.83%--ttm_bo_vm_fault_reserved > 24.40%--vmf_insert_pfn_prot > 14.30%--track_pfn_insert > 12.20%--lookup_memtype > 9.34%--pat_pagerange_is_ram > 8.22%--walk_system_ram_range > 5.09%--find_next_iomem_res > > after. That's great, but why is walk_system_ram_range() being called so often? Shouldn't that be a "set up the device" only type of thing? Why hammer on "lookup_memtype" when you know the memtype, you just did the same thing for the previous frame. This feels like it could be optimized to just "don't call these things" which would make it go faster, right? What am I missing here, why does this always have to be calculated all the time? Resource mapping changes are rare, if at all, over the lifetime of a system uptime. Constantly calculating something that never changes feels odd to me. thanks, greg k-h
On Tue, Jun 4, 2024 at 8:41 AM Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > > On Thu, May 30, 2024 at 10:36:57PM -0700, Chia-I Wu wrote: > > We can skip children resources when the parent resource does not cover > > the range. > > > > This should help vmf_insert_* users on x86, such as several DRM drivers. > > On my AMD Ryzen 5 7520C, when streaming data from cpu memory into amdgpu > > bo, the throughput goes from 5.1GB/s to 6.6GB/s. perf report says > > > > 34.69%--__do_fault > > 34.60%--amdgpu_gem_fault > > 34.00%--ttm_bo_vm_fault_reserved > > 32.95%--vmf_insert_pfn_prot > > 25.89%--track_pfn_insert > > 24.35%--lookup_memtype > > 21.77%--pat_pagerange_is_ram > > 20.80%--walk_system_ram_range > > 17.42%--find_next_iomem_res > > > > before this change, and > > > > 26.67%--__do_fault > > 26.57%--amdgpu_gem_fault > > 25.83%--ttm_bo_vm_fault_reserved > > 24.40%--vmf_insert_pfn_prot > > 14.30%--track_pfn_insert > > 12.20%--lookup_memtype > > 9.34%--pat_pagerange_is_ram > > 8.22%--walk_system_ram_range > > 5.09%--find_next_iomem_res > > > > after. > > That's great, but why is walk_system_ram_range() being called so often? > > Shouldn't that be a "set up the device" only type of thing? Why hammer > on "lookup_memtype" when you know the memtype, you just did the same > thing for the previous frame. > > This feels like it could be optimized to just "don't call these things" > which would make it go faster, right? > > What am I missing here, why does this always have to be calculated all > the time? Resource mapping changes are rare, if at all, over the > lifetime of a system uptime. Constantly calculating something that > never changes feels odd to me. Yeah, that would be even better. I am not familiar with x86 pat code. I will have to defer that to those more familiar with the matter. > > thanks, > > greg k-h
On Thu, May 30, 2024 at 10:36:57PM -0700, Chia-I Wu wrote: > We can skip children resources when the parent resource does not cover > the range. > This should help vmf_insert_* users on x86, such as several DRM drivers. vmf_insert_*() > On my AMD Ryzen 5 7520C, when streaming data from cpu memory into amdgpu > bo, the throughput goes from 5.1GB/s to 6.6GB/s. perf report says Also in the $Subj (and pay attention to the prefix) "resource: ... find_next_iomem_res()" -- With Best Regards, Andy Shevchenko
On Thu, May 30, 2024 at 10:36:57PM -0700, Chia-I Wu wrote: > We can skip children resources when the parent resource does not cover > the range. > > This should help vmf_insert_* users on x86, such as several DRM drivers. > On my AMD Ryzen 5 7520C, when streaming data from cpu memory into amdgpu > bo, the throughput goes from 5.1GB/s to 6.6GB/s. perf report says > > 34.69%--__do_fault > 34.60%--amdgpu_gem_fault > 34.00%--ttm_bo_vm_fault_reserved > 32.95%--vmf_insert_pfn_prot > 25.89%--track_pfn_insert > 24.35%--lookup_memtype > 21.77%--pat_pagerange_is_ram > 20.80%--walk_system_ram_range > 17.42%--find_next_iomem_res > > before this change, and > > 26.67%--__do_fault > 26.57%--amdgpu_gem_fault > 25.83%--ttm_bo_vm_fault_reserved > 24.40%--vmf_insert_pfn_prot > 14.30%--track_pfn_insert > 12.20%--lookup_memtype > 9.34%--pat_pagerange_is_ram > 8.22%--walk_system_ram_range > 5.09%--find_next_iomem_res > > after. Is there any documentation that explicitly says that the children resources must not overlap parent's one? Do we have some test cases? (Either way they needs to be added / expanded). P.S> I'm not so sure about this change. It needs a thoroughly testing, esp. in PCI case. Cc'ing to Ilpo. -- With Best Regards, Andy Shevchenko
© 2016 - 2026 Red Hat, Inc.