[v2] introduce kmemdump

[RFC][PATCH v2 22/29] mm/numa: Register information into Kmemdump

Posted by Eugen Hristev 6 months, 2 weeks ago

Annotate vital static information into kmemdump:
 - node_data

Information on these variables is stored into dedicated kmemdump section.

Register dynamic information into kmemdump:
 - dynamic node data for each node

This information is being allocated for each node, as physical address,
so call kmemdump_phys_alloc_size that will allocate an unique kmemdump
uid, and register the virtual address.

Signed-off-by: Eugen Hristev <eugen.hristev@linaro.org>
---
 mm/numa.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/numa.c b/mm/numa.c
index 7d5e06fe5bd4..88cada571171 100644
--- a/mm/numa.c
+++ b/mm/numa.c
@@ -4,9 +4,11 @@
 #include <linux/printk.h>
 #include <linux/numa.h>
 #include <linux/numa_memblks.h>
+#include <linux/kmemdump.h>
 
 struct pglist_data *node_data[MAX_NUMNODES];
 EXPORT_SYMBOL(node_data);
+KMEMDUMP_VAR_CORE(node_data, MAX_NUMNODES * sizeof(struct pglist_data));
 
 /* Allocate NODE_DATA for a node on the local memory */
 void __init alloc_node_data(int nid)
@@ -16,7 +18,8 @@ void __init alloc_node_data(int nid)
 	int tnid;
 
 	/* Allocate node data.  Try node-local memory and then any node. */
-	nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
+	nd_pa = kmemdump_phys_alloc_size(nd_size, memblock_phys_alloc_try_nid,
+					 nd_size, SMP_CACHE_BYTES, nid);
 	if (!nd_pa)
 		panic("Cannot allocate %zu bytes for node %d data\n",
 		      nd_size, nid);
-- 
2.43.0

Re: [RFC][PATCH v2 22/29] mm/numa: Register information into Kmemdump

Posted by David Hildenbrand 6 months, 1 week ago

On 24.07.25 15:55, Eugen Hristev wrote:
> Annotate vital static information into kmemdump:
>   - node_data
> 
> Information on these variables is stored into dedicated kmemdump section.
> 
> Register dynamic information into kmemdump:
>   - dynamic node data for each node
> 
> This information is being allocated for each node, as physical address,
> so call kmemdump_phys_alloc_size that will allocate an unique kmemdump
> uid, and register the virtual address.
> 
> Signed-off-by: Eugen Hristev <eugen.hristev@linaro.org>
> ---
>   mm/numa.c | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/numa.c b/mm/numa.c
> index 7d5e06fe5bd4..88cada571171 100644
> --- a/mm/numa.c
> +++ b/mm/numa.c
> @@ -4,9 +4,11 @@
>   #include <linux/printk.h>
>   #include <linux/numa.h>
>   #include <linux/numa_memblks.h>
> +#include <linux/kmemdump.h>
>   
>   struct pglist_data *node_data[MAX_NUMNODES];
>   EXPORT_SYMBOL(node_data);
> +KMEMDUMP_VAR_CORE(node_data, MAX_NUMNODES * sizeof(struct pglist_data));
>   
>   /* Allocate NODE_DATA for a node on the local memory */
>   void __init alloc_node_data(int nid)
> @@ -16,7 +18,8 @@ void __init alloc_node_data(int nid)
>   	int tnid;
>   
>   	/* Allocate node data.  Try node-local memory and then any node. */
> -	nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
> +	nd_pa = kmemdump_phys_alloc_size(nd_size, memblock_phys_alloc_try_nid,
> +					 nd_size, SMP_CACHE_BYTES, nid);

Do we really want to wrap memblock allocations in such a way? :/

Gah, no, no no.

Can't we pass that as some magical flag, or just ... register *after* 
allocating?

-- 
Cheers,

David / dhildenb

Re: [RFC][PATCH v2 22/29] mm/numa: Register information into Kmemdump

Posted by Eugen Hristev 6 months, 1 week ago

Hello,

On 7/30/25 16:52, David Hildenbrand wrote:
> On 24.07.25 15:55, Eugen Hristev wrote:
>> Annotate vital static information into kmemdump:
>>   - node_data
>>
>> Information on these variables is stored into dedicated kmemdump section.
>>
>> Register dynamic information into kmemdump:
>>   - dynamic node data for each node
>>
>> This information is being allocated for each node, as physical address,
>> so call kmemdump_phys_alloc_size that will allocate an unique kmemdump
>> uid, and register the virtual address.
>>
>> Signed-off-by: Eugen Hristev <eugen.hristev@linaro.org>
>> ---
>>   mm/numa.c | 5 ++++-
>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/numa.c b/mm/numa.c
>> index 7d5e06fe5bd4..88cada571171 100644
>> --- a/mm/numa.c
>> +++ b/mm/numa.c
>> @@ -4,9 +4,11 @@
>>   #include <linux/printk.h>
>>   #include <linux/numa.h>
>>   #include <linux/numa_memblks.h>
>> +#include <linux/kmemdump.h>
>>   
>>   struct pglist_data *node_data[MAX_NUMNODES];
>>   EXPORT_SYMBOL(node_data);
>> +KMEMDUMP_VAR_CORE(node_data, MAX_NUMNODES * sizeof(struct pglist_data));
>>   
>>   /* Allocate NODE_DATA for a node on the local memory */
>>   void __init alloc_node_data(int nid)
>> @@ -16,7 +18,8 @@ void __init alloc_node_data(int nid)
>>   	int tnid;
>>   
>>   	/* Allocate node data.  Try node-local memory and then any node. */
>> -	nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
>> +	nd_pa = kmemdump_phys_alloc_size(nd_size, memblock_phys_alloc_try_nid,
>> +					 nd_size, SMP_CACHE_BYTES, nid);
> 
> Do we really want to wrap memblock allocations in such a way? :/
> 
> Gah, no, no no.
> 
> Can't we pass that as some magical flag, or just ... register *after* 
> allocating?
> 

Thanks for looking into my patch.

Yes, registering after is also an option. Initially this is how I
designed the kmemdump API, I also had in mind to add a flag, but, after
discussing with Thomas Gleixner, he came up with the macro wrapper idea
here:
https://lore.kernel.org/lkml/87ikkzpcup.ffs@tglx/
Do you think we can continue that discussion , or maybe start it here ?

Eugen

Re: [RFC][PATCH v2 22/29] mm/numa: Register information into Kmemdump

Posted by David Hildenbrand 6 months, 1 week ago

On 30.07.25 15:57, Eugen Hristev wrote:
> Hello,
> 
> On 7/30/25 16:52, David Hildenbrand wrote:
>> On 24.07.25 15:55, Eugen Hristev wrote:
>>> Annotate vital static information into kmemdump:
>>>    - node_data
>>>
>>> Information on these variables is stored into dedicated kmemdump section.
>>>
>>> Register dynamic information into kmemdump:
>>>    - dynamic node data for each node
>>>
>>> This information is being allocated for each node, as physical address,
>>> so call kmemdump_phys_alloc_size that will allocate an unique kmemdump
>>> uid, and register the virtual address.
>>>
>>> Signed-off-by: Eugen Hristev <eugen.hristev@linaro.org>
>>> ---
>>>    mm/numa.c | 5 ++++-
>>>    1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/numa.c b/mm/numa.c
>>> index 7d5e06fe5bd4..88cada571171 100644
>>> --- a/mm/numa.c
>>> +++ b/mm/numa.c
>>> @@ -4,9 +4,11 @@
>>>    #include <linux/printk.h>
>>>    #include <linux/numa.h>
>>>    #include <linux/numa_memblks.h>
>>> +#include <linux/kmemdump.h>
>>>    
>>>    struct pglist_data *node_data[MAX_NUMNODES];
>>>    EXPORT_SYMBOL(node_data);
>>> +KMEMDUMP_VAR_CORE(node_data, MAX_NUMNODES * sizeof(struct pglist_data));
>>>    
>>>    /* Allocate NODE_DATA for a node on the local memory */
>>>    void __init alloc_node_data(int nid)
>>> @@ -16,7 +18,8 @@ void __init alloc_node_data(int nid)
>>>    	int tnid;
>>>    
>>>    	/* Allocate node data.  Try node-local memory and then any node. */
>>> -	nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
>>> +	nd_pa = kmemdump_phys_alloc_size(nd_size, memblock_phys_alloc_try_nid,
>>> +					 nd_size, SMP_CACHE_BYTES, nid);
>>
>> Do we really want to wrap memblock allocations in such a way? :/
>>
>> Gah, no, no no.
>>
>> Can't we pass that as some magical flag, or just ... register *after*
>> allocating?
>>
> 
> Thanks for looking into my patch.
> 
> Yes, registering after is also an option. Initially this is how I
> designed the kmemdump API, I also had in mind to add a flag, but, after
> discussing with Thomas Gleixner, he came up with the macro wrapper idea
> here:
> https://lore.kernel.org/lkml/87ikkzpcup.ffs@tglx/
> Do you think we can continue that discussion , or maybe start it here ?

Yeah, I don't like that, but I can see how we ended up here.

I also don't quite like the idea that we must encode here what to 
include in a dump and what not ...

For the vmcore we construct it at runtime in 
crash_save_vmcoreinfo_init(), where we e.g., have

VMCOREINFO_STRUCT_SIZE(pglist_data);

Could we similar have some place where we construct what to dump 
similarly, just not using the current values, but the memory ranges?

Did you consider that?

-- 
Cheers,

David / dhildenb

Re: [RFC][PATCH v2 22/29] mm/numa: Register information into Kmemdump

Posted by Michal Hocko 6 months ago

On Wed 30-07-25 16:04:28, David Hildenbrand wrote:
> On 30.07.25 15:57, Eugen Hristev wrote:
[...]
> > Yes, registering after is also an option. Initially this is how I
> > designed the kmemdump API, I also had in mind to add a flag, but, after
> > discussing with Thomas Gleixner, he came up with the macro wrapper idea
> > here:
> > https://lore.kernel.org/lkml/87ikkzpcup.ffs@tglx/
> > Do you think we can continue that discussion , or maybe start it here ?
> 
> Yeah, I don't like that, but I can see how we ended up here.
> 
> I also don't quite like the idea that we must encode here what to include in
> a dump and what not ...
> 
> For the vmcore we construct it at runtime in crash_save_vmcoreinfo_init(),
> where we e.g., have
> 
> VMCOREINFO_STRUCT_SIZE(pglist_data);
> 
> Could we similar have some place where we construct what to dump similarly,
> just not using the current values, but the memory ranges?

All those symbols are part of kallsyms, right? Can we just use kallsyms
infrastructure and a list of symbols to get what we need from there?

In other words the list of symbols to be completely external to the code
that is defining them?
-- 
Michal Hocko
SUSE Labs

Re: [RFC][PATCH v2 22/29] mm/numa: Register information into Kmemdump

Posted by Eugen Hristev 6 months ago

On 8/4/25 13:54, Michal Hocko wrote:
> On Wed 30-07-25 16:04:28, David Hildenbrand wrote:
>> On 30.07.25 15:57, Eugen Hristev wrote:
> [...]
>>> Yes, registering after is also an option. Initially this is how I
>>> designed the kmemdump API, I also had in mind to add a flag, but, after
>>> discussing with Thomas Gleixner, he came up with the macro wrapper idea
>>> here:
>>> https://lore.kernel.org/lkml/87ikkzpcup.ffs@tglx/
>>> Do you think we can continue that discussion , or maybe start it here ?
>>
>> Yeah, I don't like that, but I can see how we ended up here.
>>
>> I also don't quite like the idea that we must encode here what to include in
>> a dump and what not ...
>>
>> For the vmcore we construct it at runtime in crash_save_vmcoreinfo_init(),
>> where we e.g., have
>>
>> VMCOREINFO_STRUCT_SIZE(pglist_data);
>>
>> Could we similar have some place where we construct what to dump similarly,
>> just not using the current values, but the memory ranges?
> 
> All those symbols are part of kallsyms, right? Can we just use kallsyms
> infrastructure and a list of symbols to get what we need from there?
> 
> In other words the list of symbols to be completely external to the code
> that is defining them?

Some static symbols are indeed part of kallsyms. But some symbols are
not exported, for example patch 20/29, where printk related symbols are
not to be exported. Another example is with static variables, like in
patch 17/29 , not exported as symbols, but required for the dump.
Dynamic memory regions are not have to also be considered, have a look
for example at patch 23/29 , where dynamically allocated memory needs to
be registered.

Do you think that I should move all kallsyms related symbols annotation
into a separate place and keep it for the static/dynamic regions in place ?

Thanks for looking into my patch,
Eugen

Re: [RFC][PATCH v2 22/29] mm/numa: Register information into Kmemdump

Posted by David Hildenbrand 6 months ago

On 04.08.25 13:06, Eugen Hristev wrote:
> 
> 
> On 8/4/25 13:54, Michal Hocko wrote:
>> On Wed 30-07-25 16:04:28, David Hildenbrand wrote:
>>> On 30.07.25 15:57, Eugen Hristev wrote:
>> [...]
>>>> Yes, registering after is also an option. Initially this is how I
>>>> designed the kmemdump API, I also had in mind to add a flag, but, after
>>>> discussing with Thomas Gleixner, he came up with the macro wrapper idea
>>>> here:
>>>> https://lore.kernel.org/lkml/87ikkzpcup.ffs@tglx/
>>>> Do you think we can continue that discussion , or maybe start it here ?
>>>
>>> Yeah, I don't like that, but I can see how we ended up here.
>>>
>>> I also don't quite like the idea that we must encode here what to include in
>>> a dump and what not ...
>>>
>>> For the vmcore we construct it at runtime in crash_save_vmcoreinfo_init(),
>>> where we e.g., have
>>>
>>> VMCOREINFO_STRUCT_SIZE(pglist_data);
>>>
>>> Could we similar have some place where we construct what to dump similarly,
>>> just not using the current values, but the memory ranges?
>>
>> All those symbols are part of kallsyms, right? Can we just use kallsyms
>> infrastructure and a list of symbols to get what we need from there?
>>
>> In other words the list of symbols to be completely external to the code
>> that is defining them?
> 
> Some static symbols are indeed part of kallsyms. But some symbols are
> not exported, for example patch 20/29, where printk related symbols are
> not to be exported. Another example is with static variables, like in
> patch 17/29 , not exported as symbols, but required for the dump.
> Dynamic memory regions are not have to also be considered, have a look
> for example at patch 23/29 , where dynamically allocated memory needs to
> be registered.
> 
> Do you think that I should move all kallsyms related symbols annotation
> into a separate place and keep it for the static/dynamic regions in place ?

If you want to use a symbol from kmemdump, then make that symbol 
available to kmemdump.

IOW, if we were to rip out kmemdump tomorrow, we wouldn't have to touch 
any non-kmemdump-specific files.

-- 
Cheers,

David / dhildenb

Re: [RFC][PATCH v2 22/29] mm/numa: Register information into Kmemdump

Posted by Eugen Hristev 6 months ago


On 8/4/25 15:18, David Hildenbrand wrote:
> On 04.08.25 13:06, Eugen Hristev wrote:
>>
>>
>> On 8/4/25 13:54, Michal Hocko wrote:
>>> On Wed 30-07-25 16:04:28, David Hildenbrand wrote:
>>>> On 30.07.25 15:57, Eugen Hristev wrote:
>>> [...]
>>>>> Yes, registering after is also an option. Initially this is how I
>>>>> designed the kmemdump API, I also had in mind to add a flag, but, after
>>>>> discussing with Thomas Gleixner, he came up with the macro wrapper idea
>>>>> here:
>>>>> https://lore.kernel.org/lkml/87ikkzpcup.ffs@tglx/
>>>>> Do you think we can continue that discussion , or maybe start it here ?
>>>>
>>>> Yeah, I don't like that, but I can see how we ended up here.
>>>>
>>>> I also don't quite like the idea that we must encode here what to include in
>>>> a dump and what not ...
>>>>
>>>> For the vmcore we construct it at runtime in crash_save_vmcoreinfo_init(),
>>>> where we e.g., have
>>>>
>>>> VMCOREINFO_STRUCT_SIZE(pglist_data);
>>>>
>>>> Could we similar have some place where we construct what to dump similarly,
>>>> just not using the current values, but the memory ranges?
>>>
>>> All those symbols are part of kallsyms, right? Can we just use kallsyms
>>> infrastructure and a list of symbols to get what we need from there?
>>>
>>> In other words the list of symbols to be completely external to the code
>>> that is defining them?
>>
>> Some static symbols are indeed part of kallsyms. But some symbols are
>> not exported, for example patch 20/29, where printk related symbols are
>> not to be exported. Another example is with static variables, like in
>> patch 17/29 , not exported as symbols, but required for the dump.
>> Dynamic memory regions are not have to also be considered, have a look
>> for example at patch 23/29 , where dynamically allocated memory needs to
>> be registered.
>>
>> Do you think that I should move all kallsyms related symbols annotation
>> into a separate place and keep it for the static/dynamic regions in place ?
> 
> If you want to use a symbol from kmemdump, then make that symbol 
> available to kmemdump.

That's what I am doing, registering symbols with kmemdump.
Maybe I do not understand what you mean, do you have any suggestion for
the static variables case (symbols not exported) ?

> 
> IOW, if we were to rip out kmemdump tomorrow, we wouldn't have to touch 
> any non-kmemdump-specific files.
>

Re: [RFC][PATCH v2 22/29] mm/numa: Register information into Kmemdump

Posted by David Hildenbrand 6 months ago

On 04.08.25 14:29, Eugen Hristev wrote:
> 
> 
> On 8/4/25 15:18, David Hildenbrand wrote:
>> On 04.08.25 13:06, Eugen Hristev wrote:
>>>
>>>
>>> On 8/4/25 13:54, Michal Hocko wrote:
>>>> On Wed 30-07-25 16:04:28, David Hildenbrand wrote:
>>>>> On 30.07.25 15:57, Eugen Hristev wrote:
>>>> [...]
>>>>>> Yes, registering after is also an option. Initially this is how I
>>>>>> designed the kmemdump API, I also had in mind to add a flag, but, after
>>>>>> discussing with Thomas Gleixner, he came up with the macro wrapper idea
>>>>>> here:
>>>>>> https://lore.kernel.org/lkml/87ikkzpcup.ffs@tglx/
>>>>>> Do you think we can continue that discussion , or maybe start it here ?
>>>>>
>>>>> Yeah, I don't like that, but I can see how we ended up here.
>>>>>
>>>>> I also don't quite like the idea that we must encode here what to include in
>>>>> a dump and what not ...
>>>>>
>>>>> For the vmcore we construct it at runtime in crash_save_vmcoreinfo_init(),
>>>>> where we e.g., have
>>>>>
>>>>> VMCOREINFO_STRUCT_SIZE(pglist_data);
>>>>>
>>>>> Could we similar have some place where we construct what to dump similarly,
>>>>> just not using the current values, but the memory ranges?
>>>>
>>>> All those symbols are part of kallsyms, right? Can we just use kallsyms
>>>> infrastructure and a list of symbols to get what we need from there?
>>>>
>>>> In other words the list of symbols to be completely external to the code
>>>> that is defining them?
>>>
>>> Some static symbols are indeed part of kallsyms. But some symbols are
>>> not exported, for example patch 20/29, where printk related symbols are
>>> not to be exported. Another example is with static variables, like in
>>> patch 17/29 , not exported as symbols, but required for the dump.
>>> Dynamic memory regions are not have to also be considered, have a look
>>> for example at patch 23/29 , where dynamically allocated memory needs to
>>> be registered.
>>>
>>> Do you think that I should move all kallsyms related symbols annotation
>>> into a separate place and keep it for the static/dynamic regions in place ?
>>
>> If you want to use a symbol from kmemdump, then make that symbol
>> available to kmemdump.
> 
> That's what I am doing, registering symbols with kmemdump.
> Maybe I do not understand what you mean, do you have any suggestion for
> the static variables case (symbols not exported) ?

Let's use patch #20 as example:

What I am thinking is that you would not include "linux/kmemdump.h" and 
not leak all of that KMEMDUMP_ stuff in all these files/subsystems that 
couldn't less about kmemdump.

Instead of doing

static struct printk_ringbuffer printk_rb_dynamic;

You'd do

struct printk_ringbuffer printk_rb_dynamic;

and have it in some header file, from where kmemdump could lookup the 
address.

So you move the logic of what goes into a dump from the subsystems to
the kmemdump core.

-- 
Cheers,

David / dhildenb

Re: [RFC][PATCH v2 22/29] mm/numa: Register information into Kmemdump

Posted by Eugen Hristev 6 months ago


On 8/4/25 15:49, David Hildenbrand wrote:
> On 04.08.25 14:29, Eugen Hristev wrote:
>>
>>
>> On 8/4/25 15:18, David Hildenbrand wrote:
>>> On 04.08.25 13:06, Eugen Hristev wrote:
>>>>
>>>>
>>>> On 8/4/25 13:54, Michal Hocko wrote:
>>>>> On Wed 30-07-25 16:04:28, David Hildenbrand wrote:
>>>>>> On 30.07.25 15:57, Eugen Hristev wrote:
>>>>> [...]
>>>>>>> Yes, registering after is also an option. Initially this is how I
>>>>>>> designed the kmemdump API, I also had in mind to add a flag, but, after
>>>>>>> discussing with Thomas Gleixner, he came up with the macro wrapper idea
>>>>>>> here:
>>>>>>> https://lore.kernel.org/lkml/87ikkzpcup.ffs@tglx/
>>>>>>> Do you think we can continue that discussion , or maybe start it here ?
>>>>>>
>>>>>> Yeah, I don't like that, but I can see how we ended up here.
>>>>>>
>>>>>> I also don't quite like the idea that we must encode here what to include in
>>>>>> a dump and what not ...
>>>>>>
>>>>>> For the vmcore we construct it at runtime in crash_save_vmcoreinfo_init(),
>>>>>> where we e.g., have
>>>>>>
>>>>>> VMCOREINFO_STRUCT_SIZE(pglist_data);
>>>>>>
>>>>>> Could we similar have some place where we construct what to dump similarly,
>>>>>> just not using the current values, but the memory ranges?
>>>>>
>>>>> All those symbols are part of kallsyms, right? Can we just use kallsyms
>>>>> infrastructure and a list of symbols to get what we need from there?
>>>>>
>>>>> In other words the list of symbols to be completely external to the code
>>>>> that is defining them?
>>>>
>>>> Some static symbols are indeed part of kallsyms. But some symbols are
>>>> not exported, for example patch 20/29, where printk related symbols are
>>>> not to be exported. Another example is with static variables, like in
>>>> patch 17/29 , not exported as symbols, but required for the dump.
>>>> Dynamic memory regions are not have to also be considered, have a look
>>>> for example at patch 23/29 , where dynamically allocated memory needs to
>>>> be registered.
>>>>
>>>> Do you think that I should move all kallsyms related symbols annotation
>>>> into a separate place and keep it for the static/dynamic regions in place ?
>>>
>>> If you want to use a symbol from kmemdump, then make that symbol
>>> available to kmemdump.
>>
>> That's what I am doing, registering symbols with kmemdump.
>> Maybe I do not understand what you mean, do you have any suggestion for
>> the static variables case (symbols not exported) ?
> 
> Let's use patch #20 as example:
> 
> What I am thinking is that you would not include "linux/kmemdump.h" and 
> not leak all of that KMEMDUMP_ stuff in all these files/subsystems that 
> couldn't less about kmemdump.
> 
> Instead of doing
> 
> static struct printk_ringbuffer printk_rb_dynamic;
> 
> You'd do
> 
> struct printk_ringbuffer printk_rb_dynamic;
> 
> and have it in some header file, from where kmemdump could lookup the 
> address.
> 
> So you move the logic of what goes into a dump from the subsystems to
> the kmemdump core.
> 

That works if the people maintaining these systems agree with it.
Attempts to export symbols from printk e.g. have been nacked :

 https://lore.kernel.org/all/20250218-175733-neomutt-senozhatsky@chromium.org/

So I am unsure whether just removing the static and adding them into
header files would be more acceptable.

Added in CC Cristoph Hellwig and Sergey Senozhatsky maybe they could
tell us directly whether they like or dislike this approach, as kmemdump
would be builtin and would not require exports.

One other thing to mention is the fact that the printk code dynamically
allocates memory that would need to be registered. There is no mechanism
for kmemdump to know when this process has been completed (or even if it
was at all, because it happens on demand in certain conditions).

Thanks !

Re: [RFC][PATCH v2 22/29] mm/numa: Register information into Kmemdump

Posted by David Hildenbrand 6 months ago

On 04.08.25 15:03, Eugen Hristev wrote:
> 
> 
> On 8/4/25 15:49, David Hildenbrand wrote:
>> On 04.08.25 14:29, Eugen Hristev wrote:
>>>
>>>
>>> On 8/4/25 15:18, David Hildenbrand wrote:
>>>> On 04.08.25 13:06, Eugen Hristev wrote:
>>>>>
>>>>>
>>>>> On 8/4/25 13:54, Michal Hocko wrote:
>>>>>> On Wed 30-07-25 16:04:28, David Hildenbrand wrote:
>>>>>>> On 30.07.25 15:57, Eugen Hristev wrote:
>>>>>> [...]
>>>>>>>> Yes, registering after is also an option. Initially this is how I
>>>>>>>> designed the kmemdump API, I also had in mind to add a flag, but, after
>>>>>>>> discussing with Thomas Gleixner, he came up with the macro wrapper idea
>>>>>>>> here:
>>>>>>>> https://lore.kernel.org/lkml/87ikkzpcup.ffs@tglx/
>>>>>>>> Do you think we can continue that discussion , or maybe start it here ?
>>>>>>>
>>>>>>> Yeah, I don't like that, but I can see how we ended up here.
>>>>>>>
>>>>>>> I also don't quite like the idea that we must encode here what to include in
>>>>>>> a dump and what not ...
>>>>>>>
>>>>>>> For the vmcore we construct it at runtime in crash_save_vmcoreinfo_init(),
>>>>>>> where we e.g., have
>>>>>>>
>>>>>>> VMCOREINFO_STRUCT_SIZE(pglist_data);
>>>>>>>
>>>>>>> Could we similar have some place where we construct what to dump similarly,
>>>>>>> just not using the current values, but the memory ranges?
>>>>>>
>>>>>> All those symbols are part of kallsyms, right? Can we just use kallsyms
>>>>>> infrastructure and a list of symbols to get what we need from there?
>>>>>>
>>>>>> In other words the list of symbols to be completely external to the code
>>>>>> that is defining them?
>>>>>
>>>>> Some static symbols are indeed part of kallsyms. But some symbols are
>>>>> not exported, for example patch 20/29, where printk related symbols are
>>>>> not to be exported. Another example is with static variables, like in
>>>>> patch 17/29 , not exported as symbols, but required for the dump.
>>>>> Dynamic memory regions are not have to also be considered, have a look
>>>>> for example at patch 23/29 , where dynamically allocated memory needs to
>>>>> be registered.
>>>>>
>>>>> Do you think that I should move all kallsyms related symbols annotation
>>>>> into a separate place and keep it for the static/dynamic regions in place ?
>>>>
>>>> If you want to use a symbol from kmemdump, then make that symbol
>>>> available to kmemdump.
>>>
>>> That's what I am doing, registering symbols with kmemdump.
>>> Maybe I do not understand what you mean, do you have any suggestion for
>>> the static variables case (symbols not exported) ?
>>
>> Let's use patch #20 as example:
>>
>> What I am thinking is that you would not include "linux/kmemdump.h" and
>> not leak all of that KMEMDUMP_ stuff in all these files/subsystems that
>> couldn't less about kmemdump.
>>
>> Instead of doing
>>
>> static struct printk_ringbuffer printk_rb_dynamic;
>>
>> You'd do
>>
>> struct printk_ringbuffer printk_rb_dynamic;
>>
>> and have it in some header file, from where kmemdump could lookup the
>> address.
>>
>> So you move the logic of what goes into a dump from the subsystems to
>> the kmemdump core.
>>
> 
> That works if the people maintaining these systems agree with it.
> Attempts to export symbols from printk e.g. have been nacked :
> 
>   https://lore.kernel.org/all/20250218-175733-neomutt-senozhatsky@chromium.org/

Do you really need the EXPORT_SYMBOL?

Can't you just not export symbols, building the relevant kmemdump part 
into the core not as a module.

IIRC, kernel/vmcore_info.c is never built as a module, as it also 
accesses non-exported symbols.

> 
> So I am unsure whether just removing the static and adding them into
> header files would be more acceptable.
> 
> Added in CC Cristoph Hellwig and Sergey Senozhatsky maybe they could
> tell us directly whether they like or dislike this approach, as kmemdump
> would be builtin and would not require exports.
> 
> One other thing to mention is the fact that the printk code dynamically
> allocates memory that would need to be registered. There is no mechanism
> for kmemdump to know when this process has been completed (or even if it
> was at all, because it happens on demand in certain conditions).

If we are talking about memblock allocations, they sure are finished at 
the time ... the buddy is up.

So it's just a matter of placing yourself late in the init stage where 
the buddy is already up and running.

I assume dumping any dynamically allocated stuff through the buddy is 
out of the picture for now.

-- 
Cheers,

David / dhildenb

Re: [RFC][PATCH v2 22/29] mm/numa: Register information into Kmemdump

Posted by Eugen Hristev 5 months, 2 weeks ago


On 8/4/25 16:26, David Hildenbrand wrote:
> On 04.08.25 15:03, Eugen Hristev wrote:
>>
>>
>> On 8/4/25 15:49, David Hildenbrand wrote:
>>> On 04.08.25 14:29, Eugen Hristev wrote:
>>>>
>>>>
>>>> On 8/4/25 15:18, David Hildenbrand wrote:
>>>>> On 04.08.25 13:06, Eugen Hristev wrote:
>>>>>>
>>>>>>
>>>>>> On 8/4/25 13:54, Michal Hocko wrote:
>>>>>>> On Wed 30-07-25 16:04:28, David Hildenbrand wrote:
>>>>>>>> On 30.07.25 15:57, Eugen Hristev wrote:
>>>>>>> [...]
>>>>>>>>> Yes, registering after is also an option. Initially this is how I
>>>>>>>>> designed the kmemdump API, I also had in mind to add a flag, but, after
>>>>>>>>> discussing with Thomas Gleixner, he came up with the macro wrapper idea
>>>>>>>>> here:
>>>>>>>>> https://lore.kernel.org/lkml/87ikkzpcup.ffs@tglx/
>>>>>>>>> Do you think we can continue that discussion , or maybe start it here ?
>>>>>>>>
>>>>>>>> Yeah, I don't like that, but I can see how we ended up here.
>>>>>>>>
>>>>>>>> I also don't quite like the idea that we must encode here what to include in
>>>>>>>> a dump and what not ...
>>>>>>>>
>>>>>>>> For the vmcore we construct it at runtime in crash_save_vmcoreinfo_init(),
>>>>>>>> where we e.g., have
>>>>>>>>
>>>>>>>> VMCOREINFO_STRUCT_SIZE(pglist_data);
>>>>>>>>
>>>>>>>> Could we similar have some place where we construct what to dump similarly,
>>>>>>>> just not using the current values, but the memory ranges?
>>>>>>>
>>>>>>> All those symbols are part of kallsyms, right? Can we just use kallsyms
>>>>>>> infrastructure and a list of symbols to get what we need from there?
>>>>>>>
>>>>>>> In other words the list of symbols to be completely external to the code
>>>>>>> that is defining them?
>>>>>>
>>>>>> Some static symbols are indeed part of kallsyms. But some symbols are
>>>>>> not exported, for example patch 20/29, where printk related symbols are
>>>>>> not to be exported. Another example is with static variables, like in
>>>>>> patch 17/29 , not exported as symbols, but required for the dump.
>>>>>> Dynamic memory regions are not have to also be considered, have a look
>>>>>> for example at patch 23/29 , where dynamically allocated memory needs to
>>>>>> be registered.
>>>>>>
>>>>>> Do you think that I should move all kallsyms related symbols annotation
>>>>>> into a separate place and keep it for the static/dynamic regions in place ?
>>>>>
>>>>> If you want to use a symbol from kmemdump, then make that symbol
>>>>> available to kmemdump.
>>>>
>>>> That's what I am doing, registering symbols with kmemdump.
>>>> Maybe I do not understand what you mean, do you have any suggestion for
>>>> the static variables case (symbols not exported) ?
>>>
>>> Let's use patch #20 as example:
>>>
>>> What I am thinking is that you would not include "linux/kmemdump.h" and
>>> not leak all of that KMEMDUMP_ stuff in all these files/subsystems that
>>> couldn't less about kmemdump.
>>>
>>> Instead of doing
>>>
>>> static struct printk_ringbuffer printk_rb_dynamic;
>>>
>>> You'd do
>>>
>>> struct printk_ringbuffer printk_rb_dynamic;
>>>
>>> and have it in some header file, from where kmemdump could lookup the
>>> address.
>>>
>>> So you move the logic of what goes into a dump from the subsystems to
>>> the kmemdump core.
>>>
>>
>> That works if the people maintaining these systems agree with it.
>> Attempts to export symbols from printk e.g. have been nacked :
>>
>>   https://lore.kernel.org/all/20250218-175733-neomutt-senozhatsky@chromium.org/
> 
> Do you really need the EXPORT_SYMBOL?
> 
> Can't you just not export symbols, building the relevant kmemdump part 
> into the core not as a module.
> 
> IIRC, kernel/vmcore_info.c is never built as a module, as it also 
> accesses non-exported symbols.

Hello David,

I am looking again into this, and there are some things which in my
opinion would be difficult to achieve.
For example I looked into my patch #11 , which adds the `runqueues` into
kmemdump.

The runqueues is a variable of `struct rq` which is defined in
kernel/sched/sched.h , which is not supposed to be included outside of
sched.
Now moving all the struct definition outside of sched.h into another
public header would be rather painful and I don't think it's a really
good option (The struct would be needed to compute the sizeof inside
vmcoreinfo). Secondly, it would also imply moving all the nested struct
definitions outside as well. I doubt this is something that we want for
the sched subsys. How the subsys is designed, out of my understanding,
is to keep these internal structs opaque outside of it.

From my perspective it's much simpler and cleaner to just add the
kmemdump annotation macro inside the sched/core.c as it's done in my
patch. This macro translates to a noop if kmemdump is not selected.

How do you see this done another way ?

> 
>>
>> So I am unsure whether just removing the static and adding them into
>> header files would be more acceptable.
>>
>> Added in CC Cristoph Hellwig and Sergey Senozhatsky maybe they could
>> tell us directly whether they like or dislike this approach, as kmemdump
>> would be builtin and would not require exports.
>>
>> One other thing to mention is the fact that the printk code dynamically
>> allocates memory that would need to be registered. There is no mechanism
>> for kmemdump to know when this process has been completed (or even if it
>> was at all, because it happens on demand in certain conditions).
> 
> If we are talking about memblock allocations, they sure are finished at 
> the time ... the buddy is up.
> 
> So it's just a matter of placing yourself late in the init stage where 
> the buddy is already up and running.
> 
> I assume dumping any dynamically allocated stuff through the buddy is 
> out of the picture for now.
> 

The dumping mechanism needs to work for dynamically allocated stuff, and
right now, it works for e.g. printk, if the buffer is dynamically
allocated later on in the boot process.

To have this working outside of printk, it would be required to walk
through all the printk structs/allocations and select the required info.
Is this something that we want to do outside of printk ? E.g. for the
printk panic-dump case, the whole dumping is done by registering a
dumper that does the job inside printk. There is no mechanism walking
through printk data in another subsystem (in my example, pstore).
So for me it is logical to register the data inside the printk.

Does this make sense ?

Re: [RFC][PATCH v2 22/29] mm/numa: Register information into Kmemdump

Posted by David Hildenbrand 5 months, 2 weeks ago

>>
>> IIRC, kernel/vmcore_info.c is never built as a module, as it also
>> accesses non-exported symbols.
> 
> Hello David,
> 
> I am looking again into this, and there are some things which in my
> opinion would be difficult to achieve.
> For example I looked into my patch #11 , which adds the `runqueues` into
> kmemdump.
> 
> The runqueues is a variable of `struct rq` which is defined in
> kernel/sched/sched.h , which is not supposed to be included outside of
> sched.
> Now moving all the struct definition outside of sched.h into another
> public header would be rather painful and I don't think it's a really
> good option (The struct would be needed to compute the sizeof inside
> vmcoreinfo). Secondly, it would also imply moving all the nested struct
> definitions outside as well. I doubt this is something that we want for
> the sched subsys. How the subsys is designed, out of my understanding,
> is to keep these internal structs opaque outside of it.

All the kmemdump module needs is a start and a length, correct? So the 
only tricky part is getting the length.

One could just add a const variable that holds this information, or even 
better, a simple helper function to calculate that.

Maybe someone else reading along has a better idea.

Interestingly, runqueues is a percpu variable, which makes me wonder if 
what you had would work as intended (maybe it does, not sure).

> 
>  From my perspective it's much simpler and cleaner to just add the
> kmemdump annotation macro inside the sched/core.c as it's done in my
> patch. This macro translates to a noop if kmemdump is not selected.

I really don't like how we are spreading kmemdump all over the kernel, 
and adding complexity with __section when really, all we need is a place 
to obtain a start and a length.

So we should explore if there is anything easier possible.

>>
>>>
>>> So I am unsure whether just removing the static and adding them into
>>> header files would be more acceptable.
>>>
>>> Added in CC Cristoph Hellwig and Sergey Senozhatsky maybe they could
>>> tell us directly whether they like or dislike this approach, as kmemdump
>>> would be builtin and would not require exports.
>>>
>>> One other thing to mention is the fact that the printk code dynamically
>>> allocates memory that would need to be registered. There is no mechanism
>>> for kmemdump to know when this process has been completed (or even if it
>>> was at all, because it happens on demand in certain conditions).
>>
>> If we are talking about memblock allocations, they sure are finished at
>> the time ... the buddy is up.
>>
>> So it's just a matter of placing yourself late in the init stage where
>> the buddy is already up and running.
>>
>> I assume dumping any dynamically allocated stuff through the buddy is
>> out of the picture for now.
>>
> 
> The dumping mechanism needs to work for dynamically allocated stuff, and
> right now, it works for e.g. printk, if the buffer is dynamically
> allocated later on in the boot process.

You are talking about the memblock_alloc() result, correct? Like

new_log_buf = memblock_alloc(new_log_buf_len, LOG_ALIGN);

The current version is always stored in

static char *log_buf = __log_buf;


Once early boot is done and memblock gets torn down, you can just use 
log_buf and be sure that it will not change anymore.

> 
> To have this working outside of printk, it would be required to walk
> through all the printk structs/allocations and select the required info.
> Is this something that we want to do outside of printk ?

I don't follow, please elaborate.

How is e.g., log_buf_len_get() + log_buf_addr_get() not sufficient, 
given that you run your initialization after setup_log_buf() ?


-- 
Cheers

David / dhildenb

Re: [RFC][PATCH v2 22/29] mm/numa: Register information into Kmemdump

Posted by Eugen Hristev 5 months, 2 weeks ago


On 8/25/25 16:20, David Hildenbrand wrote:
> 
>>>
>>> IIRC, kernel/vmcore_info.c is never built as a module, as it also
>>> accesses non-exported symbols.
>>
>> Hello David,
>>
>> I am looking again into this, and there are some things which in my
>> opinion would be difficult to achieve.
>> For example I looked into my patch #11 , which adds the `runqueues` into
>> kmemdump.
>>
>> The runqueues is a variable of `struct rq` which is defined in
>> kernel/sched/sched.h , which is not supposed to be included outside of
>> sched.
>> Now moving all the struct definition outside of sched.h into another
>> public header would be rather painful and I don't think it's a really
>> good option (The struct would be needed to compute the sizeof inside
>> vmcoreinfo). Secondly, it would also imply moving all the nested struct
>> definitions outside as well. I doubt this is something that we want for
>> the sched subsys. How the subsys is designed, out of my understanding,
>> is to keep these internal structs opaque outside of it.
> 
> All the kmemdump module needs is a start and a length, correct? So the 
> only tricky part is getting the length.

I also have in mind the kernel user case. How would a kernel programmer
want to add some kernel structs/info/buffers into kmemdump such that the
dump would contain their data ? Having "KMEMDUMP_VAR(...)" looks simple
enough.
Otherwise maybe the programmer has to write helpers to compute lengths
etc, and stitch them into kmemdump core.
I am not saying it's impossible, but just tiresome perhaps.

> 
> One could just add a const variable that holds this information, or even 
> better, a simple helper function to calculate that.
> 
> Maybe someone else reading along has a better idea.

This could work, but it requires again adding some code into the
specific subsystem. E.g. struct_rq_get_size()
I am open to ideas , and thank you very much for your thoughts.

> 
> Interestingly, runqueues is a percpu variable, which makes me wonder if 
> what you had would work as intended (maybe it does, not sure).

I would not really need to dump the runqueues. But the crash tool which
I am using for testing, requires it. Without the runqueues it will not
progress further to load the kernel dump.
So I am not really sure what it does with the runqueues, but it works.
Perhaps using crash/gdb more, to actually do something with this data,
would give more insight about its utility.
For me, it is a prerequisite to run crash, and then to be able to
extract the log buffer from the dump.

> 
>>
>>  From my perspective it's much simpler and cleaner to just add the
>> kmemdump annotation macro inside the sched/core.c as it's done in my
>> patch. This macro translates to a noop if kmemdump is not selected.
> 
> I really don't like how we are spreading kmemdump all over the kernel, 
> and adding complexity with __section when really, all we need is a place 
> to obtain a start and a length.
> 

I understand. The section idea was suggested by Thomas. Initially I was
skeptic, but I like how it turned out.

> So we should explore if there is anything easier possible.
> 
>>>
>>>>
>>>> So I am unsure whether just removing the static and adding them into
>>>> header files would be more acceptable.
>>>>
>>>> Added in CC Cristoph Hellwig and Sergey Senozhatsky maybe they could
>>>> tell us directly whether they like or dislike this approach, as kmemdump
>>>> would be builtin and would not require exports.
>>>>
>>>> One other thing to mention is the fact that the printk code dynamically
>>>> allocates memory that would need to be registered. There is no mechanism
>>>> for kmemdump to know when this process has been completed (or even if it
>>>> was at all, because it happens on demand in certain conditions).
>>>
>>> If we are talking about memblock allocations, they sure are finished at
>>> the time ... the buddy is up.
>>>
>>> So it's just a matter of placing yourself late in the init stage where
>>> the buddy is already up and running.
>>>
>>> I assume dumping any dynamically allocated stuff through the buddy is
>>> out of the picture for now.
>>>
>>
>> The dumping mechanism needs to work for dynamically allocated stuff, and
>> right now, it works for e.g. printk, if the buffer is dynamically
>> allocated later on in the boot process.
> 
> You are talking about the memblock_alloc() result, correct? Like
> 
> new_log_buf = memblock_alloc(new_log_buf_len, LOG_ALIGN);
> 
> The current version is always stored in
> 
> static char *log_buf = __log_buf;
> 
> 
> Once early boot is done and memblock gets torn down, you can just use 
> log_buf and be sure that it will not change anymore.
> 
>>
>> To have this working outside of printk, it would be required to walk
>> through all the printk structs/allocations and select the required info.
>> Is this something that we want to do outside of printk ?
> 
> I don't follow, please elaborate.
> 
> How is e.g., log_buf_len_get() + log_buf_addr_get() not sufficient, 
> given that you run your initialization after setup_log_buf() ?
> 
> 

My initial thought was the same. However I got some feedback from Petr
Mladek here :

https://lore.kernel.org/lkml/aBm5QH2p6p9Wxe_M@localhost.localdomain/

Where he explained how to register the structs correctly.
It can be that setup_log_buf is called again at a later time perhaps.

Re: [RFC][PATCH v2 22/29] mm/numa: Register information into Kmemdump

Posted by David Hildenbrand 5 months, 2 weeks ago

On 25.08.25 15:36, Eugen Hristev wrote:
> 
> 
> On 8/25/25 16:20, David Hildenbrand wrote:
>>
>>>>
>>>> IIRC, kernel/vmcore_info.c is never built as a module, as it also
>>>> accesses non-exported symbols.
>>>
>>> Hello David,
>>>
>>> I am looking again into this, and there are some things which in my
>>> opinion would be difficult to achieve.
>>> For example I looked into my patch #11 , which adds the `runqueues` into
>>> kmemdump.
>>>
>>> The runqueues is a variable of `struct rq` which is defined in
>>> kernel/sched/sched.h , which is not supposed to be included outside of
>>> sched.
>>> Now moving all the struct definition outside of sched.h into another
>>> public header would be rather painful and I don't think it's a really
>>> good option (The struct would be needed to compute the sizeof inside
>>> vmcoreinfo). Secondly, it would also imply moving all the nested struct
>>> definitions outside as well. I doubt this is something that we want for
>>> the sched subsys. How the subsys is designed, out of my understanding,
>>> is to keep these internal structs opaque outside of it.
>>
>> All the kmemdump module needs is a start and a length, correct? So the
>> only tricky part is getting the length.
> 
> I also have in mind the kernel user case. How would a kernel programmer
> want to add some kernel structs/info/buffers into kmemdump such that the
> dump would contain their data ? Having "KMEMDUMP_VAR(...)" looks simple
> enough.

The other way around, why should anybody have a saying in adding their 
data to kmemdump? Why do we have that all over the kernel?

Is your mechanism really so special?

A single composer should take care of that, and it's really just start + 
len of physical memory areas.

> Otherwise maybe the programmer has to write helpers to compute lengths
> etc, and stitch them into kmemdump core.
> I am not saying it's impossible, but just tiresome perhaps.

In your patch set, how many of these instances did you encounter where 
that was a problem?

>>
>> One could just add a const variable that holds this information, or even
>> better, a simple helper function to calculate that.
>>
>> Maybe someone else reading along has a better idea.
> 
> This could work, but it requires again adding some code into the
> specific subsystem. E.g. struct_rq_get_size()
> I am open to ideas , and thank you very much for your thoughts.
> 
>>
>> Interestingly, runqueues is a percpu variable, which makes me wonder if
>> what you had would work as intended (maybe it does, not sure).
> 
> I would not really need to dump the runqueues. But the crash tool which
> I am using for testing, requires it. Without the runqueues it will not
> progress further to load the kernel dump.
> So I am not really sure what it does with the runqueues, but it works.
> Perhaps using crash/gdb more, to actually do something with this data,
> would give more insight about its utility.
> For me, it is a prerequisite to run crash, and then to be able to
> extract the log buffer from the dump.

I have the faint recollection that percpu vars might not be stored in a 
single contiguous physical memory area, but maybe my memory is just 
wrong, that's why I was raising it.

> 
>>
>>>
>>>   From my perspective it's much simpler and cleaner to just add the
>>> kmemdump annotation macro inside the sched/core.c as it's done in my
>>> patch. This macro translates to a noop if kmemdump is not selected.
>>
>> I really don't like how we are spreading kmemdump all over the kernel,
>> and adding complexity with __section when really, all we need is a place
>> to obtain a start and a length.
>>
> 
> I understand. The section idea was suggested by Thomas. Initially I was
> skeptic, but I like how it turned out.

Yeah, I don't like it. Taste differs ;)

I am in particular unhappy about custom memblock wrappers.

[...]

>>>
>>> To have this working outside of printk, it would be required to walk
>>> through all the printk structs/allocations and select the required info.
>>> Is this something that we want to do outside of printk ?
>>
>> I don't follow, please elaborate.
>>
>> How is e.g., log_buf_len_get() + log_buf_addr_get() not sufficient,
>> given that you run your initialization after setup_log_buf() ?
>>
>>
> 
> My initial thought was the same. However I got some feedback from Petr
> Mladek here :
> 
> https://lore.kernel.org/lkml/aBm5QH2p6p9Wxe_M@localhost.localdomain/
> 
> Where he explained how to register the structs correctly.
> It can be that setup_log_buf is called again at a later time perhaps.
> 

setup_log_buf() is a __init function, so there is only a certain time 
frame where it can be called.

In particular, once the buddy is up, memblock allocations are impossible 
and it would be deeply flawed to call this function again.

Let's not over-engineer this.

Peter is on CC, so hopefully he can share his thoughts.

-- 
Cheers

David / dhildenb

Re: [RFC][PATCH v2 22/29] mm/numa: Register information into Kmemdump

Posted by Eugen Hristev 5 months, 1 week ago


On 8/25/25 16:58, David Hildenbrand wrote:
> On 25.08.25 15:36, Eugen Hristev wrote:
>>
>>
>> On 8/25/25 16:20, David Hildenbrand wrote:
>>>
>>>>>
>>>>> IIRC, kernel/vmcore_info.c is never built as a module, as it also
>>>>> accesses non-exported symbols.
>>>>
>>>> Hello David,
>>>>
>>>> I am looking again into this, and there are some things which in my
>>>> opinion would be difficult to achieve.
>>>> For example I looked into my patch #11 , which adds the `runqueues` into
>>>> kmemdump.
>>>>
>>>> The runqueues is a variable of `struct rq` which is defined in
>>>> kernel/sched/sched.h , which is not supposed to be included outside of
>>>> sched.
>>>> Now moving all the struct definition outside of sched.h into another
>>>> public header would be rather painful and I don't think it's a really
>>>> good option (The struct would be needed to compute the sizeof inside
>>>> vmcoreinfo). Secondly, it would also imply moving all the nested struct
>>>> definitions outside as well. I doubt this is something that we want for
>>>> the sched subsys. How the subsys is designed, out of my understanding,
>>>> is to keep these internal structs opaque outside of it.
>>>
>>> All the kmemdump module needs is a start and a length, correct? So the
>>> only tricky part is getting the length.
>>
>> I also have in mind the kernel user case. How would a kernel programmer
>> want to add some kernel structs/info/buffers into kmemdump such that the
>> dump would contain their data ? Having "KMEMDUMP_VAR(...)" looks simple
>> enough.
> 
> The other way around, why should anybody have a saying in adding their 
> data to kmemdump? Why do we have that all over the kernel?
> 
> Is your mechanism really so special?
> 
> A single composer should take care of that, and it's really just start + 
> len of physical memory areas.
> 
>> Otherwise maybe the programmer has to write helpers to compute lengths
>> etc, and stitch them into kmemdump core.
>> I am not saying it's impossible, but just tiresome perhaps.
> 
> In your patch set, how many of these instances did you encounter where 
> that was a problem?
> 
>>>
>>> One could just add a const variable that holds this information, or even
>>> better, a simple helper function to calculate that.
>>>
>>> Maybe someone else reading along has a better idea.
>>
>> This could work, but it requires again adding some code into the
>> specific subsystem. E.g. struct_rq_get_size()
>> I am open to ideas , and thank you very much for your thoughts.
>>
>>>
>>> Interestingly, runqueues is a percpu variable, which makes me wonder if
>>> what you had would work as intended (maybe it does, not sure).
>>
>> I would not really need to dump the runqueues. But the crash tool which
>> I am using for testing, requires it. Without the runqueues it will not
>> progress further to load the kernel dump.
>> So I am not really sure what it does with the runqueues, but it works.
>> Perhaps using crash/gdb more, to actually do something with this data,
>> would give more insight about its utility.
>> For me, it is a prerequisite to run crash, and then to be able to
>> extract the log buffer from the dump.
> 
> I have the faint recollection that percpu vars might not be stored in a 
> single contiguous physical memory area, but maybe my memory is just 
> wrong, that's why I was raising it.
> 
>>
>>>
>>>>
>>>>   From my perspective it's much simpler and cleaner to just add the
>>>> kmemdump annotation macro inside the sched/core.c as it's done in my
>>>> patch. This macro translates to a noop if kmemdump is not selected.
>>>
>>> I really don't like how we are spreading kmemdump all over the kernel,
>>> and adding complexity with __section when really, all we need is a place
>>> to obtain a start and a length.
>>>
>>
>> I understand. The section idea was suggested by Thomas. Initially I was
>> skeptic, but I like how it turned out.
> 
> Yeah, I don't like it. Taste differs ;)
> 
> I am in particular unhappy about custom memblock wrappers.
> 
> [...]
> 
>>>>
>>>> To have this working outside of printk, it would be required to walk
>>>> through all the printk structs/allocations and select the required info.
>>>> Is this something that we want to do outside of printk ?
>>>
>>> I don't follow, please elaborate.
>>>
>>> How is e.g., log_buf_len_get() + log_buf_addr_get() not sufficient,
>>> given that you run your initialization after setup_log_buf() ?
>>>
>>>
>>
>> My initial thought was the same. However I got some feedback from Petr
>> Mladek here :
>>
>> https://lore.kernel.org/lkml/aBm5QH2p6p9Wxe_M@localhost.localdomain/
>>
>> Where he explained how to register the structs correctly.
>> It can be that setup_log_buf is called again at a later time perhaps.
>>
> 
> setup_log_buf() is a __init function, so there is only a certain time 
> frame where it can be called.
> 
> In particular, once the buddy is up, memblock allocations are impossible 
> and it would be deeply flawed to call this function again.
> 
> Let's not over-engineer this.
> 
> Peter is on CC, so hopefully he can share his thoughts.
> 

Hello David,

I tested out this snippet (on top of my series, so you can see what I
changed):


diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 18ba6c1e174f..7ac4248a00e5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -67,7 +67,6 @@
 #include <linux/wait_api.h>
 #include <linux/workqueue_api.h>
 #include <linux/livepatch_sched.h>
-#include <linux/kmemdump.h>

 #ifdef CONFIG_PREEMPT_DYNAMIC
 # ifdef CONFIG_GENERIC_IRQ_ENTRY
@@ -120,7 +119,12 @@
EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp);
 EXPORT_TRACEPOINT_SYMBOL_GPL(sched_compute_energy_tp);

 DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
-KMEMDUMP_VAR_CORE(runqueues, sizeof(runqueues));
+
+size_t runqueues_get_size(void);
+size_t runqueues_get_size(void)
+{
+       return sizeof(runqueues);
+}

 #ifdef CONFIG_SCHED_PROXY_EXEC
 DEFINE_STATIC_KEY_TRUE(__sched_proxy_exec);
diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c
index d808c5e67f35..c6dd2d6e96dd 100644
--- a/kernel/vmcore_info.c
+++ b/kernel/vmcore_info.c
@@ -24,6 +24,12 @@
 #include "kallsyms_internal.h"
 #include "kexec_internal.h"

+typedef void* kmemdump_opaque_t;
+
+size_t runqueues_get_size(void);
+
+extern kmemdump_opaque_t runqueues;
+
 /* vmcoreinfo stuff */
 unsigned char *vmcoreinfo_data;
 size_t vmcoreinfo_size;
@@ -230,6 +236,9 @@ static int __init crash_save_vmcoreinfo_init(void)

        kmemdump_register_id(KMEMDUMP_ID_COREIMAGE_VMCOREINFO,
                             (void *)vmcoreinfo_data, vmcoreinfo_size);
+       kmemdump_register_id(KMEMDUMP_ID_COREIMAGE_runqueues,
+                            (void *)&runqueues, runqueues_get_size());
+
        return 0;
 }

With this, no more .section, no kmemdump code into sched, however, there
are few things :
First the size function, which is quite dull and doesn't fit into the
sched very much.
Second, having the extern with a different "opaque" type to avoid
exposing the struct rq definition, which is quite hackish.

What do you think ?
My opinion is that it's ugly, but maybe you have some better idea how to
write this nicer ?
( I am also not 100 % sure if I did this the way you wanted).

Thanks for helping out,
Eugen

Re: [RFC][PATCH v2 22/29] mm/numa: Register information into Kmemdump

Posted by David Hildenbrand 5 months, 1 week ago

On 27.08.25 13:59, Eugen Hristev wrote:
> 
> 
> On 8/25/25 16:58, David Hildenbrand wrote:
>> On 25.08.25 15:36, Eugen Hristev wrote:
>>>
>>>
>>> On 8/25/25 16:20, David Hildenbrand wrote:
>>>>
>>>>>>
>>>>>> IIRC, kernel/vmcore_info.c is never built as a module, as it also
>>>>>> accesses non-exported symbols.
>>>>>
>>>>> Hello David,
>>>>>
>>>>> I am looking again into this, and there are some things which in my
>>>>> opinion would be difficult to achieve.
>>>>> For example I looked into my patch #11 , which adds the `runqueues` into
>>>>> kmemdump.
>>>>>
>>>>> The runqueues is a variable of `struct rq` which is defined in
>>>>> kernel/sched/sched.h , which is not supposed to be included outside of
>>>>> sched.
>>>>> Now moving all the struct definition outside of sched.h into another
>>>>> public header would be rather painful and I don't think it's a really
>>>>> good option (The struct would be needed to compute the sizeof inside
>>>>> vmcoreinfo). Secondly, it would also imply moving all the nested struct
>>>>> definitions outside as well. I doubt this is something that we want for
>>>>> the sched subsys. How the subsys is designed, out of my understanding,
>>>>> is to keep these internal structs opaque outside of it.
>>>>
>>>> All the kmemdump module needs is a start and a length, correct? So the
>>>> only tricky part is getting the length.
>>>
>>> I also have in mind the kernel user case. How would a kernel programmer
>>> want to add some kernel structs/info/buffers into kmemdump such that the
>>> dump would contain their data ? Having "KMEMDUMP_VAR(...)" looks simple
>>> enough.
>>
>> The other way around, why should anybody have a saying in adding their
>> data to kmemdump? Why do we have that all over the kernel?
>>
>> Is your mechanism really so special?
>>
>> A single composer should take care of that, and it's really just start +
>> len of physical memory areas.
>>
>>> Otherwise maybe the programmer has to write helpers to compute lengths
>>> etc, and stitch them into kmemdump core.
>>> I am not saying it's impossible, but just tiresome perhaps.
>>
>> In your patch set, how many of these instances did you encounter where
>> that was a problem?
>>
>>>>
>>>> One could just add a const variable that holds this information, or even
>>>> better, a simple helper function to calculate that.
>>>>
>>>> Maybe someone else reading along has a better idea.
>>>
>>> This could work, but it requires again adding some code into the
>>> specific subsystem. E.g. struct_rq_get_size()
>>> I am open to ideas , and thank you very much for your thoughts.
>>>
>>>>
>>>> Interestingly, runqueues is a percpu variable, which makes me wonder if
>>>> what you had would work as intended (maybe it does, not sure).
>>>
>>> I would not really need to dump the runqueues. But the crash tool which
>>> I am using for testing, requires it. Without the runqueues it will not
>>> progress further to load the kernel dump.
>>> So I am not really sure what it does with the runqueues, but it works.
>>> Perhaps using crash/gdb more, to actually do something with this data,
>>> would give more insight about its utility.
>>> For me, it is a prerequisite to run crash, and then to be able to
>>> extract the log buffer from the dump.
>>
>> I have the faint recollection that percpu vars might not be stored in a
>> single contiguous physical memory area, but maybe my memory is just
>> wrong, that's why I was raising it.
>>
>>>
>>>>
>>>>>
>>>>>    From my perspective it's much simpler and cleaner to just add the
>>>>> kmemdump annotation macro inside the sched/core.c as it's done in my
>>>>> patch. This macro translates to a noop if kmemdump is not selected.
>>>>
>>>> I really don't like how we are spreading kmemdump all over the kernel,
>>>> and adding complexity with __section when really, all we need is a place
>>>> to obtain a start and a length.
>>>>
>>>
>>> I understand. The section idea was suggested by Thomas. Initially I was
>>> skeptic, but I like how it turned out.
>>
>> Yeah, I don't like it. Taste differs ;)
>>
>> I am in particular unhappy about custom memblock wrappers.
>>
>> [...]
>>
>>>>>
>>>>> To have this working outside of printk, it would be required to walk
>>>>> through all the printk structs/allocations and select the required info.
>>>>> Is this something that we want to do outside of printk ?
>>>>
>>>> I don't follow, please elaborate.
>>>>
>>>> How is e.g., log_buf_len_get() + log_buf_addr_get() not sufficient,
>>>> given that you run your initialization after setup_log_buf() ?
>>>>
>>>>
>>>
>>> My initial thought was the same. However I got some feedback from Petr
>>> Mladek here :
>>>
>>> https://lore.kernel.org/lkml/aBm5QH2p6p9Wxe_M@localhost.localdomain/
>>>
>>> Where he explained how to register the structs correctly.
>>> It can be that setup_log_buf is called again at a later time perhaps.
>>>
>>
>> setup_log_buf() is a __init function, so there is only a certain time
>> frame where it can be called.
>>
>> In particular, once the buddy is up, memblock allocations are impossible
>> and it would be deeply flawed to call this function again.
>>
>> Let's not over-engineer this.
>>
>> Peter is on CC, so hopefully he can share his thoughts.
>>
> 
> Hello David,
> 
> I tested out this snippet (on top of my series, so you can see what I
> changed):
> 
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 18ba6c1e174f..7ac4248a00e5 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -67,7 +67,6 @@
>   #include <linux/wait_api.h>
>   #include <linux/workqueue_api.h>
>   #include <linux/livepatch_sched.h>
> -#include <linux/kmemdump.h>
> 
>   #ifdef CONFIG_PREEMPT_DYNAMIC
>   # ifdef CONFIG_GENERIC_IRQ_ENTRY
> @@ -120,7 +119,12 @@
> EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp);
>   EXPORT_TRACEPOINT_SYMBOL_GPL(sched_compute_energy_tp);
> 
>   DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
> -KMEMDUMP_VAR_CORE(runqueues, sizeof(runqueues));
> +
> +size_t runqueues_get_size(void);
> +size_t runqueues_get_size(void)
> +{
> +       return sizeof(runqueues);
> +}
> 
>   #ifdef CONFIG_SCHED_PROXY_EXEC
>   DEFINE_STATIC_KEY_TRUE(__sched_proxy_exec);
> diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c
> index d808c5e67f35..c6dd2d6e96dd 100644
> --- a/kernel/vmcore_info.c
> +++ b/kernel/vmcore_info.c
> @@ -24,6 +24,12 @@
>   #include "kallsyms_internal.h"
>   #include "kexec_internal.h"
> 
> +typedef void* kmemdump_opaque_t;
> +
> +size_t runqueues_get_size(void);
> +
> +extern kmemdump_opaque_t runqueues;

I would have tried that through:

struct rq;
extern struct rq runqueues;

But the whole PER_CPU_SHARED_ALIGNED makes this all weird, and likely
not the way we would want to handle that.

>   /* vmcoreinfo stuff */
>   unsigned char *vmcoreinfo_data;
>   size_t vmcoreinfo_size;
> @@ -230,6 +236,9 @@ static int __init crash_save_vmcoreinfo_init(void)
> 
>          kmemdump_register_id(KMEMDUMP_ID_COREIMAGE_VMCOREINFO,
>                               (void *)vmcoreinfo_data, vmcoreinfo_size);
> +       kmemdump_register_id(KMEMDUMP_ID_COREIMAGE_runqueues,
> +                            (void *)&runqueues, runqueues_get_size());
> +
>          return 0;
>   }
> 
> With this, no more .section, no kmemdump code into sched, however, there
> are few things :

I would really just do here something like the following:

/**
  * sched_get_runqueues_area - obtain the runqueues area for dumping
  * @start: ...
  * @size: ...
  *
  * The obtained area is only to be used for dumping purposes.
  */
void sched_get_runqueues_area(void *start, size_t size)
{
	start = &runqueues;
	size = sizeof(runqueues);
}

might be cleaner.


Having said that, if you realize that there is a fundamental issue with 
what I propose, please speak up.

So far, I feel like there are only limited number of "suboptimal" cases 
of this kind, but I might be wrong of course.

-- 
Cheers

David / dhildenb

Re: [RFC][PATCH v2 22/29] mm/numa: Register information into Kmemdump

Posted by Eugen Hristev 5 months, 1 week ago


On 8/27/25 15:18, David Hildenbrand wrote:
> On 27.08.25 13:59, Eugen Hristev wrote:
>>
>>
>> On 8/25/25 16:58, David Hildenbrand wrote:
>>> On 25.08.25 15:36, Eugen Hristev wrote:
>>>>
>>>>
>>>> On 8/25/25 16:20, David Hildenbrand wrote:
>>>>>
>>>>>>>
>>>>>>> IIRC, kernel/vmcore_info.c is never built as a module, as it also
>>>>>>> accesses non-exported symbols.
>>>>>>
>>>>>> Hello David,
>>>>>>
>>>>>> I am looking again into this, and there are some things which in my
>>>>>> opinion would be difficult to achieve.
>>>>>> For example I looked into my patch #11 , which adds the `runqueues` into
>>>>>> kmemdump.
>>>>>>
>>>>>> The runqueues is a variable of `struct rq` which is defined in
>>>>>> kernel/sched/sched.h , which is not supposed to be included outside of
>>>>>> sched.
>>>>>> Now moving all the struct definition outside of sched.h into another
>>>>>> public header would be rather painful and I don't think it's a really
>>>>>> good option (The struct would be needed to compute the sizeof inside
>>>>>> vmcoreinfo). Secondly, it would also imply moving all the nested struct
>>>>>> definitions outside as well. I doubt this is something that we want for
>>>>>> the sched subsys. How the subsys is designed, out of my understanding,
>>>>>> is to keep these internal structs opaque outside of it.
>>>>>
>>>>> All the kmemdump module needs is a start and a length, correct? So the
>>>>> only tricky part is getting the length.
>>>>
>>>> I also have in mind the kernel user case. How would a kernel programmer
>>>> want to add some kernel structs/info/buffers into kmemdump such that the
>>>> dump would contain their data ? Having "KMEMDUMP_VAR(...)" looks simple
>>>> enough.
>>>
>>> The other way around, why should anybody have a saying in adding their
>>> data to kmemdump? Why do we have that all over the kernel?
>>>
>>> Is your mechanism really so special?
>>>
>>> A single composer should take care of that, and it's really just start +
>>> len of physical memory areas.
>>>
>>>> Otherwise maybe the programmer has to write helpers to compute lengths
>>>> etc, and stitch them into kmemdump core.
>>>> I am not saying it's impossible, but just tiresome perhaps.
>>>
>>> In your patch set, how many of these instances did you encounter where
>>> that was a problem?
>>>
>>>>>
>>>>> One could just add a const variable that holds this information, or even
>>>>> better, a simple helper function to calculate that.
>>>>>
>>>>> Maybe someone else reading along has a better idea.
>>>>
>>>> This could work, but it requires again adding some code into the
>>>> specific subsystem. E.g. struct_rq_get_size()
>>>> I am open to ideas , and thank you very much for your thoughts.
>>>>
>>>>>
>>>>> Interestingly, runqueues is a percpu variable, which makes me wonder if
>>>>> what you had would work as intended (maybe it does, not sure).
>>>>
>>>> I would not really need to dump the runqueues. But the crash tool which
>>>> I am using for testing, requires it. Without the runqueues it will not
>>>> progress further to load the kernel dump.
>>>> So I am not really sure what it does with the runqueues, but it works.
>>>> Perhaps using crash/gdb more, to actually do something with this data,
>>>> would give more insight about its utility.
>>>> For me, it is a prerequisite to run crash, and then to be able to
>>>> extract the log buffer from the dump.
>>>
>>> I have the faint recollection that percpu vars might not be stored in a
>>> single contiguous physical memory area, but maybe my memory is just
>>> wrong, that's why I was raising it.
>>>
>>>>
>>>>>
>>>>>>
>>>>>>    From my perspective it's much simpler and cleaner to just add the
>>>>>> kmemdump annotation macro inside the sched/core.c as it's done in my
>>>>>> patch. This macro translates to a noop if kmemdump is not selected.
>>>>>
>>>>> I really don't like how we are spreading kmemdump all over the kernel,
>>>>> and adding complexity with __section when really, all we need is a place
>>>>> to obtain a start and a length.
>>>>>
>>>>
>>>> I understand. The section idea was suggested by Thomas. Initially I was
>>>> skeptic, but I like how it turned out.
>>>
>>> Yeah, I don't like it. Taste differs ;)
>>>
>>> I am in particular unhappy about custom memblock wrappers.
>>>
>>> [...]
>>>
>>>>>>
>>>>>> To have this working outside of printk, it would be required to walk
>>>>>> through all the printk structs/allocations and select the required info.
>>>>>> Is this something that we want to do outside of printk ?
>>>>>
>>>>> I don't follow, please elaborate.
>>>>>
>>>>> How is e.g., log_buf_len_get() + log_buf_addr_get() not sufficient,
>>>>> given that you run your initialization after setup_log_buf() ?
>>>>>
>>>>>
>>>>
>>>> My initial thought was the same. However I got some feedback from Petr
>>>> Mladek here :
>>>>
>>>> https://lore.kernel.org/lkml/aBm5QH2p6p9Wxe_M@localhost.localdomain/
>>>>
>>>> Where he explained how to register the structs correctly.
>>>> It can be that setup_log_buf is called again at a later time perhaps.
>>>>
>>>
>>> setup_log_buf() is a __init function, so there is only a certain time
>>> frame where it can be called.
>>>
>>> In particular, once the buddy is up, memblock allocations are impossible
>>> and it would be deeply flawed to call this function again.
>>>
>>> Let's not over-engineer this.
>>>
>>> Peter is on CC, so hopefully he can share his thoughts.
>>>
>>
>> Hello David,
>>
>> I tested out this snippet (on top of my series, so you can see what I
>> changed):
>>
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 18ba6c1e174f..7ac4248a00e5 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -67,7 +67,6 @@
>>   #include <linux/wait_api.h>
>>   #include <linux/workqueue_api.h>
>>   #include <linux/livepatch_sched.h>
>> -#include <linux/kmemdump.h>
>>
>>   #ifdef CONFIG_PREEMPT_DYNAMIC
>>   # ifdef CONFIG_GENERIC_IRQ_ENTRY
>> @@ -120,7 +119,12 @@
>> EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp);
>>   EXPORT_TRACEPOINT_SYMBOL_GPL(sched_compute_energy_tp);
>>
>>   DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
>> -KMEMDUMP_VAR_CORE(runqueues, sizeof(runqueues));
>> +
>> +size_t runqueues_get_size(void);
>> +size_t runqueues_get_size(void)
>> +{
>> +       return sizeof(runqueues);
>> +}
>>
>>   #ifdef CONFIG_SCHED_PROXY_EXEC
>>   DEFINE_STATIC_KEY_TRUE(__sched_proxy_exec);
>> diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c
>> index d808c5e67f35..c6dd2d6e96dd 100644
>> --- a/kernel/vmcore_info.c
>> +++ b/kernel/vmcore_info.c
>> @@ -24,6 +24,12 @@
>>   #include "kallsyms_internal.h"
>>   #include "kexec_internal.h"
>>
>> +typedef void* kmemdump_opaque_t;
>> +
>> +size_t runqueues_get_size(void);
>> +
>> +extern kmemdump_opaque_t runqueues;
> 
> I would have tried that through:
> 
> struct rq;
> extern struct rq runqueues;
> 
> But the whole PER_CPU_SHARED_ALIGNED makes this all weird, and likely
> not the way we would want to handle that.
> 
>>   /* vmcoreinfo stuff */
>>   unsigned char *vmcoreinfo_data;
>>   size_t vmcoreinfo_size;
>> @@ -230,6 +236,9 @@ static int __init crash_save_vmcoreinfo_init(void)
>>
>>          kmemdump_register_id(KMEMDUMP_ID_COREIMAGE_VMCOREINFO,
>>                               (void *)vmcoreinfo_data, vmcoreinfo_size);
>> +       kmemdump_register_id(KMEMDUMP_ID_COREIMAGE_runqueues,
>> +                            (void *)&runqueues, runqueues_get_size());
>> +
>>          return 0;
>>   }
>>
>> With this, no more .section, no kmemdump code into sched, however, there
>> are few things :
> 
> I would really just do here something like the following:
> 
> /**
>   * sched_get_runqueues_area - obtain the runqueues area for dumping
>   * @start: ...
>   * @size: ...
>   *
>   * The obtained area is only to be used for dumping purposes.
>   */
> void sched_get_runqueues_area(void *start, size_t size)
> {
> 	start = &runqueues;
> 	size = sizeof(runqueues);
> }
> 
> might be cleaner.
> 

How about this in the header:

#define DECLARE_DUMP_AREA_FUNC(subsys, symbol) \

void subsys ## _get_ ## symbol ##_area(void **start, size_t *size);



#define DEFINE_DUMP_AREA_FUNC(subsys, symbol) \

void subsys ## _get_ ## symbol ##_area(void **start, size_t *size)\

{\

        *start = &symbol;\

        *size = sizeof(symbol);\

}


then, in sched just

DECLARE_DUMP_AREA_FUNC(sched, runqueues);

DEFINE_DUMP_AREA_FUNC(sched, runqueues);

or a single macro that wraps both.

would make it shorter and neater.

What do you think ?

> 
> Having said that, if you realize that there is a fundamental issue with 
> what I propose, please speak up.
> 
> So far, I feel like there are only limited number of "suboptimal" cases 
> of this kind, but I might be wrong of course.
>

Re: [RFC][PATCH v2 22/29] mm/numa: Register information into Kmemdump

Posted by David Hildenbrand 5 months, 1 week ago

On 27.08.25 16:08, Eugen Hristev wrote:
> 
> 
> On 8/27/25 15:18, David Hildenbrand wrote:
>> On 27.08.25 13:59, Eugen Hristev wrote:
>>>
>>>
>>> On 8/25/25 16:58, David Hildenbrand wrote:
>>>> On 25.08.25 15:36, Eugen Hristev wrote:
>>>>>
>>>>>
>>>>> On 8/25/25 16:20, David Hildenbrand wrote:
>>>>>>
>>>>>>>>
>>>>>>>> IIRC, kernel/vmcore_info.c is never built as a module, as it also
>>>>>>>> accesses non-exported symbols.
>>>>>>>
>>>>>>> Hello David,
>>>>>>>
>>>>>>> I am looking again into this, and there are some things which in my
>>>>>>> opinion would be difficult to achieve.
>>>>>>> For example I looked into my patch #11 , which adds the `runqueues` into
>>>>>>> kmemdump.
>>>>>>>
>>>>>>> The runqueues is a variable of `struct rq` which is defined in
>>>>>>> kernel/sched/sched.h , which is not supposed to be included outside of
>>>>>>> sched.
>>>>>>> Now moving all the struct definition outside of sched.h into another
>>>>>>> public header would be rather painful and I don't think it's a really
>>>>>>> good option (The struct would be needed to compute the sizeof inside
>>>>>>> vmcoreinfo). Secondly, it would also imply moving all the nested struct
>>>>>>> definitions outside as well. I doubt this is something that we want for
>>>>>>> the sched subsys. How the subsys is designed, out of my understanding,
>>>>>>> is to keep these internal structs opaque outside of it.
>>>>>>
>>>>>> All the kmemdump module needs is a start and a length, correct? So the
>>>>>> only tricky part is getting the length.
>>>>>
>>>>> I also have in mind the kernel user case. How would a kernel programmer
>>>>> want to add some kernel structs/info/buffers into kmemdump such that the
>>>>> dump would contain their data ? Having "KMEMDUMP_VAR(...)" looks simple
>>>>> enough.
>>>>
>>>> The other way around, why should anybody have a saying in adding their
>>>> data to kmemdump? Why do we have that all over the kernel?
>>>>
>>>> Is your mechanism really so special?
>>>>
>>>> A single composer should take care of that, and it's really just start +
>>>> len of physical memory areas.
>>>>
>>>>> Otherwise maybe the programmer has to write helpers to compute lengths
>>>>> etc, and stitch them into kmemdump core.
>>>>> I am not saying it's impossible, but just tiresome perhaps.
>>>>
>>>> In your patch set, how many of these instances did you encounter where
>>>> that was a problem?
>>>>
>>>>>>
>>>>>> One could just add a const variable that holds this information, or even
>>>>>> better, a simple helper function to calculate that.
>>>>>>
>>>>>> Maybe someone else reading along has a better idea.
>>>>>
>>>>> This could work, but it requires again adding some code into the
>>>>> specific subsystem. E.g. struct_rq_get_size()
>>>>> I am open to ideas , and thank you very much for your thoughts.
>>>>>
>>>>>>
>>>>>> Interestingly, runqueues is a percpu variable, which makes me wonder if
>>>>>> what you had would work as intended (maybe it does, not sure).
>>>>>
>>>>> I would not really need to dump the runqueues. But the crash tool which
>>>>> I am using for testing, requires it. Without the runqueues it will not
>>>>> progress further to load the kernel dump.
>>>>> So I am not really sure what it does with the runqueues, but it works.
>>>>> Perhaps using crash/gdb more, to actually do something with this data,
>>>>> would give more insight about its utility.
>>>>> For me, it is a prerequisite to run crash, and then to be able to
>>>>> extract the log buffer from the dump.
>>>>
>>>> I have the faint recollection that percpu vars might not be stored in a
>>>> single contiguous physical memory area, but maybe my memory is just
>>>> wrong, that's why I was raising it.
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>     From my perspective it's much simpler and cleaner to just add the
>>>>>>> kmemdump annotation macro inside the sched/core.c as it's done in my
>>>>>>> patch. This macro translates to a noop if kmemdump is not selected.
>>>>>>
>>>>>> I really don't like how we are spreading kmemdump all over the kernel,
>>>>>> and adding complexity with __section when really, all we need is a place
>>>>>> to obtain a start and a length.
>>>>>>
>>>>>
>>>>> I understand. The section idea was suggested by Thomas. Initially I was
>>>>> skeptic, but I like how it turned out.
>>>>
>>>> Yeah, I don't like it. Taste differs ;)
>>>>
>>>> I am in particular unhappy about custom memblock wrappers.
>>>>
>>>> [...]
>>>>
>>>>>>>
>>>>>>> To have this working outside of printk, it would be required to walk
>>>>>>> through all the printk structs/allocations and select the required info.
>>>>>>> Is this something that we want to do outside of printk ?
>>>>>>
>>>>>> I don't follow, please elaborate.
>>>>>>
>>>>>> How is e.g., log_buf_len_get() + log_buf_addr_get() not sufficient,
>>>>>> given that you run your initialization after setup_log_buf() ?
>>>>>>
>>>>>>
>>>>>
>>>>> My initial thought was the same. However I got some feedback from Petr
>>>>> Mladek here :
>>>>>
>>>>> https://lore.kernel.org/lkml/aBm5QH2p6p9Wxe_M@localhost.localdomain/
>>>>>
>>>>> Where he explained how to register the structs correctly.
>>>>> It can be that setup_log_buf is called again at a later time perhaps.
>>>>>
>>>>
>>>> setup_log_buf() is a __init function, so there is only a certain time
>>>> frame where it can be called.
>>>>
>>>> In particular, once the buddy is up, memblock allocations are impossible
>>>> and it would be deeply flawed to call this function again.
>>>>
>>>> Let's not over-engineer this.
>>>>
>>>> Peter is on CC, so hopefully he can share his thoughts.
>>>>
>>>
>>> Hello David,
>>>
>>> I tested out this snippet (on top of my series, so you can see what I
>>> changed):
>>>
>>>
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index 18ba6c1e174f..7ac4248a00e5 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -67,7 +67,6 @@
>>>    #include <linux/wait_api.h>
>>>    #include <linux/workqueue_api.h>
>>>    #include <linux/livepatch_sched.h>
>>> -#include <linux/kmemdump.h>
>>>
>>>    #ifdef CONFIG_PREEMPT_DYNAMIC
>>>    # ifdef CONFIG_GENERIC_IRQ_ENTRY
>>> @@ -120,7 +119,12 @@
>>> EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp);
>>>    EXPORT_TRACEPOINT_SYMBOL_GPL(sched_compute_energy_tp);
>>>
>>>    DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
>>> -KMEMDUMP_VAR_CORE(runqueues, sizeof(runqueues));
>>> +
>>> +size_t runqueues_get_size(void);
>>> +size_t runqueues_get_size(void)
>>> +{
>>> +       return sizeof(runqueues);
>>> +}
>>>
>>>    #ifdef CONFIG_SCHED_PROXY_EXEC
>>>    DEFINE_STATIC_KEY_TRUE(__sched_proxy_exec);
>>> diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c
>>> index d808c5e67f35..c6dd2d6e96dd 100644
>>> --- a/kernel/vmcore_info.c
>>> +++ b/kernel/vmcore_info.c
>>> @@ -24,6 +24,12 @@
>>>    #include "kallsyms_internal.h"
>>>    #include "kexec_internal.h"
>>>
>>> +typedef void* kmemdump_opaque_t;
>>> +
>>> +size_t runqueues_get_size(void);
>>> +
>>> +extern kmemdump_opaque_t runqueues;
>>
>> I would have tried that through:
>>
>> struct rq;
>> extern struct rq runqueues;
>>
>> But the whole PER_CPU_SHARED_ALIGNED makes this all weird, and likely
>> not the way we would want to handle that.
>>
>>>    /* vmcoreinfo stuff */
>>>    unsigned char *vmcoreinfo_data;
>>>    size_t vmcoreinfo_size;
>>> @@ -230,6 +236,9 @@ static int __init crash_save_vmcoreinfo_init(void)
>>>
>>>           kmemdump_register_id(KMEMDUMP_ID_COREIMAGE_VMCOREINFO,
>>>                                (void *)vmcoreinfo_data, vmcoreinfo_size);
>>> +       kmemdump_register_id(KMEMDUMP_ID_COREIMAGE_runqueues,
>>> +                            (void *)&runqueues, runqueues_get_size());
>>> +
>>>           return 0;
>>>    }
>>>
>>> With this, no more .section, no kmemdump code into sched, however, there
>>> are few things :
>>
>> I would really just do here something like the following:
>>
>> /**
>>    * sched_get_runqueues_area - obtain the runqueues area for dumping
>>    * @start: ...
>>    * @size: ...
>>    *
>>    * The obtained area is only to be used for dumping purposes.
>>    */
>> void sched_get_runqueues_area(void *start, size_t size)
>> {
>> 	start = &runqueues;
>> 	size = sizeof(runqueues);
>> }
>>
>> might be cleaner.
>>
> 
> How about this in the header:
> 
> #define DECLARE_DUMP_AREA_FUNC(subsys, symbol) \
> 
> void subsys ## _get_ ## symbol ##_area(void **start, size_t *size);
> 
> 
> 
> #define DEFINE_DUMP_AREA_FUNC(subsys, symbol) \
> 
> void subsys ## _get_ ## symbol ##_area(void **start, size_t *size)\
> 
> {\
> 
>          *start = &symbol;\
> 
>          *size = sizeof(symbol);\
> 
> }
> 
> 
> then, in sched just
> 
> DECLARE_DUMP_AREA_FUNC(sched, runqueues);
> 
> DEFINE_DUMP_AREA_FUNC(sched, runqueues);
> 
> or a single macro that wraps both.
> 
> would make it shorter and neater.
> 
> What do you think ?

Looks a bit over-engineered, and will require us to import a header 
(likely kmemdump.h) in these files, which I don't really enjoy.

I would start simple, without any such macro-magic. It's a very simple 
function after all, and likely you won't end up having many of these?

-- 
Cheers

David / dhildenb

Re: [RFC][PATCH v2 22/29] mm/numa: Register information into Kmemdump

Posted by Eugen Hristev 5 months, 1 week ago


On 8/27/25 23:06, David Hildenbrand wrote:
> On 27.08.25 16:08, Eugen Hristev wrote:
>>
>>
>> On 8/27/25 15:18, David Hildenbrand wrote:
>>> On 27.08.25 13:59, Eugen Hristev wrote:
>>>>
>>>>
>>>> On 8/25/25 16:58, David Hildenbrand wrote:
>>>>> On 25.08.25 15:36, Eugen Hristev wrote:
>>>>>>
>>>>>>
>>>>>> On 8/25/25 16:20, David Hildenbrand wrote:
>>>>>>>
>>>>>>>>>
>>>>>>>>> IIRC, kernel/vmcore_info.c is never built as a module, as it also
>>>>>>>>> accesses non-exported symbols.
>>>>>>>>
>>>>>>>> Hello David,
>>>>>>>>
>>>>>>>> I am looking again into this, and there are some things which in my
>>>>>>>> opinion would be difficult to achieve.
>>>>>>>> For example I looked into my patch #11 , which adds the `runqueues` into
>>>>>>>> kmemdump.
>>>>>>>>
>>>>>>>> The runqueues is a variable of `struct rq` which is defined in
>>>>>>>> kernel/sched/sched.h , which is not supposed to be included outside of
>>>>>>>> sched.
>>>>>>>> Now moving all the struct definition outside of sched.h into another
>>>>>>>> public header would be rather painful and I don't think it's a really
>>>>>>>> good option (The struct would be needed to compute the sizeof inside
>>>>>>>> vmcoreinfo). Secondly, it would also imply moving all the nested struct
>>>>>>>> definitions outside as well. I doubt this is something that we want for
>>>>>>>> the sched subsys. How the subsys is designed, out of my understanding,
>>>>>>>> is to keep these internal structs opaque outside of it.
>>>>>>>
>>>>>>> All the kmemdump module needs is a start and a length, correct? So the
>>>>>>> only tricky part is getting the length.
>>>>>>
>>>>>> I also have in mind the kernel user case. How would a kernel programmer
>>>>>> want to add some kernel structs/info/buffers into kmemdump such that the
>>>>>> dump would contain their data ? Having "KMEMDUMP_VAR(...)" looks simple
>>>>>> enough.
>>>>>
>>>>> The other way around, why should anybody have a saying in adding their
>>>>> data to kmemdump? Why do we have that all over the kernel?
>>>>>
>>>>> Is your mechanism really so special?
>>>>>
>>>>> A single composer should take care of that, and it's really just start +
>>>>> len of physical memory areas.
>>>>>
>>>>>> Otherwise maybe the programmer has to write helpers to compute lengths
>>>>>> etc, and stitch them into kmemdump core.
>>>>>> I am not saying it's impossible, but just tiresome perhaps.
>>>>>
>>>>> In your patch set, how many of these instances did you encounter where
>>>>> that was a problem?
>>>>>
>>>>>>>
>>>>>>> One could just add a const variable that holds this information, or even
>>>>>>> better, a simple helper function to calculate that.
>>>>>>>
>>>>>>> Maybe someone else reading along has a better idea.
>>>>>>
>>>>>> This could work, but it requires again adding some code into the
>>>>>> specific subsystem. E.g. struct_rq_get_size()
>>>>>> I am open to ideas , and thank you very much for your thoughts.
>>>>>>
>>>>>>>
>>>>>>> Interestingly, runqueues is a percpu variable, which makes me wonder if
>>>>>>> what you had would work as intended (maybe it does, not sure).
>>>>>>
>>>>>> I would not really need to dump the runqueues. But the crash tool which
>>>>>> I am using for testing, requires it. Without the runqueues it will not
>>>>>> progress further to load the kernel dump.
>>>>>> So I am not really sure what it does with the runqueues, but it works.
>>>>>> Perhaps using crash/gdb more, to actually do something with this data,
>>>>>> would give more insight about its utility.
>>>>>> For me, it is a prerequisite to run crash, and then to be able to
>>>>>> extract the log buffer from the dump.
>>>>>
>>>>> I have the faint recollection that percpu vars might not be stored in a
>>>>> single contiguous physical memory area, but maybe my memory is just
>>>>> wrong, that's why I was raising it.
>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>     From my perspective it's much simpler and cleaner to just add the
>>>>>>>> kmemdump annotation macro inside the sched/core.c as it's done in my
>>>>>>>> patch. This macro translates to a noop if kmemdump is not selected.
>>>>>>>
>>>>>>> I really don't like how we are spreading kmemdump all over the kernel,
>>>>>>> and adding complexity with __section when really, all we need is a place
>>>>>>> to obtain a start and a length.
>>>>>>>
>>>>>>
>>>>>> I understand. The section idea was suggested by Thomas. Initially I was
>>>>>> skeptic, but I like how it turned out.
>>>>>
>>>>> Yeah, I don't like it. Taste differs ;)
>>>>>
>>>>> I am in particular unhappy about custom memblock wrappers.
>>>>>
>>>>> [...]
>>>>>
>>>>>>>>
>>>>>>>> To have this working outside of printk, it would be required to walk
>>>>>>>> through all the printk structs/allocations and select the required info.
>>>>>>>> Is this something that we want to do outside of printk ?
>>>>>>>
>>>>>>> I don't follow, please elaborate.
>>>>>>>
>>>>>>> How is e.g., log_buf_len_get() + log_buf_addr_get() not sufficient,
>>>>>>> given that you run your initialization after setup_log_buf() ?
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> My initial thought was the same. However I got some feedback from Petr
>>>>>> Mladek here :
>>>>>>
>>>>>> https://lore.kernel.org/lkml/aBm5QH2p6p9Wxe_M@localhost.localdomain/
>>>>>>
>>>>>> Where he explained how to register the structs correctly.
>>>>>> It can be that setup_log_buf is called again at a later time perhaps.
>>>>>>
>>>>>
>>>>> setup_log_buf() is a __init function, so there is only a certain time
>>>>> frame where it can be called.
>>>>>
>>>>> In particular, once the buddy is up, memblock allocations are impossible
>>>>> and it would be deeply flawed to call this function again.
>>>>>
>>>>> Let's not over-engineer this.
>>>>>
>>>>> Peter is on CC, so hopefully he can share his thoughts.
>>>>>
>>>>
>>>> Hello David,
>>>>
>>>> I tested out this snippet (on top of my series, so you can see what I
>>>> changed):
>>>>
>>>>
>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>> index 18ba6c1e174f..7ac4248a00e5 100644
>>>> --- a/kernel/sched/core.c
>>>> +++ b/kernel/sched/core.c
>>>> @@ -67,7 +67,6 @@
>>>>    #include <linux/wait_api.h>
>>>>    #include <linux/workqueue_api.h>
>>>>    #include <linux/livepatch_sched.h>
>>>> -#include <linux/kmemdump.h>
>>>>
>>>>    #ifdef CONFIG_PREEMPT_DYNAMIC
>>>>    # ifdef CONFIG_GENERIC_IRQ_ENTRY
>>>> @@ -120,7 +119,12 @@
>>>> EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp);
>>>>    EXPORT_TRACEPOINT_SYMBOL_GPL(sched_compute_energy_tp);
>>>>
>>>>    DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
>>>> -KMEMDUMP_VAR_CORE(runqueues, sizeof(runqueues));
>>>> +
>>>> +size_t runqueues_get_size(void);
>>>> +size_t runqueues_get_size(void)
>>>> +{
>>>> +       return sizeof(runqueues);
>>>> +}
>>>>
>>>>    #ifdef CONFIG_SCHED_PROXY_EXEC
>>>>    DEFINE_STATIC_KEY_TRUE(__sched_proxy_exec);
>>>> diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c
>>>> index d808c5e67f35..c6dd2d6e96dd 100644
>>>> --- a/kernel/vmcore_info.c
>>>> +++ b/kernel/vmcore_info.c
>>>> @@ -24,6 +24,12 @@
>>>>    #include "kallsyms_internal.h"
>>>>    #include "kexec_internal.h"
>>>>
>>>> +typedef void* kmemdump_opaque_t;
>>>> +
>>>> +size_t runqueues_get_size(void);
>>>> +
>>>> +extern kmemdump_opaque_t runqueues;
>>>
>>> I would have tried that through:
>>>
>>> struct rq;
>>> extern struct rq runqueues;
>>>
>>> But the whole PER_CPU_SHARED_ALIGNED makes this all weird, and likely
>>> not the way we would want to handle that.
>>>
>>>>    /* vmcoreinfo stuff */
>>>>    unsigned char *vmcoreinfo_data;
>>>>    size_t vmcoreinfo_size;
>>>> @@ -230,6 +236,9 @@ static int __init crash_save_vmcoreinfo_init(void)
>>>>
>>>>           kmemdump_register_id(KMEMDUMP_ID_COREIMAGE_VMCOREINFO,
>>>>                                (void *)vmcoreinfo_data, vmcoreinfo_size);
>>>> +       kmemdump_register_id(KMEMDUMP_ID_COREIMAGE_runqueues,
>>>> +                            (void *)&runqueues, runqueues_get_size());
>>>> +
>>>>           return 0;
>>>>    }
>>>>
>>>> With this, no more .section, no kmemdump code into sched, however, there
>>>> are few things :
>>>
>>> I would really just do here something like the following:
>>>
>>> /**
>>>    * sched_get_runqueues_area - obtain the runqueues area for dumping
>>>    * @start: ...
>>>    * @size: ...
>>>    *
>>>    * The obtained area is only to be used for dumping purposes.
>>>    */
>>> void sched_get_runqueues_area(void *start, size_t size)
>>> {
>>> 	start = &runqueues;
>>> 	size = sizeof(runqueues);
>>> }
>>>
>>> might be cleaner.
>>>
>>
>> How about this in the header:
>>
>> #define DECLARE_DUMP_AREA_FUNC(subsys, symbol) \
>>
>> void subsys ## _get_ ## symbol ##_area(void **start, size_t *size);
>>
>>
>>
>> #define DEFINE_DUMP_AREA_FUNC(subsys, symbol) \
>>
>> void subsys ## _get_ ## symbol ##_area(void **start, size_t *size)\
>>
>> {\
>>
>>          *start = &symbol;\
>>
>>          *size = sizeof(symbol);\
>>
>> }
>>
>>
>> then, in sched just
>>
>> DECLARE_DUMP_AREA_FUNC(sched, runqueues);
>>
>> DEFINE_DUMP_AREA_FUNC(sched, runqueues);
>>
>> or a single macro that wraps both.
>>
>> would make it shorter and neater.
>>
>> What do you think ?
> 
> Looks a bit over-engineered, and will require us to import a header 
> (likely kmemdump.h) in these files, which I don't really enjoy.
> 
> I would start simple, without any such macro-magic. It's a very simple 
> function after all, and likely you won't end up having many of these?
> 

Thanks David, I will do it as you suggested and see what comes out of it.

I have one side question you might know much better to answer:
As we have a start and a size for each region, this start is a virtual
address. The firmware/coprocessor that reads the memory and dumps it,
requires physical addresses. What do you suggest to use to retrieve that
address ? virt_to_phys might be problematic, __pa or __pa_symbol? or
better lm_alias ?
As kmemdump is agnostic of the region of the memory the `start` comes
from, and it should be portable and platform independent.

Thanks again,
Eugen

Re: [RFC][PATCH v2 22/29] mm/numa: Register information into Kmemdump

Posted by David Hildenbrand 5 months, 1 week ago

>>> What do you think ?
>>
>> Looks a bit over-engineered, and will require us to import a header
>> (likely kmemdump.h) in these files, which I don't really enjoy.
>>
>> I would start simple, without any such macro-magic. It's a very simple
>> function after all, and likely you won't end up having many of these?
>>
> 
> Thanks David, I will do it as you suggested and see what comes out of it.
> 
> I have one side question you might know much better to answer:
> As we have a start and a size for each region, this start is a virtual
> address. The firmware/coprocessor that reads the memory and dumps it,
> requires physical addresses.

Right. I was asking myself the same question while reviewing: should we 
directly export physical ranges here instead of virtual ones. I guess 
virtual ones is ok.

What do you suggest to use to retrieve that
> address ? virt_to_phys might be problematic, __pa or __pa_symbol? or
> better lm_alias ?

All areas should either come from memblock or be global variables, right?

IIRC, virt_to_phys() should work for these. Did you run into any 
problems with them or why do you think virt_to_phys could be problematic?

-- 
Cheers

David / dhildenb

Re: [RFC][PATCH v2 22/29] mm/numa: Register information into Kmemdump

Posted by Eugen Hristev 5 months, 1 week ago

On 9/1/25 13:01, David Hildenbrand wrote:
>>>> What do you think ?
>>>
>>> Looks a bit over-engineered, and will require us to import a header
>>> (likely kmemdump.h) in these files, which I don't really enjoy.
>>>
>>> I would start simple, without any such macro-magic. It's a very simple
>>> function after all, and likely you won't end up having many of these?
>>>
>>
>> Thanks David, I will do it as you suggested and see what comes out of it.
>>
>> I have one side question you might know much better to answer:
>> As we have a start and a size for each region, this start is a virtual
>> address. The firmware/coprocessor that reads the memory and dumps it,
>> requires physical addresses.
> 
> Right. I was asking myself the same question while reviewing: should we 
> directly export physical ranges here instead of virtual ones. I guess 
> virtual ones is ok.

In patch 22/29, some areas are registered using
memblock_phys_alloc_try_nid() which allocates physical.
In this case , phys_to_virt() didn't work for me, it was returning a
wrong address. I used __va() and this worked. So there is a difference
between them.

> 
> What do you suggest to use to retrieve that
>> address ? virt_to_phys might be problematic, __pa or __pa_symbol? or
>> better lm_alias ?
> 
> All areas should either come from memblock or be global variables, right?

I would like to be able to register from anywhere. For example someone
debugging their driver, to just register kmalloc'ed struct.
Other use case is to register dma coherent CMA areas.

> 
> IIRC, virt_to_phys() should work for these. Did you run into any 
> problems with them or why do you think virt_to_phys could be problematic?
> 

I am pondering about whether it would work in all cases, considering
it's source code comments that it shall not be used because it does not
work for any address.

Someone also reported its unavailability like this:
drivers/debug/kmemdump_coreimage.c:67:24: error: call to undeclared
function 'virt_to_phys'; ISO C99 and later do not support implicit
function declarations [-Wimplicit-function-declaration]

I am yet to figure out which config fails.

Re: [RFC][PATCH v2 22/29] mm/numa: Register information into Kmemdump

Posted by David Hildenbrand 5 months, 1 week ago

On 01.09.25 14:02, Eugen Hristev wrote:
> 
> 
> On 9/1/25 13:01, David Hildenbrand wrote:
>>>>> What do you think ?
>>>>
>>>> Looks a bit over-engineered, and will require us to import a header
>>>> (likely kmemdump.h) in these files, which I don't really enjoy.
>>>>
>>>> I would start simple, without any such macro-magic. It's a very simple
>>>> function after all, and likely you won't end up having many of these?
>>>>
>>>
>>> Thanks David, I will do it as you suggested and see what comes out of it.
>>>
>>> I have one side question you might know much better to answer:
>>> As we have a start and a size for each region, this start is a virtual
>>> address. The firmware/coprocessor that reads the memory and dumps it,
>>> requires physical addresses.
>>
>> Right. I was asking myself the same question while reviewing: should we
>> directly export physical ranges here instead of virtual ones. I guess
>> virtual ones is ok.
> 
> In patch 22/29, some areas are registered using
> memblock_phys_alloc_try_nid() which allocates physical.
> In this case , phys_to_virt() didn't work for me, it was returning a
> wrong address. I used __va() and this worked. So there is a difference
> between them.

memblock_alloc_internal() calls memblock_alloc_range_nid() to then 
perform a phys_to_virt().

memblock_phys_alloc_try_nid() calls memblock_alloc_range_nid() without 
the phys_to_virt().

So it's rather surprising the a phys_to_virt() would not work in that case.

Maybe for these cases where you export the area through a new helper, 
you can just export the physical addr + length instead.

Then, it's also clear that this area is actually physically contiguous.

> 
>>
>> What do you suggest to use to retrieve that
>>> address ? virt_to_phys might be problematic, __pa or __pa_symbol? or
>>> better lm_alias ?
>>
>> All areas should either come from memblock or be global variables, right?
> 
> I would like to be able to register from anywhere. For example someone
> debugging their driver, to just register kmalloc'ed struct.
> Other use case is to register dma coherent CMA areas.

Then probably better to export physical addresses (that you need either 
way) directly from the helpers you have to add.

> 
>>
>> IIRC, virt_to_phys() should work for these. Did you run into any
>> problems with them or why do you think virt_to_phys could be problematic?
>>
> 
> I am pondering about whether it would work in all cases, considering
> it's source code comments that it shall not be used because it does not
> work for any address.

Yeah, it does for example not work for kernel stacks IIRC.

-- 
Cheers

David / dhildenb

Re: [RFC][PATCH v2 22/29] mm/numa: Register information into Kmemdump

Posted by David Hildenbrand 6 months ago

On 04.08.25 12:54, Michal Hocko wrote:
> On Wed 30-07-25 16:04:28, David Hildenbrand wrote:
>> On 30.07.25 15:57, Eugen Hristev wrote:
> [...]
>>> Yes, registering after is also an option. Initially this is how I
>>> designed the kmemdump API, I also had in mind to add a flag, but, after
>>> discussing with Thomas Gleixner, he came up with the macro wrapper idea
>>> here:
>>> https://lore.kernel.org/lkml/87ikkzpcup.ffs@tglx/
>>> Do you think we can continue that discussion , or maybe start it here ?
>>
>> Yeah, I don't like that, but I can see how we ended up here.
>>
>> I also don't quite like the idea that we must encode here what to include in
>> a dump and what not ...
>>
>> For the vmcore we construct it at runtime in crash_save_vmcoreinfo_init(),
>> where we e.g., have
>>
>> VMCOREINFO_STRUCT_SIZE(pglist_data);
>>
>> Could we similar have some place where we construct what to dump similarly,
>> just not using the current values, but the memory ranges?
> 
> All those symbols are part of kallsyms, right? Can we just use kallsyms
> infrastructure and a list of symbols to get what we need from there?
> 
> In other words the list of symbols to be completely external to the code
> that is defining them?

That was the idea. All we should need is the start+size of the ranges. 
No need to have these kmemdump specifics all over the kernel.

-- 
Cheers,

David / dhildenb