Annotate vital static information into kmemdump:
- node_data
Information on these variables is stored into dedicated kmemdump section.
Register dynamic information into kmemdump:
- dynamic node data for each node
This information is being allocated for each node, as physical address,
so call kmemdump_phys_alloc_size that will allocate an unique kmemdump
uid, and register the virtual address.
Signed-off-by: Eugen Hristev <eugen.hristev@linaro.org>
---
mm/numa.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/mm/numa.c b/mm/numa.c
index 7d5e06fe5bd4..88cada571171 100644
--- a/mm/numa.c
+++ b/mm/numa.c
@@ -4,9 +4,11 @@
#include <linux/printk.h>
#include <linux/numa.h>
#include <linux/numa_memblks.h>
+#include <linux/kmemdump.h>
struct pglist_data *node_data[MAX_NUMNODES];
EXPORT_SYMBOL(node_data);
+KMEMDUMP_VAR_CORE(node_data, MAX_NUMNODES * sizeof(struct pglist_data));
/* Allocate NODE_DATA for a node on the local memory */
void __init alloc_node_data(int nid)
@@ -16,7 +18,8 @@ void __init alloc_node_data(int nid)
int tnid;
/* Allocate node data. Try node-local memory and then any node. */
- nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
+ nd_pa = kmemdump_phys_alloc_size(nd_size, memblock_phys_alloc_try_nid,
+ nd_size, SMP_CACHE_BYTES, nid);
if (!nd_pa)
panic("Cannot allocate %zu bytes for node %d data\n",
nd_size, nid);
--
2.43.0
On 24.07.25 15:55, Eugen Hristev wrote: > Annotate vital static information into kmemdump: > - node_data > > Information on these variables is stored into dedicated kmemdump section. > > Register dynamic information into kmemdump: > - dynamic node data for each node > > This information is being allocated for each node, as physical address, > so call kmemdump_phys_alloc_size that will allocate an unique kmemdump > uid, and register the virtual address. > > Signed-off-by: Eugen Hristev <eugen.hristev@linaro.org> > --- > mm/numa.c | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/mm/numa.c b/mm/numa.c > index 7d5e06fe5bd4..88cada571171 100644 > --- a/mm/numa.c > +++ b/mm/numa.c > @@ -4,9 +4,11 @@ > #include <linux/printk.h> > #include <linux/numa.h> > #include <linux/numa_memblks.h> > +#include <linux/kmemdump.h> > > struct pglist_data *node_data[MAX_NUMNODES]; > EXPORT_SYMBOL(node_data); > +KMEMDUMP_VAR_CORE(node_data, MAX_NUMNODES * sizeof(struct pglist_data)); > > /* Allocate NODE_DATA for a node on the local memory */ > void __init alloc_node_data(int nid) > @@ -16,7 +18,8 @@ void __init alloc_node_data(int nid) > int tnid; > > /* Allocate node data. Try node-local memory and then any node. */ > - nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid); > + nd_pa = kmemdump_phys_alloc_size(nd_size, memblock_phys_alloc_try_nid, > + nd_size, SMP_CACHE_BYTES, nid); Do we really want to wrap memblock allocations in such a way? :/ Gah, no, no no. Can't we pass that as some magical flag, or just ... register *after* allocating? -- Cheers, David / dhildenb
Hello, On 7/30/25 16:52, David Hildenbrand wrote: > On 24.07.25 15:55, Eugen Hristev wrote: >> Annotate vital static information into kmemdump: >> - node_data >> >> Information on these variables is stored into dedicated kmemdump section. >> >> Register dynamic information into kmemdump: >> - dynamic node data for each node >> >> This information is being allocated for each node, as physical address, >> so call kmemdump_phys_alloc_size that will allocate an unique kmemdump >> uid, and register the virtual address. >> >> Signed-off-by: Eugen Hristev <eugen.hristev@linaro.org> >> --- >> mm/numa.c | 5 ++++- >> 1 file changed, 4 insertions(+), 1 deletion(-) >> >> diff --git a/mm/numa.c b/mm/numa.c >> index 7d5e06fe5bd4..88cada571171 100644 >> --- a/mm/numa.c >> +++ b/mm/numa.c >> @@ -4,9 +4,11 @@ >> #include <linux/printk.h> >> #include <linux/numa.h> >> #include <linux/numa_memblks.h> >> +#include <linux/kmemdump.h> >> >> struct pglist_data *node_data[MAX_NUMNODES]; >> EXPORT_SYMBOL(node_data); >> +KMEMDUMP_VAR_CORE(node_data, MAX_NUMNODES * sizeof(struct pglist_data)); >> >> /* Allocate NODE_DATA for a node on the local memory */ >> void __init alloc_node_data(int nid) >> @@ -16,7 +18,8 @@ void __init alloc_node_data(int nid) >> int tnid; >> >> /* Allocate node data. Try node-local memory and then any node. */ >> - nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid); >> + nd_pa = kmemdump_phys_alloc_size(nd_size, memblock_phys_alloc_try_nid, >> + nd_size, SMP_CACHE_BYTES, nid); > > Do we really want to wrap memblock allocations in such a way? :/ > > Gah, no, no no. > > Can't we pass that as some magical flag, or just ... register *after* > allocating? > Thanks for looking into my patch. Yes, registering after is also an option. Initially this is how I designed the kmemdump API, I also had in mind to add a flag, but, after discussing with Thomas Gleixner, he came up with the macro wrapper idea here: https://lore.kernel.org/lkml/87ikkzpcup.ffs@tglx/ Do you think we can continue that discussion , or maybe start it here ? Eugen
On 30.07.25 15:57, Eugen Hristev wrote: > Hello, > > On 7/30/25 16:52, David Hildenbrand wrote: >> On 24.07.25 15:55, Eugen Hristev wrote: >>> Annotate vital static information into kmemdump: >>> - node_data >>> >>> Information on these variables is stored into dedicated kmemdump section. >>> >>> Register dynamic information into kmemdump: >>> - dynamic node data for each node >>> >>> This information is being allocated for each node, as physical address, >>> so call kmemdump_phys_alloc_size that will allocate an unique kmemdump >>> uid, and register the virtual address. >>> >>> Signed-off-by: Eugen Hristev <eugen.hristev@linaro.org> >>> --- >>> mm/numa.c | 5 ++++- >>> 1 file changed, 4 insertions(+), 1 deletion(-) >>> >>> diff --git a/mm/numa.c b/mm/numa.c >>> index 7d5e06fe5bd4..88cada571171 100644 >>> --- a/mm/numa.c >>> +++ b/mm/numa.c >>> @@ -4,9 +4,11 @@ >>> #include <linux/printk.h> >>> #include <linux/numa.h> >>> #include <linux/numa_memblks.h> >>> +#include <linux/kmemdump.h> >>> >>> struct pglist_data *node_data[MAX_NUMNODES]; >>> EXPORT_SYMBOL(node_data); >>> +KMEMDUMP_VAR_CORE(node_data, MAX_NUMNODES * sizeof(struct pglist_data)); >>> >>> /* Allocate NODE_DATA for a node on the local memory */ >>> void __init alloc_node_data(int nid) >>> @@ -16,7 +18,8 @@ void __init alloc_node_data(int nid) >>> int tnid; >>> >>> /* Allocate node data. Try node-local memory and then any node. */ >>> - nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid); >>> + nd_pa = kmemdump_phys_alloc_size(nd_size, memblock_phys_alloc_try_nid, >>> + nd_size, SMP_CACHE_BYTES, nid); >> >> Do we really want to wrap memblock allocations in such a way? :/ >> >> Gah, no, no no. >> >> Can't we pass that as some magical flag, or just ... register *after* >> allocating? >> > > Thanks for looking into my patch. > > Yes, registering after is also an option. Initially this is how I > designed the kmemdump API, I also had in mind to add a flag, but, after > discussing with Thomas Gleixner, he came up with the macro wrapper idea > here: > https://lore.kernel.org/lkml/87ikkzpcup.ffs@tglx/ > Do you think we can continue that discussion , or maybe start it here ? Yeah, I don't like that, but I can see how we ended up here. I also don't quite like the idea that we must encode here what to include in a dump and what not ... For the vmcore we construct it at runtime in crash_save_vmcoreinfo_init(), where we e.g., have VMCOREINFO_STRUCT_SIZE(pglist_data); Could we similar have some place where we construct what to dump similarly, just not using the current values, but the memory ranges? Did you consider that? -- Cheers, David / dhildenb
On Wed 30-07-25 16:04:28, David Hildenbrand wrote: > On 30.07.25 15:57, Eugen Hristev wrote: [...] > > Yes, registering after is also an option. Initially this is how I > > designed the kmemdump API, I also had in mind to add a flag, but, after > > discussing with Thomas Gleixner, he came up with the macro wrapper idea > > here: > > https://lore.kernel.org/lkml/87ikkzpcup.ffs@tglx/ > > Do you think we can continue that discussion , or maybe start it here ? > > Yeah, I don't like that, but I can see how we ended up here. > > I also don't quite like the idea that we must encode here what to include in > a dump and what not ... > > For the vmcore we construct it at runtime in crash_save_vmcoreinfo_init(), > where we e.g., have > > VMCOREINFO_STRUCT_SIZE(pglist_data); > > Could we similar have some place where we construct what to dump similarly, > just not using the current values, but the memory ranges? All those symbols are part of kallsyms, right? Can we just use kallsyms infrastructure and a list of symbols to get what we need from there? In other words the list of symbols to be completely external to the code that is defining them? -- Michal Hocko SUSE Labs
On 8/4/25 13:54, Michal Hocko wrote: > On Wed 30-07-25 16:04:28, David Hildenbrand wrote: >> On 30.07.25 15:57, Eugen Hristev wrote: > [...] >>> Yes, registering after is also an option. Initially this is how I >>> designed the kmemdump API, I also had in mind to add a flag, but, after >>> discussing with Thomas Gleixner, he came up with the macro wrapper idea >>> here: >>> https://lore.kernel.org/lkml/87ikkzpcup.ffs@tglx/ >>> Do you think we can continue that discussion , or maybe start it here ? >> >> Yeah, I don't like that, but I can see how we ended up here. >> >> I also don't quite like the idea that we must encode here what to include in >> a dump and what not ... >> >> For the vmcore we construct it at runtime in crash_save_vmcoreinfo_init(), >> where we e.g., have >> >> VMCOREINFO_STRUCT_SIZE(pglist_data); >> >> Could we similar have some place where we construct what to dump similarly, >> just not using the current values, but the memory ranges? > > All those symbols are part of kallsyms, right? Can we just use kallsyms > infrastructure and a list of symbols to get what we need from there? > > In other words the list of symbols to be completely external to the code > that is defining them? Some static symbols are indeed part of kallsyms. But some symbols are not exported, for example patch 20/29, where printk related symbols are not to be exported. Another example is with static variables, like in patch 17/29 , not exported as symbols, but required for the dump. Dynamic memory regions are not have to also be considered, have a look for example at patch 23/29 , where dynamically allocated memory needs to be registered. Do you think that I should move all kallsyms related symbols annotation into a separate place and keep it for the static/dynamic regions in place ? Thanks for looking into my patch, Eugen
On 04.08.25 13:06, Eugen Hristev wrote: > > > On 8/4/25 13:54, Michal Hocko wrote: >> On Wed 30-07-25 16:04:28, David Hildenbrand wrote: >>> On 30.07.25 15:57, Eugen Hristev wrote: >> [...] >>>> Yes, registering after is also an option. Initially this is how I >>>> designed the kmemdump API, I also had in mind to add a flag, but, after >>>> discussing with Thomas Gleixner, he came up with the macro wrapper idea >>>> here: >>>> https://lore.kernel.org/lkml/87ikkzpcup.ffs@tglx/ >>>> Do you think we can continue that discussion , or maybe start it here ? >>> >>> Yeah, I don't like that, but I can see how we ended up here. >>> >>> I also don't quite like the idea that we must encode here what to include in >>> a dump and what not ... >>> >>> For the vmcore we construct it at runtime in crash_save_vmcoreinfo_init(), >>> where we e.g., have >>> >>> VMCOREINFO_STRUCT_SIZE(pglist_data); >>> >>> Could we similar have some place where we construct what to dump similarly, >>> just not using the current values, but the memory ranges? >> >> All those symbols are part of kallsyms, right? Can we just use kallsyms >> infrastructure and a list of symbols to get what we need from there? >> >> In other words the list of symbols to be completely external to the code >> that is defining them? > > Some static symbols are indeed part of kallsyms. But some symbols are > not exported, for example patch 20/29, where printk related symbols are > not to be exported. Another example is with static variables, like in > patch 17/29 , not exported as symbols, but required for the dump. > Dynamic memory regions are not have to also be considered, have a look > for example at patch 23/29 , where dynamically allocated memory needs to > be registered. > > Do you think that I should move all kallsyms related symbols annotation > into a separate place and keep it for the static/dynamic regions in place ? If you want to use a symbol from kmemdump, then make that symbol available to kmemdump. IOW, if we were to rip out kmemdump tomorrow, we wouldn't have to touch any non-kmemdump-specific files. -- Cheers, David / dhildenb
On 8/4/25 15:18, David Hildenbrand wrote: > On 04.08.25 13:06, Eugen Hristev wrote: >> >> >> On 8/4/25 13:54, Michal Hocko wrote: >>> On Wed 30-07-25 16:04:28, David Hildenbrand wrote: >>>> On 30.07.25 15:57, Eugen Hristev wrote: >>> [...] >>>>> Yes, registering after is also an option. Initially this is how I >>>>> designed the kmemdump API, I also had in mind to add a flag, but, after >>>>> discussing with Thomas Gleixner, he came up with the macro wrapper idea >>>>> here: >>>>> https://lore.kernel.org/lkml/87ikkzpcup.ffs@tglx/ >>>>> Do you think we can continue that discussion , or maybe start it here ? >>>> >>>> Yeah, I don't like that, but I can see how we ended up here. >>>> >>>> I also don't quite like the idea that we must encode here what to include in >>>> a dump and what not ... >>>> >>>> For the vmcore we construct it at runtime in crash_save_vmcoreinfo_init(), >>>> where we e.g., have >>>> >>>> VMCOREINFO_STRUCT_SIZE(pglist_data); >>>> >>>> Could we similar have some place where we construct what to dump similarly, >>>> just not using the current values, but the memory ranges? >>> >>> All those symbols are part of kallsyms, right? Can we just use kallsyms >>> infrastructure and a list of symbols to get what we need from there? >>> >>> In other words the list of symbols to be completely external to the code >>> that is defining them? >> >> Some static symbols are indeed part of kallsyms. But some symbols are >> not exported, for example patch 20/29, where printk related symbols are >> not to be exported. Another example is with static variables, like in >> patch 17/29 , not exported as symbols, but required for the dump. >> Dynamic memory regions are not have to also be considered, have a look >> for example at patch 23/29 , where dynamically allocated memory needs to >> be registered. >> >> Do you think that I should move all kallsyms related symbols annotation >> into a separate place and keep it for the static/dynamic regions in place ? > > If you want to use a symbol from kmemdump, then make that symbol > available to kmemdump. That's what I am doing, registering symbols with kmemdump. Maybe I do not understand what you mean, do you have any suggestion for the static variables case (symbols not exported) ? > > IOW, if we were to rip out kmemdump tomorrow, we wouldn't have to touch > any non-kmemdump-specific files. >
On 04.08.25 14:29, Eugen Hristev wrote: > > > On 8/4/25 15:18, David Hildenbrand wrote: >> On 04.08.25 13:06, Eugen Hristev wrote: >>> >>> >>> On 8/4/25 13:54, Michal Hocko wrote: >>>> On Wed 30-07-25 16:04:28, David Hildenbrand wrote: >>>>> On 30.07.25 15:57, Eugen Hristev wrote: >>>> [...] >>>>>> Yes, registering after is also an option. Initially this is how I >>>>>> designed the kmemdump API, I also had in mind to add a flag, but, after >>>>>> discussing with Thomas Gleixner, he came up with the macro wrapper idea >>>>>> here: >>>>>> https://lore.kernel.org/lkml/87ikkzpcup.ffs@tglx/ >>>>>> Do you think we can continue that discussion , or maybe start it here ? >>>>> >>>>> Yeah, I don't like that, but I can see how we ended up here. >>>>> >>>>> I also don't quite like the idea that we must encode here what to include in >>>>> a dump and what not ... >>>>> >>>>> For the vmcore we construct it at runtime in crash_save_vmcoreinfo_init(), >>>>> where we e.g., have >>>>> >>>>> VMCOREINFO_STRUCT_SIZE(pglist_data); >>>>> >>>>> Could we similar have some place where we construct what to dump similarly, >>>>> just not using the current values, but the memory ranges? >>>> >>>> All those symbols are part of kallsyms, right? Can we just use kallsyms >>>> infrastructure and a list of symbols to get what we need from there? >>>> >>>> In other words the list of symbols to be completely external to the code >>>> that is defining them? >>> >>> Some static symbols are indeed part of kallsyms. But some symbols are >>> not exported, for example patch 20/29, where printk related symbols are >>> not to be exported. Another example is with static variables, like in >>> patch 17/29 , not exported as symbols, but required for the dump. >>> Dynamic memory regions are not have to also be considered, have a look >>> for example at patch 23/29 , where dynamically allocated memory needs to >>> be registered. >>> >>> Do you think that I should move all kallsyms related symbols annotation >>> into a separate place and keep it for the static/dynamic regions in place ? >> >> If you want to use a symbol from kmemdump, then make that symbol >> available to kmemdump. > > That's what I am doing, registering symbols with kmemdump. > Maybe I do not understand what you mean, do you have any suggestion for > the static variables case (symbols not exported) ? Let's use patch #20 as example: What I am thinking is that you would not include "linux/kmemdump.h" and not leak all of that KMEMDUMP_ stuff in all these files/subsystems that couldn't less about kmemdump. Instead of doing static struct printk_ringbuffer printk_rb_dynamic; You'd do struct printk_ringbuffer printk_rb_dynamic; and have it in some header file, from where kmemdump could lookup the address. So you move the logic of what goes into a dump from the subsystems to the kmemdump core. -- Cheers, David / dhildenb
On 8/4/25 15:49, David Hildenbrand wrote: > On 04.08.25 14:29, Eugen Hristev wrote: >> >> >> On 8/4/25 15:18, David Hildenbrand wrote: >>> On 04.08.25 13:06, Eugen Hristev wrote: >>>> >>>> >>>> On 8/4/25 13:54, Michal Hocko wrote: >>>>> On Wed 30-07-25 16:04:28, David Hildenbrand wrote: >>>>>> On 30.07.25 15:57, Eugen Hristev wrote: >>>>> [...] >>>>>>> Yes, registering after is also an option. Initially this is how I >>>>>>> designed the kmemdump API, I also had in mind to add a flag, but, after >>>>>>> discussing with Thomas Gleixner, he came up with the macro wrapper idea >>>>>>> here: >>>>>>> https://lore.kernel.org/lkml/87ikkzpcup.ffs@tglx/ >>>>>>> Do you think we can continue that discussion , or maybe start it here ? >>>>>> >>>>>> Yeah, I don't like that, but I can see how we ended up here. >>>>>> >>>>>> I also don't quite like the idea that we must encode here what to include in >>>>>> a dump and what not ... >>>>>> >>>>>> For the vmcore we construct it at runtime in crash_save_vmcoreinfo_init(), >>>>>> where we e.g., have >>>>>> >>>>>> VMCOREINFO_STRUCT_SIZE(pglist_data); >>>>>> >>>>>> Could we similar have some place where we construct what to dump similarly, >>>>>> just not using the current values, but the memory ranges? >>>>> >>>>> All those symbols are part of kallsyms, right? Can we just use kallsyms >>>>> infrastructure and a list of symbols to get what we need from there? >>>>> >>>>> In other words the list of symbols to be completely external to the code >>>>> that is defining them? >>>> >>>> Some static symbols are indeed part of kallsyms. But some symbols are >>>> not exported, for example patch 20/29, where printk related symbols are >>>> not to be exported. Another example is with static variables, like in >>>> patch 17/29 , not exported as symbols, but required for the dump. >>>> Dynamic memory regions are not have to also be considered, have a look >>>> for example at patch 23/29 , where dynamically allocated memory needs to >>>> be registered. >>>> >>>> Do you think that I should move all kallsyms related symbols annotation >>>> into a separate place and keep it for the static/dynamic regions in place ? >>> >>> If you want to use a symbol from kmemdump, then make that symbol >>> available to kmemdump. >> >> That's what I am doing, registering symbols with kmemdump. >> Maybe I do not understand what you mean, do you have any suggestion for >> the static variables case (symbols not exported) ? > > Let's use patch #20 as example: > > What I am thinking is that you would not include "linux/kmemdump.h" and > not leak all of that KMEMDUMP_ stuff in all these files/subsystems that > couldn't less about kmemdump. > > Instead of doing > > static struct printk_ringbuffer printk_rb_dynamic; > > You'd do > > struct printk_ringbuffer printk_rb_dynamic; > > and have it in some header file, from where kmemdump could lookup the > address. > > So you move the logic of what goes into a dump from the subsystems to > the kmemdump core. > That works if the people maintaining these systems agree with it. Attempts to export symbols from printk e.g. have been nacked : https://lore.kernel.org/all/20250218-175733-neomutt-senozhatsky@chromium.org/ So I am unsure whether just removing the static and adding them into header files would be more acceptable. Added in CC Cristoph Hellwig and Sergey Senozhatsky maybe they could tell us directly whether they like or dislike this approach, as kmemdump would be builtin and would not require exports. One other thing to mention is the fact that the printk code dynamically allocates memory that would need to be registered. There is no mechanism for kmemdump to know when this process has been completed (or even if it was at all, because it happens on demand in certain conditions). Thanks !
On 04.08.25 15:03, Eugen Hristev wrote: > > > On 8/4/25 15:49, David Hildenbrand wrote: >> On 04.08.25 14:29, Eugen Hristev wrote: >>> >>> >>> On 8/4/25 15:18, David Hildenbrand wrote: >>>> On 04.08.25 13:06, Eugen Hristev wrote: >>>>> >>>>> >>>>> On 8/4/25 13:54, Michal Hocko wrote: >>>>>> On Wed 30-07-25 16:04:28, David Hildenbrand wrote: >>>>>>> On 30.07.25 15:57, Eugen Hristev wrote: >>>>>> [...] >>>>>>>> Yes, registering after is also an option. Initially this is how I >>>>>>>> designed the kmemdump API, I also had in mind to add a flag, but, after >>>>>>>> discussing with Thomas Gleixner, he came up with the macro wrapper idea >>>>>>>> here: >>>>>>>> https://lore.kernel.org/lkml/87ikkzpcup.ffs@tglx/ >>>>>>>> Do you think we can continue that discussion , or maybe start it here ? >>>>>>> >>>>>>> Yeah, I don't like that, but I can see how we ended up here. >>>>>>> >>>>>>> I also don't quite like the idea that we must encode here what to include in >>>>>>> a dump and what not ... >>>>>>> >>>>>>> For the vmcore we construct it at runtime in crash_save_vmcoreinfo_init(), >>>>>>> where we e.g., have >>>>>>> >>>>>>> VMCOREINFO_STRUCT_SIZE(pglist_data); >>>>>>> >>>>>>> Could we similar have some place where we construct what to dump similarly, >>>>>>> just not using the current values, but the memory ranges? >>>>>> >>>>>> All those symbols are part of kallsyms, right? Can we just use kallsyms >>>>>> infrastructure and a list of symbols to get what we need from there? >>>>>> >>>>>> In other words the list of symbols to be completely external to the code >>>>>> that is defining them? >>>>> >>>>> Some static symbols are indeed part of kallsyms. But some symbols are >>>>> not exported, for example patch 20/29, where printk related symbols are >>>>> not to be exported. Another example is with static variables, like in >>>>> patch 17/29 , not exported as symbols, but required for the dump. >>>>> Dynamic memory regions are not have to also be considered, have a look >>>>> for example at patch 23/29 , where dynamically allocated memory needs to >>>>> be registered. >>>>> >>>>> Do you think that I should move all kallsyms related symbols annotation >>>>> into a separate place and keep it for the static/dynamic regions in place ? >>>> >>>> If you want to use a symbol from kmemdump, then make that symbol >>>> available to kmemdump. >>> >>> That's what I am doing, registering symbols with kmemdump. >>> Maybe I do not understand what you mean, do you have any suggestion for >>> the static variables case (symbols not exported) ? >> >> Let's use patch #20 as example: >> >> What I am thinking is that you would not include "linux/kmemdump.h" and >> not leak all of that KMEMDUMP_ stuff in all these files/subsystems that >> couldn't less about kmemdump. >> >> Instead of doing >> >> static struct printk_ringbuffer printk_rb_dynamic; >> >> You'd do >> >> struct printk_ringbuffer printk_rb_dynamic; >> >> and have it in some header file, from where kmemdump could lookup the >> address. >> >> So you move the logic of what goes into a dump from the subsystems to >> the kmemdump core. >> > > That works if the people maintaining these systems agree with it. > Attempts to export symbols from printk e.g. have been nacked : > > https://lore.kernel.org/all/20250218-175733-neomutt-senozhatsky@chromium.org/ Do you really need the EXPORT_SYMBOL? Can't you just not export symbols, building the relevant kmemdump part into the core not as a module. IIRC, kernel/vmcore_info.c is never built as a module, as it also accesses non-exported symbols. > > So I am unsure whether just removing the static and adding them into > header files would be more acceptable. > > Added in CC Cristoph Hellwig and Sergey Senozhatsky maybe they could > tell us directly whether they like or dislike this approach, as kmemdump > would be builtin and would not require exports. > > One other thing to mention is the fact that the printk code dynamically > allocates memory that would need to be registered. There is no mechanism > for kmemdump to know when this process has been completed (or even if it > was at all, because it happens on demand in certain conditions). If we are talking about memblock allocations, they sure are finished at the time ... the buddy is up. So it's just a matter of placing yourself late in the init stage where the buddy is already up and running. I assume dumping any dynamically allocated stuff through the buddy is out of the picture for now. -- Cheers, David / dhildenb
On 8/4/25 16:26, David Hildenbrand wrote: > On 04.08.25 15:03, Eugen Hristev wrote: >> >> >> On 8/4/25 15:49, David Hildenbrand wrote: >>> On 04.08.25 14:29, Eugen Hristev wrote: >>>> >>>> >>>> On 8/4/25 15:18, David Hildenbrand wrote: >>>>> On 04.08.25 13:06, Eugen Hristev wrote: >>>>>> >>>>>> >>>>>> On 8/4/25 13:54, Michal Hocko wrote: >>>>>>> On Wed 30-07-25 16:04:28, David Hildenbrand wrote: >>>>>>>> On 30.07.25 15:57, Eugen Hristev wrote: >>>>>>> [...] >>>>>>>>> Yes, registering after is also an option. Initially this is how I >>>>>>>>> designed the kmemdump API, I also had in mind to add a flag, but, after >>>>>>>>> discussing with Thomas Gleixner, he came up with the macro wrapper idea >>>>>>>>> here: >>>>>>>>> https://lore.kernel.org/lkml/87ikkzpcup.ffs@tglx/ >>>>>>>>> Do you think we can continue that discussion , or maybe start it here ? >>>>>>>> >>>>>>>> Yeah, I don't like that, but I can see how we ended up here. >>>>>>>> >>>>>>>> I also don't quite like the idea that we must encode here what to include in >>>>>>>> a dump and what not ... >>>>>>>> >>>>>>>> For the vmcore we construct it at runtime in crash_save_vmcoreinfo_init(), >>>>>>>> where we e.g., have >>>>>>>> >>>>>>>> VMCOREINFO_STRUCT_SIZE(pglist_data); >>>>>>>> >>>>>>>> Could we similar have some place where we construct what to dump similarly, >>>>>>>> just not using the current values, but the memory ranges? >>>>>>> >>>>>>> All those symbols are part of kallsyms, right? Can we just use kallsyms >>>>>>> infrastructure and a list of symbols to get what we need from there? >>>>>>> >>>>>>> In other words the list of symbols to be completely external to the code >>>>>>> that is defining them? >>>>>> >>>>>> Some static symbols are indeed part of kallsyms. But some symbols are >>>>>> not exported, for example patch 20/29, where printk related symbols are >>>>>> not to be exported. Another example is with static variables, like in >>>>>> patch 17/29 , not exported as symbols, but required for the dump. >>>>>> Dynamic memory regions are not have to also be considered, have a look >>>>>> for example at patch 23/29 , where dynamically allocated memory needs to >>>>>> be registered. >>>>>> >>>>>> Do you think that I should move all kallsyms related symbols annotation >>>>>> into a separate place and keep it for the static/dynamic regions in place ? >>>>> >>>>> If you want to use a symbol from kmemdump, then make that symbol >>>>> available to kmemdump. >>>> >>>> That's what I am doing, registering symbols with kmemdump. >>>> Maybe I do not understand what you mean, do you have any suggestion for >>>> the static variables case (symbols not exported) ? >>> >>> Let's use patch #20 as example: >>> >>> What I am thinking is that you would not include "linux/kmemdump.h" and >>> not leak all of that KMEMDUMP_ stuff in all these files/subsystems that >>> couldn't less about kmemdump. >>> >>> Instead of doing >>> >>> static struct printk_ringbuffer printk_rb_dynamic; >>> >>> You'd do >>> >>> struct printk_ringbuffer printk_rb_dynamic; >>> >>> and have it in some header file, from where kmemdump could lookup the >>> address. >>> >>> So you move the logic of what goes into a dump from the subsystems to >>> the kmemdump core. >>> >> >> That works if the people maintaining these systems agree with it. >> Attempts to export symbols from printk e.g. have been nacked : >> >> https://lore.kernel.org/all/20250218-175733-neomutt-senozhatsky@chromium.org/ > > Do you really need the EXPORT_SYMBOL? > > Can't you just not export symbols, building the relevant kmemdump part > into the core not as a module. > > IIRC, kernel/vmcore_info.c is never built as a module, as it also > accesses non-exported symbols. Hello David, I am looking again into this, and there are some things which in my opinion would be difficult to achieve. For example I looked into my patch #11 , which adds the `runqueues` into kmemdump. The runqueues is a variable of `struct rq` which is defined in kernel/sched/sched.h , which is not supposed to be included outside of sched. Now moving all the struct definition outside of sched.h into another public header would be rather painful and I don't think it's a really good option (The struct would be needed to compute the sizeof inside vmcoreinfo). Secondly, it would also imply moving all the nested struct definitions outside as well. I doubt this is something that we want for the sched subsys. How the subsys is designed, out of my understanding, is to keep these internal structs opaque outside of it. From my perspective it's much simpler and cleaner to just add the kmemdump annotation macro inside the sched/core.c as it's done in my patch. This macro translates to a noop if kmemdump is not selected. How do you see this done another way ? > >> >> So I am unsure whether just removing the static and adding them into >> header files would be more acceptable. >> >> Added in CC Cristoph Hellwig and Sergey Senozhatsky maybe they could >> tell us directly whether they like or dislike this approach, as kmemdump >> would be builtin and would not require exports. >> >> One other thing to mention is the fact that the printk code dynamically >> allocates memory that would need to be registered. There is no mechanism >> for kmemdump to know when this process has been completed (or even if it >> was at all, because it happens on demand in certain conditions). > > If we are talking about memblock allocations, they sure are finished at > the time ... the buddy is up. > > So it's just a matter of placing yourself late in the init stage where > the buddy is already up and running. > > I assume dumping any dynamically allocated stuff through the buddy is > out of the picture for now. > The dumping mechanism needs to work for dynamically allocated stuff, and right now, it works for e.g. printk, if the buffer is dynamically allocated later on in the boot process. To have this working outside of printk, it would be required to walk through all the printk structs/allocations and select the required info. Is this something that we want to do outside of printk ? E.g. for the printk panic-dump case, the whole dumping is done by registering a dumper that does the job inside printk. There is no mechanism walking through printk data in another subsystem (in my example, pstore). So for me it is logical to register the data inside the printk. Does this make sense ?
>> >> IIRC, kernel/vmcore_info.c is never built as a module, as it also >> accesses non-exported symbols. > > Hello David, > > I am looking again into this, and there are some things which in my > opinion would be difficult to achieve. > For example I looked into my patch #11 , which adds the `runqueues` into > kmemdump. > > The runqueues is a variable of `struct rq` which is defined in > kernel/sched/sched.h , which is not supposed to be included outside of > sched. > Now moving all the struct definition outside of sched.h into another > public header would be rather painful and I don't think it's a really > good option (The struct would be needed to compute the sizeof inside > vmcoreinfo). Secondly, it would also imply moving all the nested struct > definitions outside as well. I doubt this is something that we want for > the sched subsys. How the subsys is designed, out of my understanding, > is to keep these internal structs opaque outside of it. All the kmemdump module needs is a start and a length, correct? So the only tricky part is getting the length. One could just add a const variable that holds this information, or even better, a simple helper function to calculate that. Maybe someone else reading along has a better idea. Interestingly, runqueues is a percpu variable, which makes me wonder if what you had would work as intended (maybe it does, not sure). > > From my perspective it's much simpler and cleaner to just add the > kmemdump annotation macro inside the sched/core.c as it's done in my > patch. This macro translates to a noop if kmemdump is not selected. I really don't like how we are spreading kmemdump all over the kernel, and adding complexity with __section when really, all we need is a place to obtain a start and a length. So we should explore if there is anything easier possible. >> >>> >>> So I am unsure whether just removing the static and adding them into >>> header files would be more acceptable. >>> >>> Added in CC Cristoph Hellwig and Sergey Senozhatsky maybe they could >>> tell us directly whether they like or dislike this approach, as kmemdump >>> would be builtin and would not require exports. >>> >>> One other thing to mention is the fact that the printk code dynamically >>> allocates memory that would need to be registered. There is no mechanism >>> for kmemdump to know when this process has been completed (or even if it >>> was at all, because it happens on demand in certain conditions). >> >> If we are talking about memblock allocations, they sure are finished at >> the time ... the buddy is up. >> >> So it's just a matter of placing yourself late in the init stage where >> the buddy is already up and running. >> >> I assume dumping any dynamically allocated stuff through the buddy is >> out of the picture for now. >> > > The dumping mechanism needs to work for dynamically allocated stuff, and > right now, it works for e.g. printk, if the buffer is dynamically > allocated later on in the boot process. You are talking about the memblock_alloc() result, correct? Like new_log_buf = memblock_alloc(new_log_buf_len, LOG_ALIGN); The current version is always stored in static char *log_buf = __log_buf; Once early boot is done and memblock gets torn down, you can just use log_buf and be sure that it will not change anymore. > > To have this working outside of printk, it would be required to walk > through all the printk structs/allocations and select the required info. > Is this something that we want to do outside of printk ? I don't follow, please elaborate. How is e.g., log_buf_len_get() + log_buf_addr_get() not sufficient, given that you run your initialization after setup_log_buf() ? -- Cheers David / dhildenb
On 8/25/25 16:20, David Hildenbrand wrote: > >>> >>> IIRC, kernel/vmcore_info.c is never built as a module, as it also >>> accesses non-exported symbols. >> >> Hello David, >> >> I am looking again into this, and there are some things which in my >> opinion would be difficult to achieve. >> For example I looked into my patch #11 , which adds the `runqueues` into >> kmemdump. >> >> The runqueues is a variable of `struct rq` which is defined in >> kernel/sched/sched.h , which is not supposed to be included outside of >> sched. >> Now moving all the struct definition outside of sched.h into another >> public header would be rather painful and I don't think it's a really >> good option (The struct would be needed to compute the sizeof inside >> vmcoreinfo). Secondly, it would also imply moving all the nested struct >> definitions outside as well. I doubt this is something that we want for >> the sched subsys. How the subsys is designed, out of my understanding, >> is to keep these internal structs opaque outside of it. > > All the kmemdump module needs is a start and a length, correct? So the > only tricky part is getting the length. I also have in mind the kernel user case. How would a kernel programmer want to add some kernel structs/info/buffers into kmemdump such that the dump would contain their data ? Having "KMEMDUMP_VAR(...)" looks simple enough. Otherwise maybe the programmer has to write helpers to compute lengths etc, and stitch them into kmemdump core. I am not saying it's impossible, but just tiresome perhaps. > > One could just add a const variable that holds this information, or even > better, a simple helper function to calculate that. > > Maybe someone else reading along has a better idea. This could work, but it requires again adding some code into the specific subsystem. E.g. struct_rq_get_size() I am open to ideas , and thank you very much for your thoughts. > > Interestingly, runqueues is a percpu variable, which makes me wonder if > what you had would work as intended (maybe it does, not sure). I would not really need to dump the runqueues. But the crash tool which I am using for testing, requires it. Without the runqueues it will not progress further to load the kernel dump. So I am not really sure what it does with the runqueues, but it works. Perhaps using crash/gdb more, to actually do something with this data, would give more insight about its utility. For me, it is a prerequisite to run crash, and then to be able to extract the log buffer from the dump. > >> >> From my perspective it's much simpler and cleaner to just add the >> kmemdump annotation macro inside the sched/core.c as it's done in my >> patch. This macro translates to a noop if kmemdump is not selected. > > I really don't like how we are spreading kmemdump all over the kernel, > and adding complexity with __section when really, all we need is a place > to obtain a start and a length. > I understand. The section idea was suggested by Thomas. Initially I was skeptic, but I like how it turned out. > So we should explore if there is anything easier possible. > >>> >>>> >>>> So I am unsure whether just removing the static and adding them into >>>> header files would be more acceptable. >>>> >>>> Added in CC Cristoph Hellwig and Sergey Senozhatsky maybe they could >>>> tell us directly whether they like or dislike this approach, as kmemdump >>>> would be builtin and would not require exports. >>>> >>>> One other thing to mention is the fact that the printk code dynamically >>>> allocates memory that would need to be registered. There is no mechanism >>>> for kmemdump to know when this process has been completed (or even if it >>>> was at all, because it happens on demand in certain conditions). >>> >>> If we are talking about memblock allocations, they sure are finished at >>> the time ... the buddy is up. >>> >>> So it's just a matter of placing yourself late in the init stage where >>> the buddy is already up and running. >>> >>> I assume dumping any dynamically allocated stuff through the buddy is >>> out of the picture for now. >>> >> >> The dumping mechanism needs to work for dynamically allocated stuff, and >> right now, it works for e.g. printk, if the buffer is dynamically >> allocated later on in the boot process. > > You are talking about the memblock_alloc() result, correct? Like > > new_log_buf = memblock_alloc(new_log_buf_len, LOG_ALIGN); > > The current version is always stored in > > static char *log_buf = __log_buf; > > > Once early boot is done and memblock gets torn down, you can just use > log_buf and be sure that it will not change anymore. > >> >> To have this working outside of printk, it would be required to walk >> through all the printk structs/allocations and select the required info. >> Is this something that we want to do outside of printk ? > > I don't follow, please elaborate. > > How is e.g., log_buf_len_get() + log_buf_addr_get() not sufficient, > given that you run your initialization after setup_log_buf() ? > > My initial thought was the same. However I got some feedback from Petr Mladek here : https://lore.kernel.org/lkml/aBm5QH2p6p9Wxe_M@localhost.localdomain/ Where he explained how to register the structs correctly. It can be that setup_log_buf is called again at a later time perhaps.
On 25.08.25 15:36, Eugen Hristev wrote: > > > On 8/25/25 16:20, David Hildenbrand wrote: >> >>>> >>>> IIRC, kernel/vmcore_info.c is never built as a module, as it also >>>> accesses non-exported symbols. >>> >>> Hello David, >>> >>> I am looking again into this, and there are some things which in my >>> opinion would be difficult to achieve. >>> For example I looked into my patch #11 , which adds the `runqueues` into >>> kmemdump. >>> >>> The runqueues is a variable of `struct rq` which is defined in >>> kernel/sched/sched.h , which is not supposed to be included outside of >>> sched. >>> Now moving all the struct definition outside of sched.h into another >>> public header would be rather painful and I don't think it's a really >>> good option (The struct would be needed to compute the sizeof inside >>> vmcoreinfo). Secondly, it would also imply moving all the nested struct >>> definitions outside as well. I doubt this is something that we want for >>> the sched subsys. How the subsys is designed, out of my understanding, >>> is to keep these internal structs opaque outside of it. >> >> All the kmemdump module needs is a start and a length, correct? So the >> only tricky part is getting the length. > > I also have in mind the kernel user case. How would a kernel programmer > want to add some kernel structs/info/buffers into kmemdump such that the > dump would contain their data ? Having "KMEMDUMP_VAR(...)" looks simple > enough. The other way around, why should anybody have a saying in adding their data to kmemdump? Why do we have that all over the kernel? Is your mechanism really so special? A single composer should take care of that, and it's really just start + len of physical memory areas. > Otherwise maybe the programmer has to write helpers to compute lengths > etc, and stitch them into kmemdump core. > I am not saying it's impossible, but just tiresome perhaps. In your patch set, how many of these instances did you encounter where that was a problem? >> >> One could just add a const variable that holds this information, or even >> better, a simple helper function to calculate that. >> >> Maybe someone else reading along has a better idea. > > This could work, but it requires again adding some code into the > specific subsystem. E.g. struct_rq_get_size() > I am open to ideas , and thank you very much for your thoughts. > >> >> Interestingly, runqueues is a percpu variable, which makes me wonder if >> what you had would work as intended (maybe it does, not sure). > > I would not really need to dump the runqueues. But the crash tool which > I am using for testing, requires it. Without the runqueues it will not > progress further to load the kernel dump. > So I am not really sure what it does with the runqueues, but it works. > Perhaps using crash/gdb more, to actually do something with this data, > would give more insight about its utility. > For me, it is a prerequisite to run crash, and then to be able to > extract the log buffer from the dump. I have the faint recollection that percpu vars might not be stored in a single contiguous physical memory area, but maybe my memory is just wrong, that's why I was raising it. > >> >>> >>> From my perspective it's much simpler and cleaner to just add the >>> kmemdump annotation macro inside the sched/core.c as it's done in my >>> patch. This macro translates to a noop if kmemdump is not selected. >> >> I really don't like how we are spreading kmemdump all over the kernel, >> and adding complexity with __section when really, all we need is a place >> to obtain a start and a length. >> > > I understand. The section idea was suggested by Thomas. Initially I was > skeptic, but I like how it turned out. Yeah, I don't like it. Taste differs ;) I am in particular unhappy about custom memblock wrappers. [...] >>> >>> To have this working outside of printk, it would be required to walk >>> through all the printk structs/allocations and select the required info. >>> Is this something that we want to do outside of printk ? >> >> I don't follow, please elaborate. >> >> How is e.g., log_buf_len_get() + log_buf_addr_get() not sufficient, >> given that you run your initialization after setup_log_buf() ? >> >> > > My initial thought was the same. However I got some feedback from Petr > Mladek here : > > https://lore.kernel.org/lkml/aBm5QH2p6p9Wxe_M@localhost.localdomain/ > > Where he explained how to register the structs correctly. > It can be that setup_log_buf is called again at a later time perhaps. > setup_log_buf() is a __init function, so there is only a certain time frame where it can be called. In particular, once the buddy is up, memblock allocations are impossible and it would be deeply flawed to call this function again. Let's not over-engineer this. Peter is on CC, so hopefully he can share his thoughts. -- Cheers David / dhildenb
On 8/25/25 16:58, David Hildenbrand wrote:
> On 25.08.25 15:36, Eugen Hristev wrote:
>>
>>
>> On 8/25/25 16:20, David Hildenbrand wrote:
>>>
>>>>>
>>>>> IIRC, kernel/vmcore_info.c is never built as a module, as it also
>>>>> accesses non-exported symbols.
>>>>
>>>> Hello David,
>>>>
>>>> I am looking again into this, and there are some things which in my
>>>> opinion would be difficult to achieve.
>>>> For example I looked into my patch #11 , which adds the `runqueues` into
>>>> kmemdump.
>>>>
>>>> The runqueues is a variable of `struct rq` which is defined in
>>>> kernel/sched/sched.h , which is not supposed to be included outside of
>>>> sched.
>>>> Now moving all the struct definition outside of sched.h into another
>>>> public header would be rather painful and I don't think it's a really
>>>> good option (The struct would be needed to compute the sizeof inside
>>>> vmcoreinfo). Secondly, it would also imply moving all the nested struct
>>>> definitions outside as well. I doubt this is something that we want for
>>>> the sched subsys. How the subsys is designed, out of my understanding,
>>>> is to keep these internal structs opaque outside of it.
>>>
>>> All the kmemdump module needs is a start and a length, correct? So the
>>> only tricky part is getting the length.
>>
>> I also have in mind the kernel user case. How would a kernel programmer
>> want to add some kernel structs/info/buffers into kmemdump such that the
>> dump would contain their data ? Having "KMEMDUMP_VAR(...)" looks simple
>> enough.
>
> The other way around, why should anybody have a saying in adding their
> data to kmemdump? Why do we have that all over the kernel?
>
> Is your mechanism really so special?
>
> A single composer should take care of that, and it's really just start +
> len of physical memory areas.
>
>> Otherwise maybe the programmer has to write helpers to compute lengths
>> etc, and stitch them into kmemdump core.
>> I am not saying it's impossible, but just tiresome perhaps.
>
> In your patch set, how many of these instances did you encounter where
> that was a problem?
>
>>>
>>> One could just add a const variable that holds this information, or even
>>> better, a simple helper function to calculate that.
>>>
>>> Maybe someone else reading along has a better idea.
>>
>> This could work, but it requires again adding some code into the
>> specific subsystem. E.g. struct_rq_get_size()
>> I am open to ideas , and thank you very much for your thoughts.
>>
>>>
>>> Interestingly, runqueues is a percpu variable, which makes me wonder if
>>> what you had would work as intended (maybe it does, not sure).
>>
>> I would not really need to dump the runqueues. But the crash tool which
>> I am using for testing, requires it. Without the runqueues it will not
>> progress further to load the kernel dump.
>> So I am not really sure what it does with the runqueues, but it works.
>> Perhaps using crash/gdb more, to actually do something with this data,
>> would give more insight about its utility.
>> For me, it is a prerequisite to run crash, and then to be able to
>> extract the log buffer from the dump.
>
> I have the faint recollection that percpu vars might not be stored in a
> single contiguous physical memory area, but maybe my memory is just
> wrong, that's why I was raising it.
>
>>
>>>
>>>>
>>>> From my perspective it's much simpler and cleaner to just add the
>>>> kmemdump annotation macro inside the sched/core.c as it's done in my
>>>> patch. This macro translates to a noop if kmemdump is not selected.
>>>
>>> I really don't like how we are spreading kmemdump all over the kernel,
>>> and adding complexity with __section when really, all we need is a place
>>> to obtain a start and a length.
>>>
>>
>> I understand. The section idea was suggested by Thomas. Initially I was
>> skeptic, but I like how it turned out.
>
> Yeah, I don't like it. Taste differs ;)
>
> I am in particular unhappy about custom memblock wrappers.
>
> [...]
>
>>>>
>>>> To have this working outside of printk, it would be required to walk
>>>> through all the printk structs/allocations and select the required info.
>>>> Is this something that we want to do outside of printk ?
>>>
>>> I don't follow, please elaborate.
>>>
>>> How is e.g., log_buf_len_get() + log_buf_addr_get() not sufficient,
>>> given that you run your initialization after setup_log_buf() ?
>>>
>>>
>>
>> My initial thought was the same. However I got some feedback from Petr
>> Mladek here :
>>
>> https://lore.kernel.org/lkml/aBm5QH2p6p9Wxe_M@localhost.localdomain/
>>
>> Where he explained how to register the structs correctly.
>> It can be that setup_log_buf is called again at a later time perhaps.
>>
>
> setup_log_buf() is a __init function, so there is only a certain time
> frame where it can be called.
>
> In particular, once the buddy is up, memblock allocations are impossible
> and it would be deeply flawed to call this function again.
>
> Let's not over-engineer this.
>
> Peter is on CC, so hopefully he can share his thoughts.
>
Hello David,
I tested out this snippet (on top of my series, so you can see what I
changed):
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 18ba6c1e174f..7ac4248a00e5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -67,7 +67,6 @@
#include <linux/wait_api.h>
#include <linux/workqueue_api.h>
#include <linux/livepatch_sched.h>
-#include <linux/kmemdump.h>
#ifdef CONFIG_PREEMPT_DYNAMIC
# ifdef CONFIG_GENERIC_IRQ_ENTRY
@@ -120,7 +119,12 @@
EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp);
EXPORT_TRACEPOINT_SYMBOL_GPL(sched_compute_energy_tp);
DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
-KMEMDUMP_VAR_CORE(runqueues, sizeof(runqueues));
+
+size_t runqueues_get_size(void);
+size_t runqueues_get_size(void)
+{
+ return sizeof(runqueues);
+}
#ifdef CONFIG_SCHED_PROXY_EXEC
DEFINE_STATIC_KEY_TRUE(__sched_proxy_exec);
diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c
index d808c5e67f35..c6dd2d6e96dd 100644
--- a/kernel/vmcore_info.c
+++ b/kernel/vmcore_info.c
@@ -24,6 +24,12 @@
#include "kallsyms_internal.h"
#include "kexec_internal.h"
+typedef void* kmemdump_opaque_t;
+
+size_t runqueues_get_size(void);
+
+extern kmemdump_opaque_t runqueues;
+
/* vmcoreinfo stuff */
unsigned char *vmcoreinfo_data;
size_t vmcoreinfo_size;
@@ -230,6 +236,9 @@ static int __init crash_save_vmcoreinfo_init(void)
kmemdump_register_id(KMEMDUMP_ID_COREIMAGE_VMCOREINFO,
(void *)vmcoreinfo_data, vmcoreinfo_size);
+ kmemdump_register_id(KMEMDUMP_ID_COREIMAGE_runqueues,
+ (void *)&runqueues, runqueues_get_size());
+
return 0;
}
With this, no more .section, no kmemdump code into sched, however, there
are few things :
First the size function, which is quite dull and doesn't fit into the
sched very much.
Second, having the extern with a different "opaque" type to avoid
exposing the struct rq definition, which is quite hackish.
What do you think ?
My opinion is that it's ugly, but maybe you have some better idea how to
write this nicer ?
( I am also not 100 % sure if I did this the way you wanted).
Thanks for helping out,
Eugen
On 27.08.25 13:59, Eugen Hristev wrote:
>
>
> On 8/25/25 16:58, David Hildenbrand wrote:
>> On 25.08.25 15:36, Eugen Hristev wrote:
>>>
>>>
>>> On 8/25/25 16:20, David Hildenbrand wrote:
>>>>
>>>>>>
>>>>>> IIRC, kernel/vmcore_info.c is never built as a module, as it also
>>>>>> accesses non-exported symbols.
>>>>>
>>>>> Hello David,
>>>>>
>>>>> I am looking again into this, and there are some things which in my
>>>>> opinion would be difficult to achieve.
>>>>> For example I looked into my patch #11 , which adds the `runqueues` into
>>>>> kmemdump.
>>>>>
>>>>> The runqueues is a variable of `struct rq` which is defined in
>>>>> kernel/sched/sched.h , which is not supposed to be included outside of
>>>>> sched.
>>>>> Now moving all the struct definition outside of sched.h into another
>>>>> public header would be rather painful and I don't think it's a really
>>>>> good option (The struct would be needed to compute the sizeof inside
>>>>> vmcoreinfo). Secondly, it would also imply moving all the nested struct
>>>>> definitions outside as well. I doubt this is something that we want for
>>>>> the sched subsys. How the subsys is designed, out of my understanding,
>>>>> is to keep these internal structs opaque outside of it.
>>>>
>>>> All the kmemdump module needs is a start and a length, correct? So the
>>>> only tricky part is getting the length.
>>>
>>> I also have in mind the kernel user case. How would a kernel programmer
>>> want to add some kernel structs/info/buffers into kmemdump such that the
>>> dump would contain their data ? Having "KMEMDUMP_VAR(...)" looks simple
>>> enough.
>>
>> The other way around, why should anybody have a saying in adding their
>> data to kmemdump? Why do we have that all over the kernel?
>>
>> Is your mechanism really so special?
>>
>> A single composer should take care of that, and it's really just start +
>> len of physical memory areas.
>>
>>> Otherwise maybe the programmer has to write helpers to compute lengths
>>> etc, and stitch them into kmemdump core.
>>> I am not saying it's impossible, but just tiresome perhaps.
>>
>> In your patch set, how many of these instances did you encounter where
>> that was a problem?
>>
>>>>
>>>> One could just add a const variable that holds this information, or even
>>>> better, a simple helper function to calculate that.
>>>>
>>>> Maybe someone else reading along has a better idea.
>>>
>>> This could work, but it requires again adding some code into the
>>> specific subsystem. E.g. struct_rq_get_size()
>>> I am open to ideas , and thank you very much for your thoughts.
>>>
>>>>
>>>> Interestingly, runqueues is a percpu variable, which makes me wonder if
>>>> what you had would work as intended (maybe it does, not sure).
>>>
>>> I would not really need to dump the runqueues. But the crash tool which
>>> I am using for testing, requires it. Without the runqueues it will not
>>> progress further to load the kernel dump.
>>> So I am not really sure what it does with the runqueues, but it works.
>>> Perhaps using crash/gdb more, to actually do something with this data,
>>> would give more insight about its utility.
>>> For me, it is a prerequisite to run crash, and then to be able to
>>> extract the log buffer from the dump.
>>
>> I have the faint recollection that percpu vars might not be stored in a
>> single contiguous physical memory area, but maybe my memory is just
>> wrong, that's why I was raising it.
>>
>>>
>>>>
>>>>>
>>>>> From my perspective it's much simpler and cleaner to just add the
>>>>> kmemdump annotation macro inside the sched/core.c as it's done in my
>>>>> patch. This macro translates to a noop if kmemdump is not selected.
>>>>
>>>> I really don't like how we are spreading kmemdump all over the kernel,
>>>> and adding complexity with __section when really, all we need is a place
>>>> to obtain a start and a length.
>>>>
>>>
>>> I understand. The section idea was suggested by Thomas. Initially I was
>>> skeptic, but I like how it turned out.
>>
>> Yeah, I don't like it. Taste differs ;)
>>
>> I am in particular unhappy about custom memblock wrappers.
>>
>> [...]
>>
>>>>>
>>>>> To have this working outside of printk, it would be required to walk
>>>>> through all the printk structs/allocations and select the required info.
>>>>> Is this something that we want to do outside of printk ?
>>>>
>>>> I don't follow, please elaborate.
>>>>
>>>> How is e.g., log_buf_len_get() + log_buf_addr_get() not sufficient,
>>>> given that you run your initialization after setup_log_buf() ?
>>>>
>>>>
>>>
>>> My initial thought was the same. However I got some feedback from Petr
>>> Mladek here :
>>>
>>> https://lore.kernel.org/lkml/aBm5QH2p6p9Wxe_M@localhost.localdomain/
>>>
>>> Where he explained how to register the structs correctly.
>>> It can be that setup_log_buf is called again at a later time perhaps.
>>>
>>
>> setup_log_buf() is a __init function, so there is only a certain time
>> frame where it can be called.
>>
>> In particular, once the buddy is up, memblock allocations are impossible
>> and it would be deeply flawed to call this function again.
>>
>> Let's not over-engineer this.
>>
>> Peter is on CC, so hopefully he can share his thoughts.
>>
>
> Hello David,
>
> I tested out this snippet (on top of my series, so you can see what I
> changed):
>
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 18ba6c1e174f..7ac4248a00e5 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -67,7 +67,6 @@
> #include <linux/wait_api.h>
> #include <linux/workqueue_api.h>
> #include <linux/livepatch_sched.h>
> -#include <linux/kmemdump.h>
>
> #ifdef CONFIG_PREEMPT_DYNAMIC
> # ifdef CONFIG_GENERIC_IRQ_ENTRY
> @@ -120,7 +119,12 @@
> EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp);
> EXPORT_TRACEPOINT_SYMBOL_GPL(sched_compute_energy_tp);
>
> DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
> -KMEMDUMP_VAR_CORE(runqueues, sizeof(runqueues));
> +
> +size_t runqueues_get_size(void);
> +size_t runqueues_get_size(void)
> +{
> + return sizeof(runqueues);
> +}
>
> #ifdef CONFIG_SCHED_PROXY_EXEC
> DEFINE_STATIC_KEY_TRUE(__sched_proxy_exec);
> diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c
> index d808c5e67f35..c6dd2d6e96dd 100644
> --- a/kernel/vmcore_info.c
> +++ b/kernel/vmcore_info.c
> @@ -24,6 +24,12 @@
> #include "kallsyms_internal.h"
> #include "kexec_internal.h"
>
> +typedef void* kmemdump_opaque_t;
> +
> +size_t runqueues_get_size(void);
> +
> +extern kmemdump_opaque_t runqueues;
I would have tried that through:
struct rq;
extern struct rq runqueues;
But the whole PER_CPU_SHARED_ALIGNED makes this all weird, and likely
not the way we would want to handle that.
> /* vmcoreinfo stuff */
> unsigned char *vmcoreinfo_data;
> size_t vmcoreinfo_size;
> @@ -230,6 +236,9 @@ static int __init crash_save_vmcoreinfo_init(void)
>
> kmemdump_register_id(KMEMDUMP_ID_COREIMAGE_VMCOREINFO,
> (void *)vmcoreinfo_data, vmcoreinfo_size);
> + kmemdump_register_id(KMEMDUMP_ID_COREIMAGE_runqueues,
> + (void *)&runqueues, runqueues_get_size());
> +
> return 0;
> }
>
> With this, no more .section, no kmemdump code into sched, however, there
> are few things :
I would really just do here something like the following:
/**
* sched_get_runqueues_area - obtain the runqueues area for dumping
* @start: ...
* @size: ...
*
* The obtained area is only to be used for dumping purposes.
*/
void sched_get_runqueues_area(void *start, size_t size)
{
start = &runqueues;
size = sizeof(runqueues);
}
might be cleaner.
Having said that, if you realize that there is a fundamental issue with
what I propose, please speak up.
So far, I feel like there are only limited number of "suboptimal" cases
of this kind, but I might be wrong of course.
--
Cheers
David / dhildenb
On 8/27/25 15:18, David Hildenbrand wrote:
> On 27.08.25 13:59, Eugen Hristev wrote:
>>
>>
>> On 8/25/25 16:58, David Hildenbrand wrote:
>>> On 25.08.25 15:36, Eugen Hristev wrote:
>>>>
>>>>
>>>> On 8/25/25 16:20, David Hildenbrand wrote:
>>>>>
>>>>>>>
>>>>>>> IIRC, kernel/vmcore_info.c is never built as a module, as it also
>>>>>>> accesses non-exported symbols.
>>>>>>
>>>>>> Hello David,
>>>>>>
>>>>>> I am looking again into this, and there are some things which in my
>>>>>> opinion would be difficult to achieve.
>>>>>> For example I looked into my patch #11 , which adds the `runqueues` into
>>>>>> kmemdump.
>>>>>>
>>>>>> The runqueues is a variable of `struct rq` which is defined in
>>>>>> kernel/sched/sched.h , which is not supposed to be included outside of
>>>>>> sched.
>>>>>> Now moving all the struct definition outside of sched.h into another
>>>>>> public header would be rather painful and I don't think it's a really
>>>>>> good option (The struct would be needed to compute the sizeof inside
>>>>>> vmcoreinfo). Secondly, it would also imply moving all the nested struct
>>>>>> definitions outside as well. I doubt this is something that we want for
>>>>>> the sched subsys. How the subsys is designed, out of my understanding,
>>>>>> is to keep these internal structs opaque outside of it.
>>>>>
>>>>> All the kmemdump module needs is a start and a length, correct? So the
>>>>> only tricky part is getting the length.
>>>>
>>>> I also have in mind the kernel user case. How would a kernel programmer
>>>> want to add some kernel structs/info/buffers into kmemdump such that the
>>>> dump would contain their data ? Having "KMEMDUMP_VAR(...)" looks simple
>>>> enough.
>>>
>>> The other way around, why should anybody have a saying in adding their
>>> data to kmemdump? Why do we have that all over the kernel?
>>>
>>> Is your mechanism really so special?
>>>
>>> A single composer should take care of that, and it's really just start +
>>> len of physical memory areas.
>>>
>>>> Otherwise maybe the programmer has to write helpers to compute lengths
>>>> etc, and stitch them into kmemdump core.
>>>> I am not saying it's impossible, but just tiresome perhaps.
>>>
>>> In your patch set, how many of these instances did you encounter where
>>> that was a problem?
>>>
>>>>>
>>>>> One could just add a const variable that holds this information, or even
>>>>> better, a simple helper function to calculate that.
>>>>>
>>>>> Maybe someone else reading along has a better idea.
>>>>
>>>> This could work, but it requires again adding some code into the
>>>> specific subsystem. E.g. struct_rq_get_size()
>>>> I am open to ideas , and thank you very much for your thoughts.
>>>>
>>>>>
>>>>> Interestingly, runqueues is a percpu variable, which makes me wonder if
>>>>> what you had would work as intended (maybe it does, not sure).
>>>>
>>>> I would not really need to dump the runqueues. But the crash tool which
>>>> I am using for testing, requires it. Without the runqueues it will not
>>>> progress further to load the kernel dump.
>>>> So I am not really sure what it does with the runqueues, but it works.
>>>> Perhaps using crash/gdb more, to actually do something with this data,
>>>> would give more insight about its utility.
>>>> For me, it is a prerequisite to run crash, and then to be able to
>>>> extract the log buffer from the dump.
>>>
>>> I have the faint recollection that percpu vars might not be stored in a
>>> single contiguous physical memory area, but maybe my memory is just
>>> wrong, that's why I was raising it.
>>>
>>>>
>>>>>
>>>>>>
>>>>>> From my perspective it's much simpler and cleaner to just add the
>>>>>> kmemdump annotation macro inside the sched/core.c as it's done in my
>>>>>> patch. This macro translates to a noop if kmemdump is not selected.
>>>>>
>>>>> I really don't like how we are spreading kmemdump all over the kernel,
>>>>> and adding complexity with __section when really, all we need is a place
>>>>> to obtain a start and a length.
>>>>>
>>>>
>>>> I understand. The section idea was suggested by Thomas. Initially I was
>>>> skeptic, but I like how it turned out.
>>>
>>> Yeah, I don't like it. Taste differs ;)
>>>
>>> I am in particular unhappy about custom memblock wrappers.
>>>
>>> [...]
>>>
>>>>>>
>>>>>> To have this working outside of printk, it would be required to walk
>>>>>> through all the printk structs/allocations and select the required info.
>>>>>> Is this something that we want to do outside of printk ?
>>>>>
>>>>> I don't follow, please elaborate.
>>>>>
>>>>> How is e.g., log_buf_len_get() + log_buf_addr_get() not sufficient,
>>>>> given that you run your initialization after setup_log_buf() ?
>>>>>
>>>>>
>>>>
>>>> My initial thought was the same. However I got some feedback from Petr
>>>> Mladek here :
>>>>
>>>> https://lore.kernel.org/lkml/aBm5QH2p6p9Wxe_M@localhost.localdomain/
>>>>
>>>> Where he explained how to register the structs correctly.
>>>> It can be that setup_log_buf is called again at a later time perhaps.
>>>>
>>>
>>> setup_log_buf() is a __init function, so there is only a certain time
>>> frame where it can be called.
>>>
>>> In particular, once the buddy is up, memblock allocations are impossible
>>> and it would be deeply flawed to call this function again.
>>>
>>> Let's not over-engineer this.
>>>
>>> Peter is on CC, so hopefully he can share his thoughts.
>>>
>>
>> Hello David,
>>
>> I tested out this snippet (on top of my series, so you can see what I
>> changed):
>>
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 18ba6c1e174f..7ac4248a00e5 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -67,7 +67,6 @@
>> #include <linux/wait_api.h>
>> #include <linux/workqueue_api.h>
>> #include <linux/livepatch_sched.h>
>> -#include <linux/kmemdump.h>
>>
>> #ifdef CONFIG_PREEMPT_DYNAMIC
>> # ifdef CONFIG_GENERIC_IRQ_ENTRY
>> @@ -120,7 +119,12 @@
>> EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp);
>> EXPORT_TRACEPOINT_SYMBOL_GPL(sched_compute_energy_tp);
>>
>> DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
>> -KMEMDUMP_VAR_CORE(runqueues, sizeof(runqueues));
>> +
>> +size_t runqueues_get_size(void);
>> +size_t runqueues_get_size(void)
>> +{
>> + return sizeof(runqueues);
>> +}
>>
>> #ifdef CONFIG_SCHED_PROXY_EXEC
>> DEFINE_STATIC_KEY_TRUE(__sched_proxy_exec);
>> diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c
>> index d808c5e67f35..c6dd2d6e96dd 100644
>> --- a/kernel/vmcore_info.c
>> +++ b/kernel/vmcore_info.c
>> @@ -24,6 +24,12 @@
>> #include "kallsyms_internal.h"
>> #include "kexec_internal.h"
>>
>> +typedef void* kmemdump_opaque_t;
>> +
>> +size_t runqueues_get_size(void);
>> +
>> +extern kmemdump_opaque_t runqueues;
>
> I would have tried that through:
>
> struct rq;
> extern struct rq runqueues;
>
> But the whole PER_CPU_SHARED_ALIGNED makes this all weird, and likely
> not the way we would want to handle that.
>
>> /* vmcoreinfo stuff */
>> unsigned char *vmcoreinfo_data;
>> size_t vmcoreinfo_size;
>> @@ -230,6 +236,9 @@ static int __init crash_save_vmcoreinfo_init(void)
>>
>> kmemdump_register_id(KMEMDUMP_ID_COREIMAGE_VMCOREINFO,
>> (void *)vmcoreinfo_data, vmcoreinfo_size);
>> + kmemdump_register_id(KMEMDUMP_ID_COREIMAGE_runqueues,
>> + (void *)&runqueues, runqueues_get_size());
>> +
>> return 0;
>> }
>>
>> With this, no more .section, no kmemdump code into sched, however, there
>> are few things :
>
> I would really just do here something like the following:
>
> /**
> * sched_get_runqueues_area - obtain the runqueues area for dumping
> * @start: ...
> * @size: ...
> *
> * The obtained area is only to be used for dumping purposes.
> */
> void sched_get_runqueues_area(void *start, size_t size)
> {
> start = &runqueues;
> size = sizeof(runqueues);
> }
>
> might be cleaner.
>
How about this in the header:
#define DECLARE_DUMP_AREA_FUNC(subsys, symbol) \
void subsys ## _get_ ## symbol ##_area(void **start, size_t *size);
#define DEFINE_DUMP_AREA_FUNC(subsys, symbol) \
void subsys ## _get_ ## symbol ##_area(void **start, size_t *size)\
{\
*start = &symbol;\
*size = sizeof(symbol);\
}
then, in sched just
DECLARE_DUMP_AREA_FUNC(sched, runqueues);
DEFINE_DUMP_AREA_FUNC(sched, runqueues);
or a single macro that wraps both.
would make it shorter and neater.
What do you think ?
>
> Having said that, if you realize that there is a fundamental issue with
> what I propose, please speak up.
>
> So far, I feel like there are only limited number of "suboptimal" cases
> of this kind, but I might be wrong of course.
>
On 27.08.25 16:08, Eugen Hristev wrote:
>
>
> On 8/27/25 15:18, David Hildenbrand wrote:
>> On 27.08.25 13:59, Eugen Hristev wrote:
>>>
>>>
>>> On 8/25/25 16:58, David Hildenbrand wrote:
>>>> On 25.08.25 15:36, Eugen Hristev wrote:
>>>>>
>>>>>
>>>>> On 8/25/25 16:20, David Hildenbrand wrote:
>>>>>>
>>>>>>>>
>>>>>>>> IIRC, kernel/vmcore_info.c is never built as a module, as it also
>>>>>>>> accesses non-exported symbols.
>>>>>>>
>>>>>>> Hello David,
>>>>>>>
>>>>>>> I am looking again into this, and there are some things which in my
>>>>>>> opinion would be difficult to achieve.
>>>>>>> For example I looked into my patch #11 , which adds the `runqueues` into
>>>>>>> kmemdump.
>>>>>>>
>>>>>>> The runqueues is a variable of `struct rq` which is defined in
>>>>>>> kernel/sched/sched.h , which is not supposed to be included outside of
>>>>>>> sched.
>>>>>>> Now moving all the struct definition outside of sched.h into another
>>>>>>> public header would be rather painful and I don't think it's a really
>>>>>>> good option (The struct would be needed to compute the sizeof inside
>>>>>>> vmcoreinfo). Secondly, it would also imply moving all the nested struct
>>>>>>> definitions outside as well. I doubt this is something that we want for
>>>>>>> the sched subsys. How the subsys is designed, out of my understanding,
>>>>>>> is to keep these internal structs opaque outside of it.
>>>>>>
>>>>>> All the kmemdump module needs is a start and a length, correct? So the
>>>>>> only tricky part is getting the length.
>>>>>
>>>>> I also have in mind the kernel user case. How would a kernel programmer
>>>>> want to add some kernel structs/info/buffers into kmemdump such that the
>>>>> dump would contain their data ? Having "KMEMDUMP_VAR(...)" looks simple
>>>>> enough.
>>>>
>>>> The other way around, why should anybody have a saying in adding their
>>>> data to kmemdump? Why do we have that all over the kernel?
>>>>
>>>> Is your mechanism really so special?
>>>>
>>>> A single composer should take care of that, and it's really just start +
>>>> len of physical memory areas.
>>>>
>>>>> Otherwise maybe the programmer has to write helpers to compute lengths
>>>>> etc, and stitch them into kmemdump core.
>>>>> I am not saying it's impossible, but just tiresome perhaps.
>>>>
>>>> In your patch set, how many of these instances did you encounter where
>>>> that was a problem?
>>>>
>>>>>>
>>>>>> One could just add a const variable that holds this information, or even
>>>>>> better, a simple helper function to calculate that.
>>>>>>
>>>>>> Maybe someone else reading along has a better idea.
>>>>>
>>>>> This could work, but it requires again adding some code into the
>>>>> specific subsystem. E.g. struct_rq_get_size()
>>>>> I am open to ideas , and thank you very much for your thoughts.
>>>>>
>>>>>>
>>>>>> Interestingly, runqueues is a percpu variable, which makes me wonder if
>>>>>> what you had would work as intended (maybe it does, not sure).
>>>>>
>>>>> I would not really need to dump the runqueues. But the crash tool which
>>>>> I am using for testing, requires it. Without the runqueues it will not
>>>>> progress further to load the kernel dump.
>>>>> So I am not really sure what it does with the runqueues, but it works.
>>>>> Perhaps using crash/gdb more, to actually do something with this data,
>>>>> would give more insight about its utility.
>>>>> For me, it is a prerequisite to run crash, and then to be able to
>>>>> extract the log buffer from the dump.
>>>>
>>>> I have the faint recollection that percpu vars might not be stored in a
>>>> single contiguous physical memory area, but maybe my memory is just
>>>> wrong, that's why I was raising it.
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> From my perspective it's much simpler and cleaner to just add the
>>>>>>> kmemdump annotation macro inside the sched/core.c as it's done in my
>>>>>>> patch. This macro translates to a noop if kmemdump is not selected.
>>>>>>
>>>>>> I really don't like how we are spreading kmemdump all over the kernel,
>>>>>> and adding complexity with __section when really, all we need is a place
>>>>>> to obtain a start and a length.
>>>>>>
>>>>>
>>>>> I understand. The section idea was suggested by Thomas. Initially I was
>>>>> skeptic, but I like how it turned out.
>>>>
>>>> Yeah, I don't like it. Taste differs ;)
>>>>
>>>> I am in particular unhappy about custom memblock wrappers.
>>>>
>>>> [...]
>>>>
>>>>>>>
>>>>>>> To have this working outside of printk, it would be required to walk
>>>>>>> through all the printk structs/allocations and select the required info.
>>>>>>> Is this something that we want to do outside of printk ?
>>>>>>
>>>>>> I don't follow, please elaborate.
>>>>>>
>>>>>> How is e.g., log_buf_len_get() + log_buf_addr_get() not sufficient,
>>>>>> given that you run your initialization after setup_log_buf() ?
>>>>>>
>>>>>>
>>>>>
>>>>> My initial thought was the same. However I got some feedback from Petr
>>>>> Mladek here :
>>>>>
>>>>> https://lore.kernel.org/lkml/aBm5QH2p6p9Wxe_M@localhost.localdomain/
>>>>>
>>>>> Where he explained how to register the structs correctly.
>>>>> It can be that setup_log_buf is called again at a later time perhaps.
>>>>>
>>>>
>>>> setup_log_buf() is a __init function, so there is only a certain time
>>>> frame where it can be called.
>>>>
>>>> In particular, once the buddy is up, memblock allocations are impossible
>>>> and it would be deeply flawed to call this function again.
>>>>
>>>> Let's not over-engineer this.
>>>>
>>>> Peter is on CC, so hopefully he can share his thoughts.
>>>>
>>>
>>> Hello David,
>>>
>>> I tested out this snippet (on top of my series, so you can see what I
>>> changed):
>>>
>>>
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index 18ba6c1e174f..7ac4248a00e5 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -67,7 +67,6 @@
>>> #include <linux/wait_api.h>
>>> #include <linux/workqueue_api.h>
>>> #include <linux/livepatch_sched.h>
>>> -#include <linux/kmemdump.h>
>>>
>>> #ifdef CONFIG_PREEMPT_DYNAMIC
>>> # ifdef CONFIG_GENERIC_IRQ_ENTRY
>>> @@ -120,7 +119,12 @@
>>> EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp);
>>> EXPORT_TRACEPOINT_SYMBOL_GPL(sched_compute_energy_tp);
>>>
>>> DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
>>> -KMEMDUMP_VAR_CORE(runqueues, sizeof(runqueues));
>>> +
>>> +size_t runqueues_get_size(void);
>>> +size_t runqueues_get_size(void)
>>> +{
>>> + return sizeof(runqueues);
>>> +}
>>>
>>> #ifdef CONFIG_SCHED_PROXY_EXEC
>>> DEFINE_STATIC_KEY_TRUE(__sched_proxy_exec);
>>> diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c
>>> index d808c5e67f35..c6dd2d6e96dd 100644
>>> --- a/kernel/vmcore_info.c
>>> +++ b/kernel/vmcore_info.c
>>> @@ -24,6 +24,12 @@
>>> #include "kallsyms_internal.h"
>>> #include "kexec_internal.h"
>>>
>>> +typedef void* kmemdump_opaque_t;
>>> +
>>> +size_t runqueues_get_size(void);
>>> +
>>> +extern kmemdump_opaque_t runqueues;
>>
>> I would have tried that through:
>>
>> struct rq;
>> extern struct rq runqueues;
>>
>> But the whole PER_CPU_SHARED_ALIGNED makes this all weird, and likely
>> not the way we would want to handle that.
>>
>>> /* vmcoreinfo stuff */
>>> unsigned char *vmcoreinfo_data;
>>> size_t vmcoreinfo_size;
>>> @@ -230,6 +236,9 @@ static int __init crash_save_vmcoreinfo_init(void)
>>>
>>> kmemdump_register_id(KMEMDUMP_ID_COREIMAGE_VMCOREINFO,
>>> (void *)vmcoreinfo_data, vmcoreinfo_size);
>>> + kmemdump_register_id(KMEMDUMP_ID_COREIMAGE_runqueues,
>>> + (void *)&runqueues, runqueues_get_size());
>>> +
>>> return 0;
>>> }
>>>
>>> With this, no more .section, no kmemdump code into sched, however, there
>>> are few things :
>>
>> I would really just do here something like the following:
>>
>> /**
>> * sched_get_runqueues_area - obtain the runqueues area for dumping
>> * @start: ...
>> * @size: ...
>> *
>> * The obtained area is only to be used for dumping purposes.
>> */
>> void sched_get_runqueues_area(void *start, size_t size)
>> {
>> start = &runqueues;
>> size = sizeof(runqueues);
>> }
>>
>> might be cleaner.
>>
>
> How about this in the header:
>
> #define DECLARE_DUMP_AREA_FUNC(subsys, symbol) \
>
> void subsys ## _get_ ## symbol ##_area(void **start, size_t *size);
>
>
>
> #define DEFINE_DUMP_AREA_FUNC(subsys, symbol) \
>
> void subsys ## _get_ ## symbol ##_area(void **start, size_t *size)\
>
> {\
>
> *start = &symbol;\
>
> *size = sizeof(symbol);\
>
> }
>
>
> then, in sched just
>
> DECLARE_DUMP_AREA_FUNC(sched, runqueues);
>
> DEFINE_DUMP_AREA_FUNC(sched, runqueues);
>
> or a single macro that wraps both.
>
> would make it shorter and neater.
>
> What do you think ?
Looks a bit over-engineered, and will require us to import a header
(likely kmemdump.h) in these files, which I don't really enjoy.
I would start simple, without any such macro-magic. It's a very simple
function after all, and likely you won't end up having many of these?
--
Cheers
David / dhildenb
On 8/27/25 23:06, David Hildenbrand wrote:
> On 27.08.25 16:08, Eugen Hristev wrote:
>>
>>
>> On 8/27/25 15:18, David Hildenbrand wrote:
>>> On 27.08.25 13:59, Eugen Hristev wrote:
>>>>
>>>>
>>>> On 8/25/25 16:58, David Hildenbrand wrote:
>>>>> On 25.08.25 15:36, Eugen Hristev wrote:
>>>>>>
>>>>>>
>>>>>> On 8/25/25 16:20, David Hildenbrand wrote:
>>>>>>>
>>>>>>>>>
>>>>>>>>> IIRC, kernel/vmcore_info.c is never built as a module, as it also
>>>>>>>>> accesses non-exported symbols.
>>>>>>>>
>>>>>>>> Hello David,
>>>>>>>>
>>>>>>>> I am looking again into this, and there are some things which in my
>>>>>>>> opinion would be difficult to achieve.
>>>>>>>> For example I looked into my patch #11 , which adds the `runqueues` into
>>>>>>>> kmemdump.
>>>>>>>>
>>>>>>>> The runqueues is a variable of `struct rq` which is defined in
>>>>>>>> kernel/sched/sched.h , which is not supposed to be included outside of
>>>>>>>> sched.
>>>>>>>> Now moving all the struct definition outside of sched.h into another
>>>>>>>> public header would be rather painful and I don't think it's a really
>>>>>>>> good option (The struct would be needed to compute the sizeof inside
>>>>>>>> vmcoreinfo). Secondly, it would also imply moving all the nested struct
>>>>>>>> definitions outside as well. I doubt this is something that we want for
>>>>>>>> the sched subsys. How the subsys is designed, out of my understanding,
>>>>>>>> is to keep these internal structs opaque outside of it.
>>>>>>>
>>>>>>> All the kmemdump module needs is a start and a length, correct? So the
>>>>>>> only tricky part is getting the length.
>>>>>>
>>>>>> I also have in mind the kernel user case. How would a kernel programmer
>>>>>> want to add some kernel structs/info/buffers into kmemdump such that the
>>>>>> dump would contain their data ? Having "KMEMDUMP_VAR(...)" looks simple
>>>>>> enough.
>>>>>
>>>>> The other way around, why should anybody have a saying in adding their
>>>>> data to kmemdump? Why do we have that all over the kernel?
>>>>>
>>>>> Is your mechanism really so special?
>>>>>
>>>>> A single composer should take care of that, and it's really just start +
>>>>> len of physical memory areas.
>>>>>
>>>>>> Otherwise maybe the programmer has to write helpers to compute lengths
>>>>>> etc, and stitch them into kmemdump core.
>>>>>> I am not saying it's impossible, but just tiresome perhaps.
>>>>>
>>>>> In your patch set, how many of these instances did you encounter where
>>>>> that was a problem?
>>>>>
>>>>>>>
>>>>>>> One could just add a const variable that holds this information, or even
>>>>>>> better, a simple helper function to calculate that.
>>>>>>>
>>>>>>> Maybe someone else reading along has a better idea.
>>>>>>
>>>>>> This could work, but it requires again adding some code into the
>>>>>> specific subsystem. E.g. struct_rq_get_size()
>>>>>> I am open to ideas , and thank you very much for your thoughts.
>>>>>>
>>>>>>>
>>>>>>> Interestingly, runqueues is a percpu variable, which makes me wonder if
>>>>>>> what you had would work as intended (maybe it does, not sure).
>>>>>>
>>>>>> I would not really need to dump the runqueues. But the crash tool which
>>>>>> I am using for testing, requires it. Without the runqueues it will not
>>>>>> progress further to load the kernel dump.
>>>>>> So I am not really sure what it does with the runqueues, but it works.
>>>>>> Perhaps using crash/gdb more, to actually do something with this data,
>>>>>> would give more insight about its utility.
>>>>>> For me, it is a prerequisite to run crash, and then to be able to
>>>>>> extract the log buffer from the dump.
>>>>>
>>>>> I have the faint recollection that percpu vars might not be stored in a
>>>>> single contiguous physical memory area, but maybe my memory is just
>>>>> wrong, that's why I was raising it.
>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> From my perspective it's much simpler and cleaner to just add the
>>>>>>>> kmemdump annotation macro inside the sched/core.c as it's done in my
>>>>>>>> patch. This macro translates to a noop if kmemdump is not selected.
>>>>>>>
>>>>>>> I really don't like how we are spreading kmemdump all over the kernel,
>>>>>>> and adding complexity with __section when really, all we need is a place
>>>>>>> to obtain a start and a length.
>>>>>>>
>>>>>>
>>>>>> I understand. The section idea was suggested by Thomas. Initially I was
>>>>>> skeptic, but I like how it turned out.
>>>>>
>>>>> Yeah, I don't like it. Taste differs ;)
>>>>>
>>>>> I am in particular unhappy about custom memblock wrappers.
>>>>>
>>>>> [...]
>>>>>
>>>>>>>>
>>>>>>>> To have this working outside of printk, it would be required to walk
>>>>>>>> through all the printk structs/allocations and select the required info.
>>>>>>>> Is this something that we want to do outside of printk ?
>>>>>>>
>>>>>>> I don't follow, please elaborate.
>>>>>>>
>>>>>>> How is e.g., log_buf_len_get() + log_buf_addr_get() not sufficient,
>>>>>>> given that you run your initialization after setup_log_buf() ?
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> My initial thought was the same. However I got some feedback from Petr
>>>>>> Mladek here :
>>>>>>
>>>>>> https://lore.kernel.org/lkml/aBm5QH2p6p9Wxe_M@localhost.localdomain/
>>>>>>
>>>>>> Where he explained how to register the structs correctly.
>>>>>> It can be that setup_log_buf is called again at a later time perhaps.
>>>>>>
>>>>>
>>>>> setup_log_buf() is a __init function, so there is only a certain time
>>>>> frame where it can be called.
>>>>>
>>>>> In particular, once the buddy is up, memblock allocations are impossible
>>>>> and it would be deeply flawed to call this function again.
>>>>>
>>>>> Let's not over-engineer this.
>>>>>
>>>>> Peter is on CC, so hopefully he can share his thoughts.
>>>>>
>>>>
>>>> Hello David,
>>>>
>>>> I tested out this snippet (on top of my series, so you can see what I
>>>> changed):
>>>>
>>>>
>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>> index 18ba6c1e174f..7ac4248a00e5 100644
>>>> --- a/kernel/sched/core.c
>>>> +++ b/kernel/sched/core.c
>>>> @@ -67,7 +67,6 @@
>>>> #include <linux/wait_api.h>
>>>> #include <linux/workqueue_api.h>
>>>> #include <linux/livepatch_sched.h>
>>>> -#include <linux/kmemdump.h>
>>>>
>>>> #ifdef CONFIG_PREEMPT_DYNAMIC
>>>> # ifdef CONFIG_GENERIC_IRQ_ENTRY
>>>> @@ -120,7 +119,12 @@
>>>> EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp);
>>>> EXPORT_TRACEPOINT_SYMBOL_GPL(sched_compute_energy_tp);
>>>>
>>>> DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
>>>> -KMEMDUMP_VAR_CORE(runqueues, sizeof(runqueues));
>>>> +
>>>> +size_t runqueues_get_size(void);
>>>> +size_t runqueues_get_size(void)
>>>> +{
>>>> + return sizeof(runqueues);
>>>> +}
>>>>
>>>> #ifdef CONFIG_SCHED_PROXY_EXEC
>>>> DEFINE_STATIC_KEY_TRUE(__sched_proxy_exec);
>>>> diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c
>>>> index d808c5e67f35..c6dd2d6e96dd 100644
>>>> --- a/kernel/vmcore_info.c
>>>> +++ b/kernel/vmcore_info.c
>>>> @@ -24,6 +24,12 @@
>>>> #include "kallsyms_internal.h"
>>>> #include "kexec_internal.h"
>>>>
>>>> +typedef void* kmemdump_opaque_t;
>>>> +
>>>> +size_t runqueues_get_size(void);
>>>> +
>>>> +extern kmemdump_opaque_t runqueues;
>>>
>>> I would have tried that through:
>>>
>>> struct rq;
>>> extern struct rq runqueues;
>>>
>>> But the whole PER_CPU_SHARED_ALIGNED makes this all weird, and likely
>>> not the way we would want to handle that.
>>>
>>>> /* vmcoreinfo stuff */
>>>> unsigned char *vmcoreinfo_data;
>>>> size_t vmcoreinfo_size;
>>>> @@ -230,6 +236,9 @@ static int __init crash_save_vmcoreinfo_init(void)
>>>>
>>>> kmemdump_register_id(KMEMDUMP_ID_COREIMAGE_VMCOREINFO,
>>>> (void *)vmcoreinfo_data, vmcoreinfo_size);
>>>> + kmemdump_register_id(KMEMDUMP_ID_COREIMAGE_runqueues,
>>>> + (void *)&runqueues, runqueues_get_size());
>>>> +
>>>> return 0;
>>>> }
>>>>
>>>> With this, no more .section, no kmemdump code into sched, however, there
>>>> are few things :
>>>
>>> I would really just do here something like the following:
>>>
>>> /**
>>> * sched_get_runqueues_area - obtain the runqueues area for dumping
>>> * @start: ...
>>> * @size: ...
>>> *
>>> * The obtained area is only to be used for dumping purposes.
>>> */
>>> void sched_get_runqueues_area(void *start, size_t size)
>>> {
>>> start = &runqueues;
>>> size = sizeof(runqueues);
>>> }
>>>
>>> might be cleaner.
>>>
>>
>> How about this in the header:
>>
>> #define DECLARE_DUMP_AREA_FUNC(subsys, symbol) \
>>
>> void subsys ## _get_ ## symbol ##_area(void **start, size_t *size);
>>
>>
>>
>> #define DEFINE_DUMP_AREA_FUNC(subsys, symbol) \
>>
>> void subsys ## _get_ ## symbol ##_area(void **start, size_t *size)\
>>
>> {\
>>
>> *start = &symbol;\
>>
>> *size = sizeof(symbol);\
>>
>> }
>>
>>
>> then, in sched just
>>
>> DECLARE_DUMP_AREA_FUNC(sched, runqueues);
>>
>> DEFINE_DUMP_AREA_FUNC(sched, runqueues);
>>
>> or a single macro that wraps both.
>>
>> would make it shorter and neater.
>>
>> What do you think ?
>
> Looks a bit over-engineered, and will require us to import a header
> (likely kmemdump.h) in these files, which I don't really enjoy.
>
> I would start simple, without any such macro-magic. It's a very simple
> function after all, and likely you won't end up having many of these?
>
Thanks David, I will do it as you suggested and see what comes out of it.
I have one side question you might know much better to answer:
As we have a start and a size for each region, this start is a virtual
address. The firmware/coprocessor that reads the memory and dumps it,
requires physical addresses. What do you suggest to use to retrieve that
address ? virt_to_phys might be problematic, __pa or __pa_symbol? or
better lm_alias ?
As kmemdump is agnostic of the region of the memory the `start` comes
from, and it should be portable and platform independent.
Thanks again,
Eugen
>>> What do you think ? >> >> Looks a bit over-engineered, and will require us to import a header >> (likely kmemdump.h) in these files, which I don't really enjoy. >> >> I would start simple, without any such macro-magic. It's a very simple >> function after all, and likely you won't end up having many of these? >> > > Thanks David, I will do it as you suggested and see what comes out of it. > > I have one side question you might know much better to answer: > As we have a start and a size for each region, this start is a virtual > address. The firmware/coprocessor that reads the memory and dumps it, > requires physical addresses. Right. I was asking myself the same question while reviewing: should we directly export physical ranges here instead of virtual ones. I guess virtual ones is ok. What do you suggest to use to retrieve that > address ? virt_to_phys might be problematic, __pa or __pa_symbol? or > better lm_alias ? All areas should either come from memblock or be global variables, right? IIRC, virt_to_phys() should work for these. Did you run into any problems with them or why do you think virt_to_phys could be problematic? -- Cheers David / dhildenb
On 9/1/25 13:01, David Hildenbrand wrote: >>>> What do you think ? >>> >>> Looks a bit over-engineered, and will require us to import a header >>> (likely kmemdump.h) in these files, which I don't really enjoy. >>> >>> I would start simple, without any such macro-magic. It's a very simple >>> function after all, and likely you won't end up having many of these? >>> >> >> Thanks David, I will do it as you suggested and see what comes out of it. >> >> I have one side question you might know much better to answer: >> As we have a start and a size for each region, this start is a virtual >> address. The firmware/coprocessor that reads the memory and dumps it, >> requires physical addresses. > > Right. I was asking myself the same question while reviewing: should we > directly export physical ranges here instead of virtual ones. I guess > virtual ones is ok. In patch 22/29, some areas are registered using memblock_phys_alloc_try_nid() which allocates physical. In this case , phys_to_virt() didn't work for me, it was returning a wrong address. I used __va() and this worked. So there is a difference between them. > > What do you suggest to use to retrieve that >> address ? virt_to_phys might be problematic, __pa or __pa_symbol? or >> better lm_alias ? > > All areas should either come from memblock or be global variables, right? I would like to be able to register from anywhere. For example someone debugging their driver, to just register kmalloc'ed struct. Other use case is to register dma coherent CMA areas. > > IIRC, virt_to_phys() should work for these. Did you run into any > problems with them or why do you think virt_to_phys could be problematic? > I am pondering about whether it would work in all cases, considering it's source code comments that it shall not be used because it does not work for any address. Someone also reported its unavailability like this: drivers/debug/kmemdump_coreimage.c:67:24: error: call to undeclared function 'virt_to_phys'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration] I am yet to figure out which config fails.
On 01.09.25 14:02, Eugen Hristev wrote: > > > On 9/1/25 13:01, David Hildenbrand wrote: >>>>> What do you think ? >>>> >>>> Looks a bit over-engineered, and will require us to import a header >>>> (likely kmemdump.h) in these files, which I don't really enjoy. >>>> >>>> I would start simple, without any such macro-magic. It's a very simple >>>> function after all, and likely you won't end up having many of these? >>>> >>> >>> Thanks David, I will do it as you suggested and see what comes out of it. >>> >>> I have one side question you might know much better to answer: >>> As we have a start and a size for each region, this start is a virtual >>> address. The firmware/coprocessor that reads the memory and dumps it, >>> requires physical addresses. >> >> Right. I was asking myself the same question while reviewing: should we >> directly export physical ranges here instead of virtual ones. I guess >> virtual ones is ok. > > In patch 22/29, some areas are registered using > memblock_phys_alloc_try_nid() which allocates physical. > In this case , phys_to_virt() didn't work for me, it was returning a > wrong address. I used __va() and this worked. So there is a difference > between them. memblock_alloc_internal() calls memblock_alloc_range_nid() to then perform a phys_to_virt(). memblock_phys_alloc_try_nid() calls memblock_alloc_range_nid() without the phys_to_virt(). So it's rather surprising the a phys_to_virt() would not work in that case. Maybe for these cases where you export the area through a new helper, you can just export the physical addr + length instead. Then, it's also clear that this area is actually physically contiguous. > >> >> What do you suggest to use to retrieve that >>> address ? virt_to_phys might be problematic, __pa or __pa_symbol? or >>> better lm_alias ? >> >> All areas should either come from memblock or be global variables, right? > > I would like to be able to register from anywhere. For example someone > debugging their driver, to just register kmalloc'ed struct. > Other use case is to register dma coherent CMA areas. Then probably better to export physical addresses (that you need either way) directly from the helpers you have to add. > >> >> IIRC, virt_to_phys() should work for these. Did you run into any >> problems with them or why do you think virt_to_phys could be problematic? >> > > I am pondering about whether it would work in all cases, considering > it's source code comments that it shall not be used because it does not > work for any address. Yeah, it does for example not work for kernel stacks IIRC. -- Cheers David / dhildenb
On 04.08.25 12:54, Michal Hocko wrote: > On Wed 30-07-25 16:04:28, David Hildenbrand wrote: >> On 30.07.25 15:57, Eugen Hristev wrote: > [...] >>> Yes, registering after is also an option. Initially this is how I >>> designed the kmemdump API, I also had in mind to add a flag, but, after >>> discussing with Thomas Gleixner, he came up with the macro wrapper idea >>> here: >>> https://lore.kernel.org/lkml/87ikkzpcup.ffs@tglx/ >>> Do you think we can continue that discussion , or maybe start it here ? >> >> Yeah, I don't like that, but I can see how we ended up here. >> >> I also don't quite like the idea that we must encode here what to include in >> a dump and what not ... >> >> For the vmcore we construct it at runtime in crash_save_vmcoreinfo_init(), >> where we e.g., have >> >> VMCOREINFO_STRUCT_SIZE(pglist_data); >> >> Could we similar have some place where we construct what to dump similarly, >> just not using the current values, but the memory ranges? > > All those symbols are part of kallsyms, right? Can we just use kallsyms > infrastructure and a list of symbols to get what we need from there? > > In other words the list of symbols to be completely external to the code > that is defining them? That was the idea. All we should need is the start+size of the ranges. No need to have these kmemdump specifics all over the kernel. -- Cheers, David / dhildenb
© 2016 - 2026 Red Hat, Inc.