fs/coredump.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+)
Large cores may be truncated in some scenarios, such as with daemons
with stop timeouts that are not large enough or lack of disk space. This
impacts debuggability with large core dumps since critical information
necessary to form a usable backtrace, such as stacks and shared library
information, are omitted.
We attempted to figure out which VMAs are needed to create a useful
backtrace, and it turned out to be a non-trivial problem. Instead, we
try simply sorting the VMAs by size, which has the intended effect.
By sorting VMAs by dump size and dumping in that order, we have a
simple, yet effective heuristic.
Signed-off-by: Brian Mak <makb@juniper.net>
---
Hi all,
Still need to run rr tests on this, per Kees Cook's suggestion, will
update back once done. GDB and readelf show that this patch works
without issue though.
Thanks,
Brian Mak
v3: Edited commit message to better convey alternative solution as
non-trivial
Moved sorting logic to fs/coredump.c to make it in place
Above edits suggested by Eric Biederman <ebiederm@xmission.com>
v2: Edited commit message to include more reasoning for sorting VMAs
Removed conditional VMA sorting with debugfs knob
Above edits suggested by Eric Biederman <ebiederm@xmission.com>
fs/coredump.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/fs/coredump.c b/fs/coredump.c
index 7f12ff6ad1d3..33c5ac53ab31 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -18,6 +18,7 @@
#include <linux/personality.h>
#include <linux/binfmts.h>
#include <linux/coredump.h>
+#include <linux/sort.h>
#include <linux/sched/coredump.h>
#include <linux/sched/signal.h>
#include <linux/sched/task_stack.h>
@@ -1191,6 +1192,18 @@ static void free_vma_snapshot(struct coredump_params *cprm)
}
}
+static int cmp_vma_size(const void *vma_meta_lhs_ptr, const void *vma_meta_rhs_ptr)
+{
+ const struct core_vma_metadata *vma_meta_lhs = vma_meta_lhs_ptr;
+ const struct core_vma_metadata *vma_meta_rhs = vma_meta_rhs_ptr;
+
+ if (vma_meta_lhs->dump_size < vma_meta_rhs->dump_size)
+ return -1;
+ if (vma_meta_lhs->dump_size > vma_meta_rhs->dump_size)
+ return 1;
+ return 0;
+}
+
/*
* Under the mmap_lock, take a snapshot of relevant information about the task's
* VMAs.
@@ -1253,5 +1266,8 @@ static bool dump_vma_snapshot(struct coredump_params *cprm)
cprm->vma_data_size += m->dump_size;
}
+ sort(cprm->vma_meta, cprm->vma_count, sizeof(*cprm->vma_meta),
+ cmp_vma_size, NULL);
+
return true;
}
base-commit: eb5e56d1491297e0881c95824e2050b7c205f0d4
--
2.25.1
Hey Brian and folks > […] > backtrace, and it turned out to be a non-trivial problem. Instead, we > try simply sorting the VMAs by size, which has the intended effect. > […] > Still need to run rr tests on this, per Kees Cook's suggestion, will > update back once done. GDB and readelf show that this patch works > without issue though. I think in your testing, you probably did not try the eu-stack tool from the elfutils package, because I think I found a bug: Current elfutils cannot symbolize core dumps created by Linux 6.12+. I noticed this because systemd-coredump(8) uses elfutils, and when a program crashed on my machine, syslog did not show function names. I reported this issue with elfutils at: https://sourceware.org/bugzilla/show_bug.cgi?id=32713 …but figured it would be good to give a heads-up here, too. Is this breakage sufficient reason to revert the commit? Or are we saying userspace just needs to be updated to cope? Thanks Best regards Michael
On Feb 18, 2025, at 12:54 AM, Michael Stapelberg <michael@stapelberg.ch> wrote: > I think in your testing, you probably did not try the eu-stack tool > from the elfutils package, because I think I found a bug: Hi Michael, Thanks for the report. I can confirm that this issue does seem to be from this commit. I tested it with Juniper's Linux kernel with and without the changes. You're correct that the original testing done did not include the eu-stack tool. > Current elfutils cannot symbolize core dumps created by Linux 6.12+. > I noticed this because systemd-coredump(8) uses elfutils, and when > a program crashed on my machine, syslog did not show function names. > > I reported this issue with elfutils at: > https://urldefense.com/v3/__https://sourceware.org/bugzilla/show_bug.cgi?id=32713__;!!NEt6yMaO-gk!DbttKuHxkBdrV4Cj9axM3ED6mlBHXeQGY3NVzvfDlthl-K39e9QIrZcwT8iCXLRu0OivWRGgficcD-aCuus$ > …but figured it would be good to give a heads-up here, too. > > Is this breakage sufficient reason to revert the commit? > Or are we saying userspace just needs to be updated to cope? The way I see it is that, as long as we're in compliance with the applicable ELF specifications, then the issue lies with userspace apps to ensure that they are not making additional erroneous assumptions. However, Eric mentioned a while ago in v1 of this patch that he believes that the ELF specification requires program headers be written in memory order. Digging through the ELF specifications, I found that any loadable segment entries in the program header table must be sorted on the virtual address of the first byte of which the segment resides in memory. This indicates that we have deviated from the ELF specification with this commit. One thing we can do to remedy this is to have program headers sorted according to the specification, but then continue dumping in VMA size ordering. This would make the dumping logic significantly more complex though. Seeing how most popular userspace apps, with the exception of eu-stack, seem to work, we could also just leave it, and tell userspace apps to fix it on their end. Eric and Kees, thoughts? I'm open to going either way. Best, Brian
On Tue 18-02-25 19:53:51, Brian Mak wrote: > On Feb 18, 2025, at 12:54 AM, Michael Stapelberg <michael@stapelberg.ch> wrote: > > > I think in your testing, you probably did not try the eu-stack tool > > from the elfutils package, because I think I found a bug: > > Hi Michael, > > Thanks for the report. I can confirm that this issue does seem to be > from this commit. I tested it with Juniper's Linux kernel with and > without the changes. > > You're correct that the original testing done did not include the > eu-stack tool. > > > Current elfutils cannot symbolize core dumps created by Linux 6.12+. > > I noticed this because systemd-coredump(8) uses elfutils, and when > > a program crashed on my machine, syslog did not show function names. > > > > I reported this issue with elfutils at: > > https://urldefense.com/v3/__https://sourceware.org/bugzilla/show_bug.cgi?id=32713__;!!NEt6yMaO-gk!DbttKuHxkBdrV4Cj9axM3ED6mlBHXeQGY3NVzvfDlthl-K39e9QIrZcwT8iCXLRu0OivWRGgficcD-aCuus$ > > …but figured it would be good to give a heads-up here, too. > > > > Is this breakage sufficient reason to revert the commit? > > Or are we saying userspace just needs to be updated to cope? > > The way I see it is that, as long as we're in compliance with the > applicable ELF specifications, then the issue lies with userspace apps > to ensure that they are not making additional erroneous assumptions. > > However, Eric mentioned a while ago in v1 of this patch that he believes > that the ELF specification requires program headers be written in memory > order. Digging through the ELF specifications, I found that any loadable > segment entries in the program header table must be sorted on the > virtual address of the first byte of which the segment resides in > memory. > > This indicates that we have deviated from the ELF specification with > this commit. One thing we can do to remedy this is to have program > headers sorted according to the specification, but then continue dumping > in VMA size ordering. This would make the dumping logic significantly > more complex though. > > Seeing how most popular userspace apps, with the exception of eu-stack, > seem to work, we could also just leave it, and tell userspace apps to > fix it on their end. Well, it does not seem eu-stack is that unpopular and we really try hard to avoid user visible regressions. So I think we should revert the change. Also the fact that the patch breaks ELF spec is an indication there may be other tools that would get confused by this and another reason for a revert... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR
On Wed, Feb 19, 2025 at 05:20:17PM +0100, Jan Kara wrote:
> On Tue 18-02-25 19:53:51, Brian Mak wrote:
> > On Feb 18, 2025, at 12:54 AM, Michael Stapelberg <michael@stapelberg.ch> wrote:
> >
> > > I think in your testing, you probably did not try the eu-stack tool
> > > from the elfutils package, because I think I found a bug:
> >
> > Hi Michael,
> >
> > Thanks for the report. I can confirm that this issue does seem to be
> > from this commit. I tested it with Juniper's Linux kernel with and
> > without the changes.
> >
> > You're correct that the original testing done did not include the
> > eu-stack tool.
> >
> > > Current elfutils cannot symbolize core dumps created by Linux 6.12+.
> > > I noticed this because systemd-coredump(8) uses elfutils, and when
> > > a program crashed on my machine, syslog did not show function names.
> > >
> > > I reported this issue with elfutils at:
> > > https://urldefense.com/v3/__https://sourceware.org/bugzilla/show_bug.cgi?id=32713__;!!NEt6yMaO-gk!DbttKuHxkBdrV4Cj9axM3ED6mlBHXeQGY3NVzvfDlthl-K39e9QIrZcwT8iCXLRu0OivWRGgficcD-aCuus$
> > > …but figured it would be good to give a heads-up here, too.
> > >
> > > Is this breakage sufficient reason to revert the commit?
> > > Or are we saying userspace just needs to be updated to cope?
> >
> > The way I see it is that, as long as we're in compliance with the
> > applicable ELF specifications, then the issue lies with userspace apps
> > to ensure that they are not making additional erroneous assumptions.
> >
> > However, Eric mentioned a while ago in v1 of this patch that he believes
> > that the ELF specification requires program headers be written in memory
> > order. Digging through the ELF specifications, I found that any loadable
> > segment entries in the program header table must be sorted on the
> > virtual address of the first byte of which the segment resides in
> > memory.
> >
> > This indicates that we have deviated from the ELF specification with
> > this commit. One thing we can do to remedy this is to have program
> > headers sorted according to the specification, but then continue dumping
> > in VMA size ordering. This would make the dumping logic significantly
> > more complex though.
> >
> > Seeing how most popular userspace apps, with the exception of eu-stack,
> > seem to work, we could also just leave it, and tell userspace apps to
> > fix it on their end.
>
> Well, it does not seem eu-stack is that unpopular and we really try hard to
> avoid user visible regressions. So I think we should revert the change. Also
> the fact that the patch breaks ELF spec is an indication there may be other
> tools that would get confused by this and another reason for a revert...
Yeah, I think we need to make this a tunable. Updating the kernel breaks
elftools, which isn't some weird custom corner case. :P
So, while it took a few months, here is a report of breakage that I said
we'd need to watch for[1]. :)
Is anyone able to test this patch? And Brian will setting a sysctl be
okay for your use-case?
diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index a43b78b4b646..35d5d86cff69 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -222,6 +222,17 @@ and ``core_uses_pid`` is set, then .PID will be appended to
the filename.
+core_sort_vma
+=============
+
+The default coredump writes VMAs in address order. By setting
+``core_sort_vma`` to 1, VMAs will be written from smallest size
+to largest size. This is known to break at least elfutils, but
+can be handy when dealing with very large (and truncated)
+coredumps where the more useful debugging details are included
+in the smaller VMAs.
+
+
ctrl-alt-del
============
diff --git a/fs/coredump.c b/fs/coredump.c
index 591700e1b2ce..4375c70144d0 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -63,6 +63,7 @@ static void free_vma_snapshot(struct coredump_params *cprm);
static int core_uses_pid;
static unsigned int core_pipe_limit;
+static unsigned int core_sort_vma;
static char core_pattern[CORENAME_MAX_SIZE] = "core";
static int core_name_size = CORENAME_MAX_SIZE;
unsigned int core_file_note_size_limit = CORE_FILE_NOTE_SIZE_DEFAULT;
@@ -1026,6 +1027,15 @@ static const struct ctl_table coredump_sysctls[] = {
.extra1 = (unsigned int *)&core_file_note_size_min,
.extra2 = (unsigned int *)&core_file_note_size_max,
},
+ {
+ .procname = "core_sort_vma",
+ .data = &core_sort_vma,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_douintvec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = SYSCTL_ONE,
+ },
};
static int __init init_fs_coredump_sysctls(void)
@@ -1256,8 +1266,9 @@ static bool dump_vma_snapshot(struct coredump_params *cprm)
cprm->vma_data_size += m->dump_size;
}
- sort(cprm->vma_meta, cprm->vma_count, sizeof(*cprm->vma_meta),
- cmp_vma_size, NULL);
+ if (core_sort_vma)
+ sort(cprm->vma_meta, cprm->vma_count, sizeof(*cprm->vma_meta),
+ cmp_vma_size, NULL);
return true;
}
-Kees
[1] https://lore.kernel.org/all/202408092104.FCE51021@keescook/
--
Kees Cook
On Wed, 19 Feb 2025 at 11:52, Kees Cook <kees@kernel.org> wrote:
>
> Yeah, I think we need to make this a tunable. Updating the kernel breaks
> elftools, which isn't some weird custom corner case. :P
I wonder if we could also make the default be "no sorting" if the
vma's are all fairly small...
IOW, only trigger the new behavior when nity actually *matters*.
We already have the code to count how big the core dump is, it's that
cprm->vma_data_size += m->dump_size;
in dump_vma_snapshot() thing, so I think this could all basically be a
one-liner that does the sort() call only if that vma_data_size is
larger than the core-dump limit, or something like that?
That way, the normal case could basically work for everybody, and the
system tunable would be only for people who want to force a certain
situation.
Something trivial like this (ENTIRELY UNTESTED) patch, perhaps:
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -1256,6 +1256,10 @@ static bool dump_vma_snapshot(struct
coredump_params *cprm)
cprm->vma_data_size += m->dump_size;
}
+ /* Only sort the vmas by size if they don't all fit in the
core dump */
+ if (cprm->vma_data_size < cprm->limit)
+ return true;
+
sort(cprm->vma_meta, cprm->vma_count, sizeof(*cprm->vma_meta),
cmp_vma_size, NULL);
Hmm?
Linus
On Wed, Feb 19, 2025 at 04:39:41PM -0800, Linus Torvalds wrote: > On Wed, 19 Feb 2025 at 11:52, Kees Cook <kees@kernel.org> wrote: > > > > Yeah, I think we need to make this a tunable. Updating the kernel breaks > > elftools, which isn't some weird custom corner case. :P > > I wonder if we could also make the default be "no sorting" if the > vma's are all fairly small... > > IOW, only trigger the new behavior when nity actually *matters*. > > We already have the code to count how big the core dump is, it's that > > cprm->vma_data_size += m->dump_size; > > in dump_vma_snapshot() thing, so I think this could all basically be a > one-liner that does the sort() call only if that vma_data_size is > larger than the core-dump limit, or something like that? > > That way, the normal case could basically work for everybody, and the > system tunable would be only for people who want to force a certain > situation. > > Something trivial like this (ENTIRELY UNTESTED) patch, perhaps: > > --- a/fs/coredump.c > +++ b/fs/coredump.c > @@ -1256,6 +1256,10 @@ static bool dump_vma_snapshot(struct > coredump_params *cprm) > cprm->vma_data_size += m->dump_size; > } > > + /* Only sort the vmas by size if they don't all fit in the > core dump */ > + if (cprm->vma_data_size < cprm->limit) > + return true; > + > sort(cprm->vma_meta, cprm->vma_count, sizeof(*cprm->vma_meta), > cmp_vma_size, NULL); > > Hmm? Oh! That's a good idea. In theory, a truncated dump is going to be traditionally "unusable", so a sort shouldn't hurt tools that are expecting a complete dump. Brian, are you able to test this for your case? -Kees -- Kees Cook
On Feb 19, 2025, at 5:36 PM, Kees Cook <kees@kernel.org> wrote:
>> We already have the code to count how big the core dump is, it's that
>>
>> cprm->vma_data_size += m->dump_size;
>>
>> in dump_vma_snapshot() thing, so I think this could all basically be a
>> one-liner that does the sort() call only if that vma_data_size is
>> larger than the core-dump limit, or something like that?
...
> Oh! That's a good idea. In theory, a truncated dump is going to be
> traditionally "unusable", so a sort shouldn't hurt tools that are
> expecting a complete dump.
>
> Brian, are you able to test this for your case?
Hi Kees and Linus,
I like the idea, but the suggested patch seems to have issues in
practice.
First, the vma_data_size does not include the ELF header, program header
table, notes, etc. That's not a terribly big issue though. We can live
with that since it makes the estimated core dump size *smaller*. An even
bigger problem is that the vma_data_size doesn't take into account the
sparseness of the core dump, while the core dump size limit does.
If a generated core dump is very sparse, the vma_data_size will be much
larger than the actual size on disk of the core dump, triggering the
sorting logic earlier than expected.
One thing we can do though is to iterate through the pages for all VMAs
and see if get_dump_page() returns NULL. Then, we use that information
to calculate a more accurate predicted core dump size.
Patch is below. Thoughts?
Best,
Brian
diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index a43b78b4b646..dd49a89a62d3 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -212,6 +212,17 @@ pid>/``).
This value defaults to 0.
+core_sort_vma
+=============
+
+The default coredump writes VMAs in address order. By setting
+``core_sort_vma`` to 1, VMAs will be written from smallest size
+to largest size. This is known to break at least elfutils, but
+can be handy when dealing with very large (and truncated)
+coredumps where the more useful debugging details are included
+in the smaller VMAs.
+
+
core_uses_pid
=============
diff --git a/fs/coredump.c b/fs/coredump.c
index 591700e1b2ce..496cc7234aa7 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -63,6 +63,7 @@ static void free_vma_snapshot(struct coredump_params *cprm);
static int core_uses_pid;
static unsigned int core_pipe_limit;
+static unsigned int core_sort_vma;
static char core_pattern[CORENAME_MAX_SIZE] = "core";
static int core_name_size = CORENAME_MAX_SIZE;
unsigned int core_file_note_size_limit = CORE_FILE_NOTE_SIZE_DEFAULT;
@@ -1026,6 +1027,15 @@ static const struct ctl_table coredump_sysctls[] = {
.extra1 = (unsigned int *)&core_file_note_size_min,
.extra2 = (unsigned int *)&core_file_note_size_max,
},
+ {
+ .procname = "core_sort_vma",
+ .data = &core_sort_vma,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_douintvec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = SYSCTL_ONE,
+ },
};
static int __init init_fs_coredump_sysctls(void)
@@ -1204,6 +1214,7 @@ static bool dump_vma_snapshot(struct coredump_params *cprm)
struct mm_struct *mm = current->mm;
VMA_ITERATOR(vmi, mm, 0);
int i = 0;
+ size_t sparse_vma_dump_size = 0;
/*
* Once the stack expansion code is fixed to not change VMA bounds
@@ -1241,6 +1252,7 @@ static bool dump_vma_snapshot(struct coredump_params *cprm)
for (i = 0; i < cprm->vma_count; i++) {
struct core_vma_metadata *m = cprm->vma_meta + i;
+ unsigned long addr;
if (m->dump_size == DUMP_SIZE_MAYBE_ELFHDR_PLACEHOLDER) {
char elfmag[SELFMAG];
@@ -1254,10 +1266,27 @@ static bool dump_vma_snapshot(struct coredump_params *cprm)
}
cprm->vma_data_size += m->dump_size;
+ sparse_vma_dump_size += m->dump_size;
+
+ /* Subtract zero pages from the sparse_vma_dump_size. */
+ for (addr = m->start; addr < m->start + m->dump_size; addr += PAGE_SIZE) {
+ struct page *page = get_dump_page(addr);
+
+ if (!page)
+ sparse_vma_dump_size -= PAGE_SIZE;
+ else
+ put_page(page);
+ }
}
- sort(cprm->vma_meta, cprm->vma_count, sizeof(*cprm->vma_meta),
- cmp_vma_size, NULL);
+ /*
+ * Only sort the vmas by size if:
+ * a) the sysctl is set to do so, or
+ * b) the vmas don't all fit in within the core dump size limit.
+ */
+ if (core_sort_vma || sparse_vma_dump_size >= cprm->limit)
+ sort(cprm->vma_meta, cprm->vma_count, sizeof(*cprm->vma_meta),
+ cmp_vma_size, NULL);
return true;
}
On Thu, Feb 20, 2025 at 10:59:06PM +0000, Brian Mak wrote: > One thing we can do though is to iterate through the pages for all VMAs > and see if get_dump_page() returns NULL. Then, we use that information > to calculate a more accurate predicted core dump size. > > Patch is below. Thoughts? I've pushed this to -next for a few days of testing, and if it's all good, I'll send it to Linus next week for -rc5 (and -stable). https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?h=for-linus/execve&id=ff41385709f01519a97379ce7671ee4e91e301e1 -Kees -- Kees Cook
On Feb 19, 2025, at 11:52 AM, Kees Cook <kees@kernel.org> wrote > Is anyone able to test this patch? And Brian will setting a sysctl be > okay for your use-case? Hi Kees, I've verified that the sysctl tunable works as expected. readelf is showing that the VMAs are unsorted by default, with the tunable set to 0. When the tunable is set to 1, the VMAs are sorted. I also verified that the backtrace for unsorted and sorted cores are viewable in GDB. The backtrace reported by eu-stack shows up fine in the unsorted case, when attempting to reproduce with Michael's steps. As expected, I see the same error as Michael in the sorted case, when reproducing with his steps. Interestingly enough, in the sorted case, I found that if the crashing program is not linked statically, eu-stack will work fine. However, if the crashing program is linked statically, eu-stack will throw an error, as reported. Anyway, this patch looks pretty good. > > diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst > index a43b78b4b646..35d5d86cff69 100644 > --- a/Documentation/admin-guide/sysctl/kernel.rst > +++ b/Documentation/admin-guide/sysctl/kernel.rst > @@ -222,6 +222,17 @@ and ``core_uses_pid`` is set, then .PID will be appended to > the filename. > > > +core_sort_vma > +============= > + > +The default coredump writes VMAs in address order. By setting > +``core_sort_vma`` to 1, VMAs will be written from smallest size > +to largest size. This is known to break at least elfutils, but > +can be handy when dealing with very large (and truncated) > +coredumps where the more useful debugging details are included > +in the smaller VMAs. > + > + Just one comment here, this should go up one entry to maintain alphabetical ordering. Thanks, Brian
On Feb 19, 2025, at 11:52 AM, Kees Cook <kees@kernel.org> wrote: > Yeah, I think we need to make this a tunable. Updating the kernel breaks > elftools, which isn't some weird custom corner case. :P > > So, while it took a few months, here is a report of breakage that I said > we'd need to watch for[1]. :) > > Is anyone able to test this patch? And Brian will setting a sysctl be > okay for your use-case? Hi Kees, Yes, a sysctl tunable would be good here. I can test this patch in the next day or two. I will also scratch up a patch to bring us back into compliance with the ELF specifications, and see if that fixes the userspace breakage with elfutils, while not breaking gdb or rr. Thanks, Brian
On Feb 19, 2025, at 12:38 PM, Brian Mak <makb@juniper.net> wrote
> I will also scratch up a patch to bring us back into compliance with the
> ELF specifications, and see if that fixes the userspace breakage with
> elfutils, while not breaking gdb or rr.
I did scratch up something for this to fix up the program header
ordering, but it seems eu-stack is still broken, even with the fix. GDB
continues to work fine with the fix.
Given that there's no known utilities that get fixed as a result of the
program header sorting, I'm not sure if it's worth taking the patch.
Maybe we can just proceed with the sysctl + sorting if the core dump
size limit is hit, and leave it at that. Thoughts?
The program header ordering fix is below if someone wants to peek at it.
Best,
Brian
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 8054f44d39cf..8cf2bbc3cedf 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -2021,6 +2021,7 @@ static int elf_core_dump(struct coredump_params *cprm)
struct elf_shdr *shdr4extnum = NULL;
Elf_Half e_phnum;
elf_addr_t e_shoff;
+ struct elf_phdr *phdrs = NULL;
/*
* The number of segs are recored into ELF header as 16bit value.
@@ -2084,7 +2085,11 @@ static int elf_core_dump(struct coredump_params *cprm)
if (!dump_emit(cprm, phdr4note, sizeof(*phdr4note)))
goto end_coredump;
- /* Write program headers for segments dump */
+ phdrs = kvmalloc_array(cprm->vma_count, sizeof(*phdrs), GFP_KERNEL);
+ if (!phdrs)
+ goto end_coredump;
+
+ /* Construct sorted program headers for segments dump */
for (i = 0; i < cprm->vma_count; i++) {
struct core_vma_metadata *meta = cprm->vma_meta + i;
struct elf_phdr phdr;
@@ -2104,8 +2109,14 @@ static int elf_core_dump(struct coredump_params *cprm)
if (meta->flags & VM_EXEC)
phdr.p_flags |= PF_X;
phdr.p_align = ELF_EXEC_PAGESIZE;
+ phdrs[meta->index] = phdr;
+ }
+
+ /* Write program headers for segments dump */
+ for (i = 0; i < cprm->vma_count; i++) {
+ struct elf_phdr *phdr = phdrs + i;
- if (!dump_emit(cprm, &phdr, sizeof(phdr)))
+ if (!dump_emit(cprm, phdr, sizeof(*phdr)))
goto end_coredump;
}
@@ -2140,6 +2151,7 @@ static int elf_core_dump(struct coredump_params *cprm)
end_coredump:
free_note_info(&info);
+ kvfree(phdrs);
kfree(shdr4extnum);
kfree(phdr4note);
return has_dumped;
diff --git a/fs/coredump.c b/fs/coredump.c
index 591700e1b2ce..0ddd75c3a914 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -1226,6 +1226,7 @@ static bool dump_vma_snapshot(struct coredump_params *cprm)
while ((vma = coredump_next_vma(&vmi, vma, gate_vma)) != NULL) {
struct core_vma_metadata *m = cprm->vma_meta + i;
+ m->index = i;
m->start = vma->vm_start;
m->end = vma->vm_end;
m->flags = vma->vm_flags;
diff --git a/include/linux/coredump.h b/include/linux/coredump.h
index 77e6e195d1d6..cf1b9e53cd1e 100644
--- a/include/linux/coredump.h
+++ b/include/linux/coredump.h
@@ -9,6 +9,7 @@
#ifdef CONFIG_COREDUMP
struct core_vma_metadata {
+ unsigned int index;
unsigned long start, end;
unsigned long flags;
unsigned long dump_size;
On Sat, Feb 22, 2025 at 02:13:06AM +0000, Brian Mak wrote: > On Feb 19, 2025, at 12:38 PM, Brian Mak <makb@juniper.net> wrote > > > I will also scratch up a patch to bring us back into compliance with the > > ELF specifications, and see if that fixes the userspace breakage with > > elfutils, while not breaking gdb or rr. > > I did scratch up something for this to fix up the program header > ordering, but it seems eu-stack is still broken, even with the fix. GDB > continues to work fine with the fix. Okay, thanks for testing this! > Given that there's no known utilities that get fixed as a result of the > program header sorting, I'm not sure if it's worth taking the patch. > Maybe we can just proceed with the sysctl + sorting if the core dump > size limit is hit, and leave it at that. Thoughts? Yeah, I like that this will automatically kick on under the condition where the coredump will already be unreadable by some tools. And having the sysctl means it can be enabled for testing, etc. -Kees -- Kees Cook
> Seeing how most popular userspace apps, with the exception of eu-stack, > seem to work, we could also just leave it, and tell userspace apps to > fix it on their end. I'm not sure we do know that most work. 6.12 was released in November, we're only getting this report about elfutils now. It's not like it's been years with no complaints.
Brian Mak <makb@juniper.net> writes:
> Large cores may be truncated in some scenarios, such as with daemons
> with stop timeouts that are not large enough or lack of disk space. This
> impacts debuggability with large core dumps since critical information
> necessary to form a usable backtrace, such as stacks and shared library
> information, are omitted.
>
> We attempted to figure out which VMAs are needed to create a useful
> backtrace, and it turned out to be a non-trivial problem. Instead, we
> try simply sorting the VMAs by size, which has the intended effect.
>
> By sorting VMAs by dump size and dumping in that order, we have a
> simple, yet effective heuristic.
To make finding the history easier I would include:
v1: https://lkml.kernel.org/r/CB8195AE-518D-44C9-9841-B2694A5C4002@juniper.net
v2: https://lkml.kernel.org/r/C21B229F-D1E6-4E44-B506-A5ED4019A9DE@juniper.net
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
As Kees has already picked this up this is quite possibly silly.
But *shrug* that was when I was out.
> Signed-off-by: Brian Mak <makb@juniper.net>
> ---
>
> Hi all,
>
> Still need to run rr tests on this, per Kees Cook's suggestion, will
> update back once done. GDB and readelf show that this patch works
> without issue though.
>
> Thanks,
> Brian Mak
>
> v3: Edited commit message to better convey alternative solution as
> non-trivial
>
> Moved sorting logic to fs/coredump.c to make it in place
>
> Above edits suggested by Eric Biederman <ebiederm@xmission.com>
>
> v2: Edited commit message to include more reasoning for sorting VMAs
>
> Removed conditional VMA sorting with debugfs knob
>
> Above edits suggested by Eric Biederman <ebiederm@xmission.com>
>
> fs/coredump.c | 16 ++++++++++++++++
> 1 file changed, 16 insertions(+)
>
> diff --git a/fs/coredump.c b/fs/coredump.c
> index 7f12ff6ad1d3..33c5ac53ab31 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -18,6 +18,7 @@
> #include <linux/personality.h>
> #include <linux/binfmts.h>
> #include <linux/coredump.h>
> +#include <linux/sort.h>
> #include <linux/sched/coredump.h>
> #include <linux/sched/signal.h>
> #include <linux/sched/task_stack.h>
> @@ -1191,6 +1192,18 @@ static void free_vma_snapshot(struct coredump_params *cprm)
> }
> }
>
> +static int cmp_vma_size(const void *vma_meta_lhs_ptr, const void *vma_meta_rhs_ptr)
> +{
> + const struct core_vma_metadata *vma_meta_lhs = vma_meta_lhs_ptr;
> + const struct core_vma_metadata *vma_meta_rhs = vma_meta_rhs_ptr;
> +
> + if (vma_meta_lhs->dump_size < vma_meta_rhs->dump_size)
> + return -1;
> + if (vma_meta_lhs->dump_size > vma_meta_rhs->dump_size)
> + return 1;
> + return 0;
> +}
> +
> /*
> * Under the mmap_lock, take a snapshot of relevant information about the task's
> * VMAs.
> @@ -1253,5 +1266,8 @@ static bool dump_vma_snapshot(struct coredump_params *cprm)
> cprm->vma_data_size += m->dump_size;
> }
>
> + sort(cprm->vma_meta, cprm->vma_count, sizeof(*cprm->vma_meta),
> + cmp_vma_size, NULL);
> +
> return true;
> }
>
> base-commit: eb5e56d1491297e0881c95824e2050b7c205f0d4
On Sat, Aug 10, 2024 at 07:28:44AM -0500, Eric W. Biederman wrote: > Brian Mak <makb@juniper.net> writes: > > > Large cores may be truncated in some scenarios, such as with daemons > > with stop timeouts that are not large enough or lack of disk space. This > > impacts debuggability with large core dumps since critical information > > necessary to form a usable backtrace, such as stacks and shared library > > information, are omitted. > > > > We attempted to figure out which VMAs are needed to create a useful > > backtrace, and it turned out to be a non-trivial problem. Instead, we > > try simply sorting the VMAs by size, which has the intended effect. > > > > By sorting VMAs by dump size and dumping in that order, we have a > > simple, yet effective heuristic. > > To make finding the history easier I would include: > v1: https://lkml.kernel.org/r/CB8195AE-518D-44C9-9841-B2694A5C4002@juniper.net > v2: https://lkml.kernel.org/r/C21B229F-D1E6-4E44-B506-A5ED4019A9DE@juniper.net > > Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> > > As Kees has already picked this up this is quite possibly silly. > But *shrug* that was when I was out. I've updated the trailers. Thanks for the review! -Kees -- Kees Cook
On Aug 12, 2024, at 11:05 AM, Kees Cook <kees@kernel.org> wrote > On Sat, Aug 10, 2024 at 07:28:44AM -0500, Eric W. Biederman wrote: >> Brian Mak <makb@juniper.net> writes: >> >>> Large cores may be truncated in some scenarios, such as with daemons >>> with stop timeouts that are not large enough or lack of disk space. This >>> impacts debuggability with large core dumps since critical information >>> necessary to form a usable backtrace, such as stacks and shared library >>> information, are omitted. >>> >>> We attempted to figure out which VMAs are needed to create a useful >>> backtrace, and it turned out to be a non-trivial problem. Instead, we >>> try simply sorting the VMAs by size, which has the intended effect. >>> >>> By sorting VMAs by dump size and dumping in that order, we have a >>> simple, yet effective heuristic. >> >> To make finding the history easier I would include: >> v1: https://urldefense.com/v3/__https://lkml.kernel.org/r/CB8195AE-518D-44C9-9841-B2694A5C4002@juniper.net__;!!NEt6yMaO-gk!DavIB4o54KGrCPK44iq9_nJrOpKMJxUAlazBVF6lfKwmMCgLD_NviY088SQXriD19pS0rwhadvc$ >> v2: https://urldefense.com/v3/__https://lkml.kernel.org/r/C21B229F-D1E6-4E44-B506-A5ED4019A9DE@juniper.net__;!!NEt6yMaO-gk!DavIB4o54KGrCPK44iq9_nJrOpKMJxUAlazBVF6lfKwmMCgLD_NviY088SQXriD19pS0G7RQv4o$ >> >> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> >> >> As Kees has already picked this up this is quite possibly silly. >> But *shrug* that was when I was out. > > I've updated the trailers. Thanks for the review! Hi Kees, Thanks! I think you added it to the wrong commit though. Please double check and update accordingly. Regarding the sioc tests from earlier, I've reached a point where I think I have a compatible virtual NIC (no more ioctl errors), but it's giving me some mismatched registers error, causing the test to fail. I can see this same test failure on a vanilla kernel with my setup, so this is probably either some environment issue or a bug with rr or the tests. Since all the other tests pass, I'm just going to leave it at that. Best, Brian Mak
On Mon, Aug 12, 2024 at 06:21:15PM +0000, Brian Mak wrote: > On Aug 12, 2024, at 11:05 AM, Kees Cook <kees@kernel.org> wrote > > > On Sat, Aug 10, 2024 at 07:28:44AM -0500, Eric W. Biederman wrote: > >> Brian Mak <makb@juniper.net> writes: > >> > >>> Large cores may be truncated in some scenarios, such as with daemons > >>> with stop timeouts that are not large enough or lack of disk space. This > >>> impacts debuggability with large core dumps since critical information > >>> necessary to form a usable backtrace, such as stacks and shared library > >>> information, are omitted. > >>> > >>> We attempted to figure out which VMAs are needed to create a useful > >>> backtrace, and it turned out to be a non-trivial problem. Instead, we > >>> try simply sorting the VMAs by size, which has the intended effect. > >>> > >>> By sorting VMAs by dump size and dumping in that order, we have a > >>> simple, yet effective heuristic. > >> > >> To make finding the history easier I would include: > >> v1: https://urldefense.com/v3/__https://lkml.kernel.org/r/CB8195AE-518D-44C9-9841-B2694A5C4002@juniper.net__;!!NEt6yMaO-gk!DavIB4o54KGrCPK44iq9_nJrOpKMJxUAlazBVF6lfKwmMCgLD_NviY088SQXriD19pS0rwhadvc$ > >> v2: https://urldefense.com/v3/__https://lkml.kernel.org/r/C21B229F-D1E6-4E44-B506-A5ED4019A9DE@juniper.net__;!!NEt6yMaO-gk!DavIB4o54KGrCPK44iq9_nJrOpKMJxUAlazBVF6lfKwmMCgLD_NviY088SQXriD19pS0G7RQv4o$ > >> > >> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> > >> > >> As Kees has already picked this up this is quite possibly silly. > >> But *shrug* that was when I was out. > > > > I've updated the trailers. Thanks for the review! > > Hi Kees, > > Thanks! I think you added it to the wrong commit though. Ugh. Time for more coffee. Thanks; fixed. I need to update my "b4" -- it was hanging doing the trailers update so I did it myself manually... That'll teach me. ;) > tests. Since all the other tests pass, I'm just going to leave it at > that. Yeah, I think you're good. Thank you for taking the time to test rr! -- Kees Cook
On Tue, 06 Aug 2024 18:16:02 +0000, Brian Mak wrote:
> Large cores may be truncated in some scenarios, such as with daemons
> with stop timeouts that are not large enough or lack of disk space. This
> impacts debuggability with large core dumps since critical information
> necessary to form a usable backtrace, such as stacks and shared library
> information, are omitted.
>
> We attempted to figure out which VMAs are needed to create a useful
> backtrace, and it turned out to be a non-trivial problem. Instead, we
> try simply sorting the VMAs by size, which has the intended effect.
>
> [...]
While waiting on rr test validation, and since we're at the start of the
dev cycle, I figure let's get this into -next ASAP to see if anything
else pops out. We can drop/revise if there are problems. (And as always,
I will add any Acks/Reviews/etc that show up on the thread.)
Applied to for-next/execve, thanks!
[1/1] binfmt_elf: Dump smaller VMAs first in ELF cores
https://git.kernel.org/kees/c/9c531dfdc1bc
Take care,
--
Kees Cook
On Aug 6, 2024, at 10:21 PM, Kees Cook <kees@kernel.org> wrote: > > On Tue, 06 Aug 2024 18:16:02 +0000, Brian Mak wrote: >> Large cores may be truncated in some scenarios, such as with daemons >> with stop timeouts that are not large enough or lack of disk space. This >> impacts debuggability with large core dumps since critical information >> necessary to form a usable backtrace, such as stacks and shared library >> information, are omitted. >> >> We attempted to figure out which VMAs are needed to create a useful >> backtrace, and it turned out to be a non-trivial problem. Instead, we >> try simply sorting the VMAs by size, which has the intended effect. >> >> [...] > > While waiting on rr test validation, and since we're at the start of the > dev cycle, I figure let's get this into -next ASAP to see if anything > else pops out. We can drop/revise if there are problems. (And as always, > I will add any Acks/Reviews/etc that show up on the thread.) > > Applied to for-next/execve, thanks! > > [1/1] binfmt_elf: Dump smaller VMAs first in ELF cores > https://urldefense.com/v3/__https://git.kernel.org/kees/c/9c531dfdc1bc__;!!NEt6yMaO-gk!FK3UfXVndoYpve8Y7q7vacIoHOrTj2nJgSJbugqUB5LfciKy0_Xvit9aXz_XCWlXHpdRQO2ArP0$ Thanks, Kees! And, thanks Linus + Eric for taking the time to comment on this. Regarding the rr tests, it was not an easy task to get the environment set up to do this, but I did it and was able to run the tests. The rr tests require a lot of kernel config options and there's no list documenting what's needed anywhere... All the tests pass except for the sioc and sioc-no-syscallbuf tests. However, these test failures are due to an incompatibility with the network adapter I'm using. It seems that it only likes older network adapters. I've switched my virtualized network adapter twice now, and each time, the test gets a bit further than the previous time. Will continue trying different network adapters until something hopefully works. In any case, since this error isn't directly related to my changes and the rest of the tests pass, then I think we can be pretty confident that this change is not breaking rr. Best, Brian Mak > Take care, > > -- > Kees Cook >
On Sat, Aug 10, 2024 at 12:52:16AM +0000, Brian Mak wrote: > On Aug 6, 2024, at 10:21 PM, Kees Cook <kees@kernel.org> wrote: > > > > On Tue, 06 Aug 2024 18:16:02 +0000, Brian Mak wrote: > >> Large cores may be truncated in some scenarios, such as with daemons > >> with stop timeouts that are not large enough or lack of disk space. This > >> impacts debuggability with large core dumps since critical information > >> necessary to form a usable backtrace, such as stacks and shared library > >> information, are omitted. > >> > >> We attempted to figure out which VMAs are needed to create a useful > >> backtrace, and it turned out to be a non-trivial problem. Instead, we > >> try simply sorting the VMAs by size, which has the intended effect. > >> > >> [...] > > > > While waiting on rr test validation, and since we're at the start of the > > dev cycle, I figure let's get this into -next ASAP to see if anything > > else pops out. We can drop/revise if there are problems. (And as always, > > I will add any Acks/Reviews/etc that show up on the thread.) > > > > Applied to for-next/execve, thanks! > > > > [1/1] binfmt_elf: Dump smaller VMAs first in ELF cores > > https://urldefense.com/v3/__https://git.kernel.org/kees/c/9c531dfdc1bc__;!!NEt6yMaO-gk!FK3UfXVndoYpve8Y7q7vacIoHOrTj2nJgSJbugqUB5LfciKy0_Xvit9aXz_XCWlXHpdRQO2ArP0$ > > Thanks, Kees! And, thanks Linus + Eric for taking the time to comment on > this. > > Regarding the rr tests, it was not an easy task to get the environment > set up to do this, but I did it and was able to run the tests. The rr > tests require a lot of kernel config options and there's no list > documenting what's needed anywhere... Thanks for suffering through that! > All the tests pass except for the sioc and sioc-no-syscallbuf tests. > However, these test failures are due to an incompatibility with the > network adapter I'm using. It seems that it only likes older network > adapters. I've switched my virtualized network adapter twice now, and > each time, the test gets a bit further than the previous time. Will > continue trying different network adapters until something hopefully > works. In any case, since this error isn't directly related to my > changes and the rest of the tests pass, then I think we can be pretty > confident that this change is not breaking rr. Perfect! Okay, we'll keep our eyes open for any reports of breakage. :) -Kees -- Kees Cook
On Tue, 6 Aug 2024 at 11:16, Brian Mak <makb@juniper.net> wrote:
>
> @@ -1253,5 +1266,8 @@ static bool dump_vma_snapshot(struct coredump_params *cprm)
> cprm->vma_data_size += m->dump_size;
> }
>
> + sort(cprm->vma_meta, cprm->vma_count, sizeof(*cprm->vma_meta),
> + cmp_vma_size, NULL);
> +
> return true;
> }
Hmm. Realistically we only dump core in ELF, and the order of the
segments shouldn't matter.
But I wonder if we should do this in the ->core_dump() function
itself, in case it would have mattered for other dump formats?
IOW, instead of being at the bottom of dump_vma_snapshot(), maybe the
sorting should be at the top of elf_core_dump()?
And yes, in practice I doubt we'll ever have other dump formats, and
no, a.out isn't doing some miraculous comeback either.
But I bet you didn't test elf_fdpic_core_dump() even if I bet it (a)
works and (b) nobody cares.
So moving it to the ELF side might be conceptually the right thing to do?
(Or is there some reason it needs to be done at snapshot time that I
just didn't fully appreciate?)
Linus
Linus Torvalds <torvalds@linux-foundation.org> writes: > On Tue, 6 Aug 2024 at 11:16, Brian Mak <makb@juniper.net> wrote: >> >> @@ -1253,5 +1266,8 @@ static bool dump_vma_snapshot(struct coredump_params *cprm) >> cprm->vma_data_size += m->dump_size; >> } >> >> + sort(cprm->vma_meta, cprm->vma_count, sizeof(*cprm->vma_meta), >> + cmp_vma_size, NULL); >> + >> return true; >> } > > Hmm. Realistically we only dump core in ELF, and the order of the > segments shouldn't matter. > > But I wonder if we should do this in the ->core_dump() function > itself, in case it would have mattered for other dump formats? > > IOW, instead of being at the bottom of dump_vma_snapshot(), maybe the > sorting should be at the top of elf_core_dump()? > > And yes, in practice I doubt we'll ever have other dump formats, and > no, a.out isn't doing some miraculous comeback either. > > But I bet you didn't test elf_fdpic_core_dump() even if I bet it (a) > works and (b) nobody cares. > > So moving it to the ELF side might be conceptually the right thing to do? > > (Or is there some reason it needs to be done at snapshot time that I > just didn't fully appreciate?) I asked him to perform this at snapshot time. Plus it is obvious at snapshot time that you can change the allocated array, while it is not so obvious in the ->core_dump methods. I would argue that the long term maintainable thing to do is to merge elf_core_dump and elf_fdpic_core_dump and put all of the code in fs/coredump.c Performing the sort at snapshot time avoids introducing one extra reason why the two elf implementations of elf coredumping are different. I did read through the elf fdpic code quickly and it looks like it should just work no matter which order the vma's are dumped in. Just like the other elf coredump code does. My practical concern is that someone has a coredump thing that walks through the program headers and short circuits the walk because it knows the program headers are all written in order. But the only way to find one of those is to just try it. Eric
On Fri, 9 Aug 2024 at 07:40, Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> I asked him to perform this at snapshot time. Plus it is obvious at
> snapshot time that you can change the allocated array, while it is
> not so obvious in the ->core_dump methods.
Fair enough. The days when we supported a.out dumps are obviously long
long gone, and aren't coming back.
So I am not adamant that it has to be done in the dumper, and I
probably just have that historical "we have multiple different dumpers
with different rules" mindset that isn't really relevant any more.
> I would argue that the long term maintainable thing to do is to
> merge elf_core_dump and elf_fdpic_core_dump and put all of the code
> in fs/coredump.c
I wouldn't object. It's not like there's any foreseeable new core dump
format that we'd expect, and the indirection makes the code flow
harder to follow.
Not that most people look at this code a lot.
Linus
© 2016 - 2025 Red Hat, Inc.