coredump: allow interrupting dumps of large anonymous regions

[PATCH] coredump: allow interrupting dumps of large anonymous regions

Posted by Tavian Barnes 2 weeks, 6 days ago

dump_user_range() supports sparse core dumps by skipping anonymous pages
which have not been modified.  If get_dump_page() returns NULL, the page
is skipped rather than written to the core dump with dump_emit_page().

Sadly, dump_emit_page() contains the only check for dump_interrupted(),
so when dumping a very large sparse region, the core dump becomes
effectively uninterruptible.  This can be observed with the following
test program:

    #include <stdlib.h>
    #include <stdio.h>
    #include <sys/mman.h>

    int main(void) {
        char *mem = mmap(NULL, 1ULL << 40, PROT_READ | PROT_WRITE,
                MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE, -1, 0);
        printf("%p %m\n", mem);
        if (mem != MAP_FAILED) {
                mem[0] = 1;
        }
        abort();
    }

The program allocates 1 TiB of anonymous memory, touches one page of it,
and aborts.  During the core dump, SIGKILL has no effect.  It takes
about 30 seconds to finish the dump, burning 100% CPU.

This issue naturally arises with things like Address Sanitizer, which
allocate a large sparse region of virtual address space for their shadow
memory.

Fix it by checking dump_interrupted() explicitly in dump_user_pages().

Signed-off-by: Tavian Barnes <tavianator@tavianator.com>
---
 fs/coredump.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/coredump.c b/fs/coredump.c
index d48edb37bc35..fd29d3f15f1e 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -950,6 +950,10 @@ int dump_user_range(struct coredump_params *cprm, unsigned long start,
 			}
 		} else {
 			dump_skip(cprm, PAGE_SIZE);
+			if (dump_interrupted()) {
+				dump_page_free(dump_page);
+				return 0;
+			}
 		}
 		cond_resched();
 	}
-- 
2.48.1

Re: [PATCH] coredump: allow interrupting dumps of large anonymous regions

Posted by Jan Kara 2 weeks, 6 days ago

On Wed 15-01-25 23:05:38, Tavian Barnes wrote:
> dump_user_range() supports sparse core dumps by skipping anonymous pages
> which have not been modified.  If get_dump_page() returns NULL, the page
> is skipped rather than written to the core dump with dump_emit_page().
> 
> Sadly, dump_emit_page() contains the only check for dump_interrupted(),
> so when dumping a very large sparse region, the core dump becomes
> effectively uninterruptible.  This can be observed with the following
> test program:
> 
>     #include <stdlib.h>
>     #include <stdio.h>
>     #include <sys/mman.h>
> 
>     int main(void) {
>         char *mem = mmap(NULL, 1ULL << 40, PROT_READ | PROT_WRITE,
>                 MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE, -1, 0);
>         printf("%p %m\n", mem);
>         if (mem != MAP_FAILED) {
>                 mem[0] = 1;
>         }
>         abort();
>     }
> 
> The program allocates 1 TiB of anonymous memory, touches one page of it,
> and aborts.  During the core dump, SIGKILL has no effect.  It takes
> about 30 seconds to finish the dump, burning 100% CPU.
> 
> This issue naturally arises with things like Address Sanitizer, which
> allocate a large sparse region of virtual address space for their shadow
> memory.
> 
> Fix it by checking dump_interrupted() explicitly in dump_user_pages().
> 
> Signed-off-by: Tavian Barnes <tavianator@tavianator.com>

Thanks for the patch! The idea looks good to me as a quick fix, one
suggestion for improvement below:

> diff --git a/fs/coredump.c b/fs/coredump.c
> index d48edb37bc35..fd29d3f15f1e 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -950,6 +950,10 @@ int dump_user_range(struct coredump_params *cprm, unsigned long start,
>  			}
>  		} else {
>  			dump_skip(cprm, PAGE_SIZE);
> +			if (dump_interrupted()) {
> +				dump_page_free(dump_page);
> +				return 0;
> +			}

So rather than doing the check here, I'd do it before cond_resched() below
and remove the check from dump_emit_page(). That way we have the
interruption handling all in one place.

>  		}
>  		cond_resched();
>  	}

Bonus points for unifying the exit paths from the loop (perhaps as a
separate cleanup patch) like:

		if (page)
			ret = dump_emit_page(...)
		else
			dump_skip(...)
		if (dump_interrupted())
			ret = 0;
		if (!ret)
			break;
		cond_resched();
	}
	dump_page_free(dump_page);
	return ret;

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH] coredump: allow interrupting dumps of large anonymous regions

Posted by Mateusz Guzik 2 weeks, 6 days ago

On Wed, Jan 15, 2025 at 11:05:38PM -0500, Tavian Barnes wrote:
> dump_user_range() supports sparse core dumps by skipping anonymous pages
> which have not been modified.  If get_dump_page() returns NULL, the page
> is skipped rather than written to the core dump with dump_emit_page().
> 
> Sadly, dump_emit_page() contains the only check for dump_interrupted(),
> so when dumping a very large sparse region, the core dump becomes
> effectively uninterruptible.  This can be observed with the following
> test program:
> 
>     #include <stdlib.h>
>     #include <stdio.h>
>     #include <sys/mman.h>
> 
>     int main(void) {
>         char *mem = mmap(NULL, 1ULL << 40, PROT_READ | PROT_WRITE,
>                 MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE, -1, 0);
>         printf("%p %m\n", mem);
>         if (mem != MAP_FAILED) {
>                 mem[0] = 1;
>         }
>         abort();
>     }
> 
> The program allocates 1 TiB of anonymous memory, touches one page of it,
> and aborts.  During the core dump, SIGKILL has no effect.  It takes
> about 30 seconds to finish the dump, burning 100% CPU.
> 

While the patch makes sense to me, this should not be taking anywhere
near this much time and plausibly after unscrewing it will stop being a
factor.

So I had a look with a profiler:
-   99.89%     0.00%  a.out
     entry_SYSCALL_64_after_hwframe                      
     do_syscall_64                                       
     syscall_exit_to_user_mode                           
     arch_do_signal_or_restart                           
   - get_signal                                          
      - 99.89% do_coredump                               
         - 99.88% elf_core_dump                          
            - dump_user_range                            
               - 98.12% get_dump_page                    
                  - 64.19% __get_user_pages              
                     - 40.92% gup_vma_lookup             
                        - find_vma                       
                           - mt_find                     
                                4.21% __rcu_read_lock    
                                1.33% __rcu_read_unlock  
                     - 3.14% check_vma_flags             
                          0.68% vma_is_secretmem         
                       0.61% __cond_resched              
                       0.60% vma_pgtable_walk_end        
                       0.59% vma_pgtable_walk_begin      
                       0.58% no_page_table               
                  - 15.13% down_read_killable            
                       0.69% __cond_resched              
                    13.84% up_read                       
                 0.58% __cond_resched                    


Almost 29% of time is spent relocking the mmap semaphore in
__get_user_pages. This most likely can operate locklessly in the fast
path. Even if somehow not, chances are the lock can be held across
multiple calls.

mt_find spends most of it's time issuing a rep stos of 48 bytes (would
be faster to rep mov 6 times instead). This is the compiler being nasty,
I'll maybe look into it.

However, I strongly suspect the current iteration method is just slow
due to repeat mt_find calls and The Right Approach(tm) would make this
entire thing finish within miliseconds by iterating the maple tree
instead, but then the mm folk would have to be consulted on how to
approach this and it may be time consuming to implement.

Sorting out relocking should be an easily achievable & measurable win
(no interest on my end).

Re: [PATCH] coredump: allow interrupting dumps of large anonymous regions

Posted by Jan Kara 2 weeks, 6 days ago

On Thu 16-01-25 08:46:48, Mateusz Guzik wrote:
> On Wed, Jan 15, 2025 at 11:05:38PM -0500, Tavian Barnes wrote:
> > dump_user_range() supports sparse core dumps by skipping anonymous pages
> > which have not been modified.  If get_dump_page() returns NULL, the page
> > is skipped rather than written to the core dump with dump_emit_page().
> > 
> > Sadly, dump_emit_page() contains the only check for dump_interrupted(),
> > so when dumping a very large sparse region, the core dump becomes
> > effectively uninterruptible.  This can be observed with the following
> > test program:
> > 
> >     #include <stdlib.h>
> >     #include <stdio.h>
> >     #include <sys/mman.h>
> > 
> >     int main(void) {
> >         char *mem = mmap(NULL, 1ULL << 40, PROT_READ | PROT_WRITE,
> >                 MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE, -1, 0);
> >         printf("%p %m\n", mem);
> >         if (mem != MAP_FAILED) {
> >                 mem[0] = 1;
> >         }
> >         abort();
> >     }
> > 
> > The program allocates 1 TiB of anonymous memory, touches one page of it,
> > and aborts.  During the core dump, SIGKILL has no effect.  It takes
> > about 30 seconds to finish the dump, burning 100% CPU.
> > 
> 
> While the patch makes sense to me, this should not be taking anywhere
> near this much time and plausibly after unscrewing it will stop being a
> factor.
> 
> So I had a look with a profiler:
> -   99.89%     0.00%  a.out
>      entry_SYSCALL_64_after_hwframe                      
>      do_syscall_64                                       
>      syscall_exit_to_user_mode                           
>      arch_do_signal_or_restart                           
>    - get_signal                                          
>       - 99.89% do_coredump                               
>          - 99.88% elf_core_dump                          
>             - dump_user_range                            
>                - 98.12% get_dump_page                    
>                   - 64.19% __get_user_pages              
>                      - 40.92% gup_vma_lookup             
>                         - find_vma                       
>                            - mt_find                     
>                                 4.21% __rcu_read_lock    
>                                 1.33% __rcu_read_unlock  
>                      - 3.14% check_vma_flags             
>                           0.68% vma_is_secretmem         
>                        0.61% __cond_resched              
>                        0.60% vma_pgtable_walk_end        
>                        0.59% vma_pgtable_walk_begin      
>                        0.58% no_page_table               
>                   - 15.13% down_read_killable            
>                        0.69% __cond_resched              
>                     13.84% up_read                       
>                  0.58% __cond_resched                    
> 
> 
> Almost 29% of time is spent relocking the mmap semaphore in
> __get_user_pages. This most likely can operate locklessly in the fast
> path. Even if somehow not, chances are the lock can be held across
> multiple calls.
> 
> mt_find spends most of it's time issuing a rep stos of 48 bytes (would
> be faster to rep mov 6 times instead). This is the compiler being nasty,
> I'll maybe look into it.
> 
> However, I strongly suspect the current iteration method is just slow
> due to repeat mt_find calls and The Right Approach(tm) would make this
> entire thing finish within miliseconds by iterating the maple tree
> instead, but then the mm folk would have to be consulted on how to
> approach this and it may be time consuming to implement.
> 
> Sorting out relocking should be an easily achievable & measurable win
> (no interest on my end).

As much as I agree the code is dumb, doing what you suggest with mmap_sem
isn't going to be easy. You cannot call dump_emit_page() with mmap_sem held
as that will cause lock inversion between mmap_sem and whatever filesystem
locks we have to take. So the fix would have to involve processing larger
batches of address space at once (which should also somewhat amortize the
__get_user_pages() setup costs). Not that hard to do but I wanted to spell
it out in case someone wants to pick up this todo item :)

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH] coredump: allow interrupting dumps of large anonymous regions

Posted by Mateusz Guzik 2 weeks, 5 days ago

On Thu, Jan 16, 2025 at 10:56 AM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 16-01-25 08:46:48, Mateusz Guzik wrote:
> > On Wed, Jan 15, 2025 at 11:05:38PM -0500, Tavian Barnes wrote:
> > > dump_user_range() supports sparse core dumps by skipping anonymous pages
> > > which have not been modified.  If get_dump_page() returns NULL, the page
> > > is skipped rather than written to the core dump with dump_emit_page().
> > >
> > > Sadly, dump_emit_page() contains the only check for dump_interrupted(),
> > > so when dumping a very large sparse region, the core dump becomes
> > > effectively uninterruptible.  This can be observed with the following
> > > test program:
> > >
> > >     #include <stdlib.h>
> > >     #include <stdio.h>
> > >     #include <sys/mman.h>
> > >
> > >     int main(void) {
> > >         char *mem = mmap(NULL, 1ULL << 40, PROT_READ | PROT_WRITE,
> > >                 MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE, -1, 0);
> > >         printf("%p %m\n", mem);
> > >         if (mem != MAP_FAILED) {
> > >                 mem[0] = 1;
> > >         }
> > >         abort();
> > >     }
> > >
> > > The program allocates 1 TiB of anonymous memory, touches one page of it,
> > > and aborts.  During the core dump, SIGKILL has no effect.  It takes
> > > about 30 seconds to finish the dump, burning 100% CPU.
> > >
> >
> > While the patch makes sense to me, this should not be taking anywhere
> > near this much time and plausibly after unscrewing it will stop being a
> > factor.
> >
> > So I had a look with a profiler:
> > -   99.89%     0.00%  a.out
> >      entry_SYSCALL_64_after_hwframe
> >      do_syscall_64
> >      syscall_exit_to_user_mode
> >      arch_do_signal_or_restart
> >    - get_signal
> >       - 99.89% do_coredump
> >          - 99.88% elf_core_dump
> >             - dump_user_range
> >                - 98.12% get_dump_page
> >                   - 64.19% __get_user_pages
> >                      - 40.92% gup_vma_lookup
> >                         - find_vma
> >                            - mt_find
> >                                 4.21% __rcu_read_lock
> >                                 1.33% __rcu_read_unlock
> >                      - 3.14% check_vma_flags
> >                           0.68% vma_is_secretmem
> >                        0.61% __cond_resched
> >                        0.60% vma_pgtable_walk_end
> >                        0.59% vma_pgtable_walk_begin
> >                        0.58% no_page_table
> >                   - 15.13% down_read_killable
> >                        0.69% __cond_resched
> >                     13.84% up_read
> >                  0.58% __cond_resched
> >
> >
> > Almost 29% of time is spent relocking the mmap semaphore in
> > __get_user_pages. This most likely can operate locklessly in the fast
> > path. Even if somehow not, chances are the lock can be held across
> > multiple calls.
> >
> > mt_find spends most of it's time issuing a rep stos of 48 bytes (would
> > be faster to rep mov 6 times instead). This is the compiler being nasty,
> > I'll maybe look into it.
> >
> > However, I strongly suspect the current iteration method is just slow
> > due to repeat mt_find calls and The Right Approach(tm) would make this
> > entire thing finish within miliseconds by iterating the maple tree
> > instead, but then the mm folk would have to be consulted on how to
> > approach this and it may be time consuming to implement.
> >
> > Sorting out relocking should be an easily achievable & measurable win
> > (no interest on my end).
>
> As much as I agree the code is dumb, doing what you suggest with mmap_sem
> isn't going to be easy. You cannot call dump_emit_page() with mmap_sem held
> as that will cause lock inversion between mmap_sem and whatever filesystem
> locks we have to take. So the fix would have to involve processing larger
> batches of address space at once (which should also somewhat amortize the
> __get_user_pages() setup costs). Not that hard to do but I wanted to spell
> it out in case someone wants to pick up this todo item :)
>

Is the lock really needed to begin with?

Suppose it is.

In this context there are next to no pages found, but there is a
gazillion relocks as the entire VA is being walked.

Bare minimum patch which would already significantly help would start
with the lock held and only relock if there is a page to dump, should
be very easy to add.

I however vote for someone mm-savvy to point out an easy way (if any)
to just iterate pages which are there instead.
-- 
Mateusz Guzik <mjguzik gmail.com>

Re: [PATCH] coredump: allow interrupting dumps of large anonymous regions

Posted by Tavian Barnes 2 weeks, 5 days ago

On Thu, Jan 16, 2025 at 01:04:42PM +0100, Mateusz Guzik wrote:
> On Thu, Jan 16, 2025 at 10:56 AM Jan Kara <jack@suse.cz> wrote:
> >
> > On Thu 16-01-25 08:46:48, Mateusz Guzik wrote:
> > > On Wed, Jan 15, 2025 at 11:05:38PM -0500, Tavian Barnes wrote:
> > > > dump_user_range() supports sparse core dumps by skipping anonymous pages
> > > > which have not been modified.  If get_dump_page() returns NULL, the page
> > > > is skipped rather than written to the core dump with dump_emit_page().
> > > >
> > > > Sadly, dump_emit_page() contains the only check for dump_interrupted(),
> > > > so when dumping a very large sparse region, the core dump becomes
> > > > effectively uninterruptible.  This can be observed with the following
> > > > test program:
> > > >
> > > >     #include <stdlib.h>
> > > >     #include <stdio.h>
> > > >     #include <sys/mman.h>
> > > >
> > > >     int main(void) {
> > > >         char *mem = mmap(NULL, 1ULL << 40, PROT_READ | PROT_WRITE,
> > > >                 MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE, -1, 0);
> > > >         printf("%p %m\n", mem);
> > > >         if (mem != MAP_FAILED) {
> > > >                 mem[0] = 1;
> > > >         }
> > > >         abort();
> > > >     }
> > > >
> > > > The program allocates 1 TiB of anonymous memory, touches one page of it,
> > > > and aborts.  During the core dump, SIGKILL has no effect.  It takes
> > > > about 30 seconds to finish the dump, burning 100% CPU.
> > > >
> > >
> > > While the patch makes sense to me, this should not be taking anywhere
> > > near this much time and plausibly after unscrewing it will stop being a
> > > factor.
> > >
> > > So I had a look with a profiler:
> > > -   99.89%     0.00%  a.out
> > >      entry_SYSCALL_64_after_hwframe
> > >      do_syscall_64
> > >      syscall_exit_to_user_mode
> > >      arch_do_signal_or_restart
> > >    - get_signal
> > >       - 99.89% do_coredump
> > >          - 99.88% elf_core_dump
> > >             - dump_user_range
> > >                - 98.12% get_dump_page
> > >                   - 64.19% __get_user_pages
> > >                      - 40.92% gup_vma_lookup
> > >                         - find_vma
> > >                            - mt_find
> > >                                 4.21% __rcu_read_lock
> > >                                 1.33% __rcu_read_unlock
> > >                      - 3.14% check_vma_flags
> > >                           0.68% vma_is_secretmem
> > >                        0.61% __cond_resched
> > >                        0.60% vma_pgtable_walk_end
> > >                        0.59% vma_pgtable_walk_begin
> > >                        0.58% no_page_table
> > >                   - 15.13% down_read_killable
> > >                        0.69% __cond_resched
> > >                     13.84% up_read
> > >                  0.58% __cond_resched
> > >
> > >
> > > Almost 29% of time is spent relocking the mmap semaphore in
> > > __get_user_pages. This most likely can operate locklessly in the fast
> > > path. Even if somehow not, chances are the lock can be held across
> > > multiple calls.
> > >
> > > mt_find spends most of it's time issuing a rep stos of 48 bytes (would
> > > be faster to rep mov 6 times instead). This is the compiler being nasty,
> > > I'll maybe look into it.
> > >
> > > However, I strongly suspect the current iteration method is just slow
> > > due to repeat mt_find calls and The Right Approach(tm) would make this
> > > entire thing finish within miliseconds by iterating the maple tree
> > > instead, but then the mm folk would have to be consulted on how to
> > > approach this and it may be time consuming to implement.
> > >
> > > Sorting out relocking should be an easily achievable & measurable win
> > > (no interest on my end).
> >
> > As much as I agree the code is dumb, doing what you suggest with mmap_sem
> > isn't going to be easy. You cannot call dump_emit_page() with mmap_sem held
> > as that will cause lock inversion between mmap_sem and whatever filesystem
> > locks we have to take. So the fix would have to involve processing larger
> > batches of address space at once (which should also somewhat amortize the
> > __get_user_pages() setup costs). Not that hard to do but I wanted to spell
> > it out in case someone wants to pick up this todo item :)
> >
> 
> Is the lock really needed to begin with?
> 
> Suppose it is.
> 
> In this context there are next to no pages found, but there is a
> gazillion relocks as the entire VA is being walked.

Do I understand correctly that all the relocks are to look up the VMA
associated with each address, one page at a time?  That's especially
wasteful as dump_user_range() is called separately for each VMA, so it's
going to find the same VMA every time anyway.

> Bare minimum patch which would already significantly help would start
> with the lock held and only relock if there is a page to dump, should
> be very easy to add.

That seems like a good idea.

> I however vote for someone mm-savvy to point out an easy way (if any)
> to just iterate pages which are there instead.

It seems like some of the <linux/pagewalk.h> APIs might be relevant?
Not sure which one has the right semantics.  Can we just use
folio_walk_start()?

I guess the main complexity is every time we find a page, we have to
stop the walk, unlock mmap_sem, call dump_emit_page(), and restart the
walk from the next address.  Maybe an mm expert can weigh in.

> -- 
> Mateusz Guzik <mjguzik gmail.com>
>

Re: [PATCH] coredump: allow interrupting dumps of large anonymous regions

Posted by Mateusz Guzik 2 weeks, 3 days ago

On Thu, Jan 16, 2025 at 6:04 PM Tavian Barnes <tavianator@tavianator.com> wrote:
>
> On Thu, Jan 16, 2025 at 01:04:42PM +0100, Mateusz Guzik wrote:
> > In this context there are next to no pages found, but there is a
> > gazillion relocks as the entire VA is being walked.
>
> Do I understand correctly that all the relocks are to look up the VMA
> associated with each address, one page at a time?  That's especially
> wasteful as dump_user_range() is called separately for each VMA, so it's
> going to find the same VMA every time anyway.
>

it indeed is a loop over vmas, and then over the entire range with
PAGE_SIZE'd steps

> > I however vote for someone mm-savvy to point out an easy way (if any)> > to just iterate pages which are there instead.
>
> It seems like some of the <linux/pagewalk.h> APIs might be relevant?
> Not sure which one has the right semantics.  Can we just use
> folio_walk_start()?
>
> I guess the main complexity is every time we find a page, we have to
> stop the walk, unlock mmap_sem, call dump_emit_page(), and restart the
> walk from the next address.  Maybe an mm expert can weigh in.
>

I don't know the way, based on my epsilon understanding of the area I
*suspect* walking the maple tree would do it.

-- 
Mateusz Guzik <mjguzik gmail.com>

[PATCH] fs: avoid mmap sem relocks when coredumping with many missing pages

Posted by Mateusz Guzik 2 weeks, 5 days ago

Dumping processes with large allocated and mostly not-faulted areas is
very slow.

Borrowing a test case from Tavian Barnes:

int main(void) {
    char *mem = mmap(NULL, 1ULL << 40, PROT_READ | PROT_WRITE,
            MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE, -1, 0);
    printf("%p %m\n", mem);
    if (mem != MAP_FAILED) {
            mem[0] = 1;
    }
    abort();
}

That's 1TB of almost completely not-populated area.

On my test box it takes 13-14 seconds to dump.

The profile shows:
-   99.89%     0.00%  a.out
     entry_SYSCALL_64_after_hwframe
     do_syscall_64
     syscall_exit_to_user_mode
     arch_do_signal_or_restart
   - get_signal
      - 99.89% do_coredump
         - 99.88% elf_core_dump
            - dump_user_range
               - 98.12% get_dump_page
                  - 64.19% __get_user_pages
                     - 40.92% gup_vma_lookup
                        - find_vma
                           - mt_find
                                4.21% __rcu_read_lock
                                1.33% __rcu_read_unlock
                     - 3.14% check_vma_flags
                          0.68% vma_is_secretmem
                       0.61% __cond_resched
                       0.60% vma_pgtable_walk_end
                       0.59% vma_pgtable_walk_begin
                       0.58% no_page_table
                  - 15.13% down_read_killable
                       0.69% __cond_resched
                    13.84% up_read
                 0.58% __cond_resched

Almost 29% of the time is spent relocking the mmap semaphore between
calls to get_dump_page() which find nothing.

Whacking that results in times of 10 seconds (down from 13-14).

While here make the thing killable.

The real problem is the page-sized iteration and the real fix would
patch it up instead. It is left as an exercise for the mm-familiar
reader.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---

Minimally tested, very plausible I missed something.

 arch/arm64/kernel/elfcore.c |  3 ++-
 fs/coredump.c               | 38 +++++++++++++++++++++++++++++++------
 include/linux/mm.h          |  2 +-
 mm/gup.c                    |  5 ++---
 4 files changed, 37 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/kernel/elfcore.c b/arch/arm64/kernel/elfcore.c
index 2e94d20c4ac7..b735f4c2fe5e 100644
--- a/arch/arm64/kernel/elfcore.c
+++ b/arch/arm64/kernel/elfcore.c
@@ -27,9 +27,10 @@ static int mte_dump_tag_range(struct coredump_params *cprm,
 	int ret = 1;
 	unsigned long addr;
 	void *tags = NULL;
+	int locked = 0;
 
 	for (addr = start; addr < start + len; addr += PAGE_SIZE) {
-		struct page *page = get_dump_page(addr);
+		struct page *page = get_dump_page(addr, &locked);
 
 		/*
 		 * get_dump_page() returns NULL when encountering an empty
diff --git a/fs/coredump.c b/fs/coredump.c
index d48edb37bc35..84cf76f0d5b6 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -925,14 +925,23 @@ int dump_user_range(struct coredump_params *cprm, unsigned long start,
 {
 	unsigned long addr;
 	struct page *dump_page;
+	int locked, ret;
 
 	dump_page = dump_page_alloc();
 	if (!dump_page)
 		return 0;
 
+	ret = 0;
+	locked = 0;
 	for (addr = start; addr < start + len; addr += PAGE_SIZE) {
 		struct page *page;
 
+		if (!locked) {
+			if (mmap_read_lock_killable(current->mm))
+				goto out;
+			locked = 1;
+		}
+
 		/*
 		 * To avoid having to allocate page tables for virtual address
 		 * ranges that have never been used yet, and also to make it
@@ -940,21 +949,38 @@ int dump_user_range(struct coredump_params *cprm, unsigned long start,
 		 * NULL when encountering an empty page table entry that would
 		 * otherwise have been filled with the zero page.
 		 */
-		page = get_dump_page(addr);
+		page = get_dump_page(addr, &locked);
 		if (page) {
+			if (locked) {
+				mmap_read_unlock(current->mm);
+				locked = 0;
+			}
 			int stop = !dump_emit_page(cprm, dump_page_copy(page, dump_page));
 			put_page(page);
-			if (stop) {
-				dump_page_free(dump_page);
-				return 0;
-			}
+			if (stop)
+				goto out;
 		} else {
 			dump_skip(cprm, PAGE_SIZE);
 		}
+
+		if (dump_interrupted())
+			goto out;
+
+		if (!need_resched())
+			continue;
+		if (locked) {
+			mmap_read_unlock(current->mm);
+			locked = 0;
+		}
 		cond_resched();
 	}
+	ret = 1;
+out:
+	if (locked)
+		mmap_read_unlock(current->mm);
+
 	dump_page_free(dump_page);
-	return 1;
+	return ret;
 }
 #endif
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 75c9b4f46897..7df0d9200d8c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2633,7 +2633,7 @@ int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
 			struct task_struct *task, bool bypass_rlim);
 
 struct kvec;
-struct page *get_dump_page(unsigned long addr);
+struct page *get_dump_page(unsigned long addr, int *locked);
 
 bool folio_mark_dirty(struct folio *folio);
 bool folio_mark_dirty_lock(struct folio *folio);
diff --git a/mm/gup.c b/mm/gup.c
index 2304175636df..f3be2aa43543 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2266,13 +2266,12 @@ EXPORT_SYMBOL(fault_in_readable);
  * Called without mmap_lock (takes and releases the mmap_lock by itself).
  */
 #ifdef CONFIG_ELF_CORE
-struct page *get_dump_page(unsigned long addr)
+struct page *get_dump_page(unsigned long addr, int *locked)
 {
 	struct page *page;
-	int locked = 0;
 	int ret;
 
-	ret = __get_user_pages_locked(current->mm, addr, 1, &page, &locked,
+	ret = __get_user_pages_locked(current->mm, addr, 1, &page, locked,
 				      FOLL_FORCE | FOLL_DUMP | FOLL_GET);
 	return (ret == 1) ? page : NULL;
 }
-- 
2.43.0