uprobes: document mmap_lock, don't abuse get_user_pages_remote()

[PATCH 1/2] uprobes: document the usage of mm->mmap_lock

Posted by Oleg Nesterov 1 year, 7 months ago

The comment above uprobe_write_opcode() is wrong, unapply_uprobe() calls
it under mmap_read_lock() and this is correct.

And it is completely unclear why register_for_each_vma() takes mmap_lock
for writing, add a comment to explain that mmap_write_lock() is needed to
avoid the following race:

	- A task T hits the bp installed by uprobe and calls
	  find_active_uprobe()

	- uprobe_unregister() removes this uprobe/bp

	- T calls find_uprobe() which returns NULL

	- another uprobe_register() installs the bp at the same address

	- T calls is_trap_at_addr() which returns true

	- T returns to handle_swbp() and gets SIGTRAP.

Reported-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/events/uprobes.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 2c83ba776fc7..d52b624a50fa 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -453,7 +453,7 @@ static int update_ref_ctr(struct uprobe *uprobe, struct mm_struct *mm,
  * @vaddr: the virtual address to store the opcode.
  * @opcode: opcode to be written at @vaddr.
  *
- * Called with mm->mmap_lock held for write.
+ * Called with mm->mmap_lock held for read or write.
  * Return 0 (success) or a negative errno.
  */
 int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
@@ -1046,7 +1046,12 @@ register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new)
 
 		if (err && is_register)
 			goto free;
-
+		/*
+		 * We take mmap_lock for writing to avoid the race with
+		 * find_active_uprobe(), install_breakpoint() must not
+		 * make is_trap_at_addr() true right after find_uprobe()
+		 * returns NULL.
+		 */
 		mmap_write_lock(mm);
 		vma = find_vma(mm, info->vaddr);
 		if (!vma || !valid_vma(vma, is_register) ||
-- 
2.25.1.362.g51ebf55

Re: [PATCH 1/2] uprobes: document the usage of mm->mmap_lock

Posted by Masami Hiramatsu (Google) 1 year, 7 months ago

On Wed, 10 Jul 2024 16:00:45 +0200
Oleg Nesterov <oleg@redhat.com> wrote:

> The comment above uprobe_write_opcode() is wrong, unapply_uprobe() calls
> it under mmap_read_lock() and this is correct.
> 
> And it is completely unclear why register_for_each_vma() takes mmap_lock
> for writing, add a comment to explain that mmap_write_lock() is needed to
> avoid the following race:
> 
> 	- A task T hits the bp installed by uprobe and calls
> 	  find_active_uprobe()
> 
> 	- uprobe_unregister() removes this uprobe/bp
> 
> 	- T calls find_uprobe() which returns NULL
> 
> 	- another uprobe_register() installs the bp at the same address
> 
> 	- T calls is_trap_at_addr() which returns true
> 
> 	- T returns to handle_swbp() and gets SIGTRAP.
> 
> Reported-by: Andrii Nakryiko <andrii@kernel.org>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
>  kernel/events/uprobes.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 2c83ba776fc7..d52b624a50fa 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -453,7 +453,7 @@ static int update_ref_ctr(struct uprobe *uprobe, struct mm_struct *mm,
>   * @vaddr: the virtual address to store the opcode.
>   * @opcode: opcode to be written at @vaddr.
>   *
> - * Called with mm->mmap_lock held for write.
> + * Called with mm->mmap_lock held for read or write.
>   * Return 0 (success) or a negative errno.
>   */
>  int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> @@ -1046,7 +1046,12 @@ register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new)
>  
>  		if (err && is_register)
>  			goto free;
> -
> +		/*
> +		 * We take mmap_lock for writing to avoid the race with
> +		 * find_active_uprobe(), install_breakpoint() must not
> +		 * make is_trap_at_addr() true right after find_uprobe()
> +		 * returns NULL.

Sorry, I couldn't catch the latter part. What is the relationship of
taking the mmap_lock and install_breakpoint() and is_trap_at_addr() here?

You meant that find_active_uprobe() is using find_uprobe() which searchs
uprobe form rbtree? But it seems uprobe is already inserted to the rbtree
in alloc_uprobe() so find_uprobe() will not return NULL here, right?

Thank you,

> +		 */
>  		mmap_write_lock(mm);
>  		vma = find_vma(mm, info->vaddr);
>  		if (!vma || !valid_vma(vma, is_register) ||
> -- 
> 2.25.1.362.g51ebf55
> 
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

Re: [PATCH 1/2] uprobes: document the usage of mm->mmap_lock

Posted by Oleg Nesterov 1 year, 7 months ago

On 07/10, Masami Hiramatsu wrote:
>
> On Wed, 10 Jul 2024 16:00:45 +0200
> Oleg Nesterov <oleg@redhat.com> wrote:
>
> > The comment above uprobe_write_opcode() is wrong, unapply_uprobe() calls
> > it under mmap_read_lock() and this is correct.
> >
> > And it is completely unclear why register_for_each_vma() takes mmap_lock
> > for writing, add a comment to explain that mmap_write_lock() is needed to
> > avoid the following race:
> >
> > 	- A task T hits the bp installed by uprobe and calls
> > 	  find_active_uprobe()
> >
> > 	- uprobe_unregister() removes this uprobe/bp
> >
> > 	- T calls find_uprobe() which returns NULL
> >
> > 	- another uprobe_register() installs the bp at the same address
> >
> > 	- T calls is_trap_at_addr() which returns true
> >
> > 	- T returns to handle_swbp() and gets SIGTRAP.

...

> >  int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> > @@ -1046,7 +1046,12 @@ register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new)
> >
> >  		if (err && is_register)
> >  			goto free;
> > -
> > +		/*
> > +		 * We take mmap_lock for writing to avoid the race with
> > +		 * find_active_uprobe(), install_breakpoint() must not
> > +		 * make is_trap_at_addr() true right after find_uprobe()
> > +		 * returns NULL.
>
> Sorry, I couldn't catch the latter part. What is the relationship of
> taking the mmap_lock and install_breakpoint() and is_trap_at_addr() here?

Please the the changelog above, it tries to explain this race with more
details...

> You meant that find_active_uprobe() is using find_uprobe() which searchs
> uprobe form rbtree?

Yes,

> But it seems uprobe is already inserted to the rbtree
> in alloc_uprobe() so find_uprobe() will not return NULL here, right?

uprobe_register() -> alloc_uprobe() can come after
find_active_uprobe() -> find_uprobe() returns NULL.

Now, if uprobe_register() -> register_for_each_vma() used mmap_read_lock(), it
could do install_breakpoint() before find_active_uprobe() calls is_trap_at_addr().

In this case find_active_uprobe() returns with uprobe == NULL and is_swbp == 1,
handle_swbp() treat this case as the "normal" int3 without uprobe and do

	if (!uprobe) {
		if (is_swbp > 0) {
			/* No matching uprobe; signal SIGTRAP. */
			force_sig(SIGTRAP);

Does this answer your question?

Oleg.

Re: [PATCH 1/2] uprobes: document the usage of mm->mmap_lock

Posted by Masami Hiramatsu (Google) 1 year, 7 months ago

On Wed, 10 Jul 2024 17:10:07 +0200
Oleg Nesterov <oleg@redhat.com> wrote:

> On 07/10, Masami Hiramatsu wrote:
> >
> > On Wed, 10 Jul 2024 16:00:45 +0200
> > Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > > The comment above uprobe_write_opcode() is wrong, unapply_uprobe() calls
> > > it under mmap_read_lock() and this is correct.
> > >
> > > And it is completely unclear why register_for_each_vma() takes mmap_lock
> > > for writing, add a comment to explain that mmap_write_lock() is needed to
> > > avoid the following race:
> > >
> > > 	- A task T hits the bp installed by uprobe and calls
> > > 	  find_active_uprobe()
> > >
> > > 	- uprobe_unregister() removes this uprobe/bp
> > >
> > > 	- T calls find_uprobe() which returns NULL
> > >
> > > 	- another uprobe_register() installs the bp at the same address
> > >
> > > 	- T calls is_trap_at_addr() which returns true
> > >
> > > 	- T returns to handle_swbp() and gets SIGTRAP.
> 
> ...
> 
> > >  int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> > > @@ -1046,7 +1046,12 @@ register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new)
> > >
> > >  		if (err && is_register)
> > >  			goto free;
> > > -
> > > +		/*
> > > +		 * We take mmap_lock for writing to avoid the race with
> > > +		 * find_active_uprobe(), install_breakpoint() must not
> > > +		 * make is_trap_at_addr() true right after find_uprobe()
> > > +		 * returns NULL.
> >
> > Sorry, I couldn't catch the latter part. What is the relationship of
> > taking the mmap_lock and install_breakpoint() and is_trap_at_addr() here?
> 
> Please the the changelog above, it tries to explain this race with more
> details...

OK, but it seems we should write the above longer explanation here.
What about the comment like this?

/*
 * We take mmap_lock for writing to avoid the race with
 * find_active_uprobe() and is_trap_at_adder() in reader
 * side.
 * If the reader, which hits a swbp and is handling it,
 * does not take mmap_lock for reading, it is possible
 * that find_active_uprobe() returns NULL (because
 * uprobe_unregister() removes uprobes right before that),
 * but is_trap_at_addr() can return true afterwards (because
 * another thread calls uprobe_register() on the same address).
 * This causes unexpected SIGTRAP on reader thread.
 * Taking mmap_lock avoids this race.
*/

> 
> > You meant that find_active_uprobe() is using find_uprobe() which searchs
> > uprobe form rbtree?
> 
> Yes,
> 
> > But it seems uprobe is already inserted to the rbtree
> > in alloc_uprobe() so find_uprobe() will not return NULL here, right?
> 
> uprobe_register() -> alloc_uprobe() can come after
> find_active_uprobe() -> find_uprobe() returns NULL.
> 
> Now, if uprobe_register() -> register_for_each_vma() used mmap_read_lock(), it
> could do install_breakpoint() before find_active_uprobe() calls is_trap_at_addr().
> 
> In this case find_active_uprobe() returns with uprobe == NULL and is_swbp == 1,
> handle_swbp() treat this case as the "normal" int3 without uprobe and do
> 
> 	if (!uprobe) {
> 		if (is_swbp > 0) {
> 			/* No matching uprobe; signal SIGTRAP. */
> 			force_sig(SIGTRAP);
> 
> Does this answer your question?

No, thanks for the explanation.

Thank you!

> 
> Oleg.
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

Re: [PATCH 1/2] uprobes: document the usage of mm->mmap_lock

Posted by Oleg Nesterov 1 year, 7 months ago

On 07/11, Masami Hiramatsu wrote:
>
> On Wed, 10 Jul 2024 17:10:07 +0200
> Oleg Nesterov <oleg@redhat.com> wrote:
>
> > > >  int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> > > > @@ -1046,7 +1046,12 @@ register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new)
> > > >
> > > >  		if (err && is_register)
> > > >  			goto free;
> > > > -
> > > > +		/*
> > > > +		 * We take mmap_lock for writing to avoid the race with
> > > > +		 * find_active_uprobe(), install_breakpoint() must not
> > > > +		 * make is_trap_at_addr() true right after find_uprobe()
> > > > +		 * returns NULL.
> > >

...

> OK, but it seems we should write the above longer explanation here.
> What about the comment like this?

Well, I am biased, but your version looks much more confusing to me...

> /*
>  * We take mmap_lock for writing to avoid the race with
>  * find_active_uprobe() and is_trap_at_adder() in reader
>  * side.
>  * If the reader, which hits a swbp and is handling it,
>  * does not take mmap_lock for reading,

this looks as if the reader which hits a swbp takes mmap_lock for reading
because of this race. No, find_active_uprobe() needs mmap_read_lock() for
vma_lookup, get_user_pages, etc.

> it is possible
>  * that find_active_uprobe() returns NULL (because
>  * uprobe_unregister() removes uprobes right before that),
>  * but is_trap_at_addr() can return true afterwards (because
>  * another thread calls uprobe_register() on the same address).
     ^^^^^^^^^^^^^^^
We are the thread which called uprobe_register(), we are going to
do install_breakpoint().

And btw, not that I think this makes sense, but register_for_each_vma()
could probably do

	if (is_register)
		mmap_write_lock(mm);
	else
		mmap_read_lock(mm);

Oleg.

Re: [PATCH 1/2] uprobes: document the usage of mm->mmap_lock

Posted by Masami Hiramatsu (Google) 1 year, 7 months ago

On Thu, 11 Jul 2024 11:49:40 +0200
Oleg Nesterov <oleg@redhat.com> wrote:

> On 07/11, Masami Hiramatsu wrote:
> >
> > On Wed, 10 Jul 2024 17:10:07 +0200
> > Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > > > >  int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> > > > > @@ -1046,7 +1046,12 @@ register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new)
> > > > >
> > > > >  		if (err && is_register)
> > > > >  			goto free;
> > > > > -
> > > > > +		/*
> > > > > +		 * We take mmap_lock for writing to avoid the race with
> > > > > +		 * find_active_uprobe(), install_breakpoint() must not
> > > > > +		 * make is_trap_at_addr() true right after find_uprobe()
> > > > > +		 * returns NULL.
> > > >
> 
> ...
> 
> > OK, but it seems we should write the above longer explanation here.
> > What about the comment like this?
> 
> Well, I am biased, but your version looks much more confusing to me...
> 
> > /*
> >  * We take mmap_lock for writing to avoid the race with
> >  * find_active_uprobe() and is_trap_at_adder() in reader
> >  * side.
> >  * If the reader, which hits a swbp and is handling it,
> >  * does not take mmap_lock for reading,
> 
> this looks as if the reader which hits a swbp takes mmap_lock for reading
> because of this race. No, find_active_uprobe() needs mmap_read_lock() for
> vma_lookup, get_user_pages, etc.

OK, so it is for looking up VMA. (But in the end, this rock protects both
the VMAs and uprobes, right?)

> 
> > it is possible
> >  * that find_active_uprobe() returns NULL (because
> >  * uprobe_unregister() removes uprobes right before that),
> >  * but is_trap_at_addr() can return true afterwards (because
> >  * another thread calls uprobe_register() on the same address).
>      ^^^^^^^^^^^^^^^
> We are the thread which called uprobe_register(), we are going to
> do install_breakpoint().

Ah, yes :)

What about this?

	 * We take mmap_lock for writing to avoid the race with
	 * find_active_uprobe(), which takes mmap_lock for reading.
	 * Thus this install_breakpoint() must not make
	 * is_trap_at_addr() true right after find_uprobe()
	 * returns NULL in find_active_uprobe().


> 
> And btw, not that I think this makes sense, but register_for_each_vma()
> could probably do
> 
> 	if (is_register)
> 		mmap_write_lock(mm);
> 	else
> 		mmap_read_lock(mm);

Agreed.

Thank you,

> 
> Oleg.
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

Re: [PATCH 1/2] uprobes: document the usage of mm->mmap_lock

Posted by Oleg Nesterov 1 year, 7 months ago

On 07/11, Masami Hiramatsu wrote:
>
> What about this?
>
> 	 * We take mmap_lock for writing to avoid the race with
> 	 * find_active_uprobe(), which takes mmap_lock for reading.
> 	 * Thus this install_breakpoint() must not make
> 	 * is_trap_at_addr() true right after find_uprobe()
> 	 * returns NULL in find_active_uprobe().

Thanks! will change.

Oleg.