Allow multiple callers to install pages simultaneously by downgrading
the mmap_sem to non-exclusive mode. Races to the same PTE are handled
using get_user_pages_remote() to retrieve the already installed page.
This method significantly reduces contention in the mmap semaphore.
To ensure safety, vma_lookup() is used (instead of alloc->vma) to avoid
operating on an isolated VMA. In addition, zap_page_range_single() is
called under the alloc->mutex to avoid racing with the shrinker.
Many thanks to Barry Song who posted a similar approach [1].
Link: https://lore.kernel.org/all/20240902225009.34576-1-21cnbao@gmail.com/ [1]
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <v-songbaohua@oppo.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Carlos Llamas <cmllamas@google.com>
---
drivers/android/binder_alloc.c | 64 +++++++++++++++++++++-------------
1 file changed, 40 insertions(+), 24 deletions(-)
diff --git a/drivers/android/binder_alloc.c b/drivers/android/binder_alloc.c
index 7241bf4a3ff2..2ab520c285b3 100644
--- a/drivers/android/binder_alloc.c
+++ b/drivers/android/binder_alloc.c
@@ -221,26 +221,14 @@ static int binder_install_single_page(struct binder_alloc *alloc,
struct binder_lru_page *lru_page,
unsigned long addr)
{
+ struct vm_area_struct *vma;
struct page *page;
- int ret = 0;
+ long npages;
+ int ret;
if (!mmget_not_zero(alloc->mm))
return -ESRCH;
- /*
- * Protected with mmap_sem in write mode as multiple tasks
- * might race to install the same page.
- */
- mmap_write_lock(alloc->mm);
- if (binder_get_installed_page(lru_page))
- goto out;
-
- if (!alloc->vma) {
- pr_err("%d: %s failed, no vma\n", alloc->pid, __func__);
- ret = -ESRCH;
- goto out;
- }
-
page = alloc_page(GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO);
if (!page) {
pr_err("%d: failed to allocate page\n", alloc->pid);
@@ -248,19 +236,47 @@ static int binder_install_single_page(struct binder_alloc *alloc,
goto out;
}
- ret = vm_insert_page(alloc->vma, addr, page);
- if (ret) {
+ mmap_read_lock(alloc->mm);
+ vma = vma_lookup(alloc->mm, addr);
+ if (!vma || vma != alloc->vma) {
+ __free_page(page);
+ pr_err("%d: %s failed, no vma\n", alloc->pid, __func__);
+ ret = -ESRCH;
+ goto unlock;
+ }
+
+ ret = vm_insert_page(vma, addr, page);
+ switch (ret) {
+ case -EBUSY:
+ /*
+ * EBUSY is ok. Someone installed the pte first but the
+ * lru_page->page_ptr has not been updated yet. Discard
+ * our page and look up the one already installed.
+ */
+ ret = 0;
+ __free_page(page);
+ npages = get_user_pages_remote(alloc->mm, addr, 1, 0, &page, NULL);
+ if (npages <= 0) {
+ pr_err("%d: failed to find page at offset %lx\n",
+ alloc->pid, addr - alloc->buffer);
+ ret = -ESRCH;
+ break;
+ }
+ fallthrough;
+ case 0:
+ /* Mark page installation complete and safe to use */
+ binder_set_installed_page(lru_page, page);
+ break;
+ default:
+ __free_page(page);
pr_err("%d: %s failed to insert page at offset %lx with %d\n",
alloc->pid, __func__, addr - alloc->buffer, ret);
- __free_page(page);
ret = -ENOMEM;
- goto out;
+ break;
}
-
- /* Mark page installation complete and safe to use */
- binder_set_installed_page(lru_page, page);
+unlock:
+ mmap_read_unlock(alloc->mm);
out:
- mmap_write_unlock(alloc->mm);
mmput_async(alloc->mm);
return ret;
}
@@ -1091,7 +1107,6 @@ enum lru_status binder_alloc_free_page(struct list_head *item,
trace_binder_unmap_kernel_end(alloc, index);
list_lru_isolate(lru, item);
- mutex_unlock(&alloc->mutex);
spin_unlock(lock);
if (vma) {
@@ -1102,6 +1117,7 @@ enum lru_status binder_alloc_free_page(struct list_head *item,
trace_binder_unmap_user_end(alloc, index);
}
+ mutex_unlock(&alloc->mutex);
mmap_read_unlock(mm);
mmput_async(mm);
__free_page(page_to_free);
--
2.47.0.277.g8800431eea-goog
On 08.11.24 20:10, Carlos Llamas wrote: > Allow multiple callers to install pages simultaneously by downgrading > the mmap_sem to non-exclusive mode. Races to the same PTE are handled > using get_user_pages_remote() to retrieve the already installed page. > This method significantly reduces contention in the mmap semaphore. > > To ensure safety, vma_lookup() is used (instead of alloc->vma) to avoid > operating on an isolated VMA. In addition, zap_page_range_single() is > called under the alloc->mutex to avoid racing with the shrinker. > > Many thanks to Barry Song who posted a similar approach [1]. > > Link: https://lore.kernel.org/all/20240902225009.34576-1-21cnbao@gmail.com/ [1] > Cc: David Hildenbrand <david@redhat.com> > Cc: Barry Song <v-songbaohua@oppo.com> > Cc: Suren Baghdasaryan <surenb@google.com> > Cc: Liam R. Howlett <Liam.Howlett@oracle.com> > Reviewed-by: Suren Baghdasaryan <surenb@google.com> > Signed-off-by: Carlos Llamas <cmllamas@google.com> > --- > drivers/android/binder_alloc.c | 64 +++++++++++++++++++++------------- > 1 file changed, 40 insertions(+), 24 deletions(-) > > diff --git a/drivers/android/binder_alloc.c b/drivers/android/binder_alloc.c > index 7241bf4a3ff2..2ab520c285b3 100644 > --- a/drivers/android/binder_alloc.c > +++ b/drivers/android/binder_alloc.c > @@ -221,26 +221,14 @@ static int binder_install_single_page(struct binder_alloc *alloc, > struct binder_lru_page *lru_page, > unsigned long addr) > { > + struct vm_area_struct *vma; > struct page *page; > - int ret = 0; > + long npages; > + int ret; > > if (!mmget_not_zero(alloc->mm)) > return -ESRCH; > > - /* > - * Protected with mmap_sem in write mode as multiple tasks > - * might race to install the same page. > - */ > - mmap_write_lock(alloc->mm); > - if (binder_get_installed_page(lru_page)) > - goto out; > - > - if (!alloc->vma) { > - pr_err("%d: %s failed, no vma\n", alloc->pid, __func__); > - ret = -ESRCH; > - goto out; > - } > - > page = alloc_page(GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO); > if (!page) { > pr_err("%d: failed to allocate page\n", alloc->pid); > @@ -248,19 +236,47 @@ static int binder_install_single_page(struct binder_alloc *alloc, > goto out; > } > > - ret = vm_insert_page(alloc->vma, addr, page); > - if (ret) { > + mmap_read_lock(alloc->mm); > + vma = vma_lookup(alloc->mm, addr); > + if (!vma || vma != alloc->vma) { > + __free_page(page); > + pr_err("%d: %s failed, no vma\n", alloc->pid, __func__); > + ret = -ESRCH; > + goto unlock; > + } > + > + ret = vm_insert_page(vma, addr, page); > + switch (ret) { > + case -EBUSY: > + /* > + * EBUSY is ok. Someone installed the pte first but the > + * lru_page->page_ptr has not been updated yet. Discard > + * our page and look up the one already installed. > + */ > + ret = 0; > + __free_page(page); > + npages = get_user_pages_remote(alloc->mm, addr, 1, 0, &page, NULL); This will trigger a page fault if we don't find what we expect (are races with e.g., MADV_DONTNEED possible?), is that really desired or not a problem? -- Cheers, David / dhildenb
On Tue, Nov 12, 2024 at 12:10:20PM +0100, David Hildenbrand wrote: > On 08.11.24 20:10, Carlos Llamas wrote: > > + ret = vm_insert_page(vma, addr, page); > > + switch (ret) { > > + case -EBUSY: > > + /* > > + * EBUSY is ok. Someone installed the pte first but the > > + * lru_page->page_ptr has not been updated yet. Discard > > + * our page and look up the one already installed. > > + */ > > + ret = 0; > > + __free_page(page); > > + npages = get_user_pages_remote(alloc->mm, addr, 1, 0, &page, NULL); > > This will trigger a page fault if we don't find what we expect (are races > with e.g., MADV_DONTNEED possible?), is that really desired or not a > problem? This is fine. As of now, binder blocks its page faults: static vm_fault_t binder_vm_fault(struct vm_fault *vmf) { return VM_FAULT_SIGBUS; } If we race with something like MADV_DONTNEED then we would just propagate the -EFAULT error. I could add FOLL_NOFAULT to the gup remote call to make it evident we don't care though. -- Carlos Llamas
On 12.11.24 15:43, Carlos Llamas wrote: > On Tue, Nov 12, 2024 at 12:10:20PM +0100, David Hildenbrand wrote: >> On 08.11.24 20:10, Carlos Llamas wrote: >>> + ret = vm_insert_page(vma, addr, page); >>> + switch (ret) { >>> + case -EBUSY: >>> + /* >>> + * EBUSY is ok. Someone installed the pte first but the >>> + * lru_page->page_ptr has not been updated yet. Discard >>> + * our page and look up the one already installed. >>> + */ >>> + ret = 0; >>> + __free_page(page); >>> + npages = get_user_pages_remote(alloc->mm, addr, 1, 0, &page, NULL); >> >> This will trigger a page fault if we don't find what we expect (are races >> with e.g., MADV_DONTNEED possible?), is that really desired or not a >> problem? > > This is fine. As of now, binder blocks its page faults: > > static vm_fault_t binder_vm_fault(struct vm_fault *vmf) > { > return VM_FAULT_SIGBUS; > } > > If we race with something like MADV_DONTNEED then we would just > propagate the -EFAULT error. I could add FOLL_NOFAULT to the gup remote > call to make it evident we don't care though. Might make it clearer ... or at least adding a comment how this is supposed to work. :) -- Cheers, David / dhildenb
On Tue, Nov 12, 2024 at 03:53:51PM +0100, David Hildenbrand wrote: > On 12.11.24 15:43, Carlos Llamas wrote: > > On Tue, Nov 12, 2024 at 12:10:20PM +0100, David Hildenbrand wrote: > > > On 08.11.24 20:10, Carlos Llamas wrote: > > > > + ret = vm_insert_page(vma, addr, page); > > > > + switch (ret) { > > > > + case -EBUSY: > > > > + /* > > > > + * EBUSY is ok. Someone installed the pte first but the > > > > + * lru_page->page_ptr has not been updated yet. Discard > > > > + * our page and look up the one already installed. > > > > + */ > > > > + ret = 0; > > > > + __free_page(page); > > > > + npages = get_user_pages_remote(alloc->mm, addr, 1, 0, &page, NULL); > > > > > > This will trigger a page fault if we don't find what we expect (are races > > > with e.g., MADV_DONTNEED possible?), is that really desired or not a > > > problem? > > > > This is fine. As of now, binder blocks its page faults: > > > > static vm_fault_t binder_vm_fault(struct vm_fault *vmf) > > { > > return VM_FAULT_SIGBUS; > > } > > > > If we race with something like MADV_DONTNEED then we would just > > propagate the -EFAULT error. I could add FOLL_NOFAULT to the gup remote > > call to make it evident we don't care though. > > Might make it clearer ... or at least adding a comment how this is supposed > to work. :) Sounds good. I'll add both FOLL_NOFAULT flag and a comment for v4. Cheers, Carlos Llamas
* Carlos Llamas <cmllamas@google.com> [241108 14:11]: > Allow multiple callers to install pages simultaneously by downgrading > the mmap_sem to non-exclusive mode. Since we actually allow downgrading of a lock, it would be more clear to say that you are changing from a read lock to a write lock. I was searching for the lock downgrade in this patch :) >Races to the same PTE are handled > using get_user_pages_remote() to retrieve the already installed page. > This method significantly reduces contention in the mmap semaphore. > > To ensure safety, vma_lookup() is used (instead of alloc->vma) to avoid > operating on an isolated VMA. In addition, zap_page_range_single() is > called under the alloc->mutex to avoid racing with the shrinker. > > Many thanks to Barry Song who posted a similar approach [1]. > > Link: https://lore.kernel.org/all/20240902225009.34576-1-21cnbao@gmail.com/ [1] > Cc: David Hildenbrand <david@redhat.com> > Cc: Barry Song <v-songbaohua@oppo.com> > Cc: Suren Baghdasaryan <surenb@google.com> > Cc: Liam R. Howlett <Liam.Howlett@oracle.com> > Reviewed-by: Suren Baghdasaryan <surenb@google.com> > Signed-off-by: Carlos Llamas <cmllamas@google.com> > --- > drivers/android/binder_alloc.c | 64 +++++++++++++++++++++------------- > 1 file changed, 40 insertions(+), 24 deletions(-) > > diff --git a/drivers/android/binder_alloc.c b/drivers/android/binder_alloc.c > index 7241bf4a3ff2..2ab520c285b3 100644 > --- a/drivers/android/binder_alloc.c > +++ b/drivers/android/binder_alloc.c > @@ -221,26 +221,14 @@ static int binder_install_single_page(struct binder_alloc *alloc, > struct binder_lru_page *lru_page, > unsigned long addr) > { > + struct vm_area_struct *vma; > struct page *page; > - int ret = 0; > + long npages; > + int ret; > > if (!mmget_not_zero(alloc->mm)) > return -ESRCH; > > - /* > - * Protected with mmap_sem in write mode as multiple tasks > - * might race to install the same page. > - */ > - mmap_write_lock(alloc->mm); > - if (binder_get_installed_page(lru_page)) > - goto out; > - > - if (!alloc->vma) { > - pr_err("%d: %s failed, no vma\n", alloc->pid, __func__); > - ret = -ESRCH; > - goto out; > - } > - > page = alloc_page(GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO); > if (!page) { > pr_err("%d: failed to allocate page\n", alloc->pid); > @@ -248,19 +236,47 @@ static int binder_install_single_page(struct binder_alloc *alloc, > goto out; > } > > - ret = vm_insert_page(alloc->vma, addr, page); > - if (ret) { > + mmap_read_lock(alloc->mm); Ah, I debate bring this up, but you can do this without the mmap read lock. You can use rcu and the per-vma locking like in the page fault path. If you look at how that is done using the lock_vma_under_rcu() (mm/memory.c) then you can see that pages are installed today without the mmap locking (on some architectures - which includes x86_64 and arm64). userfaultfd had to do something like this as well, and created some abstraction to facilitate either mmap or rcu, based on the arch support. Going to rcu really helped performance there [1]. There was also a chance of the vma being missed, so it is checked again under the mmap read lock, but realistically that never happens and exists to ensure correctness. You also mention the shrinker and using the alloc->mutex, well zapping pages can also be done under the vma specific lock - if that helps? Freeing page tables is different though, that needs more locking, but I don't think this is an issue for you. [1]. https://lore.kernel.org/all/20240215182756.3448972-1-lokeshgidra@google.com/ > + vma = vma_lookup(alloc->mm, addr); Thank you for not saving a pointer anymore. > + if (!vma || vma != alloc->vma) { > + __free_page(page); > + pr_err("%d: %s failed, no vma\n", alloc->pid, __func__); > + ret = -ESRCH; > + goto unlock; > + } > + > + ret = vm_insert_page(vma, addr, page); > + switch (ret) { > + case -EBUSY: > + /* > + * EBUSY is ok. Someone installed the pte first but the > + * lru_page->page_ptr has not been updated yet. Discard > + * our page and look up the one already installed. > + */ > + ret = 0; > + __free_page(page); > + npages = get_user_pages_remote(alloc->mm, addr, 1, 0, &page, NULL); > + if (npages <= 0) { > + pr_err("%d: failed to find page at offset %lx\n", > + alloc->pid, addr - alloc->buffer); > + ret = -ESRCH; > + break; > + } > + fallthrough; > + case 0: > + /* Mark page installation complete and safe to use */ > + binder_set_installed_page(lru_page, page); > + break; > + default: > + __free_page(page); > pr_err("%d: %s failed to insert page at offset %lx with %d\n", > alloc->pid, __func__, addr - alloc->buffer, ret); > - __free_page(page); > ret = -ENOMEM; > - goto out; > + break; > } > - > - /* Mark page installation complete and safe to use */ > - binder_set_installed_page(lru_page, page); > +unlock: > + mmap_read_unlock(alloc->mm); > out: > - mmap_write_unlock(alloc->mm); > mmput_async(alloc->mm); > return ret; > } > @@ -1091,7 +1107,6 @@ enum lru_status binder_alloc_free_page(struct list_head *item, > trace_binder_unmap_kernel_end(alloc, index); > > list_lru_isolate(lru, item); > - mutex_unlock(&alloc->mutex); > spin_unlock(lock); > > if (vma) { > @@ -1102,6 +1117,7 @@ enum lru_status binder_alloc_free_page(struct list_head *item, > trace_binder_unmap_user_end(alloc, index); > } > > + mutex_unlock(&alloc->mutex); > mmap_read_unlock(mm); > mmput_async(mm); > __free_page(page_to_free); > -- > 2.47.0.277.g8800431eea-goog >
On Fri, Nov 08, 2024 at 03:41:41PM -0500, Liam R. Howlett wrote: > * Carlos Llamas <cmllamas@google.com> [241108 14:11]: > > Allow multiple callers to install pages simultaneously by downgrading > > the mmap_sem to non-exclusive mode. > > Since we actually allow downgrading of a lock, it would be more clear to > say that you are changing from a read lock to a write lock. I was > searching for the lock downgrade in this patch :) ah, I can see how the wording here can be misleading. I'll rephrase that to avoid confusion. > > - ret = vm_insert_page(alloc->vma, addr, page); > > - if (ret) { > > + mmap_read_lock(alloc->mm); > > Ah, I debate bring this up, but you can do this without the mmap read > lock. You can use rcu and the per-vma locking like in the page fault > path. If you look at how that is done using the lock_vma_under_rcu() > (mm/memory.c) then you can see that pages are installed today without > the mmap locking (on some architectures - which includes x86_64 and > arm64). Right, per-vma locking is implemented in patch 7 of this series: https://lore.kernel.org/all/20241108191057.3288442-8-cmllamas@google.com/ > userfaultfd had to do something like this as well, and created some > abstraction to facilitate either mmap or rcu, based on the arch support. > Going to rcu really helped performance there [1]. There was also a > chance of the vma being missed, so it is checked again under the mmap > read lock, but realistically that never happens and exists to ensure > correctness. hmm, if there are more users of this pattern in the future then perhaps it might be worth adding a common helper? > You also mention the shrinker and using the alloc->mutex, well zapping > pages can also be done under the vma specific lock - if that helps? > > Freeing page tables is different though, that needs more locking, but I > don't think this is an issue for you. > > [1]. https://lore.kernel.org/all/20240215182756.3448972-1-lokeshgidra@google.com/ ha, I was not aware of this. I'll have a look thanks. > > > + vma = vma_lookup(alloc->mm, addr); > > Thank you for not saving a pointer anymore. lol, thanks to you for initiating this effort! Cheers, Carlos Llamas
* Carlos Llamas <cmllamas@google.com> [241108 17:26]: ... > > > - ret = vm_insert_page(alloc->vma, addr, page); > > > - if (ret) { > > > + mmap_read_lock(alloc->mm); > > > > Ah, I debate bring this up, but you can do this without the mmap read > > lock. You can use rcu and the per-vma locking like in the page fault > > path. If you look at how that is done using the lock_vma_under_rcu() > > (mm/memory.c) then you can see that pages are installed today without > > the mmap locking (on some architectures - which includes x86_64 and > > arm64). > > Right, per-vma locking is implemented in patch 7 of this series: > https://lore.kernel.org/all/20241108191057.3288442-8-cmllamas@google.com/ > Oh, sorry I missed that. I had not pulled the entire patch set for review due to other tasks, but wanted to say something before this went in. Is there any performance numbers on how much this helps? > > > userfaultfd had to do something like this as well, and created some > > abstraction to facilitate either mmap or rcu, based on the arch support. > > Going to rcu really helped performance there [1]. There was also a > > chance of the vma being missed, so it is checked again under the mmap > > read lock, but realistically that never happens and exists to ensure > > correctness. > > hmm, if there are more users of this pattern in the future then perhaps > it might be worth adding a common helper? Yes. But the userfaultfd is a bit different in that it could need to deal with a lazy init issue with anon vma, iirc. Some of the fine details have prevented common helpers so far. > > > You also mention the shrinker and using the alloc->mutex, well zapping > > pages can also be done under the vma specific lock - if that helps? > > > > Freeing page tables is different though, that needs more locking, but I > > don't think this is an issue for you. > > > > [1]. https://lore.kernel.org/all/20240215182756.3448972-1-lokeshgidra@google.com/ > > ha, I was not aware of this. I'll have a look thanks. Happy to help. Liam
© 2016 - 2024 Red Hat, Inc.