Allow multiple callers to install pages simultaneously by switching the
mmap_sem from write-mode to read-mode. Races to the same PTE are handled
using get_user_pages_remote() to retrieve the already installed page.
This method significantly reduces contention in the mmap semaphore.
To ensure safety, vma_lookup() is used (instead of alloc->vma) to avoid
operating on an isolated VMA. In addition, zap_page_range_single() is
called under the alloc->mutex to avoid racing with the shrinker.
Many thanks to Barry Song who posted a similar approach [1].
Link: https://lore.kernel.org/all/20240902225009.34576-1-21cnbao@gmail.com/ [1]
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <v-songbaohua@oppo.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Carlos Llamas <cmllamas@google.com>
---
drivers/android/binder_alloc.c | 65 +++++++++++++++++++++-------------
1 file changed, 41 insertions(+), 24 deletions(-)
diff --git a/drivers/android/binder_alloc.c b/drivers/android/binder_alloc.c
index 52f6aa3232e1..f26283c2c768 100644
--- a/drivers/android/binder_alloc.c
+++ b/drivers/android/binder_alloc.c
@@ -221,26 +221,14 @@ static int binder_install_single_page(struct binder_alloc *alloc,
struct binder_lru_page *lru_page,
unsigned long addr)
{
+ struct vm_area_struct *vma;
struct page *page;
- int ret = 0;
+ long npages;
+ int ret;
if (!mmget_not_zero(alloc->mm))
return -ESRCH;
- /*
- * Protected with mmap_sem in write mode as multiple tasks
- * might race to install the same page.
- */
- mmap_write_lock(alloc->mm);
- if (binder_get_installed_page(lru_page))
- goto out;
-
- if (!alloc->vma) {
- pr_err("%d: %s failed, no vma\n", alloc->pid, __func__);
- ret = -ESRCH;
- goto out;
- }
-
page = alloc_page(GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO);
if (!page) {
pr_err("%d: failed to allocate page\n", alloc->pid);
@@ -248,19 +236,48 @@ static int binder_install_single_page(struct binder_alloc *alloc,
goto out;
}
- ret = vm_insert_page(alloc->vma, addr, page);
- if (ret) {
+ mmap_read_lock(alloc->mm);
+ vma = vma_lookup(alloc->mm, addr);
+ if (!vma || vma != alloc->vma) {
+ __free_page(page);
+ pr_err("%d: %s failed, no vma\n", alloc->pid, __func__);
+ ret = -ESRCH;
+ goto unlock;
+ }
+
+ ret = vm_insert_page(vma, addr, page);
+ switch (ret) {
+ case -EBUSY:
+ /*
+ * EBUSY is ok. Someone installed the pte first but the
+ * lru_page->page_ptr has not been updated yet. Discard
+ * our page and look up the one already installed.
+ */
+ ret = 0;
+ __free_page(page);
+ npages = get_user_pages_remote(alloc->mm, addr, 1,
+ FOLL_NOFAULT, &page, NULL);
+ if (npages <= 0) {
+ pr_err("%d: failed to find page at offset %lx\n",
+ alloc->pid, addr - alloc->buffer);
+ ret = -ESRCH;
+ break;
+ }
+ fallthrough;
+ case 0:
+ /* Mark page installation complete and safe to use */
+ binder_set_installed_page(lru_page, page);
+ break;
+ default:
+ __free_page(page);
pr_err("%d: %s failed to insert page at offset %lx with %d\n",
alloc->pid, __func__, addr - alloc->buffer, ret);
- __free_page(page);
ret = -ENOMEM;
- goto out;
+ break;
}
-
- /* Mark page installation complete and safe to use */
- binder_set_installed_page(lru_page, page);
+unlock:
+ mmap_read_unlock(alloc->mm);
out:
- mmap_write_unlock(alloc->mm);
mmput_async(alloc->mm);
return ret;
}
@@ -1090,7 +1107,6 @@ enum lru_status binder_alloc_free_page(struct list_head *item,
trace_binder_unmap_kernel_end(alloc, index);
list_lru_isolate(lru, item);
- mutex_unlock(&alloc->mutex);
spin_unlock(&lru->lock);
if (vma) {
@@ -1101,6 +1117,7 @@ enum lru_status binder_alloc_free_page(struct list_head *item,
trace_binder_unmap_user_end(alloc, index);
}
+ mutex_unlock(&alloc->mutex);
mmap_read_unlock(mm);
mmput_async(mm);
__free_page(page_to_free);
--
2.47.0.338.g60cca15819-goog
On Tue, Dec 3, 2024 at 10:55 PM Carlos Llamas <cmllamas@google.com> wrote: > > Allow multiple callers to install pages simultaneously by switching the > mmap_sem from write-mode to read-mode. Races to the same PTE are handled > using get_user_pages_remote() to retrieve the already installed page. > This method significantly reduces contention in the mmap semaphore. > > To ensure safety, vma_lookup() is used (instead of alloc->vma) to avoid > operating on an isolated VMA. In addition, zap_page_range_single() is > called under the alloc->mutex to avoid racing with the shrinker. How do you avoid racing with the shrinker? You don't hold the mutex when binder_install_single_page is called. E.g. consider this execution: 1. binder_alloc_new_buf finishes allocating the struct binder_buffer and unlocks the mutex. 2. Shrinker starts running, locks the mutex, sets the page pointer to NULL and unlocks the lru spinlock. The mutex is still held. 3. binder_install_buffer_pages is called and since the page pointer is NULL, binder_install_single_page is called. 4. binder_install_single_page allocates a page and tries to vm_insert_page it. It gets an EBUSY error because the shrinker has not yet called zap_page_range_single. 5. binder_install_single_page looks up the page with get_user_pages_remote. The page is written back to the pages array. 6. The shrinker calls zap_page_range_single followed by binder_free_page(page_to_free). 7. The page has now been freed and zapped, but it's in the page array. UAF. Is there something I'm missing? Alice
On Wed, Dec 04, 2024 at 10:59:19AM +0100, Alice Ryhl wrote: > On Tue, Dec 3, 2024 at 10:55 PM Carlos Llamas <cmllamas@google.com> wrote: > > > > Allow multiple callers to install pages simultaneously by switching the > > mmap_sem from write-mode to read-mode. Races to the same PTE are handled > > using get_user_pages_remote() to retrieve the already installed page. > > This method significantly reduces contention in the mmap semaphore. > > > > To ensure safety, vma_lookup() is used (instead of alloc->vma) to avoid > > operating on an isolated VMA. In addition, zap_page_range_single() is > > called under the alloc->mutex to avoid racing with the shrinker. > > How do you avoid racing with the shrinker? You don't hold the mutex > when binder_install_single_page is called. > > E.g. consider this execution: > > 1. binder_alloc_new_buf finishes allocating the struct binder_buffer > and unlocks the mutex. By the time the mutex is released in binder_alloc_new_buf() all the pages that will be used have been removed from the freelist and the shrinker will have no access to them. > 2. Shrinker starts running, locks the mutex, sets the page pointer to > NULL and unlocks the lru spinlock. The mutex is still held. > 3. binder_install_buffer_pages is called and since the page pointer is > NULL, binder_install_single_page is called. > 4. binder_install_single_page allocates a page and tries to > vm_insert_page it. It gets an EBUSY error because the shrinker has not > yet called zap_page_range_single. > 5. binder_install_single_page looks up the page with > get_user_pages_remote. The page is written back to the pages array. > 6. The shrinker calls zap_page_range_single followed by > binder_free_page(page_to_free). > 7. The page has now been freed and zapped, but it's in the page array. UAF. > > Is there something I'm missing? I think that would be the call to binder_lru_freelist_del().
On Wed, Dec 4, 2024 at 2:39 PM Carlos Llamas <cmllamas@google.com> wrote: > > On Wed, Dec 04, 2024 at 10:59:19AM +0100, Alice Ryhl wrote: > > On Tue, Dec 3, 2024 at 10:55 PM Carlos Llamas <cmllamas@google.com> wrote: > > > > > > Allow multiple callers to install pages simultaneously by switching the > > > mmap_sem from write-mode to read-mode. Races to the same PTE are handled > > > using get_user_pages_remote() to retrieve the already installed page. > > > This method significantly reduces contention in the mmap semaphore. > > > > > > To ensure safety, vma_lookup() is used (instead of alloc->vma) to avoid > > > operating on an isolated VMA. In addition, zap_page_range_single() is > > > called under the alloc->mutex to avoid racing with the shrinker. > > > > How do you avoid racing with the shrinker? You don't hold the mutex > > when binder_install_single_page is called. > > > > E.g. consider this execution: > > > > 1. binder_alloc_new_buf finishes allocating the struct binder_buffer > > and unlocks the mutex. > > By the time the mutex is released in binder_alloc_new_buf() all the > pages that will be used have been removed from the freelist and the > shrinker will have no access to them. > > > 2. Shrinker starts running, locks the mutex, sets the page pointer to > > NULL and unlocks the lru spinlock. The mutex is still held. > > 3. binder_install_buffer_pages is called and since the page pointer is > > NULL, binder_install_single_page is called. > > 4. binder_install_single_page allocates a page and tries to > > vm_insert_page it. It gets an EBUSY error because the shrinker has not > > yet called zap_page_range_single. > > 5. binder_install_single_page looks up the page with > > get_user_pages_remote. The page is written back to the pages array. > > 6. The shrinker calls zap_page_range_single followed by > > binder_free_page(page_to_free). > > 7. The page has now been freed and zapped, but it's in the page array. UAF. > > > > Is there something I'm missing? > > I think that would be the call to binder_lru_freelist_del(). Tricky stuff. But okay, I buy it. This logic works. Alice
© 2016 - 2025 Red Hat, Inc.