[v2] Add support for shared PTEs across processes

[PATCH v2 08/20] mm/mshare: flush all TLBs when updating PTEs in an mshare range

Posted by Anthony Yznaga 10 months, 1 week ago

Unlike the mm of a task, an mshare host mm is not updated on context
switch. In particular this means that mm_cpumask is never updated
which results in TLB flushes for updates to mshare PTEs only being
done on the local CPU. To ensure entries are flushed for non-local
TLBs, set up an mmu notifier on the mshare mm and use the
.arch_invalidate_secondary_tlbs callback to flush all TLBs.
arch_invalidate_secondary_tlbs guarantees that TLB entries will be
flushed before pages are freed when unmapping pages in an mshare region.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/mshare.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/mm/mshare.c b/mm/mshare.c
index 6bdbcfa8deea..792d86c61042 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -16,8 +16,10 @@
 #include <linux/fs.h>
 #include <linux/fs_context.h>
 #include <linux/mman.h>
+#include <linux/mmu_notifier.h>
 #include <uapi/linux/magic.h>
 #include <linux/falloc.h>
+#include <asm/tlbflush.h>
 
 const unsigned long mshare_align = P4D_SIZE;
 const unsigned long mshare_base = mshare_align;
@@ -29,6 +31,17 @@ struct mshare_data {
 	unsigned long start;
 	unsigned long size;
 	unsigned long flags;
+	struct mmu_notifier mn;
+};
+
+static void mshare_invalidate_tlbs(struct mmu_notifier *mn, struct mm_struct *mm,
+				   unsigned long start, unsigned long end)
+{
+	flush_tlb_all();
+}
+
+static const struct mmu_notifier_ops mshare_mmu_ops = {
+	.arch_invalidate_secondary_tlbs = mshare_invalidate_tlbs,
 };
 
 static int mshare_vm_op_split(struct vm_area_struct *vma, unsigned long addr)
@@ -237,6 +250,10 @@ msharefs_fill_mm(struct inode *inode)
 	m_data->mm = mm;
 	m_data->start = mshare_base;
 	inode->i_private = m_data;
+	m_data->mn.ops = &mshare_mmu_ops;
+	ret = mmu_notifier_register(&m_data->mn, mm);
+	if (ret)
+		goto err_free;
 
 	return 0;
 
-- 
2.43.5

Re: [PATCH v2 08/20] mm/mshare: flush all TLBs when updating PTEs in an mshare range

Posted by Jann Horn 8 months, 2 weeks ago

On Fri, Apr 4, 2025 at 4:18 AM Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
> Unlike the mm of a task, an mshare host mm is not updated on context
> switch. In particular this means that mm_cpumask is never updated
> which results in TLB flushes for updates to mshare PTEs only being
> done on the local CPU. To ensure entries are flushed for non-local
> TLBs, set up an mmu notifier on the mshare mm and use the
> .arch_invalidate_secondary_tlbs callback to flush all TLBs.
> arch_invalidate_secondary_tlbs guarantees that TLB entries will be
> flushed before pages are freed when unmapping pages in an mshare region.

Thanks for working on this, I think this is a really nice feature.

An issue that I think this series doesn't address is:
There could be mmu_notifiers (for things like KVM or SVA IOMMU) that
want to be notified on changes to an mshare VMA; if those are not
invoked, we could get UAF of page contents. So either we propagate MMU
notifier invocations in the host mm into the mshare regions that use
it, or we'd have to somehow prevent a process from using MMU notifiers
and mshare at the same time.

Re: [PATCH v2 08/20] mm/mshare: flush all TLBs when updating PTEs in an mshare range

Posted by Anthony Yznaga 8 months, 2 weeks ago


On 5/30/25 7:41 AM, Jann Horn wrote:
> On Fri, Apr 4, 2025 at 4:18 AM Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
>> Unlike the mm of a task, an mshare host mm is not updated on context
>> switch. In particular this means that mm_cpumask is never updated
>> which results in TLB flushes for updates to mshare PTEs only being
>> done on the local CPU. To ensure entries are flushed for non-local
>> TLBs, set up an mmu notifier on the mshare mm and use the
>> .arch_invalidate_secondary_tlbs callback to flush all TLBs.
>> arch_invalidate_secondary_tlbs guarantees that TLB entries will be
>> flushed before pages are freed when unmapping pages in an mshare region.
> 
> Thanks for working on this, I think this is a really nice feature.
> 
> An issue that I think this series doesn't address is:
> There could be mmu_notifiers (for things like KVM or SVA IOMMU) that
> want to be notified on changes to an mshare VMA; if those are not
> invoked, we could get UAF of page contents. So either we propagate MMU
> notifier invocations in the host mm into the mshare regions that use
> it, or we'd have to somehow prevent a process from using MMU notifiers
> and mshare at the same time.

Thanks, Jann. I've noted this as an issue. Ultimately I think the 
notifiers calls will need to be propagated. It's going to be tricky, but 
I have some ideas.

Re: [PATCH v2 08/20] mm/mshare: flush all TLBs when updating PTEs in an mshare range

Posted by Jann Horn 8 months, 2 weeks ago

On Fri, May 30, 2025 at 6:30 PM Anthony Yznaga
<anthony.yznaga@oracle.com> wrote:
> On 5/30/25 7:41 AM, Jann Horn wrote:
> > On Fri, Apr 4, 2025 at 4:18 AM Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
> >> Unlike the mm of a task, an mshare host mm is not updated on context
> >> switch. In particular this means that mm_cpumask is never updated
> >> which results in TLB flushes for updates to mshare PTEs only being
> >> done on the local CPU. To ensure entries are flushed for non-local
> >> TLBs, set up an mmu notifier on the mshare mm and use the
> >> .arch_invalidate_secondary_tlbs callback to flush all TLBs.
> >> arch_invalidate_secondary_tlbs guarantees that TLB entries will be
> >> flushed before pages are freed when unmapping pages in an mshare region.
> >
> > Thanks for working on this, I think this is a really nice feature.
> >
> > An issue that I think this series doesn't address is:
> > There could be mmu_notifiers (for things like KVM or SVA IOMMU) that
> > want to be notified on changes to an mshare VMA; if those are not
> > invoked, we could get UAF of page contents. So either we propagate MMU
> > notifier invocations in the host mm into the mshare regions that use
> > it, or we'd have to somehow prevent a process from using MMU notifiers
> > and mshare at the same time.
>
> Thanks, Jann. I've noted this as an issue. Ultimately I think the
> notifiers calls will need to be propagated. It's going to be tricky, but
> I have some ideas.

Very naively I think you could basically register your own notifier on
the host mm that has notifier callbacks vaguely like this that walk
the rmap of the mshare file and invoke nested mmu notifiers on each
VMA that maps the file, basically like unmap_mapping_pages() except
that you replace unmap_mapping_range_vma() with a notifier invocation?

static int mshare_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
    const struct mmu_notifier_range *range)
{
  struct vm_area_struct *vma;
  pgoff_t first_index, last_index;

  if (range->end < host_mm->mmap_base)
    return 0;
  first_index = (max(range->start, host_mm->mmap_base) -
host_mm->mmap_base) / PAGE_SIZE;
  last_index = (range->end - host_mm->mmap_base) / PAGE_SIZE;
  i_mmap_lock_read(mapping);
  vma_interval_tree_foreach(vma, &mapping->i_mmap, first_index, last_index) {
    struct mmu_notifier_range nested_range;

    [... same math as in unmap_mapping_range_tree ...]
    mmu_notifier_range_init(&nested_range, range->event, vma->vm_mm,
nested_start, nested_end);
    mmu_notifier_invalidate_range_start(&nested_range);
  }
  i_mmap_unlock_read(mapping);
}

And ensure that when mm_take_all_locks() encounters an mshare VMA, it
basically recursively does mm_take_all_locks() on the mshare host mm?

I think that might be enough to make it work, and the rest beyond that
would be optimizations?

Re: [PATCH v2 08/20] mm/mshare: flush all TLBs when updating PTEs in an mshare range

Posted by Anthony Yznaga 8 months, 2 weeks ago


On 5/30/25 10:46 AM, Jann Horn wrote:
> On Fri, May 30, 2025 at 6:30 PM Anthony Yznaga
> <anthony.yznaga@oracle.com> wrote:
>> On 5/30/25 7:41 AM, Jann Horn wrote:
>>> On Fri, Apr 4, 2025 at 4:18 AM Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
>>>> Unlike the mm of a task, an mshare host mm is not updated on context
>>>> switch. In particular this means that mm_cpumask is never updated
>>>> which results in TLB flushes for updates to mshare PTEs only being
>>>> done on the local CPU. To ensure entries are flushed for non-local
>>>> TLBs, set up an mmu notifier on the mshare mm and use the
>>>> .arch_invalidate_secondary_tlbs callback to flush all TLBs.
>>>> arch_invalidate_secondary_tlbs guarantees that TLB entries will be
>>>> flushed before pages are freed when unmapping pages in an mshare region.
>>>
>>> Thanks for working on this, I think this is a really nice feature.
>>>
>>> An issue that I think this series doesn't address is:
>>> There could be mmu_notifiers (for things like KVM or SVA IOMMU) that
>>> want to be notified on changes to an mshare VMA; if those are not
>>> invoked, we could get UAF of page contents. So either we propagate MMU
>>> notifier invocations in the host mm into the mshare regions that use
>>> it, or we'd have to somehow prevent a process from using MMU notifiers
>>> and mshare at the same time.
>>
>> Thanks, Jann. I've noted this as an issue. Ultimately I think the
>> notifiers calls will need to be propagated. It's going to be tricky, but
>> I have some ideas.
> 
> Very naively I think you could basically register your own notifier on
> the host mm that has notifier callbacks vaguely like this that walk
> the rmap of the mshare file and invoke nested mmu notifiers on each
> VMA that maps the file, basically like unmap_mapping_pages() except
> that you replace unmap_mapping_range_vma() with a notifier invocation?
> 
> static int mshare_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>      const struct mmu_notifier_range *range)
> {
>    struct vm_area_struct *vma;
>    pgoff_t first_index, last_index;
> 
>    if (range->end < host_mm->mmap_base)
>      return 0;
>    first_index = (max(range->start, host_mm->mmap_base) -
> host_mm->mmap_base) / PAGE_SIZE;
>    last_index = (range->end - host_mm->mmap_base) / PAGE_SIZE;
>    i_mmap_lock_read(mapping);
>    vma_interval_tree_foreach(vma, &mapping->i_mmap, first_index, last_index) {
>      struct mmu_notifier_range nested_range;
> 
>      [... same math as in unmap_mapping_range_tree ...]
>      mmu_notifier_range_init(&nested_range, range->event, vma->vm_mm,
> nested_start, nested_end);
>      mmu_notifier_invalidate_range_start(&nested_range);
>    }
>    i_mmap_unlock_read(mapping);
> }
> 
> And ensure that when mm_take_all_locks() encounters an mshare VMA, it
> basically recursively does mm_take_all_locks() on the mshare host mm?
> 
> I think that might be enough to make it work, and the rest beyond that
> would be optimizations?

I figured the vma interval tree would need to be walked. I hadn't 
considered mm_take_all_locks(), though. This is definitely a good 
starting point. Thanks for this!