On 5/30/25 10:46 AM, Jann Horn wrote:
> On Fri, May 30, 2025 at 6:30 PM Anthony Yznaga
> <anthony.yznaga@oracle.com> wrote:
>> On 5/30/25 7:41 AM, Jann Horn wrote:
>>> On Fri, Apr 4, 2025 at 4:18 AM Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
>>>> Unlike the mm of a task, an mshare host mm is not updated on context
>>>> switch. In particular this means that mm_cpumask is never updated
>>>> which results in TLB flushes for updates to mshare PTEs only being
>>>> done on the local CPU. To ensure entries are flushed for non-local
>>>> TLBs, set up an mmu notifier on the mshare mm and use the
>>>> .arch_invalidate_secondary_tlbs callback to flush all TLBs.
>>>> arch_invalidate_secondary_tlbs guarantees that TLB entries will be
>>>> flushed before pages are freed when unmapping pages in an mshare region.
>>>
>>> Thanks for working on this, I think this is a really nice feature.
>>>
>>> An issue that I think this series doesn't address is:
>>> There could be mmu_notifiers (for things like KVM or SVA IOMMU) that
>>> want to be notified on changes to an mshare VMA; if those are not
>>> invoked, we could get UAF of page contents. So either we propagate MMU
>>> notifier invocations in the host mm into the mshare regions that use
>>> it, or we'd have to somehow prevent a process from using MMU notifiers
>>> and mshare at the same time.
>>
>> Thanks, Jann. I've noted this as an issue. Ultimately I think the
>> notifiers calls will need to be propagated. It's going to be tricky, but
>> I have some ideas.
>
> Very naively I think you could basically register your own notifier on
> the host mm that has notifier callbacks vaguely like this that walk
> the rmap of the mshare file and invoke nested mmu notifiers on each
> VMA that maps the file, basically like unmap_mapping_pages() except
> that you replace unmap_mapping_range_vma() with a notifier invocation?
>
> static int mshare_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> const struct mmu_notifier_range *range)
> {
> struct vm_area_struct *vma;
> pgoff_t first_index, last_index;
>
> if (range->end < host_mm->mmap_base)
> return 0;
> first_index = (max(range->start, host_mm->mmap_base) -
> host_mm->mmap_base) / PAGE_SIZE;
> last_index = (range->end - host_mm->mmap_base) / PAGE_SIZE;
> i_mmap_lock_read(mapping);
> vma_interval_tree_foreach(vma, &mapping->i_mmap, first_index, last_index) {
> struct mmu_notifier_range nested_range;
>
> [... same math as in unmap_mapping_range_tree ...]
> mmu_notifier_range_init(&nested_range, range->event, vma->vm_mm,
> nested_start, nested_end);
> mmu_notifier_invalidate_range_start(&nested_range);
> }
> i_mmap_unlock_read(mapping);
> }
>
> And ensure that when mm_take_all_locks() encounters an mshare VMA, it
> basically recursively does mm_take_all_locks() on the mshare host mm?
>
> I think that might be enough to make it work, and the rest beyond that
> would be optimizations?
I figured the vma interval tree would need to be walked. I hadn't
considered mm_take_all_locks(), though. This is definitely a good
starting point. Thanks for this!