migration_entry_wait does not need VMA lock, therefore it can be
dropped before waiting.
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
mm/memory.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 5caaa4c66ea2..bdf46fdc58d6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3715,8 +3715,18 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
entry = pte_to_swp_entry(vmf->orig_pte);
if (unlikely(non_swap_entry(entry))) {
if (is_migration_entry(entry)) {
- migration_entry_wait(vma->vm_mm, vmf->pmd,
- vmf->address);
+ /* Save mm in case VMA lock is dropped */
+ struct mm_struct *mm = vma->vm_mm;
+
+ if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
+ /*
+ * No need to hold VMA lock for migration.
+ * WARNING: vma can't be used after this!
+ */
+ vma_end_read(vma);
+ ret |= VM_FAULT_COMPLETED;
+ }
+ migration_entry_wait(mm, vmf->pmd, vmf->address);
} else if (is_device_exclusive_entry(entry)) {
vmf->page = pfn_swap_entry_to_page(entry);
ret = remove_device_exclusive_entry(vmf);
--
2.41.0.178.g377b9f9a00-goog
On Mon, Jun 26, 2023 at 09:23:20PM -0700, Suren Baghdasaryan wrote:
> migration_entry_wait does not need VMA lock, therefore it can be
> dropped before waiting.
Hmm, I'm not sure..
Note that we're still dereferencing *vmf->pmd when waiting, while *pmd is
on the page table and IIUC only be guaranteed if the vma is still there.
If without both mmap / vma lock I don't see what makes sure the pgtable is
always there. E.g. IIUC a race can happen where unmap() runs right after
vma_end_read() below but before pmdp_get_lockless() (inside
migration_entry_wait()), then pmdp_get_lockless() can read some random
things if the pgtable is freed.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
> mm/memory.c | 14 ++++++++++++--
> 1 file changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 5caaa4c66ea2..bdf46fdc58d6 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3715,8 +3715,18 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> entry = pte_to_swp_entry(vmf->orig_pte);
> if (unlikely(non_swap_entry(entry))) {
> if (is_migration_entry(entry)) {
> - migration_entry_wait(vma->vm_mm, vmf->pmd,
> - vmf->address);
> + /* Save mm in case VMA lock is dropped */
> + struct mm_struct *mm = vma->vm_mm;
> +
> + if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> + /*
> + * No need to hold VMA lock for migration.
> + * WARNING: vma can't be used after this!
> + */
> + vma_end_read(vma);
> + ret |= VM_FAULT_COMPLETED;
> + }
> + migration_entry_wait(mm, vmf->pmd, vmf->address);
> } else if (is_device_exclusive_entry(entry)) {
> vmf->page = pfn_swap_entry_to_page(entry);
> ret = remove_device_exclusive_entry(vmf);
> --
> 2.41.0.178.g377b9f9a00-goog
>
--
Peter Xu
On Tue, Jun 27, 2023 at 8:49 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Jun 26, 2023 at 09:23:20PM -0700, Suren Baghdasaryan wrote:
> > migration_entry_wait does not need VMA lock, therefore it can be
> > dropped before waiting.
>
> Hmm, I'm not sure..
>
> Note that we're still dereferencing *vmf->pmd when waiting, while *pmd is
> on the page table and IIUC only be guaranteed if the vma is still there.
> If without both mmap / vma lock I don't see what makes sure the pgtable is
> always there. E.g. IIUC a race can happen where unmap() runs right after
> vma_end_read() below but before pmdp_get_lockless() (inside
> migration_entry_wait()), then pmdp_get_lockless() can read some random
> things if the pgtable is freed.
That sounds correct. I thought ptl would keep pmd stable but there is
time between vma_end_read() and spin_lock(ptl) when it can be freed
from under us. I think it would work if we do vma_end_read() after
spin_lock(ptl) but that requires code refactoring. I'll probably drop
this optimization from the patchset for now to keep things simple and
will get back to it later.
>
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> > mm/memory.c | 14 ++++++++++++--
> > 1 file changed, 12 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 5caaa4c66ea2..bdf46fdc58d6 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3715,8 +3715,18 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > entry = pte_to_swp_entry(vmf->orig_pte);
> > if (unlikely(non_swap_entry(entry))) {
> > if (is_migration_entry(entry)) {
> > - migration_entry_wait(vma->vm_mm, vmf->pmd,
> > - vmf->address);
> > + /* Save mm in case VMA lock is dropped */
> > + struct mm_struct *mm = vma->vm_mm;
> > +
> > + if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > + /*
> > + * No need to hold VMA lock for migration.
> > + * WARNING: vma can't be used after this!
> > + */
> > + vma_end_read(vma);
> > + ret |= VM_FAULT_COMPLETED;
> > + }
> > + migration_entry_wait(mm, vmf->pmd, vmf->address);
> > } else if (is_device_exclusive_entry(entry)) {
> > vmf->page = pfn_swap_entry_to_page(entry);
> > ret = remove_device_exclusive_entry(vmf);
> > --
> > 2.41.0.178.g377b9f9a00-goog
> >
>
> --
> Peter Xu
>
Suren Baghdasaryan <surenb@google.com> writes:
> On Tue, Jun 27, 2023 at 8:49 AM Peter Xu <peterx@redhat.com> wrote:
>>
>> On Mon, Jun 26, 2023 at 09:23:20PM -0700, Suren Baghdasaryan wrote:
>> > migration_entry_wait does not need VMA lock, therefore it can be
>> > dropped before waiting.
>>
>> Hmm, I'm not sure..
>>
>> Note that we're still dereferencing *vmf->pmd when waiting, while *pmd is
>> on the page table and IIUC only be guaranteed if the vma is still there.
>> If without both mmap / vma lock I don't see what makes sure the pgtable is
>> always there. E.g. IIUC a race can happen where unmap() runs right after
>> vma_end_read() below but before pmdp_get_lockless() (inside
>> migration_entry_wait()), then pmdp_get_lockless() can read some random
>> things if the pgtable is freed.
>
> That sounds correct. I thought ptl would keep pmd stable but there is
> time between vma_end_read() and spin_lock(ptl) when it can be freed
> from under us. I think it would work if we do vma_end_read() after
> spin_lock(ptl) but that requires code refactoring. I'll probably drop
> this optimization from the patchset for now to keep things simple and
> will get back to it later.
Oh thanks Peter that's a good point. It could be made to work, but agree
it's probably not worth the code refactoring at this point so I'm ok if
the optimisation is dropped for now.
>>
>> >
>> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>> > ---
>> > mm/memory.c | 14 ++++++++++++--
>> > 1 file changed, 12 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/mm/memory.c b/mm/memory.c
>> > index 5caaa4c66ea2..bdf46fdc58d6 100644
>> > --- a/mm/memory.c
>> > +++ b/mm/memory.c
>> > @@ -3715,8 +3715,18 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> > entry = pte_to_swp_entry(vmf->orig_pte);
>> > if (unlikely(non_swap_entry(entry))) {
>> > if (is_migration_entry(entry)) {
>> > - migration_entry_wait(vma->vm_mm, vmf->pmd,
>> > - vmf->address);
>> > + /* Save mm in case VMA lock is dropped */
>> > + struct mm_struct *mm = vma->vm_mm;
>> > +
>> > + if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
>> > + /*
>> > + * No need to hold VMA lock for migration.
>> > + * WARNING: vma can't be used after this!
>> > + */
>> > + vma_end_read(vma);
>> > + ret |= VM_FAULT_COMPLETED;
>> > + }
>> > + migration_entry_wait(mm, vmf->pmd, vmf->address);
>> > } else if (is_device_exclusive_entry(entry)) {
>> > vmf->page = pfn_swap_entry_to_page(entry);
>> > ret = remove_device_exclusive_entry(vmf);
>> > --
>> > 2.41.0.178.g377b9f9a00-goog
>> >
>>
>> --
>> Peter Xu
>>
Suren Baghdasaryan <surenb@google.com> writes:
> migration_entry_wait does not need VMA lock, therefore it can be
> dropped before waiting.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
> mm/memory.c | 14 ++++++++++++--
> 1 file changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 5caaa4c66ea2..bdf46fdc58d6 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3715,8 +3715,18 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> entry = pte_to_swp_entry(vmf->orig_pte);
> if (unlikely(non_swap_entry(entry))) {
> if (is_migration_entry(entry)) {
> - migration_entry_wait(vma->vm_mm, vmf->pmd,
> - vmf->address);
> + /* Save mm in case VMA lock is dropped */
> + struct mm_struct *mm = vma->vm_mm;
> +
> + if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> + /*
> + * No need to hold VMA lock for migration.
> + * WARNING: vma can't be used after this!
> + */
> + vma_end_read(vma);
> + ret |= VM_FAULT_COMPLETED;
Doesn't this need to also set FAULT_FLAG_LOCK_DROPPED to ensure we don't
call vma_end_read() again in __handle_mm_fault()?
> + }
> + migration_entry_wait(mm, vmf->pmd, vmf->address);
> } else if (is_device_exclusive_entry(entry)) {
> vmf->page = pfn_swap_entry_to_page(entry);
> ret = remove_device_exclusive_entry(vmf);
On Tue, Jun 27, 2023 at 1:06 AM Alistair Popple <apopple@nvidia.com> wrote:
>
>
> Suren Baghdasaryan <surenb@google.com> writes:
>
> > migration_entry_wait does not need VMA lock, therefore it can be
> > dropped before waiting.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> > mm/memory.c | 14 ++++++++++++--
> > 1 file changed, 12 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 5caaa4c66ea2..bdf46fdc58d6 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3715,8 +3715,18 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > entry = pte_to_swp_entry(vmf->orig_pte);
> > if (unlikely(non_swap_entry(entry))) {
> > if (is_migration_entry(entry)) {
> > - migration_entry_wait(vma->vm_mm, vmf->pmd,
> > - vmf->address);
> > + /* Save mm in case VMA lock is dropped */
> > + struct mm_struct *mm = vma->vm_mm;
> > +
> > + if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > + /*
> > + * No need to hold VMA lock for migration.
> > + * WARNING: vma can't be used after this!
> > + */
> > + vma_end_read(vma);
> > + ret |= VM_FAULT_COMPLETED;
>
> Doesn't this need to also set FAULT_FLAG_LOCK_DROPPED to ensure we don't
> call vma_end_read() again in __handle_mm_fault()?
Uh, right. Got lost during the last refactoring. Thanks for flagging!
>
> > + }
> > + migration_entry_wait(mm, vmf->pmd, vmf->address);
> > } else if (is_device_exclusive_entry(entry)) {
> > vmf->page = pfn_swap_entry_to_page(entry);
> > ret = remove_device_exclusive_entry(vmf);
>
© 2016 - 2026 Red Hat, Inc.