[v1] Migrate on fault for device pages

[RFC PATCH 1/4] mm: use current as mmu notifier's owner

Posted by Mika Penttilä 5 months, 4 weeks ago

When doing migration in combination with device fault handling,
detect the case in the interval notifier.

Without that, we would livelock with our own invalidations
while migrating and splitting pages during fault handling.

Note, pgmap_owner, used in some other code paths as owner for filtering,
is not readily available for split path, so use current for this use case.
Also, current and pgmap_owner, both being pointers to memory, can not be
mis-interpreted to each other.

Cc: David Hildenbrand <david@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Leon Romanovsky <leonro@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Balbir Singh <balbirs@nvidia.com>

Signed-off-by: Mika Penttilä <mpenttil@redhat.com>
---
 lib/test_hmm.c   | 5 +++++
 mm/huge_memory.c | 6 +++---
 mm/rmap.c        | 4 ++--
 3 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 761725bc713c..cd5c139213be 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -269,6 +269,11 @@ static bool dmirror_interval_invalidate(struct mmu_interval_notifier *mni,
 	    range->owner == dmirror->mdevice)
 		return true;
 
+	if (range->event == MMU_NOTIFY_CLEAR &&
+	    range->owner == current) {
+		return true;
+	}
+
 	if (mmu_notifier_range_blockable(range))
 		mutex_lock(&dmirror->mutex);
 	else if (!mutex_trylock(&dmirror->mutex))
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9c38a95e9f09..276e38dd8f68 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3069,9 +3069,9 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	spinlock_t *ptl;
 	struct mmu_notifier_range range;
 
-	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
-				address & HPAGE_PMD_MASK,
-				(address & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE);
+	mmu_notifier_range_init_owner(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
+				      address & HPAGE_PMD_MASK,
+				      (address & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE, current);
 	mmu_notifier_invalidate_range_start(&range);
 	ptl = pmd_lock(vma->vm_mm, pmd);
 	split_huge_pmd_locked(vma, range.start, pmd, freeze);
diff --git a/mm/rmap.c b/mm/rmap.c
index f93ce27132ab..e7829015a40b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2308,8 +2308,8 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 	 * try_to_unmap() must hold a reference on the page.
 	 */
 	range.end = vma_address_end(&pvmw);
-	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
-				address, range.end);
+	mmu_notifier_range_init_owner(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
+				      address, range.end, current);
 	if (folio_test_hugetlb(folio)) {
 		/*
 		 * If sharing is possible, start and end will be adjusted
-- 
2.50.0

Re: [RFC PATCH 1/4] mm: use current as mmu notifier's owner

Posted by Jason Gunthorpe 5 months, 4 weeks ago

On Thu, Aug 14, 2025 at 10:19:26AM +0300, Mika Penttilä wrote:
> When doing migration in combination with device fault handling,
> detect the case in the interval notifier.
> 
> Without that, we would livelock with our own invalidations
> while migrating and splitting pages during fault handling.
> 
> Note, pgmap_owner, used in some other code paths as owner for filtering,
> is not readily available for split path, so use current for this use case.
> Also, current and pgmap_owner, both being pointers to memory, can not be
> mis-interpreted to each other.
> 
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Leon Romanovsky <leonro@nvidia.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Balbir Singh <balbirs@nvidia.com>
> 
> Signed-off-by: Mika Penttilä <mpenttil@redhat.com>
> ---
>  lib/test_hmm.c   | 5 +++++
>  mm/huge_memory.c | 6 +++---
>  mm/rmap.c        | 4 ++--
>  3 files changed, 10 insertions(+), 5 deletions(-)
> 
> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
> index 761725bc713c..cd5c139213be 100644
> --- a/lib/test_hmm.c
> +++ b/lib/test_hmm.c
> @@ -269,6 +269,11 @@ static bool dmirror_interval_invalidate(struct mmu_interval_notifier *mni,
>  	    range->owner == dmirror->mdevice)
>  		return true;
>  
> +	if (range->event == MMU_NOTIFY_CLEAR &&
> +	    range->owner == current) {
> +		return true;
> +	}

I don't understand this, there is nothing in hmm that says only
current can call hmm_range_fault, and indeed most applications won't
even gurantee that.

So if this plan relies on something like the above in drivers I don't
see how it can work.

If this is just some hack for tests, try instead to find a solution
that more accurately matches what a real driver should do.

But this also seems overall troublesome to your goal, if you do a
migrate inside hmm_range_fault() it will generate an invalidation call
back and that will increment the seqlock and we will loop
hmm_range_fault() again which rewalks.

Jason

Re: [RFC PATCH 1/4] mm: use current as mmu notifier's owner

Posted by Mika Penttilä 5 months, 4 weeks ago

On 8/14/25 15:40, Jason Gunthorpe wrote:
> On Thu, Aug 14, 2025 at 10:19:26AM +0300, Mika Penttilä wrote:
>> When doing migration in combination with device fault handling,
>> detect the case in the interval notifier.
>>
>> Without that, we would livelock with our own invalidations
>> while migrating and splitting pages during fault handling.
>>
>> Note, pgmap_owner, used in some other code paths as owner for filtering,
>> is not readily available for split path, so use current for this use case.
>> Also, current and pgmap_owner, both being pointers to memory, can not be
>> mis-interpreted to each other.
>>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Jason Gunthorpe <jgg@nvidia.com>
>> Cc: Leon Romanovsky <leonro@nvidia.com>
>> Cc: Alistair Popple <apopple@nvidia.com>
>> Cc: Balbir Singh <balbirs@nvidia.com>
>>
>> Signed-off-by: Mika Penttilä <mpenttil@redhat.com>
>> ---
>>  lib/test_hmm.c   | 5 +++++
>>  mm/huge_memory.c | 6 +++---
>>  mm/rmap.c        | 4 ++--
>>  3 files changed, 10 insertions(+), 5 deletions(-)
>>
>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>> index 761725bc713c..cd5c139213be 100644
>> --- a/lib/test_hmm.c
>> +++ b/lib/test_hmm.c
>> @@ -269,6 +269,11 @@ static bool dmirror_interval_invalidate(struct mmu_interval_notifier *mni,
>>  	    range->owner == dmirror->mdevice)
>>  		return true;
>>  
>> +	if (range->event == MMU_NOTIFY_CLEAR &&
>> +	    range->owner == current) {
>> +		return true;
>> +	}
> I don't understand this, there is nothing in hmm that says only
> current can call hmm_range_fault, and indeed most applications won't
> even gurantee that.

No it's the opposite, if we are ourselves the invalidator, don't care.

> So if this plan relies on something like the above in drivers I don't
> see how it can work.
>
> If this is just some hack for tests, try instead to find a solution
> that more accurately matches what a real driver should do.
>
> But this also seems overall troublesome to your goal, if you do a
> migrate inside hmm_range_fault() it will generate an invalidation call
> back and that will increment the seqlock and we will loop
> hmm_range_fault() again which rewalks.

That's the problem this solves.
The semantics is "if we are the invalidator don't wait for invalidate end",
aka don't mmu_interval_set_seq() that would make hang in the next mmu_interval_read_begin(),
waiting the invalidate to end

>
> Jason
>
--Mika

Re: [RFC PATCH 1/4] mm: use current as mmu notifier's owner

Posted by Jason Gunthorpe 5 months, 4 weeks ago

On Thu, Aug 14, 2025 at 03:53:00PM +0300, Mika Penttilä wrote:
> 
> On 8/14/25 15:40, Jason Gunthorpe wrote:
> > On Thu, Aug 14, 2025 at 10:19:26AM +0300, Mika Penttilä wrote:
> >> When doing migration in combination with device fault handling,
> >> detect the case in the interval notifier.
> >>
> >> Without that, we would livelock with our own invalidations
> >> while migrating and splitting pages during fault handling.
> >>
> >> Note, pgmap_owner, used in some other code paths as owner for filtering,
> >> is not readily available for split path, so use current for this use case.
> >> Also, current and pgmap_owner, both being pointers to memory, can not be
> >> mis-interpreted to each other.
> >>
> >> Cc: David Hildenbrand <david@redhat.com>
> >> Cc: Jason Gunthorpe <jgg@nvidia.com>
> >> Cc: Leon Romanovsky <leonro@nvidia.com>
> >> Cc: Alistair Popple <apopple@nvidia.com>
> >> Cc: Balbir Singh <balbirs@nvidia.com>
> >>
> >> Signed-off-by: Mika Penttilä <mpenttil@redhat.com>
> >> ---
> >>  lib/test_hmm.c   | 5 +++++
> >>  mm/huge_memory.c | 6 +++---
> >>  mm/rmap.c        | 4 ++--
> >>  3 files changed, 10 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
> >> index 761725bc713c..cd5c139213be 100644
> >> --- a/lib/test_hmm.c
> >> +++ b/lib/test_hmm.c
> >> @@ -269,6 +269,11 @@ static bool dmirror_interval_invalidate(struct mmu_interval_notifier *mni,
> >>  	    range->owner == dmirror->mdevice)
> >>  		return true;
> >>  
> >> +	if (range->event == MMU_NOTIFY_CLEAR &&
> >> +	    range->owner == current) {
> >> +		return true;
> >> +	}
> > I don't understand this, there is nothing in hmm that says only
> > current can call hmm_range_fault, and indeed most applications won't
> > even gurantee that.
> 
> No it's the opposite, if we are ourselves the invalidator, don't care.

I think you've missed the point, you cannot use range->owner in any
way to tell "we are ourselves the invalidator". It is simply not the
meaning of range->owner.

> > So if this plan relies on something like the above in drivers I don't
> > see how it can work.
> >
> > If this is just some hack for tests, try instead to find a solution
> > that more accurately matches what a real driver should do.
> >
> > But this also seems overall troublesome to your goal, if you do a
> > migrate inside hmm_range_fault() it will generate an invalidation call
> > back and that will increment the seqlock and we will loop
> > hmm_range_fault() again which rewalks.
> 
> That's the problem this solves.
> The semantics is "if we are the invalidator don't wait for invalidate end",
> aka don't mmu_interval_set_seq() that would make hang in the next mmu_interval_read_begin(),
> waiting the invalidate to end

I doubt we can skip mmu_interval_set_seq(), doing so can corrupt concurrent
hmm_range_fault in some other thread.

Nor, as I said, can we actually skip the invalidation toward HW
anyhow.

At the very least this commit message fails to explain the new locking
proposal, or justify why it can work.

Jason

Re: [RFC PATCH 1/4] mm: use current as mmu notifier's owner

Posted by Mika Penttilä 5 months, 4 weeks ago

On 8/14/25 16:04, Jason Gunthorpe wrote:

> On Thu, Aug 14, 2025 at 03:53:00PM +0300, Mika Penttilä wrote:
>> On 8/14/25 15:40, Jason Gunthorpe wrote:
>>> On Thu, Aug 14, 2025 at 10:19:26AM +0300, Mika Penttilä wrote:
>>>> When doing migration in combination with device fault handling,
>>>> detect the case in the interval notifier.
>>>>
>>>> Without that, we would livelock with our own invalidations
>>>> while migrating and splitting pages during fault handling.
>>>>
>>>> Note, pgmap_owner, used in some other code paths as owner for filtering,
>>>> is not readily available for split path, so use current for this use case.
>>>> Also, current and pgmap_owner, both being pointers to memory, can not be
>>>> mis-interpreted to each other.
>>>>
>>>> Cc: David Hildenbrand <david@redhat.com>
>>>> Cc: Jason Gunthorpe <jgg@nvidia.com>
>>>> Cc: Leon Romanovsky <leonro@nvidia.com>
>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>> Cc: Balbir Singh <balbirs@nvidia.com>
>>>>
>>>> Signed-off-by: Mika Penttilä <mpenttil@redhat.com>
>>>> ---
>>>>  lib/test_hmm.c   | 5 +++++
>>>>  mm/huge_memory.c | 6 +++---
>>>>  mm/rmap.c        | 4 ++--
>>>>  3 files changed, 10 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>>>> index 761725bc713c..cd5c139213be 100644
>>>> --- a/lib/test_hmm.c
>>>> +++ b/lib/test_hmm.c
>>>> @@ -269,6 +269,11 @@ static bool dmirror_interval_invalidate(struct mmu_interval_notifier *mni,
>>>>  	    range->owner == dmirror->mdevice)
>>>>  		return true;
>>>>  
>>>> +	if (range->event == MMU_NOTIFY_CLEAR &&
>>>> +	    range->owner == current) {
>>>> +		return true;
>>>> +	}
>>> I don't understand this, there is nothing in hmm that says only
>>> current can call hmm_range_fault, and indeed most applications won't
>>> even gurantee that.
>> No it's the opposite, if we are ourselves the invalidator, don't care.
> I think you've missed the point, you cannot use range->owner in any
> way to tell "we are ourselves the invalidator". It is simply not the
> meaning of range->owner.

Usually it is the device but used similarly, look for instance nouveau:


static bool nouveau_svm_range_invalidate(struct mmu_interval_notifier *mni,
                                         const struct mmu_notifier_range *range,
                                         unsigned long cur_seq)
{
        struct svm_notifier *sn =
                container_of(mni, struct svm_notifier, notifier);

        if (range->event == MMU_NOTIFY_EXCLUSIVE &&
            range->owner == sn->svmm->vmm->cli->drm->dev)
                return true;

Where we return in case are the initiator of the make_device_exclusive. Otherwise,
it would also hang in next mmu_interval_read_begin().

owner is void * and admit used in a creative way here, but it can't be wrongly interpreted
as dev if current.

>
>>> So if this plan relies on something like the above in drivers I don't
>>> see how it can work.
>>>
>>> If this is just some hack for tests, try instead to find a solution
>>> that more accurately matches what a real driver should do.
>>>
>>> But this also seems overall troublesome to your goal, if you do a
>>> migrate inside hmm_range_fault() it will generate an invalidation call
>>> back and that will increment the seqlock and we will loop
>>> hmm_range_fault() again which rewalks.
>> That's the problem this solves.
>> The semantics is "if we are the invalidator don't wait for invalidate end",
>> aka don't mmu_interval_set_seq() that would make hang in the next mmu_interval_read_begin(),
>> waiting the invalidate to end
> I doubt we can skip mmu_interval_set_seq(), doing so can corrupt concurrent
> hmm_range_fault in some other thread.
>
> Nor, as I said, can we actually skip the invalidation toward HW
> anyhow.
>
> At the very least this commit message fails to explain the new locking
> proposal, or justify why it can work.

Yes the commit message could be better. 
But this is essentially the same as nouveau is doing with 
MMU_NOTIFY_EXCLUSIVE handling, just not using the dev as the qualifier,
because that is not easily available in the context.

--Mika

> Jason
>

Re: [RFC PATCH 1/4] mm: use current as mmu notifier's owner

Posted by Jason Gunthorpe 5 months, 4 weeks ago

On Thu, Aug 14, 2025 at 04:20:42PM +0300, Mika Penttilä wrote:
> 
> On 8/14/25 16:04, Jason Gunthorpe wrote:
> 
> > On Thu, Aug 14, 2025 at 03:53:00PM +0300, Mika Penttilä wrote:
> >> On 8/14/25 15:40, Jason Gunthorpe wrote:
> >>> On Thu, Aug 14, 2025 at 10:19:26AM +0300, Mika Penttilä wrote:
> >>>> When doing migration in combination with device fault handling,
> >>>> detect the case in the interval notifier.
> >>>>
> >>>> Without that, we would livelock with our own invalidations
> >>>> while migrating and splitting pages during fault handling.
> >>>>
> >>>> Note, pgmap_owner, used in some other code paths as owner for filtering,
> >>>> is not readily available for split path, so use current for this use case.
> >>>> Also, current and pgmap_owner, both being pointers to memory, can not be
> >>>> mis-interpreted to each other.
> >>>>
> >>>> Cc: David Hildenbrand <david@redhat.com>
> >>>> Cc: Jason Gunthorpe <jgg@nvidia.com>
> >>>> Cc: Leon Romanovsky <leonro@nvidia.com>
> >>>> Cc: Alistair Popple <apopple@nvidia.com>
> >>>> Cc: Balbir Singh <balbirs@nvidia.com>
> >>>>
> >>>> Signed-off-by: Mika Penttilä <mpenttil@redhat.com>
> >>>> ---
> >>>>  lib/test_hmm.c   | 5 +++++
> >>>>  mm/huge_memory.c | 6 +++---
> >>>>  mm/rmap.c        | 4 ++--
> >>>>  3 files changed, 10 insertions(+), 5 deletions(-)
> >>>>
> >>>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
> >>>> index 761725bc713c..cd5c139213be 100644
> >>>> --- a/lib/test_hmm.c
> >>>> +++ b/lib/test_hmm.c
> >>>> @@ -269,6 +269,11 @@ static bool dmirror_interval_invalidate(struct mmu_interval_notifier *mni,
> >>>>  	    range->owner == dmirror->mdevice)
> >>>>  		return true;
> >>>>  
> >>>> +	if (range->event == MMU_NOTIFY_CLEAR &&
> >>>> +	    range->owner == current) {
> >>>> +		return true;
> >>>> +	}
> >>> I don't understand this, there is nothing in hmm that says only
> >>> current can call hmm_range_fault, and indeed most applications won't
> >>> even gurantee that.
> >> No it's the opposite, if we are ourselves the invalidator, don't care.
> > I think you've missed the point, you cannot use range->owner in any
> > way to tell "we are ourselves the invalidator". It is simply not the
> > meaning of range->owner.
> 
> Usually it is the device but used similarly, look for instance nouveau:

Yes, dev is fine, but current or struct task is not reasonable.

> Yes the commit message could be better. 
> But this is essentially the same as nouveau is doing with 
> MMU_NOTIFY_EXCLUSIVE handling, just not using the dev as the
> qualifier,

I wouldn't necesarily assume anything nouveau is correct, but this is
certainly not the same thing. nouveau is trying to eliminate an
unncessary invalidation for it's HW when it knows the memory is
already only private to this local device.

This is not a statement about callchain or recursion.

Jason

Re: [RFC PATCH 1/4] mm: use current as mmu notifier's owner

Posted by Mika Penttilä 5 months, 4 weeks ago

On 8/14/25 17:11, Jason Gunthorpe wrote:

> On Thu, Aug 14, 2025 at 04:20:42PM +0300, Mika Penttilä wrote:
>> On 8/14/25 16:04, Jason Gunthorpe wrote:
>>
>>> On Thu, Aug 14, 2025 at 03:53:00PM +0300, Mika Penttilä wrote:
>>>> On 8/14/25 15:40, Jason Gunthorpe wrote:
>>>>> On Thu, Aug 14, 2025 at 10:19:26AM +0300, Mika Penttilä wrote:
>>>>>> When doing migration in combination with device fault handling,
>>>>>> detect the case in the interval notifier.
>>>>>>
>>>>>> Without that, we would livelock with our own invalidations
>>>>>> while migrating and splitting pages during fault handling.
>>>>>>
>>>>>> Note, pgmap_owner, used in some other code paths as owner for filtering,
>>>>>> is not readily available for split path, so use current for this use case.
>>>>>> Also, current and pgmap_owner, both being pointers to memory, can not be
>>>>>> mis-interpreted to each other.
>>>>>>
>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>> Cc: Jason Gunthorpe <jgg@nvidia.com>
>>>>>> Cc: Leon Romanovsky <leonro@nvidia.com>
>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>> Cc: Balbir Singh <balbirs@nvidia.com>
>>>>>>
>>>>>> Signed-off-by: Mika Penttilä <mpenttil@redhat.com>
>>>>>> ---
>>>>>>  lib/test_hmm.c   | 5 +++++
>>>>>>  mm/huge_memory.c | 6 +++---
>>>>>>  mm/rmap.c        | 4 ++--
>>>>>>  3 files changed, 10 insertions(+), 5 deletions(-)
>>>>>>
>>>>>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>>>>>> index 761725bc713c..cd5c139213be 100644
>>>>>> --- a/lib/test_hmm.c
>>>>>> +++ b/lib/test_hmm.c
>>>>>> @@ -269,6 +269,11 @@ static bool dmirror_interval_invalidate(struct mmu_interval_notifier *mni,
>>>>>>  	    range->owner == dmirror->mdevice)
>>>>>>  		return true;
>>>>>>  
>>>>>> +	if (range->event == MMU_NOTIFY_CLEAR &&
>>>>>> +	    range->owner == current) {
>>>>>> +		return true;
>>>>>> +	}
>>>>> I don't understand this, there is nothing in hmm that says only
>>>>> current can call hmm_range_fault, and indeed most applications won't
>>>>> even gurantee that.
>>>> No it's the opposite, if we are ourselves the invalidator, don't care.
>>> I think you've missed the point, you cannot use range->owner in any
>>> way to tell "we are ourselves the invalidator". It is simply not the
>>> meaning of range->owner.
>> Usually it is the device but used similarly, look for instance nouveau:
> Yes, dev is fine, but current or struct task is not reasonable.
>
>> Yes the commit message could be better. 
>> But this is essentially the same as nouveau is doing with 
>> MMU_NOTIFY_EXCLUSIVE handling, just not using the dev as the
>> qualifier,
> I wouldn't necesarily assume anything nouveau is correct, but this is
> certainly not the same thing. nouveau is trying to eliminate an
> unncessary invalidation for it's HW when it knows the memory is
> already only private to this local device.

Yes but it is omitting
  mmu_interval_set_seq()
in that case while returning early, which is ok.

as well as hmm test module with :

         * Ignore invalidation callbacks for device private pages since
         * the invalidation is handled as part of the migration process.
         */
        if (range->event == MMU_NOTIFY_MIGRATE &&
            range->owner == dmirror->mdevice)
                return true;


> This is not a statement about callchain or recursion.

In my case the unnecessary invalidation (self caused) causes hang,
so whatever the reason for unnecessary, the mechanism is the same
and used elsewhere in kernel also.

So technically I think that would work. But there are others ways to do that
without using owner, so will look that way in next version.

>
> Jason
>

--Mika

Re: [RFC PATCH 1/4] mm: use current as mmu notifier's owner

Posted by Jason Gunthorpe 5 months, 4 weeks ago

On Thu, Aug 14, 2025 at 08:00:01PM +0300, Mika Penttilä wrote:
> as well as hmm test module with :
> 
>          * Ignore invalidation callbacks for device private pages since
>          * the invalidation is handled as part of the migration process.
>          */
>         if (range->event == MMU_NOTIFY_MIGRATE &&
>             range->owner == dmirror->mdevice)
>                 return true;

If I recall this was about a very specific case where migration does a
number of invalidations and some of the earlier ones are known to be
redundant in this specific case. Redundant means it can be ignored
without causing an inconsistency.

Alistair would know, but I assumed this works OK because the above
invalidation doesn't actually go on to free any pages but keeps them
around until a later invalidation?

This is nothing like what your case is talking about.

Jason

Re: [RFC PATCH 1/4] mm: use current as mmu notifier's owner

Posted by Mika Penttilä 5 months, 4 weeks ago

On 8/14/25 20:20, Jason Gunthorpe wrote:

> On Thu, Aug 14, 2025 at 08:00:01PM +0300, Mika Penttilä wrote:
>> as well as hmm test module with :
>>
>>          * Ignore invalidation callbacks for device private pages since
>>          * the invalidation is handled as part of the migration process.
>>          */
>>         if (range->event == MMU_NOTIFY_MIGRATE &&
>>             range->owner == dmirror->mdevice)
>>                 return true;
> If I recall this was about a very specific case where migration does a
> number of invalidations and some of the earlier ones are known to be
> redundant in this specific case. Redundant means it can be ignored
> without causing an inconsistency.
>
> Alistair would know, but I assumed this works OK because the above
> invalidation doesn't actually go on to free any pages but keeps them
> around until a later invalidation?
>
> This is nothing like what your case is talking about.

This one is actually pretty similar, MMU_NOTIFY_CLEAR is also fired in migration process
(split case) and invalidation handled part of the migration process.

But I have already a working version without any of that.

>
> Jason
>
--Mika

Re: [RFC PATCH 1/4] mm: use current as mmu notifier's owner

Posted by Alistair Popple 5 months, 4 weeks ago

On Thu, Aug 14, 2025 at 08:45:43PM +0300, Mika Penttilä wrote:
> 
> On 8/14/25 20:20, Jason Gunthorpe wrote:
> 
> > On Thu, Aug 14, 2025 at 08:00:01PM +0300, Mika Penttilä wrote:
> >> as well as hmm test module with :
> >>
> >>          * Ignore invalidation callbacks for device private pages since
> >>          * the invalidation is handled as part of the migration process.
> >>          */
> >>         if (range->event == MMU_NOTIFY_MIGRATE &&
> >>             range->owner == dmirror->mdevice)
> >>                 return true;
> > If I recall this was about a very specific case where migration does a
> > number of invalidations and some of the earlier ones are known to be
> > redundant in this specific case. Redundant means it can be ignored
> > without causing an inconsistency.
> >
> > Alistair would know, but I assumed this works OK because the above
> > invalidation doesn't actually go on to free any pages but keeps them
> > around until a later invalidation?

Right, the pages don't actually get freed because a reference is taken on them
during migrate_vma_setup(). However other device MMU's still need invalidating
because the driver will go on to copy the page after this step. It's just
assumed that the driver is able to be consistent with itself (ie. it will unmap/
invalidate it's own MMU prior to initiating the copy).

In practice I suspect what Mika is running into is that the page table
synchronisation for migration works slightly differently for migrate_vma_*().

Instead of using mmu_interval_notifier's which have a sequence number drivers
typically use normal mmu_notifier's and take a device specific lock to block
page table downgrades (eg. RW -> RO). This ensures it's safe to update the
device page tables with the PFNs/permissions collected in migrate_vma_setup()
(or the new PFN) by blocking other threads from updating the page table.

The ususal problem with this approach is that when migrate_vma_setup() calls
the mmu_notifier it deadlocks on the device specific lock in the notifier
callback because it already holds the lock, which it can't drop before calling
migrate_vma_setup().

I think one of the main benefits of a series which consolidates these two
page-table mirroring techniques into common code would also be to make the
mirroring/invalidation logic the same for migration as hmm_range_fault(). Ie. to
move to mmu_interval notifers with sequence numbers for migration, perhaps with
filtering if required/safe and retries.

 - Alistair

> > This is nothing like what your case is talking about.
> 
> This one is actually pretty similar, MMU_NOTIFY_CLEAR is also fired in migration process
> (split case) and invalidation handled part of the migration process.
> 
> But I have already a working version without any of that.
> 
> >
> > Jason
> >
> --Mika
> 
>

Re: [RFC PATCH 1/4] mm: use current as mmu notifier's owner

Posted by Mika Penttilä 5 months, 3 weeks ago

On 8/15/25 08:23, Alistair Popple wrote:

> On Thu, Aug 14, 2025 at 08:45:43PM +0300, Mika Penttilä wrote:
>> On 8/14/25 20:20, Jason Gunthorpe wrote:
>>
>>> On Thu, Aug 14, 2025 at 08:00:01PM +0300, Mika Penttilä wrote:
>>>> as well as hmm test module with :
>>>>
>>>>          * Ignore invalidation callbacks for device private pages since
>>>>          * the invalidation is handled as part of the migration process.
>>>>          */
>>>>         if (range->event == MMU_NOTIFY_MIGRATE &&
>>>>             range->owner == dmirror->mdevice)
>>>>                 return true;
>>> If I recall this was about a very specific case where migration does a
>>> number of invalidations and some of the earlier ones are known to be
>>> redundant in this specific case. Redundant means it can be ignored
>>> without causing an inconsistency.
>>>
>>> Alistair would know, but I assumed this works OK because the above
>>> invalidation doesn't actually go on to free any pages but keeps them
>>> around until a later invalidation?

Thanks Alistair for your deep insights! 

> Right, the pages don't actually get freed because a reference is taken on them
> during migrate_vma_setup(). However other device MMU's still need invalidating
> because the driver will go on to copy the page after this step. It's just
> assumed that the driver is able to be consistent with itself (ie. it will unmap/
> invalidate it's own MMU prior to initiating the copy).

And reference is taken as well in migrate on fault during hmm_range_fault
if migrating.

>
> In practice I suspect what Mika is running into is that the page table
> synchronisation for migration works slightly differently for migrate_vma_*().
>
> Instead of using mmu_interval_notifier's which have a sequence number drivers
> typically use normal mmu_notifier's and take a device specific lock to block
> page table downgrades (eg. RW -> RO). This ensures it's safe to update the
> device page tables with the PFNs/permissions collected in migrate_vma_setup()
> (or the new PFN) by blocking other threads from updating the page table.
>
> The ususal problem with this approach is that when migrate_vma_setup() calls
> the mmu_notifier it deadlocks on the device specific lock in the notifier
> callback because it already holds the lock, which it can't drop before calling
> migrate_vma_setup().
>
> I think one of the main benefits of a series which consolidates these two
> page-table mirroring techniques into common code would also be to make the
> mirroring/invalidation logic the same for migration as hmm_range_fault(). Ie. to
> move to mmu_interval notifers with sequence numbers for migration, perhaps with
> filtering if required/safe and retries

Yes with the migrate_vma_setup() and collecting removed, the firing of mmu notifiers
and "collecting" are integral part of the hmm_range_fault() flow, so logical to use
interval notifiers for migrate also.

I have removed the commit with the owner games. I studied it more and seems it was added
to mitigate a bug in an early version, which led me to do wrong conclusion of the root cause
of the hang. That version had unbalanced mmu_notifier_invalidate_range_start()
after returning from hmm_range_fault() with EBUSY (after done a folio split).
With that fixed, driving the migrate on fault using the interval notifiers seems to work well, 
filtering MMU_NOTIFY_MIGRATE for device for retries.

>
>  - Alistair

--Mika

Re: [RFC PATCH 1/4] mm: use current as mmu notifier's owner

Posted by Balbir Singh 5 months, 3 weeks ago

On 8/15/25 17:11, Mika Penttilä wrote:
> On 8/15/25 08:23, Alistair Popple wrote:
> 
>> On Thu, Aug 14, 2025 at 08:45:43PM +0300, Mika Penttilä wrote:
>>> On 8/14/25 20:20, Jason Gunthorpe wrote:
>>>
>>>> On Thu, Aug 14, 2025 at 08:00:01PM +0300, Mika Penttilä wrote:
>>>>> as well as hmm test module with :
>>>>>
>>>>>          * Ignore invalidation callbacks for device private pages since
>>>>>          * the invalidation is handled as part of the migration process.
>>>>>          */
>>>>>         if (range->event == MMU_NOTIFY_MIGRATE &&
>>>>>             range->owner == dmirror->mdevice)
>>>>>                 return true;
>>>> If I recall this was about a very specific case where migration does a
>>>> number of invalidations and some of the earlier ones are known to be
>>>> redundant in this specific case. Redundant means it can be ignored
>>>> without causing an inconsistency.
>>>>
>>>> Alistair would know, but I assumed this works OK because the above
>>>> invalidation doesn't actually go on to free any pages but keeps them
>>>> around until a later invalidation?
> 
> Thanks Alistair for your deep insights! 
> 
>> Right, the pages don't actually get freed because a reference is taken on them
>> during migrate_vma_setup(). However other device MMU's still need invalidating
>> because the driver will go on to copy the page after this step. It's just
>> assumed that the driver is able to be consistent with itself (ie. it will unmap/
>> invalidate it's own MMU prior to initiating the copy).
> 
> And reference is taken as well in migrate on fault during hmm_range_fault
> if migrating.
> 
>>
>> In practice I suspect what Mika is running into is that the page table
>> synchronisation for migration works slightly differently for migrate_vma_*().
>>
>> Instead of using mmu_interval_notifier's which have a sequence number drivers
>> typically use normal mmu_notifier's and take a device specific lock to block
>> page table downgrades (eg. RW -> RO). This ensures it's safe to update the
>> device page tables with the PFNs/permissions collected in migrate_vma_setup()
>> (or the new PFN) by blocking other threads from updating the page table.
>>
>> The ususal problem with this approach is that when migrate_vma_setup() calls
>> the mmu_notifier it deadlocks on the device specific lock in the notifier
>> callback because it already holds the lock, which it can't drop before calling
>> migrate_vma_setup().
>>
>> I think one of the main benefits of a series which consolidates these two
>> page-table mirroring techniques into common code would also be to make the
>> mirroring/invalidation logic the same for migration as hmm_range_fault(). Ie. to
>> move to mmu_interval notifers with sequence numbers for migration, perhaps with
>> filtering if required/safe and retries
> 
> Yes with the migrate_vma_setup() and collecting removed, the firing of mmu notifiers
> and "collecting" are integral part of the hmm_range_fault() flow, so logical to use
> interval notifiers for migrate also.
> 
> I have removed the commit with the owner games. I studied it more and seems it was added
> to mitigate a bug in an early version, which led me to do wrong conclusion of the root cause
> of the hang. That version had unbalanced mmu_notifier_invalidate_range_start()
> after returning from hmm_range_fault() with EBUSY (after done a folio split).
> With that fixed, driving the migrate on fault using the interval notifiers seems to work well, 
> filtering MMU_NOTIFY_MIGRATE for device for retries.
> 

So this patch can be ignored in the series?

Balbir

Re: [RFC PATCH 1/4] mm: use current as mmu notifier's owner

Posted by Mika Penttilä 5 months, 3 weeks ago

Hi,

On 8/19/25 07:27, Balbir Singh wrote:

> On 8/15/25 17:11, Mika Penttilä wrote:
>> On 8/15/25 08:23, Alistair Popple wrote:
>>
>>> On Thu, Aug 14, 2025 at 08:45:43PM +0300, Mika Penttilä wrote:
>>>> On 8/14/25 20:20, Jason Gunthorpe wrote:
>>>>
>>>>> On Thu, Aug 14, 2025 at 08:00:01PM +0300, Mika Penttilä wrote:
>>>>>> as well as hmm test module with :
>>>>>>
>>>>>>          * Ignore invalidation callbacks for device private pages since
>>>>>>          * the invalidation is handled as part of the migration process.
>>>>>>          */
>>>>>>         if (range->event == MMU_NOTIFY_MIGRATE &&
>>>>>>             range->owner == dmirror->mdevice)
>>>>>>                 return true;
>>>>> If I recall this was about a very specific case where migration does a
>>>>> number of invalidations and some of the earlier ones are known to be
>>>>> redundant in this specific case. Redundant means it can be ignored
>>>>> without causing an inconsistency.
>>>>>
>>>>> Alistair would know, but I assumed this works OK because the above
>>>>> invalidation doesn't actually go on to free any pages but keeps them
>>>>> around until a later invalidation?
>> Thanks Alistair for your deep insights! 
>>
>>> Right, the pages don't actually get freed because a reference is taken on them
>>> during migrate_vma_setup(). However other device MMU's still need invalidating
>>> because the driver will go on to copy the page after this step. It's just
>>> assumed that the driver is able to be consistent with itself (ie. it will unmap/
>>> invalidate it's own MMU prior to initiating the copy).
>> And reference is taken as well in migrate on fault during hmm_range_fault
>> if migrating.
>>
>>> In practice I suspect what Mika is running into is that the page table
>>> synchronisation for migration works slightly differently for migrate_vma_*().
>>>
>>> Instead of using mmu_interval_notifier's which have a sequence number drivers
>>> typically use normal mmu_notifier's and take a device specific lock to block
>>> page table downgrades (eg. RW -> RO). This ensures it's safe to update the
>>> device page tables with the PFNs/permissions collected in migrate_vma_setup()
>>> (or the new PFN) by blocking other threads from updating the page table.
>>>
>>> The ususal problem with this approach is that when migrate_vma_setup() calls
>>> the mmu_notifier it deadlocks on the device specific lock in the notifier
>>> callback because it already holds the lock, which it can't drop before calling
>>> migrate_vma_setup().
>>>
>>> I think one of the main benefits of a series which consolidates these two
>>> page-table mirroring techniques into common code would also be to make the
>>> mirroring/invalidation logic the same for migration as hmm_range_fault(). Ie. to
>>> move to mmu_interval notifers with sequence numbers for migration, perhaps with
>>> filtering if required/safe and retries
>> Yes with the migrate_vma_setup() and collecting removed, the firing of mmu notifiers
>> and "collecting" are integral part of the hmm_range_fault() flow, so logical to use
>> interval notifiers for migrate also.
>>
>> I have removed the commit with the owner games. I studied it more and seems it was added
>> to mitigate a bug in an early version, which led me to do wrong conclusion of the root cause
>> of the hang. That version had unbalanced mmu_notifier_invalidate_range_start()
>> after returning from hmm_range_fault() with EBUSY (after done a folio split).
>> With that fixed, driving the migrate on fault using the interval notifiers seems to work well, 
>> filtering MMU_NOTIFY_MIGRATE for device for retries.
>>
> So this patch can be ignored in the series?

Yes this can be ignored. I will do more testing and repost after a while with this removed and
possibly with other changes if needed.

>
> Balbir
>
--Mika