[PATCH RESEND] mm: drop lruvec->lru_lock if contended when skipping folio

Usama Arif posted 1 patch 1 year, 5 months ago
mm/vmscan.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
[PATCH RESEND] mm: drop lruvec->lru_lock if contended when skipping folio
Posted by Usama Arif 1 year, 5 months ago
lruvec->lru_lock is highly contended and is held when calling
isolate_lru_folios. If the lru has a large number of CMA folios
consecutively, while the allocation type requested is not MIGRATE_MOVABLE,
isolate_lru_folios can hold the lock for a very long time while it
skips those. vmscan_lru_isolate tracepoint showed that skipped can go
above 70k in production and lockstat shows that waittime-max is x1000
higher without this patch.
This can cause lockups [1] and high memory pressure for extended periods of
time [2]. Hence release the lock if its contended when skipping a folio to
give other tasks a chance to acquire it and not stall.

[1] https://lore.kernel.org/all/CAOUHufbkhMZYz20aM_3rHZ3OcK4m2puji2FGpUpn_-DevGk3Kg@mail.gmail.com/
[2] https://lore.kernel.org/all/ZrssOrcJIDy8hacI@gmail.com/

Suggested-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Reported-by: Bharata B Rao <bharata@amd.com>
Tested-by: Bharata B Rao <bharata@amd.com>
---
 mm/vmscan.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 25e43bb3b574..bf8d39a1ad3e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1695,8 +1695,14 @@ static unsigned long isolate_lru_folios(unsigned long nr_to_scan,
 		if (folio_zonenum(folio) > sc->reclaim_idx ||
 				skip_cma(folio, sc)) {
 			nr_skipped[folio_zonenum(folio)] += nr_pages;
-			move_to = &folios_skipped;
-			goto move;
+			list_move(&folio->lru, &folios_skipped);
+			if (!spin_is_contended(&lruvec->lru_lock))
+				continue;
+			if (!list_empty(dst))
+				break;
+			spin_unlock_irq(&lruvec->lru_lock);
+			cond_resched();
+			spin_lock_irq(&lruvec->lru_lock);
 		}
 
 		/*
-- 
2.43.5
Re: [PATCH RESEND] mm: drop lruvec->lru_lock if contended when skipping folio
Posted by Bharata B Rao 1 year, 5 months ago
On 20-Aug-24 12:16 AM, Usama Arif wrote:
> lruvec->lru_lock is highly contended and is held when calling
> isolate_lru_folios. If the lru has a large number of CMA folios
> consecutively, while the allocation type requested is not MIGRATE_MOVABLE,
> isolate_lru_folios can hold the lock for a very long time while it
> skips those. vmscan_lru_isolate tracepoint showed that skipped can go
> above 70k in production and lockstat shows that waittime-max is x1000
> higher without this patch.
> This can cause lockups [1] and high memory pressure for extended periods of
> time [2]. Hence release the lock if its contended when skipping a folio to
> give other tasks a chance to acquire it and not stall.
> 
> [1] https://lore.kernel.org/all/CAOUHufbkhMZYz20aM_3rHZ3OcK4m2puji2FGpUpn_-DevGk3Kg@mail.gmail.com/
> [2] https://lore.kernel.org/all/ZrssOrcJIDy8hacI@gmail.com/

Though the above link[2] mentions it, can you explicitly include the 
specific condition that we saw in the patch description?

"isolate_lru_folios() can end up scanning through a huge number of 
folios with lruvec spinlock held. For FIO workload, ~150million order=0 
folios were skipped to isolate a few ZONE_DMA folios."

Regards,
Bharata.
Re: [PATCH RESEND] mm: drop lruvec->lru_lock if contended when skipping folio
Posted by Andrew Morton 1 year, 5 months ago
On Mon, 19 Aug 2024 19:46:48 +0100 Usama Arif <usamaarif642@gmail.com> wrote:

> lruvec->lru_lock is highly contended and is held when calling
> isolate_lru_folios. If the lru has a large number of CMA folios
> consecutively, while the allocation type requested is not MIGRATE_MOVABLE,
> isolate_lru_folios can hold the lock for a very long time while it
> skips those. vmscan_lru_isolate tracepoint showed that skipped can go
> above 70k in production and lockstat shows that waittime-max is x1000
> higher without this patch.
> This can cause lockups [1] and high memory pressure for extended periods of
> time [2]. Hence release the lock if its contended when skipping a folio to
> give other tasks a chance to acquire it and not stall.
> 
> ...
>
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1695,8 +1695,14 @@ static unsigned long isolate_lru_folios(unsigned long nr_to_scan,
>  		if (folio_zonenum(folio) > sc->reclaim_idx ||
>  				skip_cma(folio, sc)) {
>  			nr_skipped[folio_zonenum(folio)] += nr_pages;
> -			move_to = &folios_skipped;
> -			goto move;
> +			list_move(&folio->lru, &folios_skipped);
> +			if (!spin_is_contended(&lruvec->lru_lock))
> +				continue;
> +			if (!list_empty(dst))
> +				break;
> +			spin_unlock_irq(&lruvec->lru_lock);
> +			cond_resched();
> +			spin_lock_irq(&lruvec->lru_lock);
>  		}

Oh geeze ugly thing.  Must we do this?

The games that function plays with src, dst and move_to are a bit hard
to follow.  Some tasteful comments explaining what's going on would
help.

Also that test of !list_empty(dst).  It would be helpful to comment the
dynamics which are happening in this case - why we're testing dst here.
Re: [PATCH RESEND] mm: drop lruvec->lru_lock if contended when skipping folio
Posted by Usama Arif 1 year, 5 months ago

On 20/08/2024 02:17, Andrew Morton wrote:
> On Mon, 19 Aug 2024 19:46:48 +0100 Usama Arif <usamaarif642@gmail.com> wrote:
> 
>> lruvec->lru_lock is highly contended and is held when calling
>> isolate_lru_folios. If the lru has a large number of CMA folios
>> consecutively, while the allocation type requested is not MIGRATE_MOVABLE,
>> isolate_lru_folios can hold the lock for a very long time while it
>> skips those. vmscan_lru_isolate tracepoint showed that skipped can go
>> above 70k in production and lockstat shows that waittime-max is x1000
>> higher without this patch.
>> This can cause lockups [1] and high memory pressure for extended periods of
>> time [2]. Hence release the lock if its contended when skipping a folio to
>> give other tasks a chance to acquire it and not stall.
>>
>> ...
>>
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1695,8 +1695,14 @@ static unsigned long isolate_lru_folios(unsigned long nr_to_scan,
>>  		if (folio_zonenum(folio) > sc->reclaim_idx ||
>>  				skip_cma(folio, sc)) {
>>  			nr_skipped[folio_zonenum(folio)] += nr_pages;
>> -			move_to = &folios_skipped;
>> -			goto move;
>> +			list_move(&folio->lru, &folios_skipped);
>> +			if (!spin_is_contended(&lruvec->lru_lock))
>> +				continue;
>> +			if (!list_empty(dst))
>> +				break;
>> +			spin_unlock_irq(&lruvec->lru_lock);
>> +			cond_resched();
>> +			spin_lock_irq(&lruvec->lru_lock);
>>  		}
> 
> Oh geeze ugly thing.  Must we do this?
> 
> The games that function plays with src, dst and move_to are a bit hard
> to follow.  Some tasteful comments explaining what's going on would
> help.
> 
> Also that test of !list_empty(dst).  It would be helpful to comment the
> dynamics which are happening in this case - why we're testing dst here.
> 
> 

So Johannes pointed out to me that this is not going to properly fix the problem of holding the lru_lock for a long time introduced in [1] because of 2 reasons:
- the task that is doing lock break is hoarding folios on folios_skipped and making the lru shorter, I didn't see it in the usecase I was trying, but it could be that yielding the lock to the other task is not of much use as it is going to go through a much shorter lru list or even an empty lru list and would OOM, while the folio it is looking for is on folios_skipped. We would be substituting one OOM problem for another with this patch.
- Compaction code goes through pages by pfn and not using the list, as this patch does not clear lru flag, compaction could claim this folio.

The patch in [1] is severely breaking production at Meta and its not a proper fix to the problem that the commit was trying to be solved. It results in holding the lru_lock for a very significant amount of time, stalling all other processes trying to claim memory, creating very high memory pressure for large periods of time and causing OOM.

The way forward would be to revert it and try to come up with a longer term solution that the original commit tried to solve. If no one is opposed to it, I will wait a couple of days for comments and send a revert patch.

[1] https://lore.kernel.org/all/1685501461-19290-1-git-send-email-zhaoyang.huang@unisoc.com/

Thanks,
Usama
Re: [PATCH RESEND] mm: drop lruvec->lru_lock if contended when skipping folio
Posted by Zhaoyang Huang 1 year, 5 months ago
On Tue, Aug 20, 2024 at 11:45 PM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 20/08/2024 02:17, Andrew Morton wrote:
> > On Mon, 19 Aug 2024 19:46:48 +0100 Usama Arif <usamaarif642@gmail.com> wrote:
> >
> >> lruvec->lru_lock is highly contended and is held when calling
> >> isolate_lru_folios. If the lru has a large number of CMA folios
> >> consecutively, while the allocation type requested is not MIGRATE_MOVABLE,
> >> isolate_lru_folios can hold the lock for a very long time while it
> >> skips those. vmscan_lru_isolate tracepoint showed that skipped can go
> >> above 70k in production and lockstat shows that waittime-max is x1000
> >> higher without this patch.
> >> This can cause lockups [1] and high memory pressure for extended periods of
> >> time [2]. Hence release the lock if its contended when skipping a folio to
> >> give other tasks a chance to acquire it and not stall.
> >>
> >> ...
> >>
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -1695,8 +1695,14 @@ static unsigned long isolate_lru_folios(unsigned long nr_to_scan,
> >>              if (folio_zonenum(folio) > sc->reclaim_idx ||
> >>                              skip_cma(folio, sc)) {
> >>                      nr_skipped[folio_zonenum(folio)] += nr_pages;
> >> -                    move_to = &folios_skipped;
> >> -                    goto move;
> >> +                    list_move(&folio->lru, &folios_skipped);
> >> +                    if (!spin_is_contended(&lruvec->lru_lock))
> >> +                            continue;
> >> +                    if (!list_empty(dst))
> >> +                            break;
> >> +                    spin_unlock_irq(&lruvec->lru_lock);
> >> +                    cond_resched();
> >> +                    spin_lock_irq(&lruvec->lru_lock);
> >>              }
> >
> > Oh geeze ugly thing.  Must we do this?
> >
> > The games that function plays with src, dst and move_to are a bit hard
> > to follow.  Some tasteful comments explaining what's going on would
> > help.
> >
> > Also that test of !list_empty(dst).  It would be helpful to comment the
> > dynamics which are happening in this case - why we're testing dst here.
> >
> >
>
> So Johannes pointed out to me that this is not going to properly fix the problem of holding the lru_lock for a long time introduced in [1] because of 2 reasons:
> - the task that is doing lock break is hoarding folios on folios_skipped and making the lru shorter, I didn't see it in the usecase I was trying, but it could be that yielding the lock to the other task is not of much use as it is going to go through a much shorter lru list or even an empty lru list and would OOM, while the folio it is looking for is on folios_skipped. We would be substituting one OOM problem for another with this patch.
Other tasks can not get folios either if they can not use CMA after
the reclaiming even without the original patch. I am ok with your fix
patch except the 'if (!list_empty(dst))' puzzled me.
> - Compaction code goes through pages by pfn and not using the list, as this patch does not clear lru flag, compaction could claim this folio.
>
> The patch in [1] is severely breaking production at Meta and its not a proper fix to the problem that the commit was trying to be solved. It results in holding the lru_lock for a very significant amount of time, stalling all other processes trying to claim memory, creating very high memory pressure for large periods of time and causing OOM.
>
> The way forward would be to revert it and try to come up with a longer term solution that the original commit tried to solve. If no one is opposed to it, I will wait a couple of days for comments and send a revert patch.
I mainly focus on android systems which have no such scenarios as
server encountered. I agree with reverting it if this fix can not be
accepted.
>
> [1] https://lore.kernel.org/all/1685501461-19290-1-git-send-email-zhaoyang.huang@unisoc.com/
>
> Thanks,
> Usama
Re: [PATCH RESEND] mm: drop lruvec->lru_lock if contended when skipping folio
Posted by Usama Arif 1 year, 5 months ago

On 21/08/2024 01:51, Zhaoyang Huang wrote:
> On Tue, Aug 20, 2024 at 11:45 PM Usama Arif <usamaarif642@gmail.com> wrote:
>>
>>
>>
>> On 20/08/2024 02:17, Andrew Morton wrote:
>>> On Mon, 19 Aug 2024 19:46:48 +0100 Usama Arif <usamaarif642@gmail.com> wrote:
>>>
>>>> lruvec->lru_lock is highly contended and is held when calling
>>>> isolate_lru_folios. If the lru has a large number of CMA folios
>>>> consecutively, while the allocation type requested is not MIGRATE_MOVABLE,
>>>> isolate_lru_folios can hold the lock for a very long time while it
>>>> skips those. vmscan_lru_isolate tracepoint showed that skipped can go
>>>> above 70k in production and lockstat shows that waittime-max is x1000
>>>> higher without this patch.
>>>> This can cause lockups [1] and high memory pressure for extended periods of
>>>> time [2]. Hence release the lock if its contended when skipping a folio to
>>>> give other tasks a chance to acquire it and not stall.
>>>>
>>>> ...
>>>>
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -1695,8 +1695,14 @@ static unsigned long isolate_lru_folios(unsigned long nr_to_scan,
>>>>              if (folio_zonenum(folio) > sc->reclaim_idx ||
>>>>                              skip_cma(folio, sc)) {
>>>>                      nr_skipped[folio_zonenum(folio)] += nr_pages;
>>>> -                    move_to = &folios_skipped;
>>>> -                    goto move;
>>>> +                    list_move(&folio->lru, &folios_skipped);
>>>> +                    if (!spin_is_contended(&lruvec->lru_lock))
>>>> +                            continue;
>>>> +                    if (!list_empty(dst))
>>>> +                            break;
>>>> +                    spin_unlock_irq(&lruvec->lru_lock);
>>>> +                    cond_resched();
>>>> +                    spin_lock_irq(&lruvec->lru_lock);
>>>>              }
>>>
>>> Oh geeze ugly thing.  Must we do this?
>>>
>>> The games that function plays with src, dst and move_to are a bit hard
>>> to follow.  Some tasteful comments explaining what's going on would
>>> help.
>>>
>>> Also that test of !list_empty(dst).  It would be helpful to comment the
>>> dynamics which are happening in this case - why we're testing dst here.
>>>
>>>
>>
>> So Johannes pointed out to me that this is not going to properly fix the problem of holding the lru_lock for a long time introduced in [1] because of 2 reasons:
>> - the task that is doing lock break is hoarding folios on folios_skipped and making the lru shorter, I didn't see it in the usecase I was trying, but it could be that yielding the lock to the other task is not of much use as it is going to go through a much shorter lru list or even an empty lru list and would OOM, while the folio it is looking for is on folios_skipped. We would be substituting one OOM problem for another with this patch.
> Other tasks can not get folios either if they can not use CMA after
> the reclaiming even without the original patch. I am ok with your fix
> patch except the 'if (!list_empty(dst))' puzzled me.

The 'if (!list_empty(dst))' was there so that if the spinlock is contended and we are still skipping, then the shrink_active/inactive_list can still use the ones that are in destination, rather then continue skipping.

Both the original patch and this fix substitute one OOM problem for another one.

>> - Compaction code goes through pages by pfn and not using the list, as this patch does not clear lru flag, compaction could claim this folio.
>>
>> The patch in [1] is severely breaking production at Meta and its not a proper fix to the problem that the commit was trying to be solved. It results in holding the lru_lock for a very significant amount of time, stalling all other processes trying to claim memory, creating very high memory pressure for large periods of time and causing OOM.
>>
>> The way forward would be to revert it and try to come up with a longer term solution that the original commit tried to solve. If no one is opposed to it, I will wait a couple of days for comments and send a revert patch.
> I mainly focus on android systems which have no such scenarios as
> server encountered. I agree with reverting it if this fix can not be
> accepted.
>>
Yeah I think the 2nd point about compaction is a bigger issue, we can try fixing the 2nd point by clearing lru, but the first point will remain. I will send a revert for the original.

Thanks

>> [1] https://lore.kernel.org/all/1685501461-19290-1-git-send-email-zhaoyang.huang@unisoc.com/
>>
>> Thanks,
>> Usama
Re: [PATCH RESEND] mm: drop lruvec->lru_lock if contended when skipping folio
Posted by Breno Leitao 1 year, 5 months ago
On Tue, Aug 20, 2024 at 11:45:11AM -0400, Usama Arif wrote:

> So Johannes pointed out to me that this is not going to properly fix
> the problem of holding the lru_lock for a long time introduced in [1]
> because of 2 reasons: - the task that is doing lock break is hoarding
> folios on folios_skipped and making the lru shorter, I didn't see it
> in the usecase I was trying, but it could be that yielding the lock to
> the other task is not of much use as it is going to go through a much
> shorter lru list or even an empty lru list and would OOM, while the
> folio it is looking for is on folios_skipped. We would be substituting
> one OOM problem for another with this patch.  - Compaction code goes
> through pages by pfn and not using the list, as this patch does not
> clear lru flag, compaction could claim this folio.
> 
> The patch in [1] is severely breaking production at Meta and its not a
> proper fix to the problem that the commit was trying to be solved. It
> results in holding the lru_lock for a very significant amount of time,
> stalling all other processes trying to claim memory, creating very
> high memory pressure for large periods of time and causing OOM.
> 
> The way forward would be to revert it and try to come up with a longer
> term solution that the original commit tried to solve. If no one is
> opposed to it, I will wait a couple of days for comments and send a
> revert patch.

I agree with the concern, but for a different reason. Commit
5da226dbfce3 ("mm: skip CMA pages when they are not available") was
intended as an optimization, but it  changed the behavior of the
isolate_lru_folios() function in a way that had significant, unintended
consequences.

One such consequence was a notable increase in lock contention, as
described in [1]. Addressing this lock contention issue with a quick fix
seems like a suboptimal solution for such a core part of the system.

Instead, a better approach would be to rethink the original
optimization. Rather than applying a band-aid to the lock contention
problem, it would be more prudent to revisit the changes introduced by
commit 5da226dbfce3 and explore alternative optimization strategies that
do not have such far-reaching and difficult-to-diagnose effects.

[1] Link: https://lore.kernel.org/all/ZrssOrcJIDy8hacI@gmail.com/