[PATCH v2 3/4] mm: thp: use folio_batch to handle THP splitting in deferred_split_scan()

Qi Zheng posted 4 patches 1 week, 1 day ago
There is a newer version of this series
[PATCH v2 3/4] mm: thp: use folio_batch to handle THP splitting in deferred_split_scan()
Posted by Qi Zheng 1 week, 1 day ago
From: Muchun Song <songmuchun@bytedance.com>

The maintenance of the folio->_deferred_list is intricate because it's
reused in a local list.

Here are some peculiarities:

   1) When a folio is removed from its split queue and added to a local
      on-stack list in deferred_split_scan(), the ->split_queue_len isn't
      updated, leading to an inconsistency between it and the actual
      number of folios in the split queue.

   2) When the folio is split via split_folio() later, it's removed from
      the local list while holding the split queue lock. At this time,
      this lock protects the local list, not the split queue.

   3) To handle the race condition with a third-party freeing or migrating
      the preceding folio, we must ensure there's always one safe (with
      raised refcount) folio before by delaying its folio_put(). More
      details can be found in commit e66f3185fa04 ("mm/thp: fix deferred
      split queue not partially_mapped"). It's rather tricky.

We can use the folio_batch infrastructure to handle this clearly. In this
case, ->split_queue_len will be consistent with the real number of folios
in the split queue. If list_empty(&folio->_deferred_list) returns false,
it's clear the folio must be in its split queue (not in a local list
anymore).

In the future, we will reparent LRU folios during memcg offline to
eliminate dying memory cgroups, which requires reparenting the split queue
to its parent first. So this patch prepares for using
folio_split_queue_lock_irqsave() as the memcg may change then.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/huge_memory.c | 84 ++++++++++++++++++++++--------------------------
 1 file changed, 38 insertions(+), 46 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2f41b8f0d4871..48b51e6230a67 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3781,21 +3781,22 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 		struct lruvec *lruvec;
 		int expected_refs;
 
-		if (folio_order(folio) > 1 &&
-		    !list_empty(&folio->_deferred_list)) {
-			ds_queue->split_queue_len--;
+		if (folio_order(folio) > 1) {
+			if (!list_empty(&folio->_deferred_list)) {
+				ds_queue->split_queue_len--;
+				/*
+				 * Reinitialize page_deferred_list after removing the
+				 * page from the split_queue, otherwise a subsequent
+				 * split will see list corruption when checking the
+				 * page_deferred_list.
+				 */
+				list_del_init(&folio->_deferred_list);
+			}
 			if (folio_test_partially_mapped(folio)) {
 				folio_clear_partially_mapped(folio);
 				mod_mthp_stat(folio_order(folio),
 					      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
 			}
-			/*
-			 * Reinitialize page_deferred_list after removing the
-			 * page from the split_queue, otherwise a subsequent
-			 * split will see list corruption when checking the
-			 * page_deferred_list.
-			 */
-			list_del_init(&folio->_deferred_list);
 		}
 		split_queue_unlock(ds_queue);
 		if (mapping) {
@@ -4194,40 +4195,44 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 	struct pglist_data *pgdata = NODE_DATA(sc->nid);
 	struct deferred_split *ds_queue = &pgdata->deferred_split_queue;
 	unsigned long flags;
-	LIST_HEAD(list);
-	struct folio *folio, *next, *prev = NULL;
-	int split = 0, removed = 0;
+	struct folio *folio, *next;
+	int split = 0, i;
+	struct folio_batch fbatch;
 
 #ifdef CONFIG_MEMCG
 	if (sc->memcg)
 		ds_queue = &sc->memcg->deferred_split_queue;
 #endif
 
+	folio_batch_init(&fbatch);
+retry:
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
 	/* Take pin on all head pages to avoid freeing them under us */
 	list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
 							_deferred_list) {
 		if (folio_try_get(folio)) {
-			list_move(&folio->_deferred_list, &list);
-		} else {
+			folio_batch_add(&fbatch, folio);
+		} else if (folio_test_partially_mapped(folio)) {
 			/* We lost race with folio_put() */
-			if (folio_test_partially_mapped(folio)) {
-				folio_clear_partially_mapped(folio);
-				mod_mthp_stat(folio_order(folio),
-					      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
-			}
-			list_del_init(&folio->_deferred_list);
-			ds_queue->split_queue_len--;
+			folio_clear_partially_mapped(folio);
+			mod_mthp_stat(folio_order(folio),
+				      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
 		}
+		list_del_init(&folio->_deferred_list);
+		ds_queue->split_queue_len--;
 		if (!--sc->nr_to_scan)
 			break;
+		if (!folio_batch_space(&fbatch))
+			break;
 	}
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 
-	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
+	for (i = 0; i < folio_batch_count(&fbatch); i++) {
 		bool did_split = false;
 		bool underused = false;
+		struct deferred_split *fqueue;
 
+		folio = fbatch.folios[i];
 		if (!folio_test_partially_mapped(folio)) {
 			/*
 			 * See try_to_map_unused_to_zeropage(): we cannot
@@ -4250,38 +4255,25 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 		}
 		folio_unlock(folio);
 next:
+		if (did_split || !folio_test_partially_mapped(folio))
+			continue;
 		/*
-		 * split_folio() removes folio from list on success.
 		 * Only add back to the queue if folio is partially mapped.
 		 * If thp_underused returns false, or if split_folio fails
 		 * in the case it was underused, then consider it used and
 		 * don't add it back to split_queue.
 		 */
-		if (did_split) {
-			; /* folio already removed from list */
-		} else if (!folio_test_partially_mapped(folio)) {
-			list_del_init(&folio->_deferred_list);
-			removed++;
-		} else {
-			/*
-			 * That unlocked list_del_init() above would be unsafe,
-			 * unless its folio is separated from any earlier folios
-			 * left on the list (which may be concurrently unqueued)
-			 * by one safe folio with refcount still raised.
-			 */
-			swap(folio, prev);
+		fqueue = folio_split_queue_lock_irqsave(folio, &flags);
+		if (list_empty(&folio->_deferred_list)) {
+			list_add_tail(&folio->_deferred_list, &fqueue->split_queue);
+			fqueue->split_queue_len++;
 		}
-		if (folio)
-			folio_put(folio);
+		split_queue_unlock_irqrestore(fqueue, flags);
 	}
+	folios_put(&fbatch);
 
-	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
-	list_splice_tail(&list, &ds_queue->split_queue);
-	ds_queue->split_queue_len -= removed;
-	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
-
-	if (prev)
-		folio_put(prev);
+	if (sc->nr_to_scan)
+		goto retry;
 
 	/*
 	 * Stop shrinker if we didn't split any page, but the queue is empty.
-- 
2.20.1
Re: [PATCH v2 3/4] mm: thp: use folio_batch to handle THP splitting in deferred_split_scan()
Posted by David Hildenbrand 1 week ago
On 23.09.25 11:16, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
> 
> The maintenance of the folio->_deferred_list is intricate because it's
> reused in a local list.
> 
> Here are some peculiarities:
> 
>     1) When a folio is removed from its split queue and added to a local
>        on-stack list in deferred_split_scan(), the ->split_queue_len isn't
>        updated, leading to an inconsistency between it and the actual
>        number of folios in the split queue.
> 
>     2) When the folio is split via split_folio() later, it's removed from
>        the local list while holding the split queue lock. At this time,
>        this lock protects the local list, not the split queue.
> 
>     3) To handle the race condition with a third-party freeing or migrating
>        the preceding folio, we must ensure there's always one safe (with
>        raised refcount) folio before by delaying its folio_put(). More
>        details can be found in commit e66f3185fa04 ("mm/thp: fix deferred
>        split queue not partially_mapped"). It's rather tricky.
> 
> We can use the folio_batch infrastructure to handle this clearly. In this
> case, ->split_queue_len will be consistent with the real number of folios
> in the split queue. If list_empty(&folio->_deferred_list) returns false,
> it's clear the folio must be in its split queue (not in a local list
> anymore).
> 
> In the future, we will reparent LRU folios during memcg offline to
> eliminate dying memory cgroups, which requires reparenting the split queue
> to its parent first. So this patch prepares for using
> folio_split_queue_lock_irqsave() as the memcg may change then.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---

Nothing jumped at me

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers

David / dhildenb
Re: [PATCH v2 3/4] mm: thp: use folio_batch to handle THP splitting in deferred_split_scan()
Posted by Zi Yan 1 week, 1 day ago
On 23 Sep 2025, at 5:16, Qi Zheng wrote:

> From: Muchun Song <songmuchun@bytedance.com>
>
> The maintenance of the folio->_deferred_list is intricate because it's
> reused in a local list.
>
> Here are some peculiarities:
>
>    1) When a folio is removed from its split queue and added to a local
>       on-stack list in deferred_split_scan(), the ->split_queue_len isn't
>       updated, leading to an inconsistency between it and the actual
>       number of folios in the split queue.
>
>    2) When the folio is split via split_folio() later, it's removed from
>       the local list while holding the split queue lock. At this time,
>       this lock protects the local list, not the split queue.
>
>    3) To handle the race condition with a third-party freeing or migrating
>       the preceding folio, we must ensure there's always one safe (with
>       raised refcount) folio before by delaying its folio_put(). More
>       details can be found in commit e66f3185fa04 ("mm/thp: fix deferred
>       split queue not partially_mapped"). It's rather tricky.
>
> We can use the folio_batch infrastructure to handle this clearly. In this

Can you add more details on how folio_batch handles the above three concerns
in the original code? That would guide me what to look for during code review.

> case, ->split_queue_len will be consistent with the real number of folios
> in the split queue. If list_empty(&folio->_deferred_list) returns false,
> it's clear the folio must be in its split queue (not in a local list
> anymore).
>
> In the future, we will reparent LRU folios during memcg offline to
> eliminate dying memory cgroups, which requires reparenting the split queue
> to its parent first. So this patch prepares for using
> folio_split_queue_lock_irqsave() as the memcg may change then.
>
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
>  mm/huge_memory.c | 84 ++++++++++++++++++++++--------------------------
>  1 file changed, 38 insertions(+), 46 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2f41b8f0d4871..48b51e6230a67 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3781,21 +3781,22 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>  		struct lruvec *lruvec;
>  		int expected_refs;
>
> -		if (folio_order(folio) > 1 &&
> -		    !list_empty(&folio->_deferred_list)) {
> -			ds_queue->split_queue_len--;
> +		if (folio_order(folio) > 1) {
> +			if (!list_empty(&folio->_deferred_list)) {
> +				ds_queue->split_queue_len--;
> +				/*
> +				 * Reinitialize page_deferred_list after removing the
> +				 * page from the split_queue, otherwise a subsequent
> +				 * split will see list corruption when checking the
> +				 * page_deferred_list.
> +				 */
> +				list_del_init(&folio->_deferred_list);
> +			}
>  			if (folio_test_partially_mapped(folio)) {
>  				folio_clear_partially_mapped(folio);
>  				mod_mthp_stat(folio_order(folio),
>  					      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
>  			}

folio_test_partially_mapped() is done regardless the folio is on _deferred_list
or not, is it because the folio can be on a folio batch and its _deferred_list
is empty?

> -			/*
> -			 * Reinitialize page_deferred_list after removing the
> -			 * page from the split_queue, otherwise a subsequent
> -			 * split will see list corruption when checking the
> -			 * page_deferred_list.
> -			 */
> -			list_del_init(&folio->_deferred_list);
>  		}
>  		split_queue_unlock(ds_queue);
>  		if (mapping) {
> @@ -4194,40 +4195,44 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>  	struct pglist_data *pgdata = NODE_DATA(sc->nid);
>  	struct deferred_split *ds_queue = &pgdata->deferred_split_queue;
>  	unsigned long flags;
> -	LIST_HEAD(list);
> -	struct folio *folio, *next, *prev = NULL;
> -	int split = 0, removed = 0;
> +	struct folio *folio, *next;
> +	int split = 0, i;
> +	struct folio_batch fbatch;
>
>  #ifdef CONFIG_MEMCG
>  	if (sc->memcg)
>  		ds_queue = &sc->memcg->deferred_split_queue;
>  #endif
>
> +	folio_batch_init(&fbatch);
> +retry:
>  	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>  	/* Take pin on all head pages to avoid freeing them under us */
>  	list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
>  							_deferred_list) {
>  		if (folio_try_get(folio)) {
> -			list_move(&folio->_deferred_list, &list);
> -		} else {
> +			folio_batch_add(&fbatch, folio);
> +		} else if (folio_test_partially_mapped(folio)) {
>  			/* We lost race with folio_put() */
> -			if (folio_test_partially_mapped(folio)) {
> -				folio_clear_partially_mapped(folio);
> -				mod_mthp_stat(folio_order(folio),
> -					      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
> -			}
> -			list_del_init(&folio->_deferred_list);
> -			ds_queue->split_queue_len--;
> +			folio_clear_partially_mapped(folio);
> +			mod_mthp_stat(folio_order(folio),
> +				      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
>  		}
> +		list_del_init(&folio->_deferred_list);
> +		ds_queue->split_queue_len--;

At this point, the folio can be following conditions:
1. deferred_split_scan() gets it,
2. it is freed by folio_put().

In both cases, it is removed from deferred_split_queue, right?

>  		if (!--sc->nr_to_scan)
>  			break;
> +		if (!folio_batch_space(&fbatch))
> +			break;
>  	}
>  	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>
> -	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
> +	for (i = 0; i < folio_batch_count(&fbatch); i++) {
>  		bool did_split = false;
>  		bool underused = false;
> +		struct deferred_split *fqueue;
>
> +		folio = fbatch.folios[i];
>  		if (!folio_test_partially_mapped(folio)) {
>  			/*
>  			 * See try_to_map_unused_to_zeropage(): we cannot
> @@ -4250,38 +4255,25 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>  		}
>  		folio_unlock(folio);
>  next:
> +		if (did_split || !folio_test_partially_mapped(folio))
> +			continue;
>  		/*
> -		 * split_folio() removes folio from list on success.
>  		 * Only add back to the queue if folio is partially mapped.
>  		 * If thp_underused returns false, or if split_folio fails
>  		 * in the case it was underused, then consider it used and
>  		 * don't add it back to split_queue.
>  		 */
> -		if (did_split) {
> -			; /* folio already removed from list */
> -		} else if (!folio_test_partially_mapped(folio)) {
> -			list_del_init(&folio->_deferred_list);
> -			removed++;
> -		} else {
> -			/*
> -			 * That unlocked list_del_init() above would be unsafe,
> -			 * unless its folio is separated from any earlier folios
> -			 * left on the list (which may be concurrently unqueued)
> -			 * by one safe folio with refcount still raised.
> -			 */
> -			swap(folio, prev);
> +		fqueue = folio_split_queue_lock_irqsave(folio, &flags);
> +		if (list_empty(&folio->_deferred_list)) {
> +			list_add_tail(&folio->_deferred_list, &fqueue->split_queue);
> +			fqueue->split_queue_len++;

fqueue should be the same as ds_queue, right? Just want to make sure
I understand the code.

>  		}
> -		if (folio)
> -			folio_put(folio);
> +		split_queue_unlock_irqrestore(fqueue, flags);
>  	}
> +	folios_put(&fbatch);
>
> -	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> -	list_splice_tail(&list, &ds_queue->split_queue);
> -	ds_queue->split_queue_len -= removed;
> -	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
> -
> -	if (prev)
> -		folio_put(prev);
> +	if (sc->nr_to_scan)
> +		goto retry;
>
>  	/*
>  	 * Stop shrinker if we didn't split any page, but the queue is empty.
> -- 
> 2.20.1


Best Regards,
Yan, Zi
Re: [PATCH v2 3/4] mm: thp: use folio_batch to handle THP splitting in deferred_split_scan()
Posted by Qi Zheng 1 week ago
Hi Zi,

On 9/23/25 11:31 PM, Zi Yan wrote:
> On 23 Sep 2025, at 5:16, Qi Zheng wrote:
> 
>> From: Muchun Song <songmuchun@bytedance.com>
>>
>> The maintenance of the folio->_deferred_list is intricate because it's
>> reused in a local list.
>>
>> Here are some peculiarities:
>>
>>     1) When a folio is removed from its split queue and added to a local
>>        on-stack list in deferred_split_scan(), the ->split_queue_len isn't
>>        updated, leading to an inconsistency between it and the actual
>>        number of folios in the split queue.
>>
>>     2) When the folio is split via split_folio() later, it's removed from
>>        the local list while holding the split queue lock. At this time,
>>        this lock protects the local list, not the split queue.
>>
>>     3) To handle the race condition with a third-party freeing or migrating
>>        the preceding folio, we must ensure there's always one safe (with
>>        raised refcount) folio before by delaying its folio_put(). More
>>        details can be found in commit e66f3185fa04 ("mm/thp: fix deferred
>>        split queue not partially_mapped"). It's rather tricky.
>>
>> We can use the folio_batch infrastructure to handle this clearly. In this
> 
> Can you add more details on how folio_batch handles the above three concerns
> in the original code? That would guide me what to look for during code review.

Sure.

For 1), after adding folio to folio_batch, we immediatelly decrement the
ds_queue->split_queue_len, so there are no more inconsistencies.

For 2), after adding folio to folio_batch, we will see list_empty() in
__folio_split(), so there is no longer a situation where
split_queue_lock protects the local list.

For 3), after adding folio to folio_batch, we call folios_put() at the
end to decrement the refcount of folios, which looks more natural.

> 
>> case, ->split_queue_len will be consistent with the real number of folios
>> in the split queue. If list_empty(&folio->_deferred_list) returns false,
>> it's clear the folio must be in its split queue (not in a local list
>> anymore).
>>
>> In the future, we will reparent LRU folios during memcg offline to
>> eliminate dying memory cgroups, which requires reparenting the split queue
>> to its parent first. So this patch prepares for using
>> folio_split_queue_lock_irqsave() as the memcg may change then.
>>
>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>>   mm/huge_memory.c | 84 ++++++++++++++++++++++--------------------------
>>   1 file changed, 38 insertions(+), 46 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 2f41b8f0d4871..48b51e6230a67 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3781,21 +3781,22 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>   		struct lruvec *lruvec;
>>   		int expected_refs;
>>
>> -		if (folio_order(folio) > 1 &&
>> -		    !list_empty(&folio->_deferred_list)) {
>> -			ds_queue->split_queue_len--;
>> +		if (folio_order(folio) > 1) {
>> +			if (!list_empty(&folio->_deferred_list)) {
>> +				ds_queue->split_queue_len--;
>> +				/*
>> +				 * Reinitialize page_deferred_list after removing the
>> +				 * page from the split_queue, otherwise a subsequent
>> +				 * split will see list corruption when checking the
>> +				 * page_deferred_list.
>> +				 */
>> +				list_del_init(&folio->_deferred_list);
>> +			}
>>   			if (folio_test_partially_mapped(folio)) {
>>   				folio_clear_partially_mapped(folio);
>>   				mod_mthp_stat(folio_order(folio),
>>   					      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
>>   			}
> 
> folio_test_partially_mapped() is done regardless the folio is on _deferred_list
> or not, is it because the folio can be on a folio batch and its _deferred_list
> is empty?

Yes.

> 
>> -			/*
>> -			 * Reinitialize page_deferred_list after removing the
>> -			 * page from the split_queue, otherwise a subsequent
>> -			 * split will see list corruption when checking the
>> -			 * page_deferred_list.
>> -			 */
>> -			list_del_init(&folio->_deferred_list);
>>   		}
>>   		split_queue_unlock(ds_queue);
>>   		if (mapping) {
>> @@ -4194,40 +4195,44 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>>   	struct pglist_data *pgdata = NODE_DATA(sc->nid);
>>   	struct deferred_split *ds_queue = &pgdata->deferred_split_queue;
>>   	unsigned long flags;
>> -	LIST_HEAD(list);
>> -	struct folio *folio, *next, *prev = NULL;
>> -	int split = 0, removed = 0;
>> +	struct folio *folio, *next;
>> +	int split = 0, i;
>> +	struct folio_batch fbatch;
>>
>>   #ifdef CONFIG_MEMCG
>>   	if (sc->memcg)
>>   		ds_queue = &sc->memcg->deferred_split_queue;
>>   #endif
>>
>> +	folio_batch_init(&fbatch);
>> +retry:
>>   	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>>   	/* Take pin on all head pages to avoid freeing them under us */
>>   	list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
>>   							_deferred_list) {
>>   		if (folio_try_get(folio)) {
>> -			list_move(&folio->_deferred_list, &list);
>> -		} else {
>> +			folio_batch_add(&fbatch, folio);
>> +		} else if (folio_test_partially_mapped(folio)) {
>>   			/* We lost race with folio_put() */
>> -			if (folio_test_partially_mapped(folio)) {
>> -				folio_clear_partially_mapped(folio);
>> -				mod_mthp_stat(folio_order(folio),
>> -					      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
>> -			}
>> -			list_del_init(&folio->_deferred_list);
>> -			ds_queue->split_queue_len--;
>> +			folio_clear_partially_mapped(folio);
>> +			mod_mthp_stat(folio_order(folio),
>> +				      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
>>   		}
>> +		list_del_init(&folio->_deferred_list);
>> +		ds_queue->split_queue_len--;
> 
> At this point, the folio can be following conditions:
> 1. deferred_split_scan() gets it,
> 2. it is freed by folio_put().
> 
> In both cases, it is removed from deferred_split_queue, right?

Right. For the case 1), we may add folio back to deferred_split_queue.

> 
>>   		if (!--sc->nr_to_scan)
>>   			break;
>> +		if (!folio_batch_space(&fbatch))
>> +			break;
>>   	}
>>   	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>>
>> -	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
>> +	for (i = 0; i < folio_batch_count(&fbatch); i++) {
>>   		bool did_split = false;
>>   		bool underused = false;
>> +		struct deferred_split *fqueue;
>>
>> +		folio = fbatch.folios[i];
>>   		if (!folio_test_partially_mapped(folio)) {
>>   			/*
>>   			 * See try_to_map_unused_to_zeropage(): we cannot
>> @@ -4250,38 +4255,25 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>>   		}
>>   		folio_unlock(folio);
>>   next:
>> +		if (did_split || !folio_test_partially_mapped(folio))
>> +			continue;
>>   		/*
>> -		 * split_folio() removes folio from list on success.
>>   		 * Only add back to the queue if folio is partially mapped.
>>   		 * If thp_underused returns false, or if split_folio fails
>>   		 * in the case it was underused, then consider it used and
>>   		 * don't add it back to split_queue.
>>   		 */
>> -		if (did_split) {
>> -			; /* folio already removed from list */
>> -		} else if (!folio_test_partially_mapped(folio)) {
>> -			list_del_init(&folio->_deferred_list);
>> -			removed++;
>> -		} else {
>> -			/*
>> -			 * That unlocked list_del_init() above would be unsafe,
>> -			 * unless its folio is separated from any earlier folios
>> -			 * left on the list (which may be concurrently unqueued)
>> -			 * by one safe folio with refcount still raised.
>> -			 */
>> -			swap(folio, prev);
>> +		fqueue = folio_split_queue_lock_irqsave(folio, &flags);
>> +		if (list_empty(&folio->_deferred_list)) {
>> +			list_add_tail(&folio->_deferred_list, &fqueue->split_queue);
>> +			fqueue->split_queue_len++;
> 
> fqueue should be the same as ds_queue, right? Just want to make sure
> I understand the code.

After patch #4, fqueue may be the deferred_split of parent memcg.

Thanks,
Qi

> 
>>   		}
>> -		if (folio)
>> -			folio_put(folio);
>> +		split_queue_unlock_irqrestore(fqueue, flags);
>>   	}
>> +	folios_put(&fbatch);
>>
>> -	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>> -	list_splice_tail(&list, &ds_queue->split_queue);
>> -	ds_queue->split_queue_len -= removed;
>> -	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>> -
>> -	if (prev)
>> -		folio_put(prev);
>> +	if (sc->nr_to_scan)
>> +		goto retry;
>>
>>   	/*
>>   	 * Stop shrinker if we didn't split any page, but the queue is empty.
>> -- 
>> 2.20.1
> 
> 
> Best Regards,
> Yan, Zi
Re: [PATCH v2 3/4] mm: thp: use folio_batch to handle THP splitting in deferred_split_scan()
Posted by Zi Yan 1 week ago
On 24 Sep 2025, at 5:57, Qi Zheng wrote:

> Hi Zi,
>
> On 9/23/25 11:31 PM, Zi Yan wrote:
>> On 23 Sep 2025, at 5:16, Qi Zheng wrote:
>>
>>> From: Muchun Song <songmuchun@bytedance.com>
>>>
>>> The maintenance of the folio->_deferred_list is intricate because it's
>>> reused in a local list.
>>>
>>> Here are some peculiarities:
>>>
>>>     1) When a folio is removed from its split queue and added to a local
>>>        on-stack list in deferred_split_scan(), the ->split_queue_len isn't
>>>        updated, leading to an inconsistency between it and the actual
>>>        number of folios in the split queue.
>>>
>>>     2) When the folio is split via split_folio() later, it's removed from
>>>        the local list while holding the split queue lock. At this time,
>>>        this lock protects the local list, not the split queue.
>>>
>>>     3) To handle the race condition with a third-party freeing or migrating
>>>        the preceding folio, we must ensure there's always one safe (with
>>>        raised refcount) folio before by delaying its folio_put(). More
>>>        details can be found in commit e66f3185fa04 ("mm/thp: fix deferred
>>>        split queue not partially_mapped"). It's rather tricky.
>>>
>>> We can use the folio_batch infrastructure to handle this clearly. In this
>>
>> Can you add more details on how folio_batch handles the above three concerns
>> in the original code? That would guide me what to look for during code review.
>
> Sure.
>
> For 1), after adding folio to folio_batch, we immediatelly decrement the
> ds_queue->split_queue_len, so there are no more inconsistencies.
>
> For 2), after adding folio to folio_batch, we will see list_empty() in
> __folio_split(), so there is no longer a situation where
> split_queue_lock protects the local list.
>
> For 3), after adding folio to folio_batch, we call folios_put() at the
> end to decrement the refcount of folios, which looks more natural.
>
>>
>>> case, ->split_queue_len will be consistent with the real number of folios
>>> in the split queue. If list_empty(&folio->_deferred_list) returns false,
>>> it's clear the folio must be in its split queue (not in a local list
>>> anymore).
>>>
>>> In the future, we will reparent LRU folios during memcg offline to
>>> eliminate dying memory cgroups, which requires reparenting the split queue
>>> to its parent first. So this patch prepares for using
>>> folio_split_queue_lock_irqsave() as the memcg may change then.
>>>
>>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>> ---
>>>   mm/huge_memory.c | 84 ++++++++++++++++++++++--------------------------
>>>   1 file changed, 38 insertions(+), 46 deletions(-)
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 2f41b8f0d4871..48b51e6230a67 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -3781,21 +3781,22 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>   		struct lruvec *lruvec;
>>>   		int expected_refs;
>>>
>>> -		if (folio_order(folio) > 1 &&
>>> -		    !list_empty(&folio->_deferred_list)) {
>>> -			ds_queue->split_queue_len--;
>>> +		if (folio_order(folio) > 1) {
>>> +			if (!list_empty(&folio->_deferred_list)) {
>>> +				ds_queue->split_queue_len--;
>>> +				/*
>>> +				 * Reinitialize page_deferred_list after removing the
>>> +				 * page from the split_queue, otherwise a subsequent
>>> +				 * split will see list corruption when checking the
>>> +				 * page_deferred_list.
>>> +				 */
>>> +				list_del_init(&folio->_deferred_list);
>>> +			}
>>>   			if (folio_test_partially_mapped(folio)) {
>>>   				folio_clear_partially_mapped(folio);
>>>   				mod_mthp_stat(folio_order(folio),
>>>   					      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
>>>   			}
>>
>> folio_test_partially_mapped() is done regardless the folio is on _deferred_list
>> or not, is it because the folio can be on a folio batch and its _deferred_list
>> is empty?
>
> Yes.
>
>>
>>> -			/*
>>> -			 * Reinitialize page_deferred_list after removing the
>>> -			 * page from the split_queue, otherwise a subsequent
>>> -			 * split will see list corruption when checking the
>>> -			 * page_deferred_list.
>>> -			 */
>>> -			list_del_init(&folio->_deferred_list);
>>>   		}
>>>   		split_queue_unlock(ds_queue);
>>>   		if (mapping) {
>>> @@ -4194,40 +4195,44 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>>>   	struct pglist_data *pgdata = NODE_DATA(sc->nid);
>>>   	struct deferred_split *ds_queue = &pgdata->deferred_split_queue;
>>>   	unsigned long flags;
>>> -	LIST_HEAD(list);
>>> -	struct folio *folio, *next, *prev = NULL;
>>> -	int split = 0, removed = 0;
>>> +	struct folio *folio, *next;
>>> +	int split = 0, i;
>>> +	struct folio_batch fbatch;
>>>
>>>   #ifdef CONFIG_MEMCG
>>>   	if (sc->memcg)
>>>   		ds_queue = &sc->memcg->deferred_split_queue;
>>>   #endif
>>>
>>> +	folio_batch_init(&fbatch);
>>> +retry:
>>>   	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>>>   	/* Take pin on all head pages to avoid freeing them under us */
>>>   	list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
>>>   							_deferred_list) {
>>>   		if (folio_try_get(folio)) {
>>> -			list_move(&folio->_deferred_list, &list);
>>> -		} else {
>>> +			folio_batch_add(&fbatch, folio);
>>> +		} else if (folio_test_partially_mapped(folio)) {
>>>   			/* We lost race with folio_put() */
>>> -			if (folio_test_partially_mapped(folio)) {
>>> -				folio_clear_partially_mapped(folio);
>>> -				mod_mthp_stat(folio_order(folio),
>>> -					      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
>>> -			}
>>> -			list_del_init(&folio->_deferred_list);
>>> -			ds_queue->split_queue_len--;
>>> +			folio_clear_partially_mapped(folio);
>>> +			mod_mthp_stat(folio_order(folio),
>>> +				      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
>>>   		}
>>> +		list_del_init(&folio->_deferred_list);
>>> +		ds_queue->split_queue_len--;
>>
>> At this point, the folio can be following conditions:
>> 1. deferred_split_scan() gets it,
>> 2. it is freed by folio_put().
>>
>> In both cases, it is removed from deferred_split_queue, right?
>
> Right. For the case 1), we may add folio back to deferred_split_queue.
>
>>
>>>   		if (!--sc->nr_to_scan)
>>>   			break;
>>> +		if (!folio_batch_space(&fbatch))
>>> +			break;
>>>   	}
>>>   	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>>>
>>> -	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
>>> +	for (i = 0; i < folio_batch_count(&fbatch); i++) {
>>>   		bool did_split = false;
>>>   		bool underused = false;
>>> +		struct deferred_split *fqueue;
>>>
>>> +		folio = fbatch.folios[i];
>>>   		if (!folio_test_partially_mapped(folio)) {
>>>   			/*
>>>   			 * See try_to_map_unused_to_zeropage(): we cannot
>>> @@ -4250,38 +4255,25 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>>>   		}
>>>   		folio_unlock(folio);
>>>   next:
>>> +		if (did_split || !folio_test_partially_mapped(folio))
>>> +			continue;
>>>   		/*
>>> -		 * split_folio() removes folio from list on success.
>>>   		 * Only add back to the queue if folio is partially mapped.
>>>   		 * If thp_underused returns false, or if split_folio fails
>>>   		 * in the case it was underused, then consider it used and
>>>   		 * don't add it back to split_queue.
>>>   		 */
>>> -		if (did_split) {
>>> -			; /* folio already removed from list */
>>> -		} else if (!folio_test_partially_mapped(folio)) {
>>> -			list_del_init(&folio->_deferred_list);
>>> -			removed++;
>>> -		} else {
>>> -			/*
>>> -			 * That unlocked list_del_init() above would be unsafe,
>>> -			 * unless its folio is separated from any earlier folios
>>> -			 * left on the list (which may be concurrently unqueued)
>>> -			 * by one safe folio with refcount still raised.
>>> -			 */
>>> -			swap(folio, prev);
>>> +		fqueue = folio_split_queue_lock_irqsave(folio, &flags);
>>> +		if (list_empty(&folio->_deferred_list)) {
>>> +			list_add_tail(&folio->_deferred_list, &fqueue->split_queue);
>>> +			fqueue->split_queue_len++;
>>
>> fqueue should be the same as ds_queue, right? Just want to make sure
>> I understand the code.
>
> After patch #4, fqueue may be the deferred_split of parent memcg.
Thank you for the explanation. The changes look good to me.

Reviewed-by: Zi Yan <ziy@nvidia.com>

Best Regards,
Yan, Zi