[RESEND][PATCH v2] mm: don't call lru draining in the nested lru_cache_disable

Minchan Kim posted 1 patch 4 years, 6 months ago
include/linux/swap.h | 14 ++------------
mm/cma.c             |  5 +++++
mm/swap.c            | 30 ++++++++++++++++++++++++++++--
3 files changed, 35 insertions(+), 14 deletions(-)
[RESEND][PATCH v2] mm: don't call lru draining in the nested lru_cache_disable
Posted by Minchan Kim 4 years, 6 months ago
lru_cache_disable involves IPIs to drain pagevec of each core,
which sometimes takes quite long time to complete depending
on cpu's business, which makes allocation too slow up to
sveral hundredth milliseconds. Furthermore, the repeated draining
in the alloc_contig_range makes thing worse considering caller
of alloc_contig_range usually tries multiple times in the loop.

This patch makes the lru_cache_disable aware of the fact the
pagevec was already disabled. With that, user of alloc_contig_range
can disable the lru cache in advance in their context during the
repeated trial so they can avoid the multiple costly draining
in cma allocation.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 * from v1 - https://lore.kernel.org/lkml/20211206221006.946661-1-minchan@kernel.org/
   * fix lru_cache_disable race - akpm

 include/linux/swap.h | 14 ++------------
 mm/cma.c             |  5 +++++
 mm/swap.c            | 30 ++++++++++++++++++++++++++++--
 3 files changed, 35 insertions(+), 14 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index ba52f3a3478e..fe18e86a4f13 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -348,19 +348,9 @@ extern void lru_note_cost_page(struct page *);
 extern void lru_cache_add(struct page *);
 extern void mark_page_accessed(struct page *);
 
-extern atomic_t lru_disable_count;
-
-static inline bool lru_cache_disabled(void)
-{
-	return atomic_read(&lru_disable_count);
-}
-
-static inline void lru_cache_enable(void)
-{
-	atomic_dec(&lru_disable_count);
-}
-
+extern bool lru_cache_disabled(void);
 extern void lru_cache_disable(void);
+extern void lru_cache_enable(void);
 extern void lru_add_drain(void);
 extern void lru_add_drain_cpu(int cpu);
 extern void lru_add_drain_cpu_zone(struct zone *zone);
diff --git a/mm/cma.c b/mm/cma.c
index 995e15480937..60be555c5b95 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -30,6 +30,7 @@
 #include <linux/cma.h>
 #include <linux/highmem.h>
 #include <linux/io.h>
+#include <linux/swap.h>
 #include <linux/kmemleak.h>
 #include <trace/events/cma.h>
 
@@ -453,6 +454,8 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
 	if (bitmap_count > bitmap_maxno)
 		goto out;
 
+	lru_cache_disable();
+
 	for (;;) {
 		spin_lock_irq(&cma->lock);
 		bitmap_no = bitmap_find_next_zero_area_off(cma->bitmap,
@@ -492,6 +495,8 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
 		start = bitmap_no + mask + 1;
 	}
 
+	lru_cache_enable();
+
 	trace_cma_alloc_finish(cma->name, pfn, page, count, align);
 
 	/*
diff --git a/mm/swap.c b/mm/swap.c
index af3cad4e5378..5f89d7c9a54e 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -847,7 +847,17 @@ void lru_add_drain_all(void)
 }
 #endif /* CONFIG_SMP */
 
-atomic_t lru_disable_count = ATOMIC_INIT(0);
+static atomic_t lru_disable_count = ATOMIC_INIT(0);
+
+bool lru_cache_disabled(void)
+{
+	return atomic_read(&lru_disable_count) != 0;
+}
+
+void lru_cache_enable(void)
+{
+	atomic_dec(&lru_disable_count);
+}
 
 /*
  * lru_cache_disable() needs to be called before we start compiling
@@ -859,7 +869,21 @@ atomic_t lru_disable_count = ATOMIC_INIT(0);
  */
 void lru_cache_disable(void)
 {
-	atomic_inc(&lru_disable_count);
+	static DEFINE_MUTEX(lock);
+
+	/*
+	 * The lock gaurantees lru_cache is drained when the function
+	 * returned.
+	 */
+	mutex_lock(&lock);
+	/*
+	 * If someone is already disabled lru_cache, just return with
+	 * increasing the lru_disable_count.
+	 */
+	if (atomic_inc_not_zero(&lru_disable_count)) {
+		mutex_unlock(&lock);
+		return;
+	}
 #ifdef CONFIG_SMP
 	/*
 	 * lru_add_drain_all in the force mode will schedule draining on
@@ -873,6 +897,8 @@ void lru_cache_disable(void)
 #else
 	lru_add_and_bh_lrus_drain();
 #endif
+	atomic_inc(&lru_disable_count);
+	mutex_unlock(&lock);
 }
 
 /**
-- 
2.34.1.448.ga2b2bfdf31-goog

Re: [RESEND][PATCH v2] mm: don't call lru draining in the nested lru_cache_disable
Posted by Minchan Kim 4 years, 5 months ago
On Thu, Dec 30, 2021 at 11:36:27AM -0800, Minchan Kim wrote:
> lru_cache_disable involves IPIs to drain pagevec of each core,
> which sometimes takes quite long time to complete depending
> on cpu's business, which makes allocation too slow up to
> sveral hundredth milliseconds. Furthermore, the repeated draining
> in the alloc_contig_range makes thing worse considering caller
> of alloc_contig_range usually tries multiple times in the loop.
> 
> This patch makes the lru_cache_disable aware of the fact the
> pagevec was already disabled. With that, user of alloc_contig_range
> can disable the lru cache in advance in their context during the
> repeated trial so they can avoid the multiple costly draining
> in cma allocation.

Hi Folks,

Any comment to proceed the work?

> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  * from v1 - https://lore.kernel.org/lkml/20211206221006.946661-1-minchan@kernel.org/
>    * fix lru_cache_disable race - akpm
> 
>  include/linux/swap.h | 14 ++------------
>  mm/cma.c             |  5 +++++
>  mm/swap.c            | 30 ++++++++++++++++++++++++++++--
>  3 files changed, 35 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index ba52f3a3478e..fe18e86a4f13 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -348,19 +348,9 @@ extern void lru_note_cost_page(struct page *);
>  extern void lru_cache_add(struct page *);
>  extern void mark_page_accessed(struct page *);
>  
> -extern atomic_t lru_disable_count;
> -
> -static inline bool lru_cache_disabled(void)
> -{
> -	return atomic_read(&lru_disable_count);
> -}
> -
> -static inline void lru_cache_enable(void)
> -{
> -	atomic_dec(&lru_disable_count);
> -}
> -
> +extern bool lru_cache_disabled(void);
>  extern void lru_cache_disable(void);
> +extern void lru_cache_enable(void);
>  extern void lru_add_drain(void);
>  extern void lru_add_drain_cpu(int cpu);
>  extern void lru_add_drain_cpu_zone(struct zone *zone);
> diff --git a/mm/cma.c b/mm/cma.c
> index 995e15480937..60be555c5b95 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -30,6 +30,7 @@
>  #include <linux/cma.h>
>  #include <linux/highmem.h>
>  #include <linux/io.h>
> +#include <linux/swap.h>
>  #include <linux/kmemleak.h>
>  #include <trace/events/cma.h>
>  
> @@ -453,6 +454,8 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
>  	if (bitmap_count > bitmap_maxno)
>  		goto out;
>  
> +	lru_cache_disable();
> +
>  	for (;;) {
>  		spin_lock_irq(&cma->lock);
>  		bitmap_no = bitmap_find_next_zero_area_off(cma->bitmap,
> @@ -492,6 +495,8 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
>  		start = bitmap_no + mask + 1;
>  	}
>  
> +	lru_cache_enable();
> +
>  	trace_cma_alloc_finish(cma->name, pfn, page, count, align);
>  
>  	/*
> diff --git a/mm/swap.c b/mm/swap.c
> index af3cad4e5378..5f89d7c9a54e 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -847,7 +847,17 @@ void lru_add_drain_all(void)
>  }
>  #endif /* CONFIG_SMP */
>  
> -atomic_t lru_disable_count = ATOMIC_INIT(0);
> +static atomic_t lru_disable_count = ATOMIC_INIT(0);
> +
> +bool lru_cache_disabled(void)
> +{
> +	return atomic_read(&lru_disable_count) != 0;
> +}
> +
> +void lru_cache_enable(void)
> +{
> +	atomic_dec(&lru_disable_count);
> +}
>  
>  /*
>   * lru_cache_disable() needs to be called before we start compiling
> @@ -859,7 +869,21 @@ atomic_t lru_disable_count = ATOMIC_INIT(0);
>   */
>  void lru_cache_disable(void)
>  {
> -	atomic_inc(&lru_disable_count);
> +	static DEFINE_MUTEX(lock);
> +
> +	/*
> +	 * The lock gaurantees lru_cache is drained when the function
> +	 * returned.
> +	 */
> +	mutex_lock(&lock);
> +	/*
> +	 * If someone is already disabled lru_cache, just return with
> +	 * increasing the lru_disable_count.
> +	 */
> +	if (atomic_inc_not_zero(&lru_disable_count)) {
> +		mutex_unlock(&lock);
> +		return;
> +	}
>  #ifdef CONFIG_SMP
>  	/*
>  	 * lru_add_drain_all in the force mode will schedule draining on
> @@ -873,6 +897,8 @@ void lru_cache_disable(void)
>  #else
>  	lru_add_and_bh_lrus_drain();
>  #endif
> +	atomic_inc(&lru_disable_count);
> +	mutex_unlock(&lock);
>  }
>  
>  /**
> -- 
> 2.34.1.448.ga2b2bfdf31-goog
> 
Re: [RESEND][PATCH v2] mm: don't call lru draining in the nested lru_cache_disable
Posted by Michal Hocko 4 years, 5 months ago
On Thu 30-12-21 11:36:27, Minchan Kim wrote:
> lru_cache_disable involves IPIs to drain pagevec of each core,
> which sometimes takes quite long time to complete depending
> on cpu's business, which makes allocation too slow up to
> sveral hundredth milliseconds. Furthermore, the repeated draining
> in the alloc_contig_range makes thing worse considering caller
> of alloc_contig_range usually tries multiple times in the loop.
>
> This patch makes the lru_cache_disable aware of the fact the
> pagevec was already disabled. With that, user of alloc_contig_range
> can disable the lru cache in advance in their context during the
> repeated trial so they can avoid the multiple costly draining
> in cma allocation.

Do you have any numbers on any improvements?

Now to the change. I do not like this much to be honest. LRU cache
disabling is a complex synchronization scheme implemented in
__lru_add_drain_all now you are stacking another level on top of that.

More fundamentally though. I am not sure I understand the problem TBH.
What prevents you from calling lru_cache_disable at the cma level in the
first place?

> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  * from v1 - https://lore.kernel.org/lkml/20211206221006.946661-1-minchan@kernel.org/
>    * fix lru_cache_disable race - akpm
> 
>  include/linux/swap.h | 14 ++------------
>  mm/cma.c             |  5 +++++
>  mm/swap.c            | 30 ++++++++++++++++++++++++++++--
>  3 files changed, 35 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index ba52f3a3478e..fe18e86a4f13 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -348,19 +348,9 @@ extern void lru_note_cost_page(struct page *);
>  extern void lru_cache_add(struct page *);
>  extern void mark_page_accessed(struct page *);
>  
> -extern atomic_t lru_disable_count;
> -
> -static inline bool lru_cache_disabled(void)
> -{
> -	return atomic_read(&lru_disable_count);
> -}
> -
> -static inline void lru_cache_enable(void)
> -{
> -	atomic_dec(&lru_disable_count);
> -}
> -
> +extern bool lru_cache_disabled(void);
>  extern void lru_cache_disable(void);
> +extern void lru_cache_enable(void);
>  extern void lru_add_drain(void);
>  extern void lru_add_drain_cpu(int cpu);
>  extern void lru_add_drain_cpu_zone(struct zone *zone);
> diff --git a/mm/cma.c b/mm/cma.c
> index 995e15480937..60be555c5b95 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -30,6 +30,7 @@
>  #include <linux/cma.h>
>  #include <linux/highmem.h>
>  #include <linux/io.h>
> +#include <linux/swap.h>
>  #include <linux/kmemleak.h>
>  #include <trace/events/cma.h>
>  
> @@ -453,6 +454,8 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
>  	if (bitmap_count > bitmap_maxno)
>  		goto out;
>  
> +	lru_cache_disable();
> +
>  	for (;;) {
>  		spin_lock_irq(&cma->lock);
>  		bitmap_no = bitmap_find_next_zero_area_off(cma->bitmap,
> @@ -492,6 +495,8 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
>  		start = bitmap_no + mask + 1;
>  	}
>  
> +	lru_cache_enable();
> +
>  	trace_cma_alloc_finish(cma->name, pfn, page, count, align);
>  
>  	/*
> diff --git a/mm/swap.c b/mm/swap.c
> index af3cad4e5378..5f89d7c9a54e 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -847,7 +847,17 @@ void lru_add_drain_all(void)
>  }
>  #endif /* CONFIG_SMP */
>  
> -atomic_t lru_disable_count = ATOMIC_INIT(0);
> +static atomic_t lru_disable_count = ATOMIC_INIT(0);
> +
> +bool lru_cache_disabled(void)
> +{
> +	return atomic_read(&lru_disable_count) != 0;
> +}
> +
> +void lru_cache_enable(void)
> +{
> +	atomic_dec(&lru_disable_count);
> +}
>  
>  /*
>   * lru_cache_disable() needs to be called before we start compiling
> @@ -859,7 +869,21 @@ atomic_t lru_disable_count = ATOMIC_INIT(0);
>   */
>  void lru_cache_disable(void)
>  {
> -	atomic_inc(&lru_disable_count);
> +	static DEFINE_MUTEX(lock);
> +
> +	/*
> +	 * The lock gaurantees lru_cache is drained when the function
> +	 * returned.
> +	 */
> +	mutex_lock(&lock);
> +	/*
> +	 * If someone is already disabled lru_cache, just return with
> +	 * increasing the lru_disable_count.
> +	 */
> +	if (atomic_inc_not_zero(&lru_disable_count)) {
> +		mutex_unlock(&lock);
> +		return;
> +	}
>  #ifdef CONFIG_SMP
>  	/*
>  	 * lru_add_drain_all in the force mode will schedule draining on
> @@ -873,6 +897,8 @@ void lru_cache_disable(void)
>  #else
>  	lru_add_and_bh_lrus_drain();
>  #endif
> +	atomic_inc(&lru_disable_count);
> +	mutex_unlock(&lock);
>  }
>  
>  /**
> -- 
> 2.34.1.448.ga2b2bfdf31-goog

-- 
Michal Hocko
SUSE Labs
Re: [RESEND][PATCH v2] mm: don't call lru draining in the nested lru_cache_disable
Posted by Minchan Kim 4 years, 5 months ago
On Mon, Jan 17, 2022 at 02:47:06PM +0100, Michal Hocko wrote:
> On Thu 30-12-21 11:36:27, Minchan Kim wrote:
> > lru_cache_disable involves IPIs to drain pagevec of each core,
> > which sometimes takes quite long time to complete depending
> > on cpu's business, which makes allocation too slow up to
> > sveral hundredth milliseconds. Furthermore, the repeated draining
> > in the alloc_contig_range makes thing worse considering caller
> > of alloc_contig_range usually tries multiple times in the loop.
> >
> > This patch makes the lru_cache_disable aware of the fact the
> > pagevec was already disabled. With that, user of alloc_contig_range
> > can disable the lru cache in advance in their context during the
> > repeated trial so they can avoid the multiple costly draining
> > in cma allocation.
> 
> Do you have any numbers on any improvements?

The LRU draining consumed above 50% overhead for the 20M CMA alloc.

> 
> Now to the change. I do not like this much to be honest. LRU cache
> disabling is a complex synchronization scheme implemented in
> __lru_add_drain_all now you are stacking another level on top of that.
> 
> More fundamentally though. I am not sure I understand the problem TBH.

The problem is that kinds of IPI using normal prority workqueue to drain
takes much time depending on the system CPU business.

> What prevents you from calling lru_cache_disable at the cma level in the
> first place?

You meant moving the call from alloc_contig_range to caller layer?
So, virtio_mem_fake_online, too? It could and make sense from
performance perspective since upper layer usually calls the
alloc_contig_range multiple times on retrial loop.

Havid said, semantically, not good in that why upper layer should
know how alloc_contig_range works(LRU disable is too low level stuff)
internally but I chose the performance here.

There is an example why the stacking is needed.
cma_alloc also can be called from outside.
A usecase is try to call

    lru_cache_disable
    for (order = 10; order >= 0; order) {
        page = cma_alloc(1<<order)
        if (page)
            break;
    }
    lru_cacne_enable

Here, putting the disable lru outside of cma_alloc is
much better than inside. That's why I put it outside.

> 
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  * from v1 - https://lore.kernel.org/lkml/20211206221006.946661-1-minchan@kernel.org/
> >    * fix lru_cache_disable race - akpm
> > 
> >  include/linux/swap.h | 14 ++------------
> >  mm/cma.c             |  5 +++++
> >  mm/swap.c            | 30 ++++++++++++++++++++++++++++--
> >  3 files changed, 35 insertions(+), 14 deletions(-)
> > 
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index ba52f3a3478e..fe18e86a4f13 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -348,19 +348,9 @@ extern void lru_note_cost_page(struct page *);
> >  extern void lru_cache_add(struct page *);
> >  extern void mark_page_accessed(struct page *);
> >  
> > -extern atomic_t lru_disable_count;
> > -
> > -static inline bool lru_cache_disabled(void)
> > -{
> > -	return atomic_read(&lru_disable_count);
> > -}
> > -
> > -static inline void lru_cache_enable(void)
> > -{
> > -	atomic_dec(&lru_disable_count);
> > -}
> > -
> > +extern bool lru_cache_disabled(void);
> >  extern void lru_cache_disable(void);
> > +extern void lru_cache_enable(void);
> >  extern void lru_add_drain(void);
> >  extern void lru_add_drain_cpu(int cpu);
> >  extern void lru_add_drain_cpu_zone(struct zone *zone);
> > diff --git a/mm/cma.c b/mm/cma.c
> > index 995e15480937..60be555c5b95 100644
> > --- a/mm/cma.c
> > +++ b/mm/cma.c
> > @@ -30,6 +30,7 @@
> >  #include <linux/cma.h>
> >  #include <linux/highmem.h>
> >  #include <linux/io.h>
> > +#include <linux/swap.h>
> >  #include <linux/kmemleak.h>
> >  #include <trace/events/cma.h>
> >  
> > @@ -453,6 +454,8 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
> >  	if (bitmap_count > bitmap_maxno)
> >  		goto out;
> >  
> > +	lru_cache_disable();
> > +
> >  	for (;;) {
> >  		spin_lock_irq(&cma->lock);
> >  		bitmap_no = bitmap_find_next_zero_area_off(cma->bitmap,
> > @@ -492,6 +495,8 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
> >  		start = bitmap_no + mask + 1;
> >  	}
> >  
> > +	lru_cache_enable();
> > +
> >  	trace_cma_alloc_finish(cma->name, pfn, page, count, align);
> >  
> >  	/*
> > diff --git a/mm/swap.c b/mm/swap.c
> > index af3cad4e5378..5f89d7c9a54e 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -847,7 +847,17 @@ void lru_add_drain_all(void)
> >  }
> >  #endif /* CONFIG_SMP */
> >  
> > -atomic_t lru_disable_count = ATOMIC_INIT(0);
> > +static atomic_t lru_disable_count = ATOMIC_INIT(0);
> > +
> > +bool lru_cache_disabled(void)
> > +{
> > +	return atomic_read(&lru_disable_count) != 0;
> > +}
> > +
> > +void lru_cache_enable(void)
> > +{
> > +	atomic_dec(&lru_disable_count);
> > +}
> >  
> >  /*
> >   * lru_cache_disable() needs to be called before we start compiling
> > @@ -859,7 +869,21 @@ atomic_t lru_disable_count = ATOMIC_INIT(0);
> >   */
> >  void lru_cache_disable(void)
> >  {
> > -	atomic_inc(&lru_disable_count);
> > +	static DEFINE_MUTEX(lock);
> > +
> > +	/*
> > +	 * The lock gaurantees lru_cache is drained when the function
> > +	 * returned.
> > +	 */
> > +	mutex_lock(&lock);
> > +	/*
> > +	 * If someone is already disabled lru_cache, just return with
> > +	 * increasing the lru_disable_count.
> > +	 */
> > +	if (atomic_inc_not_zero(&lru_disable_count)) {
> > +		mutex_unlock(&lock);
> > +		return;
> > +	}
> >  #ifdef CONFIG_SMP
> >  	/*
> >  	 * lru_add_drain_all in the force mode will schedule draining on
> > @@ -873,6 +897,8 @@ void lru_cache_disable(void)
> >  #else
> >  	lru_add_and_bh_lrus_drain();
> >  #endif
> > +	atomic_inc(&lru_disable_count);
> > +	mutex_unlock(&lock);
> >  }
> >  
> >  /**
> > -- 
> > 2.34.1.448.ga2b2bfdf31-goog
> 
> -- 
> Michal Hocko
> SUSE Labs
> 
Re: [RESEND][PATCH v2] mm: don't call lru draining in the nested lru_cache_disable
Posted by David Hildenbrand 4 years, 5 months ago
On 19.01.22 01:12, Minchan Kim wrote:
> On Mon, Jan 17, 2022 at 02:47:06PM +0100, Michal Hocko wrote:
>> On Thu 30-12-21 11:36:27, Minchan Kim wrote:
>>> lru_cache_disable involves IPIs to drain pagevec of each core,
>>> which sometimes takes quite long time to complete depending
>>> on cpu's business, which makes allocation too slow up to
>>> sveral hundredth milliseconds. Furthermore, the repeated draining
>>> in the alloc_contig_range makes thing worse considering caller
>>> of alloc_contig_range usually tries multiple times in the loop.
>>>
>>> This patch makes the lru_cache_disable aware of the fact the
>>> pagevec was already disabled. With that, user of alloc_contig_range
>>> can disable the lru cache in advance in their context during the
>>> repeated trial so they can avoid the multiple costly draining
>>> in cma allocation.
>>
>> Do you have any numbers on any improvements?
> 
> The LRU draining consumed above 50% overhead for the 20M CMA alloc.
> 
>>
>> Now to the change. I do not like this much to be honest. LRU cache
>> disabling is a complex synchronization scheme implemented in
>> __lru_add_drain_all now you are stacking another level on top of that.
>>
>> More fundamentally though. I am not sure I understand the problem TBH.
> 
> The problem is that kinds of IPI using normal prority workqueue to drain
> takes much time depending on the system CPU business.
> 
>> What prevents you from calling lru_cache_disable at the cma level in the
>> first place?
> 
> You meant moving the call from alloc_contig_range to caller layer?
> So, virtio_mem_fake_online, too? It could and make sense from
> performance perspective since upper layer usually calls the
> alloc_contig_range multiple times on retrial loop.
> 

^ I actually do have something like that on my TODO list.

The issue is that we have demanding requirements for
alloc_contig_range(), discussed in the past for CMA bulk allocations:

(1) Fast, unreliable allocations

Fail fast and let caller continue with next allocation instead of
retrying. Try to not degrade system performance.

(2) Slow, reliable allocations

Retry as good as possible. Degrading system performance (e.g., disabling
lru) is acceptable.


virtio-mem is usually (2), although there could be some use cases where
we first want to try (1) -- unplug as much memory as we can fast -- to
then fallback to (2) -- unplug what remains.

CMA bulk allocations are (1). "Ordinary" CMA is mostly (2) I'd assume.

-- 
Thanks,

David / dhildenb

Re: [RESEND][PATCH v2] mm: don't call lru draining in the nested lru_cache_disable
Posted by Minchan Kim 4 years, 5 months ago
On Thu, Jan 20, 2022 at 09:42:23AM +0100, David Hildenbrand wrote:
> On 19.01.22 01:12, Minchan Kim wrote:
> > On Mon, Jan 17, 2022 at 02:47:06PM +0100, Michal Hocko wrote:
> >> On Thu 30-12-21 11:36:27, Minchan Kim wrote:
> >>> lru_cache_disable involves IPIs to drain pagevec of each core,
> >>> which sometimes takes quite long time to complete depending
> >>> on cpu's business, which makes allocation too slow up to
> >>> sveral hundredth milliseconds. Furthermore, the repeated draining
> >>> in the alloc_contig_range makes thing worse considering caller
> >>> of alloc_contig_range usually tries multiple times in the loop.
> >>>
> >>> This patch makes the lru_cache_disable aware of the fact the
> >>> pagevec was already disabled. With that, user of alloc_contig_range
> >>> can disable the lru cache in advance in their context during the
> >>> repeated trial so they can avoid the multiple costly draining
> >>> in cma allocation.
> >>
> >> Do you have any numbers on any improvements?
> > 
> > The LRU draining consumed above 50% overhead for the 20M CMA alloc.
> > 
> >>
> >> Now to the change. I do not like this much to be honest. LRU cache
> >> disabling is a complex synchronization scheme implemented in
> >> __lru_add_drain_all now you are stacking another level on top of that.
> >>
> >> More fundamentally though. I am not sure I understand the problem TBH.
> > 
> > The problem is that kinds of IPI using normal prority workqueue to drain
> > takes much time depending on the system CPU business.
> > 
> >> What prevents you from calling lru_cache_disable at the cma level in the
> >> first place?
> > 
> > You meant moving the call from alloc_contig_range to caller layer?
> > So, virtio_mem_fake_online, too? It could and make sense from
> > performance perspective since upper layer usually calls the
> > alloc_contig_range multiple times on retrial loop.
> > 
> 
> ^ I actually do have something like that on my TODO list.

I'm glad to hear you are also looking the direction.
If we move some information out of alloc_contig_range,
upper layer has higher chance to make it fast. A example,
"failure pfn and why it fails"

> 
> The issue is that we have demanding requirements for
> alloc_contig_range(), discussed in the past for CMA bulk allocations:
> 
> (1) Fast, unreliable allocations
> 
> Fail fast and let caller continue with next allocation instead of
> retrying. Try to not degrade system performance.
> 
> (2) Slow, reliable allocations
> 
> Retry as good as possible. Degrading system performance (e.g., disabling
> lru) is acceptable.

"Fast and unreliable" provides lots of flexibilty for us to design
since it's okay to fail. For "Slow, reliable", "the slow" should be not
too much and we need to keep effort to make it faster.