[v2] debugobjects: Do some minor optimizations, fixes and cleaups

[PATCH v2 4/6] debugobjects: Don't start fill if there are remaining nodes locally

Posted by Zhen Lei 1 year, 5 months ago

If the conditions for starting fill are met, it means that all cores that
call fill() later are blocked until the first core completes the fill
operation. But obviously, for a core that has free nodes locally, it does
not need to be blocked(see below for why). This is good in stress
situations.

1. In the case of no nesting, a core uses only one node at a time. As long
   as there is a local node, there is no need to use the free node in
   obj_pool.
2. In the case of nesting depth is one, nodes in obj_pool need to be used
   only when there is only one local node.
   #define ODEBUG_POOL_PERCPU_SIZE      64
   #define ODEBUG_BATCH_SIZE            16
   Assume that when nested, the probability of percpu_obj_pool having each
   number of nodes is the same. The probability of only one node is less
   than 1/17=6%. Assuming the probability of nesting is 5%, that's a
   pretty high estimate. Then the probability of using obj_pool is
   6% * 5% = 0.3%. In other words, a 333-core environment produces only
   one core to compete for obj_pool.
   #define ODEBUG_POOL_MIN_LEVEL        256
   #define ODEBUG_BATCH_SIZE            16
   But we can tolerate "256 / (16 + 1)" = 15 cores competing at the same
   time.
3. In the case of nesting depth more than one, the probability is lower
   and negligible.
   Nesting Depth=2: "2/17 * 5% * 5%" = 0.03%
   Nesting Depth=3: "3/17 * 5% * 5% * 5%" = 0.002%

However, to ensure sufficient reliability, obj_pool is not filled only
when there are more than two local nodes, reduce the probability of
problems to the impossible.

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
---
 lib/debugobjects.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/lib/debugobjects.c b/lib/debugobjects.c
index 7a8ccc94cb037ba..4f64b5d4329c27d 100644
--- a/lib/debugobjects.c
+++ b/lib/debugobjects.c
@@ -131,6 +131,16 @@ static void fill_pool(void)
 	struct debug_obj *obj;
 	unsigned long flags;
 
+	/*
+	 * The upper-layer function uses only one node at a time. If there are
+	 * more than two local nodes, it means that even if nesting occurs, it
+	 * doesn't matter. The probability of nesting depth >= 2 is extremely
+	 * low, and the number of global free nodes guarded by
+	 * debug_objects_pool_min_level is adequate.
+	 */
+	if (likely(obj_cache) && this_cpu_read(percpu_obj_pool.obj_free) >= 2)
+		return;
+
 	if (likely(READ_ONCE(obj_pool_free) >= debug_objects_pool_min_level))
 		return;
 
-- 
2.34.1

Re: [PATCH v2 4/6] debugobjects: Don't start fill if there are remaining nodes locally

Posted by Leizhen (ThunderTown) 1 year, 5 months ago


On 2024/9/4 21:39, Zhen Lei wrote:
> If the conditions for starting fill are met, it means that all cores that
> call fill() later are blocked until the first core completes the fill
> operation. But obviously, for a core that has free nodes locally, it does
> not need to be blocked(see below for why). This is good in stress
> situations.
> 
> 1. In the case of no nesting, a core uses only one node at a time. As long
>    as there is a local node, there is no need to use the free node in
>    obj_pool.
> 2. In the case of nesting depth is one, nodes in obj_pool need to be used
>    only when there is only one local node.
>    #define ODEBUG_POOL_PERCPU_SIZE      64
>    #define ODEBUG_BATCH_SIZE            16
>    Assume that when nested, the probability of percpu_obj_pool having each
>    number of nodes is the same. The probability of only one node is less
>    than 1/17=6%. Assuming the probability of nesting is 5%, that's a
>    pretty high estimate. Then the probability of using obj_pool is
>    6% * 5% = 0.3%. In other words, a 333-core environment produces only
>    one core to compete for obj_pool.
>    #define ODEBUG_POOL_MIN_LEVEL        256
>    #define ODEBUG_BATCH_SIZE            16
>    But we can tolerate "256 / (16 + 1)" = 15 cores competing at the same
>    time.

One detail is omitted. In function debug_objects_mem_init(), an extra batch
is reserved for each core.
	extras = num_possible_cpus() * ODEBUG_BATCH_SIZE;
	debug_objects_pool_min_level += extras;

In addition, above method of calculating probabilities is wrong. The correct
calculation method is as follows:
When the number of local nodes is 0, fill is performed. When the number of
local nodes is 1 and nested, 16 nodes are moved from obj_pool to obj_pool.
As a result, the obj_pool resource pool keeps decreasing. When this happens
continuously(The number of local nodes equal 0 is not met), the resource
pool will eventually be exhausted. The error probability is:
(1/2)^((256+16^ncpus)/17) * (5% + 5%^2 + ... + 5%^N) * 2/17 < 1e-7 (ncpus=1).
1/2 ==> denominator sequence: 0,1; numerator sequence: 1
(5% + 5%^2 + ... + 5%^N) < 5% + (5%^2) * 2 = 0.055
17 = ODEBUG_BATCH_SIZ + 1, amount moved from obj_pool when the number of local nodes is 0.
2/17 ==> denominator sequence: 0-16; numerator sequence: 0,1
The more cores, the lower the probability of exhaustion.

If obj_pool is not filled only when there are more than two local nodes,
the probability of exhaustion is:
(1/3)^((256+16^ncpus)/17) * (5% + 5%^2 + ... + 5%^N) * 3/17 < 2.3e-10
1/3 ==> denominator sequence: 0,1,2; numerator sequence: 2
3/17 ==> denominator sequence: 0-16; numerator sequence: 0,1,2

> 3. In the case of nesting depth more than one, the probability is lower
>    and negligible.
>    Nesting Depth=2: "2/17 * 5% * 5%" = 0.03%
>    Nesting Depth=3: "3/17 * 5% * 5% * 5%" = 0.002%
> 
> However, to ensure sufficient reliability, obj_pool is not filled only
> when there are more than two local nodes, reduce the probability of
> problems to the impossible.
> 
> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
> ---
>  lib/debugobjects.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/lib/debugobjects.c b/lib/debugobjects.c
> index 7a8ccc94cb037ba..4f64b5d4329c27d 100644
> --- a/lib/debugobjects.c
> +++ b/lib/debugobjects.c
> @@ -131,6 +131,16 @@ static void fill_pool(void)
>  	struct debug_obj *obj;
>  	unsigned long flags;
>  
> +	/*
> +	 * The upper-layer function uses only one node at a time. If there are
> +	 * more than two local nodes, it means that even if nesting occurs, it
> +	 * doesn't matter. The probability of nesting depth >= 2 is extremely
> +	 * low, and the number of global free nodes guarded by
> +	 * debug_objects_pool_min_level is adequate.
> +	 */
> +	if (likely(obj_cache) && this_cpu_read(percpu_obj_pool.obj_free) >= 2)
> +		return;
> +
>  	if (likely(READ_ONCE(obj_pool_free) >= debug_objects_pool_min_level))
>  		return;
>  
> 

-- 
Regards,
  Zhen Lei

Re: [PATCH v2 4/6] debugobjects: Don't start fill if there are remaining nodes locally

Posted by Leizhen (ThunderTown) 1 year, 5 months ago


On 2024/9/5 11:11, Leizhen (ThunderTown) wrote:
> 
> 
> On 2024/9/4 21:39, Zhen Lei wrote:
>> If the conditions for starting fill are met, it means that all cores that
>> call fill() later are blocked until the first core completes the fill
>> operation. But obviously, for a core that has free nodes locally, it does
>> not need to be blocked(see below for why). This is good in stress
>> situations.
>>
>> 1. In the case of no nesting, a core uses only one node at a time. As long
>>    as there is a local node, there is no need to use the free node in
>>    obj_pool.
>> 2. In the case of nesting depth is one, nodes in obj_pool need to be used
>>    only when there is only one local node.
>>    #define ODEBUG_POOL_PERCPU_SIZE      64
>>    #define ODEBUG_BATCH_SIZE            16
>>    Assume that when nested, the probability of percpu_obj_pool having each
>>    number of nodes is the same. The probability of only one node is less
>>    than 1/17=6%. Assuming the probability of nesting is 5%, that's a
>>    pretty high estimate. Then the probability of using obj_pool is
>>    6% * 5% = 0.3%. In other words, a 333-core environment produces only
>>    one core to compete for obj_pool.
>>    #define ODEBUG_POOL_MIN_LEVEL        256
>>    #define ODEBUG_BATCH_SIZE            16
>>    But we can tolerate "256 / (16 + 1)" = 15 cores competing at the same
>>    time.
> 
> One detail is omitted. In function debug_objects_mem_init(), an extra batch
> is reserved for each core.
> 	extras = num_possible_cpus() * ODEBUG_BATCH_SIZE;
> 	debug_objects_pool_min_level += extras;
> 
> In addition, above method of calculating probabilities is wrong. The correct
> calculation method is as follows:
> When the number of local nodes is 0, fill is performed. When the number of
> local nodes is 1 and nested, 16 nodes are moved from obj_pool to obj_pool.
> As a result, the obj_pool resource pool keeps decreasing. When this happens
> continuously(The number of local nodes equal 0 is not met), the resource
> pool will eventually be exhausted. The error probability is:
> (1/2)^((256+16^ncpus)/17) * (5% + 5%^2 + ... + 5%^N) * 2/17 < 1e-7 (ncpus=1).

Should be:
==> (1/2)^((256+16^ncpus)/17) * 5% * 2/17 < 9e-8 (ncpus=1).

> 1/2 ==> denominator sequence: 0,1; numerator sequence: 1
> (5% + 5%^2 + ... + 5%^N) < 5% + (5%^2) * 2 = 0.055
> 17 = ODEBUG_BATCH_SIZ + 1, amount moved from obj_pool when the number of local nodes is 0.
> 2/17 ==> denominator sequence: 0-16; numerator sequence: 0,1
> The more cores, the lower the probability of exhaustion.
> 
> If obj_pool is not filled only when there are more than two local nodes,
> the probability of exhaustion is:
> (1/3)^((256+16^ncpus)/17) * (5% + 5%^2 + ... + 5%^N) * 3/17 < < 2.3e-10

Should be:
==> (1/3)^((256+16^ncpus)/17) * (5%^2) * 3/17 < 1.03e-11 (ncpus=1).

> 1/3 ==> denominator sequence: 0,1,2; numerator sequence: 2
> 3/17 ==> denominator sequence: 0-16; numerator sequence: 0,1,2

Hi, Thomas Gleixner:
  Seems to need to add an additional patch as follows to be foolproof.
I'll prepare it.

diff --git a/lib/debugobjects.c b/lib/debugobjects.c
index e175cc74f7b7899..d3f8cc7dc1c9291 100644
--- a/lib/debugobjects.c
+++ b/lib/debugobjects.c
@@ -245,6 +245,21 @@ alloc_object(void *addr, struct debug_bucket *b, const struct debug_obj_descr *d

        raw_spin_lock(&pool_lock);
        obj = __alloc_object(&obj_pool);
+       if (!obj) {
+               raw_spin_unlock(&pool_lock);
+               obj = kmem_cache_zalloc(obj_cache, __GFP_HIGH | GFP_NOWAIT);
+               if (!obj)
+                       return NULL;
+
+               raw_spin_lock(&pool_lock);
+               debug_objects_allocated++;
+
+               /*
+                * It can be understood that obj is allocated immediately after
+                * being added to obj_pool.
+                */
+               obj_pool_used++;
+       }
        if (obj) {
                int cnt = 0;



> 
>> 3. In the case of nesting depth more than one, the probability is lower
>>    and negligible.
>>    Nesting Depth=2: "2/17 * 5% * 5%" = 0.03%
>>    Nesting Depth=3: "3/17 * 5% * 5% * 5%" = 0.002%
>>
>> However, to ensure sufficient reliability, obj_pool is not filled only
>> when there are more than two local nodes, reduce the probability of
>> problems to the impossible.
>>
>> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
>> ---
>>  lib/debugobjects.c | 10 ++++++++++
>>  1 file changed, 10 insertions(+)
>>
>> diff --git a/lib/debugobjects.c b/lib/debugobjects.c
>> index 7a8ccc94cb037ba..4f64b5d4329c27d 100644
>> --- a/lib/debugobjects.c
>> +++ b/lib/debugobjects.c
>> @@ -131,6 +131,16 @@ static void fill_pool(void)
>>  	struct debug_obj *obj;
>>  	unsigned long flags;
>>  
>> +	/*
>> +	 * The upper-layer function uses only one node at a time. If there are
>> +	 * more than two local nodes, it means that even if nesting occurs, it
>> +	 * doesn't matter. The probability of nesting depth >= 2 is extremely
>> +	 * low, and the number of global free nodes guarded by
>> +	 * debug_objects_pool_min_level is adequate.
>> +	 */
>> +	if (likely(obj_cache) && this_cpu_read(percpu_obj_pool.obj_free) >= 2)
>> +		return;
>> +
>>  	if (likely(READ_ONCE(obj_pool_free) >= debug_objects_pool_min_level))
>>  		return;
>>  
>>
> 

-- 
Regards,
  Zhen Lei

Re: [PATCH v2 4/6] debugobjects: Don't start fill if there are remaining nodes locally

Posted by Thomas Gleixner 1 year, 5 months ago

On Thu, Sep 05 2024 at 11:45, Leizhen wrote:
> On 2024/9/5 11:11, Leizhen (ThunderTown) wrote:
>   Seems to need to add an additional patch as follows to be foolproof.
> I'll prepare it.
>
> diff --git a/lib/debugobjects.c b/lib/debugobjects.c
> index e175cc74f7b7899..d3f8cc7dc1c9291 100644
> --- a/lib/debugobjects.c
> +++ b/lib/debugobjects.c
> @@ -245,6 +245,21 @@ alloc_object(void *addr, struct debug_bucket *b, const struct debug_obj_descr *d
>
>         raw_spin_lock(&pool_lock);
>         obj = __alloc_object(&obj_pool);
> +       if (!obj) {
> +               raw_spin_unlock(&pool_lock);
> +               obj = kmem_cache_zalloc(obj_cache, __GFP_HIGH | GFP_NOWAIT);
> +               if (!obj)
> +                       return NULL;
> +
> +               raw_spin_lock(&pool_lock);
> +               debug_objects_allocated++;
> +
> +               /*
> +                * It can be understood that obj is allocated immediately after
> +                * being added to obj_pool.
> +                */
> +               obj_pool_used++;
> +       }
>         if (obj) {
>                 int cnt = 0;

No. That fails on RT. See debug_object_fill_pool().

Thanks,

        tglx

Re: [PATCH v2 4/6] debugobjects: Don't start fill if there are remaining nodes locally

Posted by Thomas Gleixner 1 year, 5 months ago

On Mon, Sep 09 2024 at 11:22, Thomas Gleixner wrote:
> On Thu, Sep 05 2024 at 11:45, Leizhen wrote:
>> +               obj = kmem_cache_zalloc(obj_cache, __GFP_HIGH | GFP_NOWAIT);
>> +               if (!obj)
>> +                       return NULL;
>
> No. That fails on RT. See debug_object_fill_pool().

Some more thoughts on this. The goal is to reduce contention on pool
lock. At the same time, this needs to ensure that the opportunistic fill
mechanism actually works reliably.

debug_object_fill_pool() is invoked from

    - debug_object_init()
    - debug_object_assert_init()
    - debug_object_activate()

debug_object_init() is usually invoked from preemptible context. It will
most of the time consume a tracking object from the per CPU or the
global pool because re-initialization of a tracked object is rather
rare.

debug_object_assert_init() and debug_object_activate() only consume a
tracking object, when the to be tracked object is statically
initialized or the call site failed to initialize the object. Both can
be called from arbitrary contexts even under PREEMPT_RT, where
preemptible context is a prerequisite for pool refill via allocations.

And of course any CPU which sits in fill_pool() can be preempted if the
calling context is preemptible. And no, we can't disable preemption
accross the whole thing due to RT.

So something like the uncompiled below should reduce lock contention
significantly with a reasonable safety net.

Thanks,

        tglx
---
--- a/lib/debugobjects.c
+++ b/lib/debugobjects.c
@@ -125,14 +125,10 @@ static const char *obj_states[ODEBUG_STA
 	[ODEBUG_STATE_NOTAVAILABLE]	= "not available",
 };
 
-static void fill_pool(void)
+static void fill_pool_from_freelist(void)
 {
-	gfp_t gfp = __GFP_HIGH | __GFP_NOWARN;
+	static unsigned long state;
 	struct debug_obj *obj;
-	unsigned long flags;
-
-	if (likely(READ_ONCE(obj_pool_free) >= debug_objects_pool_min_level))
-		return;
 
 	/*
 	 * Reuse objs from the global obj_to_free list; they will be
@@ -141,25 +137,57 @@ static void fill_pool(void)
 	 * obj_nr_tofree is checked locklessly; the READ_ONCE() pairs with
 	 * the WRITE_ONCE() in pool_lock critical sections.
 	 */
-	if (READ_ONCE(obj_nr_tofree)) {
-		raw_spin_lock_irqsave(&pool_lock, flags);
-		/*
-		 * Recheck with the lock held as the worker thread might have
-		 * won the race and freed the global free list already.
-		 */
-		while (obj_nr_tofree && (obj_pool_free < debug_objects_pool_min_level)) {
-			obj = hlist_entry(obj_to_free.first, typeof(*obj), node);
-			hlist_del(&obj->node);
-			WRITE_ONCE(obj_nr_tofree, obj_nr_tofree - 1);
-			hlist_add_head(&obj->node, &obj_pool);
-			WRITE_ONCE(obj_pool_free, obj_pool_free + 1);
-		}
-		raw_spin_unlock_irqrestore(&pool_lock, flags);
+	if (!READ_ONCE(obj_nr_tofree))
+		return;
+
+	/*
+	 * Prevent the context from being scheduled or interrupted after
+	 * setting the state flag;
+	 */
+	guard(irqsave)();
+
+	/*
+	 * Avoid lock contention on &pool_lock and avoid making the cache
+	 * line exclusive by testing the bit before attempting to set it.
+	 */
+	if (test_bit(0, &state) || test_and_set_bit(0, &state))
+		return;
+
+	guard(raw_spinlock)(&pool_lock);
+	/*
+	 * Recheck with the lock held as the worker thread might have
+	 * won the race and freed the global free list already.
+	 */
+	while (obj_nr_tofree && (obj_pool_free < debug_objects_pool_min_level)) {
+		obj = hlist_entry(obj_to_free.first, typeof(*obj), node);
+		hlist_del(&obj->node);
+		WRITE_ONCE(obj_nr_tofree, obj_nr_tofree - 1);
+		hlist_add_head(&obj->node, &obj_pool);
+		WRITE_ONCE(obj_pool_free, obj_pool_free + 1);
 	}
+	clear_bit(0, &state);
+}
+
+static void fill_pool(void)
+{
+	gfp_t gfp = __GFP_HIGH | __GFP_NOWARN;
+	static atomic_t cpus_allocating;
 
 	if (unlikely(!obj_cache))
 		return;
 
+	/*
+	 * Avoid allocation and lock contention when
+	 *
+	 *   - the CPU local pool has at least 2 objects left
+	 *   - another CPU is already in the allocation path
+	 *   - the global pool has not reached the critical level yet
+	 */
+	if (this_cpu_read(percpu_obj_pool.obj_free) > 1 && atomic_read(&cpus_allocating) &&
+	    READ_ONCE(obj_pool_free) > (debug_objects_pool_min_level / 2))
+		return;
+
+	atomic_inc(&cpus_allocating);
 	while (READ_ONCE(obj_pool_free) < debug_objects_pool_min_level) {
 		struct debug_obj *new[ODEBUG_BATCH_SIZE];
 		int cnt;
@@ -172,14 +200,14 @@ static void fill_pool(void)
 		if (!cnt)
 			return;
 
-		raw_spin_lock_irqsave(&pool_lock, flags);
+		guard(raw_spinlock_irqsave)(&pool_lock);
 		while (cnt) {
 			hlist_add_head(&new[--cnt]->node, &obj_pool);
 			debug_objects_allocated++;
 			WRITE_ONCE(obj_pool_free, obj_pool_free + 1);
 		}
-		raw_spin_unlock_irqrestore(&pool_lock, flags);
 	}
+	atomic_dec(&cpus_allocating);
 }
 
 /*
@@ -598,6 +626,15 @@ static struct debug_obj *lookup_object_o
 
 static void debug_objects_fill_pool(void)
 {
+	if (likely(READ_ONCE(obj_pool_free) >= debug_objects_pool_min_level))
+		return;
+
+	/* Try reusing objects from obj_to_free_list */
+	fill_pool_from_freelist();
+
+	if (likely(READ_ONCE(obj_pool_free) >= debug_objects_pool_min_level))
+		return;
+
 	/*
 	 * On RT enabled kernels the pool refill must happen in preemptible
 	 * context -- for !RT kernels we rely on the fact that spinlock_t and

Re: [PATCH v2 4/6] debugobjects: Don't start fill if there are remaining nodes locally

Posted by Leizhen (ThunderTown) 1 year, 5 months ago


On 2024/9/9 20:10, Thomas Gleixner wrote:
> On Mon, Sep 09 2024 at 11:22, Thomas Gleixner wrote:
>> On Thu, Sep 05 2024 at 11:45, Leizhen wrote:
>>> +               obj = kmem_cache_zalloc(obj_cache, __GFP_HIGH | GFP_NOWAIT);
>>> +               if (!obj)
>>> +                       return NULL;
>>
>> No. That fails on RT. See debug_object_fill_pool().
> 
> Some more thoughts on this. The goal is to reduce contention on pool
> lock. At the same time, this needs to ensure that the opportunistic fill
> mechanism actually works reliably.
> 
> debug_object_fill_pool() is invoked from
> 
>     - debug_object_init()
>     - debug_object_assert_init()
>     - debug_object_activate()
> 
> debug_object_init() is usually invoked from preemptible context. It will
> most of the time consume a tracking object from the per CPU or the
> global pool because re-initialization of a tracked object is rather
> rare.
> 
> debug_object_assert_init() and debug_object_activate() only consume a
> tracking object, when the to be tracked object is statically
> initialized or the call site failed to initialize the object. Both can
> be called from arbitrary contexts even under PREEMPT_RT, where
> preemptible context is a prerequisite for pool refill via allocations.
> 
> And of course any CPU which sits in fill_pool() can be preempted if the
> calling context is preemptible. And no, we can't disable preemption
> accross the whole thing due to RT.
> 
> So something like the uncompiled below should reduce lock contention
> significantly with a reasonable safety net.

It looks very good. Especially flexible use of 'state' and 'cpus_allocating'.
In this way, there is almost no conflict of lock 'pool_lock', and the more
cores, the less possible conflict.

Hi Thomas Gleixner:
  Do you plan to post this patch? But this patch will conflict with my patch 5/6.
If you're going to merge my patch 5/6, hopefully yours will be applied after mine.
  By the way, Do you have time to review the patches in the link below?
  https://lkml.org/lkml/2024/9/4/1094

> 
> Thanks,
> 
>         tglx
> ---
> --- a/lib/debugobjects.c
> +++ b/lib/debugobjects.c
> @@ -125,14 +125,10 @@ static const char *obj_states[ODEBUG_STA
>  	[ODEBUG_STATE_NOTAVAILABLE]	= "not available",
>  };
>  
> -static void fill_pool(void)
> +static void fill_pool_from_freelist(void)
>  {
> -	gfp_t gfp = __GFP_HIGH | __GFP_NOWARN;
> +	static unsigned long state;
>  	struct debug_obj *obj;
> -	unsigned long flags;
> -
> -	if (likely(READ_ONCE(obj_pool_free) >= debug_objects_pool_min_level))
> -		return;
>  
>  	/*
>  	 * Reuse objs from the global obj_to_free list; they will be
> @@ -141,25 +137,57 @@ static void fill_pool(void)
>  	 * obj_nr_tofree is checked locklessly; the READ_ONCE() pairs with
>  	 * the WRITE_ONCE() in pool_lock critical sections.
>  	 */
> -	if (READ_ONCE(obj_nr_tofree)) {
> -		raw_spin_lock_irqsave(&pool_lock, flags);
> -		/*
> -		 * Recheck with the lock held as the worker thread might have
> -		 * won the race and freed the global free list already.
> -		 */
> -		while (obj_nr_tofree && (obj_pool_free < debug_objects_pool_min_level)) {
> -			obj = hlist_entry(obj_to_free.first, typeof(*obj), node);
> -			hlist_del(&obj->node);
> -			WRITE_ONCE(obj_nr_tofree, obj_nr_tofree - 1);
> -			hlist_add_head(&obj->node, &obj_pool);
> -			WRITE_ONCE(obj_pool_free, obj_pool_free + 1);
> -		}
> -		raw_spin_unlock_irqrestore(&pool_lock, flags);
> +	if (!READ_ONCE(obj_nr_tofree))
> +		return;
> +
> +	/*
> +	 * Prevent the context from being scheduled or interrupted after
> +	 * setting the state flag;
> +	 */
> +	guard(irqsave)();
> +
> +	/*
> +	 * Avoid lock contention on &pool_lock and avoid making the cache
> +	 * line exclusive by testing the bit before attempting to set it.
> +	 */
> +	if (test_bit(0, &state) || test_and_set_bit(0, &state))
> +		return;
> +
> +	guard(raw_spinlock)(&pool_lock);
> +	/*
> +	 * Recheck with the lock held as the worker thread might have
> +	 * won the race and freed the global free list already.
> +	 */
> +	while (obj_nr_tofree && (obj_pool_free < debug_objects_pool_min_level)) {
> +		obj = hlist_entry(obj_to_free.first, typeof(*obj), node);
> +		hlist_del(&obj->node);
> +		WRITE_ONCE(obj_nr_tofree, obj_nr_tofree - 1);
> +		hlist_add_head(&obj->node, &obj_pool);
> +		WRITE_ONCE(obj_pool_free, obj_pool_free + 1);
>  	}
> +	clear_bit(0, &state);
> +}
> +
> +static void fill_pool(void)
> +{
> +	gfp_t gfp = __GFP_HIGH | __GFP_NOWARN;
> +	static atomic_t cpus_allocating;
>  
>  	if (unlikely(!obj_cache))
>  		return;
>  
> +	/*
> +	 * Avoid allocation and lock contention when
> +	 *
> +	 *   - the CPU local pool has at least 2 objects left
> +	 *   - another CPU is already in the allocation path
> +	 *   - the global pool has not reached the critical level yet
> +	 */
> +	if (this_cpu_read(percpu_obj_pool.obj_free) > 1 && atomic_read(&cpus_allocating) &&
> +	    READ_ONCE(obj_pool_free) > (debug_objects_pool_min_level / 2))
> +		return;
> +
> +	atomic_inc(&cpus_allocating);
>  	while (READ_ONCE(obj_pool_free) < debug_objects_pool_min_level) {
>  		struct debug_obj *new[ODEBUG_BATCH_SIZE];
>  		int cnt;
> @@ -172,14 +200,14 @@ static void fill_pool(void)
>  		if (!cnt)
>  			return;
>  
> -		raw_spin_lock_irqsave(&pool_lock, flags);
> +		guard(raw_spinlock_irqsave)(&pool_lock);
>  		while (cnt) {
>  			hlist_add_head(&new[--cnt]->node, &obj_pool);
>  			debug_objects_allocated++;
>  			WRITE_ONCE(obj_pool_free, obj_pool_free + 1);
>  		}
> -		raw_spin_unlock_irqrestore(&pool_lock, flags);
>  	}
> +	atomic_dec(&cpus_allocating);
>  }
>  
>  /*
> @@ -598,6 +626,15 @@ static struct debug_obj *lookup_object_o
>  
>  static void debug_objects_fill_pool(void)
>  {
> +	if (likely(READ_ONCE(obj_pool_free) >= debug_objects_pool_min_level))
> +		return;
> +
> +	/* Try reusing objects from obj_to_free_list */
> +	fill_pool_from_freelist();
> +
> +	if (likely(READ_ONCE(obj_pool_free) >= debug_objects_pool_min_level))
> +		return;
> +
>  	/*
>  	 * On RT enabled kernels the pool refill must happen in preemptible
>  	 * context -- for !RT kernels we rely on the fact that spinlock_t and
> .
> 

-- 
Regards,
  Zhen Lei

Re: [PATCH v2 4/6] debugobjects: Don't start fill if there are remaining nodes locally

Posted by Leizhen (ThunderTown) 1 year, 5 months ago


On 2024/9/9 21:51, Leizhen (ThunderTown) wrote:
> +static void fill_pool(void)
> +{
> +	gfp_t gfp = __GFP_HIGH | __GFP_NOWARN;
> +	static atomic_t cpus_allocating;
>  
>  	if (unlikely(!obj_cache))
>  		return;
>  
> +	/*
> +	 * Avoid allocation and lock contention when
> +	 *
> +	 *   - the CPU local pool has at least 2 objects left
> +	 *   - another CPU is already in the allocation path
> +	 *   - the global pool has not reached the critical level yet
> +	 */
> +	if (this_cpu_read(percpu_obj_pool.obj_free) > 1 && atomic_read(&cpus_allocating) &&

I rethink that 'cpus_allocating' and 'percpu_obj_pool.obj_free > 1' should be
contradictory. We can only choose one of them. If 'cpus_allocating' can work,
there's no need for another. Since I had no way to prove that 'cpus_allocating'
would work, I removed it in V3.

> +	    READ_ONCE(obj_pool_free) > (debug_objects_pool_min_level / 2))
> +		return;
> +
> +	atomic_inc(&cpus_allocating);
>  	while (READ_ONCE(obj_pool_free) < debug_objects_pool_min_level) {
>  		struct debug_obj *new[ODEBUG_BATCH_SIZE];
>  		int cnt;

-- 
Regards,
  Zhen Lei

Re: [PATCH v2 4/6] debugobjects: Don't start fill if there are remaining nodes locally

Posted by Leizhen (ThunderTown) 1 year, 5 months ago


On 2024/9/10 23:29, Leizhen (ThunderTown) wrote:
> 
> 
> On 2024/9/9 21:51, Leizhen (ThunderTown) wrote:
>> +static void fill_pool(void)
>> +{
>> +	gfp_t gfp = __GFP_HIGH | __GFP_NOWARN;
>> +	static atomic_t cpus_allocating;
>>  
>>  	if (unlikely(!obj_cache))
>>  		return;
>>  
>> +	/*
>> +	 * Avoid allocation and lock contention when
>> +	 *
>> +	 *   - the CPU local pool has at least 2 objects left
>> +	 *   - another CPU is already in the allocation path
>> +	 *   - the global pool has not reached the critical level yet
>> +	 */
>> +	if (this_cpu_read(percpu_obj_pool.obj_free) > 1 && atomic_read(&cpus_allocating) &&
> 
> I rethink that 'cpus_allocating' and 'percpu_obj_pool.obj_free > 1' should be
> contradictory. We can only choose one of them. If 'cpus_allocating' can work,
> there's no need for another. Since I had no way to prove that 'cpus_allocating'

Sorry, I mistyped in a hurry yesterday, and it's 'percpu_obj_pool.obj_free > 1'
that should be deleted.


> would work, I removed it in V3.
> 
>> +	    READ_ONCE(obj_pool_free) > (debug_objects_pool_min_level / 2))
>> +		return;
>> +
>> +	atomic_inc(&cpus_allocating);
>>  	while (READ_ONCE(obj_pool_free) < debug_objects_pool_min_level) {
>>  		struct debug_obj *new[ODEBUG_BATCH_SIZE];
>>  		int cnt;
> 

-- 
Regards,
  Zhen Lei

Re: [PATCH v2 4/6] debugobjects: Don't start fill if there are remaining nodes locally

Posted by Thomas Gleixner 1 year, 5 months ago

On Mon, Sep 09 2024 at 21:51, Leizhen wrote:
> On 2024/9/9 20:10, Thomas Gleixner wrote:
>> So something like the uncompiled below should reduce lock contention
>> significantly with a reasonable safety net.
>
> It looks very good. Especially flexible use of 'state' and 'cpus_allocating'.
> In this way, there is almost no conflict of lock 'pool_lock', and the more
> cores, the less possible conflict.
>
> Hi Thomas Gleixner:
>   Do you plan to post this patch? But this patch will conflict with my patch 5/6.
> If you're going to merge my patch 5/6, hopefully yours will be applied
> after mine.

I'm short of cycles, so the best way is when you pick it up and
integrate it instead of 4/6 and post a v3.

>   By the way, Do you have time to review the patches in the link below?
>   https://lkml.org/lkml/2024/9/4/1094

Please use lore.kernel.org links. I've seen it but did not come around
to it yet because I was staring at this one. I'm single threaded :)

Thanks,

        tglx

Re: [PATCH v2 4/6] debugobjects: Don't start fill if there are remaining nodes locally

Posted by Leizhen (ThunderTown) 1 year, 5 months ago


On 2024/9/9 22:35, Thomas Gleixner wrote:
> On Mon, Sep 09 2024 at 21:51, Leizhen wrote:
>> On 2024/9/9 20:10, Thomas Gleixner wrote:
>>> So something like the uncompiled below should reduce lock contention
>>> significantly with a reasonable safety net.
>>
>> It looks very good. Especially flexible use of 'state' and 'cpus_allocating'.
>> In this way, there is almost no conflict of lock 'pool_lock', and the more
>> cores, the less possible conflict.
>>
>> Hi Thomas Gleixner:
>>   Do you plan to post this patch? But this patch will conflict with my patch 5/6.
>> If you're going to merge my patch 5/6, hopefully yours will be applied
>> after mine.
> 
> I'm short of cycles, so the best way is when you pick it up and
> integrate it instead of 4/6 and post a v3.

OK

> 
>>   By the way, Do you have time to review the patches in the link below?
>>   https://lkml.org/lkml/2024/9/4/1094
> 
> Please use lore.kernel.org links. I've seen it but did not come around

OK

> to it yet because I was staring at this one. I'm single threaded :)

Mostly you big shot are too busy.

> 
> Thanks,
> 
>         tglx
> 
> 
> 
> .
> 

-- 
Regards,
  Zhen Lei

Re: [PATCH v2 4/6] debugobjects: Don't start fill if there are remaining nodes locally

Posted by Thomas Gleixner 1 year, 5 months ago

On Mon, Sep 09 2024 at 16:35, Thomas Gleixner wrote:

> On Mon, Sep 09 2024 at 21:51, Leizhen wrote:
>> On 2024/9/9 20:10, Thomas Gleixner wrote:
>>> So something like the uncompiled below should reduce lock contention
>>> significantly with a reasonable safety net.
>>
>> It looks very good. Especially flexible use of 'state' and 'cpus_allocating'.
>> In this way, there is almost no conflict of lock 'pool_lock', and the more
>> cores, the less possible conflict.
>>
>> Hi Thomas Gleixner:
>>   Do you plan to post this patch? But this patch will conflict with my patch 5/6.
>> If you're going to merge my patch 5/6, hopefully yours will be applied
>> after mine.
>
> I'm short of cycles, so the best way is when you pick it up and
> integrate it instead of 4/6 and post a v3.

I'm picked up 1-3 from this series now. So you can drop them and work
against

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git core/debugobjects

Thanks,

        tglx

Re: [PATCH v2 4/6] debugobjects: Don't start fill if there are remaining nodes locally

Posted by Leizhen (ThunderTown) 1 year, 5 months ago


On 2024/9/9 17:22, Thomas Gleixner wrote:
> On Thu, Sep 05 2024 at 11:45, Leizhen wrote:
>> On 2024/9/5 11:11, Leizhen (ThunderTown) wrote:
>>   Seems to need to add an additional patch as follows to be foolproof.
>> I'll prepare it.
>>
>> diff --git a/lib/debugobjects.c b/lib/debugobjects.c
>> index e175cc74f7b7899..d3f8cc7dc1c9291 100644
>> --- a/lib/debugobjects.c
>> +++ b/lib/debugobjects.c
>> @@ -245,6 +245,21 @@ alloc_object(void *addr, struct debug_bucket *b, const struct debug_obj_descr *d
>>
>>         raw_spin_lock(&pool_lock);
>>         obj = __alloc_object(&obj_pool);
>> +       if (!obj) {
>> +               raw_spin_unlock(&pool_lock);
>> +               obj = kmem_cache_zalloc(obj_cache, __GFP_HIGH | GFP_NOWAIT);
>> +               if (!obj)
>> +                       return NULL;
>> +
>> +               raw_spin_lock(&pool_lock);
>> +               debug_objects_allocated++;
>> +
>> +               /*
>> +                * It can be understood that obj is allocated immediately after
>> +                * being added to obj_pool.
>> +                */
>> +               obj_pool_used++;
>> +       }
>>         if (obj) {
>>                 int cnt = 0;
> 
> No. That fails on RT. See debug_object_fill_pool().

OK, I got it, thanks.

> 
> Thanks,
> 
>         tglx
> .
> 

-- 
Regards,
  Zhen Lei

Re: [PATCH v2 4/6] debugobjects: Don't start fill if there are remaining nodes locally

Posted by Leizhen (ThunderTown) 1 year, 5 months ago


On 2024/9/5 11:45, Leizhen (ThunderTown) wrote:
> 
> 
> On 2024/9/5 11:11, Leizhen (ThunderTown) wrote:
>>
>>
>> On 2024/9/4 21:39, Zhen Lei wrote:
>>> If the conditions for starting fill are met, it means that all cores that
>>> call fill() later are blocked until the first core completes the fill
>>> operation. But obviously, for a core that has free nodes locally, it does
>>> not need to be blocked(see below for why). This is good in stress
>>> situations.
>>>
>>> 1. In the case of no nesting, a core uses only one node at a time. As long
>>>    as there is a local node, there is no need to use the free node in
>>>    obj_pool.
>>> 2. In the case of nesting depth is one, nodes in obj_pool need to be used
>>>    only when there is only one local node.
>>>    #define ODEBUG_POOL_PERCPU_SIZE      64
>>>    #define ODEBUG_BATCH_SIZE            16
>>>    Assume that when nested, the probability of percpu_obj_pool having each
>>>    number of nodes is the same. The probability of only one node is less
>>>    than 1/17=6%. Assuming the probability of nesting is 5%, that's a
>>>    pretty high estimate. Then the probability of using obj_pool is
>>>    6% * 5% = 0.3%. In other words, a 333-core environment produces only
>>>    one core to compete for obj_pool.
>>>    #define ODEBUG_POOL_MIN_LEVEL        256
>>>    #define ODEBUG_BATCH_SIZE            16
>>>    But we can tolerate "256 / (16 + 1)" = 15 cores competing at the same
>>>    time.
>>
>> One detail is omitted. In function debug_objects_mem_init(), an extra batch
>> is reserved for each core.
>> 	extras = num_possible_cpus() * ODEBUG_BATCH_SIZE;
>> 	debug_objects_pool_min_level += extras;
>>
>> In addition, above method of calculating probabilities is wrong. The correct
>> calculation method is as follows:
>> When the number of local nodes is 0, fill is performed. When the number of
>> local nodes is 1 and nested, 16 nodes are moved from obj_pool to obj_pool.
>> As a result, the obj_pool resource pool keeps decreasing. When this happens
>> continuously(The number of local nodes equal 0 is not met), the resource
>> pool will eventually be exhausted. The error probability is:
>> (1/2)^((256+16^ncpus)/17) * (5% + 5%^2 + ... + 5%^N) * 2/17 < 1e-7 (ncpus=1).
> 
> Should be:
> ==> (1/2)^((256+16^ncpus)/17) * 5% * 2/17 < 9e-8 (ncpus=1).
> 
>> 1/2 ==> denominator sequence: 0,1; numerator sequence: 1
>> (5% + 5%^2 + ... + 5%^N) < 5% + (5%^2) * 2 = 0.055
>> 17 = ODEBUG_BATCH_SIZ + 1, amount moved from obj_pool when the number of local nodes is 0.
>> 2/17 ==> denominator sequence: 0-16; numerator sequence: 0,1
>> The more cores, the lower the probability of exhaustion.
>>
>> If obj_pool is not filled only when there are more than two local nodes,
>> the probability of exhaustion is:
>> (1/3)^((256+16^ncpus)/17) * (5% + 5%^2 + ... + 5%^N) * 3/17 < < 2.3e-10
> 
> Should be:
> ==> (1/3)^((256+16^ncpus)/17) * (5%^2) * 3/17 < 1.03e-11 (ncpus=1).
> 
>> 1/3 ==> denominator sequence: 0,1,2; numerator sequence: 2
>> 3/17 ==> denominator sequence: 0-16; numerator sequence: 0,1,2
> 
> Hi, Thomas Gleixner:
>   Seems to need to add an additional patch as follows to be foolproof.
> I'll prepare it.

I've rethinked this problem, and there's another remedy. When the number
of remaining free nodes is less than half of debug_objects_pool_min_level,
still do fill. In this way, "nr_cpus/2 + 256/(16+1)" cores are required to
bypass the check before obj_pool_free is updated, which is almost impossible.

But then again, I'm just theoretical, and I don't have the data, so maybe
the best solution is to give up this patch and talk about it in future.

By the way, I've found a new concern.
	static int debug_objects_pool_size = ODEBUG_POOL_SIZE;			//1024
	static int debug_objects_pool_min_level = ODEBUG_POOL_MIN_LEVEL;	//256
	extras = num_possible_cpus() * ODEBUG_BATCH_SIZE;			//16
	debug_objects_pool_size += extras;
	debug_objects_pool_min_level += extras;
When there are so many cores, it should be easier to walk back and forth
around debug_objects_pool_min_level. For example, nr_cpus=128,
	debug_objects_pool_min_level = 128 *16 + 256 = 2304
	debug_objects_pool_size - debug_objects_pool_min_level = 768	//fixed
There are many more loafers than workers. As above, it's better to
discuss it in future.

diff --git a/lib/debugobjects.c b/lib/debugobjects.c
index 58de57090ac6389..2eb246901cf5367 100644
--- a/lib/debugobjects.c
+++ b/lib/debugobjects.c
@@ -135,6 +135,10 @@ static void fill_pool(void)
        if (likely(READ_ONCE(obj_pool_free) >= debug_objects_pool_min_level))
                return;

+       if (likely(obj_cache) &&
+           this_cpu_read(percpu_obj_pool.obj_free) > 0 &&
+           likely(READ_ONCE(obj_pool_free) >= debug_objects_pool_min_level / 2))
+               return;
        /*
         * Reuse objs from the global tofree list; they will be reinitialized
         * when allocating.


> 
> diff --git a/lib/debugobjects.c b/lib/debugobjects.c
> index e175cc74f7b7899..d3f8cc7dc1c9291 100644
> --- a/lib/debugobjects.c
> +++ b/lib/debugobjects.c
> @@ -245,6 +245,21 @@ alloc_object(void *addr, struct debug_bucket *b, const struct debug_obj_descr *d
> 
>         raw_spin_lock(&pool_lock);
>         obj = __alloc_object(&obj_pool);
> +       if (!obj) {
> +               raw_spin_unlock(&pool_lock);
> +               obj = kmem_cache_zalloc(obj_cache, __GFP_HIGH | GFP_NOWAIT);
> +               if (!obj)
> +                       return NULL;
> +
> +               raw_spin_lock(&pool_lock);
> +               debug_objects_allocated++;
> +
> +               /*
> +                * It can be understood that obj is allocated immediately after
> +                * being added to obj_pool.
> +                */
> +               obj_pool_used++;
> +       }
>         if (obj) {
>                 int cnt = 0;
> 
> 
> 
>>
>>> 3. In the case of nesting depth more than one, the probability is lower
>>>    and negligible.
>>>    Nesting Depth=2: "2/17 * 5% * 5%" = 0.03%
>>>    Nesting Depth=3: "3/17 * 5% * 5% * 5%" = 0.002%
>>>
>>> However, to ensure sufficient reliability, obj_pool is not filled only
>>> when there are more than two local nodes, reduce the probability of
>>> problems to the impossible.
>>>
>>> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
>>> ---
>>>  lib/debugobjects.c | 10 ++++++++++
>>>  1 file changed, 10 insertions(+)
>>>
>>> diff --git a/lib/debugobjects.c b/lib/debugobjects.c
>>> index 7a8ccc94cb037ba..4f64b5d4329c27d 100644
>>> --- a/lib/debugobjects.c
>>> +++ b/lib/debugobjects.c
>>> @@ -131,6 +131,16 @@ static void fill_pool(void)
>>>  	struct debug_obj *obj;
>>>  	unsigned long flags;
>>>  
>>> +	/*
>>> +	 * The upper-layer function uses only one node at a time. If there are
>>> +	 * more than two local nodes, it means that even if nesting occurs, it
>>> +	 * doesn't matter. The probability of nesting depth >= 2 is extremely
>>> +	 * low, and the number of global free nodes guarded by
>>> +	 * debug_objects_pool_min_level is adequate.
>>> +	 */
>>> +	if (likely(obj_cache) && this_cpu_read(percpu_obj_pool.obj_free) >= 2)
>>> +		return;
>>> +
>>>  	if (likely(READ_ONCE(obj_pool_free) >= debug_objects_pool_min_level))
>>>  		return;
>>>  
>>>
>>
> 

-- 
Regards,
  Zhen Lei