[RFC PATCH 2/3] sched/fair: Improve integration of SHARED_RUNQ feature within newidle_balance

K Prateek Nayak posted 3 patches 2 years, 3 months ago
[RFC PATCH 2/3] sched/fair: Improve integration of SHARED_RUNQ feature within newidle_balance
Posted by K Prateek Nayak 2 years, 3 months ago
This patch takes the relevant optimizations from [1] in
newidle_balance(). Following is the breakdown:

- Check "rq->rd->overload" before jumping into newidle_balance, even
  with SHARED_RQ feat enabled.

- Call update_next_balance() for all the domains till MC Domain in
  when SHARED_RQ path is taken.

- Account cost from shared_runq_pick_next_task() and update
  curr_cost and sd->max_newidle_lb_cost accordingly.

- Move the initial rq_unpin_lock() logic around. Also, the caller of
  shared_runq_pick_next_task() is responsible for calling
  rq_repin_lock() if the return value is non zero. (Needs to be verified
  everything is right with LOCKDEP)

- Includes a fix to skip directly above the LLC domain when calling the
  load_balance() in newidle_balance()

All other surgery from [1] has been removed.

Link: https://lore.kernel.org/all/31aeb639-1d66-2d12-1673-c19fed0ab33a@amd.com/ [1]
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c | 94 ++++++++++++++++++++++++++++++++-------------
 1 file changed, 67 insertions(+), 27 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bf844ffa79c2..446ffdad49e1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -337,7 +337,6 @@ static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
 		rq_unpin_lock(rq, &src_rf);
 		raw_spin_unlock_irqrestore(&p->pi_lock, src_rf.flags);
 	}
-	rq_repin_lock(rq, rf);
 
 	return ret;
 }
@@ -12276,50 +12275,83 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	if (!cpu_active(this_cpu))
 		return 0;
 
-	if (sched_feat(SHARED_RUNQ)) {
-		pulled_task = shared_runq_pick_next_task(this_rq, rf);
-		if (pulled_task)
-			return pulled_task;
-	}
-
 	/*
 	 * We must set idle_stamp _before_ calling idle_balance(), such that we
 	 * measure the duration of idle_balance() as idle time.
 	 */
 	this_rq->idle_stamp = rq_clock(this_rq);
 
-	/*
-	 * This is OK, because current is on_cpu, which avoids it being picked
-	 * for load-balance and preemption/IRQs are still disabled avoiding
-	 * further scheduler activity on it and we're being very careful to
-	 * re-start the picking loop.
-	 */
-	rq_unpin_lock(this_rq, rf);
-
 	rcu_read_lock();
-	sd = rcu_dereference_check_sched_domain(this_rq->sd);
-
-	/*
-	 * Skip <= LLC domains as they likely won't have any tasks if the
-	 * shared runq is empty.
-	 */
-	if (sched_feat(SHARED_RUNQ)) {
+	if (sched_feat(SHARED_RUNQ))
 		sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
-		if (likely(sd))
-			sd = sd->parent;
-	}
+	else
+		sd = rcu_dereference_check_sched_domain(this_rq->sd);
 
 	if (!READ_ONCE(this_rq->rd->overload) ||
-	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
+	    /* Look at rq->avg_idle iff SHARED_RUNQ is disabled */
+	    (!sched_feat(SHARED_RUNQ) && sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
 
-		if (sd)
+		while (sd) {
 			update_next_balance(sd, &next_balance);
+			sd = sd->child;
+		}
+
 		rcu_read_unlock();
 
 		goto out;
 	}
+
+	if (sched_feat(SHARED_RUNQ)) {
+		struct sched_domain *tmp = sd;
+
+		t0 = sched_clock_cpu(this_cpu);
+
+		/* Do update_next_balance() for all domains within LLC */
+		while (tmp) {
+			update_next_balance(tmp, &next_balance);
+			tmp = tmp->child;
+		}
+
+		pulled_task = shared_runq_pick_next_task(this_rq, rf);
+		if (pulled_task) {
+			if (sd) {
+				curr_cost = sched_clock_cpu(this_cpu) - t0;
+				/*
+				 * Will help bail out of scans of higer domains
+				 * slightly earlier.
+				 */
+				update_newidle_cost(sd, curr_cost);
+			}
+
+			rcu_read_unlock();
+			goto out_swq;
+		}
+
+		if (sd) {
+			t1 = sched_clock_cpu(this_cpu);
+			curr_cost += t1 - t0;
+			update_newidle_cost(sd, curr_cost);
+		}
+
+		/*
+		 * Since shared_runq_pick_next_task() can take a while
+		 * check if the CPU was targetted for a wakeup in the
+		 * meantime.
+		 */
+		if (this_rq->ttwu_pending) {
+			rcu_read_unlock();
+			return 0;
+		}
+	}
 	rcu_read_unlock();
 
+	/*
+	 * This is OK, because current is on_cpu, which avoids it being picked
+	 * for load-balance and preemption/IRQs are still disabled avoiding
+	 * further scheduler activity on it and we're being very careful to
+	 * re-start the picking loop.
+	 */
+	rq_unpin_lock(this_rq, rf);
 	raw_spin_rq_unlock(this_rq);
 
 	t0 = sched_clock_cpu(this_cpu);
@@ -12335,6 +12367,13 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
 			break;
 
+		/*
+		 * Skip <= LLC domains as they likely won't have any tasks if the
+		 * shared runq is empty.
+		 */
+		if (sched_feat(SHARED_RUNQ) && (sd->flags & SD_SHARE_PKG_RESOURCES))
+			continue;
+
 		if (sd->flags & SD_BALANCE_NEWIDLE) {
 
 			pulled_task = load_balance(this_cpu, this_rq,
@@ -12361,6 +12400,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 
 	raw_spin_rq_lock(this_rq);
 
+out_swq:
 	if (curr_cost > this_rq->max_idle_balance_cost)
 		this_rq->max_idle_balance_cost = curr_cost;
 
-- 
2.34.1
Re: [RFC PATCH 2/3] sched/fair: Improve integration of SHARED_RUNQ feature within newidle_balance
Posted by David Vernet 2 years, 3 months ago
On Thu, Aug 31, 2023 at 04:15:07PM +0530, K Prateek Nayak wrote:
> This patch takes the relevant optimizations from [1] in
> newidle_balance(). Following is the breakdown:

Thanks for working on this. I think the fix you added for skipping <=
LLC domains makes sense. The others possibly as well -- left some
comments below!

> 
> - Check "rq->rd->overload" before jumping into newidle_balance, even
>   with SHARED_RQ feat enabled.

Out of curiosity -- did you observe this making a material difference in
your tests? After thinking about it some more, though I see the argument
for why it would be logical to check if we're overloaded, I'm still
thinking that it's more ideal to just always check the SHARED_RUNQ.
rd->overload is only set in find_busiest_group() when we load balance,
so I worry that having SHARED_RUNQ follow rd->overload may just end up
making it redundant with normal load balancing in many cases.

So yeah, while I certainly understand the idea (and would like to better
understand what kind of difference it made in your tests), I still feel
pretty strongly that SHARED_RUNQ makes the most sense as a feature when
it ignores all of these heuristics and just tries to maximize work
conservation.

What do you think?

> - Call update_next_balance() for all the domains till MC Domain in
>   when SHARED_RQ path is taken.

I _think_ this makes sense. Though even in this case, I feel that it may
be slightly confusing and/or incorrect to push back the balance time
just because we didn't find a task in our current CCX's shared_runq.
Maybe we should avoid mucking with load balancing? Not sure, but I am
leaning towards what you're proposing here as a better approach.

> - Account cost from shared_runq_pick_next_task() and update
>   curr_cost and sd->max_newidle_lb_cost accordingly.

Yep, I think this is the correct thing to do.

> 
> - Move the initial rq_unpin_lock() logic around. Also, the caller of
>   shared_runq_pick_next_task() is responsible for calling
>   rq_repin_lock() if the return value is non zero. (Needs to be verified
>   everything is right with LOCKDEP)

Still need to think more about this, but it's purely just tactical and
can easily be fixed it we need.

> 
> - Includes a fix to skip directly above the LLC domain when calling the
>   load_balance() in newidle_balance()

Big fix, thanks again for noticing it.

> All other surgery from [1] has been removed.
> 
> Link: https://lore.kernel.org/all/31aeb639-1d66-2d12-1673-c19fed0ab33a@amd.com/ [1]
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
>  kernel/sched/fair.c | 94 ++++++++++++++++++++++++++++++++-------------
>  1 file changed, 67 insertions(+), 27 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bf844ffa79c2..446ffdad49e1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -337,7 +337,6 @@ static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
>  		rq_unpin_lock(rq, &src_rf);
>  		raw_spin_unlock_irqrestore(&p->pi_lock, src_rf.flags);
>  	}
> -	rq_repin_lock(rq, rf);
>  
>  	return ret;
>  }
> @@ -12276,50 +12275,83 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  	if (!cpu_active(this_cpu))
>  		return 0;
>  
> -	if (sched_feat(SHARED_RUNQ)) {
> -		pulled_task = shared_runq_pick_next_task(this_rq, rf);
> -		if (pulled_task)
> -			return pulled_task;
> -	}
> -
>  	/*
>  	 * We must set idle_stamp _before_ calling idle_balance(), such that we
>  	 * measure the duration of idle_balance() as idle time.
>  	 */
>  	this_rq->idle_stamp = rq_clock(this_rq);
>  
> -	/*
> -	 * This is OK, because current is on_cpu, which avoids it being picked
> -	 * for load-balance and preemption/IRQs are still disabled avoiding
> -	 * further scheduler activity on it and we're being very careful to
> -	 * re-start the picking loop.
> -	 */
> -	rq_unpin_lock(this_rq, rf);
> -
>  	rcu_read_lock();
> -	sd = rcu_dereference_check_sched_domain(this_rq->sd);
> -
> -	/*
> -	 * Skip <= LLC domains as they likely won't have any tasks if the
> -	 * shared runq is empty.
> -	 */
> -	if (sched_feat(SHARED_RUNQ)) {
> +	if (sched_feat(SHARED_RUNQ))
>  		sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
> -		if (likely(sd))
> -			sd = sd->parent;
> -	}
> +	else
> +		sd = rcu_dereference_check_sched_domain(this_rq->sd);
>  
>  	if (!READ_ONCE(this_rq->rd->overload) ||
> -	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
> +	    /* Look at rq->avg_idle iff SHARED_RUNQ is disabled */
> +	    (!sched_feat(SHARED_RUNQ) && sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>  
> -		if (sd)
> +		while (sd) {
>  			update_next_balance(sd, &next_balance);
> +			sd = sd->child;
> +		}
> +
>  		rcu_read_unlock();
>  
>  		goto out;
>  	}
> +
> +	if (sched_feat(SHARED_RUNQ)) {
> +		struct sched_domain *tmp = sd;
> +
> +		t0 = sched_clock_cpu(this_cpu);
> +
> +		/* Do update_next_balance() for all domains within LLC */
> +		while (tmp) {
> +			update_next_balance(tmp, &next_balance);
> +			tmp = tmp->child;
> +		}
> +
> +		pulled_task = shared_runq_pick_next_task(this_rq, rf);
> +		if (pulled_task) {
> +			if (sd) {
> +				curr_cost = sched_clock_cpu(this_cpu) - t0;
> +				/*
> +				 * Will help bail out of scans of higer domains
> +				 * slightly earlier.
> +				 */
> +				update_newidle_cost(sd, curr_cost);
> +			}
> +
> +			rcu_read_unlock();
> +			goto out_swq;
> +		}
> +
> +		if (sd) {
> +			t1 = sched_clock_cpu(this_cpu);
> +			curr_cost += t1 - t0;
> +			update_newidle_cost(sd, curr_cost);
> +		}
> +
> +		/*
> +		 * Since shared_runq_pick_next_task() can take a while
> +		 * check if the CPU was targetted for a wakeup in the
> +		 * meantime.
> +		 */
> +		if (this_rq->ttwu_pending) {
> +			rcu_read_unlock();
> +			return 0;
> +		}

At first I was wondering whether we should do this above
update_newidle_cost(), but I think it makes sense to always call
update_newidle_cost() after we've failed to get a task from
shared_runq_pick_next_task().

> +	}
>  	rcu_read_unlock();
>  
> +	/*
> +	 * This is OK, because current is on_cpu, which avoids it being picked
> +	 * for load-balance and preemption/IRQs are still disabled avoiding
> +	 * further scheduler activity on it and we're being very careful to
> +	 * re-start the picking loop.
> +	 */
> +	rq_unpin_lock(this_rq, rf);

Don't you need to do this before you exit on the rq->ttwu_pending path?

>  	raw_spin_rq_unlock(this_rq);
>  
>  	t0 = sched_clock_cpu(this_cpu);
> @@ -12335,6 +12367,13 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
>  			break;
>  
> +		/*
> +		 * Skip <= LLC domains as they likely won't have any tasks if the
> +		 * shared runq is empty.
> +		 */
> +		if (sched_feat(SHARED_RUNQ) && (sd->flags & SD_SHARE_PKG_RESOURCES))
> +			continue;
> +
>  		if (sd->flags & SD_BALANCE_NEWIDLE) {
>  
>  			pulled_task = load_balance(this_cpu, this_rq,
> @@ -12361,6 +12400,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  
>  	raw_spin_rq_lock(this_rq);
>  
> +out_swq:
>  	if (curr_cost > this_rq->max_idle_balance_cost)
>  		this_rq->max_idle_balance_cost = curr_cost;
>  


Thanks,
David
Re: [RFC PATCH 2/3] sched/fair: Improve integration of SHARED_RUNQ feature within newidle_balance
Posted by K Prateek Nayak 2 years, 3 months ago
Hello David,

Thank you for taking a look at this despite being on vacation.

On 9/1/2023 12:15 AM, David Vernet wrote:
> On Thu, Aug 31, 2023 at 04:15:07PM +0530, K Prateek Nayak wrote:
>> This patch takes the relevant optimizations from [1] in
>> newidle_balance(). Following is the breakdown:
> 
> Thanks for working on this. I think the fix you added for skipping <=
> LLC domains makes sense. The others possibly as well

I too am in doubt with some of them but I left them in since I was
building on top of the cumulative diff.

> -- left some
> comments below!
> 
>>
>> - Check "rq->rd->overload" before jumping into newidle_balance, even
>>   with SHARED_RQ feat enabled.
> 
> Out of curiosity -- did you observe this making a material difference in
> your tests? After thinking about it some more, though I see the argument
> for why it would be logical to check if we're overloaded, I'm still
> thinking that it's more ideal to just always check the SHARED_RUNQ.
> rd->overload is only set in find_busiest_group() when we load balance,
> so I worry that having SHARED_RUNQ follow rd->overload may just end up
> making it redundant with normal load balancing in many cases.
> 
> So yeah, while I certainly understand the idea (and would like to better
> understand what kind of difference it made in your tests), I still feel
> pretty strongly that SHARED_RUNQ makes the most sense as a feature when
> it ignores all of these heuristics and just tries to maximize work
> conservation.
> 
> What do you think?

Actually, as it turns out, it was probably a combination of the
rq->avg_idle check + updating of cost that got the performance back
during experimenting. In Patch 3, I've give the results with this patch
alone and it makes no difference, for tbench 128-client at least. There
is the same rq lock contention I mentioned previously which is why the
per-shard "overload" flag.

Based on Anna-Maria's observation in [1], we have a short idling, spread
across the system with tbench. Now it is possible we are doing a
newidle_balance() when it would have been better off to let the CPU idle
for that short duration without and not cause a contention for the rq
lock.

[1] https://lore.kernel.org/lkml/80956e8f-761e-b74-1c7a-3966f9e8d934@linutronix.de/

> 
>> - Call update_next_balance() for all the domains till MC Domain in
>>   when SHARED_RQ path is taken.
> 
> I _think_ this makes sense. Though even in this case, I feel that it may
> be slightly confusing and/or incorrect to push back the balance time
> just because we didn't find a task in our current CCX's shared_runq.
> Maybe we should avoid mucking with load balancing? Not sure, but I am
> leaning towards what you're proposing here as a better approach.

This requires a deeper look and more testing yes.

> 
>> - Account cost from shared_runq_pick_next_task() and update
>>   curr_cost and sd->max_newidle_lb_cost accordingly.
> 
> Yep, I think this is the correct thing to do.
> 
>>
>> - Move the initial rq_unpin_lock() logic around. Also, the caller of
>>   shared_runq_pick_next_task() is responsible for calling
>>   rq_repin_lock() if the return value is non zero. (Needs to be verified
>>   everything is right with LOCKDEP)
> 
> Still need to think more about this, but it's purely just tactical and
> can easily be fixed it we need.

I agree. I'll leave the full picture of this below in
[Locking code movement clarifications] since we seem to keep coming back
to this and it would be good to have more eyes on what is going on in my
mind :)

> 
>>
>> - Includes a fix to skip directly above the LLC domain when calling the
>>   load_balance() in newidle_balance()
> 
> Big fix, thanks again for noticing it.
> 
>> All other surgery from [1] has been removed.
>>
>> Link: https://lore.kernel.org/all/31aeb639-1d66-2d12-1673-c19fed0ab33a@amd.com/ [1]
>> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
>> ---
>>  kernel/sched/fair.c | 94 ++++++++++++++++++++++++++++++++-------------
>>  1 file changed, 67 insertions(+), 27 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index bf844ffa79c2..446ffdad49e1 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -337,7 +337,6 @@ static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
>>  		rq_unpin_lock(rq, &src_rf);
>>  		raw_spin_unlock_irqrestore(&p->pi_lock, src_rf.flags);
>>  	}
>> -	rq_repin_lock(rq, rf);
>>  
>>  	return ret;
>>  }
>> @@ -12276,50 +12275,83 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>>  	if (!cpu_active(this_cpu))
>>  		return 0;
>>  
>> -	if (sched_feat(SHARED_RUNQ)) {
>> -		pulled_task = shared_runq_pick_next_task(this_rq, rf);
>> -		if (pulled_task)
>> -			return pulled_task;
>> -	}
>> -
>>  	/*
>>  	 * We must set idle_stamp _before_ calling idle_balance(), such that we
>>  	 * measure the duration of idle_balance() as idle time.
>>  	 */
>>  	this_rq->idle_stamp = rq_clock(this_rq);
>>  
>> -	/*
>> -	 * This is OK, because current is on_cpu, which avoids it being picked
>> -	 * for load-balance and preemption/IRQs are still disabled avoiding
>> -	 * further scheduler activity on it and we're being very careful to
>> -	 * re-start the picking loop.
>> -	 */
>> -	rq_unpin_lock(this_rq, rf);
>> -
>>  	rcu_read_lock();
>> -	sd = rcu_dereference_check_sched_domain(this_rq->sd);
>> -
>> -	/*
>> -	 * Skip <= LLC domains as they likely won't have any tasks if the
>> -	 * shared runq is empty.
>> -	 */
>> -	if (sched_feat(SHARED_RUNQ)) {
>> +	if (sched_feat(SHARED_RUNQ))
>>  		sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>> -		if (likely(sd))
>> -			sd = sd->parent;
>> -	}
>> +	else
>> +		sd = rcu_dereference_check_sched_domain(this_rq->sd);
>>  
>>  	if (!READ_ONCE(this_rq->rd->overload) ||
>> -	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>> +	    /* Look at rq->avg_idle iff SHARED_RUNQ is disabled */
>> +	    (!sched_feat(SHARED_RUNQ) && sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>>  
>> -		if (sd)
>> +		while (sd) {
>>  			update_next_balance(sd, &next_balance);
>> +			sd = sd->child;
>> +		}
>> +
>>  		rcu_read_unlock();
>>  
>>  		goto out;
>>  	}
>> +
>> +	if (sched_feat(SHARED_RUNQ)) {
>> +		struct sched_domain *tmp = sd;
>> +
>> +		t0 = sched_clock_cpu(this_cpu);
>> +
>> +		/* Do update_next_balance() for all domains within LLC */
>> +		while (tmp) {
>> +			update_next_balance(tmp, &next_balance);
>> +			tmp = tmp->child;
>> +		}
>> +
>> +		pulled_task = shared_runq_pick_next_task(this_rq, rf);
>> +		if (pulled_task) {
>> +			if (sd) {
>> +				curr_cost = sched_clock_cpu(this_cpu) - t0;
>> +				/*
>> +				 * Will help bail out of scans of higer domains
>> +				 * slightly earlier.
>> +				 */
>> +				update_newidle_cost(sd, curr_cost);
>> +			}
>> +
>> +			rcu_read_unlock();
>> +			goto out_swq;
>> +		}
>> +
>> +		if (sd) {
>> +			t1 = sched_clock_cpu(this_cpu);
>> +			curr_cost += t1 - t0;
>> +			update_newidle_cost(sd, curr_cost);
>> +		}
>> +
>> +		/*
>> +		 * Since shared_runq_pick_next_task() can take a while
>> +		 * check if the CPU was targetted for a wakeup in the
>> +		 * meantime.
>> +		 */
>> +		if (this_rq->ttwu_pending) {
>> +			rcu_read_unlock();
>> +			return 0;
>> +		}
> 
> At first I was wondering whether we should do this above
> update_newidle_cost(), but I think it makes sense to always call
> update_newidle_cost() after we've failed to get a task from
> shared_runq_pick_next_task().

Indeed. I think the cost might be useful to be accounted for.

> 
>> +	}
>>  	rcu_read_unlock();
>>  
>> +	/*
>> +	 * This is OK, because current is on_cpu, which avoids it being picked
>> +	 * for load-balance and preemption/IRQs are still disabled avoiding
>> +	 * further scheduler activity on it and we're being very careful to
>> +	 * re-start the picking loop.
>> +	 */
>> +	rq_unpin_lock(this_rq, rf);
> 
> Don't you need to do this before you exit on the rq->ttwu_pending path?

[Locking code movement clarifications]

Okay this is where I'll put all the locking bits I have in my head:

o First, the removal of rq_repin_lock() in shared_runq_pick_next_task()

  Since this is only called from newidle_balance(), it is easy to
  isolate the changes. shared_runq_pick_next_task() can return either
  0, 1 or -1. The interpretation is same as return value of
  newidle_balance():

   0: Unsuccessful at pulling task but the rq lock was never released
      and reacquired - it was held all the time.

   1: Task was pulled successfully. The rq lock was released and
      reacquired in the process but now, with the above changes, it is
      not pinned.

  -1: Unsuccessful at pulling task but the rq lock was released and
      reacquired in the process and now, with the above changes, it is
      not pinned.

  Now the following block:

	pulled_task = shared_runq_pick_next_task(this_rq, rf);
	if (pulled_task) {
		...
		goto out_swq;
	}

  takes care of the case where return values are -1, or 1. The "out_swq"
  label is almost towards the end of newidle_balance() and just before
  returning, the newidle_balance() does:

	rq_repin_lock(this_rq, rf);

  So this path will repin the lock.

  Now for the case where shared_runq_pick_next_task() return 0.

o Which brings us to the question you asked above

  newidle_balance() is called with the rq lock held and pinned, and it
  expects the same when newidle_balance() reruns. The very first bailout
  check in newidle_balance() is:

	if (this_rq->ttwu_pending)
		return 0;

  so we return without doing any changed to the state of rq lock.

  Coming to the above changes, if we have to hit the ttwu_pending
  bailout you pointed at, shared_runq_pick_next_task() should return 0,
  signifying no modification to state of the lock or pinning. Then we
  update the cost, and come to ttwu_pending check. We still have the
  lock held, and it is pinned. Thus we do not need to unpin the lock
  since we newidle_balance() is expected to return with lock held and
  it being pinned.

Please let me know if I've missed something.

> 
>>  	raw_spin_rq_unlock(this_rq);
>>  
>>  	t0 = sched_clock_cpu(this_cpu);
>> @@ -12335,6 +12367,13 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>>  		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
>>  			break;
>>  
>> +		/*
>> +		 * Skip <= LLC domains as they likely won't have any tasks if the
>> +		 * shared runq is empty.
>> +		 */
>> +		if (sched_feat(SHARED_RUNQ) && (sd->flags & SD_SHARE_PKG_RESOURCES))
>> +			continue;
>> +
>>  		if (sd->flags & SD_BALANCE_NEWIDLE) {
>>  
>>  			pulled_task = load_balance(this_cpu, this_rq,
>> @@ -12361,6 +12400,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>>  
>>  	raw_spin_rq_lock(this_rq);
>>  
>> +out_swq:
>>  	if (curr_cost > this_rq->max_idle_balance_cost)
>>  		this_rq->max_idle_balance_cost = curr_cost;
>>  
> 
> 
> Thanks,
> David
 
--
Thanks and Regards,
Prateek