x86: make "dom0_nodes=" work with credit2

[PATCH] x86: make "dom0_nodes=" work with credit2

Posted by Jan Beulich 2 years ago

Apart from setting the affinities, vCPU-s / units also need to be
migrated off of CPUs which aren't part of the affinity. Otherwise, for a
reason I haven't been able to determine, credit2 won't actually schedule
them, resulting in the Dom0 boot process to hang (either right away or
when bringing up "secondary" vCPU-s).

Fixes: dafd936ddd ("Make credit2 the default scheduler")
Reported-by: Olaf Hering <ohering@suse.de>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
The Fixes: tag isn't very precise - it's rather the commit exposing the
issue by default. I haven't been able to identify the actual commit
which did introduce the problem; it may well be that it has always been
there since the introduction of credit2.

Credit2 moving the vCPU-s off of their initially assigned ones right
away of course renders sched_select_initial_cpu()'s use of cpu_cycle()
pretty useless. But I guess that's still useful for other schedulers.
I wonder though whether sched_init_vcpu() shouldn't use the CPU map
calculated by sched_select_initial_cpu() for its call to
sched_set_affinity() in the non-pinned case, rather than setting "all".
(I guess doing so might mask the issue at hand, but I think the change
here would still be applicable at least from an abstract pov.)

--- a/xen/common/sched/core.c
+++ b/xen/common/sched/core.c
@@ -3403,9 +3403,15 @@ void __init sched_setup_dom0_vcpus(struc
     {
         for_each_sched_unit ( d, unit )
         {
+            spinlock_t *lock = unit_schedule_lock_irq(unit);
+
             if ( !opt_dom0_vcpus_pin && !dom0_affinity_relaxed )
                 sched_set_affinity(unit, &dom0_cpus, NULL);
             sched_set_affinity(unit, NULL, &dom0_cpus);
+
+            sched_unit_migrate_start(unit);
+            unit_schedule_unlock_irq(lock, unit);
+            sched_unit_migrate_finish(unit);
         }
     }

Re: [PATCH] x86: make "dom0_nodes=" work with credit2

Posted by Dario Faggioli 2 years ago

On Thu, 2022-04-07 at 15:27 +0200, Jan Beulich wrote:
> ---
> The Fixes: tag isn't very precise - it's rather the commit exposing
> the
> issue by default. I haven't been able to identify the actual commit
> which did introduce the problem; it may well be that it has always
> been
> there since the introduction of credit2.
> 
> Credit2 moving the vCPU-s off of their initially assigned ones right
> away of course renders sched_select_initial_cpu()'s use of
> cpu_cycle()
> pretty useless.
>
Mmm... you mean that Credit2 is moving the vCPUs off they're assigned
ones _right_now_, or that it will, with this patch?

If you mean the former, I'm not sure it is. In fact, when
sched_select_initial_cpu() is called for dom0, dom0's node affinity is
just all nodes, isn't it? So, the we can pick any CPU in the cpupool,
and we use cycle to try to spread the vCPUs kind of evenly.

If you mean after this patch, well, sure, but that's normal. Again,
when we pick the initial CPU, we still don't know that the vCPUs have
an affinity. Or, actually, we know that in sched_setup_dom0_vcpus(),
but there's no direct way to tell it to sched_init_vcpu() (and hence
sched_select_initial_cpu()).

That's because, by default, affinity is just "all cpus", when we create
the vCPUs, and we change that later, if they have one already (due to
it being present in the config file, or in the dom0_nodes parameter).

Something that *maybe* we can try, since we're handling dom0 vCPUs
specially anyway, is to directly set dom0's node affinity to the nodes
of the CPUs we have in dom0_cpus, before calling vcpu_create() (in
sched_setup_dom0_vcpus(), of course).

This should make sched_select_initial_cpu() pick one of the "correct"
CPUs in the first place. But I don't know if it's worth, neither if
we'll still need this patch anyway (I have to check more thoroughly).

>  But I guess that's still useful for other schedulers.
> I wonder though whether sched_init_vcpu() shouldn't use the CPU map
> calculated by sched_select_initial_cpu() for its call to
> sched_set_affinity() in the non-pinned case, rather than setting
> "all".
>
If we do that, and there's no affinity configured for the guest, or no
"dom0_nodes=", when will we reset the affinity to all, which what it
should be in such a case?

Also, if I'm right in my reasoning above, when we come from
sched_setup_dom0_vcpus(), the mast calculated by
sched_select_initial_cpu() is basically cpupool0's cpus_valid, so this
wouldn't really change anything for the problem we're trying to solve
here.

> (I guess doing so might mask the issue at hand, but I think the
> change
> here would still be applicable at least from an abstract pov.)
> 
I don't think it would mask it, but I do think that, yes, the change
you're making would still be applicable.

And, about it, One thing...

> --- a/xen/common/sched/core.c
> +++ b/xen/common/sched/core.c
> @@ -3403,9 +3403,15 @@ void __init sched_setup_dom0_vcpus(struc
>      {
>          for_each_sched_unit ( d, unit )
>          {
> +            spinlock_t *lock = unit_schedule_lock_irq(unit);
> +
>              if ( !opt_dom0_vcpus_pin && !dom0_affinity_relaxed )
>                  sched_set_affinity(unit, &dom0_cpus, NULL);
>              sched_set_affinity(unit, NULL, &dom0_cpus);
> +
> +            sched_unit_migrate_start(unit);
> +            unit_schedule_unlock_irq(lock, unit);
> +            sched_unit_migrate_finish(unit);
>          }
>      }
>  
... The result of this (also considering the call to
domain_update_node_affinity()) ends up looking very similar to what
we'd get if, instead of calling sched_set_affinity(), we call
vcpu_set_affinity().

I'm therefore wondering if we should try to just do that... But I'm not
sure, mostly because that would mean calling
domain_update_node_affinity() for all dom0's vCPUs, which is clearly
less efficient than just calling it once at the end.

So I'm thinking that we can indeed do it like this, and add a comment.

Anyone else any thoughts?

Thanks and Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)

Re: [PATCH] x86: make "dom0_nodes=" work with credit2

Posted by Dario Faggioli 2 years ago

On Fri, 2022-04-08 at 11:20 +0000, Dario Faggioli wrote:
> On Thu, 2022-04-07 at 15:27 +0200, Jan Beulich wrote:
> > 
> > Credit2 moving the vCPU-s off of their initially assigned ones
> > right
> > away of course renders sched_select_initial_cpu()'s use of
> > cpu_cycle()
> > pretty useless.
> > 
> Mmm... you mean that Credit2 is moving the vCPUs off they're assigned
> ones _right_now_, or that it will, with this patch?
> 
> If you mean the former, I'm not sure it is. In fact, when
> sched_select_initial_cpu() is called for dom0, dom0's node affinity
> is
> just all nodes, isn't it? 
>
Actually, it's what results from dom0_nodes. I was forgetting that we
set d->node_affinity while parsing the parameter.

That said, I've added some logging, to better understand what is going
on, and there are some not obvious (at least, not to me) things that
are happening, and it seems that what is causing problems is the call
to sched_insert_unit(), within sched_init_vcpu().

I have to leave now, but I will report what I've found ASAP (probably
tomorrow, though).

Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)

Re: [PATCH] x86: make "dom0_nodes=" work with credit2

Posted by Jan Beulich 2 years ago

On 08.04.2022 13:20, Dario Faggioli wrote:
> On Thu, 2022-04-07 at 15:27 +0200, Jan Beulich wrote:
>> ---
>> The Fixes: tag isn't very precise - it's rather the commit exposing
>> the
>> issue by default. I haven't been able to identify the actual commit
>> which did introduce the problem; it may well be that it has always
>> been
>> there since the introduction of credit2.
>>
>> Credit2 moving the vCPU-s off of their initially assigned ones right
>> away of course renders sched_select_initial_cpu()'s use of
>> cpu_cycle()
>> pretty useless.
>>
> Mmm... you mean that Credit2 is moving the vCPUs off they're assigned
> ones _right_now_, or that it will, with this patch?

Right now. On a 4-node (6 cores each) system with "dom0_nodes=0" I've
observed the 6 vCPU-s to reliably be put on CPUs 13, 14, etc. The
patch is actually undoing that questionable movement.

Since I have a large amount of other patches in place (none of which
I would assume to have such an effect) - Olaf has meanwhile confirmed
that the change helps for him as well.

> If you mean the former, I'm not sure it is. In fact, when
> sched_select_initial_cpu() is called for dom0, dom0's node affinity is
> just all nodes, isn't it? So, the we can pick any CPU in the cpupool,
> and we use cycle to try to spread the vCPUs kind of evenly.

The CPU mask used in the function is 0x3f for the example given above.
I did not check which of the constituent parts of the calculation have
this effect. But the result is that all CPUs will be put on CPU 0
first, as cpumask_cycle(..., 13) for a mask of 0x3f produces 0.

> If you mean after this patch, well, sure, but that's normal. Again,
> when we pick the initial CPU, we still don't know that the vCPUs have
> an affinity. Or, actually, we know that in sched_setup_dom0_vcpus(),
> but there's no direct way to tell it to sched_init_vcpu() (and hence
> sched_select_initial_cpu()).
> 
> That's because, by default, affinity is just "all cpus", when we create
> the vCPUs, and we change that later, if they have one already (due to
> it being present in the config file, or in the dom0_nodes parameter).

But that's what I'm talking about a little further down, where you
reply that you don't think using the more narrow set would hide the
issue.

> Something that *maybe* we can try, since we're handling dom0 vCPUs
> specially anyway, is to directly set dom0's node affinity to the nodes
> of the CPUs we have in dom0_cpus, before calling vcpu_create() (in
> sched_setup_dom0_vcpus(), of course).
> 
> This should make sched_select_initial_cpu() pick one of the "correct"
> CPUs in the first place. But I don't know if it's worth, neither if
> we'll still need this patch anyway (I have to check more thoroughly).

As per above - sched_select_initial_cpu() behaves as I would expect
it. It's credit2 which subsequently overrides that decision.

>>  But I guess that's still useful for other schedulers.
>> I wonder though whether sched_init_vcpu() shouldn't use the CPU map
>> calculated by sched_select_initial_cpu() for its call to
>> sched_set_affinity() in the non-pinned case, rather than setting
>> "all".
>>
> If we do that, and there's no affinity configured for the guest, or no
> "dom0_nodes=", when will we reset the affinity to all, which what it
> should be in such a case?

Why "reset"? When no restrictions are in place, afaict
sched_select_initial_cpu() will calculate a mask of "all".

> Also, if I'm right in my reasoning above, when we come from
> sched_setup_dom0_vcpus(), the mast calculated by
> sched_select_initial_cpu() is basically cpupool0's cpus_valid, so this
> wouldn't really change anything for the problem we're trying to solve
> here.
> 
>> (I guess doing so might mask the issue at hand, but I think the
>> change
>> here would still be applicable at least from an abstract pov.)
>>
> I don't think it would mask it, but I do think that, yes, the change
> you're making would still be applicable.
> 
> And, about it, One thing...
> 
>> --- a/xen/common/sched/core.c
>> +++ b/xen/common/sched/core.c
>> @@ -3403,9 +3403,15 @@ void __init sched_setup_dom0_vcpus(struc
>>      {
>>          for_each_sched_unit ( d, unit )
>>          {
>> +            spinlock_t *lock = unit_schedule_lock_irq(unit);
>> +
>>              if ( !opt_dom0_vcpus_pin && !dom0_affinity_relaxed )
>>                  sched_set_affinity(unit, &dom0_cpus, NULL);
>>              sched_set_affinity(unit, NULL, &dom0_cpus);
>> +
>> +            sched_unit_migrate_start(unit);
>> +            unit_schedule_unlock_irq(lock, unit);
>> +            sched_unit_migrate_finish(unit);
>>          }
>>      }
>>  
> ... The result of this (also considering the call to
> domain_update_node_affinity()) ends up looking very similar to what
> we'd get if, instead of calling sched_set_affinity(), we call
> vcpu_set_affinity().

Funny you should say this - this is what I had done first. It works,
but it is less efficient than the approach above. First and foremost
when there are multiple vCPU-s per unit.

> I'm therefore wondering if we should try to just do that... But I'm not
> sure, mostly because that would mean calling
> domain_update_node_affinity() for all dom0's vCPUs, which is clearly
> less efficient than just calling it once at the end.

Indeed, that's the other reason why I did change to the approach
above.

> So I'm thinking that we can indeed do it like this, and add a comment.

A comment to what effect?

Jan

Re: [PATCH] x86: make "dom0_nodes=" work with credit2

Posted by Dario Faggioli 2 years ago

On Fri, 2022-04-08 at 14:36 +0200, Jan Beulich wrote:
> On 08.04.2022 13:20, Dario Faggioli wrote:
> > On Thu, 2022-04-07 at 15:27 +0200, Jan Beulich wrote:
> > > Credit2 moving the vCPU-s off of their initially assigned ones
> > > right
> > > away of course renders sched_select_initial_cpu()'s use of
> > > cpu_cycle()
> > > pretty useless.
> > > 
> > Mmm... you mean that Credit2 is moving the vCPUs off they're
> > assigned
> > ones _right_now_, or that it will, with this patch?
> 
> Right now. On a 4-node (6 cores each) system with "dom0_nodes=0" I've
> observed the 6 vCPU-s to reliably be put on CPUs 13, 14, etc. The
> patch is actually undoing that questionable movement.
> 
So, yes, as said, I was wrong here, because I was forgetting that we
update dom0's node-affinity according to what we find in dom0_nodes,
while we parse dom0_nodes itself (and it was stupid of me to forget
that, because of course we do it! :-/).

And in fact, since node-affinity is already set to something different
than "all", sched_select_initial_cpu() picks up a CPU from it.

And it is also the case that we could, basically, use that later, in
sched_init_vcpu(), which is kind of what I'm proposing we do.

> Since I have a large amount of other patches in place (none of which
> I would assume to have such an effect) - Olaf has meanwhile confirmed
> that the change helps for him as well.
> 
> > If you mean the former, I'm not sure it is. In fact, when
> > sched_select_initial_cpu() is called for dom0, dom0's node affinity
> > is
> > just all nodes, isn't it? So, the we can pick any CPU in the
> > cpupool,
> > and we use cycle to try to spread the vCPUs kind of evenly.
> 
> The CPU mask used in the function is 0x3f for the example given
> above.
> I did not check which of the constituent parts of the calculation
> have
> this effect. But the result is that all CPUs will be put on CPU 0
> first, as cpumask_cycle(..., 13) for a mask of 0x3f produces 0.
> 
Right. For instance, on my 16 CPUs, 2 nodes test box, where I tried
using dom0_nodes=0 (mask {0-7}), I saw the following:
- for d0v0, sched_select_initial_cpu() selects CPU _0_
- later in sched_init_vcpu() affinity is set to all
- later^2 in sched_init_vcpu(), Credit2's implementation of 
 sched_insert_unit() does a "CPU pick", assuming it can use any CPU, 
 as the affinity is "all". It ends up picking CPU _9_

We move on in the boot process, and we get to create the other dom0
vcpus, e.g., d0v1:
- d0v0->processor is _9_, so the cpumask_cycle() in 
 sched_select_initial_cpu(), done from 9, with a mask of {0-7}, 
 brings us to _0_
- that does not matter much, as the CPU pick inside of the specific 
 scheduler has its own logic, which also involves a cpumask_cycle().
 And since we picked _9_ before, and the affinity has again been set 
 to "all", we pick _10_ and, for what we know, that is fine.

This goes on and happens for all vCPUs, with the result that (in the
worst case) all of them are placed on CPUs where they can't run.

Once can think that this comes from the fact that we do CPU selection
twice (once in sched_select_initial_cpu() and another one, inside the
specific scheduler). However, this is necessary, for two reasons:
- a scheduler may not have its own "CPU pick logic", as that is not 
  required
- even for the schedulers that do, we need to have something in
  v->processor, when it runs. And sched_select_initial_cpu() is   
  something that is (should be?) simple enough, but at the same time
  it puts something sensible there.

> > That's because, by default, affinity is just "all cpus", when we
> > create
> > the vCPUs, and we change that later, if they have one already (due
> > to
> > it being present in the config file, or in the dom0_nodes
> > parameter).
> 
> But that's what I'm talking about a little further down, where you
> reply that you don't think using the more narrow set would hide the
> issue.
> 
Sure, as said, I was missing a piece when replied.

> > This should make sched_select_initial_cpu() pick one of the
> > "correct"
> > CPUs in the first place. But I don't know if it's worth, neither if
> > we'll still need this patch anyway (I have to check more
> > thoroughly).
> 
> As per above - sched_select_initial_cpu() behaves as I would expect
> it. It's credit2 which subsequently overrides that decision.
> 
Right, and it overrides it, because we set the affinity to "all", so
it's in its own right to do so. And the point here is that it is indeed
not right to do that.

So, as you said yourself, we should really change the fact that we're
setting "all" as an affinity for dom0 vCPUs, there in
sched_init_vcpu(), while we know it's not the case.

And while doing that, I think we should consolidate touching the
affinity only there, avoiding altering it twice. After all, we already
know how it should look like, so let's go for it.

I'll send a patch to that effect, to show what I mean with this. And...

> > >  But I guess that's still useful for other schedulers.
>
...Yes, I'll rebase your patch on top of mine. In fact, although
tecnically no longer necessary, for _this_specific_ issue (I tested
without it, and things work), I think it still makes sense.

Thanks and Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)

Re: [PATCH] x86: make "dom0_nodes=" work with credit2

Posted by Dario Faggioli 2 years ago

On Tue, 2022-04-12 at 15:48 +0000, Dario Faggioli wrote:
> On Fri, 2022-04-08 at 14:36 +0200, Jan Beulich wrote:
> 
> 
> And while doing that, I think we should consolidate touching the
> affinity only there, avoiding altering it twice. After all, we
> already
> know how it should look like, so let's go for it.
> 
> I'll send a patch to that effect, to show what I mean with this. 
> 
Here it is.

It's tested, with a few combinations of dom0_nodes and dom0_vcpus_pin
being there or not, and it survived (and behave as I would expect it
too) all of them :-)

I haven't tested yet (and can't test easily) the pv_shim case. I think
it's fine, but I'm adding Roger, to see if he can confirm that...
---
From: Dario Faggioli <dfaggioli@suse.com>
Subject: [PATCH 1/2] xen/sched: setup dom0 vCPUs affinity only once

Right now, affinity for dom0 vCPUs is setup in two steps. This is a
problem as, at least in Credit2, unit_insert() sees and uses the
"intermediate" affinity, and place the vCPUs on CPUs where they cannot
be run, resulting in the boot to hang, if the "dom0_nodes" parameter
is used.

Fix this by setting up the affinity properly once and for all, in
sched_init_vcpu(), called by create_vcpu().

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
---
* Changelog is RFC!
---
 xen/common/sched/core.c    | 59 +++++++++++++++++++++++++-------------
 xen/common/sched/credit2.c |  8 ++++--
 2 files changed, 44 insertions(+), 23 deletions(-)

diff --git a/xen/common/sched/core.c b/xen/common/sched/core.c
index 19ab678181..dc2ed890e0 100644
--- a/xen/common/sched/core.c
+++ b/xen/common/sched/core.c
@@ -572,11 +572,41 @@ int sched_init_vcpu(struct vcpu *v)
     }
 
     /*
-     * Initialize affinity settings. The idler, and potentially
-     * domain-0 VCPUs, are pinned onto their respective physical CPUs.
+     * Initialize affinity settings. By doing this before the unit is
+     * inserted in the scheduler runqueues (by the call to sched_insert_unit(),
+     * at the end of the function, we are sure that it will be put on an
+     * appropriate CPU.
      */
-    if ( is_idle_domain(d) || (is_hardware_domain(d) && opt_dom0_vcpus_pin) )
+    if ( pv_shim && v->vcpu_id == 0 )
+    {
+        /*
+         * PV-shim: vcpus are pinned 1:1. Initially only 1 cpu is online,
+         * others will be dealt with when onlining them. This avoids pinning
+         * a vcpu to a not yet online cpu here.
+         */
+        sched_set_affinity(unit, cpumask_of(0), cpumask_of(0));
+    }
+    else if ( is_idle_domain(d) || (is_hardware_domain(d) && opt_dom0_vcpus_pin) )
+    {
+        /*
+         * The idler, and potentially domain-0 VCPUs, are pinned onto their
+         * respective physical CPUs.
+         */
         sched_set_affinity(unit, cpumask_of(processor), &cpumask_all);
+    }
+    else if ( is_hardware_domain(d) )
+    {
+        /*
+         * In absence of dom0_vcpus_pin, the hard and soft affinity of
+         * domain-0 is controlled by the dom0_nodes parameter. At this point
+         * it has been parsed and decoded, and we have the result of that
+         * in the dom0_cpus mask.
+         */
+        if ( !dom0_affinity_relaxed )
+            sched_set_affinity(unit, &dom0_cpus, &cpumask_all);
+        else
+            sched_set_affinity(unit, &cpumask_all, &dom0_cpus);
+    }
     else
         sched_set_affinity(unit, &cpumask_all, &cpumask_all);
 
@@ -3386,29 +3416,18 @@ void wait(void)
 void __init sched_setup_dom0_vcpus(struct domain *d)
 {
     unsigned int i;
-    struct sched_unit *unit;
 
     for ( i = 1; i < d->max_vcpus; i++ )
         vcpu_create(d, i);
 
     /*
-     * PV-shim: vcpus are pinned 1:1.
-     * Initially only 1 cpu is online, others will be dealt with when
-     * onlining them. This avoids pinning a vcpu to a not yet online cpu here.
+     * sched_vcpu_init(), called by vcpu_create(), will setup the hard and
+     * soft affinity of all the vCPUs, by calling sched_set_affinity() on each
+     * one of them. We can now make sure that the domain's node affinity is
+     * also updated accordingly, and we can do that here, once and for all
+     * (which is more efficient than calling domain_update_node_affinity()
+     * on all the vCPUs).
      */
-    if ( pv_shim )
-        sched_set_affinity(d->vcpu[0]->sched_unit,
-                           cpumask_of(0), cpumask_of(0));
-    else
-    {
-        for_each_sched_unit ( d, unit )
-        {
-            if ( !opt_dom0_vcpus_pin && !dom0_affinity_relaxed )
-                sched_set_affinity(unit, &dom0_cpus, NULL);
-            sched_set_affinity(unit, NULL, &dom0_cpus);
-        }
-    }
-
     domain_update_node_affinity(d);
 }
 #endif
diff --git a/xen/common/sched/credit2.c b/xen/common/sched/credit2.c
index 0e3f89e537..ac5f8b8820 100644
--- a/xen/common/sched/credit2.c
+++ b/xen/common/sched/credit2.c
@@ -749,10 +749,12 @@ static int get_fallback_cpu(struct csched2_unit *svc)
 
         /*
          * This is cases 2 or 4 (depending on bs): v->processor isn't there
-         * any longer, check if we at least can stay in our current runq.
+         * any longer, check if we at least can stay in our current runq,
+	 * if we have any (e.g., we don't yet, if we get here when a unit
+	 * is inserted for the very first time).
          */
-        if ( likely(cpumask_intersects(cpumask_scratch_cpu(cpu),
-                                       &svc->rqd->active)) )
+        if ( likely(svc->rqd && cpumask_intersects(cpumask_scratch_cpu(cpu),
+                                                   &svc->rqd->active)) )
         {
             cpumask_and(cpumask_scratch_cpu(cpu), cpumask_scratch_cpu(cpu),
                         &svc->rqd->active);
-- 
2.35.1
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)

Re: [PATCH] x86: make "dom0_nodes=" work with credit2

Posted by Jan Beulich 2 years ago

On 12.04.2022 18:11, Dario Faggioli wrote:
> --- a/xen/common/sched/core.c
> +++ b/xen/common/sched/core.c
> @@ -572,11 +572,41 @@ int sched_init_vcpu(struct vcpu *v)
>      }
>  
>      /*
> -     * Initialize affinity settings. The idler, and potentially
> -     * domain-0 VCPUs, are pinned onto their respective physical CPUs.
> +     * Initialize affinity settings. By doing this before the unit is
> +     * inserted in the scheduler runqueues (by the call to sched_insert_unit(),
> +     * at the end of the function, we are sure that it will be put on an
> +     * appropriate CPU.
>       */
> -    if ( is_idle_domain(d) || (is_hardware_domain(d) && opt_dom0_vcpus_pin) )
> +    if ( pv_shim && v->vcpu_id == 0 )

I don't think you can handle the shim case first, as then you'd also have
its CPU0 idle vCPU take this path. The difference _may_ only be cosmetic,
but I think it would be odd for CPU0's idle vCPU to have a soft affinity
of just CPU0, while all others use cpumask_all.

> +    {
> +        /*
> +         * PV-shim: vcpus are pinned 1:1. Initially only 1 cpu is online,
> +         * others will be dealt with when onlining them. This avoids pinning
> +         * a vcpu to a not yet online cpu here.
> +         */
> +        sched_set_affinity(unit, cpumask_of(0), cpumask_of(0));
> +    }
> +    else if ( is_idle_domain(d) || (is_hardware_domain(d) && opt_dom0_vcpus_pin) )

Here (pre-existing) as well as ...

> +    {
> +        /*
> +         * The idler, and potentially domain-0 VCPUs, are pinned onto their
> +         * respective physical CPUs.
> +         */
>          sched_set_affinity(unit, cpumask_of(processor), &cpumask_all);
> +    }
> +    else if ( is_hardware_domain(d) )

... here I wonder: Shouldn't this be limited to Dom0 (for the purposes here
!= hwdom)? Any special affinity for a late hwdom ought to be specified by
the logic creating that domain imo, not by command line options concerning
Dom0 only.

This then also determines where the involved variables (opt_dom0_vcpus_pin,
dom0_affinity_relaxed, and dom0_cpus) are to be placed (answering your
question in a subsequent reply): If it's strictly Dom0 only (and knowing
there's no way to add vCPU-s to a domain post-construction), then
__initdata is fine for all of them. If late hwdom was to also be covered,
__hwdom_initdata would need to be used.

> +    {
> +        /*
> +         * In absence of dom0_vcpus_pin, the hard and soft affinity of
> +         * domain-0 is controlled by the dom0_nodes parameter. At this point
> +         * it has been parsed and decoded, and we have the result of that
> +         * in the dom0_cpus mask.
> +         */
> +        if ( !dom0_affinity_relaxed )
> +            sched_set_affinity(unit, &dom0_cpus, &cpumask_all);
> +        else
> +            sched_set_affinity(unit, &cpumask_all, &dom0_cpus);

I guess by referencing dom0_affinity_relaxed and dom0_cpus outside of
CONFIG_X86 section you're breaking the Arm build.

I also have a more general question here: sched.h says "Bitmask of CPUs
on which this VCPU may run" for hard affinity and "Bitmask of CPUs on
which this VCPU prefers to run" for soft affinity. Additionally there's
soft_aff_effective. Does it make sense in the first place for one to be
a proper subset of the of the other in _both_ directions? Is that mainly
to have a way to record preferences even when all preferred CPUs are
offline, to be able to go back to the preferences once CPUs come back
online?

Then a follow-on question is: Why do you use cpumask_all for soft
affinity in the first of the two calls above? Is this to cover for the
case where all CPUs in dom0_cpus would go offline?

> +    }
>      else
>          sched_set_affinity(unit, &cpumask_all, &cpumask_all);

Hmm, you leave this alone. Wouldn't it be better to further generalize
things, in case domain affinity was set already? I was referring to
the mask calculated by sched_select_initial_cpu() also in this regard.
And when I did suggest to re-use the result, I did mean this literally.

> @@ -3386,29 +3416,18 @@ void wait(void)
>  void __init sched_setup_dom0_vcpus(struct domain *d)
>  {
>      unsigned int i;
> -    struct sched_unit *unit;
>  
>      for ( i = 1; i < d->max_vcpus; i++ )
>          vcpu_create(d, i);
>  
>      /*
> -     * PV-shim: vcpus are pinned 1:1.
> -     * Initially only 1 cpu is online, others will be dealt with when
> -     * onlining them. This avoids pinning a vcpu to a not yet online cpu here.
> +     * sched_vcpu_init(), called by vcpu_create(), will setup the hard and
> +     * soft affinity of all the vCPUs, by calling sched_set_affinity() on each
> +     * one of them. We can now make sure that the domain's node affinity is
> +     * also updated accordingly, and we can do that here, once and for all
> +     * (which is more efficient than calling domain_update_node_affinity()
> +     * on all the vCPUs).
>       */
> -    if ( pv_shim )
> -        sched_set_affinity(d->vcpu[0]->sched_unit,
> -                           cpumask_of(0), cpumask_of(0));
> -    else
> -    {
> -        for_each_sched_unit ( d, unit )
> -        {
> -            if ( !opt_dom0_vcpus_pin && !dom0_affinity_relaxed )
> -                sched_set_affinity(unit, &dom0_cpus, NULL);
> -            sched_set_affinity(unit, NULL, &dom0_cpus);
> -        }
> -    }
> -
>      domain_update_node_affinity(d);
>  }

I consider the comment somewhat misleading, and hence I wonder if a comment
is needed here in the first place. domain_update_node_affinity() acts on a
domain, not on individual vCPU-s. Hence it's not clear what "calling
domain_update_node_affinity() on all the vCPUs" would be referring to. I
don't think anyone would consider calling vcpu_set_affinity() here just for
the purpose of updating domain affinity, with all other (vCPU) affinity
setting now happening elsewhere.

> --- a/xen/common/sched/credit2.c
> +++ b/xen/common/sched/credit2.c
> @@ -749,10 +749,12 @@ static int get_fallback_cpu(struct csched2_unit *svc)
>  
>          /*
>           * This is cases 2 or 4 (depending on bs): v->processor isn't there
> -         * any longer, check if we at least can stay in our current runq.
> +         * any longer, check if we at least can stay in our current runq,
> +	 * if we have any (e.g., we don't yet, if we get here when a unit
> +	 * is inserted for the very first time).
>           */
> -        if ( likely(cpumask_intersects(cpumask_scratch_cpu(cpu),
> -                                       &svc->rqd->active)) )
> +        if ( likely(svc->rqd && cpumask_intersects(cpumask_scratch_cpu(cpu),
> +                                                   &svc->rqd->active)) )
>          {
>              cpumask_and(cpumask_scratch_cpu(cpu), cpumask_scratch_cpu(cpu),
>                          &svc->rqd->active);

This change is not covered by anything in the description, and I wonder why
you're making the code adjustment. If svc->rqd was NULL, wouldn't Xen have
crashed prior to the adjustment? I can't spot how it being NULL here could
be the effect of any of the other changes you're making.

If the comment adjustment is to be retained, please take care of the hard
tabs which have slipped in.

Jan

Re: [PATCH] x86: make "dom0_nodes=" work with credit2

Posted by Dario Faggioli 2 years ago

On Wed, 2022-04-13 at 12:00 +0200, Jan Beulich wrote:
> On 12.04.2022 18:11, Dario Faggioli wrote:
> > --- a/xen/common/sched/core.c
> > +++ b/xen/common/sched/core.c
> > @@ -572,11 +572,41 @@ int sched_init_vcpu(struct vcpu *v)
> >      }
> >  
> >      /*
> > -     * Initialize affinity settings. The idler, and potentially
> > -     * domain-0 VCPUs, are pinned onto their respective physical
> > CPUs.
> > +     * Initialize affinity settings. By doing this before the unit
> > is
> > +     * inserted in the scheduler runqueues (by the call to
> > sched_insert_unit(),
> > +     * at the end of the function, we are sure that it will be put
> > on an
> > +     * appropriate CPU.
> >       */
> > -    if ( is_idle_domain(d) || (is_hardware_domain(d) &&
> > opt_dom0_vcpus_pin) )
> > +    if ( pv_shim && v->vcpu_id == 0 )
> 
> I don't think you can handle the shim case first, as then you'd also
> have
> its CPU0 idle vCPU take this path. The difference _may_ only be
> cosmetic,
> but I think it would be odd for CPU0's idle vCPU to have a soft
> affinity
> of just CPU0, while all others use cpumask_all.
> 
Ok, yes, I didn't think to that. I'll reshuffle the if-s order.

> 
> > +    {
> > +        /*
> > +         * The idler, and potentially domain-0 VCPUs, are pinned
> > onto their
> > +         * respective physical CPUs.
> > +         */
> >          sched_set_affinity(unit, cpumask_of(processor),
> > &cpumask_all);
> > +    }
> > +    else if ( is_hardware_domain(d) )
> 
> ... here I wonder: Shouldn't this be limited to Dom0 (for the
> purposes here
> != hwdom)? Any special affinity for a late hwdom ought to be
> specified by
> the logic creating that domain imo, not by command line options
> concerning
> Dom0 only.
> 
I think this makes sense. I'll add a patch for changing it.

> I also have a more general question here: sched.h says "Bitmask of
> CPUs
> on which this VCPU may run" for hard affinity and "Bitmask of CPUs on
> which this VCPU prefers to run" for soft affinity. Additionally
> there's
> soft_aff_effective. Does it make sense in the first place for one to
> be
> a proper subset of the of the other in _both_ directions? 
>
I'm not sure I'm 100% getting what you're asking. In particular, I'm
not sure what you mean with "for one to be a propper subset of the
other in both directions"?

Anyway, soft and hard affinity are under the complete control of the
user (I guess we can say that they're policy), so we tend to accept
pretty much everything that comes from the user.

That is, the user can set an hard affinity to 1-6 and a soft affinity
of (a) 2-3, (b) 0-2, (c) 7-12, etc.

Case (a), i.e., soft is a strict subset of hard, is the one that makes
the most sense, of course. With this configuration, the vCPU(s) can run
on CPUs 1, 2, 3, 4, 5 and 6, but the scheduler will prefer to run it
(them) on 2 and/or 3.

Case (b), i.e., no strict subset, but there's some overlap, also means
that soft-affinity is going to be considered and have an effect. In
fact, vCPU(s) will prefer to run on CPUs 1 and/or 2, but of course it
(they) will never run on CPU 0. Of course, the user can, at a later
point in time, change the hard affinity so that it includes CPU 0, and
we'll be back to the strict-subset case. So that's way we want to keep
0 in the mast, even if it causes soft to not be a strict subset of
hard.

In case (c), soft affinity is totally useless. However, again, the user
can later change hard to include some or all CPUs 7-12, so we keep it.
We do, however, print a warning. And we also use the soft_aff_effective
flag to avoid going through the soft-affinity balancing step in the
scheduler code. This is, in fact, why we also check whether hard is not
a strict subset of soft. As, if it is, there's no need to do anything
about soft, as honoring hard will automatically take care of that as
well.

> Is that mainly
> to have a way to record preferences even when all preferred CPUs are
> offline, to be able to go back to the preferences once CPUs come back
> online?
> 
That's another example/use case, yes. We want to record the user's
preference, whatever the status of the system (and of other aspects of
the configuration) is.

But I'm not really sure I've answered... Have I?

> Then a follow-on question is: Why do you use cpumask_all for soft
> affinity in the first of the two calls above? Is this to cover for
> the
> case where all CPUs in dom0_cpus would go offline?
> 
Mmm... what else should I be using? If dom0_nodes is in "strict" mode,
we want to control hard affinity only. So we set soft to the default,
which is "all". During operations, since hard is a subset of "all",
soft-affinity will be just ignored.

So I'm using "all" because soft-affinity is just "all", unless someone
sets it differently.

But I am again not sure that I fully understood and properly addressed
your question. :-(


> > +    }
> >      else
> >          sched_set_affinity(unit, &cpumask_all, &cpumask_all);
> 
> Hmm, you leave this alone. Wouldn't it be better to further
> generalize
> things, in case domain affinity was set already? I was referring to
> the mask calculated by sched_select_initial_cpu() also in this
> regard.
> And when I did suggest to re-use the result, I did mean this
> literally.
> 
Technically, I think we can do that. Although, it's probably cumbersome
to do, without adding at least one cpumask on the stack, or reshuffle
the locking between sched_select_initial_cpu() and sched_init_vcpu(),
in a way that I (personally) don't find particularly pretty.

Also, I don't think we gain much from doing that, as we probably still
need to have some special casing of dom0, for handling dom0_vcpus_pin.

And again, soft and hard affinity should be set to what the user wants
and asks for. And if, for instance, he/she passes
dom0_nodes="1,strict", soft-affinity should just be all. If, e.g., we
set both hard and soft affinity to the CPUs of node 1, and if later
hard affinity is manually changed to "all", soft affinity will remain
to node 1, even if it was never asked for it to be that way, and the
user will need to change that explicitly as well. (Of course, it's not
particularly clever to boot with dom0_nodes="1,strict" and then change
dom0's vCPUs' hard affinity to node 0... but the user is free to do
that.)

> > --- a/xen/common/sched/credit2.c
> > +++ b/xen/common/sched/credit2.c
> > @@ -749,10 +749,12 @@ static int get_fallback_cpu(struct
> > csched2_unit *svc)
> >  
> >          /*
> >           * This is cases 2 or 4 (depending on bs): v->processor
> > isn't there
> > -         * any longer, check if we at least can stay in our
> > current runq.
> > +         * any longer, check if we at least can stay in our
> > current runq,
> > +        * if we have any (e.g., we don't yet, if we get here when
> > a unit
> > +        * is inserted for the very first time).
> >           */
> > -        if ( likely(cpumask_intersects(cpumask_scratch_cpu(cpu),
> > -                                       &svc->rqd->active)) )
> > +        if ( likely(svc->rqd &&
> > cpumask_intersects(cpumask_scratch_cpu(cpu),
> > +                                                   &svc->rqd-
> > >active)) )
> >          {
> >              cpumask_and(cpumask_scratch_cpu(cpu),
> > cpumask_scratch_cpu(cpu),
> >                          &svc->rqd->active);
> 
> This change is not covered by anything in the description, and I
> wonder why
> you're making the code adjustment. If svc->rqd was NULL, wouldn't Xen
> have
> crashed prior to the adjustment? I can't spot how it being NULL here
> could
> be the effect of any of the other changes you're making.
> 
It was crashing on me, with the above changes applied, but it's not any
longer. It's certainly something that, whether it's related or not,
should be addressed in its own patch, so I'll get rid of it and keep
looking.

Thanks and Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)

Re: [PATCH] x86: make "dom0_nodes=" work with credit2

Posted by Jan Beulich 2 years ago

On 29.04.2022 12:52, Dario Faggioli wrote:
> On Wed, 2022-04-13 at 12:00 +0200, Jan Beulich wrote:
>> I also have a more general question here: sched.h says "Bitmask of
>> CPUs
>> on which this VCPU may run" for hard affinity and "Bitmask of CPUs on
>> which this VCPU prefers to run" for soft affinity. Additionally
>> there's
>> soft_aff_effective. Does it make sense in the first place for one to
>> be
>> a proper subset of the of the other in _both_ directions? 
>>
> I'm not sure I'm 100% getting what you're asking. In particular, I'm
> not sure what you mean with "for one to be a propper subset of the
> other in both directions"?
> 
> Anyway, soft and hard affinity are under the complete control of the
> user (I guess we can say that they're policy), so we tend to accept
> pretty much everything that comes from the user.
> 
> That is, the user can set an hard affinity to 1-6 and a soft affinity
> of (a) 2-3, (b) 0-2, (c) 7-12, etc.
> 
> Case (a), i.e., soft is a strict subset of hard, is the one that makes
> the most sense, of course. With this configuration, the vCPU(s) can run
> on CPUs 1, 2, 3, 4, 5 and 6, but the scheduler will prefer to run it
> (them) on 2 and/or 3.
> 
> Case (b), i.e., no strict subset, but there's some overlap, also means
> that soft-affinity is going to be considered and have an effect. In
> fact, vCPU(s) will prefer to run on CPUs 1 and/or 2, but of course it
> (they) will never run on CPU 0. Of course, the user can, at a later
> point in time, change the hard affinity so that it includes CPU 0, and
> we'll be back to the strict-subset case. So that's way we want to keep
> 0 in the mast, even if it causes soft to not be a strict subset of
> hard.
> 
> In case (c), soft affinity is totally useless. However, again, the user
> can later change hard to include some or all CPUs 7-12, so we keep it.
> We do, however, print a warning. And we also use the soft_aff_effective
> flag to avoid going through the soft-affinity balancing step in the
> scheduler code. This is, in fact, why we also check whether hard is not
> a strict subset of soft. As, if it is, there's no need to do anything
> about soft, as honoring hard will automatically take care of that as
> well.
> 
>> Is that mainly
>> to have a way to record preferences even when all preferred CPUs are
>> offline, to be able to go back to the preferences once CPUs come back
>> online?
>>
> That's another example/use case, yes. We want to record the user's
> preference, whatever the status of the system (and of other aspects of
> the configuration) is.
> 
> But I'm not really sure I've answered... Have I?

You did. My question really only was whether there are useful scenarios
for proper-subset cases in both possible directions.

>> Then a follow-on question is: Why do you use cpumask_all for soft
>> affinity in the first of the two calls above? Is this to cover for
>> the
>> case where all CPUs in dom0_cpus would go offline?
>>
> Mmm... what else should I be using?

I was thinking of dom0_cpus.

> If dom0_nodes is in "strict" mode,
> we want to control hard affinity only. So we set soft to the default,
> which is "all". During operations, since hard is a subset of "all",
> soft-affinity will be just ignored.

Right - until such point that all (original) Dom0 CPUs have gone
offline. Hence my 2nd question.

> So I'm using "all" because soft-affinity is just "all", unless someone
> sets it differently.

How would "someone set it differently"? Aiui you can't control both
affinities at the same time.

> But I am again not sure that I fully understood and properly addressed
> your question. :-(
> 
> 
>>> +    }
>>>      else
>>>          sched_set_affinity(unit, &cpumask_all, &cpumask_all);
>>
>> Hmm, you leave this alone. Wouldn't it be better to further
>> generalize
>> things, in case domain affinity was set already? I was referring to
>> the mask calculated by sched_select_initial_cpu() also in this
>> regard.
>> And when I did suggest to re-use the result, I did mean this
>> literally.
>>
> Technically, I think we can do that. Although, it's probably cumbersome
> to do, without adding at least one cpumask on the stack, or reshuffle
> the locking between sched_select_initial_cpu() and sched_init_vcpu(),
> in a way that I (personally) don't find particularly pretty.

Locking? sched_select_initial_cpu() calculates into a per-CPU variable,
which I sincerely hope cannot be corrupted by another CPU.

> Also, I don't think we gain much from doing that, as we probably still
> need to have some special casing of dom0, for handling dom0_vcpus_pin.

dom0_vcpus_pin is likely always going to require special casing, until
such point where we drop support for it.

> And again, soft and hard affinity should be set to what the user wants
> and asks for. And if, for instance, he/she passes
> dom0_nodes="1,strict", soft-affinity should just be all. If, e.g., we
> set both hard and soft affinity to the CPUs of node 1, and if later
> hard affinity is manually changed to "all", soft affinity will remain
> to node 1, even if it was never asked for it to be that way, and the
> user will need to change that explicitly as well. (Of course, it's not
> particularly clever to boot with dom0_nodes="1,strict" and then change
> dom0's vCPUs' hard affinity to node 0... but the user is free to do
> that.)

I can certainly accept this as justification for using "all" further up.

Jan

Re: [PATCH] x86: make "dom0_nodes=" work with credit2

Posted by Dario Faggioli 2 years ago

On Fri, 2022-04-29 at 14:16 +0200, Jan Beulich wrote:
> On 29.04.2022 12:52, Dario Faggioli wrote:
> > > Is that mainly
> > > to have a way to record preferences even when all preferred CPUs
> > > are
> > > offline, to be able to go back to the preferences once CPUs come
> > > back
> > > online?
> > > 
> > That's another example/use case, yes. We want to record the user's
> > preference, whatever the status of the system (and of other aspects
> > of
> > the configuration) is.
> > 
> > But I'm not really sure I've answered... Have I?
> 
> You did. 
>
Ok, great! :-)

> > 
> > If dom0_nodes is in "strict" mode,
> > we want to control hard affinity only. So we set soft to the
> > default,
> > which is "all". During operations, since hard is a subset of "all",
> > soft-affinity will be just ignored.
> 
> Right - until such point that all (original) Dom0 CPUs have gone
> offline. Hence my 2nd question.
> 
> > So I'm using "all" because soft-affinity is just "all", unless
> > someone
> > sets it differently.
> 
> How would "someone set it differently"? Aiui you can't control both
> affinities at the same time.
> 
Yeah, the argument here is basically the one that I put below, and that
you say you understand. I guess I could have put it a bit more upfront,
sorry about that. :-)

> > > 
> > > Hmm, you leave this alone. Wouldn't it be better to further
> > > generalize
> > > things, in case domain affinity was set already? I was referring
> > > to
> > > the mask calculated by sched_select_initial_cpu() also in this
> > > regard.
> > > And when I did suggest to re-use the result, I did mean this
> > > literally.
> > > 
> > Technically, I think we can do that. Although, it's probably
> > cumbersome
> > to do, without adding at least one cpumask on the stack, or
> > reshuffle
> > the locking between sched_select_initial_cpu() and
> > sched_init_vcpu(),
> > in a way that I (personally) don't find particularly pretty.
> 
> Locking? sched_select_initial_cpu() calculates into a per-CPU
> variable,
> which I sincerely hope cannot be corrupted by another CPU.
> 
No, not by another CPU, hopefully.

And this is probably fine, during boot, when there should be no (other)
scheduling activity. However, during normal operation, a vCPU being
scheduled on CPU X, or in general having X in v->processor, could be
using the scratch cpumask of X already. So, if we use it without
locking, we'd risk using the wrong mask.

Therefore, we require the scheduler lock to be held, for playing with
the scratch cpumasks:

/*
 * Scratch space, for avoiding having too many cpumask_t on the stack.
 * Within each scheduler, when using the scratch mask of one pCPU:
 * - the pCPU must belong to the scheduler,
 * - the caller must own the per-pCPU scheduler lock (a.k.a. runqueue
 *   lock).
 */
DECLARE_PER_CPU(cpumask_t, cpumask_scratch);
#define cpumask_scratch        (&this_cpu(cpumask_scratch))
#define cpumask_scratch_cpu(c) (&per_cpu(cpumask_scratch, c))

And sched_init_vcpu() (and hence sched_select_initial_cpu()) can be
called during normal operation.

In fact, sched_select_initial_cpu() does pcpu_schedule_lock_irqsave()
before starting using it.

> 
> > And again, soft and hard affinity should be set to what the user
> > wants
> > and asks for. And if, for instance, he/she passes
> > dom0_nodes="1,strict", soft-affinity should just be all. If, e.g.,
> > we
> > set both hard and soft affinity to the CPUs of node 1, and if later
> > hard affinity is manually changed to "all", soft affinity will
> > remain
> > to node 1, even if it was never asked for it to be that way, and
> > the
> > user will need to change that explicitly as well. (Of course, it's
> > not
> > particularly clever to boot with dom0_nodes="1,strict" and then
> > change
> > dom0's vCPUs' hard affinity to node 0... but the user is free to do
> > that.)
> 
> I can certainly accept this as justification for using "all" further
> up.
> 
Good then.

Do you think some of this exchange we had should end somewhere
(comments? changelogs?), to make it clearer to both future us and new
contributors why things are done this way?

Thanks and Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)

Re: [PATCH] x86: make "dom0_nodes=" work with credit2

Posted by Jan Beulich 1 year, 9 months ago

On 29.04.2022 16:09, Dario Faggioli wrote:
> On Fri, 2022-04-29 at 14:16 +0200, Jan Beulich wrote:
>> On 29.04.2022 12:52, Dario Faggioli wrote:
>>>> Is that mainly
>>>> to have a way to record preferences even when all preferred CPUs
>>>> are
>>>> offline, to be able to go back to the preferences once CPUs come
>>>> back
>>>> online?
>>>>
>>> That's another example/use case, yes. We want to record the user's
>>> preference, whatever the status of the system (and of other aspects
>>> of
>>> the configuration) is.
>>>
>>> But I'm not really sure I've answered... Have I?
>>
>> You did. 
>>
> Ok, great! :-)
> 
>>>
>>> If dom0_nodes is in "strict" mode,
>>> we want to control hard affinity only. So we set soft to the
>>> default,
>>> which is "all". During operations, since hard is a subset of "all",
>>> soft-affinity will be just ignored.
>>
>> Right - until such point that all (original) Dom0 CPUs have gone
>> offline. Hence my 2nd question.
>>
>>> So I'm using "all" because soft-affinity is just "all", unless
>>> someone
>>> sets it differently.
>>
>> How would "someone set it differently"? Aiui you can't control both
>> affinities at the same time.
>>
> Yeah, the argument here is basically the one that I put below, and that
> you say you understand. I guess I could have put it a bit more upfront,
> sorry about that. :-)
> 
>>>>
>>>> Hmm, you leave this alone. Wouldn't it be better to further
>>>> generalize
>>>> things, in case domain affinity was set already? I was referring
>>>> to
>>>> the mask calculated by sched_select_initial_cpu() also in this
>>>> regard.
>>>> And when I did suggest to re-use the result, I did mean this
>>>> literally.
>>>>
>>> Technically, I think we can do that. Although, it's probably
>>> cumbersome
>>> to do, without adding at least one cpumask on the stack, or
>>> reshuffle
>>> the locking between sched_select_initial_cpu() and
>>> sched_init_vcpu(),
>>> in a way that I (personally) don't find particularly pretty.
>>
>> Locking? sched_select_initial_cpu() calculates into a per-CPU
>> variable,
>> which I sincerely hope cannot be corrupted by another CPU.
>>
> No, not by another CPU, hopefully.
> 
> And this is probably fine, during boot, when there should be no (other)
> scheduling activity. However, during normal operation, a vCPU being
> scheduled on CPU X, or in general having X in v->processor, could be
> using the scratch cpumask of X already. So, if we use it without
> locking, we'd risk using the wrong mask.
> 
> Therefore, we require the scheduler lock to be held, for playing with
> the scratch cpumasks:
> 
> /*
>  * Scratch space, for avoiding having too many cpumask_t on the stack.
>  * Within each scheduler, when using the scratch mask of one pCPU:
>  * - the pCPU must belong to the scheduler,
>  * - the caller must own the per-pCPU scheduler lock (a.k.a. runqueue
>  *   lock).
>  */
> DECLARE_PER_CPU(cpumask_t, cpumask_scratch);
> #define cpumask_scratch        (&this_cpu(cpumask_scratch))
> #define cpumask_scratch_cpu(c) (&per_cpu(cpumask_scratch, c))
> 
> And sched_init_vcpu() (and hence sched_select_initial_cpu()) can be
> called during normal operation.
> 
> In fact, sched_select_initial_cpu() does pcpu_schedule_lock_irqsave()
> before starting using it.
> 
>>
>>> And again, soft and hard affinity should be set to what the user
>>> wants
>>> and asks for. And if, for instance, he/she passes
>>> dom0_nodes="1,strict", soft-affinity should just be all. If, e.g.,
>>> we
>>> set both hard and soft affinity to the CPUs of node 1, and if later
>>> hard affinity is manually changed to "all", soft affinity will
>>> remain
>>> to node 1, even if it was never asked for it to be that way, and
>>> the
>>> user will need to change that explicitly as well. (Of course, it's
>>> not
>>> particularly clever to boot with dom0_nodes="1,strict" and then
>>> change
>>> dom0's vCPUs' hard affinity to node 0... but the user is free to do
>>> that.)
>>
>> I can certainly accept this as justification for using "all" further
>> up.
>>
> Good then.

I notice this issue is still open. May I ask what the plans are here?

Jan

Re: [PATCH] x86: make "dom0_nodes=" work with credit2

Posted by Dario Faggioli 1 year, 9 months ago

On Wed, 2022-07-27 at 08:19 +0200, Jan Beulich wrote:
> > I notice this issue is still open. May I ask what the plans are
> > here?
> > 
Yes, you're right.

I have the patch that I sent in this thread and that we discussed a bit
in a local branch. Let me see if I can rebase it, re-test it and send
it before disappearing for a couple of weeks for vacations...

Sorry it's taking so long.

Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)

Re: [PATCH] x86: make "dom0_nodes=" work with credit2

Posted by Jan Beulich 2 years ago

On 29.04.2022 16:09, Dario Faggioli wrote:
> Do you think some of this exchange we had should end somewhere
> (comments? changelogs?), to make it clearer to both future us and new
> contributors why things are done this way?

It might be helpful, but beyond parts immediately relevant for the
patch here (where in can go in the commit message) I'm not sure
where to best put it. Of course if you'd be willing to make a patch
just to add/extend a (few) comment(s) ...

Jan

Re: [PATCH] x86: make "dom0_nodes=" work with credit2

Posted by Dario Faggioli 2 years ago

On Tue, 2022-04-12 at 16:11 +0000, Dario Faggioli wrote:
> +    else if ( is_hardware_domain(d) )
> +    {
> +        /*
> +         * In absence of dom0_vcpus_pin, the hard and soft affinity
> of
> +         * domain-0 is controlled by the dom0_nodes parameter. At
> this point
> +         * it has been parsed and decoded, and we have the result of
> that
> +         * in the dom0_cpus mask.
> +         */
> +        if ( !dom0_affinity_relaxed )
> +            sched_set_affinity(unit, &dom0_cpus, &cpumask_all);
> +        else
> +            sched_set_affinity(unit, &cpumask_all, &dom0_cpus);
> +    }
>
Oh, one thing: AFAICT, it's fine to use, e.g., dom0_affinity_relaxed
(that iis __initdata) here.

But I'm never too sure whenever I touch stuff like that... Is it
actually fine?

If no, I think we can drop the annotation, or add non initdata-
variables.

Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)

Re: [PATCH] x86: make "dom0_nodes=" work with credit2

Posted by Dario Faggioli 2 years ago

On Tue, 2022-04-12 at 16:11 +0000, Dario Faggioli wrote:
> On Tue, 2022-04-12 at 15:48 +0000, Dario Faggioli wrote:
> > On Fri, 2022-04-08 at 14:36 +0200, Jan Beulich wrote:
> > 
> > 
> > And while doing that, I think we should consolidate touching the
> > affinity only there, avoiding altering it twice. After all, we
> > already
> > know how it should look like, so let's go for it.
> > 
> > I'll send a patch to that effect, to show what I mean with this. 
> > 
> Here it is.
> 
And here's Jan's patch, ported on top of it.

As for the one before, let me know what you think.
---
From: Dario Faggioli <dfaggioli@suse.com>
Subject: [PATCH 2/2] (Kind of) rebase of "x86: make "dom0_nodes=" work with credit2"

i.e., Jan's patch, on top of the commit that unifies the affinity
handling for dom0 vCPUs.

Although not technically necessary any longer, for fixing the issue
at hand, I think it still makes sense to have it in the code.

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
---
* Changelog is so much RFC that is not even a changelog... Jan, if we go
  ahead with this approach, let me know how you prefer to handle the
  authorship, the S-o-b, etc, of this patch.

  I believe it should be From: you, with my S-o-b added after yours, but
  I'm fine being the author, if you don't want to.
---
 xen/common/sched/core.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/xen/common/sched/core.c b/xen/common/sched/core.c
index dc2ed890e0..e11acd7b88 100644
--- a/xen/common/sched/core.c
+++ b/xen/common/sched/core.c
@@ -3416,6 +3416,7 @@ void wait(void)
 void __init sched_setup_dom0_vcpus(struct domain *d)
 {
     unsigned int i;
+    struct sched_unit *unit;
 
     for ( i = 1; i < d->max_vcpus; i++ )
         vcpu_create(d, i);
@@ -3423,11 +3424,20 @@ void __init sched_setup_dom0_vcpus(struct domain *d)
     /*
      * sched_vcpu_init(), called by vcpu_create(), will setup the hard and
      * soft affinity of all the vCPUs, by calling sched_set_affinity() on each
-     * one of them. We can now make sure that the domain's node affinity is
-     * also updated accordingly, and we can do that here, once and for all
-     * (which is more efficient than calling domain_update_node_affinity()
-     * on all the vCPUs).
+     * one of them. What's remaining for us to do here is:
+     * - make sure that the vCPUs are actually migrated to suitable CPUs
+     * - update the domain's node affinity (and we can do that here, once and
+     *   for all, as it's more efficient than calling domain_update_node_affinity()
+     *   on all the vCPUs).
      */
+    for_each_sched_unit ( d, unit )
+    {
+	spinlock_t *lock = unit_schedule_lock_irq(unit);
+        sched_unit_migrate_start(unit);
+        unit_schedule_unlock_irq(lock, unit);
+        sched_unit_migrate_finish(unit);
+    }
+
     domain_update_node_affinity(d);
 }
 #endif
-- 
2.35.1
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)

Re: [PATCH] x86: make "dom0_nodes=" work with credit2

Posted by Jan Beulich 2 years ago

On 12.04.2022 18:14, Dario Faggioli wrote:
> On Tue, 2022-04-12 at 16:11 +0000, Dario Faggioli wrote:
>> On Tue, 2022-04-12 at 15:48 +0000, Dario Faggioli wrote:
>>> On Fri, 2022-04-08 at 14:36 +0200, Jan Beulich wrote:
>>>
>>>
>>> And while doing that, I think we should consolidate touching the
>>> affinity only there, avoiding altering it twice. After all, we
>>> already
>>> know how it should look like, so let's go for it.
>>>
>>> I'll send a patch to that effect, to show what I mean with this. 
>>>
>> Here it is.
>>
> And here's Jan's patch, ported on top of it.
> 
> As for the one before, let me know what you think.
> ---
> From: Dario Faggioli <dfaggioli@suse.com>
> Subject: [PATCH 2/2] (Kind of) rebase of "x86: make "dom0_nodes=" work with credit2"
> 
> i.e., Jan's patch, on top of the commit that unifies the affinity
> handling for dom0 vCPUs.
> 
> Although not technically necessary any longer, for fixing the issue
> at hand, I think it still makes sense to have it in the code.

I don't think so. My suggestion in this regard was only since with my
patch adjustments to affinity remained in sched_setup_dom0_vcpus().
With them all gone, I don't see why vCPU-s might still need migrating.
In fact I would consider making such "adjustment" here would be
papering over issues elsewhere.

As a result, if you think it's still wanted, ...

> Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
> ---
> * Changelog is so much RFC that is not even a changelog... Jan, if we go
>   ahead with this approach, let me know how you prefer to handle the
>   authorship, the S-o-b, etc, of this patch.
> 
>   I believe it should be From: you, with my S-o-b added after yours, but
>   I'm fine being the author, if you don't want to.

... I think you should be considered the author. It would make sense
for me to be the author only if affinity adjustments remained in the
function after your earlier patch.

Jan

> ---
>  xen/common/sched/core.c | 18 ++++++++++++++----
>  1 file changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/xen/common/sched/core.c b/xen/common/sched/core.c
> index dc2ed890e0..e11acd7b88 100644
> --- a/xen/common/sched/core.c
> +++ b/xen/common/sched/core.c
> @@ -3416,6 +3416,7 @@ void wait(void)
>  void __init sched_setup_dom0_vcpus(struct domain *d)
>  {
>      unsigned int i;
> +    struct sched_unit *unit;
>  
>      for ( i = 1; i < d->max_vcpus; i++ )
>          vcpu_create(d, i);
> @@ -3423,11 +3424,20 @@ void __init sched_setup_dom0_vcpus(struct domain *d)
>      /*
>       * sched_vcpu_init(), called by vcpu_create(), will setup the hard and
>       * soft affinity of all the vCPUs, by calling sched_set_affinity() on each
> -     * one of them. We can now make sure that the domain's node affinity is
> -     * also updated accordingly, and we can do that here, once and for all
> -     * (which is more efficient than calling domain_update_node_affinity()
> -     * on all the vCPUs).
> +     * one of them. What's remaining for us to do here is:
> +     * - make sure that the vCPUs are actually migrated to suitable CPUs
> +     * - update the domain's node affinity (and we can do that here, once and
> +     *   for all, as it's more efficient than calling domain_update_node_affinity()
> +     *   on all the vCPUs).
>       */
> +    for_each_sched_unit ( d, unit )
> +    {
> +	spinlock_t *lock = unit_schedule_lock_irq(unit);
> +        sched_unit_migrate_start(unit);
> +        unit_schedule_unlock_irq(lock, unit);
> +        sched_unit_migrate_finish(unit);
> +    }
> +
>      domain_update_node_affinity(d);
>  }
>  #endif