[PATCH v2] accel/tcg/tcg-accel-ops-rr: ensure fairness with icount

Jamie Iles posted 1 patch 1 year ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20230424112907.26832-1-quic._5Fjiles@quicinc.com
Maintainers: Richard Henderson <richard.henderson@linaro.org>, Paolo Bonzini <pbonzini@redhat.com>
There is a newer version of this series
accel/tcg/tcg-accel-ops-icount.c | 17 ++++++++++++++--
accel/tcg/tcg-accel-ops-icount.h |  3 ++-
accel/tcg/tcg-accel-ops-rr.c     | 34 +++++++++++++++++++++++++++++++-
3 files changed, 50 insertions(+), 4 deletions(-)
[PATCH v2] accel/tcg/tcg-accel-ops-rr: ensure fairness with icount
Posted by Jamie Iles 1 year ago
The round-robin scheduler will iterate over the CPU list with an
assigned budget until the next timer expiry and may exit early because
of a TB exit.  This is fine under normal operation but with icount
enabled and SMP it is possible for a CPU to be starved of run time and
the system live-locks.

For example, booting a riscv64 platform with '-icount
shift=0,align=off,sleep=on -smp 2' we observe a livelock once the kernel
has timers enabled and starts performing TLB shootdowns.  In this case
we have CPU 0 in M-mode with interrupts disabled sending an IPI to CPU
1.  As we enter the TCG loop, we assign the icount budget to next timer
interrupt to CPU 0 and begin executing where the guest is sat in a busy
loop exhausting all of the budget before we try to execute CPU 1 which
is the target of the IPI but CPU 1 is left with no budget with which to
execute and the process repeats.

We try here to add some fairness by splitting the budget across all of
the CPUs on the thread fairly before entering each one.  The CPU count
is cached on CPU list generation ID to avoid iterating the list on each
loop iteration.  With this change it is possible to boot an SMP rv64
guest with icount enabled and no hangs.

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Tested-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Jamie Iles <quic_jiles@quicinc.com>
---

Changes in v2:
 - Rename icount_cpu_timeslice to icount_percpu_budget
 - Add a clarifying comment about caching to rr_cpu_count()

 accel/tcg/tcg-accel-ops-icount.c | 17 ++++++++++++++--
 accel/tcg/tcg-accel-ops-icount.h |  3 ++-
 accel/tcg/tcg-accel-ops-rr.c     | 34 +++++++++++++++++++++++++++++++-
 3 files changed, 50 insertions(+), 4 deletions(-)

diff --git a/accel/tcg/tcg-accel-ops-icount.c b/accel/tcg/tcg-accel-ops-icount.c
index 84cc7421be88..e1e8afaf2f99 100644
--- a/accel/tcg/tcg-accel-ops-icount.c
+++ b/accel/tcg/tcg-accel-ops-icount.c
@@ -89,7 +89,20 @@ void icount_handle_deadline(void)
     }
 }
 
-void icount_prepare_for_run(CPUState *cpu)
+/* Distribute the budget evenly across all CPUs */
+int64_t icount_percpu_budget(int cpu_count)
+{
+    int64_t limit = icount_get_limit();
+    int64_t timeslice = limit / cpu_count;
+
+    if (timeslice == 0) {
+        timeslice = limit;
+    }
+
+    return timeslice;
+}
+
+void icount_prepare_for_run(CPUState *cpu, int64_t cpu_budget)
 {
     int insns_left;
 
@@ -101,7 +114,7 @@ void icount_prepare_for_run(CPUState *cpu)
     g_assert(cpu_neg(cpu)->icount_decr.u16.low == 0);
     g_assert(cpu->icount_extra == 0);
 
-    cpu->icount_budget = icount_get_limit();
+    cpu->icount_budget = MIN(icount_get_limit(), cpu_budget);
     insns_left = MIN(0xffff, cpu->icount_budget);
     cpu_neg(cpu)->icount_decr.u16.low = insns_left;
     cpu->icount_extra = cpu->icount_budget - insns_left;
diff --git a/accel/tcg/tcg-accel-ops-icount.h b/accel/tcg/tcg-accel-ops-icount.h
index 1b6fd9c60751..16a301b6dc0b 100644
--- a/accel/tcg/tcg-accel-ops-icount.h
+++ b/accel/tcg/tcg-accel-ops-icount.h
@@ -11,7 +11,8 @@
 #define TCG_ACCEL_OPS_ICOUNT_H
 
 void icount_handle_deadline(void);
-void icount_prepare_for_run(CPUState *cpu);
+void icount_prepare_for_run(CPUState *cpu, int64_t cpu_budget);
+int64_t icount_percpu_budget(int cpu_count);
 void icount_process_data(CPUState *cpu);
 
 void icount_handle_interrupt(CPUState *cpu, int mask);
diff --git a/accel/tcg/tcg-accel-ops-rr.c b/accel/tcg/tcg-accel-ops-rr.c
index 290833a37fb2..7114210173df 100644
--- a/accel/tcg/tcg-accel-ops-rr.c
+++ b/accel/tcg/tcg-accel-ops-rr.c
@@ -139,6 +139,33 @@ static void rr_force_rcu(Notifier *notify, void *data)
     rr_kick_next_cpu();
 }
 
+/*
+ * Calculate the number of CPUs that we will process in a single iteration of
+ * the main CPU thread loop so that we can fairly distribute the instruction
+ * count across CPUs.
+ *
+ * The CPU count is cached based on the CPU list generation ID to avoid
+ * iterating the list every time.
+ */
+static int rr_cpu_count(void)
+{
+    static unsigned int last_gen_id = ~0;
+    static int cpu_count;
+    CPUState *cpu;
+
+    cpu_list_lock();
+    if (cpu_list_generation_id_get() != last_gen_id) {
+        cpu_count = 0;
+        CPU_FOREACH(cpu) {
+            ++cpu_count;
+        }
+        last_gen_id = cpu_list_generation_id_get();
+    }
+    cpu_list_unlock();
+
+    return cpu_count;
+}
+
 /*
  * In the single-threaded case each vCPU is simulated in turn. If
  * there is more than a single vCPU we create a simple timer to kick
@@ -185,6 +212,9 @@ static void *rr_cpu_thread_fn(void *arg)
     cpu->exit_request = 1;
 
     while (1) {
+        int cpu_count = rr_cpu_count();
+        int64_t cpu_budget = INT64_MAX;
+
         qemu_mutex_unlock_iothread();
         replay_mutex_lock();
         qemu_mutex_lock_iothread();
@@ -197,6 +227,8 @@ static void *rr_cpu_thread_fn(void *arg)
              * waking up the I/O thread and waiting for completion.
              */
             icount_handle_deadline();
+
+            cpu_budget = icount_percpu_budget(cpu_count);
         }
 
         replay_mutex_unlock();
@@ -218,7 +250,7 @@ static void *rr_cpu_thread_fn(void *arg)
 
                 qemu_mutex_unlock_iothread();
                 if (icount_enabled()) {
-                    icount_prepare_for_run(cpu);
+                    icount_prepare_for_run(cpu, cpu_budget);
                 }
                 r = tcg_cpus_exec(cpu);
                 if (icount_enabled()) {
-- 
2.25.1


Re: [PATCH v2] accel/tcg/tcg-accel-ops-rr: ensure fairness with icount
Posted by Richard Henderson 1 year ago
On 4/24/23 12:29, Jamie Iles wrote:
> +/*
> + * Calculate the number of CPUs that we will process in a single iteration of
> + * the main CPU thread loop so that we can fairly distribute the instruction
> + * count across CPUs.
> + *
> + * The CPU count is cached based on the CPU list generation ID to avoid
> + * iterating the list every time.
> + */
> +static int rr_cpu_count(void)
> +{
> +    static unsigned int last_gen_id = ~0;
> +    static int cpu_count;
> +    CPUState *cpu;
> +
> +    cpu_list_lock();
> +    if (cpu_list_generation_id_get() != last_gen_id) {
> +        cpu_count = 0;
> +        CPU_FOREACH(cpu) {
> +            ++cpu_count;
> +        }
> +        last_gen_id = cpu_list_generation_id_get();
> +    }
> +    cpu_list_unlock();
> +
> +    return cpu_count;
> +}

The read of cpu_count should be in the lock.

(Ideally we'd expose QEMU_LOCK_GUARD(&qemu_cpu_list_lock) which would make the read+return 
trivial.)


>   /*
>    * In the single-threaded case each vCPU is simulated in turn. If
>    * there is more than a single vCPU we create a simple timer to kick
> @@ -185,6 +212,9 @@ static void *rr_cpu_thread_fn(void *arg)
>       cpu->exit_request = 1;
>   
>       while (1) {
> +        int cpu_count = rr_cpu_count();
> +        int64_t cpu_budget = INT64_MAX;
> +
>           qemu_mutex_unlock_iothread();
>           replay_mutex_lock();
>           qemu_mutex_lock_iothread();
> @@ -197,6 +227,8 @@ static void *rr_cpu_thread_fn(void *arg)
>                * waking up the I/O thread and waiting for completion.
>                */
>               icount_handle_deadline();
> +
> +            cpu_budget = icount_percpu_budget(cpu_count);

I think you can move the call to rr_cpu_count() here, so that it only happens if icount is 
in use.

Why cpu_budget = INT64_MAX as opposed to 0 or 0xdeadbeef?  It's not set or used except for 
icount_enabled.


r~
Re: [PATCH v2] accel/tcg/tcg-accel-ops-rr: ensure fairness with icount
Posted by Jamie Iles 1 year ago
Hi Richard,

On Mon, Apr 24, 2023 at 02:06:18PM +0100, Richard Henderson wrote:
> WARNING: This email originated from outside of Qualcomm. Please be wary of any links or attachments, and do not enable macros.
> 
> On 4/24/23 12:29, Jamie Iles wrote:
> > +/*
> > + * Calculate the number of CPUs that we will process in a single iteration of
> > + * the main CPU thread loop so that we can fairly distribute the instruction
> > + * count across CPUs.
> > + *
> > + * The CPU count is cached based on the CPU list generation ID to avoid
> > + * iterating the list every time.
> > + */
> > +static int rr_cpu_count(void)
> > +{
> > +    static unsigned int last_gen_id = ~0;
> > +    static int cpu_count;
> > +    CPUState *cpu;
> > +
> > +    cpu_list_lock();
> > +    if (cpu_list_generation_id_get() != last_gen_id) {
> > +        cpu_count = 0;
> > +        CPU_FOREACH(cpu) {
> > +            ++cpu_count;
> > +        }
> > +        last_gen_id = cpu_list_generation_id_get();
> > +    }
> > +    cpu_list_unlock();
> > +
> > +    return cpu_count;
> > +}
> 
> The read of cpu_count should be in the lock.
> 
> (Ideally we'd expose QEMU_LOCK_GUARD(&qemu_cpu_list_lock) which would make the read+return
> trivial.)

Sure, can do that, I can add that as a first patch and update other 
users.

> 
> >   /*
> >    * In the single-threaded case each vCPU is simulated in turn. If
> >    * there is more than a single vCPU we create a simple timer to kick
> > @@ -185,6 +212,9 @@ static void *rr_cpu_thread_fn(void *arg)
> >       cpu->exit_request = 1;
> > 
> >       while (1) {
> > +        int cpu_count = rr_cpu_count();
> > +        int64_t cpu_budget = INT64_MAX;
> > +
> >           qemu_mutex_unlock_iothread();
> >           replay_mutex_lock();
> >           qemu_mutex_lock_iothread();
> > @@ -197,6 +227,8 @@ static void *rr_cpu_thread_fn(void *arg)
> >                * waking up the I/O thread and waiting for completion.
> >                */
> >               icount_handle_deadline();
> > +
> > +            cpu_budget = icount_percpu_budget(cpu_count);
> 
> I think you can move the call to rr_cpu_count() here, so that it only happens if icount is
> in use.
> 
> Why cpu_budget = INT64_MAX as opposed to 0 or 0xdeadbeef?  It's not set or used except for
> icount_enabled.

That's left over from an earlier version, I can leave it uninitialized.

Thanks,

Jamie
Re: [PATCH v2] accel/tcg/tcg-accel-ops-rr: ensure fairness with icount
Posted by Richard Henderson 1 year ago
On 4/24/23 14:21, Jamie Iles wrote:
>> Why cpu_budget = INT64_MAX as opposed to 0 or 0xdeadbeef?  It's not set or used except for
>> icount_enabled.
> 
> That's left over from an earlier version, I can leave it uninitialized.

No, you can't, as you'll get a compiler warning, because the compiler won't see that the 
conditional init matches the conditional use.

A zero init value would only need comment for the conditional init/use.  A non-zero init 
value without comment is magic and bad for maintenance.


r~