[RFC v2 PATCH] record-replay: support SMP target machine

Nicholas Piggin posted 1 patch 9 months ago
Failed in applying to current master (apply log)
accel/tcg/tcg-accel-ops-icount.c |  9 +++-
accel/tcg/tcg-accel-ops-rr.c     | 73 +++++++++++++++++++++++++++++---
include/exec/replay-core.h       |  3 ++
replay/replay-internal.h         |  1 +
replay/replay.c                  | 34 ++++++++++++++-
scripts/replay-dump.py           |  5 +++
softmmu/vl.c                     |  4 --
7 files changed, 115 insertions(+), 14 deletions(-)
[RFC v2 PATCH] record-replay: support SMP target machine
Posted by Nicholas Piggin 9 months ago
RR CPU switching is driven by timers and events so it is deterministic
like everything else. Record a CPU switch event and use that to drive
the CPU switch on replay.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
This is still in RFC phase because so far I've only really testd ppc
pseries, and only with patches that are not yet upstream (but posted
to list).

It works with smp 2, can step, reverse-step, reverse-continue, etc.
throughout a Linux boot.

One issue is reverse-step on one gdb thread (vCPU) only steps back one
icount, so if another thread is currently running then it is that one
which goes back one instruction and the selected thread doesn't move. I
would call this a separate issue from the record-replay mechanism, which
is in the replay-debugging policy. I think we could record in each vCPU
an icount of the last instruction it executed before switching, then
reverse step for that vCPU could replay to there. I think that's not so
important yet until this mechanism is solid. But if you test and rsi is
not going backwards, then check your other threads.

Thanks,
Nick


 accel/tcg/tcg-accel-ops-icount.c |  9 +++-
 accel/tcg/tcg-accel-ops-rr.c     | 73 +++++++++++++++++++++++++++++---
 include/exec/replay-core.h       |  3 ++
 replay/replay-internal.h         |  1 +
 replay/replay.c                  | 34 ++++++++++++++-
 scripts/replay-dump.py           |  5 +++
 softmmu/vl.c                     |  4 --
 7 files changed, 115 insertions(+), 14 deletions(-)

diff --git a/accel/tcg/tcg-accel-ops-icount.c b/accel/tcg/tcg-accel-ops-icount.c
index 3d2cfbbc97..c26782a56a 100644
--- a/accel/tcg/tcg-accel-ops-icount.c
+++ b/accel/tcg/tcg-accel-ops-icount.c
@@ -93,10 +93,15 @@ void icount_handle_deadline(void)
 int64_t icount_percpu_budget(int cpu_count)
 {
     int64_t limit = icount_get_limit();
-    int64_t timeslice = limit / cpu_count;
+    int64_t timeslice;
 
-    if (timeslice == 0) {
+    if (replay_mode == REPLAY_MODE_PLAY) {
         timeslice = limit;
+    } else {
+        timeslice = limit / cpu_count;
+        if (timeslice == 0) {
+            timeslice = limit;
+        }
     }
 
     return timeslice;
diff --git a/accel/tcg/tcg-accel-ops-rr.c b/accel/tcg/tcg-accel-ops-rr.c
index 212d6f8df4..ce040a687e 100644
--- a/accel/tcg/tcg-accel-ops-rr.c
+++ b/accel/tcg/tcg-accel-ops-rr.c
@@ -27,6 +27,7 @@
 #include "qemu/lockable.h"
 #include "sysemu/tcg.h"
 #include "sysemu/replay.h"
+#include "sysemu/reset.h"
 #include "sysemu/cpu-timers.h"
 #include "qemu/main-loop.h"
 #include "qemu/notify.h"
@@ -61,6 +62,22 @@ void rr_kick_vcpu_thread(CPUState *unused)
 
 static QEMUTimer *rr_kick_vcpu_timer;
 static CPUState *rr_current_cpu;
+static CPUState *rr_next_cpu;
+static CPUState *rr_last_cpu;
+
+/*
+ * Reset the vCPU scheduler to the initial state.
+ */
+static void record_replay_reset(void *param)
+{
+    if (rr_kick_vcpu_timer) {
+        timer_del(rr_kick_vcpu_timer);
+    }
+    g_assert(!rr_current_cpu);
+    rr_next_cpu = NULL;
+    rr_last_cpu = first_cpu;
+    current_cpu = NULL;
+}
 
 static inline int64_t rr_next_kick_time(void)
 {
@@ -184,6 +201,8 @@ static void *rr_cpu_thread_fn(void *arg)
     Notifier force_rcu;
     CPUState *cpu = arg;
 
+    qemu_register_reset(record_replay_reset, NULL);
+
     assert(tcg_enabled());
     rcu_register_thread();
     force_rcu.notify = rr_force_rcu;
@@ -238,14 +257,20 @@ static void *rr_cpu_thread_fn(void *arg)
             cpu_budget = icount_percpu_budget(cpu_count);
         }
 
+        if (!rr_next_cpu) {
+            qatomic_set_mb(&rr_next_cpu, first_cpu);
+        }
+        cpu = rr_next_cpu;
+
+        if (cpu != rr_last_cpu) {
+            replay_switch_cpu();
+            qatomic_set_mb(&rr_last_cpu, cpu);
+        }
+
         rr_start_kick_timer();
 
         replay_mutex_unlock();
 
-        if (!cpu) {
-            cpu = first_cpu;
-        }
-
         while (cpu && cpu_work_list_empty(cpu) && !cpu->exit_request) {
             /* Store rr_current_cpu before evaluating cpu_can_run().  */
             qatomic_set_mb(&rr_current_cpu, cpu);
@@ -284,7 +309,34 @@ static void *rr_cpu_thread_fn(void *arg)
                 break;
             }
 
-            cpu = CPU_NEXT(cpu);
+            if (replay_mode == REPLAY_MODE_NONE) {
+                cpu = CPU_NEXT(cpu);
+            } else if (replay_mode == REPLAY_MODE_RECORD) {
+                /*
+                 * Exit the loop immediately so CPU switch events can be
+                 * recorded. This may be able to be improved to record
+                 * switch events here.
+                 */
+                cpu = CPU_NEXT(cpu);
+                break;
+            } else if (replay_mode == REPLAY_MODE_PLAY) {
+                /*
+                 * Play can exit from tcg_cpus_exec at different times than
+                 * record, because icount budget is set to next non-insn
+                 * event which could be an exception or something that
+                 * tcg_cpus_exec can process internally if icount limit was
+                 * not reached. So we don't necessarily switch CPU here,
+		 * only if there is a switch in the trace.
+                 */
+                qemu_mutex_unlock_iothread();
+                replay_mutex_lock();
+                qemu_mutex_lock_iothread();
+                if (replay_has_switch_cpu()) {
+                    cpu = CPU_NEXT(cpu);
+                }
+                replay_mutex_unlock();
+                break;
+            }
         } /* while (cpu && !cpu->exit_request).. */
 
         /* Does not need a memory barrier because a spurious wakeup is okay.  */
@@ -294,9 +346,9 @@ static void *rr_cpu_thread_fn(void *arg)
             qatomic_set_mb(&cpu->exit_request, 0);
         }
 
-        if (all_cpu_threads_idle()) {
-            rr_stop_kick_timer();
+        qatomic_set(&rr_next_cpu, cpu);
 
+        if (all_cpu_threads_idle()) {
             if (icount_enabled()) {
                 /*
 		 * When all cpus are sleeping (e.g in WFI), to avoid a deadlock
@@ -304,6 +356,13 @@ static void *rr_cpu_thread_fn(void *arg)
                  * timer.
                  */
                 qemu_notify_event();
+            } else {
+                /*
+                 * icount timer won't expire when not executing instructions
+                 * so no need to stop it. Avoiding restarts is also important
+                 * for record-replay to avoid clock events.
+                 */
+                rr_stop_kick_timer();
             }
         }
 
diff --git a/include/exec/replay-core.h b/include/exec/replay-core.h
index 244c77acce..543c129a1d 100644
--- a/include/exec/replay-core.h
+++ b/include/exec/replay-core.h
@@ -52,6 +52,9 @@ void replay_gdb_attached(void);
 
 /* Interrupts and exceptions */
 
+bool replay_switch_cpu(void);
+bool replay_has_switch_cpu(void);
+
 /* Called by exception handler to write or read exception processing events */
 bool replay_exception(void);
 /*
diff --git a/replay/replay-internal.h b/replay/replay-internal.h
index b6836354ac..95849e7461 100644
--- a/replay/replay-internal.h
+++ b/replay/replay-internal.h
@@ -58,6 +58,7 @@ enum ReplayEvents {
     /* some of greater codes are reserved for checkpoints */
     EVENT_CHECKPOINT,
     EVENT_CHECKPOINT_LAST = EVENT_CHECKPOINT + CHECKPOINT_COUNT - 1,
+    EVENT_SWITCH_CPU,
     /* end of log event */
     EVENT_END,
     EVENT_COUNT
diff --git a/replay/replay.c b/replay/replay.c
index 0f7d766efe..e2d77ab644 100644
--- a/replay/replay.c
+++ b/replay/replay.c
@@ -98,9 +98,41 @@ void replay_account_executed_instructions(void)
     }
 }
 
-bool replay_exception(void)
+bool replay_switch_cpu(void)
+{
+    if (replay_mode == REPLAY_MODE_RECORD) {
+        g_assert(replay_mutex_locked());
+        replay_save_instructions();
+        replay_put_event(EVENT_SWITCH_CPU);
+        return true;
+    } else if (replay_mode == REPLAY_MODE_PLAY) {
+        bool res = replay_has_switch_cpu();
+        if (res) {
+            replay_finish_event();
+        } else {
+            g_assert_not_reached();
+        }
+        return res;
+    }
+
+    return true;
+}
+
+bool replay_has_switch_cpu(void)
 {
+    bool res = false;
+    if (replay_mode == REPLAY_MODE_PLAY) {
+        g_assert(replay_mutex_locked());
+        replay_account_executed_instructions();
+        res = replay_next_event_is(EVENT_SWITCH_CPU);
+    }
+
+    return res;
+}
 
+
+bool replay_exception(void)
+{
     if (replay_mode == REPLAY_MODE_RECORD) {
         g_assert(replay_mutex_locked());
         replay_save_instructions();
diff --git a/softmmu/vl.c b/softmmu/vl.c
index b0b96f67fa..946b22d1de 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -1869,10 +1869,6 @@ static void qemu_apply_machine_options(QDict *qdict)
         /* fall back to the -kernel/-append */
         semihosting_arg_fallback(current_machine->kernel_filename, current_machine->kernel_cmdline);
     }
-
-    if (current_machine->smp.cpus > 1) {
-        replay_add_blocker("smp");
-    }
 }
 
 static void qemu_create_early_backends(void)
-- 
2.40.1
Re: [RFC v2 PATCH] record-replay: support SMP target machine
Posted by Pavel Dovgalyuk 8 months, 2 weeks ago
On 11.08.2023 04:47, Nicholas Piggin wrote:
> RR CPU switching is driven by timers and events so it is deterministic
> like everything else. Record a CPU switch event and use that to drive
> the CPU switch on replay.
> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
> This is still in RFC phase because so far I've only really testd ppc
> pseries, and only with patches that are not yet upstream (but posted
> to list).
> 
> It works with smp 2, can step, reverse-step, reverse-continue, etc.
> throughout a Linux boot.

I still didn't have time to test it, but here are some comments.

> 
> One issue is reverse-step on one gdb thread (vCPU) only steps back one
> icount, so if another thread is currently running then it is that one
> which goes back one instruction and the selected thread doesn't move. I
> would call this a separate issue from the record-replay mechanism, which
> is in the replay-debugging policy. I think we could record in each vCPU
> an icount of the last instruction it executed before switching, then
> reverse step for that vCPU could replay to there. I think that's not so
> important yet until this mechanism is solid. But if you test and rsi is
> not going backwards, then check your other threads.
> 
> Thanks,
> Nick
> 
> 
>   accel/tcg/tcg-accel-ops-icount.c |  9 +++-
>   accel/tcg/tcg-accel-ops-rr.c     | 73 +++++++++++++++++++++++++++++---
>   include/exec/replay-core.h       |  3 ++
>   replay/replay-internal.h         |  1 +
>   replay/replay.c                  | 34 ++++++++++++++-
>   scripts/replay-dump.py           |  5 +++
>   softmmu/vl.c                     |  4 --
>   7 files changed, 115 insertions(+), 14 deletions(-)
> 
> diff --git a/accel/tcg/tcg-accel-ops-icount.c b/accel/tcg/tcg-accel-ops-icount.c
> index 3d2cfbbc97..c26782a56a 100644
> --- a/accel/tcg/tcg-accel-ops-icount.c
> +++ b/accel/tcg/tcg-accel-ops-icount.c
> @@ -93,10 +93,15 @@ void icount_handle_deadline(void)
>   int64_t icount_percpu_budget(int cpu_count)
>   {
>       int64_t limit = icount_get_limit();
> -    int64_t timeslice = limit / cpu_count;
> +    int64_t timeslice;
>   
> -    if (timeslice == 0) {
> +    if (replay_mode == REPLAY_MODE_PLAY) {
>           timeslice = limit;
> +    } else {
> +        timeslice = limit / cpu_count;
> +        if (timeslice == 0) {
> +            timeslice = limit;
> +        }
>       }
>   
>       return timeslice;
> diff --git a/accel/tcg/tcg-accel-ops-rr.c b/accel/tcg/tcg-accel-ops-rr.c
> index 212d6f8df4..ce040a687e 100644
> --- a/accel/tcg/tcg-accel-ops-rr.c
> +++ b/accel/tcg/tcg-accel-ops-rr.c
> @@ -27,6 +27,7 @@
>   #include "qemu/lockable.h"
>   #include "sysemu/tcg.h"
>   #include "sysemu/replay.h"
> +#include "sysemu/reset.h"
>   #include "sysemu/cpu-timers.h"
>   #include "qemu/main-loop.h"
>   #include "qemu/notify.h"
> @@ -61,6 +62,22 @@ void rr_kick_vcpu_thread(CPUState *unused)
>   
>   static QEMUTimer *rr_kick_vcpu_timer;
>   static CPUState *rr_current_cpu;
> +static CPUState *rr_next_cpu;
> +static CPUState *rr_last_cpu;
> +
> +/*
> + * Reset the vCPU scheduler to the initial state.
> + */
> +static void record_replay_reset(void *param)
> +{
> +    if (rr_kick_vcpu_timer) {
> +        timer_del(rr_kick_vcpu_timer);
> +    }
> +    g_assert(!rr_current_cpu);
> +    rr_next_cpu = NULL;
> +    rr_last_cpu = first_cpu;
> +    current_cpu = NULL;
> +}
>   
>   static inline int64_t rr_next_kick_time(void)
>   {
> @@ -184,6 +201,8 @@ static void *rr_cpu_thread_fn(void *arg)
>       Notifier force_rcu;
>       CPUState *cpu = arg;
>   
> +    qemu_register_reset(record_replay_reset, NULL);
> +
>       assert(tcg_enabled());
>       rcu_register_thread();
>       force_rcu.notify = rr_force_rcu;
> @@ -238,14 +257,20 @@ static void *rr_cpu_thread_fn(void *arg)
>               cpu_budget = icount_percpu_budget(cpu_count);
>           }
>   
> +        if (!rr_next_cpu) {
> +            qatomic_set_mb(&rr_next_cpu, first_cpu);
> +        }
> +        cpu = rr_next_cpu;
> +
> +        if (cpu != rr_last_cpu) {
> +            replay_switch_cpu();
> +            qatomic_set_mb(&rr_last_cpu, cpu);
> +        }
> +
>           rr_start_kick_timer();
>   
>           replay_mutex_unlock();
>   
> -        if (!cpu) {
> -            cpu = first_cpu;
> -        }
> -
>           while (cpu && cpu_work_list_empty(cpu) && !cpu->exit_request) {
>               /* Store rr_current_cpu before evaluating cpu_can_run().  */
>               qatomic_set_mb(&rr_current_cpu, cpu);
> @@ -284,7 +309,34 @@ static void *rr_cpu_thread_fn(void *arg)
>                   break;
>               }
>   
> -            cpu = CPU_NEXT(cpu);
> +            if (replay_mode == REPLAY_MODE_NONE) {
> +                cpu = CPU_NEXT(cpu);
> +            } else if (replay_mode == REPLAY_MODE_RECORD) {
> +                /*
> +                 * Exit the loop immediately so CPU switch events can be
> +                 * recorded. This may be able to be improved to record
> +                 * switch events here.
> +                 */
> +                cpu = CPU_NEXT(cpu);
> +                break;
> +            } else if (replay_mode == REPLAY_MODE_PLAY) {
> +                /*
> +                 * Play can exit from tcg_cpus_exec at different times than
> +                 * record, because icount budget is set to next non-insn
> +                 * event which could be an exception or something that
> +                 * tcg_cpus_exec can process internally if icount limit was
> +                 * not reached. So we don't necessarily switch CPU here,
> +		 * only if there is a switch in the trace.
> +                 */
> +                qemu_mutex_unlock_iothread();
> +                replay_mutex_lock();
> +                qemu_mutex_lock_iothread();
> +                if (replay_has_switch_cpu()) {
> +                    cpu = CPU_NEXT(cpu);
> +                }
> +                replay_mutex_unlock();
> +                break;
> +            }
>           } /* while (cpu && !cpu->exit_request).. */
>   
>           /* Does not need a memory barrier because a spurious wakeup is okay.  */
> @@ -294,9 +346,9 @@ static void *rr_cpu_thread_fn(void *arg)
>               qatomic_set_mb(&cpu->exit_request, 0);
>           }
>   
> -        if (all_cpu_threads_idle()) {
> -            rr_stop_kick_timer();
> +        qatomic_set(&rr_next_cpu, cpu);

This does not seem to be in the mainline.

>   
> +        if (all_cpu_threads_idle()) {
>               if (icount_enabled()) {
>                   /*
>   		 * When all cpus are sleeping (e.g in WFI), to avoid a deadlock
> @@ -304,6 +356,13 @@ static void *rr_cpu_thread_fn(void *arg)
>                    * timer.
>                    */
>                   qemu_notify_event();
> +            } else {
> +                /*
> +                 * icount timer won't expire when not executing instructions
> +                 * so no need to stop it. Avoiding restarts is also important
> +                 * for record-replay to avoid clock events.
> +                 */
> +                rr_stop_kick_timer();
>               }
>           }
>   
> diff --git a/include/exec/replay-core.h b/include/exec/replay-core.h
> index 244c77acce..543c129a1d 100644
> --- a/include/exec/replay-core.h
> +++ b/include/exec/replay-core.h
> @@ -52,6 +52,9 @@ void replay_gdb_attached(void);
>   
>   /* Interrupts and exceptions */
>   
> +bool replay_switch_cpu(void);
> +bool replay_has_switch_cpu(void);
> +
>   /* Called by exception handler to write or read exception processing events */
>   bool replay_exception(void);
>   /*
> diff --git a/replay/replay-internal.h b/replay/replay-internal.h
> index b6836354ac..95849e7461 100644
> --- a/replay/replay-internal.h
> +++ b/replay/replay-internal.h
> @@ -58,6 +58,7 @@ enum ReplayEvents {
>       /* some of greater codes are reserved for checkpoints */
>       EVENT_CHECKPOINT,
>       EVENT_CHECKPOINT_LAST = EVENT_CHECKPOINT + CHECKPOINT_COUNT - 1,
> +    EVENT_SWITCH_CPU,
>       /* end of log event */
>       EVENT_END,
>       EVENT_COUNT
> diff --git a/replay/replay.c b/replay/replay.c
> index 0f7d766efe..e2d77ab644 100644
> --- a/replay/replay.c
> +++ b/replay/replay.c
> @@ -98,9 +98,41 @@ void replay_account_executed_instructions(void)
>       }
>   }
>   
> -bool replay_exception(void)
> +bool replay_switch_cpu(void)
> +{
> +    if (replay_mode == REPLAY_MODE_RECORD) {
> +        g_assert(replay_mutex_locked());
> +        replay_save_instructions();
> +        replay_put_event(EVENT_SWITCH_CPU);
> +        return true;
> +    } else if (replay_mode == REPLAY_MODE_PLAY) {
> +        bool res = replay_has_switch_cpu();
> +        if (res) {
> +            replay_finish_event();
> +        } else {
> +            g_assert_not_reached();
> +        }
> +        return res;
> +    }
> +
> +    return true;
> +}
> +
> +bool replay_has_switch_cpu(void)

Is this function really needed?

>   {
> +    bool res = false;
> +    if (replay_mode == REPLAY_MODE_PLAY) {
> +        g_assert(replay_mutex_locked());
> +        replay_account_executed_instructions();
> +        res = replay_next_event_is(EVENT_SWITCH_CPU);
> +    }
> +
> +    return res;
> +}
>   
> +
> +bool replay_exception(void)
> +{
>       if (replay_mode == REPLAY_MODE_RECORD) {
>           g_assert(replay_mutex_locked());
>           replay_save_instructions();
> diff --git a/softmmu/vl.c b/softmmu/vl.c
> index b0b96f67fa..946b22d1de 100644
> --- a/softmmu/vl.c
> +++ b/softmmu/vl.c
> @@ -1869,10 +1869,6 @@ static void qemu_apply_machine_options(QDict *qdict)
>           /* fall back to the -kernel/-append */
>           semihosting_arg_fallback(current_machine->kernel_filename, current_machine->kernel_cmdline);
>       }
> -
> -    if (current_machine->smp.cpus > 1) {
> -        replay_add_blocker("smp");
> -    }
>   }
>   
>   static void qemu_create_early_backends(void)
Re: [RFC v2 PATCH] record-replay: support SMP target machine
Posted by Nicholas Piggin 8 months, 2 weeks ago
On Tue Aug 22, 2023 at 2:44 PM AEST, Pavel Dovgalyuk wrote:
> On 11.08.2023 04:47, Nicholas Piggin wrote:
> > RR CPU switching is driven by timers and events so it is deterministic
> > like everything else. Record a CPU switch event and use that to drive
> > the CPU switch on replay.
> > 
> > Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> > ---
> > This is still in RFC phase because so far I've only really testd ppc
> > pseries, and only with patches that are not yet upstream (but posted
> > to list).
> > 
> > It works with smp 2, can step, reverse-step, reverse-continue, etc.
> > throughout a Linux boot.
>
> I still didn't have time to test it, but here are some comments.

That's okay, I got a little further, mainly adding vmstate to
migrate it (otherwise we can only use the initial snapshot).

Unless there is more interest, I will focus on getting ppc fixes
upstream first. Let me know if you have more time to look, I can
send you the latest.

[snip]

> > @@ -294,9 +346,9 @@ static void *rr_cpu_thread_fn(void *arg)
> >               qatomic_set_mb(&cpu->exit_request, 0);
> >           }
> >   
> > -        if (all_cpu_threads_idle()) {
> > -            rr_stop_kick_timer();
> > +        qatomic_set(&rr_next_cpu, cpu);
>
> This does not seem to be in the mainline.

Sorry I meant to sqush that in or send it out. The kick timer
init vs start needed to be moved to make it work.

[snip]

> > -bool replay_exception(void)
> > +bool replay_switch_cpu(void)
> > +{
> > +    if (replay_mode == REPLAY_MODE_RECORD) {
> > +        g_assert(replay_mutex_locked());
> > +        replay_save_instructions();
> > +        replay_put_event(EVENT_SWITCH_CPU);
> > +        return true;
> > +    } else if (replay_mode == REPLAY_MODE_PLAY) {
> > +        bool res = replay_has_switch_cpu();
> > +        if (res) {
> > +            replay_finish_event();
> > +        } else {
> > +            g_assert_not_reached();
> > +        }
> > +        return res;
> > +    }
> > +
> > +    return true;
> > +}
> > +
> > +bool replay_has_switch_cpu(void)
>
> Is this function really needed?

I found it was easier to fit in the way the CPU scheduling is done
in rr.

I think that main scheduling loop could be refactored a bit that
could then avoid the need for this (e.g., a helper function to
return the next CPU and all the selection code including rr is
in there). But that became non-trivial and looks like the code is
a bit delicate. I might try to tackle that afterwards.

Thanks,
Nick