Despite the fact that the responsive CPU throttle is enabled,
the dirty sync count may not always increase because this is
an optimization that might not happen in any situation.
This test case just making sure it doesn't interfere with any
current functionality.
Signed-off-by: Hyman Huang <yong.huang@smartx.com>
---
tests/qtest/migration-test.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)
diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 4626301435..cf0b1dcb50 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -718,6 +718,7 @@ typedef struct {
typedef struct {
/* CPU throttle parameters */
bool periodic;
+ bool responsive;
} AutoConvergeArgs;
static int test_migrate_start(QTestState **from, QTestState **to,
@@ -2795,6 +2796,7 @@ static void test_migrate_auto_converge_args(AutoConvergeArgs *input_args)
QTestState *from, *to;
int64_t percentage;
bool periodic = (input_args && input_args->periodic);
+ bool responsive = (input_args && input_args->responsive);
/*
* We want the test to be stable and as fast as possible.
@@ -2820,6 +2822,16 @@ static void test_migrate_auto_converge_args(AutoConvergeArgs *input_args)
periodic_throttle_interval);
}
+ if (responsive) {
+ /*
+ * The dirty-sync-count may not always go down while using responsive
+ * throttle because it is an optimization and may not take effect in
+ * any scenario. Just making sure this feature doesn't break any
+ * existing functionality by turning it on.
+ */
+ migrate_set_parameter_bool(from, "cpu-responsive-throttle", true);
+ }
+
/*
* Set the initial parameters so that the migration could not converge
* without throttling.
@@ -2902,6 +2914,12 @@ static void test_migrate_auto_converge_periodic_throttle(void)
test_migrate_auto_converge_args(&args);
}
+static void test_migrate_auto_converge_responsive_throttle(void)
+{
+ AutoConvergeArgs args = {.responsive = true};
+ test_migrate_auto_converge_args(&args);
+}
+
static void *
test_migrate_precopy_tcp_multifd_start_common(QTestState *from,
QTestState *to,
@@ -3955,6 +3973,8 @@ int main(int argc, char **argv)
test_migrate_auto_converge);
migration_test_add("/migration/auto_converge_periodic_throttle",
test_migrate_auto_converge_periodic_throttle);
+ migration_test_add("/migration/auto_converge_responsive_throttle",
+ test_migrate_auto_converge_responsive_throttle);
if (g_str_equal(arch, "x86_64") &&
has_kvm && kvm_dirty_ring_supported()) {
migration_test_add("/migration/dirty_limit",
--
2.39.1
On Mon, 9 Sept 2024 at 14:51, Hyman Huang <yong.huang@smartx.com> wrote: > > Despite the fact that the responsive CPU throttle is enabled, > the dirty sync count may not always increase because this is > an optimization that might not happen in any situation. > > This test case just making sure it doesn't interfere with any > current functionality. > > Signed-off-by: Hyman Huang <yong.huang@smartx.com> tests/qtest/migration-test already runs 75 different subtests, takes up a massive chunk of our "make check" time, and is very commonly a "times out" test on some of our CI jobs. It runs on five different guest CPU architectures, each one of which takes between 2 and 5 minutes to complete the full migration-test. Do we really need to make it even bigger? thanks -- PMM
On Mon, Sep 9, 2024 at 10:03 PM Peter Maydell <peter.maydell@linaro.org> wrote: > On Mon, 9 Sept 2024 at 14:51, Hyman Huang <yong.huang@smartx.com> wrote: > > > > Despite the fact that the responsive CPU throttle is enabled, > > the dirty sync count may not always increase because this is > > an optimization that might not happen in any situation. > > > > This test case just making sure it doesn't interfere with any > > current functionality. > > > > Signed-off-by: Hyman Huang <yong.huang@smartx.com> > > tests/qtest/migration-test already runs 75 different > subtests, takes up a massive chunk of our "make check" > time, and is very commonly a "times out" test on some > of our CI jobs. It runs on five different guest CPU > architectures, each one of which takes between 2 and > 5 minutes to complete the full migration-test. > > Do we really need to make it even bigger? > No, I don't insist on that; the cpu-responsive-throttle parameter may also be enabled on the existing migrate_auto_converge test case by default. Thank for the comment. Yong. > > thanks > -- PMM > -- Best regards
On Mon, Sep 09, 2024 at 03:02:57PM +0100, Peter Maydell wrote: > On Mon, 9 Sept 2024 at 14:51, Hyman Huang <yong.huang@smartx.com> wrote: > > > > Despite the fact that the responsive CPU throttle is enabled, > > the dirty sync count may not always increase because this is > > an optimization that might not happen in any situation. > > > > This test case just making sure it doesn't interfere with any > > current functionality. > > > > Signed-off-by: Hyman Huang <yong.huang@smartx.com> > > tests/qtest/migration-test already runs 75 different > subtests, takes up a massive chunk of our "make check" > time, and is very commonly a "times out" test on some > of our CI jobs. It runs on five different guest CPU > architectures, each one of which takes between 2 and > 5 minutes to complete the full migration-test. > > Do we really need to make it even bigger? I'll try to find some time in the next few weeks looking into this to see whether we can further shrink migration test times after previous attemps from Dan. At least a low hanging fruit is we should indeed put some more tests into g_test_slow(), and this new test could also be a candidate (then we can run "-m slow" for migration PRs only). Thanks, -- Peter Xu
Peter Xu <peterx@redhat.com> writes: > On Mon, Sep 09, 2024 at 03:02:57PM +0100, Peter Maydell wrote: >> On Mon, 9 Sept 2024 at 14:51, Hyman Huang <yong.huang@smartx.com> wrote: >> > >> > Despite the fact that the responsive CPU throttle is enabled, >> > the dirty sync count may not always increase because this is >> > an optimization that might not happen in any situation. >> > >> > This test case just making sure it doesn't interfere with any >> > current functionality. >> > >> > Signed-off-by: Hyman Huang <yong.huang@smartx.com> >> >> tests/qtest/migration-test already runs 75 different >> subtests, takes up a massive chunk of our "make check" >> time, and is very commonly a "times out" test on some >> of our CI jobs. It runs on five different guest CPU >> architectures, each one of which takes between 2 and >> 5 minutes to complete the full migration-test. >> >> Do we really need to make it even bigger? > > I'll try to find some time in the next few weeks looking into this to see > whether we can further shrink migration test times after previous attemps > from Dan. At least a low hanging fruit is we should indeed put some more > tests into g_test_slow(), and this new test could also be a candidate (then > we can run "-m slow" for migration PRs only). I think we could (using -m slow or any other method) separate tests that are generic enough that every CI run should benefit from them vs. tests that are only useful once someone starts touching migration code. I'd say very few in the former category and most of them in the latter. For an idea of where migration bugs lie, I took a look at what was fixed since 2022: # bugs | device/subsystem/arch ---------------------------------- 54 | migration 10 | vfio 6 | ppc 3 | virtio-gpu 2 | pcie_sriov, tpm_emulator, vdpa, virtio-rng-pci 1 | arm, block, gpio, lasi, pci, s390, scsi-disk, virtio-mem, TCG From these, ignoring the migration bugs, the migration-tests cover some of: arm, ppc, s390, TCG. The device_opts[1] patch hasn't merged yet, but once it is, then virtio-gpu would be covered and we could investigate adding some of the others. For actual migration code issues: # bugs | (sub)subsystem | kind ---------------------------------------------- 13 | multifd | correctness/races 8 | ram | correctness 8 | rdma: | general programming 7 | qmp | new api bugs 5 | postcopy | races 4 | file: | leaks 3 | return path | races 3 | fd_cleanup | races 2 | savevm, aio/coroutines 1 | xbzrle, colo, dirtyrate, exec:, windows, iochannel, qemufile, arch (ppc64le) Here, the migration-tests cover well: multifd, ram, qmp, postcopy, file, rp, fd_cleanup, iochannel, qemufile, xbzrle. My suggestion is we run per arch: "/precopy/tcp/plain" "/precopy/tcp/tls/psk/match", "/postcopy/plain" "/postcopy/preempt/plain" "/postcopy/preempt/recovery/plain" "/multifd/tcp/plain/cancel" "/multifd/tcp/uri/plain/none" and x86 gets extra: "/precopy/unix/suspend/live" "/precopy/unix/suspend/notlive" "/dirty_ring" (the other dirty_* tests are too slow) All the rest go behind a knob that people touching migration code will enable. wdyt? 1- allows adding devices to QEMU cmdline for migration-test https://lore.kernel.org/r/20240523201922.28007-4-farosas@suse.de
On Mon, Sep 09, 2024 at 06:54:46PM -0300, Fabiano Rosas wrote: > Peter Xu <peterx@redhat.com> writes: > > > On Mon, Sep 09, 2024 at 03:02:57PM +0100, Peter Maydell wrote: > >> On Mon, 9 Sept 2024 at 14:51, Hyman Huang <yong.huang@smartx.com> wrote: > >> > > >> > Despite the fact that the responsive CPU throttle is enabled, > >> > the dirty sync count may not always increase because this is > >> > an optimization that might not happen in any situation. > >> > > >> > This test case just making sure it doesn't interfere with any > >> > current functionality. > >> > > >> > Signed-off-by: Hyman Huang <yong.huang@smartx.com> > >> > >> tests/qtest/migration-test already runs 75 different > >> subtests, takes up a massive chunk of our "make check" > >> time, and is very commonly a "times out" test on some > >> of our CI jobs. It runs on five different guest CPU > >> architectures, each one of which takes between 2 and > >> 5 minutes to complete the full migration-test. > >> > >> Do we really need to make it even bigger? > > > > I'll try to find some time in the next few weeks looking into this to see > > whether we can further shrink migration test times after previous attemps > > from Dan. At least a low hanging fruit is we should indeed put some more > > tests into g_test_slow(), and this new test could also be a candidate (then > > we can run "-m slow" for migration PRs only). > > I think we could (using -m slow or any other method) separate tests > that are generic enough that every CI run should benefit from them > vs. tests that are only useful once someone starts touching migration > code. I'd say very few in the former category and most of them in the > latter. > > For an idea of where migration bugs lie, I took a look at what was > fixed since 2022: > > # bugs | device/subsystem/arch > ---------------------------------- > 54 | migration > 10 | vfio > 6 | ppc > 3 | virtio-gpu > 2 | pcie_sriov, tpm_emulator, > vdpa, virtio-rng-pci > 1 | arm, block, gpio, lasi, > pci, s390, scsi-disk, > virtio-mem, TCG Just curious; how did you collect these? > > From these, ignoring the migration bugs, the migration-tests cover some > of: arm, ppc, s390, TCG. The device_opts[1] patch hasn't merged yet, but > once it is, then virtio-gpu would be covered and we could investigate > adding some of the others. > > For actual migration code issues: > > # bugs | (sub)subsystem | kind > ---------------------------------------------- > 13 | multifd | correctness/races > 8 | ram | correctness > 8 | rdma: | general programming 8 rdma bugs??? ouch.. > 7 | qmp | new api bugs > 5 | postcopy | races > 4 | file: | leaks > 3 | return path | races > 3 | fd_cleanup | races > 2 | savevm, aio/coroutines > 1 | xbzrle, colo, dirtyrate, exec:, > windows, iochannel, qemufile, > arch (ppc64le) > > Here, the migration-tests cover well: multifd, ram, qmp, postcopy, > file, rp, fd_cleanup, iochannel, qemufile, xbzrle. > > My suggestion is we run per arch: > > "/precopy/tcp/plain" > "/precopy/tcp/tls/psk/match", > "/postcopy/plain" > "/postcopy/preempt/plain" > "/postcopy/preempt/recovery/plain" > "/multifd/tcp/plain/cancel" > "/multifd/tcp/uri/plain/none" Don't you want to still keep a few multifd / file tests? IIUC some file ops can still be relevant to archs. Multifd still has one bug that can only reproduce on arm64.. but not x86_64. I remember it's a race condition when migration finishes, and the issue could be memory ordering relevant, but maybe not. > > and x86 gets extra: > > "/precopy/unix/suspend/live" > "/precopy/unix/suspend/notlive" > "/dirty_ring" dirty ring will be disabled anyway when !x86, so probably not a major concern. > > (the other dirty_* tests are too slow) These are the 10 slowest tests when I run locally: /x86_64/migration/multifd/tcp/tls/x509/allow-anon-client 2.41 /x86_64/migration/postcopy/recovery/plain 2.43 /x86_64/migration/multifd/tcp/tls/x509/default-host 2.66 /x86_64/migration/multifd/tcp/tls/x509/override-host 2.86 /x86_64/migration/postcopy/tls/psk 2.91 /x86_64/migration/postcopy/preempt/recovery/tls/psk 3.08 /x86_64/migration/postcopy/preempt/tls/psk 3.30 /x86_64/migration/postcopy/recovery/tls/psk 3.81 /x86_64/migration/vcpu_dirty_limit 13.29 /x86_64/migration/precopy/unix/xbzrle 27.55 Are you aware of people using xbzrle at all? > > All the rest go behind a knob that people touching migration code will > enable. > > wdyt? Agree with the general idea, but I worry above exact list can be too small. IMHO we can definitely, at least, move the last two into slow list (vcpu_dirty_limit and xbzrle), then it'll already save us 40sec each run.. > > 1- allows adding devices to QEMU cmdline for migration-test > https://lore.kernel.org/r/20240523201922.28007-4-farosas@suse.de > -- Peter Xu
Peter Xu <peterx@redhat.com> writes: > On Mon, Sep 09, 2024 at 06:54:46PM -0300, Fabiano Rosas wrote: >> Peter Xu <peterx@redhat.com> writes: >> >> > On Mon, Sep 09, 2024 at 03:02:57PM +0100, Peter Maydell wrote: >> >> On Mon, 9 Sept 2024 at 14:51, Hyman Huang <yong.huang@smartx.com> wrote: >> >> > >> >> > Despite the fact that the responsive CPU throttle is enabled, >> >> > the dirty sync count may not always increase because this is >> >> > an optimization that might not happen in any situation. >> >> > >> >> > This test case just making sure it doesn't interfere with any >> >> > current functionality. >> >> > >> >> > Signed-off-by: Hyman Huang <yong.huang@smartx.com> >> >> >> >> tests/qtest/migration-test already runs 75 different >> >> subtests, takes up a massive chunk of our "make check" >> >> time, and is very commonly a "times out" test on some >> >> of our CI jobs. It runs on five different guest CPU >> >> architectures, each one of which takes between 2 and >> >> 5 minutes to complete the full migration-test. >> >> >> >> Do we really need to make it even bigger? >> > >> > I'll try to find some time in the next few weeks looking into this to see >> > whether we can further shrink migration test times after previous attemps >> > from Dan. At least a low hanging fruit is we should indeed put some more >> > tests into g_test_slow(), and this new test could also be a candidate (then >> > we can run "-m slow" for migration PRs only). >> >> I think we could (using -m slow or any other method) separate tests >> that are generic enough that every CI run should benefit from them >> vs. tests that are only useful once someone starts touching migration >> code. I'd say very few in the former category and most of them in the >> latter. >> >> For an idea of where migration bugs lie, I took a look at what was >> fixed since 2022: >> >> # bugs | device/subsystem/arch >> ---------------------------------- >> 54 | migration >> 10 | vfio >> 6 | ppc >> 3 | virtio-gpu >> 2 | pcie_sriov, tpm_emulator, >> vdpa, virtio-rng-pci >> 1 | arm, block, gpio, lasi, >> pci, s390, scsi-disk, >> virtio-mem, TCG > > Just curious; how did you collect these? git log --since=2022 and then squinted at it. I wrote a warning to take this with a grain of salt, but it missed the final version. > >> >> From these, ignoring the migration bugs, the migration-tests cover some >> of: arm, ppc, s390, TCG. The device_opts[1] patch hasn't merged yet, but >> once it is, then virtio-gpu would be covered and we could investigate >> adding some of the others. >> >> For actual migration code issues: >> >> # bugs | (sub)subsystem | kind >> ---------------------------------------------- >> 13 | multifd | correctness/races >> 8 | ram | correctness >> 8 | rdma: | general programming > > 8 rdma bugs??? ouch.. Mostly caught by a cleanup from Markus. Silly stuff like of mixed signed integer comparisons and bugs in error handling. I don't even want to look too much at it. ...hopefully this release we'll manage to resolve that situation. > >> 7 | qmp | new api bugs >> 5 | postcopy | races >> 4 | file: | leaks >> 3 | return path | races >> 3 | fd_cleanup | races >> 2 | savevm, aio/coroutines >> 1 | xbzrle, colo, dirtyrate, exec:, >> windows, iochannel, qemufile, >> arch (ppc64le) >> >> Here, the migration-tests cover well: multifd, ram, qmp, postcopy, >> file, rp, fd_cleanup, iochannel, qemufile, xbzrle. >> >> My suggestion is we run per arch: >> >> "/precopy/tcp/plain" >> "/precopy/tcp/tls/psk/match", >> "/postcopy/plain" >> "/postcopy/preempt/plain" >> "/postcopy/preempt/recovery/plain" >> "/multifd/tcp/plain/cancel" >> "/multifd/tcp/uri/plain/none" > > Don't you want to still keep a few multifd / file tests? Not really, but I won't object if you want to add some more in there. To be honest, I want to get out of people's way as much as I can because having to revisit this every couple of months is stressful to me. My rationale for those is: "/precopy/tcp/plain": Smoke test, the most common migration "/precopy/tcp/tls/psk/match": Something might change in the distro regarding tls. Such as: https://gitlab.com/qemu-project/qemu/-/issues/1937 "/postcopy/plain": Smoke test for postcopy "/postcopy/preempt/plain": Just to exercise the preempt thread "/postcopy/preempt/recovery/plain": Recovery has had some issues with races in the past "/multifd/tcp/plain/cancel": The MVP of catching concurrency issues "/multifd/tcp/uri/plain/none": Smoke test for multifd All in all, these will test basic funcionality and run very often. The more tests we add to this set, the less return we get in relation to the time they take. > > IIUC some file ops can still be relevant to archs. Multifd still has one > bug that can only reproduce on arm64.. but not x86_64. I remember it's a > race condition when migration finishes, and the issue could be memory > ordering relevant, but maybe not. I'm not aware of anything. I believe the last arm64 bug we had was the threadinfo stuff[1]. If you remember what it's about, let me know. 1- 01ec0f3a92 ("migration/multifd: Protect accesses to migration_threads"). > >> >> and x86 gets extra: >> >> "/precopy/unix/suspend/live" >> "/precopy/unix/suspend/notlive" >> "/dirty_ring" > > dirty ring will be disabled anyway when !x86, so probably not a major > concern. > >> >> (the other dirty_* tests are too slow) > > These are the 10 slowest tests when I run locally: > > /x86_64/migration/multifd/tcp/tls/x509/allow-anon-client 2.41 > /x86_64/migration/postcopy/recovery/plain 2.43 > /x86_64/migration/multifd/tcp/tls/x509/default-host 2.66 > /x86_64/migration/multifd/tcp/tls/x509/override-host 2.86 > /x86_64/migration/postcopy/tls/psk 2.91 > /x86_64/migration/postcopy/preempt/recovery/tls/psk 3.08 > /x86_64/migration/postcopy/preempt/tls/psk 3.30 > /x86_64/migration/postcopy/recovery/tls/psk 3.81 > /x86_64/migration/vcpu_dirty_limit 13.29 > /x86_64/migration/precopy/unix/xbzrle 27.55 > > Are you aware of people using xbzrle at all? Nope. > >> >> All the rest go behind a knob that people touching migration code will >> enable. >> >> wdyt? > > Agree with the general idea, but I worry above exact list can be too small. We won't stop running the full set of tests. We can run them in our forks' CI as much as we want. There are no cases of people reporting a migration bug because their 'make check' caught something that ours didn't. Besides, the main strength of CI is to catch bugs when someone makes a code change. If people touch migration code, then we'll run it in our CI anyway. If they touch device code and that device is migrated by default then any one of the simple tests will catch the issue when it runs via the migration-compat job. If the device is not enabled by default, then no tests will catch it. The worst case scenario is they touch some code completely unrelated and their 'make check' or CI run breaks because of some race in the migration code. That's not what CI is for, that's just an annoyance for everyone. I'd rather it breaks in our hands and then we'll go fix it. > > IMHO we can definitely, at least, move the last two into slow list > (vcpu_dirty_limit and xbzrle), then it'll already save us 40sec each run.. Agreed. I'll send a patch once I get out from under downstream stuff. > >> >> 1- allows adding devices to QEMU cmdline for migration-test >> https://lore.kernel.org/r/20240523201922.28007-4-farosas@suse.de >>
On Tue, Sep 10, 2024 at 07:23:43PM -0300, Fabiano Rosas wrote: > Peter Xu <peterx@redhat.com> writes: > > > On Mon, Sep 09, 2024 at 06:54:46PM -0300, Fabiano Rosas wrote: > >> Peter Xu <peterx@redhat.com> writes: > >> > >> > On Mon, Sep 09, 2024 at 03:02:57PM +0100, Peter Maydell wrote: > >> >> On Mon, 9 Sept 2024 at 14:51, Hyman Huang <yong.huang@smartx.com> wrote: > >> >> > > >> >> > Despite the fact that the responsive CPU throttle is enabled, > >> >> > the dirty sync count may not always increase because this is > >> >> > an optimization that might not happen in any situation. > >> >> > > >> >> > This test case just making sure it doesn't interfere with any > >> >> > current functionality. > >> >> > > >> >> > Signed-off-by: Hyman Huang <yong.huang@smartx.com> > >> >> > >> >> tests/qtest/migration-test already runs 75 different > >> >> subtests, takes up a massive chunk of our "make check" > >> >> time, and is very commonly a "times out" test on some > >> >> of our CI jobs. It runs on five different guest CPU > >> >> architectures, each one of which takes between 2 and > >> >> 5 minutes to complete the full migration-test. > >> >> > >> >> Do we really need to make it even bigger? > >> > > >> > I'll try to find some time in the next few weeks looking into this to see > >> > whether we can further shrink migration test times after previous attemps > >> > from Dan. At least a low hanging fruit is we should indeed put some more > >> > tests into g_test_slow(), and this new test could also be a candidate (then > >> > we can run "-m slow" for migration PRs only). > >> > >> I think we could (using -m slow or any other method) separate tests > >> that are generic enough that every CI run should benefit from them > >> vs. tests that are only useful once someone starts touching migration > >> code. I'd say very few in the former category and most of them in the > >> latter. > >> > >> For an idea of where migration bugs lie, I took a look at what was > >> fixed since 2022: > >> > >> # bugs | device/subsystem/arch > >> ---------------------------------- > >> 54 | migration > >> 10 | vfio > >> 6 | ppc > >> 3 | virtio-gpu > >> 2 | pcie_sriov, tpm_emulator, > >> vdpa, virtio-rng-pci > >> 1 | arm, block, gpio, lasi, > >> pci, s390, scsi-disk, > >> virtio-mem, TCG > > > > Just curious; how did you collect these? > > git log --since=2022 and then squinted at it. I wrote a warning to take > this with a grain of salt, but it missed the final version. > > > > >> > >> From these, ignoring the migration bugs, the migration-tests cover some > >> of: arm, ppc, s390, TCG. The device_opts[1] patch hasn't merged yet, but > >> once it is, then virtio-gpu would be covered and we could investigate > >> adding some of the others. > >> > >> For actual migration code issues: > >> > >> # bugs | (sub)subsystem | kind > >> ---------------------------------------------- > >> 13 | multifd | correctness/races > >> 8 | ram | correctness > >> 8 | rdma: | general programming > > > > 8 rdma bugs??? ouch.. > > Mostly caught by a cleanup from Markus. Silly stuff like of mixed signed > integer comparisons and bugs in error handling. I don't even want to > look too much at it. > > ...hopefully this release we'll manage to resolve that situation. > > > > >> 7 | qmp | new api bugs > >> 5 | postcopy | races > >> 4 | file: | leaks > >> 3 | return path | races > >> 3 | fd_cleanup | races > >> 2 | savevm, aio/coroutines > >> 1 | xbzrle, colo, dirtyrate, exec:, > >> windows, iochannel, qemufile, > >> arch (ppc64le) > >> > >> Here, the migration-tests cover well: multifd, ram, qmp, postcopy, > >> file, rp, fd_cleanup, iochannel, qemufile, xbzrle. > >> > >> My suggestion is we run per arch: > >> > >> "/precopy/tcp/plain" > >> "/precopy/tcp/tls/psk/match", > >> "/postcopy/plain" > >> "/postcopy/preempt/plain" > >> "/postcopy/preempt/recovery/plain" > >> "/multifd/tcp/plain/cancel" > >> "/multifd/tcp/uri/plain/none" > > > > Don't you want to still keep a few multifd / file tests? > > Not really, but I won't object if you want to add some more in there. To > be honest, I want to get out of people's way as much as I can because > having to revisit this every couple of months is stressful to me. I hope migration tests are not too obstructive yet so far :) - this discussion is still within a context where we might add one more slow test in CI, and we probably simply should make it a -m slow test. > > My rationale for those is: > > "/precopy/tcp/plain": > Smoke test, the most common migration E.g. unix is missing, and I'm not sure which we use for e.g. kubevirt. And note that even if unix shares the socket iochannel with tcp, it may not work the same. For example, if you remember I mentioned I was looking at threadify the dest qemu receive side, i have a draft there but currently it has a bug to hang a unix migration, not tcp.. So IMHO it's not easy to justify which we can simply drop, if we still want to provide some good gating in CI. And I won't be 100% surprised if some other non-migration PR (e.g. io/) changed some slight bit that unix is broken and tcp keeps working, for example, then we loose some CI benefits. > > "/precopy/tcp/tls/psk/match": > Something might change in the distro regarding tls. Such as: > https://gitlab.com/qemu-project/qemu/-/issues/1937 > > "/postcopy/plain": > Smoke test for postcopy > > "/postcopy/preempt/plain": > Just to exercise the preempt thread > > "/postcopy/preempt/recovery/plain": > Recovery has had some issues with races in the past > > "/multifd/tcp/plain/cancel": > The MVP of catching concurrency issues > > "/multifd/tcp/uri/plain/none": > Smoke test for multifd > > All in all, these will test basic funcionality and run very often. The > more tests we add to this set, the less return we get in relation to the > time they take. This is true. We can try to discuss more on which is more important; I still think something might be good to be added on top of above. There's also the other way - at some point, I still want to improve migration-test run speed, and please have a look if you like too at some point: so there's still chance (average is ~2sec as of now), IMHO, we don't lose anything in CI but runs simply faster. > > > > > IIUC some file ops can still be relevant to archs. Multifd still has one > > bug that can only reproduce on arm64.. but not x86_64. I remember it's a > > race condition when migration finishes, and the issue could be memory > > ordering relevant, but maybe not. > > I'm not aware of anything. I believe the last arm64 bug we had was the > threadinfo stuff[1]. If you remember what it's about, let me know. > > 1- 01ec0f3a92 ("migration/multifd: Protect accesses to migration_threads"). https://issues.redhat.com/browse/RHEL-45628 On RH side Bandan is looking at it, but he's during a long holidays recently. > > > > >> > >> and x86 gets extra: > >> > >> "/precopy/unix/suspend/live" > >> "/precopy/unix/suspend/notlive" > >> "/dirty_ring" > > > > dirty ring will be disabled anyway when !x86, so probably not a major > > concern. > > > >> > >> (the other dirty_* tests are too slow) > > > > These are the 10 slowest tests when I run locally: > > > > /x86_64/migration/multifd/tcp/tls/x509/allow-anon-client 2.41 > > /x86_64/migration/postcopy/recovery/plain 2.43 > > /x86_64/migration/multifd/tcp/tls/x509/default-host 2.66 > > /x86_64/migration/multifd/tcp/tls/x509/override-host 2.86 > > /x86_64/migration/postcopy/tls/psk 2.91 > > /x86_64/migration/postcopy/preempt/recovery/tls/psk 3.08 > > /x86_64/migration/postcopy/preempt/tls/psk 3.30 > > /x86_64/migration/postcopy/recovery/tls/psk 3.81 > > /x86_64/migration/vcpu_dirty_limit 13.29 > > /x86_64/migration/precopy/unix/xbzrle 27.55 > > > > Are you aware of people using xbzrle at all? > > Nope. > > > > >> > >> All the rest go behind a knob that people touching migration code will > >> enable. > >> > >> wdyt? > > > > Agree with the general idea, but I worry above exact list can be too small. > > We won't stop running the full set of tests. We can run them in our > forks' CI as much as we want. There are no cases of people reporting a > migration bug because their 'make check' caught something that ours > didn't. IIUC it's hard to say - when the test is in CI maintainers can catch them already before sending a formal PR. If the test is run by default in make check, a developer can trigger a migration issue (with his/her patch applied) then one can notice it introduced a bug, fix it, then post the patches. We won't know whether that happened. So one thing we can do (if you think worthwhile to do it now) is we shrink the default test case a lot as you proposed, then we wait and see what breaks, and then we gradually add tests back when it can be used to find breakages. But that'll also take time if it really can find such tests, because then we'll need to debug them one by one (instead of developer / maintainer looking into them with their expertise knowledge..). I'm not sure whether it's worthwhile to do it now, but I don't feel strongly if we can still have a reliable set of default test cases. > > Besides, the main strength of CI is to catch bugs when someone makes a > code change. If people touch migration code, then we'll run it in our CI > anyway. If they touch device code and that device is migrated by default > then any one of the simple tests will catch the issue when it runs via > the migration-compat job. If the device is not enabled by default, then > no tests will catch it. > > The worst case scenario is they touch some code completely unrelated and > their 'make check' or CI run breaks because of some race in the > migration code. That's not what CI is for, that's just an annoyance for > everyone. I'd rather it breaks in our hands and then we'll go fix it. > > > > > IMHO we can definitely, at least, move the last two into slow list > > (vcpu_dirty_limit and xbzrle), then it'll already save us 40sec each run.. > > Agreed. I'll send a patch once I get out from under downstream stuff. > > > > >> > >> 1- allows adding devices to QEMU cmdline for migration-test > >> https://lore.kernel.org/r/20240523201922.28007-4-farosas@suse.de > >> > -- Peter Xu
Peter Xu <peterx@redhat.com> writes: > On Tue, Sep 10, 2024 at 07:23:43PM -0300, Fabiano Rosas wrote: >> Peter Xu <peterx@redhat.com> writes: >> >> > On Mon, Sep 09, 2024 at 06:54:46PM -0300, Fabiano Rosas wrote: >> >> Peter Xu <peterx@redhat.com> writes: >> >> >> >> > On Mon, Sep 09, 2024 at 03:02:57PM +0100, Peter Maydell wrote: >> >> >> On Mon, 9 Sept 2024 at 14:51, Hyman Huang <yong.huang@smartx.com> wrote: >> >> >> > >> >> >> > Despite the fact that the responsive CPU throttle is enabled, >> >> >> > the dirty sync count may not always increase because this is >> >> >> > an optimization that might not happen in any situation. >> >> >> > >> >> >> > This test case just making sure it doesn't interfere with any >> >> >> > current functionality. >> >> >> > >> >> >> > Signed-off-by: Hyman Huang <yong.huang@smartx.com> >> >> >> >> >> >> tests/qtest/migration-test already runs 75 different >> >> >> subtests, takes up a massive chunk of our "make check" >> >> >> time, and is very commonly a "times out" test on some >> >> >> of our CI jobs. It runs on five different guest CPU >> >> >> architectures, each one of which takes between 2 and >> >> >> 5 minutes to complete the full migration-test. >> >> >> >> >> >> Do we really need to make it even bigger? >> >> > >> >> > I'll try to find some time in the next few weeks looking into this to see >> >> > whether we can further shrink migration test times after previous attemps >> >> > from Dan. At least a low hanging fruit is we should indeed put some more >> >> > tests into g_test_slow(), and this new test could also be a candidate (then >> >> > we can run "-m slow" for migration PRs only). >> >> >> >> I think we could (using -m slow or any other method) separate tests >> >> that are generic enough that every CI run should benefit from them >> >> vs. tests that are only useful once someone starts touching migration >> >> code. I'd say very few in the former category and most of them in the >> >> latter. >> >> >> >> For an idea of where migration bugs lie, I took a look at what was >> >> fixed since 2022: >> >> >> >> # bugs | device/subsystem/arch >> >> ---------------------------------- >> >> 54 | migration >> >> 10 | vfio >> >> 6 | ppc >> >> 3 | virtio-gpu >> >> 2 | pcie_sriov, tpm_emulator, >> >> vdpa, virtio-rng-pci >> >> 1 | arm, block, gpio, lasi, >> >> pci, s390, scsi-disk, >> >> virtio-mem, TCG >> > >> > Just curious; how did you collect these? >> >> git log --since=2022 and then squinted at it. I wrote a warning to take >> this with a grain of salt, but it missed the final version. >> >> > >> >> >> >> From these, ignoring the migration bugs, the migration-tests cover some >> >> of: arm, ppc, s390, TCG. The device_opts[1] patch hasn't merged yet, but >> >> once it is, then virtio-gpu would be covered and we could investigate >> >> adding some of the others. >> >> >> >> For actual migration code issues: >> >> >> >> # bugs | (sub)subsystem | kind >> >> ---------------------------------------------- >> >> 13 | multifd | correctness/races >> >> 8 | ram | correctness >> >> 8 | rdma: | general programming >> > >> > 8 rdma bugs??? ouch.. >> >> Mostly caught by a cleanup from Markus. Silly stuff like of mixed signed >> integer comparisons and bugs in error handling. I don't even want to >> look too much at it. >> >> ...hopefully this release we'll manage to resolve that situation. >> >> > >> >> 7 | qmp | new api bugs >> >> 5 | postcopy | races >> >> 4 | file: | leaks >> >> 3 | return path | races >> >> 3 | fd_cleanup | races >> >> 2 | savevm, aio/coroutines >> >> 1 | xbzrle, colo, dirtyrate, exec:, >> >> windows, iochannel, qemufile, >> >> arch (ppc64le) >> >> >> >> Here, the migration-tests cover well: multifd, ram, qmp, postcopy, >> >> file, rp, fd_cleanup, iochannel, qemufile, xbzrle. >> >> >> >> My suggestion is we run per arch: >> >> >> >> "/precopy/tcp/plain" >> >> "/precopy/tcp/tls/psk/match", >> >> "/postcopy/plain" >> >> "/postcopy/preempt/plain" >> >> "/postcopy/preempt/recovery/plain" >> >> "/multifd/tcp/plain/cancel" >> >> "/multifd/tcp/uri/plain/none" >> > >> > Don't you want to still keep a few multifd / file tests? >> >> Not really, but I won't object if you want to add some more in there. To >> be honest, I want to get out of people's way as much as I can because >> having to revisit this every couple of months is stressful to me. > > I hope migration tests are not too obstructive yet so far :) - this > discussion is still within a context where we might add one more slow test > in CI, and we probably simply should make it a -m slow test. > >> >> My rationale for those is: >> >> "/precopy/tcp/plain": >> Smoke test, the most common migration > > E.g. unix is missing, and I'm not sure which we use for e.g. kubevirt. > > And note that even if unix shares the socket iochannel with tcp, it may not > work the same. For example, if you remember I mentioned I was looking at > threadify the dest qemu receive side, i have a draft there but currently it > has a bug to hang a unix migration, not tcp.. Ok, let's add a unix one, no problem. > > So IMHO it's not easy to justify which we can simply drop, if we still want > to provide some good gating in CI. It's not exactly about dropping, it's about which ones need to be run at *every* invocation of make check (x5 because of the multiple archs) and at every git branch push in CI (unless -o ci.skip). For gating we don't need all the tests. Many of them are testing the same core code with just a few differences at the margins. > And I won't be 100% surprised if some > other non-migration PR (e.g. io/) changed some slight bit that unix is > broken and tcp keeps working, for example, then we loose some CI benefits. IMO, these non-migration PRs are exactly what we have to worry about. Because migration PRs would run with -m slow and we'd catch the issue there. > >> >> "/precopy/tcp/tls/psk/match": >> Something might change in the distro regarding tls. Such as: >> https://gitlab.com/qemu-project/qemu/-/issues/1937 >> >> "/postcopy/plain": >> Smoke test for postcopy >> >> "/postcopy/preempt/plain": >> Just to exercise the preempt thread >> >> "/postcopy/preempt/recovery/plain": >> Recovery has had some issues with races in the past >> >> "/multifd/tcp/plain/cancel": >> The MVP of catching concurrency issues >> >> "/multifd/tcp/uri/plain/none": >> Smoke test for multifd >> >> All in all, these will test basic funcionality and run very often. The >> more tests we add to this set, the less return we get in relation to the >> time they take. > > This is true. We can try to discuss more on which is more important; I > still think something might be good to be added on top of above. Sure, just add what you think we need. > > There's also the other way - at some point, I still want to improve > migration-test run speed, and please have a look if you like too at some > point: so there's still chance (average is ~2sec as of now), IMHO, we don't > lose anything in CI but runs simply faster. My only idea, but it requires a bit of work, is to do unit testing on the interfaces. Anything before migration_fd_connect(). Then we could stop doing a full migration for those. > >> >> > >> > IIUC some file ops can still be relevant to archs. Multifd still has one >> > bug that can only reproduce on arm64.. but not x86_64. I remember it's a >> > race condition when migration finishes, and the issue could be memory >> > ordering relevant, but maybe not. >> >> I'm not aware of anything. I believe the last arm64 bug we had was the >> threadinfo stuff[1]. If you remember what it's about, let me know. >> >> 1- 01ec0f3a92 ("migration/multifd: Protect accesses to migration_threads"). > > https://issues.redhat.com/browse/RHEL-45628 Interesting. But if multifd/tcp/plain/cancel doesn't catch this I don't think other tests will. Also, we don't have tests for zerocopy AFAIK. > > On RH side Bandan is looking at it, but he's during a long holidays > recently. Good luck to him. > >> >> > >> >> >> >> and x86 gets extra: >> >> >> >> "/precopy/unix/suspend/live" >> >> "/precopy/unix/suspend/notlive" >> >> "/dirty_ring" >> > >> > dirty ring will be disabled anyway when !x86, so probably not a major >> > concern. >> > >> >> >> >> (the other dirty_* tests are too slow) >> > >> > These are the 10 slowest tests when I run locally: >> > >> > /x86_64/migration/multifd/tcp/tls/x509/allow-anon-client 2.41 >> > /x86_64/migration/postcopy/recovery/plain 2.43 >> > /x86_64/migration/multifd/tcp/tls/x509/default-host 2.66 >> > /x86_64/migration/multifd/tcp/tls/x509/override-host 2.86 >> > /x86_64/migration/postcopy/tls/psk 2.91 >> > /x86_64/migration/postcopy/preempt/recovery/tls/psk 3.08 >> > /x86_64/migration/postcopy/preempt/tls/psk 3.30 >> > /x86_64/migration/postcopy/recovery/tls/psk 3.81 >> > /x86_64/migration/vcpu_dirty_limit 13.29 >> > /x86_64/migration/precopy/unix/xbzrle 27.55 >> > >> > Are you aware of people using xbzrle at all? >> >> Nope. >> >> > >> >> >> >> All the rest go behind a knob that people touching migration code will >> >> enable. >> >> >> >> wdyt? >> > >> > Agree with the general idea, but I worry above exact list can be too small. >> >> We won't stop running the full set of tests. We can run them in our >> forks' CI as much as we want. There are no cases of people reporting a >> migration bug because their 'make check' caught something that ours >> didn't. > > IIUC it's hard to say - when the test is in CI maintainers can catch them > already before sending a formal PR. > > If the test is run by default in make check, a developer can trigger a > migration issue (with his/her patch applied) then one can notice it > introduced a bug, fix it, then post the patches. We won't know whether > that happened. Good point. > > So one thing we can do (if you think worthwhile to do it now) is we shrink > the default test case a lot as you proposed, then we wait and see what > breaks, and then we gradually add tests back when it can be used to find > breakages. But that'll also take time if it really can find such tests, > because then we'll need to debug them one by one (instead of developer / > maintainer looking into them with their expertise knowledge..). I'm not > sure whether it's worthwhile to do it now, but I don't feel strongly if we > can still have a reliable set of default test cases. We first need a way to enable them for the migration CI. Do we have a variable that CI understands that can be used to enable slow tests? > >> >> Besides, the main strength of CI is to catch bugs when someone makes a >> code change. If people touch migration code, then we'll run it in our CI >> anyway. If they touch device code and that device is migrated by default >> then any one of the simple tests will catch the issue when it runs via >> the migration-compat job. If the device is not enabled by default, then >> no tests will catch it. >> >> The worst case scenario is they touch some code completely unrelated and >> their 'make check' or CI run breaks because of some race in the >> migration code. That's not what CI is for, that's just an annoyance for >> everyone. I'd rather it breaks in our hands and then we'll go fix it. >> >> > >> > IMHO we can definitely, at least, move the last two into slow list >> > (vcpu_dirty_limit and xbzrle), then it'll already save us 40sec each run.. >> >> Agreed. I'll send a patch once I get out from under downstream stuff. >> >> > >> >> >> >> 1- allows adding devices to QEMU cmdline for migration-test >> >> https://lore.kernel.org/r/20240523201922.28007-4-farosas@suse.de >> >> >>
On Wed, Sep 11, 2024 at 04:48:21PM -0300, Fabiano Rosas wrote: > Peter Xu <peterx@redhat.com> writes: > > > On Tue, Sep 10, 2024 at 07:23:43PM -0300, Fabiano Rosas wrote: > >> Peter Xu <peterx@redhat.com> writes: > >> > >> > On Mon, Sep 09, 2024 at 06:54:46PM -0300, Fabiano Rosas wrote: > >> >> Peter Xu <peterx@redhat.com> writes: > >> >> > >> >> > On Mon, Sep 09, 2024 at 03:02:57PM +0100, Peter Maydell wrote: > >> >> >> On Mon, 9 Sept 2024 at 14:51, Hyman Huang <yong.huang@smartx.com> wrote: > >> >> >> > > >> >> >> > Despite the fact that the responsive CPU throttle is enabled, > >> >> >> > the dirty sync count may not always increase because this is > >> >> >> > an optimization that might not happen in any situation. > >> >> >> > > >> >> >> > This test case just making sure it doesn't interfere with any > >> >> >> > current functionality. > >> >> >> > > >> >> >> > Signed-off-by: Hyman Huang <yong.huang@smartx.com> > >> >> >> > >> >> >> tests/qtest/migration-test already runs 75 different > >> >> >> subtests, takes up a massive chunk of our "make check" > >> >> >> time, and is very commonly a "times out" test on some > >> >> >> of our CI jobs. It runs on five different guest CPU > >> >> >> architectures, each one of which takes between 2 and > >> >> >> 5 minutes to complete the full migration-test. > >> >> >> > >> >> >> Do we really need to make it even bigger? > >> >> > > >> >> > I'll try to find some time in the next few weeks looking into this to see > >> >> > whether we can further shrink migration test times after previous attemps > >> >> > from Dan. At least a low hanging fruit is we should indeed put some more > >> >> > tests into g_test_slow(), and this new test could also be a candidate (then > >> >> > we can run "-m slow" for migration PRs only). > >> >> > >> >> I think we could (using -m slow or any other method) separate tests > >> >> that are generic enough that every CI run should benefit from them > >> >> vs. tests that are only useful once someone starts touching migration > >> >> code. I'd say very few in the former category and most of them in the > >> >> latter. > >> >> > >> >> For an idea of where migration bugs lie, I took a look at what was > >> >> fixed since 2022: > >> >> > >> >> # bugs | device/subsystem/arch > >> >> ---------------------------------- > >> >> 54 | migration > >> >> 10 | vfio > >> >> 6 | ppc > >> >> 3 | virtio-gpu > >> >> 2 | pcie_sriov, tpm_emulator, > >> >> vdpa, virtio-rng-pci > >> >> 1 | arm, block, gpio, lasi, > >> >> pci, s390, scsi-disk, > >> >> virtio-mem, TCG > >> > > >> > Just curious; how did you collect these? > >> > >> git log --since=2022 and then squinted at it. I wrote a warning to take > >> this with a grain of salt, but it missed the final version. > >> > >> > > >> >> > >> >> From these, ignoring the migration bugs, the migration-tests cover some > >> >> of: arm, ppc, s390, TCG. The device_opts[1] patch hasn't merged yet, but > >> >> once it is, then virtio-gpu would be covered and we could investigate > >> >> adding some of the others. > >> >> > >> >> For actual migration code issues: > >> >> > >> >> # bugs | (sub)subsystem | kind > >> >> ---------------------------------------------- > >> >> 13 | multifd | correctness/races > >> >> 8 | ram | correctness > >> >> 8 | rdma: | general programming > >> > > >> > 8 rdma bugs??? ouch.. > >> > >> Mostly caught by a cleanup from Markus. Silly stuff like of mixed signed > >> integer comparisons and bugs in error handling. I don't even want to > >> look too much at it. > >> > >> ...hopefully this release we'll manage to resolve that situation. > >> > >> > > >> >> 7 | qmp | new api bugs > >> >> 5 | postcopy | races > >> >> 4 | file: | leaks > >> >> 3 | return path | races > >> >> 3 | fd_cleanup | races > >> >> 2 | savevm, aio/coroutines > >> >> 1 | xbzrle, colo, dirtyrate, exec:, > >> >> windows, iochannel, qemufile, > >> >> arch (ppc64le) > >> >> > >> >> Here, the migration-tests cover well: multifd, ram, qmp, postcopy, > >> >> file, rp, fd_cleanup, iochannel, qemufile, xbzrle. > >> >> > >> >> My suggestion is we run per arch: > >> >> > >> >> "/precopy/tcp/plain" > >> >> "/precopy/tcp/tls/psk/match", > >> >> "/postcopy/plain" > >> >> "/postcopy/preempt/plain" > >> >> "/postcopy/preempt/recovery/plain" > >> >> "/multifd/tcp/plain/cancel" > >> >> "/multifd/tcp/uri/plain/none" > >> > > >> > Don't you want to still keep a few multifd / file tests? > >> > >> Not really, but I won't object if you want to add some more in there. To > >> be honest, I want to get out of people's way as much as I can because > >> having to revisit this every couple of months is stressful to me. > > > > I hope migration tests are not too obstructive yet so far :) - this > > discussion is still within a context where we might add one more slow test > > in CI, and we probably simply should make it a -m slow test. > > > >> > >> My rationale for those is: > >> > >> "/precopy/tcp/plain": > >> Smoke test, the most common migration > > > > E.g. unix is missing, and I'm not sure which we use for e.g. kubevirt. > > > > And note that even if unix shares the socket iochannel with tcp, it may not > > work the same. For example, if you remember I mentioned I was looking at > > threadify the dest qemu receive side, i have a draft there but currently it > > has a bug to hang a unix migration, not tcp.. > > Ok, let's add a unix one, no problem. > > > > > So IMHO it's not easy to justify which we can simply drop, if we still want > > to provide some good gating in CI. > > It's not exactly about dropping, it's about which ones need to be run at > *every* invocation of make check (x5 because of the multiple archs) and Yeah, indeed we could consider reducing the number of runs, maybe. However I do remember we used to have migration-test bugs only reproduced with some specific distro, well.. Now I'm looking at the pipelines.. Why 5x, btw? I saw alphine, centos, debian, fedora, opensuse, ubuntu, cfi-x86_64, then compat-x86_64. So that's 8x? They're for the same arch (amd64) so far. Maybe we can skip some indeed. > at every git branch push in CI (unless -o ci.skip). For gating we don't > need all the tests. Many of them are testing the same core code with > just a few differences at the margins. > > > And I won't be 100% surprised if some > > other non-migration PR (e.g. io/) changed some slight bit that unix is > > broken and tcp keeps working, for example, then we loose some CI benefits. > > IMO, these non-migration PRs are exactly what we have to worry > about. Because migration PRs would run with -m slow and we'd catch the > issue there. > > > > >> > >> "/precopy/tcp/tls/psk/match": > >> Something might change in the distro regarding tls. Such as: > >> https://gitlab.com/qemu-project/qemu/-/issues/1937 > >> > >> "/postcopy/plain": > >> Smoke test for postcopy > >> > >> "/postcopy/preempt/plain": > >> Just to exercise the preempt thread > >> > >> "/postcopy/preempt/recovery/plain": > >> Recovery has had some issues with races in the past > >> > >> "/multifd/tcp/plain/cancel": > >> The MVP of catching concurrency issues > >> > >> "/multifd/tcp/uri/plain/none": > >> Smoke test for multifd > >> > >> All in all, these will test basic funcionality and run very often. The > >> more tests we add to this set, the less return we get in relation to the > >> time they take. > > > > This is true. We can try to discuss more on which is more important; I > > still think something might be good to be added on top of above. > > Sure, just add what you think we need. Let's leave the detailed discussion on what to choose until the patch phase. IIUC this can be the last we do. We've just queued your other patch to "slow down" the two time consuming tests, we save 40s for each CI kick (total ~90s now locally, so that's 30% cut already all acrosss!), not so bad as a start. Then if we can remove some out of 8 (e.g., we can choose 2-3), then if it's 3 system runs, it'll be another 62% cut, a total of: 1.0 - 9/13 * 3/8 = ~75% So if we keep removing 5 system running it we can reduce it to only a quarter time comparing to before for each CI run. Would that be good enough already for us to live for quite a few more releases? They're so far all low hanging fruits. If we want to move further (and you think we should then start with reducing test case, rather than profiling the test cases), then feel free to go ahead and send a patch when you're ready. But maybe you don't want to bother after you notice that we can already shrink 75% of runtime even without dropping anything. Your call! > > > > > There's also the other way - at some point, I still want to improve > > migration-test run speed, and please have a look if you like too at some > > point: so there's still chance (average is ~2sec as of now), IMHO, we don't > > lose anything in CI but runs simply faster. > > My only idea, but it requires a bit of work, is to do unit testing on > the interfaces. Anything before migration_fd_connect(). Then we could > stop doing a full migration for those. What I'm thinking is some further profiling, like what Dan used to do. I feel like we still have chance, considering what we run as guest image is simply a loop.. so basically zero boot time logically. > > > > >> > >> > > >> > IIUC some file ops can still be relevant to archs. Multifd still has one > >> > bug that can only reproduce on arm64.. but not x86_64. I remember it's a > >> > race condition when migration finishes, and the issue could be memory > >> > ordering relevant, but maybe not. > >> > >> I'm not aware of anything. I believe the last arm64 bug we had was the > >> threadinfo stuff[1]. If you remember what it's about, let me know. > >> > >> 1- 01ec0f3a92 ("migration/multifd: Protect accesses to migration_threads"). > > > > https://issues.redhat.com/browse/RHEL-45628 > > Interesting. But if multifd/tcp/plain/cancel doesn't catch this I don't > think other tests will. Also, we don't have tests for zerocopy AFAIK. That'll be challenging as it requires locked_vm limit. > > > > > On RH side Bandan is looking at it, but he's during a long holidays > > recently. > > Good luck to him. > > > > >> > >> > > >> >> > >> >> and x86 gets extra: > >> >> > >> >> "/precopy/unix/suspend/live" > >> >> "/precopy/unix/suspend/notlive" > >> >> "/dirty_ring" > >> > > >> > dirty ring will be disabled anyway when !x86, so probably not a major > >> > concern. > >> > > >> >> > >> >> (the other dirty_* tests are too slow) > >> > > >> > These are the 10 slowest tests when I run locally: > >> > > >> > /x86_64/migration/multifd/tcp/tls/x509/allow-anon-client 2.41 > >> > /x86_64/migration/postcopy/recovery/plain 2.43 > >> > /x86_64/migration/multifd/tcp/tls/x509/default-host 2.66 > >> > /x86_64/migration/multifd/tcp/tls/x509/override-host 2.86 > >> > /x86_64/migration/postcopy/tls/psk 2.91 > >> > /x86_64/migration/postcopy/preempt/recovery/tls/psk 3.08 > >> > /x86_64/migration/postcopy/preempt/tls/psk 3.30 > >> > /x86_64/migration/postcopy/recovery/tls/psk 3.81 > >> > /x86_64/migration/vcpu_dirty_limit 13.29 > >> > /x86_64/migration/precopy/unix/xbzrle 27.55 > >> > > >> > Are you aware of people using xbzrle at all? > >> > >> Nope. > >> > >> > > >> >> > >> >> All the rest go behind a knob that people touching migration code will > >> >> enable. > >> >> > >> >> wdyt? > >> > > >> > Agree with the general idea, but I worry above exact list can be too small. > >> > >> We won't stop running the full set of tests. We can run them in our > >> forks' CI as much as we want. There are no cases of people reporting a > >> migration bug because their 'make check' caught something that ours > >> didn't. > > > > IIUC it's hard to say - when the test is in CI maintainers can catch them > > already before sending a formal PR. > > > > If the test is run by default in make check, a developer can trigger a > > migration issue (with his/her patch applied) then one can notice it > > introduced a bug, fix it, then post the patches. We won't know whether > > that happened. > > Good point. > > > > > So one thing we can do (if you think worthwhile to do it now) is we shrink > > the default test case a lot as you proposed, then we wait and see what > > breaks, and then we gradually add tests back when it can be used to find > > breakages. But that'll also take time if it really can find such tests, > > because then we'll need to debug them one by one (instead of developer / > > maintainer looking into them with their expertise knowledge..). I'm not > > sure whether it's worthwhile to do it now, but I don't feel strongly if we > > can still have a reliable set of default test cases. > > We first need a way to enable them for the migration CI. Do we have a > variable that CI understands that can be used to enable slow tests? I'm not aware of. Yes sounds like we should have one if it doesn't exist. > > > > >> > >> Besides, the main strength of CI is to catch bugs when someone makes a > >> code change. If people touch migration code, then we'll run it in our CI > >> anyway. If they touch device code and that device is migrated by default > >> then any one of the simple tests will catch the issue when it runs via > >> the migration-compat job. If the device is not enabled by default, then > >> no tests will catch it. > >> > >> The worst case scenario is they touch some code completely unrelated and > >> their 'make check' or CI run breaks because of some race in the > >> migration code. That's not what CI is for, that's just an annoyance for > >> everyone. I'd rather it breaks in our hands and then we'll go fix it. > >> > >> > > >> > IMHO we can definitely, at least, move the last two into slow list > >> > (vcpu_dirty_limit and xbzrle), then it'll already save us 40sec each run.. > >> > >> Agreed. I'll send a patch once I get out from under downstream stuff. > >> > >> > > >> >> > >> >> 1- allows adding devices to QEMU cmdline for migration-test > >> >> https://lore.kernel.org/r/20240523201922.28007-4-farosas@suse.de > >> >> > >> > -- Peter Xu
Peter Xu <peterx@redhat.com> writes: > On Wed, Sep 11, 2024 at 04:48:21PM -0300, Fabiano Rosas wrote: >> Peter Xu <peterx@redhat.com> writes: >> >> > On Tue, Sep 10, 2024 at 07:23:43PM -0300, Fabiano Rosas wrote: >> >> Peter Xu <peterx@redhat.com> writes: >> >> >> >> > On Mon, Sep 09, 2024 at 06:54:46PM -0300, Fabiano Rosas wrote: >> >> >> Peter Xu <peterx@redhat.com> writes: >> >> >> >> >> >> > On Mon, Sep 09, 2024 at 03:02:57PM +0100, Peter Maydell wrote: >> >> >> >> On Mon, 9 Sept 2024 at 14:51, Hyman Huang <yong.huang@smartx.com> wrote: >> >> >> >> > >> >> >> >> > Despite the fact that the responsive CPU throttle is enabled, >> >> >> >> > the dirty sync count may not always increase because this is >> >> >> >> > an optimization that might not happen in any situation. >> >> >> >> > >> >> >> >> > This test case just making sure it doesn't interfere with any >> >> >> >> > current functionality. >> >> >> >> > >> >> >> >> > Signed-off-by: Hyman Huang <yong.huang@smartx.com> >> >> >> >> >> >> >> >> tests/qtest/migration-test already runs 75 different >> >> >> >> subtests, takes up a massive chunk of our "make check" >> >> >> >> time, and is very commonly a "times out" test on some >> >> >> >> of our CI jobs. It runs on five different guest CPU >> >> >> >> architectures, each one of which takes between 2 and >> >> >> >> 5 minutes to complete the full migration-test. >> >> >> >> >> >> >> >> Do we really need to make it even bigger? >> >> >> > >> >> >> > I'll try to find some time in the next few weeks looking into this to see >> >> >> > whether we can further shrink migration test times after previous attemps >> >> >> > from Dan. At least a low hanging fruit is we should indeed put some more >> >> >> > tests into g_test_slow(), and this new test could also be a candidate (then >> >> >> > we can run "-m slow" for migration PRs only). >> >> >> >> >> >> I think we could (using -m slow or any other method) separate tests >> >> >> that are generic enough that every CI run should benefit from them >> >> >> vs. tests that are only useful once someone starts touching migration >> >> >> code. I'd say very few in the former category and most of them in the >> >> >> latter. >> >> >> >> >> >> For an idea of where migration bugs lie, I took a look at what was >> >> >> fixed since 2022: >> >> >> >> >> >> # bugs | device/subsystem/arch >> >> >> ---------------------------------- >> >> >> 54 | migration >> >> >> 10 | vfio >> >> >> 6 | ppc >> >> >> 3 | virtio-gpu >> >> >> 2 | pcie_sriov, tpm_emulator, >> >> >> vdpa, virtio-rng-pci >> >> >> 1 | arm, block, gpio, lasi, >> >> >> pci, s390, scsi-disk, >> >> >> virtio-mem, TCG >> >> > >> >> > Just curious; how did you collect these? >> >> >> >> git log --since=2022 and then squinted at it. I wrote a warning to take >> >> this with a grain of salt, but it missed the final version. >> >> >> >> > >> >> >> >> >> >> From these, ignoring the migration bugs, the migration-tests cover some >> >> >> of: arm, ppc, s390, TCG. The device_opts[1] patch hasn't merged yet, but >> >> >> once it is, then virtio-gpu would be covered and we could investigate >> >> >> adding some of the others. >> >> >> >> >> >> For actual migration code issues: >> >> >> >> >> >> # bugs | (sub)subsystem | kind >> >> >> ---------------------------------------------- >> >> >> 13 | multifd | correctness/races >> >> >> 8 | ram | correctness >> >> >> 8 | rdma: | general programming >> >> > >> >> > 8 rdma bugs??? ouch.. >> >> >> >> Mostly caught by a cleanup from Markus. Silly stuff like of mixed signed >> >> integer comparisons and bugs in error handling. I don't even want to >> >> look too much at it. >> >> >> >> ...hopefully this release we'll manage to resolve that situation. >> >> >> >> > >> >> >> 7 | qmp | new api bugs >> >> >> 5 | postcopy | races >> >> >> 4 | file: | leaks >> >> >> 3 | return path | races >> >> >> 3 | fd_cleanup | races >> >> >> 2 | savevm, aio/coroutines >> >> >> 1 | xbzrle, colo, dirtyrate, exec:, >> >> >> windows, iochannel, qemufile, >> >> >> arch (ppc64le) >> >> >> >> >> >> Here, the migration-tests cover well: multifd, ram, qmp, postcopy, >> >> >> file, rp, fd_cleanup, iochannel, qemufile, xbzrle. >> >> >> >> >> >> My suggestion is we run per arch: >> >> >> >> >> >> "/precopy/tcp/plain" >> >> >> "/precopy/tcp/tls/psk/match", >> >> >> "/postcopy/plain" >> >> >> "/postcopy/preempt/plain" >> >> >> "/postcopy/preempt/recovery/plain" >> >> >> "/multifd/tcp/plain/cancel" >> >> >> "/multifd/tcp/uri/plain/none" >> >> > >> >> > Don't you want to still keep a few multifd / file tests? >> >> >> >> Not really, but I won't object if you want to add some more in there. To >> >> be honest, I want to get out of people's way as much as I can because >> >> having to revisit this every couple of months is stressful to me. >> > >> > I hope migration tests are not too obstructive yet so far :) - this >> > discussion is still within a context where we might add one more slow test >> > in CI, and we probably simply should make it a -m slow test. >> > >> >> >> >> My rationale for those is: >> >> >> >> "/precopy/tcp/plain": >> >> Smoke test, the most common migration >> > >> > E.g. unix is missing, and I'm not sure which we use for e.g. kubevirt. >> > >> > And note that even if unix shares the socket iochannel with tcp, it may not >> > work the same. For example, if you remember I mentioned I was looking at >> > threadify the dest qemu receive side, i have a draft there but currently it >> > has a bug to hang a unix migration, not tcp.. >> >> Ok, let's add a unix one, no problem. >> >> > >> > So IMHO it's not easy to justify which we can simply drop, if we still want >> > to provide some good gating in CI. >> >> It's not exactly about dropping, it's about which ones need to be run at >> *every* invocation of make check (x5 because of the multiple archs) and > > Yeah, indeed we could consider reducing the number of runs, maybe. However > I do remember we used to have migration-test bugs only reproduced with some > specific distro, well.. > > Now I'm looking at the pipelines.. > > Why 5x, btw? 5x because migration-test supports 5 architectures (i386, x86_64, ppc64, aarch64, s390x), so if your build includes all of those, make check will run migration-test five times. > I saw alphine, centos, debian, fedora, opensuse, ubuntu, > cfi-x86_64, then compat-x86_64. So that's 8x? They're for the same arch > (amd64) so far. I agree we should fine tune this in CI, but this is a different discussion. We can't currently skip migration-test without skipping make check-qtest. I proposed a while ago to have a separate make check-migration, but people didn't like that, maybe we should revisit the idea. > > Maybe we can skip some indeed. > >> at every git branch push in CI (unless -o ci.skip). For gating we don't >> need all the tests. Many of them are testing the same core code with >> just a few differences at the margins. >> >> > And I won't be 100% surprised if some >> > other non-migration PR (e.g. io/) changed some slight bit that unix is >> > broken and tcp keeps working, for example, then we loose some CI benefits. >> >> IMO, these non-migration PRs are exactly what we have to worry >> about. Because migration PRs would run with -m slow and we'd catch the >> issue there. >> >> > >> >> >> >> "/precopy/tcp/tls/psk/match": >> >> Something might change in the distro regarding tls. Such as: >> >> https://gitlab.com/qemu-project/qemu/-/issues/1937 >> >> >> >> "/postcopy/plain": >> >> Smoke test for postcopy >> >> >> >> "/postcopy/preempt/plain": >> >> Just to exercise the preempt thread >> >> >> >> "/postcopy/preempt/recovery/plain": >> >> Recovery has had some issues with races in the past >> >> >> >> "/multifd/tcp/plain/cancel": >> >> The MVP of catching concurrency issues >> >> >> >> "/multifd/tcp/uri/plain/none": >> >> Smoke test for multifd >> >> >> >> All in all, these will test basic funcionality and run very often. The >> >> more tests we add to this set, the less return we get in relation to the >> >> time they take. >> > >> > This is true. We can try to discuss more on which is more important; I >> > still think something might be good to be added on top of above. >> >> Sure, just add what you think we need. > > Let's leave the detailed discussion on what to choose until the patch > phase. IIUC this can be the last we do. ok > > We've just queued your other patch to "slow down" the two time consuming > tests, we save 40s for each CI kick (total ~90s now locally, so that's 30% > cut already all acrosss!), not so bad as a start. > > Then if we can remove some out of 8 (e.g., we can choose 2-3), then if it's > 3 system runs, it'll be another 62% cut, a total of: > > 1.0 - 9/13 * 3/8 = ~75% > > So if we keep removing 5 system running it we can reduce it to only a > quarter time comparing to before for each CI run. > > Would that be good enough already for us to live for quite a few more > releases? They're so far all low hanging fruits. Reducing the runs in CI is not a low hanging fruit. We'd need to first split migration-test from check-qtest somehow. But that would mean not running migration-test at all during make check, which is worse than just reducing migration-test while keeping it in make check. > > If we want to move further (and you think we should then start with > reducing test case, rather than profiling the test cases), then feel free > to go ahead and send a patch when you're ready. But maybe you don't want > to bother after you notice that we can already shrink 75% of runtime even > without dropping anything. Your call! I don't think we're discussing total CI time at this point, so the math doesn't really add up. We're not looking into making the CI finish faster. We're looking into making migration-test finish faster. That would reduce timeouts in CI, speed-up make check and reduce the chance of random race conditions* affecting other people/staging runs. *- like the one I'm debugging at this very moment in multifd+ppc, are you aware of that btw? > >> >> > >> > There's also the other way - at some point, I still want to improve >> > migration-test run speed, and please have a look if you like too at some >> > point: so there's still chance (average is ~2sec as of now), IMHO, we don't >> > lose anything in CI but runs simply faster. >> >> My only idea, but it requires a bit of work, is to do unit testing on >> the interfaces. Anything before migration_fd_connect(). Then we could >> stop doing a full migration for those. > > What I'm thinking is some further profiling, like what Dan used to do. I > feel like we still have chance, considering what we run as guest image is > simply a loop.. so basically zero boot time logically. I started looking again today into code coverage tools to see if we can trim the tests that are redundant. But no promises, I already have a queue of stuff that's pending. > >> >> > >> >> >> >> > >> >> > IIUC some file ops can still be relevant to archs. Multifd still has one >> >> > bug that can only reproduce on arm64.. but not x86_64. I remember it's a >> >> > race condition when migration finishes, and the issue could be memory >> >> > ordering relevant, but maybe not. >> >> >> >> I'm not aware of anything. I believe the last arm64 bug we had was the >> >> threadinfo stuff[1]. If you remember what it's about, let me know. >> >> >> >> 1- 01ec0f3a92 ("migration/multifd: Protect accesses to migration_threads"). >> > >> > https://issues.redhat.com/browse/RHEL-45628 >> >> Interesting. But if multifd/tcp/plain/cancel doesn't catch this I don't >> think other tests will. Also, we don't have tests for zerocopy AFAIK. > > That'll be challenging as it requires locked_vm limit. > >> >> > >> > On RH side Bandan is looking at it, but he's during a long holidays >> > recently. >> >> Good luck to him. >> >> > >> >> >> >> > >> >> >> >> >> >> and x86 gets extra: >> >> >> >> >> >> "/precopy/unix/suspend/live" >> >> >> "/precopy/unix/suspend/notlive" >> >> >> "/dirty_ring" >> >> > >> >> > dirty ring will be disabled anyway when !x86, so probably not a major >> >> > concern. >> >> > >> >> >> >> >> >> (the other dirty_* tests are too slow) >> >> > >> >> > These are the 10 slowest tests when I run locally: >> >> > >> >> > /x86_64/migration/multifd/tcp/tls/x509/allow-anon-client 2.41 >> >> > /x86_64/migration/postcopy/recovery/plain 2.43 >> >> > /x86_64/migration/multifd/tcp/tls/x509/default-host 2.66 >> >> > /x86_64/migration/multifd/tcp/tls/x509/override-host 2.86 >> >> > /x86_64/migration/postcopy/tls/psk 2.91 >> >> > /x86_64/migration/postcopy/preempt/recovery/tls/psk 3.08 >> >> > /x86_64/migration/postcopy/preempt/tls/psk 3.30 >> >> > /x86_64/migration/postcopy/recovery/tls/psk 3.81 >> >> > /x86_64/migration/vcpu_dirty_limit 13.29 >> >> > /x86_64/migration/precopy/unix/xbzrle 27.55 >> >> > >> >> > Are you aware of people using xbzrle at all? >> >> >> >> Nope. >> >> >> >> > >> >> >> >> >> >> All the rest go behind a knob that people touching migration code will >> >> >> enable. >> >> >> >> >> >> wdyt? >> >> > >> >> > Agree with the general idea, but I worry above exact list can be too small. >> >> >> >> We won't stop running the full set of tests. We can run them in our >> >> forks' CI as much as we want. There are no cases of people reporting a >> >> migration bug because their 'make check' caught something that ours >> >> didn't. >> > >> > IIUC it's hard to say - when the test is in CI maintainers can catch them >> > already before sending a formal PR. >> > >> > If the test is run by default in make check, a developer can trigger a >> > migration issue (with his/her patch applied) then one can notice it >> > introduced a bug, fix it, then post the patches. We won't know whether >> > that happened. >> >> Good point. >> >> > >> > So one thing we can do (if you think worthwhile to do it now) is we shrink >> > the default test case a lot as you proposed, then we wait and see what >> > breaks, and then we gradually add tests back when it can be used to find >> > breakages. But that'll also take time if it really can find such tests, >> > because then we'll need to debug them one by one (instead of developer / >> > maintainer looking into them with their expertise knowledge..). I'm not >> > sure whether it's worthwhile to do it now, but I don't feel strongly if we >> > can still have a reliable set of default test cases. >> >> We first need a way to enable them for the migration CI. Do we have a >> variable that CI understands that can be used to enable slow tests? > > I'm not aware of. Yes sounds like we should have one if it doesn't exist. > >> >> > >> >> >> >> Besides, the main strength of CI is to catch bugs when someone makes a >> >> code change. If people touch migration code, then we'll run it in our CI >> >> anyway. If they touch device code and that device is migrated by default >> >> then any one of the simple tests will catch the issue when it runs via >> >> the migration-compat job. If the device is not enabled by default, then >> >> no tests will catch it. >> >> >> >> The worst case scenario is they touch some code completely unrelated and >> >> their 'make check' or CI run breaks because of some race in the >> >> migration code. That's not what CI is for, that's just an annoyance for >> >> everyone. I'd rather it breaks in our hands and then we'll go fix it. >> >> >> >> > >> >> > IMHO we can definitely, at least, move the last two into slow list >> >> > (vcpu_dirty_limit and xbzrle), then it'll already save us 40sec each run.. >> >> >> >> Agreed. I'll send a patch once I get out from under downstream stuff. >> >> >> >> > >> >> >> >> >> >> 1- allows adding devices to QEMU cmdline for migration-test >> >> >> https://lore.kernel.org/r/20240523201922.28007-4-farosas@suse.de >> >> >> >> >> >>
On Wed, 11 Sept 2024 at 22:26, Fabiano Rosas <farosas@suse.de> wrote: > I don't think we're discussing total CI time at this point, so the math > doesn't really add up. We're not looking into making the CI finish > faster. We're looking into making migration-test finish faster. That > would reduce timeouts in CI, speed-up make check and reduce the chance > of random race conditions* affecting other people/staging runs. Right. The reason migration-test appears on my radar is because it is very frequently the thing that shows up as "this sometimes just fails or just times out and if you hit retry it goes away again". That might not be migration-test's fault specifically, because those retries tend to be certain CI configs (s390, the i686-tci one), and I have some theories about what might be causing it (e.g. build system runs 4 migration-tests in parallel, which means 8 QEMU processes which is too many for the number of host CPUs). But right now I look at CI job failures and my reaction is "oh, it's the migration-test failing yet again" :-( For some examples from this week: https://gitlab.com/qemu-project/qemu/-/jobs/7802183144 https://gitlab.com/qemu-project/qemu/-/jobs/7799842373 https://gitlab.com/qemu-project/qemu/-/jobs/7786579152 https://gitlab.com/qemu-project/qemu/-/jobs/7786579155 -- PMM
On Thu, Sep 12, 2024 at 09:13:16AM +0100, Peter Maydell wrote: > On Wed, 11 Sept 2024 at 22:26, Fabiano Rosas <farosas@suse.de> wrote: > > I don't think we're discussing total CI time at this point, so the math > > doesn't really add up. We're not looking into making the CI finish > > faster. We're looking into making migration-test finish faster. That > > would reduce timeouts in CI, speed-up make check and reduce the chance > > of random race conditions* affecting other people/staging runs. > > Right. The reason migration-test appears on my radar is because > it is very frequently the thing that shows up as "this sometimes > just fails or just times out and if you hit retry it goes away > again". That might not be migration-test's fault specifically, > because those retries tend to be certain CI configs (s390, > the i686-tci one), and I have some theories about what might be > causing it (e.g. build system runs 4 migration-tests in parallel, > which means 8 QEMU processes which is too many for the number > of host CPUs). But right now I look at CI job failures and my reaction > is "oh, it's the migration-test failing yet again" :-( > > For some examples from this week: > > https://gitlab.com/qemu-project/qemu/-/jobs/7802183144 > https://gitlab.com/qemu-project/qemu/-/jobs/7799842373 <--------[1] > https://gitlab.com/qemu-project/qemu/-/jobs/7786579152 <--------[2] > https://gitlab.com/qemu-project/qemu/-/jobs/7786579155 Ah right, the TIMEOUT is unfortunate, especially if tests can be run in parallel. It indeed sounds like no good way to finally solve.. I don't also see how speeding up / reducing tests in migration test would help, as that's (from some degree..) is the same as tuning the timeout value bigger. When the tests are less it'll fit into 480s window, but maybe it's too quick now we wonder whether we should shrink it to e.g. 90s, but then it can timeout again when on a busy host with less capability of concurrency. But indeed there're two ERRORs ([1,2] above).. I collected some more info here before the log expires: =================================8<================================ *** /i386/migration/multifd/tcp/plain/cancel, qtest-i386 on s390 host https://gitlab.com/qemu-project/qemu/-/jobs/7799842373 101/953 qemu:qtest+qtest-i386 / qtest-i386/migration-test ERROR 144.32s killed by signal 6 SIGABRT >>> QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon G_TEST_DBUS_DAEMON=/home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/tests/dbus-vmstate-daemon.sh PYTHON=/home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_IMG=./qemu-img MALLOC_PERTURB_=144 QTEST_QEMU_BINARY=./qemu-system-i386 /home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/build/tests/qtest/migration-test --tap -k ――――――――――――――――――――――――――――――――――――― ✀ ――――――――――――――――――――――――――――――――――――― stderr: warning: fd: migration to a file is deprecated. Use file: instead. warning: fd: migration to a file is deprecated. Use file: instead. ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped) (test program exited with status code -6) TAP parsing error: Too few tests run (expected 53, got 39) ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― # Start of plain tests # Running /i386/migration/multifd/tcp/plain/cancel # Using machine type: pc-i440fx-9.2 # starting QEMU: exec ./qemu-system-i386 -qtest unix:/tmp/qtest-3273509.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3273509.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pc-i440fx-9.2, -name source,debug-threads=on -m 150M -serial file:/tmp/migration-test-4112T2/src_serial -drive if=none,id=d0,file=/tmp/migration-test-4112T2/bootsect,format=raw -device ide-hd,drive=d0,secs=1,cyls=1,heads=1 2>/dev/null -accel qtest # starting QEMU: exec ./qemu-system-i386 -qtest unix:/tmp/qtest-3273509.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3273509.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pc-i440fx-9.2, -name target,debug-threads=on -m 150M -serial file:/tmp/migration-test-4112T2/dest_serial -incoming defer -drive if=none,id=d0,file=/tmp/migration-test-4112T2/bootsect,format=raw -device ide-hd,drive=d0,secs=1,cyls=1,heads=1 2>/dev/null -accel qtest ----------------------------------- stderr ----------------------------------- warning: fd: migration to a file is deprecated. Use file: instead. warning: fd: migration to a file is deprecated. Use file: instead. ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped) *** /ppc64/migration/multifd/tcp/plain/cancel, qtest-ppc64 on i686 host https://gitlab.com/qemu-project/qemu/-/jobs/7786579152 174/315 qemu:qtest+qtest-ppc64 / qtest-ppc64/migration-test ERROR 381.00s killed by signal 6 SIGABRT >>> PYTHON=/builds/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_IMG=./qemu-img G_TEST_DBUS_DAEMON=/builds/qemu-project/qemu/tests/dbus-vmstate-daemon.sh QTEST_QEMU_BINARY=./qemu-system-ppc64 MALLOC_PERTURB_=178 QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon /builds/qemu-project/qemu/build/tests/qtest/migration-test --tap -k ――――――――――――――――――――――――――――――――――――― ✀ ――――――――――――――――――――――――――――――――――――― stderr: qemu-system-ppc64: Cannot read from TLS channel: The TLS connection was non-properly terminated. warning: fd: migration to a file is deprecated. Use file: instead. warning: fd: migration to a file is deprecated. Use file: instead. ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped) (test program exited with status code -6) TAP parsing error: Too few tests run (expected 61, got 47) ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― # Start of plain tests # Running /ppc64/migration/multifd/tcp/plain/cancel # Using machine type: pseries-9.2 # starting QEMU: exec ./qemu-system-ppc64 -qtest unix:/tmp/qtest-40766.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-40766.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pseries-9.2,vsmt=8 -name source,debug-threads=on -m 256M -serial file:/tmp/migration-test-H0Z1T2/src_serial -nodefaults -machine cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, -bios /tmp/migration-test-H0Z1T2/bootsect 2>/dev/null -accel qtest # starting QEMU: exec ./qemu-system-ppc64 -qtest unix:/tmp/qtest-40766.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-40766.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pseries-9.2,vsmt=8 -name target,debug-threads=on -m 256M -serial file:/tmp/migration-test-H0Z1T2/dest_serial -incoming defer -nodefaults -machine cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, -bios /tmp/migration-test-H0Z1T2/bootsect 2>/dev/null -accel qtest ----------------------------------- stderr ----------------------------------- qemu-system-ppc64: Cannot read from TLS channel: The TLS connection was non-properly terminated. warning: fd: migration to a file is deprecated. Use file: instead. warning: fd: migration to a file is deprecated. Use file: instead. ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped) (test program exited with status code -6) =================================8<================================ So.. it's the same test (multifd/tcp/plain/cancel) that is failing on different host / arch being tested. What is more weird is the two failures are different, the 2nd failure throw out a TLS error even though the test doesn't yet have tls involved. Fabiano, is this the issue you're looking at? Peter, do you think it'll be helpful if we temporarily mark this test as "slow" too so it's not run in CI (so we still run it ourselves when prepare migration PR, with the hope that it can reproduce)? Thanks, -- Peter Xu
Peter Xu <peterx@redhat.com> writes: > On Thu, Sep 12, 2024 at 09:13:16AM +0100, Peter Maydell wrote: >> On Wed, 11 Sept 2024 at 22:26, Fabiano Rosas <farosas@suse.de> wrote: >> > I don't think we're discussing total CI time at this point, so the math >> > doesn't really add up. We're not looking into making the CI finish >> > faster. We're looking into making migration-test finish faster. That >> > would reduce timeouts in CI, speed-up make check and reduce the chance >> > of random race conditions* affecting other people/staging runs. >> >> Right. The reason migration-test appears on my radar is because >> it is very frequently the thing that shows up as "this sometimes >> just fails or just times out and if you hit retry it goes away >> again". That might not be migration-test's fault specifically, >> because those retries tend to be certain CI configs (s390, >> the i686-tci one), and I have some theories about what might be >> causing it (e.g. build system runs 4 migration-tests in parallel, >> which means 8 QEMU processes which is too many for the number >> of host CPUs). But right now I look at CI job failures and my reaction >> is "oh, it's the migration-test failing yet again" :-( >> >> For some examples from this week: >> >> https://gitlab.com/qemu-project/qemu/-/jobs/7802183144 >> https://gitlab.com/qemu-project/qemu/-/jobs/7799842373 <--------[1] >> https://gitlab.com/qemu-project/qemu/-/jobs/7786579152 <--------[2] >> https://gitlab.com/qemu-project/qemu/-/jobs/7786579155 > > Ah right, the TIMEOUT is unfortunate, especially if tests can be run in > parallel. It indeed sounds like no good way to finally solve.. I don't > also see how speeding up / reducing tests in migration test would help, as > that's (from some degree..) is the same as tuning the timeout value bigger. > When the tests are less it'll fit into 480s window, but maybe it's too > quick now we wonder whether we should shrink it to e.g. 90s, but then it > can timeout again when on a busy host with less capability of concurrency. > > But indeed there're two ERRORs ([1,2] above).. I collected some more info > here before the log expires: > > =================================8<================================ > > *** /i386/migration/multifd/tcp/plain/cancel, qtest-i386 on s390 host > > https://gitlab.com/qemu-project/qemu/-/jobs/7799842373 > > 101/953 qemu:qtest+qtest-i386 / qtest-i386/migration-test ERROR 144.32s killed by signal 6 SIGABRT >>>> QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon G_TEST_DBUS_DAEMON=/home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/tests/dbus-vmstate-daemon.sh PYTHON=/home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_IMG=./qemu-img MALLOC_PERTURB_=144 QTEST_QEMU_BINARY=./qemu-system-i386 /home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/build/tests/qtest/migration-test --tap -k > ――――――――――――――――――――――――――――――――――――― ✀ ――――――――――――――――――――――――――――――――――――― > stderr: > warning: fd: migration to a file is deprecated. Use file: instead. > warning: fd: migration to a file is deprecated. Use file: instead. > ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped) > (test program exited with status code -6) > TAP parsing error: Too few tests run (expected 53, got 39) > ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― > > # Start of plain tests > # Running /i386/migration/multifd/tcp/plain/cancel > # Using machine type: pc-i440fx-9.2 > # starting QEMU: exec ./qemu-system-i386 -qtest unix:/tmp/qtest-3273509.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3273509.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pc-i440fx-9.2, -name source,debug-threads=on -m 150M -serial file:/tmp/migration-test-4112T2/src_serial -drive if=none,id=d0,file=/tmp/migration-test-4112T2/bootsect,format=raw -device ide-hd,drive=d0,secs=1,cyls=1,heads=1 2>/dev/null -accel qtest > # starting QEMU: exec ./qemu-system-i386 -qtest unix:/tmp/qtest-3273509.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3273509.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pc-i440fx-9.2, -name target,debug-threads=on -m 150M -serial file:/tmp/migration-test-4112T2/dest_serial -incoming defer -drive if=none,id=d0,file=/tmp/migration-test-4112T2/bootsect,format=raw -device ide-hd,drive=d0,secs=1,cyls=1,heads=1 2>/dev/null -accel qtest > ----------------------------------- stderr ----------------------------------- > warning: fd: migration to a file is deprecated. Use file: instead. > warning: fd: migration to a file is deprecated. Use file: instead. > ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped) > > *** /ppc64/migration/multifd/tcp/plain/cancel, qtest-ppc64 on i686 host > > https://gitlab.com/qemu-project/qemu/-/jobs/7786579152 > > 174/315 qemu:qtest+qtest-ppc64 / qtest-ppc64/migration-test ERROR 381.00s killed by signal 6 SIGABRT >>>> PYTHON=/builds/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_IMG=./qemu-img G_TEST_DBUS_DAEMON=/builds/qemu-project/qemu/tests/dbus-vmstate-daemon.sh QTEST_QEMU_BINARY=./qemu-system-ppc64 MALLOC_PERTURB_=178 QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon /builds/qemu-project/qemu/build/tests/qtest/migration-test --tap -k > ――――――――――――――――――――――――――――――――――――― ✀ ――――――――――――――――――――――――――――――――――――― > stderr: > qemu-system-ppc64: Cannot read from TLS channel: The TLS connection was non-properly terminated. > warning: fd: migration to a file is deprecated. Use file: instead. > warning: fd: migration to a file is deprecated. Use file: instead. > ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped) > (test program exited with status code -6) > TAP parsing error: Too few tests run (expected 61, got 47) > ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― > > # Start of plain tests > # Running /ppc64/migration/multifd/tcp/plain/cancel > # Using machine type: pseries-9.2 > # starting QEMU: exec ./qemu-system-ppc64 -qtest unix:/tmp/qtest-40766.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-40766.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pseries-9.2,vsmt=8 -name source,debug-threads=on -m 256M -serial file:/tmp/migration-test-H0Z1T2/src_serial -nodefaults -machine cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, -bios /tmp/migration-test-H0Z1T2/bootsect 2>/dev/null -accel qtest > # starting QEMU: exec ./qemu-system-ppc64 -qtest unix:/tmp/qtest-40766.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-40766.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pseries-9.2,vsmt=8 -name target,debug-threads=on -m 256M -serial file:/tmp/migration-test-H0Z1T2/dest_serial -incoming defer -nodefaults -machine cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, -bios /tmp/migration-test-H0Z1T2/bootsect 2>/dev/null -accel qtest > ----------------------------------- stderr ----------------------------------- > qemu-system-ppc64: Cannot read from TLS channel: The TLS connection was non-properly terminated. > warning: fd: migration to a file is deprecated. Use file: instead. > warning: fd: migration to a file is deprecated. Use file: instead. > ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped) > > (test program exited with status code -6) > =================================8<================================ > > So.. it's the same test (multifd/tcp/plain/cancel) that is failing on > different host / arch being tested. What is more weird is the two failures > are different, the 2nd failure throw out a TLS error even though the test > doesn't yet have tls involved. I think that's just a parallel test being cancelled prematurely, either due to the crash or due to the timeout. > > Fabiano, is this the issue you're looking at? Yes. I can reproduce locally by running 2 processes in parallel: 1 loop with make -j$(nproc) check and another loop with tcp/plain/cancel. It takes ~1h to hit. I've seen crashes with ppc64, s390 and aarch64. > Peter, do you think it'll be helpful if we temporarily mark this test as > "slow" too so it's not run in CI (so we still run it ourselves when prepare > migration PR, with the hope that it can reproduce)? > > Thanks,
Fabiano Rosas <farosas@suse.de> writes: > Peter Xu <peterx@redhat.com> writes: > >> On Thu, Sep 12, 2024 at 09:13:16AM +0100, Peter Maydell wrote: >>> On Wed, 11 Sept 2024 at 22:26, Fabiano Rosas <farosas@suse.de> wrote: >>> > I don't think we're discussing total CI time at this point, so the math >>> > doesn't really add up. We're not looking into making the CI finish >>> > faster. We're looking into making migration-test finish faster. That >>> > would reduce timeouts in CI, speed-up make check and reduce the chance >>> > of random race conditions* affecting other people/staging runs. >>> >>> Right. The reason migration-test appears on my radar is because >>> it is very frequently the thing that shows up as "this sometimes >>> just fails or just times out and if you hit retry it goes away >>> again". That might not be migration-test's fault specifically, >>> because those retries tend to be certain CI configs (s390, >>> the i686-tci one), and I have some theories about what might be >>> causing it (e.g. build system runs 4 migration-tests in parallel, >>> which means 8 QEMU processes which is too many for the number >>> of host CPUs). But right now I look at CI job failures and my reaction >>> is "oh, it's the migration-test failing yet again" :-( >>> >>> For some examples from this week: >>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7802183144 >>> https://gitlab.com/qemu-project/qemu/-/jobs/7799842373 <--------[1] >>> https://gitlab.com/qemu-project/qemu/-/jobs/7786579152 <--------[2] >>> https://gitlab.com/qemu-project/qemu/-/jobs/7786579155 >> >> Ah right, the TIMEOUT is unfortunate, especially if tests can be run in >> parallel. It indeed sounds like no good way to finally solve.. I don't >> also see how speeding up / reducing tests in migration test would help, as >> that's (from some degree..) is the same as tuning the timeout value bigger. >> When the tests are less it'll fit into 480s window, but maybe it's too >> quick now we wonder whether we should shrink it to e.g. 90s, but then it >> can timeout again when on a busy host with less capability of concurrency. >> >> But indeed there're two ERRORs ([1,2] above).. I collected some more info >> here before the log expires: >> >> =================================8<================================ >> >> *** /i386/migration/multifd/tcp/plain/cancel, qtest-i386 on s390 host >> >> https://gitlab.com/qemu-project/qemu/-/jobs/7799842373 >> >> 101/953 qemu:qtest+qtest-i386 / qtest-i386/migration-test ERROR 144.32s killed by signal 6 SIGABRT >>>>> QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon G_TEST_DBUS_DAEMON=/home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/tests/dbus-vmstate-daemon.sh PYTHON=/home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_IMG=./qemu-img MALLOC_PERTURB_=144 QTEST_QEMU_BINARY=./qemu-system-i386 /home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/build/tests/qtest/migration-test --tap -k >> ――――――――――――――――――――――――――――――――――――― ✀ ――――――――――――――――――――――――――――――――――――― >> stderr: >> warning: fd: migration to a file is deprecated. Use file: instead. >> warning: fd: migration to a file is deprecated. Use file: instead. >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped) >> (test program exited with status code -6) >> TAP parsing error: Too few tests run (expected 53, got 39) >> ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― >> >> # Start of plain tests >> # Running /i386/migration/multifd/tcp/plain/cancel >> # Using machine type: pc-i440fx-9.2 >> # starting QEMU: exec ./qemu-system-i386 -qtest unix:/tmp/qtest-3273509.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3273509.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pc-i440fx-9.2, -name source,debug-threads=on -m 150M -serial file:/tmp/migration-test-4112T2/src_serial -drive if=none,id=d0,file=/tmp/migration-test-4112T2/bootsect,format=raw -device ide-hd,drive=d0,secs=1,cyls=1,heads=1 2>/dev/null -accel qtest >> # starting QEMU: exec ./qemu-system-i386 -qtest unix:/tmp/qtest-3273509.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3273509.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pc-i440fx-9.2, -name target,debug-threads=on -m 150M -serial file:/tmp/migration-test-4112T2/dest_serial -incoming defer -drive if=none,id=d0,file=/tmp/migration-test-4112T2/bootsect,format=raw -device ide-hd,drive=d0,secs=1,cyls=1,heads=1 2>/dev/null -accel qtest >> ----------------------------------- stderr ----------------------------------- >> warning: fd: migration to a file is deprecated. Use file: instead. >> warning: fd: migration to a file is deprecated. Use file: instead. >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped) >> >> *** /ppc64/migration/multifd/tcp/plain/cancel, qtest-ppc64 on i686 host >> >> https://gitlab.com/qemu-project/qemu/-/jobs/7786579152 >> >> 174/315 qemu:qtest+qtest-ppc64 / qtest-ppc64/migration-test ERROR 381.00s killed by signal 6 SIGABRT >>>>> PYTHON=/builds/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_IMG=./qemu-img G_TEST_DBUS_DAEMON=/builds/qemu-project/qemu/tests/dbus-vmstate-daemon.sh QTEST_QEMU_BINARY=./qemu-system-ppc64 MALLOC_PERTURB_=178 QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon /builds/qemu-project/qemu/build/tests/qtest/migration-test --tap -k >> ――――――――――――――――――――――――――――――――――――― ✀ ――――――――――――――――――――――――――――――――――――― >> stderr: >> qemu-system-ppc64: Cannot read from TLS channel: The TLS connection was non-properly terminated. >> warning: fd: migration to a file is deprecated. Use file: instead. >> warning: fd: migration to a file is deprecated. Use file: instead. >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped) >> (test program exited with status code -6) >> TAP parsing error: Too few tests run (expected 61, got 47) >> ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― >> >> # Start of plain tests >> # Running /ppc64/migration/multifd/tcp/plain/cancel >> # Using machine type: pseries-9.2 >> # starting QEMU: exec ./qemu-system-ppc64 -qtest unix:/tmp/qtest-40766.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-40766.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pseries-9.2,vsmt=8 -name source,debug-threads=on -m 256M -serial file:/tmp/migration-test-H0Z1T2/src_serial -nodefaults -machine cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, -bios /tmp/migration-test-H0Z1T2/bootsect 2>/dev/null -accel qtest >> # starting QEMU: exec ./qemu-system-ppc64 -qtest unix:/tmp/qtest-40766.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-40766.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pseries-9.2,vsmt=8 -name target,debug-threads=on -m 256M -serial file:/tmp/migration-test-H0Z1T2/dest_serial -incoming defer -nodefaults -machine cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, -bios /tmp/migration-test-H0Z1T2/bootsect 2>/dev/null -accel qtest >> ----------------------------------- stderr ----------------------------------- >> qemu-system-ppc64: Cannot read from TLS channel: The TLS connection was non-properly terminated. >> warning: fd: migration to a file is deprecated. Use file: instead. >> warning: fd: migration to a file is deprecated. Use file: instead. >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped) >> >> (test program exited with status code -6) >> =================================8<================================ >> >> So.. it's the same test (multifd/tcp/plain/cancel) that is failing on >> different host / arch being tested. What is more weird is the two failures >> are different, the 2nd failure throw out a TLS error even though the test >> doesn't yet have tls involved. > > I think that's just a parallel test being cancelled prematurely, either > due to the crash or due to the timeout. > >> >> Fabiano, is this the issue you're looking at? > > Yes. I can reproduce locally by running 2 processes in parallel: 1 loop > with make -j$(nproc) check and another loop with tcp/plain/cancel. It > takes ~1h to hit. I've seen crashes with ppc64, s390 and > aarch64. > Ok, the issue is that after commit 5ef7e26bdb ("migration/multifd: solve zero page causing multiple page faults"), the multifd code started using the rb->receivedmap bitmap, which belongs to the ram code and is initialized and *freed* from the ram SaveVMHandlers. process_incoming_migration_co() ... qemu_loadvm_state() multifd_nocomp_recv() qemu_loadvm_state_cleanup() ramblock_recv_bitmap_set_offset() rb->receivedmap = NULL set_bit_atomic(..., rb->receivedmap) ... migration_incoming_state_destroy() multifd_recv_cleanup() multifd_recv_terminate_threads(NULL) Multifd threads are live until migration_incoming_state_destroy(), which is called some time later. >> Peter, do you think it'll be helpful if we temporarily mark this test as >> "slow" too so it's not run in CI (so we still run it ourselves when prepare >> migration PR, with the hope that it can reproduce)? >> >> Thanks,
On Thu, Sep 12, 2024 at 07:52:48PM -0300, Fabiano Rosas wrote: > Fabiano Rosas <farosas@suse.de> writes: > > > Peter Xu <peterx@redhat.com> writes: > > > >> On Thu, Sep 12, 2024 at 09:13:16AM +0100, Peter Maydell wrote: > >>> On Wed, 11 Sept 2024 at 22:26, Fabiano Rosas <farosas@suse.de> wrote: > >>> > I don't think we're discussing total CI time at this point, so the math > >>> > doesn't really add up. We're not looking into making the CI finish > >>> > faster. We're looking into making migration-test finish faster. That > >>> > would reduce timeouts in CI, speed-up make check and reduce the chance > >>> > of random race conditions* affecting other people/staging runs. > >>> > >>> Right. The reason migration-test appears on my radar is because > >>> it is very frequently the thing that shows up as "this sometimes > >>> just fails or just times out and if you hit retry it goes away > >>> again". That might not be migration-test's fault specifically, > >>> because those retries tend to be certain CI configs (s390, > >>> the i686-tci one), and I have some theories about what might be > >>> causing it (e.g. build system runs 4 migration-tests in parallel, > >>> which means 8 QEMU processes which is too many for the number > >>> of host CPUs). But right now I look at CI job failures and my reaction > >>> is "oh, it's the migration-test failing yet again" :-( > >>> > >>> For some examples from this week: > >>> > >>> https://gitlab.com/qemu-project/qemu/-/jobs/7802183144 > >>> https://gitlab.com/qemu-project/qemu/-/jobs/7799842373 <--------[1] > >>> https://gitlab.com/qemu-project/qemu/-/jobs/7786579152 <--------[2] > >>> https://gitlab.com/qemu-project/qemu/-/jobs/7786579155 > >> > >> Ah right, the TIMEOUT is unfortunate, especially if tests can be run in > >> parallel. It indeed sounds like no good way to finally solve.. I don't > >> also see how speeding up / reducing tests in migration test would help, as > >> that's (from some degree..) is the same as tuning the timeout value bigger. > >> When the tests are less it'll fit into 480s window, but maybe it's too > >> quick now we wonder whether we should shrink it to e.g. 90s, but then it > >> can timeout again when on a busy host with less capability of concurrency. > >> > >> But indeed there're two ERRORs ([1,2] above).. I collected some more info > >> here before the log expires: > >> > >> =================================8<================================ > >> > >> *** /i386/migration/multifd/tcp/plain/cancel, qtest-i386 on s390 host > >> > >> https://gitlab.com/qemu-project/qemu/-/jobs/7799842373 > >> > >> 101/953 qemu:qtest+qtest-i386 / qtest-i386/migration-test ERROR 144.32s killed by signal 6 SIGABRT > >>>>> QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon G_TEST_DBUS_DAEMON=/home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/tests/dbus-vmstate-daemon.sh PYTHON=/home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_IMG=./qemu-img MALLOC_PERTURB_=144 QTEST_QEMU_BINARY=./qemu-system-i386 /home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/build/tests/qtest/migration-test --tap -k > >> ――――――――――――――――――――――――――――――――――――― ✀ ――――――――――――――――――――――――――――――――――――― > >> stderr: > >> warning: fd: migration to a file is deprecated. Use file: instead. > >> warning: fd: migration to a file is deprecated. Use file: instead. > >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped) > >> (test program exited with status code -6) > >> TAP parsing error: Too few tests run (expected 53, got 39) > >> ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― > >> > >> # Start of plain tests > >> # Running /i386/migration/multifd/tcp/plain/cancel > >> # Using machine type: pc-i440fx-9.2 > >> # starting QEMU: exec ./qemu-system-i386 -qtest unix:/tmp/qtest-3273509.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3273509.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pc-i440fx-9.2, -name source,debug-threads=on -m 150M -serial file:/tmp/migration-test-4112T2/src_serial -drive if=none,id=d0,file=/tmp/migration-test-4112T2/bootsect,format=raw -device ide-hd,drive=d0,secs=1,cyls=1,heads=1 2>/dev/null -accel qtest > >> # starting QEMU: exec ./qemu-system-i386 -qtest unix:/tmp/qtest-3273509.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3273509.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pc-i440fx-9.2, -name target,debug-threads=on -m 150M -serial file:/tmp/migration-test-4112T2/dest_serial -incoming defer -drive if=none,id=d0,file=/tmp/migration-test-4112T2/bootsect,format=raw -device ide-hd,drive=d0,secs=1,cyls=1,heads=1 2>/dev/null -accel qtest > >> ----------------------------------- stderr ----------------------------------- > >> warning: fd: migration to a file is deprecated. Use file: instead. > >> warning: fd: migration to a file is deprecated. Use file: instead. > >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped) > >> > >> *** /ppc64/migration/multifd/tcp/plain/cancel, qtest-ppc64 on i686 host > >> > >> https://gitlab.com/qemu-project/qemu/-/jobs/7786579152 > >> > >> 174/315 qemu:qtest+qtest-ppc64 / qtest-ppc64/migration-test ERROR 381.00s killed by signal 6 SIGABRT > >>>>> PYTHON=/builds/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_IMG=./qemu-img G_TEST_DBUS_DAEMON=/builds/qemu-project/qemu/tests/dbus-vmstate-daemon.sh QTEST_QEMU_BINARY=./qemu-system-ppc64 MALLOC_PERTURB_=178 QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon /builds/qemu-project/qemu/build/tests/qtest/migration-test --tap -k > >> ――――――――――――――――――――――――――――――――――――― ✀ ――――――――――――――――――――――――――――――――――――― > >> stderr: > >> qemu-system-ppc64: Cannot read from TLS channel: The TLS connection was non-properly terminated. > >> warning: fd: migration to a file is deprecated. Use file: instead. > >> warning: fd: migration to a file is deprecated. Use file: instead. > >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped) > >> (test program exited with status code -6) > >> TAP parsing error: Too few tests run (expected 61, got 47) > >> ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― > >> > >> # Start of plain tests > >> # Running /ppc64/migration/multifd/tcp/plain/cancel > >> # Using machine type: pseries-9.2 > >> # starting QEMU: exec ./qemu-system-ppc64 -qtest unix:/tmp/qtest-40766.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-40766.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pseries-9.2,vsmt=8 -name source,debug-threads=on -m 256M -serial file:/tmp/migration-test-H0Z1T2/src_serial -nodefaults -machine cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, -bios /tmp/migration-test-H0Z1T2/bootsect 2>/dev/null -accel qtest > >> # starting QEMU: exec ./qemu-system-ppc64 -qtest unix:/tmp/qtest-40766.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-40766.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pseries-9.2,vsmt=8 -name target,debug-threads=on -m 256M -serial file:/tmp/migration-test-H0Z1T2/dest_serial -incoming defer -nodefaults -machine cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, -bios /tmp/migration-test-H0Z1T2/bootsect 2>/dev/null -accel qtest > >> ----------------------------------- stderr ----------------------------------- > >> qemu-system-ppc64: Cannot read from TLS channel: The TLS connection was non-properly terminated. > >> warning: fd: migration to a file is deprecated. Use file: instead. > >> warning: fd: migration to a file is deprecated. Use file: instead. > >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped) > >> > >> (test program exited with status code -6) > >> =================================8<================================ > >> > >> So.. it's the same test (multifd/tcp/plain/cancel) that is failing on > >> different host / arch being tested. What is more weird is the two failures > >> are different, the 2nd failure throw out a TLS error even though the test > >> doesn't yet have tls involved. > > > > I think that's just a parallel test being cancelled prematurely, either > > due to the crash or due to the timeout. > > > >> > >> Fabiano, is this the issue you're looking at? > > > > Yes. I can reproduce locally by running 2 processes in parallel: 1 loop > > with make -j$(nproc) check and another loop with tcp/plain/cancel. It > > takes ~1h to hit. I've seen crashes with ppc64, s390 and > > aarch64. > > > > Ok, the issue is that after commit 5ef7e26bdb ("migration/multifd: solve > zero page causing multiple page faults"), the multifd code started using > the rb->receivedmap bitmap, which belongs to the ram code and is > initialized and *freed* from the ram SaveVMHandlers. > > process_incoming_migration_co() ... > qemu_loadvm_state() multifd_nocomp_recv() > qemu_loadvm_state_cleanup() ramblock_recv_bitmap_set_offset() > rb->receivedmap = NULL set_bit_atomic(..., rb->receivedmap) > ... > migration_incoming_state_destroy() > multifd_recv_cleanup() > multifd_recv_terminate_threads(NULL) > > Multifd threads are live until migration_incoming_state_destroy(), which > is called some time later. Thanks for the debugging. Hmm I would expect loadvm should wait until all ram is received somehow.. And when looking, I also found we're ambiguous on the cleanups now.. probably a separate issue that not yet cause crashes. I meant we even don't have a consistent order to clean these in precopy / postcopy, as: We have this: qemu_loadvm_state qemu_loadvm_state_cleanup <------------- [1] migration_bh_schedule(process_incoming_migration_bh, mis); (then in bh...) process_incoming_migration_bh migration_incoming_state_destroy <------------- [2] But then: postcopy_ram_listen_thread migration_incoming_state_destroy <------------- [2] qemu_loadvm_state_cleanup <------------- [1] I think it makes more sense to do [2] after [1], so maybe postcopy needs changing too.. > > >> Peter, do you think it'll be helpful if we temporarily mark this test as > >> "slow" too so it's not run in CI (so we still run it ourselves when prepare > >> migration PR, with the hope that it can reproduce)? > >> > >> Thanks, > -- Peter Xu
Peter Xu <peterx@redhat.com> writes: > On Thu, Sep 12, 2024 at 07:52:48PM -0300, Fabiano Rosas wrote: >> Fabiano Rosas <farosas@suse.de> writes: >> >> > Peter Xu <peterx@redhat.com> writes: >> > >> >> On Thu, Sep 12, 2024 at 09:13:16AM +0100, Peter Maydell wrote: >> >>> On Wed, 11 Sept 2024 at 22:26, Fabiano Rosas <farosas@suse.de> wrote: >> >>> > I don't think we're discussing total CI time at this point, so the math >> >>> > doesn't really add up. We're not looking into making the CI finish >> >>> > faster. We're looking into making migration-test finish faster. That >> >>> > would reduce timeouts in CI, speed-up make check and reduce the chance >> >>> > of random race conditions* affecting other people/staging runs. >> >>> >> >>> Right. The reason migration-test appears on my radar is because >> >>> it is very frequently the thing that shows up as "this sometimes >> >>> just fails or just times out and if you hit retry it goes away >> >>> again". That might not be migration-test's fault specifically, >> >>> because those retries tend to be certain CI configs (s390, >> >>> the i686-tci one), and I have some theories about what might be >> >>> causing it (e.g. build system runs 4 migration-tests in parallel, >> >>> which means 8 QEMU processes which is too many for the number >> >>> of host CPUs). But right now I look at CI job failures and my reaction >> >>> is "oh, it's the migration-test failing yet again" :-( >> >>> >> >>> For some examples from this week: >> >>> >> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7802183144 >> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7799842373 <--------[1] >> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7786579152 <--------[2] >> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7786579155 >> >> >> >> Ah right, the TIMEOUT is unfortunate, especially if tests can be run in >> >> parallel. It indeed sounds like no good way to finally solve.. I don't >> >> also see how speeding up / reducing tests in migration test would help, as >> >> that's (from some degree..) is the same as tuning the timeout value bigger. >> >> When the tests are less it'll fit into 480s window, but maybe it's too >> >> quick now we wonder whether we should shrink it to e.g. 90s, but then it >> >> can timeout again when on a busy host with less capability of concurrency. >> >> >> >> But indeed there're two ERRORs ([1,2] above).. I collected some more info >> >> here before the log expires: >> >> >> >> =================================8<================================ >> >> >> >> *** /i386/migration/multifd/tcp/plain/cancel, qtest-i386 on s390 host >> >> >> >> https://gitlab.com/qemu-project/qemu/-/jobs/7799842373 >> >> >> >> 101/953 qemu:qtest+qtest-i386 / qtest-i386/migration-test ERROR 144.32s killed by signal 6 SIGABRT >> >>>>> QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon G_TEST_DBUS_DAEMON=/home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/tests/dbus-vmstate-daemon.sh PYTHON=/home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_IMG=./qemu-img MALLOC_PERTURB_=144 QTEST_QEMU_BINARY=./qemu-system-i386 /home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/build/tests/qtest/migration-test --tap -k >> >> ――――――――――――――――――――――――――――――――――――― ✀ ――――――――――――――――――――――――――――――――――――― >> >> stderr: >> >> warning: fd: migration to a file is deprecated. Use file: instead. >> >> warning: fd: migration to a file is deprecated. Use file: instead. >> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped) >> >> (test program exited with status code -6) >> >> TAP parsing error: Too few tests run (expected 53, got 39) >> >> ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― >> >> >> >> # Start of plain tests >> >> # Running /i386/migration/multifd/tcp/plain/cancel >> >> # Using machine type: pc-i440fx-9.2 >> >> # starting QEMU: exec ./qemu-system-i386 -qtest unix:/tmp/qtest-3273509.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3273509.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pc-i440fx-9.2, -name source,debug-threads=on -m 150M -serial file:/tmp/migration-test-4112T2/src_serial -drive if=none,id=d0,file=/tmp/migration-test-4112T2/bootsect,format=raw -device ide-hd,drive=d0,secs=1,cyls=1,heads=1 2>/dev/null -accel qtest >> >> # starting QEMU: exec ./qemu-system-i386 -qtest unix:/tmp/qtest-3273509.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3273509.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pc-i440fx-9.2, -name target,debug-threads=on -m 150M -serial file:/tmp/migration-test-4112T2/dest_serial -incoming defer -drive if=none,id=d0,file=/tmp/migration-test-4112T2/bootsect,format=raw -device ide-hd,drive=d0,secs=1,cyls=1,heads=1 2>/dev/null -accel qtest >> >> ----------------------------------- stderr ----------------------------------- >> >> warning: fd: migration to a file is deprecated. Use file: instead. >> >> warning: fd: migration to a file is deprecated. Use file: instead. >> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped) >> >> >> >> *** /ppc64/migration/multifd/tcp/plain/cancel, qtest-ppc64 on i686 host >> >> >> >> https://gitlab.com/qemu-project/qemu/-/jobs/7786579152 >> >> >> >> 174/315 qemu:qtest+qtest-ppc64 / qtest-ppc64/migration-test ERROR 381.00s killed by signal 6 SIGABRT >> >>>>> PYTHON=/builds/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_IMG=./qemu-img G_TEST_DBUS_DAEMON=/builds/qemu-project/qemu/tests/dbus-vmstate-daemon.sh QTEST_QEMU_BINARY=./qemu-system-ppc64 MALLOC_PERTURB_=178 QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon /builds/qemu-project/qemu/build/tests/qtest/migration-test --tap -k >> >> ――――――――――――――――――――――――――――――――――――― ✀ ――――――――――――――――――――――――――――――――――――― >> >> stderr: >> >> qemu-system-ppc64: Cannot read from TLS channel: The TLS connection was non-properly terminated. >> >> warning: fd: migration to a file is deprecated. Use file: instead. >> >> warning: fd: migration to a file is deprecated. Use file: instead. >> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped) >> >> (test program exited with status code -6) >> >> TAP parsing error: Too few tests run (expected 61, got 47) >> >> ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― >> >> >> >> # Start of plain tests >> >> # Running /ppc64/migration/multifd/tcp/plain/cancel >> >> # Using machine type: pseries-9.2 >> >> # starting QEMU: exec ./qemu-system-ppc64 -qtest unix:/tmp/qtest-40766.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-40766.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pseries-9.2,vsmt=8 -name source,debug-threads=on -m 256M -serial file:/tmp/migration-test-H0Z1T2/src_serial -nodefaults -machine cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, -bios /tmp/migration-test-H0Z1T2/bootsect 2>/dev/null -accel qtest >> >> # starting QEMU: exec ./qemu-system-ppc64 -qtest unix:/tmp/qtest-40766.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-40766.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pseries-9.2,vsmt=8 -name target,debug-threads=on -m 256M -serial file:/tmp/migration-test-H0Z1T2/dest_serial -incoming defer -nodefaults -machine cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, -bios /tmp/migration-test-H0Z1T2/bootsect 2>/dev/null -accel qtest >> >> ----------------------------------- stderr ----------------------------------- >> >> qemu-system-ppc64: Cannot read from TLS channel: The TLS connection was non-properly terminated. >> >> warning: fd: migration to a file is deprecated. Use file: instead. >> >> warning: fd: migration to a file is deprecated. Use file: instead. >> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped) >> >> >> >> (test program exited with status code -6) >> >> =================================8<================================ >> >> >> >> So.. it's the same test (multifd/tcp/plain/cancel) that is failing on >> >> different host / arch being tested. What is more weird is the two failures >> >> are different, the 2nd failure throw out a TLS error even though the test >> >> doesn't yet have tls involved. >> > >> > I think that's just a parallel test being cancelled prematurely, either >> > due to the crash or due to the timeout. >> > >> >> >> >> Fabiano, is this the issue you're looking at? >> > >> > Yes. I can reproduce locally by running 2 processes in parallel: 1 loop >> > with make -j$(nproc) check and another loop with tcp/plain/cancel. It >> > takes ~1h to hit. I've seen crashes with ppc64, s390 and >> > aarch64. >> > >> >> Ok, the issue is that after commit 5ef7e26bdb ("migration/multifd: solve >> zero page causing multiple page faults"), the multifd code started using >> the rb->receivedmap bitmap, which belongs to the ram code and is >> initialized and *freed* from the ram SaveVMHandlers. >> >> process_incoming_migration_co() ... >> qemu_loadvm_state() multifd_nocomp_recv() >> qemu_loadvm_state_cleanup() ramblock_recv_bitmap_set_offset() >> rb->receivedmap = NULL set_bit_atomic(..., rb->receivedmap) >> ... >> migration_incoming_state_destroy() >> multifd_recv_cleanup() >> multifd_recv_terminate_threads(NULL) >> >> Multifd threads are live until migration_incoming_state_destroy(), which >> is called some time later. > > Thanks for the debugging. Hmm I would expect loadvm should wait until all > ram is received somehow.. Looks like a similar issue as when we didn't have the multifd_send sync working correctly and ram code would run and do cleanup. > > And when looking, I also found we're ambiguous on the cleanups > now.. probably a separate issue that not yet cause crashes. I meant we > even don't have a consistent order to clean these in precopy / postcopy, > as: > > We have this: > > qemu_loadvm_state > qemu_loadvm_state_cleanup <------------- [1] > migration_bh_schedule(process_incoming_migration_bh, mis); > (then in bh...) > process_incoming_migration_bh > migration_incoming_state_destroy <------------- [2] > > But then: > > postcopy_ram_listen_thread > migration_incoming_state_destroy <------------- [2] > qemu_loadvm_state_cleanup <------------- [1] > > I think it makes more sense to do [2] after [1], so maybe postcopy needs > changing too.. Yes, ram_load_cleanup doesn't do much: static int ram_load_cleanup(void *opaque) { RAMBlock *rb; RAMBLOCK_FOREACH_NOT_IGNORED(rb) { qemu_ram_block_writeback(rb); } xbzrle_load_cleanup(); return 0; } We can probably change the order just fine. For the bug I'm now trying to add a helper to init/clean the receivedmap and only call it from code that uses it (postcopy, multifd_recv), instead of always init/clean along with ram_load_setup/cleanup. > >> >> >> Peter, do you think it'll be helpful if we temporarily mark this test as >> >> "slow" too so it's not run in CI (so we still run it ourselves when prepare >> >> migration PR, with the hope that it can reproduce)? >> >> >> >> Thanks, >>
Fabiano Rosas <farosas@suse.de> writes: > Peter Xu <peterx@redhat.com> writes: > >> On Thu, Sep 12, 2024 at 07:52:48PM -0300, Fabiano Rosas wrote: >>> Fabiano Rosas <farosas@suse.de> writes: >>> >>> > Peter Xu <peterx@redhat.com> writes: >>> > >>> >> On Thu, Sep 12, 2024 at 09:13:16AM +0100, Peter Maydell wrote: >>> >>> On Wed, 11 Sept 2024 at 22:26, Fabiano Rosas <farosas@suse.de> wrote: >>> >>> > I don't think we're discussing total CI time at this point, so the math >>> >>> > doesn't really add up. We're not looking into making the CI finish >>> >>> > faster. We're looking into making migration-test finish faster. That >>> >>> > would reduce timeouts in CI, speed-up make check and reduce the chance >>> >>> > of random race conditions* affecting other people/staging runs. >>> >>> >>> >>> Right. The reason migration-test appears on my radar is because >>> >>> it is very frequently the thing that shows up as "this sometimes >>> >>> just fails or just times out and if you hit retry it goes away >>> >>> again". That might not be migration-test's fault specifically, >>> >>> because those retries tend to be certain CI configs (s390, >>> >>> the i686-tci one), and I have some theories about what might be >>> >>> causing it (e.g. build system runs 4 migration-tests in parallel, >>> >>> which means 8 QEMU processes which is too many for the number >>> >>> of host CPUs). But right now I look at CI job failures and my reaction >>> >>> is "oh, it's the migration-test failing yet again" :-( >>> >>> >>> >>> For some examples from this week: >>> >>> >>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7802183144 >>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7799842373 <--------[1] >>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7786579152 <--------[2] >>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7786579155 >>> >> >>> >> Ah right, the TIMEOUT is unfortunate, especially if tests can be run in >>> >> parallel. It indeed sounds like no good way to finally solve.. I don't >>> >> also see how speeding up / reducing tests in migration test would help, as >>> >> that's (from some degree..) is the same as tuning the timeout value bigger. >>> >> When the tests are less it'll fit into 480s window, but maybe it's too >>> >> quick now we wonder whether we should shrink it to e.g. 90s, but then it >>> >> can timeout again when on a busy host with less capability of concurrency. >>> >> >>> >> But indeed there're two ERRORs ([1,2] above).. I collected some more info >>> >> here before the log expires: >>> >> >>> >> =================================8<================================ >>> >> >>> >> *** /i386/migration/multifd/tcp/plain/cancel, qtest-i386 on s390 host >>> >> >>> >> https://gitlab.com/qemu-project/qemu/-/jobs/7799842373 >>> >> >>> >> 101/953 qemu:qtest+qtest-i386 / qtest-i386/migration-test ERROR 144.32s killed by signal 6 SIGABRT >>> >>>>> QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon G_TEST_DBUS_DAEMON=/home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/tests/dbus-vmstate-daemon.sh PYTHON=/home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_IMG=./qemu-img MALLOC_PERTURB_=144 QTEST_QEMU_BINARY=./qemu-system-i386 /home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/build/tests/qtest/migration-test --tap -k >>> >> ――――――――――――――――――――――――――――――――――――― ✀ ――――――――――――――――――――――――――――――――――――― >>> >> stderr: >>> >> warning: fd: migration to a file is deprecated. Use file: instead. >>> >> warning: fd: migration to a file is deprecated. Use file: instead. >>> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped) >>> >> (test program exited with status code -6) >>> >> TAP parsing error: Too few tests run (expected 53, got 39) >>> >> ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― >>> >> >>> >> # Start of plain tests >>> >> # Running /i386/migration/multifd/tcp/plain/cancel >>> >> # Using machine type: pc-i440fx-9.2 >>> >> # starting QEMU: exec ./qemu-system-i386 -qtest unix:/tmp/qtest-3273509.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3273509.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pc-i440fx-9.2, -name source,debug-threads=on -m 150M -serial file:/tmp/migration-test-4112T2/src_serial -drive if=none,id=d0,file=/tmp/migration-test-4112T2/bootsect,format=raw -device ide-hd,drive=d0,secs=1,cyls=1,heads=1 2>/dev/null -accel qtest >>> >> # starting QEMU: exec ./qemu-system-i386 -qtest unix:/tmp/qtest-3273509.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3273509.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pc-i440fx-9.2, -name target,debug-threads=on -m 150M -serial file:/tmp/migration-test-4112T2/dest_serial -incoming defer -drive if=none,id=d0,file=/tmp/migration-test-4112T2/bootsect,format=raw -device ide-hd,drive=d0,secs=1,cyls=1,heads=1 2>/dev/null -accel qtest >>> >> ----------------------------------- stderr ----------------------------------- >>> >> warning: fd: migration to a file is deprecated. Use file: instead. >>> >> warning: fd: migration to a file is deprecated. Use file: instead. >>> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped) >>> >> >>> >> *** /ppc64/migration/multifd/tcp/plain/cancel, qtest-ppc64 on i686 host >>> >> >>> >> https://gitlab.com/qemu-project/qemu/-/jobs/7786579152 >>> >> >>> >> 174/315 qemu:qtest+qtest-ppc64 / qtest-ppc64/migration-test ERROR 381.00s killed by signal 6 SIGABRT >>> >>>>> PYTHON=/builds/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_IMG=./qemu-img G_TEST_DBUS_DAEMON=/builds/qemu-project/qemu/tests/dbus-vmstate-daemon.sh QTEST_QEMU_BINARY=./qemu-system-ppc64 MALLOC_PERTURB_=178 QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon /builds/qemu-project/qemu/build/tests/qtest/migration-test --tap -k >>> >> ――――――――――――――――――――――――――――――――――――― ✀ ――――――――――――――――――――――――――――――――――――― >>> >> stderr: >>> >> qemu-system-ppc64: Cannot read from TLS channel: The TLS connection was non-properly terminated. >>> >> warning: fd: migration to a file is deprecated. Use file: instead. >>> >> warning: fd: migration to a file is deprecated. Use file: instead. >>> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped) >>> >> (test program exited with status code -6) >>> >> TAP parsing error: Too few tests run (expected 61, got 47) >>> >> ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― >>> >> >>> >> # Start of plain tests >>> >> # Running /ppc64/migration/multifd/tcp/plain/cancel >>> >> # Using machine type: pseries-9.2 >>> >> # starting QEMU: exec ./qemu-system-ppc64 -qtest unix:/tmp/qtest-40766.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-40766.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pseries-9.2,vsmt=8 -name source,debug-threads=on -m 256M -serial file:/tmp/migration-test-H0Z1T2/src_serial -nodefaults -machine cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, -bios /tmp/migration-test-H0Z1T2/bootsect 2>/dev/null -accel qtest >>> >> # starting QEMU: exec ./qemu-system-ppc64 -qtest unix:/tmp/qtest-40766.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-40766.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pseries-9.2,vsmt=8 -name target,debug-threads=on -m 256M -serial file:/tmp/migration-test-H0Z1T2/dest_serial -incoming defer -nodefaults -machine cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, -bios /tmp/migration-test-H0Z1T2/bootsect 2>/dev/null -accel qtest >>> >> ----------------------------------- stderr ----------------------------------- >>> >> qemu-system-ppc64: Cannot read from TLS channel: The TLS connection was non-properly terminated. >>> >> warning: fd: migration to a file is deprecated. Use file: instead. >>> >> warning: fd: migration to a file is deprecated. Use file: instead. >>> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped) >>> >> >>> >> (test program exited with status code -6) >>> >> =================================8<================================ >>> >> >>> >> So.. it's the same test (multifd/tcp/plain/cancel) that is failing on >>> >> different host / arch being tested. What is more weird is the two failures >>> >> are different, the 2nd failure throw out a TLS error even though the test >>> >> doesn't yet have tls involved. >>> > >>> > I think that's just a parallel test being cancelled prematurely, either >>> > due to the crash or due to the timeout. >>> > >>> >> >>> >> Fabiano, is this the issue you're looking at? >>> > >>> > Yes. I can reproduce locally by running 2 processes in parallel: 1 loop >>> > with make -j$(nproc) check and another loop with tcp/plain/cancel. It >>> > takes ~1h to hit. I've seen crashes with ppc64, s390 and >>> > aarch64. >>> > >>> >>> Ok, the issue is that after commit 5ef7e26bdb ("migration/multifd: solve >>> zero page causing multiple page faults"), the multifd code started using >>> the rb->receivedmap bitmap, which belongs to the ram code and is >>> initialized and *freed* from the ram SaveVMHandlers. >>> >>> process_incoming_migration_co() ... >>> qemu_loadvm_state() multifd_nocomp_recv() >>> qemu_loadvm_state_cleanup() ramblock_recv_bitmap_set_offset() >>> rb->receivedmap = NULL set_bit_atomic(..., rb->receivedmap) >>> ... >>> migration_incoming_state_destroy() >>> multifd_recv_cleanup() >>> multifd_recv_terminate_threads(NULL) >>> >>> Multifd threads are live until migration_incoming_state_destroy(), which >>> is called some time later. >> >> Thanks for the debugging. Hmm I would expect loadvm should wait until all >> ram is received somehow.. > > Looks like a similar issue as when we didn't have the multifd_send sync > working correctly and ram code would run and do cleanup. Btw, this is hard to debug, but I bet what's happening is that the ram_load code itself is exiting due to qemufile error. So there wouldn't be a way to make it wait for multifd.
On Fri, Sep 13, 2024 at 12:17:40PM -0300, Fabiano Rosas wrote: > Fabiano Rosas <farosas@suse.de> writes: > > > Peter Xu <peterx@redhat.com> writes: > > > >> On Thu, Sep 12, 2024 at 07:52:48PM -0300, Fabiano Rosas wrote: > >>> Fabiano Rosas <farosas@suse.de> writes: > >>> > >>> > Peter Xu <peterx@redhat.com> writes: > >>> > > >>> >> On Thu, Sep 12, 2024 at 09:13:16AM +0100, Peter Maydell wrote: > >>> >>> On Wed, 11 Sept 2024 at 22:26, Fabiano Rosas <farosas@suse.de> wrote: > >>> >>> > I don't think we're discussing total CI time at this point, so the math > >>> >>> > doesn't really add up. We're not looking into making the CI finish > >>> >>> > faster. We're looking into making migration-test finish faster. That > >>> >>> > would reduce timeouts in CI, speed-up make check and reduce the chance > >>> >>> > of random race conditions* affecting other people/staging runs. > >>> >>> > >>> >>> Right. The reason migration-test appears on my radar is because > >>> >>> it is very frequently the thing that shows up as "this sometimes > >>> >>> just fails or just times out and if you hit retry it goes away > >>> >>> again". That might not be migration-test's fault specifically, > >>> >>> because those retries tend to be certain CI configs (s390, > >>> >>> the i686-tci one), and I have some theories about what might be > >>> >>> causing it (e.g. build system runs 4 migration-tests in parallel, > >>> >>> which means 8 QEMU processes which is too many for the number > >>> >>> of host CPUs). But right now I look at CI job failures and my reaction > >>> >>> is "oh, it's the migration-test failing yet again" :-( > >>> >>> > >>> >>> For some examples from this week: > >>> >>> > >>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7802183144 > >>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7799842373 <--------[1] > >>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7786579152 <--------[2] > >>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7786579155 > >>> >> > >>> >> Ah right, the TIMEOUT is unfortunate, especially if tests can be run in > >>> >> parallel. It indeed sounds like no good way to finally solve.. I don't > >>> >> also see how speeding up / reducing tests in migration test would help, as > >>> >> that's (from some degree..) is the same as tuning the timeout value bigger. > >>> >> When the tests are less it'll fit into 480s window, but maybe it's too > >>> >> quick now we wonder whether we should shrink it to e.g. 90s, but then it > >>> >> can timeout again when on a busy host with less capability of concurrency. > >>> >> > >>> >> But indeed there're two ERRORs ([1,2] above).. I collected some more info > >>> >> here before the log expires: > >>> >> > >>> >> =================================8<================================ > >>> >> > >>> >> *** /i386/migration/multifd/tcp/plain/cancel, qtest-i386 on s390 host > >>> >> > >>> >> https://gitlab.com/qemu-project/qemu/-/jobs/7799842373 > >>> >> > >>> >> 101/953 qemu:qtest+qtest-i386 / qtest-i386/migration-test ERROR 144.32s killed by signal 6 SIGABRT > >>> >>>>> QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon G_TEST_DBUS_DAEMON=/home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/tests/dbus-vmstate-daemon.sh PYTHON=/home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_IMG=./qemu-img MALLOC_PERTURB_=144 QTEST_QEMU_BINARY=./qemu-system-i386 /home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/build/tests/qtest/migration-test --tap -k > >>> >> ――――――――――――――――――――――――――――――――――――― ✀ ――――――――――――――――――――――――――――――――――――― > >>> >> stderr: > >>> >> warning: fd: migration to a file is deprecated. Use file: instead. > >>> >> warning: fd: migration to a file is deprecated. Use file: instead. > >>> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped) > >>> >> (test program exited with status code -6) > >>> >> TAP parsing error: Too few tests run (expected 53, got 39) > >>> >> ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― > >>> >> > >>> >> # Start of plain tests > >>> >> # Running /i386/migration/multifd/tcp/plain/cancel > >>> >> # Using machine type: pc-i440fx-9.2 > >>> >> # starting QEMU: exec ./qemu-system-i386 -qtest unix:/tmp/qtest-3273509.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3273509.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pc-i440fx-9.2, -name source,debug-threads=on -m 150M -serial file:/tmp/migration-test-4112T2/src_serial -drive if=none,id=d0,file=/tmp/migration-test-4112T2/bootsect,format=raw -device ide-hd,drive=d0,secs=1,cyls=1,heads=1 2>/dev/null -accel qtest > >>> >> # starting QEMU: exec ./qemu-system-i386 -qtest unix:/tmp/qtest-3273509.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3273509.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pc-i440fx-9.2, -name target,debug-threads=on -m 150M -serial file:/tmp/migration-test-4112T2/dest_serial -incoming defer -drive if=none,id=d0,file=/tmp/migration-test-4112T2/bootsect,format=raw -device ide-hd,drive=d0,secs=1,cyls=1,heads=1 2>/dev/null -accel qtest > >>> >> ----------------------------------- stderr ----------------------------------- > >>> >> warning: fd: migration to a file is deprecated. Use file: instead. > >>> >> warning: fd: migration to a file is deprecated. Use file: instead. > >>> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped) > >>> >> > >>> >> *** /ppc64/migration/multifd/tcp/plain/cancel, qtest-ppc64 on i686 host > >>> >> > >>> >> https://gitlab.com/qemu-project/qemu/-/jobs/7786579152 > >>> >> > >>> >> 174/315 qemu:qtest+qtest-ppc64 / qtest-ppc64/migration-test ERROR 381.00s killed by signal 6 SIGABRT > >>> >>>>> PYTHON=/builds/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_IMG=./qemu-img G_TEST_DBUS_DAEMON=/builds/qemu-project/qemu/tests/dbus-vmstate-daemon.sh QTEST_QEMU_BINARY=./qemu-system-ppc64 MALLOC_PERTURB_=178 QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon /builds/qemu-project/qemu/build/tests/qtest/migration-test --tap -k > >>> >> ――――――――――――――――――――――――――――――――――――― ✀ ――――――――――――――――――――――――――――――――――――― > >>> >> stderr: > >>> >> qemu-system-ppc64: Cannot read from TLS channel: The TLS connection was non-properly terminated. > >>> >> warning: fd: migration to a file is deprecated. Use file: instead. > >>> >> warning: fd: migration to a file is deprecated. Use file: instead. > >>> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped) > >>> >> (test program exited with status code -6) > >>> >> TAP parsing error: Too few tests run (expected 61, got 47) > >>> >> ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― > >>> >> > >>> >> # Start of plain tests > >>> >> # Running /ppc64/migration/multifd/tcp/plain/cancel > >>> >> # Using machine type: pseries-9.2 > >>> >> # starting QEMU: exec ./qemu-system-ppc64 -qtest unix:/tmp/qtest-40766.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-40766.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pseries-9.2,vsmt=8 -name source,debug-threads=on -m 256M -serial file:/tmp/migration-test-H0Z1T2/src_serial -nodefaults -machine cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, -bios /tmp/migration-test-H0Z1T2/bootsect 2>/dev/null -accel qtest > >>> >> # starting QEMU: exec ./qemu-system-ppc64 -qtest unix:/tmp/qtest-40766.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-40766.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pseries-9.2,vsmt=8 -name target,debug-threads=on -m 256M -serial file:/tmp/migration-test-H0Z1T2/dest_serial -incoming defer -nodefaults -machine cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, -bios /tmp/migration-test-H0Z1T2/bootsect 2>/dev/null -accel qtest > >>> >> ----------------------------------- stderr ----------------------------------- > >>> >> qemu-system-ppc64: Cannot read from TLS channel: The TLS connection was non-properly terminated. > >>> >> warning: fd: migration to a file is deprecated. Use file: instead. > >>> >> warning: fd: migration to a file is deprecated. Use file: instead. > >>> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped) > >>> >> > >>> >> (test program exited with status code -6) > >>> >> =================================8<================================ > >>> >> > >>> >> So.. it's the same test (multifd/tcp/plain/cancel) that is failing on > >>> >> different host / arch being tested. What is more weird is the two failures > >>> >> are different, the 2nd failure throw out a TLS error even though the test > >>> >> doesn't yet have tls involved. > >>> > > >>> > I think that's just a parallel test being cancelled prematurely, either > >>> > due to the crash or due to the timeout. > >>> > > >>> >> > >>> >> Fabiano, is this the issue you're looking at? > >>> > > >>> > Yes. I can reproduce locally by running 2 processes in parallel: 1 loop > >>> > with make -j$(nproc) check and another loop with tcp/plain/cancel. It > >>> > takes ~1h to hit. I've seen crashes with ppc64, s390 and > >>> > aarch64. > >>> > > >>> > >>> Ok, the issue is that after commit 5ef7e26bdb ("migration/multifd: solve > >>> zero page causing multiple page faults"), the multifd code started using > >>> the rb->receivedmap bitmap, which belongs to the ram code and is > >>> initialized and *freed* from the ram SaveVMHandlers. > >>> > >>> process_incoming_migration_co() ... > >>> qemu_loadvm_state() multifd_nocomp_recv() > >>> qemu_loadvm_state_cleanup() ramblock_recv_bitmap_set_offset() > >>> rb->receivedmap = NULL set_bit_atomic(..., rb->receivedmap) > >>> ... > >>> migration_incoming_state_destroy() > >>> multifd_recv_cleanup() > >>> multifd_recv_terminate_threads(NULL) > >>> > >>> Multifd threads are live until migration_incoming_state_destroy(), which > >>> is called some time later. > >> > >> Thanks for the debugging. Hmm I would expect loadvm should wait until all > >> ram is received somehow.. > > > > Looks like a similar issue as when we didn't have the multifd_send sync > > working correctly and ram code would run and do cleanup. > > Btw, this is hard to debug, but I bet what's happening is that the > ram_load code itself is exiting due to qemufile error. So there wouldn't > be a way to make it wait for multifd. One more thing is I remember one of the error is not the crash but some TLS disconnection error. I wonder which one you can reproduce and why that TLS code can got kicked off in the multifd cancel test. Perhaps the memory was simply corrupted around, so bitmap ops can write to some other memory? -- Peter Xu
Peter Xu <peterx@redhat.com> writes: > On Fri, Sep 13, 2024 at 12:17:40PM -0300, Fabiano Rosas wrote: >> Fabiano Rosas <farosas@suse.de> writes: >> >> > Peter Xu <peterx@redhat.com> writes: >> > >> >> On Thu, Sep 12, 2024 at 07:52:48PM -0300, Fabiano Rosas wrote: >> >>> Fabiano Rosas <farosas@suse.de> writes: >> >>> >> >>> > Peter Xu <peterx@redhat.com> writes: >> >>> > >> >>> >> On Thu, Sep 12, 2024 at 09:13:16AM +0100, Peter Maydell wrote: >> >>> >>> On Wed, 11 Sept 2024 at 22:26, Fabiano Rosas <farosas@suse.de> wrote: >> >>> >>> > I don't think we're discussing total CI time at this point, so the math >> >>> >>> > doesn't really add up. We're not looking into making the CI finish >> >>> >>> > faster. We're looking into making migration-test finish faster. That >> >>> >>> > would reduce timeouts in CI, speed-up make check and reduce the chance >> >>> >>> > of random race conditions* affecting other people/staging runs. >> >>> >>> >> >>> >>> Right. The reason migration-test appears on my radar is because >> >>> >>> it is very frequently the thing that shows up as "this sometimes >> >>> >>> just fails or just times out and if you hit retry it goes away >> >>> >>> again". That might not be migration-test's fault specifically, >> >>> >>> because those retries tend to be certain CI configs (s390, >> >>> >>> the i686-tci one), and I have some theories about what might be >> >>> >>> causing it (e.g. build system runs 4 migration-tests in parallel, >> >>> >>> which means 8 QEMU processes which is too many for the number >> >>> >>> of host CPUs). But right now I look at CI job failures and my reaction >> >>> >>> is "oh, it's the migration-test failing yet again" :-( >> >>> >>> >> >>> >>> For some examples from this week: >> >>> >>> >> >>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7802183144 >> >>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7799842373 <--------[1] >> >>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7786579152 <--------[2] >> >>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7786579155 >> >>> >> >> >>> >> Ah right, the TIMEOUT is unfortunate, especially if tests can be run in >> >>> >> parallel. It indeed sounds like no good way to finally solve.. I don't >> >>> >> also see how speeding up / reducing tests in migration test would help, as >> >>> >> that's (from some degree..) is the same as tuning the timeout value bigger. >> >>> >> When the tests are less it'll fit into 480s window, but maybe it's too >> >>> >> quick now we wonder whether we should shrink it to e.g. 90s, but then it >> >>> >> can timeout again when on a busy host with less capability of concurrency. >> >>> >> >> >>> >> But indeed there're two ERRORs ([1,2] above).. I collected some more info >> >>> >> here before the log expires: >> >>> >> >> >>> >> =================================8<================================ >> >>> >> >> >>> >> *** /i386/migration/multifd/tcp/plain/cancel, qtest-i386 on s390 host >> >>> >> >> >>> >> https://gitlab.com/qemu-project/qemu/-/jobs/7799842373 >> >>> >> >> >>> >> 101/953 qemu:qtest+qtest-i386 / qtest-i386/migration-test ERROR 144.32s killed by signal 6 SIGABRT >> >>> >>>>> QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon G_TEST_DBUS_DAEMON=/home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/tests/dbus-vmstate-daemon.sh PYTHON=/home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_IMG=./qemu-img MALLOC_PERTURB_=144 QTEST_QEMU_BINARY=./qemu-system-i386 /home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/build/tests/qtest/migration-test --tap -k >> >>> >> ――――――――――――――――――――――――――――――――――――― ✀ ――――――――――――――――――――――――――――――――――――― >> >>> >> stderr: >> >>> >> warning: fd: migration to a file is deprecated. Use file: instead. >> >>> >> warning: fd: migration to a file is deprecated. Use file: instead. >> >>> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped) >> >>> >> (test program exited with status code -6) >> >>> >> TAP parsing error: Too few tests run (expected 53, got 39) >> >>> >> ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― >> >>> >> >> >>> >> # Start of plain tests >> >>> >> # Running /i386/migration/multifd/tcp/plain/cancel >> >>> >> # Using machine type: pc-i440fx-9.2 >> >>> >> # starting QEMU: exec ./qemu-system-i386 -qtest unix:/tmp/qtest-3273509.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3273509.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pc-i440fx-9.2, -name source,debug-threads=on -m 150M -serial file:/tmp/migration-test-4112T2/src_serial -drive if=none,id=d0,file=/tmp/migration-test-4112T2/bootsect,format=raw -device ide-hd,drive=d0,secs=1,cyls=1,heads=1 2>/dev/null -accel qtest >> >>> >> # starting QEMU: exec ./qemu-system-i386 -qtest unix:/tmp/qtest-3273509.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3273509.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pc-i440fx-9.2, -name target,debug-threads=on -m 150M -serial file:/tmp/migration-test-4112T2/dest_serial -incoming defer -drive if=none,id=d0,file=/tmp/migration-test-4112T2/bootsect,format=raw -device ide-hd,drive=d0,secs=1,cyls=1,heads=1 2>/dev/null -accel qtest >> >>> >> ----------------------------------- stderr ----------------------------------- >> >>> >> warning: fd: migration to a file is deprecated. Use file: instead. >> >>> >> warning: fd: migration to a file is deprecated. Use file: instead. >> >>> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped) >> >>> >> >> >>> >> *** /ppc64/migration/multifd/tcp/plain/cancel, qtest-ppc64 on i686 host >> >>> >> >> >>> >> https://gitlab.com/qemu-project/qemu/-/jobs/7786579152 >> >>> >> >> >>> >> 174/315 qemu:qtest+qtest-ppc64 / qtest-ppc64/migration-test ERROR 381.00s killed by signal 6 SIGABRT >> >>> >>>>> PYTHON=/builds/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_IMG=./qemu-img G_TEST_DBUS_DAEMON=/builds/qemu-project/qemu/tests/dbus-vmstate-daemon.sh QTEST_QEMU_BINARY=./qemu-system-ppc64 MALLOC_PERTURB_=178 QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon /builds/qemu-project/qemu/build/tests/qtest/migration-test --tap -k >> >>> >> ――――――――――――――――――――――――――――――――――――― ✀ ――――――――――――――――――――――――――――――――――――― >> >>> >> stderr: >> >>> >> qemu-system-ppc64: Cannot read from TLS channel: The TLS connection was non-properly terminated. >> >>> >> warning: fd: migration to a file is deprecated. Use file: instead. >> >>> >> warning: fd: migration to a file is deprecated. Use file: instead. >> >>> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped) >> >>> >> (test program exited with status code -6) >> >>> >> TAP parsing error: Too few tests run (expected 61, got 47) >> >>> >> ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― >> >>> >> >> >>> >> # Start of plain tests >> >>> >> # Running /ppc64/migration/multifd/tcp/plain/cancel >> >>> >> # Using machine type: pseries-9.2 >> >>> >> # starting QEMU: exec ./qemu-system-ppc64 -qtest unix:/tmp/qtest-40766.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-40766.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pseries-9.2,vsmt=8 -name source,debug-threads=on -m 256M -serial file:/tmp/migration-test-H0Z1T2/src_serial -nodefaults -machine cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, -bios /tmp/migration-test-H0Z1T2/bootsect 2>/dev/null -accel qtest >> >>> >> # starting QEMU: exec ./qemu-system-ppc64 -qtest unix:/tmp/qtest-40766.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-40766.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pseries-9.2,vsmt=8 -name target,debug-threads=on -m 256M -serial file:/tmp/migration-test-H0Z1T2/dest_serial -incoming defer -nodefaults -machine cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, -bios /tmp/migration-test-H0Z1T2/bootsect 2>/dev/null -accel qtest >> >>> >> ----------------------------------- stderr ----------------------------------- >> >>> >> qemu-system-ppc64: Cannot read from TLS channel: The TLS connection was non-properly terminated. >> >>> >> warning: fd: migration to a file is deprecated. Use file: instead. >> >>> >> warning: fd: migration to a file is deprecated. Use file: instead. >> >>> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped) >> >>> >> >> >>> >> (test program exited with status code -6) >> >>> >> =================================8<================================ >> >>> >> >> >>> >> So.. it's the same test (multifd/tcp/plain/cancel) that is failing on >> >>> >> different host / arch being tested. What is more weird is the two failures >> >>> >> are different, the 2nd failure throw out a TLS error even though the test >> >>> >> doesn't yet have tls involved. >> >>> > >> >>> > I think that's just a parallel test being cancelled prematurely, either >> >>> > due to the crash or due to the timeout. >> >>> > >> >>> >> >> >>> >> Fabiano, is this the issue you're looking at? >> >>> > >> >>> > Yes. I can reproduce locally by running 2 processes in parallel: 1 loop >> >>> > with make -j$(nproc) check and another loop with tcp/plain/cancel. It >> >>> > takes ~1h to hit. I've seen crashes with ppc64, s390 and >> >>> > aarch64. >> >>> > >> >>> >> >>> Ok, the issue is that after commit 5ef7e26bdb ("migration/multifd: solve >> >>> zero page causing multiple page faults"), the multifd code started using >> >>> the rb->receivedmap bitmap, which belongs to the ram code and is >> >>> initialized and *freed* from the ram SaveVMHandlers. >> >>> >> >>> process_incoming_migration_co() ... >> >>> qemu_loadvm_state() multifd_nocomp_recv() >> >>> qemu_loadvm_state_cleanup() ramblock_recv_bitmap_set_offset() >> >>> rb->receivedmap = NULL set_bit_atomic(..., rb->receivedmap) >> >>> ... >> >>> migration_incoming_state_destroy() >> >>> multifd_recv_cleanup() >> >>> multifd_recv_terminate_threads(NULL) >> >>> >> >>> Multifd threads are live until migration_incoming_state_destroy(), which >> >>> is called some time later. >> >> >> >> Thanks for the debugging. Hmm I would expect loadvm should wait until all >> >> ram is received somehow.. >> > >> > Looks like a similar issue as when we didn't have the multifd_send sync >> > working correctly and ram code would run and do cleanup. >> >> Btw, this is hard to debug, but I bet what's happening is that the >> ram_load code itself is exiting due to qemufile error. So there wouldn't >> be a way to make it wait for multifd. > > One more thing is I remember one of the error is not the crash but some TLS > disconnection error. I wonder which one you can reproduce and why that TLS > code can got kicked off in the multifd cancel test. Perhaps the memory was > simply corrupted around, so bitmap ops can write to some other memory? I haven't seen that error. I said in another email that I think that is due to the timeout cancelling the tests forcefully.
On Thu, 12 Sept 2024 at 16:09, Peter Xu <peterx@redhat.com> wrote: > > On Thu, Sep 12, 2024 at 09:13:16AM +0100, Peter Maydell wrote: > > On Wed, 11 Sept 2024 at 22:26, Fabiano Rosas <farosas@suse.de> wrote: > > > I don't think we're discussing total CI time at this point, so the math > > > doesn't really add up. We're not looking into making the CI finish > > > faster. We're looking into making migration-test finish faster. That > > > would reduce timeouts in CI, speed-up make check and reduce the chance > > > of random race conditions* affecting other people/staging runs. > > > > Right. The reason migration-test appears on my radar is because > > it is very frequently the thing that shows up as "this sometimes > > just fails or just times out and if you hit retry it goes away > > again". That might not be migration-test's fault specifically, > > because those retries tend to be certain CI configs (s390, > > the i686-tci one), and I have some theories about what might be > > causing it (e.g. build system runs 4 migration-tests in parallel, > > which means 8 QEMU processes which is too many for the number > > of host CPUs). But right now I look at CI job failures and my reaction > > is "oh, it's the migration-test failing yet again" :-( > > > > For some examples from this week: > > > > https://gitlab.com/qemu-project/qemu/-/jobs/7802183144 > > https://gitlab.com/qemu-project/qemu/-/jobs/7799842373 <--------[1] > > https://gitlab.com/qemu-project/qemu/-/jobs/7786579152 <--------[2] > > https://gitlab.com/qemu-project/qemu/-/jobs/7786579155 > > Ah right, the TIMEOUT is unfortunate, especially if tests can be run in > parallel. It indeed sounds like no good way to finally solve.. I don't > also see how speeding up / reducing tests in migration test would help, as > that's (from some degree..) is the same as tuning the timeout value bigger. > When the tests are less it'll fit into 480s window, but maybe it's too > quick now we wonder whether we should shrink it to e.g. 90s, but then it > can timeout again when on a busy host with less capability of concurrency. For the TIMEOUT part on cross-i686-tci I plan to try this patch: https://patchew.org/QEMU/20240912151003.2045031-1-peter.maydell@linaro.org/ which makes 'make check' single-threaded; that will help to see if the parallelism is a problem. (If it is then we might want to do a more generalised approach rather than just for that one CI job.) > But indeed there're two ERRORs ([1,2] above).. I collected some more info > here before the log expires: > So.. it's the same test (multifd/tcp/plain/cancel) that is failing on > different host / arch being tested. What is more weird is the two failures > are different, the 2nd failure throw out a TLS error even though the test > doesn't yet have tls involved. > > Fabiano, is this the issue you're looking at? > > Peter, do you think it'll be helpful if we temporarily mark this test as > "slow" too so it's not run in CI (so we still run it ourselves when prepare > migration PR, with the hope that it can reproduce)? If you think that specific test is flaky then I think that's probably a good idea. As usual with this kind of thing, probably best to have a comment next to the test noting why and with a URL to a gitlab issue for it, so we don't forget why we disabled the test. -- PMM
On Thu, Sep 12, 2024 at 04:14:20PM +0100, Peter Maydell wrote: > On Thu, 12 Sept 2024 at 16:09, Peter Xu <peterx@redhat.com> wrote: > > > > On Thu, Sep 12, 2024 at 09:13:16AM +0100, Peter Maydell wrote: > > > On Wed, 11 Sept 2024 at 22:26, Fabiano Rosas <farosas@suse.de> wrote: > > > > I don't think we're discussing total CI time at this point, so the math > > > > doesn't really add up. We're not looking into making the CI finish > > > > faster. We're looking into making migration-test finish faster. That > > > > would reduce timeouts in CI, speed-up make check and reduce the chance > > > > of random race conditions* affecting other people/staging runs. > > > > > > Right. The reason migration-test appears on my radar is because > > > it is very frequently the thing that shows up as "this sometimes > > > just fails or just times out and if you hit retry it goes away > > > again". That might not be migration-test's fault specifically, > > > because those retries tend to be certain CI configs (s390, > > > the i686-tci one), and I have some theories about what might be > > > causing it (e.g. build system runs 4 migration-tests in parallel, > > > which means 8 QEMU processes which is too many for the number > > > of host CPUs). But right now I look at CI job failures and my reaction > > > is "oh, it's the migration-test failing yet again" :-( > > > > > > For some examples from this week: > > > > > > https://gitlab.com/qemu-project/qemu/-/jobs/7802183144 > > > https://gitlab.com/qemu-project/qemu/-/jobs/7799842373 <--------[1] > > > https://gitlab.com/qemu-project/qemu/-/jobs/7786579152 <--------[2] > > > https://gitlab.com/qemu-project/qemu/-/jobs/7786579155 > > > > Ah right, the TIMEOUT is unfortunate, especially if tests can be run in > > parallel. It indeed sounds like no good way to finally solve.. I don't > > also see how speeding up / reducing tests in migration test would help, as > > that's (from some degree..) is the same as tuning the timeout value bigger. > > When the tests are less it'll fit into 480s window, but maybe it's too > > quick now we wonder whether we should shrink it to e.g. 90s, but then it > > can timeout again when on a busy host with less capability of concurrency. > > For the TIMEOUT part on cross-i686-tci I plan to try this patch: > https://patchew.org/QEMU/20240912151003.2045031-1-peter.maydell@linaro.org/ > which makes 'make check' single-threaded; that will help to see > if the parallelism is a problem. (If it is then we might want > to do a more generalised approach rather than just for that one > CI job.) Sounds good. > > > But indeed there're two ERRORs ([1,2] above).. I collected some more info > > here before the log expires: > > > So.. it's the same test (multifd/tcp/plain/cancel) that is failing on > > different host / arch being tested. What is more weird is the two failures > > are different, the 2nd failure throw out a TLS error even though the test > > doesn't yet have tls involved. > > > > Fabiano, is this the issue you're looking at? > > > > Peter, do you think it'll be helpful if we temporarily mark this test as > > "slow" too so it's not run in CI (so we still run it ourselves when prepare > > migration PR, with the hope that it can reproduce)? > > If you think that specific test is flaky then I think that's > probably a good idea. As usual with this kind of thing, > probably best to have a comment next to the test noting > why and with a URL to a gitlab issue for it, so we don't > forget why we disabled the test. Looks like Fabiano root-caused the issue. We'll see how that goes, or we can prepare a patch to make it optional with the comments in place. Thanks, -- Peter Xu
Peter Maydell <peter.maydell@linaro.org> writes: > On Wed, 11 Sept 2024 at 22:26, Fabiano Rosas <farosas@suse.de> wrote: >> I don't think we're discussing total CI time at this point, so the math >> doesn't really add up. We're not looking into making the CI finish >> faster. We're looking into making migration-test finish faster. That >> would reduce timeouts in CI, speed-up make check and reduce the chance >> of random race conditions* affecting other people/staging runs. > > Right. The reason migration-test appears on my radar is because > it is very frequently the thing that shows up as "this sometimes > just fails or just times out and if you hit retry it goes away > again". That might not be migration-test's fault specifically, > because those retries tend to be certain CI configs (s390, > the i686-tci one), and I have some theories about what might be > causing it (e.g. build system runs 4 migration-tests in parallel, > which means 8 QEMU processes which is too many for the number > of host CPUs). But right now I look at CI job failures and my reaction > is "oh, it's the migration-test failing yet again" :-( And then I go: "oh, people complaining about migration-test again, I thought we had fixed all the issues this time". It's frustrating for everyone, as I said previously. > > For some examples from this week: > > https://gitlab.com/qemu-project/qemu/-/jobs/7802183144 > https://gitlab.com/qemu-project/qemu/-/jobs/7799842373 > https://gitlab.com/qemu-project/qemu/-/jobs/7786579152 > https://gitlab.com/qemu-project/qemu/-/jobs/7786579155 About these: There are 2 instances of plain-old-SIGSEGV here. Both happen in non-x86_64 runs and on the /multifd/tcp/plain/cancel test, which means they're either races or memory ordering issues. Having i386 crashing points to the former. So having the CI loaded and causing timeouts is probably what exposed the issue. The thread is mig/dst/recv_7 and grepping the objdump output shows: <set_bit_atomic> 55 48 89 e5 48 89 7d e8 48 89 75 e0 48 8b 45 e8 83 e0 3f ba 01 00 00 00 89 c1 48 d3 e2 48 89 d0 48 89 45 f0 48 8b 45 e8 48 c1 e8 06 48 8d 14 c5 00 00 00 00 48 8b 45 e0 48 01 d0 48 89 45 f8 48 8b 45 f8 48 8b 55 f0 <f0> 48 09 10 90 5d c3 I tried a bisect overnight, but it seems the issue has been there since before 9.0. I'll try to repro with gdb attached or get a core dump.
On Thu, 12 Sept 2024 at 14:48, Fabiano Rosas <farosas@suse.de> wrote: > Peter Maydell <peter.maydell@linaro.org> writes: > > For some examples from this week: > > > > https://gitlab.com/qemu-project/qemu/-/jobs/7802183144 > > https://gitlab.com/qemu-project/qemu/-/jobs/7799842373 > > https://gitlab.com/qemu-project/qemu/-/jobs/7786579152 > > https://gitlab.com/qemu-project/qemu/-/jobs/7786579155 > > About these: > > There are 2 instances of plain-old-SIGSEGV here. Both happen in > non-x86_64 runs and on the /multifd/tcp/plain/cancel test, which means > they're either races or memory ordering issues. Having i386 crashing > points to the former. So having the CI loaded and causing timeouts is > probably what exposed the issue. They're also both TCI. Would these tests be relying on specific atomic-access behaviour in the guest code that's running, or is all the avoidance-of-races in the migration code in QEMU itself? (I don't know of any particular problems with TCI's implementation of atomic accesses, so this is just a stab in the dark.) thanks -- PMM
Peter Maydell <peter.maydell@linaro.org> writes: > On Thu, 12 Sept 2024 at 14:48, Fabiano Rosas <farosas@suse.de> wrote: >> Peter Maydell <peter.maydell@linaro.org> writes: >> > For some examples from this week: >> > >> > https://gitlab.com/qemu-project/qemu/-/jobs/7802183144 >> > https://gitlab.com/qemu-project/qemu/-/jobs/7799842373 >> > https://gitlab.com/qemu-project/qemu/-/jobs/7786579152 >> > https://gitlab.com/qemu-project/qemu/-/jobs/7786579155 >> >> About these: >> >> There are 2 instances of plain-old-SIGSEGV here. Both happen in >> non-x86_64 runs and on the /multifd/tcp/plain/cancel test, which means >> they're either races or memory ordering issues. Having i386 crashing >> points to the former. So having the CI loaded and causing timeouts is >> probably what exposed the issue. > > They're also both TCI. Would these tests be relying on > specific atomic-access behaviour in the guest code that's > running, or is all the avoidance-of-races in the migration > code in QEMU itself? I misspoke about memory ordering, this is all just the x86 host and the multifd threads in QEMU having synchronization issues. > > (I don't know of any particular problems with TCI's > implementation of atomic accesses, so this is just a stab > in the dark.) > > thanks > -- PMM
© 2016 - 2024 Red Hat, Inc.