migration/migration.c | 90 +++++++++++++++++++++++--------------- migration/multifd-nocomp.c | 3 +- migration/multifd.c | 5 --- migration/multifd.h | 5 +++ migration/options.c | 5 --- migration/ram.c | 73 +++++++++---------------------- 6 files changed, 82 insertions(+), 99 deletions(-)
From: Prasad Pandit <pjp@fedoraproject.org> Hello, * Currently Multifd and Postcopy migration can not be used together. QEMU shows "Postcopy is not yet compatible with multifd" message. When migrating guests with large (100's GB) RAM, Multifd threads help to accelerate migration, but inability to use it with the Postcopy mode delays guest start up on the destination side. * This patch series allows to enable both Multifd and Postcopy migration together. Precopy and Multifd threads work during the initial guest (RAM) transfer. When migration moves to the Postcopy phase, Multifd threads are restrained and the Postcopy threads start to request pages from the source side. * This series removes magic value (4-bytes) introduced in the previous series for the Postcopy channel. And refactoring of the 'ram_save_target_page' function is made independent of the multifd & postcopy change. v1: https://lore.kernel.org/qemu-devel/20241126115748.118683-1-ppandit@redhat.com/T/#u v0: https://lore.kernel.org/qemu-devel/20241029150908.1136894-1-ppandit@redhat.com/T/#u Thank you. --- Prasad Pandit (3): migration/multifd: move macros to multifd header migration: refactor ram_save_target_page functions migration: enable multifd and postcopy together migration/migration.c | 90 +++++++++++++++++++++++--------------- migration/multifd-nocomp.c | 3 +- migration/multifd.c | 5 --- migration/multifd.h | 5 +++ migration/options.c | 5 --- migration/ram.c | 73 +++++++++---------------------- 6 files changed, 82 insertions(+), 99 deletions(-) -- 2.47.1
On Fri, Nov 29, 2024 at 05:52:53PM +0530, Prasad Pandit wrote: > From: Prasad Pandit <pjp@fedoraproject.org> > > Hello, > > * Currently Multifd and Postcopy migration can not be used together. > QEMU shows "Postcopy is not yet compatible with multifd" message. > > When migrating guests with large (100's GB) RAM, Multifd threads > help to accelerate migration, but inability to use it with the > Postcopy mode delays guest start up on the destination side. > > * This patch series allows to enable both Multifd and Postcopy > migration together. Precopy and Multifd threads work during > the initial guest (RAM) transfer. When migration moves to the > Postcopy phase, Multifd threads are restrained and the Postcopy > threads start to request pages from the source side. > > * This series removes magic value (4-bytes) introduced in the > previous series for the Postcopy channel. And refactoring of > the 'ram_save_target_page' function is made independent of > the multifd & postcopy change. > > v1: https://lore.kernel.org/qemu-devel/20241126115748.118683-1-ppandit@redhat.com/T/#u > v0: https://lore.kernel.org/qemu-devel/20241029150908.1136894-1-ppandit@redhat.com/T/#u > > > Thank you. > --- > Prasad Pandit (3): > migration/multifd: move macros to multifd header > migration: refactor ram_save_target_page functions > migration: enable multifd and postcopy together Prasad, I saw that there's still discussion in the previous version, while this cover letter doesn't mention why it was ignored. Especially, at least to me, what Fabiano commented on simplifying the flush condition check makes senes to me to be addressed. Please cherish reviewer's feedback and time contributed, and let's finish the disucssion first before rushing for a new version. I'll try to join the discussion later too there. Meanwhile, before I read into any details, I found that all the tests I requested are still missing. Would you please consider adding them? My previous comment regarding to test is here: https://lore.kernel.org/qemu-devel/ZykJBq7ME5jgSzCA@x1n/ I listed exactly the minimum set of tests that I think we should have. In general, any migration new feature must have both doc updates and test coverage as long as applicable. Multifd still has its doc missing, which is unfortunate. We could have one doc prior to this work, but it can also be done later. OTOH on testing: this is not a new feature alone, but it's more dangerous than a new feature due to what I mentioned before, that postcopy needs atomicity on page movements, and multifd is completely against that from design POV due to NIC drivers being able to modify guest pages directly. It means multifd+postcopy bugs will be extremely hard to debug if we don't put it right first. So please be serious on the test coverage on this work. Thanks, -- Peter Xu
Hello Peter, On Fri, 29 Nov 2024 at 22:16, Peter Xu <peterx@redhat.com> wrote: > I saw that there's still discussion in the previous version, while this > cover letter doesn't mention why it was ignored. Especially, at least to > me, what Fabiano commented on simplifying the flush condition check makes > senes to me to be addressed. * It is not ignored. Simplifying flush conditionals makes sense to me too, that is why in the 'v0' version of this series I had added the !migration_in_postcopy() check to the migrate_multifd() function, right? I tried to discuss in the 'v1' thread if there's another way to simplify conditionals. Not sure if you've followed all comments in the thread. * Secondly, as you mention above, I also thought Fabiano is pointing at the complexity of the 'if' conditionals and thus I replied that his proposed patch does not seem to solve for that complexity. But in his subsequent reply Fabiano mentions that it is not just about conditionals, but larger complexity of how and when multifd threads do flush and sync amongst them. * IMHO, simplifying that larger complexity of how multifd threads do flush and sync can be done independently, outside the scope of this patch series, which is about enabling multifd and postcopy together. > Meanwhile, before I read into any details, I found that all the tests I > requested are still missing. Would you please consider adding them? > > My previous comment regarding to test is here: > https://lore.kernel.org/qemu-devel/ZykJBq7ME5jgSzCA@x1n/ > > I listed exactly the minimum set of tests that I think we should have. ... > In general, any migration new feature must have both doc updates and test > coverage as long as applicable. > > Multifd still has its doc missing, which is unfortunate. We could have one > doc prior to this work, but it can also be done later. > > OTOH on testing: this is not a new feature alone, but it's more dangerous > than a new feature due to what I mentioned before, that postcopy needs > atomicity on page movements, and multifd is completely against that from > design POV due to NIC drivers being able to modify guest pages directly. > > It means multifd+postcopy bugs will be extremely hard to debug if we don't > put it right first. So please be serious on the test coverage on this > work. * I'm yet to get to the test cases. The revised series(v1 and v2) are posted to share patch updates which were suggested in the previous reviews. Test cases are a separate/different effort from source patches. If we want to hold on this patch series till we get the test cases and documentation in place, that is okay, I'll work on that next. Thank you. --- - Prasad
On Mon, Dec 02, 2024 at 12:37:19PM +0530, Prasad Pandit wrote: > Hello Peter, > > On Fri, 29 Nov 2024 at 22:16, Peter Xu <peterx@redhat.com> wrote: > > I saw that there's still discussion in the previous version, while this > > cover letter doesn't mention why it was ignored. Especially, at least to > > me, what Fabiano commented on simplifying the flush condition check makes > > senes to me to be addressed. > > * It is not ignored. Simplifying flush conditionals makes sense to me > too, that is why in the 'v0' version of this series I had added the > !migration_in_postcopy() check to the migrate_multifd() function, > right? As explained, that addition was wrong, because migrate_multifd() should always return the user option only. Again, you can add another helper. > I tried to discuss in the 'v1' thread if there's another way to > simplify conditionals. Not sure if you've followed all comments in the > thread. I'll post a version to clean it up, either we go with Fabiano's, or mine, or a 3rd option. We shouldn't pile up more condition check there. It's growing into something not maintainable. > > * Secondly, as you mention above, I also thought Fabiano is pointing > at the complexity of the 'if' conditionals and thus I replied that his > proposed patch does not seem to solve for that complexity. But in his > subsequent reply Fabiano mentions that it is not just about > conditionals, but larger complexity of how and when multifd threads do > flush and sync amongst them. Yes they're relevant, but I think we can cleanup the whole thing and it's not that complicated, IMHO. We'll see. > > * IMHO, simplifying that larger complexity of how multifd threads do > flush and sync can be done independently, outside the scope of this > patch series, which is about enabling multifd and postcopy together. I assume you're working on the test cases, I hope this won't block you from continuing your work on this series. As mentioned above, I think we need to clean this up before moving on, unfortunately. And I hope things settle already before you have the test cases ready. I appreciate you add the test cases for multifd+postcopy. That's very important. Before that you can keep your patch as-is, and leave that part for us to figure out. Feel free to chime in anytime as well. > > > Meanwhile, before I read into any details, I found that all the tests I > > requested are still missing. Would you please consider adding them? > > > > My previous comment regarding to test is here: > > https://lore.kernel.org/qemu-devel/ZykJBq7ME5jgSzCA@x1n/ > > > > I listed exactly the minimum set of tests that I think we should have. > ... > > In general, any migration new feature must have both doc updates and test > > coverage as long as applicable. > > > > Multifd still has its doc missing, which is unfortunate. We could have one > > doc prior to this work, but it can also be done later. > > > > OTOH on testing: this is not a new feature alone, but it's more dangerous > > than a new feature due to what I mentioned before, that postcopy needs > > atomicity on page movements, and multifd is completely against that from > > design POV due to NIC drivers being able to modify guest pages directly. > > > > It means multifd+postcopy bugs will be extremely hard to debug if we don't > > put it right first. So please be serious on the test coverage on this > > work. > > * I'm yet to get to the test cases. The revised series(v1 and v2) are > posted to share patch updates which were suggested in the previous > reviews. Test cases are a separate/different effort from source > patches. If we want to hold on this patch series till we get the test > cases and documentation in place, that is okay, I'll work on that > next. So we talked about this in our meeting, but still just to keep it a record so whoever work on migration can reference: we do require test cases and it's not separate effort. We request both test cases and docs to present before mering a feature, unless there's good reason to not to. E.g. multifd doesn't yet have doc, so doc is not required for this work yet, however test cases are. Another outlier is VFIO+multifd cannot easily add test case because CI normally doesn't have available hardware environment. However does should apply there to be required at least from migration POV. Thanks, -- Peter Xu
© 2016 - 2024 Red Hat, Inc.