migration/migration.c | 33 ++++++++++++++++++--------------- 1 file changed, 18 insertions(+), 15 deletions(-)
Hello, Today, close_return_path_on_source() can perform a shutdown to exit the return-path thread if an error occured. However, migrate_fd_cleanup() does cleanups too early and the shutdown in close_return_path_on_source() fails, leaving the source and destination waiting for an event to occur. This little series tries to fix that. Comments welcome ! Thanks, C. Cédric Le Goater (2): migration: Add a file_error argument to close_return_path_on_source() migration: Fix return-path thread exit migration/migration.c | 33 ++++++++++++++++++--------------- 1 file changed, 18 insertions(+), 15 deletions(-) -- 2.43.0
On Thu, Feb 01, 2024 at 07:48:51PM +0100, Cédric Le Goater wrote: > Hello, Hi, Cédric, Thanks for the patches. > > Today, close_return_path_on_source() can perform a shutdown to exit > the return-path thread if an error occured. However, migrate_fd_cleanup() > does cleanups too early and the shutdown in close_return_path_on_source() > fails, leaving the source and destination waiting for an event to occur. > > This little series tries to fix that. Comments welcome ! One thing I do agree is that relying on qemu_file_get_error(to_dst_file) in close_return_path_on_source() is weird: IMHO we have better way to detect "whether the migration has error" now, which is migrate_has_error(). For this specific issue, I think one long standing issue that might be relevant is we have two QEMUFile (from_dst_file, to_dst_file) that share the same QIOChannel now. Logically the two QEMUFile should be able to be managed separately, say, close() of to_dst_file shouldn't affect the other. However I don't think it's the case now, as qemu_fclose(to_dst_file) will do qio_channel_close() already, which means there will be a side effect to the other QEMUFile that its backing IOC is already closed. Is this the issue we're facing? IOW, the close() of to_dst_file will not properly kick the other thread who is blocked at reading from_dst_file, while the shutdown() will kick it out? If so, not sure whether we can somehow relay the real qio_channel_close() to until the last user releases it? IOW, conditionally close() the channel in qio_channel_finalize(), if the channel is still open? Would that make sense? Copy Dan too. Thanks, -- Peter Xu
On Fri, Feb 02, 2024 at 05:53:39PM +0800, Peter Xu wrote: > On Thu, Feb 01, 2024 at 07:48:51PM +0100, Cédric Le Goater wrote: > > Hello, > > Hi, Cédric, > > Thanks for the patches. > > > > > Today, close_return_path_on_source() can perform a shutdown to exit > > the return-path thread if an error occured. However, migrate_fd_cleanup() > > does cleanups too early and the shutdown in close_return_path_on_source() > > fails, leaving the source and destination waiting for an event to occur. > > > > This little series tries to fix that. Comments welcome ! > > One thing I do agree is that relying on qemu_file_get_error(to_dst_file) in > close_return_path_on_source() is weird: IMHO we have better way to detect > "whether the migration has error" now, which is migrate_has_error(). > > For this specific issue, I think one long standing issue that might be > relevant is we have two QEMUFile (from_dst_file, to_dst_file) that share > the same QIOChannel now. Logically the two QEMUFile should be able to be > managed separately, say, close() of to_dst_file shouldn't affect the other. > > However I don't think it's the case now, as qemu_fclose(to_dst_file) will > do qio_channel_close() already, which means there will be a side effect to > the other QEMUFile that its backing IOC is already closed. > > Is this the issue we're facing? IOW, the close() of to_dst_file will not > properly kick the other thread who is blocked at reading from_dst_file, > while the shutdown() will kick it out? > > If so, not sure whether we can somehow relay the real qio_channel_close() > to until the last user releases it? IOW, conditionally close() the channel > in qio_channel_finalize(), if the channel is still open? Would that make > sense? IMHO the problem described above is a result of the design mistake of having 2 separate QEMUFile instances for what is ultimately the same channel. This was a convenient approach to take originally, but it has likely outlived its purpose. In the ideal world IMHO, QEMUFile would not exist at all, and we would have a QIOChannelCached that adds the read/write buffering above the base QIOChannel. That's doable, but bigger than a quick fix. A natural stepping stone to get there though is to move from 2 QEMUFile objs down to 1 QEMUFile, which might be more practical as a quick fix. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
On Mon, Feb 05, 2024 at 10:32:33AM +0000, Daniel P. Berrangé wrote: > On Fri, Feb 02, 2024 at 05:53:39PM +0800, Peter Xu wrote: > > On Thu, Feb 01, 2024 at 07:48:51PM +0100, Cédric Le Goater wrote: > > > Hello, > > > > Hi, Cédric, > > > > Thanks for the patches. > > > > > > > > Today, close_return_path_on_source() can perform a shutdown to exit > > > the return-path thread if an error occured. However, migrate_fd_cleanup() > > > does cleanups too early and the shutdown in close_return_path_on_source() > > > fails, leaving the source and destination waiting for an event to occur. > > > > > > This little series tries to fix that. Comments welcome ! > > > > One thing I do agree is that relying on qemu_file_get_error(to_dst_file) in > > close_return_path_on_source() is weird: IMHO we have better way to detect > > "whether the migration has error" now, which is migrate_has_error(). > > > > For this specific issue, I think one long standing issue that might be > > relevant is we have two QEMUFile (from_dst_file, to_dst_file) that share > > the same QIOChannel now. Logically the two QEMUFile should be able to be > > managed separately, say, close() of to_dst_file shouldn't affect the other. > > > > However I don't think it's the case now, as qemu_fclose(to_dst_file) will > > do qio_channel_close() already, which means there will be a side effect to > > the other QEMUFile that its backing IOC is already closed. > > > > Is this the issue we're facing? IOW, the close() of to_dst_file will not > > properly kick the other thread who is blocked at reading from_dst_file, > > while the shutdown() will kick it out? > > > > If so, not sure whether we can somehow relay the real qio_channel_close() > > to until the last user releases it? IOW, conditionally close() the channel > > in qio_channel_finalize(), if the channel is still open? Would that make > > sense? > > IMHO the problem described above is a result of the design mistake of > having 2 separate QEMUFile instances for what is ultimately the same > channel. This was a convenient approach to take originally, but it has > likely outlived its purpose. > > In the ideal world IMHO, QEMUFile would not exist at all, and we would > have a QIOChannelCached that adds the read/write buffering above the > base QIOChannel. We have that in the TODO wiki page for a long time, I'll update it slightly. https://wiki.qemu.org/ToDo/LiveMigration#Rewrite_QEMUFile_for_migration But yeah that might be too big a hammer to solve this specific issue. AFAIU Fabiano is looking into that direction, but I assume it should still be a long term thing. > > That's doable, but bigger than a quick fix. A natural stepping stone > to get there though is to move from 2 QEMUFile objs down to 1 QEMUFile, > which might be more practical as a quick fix. Agree. However would this still be quite some change? We still have a lot of references on the four qemufiles (to/from_dst_file, to/from_src_file), at least that'll need a replacement; I didn't yet further check whether all places can be done with a direct replacement of such change, some tweaks may be needed here and there, but shouldn't be major. Meanwhile IIUC it'll also need a major rework on QEMUFile, allowing it to be bi-directional? We may need to duplicate the cache layer, IIUC, one for each direction IOs. -- Peter Xu
Hello Peter, >> Today, close_return_path_on_source() can perform a shutdown to exit >> the return-path thread if an error occured. However, migrate_fd_cleanup() >> does cleanups too early and the shutdown in close_return_path_on_source() >> fails, leaving the source and destination waiting for an event to occur. >> >> This little series tries to fix that. Comments welcome ! > > One thing I do agree is that relying on qemu_file_get_error(to_dst_file) in > close_return_path_on_source() is weird: IMHO we have better way to detect > "whether the migration has error" now, which is migrate_has_error(). ok. migrate_has_error() looks safe to use in that case. It works fine with all the prereq VFIO cleanups (that I didn't send yet) and errors in the setup of dirty tracking are reported correctly to the migration core subsystem. > For this specific issue, I think one long standing issue that might be > relevant is we have two QEMUFile (from_dst_file, to_dst_file) that share > the same QIOChannel now. Logically the two QEMUFile should be able to be > managed separately, say, close() of to_dst_file shouldn't affect the other. > > However I don't think it's the case now, as qemu_fclose(to_dst_file) will > do qio_channel_close() already, which means there will be a side effect to > the other QEMUFile that its backing IOC is already closed. > > Is this the issue we're facing? Yes. The socket is closed before calling close_return_path_on_source() and ms->rp_state.from_dst_file becomes invalid, the shutdown silently fails (we should maybe report error in qemu_file_shutdown()) and the return-path thread does not exits. > IOW, the close() of to_dst_file will not > properly kick the other thread who is blocked at reading from_dst_file, > while the shutdown() will kick it out? Yes, that's how I understand the comment : /* * If this is a normal exit then the destination will send a SHUT * and the rp_thread will exit, however if there's an error we * need to cause it to exit. shutdown(2), if we have it, will * cause it to unblock if it's stuck waiting for the destination. */ > If so, not sure whether we can somehow relay the real qio_channel_close() > to until the last user releases it? IOW, conditionally close() the channel> in qio_channel_finalize(), if the channel is still open? Would that make > sense? It's the first time that I look at this code :/ I can't tell. Here is the closing section : qemu_mutex_unlock(&s->qemu_file_lock); /* * Close the file handle without the lock to make sure the * critical section won't block for long. */ migration_ioc_unregister_yank_from_file(tmp); qemu_fclose(tmp); } Thanks, C.
© 2016 - 2024 Red Hat, Inc.