include/crypto/tlssession.h | 10 +++------- crypto/tlssession.c | 7 ++----- io/channel-tls.c | 21 +++++++++++++++++++-- migration/migration.c | 3 ++- 4 files changed, 26 insertions(+), 15 deletions(-)
v3: - Patch 1 - Update qcrypto_tls_session_read() doc to reflect the new retval [Dan] - Update commit message to explain the qatomic_read() change [Dan] - Patch 2 (old) - Dropped for now, more at the end This is v3 of the series. Fabiano fixed graceful shutdowns for multifd channels previously: https://lore.kernel.org/qemu-devel/20250206175824.22664-1-farosas@suse.de/ However we can still see an warning when running preempt unit test on TLS, even though migration functionality will not be affected: QTEST_QEMU_BINARY=./qemu-system-x86_64 ./tests/qtest/migration-test --full -r /x86_64/migration/postcopy/preempt/tls/psk ... qemu-kvm: Cannot read from TLS channel: The TLS connection was non-properly terminated. ... It turns out this is because the crypto code only passes the ->shutdown field into the read function, however that value can change concurrently in another thread by a concurrent qio_channel_shutdown(). Patch 1 should fix this issue. Patch 2 is something I found when looking at this problem, there's no known issues I am aware of with them, however I still think they're logically flawed, so I post them together here. Please review, thanks. ============= ABOUT OLD PATCH 2 =================== I dropped it for now to unblock almost patch 1, because patch 1 will fix a real warning that can be triggered for not only qtest but also normal tls postcopy migration. While I was looking at temporary settings for multifd send iochannels to be blocking always, I found I cannot explain how migration_tls_channel_end() currently works, because it writes to the multifd iochannels while the channels should still be owned (and can be written at the same time?) by the sender threads. It sounds like a thread-safety issue, or is it not? Peter Xu (2): io/crypto: Move tls premature termination handling into QIO layer migration: Make migration_has_failed() work even for CANCELLING include/crypto/tlssession.h | 10 +++------- crypto/tlssession.c | 7 ++----- io/channel-tls.c | 21 +++++++++++++++++++-- migration/migration.c | 3 ++- 4 files changed, 26 insertions(+), 15 deletions(-) -- 2.50.1
Peter Xu <peterx@redhat.com> writes: > v3: > - Patch 1 > - Update qcrypto_tls_session_read() doc to reflect the new retval [Dan] > - Update commit message to explain the qatomic_read() change [Dan] > - Patch 2 (old) > - Dropped for now, more at the end > > This is v3 of the series. > > Fabiano fixed graceful shutdowns for multifd channels previously: > > https://lore.kernel.org/qemu-devel/20250206175824.22664-1-farosas@suse.de/ > > However we can still see an warning when running preempt unit test on TLS, > even though migration functionality will not be affected: > > QTEST_QEMU_BINARY=./qemu-system-x86_64 ./tests/qtest/migration-test --full -r /x86_64/migration/postcopy/preempt/tls/psk > ... > qemu-kvm: Cannot read from TLS channel: The TLS connection was non-properly terminated. > ... > > It turns out this is because the crypto code only passes the ->shutdown > field into the read function, however that value can change concurrently in > another thread by a concurrent qio_channel_shutdown(). > > Patch 1 should fix this issue. > > Patch 2 is something I found when looking at this problem, there's no known > issues I am aware of with them, however I still think they're logically > flawed, so I post them together here. > > Please review, thanks. > > ============= ABOUT OLD PATCH 2 =================== > > I dropped it for now to unblock almost patch 1, because patch 1 will fix a > real warning that can be triggered for not only qtest but also normal tls > postcopy migration. > > While I was looking at temporary settings for multifd send iochannels to be > blocking always, I found I cannot explain how migration_tls_channel_end() > currently works, because it writes to the multifd iochannels while the > channels should still be owned (and can be written at the same time?) by > the sender threads. It sounds like a thread-safety issue, or is it not? > IIUC, the multifd channels will be stuck at p->sem because this is the success path so migration will have already finished when we reach migration_cleanup(). The ram/device state migration will hold the main thread until the multifd channels finish transferring. > Peter Xu (2): > io/crypto: Move tls premature termination handling into QIO layer > migration: Make migration_has_failed() work even for CANCELLING > > include/crypto/tlssession.h | 10 +++------- > crypto/tlssession.c | 7 ++----- > io/channel-tls.c | 21 +++++++++++++++++++-- > migration/migration.c | 3 ++- > 4 files changed, 26 insertions(+), 15 deletions(-)
On Thu, Sep 18, 2025 at 06:17:37PM -0300, Fabiano Rosas wrote: > > ============= ABOUT OLD PATCH 2 =================== > > > > I dropped it for now to unblock almost patch 1, because patch 1 will fix a > > real warning that can be triggered for not only qtest but also normal tls > > postcopy migration. > > > > While I was looking at temporary settings for multifd send iochannels to be > > blocking always, I found I cannot explain how migration_tls_channel_end() > > currently works, because it writes to the multifd iochannels while the > > channels should still be owned (and can be written at the same time?) by > > the sender threads. It sounds like a thread-safety issue, or is it not? > > > > IIUC, the multifd channels will be stuck at p->sem because this is the > success path so migration will have already finished when we reach > migration_cleanup(). The ram/device state migration will hold the main > thread until the multifd channels finish transferring. For success cases, indeed. However this is not the success path? After all, we check migration_has_failed(). Should I then send a patch to only send bye() when succeeded? Then I can also add some comment. I wished we could assert. Then the "temporarily changing nonblock mode" will also rely on this one, because ideally we shouldn't touch the fd nonblocking mode if some other thread is operating on it. The other thing is I also think we shouldn't rely on checking "p->tls_thread_created && p->thread_created" but only rely on channel type, which might be more straightforward (I almost did it in v1, but v2 rewrote things so it was lost). -- Peter Xu
Peter Xu <peterx@redhat.com> writes: > On Thu, Sep 18, 2025 at 06:17:37PM -0300, Fabiano Rosas wrote: >> > ============= ABOUT OLD PATCH 2 =================== >> > >> > I dropped it for now to unblock almost patch 1, because patch 1 will fix a >> > real warning that can be triggered for not only qtest but also normal tls >> > postcopy migration. >> > >> > While I was looking at temporary settings for multifd send iochannels to be >> > blocking always, I found I cannot explain how migration_tls_channel_end() >> > currently works, because it writes to the multifd iochannels while the >> > channels should still be owned (and can be written at the same time?) by >> > the sender threads. It sounds like a thread-safety issue, or is it not? >> > >> >> IIUC, the multifd channels will be stuck at p->sem because this is the >> success path so migration will have already finished when we reach >> migration_cleanup(). The ram/device state migration will hold the main >> thread until the multifd channels finish transferring. > > For success cases, indeed. However this is not the success path? After > all, we check migration_has_failed(). > My point is that when we reach here, if migration has succeeded, then it should be ok. If not, then thread-safety doesn't matter because things have already went bad, we'll lose the destination anyway. > Should I then send a patch to only send bye() when succeeded? Then I can > also add some comment. I wished we could assert. Then the "temporarily > changing nonblock mode" will also rely on this one, because ideally we > shouldn't touch the fd nonblocking mode if some other thread is operating > on it. > I don't know if it changes much. Currently we basically always ignore the error from bye(). > The other thing is I also think we shouldn't rely on checking > "p->tls_thread_created && p->thread_created" but only rely on channel type, > which might be more straightforward (I almost did it in v1, but v2 rewrote > things so it was lost). Ok, but we may need to ensure bye() is not called before the session is initiated. So thread_created may still be needed?
On Fri, Sep 19, 2025 at 10:50:56AM -0300, Fabiano Rosas wrote: > Peter Xu <peterx@redhat.com> writes: > > > On Thu, Sep 18, 2025 at 06:17:37PM -0300, Fabiano Rosas wrote: > >> > ============= ABOUT OLD PATCH 2 =================== > >> > > >> > I dropped it for now to unblock almost patch 1, because patch 1 will fix a > >> > real warning that can be triggered for not only qtest but also normal tls > >> > postcopy migration. > >> > > >> > While I was looking at temporary settings for multifd send iochannels to be > >> > blocking always, I found I cannot explain how migration_tls_channel_end() > >> > currently works, because it writes to the multifd iochannels while the > >> > channels should still be owned (and can be written at the same time?) by > >> > the sender threads. It sounds like a thread-safety issue, or is it not? > >> > > >> > >> IIUC, the multifd channels will be stuck at p->sem because this is the > >> success path so migration will have already finished when we reach > >> migration_cleanup(). The ram/device state migration will hold the main > >> thread until the multifd channels finish transferring. > > > > For success cases, indeed. However this is not the success path? After > > all, we check migration_has_failed(). > > > > My point is that when we reach here, if migration has succeeded, then it > should be ok. If not, then thread-safety doesn't matter because things > have already went bad, we'll lose the destination anyway. I'm not sure if it matters or not, maybe it depends on how bad it is when a race happened. If it's a tcp channel, it might be easier; the worst case is we write() concurrently in two threads and the output stream, IIUC, can be interleaved with the two buffers we write. Not an issue if migration failed anyway. However this is only needed for TLS, hence I have no idea what happens if gnutls writes concurrently. I don't think GnuTLS supports concurrent writters. I'm not sure if it means there's still chance src QEMU (when having a failed live migration) can crash. So.. I still think it might be wise we only bye() after knowing it is a success, not only because that looks like the only way to make sure it's thread-safe, but also because a bye() is only needed if it didn't fail. Sending it ignoring error is another way of doing so, but it doesn't avoid the possible result of a race (even if I totally agree it is unlikely..). > > > Should I then send a patch to only send bye() when succeeded? Then I can > > also add some comment. I wished we could assert. Then the "temporarily > > changing nonblock mode" will also rely on this one, because ideally we > > shouldn't touch the fd nonblocking mode if some other thread is operating > > on it. > > > > I don't know if it changes much. Currently we basically always ignore > the error from bye(). > > > The other thing is I also think we shouldn't rely on checking > > "p->tls_thread_created && p->thread_created" but only rely on channel type, > > which might be more straightforward (I almost did it in v1, but v2 rewrote > > things so it was lost). > > Ok, but we may need to ensure bye() is not called before the session is > initiated. So thread_created may still be needed? In v1, I was using "object_dynamic_cast((Object *)c, TYPE_QIO_CHANNEL_TLS)": https://lore.kernel.org/all/20250910160144.1762894-4-peterx@redhat.com/ Would that work the same, but without relying on "thread_created" vars? -- Peter Xu
Peter Xu <peterx@redhat.com> writes: > On Fri, Sep 19, 2025 at 10:50:56AM -0300, Fabiano Rosas wrote: >> Peter Xu <peterx@redhat.com> writes: >> >> > On Thu, Sep 18, 2025 at 06:17:37PM -0300, Fabiano Rosas wrote: >> >> > ============= ABOUT OLD PATCH 2 =================== >> >> > >> >> > I dropped it for now to unblock almost patch 1, because patch 1 will fix a >> >> > real warning that can be triggered for not only qtest but also normal tls >> >> > postcopy migration. >> >> > >> >> > While I was looking at temporary settings for multifd send iochannels to be >> >> > blocking always, I found I cannot explain how migration_tls_channel_end() >> >> > currently works, because it writes to the multifd iochannels while the >> >> > channels should still be owned (and can be written at the same time?) by >> >> > the sender threads. It sounds like a thread-safety issue, or is it not? >> >> > >> >> >> >> IIUC, the multifd channels will be stuck at p->sem because this is the >> >> success path so migration will have already finished when we reach >> >> migration_cleanup(). The ram/device state migration will hold the main >> >> thread until the multifd channels finish transferring. >> > >> > For success cases, indeed. However this is not the success path? After >> > all, we check migration_has_failed(). >> > >> >> My point is that when we reach here, if migration has succeeded, then it >> should be ok. If not, then thread-safety doesn't matter because things >> have already went bad, we'll lose the destination anyway. > > I'm not sure if it matters or not, maybe it depends on how bad it is when a > race happened. > > If it's a tcp channel, it might be easier; the worst case is we write() > concurrently in two threads and the output stream, IIUC, can be interleaved > with the two buffers we write. Not an issue if migration failed anyway. > > However this is only needed for TLS, hence I have no idea what happens if > gnutls writes concurrently. I don't think GnuTLS supports concurrent > writters. I'm not sure if it means there's still chance src QEMU (when > having a failed live migration) can crash. > > So.. I still think it might be wise we only bye() after knowing it is a > success, not only because that looks like the only way to make sure it's > thread-safe, but also because a bye() is only needed if it didn't fail. > Sending it ignoring error is another way of doing so, but it doesn't avoid > the possible result of a race (even if I totally agree it is unlikely..). > ok >> >> > Should I then send a patch to only send bye() when succeeded? Then I can >> > also add some comment. I wished we could assert. Then the "temporarily >> > changing nonblock mode" will also rely on this one, because ideally we >> > shouldn't touch the fd nonblocking mode if some other thread is operating >> > on it. >> > >> >> I don't know if it changes much. Currently we basically always ignore >> the error from bye(). >> >> > The other thing is I also think we shouldn't rely on checking >> > "p->tls_thread_created && p->thread_created" but only rely on channel type, >> > which might be more straightforward (I almost did it in v1, but v2 rewrote >> > things so it was lost). >> >> Ok, but we may need to ensure bye() is not called before the session is >> initiated. So thread_created may still be needed? > > In v1, I was using "object_dynamic_cast((Object *)c, TYPE_QIO_CHANNEL_TLS)": > > https://lore.kernel.org/all/20250910160144.1762894-4-peterx@redhat.com/ > > Would that work the same, but without relying on "thread_created" > vars? Ok, I'm convinced. migration_cleanup() -> multifd_send_shutdown() -> bye() cannot happen before thread_create=true because multifd_send_setup() blocks the migration_thread until the channels have been fully created. Go ahead then!
© 2016 - 2025 Red Hat, Inc.