* [PATCH for-10.1 0/2] migration: actually make gnutls workaround functional @ 2025-08-01 17:02 Daniel P. Berrangé 2025-08-01 17:02 ` [PATCH 1/2] migration: simplify error reporting after channel read Daniel P. Berrangé ` (2 more replies) 0 siblings, 3 replies; 17+ messages in thread From: Daniel P. Berrangé @ 2025-08-01 17:02 UTC (permalink / raw) To: qemu-devel; +Cc: Juraj Marcin, Peter Xu, Fabiano Rosas, Daniel P. Berrangé This is a followup to previously merged patches that claimed to workaround the gnutls bug impacting migration, but in fact were essentially non-functional. Juraj Marcin pointed this out, and this new patch tweaks the workaround to make it actually do something useful. Daniel P. Berrangé (2): migration: simplify error reporting after channel read migration: fix workaround for gnutls thread safety crypto/tlssession.c | 16 ---------------- migration/qemu-file.c | 22 +++++++++++++++++----- 2 files changed, 17 insertions(+), 21 deletions(-) -- 2.50.1 ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 1/2] migration: simplify error reporting after channel read 2025-08-01 17:02 [PATCH for-10.1 0/2] migration: actually make gnutls workaround functional Daniel P. Berrangé @ 2025-08-01 17:02 ` Daniel P. Berrangé 2025-08-04 10:18 ` Prasad Pandit 2025-08-06 0:41 ` Peter Xu 2025-08-01 17:02 ` [PATCH 2/2] migration: fix workaround for gnutls thread safety Daniel P. Berrangé 2025-08-04 17:53 ` [PATCH for-10.1 0/2] migration: actually make gnutls workaround functional Juraj Marcin 2 siblings, 2 replies; 17+ messages in thread From: Daniel P. Berrangé @ 2025-08-01 17:02 UTC (permalink / raw) To: qemu-devel; +Cc: Juraj Marcin, Peter Xu, Fabiano Rosas, Daniel P. Berrangé The code handling the return value of qio_channel_read proceses len == 0 (EOF) separately from len < 1 (error), but in both cases ends up calling qemu_file_set_error_obj() with -EIO as the errno. This logic can be merged into one codepath to simplify it. Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> --- migration/qemu-file.c | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/migration/qemu-file.c b/migration/qemu-file.c index b6ac190034..8ee44c5ac9 100644 --- a/migration/qemu-file.c +++ b/migration/qemu-file.c @@ -348,17 +348,13 @@ static ssize_t coroutine_mixed_fn qemu_fill_buffer(QEMUFile *f) } else { qio_channel_wait(f->ioc, G_IO_IN); } - } else if (len < 0) { - len = -EIO; } } while (len == QIO_CHANNEL_ERR_BLOCK); if (len > 0) { f->buf_size += len; - } else if (len == 0) { - qemu_file_set_error_obj(f, -EIO, local_error); } else { - qemu_file_set_error_obj(f, len, local_error); + qemu_file_set_error_obj(f, -EIO, local_error); } for (int i = 0; i < nfd; i++) { -- 2.50.1 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] migration: simplify error reporting after channel read 2025-08-01 17:02 ` [PATCH 1/2] migration: simplify error reporting after channel read Daniel P. Berrangé @ 2025-08-04 10:18 ` Prasad Pandit 2025-08-04 10:22 ` Daniel P. Berrangé 2025-08-06 0:41 ` Peter Xu 1 sibling, 1 reply; 17+ messages in thread From: Prasad Pandit @ 2025-08-04 10:18 UTC (permalink / raw) To: Daniel P. Berrangé; +Cc: qemu-devel, Juraj Marcin, Peter Xu, Fabiano Rosas On Sat, 2 Aug 2025 at 00:06, Daniel P. Berrangé <berrange@redhat.com> wrote: > The code handling the return value of qio_channel_read proceses > len == 0 (EOF) separately from len < 1 (error), but in both > cases ends up calling qemu_file_set_error_obj() with -EIO as the > errno. This logic can be merged into one codepath to simplify it. > > } else { > qio_channel_wait(f->ioc, G_IO_IN); > } > - } else if (len < 0) { > - len = -EIO; > } > } while (len == QIO_CHANNEL_ERR_BLOCK); > > if (len > 0) { > f->buf_size += len; > - } else if (len == 0) { > - qemu_file_set_error_obj(f, -EIO, local_error); > } else { > - qemu_file_set_error_obj(f, len, local_error); > + qemu_file_set_error_obj(f, -EIO, local_error); > } * But should _file_set_error_obj(... -EIO) be called for len == 0 (EOF) case? ie. function is trying to read from a file, at some point it is bound to reach EOF. '-EIO' indicates an I/O error, reaching EOF could not be an error. Maybe we could just return zero(0) ? (just checking) Thank you. --- - Prasad ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] migration: simplify error reporting after channel read 2025-08-04 10:18 ` Prasad Pandit @ 2025-08-04 10:22 ` Daniel P. Berrangé 2025-08-04 11:03 ` Prasad Pandit 0 siblings, 1 reply; 17+ messages in thread From: Daniel P. Berrangé @ 2025-08-04 10:22 UTC (permalink / raw) To: Prasad Pandit; +Cc: qemu-devel, Juraj Marcin, Peter Xu, Fabiano Rosas On Mon, Aug 04, 2025 at 03:48:59PM +0530, Prasad Pandit wrote: > On Sat, 2 Aug 2025 at 00:06, Daniel P. Berrangé <berrange@redhat.com> wrote: > > The code handling the return value of qio_channel_read proceses > > len == 0 (EOF) separately from len < 1 (error), but in both > > cases ends up calling qemu_file_set_error_obj() with -EIO as the > > errno. This logic can be merged into one codepath to simplify it. > > > > } else { > > qio_channel_wait(f->ioc, G_IO_IN); > > } > > - } else if (len < 0) { > > - len = -EIO; > > } > > } while (len == QIO_CHANNEL_ERR_BLOCK); > > > > if (len > 0) { > > f->buf_size += len; > > - } else if (len == 0) { > > - qemu_file_set_error_obj(f, -EIO, local_error); > > } else { > > - qemu_file_set_error_obj(f, len, local_error); > > + qemu_file_set_error_obj(f, -EIO, local_error); > > } > > * But should _file_set_error_obj(... -EIO) be called for len == 0 > (EOF) case? ie. function is trying to read from a file, at some point > it is bound to reach EOF. '-EIO' indicates an I/O error, reaching EOF > could not be an error. Maybe we could just return zero(0) ? (just > checking) The migration protocol knows whether it is expecting more data or not. If we want more data, then a call to qemu_fill_buffer must successfully read at least 1 byte. If we don't want more data, then we would not have triggered any call to qemu_fill_buffer. Thus, a call to qemu_fill_buffer which gets EOF is an error scenario. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] migration: simplify error reporting after channel read 2025-08-04 10:22 ` Daniel P. Berrangé @ 2025-08-04 11:03 ` Prasad Pandit 0 siblings, 0 replies; 17+ messages in thread From: Prasad Pandit @ 2025-08-04 11:03 UTC (permalink / raw) To: Daniel P. Berrangé; +Cc: qemu-devel, Juraj Marcin, Peter Xu, Fabiano Rosas On Mon, 4 Aug 2025 at 15:52, Daniel P. Berrangé <berrange@redhat.com> wrote: > The migration protocol knows whether it is expecting more data or not. > If we want more data, then a call to qemu_fill_buffer must successfully > read at least 1 byte. > If we don't want more data, then we would not have triggered any call > to qemu_fill_buffer. > Thus, a call to qemu_fill_buffer which gets EOF is an error scenario. * I see. In that case the change looks fine. Reviewed-by: Prasad Pandit <pjp@fedoraproject.org> Thank you. --- - Prasad ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] migration: simplify error reporting after channel read 2025-08-01 17:02 ` [PATCH 1/2] migration: simplify error reporting after channel read Daniel P. Berrangé 2025-08-04 10:18 ` Prasad Pandit @ 2025-08-06 0:41 ` Peter Xu 1 sibling, 0 replies; 17+ messages in thread From: Peter Xu @ 2025-08-06 0:41 UTC (permalink / raw) To: Daniel P. Berrangé; +Cc: qemu-devel, Juraj Marcin, Fabiano Rosas On Fri, Aug 01, 2025 at 06:02:11PM +0100, Daniel P. Berrangé wrote: > The code handling the return value of qio_channel_read proceses > len == 0 (EOF) separately from len < 1 (error), but in both > cases ends up calling qemu_file_set_error_obj() with -EIO as the > errno. This logic can be merged into one codepath to simplify it. > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> -- Peter Xu ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 2/2] migration: fix workaround for gnutls thread safety 2025-08-01 17:02 [PATCH for-10.1 0/2] migration: actually make gnutls workaround functional Daniel P. Berrangé 2025-08-01 17:02 ` [PATCH 1/2] migration: simplify error reporting after channel read Daniel P. Berrangé @ 2025-08-01 17:02 ` Daniel P. Berrangé 2025-08-04 10:29 ` Prasad Pandit 2025-08-04 18:13 ` Fabiano Rosas 2025-08-04 17:53 ` [PATCH for-10.1 0/2] migration: actually make gnutls workaround functional Juraj Marcin 2 siblings, 2 replies; 17+ messages in thread From: Daniel P. Berrangé @ 2025-08-01 17:02 UTC (permalink / raw) To: qemu-devel; +Cc: Juraj Marcin, Peter Xu, Fabiano Rosas, Daniel P. Berrangé In previous commits eb3618e9 migration: activate TLS thread safety workaround edea8183 io: add support for activating TLS thread safety workaround 24ad5e19 crypto: implement workaround for GNUTLS thread safety problems an attempt was made to workaround broken gnutls thread safety when TLS 1.3 rekeying is performed. Those patches acquired locks before calling gnutls_record_{send|recv} but temporarily dropped the locks in the push/pull functions, in the mistaken belief that there was a race inside gnutls that did not cross execution of the push/pull functions. A non-deterministic reproducer mislead into thinking the workaround was operating as expected, but this was wrong. Juraj demonstrated that QEMU would still see errors from GNUTLS as well as crashes. The issue is that a pointer to internal state is saved before the the push/pull functions are called, and after they return this saved pointer is potentially invalid. IOW, it is never safe to temporarily drop the mutexes inside the push/pull functions. The lock must be held throughout execution of gnutls_record_{send|recv}. This would be possible with QEMU migration, except that the return path thread sits in a blocking read waiting for data that very rarely arrives from the destination QEMU. This blocks ability to send any migration data in the other thread. It is possible to workaround this issue, however, by proactively calling poll() to check for available incoming data before trying the qio_channel_read() call. Reported-by: Juraj Marcin <jmarcin@redhat.com> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> --- crypto/tlssession.c | 16 ---------------- migration/qemu-file.c | 16 ++++++++++++++++ 2 files changed, 16 insertions(+), 16 deletions(-) diff --git a/crypto/tlssession.c b/crypto/tlssession.c index 86d407a142..7e11317528 100644 --- a/crypto/tlssession.c +++ b/crypto/tlssession.c @@ -95,19 +95,11 @@ qcrypto_tls_session_push(void *opaque, const void *buf, size_t len) return -1; }; - if (session->lockEnabled) { - qemu_mutex_unlock(&session->lock); - } - error_free(session->werr); session->werr = NULL; ret = session->writeFunc(buf, len, session->opaque, &session->werr); - if (session->lockEnabled) { - qemu_mutex_lock(&session->lock); - } - if (ret == QCRYPTO_TLS_SESSION_ERR_BLOCK) { errno = EAGAIN; return -1; @@ -134,16 +126,8 @@ qcrypto_tls_session_pull(void *opaque, void *buf, size_t len) error_free(session->rerr); session->rerr = NULL; - if (session->lockEnabled) { - qemu_mutex_unlock(&session->lock); - } - ret = session->readFunc(buf, len, session->opaque, &session->rerr); - if (session->lockEnabled) { - qemu_mutex_lock(&session->lock); - } - if (ret == QCRYPTO_TLS_SESSION_ERR_BLOCK) { errno = EAGAIN; return -1; diff --git a/migration/qemu-file.c b/migration/qemu-file.c index 8ee44c5ac9..cf6115e699 100644 --- a/migration/qemu-file.c +++ b/migration/qemu-file.c @@ -338,6 +338,22 @@ static ssize_t coroutine_mixed_fn qemu_fill_buffer(QEMUFile *f) return 0; } + /* + * This feature triggers acquisition of mutexes around every + * read and write. Thus we must not sit in a blocking read + * if this is set, but must instead poll proactively. This + * does not work with some channel types, however, so must + * only pre-poll when the featre is set. + */ + if (qio_channel_has_feature(f->ioc, + QIO_CHANNEL_FEATURE_CONCURRENT_IO)) { + if (qemu_in_coroutine()) { + qio_channel_yield(f->ioc, G_IO_IN); + } else { + qio_channel_wait(f->ioc, G_IO_IN); + } + } + do { struct iovec iov = { f->buf + pending, IO_BUF_SIZE - pending }; len = qio_channel_readv_full(f->ioc, &iov, 1, pfds, pnfd, 0, -- 2.50.1 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 2/2] migration: fix workaround for gnutls thread safety 2025-08-01 17:02 ` [PATCH 2/2] migration: fix workaround for gnutls thread safety Daniel P. Berrangé @ 2025-08-04 10:29 ` Prasad Pandit 2025-08-04 18:13 ` Fabiano Rosas 1 sibling, 0 replies; 17+ messages in thread From: Prasad Pandit @ 2025-08-04 10:29 UTC (permalink / raw) To: Daniel P. Berrangé; +Cc: qemu-devel, Juraj Marcin, Peter Xu, Fabiano Rosas On Sat, 2 Aug 2025 at 00:09, Daniel P. Berrangé <berrange@redhat.com> wrote: > eb3618e9 migration: activate TLS thread safety workaround > edea8183 io: add support for activating TLS thread safety workaround > 24ad5e19 crypto: implement workaround for GNUTLS thread safety problems > > an attempt was made to workaround broken gnutls thread safety when > TLS 1.3 rekeying is performed. > > Those patches acquired locks before calling gnutls_record_{send|recv} > but temporarily dropped the locks in the push/pull functions, in the > mistaken belief that there was a race inside gnutls that did not cross > execution of the push/pull functions. > > A non-deterministic reproducer mislead into thinking the workaround > was operating as expected, but this was wrong. Juraj demonstrated > that QEMU would still see errors from GNUTLS as well as crashes. > > The issue is that a pointer to internal state is saved before the > the push/pull functions are called, and after they return this > saved pointer is potentially invalid. IOW, it is never safe to > temporarily drop the mutexes inside the push/pull functions. The > lock must be held throughout execution of gnutls_record_{send|recv}. > > This would be possible with QEMU migration, except that the return > path thread sits in a blocking read waiting for data that very > rarely arrives from the destination QEMU. This blocks ability to > send any migration data in the other thread. > > It is possible to workaround this issue, however, by proactively > calling poll() to check for available incoming data before trying > the qio_channel_read() call. > > Reported-by: Juraj Marcin <jmarcin@redhat.com> > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> > --- > crypto/tlssession.c | 16 ---------------- > migration/qemu-file.c | 16 ++++++++++++++++ > 2 files changed, 16 insertions(+), 16 deletions(-) > > diff --git a/crypto/tlssession.c b/crypto/tlssession.c > index 86d407a142..7e11317528 100644 > --- a/crypto/tlssession.c > +++ b/crypto/tlssession.c > @@ -95,19 +95,11 @@ qcrypto_tls_session_push(void *opaque, const void *buf, size_t len) > return -1; > }; > > - if (session->lockEnabled) { > - qemu_mutex_unlock(&session->lock); > - } > - > error_free(session->werr); > session->werr = NULL; > > ret = session->writeFunc(buf, len, session->opaque, &session->werr); > > - if (session->lockEnabled) { > - qemu_mutex_lock(&session->lock); > - } > - > if (ret == QCRYPTO_TLS_SESSION_ERR_BLOCK) { > errno = EAGAIN; > return -1; > @@ -134,16 +126,8 @@ qcrypto_tls_session_pull(void *opaque, void *buf, size_t len) > error_free(session->rerr); > session->rerr = NULL; > > - if (session->lockEnabled) { > - qemu_mutex_unlock(&session->lock); > - } > - > ret = session->readFunc(buf, len, session->opaque, &session->rerr); > > - if (session->lockEnabled) { > - qemu_mutex_lock(&session->lock); > - } > - > if (ret == QCRYPTO_TLS_SESSION_ERR_BLOCK) { > errno = EAGAIN; > return -1; > diff --git a/migration/qemu-file.c b/migration/qemu-file.c > index 8ee44c5ac9..cf6115e699 100644 > --- a/migration/qemu-file.c > +++ b/migration/qemu-file.c > @@ -338,6 +338,22 @@ static ssize_t coroutine_mixed_fn qemu_fill_buffer(QEMUFile *f) > return 0; > } > > + /* > + * This feature triggers acquisition of mutexes around every > + * read and write. Thus we must not sit in a blocking read > + * if this is set, but must instead poll proactively. This > + * does not work with some channel types, however, so must > + * only pre-poll when the featre is set. > + */ > + if (qio_channel_has_feature(f->ioc, > + QIO_CHANNEL_FEATURE_CONCURRENT_IO)) { > + if (qemu_in_coroutine()) { > + qio_channel_yield(f->ioc, G_IO_IN); > + } else { > + qio_channel_wait(f->ioc, G_IO_IN); > + } > + } > + > do { > struct iovec iov = { f->buf + pending, IO_BUF_SIZE - pending }; > len = qio_channel_readv_full(f->ioc, &iov, 1, pfds, pnfd, 0, > -- * Looks okay. Reviewed-by: Prasad Pandit <pjp@fedoraproject.org> Thank you. --- - Prasad ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/2] migration: fix workaround for gnutls thread safety 2025-08-01 17:02 ` [PATCH 2/2] migration: fix workaround for gnutls thread safety Daniel P. Berrangé 2025-08-04 10:29 ` Prasad Pandit @ 2025-08-04 18:13 ` Fabiano Rosas 1 sibling, 0 replies; 17+ messages in thread From: Fabiano Rosas @ 2025-08-04 18:13 UTC (permalink / raw) To: Daniel P. Berrangé, qemu-devel Cc: Juraj Marcin, Peter Xu, Daniel P. Berrangé Daniel P. Berrangé <berrange@redhat.com> writes: > In previous commits > > eb3618e9 migration: activate TLS thread safety workaround > edea8183 io: add support for activating TLS thread safety workaround > 24ad5e19 crypto: implement workaround for GNUTLS thread safety problems > > an attempt was made to workaround broken gnutls thread safety when > TLS 1.3 rekeying is performed. > > Those patches acquired locks before calling gnutls_record_{send|recv} > but temporarily dropped the locks in the push/pull functions, in the > mistaken belief that there was a race inside gnutls that did not cross > execution of the push/pull functions. > > A non-deterministic reproducer mislead into thinking the workaround > was operating as expected, but this was wrong. Juraj demonstrated > that QEMU would still see errors from GNUTLS as well as crashes. > > The issue is that a pointer to internal state is saved before the > the push/pull functions are called, and after they return this > saved pointer is potentially invalid. IOW, it is never safe to > temporarily drop the mutexes inside the push/pull functions. The > lock must be held throughout execution of gnutls_record_{send|recv}. > > This would be possible with QEMU migration, except that the return > path thread sits in a blocking read waiting for data that very > rarely arrives from the destination QEMU. This blocks ability to > send any migration data in the other thread. > > It is possible to workaround this issue, however, by proactively > calling poll() to check for available incoming data before trying > the qio_channel_read() call. > > Reported-by: Juraj Marcin <jmarcin@redhat.com> > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> Reviewed-by: Fabiano Rosas <farosas@suse.de> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH for-10.1 0/2] migration: actually make gnutls workaround functional 2025-08-01 17:02 [PATCH for-10.1 0/2] migration: actually make gnutls workaround functional Daniel P. Berrangé 2025-08-01 17:02 ` [PATCH 1/2] migration: simplify error reporting after channel read Daniel P. Berrangé 2025-08-01 17:02 ` [PATCH 2/2] migration: fix workaround for gnutls thread safety Daniel P. Berrangé @ 2025-08-04 17:53 ` Juraj Marcin 2025-08-04 19:27 ` Fabiano Rosas 2 siblings, 1 reply; 17+ messages in thread From: Juraj Marcin @ 2025-08-04 17:53 UTC (permalink / raw) To: Daniel P. Berrangé Cc: qemu-devel, Peter Xu, Fabiano Rosas, Prasad Pandit Hi Daniel, On 2025-08-01 18:02, Daniel P. Berrangé wrote: > This is a followup to previously merged patches that claimed to > workaround the gnutls bug impacting migration, but in fact were > essentially non-functional. Juraj Marcin pointed this out, and > this new patch tweaks the workaround to make it actually do > something useful. > > Daniel P. Berrangé (2): > migration: simplify error reporting after channel read > migration: fix workaround for gnutls thread safety > > crypto/tlssession.c | 16 ---------------- > migration/qemu-file.c | 22 +++++++++++++++++----- > 2 files changed, 17 insertions(+), 21 deletions(-) > thanks for finding a fix for the workaround. I have tested it and it resolves the issue. However, it significantly slows down migration, even with the workaround disabled (and thus no locking). When benchmarking, I used the fixed version of GNUTLS, VM with 20GB of RAM which were fully written to before starting a normal migration with no workload during the migration. Test cases: [1]: before this patchset [2]: with this patchset applied and GNUTLS workaround enabled [2]: with this patchset applied and GNUTLS workaround disabled | Total time | Throughput | Transfered bytes | --+------------+------------+------------------+ 1 | 31 192 ms | 5450 mpbs | 21 230 973 763 | 2 | 74 147 ms | 2291 mbps | 21 232 849 066 | 3 | 72 426 ms | 2343 mbps | 21 215 009 392 | (data reported by query-migrate QMP command after migration completion) Therefore, this workaround can be used in environments where we cannot use the fixed version of GNUTLS library, but I think it would be better use polling only if locking is actually used, so the performance with the proper GNUTLS fix is not hindered. Maybe yielding/waiting inside tlssession.c before actually locking? Best regards, Juraj Marcin ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH for-10.1 0/2] migration: actually make gnutls workaround functional 2025-08-04 17:53 ` [PATCH for-10.1 0/2] migration: actually make gnutls workaround functional Juraj Marcin @ 2025-08-04 19:27 ` Fabiano Rosas 2025-08-05 10:09 ` Daniel P. Berrangé 0 siblings, 1 reply; 17+ messages in thread From: Fabiano Rosas @ 2025-08-04 19:27 UTC (permalink / raw) To: Juraj Marcin, Daniel P. Berrangé; +Cc: qemu-devel, Peter Xu, Prasad Pandit Juraj Marcin <jmarcin@redhat.com> writes: > Hi Daniel, > > On 2025-08-01 18:02, Daniel P. Berrangé wrote: >> This is a followup to previously merged patches that claimed to >> workaround the gnutls bug impacting migration, but in fact were >> essentially non-functional. Juraj Marcin pointed this out, and >> this new patch tweaks the workaround to make it actually do >> something useful. >> >> Daniel P. Berrangé (2): >> migration: simplify error reporting after channel read >> migration: fix workaround for gnutls thread safety >> >> crypto/tlssession.c | 16 ---------------- >> migration/qemu-file.c | 22 +++++++++++++++++----- >> 2 files changed, 17 insertions(+), 21 deletions(-) >> > > thanks for finding a fix for the workaround. I have tested it and it > resolves the issue. > > However, it significantly slows down migration, even with the workaround > disabled (and thus no locking). When benchmarking, I used the fixed > version of GNUTLS, VM with 20GB of RAM which were fully written to > before starting a normal migration with no workload during the > migration. > > Test cases: > [1]: before this patchset > [2]: with this patchset applied and GNUTLS workaround enabled > [2]: with this patchset applied and GNUTLS workaround disabled > > | Total time | Throughput | Transfered bytes | > --+------------+------------+------------------+ > 1 | 31 192 ms | 5450 mpbs | 21 230 973 763 | > 2 | 74 147 ms | 2291 mbps | 21 232 849 066 | > 3 | 72 426 ms | 2343 mbps | 21 215 009 392 | Thanks testing this. I had just managed to convince myself that there wouldn't be any performance issues. The yield at every buffer fill on the incoming side is probably way more impactful than the poll on the RP. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH for-10.1 0/2] migration: actually make gnutls workaround functional 2025-08-04 19:27 ` Fabiano Rosas @ 2025-08-05 10:09 ` Daniel P. Berrangé 2025-08-05 13:44 ` Fabiano Rosas 0 siblings, 1 reply; 17+ messages in thread From: Daniel P. Berrangé @ 2025-08-05 10:09 UTC (permalink / raw) To: Fabiano Rosas; +Cc: Juraj Marcin, qemu-devel, Peter Xu, Prasad Pandit On Mon, Aug 04, 2025 at 04:27:45PM -0300, Fabiano Rosas wrote: > Juraj Marcin <jmarcin@redhat.com> writes: > > > Hi Daniel, > > > > On 2025-08-01 18:02, Daniel P. Berrangé wrote: > >> This is a followup to previously merged patches that claimed to > >> workaround the gnutls bug impacting migration, but in fact were > >> essentially non-functional. Juraj Marcin pointed this out, and > >> this new patch tweaks the workaround to make it actually do > >> something useful. > >> > >> Daniel P. Berrangé (2): > >> migration: simplify error reporting after channel read > >> migration: fix workaround for gnutls thread safety > >> > >> crypto/tlssession.c | 16 ---------------- > >> migration/qemu-file.c | 22 +++++++++++++++++----- > >> 2 files changed, 17 insertions(+), 21 deletions(-) > >> > > > > thanks for finding a fix for the workaround. I have tested it and it > > resolves the issue. > > > > However, it significantly slows down migration, even with the workaround > > disabled (and thus no locking). When benchmarking, I used the fixed > > version of GNUTLS, VM with 20GB of RAM which were fully written to > > before starting a normal migration with no workload during the > > migration. > > > > Test cases: > > [1]: before this patchset > > [2]: with this patchset applied and GNUTLS workaround enabled > > [2]: with this patchset applied and GNUTLS workaround disabled > > > > | Total time | Throughput | Transfered bytes | > > --+------------+------------+------------------+ > > 1 | 31 192 ms | 5450 mpbs | 21 230 973 763 | > > 2 | 74 147 ms | 2291 mbps | 21 232 849 066 | > > 3 | 72 426 ms | 2343 mbps | 21 215 009 392 | > > Thanks testing this. I had just managed to convince myself that there > wouldn't be any performance issues. > > The yield at every buffer fill on the incoming side is probably way more > impactful than the poll on the RP. Yeah, that's an unacceptable penalty on the incoming side for sure. How about we simply change the outbound migration channel to be in non-blocking mode ? I originally put it in blocking mode way back in 9e4d2b98ee98f4cee50d671e500eceeefa751ee0, but if I look at the QEMUFile impl of qemu_fill_buffer and qemu_fflush, but should be coping with a non-blocking socket. qemu_fill_buffer has explicit code to wait, and qemu_fflush uses the _all() variant whcih has built-in support for waiting. So I'm not seeing an obvious need to run the channel in blocking mode. Using non-blocking will prevent the return path thuread setting in a read() call, so we won't have mutual exclusion between read and write which this patch was trying to avoid ie, this delta on top of this patch: diff --git a/migration/migration.c b/migration/migration.c index 10c216d25d..1eaabc1f19 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -4002,7 +4002,7 @@ void migration_connect(MigrationState *s, Error *error_in) } migration_rate_set(rate_limit); - qemu_file_set_blocking(s->to_dst_file, true); + qemu_file_set_blocking(s->to_dst_file, false); /* * Open the return path. For postcopy, it is used exclusively. For diff --git a/migration/qemu-file.c b/migration/qemu-file.c index cf6115e699..8ee44c5ac9 100644 --- a/migration/qemu-file.c +++ b/migration/qemu-file.c @@ -338,22 +338,6 @@ static ssize_t coroutine_mixed_fn qemu_fill_buffer(QEMUFile *f) return 0; } - /* - * This feature triggers acquisition of mutexes around every - * read and write. Thus we must not sit in a blocking read - * if this is set, but must instead poll proactively. This - * does not work with some channel types, however, so must - * only pre-poll when the featre is set. - */ - if (qio_channel_has_feature(f->ioc, - QIO_CHANNEL_FEATURE_CONCURRENT_IO)) { - if (qemu_in_coroutine()) { - qio_channel_yield(f->ioc, G_IO_IN); - } else { - qio_channel_wait(f->ioc, G_IO_IN); - } - } - do { struct iovec iov = { f->buf + pending, IO_BUF_SIZE - pending }; len = qio_channel_readv_full(f->ioc, &iov, 1, pfds, pnfd, 0, With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH for-10.1 0/2] migration: actually make gnutls workaround functional 2025-08-05 10:09 ` Daniel P. Berrangé @ 2025-08-05 13:44 ` Fabiano Rosas 2025-08-05 14:18 ` Daniel P. Berrangé 2025-08-05 14:52 ` Juraj Marcin 0 siblings, 2 replies; 17+ messages in thread From: Fabiano Rosas @ 2025-08-05 13:44 UTC (permalink / raw) To: Daniel P. Berrangé; +Cc: Juraj Marcin, qemu-devel, Peter Xu, Prasad Pandit Daniel P. Berrangé <berrange@redhat.com> writes: > On Mon, Aug 04, 2025 at 04:27:45PM -0300, Fabiano Rosas wrote: >> Juraj Marcin <jmarcin@redhat.com> writes: >> >> > Hi Daniel, >> > >> > On 2025-08-01 18:02, Daniel P. Berrangé wrote: >> >> This is a followup to previously merged patches that claimed to >> >> workaround the gnutls bug impacting migration, but in fact were >> >> essentially non-functional. Juraj Marcin pointed this out, and >> >> this new patch tweaks the workaround to make it actually do >> >> something useful. >> >> >> >> Daniel P. Berrangé (2): >> >> migration: simplify error reporting after channel read >> >> migration: fix workaround for gnutls thread safety >> >> >> >> crypto/tlssession.c | 16 ---------------- >> >> migration/qemu-file.c | 22 +++++++++++++++++----- >> >> 2 files changed, 17 insertions(+), 21 deletions(-) >> >> >> > >> > thanks for finding a fix for the workaround. I have tested it and it >> > resolves the issue. >> > >> > However, it significantly slows down migration, even with the workaround >> > disabled (and thus no locking). When benchmarking, I used the fixed >> > version of GNUTLS, VM with 20GB of RAM which were fully written to >> > before starting a normal migration with no workload during the >> > migration. >> > >> > Test cases: >> > [1]: before this patchset >> > [2]: with this patchset applied and GNUTLS workaround enabled >> > [2]: with this patchset applied and GNUTLS workaround disabled >> > >> > | Total time | Throughput | Transfered bytes | >> > --+------------+------------+------------------+ >> > 1 | 31 192 ms | 5450 mpbs | 21 230 973 763 | >> > 2 | 74 147 ms | 2291 mbps | 21 232 849 066 | >> > 3 | 72 426 ms | 2343 mbps | 21 215 009 392 | >> >> Thanks testing this. I had just managed to convince myself that there >> wouldn't be any performance issues. >> >> The yield at every buffer fill on the incoming side is probably way more >> impactful than the poll on the RP. > > Yeah, that's an unacceptable penalty on the incoming side for sure. > > How about we simply change the outbound migration channel to be in > non-blocking mode ? I originally put it in blocking mode way back > in 9e4d2b98ee98f4cee50d671e500eceeefa751ee0, but if I look at the > QEMUFile impl of qemu_fill_buffer and qemu_fflush, but should be > coping with a non-blocking socket. qemu_fill_buffer has explicit > code to wait, and qemu_fflush uses the _all() variant whcih has > built-in support for waiting. So I'm not seeing an obvious need > to run the channel in blocking mode. > It's definitely simpler and I think it works. It's uncomfortably late though to add a bunch of glib event loop code to the migration thread. Is the suggestion of moving the yield to tlssession.c even feasible? Juraj, are you able to get some numbers with this? > Using non-blocking will prevent the return path thuread setting > in a read() call, so we won't have mutual exclusion between read > and write which this patch was trying to avoid > > ie, this delta on top of this patch: > > diff --git a/migration/migration.c b/migration/migration.c > index 10c216d25d..1eaabc1f19 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -4002,7 +4002,7 @@ void migration_connect(MigrationState *s, Error *error_in) > } > > migration_rate_set(rate_limit); > - qemu_file_set_blocking(s->to_dst_file, true); > + qemu_file_set_blocking(s->to_dst_file, false); > > /* > * Open the return path. For postcopy, it is used exclusively. For > diff --git a/migration/qemu-file.c b/migration/qemu-file.c > index cf6115e699..8ee44c5ac9 100644 > --- a/migration/qemu-file.c > +++ b/migration/qemu-file.c > @@ -338,22 +338,6 @@ static ssize_t coroutine_mixed_fn qemu_fill_buffer(QEMUFile *f) > return 0; > } > > - /* > - * This feature triggers acquisition of mutexes around every > - * read and write. Thus we must not sit in a blocking read > - * if this is set, but must instead poll proactively. This > - * does not work with some channel types, however, so must > - * only pre-poll when the featre is set. > - */ > - if (qio_channel_has_feature(f->ioc, > - QIO_CHANNEL_FEATURE_CONCURRENT_IO)) { > - if (qemu_in_coroutine()) { > - qio_channel_yield(f->ioc, G_IO_IN); > - } else { > - qio_channel_wait(f->ioc, G_IO_IN); > - } > - } > - > do { > struct iovec iov = { f->buf + pending, IO_BUF_SIZE - pending }; > len = qio_channel_readv_full(f->ioc, &iov, 1, pfds, pnfd, 0, > > > With regards, > Daniel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH for-10.1 0/2] migration: actually make gnutls workaround functional 2025-08-05 13:44 ` Fabiano Rosas @ 2025-08-05 14:18 ` Daniel P. Berrangé 2025-08-05 15:28 ` Fabiano Rosas 2025-08-05 14:52 ` Juraj Marcin 1 sibling, 1 reply; 17+ messages in thread From: Daniel P. Berrangé @ 2025-08-05 14:18 UTC (permalink / raw) To: Fabiano Rosas; +Cc: Juraj Marcin, qemu-devel, Peter Xu, Prasad Pandit On Tue, Aug 05, 2025 at 10:44:41AM -0300, Fabiano Rosas wrote: > Daniel P. Berrangé <berrange@redhat.com> writes: > > > On Mon, Aug 04, 2025 at 04:27:45PM -0300, Fabiano Rosas wrote: > >> Juraj Marcin <jmarcin@redhat.com> writes: > >> > >> > Hi Daniel, > >> > > >> > On 2025-08-01 18:02, Daniel P. Berrangé wrote: > >> >> This is a followup to previously merged patches that claimed to > >> >> workaround the gnutls bug impacting migration, but in fact were > >> >> essentially non-functional. Juraj Marcin pointed this out, and > >> >> this new patch tweaks the workaround to make it actually do > >> >> something useful. > >> >> > >> >> Daniel P. Berrangé (2): > >> >> migration: simplify error reporting after channel read > >> >> migration: fix workaround for gnutls thread safety > >> >> > >> >> crypto/tlssession.c | 16 ---------------- > >> >> migration/qemu-file.c | 22 +++++++++++++++++----- > >> >> 2 files changed, 17 insertions(+), 21 deletions(-) > >> >> > >> > > >> > thanks for finding a fix for the workaround. I have tested it and it > >> > resolves the issue. > >> > > >> > However, it significantly slows down migration, even with the workaround > >> > disabled (and thus no locking). When benchmarking, I used the fixed > >> > version of GNUTLS, VM with 20GB of RAM which were fully written to > >> > before starting a normal migration with no workload during the > >> > migration. > >> > > >> > Test cases: > >> > [1]: before this patchset > >> > [2]: with this patchset applied and GNUTLS workaround enabled > >> > [2]: with this patchset applied and GNUTLS workaround disabled > >> > > >> > | Total time | Throughput | Transfered bytes | > >> > --+------------+------------+------------------+ > >> > 1 | 31 192 ms | 5450 mpbs | 21 230 973 763 | > >> > 2 | 74 147 ms | 2291 mbps | 21 232 849 066 | > >> > 3 | 72 426 ms | 2343 mbps | 21 215 009 392 | > >> > >> Thanks testing this. I had just managed to convince myself that there > >> wouldn't be any performance issues. > >> > >> The yield at every buffer fill on the incoming side is probably way more > >> impactful than the poll on the RP. > > > > Yeah, that's an unacceptable penalty on the incoming side for sure. > > > > How about we simply change the outbound migration channel to be in > > non-blocking mode ? I originally put it in blocking mode way back > > in 9e4d2b98ee98f4cee50d671e500eceeefa751ee0, but if I look at the > > QEMUFile impl of qemu_fill_buffer and qemu_fflush, but should be > > coping with a non-blocking socket. qemu_fill_buffer has explicit > > code to wait, and qemu_fflush uses the _all() variant whcih has > > built-in support for waiting. So I'm not seeing an obvious need > > to run the channel in blocking mode. > > > > It's definitely simpler and I think it works. It's uncomfortably late > though to add a bunch of glib event loop code to the migration > thread. Is the suggestion of moving the yield to tlssession.c even > feasible? Well that'll remove the burden for the non-TLS incoming migration, but the incoming TLS migration will still have the redundant yields and so still suffer a hit. Given where we are in freeze, I'm thinking we should just hard disable the workaround for this release, and re-attempt it in next cycle and then we can bring it back to stable afterwards. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH for-10.1 0/2] migration: actually make gnutls workaround functional 2025-08-05 14:18 ` Daniel P. Berrangé @ 2025-08-05 15:28 ` Fabiano Rosas 0 siblings, 0 replies; 17+ messages in thread From: Fabiano Rosas @ 2025-08-05 15:28 UTC (permalink / raw) To: Daniel P. Berrangé; +Cc: Juraj Marcin, qemu-devel, Peter Xu, Prasad Pandit Daniel P. Berrangé <berrange@redhat.com> writes: > On Tue, Aug 05, 2025 at 10:44:41AM -0300, Fabiano Rosas wrote: >> Daniel P. Berrangé <berrange@redhat.com> writes: >> >> > On Mon, Aug 04, 2025 at 04:27:45PM -0300, Fabiano Rosas wrote: >> >> Juraj Marcin <jmarcin@redhat.com> writes: >> >> >> >> > Hi Daniel, >> >> > >> >> > On 2025-08-01 18:02, Daniel P. Berrangé wrote: >> >> >> This is a followup to previously merged patches that claimed to >> >> >> workaround the gnutls bug impacting migration, but in fact were >> >> >> essentially non-functional. Juraj Marcin pointed this out, and >> >> >> this new patch tweaks the workaround to make it actually do >> >> >> something useful. >> >> >> >> >> >> Daniel P. Berrangé (2): >> >> >> migration: simplify error reporting after channel read >> >> >> migration: fix workaround for gnutls thread safety >> >> >> >> >> >> crypto/tlssession.c | 16 ---------------- >> >> >> migration/qemu-file.c | 22 +++++++++++++++++----- >> >> >> 2 files changed, 17 insertions(+), 21 deletions(-) >> >> >> >> >> > >> >> > thanks for finding a fix for the workaround. I have tested it and it >> >> > resolves the issue. >> >> > >> >> > However, it significantly slows down migration, even with the workaround >> >> > disabled (and thus no locking). When benchmarking, I used the fixed >> >> > version of GNUTLS, VM with 20GB of RAM which were fully written to >> >> > before starting a normal migration with no workload during the >> >> > migration. >> >> > >> >> > Test cases: >> >> > [1]: before this patchset >> >> > [2]: with this patchset applied and GNUTLS workaround enabled >> >> > [2]: with this patchset applied and GNUTLS workaround disabled >> >> > >> >> > | Total time | Throughput | Transfered bytes | >> >> > --+------------+------------+------------------+ >> >> > 1 | 31 192 ms | 5450 mpbs | 21 230 973 763 | >> >> > 2 | 74 147 ms | 2291 mbps | 21 232 849 066 | >> >> > 3 | 72 426 ms | 2343 mbps | 21 215 009 392 | >> >> >> >> Thanks testing this. I had just managed to convince myself that there >> >> wouldn't be any performance issues. >> >> >> >> The yield at every buffer fill on the incoming side is probably way more >> >> impactful than the poll on the RP. >> > >> > Yeah, that's an unacceptable penalty on the incoming side for sure. >> > >> > How about we simply change the outbound migration channel to be in >> > non-blocking mode ? I originally put it in blocking mode way back >> > in 9e4d2b98ee98f4cee50d671e500eceeefa751ee0, but if I look at the >> > QEMUFile impl of qemu_fill_buffer and qemu_fflush, but should be >> > coping with a non-blocking socket. qemu_fill_buffer has explicit >> > code to wait, and qemu_fflush uses the _all() variant whcih has >> > built-in support for waiting. So I'm not seeing an obvious need >> > to run the channel in blocking mode. >> > >> >> It's definitely simpler and I think it works. It's uncomfortably late >> though to add a bunch of glib event loop code to the migration >> thread. Is the suggestion of moving the yield to tlssession.c even >> feasible? > > Well that'll remove the burden for the non-TLS incoming migration, > but the incoming TLS migration will still have the redundant > yields and so still suffer a hit. > > Given where we are in freeze, I'm thinking we should just hard > disable the workaround for this release, and re-attempt it in > next cycle and then we can bring it back to stable afterwards. > Yes, I agree. Will you send a patch? > With regards, > Daniel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH for-10.1 0/2] migration: actually make gnutls workaround functional 2025-08-05 13:44 ` Fabiano Rosas 2025-08-05 14:18 ` Daniel P. Berrangé @ 2025-08-05 14:52 ` Juraj Marcin 2025-08-06 14:54 ` Peter Xu 1 sibling, 1 reply; 17+ messages in thread From: Juraj Marcin @ 2025-08-05 14:52 UTC (permalink / raw) To: Fabiano Rosas, Daniel P. Berrangé Cc: qemu-devel, Peter Xu, Prasad Pandit Hi all! On 2025-08-05 10:44, Fabiano Rosas wrote: > Daniel P. Berrangé <berrange@redhat.com> writes: > > > On Mon, Aug 04, 2025 at 04:27:45PM -0300, Fabiano Rosas wrote: > >> Juraj Marcin <jmarcin@redhat.com> writes: > >> > >> > Hi Daniel, > >> > > >> > On 2025-08-01 18:02, Daniel P. Berrangé wrote: > >> >> This is a followup to previously merged patches that claimed to > >> >> workaround the gnutls bug impacting migration, but in fact were > >> >> essentially non-functional. Juraj Marcin pointed this out, and > >> >> this new patch tweaks the workaround to make it actually do > >> >> something useful. > >> >> > >> >> Daniel P. Berrangé (2): > >> >> migration: simplify error reporting after channel read > >> >> migration: fix workaround for gnutls thread safety > >> >> > >> >> crypto/tlssession.c | 16 ---------------- > >> >> migration/qemu-file.c | 22 +++++++++++++++++----- > >> >> 2 files changed, 17 insertions(+), 21 deletions(-) > >> >> > >> > > >> > thanks for finding a fix for the workaround. I have tested it and it > >> > resolves the issue. > >> > > >> > However, it significantly slows down migration, even with the workaround > >> > disabled (and thus no locking). When benchmarking, I used the fixed > >> > version of GNUTLS, VM with 20GB of RAM which were fully written to > >> > before starting a normal migration with no workload during the > >> > migration. > >> > > >> > Test cases: > >> > [1]: before this patchset > >> > [2]: with this patchset applied and GNUTLS workaround enabled > >> > [2]: with this patchset applied and GNUTLS workaround disabled > >> > > >> > | Total time | Throughput | Transfered bytes | > >> > --+------------+------------+------------------+ > >> > 1 | 31 192 ms | 5450 mpbs | 21 230 973 763 | > >> > 2 | 74 147 ms | 2291 mbps | 21 232 849 066 | > >> > 3 | 72 426 ms | 2343 mbps | 21 215 009 392 | > >> > >> Thanks testing this. I had just managed to convince myself that there > >> wouldn't be any performance issues. > >> > >> The yield at every buffer fill on the incoming side is probably way more > >> impactful than the poll on the RP. > > > > Yeah, that's an unacceptable penalty on the incoming side for sure. > > > > How about we simply change the outbound migration channel to be in > > non-blocking mode ? I originally put it in blocking mode way back > > in 9e4d2b98ee98f4cee50d671e500eceeefa751ee0, but if I look at the > > QEMUFile impl of qemu_fill_buffer and qemu_fflush, but should be > > coping with a non-blocking socket. qemu_fill_buffer has explicit > > code to wait, and qemu_fflush uses the _all() variant whcih has > > built-in support for waiting. So I'm not seeing an obvious need > > to run the channel in blocking mode. > > > > It's definitely simpler and I think it works. It's uncomfortably late > though to add a bunch of glib event loop code to the migration > thread. Is the suggestion of moving the yield to tlssession.c even > feasible? > > Juraj, are you able to get some numbers with this? Yes, I have tested it with this patch and it still works, now without any noticeable performance regressions. Test cases: [0]: before this patchset with GNUTLS workaround disabled [1]: before this patchset with GNUTLS workaround enabled [2]: with this patchset applied and GNUTLS workaround enabled [2]: with this patchset applied and GNUTLS workaround disabled | Total time | Throughput | Transfered bytes | --+------------+-------------+------------------+ 0 | 32 464 ms | 5 228 mbps | 21 213 283 021 | 1 | 32 429 ms | 5 236 mpbs | 21 224 062 348 | 2 | 32 934 ms | 5 150 mbps | 21 200 558 083 | 3 | 31 547 ms | 5 384 mbps | 21 229 225 792 | Tested-by: Juraj Marcin <jmarcin@redhat.com> Best regards Juraj Marcin > > > Using non-blocking will prevent the return path thuread setting > > in a read() call, so we won't have mutual exclusion between read > > and write which this patch was trying to avoid > > > > ie, this delta on top of this patch: > > > > diff --git a/migration/migration.c b/migration/migration.c > > index 10c216d25d..1eaabc1f19 100644 > > --- a/migration/migration.c > > +++ b/migration/migration.c > > @@ -4002,7 +4002,7 @@ void migration_connect(MigrationState *s, Error *error_in) > > } > > > > migration_rate_set(rate_limit); > > - qemu_file_set_blocking(s->to_dst_file, true); > > + qemu_file_set_blocking(s->to_dst_file, false); > > > > /* > > * Open the return path. For postcopy, it is used exclusively. For > > diff --git a/migration/qemu-file.c b/migration/qemu-file.c > > index cf6115e699..8ee44c5ac9 100644 > > --- a/migration/qemu-file.c > > +++ b/migration/qemu-file.c > > @@ -338,22 +338,6 @@ static ssize_t coroutine_mixed_fn qemu_fill_buffer(QEMUFile *f) > > return 0; > > } > > > > - /* > > - * This feature triggers acquisition of mutexes around every > > - * read and write. Thus we must not sit in a blocking read > > - * if this is set, but must instead poll proactively. This > > - * does not work with some channel types, however, so must > > - * only pre-poll when the featre is set. > > - */ > > - if (qio_channel_has_feature(f->ioc, > > - QIO_CHANNEL_FEATURE_CONCURRENT_IO)) { > > - if (qemu_in_coroutine()) { > > - qio_channel_yield(f->ioc, G_IO_IN); > > - } else { > > - qio_channel_wait(f->ioc, G_IO_IN); > > - } > > - } > > - > > do { > > struct iovec iov = { f->buf + pending, IO_BUF_SIZE - pending }; > > len = qio_channel_readv_full(f->ioc, &iov, 1, pfds, pnfd, 0, > > > > > > With regards, > > Daniel > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH for-10.1 0/2] migration: actually make gnutls workaround functional 2025-08-05 14:52 ` Juraj Marcin @ 2025-08-06 14:54 ` Peter Xu 0 siblings, 0 replies; 17+ messages in thread From: Peter Xu @ 2025-08-06 14:54 UTC (permalink / raw) To: Juraj Marcin Cc: Fabiano Rosas, Daniel P. Berrangé, qemu-devel, Prasad Pandit On Tue, Aug 05, 2025 at 04:52:58PM +0200, Juraj Marcin wrote: > Hi all! > > On 2025-08-05 10:44, Fabiano Rosas wrote: > > Daniel P. Berrangé <berrange@redhat.com> writes: > > > > > On Mon, Aug 04, 2025 at 04:27:45PM -0300, Fabiano Rosas wrote: > > >> Juraj Marcin <jmarcin@redhat.com> writes: > > >> > > >> > Hi Daniel, > > >> > > > >> > On 2025-08-01 18:02, Daniel P. Berrangé wrote: > > >> >> This is a followup to previously merged patches that claimed to > > >> >> workaround the gnutls bug impacting migration, but in fact were > > >> >> essentially non-functional. Juraj Marcin pointed this out, and > > >> >> this new patch tweaks the workaround to make it actually do > > >> >> something useful. > > >> >> > > >> >> Daniel P. Berrangé (2): > > >> >> migration: simplify error reporting after channel read > > >> >> migration: fix workaround for gnutls thread safety > > >> >> > > >> >> crypto/tlssession.c | 16 ---------------- > > >> >> migration/qemu-file.c | 22 +++++++++++++++++----- > > >> >> 2 files changed, 17 insertions(+), 21 deletions(-) > > >> >> > > >> > > > >> > thanks for finding a fix for the workaround. I have tested it and it > > >> > resolves the issue. > > >> > > > >> > However, it significantly slows down migration, even with the workaround > > >> > disabled (and thus no locking). When benchmarking, I used the fixed > > >> > version of GNUTLS, VM with 20GB of RAM which were fully written to > > >> > before starting a normal migration with no workload during the > > >> > migration. > > >> > > > >> > Test cases: > > >> > [1]: before this patchset > > >> > [2]: with this patchset applied and GNUTLS workaround enabled > > >> > [2]: with this patchset applied and GNUTLS workaround disabled > > >> > > > >> > | Total time | Throughput | Transfered bytes | > > >> > --+------------+------------+------------------+ > > >> > 1 | 31 192 ms | 5450 mpbs | 21 230 973 763 | > > >> > 2 | 74 147 ms | 2291 mbps | 21 232 849 066 | > > >> > 3 | 72 426 ms | 2343 mbps | 21 215 009 392 | > > >> > > >> Thanks testing this. I had just managed to convince myself that there > > >> wouldn't be any performance issues. > > >> > > >> The yield at every buffer fill on the incoming side is probably way more > > >> impactful than the poll on the RP. > > > > > > Yeah, that's an unacceptable penalty on the incoming side for sure. > > > > > > How about we simply change the outbound migration channel to be in > > > non-blocking mode ? I originally put it in blocking mode way back > > > in 9e4d2b98ee98f4cee50d671e500eceeefa751ee0, but if I look at the > > > QEMUFile impl of qemu_fill_buffer and qemu_fflush, but should be > > > coping with a non-blocking socket. qemu_fill_buffer has explicit > > > code to wait, and qemu_fflush uses the _all() variant whcih has > > > built-in support for waiting. So I'm not seeing an obvious need > > > to run the channel in blocking mode. Sync IOs actually made some sense to me here, since when there's no lock to releaes, it sounds more efficient to wait in the syscall until all the buffers are flushed read/write. That is compared to async where we need to exit to userspace, qio_channel_wait(), retry syscall. Is there any concern where we could in some cases get frequent G_IO_* events, but syscall read/write may be a large buffer so the overhead of async can be measurable (one syscall may trigger multiple loops of wait() and retry)? From Juraj's test results it looks like at least not measurable in the questioned use case. However not sure about the rest. For example, we still have option to only enable async channels for tls, but I'm not sure whether it's necessary, either. > > > > It's definitely simpler and I think it works. It's uncomfortably late > > though to add a bunch of glib event loop code to the migration > > thread. Is the suggestion of moving the yield to tlssession.c even > > feasible? > > > > Juraj, are you able to get some numbers with this? > > Yes, I have tested it with this patch and it still works, now without > any noticeable performance regressions. > > Test cases: > [0]: before this patchset with GNUTLS workaround disabled > [1]: before this patchset with GNUTLS workaround enabled > [2]: with this patchset applied and GNUTLS workaround enabled > [2]: with this patchset applied and GNUTLS workaround disabled > > | Total time | Throughput | Transfered bytes | > --+------------+-------------+------------------+ > 0 | 32 464 ms | 5 228 mbps | 21 213 283 021 | > 1 | 32 429 ms | 5 236 mpbs | 21 224 062 348 | > 2 | 32 934 ms | 5 150 mbps | 21 200 558 083 | > 3 | 31 547 ms | 5 384 mbps | 21 229 225 792 | > > > Tested-by: Juraj Marcin <jmarcin@redhat.com> > > > Best regards > > Juraj Marcin > > > > > > Using non-blocking will prevent the return path thuread setting > > > in a read() call, so we won't have mutual exclusion between read > > > and write which this patch was trying to avoid > > > > > > ie, this delta on top of this patch: > > > > > > diff --git a/migration/migration.c b/migration/migration.c > > > index 10c216d25d..1eaabc1f19 100644 > > > --- a/migration/migration.c > > > +++ b/migration/migration.c > > > @@ -4002,7 +4002,7 @@ void migration_connect(MigrationState *s, Error *error_in) > > > } > > > > > > migration_rate_set(rate_limit); > > > - qemu_file_set_blocking(s->to_dst_file, true); > > > + qemu_file_set_blocking(s->to_dst_file, false); > > > > > > /* > > > * Open the return path. For postcopy, it is used exclusively. For > > > diff --git a/migration/qemu-file.c b/migration/qemu-file.c > > > index cf6115e699..8ee44c5ac9 100644 > > > --- a/migration/qemu-file.c > > > +++ b/migration/qemu-file.c > > > @@ -338,22 +338,6 @@ static ssize_t coroutine_mixed_fn qemu_fill_buffer(QEMUFile *f) > > > return 0; > > > } > > > > > > - /* > > > - * This feature triggers acquisition of mutexes around every > > > - * read and write. Thus we must not sit in a blocking read > > > - * if this is set, but must instead poll proactively. This > > > - * does not work with some channel types, however, so must > > > - * only pre-poll when the featre is set. > > > - */ > > > - if (qio_channel_has_feature(f->ioc, > > > - QIO_CHANNEL_FEATURE_CONCURRENT_IO)) { > > > - if (qemu_in_coroutine()) { > > > - qio_channel_yield(f->ioc, G_IO_IN); > > > - } else { > > > - qio_channel_wait(f->ioc, G_IO_IN); > > > - } > > > - } > > > - > > > do { > > > struct iovec iov = { f->buf + pending, IO_BUF_SIZE - pending }; > > > len = qio_channel_readv_full(f->ioc, &iov, 1, pfds, pnfd, 0, > > > > > > > > > With regards, > > > Daniel > > > -- Peter Xu ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2025-08-06 15:09 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-08-01 17:02 [PATCH for-10.1 0/2] migration: actually make gnutls workaround functional Daniel P. Berrangé 2025-08-01 17:02 ` [PATCH 1/2] migration: simplify error reporting after channel read Daniel P. Berrangé 2025-08-04 10:18 ` Prasad Pandit 2025-08-04 10:22 ` Daniel P. Berrangé 2025-08-04 11:03 ` Prasad Pandit 2025-08-06 0:41 ` Peter Xu 2025-08-01 17:02 ` [PATCH 2/2] migration: fix workaround for gnutls thread safety Daniel P. Berrangé 2025-08-04 10:29 ` Prasad Pandit 2025-08-04 18:13 ` Fabiano Rosas 2025-08-04 17:53 ` [PATCH for-10.1 0/2] migration: actually make gnutls workaround functional Juraj Marcin 2025-08-04 19:27 ` Fabiano Rosas 2025-08-05 10:09 ` Daniel P. Berrangé 2025-08-05 13:44 ` Fabiano Rosas 2025-08-05 14:18 ` Daniel P. Berrangé 2025-08-05 15:28 ` Fabiano Rosas 2025-08-05 14:52 ` Juraj Marcin 2025-08-06 14:54 ` Peter Xu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).