* [PATCH 0/4] migration: workaround GNUTLS live migration crashes
@ 2025-07-18 15:05 Daniel P. Berrangé
2025-07-18 15:05 ` [PATCH 1/4] crypto: implement workaround for GNUTLS thread safety problems Daniel P. Berrangé
` (5 more replies)
0 siblings, 6 replies; 17+ messages in thread
From: Daniel P. Berrangé @ 2025-07-18 15:05 UTC (permalink / raw)
To: qemu-devel
Cc: Paolo Bonzini, Daniel P. Berrangé, Peter Xu,
Philippe Mathieu-Daudé, Marc-André Lureau,
Fabiano Rosas
TL:DR: GNUTLS is liable to crash QEMU when live migration is run
with TLS enabled and a return path channel is present, if approx
64 GB of data is transferred. This is easily triggered in a 16 GB
VM with 4 CPUs, by running 'stress-ng --vm 4 --vm-bytes 80%' to
prevent convergance until 64 GB of RAM has been copied. Then
triggering post-copy switchover, or removing the stress workload
to allow completion, will crash it.
The only live migration scenario that should avoid this danger
is multifd, since the high volume data transfers are handled in
dedicated TCP connections which are unidirectional. The main
bi-directionl TCP connection is only for co-ordination purposes
This patch implements a workaround that will prevent future QEMU
versions from triggering the crash.
The only way to avoid the crash with *existing* running QEMU
processes is to change the TLS cipher priority string to avoid
use of AES with TLS 1.3. This can be done with the 'priority'
field in the 'tls-creds-x509' object.eg
-object tls-creds-x509,id=tls0,priority=NORMAL:-AES-256-GCM:-AES-128-GCM:-AES-128-CCM
which should force the use of CHACHA20-POLY1305 which does not
require TLS re-keying after 16 million sent records (64 GB of
migration data).
https://gitlab.com/qemu-project/qemu/-/issues/1937
On RHEL/Fedora distros you can also use the system wide crypto
priorities to override this from the migration *target* host
by creating /etc/crypto-policies/local.d/gnutls-qemu.config
containing
QEMU=NONE:+ECDHE-RSA:+ECDHE-ECDSA:+RSA:+DHE-RSA:+GROUP-X25519:+GROUP-X448:+GROUP-SECP256R1:+GROUP-SECP384R1:+GROUP-SECP521R1:+GROUP-FF
and running 'update-crypto-policies'. I recommend the QEMU
level 'tls-creds-x509' workaround though, which new libvirt
patches can soon do:
https://lists.libvirt.org/archives/list/devel@lists.libvirt.org/thread/LX5KMIUFZSP5DPUXKJDFYBZI5TIE3E5N/
Daniel P. Berrangé (4):
crypto: implement workaround for GNUTLS thread safety problems
io: add support for activating TLS thread safety workaround
migration: activate TLS thread safety workaround
crypto: add tracing & warning about GNUTLS countermeasures
crypto/tlssession.c | 99 +++++++++++++++++++++++++++++++++--
crypto/trace-events | 2 +
include/crypto/tlssession.h | 14 +++++
include/io/channel.h | 1 +
io/channel-tls.c | 5 ++
meson.build | 9 ++++
meson_options.txt | 2 +
migration/tls.c | 9 ++++
scripts/meson-buildoptions.sh | 5 ++
9 files changed, 143 insertions(+), 3 deletions(-)
--
2.50.1
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 1/4] crypto: implement workaround for GNUTLS thread safety problems
2025-07-18 15:05 [PATCH 0/4] migration: workaround GNUTLS live migration crashes Daniel P. Berrangé
@ 2025-07-18 15:05 ` Daniel P. Berrangé
2025-07-21 14:52 ` Fabiano Rosas
2025-07-21 19:19 ` Fabiano Rosas
2025-07-18 15:05 ` [PATCH 2/4] io: add support for activating TLS thread safety workaround Daniel P. Berrangé
` (4 subsequent siblings)
5 siblings, 2 replies; 17+ messages in thread
From: Daniel P. Berrangé @ 2025-07-18 15:05 UTC (permalink / raw)
To: qemu-devel
Cc: Paolo Bonzini, Daniel P. Berrangé, Peter Xu,
Philippe Mathieu-Daudé, Marc-André Lureau,
Fabiano Rosas
When TLS 1.3 is negotiated on a TLS session, GNUTLS will perform
automatic rekeying of the session after 16 million records. This
is done for all algorithms except CHACHA20_POLY1305 which does
not require rekeying.
Unfortunately the rekeying breaks GNUTLS' promise that it is safe
to use a gnutls_session_t object concurrently from multiple threads
if they are exclusively calling gnutls_record_send/recv.
This patch implements a workaround for QEMU that adds a mutex lock
around any gnutls_record_send/recv call to serialize execution
within GNUTLS code. When GNUTLS calls into the push/pull functions
we can release the lock so the OS level I/O calls can at least
have some parallelism.
The big downside of this is that the actual encryption/decryption
code is fully serialized, which will halve performance of that
cipher operations if two threads are contending.
The workaround is not enabled by default, since most use of GNUTLS
in QEMU does not tickle the problem, only non-multifd migration
with a return path open is affected. Fortunately the migration
code also won't trigger the halving of performance, since only
the outbound channel diretion needs to sustain high data rates,
the inbound direction is low volume.
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
---
crypto/tlssession.c | 89 +++++++++++++++++++++++++++++++++--
include/crypto/tlssession.h | 14 ++++++
meson.build | 9 ++++
meson_options.txt | 2 +
scripts/meson-buildoptions.sh | 5 ++
5 files changed, 116 insertions(+), 3 deletions(-)
diff --git a/crypto/tlssession.c b/crypto/tlssession.c
index 6d8f8df623..939f69bdb3 100644
--- a/crypto/tlssession.c
+++ b/crypto/tlssession.c
@@ -19,6 +19,7 @@
*/
#include "qemu/osdep.h"
+#include "qemu/thread.h"
#include "crypto/tlssession.h"
#include "crypto/tlscredsanon.h"
#include "crypto/tlscredspsk.h"
@@ -51,6 +52,14 @@ struct QCryptoTLSSession {
*/
Error *rerr;
Error *werr;
+
+ /*
+ * Used to protect against broken GNUTLS thread safety
+ * https://gitlab.com/gnutls/gnutls/-/issues/1717
+ */
+ bool requireThreadSafety;
+ bool lockEnabled;
+ QemuMutex lock;
};
@@ -69,6 +78,7 @@ qcrypto_tls_session_free(QCryptoTLSSession *session)
g_free(session->peername);
g_free(session->authzid);
object_unref(OBJECT(session->creds));
+ qemu_mutex_destroy(&session->lock);
g_free(session);
}
@@ -84,10 +94,19 @@ qcrypto_tls_session_push(void *opaque, const void *buf, size_t len)
return -1;
};
+ if (session->lockEnabled) {
+ qemu_mutex_unlock(&session->lock);
+ }
+
error_free(session->werr);
session->werr = NULL;
ret = session->writeFunc(buf, len, session->opaque, &session->werr);
+
+ if (session->lockEnabled) {
+ qemu_mutex_lock(&session->lock);
+ }
+
if (ret == QCRYPTO_TLS_SESSION_ERR_BLOCK) {
errno = EAGAIN;
return -1;
@@ -114,7 +133,16 @@ qcrypto_tls_session_pull(void *opaque, void *buf, size_t len)
error_free(session->rerr);
session->rerr = NULL;
+ if (session->lockEnabled) {
+ qemu_mutex_unlock(&session->lock);
+ }
+
ret = session->readFunc(buf, len, session->opaque, &session->rerr);
+
+ if (session->lockEnabled) {
+ qemu_mutex_lock(&session->lock);
+ }
+
if (ret == QCRYPTO_TLS_SESSION_ERR_BLOCK) {
errno = EAGAIN;
return -1;
@@ -153,6 +181,8 @@ qcrypto_tls_session_new(QCryptoTLSCreds *creds,
session->creds = creds;
object_ref(OBJECT(creds));
+ qemu_mutex_init(&session->lock);
+
if (creds->endpoint != endpoint) {
error_setg(errp, "Credentials endpoint doesn't match session");
goto error;
@@ -289,6 +319,11 @@ qcrypto_tls_session_new(QCryptoTLSCreds *creds,
return NULL;
}
+void qcrypto_tls_session_require_thread_safety(QCryptoTLSSession *sess)
+{
+ sess->requireThreadSafety = true;
+}
+
static int
qcrypto_tls_session_check_certificate(QCryptoTLSSession *session,
Error **errp)
@@ -480,7 +515,17 @@ qcrypto_tls_session_write(QCryptoTLSSession *session,
size_t len,
Error **errp)
{
- ssize_t ret = gnutls_record_send(session->handle, buf, len);
+ ssize_t ret;
+
+ if (session->lockEnabled) {
+ qemu_mutex_lock(&session->lock);
+ }
+
+ ret = gnutls_record_send(session->handle, buf, len);
+
+ if (session->lockEnabled) {
+ qemu_mutex_unlock(&session->lock);
+ }
if (ret < 0) {
if (ret == GNUTLS_E_AGAIN) {
@@ -509,7 +554,17 @@ qcrypto_tls_session_read(QCryptoTLSSession *session,
bool gracefulTermination,
Error **errp)
{
- ssize_t ret = gnutls_record_recv(session->handle, buf, len);
+ ssize_t ret;
+
+ if (session->lockEnabled) {
+ qemu_mutex_lock(&session->lock);
+ }
+
+ ret = gnutls_record_recv(session->handle, buf, len);
+
+ if (session->lockEnabled) {
+ qemu_mutex_unlock(&session->lock);
+ }
if (ret < 0) {
if (ret == GNUTLS_E_AGAIN) {
@@ -545,8 +600,29 @@ int
qcrypto_tls_session_handshake(QCryptoTLSSession *session,
Error **errp)
{
- int ret = gnutls_handshake(session->handle);
+ int ret;
+ ret = gnutls_handshake(session->handle);
+
if (!ret) {
+ gnutls_cipher_algorithm_t cipher =
+ gnutls_cipher_get(session->handle);
+
+#ifdef CONFIG_GNUTLS_BUG1717_WORKAROUND
+ /*
+ * Any use of rekeying in TLS 1.3 is unsafe for
+ * a gnutls with bug 1717, however, we know that
+ * QEMU won't initiate manual rekeying. Thus we
+ * only have to protect against automatic rekeying
+ * which doesn't trigger with CHACHA20
+ */
+ if (session->requireThreadSafety &&
+ gnutls_protocol_get_version(session->handle) ==
+ GNUTLS_TLS1_3 &&
+ cipher != GNUTLS_CIPHER_CHACHA20_POLY1305) {
+ session->lockEnabled = true;
+ }
+#endif
+
session->handshakeComplete = true;
return QCRYPTO_TLS_HANDSHAKE_COMPLETE;
}
@@ -584,8 +660,15 @@ qcrypto_tls_session_bye(QCryptoTLSSession *session, Error **errp)
return 0;
}
+ if (session->lockEnabled) {
+ qemu_mutex_lock(&session->lock);
+ }
ret = gnutls_bye(session->handle, GNUTLS_SHUT_WR);
+ if (session->lockEnabled) {
+ qemu_mutex_unlock(&session->lock);
+ }
+
if (!ret) {
return QCRYPTO_TLS_BYE_COMPLETE;
}
diff --git a/include/crypto/tlssession.h b/include/crypto/tlssession.h
index d77ae0d423..2f62ce2d67 100644
--- a/include/crypto/tlssession.h
+++ b/include/crypto/tlssession.h
@@ -165,6 +165,20 @@ void qcrypto_tls_session_free(QCryptoTLSSession *sess);
G_DEFINE_AUTOPTR_CLEANUP_FUNC(QCryptoTLSSession, qcrypto_tls_session_free)
+/**
+ * qcrypto_tls_session_require_thread_safety:
+ * @sess: the TLS session object
+ *
+ * Mark that this TLS session will require thread safety
+ * for concurrent I/O in both directions. This must be
+ * called before the handshake is performed.
+ *
+ * This will activate a workaround for GNUTLS thread
+ * safety issues, where appropriate for the negotiated
+ * TLS session parameters.
+ */
+void qcrypto_tls_session_require_thread_safety(QCryptoTLSSession *sess);
+
/**
* qcrypto_tls_session_check_credentials:
* @sess: the TLS session object
diff --git a/meson.build b/meson.build
index c2bc3eeedc..e53cd5b413 100644
--- a/meson.build
+++ b/meson.build
@@ -1809,6 +1809,7 @@ endif
gnutls = not_found
gnutls_crypto = not_found
+gnutls_bug1717_workaround = false
if get_option('gnutls').enabled() or (get_option('gnutls').auto() and have_system)
# For general TLS support our min gnutls matches
# that implied by our platform support matrix
@@ -1834,6 +1835,12 @@ if get_option('gnutls').enabled() or (get_option('gnutls').auto() and have_syste
method: 'pkg-config',
required: get_option('gnutls'))
endif
+
+ if gnutls.found() and not get_option('gnutls-bug1717-workaround').disabled()
+ # XXX: when bug 1717 is resolved, add logic to probe for
+ # the GNUTLS fixed version number to handle the 'auto' case
+ gnutls_bug1717_workaround = true
+ endif
endif
# We prefer use of gnutls for crypto, unless the options
@@ -2585,6 +2592,7 @@ config_host_data.set('CONFIG_KEYUTILS', keyutils.found())
config_host_data.set('CONFIG_GETTID', has_gettid)
config_host_data.set('CONFIG_GNUTLS', gnutls.found())
config_host_data.set('CONFIG_GNUTLS_CRYPTO', gnutls_crypto.found())
+config_host_data.set('CONFIG_GNUTLS_BUG1717_WORKAROUND', gnutls_bug1717_workaround)
config_host_data.set('CONFIG_TASN1', tasn1.found())
config_host_data.set('CONFIG_GCRYPT', gcrypt.found())
config_host_data.set('CONFIG_NETTLE', nettle.found())
@@ -4869,6 +4877,7 @@ summary_info += {'TLS priority': get_option('tls_priority')}
summary_info += {'GNUTLS support': gnutls}
if gnutls.found()
summary_info += {' GNUTLS crypto': gnutls_crypto.found()}
+ summary_info += {' GNUTLS bug 1717 workaround': gnutls_bug1717_workaround }
endif
summary_info += {'libgcrypt': gcrypt}
summary_info += {'nettle': nettle}
diff --git a/meson_options.txt b/meson_options.txt
index fff1521e58..dd33530750 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -174,6 +174,8 @@ option('libcbor', type : 'feature', value : 'auto',
description: 'libcbor support')
option('gnutls', type : 'feature', value : 'auto',
description: 'GNUTLS cryptography support')
+option('gnutls-bug1717-workaround', type: 'feature', value : 'auto',
+ description: 'GNUTLS workaround for https://gitlab.com/gnutls/gnutls/-/issues/1717')
option('nettle', type : 'feature', value : 'auto',
description: 'nettle cryptography support')
option('gcrypt', type : 'feature', value : 'auto',
diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh
index e8504689e8..3493952d18 100644
--- a/scripts/meson-buildoptions.sh
+++ b/scripts/meson-buildoptions.sh
@@ -123,6 +123,9 @@ meson_options_help() {
printf "%s\n" ' gio use libgio for D-Bus support'
printf "%s\n" ' glusterfs Glusterfs block device driver'
printf "%s\n" ' gnutls GNUTLS cryptography support'
+ printf "%s\n" ' gnutls-bug1717-workaround'
+ printf "%s\n" ' GNUTLS workaround for'
+ printf "%s\n" ' https://gitlab.com/gnutls/gnutls/-/issues/1717'
printf "%s\n" ' gtk GTK+ user interface'
printf "%s\n" ' gtk-clipboard clipboard support for the gtk UI (EXPERIMENTAL, MAY HANG)'
printf "%s\n" ' guest-agent Build QEMU Guest Agent'
@@ -331,6 +334,8 @@ _meson_option_parse() {
--disable-glusterfs) printf "%s" -Dglusterfs=disabled ;;
--enable-gnutls) printf "%s" -Dgnutls=enabled ;;
--disable-gnutls) printf "%s" -Dgnutls=disabled ;;
+ --enable-gnutls-bug1717-workaround) printf "%s" -Dgnutls-bug1717-workaround=enabled ;;
+ --disable-gnutls-bug1717-workaround) printf "%s" -Dgnutls-bug1717-workaround=disabled ;;
--enable-gtk) printf "%s" -Dgtk=enabled ;;
--disable-gtk) printf "%s" -Dgtk=disabled ;;
--enable-gtk-clipboard) printf "%s" -Dgtk_clipboard=enabled ;;
--
2.50.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH 2/4] io: add support for activating TLS thread safety workaround
2025-07-18 15:05 [PATCH 0/4] migration: workaround GNUTLS live migration crashes Daniel P. Berrangé
2025-07-18 15:05 ` [PATCH 1/4] crypto: implement workaround for GNUTLS thread safety problems Daniel P. Berrangé
@ 2025-07-18 15:05 ` Daniel P. Berrangé
2025-07-21 14:52 ` Fabiano Rosas
2025-07-18 15:05 ` [PATCH 3/4] migration: activate " Daniel P. Berrangé
` (3 subsequent siblings)
5 siblings, 1 reply; 17+ messages in thread
From: Daniel P. Berrangé @ 2025-07-18 15:05 UTC (permalink / raw)
To: qemu-devel
Cc: Paolo Bonzini, Daniel P. Berrangé, Peter Xu,
Philippe Mathieu-Daudé, Marc-André Lureau,
Fabiano Rosas
Add a QIO_CHANNEL_FEATURE_CONCURRENT_IO feature flag.
If this is set on a QIOChannelTLS session object, the TLS
session will be marked as requiring thread safety, which
will activate the workaround for GNUTLS bug 1717 if needed.
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
---
include/io/channel.h | 1 +
io/channel-tls.c | 5 +++++
2 files changed, 6 insertions(+)
diff --git a/include/io/channel.h b/include/io/channel.h
index 62b657109c..234e5db70d 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -46,6 +46,7 @@ enum QIOChannelFeature {
QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY,
QIO_CHANNEL_FEATURE_READ_MSG_PEEK,
QIO_CHANNEL_FEATURE_SEEKABLE,
+ QIO_CHANNEL_FEATURE_CONCURRENT_IO,
};
diff --git a/io/channel-tls.c b/io/channel-tls.c
index db2ac1deae..a8248a9216 100644
--- a/io/channel-tls.c
+++ b/io/channel-tls.c
@@ -241,6 +241,11 @@ void qio_channel_tls_handshake(QIOChannelTLS *ioc,
{
QIOTask *task;
+ if (qio_channel_has_feature(QIO_CHANNEL(ioc),
+ QIO_CHANNEL_FEATURE_CONCURRENT_IO)) {
+ qcrypto_tls_session_require_thread_safety(ioc->session);
+ }
+
task = qio_task_new(OBJECT(ioc),
func, opaque, destroy);
--
2.50.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH 3/4] migration: activate TLS thread safety workaround
2025-07-18 15:05 [PATCH 0/4] migration: workaround GNUTLS live migration crashes Daniel P. Berrangé
2025-07-18 15:05 ` [PATCH 1/4] crypto: implement workaround for GNUTLS thread safety problems Daniel P. Berrangé
2025-07-18 15:05 ` [PATCH 2/4] io: add support for activating TLS thread safety workaround Daniel P. Berrangé
@ 2025-07-18 15:05 ` Daniel P. Berrangé
2025-07-21 14:52 ` Fabiano Rosas
2025-07-18 15:05 ` [PATCH 4/4] crypto: add tracing & warning about GNUTLS countermeasures Daniel P. Berrangé
` (2 subsequent siblings)
5 siblings, 1 reply; 17+ messages in thread
From: Daniel P. Berrangé @ 2025-07-18 15:05 UTC (permalink / raw)
To: qemu-devel
Cc: Paolo Bonzini, Daniel P. Berrangé, Peter Xu,
Philippe Mathieu-Daudé, Marc-André Lureau,
Fabiano Rosas
When either the postcopy or return path capabilities are
enabled, the migration code will use the primary channel
for bidirectional I/O.
If either of those capabilities are enabled, the migration
code needs to mark the channel as expecting concurrent I/O
in order to activate the thread safety workarounds for
GNUTLS bug 1717
Closes: https://gitlab.com/qemu-project/qemu/-/issues/1937
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
---
migration/tls.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/migration/tls.c b/migration/tls.c
index 5cbf952383..284a6194b2 100644
--- a/migration/tls.c
+++ b/migration/tls.c
@@ -90,6 +90,10 @@ void migration_tls_channel_process_incoming(MigrationState *s,
trace_migration_tls_incoming_handshake_start();
qio_channel_set_name(QIO_CHANNEL(tioc), "migration-tls-incoming");
+ if (migrate_postcopy_ram() || migrate_return_path()) {
+ qio_channel_set_feature(QIO_CHANNEL(tioc),
+ QIO_CHANNEL_FEATURE_CONCURRENT_IO);
+ }
qio_channel_tls_handshake(tioc,
migration_tls_incoming_handshake,
NULL,
@@ -149,6 +153,11 @@ void migration_tls_channel_connect(MigrationState *s,
s->hostname = g_strdup(hostname);
trace_migration_tls_outgoing_handshake_start(hostname);
qio_channel_set_name(QIO_CHANNEL(tioc), "migration-tls-outgoing");
+
+ if (migrate_postcopy_ram() || migrate_return_path()) {
+ qio_channel_set_feature(QIO_CHANNEL(tioc),
+ QIO_CHANNEL_FEATURE_CONCURRENT_IO);
+ }
qio_channel_tls_handshake(tioc,
migration_tls_outgoing_handshake,
s,
--
2.50.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH 4/4] crypto: add tracing & warning about GNUTLS countermeasures
2025-07-18 15:05 [PATCH 0/4] migration: workaround GNUTLS live migration crashes Daniel P. Berrangé
` (2 preceding siblings ...)
2025-07-18 15:05 ` [PATCH 3/4] migration: activate " Daniel P. Berrangé
@ 2025-07-18 15:05 ` Daniel P. Berrangé
2025-07-21 14:52 ` Fabiano Rosas
2025-07-21 19:32 ` Fabiano Rosas
2025-07-21 14:56 ` [PATCH 0/4] migration: workaround GNUTLS live migration crashes Fabiano Rosas
2025-07-26 6:24 ` Michael Tokarev
5 siblings, 2 replies; 17+ messages in thread
From: Daniel P. Berrangé @ 2025-07-18 15:05 UTC (permalink / raw)
To: qemu-devel
Cc: Paolo Bonzini, Daniel P. Berrangé, Peter Xu,
Philippe Mathieu-Daudé, Marc-André Lureau,
Fabiano Rosas
We want some visibility on stderr when the GNUTLS thread
safety countermeasures are activated, to encourage people
to get the real fix deployed (once it exists). Some trace
points will also help if we see any further wierd crash
scenario we've not anticipated.
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
---
crypto/tlssession.c | 10 ++++++++++
crypto/trace-events | 2 ++
2 files changed, 12 insertions(+)
diff --git a/crypto/tlssession.c b/crypto/tlssession.c
index 939f69bdb3..246cd6f7c0 100644
--- a/crypto/tlssession.c
+++ b/crypto/tlssession.c
@@ -615,10 +615,20 @@ qcrypto_tls_session_handshake(QCryptoTLSSession *session,
* only have to protect against automatic rekeying
* which doesn't trigger with CHACHA20
*/
+ trace_qcrypto_tls_session_parameters(
+ session,
+ session->requireThreadSafety,
+ gnutls_protocol_get_version(session->handle),
+ cipher);
+
if (session->requireThreadSafety &&
gnutls_protocol_get_version(session->handle) ==
GNUTLS_TLS1_3 &&
cipher != GNUTLS_CIPHER_CHACHA20_POLY1305) {
+ warn_report("WARNING: activating thread safety countermeasures "
+ "for potentially broken GNUTLS with TLS1.3 cipher=%d",
+ cipher);
+ trace_qcrypto_tls_session_bug1717_workaround(session);
session->lockEnabled = true;
}
#endif
diff --git a/crypto/trace-events b/crypto/trace-events
index bccd0bbf29..d0e33427fa 100644
--- a/crypto/trace-events
+++ b/crypto/trace-events
@@ -21,6 +21,8 @@ qcrypto_tls_creds_x509_load_cert_list(void *creds, const char *file) "TLS creds
# tlssession.c
qcrypto_tls_session_new(void *session, void *creds, const char *hostname, const char *authzid, int endpoint) "TLS session new session=%p creds=%p hostname=%s authzid=%s endpoint=%d"
qcrypto_tls_session_check_creds(void *session, const char *status) "TLS session check creds session=%p status=%s"
+qcrypto_tls_session_parameters(void *session, int threadSafety, int protocol, int cipher) "TLS session parameters session=%p threadSafety=%d protocol=%d cipher=%d"
+qcrypto_tls_session_bug1717_workaround(void *session) "TLS session bug1717 workaround session=%p"
# tls-cipher-suites.c
qcrypto_tls_cipher_suite_priority(const char *name) "priority: %s"
--
2.50.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 1/4] crypto: implement workaround for GNUTLS thread safety problems
2025-07-18 15:05 ` [PATCH 1/4] crypto: implement workaround for GNUTLS thread safety problems Daniel P. Berrangé
@ 2025-07-21 14:52 ` Fabiano Rosas
2025-07-21 19:19 ` Fabiano Rosas
1 sibling, 0 replies; 17+ messages in thread
From: Fabiano Rosas @ 2025-07-21 14:52 UTC (permalink / raw)
To: Daniel P. Berrangé, qemu-devel
Cc: Paolo Bonzini, Daniel P. Berrangé, Peter Xu,
Philippe Mathieu-Daudé, Marc-André Lureau
Daniel P. Berrangé <berrange@redhat.com> writes:
> When TLS 1.3 is negotiated on a TLS session, GNUTLS will perform
> automatic rekeying of the session after 16 million records. This
> is done for all algorithms except CHACHA20_POLY1305 which does
> not require rekeying.
>
> Unfortunately the rekeying breaks GNUTLS' promise that it is safe
> to use a gnutls_session_t object concurrently from multiple threads
> if they are exclusively calling gnutls_record_send/recv.
>
> This patch implements a workaround for QEMU that adds a mutex lock
> around any gnutls_record_send/recv call to serialize execution
> within GNUTLS code. When GNUTLS calls into the push/pull functions
> we can release the lock so the OS level I/O calls can at least
> have some parallelism.
>
> The big downside of this is that the actual encryption/decryption
> code is fully serialized, which will halve performance of that
> cipher operations if two threads are contending.
>
> The workaround is not enabled by default, since most use of GNUTLS
> in QEMU does not tickle the problem, only non-multifd migration
> with a return path open is affected. Fortunately the migration
> code also won't trigger the halving of performance, since only
> the outbound channel diretion needs to sustain high data rates,
> the inbound direction is low volume.
>
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/4] io: add support for activating TLS thread safety workaround
2025-07-18 15:05 ` [PATCH 2/4] io: add support for activating TLS thread safety workaround Daniel P. Berrangé
@ 2025-07-21 14:52 ` Fabiano Rosas
0 siblings, 0 replies; 17+ messages in thread
From: Fabiano Rosas @ 2025-07-21 14:52 UTC (permalink / raw)
To: Daniel P. Berrangé, qemu-devel
Cc: Paolo Bonzini, Daniel P. Berrangé, Peter Xu,
Philippe Mathieu-Daudé, Marc-André Lureau
Daniel P. Berrangé <berrange@redhat.com> writes:
> Add a QIO_CHANNEL_FEATURE_CONCURRENT_IO feature flag.
>
> If this is set on a QIOChannelTLS session object, the TLS
> session will be marked as requiring thread safety, which
> will activate the workaround for GNUTLS bug 1717 if needed.
>
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 3/4] migration: activate TLS thread safety workaround
2025-07-18 15:05 ` [PATCH 3/4] migration: activate " Daniel P. Berrangé
@ 2025-07-21 14:52 ` Fabiano Rosas
0 siblings, 0 replies; 17+ messages in thread
From: Fabiano Rosas @ 2025-07-21 14:52 UTC (permalink / raw)
To: Daniel P. Berrangé, qemu-devel
Cc: Paolo Bonzini, Daniel P. Berrangé, Peter Xu,
Philippe Mathieu-Daudé, Marc-André Lureau
Daniel P. Berrangé <berrange@redhat.com> writes:
> When either the postcopy or return path capabilities are
> enabled, the migration code will use the primary channel
> for bidirectional I/O.
>
> If either of those capabilities are enabled, the migration
> code needs to mark the channel as expecting concurrent I/O
> in order to activate the thread safety workarounds for
> GNUTLS bug 1717
>
> Closes: https://gitlab.com/qemu-project/qemu/-/issues/1937
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 4/4] crypto: add tracing & warning about GNUTLS countermeasures
2025-07-18 15:05 ` [PATCH 4/4] crypto: add tracing & warning about GNUTLS countermeasures Daniel P. Berrangé
@ 2025-07-21 14:52 ` Fabiano Rosas
2025-07-21 19:32 ` Fabiano Rosas
1 sibling, 0 replies; 17+ messages in thread
From: Fabiano Rosas @ 2025-07-21 14:52 UTC (permalink / raw)
To: Daniel P. Berrangé, qemu-devel
Cc: Paolo Bonzini, Daniel P. Berrangé, Peter Xu,
Philippe Mathieu-Daudé, Marc-André Lureau
Daniel P. Berrangé <berrange@redhat.com> writes:
> We want some visibility on stderr when the GNUTLS thread
> safety countermeasures are activated, to encourage people
> to get the real fix deployed (once it exists). Some trace
> points will also help if we see any further wierd crash
> scenario we've not anticipated.
>
> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/4] migration: workaround GNUTLS live migration crashes
2025-07-18 15:05 [PATCH 0/4] migration: workaround GNUTLS live migration crashes Daniel P. Berrangé
` (3 preceding siblings ...)
2025-07-18 15:05 ` [PATCH 4/4] crypto: add tracing & warning about GNUTLS countermeasures Daniel P. Berrangé
@ 2025-07-21 14:56 ` Fabiano Rosas
2025-07-21 15:03 ` Daniel P. Berrangé
2025-07-26 6:24 ` Michael Tokarev
5 siblings, 1 reply; 17+ messages in thread
From: Fabiano Rosas @ 2025-07-21 14:56 UTC (permalink / raw)
To: Daniel P. Berrangé, qemu-devel
Cc: Paolo Bonzini, Daniel P. Berrangé, Peter Xu,
Philippe Mathieu-Daudé, Marc-André Lureau
Daniel P. Berrangé <berrange@redhat.com> writes:
> TL:DR: GNUTLS is liable to crash QEMU when live migration is run
> with TLS enabled and a return path channel is present, if approx
> 64 GB of data is transferred. This is easily triggered in a 16 GB
> VM with 4 CPUs, by running 'stress-ng --vm 4 --vm-bytes 80%' to
> prevent convergance until 64 GB of RAM has been copied. Then
> triggering post-copy switchover, or removing the stress workload
> to allow completion, will crash it.
>
> The only live migration scenario that should avoid this danger
> is multifd, since the high volume data transfers are handled in
> dedicated TCP connections which are unidirectional. The main
> bi-directionl TCP connection is only for co-ordination purposes
>
> This patch implements a workaround that will prevent future QEMU
> versions from triggering the crash.
>
> The only way to avoid the crash with *existing* running QEMU
> processes is to change the TLS cipher priority string to avoid
> use of AES with TLS 1.3. This can be done with the 'priority'
> field in the 'tls-creds-x509' object.eg
>
> -object tls-creds-x509,id=tls0,priority=NORMAL:-AES-256-GCM:-AES-128-GCM:-AES-128-CCM
>
> which should force the use of CHACHA20-POLY1305 which does not
> require TLS re-keying after 16 million sent records (64 GB of
> migration data).
>
> https://gitlab.com/qemu-project/qemu/-/issues/1937
>
> On RHEL/Fedora distros you can also use the system wide crypto
> priorities to override this from the migration *target* host
> by creating /etc/crypto-policies/local.d/gnutls-qemu.config
> containing
>
> QEMU=NONE:+ECDHE-RSA:+ECDHE-ECDSA:+RSA:+DHE-RSA:+GROUP-X25519:+GROUP-X448:+GROUP-SECP256R1:+GROUP-SECP384R1:+GROUP-SECP521R1:+GROUP-FF
>
> and running 'update-crypto-policies'. I recommend the QEMU
> level 'tls-creds-x509' workaround though, which new libvirt
> patches can soon do:
>
> https://lists.libvirt.org/archives/list/devel@lists.libvirt.org/thread/LX5KMIUFZSP5DPUXKJDFYBZI5TIE3E5N/
>
> Daniel P. Berrangé (4):
> crypto: implement workaround for GNUTLS thread safety problems
> io: add support for activating TLS thread safety workaround
> migration: activate TLS thread safety workaround
> crypto: add tracing & warning about GNUTLS countermeasures
>
> crypto/tlssession.c | 99 +++++++++++++++++++++++++++++++++--
> crypto/trace-events | 2 +
> include/crypto/tlssession.h | 14 +++++
> include/io/channel.h | 1 +
> io/channel-tls.c | 5 ++
> meson.build | 9 ++++
> meson_options.txt | 2 +
> migration/tls.c | 9 ++++
> scripts/meson-buildoptions.sh | 5 ++
> 9 files changed, 143 insertions(+), 3 deletions(-)
Hi, thank you for getting to the bottom of this.
Do you think it would be too cumbersome to add a test for this
somewhere? So we don't regress the workaround but also so the test tells
us whether GNUTLS is fixed.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/4] migration: workaround GNUTLS live migration crashes
2025-07-21 14:56 ` [PATCH 0/4] migration: workaround GNUTLS live migration crashes Fabiano Rosas
@ 2025-07-21 15:03 ` Daniel P. Berrangé
2025-07-21 15:14 ` Fabiano Rosas
0 siblings, 1 reply; 17+ messages in thread
From: Daniel P. Berrangé @ 2025-07-21 15:03 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, Paolo Bonzini, Peter Xu, Philippe Mathieu-Daudé,
Marc-André Lureau
On Mon, Jul 21, 2025 at 11:56:09AM -0300, Fabiano Rosas wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
>
> > TL:DR: GNUTLS is liable to crash QEMU when live migration is run
> > with TLS enabled and a return path channel is present, if approx
> > 64 GB of data is transferred. This is easily triggered in a 16 GB
> > VM with 4 CPUs, by running 'stress-ng --vm 4 --vm-bytes 80%' to
> > prevent convergance until 64 GB of RAM has been copied. Then
> > triggering post-copy switchover, or removing the stress workload
> > to allow completion, will crash it.
> >
> > The only live migration scenario that should avoid this danger
> > is multifd, since the high volume data transfers are handled in
> > dedicated TCP connections which are unidirectional. The main
> > bi-directionl TCP connection is only for co-ordination purposes
> >
> > This patch implements a workaround that will prevent future QEMU
> > versions from triggering the crash.
> >
> > The only way to avoid the crash with *existing* running QEMU
> > processes is to change the TLS cipher priority string to avoid
> > use of AES with TLS 1.3. This can be done with the 'priority'
> > field in the 'tls-creds-x509' object.eg
> >
> > -object tls-creds-x509,id=tls0,priority=NORMAL:-AES-256-GCM:-AES-128-GCM:-AES-128-CCM
> >
> > which should force the use of CHACHA20-POLY1305 which does not
> > require TLS re-keying after 16 million sent records (64 GB of
> > migration data).
> >
> > https://gitlab.com/qemu-project/qemu/-/issues/1937
> >
> > On RHEL/Fedora distros you can also use the system wide crypto
> > priorities to override this from the migration *target* host
> > by creating /etc/crypto-policies/local.d/gnutls-qemu.config
> > containing
> >
> > QEMU=NONE:+ECDHE-RSA:+ECDHE-ECDSA:+RSA:+DHE-RSA:+GROUP-X25519:+GROUP-X448:+GROUP-SECP256R1:+GROUP-SECP384R1:+GROUP-SECP521R1:+GROUP-FF
> >
> > and running 'update-crypto-policies'. I recommend the QEMU
> > level 'tls-creds-x509' workaround though, which new libvirt
> > patches can soon do:
> >
> > https://lists.libvirt.org/archives/list/devel@lists.libvirt.org/thread/LX5KMIUFZSP5DPUXKJDFYBZI5TIE3E5N/
> >
> > Daniel P. Berrangé (4):
> > crypto: implement workaround for GNUTLS thread safety problems
> > io: add support for activating TLS thread safety workaround
> > migration: activate TLS thread safety workaround
> > crypto: add tracing & warning about GNUTLS countermeasures
> >
> > crypto/tlssession.c | 99 +++++++++++++++++++++++++++++++++--
> > crypto/trace-events | 2 +
> > include/crypto/tlssession.h | 14 +++++
> > include/io/channel.h | 1 +
> > io/channel-tls.c | 5 ++
> > meson.build | 9 ++++
> > meson_options.txt | 2 +
> > migration/tls.c | 9 ++++
> > scripts/meson-buildoptions.sh | 5 ++
> > 9 files changed, 143 insertions(+), 3 deletions(-)
>
> Hi, thank you for getting to the bottom of this.
>
> Do you think it would be too cumbersome to add a test for this
> somewhere? So we don't regress the workaround but also so the test tells
> us whether GNUTLS is fixed.
The reproducer scenario is very expensive. I'm doing it with a 16 GB RAM
guest, with 4 CPUs, running 'stress-ng' guest workload. With that, it
takes between 10-20 minutes before live migration gets GNUTLS into the
potentially broken state, and the failure is not 100% guaranteed at
that point.
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/4] migration: workaround GNUTLS live migration crashes
2025-07-21 15:03 ` Daniel P. Berrangé
@ 2025-07-21 15:14 ` Fabiano Rosas
2025-07-21 15:28 ` Daniel P. Berrangé
0 siblings, 1 reply; 17+ messages in thread
From: Fabiano Rosas @ 2025-07-21 15:14 UTC (permalink / raw)
To: Daniel P. Berrangé
Cc: qemu-devel, Paolo Bonzini, Peter Xu, Philippe Mathieu-Daudé,
Marc-André Lureau
Daniel P. Berrangé <berrange@redhat.com> writes:
> On Mon, Jul 21, 2025 at 11:56:09AM -0300, Fabiano Rosas wrote:
>> Daniel P. Berrangé <berrange@redhat.com> writes:
>>
>> > TL:DR: GNUTLS is liable to crash QEMU when live migration is run
>> > with TLS enabled and a return path channel is present, if approx
>> > 64 GB of data is transferred. This is easily triggered in a 16 GB
>> > VM with 4 CPUs, by running 'stress-ng --vm 4 --vm-bytes 80%' to
>> > prevent convergance until 64 GB of RAM has been copied. Then
>> > triggering post-copy switchover, or removing the stress workload
>> > to allow completion, will crash it.
>> >
>> > The only live migration scenario that should avoid this danger
>> > is multifd, since the high volume data transfers are handled in
>> > dedicated TCP connections which are unidirectional. The main
>> > bi-directionl TCP connection is only for co-ordination purposes
>> >
>> > This patch implements a workaround that will prevent future QEMU
>> > versions from triggering the crash.
>> >
>> > The only way to avoid the crash with *existing* running QEMU
>> > processes is to change the TLS cipher priority string to avoid
>> > use of AES with TLS 1.3. This can be done with the 'priority'
>> > field in the 'tls-creds-x509' object.eg
>> >
>> > -object tls-creds-x509,id=tls0,priority=NORMAL:-AES-256-GCM:-AES-128-GCM:-AES-128-CCM
>> >
>> > which should force the use of CHACHA20-POLY1305 which does not
>> > require TLS re-keying after 16 million sent records (64 GB of
>> > migration data).
>> >
>> > https://gitlab.com/qemu-project/qemu/-/issues/1937
>> >
>> > On RHEL/Fedora distros you can also use the system wide crypto
>> > priorities to override this from the migration *target* host
>> > by creating /etc/crypto-policies/local.d/gnutls-qemu.config
>> > containing
>> >
>> > QEMU=NONE:+ECDHE-RSA:+ECDHE-ECDSA:+RSA:+DHE-RSA:+GROUP-X25519:+GROUP-X448:+GROUP-SECP256R1:+GROUP-SECP384R1:+GROUP-SECP521R1:+GROUP-FF
>> >
>> > and running 'update-crypto-policies'. I recommend the QEMU
>> > level 'tls-creds-x509' workaround though, which new libvirt
>> > patches can soon do:
>> >
>> > https://lists.libvirt.org/archives/list/devel@lists.libvirt.org/thread/LX5KMIUFZSP5DPUXKJDFYBZI5TIE3E5N/
>> >
>> > Daniel P. Berrangé (4):
>> > crypto: implement workaround for GNUTLS thread safety problems
>> > io: add support for activating TLS thread safety workaround
>> > migration: activate TLS thread safety workaround
>> > crypto: add tracing & warning about GNUTLS countermeasures
>> >
>> > crypto/tlssession.c | 99 +++++++++++++++++++++++++++++++++--
>> > crypto/trace-events | 2 +
>> > include/crypto/tlssession.h | 14 +++++
>> > include/io/channel.h | 1 +
>> > io/channel-tls.c | 5 ++
>> > meson.build | 9 ++++
>> > meson_options.txt | 2 +
>> > migration/tls.c | 9 ++++
>> > scripts/meson-buildoptions.sh | 5 ++
>> > 9 files changed, 143 insertions(+), 3 deletions(-)
>>
>> Hi, thank you for getting to the bottom of this.
>>
>> Do you think it would be too cumbersome to add a test for this
>> somewhere? So we don't regress the workaround but also so the test tells
>> us whether GNUTLS is fixed.
>
> The reproducer scenario is very expensive. I'm doing it with a 16 GB RAM
> guest, with 4 CPUs, running 'stress-ng' guest workload. With that, it
> takes between 10-20 minutes before live migration gets GNUTLS into the
> potentially broken state, and the failure is not 100% guaranteed at
> that point.
>
Makes sense. Thanks.
Will you take the series or should I?
> With regards,
> Daniel
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/4] migration: workaround GNUTLS live migration crashes
2025-07-21 15:14 ` Fabiano Rosas
@ 2025-07-21 15:28 ` Daniel P. Berrangé
0 siblings, 0 replies; 17+ messages in thread
From: Daniel P. Berrangé @ 2025-07-21 15:28 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, Paolo Bonzini, Peter Xu, Philippe Mathieu-Daudé,
Marc-André Lureau
On Mon, Jul 21, 2025 at 12:14:51PM -0300, Fabiano Rosas wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
>
> > On Mon, Jul 21, 2025 at 11:56:09AM -0300, Fabiano Rosas wrote:
> >> Daniel P. Berrangé <berrange@redhat.com> writes:
> >>
> >> > TL:DR: GNUTLS is liable to crash QEMU when live migration is run
> >> > with TLS enabled and a return path channel is present, if approx
> >> > 64 GB of data is transferred. This is easily triggered in a 16 GB
> >> > VM with 4 CPUs, by running 'stress-ng --vm 4 --vm-bytes 80%' to
> >> > prevent convergance until 64 GB of RAM has been copied. Then
> >> > triggering post-copy switchover, or removing the stress workload
> >> > to allow completion, will crash it.
> >> >
> >> > The only live migration scenario that should avoid this danger
> >> > is multifd, since the high volume data transfers are handled in
> >> > dedicated TCP connections which are unidirectional. The main
> >> > bi-directionl TCP connection is only for co-ordination purposes
> >> >
> >> > This patch implements a workaround that will prevent future QEMU
> >> > versions from triggering the crash.
> >> >
> >> > The only way to avoid the crash with *existing* running QEMU
> >> > processes is to change the TLS cipher priority string to avoid
> >> > use of AES with TLS 1.3. This can be done with the 'priority'
> >> > field in the 'tls-creds-x509' object.eg
> >> >
> >> > -object tls-creds-x509,id=tls0,priority=NORMAL:-AES-256-GCM:-AES-128-GCM:-AES-128-CCM
> >> >
> >> > which should force the use of CHACHA20-POLY1305 which does not
> >> > require TLS re-keying after 16 million sent records (64 GB of
> >> > migration data).
> >> >
> >> > https://gitlab.com/qemu-project/qemu/-/issues/1937
> >> >
> >> > On RHEL/Fedora distros you can also use the system wide crypto
> >> > priorities to override this from the migration *target* host
> >> > by creating /etc/crypto-policies/local.d/gnutls-qemu.config
> >> > containing
> >> >
> >> > QEMU=NONE:+ECDHE-RSA:+ECDHE-ECDSA:+RSA:+DHE-RSA:+GROUP-X25519:+GROUP-X448:+GROUP-SECP256R1:+GROUP-SECP384R1:+GROUP-SECP521R1:+GROUP-FF
> >> >
> >> > and running 'update-crypto-policies'. I recommend the QEMU
> >> > level 'tls-creds-x509' workaround though, which new libvirt
> >> > patches can soon do:
> >> >
> >> > https://lists.libvirt.org/archives/list/devel@lists.libvirt.org/thread/LX5KMIUFZSP5DPUXKJDFYBZI5TIE3E5N/
> >> >
> >> > Daniel P. Berrangé (4):
> >> > crypto: implement workaround for GNUTLS thread safety problems
> >> > io: add support for activating TLS thread safety workaround
> >> > migration: activate TLS thread safety workaround
> >> > crypto: add tracing & warning about GNUTLS countermeasures
> >> >
> >> > crypto/tlssession.c | 99 +++++++++++++++++++++++++++++++++--
> >> > crypto/trace-events | 2 +
> >> > include/crypto/tlssession.h | 14 +++++
> >> > include/io/channel.h | 1 +
> >> > io/channel-tls.c | 5 ++
> >> > meson.build | 9 ++++
> >> > meson_options.txt | 2 +
> >> > migration/tls.c | 9 ++++
> >> > scripts/meson-buildoptions.sh | 5 ++
> >> > 9 files changed, 143 insertions(+), 3 deletions(-)
> >>
> >> Hi, thank you for getting to the bottom of this.
> >>
> >> Do you think it would be too cumbersome to add a test for this
> >> somewhere? So we don't regress the workaround but also so the test tells
> >> us whether GNUTLS is fixed.
> >
> > The reproducer scenario is very expensive. I'm doing it with a 16 GB RAM
> > guest, with 4 CPUs, running 'stress-ng' guest workload. With that, it
> > takes between 10-20 minutes before live migration gets GNUTLS into the
> > potentially broken state, and the failure is not 100% guaranteed at
> > that point.
> >
>
> Makes sense. Thanks.
>
> Will you take the series or should I?
I presume you've already got another migration pull request planned, so
feel free to add this
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/4] crypto: implement workaround for GNUTLS thread safety problems
2025-07-18 15:05 ` [PATCH 1/4] crypto: implement workaround for GNUTLS thread safety problems Daniel P. Berrangé
2025-07-21 14:52 ` Fabiano Rosas
@ 2025-07-21 19:19 ` Fabiano Rosas
1 sibling, 0 replies; 17+ messages in thread
From: Fabiano Rosas @ 2025-07-21 19:19 UTC (permalink / raw)
To: Daniel P. Berrangé, qemu-devel
Cc: Paolo Bonzini, Daniel P. Berrangé, Peter Xu,
Philippe Mathieu-Daudé, Marc-André Lureau
Daniel P. Berrangé <berrange@redhat.com> writes:
> When TLS 1.3 is negotiated on a TLS session, GNUTLS will perform
> automatic rekeying of the session after 16 million records. This
> is done for all algorithms except CHACHA20_POLY1305 which does
> not require rekeying.
>
> Unfortunately the rekeying breaks GNUTLS' promise that it is safe
> to use a gnutls_session_t object concurrently from multiple threads
> if they are exclusively calling gnutls_record_send/recv.
>
> This patch implements a workaround for QEMU that adds a mutex lock
> around any gnutls_record_send/recv call to serialize execution
> within GNUTLS code. When GNUTLS calls into the push/pull functions
> we can release the lock so the OS level I/O calls can at least
> have some parallelism.
>
> The big downside of this is that the actual encryption/decryption
> code is fully serialized, which will halve performance of that
> cipher operations if two threads are contending.
>
> The workaround is not enabled by default, since most use of GNUTLS
> in QEMU does not tickle the problem, only non-multifd migration
> with a return path open is affected. Fortunately the migration
> code also won't trigger the halving of performance, since only
> the outbound channel diretion needs to sustain high data rates,
> the inbound direction is low volume.
>
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Hi, the CI caught a couple of issues. I'll fixup on the branch, no need
to resend.
...
> +void qcrypto_tls_session_require_thread_safety(QCryptoTLSSession *sess)
> +{
> + sess->requireThreadSafety = true;
> +}
Needs a stub.
...
> @@ -545,8 +600,29 @@ int
> qcrypto_tls_session_handshake(QCryptoTLSSession *session,
> Error **errp)
> {
> - int ret = gnutls_handshake(session->handle);
> + int ret;
> + ret = gnutls_handshake(session->handle);
> +
> if (!ret) {
> + gnutls_cipher_algorithm_t cipher =
> + gnutls_cipher_get(session->handle);
> +
Unused var when the config is not enabled.
> +#ifdef CONFIG_GNUTLS_BUG1717_WORKAROUND
> + /*
> + * Any use of rekeying in TLS 1.3 is unsafe for
> + * a gnutls with bug 1717, however, we know that
> + * QEMU won't initiate manual rekeying. Thus we
> + * only have to protect against automatic rekeying
> + * which doesn't trigger with CHACHA20
> + */
> + if (session->requireThreadSafety &&
> + gnutls_protocol_get_version(session->handle) ==
> + GNUTLS_TLS1_3 &&
> + cipher != GNUTLS_CIPHER_CHACHA20_POLY1305) {
> + session->lockEnabled = true;
> + }
> +#endif
> +
> session->handshakeComplete = true;
> return QCRYPTO_TLS_HANDSHAKE_COMPLETE;
> }
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 4/4] crypto: add tracing & warning about GNUTLS countermeasures
2025-07-18 15:05 ` [PATCH 4/4] crypto: add tracing & warning about GNUTLS countermeasures Daniel P. Berrangé
2025-07-21 14:52 ` Fabiano Rosas
@ 2025-07-21 19:32 ` Fabiano Rosas
1 sibling, 0 replies; 17+ messages in thread
From: Fabiano Rosas @ 2025-07-21 19:32 UTC (permalink / raw)
To: Daniel P. Berrangé, qemu-devel
Cc: Paolo Bonzini, Daniel P. Berrangé, Peter Xu,
Philippe Mathieu-Daudé, Marc-André Lureau
Daniel P. Berrangé <berrange@redhat.com> writes:
> We want some visibility on stderr when the GNUTLS thread
> safety countermeasures are activated, to encourage people
> to get the real fix deployed (once it exists). Some trace
> points will also help if we see any further wierd crash
> scenario we've not anticipated.
>
> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> ---
> crypto/tlssession.c | 10 ++++++++++
> crypto/trace-events | 2 ++
> 2 files changed, 12 insertions(+)
>
> diff --git a/crypto/tlssession.c b/crypto/tlssession.c
> index 939f69bdb3..246cd6f7c0 100644
> --- a/crypto/tlssession.c
> +++ b/crypto/tlssession.c
> @@ -615,10 +615,20 @@ qcrypto_tls_session_handshake(QCryptoTLSSession *session,
> * only have to protect against automatic rekeying
> * which doesn't trigger with CHACHA20
> */
> + trace_qcrypto_tls_session_parameters(
> + session,
> + session->requireThreadSafety,
> + gnutls_protocol_get_version(session->handle),
> + cipher);
> +
> if (session->requireThreadSafety &&
> gnutls_protocol_get_version(session->handle) ==
> GNUTLS_TLS1_3 &&
> cipher != GNUTLS_CIPHER_CHACHA20_POLY1305) {
> + warn_report("WARNING: activating thread safety countermeasures "
And this hit the missing error-report.h weirdness.
> + "for potentially broken GNUTLS with TLS1.3 cipher=%d",
> + cipher);
> + trace_qcrypto_tls_session_bug1717_workaround(session);
> session->lockEnabled = true;
> }
> #endif
> diff --git a/crypto/trace-events b/crypto/trace-events
> index bccd0bbf29..d0e33427fa 100644
> --- a/crypto/trace-events
> +++ b/crypto/trace-events
> @@ -21,6 +21,8 @@ qcrypto_tls_creds_x509_load_cert_list(void *creds, const char *file) "TLS creds
> # tlssession.c
> qcrypto_tls_session_new(void *session, void *creds, const char *hostname, const char *authzid, int endpoint) "TLS session new session=%p creds=%p hostname=%s authzid=%s endpoint=%d"
> qcrypto_tls_session_check_creds(void *session, const char *status) "TLS session check creds session=%p status=%s"
> +qcrypto_tls_session_parameters(void *session, int threadSafety, int protocol, int cipher) "TLS session parameters session=%p threadSafety=%d protocol=%d cipher=%d"
> +qcrypto_tls_session_bug1717_workaround(void *session) "TLS session bug1717 workaround session=%p"
>
> # tls-cipher-suites.c
> qcrypto_tls_cipher_suite_priority(const char *name) "priority: %s"
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/4] migration: workaround GNUTLS live migration crashes
2025-07-18 15:05 [PATCH 0/4] migration: workaround GNUTLS live migration crashes Daniel P. Berrangé
` (4 preceding siblings ...)
2025-07-21 14:56 ` [PATCH 0/4] migration: workaround GNUTLS live migration crashes Fabiano Rosas
@ 2025-07-26 6:24 ` Michael Tokarev
2025-07-28 9:04 ` Daniel P. Berrangé
5 siblings, 1 reply; 17+ messages in thread
From: Michael Tokarev @ 2025-07-26 6:24 UTC (permalink / raw)
To: Daniel P. Berrangé, qemu-devel
Cc: Paolo Bonzini, Peter Xu, Philippe Mathieu-Daudé,
Marc-André Lureau, Fabiano Rosas
On 18.07.2025 18:05, Daniel P. Berrangé wrote:
> TL:DR: GNUTLS is liable to crash QEMU when live migration is run
> with TLS enabled and a return path channel is present, if approx
> 64 GB of data is transferred. This is easily triggered in a 16 GB
> VM with 4 CPUs, by running 'stress-ng --vm 4 --vm-bytes 80%' to
> prevent convergance until 64 GB of RAM has been copied. Then
> triggering post-copy switchover, or removing the stress workload
> to allow completion, will crash it.
>
> The only live migration scenario that should avoid this danger
> is multifd, since the high volume data transfers are handled in
> dedicated TCP connections which are unidirectional. The main
> bi-directionl TCP connection is only for co-ordination purposes
>
> This patch implements a workaround that will prevent future QEMU
> versions from triggering the crash.
>
> The only way to avoid the crash with *existing* running QEMU
> processes is to change the TLS cipher priority string to avoid
> use of AES with TLS 1.3. This can be done with the 'priority'
> field in the 'tls-creds-x509' object.eg
>
> -object tls-creds-x509,id=tls0,priority=NORMAL:-AES-256-GCM:-AES-128-GCM:-AES-128-CCM
>
> which should force the use of CHACHA20-POLY1305 which does not
> require TLS re-keying after 16 million sent records (64 GB of
> migration data).
>
> https://gitlab.com/qemu-project/qemu/-/issues/1937
>
> On RHEL/Fedora distros you can also use the system wide crypto
> priorities to override this from the migration *target* host
> by creating /etc/crypto-policies/local.d/gnutls-qemu.config
> containing
>
> QEMU=NONE:+ECDHE-RSA:+ECDHE-ECDSA:+RSA:+DHE-RSA:+GROUP-X25519:+GROUP-X448:+GROUP-SECP256R1:+GROUP-SECP384R1:+GROUP-SECP521R1:+GROUP-FF
>
> and running 'update-crypto-policies'. I recommend the QEMU
> level 'tls-creds-x509' workaround though, which new libvirt
> patches can soon do:
>
> https://lists.libvirt.org/archives/list/devel@lists.libvirt.org/thread/LX5KMIUFZSP5DPUXKJDFYBZI5TIE3E5N/
>
> Daniel P. Berrangé (4):
> crypto: implement workaround for GNUTLS thread safety problems
> io: add support for activating TLS thread safety workaround
> migration: activate TLS thread safety workaround
> crypto: add tracing & warning about GNUTLS countermeasures
>
> crypto/tlssession.c | 99 +++++++++++++++++++++++++++++++++--
> crypto/trace-events | 2 +
> include/crypto/tlssession.h | 14 +++++
> include/io/channel.h | 1 +
> io/channel-tls.c | 5 ++
> meson.build | 9 ++++
> meson_options.txt | 2 +
> migration/tls.c | 9 ++++
> scripts/meson-buildoptions.sh | 5 ++
> 9 files changed, 143 insertions(+), 3 deletions(-)
Being a large(ish) change, but it looks like this patch set is a good
candidate for qemu-stable series, at least for 10.0.x. What do you
think?
Thanks,
/mjt
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/4] migration: workaround GNUTLS live migration crashes
2025-07-26 6:24 ` Michael Tokarev
@ 2025-07-28 9:04 ` Daniel P. Berrangé
0 siblings, 0 replies; 17+ messages in thread
From: Daniel P. Berrangé @ 2025-07-28 9:04 UTC (permalink / raw)
To: Michael Tokarev
Cc: qemu-devel, Paolo Bonzini, Peter Xu, Philippe Mathieu-Daudé,
Marc-André Lureau, Fabiano Rosas
On Sat, Jul 26, 2025 at 09:24:10AM +0300, Michael Tokarev wrote:
> On 18.07.2025 18:05, Daniel P. Berrangé wrote:
> > TL:DR: GNUTLS is liable to crash QEMU when live migration is run
> > with TLS enabled and a return path channel is present, if approx
> > 64 GB of data is transferred. This is easily triggered in a 16 GB
> > VM with 4 CPUs, by running 'stress-ng --vm 4 --vm-bytes 80%' to
> > prevent convergance until 64 GB of RAM has been copied. Then
> > triggering post-copy switchover, or removing the stress workload
> > to allow completion, will crash it.
> >
> > The only live migration scenario that should avoid this danger
> > is multifd, since the high volume data transfers are handled in
> > dedicated TCP connections which are unidirectional. The main
> > bi-directionl TCP connection is only for co-ordination purposes
> >
> > This patch implements a workaround that will prevent future QEMU
> > versions from triggering the crash.
> >
> > The only way to avoid the crash with *existing* running QEMU
> > processes is to change the TLS cipher priority string to avoid
> > use of AES with TLS 1.3. This can be done with the 'priority'
> > field in the 'tls-creds-x509' object.eg
> >
> > -object tls-creds-x509,id=tls0,priority=NORMAL:-AES-256-GCM:-AES-128-GCM:-AES-128-CCM
> >
> > which should force the use of CHACHA20-POLY1305 which does not
> > require TLS re-keying after 16 million sent records (64 GB of
> > migration data).
> >
> > https://gitlab.com/qemu-project/qemu/-/issues/1937
> >
> > On RHEL/Fedora distros you can also use the system wide crypto
> > priorities to override this from the migration *target* host
> > by creating /etc/crypto-policies/local.d/gnutls-qemu.config
> > containing
> >
> > QEMU=NONE:+ECDHE-RSA:+ECDHE-ECDSA:+RSA:+DHE-RSA:+GROUP-X25519:+GROUP-X448:+GROUP-SECP256R1:+GROUP-SECP384R1:+GROUP-SECP521R1:+GROUP-FF
> >
> > and running 'update-crypto-policies'. I recommend the QEMU
> > level 'tls-creds-x509' workaround though, which new libvirt
> > patches can soon do:
> >
> > https://lists.libvirt.org/archives/list/devel@lists.libvirt.org/thread/LX5KMIUFZSP5DPUXKJDFYBZI5TIE3E5N/
> >
> > Daniel P. Berrangé (4):
> > crypto: implement workaround for GNUTLS thread safety problems
> > io: add support for activating TLS thread safety workaround
> > migration: activate TLS thread safety workaround
> > crypto: add tracing & warning about GNUTLS countermeasures
> >
> > crypto/tlssession.c | 99 +++++++++++++++++++++++++++++++++--
> > crypto/trace-events | 2 +
> > include/crypto/tlssession.h | 14 +++++
> > include/io/channel.h | 1 +
> > io/channel-tls.c | 5 ++
> > meson.build | 9 ++++
> > meson_options.txt | 2 +
> > migration/tls.c | 9 ++++
> > scripts/meson-buildoptions.sh | 5 ++
> > 9 files changed, 143 insertions(+), 3 deletions(-)
>
> Being a large(ish) change, but it looks like this patch set is a good
> candidate for qemu-stable series, at least for 10.0.x. What do you
> think?
Yeah, given broken gnutls is everywhere, I'd take it to any stable tree
where it is reasonably easy to do a clean backport.
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2025-07-28 9:05 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-18 15:05 [PATCH 0/4] migration: workaround GNUTLS live migration crashes Daniel P. Berrangé
2025-07-18 15:05 ` [PATCH 1/4] crypto: implement workaround for GNUTLS thread safety problems Daniel P. Berrangé
2025-07-21 14:52 ` Fabiano Rosas
2025-07-21 19:19 ` Fabiano Rosas
2025-07-18 15:05 ` [PATCH 2/4] io: add support for activating TLS thread safety workaround Daniel P. Berrangé
2025-07-21 14:52 ` Fabiano Rosas
2025-07-18 15:05 ` [PATCH 3/4] migration: activate " Daniel P. Berrangé
2025-07-21 14:52 ` Fabiano Rosas
2025-07-18 15:05 ` [PATCH 4/4] crypto: add tracing & warning about GNUTLS countermeasures Daniel P. Berrangé
2025-07-21 14:52 ` Fabiano Rosas
2025-07-21 19:32 ` Fabiano Rosas
2025-07-21 14:56 ` [PATCH 0/4] migration: workaround GNUTLS live migration crashes Fabiano Rosas
2025-07-21 15:03 ` Daniel P. Berrangé
2025-07-21 15:14 ` Fabiano Rosas
2025-07-21 15:28 ` Daniel P. Berrangé
2025-07-26 6:24 ` Michael Tokarev
2025-07-28 9:04 ` Daniel P. Berrangé
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).