* [PATCH 1/6] Revert "netdev: set timeout depending on loadavg"
2024-01-04 16:29 [PATCH 0/6] net: fix non-deterministic failures of the 'netdev-socket' qtest Daniel P. Berrangé
@ 2024-01-04 16:29 ` Daniel P. Berrangé
2024-01-04 16:29 ` [PATCH 2/6] Revert "osdep: add getloadavg" Daniel P. Berrangé
` (6 subsequent siblings)
7 siblings, 0 replies; 13+ messages in thread
From: Daniel P. Berrangé @ 2024-01-04 16:29 UTC (permalink / raw)
To: qemu-devel
Cc: Stefan Hajnoczi, Philippe Mathieu-Daudé,
Daniel P. Berrangé, Michael S . Tsirkin, Jason Wang,
Laurent Vivier, Thomas Huth, Marc-André Lureau,
Paolo Bonzini
This reverts commit cadfc7293977ecadc2d6c48d7cffc553ed2f85f1.
The test was not timing out because of slow execution. It was
timing out due to a race condition leading to the client QEMU
attempting (and fatally failing) to connect before the server
QEMU was listening.
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
---
tests/qtest/netdev-socket.c | 28 +---------------------------
1 file changed, 1 insertion(+), 27 deletions(-)
diff --git a/tests/qtest/netdev-socket.c b/tests/qtest/netdev-socket.c
index bb99d08b5e..7ba1eff120 100644
--- a/tests/qtest/netdev-socket.c
+++ b/tests/qtest/netdev-socket.c
@@ -18,32 +18,6 @@
#define CONNECTION_TIMEOUT 120
-static double connection_timeout(void)
-{
- double load;
- int ret = getloadavg(&load, 1);
-
- /*
- * If we can't get load data, or load is low because we just started
- * running, assume load of 1 (we are alone in this system).
- */
- if (ret < 1 || load < 1.0) {
- load = 1.0;
- }
- /*
- * No one wants to wait more than 10 minutes for this test. Higher load?
- * Too bad.
- */
- if (load > 10.0) {
- fprintf(stderr, "Warning: load %f higher than 10 - test might timeout\n",
- load);
- load = 10.0;
- }
-
- /* if load is high increase timeout as we might not get a chance to run */
- return load * CONNECTION_TIMEOUT;
-}
-
#define EXPECT_STATE(q, e, t) \
do { \
char *resp = NULL; \
@@ -57,7 +31,7 @@ do { \
if (g_str_equal(resp, e)) { \
break; \
} \
- } while (g_test_timer_elapsed() < connection_timeout()); \
+ } while (g_test_timer_elapsed() < CONNECTION_TIMEOUT); \
g_assert_cmpstr(resp, ==, e); \
g_free(resp); \
} while (0)
--
2.43.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 2/6] Revert "osdep: add getloadavg"
2024-01-04 16:29 [PATCH 0/6] net: fix non-deterministic failures of the 'netdev-socket' qtest Daniel P. Berrangé
2024-01-04 16:29 ` [PATCH 1/6] Revert "netdev: set timeout depending on loadavg" Daniel P. Berrangé
@ 2024-01-04 16:29 ` Daniel P. Berrangé
2024-01-04 16:29 ` [PATCH 3/6] Revert "tests/qtest/netdev-socket: Raise connection timeout to 120 seconds" Daniel P. Berrangé
` (5 subsequent siblings)
7 siblings, 0 replies; 13+ messages in thread
From: Daniel P. Berrangé @ 2024-01-04 16:29 UTC (permalink / raw)
To: qemu-devel
Cc: Stefan Hajnoczi, Philippe Mathieu-Daudé,
Daniel P. Berrangé, Michael S . Tsirkin, Jason Wang,
Laurent Vivier, Thomas Huth, Marc-André Lureau,
Paolo Bonzini
This reverts commit dc864d3a3777424187280e50c9bfb84dced54f12.
This functionality is not required after the previous revert
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
---
include/qemu/osdep.h | 10 ----------
meson.build | 1 -
2 files changed, 11 deletions(-)
diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index d30ba73eda..475a1c62ff 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -779,16 +779,6 @@ static inline int platform_does_not_support_system(const char *command)
}
#endif /* !HAVE_SYSTEM_FUNCTION */
-/**
- * If the load average was unobtainable, -1 is returned
- */
-#ifndef HAVE_GETLOADAVG_FUNCTION
-static inline int getloadavg(double loadavg[], int nelem)
-{
- return -1;
-}
-#endif /* !HAVE_GETLOADAVG_FUNCTION */
-
#ifdef __cplusplus
}
#endif
diff --git a/meson.build b/meson.build
index 6c77d9687d..67f4ede8ae 100644
--- a/meson.build
+++ b/meson.build
@@ -2296,7 +2296,6 @@ config_host_data.set('HAVE_GLIB_WITH_SLICE_ALLOCATOR', glib_has_gslice)
config_host_data.set('HAVE_OPENPTY', cc.has_function('openpty', dependencies: util))
config_host_data.set('HAVE_STRCHRNUL', cc.has_function('strchrnul'))
config_host_data.set('HAVE_SYSTEM_FUNCTION', cc.has_function('system', prefix: '#include <stdlib.h>'))
-config_host_data.set('HAVE_GETLOADAVG_FUNCTION', cc.has_function('getloadavg', prefix: '#include <stdlib.h>'))
if rbd.found()
config_host_data.set('HAVE_RBD_NAMESPACE_EXISTS',
cc.has_function('rbd_namespace_exists',
--
2.43.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 3/6] Revert "tests/qtest/netdev-socket: Raise connection timeout to 120 seconds"
2024-01-04 16:29 [PATCH 0/6] net: fix non-deterministic failures of the 'netdev-socket' qtest Daniel P. Berrangé
2024-01-04 16:29 ` [PATCH 1/6] Revert "netdev: set timeout depending on loadavg" Daniel P. Berrangé
2024-01-04 16:29 ` [PATCH 2/6] Revert "osdep: add getloadavg" Daniel P. Berrangé
@ 2024-01-04 16:29 ` Daniel P. Berrangé
2024-01-04 16:29 ` [PATCH 4/6] net: add explicit info about connecting/listening state Daniel P. Berrangé
` (4 subsequent siblings)
7 siblings, 0 replies; 13+ messages in thread
From: Daniel P. Berrangé @ 2024-01-04 16:29 UTC (permalink / raw)
To: qemu-devel
Cc: Stefan Hajnoczi, Philippe Mathieu-Daudé,
Daniel P. Berrangé, Michael S . Tsirkin, Jason Wang,
Laurent Vivier, Thomas Huth, Marc-André Lureau,
Paolo Bonzini
This reverts commit 0daaf2761f6d268ffaa2d01d450e202e127452b1.
The test was not timing out because of slow execution. It was
timing out due to a race condition leading to the client QEMU
attempting (and fatally failing) to connect before the server
QEMU was listening.
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
---
tests/qtest/netdev-socket.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tests/qtest/netdev-socket.c b/tests/qtest/netdev-socket.c
index 7ba1eff120..3fc2ac26d0 100644
--- a/tests/qtest/netdev-socket.c
+++ b/tests/qtest/netdev-socket.c
@@ -16,7 +16,7 @@
#include "qapi/qobject-input-visitor.h"
#include "qapi/qapi-visit-sockets.h"
-#define CONNECTION_TIMEOUT 120
+#define CONNECTION_TIMEOUT 60
#define EXPECT_STATE(q, e, t) \
do { \
--
2.43.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 4/6] net: add explicit info about connecting/listening state
2024-01-04 16:29 [PATCH 0/6] net: fix non-deterministic failures of the 'netdev-socket' qtest Daniel P. Berrangé
` (2 preceding siblings ...)
2024-01-04 16:29 ` [PATCH 3/6] Revert "tests/qtest/netdev-socket: Raise connection timeout to 120 seconds" Daniel P. Berrangé
@ 2024-01-04 16:29 ` Daniel P. Berrangé
2024-01-04 16:29 ` [PATCH 5/6] net: handle QIOTask completion to report useful error message Daniel P. Berrangé
` (3 subsequent siblings)
7 siblings, 0 replies; 13+ messages in thread
From: Daniel P. Berrangé @ 2024-01-04 16:29 UTC (permalink / raw)
To: qemu-devel
Cc: Stefan Hajnoczi, Philippe Mathieu-Daudé,
Daniel P. Berrangé, Michael S . Tsirkin, Jason Wang,
Laurent Vivier, Thomas Huth, Marc-André Lureau,
Paolo Bonzini
When running 'info network', if the stream backend is still in
the process of connecting, or waiting for an incoming connection,
no information is displayed.
There is also no way to distinguish whether the server is still
in the process of setting up the listener socket, or whether it
is ready to accept incoming client connections.
This leads to a race condition in the netdev-socket qtest which
launches a server process followed by a client process. Under
high load conditions it is possible for the client to attempt
to connect before the server is accepting clients. For the
scenarios which do not set the 'reconnect' option, this opens
up a race which can lead to the test scenario failing to reach
the expected state.
Now that 'info network' can distinguish between initialization
phase and the listening phase, the netdev-socket qtest will
correctly synchronize, such that the client QEMU is not spawned
until the server is ready.
This should solve the non-deterministic failures seen with the
netdev-socket qtest.
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
---
net/stream.c | 5 ++++-
tests/qtest/netdev-socket.c | 10 +++++-----
2 files changed, 9 insertions(+), 6 deletions(-)
diff --git a/net/stream.c b/net/stream.c
index 9204b4c96e..0defb21d45 100644
--- a/net/stream.c
+++ b/net/stream.c
@@ -173,7 +173,7 @@ static gboolean net_stream_send(QIOChannel *ioc,
net_socket_rs_init(&s->rs, net_stream_rs_finalize, false);
s->nc.link_down = true;
- qemu_set_info_str(&s->nc, "%s", "");
+ qemu_set_info_str(&s->nc, "listening");
qapi_event_send_netdev_stream_disconnected(s->nc.name);
net_stream_arm_reconnect(s);
@@ -292,6 +292,7 @@ static void net_stream_server_listening(QIOTask *task, gpointer opaque)
s->nc.link_down = true;
s->listener = qio_net_listener_new();
+ qemu_set_info_str(&s->nc, "listening");
net_socket_rs_init(&s->rs, net_stream_rs_finalize, false);
qio_net_listener_set_client_func(s->listener, net_stream_listen, s, NULL);
qio_net_listener_add(s->listener, listen_sioc);
@@ -309,6 +310,7 @@ static int net_stream_server_init(NetClientState *peer,
nc = qemu_new_net_client(&net_stream_info, peer, model, name);
s = DO_UPCAST(NetStreamState, nc, nc);
+ qemu_set_info_str(&s->nc, "initializing");
s->listen_ioc = QIO_CHANNEL(listen_sioc);
qio_channel_socket_listen_async(listen_sioc, addr, 0,
@@ -400,6 +402,7 @@ static int net_stream_client_init(NetClientState *peer,
nc = qemu_new_net_client(&net_stream_info, peer, model, name);
s = DO_UPCAST(NetStreamState, nc, nc);
+ qemu_set_info_str(&s->nc, "connecting");
s->ioc = QIO_CHANNEL(sioc);
s->nc.link_down = true;
diff --git a/tests/qtest/netdev-socket.c b/tests/qtest/netdev-socket.c
index 3fc2ac26d0..91441f7922 100644
--- a/tests/qtest/netdev-socket.c
+++ b/tests/qtest/netdev-socket.c
@@ -127,7 +127,7 @@ static void test_stream_inet_ipv4(void)
"addr.ipv4=on,addr.ipv6=off,"
"addr.host=127.0.0.1,addr.port=%d", port);
- EXPECT_STATE(qts0, "st0: index=0,type=stream,\r\n", 0);
+ EXPECT_STATE(qts0, "st0: index=0,type=stream,listening\r\n", 0);
qts1 = qtest_initf("-nodefaults -M none "
"-netdev stream,server=false,id=st0,addr.type=inet,"
@@ -200,7 +200,7 @@ static void test_stream_unix_reconnect(void)
"-netdev stream,id=st0,server=true,addr.type=unix,"
"addr.path=%s", path);
- EXPECT_STATE(qts0, "st0: index=0,type=stream,\r\n", 0);
+ EXPECT_STATE(qts0, "st0: index=0,type=stream,listening\r\n", 0);
qts1 = qtest_initf("-nodefaults -M none "
"-netdev stream,server=false,id=st0,addr.type=unix,"
@@ -250,7 +250,7 @@ static void test_stream_inet_ipv6(void)
"addr.ipv4=off,addr.ipv6=on,"
"addr.host=::1,addr.port=%d", port);
- EXPECT_STATE(qts0, "st0: index=0,type=stream,\r\n", 0);
+ EXPECT_STATE(qts0, "st0: index=0,type=stream,listening\r\n", 0);
qts1 = qtest_initf("-nodefaults -M none "
"-netdev stream,server=false,id=st0,addr.type=inet,"
@@ -282,7 +282,7 @@ static void test_stream_unix(void)
"addr.type=unix,addr.path=%s,",
path);
- EXPECT_STATE(qts0, "st0: index=0,type=stream,\r\n", 0);
+ EXPECT_STATE(qts0, "st0: index=0,type=stream,listening\r\n", 0);
qts1 = qtest_initf("-nodefaults -M none "
"-netdev stream,id=st0,server=false,"
@@ -314,7 +314,7 @@ static void test_stream_unix_abstract(void)
"addr.abstract=on",
path);
- EXPECT_STATE(qts0, "st0: index=0,type=stream,\r\n", 0);
+ EXPECT_STATE(qts0, "st0: index=0,type=stream,listening\r\n", 0);
qts1 = qtest_initf("-nodefaults -M none "
"-netdev stream,id=st0,server=false,"
--
2.43.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 5/6] net: handle QIOTask completion to report useful error message
2024-01-04 16:29 [PATCH 0/6] net: fix non-deterministic failures of the 'netdev-socket' qtest Daniel P. Berrangé
` (3 preceding siblings ...)
2024-01-04 16:29 ` [PATCH 4/6] net: add explicit info about connecting/listening state Daniel P. Berrangé
@ 2024-01-04 16:29 ` Daniel P. Berrangé
2024-01-04 16:29 ` [PATCH 6/6] qtest: ensure netdev-socket tests have non-overlapping names Daniel P. Berrangé
` (2 subsequent siblings)
7 siblings, 0 replies; 13+ messages in thread
From: Daniel P. Berrangé @ 2024-01-04 16:29 UTC (permalink / raw)
To: qemu-devel
Cc: Stefan Hajnoczi, Philippe Mathieu-Daudé,
Daniel P. Berrangé, Michael S . Tsirkin, Jason Wang,
Laurent Vivier, Thomas Huth, Marc-André Lureau,
Paolo Bonzini
The network stream backend uses the async QIO socket APIs for listening
and connecting sockets. It does not check the task object completion
status, however, instead just looking at whether the socket FD is -1
or not.
By checking the task completion, we can set a useful error message for
users instead of the non-actionable "connection error" string.
eg so users will see:
(qemu) info network
net: index=0,type=stream,error: Failed to connect to '/foo.unix': No such file or directory
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
---
net/stream.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/net/stream.c b/net/stream.c
index 0defb21d45..97e6ec6679 100644
--- a/net/stream.c
+++ b/net/stream.c
@@ -165,6 +165,7 @@ static gboolean net_stream_send(QIOChannel *ioc,
s->ioc_write_tag = 0;
}
if (s->listener) {
+ qemu_set_info_str(&s->nc, "listening");
qio_net_listener_set_client_func(s->listener, net_stream_listen,
s, NULL);
}
@@ -173,7 +174,6 @@ static gboolean net_stream_send(QIOChannel *ioc,
net_socket_rs_init(&s->rs, net_stream_rs_finalize, false);
s->nc.link_down = true;
- qemu_set_info_str(&s->nc, "listening");
qapi_event_send_netdev_stream_disconnected(s->nc.name);
net_stream_arm_reconnect(s);
@@ -272,9 +272,11 @@ static void net_stream_server_listening(QIOTask *task, gpointer opaque)
QIOChannelSocket *listen_sioc = QIO_CHANNEL_SOCKET(s->listen_ioc);
SocketAddress *addr;
int ret;
+ Error *err = NULL;
- if (listen_sioc->fd < 0) {
- qemu_set_info_str(&s->nc, "connection error");
+ if (qio_task_propagate_error(task, &err)) {
+ qemu_set_info_str(&s->nc, "error: %s", error_get_pretty(err));
+ error_free(err);
return;
}
@@ -327,9 +329,11 @@ static void net_stream_client_connected(QIOTask *task, gpointer opaque)
SocketAddress *addr;
gchar *uri;
int ret;
+ Error *err = NULL;
- if (sioc->fd < 0) {
- qemu_set_info_str(&s->nc, "connection error");
+ if (qio_task_propagate_error(task, &err)) {
+ qemu_set_info_str(&s->nc, "error: %s", error_get_pretty(err));
+ error_free(err);
goto error;
}
@@ -384,6 +388,7 @@ static gboolean net_stream_reconnect(gpointer data)
static void net_stream_arm_reconnect(NetStreamState *s)
{
if (s->reconnect && s->timer_tag == 0) {
+ qemu_set_info_str(&s->nc, "connecting");
s->timer_tag = g_timeout_add_seconds(s->reconnect,
net_stream_reconnect, s);
}
--
2.43.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 6/6] qtest: ensure netdev-socket tests have non-overlapping names
2024-01-04 16:29 [PATCH 0/6] net: fix non-deterministic failures of the 'netdev-socket' qtest Daniel P. Berrangé
` (4 preceding siblings ...)
2024-01-04 16:29 ` [PATCH 5/6] net: handle QIOTask completion to report useful error message Daniel P. Berrangé
@ 2024-01-04 16:29 ` Daniel P. Berrangé
2024-01-04 17:47 ` Philippe Mathieu-Daudé
2024-01-04 16:45 ` [PATCH 0/6] net: fix non-deterministic failures of the 'netdev-socket' qtest Stefan Hajnoczi
2024-01-09 13:49 ` Daniel P. Berrangé
7 siblings, 1 reply; 13+ messages in thread
From: Daniel P. Berrangé @ 2024-01-04 16:29 UTC (permalink / raw)
To: qemu-devel
Cc: Stefan Hajnoczi, Philippe Mathieu-Daudé,
Daniel P. Berrangé, Michael S . Tsirkin, Jason Wang,
Laurent Vivier, Thomas Huth, Marc-André Lureau,
Paolo Bonzini
When naming glib tests if the name of one test is a substring of the
name of another test, it is not possible to use the '-p /the/name'
option to run a single test.
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
---
tests/qtest/netdev-socket.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tests/qtest/netdev-socket.c b/tests/qtest/netdev-socket.c
index 91441f7922..fc7d11961e 100644
--- a/tests/qtest/netdev-socket.c
+++ b/tests/qtest/netdev-socket.c
@@ -526,7 +526,7 @@ int main(int argc, char **argv)
#ifndef _WIN32
qtest_add_func("/netdev/dgram/unix", test_dgram_unix);
#endif
- qtest_add_func("/netdev/stream/unix", test_stream_unix);
+ qtest_add_func("/netdev/stream/unix/oneshot", test_stream_unix);
qtest_add_func("/netdev/stream/unix/reconnect",
test_stream_unix_reconnect);
#ifdef CONFIG_LINUX
--
2.43.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 6/6] qtest: ensure netdev-socket tests have non-overlapping names
2024-01-04 16:29 ` [PATCH 6/6] qtest: ensure netdev-socket tests have non-overlapping names Daniel P. Berrangé
@ 2024-01-04 17:47 ` Philippe Mathieu-Daudé
0 siblings, 0 replies; 13+ messages in thread
From: Philippe Mathieu-Daudé @ 2024-01-04 17:47 UTC (permalink / raw)
To: Daniel P. Berrangé, qemu-devel
Cc: Stefan Hajnoczi, Michael S . Tsirkin, Jason Wang, Laurent Vivier,
Thomas Huth, Marc-André Lureau, Paolo Bonzini
On 4/1/24 17:29, Daniel P. Berrangé wrote:
> When naming glib tests if the name of one test is a substring of the
> name of another test, it is not possible to use the '-p /the/name'
> option to run a single test.
>
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> ---
> tests/qtest/netdev-socket.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/6] net: fix non-deterministic failures of the 'netdev-socket' qtest
2024-01-04 16:29 [PATCH 0/6] net: fix non-deterministic failures of the 'netdev-socket' qtest Daniel P. Berrangé
` (5 preceding siblings ...)
2024-01-04 16:29 ` [PATCH 6/6] qtest: ensure netdev-socket tests have non-overlapping names Daniel P. Berrangé
@ 2024-01-04 16:45 ` Stefan Hajnoczi
2024-01-15 2:36 ` Jason Wang
2024-01-09 13:49 ` Daniel P. Berrangé
7 siblings, 1 reply; 13+ messages in thread
From: Stefan Hajnoczi @ 2024-01-04 16:45 UTC (permalink / raw)
To: Daniel P. Berrangé
Cc: qemu-devel, Stefan Hajnoczi, Philippe Mathieu-Daudé,
Michael S . Tsirkin, Jason Wang, Laurent Vivier, Thomas Huth,
Marc-André Lureau, Paolo Bonzini
On Thu, 4 Jan 2024 at 11:30, Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> We've previously bumped up the timeouts in the netdev-socket qtest
> to supposedly fix non-deterministic failures, however, the failures
> are still hitting CI.
>
> A simple 'listen()' and 'connect()' pairing across 2 QEMU processes
> should be very quick to execute, even under high system load, so it
> was never likely that the test was failing due to timeouts being
> reached.
>
> The actual root cause was a race condition in the test design. It
> was spawning a QEMU with a 'server' netdev, and then spawning one
> with the 'client' netdev. There was insufficient synchronization,
> however, so it was possible for the 2nd QEMU process to attempt
> to 'connect()' before the 'listen()' call was made by the 1st QEMU.
>
> In the test scenarios that did not use the 'reconnect' flag, this
> would result in the client QEMU never getting into the expected
> state. The test code would thus loop on 'info network' until
> hitting the maximum wait time.
>
> This series reverts the increased timeouts, and fixes synchronization
> in the test scenarios. It also improves reporting of errors in the
> socket netdev backend so that 'info network' reports what actually
> went wrong rather than a useless generic 'connection error' string.
> This will help us diagnose any future CI problems, should they occurr.
>
> Daniel P. Berrangé (6):
> Revert "netdev: set timeout depending on loadavg"
> Revert "osdep: add getloadavg"
> Revert "tests/qtest/netdev-socket: Raise connection timeout to 120
> seconds"
> net: add explicit info about connecting/listening state
> net: handle QIOTask completion to report useful error message
> qtest: ensure netdev-socket tests have non-overlapping names
>
> include/qemu/osdep.h | 10 ---------
> meson.build | 1 -
> net/stream.c | 18 +++++++++++-----
> tests/qtest/netdev-socket.c | 42 +++++++------------------------------
> 4 files changed, 21 insertions(+), 50 deletions(-)
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/6] net: fix non-deterministic failures of the 'netdev-socket' qtest
2024-01-04 16:45 ` [PATCH 0/6] net: fix non-deterministic failures of the 'netdev-socket' qtest Stefan Hajnoczi
@ 2024-01-15 2:36 ` Jason Wang
2024-01-15 10:19 ` Peter Maydell
0 siblings, 1 reply; 13+ messages in thread
From: Jason Wang @ 2024-01-15 2:36 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: Daniel P. Berrangé, qemu-devel, Stefan Hajnoczi,
Philippe Mathieu-Daudé, Michael S . Tsirkin, Laurent Vivier,
Thomas Huth, Marc-André Lureau, Paolo Bonzini
On Fri, Jan 5, 2024 at 12:45 AM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> On Thu, 4 Jan 2024 at 11:30, Daniel P. Berrangé <berrange@redhat.com> wrote:
> >
> > We've previously bumped up the timeouts in the netdev-socket qtest
> > to supposedly fix non-deterministic failures, however, the failures
> > are still hitting CI.
> >
> > A simple 'listen()' and 'connect()' pairing across 2 QEMU processes
> > should be very quick to execute, even under high system load, so it
> > was never likely that the test was failing due to timeouts being
> > reached.
> >
> > The actual root cause was a race condition in the test design. It
> > was spawning a QEMU with a 'server' netdev, and then spawning one
> > with the 'client' netdev. There was insufficient synchronization,
> > however, so it was possible for the 2nd QEMU process to attempt
> > to 'connect()' before the 'listen()' call was made by the 1st QEMU.
> >
> > In the test scenarios that did not use the 'reconnect' flag, this
> > would result in the client QEMU never getting into the expected
> > state. The test code would thus loop on 'info network' until
> > hitting the maximum wait time.
> >
> > This series reverts the increased timeouts, and fixes synchronization
> > in the test scenarios. It also improves reporting of errors in the
> > socket netdev backend so that 'info network' reports what actually
> > went wrong rather than a useless generic 'connection error' string.
> > This will help us diagnose any future CI problems, should they occurr.
> >
> > Daniel P. Berrangé (6):
> > Revert "netdev: set timeout depending on loadavg"
> > Revert "osdep: add getloadavg"
> > Revert "tests/qtest/netdev-socket: Raise connection timeout to 120
> > seconds"
> > net: add explicit info about connecting/listening state
> > net: handle QIOTask completion to report useful error message
> > qtest: ensure netdev-socket tests have non-overlapping names
> >
> > include/qemu/osdep.h | 10 ---------
> > meson.build | 1 -
> > net/stream.c | 18 +++++++++++-----
> > tests/qtest/netdev-socket.c | 42 +++++++------------------------------
> > 4 files changed, 21 insertions(+), 50 deletions(-)
>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>
Queued.
Thanks
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/6] net: fix non-deterministic failures of the 'netdev-socket' qtest
2024-01-15 2:36 ` Jason Wang
@ 2024-01-15 10:19 ` Peter Maydell
2024-01-16 1:05 ` Jason Wang
0 siblings, 1 reply; 13+ messages in thread
From: Peter Maydell @ 2024-01-15 10:19 UTC (permalink / raw)
To: Jason Wang
Cc: Stefan Hajnoczi, Daniel P. Berrangé, qemu-devel,
Stefan Hajnoczi, Philippe Mathieu-Daudé, Michael S . Tsirkin,
Laurent Vivier, Thomas Huth, Marc-André Lureau,
Paolo Bonzini
On Mon, 15 Jan 2024 at 02:37, Jason Wang <jasowang@redhat.com> wrote:
>
> On Fri, Jan 5, 2024 at 12:45 AM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >
> > On Thu, 4 Jan 2024 at 11:30, Daniel P. Berrangé <berrange@redhat.com> wrote:
> > >
> > > We've previously bumped up the timeouts in the netdev-socket qtest
> > > to supposedly fix non-deterministic failures, however, the failures
> > > are still hitting CI.
> > >
> > > A simple 'listen()' and 'connect()' pairing across 2 QEMU processes
> > > should be very quick to execute, even under high system load, so it
> > > was never likely that the test was failing due to timeouts being
> > > reached.
> > >
> > > The actual root cause was a race condition in the test design. It
> > > was spawning a QEMU with a 'server' netdev, and then spawning one
> > > with the 'client' netdev. There was insufficient synchronization,
> > > however, so it was possible for the 2nd QEMU process to attempt
> > > to 'connect()' before the 'listen()' call was made by the 1st QEMU.
> > >
> > > In the test scenarios that did not use the 'reconnect' flag, this
> > > would result in the client QEMU never getting into the expected
> > > state. The test code would thus loop on 'info network' until
> > > hitting the maximum wait time.
> > >
> > > This series reverts the increased timeouts, and fixes synchronization
> > > in the test scenarios. It also improves reporting of errors in the
> > > socket netdev backend so that 'info network' reports what actually
> > > went wrong rather than a useless generic 'connection error' string.
> > > This will help us diagnose any future CI problems, should they occurr.
> > >
> > > Daniel P. Berrangé (6):
> > > Revert "netdev: set timeout depending on loadavg"
> > > Revert "osdep: add getloadavg"
> > > Revert "tests/qtest/netdev-socket: Raise connection timeout to 120
> > > seconds"
> > > net: add explicit info about connecting/listening state
> > > net: handle QIOTask completion to report useful error message
> > > qtest: ensure netdev-socket tests have non-overlapping names
> > >
> > > include/qemu/osdep.h | 10 ---------
> > > meson.build | 1 -
> > > net/stream.c | 18 +++++++++++-----
> > > tests/qtest/netdev-socket.c | 42 +++++++------------------------------
> > > 4 files changed, 21 insertions(+), 50 deletions(-)
> >
> > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> >
>
> Queued.
These are already upstream, via thuth's testing queue.
thanks
-- PMM
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/6] net: fix non-deterministic failures of the 'netdev-socket' qtest
2024-01-15 10:19 ` Peter Maydell
@ 2024-01-16 1:05 ` Jason Wang
0 siblings, 0 replies; 13+ messages in thread
From: Jason Wang @ 2024-01-16 1:05 UTC (permalink / raw)
To: Peter Maydell
Cc: Stefan Hajnoczi, Daniel P. Berrangé, qemu-devel,
Stefan Hajnoczi, Philippe Mathieu-Daudé, Michael S . Tsirkin,
Laurent Vivier, Thomas Huth, Marc-André Lureau,
Paolo Bonzini
On Mon, Jan 15, 2024 at 6:19 PM Peter Maydell <peter.maydell@linaro.org> wrote:
>
> On Mon, 15 Jan 2024 at 02:37, Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Fri, Jan 5, 2024 at 12:45 AM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > >
> > > On Thu, 4 Jan 2024 at 11:30, Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > >
> > > > We've previously bumped up the timeouts in the netdev-socket qtest
> > > > to supposedly fix non-deterministic failures, however, the failures
> > > > are still hitting CI.
> > > >
> > > > A simple 'listen()' and 'connect()' pairing across 2 QEMU processes
> > > > should be very quick to execute, even under high system load, so it
> > > > was never likely that the test was failing due to timeouts being
> > > > reached.
> > > >
> > > > The actual root cause was a race condition in the test design. It
> > > > was spawning a QEMU with a 'server' netdev, and then spawning one
> > > > with the 'client' netdev. There was insufficient synchronization,
> > > > however, so it was possible for the 2nd QEMU process to attempt
> > > > to 'connect()' before the 'listen()' call was made by the 1st QEMU.
> > > >
> > > > In the test scenarios that did not use the 'reconnect' flag, this
> > > > would result in the client QEMU never getting into the expected
> > > > state. The test code would thus loop on 'info network' until
> > > > hitting the maximum wait time.
> > > >
> > > > This series reverts the increased timeouts, and fixes synchronization
> > > > in the test scenarios. It also improves reporting of errors in the
> > > > socket netdev backend so that 'info network' reports what actually
> > > > went wrong rather than a useless generic 'connection error' string.
> > > > This will help us diagnose any future CI problems, should they occurr.
> > > >
> > > > Daniel P. Berrangé (6):
> > > > Revert "netdev: set timeout depending on loadavg"
> > > > Revert "osdep: add getloadavg"
> > > > Revert "tests/qtest/netdev-socket: Raise connection timeout to 120
> > > > seconds"
> > > > net: add explicit info about connecting/listening state
> > > > net: handle QIOTask completion to report useful error message
> > > > qtest: ensure netdev-socket tests have non-overlapping names
> > > >
> > > > include/qemu/osdep.h | 10 ---------
> > > > meson.build | 1 -
> > > > net/stream.c | 18 +++++++++++-----
> > > > tests/qtest/netdev-socket.c | 42 +++++++------------------------------
> > > > 4 files changed, 21 insertions(+), 50 deletions(-)
> > >
> > > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> > >
> >
> > Queued.
>
> These are already upstream, via thuth's testing queue.
>
> thanks
> -- PMM
>
Great, so I dropped this.
Thanks
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/6] net: fix non-deterministic failures of the 'netdev-socket' qtest
2024-01-04 16:29 [PATCH 0/6] net: fix non-deterministic failures of the 'netdev-socket' qtest Daniel P. Berrangé
` (6 preceding siblings ...)
2024-01-04 16:45 ` [PATCH 0/6] net: fix non-deterministic failures of the 'netdev-socket' qtest Stefan Hajnoczi
@ 2024-01-09 13:49 ` Daniel P. Berrangé
7 siblings, 0 replies; 13+ messages in thread
From: Daniel P. Berrangé @ 2024-01-09 13:49 UTC (permalink / raw)
To: qemu-devel, Jason Wang
Cc: Stefan Hajnoczi, Philippe Mathieu-Daudé, Michael S . Tsirkin,
Laurent Vivier, Thomas Huth, Marc-André Lureau,
Paolo Bonzini
Hi Jason,
As the net/ maintainer, could you take a look at this series.
This failure has been causing pain for CI for quite a while.
If you're happy with it, I can include it in a pending pull
request of other misc patches I have.
On Thu, Jan 04, 2024 at 04:29:36PM +0000, Daniel P. Berrangé wrote:
> We've previously bumped up the timeouts in the netdev-socket qtest
> to supposedly fix non-deterministic failures, however, the failures
> are still hitting CI.
>
> A simple 'listen()' and 'connect()' pairing across 2 QEMU processes
> should be very quick to execute, even under high system load, so it
> was never likely that the test was failing due to timeouts being
> reached.
>
> The actual root cause was a race condition in the test design. It
> was spawning a QEMU with a 'server' netdev, and then spawning one
> with the 'client' netdev. There was insufficient synchronization,
> however, so it was possible for the 2nd QEMU process to attempt
> to 'connect()' before the 'listen()' call was made by the 1st QEMU.
>
> In the test scenarios that did not use the 'reconnect' flag, this
> would result in the client QEMU never getting into the expected
> state. The test code would thus loop on 'info network' until
> hitting the maximum wait time.
>
> This series reverts the increased timeouts, and fixes synchronization
> in the test scenarios. It also improves reporting of errors in the
> socket netdev backend so that 'info network' reports what actually
> went wrong rather than a useless generic 'connection error' string.
> This will help us diagnose any future CI problems, should they occurr.
>
> Daniel P. Berrangé (6):
> Revert "netdev: set timeout depending on loadavg"
> Revert "osdep: add getloadavg"
> Revert "tests/qtest/netdev-socket: Raise connection timeout to 120
> seconds"
> net: add explicit info about connecting/listening state
> net: handle QIOTask completion to report useful error message
> qtest: ensure netdev-socket tests have non-overlapping names
>
> include/qemu/osdep.h | 10 ---------
> meson.build | 1 -
> net/stream.c | 18 +++++++++++-----
> tests/qtest/netdev-socket.c | 42 +++++++------------------------------
> 4 files changed, 21 insertions(+), 50 deletions(-)
>
> --
> 2.43.0
>
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 13+ messages in thread