[PATCH net-next v4] vsock/test: Add test for null ptr deref when transport changes

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next v4] vsock/test: Add test for null ptr deref when transport changes
@ 2025-06-24 15:40 Luigi Leonardi
  2025-06-25  8:26 ` Stefano Garzarella
  0 siblings, 1 reply; 9+ messages in thread
From: Luigi Leonardi @ 2025-06-24 15:40 UTC (permalink / raw)
  To: Stefano Garzarella, Michal Luczaj
  Cc: virtualization, netdev, linux-kernel, Hyunwoo Kim, Luigi Leonardi

Add a new test to ensure that when the transport changes a null pointer
dereference does not occur. The bug was reported upstream [1] and fixed
with commit 2cb7c756f605 ("vsock/virtio: discard packets if the
transport changes").

KASAN: null-ptr-deref in range [0x0000000000000060-0x0000000000000067]
CPU: 2 UID: 0 PID: 463 Comm: kworker/2:3 Not tainted
Workqueue: vsock-loopback vsock_loopback_work
RIP: 0010:vsock_stream_has_data+0x44/0x70
Call Trace:
 virtio_transport_do_close+0x68/0x1a0
 virtio_transport_recv_pkt+0x1045/0x2ae4
 vsock_loopback_work+0x27d/0x3f0
 process_one_work+0x846/0x1420
 worker_thread+0x5b3/0xf80
 kthread+0x35a/0x700
 ret_from_fork+0x2d/0x70
 ret_from_fork_asm+0x1a/0x30

Note that this test may not fail in a kernel without the fix, but it may
hang on the client side if it triggers a kernel oops.

This works by creating a socket, trying to connect to a server, and then
executing a second connect operation on the same socket but to a
different CID (0). This triggers a transport change. If the connect
operation is interrupted by a signal, this could cause a null-ptr-deref.

Since this bug is non-deterministic, we need to try several times. It
is reasonable to assume that the bug will show up within the timeout
period.

If there is a G2H transport loaded in the system, the bug is not
triggered and this test will always pass.

[1]https://lore.kernel.org/netdev/Z2LvdTTQR7dBmPb5@v4bel-B760M-AORUS-ELITE-AX/

Suggested-by: Hyunwoo Kim <v4bel@theori.io>
Suggested-by: Michal Luczaj <mhal@rbox.co>
Signed-off-by: Luigi Leonardi <leonardi@redhat.com>
---
This series introduces a new test that checks for a null pointer 
dereference that may happen when there is a transport change[1]. This 
bug was fixed in [2].

Note that this test *cannot* fail, it hangs if it triggers a kernel
oops. The intended use-case is to run it and then check if there is any 
oops in the dmesg.

This test is based on Hyunwoo Kim's[3] and Michal's python 
reproducers[4].

[1]https://lore.kernel.org/netdev/Z2LvdTTQR7dBmPb5@v4bel-B760M-AORUS-ELITE-AX/
[2]https://lore.kernel.org/netdev/20250110083511.30419-1-sgarzare@redhat.com/
[3]https://lore.kernel.org/netdev/Z2LvdTTQR7dBmPb5@v4bel-B760M-AORUS-ELITE-AX/#t
[4]https://lore.kernel.org/netdev/2b3062e3-bdaa-4c94-a3c0-2930595b9670@rbox.co/
---
Changes in v4:
- Addressed Stefano's comments:
    - Minor style changes
    - Use `get_transports()` to print a warning when a G2H transport is 
    loaded
    - Removed check on second connect: Because the first connect is 
    interrupted, the socket is in an unspecified state (see man connect) 
    . This can cause strange and unexpected behaviors (connect returning 
    success on a non-existing CID).

- Link to v3: 
https://lore.kernel.org/r/20250611-test_vsock-v3-1-8414a2d4df62@redhat.com

Sorry, this took waaay longer than expected.

Changes in v3:
Addressed Stefano's and Michal's comments:
    - Added the splat text to the commit commessage.
    - Introduced commit hash that fixes the bug.
    - Not using perror anymore on pthread_* functions.
    - Listener is just created once.

- Link to v2:
https://lore.kernel.org/r/20250314-test_vsock-v2-1-3c0a1d878a6d@redhat.com

Changes in v2:
- Addressed Stefano's comments:
    - Timeout is now using current_nsec()
    - Check for return values
    - Style issues
- Added Hyunwoo Kim to Suggested-by
- Link to v1: 
https://lore.kernel.org/r/20250306-test_vsock-v1-0-0320b5accf92@redhat.com
---
 tools/testing/vsock/Makefile     |   1 +
 tools/testing/vsock/vsock_test.c | 178 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 179 insertions(+)

diff --git a/tools/testing/vsock/Makefile b/tools/testing/vsock/Makefile
index 6e0b4e95e230500f99bb9c74350701a037ecd198..88211fd132d23ecdfd56ab0815580a237889e7f2 100644
--- a/tools/testing/vsock/Makefile
+++ b/tools/testing/vsock/Makefile
@@ -5,6 +5,7 @@ vsock_test: vsock_test.o vsock_test_zerocopy.o timeout.o control.o util.o msg_ze
 vsock_diag_test: vsock_diag_test.o timeout.o control.o util.o
 vsock_perf: vsock_perf.o msg_zerocopy_common.o
 
+vsock_test: LDLIBS = -lpthread
 vsock_uring_test: LDLIBS = -luring
 vsock_uring_test: control.o util.o vsock_uring_test.o timeout.o msg_zerocopy_common.o
 
diff --git a/tools/testing/vsock/vsock_test.c b/tools/testing/vsock/vsock_test.c
index eb6f54378667ac7ed324f4823e988ec9846e41a3..095705c7b53b7ad38ab3b8bc3cbe54a9eeb76d5c 100644
--- a/tools/testing/vsock/vsock_test.c
+++ b/tools/testing/vsock/vsock_test.c
@@ -22,6 +22,8 @@
 #include <signal.h>
 #include <sys/ioctl.h>
 #include <linux/time64.h>
+#include <pthread.h>
+#include <fcntl.h>
 
 #include "vsock_test_zerocopy.h"
 #include "timeout.h"
@@ -1867,6 +1869,177 @@ static void test_stream_connect_retry_server(const struct test_opts *opts)
 	close(fd);
 }
 
+#define TRANSPORT_CHANGE_TIMEOUT 2 /* seconds */
+
+static void *test_stream_transport_change_thread(void *vargp)
+{
+	pid_t *pid = (pid_t *)vargp;
+	int ret;
+
+	/* We want this thread to terminate as soon as possible */
+	ret = pthread_setcanceltype(PTHREAD_CANCEL_ASYNCHRONOUS, NULL);
+	if (ret) {
+		fprintf(stderr, "pthread_setcanceltype: %d\n", ret);
+		exit(EXIT_FAILURE);
+	}
+
+	while (true) {
+		if (kill(*pid, SIGUSR1) < 0) {
+			perror("kill");
+			exit(EXIT_FAILURE);
+		}
+	}
+	return NULL;
+}
+
+static void test_transport_change_signal_handler(int signal)
+{
+	/* We need a custom handler for SIGUSR1 as the default one terminates the process. */
+}
+
+static void test_stream_transport_change_client(const struct test_opts *opts)
+{
+	__sighandler_t old_handler;
+	pid_t pid = getpid();
+	pthread_t thread_id;
+	time_t tout;
+	int ret, tr;
+
+	tr = get_transports();
+	/* Print a warning if there is a G2H transport loaded.
+	 * This is on a best effort basis because VMCI can be either G2H and H2G, and there is
+	 * no easy way to understand it.
+	 * The bug is present in the loopback transport. However, it does not interfere
+	 * if it is loaded.
+	 * The bug we are testing only appears when G2H transports are not loaded.
+	 */
+
+	tr &= ~TRANSPORT_LOOPBACK;
+	if (tr != 0 && tr != TRANSPORT_VHOST)
+		fprintf(stderr, "G2H Transport detected. This test will not fail.\n");
+
+	old_handler = signal(SIGUSR1, test_transport_change_signal_handler);
+	if (old_handler == SIG_ERR) {
+		perror("signal");
+		exit(EXIT_FAILURE);
+	}
+
+	ret = pthread_create(&thread_id, NULL, test_stream_transport_change_thread, &pid);
+	if (ret) {
+		fprintf(stderr, "pthread_create: %d\n", ret);
+		exit(EXIT_FAILURE);
+	}
+
+	control_expectln("LISTENING");
+
+	tout = current_nsec() + TRANSPORT_CHANGE_TIMEOUT * NSEC_PER_SEC;
+	do {
+		struct sockaddr_vm sa = {
+			.svm_family = AF_VSOCK,
+			.svm_cid = opts->peer_cid,
+			.svm_port = opts->peer_port,
+		};
+		int s;
+
+		s = socket(AF_VSOCK, SOCK_STREAM, 0);
+		if (s < 0) {
+			perror("socket");
+			exit(EXIT_FAILURE);
+		}
+
+		ret = connect(s, (struct sockaddr *)&sa, sizeof(sa));
+		/* The connect can fail due to signals coming from the thread.
+		 * or because the receiver connection queue is full.
+		 * Ignoring also the latter case because there is no way
+		 * of synchronizing client's connect and server's accept when
+		 * connect(s) are constantly being interrupted by signals.
+		 */
+		if (ret == -1 && (errno != EINTR && errno != ECONNRESET)) {
+			perror("connect");
+			exit(EXIT_FAILURE);
+		}
+
+		/* Set CID to 0 cause a transport change. */
+		sa.svm_cid = 0;
+		/* Here we ignore the connect return value because we cannot
+		 * safely assume that it will *always* fail.
+		 * This is because the previous connect was interrupted
+		 * during the connection process. The socket state, as stated
+		 * in `man connect`, is unspecified and can result in strange
+		 * behaviors.
+		 */
+		connect(s, (struct sockaddr *)&sa, sizeof(sa));
+
+		close(s);
+
+		control_writeulong(CONTROL_CONTINUE);
+
+	} while (current_nsec() < tout);
+
+	control_writeulong(CONTROL_DONE);
+
+	ret = pthread_cancel(thread_id);
+	if (ret) {
+		fprintf(stderr, "pthread_cancel: %d\n", ret);
+		exit(EXIT_FAILURE);
+	}
+
+	/* Wait for the thread to terminate */
+	ret = pthread_join(thread_id, NULL);
+	if (ret) {
+		fprintf(stderr, "pthread_join: %d\n", ret);
+		exit(EXIT_FAILURE);
+	}
+
+	/* Restore the old handler */
+	if (signal(SIGUSR1, old_handler) == SIG_ERR) {
+		perror("signal");
+		exit(EXIT_FAILURE);
+	}
+}
+
+static void test_stream_transport_change_server(const struct test_opts *opts)
+{
+	int s = vsock_stream_listen(VMADDR_CID_ANY, opts->peer_port);
+
+	/* Set the socket to be nonblocking because connects that have been interrupted
+	 * (EINTR) can fill the receiver's accept queue anyway, leading to connect failure.
+	 * As of today (6.15) in such situation there is no way to understand, from the
+	 * client side, if the connection has been queued in the server or not.
+	 */
+	if (fcntl(s, F_SETFL, fcntl(s, F_GETFL, 0) | O_NONBLOCK) < 0) {
+		perror("fcntl");
+		exit(EXIT_FAILURE);
+	}
+	control_writeln("LISTENING");
+
+	while (control_readulong() == CONTROL_CONTINUE) {
+		struct sockaddr_vm sa_client;
+		socklen_t socklen_client = sizeof(sa_client);
+
+		/* Must accept the connection, otherwise the `listen`
+		 * queue will fill up and new connections will fail.
+		 * There can be more than one queued connection,
+		 * clear them all.
+		 */
+		while (true) {
+			int client = accept(s, (struct sockaddr *)&sa_client, &socklen_client);
+
+			if (client < 0) {
+				if (errno == EAGAIN)
+					break;
+
+				perror("accept");
+				exit(EXIT_FAILURE);
+			}
+
+			close(client);
+		}
+	}
+
+	close(s);
+}
+
 static void test_stream_linger_client(const struct test_opts *opts)
 {
 	int fd;
@@ -2106,6 +2279,11 @@ static struct test_case test_cases[] = {
 		.run_client = test_stream_nolinger_client,
 		.run_server = test_stream_nolinger_server,
 	},
+	{
+		.name = "SOCK_STREAM transport change null-ptr-deref",
+		.run_client = test_stream_transport_change_client,
+		.run_server = test_stream_transport_change_server,
+	},
 	{},
 };
 

---
base-commit: 68d019aa14d97f8d57b0f8d203fd3b44db2ba0c7
change-id: 20250306-test_vsock-3e77a9c7a245

Best regards,
-- 
Luigi Leonardi <leonardi@redhat.com>


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next v4] vsock/test: Add test for null ptr deref when transport changes
  2025-06-24 15:40 Luigi Leonardi
@ 2025-06-25  8:26 ` Stefano Garzarella
  2025-06-30  9:24   ` Luigi Leonardi
  0 siblings, 1 reply; 9+ messages in thread
From: Stefano Garzarella @ 2025-06-25  8:26 UTC (permalink / raw)
  To: Luigi Leonardi
  Cc: Michal Luczaj, virtualization, netdev, linux-kernel, Hyunwoo Kim

On Tue, Jun 24, 2025 at 05:40:15PM +0200, Luigi Leonardi wrote:
>Add a new test to ensure that when the transport changes a null pointer
>dereference does not occur. The bug was reported upstream [1] and fixed
>with commit 2cb7c756f605 ("vsock/virtio: discard packets if the
>transport changes").
>
>KASAN: null-ptr-deref in range [0x0000000000000060-0x0000000000000067]
>CPU: 2 UID: 0 PID: 463 Comm: kworker/2:3 Not tainted
>Workqueue: vsock-loopback vsock_loopback_work
>RIP: 0010:vsock_stream_has_data+0x44/0x70
>Call Trace:
> virtio_transport_do_close+0x68/0x1a0
> virtio_transport_recv_pkt+0x1045/0x2ae4
> vsock_loopback_work+0x27d/0x3f0
> process_one_work+0x846/0x1420
> worker_thread+0x5b3/0xf80
> kthread+0x35a/0x700
> ret_from_fork+0x2d/0x70
> ret_from_fork_asm+0x1a/0x30
>
>Note that this test may not fail in a kernel without the fix, but it may
>hang on the client side if it triggers a kernel oops.
>
>This works by creating a socket, trying to connect to a server, and then
>executing a second connect operation on the same socket but to a
>different CID (0). This triggers a transport change. If the connect
>operation is interrupted by a signal, this could cause a null-ptr-deref.
>
>Since this bug is non-deterministic, we need to try several times. It
>is reasonable to assume that the bug will show up within the timeout
>period.
>
>If there is a G2H transport loaded in the system, the bug is not
>triggered and this test will always pass.

Can you add the reason?

>
>[1]https://lore.kernel.org/netdev/Z2LvdTTQR7dBmPb5@v4bel-B760M-AORUS-ELITE-AX/
>
>Suggested-by: Hyunwoo Kim <v4bel@theori.io>
>Suggested-by: Michal Luczaj <mhal@rbox.co>
>Signed-off-by: Luigi Leonardi <leonardi@redhat.com>
>---
>This series introduces a new test that checks for a null pointer
>dereference that may happen when there is a transport change[1]. This
>bug was fixed in [2].
>
>Note that this test *cannot* fail, it hangs if it triggers a kernel
>oops. The intended use-case is to run it and then check if there is any
>oops in the dmesg.
>
>This test is based on Hyunwoo Kim's[3] and Michal's python
>reproducers[4].
>
>[1]https://lore.kernel.org/netdev/Z2LvdTTQR7dBmPb5@v4bel-B760M-AORUS-ELITE-AX/
>[2]https://lore.kernel.org/netdev/20250110083511.30419-1-sgarzare@redhat.com/
>[3]https://lore.kernel.org/netdev/Z2LvdTTQR7dBmPb5@v4bel-B760M-AORUS-ELITE-AX/#t
>[4]https://lore.kernel.org/netdev/2b3062e3-bdaa-4c94-a3c0-2930595b9670@rbox.co/
>---
>Changes in v4:
>- Addressed Stefano's comments:
>    - Minor style changes
>    - Use `get_transports()` to print a warning when a G2H transport is
>    loaded
>    - Removed check on second connect: Because the first connect is
>    interrupted, the socket is in an unspecified state (see man connect)
>    . This can cause strange and unexpected behaviors (connect returning
>    success on a non-existing CID).
>
>- Link to v3:
>https://lore.kernel.org/r/20250611-test_vsock-v3-1-8414a2d4df62@redhat.com
>
>Sorry, this took waaay longer than expected.
>
>Changes in v3:
>Addressed Stefano's and Michal's comments:
>    - Added the splat text to the commit commessage.
>    - Introduced commit hash that fixes the bug.
>    - Not using perror anymore on pthread_* functions.
>    - Listener is just created once.
>
>- Link to v2:
>https://lore.kernel.org/r/20250314-test_vsock-v2-1-3c0a1d878a6d@redhat.com
>
>Changes in v2:
>- Addressed Stefano's comments:
>    - Timeout is now using current_nsec()
>    - Check for return values
>    - Style issues
>- Added Hyunwoo Kim to Suggested-by
>- Link to v1:
>https://lore.kernel.org/r/20250306-test_vsock-v1-0-0320b5accf92@redhat.com
>---
> tools/testing/vsock/Makefile     |   1 +
> tools/testing/vsock/vsock_test.c | 178 +++++++++++++++++++++++++++++++++++++++
> 2 files changed, 179 insertions(+)
>
>diff --git a/tools/testing/vsock/Makefile b/tools/testing/vsock/Makefile
>index 6e0b4e95e230500f99bb9c74350701a037ecd198..88211fd132d23ecdfd56ab0815580a237889e7f2 100644
>--- a/tools/testing/vsock/Makefile
>+++ b/tools/testing/vsock/Makefile
>@@ -5,6 +5,7 @@ vsock_test: vsock_test.o vsock_test_zerocopy.o timeout.o control.o util.o msg_ze
> vsock_diag_test: vsock_diag_test.o timeout.o control.o util.o
> vsock_perf: vsock_perf.o msg_zerocopy_common.o
>
>+vsock_test: LDLIBS = -lpthread
> vsock_uring_test: LDLIBS = -luring
> vsock_uring_test: control.o util.o vsock_uring_test.o timeout.o msg_zerocopy_common.o
>
>diff --git a/tools/testing/vsock/vsock_test.c b/tools/testing/vsock/vsock_test.c
>index eb6f54378667ac7ed324f4823e988ec9846e41a3..095705c7b53b7ad38ab3b8bc3cbe54a9eeb76d5c 100644
>--- a/tools/testing/vsock/vsock_test.c
>+++ b/tools/testing/vsock/vsock_test.c
>@@ -22,6 +22,8 @@
> #include <signal.h>
> #include <sys/ioctl.h>
> #include <linux/time64.h>
>+#include <pthread.h>
>+#include <fcntl.h>
>
> #include "vsock_test_zerocopy.h"
> #include "timeout.h"
>@@ -1867,6 +1869,177 @@ static void test_stream_connect_retry_server(const struct test_opts *opts)
> 	close(fd);
> }
>
>+#define TRANSPORT_CHANGE_TIMEOUT 2 /* seconds */
>+
>+static void *test_stream_transport_change_thread(void *vargp)
>+{
>+	pid_t *pid = (pid_t *)vargp;
>+	int ret;
>+
>+	/* We want this thread to terminate as soon as possible */
>+	ret = pthread_setcanceltype(PTHREAD_CANCEL_ASYNCHRONOUS, NULL);
>+	if (ret) {
>+		fprintf(stderr, "pthread_setcanceltype: %d\n", ret);
>+		exit(EXIT_FAILURE);
>+	}
>+
>+	while (true) {
>+		if (kill(*pid, SIGUSR1) < 0) {
>+			perror("kill");
>+			exit(EXIT_FAILURE);
>+		}
>+	}
>+	return NULL;
>+}
>+
>+static void test_transport_change_signal_handler(int signal)
>+{
>+	/* We need a custom handler for SIGUSR1 as the default one terminates the process. */
>+}
>+
>+static void test_stream_transport_change_client(const struct test_opts *opts)
>+{
>+	__sighandler_t old_handler;
>+	pid_t pid = getpid();
>+	pthread_t thread_id;
>+	time_t tout;
>+	int ret, tr;
>+
>+	tr = get_transports();

nit: add a blank line here

>+	/* Print a warning if there is a G2H transport loaded.
>+	 * This is on a best effort basis because VMCI can be either G2H and H2G, and there is
>+	 * no easy way to understand it.
>+	 * The bug is present in the loopback transport. However, it does not interfere

nit: s/is/was

>+	 * if it is loaded.

I don't understand this, if the bug was present in the loopback transport, our goal is to stress it, so have it loaded is great, why it should interfere?

>+	 * The bug we are testing only appears when G2H transports are not loaded.

Please add the reason also here.

>+	 */
>+

nit: remove the blank line here

>+	tr &= ~TRANSPORT_LOOPBACK;
>+	if (tr != 0 && tr != TRANSPORT_VHOST)

Sorry, this is really hard to understand IMO, let's do a step back.
Your goal is to check if there is a G2H transport loaded, right?

I think we have 2 options:
1. similar to your, just masking the other 2 transports

	if (tr & ~(TRANSPORT_LOOPBACK | TRANSPORT_VHOST))

    IMO this is much clear to understand, and should have the same effect.

2. (my preference) define in util.h some macros that we can reuse:
	#define TRANSPORTS_G2H   (TRANSPORT_VIRTIO | TRANSPORT_VMCI | TRANSPORT_HYPERV)
	#define TRANSPORTS_H2G   (TRANSPORT_VHOST | TRANSPORT_VMCI)
	#define TRANSPORTS_LOCAL (TRANSPORT_LOOPBACK)

     and here you can just do:
	if (tr & TRANSPORTS_G2H)

>+		fprintf(stderr, "G2H Transport detected. This test will not fail.\n");
>+
>+	old_handler = signal(SIGUSR1, test_transport_change_signal_handler);
>+	if (old_handler == SIG_ERR) {
>+		perror("signal");
>+		exit(EXIT_FAILURE);
>+	}
>+
>+	ret = pthread_create(&thread_id, NULL, test_stream_transport_change_thread, &pid);
>+	if (ret) {
>+		fprintf(stderr, "pthread_create: %d\n", ret);
>+		exit(EXIT_FAILURE);
>+	}
>+
>+	control_expectln("LISTENING");
>+
>+	tout = current_nsec() + TRANSPORT_CHANGE_TIMEOUT * NSEC_PER_SEC;
>+	do {
>+		struct sockaddr_vm sa = {
>+			.svm_family = AF_VSOCK,
>+			.svm_cid = opts->peer_cid,
>+			.svm_port = opts->peer_port,
>+		};
>+		int s;
>+
>+		s = socket(AF_VSOCK, SOCK_STREAM, 0);
>+		if (s < 0) {
>+			perror("socket");
>+			exit(EXIT_FAILURE);
>+		}
>+
>+		ret = connect(s, (struct sockaddr *)&sa, sizeof(sa));
>+		/* The connect can fail due to signals coming from the thread.

. should be ,

>+		 * or because the receiver connection queue is full.
>+		 * Ignoring also the latter case because there is no way
>+		 * of synchronizing client's connect and server's accept 
>when
>+		 * connect(s) are constantly being interrupted by signals.
>+		 */
>+		if (ret == -1 && (errno != EINTR && errno != ECONNRESET)) {
>+			perror("connect");
>+			exit(EXIT_FAILURE);
>+		}
>+
>+		/* Set CID to 0 cause a transport change. */
>+		sa.svm_cid = 0;

nit: add a blank line

>+		/* Here we ignore the connect return value because we cannot
>+		 * safely assume that it will *always* fail.
>+		 * This is because the previous connect was interrupted
>+		 * during the connection process. The socket state, as stated
>+		 * in `man connect`, is unspecified and can result in strange
>+		 * behaviors.

Let's focus on the behaviour and try to be more concise. Something like
this:

		/* Ignore return value since it can fail or not.
		 * If the previous connect is interrupted while the
		 * connection request is already sent, the second
		 * connect() will wait for the response.

>+		 */
>+		connect(s, (struct sockaddr *)&sa, sizeof(sa));
>+
>+		close(s);
>+
>+		control_writeulong(CONTROL_CONTINUE);
>+
>+	} while (current_nsec() < tout);
>+
>+	control_writeulong(CONTROL_DONE);
>+
>+	ret = pthread_cancel(thread_id);
>+	if (ret) {
>+		fprintf(stderr, "pthread_cancel: %d\n", ret);
>+		exit(EXIT_FAILURE);
>+	}
>+
>+	/* Wait for the thread to terminate */

useless comment

>+	ret = pthread_join(thread_id, NULL);
>+	if (ret) {
>+		fprintf(stderr, "pthread_join: %d\n", ret);
>+		exit(EXIT_FAILURE);
>+	}
>+
>+	/* Restore the old handler */

ditto

>+	if (signal(SIGUSR1, old_handler) == SIG_ERR) {
>+		perror("signal");
>+		exit(EXIT_FAILURE);
>+	}
>+}
>+
>+static void test_stream_transport_change_server(const struct test_opts *opts)
>+{
>+	int s = vsock_stream_listen(VMADDR_CID_ANY, opts->peer_port);
>+
>+	/* Set the socket to be nonblocking because connects that have been interrupted
>+	 * (EINTR) can fill the receiver's accept queue anyway, leading to connect failure.
>+	 * As of today (6.15) in such situation there is no way to understand, from the
>+	 * client side, if the connection has been queued in the server or not.
>+	 */
>+	if (fcntl(s, F_SETFL, fcntl(s, F_GETFL, 0) | O_NONBLOCK) < 0) {
>+		perror("fcntl");
>+		exit(EXIT_FAILURE);
>+	}
>+	control_writeln("LISTENING");
>+
>+	while (control_readulong() == CONTROL_CONTINUE) {
>+		struct sockaddr_vm sa_client;
>+		socklen_t socklen_client = sizeof(sa_client);
>+
>+		/* Must accept the connection, otherwise the `listen`
>+		 * queue will fill up and new connections will fail.
>+		 * There can be more than one queued connection,
>+		 * clear them all.
>+		 */
>+		while (true) {
>+			int client = accept(s, (struct sockaddr *)&sa_client, &socklen_client);

We don't use the client address, so IMO we can just pass NULL.

Thanks,
Stefano

>+
>+			if (client < 0) {
>+				if (errno == EAGAIN)
>+					break;
>+
>+				perror("accept");
>+				exit(EXIT_FAILURE);
>+			}
>+
>+			close(client);
>+		}
>+	}
>+
>+	close(s);
>+}
>+
> static void test_stream_linger_client(const struct test_opts *opts)
> {
> 	int fd;
>@@ -2106,6 +2279,11 @@ static struct test_case test_cases[] = {
> 		.run_client = test_stream_nolinger_client,
> 		.run_server = test_stream_nolinger_server,
> 	},
>+	{
>+		.name = "SOCK_STREAM transport change null-ptr-deref",
>+		.run_client = test_stream_transport_change_client,
>+		.run_server = test_stream_transport_change_server,
>+	},
> 	{},
> };
>
>
>---
>base-commit: 68d019aa14d97f8d57b0f8d203fd3b44db2ba0c7
>change-id: 20250306-test_vsock-3e77a9c7a245
>
>Best regards,
>-- 
>Luigi Leonardi <leonardi@redhat.com>
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next v4] vsock/test: Add test for null ptr deref when transport changes
  2025-06-25  8:26 ` Stefano Garzarella
@ 2025-06-30  9:24   ` Luigi Leonardi
  0 siblings, 0 replies; 9+ messages in thread
From: Luigi Leonardi @ 2025-06-30  9:24 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Michal Luczaj, virtualization, netdev, linux-kernel, Hyunwoo Kim

Hi Stefano,

On Wed, Jun 25, 2025 at 10:26:26AM +0200, Stefano Garzarella wrote:
>On Tue, Jun 24, 2025 at 05:40:15PM +0200, Luigi Leonardi wrote:
>>Add a new test to ensure that when the transport changes a null pointer
>>dereference does not occur. The bug was reported upstream [1] and fixed
>>with commit 2cb7c756f605 ("vsock/virtio: discard packets if the
>>transport changes").
>>
>>KASAN: null-ptr-deref in range [0x0000000000000060-0x0000000000000067]
>>CPU: 2 UID: 0 PID: 463 Comm: kworker/2:3 Not tainted
>>Workqueue: vsock-loopback vsock_loopback_work
>>RIP: 0010:vsock_stream_has_data+0x44/0x70
>>Call Trace:
>>virtio_transport_do_close+0x68/0x1a0
>>virtio_transport_recv_pkt+0x1045/0x2ae4
>>vsock_loopback_work+0x27d/0x3f0
>>process_one_work+0x846/0x1420
>>worker_thread+0x5b3/0xf80
>>kthread+0x35a/0x700
>>ret_from_fork+0x2d/0x70
>>ret_from_fork_asm+0x1a/0x30
>>
>>Note that this test may not fail in a kernel without the fix, but it may
>>hang on the client side if it triggers a kernel oops.
>>
>>This works by creating a socket, trying to connect to a server, and then
>>executing a second connect operation on the same socket but to a
>>different CID (0). This triggers a transport change. If the connect
>>operation is interrupted by a signal, this could cause a null-ptr-deref.
>>
>>Since this bug is non-deterministic, we need to try several times. It
>>is reasonable to assume that the bug will show up within the timeout
>>period.
>>
>>If there is a G2H transport loaded in the system, the bug is not
>>triggered and this test will always pass.
>
>Can you add the reason?
Will do.
>
>>
>>[1]https://lore.kernel.org/netdev/Z2LvdTTQR7dBmPb5@v4bel-B760M-AORUS-ELITE-AX/
>>
>>Suggested-by: Hyunwoo Kim <v4bel@theori.io>
>>Suggested-by: Michal Luczaj <mhal@rbox.co>
>>Signed-off-by: Luigi Leonardi <leonardi@redhat.com>
>>---
>>This series introduces a new test that checks for a null pointer
>>dereference that may happen when there is a transport change[1]. This
>>bug was fixed in [2].
>>
>>Note that this test *cannot* fail, it hangs if it triggers a kernel
>>oops. The intended use-case is to run it and then check if there is any
>>oops in the dmesg.
>>
>>This test is based on Hyunwoo Kim's[3] and Michal's python
>>reproducers[4].
>>
>>[1]https://lore.kernel.org/netdev/Z2LvdTTQR7dBmPb5@v4bel-B760M-AORUS-ELITE-AX/
>>[2]https://lore.kernel.org/netdev/20250110083511.30419-1-sgarzare@redhat.com/
>>[3]https://lore.kernel.org/netdev/Z2LvdTTQR7dBmPb5@v4bel-B760M-AORUS-ELITE-AX/#t
>>[4]https://lore.kernel.org/netdev/2b3062e3-bdaa-4c94-a3c0-2930595b9670@rbox.co/
>>---
>>Changes in v4:
>>- Addressed Stefano's comments:
>>   - Minor style changes
>>   - Use `get_transports()` to print a warning when a G2H transport is
>>   loaded
>>   - Removed check on second connect: Because the first connect is
>>   interrupted, the socket is in an unspecified state (see man connect)
>>   . This can cause strange and unexpected behaviors (connect returning
>>   success on a non-existing CID).
>>
>>- Link to v3:
>>https://lore.kernel.org/r/20250611-test_vsock-v3-1-8414a2d4df62@redhat.com
>>
>>Sorry, this took waaay longer than expected.
>>
>>Changes in v3:
>>Addressed Stefano's and Michal's comments:
>>   - Added the splat text to the commit commessage.
>>   - Introduced commit hash that fixes the bug.
>>   - Not using perror anymore on pthread_* functions.
>>   - Listener is just created once.
>>
>>- Link to v2:
>>https://lore.kernel.org/r/20250314-test_vsock-v2-1-3c0a1d878a6d@redhat.com
>>
>>Changes in v2:
>>- Addressed Stefano's comments:
>>   - Timeout is now using current_nsec()
>>   - Check for return values
>>   - Style issues
>>- Added Hyunwoo Kim to Suggested-by
>>- Link to v1:
>>https://lore.kernel.org/r/20250306-test_vsock-v1-0-0320b5accf92@redhat.com
>>---
>>tools/testing/vsock/Makefile     |   1 +
>>tools/testing/vsock/vsock_test.c | 178 +++++++++++++++++++++++++++++++++++++++
>>2 files changed, 179 insertions(+)
>>
>>diff --git a/tools/testing/vsock/Makefile b/tools/testing/vsock/Makefile
>>index 6e0b4e95e230500f99bb9c74350701a037ecd198..88211fd132d23ecdfd56ab0815580a237889e7f2 100644
>>--- a/tools/testing/vsock/Makefile
>>+++ b/tools/testing/vsock/Makefile
>>@@ -5,6 +5,7 @@ vsock_test: vsock_test.o vsock_test_zerocopy.o timeout.o control.o util.o msg_ze
>>vsock_diag_test: vsock_diag_test.o timeout.o control.o util.o
>>vsock_perf: vsock_perf.o msg_zerocopy_common.o
>>
>>+vsock_test: LDLIBS = -lpthread
>>vsock_uring_test: LDLIBS = -luring
>>vsock_uring_test: control.o util.o vsock_uring_test.o timeout.o msg_zerocopy_common.o
>>
>>diff --git a/tools/testing/vsock/vsock_test.c b/tools/testing/vsock/vsock_test.c
>>index eb6f54378667ac7ed324f4823e988ec9846e41a3..095705c7b53b7ad38ab3b8bc3cbe54a9eeb76d5c 100644
>>--- a/tools/testing/vsock/vsock_test.c
>>+++ b/tools/testing/vsock/vsock_test.c
>>@@ -22,6 +22,8 @@
>>#include <signal.h>
>>#include <sys/ioctl.h>
>>#include <linux/time64.h>
>>+#include <pthread.h>
>>+#include <fcntl.h>
>>
>>#include "vsock_test_zerocopy.h"
>>#include "timeout.h"
>>@@ -1867,6 +1869,177 @@ static void test_stream_connect_retry_server(const struct test_opts *opts)
>>	close(fd);
>>}
>>
>>+#define TRANSPORT_CHANGE_TIMEOUT 2 /* seconds */
>>+
>>+static void *test_stream_transport_change_thread(void *vargp)
>>+{
>>+	pid_t *pid = (pid_t *)vargp;
>>+	int ret;
>>+
>>+	/* We want this thread to terminate as soon as possible */
>>+	ret = pthread_setcanceltype(PTHREAD_CANCEL_ASYNCHRONOUS, NULL);
>>+	if (ret) {
>>+		fprintf(stderr, "pthread_setcanceltype: %d\n", ret);
>>+		exit(EXIT_FAILURE);
>>+	}
>>+
>>+	while (true) {
>>+		if (kill(*pid, SIGUSR1) < 0) {
>>+			perror("kill");
>>+			exit(EXIT_FAILURE);
>>+		}
>>+	}
>>+	return NULL;
>>+}
>>+
>>+static void test_transport_change_signal_handler(int signal)
>>+{
>>+	/* We need a custom handler for SIGUSR1 as the default one terminates the process. */
>>+}
>>+
>>+static void test_stream_transport_change_client(const struct test_opts *opts)
>>+{
>>+	__sighandler_t old_handler;
>>+	pid_t pid = getpid();
>>+	pthread_t thread_id;
>>+	time_t tout;
>>+	int ret, tr;
>>+
>>+	tr = get_transports();
>
>nit: add a blank line here
>
>>+	/* Print a warning if there is a G2H transport loaded.
>>+	 * This is on a best effort basis because VMCI can be either G2H and H2G, and there is
>>+	 * no easy way to understand it.
>>+	 * The bug is present in the loopback transport. However, it does not interfere
>
>nit: s/is/was
>
>>+	 * if it is loaded.
>
>I don't understand this, if the bug was present in the loopback transport, our goal is to stress it, so have it loaded is great, why it should interfere?
What I meant to say is that, when testing the H2G path to trigger the
issue, and the loopback transport is there, it does not prevent the bug, 
like a G2H transport would do.

>
>>+	 * The bug we are testing only appears when G2H transports are not loaded.
>
>Please add the reason also here.
>
>>+	 */
>>+
>
>nit: remove the blank line here
>
>>+	tr &= ~TRANSPORT_LOOPBACK;
>>+	if (tr != 0 && tr != TRANSPORT_VHOST)
>
>Sorry, this is really hard to understand IMO, let's do a step back.
>Your goal is to check if there is a G2H transport loaded, right?
correct
>
>I think we have 2 options:
>1. similar to your, just masking the other 2 transports
>
>	if (tr & ~(TRANSPORT_LOOPBACK | TRANSPORT_VHOST))
>
>   IMO this is much clear to understand, and should have the same 
>   effect.
>
>2. (my preference) define in util.h some macros that we can reuse:
>	#define TRANSPORTS_G2H   (TRANSPORT_VIRTIO | TRANSPORT_VMCI | 
>	TRANSPORT_HYPERV)
>	#define TRANSPORTS_H2G   (TRANSPORT_VHOST | TRANSPORT_VMCI)
>	#define TRANSPORTS_LOCAL (TRANSPORT_LOOPBACK)
>
>    and here you can just do:
>	if (tr & TRANSPORTS_G2H)
>

I'll add these defines in a separate commit, thanks for the hint.

>>+		fprintf(stderr, "G2H Transport detected. This test will not fail.\n");
>>+
>>+	old_handler = signal(SIGUSR1, test_transport_change_signal_handler);
>>+	if (old_handler == SIG_ERR) {
>>+		perror("signal");
>>+		exit(EXIT_FAILURE);
>>+	}
>>+
>>+	ret = pthread_create(&thread_id, NULL, test_stream_transport_change_thread, &pid);
>>+	if (ret) {
>>+		fprintf(stderr, "pthread_create: %d\n", ret);
>>+		exit(EXIT_FAILURE);
>>+	}
>>+
>>+	control_expectln("LISTENING");
>>+
>>+	tout = current_nsec() + TRANSPORT_CHANGE_TIMEOUT * NSEC_PER_SEC;
>>+	do {
>>+		struct sockaddr_vm sa = {
>>+			.svm_family = AF_VSOCK,
>>+			.svm_cid = opts->peer_cid,
>>+			.svm_port = opts->peer_port,
>>+		};
>>+		int s;
>>+
>>+		s = socket(AF_VSOCK, SOCK_STREAM, 0);
>>+		if (s < 0) {
>>+			perror("socket");
>>+			exit(EXIT_FAILURE);
>>+		}
>>+
>>+		ret = connect(s, (struct sockaddr *)&sa, sizeof(sa));
>>+		/* The connect can fail due to signals coming from the thread.
>
>. should be ,
>
>>+		 * or because the receiver connection queue is full.
>>+		 * Ignoring also the latter case because there is no way
>>+		 * of synchronizing client's connect and server's accept when
>>+		 * connect(s) are constantly being interrupted by signals.
>>+		 */
>>+		if (ret == -1 && (errno != EINTR && errno != ECONNRESET)) {
>>+			perror("connect");
>>+			exit(EXIT_FAILURE);
>>+		}
>>+
>>+		/* Set CID to 0 cause a transport change. */
>>+		sa.svm_cid = 0;
>
>nit: add a blank line
>
>>+		/* Here we ignore the connect return value because we cannot
>>+		 * safely assume that it will *always* fail.
>>+		 * This is because the previous connect was interrupted
>>+		 * during the connection process. The socket state, as stated
>>+		 * in `man connect`, is unspecified and can result in strange
>>+		 * behaviors.
>
>Let's focus on the behaviour and try to be more concise. Something like
>this:
>
>		/* Ignore return value since it can fail or not.
>		 * If the previous connect is interrupted while the
>		 * connection request is already sent, the second
>		 * connect() will wait for the response.
>
>>+		 */
>>+		connect(s, (struct sockaddr *)&sa, sizeof(sa));
>>+
>>+		close(s);
>>+
>>+		control_writeulong(CONTROL_CONTINUE);
>>+
>>+	} while (current_nsec() < tout);
>>+
>>+	control_writeulong(CONTROL_DONE);
>>+
>>+	ret = pthread_cancel(thread_id);
>>+	if (ret) {
>>+		fprintf(stderr, "pthread_cancel: %d\n", ret);
>>+		exit(EXIT_FAILURE);
>>+	}
>>+
>>+	/* Wait for the thread to terminate */
>
>useless comment
>
>>+	ret = pthread_join(thread_id, NULL);
>>+	if (ret) {
>>+		fprintf(stderr, "pthread_join: %d\n", ret);
>>+		exit(EXIT_FAILURE);
>>+	}
>>+
>>+	/* Restore the old handler */
>
>ditto
>
>>+	if (signal(SIGUSR1, old_handler) == SIG_ERR) {
>>+		perror("signal");
>>+		exit(EXIT_FAILURE);
>>+	}
>>+}
>>+
>>+static void test_stream_transport_change_server(const struct test_opts *opts)
>>+{
>>+	int s = vsock_stream_listen(VMADDR_CID_ANY, opts->peer_port);
>>+
>>+	/* Set the socket to be nonblocking because connects that have been interrupted
>>+	 * (EINTR) can fill the receiver's accept queue anyway, leading to connect failure.
>>+	 * As of today (6.15) in such situation there is no way to understand, from the
>>+	 * client side, if the connection has been queued in the server or not.
>>+	 */
>>+	if (fcntl(s, F_SETFL, fcntl(s, F_GETFL, 0) | O_NONBLOCK) < 0) {
>>+		perror("fcntl");
>>+		exit(EXIT_FAILURE);
>>+	}
>>+	control_writeln("LISTENING");
>>+
>>+	while (control_readulong() == CONTROL_CONTINUE) {
>>+		struct sockaddr_vm sa_client;
>>+		socklen_t socklen_client = sizeof(sa_client);
>>+
>>+		/* Must accept the connection, otherwise the `listen`
>>+		 * queue will fill up and new connections will fail.
>>+		 * There can be more than one queued connection,
>>+		 * clear them all.
>>+		 */
>>+		while (true) {
>>+			int client = accept(s, (struct sockaddr *)&sa_client, &socklen_client);
>
>We don't use the client address, so IMO we can just pass NULL.
>
>Thanks,
>Stefano
>
>>+
>>+			if (client < 0) {
>>+				if (errno == EAGAIN)
>>+					break;
>>+
>>+				perror("accept");
>>+				exit(EXIT_FAILURE);
>>+			}
>>+
>>+			close(client);
>>+		}
>>+	}
>>+
>>+	close(s);
>>+}
>>+
>>static void test_stream_linger_client(const struct test_opts *opts)
>>{
>>	int fd;
>>@@ -2106,6 +2279,11 @@ static struct test_case test_cases[] = {
>>		.run_client = test_stream_nolinger_client,
>>		.run_server = test_stream_nolinger_server,
>>	},
>>+	{
>>+		.name = "SOCK_STREAM transport change null-ptr-deref",
>>+		.run_client = test_stream_transport_change_client,
>>+		.run_server = test_stream_transport_change_server,
>>+	},
>>	{},
>>};
>>
>>
>>---
>>base-commit: 68d019aa14d97f8d57b0f8d203fd3b44db2ba0c7
>>change-id: 20250306-test_vsock-3e77a9c7a245
>>
>>Best regards,
>>-- 
>>Luigi Leonardi <leonardi@redhat.com>
>>
>

Thanks for the review all the comments!
Luigi


^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH net-next v4] vsock/test: Add test for null ptr deref when transport changes
@ 2025-07-09 14:54 Konstantin Shkolnyy
  2025-07-09 14:57 ` Luigi Leonardi
  2025-07-09 15:26 ` Stefano Garzarella
  0 siblings, 2 replies; 9+ messages in thread
From: Konstantin Shkolnyy @ 2025-07-09 14:54 UTC (permalink / raw)
  To: mhal, sgarzare; +Cc: virtualization, netdev, linux-kernel, v4bel, leonardi

I'm seeing a problem on s390 with the new "SOCK_STREAM transport change 
null-ptr-deref" test. Here is how it appears to happen:

test_stream_transport_change_client() spins for 2s and sends 70K+ 
CONTROL_CONTINUE messages to the "control" socket.

test_stream_transport_change_server() spins calling accept() because it 
keeps receiving CONTROL_CONTINUE.

When the client exits, the server has received just under 1K of those 
70K CONTROL_CONTINUE, so it calls accept() again but the client has 
exited, so accept() never returns and the server never exits.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next v4] vsock/test: Add test for null ptr deref when transport changes
  2025-07-09 14:54 [PATCH net-next v4] vsock/test: Add test for null ptr deref when transport changes Konstantin Shkolnyy
@ 2025-07-09 14:57 ` Luigi Leonardi
  2025-07-09 16:45   ` Konstantin Shkolnyy
  2025-07-09 15:26 ` Stefano Garzarella
  1 sibling, 1 reply; 9+ messages in thread
From: Luigi Leonardi @ 2025-07-09 14:57 UTC (permalink / raw)
  To: Konstantin Shkolnyy
  Cc: mhal, sgarzare, virtualization, netdev, linux-kernel, v4bel

Hi Konstantin,

On Wed, Jul 09, 2025 at 09:54:03AM -0500, Konstantin Shkolnyy wrote:
>I'm seeing a problem on s390 with the new "SOCK_STREAM transport 
>change null-ptr-deref" test. Here is how it appears to happen:
>
>test_stream_transport_change_client() spins for 2s and sends 70K+ 
>CONTROL_CONTINUE messages to the "control" socket.
>
>test_stream_transport_change_server() spins calling accept() because 
>it keeps receiving CONTROL_CONTINUE.
>
>When the client exits, the server has received just under 1K of those 
>70K CONTROL_CONTINUE, so it calls accept() again but the client has 
>exited, so accept() never returns and the server never exits.
>

Thanks for pointing this out!
I had an offline discussion with Stefano about this issue.
This patch[1] should address it.
Please let us know if it works on s390 too.

Cheers,
Luigi

[1]https://lore.kernel.org/netdev/20250708111701.129585-1-sgarzare@redhat.com/


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next v4] vsock/test: Add test for null ptr deref when transport changes
  2025-07-09 14:54 [PATCH net-next v4] vsock/test: Add test for null ptr deref when transport changes Konstantin Shkolnyy
  2025-07-09 14:57 ` Luigi Leonardi
@ 2025-07-09 15:26 ` Stefano Garzarella
  2025-07-09 15:41   ` Stefano Garzarella
  1 sibling, 1 reply; 9+ messages in thread
From: Stefano Garzarella @ 2025-07-09 15:26 UTC (permalink / raw)
  To: Konstantin Shkolnyy
  Cc: mhal, virtualization, netdev, linux-kernel, v4bel, leonardi

On Wed, 9 Jul 2025 at 16:54, Konstantin Shkolnyy <kshk@linux.ibm.com> wrote:
>
> I'm seeing a problem on s390 with the new "SOCK_STREAM transport change
> null-ptr-deref" test. Here is how it appears to happen:
>
> test_stream_transport_change_client() spins for 2s and sends 70K+
> CONTROL_CONTINUE messages to the "control" socket.
>
> test_stream_transport_change_server() spins calling accept() because it
> keeps receiving CONTROL_CONTINUE.
>
> When the client exits, the server has received just under 1K of those
> 70K CONTROL_CONTINUE, so it calls accept() again but the client has
> exited, so accept() never returns and the server never exits.
>

Yep, I saw exactly the same issue while testing a new test.
I already sent a fix:
https://lore.kernel.org/netdev/20250708111701.129585-1-sgarzare@redhat.com/

Please, send a T-b/R-b on that if you can.

Stefano


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next v4] vsock/test: Add test for null ptr deref when transport changes
  2025-07-09 15:26 ` Stefano Garzarella
@ 2025-07-09 15:41   ` Stefano Garzarella
  2025-07-09 19:03     ` Konstantin Shkolnyy
  0 siblings, 1 reply; 9+ messages in thread
From: Stefano Garzarella @ 2025-07-09 15:41 UTC (permalink / raw)
  To: Konstantin Shkolnyy
  Cc: mhal, virtualization, netdev, linux-kernel, v4bel, leonardi

On Wed, 9 Jul 2025 at 17:26, Stefano Garzarella <sgarzare@redhat.com> wrote:
>
> On Wed, 9 Jul 2025 at 16:54, Konstantin Shkolnyy <kshk@linux.ibm.com> wrote:
> >
> > I'm seeing a problem on s390 with the new "SOCK_STREAM transport change
> > null-ptr-deref" test. Here is how it appears to happen:
> >
> > test_stream_transport_change_client() spins for 2s and sends 70K+
> > CONTROL_CONTINUE messages to the "control" socket.
> >
> > test_stream_transport_change_server() spins calling accept() because it
> > keeps receiving CONTROL_CONTINUE.
> >
> > When the client exits, the server has received just under 1K of those
> > 70K CONTROL_CONTINUE, so it calls accept() again but the client has
> > exited, so accept() never returns and the server never exits.

Just to be clear, I was seeing something a bit different.
The accept() in the server is no-blocking, since we set O_NONBLOCK on
the socket, so I see the server looping around a failing accept()
(errno == EAGAIN) while dequeueing the CONTROL_CONTINUE messages, so
after 10/15 seconds the server ends on my case.

It seems strange that in your case it blocks, since it should be a
no-blocking call.

Stefano

> >
>
> Yep, I saw exactly the same issue while testing a new test.
> I already sent a fix:
> https://lore.kernel.org/netdev/20250708111701.129585-1-sgarzare@redhat.com/
>
> Please, send a T-b/R-b on that if you can.
>
> Stefano


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next v4] vsock/test: Add test for null ptr deref when transport changes
  2025-07-09 14:57 ` Luigi Leonardi
@ 2025-07-09 16:45   ` Konstantin Shkolnyy
  0 siblings, 0 replies; 9+ messages in thread
From: Konstantin Shkolnyy @ 2025-07-09 16:45 UTC (permalink / raw)
  To: Luigi Leonardi
  Cc: mhal, sgarzare, virtualization, netdev, linux-kernel, v4bel

On 09-Jul-25 09:57, Luigi Leonardi wrote:
> Hi Konstantin,
> 
> On Wed, Jul 09, 2025 at 09:54:03AM -0500, Konstantin Shkolnyy wrote:
>> I'm seeing a problem on s390 with the new "SOCK_STREAM transport 
>> change null-ptr-deref" test. Here is how it appears to happen:
>>
>> test_stream_transport_change_client() spins for 2s and sends 70K+ 
>> CONTROL_CONTINUE messages to the "control" socket.
>>
>> test_stream_transport_change_server() spins calling accept() because 
>> it keeps receiving CONTROL_CONTINUE.
>>
>> When the client exits, the server has received just under 1K of those 
>> 70K CONTROL_CONTINUE, so it calls accept() again but the client has 
>> exited, so accept() never returns and the server never exits.
>>
> 
> Thanks for pointing this out!
> I had an offline discussion with Stefano about this issue.
> This patch[1] should address it.
> Please let us know if it works on s390 too.
> 
> Cheers,
> Luigi
> 
> [1]https://lore.kernel.org/netdev/20250708111701.129585-1- 
> sgarzare@redhat.com/
> 
I've run it 40 times with this patch, and it seems OK now. You can add 
my "Tested-by" if you wish.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next v4] vsock/test: Add test for null ptr deref when transport changes
  2025-07-09 15:41   ` Stefano Garzarella
@ 2025-07-09 19:03     ` Konstantin Shkolnyy
  0 siblings, 0 replies; 9+ messages in thread
From: Konstantin Shkolnyy @ 2025-07-09 19:03 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: mhal, virtualization, netdev, linux-kernel, v4bel, leonardi

On 09-Jul-25 10:41, Stefano Garzarella wrote:
> On Wed, 9 Jul 2025 at 17:26, Stefano Garzarella <sgarzare@redhat.com> wrote:
>>
>> On Wed, 9 Jul 2025 at 16:54, Konstantin Shkolnyy <kshk@linux.ibm.com> wrote:
>>>
>>> I'm seeing a problem on s390 with the new "SOCK_STREAM transport change
>>> null-ptr-deref" test. Here is how it appears to happen:
>>>
>>> test_stream_transport_change_client() spins for 2s and sends 70K+
>>> CONTROL_CONTINUE messages to the "control" socket.
>>>
>>> test_stream_transport_change_server() spins calling accept() because it
>>> keeps receiving CONTROL_CONTINUE.
>>>
>>> When the client exits, the server has received just under 1K of those
>>> 70K CONTROL_CONTINUE, so it calls accept() again but the client has
>>> exited, so accept() never returns and the server never exits.
> 
> Just to be clear, I was seeing something a bit different.
> The accept() in the server is no-blocking, since we set O_NONBLOCK on
> the socket, so I see the server looping around a failing accept()
> (errno == EAGAIN) while dequeueing the CONTROL_CONTINUE messages, so
> after 10/15 seconds the server ends on my case.
> 
> It seems strange that in your case it blocks, since it should be a
> no-blocking call.

It was my mistake. The accept() doesn't block. I've retested it more 
carefully and it keeps returning and the loop eventually consumes all 
queued CONTROL_CONTINUE messages and quits, as you described.


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-07-09 19:03 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-09 14:54 [PATCH net-next v4] vsock/test: Add test for null ptr deref when transport changes Konstantin Shkolnyy
2025-07-09 14:57 ` Luigi Leonardi
2025-07-09 16:45   ` Konstantin Shkolnyy
2025-07-09 15:26 ` Stefano Garzarella
2025-07-09 15:41   ` Stefano Garzarella
2025-07-09 19:03     ` Konstantin Shkolnyy
  -- strict thread matches above, loose matches on Subject: below --
2025-06-24 15:40 Luigi Leonardi
2025-06-25  8:26 ` Stefano Garzarella
2025-06-30  9:24   ` Luigi Leonardi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).