* possible race condition in qemu nat layer or virtio-net
@ 2024-02-25 18:22 g1pi
0 siblings, 0 replies; only message in thread
From: g1pi @ 2024-02-25 18:22 UTC (permalink / raw)
To: qemu-devel
[-- Attachment #1: Type: text/plain, Size: 3179 bytes --]
Hi all.
I believe I spotted a race condition in virtio-net or qemu/kvm (but
only when virtio-net is involved).
To replicate, one needs a virtualization environment similar to
Host:
- debian 12 x86_64, kernel 6.1.0-18-amd64
- caching name server listening on 127.0.0.1
- qemu version 7.2.9 (Debian 1:7.2+dfsg-7+deb12u5)
- command line:
qemu-system-x86_64 \
-enable-kvm \
-daemonize \
-parallel none \
-serial none \
-m 256 \
-drive if=virtio,format=raw,file=void.raw \
-monitor unix:run/void.mon,server,nowait \
-nic user,model=virtio,hostfwd=tcp:127.0.0.1:3822-:22
Guest:
- x86_64, linux/musl or linux/glibc or freebsd or openbsd
- /etc/resolv.conf:
nameserver 10.0.2.2 i.e. the caching dns in the host
nameserver 192.168.1.123 non existent
and run the attached program in the guest.
The program opens a UDP socket, sends out a bunch of (dns) requests,
poll()s on the socket, and then receives the responses.
If a delay is inserted between the sendto() calls, the (unique) response
from the host is received correctly:
$ ./a.out 10.0.2.2 >/dev/null # to warm up the host cache
$ ./a.out 10.0.2.2 delay 192.168.1.123
poll: 1 1 1
recvfrom() 45
<response packet>
recvfrom() -1
If the sento()s are performed in short order, the response packet
gets lost:
$ ./a.out 10.0.2.2 >/dev/null # to warm up the host cache
$ ./a.out 10.0.2.2 192.168.1.123
poll: 0 1 0
recvfrom() -1
recvfrom() -1
A tcpdump capture on the host side shows no difference between the two cases.
Tcpdump on the guest side is another story: in the good case, it looks like
this
7:32:44.332 IP 10.0.2.15.43276 > 10.0.2.2.53: 33452+ A? example.com. (29)
7:32:44.333 IP 10.0.2.2.53 > 10.0.2.15.43276: 33452 1/0/0 A 93.184.216.34 (45)
7:32:44.349 IP 10.0.2.15.43276 > 192.168.1.123.53: 33452+ A? example.com. (29)
while in the bad case it looks like this
7:32:55.358 IP 10.0.2.15.46537 > 10.0.2.2.53: 33452+ A? example.com. (29)
7:32:55.358 IP 10.0.2.15.46537 > 192.168.1.123.53: 33452+ A? example.com. (29)
7:32:55.358 IP *127.0.0.1*.53 > 10.0.2.15.46537: 33452 1/0/0 A 93.184.216.34 (45)
where the response packet has wrong src ip.
Looks like a failure of the NAT layer, but it does not happen when
the guest uses another emulated network driver: don't know whether it's
because the relevant code is in virtio-net or because other drivers add
overhead that masks the issue.
There's nothing special in port 53: I was just investigating
a weird failure in name resolution in a MUSL based guest
(https://www.openwall.com/lists/musl/2024/02/17/3) and wrote the program
to mimic MUSL resolver's behaviour.
But it succeeds/fails consistently with a different port, and in all
guests I tried (as long as the emulated network device is virtio-net).
To see the issue, it's important that the response to the first request
is so fast that it's simultaneous with the second request: that's the reason
behind the caching nameserver in the host.
I also opened a bug report to debian
(https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1064634).
I'm not subscribed to qemu-devel, so please CC me in replies.
Best regards,
g.b.
[-- Attachment #2: m.c --]
[-- Type: text/x-csrc, Size: 2049 bytes --]
#include <stdio.h>
#include <time.h>
#include <poll.h>
#include <assert.h>
#include <string.h>
#include <arpa/inet.h>
#include <netdb.h>
#include <netinet/in.h>
#include <sys/socket.h>
#include <sys/socket.h>
#include <sys/types.h>
static void dump(const char *s, size_t len) {
while (len--) {
char t = *s++;
if (' ' <= t && t <= '~' && t != '\\')
printf("%c", t);
else
printf("\\%o", t & 0xff);
}
printf("\n");
}
int main(int argc, char *argv[]) {
int sock, rv, n;
const char req[] =
"\202\254\1\0\0\1\0\0\0\0\0\0\7example\3com\0\0\1\0\1";
struct timespec delay_l = { 1, 0 }; /* 1 sec */
struct pollfd pfs;
struct sockaddr_in me = { 0 };
sock = socket(AF_INET, SOCK_DGRAM | SOCK_CLOEXEC | SOCK_NONBLOCK,
IPPROTO_IP);
assert(sock >= 0);
me.sin_family = AF_INET;
me.sin_port = 0;
me.sin_addr.s_addr = inet_addr("0.0.0.0");
rv = bind(sock, (struct sockaddr *) &me, sizeof me);
assert(0 == rv);
for (n = 1; n < argc; n++) {
if (0 == strcmp("delay", argv[n])) {
struct timespec delay_s = { 0, (1 << 24) }; /* ~ 16 msec */
nanosleep(&delay_s, NULL);
} else {
struct sockaddr_in dst = { 0 };
dst.sin_family = AF_INET;
dst.sin_port = htons(53);
dst.sin_addr.s_addr = inet_addr(argv[n]);
rv = sendto(sock, req, sizeof req - 1, MSG_NOSIGNAL,
(struct sockaddr *) &dst, sizeof dst);
assert(rv >= 0);
}
}
nanosleep(&delay_l, NULL);
pfs.fd = sock;
pfs.events = POLLIN;
rv = poll(&pfs, 1, 2000);
printf("poll: %d %d %d\n", rv, pfs.events, pfs.revents);
for (n = 1; n < argc; n++) {
char resp[4000];
if (0 == strcmp("delay", argv[n]))
continue;
rv = recvfrom(sock, resp, sizeof resp, 0, NULL, NULL);
printf("recvfrom() %d\n", rv);
if (rv > 0)
dump(resp, rv);
}
return 0;
}
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2024-02-25 21:03 UTC | newest]
Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-25 18:22 possible race condition in qemu nat layer or virtio-net g1pi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).