netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] virtio: orphan skbs if we're relying on timer to free them
@ 2009-05-18 12:48 Rusty Russell
  2009-05-19  2:40 ` David Miller
  0 siblings, 1 reply; 6+ messages in thread
From: Rusty Russell @ 2009-05-18 12:48 UTC (permalink / raw)
  To: netdev; +Cc: virtualization

We check for finished xmit skbs on every xmit, or on a timer (unless
the host promises to force an interrupt when the xmit ring is empty).
This can penalize userspace tasks which fill their sockbuf.  Not much
difference with TSO, but measurable with large numbers of packets.

There are a finite number of packets which can be in the transmission
queue.  We could fire the timer more than every 100ms, but that would
just hurt performance for a corner case.  This seems neatest.

With interrupt when Tx ring empty:
				Seconds	TxPkts	TxIRQs
 1G TCP Guest->Host:		3.76	32833	32758
 1M normal pings:		111	1000008	997463
 1M 1k pings (-l 120):		55	1000007	488920

Without interrupt, without orphaning:
 1G TCP Guest->Host:		3.64	32806	1
 1M normal pings:		106	1000008	1
 1M 1k pings (-l 120):		68	1000005	1

With orphaning:
 1G TCP Guest->Host:		3.86	32821	1
 1M normal pings:		102	1000007	1
 1M 1k pings (-l 120):		43	1000005	1

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
---
 drivers/net/virtio_net.c |    5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -522,6 +522,11 @@ static int start_xmit(struct sk_buff *sk
 {
 	struct virtnet_info *vi = netdev_priv(dev);
 
+	/* We queue a limited number; don't let that delay writers if
+	 * we are slow in getting tx interrupt. */
+	if (!vi->free_in_tasklet)
+		skb_orphan(skb);
+
 again:
 	/* Free up any pending old buffers before queueing new ones. */
 	free_old_xmit_skbs(vi);


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] virtio: orphan skbs if we're relying on timer to free them
  2009-05-18 12:48 [RFC] virtio: orphan skbs if we're relying on timer to free them Rusty Russell
@ 2009-05-19  2:40 ` David Miller
  2009-05-21  6:57   ` Rusty Russell
  0 siblings, 1 reply; 6+ messages in thread
From: David Miller @ 2009-05-19  2:40 UTC (permalink / raw)
  To: rusty; +Cc: netdev, virtualization

From: Rusty Russell <rusty@rustcorp.com.au>
Date: Mon, 18 May 2009 22:18:47 +0930

> We check for finished xmit skbs on every xmit, or on a timer (unless
> the host promises to force an interrupt when the xmit ring is empty).
> This can penalize userspace tasks which fill their sockbuf.  Not much
> difference with TSO, but measurable with large numbers of packets.
> 
> There are a finite number of packets which can be in the transmission
> queue.  We could fire the timer more than every 100ms, but that would
> just hurt performance for a corner case.  This seems neatest.
 ...
> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

If this is so great for virtio it would also be a great idea
universally, but we don't do it.

What you're doing by orphan'ing is creating a situation where a single
UDP socket can loop doing sends and monopolize the TX queue of a
device.  The only control we have over a sender for fairness in
datagram protocols is that send buffer allocation.

I'm guilty of doing this too in the NIU driver, also because there I
lack a "TX queue empty" interrupt and this can keep TCP sockets from
getting stuck.

I think we need a generic solution to this issue because it is getting
quite common to see cases where the packets in the TX queue of a
device can sit there indefinitely.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] virtio: orphan skbs if we're relying on timer to free them
  2009-05-19  2:40 ` David Miller
@ 2009-05-21  6:57   ` Rusty Russell
  2009-05-21  7:15     ` David Miller
  0 siblings, 1 reply; 6+ messages in thread
From: Rusty Russell @ 2009-05-21  6:57 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, virtualization

On Tue, 19 May 2009 12:10:13 pm David Miller wrote:
> From: Rusty Russell <rusty@rustcorp.com.au>
> Date: Mon, 18 May 2009 22:18:47 +0930
> > We check for finished xmit skbs on every xmit, or on a timer (unless
> > the host promises to force an interrupt when the xmit ring is empty).
> > This can penalize userspace tasks which fill their sockbuf.  Not much
> > difference with TSO, but measurable with large numbers of packets.
> >
> > There are a finite number of packets which can be in the transmission
> > queue.  We could fire the timer more than every 100ms, but that would
> > just hurt performance for a corner case.  This seems neatest.
...
> > Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
>
> If this is so great for virtio it would also be a great idea
> universally, but we don't do it.
>
> What you're doing by orphan'ing is creating a situation where a single
> UDP socket can loop doing sends and monopolize the TX queue of a
> device.  The only control we have over a sender for fairness in
> datagram protocols is that send buffer allocation.

Urgh, that hadn't even occurred to me.  Good point.

> I'm guilty of doing this too in the NIU driver, also because there I
> lack a "TX queue empty" interrupt and this can keep TCP sockets from
> getting stuck.
>
> I think we need a generic solution to this issue because it is getting
> quite common to see cases where the packets in the TX queue of a
> device can sit there indefinitely.

I haven't thought this through properly, but how about a hack where we don't 
orphan packets if the ring is over half full?

Then I guess we could overload the watchdog as a more general timer-after-no-
xmit?

Rusty.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] virtio: orphan skbs if we're relying on timer to free them
  2009-05-21  6:57   ` Rusty Russell
@ 2009-05-21  7:15     ` David Miller
  2009-05-21 17:24       ` Dimitris Michailidis
  2009-05-25 11:01       ` Rusty Russell
  0 siblings, 2 replies; 6+ messages in thread
From: David Miller @ 2009-05-21  7:15 UTC (permalink / raw)
  To: rusty; +Cc: netdev, virtualization

From: Rusty Russell <rusty@rustcorp.com.au>
Date: Thu, 21 May 2009 16:27:05 +0930

> On Tue, 19 May 2009 12:10:13 pm David Miller wrote:
>> What you're doing by orphan'ing is creating a situation where a single
>> UDP socket can loop doing sends and monopolize the TX queue of a
>> device.  The only control we have over a sender for fairness in
>> datagram protocols is that send buffer allocation.
> 
> Urgh, that hadn't even occurred to me.  Good point.

Now this all is predicated on this actually mattering. :-)

You could argue that the scheduler as well as the size of the
TX queue should be limiting and enforcing fairness.

Someone really needs to test this.  Just skb_orphan() every packet
at the beginning of dev_hard_start_xmit(), then run some test
program with two clients looping out UDP packets to see if one
can monopolize the device and get a significantly larger amount
of TX resources than the other.  Repeat for 3, 4, 5, etc. clients.

> I haven't thought this through properly, but how about a hack where
> we don't orphan packets if the ring is over half full?

That would also work.  And for the NIU case this would be great
because I DO have a marker bit for triggering interrupts in the TX
descriptors.  There's just no "all empty" interrupt on TX (who
designs these things? :( ).

> Then I guess we could overload the watchdog as a more general
> timer-after-no- xmit?

Yes, but it means that teardown of a socket can be delayed up to
the amount of that timer.  Factor in all of this crazy
round_jiffies() stuff people do these days and it could cause
pauses for real use cases and drive users batty.

Probably the most profitable avenue is to see if this is a real issue
afterall (see above).  If we can get away with having the socket
buffer represent socket --> device space only, that's the most ideal
solution.  It will probably also improve performance a lot across the
board, especially on NUMA/SMP boxes as our TX complete events tend to
be in difference places than the SKB producer.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] virtio: orphan skbs if we're relying on timer to free them
  2009-05-21  7:15     ` David Miller
@ 2009-05-21 17:24       ` Dimitris Michailidis
  2009-05-25 11:01       ` Rusty Russell
  1 sibling, 0 replies; 6+ messages in thread
From: Dimitris Michailidis @ 2009-05-21 17:24 UTC (permalink / raw)
  To: David Miller; +Cc: rusty, netdev, virtualization

David Miller wrote:
> From: Rusty Russell <rusty@rustcorp.com.au>
> Date: Thu, 21 May 2009 16:27:05 +0930
> 
>> On Tue, 19 May 2009 12:10:13 pm David Miller wrote:
>>> What you're doing by orphan'ing is creating a situation where a single
>>> UDP socket can loop doing sends and monopolize the TX queue of a
>>> device.  The only control we have over a sender for fairness in
>>> datagram protocols is that send buffer allocation.
>> Urgh, that hadn't even occurred to me.  Good point.
> 
> Now this all is predicated on this actually mattering. :-)
> 
> You could argue that the scheduler as well as the size of the
> TX queue should be limiting and enforcing fairness.
> 
> Someone really needs to test this.  Just skb_orphan() every packet
> at the beginning of dev_hard_start_xmit(), then run some test
> program with two clients looping out UDP packets to see if one
> can monopolize the device and get a significantly larger amount
> of TX resources than the other.  Repeat for 3, 4, 5, etc. clients.

The cxgb3 driver has had skb_orphan in its transmit routine forever (also 
due to lack of Tx interrupts) and I am not aware of adverse effects caused 
by doing so.  It does skip skb_orphan when skb_shared but probably nobody 
sends sharead skbs with destructors.

The only application I know of that has trouble with lazy skb freeing is 
pktgen because it treats freeing as an indication that the packet has been 
transmitted so it's thrown off if packets sit there for a while.  (Also 
freeing just indicates that the DMA is done, not that the packet has been 
sent, and modern devices have quite a bit of buffering.)

> 
>> I haven't thought this through properly, but how about a hack where
>> we don't orphan packets if the ring is over half full?
> 
> That would also work.  And for the NIU case this would be great
> because I DO have a marker bit for triggering interrupts in the TX
> descriptors.  There's just no "all empty" interrupt on TX (who
> designs these things? :( ).
> 
>> Then I guess we could overload the watchdog as a more general
>> timer-after-no- xmit?
> 
> Yes, but it means that teardown of a socket can be delayed up to
> the amount of that timer.  Factor in all of this crazy
> round_jiffies() stuff people do these days and it could cause
> pauses for real use cases and drive users batty.
> 
> Probably the most profitable avenue is to see if this is a real issue
> afterall (see above).  If we can get away with having the socket
> buffer represent socket --> device space only, that's the most ideal
> solution.  It will probably also improve performance a lot across the
> board, especially on NUMA/SMP boxes as our TX complete events tend to
> be in difference places than the SKB producer.

There's a comment in the cxgb3 driver where it calls skb_orphan that 
explains the rationale and includes some of what you're saying.


> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] virtio: orphan skbs if we're relying on timer to free them
  2009-05-21  7:15     ` David Miller
  2009-05-21 17:24       ` Dimitris Michailidis
@ 2009-05-25 11:01       ` Rusty Russell
  1 sibling, 0 replies; 6+ messages in thread
From: Rusty Russell @ 2009-05-25 11:01 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, virtualization

[-- Attachment #1: Type: text/plain, Size: 7249 bytes --]

On Thu, 21 May 2009 04:45:03 pm David Miller wrote:
> Probably the most profitable avenue is to see if this is a real issue
> afterall (see above).  If we can get away with having the socket
> buffer represent socket --> device space only, that's the most ideal
> solution.  It will probably also improve performance a lot across the
> board, especially on NUMA/SMP boxes as our TX complete events tend to
> be in difference places than the SKB producer.

OK, hacked in skb_orphan() and couldn't see any behavior change here.

Details: I wrote a little test program to do a 1 byte ping pong on a TCP
socket while doing 1k writes to a number of udp sockets.  We never completely
starve the TCP connection, though that may be because it begins at the same
time as the others and gets some traffic before the buffers fill up.

Running it with 0 to 20 sockets, from my gigabit laptop to my 100Mbit
desktop.  Sorry about the attachment, but sometimes a graph is worth about
1k words.

Not sure why write rate shoots up on serious overload though.  Perhaps we're
discarding earlier in the path?

Cheers,
Rusty.

diff --git a/net/core/dev.c b/net/core/dev.c
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1677,6 +1677,8 @@ int dev_hard_start_xmit(struct sk_buff *
 	const struct net_device_ops *ops = dev->netdev_ops;
 	int rc;
 
+	skb_orphan(skb);
+
 	if (likely(!skb->next)) {
 		if (!list_empty(&ptype_all))
 			dev_queue_xmit_nit(skb, dev);

Sender: for i in `seq 0 20`; do ~/devel/junk/udpstress 9999 $i debussy 5; sleep 1; done
Receiver: for i in `seq 0 20`; do ./udpstress -r 9999 $i; done

/* udpstress.c: Simple server & client to test multiple UDP streams. */
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/sendfile.h>
#include <sys/time.h>
#include <time.h>
#include <netinet/in.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <sys/wait.h>
#include <sys/mman.h>
#include <err.h>
#include <netdb.h>

static volatile unsigned int bytes;

static void dumpstats(int signal)
{
	char buffer[128];

	sprintf(buffer, "%u\n", bytes);
	write(STDOUT_FILENO, buffer, strlen(buffer));
	_exit(0);
}

static void tcp_dumpstats(int signal)
{
	char buffer[128];

	sprintf(buffer, "TCP: %u\n", bytes);
	write(STDOUT_FILENO, buffer, strlen(buffer));
	_exit(0);
}

static void udp_receiver(unsigned int port)
{
	struct sockaddr_in saddr;
	int sock, ret;
	char buffer[65536];
	int val = 1;

	sock = socket(PF_INET, SOCK_DGRAM, 0);
	if (sock < 0)
		err(1, "creating socket");

	if (setsockopt(sock, SOL_SOCKET, SO_REUSEADDR, &val, sizeof(&val)) != 0)
		err(1, "setsockopt(SO_REUSEADDR)");

	/* Wait for incoming. */
	saddr.sin_family = AF_INET;
	saddr.sin_port = htons(port);
	saddr.sin_addr.s_addr = INADDR_ANY;
	if (bind(sock, (struct sockaddr *)&saddr, sizeof(saddr)) != 0)
		err(1, "bind to port %u", port);

	/* Read packets until we get signal. */
	signal(SIGALRM, dumpstats);

	while ((ret = read(sock, buffer, sizeof(buffer))) > 0)
		bytes += ret;
	err(1, "Short read %i", ret);
}

static void udp_client(unsigned int ip, unsigned int port, int fd)
{
	/* We're going to send UDP packets to that addr. */
	struct sockaddr_in saddr;
	int sock, ret;
	char buffer[1024];

	sock = socket(PF_INET, SOCK_DGRAM, 0);
	if (sock < 0)
		err(1, "creating socket");

	/* Connect to remote. */
	saddr.sin_family = AF_INET;
	saddr.sin_port = htons(port);
	saddr.sin_addr.s_addr = ip;
	if (connect(sock, (struct sockaddr *)&saddr, sizeof(saddr)))
		err(1, "connecting socket");

	/* Write packets until we get signal. */
	signal(SIGALRM, dumpstats);

	/* Wait for "go" signal. */
	if (read(fd, buffer, 1) != 1)
		err(1, "Reading go signal");

	while ((ret = write(sock, buffer, sizeof(buffer))) > 0)
		bytes += ret;
	err(1, "Short write %i", ret);
}

/* Exits when tcp socket closes. */
static void tcp_receiver(unsigned int port)
{
	/* We're going to send TCP packets to that addr. */
	struct sockaddr_in saddr;
	int sock;
	char c;
	int val = 1;

	sock = socket(PF_INET, SOCK_STREAM, 0);
	if (sock < 0)
		err(1, "creating socket");

	if (setsockopt(sock, SOL_SOCKET, SO_REUSEADDR, &val, sizeof(&val)) != 0)
		err(1, "setsockopt(SO_REUSEADDR)");

	/* Wait for incoming. */
	saddr.sin_family = AF_INET;
	saddr.sin_port = htons(port);
	saddr.sin_addr.s_addr = INADDR_ANY;
	if (bind(sock, (struct sockaddr *)&saddr, sizeof(saddr)) != 0)
		err(1, "bind to port %u", port);

	if (listen(sock, 1) != 0)
		err(1, "listen");

	sock = accept(sock, NULL, NULL);
	if (sock < 0)
		err(1, "Accepting");

	while (read(sock, &c, 1) > 0) {
		bytes++;
		if (write(sock, &c, 1) != 1)
			break;
	}
}

static void tcp_client(unsigned int ip, unsigned int port, int fd)
{
	/* We're going to send TCP packets to that addr. */
	struct sockaddr_in saddr;
	int sock;
	char c;

	sock = socket(PF_INET, SOCK_STREAM, 0);
	if (sock < 0)
		err(1, "creating socket");

	/* Connect to remote. */
	saddr.sin_family = AF_INET;
	saddr.sin_port = htons(port);
	saddr.sin_addr.s_addr = ip;
	if (connect(sock, (struct sockaddr *)&saddr, sizeof(saddr)))
		err(1, "connecting socket");

	/* Write packets until we get signal. */
	signal(SIGALRM, tcp_dumpstats);

	/* Wait for "go" signal. */
	if (read(fd, &c, 1) != 1)
		err(1, "Reading go signal");

	while (write(sock, &c, 1) > 0) {
		bytes++;
		if (read(sock, &c, 1) != 1)
			err(1, "Short read");
	}
	err(1, "Short write");
}

static unsigned int host_to_ipaddr(const char *name)
{
	struct hostent *host;
	unsigned int ip = 0;

	host = gethostbyname(name);
	if (!host)
		err(1, "gethostbyname(%s)", name);

	if (host->h_addrtype != AF_INET ||
	    host->h_length != sizeof(struct in_addr))
		errx(1, "gethostbyname gave wrong type");

	memcpy(&ip, host->h_addr_list[0], 4);
	return ip;
}

static void usage(void)
{
	errx(1,
	     "Usage: udpstress -r portnum number OR\n"
	     "	udpstress portnum number ipaddr timeout");
}

int main(int argc, char *argv[])
{
	unsigned int ip, i, port, kids;
	int fds[2];

	if (argc < 4)
		usage();

	if (pipe(fds) != 0)
		err(1, "Creating pipe");

	signal(SIGALRM, SIG_IGN);
	setsid();

	if (strcmp(argv[1], "-r") == 0) {
		port = atoi(argv[2]);
		kids = atoi(argv[3]);
		for (i = 0; i < kids; i++) {
			switch (fork()) {
			case 0:
				udp_receiver(port + i);
			case -1:
				err(1, "forking");
			}
		}

		/* Finally, receive TCP until the sender finishes. */
		tcp_receiver(port);
		{
			char buffer[128];

			/* Don't just use printf here; gets overwritten by
			 * other children when writing to file. */
			sprintf(buffer, "TCP: %u\n", bytes);
			write(STDOUT_FILENO, buffer, strlen(buffer));
		}
	} else {
		char buffer[kids = atoi(argv[2])+1];
		port = atoi(argv[1]);
		ip = host_to_ipaddr(argv[3]);
		for (i = 0; i < kids - 1; i++) {
			switch (fork()) {
			case 0:
				udp_client(ip, port + i, fds[0]);
			case -1:
				err(1, "forking");
			}
		}
		/* Finally, TCP sender. */
		switch (fork()) {
		case 0:
			tcp_client(ip, port, fds[0]);
		case -1:
			err(1, "forking");
		}

		/* Wait for them to settle. */
		sleep(1);

		if (write(fds[1], buffer, kids) != kids)
			errx(1, "Short write on pipe");
		sleep(atoi(argv[4]));
	}

	/* Kill kids. */
	kill(-getpid(), SIGALRM);

	/* We want to exit last. */
	for (i = 0; i < kids; i++)
		wait(NULL);
	return 0;
}


[-- Attachment #2: results.gnumeric --]
[-- Type: application/x-gnumeric, Size: 12248 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2009-05-25 11:03 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-05-18 12:48 [RFC] virtio: orphan skbs if we're relying on timer to free them Rusty Russell
2009-05-19  2:40 ` David Miller
2009-05-21  6:57   ` Rusty Russell
2009-05-21  7:15     ` David Miller
2009-05-21 17:24       ` Dimitris Michailidis
2009-05-25 11:01       ` Rusty Russell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).