strange TCP stack behiviour with write()es in pieces

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* strange TCP stack behiviour with write()es in pieces
@ 2002-01-02 16:28 Michal Moskal
  2002-01-02 18:26 ` Edgar Toernig
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Michal Moskal @ 2002-01-02 16:28 UTC (permalink / raw)
  To: linux-kernel

Hi,

I found something intresting (at least to me ;) in Linux TCP stack.
I don't know if it should be regarded a bug or not, or if it's known.
Anyway, this email is not meant to start flame of any kind (test results
are flamable material... ;)

So, it occurs in programs doing packet communication over TCP, when
peer waits for a packet to send an answer. If they send data with two
write() calls (for example to write packet header and packet data),
the performance dramaticly decrases (down to exactly 100 (2.2.19)
or 25 (2.4.[57]) packet exchanges per second on x86, from several
thousands. 100 seems to be related to HZ variable, see also AXP results,
where HZ is 10 times bigger).

Maybe example of code will tell more:

	struct header {
		int cmd;
		int len;
	};

	void send_packet(int cmd, void *data, int len)
	{
		struct header h = { cmd, len };

		write(fd, &h, sizeof(h));
		write(fd, data, len);
	}

is, let's say, 300 times slower then:

	void send_packet(int cmd, void *data, int len)
	{
		struct header h = { cmd, len };
		char tmp[BUFSIZE];

		memcpy(tmp, &h, sizeof(h));
		memcpy(tmp + sizeof(h), data, len);
		write(fd, tmp, len + sizeof(h));
	}

when running over loopback. Similar effects can be seen when running over
ethernet (the condition is, that next packet is requested only after
first one is recived).

I, personally, would expect the second version to be at most two times
slower (as there might be need to send two IP packets instead of one).
Also note, that as it is obvious that version with copying to buffer
on the stack should be faster, it is not so obvious if there is need to
malloc() buffer before sending (for example if there is no upper limit
on len).  However even if we need to malloc() buffer, second version is
still by orders of magnitude faster.

I don't know how many user space program does it impact. Probably not many,
as they often use buffering of some kind.

This is both true for 2.2 and 2.4, IPv4 and IPv6. One vs two writes doesn't
seem to make big a diffrence for unix domain sockets though.

I found it during work with client/server program that worked horribly
slow just becouse of this. (of course I fixed it, but that's not the point).

I tried to find it in kernel sources, but probably I didn't try hard enough ;)

I attach a test program and results of tests with few diffrent machines.

Test results follow. Please don't be suggested by diffrences between
2.2 and 2.4 as they might be results of kernel patches, also machines
other then roke were on heavy load. The only important thing, is that
two-writes-mode works at *constant* speed, indpenent of machine speed.

(to make things a bit sweeter I can tell that on fbsd 4.4 stable fragmented
writes go at 10 packets/sec, unfortunetly I don't have other machines to 
chceck right now ;)

Linux roke 2.4.7 #3 Wed Oct 3 22:22:24 CEST 2001 i686 pld
cpu MHz		: 840.426
model name	: AMD Duron(tm) Processor
IPv4
  25 packets/sec
  31833 packets/sec
IPv6
  25 packets/sec
  31634 packets/sec
UNIX
  66135 packets/sec
  77457 packets/sec

Linux roke 2.2.19 #6 Sun Sep 30 20:25:08 CEST 2001 i686 pld
cpu MHz		: 840.442
model name	: AMD Duron(tm) Processor
IPv4
  100 packets/sec
  34562 packets/sec
IPv6
  100 packets/sec
  38555 packets/sec
UNIX
  72355 packets/sec
  90586 packets/sec

Linux boniek 2.2.19 #2 Tue Mar 27 17:19:45 CEST 2001 alpha pld
IPv4
  1024 packets/sec
  23351 packets/sec
UNIX
  42219 packets/sec
  50643 packets/sec

# and more recent 2.4:

Linux kenny 2.4.16 #1 SMP Thu Dec 20 16:16:22 CET 2001 i686 pld
cpu MHz         : 699.331
cpu MHz         : 699.331
model name      : Pentium III (Cascades)
model name      : Pentium III (Cascades)
IPv4
  25 packets/sec
  16965 packets/sec
IPv6
  25 packets/sec
  14928 packets/sec
UNIX
  30111 packets/sec
  32143 packets/sec

sparc64/2.2.19 does similary as x86/2.2.19

-- 
: Michal ``,/\/\,       '' Moskal    | |            : GCS {C,UL}++++$
:          |    |alekith      @    |)|(| . org . pl : {E--, W, w-,M}-
:    Linux: We are dot in .ORG.    |                : {b,e>+}++ !tv h
: CurProj: ftp://ftp.pld.org.pl/people/malekith/ksi : PLD Team member

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: strange TCP stack behiviour with write()es in pieces
  2002-01-02 16:28 Michal Moskal
@ 2002-01-02 18:26 ` Edgar Toernig
  2002-01-02 19:46 ` dean gaudet
  2002-01-02 21:49 ` David Schwartz
  2 siblings, 0 replies; 7+ messages in thread
From: Edgar Toernig @ 2002-01-02 18:26 UTC (permalink / raw)
  To: Michal Moskal; +Cc: linux-kernel

Michal Moskal wrote:
> 
> So, it occurs in programs doing packet communication over TCP, when
> peer waits for a packet to send an answer. If they send data with two
> write() calls (for example to write packet header and packet data),
> the performance dramaticly decrases (down to exactly 100 (2.2.19)
> or 25 (2.4.[57]) packet exchanges per second on x86, from several
> thousands. 100 seems to be related to HZ variable, see also AXP results,
> where HZ is 10 times bigger).

Try disabling the nagle algorithm:

        int i = 1;
        if (setsockopt(fd, SOL_TCP, TCP_NODELAY, &i, sizeof(i)) == -1)
            perror("TCP_NODELAY");

Ciao, ET.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: strange TCP stack behiviour with write()es in pieces
  2002-01-02 16:28 Michal Moskal
  2002-01-02 18:26 ` Edgar Toernig
@ 2002-01-02 19:46 ` dean gaudet
  2002-01-02 20:21   ` Richard B. Johnson
  2002-01-02 21:49 ` David Schwartz
  2 siblings, 1 reply; 7+ messages in thread
From: dean gaudet @ 2002-01-02 19:46 UTC (permalink / raw)
  To: Michal Moskal; +Cc: linux-kernel

On Wed, 2 Jan 2002, Michal Moskal wrote:

> 	void send_packet(int cmd, void *data, int len)
> 	{
> 		struct header h = { cmd, len };
>
> 		write(fd, &h, sizeof(h));
> 		write(fd, data, len);
> 	}

you should look into writev(2).

you might also want to look at this paper
<http://www.isi.edu/~johnh/PAPERS/Heidemann97a.html>, it's probably
similar to the problems you're seeing.

-dean


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: strange TCP stack behiviour with write()es in pieces
  2002-01-02 19:46 ` dean gaudet
@ 2002-01-02 20:21   ` Richard B. Johnson
  0 siblings, 0 replies; 7+ messages in thread
From: Richard B. Johnson @ 2002-01-02 20:21 UTC (permalink / raw)
  To: dean gaudet; +Cc: Michal Moskal, linux-kernel

On Wed, 2 Jan 2002, dean gaudet wrote:

> On Wed, 2 Jan 2002, Michal Moskal wrote:
> 
> > 	void send_packet(int cmd, void *data, int len)
> > 	{
> > 		struct header h = { cmd, len };
> >
> > 		write(fd, &h, sizeof(h));
> > 		write(fd, data, len);
> > 	}
> 
> you should look into writev(2).
[SNIPPED...]

First, this isn't "TCP stack behavior...". It's an apparent attempt
to write raw (network?) packets using some kernel primitives. I presume
that you have obtained the fd from either socket() or by opening some
device. Whatever. If you are generating a "packet", you need to
make the packet in a buffer and send the packet. You can't presume
that something will concatenate to separate writes into some
kind of "packet". If the hardware is Ethernet, even the hardware
will fight you because it puts a destination-hardware-address, 
source-hardware-address, packet-length, data (your packet), then
32-bit CRC into the outgoing packet. FYI, that 'data' is where
the TCP/IP data-gram exists.

That said, if you are trying to make some kind of "zero-copy" thing,
you need to leave space in the initial allocation for the header and
other overhead. That way, you do one write to the device.

Cheers,
Dick Johnson

Penguin : Linux version 2.4.1 on an i686 machine (797.90 BogoMips).

    I was going to compile a list of innovations that could be
    attributed to Microsoft. Once I realized that Ctrl-Alt-Del
    was handled in the BIOS, I found that there aren't any.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: strange TCP stack behiviour with write()es in pieces
  2002-01-02 16:28 Michal Moskal
  2002-01-02 18:26 ` Edgar Toernig
  2002-01-02 19:46 ` dean gaudet
@ 2002-01-02 21:49 ` David Schwartz
  2 siblings, 0 replies; 7+ messages in thread
From: David Schwartz @ 2002-01-02 21:49 UTC (permalink / raw)
  To: malekith, linux-kernel


On Wed, 2 Jan 2002 17:28:06 +0100, Michal Moskal wrote:

>So, it occurs in programs doing packet communication over TCP, when peer
>waits for a packet to send an answer. If they send data with two write()
>calls (for example to write packet header and packet data), the performance
>dramaticly decrases (down to exactly 100 (2.2.19)
>or 25 (2.4.[57]) packet exchanges per second on x86, from several thousands.
>100 seems to be related to HZ variable, see also AXP results, where HZ is 10
>times bigger).

	That's why you should never, ever do anything that stupid. What should the 
kernel do? When it sees the first write, it has no idea there's going to be a 
second write, so it sends a packet. It gives you the benefit of the doubt and 
assumes that you know how to use TCP. When it sees the second write 
immediately thereafter and they're both small, it no longer trusts you and it 
has no idea there isn't going to be a third write a microsecond later, so it 
doesn't send a packet.

>I, personally, would expect the second version to be at most two times
>slower (as there might be need to send two IP packets instead of one).
>Also note, that as it is obvious that version with copying to buffer on the
>stack should be faster, it is not so obvious if there is need to malloc()
>buffer before sending (for example if there is no upper limit on len).
>However even if we need to malloc() buffer, second version is still by
>orders of magnitude faster.

	If you can design an algorithm that makes that only two times slower, then 
the world will be excited and interested and perhaps that algorithm will 
replace TCP. But until that time, we're stuck with what we have.

>I found it during work with client/server program that worked horribly slow
>just becouse of this. (of course I fixed it, but that's not the point).

	THAT IS THE POINT. The problem wasn't in the kernel, it was in the program, 
and you fixed it. If you do smart buffering, TCP can behave efficiently. If 
you don't, it has to guess when to send packets, and it can't possibly 
predict the future and behave in the way you think is optimum.

	How does it know you care about latency rather than throughput? And what 
should it do if it sees a steady stream of one byte writes, one every tenth 
of a second?

	DS



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: strange TCP stack behiviour with write()es in pieces
@ 2002-01-03 13:22 Michal Moskal
  2002-01-03 20:43 ` David Schwartz
  0 siblings, 1 reply; 7+ messages in thread
From: Michal Moskal @ 2002-01-03 13:22 UTC (permalink / raw)
  To: David Schwartz; +Cc: linux-kernel

On Wed, Jan 02, 2002 at 01:49:56PM -0800, David Schwartz wrote:
> 
> On Wed, 2 Jan 2002 17:28:06 +0100, Michal Moskal wrote:
> >I, personally, would expect the second version to be at most two times
> >slower (as there might be need to send two IP packets instead of one).
> >Also note, that as it is obvious that version with copying to buffer on the
> >stack should be faster, it is not so obvious if there is need to malloc()
> >buffer before sending (for example if there is no upper limit on len).
> >However even if we need to malloc() buffer, second version is still by
> >orders of magnitude faster.
> 
> 	If you can design an algorithm that makes that only two times slower, then 
> the world will be excited and interested and perhaps that algorithm will 
> replace TCP. But until that time, we're stuck with what we have.

With negle disabled it works 17/15 times slower, which is much less then
two. Similary with UNIX domain sockets.

> >I found it during work with client/server program that worked horribly slow
> >just becouse of this. (of course I fixed it, but that's not the point).
> 
> 	THAT IS THE POINT. The problem wasn't in the kernel, it was in the program, 
> and you fixed it. If you do smart buffering, TCP can behave efficiently. If 
> you don't, it has to guess when to send packets, and it can't possibly 
> predict the future and behave in the way you think is optimum.

Ok, *now* I know that ;)

Thank you all for pointers.

-- 
: Michal ``,/\/\,       '' Moskal    | |            : GCS {C,UL}++++$
:          |    |alekith      @    |)|(| . org . pl : {E--, W, w-,M}-
:    Linux: We are dot in .ORG.    |                : {b,e>+}++ !tv h
: CurProj: ftp://ftp.pld.org.pl/people/malekith/ksi : PLD Team member


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: strange TCP stack behiviour with write()es in pieces
  2002-01-03 13:22 strange TCP stack behiviour with write()es in pieces Michal Moskal
@ 2002-01-03 20:43 ` David Schwartz
  0 siblings, 0 replies; 7+ messages in thread
From: David Schwartz @ 2002-01-03 20:43 UTC (permalink / raw)
  To: malekith; +Cc: linux-kernel


On Thu, 3 Jan 2002 14:22:52 +0100, Michal Moskal wrote:

>>    If you can design an algorithm that makes that only two times slower, 
>>then
>> the world will be excited and interested and perhaps that algorithm will
>>replace TCP. But until that time, we're stuck with what we have.

>With negle disabled it works 17/15 times slower, which is much less then
>two. Similary with UNIX domain sockets.

	However, with Nagle disabled, there is no bound to how poor network 
efficiency can be. If you do a single byte write every tenth of a second, you 
will send out a packet for each single byte.

	You can only disable Nagle if you can assume that the application is smart 
enough to do the coalescing. After all, someone has to. Since we're talking 
about an app that can't coalesce, you cannot disable Nagle. (Unless you 
consider it acceptable to send one byte of data in each packet.)

	Again, an application *must* *not* disable Nagle unless it (the app) takes 
responsibility for ensuring that data is sent in large enough chunks to 
ensure network efficiency. So you can disable nagle if you want to, but not 
until *AFTER* you make sure your application coalesces writes.

	DS



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2002-01-03 20:43 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-01-03 13:22 strange TCP stack behiviour with write()es in pieces Michal Moskal
2002-01-03 20:43 ` David Schwartz
  -- strict thread matches above, loose matches on Subject: below --
2002-01-02 16:28 Michal Moskal
2002-01-02 18:26 ` Edgar Toernig
2002-01-02 19:46 ` dean gaudet
2002-01-02 20:21   ` Richard B. Johnson
2002-01-02 21:49 ` David Schwartz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox