public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Re: [Bench] New benchmark showing fileserver problem in 2.4.12
@ 2001-10-17 13:06 Robert Cohen
  2001-10-17 14:12 ` Marcelo Tosatti
                   ` (5 more replies)
  0 siblings, 6 replies; 13+ messages in thread
From: Robert Cohen @ 2001-10-17 13:06 UTC (permalink / raw)
  To: linux-kernel

I have had a chance to do some more testing with the test program I
posted yesterday. I have been able to try various combinations of
parameters and variations of the programs.

I now have a pretty good idea of what specific activities will see the
performance problems I was seeing. But since I'm not a kernel guru, I
have no idea as to why the problem exists or how to fix it.

I am interested in reports from people who can run the test. I would
like to confirm my findings (or simply confirm that I'm crazy :-).

The problems appear to only happen in very specific set of
circumstances. Its an incredible coincidence that my original
lantest/netatalk testing happened to hit that specific combination of
factors.
So it looks like I havent actually found a generic performance problem
with Linux as such. But I would still like to get to the bottom of this. 

The factors that cause these problems probably won't occur very often in
real usage, but they are things that are not obviously silly. So it does
indicate a problem with some dark corner of the linux kernel that
probably should be investigated.

I have identified 4 specific factors that contribute to the problem. All
4 have to be present for before there is a performance problem.


Summary of the factors involved
===============================

Factor 1: the performance problems only occur when you are rewriting an
existing file in place. That is writing to an existing file which is
opened without O_TRUNC. Equivalently, if you have written a file and
then seek'ed back to the beginning and started writing again. I admit
this is something that not many real programs (except databases) do. But
it still shouldnt cause any problems.

Factor 2: the performance problems only occur when the part of the file
you are rewriting is not already present in the page cache. This tends
to happen when you are doing I/O to files larger than memory. Or if you
are rewriting an existing file which has just been opened.

Factor 3: the performance problems only happens for I/O that is due to
network traffic, not I/O that was generated locally. I realise this is
extremely strange and I have no idea how it knows that I/O is die to
network traffic let alone why it cares. But I can assure you that it
does make a difference.

Factor 4: the performance problem is only evident with small writes eg
write calls with an 8k buffer. Actually, the performance hit is there
with larger writes, just not significant enough to be an issue. Its
tempting to say "well just use larger buffers". But this isnt always
possible and anyway, 8k buffers should still work adequately, just not
optimally.



Experimental evidence
=====================


Factor 1: the performance problems only occur when you are rewriting an
existing file in place. That is writing to an existing file which is
opened without O_TRUNC. Equivalently, if you have written a file and
then seek'ed back to the beginning and started writing again.

Evidence: in the report I posted yesterday, the test I was using
involved 5 clients rewriting 30 Meg files on a 128 Meg machine. The
symptom  was that after about 10 seconds, the throughput as shown by
vmstat "bo" drops sharply and we start getting reads occuring as shown
by the "bi" figure. However, with that test the page cache fills up
after 10 seconds. This is only just before the end of the files are
reached and we start rewriting the files. So its difficult to see which
of those two is causing the problem. Yesterday, I attributed the
problems to the page cache filling up, but I was apparently wrong. 

The new test I am using is 5 copies of

./send 200 2 | rsh server ./receive 200 2.

Here we have 5 clients rewriting 200 Meg file.
With this test, the page cache fills up after about 10 seconds, but
since we are writing a total of 1 Gig of files, the end of the files is
not reached for 2 minutes or so. It is at this point that we start
rewriting the files.

When the page cache fills up, there is no drop in performance. However,
when the end of the file is reached and we start to rewrite, the
throughput drops and we get the reads happening. So the problems are
obviously due to the rewriting of an existing file not due to the page
cache filling.

 It doesnt make any difference whether the test seeks back to the start
to rewrite or if it closes it and reopens without O_TRUNC.



Factor 2: the performance problems only occur when the part of the file
that is being rewritten is not already present in the page cache.

Evidence: I modified the "receive" test program to write to a named file
and to not delete the file after the run, so I could rewrite existing
files with only one pass. 

On a machine with 128 Megs of memory

I created 5 large test files.
I purged these files from the page cache by writing another file larger
than memory and deleting it.

I did a run of 5 copies of ./send 18 1 | rsh server ./receive 18 1 
(each one on a different file).
I did a second run of ./send 18 1 | rsh server ./receive 18 1

With the first run, the files were not present in page cache and the
performance problems were seen. This run took about 95 seconds. Since
the total size of the 5 files is smaller than page cache available, they
were all still present after the first run.

The second run took about 20 seconds. So the presence of data in the
cache makes a significant difference.

It seems natural to say "of course the cache sped things up, thats what
caches are for". However, the cache should not have sped this operation
up. Only writes are being done, no reads. So there is no reason why the
presence of data in the cache which is going to be overwritten anyway
should speed things up. 
Also, the cache shouldnt speed writes up since the program does an fsync
to flush the cache on write. And even if the cache does speed writes, it
should have the same effect on both runs.

I had originally thought the problem occured when the page cache was
full. I assumed it was due to the extra work to purge something from the
page cache to make space for a new write. However with this test I
observed that the performance was bad even when the page cache did not
fill memory and there was plenty of free memory. So it seems that the
performance problem is purely due to rewriting something which is not
present in page cache. It has nothing to do with the amount of free
memory and whether the page cache is filling memory.

In this kind of test, if the collective size of the files is greater
than the amount of memory available for page cache, then problems can be
observed even with the second run. For example if you are writing to 120
Megs of files and there is 100 Megs of page cache. On the second run,
even though 100 megs of the files are present in the page cache, you get
no benefit because each portion of the file will be flushed to make way
for new writes before you get around to rewriting that portion. This is
the standard LRU performance wall when the working set is bigger than
available memory.



Factor 3: the problems only happens for I/O that is due to network
traffic.
Evidence: The problem does occurs when you have a second machine
"rsh"ing into the linux server.
However, if you run the test entirely on the linux server with any of
the following

./send 30 10 | ./receive 30 10
./send 30 10 | rsh localhost ./receive 30 10
./send 30 10 | rsh server ./receive 30 10

then the problem does not occur. Strangely we also don't see any reads
showing up in the vmstat output in these cases.
It seems the page cache is able to rewrite existing files without doing
any reading first under these conditions.

This is the really strange issue. I have no idea why it would make a
difference whether the receive program is taking its standard input from
a local source or from an rsh over the network. Why would the behaviour
of the page cache differ in these circumstances. If any Guru's can clue
me in, I would appreciate it.



Factor 4: the performance problem only occurs with small writes.
Evidence: the  test programs I posted yesterday were doing IO with 8K
buffers (set by a define) because that was what the original benchmark I
was emulating did. If I modify "receive" to use a 64k buffer, I get
adequate throughput.
The anomalous reads are still happening, but don't seem to impact
performance too much. The throughput ramps smoothly between 8k and 64k
buffers.

One possible response is a variation on the old joke: if you have
experience problems when you do 8k writes, then don't do 8k writes.
However, I would like to understand why we are seeing a problem with 8k
writes. Its not as if 8k is *that* small. At worst small writes should
just chew CPU time, but we get lots of CPU idle time during the
benchmark, just poor throughput. The evidence suggests some kind of
constant overhead for each write.

Modifying the buffer size in send, simply reduces the amount of CPU that
send uses. Which is as you would expect. Doing this doesnt have much
effect on the overall throughput.


--
Robert Cohen
Unix Support
TLTSU
Australian National University
Ph: 612 58389

^ permalink raw reply	[flat|nested] 13+ messages in thread
* [Bench] New benchmark showing fileserver problem in 2.4.12
@ 2001-10-16  9:07 Robert Cohen
  0 siblings, 0 replies; 13+ messages in thread
From: Robert Cohen @ 2001-10-16  9:07 UTC (permalink / raw)
  To: linux-kernel

I have recently been reporting on problems with file server performance
in recent 2.4 kernels.
Since the setup I was using is difficult for most people to reproduce
(it involved 5 mac clients) I have taken the time to find a benchmark
that more or less reproduces the problems in a more accessible manner.

The original benchmark involved a number of file server clients writing
to  the server.

The new benchmark involves two programs "send" and "receive". Send
generates data on standard out.
Receive takes data from stdin and writes it to a file. They are setup to
do this for a number of repetitions.
When "receive" reaches the end of the file it seeks back to the
beginning and rewrites the file.
I think it may be significant that the file is not truncated, it is
overwritten.

Send and Receive are designed to run over an rsh pipe. The programs take
2
parameters "file_size" and the number of repetitions. The same
parameters
should be given to each program.

To duplicate the activity of the original benchmark, I run 5 copies each
using files of 30 Megs:
./send 30 10 | rsh server ./receive 30 10 &

Since its a networked benchmark you need at least 2 linux machines on
100 Meg (or faster) network.
Originally I thought I might need to run the "send" programs on separate
machines, but testing indicates that I get the same problems running all
the "send"'s on one machine and the "receives" on another.
I have to admit I used a solaris box to run the sends on since I don't
have 2 linux machines here but I can't see why that would make any
difference.


The source code for send is at http://tltsu.anu.edu.au/~robert/send.c
Receive is at http://tltsu.anu.edu.au/~robert/receive.c

In order to produce the problem, the collective filesize has to be
bigger than the memory in the server.
In this example the collective filesize is 5*30=150 Megs.

You can see the problems most clearly by running vmstat while the
program runs.

So if I run it against a server with 256 Megs of memory, there are  no
problems. The run takes about 6 minutes to complete.
A vmstat output is available at
http://tltsu.anu.edu.au/~robert/linux_logs/sr-256

If I run it against a server with 128 Megs of memory, the throughput as
shown by the "bo" stat starts out fine but the page cache usage rises
while the files are written. When the page cache tops out, the "bo"
figure drops sharply. At this point we get reads happening as shown by
"bi" even though the program does no reads. I presume that pages evicted
from page cache need to be read back into page cache before they can be
modified by writes.

With 128 Megs of memory, the benchmark takes about 30 minutes to run. So
its 5 times slower
than with 256 Megs. Given that the system isnt actually getting any
benefit out of the page cache since the files are never read back in, I
would have hoped there wouldnt be much difference.
A vmstat output for a 128 Meg run is at
http://tltsu.anu.edu.au/~robert/linux_logs/sr-128.


I can reproduce the problems with 256 Megs of memory by running 5
clients with 60 Meg files instead of 30 Meg files.

I get similar results with the following kernels

2.4.10-ac11 with Rik's Hog patch.
2.4.12-ac3
2.4.11-pre6

With an aa kernel 2.4.13pre2-aa1, once the page cache fills up, we
start getting "order 0 allocation" fails. The mem killer kicks in and
kills one of the receives (even though it only allocates 8k of memory
:-(  ). The remaining clients then show similar throughput problems.

The problem does not occur when the sends and receives are run on the
same machine connected by pipes. This seems to indicate that its an
interaction between the memory usage by the page cache and the memory
usage by the network subsystem.

Also the problem is not as pronounced if I test with 1 client accessing
150 Megs rather than 5 clients accessing 30 Megs each.

--
Robert Cohen
Unix Support
TLTSU
Australian National University
Ph: 612 58389

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2001-10-19  6:13 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-10-17 13:06 [Bench] New benchmark showing fileserver problem in 2.4.12 Robert Cohen
2001-10-17 14:12 ` Marcelo Tosatti
2001-10-17 15:12 ` M. Edward Borasky
2001-10-17 15:18 ` John Stoffel
2001-10-17 15:47 ` Andreas Dilger
2001-10-17 16:44 ` Linus Torvalds
2001-10-18  2:01   ` Leo Mauro
2001-10-18  8:30     ` James Sutherland
2001-10-18 21:36     ` Roger Larsson
2001-10-19  2:53       ` George Greer
2001-10-19  6:08         ` Roger Larsson
     [not found] ` <200110171644.f9HGinZ17717@penguin.transmeta.com>
2001-10-18  4:51   ` Robert Cohen
  -- strict thread matches above, loose matches on Subject: below --
2001-10-16  9:07 Robert Cohen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox