* IO scheduling & filesystem v a few processes writing a lot
@ 2005-07-31 16:39 Dr. David Alan Gilbert
2005-07-31 19:16 ` Sander
0 siblings, 1 reply; 5+ messages in thread
From: Dr. David Alan Gilbert @ 2005-07-31 16:39 UTC (permalink / raw)
To: linux-kernel; +Cc: axboe
Hi,
I've got a backup system that I'm trying to eek some more performance
out of - and I don't understand the right way to get the kernel to
do disc writes efficiently - and hence would like some advice.
I was using rsync, but the problem with rsync is that I have
a back up server then filled with lots and lots of small files
- I want larger files for spooling to tape.
(Other suggestions welcome)
So I'm trying switching to streaming gzip'd tars from each
client to backup to the server. I have one server that
opens connections to each of the clients and sucks the data
using netcat (now netcat6 in ipv4 mode) and writes it to
disc, one file per client. Now the downside here
relative to rsync is that it is going to transfer and
write a lot more data.
Now the clients are on 100Mb/s, and the server on GigE,
the clients sometime have to think while they gzip their data, so I'd
like to suck data from multiple clients at once. So I run multiple of
these netcat's in parallel - currently about 9.
I've benchmarked write performance on the filesystem at
60-70MB/s for a single write process (as shown with iostat)
for a simple dd if=/dev/zero of=abigfile bs=1024k
My problem is that with the parallel writes iostat is showing
I'm actually getting ~3MB/s write bandwidth - that stinks!
The machine is a dual xeon with 1GB of RAM, an intel GigE
card and a 2.6.11 kernel, a 3ware-9000 series pci-x controller
with a 1.5TB RAID5 partition running Reiser3. Reiser3 is used
because I couldn't get ext3 stable on a filesystem of this size
(-64ZByte free shown in df), and xfs didn't seem stable on
recovering from an arbitrarily placed reset. The 3ware has
write caching (with battery backup). (Note this is the 9000
series SATA ones, not the older 7000/8000 that really did suck
when writing). The CFQ scheduler is being used and I've turned up
nr_requests to 1024.
So as I see it I have to persuade all appropriate parts to buffer
sensibly and try and generate as large I/O requests as possible;
I've told netcat6 (on the read side) to use 128k buffers:
nc6 -4 --recv-only --buffer-size=131072 --idle-timeout=3600
Iostat is showing the world is not happy:
avg-cpu: %user %nice %system %iowait %idle
0.15 0.00 2.65 97.20 0.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s
sda 0.00 249.95 0.10 193.62 0.80 3548.55 0.40 1774.28
avgrq-sz avgqu-sz await svctm %util
18.32 1290.24 8225.93 5.15 99.73
avg-cpu: %user %nice %system %iowait %idle
0.15 0.00 3.05 95.95 0.85
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s
sda 0.00 260.96 0.00 217.62 0.00 4342.74 0.00 2171.37
avgrq-sz avgqu-sz await svctm %util
19.96 1630.13 6951.26 4.60 100.12
(I could do with a pointer to something explaining all the fields here!)
But as I read this, the machine is stuck waiting on I/O (OK), but
only managing to write ~2MB/s - and seems to have done that 2MB in
about 200 writes (am I reading that correctly?) - which is rather
on the small side.
So I can understand it going slowly if it is interleaving all the
requests at tiny 10KB writes all over the disc - how do I persuade
it I'd like it to buffer more and get larger writes - or
how do I understand why it isn't?
At the moment it is feeling like I'm going to have to write
something that has a single process doing the writes and
accumulates the data from all the clients itself so that there
is only ever one process writing.
I'm open for all suggestions.
Dave
--
-----Open up your eyes, open up your mind, open up your code -------
/ Dr. David Alan Gilbert | Running GNU/Linux on Alpha,68K| Happy \
\ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC & HPPA | In Hex /
\ _________________________|_____ http://www.treblig.org |_______/
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: IO scheduling & filesystem v a few processes writing a lot
2005-07-31 16:39 IO scheduling & filesystem v a few processes writing a lot Dr. David Alan Gilbert
@ 2005-07-31 19:16 ` Sander
2005-08-01 8:54 ` Dr. David Alan Gilbert
0 siblings, 1 reply; 5+ messages in thread
From: Sander @ 2005-07-31 19:16 UTC (permalink / raw)
To: Dr. David Alan Gilbert; +Cc: linux-kernel, axboe
Dr. David Alan Gilbert wrote (ao):
> I've got a backup system that I'm trying to eek some more performance
> out of - and I don't understand the right way to get the kernel to
> do disc writes efficiently - and hence would like some advice.
>
> I was using rsync, but the problem with rsync is that I have
> a back up server then filled with lots and lots of small files
> - I want larger files for spooling to tape.
> (Other suggestions welcome)
Can't you just tar the small files from the backupserver to tape? (or,
what is the problem with that?).
> So I'm trying switching to streaming gzip'd tars from each
> client to backup to the server. I have one server that
> opens connections to each of the clients and sucks the data
> using netcat (now netcat6 in ipv4 mode) and writes it to
> disc, one file per client. Now the downside here
> relative to rsync is that it is going to transfer and
> write a lot more data.
You also do incremental backups?
> Now the clients are on 100Mb/s, and the server on GigE,
> the clients sometime have to think while they gzip their data, so I'd
> like to suck data from multiple clients at once. So I run multiple of
> these netcat's in parallel - currently about 9.
>
> I've benchmarked write performance on the filesystem at
> 60-70MB/s for a single write process (as shown with iostat)
> for a simple dd if=/dev/zero of=abigfile bs=1024k
>
> My problem is that with the parallel writes iostat is showing
> I'm actually getting ~3MB/s write bandwidth - that stinks!
How many parallel streams can the system currently handle before the
write bandwith gets unacceptable?
> The machine is a dual xeon with 1GB of RAM, an intel GigE
> card and a 2.6.11 kernel, a 3ware-9000 series pci-x controller
> with a 1.5TB RAID5 partition running Reiser3.
What mount options? And how many disks?
> Reiser3 is used because I couldn't get ext3 stable on a filesystem of
> this size (-64ZByte free shown in df),
That is not a sign of instability per se AFAIK.
> and xfs didn't seem stable on recovering from an arbitrarily placed
> reset. The 3ware has write caching (with battery backup).
How is the cache configured in the bios?
> I'm open for all suggestions.
Would it be possible to test software raid to see if that gives
different numbers?
Sander
--
Humilis IT Services and Solutions
http://www.humilis.net
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: IO scheduling & filesystem v a few processes writing a lot
2005-07-31 19:16 ` Sander
@ 2005-08-01 8:54 ` Dr. David Alan Gilbert
2005-08-01 14:48 ` Sander
0 siblings, 1 reply; 5+ messages in thread
From: Dr. David Alan Gilbert @ 2005-08-01 8:54 UTC (permalink / raw)
To: Sander; +Cc: Dr. David Alan Gilbert, linux-kernel, axboe
* Sander (sander@humilis.net) wrote:
> Dr. David Alan Gilbert wrote (ao):
> > I was using rsync, but the problem with rsync is that I have
> > a back up server then filled with lots and lots of small files
> > - I want larger files for spooling to tape.
> > (Other suggestions welcome)
>
> Can't you just tar the small files from the backupserver to tape? (or,
> what is the problem with that?).
Lots of small files->slow; it is an LTO-2 tape drive that is spec'd
at 35MByte/s - it won't get that if I'm feeding it from something
seeking all over.
> > write a lot more data.
>
> You also do incremental backups?
I could - but they are a pain at restore time.
> > I've benchmarked write performance on the filesystem at
> > 60-70MB/s for a single write process (as shown with iostat)
> > for a simple dd if=/dev/zero of=abigfile bs=1024k
> >
> > My problem is that with the parallel writes iostat is showing
> > I'm actually getting ~3MB/s write bandwidth - that stinks!
>
> How many parallel streams can the system currently handle before the
> write bandwith gets unacceptable?
I'll be honest I don't know; this was running with 9 streams; but
I know the overall speed of the backup goes up as I increase
the parallelism from 5 through 9 - but it still sucks.
> > The machine is a dual xeon with 1GB of RAM, an intel GigE
> > card and a 2.6.11 kernel, a 3ware-9000 series pci-x controller
> > with a 1.5TB RAID5 partition running Reiser3.
>
> What mount options? And how many disks?
7 active discs, raid5; mounted with noatime, nodiratime
> > Reiser3 is used because I couldn't get ext3 stable on a filesystem of
> > this size (-64ZByte free shown in df),
>
> That is not a sign of instability per se AFAIK.
When I fsck it fixes it - this to me is an indication something is wrong
with the ondisc data; now it might only be the freespace totals - but
the fact that the disc contents are wrong makes me worry - I don't
like having to fsck a 1.5TB partition.
> > and xfs didn't seem stable on recovering from an arbitrarily placed
> > reset. The 3ware has write caching (with battery backup).
>
> How is the cache configured in the bios?
Write cache is on in the 3ware bios as is the battery backup.
> > I'm open for all suggestions.
>
> Would it be possible to test software raid to see if that gives
> different numbers?
Erm I guess I could - but the controller does manage
60/70MB/s write as a raw stream, so as far as I can tell if I can
persuade the kernel not to chop my writes into silly small
chunks things should be good.
Dave
--
-----Open up your eyes, open up your mind, open up your code -------
/ Dr. David Alan Gilbert | Running GNU/Linux on Alpha,68K| Happy \
\ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC & HPPA | In Hex /
\ _________________________|_____ http://www.treblig.org |_______/
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: IO scheduling & filesystem v a few processes writing a lot
2005-08-01 8:54 ` Dr. David Alan Gilbert
@ 2005-08-01 14:48 ` Sander
2005-08-02 1:43 ` David Lang
0 siblings, 1 reply; 5+ messages in thread
From: Sander @ 2005-08-01 14:48 UTC (permalink / raw)
To: Dr. David Alan Gilbert
Cc: Sander, Dr. David Alan Gilbert, linux-kernel, axboe
Dr. David Alan Gilbert wrote (ao):
> * Sander (sander@humilis.net) wrote:
> > Dr. David Alan Gilbert wrote (ao):
> > > I was using rsync, but the problem with rsync is that I have
> > > a back up server then filled with lots and lots of small files
> > > - I want larger files for spooling to tape.
> > > (Other suggestions welcome)
> >
> > Can't you just tar the small files from the backupserver to tape? (or,
> > what is the problem with that?).
>
> Lots of small files->slow; it is an LTO-2 tape drive that is spec'd
> at 35MByte/s - it won't get that if I'm feeding it from something
> seeking all over.
ic. Sorry if the question is stupid, but is it bad not to reach
35MB/sec?
> > > write a lot more data.
> >
> > You also do incremental backups?
>
> I could - but they are a pain at restore time.
Well, bare metal restores are rare, and if you need to do one, IMHO one
full restore and six incrementals (worst case, and with one full backup
a week) are not that painful. Very IMHO of course. You could lessen the
pain with incrementals_since_last_full (can't remember the correct term
ATM).
If you go with weekly fulls, you can almost have a single system per
day streaming a full backup to your backupserver.
> > What mount options? And how many disks?
>
> 7 active discs, raid5; mounted with noatime, nodiratime
Should perform at least a bit..
> > > Reiser3 is used because I couldn't get ext3 stable on a filesystem of
> > > this size (-64ZByte free shown in df),
> >
> > That is not a sign of instability per se AFAIK.
>
> When I fsck it fixes it - this to me is an indication something is
> wrong with the ondisc data; now it might only be the freespace totals
> - but the fact that the disc contents are wrong makes me worry - I
> don't like having to fsck a 1.5TB partition.
Yes, I understand.
> > How is the cache configured in the bios?
>
> Write cache is on in the 3ware bios as is the battery backup.
According to the docs it is just 'enable' or 'disable'. I remember raid
controllers (Intel?) which also have 'write through' or 'write back'.
That was what I was looking for, but no such thing it seems.
> > > I'm open for all suggestions.
> >
> > Would it be possible to test software raid to see if that gives
> > different numbers?
>
> Erm I guess I could - but the controller does manage
> 60/70MB/s write as a raw stream, so as far as I can tell if I can
> persuade the kernel not to chop my writes into silly small
> chunks things should be good.
Yes, but if you can lift the performance at some point, it might get
acceptable. You could for example also try a raid0 stripe across the 7
disks.
Kind regards, Sander
--
Humilis IT Services and Solutions
http://www.humilis.net
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: IO scheduling & filesystem v a few processes writing a lot
2005-08-01 14:48 ` Sander
@ 2005-08-02 1:43 ` David Lang
0 siblings, 0 replies; 5+ messages in thread
From: David Lang @ 2005-08-02 1:43 UTC (permalink / raw)
To: Sander; +Cc: Dr. David Alan Gilbert, Dr. David Alan Gilbert, linux-kernel,
axboe
On Mon, 1 Aug 2005, Sander wrote:
> Dr. David Alan Gilbert wrote (ao):
>> * Sander (sander@humilis.net) wrote:
>>> Dr. David Alan Gilbert wrote (ao):
>>>> I was using rsync, but the problem with rsync is that I have
>>>> a back up server then filled with lots and lots of small files
>>>> - I want larger files for spooling to tape.
>>>> (Other suggestions welcome)
>>>
>>> Can't you just tar the small files from the backupserver to tape? (or,
>>> what is the problem with that?).
>>
>> Lots of small files->slow; it is an LTO-2 tape drive that is spec'd
>> at 35MByte/s - it won't get that if I'm feeding it from something
>> seeking all over.
>
> ic. Sorry if the question is stupid, but is it bad not to reach
> 35MB/sec?
with modern tape drives, when you fall out of streaming mode you are lucky
to get 1/10 of the rated drive performance (not to mention the extra wear
and tear on the tape and the drive)
the common thing is to do disk->disk->tape backups. use rsync to pull your
data from the remote machines to your server, then use tar on the server
to make the one image you want to put on the tape (frequently onto a
different drive), then record that image to tape.
note that tar is not nessasarily the best format to use for this in the
face of tape errors (see backup software companies for details, several
years ago I read an interesting document from the bru backup folks that
went into details)
David Lang
--
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
-- C.A.R. Hoare
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2005-08-02 1:44 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-07-31 16:39 IO scheduling & filesystem v a few processes writing a lot Dr. David Alan Gilbert
2005-07-31 19:16 ` Sander
2005-08-01 8:54 ` Dr. David Alan Gilbert
2005-08-01 14:48 ` Sander
2005-08-02 1:43 ` David Lang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox