linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: split RAID1 during backups?
@ 2005-10-24 12:07 Jeff Breidenbach
  2005-10-24 13:26 ` Paul Clements
  2005-10-24 18:55 ` dean gaudet
  0 siblings, 2 replies; 25+ messages in thread
From: Jeff Breidenbach @ 2005-10-24 12:07 UTC (permalink / raw)
  To: linux-raid


>First of all, if the data is mostly static, rsync might work faster.

Any operation that stats the individual files - even to just look at
timestamps - takes about two weeks. Therefore it is hard for me to see
rsync as a viable solution, even though the data is mostly
static. About 400,000 files change between weekly backups.

>I take it sdc and sdd using SATA don't influence each other?

Correct.

>However you will endure a rebuild on md0 when you re-add the disk, but
>given everything is mounted read-only, you should not practically be
>doing anything

If the rebuild operation is a no-op, then that sounds like a great
idea. If the rebuild operation requires scanning over all data in both
drives, I think that's going to be at least as expensive as the
current 10 hour process.

Thanks for the suggestions so far.

Cheers,
Jeff


^ permalink raw reply	[flat|nested] 25+ messages in thread
* Re: split RAID1 during backups?
@ 2005-10-30  3:06 Jeff Breidenbach
  0 siblings, 0 replies; 25+ messages in thread
From: Jeff Breidenbach @ 2005-10-30  3:06 UTC (permalink / raw)
  To: linux-raid


Thanks to good advice from many people, here are my findings and
conclusions.

(1) Splitting the RAID works. I have now implemented this technique on
the production system and am making a backup right now.

(2) NBD is cool, works well on Debian, and is very convenient. A
couple experiments suggest it may be slower compared to netcat for
blasting data across the network. By slower, I mean less throughput to
the point where the network can become a bottleneck. I don't have
conclusive data yet, so take with a large grain of salt. I am using
netcat for now.

(3) End-to-end throughput is not quite as high as I'd hoped.  At this
point it appears the limiting factor is the speed (throughput) of the
destination disks. During earlier testing, I had been dumping bits to
/dev/null on the destination machine instead of the actual destination
partition. No worries, this can be addressed.

(4) I'll play with fancier options like "write-mostly" when Debian
releases a 2.6.13 kernel, and when I'm convinced that I'm not going
accidentally introduce slower disks into the RAID and bog the entire
system down for writes.

>sendfile() bypasses the copy to user buffer, which in turn will bypass
>copy to system buffers, which eliminates contention for buffer space. Use
>vmstat to check, if you have a lot of system time and lots of space in
>buffers of various kinds, there's a good possibility that the problem is
>there.

I use the -d option in dd_rescue, which invokes O_DIRECT and therefore
doesn't trash Linux's disk buffer. Unfortunately because of the very
random access patterns of the web server, cache misses are extremely
common anyway.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 25+ messages in thread
* split RAID1 during backups?
@ 2005-10-26  8:17 Jeff Breidenbach
  2005-10-27 13:23 ` Bill Davidsen
  0 siblings, 1 reply; 25+ messages in thread
From: Jeff Breidenbach @ 2005-10-26  8:17 UTC (permalink / raw)
  To: linux-raid


Norman> What you should be able to do with software raid1 is the
Norman> following: Stop the raid, mount both underlying devices
Norman> instead of the raid device, but of course READ ONLY. Both
Norman> contain the complete data and filesystem, and in addition to
Norman> that the md superblock at the end. Both should be identical
Norman> copies of that.  Thus, you do not have to resync
Norman> afterwards. You then can backup the one disk while serving the
Norman> web server from the other. When you are done, unmount,
Norman> assemble the raid, mount it and go on.

I tried both variants of Norman's suggestion on a test machine and
they worked great. Shutting down and restarting md0 did not trigger a
rebuild. Perfect! And I could mount component partitions
read-only at any time. However on the production machine the
component partitions refused to mount, claiming to be "already
mounted". Despite the fact that the component drives do not show up
anywhere in lsof or mtab. When I saw this, I got nervous and did not
even try stopping md0 on the production machine.

# mount -o ro /dev/sdc1 backup
mount: /dev/sdc1 already mounted or backup busy

The two machines hardly match. The test machine has a 2.4.27 kernel
and JBOD drives hanging off a 3ware 7xxx controller. The production
machine has a 2.6.12 kernel and Intel SATA controllers. Both machines
have mdadm 1.9.0, and the discrepancy in behavior seems weird to
me. Any insights?

Paul> There have been a couple bug fixes in the bitmap stuff since
Paul> 2.6.13 was released, but it's stable. You'll need mdadm 2.x as
Paul> well.

It turns out Debian has not yet packaged 2.6.13 even in the unstable
branch. I will wait for this to happen before trying out the whizzy
intent-logging and write-mostly suggestions. I'm brave, but not THAT
brave. 

Dean> i didn't realise you were using reiserfs... i'd suggest
Dean> disabling tail packing... but then i've never used reiser, and
Dean> i've only ever seen reports of tail packing having serious
Dean> performance impact.

Done, thanks.

Bill> If you want to try something "which used to work" see nbd,
Bill> export 500GB from another machine, add the network block device
Bill> to the mirror, let it sync, break the mirror. Haven't tried
Bill> since 2.4.19 or so.

Wow, nbd (network block device) sounds really useful. I wonder if it
is a good way to provide more spindles to a hungry webserver.  Plus
they had a major release yesterday. While I've been focusing on
managing disk contention, if there's an easy way to reduce it, that's
definitely fair game.

Some of the other suggestions I'm going to hold off on. For example,
sendfile() doesn't really address the bottleneck of disk contention.
I'm also not so anxious to switch filesystems. That's a two week
endeavor that doesn't really address the contention issue. And it's
also a little hard for me to imagine that someone is going to beat the
pants off of reiserfs, especially since reiserfs was specifically
designed to deal with lots of small files efficiently. Finally, I'm
not going to focus on incremental backups if there's any prayer of
getting a 500GB full backup in 3 hours.  Full backups provide a LOT of
warm fuzzies.

Again, thank you all very much.

-Jeff

^ permalink raw reply	[flat|nested] 25+ messages in thread
* Re: split RAID1 during backups?
@ 2005-10-25  5:01 Jeff Breidenbach
  0 siblings, 0 replies; 25+ messages in thread
From: Jeff Breidenbach @ 2005-10-25  5:01 UTC (permalink / raw)
  To: linux-raid


On 10/24/05, Thomas Garner <tlg1466@neo.tamu.edu> wrote:
> Should there be any consideration for the utilization of the gigabit
> interface that is passing all of this backup traffic, as well as the
> speed of the drive that is doing all of the writing during this
> transaction?  Is the 18MB/s how fast the data is being copied over the
> network, or is it some metric within the host system?

The switched gigabit network is plenty fast. The bottleneck is
reading from the RAID1 while it is under contention.

Here are measurements from transferring a chunk of data from
/dev/zero, a single unmounted drive, and RAID1. Measurements are
reported by dd_rescue and reflect how fast data is moving over the
network. I was careful to use smart command line options with
dd_rescue, avoid contaminating Linux's disk cache, and make sure
results were repeatable.

MB/s   Operation
====   ============================
72.0   dd-rescue /dev/zero - | netcat
61.8   dd-rescue [unmounted single drive]  - | netcat
18.8   dd-rescue md0 - | netcat

dd_rescue v1.11 options:
 -B 4096 -q  -l -d -s 11G -m 200M -S 0

^ permalink raw reply	[flat|nested] 25+ messages in thread
* Re: split RAID1 during backups?
@ 2005-10-25  3:37 Jeff Breidenbach
  2005-10-25  4:07 ` dean gaudet
                   ` (4 more replies)
  0 siblings, 5 replies; 25+ messages in thread
From: Jeff Breidenbach @ 2005-10-25  3:37 UTC (permalink / raw)
  To: linux-raid


Ok... thanks everyone!

David, you said you are worried about failure scenarios
involved with RAID splitting. Could you please elaborate?
My biggest concern is I'm going to accidentally trigger
a rebuild no matter what I try but maybe you have something
more serious in mind.

Brad, your suggestion about kernel 2.6.13 and intent logging and
having mdadm pull a disk sounds like a winner. I'm going to to try it
if the software looks mature enough. Should I be scared?

Dean, the comment about "write-mostly" is confusing to me.  Let's say
I somehow marked one of the component drives write-mostly to quiet it
down. How do I get at it? Linux will not let me mount the component
partition if md0 is also mounted. Do you think "write-mostly" or
"write-behind" are likely enough to be magic bullets that I should
learn all about them?

Bill, thanks for the suggestion to use nbd instead of netcat.  Netcat
is solid software and very fast, but does feel a little like duct
tape. You also suggested putting a third drive (local or nbd remote)
temporarily in the RAID1. What does that buy versus the current
practice of using dd_rescue to copy the data off md0? I'm not
imagining any I/O savings over the current approach.

John, I'm using 4KB blocks in reiserfs with tail packing. All sorts of
other details are in the dmesg output [1]. I agree seeks are a major
bottleneck, and I like your suggestion about putting extra spindles
in. Master-slave won't work because the data is continuously changing.
I'm not going to argue about the optimality of millions of tiny files
(go talk to Hans Reiser about that one!) but I definitely don't foresee
major application redesign any time soon.

Most importantly, thanks for the encouragement. So far it sounds like
there might be some ninja magic required, but I'm becoming
increasingly optimistic that it will be - somehow - possible manage
disk contention in order to dramatically raise backup speeds.

Cheers,
Jeff

[1] http://www.jab.org/dmesg

^ permalink raw reply	[flat|nested] 25+ messages in thread
* Re: split RAID1 during backups?
@ 2005-10-24 20:28 Jeff Breidenbach
  2005-10-24 20:58 ` John Stoffel
  2005-10-25 22:18 ` David Greaves
  0 siblings, 2 replies; 25+ messages in thread
From: Jeff Breidenbach @ 2005-10-24 20:28 UTC (permalink / raw)
  To: linux-raid


Thanks for all the suggestions.

> a big hint you're suffering from atime updates is write traffic when your
> fs is mounted rw, and your static webserver is the only thing running (and
> your logs go elsewhere)... atime updates are probably the only writes
> then.  try "iostat -x 5".

I think atime updates are an unlikely, since the partition is mounted
with atime turned off. I will look into the directory listing issue.

# mount | grep md0
/dev/md0 on /data1 type reiserfs (rw,noatime,nodiratime)

# vmstat
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 1  0      0  57468 962964 2301336    0    0   345   500    3     2 40  6 31 23


>  The real problem is that you should be able to search the disk faster
>than that, identify the modified files, and do incrementals regularly. I
>have 400GB of 2-5MB files, and it takes minutes, not hours, to scan them.
>That's PATA not SATA, I suspect there may still be issues there, SATA is
>not as well explored, certainly not by me!

I'm not sure this is a relevant comparison. The files in question are
three about 1000 times smaller, and there are about 1000 times more of them.

>Get one more drive of the same size, and at backup time add it to
>the mirror. After rebuild take it back out of the mirror. Put a
>remount r/o in there at your discresion. Now you have a valid copy of
>your data, back that up as you like.

Interesting. What is the advantage over the current practice? Is it
faster, or does is use less disk I/O? Reminder: the current practice
(which I think is too slow) is to copy md0 with dd_rescue while the
partition is also feeding a webserver.

> the md event counts would be out of sync and unless you're using bitmapped
> intent logging this would cause a full resync.  if the raid wasn't online
> you could probably use one of the mdadm options to force the two devices
> to be a sync'd raid1 ... but i'm guessing you wouldn't be able to do it
> online.

I will look into intent logging. This is the first I've heard of it, thanks.

> other 2.6.x bleeding edge options are to mark one drive as write-mostly
> so that you have no read traffic competition while doing a backup... or
> just use the bitmap intent logging and a nbd to add a third, networked,
> copy of the drive on another machine.

This is also the first I've heard of ndb. Thanks, I'll look into that
too.

>One last thought, the slowness of the disk may result from the extended
>times to do directory operations which you mention. You don't have all
>those thousands of files in a single directory, do you? How long does it
>take to do an unsuccessful "find" from the root? Like "find /base -name
>spvgZy3G" or other name it's not going to find.

Individual directories contain up to about 150,000 files. If I run ls
-U on all directories, it completes in a reasonably amount of time (I
forget how much, but I think it is well under an hour). Reiserfs is
supposed to be good at this sort of thing. If I were to stat each
file, then it's a different story.

Jeff

^ permalink raw reply	[flat|nested] 25+ messages in thread
* split RAID1 during backups?
@ 2005-10-24 10:57 Jeff Breidenbach
  2005-10-24 11:22 ` Jurriaan Kalkman
                   ` (4 more replies)
  0 siblings, 5 replies; 25+ messages in thread
From: Jeff Breidenbach @ 2005-10-24 10:57 UTC (permalink / raw)
  To: linux-raid


Hi all,

I have a two drive RAID1 serving data for a busy website. The
partition is 500GB and contains millions of 10KB files. For reference,
here's /proc/mdstat

Personalities : [raid1]
md0 : active raid1 sdc1[0] sdd1[1]
      488383936 blocks [2/2] [UU]

For backups, I set the md0 partition to readonly and then use dd_rescue
+ netcat to copy the parition over a gigabit network. Unfortuantely,
this process takes almost 10 hours. I'm only able to copy about 18MB/s
from md0 due to disk contention with the webserver. If I had the full
attention of a single disk, I could read at nearly 60MB/s.

So - I'm thinking of the following backup scenario.  First, remount
/dev/md0 readonly just to be safe. Then mount the two component
paritions (sdc1, sdd1) readonly. Tell the webserver to work from one
component partition, and tell the backup process to work from the
other component partition. Once the backup is complete, point the
webserver back at /dev/md0, unmount the component partitions, then
switch read-write mode back on.

Am I insane? 

Everything on this system seems bottlenecked by disk I/O. That
includes the rate web pages are served as well as the backup process
described above. While I'm always hungry for perforance tips, faster
backups are the current focus. For those interested in gory details
such as drive types, NCQ settings, kernel version and whatnot, I
dumped a copy of dmesg output here: http://www.jab.org/dmesg

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2005-10-30  3:06 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-10-24 12:07 split RAID1 during backups? Jeff Breidenbach
2005-10-24 13:26 ` Paul Clements
2005-10-24 18:55 ` dean gaudet
  -- strict thread matches above, loose matches on Subject: below --
2005-10-30  3:06 Jeff Breidenbach
2005-10-26  8:17 Jeff Breidenbach
2005-10-27 13:23 ` Bill Davidsen
2005-10-25  5:01 Jeff Breidenbach
2005-10-25  3:37 Jeff Breidenbach
2005-10-25  4:07 ` dean gaudet
2005-10-25  8:35 ` Norman Schmidt
2005-10-25 17:51   ` John Stoffel
2005-10-25 19:20     ` Norman Schmidt
2005-10-25 18:04 ` John Stoffel
2005-10-25 18:13 ` Paul Clements
2005-10-25 20:05 ` Bill Davidsen
2005-10-26 18:15   ` Dan Stromberg
2005-10-24 20:28 Jeff Breidenbach
2005-10-24 20:58 ` John Stoffel
2005-10-25 22:18 ` David Greaves
2005-10-24 10:57 Jeff Breidenbach
2005-10-24 11:22 ` Jurriaan Kalkman
2005-10-24 11:37 ` Brad Campbell
2005-10-24 19:05 ` Bill Davidsen
2005-10-25  4:30 ` Thomas Garner
2005-10-27  0:04 ` Christopher Smith

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).