RAID5/6 writes

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* RAID5/6 writes
@ 2008-10-01 17:52 Peter Cordes
  2008-10-01 19:36 ` Andi Kleen
  2008-10-02  0:32 ` Dave Chinner
  0 siblings, 2 replies; 6+ messages in thread
From: Peter Cordes @ 2008-10-01 17:52 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: Type: text/plain, Size: 2128 bytes --]

 I just had an idea for speeding up writes to parity-based RAIDs
(RAID4,5,6).[1]  If XFS wants to write sectors 1,2,3, 5,6,7, but it
knows that block 4 is free space, it might be better to write sector 4
(with zeros, don't put uninitialized kernel memory on disk!).  It's
probably only useful to do this if XFS has data in memory to prove
that the gap is not part of the filesystem.  Doing extra reads
probably doesn't make sense except in very special cases.  (e.g.
repeated writes to the same location with the same hole, so just one
read would let them all become full-block or even full-stripe writes.)

 XFS knows (or should have been told by the admin with mkfs!) what the
stripe geometry is: block size and stripe width.  So it could apply
this optimization only if it would make a write cover more whole
blocks or whole stripes.

[1]  See http://www.acnc.com/04_01_05.html if you need a reminder of
what RAID level is what...  It has good pictures and explanations. :)

 I use RAID6 on a Dell PERC 6/e with 8 500GB SATA disks, and I'm still
tuning XFS for it...  (I'll start another with some tuning questions...)
RAID5 write performance has the same limitations as RAID6, and more
people know about it, so...  RAID5 is ok at sequential writes, but
non-full-stripe writes require reading the rest of the data for stripe
so the parity stripe(s) can be recalculated and rewritten.  (typical
block size is 64kiB, and with a 7 disk RAID5, a full stripe is
64kiB*(7-1) = 384kiB.)  Within a single 64kiB block, small scattered
writes are deadly: It's a read-modify-write (or write-read) because
the whole 64kiB is needed (along with the data from the other disks
with data in this stripe).  HW RAID controllers have large e.g. 256MiB
caches so they can merge writes, and sometimes avoid the extra reads.

-- 
#define X(x,y) x##y
Peter Cordes ;  e-mail: X(peter@cor , des.ca)

"The gods confound the man who first found out how to distinguish the hours!
 Confound him, too, who in this place set up a sundial, to cut and hack
 my day so wretchedly into small pieces!" -- Plautus, 200 BC

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 351 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAID5/6 writes
  2008-10-01 17:52 RAID5/6 writes Peter Cordes
@ 2008-10-01 19:36 ` Andi Kleen
  2008-10-01 20:13   ` Peter Cordes
  2008-10-02  0:32 ` Dave Chinner
  1 sibling, 1 reply; 6+ messages in thread
From: Andi Kleen @ 2008-10-01 19:36 UTC (permalink / raw)
  To: Peter Cordes; +Cc: xfs

Peter Cordes <peter@cordes.ca> writes:
>
>  XFS knows (or should have been told by the admin with mkfs!) what the
> stripe geometry is: block size and stripe width.  So it could apply
> this optimization only if it would make a write cover more whole
> blocks or whole stripes.

It's a nice idea, but I don't think XFS knows the actual RAID level,
only the stripes. And for 0/1 it wouldn't be a good idea.

-Andi

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAID5/6 writes
  2008-10-01 19:36 ` Andi Kleen
@ 2008-10-01 20:13   ` Peter Cordes
  2008-10-01 20:44     ` Andi Kleen
  0 siblings, 1 reply; 6+ messages in thread
From: Peter Cordes @ 2008-10-01 20:13 UTC (permalink / raw)
  To: Andi Kleen; +Cc: xfs

On Wed, Oct 01, 2008 at 09:36:13PM +0200, Andi Kleen wrote:
> Peter Cordes <peter@cordes.ca> writes:
> >
> >  XFS knows (or should have been told by the admin with mkfs!) what the
> > stripe geometry is: block size and stripe width.  So it could apply
> > this optimization only if it would make a write cover more whole
> > blocks or whole stripes.
> 
> It's a nice idea, but I don't think XFS knows the actual RAID level,
> only the stripes. And for 0/1 it wouldn't be a good idea.

 Yeah, this would have to be a mount option, like stripewrite=1.
There are already a few other essential mount options people need to
learn about for big RAIDs, e.g. inode64.

 AFAIK, XFS only knows the stripe geometry (sunit, swidth), not how
many parity blocks are part of each stripe, so it can't tell the
difference between RAID0 and RAID4,5,6.  (let alone RAID60...).  XFS
on RAID1 will have swidth=0, though.  Probably the only sane default
is 0, even when swidth!=0, to make sure it doesn't cause problems for
anyone or slow down RAID0.

 Thanks for the CC, since I'm not subscribe to the xfs list.

-- 
#define X(x,y) x##y
Peter Cordes ;  e-mail: X(peter@cor , des.ca)

"The gods confound the man who first found out how to distinguish the hours!
 Confound him, too, who in this place set up a sundial, to cut and hack
 my day so wretchedly into small pieces!" -- Plautus, 200 BC

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAID5/6 writes
  2008-10-01 20:13   ` Peter Cordes
@ 2008-10-01 20:44     ` Andi Kleen
  2008-10-01 21:01       ` Peter Cordes
  0 siblings, 1 reply; 6+ messages in thread
From: Andi Kleen @ 2008-10-01 20:44 UTC (permalink / raw)
  To: Peter Cordes; +Cc: Andi Kleen, xfs

On Wed, Oct 01, 2008 at 05:13:31PM -0300, Peter Cordes wrote:
> On Wed, Oct 01, 2008 at 09:36:13PM +0200, Andi Kleen wrote:
> > Peter Cordes <peter@cordes.ca> writes:
> > >
> > >  XFS knows (or should have been told by the admin with mkfs!) what the
> > > stripe geometry is: block size and stripe width.  So it could apply
> > > this optimization only if it would make a write cover more whole
> > > blocks or whole stripes.
> > 
> > It's a nice idea, but I don't think XFS knows the actual RAID level,
> > only the stripes. And for 0/1 it wouldn't be a good idea.
> 
>  Yeah, this would have to be a mount option, like stripewrite=1.
> There are already a few other essential mount options people need to
> learn about for big RAIDs, e.g. inode64.

The other problem I can think of is that determing if something is 
free data might need more read IO if the free extent tree is not completely
cached.

-Andi

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAID5/6 writes
  2008-10-01 20:44     ` Andi Kleen
@ 2008-10-01 21:01       ` Peter Cordes
  0 siblings, 0 replies; 6+ messages in thread
From: Peter Cordes @ 2008-10-01 21:01 UTC (permalink / raw)
  To: Andi Kleen; +Cc: xfs

On Wed, Oct 01, 2008 at 10:44:50PM +0200, Andi Kleen wrote:
> The other problem I can think of is that determing if something is 
> free data might need more read IO if the free extent tree is not completely
> cached.

 Yeah, I think I mentioned that in my original suggestion.  Unless
there are repeated writes with the same hole, it's probably not worth
it to read from disk to figure out if a sector is free.  XFS could just
see what it could do with what it already has in memory.  This is just
an optimization, so it doesn't have to succeed every time it's possible.

-- 
#define X(x,y) x##y
Peter Cordes ;  e-mail: X(peter@cor , des.ca)

"The gods confound the man who first found out how to distinguish the hours!
 Confound him, too, who in this place set up a sundial, to cut and hack
 my day so wretchedly into small pieces!" -- Plautus, 200 BC

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAID5/6 writes
  2008-10-01 17:52 RAID5/6 writes Peter Cordes
  2008-10-01 19:36 ` Andi Kleen
@ 2008-10-02  0:32 ` Dave Chinner
  1 sibling, 0 replies; 6+ messages in thread
From: Dave Chinner @ 2008-10-02  0:32 UTC (permalink / raw)
  To: Peter Cordes; +Cc: xfs

On Wed, Oct 01, 2008 at 02:52:37PM -0300, Peter Cordes wrote:
>  I just had an idea for speeding up writes to parity-based RAIDs
> (RAID4,5,6).[1]  If XFS wants to write sectors 1,2,3, 5,6,7, but it
> knows that block 4 is free space, it might be better to write sector 4
> (with zeros, don't put uninitialized kernel memory on disk!).

How does XFS know that block 4 is free space? Or indeed that this is
a single block sized hole in range of blocks mapped to different inodes
or filesystem metadata?

If you want something like this, you need to have the lower layer
discover holes like this and instead of immediately initiating
a RMW cycle, it calls back to the filesystem to determine is hole
is free space. That works for all filesystems not just XFS.

> It's
> probably only useful to do this if XFS has data in memory to prove
> that the gap is not part of the filesystem.  Doing extra reads
> probably doesn't make sense except in very special cases.  (e.g.
> repeated writes to the same location with the same hole, so just one
> read would let them all become full-block or even full-stripe writes.)

That's the sort of workload the stripe cache is supposed to optimise;
every subsequent sparse write to the same stripe line avoids the
read part of the RMW cycle. The filesystem is the wrong layer to
optimise this type of workload....

FWIW, XFS has it's own problems with writeback triggering RMW
cycles - this sort of thing for data could be considered noise
compared to the RMW storm that can be caused by inode writeback
under memory pressure as XFS has to do RMW cycles itself on the
inode cluster buffers. See the Inode Writeback section of this
document:

http://oss.sgi.com/archives/xfs/2008-09/msg00289.html

This can only be fixed at the filesystem level because no amount of
tweaking the storage can improve the I/O patterns that XFS is
issuing. These RMW cycles in inode writeback can cause the inode
flush rate to drop to a few tens of inodes per second. When you have
hundreds of thousands of dirty inodes in a system, it can take
*hours* to flush the dirty inodes to disk....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2008-10-02  0:33 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-01 17:52 RAID5/6 writes Peter Cordes
2008-10-01 19:36 ` Andi Kleen
2008-10-01 20:13   ` Peter Cordes
2008-10-01 20:44     ` Andi Kleen
2008-10-01 21:01       ` Peter Cordes
2008-10-02  0:32 ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox