Re: What to put for unknown stripe-width?

public inbox for linux-ext4@vger.kernel.org
 help / color / mirror / Atom feed

From: torn5 <torn5@shiftmail.org>
To: Theodore Tso <tytso@MIT.EDU>
Cc: linux-ext4@vger.kernel.org
Subject: Re: What to put for unknown stripe-width?
Date: Tue, 20 Sep 2011 17:29:34 +0200	[thread overview]
Message-ID: <4E78B15E.9060702@shiftmail.org> (raw)
In-Reply-To: <9D3B900A-8FCF-41B1-852A-FADD953FBDBD@mit.edu>

On 09/20/11 14:47, Theodore Tso wrote:
> But that's OK, because I don't know of any RAID array that supports 
> this kind of radical surgery in parameters in the first case. :-)

Ted, thanks for your reply,

Linux MD raid supports this, it's called reshape. Most parameters 
changes are supported, in particular the addition of a new disk and 
restriping of a raid5 is supported *live*. It's not very stable though...

But apart from the MD live reshape/restripe, what I could do more likely 
is to move such filesystem *live* across various RAIDs I have, 
leveraging LVM's "pvmove". Such RAIDs are almost all of 1MB stride, but 
with various number of elements, hence they have a different stripe-width.

> The other thing to consider is small writes.   If you are doing small writes, a large stripe size is a disaster, because a 32k random write by a program like MySQL will turn into a 3MB read + 3MB write request.

No this is not correct, for MD at least.
MD uses strips to compute parity, which are always 4k wide for each 
device. The reads in your example would be 32k read from two devices, 
followed by 32k write to two devices. I am testing this now with iostat 
to confirm what I'm saying with a dd 4k write: I see various spurious 
read and writes (probably due to MD and LVM accounting, dirty flags etc) 
which sum up to about 108k read and 18k write (that's the aggregated sum 
from all drives) for a single 4k write to the MD device. That's 
definitely not as large as even a single chunk which is 1MB.
What chunksize does is to regulate every how much data the placement of 
parity is changed (i.e. your ascii-art picture was correct). Large 
chunksize like I use, means that reads smaller than 1MB hopefully come 
from 1 spindle only. This is useful for us.

So, regarding my original problem, the way you use stride-size in ext4 
is that you begin every new file at the start of a stripe?

For growing an existing file what do you do, do you continue to write it 
from where it was, without holes, or you put a hole, select a new 
location at the start of a new stripe and start from there?

Regarding multiple very small files wrote together by pdflush, what do 
they do? They are sticked together on the same stripe without holes, or 
each one goes to a different stripe?

Is the change of stripe-width with tune2fs supported on a live, mounted 
fs? (I mean maybe with a mount -o remount but no umount)

Thanks for your help,

T.

next prev parent reply	other threads:[~2011-09-20 15:29 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-09-20 10:30 What to put for unknown stripe-width? torn5
2011-09-20 12:47 ` Theodore Tso
2011-09-20 15:29   ` torn5 [this message]
2011-09-20 16:00     ` Ted Ts'o
2011-09-20 23:29     ` Andreas Dilger

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4E78B15E.9060702@shiftmail.org \
    --to=torn5@shiftmail.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=tytso@MIT.EDU \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox