All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: "C. Morgan Hamill" <chamill@wesleyan.edu>
Cc: Stan Hoeppner <stan@hardwarefreak.com>, xfs <xfs@oss.sgi.com>
Subject: Re: Question regarding XFS on LVM over hardware RAID.
Date: Tue, 4 Feb 2014 08:41:28 +1100	[thread overview]
Message-ID: <20140203214128.GR13997@dastard> (raw)
In-Reply-To: <1391443675-sup-1730@al.wesleyan.edu>

On Mon, Feb 03, 2014 at 11:12:39AM -0500, C. Morgan Hamill wrote:
> Excerpts from Dave Chinner's message of 2014-02-02 16:21:52 -0500:
> > On Sat, Feb 01, 2014 at 03:06:17PM -0600, Stan Hoeppner wrote:
> > > On 1/31/2014 3:14 PM, C. Morgan Hamill wrote:
> > > > So, basically, --dataalignment is my friend during pvcreate and
> > > > lvcreate.
> > > 
> > > If the logical sector size reported by your RAID controller is 512
> > > bytes, then "--dataalignment=9216s" should start your data section on a
> > > RAID60 stripe boundary after the metadata section.
> > > 
> > > Tthe PhysicalExtentSize should probably also match the 4608KB stripe
> > > width, but this is apparently not possible.  PhysicalExtentSize must be
> > > a power of 2 value.  I don't know if or how this will affect XFS aligned
> > > write out.  You'll need to consult with someone more knowledgeable of LVM.
> > 
> > You can't do single IOs of that size, anyway, so this is where the
> > BBWC on the raid controller does it's magic and caches sequntial IOs
> > until it has full stripe writes cached....
> 
> So I am probably missing something here, could you clarify?  Are you
> saying that I can't do single IOs of that size (by which I take your
> meaning to be IOs as small as 9216 sectors) because my RAID controllers
> controller won't let me (i.e., it will cache anything smaller than the
> stripe size anyway)?

Typical limitations on IO size are the size of the hardware DMA
scatter-gather rings of the HBA/raid controller. For example, the
two hardware RAID controllers in my largest test box have
limitations of 70 and 80 segments and maximum IO sizes of 280k and
320k.

And looking at the IO being dispatched with blktrace, I see the
maximum size is:

  8,80   2       61     0.769857112 44866  D  WS 12423408 + 560 [qemu-system-x86]
  8,80   2       71     0.769877563 44866  D  WS 12423968 + 560 [qemu-system-x86]
  8,80   2       72     0.769889767 44866  D  WS 12424528 + 560 [qemu-system-x86]
                                                            ^^^

560 sectors or 280k. So for this hardware, sequential 280k writes
are hitting the BBWC. And because they are sequential, the BBWC is
writing them back as fully stripe writes after aggregating them in
NVRAM. Hence there are no performance diminishing RMW cycles
occurring, even though the individual IO size is much smaller than
the stripe unit/width....

> Or are you saying that XFS with these given
> settings won't make writes that small (which seems false, since I'm
> essentially telling it to do writes of precisely that size).  I'm a bit
> unclear on that.

What su/sw tells XFs is how to align allocation of files, so that
when we dispatch sequential IO to that file it is aligned to the
underlying storage because the extents that the filesystem allocated
for it are aligned. This means that if you write exactly one stripe
width of data, it will hit each disk exactly once. It might take 10
IOs to get the data to the storage, but it will only hit each disk
once.

The function of the stripe cache (in software raid) and the BBWC (in
hardware RAID) is to prevent RMW cycles while the
filesystem/hardware is still flinging data at the RAID lun. Only
once the controller has complete stripe widths will it calculate
parity and write back the data, thereby avoiding a RMW cycle....

> In addition, does this in effect mean that when it comes to LVM, extent
> size makes no difference for alignment purposes?  So I don't have to
> worry about anything other that aligning the beginning and ending of
> logical volumes, volume groups, etc. to 9216 sector multiples?

No, you still have to align everything to the underlying storage so
that the filesystem on top of the volumes is correctly aligned.
Where the data will be written (i.e. howthe filesystem allocates the
underlying blocks) determines the IO alignment of sequential/large
user IOs, and that matters far more than the size of the sequntial
IOs the kernel uses to write the data.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2014-02-03 21:41 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-01-29 14:26 Question regarding XFS on LVM over hardware RAID C. Morgan Hamill
2014-01-29 15:07 ` Eric Sandeen
2014-01-29 19:11   ` C. Morgan Hamill
2014-01-29 23:55     ` Stan Hoeppner
2014-01-30 14:28       ` C. Morgan Hamill
2014-01-30 20:28         ` Dave Chinner
2014-01-31  5:58           ` Stan Hoeppner
2014-01-31 21:14             ` C. Morgan Hamill
2014-02-01 21:06               ` Stan Hoeppner
2014-02-02 21:21                 ` Dave Chinner
2014-02-03 16:12                   ` C. Morgan Hamill
2014-02-03 21:41                     ` Dave Chinner [this message]
2014-02-04  8:00                       ` Stan Hoeppner
2014-02-18 19:44                         ` C. Morgan Hamill
2014-02-18 23:07                           ` Stan Hoeppner
2014-02-20 18:31                             ` C. Morgan Hamill
2014-02-21  3:33                               ` Stan Hoeppner
2014-02-21  8:57                                 ` Emmanuel Florac
2014-02-22  2:21                                   ` Stan Hoeppner
2014-02-25 17:04                                     ` C. Morgan Hamill
2014-02-25 17:17                                       ` Emmanuel Florac
2014-02-25 20:08                                       ` Stan Hoeppner
2014-02-26 14:19                                         ` C. Morgan Hamill
2014-02-26 17:49                                           ` Stan Hoeppner
2014-02-21 19:17                                 ` C. Morgan Hamill
2014-02-03 16:07                 ` C. Morgan Hamill
2014-01-29 22:40   ` Stan Hoeppner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140203214128.GR13997@dastard \
    --to=david@fromorbit.com \
    --cc=chamill@wesleyan.edu \
    --cc=stan@hardwarefreak.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.