Re: high throughput storage server?

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Stan Hoeppner <stan@hardwarefreak.com>
To: Roberto Spadim <roberto@spadim.com.br>
Cc: "Keld Jørn Simonsen" <keld@keldix.com>,
	Mdadm <linux-raid@vger.kernel.org>, NeilBrown <neilb@suse.de>,
	"Christoph Hellwig" <hch@infradead.org>,
	Drew <drew.kay@gmail.com>
Subject: Re: high throughput storage server?
Date: Thu, 24 Mar 2011 00:52:00 -0500	[thread overview]
Message-ID: <4D8ADC00.5010709@hardwarefreak.com> (raw)
In-Reply-To: <AANLkTinONtwNh9KZKe98FL+u2GV_Fqaf1zR8eMyEDGBp@mail.gmail.com>

Roberto Spadim put forth on 3/23/2011 10:57 AM:
> it's something like 'partitioning'? i don't know xfs very well, but ...
> if you use 99% ag16 and 1% ag1-15
> you should use a raid0 with stripe (for better write/read rate),
> linear wouldn't help like stripe, i'm right?

You should really read up on XFS internals to understand exactly how
allocation groups work.

http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure//tmp/en-US/html/index.html

I've explained the basics.  What I didn't mention is that an individual
file can be written concurrently to more than one allocation group,
yielding some of the benefit of striping but without the baggage of
RAID0 over 16 RAID10 or a wide stripe RAID10.  However, I've not been
able to find documentation stating exactly how this is done and under
what circumstances, and I would really like to know.  XFS has some good
documentation, but none of it goes into this kind of low level detail
with lay person digestible descriptions.  I'm not a dev so I'm unable to
understand how this works by reading the codde.

Note that once such a large file is written, reading that file later
puts multiple AGs into play so you have read parallelism approaching the
performance of straight disk striping.

The problems with nested RAID0 over RAID10, or simply a very wide array
(384 disks in this case) are two fold:

1.  Lower performance with files smaller than the stripe width
2.  Poor space utilization for the same reason

Let's analyze the wide RAID10 case.  With 384 disks you get a stripe
width of 192 spindles.  A common stripe block size is 64KB, or 16
filesystem blocks, 128 disk sectors.  Taking that 64KB and multiplying
by 192 stripe spindles we get a stripe size of exactly 12MB.

If you write a file much smaller than the stripe size, say a 1MB file,
to the filesystem atop this wide RAID10, the file will only be striped
across 16 of the 192 spindles, with 64KB going to each stripe member, 16
filesystem blocks, 128 sectors.  I don't know about mdraid, but with
many hardware RAID striping implementations the remaining 176 disks in
the stripe will have zeros or nulls written for their portion of the
stripe for this file that is a tiny fraction of the stripe size.  Also,
all modern disk drives are much more efficient when doing larger
multi-sector transfers of anywhere from 512KB to 1MB or more than with
small transfers of 64KB.

By using XFS allocation groups for parallelism instead of a wide stripe
array, you don't suffer from this massive waste of disk space, and,
since each file is striped across fewer disks (12 in the case of my
example system), we end up with slightly better throughput as each
transfer is larger, 170 sectors in this case.  The extremely wide array,
or nested stripe over striped array setup, is only useful in situations
where all files being written are close to or larger than the stripe
size.  There are many application areas where this is not only plausible
but preferred.  Most HPC applications work with data sets far larger
than the 12MB in this example, usually hundreds of megs if not multiple
gigs.  In this case extremely wide arrays are the way to go, whether
using a single large file store, a cluster of fileservers, or a cluster
filesystem on SAN storage such as CXFS.

Most other environments are going to have a mix of small and large
files, and all sizes in between.  This is the case where leveraging XFS
allocation group parallelism makes far more sense than a very wide
array, and why I chose this configuration for my example system.

Do note that XFS will also outperform any other filesytem when used
directly atop this same 192 spindle wide RAID10 array.  You'll still
have 16 allocation groups, but the performance characteristics of the
AGs change when the underlying storage is a wide stripe.  In this case
the AGs become cylinder groups from the outer to inner edge of the
disks, instead of each AG occupying an entire 12 spindle disk array.

In this case the AGs do more to prevent fragmentation than increase
parallel throughput at the hardware level.  AGs do always allow more
filesystem concurrency though, regardless of the underlying hardware
storage structure, because inodes can be allocated or read in parallel.
 This is sue to the fact each XFS AG has its own set of B+ trees and
inodes.  Each AG is a "filesystem within a filesystem".

If we pretend for a moment that an EXT4 filesystem can be larger than
16TB, in this case 28TB, and we tested this 192 spindle RAID10 array
with a high parallel workload with both EXT4 and XFS, you'd find that
EXT4 throughput is a small fraction of XFS due to the fact that so much
of EXT4 IO is serialized, precisely because it lacks XFS' allocation
group architecture.

> a question... this example was with directories, how files (metadata)
> are saved? and how file content are saved? and jornaling?

http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure//tmp/en-US/html/index.html

> speed of write and read will be a function of how you designed it to
> use device layer (it's something like a virtual memory utilization, a
> big memory, and many programs trying to use small parts and when need
> use a big part)

Not only that, but how efficiently you can walk the directory tree to
locate inodes.  XFS can walk many directory trees in parallel, partly
due to allocation groups.  This is one huge advantage it has over
EXT2/3/4, ReiserFS, JFS, etc.

-- 
Stan

next prev parent reply	other threads:[~2011-03-24  5:52 UTC|newest]

Thread overview: 116+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-02-14 23:59 high throughput storage server? Matt Garman
2011-02-15  2:06 ` Doug Dumitru
2011-02-15  4:44   ` Matt Garman
2011-02-15  5:49     ` hansbkk
2011-02-15  9:43     ` David Brown
2011-02-24 20:28       ` Matt Garman
2011-02-24 20:43         ` David Brown
2011-02-15 15:16     ` Joe Landman
2011-02-15 20:37       ` NeilBrown
2011-02-15 20:47         ` Joe Landman
2011-02-15 21:41           ` NeilBrown
2011-02-24 20:58       ` Matt Garman
2011-02-24 21:20         ` Joe Landman
2011-02-26 23:54           ` high throughput storage server? GPFS w/ 10GB/s throughput to the rescue Stan Hoeppner
2011-02-27  0:56             ` Joe Landman
2011-02-27 14:55               ` Stan Hoeppner
2011-03-12 22:49                 ` Matt Garman
2011-02-27 21:30     ` high throughput storage server? Ed W
2011-02-28 15:46       ` Joe Landman
2011-02-28 23:14         ` Stan Hoeppner
2011-02-28 22:22       ` Stan Hoeppner
2011-03-02  3:44       ` Matt Garman
2011-03-02  4:20         ` Joe Landman
2011-03-02  7:10           ` Roberto Spadim
2011-03-02 19:03             ` Drew
2011-03-02 19:20               ` Roberto Spadim
2011-03-13 20:10                 ` Christoph Hellwig
2011-03-14 12:27                   ` Stan Hoeppner
2011-03-14 12:47                     ` Christoph Hellwig
2011-03-18 13:16                       ` Stan Hoeppner
2011-03-18 14:05                         ` Christoph Hellwig
2011-03-18 15:43                           ` Stan Hoeppner
2011-03-18 16:21                             ` Roberto Spadim
2011-03-18 22:01                             ` NeilBrown
2011-03-18 22:23                               ` Roberto Spadim
2011-03-20  1:34                               ` Stan Hoeppner
2011-03-20  3:41                                 ` NeilBrown
2011-03-20  5:32                                   ` Roberto Spadim
2011-03-20 23:22                                     ` Stan Hoeppner
2011-03-21  0:52                                       ` Roberto Spadim
2011-03-21  2:44                                       ` Keld Jørn Simonsen
2011-03-21  3:13                                         ` Roberto Spadim
2011-03-21  3:14                                           ` Roberto Spadim
2011-03-21 17:07                                             ` Stan Hoeppner
2011-03-21 14:18                                         ` Stan Hoeppner
2011-03-21 17:08                                           ` Roberto Spadim
2011-03-21 22:13                                           ` Keld Jørn Simonsen
2011-03-22  9:46                                             ` Robin Hill
2011-03-22 10:14                                               ` Keld Jørn Simonsen
2011-03-23  8:53                                                 ` Stan Hoeppner
2011-03-23 15:57                                                   ` Roberto Spadim
2011-03-23 16:19                                                     ` Joe Landman
2011-03-24  8:05                                                       ` Stan Hoeppner
2011-03-24 13:12                                                         ` Joe Landman
2011-03-25  7:06                                                           ` Stan Hoeppner
2011-03-24 17:07                                                       ` Christoph Hellwig
2011-03-24  5:52                                                     ` Stan Hoeppner [this message]
2011-03-24  6:33                                                       ` NeilBrown
2011-03-24  8:07                                                         ` Roberto Spadim
2011-03-24  8:31                                                         ` Stan Hoeppner
2011-03-22 10:00                                             ` Stan Hoeppner
2011-03-22 11:01                                               ` Keld Jørn Simonsen
2011-02-15 12:29 ` Stan Hoeppner
2011-02-15 12:45   ` Roberto Spadim
2011-02-15 13:03     ` Roberto Spadim
2011-02-24 20:43       ` Matt Garman
2011-02-24 20:53         ` Zdenek Kaspar
2011-02-24 21:07           ` Joe Landman
2011-02-15 13:39   ` David Brown
2011-02-16 23:32     ` Stan Hoeppner
2011-02-17  0:00       ` Keld Jørn Simonsen
2011-02-17  0:19         ` Stan Hoeppner
2011-02-17  2:23           ` Roberto Spadim
2011-02-17  3:05             ` Stan Hoeppner
2011-02-17  0:26       ` David Brown
2011-02-17  0:45         ` Stan Hoeppner
2011-02-17 10:39           ` David Brown
2011-02-24 20:49     ` Matt Garman
2011-02-15 13:48 ` Zdenek Kaspar
2011-02-15 14:29   ` Roberto Spadim
2011-02-15 14:51     ` A. Krijgsman
2011-02-15 16:44       ` Roberto Spadim
2011-02-15 14:56     ` Zdenek Kaspar
2011-02-24 20:36       ` Matt Garman
2011-02-17 11:07 ` John Robinson
2011-02-17 13:36   ` Roberto Spadim
2011-02-17 13:54     ` Roberto Spadim
2011-02-17 21:47   ` Stan Hoeppner
2011-02-17 22:13     ` Joe Landman
2011-02-17 23:49       ` Stan Hoeppner
2011-02-18  0:06         ` Joe Landman
2011-02-18  3:48           ` Stan Hoeppner
2011-02-18 13:49 ` Mattias Wadenstein
2011-02-18 23:16   ` Stan Hoeppner
2011-02-21 10:25     ` Mattias Wadenstein
2011-02-21 21:51       ` Stan Hoeppner
2011-02-22  8:57         ` David Brown
2011-02-22  9:30           ` Mattias Wadenstein
2011-02-22  9:49             ` David Brown
2011-02-22 13:38           ` Stan Hoeppner
2011-02-22 14:18             ` David Brown
2011-02-23  5:52               ` Stan Hoeppner
2011-02-23 13:56                 ` David Brown
2011-02-23 14:25                   ` John Robinson
2011-02-23 15:15                     ` David Brown
2011-02-23 23:14                       ` Stan Hoeppner
2011-02-24 10:19                         ` David Brown
2011-02-23 21:59                     ` Stan Hoeppner
2011-02-23 23:43                       ` John Robinson
2011-02-24 15:53                         ` Stan Hoeppner
2011-02-23 21:11                   ` Stan Hoeppner
2011-02-24 11:24                     ` David Brown
2011-02-24 23:30                       ` Stan Hoeppner
2011-02-25  8:20                         ` David Brown
2011-02-19  0:24   ` Joe Landman
2011-02-21 10:04     ` Mattias Wadenstein

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4D8ADC00.5010709@hardwarefreak.com \
    --to=stan@hardwarefreak.com \
    --cc=drew.kay@gmail.com \
    --cc=hch@infradead.org \
    --cc=keld@keldix.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=roberto@spadim.com.br \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).