Re: high throughput storage server?

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Stan Hoeppner <stan@hardwarefreak.com>
To: "Keld Jørn Simonsen" <keld@keldix.com>
Cc: Mdadm <linux-raid@vger.kernel.org>,
	Roberto Spadim <roberto@spadim.com.br>, NeilBrown <neilb@suse.de>,
	Christoph Hellwig <hch@infradead.org>, Drew <drew.kay@gmail.com>
Subject: Re: high throughput storage server?
Date: Wed, 23 Mar 2011 03:53:45 -0500	[thread overview]
Message-ID: <4D89B519.3020907@hardwarefreak.com> (raw)
In-Reply-To: <20110322101403.GA9329@www2.open-std.org>

Keld Jørn Simonsen put forth on 3/22/2011 5:14 AM:

> Of course the IO will be randomized, if there is more users, but the
> read IO will tend to be quite sequential, if the reading of each process
> is sequential. So if a user reads a big file sequentially, and the
> system is lightly loaded, IO schedulers will tend to order all IO
> for the process so that it is served in one series of operations,
> given that the big file is laid out consequently on the file system.

With the way I've architected this hypothetical system, the read load on
each allocation group (each 12 spindle array) should be relatively low,
about 3 streams on 14 AGs, 4 streams on the remaining two AGs,
_assuming_ the files being read are spread out evenly across at least 16
directories.  As you all read in the docs for which I provided links,
XFS AG parallelism functions at the directory and file level.  For
example, if we create 32 directories on a virgin XFS filesystem of 16
allocation groups, the following layout would result:

AG1:  /general requirements	AG1:  /alabama
AG2:  /site construction	AG2:  /alaska
AG3:  /concrete			AG3:  /arizona
..
..
AG14: /conveying systems	AG14: /indiana
AG15: /mechanical		AG15: /iowa
AG16: /electrical		AG16: /kansas

AIUI, the first 16 directories get created in consecutive AGs until we
hit the last AG.  The 17th directory is then created in the first AG and
we start the cycle over.  This is how XFS allocation group parallelism
works.  It doesn't provide linear IO scaling for all workloads, and it's
not magic, but it works especially well for multiuser fileservers, and
typically better than multi nested stripe levels or extremely wide arrays.

Imagine you have a 5000 seat company.  You'd mount this XFS filesytem in
/home.  Each user home directory created would fall in a consecutive AG,
resulting in about 312 user dirs per AG.  In this type of environment
XFS AG parallelism will work marvelously as you'll achieve fairly
balanced IO across all AGs and thus all 16 arrays.

In the case where you have many clients reading files from only one
directory, hence the same AG, IO parallelism is limited to the 12
spindles of that one array.  When this happens, we end up with a highly
random workload at the disk head, resulting in high seek rates and low
throughput.  This is one of the reasons I built some "excess" capacity
into the disk subsystem.  Using XFS AGs for parallelism doesn't
guarantee even distribution of IO across all the 192 spindles of the 16
arrays.  It gives good parallelism if clients are accessing different
files in different directories concurrently, but not in the opposite case.

> The block allocation is only done when writing. The system at hand was
> specified as a mostly reading system, where such a bottleneck of block
> allocating is not so dominant.

This system would excel at massive parallel writes as well, again, as
long as we have many writers into multiple directories concurrently,
which spreads the write load across all AGs, and thus all arrays.

XFS is legendary for multiple large file parallel write throughput,
thanks to delayed allocation, and some other tricks.

-- 
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2011-03-23  8:53 UTC|newest]

Thread overview: 116+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-02-14 23:59 high throughput storage server? Matt Garman
2011-02-15  2:06 ` Doug Dumitru
2011-02-15  4:44   ` Matt Garman
2011-02-15  5:49     ` hansbkk
2011-02-15  9:43     ` David Brown
2011-02-24 20:28       ` Matt Garman
2011-02-24 20:43         ` David Brown
2011-02-15 15:16     ` Joe Landman
2011-02-15 20:37       ` NeilBrown
2011-02-15 20:47         ` Joe Landman
2011-02-15 21:41           ` NeilBrown
2011-02-24 20:58       ` Matt Garman
2011-02-24 21:20         ` Joe Landman
2011-02-26 23:54           ` high throughput storage server? GPFS w/ 10GB/s throughput to the rescue Stan Hoeppner
2011-02-27  0:56             ` Joe Landman
2011-02-27 14:55               ` Stan Hoeppner
2011-03-12 22:49                 ` Matt Garman
2011-02-27 21:30     ` high throughput storage server? Ed W
2011-02-28 15:46       ` Joe Landman
2011-02-28 23:14         ` Stan Hoeppner
2011-02-28 22:22       ` Stan Hoeppner
2011-03-02  3:44       ` Matt Garman
2011-03-02  4:20         ` Joe Landman
2011-03-02  7:10           ` Roberto Spadim
2011-03-02 19:03             ` Drew
2011-03-02 19:20               ` Roberto Spadim
2011-03-13 20:10                 ` Christoph Hellwig
2011-03-14 12:27                   ` Stan Hoeppner
2011-03-14 12:47                     ` Christoph Hellwig
2011-03-18 13:16                       ` Stan Hoeppner
2011-03-18 14:05                         ` Christoph Hellwig
2011-03-18 15:43                           ` Stan Hoeppner
2011-03-18 16:21                             ` Roberto Spadim
2011-03-18 22:01                             ` NeilBrown
2011-03-18 22:23                               ` Roberto Spadim
2011-03-20  1:34                               ` Stan Hoeppner
2011-03-20  3:41                                 ` NeilBrown
2011-03-20  5:32                                   ` Roberto Spadim
2011-03-20 23:22                                     ` Stan Hoeppner
2011-03-21  0:52                                       ` Roberto Spadim
2011-03-21  2:44                                       ` Keld Jørn Simonsen
2011-03-21  3:13                                         ` Roberto Spadim
2011-03-21  3:14                                           ` Roberto Spadim
2011-03-21 17:07                                             ` Stan Hoeppner
2011-03-21 14:18                                         ` Stan Hoeppner
2011-03-21 17:08                                           ` Roberto Spadim
2011-03-21 22:13                                           ` Keld Jørn Simonsen
2011-03-22  9:46                                             ` Robin Hill
2011-03-22 10:14                                               ` Keld Jørn Simonsen
2011-03-23  8:53                                                 ` Stan Hoeppner [this message]
2011-03-23 15:57                                                   ` Roberto Spadim
2011-03-23 16:19                                                     ` Joe Landman
2011-03-24  8:05                                                       ` Stan Hoeppner
2011-03-24 13:12                                                         ` Joe Landman
2011-03-25  7:06                                                           ` Stan Hoeppner
2011-03-24 17:07                                                       ` Christoph Hellwig
2011-03-24  5:52                                                     ` Stan Hoeppner
2011-03-24  6:33                                                       ` NeilBrown
2011-03-24  8:07                                                         ` Roberto Spadim
2011-03-24  8:31                                                         ` Stan Hoeppner
2011-03-22 10:00                                             ` Stan Hoeppner
2011-03-22 11:01                                               ` Keld Jørn Simonsen
2011-02-15 12:29 ` Stan Hoeppner
2011-02-15 12:45   ` Roberto Spadim
2011-02-15 13:03     ` Roberto Spadim
2011-02-24 20:43       ` Matt Garman
2011-02-24 20:53         ` Zdenek Kaspar
2011-02-24 21:07           ` Joe Landman
2011-02-15 13:39   ` David Brown
2011-02-16 23:32     ` Stan Hoeppner
2011-02-17  0:00       ` Keld Jørn Simonsen
2011-02-17  0:19         ` Stan Hoeppner
2011-02-17  2:23           ` Roberto Spadim
2011-02-17  3:05             ` Stan Hoeppner
2011-02-17  0:26       ` David Brown
2011-02-17  0:45         ` Stan Hoeppner
2011-02-17 10:39           ` David Brown
2011-02-24 20:49     ` Matt Garman
2011-02-15 13:48 ` Zdenek Kaspar
2011-02-15 14:29   ` Roberto Spadim
2011-02-15 14:51     ` A. Krijgsman
2011-02-15 16:44       ` Roberto Spadim
2011-02-15 14:56     ` Zdenek Kaspar
2011-02-24 20:36       ` Matt Garman
2011-02-17 11:07 ` John Robinson
2011-02-17 13:36   ` Roberto Spadim
2011-02-17 13:54     ` Roberto Spadim
2011-02-17 21:47   ` Stan Hoeppner
2011-02-17 22:13     ` Joe Landman
2011-02-17 23:49       ` Stan Hoeppner
2011-02-18  0:06         ` Joe Landman
2011-02-18  3:48           ` Stan Hoeppner
2011-02-18 13:49 ` Mattias Wadenstein
2011-02-18 23:16   ` Stan Hoeppner
2011-02-21 10:25     ` Mattias Wadenstein
2011-02-21 21:51       ` Stan Hoeppner
2011-02-22  8:57         ` David Brown
2011-02-22  9:30           ` Mattias Wadenstein
2011-02-22  9:49             ` David Brown
2011-02-22 13:38           ` Stan Hoeppner
2011-02-22 14:18             ` David Brown
2011-02-23  5:52               ` Stan Hoeppner
2011-02-23 13:56                 ` David Brown
2011-02-23 14:25                   ` John Robinson
2011-02-23 15:15                     ` David Brown
2011-02-23 23:14                       ` Stan Hoeppner
2011-02-24 10:19                         ` David Brown
2011-02-23 21:59                     ` Stan Hoeppner
2011-02-23 23:43                       ` John Robinson
2011-02-24 15:53                         ` Stan Hoeppner
2011-02-23 21:11                   ` Stan Hoeppner
2011-02-24 11:24                     ` David Brown
2011-02-24 23:30                       ` Stan Hoeppner
2011-02-25  8:20                         ` David Brown
2011-02-19  0:24   ` Joe Landman
2011-02-21 10:04     ` Mattias Wadenstein

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4D89B519.3020907@hardwarefreak.com \
    --to=stan@hardwarefreak.com \
    --cc=drew.kay@gmail.com \
    --cc=hch@infradead.org \
    --cc=keld@keldix.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=roberto@spadim.com.br \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).