All of lore.kernel.org
 help / color / mirror / Atom feed
From: Stan Hoeppner <stan@hardwarefreak.com>
To: Christoph Hellwig <hch@infradead.org>
Cc: Roberto Spadim <roberto@spadim.com.br>, Drew <drew.kay@gmail.com>,
	Mdadm <linux-raid@vger.kernel.org>
Subject: Re: high throughput storage server?
Date: Fri, 18 Mar 2011 08:16:26 -0500	[thread overview]
Message-ID: <4D835B2A.1000805@hardwarefreak.com> (raw)
In-Reply-To: <20110314124733.GA31377@infradead.org>

Christoph Hellwig put forth on 3/14/2011 7:47 AM:
> On Mon, Mar 14, 2011 at 07:27:00AM -0500, Stan Hoeppner wrote:
>> Is this only an issue with multi-chassis cabled NUMA systems such as
>> Altix 4000/UV and the (discontinued) IBM x86 NUMA systems (x440/445)
>> with their relatively low direct node-node bandwidth, or is this also of
>> concern with single chassis systems with relatively much higher
>> node-node bandwidth, such as the AMD Opteron systems, specifically the
>> newer G34, which have node-node bandwidth of 19.2GB/s bidirectional?
> 
> Just do your math.  Buffered I/O will do two memory copies - a
> copy_to_user into the pagecache and DMA from the pagecache to the device
> (yes, that's also a copy as far as the memory subsystem is concerned,
> even if it is access from the device).

The context of this thread was high throughput NFS serving.  If we
wanted to do 10 GB/s kernel NFS serving, would we still only have two
memory copies, since the NFS server runs in kernel, not user, space?
I.e. in addition to the block device DMA read into the page cache, would
we also have a memcopy into application buffers from the page cache, or
does the kernel NFS server simply work with the data directly from the
page cache without an extra memory copy being needed?  If the latter,
adding in the DMA copy to the NIC would yield two total memory copies.
Is this correct?  Or would we have 3 memcopies?

> So to get 10GB/s throughput you spends 20GB/s on memcpys for the actual
> data alone.  Add to that other system activity and metadata.  Wether you
> hit the interconnect or not depends on your memory configuration, I/O
> attachment, and process locality.  If you have all memory that the
> process uses and all I/O on one node you won't hit the interconnect at
> all, but depending on memory placement and storage attachment you might
> hit it twice:
> 
>  - userspace memory on node A to pagecache on node B to device on node
>    C (or A again for that matter).

Not to mention hardware interrupt processing load, which, in addition to
eating some interconnect bandwidth, will also take a toll on CPU cycles
given the number of RAID HBAs and NIC required to read and push 10GB/s
NFS to clients.

Will achieving 10GB/s NFS likely require intricate manual process
placement, along with spreading interrupt processing across only node
cores which are directly connected to the IO bridge chips, preventing
interrupt packets from consuming interconnect bandwidth?

> In short you need to review your configuration pretty carefully.  With
> direct I/O it's a lot easier as you save a copy.

Is O_DIRECT necessary in this scenario, or does the kernel NFS server
negate the need for direct IO since the worker threads execute in kernel
space not user space?  If not, is it possible to force to kernel NFS
server to always do O_DIRECT reads and writes, or is that the
responsibility of the application on the NFS client?

I was under the impression that the memory manager in recent 2.6
kernels, similar to IRIX on Origin, is sufficiently NUMA aware in the
default configuration to automatically take care of memory placement,
keeping all of a given process/thread's memory on the local node, and in
cases where thread memory ends up on another node for some reason, block
copying that memory to the local node and invalidating the remote CPU
caches, or in certain cases, simply moving the thread execution pointer
to a core in the remote node where the memory resides.

WRT the page cache, if the kernel doesn't automatically place page cache
data associated with a given thread in that thread's local node memory,
is it possible to force this?  It's been a while since I read the
cpumemsets and other related documentation, and I don't recall if page
cache memory is manually locatable.  That doesn't ring a bell.
Obviously it would be a big win from an interconnect utilization and
overall performance standpoint if the thread's working memory and page
cache memory were both on the local node.

-- 
Stan

  reply	other threads:[~2011-03-18 13:16 UTC|newest]

Thread overview: 116+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-02-14 23:59 high throughput storage server? Matt Garman
2011-02-15  2:06 ` Doug Dumitru
2011-02-15  4:44   ` Matt Garman
2011-02-15  5:49     ` hansbkk
2011-02-15  9:43     ` David Brown
2011-02-24 20:28       ` Matt Garman
2011-02-24 20:43         ` David Brown
2011-02-15 15:16     ` Joe Landman
2011-02-15 20:37       ` NeilBrown
2011-02-15 20:47         ` Joe Landman
2011-02-15 21:41           ` NeilBrown
2011-02-24 20:58       ` Matt Garman
2011-02-24 21:20         ` Joe Landman
2011-02-26 23:54           ` high throughput storage server? GPFS w/ 10GB/s throughput to the rescue Stan Hoeppner
2011-02-27  0:56             ` Joe Landman
2011-02-27 14:55               ` Stan Hoeppner
2011-03-12 22:49                 ` Matt Garman
2011-02-27 21:30     ` high throughput storage server? Ed W
2011-02-28 15:46       ` Joe Landman
2011-02-28 23:14         ` Stan Hoeppner
2011-02-28 22:22       ` Stan Hoeppner
2011-03-02  3:44       ` Matt Garman
2011-03-02  4:20         ` Joe Landman
2011-03-02  7:10           ` Roberto Spadim
2011-03-02 19:03             ` Drew
2011-03-02 19:20               ` Roberto Spadim
2011-03-13 20:10                 ` Christoph Hellwig
2011-03-14 12:27                   ` Stan Hoeppner
2011-03-14 12:47                     ` Christoph Hellwig
2011-03-18 13:16                       ` Stan Hoeppner [this message]
2011-03-18 14:05                         ` Christoph Hellwig
2011-03-18 15:43                           ` Stan Hoeppner
2011-03-18 16:21                             ` Roberto Spadim
2011-03-18 22:01                             ` NeilBrown
2011-03-18 22:23                               ` Roberto Spadim
2011-03-20  1:34                               ` Stan Hoeppner
2011-03-20  3:41                                 ` NeilBrown
2011-03-20  5:32                                   ` Roberto Spadim
2011-03-20 23:22                                     ` Stan Hoeppner
2011-03-21  0:52                                       ` Roberto Spadim
2011-03-21  2:44                                       ` Keld Jørn Simonsen
2011-03-21  3:13                                         ` Roberto Spadim
2011-03-21  3:14                                           ` Roberto Spadim
2011-03-21 17:07                                             ` Stan Hoeppner
2011-03-21 14:18                                         ` Stan Hoeppner
2011-03-21 17:08                                           ` Roberto Spadim
2011-03-21 22:13                                           ` Keld Jørn Simonsen
2011-03-22  9:46                                             ` Robin Hill
2011-03-22 10:14                                               ` Keld Jørn Simonsen
2011-03-23  8:53                                                 ` Stan Hoeppner
2011-03-23 15:57                                                   ` Roberto Spadim
2011-03-23 16:19                                                     ` Joe Landman
2011-03-24  8:05                                                       ` Stan Hoeppner
2011-03-24 13:12                                                         ` Joe Landman
2011-03-25  7:06                                                           ` Stan Hoeppner
2011-03-24 17:07                                                       ` Christoph Hellwig
2011-03-24  5:52                                                     ` Stan Hoeppner
2011-03-24  6:33                                                       ` NeilBrown
2011-03-24  8:07                                                         ` Roberto Spadim
2011-03-24  8:31                                                         ` Stan Hoeppner
2011-03-22 10:00                                             ` Stan Hoeppner
2011-03-22 11:01                                               ` Keld Jørn Simonsen
2011-02-15 12:29 ` Stan Hoeppner
2011-02-15 12:45   ` Roberto Spadim
2011-02-15 13:03     ` Roberto Spadim
2011-02-24 20:43       ` Matt Garman
2011-02-24 20:53         ` Zdenek Kaspar
2011-02-24 21:07           ` Joe Landman
2011-02-15 13:39   ` David Brown
2011-02-16 23:32     ` Stan Hoeppner
2011-02-17  0:00       ` Keld Jørn Simonsen
2011-02-17  0:19         ` Stan Hoeppner
2011-02-17  2:23           ` Roberto Spadim
2011-02-17  3:05             ` Stan Hoeppner
2011-02-17  0:26       ` David Brown
2011-02-17  0:45         ` Stan Hoeppner
2011-02-17 10:39           ` David Brown
2011-02-24 20:49     ` Matt Garman
2011-02-15 13:48 ` Zdenek Kaspar
2011-02-15 14:29   ` Roberto Spadim
2011-02-15 14:51     ` A. Krijgsman
2011-02-15 16:44       ` Roberto Spadim
2011-02-15 14:56     ` Zdenek Kaspar
2011-02-24 20:36       ` Matt Garman
2011-02-17 11:07 ` John Robinson
2011-02-17 13:36   ` Roberto Spadim
2011-02-17 13:54     ` Roberto Spadim
2011-02-17 21:47   ` Stan Hoeppner
2011-02-17 22:13     ` Joe Landman
2011-02-17 23:49       ` Stan Hoeppner
2011-02-18  0:06         ` Joe Landman
2011-02-18  3:48           ` Stan Hoeppner
2011-02-18 13:49 ` Mattias Wadenstein
2011-02-18 23:16   ` Stan Hoeppner
2011-02-21 10:25     ` Mattias Wadenstein
2011-02-21 21:51       ` Stan Hoeppner
2011-02-22  8:57         ` David Brown
2011-02-22  9:30           ` Mattias Wadenstein
2011-02-22  9:49             ` David Brown
2011-02-22 13:38           ` Stan Hoeppner
2011-02-22 14:18             ` David Brown
2011-02-23  5:52               ` Stan Hoeppner
2011-02-23 13:56                 ` David Brown
2011-02-23 14:25                   ` John Robinson
2011-02-23 15:15                     ` David Brown
2011-02-23 23:14                       ` Stan Hoeppner
2011-02-24 10:19                         ` David Brown
2011-02-23 21:59                     ` Stan Hoeppner
2011-02-23 23:43                       ` John Robinson
2011-02-24 15:53                         ` Stan Hoeppner
2011-02-23 21:11                   ` Stan Hoeppner
2011-02-24 11:24                     ` David Brown
2011-02-24 23:30                       ` Stan Hoeppner
2011-02-25  8:20                         ` David Brown
2011-02-19  0:24   ` Joe Landman
2011-02-21 10:04     ` Mattias Wadenstein

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4D835B2A.1000805@hardwarefreak.com \
    --to=stan@hardwarefreak.com \
    --cc=drew.kay@gmail.com \
    --cc=hch@infradead.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=roberto@spadim.com.br \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.