All of lore.kernel.org
 help / color / mirror / Atom feed
From: Eric Barton <eeb@sun.com>
To: lustre-devel@lists.lustre.org
Subject: [Lustre-devel] create performance
Date: Fri, 06 Mar 2009 13:25:49 +0000	[thread overview]
Message-ID: <049301c99e5f$10ee3c60$32cab520$@com> (raw)
In-Reply-To: <20090305214620.GJ3199@webber.adilger.int>

Moving this discussion to lustre-devel...

OST object placement is a hard problem with conflicting requirements
including...

1. Even server space balance
2. Even server load balance
3. Minimal network congestion
4. Scalable ultra-wide file layout descriptor
5. Scalable placement algorithm

Implementing a placement algorithm with a centralized server clearly
isn't scalable and will have to be reworked for CMD.  A starting
point might be to explore how to ensure CROW goes some way to
satisfy requirements 1-3 above.

BTW, I've long believed that it's a mistake not to give Lustre any
inkling that all the creates done by a FPP parallel application are
somehow related - e.g. via a cluster-wide job identifier.  Surely
file-per-process placement is very close to shared file placement
(minus extent locking conflicts :)?  I recognize that fixing this
still leaves the problem of how to get best F/S utilization when
different applications share a cluster - but I don't think they are
necessarily the same problem and trying to address them both with the
same solution seems wrong.

    Cheers,
              Eric

> -----Original Message-----
> From: Andreas.Dilger at Sun.COM [mailto:Andreas.Dilger at Sun.COM] On Behalf Of Andreas Dilger
> Sent: 05 March 2009 9:46 PM
> To: Nathaniel Rutman
> Cc: lustre-tech-leads at sun.com
> Subject: Re: create performance
> 
> On Mar 05, 2009  09:39 -0800, Nathaniel Rutman wrote:
> > Alex Zhuravlav wrote:
> >>>>>>> Nathaniel Rutman (NR) writes:
> >>  NR> What about preallocating objects per client, on the clients?
> >>  NR> Client still needs to get a namespace entry from MDT, but could then
> >>  NR> hold a write layout lock
> >>  NR> and do it's own round-robin allocation.  For clients with subtree
> >>  NR> locks this could avoid any need to talk to the MDT and wouldn't need
> >>  NR> the writeback cache.
> >>
> >> I thought "avoid any need to talk to MDT" implies "writeback cache"
> >
> > Hmm, well, maybe you consider this a limited version of writeback cache?
> > It would be kind of a notification of "here is the layout/objects of my
> > new file, with my new fid."  Fid ranges and object numbers would be
> > granted to clients for their own use, and the MDT would only have to do
> > the namespace entry, asynchronously.  I suppose there's recovery issues
> > we have to worry about then.
> >
> > What I was really trying to get at was to avoid the two step process of
> > client -> MDT -> OST stripe allocation, which includes an extra network
> > hop in some precreation starvation cases, and always includes some (a
> > little?) cpu on the MDT:
> > 1. clients get object grants for every OST.
> > 2. clients assign objects to new files and send in reqs to MDT, which
> > just records the objects in the LOV EA
> > 3. MDT batches up the assigned objects and sends to OSTs for orphan
> > cleanup llog.
> 
> The main problem with having many clients do precreation themselves is
> that this will invariably cause load imbalance on the OSTs, which will
> cause long-term file IO performance problems (much in excess of the
> performance problems hit during precreate).
> 
> Cray recently filed a bug on the read performance of files being noticably
> hurt by QOS object allocation due to space imbalance, even thoguh the MDS
> is trying to balance across OSTs locally, but is using random numbers to
> do this and is not selecting OSTs evenly.
> 
> In a file-per-process checkpoint (say 100 processes/files on 100 OSTs)
> the MDS round-robin will allocate 1 object per OST evenly across all
> OSTs (excluding the case where an OSC is out of preallocated objects).
> If clients are doing the allocation (or in the past when the MDS did
> "random" OST selection) then the chance of all 100 clients allocating
> on 100 OSTs is vanishingly small.  Instead it is likely that some OSTs
> will have no objects used, and some will have 2 or 3 or 4, and the
> aggregate write performance will FOREVER be 50% or 33% or 25% of the
> MDS-round-robin allocated objects for that set of files.  That is far
> worse than waiting 1s for the MDS to allocate the objects.
> 
> IMHO, if we are doing WBC on the client, then there is no _requirement_
> that the client has to allocate objects for the files at all, and any
> write data could just be in the client page cache.  Until the new file
> is visible on the MDS to another client nobody can even try to access
> the data.  Once the WBC cache is flushed to the MDS then objects can
> be allocated by the MDS evenly (granting an exclusive layout lock to the
> client in the process) until the cached client data is either flushed to
> disk or at least protected by extent locks and can be partially flushed
> as needed.
> 
> Note that I don't totally object to WBC clients doing object allocations
> if they are creating a large number of files, in essence becoming an
> MDS that is tracking the load on the OSTs and balancing object creation
> appropriately.  What I object to is the more common case where each
> client is creating a single file for a large FPP checkpoint, and the
> clients all selecting the OSTs separately.
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.

       reply	other threads:[~2009-03-06 13:25 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <49AEE953.2030401@sun.com>
     [not found] ` <m3ocwg61jc.fsf@bzzz.home.net>
     [not found]   ` <49B00E4D.2080806@sun.com>
     [not found]     ` <20090305214620.GJ3199@webber.adilger.int>
2009-03-06 13:25       ` Eric Barton [this message]
2009-03-09  6:05         ` [Lustre-devel] create performance 'Andreas Dilger'
2009-06-02 19:38           ` Nicolas Williams
2009-06-03  9:50             ` 'Andreas Dilger'

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='049301c99e5f$10ee3c60$32cab520$@com' \
    --to=eeb@sun.com \
    --cc=lustre-devel@lists.lustre.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.