[Lustre-devel] create performance

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Lustre-devel] create performance
       [not found]     ` <20090305214620.GJ3199@webber.adilger.int>
@ 2009-03-06 13:25       ` Eric Barton
  2009-03-09  6:05         ` 'Andreas Dilger'
  0 siblings, 1 reply; 4+ messages in thread
From: Eric Barton @ 2009-03-06 13:25 UTC (permalink / raw)
  To: lustre-devel

Moving this discussion to lustre-devel...

OST object placement is a hard problem with conflicting requirements
including...

1. Even server space balance
2. Even server load balance
3. Minimal network congestion
4. Scalable ultra-wide file layout descriptor
5. Scalable placement algorithm

Implementing a placement algorithm with a centralized server clearly
isn't scalable and will have to be reworked for CMD.  A starting
point might be to explore how to ensure CROW goes some way to
satisfy requirements 1-3 above.

BTW, I've long believed that it's a mistake not to give Lustre any
inkling that all the creates done by a FPP parallel application are
somehow related - e.g. via a cluster-wide job identifier.  Surely
file-per-process placement is very close to shared file placement
(minus extent locking conflicts :)?  I recognize that fixing this
still leaves the problem of how to get best F/S utilization when
different applications share a cluster - but I don't think they are
necessarily the same problem and trying to address them both with the
same solution seems wrong.

    Cheers,
              Eric

> -----Original Message-----
> From: Andreas.Dilger at Sun.COM [mailto:Andreas.Dilger at Sun.COM] On Behalf Of Andreas Dilger
> Sent: 05 March 2009 9:46 PM
> To: Nathaniel Rutman
> Cc: lustre-tech-leads at sun.com
> Subject: Re: create performance
> 
> On Mar 05, 2009  09:39 -0800, Nathaniel Rutman wrote:
> > Alex Zhuravlav wrote:
> >>>>>>> Nathaniel Rutman (NR) writes:
> >>  NR> What about preallocating objects per client, on the clients?
> >>  NR> Client still needs to get a namespace entry from MDT, but could then
> >>  NR> hold a write layout lock
> >>  NR> and do it's own round-robin allocation.  For clients with subtree
> >>  NR> locks this could avoid any need to talk to the MDT and wouldn't need
> >>  NR> the writeback cache.
> >>
> >> I thought "avoid any need to talk to MDT" implies "writeback cache"
> >
> > Hmm, well, maybe you consider this a limited version of writeback cache?
> > It would be kind of a notification of "here is the layout/objects of my
> > new file, with my new fid."  Fid ranges and object numbers would be
> > granted to clients for their own use, and the MDT would only have to do
> > the namespace entry, asynchronously.  I suppose there's recovery issues
> > we have to worry about then.
> >
> > What I was really trying to get at was to avoid the two step process of
> > client -> MDT -> OST stripe allocation, which includes an extra network
> > hop in some precreation starvation cases, and always includes some (a
> > little?) cpu on the MDT:
> > 1. clients get object grants for every OST.
> > 2. clients assign objects to new files and send in reqs to MDT, which
> > just records the objects in the LOV EA
> > 3. MDT batches up the assigned objects and sends to OSTs for orphan
> > cleanup llog.
> 
> The main problem with having many clients do precreation themselves is
> that this will invariably cause load imbalance on the OSTs, which will
> cause long-term file IO performance problems (much in excess of the
> performance problems hit during precreate).
> 
> Cray recently filed a bug on the read performance of files being noticably
> hurt by QOS object allocation due to space imbalance, even thoguh the MDS
> is trying to balance across OSTs locally, but is using random numbers to
> do this and is not selecting OSTs evenly.
> 
> In a file-per-process checkpoint (say 100 processes/files on 100 OSTs)
> the MDS round-robin will allocate 1 object per OST evenly across all
> OSTs (excluding the case where an OSC is out of preallocated objects).
> If clients are doing the allocation (or in the past when the MDS did
> "random" OST selection) then the chance of all 100 clients allocating
> on 100 OSTs is vanishingly small.  Instead it is likely that some OSTs
> will have no objects used, and some will have 2 or 3 or 4, and the
> aggregate write performance will FOREVER be 50% or 33% or 25% of the
> MDS-round-robin allocated objects for that set of files.  That is far
> worse than waiting 1s for the MDS to allocate the objects.
> 
> IMHO, if we are doing WBC on the client, then there is no _requirement_
> that the client has to allocate objects for the files at all, and any
> write data could just be in the client page cache.  Until the new file
> is visible on the MDS to another client nobody can even try to access
> the data.  Once the WBC cache is flushed to the MDS then objects can
> be allocated by the MDS evenly (granting an exclusive layout lock to the
> client in the process) until the cached client data is either flushed to
> disk or at least protected by extent locks and can be partially flushed
> as needed.
> 
> Note that I don't totally object to WBC clients doing object allocations
> if they are creating a large number of files, in essence becoming an
> MDS that is tracking the load on the OSTs and balancing object creation
> appropriately.  What I object to is the more common case where each
> client is creating a single file for a large FPP checkpoint, and the
> clients all selecting the OSTs separately.
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Lustre-devel] create performance
  2009-03-06 13:25       ` [Lustre-devel] create performance Eric Barton
@ 2009-03-09  6:05         ` 'Andreas Dilger'
  2009-06-02 19:38           ` Nicolas Williams
  0 siblings, 1 reply; 4+ messages in thread
From: 'Andreas Dilger' @ 2009-03-09  6:05 UTC (permalink / raw)
  To: lustre-devel

On Mar 06, 2009  13:25 +0000, Eric Barton wrote:
> OST object placement is a hard problem with conflicting requirements
> including...
> 
> 1. Even server space balance
> 2. Even server load balance
> 3. Minimal network congestion
> 4. Scalable ultra-wide file layout descriptor
> 5. Scalable placement algorithm
> 
> Implementing a placement algorithm with a centralized server clearly
> isn't scalable and will have to be reworked for CMD.  A starting
> point might be to explore how to ensure CROW goes some way to
> satisfy requirements 1-3 above.

While CROW can help avoid latency for precreating objects (which can
avoid some of the object allocation imbalances hit today when OSTs
are slow precreating objects), it doesn't really fundamentally help
to balance space and performance of the OSTs.  With any filesystem
with more than a handful of OSTs there shouldn't be any reason why
the OSTs precreating can't keep up with the MDS create rate.  Johann
and I were discussing this problem and I suspect it is only a defect
in the object precreation code and not a fundamental problem int the
design.

I definitely agree that for CMD we will have distributed object
allocation, but so far it isn't clear whether having more than the
MDSes and/or WBC clients doing the allocation will improve the
situation or make it worse.

> BTW, I've long believed that it's a mistake not to give Lustre any
> inkling that all the creates done by a FPP parallel application are
> somehow related - e.g. via a cluster-wide job identifier.  Surely
> file-per-process placement is very close to shared file placement
> (minus extent locking conflicts :)?

Yes, I agree.  In theory it should be possible to extract this kind
of information from the client processes themselves, either by
examining the process environment (some MPI job launchers store the
MPI rank there for pre-launch shell scripts) or by comparing the
filenames being created by the clients.  Any file-per-process job
will invariably create filenames with the rank in the filename.

> I recognize that fixing this
> still leaves the problem of how to get best F/S utilization when
> different applications share a cluster - but I don't think they are
> necessarily the same problem and trying to address them both with the
> same solution seems wrong.
> 
>     Cheers,
>               Eric
> 
> > -----Original Message-----
> > From: Andreas.Dilger at Sun.COM [mailto:Andreas.Dilger at Sun.COM] On Behalf Of Andreas Dilger
> > Sent: 05 March 2009 9:46 PM
> > To: Nathaniel Rutman
> > Cc: lustre-tech-leads at sun.com
> > Subject: Re: create performance
> > 
> > On Mar 05, 2009  09:39 -0800, Nathaniel Rutman wrote:
> > > Alex Zhuravlav wrote:
> > >>>>>>> Nathaniel Rutman (NR) writes:
> > >>  NR> What about preallocating objects per client, on the clients?
> > >>  NR> Client still needs to get a namespace entry from MDT, but could then
> > >>  NR> hold a write layout lock
> > >>  NR> and do it's own round-robin allocation.  For clients with subtree
> > >>  NR> locks this could avoid any need to talk to the MDT and wouldn't need
> > >>  NR> the writeback cache.
> > >>
> > >> I thought "avoid any need to talk to MDT" implies "writeback cache"
> > >
> > > Hmm, well, maybe you consider this a limited version of writeback cache?
> > > It would be kind of a notification of "here is the layout/objects of my
> > > new file, with my new fid."  Fid ranges and object numbers would be
> > > granted to clients for their own use, and the MDT would only have to do
> > > the namespace entry, asynchronously.  I suppose there's recovery issues
> > > we have to worry about then.
> > >
> > > What I was really trying to get at was to avoid the two step process of
> > > client -> MDT -> OST stripe allocation, which includes an extra network
> > > hop in some precreation starvation cases, and always includes some (a
> > > little?) cpu on the MDT:
> > > 1. clients get object grants for every OST.
> > > 2. clients assign objects to new files and send in reqs to MDT, which
> > > just records the objects in the LOV EA
> > > 3. MDT batches up the assigned objects and sends to OSTs for orphan
> > > cleanup llog.
> > 
> > The main problem with having many clients do precreation themselves is
> > that this will invariably cause load imbalance on the OSTs, which will
> > cause long-term file IO performance problems (much in excess of the
> > performance problems hit during precreate).
> > 
> > Cray recently filed a bug on the read performance of files being noticably
> > hurt by QOS object allocation due to space imbalance, even thoguh the MDS
> > is trying to balance across OSTs locally, but is using random numbers to
> > do this and is not selecting OSTs evenly.
> > 
> > In a file-per-process checkpoint (say 100 processes/files on 100 OSTs)
> > the MDS round-robin will allocate 1 object per OST evenly across all
> > OSTs (excluding the case where an OSC is out of preallocated objects).
> > If clients are doing the allocation (or in the past when the MDS did
> > "random" OST selection) then the chance of all 100 clients allocating
> > on 100 OSTs is vanishingly small.  Instead it is likely that some OSTs
> > will have no objects used, and some will have 2 or 3 or 4, and the
> > aggregate write performance will FOREVER be 50% or 33% or 25% of the
> > MDS-round-robin allocated objects for that set of files.  That is far
> > worse than waiting 1s for the MDS to allocate the objects.
> > 
> > IMHO, if we are doing WBC on the client, then there is no _requirement_
> > that the client has to allocate objects for the files at all, and any
> > write data could just be in the client page cache.  Until the new file
> > is visible on the MDS to another client nobody can even try to access
> > the data.  Once the WBC cache is flushed to the MDS then objects can
> > be allocated by the MDS evenly (granting an exclusive layout lock to the
> > client in the process) until the cached client data is either flushed to
> > disk or at least protected by extent locks and can be partially flushed
> > as needed.
> > 
> > Note that I don't totally object to WBC clients doing object allocations
> > if they are creating a large number of files, in essence becoming an
> > MDS that is tracking the load on the OSTs and balancing object creation
> > appropriately.  What I object to is the more common case where each
> > client is creating a single file for a large FPP checkpoint, and the
> > clients all selecting the OSTs separately.
> > 
> > Cheers, Andreas
> > --
> > Andreas Dilger
> > Sr. Staff Engineer, Lustre Group
> > Sun Microsystems of Canada, Inc.
> 

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Lustre-devel] create performance
  2009-03-09  6:05         ` 'Andreas Dilger'
@ 2009-06-02 19:38           ` Nicolas Williams
  2009-06-03  9:50             ` 'Andreas Dilger'
  0 siblings, 1 reply; 4+ messages in thread
From: Nicolas Williams @ 2009-06-02 19:38 UTC (permalink / raw)
  To: lustre-devel

On Mon, Mar 09, 2009 at 12:05:34AM -0600, 'Andreas Dilger' wrote:
> On Mar 06, 2009  13:25 +0000, Eric Barton wrote:
> > OST object placement is a hard problem with conflicting requirements
> > including...
> > 
> > 1. Even server space balance
> > 2. Even server load balance
> > 3. Minimal network congestion
> > 4. Scalable ultra-wide file layout descriptor
> > 5. Scalable placement algorithm
> > 
> > Implementing a placement algorithm with a centralized server clearly
> > isn't scalable and will have to be reworked for CMD.  A starting
> > point might be to explore how to ensure CROW goes some way to
> > satisfy requirements 1-3 above.

CROW should satisfy #4 easily because it would allow us to have the same
OST-side FID for all stripes of a file, which combined with a
compression of the stripe configuration of the file (the ordered list of
OSTs) should result in fixed-sized FID for all files.  (For compat,
small FIDs can be expanded when talking to old clients.)

CROW should be mostly orthogonal to #1-3 and #5 though, except that a
good compression technique for the stripe configuration might make it
easier to get even server space and load balance.  Imagine an algorithm
that takes a list of OSTs, stripe count and index as inputs and quickly
outputs an ordered list of <strip-count> OSTs, such that for each index
value you get a pseudo-random permutation of a pseudo-randomly picked
combination of <strip-count> OSTs.  Then we could monotonically
increment that index as a way to generate the next new file's placement.

For this use an LFSR would be a perfect way to get pseudo-randomness (we
don't need cryptographic strength for this purpose).  The index becomes
a seed for the LFSR.  We might need two indexes, actually, one for the
combination of OSTs and one for the permutation thereof.  With a
pseudo-random distribution of combinations and permutations we ought to
get a fair distribution of data and load.

> While CROW can help avoid latency for precreating objects (which can
> avoid some of the object allocation imbalances hit today when OSTs
> are slow precreating objects), it doesn't really fundamentally help
> to balance space and performance of the OSTs.  With any filesystem
> with more than a handful of OSTs there shouldn't be any reason why
> the OSTs precreating can't keep up with the MDS create rate.  Johann
> and I were discussing this problem and I suspect it is only a defect
> in the object precreation code and not a fundamental problem int the
> design.
> 
> I definitely agree that for CMD we will have distributed object
> allocation, but so far it isn't clear whether having more than the
> MDSes and/or WBC clients doing the allocation will improve the
> situation or make it worse.

We really should use CROW for these reasons:

 - CROW enables fixed sized FIDs no matter how large the stripe count
 - no need to go destroy unused pre-created files on MGS reboot

> > BTW, I've long believed that it's a mistake not to give Lustre any
> > inkling that all the creates done by a FPP parallel application are
> > somehow related - e.g. via a cluster-wide job identifier.  Surely
> > file-per-process placement is very close to shared file placement
> > (minus extent locking conflicts :)?
> 
> Yes, I agree.  In theory it should be possible to extract this kind
> of information from the client processes themselves, either by
> examining the process environment (some MPI job launchers store the
> MPI rank there for pre-launch shell scripts) or by comparing the
> filenames being created by the clients.  Any file-per-process job
> will invariably create filenames with the rank in the filename.

Sounds like a good idea, and configurable via regexes (ick, I know).

Even better would be a way to associate a cluster job ID with a set of
processes.  This could be done via Linux keyrings, say.

Nico
-- 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Lustre-devel] create performance
  2009-06-02 19:38           ` Nicolas Williams
@ 2009-06-03  9:50             ` 'Andreas Dilger'
  0 siblings, 0 replies; 4+ messages in thread
From: 'Andreas Dilger' @ 2009-06-03  9:50 UTC (permalink / raw)
  To: lustre-devel

On Jun 02, 2009  14:38 -0500, Nicolas Williams wrote:
> On Mon, Mar 09, 2009 at 12:05:34AM -0600, 'Andreas Dilger' wrote:
> > On Mar 06, 2009  13:25 +0000, Eric Barton wrote:
> > > OST object placement is a hard problem with conflicting requirements
> > > including...
> > > 
> > > 1. Even server space balance
> > > 2. Even server load balance
> > > 3. Minimal network congestion
> > > 4. Scalable ultra-wide file layout descriptor
> > > 5. Scalable placement algorithm
> > > 
> > > Implementing a placement algorithm with a centralized server clearly
> > > isn't scalable and will have to be reworked for CMD.  A starting
> > > point might be to explore how to ensure CROW goes some way to
> > > satisfy requirements 1-3 above.
> 
> CROW should satisfy #4 easily because it would allow us to have the same
> OST-side FID for all stripes of a file, which combined with a
> compression of the stripe configuration of the file (the ordered list of
> OSTs) should result in fixed-sized FID for all files.  (For compat,
> small FIDs can be expanded when talking to old clients.)

CROW itself isn't required for wide striping.  It is possible to allocate
FID sequences to OSTs in a manner that will allow widely striped files
to be specified in a compact manner.

The main problem with widely-striped files is that they add overhead to
file IO operations, because the client might potentially have to get
hundreds or thousands of locks per file.

> CROW should be mostly orthogonal to #1-3 and #5 though, except that a
> good compression technique for the stripe configuration might make it
> easier to get even server space and load balance.  Imagine an algorithm
> that takes a list of OSTs, stripe count and index as inputs and quickly
> outputs an ordered list of <strip-count> OSTs, such that for each index
> value you get a pseudo-random permutation of a pseudo-randomly picked
> combination of <strip-count> OSTs.  Then we could monotonically
> increment that index as a way to generate the next new file's placement.
> 
> For this use an LFSR would be a perfect way to get pseudo-randomness (we
> don't need cryptographic strength for this purpose).  The index becomes
> a seed for the LFSR.  We might need two indexes, actually, one for the
> combination of OSTs and one for the permutation thereof.  With a
> pseudo-random distribution of combinations and permutations we ought to
> get a fair distribution of data and load.

In our previous testing, any kind of random OST selection is sub-optimal
compared to round robin.  The problem is that RNG/PRNG OST selection,
while uniform on average, is definitely non-uniform locally, and this
results in non-uniform OST selection and clients competing for OSS/OST
resources.

For example, if 100 MPI clients are creating 100 files on 100 OSTs, then
on average there would be 1 file/OST, but typically some OSTs will have 2
or 3 OSTs, while others are idle.  This will result in IO being 2-3x
slower on those OSTs, and often result in the entire IO being slower 2-3x.

While we do something similar to this for the case of unbalanced OSTs,
we want to move to a round-robin scheme even in the case of unbalanced
OSTs.  This would use an "freespace accumulator" similar to a Bresenham
line algorithm, so that OSTs which are below the average freespace will
be skipped until their "accumulated freespace" is temporarily above average.

> > > BTW, I've long believed that it's a mistake not to give Lustre any
> > > inkling that all the creates done by a FPP parallel application are
> > > somehow related - e.g. via a cluster-wide job identifier.  Surely
> > > file-per-process placement is very close to shared file placement
> > > (minus extent locking conflicts :)?
> > 
> > Yes, I agree.  In theory it should be possible to extract this kind
> > of information from the client processes themselves, either by
> > examining the process environment (some MPI job launchers store the
> > MPI rank there for pre-launch shell scripts) or by comparing the
> > filenames being created by the clients.  Any file-per-process job
> > will invariably create filenames with the rank in the filename.
> 
> Sounds like a good idea, and configurable via regexes (ick, I know).
> 
> Even better would be a way to associate a cluster job ID with a set of
> processes.  This could be done via Linux keyrings, say.

This is probably easiest to start with MPI-IO ADIO ioctls directly to
Lustre.  Once we know it helps we can look at other mechanisms to get
this information from applications that don't use MPI-IO.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2009-06-03  9:50 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <49AEE953.2030401@sun.com>
     [not found] ` <m3ocwg61jc.fsf@bzzz.home.net>
     [not found]   ` <49B00E4D.2080806@sun.com>
     [not found]     ` <20090305214620.GJ3199@webber.adilger.int>
2009-03-06 13:25       ` [Lustre-devel] create performance Eric Barton
2009-03-09  6:05         ` 'Andreas Dilger'
2009-06-02 19:38           ` Nicolas Williams
2009-06-03  9:50             ` 'Andreas Dilger'

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.