From mboxrd@z Thu Jan 1 00:00:00 1970 From: Eric Barton Date: Fri, 06 Mar 2009 13:25:49 +0000 Subject: [Lustre-devel] create performance In-Reply-To: <20090305214620.GJ3199@webber.adilger.int> References: <49AEE953.2030401@sun.com> <49B00E4D.2080806@sun.com> <20090305214620.GJ3199@webber.adilger.int> Message-ID: <049301c99e5f$10ee3c60$32cab520$@com> List-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: lustre-devel@lists.lustre.org Moving this discussion to lustre-devel... OST object placement is a hard problem with conflicting requirements including... 1. Even server space balance 2. Even server load balance 3. Minimal network congestion 4. Scalable ultra-wide file layout descriptor 5. Scalable placement algorithm Implementing a placement algorithm with a centralized server clearly isn't scalable and will have to be reworked for CMD. A starting point might be to explore how to ensure CROW goes some way to satisfy requirements 1-3 above. BTW, I've long believed that it's a mistake not to give Lustre any inkling that all the creates done by a FPP parallel application are somehow related - e.g. via a cluster-wide job identifier. Surely file-per-process placement is very close to shared file placement (minus extent locking conflicts :)? I recognize that fixing this still leaves the problem of how to get best F/S utilization when different applications share a cluster - but I don't think they are necessarily the same problem and trying to address them both with the same solution seems wrong. Cheers, Eric > -----Original Message----- > From: Andreas.Dilger at Sun.COM [mailto:Andreas.Dilger at Sun.COM] On Behalf Of Andreas Dilger > Sent: 05 March 2009 9:46 PM > To: Nathaniel Rutman > Cc: lustre-tech-leads at sun.com > Subject: Re: create performance > > On Mar 05, 2009 09:39 -0800, Nathaniel Rutman wrote: > > Alex Zhuravlav wrote: > >>>>>>> Nathaniel Rutman (NR) writes: > >> NR> What about preallocating objects per client, on the clients? > >> NR> Client still needs to get a namespace entry from MDT, but could then > >> NR> hold a write layout lock > >> NR> and do it's own round-robin allocation. For clients with subtree > >> NR> locks this could avoid any need to talk to the MDT and wouldn't need > >> NR> the writeback cache. > >> > >> I thought "avoid any need to talk to MDT" implies "writeback cache" > > > > Hmm, well, maybe you consider this a limited version of writeback cache? > > It would be kind of a notification of "here is the layout/objects of my > > new file, with my new fid." Fid ranges and object numbers would be > > granted to clients for their own use, and the MDT would only have to do > > the namespace entry, asynchronously. I suppose there's recovery issues > > we have to worry about then. > > > > What I was really trying to get at was to avoid the two step process of > > client -> MDT -> OST stripe allocation, which includes an extra network > > hop in some precreation starvation cases, and always includes some (a > > little?) cpu on the MDT: > > 1. clients get object grants for every OST. > > 2. clients assign objects to new files and send in reqs to MDT, which > > just records the objects in the LOV EA > > 3. MDT batches up the assigned objects and sends to OSTs for orphan > > cleanup llog. > > The main problem with having many clients do precreation themselves is > that this will invariably cause load imbalance on the OSTs, which will > cause long-term file IO performance problems (much in excess of the > performance problems hit during precreate). > > Cray recently filed a bug on the read performance of files being noticably > hurt by QOS object allocation due to space imbalance, even thoguh the MDS > is trying to balance across OSTs locally, but is using random numbers to > do this and is not selecting OSTs evenly. > > In a file-per-process checkpoint (say 100 processes/files on 100 OSTs) > the MDS round-robin will allocate 1 object per OST evenly across all > OSTs (excluding the case where an OSC is out of preallocated objects). > If clients are doing the allocation (or in the past when the MDS did > "random" OST selection) then the chance of all 100 clients allocating > on 100 OSTs is vanishingly small. Instead it is likely that some OSTs > will have no objects used, and some will have 2 or 3 or 4, and the > aggregate write performance will FOREVER be 50% or 33% or 25% of the > MDS-round-robin allocated objects for that set of files. That is far > worse than waiting 1s for the MDS to allocate the objects. > > IMHO, if we are doing WBC on the client, then there is no _requirement_ > that the client has to allocate objects for the files at all, and any > write data could just be in the client page cache. Until the new file > is visible on the MDS to another client nobody can even try to access > the data. Once the WBC cache is flushed to the MDS then objects can > be allocated by the MDS evenly (granting an exclusive layout lock to the > client in the process) until the cached client data is either flushed to > disk or at least protected by extent locks and can be partially flushed > as needed. > > Note that I don't totally object to WBC clients doing object allocations > if they are creating a large number of files, in essence becoming an > MDS that is tracking the load on the OSTs and balancing object creation > appropriately. What I object to is the more common case where each > client is creating a single file for a large FPP checkpoint, and the > clients all selecting the OSTs separately. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc.