From mboxrd@z Thu Jan 1 00:00:00 1970 From: Nicolas Williams Date: Tue, 2 Jun 2009 14:38:43 -0500 Subject: [Lustre-devel] create performance In-Reply-To: <20090309060534.GK3199@webber.adilger.int> References: <49AEE953.2030401@sun.com> <49B00E4D.2080806@sun.com> <20090305214620.GJ3199@webber.adilger.int> <049301c99e5f$10ee3c60$32cab520$@com> <20090309060534.GK3199@webber.adilger.int> Message-ID: <20090602193838.GA18161@Sun.COM> List-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: lustre-devel@lists.lustre.org On Mon, Mar 09, 2009 at 12:05:34AM -0600, 'Andreas Dilger' wrote: > On Mar 06, 2009 13:25 +0000, Eric Barton wrote: > > OST object placement is a hard problem with conflicting requirements > > including... > > > > 1. Even server space balance > > 2. Even server load balance > > 3. Minimal network congestion > > 4. Scalable ultra-wide file layout descriptor > > 5. Scalable placement algorithm > > > > Implementing a placement algorithm with a centralized server clearly > > isn't scalable and will have to be reworked for CMD. A starting > > point might be to explore how to ensure CROW goes some way to > > satisfy requirements 1-3 above. CROW should satisfy #4 easily because it would allow us to have the same OST-side FID for all stripes of a file, which combined with a compression of the stripe configuration of the file (the ordered list of OSTs) should result in fixed-sized FID for all files. (For compat, small FIDs can be expanded when talking to old clients.) CROW should be mostly orthogonal to #1-3 and #5 though, except that a good compression technique for the stripe configuration might make it easier to get even server space and load balance. Imagine an algorithm that takes a list of OSTs, stripe count and index as inputs and quickly outputs an ordered list of OSTs, such that for each index value you get a pseudo-random permutation of a pseudo-randomly picked combination of OSTs. Then we could monotonically increment that index as a way to generate the next new file's placement. For this use an LFSR would be a perfect way to get pseudo-randomness (we don't need cryptographic strength for this purpose). The index becomes a seed for the LFSR. We might need two indexes, actually, one for the combination of OSTs and one for the permutation thereof. With a pseudo-random distribution of combinations and permutations we ought to get a fair distribution of data and load. > While CROW can help avoid latency for precreating objects (which can > avoid some of the object allocation imbalances hit today when OSTs > are slow precreating objects), it doesn't really fundamentally help > to balance space and performance of the OSTs. With any filesystem > with more than a handful of OSTs there shouldn't be any reason why > the OSTs precreating can't keep up with the MDS create rate. Johann > and I were discussing this problem and I suspect it is only a defect > in the object precreation code and not a fundamental problem int the > design. > > I definitely agree that for CMD we will have distributed object > allocation, but so far it isn't clear whether having more than the > MDSes and/or WBC clients doing the allocation will improve the > situation or make it worse. We really should use CROW for these reasons: - CROW enables fixed sized FIDs no matter how large the stripe count - no need to go destroy unused pre-created files on MGS reboot > > BTW, I've long believed that it's a mistake not to give Lustre any > > inkling that all the creates done by a FPP parallel application are > > somehow related - e.g. via a cluster-wide job identifier. Surely > > file-per-process placement is very close to shared file placement > > (minus extent locking conflicts :)? > > Yes, I agree. In theory it should be possible to extract this kind > of information from the client processes themselves, either by > examining the process environment (some MPI job launchers store the > MPI rank there for pre-launch shell scripts) or by comparing the > filenames being created by the clients. Any file-per-process job > will invariably create filenames with the rank in the filename. Sounds like a good idea, and configurable via regexes (ick, I know). Even better would be a way to associate a cluster job ID with a set of processes. This could be done via Linux keyrings, say. Nico --