From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: with ECARTIS (v1.0.0; list xfs); Mon, 18 Feb 2008 17:36:38 -0800 (PST) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.168.29]) by oss.sgi.com (8.12.11.20060308/8.12.11/SuSE Linux 0.7) with ESMTP id m1J1aVgW018552 for ; Mon, 18 Feb 2008 17:36:33 -0800 Message-ID: <47BA2AFD.2060409@tlinx.org> Date: Mon, 18 Feb 2008 17:03:57 -0800 From: Linda Walsh MIME-Version: 1.0 Subject: Re: tuning, many small files, small blocksize References: <47BA10EC.3090004@tlinx.org> <20080218235103.GW155407@sgi.com> In-Reply-To: <20080218235103.GW155407@sgi.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com List-Id: xfs To: David Chinner Cc: Jeff Breidenbach , xfs@oss.sgi.com David Chinner wrote: > That makes no sense. Inodes are *unique* - they are not shared with > any other inode at all. Could you explain why you think that 256 > byte inodes are any different to larger inodes in this respect? --- Sorry to be unclear, but it would seem to me that if the minimum physical blocksize on disk is 512 bytes, then either a 256 byte inode will share that block with another inode, or you are wasting 256 bytes on each inode. The latter interpretation doesn't make logical sense. If the minimum physical I/O size is larger than 512 bytes, then I would assume even more, *unique*, inodes could be packed in per block. > >> Remember, in xfs, if the last bit of left-over data in an inode will fit >> into the inode, it can save a block-allocation, though I don't know >> how this will affect speed. > > No, that's wrong. We never put data in inodes. --- You mean file data, no? Doesn't directory and link data get packed in? It always gnawed at me, as to why inode's packing in small bits of data was disallowed for file data, but not other types of data. How about extended attribute data? Is it always allocated in separate data blocks as well, or can it be fit into an inode if it fits? Why not include file data as a type of data that could be packed into an inode? I'm sure there's a good reason, but it seems other types of file system data can be packed into inodes -- just not file data...or am I really disinformed? :-) > >> Space-wise, a 2k block size and 1k-inode size might be good, but don't >> know how that would affect performance. > > Inode size vs block size is pretty much irrelevant w.r.t performance, > except for the fact inode size can't be larger than the block size. ---- If you have a small directory, can't it be stored in the inode? Wouldn't that save some bit (or block) of I/O? >> I'm sure you are familiar with mount options noatime,nodiratime -- same >> concepts, but dir's are split out. > > noatime implies nodiratime. ---- Well dang...thanks! Ever since the nodiratime option came out, I thought I had to specify it in addition. Now my fstabs can be shorter! >> Also, it depends on the situation, but sometimes flattening out the >> directory structure can speed up lookup time. > > Like using large directory block sizes to make large directory > btrees wider and flatter and therefore use less seeks for any given > random directory lookup? ;) --- Are you saying that directory entries are stored in a sorted order in a B-Tree? Hmmm... Well, I did say it depended on the situation -- you are right that time lost to seeks might overshadow time lost to #blocks read in, I'd think it might depend on how the directories are laid out on disk, but in benchmarks, I've noticed larger slowdowns when using more files/dir, than distributing the same number of files among more dirs, but it could have been something about my test setup, but I did not test with varying size directory block sizes. Either I overlooked the naming option size param or was limited to version=1 for some reason (don't remember when version=2 was added...) > >> Sometime back someone did some benchmarks involving log size and it seemed >> that 32768b(4k) or ~128Meg seemed optimal if memory serves me correctly. > > 128MB is the maximum size currently. --- Maybe that's why it's optimal? :-) Thanks for the corrections...I appreciate it! -l