From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ric Wheeler Subject: Re: large fs testing Date: Tue, 26 May 2009 18:17:21 -0400 Message-ID: <4A1C6A71.7010300@redhat.com> References: <4A17FFD8.80401@redhat.com> <5971.1243359565@gamaville.dokosmarshall.org> <4A1C2B40.30102@redhat.com> <20090526212132.GE3218@webber.adilger.int> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: nicholas.dokos@hp.com, linux-fsdevel@vger.kernel.org, Christoph Hellwig , Douglas Shakshober , Joshua Giles , Valerie Aurora , Eric Sandeen , Steven Whitehouse , Edward Shishkin , Josef Bacik , Jeff Moyer , Chris Mason , "Whitney, Eric" , Theodore Tso To: Andreas Dilger Return-path: Received: from mx2.redhat.com ([66.187.237.31]:39292 "EHLO mx2.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752563AbZEZWSo (ORCPT ); Tue, 26 May 2009 18:18:44 -0400 In-Reply-To: <20090526212132.GE3218@webber.adilger.int> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On 05/26/2009 05:21 PM, Andreas Dilger wrote: > On May 26, 2009 13:47 -0400, Ric Wheeler wrote: >> These runs were without lazy init, so I would expect to be a little more >> than twice as slow as your second run (not the three times I saw) >> assuming that it scales linearly. > > Making lazy_itable_init the default formatting option for ext4 is/was > dependent upon the kernel doing the zeroing of the inode table blocks > at first mount time. I'm not sure if that was implemented yet. > >> This run was with limited DRAM on the >> box (6GB) and only a single HBA, but I am afraid that I did not get any >> good insight into what was the bottleneck during my runs. > > For a very large array (80TB) this could be 1TB or more of inode tables > that are being zeroed out at format time. After 64TB the default mke2fs > options will cap out at 4B inodes in the filesystem. 1TB/90min ~= 200MB/s > so this is probably your bottleneck. > >> Do you have any access to even larger storage, say the mythical 100TB :-) >> ? Any insight on interesting workloads? > > I would definitely be most interested in e2fsck performance at this scale > (RAM usage and elapsed time) because this will in the end be the defining > limit on how large a usable filesystem can actually be in practise. > > Cheers, Andreas Not sure why, but the box rebooted (crashed?) a couple of hours into the run (no hints in the logs pointed at anything suspicious). What I did get was the following from the fsck run: root@l82bi250:/home/redhatYou have new mail in /var/spool/mail/root [root@l82bi250 redhat]# time /sbin/fsck.ext4 -tt -y /dev/mapper/Big_boy-Big_boy e2fsck 1.41.4 (27-Jan-2009) Pass 1: Checking inodes, blocks, and sizes Pass 1: Memory used: 1596k/1177752k (1447k/150k), time: 1184.73/514.16/344.38 Pass 1: I/O read: 50655MB, write: 0MB, rate: 42.76MB/s Pass 2: Checking directory structure Entry '4a1590dc~~~~~~~~O4A0SMJ1VC34YQ1PD3B5DL9Q' in /da (188378) references inode 196988 in group 30 where _INODE_UNINIT is set. Fix? yes Restarting e2fsck from the beginning... Group descriptor 15 checksum is invalid. Fix? yes Pass 1: Checking inodes, blocks, and sizes Pass 1: Memory used: 120396k/-1389015k (120134k/263k), time: 1134.71/522.48/323.65 Pass 1: I/O read: 50656MB, write: 0MB, rate: 44.64MB/s Pass 2: Checking directory structure Entry '4a15910c~~~~~~~~H8099TRM701Q29CSTCWBVIHJ' in /0b (404925) references inode 413100 in group 62 where _INODE_UNINIT is set. Fix? yes Restarting e2fsck from the beginning... Group descriptor 31 checksum is invalid. Fix? yes Pass 1: Checking inodes, blocks, and sizes Pass 1: Memory used: 231360k/246272k (231083k/278k), time: 1140.48/521.00/334.74 Pass 1: I/O read: 50658MB, write: 0MB, rate: 44.42MB/s Pass 2: Checking directory structure Pass 2: Memory used: 231360k/1290436k (231083k/278k), time: 538.22/264.56/83.49 Pass 2: I/O read: 13749MB, write: 0MB, rate: 25.55MB/s Pass 3: Checking directory connectivity Peak memory: Memory used: 231360k/1789000k (231083k/278k), time: 4221.57/1947.37/1116.21 Pass 3A: Memory used: 231360k/1789000k (231083k/278k), time: 0.00/ 0.00/ 0.00 Pass 3A: I/O read: 0MB, write: 0MB, rate: 0.00MB/s Pass 3: Memory used: 231360k/1290436k (231083k/278k), time: 9.99/ 0.26/ 1.37 Pass 3: I/O read: 1MB, write: 0MB, rate: 0.10MB/s Pass 4: Checking reference counts Pass 4: Memory used: 231360k/-1481575k (231082k/279k), time: 147.16/139.87/ 1.94 Pass 4: I/O read: 0MB, write: 0MB, rate: 0.00MB/s Pass 5: Checking group summary information Inode bitmap differences: -(98404--98405) Note that it got truncated in Pass 5 - just after writing out some values that look like they sign wrapped? -(103650--103655) -(103659--103660) -103663 -103665 -103667 -(103669--103670) -(103673--103676) -103679 -103684 -103687 -10 ric