From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id ; Thu, 20 Jun 2002 02:05:36 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id ; Thu, 20 Jun 2002 02:05:35 -0400 Received: from h24-67-14-151.cg.shawcable.net ([24.67.14.151]:7678 "EHLO webber.adilger.int") by vger.kernel.org with ESMTP id ; Thu, 20 Jun 2002 02:05:32 -0400 From: Andreas Dilger Date: Thu, 20 Jun 2002 00:03:38 -0600 To: Dave Hansen Cc: Andrew Morton , mgross@unix-os.sc.intel.com, Linux Kernel Mailing List , lse-tech@lists.sourceforge.net, richard.a.griffiths@intel.com Subject: Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large Message-ID: <20020620060337.GJ22427@clusterfs.com> Mail-Followup-To: Dave Hansen , Andrew Morton , mgross@unix-os.sc.intel.com, Linux Kernel Mailing List , lse-tech@lists.sourceforge.net, richard.a.griffiths@intel.com References: <200206200022.g5K0MKP27994@unix-os.sc.intel.com> <3D1127D6.F6988C4B@zip.com.au> <3D115563.4020402@us.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3D115563.4020402@us.ibm.com> User-Agent: Mutt/1.3.28i X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org On Jun 19, 2002 21:09 -0700, Dave Hansen wrote: > Andrew Morton wrote: > >The vague plan there is to replace lock_kernel with lock_journal > >where appropriate. But ext3 scalability work of this nature > >will be targetted at the 2.5 kernel, most probably. > > I really doubt that dropping in lock_journal will help this case very > much. Every single kernel_flag entry in the lockmeter output where > Util > 0.00% is caused by ext3. The schedule entry is probably caused > by something in ext3 grabbing BKL, getting scheduled out for some > reason, then having it implicitly released in schedule(). The > schedule() contention comes from the reacquire_kernel_lock(). > > We used to see plenty of ext2 BKL contention, but Al Viro did a good > job fixing that early in 2.5 using a per-inode rwlock. I think that > this is the required level of lock granularity, another global lock > just won't cut it. > http://lse.sourceforge.net/lockhier/bkl_rollup.html#getblock There are a variety of different efforts that could be made towards removing the BKL from ext2 and ext3. The first, of course, would be to have a per-filesystem lock instead of taking the BKL (I don't know if Al has changed lock_super() in 2.5 to be a real semaphore or not). As Andrew mentioned, there would also need to be be a per-journal lock to ensure coherency of the journal data. Currently the per-filesystem and per-journal lock would be equivalent, but when a single journal device can be shared among multiple filesystems they would be different locks. I will leave it up to Andrew and Stephen to discuss locking scalability within the journal layer. Within the filesystem there can be a large number of increasingly fine locks added - a superblock-only lock with per-group locks, or even per-bitmap and per-inode-table(-block) locks if needed. This would allow multi- threaded inode and block allocations, but a sane lock ranking strategy would have to be developed. The bitmap locks would only need to be 2-state locks, because you only look at the bitmaps when you want to modify them. The inode table locks would be read/write locks. If there is a try-writelock mechanism for the individual inode table blocks you can avoid write lock contention for creations by simply finding the first un-write-locked block in the target group's inode table (usually in the hundreds of blocks per group for default parameters). For inode allocation you don't really care which inode you get, as long as you get one in the preferred group (even that isn't critical for directory creation). For inode deletions you will get essentially random block locking, which is actually improved by the find-first-unlocked allocation policy (at the expense of dirtying more inode table blocks). Contention for the superblock lock for updates to the superblock free block and free inode counts could be mitigated by keeping "per-group delta buckets" in memory, that are written into the superblock only once every few seconds or at statfs time instead of needing multiple locks for each block/inode alloc/free. The groups already keep their own summary counts for free blocks and inodes. The coherency of these fields with the superblock on recovery would be handled at journal recovery time (either in the kernel or e2fsck*). Other than these two fields there are few write updates to the superblock (on ext3 there is also the orphan list, modified at truncate and when an open file is unlinked and when such a file is closed). I have even been thinking about multi-threaded directory-entry creation in a single directory. One nice thing about ext2/ext3 directory blocks is that each one is self-contained and can be modified independently. For regular ext2/ext3 directories you would only be able to do multi-threaded deletes by having a lock for each directory block. For creations you would need to lock the entire directory to ensure exclusive access for a create, which is the same single-threaded behaviour for a single directory we have today with the directory i_sem. However, if you are using the htree indexed directory layout (which you will be, if you care about scalable filesystem performance) then there is only a single[**] block into which a given filename can be added, so you can have per-block locks even for file creation. As the number of directory entries grows (and hence more directory blocks) the locking becomes increasingly more fine-grained so you get better scalability with larger directories, which is what you want. Cheers, Andreas [*] If we think that we will go to any kind of per-group locking in the near future, the support for this could be added into e2fsck and existing kernels today with read support for a COMPAT flag to ensure maximal forwards compatibility. On e2fsck runs we already validate the superblock on each boot, and the group descriptor table is contiguous with the superblock, so the amount of extra checking at boot time would be very minimal. The kernel already has ext[23]_count_free_{blocks,inodes} functions that just need a bit of tweaking to check only the descriptor summaries unless mounted with debug and check options, and to update the superblock counts at mount time if the COMPAT flag is set. [**] In rare circumstances you may have a large number of hash collisions for a single hash value which fill more than one block, so an entry with that hash value could live in more than a single block. This would need to be handled somehow (e.g. always getting the locks on all such blocks in order at create time; you only need a single block lock at delete time). -- Andreas Dilger http://www-mddsp.enel.ucalgary.ca/People/adilger/ http://sourceforge.net/projects/ext2resize/