From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id qAF8W1SK248089 for ; Thu, 15 Nov 2012 02:32:02 -0600 Received: from galenis.iv.lt (galenis.iv.lt [195.14.170.242]) by cuda.sgi.com with ESMTP id oLCcgdqSZ7OH9dOo (version=TLSv1 cipher=AES256-SHA bits=256 verify=NO) for ; Thu, 15 Nov 2012 00:34:05 -0800 (PST) Received: from fwdr-2.fwdr-0.fwdr-168.fwdr-192 ([192.168.0.2] helo=galenis.iv.lt) by galenis.iv.lt with esmtp (Exim 4.72) (envelope-from ) id 1TYute-00052N-OJ for xfs@oss.sgi.com; Thu, 15 Nov 2012 10:34:02 +0200 Message-ID: <50A4A8FA.30403@iv.lt> Date: Thu, 15 Nov 2012 10:34:02 +0200 From: Linas Jankauskas MIME-Version: 1.0 Subject: Re: Slow performance after ~4.5TB References: <50A0AFD5.2020607@iv.lt> <20121112090448.GS24575@dastard> <50A0C590.6020602@iv.lt> <20121112123222.GT24575@dastard> <50A10077.2060908@iv.lt> <20121112223623.GV24575@dastard> <50A20F53.80405@iv.lt> <20121114211356.GJ1710@dastard> In-Reply-To: <20121114211356.GJ1710@dastard> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="us-ascii"; Format="flowed" Sender: xfs-bounces@oss.sgi.com Errors-To: xfs-bounces@oss.sgi.com To: xfs@oss.sgi.com Ok, thanks for your help. We will try to make 200 allocation groups and enable inode64 option. Hope it will solve a problem. Thanks Linas On 11/14/2012 11:13 PM, Dave Chinner wrote: > On Tue, Nov 13, 2012 at 11:13:55AM +0200, Linas Jankauskas wrote: >> trace-cmd output was about 300mb, so im pasting first 100 lines of >> it, is it enough?: > .... >> >> Rsync command: >> >> /usr/bin/rsync -e ssh -c blowfish -a --inplace --numeric-ids >> --hard-links --ignore-errors --delete --force > > Ok, so you are overwriting in place and deleting files/dirs that > don't exist anymore. And they are all small files. > >> xfs_bmap on one random file: >> >> EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET >> TOTAL FLAGS >> 0: [0..991]: 26524782560..26524783551 12 (754978880..754979871) 992 00000 >> >> xfs_db -r -c "frag" /dev/sda5 >> actual 81347252, ideal 80737778, fragmentation factor 0.75% > > And that indicates file fragmentation is not an issue. >> >> >> agno: 0 > > Not too bad. > >> agno: 1 >> >> from to extents blocks pct >> 1 1 74085 74085 0.05 >> 2 3 97017 237788 0.15 >> 4 7 165766 918075 0.59 >> 8 15 2557055 35731152 22.78 > > And there's the problem. Free space is massively fragmented in the > 8-16 block size (32-64k) range. All the other AGs show the same > pattern: > >> 8 15 2477693 34631683 18.51 >> 8 15 2479273 34656696 20.37 >> 8 15 2440290 34132542 20.51 >> 8 15 2461646 34419704 20.38 >> 8 15 2463571 34439233 21.06 >> 8 15 2487324 34785498 19.92 >> 8 15 2474275 34589732 19.85 >> 8 15 2438528 34100460 20.69 >> 8 15 2467056 34493555 20.04 >> 8 15 2457983 34364055 20.14 >> 8 15 2438076 34112592 22.48 >> 8 15 2465147 34481897 19.79 >> 8 15 2466844 34492253 21.44 >> 8 15 2445986 34205258 21.35 >> 8 15 2436154 34060275 19.60 >> 8 15 2438373 34082653 20.59 >> 8 15 2435860 34057838 21.01 > > Given the uniform distribution of the freespace fragmentation, the > problem is most likely the fact you are using the inode32 allocator. > > What is does is keep inodes in AG 0 (below 1TB) and rotor's data > extents across all other AGs. Hence AG 0 has a different freespace > pattern because it mainly contains metadata. The data AGs are > showing the signs of files with no reference locality being packed > adjacent to each other when written, then randomly removed, which > leaves a swiss-cheese style of freespace fragmentation. > > The result is freespace btrees that are much, much larger than > usual, and each AG is being randomly accessed by each userspace > process. This leads to long lock hold times during searches, and > access from multiple CPUs at once slows things down and adds to lock > contention. > > It appears that the threshold that limits performance for your > workload and configuration is around 2.5million freespace extents in > a single size range. most likely it is a linear scan of duplicate > sizes trying to find the best block number match that is chewing up > all the CPU. That's roughly what the event trace shows. > > I don't think you can fix a filesystem once it's got into this > state. It's aged severely and the only way to fix freespace > fragmentation is to remove files from the filesystem. In this case, > mkfs.xfs is going to be the only sane way to do that, because it's > much faster than removing 90million inodes... > > So, how to prevent it from happening again on a new filesystem? > > Using the inode64 allocator should prevent this freespace > fragmentation from happening. It allocates file data in the same AG > as the inode and inodes are grouped in an AG based on the parent > directory location. Directory inodes are rotored across AGs to > spread them out. The way it searches for free space for new files is > different, too, and will tend to fill holes near to the inode before > searching wider. Hence it's a much more local search, and it will > fill holes created by deleting files/dirs much faster, leaving less > swiss chess freespace fragmentation around. > > The other thing is that if you have lots of rsyncs running at once > is increase the number of AGs to reduce their size. More AGs will > increase allocation parallelism, reducing contention, and also > reducing the size of each free space trees if freespace > fragmentation does occur. Given you are tracking lots of small > files, (90 million inodes so far), I'd suggest increase the number > of AGs by an order of magnitude so that the size drops from 1TB down > to 100GB. Even if freespace fragmentation then does occur, it is > Spread over 10x the number of freespace trees, and hence will have > significantly less effect on performance. > > FWIW, you probably also want to set allocsize=4k as well, as you > don't need specualtive EOF preallocation on your workload to avoid > file fragmentation.... > > Cheers, > > Dave. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs