From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: with ECARTIS (v1.0.0; list xfs); Fri, 04 Jul 2008 03:17:37 -0700 (PDT) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.168.29]) by oss.sgi.com (8.12.11.20060308/8.12.11/SuSE Linux 0.7) with ESMTP id m64AHWXd015038 for ; Fri, 4 Jul 2008 03:17:34 -0700 Received: from bby1mta02.pmc-sierra.bc.ca (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 051072B6EA1 for ; Fri, 4 Jul 2008 03:18:34 -0700 (PDT) Received: from bby1mta02.pmc-sierra.bc.ca (bby1mta02.pmc-sierra.com [216.241.235.117]) by cuda.sgi.com with ESMTP id esAysl4yWohFJAJn for ; Fri, 04 Jul 2008 03:18:34 -0700 (PDT) Message-ID: <486DF8F0.5010700@pmc-sierra.com> Date: Fri, 04 Jul 2008 15:48:24 +0530 From: Sagar Borikar MIME-Version: 1.0 Subject: Re: Xfs Access to block zero exception and system crash References: <20080628000516.GD29319@disturbed> <340C71CD25A7EB49BFA81AE8C8392667028A1CA7@BBY1EXM10.pmc_nt.nt.pmc-sierra.bc.ca> <20080629215647.GJ29319@disturbed> <20080630034112.055CF18904C4@bby1mta01.pmc-sierra.bc.ca> <4868B46C.9000200@pmc-sierra.com> <20080701064437.GR29319@disturbed> <486B01A6.4030104@pmc-sierra.com> <20080702051337.GX29319@disturbed> <486B13AD.2010500@pmc-sierra.com> <1214979191.6025.22.camel@verge.scott.net.au> <20080702065652.GS14251@build-svl-1.agami.com> <486B6062.6040201@pmc-sierra.com> <486C4F89.9030009@sandeen.net> <486C6053.7010503@pmc-sierra.com> <486CE9EA.90502@sandeen.net> In-Reply-To: <486CE9EA.90502@sandeen.net> Content-Type: multipart/mixed; boundary="------------000503030107080700050305" Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com List-Id: xfs To: Eric Sandeen Cc: Nathan Scott , xfs@oss.sgi.com This is a multi-part message in MIME format. --------------000503030107080700050305 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Eric Sandeen wrote: > Sagar Borikar wrote: > >> Eric Sandeen wrote: >> > > > >>>> Eric, Could you please let me know about bits and pieces that we need to >>>> remember while back porting xfs to 2.6.18? >>>> If you share patches which takes care of it, that would be great. >>>> >>>> >>> http://sandeen.net/rhel5_xfs/xfs-2.6.25-for-rhel5-testing.tar.bz2 >>> >>> should be pretty close. It was quick 'n' dirty and it has some warts >>> but would give an idea of what backporting was done (see patches/ and >>> the associated quilt series; quilt push -a to apply them all) >>> >>> >> Thanks a lot Eric. I'll go through it .I am actually trying another >> option of regularly defragmenting the file system under stress. >> > > Ok, but that won't get to the bottom of the problem. It might alleviate > it at best, but if I were shipping a product using xfs I'd want to know > that it was properly solved. :) > > Even we too don't want to leave it as it is. I still am working on back porting the latest xfs code. Your patches are helping a lot . Just to check whether that issue lies with 2.6.18 or MIPS port, I tested it on 2.6.24 x86 platform. Here we created a loop back device of 10 GB and mounted xfs on that. What I observe that xfs_repair reports quite a few bad blocks and bad extents here as well. So is developing bad blocks and extents normal behavior in xfs which would be recovered in background or is it a bug? I still didn't see the exception but the bad blocks and extents are generated within 10 minutes or running the tests. Attaching the log . > The tarball above should give you almost everything you need to run your > testcase with current xfs code on your older kernel to see if the bug > persists or if it's been fixed upstream, in which case you have a > relatively easy path to an actual solution that your customers can > depend on. > > >> I wanted to understand couple of things for using xfs_fsr utility: >> >> 1. What should be the state of filesystem when I am running xfs_fsr. >> Ideally we should stop all io before running defragmentation. >> > > you can run in any state. Some files will not get defragmented due to > busy-ness or other conditions; look at the xfs_swap_extents() function > in the kernel which is very well documented; some cases return EBUSY. > > >> 2. How effective is the utility when ran on highly fragmented file >> system? I saw that if filesystem is 99.89% fragmented, the recovery is >> very slow. It took around 25 min to clean up 100GB JBOD volume and after >> that system was fragmented to 82%. So I was confused on how exactly the >> fragmentation works. >> > > Again read the code, but basically it tries to preallocate as much space > as the file is currently using, then checks that it is more contiguous > space than the file currently has and if so, it copies the data from old > to new and swaps the new allocation for the old. Note, this involves a > fair amount of IO. > > Also don't get hung up on that fragmentation factor, at least not until > you've read xfs_db code to see how it's reported, and you've thought > about what that means. For example: a 100G filesystem with 10 10G files > each with 5x2G extents will report 80% fragmentation. Now, ask > yourself, is a 10G file in 5x2G extents "bad" fragmentation? > > Agreed as in x86 too I see 99.12% fragmentation when I run above mentioned test. and xfs_fsr doesn't help much even after freezing the file system. >> Any pointers on probable optimum use of xfs_fsr? >> 3. Any precautions I need to take when working with that from data >> consistency, robustness point of view? Any disadvantages? >> > > Anything which corrupts data is a bug, and I'm not aware of any such > bugs in the defragmentation process. > > Assuming that we get some improvement by running xfs_fsr, is it safe to run regularly in some periodic interval the defragmentation utility? >> 4. Any threshold for starting the defragmentation on xfs? >> > > Pretty well determined by your individual use case and requirements, I > think. > > -Eric > Thanks for the detailed response Eric. Sagar --------------000503030107080700050305 Content-Type: text/plain; name="xfs_repair_log" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="xfs_repair_log" bad nblocks 13345 for inode 50331785, would reset to 19431 bad nextents 156 for inode 50331785, would reset to 251 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 1 - agno = 0 entry "testfile" in shortform directory 132 references free inode 142 would have junked entry "testfile" in directory inode 132 entry "testfile" in shortform directory 138 references free inode 143 would have junked entry "testfile" in directory inode 138 entry "testfile" in shortform directory 140 references free inode 144 would have junked entry "testfile" in directory inode 140 bad nblocks 15848 for inode 141, would reset to 18634 bad nextents 269 for inode 141, would reset to 306 bad nblocks 18888 for inode 16777350, would reset to 19144 bad nextents 303 for inode 16777350, would reset to 309 bad nblocks 18704 for inode 16777351, would reset to 19144 bad nextents 291 for inode 16777351, would reset to 299 bad fwd (right) sibling pointer (saw 107678 should be NULLDFSBNO) in inode 142 ((null) fork) bmap btree block 236077307437232 would have cleared inode 142 bad fwd (right) sibling pointer (saw 1139882 should be NULLDFSBNO) in inode 143 ((null) fork) bmap btree block 4556402090352816 would have cleared inode 143 bad fwd (right) sibling pointer (saw 1138473 should be NULLDFSBNO) in inode 144 ((null) fork) bmap btree block 4564279060373680 would have cleared inode 144 bad nblocks 13825 for inode 145, would reset to 18503 bad nextents 221 for inode 145, would reset to 222 - agno = 2 entry "testfile" in shortform directory 33595588 references free inode 33595593 would have junked entry "testfile" in directory inode 33595588 bad nblocks 18704 for inode 33595589, would reset to 19121 bad nextents 306 for inode 33595589, would reset to 314 bad nblocks 18704 for inode 33595590, would reset to 19432 bad nextents 302 for inode 33595590, would reset to 313 bad nblocks 18640 for inode 33595591, would reset to 19432 bad nextents 311 for inode 33595591, would reset to 317 bad nblocks 18888 for inode 33595592, would reset to 19432 bad nextents 312 for inode 33595592, would reset to 322 bad fwd (right) sibling pointer (saw 104113 should be NULLDFSBNO) in inode 33595593 ((null) fork) bmap btree block 9041060911947952 would have cleared inode 33595593 - agno = 3 bad nblocks 18888 for inode 50331781, would reset to 19432 bad nextents 315 for inode 50331781, would reset to 324 bad nblocks 18888 for inode 50331782, would reset to 19432 bad nextents 326 for inode 50331782, would reset to 333 bad nblocks 18888 for inode 50331783, would reset to 19432 bad nblocks 18428 for inode 50331784, would reset to 19784 bad nextents 285 for inode 50331784, would reset to 306 bad nblocks 18704 for inode 16777352, would reset to 19144 bad nextents 311 for inode 16777352, would reset to 315 bad nblocks 13345 for inode 50331785, would reset to 19431 bad nextents 156 for inode 50331785, would reset to 251 bad nblocks 18888 for inode 16777353, would reset to 19144 bad nextents 318 for inode 16777353, would reset to 321 No modify flag set, skipping phase 5 Phase 6 - check inode connectivity... - traversing filesystem ... - agno = 0 entry "testfile" in shortform directory inode 132 points to free inode 142would junk entry entry "testfile" in shortform directory inode 138 points to free inode 143would junk entry entry "testfile" in shortform directory inode 140 points to free inode 144would junk entry - agno = 1 - agno = 2 entry "testfile" in shortform directory inode 33595588 points to free inode 33595593would junk entry - agno = 3 - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify link counts... No modify flag set, skipping filesystem flush and exiting. XFS_REPAIR Summary Fri Jul 4 15:34:47 2008 Phase Start End Duration Phase 1: 07/04 15:34:00 07/04 15:34:04 4 seconds Phase 2: 07/04 15:34:04 07/04 15:34:31 27 seconds Phase 3: 07/04 15:34:31 07/04 15:34:47 16 seconds Phase 4: 07/04 15:34:47 07/04 15:34:47 Phase 5: Skipped Phase 6: 07/04 15:34:47 07/04 15:34:47 Phase 7: 07/04 15:34:47 07/04 15:34:47 Total run time: 47 seconds --------------000503030107080700050305--