From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: with ECARTIS (v1.0.0; list xfs); Mon, 10 Mar 2008 15:21:44 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.11.20060308/8.12.11/SuSE Linux 0.7) with SMTP id m2AMLICm002470 for ; Mon, 10 Mar 2008 15:21:22 -0700 Date: Tue, 11 Mar 2008 09:21:35 +1100 From: David Chinner Subject: Re: XFS internal error xfs_trans_cancel at line 1150 of file fs/xfs/xfs_trans.c Message-ID: <20080310222135.GZ155407@sgi.com> References: <20080310000809.GU155407@sgi.com> <1a4a774c0803100134k258e1bcfma95e7969bc44b2af@mail.gmail.com> <1a4a774c0803100302y17530814wee7522aa0dfd7668@mail.gmail.com> <1a4a774c0802130251h657a52f7lb97942e7afdf6e3f@mail.gmail.com> <20080213214551.GR155407@sgi.com> <1a4a774c0803050553h7f6294cfq41c38f34ea92ceae@mail.gmail.com> <1a4a774c0803060310w2642224w690ac8fa13f96ec@mail.gmail.com> <1a4a774c0803070319j1eb8790ek3daae4a16b3e6256@mail.gmail.com> <20080310000809.GU155407@sgi.com> <1a4a774c0803100134k258e1bcfma95e7969bc44b2af@mail.gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <1a4a774c0803100302y17530814wee7522aa0dfd7668@mail.gmail.com> <1a4a774c0803100134k258e1bcfma95e7969bc44b2af@mail.gmail.com> Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com List-Id: xfs To: Christian =?iso-8859-1?Q?R=F8snes?= Cc: David Chinner , xfs@oss.sgi.com On Mon, Mar 10, 2008 at 09:34:14AM +0100, Christian Røsnes wrote: > On Mon, Mar 10, 2008 at 1:08 AM, David Chinner wrote: > > On Fri, Mar 07, 2008 at 12:19:28PM +0100, Christian Røsnes wrote: > > > > Actually, a single mkdir command is enough to trigger the filesystem > > > > shutdown when its 99% full (according to df -k): > > > > > > > > /data# mkdir test > > > > mkdir: cannot create directory `test': No space left on device > > > > Ok, that's helpful ;) > > So, can you dump the directory inode with xfs_db? i.e. > > # ls -ia /data > > # ls -ia /data > 128 . 128 .. 131 content 149256847 rsync > > > The directory inode is the inode at ".", and if this is the root of > > the filesystem it will probably be 128. Then run: > > # xfs_db -r -c 'inode 128' -c p /dev/sdb1 > > # xfs_db -r -c 'inode 128' -c p /dev/sdb1 > core.magic = 0x494e > core.mode = 040755 > core.version = 1 > core.format = 1 (local) ..... > core.size = 32 .... > u.sfdir2.hdr.count = 2 > u.sfdir2.hdr.i8count = 0 > u.sfdir2.hdr.parent.i4 = 128 > u.sfdir2.list[0].namelen = 7 > u.sfdir2.list[0].offset = 0x30 > u.sfdir2.list[0].name = "content" > u.sfdir2.list[0].inumber.i4 = 131 > u.sfdir2.list[1].namelen = 5 > u.sfdir2.list[1].offset = 0x48 > u.sfdir2.list[1].name = "rsync" > u.sfdir2.list[1].inumber.i4 = 149256847 Ok, so a shortform directory still with heaps of space in it. so it's definitely not a directory namespace creation issue. > > > > xfs_db -r -c 'sb 0' -c p /dev/sdb1 > > > > ---------------------------------- > > ..... > > > > fdblocks = 847484 > > > > Apparently there are still lots of free blocks. I wonder if you are out of > > space in the metadata AGs. > > > > Can you do this for me: > > > > ------- > > #!/bin/bash > > > > for i in `seq 0 1 15`; do > > echo freespace histogram for AG $i > > xfs_db -r -c "freesp -bs -a $i" /dev/sdb1 > > done > > ------ > freespace histogram for AG 0 > from to extents blocks pct > 1 1 2098 2098 3.77 > 2 3 8032 16979 30.54 > 4 7 6158 33609 60.46 > 8 15 363 2904 5.22 So with 256 byte inodes, we need a 16k allocation or a 4 block extent. There's plenty of extents large enough to use for that, so it's not an inode chunk allocation error. > Btw - to debug this on a test-system, can I do a dd if=/dev/sdb1 or dd > if=/dev/sdb, > and output it to an image which is then loopback mounted on the test-system ? That would work. Use /dev/sdb1 as the source so all you copy are filesystem blocks. > Ie. is there some sort of "best practice" on how to copy this > partition to a test-system > for further testing ? Do what fit's your needs - for debugging identical images are generally best. For debugging metadata or repair problems, xfs_metadump works very well (replaces data with zeros, though), and for imaging purposes xfs_copy is very efficient. On Mon, Mar 10, 2008 at 11:02:28AM +0100, Christian Røsnes wrote: > On Mon, Mar 10, 2008 at 9:34 AM, Christian Røsnes > wrote: > > On Mon, Mar 10, 2008 at 1:08 AM, David Chinner wrote: > > > This does not appear to be the case I was expecting, though I can > > > see how we can get an ENOSPC here with plenty of blocks free - none > > > are large enough to allocate an inode chunk. What would be worth > > > knowing is the value of resblks when this error is reported. > > > > Ok. I'll see if I can print it out. > > Ok. I added printk statments to xfs_mkdir in xfs_vnodeops.c: > > 'resblks=45' is the value returned by: > > resblks = XFS_MKDIR_SPACE_RES(mp, dir_namelen); > > and this is the value when the error_return label is called. That confirms we're not out of directory space or filesystem space. > -- > > and inside xfs_dir_ialloc (file: xfs_utils.c) this is where it returns > > ... > > code = xfs_ialloc(tp, dp, mode, nlink, rdev, credp, prid, okalloc, > &ialloc_context, &call_again, &ip); > > /* > * Return an error if we were unable to allocate a new inode. > * This should only happen if we run out of space on disk or > * encounter a disk error. > */ > if (code) { > *ipp = NULL; > return code; > } > if (!call_again && (ip == NULL)) { > *ipp = NULL; > return XFS_ERROR(ENOSPC); <============== returns here > } Interesting. That implies that xfs_ialloc() failed here: 1053 /* 1054 * Call the space management code to pick 1055 * the on-disk inode to be allocated. 1056 */ 1057 error = xfs_dialloc(tp, pip ? pip->i_ino : 0, mode, okalloc, 1058 ialloc_context, call_again, &ino); 1059 if (error != 0) { 1060 return error; 1061 } 1062 if (*call_again || ino == NULLFSINO) { <<<<<<<<<<<<<<<< 1063 *ipp = NULL; 1064 return 0; 1065 } Which means that xfs_dialloc() failed without ian error or setting *call_again but setting ino == NULLFSINO. That leaves these possible failure places: 544 agbp = xfs_ialloc_ag_select(tp, parent, mode, okalloc); 545 /* 546 * Couldn't find an allocation group satisfying the 547 * criteria, give up. 548 */ 549 if (!agbp) { 550 *inop = NULLFSINO; 551 >>>>>>>>>> return 0; 552 } ........ 572 /* 573 * If we have already hit the ceiling of inode blocks then clear 574 * okalloc so we scan all available agi structures for a free 575 * inode. 576 */ 577 578 if (mp->m_maxicount && 579 mp->m_sb.sb_icount + XFS_IALLOC_INODES(mp) > mp->m_maxicount) { 580 noroom = 1; 581 okalloc = 0; 582 } ........ 600 if ((error = xfs_ialloc_ag_alloc(tp, agbp, &ialloced))) { 601 xfs_trans_brelse(tp, agbp); 602 if (error == ENOSPC) { 603 *inop = NULLFSINO; 604 >>>>>>>>>> return 0; 605 } else 606 return error; ........ 629 nextag: 630 if (++tagno == agcount) 631 tagno = 0; 632 if (tagno == agno) { 633 *inop = NULLFSINO; 634 >>>>>>>>>> return noroom ? ENOSPC : 0; 635 } Note that for the last case, we don't know what the value of "noroom" is. noroom gets set to 1 if we've reached the maximum number of inodes in the filesystem. Fromteh earlier superblock dump you did: > dblocks = 71627792 ..... > inopblog = 3 ..... > imax_pct = 25 > icount = 3570112 > ifree = 0 and the code that calculates this is: icount = sbp->sb_dblocks * sbp->sb_imax_pct; do_div(icount, 100); do_div(icount, mp->m_ialloc_blks); mp->m_maxicount = (icount * mp->m_ialloc_blks) << sbp->sb_inopblog; therefore: m_maxicount = (((((71627792 * 25) / 100) / 4) * 4) << 3) = 143,255,584 which is way larger than the 3,570,112 that you have already allocated. Hence I think that noroom == 0 and the last chunk of code above is a possibility. Further - we need to allocate new inodes as there are none free. That implies we are calling xfs_ialloc_ag_alloc(). Taking a stab in the dark, I suspect that we are not getting an error from xfs_ialloc_ag_alloc() but we are not allocating inode chunks. Why? Back to the superblock: > unit = 16 > width = 32 You've got a filesystem with stripe alignment set. In xfs_ialloc_ag_alloc() we attempt inode allocation by the following rules: 1. a) If we haven't previously allocated inodes, fall through to 2. b) If we have previously allocated inode, attempt to allocate next to the last inode chunk. 2. If we do not have an extent now: a) if we have stripe alignment, try with alignment b) if we don't have stripe alignment try cluster alignment 3. If we do not have an extent now: a) if we have stripe alignment, try with cluster alignment b) no stripe alignment, turn off alignment. 4. If we do not have an extent now: FAIL. Note the case missing from the stripe alignment fallback path - it does not try without alignment at all. That means if all those extents large enough that we found above are not correctly aligned, then we will still fail to allocate an inode chunk. if all the AGs are like this, then we'll fail to allocate at all and fall out of xfs_dialloc() through the last fragment I quoted above. As to the shutdown that this triggers - the attempt to allocate dirties the AGFL and the AGF by moving free blocks into the free list for btree splits and cancelling a dirty transaction results in a shutdown. Now, to test this theory. ;) Luckily, it's easy to test. mount the filesystem with the mount option "noalign" and rerun the mkdir test. If it is an alignment problem, then setting noalign will prevent this ENOSPC and shutdown as the filesystem will be able to allocate more inodes. Can you test this for me, Christian? Cheers, Dave. > > > Christian -- Dave Chinner Principal Engineer SGI Australian Software Group