* XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c
@ 2008-02-25 20:58 Wolfgang Karall
0 siblings, 0 replies; 6+ messages in thread
From: Wolfgang Karall @ 2008-02-25 20:58 UTC (permalink / raw)
To: xfs
Hello,
today I ran across above error, resulting in an unaccessible FS,
after remounting once xfs_repair didn't show any signs of errors. At
the time of the error the machine and also the concerned FS was under
rather high I/O load (doing test restores of some SVN repositories).
The FS is now working again without a hitch (and did before for close
to a year), latest operation on the filesystem was an xfs_growfs a
couple of days ago. I read a couple of reports about similiar errors
(with differing line numbers), but didn't find conclusive
information, so I'm asking here what the problem might be.
For details see below, if anything is amiss, please let me know. And
please CC me on the replies, since I'm not subscribed to the list.
Kind regards
Wolfgang Karall
Full dmesg output of the problem:
---------
Filesystem "dm-13": XFS internal error xfs_trans_cancel at line 1138
of file fs/xfs/xfs_trans.c. Caller 0xffffffff88158042
Call Trace:
[<ffffffff8814fd54>] :xfs:xfs_trans_cancel+0x5b/0xfe
[<ffffffff88158042>] :xfs:xfs_create+0x58b/0x5dd
[<ffffffff8816053e>] :xfs:xfs_vn_mknod+0x1bd/0x3c8
[<ffffffff8815c793>] :xfs:xfs_buf_rele+0x32/0xa1
[<ffffffff8812bd3d>] :xfs:xfs_da_brelse+0x70/0x9d
[<ffffffff80207138>] kmem_cache_free+0x77/0xca
[<ffffffff8813379b>] :xfs:xfs_dir2_node_lookup+0xa9/0xb8
[<ffffffff8812e887>] :xfs:xfs_dir2_isleaf+0x19/0x4a
[<ffffffff8812f0c4>] :xfs:xfs_dir_lookup+0x100/0x122
[<ffffffff8020f89e>] link_path_walk+0xd3/0xe5
[<ffffffff8023b184>] vfs_create+0xe7/0x12c
[<ffffffff8021adfc>] open_namei+0x18c/0x6a0
[<ffffffff8023beda>] chown_common+0xa5/0xb0
[<ffffffff80227624>] do_filp_open+0x1c/0x3d
[<ffffffff80219a7f>] do_sys_open+0x44/0xc5
[<ffffffff8025be0e>] system_call+0x86/0x8b
[<ffffffff8025bd88>] system_call+0x0/0x8b
Filesystem "dm-13": Corruption of in-memory data detected. Shutting
down filesystem: dm-13
Please umount the filesystem, and rectify the problem(s)
---------
More information about the used software versions, mount options, etc.:
---------
test:~# uname -a
Linux test 2.6.18-6-xen-amd64 #1 SMP Sun Feb 10 18:02:52 UTC 2008
x86_64 GNU/Linux
test:~# dpkg -l xfsprogs
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Installed/Config-files/Unpacked/Failed-config/Half-
installed
|/ Err?=(none)/Hold/Reinst-required/X=both-problems (Status,Err:
uppercase=bad)
||/ Name Version
Description
+++-=============================-=============================-
========================================================================
==
ii xfsprogs 2.8.11-1
Utilities for managing the XFS filesystem
test:~# mount | grep /srv
/dev/mapper/test-srv on /srv type xfs (rw)
test:~# grep /srv /etc/fstab
/dev/test/srv /srv xfs rw 0 2
test:~# xfs_info /srv
meta-data=/dev/test/srv isize=256 agcount=56,
agsize=65536 blks
= sectsz=512 attr=0
data = bsize=4096 blocks=3670016, imaxpct=25
= sunit=0 swidth=0 blks,
unwritten=1
naming =version 2 bsize=4096
log =internal bsize=4096 blocks=2560, version=1
= sectsz=512 sunit=0 blks
realtime =none extsz=65536 blocks=0, rtextents=0
---------
^ permalink raw reply [flat|nested] 6+ messages in thread
* XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c
@ 2008-09-21 19:29 Kevin Jamieson
2008-09-23 7:50 ` Mark Goodwin
2008-09-23 9:18 ` Dave Chinner
0 siblings, 2 replies; 6+ messages in thread
From: Kevin Jamieson @ 2008-09-21 19:29 UTC (permalink / raw)
To: xfs
Hello,
We have encountered the following error several times on some of our
customer systems running SLES 10 SP1 (kernel version 2.6.16.46-0.12):
Sep 12 01:24:41 gn1 kernel: Filesystem "dm-0": XFS internal error
xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c. Caller 0xf9429f23
Sep 12 01:24:41 gn1 kernel: [<f942277d>] xfs_trans_cancel+0x4d/0xd6 [xfs]
Sep 12 01:24:41 gn1 kernel: [<f9429f23>] xfs_create+0x4f1/0x52a [xfs]
Sep 12 01:24:41 gn1 kernel: [<f9429f23>] xfs_create+0x4f1/0x52a [xfs]
Sep 12 01:24:41 gn1 kernel: [<f9433657>] xfs_vn_mknod+0x172/0x276 [xfs]
Sep 12 01:24:41 gn1 kernel: [<c016c45b>] vfs_create+0xd1/0x138
Sep 12 01:24:41 gn1 kernel: [<c016e862>] sys_mknodat+0xfa/0x164
Sep 12 01:24:41 gn1 kernel: [<c016fc47>] do_ioctl+0x4f/0x5e
Sep 12 01:24:41 gn1 kernel: [<c016fea2>] vfs_ioctl+0x24c/0x25e
Sep 12 01:24:41 gn1 kernel: [<c016e8df>] sys_mknod+0x13/0x17
Sep 12 01:24:41 gn1 kernel: [<c0103bdb>] sysenter_past_esp+0x54/0x79
Sep 12 01:24:41 gn1 kernel: xfs_force_shutdown(dm-0,0x8) called from
line 1139 of file fs/xfs/xfs_trans.c. Return address = 0xf94359db
Sep 12 01:24:41 gn1 kernel: Filesystem "dm-0": Corruption of in-memory
data detected. Shutting down filesystem: dm-0
From a recent occurrence of this, I was able to obtain a copy of the
file system prior to it being restored from backup.
The file system in question is relatively full -- 800GB with 25GB free
and 70M inodes used (the file system is used with a proprietary
DMAPI-based HSM application, so most of the file data is punched out).
With the file system in this state, the forced shutdown error is
reproducible when attempting to create a file in a particular directory
-- an ENOSPC error is returned to userspace, and the file system shuts down.
The forced shutdown is also reproducible with this file system mounted
on a more recent kernel version -- here is a stack trace from the same
file system mounted on a 2.6.26 kernel built from oss.sgi.com cvs on Sep
19 2008:
Sep 21 06:35:41 gn1 kernel: Filesystem "loop0": XFS internal error
xfs_trans_cancel at line 1164 of file fs/xfs/xfs_trans.c. Caller 0xf93c8195
Sep 21 06:35:41 gn1 kernel: [<f93c2fc0>] xfs_trans_cancel+0x4d/0xd3 [xfs]
Sep 21 06:35:41 gn1 kernel: [<f93c8195>] xfs_create+0x49b/0x4db [xfs]
Sep 21 06:35:41 gn1 kernel: [<f93c8195>] xfs_create+0x49b/0x4db [xfs]
Sep 21 06:35:41 gn1 kernel: [<f93d166b>] xfs_vn_mknod+0x128/0x1e3 [xfs]
Sep 21 06:35:41 gn1 kernel: [<c0170e9d>] vfs_create+0xb4/0x117
Sep 21 06:35:41 gn1 kernel: [<c0172c46>] do_filp_open+0x1a0/0x671
Sep 21 06:35:41 gn1 kernel: [<c01681da>] do_sys_open+0x40/0xb6
Sep 21 06:35:41 gn1 kernel: [<c0168294>] sys_open+0x1e/0x23
Sep 21 06:35:41 gn1 kernel: [<c0104791>] sysenter_past_esp+0x6a/0x99
Sep 21 06:35:41 gn1 kernel: [<c02b0000>] unix_listen+0x8/0xc9
Sep 21 06:35:41 gn1 kernel: =======================
Sep 21 06:35:41 gn1 kernel: xfs_force_shutdown(loop0,0x8) called from
line 1165 of file fs/xfs/xfs_trans.c. Return address = 0xf93c2fd6
Sep 21 06:35:41 gn1 kernel: Filesystem "loop0": Corruption of in-memory
data detected. Shutting down filesystem: loop0
Tracing through the XFS code, the ENOSPC error is returned here from
fs/xfs/xfs_da_btree.c:
xfs_da_grow_inode(xfs_da_args_t *args, xfs_dablk_t *new_blkno)
{
...
if (got != count || mapp[0].br_startoff != bno ||
...
return XFS_ERROR(ENOSPC);
}
...
}
where got = 0 and count = 1 and xfs_da_grow_inode() is called from
xfs_create() -> xfs_dir_createname() -> xfs_dir2_node_addname() ->
xfs_da_split() -> xfs_da_root_split()
xfs_repair -n (the latest version of xfs_repair from cvs, as the SLES 10
SP1 version just runs out of memory) does not report any problems with
the file system, but after running xfs_repair (without -n) on the file
system, the error can no longer be triggered. Based on this, I suspect a
problem with the free space btrees, as I understand that xfs_repair
rebuilds them. I tried running xfs_check (latest cvs version also) as
well but it runs out of memory and dies.
Are there any known issues in 2.6.16 that could lead to this sort of
problem? If there is any additional information that would be helpful in
tracking this down, please let me know. If needed, I can probably make a
xfs_metadump of the file system available to someone from SGI later this
week.
Some additional information that may be useful:
# xfs_info /fs
meta-data=/dev/loop0 isize=1024 agcount=32, agsize=6179712
blks
= sectsz=512 attr=1
data = bsize=4096 blocks=197750784, imaxpct=0
= sunit=0 swidth=0 blks, unwritten=1
naming =version 2 bsize=4096
log =internal bsize=4096 blocks=32768, version=1
= sectsz=512 sunit=0 blks
realtime =none extsz=4096 blocks=0, rtextents=0
# xfs_db -f -r -c 'sb 0' -c p /mnt/scratch3/xfs.fs
magicnum = 0x58465342
blocksize = 4096
dblocks = 197750784
rblocks = 0
rextents = 0
uuid = 18602b1d-e419-4f97-9bd7-37c4b64b884a
logstart = 134217732
rootino = 32
rbmino = 33
rsumino = 34
rextsize = 1
agblocks = 6179712
agcount = 32
rbmblocks = 0
logblocks = 32768
versionnum = 0x3094
sectsize = 512
inodesize = 1024
inopblock = 4
fname = "\000\000\000\000\000\000\000\000\000\000\000\000"
blocklog = 12
sectlog = 9
inodelog = 10
inopblog = 2
agblklog = 23
rextslog = 0
inprogress = 0
imax_pct = 0
icount = 72288384
ifree = 1131
fdblocks = 6101857
frextents = 0
uquotino = 0
gquotino = 0
qflags = 0
flags = 0
shared_vn = 0
inoalignmt = 2
unit = 0
width = 0
dirblklog = 0
logsectlog = 0
logsectsize = 0
logsunit = 0
features2 = 0
This is for the directory that triggers the problem:
# xfs_db -f -r -c 'inode 304409506' -c p /mnt/scratch3/xfs.fs
core.magic = 0x494e
core.mode = 040700
core.version = 1
core.format = 2 (extents)
core.nlinkv1 = 2
core.uid = 60000
core.gid = 65534
core.flushiter = 0
core.atime.sec = Fri Sep 12 01:15:17 2008
core.atime.nsec = 169962854
core.mtime.sec = Fri Sep 12 01:24:41 2008
core.mtime.nsec = 058880718
core.ctime.sec = Fri Sep 12 01:24:41 2008
core.ctime.nsec = 058880718
core.size = 40960
core.nblocks = 12
core.extsize = 0
core.nextents = 12
core.naextents = 0
core.forkoff = 24
core.aformat = 1 (local)
core.dmevmask = 0
core.dmstate = 0
core.newrtbm = 0
core.prealloc = 0
core.realtime = 0
core.immutable = 0
core.append = 0
core.sync = 0
core.noatime = 0
core.nodump = 0
core.rtinherit = 0
core.projinherit = 0
core.nosymlinks = 0
core.extsz = 0
core.extszinherit = 0
core.nodefrag = 0
core.gen = 0
next_unlinked = null
u.bmx[0-11] = [startoff,startblock,blockcount,extentflag]
0:[0,76102390,1,0] 1:[1,76102361,1,0] 2:[2,76102054,1,0]
3:[3,76102053,1,0] 4:[4,76101744,1,0] 5:[5,76101743,1,0]
6:[6,76101742,1,0] 7:[7,76101741,1,0] 8:[8,76101740,1,0]
9:[9,76101707,1,0] 10:[8388608,76102391,1,0] 11:[16777216,76102719,1,0]
a.sfattr.hdr.totsize = 74
a.sfattr.hdr.count = 2
a.sfattr.list[0].namelen = 16
a.sfattr.list[0].valuelen = 8
a.sfattr.list[0].root = 1
a.sfattr.list[0].secure = 0
a.sfattr.list[0].name = "\004,\005XfY\016oapoBO\bR)"
a.sfattr.list[0].value = "\000\000\000\000\000\000\000\000"
a.sfattr.list[1].namelen = 16
a.sfattr.list[1].valuelen = 24
a.sfattr.list[1].root = 1
a.sfattr.list[1].secure = 0
a.sfattr.list[1].name = "\003>\016oMa7\034@#I8\004X\027G"
a.sfattr.list[1].value =
"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
# xfs_repair -f /mnt/scratch4/xfs.fs
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
- scan filesystem freespace and inode maps...
- found root inode chunk
Phase 3 - for each AG...
- scan and clear agi unlinked lists...
- process known inodes and perform inode discovery...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- agno = 5
- agno = 6
- agno = 7
- agno = 8
- agno = 9
- agno = 10
- agno = 11
- agno = 12
- agno = 13
- agno = 14
- agno = 15
- agno = 16
- agno = 17
- agno = 18
- agno = 19
- agno = 20
- agno = 21
- agno = 22
- agno = 23
- agno = 24
- agno = 25
- agno = 26
- agno = 27
- agno = 28
- agno = 29
- agno = 30
- agno = 31
- process newly discovered inodes...
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- agno = 5
- agno = 6
- agno = 7
- agno = 8
- agno = 9
- agno = 10
- agno = 11
- agno = 12
- agno = 13
- agno = 14
- agno = 15
- agno = 16
- agno = 17
- agno = 18
- agno = 19
- agno = 20
- agno = 21
- agno = 22
- agno = 23
- agno = 24
- agno = 25
- agno = 26
- agno = 27
- agno = 28
- agno = 29
- agno = 30
- agno = 31
Phase 5 - rebuild AG headers and trees...
- reset superblock...
Phase 6 - check inode connectivity...
- resetting contents of realtime bitmap and summary inodes
- traversing filesystem ...
- traversal finished ...
- moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
done
Thanks,
Kevin Jamieson
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c
2008-09-21 19:29 XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c Kevin Jamieson
@ 2008-09-23 7:50 ` Mark Goodwin
2008-09-23 9:18 ` Dave Chinner
1 sibling, 0 replies; 6+ messages in thread
From: Mark Goodwin @ 2008-09-23 7:50 UTC (permalink / raw)
To: kevin; +Cc: xfs
Hi Kevin, sorry for the delay - yes AFAICT we have seen this before,
see the discussion and patches in the xfs@oss archives at
http://oss.sgi.com/archives/xfs/2008-03/msg00144.html
I think several patches have been committed to recent mainline kernels,
but it looks like your testing on 2.6.26 indicates either it still isn't
fixed, or you've hit a different problem, despite the similar stack
traces. Tim might have additional info to add here ...
In any case, this has also been been reported in Novell bugzilla #243029
against SLES10sp1. Since you're running sles10sp1, perhaps you could get
in touch with Novell for support?
Thanks
-- Mark
Kevin Jamieson wrote:
> Hello,
>
> We have encountered the following error several times on some of our
> customer systems running SLES 10 SP1 (kernel version 2.6.16.46-0.12):
>
> Sep 12 01:24:41 gn1 kernel: Filesystem "dm-0": XFS internal error
> xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c. Caller
> 0xf9429f23
> Sep 12 01:24:41 gn1 kernel: [<f942277d>] xfs_trans_cancel+0x4d/0xd6 [xfs]
> Sep 12 01:24:41 gn1 kernel: [<f9429f23>] xfs_create+0x4f1/0x52a [xfs]
> Sep 12 01:24:41 gn1 kernel: [<f9429f23>] xfs_create+0x4f1/0x52a [xfs]
> Sep 12 01:24:41 gn1 kernel: [<f9433657>] xfs_vn_mknod+0x172/0x276 [xfs]
> Sep 12 01:24:41 gn1 kernel: [<c016c45b>] vfs_create+0xd1/0x138
> Sep 12 01:24:41 gn1 kernel: [<c016e862>] sys_mknodat+0xfa/0x164
> Sep 12 01:24:41 gn1 kernel: [<c016fc47>] do_ioctl+0x4f/0x5e
> Sep 12 01:24:41 gn1 kernel: [<c016fea2>] vfs_ioctl+0x24c/0x25e
> Sep 12 01:24:41 gn1 kernel: [<c016e8df>] sys_mknod+0x13/0x17
> Sep 12 01:24:41 gn1 kernel: [<c0103bdb>] sysenter_past_esp+0x54/0x79
> Sep 12 01:24:41 gn1 kernel: xfs_force_shutdown(dm-0,0x8) called from
> line 1139 of file fs/xfs/xfs_trans.c. Return address = 0xf94359db
> Sep 12 01:24:41 gn1 kernel: Filesystem "dm-0": Corruption of in-memory
> data detected. Shutting down filesystem: dm-0
>
>
> From a recent occurrence of this, I was able to obtain a copy of the
> file system prior to it being restored from backup.
>
> The file system in question is relatively full -- 800GB with 25GB free
> and 70M inodes used (the file system is used with a proprietary
> DMAPI-based HSM application, so most of the file data is punched out).
>
> With the file system in this state, the forced shutdown error is
> reproducible when attempting to create a file in a particular directory
> -- an ENOSPC error is returned to userspace, and the file system shuts
> down.
>
> The forced shutdown is also reproducible with this file system mounted
> on a more recent kernel version -- here is a stack trace from the same
> file system mounted on a 2.6.26 kernel built from oss.sgi.com cvs on Sep
> 19 2008:
>
> Sep 21 06:35:41 gn1 kernel: Filesystem "loop0": XFS internal error
> xfs_trans_cancel at line 1164 of file fs/xfs/xfs_trans.c. Caller
> 0xf93c8195
> Sep 21 06:35:41 gn1 kernel: [<f93c2fc0>] xfs_trans_cancel+0x4d/0xd3 [xfs]
> Sep 21 06:35:41 gn1 kernel: [<f93c8195>] xfs_create+0x49b/0x4db [xfs]
> Sep 21 06:35:41 gn1 kernel: [<f93c8195>] xfs_create+0x49b/0x4db [xfs]
> Sep 21 06:35:41 gn1 kernel: [<f93d166b>] xfs_vn_mknod+0x128/0x1e3 [xfs]
> Sep 21 06:35:41 gn1 kernel: [<c0170e9d>] vfs_create+0xb4/0x117
> Sep 21 06:35:41 gn1 kernel: [<c0172c46>] do_filp_open+0x1a0/0x671
> Sep 21 06:35:41 gn1 kernel: [<c01681da>] do_sys_open+0x40/0xb6
> Sep 21 06:35:41 gn1 kernel: [<c0168294>] sys_open+0x1e/0x23
> Sep 21 06:35:41 gn1 kernel: [<c0104791>] sysenter_past_esp+0x6a/0x99
> Sep 21 06:35:41 gn1 kernel: [<c02b0000>] unix_listen+0x8/0xc9
> Sep 21 06:35:41 gn1 kernel: =======================
> Sep 21 06:35:41 gn1 kernel: xfs_force_shutdown(loop0,0x8) called from
> line 1165 of file fs/xfs/xfs_trans.c. Return address = 0xf93c2fd6
> Sep 21 06:35:41 gn1 kernel: Filesystem "loop0": Corruption of in-memory
> data detected. Shutting down filesystem: loop0
>
>
> Tracing through the XFS code, the ENOSPC error is returned here from
> fs/xfs/xfs_da_btree.c:
>
> xfs_da_grow_inode(xfs_da_args_t *args, xfs_dablk_t *new_blkno)
> {
> ...
> if (got != count || mapp[0].br_startoff != bno ||
> ...
> return XFS_ERROR(ENOSPC);
> }
> ...
> }
>
> where got = 0 and count = 1 and xfs_da_grow_inode() is called from
> xfs_create() -> xfs_dir_createname() -> xfs_dir2_node_addname() ->
> xfs_da_split() -> xfs_da_root_split()
>
>
> xfs_repair -n (the latest version of xfs_repair from cvs, as the SLES 10
> SP1 version just runs out of memory) does not report any problems with
> the file system, but after running xfs_repair (without -n) on the file
> system, the error can no longer be triggered. Based on this, I suspect a
> problem with the free space btrees, as I understand that xfs_repair
> rebuilds them. I tried running xfs_check (latest cvs version also) as
> well but it runs out of memory and dies.
>
> Are there any known issues in 2.6.16 that could lead to this sort of
> problem? If there is any additional information that would be helpful in
> tracking this down, please let me know. If needed, I can probably make a
> xfs_metadump of the file system available to someone from SGI later this
> week.
>
>
> Some additional information that may be useful:
>
>
> # xfs_info /fs
> meta-data=/dev/loop0 isize=1024 agcount=32, agsize=6179712
> blks
> = sectsz=512 attr=1
> data = bsize=4096 blocks=197750784, imaxpct=0
> = sunit=0 swidth=0 blks, unwritten=1
> naming =version 2 bsize=4096
> log =internal bsize=4096 blocks=32768, version=1
> = sectsz=512 sunit=0 blks
> realtime =none extsz=4096 blocks=0, rtextents=0
>
>
> # xfs_db -f -r -c 'sb 0' -c p /mnt/scratch3/xfs.fs
> magicnum = 0x58465342
> blocksize = 4096
> dblocks = 197750784
> rblocks = 0
> rextents = 0
> uuid = 18602b1d-e419-4f97-9bd7-37c4b64b884a
> logstart = 134217732
> rootino = 32
> rbmino = 33
> rsumino = 34
> rextsize = 1
> agblocks = 6179712
> agcount = 32
> rbmblocks = 0
> logblocks = 32768
> versionnum = 0x3094
> sectsize = 512
> inodesize = 1024
> inopblock = 4
> fname = "\000\000\000\000\000\000\000\000\000\000\000\000"
> blocklog = 12
> sectlog = 9
> inodelog = 10
> inopblog = 2
> agblklog = 23
> rextslog = 0
> inprogress = 0
> imax_pct = 0
> icount = 72288384
> ifree = 1131
> fdblocks = 6101857
> frextents = 0
> uquotino = 0
> gquotino = 0
> qflags = 0
> flags = 0
> shared_vn = 0
> inoalignmt = 2
> unit = 0
> width = 0
> dirblklog = 0
> logsectlog = 0
> logsectsize = 0
> logsunit = 0
> features2 = 0
>
>
> This is for the directory that triggers the problem:
>
> # xfs_db -f -r -c 'inode 304409506' -c p /mnt/scratch3/xfs.fs
> core.magic = 0x494e
> core.mode = 040700
> core.version = 1
> core.format = 2 (extents)
> core.nlinkv1 = 2
> core.uid = 60000
> core.gid = 65534
> core.flushiter = 0
> core.atime.sec = Fri Sep 12 01:15:17 2008
> core.atime.nsec = 169962854
> core.mtime.sec = Fri Sep 12 01:24:41 2008
> core.mtime.nsec = 058880718
> core.ctime.sec = Fri Sep 12 01:24:41 2008
> core.ctime.nsec = 058880718
> core.size = 40960
> core.nblocks = 12
> core.extsize = 0
> core.nextents = 12
> core.naextents = 0
> core.forkoff = 24
> core.aformat = 1 (local)
> core.dmevmask = 0
> core.dmstate = 0
> core.newrtbm = 0
> core.prealloc = 0
> core.realtime = 0
> core.immutable = 0
> core.append = 0
> core.sync = 0
> core.noatime = 0
> core.nodump = 0
> core.rtinherit = 0
> core.projinherit = 0
> core.nosymlinks = 0
> core.extsz = 0
> core.extszinherit = 0
> core.nodefrag = 0
> core.gen = 0
> next_unlinked = null
> u.bmx[0-11] = [startoff,startblock,blockcount,extentflag]
> 0:[0,76102390,1,0] 1:[1,76102361,1,0] 2:[2,76102054,1,0]
> 3:[3,76102053,1,0] 4:[4,76101744,1,0] 5:[5,76101743,1,0]
> 6:[6,76101742,1,0] 7:[7,76101741,1,0] 8:[8,76101740,1,0]
> 9:[9,76101707,1,0] 10:[8388608,76102391,1,0] 11:[16777216,76102719,1,0]
> a.sfattr.hdr.totsize = 74
> a.sfattr.hdr.count = 2
> a.sfattr.list[0].namelen = 16
> a.sfattr.list[0].valuelen = 8
> a.sfattr.list[0].root = 1
> a.sfattr.list[0].secure = 0
> a.sfattr.list[0].name = "\004,\005XfY\016oapoBO\bR)"
> a.sfattr.list[0].value = "\000\000\000\000\000\000\000\000"
> a.sfattr.list[1].namelen = 16
> a.sfattr.list[1].valuelen = 24
> a.sfattr.list[1].root = 1
> a.sfattr.list[1].secure = 0
> a.sfattr.list[1].name = "\003>\016oMa7\034@#I8\004X\027G"
> a.sfattr.list[1].value =
> "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
>
>
>
> # xfs_repair -f /mnt/scratch4/xfs.fs
> Phase 1 - find and verify superblock...
> Phase 2 - using internal log
> - zero log...
> - scan filesystem freespace and inode maps...
> - found root inode chunk
> Phase 3 - for each AG...
> - scan and clear agi unlinked lists...
> - process known inodes and perform inode discovery...
> - agno = 0
> - agno = 1
> - agno = 2
> - agno = 3
> - agno = 4
> - agno = 5
> - agno = 6
> - agno = 7
> - agno = 8
> - agno = 9
> - agno = 10
> - agno = 11
> - agno = 12
> - agno = 13
> - agno = 14
> - agno = 15
> - agno = 16
> - agno = 17
> - agno = 18
> - agno = 19
> - agno = 20
> - agno = 21
> - agno = 22
> - agno = 23
> - agno = 24
> - agno = 25
> - agno = 26
> - agno = 27
> - agno = 28
> - agno = 29
> - agno = 30
> - agno = 31
> - process newly discovered inodes...
> Phase 4 - check for duplicate blocks...
> - setting up duplicate extent list...
> - check for inodes claiming duplicate blocks...
> - agno = 0
> - agno = 1
> - agno = 2
> - agno = 3
> - agno = 4
> - agno = 5
> - agno = 6
> - agno = 7
> - agno = 8
> - agno = 9
> - agno = 10
> - agno = 11
> - agno = 12
> - agno = 13
> - agno = 14
> - agno = 15
> - agno = 16
> - agno = 17
> - agno = 18
> - agno = 19
> - agno = 20
> - agno = 21
> - agno = 22
> - agno = 23
> - agno = 24
> - agno = 25
> - agno = 26
> - agno = 27
> - agno = 28
> - agno = 29
> - agno = 30
> - agno = 31
> Phase 5 - rebuild AG headers and trees...
> - reset superblock...
> Phase 6 - check inode connectivity...
> - resetting contents of realtime bitmap and summary inodes
> - traversing filesystem ...
> - traversal finished ...
> - moving disconnected inodes to lost+found ...
> Phase 7 - verify and correct link counts...
> done
>
>
>
> Thanks,
> Kevin Jamieson
>
>
--
Mark Goodwin markgw@sgi.com
Engineering Manager for XFS and PCP Phone: +61-3-99631937
SGI Australian Software Group Cell: +61-4-18969583
-------------------------------------------------------------
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c
2008-09-21 19:29 XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c Kevin Jamieson
2008-09-23 7:50 ` Mark Goodwin
@ 2008-09-23 9:18 ` Dave Chinner
2008-09-24 2:49 ` Kevin Jamieson
[not found] ` <54241.24.80.224.145.1222383385.squirrel@squirrel.kevinjamieson.com>
1 sibling, 2 replies; 6+ messages in thread
From: Dave Chinner @ 2008-09-23 9:18 UTC (permalink / raw)
To: Kevin Jamieson; +Cc: xfs
On Sun, Sep 21, 2008 at 12:29:49PM -0700, Kevin Jamieson wrote:
> The forced shutdown is also reproducible with this file system mounted
> on a more recent kernel version -- here is a stack trace from the same
> file system mounted on a 2.6.26 kernel built from oss.sgi.com cvs on Sep
> 19 2008:
>
> Sep 21 06:35:41 gn1 kernel: Filesystem "loop0": XFS internal error
> xfs_trans_cancel at line 1164 of file fs/xfs/xfs_trans.c. Caller
> 0xf93c8195
> Sep 21 06:35:41 gn1 kernel: [<f93c2fc0>] xfs_trans_cancel+0x4d/0xd3 [xfs]
> Sep 21 06:35:41 gn1 kernel: [<f93c8195>] xfs_create+0x49b/0x4db [xfs]
> Sep 21 06:35:41 gn1 kernel: [<f93c8195>] xfs_create+0x49b/0x4db [xfs]
> Sep 21 06:35:41 gn1 kernel: [<f93d166b>] xfs_vn_mknod+0x128/0x1e3 [xfs]
> Sep 21 06:35:41 gn1 kernel: [<c0170e9d>] vfs_create+0xb4/0x117
> Sep 21 06:35:41 gn1 kernel: [<c0172c46>] do_filp_open+0x1a0/0x671
> Sep 21 06:35:41 gn1 kernel: [<c01681da>] do_sys_open+0x40/0xb6
> Sep 21 06:35:41 gn1 kernel: [<c0168294>] sys_open+0x1e/0x23
> Sep 21 06:35:41 gn1 kernel: [<c0104791>] sysenter_past_esp+0x6a/0x99
> Sep 21 06:35:41 gn1 kernel: [<c02b0000>] unix_listen+0x8/0xc9
> Sep 21 06:35:41 gn1 kernel: =======================
> Sep 21 06:35:41 gn1 kernel: xfs_force_shutdown(loop0,0x8) called from
> line 1165 of file fs/xfs/xfs_trans.c. Return address = 0xf93c2fd6
> Sep 21 06:35:41 gn1 kernel: Filesystem "loop0": Corruption of in-memory
> data detected. Shutting down filesystem: loop0
Oh, that's interesting. I've been trying to track down the problem
on TOT kernels without much luck recently.
> Tracing through the XFS code, the ENOSPC error is returned here from
> fs/xfs/xfs_da_btree.c:
>
> xfs_da_grow_inode(xfs_da_args_t *args, xfs_dablk_t *new_blkno)
> {
> ...
> if (got != count || mapp[0].br_startoff != bno ||
> ...
> return XFS_ERROR(ENOSPC);
> }
> ...
> }
>
> where got = 0 and count = 1 and xfs_da_grow_inode() is called from
> xfs_create() -> xfs_dir_createname() -> xfs_dir2_node_addname() ->
> xfs_da_split() -> xfs_da_root_split()
got = 0 means that xfs_bmapi() returned zero blocks. Given that it
was only being asked for a single block (from the xfs_info output),
that implies that either the FS was out of space or that the order
of AG locking meant we couldn't get to the AGs that had space in
them. Given that the transaction reservation or the
xfs_dir_can_enter() check should ensure we have space availlable,
I'm inclined to think that the free space is in an AG we can't
currently allocate out of because of previous allocations for
other blocks needed by the split....
> xfs_repair -n (the latest version of xfs_repair from cvs, as the SLES 10
> SP1 version just runs out of memory) does not report any problems with
> the file system, but after running xfs_repair (without -n) on the file
> system, the error can no longer be triggered. Based on this, I suspect a
> problem with the free space btrees, as I understand that xfs_repair
> rebuilds them. I tried running xfs_check (latest cvs version also) as
> well but it runs out of memory and dies.
Rebuilding the freespace trees will change the pattern of free space
in each AG, which means the same sequence of events could result in
different allocation patterns.
> Are there any known issues in 2.6.16 that could lead to this sort of
> problem? If there is any additional information that would be helpful in
> tracking this down, please let me know. If needed, I can probably make a
> xfs_metadump of the file system available to someone from SGI later this
> week.
A metadump will tell us what the freespace patterns are....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c
2008-09-23 9:18 ` Dave Chinner
@ 2008-09-24 2:49 ` Kevin Jamieson
[not found] ` <54241.24.80.224.145.1222383385.squirrel@squirrel.kevinjamieson.com>
1 sibling, 0 replies; 6+ messages in thread
From: Kevin Jamieson @ 2008-09-24 2:49 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On Tue, September 23, 2008 2:18 am, Dave Chinner wrote:
> A metadump will tell us what the freespace patterns are....
I should be able to get a metadump for you soon, just waiting for approval
from management (sigh).
> I'm inclined to think that the free space is in an AG we can't
> currently allocate out of because of previous allocations for
> other blocks needed by the split....
Adding some debug messages (this is with the latest oss.sgi.com CVS
kernel) into xfs_alloc_vextent() and xfs_alloc_fix_freelist() shows the
following allocations for the create transaction, all in AG 9 which has
only 40 free blocks:
kernel: XFS: ENTER function xfs_alloc_vextent file fs/xfs/xfs_alloc.c line
2285 agno 9 type 4 called from:
kernel: [<f951f5c0>] xfs_alloc_vextent+0xe4/0x520 [xfs]
kernel: [<f952fe4f>] xfs_bmap_btalloc+0x791/0xa80 [xfs]
kernel: [<f9530b2f>] xfs_bmapi+0x9cf/0x148c [xfs]
kernel: [<f95372bd>] xfs_da_grow_inode+0xe6/0x2a1 [xfs]
kernel: [<f954009a>] xfs_dir2_leafn_split+0x29/0x27f [xfs]
kernel: [<f9539419>] xfs_da_split+0xc2/0x236 [xfs]
kernel: [<f953f55b>] xfs_dir2_node_addname+0x845/0x8ae [xfs]
kernel: [<f95646ca>] kmem_zone_alloc+0x46/0x8a [xfs]
kernel: [<f956472b>] kmem_zone_zalloc+0x1d/0x41 [xfs]
kernel: [<f953a149>] xfs_dir_createname+0x12c/0x13a [xfs]
kernel: [<f9561447>] xfs_create+0x2d9/0x4e6 [xfs]
kernel: =======================
kernel: XFS: function xfs_alloc_fix_freelist file fs/xfs/xfs_alloc.c line
1855 agno 9 minlen 1 alignment 1 minalignslop 0 longest 10 minleft 1
pagf_freeblks 36 pagf_flcount 4 need 4 total 35
kernel: XFS: EXIT function xfs_alloc_vextent file fs/xfs/xfs_alloc.c line
2454 fsbno 76102855
kernel: XFS: ENTER function xfs_alloc_vextent file fs/xfs/xfs_alloc.c line
2285 agno 9 type 5 called from:
kernel: [<f951f5c0>] xfs_alloc_vextent+0xe4/0x520 [xfs]
kernel: [<f952aeb3>] xfs_bmap_extents_to_btree+0x26d/0x4eb [xfs]
kernel: [<f95490e6>] xfs_iext_add+0xe0/0x213 [xfs]
kernel: [<f952d10d>] xfs_bmap_add_extent+0x338/0x3cb [xfs]
kernel: [<f9530d50>] xfs_bmapi+0xbf0/0x148c [xfs]
kernel: [<f95372bd>] xfs_da_grow_inode+0xe6/0x2a1 [xfs]
kernel: [<f954009a>] xfs_dir2_leafn_split+0x29/0x27f [xfs]
kernel: [<f9539419>] xfs_da_split+0xc2/0x236 [xfs]
kernel: [<f953f55b>] xfs_dir2_node_addname+0x845/0x8ae [xfs]
kernel: [<f95646ca>] kmem_zone_alloc+0x46/0x8a [xfs]
kernel: [<f956472b>] kmem_zone_zalloc+0x1d/0x41 [xfs]
kernel: [<f953a149>] xfs_dir_createname+0x12c/0x13a [xfs]
kernel: [<f9561447>] xfs_create+0x2d9/0x4e6 [xfs]
kernel: =======================
kernel: XFS: function xfs_alloc_fix_freelist file fs/xfs/xfs_alloc.c line
1855 agno 9 minlen 1 alignment 1 minalignslop 0 longest 10 minleft 0
pagf_freeblks 35 pagf_flcount 4 need 4 total 0
kernel: XFS: EXIT function xfs_alloc_vextent file fs/xfs/xfs_alloc.c line
2454 fsbno 76103705
kernel: XFS: ENTER function xfs_alloc_vextent file fs/xfs/xfs_alloc.c line
2285 agno 9 type 5 called from:
kernel: [<f951f5c0>] xfs_alloc_vextent+0xe4/0x520 [xfs]
kernel: [<c0132a98>] up+0x9/0x2a
kernel: [<f952fe4f>] xfs_bmap_btalloc+0x791/0xa80 [xfs]
kernel: [<f952d10d>] xfs_bmap_add_extent+0x338/0x3cb [xfs]
kernel: [<f95485e3>] xfs_iext_bno_to_ext+0xd8/0x191 [xfs]
kernel: [<f9530b2f>] xfs_bmapi+0x9cf/0x148c [xfs]
kernel: [<f956713f>] _xfs_buf_lookup_pages+0x25e/0x285 [xfs]
kernel: [<f95372bd>] xfs_da_grow_inode+0xe6/0x2a1 [xfs]
kernel: [<f953ed09>] xfs_dir2_leafn_add+0x29c/0x2a9 [xfs]
kernel: [<f953c7c7>] xfs_dir2_leaf_log_header+0x9/0xb [xfs]
kernel: [<f9538df4>] xfs_da_root_split+0x1d/0x15e [xfs]
kernel: [<f953756e>] xfs_da_fixhashpath+0x4c/0xbe [xfs]
kernel: [<f95394ba>] xfs_da_split+0x163/0x236 [xfs]
kernel: [<f953f55b>] xfs_dir2_node_addname+0x845/0x8ae [xfs]
kernel: [<f95646ca>] kmem_zone_alloc+0x46/0x8a [xfs]
kernel: [<f956472b>] kmem_zone_zalloc+0x1d/0x41 [xfs]
kernel: [<f953a149>] xfs_dir_createname+0x12c/0x13a [xfs]
kernel: [<f9561447>] xfs_create+0x2d9/0x4e6 [xfs]
kernel: =======================
kernel: XFS: function xfs_alloc_fix_freelist file fs/xfs/xfs_alloc.c line
1855 agno 9 minlen 1 alignment 1 minalignslop 0 longest 10 minleft 0
pagf_freeblks 34 pagf_flcount 4 need 4 total 35
kernel: XFS: EXIT function xfs_alloc_vextent file fs/xfs/xfs_alloc.c line
2454 fsbno -1
So this last allocation is failing this test in xfs_alloc_fix_freelist()
and not allocating anything:
/*
* If it looks like there isn't a long enough extent, or enough
* total blocks, reject it.
*/
if ((args->minlen + args->alignment + args->minalignslop - 1) >
longest ||
((int)(pag->pagf_freeblks + pag->pagf_flcount -
need - args->total) < (int)args->minleft)) {
if (agbp)
xfs_trans_brelse(tp, agbp);
args->agbp = NULL;
return 0;
Thanks for your help,
Kevin Jamieson
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c
[not found] ` <62255.192.168.1.1.1222403942.squirrel@squirrel.kevinjamieson.com>
@ 2008-09-26 10:16 ` Dave Chinner
0 siblings, 0 replies; 6+ messages in thread
From: Dave Chinner @ 2008-09-26 10:16 UTC (permalink / raw)
To: Kevin Jamieson; +Cc: Mark Goodwin, xfs
On Thu, Sep 25, 2008 at 09:39:02PM -0700, Kevin Jamieson wrote:
> On Thu, September 25, 2008 6:27 pm, Dave Chinner wrote:
> > On Thu, Sep 25, 2008 at 03:56:25PM -0700, Kevin Jamieson wrote:
> >> On Tue, September 23, 2008 2:18 am, Dave Chinner wrote:
> >>
> >> > A metadump will tell us what the freespace patterns are....
> >>
> >> Hi Dave,
> >>
> >> A metadump of a file system that triggers this issue is now available on
Cc'įng this back to the open list because there have been several
other occurrences of this problem recently, so I want this to
hit the public archives.
Firstly, Kevinn, thank you for the image and the trivial test case.
This is exactly how I found the last one of these problems - 20
minutes with UML and single stepping through gdb....
Breakpoint 5, xfs_dir_createname (tp=0x7f3bba18, dp=0x7f4b90c0,
name=0x7f273c30, inum=308318065,
first=0x7f273bc0, flist=0x7f273b90, total=35) at fs/xfs/xfs_dir2.c:207
We have a reservation of 35 blocks for this operation.
xfs_dir_createname()
xfs_dir2_node_addname()
xfs_dir2_node_addname_int() - adds new block,
xfs_dir2_leafn_add() - full block, no stale
xfs_da_split()
xfs_dir2_leafn_split() - single block allocated out of AG 9
xfs_da_grow_inode()
xfs_da_root_split()
xfs_da_grow_inode() - fails to allocate single block
Allocation fails with AG 9 having 34 free blocks and it does not
try any other AG. Now to trace the second xfs_bmapi call to see why
it fails.
xfs_bmapi()
xfs_bmap_alloc()
(gdb) p *ap
$26 = {firstblock = 83559978, rval = 1612378431, off = 8388610, tp = 0x7f14da18,
ip = 0x7f4ba0c0, prevp = 0x7f0c7790, gotp = 0x7f0c77b0, alen = 1, total = 35,
minlen = 1, minleft = 0, eof = 0 '\0', wasdel = 0 '\0', userdata = 0 '\0',
low = 0 '\0', aeof = 0 '\0', conv = 0 '\0'}
xfs_bmap_btalloc()
xfs_alloc_vextent()
(gdb) p *args
$30 = {tp = 0x7f14da18, mp = 0x7f12f800, agbp = 0x7f0c77f4, pag = 0x7f595e80, fsbno = 83559979,
agno = 9, agbno = 0, minlen = 1, maxlen = 1, mod = 0, prod = 1, minleft = 0,
total = 35, alignment = 1, minalignslop = 0, len = 2131523184, type = XFS_ALLOCTYPE_NEAR_BNO,
otype = 1612065600, wasdel = 0 '\0', wasfromfl = 0 '\0', isfl = 0 '\0', userdata = 0 '\0',
firstblock = 83559978}
xfs_alloc_fix_freelist()
1842 if (!(flags & XFS_ALLOC_FLAG_FREEING)) {
1843 need = XFS_MIN_FREELIST_PAG(pag, mp);
1844 delta = need > pag->pagf_flcount ? need - pag->pagf_flcount : 0;
1845 /*
1846 * If it looks like there isn't a long enough extent, or enough
1847 * total blocks, reject it.
1848 */
1849 longest = (pag->pagf_longest > delta) ?
1850 (pag->pagf_longest - delta) :
1851 (pag->pagf_flcount > 0 || pag->pagf_longest > 0);
1852 if ((args->minlen + args->alignment + args->minalignslop - 1) >
1853 longest ||
1854 >>>>>>> ((int)(pag->pagf_freeblks + pag->pagf_flcount -
1855 >>>>>>> need - args->total) < (int)args->minleft)) {
1856 if (agbp)
1857 xfs_trans_brelse(tp, agbp);
1858 >>>>>>> args->agbp = NULL;
1859 >>>>>>> return 0;
1860 }
1861 }
We are failing the marked check.
pag->pagf_freeblks + pag->pagf_flcount - need - args->total = -1.
and
args->minleft = 0
The problem is that AG 9 has only 34 free blocks left when the root
split occurs.
So, what has happened is this:
- transaction block reservation is for 35 blocks
- directory located in AG 9
- AG 9 has 35 free blocks.
- we've allocated a new block in the directory for the name
- allocation set up with args->total = 35
- single block allocated reduces AG 9 to 34 free blocks
- node is full, so can't add pointer to new free block
- triggers root split
- root split tries to allocate new block with:
- allocation set up with args->total = 35
- AG 9 only has 34 free blocks now.
- fails with not enough space for "entire transaction"
in the AG.
Hence an ENOSPC with plenty of space left in the AG and huge
amounts of free space in the filesystem, and a shutdown because we
are cancelling a dirty transaction.
There's several problems here.
1. the directory code does not account for blocks that get allocated
by reducing args->total as blocks are allocated. That directly
causes this particular shutdown.
2. the xfs bmap code has no way of passing back how many blocks
were allocated to the inode. We're going to have to infer it
from the number of reserved blocks used in the transaction
structure or from the change in the inode block count
across the allocation....
3. we're going to have to audit and fix all the allocation
calls in the directory code to ensure the accounting is
correct.
4. the check in xfs_alloc_fix_freelist() is incorrect.
- it assumes that we can completely empty the AG
- we must leave 4 blocks behind in the AG so that the
first extent free on a full AG can succeed.
- hence even if we fix 1), this case could still fail
once we get to 32 of 35 blocks allocated.
5. The metadata allocation is a XFS_ALLOCTYPE_NEAR_BNO
allocation with no fallback if the AG is ENOSPC - if we
can't allocate in that AG, we fail. Why isn't there a
fallback in this case? Directory btree blocks are not
confined to a single AG, right?
This is going to take a bit of work to fix....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2008-09-26 10:14 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-09-21 19:29 XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c Kevin Jamieson
2008-09-23 7:50 ` Mark Goodwin
2008-09-23 9:18 ` Dave Chinner
2008-09-24 2:49 ` Kevin Jamieson
[not found] ` <54241.24.80.224.145.1222383385.squirrel@squirrel.kevinjamieson.com>
[not found] ` <20080926012704.GI27997@disturbed>
[not found] ` <62255.192.168.1.1.1222403942.squirrel@squirrel.kevinjamieson.com>
2008-09-26 10:16 ` Dave Chinner
-- strict thread matches above, loose matches on Subject: below --
2008-02-25 20:58 Wolfgang Karall
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox