* kernel oops on debian, 2.6.18-5, large xfs volume
@ 2008-01-25 7:16 lxh
2008-01-25 8:01 ` David Chinner
0 siblings, 1 reply; 4+ messages in thread
From: lxh @ 2008-01-25 7:16 UTC (permalink / raw)
To: xfs
Hello,
we have dozens of file servers with a 1.5TB/2.5 TB large xfs file system volume running on a RAID6 SATA array. Each volume contains about 10,000,000 files. The Operating system is debian GNU/Linux 2.6.18-5-amd64 #1 SMP. we got a kernel oops frequently last year.
here is the oops :
Filesystem "cciss/c0d1": XFS internal error xfs_trans_cancel at line 1138
of file fs/xfs/xfs_trans.c. Caller 0xffffffff881df006
Call Trace:
[<ffffffff881fed18>] :xfs:xfs_trans_cancel+0x5b/0xfe
[<ffffffff88207006>] :xfs:xfs_create+0x58b/0x5dd
[<ffffffff8820f496>] :xfs:xfs_vn_mknod+0x1bd/0x3c8
[<ffffffff8027d27d>] default_wake_function+0x0/0xe
[<ffffffff802200e5>] __up_read+0x13/0x8a
[<ffffffff881eb682>] :xfs:xfs_iunlock+0x57/0x79
[<ffffffff88204180>] :xfs:xfs_lookup+0x6c/0x7d
[<ffffffff802200e5>] __up_read+0x13/0x8a
[<ffffffff881eb682>] :xfs:xfs_iunlock+0x57/0x79
[<ffffffff882041ce>] :xfs:xfs_access+0x3d/0x46
[<ffffffff8820fa4b>] :xfs:xfs_vn_permission+0x14/0x18
[<ffffffff8020cc7d>] permission+0x87/0xce
[<ffffffff80208f26>] __link_path_walk+0x16a/0xf3c
[<ffffffff8022ae52>] mntput_no_expire+0x19/0x8b
[<ffffffff8020dd5f>] link_path_walk+0xd3/0xe5
[<ffffffff802381ed>] vfs_create+0xe7/0x12c
[<ffffffff80218efb>] open_namei+0x18d/0x69c
[<ffffffff802252f1>] do_filp_open+0x1c/0x3d
[<ffffffff80217baa>] do_sys_open+0x44/0xc5
[<ffffffff802584d6>] system_call+0x7e/0x83
Every time the error occurs, the volume can not be accessed. So we have to umount this volume, run xfs_repair, and then remount it. This problem causes seriously impact of our service.
Could you help me resolve this problem ?
Luo xiaohua
lxhzju@163.com
2008-01-25
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: kernel oops on debian, 2.6.18-5, large xfs volume
2008-01-25 7:16 kernel oops on debian, 2.6.18-5, large xfs volume lxh
@ 2008-01-25 8:01 ` David Chinner
2008-02-21 7:34 ` lxh
0 siblings, 1 reply; 4+ messages in thread
From: David Chinner @ 2008-01-25 8:01 UTC (permalink / raw)
To: lxh; +Cc: xfs
On Fri, Jan 25, 2008 at 03:16:36PM +0800, lxh wrote:
> Hello,
> we have dozens of file servers with a 1.5TB/2.5 TB large xfs file system
> volume running on a RAID6 SATA array. Each volume contains about
> 10,000,000 files. The Operating system is debian GNU/Linux 2.6.18-5-amd64
> #1 SMP. we got a kernel oops frequently last year.
>
> here is the oops :
> Filesystem "cciss/c0d1": XFS internal error xfs_trans_cancel at line 1138
> of file fs/xfs/xfs_trans.c. Caller 0xffffffff881df006
> Call Trace:
> [<ffffffff881fed18>] :xfs:xfs_trans_cancel+0x5b/0xfe
> [<ffffffff88207006>] :xfs:xfs_create+0x58b/0x5dd
> [<ffffffff8820f496>] :xfs:xfs_vn_mknod+0x1bd/0x3c8
Are you running out of space in the filesystem?
The only vectors I've seen that can cause this are I/O errors
or ENOSPC during file create after we've already checked that
this cannot happen. Are there any I/O errors in the log?
This commit:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=45c34141126a89da07197d5b89c04c6847f1171a
which is in 2.6.23 fixed the last known cause of the ENOSPC
issue, so upgrading the kernel or patching this fix back
to the 2.6.18 kernel may fix the problem if it is related to
ENOSPC.
> Every time the error occurs, the volume can not be accessed. So we have to
> umount this volume, run xfs_repair, and then remount it. This problem
> causes seriously impact of our service.
Anyway, next time it happens, can you please run xfs_check on the
filesystem first and post the output? If there is no output, then
the filesystem is fine and you don't need to run repair.
If it is not fine, can also post the output of xfs_repair?
Once the filesystem has been fixed up, can you then post the
output of this command to tell us the space usage in the filesystems?
# xfs_db -r -c 'sb 0' -c p <dev>
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Re: kernel oops on debian, 2.6.18-5, large xfs volume
2008-01-25 8:01 ` David Chinner
@ 2008-02-21 7:34 ` lxh
0 siblings, 0 replies; 4+ messages in thread
From: lxh @ 2008-02-21 7:34 UTC (permalink / raw)
To: David Chinner; +Cc: xfs
Hello,
Yesterday, this failure happened again in a file server with a 1.5TB large xfs file system volume running on a RAID6 SATA array. Here is the kernel oops:
Filesystem "cciss/c0d2": XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c. Caller 0xffffffff881df006
Call Trace:
[<ffffffff881fed18>] :xfs:xfs_trans_cancel+0x5b/0xfe
[<ffffffff88207006>] :xfs:xfs_create+0x58b/0x5dd
[<ffffffff8820f496>] :xfs:xfs_vn_mknod+0x1bd/0x3c8
Then we umounted it and ran xfs_check. xfs_check suggested that we should replay log first. Namely, it ask us to mount the volume and then umount it. But we found that this volume could not be mounted again. Therefore, we restarted the server, here was the replay:
Feb 20 20:31:44 fs-10 kernel: Filesystem "cciss/c0d2": Disabling barriers, not supported by the underlying device
Feb 20 20:31:44 fs-10 kernel: XFS mounting filesystem cciss/c0d2
Feb 20 20:31:44 fs-10 kernel: Starting XFS recovery on filesystem: cciss/c0d2 (logdev: internal)
Feb 20 20:31:44 fs-10 kernel: Ending XFS recovery on filesystem: cciss/c0d2 (logdev: internal)
Then we mounted this volume successfully. xfs_check found nothing. The result of xfs_db command was :
magicnum = 0x58465342
blocksize = 4096
dblocks = 406961280
rblocks = 0
rextents = 0
uuid = 435ae4a0-0143-447a-aadb-7c25f8cace17
logstart = 268435460
rootino = 128
rbmino = 129
rsumino = 130
rextsize = 16
agblocks = 12717540
agcount = 32
rbmblocks = 0
logblocks = 32768
versionnum = 0x3084
sectsize = 512
inodesize = 256
inopblock = 16
fname = "fs2\000\000\000\000\000\000\000\000\000"
blocklog = 12
sectlog = 9
inodelog = 8
inopblog = 4
agblklog = 24
rextslog = 0
inprogress = 0
imax_pct = 25
icount = 14902784
ifree = 531
fdblocks = 4886264
frextents = 0
uquotino = 0
gquotino = 0
qflags = 0
flags = 0
shared_vn = 0
inoalignmt = 2
unit = 0
width = 0
dirblklog = 0
logsectlog = 0
logsectsize = 0
logsunit = 0
features2 = 0
======= 2008-01-25 16:01:54 =======
>On Fri, Jan 25, 2008 at 03:16:36PM +0800, lxh wrote:
>> Hello,
>> we have dozens of file servers with a 1.5TB/2.5 TB large xfs file system
>> volume running on a RAID6 SATA array. Each volume contains about
>> 10,000,000 files. The Operating system is debian GNU/Linux 2.6.18-5-amd64
>> #1 SMP. we got a kernel oops frequently last year.
>>
>> here is the oops :
>> Filesystem "cciss/c0d1": XFS internal error xfs_trans_cancel at line 1138
>> of file fs/xfs/xfs_trans.c. Caller 0xffffffff881df006
>> Call Trace:
>> [<ffffffff881fed18>] :xfs:xfs_trans_cancel+0x5b/0xfe
>> [<ffffffff88207006>] :xfs:xfs_create+0x58b/0x5dd
>> [<ffffffff8820f496>] :xfs:xfs_vn_mknod+0x1bd/0x3c8
>
>Are you running out of space in the filesystem?
>
>The only vectors I've seen that can cause this are I/O errors
>or ENOSPC during file create after we've already checked that
>this cannot happen. Are there any I/O errors in the log?
>
>This commit:
>
>http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=45c34141126a89da07197d5b89c04c6847f1171a
>
>which is in 2.6.23 fixed the last known cause of the ENOSPC
>issue, so upgrading the kernel or patching this fix back
>to the 2.6.18 kernel may fix the problem if it is related to
>ENOSPC.
>
>> Every time the error occurs, the volume can not be accessed. So we have to
>> umount this volume, run xfs_repair, and then remount it. This problem
>> causes seriously impact of our service.
>
>Anyway, next time it happens, can you please run xfs_check on the
>filesystem first and post the output? If there is no output, then
>the filesystem is fine and you don't need to run repair.
>
>If it is not fine, can also post the output of xfs_repair?
>
>Once the filesystem has been fixed up, can you then post the
>output of this command to tell us the space usage in the filesystems?
>
># xfs_db -r -c 'sb 0' -c p <dev>
>
>Cheers,
>
>Dave.
>--
>Dave Chinner
>Principal Engineer
>SGI Australian Software Group
= = = = = = = = = = = = = = = = = = = =
Cheers,
Luo xiaohua
lxhzju@163.com
2008-02-21
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Re: kernel oops on debian, 2.6.18-5, large xfs volume
@ 2008-01-25 9:41 lxh
0 siblings, 0 replies; 4+ messages in thread
From: lxh @ 2008-01-25 9:41 UTC (permalink / raw)
To: xfs
Hi,
======= 2008-01-25 16:01:54 =======
>On Fri, Jan 25, 2008 at 03:16:36PM +0800, lxh wrote:
>> Hello,
>> we have dozens of file servers with a 1.5TB/2.5 TB large xfs file system
>> volume running on a RAID6 SATA array. Each volume contains about
>> 10,000,000 files. The Operating system is debian GNU/Linux 2.6.18-5-amd64
>> #1 SMP. we got a kernel oops frequently last year.
>>
>> here is the oops :
>> Filesystem "cciss/c0d1": XFS internal error xfs_trans_cancel at line 1138
>> of file fs/xfs/xfs_trans.c. Caller 0xffffffff881df006
>> Call Trace:
>> [<ffffffff881fed18>] :xfs:xfs_trans_cancel+0x5b/0xfe
>> [<ffffffff88207006>] :xfs:xfs_create+0x58b/0x5dd
>> [<ffffffff8820f496>] :xfs:xfs_vn_mknod+0x1bd/0x3c8
>
>Are you running out of space in the filesystem?
we did not run out of space. there is enough space for writing.
>
>The only vectors I've seen that can cause this are I/O errors
>or ENOSPC during file create after we've already checked that
>this cannot happen. Are there any I/O errors in the log?
>
After we run xfs_repair, it outputs nothing special.
I guess this problem be related with big volume and a mass of small files. Some servers are equipped with same hardware and software, but they are configured with 1TB volume and stored big files. This problem never happen on them.
>This commit:
>
>http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=45c34141126a89da07197d5b89c04c6847f1171a
>
>which is in 2.6.23 fixed the last known cause of the ENOSPC
>issue, so upgrading the kernel or patching this fix back
>to the 2.6.18 kernel may fix the problem if it is related to
>ENOSPC.
Thank you very much for your help! I will try this patch on some machines.
>
>> Every time the error occurs, the volume can not be accessed. So we have to
>> umount this volume, run xfs_repair, and then remount it. This problem
>> causes seriously impact of our service.
>
>Anyway, next time it happens, can you please run xfs_check on the
>filesystem first and post the output? If there is no output, then
>the filesystem is fine and you don't need to run repair.
The volume is unusable when it happens. So we run xfs_repair. The xfs_repair operation output nothing special. But after xfs_repair, we can access the volume again. I don't konw why.
>
>If it is not fine, can also post the output of xfs_repair?
>
>Once the filesystem has been fixed up, can you then post the
>output of this command to tell us the space usage in the filesystems?
>
># xfs_db -r -c 'sb 0' -c p <dev>
I will comply with the your suggestions when it happens again, and then contact you.
>
>Cheers,
>
>Dave.
>--
>Dave Chinner
>Principal Engineer
>SGI Australian Software Group
= = = = = = = = = = = = = = = = = = = =
Cheers,
Luoxiaohua
NetEase.com Inc
lxhzju@163.com
2008-01-25
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2008-02-21 7:34 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-25 7:16 kernel oops on debian, 2.6.18-5, large xfs volume lxh
2008-01-25 8:01 ` David Chinner
2008-02-21 7:34 ` lxh
-- strict thread matches above, loose matches on Subject: below --
2008-01-25 9:41 lxh
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox