From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: with ECARTIS (v1.0.0; list xfs); Fri, 25 Jan 2008 01:41:15 -0800 (PST) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.168.29]) by oss.sgi.com (8.12.11.20060308/8.12.11/SuSE Linux 0.7) with ESMTP id m0P9f5AZ018512 for ; Fri, 25 Jan 2008 01:41:10 -0800 Received: from m12-13.163.com (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with SMTP id 49058361BEC for ; Fri, 25 Jan 2008 01:41:23 -0800 (PST) Received: from m12-13.163.com (m12-13.163.com [220.181.12.13]) by cuda.sgi.com with SMTP id bRhKA6H8QcP5jxPP for ; Fri, 25 Jan 2008 01:41:23 -0800 (PST) Date: Fri, 25 Jan 2008 17:41:05 +0800 From: "lxh" Reply-To: lxhzju@163.com Subject: Re: Re: kernel oops on debian, 2.6.18-5, large xfs volume Message-ID: <200801251741035934497@163.com> Mime-Version: 1.0 Content-Type: text/plain; charset="gb2312" Content-Transfer-Encoding: 8bit Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com List-Id: xfs To: xfs Hi, ======= 2008-01-25 16:01:54 ======= >On Fri, Jan 25, 2008 at 03:16:36PM +0800, lxh wrote: >> Hello, >> we have dozens of file servers with a 1.5TB/2.5 TB large xfs file system >> volume running on a RAID6 SATA array. Each volume contains about >> 10,000,000 files. The Operating system is debian GNU/Linux 2.6.18-5-amd64 >> #1 SMP. we got a kernel oops frequently last year. >> >> here is the oops : >> Filesystem "cciss/c0d1": XFS internal error xfs_trans_cancel at line 1138 >> of file fs/xfs/xfs_trans.c. Caller 0xffffffff881df006 >> Call Trace: >> [] :xfs:xfs_trans_cancel+0x5b/0xfe >> [] :xfs:xfs_create+0x58b/0x5dd >> [] :xfs:xfs_vn_mknod+0x1bd/0x3c8 > >Are you running out of space in the filesystem? we did not run out of space. there is enough space for writing. > >The only vectors I've seen that can cause this are I/O errors >or ENOSPC during file create after we've already checked that >this cannot happen. Are there any I/O errors in the log? > After we run xfs_repair, it outputs nothing special. I guess this problem be related with big volume and a mass of small files. Some servers are equipped with same hardware and software, but they are configured with 1TB volume and stored big files. This problem never happen on them. >This commit: > >http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=45c34141126a89da07197d5b89c04c6847f1171a > >which is in 2.6.23 fixed the last known cause of the ENOSPC >issue, so upgrading the kernel or patching this fix back >to the 2.6.18 kernel may fix the problem if it is related to >ENOSPC. Thank you very much for your help! I will try this patch on some machines. > >> Every time the error occurs, the volume can not be accessed. So we have to >> umount this volume, run xfs_repair, and then remount it. This problem >> causes seriously impact of our service. > >Anyway, next time it happens, can you please run xfs_check on the >filesystem first and post the output? If there is no output, then >the filesystem is fine and you don't need to run repair. The volume is unusable when it happens. So we run xfs_repair. The xfs_repair operation output nothing special. But after xfs_repair, we can access the volume again. I don't konw why. > >If it is not fine, can also post the output of xfs_repair? > >Once the filesystem has been fixed up, can you then post the >output of this command to tell us the space usage in the filesystems? > ># xfs_db -r -c 'sb 0' -c p I will comply with the your suggestions when it happens again, and then contact you. > >Cheers, > >Dave. >-- >Dave Chinner >Principal Engineer >SGI Australian Software Group = = = = = = = = = = = = = = = = = = = = Cheers, Luoxiaohua NetEase.com Inc                  lxhzju@163.com           2008-01-25