From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.12.11.20060308/8.12.11/SuSE Linux 0.7) with ESMTP id n023OYdj020689 for ; Thu, 1 Jan 2009 21:24:34 -0600 Received: from mail.sandeen.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id E84F358C99 for ; Thu, 1 Jan 2009 19:24:32 -0800 (PST) Received: from mail.sandeen.net (sandeen.net [209.173.210.139]) by cuda.sgi.com with ESMTP id 9iR9Q5toQcGjGnns for ; Thu, 01 Jan 2009 19:24:32 -0800 (PST) Message-ID: <495D88EE.2040406@sandeen.net> Date: Thu, 01 Jan 2009 21:24:30 -0600 From: Eric Sandeen MIME-Version: 1.0 Subject: Re: Corruption of in-memory data detected References: <169670ec0901011846q1d370e6cu31514519afc8295d@mail.gmail.com> In-Reply-To: <169670ec0901011846q1d370e6cu31514519afc8295d@mail.gmail.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: xfs-bounces@oss.sgi.com Errors-To: xfs-bounces@oss.sgi.com To: Thomas Gutzler Cc: xfs@oss.sgi.com Thomas Gutzler wrote: > Hi, > > I've been running an 8x500G hardware SATA RAID5 on an adaptec 31605 > controller for a while. The operating system is ubuntu feisty with the > 2.6.22-16-server kernel. Recently, I added a disk. After the array > rebuild was completed, I kept getting errors from the xfs module such > as this one: > Dec 30 22:55:39 io kernel: [21844.939832] Filesystem "sda": > xfs_iflush: Bad inode 1610669723 magic number 0xec9d, ptr 0xe523eb00 > Dec 30 22:55:39 io kernel: [21844.939879] xfs_force_shutdown(sda,0x8) > called from line 3277 of file > /build/buildd/linux-source-2.6.22-2.6.22/fs/xfs/xfs_inode.c. Return > address = 0xf8af263c > Dec 30 22:55:39 io kernel: [21844.939885] Filesystem "sda": Corruption > of in-memory data detected. Shutting down filesystem: sda > > My first thought was to run memcheck on the machine, which completed > several passes without error; the raid controller doesn't report any > SMART failures either. Both good ideas, but note that "Corruption of in-memory data detected" doesn't necessarily mean bad memory (though it might, so memcheck was prudent). 0xec9d is not the correct magic nr. for an on-disk inode, so that's why things went south. Were there no storage related errors prior to this? > After an xfs_repair, which fixed a few things, Knowing which things were fixed might lend some clues ... > I mounted the file > system but the error kept reappearing after a few hours unless I > mounted read-only. Since xfs_ncheck -i always exited with 'Out of > memory' xfs_check takes a ton of memory; xfs_repair much less so > I decided to reduce the max amount of inodes to 1% (156237488) > by running xfs_growfs -m 1 - the total amount of inodes used is still > less than 1%. Unfortunately, both xfs_check and xfs_ncheck still say > 'out of memory' with 2GB installed. the max inodes really have no bearing on check or repair memory usage; it's just an upper limit on how many inodes *could* be created. > After the modification, the file system survived for a day until the > following happened: > Jan 2 09:33:29 io kernel: [232751.699812] BUG: unable to handle > kernel paging request at virtual address 0003fffb > Jan 2 09:33:29 io kernel: [232751.699848] printing eip: > Jan 2 09:33:29 io kernel: [232751.699863] c017d872 > Jan 2 09:33:29 io kernel: [232751.699865] *pdpt = 000000003711e001 > Jan 2 09:33:29 io kernel: [232751.699881] *pde = 0000000000000000 > Jan 2 09:33:29 io kernel: [232751.699898] Oops: 0002 [#1] > Jan 2 09:33:29 io kernel: [232751.699913] SMP > Jan 2 09:33:29 io kernel: [232751.699931] Modules linked in: nfs nfsd > exportfs lockd sunrpc xt_tcpudp nf_conntrack_ipv4 xt_state > nf_conntrack nfnetlink iptable_filter ip_tables x_tables ipv6 ext2 > mbcache coretemp w83627ehf i2c_isa i2c_core acpi_cpufreq > cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_ondemand > freq_table cpufreq_conservative psmouse serio_raw pcspkr shpchp > pci_hotplug evdev intel_agp agpgart xfs sr_mod cdrom pata_jmicron > ata_piix sg sd_mod ata_generic ohci1394 ieee1394 ahci libata e1000 > aacraid scsi_mod uhci_hcd ehci_hcd usbcore thermal processor fan fuse > apparmor commoncap > Jan 2 09:33:29 io kernel: [232751.700180] CPU: 1 > Jan 2 09:33:29 io kernel: [232751.700181] EIP: > 0060:[__slab_free+50/672] Not tainted VLI > Jan 2 09:33:29 io kernel: [232751.700182] EFLAGS: 00010046 > (2.6.22-16-server #1) > Jan 2 09:33:29 io kernel: [232751.700234] EIP is at __slab_free+0x32/0x2a0 Memory corruption perhaps? > Jan 2 09:33:29 io kernel: [232751.700252] eax: 0000ffff ebx: > ffffffff ecx: ffffffff edx: 000014aa > Jan 2 09:33:29 io kernel: [232751.700273] esi: c17fffe0 edi: > e6b8e0c0 ebp: f8ac2c8c esp: c21dfe44 > Jan 2 09:33:29 io kernel: [232751.700293] ds: 007b es: 007b fs: > 00d8 gs: 0000 ss: 0068 > Jan 2 09:33:29 io kernel: [232751.700313] Process kswapd0 (pid: 198, > ti=c21de000 task=c21f39f0 task.ti=c21de000) > Jan 2 09:33:29 io kernel: [232751.700334] Stack: 00000000 00000065 > 00000000 fffffffe ffffffff c17fffe0 00000287 e6b8e0c0 > Jan 2 09:33:29 io kernel: [232751.700378] 00000001 c017e3fe > f8ac2c8c cecb7d20 00000001 df2e2600 f8ac2c8c df2e2600 > Jan 2 09:33:29 io kernel: [232751.700422] f8d7559c e8247900 > f8ac5224 df2e2600 f8d7559c e8247900 f8ae1606 00000001 > Jan 2 09:33:29 io kernel: [232751.700466] Call Trace: > Jan 2 09:33:29 io kernel: [232751.700499] [kfree+126/192] kfree+0x7e/0xc0 > Jan 2 09:33:29 io kernel: [232751.700519] [] > xfs_idestroy_fork+0x2c/0xf0 [xfs] > Jan 2 09:33:29 io kernel: [232751.700561] [] > xfs_idestroy_fork+0x2c/0xf0 [xfs] > Jan 2 09:33:29 io kernel: [232751.700601] [] > xfs_idestroy+0x44/0xb0 [xfs] > Jan 2 09:33:29 io kernel: [232751.700640] [] > xfs_finish_reclaim+0x36/0x160 [xfs] > Jan 2 09:33:29 io kernel: [232751.700681] [] > xfs_fs_clear_inode+0x97/0xc0 [xfs] > Jan 2 09:33:29 io kernel: [232751.700721] [clear_inode+143/320] > clear_inode+0x8f/0x140 > Jan 2 09:33:29 io kernel: [232751.700743] [dispose_list+26/224] > dispose_list+0x1a/0xe0 > Jan 2 09:33:29 io kernel: [232751.700765] > [shrink_icache_memory+379/592] shrink_icache_memory+0x17b/0x250 > Jan 2 09:33:29 io kernel: [232751.700789] [shrink_slab+279/368] > shrink_slab+0x117/0x170 > Jan 2 09:33:29 io kernel: [232751.700815] [kswapd+859/1136] kswapd+0x35b/0x470 > Jan 2 09:33:29 io kernel: [232751.700842] > [autoremove_wake_function+0/80] autoremove_wake_function+0x0/0x50 > Jan 2 09:33:29 io kernel: [232751.700867] [kswapd+0/1136] kswapd+0x0/0x470 > Jan 2 09:33:29 io kernel: [232751.700886] [kthread+66/112] kthread+0x42/0x70 > Jan 2 09:33:29 io kernel: [232751.700904] [kthread+0/112] kthread+0x0/0x70 > Jan 2 09:33:29 io kernel: [232751.700923] > [kernel_thread_helper+7/28] kernel_thread_helper+0x7/0x1c > Jan 2 09:33:29 io kernel: [232751.700946] ======================= > Jan 2 09:33:29 io kernel: [232751.700962] Code: 53 89 cb 83 ec 14 8b > 6c 24 28 f0 0f ba 2e 00 19 c0 85 c0 74 0a 8b 06 a8 01 74 ef f3 90 eb > f6 f6 06 02 75 48 0f b7 46 0a 8b 56 14 <89> 14 83 0f b7 46 08 89 5e 14 > 83 e8 01 f6 06 40 66 89 46 08 75 > Jan 2 09:33:29 io kernel: [232751.701128] EIP: [__slab_free+50/672] > __slab_free+0x32/0x2a0 SS:ESP 0068:c21dfe44 > > Any thoughts what this could be or what could be done to fix it? seems like maybe something went wrong w/ the raid rebuild, if that's when things started going south. Do you get any storage error related messages at all? Ubuntu knows best what's in this oldish distro kernel, I guess; I don't know offhand what might be going wrong. If they have a debug kernel variant, you could run that to see if you get earlier indications of problems. If you can reproduce on a more recent upstream kernel, that would be interesting. -Eric > Cheers, > Tom _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs