From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932281AbWGGVPM (ORCPT ); Fri, 7 Jul 2006 17:15:12 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S932288AbWGGVPM (ORCPT ); Fri, 7 Jul 2006 17:15:12 -0400 Received: from tornado.reub.net ([202.89.145.182]:6044 "EHLO tornado.reub.net") by vger.kernel.org with ESMTP id S932281AbWGGVPK (ORCPT ); Fri, 7 Jul 2006 17:15:10 -0400 Message-ID: <44AECEDD.201@reub.net> Date: Sat, 08 Jul 2006 09:15:09 +1200 From: Reuben Farrelly User-Agent: Thunderbird 3.0a1 (Windows/20060707) MIME-Version: 1.0 To: Andrew Morton CC: linux-kernel@vger.kernel.org Subject: Re: 2.6.17-mm6 References: <20060703030355.420c7155.akpm@osdl.org> <44AE268F.7080409@reub.net> <20060707023518.f621bcf2.akpm@osdl.org> In-Reply-To: <20060707023518.f621bcf2.akpm@osdl.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org On 7/07/2006 9:35 p.m., Andrew Morton wrote: > On Fri, 07 Jul 2006 21:17:03 +1200 > Reuben Farrelly wrote: > >> >> On 3/07/2006 10:03 p.m., Andrew Morton wrote: >>> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.17/2.6.17-mm6/ >>> >>> >>> - A major update to the e1000 driver. >>> >>> - 1394 updates >> This release has been working quite well after some initial hiccups (see -mm >> hotfix), but a couple of hours ago another bad thing (tm) happened. >> >> At the time I was moving 100G of data from one partition on a non-raid partition >> to a new RAID-1 md that I had created earlier. Both are ext3 partitions. >> >> The word "barrier" of course comes to mind again, I'm not sure NeilB is the >> culprit this time either but I've cc'd him in just in case. >> >> The file copy went on happily for quite a while (maybe 10 mins or so) under very >> high IO load before blowing up as below. The terminal was spewing out constant >> traces but hopefully the right ones are here as these are the first few (if not, >> I have copied a bit more). >> > > Yes, the very first oops is by far the most important to capture. > > You can add pause_on_oops=100000 to the kernel boot command line to make > the machine freeze after outputting the first oops. That'll certainly > prevent the oops messages from getting to the log files, but it will also > prevent it from scolling away or swamping your log device. > > >> sh-3.1# mv /store-old/* /store/ >> Unable to handle kernel paging request at ffff81043e345490 RIP: >> [] memcpy+0x12/0xb0 >> PGD 8063 PUD 0 >> Oops: 0000 [1] SMP >> last sysfs file: /kernel/uevent_seqnum >> CPU 1 >> Modules linked in: binfmt_misc ide_cd iTCO_wdt i2c_i801 cdrom serio_raw ide_disk >> Pid: 165, comm: pdflush Not tainted 2.6.17-mm6 #2 >> RIP: 0010:[] [] memcpy+0x12/0xb0 >> RSP: 0018:ffff81003ed31828 EFLAGS: 00010002 >> RAX: ffff810001faec18 RBX: 0000000000000010 RCX: 0000000000000001 >> RDX: 0000000000000080 RSI: ffff81043e345490 RDI: ffff810001faec18 >> RBP: ffff81003ed31898 R08: ffff81003f66a800 R09: 0000000000000000 >> R10: ffff810029b364b0 R11: 0000000000000000 R12: ffff810001faec00 >> R13: ffff81003f6ff140 R14: ffff81003f6ea240 R15: 0000000000000010 >> FS: 0000000000000000(0000) GS:ffff810037ffe440(0000) knlGS:0000000000000000 >> CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b >> CR2: ffff81043e345490 CR3: 000000003dc32000 CR4: 00000000000006e0 >> Process pdflush (pid: 165, threadinfo ffff81003ed30000, task ffff810037f2b840) >> Stack: 0000000000000010 ffffffff8025e9f0 ffff81003f66a800 0001120000000000 >> 0000000000011200 000112003f650080 ffff81003eca39f0 ffff81003ed318c8 >> ffff81003ed31898 0000000000000246 ffff81003f6ff140 0000000000011200 >> Call Trace: >> [] cache_alloc_refill+0xc9/0x538 >> [] __kmalloc+0x86/0x96 >> [] __kzalloc+0xf/0x2f >> [] r1bio_pool_alloc+0x21/0x3a >> [] mempool_alloc+0x44/0xfb > > The core slab data structures were wrecked. For kmalloc(), no less. > Something secretly destroyed your kernel, and it could be anything. Nice. Having now turned on slab debugging, is it possibly related to this message which appeared in my log when I booting up earlier? Jul 8 02:49:39 tornado kernel: EXT3-fs: mounted filesystem with ordered data mode. Jul 8 02:49:39 tornado kernel: Adding 497972k swap on /dev/sdc9. Priority:-1 extents:1 across:497972k Jul 8 02:49:40 tornado kernel: ip_tables: (C) 2000-2006 Netfilter Core Team Jul 8 02:49:40 tornado kernel: Netfilter messages via NETLINK v0.30. Jul 8 02:49:40 tornado kernel: ip_conntrack version 2.4 (4060 buckets, 32480 max) - 288 bytes per conntrack Jul 8 02:49:40 tornado kernel: Slab corruption: start=ffff81003efd7000, len=4096 Jul 8 02:49:40 tornado kernel: 170: ff ff ff ff 00 00 00 00 6b 6b 6b 6b 6b 6b 6b 6b Jul 8 02:49:40 tornado kernel: e1000: eth0: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex Jul 8 02:49:40 tornado kernel: GRE over IPv4 tunneling driver reuben