From: Jose Manuel dos Santos Calhariz <jose.spam@netvisao.pt>
To: NeilBrown <neilb@suse.de>
Cc: Christian Balzer <chibi@gol.com>,
linux-raid@vger.kernel.org,
Jose Manuel dos Santos Calhariz <jose.spam@netvisao.pt>
Subject: Re: linux-image-2.6.32-5-686: kernel BUG at ... build/source_i386_none/drivers/md/raid5.c:2764!
Date: Mon, 25 Jun 2012 11:59:04 +0100 [thread overview]
Message-ID: <20120625105903.GA6005@calhariz.com> (raw)
In-Reply-To: <20120625164230.2ba8f72c@notabene.brown>
[-- Attachment #1: Type: text/plain, Size: 6567 bytes --]
On Mon, Jun 25, 2012 at 04:42:30PM +1000, NeilBrown wrote:
> On Mon, 25 Jun 2012 11:58:33 +0900 Christian Balzer <chibi@gol.com> wrote:
>
> > On Mon, 25 Jun 2012 12:39:06 +1000 NeilBrown wrote:
> >
> > > On Sun, 24 Jun 2012 18:02:34 +0100 Jose Manuel dos Santos Calhariz
> > > <jose.spam@netvisao.pt> wrote:
> > >
> > > > On Sun, Jun 24, 2012 at 06:21:46PM +1000, NeilBrown wrote:
> > > > > On Fri, 22 Jun 2012 13:19:53 +0100 Jose Manuel dos Santos Calhariz
> > > > > <jose.spam@netvisao.pt> wrote:
> > > > >
> > > > > >
> > > > > > In another day during the periodic mdadm RAID check:
> > > > > > - the linux kernel gave a kernel BUG,
> > > > > > - tried to kick out a failed disk and
> > > > > > - stopped accepting I/O to the affected raid.
> > > > > >
> > > > > > The affected programs were in state D. The only way to recover
> > > > > > was to do a reboot. After reboot the problematic disk was
> > > > > > replaced.
> > > > > >
> > > > > > I reported the bug to Debian and is there all the information
> > > > > > about it:
> > > > > >
> > > > > > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=675969
> > > > > >
> > > > > > I was asked to report the BUG here in case someone knows what
> > > > > > happened.
> > > > > >
> > > > > > Here is a summary of the more relevant information:
> > > > > >
> > > > > > This machine have 2 x RAID6 with 6 disks each, for a total of 12
> > > > > > disks.
> > > > > >
> > > > > > I have 5 systems with a similar setup and only one failed, maybe
> > > > > > because of the failing disk. I will use one of the systems to try
> > > > > > to reproduce the bug, before triyng a new kernel.
> > > > > >
> > > > > >
> > > > > > The proprietary module is the openafs filesystem v1.6.1 backported
> > > > > > from Debian testing.
> > > > > >
> > > > > > The kernel bug is:
> > > > > >
> > > > > >
> > > > > > build/source_i386_none/drivers/md/raid5.c:2764!
> > > >
> > > > >
> > > > > This bug was fixed in 2.6.32.49 and 3.2
> > > > >
> > > > > http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=commitdiff;h=61d433c479a6ccfed6a7e73e6111ca8fa0348c63
> > > > >
> > > > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=9a3f530f39f4490eaa18b02719fb74ce5f4d2d86
> > > > >
> > > > > NeilBrown
> > > >
> > > > The failing kernel had that fix all ready. The machine was running
> > > > the kernel Debian 2.6.32-41squeeze2. Looking into the change log,
> > > > this kernel have all the fixes until 2.6.32.51 plus other fixes.
> > > >
> > > > Jose Calhariz
> > > >
> > >
> > > The oops report said:
> > >
> > > (2.6.32-5-686 #1)
> > >
> > > is "5" the same as "41squeeze2" ??? This is a genuine question - I have
> > > little idea about Debian versioning so maybe these are the same thing
> > > somehow. But they look different.
> > >
> > Yes, the "name' of the kernel and it's actual detail version are disjunct
> > like that in Debian, the current kernel of that vintage is:
> > ---
> > Package: linux-image-2.6.32-5-amd64
> > Source: linux-2.6
> > Version: 2.6.32-44
> > ---
>
> Ok.
> So the version number reported by "uname -a" doesn't change when you upgrade
> a Debian kernel? That's rather sad.
> I means that one has to take the reporters work for which kernel was running
> rather than looking in the oops message for where the kernels tells me
> what version it was.
>
> Given the report, it is entirely possible that an older kernel was running
> while a newer kernel was installed.
>
> Jose: how certain are you that the kernel that was running at the time was
> exactly the kernel that was installed at the time. i.e. you had not
> performed a software update since the last reboot?
Whenever I reboot a server I run a script to collect information about
it: Kernel boot messages, kernel version, kernel modules, md raid
information, etc.
So I have the kernel boot messages for the precise boot that gave the
BUG. From that boot log:
[ 0.000000] Linux version 2.6.32-5-686 (Debian 2.6.32-41squeeze2)
(dannf@debian.org) (gcc version 4.3.5 (Debian 4.3.5-4) ) #1 SMP Mon
Mar 26 05:20:33 UTC 2012
The version of the running kernel is 2.6.32-41squeeze2. In the
changelog of the Debian package, for version 2.6.32-41:
* Add longterm release 2.6.32.54
The complete changelog, in case someone want look into it:
http://packages.debian.org/changelogs/pool/main/l/linux-2.6/linux-2.6_2.6.32-45/changelog
On the previous Debian version 2.6.32-40 there is this entry on the
changelog:
* Add longterm release 2.6.32.49, including:
- SCSI: st: fix race in st_scsi_execute_end
- NFS/sunrpc: don't use a credential with extra groups.
- netlink: validate NLA_MSECS length
- hfs: add sanity check for file name length (CVE-2011-4330)
- md/raid5: abort any pending parity operations when array fails.
- mm: avoid null pointer access in vm_struct via /proc/vmallocinfo
- ipv6: udp: fix the wrong headroom check (CVE-2011-4326)
- USB: Fix Corruption issue in USB ftdi driver ftdi_sio.c
The complete boot log is on:
http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=15;filename=kernel-boot;att=1;bug=675969
>
> However even if you can confirm that a new kernel was running I doubt I could
> find an answer. There isn't really much info to go on. So unless you can
> reproduce the problem, I doubt I'll even start looking.
I have too much information about the system that gave the BUG, but no
way to sort it out what is relevant and what it's not relevant. Is
there anything more you would like to know?
I understand if you can't help me. I have 5 similar servers that are
running 2.6.32.x for 3 months but I have 1 BUG only. I have one
server where I am trying to reproduce the BUG without avail.
- Doing a re-sync of the RAID when there is a "error read corrected"
don't trigger the BUG.
- Hot unplug a disk don't trigger the BUG.
My guess is this bug is related with bad disks and errors messages
that sometimes the disks give to the kernel. But is more difficult to
find disks that give this errors messages in a reproducible way than
finding disks with bad sectors for the test server.
>
> NeilBrown
--
--
Ambição: um supremo desejo de ser vilipendiado por seus inimigos enquanto você está vivo e ser ridicularizado pelos amigos quando estiver morto
--Ambrose Bierce
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]
next prev parent reply other threads:[~2012-06-25 10:59 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-06-22 12:19 linux-image-2.6.32-5-686: kernel BUG at ... build/source_i386_none/drivers/md/raid5.c:2764! Jose Manuel dos Santos Calhariz
2012-06-24 8:21 ` NeilBrown
2012-06-24 17:02 ` Jose Manuel dos Santos Calhariz
2012-06-25 2:39 ` NeilBrown
2012-06-25 2:58 ` Christian Balzer
2012-06-25 6:42 ` NeilBrown
2012-06-25 6:55 ` Christian Balzer
2012-06-25 10:59 ` Jose Manuel dos Santos Calhariz [this message]
[not found] ` <CAGqmV7rBk9R-q-LRVw1tzxmXoQMLTQbQY8f9C0SOuaMOf7AfoQ@mail.gmail.com>
2012-06-27 14:12 ` Jose Manuel dos Santos Calhariz
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120625105903.GA6005@calhariz.com \
--to=jose.spam@netvisao.pt \
--cc=chibi@gol.com \
--cc=linux-raid@vger.kernel.org \
--cc=neilb@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).