From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S263024AbTLDGmP (ORCPT ); Thu, 4 Dec 2003 01:42:15 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S263135AbTLDGmP (ORCPT ); Thu, 4 Dec 2003 01:42:15 -0500 Received: from mtvcafw.SGI.COM ([192.48.171.6]:36019 "EHLO zok.sgi.com") by vger.kernel.org with ESMTP id S263024AbTLDGmN (ORCPT ); Thu, 4 Dec 2003 01:42:13 -0500 Date: Thu, 4 Dec 2003 17:39:21 +1100 From: Nathan Scott To: Neil Brown Cc: Linus Torvalds , Jens Axboe , David =?iso-8859-1?Q?Mart=EDnez?= Moreno , Kernel Mailing List , clubinfo.servers@adi.uam.es, Ingo Molnar Subject: Re: Errors and later panics in 2.6.0-test11. Message-ID: <20031204063921.GD1967@frodo> References: <200312031417.18462.ender@debian.org> <20031203162045.GA27964@suse.de> <16334.17126.154718.614827@notabene.cse.unsw.edu.au> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <16334.17126.154718.614827@notabene.cse.unsw.edu.au> User-Agent: Mutt/1.5.3i Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Dec 04, 2003 at 07:09:10AM +1100, Neil Brown wrote: > On Wednesday December 3, torvalds@osdl.org wrote: > > > > > > On Wed, 3 Dec 2003, Jens Axboe wrote: > > > > > > > > Interesting. Another RAID 0 problem report.. > > > > > > Hmm did _all_ reports include raid-0, or just "some" raid? I'm looking > > > at the bio_pair stuff which raid-0 is the only user of, something looks > > > fishy there. > > > > The ones I've seen seem to be raid-0, but Nathan (nathans@sgi.com) > > reported problems in RAID-5 under load. I didn't decode the full oops on > > that one, but it really looked like a stale "bi" bio that trapped on the > > PAGE_ALLOC debug code. > > > > Nathan's had a second oops that turned out to be a bi_next pointer > being bad in a bio that raid5 had just about finished writing out. > So there does seem to be something wrong with bio handling, quite > possibly in raid5. > > The only thing I could find was that if raid5 received two overlapping > bios concurrently (or atleast received the second before it had > finished with the first) it could get confused. I've asked Nathan to > try a patch that BUGs when that happens. I haven't tripped the bug so far today, although have been running with page-sized fs blocksize so far - perhaps that is implicated, and makes it less likely to trigger (when I say "the bug" there, I mean neither the panic, nor the new BUG_ON(); I'll revert back to smaller block sizes next). That error path bio_put issue you spotted in XFS, Neil, I think is a valid problem - I'm not sure that is reachable code in practice (possibly overly defensive XFS bio code), I'll go investigate that some more. cheers. -- Nathan