From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933962Ab2C2VpT (ORCPT ); Thu, 29 Mar 2012 17:45:19 -0400 Received: from li9-11.members.linode.com ([67.18.176.11]:43788 "EHLO test.thunk.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933361Ab2C2VpP (ORCPT ); Thu, 29 Mar 2012 17:45:15 -0400 Date: Thu, 29 Mar 2012 14:45:10 -0700 From: "Ted Ts'o" To: Dave Jones , Linus Torvalds , Wu Fengguang , Linux Kernel Mailing List Subject: Re: lockups shortly after booting in current git. Message-ID: <20120329214510.GD13970@thunk.org> Mail-Followup-To: Ted Ts'o , Dave Jones , Linus Torvalds , Wu Fengguang , Linux Kernel Mailing List References: <20120329155542.GA31285@redhat.com> <20120329182632.GA6891@redhat.com> <20120329195354.GA11790@redhat.com> <20120329202619.GA14001@redhat.com> <20120329203926.GA13970@thunk.org> <20120329211244.GA18684@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120329211244.GA18684@redhat.com> User-Agent: Mutt/1.5.20 (2009-06-14) X-SA-Exim-Connect-IP: X-SA-Exim-Mail-From: tytso@thunk.org X-SA-Exim-Scanned: No (on test.thunk.org); SAEximRunCond expanded to false Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Mar 29, 2012 at 05:12:44PM -0400, Dave Jones wrote: > > I'll try a build with just that reverted, given the bisect build is taking a while. > Something else you could try doing without even have to do a rebuild is to just mount the filesystem with the mount option nomblk_io_submit, which avoids using any of the code in fs/ext4/page_io.c. (This option causes ext4 will send blocks to the block layer old fashioned way, on 4k block at a time, and rely on the elevator code to coaslece the write requests.) > Any thoughts on any printk's I could add to verify a situation occurred or not ? > The problem with bisecting a bug like this is that it's hard to tell if > the bug has been fixed, or if I've just not hit it yet. If it really is about the PageWriteback bit not getting cleared, not really. If you're willing to expand the struct page to include a timestamp, we could use that to note pages which have been in writeback for a long time, but that's obviously quite expensive. But actually, normally when it's a PageWriteback stall, usually you get a soft lockup warning, assuming that was compiled into the system. And I didn't see that in your trace, which is surprising given the symptoms you described. Was it perhaps not included in your log file snippet? Or was soft lockup detection not enabled? - Ted