From mboxrd@z Thu Jan 1 00:00:00 1970 From: Albert Chu Subject: Re: RE: RE: oops in 2.4.25 prune_icache() called from kswapd Date: Sun, 19 Jun 2005 23:20:26 -0700 Message-ID: <339573ada4.3ada433957@llnl.gov> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7BIT Cc: "Velupula, Prakash" , Marcelo Tosatti , David Woodhouse , Al Viro , linux-fsdevel@vger.kernel.org, lwoodman@redhat.com, behlendorf1@llnl.gov Return-path: Received: from smtp-1.llnl.gov ([128.115.250.81]:58777 "EHLO smtp-1.llnl.gov") by vger.kernel.org with ESMTP id S261468AbVFTGVv (ORCPT ); Mon, 20 Jun 2005 02:21:51 -0400 To: Chris Caputo Content-language: en Content-disposition: inline Sender: linux-fsdevel-owner@vger.kernel.org List-Id: linux-fsdevel.vger.kernel.org > > Albert, Prakash, any of you using stock v2.4? Sorry, we don't use a stock kernel in our environment :-( > Do you have any inputs on reproducing this? Sorry. We were a bit resource fortunate when debugging this. Heavy I/O scientific simulations on a 1000 node cluster got us about 5-10 hits a day. Al -- Albert Chu chu11@llnl.gov Lawrence Livermore National Laboratory ----- Original Message ----- From: Chris Caputo Date: Sunday, June 19, 2005 4:07 pm Subject: RE: RE: oops in 2.4.25 prune_icache() called from kswapd > My basic repro method was: > > -- > 0) start irqbalance > 1) run loop_dbench, which is the following dbench script which uses > client_plain.txt: > > #!/bin/sh > > while [ 1 ] > do > date > dbench 2 > 2) wait for oops > -- > > I think I was using dbench-2.1: > > http://samba.org/ftp/tridge/dbench/dbench-2.1.tar.gz > > In my case irqbalance was key. If I didn't run it I never got the > problem. I think irqbalance just did a good job of exasperating a > race > condition in some way. > > Chris > > On Sun, 19 Jun 2005, Velupula, Prakash wrote: > > Hi Marcelo, > > > > We are using 2.4.20 and would be able test it. But before that, > we are > > trying to recreate the problem. We just had one occurrence of this > > problem. Do you have any inputs on reproducing this? > > > > Thanks, > > Prakash > > > > -----Original Message----- > > From: Marcelo Tosatti [mailto:marcelo.tosatti@cyclades.com] > > Sent: Saturday, June 18, 2005 3:34 PM > > To: Chris Caputo > > Cc: Albert Chu; David Woodhouse; Al Viro; Velupula, Prakash; > > linux-fsdevel@vger.kernel.org; lwoodman@redhat.com > > Subject: Re: RE: oops in 2.4.25 prune_icache() called from kswapd > > > > On Sun, Jun 19, 2005 at 01:11:20AM +0000, Chris Caputo wrote: > >> Hi, > >> > >> Marcello, your patch didn't come through. Unfortunately I can't > test>> it anymore since my environment has now changed, but > hopefully Albert > >> or Prakash can try it when you resend. > > > > Albert, Prakash, any of you using stock v2.4? > > > > Here's the diff: > > > > --- a/fs/inode.c.orig 2005-06-18 11:19:21.508857600 -0300 > > +++ b/fs/inode.c 2005-06-18 11:21:03.925287936 -0300 > > @@ -1238,15 +1238,18 @@ > > BUG(); > > } else { > > if (!list_empty(&inode->i_hash)) { > > - if (!(inode->i_state & > > (I_DIRTY|I_LOCK))) > > - __refile_inode(inode); > > - inodes_stat.nr_unused++; > > - spin_unlock(&inode_lock); > > - if (!sb || (sb->s_flags & MS_ACTIVE)) > > + if (!sb || (sb->s_flags & MS_ACTIVE)) { > > + if (!(inode->i_state & > > + (I_DIRTY|I_LOCK))) { > > + __refile_inode(inode); > > + inodes_stat.nr_unused++; > > + } > > + spin_unlock(&inode_lock); > > return; > > + } > > + spin_unlock(&inode_lock); > > write_inode_now(inode, 1); > > spin_lock(&inode_lock); > > - inodes_stat.nr_unused--; > > list_del_init(&inode->i_hash); > > } > > list_del_init(&inode->i_list); > >> > >> Thanks, > >> Chris > >> > >> On Sat, 18 Jun 2005, Marcelo Tosatti wrote: > >>> Hi, > >>> > >>> Shame the RH bugzilla (#155289) requires super priviledges to be > >>> accessed. > >>> > >>> I've got around reading Albert's description of the race - thanks > > BTW. > >>> (its attached below for reference) > >>> > >>> It seems to me that only window open in mainline is between iput() > >>> and prune_icache(), while iput() sleeps on sync_one() with the > inode>>> being: > >>> > >>> - on the unused list > >>> - and with i_count set to zero > >>> > >>> prune_icache() is free to invalidate and destroy the inode in the > >>> meantime, causing iput()'s sync_one() to __refile_inode() the NULL > >>> entry to the unused list later on. > >>> > >>> If that is indeed the case, removing __refile_inode() from the > >>> nonzero > >>> inode->i_nlink path should close that window. > >>> > >>> Chris, can you please test the attached "iput-iprune- > race.patch" with > > > >>> your usual irqbalance enable environment ? > >>> > >>> > >>> On Sat, Jun 18, 2005 at 01:02:20PM -0700, Albert Chu wrote: > >>>> Howdy everyone, > >>>> > >>>>> Albert Chu, CC'ed, has suggested the below as a fix. Albert, > any>>>>> new info on this, or have these two patches cleared up the > problem> well? > >>>> > >>>> The __refile_inode() patch below fixed the problem for us on our > >>>> clusters (running RHEL3). The clear_inode() patch is something > >>>> Redhat (I think Larry Woodman) gave to us to fix a problem he > >>>> believes is in the same general area. We haven't had any > additional> > >>>> problems adding his second patch. > >>>> > >>>>> BTW, the below mail says that these are workarounds and a > real fix > >>>>> is on the way. Has that been rolled in as well? > >>>> > >>>> Sorry, not too sure. :-( > >>> > >>> Albert's description: > >>> > >>>> Just thought I'd let you know what's up. We think we're close to > >>>> getting to the bottom of this. We think the race is between > iput()>>>> and __refile_inode(). The Redhat kernel is of course > different than > > > >>>> the mainline kernel, but I think the same bug exists in the > >>>> mainline. The example below illustrates the race between > iput() and > > > >>>> __sync_one(), but it could occur with other areas that all > >>>> __refile_inode(). For us, I think we're hitting it in > __sync_one()>>>> and prune_icache(). (The > >>>> prune_icache() call to __refile_inode() doesn't seem to be in the > >>>> mainline though). > >>>> > >>>> proc 0: > >>>> calls iput() and locks inode_lock. > >>>> iput removes the inode off of the i_list unlocks inode_lock > (proc 1 > >>>> will now at some point grab inode_lock) calls clear_inode() > (this is > > > >>>> key) gets past the call to wait_on_inode(); at this point > >>>> clear_inode() and the remainder of iput() does not care about > I_LOCK> > >>>> or inode_lock. > >>>> > >>>> proc 1: > >>>> calls __sync_one() with inode_lock. > >>>> sets I_LOCK > >>>> do stuff before __refile_inode() is called, all I_LOCK/inode_lock > >>>> stuff doesn't matter. > >>>> > >>>> proc 0: > >>>> sets i_state = I_CLEAR > >>>> iput calls destroy_inode() > >>>> > >>>> proc 1: > >>>> calls __refile_inode > >>>> > >>>> and we have ourselves a corrupted inode on the inode_unused list. > >>>> > >>>> I'm not sure if you can see it or not, but Redhat bugzilla > 155289 is > > > >>>> tracking this. > >>>> > >>>> Al > > >