From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Zhang, Yanmin" Subject: Re: [PATCH 0/15] Per-bdi writeback flusher threads v10 Date: Thu, 18 Jun 2009 09:01:06 +0800 Message-ID: <1245286866.2560.407.camel@ymzhang> References: <1244811255-5391-1-git-send-email-jens.axboe@oracle.com> <1245114397.2560.368.camel@ymzhang> <20090616080036.GO11363@kernel.dk> <20090616195329.GH11363@kernel.dk> Mime-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: 7bit Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, chris.mason@oracle.com, david@fromorbit.com, hch@infradead.org, akpm@linux-foundation.org, jack@suse.cz, richard@rsk.demon.co.uk, damien.wyart@free.fr, dedekind1@gmail.com, fweisbec@gmail.com To: Jens Axboe Return-path: Received: from mga06.intel.com ([134.134.136.21]:15511 "EHLO orsmga101.jf.intel.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751218AbZFRBBF (ORCPT ); Wed, 17 Jun 2009 21:01:05 -0400 In-Reply-To: <20090616195329.GH11363@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Tue, 2009-06-16 at 21:53 +0200, Jens Axboe wrote: > On Tue, Jun 16 2009, Jens Axboe wrote: > > On Tue, Jun 16 2009, Zhang, Yanmin wrote: > > > On Fri, 2009-06-12 at 14:54 +0200, Jens Axboe wrote: > > > > Hi, > > > > > > > > Here's the 10th version of the writeback patches. Changes since v9: > > > > > > > > - Fix bdi task exit race leaving work on the list, flush it after we > > > > know we cannot be found anymore. > > > > - Rename flusher tasks from bdi-foo to flush-foo. Should make it more > > > > clear to the casual observer. > > > > - Fix a problem with the btrfs bdi register patch that would spew > > > > warnings for > 1 mounted btrfs file system. > > > > - Rebase to current -git, there were some conflicts with the latest work > > > > from viro/hch. > > > > - Fix a block layer core problem were stacked devices would overwrite > > > > the bdi state, causing problems and warning spew. > > > > - In bdi_writeback_all(), in the race occurence of a work allocation > > > > failure, restart scanning from the beginning. Then we can drop the > > > > bdi_lock mutex before diving into bdi specific writeback. > > > > - Convert bdi_lock to a spinlock. > > > > - Use spin_trylock() in bdi_writeback_all(), if this isn't a data > > > > integrity writeback. Debatable, I kind of like it... > > > > - Get rid of BDI_CAP_FLUSH_FORKER, just check for match with the > > > > default_backing_dev_info. > > > > - Fix race in list checking in bdi_forker_task(). > > > > > > > > > > > > For ease of patching, I've put the full diff here: > > > > > > > > http://kernel.dk/writeback-v10.patch > > > Jens, > > > > > > I applied the patch to 2.6.30 and got a confliction. The attachment is > > > the patch I ported to 2.6.30. Did I miss anything? > > > > > > > > > With the patch, kernel reports below messages on 2 machines. > > > > > > INFO: task sync:29984 blocked for more than 120 seconds. > > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > > sync D ffff88002805e300 6168 29984 24581 > > > ffff88022f84b780 0000000000000082 7fffffffffffffff ffff880133dbfe70 > > > 0000000000000000 ffff88022e2b4c50 ffff88022e2b4fd8 00000001000c7bb8 > > > ffff88022f513fd0 ffff880133dbfde8 ffff880133dbfec8 ffff88022d5d13c8 > > > Call Trace: > > > [] ? bdi_sched_wait+0x0/0xd > > > [] ? schedule+0x9/0x1d > > > [] ? bdi_sched_wait+0x9/0xd > > > [] ? __wait_on_bit+0x40/0x6f > > > [] ? bdi_sched_wait+0x0/0xd > > > [] ? out_of_line_wait_on_bit+0x6c/0x78 > > > [] ? wake_bit_function+0x0/0x23 > > > [] ? bdi_writeback_all+0x12a/0x152 > > > [] ? generic_sync_sb_inodes+0x31/0xde > > > [] ? sync_inodes_sb+0x83/0x88 > > > [] ? __sync_inodes+0x46/0x8f > > > [] ? do_sync+0x36/0x5a > > > [] ? sys_sync+0xe/0x12 > > > [] ? system_call_fastpath+0x16/0x1b > > > > I don't think it is your backport, for some reason the v10 missed a > > change that I think could solve this race. If not, there's another in > > there that I need to look at. > > > > So against your current base, could you try with the below added as > > well? The printk() is just so we can see if this triggers for you or > > not. > > OK that wont work, since we need to actually wait for the work to be > flushed, otherwise we wreak things when we free the bdi immediately > after that. > > Can you try with this patch? Jens, I tested below patch on 4 machines (run all fio sub-test cases twice which need more than 10 hours). The previous 2 machines don't stop this time. Unfortunately, the 3rd machine stops. I double-check the disassembled codes of kernel and make sure bdi_start_fn really calls wb_do_writeback. INFO: task sync:30618 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. sync D ffffc20000011300 4736 30618 28522 ffff8800bd25b090 0000000000000082 ffff8800bd763780 ffff8800bd763b08 00000000bd582e68 0000000000004000 0000000000011300 000000000000c868 ffff8800bd9c5df8 0000000000000000 ffff8800bd763780 ffff8800bd763b08 Call Trace: [] ? find_get_pages_tag+0x46/0xdd [] ? bdi_sched_wait+0x0/0xd [] ? schedule+0x9/0x1e [] ? bdi_sched_wait+0x9/0xd [] ? __wait_on_bit+0x41/0x71 [] ? bdi_sched_wait+0x0/0xd [] ? out_of_line_wait_on_bit+0x6b/0x77 [] ? wake_bit_function+0x0/0x23 [] ? bdi_writeback_all+0x134/0x16b [] ? generic_sync_sb_inodes+0x31/0xdc [] ? sync_inodes_sb+0x83/0x88 [] ? __sync_inodes+0x46/0x8f [] ? do_sync+0x36/0x5a [] ? sys_sync+0xe/0x14 [] ? system_call_fastpath+0x16/0x1b > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > index 5a1837f..4a6859e 100644 > --- a/fs/fs-writeback.c > +++ b/fs/fs-writeback.c > @@ -409,7 +409,7 @@ static struct bdi_work *get_next_work_item(struct backing_dev_info *bdi, > /* > * Retrieve work items and do the writeback they describe > */ > -static long wb_writeback(struct bdi_writeback *wb) > +static long wb_writeback(struct bdi_writeback *wb, int force_wait) > { > struct backing_dev_info *bdi = wb->bdi; > struct bdi_work *work; > @@ -418,7 +418,12 @@ static long wb_writeback(struct bdi_writeback *wb) > while ((work = get_next_work_item(bdi, wb)) != NULL) { > struct super_block *sb = bdi_work_sb(work); > long nr_pages = work->nr_pages; > - enum writeback_sync_modes sync_mode = work->sync_mode; > + enum writeback_sync_modes sync_mode; > + > + if (force_wait) > + sync_mode = WB_SYNC_ALL; > + else > + sync_mode = work->sync_mode; > > /* > * If this isn't a data integrity operation, just notify > @@ -444,7 +449,7 @@ static long wb_writeback(struct bdi_writeback *wb) > * This will be inlined in bdi_writeback_task() once we get rid of any > * dirty inodes on the default_backing_dev_info > */ > -long wb_do_writeback(struct bdi_writeback *wb) > +long wb_do_writeback(struct bdi_writeback *wb, int force_wait) > { > long wrote; > > @@ -461,7 +466,7 @@ long wb_do_writeback(struct bdi_writeback *wb) > if (list_empty(&wb->bdi->work_list)) > wrote = wb_kupdated(wb); > else > - wrote = wb_writeback(wb); > + wrote = wb_writeback(wb, force_wait); > > return wrote; > } > @@ -477,7 +482,7 @@ int bdi_writeback_task(struct bdi_writeback *wb) > long pages_written; > > while (!kthread_should_stop()) { > - pages_written = wb_do_writeback(wb); > + pages_written = wb_do_writeback(wb, 0); > > if (pages_written) > last_active = jiffies; > diff --git a/include/linux/writeback.h b/include/linux/writeback.h > index 0d4e31d..e070b91 100644 > --- a/include/linux/writeback.h > +++ b/include/linux/writeback.h > @@ -68,7 +68,7 @@ struct writeback_control { > void writeback_inodes(struct writeback_control *wbc); > int inode_wait(void *); > void sync_inodes_sb(struct super_block *, int wait); > -long wb_do_writeback(struct bdi_writeback *wb); > +long wb_do_writeback(struct bdi_writeback *wb, int force_wait); > > /* writeback.h requires fs.h; it, too, is not included from here. */ > static inline void wait_on_inode(struct inode *inode) > diff --git a/mm/backing-dev.c b/mm/backing-dev.c > index 23013d5..0c91add 100644 > --- a/mm/backing-dev.c > +++ b/mm/backing-dev.c > @@ -389,7 +389,7 @@ static int bdi_start_fn(void *ptr) > * will be added, since this bdi isn't discoverable anymore. > */ > if (!list_empty(&bdi->work_list)) > - wb_do_writeback(wb); > + wb_do_writeback(wb, 1); > > bdi_put_wb(bdi, wb); > return ret; > @@ -484,7 +484,7 @@ static int bdi_forker_task(void *ptr) > * dirty data on the default backing_dev_info > */ > if (wb_has_dirty_io(me) || !list_empty(&me->bdi->work_list)) > - wb_do_writeback(me); > + wb_do_writeback(me, 0); > > spin_lock(&bdi_lock); > >