From: Wu Fengguang <fengguang.wu@intel.com>
To: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de>
Cc: "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
Jan Kara <jack@suse.cz>, Peter Zijlstra <a.p.zijlstra@chello.nl>
Subject: Re: deadlock balance_dirty_pages() to be expected?
Date: Fri, 7 Oct 2011 22:38:51 +0800 [thread overview]
Message-ID: <20111007143851.GB14427@localhost> (raw)
In-Reply-To: <4E8F0CFA.3010205@itwm.fraunhofer.de>
On Fri, Oct 07, 2011 at 10:30:18PM +0800, Bernd Schubert wrote:
> On 10/07/2011 04:21 PM, Wu Fengguang wrote:
> > On Fri, Oct 07, 2011 at 10:08:06PM +0800, Bernd Schubert wrote:
> >> Hello Fengguang,
> >>
> >> On 10/07/2011 03:37 PM, Wu Fengguang wrote:
> >>> Hi Bernd,
> >>>
> >>> On Fri, Oct 07, 2011 at 08:34:33PM +0800, Bernd Schubert wrote:
> >>>> Hello,
> >>>>
> >>>> while I'm working on the page cached mode in FhGFS (*) I noticed a
> >>>> deadlock in balance_dirty_pages().
> >>>>
> >>>> sysrq-w showed that it never started background write-out due to
> >>>>
> >>>> if (bdi_nr_reclaimable> bdi_thresh) {
> >>>> pages_written += writeback_inodes_wb(&bdi->wb,
> >>>> (write_chunk);
> >>>>
> >>>>
> >>>> and therefore also did not leave that loop with
> >>>>
> >>>> if (pages_written>= write_chunk)
> >>>> break; /* We've done our duty */
> >>>>
> >>>>
> >>>> So my process stay in uninterruptible D-state forever.
> >>>
> >>> If writeback_inodes_wb() is not triggered, the process should still be
> >>> able to proceed, presumably with longer delays, but never stuck forever.
> >>> That's because the flusher thread should still be cleaning the pages
> >>> in the background which will knock down the dirty pages and eventually
> >>> unthrottle the dirtier process.
> >>
> >> Hmm, that does not seem to work:
> >>
> >> 1330 pts/0 D+ 0:13 dd if=/dev/zero of=/mnt/fhgfs/testfile bs=1M
> >> count=100
> >
> > That's normal: dd will be in D state in the vast majority time, but
> > the point is, one single balance_dirty_pages() call should not take
> > forever time, and dd should be able to go out of the D state (and
> > re-enter it almost immediately) from time to time.
> >
> >> So the process is in D state ever since I wrote the first mail, just for
> >> 100MB writes. Even if it still would do something, it would be extremely
> >> slow. Sysrq-w then shows:
> >
> > So it's normal to catch such trace for 99% times. But do you mean the
> > writeout bandwidth is lower than expected?
>
> If it really is still doing something, it is *ways* slower. Once I added
> bdi support, it finishes to write the 100MB file in my kvm test instance
> within a few seconds. Right now it is running for hours already... As I
> added a dump_stack() to our writepages() method, I also see that this
> function is never called.
In your case it should be the default/forker thread that's doing the
(suboptimal) writeout:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 17 0.0 0.0 0 0 ? S 21:12 0:00 [bdi-default]
In normal cases there are the flush-* threads doing the writeout:
root 1146 0.0 0.0 0 0 ? S 21:12 0:00 [flush-8:0]
> >
> >>> [ 6727.616976] SysRq : Show Blocked State
> >>> [ 6727.617575] task PC stack pid father
> >>> [ 6727.618252] dd D 0000000000000000 3544 1330 1306 0x00000000
> >>> [ 6727.619002] ffff88000ddfb9a8 0000000000000046 ffffffff81398627 0000000000000046
> >>> [ 6727.620157] 0000000000000000 ffff88000ddfa000 ffff88000ddfa000 ffff88000ddfbfd8
> >>> [ 6727.620466] ffff88000ddfa010 ffff88000ddfa000 ffff88000ddfbfd8 ffff88000ddfa000
> >>> [ 6727.620466] Call Trace:
> >>> [ 6727.620466] [<ffffffff81398627>] ? __schedule+0x697/0x7e0
> >>> [ 6727.620466] [<ffffffff8109be70>] ? trace_hardirqs_on_caller+0x20/0x1b0
> >>> [ 6727.620466] [<ffffffff8139884f>] schedule+0x3f/0x60
> >>> [ 6727.620466] [<ffffffff81398c44>] schedule_timeout+0x164/0x2f0
> >>> [ 6727.620466] [<ffffffff81070930>] ? lock_timer_base+0x70/0x70
> >>> [ 6727.620466] [<ffffffff81397bc9>] io_schedule_timeout+0x69/0x90
> >>> [ 6727.620466] [<ffffffff81109854>] balance_dirty_pages_ratelimited_nr+0x234/0x640
> >>> [ 6727.620466] [<ffffffff8110070f>] ? iov_iter_copy_from_user_atomic+0xaf/0x180
> >>> [ 6727.620466] [<ffffffff811009ae>] generic_file_buffered_write+0x1ce/0x270
> >>> [ 6727.620466] [<ffffffff811015dc>] ? generic_file_aio_write+0x5c/0xf0
> >>> [ 6727.620466] [<ffffffff81101358>] __generic_file_aio_write+0x238/0x460
> >>> [ 6727.620466] [<ffffffff811015dc>] ? generic_file_aio_write+0x5c/0xf0
> >>> [ 6727.620466] [<ffffffff811015f8>] generic_file_aio_write+0x78/0xf0
> >>> [ 6727.620466] [<ffffffffa034f539>] FhgfsOps_aio_write+0xdc/0x144 [fhgfs]
> >>> [ 6727.620466] [<ffffffff8115af8a>] do_sync_write+0xda/0x120
> >>> [ 6727.620466] [<ffffffff8112146c>] ? might_fault+0x9c/0xb0
> >>> [ 6727.620466] [<ffffffff8115b4b8>] vfs_write+0xc8/0x180
> >>> [ 6727.620466] [<ffffffff8115b661>] sys_write+0x51/0x90
> >>> [ 6727.620466] [<ffffffff813a3702>] system_call_fastpath+0x16/0x1b
> >>> [ 6727.620466] Sched Debug Version: v0.10, 3.1.0-rc9+ #47
> >>
> >>
> >>>
> >>>> Once I added basic inode->i_data.backing_dev_info bdi support to our
> >>>> file system, the deadlock did not happen anymore.
> >>>
> >>> What's the workload and change exactly?
> >>
> >> I wish I could simply send the patch, but until all the paper work is
> >> done I'm not allowed to :(
> >>
> >> The basic idea is:
> >>
> >> 1) During mount and setting the super block from
> >>
> >> static struct file_system_type fhgfs_fs_type =
> >> {
> >> .mount = fhgfs_mount,
> >> }
> >>
> >> Then in fhgfs_mount():
> >>
> >> bdi_setup_and_register(&sbInfo->bdi, "fhgfs", BDI_CAP_MAP_COPY);
> >> sb->s_bdi =&sbInfo->bdi;
> >>
> >>
> >>
> >> 2) When new (S_IFREG) inodes are allocated, for example from
> >>
> >> static struct inode_operations fhgfs_dir_inode_ops
> >> {
> >> .lookup,
> >> .create,
> >> .link
> >> }
> >>
> >> inode->i_data.backing_dev_info =&sbInfo->bdi;
> >
> > Ah when you didn't register the "fhgfs" bdi, there should be no
> > dedicated flusher thread for doing the writeout. Which is obviously
> > suboptimal.
> >
> >>>> So my question is simply if we should expect this deadlock, if the file
> >>>> system does not set up backing device information and if so, shouldn't
> >>>> this be documented?
> >>>
> >>> Such deadlock is not expected..
> >>
> >> Ok thanks, then we should figure out why it happens. Due to a network
> >> outage here I won't have time before Monday to track down which kernel
> >> version introduced it, though.
> >
> > It's long time ago when the per-bdi writeback is introduced, I suspect.
>
> Ok, I can start to test if 2.6.32 also already deadlocks.
I found the commit, it's introduced right in .32, hehe.
commit 03ba3782e8dcc5b0e1efe440d33084f066e38cae
Author: Jens Axboe <jens.axboe@oracle.com>
Date: Wed Sep 9 09:08:54 2009 +0200
writeback: switch to per-bdi threads for flushing data
This gets rid of pdflush for bdi writeout and kupdated style cleaning.
Thanks,
Fengguang
next prev parent reply other threads:[~2011-10-07 14:38 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-10-07 12:34 deadlock balance_dirty_pages() to be expected? Bernd Schubert
2011-10-07 13:37 ` Wu Fengguang
2011-10-07 14:08 ` Bernd Schubert
2011-10-07 14:21 ` Wu Fengguang
2011-10-07 14:30 ` Bernd Schubert
2011-10-07 14:38 ` Wu Fengguang [this message]
2011-10-11 14:55 ` Bernd Schubert
2011-10-12 1:45 ` Wu Fengguang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20111007143851.GB14427@localhost \
--to=fengguang.wu@intel.com \
--cc=a.p.zijlstra@chello.nl \
--cc=bernd.schubert@itwm.fraunhofer.de \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).