Re: deadlock balance_dirty_pages() to be expected?

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Wu Fengguang <fengguang.wu@intel.com>
To: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de>
Cc: "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	Jan Kara <jack@suse.cz>, Peter Zijlstra <a.p.zijlstra@chello.nl>
Subject: Re: deadlock balance_dirty_pages() to be expected?
Date: Fri, 7 Oct 2011 22:38:51 +0800	[thread overview]
Message-ID: <20111007143851.GB14427@localhost> (raw)
In-Reply-To: <4E8F0CFA.3010205@itwm.fraunhofer.de>

On Fri, Oct 07, 2011 at 10:30:18PM +0800, Bernd Schubert wrote:
> On 10/07/2011 04:21 PM, Wu Fengguang wrote:
> > On Fri, Oct 07, 2011 at 10:08:06PM +0800, Bernd Schubert wrote:
> >> Hello Fengguang,
> >>
> >> On 10/07/2011 03:37 PM, Wu Fengguang wrote:
> >>> Hi Bernd,
> >>>
> >>> On Fri, Oct 07, 2011 at 08:34:33PM +0800, Bernd Schubert wrote:
> >>>> Hello,
> >>>>
> >>>> while I'm working on the page cached mode in FhGFS (*) I noticed a
> >>>> deadlock in balance_dirty_pages().
> >>>>
> >>>> sysrq-w showed that it never started background write-out due to
> >>>>
> >>>> if (bdi_nr_reclaimable>   bdi_thresh) {
> >>>> 	pages_written += writeback_inodes_wb(&bdi->wb,
> >>>> 					    (write_chunk);
> >>>>
> >>>>
> >>>> and therefore also did not leave that loop with
> >>>>
> >>>> 	if (pages_written>= write_chunk)
> >>>>     				break;	/* We've done our duty */
> >>>>
> >>>>
> >>>> So my process stay in uninterruptible D-state forever.
> >>>
> >>> If writeback_inodes_wb() is not triggered, the process should still be
> >>> able to proceed, presumably with longer delays, but never stuck forever.
> >>> That's because the flusher thread should still be cleaning the pages
> >>> in the background which will knock down the dirty pages and eventually
> >>> unthrottle the dirtier process.
> >>
> >> Hmm, that does not seem to work:
> >>
> >> 1330 pts/0    D+     0:13 dd if=/dev/zero of=/mnt/fhgfs/testfile bs=1M
> >> count=100
> >
> > That's normal: dd will be in D state in the vast majority time, but
> > the point is, one single balance_dirty_pages() call should not take
> > forever time, and dd should be able to go out of the D state (and
> > re-enter it almost immediately) from time to time.
> >
> >> So the process is in D state ever since I wrote the first mail, just for
> >> 100MB writes. Even if it still would do something, it would be extremely
> >> slow. Sysrq-w then shows:
> >
> > So it's normal to catch such trace for 99% times.  But do you mean the
> > writeout bandwidth is lower than expected?
> 
> If it really is still doing something, it is *ways* slower. Once I added 
> bdi support, it finishes to write the 100MB file in my kvm test instance 
> within a few seconds. Right now it is running for hours already... As I 
> added a dump_stack() to our writepages() method, I also see that this 
> function is never called.

In your case it should be the default/forker thread that's doing the
(suboptimal) writeout: 

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root        17  0.0  0.0      0     0 ?        S    21:12   0:00 [bdi-default]

In normal cases there are the flush-* threads doing the writeout:

root      1146  0.0  0.0      0     0 ?        S    21:12   0:00 [flush-8:0]

> >
> >>> [ 6727.616976] SysRq : Show Blocked State
> >>> [ 6727.617575]   task                        PC stack   pid father
> >>> [ 6727.618252] dd              D 0000000000000000  3544  1330   1306 0x00000000
> >>> [ 6727.619002]  ffff88000ddfb9a8 0000000000000046 ffffffff81398627 0000000000000046
> >>> [ 6727.620157]  0000000000000000 ffff88000ddfa000 ffff88000ddfa000 ffff88000ddfbfd8
> >>> [ 6727.620466]  ffff88000ddfa010 ffff88000ddfa000 ffff88000ddfbfd8 ffff88000ddfa000
> >>> [ 6727.620466] Call Trace:
> >>> [ 6727.620466]  [<ffffffff81398627>] ? __schedule+0x697/0x7e0
> >>> [ 6727.620466]  [<ffffffff8109be70>] ? trace_hardirqs_on_caller+0x20/0x1b0
> >>> [ 6727.620466]  [<ffffffff8139884f>] schedule+0x3f/0x60
> >>> [ 6727.620466]  [<ffffffff81398c44>] schedule_timeout+0x164/0x2f0
> >>> [ 6727.620466]  [<ffffffff81070930>] ? lock_timer_base+0x70/0x70
> >>> [ 6727.620466]  [<ffffffff81397bc9>] io_schedule_timeout+0x69/0x90
> >>> [ 6727.620466]  [<ffffffff81109854>] balance_dirty_pages_ratelimited_nr+0x234/0x640
> >>> [ 6727.620466]  [<ffffffff8110070f>] ? iov_iter_copy_from_user_atomic+0xaf/0x180
> >>> [ 6727.620466]  [<ffffffff811009ae>] generic_file_buffered_write+0x1ce/0x270
> >>> [ 6727.620466]  [<ffffffff811015dc>] ? generic_file_aio_write+0x5c/0xf0
> >>> [ 6727.620466]  [<ffffffff81101358>] __generic_file_aio_write+0x238/0x460
> >>> [ 6727.620466]  [<ffffffff811015dc>] ? generic_file_aio_write+0x5c/0xf0
> >>> [ 6727.620466]  [<ffffffff811015f8>] generic_file_aio_write+0x78/0xf0
> >>> [ 6727.620466]  [<ffffffffa034f539>] FhgfsOps_aio_write+0xdc/0x144 [fhgfs]
> >>> [ 6727.620466]  [<ffffffff8115af8a>] do_sync_write+0xda/0x120
> >>> [ 6727.620466]  [<ffffffff8112146c>] ? might_fault+0x9c/0xb0
> >>> [ 6727.620466]  [<ffffffff8115b4b8>] vfs_write+0xc8/0x180
> >>> [ 6727.620466]  [<ffffffff8115b661>] sys_write+0x51/0x90
> >>> [ 6727.620466]  [<ffffffff813a3702>] system_call_fastpath+0x16/0x1b
> >>> [ 6727.620466] Sched Debug Version: v0.10, 3.1.0-rc9+ #47
> >>
> >>
> >>>
> >>>> Once I added basic inode->i_data.backing_dev_info bdi support to our
> >>>> file system, the deadlock did not happen anymore.
> >>>
> >>> What's the workload and change exactly?
> >>
> >> I wish I could simply send the patch, but until all the paper work is
> >> done I'm not allowed to :(
> >>
> >> The basic idea is:
> >>
> >> 1) During mount and setting the super block from
> >>
> >> static struct file_system_type fhgfs_fs_type =
> >> {
> >> 	.mount = fhgfs_mount,
> >> }
> >>
> >> Then in fhgfs_mount():
> >>
> >> bdi_setup_and_register(&sbInfo->bdi, "fhgfs", BDI_CAP_MAP_COPY);
> >> sb->s_bdi =&sbInfo->bdi;
> >>
> >>
> >>
> >> 2) When new (S_IFREG) inodes are allocated, for example from
> >>
> >> static struct inode_operations fhgfs_dir_inode_ops
> >> {
> >> 	.lookup,
> >> 	.create,
> >> 	.link
> >> }
> >>
> >> inode->i_data.backing_dev_info =&sbInfo->bdi;
> >
> > Ah when you didn't register the "fhgfs" bdi, there should be no
> > dedicated flusher thread for doing the writeout.  Which is obviously
> > suboptimal.
> >
> >>>> So my question is simply if we should expect this deadlock, if the file
> >>>> system does not set up backing device information and if so, shouldn't
> >>>> this be documented?
> >>>
> >>> Such deadlock is not expected..
> >>
> >> Ok thanks, then we should figure out why it happens. Due to a network
> >> outage here I won't have time before Monday to track down which kernel
> >> version introduced it, though.
> >
> > It's long time ago when the per-bdi writeback is introduced, I suspect.
> 
> Ok, I can start to test if 2.6.32 also already deadlocks.

I found the commit, it's introduced right in .32, hehe.

commit 03ba3782e8dcc5b0e1efe440d33084f066e38cae
Author: Jens Axboe <jens.axboe@oracle.com>
Date:   Wed Sep 9 09:08:54 2009 +0200

    writeback: switch to per-bdi threads for flushing data
    
    This gets rid of pdflush for bdi writeout and kupdated style cleaning.

Thanks,
Fengguang

next prev parent reply	other threads:[~2011-10-07 14:38 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-10-07 12:34 deadlock balance_dirty_pages() to be expected? Bernd Schubert
2011-10-07 13:37 ` Wu Fengguang
2011-10-07 14:08   ` Bernd Schubert
2011-10-07 14:21     ` Wu Fengguang
2011-10-07 14:30       ` Bernd Schubert
2011-10-07 14:38         ` Wu Fengguang [this message]
2011-10-11 14:55           ` Bernd Schubert
2011-10-12  1:45             ` Wu Fengguang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20111007143851.GB14427@localhost \
    --to=fengguang.wu@intel.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=bernd.schubert@itwm.fraunhofer.de \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.