deadlock balance_dirty_pages() to be expected?

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* deadlock balance_dirty_pages() to be expected?
@ 2011-10-07 12:34 Bernd Schubert
  2011-10-07 13:37 ` Wu Fengguang
  0 siblings, 1 reply; 8+ messages in thread
From: Bernd Schubert @ 2011-10-07 12:34 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Wu Fengguang

Hello,

while I'm working on the page cached mode in FhGFS (*) I noticed a 
deadlock in balance_dirty_pages().

sysrq-w showed that it never started background write-out due to

if (bdi_nr_reclaimable > bdi_thresh) {
	pages_written += writeback_inodes_wb(&bdi->wb,
					    (write_chunk);

and therefore also did not leave that loop with

	if (pages_written >= write_chunk)
  				break;	/* We've done our duty */

So my process stay in uninterruptible D-state forever.

Once I added basic inode->i_data.backing_dev_info bdi support to our 
file system, the deadlock did not happen anymore.

So my question is simply if we should expect this deadlock, if the file 
system does not set up backing device information and if so, shouldn't 
this be documented?

Thanks,
Bernd

PS: While FhGFS has a proprietary license right now, we will soon at 
least provide the client kernel modules under the GPL.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: deadlock balance_dirty_pages() to be expected?
  2011-10-07 12:34 deadlock balance_dirty_pages() to be expected? Bernd Schubert
@ 2011-10-07 13:37 ` Wu Fengguang
  2011-10-07 14:08   ` Bernd Schubert
  0 siblings, 1 reply; 8+ messages in thread
From: Wu Fengguang @ 2011-10-07 13:37 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: linux-fsdevel@vger.kernel.org, Jan Kara, Peter Zijlstra

Hi Bernd,

On Fri, Oct 07, 2011 at 08:34:33PM +0800, Bernd Schubert wrote:
> Hello,
> 
> while I'm working on the page cached mode in FhGFS (*) I noticed a 
> deadlock in balance_dirty_pages().
> 
> sysrq-w showed that it never started background write-out due to
> 
> if (bdi_nr_reclaimable > bdi_thresh) {
> 	pages_written += writeback_inodes_wb(&bdi->wb,
> 					    (write_chunk);
> 
> 
> and therefore also did not leave that loop with
> 
> 	if (pages_written >= write_chunk)
>   				break;	/* We've done our duty */
> 
> 
> So my process stay in uninterruptible D-state forever.

If writeback_inodes_wb() is not triggered, the process should still be
able to proceed, presumably with longer delays, but never stuck forever.
That's because the flusher thread should still be cleaning the pages
in the background which will knock down the dirty pages and eventually
unthrottle the dirtier process.

> Once I added basic inode->i_data.backing_dev_info bdi support to our 
> file system, the deadlock did not happen anymore.

What's the workload and change exactly?

> So my question is simply if we should expect this deadlock, if the file 
> system does not set up backing device information and if so, shouldn't 
> this be documented?

Such deadlock is not expected..

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: deadlock balance_dirty_pages() to be expected?
  2011-10-07 13:37 ` Wu Fengguang
@ 2011-10-07 14:08   ` Bernd Schubert
  2011-10-07 14:21     ` Wu Fengguang
  0 siblings, 1 reply; 8+ messages in thread
From: Bernd Schubert @ 2011-10-07 14:08 UTC (permalink / raw)
  To: Wu Fengguang, linux-fsdevel; +Cc: Jan Kara, Peter Zijlstra

Hello Fengguang,

On 10/07/2011 03:37 PM, Wu Fengguang wrote:
> Hi Bernd,
>
> On Fri, Oct 07, 2011 at 08:34:33PM +0800, Bernd Schubert wrote:
>> Hello,
>>
>> while I'm working on the page cached mode in FhGFS (*) I noticed a
>> deadlock in balance_dirty_pages().
>>
>> sysrq-w showed that it never started background write-out due to
>>
>> if (bdi_nr_reclaimable>  bdi_thresh) {
>> 	pages_written += writeback_inodes_wb(&bdi->wb,
>> 					    (write_chunk);
>>
>>
>> and therefore also did not leave that loop with
>>
>> 	if (pages_written>= write_chunk)
>>    				break;	/* We've done our duty */
>>
>>
>> So my process stay in uninterruptible D-state forever.
>
> If writeback_inodes_wb() is not triggered, the process should still be
> able to proceed, presumably with longer delays, but never stuck forever.
> That's because the flusher thread should still be cleaning the pages
> in the background which will knock down the dirty pages and eventually
> unthrottle the dirtier process.

Hmm, that does not seem to work:

1330 pts/0    D+     0:13 dd if=/dev/zero of=/mnt/fhgfs/testfile bs=1M 
count=100

So the process is in D state ever since I wrote the first mail, just for 
100MB writes. Even if it still would do something, it would be extremely 
slow. Sysrq-w then shows:

> [ 6727.616976] SysRq : Show Blocked State
> [ 6727.617575]   task                        PC stack   pid father
> [ 6727.618252] dd              D 0000000000000000  3544  1330   1306 0x00000000
> [ 6727.619002]  ffff88000ddfb9a8 0000000000000046 ffffffff81398627 0000000000000046
> [ 6727.620157]  0000000000000000 ffff88000ddfa000 ffff88000ddfa000 ffff88000ddfbfd8
> [ 6727.620466]  ffff88000ddfa010 ffff88000ddfa000 ffff88000ddfbfd8 ffff88000ddfa000
> [ 6727.620466] Call Trace:
> [ 6727.620466]  [<ffffffff81398627>] ? __schedule+0x697/0x7e0
> [ 6727.620466]  [<ffffffff8109be70>] ? trace_hardirqs_on_caller+0x20/0x1b0
> [ 6727.620466]  [<ffffffff8139884f>] schedule+0x3f/0x60
> [ 6727.620466]  [<ffffffff81398c44>] schedule_timeout+0x164/0x2f0
> [ 6727.620466]  [<ffffffff81070930>] ? lock_timer_base+0x70/0x70
> [ 6727.620466]  [<ffffffff81397bc9>] io_schedule_timeout+0x69/0x90
> [ 6727.620466]  [<ffffffff81109854>] balance_dirty_pages_ratelimited_nr+0x234/0x640
> [ 6727.620466]  [<ffffffff8110070f>] ? iov_iter_copy_from_user_atomic+0xaf/0x180
> [ 6727.620466]  [<ffffffff811009ae>] generic_file_buffered_write+0x1ce/0x270
> [ 6727.620466]  [<ffffffff811015dc>] ? generic_file_aio_write+0x5c/0xf0
> [ 6727.620466]  [<ffffffff81101358>] __generic_file_aio_write+0x238/0x460
> [ 6727.620466]  [<ffffffff811015dc>] ? generic_file_aio_write+0x5c/0xf0
> [ 6727.620466]  [<ffffffff811015f8>] generic_file_aio_write+0x78/0xf0
> [ 6727.620466]  [<ffffffffa034f539>] FhgfsOps_aio_write+0xdc/0x144 [fhgfs]
> [ 6727.620466]  [<ffffffff8115af8a>] do_sync_write+0xda/0x120
> [ 6727.620466]  [<ffffffff8112146c>] ? might_fault+0x9c/0xb0
> [ 6727.620466]  [<ffffffff8115b4b8>] vfs_write+0xc8/0x180
> [ 6727.620466]  [<ffffffff8115b661>] sys_write+0x51/0x90
> [ 6727.620466]  [<ffffffff813a3702>] system_call_fastpath+0x16/0x1b
> [ 6727.620466] Sched Debug Version: v0.10, 3.1.0-rc9+ #47


>
>> Once I added basic inode->i_data.backing_dev_info bdi support to our
>> file system, the deadlock did not happen anymore.
>
> What's the workload and change exactly?

I wish I could simply send the patch, but until all the paper work is 
done I'm not allowed to :(

The basic idea is:

1) During mount and setting the super block from

static struct file_system_type fhgfs_fs_type =
{
	.mount = fhgfs_mount,
}

Then in fhgfs_mount():

bdi_setup_and_register(&sbInfo->bdi, "fhgfs", BDI_CAP_MAP_COPY);
sb->s_bdi = &sbInfo->bdi;



2) When new (S_IFREG) inodes are allocated, for example from

static struct inode_operations fhgfs_dir_inode_ops
{
	.lookup,
	.create,
	.link
}

inode->i_data.backing_dev_info = &sbInfo->bdi;


>
>> So my question is simply if we should expect this deadlock, if the file
>> system does not set up backing device information and if so, shouldn't
>> this be documented?
>
> Such deadlock is not expected..

Ok thanks, then we should figure out why it happens. Due to a network 
outage here I won't have time before Monday to track down which kernel 
version introduced it, though.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: deadlock balance_dirty_pages() to be expected?
  2011-10-07 14:08   ` Bernd Schubert
@ 2011-10-07 14:21     ` Wu Fengguang
  2011-10-07 14:30       ` Bernd Schubert
  0 siblings, 1 reply; 8+ messages in thread
From: Wu Fengguang @ 2011-10-07 14:21 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: linux-fsdevel@vger.kernel.org, Jan Kara, Peter Zijlstra

On Fri, Oct 07, 2011 at 10:08:06PM +0800, Bernd Schubert wrote:
> Hello Fengguang,
> 
> On 10/07/2011 03:37 PM, Wu Fengguang wrote:
> > Hi Bernd,
> >
> > On Fri, Oct 07, 2011 at 08:34:33PM +0800, Bernd Schubert wrote:
> >> Hello,
> >>
> >> while I'm working on the page cached mode in FhGFS (*) I noticed a
> >> deadlock in balance_dirty_pages().
> >>
> >> sysrq-w showed that it never started background write-out due to
> >>
> >> if (bdi_nr_reclaimable>  bdi_thresh) {
> >> 	pages_written += writeback_inodes_wb(&bdi->wb,
> >> 					    (write_chunk);
> >>
> >>
> >> and therefore also did not leave that loop with
> >>
> >> 	if (pages_written>= write_chunk)
> >>    				break;	/* We've done our duty */
> >>
> >>
> >> So my process stay in uninterruptible D-state forever.
> >
> > If writeback_inodes_wb() is not triggered, the process should still be
> > able to proceed, presumably with longer delays, but never stuck forever.
> > That's because the flusher thread should still be cleaning the pages
> > in the background which will knock down the dirty pages and eventually
> > unthrottle the dirtier process.
> 
> Hmm, that does not seem to work:
> 
> 1330 pts/0    D+     0:13 dd if=/dev/zero of=/mnt/fhgfs/testfile bs=1M 
> count=100

That's normal: dd will be in D state in the vast majority time, but
the point is, one single balance_dirty_pages() call should not take
forever time, and dd should be able to go out of the D state (and
re-enter it almost immediately) from time to time.

> So the process is in D state ever since I wrote the first mail, just for 
> 100MB writes. Even if it still would do something, it would be extremely 
> slow. Sysrq-w then shows:

So it's normal to catch such trace for 99% times.  But do you mean the
writeout bandwidth is lower than expected?

> > [ 6727.616976] SysRq : Show Blocked State
> > [ 6727.617575]   task                        PC stack   pid father
> > [ 6727.618252] dd              D 0000000000000000  3544  1330   1306 0x00000000
> > [ 6727.619002]  ffff88000ddfb9a8 0000000000000046 ffffffff81398627 0000000000000046
> > [ 6727.620157]  0000000000000000 ffff88000ddfa000 ffff88000ddfa000 ffff88000ddfbfd8
> > [ 6727.620466]  ffff88000ddfa010 ffff88000ddfa000 ffff88000ddfbfd8 ffff88000ddfa000
> > [ 6727.620466] Call Trace:
> > [ 6727.620466]  [<ffffffff81398627>] ? __schedule+0x697/0x7e0
> > [ 6727.620466]  [<ffffffff8109be70>] ? trace_hardirqs_on_caller+0x20/0x1b0
> > [ 6727.620466]  [<ffffffff8139884f>] schedule+0x3f/0x60
> > [ 6727.620466]  [<ffffffff81398c44>] schedule_timeout+0x164/0x2f0
> > [ 6727.620466]  [<ffffffff81070930>] ? lock_timer_base+0x70/0x70
> > [ 6727.620466]  [<ffffffff81397bc9>] io_schedule_timeout+0x69/0x90
> > [ 6727.620466]  [<ffffffff81109854>] balance_dirty_pages_ratelimited_nr+0x234/0x640
> > [ 6727.620466]  [<ffffffff8110070f>] ? iov_iter_copy_from_user_atomic+0xaf/0x180
> > [ 6727.620466]  [<ffffffff811009ae>] generic_file_buffered_write+0x1ce/0x270
> > [ 6727.620466]  [<ffffffff811015dc>] ? generic_file_aio_write+0x5c/0xf0
> > [ 6727.620466]  [<ffffffff81101358>] __generic_file_aio_write+0x238/0x460
> > [ 6727.620466]  [<ffffffff811015dc>] ? generic_file_aio_write+0x5c/0xf0
> > [ 6727.620466]  [<ffffffff811015f8>] generic_file_aio_write+0x78/0xf0
> > [ 6727.620466]  [<ffffffffa034f539>] FhgfsOps_aio_write+0xdc/0x144 [fhgfs]
> > [ 6727.620466]  [<ffffffff8115af8a>] do_sync_write+0xda/0x120
> > [ 6727.620466]  [<ffffffff8112146c>] ? might_fault+0x9c/0xb0
> > [ 6727.620466]  [<ffffffff8115b4b8>] vfs_write+0xc8/0x180
> > [ 6727.620466]  [<ffffffff8115b661>] sys_write+0x51/0x90
> > [ 6727.620466]  [<ffffffff813a3702>] system_call_fastpath+0x16/0x1b
> > [ 6727.620466] Sched Debug Version: v0.10, 3.1.0-rc9+ #47
> 
> 
> >
> >> Once I added basic inode->i_data.backing_dev_info bdi support to our
> >> file system, the deadlock did not happen anymore.
> >
> > What's the workload and change exactly?
> 
> I wish I could simply send the patch, but until all the paper work is 
> done I'm not allowed to :(
> 
> The basic idea is:
> 
> 1) During mount and setting the super block from
> 
> static struct file_system_type fhgfs_fs_type =
> {
> 	.mount = fhgfs_mount,
> }
> 
> Then in fhgfs_mount():
> 
> bdi_setup_and_register(&sbInfo->bdi, "fhgfs", BDI_CAP_MAP_COPY);
> sb->s_bdi = &sbInfo->bdi;
> 
> 
> 
> 2) When new (S_IFREG) inodes are allocated, for example from
> 
> static struct inode_operations fhgfs_dir_inode_ops
> {
> 	.lookup,
> 	.create,
> 	.link
> }
> 
> inode->i_data.backing_dev_info = &sbInfo->bdi;

Ah when you didn't register the "fhgfs" bdi, there should be no
dedicated flusher thread for doing the writeout.  Which is obviously
suboptimal.

> >> So my question is simply if we should expect this deadlock, if the file
> >> system does not set up backing device information and if so, shouldn't
> >> this be documented?
> >
> > Such deadlock is not expected..
> 
> Ok thanks, then we should figure out why it happens. Due to a network 
> outage here I won't have time before Monday to track down which kernel 
> version introduced it, though.

It's long time ago when the per-bdi writeback is introduced, I suspect.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: deadlock balance_dirty_pages() to be expected?
  2011-10-07 14:21     ` Wu Fengguang
@ 2011-10-07 14:30       ` Bernd Schubert
  2011-10-07 14:38         ` Wu Fengguang
  0 siblings, 1 reply; 8+ messages in thread
From: Bernd Schubert @ 2011-10-07 14:30 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: linux-fsdevel@vger.kernel.org, Jan Kara, Peter Zijlstra

On 10/07/2011 04:21 PM, Wu Fengguang wrote:
> On Fri, Oct 07, 2011 at 10:08:06PM +0800, Bernd Schubert wrote:
>> Hello Fengguang,
>>
>> On 10/07/2011 03:37 PM, Wu Fengguang wrote:
>>> Hi Bernd,
>>>
>>> On Fri, Oct 07, 2011 at 08:34:33PM +0800, Bernd Schubert wrote:
>>>> Hello,
>>>>
>>>> while I'm working on the page cached mode in FhGFS (*) I noticed a
>>>> deadlock in balance_dirty_pages().
>>>>
>>>> sysrq-w showed that it never started background write-out due to
>>>>
>>>> if (bdi_nr_reclaimable>   bdi_thresh) {
>>>> 	pages_written += writeback_inodes_wb(&bdi->wb,
>>>> 					    (write_chunk);
>>>>
>>>>
>>>> and therefore also did not leave that loop with
>>>>
>>>> 	if (pages_written>= write_chunk)
>>>>     				break;	/* We've done our duty */
>>>>
>>>>
>>>> So my process stay in uninterruptible D-state forever.
>>>
>>> If writeback_inodes_wb() is not triggered, the process should still be
>>> able to proceed, presumably with longer delays, but never stuck forever.
>>> That's because the flusher thread should still be cleaning the pages
>>> in the background which will knock down the dirty pages and eventually
>>> unthrottle the dirtier process.
>>
>> Hmm, that does not seem to work:
>>
>> 1330 pts/0    D+     0:13 dd if=/dev/zero of=/mnt/fhgfs/testfile bs=1M
>> count=100
>
> That's normal: dd will be in D state in the vast majority time, but
> the point is, one single balance_dirty_pages() call should not take
> forever time, and dd should be able to go out of the D state (and
> re-enter it almost immediately) from time to time.
>
>> So the process is in D state ever since I wrote the first mail, just for
>> 100MB writes. Even if it still would do something, it would be extremely
>> slow. Sysrq-w then shows:
>
> So it's normal to catch such trace for 99% times.  But do you mean the
> writeout bandwidth is lower than expected?

If it really is still doing something, it is *ways* slower. Once I added 
bdi support, it finishes to write the 100MB file in my kvm test instance 
within a few seconds. Right now it is running for hours already... As I 
added a dump_stack() to our writepages() method, I also see that this 
function is never called.

>
>>> [ 6727.616976] SysRq : Show Blocked State
>>> [ 6727.617575]   task                        PC stack   pid father
>>> [ 6727.618252] dd              D 0000000000000000  3544  1330   1306 0x00000000
>>> [ 6727.619002]  ffff88000ddfb9a8 0000000000000046 ffffffff81398627 0000000000000046
>>> [ 6727.620157]  0000000000000000 ffff88000ddfa000 ffff88000ddfa000 ffff88000ddfbfd8
>>> [ 6727.620466]  ffff88000ddfa010 ffff88000ddfa000 ffff88000ddfbfd8 ffff88000ddfa000
>>> [ 6727.620466] Call Trace:
>>> [ 6727.620466]  [<ffffffff81398627>] ? __schedule+0x697/0x7e0
>>> [ 6727.620466]  [<ffffffff8109be70>] ? trace_hardirqs_on_caller+0x20/0x1b0
>>> [ 6727.620466]  [<ffffffff8139884f>] schedule+0x3f/0x60
>>> [ 6727.620466]  [<ffffffff81398c44>] schedule_timeout+0x164/0x2f0
>>> [ 6727.620466]  [<ffffffff81070930>] ? lock_timer_base+0x70/0x70
>>> [ 6727.620466]  [<ffffffff81397bc9>] io_schedule_timeout+0x69/0x90
>>> [ 6727.620466]  [<ffffffff81109854>] balance_dirty_pages_ratelimited_nr+0x234/0x640
>>> [ 6727.620466]  [<ffffffff8110070f>] ? iov_iter_copy_from_user_atomic+0xaf/0x180
>>> [ 6727.620466]  [<ffffffff811009ae>] generic_file_buffered_write+0x1ce/0x270
>>> [ 6727.620466]  [<ffffffff811015dc>] ? generic_file_aio_write+0x5c/0xf0
>>> [ 6727.620466]  [<ffffffff81101358>] __generic_file_aio_write+0x238/0x460
>>> [ 6727.620466]  [<ffffffff811015dc>] ? generic_file_aio_write+0x5c/0xf0
>>> [ 6727.620466]  [<ffffffff811015f8>] generic_file_aio_write+0x78/0xf0
>>> [ 6727.620466]  [<ffffffffa034f539>] FhgfsOps_aio_write+0xdc/0x144 [fhgfs]
>>> [ 6727.620466]  [<ffffffff8115af8a>] do_sync_write+0xda/0x120
>>> [ 6727.620466]  [<ffffffff8112146c>] ? might_fault+0x9c/0xb0
>>> [ 6727.620466]  [<ffffffff8115b4b8>] vfs_write+0xc8/0x180
>>> [ 6727.620466]  [<ffffffff8115b661>] sys_write+0x51/0x90
>>> [ 6727.620466]  [<ffffffff813a3702>] system_call_fastpath+0x16/0x1b
>>> [ 6727.620466] Sched Debug Version: v0.10, 3.1.0-rc9+ #47
>>
>>
>>>
>>>> Once I added basic inode->i_data.backing_dev_info bdi support to our
>>>> file system, the deadlock did not happen anymore.
>>>
>>> What's the workload and change exactly?
>>
>> I wish I could simply send the patch, but until all the paper work is
>> done I'm not allowed to :(
>>
>> The basic idea is:
>>
>> 1) During mount and setting the super block from
>>
>> static struct file_system_type fhgfs_fs_type =
>> {
>> 	.mount = fhgfs_mount,
>> }
>>
>> Then in fhgfs_mount():
>>
>> bdi_setup_and_register(&sbInfo->bdi, "fhgfs", BDI_CAP_MAP_COPY);
>> sb->s_bdi =&sbInfo->bdi;
>>
>>
>>
>> 2) When new (S_IFREG) inodes are allocated, for example from
>>
>> static struct inode_operations fhgfs_dir_inode_ops
>> {
>> 	.lookup,
>> 	.create,
>> 	.link
>> }
>>
>> inode->i_data.backing_dev_info =&sbInfo->bdi;
>
> Ah when you didn't register the "fhgfs" bdi, there should be no
> dedicated flusher thread for doing the writeout.  Which is obviously
> suboptimal.
>
>>>> So my question is simply if we should expect this deadlock, if the file
>>>> system does not set up backing device information and if so, shouldn't
>>>> this be documented?
>>>
>>> Such deadlock is not expected..
>>
>> Ok thanks, then we should figure out why it happens. Due to a network
>> outage here I won't have time before Monday to track down which kernel
>> version introduced it, though.
>
> It's long time ago when the per-bdi writeback is introduced, I suspect.

Ok, I can start to test if 2.6.32 also already deadlocks.

Thanks,
Bernd

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: deadlock balance_dirty_pages() to be expected?
  2011-10-07 14:30       ` Bernd Schubert
@ 2011-10-07 14:38         ` Wu Fengguang
  2011-10-11 14:55           ` Bernd Schubert
  0 siblings, 1 reply; 8+ messages in thread
From: Wu Fengguang @ 2011-10-07 14:38 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: linux-fsdevel@vger.kernel.org, Jan Kara, Peter Zijlstra

On Fri, Oct 07, 2011 at 10:30:18PM +0800, Bernd Schubert wrote:
> On 10/07/2011 04:21 PM, Wu Fengguang wrote:
> > On Fri, Oct 07, 2011 at 10:08:06PM +0800, Bernd Schubert wrote:
> >> Hello Fengguang,
> >>
> >> On 10/07/2011 03:37 PM, Wu Fengguang wrote:
> >>> Hi Bernd,
> >>>
> >>> On Fri, Oct 07, 2011 at 08:34:33PM +0800, Bernd Schubert wrote:
> >>>> Hello,
> >>>>
> >>>> while I'm working on the page cached mode in FhGFS (*) I noticed a
> >>>> deadlock in balance_dirty_pages().
> >>>>
> >>>> sysrq-w showed that it never started background write-out due to
> >>>>
> >>>> if (bdi_nr_reclaimable>   bdi_thresh) {
> >>>> 	pages_written += writeback_inodes_wb(&bdi->wb,
> >>>> 					    (write_chunk);
> >>>>
> >>>>
> >>>> and therefore also did not leave that loop with
> >>>>
> >>>> 	if (pages_written>= write_chunk)
> >>>>     				break;	/* We've done our duty */
> >>>>
> >>>>
> >>>> So my process stay in uninterruptible D-state forever.
> >>>
> >>> If writeback_inodes_wb() is not triggered, the process should still be
> >>> able to proceed, presumably with longer delays, but never stuck forever.
> >>> That's because the flusher thread should still be cleaning the pages
> >>> in the background which will knock down the dirty pages and eventually
> >>> unthrottle the dirtier process.
> >>
> >> Hmm, that does not seem to work:
> >>
> >> 1330 pts/0    D+     0:13 dd if=/dev/zero of=/mnt/fhgfs/testfile bs=1M
> >> count=100
> >
> > That's normal: dd will be in D state in the vast majority time, but
> > the point is, one single balance_dirty_pages() call should not take
> > forever time, and dd should be able to go out of the D state (and
> > re-enter it almost immediately) from time to time.
> >
> >> So the process is in D state ever since I wrote the first mail, just for
> >> 100MB writes. Even if it still would do something, it would be extremely
> >> slow. Sysrq-w then shows:
> >
> > So it's normal to catch such trace for 99% times.  But do you mean the
> > writeout bandwidth is lower than expected?
> 
> If it really is still doing something, it is *ways* slower. Once I added 
> bdi support, it finishes to write the 100MB file in my kvm test instance 
> within a few seconds. Right now it is running for hours already... As I 
> added a dump_stack() to our writepages() method, I also see that this 
> function is never called.

In your case it should be the default/forker thread that's doing the
(suboptimal) writeout: 

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root        17  0.0  0.0      0     0 ?        S    21:12   0:00 [bdi-default]

In normal cases there are the flush-* threads doing the writeout:

root      1146  0.0  0.0      0     0 ?        S    21:12   0:00 [flush-8:0]

> >
> >>> [ 6727.616976] SysRq : Show Blocked State
> >>> [ 6727.617575]   task                        PC stack   pid father
> >>> [ 6727.618252] dd              D 0000000000000000  3544  1330   1306 0x00000000
> >>> [ 6727.619002]  ffff88000ddfb9a8 0000000000000046 ffffffff81398627 0000000000000046
> >>> [ 6727.620157]  0000000000000000 ffff88000ddfa000 ffff88000ddfa000 ffff88000ddfbfd8
> >>> [ 6727.620466]  ffff88000ddfa010 ffff88000ddfa000 ffff88000ddfbfd8 ffff88000ddfa000
> >>> [ 6727.620466] Call Trace:
> >>> [ 6727.620466]  [<ffffffff81398627>] ? __schedule+0x697/0x7e0
> >>> [ 6727.620466]  [<ffffffff8109be70>] ? trace_hardirqs_on_caller+0x20/0x1b0
> >>> [ 6727.620466]  [<ffffffff8139884f>] schedule+0x3f/0x60
> >>> [ 6727.620466]  [<ffffffff81398c44>] schedule_timeout+0x164/0x2f0
> >>> [ 6727.620466]  [<ffffffff81070930>] ? lock_timer_base+0x70/0x70
> >>> [ 6727.620466]  [<ffffffff81397bc9>] io_schedule_timeout+0x69/0x90
> >>> [ 6727.620466]  [<ffffffff81109854>] balance_dirty_pages_ratelimited_nr+0x234/0x640
> >>> [ 6727.620466]  [<ffffffff8110070f>] ? iov_iter_copy_from_user_atomic+0xaf/0x180
> >>> [ 6727.620466]  [<ffffffff811009ae>] generic_file_buffered_write+0x1ce/0x270
> >>> [ 6727.620466]  [<ffffffff811015dc>] ? generic_file_aio_write+0x5c/0xf0
> >>> [ 6727.620466]  [<ffffffff81101358>] __generic_file_aio_write+0x238/0x460
> >>> [ 6727.620466]  [<ffffffff811015dc>] ? generic_file_aio_write+0x5c/0xf0
> >>> [ 6727.620466]  [<ffffffff811015f8>] generic_file_aio_write+0x78/0xf0
> >>> [ 6727.620466]  [<ffffffffa034f539>] FhgfsOps_aio_write+0xdc/0x144 [fhgfs]
> >>> [ 6727.620466]  [<ffffffff8115af8a>] do_sync_write+0xda/0x120
> >>> [ 6727.620466]  [<ffffffff8112146c>] ? might_fault+0x9c/0xb0
> >>> [ 6727.620466]  [<ffffffff8115b4b8>] vfs_write+0xc8/0x180
> >>> [ 6727.620466]  [<ffffffff8115b661>] sys_write+0x51/0x90
> >>> [ 6727.620466]  [<ffffffff813a3702>] system_call_fastpath+0x16/0x1b
> >>> [ 6727.620466] Sched Debug Version: v0.10, 3.1.0-rc9+ #47
> >>
> >>
> >>>
> >>>> Once I added basic inode->i_data.backing_dev_info bdi support to our
> >>>> file system, the deadlock did not happen anymore.
> >>>
> >>> What's the workload and change exactly?
> >>
> >> I wish I could simply send the patch, but until all the paper work is
> >> done I'm not allowed to :(
> >>
> >> The basic idea is:
> >>
> >> 1) During mount and setting the super block from
> >>
> >> static struct file_system_type fhgfs_fs_type =
> >> {
> >> 	.mount = fhgfs_mount,
> >> }
> >>
> >> Then in fhgfs_mount():
> >>
> >> bdi_setup_and_register(&sbInfo->bdi, "fhgfs", BDI_CAP_MAP_COPY);
> >> sb->s_bdi =&sbInfo->bdi;
> >>
> >>
> >>
> >> 2) When new (S_IFREG) inodes are allocated, for example from
> >>
> >> static struct inode_operations fhgfs_dir_inode_ops
> >> {
> >> 	.lookup,
> >> 	.create,
> >> 	.link
> >> }
> >>
> >> inode->i_data.backing_dev_info =&sbInfo->bdi;
> >
> > Ah when you didn't register the "fhgfs" bdi, there should be no
> > dedicated flusher thread for doing the writeout.  Which is obviously
> > suboptimal.
> >
> >>>> So my question is simply if we should expect this deadlock, if the file
> >>>> system does not set up backing device information and if so, shouldn't
> >>>> this be documented?
> >>>
> >>> Such deadlock is not expected..
> >>
> >> Ok thanks, then we should figure out why it happens. Due to a network
> >> outage here I won't have time before Monday to track down which kernel
> >> version introduced it, though.
> >
> > It's long time ago when the per-bdi writeback is introduced, I suspect.
> 
> Ok, I can start to test if 2.6.32 also already deadlocks.

I found the commit, it's introduced right in .32, hehe.

commit 03ba3782e8dcc5b0e1efe440d33084f066e38cae
Author: Jens Axboe <jens.axboe@oracle.com>
Date:   Wed Sep 9 09:08:54 2009 +0200

    writeback: switch to per-bdi threads for flushing data
    
    This gets rid of pdflush for bdi writeout and kupdated style cleaning.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: deadlock balance_dirty_pages() to be expected?
  2011-10-07 14:38         ` Wu Fengguang
@ 2011-10-11 14:55           ` Bernd Schubert
  2011-10-12  1:45             ` Wu Fengguang
  0 siblings, 1 reply; 8+ messages in thread
From: Bernd Schubert @ 2011-10-11 14:55 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: linux-fsdevel@vger.kernel.org, Jan Kara, Peter Zijlstra

>>>
>>>> So the process is in D state ever since I wrote the first mail, just for
>>>> 100MB writes. Even if it still would do something, it would be extremely
>>>> slow. Sysrq-w then shows:
>>>
>>> So it's normal to catch such trace for 99% times.  But do you mean the
>>> writeout bandwidth is lower than expected?
>>
>> If it really is still doing something, it is *ways* slower. Once I added
>> bdi support, it finishes to write the 100MB file in my kvm test instance
>> within a few seconds. Right now it is running for hours already... As I
>> added a dump_stack() to our writepages() method, I also see that this
>> function is never called.
>
> In your case it should be the default/forker thread that's doing the
> (suboptimal) writeout:
>
> USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> root        17  0.0  0.0      0     0 ?        S    21:12   0:00 [bdi-default]

Ah thanks. Good to know, I will starting why this isn't doing anything.

>
> In normal cases there are the flush-* threads doing the writeout:
>
> root      1146  0.0  0.0      0     0 ?        S    21:12   0:00 [flush-8:0]
>
>>>

[...]

>>> It's long time ago when the per-bdi writeback is introduced, I suspect.
>>
>> Ok, I can start to test if 2.6.32 also already deadlocks.
>
> I found the commit, it's introduced right in .32, hehe.
>
> commit 03ba3782e8dcc5b0e1efe440d33084f066e38cae
> Author: Jens Axboe<jens.axboe@oracle.com>
> Date:   Wed Sep 9 09:08:54 2009 +0200
>
>      writeback: switch to per-bdi threads for flushing data
>
>      This gets rid of pdflush for bdi writeout and kupdated style cleaning.

Yeah, that is why I wrote 2.6.32. But I just tested it - with 2.6.32 it 
also works without additional bdi code. Sometime later this week I will 
try to figure out the exact kernel and then commit causing the issue 
(2.6.35 had several bdi additions I think).


Cheers,
Bernd


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: deadlock balance_dirty_pages() to be expected?
  2011-10-11 14:55           ` Bernd Schubert
@ 2011-10-12  1:45             ` Wu Fengguang
  0 siblings, 0 replies; 8+ messages in thread
From: Wu Fengguang @ 2011-10-12  1:45 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: linux-fsdevel@vger.kernel.org, Jan Kara, Peter Zijlstra

On Tue, Oct 11, 2011 at 10:55:37PM +0800, Bernd Schubert wrote:
> >>>
> >>>> So the process is in D state ever since I wrote the first mail, just for
> >>>> 100MB writes. Even if it still would do something, it would be extremely
> >>>> slow. Sysrq-w then shows:
> >>>
> >>> So it's normal to catch such trace for 99% times.  But do you mean the
> >>> writeout bandwidth is lower than expected?
> >>
> >> If it really is still doing something, it is *ways* slower. Once I added
> >> bdi support, it finishes to write the 100MB file in my kvm test instance
> >> within a few seconds. Right now it is running for hours already... As I
> >> added a dump_stack() to our writepages() method, I also see that this
> >> function is never called.
> >
> > In your case it should be the default/forker thread that's doing the
> > (suboptimal) writeout:
> >
> > USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> > root        17  0.0  0.0      0     0 ?        S    21:12   0:00 [bdi-default]
> 
> Ah thanks. Good to know, I will starting why this isn't doing anything.

The forker thread is simply not designed to do efficient and well
behaviored flushing.  So I'd suggest to leave it as is and focus on
ensuring the flush-XXX thread is started for your bdi.

> > In normal cases there are the flush-* threads doing the writeout:
> >
> > root      1146  0.0  0.0      0     0 ?        S    21:12   0:00 [flush-8:0]
> >
> >>>
> 
> [...]
> 
> >>> It's long time ago when the per-bdi writeback is introduced, I suspect.
> >>
> >> Ok, I can start to test if 2.6.32 also already deadlocks.
> >
> > I found the commit, it's introduced right in .32, hehe.
> >
> > commit 03ba3782e8dcc5b0e1efe440d33084f066e38cae
> > Author: Jens Axboe<jens.axboe@oracle.com>
> > Date:   Wed Sep 9 09:08:54 2009 +0200
> >
> >      writeback: switch to per-bdi threads for flushing data
> >
> >      This gets rid of pdflush for bdi writeout and kupdated style cleaning.
> 
> Yeah, that is why I wrote 2.6.32. But I just tested it - with 2.6.32 it 
> also works without additional bdi code. Sometime later this week I will 
> try to figure out the exact kernel and then commit causing the issue 
> (2.6.35 had several bdi additions I think).

OK, thanks for the feedback!

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2011-10-12  1:45 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-10-07 12:34 deadlock balance_dirty_pages() to be expected? Bernd Schubert
2011-10-07 13:37 ` Wu Fengguang
2011-10-07 14:08   ` Bernd Schubert
2011-10-07 14:21     ` Wu Fengguang
2011-10-07 14:30       ` Bernd Schubert
2011-10-07 14:38         ` Wu Fengguang
2011-10-11 14:55           ` Bernd Schubert
2011-10-12  1:45             ` Wu Fengguang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).