Re: [patch 03/22] fix deadlock in balance_dirty

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [patch 03/22] fix deadlock in balance_dirty_pages
       [not found] ` <20070227223911.472192712@szeredi.hu>
@ 2007-03-01  6:58   ` Andrew Morton
  2007-03-01  7:35     ` Miklos Szeredi
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2007-03-01  6:58 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-kernel, linux-fsdevel

On Tue, 27 Feb 2007 23:38:12 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote:

> From: Miklos Szeredi <mszeredi@suse.cz>
> 
> This deadlock happens, when dirty pages from one filesystem are
> written back through another filesystem.  It easiest to demonstrate
> with fuse although it could affect looback mounts as well (see
> following patches).
> 
> Let's call the filesystems A(bove) and B(elow).  Process Pr_a is
> writing to A, and process Pr_b is writing to B.
> 
> Pr_a is bash-shared-mapping.  Pr_b is the fuse filesystem daemon
> (fusexmp_fh), for simplicity let's assume that Pr_b is single
> threaded.
> 
> These are the simplified stack traces of these processes after the
> deadlock:
> 
> Pr_a (bash-shared-mapping):
> 
>   (block on queue)
>   fuse_writepage
>   generic_writepages
>   writeback_inodes
>   balance_dirty_pages
>   balance_dirty_pages_ratelimited_nr
>   set_page_dirty_mapping_balance
>   do_no_page
> 
> 
> Pr_b (fusexmp_fh):
> 
>   io_schedule_timeout
>   congestion_wait
>   balance_dirty_pages
>   balance_dirty_pages_ratelimited_nr
>   generic_file_buffered_write
>   generic_file_aio_write
>   ext3_file_write
>   do_sync_write
>   vfs_write
>   sys_pwrite64
> 
> 
> Thanks to the aggressive nature of Pr_a, it can happen, that
> 
>   nr_file_dirty > dirty_thresh + margin
> 
> This is due to both nr_dirty growing and dirty_thresh shrinking, which
> in turn is due to nr_file_mapped rapidly growing.  The exact size of
> the margin at which the deadlock happens is not known, but it's around
> 100 pages.
> 
> At this point Pr_a enters balance_dirty_pages and starts to write back
> some if it's dirty pages.  After submitting some requests, it blocks
> on the request queue.
> 
> The first write request will trigger Pr_b to perform a write()
> syscall.  This will submit a write request to the block device and
> then may enter balance_dirty_pages().
> 
> The condition for exiting balance_dirty_pages() is
> 
>  - either that write_chunk pages have been written
> 
>  - or nr_file_dirty + nr_writeback < dirty_thresh
> 
> It is entirely possible that less than write_chunk pages were written,
> in which case balance_dirty_pages() will not exit even after all the
> submitted requests have been succesfully completed.
> 
> Which means that the write() syscall does not return.

But the balance_dirty_pages() loop does more than just wait for those two
conditions.  It will also submit _more_ dirty pages for writeout.  ie: it
should be feeding more of file A's pages into writepage.

Why isn't that happening?

> Which means, that no more dirty pages from A will be written back, and
> neither nr_writeback nor nr_file_dirty will decrease.
> 
> Which means, that balance_dirty_pages() will loop forever.
> 
> Q.E.D.
> 
> The solution is to exit balance_dirty_pages() on the condition, that
> there are only a few dirty + writeback pages for this backing dev.  This
> makes sure, that there is always some progress with this setup.
> 
> The number of outstanding dirty + written pages is limited to 8, which
> means that when over the threshold (dirty_exceeded == 1), each
> filesystem may only effectively pin a maximum of 16 (+8 because of
> ratelimiting) extra pages.
> 
> Note: a similar safety vent is always needed if there's a global limit
> for the dirty+writeback pages, even if in the future there will be
> some per-queue (or other) soft limit.
> 
> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
> ---
> 
> Index: linux/mm/page-writeback.c
> ===================================================================
> --- linux.orig/mm/page-writeback.c	2007-02-27 14:41:07.000000000 +0100
> +++ linux/mm/page-writeback.c	2007-02-27 14:41:07.000000000 +0100
> @@ -201,6 +201,17 @@ static void balance_dirty_pages(struct a
>  		if (!dirty_exceeded)
>  			dirty_exceeded = 1;
>  
> +		/*
> +		 * Acquit producer of dirty pages if there's little or
> +		 * nothing to write back to this particular queue.
> +		 *
> +		 * Without this check a deadlock is possible for if
> +		 * one filesystem is writing data through another.
> +		 */
> +		if (atomic_long_read(&bdi->nr_dirty) +
> +		    atomic_long_read(&bdi->nr_writeback) < 8)
> +			break;
> +
>  		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
>  		 * Unstable writes are a feature of certain networked
>  		 * filesystems (i.e. NFS) in which data may have been
> 
> --

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch 03/22] fix deadlock in balance_dirty_pages
  2007-03-01  6:58   ` [patch 03/22] fix deadlock in balance_dirty_pages Andrew Morton
@ 2007-03-01  7:35     ` Miklos Szeredi
  2007-03-01  8:27       ` Andrew Morton
  0 siblings, 1 reply; 9+ messages in thread
From: Miklos Szeredi @ 2007-03-01  7:35 UTC (permalink / raw)
  To: akpm; +Cc: miklos, linux-kernel, linux-fsdevel

> > This deadlock happens, when dirty pages from one filesystem are
> > written back through another filesystem.  It easiest to demonstrate
> > with fuse although it could affect looback mounts as well (see
> > following patches).
> > 
> > Let's call the filesystems A(bove) and B(elow).  Process Pr_a is
> > writing to A, and process Pr_b is writing to B.
> > 
> > Pr_a is bash-shared-mapping.  Pr_b is the fuse filesystem daemon
> > (fusexmp_fh), for simplicity let's assume that Pr_b is single
> > threaded.
> > 
> > These are the simplified stack traces of these processes after the
> > deadlock:
> > 
> > Pr_a (bash-shared-mapping):
> > 
> >   (block on queue)
> >   fuse_writepage
> >   generic_writepages
> >   writeback_inodes
> >   balance_dirty_pages
> >   balance_dirty_pages_ratelimited_nr
> >   set_page_dirty_mapping_balance
> >   do_no_page
> > 
> > 
> > Pr_b (fusexmp_fh):
> > 
> >   io_schedule_timeout
> >   congestion_wait
> >   balance_dirty_pages
> >   balance_dirty_pages_ratelimited_nr
> >   generic_file_buffered_write
> >   generic_file_aio_write
> >   ext3_file_write
> >   do_sync_write
> >   vfs_write
> >   sys_pwrite64
> > 
> > 
> > Thanks to the aggressive nature of Pr_a, it can happen, that
> > 
> >   nr_file_dirty > dirty_thresh + margin
> > 
> > This is due to both nr_dirty growing and dirty_thresh shrinking, which
> > in turn is due to nr_file_mapped rapidly growing.  The exact size of
> > the margin at which the deadlock happens is not known, but it's around
> > 100 pages.
> > 
> > At this point Pr_a enters balance_dirty_pages and starts to write back
> > some if it's dirty pages.  After submitting some requests, it blocks
> > on the request queue.
> > 
> > The first write request will trigger Pr_b to perform a write()
> > syscall.  This will submit a write request to the block device and
> > then may enter balance_dirty_pages().
> > 
> > The condition for exiting balance_dirty_pages() is
> > 
> >  - either that write_chunk pages have been written
> > 
> >  - or nr_file_dirty + nr_writeback < dirty_thresh
> > 
> > It is entirely possible that less than write_chunk pages were written,
> > in which case balance_dirty_pages() will not exit even after all the
> > submitted requests have been succesfully completed.
> > 
> > Which means that the write() syscall does not return.
> 
> But the balance_dirty_pages() loop does more than just wait for those two
> conditions.  It will also submit _more_ dirty pages for writeout.  ie: it
> should be feeding more of file A's pages into writepage.
> 
> Why isn't that happening?

All of A's data is actually written by B.  So just submitting more
pages to some queue doesn't help, it will just make the queue longer.

If the queue length were not limited, and B would have limitless
threads, and the write() wouldn't exclude other writes to the same
file (i_mutex), then there would be no deadlock.

But for fuse the first and the last condition isn't met.

For the loop device the second condition isn't met, loop is single
threaded.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch 03/22] fix deadlock in balance_dirty_pages
  2007-03-01  7:35     ` Miklos Szeredi
@ 2007-03-01  8:27       ` Andrew Morton
  2007-03-01  8:37         ` Miklos Szeredi
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2007-03-01  8:27 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-kernel, linux-fsdevel

On Thu, 01 Mar 2007 08:35:28 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote:

> > > This deadlock happens, when dirty pages from one filesystem are
> > > written back through another filesystem.  It easiest to demonstrate
> > > with fuse although it could affect looback mounts as well (see
> > > following patches).
> > > 
> > > Let's call the filesystems A(bove) and B(elow).  Process Pr_a is
> > > writing to A, and process Pr_b is writing to B.
> > > 
> > > Pr_a is bash-shared-mapping.  Pr_b is the fuse filesystem daemon
> > > (fusexmp_fh), for simplicity let's assume that Pr_b is single
> > > threaded.
> > > 
> > > These are the simplified stack traces of these processes after the
> > > deadlock:
> > > 
> > > Pr_a (bash-shared-mapping):
> > > 
> > >   (block on queue)
> > >   fuse_writepage
> > >   generic_writepages
> > >   writeback_inodes
> > >   balance_dirty_pages
> > >   balance_dirty_pages_ratelimited_nr
> > >   set_page_dirty_mapping_balance
> > >   do_no_page
> > > 
> > > 
> > > Pr_b (fusexmp_fh):
> > > 
> > >   io_schedule_timeout
> > >   congestion_wait
> > >   balance_dirty_pages
> > >   balance_dirty_pages_ratelimited_nr
> > >   generic_file_buffered_write
> > >   generic_file_aio_write
> > >   ext3_file_write
> > >   do_sync_write
> > >   vfs_write
> > >   sys_pwrite64
> > > 
> > > 
> > > Thanks to the aggressive nature of Pr_a, it can happen, that
> > > 
> > >   nr_file_dirty > dirty_thresh + margin
> > > 
> > > This is due to both nr_dirty growing and dirty_thresh shrinking, which
> > > in turn is due to nr_file_mapped rapidly growing.  The exact size of
> > > the margin at which the deadlock happens is not known, but it's around
> > > 100 pages.
> > > 
> > > At this point Pr_a enters balance_dirty_pages and starts to write back
> > > some if it's dirty pages.  After submitting some requests, it blocks
> > > on the request queue.
> > > 
> > > The first write request will trigger Pr_b to perform a write()
> > > syscall.  This will submit a write request to the block device and
> > > then may enter balance_dirty_pages().
> > > 
> > > The condition for exiting balance_dirty_pages() is
> > > 
> > >  - either that write_chunk pages have been written
> > > 
> > >  - or nr_file_dirty + nr_writeback < dirty_thresh
> > > 
> > > It is entirely possible that less than write_chunk pages were written,
> > > in which case balance_dirty_pages() will not exit even after all the
> > > submitted requests have been succesfully completed.
> > > 
> > > Which means that the write() syscall does not return.
> > 
> > But the balance_dirty_pages() loop does more than just wait for those two
> > conditions.  It will also submit _more_ dirty pages for writeout.  ie: it
> > should be feeding more of file A's pages into writepage.
> > 
> > Why isn't that happening?
> 
> All of A's data is actually written by B.  So just submitting more
> pages to some queue doesn't help, it will just make the queue longer.
> 
> If the queue length were not limited, and B would have limitless
> threads, and the write() wouldn't exclude other writes to the same
> file (i_mutex), then there would be no deadlock.
> 
> But for fuse the first and the last condition isn't met.
> 
> For the loop device the second condition isn't met, loop is single
> threaded.

Sigh.  What's this about i_mutex?  That appears to be some critical
information which _still_ isn't being communicated.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch 03/22] fix deadlock in balance_dirty_pages
  2007-03-01  8:27       ` Andrew Morton
@ 2007-03-01  8:37         ` Miklos Szeredi
  2007-03-01  8:41           ` Andrew Morton
  0 siblings, 1 reply; 9+ messages in thread
From: Miklos Szeredi @ 2007-03-01  8:37 UTC (permalink / raw)
  To: akpm; +Cc: miklos, linux-kernel, linux-fsdevel

> > > > This deadlock happens, when dirty pages from one filesystem are
> > > > written back through another filesystem.  It easiest to demonstrate
> > > > with fuse although it could affect looback mounts as well (see
> > > > following patches).
> > > > 
> > > > Let's call the filesystems A(bove) and B(elow).  Process Pr_a is
> > > > writing to A, and process Pr_b is writing to B.
> > > > 
> > > > Pr_a is bash-shared-mapping.  Pr_b is the fuse filesystem daemon
> > > > (fusexmp_fh), for simplicity let's assume that Pr_b is single
> > > > threaded.
> > > > 
> > > > These are the simplified stack traces of these processes after the
> > > > deadlock:
> > > > 
> > > > Pr_a (bash-shared-mapping):
> > > > 
> > > >   (block on queue)
> > > >   fuse_writepage
> > > >   generic_writepages
> > > >   writeback_inodes
> > > >   balance_dirty_pages
> > > >   balance_dirty_pages_ratelimited_nr
> > > >   set_page_dirty_mapping_balance
> > > >   do_no_page
> > > > 
> > > > 
> > > > Pr_b (fusexmp_fh):
> > > > 
> > > >   io_schedule_timeout
> > > >   congestion_wait
> > > >   balance_dirty_pages
> > > >   balance_dirty_pages_ratelimited_nr
> > > >   generic_file_buffered_write
> > > >   generic_file_aio_write
> > > >   ext3_file_write
> > > >   do_sync_write
> > > >   vfs_write
> > > >   sys_pwrite64
> > > > 
> > > > 
> > > > Thanks to the aggressive nature of Pr_a, it can happen, that
> > > > 
> > > >   nr_file_dirty > dirty_thresh + margin
> > > > 
> > > > This is due to both nr_dirty growing and dirty_thresh shrinking, which
> > > > in turn is due to nr_file_mapped rapidly growing.  The exact size of
> > > > the margin at which the deadlock happens is not known, but it's around
> > > > 100 pages.
> > > > 
> > > > At this point Pr_a enters balance_dirty_pages and starts to write back
> > > > some if it's dirty pages.  After submitting some requests, it blocks
> > > > on the request queue.
> > > > 
> > > > The first write request will trigger Pr_b to perform a write()
> > > > syscall.  This will submit a write request to the block device and
> > > > then may enter balance_dirty_pages().
> > > > 
> > > > The condition for exiting balance_dirty_pages() is
> > > > 
> > > >  - either that write_chunk pages have been written
> > > > 
> > > >  - or nr_file_dirty + nr_writeback < dirty_thresh
> > > > 
> > > > It is entirely possible that less than write_chunk pages were written,
> > > > in which case balance_dirty_pages() will not exit even after all the
> > > > submitted requests have been succesfully completed.
> > > > 
> > > > Which means that the write() syscall does not return.
> > > 
> > > But the balance_dirty_pages() loop does more than just wait for those two
> > > conditions.  It will also submit _more_ dirty pages for writeout.  ie: it
> > > should be feeding more of file A's pages into writepage.
> > > 
> > > Why isn't that happening?
> > 
> > All of A's data is actually written by B.  So just submitting more
> > pages to some queue doesn't help, it will just make the queue longer.
> > 
> > If the queue length were not limited, and B would have limitless
> > threads, and the write() wouldn't exclude other writes to the same
> > file (i_mutex), then there would be no deadlock.
> > 
> > But for fuse the first and the last condition isn't met.
> > 
> > For the loop device the second condition isn't met, loop is single
> > threaded.
> 
> Sigh.  What's this about i_mutex?  That appears to be some critical
> information which _still_ isn't being communicated.
> 

This:

ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
		unsigned long nr_segs, loff_t pos)
{
	struct file *file = iocb->ki_filp;
	struct address_space *mapping = file->f_mapping;
	struct inode *inode = mapping->host;
	ssize_t ret;

	BUG_ON(iocb->ki_pos != pos);

	mutex_lock(&inode->i_mutex);
	ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
			&iocb->ki_pos);
	mutex_unlock(&inode->i_mutex);


It's in the stack trace.  I thought it was obvious.

Miklos

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch 03/22] fix deadlock in balance_dirty_pages
  2007-03-01  8:37         ` Miklos Szeredi
@ 2007-03-01  8:41           ` Andrew Morton
  2007-03-01  8:58             ` Miklos Szeredi
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2007-03-01  8:41 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-kernel, linux-fsdevel

On Thu, 01 Mar 2007 09:37:06 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote:

> > Sigh.  What's this about i_mutex?  That appears to be some critical
> > information which _still_ isn't being communicated.
> > 
> 
> This:
> 
> ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
> 		unsigned long nr_segs, loff_t pos)
> {
> 	struct file *file = iocb->ki_filp;
> 	struct address_space *mapping = file->f_mapping;
> 	struct inode *inode = mapping->host;
> 	ssize_t ret;
> 
> 	BUG_ON(iocb->ki_pos != pos);
> 
> 	mutex_lock(&inode->i_mutex);
> 	ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
> 			&iocb->ki_pos);
> 	mutex_unlock(&inode->i_mutex);
> 
> 
> It's in the stack trace.  I thought it was obvious.

No, it is not obvious.

I'm just not going to apply weird hacks to work around a bug which
I do not understand, and I have spent way too much time trying to understand
this one.

So let us persist.

Please fully describe the role of i_mutex in this hang.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch 03/22] fix deadlock in balance_dirty_pages
  2007-03-01  8:41           ` Andrew Morton
@ 2007-03-01  8:58             ` Miklos Szeredi
  0 siblings, 0 replies; 9+ messages in thread
From: Miklos Szeredi @ 2007-03-01  8:58 UTC (permalink / raw)
  To: akpm; +Cc: miklos, linux-kernel, linux-fsdevel

> I'm just not going to apply weird hacks to work around a bug which
> I do not understand, and I have spent way too much time trying to understand
> this one.

I suggest you apply patch #5 "balance dirty pages from loop device"
and see for yourself the same deadlock with a simple loopback mount.

I beleive _not_ doing balance_dirty_pages() in loop is rather a bigger
hack, then mine ;)

> So let us persist.
> 
> Please fully describe the role of i_mutex in this hang.

OK.  Added description of the multithreaded case, with i_mutex's role:

+ What if Pr_b is multithreaded?  The first thread will enter
+ balance_dirty_pages() and loop there as shown above.  It will hold
+ i_mutex for the inode, taken in generic_file_aio_write().
+ 
+ The other theads now try to write back more data into the same file,
+ but will block on i_mutex.  So even with unlimited number of threads
+ no progress is made.

Thanks,
Miklos

From: Miklos Szeredi <mszeredi@suse.cz>

This deadlock happens, when dirty pages from one filesystem are
written back through another filesystem.  It easiest to demonstrate
with fuse although it could affect looback mounts as well (see
following patches).

Let's call the filesystems A(bove) and B(elow).  Process Pr_a is
writing to A, and process Pr_b is writing to B.

Pr_a is bash-shared-mapping.  Pr_b is the fuse filesystem daemon
(fusexmp_fh), for simplicity let's assume that Pr_b is single
threaded.

These are the simplified stack traces of these processes after the
deadlock:

Pr_a (bash-shared-mapping):

  (block on queue)
  fuse_writepage
  generic_writepages
  writeback_inodes
  balance_dirty_pages
  balance_dirty_pages_ratelimited_nr
  set_page_dirty_mapping_balance
  do_no_page

Pr_b (fusexmp_fh):

  io_schedule_timeout
  congestion_wait
  balance_dirty_pages
  balance_dirty_pages_ratelimited_nr
  generic_file_buffered_write
  generic_file_aio_write
  ext3_file_write
  do_sync_write
  vfs_write
  sys_pwrite64

Thanks to the aggressive nature of Pr_a, it can happen, that

  nr_file_dirty > dirty_thresh + margin

This is due to both nr_dirty growing and dirty_thresh shrinking, which
in turn is due to nr_file_mapped rapidly growing.  The exact size of
the margin at which the deadlock happens is not known, but it's around
100 pages.

At this point Pr_a enters balance_dirty_pages and starts to write back
some if it's dirty pages.  After submitting some requests, it blocks
on the request queue.

The first write request will trigger Pr_b to perform a write()
syscall.  This will submit a write request to the block device and
then may enter balance_dirty_pages().

The condition for exiting balance_dirty_pages() is

 - either that write_chunk pages have been written

 - or nr_file_dirty + nr_writeback < dirty_thresh

It is entirely possible that less than write_chunk pages were written,
in which case balance_dirty_pages() will not exit even after all the
submitted requests have been succesfully completed.

Which means that the write() syscall does not return.

Which means, that no more dirty pages from A will be written back, and
neither nr_writeback nor nr_file_dirty will decrease.

Which means, that balance_dirty_pages() will loop forever.

What if Pr_b is multithreaded?  The first thread will enter
balance_dirty_pages() and loop there as shown above.  It will hold
i_mutex for the inode, taken in generic_file_aio_write().

The other theads now try to write back more data into the same file,
but will block on i_mutex.  So even with unlimited number of threads
no progress is made.

Q.E.D.

The solution is to exit balance_dirty_pages() on the condition, that
there are only a few dirty + writeback pages for this backing dev.  This
makes sure, that there is always some progress with this setup.

The number of outstanding dirty + written pages is limited to 8, which
means that when over the threshold (dirty_exceeded == 1), each
filesystem may only effectively pin a maximum of 16 (+8 because of
ratelimiting) extra pages.

Note: a similar safety vent is always needed if there's a global limit
for the dirty+writeback pages, even if in the future there will be
some per-queue (or other) soft limit.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux/mm/page-writeback.c
===================================================================
--- linux.orig/mm/page-writeback.c	2007-02-27 14:41:07.000000000 +0100
+++ linux/mm/page-writeback.c	2007-02-27 14:41:07.000000000 +0100
@@ -201,6 +201,17 @@ static void balance_dirty_pages(struct a
 		if (!dirty_exceeded)
 			dirty_exceeded = 1;

+		/*
+		 * Acquit producer of dirty pages if there's little or
+		 * nothing to write back to this particular queue.
+		 *
+		 * Without this check a deadlock is possible for if
+		 * one filesystem is writing data through another.
+		 */
+		if (atomic_long_read(&bdi->nr_dirty) +
+		    atomic_long_read(&bdi->nr_writeback) < 8)
+			break;
+
 		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 		 * Unstable writes are a feature of certain networked
 		 * filesystems (i.e. NFS) in which data may have been

^ permalink raw reply	[flat|nested] 9+ messages in thread

[parent not found: <20070227223914.057085427@szeredi.hu>]

* Re: [patch 04/22] fix deadlock in throttle_vm_writeout
       [not found] ` <20070227223914.057085427@szeredi.hu>
@ 2007-03-01  7:11   ` Andrew Morton
  2007-03-01  7:48     ` Miklos Szeredi
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2007-03-01  7:11 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-kernel, linux-fsdevel

On Tue, 27 Feb 2007 23:38:13 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote:

> From: Miklos Szeredi <mszeredi@suse.cz>
> 
> This deadlock is similar to the one in balance_dirty_pages, but
> instead of waiting in balance_dirty_pages after submitting a write
> request, it happens during a memory allocation for filesystem B before
> submitting a write request.
> 
> It is easy to reproduce on a machine with not too much memory.
> E.g. try this on 2.6.21-rc1 UML with 32MB (works on physical hw as
> well):
> 
>   dd if=/dev/zero of=/tmp/tmp.img bs=1048576 count=40
>   mke2fs -j -F /tmp/tmp.img
>   mkdir /tmp/img
>   mount -oloop /tmp/tmp.img /tmp/img
>   bash-shared-mapping /tmp/img/foo 30000000
> 
> The deadlock doesn't happen immediately, sometimes only after a few
> minutes.
> 
> Simplified stack trace for bash-shared-mapping after the deadlock:
> 
>   io_schedule_timeout
>   congestion_wait
>   balance_dirty_pages
>   balance_dirty_pages_ratelimited_nr
>   generic_file_buffered_write
>   __generic_file_aio_write_nolock
>   generic_file_aio_write
>   ext3_file_write
>   do_sync_write
>   vfs_write
>   sys_pwrite64
> 
> and for [loop0]:
> 
>   io_schedule_timeout
>   congestion_wait
>   throttle_vm_writeout
>   shrink_zone
>   shrink_zones
>   try_to_free_pages
>   __alloc_pages
>   find_or_create_page
>   do_lo_send_aops
>   lo_send
>   do_bio_filebacked
>   loop_thread
> 
> The requirement for the deadlock is that
> 
>   nr_writeback > dirty_thresh * 1.1 + margin
> 
> Again margin seems to be in the 100 page range.
> 
> The task of throttle_vm_writeout is to limit the rate at which
> under-writeback pages are created due to swapping.  There's no other
> way direct reclaim can increase the nr_writeback + nr_file_dirty.
> 
> So when there are few or no under-swap pages, it is safe for this
> function to return.  This ensures, that there's progress with writing
> back dirty pages.
> 

Would this also be solved by the below just-submitted bugfix?  I guess not.

I think the basic problem here is that the loop thread is reponsible for cleaning
memory, but in throttle_vm_writeout(), the loop thread can get stuck waiting
for some other thread to clean memory, but that ain't going to happen.

throttle_vm_writeout() wasn't very well thought through, I suspect.


I suspect we can just delete throttle_vm_writeout() now.  The original
rationale was:

    [PATCH] vm: pageout throttling
    
    With silly pageout testcases it is possible to place huge amounts of memory
    under I/O.  With a large request queue (CFQ uses 8192 requests) it is
    possible to place _all_ memory under I/O at the same time.
    
    This means that all memory is pinned and unreclaimable and the VM gets
    upset and goes oom.
    
    The patch limits the amount of memory which is under pageout writeout to be
    a little more than the amount of memory at which balance_dirty_pages()
    callers will synchronously throttle.
    
    This means that heavy pageout activity can starve heavy writeback activity
    completely, but heavy writeback activity will not cause starvation of
    pageout.  Because we don't want a simple `dd' to be causing excessive
    latencies in page reclaim.

but now that we limit the amount of dirty MAP_SHARED memory, and given that
the various pieces of the dirty-memory limiting puzzle also take the number
of under-writeback pages into account, we should no longer be able to get
in the situation where the total number of writeback+dirty pages exceeds
dirty_ratio.



From: Andrew Morton <akpm@linux-foundation.org>

throttle_vm_writeout() is designed to wait for the dirty levels to subside. 
But if the caller holds IO or FS locks, we might be holding up that writeout.

So change it to take a single nap to give other devices a chance to clean some
memory, then return.

Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: Pete Zaitcev <zaitcev@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/writeback.h |    2 +-
 mm/page-writeback.c       |   13 +++++++++++--
 mm/vmscan.c               |    2 +-
 3 files changed, 13 insertions(+), 4 deletions(-)

diff -puN include/linux/writeback.h~throttle_vm_writeout-dont-loop-on-gfp_nofs-and-gfp_noio-allocations include/linux/writeback.h
--- a/include/linux/writeback.h~throttle_vm_writeout-dont-loop-on-gfp_nofs-and-gfp_noio-allocations
+++ a/include/linux/writeback.h
@@ -84,7 +84,7 @@ static inline void wait_on_inode(struct 
 int wakeup_pdflush(long nr_pages);
 void laptop_io_completion(void);
 void laptop_sync_completion(void);
-void throttle_vm_writeout(void);
+void throttle_vm_writeout(gfp_t gfp_mask);
 
 /* These are exported to sysctl. */
 extern int dirty_background_ratio;
diff -puN mm/page-writeback.c~throttle_vm_writeout-dont-loop-on-gfp_nofs-and-gfp_noio-allocations mm/page-writeback.c
--- a/mm/page-writeback.c~throttle_vm_writeout-dont-loop-on-gfp_nofs-and-gfp_noio-allocations
+++ a/mm/page-writeback.c
@@ -296,11 +296,21 @@ void balance_dirty_pages_ratelimited_nr(
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
 
-void throttle_vm_writeout(void)
+void throttle_vm_writeout(gfp_t gfp_mask)
 {
 	long background_thresh;
 	long dirty_thresh;
 
+	if ((gfp_mask & (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO)) {
+		/*
+		 * The caller might hold locks which can prevent IO completion
+		 * or progress in the filesystem.  So we cannot just sit here
+		 * waiting for IO to complete.
+		 */
+		congestion_wait(WRITE, HZ/10);
+		return;
+	}
+
         for ( ; ; ) {
 		get_dirty_limits(&background_thresh, &dirty_thresh, NULL);
 
@@ -317,7 +327,6 @@ void throttle_vm_writeout(void)
         }
 }
 
-
 /*
  * writeback at least _min_pages, and keep writing until the amount of dirty
  * memory is less than the background threshold, or until we're all clean.
diff -puN mm/vmscan.c~throttle_vm_writeout-dont-loop-on-gfp_nofs-and-gfp_noio-allocations mm/vmscan.c
--- a/mm/vmscan.c~throttle_vm_writeout-dont-loop-on-gfp_nofs-and-gfp_noio-allocations
+++ a/mm/vmscan.c
@@ -952,7 +952,7 @@ static unsigned long shrink_zone(int pri
 		}
 	}
 
-	throttle_vm_writeout();
+	throttle_vm_writeout(sc->gfp_mask);
 
 	atomic_dec(&zone->reclaim_in_progress);
 	return nr_reclaimed;
_


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch 04/22] fix deadlock in throttle_vm_writeout
  2007-03-01  7:11   ` [patch 04/22] fix deadlock in throttle_vm_writeout Andrew Morton
@ 2007-03-01  7:48     ` Miklos Szeredi
  0 siblings, 0 replies; 9+ messages in thread
From: Miklos Szeredi @ 2007-03-01  7:48 UTC (permalink / raw)
  To: akpm; +Cc: miklos, linux-kernel, linux-fsdevel

> > From: Miklos Szeredi <mszeredi@suse.cz>
> > 
> > This deadlock is similar to the one in balance_dirty_pages, but
> > instead of waiting in balance_dirty_pages after submitting a write
> > request, it happens during a memory allocation for filesystem B before
> > submitting a write request.
> > 
> > It is easy to reproduce on a machine with not too much memory.
> > E.g. try this on 2.6.21-rc1 UML with 32MB (works on physical hw as
> > well):
> > 
> >   dd if=/dev/zero of=/tmp/tmp.img bs=1048576 count=40
> >   mke2fs -j -F /tmp/tmp.img
> >   mkdir /tmp/img
> >   mount -oloop /tmp/tmp.img /tmp/img
> >   bash-shared-mapping /tmp/img/foo 30000000
> > 
> > The deadlock doesn't happen immediately, sometimes only after a few
> > minutes.
> > 
> > Simplified stack trace for bash-shared-mapping after the deadlock:
> > 
> >   io_schedule_timeout
> >   congestion_wait
> >   balance_dirty_pages
> >   balance_dirty_pages_ratelimited_nr
> >   generic_file_buffered_write
> >   __generic_file_aio_write_nolock
> >   generic_file_aio_write
> >   ext3_file_write
> >   do_sync_write
> >   vfs_write
> >   sys_pwrite64
> > 
> > and for [loop0]:
> > 
> >   io_schedule_timeout
> >   congestion_wait
> >   throttle_vm_writeout
> >   shrink_zone
> >   shrink_zones
> >   try_to_free_pages
> >   __alloc_pages
> >   find_or_create_page
> >   do_lo_send_aops
> >   lo_send
> >   do_bio_filebacked
> >   loop_thread
> > 
> > The requirement for the deadlock is that
> > 
> >   nr_writeback > dirty_thresh * 1.1 + margin
> > 
> > Again margin seems to be in the 100 page range.
> > 
> > The task of throttle_vm_writeout is to limit the rate at which
> > under-writeback pages are created due to swapping.  There's no other
> > way direct reclaim can increase the nr_writeback + nr_file_dirty.
> > 
> > So when there are few or no under-swap pages, it is safe for this
> > function to return.  This ensures, that there's progress with writing
> > back dirty pages.
> > 
> 
> Would this also be solved by the below just-submitted bugfix?  I guess not.
> 
> I think the basic problem here is that the loop thread is reponsible for cleaning
> memory, but in throttle_vm_writeout(), the loop thread can get stuck waiting
> for some other thread to clean memory, but that ain't going to happen.
> 
> throttle_vm_writeout() wasn't very well thought through, I suspect.
> 
> 
> I suspect we can just delete throttle_vm_writeout() now.  The original
> rationale was:
> 
>     [PATCH] vm: pageout throttling
>     
>     With silly pageout testcases it is possible to place huge amounts of memory
>     under I/O.  With a large request queue (CFQ uses 8192 requests) it is
>     possible to place _all_ memory under I/O at the same time.
>     
>     This means that all memory is pinned and unreclaimable and the VM gets
>     upset and goes oom.
>     
>     The patch limits the amount of memory which is under pageout writeout to be
>     a little more than the amount of memory at which balance_dirty_pages()
>     callers will synchronously throttle.
>     
>     This means that heavy pageout activity can starve heavy writeback activity
>     completely, but heavy writeback activity will not cause starvation of
>     pageout.  Because we don't want a simple `dd' to be causing excessive
>     latencies in page reclaim.
> 
> but now that we limit the amount of dirty MAP_SHARED memory, and given that
> the various pieces of the dirty-memory limiting puzzle also take the number
> of under-writeback pages into account, we should no longer be able to get
> in the situation where the total number of writeback+dirty pages exceeds
> dirty_ratio.

Only with swap, which can generate a lot of writeback pages, without
limiting.  Does this matter?  I guess that depends on the queue
length.  If a lot of swap requests are using up memory, than that's
bad.

> From: Andrew Morton <akpm@linux-foundation.org>
> 
> throttle_vm_writeout() is designed to wait for the dirty levels to subside. 
> But if the caller holds IO or FS locks, we might be holding up that writeout.
> 
> So change it to take a single nap to give other devices a chance to clean some
> memory, then return.
> 

Why does it nap unconditionally for GFP_NOFS/NOIO?  That seems to be a
waste of time.

Btw, I did try not calling throttle_vm_writeout() for GFP_NOFS/NOIO
and IIRC it didn't help.  I'll check again.

Thanks,
Miklos


> Cc: Nick Piggin <nickpiggin@yahoo.com.au>
> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
> Cc: Kumar Gala <galak@kernel.crashing.org>
> Cc: Pete Zaitcev <zaitcev@redhat.com>
> Cc: <stable@kernel.org>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  include/linux/writeback.h |    2 +-
>  mm/page-writeback.c       |   13 +++++++++++--
>  mm/vmscan.c               |    2 +-
>  3 files changed, 13 insertions(+), 4 deletions(-)
> 
> diff -puN include/linux/writeback.h~throttle_vm_writeout-dont-loop-on-gfp_nofs-and-gfp_noio-allocations include/linux/writeback.h
> --- a/include/linux/writeback.h~throttle_vm_writeout-dont-loop-on-gfp_nofs-and-gfp_noio-allocations
> +++ a/include/linux/writeback.h
> @@ -84,7 +84,7 @@ static inline void wait_on_inode(struct 
>  int wakeup_pdflush(long nr_pages);
>  void laptop_io_completion(void);
>  void laptop_sync_completion(void);
> -void throttle_vm_writeout(void);
> +void throttle_vm_writeout(gfp_t gfp_mask);
>  
>  /* These are exported to sysctl. */
>  extern int dirty_background_ratio;
> diff -puN mm/page-writeback.c~throttle_vm_writeout-dont-loop-on-gfp_nofs-and-gfp_noio-allocations mm/page-writeback.c
> --- a/mm/page-writeback.c~throttle_vm_writeout-dont-loop-on-gfp_nofs-and-gfp_noio-allocations
> +++ a/mm/page-writeback.c
> @@ -296,11 +296,21 @@ void balance_dirty_pages_ratelimited_nr(
>  }
>  EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
>  
> -void throttle_vm_writeout(void)
> +void throttle_vm_writeout(gfp_t gfp_mask)
>  {
>  	long background_thresh;
>  	long dirty_thresh;
>  
> +	if ((gfp_mask & (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO)) {
> +		/*
> +		 * The caller might hold locks which can prevent IO completion
> +		 * or progress in the filesystem.  So we cannot just sit here
> +		 * waiting for IO to complete.
> +		 */
> +		congestion_wait(WRITE, HZ/10);
> +		return;
> +	}
> +
>          for ( ; ; ) {
>  		get_dirty_limits(&background_thresh, &dirty_thresh, NULL);
>  
> @@ -317,7 +327,6 @@ void throttle_vm_writeout(void)
>          }
>  }
>  
> -
>  /*
>   * writeback at least _min_pages, and keep writing until the amount of dirty
>   * memory is less than the background threshold, or until we're all clean.
> diff -puN mm/vmscan.c~throttle_vm_writeout-dont-loop-on-gfp_nofs-and-gfp_noio-allocations mm/vmscan.c
> --- a/mm/vmscan.c~throttle_vm_writeout-dont-loop-on-gfp_nofs-and-gfp_noio-allocations
> +++ a/mm/vmscan.c
> @@ -952,7 +952,7 @@ static unsigned long shrink_zone(int pri
>  		}
>  	}
>  
> -	throttle_vm_writeout();
> +	throttle_vm_writeout(sc->gfp_mask);
>  
>  	atomic_dec(&zone->reclaim_in_progress);
>  	return nr_reclaimed;

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [patch 00/22] misc VFS/VM patches and fuse writable shared mapping support
@ 2007-02-27 23:14 Miklos Szeredi
  2007-02-27 23:14 ` [patch 03/22] fix deadlock in balance_dirty_pages Miklos Szeredi
  0 siblings, 1 reply; 9+ messages in thread
From: Miklos Szeredi @ 2007-02-27 23:14 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-fsdevel

The first part of this series (1-7) contains miscellaneous patches,
some of which are needed for fuse writable mmap to work correctly.

Some of these are resends of patches already in -mm, with minor
updates.

The rest of the series adds shared writable mapping support to fuse,
with some write performance improvements thrown in.  Survives
fsx-linux, LTP and bash-shared-mappings.

There are dependencies on patches in -mm, notably

  per-backing_dev-dirty-and-writeback-page-accounting.patch

--

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [patch 03/22] fix deadlock in balance_dirty_pages
  2007-02-27 23:14 [patch 00/22] misc VFS/VM patches and fuse writable shared mapping support Miklos Szeredi
@ 2007-02-27 23:14 ` Miklos Szeredi
  0 siblings, 0 replies; 9+ messages in thread
From: Miklos Szeredi @ 2007-02-27 23:14 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-fsdevel

[-- Attachment #1: dirty_balancing_fix.patch --]
[-- Type: text/plain, Size: 3829 bytes --]

From: Miklos Szeredi <mszeredi@suse.cz>

This deadlock happens, when dirty pages from one filesystem are
written back through another filesystem.  It easiest to demonstrate
with fuse although it could affect looback mounts as well (see
following patches).

Let's call the filesystems A(bove) and B(elow).  Process Pr_a is
writing to A, and process Pr_b is writing to B.

Pr_a is bash-shared-mapping.  Pr_b is the fuse filesystem daemon
(fusexmp_fh), for simplicity let's assume that Pr_b is single
threaded.

These are the simplified stack traces of these processes after the
deadlock:

Pr_a (bash-shared-mapping):

  (block on queue)
  fuse_writepage
  generic_writepages
  writeback_inodes
  balance_dirty_pages
  balance_dirty_pages_ratelimited_nr
  set_page_dirty_mapping_balance
  do_no_page

Pr_b (fusexmp_fh):

  io_schedule_timeout
  congestion_wait
  balance_dirty_pages
  balance_dirty_pages_ratelimited_nr
  generic_file_buffered_write
  generic_file_aio_write
  ext3_file_write
  do_sync_write
  vfs_write
  sys_pwrite64

Thanks to the aggressive nature of Pr_a, it can happen, that

  nr_file_dirty > dirty_thresh + margin

This is due to both nr_dirty growing and dirty_thresh shrinking, which
in turn is due to nr_file_mapped rapidly growing.  The exact size of
the margin at which the deadlock happens is not known, but it's around
100 pages.

At this point Pr_a enters balance_dirty_pages and starts to write back
some if it's dirty pages.  After submitting some requests, it blocks
on the request queue.

The first write request will trigger Pr_b to perform a write()
syscall.  This will submit a write request to the block device and
then may enter balance_dirty_pages().

The condition for exiting balance_dirty_pages() is

 - either that write_chunk pages have been written

 - or nr_file_dirty + nr_writeback < dirty_thresh

It is entirely possible that less than write_chunk pages were written,
in which case balance_dirty_pages() will not exit even after all the
submitted requests have been succesfully completed.

Which means that the write() syscall does not return.

Which means, that no more dirty pages from A will be written back, and
neither nr_writeback nor nr_file_dirty will decrease.

Which means, that balance_dirty_pages() will loop forever.

Q.E.D.

The solution is to exit balance_dirty_pages() on the condition, that
there are only a few dirty + writeback pages for this backing dev.  This
makes sure, that there is always some progress with this setup.

The number of outstanding dirty + written pages is limited to 8, which
means that when over the threshold (dirty_exceeded == 1), each
filesystem may only effectively pin a maximum of 16 (+8 because of
ratelimiting) extra pages.

Note: a similar safety vent is always needed if there's a global limit
for the dirty+writeback pages, even if in the future there will be
some per-queue (or other) soft limit.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux/mm/page-writeback.c
===================================================================
--- linux.orig/mm/page-writeback.c	2007-02-27 14:41:07.000000000 +0100
+++ linux/mm/page-writeback.c	2007-02-27 14:41:07.000000000 +0100
@@ -201,6 +201,17 @@ static void balance_dirty_pages(struct a
 		if (!dirty_exceeded)
 			dirty_exceeded = 1;

+		/*
+		 * Acquit producer of dirty pages if there's little or
+		 * nothing to write back to this particular queue.
+		 *
+		 * Without this check a deadlock is possible for if
+		 * one filesystem is writing data through another.
+		 */
+		if (atomic_long_read(&bdi->nr_dirty) +
+		    atomic_long_read(&bdi->nr_writeback) < 8)
+			break;
+
 		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 		 * Unstable writes are a feature of certain networked
 		 * filesystems (i.e. NFS) in which data may have been

--

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2007-03-01  8:59 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20070227223809.684624012@szeredi.hu>
     [not found] ` <20070227223911.472192712@szeredi.hu>
2007-03-01  6:58   ` [patch 03/22] fix deadlock in balance_dirty_pages Andrew Morton
2007-03-01  7:35     ` Miklos Szeredi
2007-03-01  8:27       ` Andrew Morton
2007-03-01  8:37         ` Miklos Szeredi
2007-03-01  8:41           ` Andrew Morton
2007-03-01  8:58             ` Miklos Szeredi
     [not found] ` <20070227223914.057085427@szeredi.hu>
2007-03-01  7:11   ` [patch 04/22] fix deadlock in throttle_vm_writeout Andrew Morton
2007-03-01  7:48     ` Miklos Szeredi
2007-02-27 23:14 [patch 00/22] misc VFS/VM patches and fuse writable shared mapping support Miklos Szeredi
2007-02-27 23:14 ` [patch 03/22] fix deadlock in balance_dirty_pages Miklos Szeredi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox