public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@linux-foundation.org>
To: Mikulas Patocka <mpatocka@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@vger.kernel.org,
	agk@redhat.com, mbroz@redhat.com, chris@arachsys.com
Subject: Re: [PATCH] Memory management livelock
Date: Tue, 23 Sep 2008 15:49:05 -0700	[thread overview]
Message-ID: <20080923154905.50d4b0fa.akpm@linux-foundation.org> (raw)
In-Reply-To: <Pine.LNX.4.64.0809231817390.11559@hs20-bc2-1.build.redhat.com>

On Tue, 23 Sep 2008 18:34:20 -0400 (EDT)
Mikulas Patocka <mpatocka@redhat.com> wrote:

> > On Mon, 22 Sep 2008 17:10:04 -0400 (EDT)
> > Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote:
> > 
> > > The bug happens when one process is doing sequential buffered writes to
> > > a block device (or file) and another process is attempting to execute
> > > sync(), fsync() or direct-IO on that device (or file). This syncing
> > > process will wait indefinitelly, until the first writing process
> > > finishes.
> > >
> > > For example, run these two commands:
> > > dd if=/dev/zero of=/dev/sda1 bs=65536 &
> > > dd if=/dev/sda1 of=/dev/null bs=4096 count=1 iflag=direct
> > >
> > > The bug is caused by sequential walking of address space in
> > > write_cache_pages and wait_on_page_writeback_range: if some other
> > > process is constantly making dirty and writeback pages while these
> > > functions run, the functions will wait on every new page, resulting in
> > > indefinite wait.
> > 
> > Shouldn't happen. All the data-syncing functions should have an upper
> > bound on the number of pages which they attempt to write. In the
> > example above, we end up in here:
> > 
> > int __filemap_fdatawrite_range(struct address_space *mapping, loff_t
> > start,
> > loff_t end, int sync_mode)
> > {
> > int ret;
> > struct writeback_control wbc = {
> > .sync_mode = sync_mode,
> > .nr_to_write = mapping->nrpages * 2, <<--
> > .range_start = start,
> > .range_end = end,
> > };
> > 
> > so generic_file_direct_write()'s filemap_write_and_wait() will attempt
> > to write at most 2* the number of pages which are in cache for that inode.
> 
> See write_cache_pages:
> 
> if (wbc->sync_mode != WB_SYNC_NONE)
>         wait_on_page_writeback(page);	(1)
> if (PageWriteback(page) ||
>     !clear_page_dirty_for_io(page)) {
>         unlock_page(page);		(2)
>         continue;
> }
> ret = (*writepage)(page, wbc, data);
> if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
>         unlock_page(page);
>         ret = 0;
> }
> if (ret || (--(wbc->nr_to_write) <= 0))
>         done = 1;
> 
> --- so if it goes by points (1) and (2), the counter is not decremented, 
> yet the function waits for the page. If there is constant stream of 
> writeback pages being generated, it waits on each on them --- that is, 
> forever. I have seen livelock in this function. For you that example with 
> two dd's, one buffered write and the other directIO read doesn't work? For 
> me it livelocks here.
> 
> wait_on_page_writeback_range is another example where the livelock 
> happened, there is no protection at all against starvation.

um, OK.  So someone else is initiating IO for this inode and this
thread *never* gets to initiate any writeback.  That's a bit of a
surprise.

How do we fix that?  Maybe decrement nt_to_write for these pages as
well?

> 
> BTW. that .nr_to_write = mapping->nrpages * 2 looks like a dangerous thing 
> to me.
> 
> Imagine this case: You have two pages with indices 4 and 5 dirty in a 
> file. You call fsync(). It sets nr_to_write to 4.
> 
> Meanwhile, another process makes pages 0, 1, 2, 3 dirty.
> 
> The fsync() process goes to write_cache_pages, writes the first 4 dirty 
> pages and exits because it goes over the limit.
> 
> result --- you violate fsync() semantics, pages that were dirty before 
> call to fsync() are not written when fsync() exits.

yup, that's pretty much unfixable, really, unless new locks are added
which block threads which are writing to unrelated sections of the
file, and that could hurt some workloads quite a lot, I expect.

Hopefully high performance applications are instantiating the file
up-front and are using sync_file_range() to prevent these sorts of
things from happening.  But they probably aren't.



  reply	other threads:[~2008-09-23 22:49 UTC|newest]

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20080911101616.GA24064@agk.fab.redhat.com>
2008-09-22 21:10 ` [PATCH] Memory management livelock Mikulas Patocka
2008-09-23  0:48   ` Andrew Morton
2008-09-23 22:34   ` Mikulas Patocka
2008-09-23 22:49     ` Andrew Morton [this message]
2008-09-23 23:11       ` Mikulas Patocka
2008-09-23 23:46         ` Andrew Morton
2008-09-24 18:50           ` Mikulas Patocka
2008-09-24 18:51           ` [PATCH 1/3] " Mikulas Patocka
2008-09-24 18:52           ` [PATCH 2/3] " Mikulas Patocka
2008-10-02  5:54             ` Andrew Morton
2008-10-05 22:11               ` RFC: one-bit mutexes (was: Re: [PATCH 2/3] Memory management livelock) Mikulas Patocka
2008-10-11 12:06                 ` Nick Piggin
2008-10-20 20:14                   ` Mikulas Patocka
2008-10-21  1:51                     ` Nick Piggin
2008-10-05 22:14               ` [PATCH 1/3] bit mutexes Mikulas Patocka
2008-10-05 22:14               ` [PATCH 2/3] Fix fsync livelock Mikulas Patocka
2008-10-05 22:33                 ` Arjan van de Ven
2008-10-05 23:02                   ` Mikulas Patocka
2008-10-05 23:07                     ` Arjan van de Ven
2008-10-05 23:18                       ` Mikulas Patocka
2008-10-05 23:28                         ` Arjan van de Ven
2008-10-06  0:01                           ` Mikulas Patocka
2008-10-06  0:30                             ` Arjan van de Ven
2008-10-06  3:30                               ` Mikulas Patocka
2008-10-06  4:20                                 ` Arjan van de Ven
2008-10-06 13:00                                   ` Mikulas Patocka
2008-10-06 13:50                                     ` Arjan van de Ven
2008-10-06 20:44                                       ` Mikulas Patocka
2008-10-08 10:56                               ` Pavel Machek
2008-10-06  2:51                             ` Dave Chinner
2008-10-05 22:16               ` [PATCH 3/3] Fix fsync-vs-write misbehavior Mikulas Patocka
2008-10-09  1:12               ` [PATCH] documentation: explain memory barriers Randy Dunlap
2008-10-09  1:17                 ` Chris Snook
2008-10-09  1:31                   ` Andrew Morton
2008-10-09  5:51                     ` Chris Snook
2008-10-09  9:58                       ` Ben Hutchings
2008-10-09 21:27                         ` Nick Piggin
2008-10-09 17:29                       ` Nick Piggin
2008-10-09  1:50                 ` Valdis.Kletnieks
2008-10-09 17:35                   ` Nick Piggin
2008-10-09  6:52                     ` Valdis.Kletnieks
2008-09-24 18:53           ` [PATCH 3/3] Memory management livelock Mikulas Patocka
2008-10-03  2:32       ` [PATCH] " Nick Piggin
2008-10-03  2:40         ` Andrew Morton
2008-10-03  2:59           ` Nick Piggin
2008-10-03  3:14             ` Andrew Morton
2008-10-03  3:47               ` Nick Piggin
2008-10-03  3:56                 ` Andrew Morton
2008-10-03  4:07                   ` Nick Piggin
2008-10-03  4:17                     ` Andrew Morton
2008-10-03  4:29                       ` Nick Piggin
2008-10-03 11:43                   ` Mikulas Patocka
2008-10-03 12:27                     ` Nick Piggin
2008-10-03 13:53                       ` Mikulas Patocka
2008-10-03  2:54         ` Nick Piggin
2008-10-03 11:26           ` Mikulas Patocka
2008-10-03 12:31             ` Nick Piggin
2008-10-03 13:50               ` Mikulas Patocka
2008-10-03 14:50                 ` Alasdair G Kergon
2008-10-03 14:36               ` Alasdair G Kergon
2008-10-03 15:52           ` application syncing options (was Re: [PATCH] Memory management livelock) david
2008-10-06  0:04             ` Mikulas Patocka
2008-10-06  0:19               ` david
2008-10-06  3:42                 ` Mikulas Patocka
2008-10-07  3:37                   ` david
2008-10-07 15:44                     ` Mikulas Patocka
2008-10-07 17:16                       ` david

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080923154905.50d4b0fa.akpm@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=agk@redhat.com \
    --cc=chris@arachsys.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@vger.kernel.org \
    --cc=mbroz@redhat.com \
    --cc=mpatocka@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox