All of lore.kernel.org
 help / color / mirror / Atom feed
From: Nick Piggin <npiggin@suse.de>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-fsdevel@vger.kernel.org, mpatocka@redhat.com
Subject: Re: [patch 6/6] mm: fsync livelock avoidance
Date: Thu, 11 Dec 2008 23:32:13 +0100	[thread overview]
Message-ID: <20081211223213.GC8294@wotan.suse.de> (raw)
In-Reply-To: <20081211135111.cada5b8b.akpm@linux-foundation.org>

On Thu, Dec 11, 2008 at 01:51:11PM -0800, Andrew Morton wrote:
> On Wed, 10 Dec 2008 08:42:09 +0100
> Nick Piggin <npiggin@suse.de> wrote:
> > 
> > This lock also solves a real data integrity problem that I only noticed as
> > I was writing the livelock avoidance code. If we consider the lock as the
> > solution to this bug, this makes the livelock avoidance code much more
> > attractive because then it does not introduce the new lock.
> > 
> > The bug is that fsync errors do not get propogated back up to the caller
> > properly in some cases. Consider where we write a page in the writeout path,
> > then it encounters an IO error and finishes writeback, in the meantime, another
> > process (eg. via sys_sync, or another fsync) clears the mapping error bits.
> > Then our fsync will have appeared to finish successfully, but actually should
> > have returned error.
> 
> Has *anybody* *ever* complained about this behaviour?  I think maybe
> one person after sixish years?

The livelock behaviour? (or the error propagation).

I first heard about it from Mikulas, where some dm tool locks up because
it does direct IO on the block device of mounted filesystem (or something
like that). That case is actually mostly solved by my first ptach in the
series. 

> Why fix it?

Good question. My earlier patches already in your tree removed some starvation
avoidance code because they were breaking data integrity semantics. So in
theory, your tree today is more susceptible to this sync/fsync starvation
than mainline. I care most about the correctness, and it would be great if
nobody cares about this starvation problem so we don't need the extra
complexity.

Thanks for the review.

> > +void mapping_fsync_lock(struct address_space *mapping)
> > +{
> > +	wait_on_bit_lock(&mapping->flags, AS_FSYNC_LOCK, sleep_on_fsync,
> > +							TASK_UNINTERRUPTIBLE);
> > +	WARN_ON(mapping_tagged(mapping, PAGECACHE_TAG_FSYNC));
> > +}
> > +
> > +void mapping_fsync_unlock(struct address_space *mapping)
> > +{
> > +	WARN_ON(mapping_tagged(mapping, PAGECACHE_TAG_FSYNC));
> > +	WARN_ON(!test_bit(AS_FSYNC_LOCK, &mapping->flags));
> > +	clear_bit_unlock(AS_FSYNC_LOCK, &mapping->flags);
> > +	smp_mb__after_clear_bit();
> 
> hm, I wonder why clear_bit_unlock() didn't already do that.

Strictly unlock semantics. So it has the mb before the clear bit,
but none after.

 
> The clear_bit_unlock() documentation is rather crappy.
> 
> > +	wake_up_bit(&mapping->flags, AS_FSYNC_LOCK);
> > +}
> 
> It wouldn't hurt to document this interface a little bit.

True.


> > +int wait_on_page_writeback_range_fsync(struct address_space *mapping,
> > +				pgoff_t start, pgoff_t end)
> 
> We already have a wait_on_page_writeback_range().  The reader of your
> code will be wondering why this variant exists, and how it differs. 
> Sigh.

Ah OK, wait_on_page_writeback_range_fsync can be used when the caller
has set up the fsync tags and holds the mapping bit lock. So unconverted
filesystems hopefully can use the old code without blowing up (they just
would be more prone to starvation).

> >  	if (!mapping_cap_writeback_dirty(mapping) || !count)
> >  		return 0;
> > +	mutex_lock(&inode->i_mutex);
> 
> I am unable to determine why i_mutex is being taken here.

Oh, good point (which I probably didn't mention). mapping_fsync_lock
nests inside i_mutex. Will add that to documentation and lock order
graphs.

> > @@ -897,13 +899,40 @@ int write_cache_pages(struct address_spa
> >  			range_whole = 1;
> >  		cycled = 1; /* ignore range_cyclic tests */
> >  	}
> > +
> > +	if (sync) {
> > +		WARN_ON(!test_bit(AS_FSYNC_LOCK, &mapping->flags));
> 
> hm.  Is fsync the only caller of write_cache_pages(!WB_SYNC_NONE)? 
> Surprised.

No some filesystems and things also call eg. filemap_write_and_wait
which should come here too. But they also need to take the lock.

> > +		if (!radix_tree_gang_tag_set_if_tagged(&mapping->page_tree,
> > +							index, end,
> > +				(1UL << PAGECACHE_TAG_DIRTY) |
> > +				(1UL << PAGECACHE_TAG_WRITEBACK),
> > +				(1UL << PAGECACHE_TAG_FSYNC))) {
> 
> ooh, so that's what that thing does.

Maybe an example is a better documentation than my waffling.

 
> > +			/* nothing tagged */
> > +			spin_unlock_irq(&mapping->tree_lock);
> > +			return 0;
> 
> Can we please avoid the deeply-nested-return hand grenade?

Hmm, we could 

   goto out;
...
out:
 return ret;

But is that less hand grenadie than the plain return?

> > ===================================================================
> > --- linux-2.6.orig/drivers/usb/gadget/file_storage.c
> > +++ linux-2.6/drivers/usb/gadget/file_storage.c
> > @@ -1873,13 +1873,15 @@ static int fsync_sub(struct lun *curlun)
> >  
> >  	inode = filp->f_path.dentry->d_inode;
> >  	mutex_lock(&inode->i_mutex);
> > +	mapping_fsync_lock(mapping);
> 
> Dood. Do `make allmodconfig ; make'
 
OK.


> >  	rc = filemap_fdatawrite(inode->i_mapping);
> >  	err = filp->f_op->fsync(filp, filp->f_path.dentry, 1);
> >  	if (!rc)
> >  		rc = err;
> > -	err = filemap_fdatawait(inode->i_mapping);
> > +	err = filemap_fdatawait_fsync(inode->i_mapping);
> >  	if (!rc)
> >  		rc = err;
> > +	mapping_fsync_unlock(mapping);
> >
> > ...
> >
> 
> 
> I won't apply this because of the build breakage.

OK.


  reply	other threads:[~2008-12-11 22:32 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-12-10  7:24 [patch 1/6] mm: direct IO starvation improvement Nick Piggin
2008-12-10  7:25 ` [patch 2/6] fs: remove WB_SYNC_HOLD Nick Piggin
2008-12-10  7:27 ` [patch 3/6] fs: sync_sb_inodes fix Nick Piggin
2008-12-11 21:51   ` Andrew Morton
2008-12-11 22:34     ` Nick Piggin
2008-12-10  7:27 ` [patch 4/6] fs: sys_sync fix Nick Piggin
2008-12-10  7:28 ` [patch 5/6] radix-tree: gang set if tagged operation Nick Piggin
2008-12-11 21:20   ` Andrew Morton
2008-12-11 22:10     ` Nick Piggin
2008-12-10  7:42 ` [patch 6/6] mm: fsync livelock avoidance Nick Piggin
2008-12-10  9:15   ` steve
2008-12-11 21:51   ` Andrew Morton
2008-12-11 22:32     ` Nick Piggin [this message]
2008-12-11 22:41       ` Andrew Morton
2008-12-11 22:45       ` Andrew Morton
2008-12-11 22:59         ` Nick Piggin
2008-12-11 21:51   ` Andrew Morton
2008-12-11 22:23   ` Andrew Morton
2008-12-11 22:45     ` Nick Piggin
2008-12-11 23:14       ` Andrew Morton
2008-12-11 23:43         ` Nick Piggin
2008-12-12  0:39           ` Andrew Morton
2008-12-12  4:01             ` Nick Piggin
2008-12-12 16:04 ` [patch 1/6] mm: direct IO starvation improvement Jeff Moyer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20081211223213.GC8294@wotan.suse.de \
    --to=npiggin@suse.de \
    --cc=akpm@linux-foundation.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=mpatocka@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.