Re: [patch 6/6] mm: fsync livelock avoidance

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Andrew Morton <akpm@linux-foundation.org>
To: Nick Piggin <npiggin@suse.de>
Cc: linux-fsdevel@vger.kernel.org, mpatocka@redhat.com
Subject: Re: [patch 6/6] mm: fsync livelock avoidance
Date: Thu, 11 Dec 2008 13:51:11 -0800	[thread overview]
Message-ID: <20081211135111.cada5b8b.akpm@linux-foundation.org> (raw)
In-Reply-To: <20081210074209.GG27096@wotan.suse.de>

On Wed, 10 Dec 2008 08:42:09 +0100
Nick Piggin <npiggin@suse.de> wrote:

> OK, there has not been any further discussion on this approach since I last
> posted it, so I am going to go out on a limb and suggest that we take this
> approach, if any, rather than Mikulas' one.
> 
> The advantages of my approach I think:
> - nothing added to non-sync fastpaths
> - close to theoretically fewest number of pages will be written / waited for
>   in the fsync case
> - works nicely even in unusual cases (eg. file with 1GB of dirty data, but
>   the fsync_range only needs to sync a few pages will not get stuck behind
>   a concurrent dirtier)
> 
> And some comments:
> - adds 8 bytes to the radix tree node, but this doesn't really change its
>   fitting into slab or cachelines, so effective impact is basically zero
>   for this addition.
> - adds an extra lock, but as per the comments, this lock seems to be required
>   in order to fix a bug anyway. And we already tend to hold i_mutex over at
>   least some of the fsync operation. Although if anybody thinks this will be
>   a problem, I'd like to hear.
> 
> Disadvantages:
> - more complex. Although in a way I consider Mikulas' change to have
>   more complex heuristics, which I don't like. I think Mikulas' version
>   would be more complex to analyse at runtime. Also, much of the complexity
>   comes from the extra lock, which as I said fixes a bug.
> 
> Any additions or disputes? :)
> 
> --
> This patch fixes fsync starvation problems in the presence of concurrent
> dirtiers.
> 
> To take an extreme example: if thread A calls fsync on a file with one dirty
> page, at index 1 000 000; at the same time, thread B starts dirtying the
> file from offset 0 onwards.
> 
> Thead B perhaps will be allowed to dirty 1 000 pages before hitting its dirty
> threshold, then it will start throttling. Thread A will start writing out B's
> pages. They'll proceed more or less in lockstep until thread B finishes
> writing.
> 
> While these semantics are correct, we'd prefer a more timely notification that
> pages dirty at the time fsync was called are safely on disk. In the above
> scenario, we may have to wait until many times the machine's RAM capacity has
> been written to disk before the fsync returns success. Ideally, thread A would
> write the single page at index 1 000 000, then return.
> 
> This patch introduces a new pagecache tag, PAGECACHE_TAG_FSYNC. Data integrity
> syncs then start by looking through the pagecache for pages which are DIRTY
> and/or WRITEBACK within the requested range, and tagging all those as FSYNC.
> 
> Subsequent writeout and wait phases need then only look up those pages in the
> pagecache which are tagged with PAGECACHE_TAG_FSYNC.
> 
> After the sync operation has completed, the FSYNC tags are removed from the
> radix tree. This design requires exclusive usage of the FSYNC tags for the
> duration of a sync operation, so a lock on the address space is required.
> 
> For simplicity, I have removed the "don't wait for writeout if we hit -EIO"
> logic from a couple of places. I don't know if this is really worth the added
> complexity (EIO will still get reported, but it will just take a bit longer;
> an app can't rely in specific behaviour or timeliness here).
> 
> This lock also solves a real data integrity problem that I only noticed as
> I was writing the livelock avoidance code. If we consider the lock as the
> solution to this bug, this makes the livelock avoidance code much more
> attractive because then it does not introduce the new lock.
> 
> The bug is that fsync errors do not get propogated back up to the caller
> properly in some cases. Consider where we write a page in the writeout path,
> then it encounters an IO error and finishes writeback, in the meantime, another
> process (eg. via sys_sync, or another fsync) clears the mapping error bits.
> Then our fsync will have appeared to finish successfully, but actually should
> have returned error.

Has *anybody* *ever* complained about this behaviour?  I think maybe
one person after sixish years?

Why fix it?

>
> ...
>
> --- linux-2.6.orig/mm/filemap.c
> +++ linux-2.6/mm/filemap.c
> @@ -147,6 +147,28 @@ void remove_from_page_cache(struct page
>  	spin_unlock_irq(&mapping->tree_lock);
>  }
>  
> +static int sleep_on_fsync(void *word)
> +{
> +	io_schedule();
> +	return 0;
> +}
> +
> +void mapping_fsync_lock(struct address_space *mapping)
> +{
> +	wait_on_bit_lock(&mapping->flags, AS_FSYNC_LOCK, sleep_on_fsync,
> +							TASK_UNINTERRUPTIBLE);
> +	WARN_ON(mapping_tagged(mapping, PAGECACHE_TAG_FSYNC));
> +}
> +
> +void mapping_fsync_unlock(struct address_space *mapping)
> +{
> +	WARN_ON(mapping_tagged(mapping, PAGECACHE_TAG_FSYNC));
> +	WARN_ON(!test_bit(AS_FSYNC_LOCK, &mapping->flags));
> +	clear_bit_unlock(AS_FSYNC_LOCK, &mapping->flags);
> +	smp_mb__after_clear_bit();

hm, I wonder why clear_bit_unlock() didn't already do that.

The clear_bit_unlock() documentation is rather crappy.

> +	wake_up_bit(&mapping->flags, AS_FSYNC_LOCK);
> +}

It wouldn't hurt to document this interface a little bit.

>  static int sync_page(void *word)
>  {
>  	struct address_space *mapping;
> @@ -287,7 +309,64 @@ int wait_on_page_writeback_range(struct
>  
>  			/* until radix tree lookup accepts end_index */
>  			if (page->index > end)
> -				continue;
> +				break;
> +
> +			wait_on_page_writeback(page);
> +			if (PageError(page))
> +				ret = -EIO;
> +		}
> +		pagevec_release(&pvec);
> +		cond_resched();
> +	}
> +
> +	/* Check for outstanding write errors */
> +	if (test_and_clear_bit(AS_ENOSPC, &mapping->flags))
> +		ret = -ENOSPC;
> +	if (test_and_clear_bit(AS_EIO, &mapping->flags))
> +		ret = -EIO;
> +
> +	return ret;
> +}
> +
> +int wait_on_page_writeback_range_fsync(struct address_space *mapping,
> +				pgoff_t start, pgoff_t end)

We already have a wait_on_page_writeback_range().  The reader of your
code will be wondering why this variant exists, and how it differs. 
Sigh.


> +{
> +	struct pagevec pvec;
> +	int nr_pages;
> +	int ret = 0;
> +	pgoff_t index;
> +
> +	WARN_ON(!test_bit(AS_FSYNC_LOCK, &mapping->flags));
> +
> +	if (end < start)
> +		goto out;
> +
> +	pagevec_init(&pvec, 0);
> +	index = start;
> +	while ((index <= end) &&
> +			(nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
> +			PAGECACHE_TAG_FSYNC,
> +			min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1)) != 0) {
> +		unsigned i;
> +
> +		spin_lock_irq(&mapping->tree_lock);
> +		for (i = 0; i < nr_pages; i++) {
> +			struct page *page = pvec.pages[i];
> +
> +			/* until radix tree lookup accepts end_index */
> +			if (page->index > end)
> +				break;
> +
> +			radix_tree_tag_clear(&mapping->page_tree, page->index, PAGECACHE_TAG_FSYNC);

Don't mucky up the nice kernel code please.

> +		}
> +		spin_unlock_irq(&mapping->tree_lock);
> +
> +		for (i = 0; i < nr_pages; i++) {
> +			struct page *page = pvec.pages[i];
> +
> +			/* until radix tree lookup accepts end_index */
> +			if (page->index > end)
> +				break;
>  
>  			wait_on_page_writeback(page);
>  			if (PageError(page))
> @@ -303,6 +382,7 @@ int wait_on_page_writeback_range(struct
>  	if (test_and_clear_bit(AS_EIO, &mapping->flags))
>  		ret = -EIO;
>  
> +out:
>  	return ret;
>  }
>  
> @@ -325,18 +405,20 @@ int sync_page_range(struct inode *inode,
>  {
>  	pgoff_t start = pos >> PAGE_CACHE_SHIFT;
>  	pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT;
> -	int ret;
> +	int ret, ret2;
>  
>  	if (!mapping_cap_writeback_dirty(mapping) || !count)
>  		return 0;
> +	mutex_lock(&inode->i_mutex);

I am unable to determine why i_mutex is being taken here.

> +	mapping_fsync_lock(mapping);
>  	ret = filemap_fdatawrite_range(mapping, pos, pos + count - 1);
> -	if (ret == 0) {
> -		mutex_lock(&inode->i_mutex);
> +	if (ret == 0)
>  		ret = generic_osync_inode(inode, mapping, OSYNC_METADATA);
> -		mutex_unlock(&inode->i_mutex);
> -	}
> +	mutex_unlock(&inode->i_mutex);

Nor why it was dropped at this particular place.

> +	ret2 = wait_on_page_writeback_range_fsync(mapping, start, end);
>  	if (ret == 0)
> -		ret = wait_on_page_writeback_range(mapping, start, end);
> +		ret = ret2;
> +	mapping_fsync_unlock(mapping);
>  	return ret;
>  }
>  EXPORT_SYMBOL(sync_page_range);
> @@ -357,15 +439,18 @@ int sync_page_range_nolock(struct inode
>  {
>  	pgoff_t start = pos >> PAGE_CACHE_SHIFT;
>  	pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT;
> -	int ret;
> +	int ret, ret2;
>  
>  	if (!mapping_cap_writeback_dirty(mapping) || !count)
>  		return 0;
> +	mapping_fsync_lock(mapping);
>  	ret = filemap_fdatawrite_range(mapping, pos, pos + count - 1);
>  	if (ret == 0)
>  		ret = generic_osync_inode(inode, mapping, OSYNC_METADATA);
> +	ret2 = wait_on_page_writeback_range_fsync(mapping, start, end);
>  	if (ret == 0)
> -		ret = wait_on_page_writeback_range(mapping, start, end);
> +		ret = ret2;
> +	mapping_fsync_unlock(mapping);
>  	return ret;
>  }
>  EXPORT_SYMBOL(sync_page_range_nolock);
> @@ -389,23 +474,30 @@ int filemap_fdatawait(struct address_spa
>  }
>  EXPORT_SYMBOL(filemap_fdatawait);
>  
> +int filemap_fdatawait_fsync(struct address_space *mapping)
> +{
> +	loff_t i_size = i_size_read(mapping->host);
> +
> +	if (i_size == 0)
> +		return 0;
> +
> +	return wait_on_page_writeback_range_fsync(mapping, 0,
> +				(i_size - 1) >> PAGE_CACHE_SHIFT);
> +}

filemap_fdatawait() is documented...

>  int filemap_write_and_wait(struct address_space *mapping)
>  {
>  	int err = 0;
>  
>  	if (mapping->nrpages) {
> +		int err2;
> +
> +		mapping_fsync_lock(mapping);
>  		err = filemap_fdatawrite(mapping);
> -		/*
> -		 * Even if the above returned error, the pages may be
> -		 * written partially (e.g. -ENOSPC), so we wait for it.
> -		 * But the -EIO is special case, it may indicate the worst
> -		 * thing (e.g. bug) happened, so we avoid waiting for it.
> -		 */
> -		if (err != -EIO) {
> -			int err2 = filemap_fdatawait(mapping);
> -			if (!err)
> -				err = err2;
> -		}
> +		err2 = filemap_fdatawait_fsync(mapping);
> +		if (!err)
> +			err = err2;
> +		mapping_fsync_unlock(mapping);
>  	}
>  	return err;
>  }
> @@ -428,16 +520,16 @@ int filemap_write_and_wait_range(struct
>  	int err = 0;
>  
>  	if (mapping->nrpages) {
> -		err = __filemap_fdatawrite_range(mapping, lstart, lend,
> -						 WB_SYNC_ALL);
> -		/* See comment of filemap_write_and_wait() */
> -		if (err != -EIO) {
> -			int err2 = wait_on_page_writeback_range(mapping,
> +		int err2;
> +
> +		mapping_fsync_lock(mapping);
> +		err = filemap_fdatawrite_range(mapping, lstart, lend);
> +		err2 = wait_on_page_writeback_range_fsync(mapping,
>  						lstart >> PAGE_CACHE_SHIFT,
>  						lend >> PAGE_CACHE_SHIFT);
> -			if (!err)
> -				err = err2;
> -		}
> +		if (!err)
> +			err = err2;
> +		mapping_fsync_unlock(mapping);
>  	}
>  	return err;
>  }
>
> ...
>
> --- linux-2.6.orig/mm/page-writeback.c
> +++ linux-2.6/mm/page-writeback.c
> @@ -872,9 +872,11 @@ int write_cache_pages(struct address_spa
>  	pgoff_t index;
>  	pgoff_t end;		/* Inclusive */
>  	pgoff_t done_index;
> +	unsigned int tag = PAGECACHE_TAG_DIRTY;
>  	int cycled;
>  	int range_whole = 0;
>  	long nr_to_write = wbc->nr_to_write;
> +	int sync = wbc->sync_mode != WB_SYNC_NONE;
>  
>  	if (wbc->nonblocking && bdi_write_congested(bdi)) {
>  		wbc->encountered_congestion = 1;
> @@ -897,13 +899,40 @@ int write_cache_pages(struct address_spa
>  			range_whole = 1;
>  		cycled = 1; /* ignore range_cyclic tests */
>  	}
> +
> +	if (sync) {
> +		WARN_ON(!test_bit(AS_FSYNC_LOCK, &mapping->flags));

hm.  Is fsync the only caller of write_cache_pages(!WB_SYNC_NONE)? 
Surprised.

> +		/*
> +		 * If any pages are writeback or dirty, mark them fsync now.
> +		 * These are the pages we need to wait in in order to meet our

s/in/on/

> +		 * data integrity contract.
> +		 *
> +		 * Writeback pages need to be tagged, so we'll wait for them
> +		 * at the end of the writeout phase. However, the lookup below
> +		 * could just look up pages which are _DIRTY AND _FSYNC,
> +		 * because we don't care about them for the writeout phase.
> +		 */
> +		spin_lock_irq(&mapping->tree_lock);
> +		if (!radix_tree_gang_tag_set_if_tagged(&mapping->page_tree,
> +							index, end,
> +				(1UL << PAGECACHE_TAG_DIRTY) |
> +				(1UL << PAGECACHE_TAG_WRITEBACK),
> +				(1UL << PAGECACHE_TAG_FSYNC))) {

ooh, so that's what that thing does.

> +			/* nothing tagged */
> +			spin_unlock_irq(&mapping->tree_lock);
> +			return 0;

Can we please avoid the deeply-nested-return hand grenade?

> +		}
> +		spin_unlock_irq(&mapping->tree_lock);
> +		tag = PAGECACHE_TAG_FSYNC;
> +	}
> +
>  retry:
>  	done_index = index;
>  	while (!done && (index <= end)) {
>  		int i;
>  
>  		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
> -			      PAGECACHE_TAG_DIRTY,
> +			      tag,
>  			      min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
>  		if (nr_pages == 0)
>  			break;
> @@ -951,7 +980,7 @@ continue_unlock:
>  			}
>  
>  			if (PageWriteback(page)) {
> -				if (wbc->sync_mode != WB_SYNC_NONE)
> +				if (sync)
>  					wait_on_page_writeback(page);
>  				else
>  					goto continue_unlock;
> @@ -981,7 +1010,7 @@ continue_unlock:
>  				}
>   			}
>  
> -			if (wbc->sync_mode == WB_SYNC_NONE) {
> +			if (!sync) {
>  				wbc->nr_to_write--;
>  				if (wbc->nr_to_write <= 0) {
>  					done = 1;
> Index: linux-2.6/drivers/usb/gadget/file_storage.c
> ===================================================================
> --- linux-2.6.orig/drivers/usb/gadget/file_storage.c
> +++ linux-2.6/drivers/usb/gadget/file_storage.c
> @@ -1873,13 +1873,15 @@ static int fsync_sub(struct lun *curlun)
>  
>  	inode = filp->f_path.dentry->d_inode;
>  	mutex_lock(&inode->i_mutex);
> +	mapping_fsync_lock(mapping);

Dood. Do `make allmodconfig ; make'

>  	rc = filemap_fdatawrite(inode->i_mapping);
>  	err = filp->f_op->fsync(filp, filp->f_path.dentry, 1);
>  	if (!rc)
>  		rc = err;
> -	err = filemap_fdatawait(inode->i_mapping);
> +	err = filemap_fdatawait_fsync(inode->i_mapping);
>  	if (!rc)
>  		rc = err;
> +	mapping_fsync_unlock(mapping);
>
> ...
>


I won't apply this because of the build breakage.

next prev parent reply	other threads:[~2008-12-11 21:51 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-12-10  7:24 [patch 1/6] mm: direct IO starvation improvement Nick Piggin
2008-12-10  7:25 ` [patch 2/6] fs: remove WB_SYNC_HOLD Nick Piggin
2008-12-10  7:27 ` [patch 3/6] fs: sync_sb_inodes fix Nick Piggin
2008-12-11 21:51   ` Andrew Morton
2008-12-11 22:34     ` Nick Piggin
2008-12-10  7:27 ` [patch 4/6] fs: sys_sync fix Nick Piggin
2008-12-10  7:28 ` [patch 5/6] radix-tree: gang set if tagged operation Nick Piggin
2008-12-11 21:20   ` Andrew Morton
2008-12-11 22:10     ` Nick Piggin
2008-12-10  7:42 ` [patch 6/6] mm: fsync livelock avoidance Nick Piggin
2008-12-10  9:15   ` steve
2008-12-11 21:51   ` Andrew Morton [this message]
2008-12-11 22:32     ` Nick Piggin
2008-12-11 22:41       ` Andrew Morton
2008-12-11 22:45       ` Andrew Morton
2008-12-11 22:59         ` Nick Piggin
2008-12-11 21:51   ` Andrew Morton
2008-12-11 22:23   ` Andrew Morton
2008-12-11 22:45     ` Nick Piggin
2008-12-11 23:14       ` Andrew Morton
2008-12-11 23:43         ` Nick Piggin
2008-12-12  0:39           ` Andrew Morton
2008-12-12  4:01             ` Nick Piggin
2008-12-12 16:04 ` [patch 1/6] mm: direct IO starvation improvement Jeff Moyer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20081211135111.cada5b8b.akpm@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=mpatocka@redhat.com \
    --cc=npiggin@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.