Re: [PATCH v3] writeback: Do not sync data dirtied after sync start

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Andrew Morton <akpm@linux-foundation.org>
To: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@ZenIV.linux.org.uk>,
	Wu Fengguang <fengguang.wu@intel.com>,
	linux-fsdevel@vger.kernel.org, dchinner@redhat.com
Subject: Re: [PATCH v3] writeback: Do not sync data dirtied after sync start
Date: Tue, 8 Oct 2013 15:14:09 -0700	[thread overview]
Message-ID: <20131008151409.6b7415fc9ad7108f5de04873@linux-foundation.org> (raw)
In-Reply-To: <1380275080-31736-1-git-send-email-jack@suse.cz>

On Fri, 27 Sep 2013 11:44:40 +0200 Jan Kara <jack@suse.cz> wrote:

> When there are processes heavily creating small files while sync(2) is
> running, it can easily happen that quite some new files are created
> between WB_SYNC_NONE and WB_SYNC_ALL pass of sync(2). That can happen
> especially if there are several busy filesystems (remember that sync
> traverses filesystems sequentially and waits in WB_SYNC_ALL phase on one
> fs before starting it on another fs). Because WB_SYNC_ALL pass is slow
> (e.g. causes a transaction commit and cache flush for each inode in
> ext3), resulting sync(2) times are rather large.
> 
> The following script reproduces the problem:
> 
> function run_writers
> {
>   for (( i = 0; i < 10; i++ )); do
>     mkdir $1/dir$i
>     for (( j = 0; j < 40000; j++ )); do
>       dd if=/dev/zero of=$1/dir$i/$j bs=4k count=4 &>/dev/null
>     done &
>   done
> }
> 
> for dir in "$@"; do
>   run_writers $dir
> done
> 
> sleep 40
> time sync
> ======
> 
> Fix the problem by disregarding inodes dirtied after sync(2) was called
> in the WB_SYNC_ALL pass. To allow for this, sync_inodes_sb() now takes a
> time stamp when sync has started which is used for setting up work for
> flusher threads.
> 
> To give some numbers, when above script is run on two ext4 filesystems on
> simple SATA drive, the average sync time from 10 runs is 267.549 seconds
> with standard deviation 104.799426. With the patched kernel, the average
> sync time from 10 runs is 2.995 seconds with standard deviation 0.096.

We need to be really careful about this - it's easy to make mistakes
and the consequences are nasty.

> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -39,7 +39,7 @@
>  struct wb_writeback_work {
>  	long nr_pages;
>  	struct super_block *sb;
> -	unsigned long *older_than_this;
> +	unsigned long older_than_this;
>  	enum writeback_sync_modes sync_mode;
>  	unsigned int tagged_writepages:1;
>  	unsigned int for_kupdate:1;
> @@ -248,8 +248,7 @@ static int move_expired_inodes(struct list_head *delaying_queue,
>  
>  	while (!list_empty(delaying_queue)) {
>  		inode = wb_inode(delaying_queue->prev);
> -		if (work->older_than_this &&
> -		    inode_dirtied_after(inode, *work->older_than_this))
> +		if (inode_dirtied_after(inode, work->older_than_this))
>  			break;
>  		list_move(&inode->i_wb_list, &tmp);
>  		moved++;
> @@ -791,12 +790,11 @@ static long wb_writeback(struct bdi_writeback *wb,
>  {
>  	unsigned long wb_start = jiffies;
>  	long nr_pages = work->nr_pages;
> -	unsigned long oldest_jif;
>  	struct inode *inode;
>  	long progress;
>  
> -	oldest_jif = jiffies;
> -	work->older_than_this = &oldest_jif;
> +	if (!work->older_than_this)
> +		work->older_than_this = jiffies;

So wb_writeback_work.older_than_this==0 has special (and undocumented!)
meaning.  But 0 is a valid jiffies value (it occurs 5 minutes after
boot, too).  What happens?

If the caller passed in "jiffies" at that time, things will presumably
work, by luck, because we'll overwrite the caller's zero with another
zero.  Most of the time - things might go wrong if jiffies increments
to 1.

But what happens if the caller was kupdate, exactly 330 seconds after
boot?  Won't we overwrite the caller's "older than 330 seconds" with
"older than 300 seconds" (or something like that)?

If this has all been thought through then let's explain how it works,
please.

Perhaps it would be better to just stop using the
wb_writeback_work.older_than_this==0 magic sentinel and add a new
older_than_this_is_set:1 to the wb_writeback_work.

next prev parent reply	other threads:[~2013-10-08 22:14 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-09-27  9:44 [PATCH v3] writeback: Do not sync data dirtied after sync start Jan Kara
2013-09-29 23:12 ` Dave Chinner
2013-10-08 22:14 ` Andrew Morton [this message]
2013-10-09 14:02   ` Jan Kara
2013-10-09 15:03     ` Jan Kara
2013-10-09 21:21       ` Andrew Morton
2013-10-09 22:14         ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20131008151409.6b7415fc9ad7108f5de04873@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=dchinner@redhat.com \
    --cc=fengguang.wu@intel.com \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=viro@ZenIV.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).