Re: [PATCH 2/4] vfs: add support for a lazytime mount option

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Dave Chinner <david@fromorbit.com>
To: Theodore Ts'o <tytso@mit.edu>
Cc: linux-fsdevel@vger.kernel.org,
	Ext4 Developers List <linux-ext4@vger.kernel.org>,
	linux-btrfs@vger.kernel.org, xfs@oss.sgi.com
Subject: Re: [PATCH 2/4] vfs: add support for a lazytime mount option
Date: Tue, 25 Nov 2014 12:52:39 +1100	[thread overview]
Message-ID: <20141125015239.GD27262@dastard> (raw)
In-Reply-To: <1416599964-21892-3-git-send-email-tytso@mit.edu>

On Fri, Nov 21, 2014 at 02:59:22PM -0500, Theodore Ts'o wrote:
> Add a new mount option which enables a new "lazytime" mode.  This mode
> causes atime, mtime, and ctime updates to only be made to the
> in-memory version of the inode.  The on-disk times will only get
> updated when (a) if the inode needs to be updated for some non-time
> related change, (b) if userspace calls fsync(), syncfs() or sync(), or
> (c) just before an undeleted inode is evicted from memory.
> 
> This is OK according to POSIX because there are no guarantees after a
> crash unless userspace explicitly requests via a fsync(2) call.
> 
> For workloads which feature a large number of random write to a
> preallocated file, the lazytime mount option significantly reduces
> writes to the inode table.  The repeated 4k writes to a single block
> will result in undesirable stress on flash devices and SMR disk
> drives.  Even on conventional HDD's, the repeated writes to the inode
> table block will trigger Adjacent Track Interference (ATI) remediation
> latencies, which very negatively impact 99.9 percentile latencies ---
> which is a very big deal for web serving tiers (for example).
> 
> Google-Bug-Id: 18297052
> 
> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
> ---
>  fs/fs-writeback.c       | 38 +++++++++++++++++++++++++++++++++++++-
>  fs/inode.c              | 18 ++++++++++++++++++
>  fs/proc_namespace.c     |  1 +
>  fs/sync.c               |  7 +++++++
>  include/linux/fs.h      |  1 +
>  include/uapi/linux/fs.h |  1 +
>  6 files changed, 65 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index ef9bef1..ce7de22 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -483,7 +483,7 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
>  	if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
>  		inode->i_state &= ~I_DIRTY_PAGES;
>  	dirty = inode->i_state & I_DIRTY;
> -	inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
> +	inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_TIME);
>  	spin_unlock(&inode->i_lock);
>  	/* Don't write the inode if only I_DIRTY_PAGES was set */
>  	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
> @@ -1277,6 +1277,41 @@ static void wait_sb_inodes(struct super_block *sb)
>  	iput(old_inode);
>  }
>  
> +/*
> + * This works like wait_sb_inodes(), but it is called *before* we kick
> + * the bdi so the inodes can get written out.
> + */
> +static void flush_sb_dirty_time(struct super_block *sb)
> +{
> +	struct inode *inode, *old_inode = NULL;
> +
> +	WARN_ON(!rwsem_is_locked(&sb->s_umount));
> +	spin_lock(&inode_sb_list_lock);
> +	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> +		int dirty_time;
> +
> +		spin_lock(&inode->i_lock);
> +		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
> +			spin_unlock(&inode->i_lock);
> +			continue;
> +		}
> +		dirty_time = inode->i_state & I_DIRTY_TIME;
> +		__iget(inode);
> +		spin_unlock(&inode->i_lock);
> +		spin_unlock(&inode_sb_list_lock);
> +
> +		iput(old_inode);
> +		old_inode = inode;
> +
> +		if (dirty_time)
> +			mark_inode_dirty(inode);
> +		cond_resched();
> +		spin_lock(&inode_sb_list_lock);
> +	}
> +	spin_unlock(&inode_sb_list_lock);
> +	iput(old_inode);
> +}

This just seems wrong to me, not to mention extremely expensive when we have
millions of cached inodes on the superblock.

Why can't we just add a function like mark_inode_dirty_time() which
puts the inode on the dirty inode list with a writeback time 24
hours in the future rather than 30s in the future?



> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -534,6 +534,18 @@ static void evict(struct inode *inode)
>  	BUG_ON(!(inode->i_state & I_FREEING));
>  	BUG_ON(!list_empty(&inode->i_lru));
>  
> +	if (inode->i_nlink && inode->i_state & I_DIRTY_TIME) {
> +		if (inode->i_op->write_time)
> +			inode->i_op->write_time(inode);
> +		else if (inode->i_sb->s_op->write_inode) {
> +			struct writeback_control wbc = {
> +				.sync_mode = WB_SYNC_NONE,
> +			};
> +			mark_inode_dirty(inode);
> +			inode->i_sb->s_op->write_inode(inode, &wbc);
> +		}
> +	}
> +

Eviction is too late for this. I'm pretty sure that it won't get
this far as iput_final() should catch the I_DIRTY_TIME in the !drop
case via write_inode_now().


>  int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync)
>  {
> +	struct inode *inode = file->f_mapping->host;
> +
>  	if (!file->f_op->fsync)
>  		return -EINVAL;
> +	if (!datasync && inode->i_state & I_DIRTY_TIME) {

FWIW, I'm surprised gcc isn't throwing warnings about that. Please
make sure there isn't any ambiguity in the code by writing those
checks like this:

	if (!datasync && (inode->i_state & I_DIRTY_TIME)) {

> +		spin_lock(&inode->i_lock);
> +		inode->i_state |= I_DIRTY_SYNC;
> +		spin_unlock(&inode->i_lock);
> +	}
>  	return file->f_op->fsync(file, start, end, datasync);

When we mark the inode I_DIRTY_TIME, we should also be marking it
I_DIRTY_SYNC so that all the sync operations know that they should
be writing this inode. That's partly why I also think these inodes
should be tracked on the dirty inode list....

> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 3633239..489b2f2 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1721,6 +1721,7 @@ struct super_operations {
>  #define __I_DIO_WAKEUP		9
>  #define I_DIO_WAKEUP		(1 << I_DIO_WAKEUP)
>  #define I_LINKABLE		(1 << 10)
> +#define I_DIRTY_TIME		(1 << 11)
>  
>  #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)

Shouldn't I_DIRTY also include I_DIRTY_TIME now?

-- 
Dave Chinner
david@fromorbit.com

next prev parent reply	other threads:[~2014-11-25  1:52 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-11-21 19:59 [PATCH 0/4] add support for a lazytime mount option Theodore Ts'o
2014-11-21 19:59 ` [PATCH 1/4] fs: split update_time() into update_time() and write_time() Theodore Ts'o
2014-11-21 20:08   ` Chris Mason
2014-11-21 21:42     ` Theodore Ts'o
2014-11-24 16:38       ` David Sterba
2014-11-24 17:22         ` Theodore Ts'o
2014-11-24 18:09           ` David Sterba
2014-11-24 15:21   ` Christoph Hellwig
2014-11-24 15:56     ` Theodore Ts'o
2014-11-24 17:34     ` David Sterba
2014-11-25 15:51       ` David Sterba
2014-11-25 17:01         ` Christoph Hellwig
2014-11-21 19:59 ` [PATCH 2/4] vfs: add support for a lazytime mount option Theodore Ts'o
2014-11-25  1:52   ` Dave Chinner [this message]
2014-11-25  4:33     ` Theodore Ts'o
2014-11-25 15:32       ` Boaz Harrosh
2014-11-25 17:19       ` Jan Kara
2014-11-25 17:57         ` Theodore Ts'o
2014-11-25 20:18           ` Jan Kara
2014-11-25 17:30       ` Jan Kara
2014-11-25 19:26         ` Theodore Ts'o
2014-11-26  0:24       ` Dave Chinner
2014-11-21 19:59 ` [PATCH 3/4] vfs: don't let the dirty time inodes get more than a day stale Theodore Ts'o
2014-11-21 20:19   ` Andreas Dilger
2014-11-21 21:36     ` Theodore Ts'o
2014-11-21 23:09       ` Andreas Dilger
2014-11-25  1:53   ` Dave Chinner
2014-11-25  4:45     ` Theodore Ts'o
2014-11-25 23:48       ` Dave Chinner
2014-11-26 10:20         ` Theodore Ts'o
2014-11-26 22:39           ` Dave Chinner
2014-11-25 17:31   ` Jan Kara
2014-11-21 19:59 ` [PATCH 4/4] ext4: add support for a lazytime mount option Theodore Ts'o
2014-11-25 17:34   ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20141125015239.GD27262@dastard \
    --to=david@fromorbit.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=tytso@mit.edu \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).