From: Dave Chinner <david@fromorbit.com>
To: "Theodore Ts'o" <tytso@mit.edu>
Cc: linux-fsdevel@vger.kernel.org,
Ext4 Developers List <linux-ext4@vger.kernel.org>,
linux-btrfs@vger.kernel.org, xfs@oss.sgi.com
Subject: Re: [PATCH 2/4] vfs: add support for a lazytime mount option
Date: Tue, 25 Nov 2014 12:52:39 +1100 [thread overview]
Message-ID: <20141125015239.GD27262@dastard> (raw)
In-Reply-To: <1416599964-21892-3-git-send-email-tytso@mit.edu>
On Fri, Nov 21, 2014 at 02:59:22PM -0500, Theodore Ts'o wrote:
> Add a new mount option which enables a new "lazytime" mode. This mode
> causes atime, mtime, and ctime updates to only be made to the
> in-memory version of the inode. The on-disk times will only get
> updated when (a) if the inode needs to be updated for some non-time
> related change, (b) if userspace calls fsync(), syncfs() or sync(), or
> (c) just before an undeleted inode is evicted from memory.
>
> This is OK according to POSIX because there are no guarantees after a
> crash unless userspace explicitly requests via a fsync(2) call.
>
> For workloads which feature a large number of random write to a
> preallocated file, the lazytime mount option significantly reduces
> writes to the inode table. The repeated 4k writes to a single block
> will result in undesirable stress on flash devices and SMR disk
> drives. Even on conventional HDD's, the repeated writes to the inode
> table block will trigger Adjacent Track Interference (ATI) remediation
> latencies, which very negatively impact 99.9 percentile latencies ---
> which is a very big deal for web serving tiers (for example).
>
> Google-Bug-Id: 18297052
>
> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
> ---
> fs/fs-writeback.c | 38 +++++++++++++++++++++++++++++++++++++-
> fs/inode.c | 18 ++++++++++++++++++
> fs/proc_namespace.c | 1 +
> fs/sync.c | 7 +++++++
> include/linux/fs.h | 1 +
> include/uapi/linux/fs.h | 1 +
> 6 files changed, 65 insertions(+), 1 deletion(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index ef9bef1..ce7de22 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -483,7 +483,7 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
> if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
> inode->i_state &= ~I_DIRTY_PAGES;
> dirty = inode->i_state & I_DIRTY;
> - inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
> + inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_TIME);
> spin_unlock(&inode->i_lock);
> /* Don't write the inode if only I_DIRTY_PAGES was set */
> if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
> @@ -1277,6 +1277,41 @@ static void wait_sb_inodes(struct super_block *sb)
> iput(old_inode);
> }
>
> +/*
> + * This works like wait_sb_inodes(), but it is called *before* we kick
> + * the bdi so the inodes can get written out.
> + */
> +static void flush_sb_dirty_time(struct super_block *sb)
> +{
> + struct inode *inode, *old_inode = NULL;
> +
> + WARN_ON(!rwsem_is_locked(&sb->s_umount));
> + spin_lock(&inode_sb_list_lock);
> + list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> + int dirty_time;
> +
> + spin_lock(&inode->i_lock);
> + if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
> + spin_unlock(&inode->i_lock);
> + continue;
> + }
> + dirty_time = inode->i_state & I_DIRTY_TIME;
> + __iget(inode);
> + spin_unlock(&inode->i_lock);
> + spin_unlock(&inode_sb_list_lock);
> +
> + iput(old_inode);
> + old_inode = inode;
> +
> + if (dirty_time)
> + mark_inode_dirty(inode);
> + cond_resched();
> + spin_lock(&inode_sb_list_lock);
> + }
> + spin_unlock(&inode_sb_list_lock);
> + iput(old_inode);
> +}
This just seems wrong to me, not to mention extremely expensive when we have
millions of cached inodes on the superblock.
Why can't we just add a function like mark_inode_dirty_time() which
puts the inode on the dirty inode list with a writeback time 24
hours in the future rather than 30s in the future?
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -534,6 +534,18 @@ static void evict(struct inode *inode)
> BUG_ON(!(inode->i_state & I_FREEING));
> BUG_ON(!list_empty(&inode->i_lru));
>
> + if (inode->i_nlink && inode->i_state & I_DIRTY_TIME) {
> + if (inode->i_op->write_time)
> + inode->i_op->write_time(inode);
> + else if (inode->i_sb->s_op->write_inode) {
> + struct writeback_control wbc = {
> + .sync_mode = WB_SYNC_NONE,
> + };
> + mark_inode_dirty(inode);
> + inode->i_sb->s_op->write_inode(inode, &wbc);
> + }
> + }
> +
Eviction is too late for this. I'm pretty sure that it won't get
this far as iput_final() should catch the I_DIRTY_TIME in the !drop
case via write_inode_now().
> int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync)
> {
> + struct inode *inode = file->f_mapping->host;
> +
> if (!file->f_op->fsync)
> return -EINVAL;
> + if (!datasync && inode->i_state & I_DIRTY_TIME) {
FWIW, I'm surprised gcc isn't throwing warnings about that. Please
make sure there isn't any ambiguity in the code by writing those
checks like this:
if (!datasync && (inode->i_state & I_DIRTY_TIME)) {
> + spin_lock(&inode->i_lock);
> + inode->i_state |= I_DIRTY_SYNC;
> + spin_unlock(&inode->i_lock);
> + }
> return file->f_op->fsync(file, start, end, datasync);
When we mark the inode I_DIRTY_TIME, we should also be marking it
I_DIRTY_SYNC so that all the sync operations know that they should
be writing this inode. That's partly why I also think these inodes
should be tracked on the dirty inode list....
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 3633239..489b2f2 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1721,6 +1721,7 @@ struct super_operations {
> #define __I_DIO_WAKEUP 9
> #define I_DIO_WAKEUP (1 << I_DIO_WAKEUP)
> #define I_LINKABLE (1 << 10)
> +#define I_DIRTY_TIME (1 << 11)
>
> #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
Shouldn't I_DIRTY also include I_DIRTY_TIME now?
--
Dave Chinner
david@fromorbit.com
WARNING: multiple messages have this Message-ID (diff)
From: Dave Chinner <david@fromorbit.com>
To: Theodore Ts'o <tytso@mit.edu>
Cc: linux-fsdevel@vger.kernel.org,
Ext4 Developers List <linux-ext4@vger.kernel.org>,
linux-btrfs@vger.kernel.org, xfs@oss.sgi.com
Subject: Re: [PATCH 2/4] vfs: add support for a lazytime mount option
Date: Tue, 25 Nov 2014 12:52:39 +1100 [thread overview]
Message-ID: <20141125015239.GD27262@dastard> (raw)
In-Reply-To: <1416599964-21892-3-git-send-email-tytso@mit.edu>
On Fri, Nov 21, 2014 at 02:59:22PM -0500, Theodore Ts'o wrote:
> Add a new mount option which enables a new "lazytime" mode. This mode
> causes atime, mtime, and ctime updates to only be made to the
> in-memory version of the inode. The on-disk times will only get
> updated when (a) if the inode needs to be updated for some non-time
> related change, (b) if userspace calls fsync(), syncfs() or sync(), or
> (c) just before an undeleted inode is evicted from memory.
>
> This is OK according to POSIX because there are no guarantees after a
> crash unless userspace explicitly requests via a fsync(2) call.
>
> For workloads which feature a large number of random write to a
> preallocated file, the lazytime mount option significantly reduces
> writes to the inode table. The repeated 4k writes to a single block
> will result in undesirable stress on flash devices and SMR disk
> drives. Even on conventional HDD's, the repeated writes to the inode
> table block will trigger Adjacent Track Interference (ATI) remediation
> latencies, which very negatively impact 99.9 percentile latencies ---
> which is a very big deal for web serving tiers (for example).
>
> Google-Bug-Id: 18297052
>
> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
> ---
> fs/fs-writeback.c | 38 +++++++++++++++++++++++++++++++++++++-
> fs/inode.c | 18 ++++++++++++++++++
> fs/proc_namespace.c | 1 +
> fs/sync.c | 7 +++++++
> include/linux/fs.h | 1 +
> include/uapi/linux/fs.h | 1 +
> 6 files changed, 65 insertions(+), 1 deletion(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index ef9bef1..ce7de22 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -483,7 +483,7 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
> if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
> inode->i_state &= ~I_DIRTY_PAGES;
> dirty = inode->i_state & I_DIRTY;
> - inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
> + inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_TIME);
> spin_unlock(&inode->i_lock);
> /* Don't write the inode if only I_DIRTY_PAGES was set */
> if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
> @@ -1277,6 +1277,41 @@ static void wait_sb_inodes(struct super_block *sb)
> iput(old_inode);
> }
>
> +/*
> + * This works like wait_sb_inodes(), but it is called *before* we kick
> + * the bdi so the inodes can get written out.
> + */
> +static void flush_sb_dirty_time(struct super_block *sb)
> +{
> + struct inode *inode, *old_inode = NULL;
> +
> + WARN_ON(!rwsem_is_locked(&sb->s_umount));
> + spin_lock(&inode_sb_list_lock);
> + list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> + int dirty_time;
> +
> + spin_lock(&inode->i_lock);
> + if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
> + spin_unlock(&inode->i_lock);
> + continue;
> + }
> + dirty_time = inode->i_state & I_DIRTY_TIME;
> + __iget(inode);
> + spin_unlock(&inode->i_lock);
> + spin_unlock(&inode_sb_list_lock);
> +
> + iput(old_inode);
> + old_inode = inode;
> +
> + if (dirty_time)
> + mark_inode_dirty(inode);
> + cond_resched();
> + spin_lock(&inode_sb_list_lock);
> + }
> + spin_unlock(&inode_sb_list_lock);
> + iput(old_inode);
> +}
This just seems wrong to me, not to mention extremely expensive when we have
millions of cached inodes on the superblock.
Why can't we just add a function like mark_inode_dirty_time() which
puts the inode on the dirty inode list with a writeback time 24
hours in the future rather than 30s in the future?
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -534,6 +534,18 @@ static void evict(struct inode *inode)
> BUG_ON(!(inode->i_state & I_FREEING));
> BUG_ON(!list_empty(&inode->i_lru));
>
> + if (inode->i_nlink && inode->i_state & I_DIRTY_TIME) {
> + if (inode->i_op->write_time)
> + inode->i_op->write_time(inode);
> + else if (inode->i_sb->s_op->write_inode) {
> + struct writeback_control wbc = {
> + .sync_mode = WB_SYNC_NONE,
> + };
> + mark_inode_dirty(inode);
> + inode->i_sb->s_op->write_inode(inode, &wbc);
> + }
> + }
> +
Eviction is too late for this. I'm pretty sure that it won't get
this far as iput_final() should catch the I_DIRTY_TIME in the !drop
case via write_inode_now().
> int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync)
> {
> + struct inode *inode = file->f_mapping->host;
> +
> if (!file->f_op->fsync)
> return -EINVAL;
> + if (!datasync && inode->i_state & I_DIRTY_TIME) {
FWIW, I'm surprised gcc isn't throwing warnings about that. Please
make sure there isn't any ambiguity in the code by writing those
checks like this:
if (!datasync && (inode->i_state & I_DIRTY_TIME)) {
> + spin_lock(&inode->i_lock);
> + inode->i_state |= I_DIRTY_SYNC;
> + spin_unlock(&inode->i_lock);
> + }
> return file->f_op->fsync(file, start, end, datasync);
When we mark the inode I_DIRTY_TIME, we should also be marking it
I_DIRTY_SYNC so that all the sync operations know that they should
be writing this inode. That's partly why I also think these inodes
should be tracked on the dirty inode list....
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 3633239..489b2f2 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1721,6 +1721,7 @@ struct super_operations {
> #define __I_DIO_WAKEUP 9
> #define I_DIO_WAKEUP (1 << I_DIO_WAKEUP)
> #define I_LINKABLE (1 << 10)
> +#define I_DIRTY_TIME (1 << 11)
>
> #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
Shouldn't I_DIRTY also include I_DIRTY_TIME now?
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2014-11-25 1:52 UTC|newest]
Thread overview: 66+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-11-21 19:59 [PATCH 0/4] add support for a lazytime mount option Theodore Ts'o
2014-11-21 19:59 ` Theodore Ts'o
2014-11-21 19:59 ` [PATCH 1/4] fs: split update_time() into update_time() and write_time() Theodore Ts'o
2014-11-21 19:59 ` Theodore Ts'o
2014-11-21 20:08 ` Chris Mason
2014-11-21 20:08 ` Chris Mason
2014-11-21 21:42 ` Theodore Ts'o
2014-11-21 21:42 ` Theodore Ts'o
2014-11-24 16:38 ` David Sterba
2014-11-24 16:38 ` David Sterba
2014-11-24 17:22 ` Theodore Ts'o
2014-11-24 17:22 ` Theodore Ts'o
2014-11-24 18:09 ` David Sterba
2014-11-24 18:09 ` David Sterba
2014-11-24 15:21 ` Christoph Hellwig
2014-11-24 15:21 ` Christoph Hellwig
2014-11-24 15:56 ` Theodore Ts'o
2014-11-24 15:56 ` Theodore Ts'o
2014-11-24 17:34 ` David Sterba
2014-11-24 17:34 ` David Sterba
2014-11-25 15:51 ` David Sterba
2014-11-25 15:51 ` David Sterba
2014-11-25 17:01 ` Christoph Hellwig
2014-11-21 19:59 ` [PATCH 2/4] vfs: add support for a lazytime mount option Theodore Ts'o
2014-11-21 19:59 ` Theodore Ts'o
2014-11-25 1:52 ` Dave Chinner [this message]
2014-11-25 1:52 ` Dave Chinner
2014-11-25 4:33 ` Theodore Ts'o
2014-11-25 4:33 ` Theodore Ts'o
2014-11-25 15:32 ` Boaz Harrosh
2014-11-25 15:32 ` Boaz Harrosh
2014-11-25 17:19 ` Jan Kara
2014-11-25 17:19 ` Jan Kara
2014-11-25 17:57 ` Theodore Ts'o
2014-11-25 17:57 ` Theodore Ts'o
2014-11-25 20:18 ` Jan Kara
2014-11-25 20:18 ` Jan Kara
2014-11-25 17:30 ` Jan Kara
2014-11-25 17:30 ` Jan Kara
2014-11-25 19:26 ` Theodore Ts'o
2014-11-25 19:26 ` Theodore Ts'o
2014-11-26 0:24 ` Dave Chinner
2014-11-26 0:24 ` Dave Chinner
2014-11-21 19:59 ` [PATCH 3/4] vfs: don't let the dirty time inodes get more than a day stale Theodore Ts'o
2014-11-21 19:59 ` Theodore Ts'o
2014-11-21 20:19 ` Andreas Dilger
2014-11-21 20:19 ` Andreas Dilger
2014-11-21 21:36 ` Theodore Ts'o
2014-11-21 21:36 ` Theodore Ts'o
2014-11-21 23:09 ` Andreas Dilger
2014-11-25 1:53 ` Dave Chinner
2014-11-25 1:53 ` Dave Chinner
2014-11-25 4:45 ` Theodore Ts'o
2014-11-25 4:45 ` Theodore Ts'o
2014-11-25 23:48 ` Dave Chinner
2014-11-25 23:48 ` Dave Chinner
2014-11-26 10:20 ` Theodore Ts'o
2014-11-26 10:20 ` Theodore Ts'o
2014-11-26 22:39 ` Dave Chinner
2014-11-26 22:39 ` Dave Chinner
2014-11-25 17:31 ` Jan Kara
2014-11-25 17:31 ` Jan Kara
2014-11-21 19:59 ` [PATCH 4/4] ext4: add support for a lazytime mount option Theodore Ts'o
2014-11-21 19:59 ` Theodore Ts'o
2014-11-25 17:34 ` Jan Kara
2014-11-25 17:34 ` Jan Kara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20141125015239.GD27262@dastard \
--to=david@fromorbit.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=tytso@mit.edu \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.