* [PATCH-v4 0/7] add support for a lazytime mount option
@ 2014-11-26 10:23 Theodore Ts'o
  2014-11-26 10:23 ` [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time() Theodore Ts'o
                   ` (6 more replies)
  0 siblings, 7 replies; 36+ messages in thread
From: Theodore Ts'o @ 2014-11-26 10:23 UTC (permalink / raw)
  To: Linux Filesystem Development List
  Cc: Ext4 Developers List, Theodore Ts'o,
	Linux btrfs Developers List, XFS Developers
This is an updated version of what had originally been an
ext4-specific patch which significantly improves performance by lazily
writing timestamp updates (and in particular, mtime updates) to disk.
The in-memory timestamps are always correct, but they are only written
to disk when required for correctness.
This provides a huge performance boost for ext4 due to how it handles
journalling, but it's valuable for all file systems running on flash
storage or drive-managed SMR disks by reducing the metadata write
load.  So upon request, I've moved the functionality to the VFS layer.
Once the /sbin/mount program adds support for MS_LAZYTIME, all file
systems should be able to benefit from this optimization.
There is still an ext4-specific optimization, which may be applicable
for other file systems which store more than one inode in a block, but
it will require file system specific code.  It is purely optional,
however.
Please note the changes to update_time() and the new write_time() inode
operations functions, which impact btrfs and xfs.  The changes are
fairly simple, but I would appreciate confirmation from the btrfs and
xfs teams that I got things right.   Thanks!!
Any objections to my carrying these patches in the ext4 git tree?
Changes since -v3:
   - inodes with I_DIRTY_TIME set are placed on a new bdi list,
        b_dirty_time.  This allows filesystem-level syncs to more
        easily iterate over those inodes that need to have their
        timestamps written to disk.
   - dirty timestamps will be written out asynchronously on the final
        iput, instead of when the inode gets evicted.
   - separate the definition of the new function
        find_active_inode_nowait() to a separate patch
   - create separate flag masks: I_DIRTY_WB and I_DIRTY_INODE, which
       indicate whether the inode needs to be on the write back lists,
       or whether the inode itself is dirty, while I_DIRTY means any one
       of the inode dirty flags are set.  This simplifies the fs
       writeback logic which needs to test for different combinations of
       the inode dirty flags in different places.
Changes since -v2:
   - If update_time() updates i_version, it will not use lazytime (i..e,
       the inode will be marked dirty so the change will be persisted on to
       disk sooner rather than later).  Yes, this eliminates the
       benefits of lazytime if the user is experting the file system via
       NFSv4.  Sad, but NFS's requirements seem to mandate this.
   - Fix time wrapping bug 49 days after the system boots (on a system
        with a 32-bit jiffies).   Use get_monotonic_boottime() instead.
   - Clean up type warning in include/tracing/ext4.h
   - Added explicit parenthesis for stylistic reasons    
   - Added an is_readonly() inode operations method so btrfs doesn't
       have to duplicate code in update_time().
Changes since -v1:
   - Added explanatory comments in update_time() regarding i_ts_dirty_days
   - Fix type used for days_since_boot
   - Improve SMP scalability in update_time and ext4_update_other_inodes_time
   - Added tracepoints to help test and characterize how often and under
         what circumstances inodes have their timestamps lazily updated
Theodore Ts'o (7):
  vfs: split update_time() into update_time() and write_time()
  vfs: add support for a lazytime mount option
  vfs: don't let the dirty time inodes get more than a day stale
  vfs: add lazytime tracepoints for better debugging
  vfs: add find_active_inode_nowait() function
  ext4: add support for a lazytime mount option
  btrfs: add an is_readonly() so btrfs can use common code for
    update_time()
 Documentation/filesystems/Locking |   2 +
 fs/btrfs/inode.c                  |  34 +++++-------
 fs/ext4/inode.c                   |  49 ++++++++++++++++-
 fs/ext4/super.c                   |   9 +++
 fs/fs-writeback.c                 |  57 +++++++++++++++++--
 fs/inode.c                        | 113 +++++++++++++++++++++++++++++++++++---
 fs/proc_namespace.c               |   1 +
 fs/sync.c                         |   8 +++
 fs/xfs/xfs_iops.c                 |  39 ++++++-------
 include/linux/backing-dev.h       |   1 +
 include/linux/fs.h                |  17 +++++-
 include/trace/events/ext4.h       |  30 ++++++++++
 include/uapi/linux/fs.h           |   1 +
 mm/backing-dev.c                  |   9 ++-
 14 files changed, 306 insertions(+), 64 deletions(-)
-- 
2.1.0
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply	[flat|nested] 36+ messages in thread
* [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()
  2014-11-26 10:23 [PATCH-v4 0/7] add support for a lazytime mount option Theodore Ts'o
@ 2014-11-26 10:23 ` Theodore Ts'o
  2014-11-26 19:23   ` Christoph Hellwig
  2014-11-26 10:23 ` [PATCH-v4 2/7] vfs: add support for a lazytime mount option Theodore Ts'o
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 36+ messages in thread
From: Theodore Ts'o @ 2014-11-26 10:23 UTC (permalink / raw)
  To: Linux Filesystem Development List
  Cc: Ext4 Developers List, Theodore Ts'o,
	Linux btrfs Developers List, XFS Developers
In preparation for adding support for the lazytime mount option, we
need to be able to separate out the update_time() and write_time()
inode operations.  Currently, only btrfs and xfs uses update_time().
We needed to preserve update_time() because btrfs wants to have a
special btrfs_root_readonly() check; otherwise we could drop the
update_time() inode operation entirely.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: xfs@oss.sgi.com
Cc: linux-btrfs@vger.kernel.org
Acked-by: David Sterba <dsterba@suse.cz>
---
 Documentation/filesystems/Locking |  2 ++
 fs/btrfs/inode.c                  | 10 ++++++++++
 fs/inode.c                        | 29 ++++++++++++++++++-----------
 fs/xfs/xfs_iops.c                 | 39 ++++++++++++++++-----------------------
 include/linux/fs.h                |  1 +
 5 files changed, 47 insertions(+), 34 deletions(-)
diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index b30753c..e49861d 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -63,6 +63,7 @@ prototypes:
 	int (*removexattr) (struct dentry *, const char *);
 	int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, u64 len);
 	void (*update_time)(struct inode *, struct timespec *, int);
+	void (*write_time)(struct inode *);
 	int (*atomic_open)(struct inode *, struct dentry *,
 				struct file *, unsigned open_flag,
 				umode_t create_mode, int *opened);
@@ -95,6 +96,7 @@ listxattr:	no
 removexattr:	yes
 fiemap:		no
 update_time:	no
+write_time:	no
 atomic_open:	yes
 tmpfile:	no
 dentry_open:	no
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d23362f..a5e0d0d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5574,6 +5574,11 @@ static int btrfs_update_time(struct inode *inode, struct timespec *now,
 		inode->i_mtime = *now;
 	if (flags & S_ATIME)
 		inode->i_atime = *now;
+	return 0;
+}
+
+static int btrfs_write_time(struct inode *inode)
+{
 	return btrfs_dirty_inode(inode);
 }
 
@@ -9462,6 +9467,7 @@ static const struct inode_operations btrfs_dir_inode_operations = {
 	.get_acl	= btrfs_get_acl,
 	.set_acl	= btrfs_set_acl,
 	.update_time	= btrfs_update_time,
+	.write_time	= btrfs_write_time,
 	.tmpfile        = btrfs_tmpfile,
 };
 static const struct inode_operations btrfs_dir_ro_inode_operations = {
@@ -9470,6 +9476,7 @@ static const struct inode_operations btrfs_dir_ro_inode_operations = {
 	.get_acl	= btrfs_get_acl,
 	.set_acl	= btrfs_set_acl,
 	.update_time	= btrfs_update_time,
+	.write_time	= btrfs_write_time,
 };
 
 static const struct file_operations btrfs_dir_file_operations = {
@@ -9540,6 +9547,7 @@ static const struct inode_operations btrfs_file_inode_operations = {
 	.get_acl	= btrfs_get_acl,
 	.set_acl	= btrfs_set_acl,
 	.update_time	= btrfs_update_time,
+	.write_time	= btrfs_write_time,
 };
 static const struct inode_operations btrfs_special_inode_operations = {
 	.getattr	= btrfs_getattr,
@@ -9552,6 +9560,7 @@ static const struct inode_operations btrfs_special_inode_operations = {
 	.get_acl	= btrfs_get_acl,
 	.set_acl	= btrfs_set_acl,
 	.update_time	= btrfs_update_time,
+	.write_time	= btrfs_write_time,
 };
 static const struct inode_operations btrfs_symlink_inode_operations = {
 	.readlink	= generic_readlink,
@@ -9565,6 +9574,7 @@ static const struct inode_operations btrfs_symlink_inode_operations = {
 	.listxattr	= btrfs_listxattr,
 	.removexattr	= btrfs_removexattr,
 	.update_time	= btrfs_update_time,
+	.write_time	= btrfs_write_time,
 };
 
 const struct dentry_operations btrfs_dentry_operations = {
diff --git a/fs/inode.c b/fs/inode.c
index 26753ba..8f5c4b5 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1499,17 +1499,24 @@ static int relatime_need_update(struct vfsmount *mnt, struct inode *inode,
  */
 static int update_time(struct inode *inode, struct timespec *time, int flags)
 {
-	if (inode->i_op->update_time)
-		return inode->i_op->update_time(inode, time, flags);
-
-	if (flags & S_ATIME)
-		inode->i_atime = *time;
-	if (flags & S_VERSION)
-		inode_inc_iversion(inode);
-	if (flags & S_CTIME)
-		inode->i_ctime = *time;
-	if (flags & S_MTIME)
-		inode->i_mtime = *time;
+	int ret;
+
+	if (inode->i_op->update_time) {
+		ret = inode->i_op->update_time(inode, time, flags);
+		if (ret)
+			return ret;
+	} else {
+		if (flags & S_ATIME)
+			inode->i_atime = *time;
+		if (flags & S_VERSION)
+			inode_inc_iversion(inode);
+		if (flags & S_CTIME)
+			inode->i_ctime = *time;
+		if (flags & S_MTIME)
+			inode->i_mtime = *time;
+	}
+	if (inode->i_op->write_time)
+		return inode->i_op->write_time(inode);
 	mark_inode_dirty_sync(inode);
 	return 0;
 }
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index ec6dcdc..0e9653c 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -984,10 +984,8 @@ xfs_vn_setattr(
 }
 
 STATIC int
-xfs_vn_update_time(
-	struct inode		*inode,
-	struct timespec		*now,
-	int			flags)
+xfs_vn_write_time(
+	struct inode		*inode)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
@@ -1004,21 +1002,16 @@ xfs_vn_update_time(
 	}
 
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
-	if (flags & S_CTIME) {
-		inode->i_ctime = *now;
-		ip->i_d.di_ctime.t_sec = (__int32_t)now->tv_sec;
-		ip->i_d.di_ctime.t_nsec = (__int32_t)now->tv_nsec;
-	}
-	if (flags & S_MTIME) {
-		inode->i_mtime = *now;
-		ip->i_d.di_mtime.t_sec = (__int32_t)now->tv_sec;
-		ip->i_d.di_mtime.t_nsec = (__int32_t)now->tv_nsec;
-	}
-	if (flags & S_ATIME) {
-		inode->i_atime = *now;
-		ip->i_d.di_atime.t_sec = (__int32_t)now->tv_sec;
-		ip->i_d.di_atime.t_nsec = (__int32_t)now->tv_nsec;
-	}
+
+	ip->i_d.di_ctime.t_sec = (__int32_t) inode->i_ctime.tv_sec;
+	ip->i_d.di_ctime.t_nsec = (__int32_t) inode->i_ctime.tv_nsec;
+
+	ip->i_d.di_mtime.t_sec = (__int32_t)inode->i_mtime.tv_sec;
+	ip->i_d.di_mtime.t_nsec = (__int32_t) inode->i_mtime.tv_nsec;
+
+	ip->i_d.di_atime.t_sec = (__int32_t) inode->i_atime.tv_sec;
+	ip->i_d.di_atime.t_nsec = (__int32_t) inode->i_atime.tv_nsec;
+
 	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
 	xfs_trans_log_inode(tp, ip, XFS_ILOG_TIMESTAMP);
 	return xfs_trans_commit(tp, 0);
@@ -1129,7 +1122,7 @@ static const struct inode_operations xfs_inode_operations = {
 	.removexattr		= generic_removexattr,
 	.listxattr		= xfs_vn_listxattr,
 	.fiemap			= xfs_vn_fiemap,
-	.update_time		= xfs_vn_update_time,
+	.write_time		= xfs_vn_write_time,
 };
 
 static const struct inode_operations xfs_dir_inode_operations = {
@@ -1156,7 +1149,7 @@ static const struct inode_operations xfs_dir_inode_operations = {
 	.getxattr		= generic_getxattr,
 	.removexattr		= generic_removexattr,
 	.listxattr		= xfs_vn_listxattr,
-	.update_time		= xfs_vn_update_time,
+	.write_time		= xfs_vn_write_time,
 	.tmpfile		= xfs_vn_tmpfile,
 };
 
@@ -1184,7 +1177,7 @@ static const struct inode_operations xfs_dir_ci_inode_operations = {
 	.getxattr		= generic_getxattr,
 	.removexattr		= generic_removexattr,
 	.listxattr		= xfs_vn_listxattr,
-	.update_time		= xfs_vn_update_time,
+	.write_time		= xfs_vn_write_time,
 	.tmpfile		= xfs_vn_tmpfile,
 };
 
@@ -1198,7 +1191,7 @@ static const struct inode_operations xfs_symlink_inode_operations = {
 	.getxattr		= generic_getxattr,
 	.removexattr		= generic_removexattr,
 	.listxattr		= xfs_vn_listxattr,
-	.update_time		= xfs_vn_update_time,
+	.write_time		= xfs_vn_write_time,
 };
 
 STATIC void
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9ab779e..3633239 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1545,6 +1545,7 @@ struct inode_operations {
 	int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
 		      u64 len);
 	int (*update_time)(struct inode *, struct timespec *, int);
+	int (*write_time)(struct inode *);
 	int (*atomic_open)(struct inode *, struct dentry *,
 			   struct file *, unsigned open_flag,
 			   umode_t create_mode, int *opened);
-- 
2.1.0
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related	[flat|nested] 36+ messages in thread
* [PATCH-v4 2/7] vfs: add support for a lazytime mount option
  2014-11-26 10:23 [PATCH-v4 0/7] add support for a lazytime mount option Theodore Ts'o
  2014-11-26 10:23 ` [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time() Theodore Ts'o
@ 2014-11-26 10:23 ` Theodore Ts'o
  2014-11-27 13:14   ` Jan Kara
  2014-11-26 10:23 ` [PATCH-v4 3/7] vfs: don't let the dirty time inodes get more than a day stale Theodore Ts'o
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 36+ messages in thread
From: Theodore Ts'o @ 2014-11-26 10:23 UTC (permalink / raw)
  To: Linux Filesystem Development List
  Cc: Ext4 Developers List, Theodore Ts'o,
	Linux btrfs Developers List, XFS Developers
Add a new mount option which enables a new "lazytime" mode.  This mode
causes atime, mtime, and ctime updates to only be made to the
in-memory version of the inode.  The on-disk times will only get
updated when (a) if the inode needs to be updated for some non-time
related change, (b) if userspace calls fsync(), syncfs() or sync(), or
(c) just before an undeleted inode is evicted from memory.
This is OK according to POSIX because there are no guarantees after a
crash unless userspace explicitly requests via a fsync(2) call.
For workloads which feature a large number of random write to a
preallocated file, the lazytime mount option significantly reduces
writes to the inode table.  The repeated 4k writes to a single block
will result in undesirable stress on flash devices and SMR disk
drives.  Even on conventional HDD's, the repeated writes to the inode
table block will trigger Adjacent Track Interference (ATI) remediation
latencies, which very negatively impact 99.9 percentile latencies ---
which is a very big deal for web serving tiers (for example).
Google-Bug-Id: 18297052
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
---
 fs/fs-writeback.c           | 55 ++++++++++++++++++++++++++++++++++++++++-----
 fs/inode.c                  | 43 ++++++++++++++++++++++++++++++-----
 fs/proc_namespace.c         |  1 +
 fs/sync.c                   |  8 +++++++
 include/linux/backing-dev.h |  1 +
 include/linux/fs.h          | 11 +++++++--
 include/uapi/linux/fs.h     |  1 +
 mm/backing-dev.c            |  9 ++++++--
 8 files changed, 113 insertions(+), 16 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index ef9bef1..ef8c5d8 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -397,7 +397,7 @@ static void requeue_inode(struct inode *inode, struct bdi_writeback *wb,
 	 * shot. If still dirty, it will be redirty_tail()'ed below.  Update
 	 * the dirty time to prevent enqueue and sync it again.
 	 */
-	if ((inode->i_state & I_DIRTY) &&
+	if ((inode->i_state & I_DIRTY_WB) &&
 	    (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages))
 		inode->dirtied_when = jiffies;
 
@@ -428,13 +428,15 @@ static void requeue_inode(struct inode *inode, struct bdi_writeback *wb,
 			 */
 			redirty_tail(inode, wb);
 		}
-	} else if (inode->i_state & I_DIRTY) {
+	} else if (inode->i_state & I_DIRTY_WB) {
 		/*
 		 * Filesystems can dirty the inode during writeback operations,
 		 * such as delayed allocation during submission or metadata
 		 * updates after data IO completion.
 		 */
 		redirty_tail(inode, wb);
+	} else if (inode->i_state & I_DIRTY_TIME) {
+		list_move(&inode->i_wb_list, &wb->b_dirty_time);
 	} else {
 		/* The inode is clean. Remove from writeback lists. */
 		list_del_init(&inode->i_wb_list);
@@ -482,11 +484,11 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	/* Clear I_DIRTY_PAGES if we've written out all dirty pages */
 	if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
 		inode->i_state &= ~I_DIRTY_PAGES;
-	dirty = inode->i_state & I_DIRTY;
-	inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
+	dirty = inode->i_state & I_DIRTY_INODE;
+	inode->i_state &= ~I_DIRTY_INODE;
 	spin_unlock(&inode->i_lock);
 	/* Don't write the inode if only I_DIRTY_PAGES was set */
-	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
+	if (dirty) {
 		int err = write_inode(inode, wbc);
 		if (ret == 0)
 			ret = err;
@@ -1162,7 +1164,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 
 	spin_lock(&inode->i_lock);
 	if ((inode->i_state & flags) != flags) {
-		const int was_dirty = inode->i_state & I_DIRTY;
+		const int was_dirty = inode->i_state & I_DIRTY_WB;
 
 		inode->i_state |= flags;
 
@@ -1224,6 +1226,24 @@ out_unlock_inode:
 }
 EXPORT_SYMBOL(__mark_inode_dirty);
 
+void inode_requeue_dirtytime(struct inode *inode)
+{
+	struct backing_dev_info *bdi = inode_to_bdi(inode);
+
+	spin_lock(&bdi->wb.list_lock);
+	spin_lock(&inode->i_lock);
+	if ((inode->i_state & I_DIRTY_WB) == 0) {
+		if (inode->i_state & I_DIRTY_TIME)
+			list_move(&inode->i_wb_list, &bdi->wb.b_dirty_time);
+		else
+			list_del_init(&inode->i_wb_list);
+	}
+	spin_unlock(&inode->i_lock);
+	spin_unlock(&bdi->wb.list_lock);
+
+}
+EXPORT_SYMBOL(inode_requeue_dirtytime);
+
 static void wait_sb_inodes(struct super_block *sb)
 {
 	struct inode *inode, *old_inode = NULL;
@@ -1277,6 +1297,28 @@ static void wait_sb_inodes(struct super_block *sb)
 	iput(old_inode);
 }
 
+/*
+ * Take all of the indoes on the dirty_time list, and mark them as
+ * dirty, so they will be written out.
+ */
+static void flush_sb_dirty_time(struct super_block *sb)
+{
+	struct bdi_writeback *wb = &sb->s_bdi->wb;
+	LIST_HEAD(tmp);
+
+	spin_lock(&wb->list_lock);
+	list_cut_position(&tmp, &wb->b_dirty_time, wb->b_dirty_time.prev);
+	while (!list_empty(&tmp)) {
+		struct inode *inode = wb_inode(tmp.prev);
+
+		list_del_init(&inode->i_wb_list);
+		spin_unlock(&wb->list_lock);
+		mark_inode_dirty_sync(inode);
+		spin_lock(&wb->list_lock);
+	}
+	spin_unlock(&wb->list_lock);
+}
+
 /**
  * writeback_inodes_sb_nr -	writeback dirty inodes from given super_block
  * @sb: the superblock
@@ -1388,6 +1430,7 @@ void sync_inodes_sb(struct super_block *sb)
 		return;
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
+	flush_sb_dirty_time(sb);
 	bdi_queue_work(sb->s_bdi, &work);
 	wait_for_completion(&done);
 
diff --git a/fs/inode.c b/fs/inode.c
index 8f5c4b5..9e464cc 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -30,7 +30,7 @@
  * inode_sb_list_lock protects:
  *   sb->s_inodes, inode->i_sb_list
  * bdi->wb.list_lock protects:
- *   bdi->wb.b_{dirty,io,more_io}, inode->i_wb_list
+ *   bdi->wb.b_{dirty,io,more_io,dirty_time}, inode->i_wb_list
  * inode_hash_lock protects:
  *   inode_hashtable, inode->i_hash
  *
@@ -1430,11 +1430,22 @@ static void iput_final(struct inode *inode)
  */
 void iput(struct inode *inode)
 {
-	if (inode) {
-		BUG_ON(inode->i_state & I_CLEAR);
-
-		if (atomic_dec_and_lock(&inode->i_count, &inode->i_lock))
-			iput_final(inode);
+	if (!inode)
+		return;
+	BUG_ON(inode->i_state & I_CLEAR);
+retry:
+	if (atomic_dec_and_lock(&inode->i_count, &inode->i_lock)) {
+		if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
+			atomic_inc(&inode->i_count);
+			inode->i_state &= ~I_DIRTY_TIME;
+			spin_unlock(&inode->i_lock);
+			if (inode->i_op->write_time)
+				inode->i_op->write_time(inode);
+			else if (inode->i_sb->s_op->write_inode)
+				mark_inode_dirty_sync(inode);
+			goto retry;
+		}
+		iput_final(inode);
 	}
 }
 EXPORT_SYMBOL(iput);
@@ -1515,6 +1526,26 @@ static int update_time(struct inode *inode, struct timespec *time, int flags)
 		if (flags & S_MTIME)
 			inode->i_mtime = *time;
 	}
+	if ((inode->i_sb->s_flags & MS_LAZYTIME) &&
+	    !(flags & S_VERSION) &&
+	    !(inode->i_state & (I_DIRTY_SYNC | I_DIRTY_DATASYNC))) {
+		if (inode->i_state & I_DIRTY_TIME)
+			return 0;
+		spin_lock(&inode->i_lock);
+		if (inode->i_state & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
+			spin_unlock(&inode->i_lock);
+			goto force_dirty;
+		}
+		if (inode->i_state & I_DIRTY_TIME) {
+			spin_unlock(&inode->i_lock);
+			return 0;
+		}
+		inode->i_state |= I_DIRTY_TIME;
+		spin_unlock(&inode->i_lock);
+		inode_requeue_dirtytime(inode);
+		return 0;
+	}
+force_dirty:
 	if (inode->i_op->write_time)
 		return inode->i_op->write_time(inode);
 	mark_inode_dirty_sync(inode);
diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
index 73ca174..f98234a 100644
--- a/fs/proc_namespace.c
+++ b/fs/proc_namespace.c
@@ -44,6 +44,7 @@ static int show_sb_opts(struct seq_file *m, struct super_block *sb)
 		{ MS_SYNCHRONOUS, ",sync" },
 		{ MS_DIRSYNC, ",dirsync" },
 		{ MS_MANDLOCK, ",mand" },
+		{ MS_LAZYTIME, ",lazytime" },
 		{ 0, NULL }
 	};
 	const struct proc_fs_info *fs_infop;
diff --git a/fs/sync.c b/fs/sync.c
index bdc729d..6ac7bf0 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -177,8 +177,16 @@ SYSCALL_DEFINE1(syncfs, int, fd)
  */
 int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync)
 {
+	struct inode *inode = file->f_mapping->host;
+
 	if (!file->f_op->fsync)
 		return -EINVAL;
+	if (!datasync && (inode->i_state & I_DIRTY_TIME)) {
+		spin_lock(&inode->i_lock);
+		inode->i_state &= ~I_DIRTY_TIME;
+		spin_unlock(&inode->i_lock);
+		mark_inode_dirty_sync(inode);
+	}
 	return file->f_op->fsync(file, start, end, datasync);
 }
 EXPORT_SYMBOL(vfs_fsync_range);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 5da6012..4cdf733 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -55,6 +55,7 @@ struct bdi_writeback {
 	struct list_head b_dirty;	/* dirty inodes */
 	struct list_head b_io;		/* parked for writeback */
 	struct list_head b_more_io;	/* parked for more writeback */
+	struct list_head b_dirty_time;	/* time stamps are dirty */
 	spinlock_t list_lock;		/* protects the b_* lists */
 };
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3633239..55cf34d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1721,19 +1721,26 @@ struct super_operations {
 #define __I_DIO_WAKEUP		9
 #define I_DIO_WAKEUP		(1 << I_DIO_WAKEUP)
 #define I_LINKABLE		(1 << 10)
+#define I_DIRTY_TIME		(1 << 11)
 
-#define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
+/* Inode should be on the b_dirty/b_io/b_more_io lists */
+#define I_DIRTY_WB (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
+/* Inode should be on the b_dirty/b_io/b_more_io/b_dirty_time lists */
+#define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES | I_DIRTY_TIME)
+/* The inode itself is dirty  */
+#define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_TIME)
 
 extern void __mark_inode_dirty(struct inode *, int);
 static inline void mark_inode_dirty(struct inode *inode)
 {
-	__mark_inode_dirty(inode, I_DIRTY);
+	__mark_inode_dirty(inode, I_DIRTY_WB);
 }
 
 static inline void mark_inode_dirty_sync(struct inode *inode)
 {
 	__mark_inode_dirty(inode, I_DIRTY_SYNC);
 }
+extern void inode_requeue_dirtytime(struct inode *);
 
 extern void inc_nlink(struct inode *inode);
 extern void drop_nlink(struct inode *inode);
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 3735fa0..cc9713a 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -90,6 +90,7 @@ struct inodes_stat_t {
 #define MS_KERNMOUNT	(1<<22) /* this is a kern_mount call */
 #define MS_I_VERSION	(1<<23) /* Update inode I_version field */
 #define MS_STRICTATIME	(1<<24) /* Always perform atime updates */
+#define MS_LAZYTIME	(1<<25) /* Update the on-disk [acm]times lazily */
 
 /* These sb flags are internal to the kernel */
 #define MS_NOSEC	(1<<28)
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 0ae0df5..14851fe 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -69,10 +69,10 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
-	unsigned long nr_dirty, nr_io, nr_more_io;
+	unsigned long nr_dirty, nr_io, nr_more_io, nr_dirty_time;
 	struct inode *inode;
 
-	nr_dirty = nr_io = nr_more_io = 0;
+	nr_dirty = nr_io = nr_more_io = nr_dirty_time = 0;
 	spin_lock(&wb->list_lock);
 	list_for_each_entry(inode, &wb->b_dirty, i_wb_list)
 		nr_dirty++;
@@ -80,6 +80,8 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 		nr_io++;
 	list_for_each_entry(inode, &wb->b_more_io, i_wb_list)
 		nr_more_io++;
+	list_for_each_entry(inode, &wb->b_dirty_time, i_wb_list)
+		nr_dirty_time++;
 	spin_unlock(&wb->list_lock);
 
 	global_dirty_limits(&background_thresh, &dirty_thresh);
@@ -98,6 +100,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 		   "b_dirty:            %10lu\n"
 		   "b_io:               %10lu\n"
 		   "b_more_io:          %10lu\n"
+		   "b_dirty_time:       %10lu\n"
 		   "bdi_list:           %10u\n"
 		   "state:              %10lx\n",
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
@@ -111,6 +114,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 		   nr_dirty,
 		   nr_io,
 		   nr_more_io,
+		   nr_dirty_time,
 		   !list_empty(&bdi->bdi_list), bdi->state);
 #undef K
 
@@ -418,6 +422,7 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
 	INIT_LIST_HEAD(&wb->b_dirty);
 	INIT_LIST_HEAD(&wb->b_io);
 	INIT_LIST_HEAD(&wb->b_more_io);
+	INIT_LIST_HEAD(&wb->b_dirty_time);
 	spin_lock_init(&wb->list_lock);
 	INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);
 }
-- 
2.1.0
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related	[flat|nested] 36+ messages in thread
* [PATCH-v4 3/7] vfs: don't let the dirty time inodes get more than a day stale
  2014-11-26 10:23 [PATCH-v4 0/7] add support for a lazytime mount option Theodore Ts'o
  2014-11-26 10:23 ` [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time() Theodore Ts'o
  2014-11-26 10:23 ` [PATCH-v4 2/7] vfs: add support for a lazytime mount option Theodore Ts'o
@ 2014-11-26 10:23 ` Theodore Ts'o
  2014-11-26 10:23 ` [PATCH-v4 4/7] vfs: add lazytime tracepoints for better debugging Theodore Ts'o
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 36+ messages in thread
From: Theodore Ts'o @ 2014-11-26 10:23 UTC (permalink / raw)
  To: Linux Filesystem Development List
  Cc: Ext4 Developers List, Linux btrfs Developers List, XFS Developers,
	Theodore Ts'o
Guarantee that the on-disk timestamps will be no more than 24 hours
stale.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
---
 fs/fs-writeback.c  |  1 +
 fs/inode.c         | 18 ++++++++++++++++++
 include/linux/fs.h |  1 +
 3 files changed, 20 insertions(+)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index ef8c5d8..529480a 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1143,6 +1143,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 	if (flags & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
 		trace_writeback_dirty_inode_start(inode, flags);
 
+		inode->i_ts_dirty_day = 0;
 		if (sb->s_op->dirty_inode)
 			sb->s_op->dirty_inode(inode, flags);
 
diff --git a/fs/inode.c b/fs/inode.c
index 9e464cc..37c0429 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1510,6 +1510,8 @@ static int relatime_need_update(struct vfsmount *mnt, struct inode *inode,
  */
 static int update_time(struct inode *inode, struct timespec *time, int flags)
 {
+	struct timespec uptime;
+	unsigned short days_since_boot;
 	int ret;
 
 	if (inode->i_op->update_time) {
@@ -1526,11 +1528,26 @@ static int update_time(struct inode *inode, struct timespec *time, int flags)
 		if (flags & S_MTIME)
 			inode->i_mtime = *time;
 	}
+	/*
+	 * If i_ts_dirty_day is zero, then either we have not deferred
+	 * timestamp updates, or the system has been up for less than
+	 * a day (so days_since_boot is zero), so we defer timestamp
+	 * updates in that case and set the I_DIRTY_TIME flag.  If a
+	 * day or more has passed, then i_ts_dirty_day will be
+	 * different from days_since_boot, and then we should update
+	 * the on-disk inode and then we can clear i_ts_dirty_day.
+	 */
 	if ((inode->i_sb->s_flags & MS_LAZYTIME) &&
 	    !(flags & S_VERSION) &&
 	    !(inode->i_state & (I_DIRTY_SYNC | I_DIRTY_DATASYNC))) {
 		if (inode->i_state & I_DIRTY_TIME)
 			return 0;
+		get_monotonic_boottime(&uptime);
+		days_since_boot = div_u64(uptime.tv_sec, 86400);
+		if (inode->i_ts_dirty_day &&
+		    (inode->i_ts_dirty_day != days_since_boot))
+			goto force_dirty;
+
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
 			spin_unlock(&inode->i_lock);
@@ -1541,6 +1558,7 @@ static int update_time(struct inode *inode, struct timespec *time, int flags)
 			return 0;
 		}
 		inode->i_state |= I_DIRTY_TIME;
+		inode->i_ts_dirty_day = days_since_boot;
 		spin_unlock(&inode->i_lock);
 		inode_requeue_dirtytime(inode);
 		return 0;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 55cf34d..d0a2181 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -575,6 +575,7 @@ struct inode {
 	struct timespec		i_ctime;
 	spinlock_t		i_lock;	/* i_blocks, i_bytes, maybe i_size */
 	unsigned short          i_bytes;
+	unsigned short		i_ts_dirty_day;
 	unsigned int		i_blkbits;
 	blkcnt_t		i_blocks;
 
-- 
2.1.0
^ permalink raw reply related	[flat|nested] 36+ messages in thread
* [PATCH-v4 4/7] vfs: add lazytime tracepoints for better debugging
  2014-11-26 10:23 [PATCH-v4 0/7] add support for a lazytime mount option Theodore Ts'o
                   ` (2 preceding siblings ...)
  2014-11-26 10:23 ` [PATCH-v4 3/7] vfs: don't let the dirty time inodes get more than a day stale Theodore Ts'o
@ 2014-11-26 10:23 ` Theodore Ts'o
  2014-11-26 10:23 ` [PATCH-v4 5/7] vfs: add find_active_inode_nowait() function Theodore Ts'o
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 36+ messages in thread
From: Theodore Ts'o @ 2014-11-26 10:23 UTC (permalink / raw)
  To: Linux Filesystem Development List
  Cc: Ext4 Developers List, Theodore Ts'o,
	Linux btrfs Developers List, XFS Developers
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
---
 fs/fs-writeback.c | 1 +
 fs/inode.c        | 5 +++++
 2 files changed, 6 insertions(+)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 529480a..3d87174 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -27,6 +27,7 @@
 #include <linux/backing-dev.h>
 #include <linux/tracepoint.h>
 #include <linux/device.h>
+#include <trace/events/fs.h>
 #include "internal.h"
 
 /*
diff --git a/fs/inode.c b/fs/inode.c
index 37c0429..b2fea60 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -20,6 +20,9 @@
 #include <linux/list_lru.h>
 #include "internal.h"
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/fs.h>
+
 /*
  * Inode locking rules:
  *
@@ -1443,6 +1446,7 @@ retry:
 				inode->i_op->write_time(inode);
 			else if (inode->i_sb->s_op->write_inode)
 				mark_inode_dirty_sync(inode);
+			trace_fs_lazytime_iput(inode);
 			goto retry;
 		}
 		iput_final(inode);
@@ -1561,6 +1565,7 @@ static int update_time(struct inode *inode, struct timespec *time, int flags)
 		inode->i_ts_dirty_day = days_since_boot;
 		spin_unlock(&inode->i_lock);
 		inode_requeue_dirtytime(inode);
+		trace_fs_lazytime_defer(inode);
 		return 0;
 	}
 force_dirty:
-- 
2.1.0
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related	[flat|nested] 36+ messages in thread
* [PATCH-v4 5/7] vfs: add find_active_inode_nowait() function
  2014-11-26 10:23 [PATCH-v4 0/7] add support for a lazytime mount option Theodore Ts'o
                   ` (3 preceding siblings ...)
  2014-11-26 10:23 ` [PATCH-v4 4/7] vfs: add lazytime tracepoints for better debugging Theodore Ts'o
@ 2014-11-26 10:23 ` Theodore Ts'o
  2014-11-26 10:23 ` [PATCH-v4 6/7] ext4: add support for a lazytime mount option Theodore Ts'o
  2014-11-26 10:23 ` [PATCH-v4 7/7] btrfs: add an is_readonly() so btrfs can use common code for update_time() Theodore Ts'o
  6 siblings, 0 replies; 36+ messages in thread
From: Theodore Ts'o @ 2014-11-26 10:23 UTC (permalink / raw)
  To: Linux Filesystem Development List
  Cc: Ext4 Developers List, Theodore Ts'o,
	Linux btrfs Developers List, XFS Developers
Add a new function find_active_inode_nowait() which will never block.
If there is an inode being freed or is still being initialized, this
function will return NULL instead of blocking waiting for an inode to
be freed or to finish initializing.  Hence, a negative return from
this function does not mean that inode number is free for use.  It is
useful for callers that want to opportunistically do some work on an
inode only if it is present and available in the cache, and where
blocking is not an option.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
---
 fs/inode.c         | 36 ++++++++++++++++++++++++++++++++++++
 include/linux/fs.h |  2 ++
 2 files changed, 38 insertions(+)
diff --git a/fs/inode.c b/fs/inode.c
index b2fea60..0b4c6ae 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1283,6 +1283,42 @@ struct inode *ilookup(struct super_block *sb, unsigned long ino)
 }
 EXPORT_SYMBOL(ilookup);
 
+/**
+ * find_active_inode_nowait - find an active inode in the inode cache
+ * @sb:		super block of file system to search
+ * @ino:	inode number to search for
+ *
+ * Search for an active inode @ino in the inode cache, and if the
+ * inode is in the cache, the inode is returned with an incremented
+ * reference count.  If the inode is being freed or is newly
+ * initialized, return nothing instead of trying to wait for the inode
+ * initialization or destruction to be complete.
+ */
+struct inode *find_active_inode_nowait(struct super_block *sb,
+				       unsigned long ino)
+{
+	struct hlist_head *head = inode_hashtable + hash(sb, ino);
+	struct inode *inode, *ret_inode = NULL;
+
+	spin_lock(&inode_hash_lock);
+	hlist_for_each_entry(inode, head, i_hash) {
+		if ((inode->i_ino != ino) ||
+		    (inode->i_sb != sb))
+			continue;
+		spin_lock(&inode->i_lock);
+		if ((inode->i_state & (I_FREEING | I_WILL_FREE | I_NEW)) == 0) {
+			__iget(inode);
+			ret_inode = inode;
+		}
+		spin_unlock(&inode->i_lock);
+		goto out;
+	}
+out:
+	spin_unlock(&inode_hash_lock);
+	return ret_inode;
+}
+EXPORT_SYMBOL(find_active_inode_nowait);
+
 int insert_inode_locked(struct inode *inode)
 {
 	struct super_block *sb = inode->i_sb;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index d0a2181..dc615ec 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2419,6 +2419,8 @@ extern struct inode *ilookup(struct super_block *sb, unsigned long ino);
 
 extern struct inode * iget5_locked(struct super_block *, unsigned long, int (*test)(struct inode *, void *), int (*set)(struct inode *, void *), void *);
 extern struct inode * iget_locked(struct super_block *, unsigned long);
+extern struct inode *find_active_inode_nowait(struct super_block *,
+					      unsigned long);
 extern int insert_inode_locked4(struct inode *, unsigned long, int (*test)(struct inode *, void *), void *);
 extern int insert_inode_locked(struct inode *);
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
-- 
2.1.0
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related	[flat|nested] 36+ messages in thread
* [PATCH-v4 6/7] ext4: add support for a lazytime mount option
  2014-11-26 10:23 [PATCH-v4 0/7] add support for a lazytime mount option Theodore Ts'o
                   ` (4 preceding siblings ...)
  2014-11-26 10:23 ` [PATCH-v4 5/7] vfs: add find_active_inode_nowait() function Theodore Ts'o
@ 2014-11-26 10:23 ` Theodore Ts'o
  2014-11-26 19:24   ` Christoph Hellwig
  2014-11-26 22:48   ` Dave Chinner
  2014-11-26 10:23 ` [PATCH-v4 7/7] btrfs: add an is_readonly() so btrfs can use common code for update_time() Theodore Ts'o
  6 siblings, 2 replies; 36+ messages in thread
From: Theodore Ts'o @ 2014-11-26 10:23 UTC (permalink / raw)
  To: Linux Filesystem Development List
  Cc: Ext4 Developers List, Theodore Ts'o,
	Linux btrfs Developers List, XFS Developers
Add an optimization for the MS_LAZYTIME mount option so that we will
opportunistically write out any inodes with the I_DIRTY_TIME flag set
in a particular inode table block when we need to update some inode in
that inode table block anyway.
Also add some temporary code so that we can set the lazytime mount
option without needing a modified /sbin/mount program which can set
MS_LAZYTIME.  We can eventually make this go away once util-linux has
added support.
Google-Bug-Id: 18297052
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
---
 fs/ext4/inode.c             | 49 ++++++++++++++++++++++++++++++++++++++++++---
 fs/ext4/super.c             |  9 +++++++++
 include/trace/events/ext4.h | 30 +++++++++++++++++++++++++++
 3 files changed, 85 insertions(+), 3 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5653fa4..8308c82 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4140,6 +4140,51 @@ static int ext4_inode_blocks_set(handle_t *handle,
 }
 
 /*
+ * Opportunistically update the other time fields for other inodes in
+ * the same inode table block.
+ */
+static void ext4_update_other_inodes_time(struct super_block *sb,
+					  unsigned long orig_ino, char *buf)
+{
+	struct ext4_inode_info	*ei;
+	struct ext4_inode	*raw_inode;
+	unsigned long		ino;
+	struct inode		*inode;
+	int		i, inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
+	int		inode_size = EXT4_INODE_SIZE(sb);
+
+	ino = orig_ino & ~(inodes_per_block - 1);
+	for (i = 0; i < inodes_per_block; i++, ino++, buf += inode_size) {
+		if (ino == orig_ino)
+			continue;
+		inode = find_active_inode_nowait(sb, ino);
+		if (!inode ||
+		    (inode->i_state & I_DIRTY_TIME) == 0 ||
+		    !spin_trylock(&inode->i_lock)) {
+			iput(inode);
+			continue;
+		}
+		inode->i_state &= ~I_DIRTY_TIME;
+		inode->i_ts_dirty_day = 0;
+		spin_unlock(&inode->i_lock);
+		inode_requeue_dirtytime(inode);
+
+		ei = EXT4_I(inode);
+		raw_inode = (struct ext4_inode *) buf;
+
+		spin_lock(&ei->i_raw_lock);
+		EXT4_INODE_SET_XTIME(i_ctime, inode, raw_inode);
+		EXT4_INODE_SET_XTIME(i_mtime, inode, raw_inode);
+		EXT4_INODE_SET_XTIME(i_atime, inode, raw_inode);
+		ext4_inode_csum_set(inode, raw_inode, ei);
+		spin_unlock(&ei->i_raw_lock);
+		trace_ext4_other_inode_update_time(inode, orig_ino);
+		iput(inode);
+	}
+}
+
+
+/*
  * Post the struct inode info into an on-disk inode location in the
  * buffer-cache.  This gobbles the caller's reference to the
  * buffer_head in the inode location struct.
@@ -4237,7 +4282,6 @@ static int ext4_do_update_inode(handle_t *handle,
 		for (block = 0; block < EXT4_N_BLOCKS; block++)
 			raw_inode->i_block[block] = ei->i_data[block];
 	}
-
 	if (likely(!test_opt2(inode->i_sb, HURD_COMPAT))) {
 		raw_inode->i_disk_version = cpu_to_le32(inode->i_version);
 		if (ei->i_extra_isize) {
@@ -4248,10 +4292,9 @@ static int ext4_do_update_inode(handle_t *handle,
 				cpu_to_le16(ei->i_extra_isize);
 		}
 	}
-
 	ext4_inode_csum_set(inode, raw_inode, ei);
-
 	spin_unlock(&ei->i_raw_lock);
+	ext4_update_other_inodes_time(inode->i_sb, inode->i_ino, bh->b_data);
 
 	BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
 	rc = ext4_handle_dirty_metadata(handle, NULL, bh);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 58859bc..93a2b7a 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1132,6 +1132,7 @@ enum {
 	Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err,
 	Opt_usrquota, Opt_grpquota, Opt_i_version,
 	Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_mblk_io_submit,
+	Opt_lazytime, Opt_nolazytime,
 	Opt_nomblk_io_submit, Opt_block_validity, Opt_noblock_validity,
 	Opt_inode_readahead_blks, Opt_journal_ioprio,
 	Opt_dioread_nolock, Opt_dioread_lock,
@@ -1195,6 +1196,8 @@ static const match_table_t tokens = {
 	{Opt_i_version, "i_version"},
 	{Opt_stripe, "stripe=%u"},
 	{Opt_delalloc, "delalloc"},
+	{Opt_lazytime, "lazytime"},
+	{Opt_nolazytime, "nolazytime"},
 	{Opt_nodelalloc, "nodelalloc"},
 	{Opt_removed, "mblk_io_submit"},
 	{Opt_removed, "nomblk_io_submit"},
@@ -1452,6 +1455,12 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
 	case Opt_i_version:
 		sb->s_flags |= MS_I_VERSION;
 		return 1;
+	case Opt_lazytime:
+		sb->s_flags |= MS_LAZYTIME;
+		return 1;
+	case Opt_nolazytime:
+		sb->s_flags &= ~MS_LAZYTIME;
+		return 1;
 	}
 
 	for (m = ext4_mount_opts; m->token != Opt_err; m++)
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 6cfb841..6e5abd6 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -73,6 +73,36 @@ struct extent_status;
 	{ FALLOC_FL_ZERO_RANGE,		"ZERO_RANGE"})
 
 
+TRACE_EVENT(ext4_other_inode_update_time,
+	TP_PROTO(struct inode *inode, ino_t orig_ino),
+
+	TP_ARGS(inode, orig_ino),
+
+	TP_STRUCT__entry(
+		__field(	dev_t,	dev			)
+		__field(	ino_t,	ino			)
+		__field(	ino_t,	orig_ino		)
+		__field(	uid_t,	uid			)
+		__field(	gid_t,	gid			)
+		__field(	__u16, mode			)
+	),
+
+	TP_fast_assign(
+		__entry->orig_ino = orig_ino;
+		__entry->dev	= inode->i_sb->s_dev;
+		__entry->ino	= inode->i_ino;
+		__entry->uid	= i_uid_read(inode);
+		__entry->gid	= i_gid_read(inode);
+		__entry->mode	= inode->i_mode;
+	),
+
+	TP_printk("dev %d,%d orig_ino %lu ino %lu mode 0%o uid %u gid %u",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  (unsigned long) __entry->orig_ino,
+		  (unsigned long) __entry->ino, __entry->mode,
+		  __entry->uid, __entry->gid)
+);
+
 TRACE_EVENT(ext4_free_inode,
 	TP_PROTO(struct inode *inode),
 
-- 
2.1.0
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related	[flat|nested] 36+ messages in thread
* [PATCH-v4 7/7] btrfs: add an is_readonly() so btrfs can use common code for update_time()
  2014-11-26 10:23 [PATCH-v4 0/7] add support for a lazytime mount option Theodore Ts'o
                   ` (5 preceding siblings ...)
  2014-11-26 10:23 ` [PATCH-v4 6/7] ext4: add support for a lazytime mount option Theodore Ts'o
@ 2014-11-26 10:23 ` Theodore Ts'o
  6 siblings, 0 replies; 36+ messages in thread
From: Theodore Ts'o @ 2014-11-26 10:23 UTC (permalink / raw)
  To: Linux Filesystem Development List
  Cc: Ext4 Developers List, Linux btrfs Developers List, XFS Developers,
	Theodore Ts'o
The only reason btrfs cloned code from the VFS layer was so it could
add a check to see if a subvolume is read-ony.  Instead of doing that,
let's add a new inode operation which allows a file system to return
an error if the inode is read-only, and use that in update_time().
There may be other places where the VFS layer may want to know that
btrfs would want to treat an inode is read-only.
With this commit, there are no remaining users of update_time() in the
inode operations structure, so we can remove it and simply things
further.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: linux-btrfs@vger.kernel.org
Reviewed-by: David Sterba <dsterba@suse.cz>
---
 fs/btrfs/inode.c   | 26 ++++++--------------------
 fs/inode.c         | 22 +++++++++++-----------
 include/linux/fs.h |  2 +-
 3 files changed, 18 insertions(+), 32 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a5e0d0d..0bfe3a4 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5554,26 +5554,12 @@ static int btrfs_dirty_inode(struct inode *inode)
 	return ret;
 }
 
-/*
- * This is a copy of file_update_time.  We need this so we can return error on
- * ENOSPC for updating the inode in the case of file write and mmap writes.
- */
-static int btrfs_update_time(struct inode *inode, struct timespec *now,
-			     int flags)
+static int btrfs_is_readonly(struct inode *inode)
 {
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 
 	if (btrfs_root_readonly(root))
 		return -EROFS;
-
-	if (flags & S_VERSION)
-		inode_inc_iversion(inode);
-	if (flags & S_CTIME)
-		inode->i_ctime = *now;
-	if (flags & S_MTIME)
-		inode->i_mtime = *now;
-	if (flags & S_ATIME)
-		inode->i_atime = *now;
 	return 0;
 }
 
@@ -9466,8 +9452,8 @@ static const struct inode_operations btrfs_dir_inode_operations = {
 	.permission	= btrfs_permission,
 	.get_acl	= btrfs_get_acl,
 	.set_acl	= btrfs_set_acl,
-	.update_time	= btrfs_update_time,
 	.write_time	= btrfs_write_time,
+	.is_readonly	= btrfs_is_readonly,
 	.tmpfile        = btrfs_tmpfile,
 };
 static const struct inode_operations btrfs_dir_ro_inode_operations = {
@@ -9475,8 +9461,8 @@ static const struct inode_operations btrfs_dir_ro_inode_operations = {
 	.permission	= btrfs_permission,
 	.get_acl	= btrfs_get_acl,
 	.set_acl	= btrfs_set_acl,
-	.update_time	= btrfs_update_time,
 	.write_time	= btrfs_write_time,
+	.is_readonly	= btrfs_is_readonly,
 };
 
 static const struct file_operations btrfs_dir_file_operations = {
@@ -9546,8 +9532,8 @@ static const struct inode_operations btrfs_file_inode_operations = {
 	.fiemap		= btrfs_fiemap,
 	.get_acl	= btrfs_get_acl,
 	.set_acl	= btrfs_set_acl,
-	.update_time	= btrfs_update_time,
 	.write_time	= btrfs_write_time,
+	.is_readonly	= btrfs_is_readonly,
 };
 static const struct inode_operations btrfs_special_inode_operations = {
 	.getattr	= btrfs_getattr,
@@ -9559,8 +9545,8 @@ static const struct inode_operations btrfs_special_inode_operations = {
 	.removexattr	= btrfs_removexattr,
 	.get_acl	= btrfs_get_acl,
 	.set_acl	= btrfs_set_acl,
-	.update_time	= btrfs_update_time,
 	.write_time	= btrfs_write_time,
+	.is_readonly	= btrfs_is_readonly,
 };
 static const struct inode_operations btrfs_symlink_inode_operations = {
 	.readlink	= generic_readlink,
@@ -9573,8 +9559,8 @@ static const struct inode_operations btrfs_symlink_inode_operations = {
 	.getxattr	= btrfs_getxattr,
 	.listxattr	= btrfs_listxattr,
 	.removexattr	= btrfs_removexattr,
-	.update_time	= btrfs_update_time,
 	.write_time	= btrfs_write_time,
+	.is_readonly	= btrfs_is_readonly,
 };
 
 const struct dentry_operations btrfs_dentry_operations = {
diff --git a/fs/inode.c b/fs/inode.c
index 0b4c6ae..e29bd2d 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1554,20 +1554,20 @@ static int update_time(struct inode *inode, struct timespec *time, int flags)
 	unsigned short days_since_boot;
 	int ret;
 
-	if (inode->i_op->update_time) {
-		ret = inode->i_op->update_time(inode, time, flags);
+	if (inode->i_op->is_readonly) {
+		ret = inode->i_op->is_readonly(inode);
 		if (ret)
 			return ret;
-	} else {
-		if (flags & S_ATIME)
-			inode->i_atime = *time;
-		if (flags & S_VERSION)
-			inode_inc_iversion(inode);
-		if (flags & S_CTIME)
-			inode->i_ctime = *time;
-		if (flags & S_MTIME)
-			inode->i_mtime = *time;
 	}
+	if (flags & S_ATIME)
+		inode->i_atime = *time;
+	if (flags & S_VERSION)
+		inode_inc_iversion(inode);
+	if (flags & S_CTIME)
+		inode->i_ctime = *time;
+	if (flags & S_MTIME)
+		inode->i_mtime = *time;
+
 	/*
 	 * If i_ts_dirty_day is zero, then either we have not deferred
 	 * timestamp updates, or the system has been up for less than
diff --git a/include/linux/fs.h b/include/linux/fs.h
index dc615ec..6e41107 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1545,8 +1545,8 @@ struct inode_operations {
 	int (*removexattr) (struct dentry *, const char *);
 	int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
 		      u64 len);
-	int (*update_time)(struct inode *, struct timespec *, int);
 	int (*write_time)(struct inode *);
+	int (*is_readonly)(struct inode *);
 	int (*atomic_open)(struct inode *, struct dentry *,
 			   struct file *, unsigned open_flag,
 			   umode_t create_mode, int *opened);
-- 
2.1.0
^ permalink raw reply related	[flat|nested] 36+ messages in thread
* Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()
  2014-11-26 10:23 ` [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time() Theodore Ts'o
@ 2014-11-26 19:23   ` Christoph Hellwig
  2014-11-27 12:34     ` Jan Kara
  2014-11-27 14:41     ` Theodore Ts'o
  0 siblings, 2 replies; 36+ messages in thread
From: Christoph Hellwig @ 2014-11-26 19:23 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Linux Filesystem Development List, Ext4 Developers List,
	Linux btrfs Developers List, XFS Developers
As mentioned last round please move the addition of the is_readonly
operation to the first thing in the series, so that the ordering makes
more sense.
Second I think this patch is incorrect for XFS - XFS uses ->update_time
to set the time stampst in the dinode.  These two need to be coherent
as we can write out a dirty inode any time, so it needs to have the
timestamp uptodate.
Third update_time now calls mark_inode_dirty unconditionally, while
previously it wasn't called when ->update_time was set.  At least
for XFS that's a major change in behavior as XFS never used VFS dirty
tracking for metadata updates.
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [PATCH-v4 6/7] ext4: add support for a lazytime mount option
  2014-11-26 10:23 ` [PATCH-v4 6/7] ext4: add support for a lazytime mount option Theodore Ts'o
@ 2014-11-26 19:24   ` Christoph Hellwig
  2014-11-26 22:48   ` Dave Chinner
  1 sibling, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2014-11-26 19:24 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Linux Filesystem Development List, Ext4 Developers List,
	Linux btrfs Developers List, XFS Developers
The subject line seems incorrect, this seems to implement some form
of dirty inode writeback clustering.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [PATCH-v4 6/7] ext4: add support for a lazytime mount option
  2014-11-26 10:23 ` [PATCH-v4 6/7] ext4: add support for a lazytime mount option Theodore Ts'o
  2014-11-26 19:24   ` Christoph Hellwig
@ 2014-11-26 22:48   ` Dave Chinner
  2014-11-26 23:10     ` Andreas Dilger
  1 sibling, 1 reply; 36+ messages in thread
From: Dave Chinner @ 2014-11-26 22:48 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Linux Filesystem Development List, Ext4 Developers List,
	Linux btrfs Developers List, XFS Developers
On Wed, Nov 26, 2014 at 05:23:56AM -0500, Theodore Ts'o wrote:
> Add an optimization for the MS_LAZYTIME mount option so that we will
> opportunistically write out any inodes with the I_DIRTY_TIME flag set
> in a particular inode table block when we need to update some inode in
> that inode table block anyway.
> 
> Also add some temporary code so that we can set the lazytime mount
> option without needing a modified /sbin/mount program which can set
> MS_LAZYTIME.  We can eventually make this go away once util-linux has
> added support.
> 
> Google-Bug-Id: 18297052
> 
> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
> ---
>  fs/ext4/inode.c             | 49 ++++++++++++++++++++++++++++++++++++++++++---
>  fs/ext4/super.c             |  9 +++++++++
>  include/trace/events/ext4.h | 30 +++++++++++++++++++++++++++
>  3 files changed, 85 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 5653fa4..8308c82 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4140,6 +4140,51 @@ static int ext4_inode_blocks_set(handle_t *handle,
>  }
>  
>  /*
> + * Opportunistically update the other time fields for other inodes in
> + * the same inode table block.
> + */
> +static void ext4_update_other_inodes_time(struct super_block *sb,
> +					  unsigned long orig_ino, char *buf)
> +{
> +	struct ext4_inode_info	*ei;
> +	struct ext4_inode	*raw_inode;
> +	unsigned long		ino;
> +	struct inode		*inode;
> +	int		i, inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
> +	int		inode_size = EXT4_INODE_SIZE(sb);
> +
> +	ino = orig_ino & ~(inodes_per_block - 1);
> +	for (i = 0; i < inodes_per_block; i++, ino++, buf += inode_size) {
> +		if (ino == orig_ino)
> +			continue;
> +		inode = find_active_inode_nowait(sb, ino);
> +		if (!inode ||
> +		    (inode->i_state & I_DIRTY_TIME) == 0 ||
> +		    !spin_trylock(&inode->i_lock)) {
> +			iput(inode);
> +			continue;
> +		}
> +		inode->i_state &= ~I_DIRTY_TIME;
> +		inode->i_ts_dirty_day = 0;
> +		spin_unlock(&inode->i_lock);
> +		inode_requeue_dirtytime(inode);
> +
> +		ei = EXT4_I(inode);
> +		raw_inode = (struct ext4_inode *) buf;
> +
> +		spin_lock(&ei->i_raw_lock);
> +		EXT4_INODE_SET_XTIME(i_ctime, inode, raw_inode);
> +		EXT4_INODE_SET_XTIME(i_mtime, inode, raw_inode);
> +		EXT4_INODE_SET_XTIME(i_atime, inode, raw_inode);
> +		ext4_inode_csum_set(inode, raw_inode, ei);
> +		spin_unlock(&ei->i_raw_lock);
> +		trace_ext4_other_inode_update_time(inode, orig_ino);
> +		iput(inode);
> +	}
> +}
Am I right in that this now does unlogged timestamp updates of
inodes? What happens when that buffer gets overwritten by log
recover after a crash? The timestamp updates get lost?
FYI, XFS has had all sorts of nasty log recovery corner cases
caused by log recovery overwriting non-logged inode updates like
this. In the past few years we've removed every single non-logged
inode update "optimisation" so that all changes (including timestamps)
are transactional so inode state on disk not matching what log
recovery wrote to disk for all the other inode metadata...
Optimistic unlogged inode updates are a slippery slope, and history
tells me that it doesn't lead to a nice place....
Cheers,
Dave.
-- 
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [PATCH-v4 6/7] ext4: add support for a lazytime mount option
  2014-11-26 22:48   ` Dave Chinner
@ 2014-11-26 23:10     ` Andreas Dilger
  2014-11-26 23:35       ` Dave Chinner
  0 siblings, 1 reply; 36+ messages in thread
From: Andreas Dilger @ 2014-11-26 23:10 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Theodore Ts'o, Linux Filesystem Development List,
	Ext4 Developers List, Linux btrfs Developers List, XFS Developers
On Nov 26, 2014, at 3:48 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Wed, Nov 26, 2014 at 05:23:56AM -0500, Theodore Ts'o wrote:
>> Add an optimization for the MS_LAZYTIME mount option so that we will
>> opportunistically write out any inodes with the I_DIRTY_TIME flag set
>> in a particular inode table block when we need to update some inode
>> in that inode table block anyway.
>> 
>> Also add some temporary code so that we can set the lazytime mount
>> option without needing a modified /sbin/mount program which can set
>> MS_LAZYTIME.  We can eventually make this go away once util-linux has
>> added support.
>> 
>> Google-Bug-Id: 18297052
>> 
>> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
>> ---
>> fs/ext4/inode.c             | 49 ++++++++++++++++++++++++++++++++++++++++++---
>> fs/ext4/super.c             |  9 +++++++++
>> include/trace/events/ext4.h | 30 +++++++++++++++++++++++++++
>> 3 files changed, 85 insertions(+), 3 deletions(-)
>> 
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 5653fa4..8308c82 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -4140,6 +4140,51 @@ static int ext4_inode_blocks_set(handle_t *handle,
>> }
>> 
>> /*
>> + * Opportunistically update the other time fields for other inodes in
>> + * the same inode table block.
>> + */
>> +static void ext4_update_other_inodes_time(struct super_block *sb,
>> +					  unsigned long orig_ino, char *buf)
>> +{
>> +	struct ext4_inode_info	*ei;
>> +	struct ext4_inode	*raw_inode;
>> +	unsigned long		ino;
>> +	struct inode		*inode;
>> +	int		i, inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
>> +	int		inode_size = EXT4_INODE_SIZE(sb);
>> +
>> +	ino = orig_ino & ~(inodes_per_block - 1);
>> +	for (i = 0; i < inodes_per_block; i++, ino++, buf += inode_size) {
>> +		if (ino == orig_ino)
>> +			continue;
>> +		inode = find_active_inode_nowait(sb, ino);
>> +		if (!inode ||
>> +		    (inode->i_state & I_DIRTY_TIME) == 0 ||
>> +		    !spin_trylock(&inode->i_lock)) {
>> +			iput(inode);
>> +			continue;
>> +		}
>> +		inode->i_state &= ~I_DIRTY_TIME;
>> +		inode->i_ts_dirty_day = 0;
>> +		spin_unlock(&inode->i_lock);
>> +		inode_requeue_dirtytime(inode);
>> +
>> +		ei = EXT4_I(inode);
>> +		raw_inode = (struct ext4_inode *) buf;
>> +
>> +		spin_lock(&ei->i_raw_lock);
>> +		EXT4_INODE_SET_XTIME(i_ctime, inode, raw_inode);
>> +		EXT4_INODE_SET_XTIME(i_mtime, inode, raw_inode);
>> +		EXT4_INODE_SET_XTIME(i_atime, inode, raw_inode);
>> +		ext4_inode_csum_set(inode, raw_inode, ei);
>> +		spin_unlock(&ei->i_raw_lock);
>> +		trace_ext4_other_inode_update_time(inode, orig_ino);
>> +		iput(inode);
>> +	}
>> +}
> 
> Am I right in that this now does unlogged timestamp updates of
> inodes? What happens when that buffer gets overwritten by log
> recover after a crash? The timestamp updates get lost?
> 
> FYI, XFS has had all sorts of nasty log recovery corner cases
> caused by log recovery overwriting non-logged inode updates like
> this. In the past few years we've removed every single non-logged
> inode update "optimisation" so that all changes (including timestamps)
> are transactional so inode state on disk not matching what log
> recovery wrote to disk for all the other inode metadata...
> 
> Optimistic unlogged inode updates are a slippery slope, and history
> tells me that it doesn't lead to a nice place....
Since ext4/jbd2 is logging the whole block, unlike XFS which is doing
logical journaling, this isn't an unlogged update.  It is just taking
advantage of the fact that the whole block is going to be logged and
written to the disk anyway.  If the only update needed for other inodes
in the block is the timestamp then they may as well be flushed to disk
at the same time and avoid the need for another update later on.
Cheers, Andreas
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [PATCH-v4 6/7] ext4: add support for a lazytime mount option
  2014-11-26 23:10     ` Andreas Dilger
@ 2014-11-26 23:35       ` Dave Chinner
  2014-11-27 13:27         ` Jan Kara
  0 siblings, 1 reply; 36+ messages in thread
From: Dave Chinner @ 2014-11-26 23:35 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Theodore Ts'o, Linux Filesystem Development List,
	Ext4 Developers List, Linux btrfs Developers List, XFS Developers
On Wed, Nov 26, 2014 at 04:10:44PM -0700, Andreas Dilger wrote:
> On Nov 26, 2014, at 3:48 PM, Dave Chinner <david@fromorbit.com> wrote:
> > 
> > On Wed, Nov 26, 2014 at 05:23:56AM -0500, Theodore Ts'o wrote:
> >> Add an optimization for the MS_LAZYTIME mount option so that we will
> >> opportunistically write out any inodes with the I_DIRTY_TIME flag set
> >> in a particular inode table block when we need to update some inode
> >> in that inode table block anyway.
> >> 
> >> Also add some temporary code so that we can set the lazytime mount
> >> option without needing a modified /sbin/mount program which can set
> >> MS_LAZYTIME.  We can eventually make this go away once util-linux has
> >> added support.
> >> 
> >> Google-Bug-Id: 18297052
> >> 
> >> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
> >> ---
> >> fs/ext4/inode.c             | 49 ++++++++++++++++++++++++++++++++++++++++++---
> >> fs/ext4/super.c             |  9 +++++++++
> >> include/trace/events/ext4.h | 30 +++++++++++++++++++++++++++
> >> 3 files changed, 85 insertions(+), 3 deletions(-)
> >> 
> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >> index 5653fa4..8308c82 100644
> >> --- a/fs/ext4/inode.c
> >> +++ b/fs/ext4/inode.c
> >> @@ -4140,6 +4140,51 @@ static int ext4_inode_blocks_set(handle_t *handle,
> >> }
> >> 
> >> /*
> >> + * Opportunistically update the other time fields for other inodes in
> >> + * the same inode table block.
> >> + */
> >> +static void ext4_update_other_inodes_time(struct super_block *sb,
> >> +					  unsigned long orig_ino, char *buf)
> >> +{
> >> +	struct ext4_inode_info	*ei;
> >> +	struct ext4_inode	*raw_inode;
> >> +	unsigned long		ino;
> >> +	struct inode		*inode;
> >> +	int		i, inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
> >> +	int		inode_size = EXT4_INODE_SIZE(sb);
> >> +
> >> +	ino = orig_ino & ~(inodes_per_block - 1);
> >> +	for (i = 0; i < inodes_per_block; i++, ino++, buf += inode_size) {
> >> +		if (ino == orig_ino)
> >> +			continue;
> >> +		inode = find_active_inode_nowait(sb, ino);
> >> +		if (!inode ||
> >> +		    (inode->i_state & I_DIRTY_TIME) == 0 ||
> >> +		    !spin_trylock(&inode->i_lock)) {
> >> +			iput(inode);
> >> +			continue;
> >> +		}
> >> +		inode->i_state &= ~I_DIRTY_TIME;
> >> +		inode->i_ts_dirty_day = 0;
> >> +		spin_unlock(&inode->i_lock);
> >> +		inode_requeue_dirtytime(inode);
> >> +
> >> +		ei = EXT4_I(inode);
> >> +		raw_inode = (struct ext4_inode *) buf;
> >> +
> >> +		spin_lock(&ei->i_raw_lock);
> >> +		EXT4_INODE_SET_XTIME(i_ctime, inode, raw_inode);
> >> +		EXT4_INODE_SET_XTIME(i_mtime, inode, raw_inode);
> >> +		EXT4_INODE_SET_XTIME(i_atime, inode, raw_inode);
> >> +		ext4_inode_csum_set(inode, raw_inode, ei);
> >> +		spin_unlock(&ei->i_raw_lock);
> >> +		trace_ext4_other_inode_update_time(inode, orig_ino);
> >> +		iput(inode);
> >> +	}
> >> +}
> > 
> > Am I right in that this now does unlogged timestamp updates of
> > inodes? What happens when that buffer gets overwritten by log
> > recover after a crash? The timestamp updates get lost?
> > 
> > FYI, XFS has had all sorts of nasty log recovery corner cases
> > caused by log recovery overwriting non-logged inode updates like
> > this. In the past few years we've removed every single non-logged
> > inode update "optimisation" so that all changes (including timestamps)
> > are transactional so inode state on disk not matching what log
> > recovery wrote to disk for all the other inode metadata...
> > 
> > Optimistic unlogged inode updates are a slippery slope, and history
> > tells me that it doesn't lead to a nice place....
> 
> Since ext4/jbd2 is logging the whole block, unlike XFS which is doing
> logical journaling, this isn't an unlogged update.  It is just taking
> advantage of the fact that the whole block is going to be logged and
> written to the disk anyway.
Urk - that's worse, isn't it? i.e the code above calls iput() from
within a current transaction context?  What happens if that drops
the last reference to the inode and it gets evicted due to racing
with an unlink? Won't that try to start another transaction to free
the inode (i.e. through ext4_evict_inode())?
>
> If the only update needed for other inodes
> in the block is the timestamp then they may as well be flushed to disk
> at the same time and avoid the need for another update later on.
> 
> Cheers, Andreas
> 
> 
> 
> 
> 
> 
-- 
Dave Chinner
david@fromorbit.com
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()
  2014-11-26 19:23   ` Christoph Hellwig
@ 2014-11-27 12:34     ` Jan Kara
  2014-11-27 15:25       ` Christoph Hellwig
  2014-11-27 14:41     ` Theodore Ts'o
  1 sibling, 1 reply; 36+ messages in thread
From: Jan Kara @ 2014-11-27 12:34 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Theodore Ts'o, Linux Filesystem Development List,
	Ext4 Developers List, Linux btrfs Developers List, XFS Developers
On Wed 26-11-14 11:23:28, Christoph Hellwig wrote:
> As mentioned last round please move the addition of the is_readonly
> operation to the first thing in the series, so that the ordering makes
> more sense.
> 
> Second I think this patch is incorrect for XFS - XFS uses ->update_time
> to set the time stampst in the dinode.  These two need to be coherent
> as we can write out a dirty inode any time, so it needs to have the
> timestamp uptodate.
  But Ted changed XFS to copy timestamps to on-disk structure from the
in-memory inode fields after VFS updated the timestamps. So the stamps
should be coherent AFAICT, shouldn't they?
> Third update_time now calls mark_inode_dirty unconditionally, while
> previously it wasn't called when ->update_time was set.  At least
> for XFS that's a major change in behavior as XFS never used VFS dirty
> tracking for metadata updates.
  We don't call mark_inode_dirty() when ->write_time is set (note the
return, I missed it on the first reading) which looks sensible to me.
								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [PATCH-v4 2/7] vfs: add support for a lazytime mount option
  2014-11-26 10:23 ` [PATCH-v4 2/7] vfs: add support for a lazytime mount option Theodore Ts'o
@ 2014-11-27 13:14   ` Jan Kara
  2014-11-27 20:19     ` Theodore Ts'o
  2014-11-27 23:00     ` Theodore Ts'o
  0 siblings, 2 replies; 36+ messages in thread
From: Jan Kara @ 2014-11-27 13:14 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Linux Filesystem Development List, Ext4 Developers List,
	Linux btrfs Developers List, XFS Developers
On Wed 26-11-14 05:23:52, Ted Tso wrote:
> Add a new mount option which enables a new "lazytime" mode.  This mode
> causes atime, mtime, and ctime updates to only be made to the
> in-memory version of the inode.  The on-disk times will only get
> updated when (a) if the inode needs to be updated for some non-time
> related change, (b) if userspace calls fsync(), syncfs() or sync(), or
> (c) just before an undeleted inode is evicted from memory.
> 
> This is OK according to POSIX because there are no guarantees after a
> crash unless userspace explicitly requests via a fsync(2) call.
> 
> For workloads which feature a large number of random write to a
> preallocated file, the lazytime mount option significantly reduces
> writes to the inode table.  The repeated 4k writes to a single block
> will result in undesirable stress on flash devices and SMR disk
> drives.  Even on conventional HDD's, the repeated writes to the inode
> table block will trigger Adjacent Track Interference (ATI) remediation
> latencies, which very negatively impact 99.9 percentile latencies ---
> which is a very big deal for web serving tiers (for example).
  So this looks better to me than previous versions but I'm still not 100%
happy :)
Looking into the code & your patch I'd prefer to do something like:
* add support for I_DIRTY_TIME in __mark_inode_dirty() - update_time will
  call __mark_inode_dirty() with this flag if any of the times was updated.
  That way we can just remove your ->write_time() callback - filesystems
  can just handle this in their ->dirty_inode() methods if they wish.
  __mark_inode_dirty() will take care of moving inode into proper writeback
  list (i_dirty / i_dirty_time), dirtied_when will be set to current time.
* change queue_io() to also call
	moved += move_expired_inodes(&wb->b_dirty_time, &wb->b_io, time + 24hours)
  For this you need to tweak move_expired_inodes() to take pointer to
  timestamp instead of pointer to work but that's trivial. Also you want
  probably leave time ->older_than_this value (i.e. without +24 hours) if
  we are doing WB_SYNC_ALL writeback. With this you can remove
  flush_sb_dirty_time() completely.
* Changes for iput() & fsync stay as they are.
And this should be all that's necessary. I'm not 100% sure about your dirty
bits naming changes - let's see how that will look like when the above more
substantial changes are done.
One technical detail below:
> diff --git a/fs/inode.c b/fs/inode.c
> index 8f5c4b5..9e464cc 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1430,11 +1430,22 @@ static void iput_final(struct inode *inode)
>   */
>  void iput(struct inode *inode)
>  {
> -	if (inode) {
> -		BUG_ON(inode->i_state & I_CLEAR);
> -
> -		if (atomic_dec_and_lock(&inode->i_count, &inode->i_lock))
> -			iput_final(inode);
> +	if (!inode)
> +		return;
> +	BUG_ON(inode->i_state & I_CLEAR);
  I think we can better handle this without retry at this place like:
	if (atomic_read(&inode->i_count) == 1 && inode->i_nlink &&
	    (inode->i_state & I_DIRTY_TIME)) {
		if (inode->i_op->write_time)
			inode->i_op->write_time(inode);
		else if (inode->i_sb->s_op->write_inode)
			mark_inode_dirty_sync(inode);
	}
  Sure it will be one more read of i_count in the fast path but that's IMO
negligible.
BTW: Is the test for ->write_inode really needed? We don't do it e.g. in
update_time().
> +retry:
> +	if (atomic_dec_and_lock(&inode->i_count, &inode->i_lock)) {
> +		if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
> +			atomic_inc(&inode->i_count);
> +			inode->i_state &= ~I_DIRTY_TIME;
> +			spin_unlock(&inode->i_lock);
> +			if (inode->i_op->write_time)
> +				inode->i_op->write_time(inode);
> +			else if (inode->i_sb->s_op->write_inode)
> +				mark_inode_dirty_sync(inode);
> +			goto retry;
> +		}
> +		iput_final(inode);
>  	}
>  }
>  EXPORT_SYMBOL(iput);
								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [PATCH-v4 6/7] ext4: add support for a lazytime mount option
  2014-11-26 23:35       ` Dave Chinner
@ 2014-11-27 13:27         ` Jan Kara
  2014-11-27 13:32           ` Jan Kara
  0 siblings, 1 reply; 36+ messages in thread
From: Jan Kara @ 2014-11-27 13:27 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andreas Dilger, Theodore Ts'o,
	Linux Filesystem Development List, Ext4 Developers List,
	Linux btrfs Developers List, XFS Developers
On Thu 27-11-14 10:35:37, Dave Chinner wrote:
> On Wed, Nov 26, 2014 at 04:10:44PM -0700, Andreas Dilger wrote:
> > On Nov 26, 2014, at 3:48 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > 
> > > On Wed, Nov 26, 2014 at 05:23:56AM -0500, Theodore Ts'o wrote:
> > >> Add an optimization for the MS_LAZYTIME mount option so that we will
> > >> opportunistically write out any inodes with the I_DIRTY_TIME flag set
> > >> in a particular inode table block when we need to update some inode
> > >> in that inode table block anyway.
> > >> 
> > >> Also add some temporary code so that we can set the lazytime mount
> > >> option without needing a modified /sbin/mount program which can set
> > >> MS_LAZYTIME.  We can eventually make this go away once util-linux has
> > >> added support.
> > >> 
> > >> Google-Bug-Id: 18297052
> > >> 
> > >> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
> > >> ---
> > >> fs/ext4/inode.c             | 49 ++++++++++++++++++++++++++++++++++++++++++---
> > >> fs/ext4/super.c             |  9 +++++++++
> > >> include/trace/events/ext4.h | 30 +++++++++++++++++++++++++++
> > >> 3 files changed, 85 insertions(+), 3 deletions(-)
> > >> 
> > >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > >> index 5653fa4..8308c82 100644
> > >> --- a/fs/ext4/inode.c
> > >> +++ b/fs/ext4/inode.c
> > >> @@ -4140,6 +4140,51 @@ static int ext4_inode_blocks_set(handle_t *handle,
> > >> }
> > >> 
> > >> /*
> > >> + * Opportunistically update the other time fields for other inodes in
> > >> + * the same inode table block.
> > >> + */
> > >> +static void ext4_update_other_inodes_time(struct super_block *sb,
> > >> +					  unsigned long orig_ino, char *buf)
> > >> +{
> > >> +	struct ext4_inode_info	*ei;
> > >> +	struct ext4_inode	*raw_inode;
> > >> +	unsigned long		ino;
> > >> +	struct inode		*inode;
> > >> +	int		i, inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
> > >> +	int		inode_size = EXT4_INODE_SIZE(sb);
> > >> +
> > >> +	ino = orig_ino & ~(inodes_per_block - 1);
> > >> +	for (i = 0; i < inodes_per_block; i++, ino++, buf += inode_size) {
> > >> +		if (ino == orig_ino)
> > >> +			continue;
> > >> +		inode = find_active_inode_nowait(sb, ino);
> > >> +		if (!inode ||
> > >> +		    (inode->i_state & I_DIRTY_TIME) == 0 ||
> > >> +		    !spin_trylock(&inode->i_lock)) {
> > >> +			iput(inode);
> > >> +			continue;
> > >> +		}
> > >> +		inode->i_state &= ~I_DIRTY_TIME;
> > >> +		inode->i_ts_dirty_day = 0;
> > >> +		spin_unlock(&inode->i_lock);
> > >> +		inode_requeue_dirtytime(inode);
> > >> +
> > >> +		ei = EXT4_I(inode);
> > >> +		raw_inode = (struct ext4_inode *) buf;
> > >> +
> > >> +		spin_lock(&ei->i_raw_lock);
> > >> +		EXT4_INODE_SET_XTIME(i_ctime, inode, raw_inode);
> > >> +		EXT4_INODE_SET_XTIME(i_mtime, inode, raw_inode);
> > >> +		EXT4_INODE_SET_XTIME(i_atime, inode, raw_inode);
> > >> +		ext4_inode_csum_set(inode, raw_inode, ei);
> > >> +		spin_unlock(&ei->i_raw_lock);
> > >> +		trace_ext4_other_inode_update_time(inode, orig_ino);
> > >> +		iput(inode);
> > >> +	}
> > >> +}
> > > 
> > > Am I right in that this now does unlogged timestamp updates of
> > > inodes? What happens when that buffer gets overwritten by log
> > > recover after a crash? The timestamp updates get lost?
> > > 
> > > FYI, XFS has had all sorts of nasty log recovery corner cases
> > > caused by log recovery overwriting non-logged inode updates like
> > > this. In the past few years we've removed every single non-logged
> > > inode update "optimisation" so that all changes (including timestamps)
> > > are transactional so inode state on disk not matching what log
> > > recovery wrote to disk for all the other inode metadata...
> > > 
> > > Optimistic unlogged inode updates are a slippery slope, and history
> > > tells me that it doesn't lead to a nice place....
> > 
> > Since ext4/jbd2 is logging the whole block, unlike XFS which is doing
> > logical journaling, this isn't an unlogged update.  It is just taking
> > advantage of the fact that the whole block is going to be logged and
> > written to the disk anyway.
> 
> Urk - that's worse, isn't it? i.e the code above calls iput() from
> within a current transaction context?  What happens if that drops
> the last reference to the inode and it gets evicted due to racing
> with an unlink? Won't that try to start another transaction to free
> the inode (i.e. through ext4_evict_inode())?
  Yeah, the patch looks buggy (and racy wrt concurrent updates of time
stamps as well). I think if we want to do this optimization, we would need
a function like "clear inode dirty bits for this range of inode numbers".
That is doable atomically within VFS and although it looks somewhat ugly,
the performance
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [PATCH-v4 6/7] ext4: add support for a lazytime mount option
  2014-11-27 13:27         ` Jan Kara
@ 2014-11-27 13:32           ` Jan Kara
  2014-11-27 15:25             ` Theodore Ts'o
  0 siblings, 1 reply; 36+ messages in thread
From: Jan Kara @ 2014-11-27 13:32 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andreas Dilger, Theodore Ts'o,
	Linux Filesystem Development List, Ext4 Developers List,
	Linux btrfs Developers List, XFS Developers
On Thu 27-11-14 14:27:52, Jan Kara wrote:
> On Thu 27-11-14 10:35:37, Dave Chinner wrote:
> > On Wed, Nov 26, 2014 at 04:10:44PM -0700, Andreas Dilger wrote:
> > > On Nov 26, 2014, at 3:48 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > > 
> > > > On Wed, Nov 26, 2014 at 05:23:56AM -0500, Theodore Ts'o wrote:
> > > >> Add an optimization for the MS_LAZYTIME mount option so that we will
> > > >> opportunistically write out any inodes with the I_DIRTY_TIME flag set
> > > >> in a particular inode table block when we need to update some inode
> > > >> in that inode table block anyway.
> > > >> 
> > > >> Also add some temporary code so that we can set the lazytime mount
> > > >> option without needing a modified /sbin/mount program which can set
> > > >> MS_LAZYTIME.  We can eventually make this go away once util-linux has
> > > >> added support.
> > > >> 
> > > >> Google-Bug-Id: 18297052
> > > >> 
> > > >> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
> > > >> ---
> > > >> fs/ext4/inode.c             | 49 ++++++++++++++++++++++++++++++++++++++++++---
> > > >> fs/ext4/super.c             |  9 +++++++++
> > > >> include/trace/events/ext4.h | 30 +++++++++++++++++++++++++++
> > > >> 3 files changed, 85 insertions(+), 3 deletions(-)
> > > >> 
> > > >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > >> index 5653fa4..8308c82 100644
> > > >> --- a/fs/ext4/inode.c
> > > >> +++ b/fs/ext4/inode.c
> > > >> @@ -4140,6 +4140,51 @@ static int ext4_inode_blocks_set(handle_t *handle,
> > > >> }
> > > >> 
> > > >> /*
> > > >> + * Opportunistically update the other time fields for other inodes in
> > > >> + * the same inode table block.
> > > >> + */
> > > >> +static void ext4_update_other_inodes_time(struct super_block *sb,
> > > >> +					  unsigned long orig_ino, char *buf)
> > > >> +{
> > > >> +	struct ext4_inode_info	*ei;
> > > >> +	struct ext4_inode	*raw_inode;
> > > >> +	unsigned long		ino;
> > > >> +	struct inode		*inode;
> > > >> +	int		i, inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
> > > >> +	int		inode_size = EXT4_INODE_SIZE(sb);
> > > >> +
> > > >> +	ino = orig_ino & ~(inodes_per_block - 1);
> > > >> +	for (i = 0; i < inodes_per_block; i++, ino++, buf += inode_size) {
> > > >> +		if (ino == orig_ino)
> > > >> +			continue;
> > > >> +		inode = find_active_inode_nowait(sb, ino);
> > > >> +		if (!inode ||
> > > >> +		    (inode->i_state & I_DIRTY_TIME) == 0 ||
> > > >> +		    !spin_trylock(&inode->i_lock)) {
> > > >> +			iput(inode);
> > > >> +			continue;
> > > >> +		}
> > > >> +		inode->i_state &= ~I_DIRTY_TIME;
> > > >> +		inode->i_ts_dirty_day = 0;
> > > >> +		spin_unlock(&inode->i_lock);
> > > >> +		inode_requeue_dirtytime(inode);
> > > >> +
> > > >> +		ei = EXT4_I(inode);
> > > >> +		raw_inode = (struct ext4_inode *) buf;
> > > >> +
> > > >> +		spin_lock(&ei->i_raw_lock);
> > > >> +		EXT4_INODE_SET_XTIME(i_ctime, inode, raw_inode);
> > > >> +		EXT4_INODE_SET_XTIME(i_mtime, inode, raw_inode);
> > > >> +		EXT4_INODE_SET_XTIME(i_atime, inode, raw_inode);
> > > >> +		ext4_inode_csum_set(inode, raw_inode, ei);
> > > >> +		spin_unlock(&ei->i_raw_lock);
> > > >> +		trace_ext4_other_inode_update_time(inode, orig_ino);
> > > >> +		iput(inode);
> > > >> +	}
> > > >> +}
> > > > 
> > > > Am I right in that this now does unlogged timestamp updates of
> > > > inodes? What happens when that buffer gets overwritten by log
> > > > recover after a crash? The timestamp updates get lost?
> > > > 
> > > > FYI, XFS has had all sorts of nasty log recovery corner cases
> > > > caused by log recovery overwriting non-logged inode updates like
> > > > this. In the past few years we've removed every single non-logged
> > > > inode update "optimisation" so that all changes (including timestamps)
> > > > are transactional so inode state on disk not matching what log
> > > > recovery wrote to disk for all the other inode metadata...
> > > > 
> > > > Optimistic unlogged inode updates are a slippery slope, and history
> > > > tells me that it doesn't lead to a nice place....
> > > 
> > > Since ext4/jbd2 is logging the whole block, unlike XFS which is doing
> > > logical journaling, this isn't an unlogged update.  It is just taking
> > > advantage of the fact that the whole block is going to be logged and
> > > written to the disk anyway.
> > 
> > Urk - that's worse, isn't it? i.e the code above calls iput() from
> > within a current transaction context?  What happens if that drops
> > the last reference to the inode and it gets evicted due to racing
> > with an unlink? Won't that try to start another transaction to free
> > the inode (i.e. through ext4_evict_inode())?
>   Yeah, the patch looks buggy (and racy wrt concurrent updates of time
> stamps as well). I think if we want to do this optimization, we would need
> a function like "clear inode dirty bits for this range of inode numbers".
> That is doable atomically within VFS and although it looks somewhat ugly,
> the performance
  Sorry, I sent this too early (did send instead of postpone). So the patch
looks buggy because of iput() but it isn't racy wrt time updates as I
checked now. So it would be enough to move calling of this outside of the
transaction and start new handle for each inode.
								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()
  2014-11-26 19:23   ` Christoph Hellwig
  2014-11-27 12:34     ` Jan Kara
@ 2014-11-27 14:41     ` Theodore Ts'o
  2014-11-27 15:28       ` Christoph Hellwig
  2014-11-27 15:33       ` Theodore Ts'o
  1 sibling, 2 replies; 36+ messages in thread
From: Theodore Ts'o @ 2014-11-27 14:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Linux Filesystem Development List, Ext4 Developers List,
	Linux btrfs Developers List, XFS Developers
On Wed, Nov 26, 2014 at 11:23:28AM -0800, Christoph Hellwig wrote:
> As mentioned last round please move the addition of the is_readonly
> operation to the first thing in the series, so that the ordering makes
> more sense.
OK, will fix.
> Second I think this patch is incorrect for XFS - XFS uses ->update_time
> to set the time stampst in the dinode.  These two need to be coherent
> as we can write out a dirty inode any time, so it needs to have the
> timestamp uptodate.
> 
> Third update_time now calls mark_inode_dirty unconditionally, while
> previously it wasn't called when ->update_time was set.  At least
> for XFS that's a major change in behavior as XFS never used VFS dirty
> tracking for metadata updates.
Thanks, I'll fix both of the above.
							- Ted
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()
  2014-11-27 12:34     ` Jan Kara
@ 2014-11-27 15:25       ` Christoph Hellwig
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2014-11-27 15:25 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Ts'o, Linux Filesystem Development List,
	Ext4 Developers List, Linux btrfs Developers List, XFS Developers
On Thu, Nov 27, 2014 at 01:34:29PM +0100, Jan Kara wrote:
>   But Ted changed XFS to copy timestamps to on-disk structure from the
> in-memory inode fields after VFS updated the timestamps. So the stamps
> should be coherent AFAICT, shouldn't they?
Not coherent enough.  We need the XFS ilock to synchronize reading from
and writing to the time stamps.  update_time() only has i_mutex, which
we can't take for the transaction commit path.
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [PATCH-v4 6/7] ext4: add support for a lazytime mount option
  2014-11-27 13:32           ` Jan Kara
@ 2014-11-27 15:25             ` Theodore Ts'o
  2014-11-27 15:41               ` Jan Kara
  0 siblings, 1 reply; 36+ messages in thread
From: Theodore Ts'o @ 2014-11-27 15:25 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Andreas Dilger, Linux Filesystem Development List,
	Ext4 Developers List, Linux btrfs Developers List, XFS Developers
This is what I'm currently playing with which I believe fixes the iput()
problem.  In fs/ext4/inode.c:
struct other_inode {
	unsigned long		orig_ino;
	struct ext4_inode	*raw_inode;
};
static int other_inode_match(struct inode * inode, unsigned long ino,
			     void *data);
/*
 * Opportunistically update the other time fields for other inodes in
 * the same inode table block.
 */
static void ext4_update_other_inodes_time(struct super_block *sb,
					  unsigned long orig_ino, char *buf)
{
	struct other_inode oi;
	unsigned long ino;
	int i, inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
	int inode_size = EXT4_INODE_SIZE(sb);
	oi.orig_ino = orig_ino;
	ino = orig_ino & ~(inodes_per_block - 1);
	for (i = 0; i < inodes_per_block; i++, ino++, buf += inode_size) {
		if (ino == orig_ino)
			continue;
		oi.raw_inode = (struct ext4_inode *) buf;
		(void) find_inode_nowait(sb, ino, other_inode_match, &oi);
	}
}
static int other_inode_match(struct inode * inode, unsigned long ino,
			     void *data)
{
	struct other_inode *oi = (struct other_inode *) data;
	if ((inode->i_ino != ino) ||
	    (inode->i_state & (I_FREEING | I_WILL_FREE | I_NEW)) ||
	    ((inode->i_state & I_DIRTY_TIME) == 0))
		return 0;
	spin_lock(&inode->i_lock);
	if (((inode->i_state & (I_FREEING | I_WILL_FREE | I_NEW)) == 0) &&
	    (inode->i_state & I_DIRTY_TIME)) {
		struct ext4_inode_info	*ei = EXT4_I(inode);
		inode->i_state &= ~I_DIRTY_TIME;
		inode->i_ts_dirty_day = 0;
		spin_unlock(&inode->i_lock);
		inode_requeue_dirtytime(inode);
		spin_lock(&ei->i_raw_lock);
		EXT4_INODE_SET_XTIME(i_ctime, inode, oi->raw_inode);
		EXT4_INODE_SET_XTIME(i_mtime, inode, oi->raw_inode);
		EXT4_INODE_SET_XTIME(i_atime, inode, oi->raw_inode);
		ext4_inode_csum_set(inode, oi->raw_inode, ei);
		spin_unlock(&ei->i_raw_lock);
		trace_ext4_other_inode_update_time(inode, oi->orig_ino);
		return -1;
	}
	spin_unlock(&inode->i_lock);
	return -1;
}
The above uses the following in fs/inode.c (which gets added instead of
find_active_inode_nowait):
/**
 * find_inode_nowait - find an inode in the inode cache
 * @sb:		super block of file system to search
 * @hashval:	hash value (usually inode number) to search for
 * @match:	callback used for comparisons between inodes
 * @data:	opaque data pointer to pass to @match
 *
 * Search for the inode specified by @hashval and @data in the inode
 * cache, where the helper function @match will return 0 if the inode
 * does not match, 1 if the inode does match, and -1 if the search
 * should be stopped.  The @match function must be responsible for
 * taking the i_lock spin_lock and checking i_state for an inode being
 * freed or being initialized, and incrementing the reference count
 * before returning 1.  It also must not sleep, since it is called with
 * the inode_hash_lock spinlock held.
 *
 * This is a even more generalized version of ilookup5() when the
 * function must never block --- find_inode() can block in
 * __wait_on_freeing_inode() --- or when the caller can not increment
 * the reference count because the resulting iput() might cause an
 * inode eviction().  The tradeoff is that the @match funtion must be
 * very carefully implemented.
 */
struct inode *find_inode_nowait(struct super_block *sb,
				unsigned long hashval,
				int (*match)(struct inode *, unsigned long,
					     void *),
				void *data)
{
	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
	struct inode *inode, *ret_inode = NULL;
	int mval;
	spin_lock(&inode_hash_lock);
	hlist_for_each_entry(inode, head, i_hash) {
		if (inode->i_sb != sb)
			continue;
		mval = match(inode, hashval, data);
		if (mval == 0)
			continue;
		if (mval == 1)
			ret_inode = inode;
		goto out;
	}
out:
	spin_unlock(&inode_hash_lock);
	return ret_inode;
}
EXPORT_SYMBOL(find_inode_nowait);
Comments?
							- Ted
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()
  2014-11-27 14:41     ` Theodore Ts'o
@ 2014-11-27 15:28       ` Christoph Hellwig
  2014-11-27 15:33       ` Theodore Ts'o
  1 sibling, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2014-11-27 15:28 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Christoph Hellwig, Linux Filesystem Development List,
	Ext4 Developers List, Linux btrfs Developers List, XFS Developers
FYI, I suspect for now the best might be to let filesystems that define
->update_times work as-is and not tie them into the infrastructure.  At
least for XFS I suspect the lazy updates might better be handled
internally, although I'm not entirely sure yet.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()
  2014-11-27 14:41     ` Theodore Ts'o
  2014-11-27 15:28       ` Christoph Hellwig
@ 2014-11-27 15:33       ` Theodore Ts'o
  2014-11-27 16:49         ` Christoph Hellwig
  1 sibling, 1 reply; 36+ messages in thread
From: Theodore Ts'o @ 2014-11-27 15:33 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Linux Filesystem Development List, Ext4 Developers List,
	Linux btrfs Developers List, XFS Developers
Christoph, can you take a quick look at this?  I'm not sure I got the
xfs inode transaction logging correct.
Thanks!!
						- Ted
commit cd58addfa340c9cf88b1f9b2d31a42e2e65c7252
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Thu Nov 27 10:14:27 2014 -0500
    vfs: split update_time() into update_time() and write_time()
    
    In preparation for adding support for the lazytime mount option, we
    need to be able to separate out the update_time() and write_time()
    inode operations.  Previously, only btrfs and xfs uses update_time().
    
    With this patch, btrfs only needs write_time(), and xfs uses
    update_time() to synchronize its on-disk and in-memory timestamps.
    
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    Cc: xfs@oss.sgi.com
    Cc: linux-btrfs@vger.kernel.org
    Acked-by: David Sterba <dsterba@suse.cz>
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f4b0ecd..befd5d2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1545,7 +1545,8 @@ struct inode_operations {
 	int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
 		      u64 len);
 	int (*is_readonly)(struct inode *);
-	int (*update_time)(struct inode *, struct timespec *, int);
+	void (*update_time)(struct inode *);
+	int (*write_time)(struct inode *);
 	int (*atomic_open)(struct inode *, struct dentry *,
 			   struct file *, unsigned open_flag,
 			   umode_t create_mode, int *opened);
diff --git a/fs/inode.c b/fs/inode.c
index 53f0173..94bc908 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1506,9 +1506,6 @@ static int update_time(struct inode *inode, struct timespec *time, int flags)
  		if (ret)
  			return ret;
 	}
-	if (inode->i_op->update_time)
-		return inode->i_op->update_time(inode, time, flags);
-
 	if (flags & S_ATIME)
 		inode->i_atime = *time;
 	if (flags & S_VERSION)
@@ -1517,6 +1514,10 @@ static int update_time(struct inode *inode, struct timespec *time, int flags)
 		inode->i_ctime = *time;
 	if (flags & S_MTIME)
 		inode->i_mtime = *time;
+	if (inode->i_op->update_time)
+		inode->i_op->update_time(inode);
+	if (inode->i_op->write_time)
+		return inode->i_op->write_time(inode);
 	mark_inode_dirty_sync(inode);
 	return 0;
 }
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index ec6dcdc..b69493d 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -983,42 +983,42 @@ xfs_vn_setattr(
 	return error;
 }
 
-STATIC int
+STATIC void
 xfs_vn_update_time(
-	struct inode		*inode,
-	struct timespec		*now,
-	int			flags)
+	struct inode		*inode)
+{
+	struct xfs_inode	*ip = XFS_I(inode);
+
+	trace_xfs_update_time(ip);
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	ip->i_d.di_ctime.t_sec = (__int32_t) inode->i_ctime.tv_sec;
+	ip->i_d.di_ctime.t_nsec = (__int32_t) inode->i_ctime.tv_nsec;
+
+	ip->i_d.di_mtime.t_sec = (__int32_t)inode->i_mtime.tv_sec;
+	ip->i_d.di_mtime.t_nsec = (__int32_t) inode->i_mtime.tv_nsec;
+
+	ip->i_d.di_atime.t_sec = (__int32_t) inode->i_atime.tv_sec;
+	ip->i_d.di_atime.t_nsec = (__int32_t) inode->i_atime.tv_nsec;
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+}
+
+STATIC int
+xfs_vn_write_time(
+	struct inode		*inode)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_trans	*tp;
 	int			error;
 
-	trace_xfs_update_time(ip);
-
+	trace_xfs_write_time(ip);
 	tp = xfs_trans_alloc(mp, XFS_TRANS_FSYNC_TS);
 	error = xfs_trans_reserve(tp, &M_RES(mp)->tr_fsyncts, 0, 0);
 	if (error) {
 		xfs_trans_cancel(tp, 0);
 		return error;
 	}
-
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
-	if (flags & S_CTIME) {
-		inode->i_ctime = *now;
-		ip->i_d.di_ctime.t_sec = (__int32_t)now->tv_sec;
-		ip->i_d.di_ctime.t_nsec = (__int32_t)now->tv_nsec;
-	}
-	if (flags & S_MTIME) {
-		inode->i_mtime = *now;
-		ip->i_d.di_mtime.t_sec = (__int32_t)now->tv_sec;
-		ip->i_d.di_mtime.t_nsec = (__int32_t)now->tv_nsec;
-	}
-	if (flags & S_ATIME) {
-		inode->i_atime = *now;
-		ip->i_d.di_atime.t_sec = (__int32_t)now->tv_sec;
-		ip->i_d.di_atime.t_nsec = (__int32_t)now->tv_nsec;
-	}
 	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
 	xfs_trans_log_inode(tp, ip, XFS_ILOG_TIMESTAMP);
 	return xfs_trans_commit(tp, 0);
@@ -1130,6 +1130,7 @@ static const struct inode_operations xfs_inode_operations = {
 	.listxattr		= xfs_vn_listxattr,
 	.fiemap			= xfs_vn_fiemap,
 	.update_time		= xfs_vn_update_time,
+	.write_time		= xfs_vn_write_time,
 };
 
 static const struct inode_operations xfs_dir_inode_operations = {
@@ -1157,6 +1158,7 @@ static const struct inode_operations xfs_dir_inode_operations = {
 	.removexattr		= generic_removexattr,
 	.listxattr		= xfs_vn_listxattr,
 	.update_time		= xfs_vn_update_time,
+	.write_time		= xfs_vn_write_time,
 	.tmpfile		= xfs_vn_tmpfile,
 };
 
@@ -1185,6 +1187,7 @@ static const struct inode_operations xfs_dir_ci_inode_operations = {
 	.removexattr		= generic_removexattr,
 	.listxattr		= xfs_vn_listxattr,
 	.update_time		= xfs_vn_update_time,
+	.write_time		= xfs_vn_write_time,
 	.tmpfile		= xfs_vn_tmpfile,
 };
 
@@ -1199,6 +1202,7 @@ static const struct inode_operations xfs_symlink_inode_operations = {
 	.removexattr		= generic_removexattr,
 	.listxattr		= xfs_vn_listxattr,
 	.update_time		= xfs_vn_update_time,
+	.write_time		= xfs_vn_write_time,
 };
 
 STATIC void
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 51372e3..09d261c 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -677,6 +677,7 @@ DEFINE_INODE_EVENT(xfs_file_fsync);
 DEFINE_INODE_EVENT(xfs_destroy_inode);
 DEFINE_INODE_EVENT(xfs_evict_inode);
 DEFINE_INODE_EVENT(xfs_update_time);
+DEFINE_INODE_EVENT(xfs_write_time);
 
 DEFINE_INODE_EVENT(xfs_dquot_dqalloc);
 DEFINE_INODE_EVENT(xfs_dquot_dqdetach);
diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index b30753c..ee94a66 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -62,7 +62,8 @@ prototypes:
 	ssize_t (*listxattr) (struct dentry *, char *, size_t);
 	int (*removexattr) (struct dentry *, const char *);
 	int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, u64 len);
-	void (*update_time)(struct inode *, struct timespec *, int);
+	void (*update_time)(struct inode *);
+	void (*write_time)(struct inode *);
 	int (*atomic_open)(struct inode *, struct dentry *,
 				struct file *, unsigned open_flag,
 				umode_t create_mode, int *opened);
@@ -95,6 +96,7 @@ listxattr:	no
 removexattr:	yes
 fiemap:		no
 update_time:	no
+write_time:	no
 atomic_open:	yes
 tmpfile:	no
 dentry_open:	no
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index bd46a22..a81a0652 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5563,26 +5563,8 @@ static int btrfs_is_readonly(struct inode *inode)
 	return 0;
 }
 
-/*
- * This is a copy of file_update_time.  We need this so we can return error on
- * ENOSPC for updating the inode in the case of file write and mmap writes.
- */
-static int btrfs_update_time(struct inode *inode, struct timespec *now,
-			     int flags)
+static int btrfs_write_time(struct inode *inode)
 {
-	struct btrfs_root *root = BTRFS_I(inode)->root;
-
-	if (btrfs_root_readonly(root))
-		return -EROFS;
-
-	if (flags & S_VERSION)
-		inode_inc_iversion(inode);
-	if (flags & S_CTIME)
-		inode->i_ctime = *now;
-	if (flags & S_MTIME)
-		inode->i_mtime = *now;
-	if (flags & S_ATIME)
-		inode->i_atime = *now;
 	return btrfs_dirty_inode(inode);
 }
 
@@ -9471,7 +9453,7 @@ static const struct inode_operations btrfs_dir_inode_operations = {
 	.get_acl	= btrfs_get_acl,
 	.set_acl	= btrfs_set_acl,
 	.is_readonly	= btrfs_is_readonly,
-	.update_time	= btrfs_update_time,
+	.write_time	= btrfs_write_time,
 	.tmpfile        = btrfs_tmpfile,
 };
 static const struct inode_operations btrfs_dir_ro_inode_operations = {
@@ -9480,7 +9462,7 @@ static const struct inode_operations btrfs_dir_ro_inode_operations = {
 	.get_acl	= btrfs_get_acl,
 	.set_acl	= btrfs_set_acl,
 	.is_readonly	= btrfs_is_readonly,
-	.update_time	= btrfs_update_time,
+	.write_time	= btrfs_write_time,
 };
 
 static const struct file_operations btrfs_dir_file_operations = {
@@ -9551,7 +9533,7 @@ static const struct inode_operations btrfs_file_inode_operations = {
 	.get_acl	= btrfs_get_acl,
 	.set_acl	= btrfs_set_acl,
 	.is_readonly	= btrfs_is_readonly,
-	.update_time	= btrfs_update_time,
+	.write_time	= btrfs_write_time,
 };
 static const struct inode_operations btrfs_special_inode_operations = {
 	.getattr	= btrfs_getattr,
@@ -9564,7 +9546,7 @@ static const struct inode_operations btrfs_special_inode_operations = {
 	.get_acl	= btrfs_get_acl,
 	.set_acl	= btrfs_set_acl,
 	.is_readonly	= btrfs_is_readonly,
-	.update_time	= btrfs_update_time,
+	.write_time	= btrfs_write_time,
 };
 static const struct inode_operations btrfs_symlink_inode_operations = {
 	.readlink	= generic_readlink,
@@ -9578,7 +9560,7 @@ static const struct inode_operations btrfs_symlink_inode_operations = {
 	.listxattr	= btrfs_listxattr,
 	.removexattr	= btrfs_removexattr,
 	.is_readonly	= btrfs_is_readonly,
-	.update_time	= btrfs_update_time,
+	.write_time	= btrfs_write_time,
 };
 
 const struct dentry_operations btrfs_dentry_operations = {
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related	[flat|nested] 36+ messages in thread
* Re: [PATCH-v4 6/7] ext4: add support for a lazytime mount option
  2014-11-27 15:25             ` Theodore Ts'o
@ 2014-11-27 15:41               ` Jan Kara
  2014-11-27 20:13                 ` Theodore Ts'o
  0 siblings, 1 reply; 36+ messages in thread
From: Jan Kara @ 2014-11-27 15:41 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Andreas Dilger, Jan Kara, XFS Developers,
	Linux Filesystem Development List, Ext4 Developers List,
	Linux btrfs Developers List
On Thu 27-11-14 10:25:24, Ted Tso wrote:
> This is what I'm currently playing with which I believe fixes the iput()
> problem.  In fs/ext4/inode.c:
> 
> struct other_inode {
> 	unsigned long		orig_ino;
> 	struct ext4_inode	*raw_inode;
> };
> static int other_inode_match(struct inode * inode, unsigned long ino,
> 			     void *data);
> 
> /*
>  * Opportunistically update the other time fields for other inodes in
>  * the same inode table block.
>  */
> static void ext4_update_other_inodes_time(struct super_block *sb,
> 					  unsigned long orig_ino, char *buf)
> {
> 	struct other_inode oi;
> 	unsigned long ino;
> 	int i, inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
> 	int inode_size = EXT4_INODE_SIZE(sb);
> 
> 	oi.orig_ino = orig_ino;
> 	ino = orig_ino & ~(inodes_per_block - 1);
> 	for (i = 0; i < inodes_per_block; i++, ino++, buf += inode_size) {
> 		if (ino == orig_ino)
> 			continue;
> 		oi.raw_inode = (struct ext4_inode *) buf;
> 		(void) find_inode_nowait(sb, ino, other_inode_match, &oi);
> 	}
> }
> 
> static int other_inode_match(struct inode * inode, unsigned long ino,
> 			     void *data)
> {
> 	struct other_inode *oi = (struct other_inode *) data;
> 
> 	if ((inode->i_ino != ino) ||
> 	    (inode->i_state & (I_FREEING | I_WILL_FREE | I_NEW)) ||
> 	    ((inode->i_state & I_DIRTY_TIME) == 0))
> 		return 0;
> 	spin_lock(&inode->i_lock);
> 	if (((inode->i_state & (I_FREEING | I_WILL_FREE | I_NEW)) == 0) &&
> 	    (inode->i_state & I_DIRTY_TIME)) {
> 		struct ext4_inode_info	*ei = EXT4_I(inode);
> 
> 		inode->i_state &= ~I_DIRTY_TIME;
> 		inode->i_ts_dirty_day = 0;
> 		spin_unlock(&inode->i_lock);
> 		inode_requeue_dirtytime(inode);
> 
> 		spin_lock(&ei->i_raw_lock);
> 		EXT4_INODE_SET_XTIME(i_ctime, inode, oi->raw_inode);
> 		EXT4_INODE_SET_XTIME(i_mtime, inode, oi->raw_inode);
> 		EXT4_INODE_SET_XTIME(i_atime, inode, oi->raw_inode);
> 		ext4_inode_csum_set(inode, oi->raw_inode, ei);
> 		spin_unlock(&ei->i_raw_lock);
> 		trace_ext4_other_inode_update_time(inode, oi->orig_ino);
> 		return -1;
> 	}
> 	spin_unlock(&inode->i_lock);
> 	return -1;
> }
  Hum, but this puts lots of stuff under inode_hash_lock, including
writeback list lock. I don't like this too much. I understand that getting
handle for each inode is rather more CPU intensive but it should still be a
clear win over the current situation and avoids entangling locks like this.
								Honza
> The above uses the following in fs/inode.c (which gets added instead of
> find_active_inode_nowait):
> 
> /**
>  * find_inode_nowait - find an inode in the inode cache
>  * @sb:		super block of file system to search
>  * @hashval:	hash value (usually inode number) to search for
>  * @match:	callback used for comparisons between inodes
>  * @data:	opaque data pointer to pass to @match
>  *
>  * Search for the inode specified by @hashval and @data in the inode
>  * cache, where the helper function @match will return 0 if the inode
>  * does not match, 1 if the inode does match, and -1 if the search
>  * should be stopped.  The @match function must be responsible for
>  * taking the i_lock spin_lock and checking i_state for an inode being
>  * freed or being initialized, and incrementing the reference count
>  * before returning 1.  It also must not sleep, since it is called with
>  * the inode_hash_lock spinlock held.
>  *
>  * This is a even more generalized version of ilookup5() when the
>  * function must never block --- find_inode() can block in
>  * __wait_on_freeing_inode() --- or when the caller can not increment
>  * the reference count because the resulting iput() might cause an
>  * inode eviction().  The tradeoff is that the @match funtion must be
>  * very carefully implemented.
>  */
> struct inode *find_inode_nowait(struct super_block *sb,
> 				unsigned long hashval,
> 				int (*match)(struct inode *, unsigned long,
> 					     void *),
> 				void *data)
> {
> 	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
> 	struct inode *inode, *ret_inode = NULL;
> 	int mval;
> 
> 	spin_lock(&inode_hash_lock);
> 	hlist_for_each_entry(inode, head, i_hash) {
> 		if (inode->i_sb != sb)
> 			continue;
> 		mval = match(inode, hashval, data);
> 		if (mval == 0)
> 			continue;
> 		if (mval == 1)
> 			ret_inode = inode;
> 		goto out;
> 	}
> out:
> 	spin_unlock(&inode_hash_lock);
> 	return ret_inode;
> }
> EXPORT_SYMBOL(find_inode_nowait);
> 
> Comments?
> 
> 							- Ted
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()
  2014-11-27 15:33       ` Theodore Ts'o
@ 2014-11-27 16:49         ` Christoph Hellwig
  2014-11-27 20:27           ` Theodore Ts'o
  0 siblings, 1 reply; 36+ messages in thread
From: Christoph Hellwig @ 2014-11-27 16:49 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Christoph Hellwig, Linux Filesystem Development List,
	Ext4 Developers List, Linux btrfs Developers List, XFS Developers
I don't think this scheme works well.  As mentioned earlier XFS doesn't
even use vfs dirty tracking at the moment, so introducing this in a
hidden way sounds like a bad idea.  Probably the same for btrfs.
I'd rather keep update_time as-is for now, don't add ->write_time and
let btrfs and XFS figure out how to implement the semantics on their
own.
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [PATCH-v4 6/7] ext4: add support for a lazytime mount option
  2014-11-27 15:41               ` Jan Kara
@ 2014-11-27 20:13                 ` Theodore Ts'o
  0 siblings, 0 replies; 36+ messages in thread
From: Theodore Ts'o @ 2014-11-27 20:13 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andreas Dilger, XFS Developers, Linux Filesystem Development List,
	Ext4 Developers List, Linux btrfs Developers List
On Thu, Nov 27, 2014 at 04:41:59PM +0100, Jan Kara wrote:
>   Hum, but this puts lots of stuff under inode_hash_lock, including
> writeback list lock. I don't like this too much. I understand that getting
> handle for each inode is rather more CPU intensive but it should still be a
> clear win over the current situation and avoids entangling locks like this.
Hmm, if we dropped the inode_requeue_dirtytime(), then we can avoid
taking the writeback list lock.  The net result is that we would have
some inodes still on the b_dirty_time list that were no longer
I_DIRTY_TIME, but since I_DIRTY_TIME wouldn't be set, it's mostly
harmless since when we do iterate over the b_dirty_time list, those
inodes can be quickly identified and skipped over.  (And if the inode
ever gets dirtied for real, then it will get moved onto the b_dirty
list and that will be that.)
The problem with getting a handle on the inode is not just that it is
more CPU intensive, but that can't let the iput_final() call happen
until after we have finished the transaction handle.  We could keep a
linked list of inodes attached to the handle, and then only call iput
on them once ext4_journal_stop(handle) gets called, but that's a
complication I'd like to avoid if at all possible.
Being able to opporunistically write the timestamps when we are
journalling an inode table block is a pretty big win, so if we end up
extending the hold time on inode_hash_lock (only when we come across a
I_DIRTY_TIME inode that we can clear) a tiny bit, there will be a lot
of workloads where I think it's a worthwhile tradeoff.  If we can
avoid entangling the writebakc list lock, does that make you happier
about this approach?
	     	      	       	     	 - Ted
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [PATCH-v4 2/7] vfs: add support for a lazytime mount option
  2014-11-27 13:14   ` Jan Kara
@ 2014-11-27 20:19     ` Theodore Ts'o
  2014-11-28 12:41       ` Jan Kara
  2014-11-27 23:00     ` Theodore Ts'o
  1 sibling, 1 reply; 36+ messages in thread
From: Theodore Ts'o @ 2014-11-27 20:19 UTC (permalink / raw)
  To: Jan Kara
  Cc: Linux Filesystem Development List, Ext4 Developers List,
	Linux btrfs Developers List, XFS Developers
On Thu, Nov 27, 2014 at 02:14:21PM +0100, Jan Kara wrote:
> Looking into the code & your patch I'd prefer to do something like:
> * add support for I_DIRTY_TIME in __mark_inode_dirty() - update_time will
>   call __mark_inode_dirty() with this flag if any of the times was updated.
>   That way we can just remove your ->write_time() callback - filesystems
>   can just handle this in their ->dirty_inode() methods if they wish.
>   __mark_inode_dirty() will take care of moving inode into proper writeback
>   list (i_dirty / i_dirty_time), dirtied_when will be set to current time.
One of the tricky bits about this is that btrfs wants to be able to
return an error from write_time() which gets reflected up through
update_time() to the callers of file_update_time().  Currently
__mark_inode_dirty() and family return a void, and changing this is
going to be a bit of a mess, since doing this correctly would require
auditing all of the callers of mark_inode_dirty(),
mark_inode_dirty_sync(), __mark_inode_dirty(), etc.
Doing this would be a good thing, and eventually I think it would be
nice if we could allow the mark_inode_dirty() functions return an
error instead of void, but I wonder if that's a cleanup that's better
saved for later.  While we were at it, maybe we should rename
mark_inode_dirty() to inode_dirty(), since that way we can be sure
we've looked at all of the call site of mark_inode_dirty() and friends
--- and we have a number of file systems, including btrfs, ext3, and
ext4, where mark_inode_dirty() is doing a lot more than just marking
the inode is dirty, and the only reason why it's named that is mostly
historical.
						- Ted
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()
  2014-11-27 16:49         ` Christoph Hellwig
@ 2014-11-27 20:27           ` Theodore Ts'o
  2014-12-01  9:28             ` Christoph Hellwig
  0 siblings, 1 reply; 36+ messages in thread
From: Theodore Ts'o @ 2014-11-27 20:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Linux Filesystem Development List, Ext4 Developers List,
	Linux btrfs Developers List, XFS Developers
On Thu, Nov 27, 2014 at 08:49:52AM -0800, Christoph Hellwig wrote:
> I don't think this scheme works well.  As mentioned earlier XFS doesn't
> even use vfs dirty tracking at the moment, so introducing this in a
> hidden way sounds like a bad idea.  Probably the same for btrfs.
> 
> I'd rather keep update_time as-is for now, don't add ->write_time and
> let btrfs and XFS figure out how to implement the semantics on their
> own.
I can do that, but part of the reason why we were doing this rather
involved set of changes was to allow other file systems to be able to
take advantage of lazytime.  I suppose there is value in allowing
other file systems, such as jfs, f2fs, etc., to use it, but still,
it's a bit of a shame to drop btrfs and xfs support for this feature.
I'll note by the way that ext3 and ext4 doesn't really use VFS dirty
tracking either --- see my other comments about the naming of
"mark_inode_dirty" being a bit misleading, at least for all/most of
the major file systems.  The problem seems to be that replacement
schemes that we've all using are slightly different.  :-/
I suppose should let the btrfs folks decide whether they want to add
is_readonly() and write_time() function --- or maybe help with the
cleanup work so that mark_inode_dirty() can reflect an error to its
callers.   Chris, David, what do you think?
		   	  	      	     - Ted
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [PATCH-v4 2/7] vfs: add support for a lazytime mount option
  2014-11-27 13:14   ` Jan Kara
  2014-11-27 20:19     ` Theodore Ts'o
@ 2014-11-27 23:00     ` Theodore Ts'o
  2014-11-28  5:36       ` Theodore Ts'o
  2014-11-28 16:24       ` Jan Kara
  1 sibling, 2 replies; 36+ messages in thread
From: Theodore Ts'o @ 2014-11-27 23:00 UTC (permalink / raw)
  To: Jan Kara
  Cc: Linux Filesystem Development List, Ext4 Developers List,
	Linux btrfs Developers List, XFS Developers
On Thu, Nov 27, 2014 at 02:14:21PM +0100, Jan Kara wrote:
> * change queue_io() to also call
> 	moved += move_expired_inodes(&wb->b_dirty_time, &wb->b_io, time + 24hours)
>   For this you need to tweak move_expired_inodes() to take pointer to
>   timestamp instead of pointer to work but that's trivial. Also you want
>   probably leave time ->older_than_this value (i.e. without +24 hours) if
>   we are doing WB_SYNC_ALL writeback. With this you can remove
>   flush_sb_dirty_time() completely.
Well.... it's not quite enough.  The problem is that for ext3 and
ext4, the actual work of writing the inode happens in dirty_inode(),
not in write_inode().  Which means we need to do something like this.
I'm not entirely sure whether or not this is too ugly to live;
personally, I think my hack of handling this in update_time() might be
preferable....
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index b93c529..95a42b3 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -253,7 +253,7 @@ static bool inode_dirtied_after(struct inode *inode, unsigned long t)
  */
 static int move_expired_inodes(struct list_head *delaying_queue,
 			       struct list_head *dispatch_queue,
-			       struct wb_writeback_work *work)
+			       unsigned long *older_than_this)
 {
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
@@ -264,8 +264,8 @@ static int move_expired_inodes(struct list_head *delaying_queue,
 
 	while (!list_empty(delaying_queue)) {
 		inode = wb_inode(delaying_queue->prev);
-		if (work->older_than_this &&
-		    inode_dirtied_after(inode, *work->older_than_this))
+		if (older_than_this &&
+		    inode_dirtied_after(inode, *older_than_this))
 			break;
 		list_move(&inode->i_wb_list, &tmp);
 		moved++;
@@ -309,9 +309,14 @@ out:
 static void queue_io(struct bdi_writeback *wb, struct wb_writeback_work *work)
 {
 	int moved;
+	unsigned long one_day_later = jiffies + (HZ * 86400);
+
 	assert_spin_locked(&wb->list_lock);
 	list_splice_init(&wb->b_more_io, &wb->b_io);
-	moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, work);
+	moved = move_expired_inodes(&wb->b_dirty, &wb->b_io,
+				    work->older_than_this);
+	moved += move_expired_inodes(&wb->b_dirty_time, &wb->b_io,
+				     &one_day_later);
 	trace_writeback_queue_io(wb, work, moved);
 }
 
@@ -637,6 +642,17 @@ static long writeback_sb_inodes(struct super_block *sb,
 		}
 
 		/*
+		 * If the inode is marked dirty time but is not dirty,
+		 * then at last for ext3 and ext4 we need to call
+		 * mark_inode_dirty_sync in order to get the inode
+		 * timestamp transferred to the on disk inode, since
+		 * write_inode is a no-op for those file systems.  HACK HACK HACK
+		 */
+		if ((inode->i_state & I_DIRTY_TIME) &&
+		    ((inode->i_state & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) == 0))
+			mark_inode_dirty_sync(inode);
+
+		/*
 		 * Don't bother with new inodes or inodes being freed, first
 		 * kind does not need periodic writeout yet, and for the latter
 		 * kind writeout is handled by the freer.
@@ -1233,9 +1249,10 @@ void inode_requeue_dirtytime(struct inode *inode)
 	spin_lock(&bdi->wb.list_lock);
 	spin_lock(&inode->i_lock);
 	if ((inode->i_state & I_DIRTY_WB) == 0) {
-		if (inode->i_state & I_DIRTY_TIME)
+		if (inode->i_state & I_DIRTY_TIME) {
+			inode->dirtied_when = jiffies;
 			list_move(&inode->i_wb_list, &bdi->wb.b_dirty_time);
-		else
+		} else
 			list_del_init(&inode->i_wb_list);
 	}
 	spin_unlock(&inode->i_lock);
Comments?
						- Ted
^ permalink raw reply related	[flat|nested] 36+ messages in thread
* Re: [PATCH-v4 2/7] vfs: add support for a lazytime mount option
  2014-11-27 23:00     ` Theodore Ts'o
@ 2014-11-28  5:36       ` Theodore Ts'o
  2014-11-28 16:24       ` Jan Kara
  1 sibling, 0 replies; 36+ messages in thread
From: Theodore Ts'o @ 2014-11-28  5:36 UTC (permalink / raw)
  To: Jan Kara
  Cc: Linux Filesystem Development List, Ext4 Developers List,
	Linux btrfs Developers List, XFS Developers
On Thu, Nov 27, 2014 at 06:00:16PM -0500, Theodore Ts'o wrote:
> Well.... it's not quite enough.  The problem is that for ext3 and
> ext4, the actual work of writing the inode happens in dirty_inode(),
> not in write_inode().  Which means we need to do something like this.
> 
> I'm not entirely sure whether or not this is too ugly to live;
> personally, I think my hack of handling this in update_time() might be
> preferable....
.... and this doesn't work because it breaks ext3/ext4's transaction
handling, since the writeback thread could be racing against some
transactional update of the inode.  So I don't see a way of making the
queue_io / move_expired_inodes approach to the 24 hour time being
tractable.
So the alternatives that I can see at this point is either give up on
the 24 hour timeout, or we fall back to handling this in
update_time().
					- Ted
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [PATCH-v4 2/7] vfs: add support for a lazytime mount option
  2014-11-27 20:19     ` Theodore Ts'o
@ 2014-11-28 12:41       ` Jan Kara
  0 siblings, 0 replies; 36+ messages in thread
From: Jan Kara @ 2014-11-28 12:41 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jan Kara, Linux Filesystem Development List, Ext4 Developers List,
	Linux btrfs Developers List, XFS Developers
On Thu 27-11-14 15:19:54, Ted Tso wrote:
> On Thu, Nov 27, 2014 at 02:14:21PM +0100, Jan Kara wrote:
> > Looking into the code & your patch I'd prefer to do something like:
> > * add support for I_DIRTY_TIME in __mark_inode_dirty() - update_time will
> >   call __mark_inode_dirty() with this flag if any of the times was updated.
> >   That way we can just remove your ->write_time() callback - filesystems
> >   can just handle this in their ->dirty_inode() methods if they wish.
> >   __mark_inode_dirty() will take care of moving inode into proper writeback
> >   list (i_dirty / i_dirty_time), dirtied_when will be set to current time.
> 
> One of the tricky bits about this is that btrfs wants to be able to
> return an error from write_time() which gets reflected up through
> update_time() to the callers of file_update_time().  Currently
> __mark_inode_dirty() and family return a void, and changing this is
> going to be a bit of a mess, since doing this correctly would require
> auditing all of the callers of mark_inode_dirty(),
> mark_inode_dirty_sync(), __mark_inode_dirty(), etc.
> 
> Doing this would be a good thing, and eventually I think it would be
> nice if we could allow the mark_inode_dirty() functions return an
> error instead of void, but I wonder if that's a cleanup that's better
> saved for later.  While we were at it, maybe we should rename
> mark_inode_dirty() to inode_dirty(), since that way we can be sure
> we've looked at all of the call site of mark_inode_dirty() and friends
> --- and we have a number of file systems, including btrfs, ext3, and
> ext4, where mark_inode_dirty() is doing a lot more than just marking
> the inode is dirty, and the only reason why it's named that is mostly
> historical.
  Except that lots of callers of update_time() / file_update_time() just
ignore the return value anyway. And frankly most of the time it's a
simplification we can get away with. I agree that ultimately we should
propagate and handle these errors but as you say above handling errors from
__mark_inode_dirty() is what we'd really need - that handles the whole
class of errors. So for now I would be OK, with just ignoring the error
when updating time stamps.
								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [PATCH-v4 2/7] vfs: add support for a lazytime mount option
  2014-11-27 23:00     ` Theodore Ts'o
  2014-11-28  5:36       ` Theodore Ts'o
@ 2014-11-28 16:24       ` Jan Kara
  1 sibling, 0 replies; 36+ messages in thread
From: Jan Kara @ 2014-11-28 16:24 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jan Kara, Linux Filesystem Development List, Ext4 Developers List,
	Linux btrfs Developers List, XFS Developers
On Thu 27-11-14 18:00:16, Ted Tso wrote:
> On Thu, Nov 27, 2014 at 02:14:21PM +0100, Jan Kara wrote:
> > * change queue_io() to also call
> > 	moved += move_expired_inodes(&wb->b_dirty_time, &wb->b_io, time + 24hours)
> >   For this you need to tweak move_expired_inodes() to take pointer to
> >   timestamp instead of pointer to work but that's trivial. Also you want
> >   probably leave time ->older_than_this value (i.e. without +24 hours) if
> >   we are doing WB_SYNC_ALL writeback. With this you can remove
> >   flush_sb_dirty_time() completely.
> 
> Well.... it's not quite enough.  The problem is that for ext3 and
> ext4, the actual work of writing the inode happens in dirty_inode(),
> not in write_inode().  Which means we need to do something like this.
  Right, I didn't realize this problem.
> I'm not entirely sure whether or not this is too ugly to live;
> personally, I think my hack of handling this in update_time() might be
> preferable....
  Actually handling the copying of timestamps in __writeback_single_inode()
would look fine to me. You mention in your next email, calling
mark_inode_dirty_sync() from flusher may be problematic - why? How is this
any different from calling mark_inode_dirty_sync() from
flush_sb_dirty_time()?
I will note that for a while I thought copying the full inode to on-disk
buffer may be problematic because inode may be in an intermediate state of
some transactional change. But that isn't an issue - if there's any
transactional change in progress, it has a handle open and until the
change is node, thus the buffer with the partial change cannot go to the
journal (transaction cannot commit) until mark_inode_dirty_sync() copies
the final state of the inode.
Another solution may be to convey the information that copying of timestamps
is necessary to ->write_inode method. We could do that via a
flag bit in writeback_control. Each filesystem can then copy timestamps
when this bit is set. But calling mark_inode_dirty_sync() from
__writeback_single_inode() looks simpler to me.
								Honza
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index b93c529..95a42b3 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -253,7 +253,7 @@ static bool inode_dirtied_after(struct inode *inode, unsigned long t)
>   */
>  static int move_expired_inodes(struct list_head *delaying_queue,
>  			       struct list_head *dispatch_queue,
> -			       struct wb_writeback_work *work)
> +			       unsigned long *older_than_this)
>  {
>  	LIST_HEAD(tmp);
>  	struct list_head *pos, *node;
> @@ -264,8 +264,8 @@ static int move_expired_inodes(struct list_head *delaying_queue,
>  
>  	while (!list_empty(delaying_queue)) {
>  		inode = wb_inode(delaying_queue->prev);
> -		if (work->older_than_this &&
> -		    inode_dirtied_after(inode, *work->older_than_this))
> +		if (older_than_this &&
> +		    inode_dirtied_after(inode, *older_than_this))
>  			break;
>  		list_move(&inode->i_wb_list, &tmp);
>  		moved++;
> @@ -309,9 +309,14 @@ out:
>  static void queue_io(struct bdi_writeback *wb, struct wb_writeback_work *work)
>  {
>  	int moved;
> +	unsigned long one_day_later = jiffies + (HZ * 86400);
> +
>  	assert_spin_locked(&wb->list_lock);
>  	list_splice_init(&wb->b_more_io, &wb->b_io);
> -	moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, work);
> +	moved = move_expired_inodes(&wb->b_dirty, &wb->b_io,
> +				    work->older_than_this);
> +	moved += move_expired_inodes(&wb->b_dirty_time, &wb->b_io,
> +				     &one_day_later);
>  	trace_writeback_queue_io(wb, work, moved);
>  }
>  
> @@ -637,6 +642,17 @@ static long writeback_sb_inodes(struct super_block *sb,
>  		}
>  
>  		/*
> +		 * If the inode is marked dirty time but is not dirty,
> +		 * then at last for ext3 and ext4 we need to call
> +		 * mark_inode_dirty_sync in order to get the inode
> +		 * timestamp transferred to the on disk inode, since
> +		 * write_inode is a no-op for those file systems.  HACK HACK HACK
> +		 */
> +		if ((inode->i_state & I_DIRTY_TIME) &&
> +		    ((inode->i_state & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) == 0))
> +			mark_inode_dirty_sync(inode);
> +
> +		/*
>  		 * Don't bother with new inodes or inodes being freed, first
>  		 * kind does not need periodic writeout yet, and for the latter
>  		 * kind writeout is handled by the freer.
> @@ -1233,9 +1249,10 @@ void inode_requeue_dirtytime(struct inode *inode)
>  	spin_lock(&bdi->wb.list_lock);
>  	spin_lock(&inode->i_lock);
>  	if ((inode->i_state & I_DIRTY_WB) == 0) {
> -		if (inode->i_state & I_DIRTY_TIME)
> +		if (inode->i_state & I_DIRTY_TIME) {
> +			inode->dirtied_when = jiffies;
>  			list_move(&inode->i_wb_list, &bdi->wb.b_dirty_time);
> -		else
> +		} else
>  			list_del_init(&inode->i_wb_list);
>  	}
>  	spin_unlock(&inode->i_lock);
> 
> Comments?
> 
> 						- Ted
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()
  2014-11-27 20:27           ` Theodore Ts'o
@ 2014-12-01  9:28             ` Christoph Hellwig
  2014-12-01 15:04               ` Theodore Ts'o
  0 siblings, 1 reply; 36+ messages in thread
From: Christoph Hellwig @ 2014-12-01  9:28 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Linux Filesystem Development List, Ext4 Developers List,
	Linux btrfs Developers List, XFS Developers
On Thu, Nov 27, 2014 at 03:27:31PM -0500, Theodore Ts'o wrote:
> I can do that, but part of the reason why we were doing this rather
> involved set of changes was to allow other file systems to be able to
> take advantage of lazytime.  I suppose there is value in allowing
> other file systems, such as jfs, f2fs, etc., to use it, but still,
> it's a bit of a shame to drop btrfs and xfs support for this feature.
I want to see xfs and btrfs support, but I think we're running in some
conceptual problems here.  I don't have the time right now to fully
review the XFS changes for correctness and test them, and I'd rather
keep things as-is for a while and then add properly designed and fully
teste support in rather than something possible broken.
> I'll note by the way that ext3 and ext4 doesn't really use VFS dirty
> tracking either --- see my other comments about the naming of
> "mark_inode_dirty" being a bit misleading, at least for all/most of
> the major file systems.  The problem seems to be that replacement
> schemes that we've all using are slightly different.  :-/
Indeed.  It seems all existing ->dirty_inode instances basically
just try to work around the problem that the VFS simply updates
timestamps by writing into the inode without involving the filesystem.
There are all kinds of bugs in different instances, as well as comments
mentioning an assumption that this only happens for atime although
the VFS also dos this "trick" for c/mtime, including a caller from
the page fault code that the filesystems can't even avoid by providing
non-default methods everywhere.
> I suppose should let the btrfs folks decide whether they want to add
> is_readonly() and write_time() function --- or maybe help with the
> cleanup work so that mark_inode_dirty() can reflect an error to its
> callers.   Chris, David, what do you think?
The ->is_readonly method seems like a clear winner to me, I'm all for
adding it, and thus suggested moving it first in the series.
I've read a bit more through the series and would like to suggest
the following approach for the rest:
 - convert ext3/4 to use ->update_time instead of the ->dirty_time
   callout so it gets and exact notifications (preferably the few
   remaining filesystems as well, although that shouldn't really be a
   blocker)
 - defer timestamp updates for any filesystems not defining
   ->update_time (or ->dirty_time for now), and allow filesystems
   using ->update_time to defer the update as well by calling
   mark_inode_dirty with the I_DIRTY_TIME flag so that XFS and btrfs
   don't have to opt-in without testing.
 - Convert xfs, btrfs and the remaining filesystes using ->dirty_inode
   incrementally.
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()
  2014-12-01  9:28             ` Christoph Hellwig
@ 2014-12-01 15:04               ` Theodore Ts'o
  2014-12-01 17:18                 ` David Sterba
  2014-12-02  9:20                 ` Christoph Hellwig
  0 siblings, 2 replies; 36+ messages in thread
From: Theodore Ts'o @ 2014-12-01 15:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Linux Filesystem Development List, Ext4 Developers List,
	Linux btrfs Developers List, XFS Developers
On Mon, Dec 01, 2014 at 01:28:10AM -0800, Christoph Hellwig wrote:
> 
> The ->is_readonly method seems like a clear winner to me, I'm all for
> adding it, and thus suggested moving it first in the series.
It's a real winner for me as well, but the reason why I dropped it is
because if btrfs() has to keep its ->update_time function, we wouldn't
actually have a user for is_readonly().  I suppose we could have
update_time() call ->is_readonly() and then ->update_time() if they
exist, but it only seemed to add an extra call and a bit of extra
overhead without really simplifying things for btrfs.
If there were other users of ->is_readonly, then it would make sense,
but it seemed better to move into a separate code refactoring series.
> I've read a bit more through the series and would like to suggest
> the following approach for the rest:
> 
>  - convert ext3/4 to use ->update_time instead of the ->dirty_time
>    callout so it gets and exact notifications (preferably the few
>    remaining filesystems as well, although that shouldn't really be a
>    blocker)
We could do that, although ext3/4's ->update_time() would be exactly
the same as the generic update_time() function, so there would be code
duplication.  If the goal is to get rid of the magic in
-->dirty_inode() being used to work around how the VFS makes changes
to fields that end up in the on-disk inode, we would need to audit a
lot of extra code paths; at the very least, in how the generic quota
code handles updates to i_size and i_blocks (for example).
And BTW, we don't actually have a dirty_time() function any more in
the current patch series.  update_time() is currently looking like
this:
static int update_time(struct inode *inode, struct timespec *time, int flags)
{
	if (inode->i_op->update_time)
		return inode->i_op->update_time(inode, time, flags);
	if (flags & S_ATIME)
		inode->i_atime = *time;
	if (flags & S_VERSION)
		inode_inc_iversion(inode);
	if (flags & S_CTIME)
		inode->i_ctime = *time;
	if (flags & S_MTIME)
		inode->i_mtime = *time;
	if ((inode->i_sb->s_flags & MS_LAZYTIME) && !(flags & S_VERSION) &&
	    !(inode->i_state & I_DIRTY))
		__mark_inode_dirty(inode, I_DIRTY_TIME);
	else
		__mark_inode_dirty(inode, I_DIRTY_SYNC);
	return 0;
}
>  - Convert xfs, btrfs and the remaining filesystes using ->dirty_inode
>    incrementally.
Right, so xfs and btrfs (which are the two file systems that have
update_time at the moment) can just drop update_time() and then check
the ->dirty_time() for (flags & I_DIRTY_TIME).  Hmm, I suspect this
might be better for xfs, yes?
	if ((inode->i_sb->s_flags & MS_LAZYTIME) && !(flags & S_VERSION) &&
	    !(inode->i_state & I_DIRTY))
		__mark_inode_dirty(inode, I_DIRTY_TIME);
	else
		__mark_inode_dirty(inode, I_DIRTY_SYNC | I_DIRTY_TIME);
XFS doesn't have a ->dirty_time yet, but that way XFS would be able to
use the I_DIRTY_TIME flag to log the journal timestamps if it so
desires, and perhaps drop the need for it to use update_time().  (And
with XFS doing logical journalling, it may be that you might want to
include the timestamp update in the journal if you have a journal
transaction open already, so the disk is spun up or likely to be spin
up anyway, right?)
						- Ted
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()
  2014-12-01 15:04               ` Theodore Ts'o
@ 2014-12-01 17:18                 ` David Sterba
  2014-12-02  9:20                 ` Christoph Hellwig
  1 sibling, 0 replies; 36+ messages in thread
From: David Sterba @ 2014-12-01 17:18 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Christoph Hellwig, Linux Filesystem Development List,
	Ext4 Developers List, Linux btrfs Developers List, XFS Developers
On Mon, Dec 01, 2014 at 10:04:50AM -0500, Theodore Ts'o wrote:
> On Mon, Dec 01, 2014 at 01:28:10AM -0800, Christoph Hellwig wrote:
> > 
> > The ->is_readonly method seems like a clear winner to me, I'm all for
> > adding it, and thus suggested moving it first in the series.
> 
> It's a real winner for me as well, but the reason why I dropped it is
> because if btrfs() has to keep its ->update_time function, we wouldn't
> actually have a user for is_readonly().  I suppose we could have
> update_time() call ->is_readonly() and then ->update_time() if they
> exist, but it only seemed to add an extra call and a bit of extra
> overhead without really simplifying things for btrfs.
We would use is_readonly in order to remove some extra checks from btrfs
(setxattr, removexattr, possibly setsize).
> If there were other users of ->is_readonly, then it would make sense,
> but it seemed better to move into a separate code refactoring series.
Yeah it would be better addressed separately as it's not the point of
lazytime patchset and only turned out to be a good idea during the
iterations.
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()
  2014-12-01 15:04               ` Theodore Ts'o
  2014-12-01 17:18                 ` David Sterba
@ 2014-12-02  9:20                 ` Christoph Hellwig
  2014-12-02 15:09                   ` Theodore Ts'o
  1 sibling, 1 reply; 36+ messages in thread
From: Christoph Hellwig @ 2014-12-02  9:20 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Christoph Hellwig, Linux Filesystem Development List,
	Ext4 Developers List, Linux btrfs Developers List, XFS Developers,
	Jan Kara
On Mon, Dec 01, 2014 at 10:04:50AM -0500, Theodore Ts'o wrote:
> >  - convert ext3/4 to use ->update_time instead of the ->dirty_time
> >    callout so it gets and exact notifications (preferably the few
> >    remaining filesystems as well, although that shouldn't really be a
> >    blocker)
> 
> We could do that, although ext3/4's ->update_time() would be exactly
> the same as the generic update_time() function, so there would be code
> duplication.  If the goal is to get rid of the magic in
> -->dirty_inode() being used to work around how the VFS makes changes
> to fields that end up in the on-disk inode, we would need to audit a
> lot of extra code paths; at the very least, in how the generic quota
> code handles updates to i_size and i_blocks (for example).
> 
> And BTW, we don't actually have a dirty_time() function any more in
> the current patch series.  update_time() is currently looking like
> this:
Sorry, I actually meant ->dirty_inode, which is where ext4 currently
hooks in for time updates.  ->update_time was introduced to
 a) more specificly catch the kind of update
 b) allow the filesystem to take locks or a start a transaction
    before the inode fields are updated to provide proper atomicy.
It seems like the quota code has the same problem, but given that
neither XFS nor btrfs use it it seems like no one cared enough to sort
it out properly..
> static int update_time(struct inode *inode, struct timespec *time, int flags)
> {
> 	if (inode->i_op->update_time)
> 		return inode->i_op->update_time(inode, time, flags);
> 
> 	if (flags & S_ATIME)
> 		inode->i_atime = *time;
> 	if (flags & S_VERSION)
> 		inode_inc_iversion(inode);
> 	if (flags & S_CTIME)
> 		inode->i_ctime = *time;
> 	if (flags & S_MTIME)
> 		inode->i_mtime = *time;
> 
> 	if ((inode->i_sb->s_flags & MS_LAZYTIME) && !(flags & S_VERSION) &&
> 	    !(inode->i_state & I_DIRTY))
> 		__mark_inode_dirty(inode, I_DIRTY_TIME);
> 	else
> 		__mark_inode_dirty(inode, I_DIRTY_SYNC);
> 	return 0;
Why do you need the additional I_DIRTY flag?  A "lesser"
__mark_inode_dirty should never override a stronger one.
Otherwise this looks fine to me, except that I would split the default
implementation into a new generic_update_time helper.
> XFS doesn't have a ->dirty_time yet, but that way XFS would be able to
> use the I_DIRTY_TIME flag to log the journal timestamps if it so
> desires, and perhaps drop the need for it to use update_time().
We will probably always need a ->update_time to proide proper locking
around the timestamp updates.
> (And
> with XFS doing logical journalling, it may be that you might want to
> include the timestamp update in the journal if you have a journal
> transaction open already, so the disk is spun up or likely to be spin
> up anyway, right?)
XFS transactions are explicitly opened and closed, so during the atime
updates we'll never have one open.
What we could try is to have CIL items that are on "indefinit" hold
before they are batched into a checkpoint.  We'd still commit them to
an in-memory transaction in ->upate_time for that.  All this requires
a lot of through and will take some time, though.
In the current from the generic lazytime might even be a loss for XFS as
we're already really good at batching updates from multiple inodes in
the same cluster for the in-place writeback, so I really don't want
to just enable it without those optimizations without a lot of testing.
^ permalink raw reply	[flat|nested] 36+ messages in thread
* Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()
  2014-12-02  9:20                 ` Christoph Hellwig
@ 2014-12-02 15:09                   ` Theodore Ts'o
  0 siblings, 0 replies; 36+ messages in thread
From: Theodore Ts'o @ 2014-12-02 15:09 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Linux Filesystem Development List, Ext4 Developers List, Jan Kara,
	Linux btrfs Developers List, XFS Developers
On Tue, Dec 02, 2014 at 01:20:33AM -0800, Christoph Hellwig wrote:
> Why do you need the additional I_DIRTY flag?  A "lesser"
> __mark_inode_dirty should never override a stronger one.
Agreed, will fix.
> Otherwise this looks fine to me, except that I would split the default
> implementation into a new generic_update_time helper.
Sure, I can do that.
> > XFS doesn't have a ->dirty_time yet, but that way XFS would be able to
> > use the I_DIRTY_TIME flag to log the journal timestamps if it so
> > desires, and perhaps drop the need for it to use update_time().
> 
> We will probably always need a ->update_time to proide proper locking
> around the timestamp updates.
Couldn't you let the VFS set the inode timesstamps and then have xfs's
->dirty_time(inode, I_DIRTY_TIME) copy the timestamps to the on-disk
inode structure under the appropriate lock, or am I missing something?
> In the current from the generic lazytime might even be a loss for XFS as
> we're already really good at batching updates from multiple inodes in
> the same cluster for the in-place writeback, so I really don't want
> to just enable it without those optimizations without a lot of testing.
Fair enough; it's not surprising that this might be much more
effective as an optimization for ext4, for no other reason that
timestamp updates are so much heavyweight for us.  I suspect that it
should be a win for btrfs, though, and it should definitely be a win
for those file systems that don't use journalling at all.
    	       	       	    	      		     - Ted
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply	[flat|nested] 36+ messages in thread
end of thread, other threads:[~2014-12-02 15:09 UTC | newest]
Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-11-26 10:23 [PATCH-v4 0/7] add support for a lazytime mount option Theodore Ts'o
2014-11-26 10:23 ` [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time() Theodore Ts'o
2014-11-26 19:23   ` Christoph Hellwig
2014-11-27 12:34     ` Jan Kara
2014-11-27 15:25       ` Christoph Hellwig
2014-11-27 14:41     ` Theodore Ts'o
2014-11-27 15:28       ` Christoph Hellwig
2014-11-27 15:33       ` Theodore Ts'o
2014-11-27 16:49         ` Christoph Hellwig
2014-11-27 20:27           ` Theodore Ts'o
2014-12-01  9:28             ` Christoph Hellwig
2014-12-01 15:04               ` Theodore Ts'o
2014-12-01 17:18                 ` David Sterba
2014-12-02  9:20                 ` Christoph Hellwig
2014-12-02 15:09                   ` Theodore Ts'o
2014-11-26 10:23 ` [PATCH-v4 2/7] vfs: add support for a lazytime mount option Theodore Ts'o
2014-11-27 13:14   ` Jan Kara
2014-11-27 20:19     ` Theodore Ts'o
2014-11-28 12:41       ` Jan Kara
2014-11-27 23:00     ` Theodore Ts'o
2014-11-28  5:36       ` Theodore Ts'o
2014-11-28 16:24       ` Jan Kara
2014-11-26 10:23 ` [PATCH-v4 3/7] vfs: don't let the dirty time inodes get more than a day stale Theodore Ts'o
2014-11-26 10:23 ` [PATCH-v4 4/7] vfs: add lazytime tracepoints for better debugging Theodore Ts'o
2014-11-26 10:23 ` [PATCH-v4 5/7] vfs: add find_active_inode_nowait() function Theodore Ts'o
2014-11-26 10:23 ` [PATCH-v4 6/7] ext4: add support for a lazytime mount option Theodore Ts'o
2014-11-26 19:24   ` Christoph Hellwig
2014-11-26 22:48   ` Dave Chinner
2014-11-26 23:10     ` Andreas Dilger
2014-11-26 23:35       ` Dave Chinner
2014-11-27 13:27         ` Jan Kara
2014-11-27 13:32           ` Jan Kara
2014-11-27 15:25             ` Theodore Ts'o
2014-11-27 15:41               ` Jan Kara
2014-11-27 20:13                 ` Theodore Ts'o
2014-11-26 10:23 ` [PATCH-v4 7/7] btrfs: add an is_readonly() so btrfs can use common code for update_time() Theodore Ts'o
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).