public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] xfs: timestamp updates cause excessive fdatasync log traffic
@ 2015-08-28  1:23 Dave Chinner
  2015-08-28  4:32 ` [PATCH V2] " Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2015-08-28  1:23 UTC (permalink / raw)
  To: xfs; +Cc: sage

From: Dave Chinner <dchinner@redhat.com>

Sage Weil reported that a ceph test workload was writing to the
log on every fdatasync during an overwrite workload. Event tracing
showed that the only metadata modification being made was the
timestamp updates during the write(2) syscall, but fdatasync(2)
is supposed to ignore them. The key observation was that the
transactions in the log all looked like this:

INODE: #regs: 4   ino: 0x8b  flags: 0x45   dsize: 32

And contained a flags field of 0x45 or 0x85, and had data and
attribute forks following the inode core. This means that the
timestamp updates were triggering dirty relogging of previously
logged parts of the inode that hadn't yet been flushed back to
disk.

This is caused by xfs_trans_log_inode(), where it folds the dirty
fields that have previously been logged back into the current
transaction dirty fields from the inode item ili_last_fields. The
issue is that ili_last_fields only contains a non-zero value when
the inode is in the process of being flushed. The typical state
progression is this, using a core field update as the modification
occuring:

state			ili_fields		ili_last_fields
clean			    0			      0
modified, logged	XFS_ILOG_CORE		      0
flushed to buffer	    0			XFS_ILOG_CORE
<buffer submitted>
buffer IO done (clean)	    0			      0

However, if we run a new transaction after it has been flushed to
buffer but before the buffer IO is done, we don't know if the
modifications in the inode buffer (i.e. what is in ili_last_fields)
will reach the disk before the new transaction reaches the log.
Hence to keep transactional ordering correct in recovery, we need to
ensure the new transaction re-logs the modifications that are being
flushed to disk.

By relogging, we ensure that if the transaction makes it to disk and
the buffer doesn't, then recovery replays all the changes upto that
point correctly. If the transaction does not make it disk, but the
buffer does, then recovery will see that the inode on disk is more
recent than in the log and won't overwrite it with changes that it
already contains.

The upshot of this is that while the inode buffer sits in memory
with the inode in the "flushed" state, fdatasync is going to see
the relogged state in the ili_fields. i.e:

What is happening here is this:

state			ili_fields		ili_last_fields
clean			    0			      0
modified, logged	   CORE			      0
flushed to buffer	    0			     CORE
<write>
timestamp update	 TIMESTAMP		     CORE
<xfs_trans_inode_log pulls in ili_last_fields>
		       CORE|TIMESTAMP		     CORE
<fdatasync>
  sees field other than TIMESTAMP in ili_fields,
  triggers xfs_log_force_lsn to flush inode

<buffer submittted>
buffer IO done (clean) CORE|TIMESTAMP		      0
.....
<write>
timestamp update       CORE|TIMESTAMP		      0
<fdatasync>
  sees field other than TIMESTAMP in ili_fields,
  triggers xfs_log_force_lsn to flush inode
.....
<write>
timestamp update       CORE|TIMESTAMP		      0
<fdatasync>
  sees field other than TIMESTAMP in ili_fields,
  triggers xfs_log_force_lsn to flush inode

So, essentially, once a race condition on the buffer flush has
occurred, the condition is not cleared until the inode is flushed to
the buffer again and it is written without racing against the inode
being dirtied again.

The behaviour we really want here is to capture the timestamp
update transactionally, but not trigger fdatasync because of the
relogged fields that haven't been modified since the inode was
flushed to the buffer. We still need to relog them, but we don't
need to force the log in the fdatasync case.

To do this, don't fold the ili_last_fields value into ili_fields
when logging the inode. Instead, fold it into the fields that get
logged during formatting of the inode item. This means that we will
stop logging those fields the moment we know that there is a more
recent version of the inode on disk than we have in the log and so
we don't need to log those fields anymore as the next transaction on
disk doesn't need to replay them.

Doing this separates changes that are in memory but are not being
flushed from those that have been flushed. Hence we can now look at
ili_fields and hence see what fields have been modified since the
last flush, and hence whether fdatasync needs to force the log or
not. If non-timestamp changes have been made since the inode was
flushed to the backing buffer, then fdatasync will do exactly the
right thing (i.e. force the log).

Reported-by: Sage Weil <sage@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_inode_item.c | 161 ++++++++++++++++++++++++++++++++++--------------
 fs/xfs/xfs_inode_item.h |   1 +
 2 files changed, 115 insertions(+), 47 deletions(-)

diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 62bd80f..be04eb2 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -29,6 +29,63 @@
 #include "xfs_trans_priv.h"
 #include "xfs_log.h"
 
+/*
+ * Notes on ili_fields, ili_format_fields and ili_last_fields.
+ *
+ * ili_fields contains the flags that reflect the current changes that are in
+ * memory and have been logged, but have not been flushed to the backing buffer
+ * for writeback yet. When a transaction is done, the fields that are modified
+ * are added to ili_fields and all those fields are logged. This results in
+ * repeated transactions correctly relogging all the other changes in memory and
+ * allows the inode to be re-ordered (moved forward) in the AIL safely.
+ * ili_fields is copied to ili_last_fields when the inode is flushed to the
+ * backing buffer and is then cleared, indicating that the inode in memory is
+ * now clean from a transactional change point of view and does not need to
+ * relog those fields anymore on tranaction commit.
+ *
+ * ili_last_fields, therefore, is only set while there is an unresolved inode
+ * flush being done (i.e. flush lock is held). We need to keep this state
+ * available to avoid a transaction recovery vs inode buffer IO completion race
+ * if we crash. If the inode is logged again makes it to disk before the inode
+ * buffer IO complete, log recovery will replay the latest inode transaction and
+ * can lose the changes that were in the inode buffer. Hence we still need to
+ * log the changes that have been flushed so that transactions issued during the
+ * inode buffer write still contain the modifications being flushed. Once the
+ * inode buffer IO completes, we no longer need to log those changes as the
+ * inode on disk the same as the inode in memory apart from the changes made
+ * since the inode was flushed (i.e. those recorded in ili_fields).
+ *
+ * ili_format_fields is only used during transaction commit. It is the
+ * aggregation of ili_fields and ili_last_fields, as sampled during the sizing
+ * of the inode item to be logged. We need this field definition to be constant
+ * between sizing and formatting, but inode buffer IO can complete
+ * asynchronously with transaction commit and so we must only read it once so
+ * that the different stages of the item formatting work correctly. We don't
+ * care about the async completion clearing ili_last_fields after we've sampled
+ * it - if we log too much then so be it - we won't log it next time. Once we've
+ * formatted the inode item, we need to propagate the ili_format_fields value to
+ * the on-disk inode log item format field, and then use it to clear all the
+ * fields that we marked for logging but were not dirty from ili_fields.
+ *
+ * This separation of changes allows us to accurately determine what fields in
+ * the inode have changed since it was last flushed to disk. This is important
+ * for fdatasync() performance as there are certain fields that, if modified,
+ * should not trigger log forces because they contain metadata that fdatasync()
+ * does not need to guarantee is safe on disk. This is, currently, just
+ * XFS_ILOG_TIMESTAMP, though may be extended in future to cover other metadata.
+ *
+ * If we simply fold ili_last_fields back into ili_fields when we log the inode
+ * in a transaction, then if we have a flush/modification race it results in
+ * every timestamp update causing fdatasync to flush the log because ili_fields
+ * has fields other than XFS_ILOG_TIMESTAMP set in it. This behaviour will
+ * persist until the inode is next flushed to it's backing buffer and that
+ * buffer is written and completed without another flush/modification race
+ * occuring.  Hence we keep the change state separate and only combine them when
+ * formatting the inode into the log.
+ *
+ * For more information, there's another big comment in xfs_iflush_int() about
+ * this flush/modification race condition.
+ */
 
 kmem_zone_t	*xfs_ili_zone;		/* inode log item zone */
 
@@ -40,6 +97,7 @@ static inline struct xfs_inode_log_item *INODE_ITEM(struct xfs_log_item *lip)
 STATIC void
 xfs_inode_item_data_fork_size(
 	struct xfs_inode_log_item *iip,
+	unsigned int		fields,
 	int			*nvecs,
 	int			*nbytes)
 {
@@ -47,7 +105,7 @@ xfs_inode_item_data_fork_size(
 
 	switch (ip->i_d.di_format) {
 	case XFS_DINODE_FMT_EXTENTS:
-		if ((iip->ili_fields & XFS_ILOG_DEXT) &&
+		if ((fields & XFS_ILOG_DEXT) &&
 		    ip->i_d.di_nextents > 0 &&
 		    ip->i_df.if_bytes > 0) {
 			/* worst case, doesn't subtract delalloc extents */
@@ -56,14 +114,14 @@ xfs_inode_item_data_fork_size(
 		}
 		break;
 	case XFS_DINODE_FMT_BTREE:
-		if ((iip->ili_fields & XFS_ILOG_DBROOT) &&
+		if ((fields & XFS_ILOG_DBROOT) &&
 		    ip->i_df.if_broot_bytes > 0) {
 			*nbytes += ip->i_df.if_broot_bytes;
 			*nvecs += 1;
 		}
 		break;
 	case XFS_DINODE_FMT_LOCAL:
-		if ((iip->ili_fields & XFS_ILOG_DDATA) &&
+		if ((fields & XFS_ILOG_DDATA) &&
 		    ip->i_df.if_bytes > 0) {
 			*nbytes += roundup(ip->i_df.if_bytes, 4);
 			*nvecs += 1;
@@ -82,6 +140,7 @@ xfs_inode_item_data_fork_size(
 STATIC void
 xfs_inode_item_attr_fork_size(
 	struct xfs_inode_log_item *iip,
+	unsigned int		fields,
 	int			*nvecs,
 	int			*nbytes)
 {
@@ -89,7 +148,7 @@ xfs_inode_item_attr_fork_size(
 
 	switch (ip->i_d.di_aformat) {
 	case XFS_DINODE_FMT_EXTENTS:
-		if ((iip->ili_fields & XFS_ILOG_AEXT) &&
+		if ((fields & XFS_ILOG_AEXT) &&
 		    ip->i_d.di_anextents > 0 &&
 		    ip->i_afp->if_bytes > 0) {
 			/* worst case, doesn't subtract unused space */
@@ -98,14 +157,14 @@ xfs_inode_item_attr_fork_size(
 		}
 		break;
 	case XFS_DINODE_FMT_BTREE:
-		if ((iip->ili_fields & XFS_ILOG_ABROOT) &&
+		if ((fields & XFS_ILOG_ABROOT) &&
 		    ip->i_afp->if_broot_bytes > 0) {
 			*nbytes += ip->i_afp->if_broot_bytes;
 			*nvecs += 1;
 		}
 		break;
 	case XFS_DINODE_FMT_LOCAL:
-		if ((iip->ili_fields & XFS_ILOG_ADATA) &&
+		if ((fields & XFS_ILOG_ADATA) &&
 		    ip->i_afp->if_bytes > 0) {
 			*nbytes += roundup(ip->i_afp->if_bytes, 4);
 			*nvecs += 1;
@@ -133,13 +192,17 @@ xfs_inode_item_size(
 	struct xfs_inode_log_item *iip = INODE_ITEM(lip);
 	struct xfs_inode	*ip = iip->ili_inode;
 
+	iip->ili_format_fields = iip->ili_fields | iip->ili_last_fields;
+
 	*nvecs += 2;
 	*nbytes += sizeof(struct xfs_inode_log_format) +
 		   xfs_icdinode_size(ip->i_d.di_version);
 
-	xfs_inode_item_data_fork_size(iip, nvecs, nbytes);
+	xfs_inode_item_data_fork_size(iip, iip->ili_format_fields,
+				      nvecs, nbytes);
 	if (XFS_IFORK_Q(ip))
-		xfs_inode_item_attr_fork_size(iip, nvecs, nbytes);
+		xfs_inode_item_attr_fork_size(iip, iip->ili_format_fields,
+					      nvecs, nbytes);
 }
 
 STATIC void
@@ -151,14 +214,14 @@ xfs_inode_item_format_data_fork(
 {
 	struct xfs_inode	*ip = iip->ili_inode;
 	size_t			data_bytes;
+	unsigned int		fields = iip->ili_format_fields;
 
 	switch (ip->i_d.di_format) {
 	case XFS_DINODE_FMT_EXTENTS:
-		iip->ili_fields &=
-			~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT |
-			  XFS_ILOG_DEV | XFS_ILOG_UUID);
+		fields &= ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT |
+			    XFS_ILOG_DEV | XFS_ILOG_UUID);
 
-		if ((iip->ili_fields & XFS_ILOG_DEXT) &&
+		if ((fields & XFS_ILOG_DEXT) &&
 		    ip->i_d.di_nextents > 0 &&
 		    ip->i_df.if_bytes > 0) {
 			struct xfs_bmbt_rec *p;
@@ -175,15 +238,14 @@ xfs_inode_item_format_data_fork(
 			ilf->ilf_dsize = data_bytes;
 			ilf->ilf_size++;
 		} else {
-			iip->ili_fields &= ~XFS_ILOG_DEXT;
+			fields &= ~XFS_ILOG_DEXT;
 		}
 		break;
 	case XFS_DINODE_FMT_BTREE:
-		iip->ili_fields &=
-			~(XFS_ILOG_DDATA | XFS_ILOG_DEXT |
-			  XFS_ILOG_DEV | XFS_ILOG_UUID);
+		fields &= ~(XFS_ILOG_DDATA | XFS_ILOG_DEXT |
+			    XFS_ILOG_DEV | XFS_ILOG_UUID);
 
-		if ((iip->ili_fields & XFS_ILOG_DBROOT) &&
+		if ((fields & XFS_ILOG_DBROOT) &&
 		    ip->i_df.if_broot_bytes > 0) {
 			ASSERT(ip->i_df.if_broot != NULL);
 			xlog_copy_iovec(lv, vecp, XLOG_REG_TYPE_IBROOT,
@@ -192,16 +254,15 @@ xfs_inode_item_format_data_fork(
 			ilf->ilf_dsize = ip->i_df.if_broot_bytes;
 			ilf->ilf_size++;
 		} else {
-			ASSERT(!(iip->ili_fields &
-				 XFS_ILOG_DBROOT));
-			iip->ili_fields &= ~XFS_ILOG_DBROOT;
+			ASSERT(!(fields & XFS_ILOG_DBROOT));
+			fields &= ~XFS_ILOG_DBROOT;
 		}
 		break;
 	case XFS_DINODE_FMT_LOCAL:
-		iip->ili_fields &=
-			~(XFS_ILOG_DEXT | XFS_ILOG_DBROOT |
-			  XFS_ILOG_DEV | XFS_ILOG_UUID);
-		if ((iip->ili_fields & XFS_ILOG_DDATA) &&
+		fields &= ~(XFS_ILOG_DEXT | XFS_ILOG_DBROOT |
+			    XFS_ILOG_DEV | XFS_ILOG_UUID);
+
+		if ((fields & XFS_ILOG_DDATA) &&
 		    ip->i_df.if_bytes > 0) {
 			/*
 			 * Round i_bytes up to a word boundary.
@@ -218,27 +279,28 @@ xfs_inode_item_format_data_fork(
 			ilf->ilf_dsize = (unsigned)data_bytes;
 			ilf->ilf_size++;
 		} else {
-			iip->ili_fields &= ~XFS_ILOG_DDATA;
+			fields &= ~XFS_ILOG_DDATA;
 		}
 		break;
 	case XFS_DINODE_FMT_DEV:
-		iip->ili_fields &=
-			~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT |
-			  XFS_ILOG_DEXT | XFS_ILOG_UUID);
-		if (iip->ili_fields & XFS_ILOG_DEV)
+		fields &= ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT |
+			    XFS_ILOG_DEXT | XFS_ILOG_UUID);
+		if (fields & XFS_ILOG_DEV)
 			ilf->ilf_u.ilfu_rdev = ip->i_df.if_u2.if_rdev;
 		break;
 	case XFS_DINODE_FMT_UUID:
-		iip->ili_fields &=
-			~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT |
-			  XFS_ILOG_DEXT | XFS_ILOG_DEV);
-		if (iip->ili_fields & XFS_ILOG_UUID)
+		fields &= ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT |
+			    XFS_ILOG_DEXT | XFS_ILOG_DEV);
+		if (fields & XFS_ILOG_UUID)
 			ilf->ilf_u.ilfu_uuid = ip->i_df.if_u2.if_uuid;
 		break;
 	default:
 		ASSERT(0);
 		break;
 	}
+
+	/* reflect the logged fields back in ili_format_fields */
+	iip->ili_format_fields = fields;
 }
 
 STATIC void
@@ -250,13 +312,13 @@ xfs_inode_item_format_attr_fork(
 {
 	struct xfs_inode	*ip = iip->ili_inode;
 	size_t			data_bytes;
+	unsigned int		fields = iip->ili_format_fields;
 
 	switch (ip->i_d.di_aformat) {
 	case XFS_DINODE_FMT_EXTENTS:
-		iip->ili_fields &=
-			~(XFS_ILOG_ADATA | XFS_ILOG_ABROOT);
+		fields &= ~(XFS_ILOG_ADATA | XFS_ILOG_ABROOT);
 
-		if ((iip->ili_fields & XFS_ILOG_AEXT) &&
+		if ((fields & XFS_ILOG_AEXT) &&
 		    ip->i_d.di_anextents > 0 &&
 		    ip->i_afp->if_bytes > 0) {
 			struct xfs_bmbt_rec *p;
@@ -272,14 +334,13 @@ xfs_inode_item_format_attr_fork(
 			ilf->ilf_asize = data_bytes;
 			ilf->ilf_size++;
 		} else {
-			iip->ili_fields &= ~XFS_ILOG_AEXT;
+			fields &= ~XFS_ILOG_AEXT;
 		}
 		break;
 	case XFS_DINODE_FMT_BTREE:
-		iip->ili_fields &=
-			~(XFS_ILOG_ADATA | XFS_ILOG_AEXT);
+		fields &= ~(XFS_ILOG_ADATA | XFS_ILOG_AEXT);
 
-		if ((iip->ili_fields & XFS_ILOG_ABROOT) &&
+		if ((fields & XFS_ILOG_ABROOT) &&
 		    ip->i_afp->if_broot_bytes > 0) {
 			ASSERT(ip->i_afp->if_broot != NULL);
 
@@ -289,14 +350,13 @@ xfs_inode_item_format_attr_fork(
 			ilf->ilf_asize = ip->i_afp->if_broot_bytes;
 			ilf->ilf_size++;
 		} else {
-			iip->ili_fields &= ~XFS_ILOG_ABROOT;
+			fields &= ~XFS_ILOG_ABROOT;
 		}
 		break;
 	case XFS_DINODE_FMT_LOCAL:
-		iip->ili_fields &=
-			~(XFS_ILOG_AEXT | XFS_ILOG_ABROOT);
+		fields &= ~(XFS_ILOG_AEXT | XFS_ILOG_ABROOT);
 
-		if ((iip->ili_fields & XFS_ILOG_ADATA) &&
+		if ((fields & XFS_ILOG_ADATA) &&
 		    ip->i_afp->if_bytes > 0) {
 			/*
 			 * Round i_bytes up to a word boundary.
@@ -313,13 +373,16 @@ xfs_inode_item_format_attr_fork(
 			ilf->ilf_asize = (unsigned)data_bytes;
 			ilf->ilf_size++;
 		} else {
-			iip->ili_fields &= ~XFS_ILOG_ADATA;
+			fields &= ~XFS_ILOG_ADATA;
 		}
 		break;
 	default:
 		ASSERT(0);
 		break;
 	}
+
+	/* reflect the logged fields back in ili_format_fields */
+	iip->ili_format_fields = fields;
 }
 
 /*
@@ -359,12 +422,16 @@ xfs_inode_item_format(
 	if (XFS_IFORK_Q(ip)) {
 		xfs_inode_item_format_attr_fork(iip, ilf, lv, &vecp);
 	} else {
-		iip->ili_fields &=
+		iip->ili_format_fields &=
 			~(XFS_ILOG_ADATA | XFS_ILOG_ABROOT | XFS_ILOG_AEXT);
 	}
 
 	/* update the format with the exact fields we actually logged */
-	ilf->ilf_fields |= (iip->ili_fields & ~XFS_ILOG_TIMESTAMP);
+	ilf->ilf_fields |= (iip->ili_format_fields & ~XFS_ILOG_TIMESTAMP);
+
+	/* clear any fields we didn't log from ili_fields */
+	iip->ili_fields &= iip->ili_format_fields;
+	iip->ili_format_fields = 0;
 }
 
 /*
diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
index 488d812..43a9e1c 100644
--- a/fs/xfs/xfs_inode_item.h
+++ b/fs/xfs/xfs_inode_item.h
@@ -34,6 +34,7 @@ typedef struct xfs_inode_log_item {
 	unsigned short		ili_logged;	   /* flushed logged data */
 	unsigned int		ili_last_fields;   /* fields when flushed */
 	unsigned int		ili_fields;	   /* fields to be logged */
+	unsigned int		ili_format_fields; /* combined log fields */
 } xfs_inode_log_item_t;
 
 static inline int xfs_inode_clean(xfs_inode_t *ip)
-- 
2.5.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH V2] xfs: timestamp updates cause excessive fdatasync log traffic
  2015-08-28  1:23 [PATCH] xfs: timestamp updates cause excessive fdatasync log traffic Dave Chinner
@ 2015-08-28  4:32 ` Dave Chinner
  2015-08-28 15:11   ` Sage Weil
  2015-09-08 14:45   ` Brian Foster
  0 siblings, 2 replies; 9+ messages in thread
From: Dave Chinner @ 2015-08-28  4:32 UTC (permalink / raw)
  To: xfs; +Cc: sage


From: Dave Chinner <dchinner@redhat.com>

Sage Weil reported that a ceph test workload was writing to the
log on every fdatasync during an overwrite workload. Event tracing
showed that the only metadata modification being made was the
timestamp updates during the write(2) syscall, but fdatasync(2)
is supposed to ignore them. The key observation was that the
transactions in the log all looked like this:

INODE: #regs: 4   ino: 0x8b  flags: 0x45   dsize: 32

And contained a flags field of 0x45 or 0x85, and had data and
attribute forks following the inode core. This means that the
timestamp updates were triggering dirty relogging of previously
logged parts of the inode that hadn't yet been flushed back to
disk.

This is caused by xfs_trans_log_inode(), where it folds the dirty
fields that have previously been logged back into the current
transaction dirty fields from the inode item ili_last_fields. The
issue is that ili_last_fields only contains a non-zero value when
the inode is in the process of being flushed. The typical state
progression is this, using a core field update as the modification
occuring:

state			ili_fields		ili_last_fields
clean			    0			      0
modified, logged	XFS_ILOG_CORE		      0
flushed to buffer	    0			XFS_ILOG_CORE
<buffer submitted>
buffer IO done (clean)	    0			      0

However, if we run a new transaction after it has been flushed to
buffer but before the buffer IO is done, we don't know if the
modifications in the inode buffer (i.e. what is in ili_last_fields)
will reach the disk before the new transaction reaches the log.
Hence to keep transactional ordering correct in recovery, we need to
ensure the new transaction re-logs the modifications that are being
flushed to disk.

By relogging, we ensure that if the transaction makes it to disk and
the buffer doesn't, then recovery replays all the changes upto that
point correctly. If the transaction does not make it disk, but the
buffer does, then recovery will see that the inode on disk is more
recent than in the log and won't overwrite it with changes that it
already contains.

The upshot of this is that while the inode buffer sits in memory
with the inode in the "flushed" state, fdatasync is going to see
the relogged state in the ili_fields. i.e:

What is happening here is this:

state			ili_fields		ili_last_fields
clean			    0			      0
modified, logged	   CORE			      0
flushed to buffer	    0			     CORE
<write>
timestamp update	 TIMESTAMP		     CORE
<xfs_trans_inode_log pulls in ili_last_fields>
		       CORE|TIMESTAMP		     CORE
<fdatasync>
  sees field other than TIMESTAMP in ili_fields,
  triggers xfs_log_force_lsn to flush inode

<buffer submittted>
buffer IO done (clean) CORE|TIMESTAMP		      0
.....
<write>
timestamp update       CORE|TIMESTAMP		      0
<fdatasync>
  sees field other than TIMESTAMP in ili_fields,
  triggers xfs_log_force_lsn to flush inode
.....
<write>
timestamp update       CORE|TIMESTAMP		      0
<fdatasync>
  sees field other than TIMESTAMP in ili_fields,
  triggers xfs_log_force_lsn to flush inode

So, essentially, once a race condition on the buffer flush has
occurred, the condition is not cleared until the inode is flushed to
the buffer again and it is written without racing against the inode
being dirtied again.

The behaviour we really want here is to capture the timestamp
update transactionally, but not trigger fdatasync because of the
relogged fields that haven't been modified since the inode was
flushed to the buffer. We still need to relog them, but we don't
need to force the log in the fdatasync case.

To do this, don't fold the ili_last_fields value into ili_fields
when logging the inode. Instead, fold it into the fields that get
logged during formatting of the inode item. This means that we will
stop logging those fields the moment we know that there is a more
recent version of the inode on disk than we have in the log and so
we don't need to log those fields anymore as the next transaction on
disk doesn't need to replay them.

Doing this separates changes that are in memory but are not being
flushed from those that have been flushed. Hence we can now look at
ili_fields and hence see what fields have been modified since the
last flush, and hence whether fdatasync needs to force the log or
not. If non-timestamp changes have been made since the inode was
flushed to the backing buffer, then fdatasync will do exactly the
right thing (i.e. force the log).

Reported-by: Sage Weil <sage@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
Version 2:
- include the hunk from fs/xfs/xfs_trans_inode.c that I missed
  when committing the patch locally the first time.

 fs/xfs/xfs_inode_item.c  | 161 +++++++++++++++++++++++++++++++++--------------
 fs/xfs/xfs_inode_item.h  |   1 +
 fs/xfs/xfs_trans_inode.c |   8 ---
 3 files changed, 115 insertions(+), 55 deletions(-)

diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 62bd80f..be04eb2 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -29,6 +29,63 @@
 #include "xfs_trans_priv.h"
 #include "xfs_log.h"
 
+/*
+ * Notes on ili_fields, ili_format_fields and ili_last_fields.
+ *
+ * ili_fields contains the flags that reflect the current changes that are in
+ * memory and have been logged, but have not been flushed to the backing buffer
+ * for writeback yet. When a transaction is done, the fields that are modified
+ * are added to ili_fields and all those fields are logged. This results in
+ * repeated transactions correctly relogging all the other changes in memory and
+ * allows the inode to be re-ordered (moved forward) in the AIL safely.
+ * ili_fields is copied to ili_last_fields when the inode is flushed to the
+ * backing buffer and is then cleared, indicating that the inode in memory is
+ * now clean from a transactional change point of view and does not need to
+ * relog those fields anymore on tranaction commit.
+ *
+ * ili_last_fields, therefore, is only set while there is an unresolved inode
+ * flush being done (i.e. flush lock is held). We need to keep this state
+ * available to avoid a transaction recovery vs inode buffer IO completion race
+ * if we crash. If the inode is logged again makes it to disk before the inode
+ * buffer IO complete, log recovery will replay the latest inode transaction and
+ * can lose the changes that were in the inode buffer. Hence we still need to
+ * log the changes that have been flushed so that transactions issued during the
+ * inode buffer write still contain the modifications being flushed. Once the
+ * inode buffer IO completes, we no longer need to log those changes as the
+ * inode on disk the same as the inode in memory apart from the changes made
+ * since the inode was flushed (i.e. those recorded in ili_fields).
+ *
+ * ili_format_fields is only used during transaction commit. It is the
+ * aggregation of ili_fields and ili_last_fields, as sampled during the sizing
+ * of the inode item to be logged. We need this field definition to be constant
+ * between sizing and formatting, but inode buffer IO can complete
+ * asynchronously with transaction commit and so we must only read it once so
+ * that the different stages of the item formatting work correctly. We don't
+ * care about the async completion clearing ili_last_fields after we've sampled
+ * it - if we log too much then so be it - we won't log it next time. Once we've
+ * formatted the inode item, we need to propagate the ili_format_fields value to
+ * the on-disk inode log item format field, and then use it to clear all the
+ * fields that we marked for logging but were not dirty from ili_fields.
+ *
+ * This separation of changes allows us to accurately determine what fields in
+ * the inode have changed since it was last flushed to disk. This is important
+ * for fdatasync() performance as there are certain fields that, if modified,
+ * should not trigger log forces because they contain metadata that fdatasync()
+ * does not need to guarantee is safe on disk. This is, currently, just
+ * XFS_ILOG_TIMESTAMP, though may be extended in future to cover other metadata.
+ *
+ * If we simply fold ili_last_fields back into ili_fields when we log the inode
+ * in a transaction, then if we have a flush/modification race it results in
+ * every timestamp update causing fdatasync to flush the log because ili_fields
+ * has fields other than XFS_ILOG_TIMESTAMP set in it. This behaviour will
+ * persist until the inode is next flushed to it's backing buffer and that
+ * buffer is written and completed without another flush/modification race
+ * occuring.  Hence we keep the change state separate and only combine them when
+ * formatting the inode into the log.
+ *
+ * For more information, there's another big comment in xfs_iflush_int() about
+ * this flush/modification race condition.
+ */
 
 kmem_zone_t	*xfs_ili_zone;		/* inode log item zone */
 
@@ -40,6 +97,7 @@ static inline struct xfs_inode_log_item *INODE_ITEM(struct xfs_log_item *lip)
 STATIC void
 xfs_inode_item_data_fork_size(
 	struct xfs_inode_log_item *iip,
+	unsigned int		fields,
 	int			*nvecs,
 	int			*nbytes)
 {
@@ -47,7 +105,7 @@ xfs_inode_item_data_fork_size(
 
 	switch (ip->i_d.di_format) {
 	case XFS_DINODE_FMT_EXTENTS:
-		if ((iip->ili_fields & XFS_ILOG_DEXT) &&
+		if ((fields & XFS_ILOG_DEXT) &&
 		    ip->i_d.di_nextents > 0 &&
 		    ip->i_df.if_bytes > 0) {
 			/* worst case, doesn't subtract delalloc extents */
@@ -56,14 +114,14 @@ xfs_inode_item_data_fork_size(
 		}
 		break;
 	case XFS_DINODE_FMT_BTREE:
-		if ((iip->ili_fields & XFS_ILOG_DBROOT) &&
+		if ((fields & XFS_ILOG_DBROOT) &&
 		    ip->i_df.if_broot_bytes > 0) {
 			*nbytes += ip->i_df.if_broot_bytes;
 			*nvecs += 1;
 		}
 		break;
 	case XFS_DINODE_FMT_LOCAL:
-		if ((iip->ili_fields & XFS_ILOG_DDATA) &&
+		if ((fields & XFS_ILOG_DDATA) &&
 		    ip->i_df.if_bytes > 0) {
 			*nbytes += roundup(ip->i_df.if_bytes, 4);
 			*nvecs += 1;
@@ -82,6 +140,7 @@ xfs_inode_item_data_fork_size(
 STATIC void
 xfs_inode_item_attr_fork_size(
 	struct xfs_inode_log_item *iip,
+	unsigned int		fields,
 	int			*nvecs,
 	int			*nbytes)
 {
@@ -89,7 +148,7 @@ xfs_inode_item_attr_fork_size(
 
 	switch (ip->i_d.di_aformat) {
 	case XFS_DINODE_FMT_EXTENTS:
-		if ((iip->ili_fields & XFS_ILOG_AEXT) &&
+		if ((fields & XFS_ILOG_AEXT) &&
 		    ip->i_d.di_anextents > 0 &&
 		    ip->i_afp->if_bytes > 0) {
 			/* worst case, doesn't subtract unused space */
@@ -98,14 +157,14 @@ xfs_inode_item_attr_fork_size(
 		}
 		break;
 	case XFS_DINODE_FMT_BTREE:
-		if ((iip->ili_fields & XFS_ILOG_ABROOT) &&
+		if ((fields & XFS_ILOG_ABROOT) &&
 		    ip->i_afp->if_broot_bytes > 0) {
 			*nbytes += ip->i_afp->if_broot_bytes;
 			*nvecs += 1;
 		}
 		break;
 	case XFS_DINODE_FMT_LOCAL:
-		if ((iip->ili_fields & XFS_ILOG_ADATA) &&
+		if ((fields & XFS_ILOG_ADATA) &&
 		    ip->i_afp->if_bytes > 0) {
 			*nbytes += roundup(ip->i_afp->if_bytes, 4);
 			*nvecs += 1;
@@ -133,13 +192,17 @@ xfs_inode_item_size(
 	struct xfs_inode_log_item *iip = INODE_ITEM(lip);
 	struct xfs_inode	*ip = iip->ili_inode;
 
+	iip->ili_format_fields = iip->ili_fields | iip->ili_last_fields;
+
 	*nvecs += 2;
 	*nbytes += sizeof(struct xfs_inode_log_format) +
 		   xfs_icdinode_size(ip->i_d.di_version);
 
-	xfs_inode_item_data_fork_size(iip, nvecs, nbytes);
+	xfs_inode_item_data_fork_size(iip, iip->ili_format_fields,
+				      nvecs, nbytes);
 	if (XFS_IFORK_Q(ip))
-		xfs_inode_item_attr_fork_size(iip, nvecs, nbytes);
+		xfs_inode_item_attr_fork_size(iip, iip->ili_format_fields,
+					      nvecs, nbytes);
 }
 
 STATIC void
@@ -151,14 +214,14 @@ xfs_inode_item_format_data_fork(
 {
 	struct xfs_inode	*ip = iip->ili_inode;
 	size_t			data_bytes;
+	unsigned int		fields = iip->ili_format_fields;
 
 	switch (ip->i_d.di_format) {
 	case XFS_DINODE_FMT_EXTENTS:
-		iip->ili_fields &=
-			~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT |
-			  XFS_ILOG_DEV | XFS_ILOG_UUID);
+		fields &= ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT |
+			    XFS_ILOG_DEV | XFS_ILOG_UUID);
 
-		if ((iip->ili_fields & XFS_ILOG_DEXT) &&
+		if ((fields & XFS_ILOG_DEXT) &&
 		    ip->i_d.di_nextents > 0 &&
 		    ip->i_df.if_bytes > 0) {
 			struct xfs_bmbt_rec *p;
@@ -175,15 +238,14 @@ xfs_inode_item_format_data_fork(
 			ilf->ilf_dsize = data_bytes;
 			ilf->ilf_size++;
 		} else {
-			iip->ili_fields &= ~XFS_ILOG_DEXT;
+			fields &= ~XFS_ILOG_DEXT;
 		}
 		break;
 	case XFS_DINODE_FMT_BTREE:
-		iip->ili_fields &=
-			~(XFS_ILOG_DDATA | XFS_ILOG_DEXT |
-			  XFS_ILOG_DEV | XFS_ILOG_UUID);
+		fields &= ~(XFS_ILOG_DDATA | XFS_ILOG_DEXT |
+			    XFS_ILOG_DEV | XFS_ILOG_UUID);
 
-		if ((iip->ili_fields & XFS_ILOG_DBROOT) &&
+		if ((fields & XFS_ILOG_DBROOT) &&
 		    ip->i_df.if_broot_bytes > 0) {
 			ASSERT(ip->i_df.if_broot != NULL);
 			xlog_copy_iovec(lv, vecp, XLOG_REG_TYPE_IBROOT,
@@ -192,16 +254,15 @@ xfs_inode_item_format_data_fork(
 			ilf->ilf_dsize = ip->i_df.if_broot_bytes;
 			ilf->ilf_size++;
 		} else {
-			ASSERT(!(iip->ili_fields &
-				 XFS_ILOG_DBROOT));
-			iip->ili_fields &= ~XFS_ILOG_DBROOT;
+			ASSERT(!(fields & XFS_ILOG_DBROOT));
+			fields &= ~XFS_ILOG_DBROOT;
 		}
 		break;
 	case XFS_DINODE_FMT_LOCAL:
-		iip->ili_fields &=
-			~(XFS_ILOG_DEXT | XFS_ILOG_DBROOT |
-			  XFS_ILOG_DEV | XFS_ILOG_UUID);
-		if ((iip->ili_fields & XFS_ILOG_DDATA) &&
+		fields &= ~(XFS_ILOG_DEXT | XFS_ILOG_DBROOT |
+			    XFS_ILOG_DEV | XFS_ILOG_UUID);
+
+		if ((fields & XFS_ILOG_DDATA) &&
 		    ip->i_df.if_bytes > 0) {
 			/*
 			 * Round i_bytes up to a word boundary.
@@ -218,27 +279,28 @@ xfs_inode_item_format_data_fork(
 			ilf->ilf_dsize = (unsigned)data_bytes;
 			ilf->ilf_size++;
 		} else {
-			iip->ili_fields &= ~XFS_ILOG_DDATA;
+			fields &= ~XFS_ILOG_DDATA;
 		}
 		break;
 	case XFS_DINODE_FMT_DEV:
-		iip->ili_fields &=
-			~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT |
-			  XFS_ILOG_DEXT | XFS_ILOG_UUID);
-		if (iip->ili_fields & XFS_ILOG_DEV)
+		fields &= ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT |
+			    XFS_ILOG_DEXT | XFS_ILOG_UUID);
+		if (fields & XFS_ILOG_DEV)
 			ilf->ilf_u.ilfu_rdev = ip->i_df.if_u2.if_rdev;
 		break;
 	case XFS_DINODE_FMT_UUID:
-		iip->ili_fields &=
-			~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT |
-			  XFS_ILOG_DEXT | XFS_ILOG_DEV);
-		if (iip->ili_fields & XFS_ILOG_UUID)
+		fields &= ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT |
+			    XFS_ILOG_DEXT | XFS_ILOG_DEV);
+		if (fields & XFS_ILOG_UUID)
 			ilf->ilf_u.ilfu_uuid = ip->i_df.if_u2.if_uuid;
 		break;
 	default:
 		ASSERT(0);
 		break;
 	}
+
+	/* reflect the logged fields back in ili_format_fields */
+	iip->ili_format_fields = fields;
 }
 
 STATIC void
@@ -250,13 +312,13 @@ xfs_inode_item_format_attr_fork(
 {
 	struct xfs_inode	*ip = iip->ili_inode;
 	size_t			data_bytes;
+	unsigned int		fields = iip->ili_format_fields;
 
 	switch (ip->i_d.di_aformat) {
 	case XFS_DINODE_FMT_EXTENTS:
-		iip->ili_fields &=
-			~(XFS_ILOG_ADATA | XFS_ILOG_ABROOT);
+		fields &= ~(XFS_ILOG_ADATA | XFS_ILOG_ABROOT);
 
-		if ((iip->ili_fields & XFS_ILOG_AEXT) &&
+		if ((fields & XFS_ILOG_AEXT) &&
 		    ip->i_d.di_anextents > 0 &&
 		    ip->i_afp->if_bytes > 0) {
 			struct xfs_bmbt_rec *p;
@@ -272,14 +334,13 @@ xfs_inode_item_format_attr_fork(
 			ilf->ilf_asize = data_bytes;
 			ilf->ilf_size++;
 		} else {
-			iip->ili_fields &= ~XFS_ILOG_AEXT;
+			fields &= ~XFS_ILOG_AEXT;
 		}
 		break;
 	case XFS_DINODE_FMT_BTREE:
-		iip->ili_fields &=
-			~(XFS_ILOG_ADATA | XFS_ILOG_AEXT);
+		fields &= ~(XFS_ILOG_ADATA | XFS_ILOG_AEXT);
 
-		if ((iip->ili_fields & XFS_ILOG_ABROOT) &&
+		if ((fields & XFS_ILOG_ABROOT) &&
 		    ip->i_afp->if_broot_bytes > 0) {
 			ASSERT(ip->i_afp->if_broot != NULL);
 
@@ -289,14 +350,13 @@ xfs_inode_item_format_attr_fork(
 			ilf->ilf_asize = ip->i_afp->if_broot_bytes;
 			ilf->ilf_size++;
 		} else {
-			iip->ili_fields &= ~XFS_ILOG_ABROOT;
+			fields &= ~XFS_ILOG_ABROOT;
 		}
 		break;
 	case XFS_DINODE_FMT_LOCAL:
-		iip->ili_fields &=
-			~(XFS_ILOG_AEXT | XFS_ILOG_ABROOT);
+		fields &= ~(XFS_ILOG_AEXT | XFS_ILOG_ABROOT);
 
-		if ((iip->ili_fields & XFS_ILOG_ADATA) &&
+		if ((fields & XFS_ILOG_ADATA) &&
 		    ip->i_afp->if_bytes > 0) {
 			/*
 			 * Round i_bytes up to a word boundary.
@@ -313,13 +373,16 @@ xfs_inode_item_format_attr_fork(
 			ilf->ilf_asize = (unsigned)data_bytes;
 			ilf->ilf_size++;
 		} else {
-			iip->ili_fields &= ~XFS_ILOG_ADATA;
+			fields &= ~XFS_ILOG_ADATA;
 		}
 		break;
 	default:
 		ASSERT(0);
 		break;
 	}
+
+	/* reflect the logged fields back in ili_format_fields */
+	iip->ili_format_fields = fields;
 }
 
 /*
@@ -359,12 +422,16 @@ xfs_inode_item_format(
 	if (XFS_IFORK_Q(ip)) {
 		xfs_inode_item_format_attr_fork(iip, ilf, lv, &vecp);
 	} else {
-		iip->ili_fields &=
+		iip->ili_format_fields &=
 			~(XFS_ILOG_ADATA | XFS_ILOG_ABROOT | XFS_ILOG_AEXT);
 	}
 
 	/* update the format with the exact fields we actually logged */
-	ilf->ilf_fields |= (iip->ili_fields & ~XFS_ILOG_TIMESTAMP);
+	ilf->ilf_fields |= (iip->ili_format_fields & ~XFS_ILOG_TIMESTAMP);
+
+	/* clear any fields we didn't log from ili_fields */
+	iip->ili_fields &= iip->ili_format_fields;
+	iip->ili_format_fields = 0;
 }
 
 /*
diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
index 488d812..43a9e1c 100644
--- a/fs/xfs/xfs_inode_item.h
+++ b/fs/xfs/xfs_inode_item.h
@@ -34,6 +34,7 @@ typedef struct xfs_inode_log_item {
 	unsigned short		ili_logged;	   /* flushed logged data */
 	unsigned int		ili_last_fields;   /* fields when flushed */
 	unsigned int		ili_fields;	   /* fields to be logged */
+	unsigned int		ili_format_fields; /* combined log fields */
 } xfs_inode_log_item_t;
 
 static inline int xfs_inode_clean(xfs_inode_t *ip)
diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c
index 17280cd..77f1e8a 100644
--- a/fs/xfs/xfs_trans_inode.c
+++ b/fs/xfs/xfs_trans_inode.c
@@ -123,13 +123,5 @@ xfs_trans_log_inode(
 	tp->t_flags |= XFS_TRANS_DIRTY;
 	ip->i_itemp->ili_item.li_desc->lid_flags |= XFS_LID_DIRTY;
 
-	/*
-	 * Always OR in the bits from the ili_last_fields field.
-	 * This is to coordinate with the xfs_iflush() and xfs_iflush_done()
-	 * routines in the eventual clearing of the ili_fields bits.
-	 * See the big comment in xfs_iflush() for an explanation of
-	 * this coordination mechanism.
-	 */
-	flags |= ip->i_itemp->ili_last_fields;
 	ip->i_itemp->ili_fields |= flags;
 }

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH V2] xfs: timestamp updates cause excessive fdatasync log traffic
  2015-08-28  4:32 ` [PATCH V2] " Dave Chinner
@ 2015-08-28 15:11   ` Sage Weil
  2015-08-28 22:04     ` Dave Chinner
  2015-09-08 14:45   ` Brian Foster
  1 sibling, 1 reply; 9+ messages in thread
From: Sage Weil @ 2015-08-28 15:11 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Hi Dave,

On Fri, 28 Aug 2015, Dave Chinner wrote:
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> Sage Weil reported that a ceph test workload was writing to the
> log on every fdatasync during an overwrite workload. Event tracing
> showed that the only metadata modification being made was the
> timestamp updates during the write(2) syscall, but fdatasync(2)
> is supposed to ignore them. The key observation was that the
> transactions in the log all looked like this:
> 
> INODE: #regs: 4   ino: 0x8b  flags: 0x45   dsize: 32
> 
> And contained a flags field of 0x45 or 0x85, and had data and
> attribute forks following the inode core. This means that the
> timestamp updates were triggering dirty relogging of previously
> logged parts of the inode that hadn't yet been flushed back to
> disk.
> 
> This is caused by xfs_trans_log_inode(), where it folds the dirty
> fields that have previously been logged back into the current
> transaction dirty fields from the inode item ili_last_fields. The
> issue is that ili_last_fields only contains a non-zero value when
> the inode is in the process of being flushed. The typical state
> progression is this, using a core field update as the modification
> occuring:
> 
> state			ili_fields		ili_last_fields
> clean			    0			      0
> modified, logged	XFS_ILOG_CORE		      0
> flushed to buffer	    0			XFS_ILOG_CORE
> <buffer submitted>
> buffer IO done (clean)	    0			      0
> 
> However, if we run a new transaction after it has been flushed to
> buffer but before the buffer IO is done, we don't know if the
> modifications in the inode buffer (i.e. what is in ili_last_fields)
> will reach the disk before the new transaction reaches the log.
> Hence to keep transactional ordering correct in recovery, we need to
> ensure the new transaction re-logs the modifications that are being
> flushed to disk.
> 
> By relogging, we ensure that if the transaction makes it to disk and
> the buffer doesn't, then recovery replays all the changes upto that
> point correctly. If the transaction does not make it disk, but the
> buffer does, then recovery will see that the inode on disk is more
> recent than in the log and won't overwrite it with changes that it
> already contains.
> 
> The upshot of this is that while the inode buffer sits in memory
> with the inode in the "flushed" state, fdatasync is going to see
> the relogged state in the ili_fields. i.e:
> 
> What is happening here is this:
> 
> state			ili_fields		ili_last_fields
> clean			    0			      0
> modified, logged	   CORE			      0
> flushed to buffer	    0			     CORE
> <write>
> timestamp update	 TIMESTAMP		     CORE
> <xfs_trans_inode_log pulls in ili_last_fields>
> 		       CORE|TIMESTAMP		     CORE
> <fdatasync>
>   sees field other than TIMESTAMP in ili_fields,
>   triggers xfs_log_force_lsn to flush inode
> 
> <buffer submittted>
> buffer IO done (clean) CORE|TIMESTAMP		      0
> .....
> <write>
> timestamp update       CORE|TIMESTAMP		      0
> <fdatasync>
>   sees field other than TIMESTAMP in ili_fields,
>   triggers xfs_log_force_lsn to flush inode
> .....
> <write>
> timestamp update       CORE|TIMESTAMP		      0
> <fdatasync>
>   sees field other than TIMESTAMP in ili_fields,
>   triggers xfs_log_force_lsn to flush inode
> 
> So, essentially, once a race condition on the buffer flush has
> occurred, the condition is not cleared until the inode is flushed to
> the buffer again and it is written without racing against the inode
> being dirtied again.
> 
> The behaviour we really want here is to capture the timestamp
> update transactionally, but not trigger fdatasync because of the
> relogged fields that haven't been modified since the inode was
> flushed to the buffer. We still need to relog them, but we don't
> need to force the log in the fdatasync case.
> 
> To do this, don't fold the ili_last_fields value into ili_fields
> when logging the inode. Instead, fold it into the fields that get
> logged during formatting of the inode item. This means that we will
> stop logging those fields the moment we know that there is a more
> recent version of the inode on disk than we have in the log and so
> we don't need to log those fields anymore as the next transaction on
> disk doesn't need to replay them.
> 
> Doing this separates changes that are in memory but are not being
> flushed from those that have been flushed. Hence we can now look at
> ili_fields and hence see what fields have been modified since the
> last flush, and hence whether fdatasync needs to force the log or
> not. If non-timestamp changes have been made since the inode was
> flushed to the backing buffer, then fdatasync will do exactly the
> right thing (i.e. force the log).
> 
> Reported-by: Sage Weil <sage@redhat.com>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> Version 2:
> - include the hunk from fs/xfs/xfs_trans_inode.c that I missed
>   when committing the patch locally the first time.

I gave this a go on my machine but I'm still seeing the same symptom.  
I've gathered the trace, strace, and other useful bits at

   http://newdream.net/~sage/drop/rocksdb.2/

This is pretty easy to reproduce with the ceph_test_keyvaluedb binary 
(built on fedora 22), also in that dir:

   rm -rf kv_test_temp_dir/
   ./ceph_test_keyvaluedb --gtest_filter=KeyValueDB/KVTest.BenchCommit/1

Thanks for looking into this!
sage

> 
>  fs/xfs/xfs_inode_item.c  | 161 +++++++++++++++++++++++++++++++++--------------
>  fs/xfs/xfs_inode_item.h  |   1 +
>  fs/xfs/xfs_trans_inode.c |   8 ---
>  3 files changed, 115 insertions(+), 55 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 62bd80f..be04eb2 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -29,6 +29,63 @@
>  #include "xfs_trans_priv.h"
>  #include "xfs_log.h"
>  
> +/*
> + * Notes on ili_fields, ili_format_fields and ili_last_fields.
> + *
> + * ili_fields contains the flags that reflect the current changes that are in
> + * memory and have been logged, but have not been flushed to the backing buffer
> + * for writeback yet. When a transaction is done, the fields that are modified
> + * are added to ili_fields and all those fields are logged. This results in
> + * repeated transactions correctly relogging all the other changes in memory and
> + * allows the inode to be re-ordered (moved forward) in the AIL safely.
> + * ili_fields is copied to ili_last_fields when the inode is flushed to the
> + * backing buffer and is then cleared, indicating that the inode in memory is
> + * now clean from a transactional change point of view and does not need to
> + * relog those fields anymore on tranaction commit.
> + *
> + * ili_last_fields, therefore, is only set while there is an unresolved inode
> + * flush being done (i.e. flush lock is held). We need to keep this state
> + * available to avoid a transaction recovery vs inode buffer IO completion race
> + * if we crash. If the inode is logged again makes it to disk before the inode
> + * buffer IO complete, log recovery will replay the latest inode transaction and
> + * can lose the changes that were in the inode buffer. Hence we still need to
> + * log the changes that have been flushed so that transactions issued during the
> + * inode buffer write still contain the modifications being flushed. Once the
> + * inode buffer IO completes, we no longer need to log those changes as the
> + * inode on disk the same as the inode in memory apart from the changes made
> + * since the inode was flushed (i.e. those recorded in ili_fields).
> + *
> + * ili_format_fields is only used during transaction commit. It is the
> + * aggregation of ili_fields and ili_last_fields, as sampled during the sizing
> + * of the inode item to be logged. We need this field definition to be constant
> + * between sizing and formatting, but inode buffer IO can complete
> + * asynchronously with transaction commit and so we must only read it once so
> + * that the different stages of the item formatting work correctly. We don't
> + * care about the async completion clearing ili_last_fields after we've sampled
> + * it - if we log too much then so be it - we won't log it next time. Once we've
> + * formatted the inode item, we need to propagate the ili_format_fields value to
> + * the on-disk inode log item format field, and then use it to clear all the
> + * fields that we marked for logging but were not dirty from ili_fields.
> + *
> + * This separation of changes allows us to accurately determine what fields in
> + * the inode have changed since it was last flushed to disk. This is important
> + * for fdatasync() performance as there are certain fields that, if modified,
> + * should not trigger log forces because they contain metadata that fdatasync()
> + * does not need to guarantee is safe on disk. This is, currently, just
> + * XFS_ILOG_TIMESTAMP, though may be extended in future to cover other metadata.
> + *
> + * If we simply fold ili_last_fields back into ili_fields when we log the inode
> + * in a transaction, then if we have a flush/modification race it results in
> + * every timestamp update causing fdatasync to flush the log because ili_fields
> + * has fields other than XFS_ILOG_TIMESTAMP set in it. This behaviour will
> + * persist until the inode is next flushed to it's backing buffer and that
> + * buffer is written and completed without another flush/modification race
> + * occuring.  Hence we keep the change state separate and only combine them when
> + * formatting the inode into the log.
> + *
> + * For more information, there's another big comment in xfs_iflush_int() about
> + * this flush/modification race condition.
> + */
>  
>  kmem_zone_t	*xfs_ili_zone;		/* inode log item zone */
>  
> @@ -40,6 +97,7 @@ static inline struct xfs_inode_log_item *INODE_ITEM(struct xfs_log_item *lip)
>  STATIC void
>  xfs_inode_item_data_fork_size(
>  	struct xfs_inode_log_item *iip,
> +	unsigned int		fields,
>  	int			*nvecs,
>  	int			*nbytes)
>  {
> @@ -47,7 +105,7 @@ xfs_inode_item_data_fork_size(
>  
>  	switch (ip->i_d.di_format) {
>  	case XFS_DINODE_FMT_EXTENTS:
> -		if ((iip->ili_fields & XFS_ILOG_DEXT) &&
> +		if ((fields & XFS_ILOG_DEXT) &&
>  		    ip->i_d.di_nextents > 0 &&
>  		    ip->i_df.if_bytes > 0) {
>  			/* worst case, doesn't subtract delalloc extents */
> @@ -56,14 +114,14 @@ xfs_inode_item_data_fork_size(
>  		}
>  		break;
>  	case XFS_DINODE_FMT_BTREE:
> -		if ((iip->ili_fields & XFS_ILOG_DBROOT) &&
> +		if ((fields & XFS_ILOG_DBROOT) &&
>  		    ip->i_df.if_broot_bytes > 0) {
>  			*nbytes += ip->i_df.if_broot_bytes;
>  			*nvecs += 1;
>  		}
>  		break;
>  	case XFS_DINODE_FMT_LOCAL:
> -		if ((iip->ili_fields & XFS_ILOG_DDATA) &&
> +		if ((fields & XFS_ILOG_DDATA) &&
>  		    ip->i_df.if_bytes > 0) {
>  			*nbytes += roundup(ip->i_df.if_bytes, 4);
>  			*nvecs += 1;
> @@ -82,6 +140,7 @@ xfs_inode_item_data_fork_size(
>  STATIC void
>  xfs_inode_item_attr_fork_size(
>  	struct xfs_inode_log_item *iip,
> +	unsigned int		fields,
>  	int			*nvecs,
>  	int			*nbytes)
>  {
> @@ -89,7 +148,7 @@ xfs_inode_item_attr_fork_size(
>  
>  	switch (ip->i_d.di_aformat) {
>  	case XFS_DINODE_FMT_EXTENTS:
> -		if ((iip->ili_fields & XFS_ILOG_AEXT) &&
> +		if ((fields & XFS_ILOG_AEXT) &&
>  		    ip->i_d.di_anextents > 0 &&
>  		    ip->i_afp->if_bytes > 0) {
>  			/* worst case, doesn't subtract unused space */
> @@ -98,14 +157,14 @@ xfs_inode_item_attr_fork_size(
>  		}
>  		break;
>  	case XFS_DINODE_FMT_BTREE:
> -		if ((iip->ili_fields & XFS_ILOG_ABROOT) &&
> +		if ((fields & XFS_ILOG_ABROOT) &&
>  		    ip->i_afp->if_broot_bytes > 0) {
>  			*nbytes += ip->i_afp->if_broot_bytes;
>  			*nvecs += 1;
>  		}
>  		break;
>  	case XFS_DINODE_FMT_LOCAL:
> -		if ((iip->ili_fields & XFS_ILOG_ADATA) &&
> +		if ((fields & XFS_ILOG_ADATA) &&
>  		    ip->i_afp->if_bytes > 0) {
>  			*nbytes += roundup(ip->i_afp->if_bytes, 4);
>  			*nvecs += 1;
> @@ -133,13 +192,17 @@ xfs_inode_item_size(
>  	struct xfs_inode_log_item *iip = INODE_ITEM(lip);
>  	struct xfs_inode	*ip = iip->ili_inode;
>  
> +	iip->ili_format_fields = iip->ili_fields | iip->ili_last_fields;
> +
>  	*nvecs += 2;
>  	*nbytes += sizeof(struct xfs_inode_log_format) +
>  		   xfs_icdinode_size(ip->i_d.di_version);
>  
> -	xfs_inode_item_data_fork_size(iip, nvecs, nbytes);
> +	xfs_inode_item_data_fork_size(iip, iip->ili_format_fields,
> +				      nvecs, nbytes);
>  	if (XFS_IFORK_Q(ip))
> -		xfs_inode_item_attr_fork_size(iip, nvecs, nbytes);
> +		xfs_inode_item_attr_fork_size(iip, iip->ili_format_fields,
> +					      nvecs, nbytes);
>  }
>  
>  STATIC void
> @@ -151,14 +214,14 @@ xfs_inode_item_format_data_fork(
>  {
>  	struct xfs_inode	*ip = iip->ili_inode;
>  	size_t			data_bytes;
> +	unsigned int		fields = iip->ili_format_fields;
>  
>  	switch (ip->i_d.di_format) {
>  	case XFS_DINODE_FMT_EXTENTS:
> -		iip->ili_fields &=
> -			~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT |
> -			  XFS_ILOG_DEV | XFS_ILOG_UUID);
> +		fields &= ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT |
> +			    XFS_ILOG_DEV | XFS_ILOG_UUID);
>  
> -		if ((iip->ili_fields & XFS_ILOG_DEXT) &&
> +		if ((fields & XFS_ILOG_DEXT) &&
>  		    ip->i_d.di_nextents > 0 &&
>  		    ip->i_df.if_bytes > 0) {
>  			struct xfs_bmbt_rec *p;
> @@ -175,15 +238,14 @@ xfs_inode_item_format_data_fork(
>  			ilf->ilf_dsize = data_bytes;
>  			ilf->ilf_size++;
>  		} else {
> -			iip->ili_fields &= ~XFS_ILOG_DEXT;
> +			fields &= ~XFS_ILOG_DEXT;
>  		}
>  		break;
>  	case XFS_DINODE_FMT_BTREE:
> -		iip->ili_fields &=
> -			~(XFS_ILOG_DDATA | XFS_ILOG_DEXT |
> -			  XFS_ILOG_DEV | XFS_ILOG_UUID);
> +		fields &= ~(XFS_ILOG_DDATA | XFS_ILOG_DEXT |
> +			    XFS_ILOG_DEV | XFS_ILOG_UUID);
>  
> -		if ((iip->ili_fields & XFS_ILOG_DBROOT) &&
> +		if ((fields & XFS_ILOG_DBROOT) &&
>  		    ip->i_df.if_broot_bytes > 0) {
>  			ASSERT(ip->i_df.if_broot != NULL);
>  			xlog_copy_iovec(lv, vecp, XLOG_REG_TYPE_IBROOT,
> @@ -192,16 +254,15 @@ xfs_inode_item_format_data_fork(
>  			ilf->ilf_dsize = ip->i_df.if_broot_bytes;
>  			ilf->ilf_size++;
>  		} else {
> -			ASSERT(!(iip->ili_fields &
> -				 XFS_ILOG_DBROOT));
> -			iip->ili_fields &= ~XFS_ILOG_DBROOT;
> +			ASSERT(!(fields & XFS_ILOG_DBROOT));
> +			fields &= ~XFS_ILOG_DBROOT;
>  		}
>  		break;
>  	case XFS_DINODE_FMT_LOCAL:
> -		iip->ili_fields &=
> -			~(XFS_ILOG_DEXT | XFS_ILOG_DBROOT |
> -			  XFS_ILOG_DEV | XFS_ILOG_UUID);
> -		if ((iip->ili_fields & XFS_ILOG_DDATA) &&
> +		fields &= ~(XFS_ILOG_DEXT | XFS_ILOG_DBROOT |
> +			    XFS_ILOG_DEV | XFS_ILOG_UUID);
> +
> +		if ((fields & XFS_ILOG_DDATA) &&
>  		    ip->i_df.if_bytes > 0) {
>  			/*
>  			 * Round i_bytes up to a word boundary.
> @@ -218,27 +279,28 @@ xfs_inode_item_format_data_fork(
>  			ilf->ilf_dsize = (unsigned)data_bytes;
>  			ilf->ilf_size++;
>  		} else {
> -			iip->ili_fields &= ~XFS_ILOG_DDATA;
> +			fields &= ~XFS_ILOG_DDATA;
>  		}
>  		break;
>  	case XFS_DINODE_FMT_DEV:
> -		iip->ili_fields &=
> -			~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT |
> -			  XFS_ILOG_DEXT | XFS_ILOG_UUID);
> -		if (iip->ili_fields & XFS_ILOG_DEV)
> +		fields &= ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT |
> +			    XFS_ILOG_DEXT | XFS_ILOG_UUID);
> +		if (fields & XFS_ILOG_DEV)
>  			ilf->ilf_u.ilfu_rdev = ip->i_df.if_u2.if_rdev;
>  		break;
>  	case XFS_DINODE_FMT_UUID:
> -		iip->ili_fields &=
> -			~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT |
> -			  XFS_ILOG_DEXT | XFS_ILOG_DEV);
> -		if (iip->ili_fields & XFS_ILOG_UUID)
> +		fields &= ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT |
> +			    XFS_ILOG_DEXT | XFS_ILOG_DEV);
> +		if (fields & XFS_ILOG_UUID)
>  			ilf->ilf_u.ilfu_uuid = ip->i_df.if_u2.if_uuid;
>  		break;
>  	default:
>  		ASSERT(0);
>  		break;
>  	}
> +
> +	/* reflect the logged fields back in ili_format_fields */
> +	iip->ili_format_fields = fields;
>  }
>  
>  STATIC void
> @@ -250,13 +312,13 @@ xfs_inode_item_format_attr_fork(
>  {
>  	struct xfs_inode	*ip = iip->ili_inode;
>  	size_t			data_bytes;
> +	unsigned int		fields = iip->ili_format_fields;
>  
>  	switch (ip->i_d.di_aformat) {
>  	case XFS_DINODE_FMT_EXTENTS:
> -		iip->ili_fields &=
> -			~(XFS_ILOG_ADATA | XFS_ILOG_ABROOT);
> +		fields &= ~(XFS_ILOG_ADATA | XFS_ILOG_ABROOT);
>  
> -		if ((iip->ili_fields & XFS_ILOG_AEXT) &&
> +		if ((fields & XFS_ILOG_AEXT) &&
>  		    ip->i_d.di_anextents > 0 &&
>  		    ip->i_afp->if_bytes > 0) {
>  			struct xfs_bmbt_rec *p;
> @@ -272,14 +334,13 @@ xfs_inode_item_format_attr_fork(
>  			ilf->ilf_asize = data_bytes;
>  			ilf->ilf_size++;
>  		} else {
> -			iip->ili_fields &= ~XFS_ILOG_AEXT;
> +			fields &= ~XFS_ILOG_AEXT;
>  		}
>  		break;
>  	case XFS_DINODE_FMT_BTREE:
> -		iip->ili_fields &=
> -			~(XFS_ILOG_ADATA | XFS_ILOG_AEXT);
> +		fields &= ~(XFS_ILOG_ADATA | XFS_ILOG_AEXT);
>  
> -		if ((iip->ili_fields & XFS_ILOG_ABROOT) &&
> +		if ((fields & XFS_ILOG_ABROOT) &&
>  		    ip->i_afp->if_broot_bytes > 0) {
>  			ASSERT(ip->i_afp->if_broot != NULL);
>  
> @@ -289,14 +350,13 @@ xfs_inode_item_format_attr_fork(
>  			ilf->ilf_asize = ip->i_afp->if_broot_bytes;
>  			ilf->ilf_size++;
>  		} else {
> -			iip->ili_fields &= ~XFS_ILOG_ABROOT;
> +			fields &= ~XFS_ILOG_ABROOT;
>  		}
>  		break;
>  	case XFS_DINODE_FMT_LOCAL:
> -		iip->ili_fields &=
> -			~(XFS_ILOG_AEXT | XFS_ILOG_ABROOT);
> +		fields &= ~(XFS_ILOG_AEXT | XFS_ILOG_ABROOT);
>  
> -		if ((iip->ili_fields & XFS_ILOG_ADATA) &&
> +		if ((fields & XFS_ILOG_ADATA) &&
>  		    ip->i_afp->if_bytes > 0) {
>  			/*
>  			 * Round i_bytes up to a word boundary.
> @@ -313,13 +373,16 @@ xfs_inode_item_format_attr_fork(
>  			ilf->ilf_asize = (unsigned)data_bytes;
>  			ilf->ilf_size++;
>  		} else {
> -			iip->ili_fields &= ~XFS_ILOG_ADATA;
> +			fields &= ~XFS_ILOG_ADATA;
>  		}
>  		break;
>  	default:
>  		ASSERT(0);
>  		break;
>  	}
> +
> +	/* reflect the logged fields back in ili_format_fields */
> +	iip->ili_format_fields = fields;
>  }
>  
>  /*
> @@ -359,12 +422,16 @@ xfs_inode_item_format(
>  	if (XFS_IFORK_Q(ip)) {
>  		xfs_inode_item_format_attr_fork(iip, ilf, lv, &vecp);
>  	} else {
> -		iip->ili_fields &=
> +		iip->ili_format_fields &=
>  			~(XFS_ILOG_ADATA | XFS_ILOG_ABROOT | XFS_ILOG_AEXT);
>  	}
>  
>  	/* update the format with the exact fields we actually logged */
> -	ilf->ilf_fields |= (iip->ili_fields & ~XFS_ILOG_TIMESTAMP);
> +	ilf->ilf_fields |= (iip->ili_format_fields & ~XFS_ILOG_TIMESTAMP);
> +
> +	/* clear any fields we didn't log from ili_fields */
> +	iip->ili_fields &= iip->ili_format_fields;
> +	iip->ili_format_fields = 0;
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
> index 488d812..43a9e1c 100644
> --- a/fs/xfs/xfs_inode_item.h
> +++ b/fs/xfs/xfs_inode_item.h
> @@ -34,6 +34,7 @@ typedef struct xfs_inode_log_item {
>  	unsigned short		ili_logged;	   /* flushed logged data */
>  	unsigned int		ili_last_fields;   /* fields when flushed */
>  	unsigned int		ili_fields;	   /* fields to be logged */
> +	unsigned int		ili_format_fields; /* combined log fields */
>  } xfs_inode_log_item_t;
>  
>  static inline int xfs_inode_clean(xfs_inode_t *ip)
> diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c
> index 17280cd..77f1e8a 100644
> --- a/fs/xfs/xfs_trans_inode.c
> +++ b/fs/xfs/xfs_trans_inode.c
> @@ -123,13 +123,5 @@ xfs_trans_log_inode(
>  	tp->t_flags |= XFS_TRANS_DIRTY;
>  	ip->i_itemp->ili_item.li_desc->lid_flags |= XFS_LID_DIRTY;
>  
> -	/*
> -	 * Always OR in the bits from the ili_last_fields field.
> -	 * This is to coordinate with the xfs_iflush() and xfs_iflush_done()
> -	 * routines in the eventual clearing of the ili_fields bits.
> -	 * See the big comment in xfs_iflush() for an explanation of
> -	 * this coordination mechanism.
> -	 */
> -	flags |= ip->i_itemp->ili_last_fields;
>  	ip->i_itemp->ili_fields |= flags;
>  }
> 
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH V2] xfs: timestamp updates cause excessive fdatasync log traffic
  2015-08-28 15:11   ` Sage Weil
@ 2015-08-28 22:04     ` Dave Chinner
  2015-08-31  2:21       ` Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2015-08-28 22:04 UTC (permalink / raw)
  To: Sage Weil; +Cc: xfs

On Fri, Aug 28, 2015 at 08:11:20AM -0700, Sage Weil wrote:
> Hi Dave,
> 
> On Fri, 28 Aug 2015, Dave Chinner wrote:
> > 
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Sage Weil reported that a ceph test workload was writing to the
> > log on every fdatasync during an overwrite workload. Event tracing
> > showed that the only metadata modification being made was the
> > timestamp updates during the write(2) syscall, but fdatasync(2)
> > is supposed to ignore them. The key observation was that the
> > transactions in the log all looked like this:
[....]

> > ---
> > Version 2:
> > - include the hunk from fs/xfs/xfs_trans_inode.c that I missed
> >   when committing the patch locally the first time.
> 
> I gave this a go on my machine but I'm still seeing the same symptom.  

OK, that implies the inode buffer has not been submitted for IO and
so the inode is being held in "flushing" state for an extended
period of time.

> I've gathered the trace, strace, and other useful bits at
> 
>    http://newdream.net/~sage/drop/rocksdb.2/
> 
> This is pretty easy to reproduce with the ceph_test_keyvaluedb binary 
> (built on fedora 22), also in that dir:
> 
>    rm -rf kv_test_temp_dir/
>    ./ceph_test_keyvaluedb --gtest_filter=KeyValueDB/KVTest.BenchCommit/1

I'll have a deeper look.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH V2] xfs: timestamp updates cause excessive fdatasync log traffic
  2015-08-28 22:04     ` Dave Chinner
@ 2015-08-31  2:21       ` Dave Chinner
  2015-08-31  8:48         ` Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2015-08-31  2:21 UTC (permalink / raw)
  To: Sage Weil; +Cc: xfs

On Sat, Aug 29, 2015 at 08:04:54AM +1000, Dave Chinner wrote:
> On Fri, Aug 28, 2015 at 08:11:20AM -0700, Sage Weil wrote:
> > Hi Dave,
> > 
> > On Fri, 28 Aug 2015, Dave Chinner wrote:
> > > 
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Sage Weil reported that a ceph test workload was writing to the
> > > log on every fdatasync during an overwrite workload. Event tracing
> > > showed that the only metadata modification being made was the
> > > timestamp updates during the write(2) syscall, but fdatasync(2)
> > > is supposed to ignore them. The key observation was that the
> > > transactions in the log all looked like this:
> [....]
> 
> > > ---
> > > Version 2:
> > > - include the hunk from fs/xfs/xfs_trans_inode.c that I missed
> > >   when committing the patch locally the first time.
> > 
> > I gave this a go on my machine but I'm still seeing the same symptom.  
> 
> OK, that implies the inode buffer has not been submitted for IO and
> so the inode is being held in "flushing" state for an extended
> period of time.
> 
> > I've gathered the trace, strace, and other useful bits at
> > 
> >    http://newdream.net/~sage/drop/rocksdb.2/
> > 
> > This is pretty easy to reproduce with the ceph_test_keyvaluedb binary 
> > (built on fedora 22), also in that dir:
> > 
> >    rm -rf kv_test_temp_dir/
> >    ./ceph_test_keyvaluedb --gtest_filter=KeyValueDB/KVTest.BenchCommit/1
> 
> I'll have a deeper look.

Ok, I was assuming this is a longer running test than it is - it
only takes about 2300ms to run on my test box. Hence the problem is
that the inode has never been flushed out, and so it's being
relogged in full on every fdatasync() operation. Another, similar
change is necessary to track the changes since the last time the
inode was flushed to the log.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH V2] xfs: timestamp updates cause excessive fdatasync log traffic
  2015-08-31  2:21       ` Dave Chinner
@ 2015-08-31  8:48         ` Dave Chinner
  2015-08-31 12:40           ` Sage Weil
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2015-08-31  8:48 UTC (permalink / raw)
  To: Sage Weil; +Cc: xfs

On Mon, Aug 31, 2015 at 12:21:55PM +1000, Dave Chinner wrote:
> On Sat, Aug 29, 2015 at 08:04:54AM +1000, Dave Chinner wrote:
> > On Fri, Aug 28, 2015 at 08:11:20AM -0700, Sage Weil wrote:
> > > Hi Dave,
> > > 
> > > On Fri, 28 Aug 2015, Dave Chinner wrote:
> > > > 
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > > Sage Weil reported that a ceph test workload was writing to the
> > > > log on every fdatasync during an overwrite workload. Event tracing
> > > > showed that the only metadata modification being made was the
> > > > timestamp updates during the write(2) syscall, but fdatasync(2)
> > > > is supposed to ignore them. The key observation was that the
> > > > transactions in the log all looked like this:
> > [....]
> > 
> > > > ---
> > > > Version 2:
> > > > - include the hunk from fs/xfs/xfs_trans_inode.c that I missed
> > > >   when committing the patch locally the first time.
> > > 
> > > I gave this a go on my machine but I'm still seeing the same symptom.  
> > 
> > OK, that implies the inode buffer has not been submitted for IO and
> > so the inode is being held in "flushing" state for an extended
> > period of time.
> > 
> > > I've gathered the trace, strace, and other useful bits at
> > > 
> > >    http://newdream.net/~sage/drop/rocksdb.2/
> > > 
> > > This is pretty easy to reproduce with the ceph_test_keyvaluedb binary 
> > > (built on fedora 22), also in that dir:
> > > 
> > >    rm -rf kv_test_temp_dir/
> > >    ./ceph_test_keyvaluedb --gtest_filter=KeyValueDB/KVTest.BenchCommit/1
> > 
> > I'll have a deeper look.
> 
> Ok, I was assuming this is a longer running test than it is - it
> only takes about 2300ms to run on my test box. Hence the problem is
> that the inode has never been flushed out, and so it's being
> relogged in full on every fdatasync() operation. Another, similar
> change is necessary to track the changes since the last time the
> inode was flushed to the log.

This is more complex than I expected, and I ended up chasing down a
rathole because performance is now *highly variable*.

Before making these changes, the performance was consistent at
2.3-2.6s for a test run on my test machine. It's on SSDs, with a 1GB
BBWC in front of them, so there's minimal IO latency, and hence the
average commit latency was around 2.5ms for the original (bad)
fdatasync behaviour.

priming
now doing small writes
1024 commits in 2.582939, avg latency 0.0025224
[       OK ] KeyValueDB/KVTest.BenchCommit/1 (2600 ms)
[----------] 1 test from KeyValueDB/KVTest (2600 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (2600 ms total)
[  PASSED  ] 1 test.


Now I see the test often complete in about 1.1s, or ~1.1ms per
write/fdatasync operation. That's definitely an improvement,

priming
now doing small writes
1024 commits in 1.133530, avg latency 0.00110696
[       OK ] KeyValueDB/KVTest.BenchCommit/1 (1148 ms)
[----------] 1 test from KeyValueDB/KVTest (1148 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (1149 ms total)
[  PASSED  ] 1 test.

but every 4-5 test runs I'd see a test run that would take 8s to
complete:

priming
now doing small writes
1024 commits in 8.373202, avg latency 0.00817696
[       OK ] KeyValueDB/KVTest.BenchCommit/1 (8429 ms)
[----------] 1 test from KeyValueDB/KVTest (8430 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (8430 ms total)
[  PASSED  ] 1 test.

This is completely reproducable, and very annoying as it doesn't
happen without these fdatasync optimisation patches.

After taking a tangent to find a tracepoint regression that was
getting in my way, I found that there was a significant pause
between the inode locking calls within xfs_file_fsync and the inode
locking calls on the buffered write. Roughly 8ms, in fact, on almost
every call. After adding a couple more test trace points into the
XFS fsync code, it turns out that a hardware cache flush is causing
the delay. That is, because we aren't doing log writes that trigger
cache flushes and FUA writes, we have to issue a
blkdev_issue_flush() call from xfs_file_fsync and that is taking 8ms
to complete.

IOWs, on my hardware, making fdatasync work correctly completely
destroys workload determinism.  The full fsync adds 1ms to each
operation, but it is constant and unchanging. The fdatasyncs fast
most of the time, but every so often the cache flush blows
performance all to hell and so over time (e.g. 100 test runs)
fdatasync is little faster than using a full fsync.

So, that's the rat hole I've been down today - once again proving
that Modern Storage Hardware Still Sucks.

Anyway, I'm going to do correctness testing now that I know the
performance problems are hardware and not software.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH V2] xfs: timestamp updates cause excessive fdatasync log traffic
  2015-08-31  8:48         ` Dave Chinner
@ 2015-08-31 12:40           ` Sage Weil
  2015-08-31 21:51             ` Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: Sage Weil @ 2015-08-31 12:40 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Mon, 31 Aug 2015, Dave Chinner wrote:
> On Mon, Aug 31, 2015 at 12:21:55PM +1000, Dave Chinner wrote:
> > On Sat, Aug 29, 2015 at 08:04:54AM +1000, Dave Chinner wrote:
> > > On Fri, Aug 28, 2015 at 08:11:20AM -0700, Sage Weil wrote:
> > > > Hi Dave,
> > > > 
> > > > On Fri, 28 Aug 2015, Dave Chinner wrote:
> > > > > 
> > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > 
> > > > > Sage Weil reported that a ceph test workload was writing to the
> > > > > log on every fdatasync during an overwrite workload. Event tracing
> > > > > showed that the only metadata modification being made was the
> > > > > timestamp updates during the write(2) syscall, but fdatasync(2)
> > > > > is supposed to ignore them. The key observation was that the
> > > > > transactions in the log all looked like this:
> > > [....]
> > > 
> > > > > ---
> > > > > Version 2:
> > > > > - include the hunk from fs/xfs/xfs_trans_inode.c that I missed
> > > > >   when committing the patch locally the first time.
> > > > 
> > > > I gave this a go on my machine but I'm still seeing the same symptom.  
> > > 
> > > OK, that implies the inode buffer has not been submitted for IO and
> > > so the inode is being held in "flushing" state for an extended
> > > period of time.
> > > 
> > > > I've gathered the trace, strace, and other useful bits at
> > > > 
> > > >    http://newdream.net/~sage/drop/rocksdb.2/
> > > > 
> > > > This is pretty easy to reproduce with the ceph_test_keyvaluedb binary 
> > > > (built on fedora 22), also in that dir:
> > > > 
> > > >    rm -rf kv_test_temp_dir/
> > > >    ./ceph_test_keyvaluedb --gtest_filter=KeyValueDB/KVTest.BenchCommit/1
> > > 
> > > I'll have a deeper look.
> > 
> > Ok, I was assuming this is a longer running test than it is - it
> > only takes about 2300ms to run on my test box. Hence the problem is
> > that the inode has never been flushed out, and so it's being
> > relogged in full on every fdatasync() operation. Another, similar
> > change is necessary to track the changes since the last time the
> > inode was flushed to the log.
> 
> This is more complex than I expected, and I ended up chasing down a
> rathole because performance is now *highly variable*.
> 
> Before making these changes, the performance was consistent at
> 2.3-2.6s for a test run on my test machine. It's on SSDs, with a 1GB
> BBWC in front of them, so there's minimal IO latency, and hence the
> average commit latency was around 2.5ms for the original (bad)
> fdatasync behaviour.
> 
> priming
> now doing small writes
> 1024 commits in 2.582939, avg latency 0.0025224
> [       OK ] KeyValueDB/KVTest.BenchCommit/1 (2600 ms)
> [----------] 1 test from KeyValueDB/KVTest (2600 ms total)
> 
> [----------] Global test environment tear-down
> [==========] 1 test from 1 test case ran. (2600 ms total)
> [  PASSED  ] 1 test.
> 
> 
> Now I see the test often complete in about 1.1s, or ~1.1ms per
> write/fdatasync operation. That's definitely an improvement,
> 
> priming
> now doing small writes
> 1024 commits in 1.133530, avg latency 0.00110696
> [       OK ] KeyValueDB/KVTest.BenchCommit/1 (1148 ms)
> [----------] 1 test from KeyValueDB/KVTest (1148 ms total)
> 
> [----------] Global test environment tear-down
> [==========] 1 test from 1 test case ran. (1149 ms total)
> [  PASSED  ] 1 test.
> 
> but every 4-5 test runs I'd see a test run that would take 8s to
> complete:
> 
> priming
> now doing small writes
> 1024 commits in 8.373202, avg latency 0.00817696
> [       OK ] KeyValueDB/KVTest.BenchCommit/1 (8429 ms)
> [----------] 1 test from KeyValueDB/KVTest (8430 ms total)
> 
> [----------] Global test environment tear-down
> [==========] 1 test from 1 test case ran. (8430 ms total)
> [  PASSED  ] 1 test.
> 
> This is completely reproducable, and very annoying as it doesn't
> happen without these fdatasync optimisation patches.

Note that rocksdb itself might be doing something nondeterministic too 
(like writing out 4mb sst files from the priming stage).  Though it sounds 
like that isn't (the only thing) happening here... and it certainly 
wouldn't account for 7 seconds.

> After taking a tangent to find a tracepoint regression that was
> getting in my way, I found that there was a significant pause
> between the inode locking calls within xfs_file_fsync and the inode
> locking calls on the buffered write. Roughly 8ms, in fact, on almost
> every call. After adding a couple more test trace points into the
> XFS fsync code, it turns out that a hardware cache flush is causing
> the delay. That is, because we aren't doing log writes that trigger
> cache flushes and FUA writes, we have to issue a
> blkdev_issue_flush() call from xfs_file_fsync and that is taking 8ms
> to complete.

This is where my understanding of block layer flushing really breaks down, 
but in both cases we're issues flush requests to the hardware, right? Is 
the difference that the log write is a FUA flush request with data, and 
blkdev_issue_flush() issues a flush request without associated data?

> IOWs, on my hardware, making fdatasync work correctly completely
> destroys workload determinism.  The full fsync adds 1ms to each
> operation, but it is constant and unchanging. The fdatasyncs fast
> most of the time, but every so often the cache flush blows
> performance all to hell and so over time (e.g. 100 test runs)
> fdatasync is little faster than using a full fsync.
> 
> So, that's the rat hole I've been down today - once again proving
> that Modern Storage Hardware Still Sucks.
> 
> Anyway, I'm going to do correctness testing now that I know the
> performance problems are hardware and not software.....

Thanks, Dave!  Sorry this snowballed...

sage

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH V2] xfs: timestamp updates cause excessive fdatasync log traffic
  2015-08-31 12:40           ` Sage Weil
@ 2015-08-31 21:51             ` Dave Chinner
  0 siblings, 0 replies; 9+ messages in thread
From: Dave Chinner @ 2015-08-31 21:51 UTC (permalink / raw)
  To: Sage Weil; +Cc: xfs

On Mon, Aug 31, 2015 at 05:40:04AM -0700, Sage Weil wrote:
> On Mon, 31 Aug 2015, Dave Chinner wrote:
> > After taking a tangent to find a tracepoint regression that was
> > getting in my way, I found that there was a significant pause
> > between the inode locking calls within xfs_file_fsync and the inode
> > locking calls on the buffered write. Roughly 8ms, in fact, on almost
> > every call. After adding a couple more test trace points into the
> > XFS fsync code, it turns out that a hardware cache flush is causing
> > the delay. That is, because we aren't doing log writes that trigger
> > cache flushes and FUA writes, we have to issue a
> > blkdev_issue_flush() call from xfs_file_fsync and that is taking 8ms
> > to complete.
> 
> This is where my understanding of block layer flushing really breaks down, 
> but in both cases we're issues flush requests to the hardware, right? Is 
> the difference that the log write is a FUA flush request with data, and 
> blkdev_issue_flush() issues a flush request without associated data?

Pretty much, though th elog write also does a cache flush before the
FUA write. i.e.  The log writes consist of a bio with data issued via:

	submit_bio(REQ_FUA | REQ_FLUSH | WRITE_SYNC, bio);

blkdev_issue_flush consists of an empty bio issued via:

	submit_bio(REQ_FLUSH | WRITE_SYNC, bio);

So from a block layer and filesystem point of view there is little
difference, and the only difference at the SCSI layer is the WRITE
w/ FUA that is issued after the cache flush in the log write case
(see https://lwn.net/Articles/400541/ fo a bit more background).

I haven't looked any deeper than this so far - I don't have time
right now to do so...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH V2] xfs: timestamp updates cause excessive fdatasync log traffic
  2015-08-28  4:32 ` [PATCH V2] " Dave Chinner
  2015-08-28 15:11   ` Sage Weil
@ 2015-09-08 14:45   ` Brian Foster
  1 sibling, 0 replies; 9+ messages in thread
From: Brian Foster @ 2015-09-08 14:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: sage, xfs

On Fri, Aug 28, 2015 at 02:32:53PM +1000, Dave Chinner wrote:
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> Sage Weil reported that a ceph test workload was writing to the
> log on every fdatasync during an overwrite workload. Event tracing
> showed that the only metadata modification being made was the
> timestamp updates during the write(2) syscall, but fdatasync(2)
> is supposed to ignore them. The key observation was that the
> transactions in the log all looked like this:
> 
> INODE: #regs: 4   ino: 0x8b  flags: 0x45   dsize: 32
> 
> And contained a flags field of 0x45 or 0x85, and had data and
> attribute forks following the inode core. This means that the
> timestamp updates were triggering dirty relogging of previously
> logged parts of the inode that hadn't yet been flushed back to
> disk.
> 
> This is caused by xfs_trans_log_inode(), where it folds the dirty
> fields that have previously been logged back into the current
> transaction dirty fields from the inode item ili_last_fields. The
> issue is that ili_last_fields only contains a non-zero value when
> the inode is in the process of being flushed. The typical state
> progression is this, using a core field update as the modification
> occuring:
> 
> state			ili_fields		ili_last_fields
> clean			    0			      0
> modified, logged	XFS_ILOG_CORE		      0
> flushed to buffer	    0			XFS_ILOG_CORE
> <buffer submitted>
> buffer IO done (clean)	    0			      0
> 
> However, if we run a new transaction after it has been flushed to
> buffer but before the buffer IO is done, we don't know if the
> modifications in the inode buffer (i.e. what is in ili_last_fields)
> will reach the disk before the new transaction reaches the log.
> Hence to keep transactional ordering correct in recovery, we need to
> ensure the new transaction re-logs the modifications that are being
> flushed to disk.
> 
> By relogging, we ensure that if the transaction makes it to disk and
> the buffer doesn't, then recovery replays all the changes upto that
> point correctly. If the transaction does not make it disk, but the
> buffer does, then recovery will see that the inode on disk is more
> recent than in the log and won't overwrite it with changes that it
> already contains.
> 
> The upshot of this is that while the inode buffer sits in memory
> with the inode in the "flushed" state, fdatasync is going to see
> the relogged state in the ili_fields. i.e:
> 
> What is happening here is this:
> 
> state			ili_fields		ili_last_fields
> clean			    0			      0
> modified, logged	   CORE			      0
> flushed to buffer	    0			     CORE
> <write>
> timestamp update	 TIMESTAMP		     CORE
> <xfs_trans_inode_log pulls in ili_last_fields>
> 		       CORE|TIMESTAMP		     CORE
> <fdatasync>
>   sees field other than TIMESTAMP in ili_fields,
>   triggers xfs_log_force_lsn to flush inode
> 
> <buffer submittted>
> buffer IO done (clean) CORE|TIMESTAMP		      0
> .....

Shouldn't both states be cleared here? If so, I don't see how the
problem persists unless the race is perpetual...

> <write>
> timestamp update       CORE|TIMESTAMP		      0
> <fdatasync>
>   sees field other than TIMESTAMP in ili_fields,
>   triggers xfs_log_force_lsn to flush inode
> .....
> <write>
> timestamp update       CORE|TIMESTAMP		      0
> <fdatasync>
>   sees field other than TIMESTAMP in ili_fields,
>   triggers xfs_log_force_lsn to flush inode
> 
> So, essentially, once a race condition on the buffer flush has
> occurred, the condition is not cleared until the inode is flushed to
> the buffer again and it is written without racing against the inode
> being dirtied again.
> 

Ok, so we would have to continuously race between the buffer I/O and
inode logging events for this behavior to continue. How likely is that
to happen on a constant workload? I ask because I see Sage had pointed
out that this might not address his originally reported problem, so I'm
wondering whether we still want to pursue this change.

> The behaviour we really want here is to capture the timestamp
> update transactionally, but not trigger fdatasync because of the
> relogged fields that haven't been modified since the inode was
> flushed to the buffer. We still need to relog them, but we don't
> need to force the log in the fdatasync case.
> 
> To do this, don't fold the ili_last_fields value into ili_fields
> when logging the inode. Instead, fold it into the fields that get
> logged during formatting of the inode item. This means that we will
> stop logging those fields the moment we know that there is a more
> recent version of the inode on disk than we have in the log and so
> we don't need to log those fields anymore as the next transaction on
> disk doesn't need to replay them.
> 
> Doing this separates changes that are in memory but are not being
> flushed from those that have been flushed. Hence we can now look at
> ili_fields and hence see what fields have been modified since the
> last flush, and hence whether fdatasync needs to force the log or
> not. If non-timestamp changes have been made since the inode was
> flushed to the backing buffer, then fdatasync will do exactly the
> right thing (i.e. force the log).
> 
> Reported-by: Sage Weil <sage@redhat.com>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> Version 2:
> - include the hunk from fs/xfs/xfs_trans_inode.c that I missed
>   when committing the patch locally the first time.
> 
>  fs/xfs/xfs_inode_item.c  | 161 +++++++++++++++++++++++++++++++++--------------
>  fs/xfs/xfs_inode_item.h  |   1 +
>  fs/xfs/xfs_trans_inode.c |   8 ---
>  3 files changed, 115 insertions(+), 55 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 62bd80f..be04eb2 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -29,6 +29,63 @@
>  #include "xfs_trans_priv.h"
>  #include "xfs_log.h"
>  
> +/*
> + * Notes on ili_fields, ili_format_fields and ili_last_fields.
> + *
> + * ili_fields contains the flags that reflect the current changes that are in
> + * memory and have been logged, but have not been flushed to the backing buffer
> + * for writeback yet. When a transaction is done, the fields that are modified
> + * are added to ili_fields and all those fields are logged. This results in
> + * repeated transactions correctly relogging all the other changes in memory and
> + * allows the inode to be re-ordered (moved forward) in the AIL safely.
> + * ili_fields is copied to ili_last_fields when the inode is flushed to the
> + * backing buffer and is then cleared, indicating that the inode in memory is
> + * now clean from a transactional change point of view and does not need to
> + * relog those fields anymore on tranaction commit.
> + *
> + * ili_last_fields, therefore, is only set while there is an unresolved inode
> + * flush being done (i.e. flush lock is held). We need to keep this state
> + * available to avoid a transaction recovery vs inode buffer IO completion race
> + * if we crash. If the inode is logged again makes it to disk before the inode
					       ^ and


> + * buffer IO complete, log recovery will replay the latest inode transaction and
		completes

> + * can lose the changes that were in the inode buffer. Hence we still need to
> + * log the changes that have been flushed so that transactions issued during the
> + * inode buffer write still contain the modifications being flushed. Once the
> + * inode buffer IO completes, we no longer need to log those changes as the
> + * inode on disk the same as the inode in memory apart from the changes made
		   ^ is

> + * since the inode was flushed (i.e. those recorded in ili_fields).
> + *
> + * ili_format_fields is only used during transaction commit. It is the
> + * aggregation of ili_fields and ili_last_fields, as sampled during the sizing
> + * of the inode item to be logged. We need this field definition to be constant
> + * between sizing and formatting, but inode buffer IO can complete
> + * asynchronously with transaction commit and so we must only read it once so
> + * that the different stages of the item formatting work correctly. We don't

>From the comment alone, I'm a little confused as to whether this means
we must combine and sample ili_format_fields only once, or just
ili_last_fields as aggregrated into the former (and ili_format_fields is
safe for the remainder of the transaction sizing/format sequence).

> + * care about the async completion clearing ili_last_fields after we've sampled
> + * it - if we log too much then so be it - we won't log it next time. Once we've
> + * formatted the inode item, we need to propagate the ili_format_fields value to
> + * the on-disk inode log item format field, and then use it to clear all the
> + * fields that we marked for logging but were not dirty from ili_fields.
> + *
> + * This separation of changes allows us to accurately determine what fields in
> + * the inode have changed since it was last flushed to disk. This is important
> + * for fdatasync() performance as there are certain fields that, if modified,
> + * should not trigger log forces because they contain metadata that fdatasync()
> + * does not need to guarantee is safe on disk. This is, currently, just
> + * XFS_ILOG_TIMESTAMP, though may be extended in future to cover other metadata.
> + *
> + * If we simply fold ili_last_fields back into ili_fields when we log the inode
> + * in a transaction, then if we have a flush/modification race it results in
> + * every timestamp update causing fdatasync to flush the log because ili_fields
> + * has fields other than XFS_ILOG_TIMESTAMP set in it. This behaviour will
> + * persist until the inode is next flushed to it's backing buffer and that
> + * buffer is written and completed without another flush/modification race
> + * occuring.  Hence we keep the change state separate and only combine them when
> + * formatting the inode into the log.
> + *

Thanks for the big comment. I think the above two paragraphs could
probably be a bit more generic/succinct, however:

"This separation of changes is necessary because certain codepaths
require precise knowledge about what has been dirtied in the inode since
the last transaction commit. For example, fdatasync() is allowed to skip
log flushes in the event that only XFS_ILOG_TIMESTAMP is specified in
ili_fields. Repopulating ili_fields from ili_last_fields at log time, as
was previously done, pollutes this information and can defeat the
optimization."

> + * For more information, there's another big comment in xfs_iflush_int() about
> + * this flush/modification race condition.
> + */
>  
>  kmem_zone_t	*xfs_ili_zone;		/* inode log item zone */
>  
> @@ -40,6 +97,7 @@ static inline struct xfs_inode_log_item *INODE_ITEM(struct xfs_log_item *lip)
>  STATIC void
>  xfs_inode_item_data_fork_size(
>  	struct xfs_inode_log_item *iip,
> +	unsigned int		fields,
>  	int			*nvecs,
>  	int			*nbytes)
>  {
> @@ -47,7 +105,7 @@ xfs_inode_item_data_fork_size(
>  
>  	switch (ip->i_d.di_format) {
>  	case XFS_DINODE_FMT_EXTENTS:
> -		if ((iip->ili_fields & XFS_ILOG_DEXT) &&
> +		if ((fields & XFS_ILOG_DEXT) &&
>  		    ip->i_d.di_nextents > 0 &&
>  		    ip->i_df.if_bytes > 0) {
>  			/* worst case, doesn't subtract delalloc extents */
> @@ -56,14 +114,14 @@ xfs_inode_item_data_fork_size(
>  		}
>  		break;
>  	case XFS_DINODE_FMT_BTREE:
> -		if ((iip->ili_fields & XFS_ILOG_DBROOT) &&
> +		if ((fields & XFS_ILOG_DBROOT) &&
>  		    ip->i_df.if_broot_bytes > 0) {
>  			*nbytes += ip->i_df.if_broot_bytes;
>  			*nvecs += 1;
>  		}
>  		break;
>  	case XFS_DINODE_FMT_LOCAL:
> -		if ((iip->ili_fields & XFS_ILOG_DDATA) &&
> +		if ((fields & XFS_ILOG_DDATA) &&
>  		    ip->i_df.if_bytes > 0) {
>  			*nbytes += roundup(ip->i_df.if_bytes, 4);
>  			*nvecs += 1;
> @@ -82,6 +140,7 @@ xfs_inode_item_data_fork_size(
>  STATIC void
>  xfs_inode_item_attr_fork_size(
>  	struct xfs_inode_log_item *iip,
> +	unsigned int		fields,
>  	int			*nvecs,
>  	int			*nbytes)
>  {
> @@ -89,7 +148,7 @@ xfs_inode_item_attr_fork_size(
>  
>  	switch (ip->i_d.di_aformat) {
>  	case XFS_DINODE_FMT_EXTENTS:
> -		if ((iip->ili_fields & XFS_ILOG_AEXT) &&
> +		if ((fields & XFS_ILOG_AEXT) &&
>  		    ip->i_d.di_anextents > 0 &&
>  		    ip->i_afp->if_bytes > 0) {
>  			/* worst case, doesn't subtract unused space */
> @@ -98,14 +157,14 @@ xfs_inode_item_attr_fork_size(
>  		}
>  		break;
>  	case XFS_DINODE_FMT_BTREE:
> -		if ((iip->ili_fields & XFS_ILOG_ABROOT) &&
> +		if ((fields & XFS_ILOG_ABROOT) &&
>  		    ip->i_afp->if_broot_bytes > 0) {
>  			*nbytes += ip->i_afp->if_broot_bytes;
>  			*nvecs += 1;
>  		}
>  		break;
>  	case XFS_DINODE_FMT_LOCAL:
> -		if ((iip->ili_fields & XFS_ILOG_ADATA) &&
> +		if ((fields & XFS_ILOG_ADATA) &&
>  		    ip->i_afp->if_bytes > 0) {
>  			*nbytes += roundup(ip->i_afp->if_bytes, 4);
>  			*nvecs += 1;
> @@ -133,13 +192,17 @@ xfs_inode_item_size(
>  	struct xfs_inode_log_item *iip = INODE_ITEM(lip);
>  	struct xfs_inode	*ip = iip->ili_inode;
>  
> +	iip->ili_format_fields = iip->ili_fields | iip->ili_last_fields;
> +
>  	*nvecs += 2;
>  	*nbytes += sizeof(struct xfs_inode_log_format) +
>  		   xfs_icdinode_size(ip->i_d.di_version);
>  
> -	xfs_inode_item_data_fork_size(iip, nvecs, nbytes);
> +	xfs_inode_item_data_fork_size(iip, iip->ili_format_fields,
> +				      nvecs, nbytes);
>  	if (XFS_IFORK_Q(ip))
> -		xfs_inode_item_attr_fork_size(iip, nvecs, nbytes);
> +		xfs_inode_item_attr_fork_size(iip, iip->ili_format_fields,
> +					      nvecs, nbytes);

We aggregate ili_format_fields and pass the result as a parameter to the
sizing helpers, but we just use ili_format_fields directly from the
subsequent formatting helpers. The latter seems more clean to me since
we pass iip everywhere already. Could we use that same pattern here as
well?

I suppose this also answers my question above in that ili_last_fields is
sampled once and ili_format_fields used thereafter in the commit
sequence.

Brian

>  }
>  
>  STATIC void
> @@ -151,14 +214,14 @@ xfs_inode_item_format_data_fork(
>  {
>  	struct xfs_inode	*ip = iip->ili_inode;
>  	size_t			data_bytes;
> +	unsigned int		fields = iip->ili_format_fields;
>  
>  	switch (ip->i_d.di_format) {
>  	case XFS_DINODE_FMT_EXTENTS:
> -		iip->ili_fields &=
> -			~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT |
> -			  XFS_ILOG_DEV | XFS_ILOG_UUID);
> +		fields &= ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT |
> +			    XFS_ILOG_DEV | XFS_ILOG_UUID);
>  
> -		if ((iip->ili_fields & XFS_ILOG_DEXT) &&
> +		if ((fields & XFS_ILOG_DEXT) &&
>  		    ip->i_d.di_nextents > 0 &&
>  		    ip->i_df.if_bytes > 0) {
>  			struct xfs_bmbt_rec *p;
> @@ -175,15 +238,14 @@ xfs_inode_item_format_data_fork(
>  			ilf->ilf_dsize = data_bytes;
>  			ilf->ilf_size++;
>  		} else {
> -			iip->ili_fields &= ~XFS_ILOG_DEXT;
> +			fields &= ~XFS_ILOG_DEXT;
>  		}
>  		break;
>  	case XFS_DINODE_FMT_BTREE:
> -		iip->ili_fields &=
> -			~(XFS_ILOG_DDATA | XFS_ILOG_DEXT |
> -			  XFS_ILOG_DEV | XFS_ILOG_UUID);
> +		fields &= ~(XFS_ILOG_DDATA | XFS_ILOG_DEXT |
> +			    XFS_ILOG_DEV | XFS_ILOG_UUID);
>  
> -		if ((iip->ili_fields & XFS_ILOG_DBROOT) &&
> +		if ((fields & XFS_ILOG_DBROOT) &&
>  		    ip->i_df.if_broot_bytes > 0) {
>  			ASSERT(ip->i_df.if_broot != NULL);
>  			xlog_copy_iovec(lv, vecp, XLOG_REG_TYPE_IBROOT,
> @@ -192,16 +254,15 @@ xfs_inode_item_format_data_fork(
>  			ilf->ilf_dsize = ip->i_df.if_broot_bytes;
>  			ilf->ilf_size++;
>  		} else {
> -			ASSERT(!(iip->ili_fields &
> -				 XFS_ILOG_DBROOT));
> -			iip->ili_fields &= ~XFS_ILOG_DBROOT;
> +			ASSERT(!(fields & XFS_ILOG_DBROOT));
> +			fields &= ~XFS_ILOG_DBROOT;
>  		}
>  		break;
>  	case XFS_DINODE_FMT_LOCAL:
> -		iip->ili_fields &=
> -			~(XFS_ILOG_DEXT | XFS_ILOG_DBROOT |
> -			  XFS_ILOG_DEV | XFS_ILOG_UUID);
> -		if ((iip->ili_fields & XFS_ILOG_DDATA) &&
> +		fields &= ~(XFS_ILOG_DEXT | XFS_ILOG_DBROOT |
> +			    XFS_ILOG_DEV | XFS_ILOG_UUID);
> +
> +		if ((fields & XFS_ILOG_DDATA) &&
>  		    ip->i_df.if_bytes > 0) {
>  			/*
>  			 * Round i_bytes up to a word boundary.
> @@ -218,27 +279,28 @@ xfs_inode_item_format_data_fork(
>  			ilf->ilf_dsize = (unsigned)data_bytes;
>  			ilf->ilf_size++;
>  		} else {
> -			iip->ili_fields &= ~XFS_ILOG_DDATA;
> +			fields &= ~XFS_ILOG_DDATA;
>  		}
>  		break;
>  	case XFS_DINODE_FMT_DEV:
> -		iip->ili_fields &=
> -			~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT |
> -			  XFS_ILOG_DEXT | XFS_ILOG_UUID);
> -		if (iip->ili_fields & XFS_ILOG_DEV)
> +		fields &= ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT |
> +			    XFS_ILOG_DEXT | XFS_ILOG_UUID);
> +		if (fields & XFS_ILOG_DEV)
>  			ilf->ilf_u.ilfu_rdev = ip->i_df.if_u2.if_rdev;
>  		break;
>  	case XFS_DINODE_FMT_UUID:
> -		iip->ili_fields &=
> -			~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT |
> -			  XFS_ILOG_DEXT | XFS_ILOG_DEV);
> -		if (iip->ili_fields & XFS_ILOG_UUID)
> +		fields &= ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT |
> +			    XFS_ILOG_DEXT | XFS_ILOG_DEV);
> +		if (fields & XFS_ILOG_UUID)
>  			ilf->ilf_u.ilfu_uuid = ip->i_df.if_u2.if_uuid;
>  		break;
>  	default:
>  		ASSERT(0);
>  		break;
>  	}
> +
> +	/* reflect the logged fields back in ili_format_fields */
> +	iip->ili_format_fields = fields;
>  }
>  
>  STATIC void
> @@ -250,13 +312,13 @@ xfs_inode_item_format_attr_fork(
>  {
>  	struct xfs_inode	*ip = iip->ili_inode;
>  	size_t			data_bytes;
> +	unsigned int		fields = iip->ili_format_fields;
>  
>  	switch (ip->i_d.di_aformat) {
>  	case XFS_DINODE_FMT_EXTENTS:
> -		iip->ili_fields &=
> -			~(XFS_ILOG_ADATA | XFS_ILOG_ABROOT);
> +		fields &= ~(XFS_ILOG_ADATA | XFS_ILOG_ABROOT);
>  
> -		if ((iip->ili_fields & XFS_ILOG_AEXT) &&
> +		if ((fields & XFS_ILOG_AEXT) &&
>  		    ip->i_d.di_anextents > 0 &&
>  		    ip->i_afp->if_bytes > 0) {
>  			struct xfs_bmbt_rec *p;
> @@ -272,14 +334,13 @@ xfs_inode_item_format_attr_fork(
>  			ilf->ilf_asize = data_bytes;
>  			ilf->ilf_size++;
>  		} else {
> -			iip->ili_fields &= ~XFS_ILOG_AEXT;
> +			fields &= ~XFS_ILOG_AEXT;
>  		}
>  		break;
>  	case XFS_DINODE_FMT_BTREE:
> -		iip->ili_fields &=
> -			~(XFS_ILOG_ADATA | XFS_ILOG_AEXT);
> +		fields &= ~(XFS_ILOG_ADATA | XFS_ILOG_AEXT);
>  
> -		if ((iip->ili_fields & XFS_ILOG_ABROOT) &&
> +		if ((fields & XFS_ILOG_ABROOT) &&
>  		    ip->i_afp->if_broot_bytes > 0) {
>  			ASSERT(ip->i_afp->if_broot != NULL);
>  
> @@ -289,14 +350,13 @@ xfs_inode_item_format_attr_fork(
>  			ilf->ilf_asize = ip->i_afp->if_broot_bytes;
>  			ilf->ilf_size++;
>  		} else {
> -			iip->ili_fields &= ~XFS_ILOG_ABROOT;
> +			fields &= ~XFS_ILOG_ABROOT;
>  		}
>  		break;
>  	case XFS_DINODE_FMT_LOCAL:
> -		iip->ili_fields &=
> -			~(XFS_ILOG_AEXT | XFS_ILOG_ABROOT);
> +		fields &= ~(XFS_ILOG_AEXT | XFS_ILOG_ABROOT);
>  
> -		if ((iip->ili_fields & XFS_ILOG_ADATA) &&
> +		if ((fields & XFS_ILOG_ADATA) &&
>  		    ip->i_afp->if_bytes > 0) {
>  			/*
>  			 * Round i_bytes up to a word boundary.
> @@ -313,13 +373,16 @@ xfs_inode_item_format_attr_fork(
>  			ilf->ilf_asize = (unsigned)data_bytes;
>  			ilf->ilf_size++;
>  		} else {
> -			iip->ili_fields &= ~XFS_ILOG_ADATA;
> +			fields &= ~XFS_ILOG_ADATA;
>  		}
>  		break;
>  	default:
>  		ASSERT(0);
>  		break;
>  	}
> +
> +	/* reflect the logged fields back in ili_format_fields */
> +	iip->ili_format_fields = fields;
>  }
>  
>  /*
> @@ -359,12 +422,16 @@ xfs_inode_item_format(
>  	if (XFS_IFORK_Q(ip)) {
>  		xfs_inode_item_format_attr_fork(iip, ilf, lv, &vecp);
>  	} else {
> -		iip->ili_fields &=
> +		iip->ili_format_fields &=
>  			~(XFS_ILOG_ADATA | XFS_ILOG_ABROOT | XFS_ILOG_AEXT);
>  	}
>  
>  	/* update the format with the exact fields we actually logged */
> -	ilf->ilf_fields |= (iip->ili_fields & ~XFS_ILOG_TIMESTAMP);
> +	ilf->ilf_fields |= (iip->ili_format_fields & ~XFS_ILOG_TIMESTAMP);
> +
> +	/* clear any fields we didn't log from ili_fields */
> +	iip->ili_fields &= iip->ili_format_fields;
> +	iip->ili_format_fields = 0;
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
> index 488d812..43a9e1c 100644
> --- a/fs/xfs/xfs_inode_item.h
> +++ b/fs/xfs/xfs_inode_item.h
> @@ -34,6 +34,7 @@ typedef struct xfs_inode_log_item {
>  	unsigned short		ili_logged;	   /* flushed logged data */
>  	unsigned int		ili_last_fields;   /* fields when flushed */
>  	unsigned int		ili_fields;	   /* fields to be logged */
> +	unsigned int		ili_format_fields; /* combined log fields */
>  } xfs_inode_log_item_t;
>  
>  static inline int xfs_inode_clean(xfs_inode_t *ip)
> diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c
> index 17280cd..77f1e8a 100644
> --- a/fs/xfs/xfs_trans_inode.c
> +++ b/fs/xfs/xfs_trans_inode.c
> @@ -123,13 +123,5 @@ xfs_trans_log_inode(
>  	tp->t_flags |= XFS_TRANS_DIRTY;
>  	ip->i_itemp->ili_item.li_desc->lid_flags |= XFS_LID_DIRTY;
>  
> -	/*
> -	 * Always OR in the bits from the ili_last_fields field.
> -	 * This is to coordinate with the xfs_iflush() and xfs_iflush_done()
> -	 * routines in the eventual clearing of the ili_fields bits.
> -	 * See the big comment in xfs_iflush() for an explanation of
> -	 * this coordination mechanism.
> -	 */
> -	flags |= ip->i_itemp->ili_last_fields;
>  	ip->i_itemp->ili_fields |= flags;
>  }
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-09-08 14:45 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-08-28  1:23 [PATCH] xfs: timestamp updates cause excessive fdatasync log traffic Dave Chinner
2015-08-28  4:32 ` [PATCH V2] " Dave Chinner
2015-08-28 15:11   ` Sage Weil
2015-08-28 22:04     ` Dave Chinner
2015-08-31  2:21       ` Dave Chinner
2015-08-31  8:48         ` Dave Chinner
2015-08-31 12:40           ` Sage Weil
2015-08-31 21:51             ` Dave Chinner
2015-09-08 14:45   ` Brian Foster

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox