* [PATCH] xfs: timestamp updates cause excessive fdatasync log traffic @ 2015-08-28 1:23 Dave Chinner 2015-08-28 4:32 ` [PATCH V2] " Dave Chinner 0 siblings, 1 reply; 9+ messages in thread From: Dave Chinner @ 2015-08-28 1:23 UTC (permalink / raw) To: xfs; +Cc: sage From: Dave Chinner <dchinner@redhat.com> Sage Weil reported that a ceph test workload was writing to the log on every fdatasync during an overwrite workload. Event tracing showed that the only metadata modification being made was the timestamp updates during the write(2) syscall, but fdatasync(2) is supposed to ignore them. The key observation was that the transactions in the log all looked like this: INODE: #regs: 4 ino: 0x8b flags: 0x45 dsize: 32 And contained a flags field of 0x45 or 0x85, and had data and attribute forks following the inode core. This means that the timestamp updates were triggering dirty relogging of previously logged parts of the inode that hadn't yet been flushed back to disk. This is caused by xfs_trans_log_inode(), where it folds the dirty fields that have previously been logged back into the current transaction dirty fields from the inode item ili_last_fields. The issue is that ili_last_fields only contains a non-zero value when the inode is in the process of being flushed. The typical state progression is this, using a core field update as the modification occuring: state ili_fields ili_last_fields clean 0 0 modified, logged XFS_ILOG_CORE 0 flushed to buffer 0 XFS_ILOG_CORE <buffer submitted> buffer IO done (clean) 0 0 However, if we run a new transaction after it has been flushed to buffer but before the buffer IO is done, we don't know if the modifications in the inode buffer (i.e. what is in ili_last_fields) will reach the disk before the new transaction reaches the log. Hence to keep transactional ordering correct in recovery, we need to ensure the new transaction re-logs the modifications that are being flushed to disk. By relogging, we ensure that if the transaction makes it to disk and the buffer doesn't, then recovery replays all the changes upto that point correctly. If the transaction does not make it disk, but the buffer does, then recovery will see that the inode on disk is more recent than in the log and won't overwrite it with changes that it already contains. The upshot of this is that while the inode buffer sits in memory with the inode in the "flushed" state, fdatasync is going to see the relogged state in the ili_fields. i.e: What is happening here is this: state ili_fields ili_last_fields clean 0 0 modified, logged CORE 0 flushed to buffer 0 CORE <write> timestamp update TIMESTAMP CORE <xfs_trans_inode_log pulls in ili_last_fields> CORE|TIMESTAMP CORE <fdatasync> sees field other than TIMESTAMP in ili_fields, triggers xfs_log_force_lsn to flush inode <buffer submittted> buffer IO done (clean) CORE|TIMESTAMP 0 ..... <write> timestamp update CORE|TIMESTAMP 0 <fdatasync> sees field other than TIMESTAMP in ili_fields, triggers xfs_log_force_lsn to flush inode ..... <write> timestamp update CORE|TIMESTAMP 0 <fdatasync> sees field other than TIMESTAMP in ili_fields, triggers xfs_log_force_lsn to flush inode So, essentially, once a race condition on the buffer flush has occurred, the condition is not cleared until the inode is flushed to the buffer again and it is written without racing against the inode being dirtied again. The behaviour we really want here is to capture the timestamp update transactionally, but not trigger fdatasync because of the relogged fields that haven't been modified since the inode was flushed to the buffer. We still need to relog them, but we don't need to force the log in the fdatasync case. To do this, don't fold the ili_last_fields value into ili_fields when logging the inode. Instead, fold it into the fields that get logged during formatting of the inode item. This means that we will stop logging those fields the moment we know that there is a more recent version of the inode on disk than we have in the log and so we don't need to log those fields anymore as the next transaction on disk doesn't need to replay them. Doing this separates changes that are in memory but are not being flushed from those that have been flushed. Hence we can now look at ili_fields and hence see what fields have been modified since the last flush, and hence whether fdatasync needs to force the log or not. If non-timestamp changes have been made since the inode was flushed to the backing buffer, then fdatasync will do exactly the right thing (i.e. force the log). Reported-by: Sage Weil <sage@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_inode_item.c | 161 ++++++++++++++++++++++++++++++++++-------------- fs/xfs/xfs_inode_item.h | 1 + 2 files changed, 115 insertions(+), 47 deletions(-) diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c index 62bd80f..be04eb2 100644 --- a/fs/xfs/xfs_inode_item.c +++ b/fs/xfs/xfs_inode_item.c @@ -29,6 +29,63 @@ #include "xfs_trans_priv.h" #include "xfs_log.h" +/* + * Notes on ili_fields, ili_format_fields and ili_last_fields. + * + * ili_fields contains the flags that reflect the current changes that are in + * memory and have been logged, but have not been flushed to the backing buffer + * for writeback yet. When a transaction is done, the fields that are modified + * are added to ili_fields and all those fields are logged. This results in + * repeated transactions correctly relogging all the other changes in memory and + * allows the inode to be re-ordered (moved forward) in the AIL safely. + * ili_fields is copied to ili_last_fields when the inode is flushed to the + * backing buffer and is then cleared, indicating that the inode in memory is + * now clean from a transactional change point of view and does not need to + * relog those fields anymore on tranaction commit. + * + * ili_last_fields, therefore, is only set while there is an unresolved inode + * flush being done (i.e. flush lock is held). We need to keep this state + * available to avoid a transaction recovery vs inode buffer IO completion race + * if we crash. If the inode is logged again makes it to disk before the inode + * buffer IO complete, log recovery will replay the latest inode transaction and + * can lose the changes that were in the inode buffer. Hence we still need to + * log the changes that have been flushed so that transactions issued during the + * inode buffer write still contain the modifications being flushed. Once the + * inode buffer IO completes, we no longer need to log those changes as the + * inode on disk the same as the inode in memory apart from the changes made + * since the inode was flushed (i.e. those recorded in ili_fields). + * + * ili_format_fields is only used during transaction commit. It is the + * aggregation of ili_fields and ili_last_fields, as sampled during the sizing + * of the inode item to be logged. We need this field definition to be constant + * between sizing and formatting, but inode buffer IO can complete + * asynchronously with transaction commit and so we must only read it once so + * that the different stages of the item formatting work correctly. We don't + * care about the async completion clearing ili_last_fields after we've sampled + * it - if we log too much then so be it - we won't log it next time. Once we've + * formatted the inode item, we need to propagate the ili_format_fields value to + * the on-disk inode log item format field, and then use it to clear all the + * fields that we marked for logging but were not dirty from ili_fields. + * + * This separation of changes allows us to accurately determine what fields in + * the inode have changed since it was last flushed to disk. This is important + * for fdatasync() performance as there are certain fields that, if modified, + * should not trigger log forces because they contain metadata that fdatasync() + * does not need to guarantee is safe on disk. This is, currently, just + * XFS_ILOG_TIMESTAMP, though may be extended in future to cover other metadata. + * + * If we simply fold ili_last_fields back into ili_fields when we log the inode + * in a transaction, then if we have a flush/modification race it results in + * every timestamp update causing fdatasync to flush the log because ili_fields + * has fields other than XFS_ILOG_TIMESTAMP set in it. This behaviour will + * persist until the inode is next flushed to it's backing buffer and that + * buffer is written and completed without another flush/modification race + * occuring. Hence we keep the change state separate and only combine them when + * formatting the inode into the log. + * + * For more information, there's another big comment in xfs_iflush_int() about + * this flush/modification race condition. + */ kmem_zone_t *xfs_ili_zone; /* inode log item zone */ @@ -40,6 +97,7 @@ static inline struct xfs_inode_log_item *INODE_ITEM(struct xfs_log_item *lip) STATIC void xfs_inode_item_data_fork_size( struct xfs_inode_log_item *iip, + unsigned int fields, int *nvecs, int *nbytes) { @@ -47,7 +105,7 @@ xfs_inode_item_data_fork_size( switch (ip->i_d.di_format) { case XFS_DINODE_FMT_EXTENTS: - if ((iip->ili_fields & XFS_ILOG_DEXT) && + if ((fields & XFS_ILOG_DEXT) && ip->i_d.di_nextents > 0 && ip->i_df.if_bytes > 0) { /* worst case, doesn't subtract delalloc extents */ @@ -56,14 +114,14 @@ xfs_inode_item_data_fork_size( } break; case XFS_DINODE_FMT_BTREE: - if ((iip->ili_fields & XFS_ILOG_DBROOT) && + if ((fields & XFS_ILOG_DBROOT) && ip->i_df.if_broot_bytes > 0) { *nbytes += ip->i_df.if_broot_bytes; *nvecs += 1; } break; case XFS_DINODE_FMT_LOCAL: - if ((iip->ili_fields & XFS_ILOG_DDATA) && + if ((fields & XFS_ILOG_DDATA) && ip->i_df.if_bytes > 0) { *nbytes += roundup(ip->i_df.if_bytes, 4); *nvecs += 1; @@ -82,6 +140,7 @@ xfs_inode_item_data_fork_size( STATIC void xfs_inode_item_attr_fork_size( struct xfs_inode_log_item *iip, + unsigned int fields, int *nvecs, int *nbytes) { @@ -89,7 +148,7 @@ xfs_inode_item_attr_fork_size( switch (ip->i_d.di_aformat) { case XFS_DINODE_FMT_EXTENTS: - if ((iip->ili_fields & XFS_ILOG_AEXT) && + if ((fields & XFS_ILOG_AEXT) && ip->i_d.di_anextents > 0 && ip->i_afp->if_bytes > 0) { /* worst case, doesn't subtract unused space */ @@ -98,14 +157,14 @@ xfs_inode_item_attr_fork_size( } break; case XFS_DINODE_FMT_BTREE: - if ((iip->ili_fields & XFS_ILOG_ABROOT) && + if ((fields & XFS_ILOG_ABROOT) && ip->i_afp->if_broot_bytes > 0) { *nbytes += ip->i_afp->if_broot_bytes; *nvecs += 1; } break; case XFS_DINODE_FMT_LOCAL: - if ((iip->ili_fields & XFS_ILOG_ADATA) && + if ((fields & XFS_ILOG_ADATA) && ip->i_afp->if_bytes > 0) { *nbytes += roundup(ip->i_afp->if_bytes, 4); *nvecs += 1; @@ -133,13 +192,17 @@ xfs_inode_item_size( struct xfs_inode_log_item *iip = INODE_ITEM(lip); struct xfs_inode *ip = iip->ili_inode; + iip->ili_format_fields = iip->ili_fields | iip->ili_last_fields; + *nvecs += 2; *nbytes += sizeof(struct xfs_inode_log_format) + xfs_icdinode_size(ip->i_d.di_version); - xfs_inode_item_data_fork_size(iip, nvecs, nbytes); + xfs_inode_item_data_fork_size(iip, iip->ili_format_fields, + nvecs, nbytes); if (XFS_IFORK_Q(ip)) - xfs_inode_item_attr_fork_size(iip, nvecs, nbytes); + xfs_inode_item_attr_fork_size(iip, iip->ili_format_fields, + nvecs, nbytes); } STATIC void @@ -151,14 +214,14 @@ xfs_inode_item_format_data_fork( { struct xfs_inode *ip = iip->ili_inode; size_t data_bytes; + unsigned int fields = iip->ili_format_fields; switch (ip->i_d.di_format) { case XFS_DINODE_FMT_EXTENTS: - iip->ili_fields &= - ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT | - XFS_ILOG_DEV | XFS_ILOG_UUID); + fields &= ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT | + XFS_ILOG_DEV | XFS_ILOG_UUID); - if ((iip->ili_fields & XFS_ILOG_DEXT) && + if ((fields & XFS_ILOG_DEXT) && ip->i_d.di_nextents > 0 && ip->i_df.if_bytes > 0) { struct xfs_bmbt_rec *p; @@ -175,15 +238,14 @@ xfs_inode_item_format_data_fork( ilf->ilf_dsize = data_bytes; ilf->ilf_size++; } else { - iip->ili_fields &= ~XFS_ILOG_DEXT; + fields &= ~XFS_ILOG_DEXT; } break; case XFS_DINODE_FMT_BTREE: - iip->ili_fields &= - ~(XFS_ILOG_DDATA | XFS_ILOG_DEXT | - XFS_ILOG_DEV | XFS_ILOG_UUID); + fields &= ~(XFS_ILOG_DDATA | XFS_ILOG_DEXT | + XFS_ILOG_DEV | XFS_ILOG_UUID); - if ((iip->ili_fields & XFS_ILOG_DBROOT) && + if ((fields & XFS_ILOG_DBROOT) && ip->i_df.if_broot_bytes > 0) { ASSERT(ip->i_df.if_broot != NULL); xlog_copy_iovec(lv, vecp, XLOG_REG_TYPE_IBROOT, @@ -192,16 +254,15 @@ xfs_inode_item_format_data_fork( ilf->ilf_dsize = ip->i_df.if_broot_bytes; ilf->ilf_size++; } else { - ASSERT(!(iip->ili_fields & - XFS_ILOG_DBROOT)); - iip->ili_fields &= ~XFS_ILOG_DBROOT; + ASSERT(!(fields & XFS_ILOG_DBROOT)); + fields &= ~XFS_ILOG_DBROOT; } break; case XFS_DINODE_FMT_LOCAL: - iip->ili_fields &= - ~(XFS_ILOG_DEXT | XFS_ILOG_DBROOT | - XFS_ILOG_DEV | XFS_ILOG_UUID); - if ((iip->ili_fields & XFS_ILOG_DDATA) && + fields &= ~(XFS_ILOG_DEXT | XFS_ILOG_DBROOT | + XFS_ILOG_DEV | XFS_ILOG_UUID); + + if ((fields & XFS_ILOG_DDATA) && ip->i_df.if_bytes > 0) { /* * Round i_bytes up to a word boundary. @@ -218,27 +279,28 @@ xfs_inode_item_format_data_fork( ilf->ilf_dsize = (unsigned)data_bytes; ilf->ilf_size++; } else { - iip->ili_fields &= ~XFS_ILOG_DDATA; + fields &= ~XFS_ILOG_DDATA; } break; case XFS_DINODE_FMT_DEV: - iip->ili_fields &= - ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT | - XFS_ILOG_DEXT | XFS_ILOG_UUID); - if (iip->ili_fields & XFS_ILOG_DEV) + fields &= ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT | + XFS_ILOG_DEXT | XFS_ILOG_UUID); + if (fields & XFS_ILOG_DEV) ilf->ilf_u.ilfu_rdev = ip->i_df.if_u2.if_rdev; break; case XFS_DINODE_FMT_UUID: - iip->ili_fields &= - ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT | - XFS_ILOG_DEXT | XFS_ILOG_DEV); - if (iip->ili_fields & XFS_ILOG_UUID) + fields &= ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT | + XFS_ILOG_DEXT | XFS_ILOG_DEV); + if (fields & XFS_ILOG_UUID) ilf->ilf_u.ilfu_uuid = ip->i_df.if_u2.if_uuid; break; default: ASSERT(0); break; } + + /* reflect the logged fields back in ili_format_fields */ + iip->ili_format_fields = fields; } STATIC void @@ -250,13 +312,13 @@ xfs_inode_item_format_attr_fork( { struct xfs_inode *ip = iip->ili_inode; size_t data_bytes; + unsigned int fields = iip->ili_format_fields; switch (ip->i_d.di_aformat) { case XFS_DINODE_FMT_EXTENTS: - iip->ili_fields &= - ~(XFS_ILOG_ADATA | XFS_ILOG_ABROOT); + fields &= ~(XFS_ILOG_ADATA | XFS_ILOG_ABROOT); - if ((iip->ili_fields & XFS_ILOG_AEXT) && + if ((fields & XFS_ILOG_AEXT) && ip->i_d.di_anextents > 0 && ip->i_afp->if_bytes > 0) { struct xfs_bmbt_rec *p; @@ -272,14 +334,13 @@ xfs_inode_item_format_attr_fork( ilf->ilf_asize = data_bytes; ilf->ilf_size++; } else { - iip->ili_fields &= ~XFS_ILOG_AEXT; + fields &= ~XFS_ILOG_AEXT; } break; case XFS_DINODE_FMT_BTREE: - iip->ili_fields &= - ~(XFS_ILOG_ADATA | XFS_ILOG_AEXT); + fields &= ~(XFS_ILOG_ADATA | XFS_ILOG_AEXT); - if ((iip->ili_fields & XFS_ILOG_ABROOT) && + if ((fields & XFS_ILOG_ABROOT) && ip->i_afp->if_broot_bytes > 0) { ASSERT(ip->i_afp->if_broot != NULL); @@ -289,14 +350,13 @@ xfs_inode_item_format_attr_fork( ilf->ilf_asize = ip->i_afp->if_broot_bytes; ilf->ilf_size++; } else { - iip->ili_fields &= ~XFS_ILOG_ABROOT; + fields &= ~XFS_ILOG_ABROOT; } break; case XFS_DINODE_FMT_LOCAL: - iip->ili_fields &= - ~(XFS_ILOG_AEXT | XFS_ILOG_ABROOT); + fields &= ~(XFS_ILOG_AEXT | XFS_ILOG_ABROOT); - if ((iip->ili_fields & XFS_ILOG_ADATA) && + if ((fields & XFS_ILOG_ADATA) && ip->i_afp->if_bytes > 0) { /* * Round i_bytes up to a word boundary. @@ -313,13 +373,16 @@ xfs_inode_item_format_attr_fork( ilf->ilf_asize = (unsigned)data_bytes; ilf->ilf_size++; } else { - iip->ili_fields &= ~XFS_ILOG_ADATA; + fields &= ~XFS_ILOG_ADATA; } break; default: ASSERT(0); break; } + + /* reflect the logged fields back in ili_format_fields */ + iip->ili_format_fields = fields; } /* @@ -359,12 +422,16 @@ xfs_inode_item_format( if (XFS_IFORK_Q(ip)) { xfs_inode_item_format_attr_fork(iip, ilf, lv, &vecp); } else { - iip->ili_fields &= + iip->ili_format_fields &= ~(XFS_ILOG_ADATA | XFS_ILOG_ABROOT | XFS_ILOG_AEXT); } /* update the format with the exact fields we actually logged */ - ilf->ilf_fields |= (iip->ili_fields & ~XFS_ILOG_TIMESTAMP); + ilf->ilf_fields |= (iip->ili_format_fields & ~XFS_ILOG_TIMESTAMP); + + /* clear any fields we didn't log from ili_fields */ + iip->ili_fields &= iip->ili_format_fields; + iip->ili_format_fields = 0; } /* diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h index 488d812..43a9e1c 100644 --- a/fs/xfs/xfs_inode_item.h +++ b/fs/xfs/xfs_inode_item.h @@ -34,6 +34,7 @@ typedef struct xfs_inode_log_item { unsigned short ili_logged; /* flushed logged data */ unsigned int ili_last_fields; /* fields when flushed */ unsigned int ili_fields; /* fields to be logged */ + unsigned int ili_format_fields; /* combined log fields */ } xfs_inode_log_item_t; static inline int xfs_inode_clean(xfs_inode_t *ip) -- 2.5.0 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH V2] xfs: timestamp updates cause excessive fdatasync log traffic 2015-08-28 1:23 [PATCH] xfs: timestamp updates cause excessive fdatasync log traffic Dave Chinner @ 2015-08-28 4:32 ` Dave Chinner 2015-08-28 15:11 ` Sage Weil 2015-09-08 14:45 ` Brian Foster 0 siblings, 2 replies; 9+ messages in thread From: Dave Chinner @ 2015-08-28 4:32 UTC (permalink / raw) To: xfs; +Cc: sage From: Dave Chinner <dchinner@redhat.com> Sage Weil reported that a ceph test workload was writing to the log on every fdatasync during an overwrite workload. Event tracing showed that the only metadata modification being made was the timestamp updates during the write(2) syscall, but fdatasync(2) is supposed to ignore them. The key observation was that the transactions in the log all looked like this: INODE: #regs: 4 ino: 0x8b flags: 0x45 dsize: 32 And contained a flags field of 0x45 or 0x85, and had data and attribute forks following the inode core. This means that the timestamp updates were triggering dirty relogging of previously logged parts of the inode that hadn't yet been flushed back to disk. This is caused by xfs_trans_log_inode(), where it folds the dirty fields that have previously been logged back into the current transaction dirty fields from the inode item ili_last_fields. The issue is that ili_last_fields only contains a non-zero value when the inode is in the process of being flushed. The typical state progression is this, using a core field update as the modification occuring: state ili_fields ili_last_fields clean 0 0 modified, logged XFS_ILOG_CORE 0 flushed to buffer 0 XFS_ILOG_CORE <buffer submitted> buffer IO done (clean) 0 0 However, if we run a new transaction after it has been flushed to buffer but before the buffer IO is done, we don't know if the modifications in the inode buffer (i.e. what is in ili_last_fields) will reach the disk before the new transaction reaches the log. Hence to keep transactional ordering correct in recovery, we need to ensure the new transaction re-logs the modifications that are being flushed to disk. By relogging, we ensure that if the transaction makes it to disk and the buffer doesn't, then recovery replays all the changes upto that point correctly. If the transaction does not make it disk, but the buffer does, then recovery will see that the inode on disk is more recent than in the log and won't overwrite it with changes that it already contains. The upshot of this is that while the inode buffer sits in memory with the inode in the "flushed" state, fdatasync is going to see the relogged state in the ili_fields. i.e: What is happening here is this: state ili_fields ili_last_fields clean 0 0 modified, logged CORE 0 flushed to buffer 0 CORE <write> timestamp update TIMESTAMP CORE <xfs_trans_inode_log pulls in ili_last_fields> CORE|TIMESTAMP CORE <fdatasync> sees field other than TIMESTAMP in ili_fields, triggers xfs_log_force_lsn to flush inode <buffer submittted> buffer IO done (clean) CORE|TIMESTAMP 0 ..... <write> timestamp update CORE|TIMESTAMP 0 <fdatasync> sees field other than TIMESTAMP in ili_fields, triggers xfs_log_force_lsn to flush inode ..... <write> timestamp update CORE|TIMESTAMP 0 <fdatasync> sees field other than TIMESTAMP in ili_fields, triggers xfs_log_force_lsn to flush inode So, essentially, once a race condition on the buffer flush has occurred, the condition is not cleared until the inode is flushed to the buffer again and it is written without racing against the inode being dirtied again. The behaviour we really want here is to capture the timestamp update transactionally, but not trigger fdatasync because of the relogged fields that haven't been modified since the inode was flushed to the buffer. We still need to relog them, but we don't need to force the log in the fdatasync case. To do this, don't fold the ili_last_fields value into ili_fields when logging the inode. Instead, fold it into the fields that get logged during formatting of the inode item. This means that we will stop logging those fields the moment we know that there is a more recent version of the inode on disk than we have in the log and so we don't need to log those fields anymore as the next transaction on disk doesn't need to replay them. Doing this separates changes that are in memory but are not being flushed from those that have been flushed. Hence we can now look at ili_fields and hence see what fields have been modified since the last flush, and hence whether fdatasync needs to force the log or not. If non-timestamp changes have been made since the inode was flushed to the backing buffer, then fdatasync will do exactly the right thing (i.e. force the log). Reported-by: Sage Weil <sage@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> --- Version 2: - include the hunk from fs/xfs/xfs_trans_inode.c that I missed when committing the patch locally the first time. fs/xfs/xfs_inode_item.c | 161 +++++++++++++++++++++++++++++++++-------------- fs/xfs/xfs_inode_item.h | 1 + fs/xfs/xfs_trans_inode.c | 8 --- 3 files changed, 115 insertions(+), 55 deletions(-) diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c index 62bd80f..be04eb2 100644 --- a/fs/xfs/xfs_inode_item.c +++ b/fs/xfs/xfs_inode_item.c @@ -29,6 +29,63 @@ #include "xfs_trans_priv.h" #include "xfs_log.h" +/* + * Notes on ili_fields, ili_format_fields and ili_last_fields. + * + * ili_fields contains the flags that reflect the current changes that are in + * memory and have been logged, but have not been flushed to the backing buffer + * for writeback yet. When a transaction is done, the fields that are modified + * are added to ili_fields and all those fields are logged. This results in + * repeated transactions correctly relogging all the other changes in memory and + * allows the inode to be re-ordered (moved forward) in the AIL safely. + * ili_fields is copied to ili_last_fields when the inode is flushed to the + * backing buffer and is then cleared, indicating that the inode in memory is + * now clean from a transactional change point of view and does not need to + * relog those fields anymore on tranaction commit. + * + * ili_last_fields, therefore, is only set while there is an unresolved inode + * flush being done (i.e. flush lock is held). We need to keep this state + * available to avoid a transaction recovery vs inode buffer IO completion race + * if we crash. If the inode is logged again makes it to disk before the inode + * buffer IO complete, log recovery will replay the latest inode transaction and + * can lose the changes that were in the inode buffer. Hence we still need to + * log the changes that have been flushed so that transactions issued during the + * inode buffer write still contain the modifications being flushed. Once the + * inode buffer IO completes, we no longer need to log those changes as the + * inode on disk the same as the inode in memory apart from the changes made + * since the inode was flushed (i.e. those recorded in ili_fields). + * + * ili_format_fields is only used during transaction commit. It is the + * aggregation of ili_fields and ili_last_fields, as sampled during the sizing + * of the inode item to be logged. We need this field definition to be constant + * between sizing and formatting, but inode buffer IO can complete + * asynchronously with transaction commit and so we must only read it once so + * that the different stages of the item formatting work correctly. We don't + * care about the async completion clearing ili_last_fields after we've sampled + * it - if we log too much then so be it - we won't log it next time. Once we've + * formatted the inode item, we need to propagate the ili_format_fields value to + * the on-disk inode log item format field, and then use it to clear all the + * fields that we marked for logging but were not dirty from ili_fields. + * + * This separation of changes allows us to accurately determine what fields in + * the inode have changed since it was last flushed to disk. This is important + * for fdatasync() performance as there are certain fields that, if modified, + * should not trigger log forces because they contain metadata that fdatasync() + * does not need to guarantee is safe on disk. This is, currently, just + * XFS_ILOG_TIMESTAMP, though may be extended in future to cover other metadata. + * + * If we simply fold ili_last_fields back into ili_fields when we log the inode + * in a transaction, then if we have a flush/modification race it results in + * every timestamp update causing fdatasync to flush the log because ili_fields + * has fields other than XFS_ILOG_TIMESTAMP set in it. This behaviour will + * persist until the inode is next flushed to it's backing buffer and that + * buffer is written and completed without another flush/modification race + * occuring. Hence we keep the change state separate and only combine them when + * formatting the inode into the log. + * + * For more information, there's another big comment in xfs_iflush_int() about + * this flush/modification race condition. + */ kmem_zone_t *xfs_ili_zone; /* inode log item zone */ @@ -40,6 +97,7 @@ static inline struct xfs_inode_log_item *INODE_ITEM(struct xfs_log_item *lip) STATIC void xfs_inode_item_data_fork_size( struct xfs_inode_log_item *iip, + unsigned int fields, int *nvecs, int *nbytes) { @@ -47,7 +105,7 @@ xfs_inode_item_data_fork_size( switch (ip->i_d.di_format) { case XFS_DINODE_FMT_EXTENTS: - if ((iip->ili_fields & XFS_ILOG_DEXT) && + if ((fields & XFS_ILOG_DEXT) && ip->i_d.di_nextents > 0 && ip->i_df.if_bytes > 0) { /* worst case, doesn't subtract delalloc extents */ @@ -56,14 +114,14 @@ xfs_inode_item_data_fork_size( } break; case XFS_DINODE_FMT_BTREE: - if ((iip->ili_fields & XFS_ILOG_DBROOT) && + if ((fields & XFS_ILOG_DBROOT) && ip->i_df.if_broot_bytes > 0) { *nbytes += ip->i_df.if_broot_bytes; *nvecs += 1; } break; case XFS_DINODE_FMT_LOCAL: - if ((iip->ili_fields & XFS_ILOG_DDATA) && + if ((fields & XFS_ILOG_DDATA) && ip->i_df.if_bytes > 0) { *nbytes += roundup(ip->i_df.if_bytes, 4); *nvecs += 1; @@ -82,6 +140,7 @@ xfs_inode_item_data_fork_size( STATIC void xfs_inode_item_attr_fork_size( struct xfs_inode_log_item *iip, + unsigned int fields, int *nvecs, int *nbytes) { @@ -89,7 +148,7 @@ xfs_inode_item_attr_fork_size( switch (ip->i_d.di_aformat) { case XFS_DINODE_FMT_EXTENTS: - if ((iip->ili_fields & XFS_ILOG_AEXT) && + if ((fields & XFS_ILOG_AEXT) && ip->i_d.di_anextents > 0 && ip->i_afp->if_bytes > 0) { /* worst case, doesn't subtract unused space */ @@ -98,14 +157,14 @@ xfs_inode_item_attr_fork_size( } break; case XFS_DINODE_FMT_BTREE: - if ((iip->ili_fields & XFS_ILOG_ABROOT) && + if ((fields & XFS_ILOG_ABROOT) && ip->i_afp->if_broot_bytes > 0) { *nbytes += ip->i_afp->if_broot_bytes; *nvecs += 1; } break; case XFS_DINODE_FMT_LOCAL: - if ((iip->ili_fields & XFS_ILOG_ADATA) && + if ((fields & XFS_ILOG_ADATA) && ip->i_afp->if_bytes > 0) { *nbytes += roundup(ip->i_afp->if_bytes, 4); *nvecs += 1; @@ -133,13 +192,17 @@ xfs_inode_item_size( struct xfs_inode_log_item *iip = INODE_ITEM(lip); struct xfs_inode *ip = iip->ili_inode; + iip->ili_format_fields = iip->ili_fields | iip->ili_last_fields; + *nvecs += 2; *nbytes += sizeof(struct xfs_inode_log_format) + xfs_icdinode_size(ip->i_d.di_version); - xfs_inode_item_data_fork_size(iip, nvecs, nbytes); + xfs_inode_item_data_fork_size(iip, iip->ili_format_fields, + nvecs, nbytes); if (XFS_IFORK_Q(ip)) - xfs_inode_item_attr_fork_size(iip, nvecs, nbytes); + xfs_inode_item_attr_fork_size(iip, iip->ili_format_fields, + nvecs, nbytes); } STATIC void @@ -151,14 +214,14 @@ xfs_inode_item_format_data_fork( { struct xfs_inode *ip = iip->ili_inode; size_t data_bytes; + unsigned int fields = iip->ili_format_fields; switch (ip->i_d.di_format) { case XFS_DINODE_FMT_EXTENTS: - iip->ili_fields &= - ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT | - XFS_ILOG_DEV | XFS_ILOG_UUID); + fields &= ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT | + XFS_ILOG_DEV | XFS_ILOG_UUID); - if ((iip->ili_fields & XFS_ILOG_DEXT) && + if ((fields & XFS_ILOG_DEXT) && ip->i_d.di_nextents > 0 && ip->i_df.if_bytes > 0) { struct xfs_bmbt_rec *p; @@ -175,15 +238,14 @@ xfs_inode_item_format_data_fork( ilf->ilf_dsize = data_bytes; ilf->ilf_size++; } else { - iip->ili_fields &= ~XFS_ILOG_DEXT; + fields &= ~XFS_ILOG_DEXT; } break; case XFS_DINODE_FMT_BTREE: - iip->ili_fields &= - ~(XFS_ILOG_DDATA | XFS_ILOG_DEXT | - XFS_ILOG_DEV | XFS_ILOG_UUID); + fields &= ~(XFS_ILOG_DDATA | XFS_ILOG_DEXT | + XFS_ILOG_DEV | XFS_ILOG_UUID); - if ((iip->ili_fields & XFS_ILOG_DBROOT) && + if ((fields & XFS_ILOG_DBROOT) && ip->i_df.if_broot_bytes > 0) { ASSERT(ip->i_df.if_broot != NULL); xlog_copy_iovec(lv, vecp, XLOG_REG_TYPE_IBROOT, @@ -192,16 +254,15 @@ xfs_inode_item_format_data_fork( ilf->ilf_dsize = ip->i_df.if_broot_bytes; ilf->ilf_size++; } else { - ASSERT(!(iip->ili_fields & - XFS_ILOG_DBROOT)); - iip->ili_fields &= ~XFS_ILOG_DBROOT; + ASSERT(!(fields & XFS_ILOG_DBROOT)); + fields &= ~XFS_ILOG_DBROOT; } break; case XFS_DINODE_FMT_LOCAL: - iip->ili_fields &= - ~(XFS_ILOG_DEXT | XFS_ILOG_DBROOT | - XFS_ILOG_DEV | XFS_ILOG_UUID); - if ((iip->ili_fields & XFS_ILOG_DDATA) && + fields &= ~(XFS_ILOG_DEXT | XFS_ILOG_DBROOT | + XFS_ILOG_DEV | XFS_ILOG_UUID); + + if ((fields & XFS_ILOG_DDATA) && ip->i_df.if_bytes > 0) { /* * Round i_bytes up to a word boundary. @@ -218,27 +279,28 @@ xfs_inode_item_format_data_fork( ilf->ilf_dsize = (unsigned)data_bytes; ilf->ilf_size++; } else { - iip->ili_fields &= ~XFS_ILOG_DDATA; + fields &= ~XFS_ILOG_DDATA; } break; case XFS_DINODE_FMT_DEV: - iip->ili_fields &= - ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT | - XFS_ILOG_DEXT | XFS_ILOG_UUID); - if (iip->ili_fields & XFS_ILOG_DEV) + fields &= ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT | + XFS_ILOG_DEXT | XFS_ILOG_UUID); + if (fields & XFS_ILOG_DEV) ilf->ilf_u.ilfu_rdev = ip->i_df.if_u2.if_rdev; break; case XFS_DINODE_FMT_UUID: - iip->ili_fields &= - ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT | - XFS_ILOG_DEXT | XFS_ILOG_DEV); - if (iip->ili_fields & XFS_ILOG_UUID) + fields &= ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT | + XFS_ILOG_DEXT | XFS_ILOG_DEV); + if (fields & XFS_ILOG_UUID) ilf->ilf_u.ilfu_uuid = ip->i_df.if_u2.if_uuid; break; default: ASSERT(0); break; } + + /* reflect the logged fields back in ili_format_fields */ + iip->ili_format_fields = fields; } STATIC void @@ -250,13 +312,13 @@ xfs_inode_item_format_attr_fork( { struct xfs_inode *ip = iip->ili_inode; size_t data_bytes; + unsigned int fields = iip->ili_format_fields; switch (ip->i_d.di_aformat) { case XFS_DINODE_FMT_EXTENTS: - iip->ili_fields &= - ~(XFS_ILOG_ADATA | XFS_ILOG_ABROOT); + fields &= ~(XFS_ILOG_ADATA | XFS_ILOG_ABROOT); - if ((iip->ili_fields & XFS_ILOG_AEXT) && + if ((fields & XFS_ILOG_AEXT) && ip->i_d.di_anextents > 0 && ip->i_afp->if_bytes > 0) { struct xfs_bmbt_rec *p; @@ -272,14 +334,13 @@ xfs_inode_item_format_attr_fork( ilf->ilf_asize = data_bytes; ilf->ilf_size++; } else { - iip->ili_fields &= ~XFS_ILOG_AEXT; + fields &= ~XFS_ILOG_AEXT; } break; case XFS_DINODE_FMT_BTREE: - iip->ili_fields &= - ~(XFS_ILOG_ADATA | XFS_ILOG_AEXT); + fields &= ~(XFS_ILOG_ADATA | XFS_ILOG_AEXT); - if ((iip->ili_fields & XFS_ILOG_ABROOT) && + if ((fields & XFS_ILOG_ABROOT) && ip->i_afp->if_broot_bytes > 0) { ASSERT(ip->i_afp->if_broot != NULL); @@ -289,14 +350,13 @@ xfs_inode_item_format_attr_fork( ilf->ilf_asize = ip->i_afp->if_broot_bytes; ilf->ilf_size++; } else { - iip->ili_fields &= ~XFS_ILOG_ABROOT; + fields &= ~XFS_ILOG_ABROOT; } break; case XFS_DINODE_FMT_LOCAL: - iip->ili_fields &= - ~(XFS_ILOG_AEXT | XFS_ILOG_ABROOT); + fields &= ~(XFS_ILOG_AEXT | XFS_ILOG_ABROOT); - if ((iip->ili_fields & XFS_ILOG_ADATA) && + if ((fields & XFS_ILOG_ADATA) && ip->i_afp->if_bytes > 0) { /* * Round i_bytes up to a word boundary. @@ -313,13 +373,16 @@ xfs_inode_item_format_attr_fork( ilf->ilf_asize = (unsigned)data_bytes; ilf->ilf_size++; } else { - iip->ili_fields &= ~XFS_ILOG_ADATA; + fields &= ~XFS_ILOG_ADATA; } break; default: ASSERT(0); break; } + + /* reflect the logged fields back in ili_format_fields */ + iip->ili_format_fields = fields; } /* @@ -359,12 +422,16 @@ xfs_inode_item_format( if (XFS_IFORK_Q(ip)) { xfs_inode_item_format_attr_fork(iip, ilf, lv, &vecp); } else { - iip->ili_fields &= + iip->ili_format_fields &= ~(XFS_ILOG_ADATA | XFS_ILOG_ABROOT | XFS_ILOG_AEXT); } /* update the format with the exact fields we actually logged */ - ilf->ilf_fields |= (iip->ili_fields & ~XFS_ILOG_TIMESTAMP); + ilf->ilf_fields |= (iip->ili_format_fields & ~XFS_ILOG_TIMESTAMP); + + /* clear any fields we didn't log from ili_fields */ + iip->ili_fields &= iip->ili_format_fields; + iip->ili_format_fields = 0; } /* diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h index 488d812..43a9e1c 100644 --- a/fs/xfs/xfs_inode_item.h +++ b/fs/xfs/xfs_inode_item.h @@ -34,6 +34,7 @@ typedef struct xfs_inode_log_item { unsigned short ili_logged; /* flushed logged data */ unsigned int ili_last_fields; /* fields when flushed */ unsigned int ili_fields; /* fields to be logged */ + unsigned int ili_format_fields; /* combined log fields */ } xfs_inode_log_item_t; static inline int xfs_inode_clean(xfs_inode_t *ip) diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c index 17280cd..77f1e8a 100644 --- a/fs/xfs/xfs_trans_inode.c +++ b/fs/xfs/xfs_trans_inode.c @@ -123,13 +123,5 @@ xfs_trans_log_inode( tp->t_flags |= XFS_TRANS_DIRTY; ip->i_itemp->ili_item.li_desc->lid_flags |= XFS_LID_DIRTY; - /* - * Always OR in the bits from the ili_last_fields field. - * This is to coordinate with the xfs_iflush() and xfs_iflush_done() - * routines in the eventual clearing of the ili_fields bits. - * See the big comment in xfs_iflush() for an explanation of - * this coordination mechanism. - */ - flags |= ip->i_itemp->ili_last_fields; ip->i_itemp->ili_fields |= flags; } _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH V2] xfs: timestamp updates cause excessive fdatasync log traffic 2015-08-28 4:32 ` [PATCH V2] " Dave Chinner @ 2015-08-28 15:11 ` Sage Weil 2015-08-28 22:04 ` Dave Chinner 2015-09-08 14:45 ` Brian Foster 1 sibling, 1 reply; 9+ messages in thread From: Sage Weil @ 2015-08-28 15:11 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs Hi Dave, On Fri, 28 Aug 2015, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > Sage Weil reported that a ceph test workload was writing to the > log on every fdatasync during an overwrite workload. Event tracing > showed that the only metadata modification being made was the > timestamp updates during the write(2) syscall, but fdatasync(2) > is supposed to ignore them. The key observation was that the > transactions in the log all looked like this: > > INODE: #regs: 4 ino: 0x8b flags: 0x45 dsize: 32 > > And contained a flags field of 0x45 or 0x85, and had data and > attribute forks following the inode core. This means that the > timestamp updates were triggering dirty relogging of previously > logged parts of the inode that hadn't yet been flushed back to > disk. > > This is caused by xfs_trans_log_inode(), where it folds the dirty > fields that have previously been logged back into the current > transaction dirty fields from the inode item ili_last_fields. The > issue is that ili_last_fields only contains a non-zero value when > the inode is in the process of being flushed. The typical state > progression is this, using a core field update as the modification > occuring: > > state ili_fields ili_last_fields > clean 0 0 > modified, logged XFS_ILOG_CORE 0 > flushed to buffer 0 XFS_ILOG_CORE > <buffer submitted> > buffer IO done (clean) 0 0 > > However, if we run a new transaction after it has been flushed to > buffer but before the buffer IO is done, we don't know if the > modifications in the inode buffer (i.e. what is in ili_last_fields) > will reach the disk before the new transaction reaches the log. > Hence to keep transactional ordering correct in recovery, we need to > ensure the new transaction re-logs the modifications that are being > flushed to disk. > > By relogging, we ensure that if the transaction makes it to disk and > the buffer doesn't, then recovery replays all the changes upto that > point correctly. If the transaction does not make it disk, but the > buffer does, then recovery will see that the inode on disk is more > recent than in the log and won't overwrite it with changes that it > already contains. > > The upshot of this is that while the inode buffer sits in memory > with the inode in the "flushed" state, fdatasync is going to see > the relogged state in the ili_fields. i.e: > > What is happening here is this: > > state ili_fields ili_last_fields > clean 0 0 > modified, logged CORE 0 > flushed to buffer 0 CORE > <write> > timestamp update TIMESTAMP CORE > <xfs_trans_inode_log pulls in ili_last_fields> > CORE|TIMESTAMP CORE > <fdatasync> > sees field other than TIMESTAMP in ili_fields, > triggers xfs_log_force_lsn to flush inode > > <buffer submittted> > buffer IO done (clean) CORE|TIMESTAMP 0 > ..... > <write> > timestamp update CORE|TIMESTAMP 0 > <fdatasync> > sees field other than TIMESTAMP in ili_fields, > triggers xfs_log_force_lsn to flush inode > ..... > <write> > timestamp update CORE|TIMESTAMP 0 > <fdatasync> > sees field other than TIMESTAMP in ili_fields, > triggers xfs_log_force_lsn to flush inode > > So, essentially, once a race condition on the buffer flush has > occurred, the condition is not cleared until the inode is flushed to > the buffer again and it is written without racing against the inode > being dirtied again. > > The behaviour we really want here is to capture the timestamp > update transactionally, but not trigger fdatasync because of the > relogged fields that haven't been modified since the inode was > flushed to the buffer. We still need to relog them, but we don't > need to force the log in the fdatasync case. > > To do this, don't fold the ili_last_fields value into ili_fields > when logging the inode. Instead, fold it into the fields that get > logged during formatting of the inode item. This means that we will > stop logging those fields the moment we know that there is a more > recent version of the inode on disk than we have in the log and so > we don't need to log those fields anymore as the next transaction on > disk doesn't need to replay them. > > Doing this separates changes that are in memory but are not being > flushed from those that have been flushed. Hence we can now look at > ili_fields and hence see what fields have been modified since the > last flush, and hence whether fdatasync needs to force the log or > not. If non-timestamp changes have been made since the inode was > flushed to the backing buffer, then fdatasync will do exactly the > right thing (i.e. force the log). > > Reported-by: Sage Weil <sage@redhat.com> > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > Version 2: > - include the hunk from fs/xfs/xfs_trans_inode.c that I missed > when committing the patch locally the first time. I gave this a go on my machine but I'm still seeing the same symptom. I've gathered the trace, strace, and other useful bits at http://newdream.net/~sage/drop/rocksdb.2/ This is pretty easy to reproduce with the ceph_test_keyvaluedb binary (built on fedora 22), also in that dir: rm -rf kv_test_temp_dir/ ./ceph_test_keyvaluedb --gtest_filter=KeyValueDB/KVTest.BenchCommit/1 Thanks for looking into this! sage > > fs/xfs/xfs_inode_item.c | 161 +++++++++++++++++++++++++++++++++-------------- > fs/xfs/xfs_inode_item.h | 1 + > fs/xfs/xfs_trans_inode.c | 8 --- > 3 files changed, 115 insertions(+), 55 deletions(-) > > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c > index 62bd80f..be04eb2 100644 > --- a/fs/xfs/xfs_inode_item.c > +++ b/fs/xfs/xfs_inode_item.c > @@ -29,6 +29,63 @@ > #include "xfs_trans_priv.h" > #include "xfs_log.h" > > +/* > + * Notes on ili_fields, ili_format_fields and ili_last_fields. > + * > + * ili_fields contains the flags that reflect the current changes that are in > + * memory and have been logged, but have not been flushed to the backing buffer > + * for writeback yet. When a transaction is done, the fields that are modified > + * are added to ili_fields and all those fields are logged. This results in > + * repeated transactions correctly relogging all the other changes in memory and > + * allows the inode to be re-ordered (moved forward) in the AIL safely. > + * ili_fields is copied to ili_last_fields when the inode is flushed to the > + * backing buffer and is then cleared, indicating that the inode in memory is > + * now clean from a transactional change point of view and does not need to > + * relog those fields anymore on tranaction commit. > + * > + * ili_last_fields, therefore, is only set while there is an unresolved inode > + * flush being done (i.e. flush lock is held). We need to keep this state > + * available to avoid a transaction recovery vs inode buffer IO completion race > + * if we crash. If the inode is logged again makes it to disk before the inode > + * buffer IO complete, log recovery will replay the latest inode transaction and > + * can lose the changes that were in the inode buffer. Hence we still need to > + * log the changes that have been flushed so that transactions issued during the > + * inode buffer write still contain the modifications being flushed. Once the > + * inode buffer IO completes, we no longer need to log those changes as the > + * inode on disk the same as the inode in memory apart from the changes made > + * since the inode was flushed (i.e. those recorded in ili_fields). > + * > + * ili_format_fields is only used during transaction commit. It is the > + * aggregation of ili_fields and ili_last_fields, as sampled during the sizing > + * of the inode item to be logged. We need this field definition to be constant > + * between sizing and formatting, but inode buffer IO can complete > + * asynchronously with transaction commit and so we must only read it once so > + * that the different stages of the item formatting work correctly. We don't > + * care about the async completion clearing ili_last_fields after we've sampled > + * it - if we log too much then so be it - we won't log it next time. Once we've > + * formatted the inode item, we need to propagate the ili_format_fields value to > + * the on-disk inode log item format field, and then use it to clear all the > + * fields that we marked for logging but were not dirty from ili_fields. > + * > + * This separation of changes allows us to accurately determine what fields in > + * the inode have changed since it was last flushed to disk. This is important > + * for fdatasync() performance as there are certain fields that, if modified, > + * should not trigger log forces because they contain metadata that fdatasync() > + * does not need to guarantee is safe on disk. This is, currently, just > + * XFS_ILOG_TIMESTAMP, though may be extended in future to cover other metadata. > + * > + * If we simply fold ili_last_fields back into ili_fields when we log the inode > + * in a transaction, then if we have a flush/modification race it results in > + * every timestamp update causing fdatasync to flush the log because ili_fields > + * has fields other than XFS_ILOG_TIMESTAMP set in it. This behaviour will > + * persist until the inode is next flushed to it's backing buffer and that > + * buffer is written and completed without another flush/modification race > + * occuring. Hence we keep the change state separate and only combine them when > + * formatting the inode into the log. > + * > + * For more information, there's another big comment in xfs_iflush_int() about > + * this flush/modification race condition. > + */ > > kmem_zone_t *xfs_ili_zone; /* inode log item zone */ > > @@ -40,6 +97,7 @@ static inline struct xfs_inode_log_item *INODE_ITEM(struct xfs_log_item *lip) > STATIC void > xfs_inode_item_data_fork_size( > struct xfs_inode_log_item *iip, > + unsigned int fields, > int *nvecs, > int *nbytes) > { > @@ -47,7 +105,7 @@ xfs_inode_item_data_fork_size( > > switch (ip->i_d.di_format) { > case XFS_DINODE_FMT_EXTENTS: > - if ((iip->ili_fields & XFS_ILOG_DEXT) && > + if ((fields & XFS_ILOG_DEXT) && > ip->i_d.di_nextents > 0 && > ip->i_df.if_bytes > 0) { > /* worst case, doesn't subtract delalloc extents */ > @@ -56,14 +114,14 @@ xfs_inode_item_data_fork_size( > } > break; > case XFS_DINODE_FMT_BTREE: > - if ((iip->ili_fields & XFS_ILOG_DBROOT) && > + if ((fields & XFS_ILOG_DBROOT) && > ip->i_df.if_broot_bytes > 0) { > *nbytes += ip->i_df.if_broot_bytes; > *nvecs += 1; > } > break; > case XFS_DINODE_FMT_LOCAL: > - if ((iip->ili_fields & XFS_ILOG_DDATA) && > + if ((fields & XFS_ILOG_DDATA) && > ip->i_df.if_bytes > 0) { > *nbytes += roundup(ip->i_df.if_bytes, 4); > *nvecs += 1; > @@ -82,6 +140,7 @@ xfs_inode_item_data_fork_size( > STATIC void > xfs_inode_item_attr_fork_size( > struct xfs_inode_log_item *iip, > + unsigned int fields, > int *nvecs, > int *nbytes) > { > @@ -89,7 +148,7 @@ xfs_inode_item_attr_fork_size( > > switch (ip->i_d.di_aformat) { > case XFS_DINODE_FMT_EXTENTS: > - if ((iip->ili_fields & XFS_ILOG_AEXT) && > + if ((fields & XFS_ILOG_AEXT) && > ip->i_d.di_anextents > 0 && > ip->i_afp->if_bytes > 0) { > /* worst case, doesn't subtract unused space */ > @@ -98,14 +157,14 @@ xfs_inode_item_attr_fork_size( > } > break; > case XFS_DINODE_FMT_BTREE: > - if ((iip->ili_fields & XFS_ILOG_ABROOT) && > + if ((fields & XFS_ILOG_ABROOT) && > ip->i_afp->if_broot_bytes > 0) { > *nbytes += ip->i_afp->if_broot_bytes; > *nvecs += 1; > } > break; > case XFS_DINODE_FMT_LOCAL: > - if ((iip->ili_fields & XFS_ILOG_ADATA) && > + if ((fields & XFS_ILOG_ADATA) && > ip->i_afp->if_bytes > 0) { > *nbytes += roundup(ip->i_afp->if_bytes, 4); > *nvecs += 1; > @@ -133,13 +192,17 @@ xfs_inode_item_size( > struct xfs_inode_log_item *iip = INODE_ITEM(lip); > struct xfs_inode *ip = iip->ili_inode; > > + iip->ili_format_fields = iip->ili_fields | iip->ili_last_fields; > + > *nvecs += 2; > *nbytes += sizeof(struct xfs_inode_log_format) + > xfs_icdinode_size(ip->i_d.di_version); > > - xfs_inode_item_data_fork_size(iip, nvecs, nbytes); > + xfs_inode_item_data_fork_size(iip, iip->ili_format_fields, > + nvecs, nbytes); > if (XFS_IFORK_Q(ip)) > - xfs_inode_item_attr_fork_size(iip, nvecs, nbytes); > + xfs_inode_item_attr_fork_size(iip, iip->ili_format_fields, > + nvecs, nbytes); > } > > STATIC void > @@ -151,14 +214,14 @@ xfs_inode_item_format_data_fork( > { > struct xfs_inode *ip = iip->ili_inode; > size_t data_bytes; > + unsigned int fields = iip->ili_format_fields; > > switch (ip->i_d.di_format) { > case XFS_DINODE_FMT_EXTENTS: > - iip->ili_fields &= > - ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT | > - XFS_ILOG_DEV | XFS_ILOG_UUID); > + fields &= ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT | > + XFS_ILOG_DEV | XFS_ILOG_UUID); > > - if ((iip->ili_fields & XFS_ILOG_DEXT) && > + if ((fields & XFS_ILOG_DEXT) && > ip->i_d.di_nextents > 0 && > ip->i_df.if_bytes > 0) { > struct xfs_bmbt_rec *p; > @@ -175,15 +238,14 @@ xfs_inode_item_format_data_fork( > ilf->ilf_dsize = data_bytes; > ilf->ilf_size++; > } else { > - iip->ili_fields &= ~XFS_ILOG_DEXT; > + fields &= ~XFS_ILOG_DEXT; > } > break; > case XFS_DINODE_FMT_BTREE: > - iip->ili_fields &= > - ~(XFS_ILOG_DDATA | XFS_ILOG_DEXT | > - XFS_ILOG_DEV | XFS_ILOG_UUID); > + fields &= ~(XFS_ILOG_DDATA | XFS_ILOG_DEXT | > + XFS_ILOG_DEV | XFS_ILOG_UUID); > > - if ((iip->ili_fields & XFS_ILOG_DBROOT) && > + if ((fields & XFS_ILOG_DBROOT) && > ip->i_df.if_broot_bytes > 0) { > ASSERT(ip->i_df.if_broot != NULL); > xlog_copy_iovec(lv, vecp, XLOG_REG_TYPE_IBROOT, > @@ -192,16 +254,15 @@ xfs_inode_item_format_data_fork( > ilf->ilf_dsize = ip->i_df.if_broot_bytes; > ilf->ilf_size++; > } else { > - ASSERT(!(iip->ili_fields & > - XFS_ILOG_DBROOT)); > - iip->ili_fields &= ~XFS_ILOG_DBROOT; > + ASSERT(!(fields & XFS_ILOG_DBROOT)); > + fields &= ~XFS_ILOG_DBROOT; > } > break; > case XFS_DINODE_FMT_LOCAL: > - iip->ili_fields &= > - ~(XFS_ILOG_DEXT | XFS_ILOG_DBROOT | > - XFS_ILOG_DEV | XFS_ILOG_UUID); > - if ((iip->ili_fields & XFS_ILOG_DDATA) && > + fields &= ~(XFS_ILOG_DEXT | XFS_ILOG_DBROOT | > + XFS_ILOG_DEV | XFS_ILOG_UUID); > + > + if ((fields & XFS_ILOG_DDATA) && > ip->i_df.if_bytes > 0) { > /* > * Round i_bytes up to a word boundary. > @@ -218,27 +279,28 @@ xfs_inode_item_format_data_fork( > ilf->ilf_dsize = (unsigned)data_bytes; > ilf->ilf_size++; > } else { > - iip->ili_fields &= ~XFS_ILOG_DDATA; > + fields &= ~XFS_ILOG_DDATA; > } > break; > case XFS_DINODE_FMT_DEV: > - iip->ili_fields &= > - ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT | > - XFS_ILOG_DEXT | XFS_ILOG_UUID); > - if (iip->ili_fields & XFS_ILOG_DEV) > + fields &= ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT | > + XFS_ILOG_DEXT | XFS_ILOG_UUID); > + if (fields & XFS_ILOG_DEV) > ilf->ilf_u.ilfu_rdev = ip->i_df.if_u2.if_rdev; > break; > case XFS_DINODE_FMT_UUID: > - iip->ili_fields &= > - ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT | > - XFS_ILOG_DEXT | XFS_ILOG_DEV); > - if (iip->ili_fields & XFS_ILOG_UUID) > + fields &= ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT | > + XFS_ILOG_DEXT | XFS_ILOG_DEV); > + if (fields & XFS_ILOG_UUID) > ilf->ilf_u.ilfu_uuid = ip->i_df.if_u2.if_uuid; > break; > default: > ASSERT(0); > break; > } > + > + /* reflect the logged fields back in ili_format_fields */ > + iip->ili_format_fields = fields; > } > > STATIC void > @@ -250,13 +312,13 @@ xfs_inode_item_format_attr_fork( > { > struct xfs_inode *ip = iip->ili_inode; > size_t data_bytes; > + unsigned int fields = iip->ili_format_fields; > > switch (ip->i_d.di_aformat) { > case XFS_DINODE_FMT_EXTENTS: > - iip->ili_fields &= > - ~(XFS_ILOG_ADATA | XFS_ILOG_ABROOT); > + fields &= ~(XFS_ILOG_ADATA | XFS_ILOG_ABROOT); > > - if ((iip->ili_fields & XFS_ILOG_AEXT) && > + if ((fields & XFS_ILOG_AEXT) && > ip->i_d.di_anextents > 0 && > ip->i_afp->if_bytes > 0) { > struct xfs_bmbt_rec *p; > @@ -272,14 +334,13 @@ xfs_inode_item_format_attr_fork( > ilf->ilf_asize = data_bytes; > ilf->ilf_size++; > } else { > - iip->ili_fields &= ~XFS_ILOG_AEXT; > + fields &= ~XFS_ILOG_AEXT; > } > break; > case XFS_DINODE_FMT_BTREE: > - iip->ili_fields &= > - ~(XFS_ILOG_ADATA | XFS_ILOG_AEXT); > + fields &= ~(XFS_ILOG_ADATA | XFS_ILOG_AEXT); > > - if ((iip->ili_fields & XFS_ILOG_ABROOT) && > + if ((fields & XFS_ILOG_ABROOT) && > ip->i_afp->if_broot_bytes > 0) { > ASSERT(ip->i_afp->if_broot != NULL); > > @@ -289,14 +350,13 @@ xfs_inode_item_format_attr_fork( > ilf->ilf_asize = ip->i_afp->if_broot_bytes; > ilf->ilf_size++; > } else { > - iip->ili_fields &= ~XFS_ILOG_ABROOT; > + fields &= ~XFS_ILOG_ABROOT; > } > break; > case XFS_DINODE_FMT_LOCAL: > - iip->ili_fields &= > - ~(XFS_ILOG_AEXT | XFS_ILOG_ABROOT); > + fields &= ~(XFS_ILOG_AEXT | XFS_ILOG_ABROOT); > > - if ((iip->ili_fields & XFS_ILOG_ADATA) && > + if ((fields & XFS_ILOG_ADATA) && > ip->i_afp->if_bytes > 0) { > /* > * Round i_bytes up to a word boundary. > @@ -313,13 +373,16 @@ xfs_inode_item_format_attr_fork( > ilf->ilf_asize = (unsigned)data_bytes; > ilf->ilf_size++; > } else { > - iip->ili_fields &= ~XFS_ILOG_ADATA; > + fields &= ~XFS_ILOG_ADATA; > } > break; > default: > ASSERT(0); > break; > } > + > + /* reflect the logged fields back in ili_format_fields */ > + iip->ili_format_fields = fields; > } > > /* > @@ -359,12 +422,16 @@ xfs_inode_item_format( > if (XFS_IFORK_Q(ip)) { > xfs_inode_item_format_attr_fork(iip, ilf, lv, &vecp); > } else { > - iip->ili_fields &= > + iip->ili_format_fields &= > ~(XFS_ILOG_ADATA | XFS_ILOG_ABROOT | XFS_ILOG_AEXT); > } > > /* update the format with the exact fields we actually logged */ > - ilf->ilf_fields |= (iip->ili_fields & ~XFS_ILOG_TIMESTAMP); > + ilf->ilf_fields |= (iip->ili_format_fields & ~XFS_ILOG_TIMESTAMP); > + > + /* clear any fields we didn't log from ili_fields */ > + iip->ili_fields &= iip->ili_format_fields; > + iip->ili_format_fields = 0; > } > > /* > diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h > index 488d812..43a9e1c 100644 > --- a/fs/xfs/xfs_inode_item.h > +++ b/fs/xfs/xfs_inode_item.h > @@ -34,6 +34,7 @@ typedef struct xfs_inode_log_item { > unsigned short ili_logged; /* flushed logged data */ > unsigned int ili_last_fields; /* fields when flushed */ > unsigned int ili_fields; /* fields to be logged */ > + unsigned int ili_format_fields; /* combined log fields */ > } xfs_inode_log_item_t; > > static inline int xfs_inode_clean(xfs_inode_t *ip) > diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c > index 17280cd..77f1e8a 100644 > --- a/fs/xfs/xfs_trans_inode.c > +++ b/fs/xfs/xfs_trans_inode.c > @@ -123,13 +123,5 @@ xfs_trans_log_inode( > tp->t_flags |= XFS_TRANS_DIRTY; > ip->i_itemp->ili_item.li_desc->lid_flags |= XFS_LID_DIRTY; > > - /* > - * Always OR in the bits from the ili_last_fields field. > - * This is to coordinate with the xfs_iflush() and xfs_iflush_done() > - * routines in the eventual clearing of the ili_fields bits. > - * See the big comment in xfs_iflush() for an explanation of > - * this coordination mechanism. > - */ > - flags |= ip->i_itemp->ili_last_fields; > ip->i_itemp->ili_fields |= flags; > } > > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH V2] xfs: timestamp updates cause excessive fdatasync log traffic 2015-08-28 15:11 ` Sage Weil @ 2015-08-28 22:04 ` Dave Chinner 2015-08-31 2:21 ` Dave Chinner 0 siblings, 1 reply; 9+ messages in thread From: Dave Chinner @ 2015-08-28 22:04 UTC (permalink / raw) To: Sage Weil; +Cc: xfs On Fri, Aug 28, 2015 at 08:11:20AM -0700, Sage Weil wrote: > Hi Dave, > > On Fri, 28 Aug 2015, Dave Chinner wrote: > > > > From: Dave Chinner <dchinner@redhat.com> > > > > Sage Weil reported that a ceph test workload was writing to the > > log on every fdatasync during an overwrite workload. Event tracing > > showed that the only metadata modification being made was the > > timestamp updates during the write(2) syscall, but fdatasync(2) > > is supposed to ignore them. The key observation was that the > > transactions in the log all looked like this: [....] > > --- > > Version 2: > > - include the hunk from fs/xfs/xfs_trans_inode.c that I missed > > when committing the patch locally the first time. > > I gave this a go on my machine but I'm still seeing the same symptom. OK, that implies the inode buffer has not been submitted for IO and so the inode is being held in "flushing" state for an extended period of time. > I've gathered the trace, strace, and other useful bits at > > http://newdream.net/~sage/drop/rocksdb.2/ > > This is pretty easy to reproduce with the ceph_test_keyvaluedb binary > (built on fedora 22), also in that dir: > > rm -rf kv_test_temp_dir/ > ./ceph_test_keyvaluedb --gtest_filter=KeyValueDB/KVTest.BenchCommit/1 I'll have a deeper look. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH V2] xfs: timestamp updates cause excessive fdatasync log traffic 2015-08-28 22:04 ` Dave Chinner @ 2015-08-31 2:21 ` Dave Chinner 2015-08-31 8:48 ` Dave Chinner 0 siblings, 1 reply; 9+ messages in thread From: Dave Chinner @ 2015-08-31 2:21 UTC (permalink / raw) To: Sage Weil; +Cc: xfs On Sat, Aug 29, 2015 at 08:04:54AM +1000, Dave Chinner wrote: > On Fri, Aug 28, 2015 at 08:11:20AM -0700, Sage Weil wrote: > > Hi Dave, > > > > On Fri, 28 Aug 2015, Dave Chinner wrote: > > > > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > Sage Weil reported that a ceph test workload was writing to the > > > log on every fdatasync during an overwrite workload. Event tracing > > > showed that the only metadata modification being made was the > > > timestamp updates during the write(2) syscall, but fdatasync(2) > > > is supposed to ignore them. The key observation was that the > > > transactions in the log all looked like this: > [....] > > > > --- > > > Version 2: > > > - include the hunk from fs/xfs/xfs_trans_inode.c that I missed > > > when committing the patch locally the first time. > > > > I gave this a go on my machine but I'm still seeing the same symptom. > > OK, that implies the inode buffer has not been submitted for IO and > so the inode is being held in "flushing" state for an extended > period of time. > > > I've gathered the trace, strace, and other useful bits at > > > > http://newdream.net/~sage/drop/rocksdb.2/ > > > > This is pretty easy to reproduce with the ceph_test_keyvaluedb binary > > (built on fedora 22), also in that dir: > > > > rm -rf kv_test_temp_dir/ > > ./ceph_test_keyvaluedb --gtest_filter=KeyValueDB/KVTest.BenchCommit/1 > > I'll have a deeper look. Ok, I was assuming this is a longer running test than it is - it only takes about 2300ms to run on my test box. Hence the problem is that the inode has never been flushed out, and so it's being relogged in full on every fdatasync() operation. Another, similar change is necessary to track the changes since the last time the inode was flushed to the log. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH V2] xfs: timestamp updates cause excessive fdatasync log traffic 2015-08-31 2:21 ` Dave Chinner @ 2015-08-31 8:48 ` Dave Chinner 2015-08-31 12:40 ` Sage Weil 0 siblings, 1 reply; 9+ messages in thread From: Dave Chinner @ 2015-08-31 8:48 UTC (permalink / raw) To: Sage Weil; +Cc: xfs On Mon, Aug 31, 2015 at 12:21:55PM +1000, Dave Chinner wrote: > On Sat, Aug 29, 2015 at 08:04:54AM +1000, Dave Chinner wrote: > > On Fri, Aug 28, 2015 at 08:11:20AM -0700, Sage Weil wrote: > > > Hi Dave, > > > > > > On Fri, 28 Aug 2015, Dave Chinner wrote: > > > > > > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > > > Sage Weil reported that a ceph test workload was writing to the > > > > log on every fdatasync during an overwrite workload. Event tracing > > > > showed that the only metadata modification being made was the > > > > timestamp updates during the write(2) syscall, but fdatasync(2) > > > > is supposed to ignore them. The key observation was that the > > > > transactions in the log all looked like this: > > [....] > > > > > > --- > > > > Version 2: > > > > - include the hunk from fs/xfs/xfs_trans_inode.c that I missed > > > > when committing the patch locally the first time. > > > > > > I gave this a go on my machine but I'm still seeing the same symptom. > > > > OK, that implies the inode buffer has not been submitted for IO and > > so the inode is being held in "flushing" state for an extended > > period of time. > > > > > I've gathered the trace, strace, and other useful bits at > > > > > > http://newdream.net/~sage/drop/rocksdb.2/ > > > > > > This is pretty easy to reproduce with the ceph_test_keyvaluedb binary > > > (built on fedora 22), also in that dir: > > > > > > rm -rf kv_test_temp_dir/ > > > ./ceph_test_keyvaluedb --gtest_filter=KeyValueDB/KVTest.BenchCommit/1 > > > > I'll have a deeper look. > > Ok, I was assuming this is a longer running test than it is - it > only takes about 2300ms to run on my test box. Hence the problem is > that the inode has never been flushed out, and so it's being > relogged in full on every fdatasync() operation. Another, similar > change is necessary to track the changes since the last time the > inode was flushed to the log. This is more complex than I expected, and I ended up chasing down a rathole because performance is now *highly variable*. Before making these changes, the performance was consistent at 2.3-2.6s for a test run on my test machine. It's on SSDs, with a 1GB BBWC in front of them, so there's minimal IO latency, and hence the average commit latency was around 2.5ms for the original (bad) fdatasync behaviour. priming now doing small writes 1024 commits in 2.582939, avg latency 0.0025224 [ OK ] KeyValueDB/KVTest.BenchCommit/1 (2600 ms) [----------] 1 test from KeyValueDB/KVTest (2600 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (2600 ms total) [ PASSED ] 1 test. Now I see the test often complete in about 1.1s, or ~1.1ms per write/fdatasync operation. That's definitely an improvement, priming now doing small writes 1024 commits in 1.133530, avg latency 0.00110696 [ OK ] KeyValueDB/KVTest.BenchCommit/1 (1148 ms) [----------] 1 test from KeyValueDB/KVTest (1148 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (1149 ms total) [ PASSED ] 1 test. but every 4-5 test runs I'd see a test run that would take 8s to complete: priming now doing small writes 1024 commits in 8.373202, avg latency 0.00817696 [ OK ] KeyValueDB/KVTest.BenchCommit/1 (8429 ms) [----------] 1 test from KeyValueDB/KVTest (8430 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (8430 ms total) [ PASSED ] 1 test. This is completely reproducable, and very annoying as it doesn't happen without these fdatasync optimisation patches. After taking a tangent to find a tracepoint regression that was getting in my way, I found that there was a significant pause between the inode locking calls within xfs_file_fsync and the inode locking calls on the buffered write. Roughly 8ms, in fact, on almost every call. After adding a couple more test trace points into the XFS fsync code, it turns out that a hardware cache flush is causing the delay. That is, because we aren't doing log writes that trigger cache flushes and FUA writes, we have to issue a blkdev_issue_flush() call from xfs_file_fsync and that is taking 8ms to complete. IOWs, on my hardware, making fdatasync work correctly completely destroys workload determinism. The full fsync adds 1ms to each operation, but it is constant and unchanging. The fdatasyncs fast most of the time, but every so often the cache flush blows performance all to hell and so over time (e.g. 100 test runs) fdatasync is little faster than using a full fsync. So, that's the rat hole I've been down today - once again proving that Modern Storage Hardware Still Sucks. Anyway, I'm going to do correctness testing now that I know the performance problems are hardware and not software..... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH V2] xfs: timestamp updates cause excessive fdatasync log traffic 2015-08-31 8:48 ` Dave Chinner @ 2015-08-31 12:40 ` Sage Weil 2015-08-31 21:51 ` Dave Chinner 0 siblings, 1 reply; 9+ messages in thread From: Sage Weil @ 2015-08-31 12:40 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Mon, 31 Aug 2015, Dave Chinner wrote: > On Mon, Aug 31, 2015 at 12:21:55PM +1000, Dave Chinner wrote: > > On Sat, Aug 29, 2015 at 08:04:54AM +1000, Dave Chinner wrote: > > > On Fri, Aug 28, 2015 at 08:11:20AM -0700, Sage Weil wrote: > > > > Hi Dave, > > > > > > > > On Fri, 28 Aug 2015, Dave Chinner wrote: > > > > > > > > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > > > > > Sage Weil reported that a ceph test workload was writing to the > > > > > log on every fdatasync during an overwrite workload. Event tracing > > > > > showed that the only metadata modification being made was the > > > > > timestamp updates during the write(2) syscall, but fdatasync(2) > > > > > is supposed to ignore them. The key observation was that the > > > > > transactions in the log all looked like this: > > > [....] > > > > > > > > --- > > > > > Version 2: > > > > > - include the hunk from fs/xfs/xfs_trans_inode.c that I missed > > > > > when committing the patch locally the first time. > > > > > > > > I gave this a go on my machine but I'm still seeing the same symptom. > > > > > > OK, that implies the inode buffer has not been submitted for IO and > > > so the inode is being held in "flushing" state for an extended > > > period of time. > > > > > > > I've gathered the trace, strace, and other useful bits at > > > > > > > > http://newdream.net/~sage/drop/rocksdb.2/ > > > > > > > > This is pretty easy to reproduce with the ceph_test_keyvaluedb binary > > > > (built on fedora 22), also in that dir: > > > > > > > > rm -rf kv_test_temp_dir/ > > > > ./ceph_test_keyvaluedb --gtest_filter=KeyValueDB/KVTest.BenchCommit/1 > > > > > > I'll have a deeper look. > > > > Ok, I was assuming this is a longer running test than it is - it > > only takes about 2300ms to run on my test box. Hence the problem is > > that the inode has never been flushed out, and so it's being > > relogged in full on every fdatasync() operation. Another, similar > > change is necessary to track the changes since the last time the > > inode was flushed to the log. > > This is more complex than I expected, and I ended up chasing down a > rathole because performance is now *highly variable*. > > Before making these changes, the performance was consistent at > 2.3-2.6s for a test run on my test machine. It's on SSDs, with a 1GB > BBWC in front of them, so there's minimal IO latency, and hence the > average commit latency was around 2.5ms for the original (bad) > fdatasync behaviour. > > priming > now doing small writes > 1024 commits in 2.582939, avg latency 0.0025224 > [ OK ] KeyValueDB/KVTest.BenchCommit/1 (2600 ms) > [----------] 1 test from KeyValueDB/KVTest (2600 ms total) > > [----------] Global test environment tear-down > [==========] 1 test from 1 test case ran. (2600 ms total) > [ PASSED ] 1 test. > > > Now I see the test often complete in about 1.1s, or ~1.1ms per > write/fdatasync operation. That's definitely an improvement, > > priming > now doing small writes > 1024 commits in 1.133530, avg latency 0.00110696 > [ OK ] KeyValueDB/KVTest.BenchCommit/1 (1148 ms) > [----------] 1 test from KeyValueDB/KVTest (1148 ms total) > > [----------] Global test environment tear-down > [==========] 1 test from 1 test case ran. (1149 ms total) > [ PASSED ] 1 test. > > but every 4-5 test runs I'd see a test run that would take 8s to > complete: > > priming > now doing small writes > 1024 commits in 8.373202, avg latency 0.00817696 > [ OK ] KeyValueDB/KVTest.BenchCommit/1 (8429 ms) > [----------] 1 test from KeyValueDB/KVTest (8430 ms total) > > [----------] Global test environment tear-down > [==========] 1 test from 1 test case ran. (8430 ms total) > [ PASSED ] 1 test. > > This is completely reproducable, and very annoying as it doesn't > happen without these fdatasync optimisation patches. Note that rocksdb itself might be doing something nondeterministic too (like writing out 4mb sst files from the priming stage). Though it sounds like that isn't (the only thing) happening here... and it certainly wouldn't account for 7 seconds. > After taking a tangent to find a tracepoint regression that was > getting in my way, I found that there was a significant pause > between the inode locking calls within xfs_file_fsync and the inode > locking calls on the buffered write. Roughly 8ms, in fact, on almost > every call. After adding a couple more test trace points into the > XFS fsync code, it turns out that a hardware cache flush is causing > the delay. That is, because we aren't doing log writes that trigger > cache flushes and FUA writes, we have to issue a > blkdev_issue_flush() call from xfs_file_fsync and that is taking 8ms > to complete. This is where my understanding of block layer flushing really breaks down, but in both cases we're issues flush requests to the hardware, right? Is the difference that the log write is a FUA flush request with data, and blkdev_issue_flush() issues a flush request without associated data? > IOWs, on my hardware, making fdatasync work correctly completely > destroys workload determinism. The full fsync adds 1ms to each > operation, but it is constant and unchanging. The fdatasyncs fast > most of the time, but every so often the cache flush blows > performance all to hell and so over time (e.g. 100 test runs) > fdatasync is little faster than using a full fsync. > > So, that's the rat hole I've been down today - once again proving > that Modern Storage Hardware Still Sucks. > > Anyway, I'm going to do correctness testing now that I know the > performance problems are hardware and not software..... Thanks, Dave! Sorry this snowballed... sage _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH V2] xfs: timestamp updates cause excessive fdatasync log traffic 2015-08-31 12:40 ` Sage Weil @ 2015-08-31 21:51 ` Dave Chinner 0 siblings, 0 replies; 9+ messages in thread From: Dave Chinner @ 2015-08-31 21:51 UTC (permalink / raw) To: Sage Weil; +Cc: xfs On Mon, Aug 31, 2015 at 05:40:04AM -0700, Sage Weil wrote: > On Mon, 31 Aug 2015, Dave Chinner wrote: > > After taking a tangent to find a tracepoint regression that was > > getting in my way, I found that there was a significant pause > > between the inode locking calls within xfs_file_fsync and the inode > > locking calls on the buffered write. Roughly 8ms, in fact, on almost > > every call. After adding a couple more test trace points into the > > XFS fsync code, it turns out that a hardware cache flush is causing > > the delay. That is, because we aren't doing log writes that trigger > > cache flushes and FUA writes, we have to issue a > > blkdev_issue_flush() call from xfs_file_fsync and that is taking 8ms > > to complete. > > This is where my understanding of block layer flushing really breaks down, > but in both cases we're issues flush requests to the hardware, right? Is > the difference that the log write is a FUA flush request with data, and > blkdev_issue_flush() issues a flush request without associated data? Pretty much, though th elog write also does a cache flush before the FUA write. i.e. The log writes consist of a bio with data issued via: submit_bio(REQ_FUA | REQ_FLUSH | WRITE_SYNC, bio); blkdev_issue_flush consists of an empty bio issued via: submit_bio(REQ_FLUSH | WRITE_SYNC, bio); So from a block layer and filesystem point of view there is little difference, and the only difference at the SCSI layer is the WRITE w/ FUA that is issued after the cache flush in the log write case (see https://lwn.net/Articles/400541/ fo a bit more background). I haven't looked any deeper than this so far - I don't have time right now to do so... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH V2] xfs: timestamp updates cause excessive fdatasync log traffic 2015-08-28 4:32 ` [PATCH V2] " Dave Chinner 2015-08-28 15:11 ` Sage Weil @ 2015-09-08 14:45 ` Brian Foster 1 sibling, 0 replies; 9+ messages in thread From: Brian Foster @ 2015-09-08 14:45 UTC (permalink / raw) To: Dave Chinner; +Cc: sage, xfs On Fri, Aug 28, 2015 at 02:32:53PM +1000, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > Sage Weil reported that a ceph test workload was writing to the > log on every fdatasync during an overwrite workload. Event tracing > showed that the only metadata modification being made was the > timestamp updates during the write(2) syscall, but fdatasync(2) > is supposed to ignore them. The key observation was that the > transactions in the log all looked like this: > > INODE: #regs: 4 ino: 0x8b flags: 0x45 dsize: 32 > > And contained a flags field of 0x45 or 0x85, and had data and > attribute forks following the inode core. This means that the > timestamp updates were triggering dirty relogging of previously > logged parts of the inode that hadn't yet been flushed back to > disk. > > This is caused by xfs_trans_log_inode(), where it folds the dirty > fields that have previously been logged back into the current > transaction dirty fields from the inode item ili_last_fields. The > issue is that ili_last_fields only contains a non-zero value when > the inode is in the process of being flushed. The typical state > progression is this, using a core field update as the modification > occuring: > > state ili_fields ili_last_fields > clean 0 0 > modified, logged XFS_ILOG_CORE 0 > flushed to buffer 0 XFS_ILOG_CORE > <buffer submitted> > buffer IO done (clean) 0 0 > > However, if we run a new transaction after it has been flushed to > buffer but before the buffer IO is done, we don't know if the > modifications in the inode buffer (i.e. what is in ili_last_fields) > will reach the disk before the new transaction reaches the log. > Hence to keep transactional ordering correct in recovery, we need to > ensure the new transaction re-logs the modifications that are being > flushed to disk. > > By relogging, we ensure that if the transaction makes it to disk and > the buffer doesn't, then recovery replays all the changes upto that > point correctly. If the transaction does not make it disk, but the > buffer does, then recovery will see that the inode on disk is more > recent than in the log and won't overwrite it with changes that it > already contains. > > The upshot of this is that while the inode buffer sits in memory > with the inode in the "flushed" state, fdatasync is going to see > the relogged state in the ili_fields. i.e: > > What is happening here is this: > > state ili_fields ili_last_fields > clean 0 0 > modified, logged CORE 0 > flushed to buffer 0 CORE > <write> > timestamp update TIMESTAMP CORE > <xfs_trans_inode_log pulls in ili_last_fields> > CORE|TIMESTAMP CORE > <fdatasync> > sees field other than TIMESTAMP in ili_fields, > triggers xfs_log_force_lsn to flush inode > > <buffer submittted> > buffer IO done (clean) CORE|TIMESTAMP 0 > ..... Shouldn't both states be cleared here? If so, I don't see how the problem persists unless the race is perpetual... > <write> > timestamp update CORE|TIMESTAMP 0 > <fdatasync> > sees field other than TIMESTAMP in ili_fields, > triggers xfs_log_force_lsn to flush inode > ..... > <write> > timestamp update CORE|TIMESTAMP 0 > <fdatasync> > sees field other than TIMESTAMP in ili_fields, > triggers xfs_log_force_lsn to flush inode > > So, essentially, once a race condition on the buffer flush has > occurred, the condition is not cleared until the inode is flushed to > the buffer again and it is written without racing against the inode > being dirtied again. > Ok, so we would have to continuously race between the buffer I/O and inode logging events for this behavior to continue. How likely is that to happen on a constant workload? I ask because I see Sage had pointed out that this might not address his originally reported problem, so I'm wondering whether we still want to pursue this change. > The behaviour we really want here is to capture the timestamp > update transactionally, but not trigger fdatasync because of the > relogged fields that haven't been modified since the inode was > flushed to the buffer. We still need to relog them, but we don't > need to force the log in the fdatasync case. > > To do this, don't fold the ili_last_fields value into ili_fields > when logging the inode. Instead, fold it into the fields that get > logged during formatting of the inode item. This means that we will > stop logging those fields the moment we know that there is a more > recent version of the inode on disk than we have in the log and so > we don't need to log those fields anymore as the next transaction on > disk doesn't need to replay them. > > Doing this separates changes that are in memory but are not being > flushed from those that have been flushed. Hence we can now look at > ili_fields and hence see what fields have been modified since the > last flush, and hence whether fdatasync needs to force the log or > not. If non-timestamp changes have been made since the inode was > flushed to the backing buffer, then fdatasync will do exactly the > right thing (i.e. force the log). > > Reported-by: Sage Weil <sage@redhat.com> > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > Version 2: > - include the hunk from fs/xfs/xfs_trans_inode.c that I missed > when committing the patch locally the first time. > > fs/xfs/xfs_inode_item.c | 161 +++++++++++++++++++++++++++++++++-------------- > fs/xfs/xfs_inode_item.h | 1 + > fs/xfs/xfs_trans_inode.c | 8 --- > 3 files changed, 115 insertions(+), 55 deletions(-) > > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c > index 62bd80f..be04eb2 100644 > --- a/fs/xfs/xfs_inode_item.c > +++ b/fs/xfs/xfs_inode_item.c > @@ -29,6 +29,63 @@ > #include "xfs_trans_priv.h" > #include "xfs_log.h" > > +/* > + * Notes on ili_fields, ili_format_fields and ili_last_fields. > + * > + * ili_fields contains the flags that reflect the current changes that are in > + * memory and have been logged, but have not been flushed to the backing buffer > + * for writeback yet. When a transaction is done, the fields that are modified > + * are added to ili_fields and all those fields are logged. This results in > + * repeated transactions correctly relogging all the other changes in memory and > + * allows the inode to be re-ordered (moved forward) in the AIL safely. > + * ili_fields is copied to ili_last_fields when the inode is flushed to the > + * backing buffer and is then cleared, indicating that the inode in memory is > + * now clean from a transactional change point of view and does not need to > + * relog those fields anymore on tranaction commit. > + * > + * ili_last_fields, therefore, is only set while there is an unresolved inode > + * flush being done (i.e. flush lock is held). We need to keep this state > + * available to avoid a transaction recovery vs inode buffer IO completion race > + * if we crash. If the inode is logged again makes it to disk before the inode ^ and > + * buffer IO complete, log recovery will replay the latest inode transaction and completes > + * can lose the changes that were in the inode buffer. Hence we still need to > + * log the changes that have been flushed so that transactions issued during the > + * inode buffer write still contain the modifications being flushed. Once the > + * inode buffer IO completes, we no longer need to log those changes as the > + * inode on disk the same as the inode in memory apart from the changes made ^ is > + * since the inode was flushed (i.e. those recorded in ili_fields). > + * > + * ili_format_fields is only used during transaction commit. It is the > + * aggregation of ili_fields and ili_last_fields, as sampled during the sizing > + * of the inode item to be logged. We need this field definition to be constant > + * between sizing and formatting, but inode buffer IO can complete > + * asynchronously with transaction commit and so we must only read it once so > + * that the different stages of the item formatting work correctly. We don't >From the comment alone, I'm a little confused as to whether this means we must combine and sample ili_format_fields only once, or just ili_last_fields as aggregrated into the former (and ili_format_fields is safe for the remainder of the transaction sizing/format sequence). > + * care about the async completion clearing ili_last_fields after we've sampled > + * it - if we log too much then so be it - we won't log it next time. Once we've > + * formatted the inode item, we need to propagate the ili_format_fields value to > + * the on-disk inode log item format field, and then use it to clear all the > + * fields that we marked for logging but were not dirty from ili_fields. > + * > + * This separation of changes allows us to accurately determine what fields in > + * the inode have changed since it was last flushed to disk. This is important > + * for fdatasync() performance as there are certain fields that, if modified, > + * should not trigger log forces because they contain metadata that fdatasync() > + * does not need to guarantee is safe on disk. This is, currently, just > + * XFS_ILOG_TIMESTAMP, though may be extended in future to cover other metadata. > + * > + * If we simply fold ili_last_fields back into ili_fields when we log the inode > + * in a transaction, then if we have a flush/modification race it results in > + * every timestamp update causing fdatasync to flush the log because ili_fields > + * has fields other than XFS_ILOG_TIMESTAMP set in it. This behaviour will > + * persist until the inode is next flushed to it's backing buffer and that > + * buffer is written and completed without another flush/modification race > + * occuring. Hence we keep the change state separate and only combine them when > + * formatting the inode into the log. > + * Thanks for the big comment. I think the above two paragraphs could probably be a bit more generic/succinct, however: "This separation of changes is necessary because certain codepaths require precise knowledge about what has been dirtied in the inode since the last transaction commit. For example, fdatasync() is allowed to skip log flushes in the event that only XFS_ILOG_TIMESTAMP is specified in ili_fields. Repopulating ili_fields from ili_last_fields at log time, as was previously done, pollutes this information and can defeat the optimization." > + * For more information, there's another big comment in xfs_iflush_int() about > + * this flush/modification race condition. > + */ > > kmem_zone_t *xfs_ili_zone; /* inode log item zone */ > > @@ -40,6 +97,7 @@ static inline struct xfs_inode_log_item *INODE_ITEM(struct xfs_log_item *lip) > STATIC void > xfs_inode_item_data_fork_size( > struct xfs_inode_log_item *iip, > + unsigned int fields, > int *nvecs, > int *nbytes) > { > @@ -47,7 +105,7 @@ xfs_inode_item_data_fork_size( > > switch (ip->i_d.di_format) { > case XFS_DINODE_FMT_EXTENTS: > - if ((iip->ili_fields & XFS_ILOG_DEXT) && > + if ((fields & XFS_ILOG_DEXT) && > ip->i_d.di_nextents > 0 && > ip->i_df.if_bytes > 0) { > /* worst case, doesn't subtract delalloc extents */ > @@ -56,14 +114,14 @@ xfs_inode_item_data_fork_size( > } > break; > case XFS_DINODE_FMT_BTREE: > - if ((iip->ili_fields & XFS_ILOG_DBROOT) && > + if ((fields & XFS_ILOG_DBROOT) && > ip->i_df.if_broot_bytes > 0) { > *nbytes += ip->i_df.if_broot_bytes; > *nvecs += 1; > } > break; > case XFS_DINODE_FMT_LOCAL: > - if ((iip->ili_fields & XFS_ILOG_DDATA) && > + if ((fields & XFS_ILOG_DDATA) && > ip->i_df.if_bytes > 0) { > *nbytes += roundup(ip->i_df.if_bytes, 4); > *nvecs += 1; > @@ -82,6 +140,7 @@ xfs_inode_item_data_fork_size( > STATIC void > xfs_inode_item_attr_fork_size( > struct xfs_inode_log_item *iip, > + unsigned int fields, > int *nvecs, > int *nbytes) > { > @@ -89,7 +148,7 @@ xfs_inode_item_attr_fork_size( > > switch (ip->i_d.di_aformat) { > case XFS_DINODE_FMT_EXTENTS: > - if ((iip->ili_fields & XFS_ILOG_AEXT) && > + if ((fields & XFS_ILOG_AEXT) && > ip->i_d.di_anextents > 0 && > ip->i_afp->if_bytes > 0) { > /* worst case, doesn't subtract unused space */ > @@ -98,14 +157,14 @@ xfs_inode_item_attr_fork_size( > } > break; > case XFS_DINODE_FMT_BTREE: > - if ((iip->ili_fields & XFS_ILOG_ABROOT) && > + if ((fields & XFS_ILOG_ABROOT) && > ip->i_afp->if_broot_bytes > 0) { > *nbytes += ip->i_afp->if_broot_bytes; > *nvecs += 1; > } > break; > case XFS_DINODE_FMT_LOCAL: > - if ((iip->ili_fields & XFS_ILOG_ADATA) && > + if ((fields & XFS_ILOG_ADATA) && > ip->i_afp->if_bytes > 0) { > *nbytes += roundup(ip->i_afp->if_bytes, 4); > *nvecs += 1; > @@ -133,13 +192,17 @@ xfs_inode_item_size( > struct xfs_inode_log_item *iip = INODE_ITEM(lip); > struct xfs_inode *ip = iip->ili_inode; > > + iip->ili_format_fields = iip->ili_fields | iip->ili_last_fields; > + > *nvecs += 2; > *nbytes += sizeof(struct xfs_inode_log_format) + > xfs_icdinode_size(ip->i_d.di_version); > > - xfs_inode_item_data_fork_size(iip, nvecs, nbytes); > + xfs_inode_item_data_fork_size(iip, iip->ili_format_fields, > + nvecs, nbytes); > if (XFS_IFORK_Q(ip)) > - xfs_inode_item_attr_fork_size(iip, nvecs, nbytes); > + xfs_inode_item_attr_fork_size(iip, iip->ili_format_fields, > + nvecs, nbytes); We aggregate ili_format_fields and pass the result as a parameter to the sizing helpers, but we just use ili_format_fields directly from the subsequent formatting helpers. The latter seems more clean to me since we pass iip everywhere already. Could we use that same pattern here as well? I suppose this also answers my question above in that ili_last_fields is sampled once and ili_format_fields used thereafter in the commit sequence. Brian > } > > STATIC void > @@ -151,14 +214,14 @@ xfs_inode_item_format_data_fork( > { > struct xfs_inode *ip = iip->ili_inode; > size_t data_bytes; > + unsigned int fields = iip->ili_format_fields; > > switch (ip->i_d.di_format) { > case XFS_DINODE_FMT_EXTENTS: > - iip->ili_fields &= > - ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT | > - XFS_ILOG_DEV | XFS_ILOG_UUID); > + fields &= ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT | > + XFS_ILOG_DEV | XFS_ILOG_UUID); > > - if ((iip->ili_fields & XFS_ILOG_DEXT) && > + if ((fields & XFS_ILOG_DEXT) && > ip->i_d.di_nextents > 0 && > ip->i_df.if_bytes > 0) { > struct xfs_bmbt_rec *p; > @@ -175,15 +238,14 @@ xfs_inode_item_format_data_fork( > ilf->ilf_dsize = data_bytes; > ilf->ilf_size++; > } else { > - iip->ili_fields &= ~XFS_ILOG_DEXT; > + fields &= ~XFS_ILOG_DEXT; > } > break; > case XFS_DINODE_FMT_BTREE: > - iip->ili_fields &= > - ~(XFS_ILOG_DDATA | XFS_ILOG_DEXT | > - XFS_ILOG_DEV | XFS_ILOG_UUID); > + fields &= ~(XFS_ILOG_DDATA | XFS_ILOG_DEXT | > + XFS_ILOG_DEV | XFS_ILOG_UUID); > > - if ((iip->ili_fields & XFS_ILOG_DBROOT) && > + if ((fields & XFS_ILOG_DBROOT) && > ip->i_df.if_broot_bytes > 0) { > ASSERT(ip->i_df.if_broot != NULL); > xlog_copy_iovec(lv, vecp, XLOG_REG_TYPE_IBROOT, > @@ -192,16 +254,15 @@ xfs_inode_item_format_data_fork( > ilf->ilf_dsize = ip->i_df.if_broot_bytes; > ilf->ilf_size++; > } else { > - ASSERT(!(iip->ili_fields & > - XFS_ILOG_DBROOT)); > - iip->ili_fields &= ~XFS_ILOG_DBROOT; > + ASSERT(!(fields & XFS_ILOG_DBROOT)); > + fields &= ~XFS_ILOG_DBROOT; > } > break; > case XFS_DINODE_FMT_LOCAL: > - iip->ili_fields &= > - ~(XFS_ILOG_DEXT | XFS_ILOG_DBROOT | > - XFS_ILOG_DEV | XFS_ILOG_UUID); > - if ((iip->ili_fields & XFS_ILOG_DDATA) && > + fields &= ~(XFS_ILOG_DEXT | XFS_ILOG_DBROOT | > + XFS_ILOG_DEV | XFS_ILOG_UUID); > + > + if ((fields & XFS_ILOG_DDATA) && > ip->i_df.if_bytes > 0) { > /* > * Round i_bytes up to a word boundary. > @@ -218,27 +279,28 @@ xfs_inode_item_format_data_fork( > ilf->ilf_dsize = (unsigned)data_bytes; > ilf->ilf_size++; > } else { > - iip->ili_fields &= ~XFS_ILOG_DDATA; > + fields &= ~XFS_ILOG_DDATA; > } > break; > case XFS_DINODE_FMT_DEV: > - iip->ili_fields &= > - ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT | > - XFS_ILOG_DEXT | XFS_ILOG_UUID); > - if (iip->ili_fields & XFS_ILOG_DEV) > + fields &= ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT | > + XFS_ILOG_DEXT | XFS_ILOG_UUID); > + if (fields & XFS_ILOG_DEV) > ilf->ilf_u.ilfu_rdev = ip->i_df.if_u2.if_rdev; > break; > case XFS_DINODE_FMT_UUID: > - iip->ili_fields &= > - ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT | > - XFS_ILOG_DEXT | XFS_ILOG_DEV); > - if (iip->ili_fields & XFS_ILOG_UUID) > + fields &= ~(XFS_ILOG_DDATA | XFS_ILOG_DBROOT | > + XFS_ILOG_DEXT | XFS_ILOG_DEV); > + if (fields & XFS_ILOG_UUID) > ilf->ilf_u.ilfu_uuid = ip->i_df.if_u2.if_uuid; > break; > default: > ASSERT(0); > break; > } > + > + /* reflect the logged fields back in ili_format_fields */ > + iip->ili_format_fields = fields; > } > > STATIC void > @@ -250,13 +312,13 @@ xfs_inode_item_format_attr_fork( > { > struct xfs_inode *ip = iip->ili_inode; > size_t data_bytes; > + unsigned int fields = iip->ili_format_fields; > > switch (ip->i_d.di_aformat) { > case XFS_DINODE_FMT_EXTENTS: > - iip->ili_fields &= > - ~(XFS_ILOG_ADATA | XFS_ILOG_ABROOT); > + fields &= ~(XFS_ILOG_ADATA | XFS_ILOG_ABROOT); > > - if ((iip->ili_fields & XFS_ILOG_AEXT) && > + if ((fields & XFS_ILOG_AEXT) && > ip->i_d.di_anextents > 0 && > ip->i_afp->if_bytes > 0) { > struct xfs_bmbt_rec *p; > @@ -272,14 +334,13 @@ xfs_inode_item_format_attr_fork( > ilf->ilf_asize = data_bytes; > ilf->ilf_size++; > } else { > - iip->ili_fields &= ~XFS_ILOG_AEXT; > + fields &= ~XFS_ILOG_AEXT; > } > break; > case XFS_DINODE_FMT_BTREE: > - iip->ili_fields &= > - ~(XFS_ILOG_ADATA | XFS_ILOG_AEXT); > + fields &= ~(XFS_ILOG_ADATA | XFS_ILOG_AEXT); > > - if ((iip->ili_fields & XFS_ILOG_ABROOT) && > + if ((fields & XFS_ILOG_ABROOT) && > ip->i_afp->if_broot_bytes > 0) { > ASSERT(ip->i_afp->if_broot != NULL); > > @@ -289,14 +350,13 @@ xfs_inode_item_format_attr_fork( > ilf->ilf_asize = ip->i_afp->if_broot_bytes; > ilf->ilf_size++; > } else { > - iip->ili_fields &= ~XFS_ILOG_ABROOT; > + fields &= ~XFS_ILOG_ABROOT; > } > break; > case XFS_DINODE_FMT_LOCAL: > - iip->ili_fields &= > - ~(XFS_ILOG_AEXT | XFS_ILOG_ABROOT); > + fields &= ~(XFS_ILOG_AEXT | XFS_ILOG_ABROOT); > > - if ((iip->ili_fields & XFS_ILOG_ADATA) && > + if ((fields & XFS_ILOG_ADATA) && > ip->i_afp->if_bytes > 0) { > /* > * Round i_bytes up to a word boundary. > @@ -313,13 +373,16 @@ xfs_inode_item_format_attr_fork( > ilf->ilf_asize = (unsigned)data_bytes; > ilf->ilf_size++; > } else { > - iip->ili_fields &= ~XFS_ILOG_ADATA; > + fields &= ~XFS_ILOG_ADATA; > } > break; > default: > ASSERT(0); > break; > } > + > + /* reflect the logged fields back in ili_format_fields */ > + iip->ili_format_fields = fields; > } > > /* > @@ -359,12 +422,16 @@ xfs_inode_item_format( > if (XFS_IFORK_Q(ip)) { > xfs_inode_item_format_attr_fork(iip, ilf, lv, &vecp); > } else { > - iip->ili_fields &= > + iip->ili_format_fields &= > ~(XFS_ILOG_ADATA | XFS_ILOG_ABROOT | XFS_ILOG_AEXT); > } > > /* update the format with the exact fields we actually logged */ > - ilf->ilf_fields |= (iip->ili_fields & ~XFS_ILOG_TIMESTAMP); > + ilf->ilf_fields |= (iip->ili_format_fields & ~XFS_ILOG_TIMESTAMP); > + > + /* clear any fields we didn't log from ili_fields */ > + iip->ili_fields &= iip->ili_format_fields; > + iip->ili_format_fields = 0; > } > > /* > diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h > index 488d812..43a9e1c 100644 > --- a/fs/xfs/xfs_inode_item.h > +++ b/fs/xfs/xfs_inode_item.h > @@ -34,6 +34,7 @@ typedef struct xfs_inode_log_item { > unsigned short ili_logged; /* flushed logged data */ > unsigned int ili_last_fields; /* fields when flushed */ > unsigned int ili_fields; /* fields to be logged */ > + unsigned int ili_format_fields; /* combined log fields */ > } xfs_inode_log_item_t; > > static inline int xfs_inode_clean(xfs_inode_t *ip) > diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c > index 17280cd..77f1e8a 100644 > --- a/fs/xfs/xfs_trans_inode.c > +++ b/fs/xfs/xfs_trans_inode.c > @@ -123,13 +123,5 @@ xfs_trans_log_inode( > tp->t_flags |= XFS_TRANS_DIRTY; > ip->i_itemp->ili_item.li_desc->lid_flags |= XFS_LID_DIRTY; > > - /* > - * Always OR in the bits from the ili_last_fields field. > - * This is to coordinate with the xfs_iflush() and xfs_iflush_done() > - * routines in the eventual clearing of the ili_fields bits. > - * See the big comment in xfs_iflush() for an explanation of > - * this coordination mechanism. > - */ > - flags |= ip->i_itemp->ili_last_fields; > ip->i_itemp->ili_fields |= flags; > } > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2015-09-08 14:45 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-08-28 1:23 [PATCH] xfs: timestamp updates cause excessive fdatasync log traffic Dave Chinner 2015-08-28 4:32 ` [PATCH V2] " Dave Chinner 2015-08-28 15:11 ` Sage Weil 2015-08-28 22:04 ` Dave Chinner 2015-08-31 2:21 ` Dave Chinner 2015-08-31 8:48 ` Dave Chinner 2015-08-31 12:40 ` Sage Weil 2015-08-31 21:51 ` Dave Chinner 2015-09-08 14:45 ` Brian Foster
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox