* [PATCH 0/4] xfs: candidate fixes for 3.12-rc4
@ 2013-09-29 23:37 Dave Chinner
2013-09-29 23:37 ` [PATCH 1/4] xfs: lockdep needs to know about 3 dquot-deep nesting Dave Chinner
` (4 more replies)
0 siblings, 5 replies; 17+ messages in thread
From: Dave Chinner @ 2013-09-29 23:37 UTC (permalink / raw)
To: xfs
Hi folks,
The first two patches are regressions and hence -rc4 candidates -
the first patch fixes a lockdep false positive as a result of adding
the third quota inode, and the second fixes a on-disk format
interpretation problem in the v4 superblock dtype-in-dirent feature.
The third patch fixes the AGF vs AGI lock inversion problem I
reported a short while ago, and the last reduces the overhead of
atomic version change accounting. These two aren't critical fixes,
but would be nice to have out there for 3.12.
Cheers,
Dave.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 1/4] xfs: lockdep needs to know about 3 dquot-deep nesting
2013-09-29 23:37 [PATCH 0/4] xfs: candidate fixes for 3.12-rc4 Dave Chinner
@ 2013-09-29 23:37 ` Dave Chinner
2013-09-30 21:19 ` Ben Myers
2013-09-29 23:37 ` [PATCH 2/4] xfs: dirent dtype presence is dependent on directory magic numbers Dave Chinner
` (3 subsequent siblings)
4 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2013-09-29 23:37 UTC (permalink / raw)
To: xfs
From: Dave Chinner <dchinner@redhat.com>
Michael Semon reported that xfs/299 generated this lockdep warning:
=============================================
[ INFO: possible recursive locking detected ]
3.12.0-rc2+ #2 Not tainted
---------------------------------------------
touch/21072 is trying to acquire lock:
(&xfs_dquot_other_class){+.+...}, at: [<c12902fb>] xfs_trans_dqlockedjoin+0x57/0x64
but task is already holding lock:
(&xfs_dquot_other_class){+.+...}, at: [<c12902fb>] xfs_trans_dqlockedjoin+0x57/0x64
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&xfs_dquot_other_class);
lock(&xfs_dquot_other_class);
*** DEADLOCK ***
May be due to missing lock nesting notation
7 locks held by touch/21072:
#0: (sb_writers#10){++++.+}, at: [<c11185b6>] mnt_want_write+0x1e/0x3e
#1: (&type->i_mutex_dir_key#4){+.+.+.}, at: [<c11078ee>] do_last+0x245/0xe40
#2: (sb_internal#2){++++.+}, at: [<c122c9e0>] xfs_trans_alloc+0x1f/0x35
#3: (&(&ip->i_lock)->mr_lock/1){+.+...}, at: [<c126cd1b>] xfs_ilock+0x100/0x1f1
#4: (&(&ip->i_lock)->mr_lock){++++-.}, at: [<c126cf52>] xfs_ilock_nowait+0x105/0x22f
#5: (&dqp->q_qlock){+.+...}, at: [<c12902fb>] xfs_trans_dqlockedjoin+0x57/0x64
#6: (&xfs_dquot_other_class){+.+...}, at: [<c12902fb>] xfs_trans_dqlockedjoin+0x57/0x64
The lockdep annotation for dquot lock nesting only understands
locking for user and "other" dquots, not user, group and quota
dquots. Fix the annotations to match the locking heirarchy we now
have.
Reported-by: Michael L. Semon <mlsemon35@gmail.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/xfs_dquot.c | 19 ++++++++++++++++---
1 file changed, 16 insertions(+), 3 deletions(-)
diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index 71520e6..1ee776d 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -64,7 +64,8 @@ int xfs_dqerror_mod = 33;
struct kmem_zone *xfs_qm_dqtrxzone;
static struct kmem_zone *xfs_qm_dqzone;
-static struct lock_class_key xfs_dquot_other_class;
+static struct lock_class_key xfs_dquot_group_class;
+static struct lock_class_key xfs_dquot_project_class;
/*
* This is called to free all the memory associated with a dquot
@@ -703,8 +704,20 @@ xfs_qm_dqread(
* Make sure group quotas have a different lock class than user
* quotas.
*/
- if (!(type & XFS_DQ_USER))
- lockdep_set_class(&dqp->q_qlock, &xfs_dquot_other_class);
+ switch (type) {
+ case XFS_DQ_USER:
+ /* uses the default lock class */
+ break;
+ case XFS_DQ_GROUP:
+ lockdep_set_class(&dqp->q_qlock, &xfs_dquot_group_class);
+ break;
+ case XFS_DQ_PROJ:
+ lockdep_set_class(&dqp->q_qlock, &xfs_dquot_project_class);
+ break;
+ default:
+ ASSERT(0);
+ break;
+ }
XFS_STATS_INC(xs_qm_dquot);
--
1.8.3.2
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH 2/4] xfs: dirent dtype presence is dependent on directory magic numbers
2013-09-29 23:37 [PATCH 0/4] xfs: candidate fixes for 3.12-rc4 Dave Chinner
2013-09-29 23:37 ` [PATCH 1/4] xfs: lockdep needs to know about 3 dquot-deep nesting Dave Chinner
@ 2013-09-29 23:37 ` Dave Chinner
2013-09-30 22:02 ` Ben Myers
2013-10-18 16:56 ` Rich Johnston
2013-09-29 23:37 ` [PATCH 3/4] xfs: xfs_remove deadlocks due to inverted AGF vs AGI lock ordering Dave Chinner
` (2 subsequent siblings)
4 siblings, 2 replies; 17+ messages in thread
From: Dave Chinner @ 2013-09-29 23:37 UTC (permalink / raw)
To: xfs
From: Dave Chinner <dchinner@redhat.com>
The determination of whether a directory entry contains a dtype
field originally was dependent on the filesystem having CRCs
enabled. This meant that the format for dtype beign enabled could be
determined by checking the directory block magic number rather than
doing a feature bit check. This was useful in that it meant that we
didn't need to pass a struct xfs_mount around to functions that
were already supplied with a directory block header.
Unfortunately, the introduction of dtype fields into the v4
structure via a feature bit meant this "use the directory block
magic number" method of discriminating the dirent entry sizes is
broken. Hence we need to convert the places that use magic number
checks to use feature bit checks so that they work correctly and not
by chance.
The current code works on v4 filesystems only because the dirent
size roundup covers the extra byte needed by the dtype field in the
places where this problem occurs.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/xfs_dir2_block.c | 6 +++---
fs/xfs/xfs_dir2_format.h | 51 +++++++++++++++++++----------------------------
fs/xfs/xfs_dir2_readdir.c | 4 ++--
fs/xfs/xfs_dir2_sf.c | 6 +++---
4 files changed, 28 insertions(+), 39 deletions(-)
diff --git a/fs/xfs/xfs_dir2_block.c b/fs/xfs/xfs_dir2_block.c
index 0957aa9..12dad18 100644
--- a/fs/xfs/xfs_dir2_block.c
+++ b/fs/xfs/xfs_dir2_block.c
@@ -1158,7 +1158,7 @@ xfs_dir2_sf_to_block(
/*
* Create entry for .
*/
- dep = xfs_dir3_data_dot_entry_p(hdr);
+ dep = xfs_dir3_data_dot_entry_p(mp, hdr);
dep->inumber = cpu_to_be64(dp->i_ino);
dep->namelen = 1;
dep->name[0] = '.';
@@ -1172,7 +1172,7 @@ xfs_dir2_sf_to_block(
/*
* Create entry for ..
*/
- dep = xfs_dir3_data_dotdot_entry_p(hdr);
+ dep = xfs_dir3_data_dotdot_entry_p(mp, hdr);
dep->inumber = cpu_to_be64(xfs_dir2_sf_get_parent_ino(sfp));
dep->namelen = 2;
dep->name[0] = dep->name[1] = '.';
@@ -1183,7 +1183,7 @@ xfs_dir2_sf_to_block(
blp[1].hashval = cpu_to_be32(xfs_dir_hash_dotdot);
blp[1].address = cpu_to_be32(xfs_dir2_byte_to_dataptr(mp,
(char *)dep - (char *)hdr));
- offset = xfs_dir3_data_first_offset(hdr);
+ offset = xfs_dir3_data_first_offset(mp);
/*
* Loop over existing entries, stuff them in.
*/
diff --git a/fs/xfs/xfs_dir2_format.h b/fs/xfs/xfs_dir2_format.h
index a0961a6..9cf6738 100644
--- a/fs/xfs/xfs_dir2_format.h
+++ b/fs/xfs/xfs_dir2_format.h
@@ -497,69 +497,58 @@ xfs_dir3_data_unused_p(struct xfs_dir2_data_hdr *hdr)
/*
* Offsets of . and .. in data space (always block 0)
*
- * The macros are used for shortform directories as they have no headers to read
- * the magic number out of. Shortform directories need to know the size of the
- * data block header because the sfe embeds the block offset of the entry into
- * it so that it doesn't change when format conversion occurs. Bad Things Happen
- * if we don't follow this rule.
- *
* XXX: there is scope for significant optimisation of the logic here. Right
* now we are checking for "dir3 format" over and over again. Ideally we should
* only do it once for each operation.
*/
-#define XFS_DIR3_DATA_DOT_OFFSET(mp) \
- xfs_dir3_data_hdr_size(xfs_sb_version_hascrc(&(mp)->m_sb))
-#define XFS_DIR3_DATA_DOTDOT_OFFSET(mp) \
- (XFS_DIR3_DATA_DOT_OFFSET(mp) + xfs_dir3_data_entsize(mp, 1))
-#define XFS_DIR3_DATA_FIRST_OFFSET(mp) \
- (XFS_DIR3_DATA_DOTDOT_OFFSET(mp) + xfs_dir3_data_entsize(mp, 2))
-
static inline xfs_dir2_data_aoff_t
-xfs_dir3_data_dot_offset(struct xfs_dir2_data_hdr *hdr)
+xfs_dir3_data_dot_offset(struct xfs_mount *mp)
{
- return xfs_dir3_data_entry_offset(hdr);
+ return xfs_dir3_data_hdr_size(xfs_sb_version_hascrc(&mp->m_sb));
}
static inline xfs_dir2_data_aoff_t
-xfs_dir3_data_dotdot_offset(struct xfs_dir2_data_hdr *hdr)
+xfs_dir3_data_dotdot_offset(struct xfs_mount *mp)
{
- bool dir3 = hdr->magic == cpu_to_be32(XFS_DIR3_DATA_MAGIC) ||
- hdr->magic == cpu_to_be32(XFS_DIR3_BLOCK_MAGIC);
- return xfs_dir3_data_dot_offset(hdr) +
- __xfs_dir3_data_entsize(dir3, 1);
+ return xfs_dir3_data_dot_offset(mp) +
+ xfs_dir3_data_entsize(mp, 1);
}
static inline xfs_dir2_data_aoff_t
-xfs_dir3_data_first_offset(struct xfs_dir2_data_hdr *hdr)
+xfs_dir3_data_first_offset(struct xfs_mount *mp)
{
- bool dir3 = hdr->magic == cpu_to_be32(XFS_DIR3_DATA_MAGIC) ||
- hdr->magic == cpu_to_be32(XFS_DIR3_BLOCK_MAGIC);
- return xfs_dir3_data_dotdot_offset(hdr) +
- __xfs_dir3_data_entsize(dir3, 2);
+ return xfs_dir3_data_dotdot_offset(mp) +
+ xfs_dir3_data_entsize(mp, 2);
}
/*
* location of . and .. in data space (always block 0)
*/
static inline struct xfs_dir2_data_entry *
-xfs_dir3_data_dot_entry_p(struct xfs_dir2_data_hdr *hdr)
+xfs_dir3_data_dot_entry_p(
+ struct xfs_mount *mp,
+ struct xfs_dir2_data_hdr *hdr)
{
return (struct xfs_dir2_data_entry *)
- ((char *)hdr + xfs_dir3_data_dot_offset(hdr));
+ ((char *)hdr + xfs_dir3_data_dot_offset(mp));
}
static inline struct xfs_dir2_data_entry *
-xfs_dir3_data_dotdot_entry_p(struct xfs_dir2_data_hdr *hdr)
+xfs_dir3_data_dotdot_entry_p(
+ struct xfs_mount *mp,
+ struct xfs_dir2_data_hdr *hdr)
{
return (struct xfs_dir2_data_entry *)
- ((char *)hdr + xfs_dir3_data_dotdot_offset(hdr));
+ ((char *)hdr + xfs_dir3_data_dotdot_offset(mp));
}
static inline struct xfs_dir2_data_entry *
-xfs_dir3_data_first_entry_p(struct xfs_dir2_data_hdr *hdr)
+xfs_dir3_data_first_entry_p(
+ struct xfs_mount *mp,
+ struct xfs_dir2_data_hdr *hdr)
{
return (struct xfs_dir2_data_entry *)
- ((char *)hdr + xfs_dir3_data_first_offset(hdr));
+ ((char *)hdr + xfs_dir3_data_first_offset(mp));
}
/*
diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
index 8993ec1..8f84153 100644
--- a/fs/xfs/xfs_dir2_readdir.c
+++ b/fs/xfs/xfs_dir2_readdir.c
@@ -119,9 +119,9 @@ xfs_dir2_sf_getdents(
* mp->m_dirdatablk.
*/
dot_offset = xfs_dir2_db_off_to_dataptr(mp, mp->m_dirdatablk,
- XFS_DIR3_DATA_DOT_OFFSET(mp));
+ xfs_dir3_data_dot_offset(mp));
dotdot_offset = xfs_dir2_db_off_to_dataptr(mp, mp->m_dirdatablk,
- XFS_DIR3_DATA_DOTDOT_OFFSET(mp));
+ xfs_dir3_data_dotdot_offset(mp));
/*
* Put . entry unless we're starting past it.
diff --git a/fs/xfs/xfs_dir2_sf.c b/fs/xfs/xfs_dir2_sf.c
index bb6e284..3ef6d40 100644
--- a/fs/xfs/xfs_dir2_sf.c
+++ b/fs/xfs/xfs_dir2_sf.c
@@ -557,7 +557,7 @@ xfs_dir2_sf_addname_hard(
* to insert the new entry.
* If it's going to end up at the end then oldsfep will point there.
*/
- for (offset = XFS_DIR3_DATA_FIRST_OFFSET(mp),
+ for (offset = xfs_dir3_data_first_offset(mp),
oldsfep = xfs_dir2_sf_firstentry(oldsfp),
add_datasize = xfs_dir3_data_entsize(mp, args->namelen),
eof = (char *)oldsfep == &buf[old_isize];
@@ -640,7 +640,7 @@ xfs_dir2_sf_addname_pick(
sfp = (xfs_dir2_sf_hdr_t *)dp->i_df.if_u1.if_data;
size = xfs_dir3_data_entsize(mp, args->namelen);
- offset = XFS_DIR3_DATA_FIRST_OFFSET(mp);
+ offset = xfs_dir3_data_first_offset(mp);
sfep = xfs_dir2_sf_firstentry(sfp);
holefit = 0;
/*
@@ -713,7 +713,7 @@ xfs_dir2_sf_check(
mp = dp->i_mount;
sfp = (xfs_dir2_sf_hdr_t *)dp->i_df.if_u1.if_data;
- offset = XFS_DIR3_DATA_FIRST_OFFSET(mp);
+ offset = xfs_dir3_data_first_offset(mp);
ino = xfs_dir2_sf_get_parent_ino(sfp);
i8count = ino > XFS_DIR2_MAX_SHORT_INUM;
--
1.8.3.2
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH 3/4] xfs: xfs_remove deadlocks due to inverted AGF vs AGI lock ordering
2013-09-29 23:37 [PATCH 0/4] xfs: candidate fixes for 3.12-rc4 Dave Chinner
2013-09-29 23:37 ` [PATCH 1/4] xfs: lockdep needs to know about 3 dquot-deep nesting Dave Chinner
2013-09-29 23:37 ` [PATCH 2/4] xfs: dirent dtype presence is dependent on directory magic numbers Dave Chinner
@ 2013-09-29 23:37 ` Dave Chinner
2013-09-30 22:14 ` Ben Myers
2013-09-29 23:37 ` [PATCH 4/4] xfs: open code inc_inode_iversion when logging an inode Dave Chinner
2013-09-30 22:52 ` [PATCH 0/4] xfs: candidate fixes for 3.12-rc4 Ben Myers
4 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2013-09-29 23:37 UTC (permalink / raw)
To: xfs
From: Dave Chinner <dchinner@redhat.com>
Removing an inode from the namespace involves removing the directory
entry and dropping the link count on the inode. Removing the
directory entry can result in locking an AGF (directory blocks were
freed) and removing a link count can result in placing the inode on
an unlinked list which results in locking an AGI.
The big problem here is that we have an ordering constraint on AGF
and AGI locking - inode allocation locks the AGI, then can allocate
a new extent for new inodes, locking the AGF after the AGI.
Similarly, freeing the inode removes the inode from the unlinked
list, requiring that we lock the AGI first, and then freeing the
inode can result in an inode chunk being freed and hence freeing
disk space requiring that we lock an AGF.
Hence the ordering that is imposed by other parts of the code is AGI
before AGF. This means we cannot remove the directory entry before
we drop the inode reference count and put it on the unlinked list as
this results in a lock order of AGF then AGI, and this can deadlock
against inode allocation and freeing. Therefore we must drop the
link counts before we remove the directory entry.
This is still safe from a transactional point of view - it is not
until we get to xfs_bmap_finish() that we have the possibility of
multiple transactions in this operation. Hence as long as we remove
the directory entry and drop the link count in the first transaction
of the remove operation, there are no transactional constraints on
the ordering here.
Change the ordering of the operations in the xfs_remove() function
to align the ordering of AGI and AGF locking to match that of the
rest of the code.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/xfs_inode.c | 72 +++++++++++++++++++++++++++++++++---------------------
1 file changed, 44 insertions(+), 28 deletions(-)
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index e3d7538..7a460d8 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2370,6 +2370,33 @@ xfs_iunpin_wait(
__xfs_iunpin_wait(ip);
}
+/*
+ * Removing an inode from the namespace involves removing the directory entry
+ * and dropping the link count on the inode. Removing the directory entry can
+ * result in locking an AGF (directory blocks were freed) and removing a link
+ * count can result in placing the inode on an unlinked list which results in
+ * locking an AGI.
+ *
+ * The big problem here is that we have an ordering constraint on AGF and AGI
+ * locking - inode allocation locks the AGI, then can allocate a new extent for
+ * new inodes, locking the AGF after the AGI. Similarly, freeing the inode
+ * removes the inode from the unlinked list, requiring that we lock the AGI
+ * first, and then freeing the inode can result in an inode chunk being freed
+ * and hence freeing disk space requiring that we lock an AGF.
+ *
+ * Hence the ordering that is imposed by other parts of the code is AGI before
+ * AGF. This means we cannot remove the directory entry before we drop the inode
+ * reference count and put it on the unlinked list as this results in a lock
+ * order of AGF then AGI, and this can deadlock against inode allocation and
+ * freeing. Therefore we must drop the link counts before we remove the
+ * directory entry.
+ *
+ * This is still safe from a transactional point of view - it is not until we
+ * get to xfs_bmap_finish() that we have the possibility of multiple
+ * transactions in this operation. Hence as long as we remove the directory
+ * entry and drop the link count in the first transaction of the remove
+ * operation, there are no transactional constraints on the ordering here.
+ */
int
xfs_remove(
xfs_inode_t *dp,
@@ -2439,6 +2466,7 @@ xfs_remove(
/*
* If we're removing a directory perform some additional validation.
*/
+ cancel_flags |= XFS_TRANS_ABORT;
if (is_dir) {
ASSERT(ip->i_d.di_nlink >= 2);
if (ip->i_d.di_nlink != 2) {
@@ -2449,31 +2477,16 @@ xfs_remove(
error = XFS_ERROR(ENOTEMPTY);
goto out_trans_cancel;
}
- }
- xfs_bmap_init(&free_list, &first_block);
- error = xfs_dir_removename(tp, dp, name, ip->i_ino,
- &first_block, &free_list, resblks);
- if (error) {
- ASSERT(error != ENOENT);
- goto out_bmap_cancel;
- }
- xfs_trans_ichgtime(tp, dp, XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG);
-
- if (is_dir) {
- /*
- * Drop the link from ip's "..".
- */
+ /* Drop the link from ip's "..". */
error = xfs_droplink(tp, dp);
if (error)
- goto out_bmap_cancel;
+ goto out_trans_cancel;
- /*
- * Drop the "." link from ip to self.
- */
+ /* Drop the "." link from ip to self. */
error = xfs_droplink(tp, ip);
if (error)
- goto out_bmap_cancel;
+ goto out_trans_cancel;
} else {
/*
* When removing a non-directory we need to log the parent
@@ -2482,20 +2495,24 @@ xfs_remove(
*/
xfs_trans_log_inode(tp, dp, XFS_ILOG_CORE);
}
+ xfs_trans_ichgtime(tp, dp, XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG);
- /*
- * Drop the link from dp to ip.
- */
+ /* Drop the link from dp to ip. */
error = xfs_droplink(tp, ip);
if (error)
- goto out_bmap_cancel;
+ goto out_trans_cancel;
- /*
- * Determine if this is the last link while
- * we are in the transaction.
- */
+ /* Determine if this is the last link while the inode is locked */
link_zero = (ip->i_d.di_nlink == 0);
+ xfs_bmap_init(&free_list, &first_block);
+ error = xfs_dir_removename(tp, dp, name, ip->i_ino,
+ &first_block, &free_list, resblks);
+ if (error) {
+ ASSERT(error != ENOENT);
+ goto out_bmap_cancel;
+ }
+
/*
* If this is a synchronous mount, make sure that the
* remove transaction goes to disk before returning to
@@ -2525,7 +2542,6 @@ xfs_remove(
out_bmap_cancel:
xfs_bmap_cancel(&free_list);
- cancel_flags |= XFS_TRANS_ABORT;
out_trans_cancel:
xfs_trans_cancel(tp, cancel_flags);
std_return:
--
1.8.3.2
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH 4/4] xfs: open code inc_inode_iversion when logging an inode
2013-09-29 23:37 [PATCH 0/4] xfs: candidate fixes for 3.12-rc4 Dave Chinner
` (2 preceding siblings ...)
2013-09-29 23:37 ` [PATCH 3/4] xfs: xfs_remove deadlocks due to inverted AGF vs AGI lock ordering Dave Chinner
@ 2013-09-29 23:37 ` Dave Chinner
2013-09-30 22:24 ` Eric Sandeen
2013-09-30 22:52 ` [PATCH 0/4] xfs: candidate fixes for 3.12-rc4 Ben Myers
4 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2013-09-29 23:37 UTC (permalink / raw)
To: xfs
From: Dave Chinner <dchinner@redhat.com>
Michael L Semon reported that generic/069 runtime increased on v5
superblocks by 100% compared to v4 superblocks. his perf-based
analysis pointed directly at the timestamp updates being done by the
write path in this workload. The append writers are doing 4-byte
writes, so there are lots of timestamp updates occurring.
The thing is, they aren't being triggered by timestamp changes -
they are being triggered by the inode change counter needing to be
updated. That is, every write(2) system call needs to bump the inode
version count, and it does that through the timestamp update
mechanism. Hence for v5 filesystems, test generic/069 is running 3
orders of magnitude more timestmap update transactions on v5
filesystems due to the fact it does a huge number of *4 byte*
write(2) calls.
This isn't a real world scenario we really need to address - anyone
doing such sequential IO should be using fwrite(3), not write(2).
i.e. fwrite(3) buffers the writes in userspace to minimise the
number of write(2) syscalls, and the problem goes away.
However, there is a small change we can make to improve the
situation - removing the expensive lock operation on the change
counter update. All inode version counter changes in XFS occur
under the ip->i_ilock during a transaction, and therefore we
don't actually need the spin lock that provides exclusive access to
it through inc_inode_iversion().
Hence avoid the lock and just open code the increment ourselves when
logging the inode.
Reported-by: Michael L. Semon <mlsemon35@gmail.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/xfs_trans_inode.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c
index 53dfe46..e6601c1 100644
--- a/fs/xfs/xfs_trans_inode.c
+++ b/fs/xfs/xfs_trans_inode.c
@@ -118,8 +118,7 @@ xfs_trans_log_inode(
*/
if (!(ip->i_itemp->ili_item.li_desc->lid_flags & XFS_LID_DIRTY) &&
IS_I_VERSION(VFS_I(ip))) {
- inode_inc_iversion(VFS_I(ip));
- ip->i_d.di_changecount = VFS_I(ip)->i_version;
+ ip->i_d.di_changecount = ++VFS_I(ip)->i_version;
flags |= XFS_ILOG_CORE;
}
--
1.8.3.2
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 1/4] xfs: lockdep needs to know about 3 dquot-deep nesting
2013-09-29 23:37 ` [PATCH 1/4] xfs: lockdep needs to know about 3 dquot-deep nesting Dave Chinner
@ 2013-09-30 21:19 ` Ben Myers
0 siblings, 0 replies; 17+ messages in thread
From: Ben Myers @ 2013-09-30 21:19 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On Mon, Sep 30, 2013 at 09:37:03AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> Michael Semon reported that xfs/299 generated this lockdep warning:
>
> =============================================
> [ INFO: possible recursive locking detected ]
> 3.12.0-rc2+ #2 Not tainted
> ---------------------------------------------
> touch/21072 is trying to acquire lock:
> (&xfs_dquot_other_class){+.+...}, at: [<c12902fb>] xfs_trans_dqlockedjoin+0x57/0x64
>
> but task is already holding lock:
> (&xfs_dquot_other_class){+.+...}, at: [<c12902fb>] xfs_trans_dqlockedjoin+0x57/0x64
>
> other info that might help us debug this:
> Possible unsafe locking scenario:
>
> CPU0
> ----
> lock(&xfs_dquot_other_class);
> lock(&xfs_dquot_other_class);
>
> *** DEADLOCK ***
>
> May be due to missing lock nesting notation
>
> 7 locks held by touch/21072:
> #0: (sb_writers#10){++++.+}, at: [<c11185b6>] mnt_want_write+0x1e/0x3e
> #1: (&type->i_mutex_dir_key#4){+.+.+.}, at: [<c11078ee>] do_last+0x245/0xe40
> #2: (sb_internal#2){++++.+}, at: [<c122c9e0>] xfs_trans_alloc+0x1f/0x35
> #3: (&(&ip->i_lock)->mr_lock/1){+.+...}, at: [<c126cd1b>] xfs_ilock+0x100/0x1f1
> #4: (&(&ip->i_lock)->mr_lock){++++-.}, at: [<c126cf52>] xfs_ilock_nowait+0x105/0x22f
> #5: (&dqp->q_qlock){+.+...}, at: [<c12902fb>] xfs_trans_dqlockedjoin+0x57/0x64
> #6: (&xfs_dquot_other_class){+.+...}, at: [<c12902fb>] xfs_trans_dqlockedjoin+0x57/0x64
>
> The lockdep annotation for dquot lock nesting only understands
> locking for user and "other" dquots, not user, group and quota
> dquots. Fix the annotations to match the locking heirarchy we now
> have.
>
> Reported-by: Michael L. Semon <mlsemon35@gmail.com>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
Looks good.
Reviewed-by: Ben Myers <bpm@sgi.com>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/4] xfs: dirent dtype presence is dependent on directory magic numbers
2013-09-29 23:37 ` [PATCH 2/4] xfs: dirent dtype presence is dependent on directory magic numbers Dave Chinner
@ 2013-09-30 22:02 ` Ben Myers
2013-10-18 16:56 ` Rich Johnston
1 sibling, 0 replies; 17+ messages in thread
From: Ben Myers @ 2013-09-30 22:02 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On Mon, Sep 30, 2013 at 09:37:04AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> The determination of whether a directory entry contains a dtype
> field originally was dependent on the filesystem having CRCs
> enabled. This meant that the format for dtype beign enabled could be
> determined by checking the directory block magic number rather than
> doing a feature bit check. This was useful in that it meant that we
> didn't need to pass a struct xfs_mount around to functions that
> were already supplied with a directory block header.
>
> Unfortunately, the introduction of dtype fields into the v4
> structure via a feature bit meant this "use the directory block
> magic number" method of discriminating the dirent entry sizes is
> broken. Hence we need to convert the places that use magic number
> checks to use feature bit checks so that they work correctly and not
> by chance.
>
> The current code works on v4 filesystems only because the dirent
> size roundup covers the extra byte needed by the dtype field in the
> places where this problem occurs.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
Looks fine to me.
Reviewed-by: Ben Myers <bpm@sgi.com>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 3/4] xfs: xfs_remove deadlocks due to inverted AGF vs AGI lock ordering
2013-09-29 23:37 ` [PATCH 3/4] xfs: xfs_remove deadlocks due to inverted AGF vs AGI lock ordering Dave Chinner
@ 2013-09-30 22:14 ` Ben Myers
2013-10-01 10:57 ` Dave Chinner
0 siblings, 1 reply; 17+ messages in thread
From: Ben Myers @ 2013-09-30 22:14 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
Hi Dave,
On Mon, Sep 30, 2013 at 09:37:05AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> Removing an inode from the namespace involves removing the directory
> entry and dropping the link count on the inode. Removing the
> directory entry can result in locking an AGF (directory blocks were
> freed) and removing a link count can result in placing the inode on
> an unlinked list which results in locking an AGI.
>
> The big problem here is that we have an ordering constraint on AGF
> and AGI locking - inode allocation locks the AGI, then can allocate
> a new extent for new inodes, locking the AGF after the AGI.
> Similarly, freeing the inode removes the inode from the unlinked
> list, requiring that we lock the AGI first, and then freeing the
> inode can result in an inode chunk being freed and hence freeing
> disk space requiring that we lock an AGF.
>
> Hence the ordering that is imposed by other parts of the code is AGI
> before AGF. This means we cannot remove the directory entry before
> we drop the inode reference count and put it on the unlinked list as
> this results in a lock order of AGF then AGI, and this can deadlock
> against inode allocation and freeing. Therefore we must drop the
> link counts before we remove the directory entry.
>
> This is still safe from a transactional point of view - it is not
> until we get to xfs_bmap_finish() that we have the possibility of
> multiple transactions in this operation. Hence as long as we remove
> the directory entry and drop the link count in the first transaction
> of the remove operation, there are no transactional constraints on
> the ordering here.
>
> Change the ordering of the operations in the xfs_remove() function
> to align the ordering of AGI and AGF locking to match that of the
> rest of the code.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
Hmmm.. I'm not quite comfortable with this one yet... It'll probably be
better after some more review. Did you happen to have a test case to go with
this?
Thanks,
Ben
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 4/4] xfs: open code inc_inode_iversion when logging an inode
2013-09-29 23:37 ` [PATCH 4/4] xfs: open code inc_inode_iversion when logging an inode Dave Chinner
@ 2013-09-30 22:24 ` Eric Sandeen
2013-09-30 22:39 ` Ben Myers
0 siblings, 1 reply; 17+ messages in thread
From: Eric Sandeen @ 2013-09-30 22:24 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On 9/29/13 6:37 PM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> Michael L Semon reported that generic/069 runtime increased on v5
> superblocks by 100% compared to v4 superblocks. his perf-based
> analysis pointed directly at the timestamp updates being done by the
> write path in this workload. The append writers are doing 4-byte
> writes, so there are lots of timestamp updates occurring.
>
> The thing is, they aren't being triggered by timestamp changes -
> they are being triggered by the inode change counter needing to be
> updated. That is, every write(2) system call needs to bump the inode
> version count, and it does that through the timestamp update
> mechanism. Hence for v5 filesystems, test generic/069 is running 3
> orders of magnitude more timestmap update transactions on v5
> filesystems due to the fact it does a huge number of *4 byte*
> write(2) calls.
>
> This isn't a real world scenario we really need to address - anyone
> doing such sequential IO should be using fwrite(3), not write(2).
> i.e. fwrite(3) buffers the writes in userspace to minimise the
> number of write(2) syscalls, and the problem goes away.
>
> However, there is a small change we can make to improve the
> situation - removing the expensive lock operation on the change
> counter update. All inode version counter changes in XFS occur
> under the ip->i_ilock during a transaction, and therefore we
> don't actually need the spin lock that provides exclusive access to
> it through inc_inode_iversion().
>
> Hence avoid the lock and just open code the increment ourselves when
> logging the inode.
>
> Reported-by: Michael L. Semon <mlsemon35@gmail.com>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> fs/xfs/xfs_trans_inode.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c
> index 53dfe46..e6601c1 100644
> --- a/fs/xfs/xfs_trans_inode.c
> +++ b/fs/xfs/xfs_trans_inode.c
> @@ -118,8 +118,7 @@ xfs_trans_log_inode(
> */
> if (!(ip->i_itemp->ili_item.li_desc->lid_flags & XFS_LID_DIRTY) &&
> IS_I_VERSION(VFS_I(ip))) {
> - inode_inc_iversion(VFS_I(ip));
> - ip->i_d.di_changecount = VFS_I(ip)->i_version;
comment about the reason for the open-code might be good, too?
otherwise some semantic patcher might "fix" it for you again later...
-Eric
> + ip->i_d.di_changecount = ++VFS_I(ip)->i_version;
> flags |= XFS_ILOG_CORE;
> }
>
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 4/4] xfs: open code inc_inode_iversion when logging an inode
2013-09-30 22:24 ` Eric Sandeen
@ 2013-09-30 22:39 ` Ben Myers
2013-10-01 11:12 ` Dave Chinner
0 siblings, 1 reply; 17+ messages in thread
From: Ben Myers @ 2013-09-30 22:39 UTC (permalink / raw)
To: Eric Sandeen, Jean Noel Cordenner, Dave Chinner; +Cc: xfs
Hi Gents,
On Mon, Sep 30, 2013 at 05:24:54PM -0500, Eric Sandeen wrote:
> On 9/29/13 6:37 PM, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > Michael L Semon reported that generic/069 runtime increased on v5
> > superblocks by 100% compared to v4 superblocks. his perf-based
> > analysis pointed directly at the timestamp updates being done by the
> > write path in this workload. The append writers are doing 4-byte
> > writes, so there are lots of timestamp updates occurring.
> >
> > The thing is, they aren't being triggered by timestamp changes -
> > they are being triggered by the inode change counter needing to be
> > updated. That is, every write(2) system call needs to bump the inode
> > version count, and it does that through the timestamp update
> > mechanism. Hence for v5 filesystems, test generic/069 is running 3
> > orders of magnitude more timestmap update transactions on v5
> > filesystems due to the fact it does a huge number of *4 byte*
> > write(2) calls.
> >
> > This isn't a real world scenario we really need to address - anyone
> > doing such sequential IO should be using fwrite(3), not write(2).
> > i.e. fwrite(3) buffers the writes in userspace to minimise the
> > number of write(2) syscalls, and the problem goes away.
> >
> > However, there is a small change we can make to improve the
> > situation - removing the expensive lock operation on the change
> > counter update. All inode version counter changes in XFS occur
> > under the ip->i_ilock during a transaction, and therefore we
> > don't actually need the spin lock that provides exclusive access to
> > it through inc_inode_iversion().
> >
> > Hence avoid the lock and just open code the increment ourselves when
> > logging the inode.
> >
> > Reported-by: Michael L. Semon <mlsemon35@gmail.com>
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> > fs/xfs/xfs_trans_inode.c | 3 +--
> > 1 file changed, 1 insertion(+), 2 deletions(-)
> >
> > diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c
> > index 53dfe46..e6601c1 100644
> > --- a/fs/xfs/xfs_trans_inode.c
> > +++ b/fs/xfs/xfs_trans_inode.c
> > @@ -118,8 +118,7 @@ xfs_trans_log_inode(
> > */
> > if (!(ip->i_itemp->ili_item.li_desc->lid_flags & XFS_LID_DIRTY) &&
> > IS_I_VERSION(VFS_I(ip))) {
> > - inode_inc_iversion(VFS_I(ip));
> > - ip->i_d.di_changecount = VFS_I(ip)->i_version;
>
> comment about the reason for the open-code might be good, too?
>
> otherwise some semantic patcher might "fix" it for you again later...
>
> -Eric
>
> > + ip->i_d.di_changecount = ++VFS_I(ip)->i_version;
> > flags |= XFS_ILOG_CORE;
> > }
> >
> >
Adding a comment strikes me as a good idea too... But isn't that lock there for
a reason? I suspect that will break i_version like i_size on 32 bit systems.
Jean added this function, hopefully he can shed some light.
Thanks,
Ben
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/4] xfs: candidate fixes for 3.12-rc4
2013-09-29 23:37 [PATCH 0/4] xfs: candidate fixes for 3.12-rc4 Dave Chinner
` (3 preceding siblings ...)
2013-09-29 23:37 ` [PATCH 4/4] xfs: open code inc_inode_iversion when logging an inode Dave Chinner
@ 2013-09-30 22:52 ` Ben Myers
4 siblings, 0 replies; 17+ messages in thread
From: Ben Myers @ 2013-09-30 22:52 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On Mon, Sep 30, 2013 at 09:37:02AM +1000, Dave Chinner wrote:
> Hi folks,
>
> The first two patches are regressions and hence -rc4 candidates -
> the first patch fixes a lockdep false positive as a result of adding
> the third quota inode, and the second fixes a on-disk format
> interpretation problem in the v4 superblock dtype-in-dirent feature.
Applied 1 and 2.
Thanks,
Ben
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 3/4] xfs: xfs_remove deadlocks due to inverted AGF vs AGI lock ordering
2013-09-30 22:14 ` Ben Myers
@ 2013-10-01 10:57 ` Dave Chinner
0 siblings, 0 replies; 17+ messages in thread
From: Dave Chinner @ 2013-10-01 10:57 UTC (permalink / raw)
To: Ben Myers; +Cc: xfs
On Mon, Sep 30, 2013 at 05:14:49PM -0500, Ben Myers wrote:
> Hi Dave,
>
> On Mon, Sep 30, 2013 at 09:37:05AM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > Removing an inode from the namespace involves removing the directory
> > entry and dropping the link count on the inode. Removing the
> > directory entry can result in locking an AGF (directory blocks were
> > freed) and removing a link count can result in placing the inode on
> > an unlinked list which results in locking an AGI.
> >
> > The big problem here is that we have an ordering constraint on AGF
> > and AGI locking - inode allocation locks the AGI, then can allocate
> > a new extent for new inodes, locking the AGF after the AGI.
> > Similarly, freeing the inode removes the inode from the unlinked
> > list, requiring that we lock the AGI first, and then freeing the
> > inode can result in an inode chunk being freed and hence freeing
> > disk space requiring that we lock an AGF.
> >
> > Hence the ordering that is imposed by other parts of the code is AGI
> > before AGF. This means we cannot remove the directory entry before
> > we drop the inode reference count and put it on the unlinked list as
> > this results in a lock order of AGF then AGI, and this can deadlock
> > against inode allocation and freeing. Therefore we must drop the
> > link counts before we remove the directory entry.
> >
> > This is still safe from a transactional point of view - it is not
> > until we get to xfs_bmap_finish() that we have the possibility of
> > multiple transactions in this operation. Hence as long as we remove
> > the directory entry and drop the link count in the first transaction
> > of the remove operation, there are no transactional constraints on
> > the ordering here.
> >
> > Change the ordering of the operations in the xfs_remove() function
> > to align the ordering of AGI and AGF locking to match that of the
> > rest of the code.
> >
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
>
> Hmmm.. I'm not quite comfortable with this one yet... It'll probably be
> better after some more review. Did you happen to have a test case to go with
> this?
Two test cases.
The first that I saw involved a 100TB filesystem and 16-way creates
running concurrently with 16-way unlinks, as reported here:
http://oss.sgi.com/archives/xfs/2013-09/msg00359.html
That's not reliable, though - in general it takes hours of running a
couple of hundred threads doing sequential file create and unlinks
on a 32p VM. i.e. adding a half-hour long 16-way fsmark workload
every couple of minutes to the load that is aready running on the
filesystem. It runs at about 350,000 file creates/s and 150,000
unlinks/s, so it takes a load that is not really feasible for the
majority of QA environments. I haven't had this workload deadlock
since I wrote this patch.
The other system I have that reliably trips over this is at the
other end of the scale, and it's generic/269 w/ MKFS_OPTIONS="-b
size=1024 -m crc=1" that deadlocks on a single CPU VM with 1GB RAM.
This patch does make generic/269 complete on that machine - it just
changes the deadlock that is being hit.
i.e. it appears that we have introduced an ENOSPC AGF vs AGF lock
ordering deadlock at some point, but I haven't got a handle on that
yet to be able to even speculate on what the cause might be. I'm in
the process of refining tracepoints to narrow down the exactly order
of operations that is causing problems...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 4/4] xfs: open code inc_inode_iversion when logging an inode
2013-09-30 22:39 ` Ben Myers
@ 2013-10-01 11:12 ` Dave Chinner
2013-10-01 23:04 ` Ben Myers
0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2013-10-01 11:12 UTC (permalink / raw)
To: Ben Myers; +Cc: Eric Sandeen, Jean Noel Cordenner, xfs
On Mon, Sep 30, 2013 at 05:39:46PM -0500, Ben Myers wrote:
> On Mon, Sep 30, 2013 at 05:24:54PM -0500, Eric Sandeen wrote:
> > On 9/29/13 6:37 PM, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > >
> > > Michael L Semon reported that generic/069 runtime increased on v5
> > > superblocks by 100% compared to v4 superblocks. his perf-based
> > > analysis pointed directly at the timestamp updates being done by the
> > > write path in this workload. The append writers are doing 4-byte
> > > writes, so there are lots of timestamp updates occurring.
...
> > > diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c
> > > index 53dfe46..e6601c1 100644
> > > --- a/fs/xfs/xfs_trans_inode.c
> > > +++ b/fs/xfs/xfs_trans_inode.c
> > > @@ -118,8 +118,7 @@ xfs_trans_log_inode(
> > > */
> > > if (!(ip->i_itemp->ili_item.li_desc->lid_flags & XFS_LID_DIRTY) &&
> > > IS_I_VERSION(VFS_I(ip))) {
> > > - inode_inc_iversion(VFS_I(ip));
> > > - ip->i_d.di_changecount = VFS_I(ip)->i_version;
> >
> > comment about the reason for the open-code might be good, too?
Sure, I can add that.
> > otherwise some semantic patcher might "fix" it for you again later...
> >
> > -Eric
> >
> > > + ip->i_d.di_changecount = ++VFS_I(ip)->i_version;
> > > flags |= XFS_ILOG_CORE;
> > > }
> > >
> > >
>
> Adding a comment strikes me as a good idea too... But isn't that lock there for
> a reason? I suspect that will break i_version like i_size on 32 bit systems.
> Jean added this function, hopefully he can shed some light.
I can't see how there's a 32 bit issue here - i_version is always
read unlocked, and so if you're worried about a 32 bit system doing
2 32 bit reads to read the 64 bit value and seeing values on
different sides of the increment, then we've already got that
problem *everywhere*. i.e. the only place that i_version is
protected by i_lock is in inode_inc_iversion() - nowhere else is
that lock used at all when reading or writing i_version.
A quick grep points out that ext2/3/4 directory code all update and
read i_version without using the i_lock - they are all serialised by
the directory locks that are held. Ceph, exofs, ocfs2, ecryptfs,
affs, fat, etc all do similar things with inode->i_version.
So if the intention is to make i_version safe on 32 bit systems,
then it's failed. The only thing it does in inode_inc_iversion is
serialise other updates that aren't done under some exclusive inode
locks, and all the XFS updates are done either under the i_mutex
and/or the i_ilock, so I don't think there is any problem with
racing occurring here...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 4/4] xfs: open code inc_inode_iversion when logging an inode
2013-10-01 11:12 ` Dave Chinner
@ 2013-10-01 23:04 ` Ben Myers
0 siblings, 0 replies; 17+ messages in thread
From: Ben Myers @ 2013-10-01 23:04 UTC (permalink / raw)
To: Dave Chinner; +Cc: Eric Sandeen, Jean Noel Cordenner, xfs
Hi Dave,
On Tue, Oct 01, 2013 at 09:12:36PM +1000, Dave Chinner wrote:
> On Mon, Sep 30, 2013 at 05:39:46PM -0500, Ben Myers wrote:
> > On Mon, Sep 30, 2013 at 05:24:54PM -0500, Eric Sandeen wrote:
> > > On 9/29/13 6:37 PM, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > >
> > > > Michael L Semon reported that generic/069 runtime increased on v5
> > > > superblocks by 100% compared to v4 superblocks. his perf-based
> > > > analysis pointed directly at the timestamp updates being done by the
> > > > write path in this workload. The append writers are doing 4-byte
> > > > writes, so there are lots of timestamp updates occurring.
> ...
> > > > diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c
> > > > index 53dfe46..e6601c1 100644
> > > > --- a/fs/xfs/xfs_trans_inode.c
> > > > +++ b/fs/xfs/xfs_trans_inode.c
> > > > @@ -118,8 +118,7 @@ xfs_trans_log_inode(
> > > > */
> > > > if (!(ip->i_itemp->ili_item.li_desc->lid_flags & XFS_LID_DIRTY) &&
> > > > IS_I_VERSION(VFS_I(ip))) {
> > > > - inode_inc_iversion(VFS_I(ip));
> > > > - ip->i_d.di_changecount = VFS_I(ip)->i_version;
> > >
> > > comment about the reason for the open-code might be good, too?
>
> Sure, I can add that.
>
> > > otherwise some semantic patcher might "fix" it for you again later...
> > >
> > > -Eric
> > >
> > > > + ip->i_d.di_changecount = ++VFS_I(ip)->i_version;
> > > > flags |= XFS_ILOG_CORE;
> > > > }
> > > >
> > > >
> >
> > Adding a comment strikes me as a good idea too... But isn't that lock there for
> > a reason? I suspect that will break i_version like i_size on 32 bit systems.
> > Jean added this function, hopefully he can shed some light.
>
> I can't see how there's a 32 bit issue here - i_version is always
> read unlocked, and so if you're worried about a 32 bit system doing
> 2 32 bit reads to read the 64 bit value and seeing values on
> different sides of the increment, then we've already got that
> problem *everywhere*.
I think if we had the 32 bit issues with i_size, the same is likely true here.
You're not making it any worse, AFAICT.
> i.e. the only place that i_version is
> protected by i_lock is in inode_inc_iversion() - nowhere else is
> that lock used at all when reading or writing i_version.
Seems like if nobody is taking the i_lock when reading i_version, it's not
really providing the protection that was intended. Weird.
> A quick grep points out that ext2/3/4 directory code all update and
> read i_version without using the i_lock - they are all serialised by
> the directory locks that are held. Ceph, exofs, ocfs2, ecryptfs,
> affs, fat, etc all do similar things with inode->i_version.
>
> So if the intention is to make i_version safe on 32 bit systems,
> then it's failed.
Agreed.
> The only thing it does in inode_inc_iversion is
> serialise other updates that aren't done under some exclusive inode
> locks, and all the XFS updates are done either under the i_mutex
> and/or the i_ilock, so I don't think there is any problem with
> racing occurring here...
I'll take another look at it with that in mind.
Thanks,
Ben
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/4] xfs: dirent dtype presence is dependent on directory magic numbers
2013-09-29 23:37 ` [PATCH 2/4] xfs: dirent dtype presence is dependent on directory magic numbers Dave Chinner
2013-09-30 22:02 ` Ben Myers
@ 2013-10-18 16:56 ` Rich Johnston
2013-10-18 22:30 ` Dave Chinner
1 sibling, 1 reply; 17+ messages in thread
From: Rich Johnston @ 2013-10-18 16:56 UTC (permalink / raw)
To: Dave Chinner, xfs
This has been committed.
Thanks
--Rich
commit 41315687d9db9b50876401e7b0ee20dd77cfc712
Author: Dave Chinner <dchinner@redhat.com>
Date: Mon Sep 30 03:15:19 2013 +0000
xfs: dirent dtype presence is dependent on directory magic numbers
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/4] xfs: dirent dtype presence is dependent on directory magic numbers
2013-10-18 16:56 ` Rich Johnston
@ 2013-10-18 22:30 ` Dave Chinner
2013-10-18 22:41 ` Rich Johnston
0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2013-10-18 22:30 UTC (permalink / raw)
To: Rich Johnston; +Cc: xfs
On Fri, Oct 18, 2013 at 11:56:00AM -0500, Rich Johnston wrote:
> This has been committed.
>
> Thanks
> --Rich
>
> commit 41315687d9db9b50876401e7b0ee20dd77cfc712
> Author: Dave Chinner <dchinner@redhat.com>
> Date: Mon Sep 30 03:15:19 2013 +0000
>
> xfs: dirent dtype presence is dependent on directory magic numbers
Committed where, exactly?
This is a kernel patch that has already been committed and been
pushed upstream to linus as commit 6d31349....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/4] xfs: dirent dtype presence is dependent on directory magic numbers
2013-10-18 22:30 ` Dave Chinner
@ 2013-10-18 22:41 ` Rich Johnston
0 siblings, 0 replies; 17+ messages in thread
From: Rich Johnston @ 2013-10-18 22:41 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
sorry oversight on reply the patch had the similar name
Re: [PATCH 07/32] xfs: dirent dtype presence is dependent on directory
magic numbers
On 10/18/2013 05:30 PM, D \ave Chinner wrote:
> On Fri, Oct 18, 2013 at 11:56:00AM -0500, Rich Johnston wrote:
>> This has been committed.
>>
>> Thanks
>> --Rich
>>
>> commit 41315687d9db9b50876401e7b0ee20dd77cfc712
>> Author: Dave Chinner <dchinner@redhat.com>
>> Date: Mon Sep 30 03:15:19 2013 +0000
>>
>> xfs: dirent dtype presence is dependent on directory magic numbers
>
> Committed where, exactly?
>
> This is a kernel patch that has already been committed and been
> pushed upstream to linus as commit 6d31349....
>
> Cheers,
>
> Dave.
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2013-10-18 22:41 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-29 23:37 [PATCH 0/4] xfs: candidate fixes for 3.12-rc4 Dave Chinner
2013-09-29 23:37 ` [PATCH 1/4] xfs: lockdep needs to know about 3 dquot-deep nesting Dave Chinner
2013-09-30 21:19 ` Ben Myers
2013-09-29 23:37 ` [PATCH 2/4] xfs: dirent dtype presence is dependent on directory magic numbers Dave Chinner
2013-09-30 22:02 ` Ben Myers
2013-10-18 16:56 ` Rich Johnston
2013-10-18 22:30 ` Dave Chinner
2013-10-18 22:41 ` Rich Johnston
2013-09-29 23:37 ` [PATCH 3/4] xfs: xfs_remove deadlocks due to inverted AGF vs AGI lock ordering Dave Chinner
2013-09-30 22:14 ` Ben Myers
2013-10-01 10:57 ` Dave Chinner
2013-09-29 23:37 ` [PATCH 4/4] xfs: open code inc_inode_iversion when logging an inode Dave Chinner
2013-09-30 22:24 ` Eric Sandeen
2013-09-30 22:39 ` Ben Myers
2013-10-01 11:12 ` Dave Chinner
2013-10-01 23:04 ` Ben Myers
2013-09-30 22:52 ` [PATCH 0/4] xfs: candidate fixes for 3.12-rc4 Ben Myers
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox