[Cluster-devel] [GFS2 PATCH] gfs2: eliminate circular lock dependency in inode.c

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Cluster-devel] [GFS2 PATCH] gfs2: eliminate circular lock dependency in inode.c
       [not found] <655197227.7744188.1565376946008.JavaMail.zimbra@redhat.com>
@ 2019-08-09 18:58 ` Bob Peterson
  2019-08-12  9:07   ` Steven Whitehouse
  0 siblings, 1 reply; 4+ messages in thread
From: Bob Peterson @ 2019-08-09 18:58 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

This patch fixes problems caused by regressions from patch
"GFS2: rm on multiple nodes causes panic" from 2008,
72dbf4790fc6736f9cb54424245114acf0b0038c, which was an earlier
attempt to fix very similar problems.

The original problem for which it was written had to do with
simultaneous link, unlink, rmdir and rename operations on
multiple nodes that interfered with one another, due to the
lock ordering. The problem was that the lock ordering was
not consistent between the operations.

The defective patch put in place to solve it (and hey, it
worked for more than 10 years) changed the lock ordering so
that the parent directory glock was always locked before the
child. This almost always worked. Almost. The rmdir version
was still wrong because the rgrp glock was added to the holder
array, which was sorted, and the locks were acquired in sorted
order. That is counter to the locking requirements documented
in: Documentation/filesystems/gfs2-glocks.txt which states the
rgrp glock glock must always be locked after the inode glocks.

The real problem came with renames, though. Function
gfs2_rename(), which locked a series of inode glocks, did so
in parent-child order due to that patch. But it was still
possible to create circular lock dependencies just by doing the
wrong combination of renames on different nodes. For example:

Node a: mv /mnt/gfs2/sub /mnt/gfs2/tmp_name (rename sub to tmp_name)

a1. Same directory, so rename glock is NOT held
a2. /mnt/gfs2 is locked
a3. Tries to lock sub for rename, but it is locked on node b

Node b: mv /mnt/gfs2/sub /mnt/gfs2/dir1/ (move sub to dir1...
        mv /mnt/gfs2/dir1/sub /mnt/gfs2/  ...then move it back)

b1. Different directory, so rename glock IS held
b2. /mnt/gfs2 is locked
b3. dir1 is locked
b4. sub is moved to dir1 and everything is unlocked
b5. Different directory, so rename glock IS held again
b6. dir1 is locked
b7. Lock for /mnt/gfs2 is requested, but cannot be granted because
    node 1 locked it in step a2.

(Note that the nodes must be different, otherwise the vfs inode
level locking prevents the problem on a single node).

Thus, we get into a glock deadlock that looks like this:

host-018:
G:  s:EX n:2/3347 f:DyIqob t:EX d:UN/2368172000 a:0 v:0 r:3 m:150 <--------root
 H: s:EX f:H e:0 p:6414 [renameat_frenzy] gfs2_rename+0x32a/0x680 [gfs2]
 I: n:20/13127 t:4 f:0x00 d:0x00000001 s:3864
G:  s:UN n:2/3348 f:ldIqob t:EX d:UN/0 a:0 v:0 r:3 m:150 <-----------------dir6
 H: s:EX f:W e:0 p:6414 [renameat_frenzy] gfs2_rename+0x18f/0x680 [gfs2]
[root@host-018 ~]# cat /proc/6414/stack
[<ffffffffc04b621e>] gfs2_glock_wait+0x3e/0x80 [gfs2]
[<ffffffffc04b7d00>] gfs2_glock_nq+0x250/0x420 [gfs2]
[<ffffffffc04c7908>] gfs2_rename+0x228/0x680 [gfs2]
[<ffffffffc04c7d7d>] gfs2_rename2+0x1d/0x40 [gfs2]
[<ffffffff81e57245>] vfs_rename+0x575/0x880

So host-018 is holding root, and waiting for dir6

host-019:
G:  s:EX n:1/3 f:Iqb t:EX d:EX/0 a:0 v:0 r:3 m:200 <---------------------rename
 H: s:EX f:H e:0 p:6414 [renameat_frenzy] gfs2_rename+0xe7/0x680 [gfs2]
G:  s:UN n:2/3347 f:lIqob t:EX d:EX/0 a:0 v:0 r:3 m:10 <-------------------root
 H: s:EX f:W e:0 p:6414 [renameat_frenzy] gfs2_rename+0x162/0x680 [gfs2]
G:  s:EX n:2/3348 f:DIqob t:EX d:UN/2377079000 a:0 v:0 r:3 m:200 <---------dir6
 H: s:EX f:H e:0 p:6414 [renameat_frenzy] gfs2_rename+0x140/0x680 [gfs2]
 I: n:1/13128 t:4 f:0x00 d:0x00000001 s:3864
[root@host-019 ~]# cat /proc/6414/stack
[<ffffffffc05e621e>] gfs2_glock_wait+0x3e/0x80 [gfs2]
[<ffffffffc05e7d00>] gfs2_glock_nq+0x250/0x420 [gfs2]
[<ffffffffc05f7908>] gfs2_rename+0x228/0x680 [gfs2]
[<ffffffffc05f7d7d>] gfs2_rename2+0x1d/0x40 [gfs2]
[<ffffffff90e57245>] vfs_rename+0x575/0x880

So host-019 is holding dir6, waiting for root

Before the defective patch, many of the functions like gfs2_rename
used gfs2_glock_nq_m to acquire their inode glocks, which acquires
them in sorted order. That means dependencies should always be
heirarchical and never circular, even across multiple nodes.

This patch reintroduces the use of gfs2_glock_nq_m for many of
the inode operations in inode.c. It also separates the rgrp glocks
from the others so the correct locking order is properly enforced.
This consistency in locking should prevent any deadlocks.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/inode.c | 114 ++++++++++++++++++++++++--------------------------------
 1 file changed, 49 insertions(+), 65 deletions(-)

diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
index 2e2a8a2fb51d..d7a80b35910f 100644
--- a/fs/gfs2/inode.c
+++ b/fs/gfs2/inode.c
@@ -912,13 +912,9 @@ static int gfs2_link(struct dentry *old_dentry, struct inode *dir,
 	gfs2_holder_init(dip->i_gl, LM_ST_EXCLUSIVE, 0, ghs);
 	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, ghs + 1);
 
-	error = gfs2_glock_nq(ghs); /* parent */
+	error = gfs2_glock_nq_m(2, ghs); /* inodes */
 	if (error)
-		goto out_parent;
-
-	error = gfs2_glock_nq(ghs + 1); /* child */
-	if (error)
-		goto out_child;
+		goto out_uninit;
 
 	error = -ENOENT;
 	if (inode->i_nlink == 0)
@@ -1004,10 +1000,8 @@ static int gfs2_link(struct dentry *old_dentry, struct inode *dir,
 		gfs2_quota_unlock(dip);
 out_gunlock:
 	gfs2_dir_no_add(&da);
-	gfs2_glock_dq(ghs + 1);
-out_child:
-	gfs2_glock_dq(ghs);
-out_parent:
+	gfs2_glock_dq_m(2, ghs);
+out_uninit:
 	gfs2_holder_uninit(ghs);
 	gfs2_holder_uninit(ghs + 1);
 	return error;
@@ -1100,7 +1094,7 @@ static int gfs2_unlink(struct inode *dir, struct dentry *dentry)
 	struct gfs2_sbd *sdp = GFS2_SB(dir);
 	struct inode *inode = d_inode(dentry);
 	struct gfs2_inode *ip = GFS2_I(inode);
-	struct gfs2_holder ghs[3];
+	struct gfs2_holder ghs[2], rd_gh;
 	struct gfs2_rgrpd *rgd;
 	int error;
 
@@ -1108,60 +1102,48 @@ static int gfs2_unlink(struct inode *dir, struct dentry *dentry)
 	if (error)
 		return error;
 
-	error = -EROFS;
-
-	gfs2_holder_init(dip->i_gl, LM_ST_EXCLUSIVE, 0, ghs);
-	gfs2_holder_init(ip->i_gl,  LM_ST_EXCLUSIVE, 0, ghs + 1);
-
 	rgd = gfs2_blk2rgrpd(sdp, ip->i_no_addr, 1);
 	if (!rgd)
-		goto out_inodes;
-
-	gfs2_holder_init(rgd->rd_gl, LM_ST_EXCLUSIVE, 0, ghs + 2);
+		return -EROFS;
 
+	gfs2_holder_init(dip->i_gl, LM_ST_EXCLUSIVE, 0, ghs);
+	gfs2_holder_init(ip->i_gl,  LM_ST_EXCLUSIVE, 0, ghs + 1);
 
-	error = gfs2_glock_nq(ghs); /* parent */
+	error = gfs2_glock_nq_m(2, ghs); /* inodes */
 	if (error)
-		goto out_parent;
+		goto out_uninit;
 
-	error = gfs2_glock_nq(ghs + 1); /* child */
+	error = gfs2_glock_nq_init(rgd->rd_gl, LM_ST_EXCLUSIVE, 0, &rd_gh);
 	if (error)
-		goto out_child;
+		goto out_gunlock;
 
-	error = -ENOENT;
-	if (inode->i_nlink == 0)
-		goto out_rgrp;
+	if (inode->i_nlink == 0) {
+		error = -ENOENT;
+		goto out_rgunlock;
+	}
 
-	if (S_ISDIR(inode->i_mode)) {
+	if (S_ISDIR(inode->i_mode) &&
+	    (ip->i_entries > 2 || inode->i_nlink > 2)) {
 		error = -ENOTEMPTY;
-		if (ip->i_entries > 2 || inode->i_nlink > 2)
-			goto out_rgrp;
+		goto out_rgunlock;
 	}
 
-	error = gfs2_glock_nq(ghs + 2); /* rgrp */
-	if (error)
-		goto out_rgrp;
-
 	error = gfs2_unlink_ok(dip, &dentry->d_name, ip);
 	if (error)
-		goto out_gunlock;
+		goto out_rgunlock;
 
 	error = gfs2_trans_begin(sdp, 2*RES_DINODE + 3*RES_LEAF + RES_RG_BIT, 0);
 	if (error)
-		goto out_gunlock;
+		goto out_rgunlock;
 
 	error = gfs2_unlink_inode(dip, dentry);
 	gfs2_trans_end(sdp);
 
+out_rgunlock:
+	gfs2_glock_dq_uninit(&rd_gh);
 out_gunlock:
-	gfs2_glock_dq(ghs + 2);
-out_rgrp:
-	gfs2_glock_dq(ghs + 1);
-out_child:
-	gfs2_glock_dq(ghs);
-out_parent:
-	gfs2_holder_uninit(ghs + 2);
-out_inodes:
+	gfs2_glock_dq_m(2, ghs);
+out_uninit:
 	gfs2_holder_uninit(ghs + 1);
 	gfs2_holder_uninit(ghs);
 	return error;
@@ -1348,15 +1330,15 @@ static int gfs2_rename(struct inode *odir, struct dentry *odentry,
 	struct gfs2_inode *ip = GFS2_I(d_inode(odentry));
 	struct gfs2_inode *nip = NULL;
 	struct gfs2_sbd *sdp = GFS2_SB(odir);
-	struct gfs2_holder ghs[5], r_gh;
-	struct gfs2_rgrpd *nrgd;
+	struct gfs2_holder ghs[4], r_gh, rd_gh;
+	struct gfs2_rgrpd *nrgd = NULL;
 	unsigned int num_gh;
 	int dir_rename = 0;
 	struct gfs2_diradd da = { .nr_blocks = 0, .save_loc = 0, };
-	unsigned int x;
 	int error;
 
 	gfs2_holder_mark_uninitialized(&r_gh);
+	gfs2_holder_mark_uninitialized(&rd_gh);
 	if (d_really_is_positive(ndentry)) {
 		nip = GFS2_I(d_inode(ndentry));
 		if (ip == nip)
@@ -1403,12 +1385,15 @@ static int gfs2_rename(struct inode *odir, struct dentry *odentry,
 		 * so we unlink before doing the rename
 		 */
 		nrgd = gfs2_blk2rgrpd(sdp, nip->i_no_addr, 1);
-		if (nrgd)
-			gfs2_holder_init(nrgd->rd_gl, LM_ST_EXCLUSIVE, 0, ghs + num_gh++);
 	}
 
-	for (x = 0; x < num_gh; x++) {
-		error = gfs2_glock_nq(ghs + x);
+	error = gfs2_glock_nq_m(num_gh, ghs);
+	if (error)
+		goto out_uninit;
+
+	if (nrgd) {
+		error = gfs2_glock_nq_init(nrgd->rd_gl, LM_ST_EXCLUSIVE, 0,
+					   &rd_gh);
 		if (error)
 			goto out_gunlock;
 	}
@@ -1541,10 +1526,12 @@ static int gfs2_rename(struct inode *odir, struct dentry *odentry,
 		gfs2_quota_unlock(ndip);
 out_gunlock:
 	gfs2_dir_no_add(&da);
-	while (x--) {
-		gfs2_glock_dq(ghs + x);
-		gfs2_holder_uninit(ghs + x);
-	}
+	if (gfs2_holder_initialized(&rd_gh))
+		gfs2_glock_dq_uninit(&rd_gh);
+	gfs2_glock_dq_m(num_gh, ghs);
+out_uninit:
+	while (num_gh--)
+		gfs2_holder_uninit(ghs + num_gh);
 out_gunlock_r:
 	if (gfs2_holder_initialized(&r_gh))
 		gfs2_glock_dq_uninit(&r_gh);
@@ -1572,9 +1559,8 @@ static int gfs2_exchange(struct inode *odir, struct dentry *odentry,
 	struct gfs2_inode *oip = GFS2_I(odentry->d_inode);
 	struct gfs2_inode *nip = GFS2_I(ndentry->d_inode);
 	struct gfs2_sbd *sdp = GFS2_SB(odir);
-	struct gfs2_holder ghs[5], r_gh;
+	struct gfs2_holder ghs[4], r_gh;
 	unsigned int num_gh;
-	unsigned int x;
 	umode_t old_mode = oip->i_inode.i_mode;
 	umode_t new_mode = nip->i_inode.i_mode;
 	int error;
@@ -1617,11 +1603,9 @@ static int gfs2_exchange(struct inode *odir, struct dentry *odentry,
 	gfs2_holder_init(nip->i_gl, LM_ST_EXCLUSIVE, 0, ghs + num_gh);
 	num_gh++;
 
-	for (x = 0; x < num_gh; x++) {
-		error = gfs2_glock_nq(ghs + x);
-		if (error)
-			goto out_gunlock;
-	}
+	error = gfs2_glock_nq_m(num_gh, ghs);
+	if (error)
+		goto out_guninit;
 
 	error = -ENOENT;
 	if (oip->i_inode.i_nlink == 0 || nip->i_inode.i_nlink == 0)
@@ -1682,10 +1666,10 @@ static int gfs2_exchange(struct inode *odir, struct dentry *odentry,
 out_end_trans:
 	gfs2_trans_end(sdp);
 out_gunlock:
-	while (x--) {
-		gfs2_glock_dq(ghs + x);
-		gfs2_holder_uninit(ghs + x);
-	}
+	gfs2_glock_dq_m(num_gh, ghs);
+out_guninit:
+	while (num_gh--)
+		gfs2_holder_uninit(ghs + num_gh);
 out_gunlock_r:
 	if (gfs2_holder_initialized(&r_gh))
 		gfs2_glock_dq_uninit(&r_gh);



^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [Cluster-devel] [GFS2 PATCH] gfs2: eliminate circular lock dependency in inode.c
  2019-08-09 18:58 ` [Cluster-devel] [GFS2 PATCH] gfs2: eliminate circular lock dependency in inode.c Bob Peterson
@ 2019-08-12  9:07   ` Steven Whitehouse
  2019-08-12 13:43     ` Bob Peterson
  0 siblings, 1 reply; 4+ messages in thread
From: Steven Whitehouse @ 2019-08-12  9:07 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On 09/08/2019 19:58, Bob Peterson wrote:
> Hi,
>
> This patch fixes problems caused by regressions from patch
> "GFS2: rm on multiple nodes causes panic" from 2008,
> 72dbf4790fc6736f9cb54424245114acf0b0038c, which was an earlier
> attempt to fix very similar problems.
>
> The original problem for which it was written had to do with
> simultaneous link, unlink, rmdir and rename operations on
> multiple nodes that interfered with one another, due to the
> lock ordering. The problem was that the lock ordering was
> not consistent between the operations.
>
> The defective patch put in place to solve it (and hey, it
> worked for more than 10 years) changed the lock ordering so
> that the parent directory glock was always locked before the
> child. This almost always worked. Almost. The rmdir version
> was still wrong because the rgrp glock was added to the holder
> array, which was sorted, and the locks were acquired in sorted
> order. That is counter to the locking requirements documented
> in: Documentation/filesystems/gfs2-glocks.txt which states the
> rgrp glock glock must always be locked after the inode glocks.

Yes, that does need fixing, however it also doesn't entirely make sense, 
because the parent in that case is locked, but is not being removed, so 
it's rgrp would not need to be added to the transaction. anyway.


>
> The real problem came with renames, though. Function
> gfs2_rename(), which locked a series of inode glocks, did so
> in parent-child order due to that patch. But it was still
> possible to create circular lock dependencies just by doing the
> wrong combination of renames on different nodes. For example:
>
> Node a: mv /mnt/gfs2/sub /mnt/gfs2/tmp_name (rename sub to tmp_name)
>
> a1. Same directory, so rename glock is NOT held
> a2. /mnt/gfs2 is locked
> a3. Tries to lock sub for rename, but it is locked on node b
>
> Node b: mv /mnt/gfs2/sub /mnt/gfs2/dir1/ (move sub to dir1...
>          mv /mnt/gfs2/dir1/sub /mnt/gfs2/  ...then move it back)
>
> b1. Different directory, so rename glock IS held
> b2. /mnt/gfs2 is locked
> b3. dir1 is locked
> b4. sub is moved to dir1 and everything is unlocked
> b5. Different directory, so rename glock IS held again
> b6. dir1 is locked
> b7. Lock for /mnt/gfs2 is requested, but cannot be granted because
>      node 1 locked it in step a2.

If the parents are being locked before the child, as per the correct 
locking order, then this cannot happen. The directory in which the child 
is located should always be locked first, before the child, so that is 
what protects the operation on a from whatever might be going on, on node b.

When you get to step b7, sub is not locked (since it was unlocked in b4) 
and not locked again. Thus a3 can complete. So this doesn't look like it 
is the right explanation.


>
> (Note that the nodes must be different, otherwise the vfs inode
> level locking prevents the problem on a single node).
>
> Thus, we get into a glock deadlock that looks like this:
>
> host-018:
> G:  s:EX n:2/3347 f:DyIqob t:EX d:UN/2368172000 a:0 v:0 r:3 m:150 <--------root
>   H: s:EX f:H e:0 p:6414 [renameat_frenzy] gfs2_rename+0x32a/0x680 [gfs2]
>   I: n:20/13127 t:4 f:0x00 d:0x00000001 s:3864
> G:  s:UN n:2/3348 f:ldIqob t:EX d:UN/0 a:0 v:0 r:3 m:150 <-----------------dir6
>   H: s:EX f:W e:0 p:6414 [renameat_frenzy] gfs2_rename+0x18f/0x680 [gfs2]
> [root at host-018 ~]# cat /proc/6414/stack
> [<ffffffffc04b621e>] gfs2_glock_wait+0x3e/0x80 [gfs2]
> [<ffffffffc04b7d00>] gfs2_glock_nq+0x250/0x420 [gfs2]
> [<ffffffffc04c7908>] gfs2_rename+0x228/0x680 [gfs2]
> [<ffffffffc04c7d7d>] gfs2_rename2+0x1d/0x40 [gfs2]
> [<ffffffff81e57245>] vfs_rename+0x575/0x880
>
> So host-018 is holding root, and waiting for dir6
>
> host-019:
> G:  s:EX n:1/3 f:Iqb t:EX d:EX/0 a:0 v:0 r:3 m:200 <---------------------rename
>   H: s:EX f:H e:0 p:6414 [renameat_frenzy] gfs2_rename+0xe7/0x680 [gfs2]
> G:  s:UN n:2/3347 f:lIqob t:EX d:EX/0 a:0 v:0 r:3 m:10 <-------------------root
>   H: s:EX f:W e:0 p:6414 [renameat_frenzy] gfs2_rename+0x162/0x680 [gfs2]
> G:  s:EX n:2/3348 f:DIqob t:EX d:UN/2377079000 a:0 v:0 r:3 m:200 <---------dir6
>   H: s:EX f:H e:0 p:6414 [renameat_frenzy] gfs2_rename+0x140/0x680 [gfs2]
>   I: n:1/13128 t:4 f:0x00 d:0x00000001 s:3864
> [root at host-019 ~]# cat /proc/6414/stack
> [<ffffffffc05e621e>] gfs2_glock_wait+0x3e/0x80 [gfs2]
> [<ffffffffc05e7d00>] gfs2_glock_nq+0x250/0x420 [gfs2]
> [<ffffffffc05f7908>] gfs2_rename+0x228/0x680 [gfs2]
> [<ffffffffc05f7d7d>] gfs2_rename2+0x1d/0x40 [gfs2]
> [<ffffffff90e57245>] vfs_rename+0x575/0x880

That seems to be the real issue then - how did we manage to try and 
acquire the locks in the wrong order here? The root dir should always be 
locked first in this case.


>
> So host-019 is holding dir6, waiting for root
>
> Before the defective patch, many of the functions like gfs2_rename
> used gfs2_glock_nq_m to acquire their inode glocks, which acquires
> them in sorted order. That means dependencies should always be
> heirarchical and never circular, even across multiple nodes.
>
> This patch reintroduces the use of gfs2_glock_nq_m for many of
> the inode operations in inode.c. It also separates the rgrp glocks
> from the others so the correct locking order is properly enforced.
> This consistency in locking should prevent any deadlocks.

We definitely don't want to add that back, it has taken a long time to 
get rid of it. Lets fix the actual issue here rather than try and change 
everything back to an incompatible locking system that is rather 
inefficient,

Steve.


>
> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
> ---
>   fs/gfs2/inode.c | 114 ++++++++++++++++++++++++--------------------------------
>   1 file changed, 49 insertions(+), 65 deletions(-)
>
> diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
> index 2e2a8a2fb51d..d7a80b35910f 100644
> --- a/fs/gfs2/inode.c
> +++ b/fs/gfs2/inode.c
> @@ -912,13 +912,9 @@ static int gfs2_link(struct dentry *old_dentry, struct inode *dir,
>   	gfs2_holder_init(dip->i_gl, LM_ST_EXCLUSIVE, 0, ghs);
>   	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, ghs + 1);
>   
> -	error = gfs2_glock_nq(ghs); /* parent */
> +	error = gfs2_glock_nq_m(2, ghs); /* inodes */
>   	if (error)
> -		goto out_parent;
> -
> -	error = gfs2_glock_nq(ghs + 1); /* child */
> -	if (error)
> -		goto out_child;
> +		goto out_uninit;
>   
>   	error = -ENOENT;
>   	if (inode->i_nlink == 0)
> @@ -1004,10 +1000,8 @@ static int gfs2_link(struct dentry *old_dentry, struct inode *dir,
>   		gfs2_quota_unlock(dip);
>   out_gunlock:
>   	gfs2_dir_no_add(&da);
> -	gfs2_glock_dq(ghs + 1);
> -out_child:
> -	gfs2_glock_dq(ghs);
> -out_parent:
> +	gfs2_glock_dq_m(2, ghs);
> +out_uninit:
>   	gfs2_holder_uninit(ghs);
>   	gfs2_holder_uninit(ghs + 1);
>   	return error;
> @@ -1100,7 +1094,7 @@ static int gfs2_unlink(struct inode *dir, struct dentry *dentry)
>   	struct gfs2_sbd *sdp = GFS2_SB(dir);
>   	struct inode *inode = d_inode(dentry);
>   	struct gfs2_inode *ip = GFS2_I(inode);
> -	struct gfs2_holder ghs[3];
> +	struct gfs2_holder ghs[2], rd_gh;
>   	struct gfs2_rgrpd *rgd;
>   	int error;
>   
> @@ -1108,60 +1102,48 @@ static int gfs2_unlink(struct inode *dir, struct dentry *dentry)
>   	if (error)
>   		return error;
>   
> -	error = -EROFS;
> -
> -	gfs2_holder_init(dip->i_gl, LM_ST_EXCLUSIVE, 0, ghs);
> -	gfs2_holder_init(ip->i_gl,  LM_ST_EXCLUSIVE, 0, ghs + 1);
> -
>   	rgd = gfs2_blk2rgrpd(sdp, ip->i_no_addr, 1);
>   	if (!rgd)
> -		goto out_inodes;
> -
> -	gfs2_holder_init(rgd->rd_gl, LM_ST_EXCLUSIVE, 0, ghs + 2);
> +		return -EROFS;
>   
> +	gfs2_holder_init(dip->i_gl, LM_ST_EXCLUSIVE, 0, ghs);
> +	gfs2_holder_init(ip->i_gl,  LM_ST_EXCLUSIVE, 0, ghs + 1);
>   
> -	error = gfs2_glock_nq(ghs); /* parent */
> +	error = gfs2_glock_nq_m(2, ghs); /* inodes */
>   	if (error)
> -		goto out_parent;
> +		goto out_uninit;
>   
> -	error = gfs2_glock_nq(ghs + 1); /* child */
> +	error = gfs2_glock_nq_init(rgd->rd_gl, LM_ST_EXCLUSIVE, 0, &rd_gh);
>   	if (error)
> -		goto out_child;
> +		goto out_gunlock;
>   
> -	error = -ENOENT;
> -	if (inode->i_nlink == 0)
> -		goto out_rgrp;
> +	if (inode->i_nlink == 0) {
> +		error = -ENOENT;
> +		goto out_rgunlock;
> +	}
>   
> -	if (S_ISDIR(inode->i_mode)) {
> +	if (S_ISDIR(inode->i_mode) &&
> +	    (ip->i_entries > 2 || inode->i_nlink > 2)) {
>   		error = -ENOTEMPTY;
> -		if (ip->i_entries > 2 || inode->i_nlink > 2)
> -			goto out_rgrp;
> +		goto out_rgunlock;
>   	}
>   
> -	error = gfs2_glock_nq(ghs + 2); /* rgrp */
> -	if (error)
> -		goto out_rgrp;
> -
>   	error = gfs2_unlink_ok(dip, &dentry->d_name, ip);
>   	if (error)
> -		goto out_gunlock;
> +		goto out_rgunlock;
>   
>   	error = gfs2_trans_begin(sdp, 2*RES_DINODE + 3*RES_LEAF + RES_RG_BIT, 0);
>   	if (error)
> -		goto out_gunlock;
> +		goto out_rgunlock;
>   
>   	error = gfs2_unlink_inode(dip, dentry);
>   	gfs2_trans_end(sdp);
>   
> +out_rgunlock:
> +	gfs2_glock_dq_uninit(&rd_gh);
>   out_gunlock:
> -	gfs2_glock_dq(ghs + 2);
> -out_rgrp:
> -	gfs2_glock_dq(ghs + 1);
> -out_child:
> -	gfs2_glock_dq(ghs);
> -out_parent:
> -	gfs2_holder_uninit(ghs + 2);
> -out_inodes:
> +	gfs2_glock_dq_m(2, ghs);
> +out_uninit:
>   	gfs2_holder_uninit(ghs + 1);
>   	gfs2_holder_uninit(ghs);
>   	return error;
> @@ -1348,15 +1330,15 @@ static int gfs2_rename(struct inode *odir, struct dentry *odentry,
>   	struct gfs2_inode *ip = GFS2_I(d_inode(odentry));
>   	struct gfs2_inode *nip = NULL;
>   	struct gfs2_sbd *sdp = GFS2_SB(odir);
> -	struct gfs2_holder ghs[5], r_gh;
> -	struct gfs2_rgrpd *nrgd;
> +	struct gfs2_holder ghs[4], r_gh, rd_gh;
> +	struct gfs2_rgrpd *nrgd = NULL;
>   	unsigned int num_gh;
>   	int dir_rename = 0;
>   	struct gfs2_diradd da = { .nr_blocks = 0, .save_loc = 0, };
> -	unsigned int x;
>   	int error;
>   
>   	gfs2_holder_mark_uninitialized(&r_gh);
> +	gfs2_holder_mark_uninitialized(&rd_gh);
>   	if (d_really_is_positive(ndentry)) {
>   		nip = GFS2_I(d_inode(ndentry));
>   		if (ip == nip)
> @@ -1403,12 +1385,15 @@ static int gfs2_rename(struct inode *odir, struct dentry *odentry,
>   		 * so we unlink before doing the rename
>   		 */
>   		nrgd = gfs2_blk2rgrpd(sdp, nip->i_no_addr, 1);
> -		if (nrgd)
> -			gfs2_holder_init(nrgd->rd_gl, LM_ST_EXCLUSIVE, 0, ghs + num_gh++);
>   	}
>   
> -	for (x = 0; x < num_gh; x++) {
> -		error = gfs2_glock_nq(ghs + x);
> +	error = gfs2_glock_nq_m(num_gh, ghs);
> +	if (error)
> +		goto out_uninit;
> +
> +	if (nrgd) {
> +		error = gfs2_glock_nq_init(nrgd->rd_gl, LM_ST_EXCLUSIVE, 0,
> +					   &rd_gh);
>   		if (error)
>   			goto out_gunlock;
>   	}
> @@ -1541,10 +1526,12 @@ static int gfs2_rename(struct inode *odir, struct dentry *odentry,
>   		gfs2_quota_unlock(ndip);
>   out_gunlock:
>   	gfs2_dir_no_add(&da);
> -	while (x--) {
> -		gfs2_glock_dq(ghs + x);
> -		gfs2_holder_uninit(ghs + x);
> -	}
> +	if (gfs2_holder_initialized(&rd_gh))
> +		gfs2_glock_dq_uninit(&rd_gh);
> +	gfs2_glock_dq_m(num_gh, ghs);
> +out_uninit:
> +	while (num_gh--)
> +		gfs2_holder_uninit(ghs + num_gh);
>   out_gunlock_r:
>   	if (gfs2_holder_initialized(&r_gh))
>   		gfs2_glock_dq_uninit(&r_gh);
> @@ -1572,9 +1559,8 @@ static int gfs2_exchange(struct inode *odir, struct dentry *odentry,
>   	struct gfs2_inode *oip = GFS2_I(odentry->d_inode);
>   	struct gfs2_inode *nip = GFS2_I(ndentry->d_inode);
>   	struct gfs2_sbd *sdp = GFS2_SB(odir);
> -	struct gfs2_holder ghs[5], r_gh;
> +	struct gfs2_holder ghs[4], r_gh;
>   	unsigned int num_gh;
> -	unsigned int x;
>   	umode_t old_mode = oip->i_inode.i_mode;
>   	umode_t new_mode = nip->i_inode.i_mode;
>   	int error;
> @@ -1617,11 +1603,9 @@ static int gfs2_exchange(struct inode *odir, struct dentry *odentry,
>   	gfs2_holder_init(nip->i_gl, LM_ST_EXCLUSIVE, 0, ghs + num_gh);
>   	num_gh++;
>   
> -	for (x = 0; x < num_gh; x++) {
> -		error = gfs2_glock_nq(ghs + x);
> -		if (error)
> -			goto out_gunlock;
> -	}
> +	error = gfs2_glock_nq_m(num_gh, ghs);
> +	if (error)
> +		goto out_guninit;
>   
>   	error = -ENOENT;
>   	if (oip->i_inode.i_nlink == 0 || nip->i_inode.i_nlink == 0)
> @@ -1682,10 +1666,10 @@ static int gfs2_exchange(struct inode *odir, struct dentry *odentry,
>   out_end_trans:
>   	gfs2_trans_end(sdp);
>   out_gunlock:
> -	while (x--) {
> -		gfs2_glock_dq(ghs + x);
> -		gfs2_holder_uninit(ghs + x);
> -	}
> +	gfs2_glock_dq_m(num_gh, ghs);
> +out_guninit:
> +	while (num_gh--)
> +		gfs2_holder_uninit(ghs + num_gh);
>   out_gunlock_r:
>   	if (gfs2_holder_initialized(&r_gh))
>   		gfs2_glock_dq_uninit(&r_gh);
>



^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Cluster-devel] [GFS2 PATCH] gfs2: eliminate circular lock dependency in inode.c
  2019-08-12  9:07   ` Steven Whitehouse
@ 2019-08-12 13:43     ` Bob Peterson
  2019-08-12 13:56       ` Steven Whitehouse
  0 siblings, 1 reply; 4+ messages in thread
From: Bob Peterson @ 2019-08-12 13:43 UTC (permalink / raw)
  To: cluster-devel.redhat.com

----- Original Message -----
> > The real problem came with renames, though. Function
> > gfs2_rename(), which locked a series of inode glocks, did so
> > in parent-child order due to that patch. But it was still
> > possible to create circular lock dependencies just by doing the
> > wrong combination of renames on different nodes. For example:
> >
> > Node a: mv /mnt/gfs2/sub /mnt/gfs2/tmp_name (rename sub to tmp_name)
> >
> > a1. Same directory, so rename glock is NOT held
> > a2. /mnt/gfs2 is locked
> > a3. Tries to lock sub for rename, but it is locked on node b
> >
> > Node b: mv /mnt/gfs2/sub /mnt/gfs2/dir1/ (move sub to dir1...
> >          mv /mnt/gfs2/dir1/sub /mnt/gfs2/  ...then move it back)
> >
> > b1. Different directory, so rename glock IS held
> > b2. /mnt/gfs2 is locked
> > b3. dir1 is locked
> > b4. sub is moved to dir1 and everything is unlocked
> > b5. Different directory, so rename glock IS held again
> > b6. dir1 is locked
> > b7. Lock for /mnt/gfs2 is requested, but cannot be granted because
> >      node 1 locked it in step a2.
> 
> If the parents are being locked before the child, as per the correct
> locking order, then this cannot happen. The directory in which the child
> is located should always be locked first, before the child, so that is
> what protects the operation on a from whatever might be going on, on node b.
> 
> When you get to step b7, sub is not locked (since it was unlocked in b4)
> and not locked again. Thus a3 can complete. So this doesn't look like it
> is the right explanation.

Hi,

I guess maybe my explanation is lacking.
It's not so much a relationship between "parent" and "child"
directories as it is "old" and "new" directories.

The comments for function vfs_rename() explain the situations in which
this can happen, and have been prevented on a single node through the
use of s_vfs_rename_mutex. However, that mutex is not cluster-wide,
which means the relationship of which inode is the "old" and which
inode is the "new" can change indiscriminately without notice and
without cluster-wide locking. The whole point of the "a" and "b"
scenarios was to illustrate that one node can lock "old", then "new",
but the other node can reverse the roles of those same inodes (which
is the "old" and which is the "new") and therefore reverse the lock
order without notice.

Since the old-new relationship itself is not protected, we need
some other way to get the lock order correct.

My first attempt to fix this was to extend the "rename" glock to have
a rename-wide reach so it affected both types of renames rather than
today's code which only locks old and new if they're different.
I implemented this with a new i_op called by vfs (vfs_rename) to make
the rename glock serve as a kind of cluster-wide version vfs's
s_vfs_rename_mutex. However, this ended up having a huge performance
penalty for my test.

My second attempt (the patch I posted) was to lock the inodes in
block-number-sort order, because the block number relationships
will never change, regardless of which is old and which is new.
It made no sense to me to reinvent the wheel wrt locking them in
sorted order, so I used gfs2_glock_nq_m which already does that.

Regards,

Bob Peterson

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Cluster-devel] [GFS2 PATCH] gfs2: eliminate circular lock dependency in inode.c
  2019-08-12 13:43     ` Bob Peterson
@ 2019-08-12 13:56       ` Steven Whitehouse
  0 siblings, 0 replies; 4+ messages in thread
From: Steven Whitehouse @ 2019-08-12 13:56 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On 12/08/2019 14:43, Bob Peterson wrote:
> ----- Original Message -----
>>> The real problem came with renames, though. Function
>>> gfs2_rename(), which locked a series of inode glocks, did so
>>> in parent-child order due to that patch. But it was still
>>> possible to create circular lock dependencies just by doing the
>>> wrong combination of renames on different nodes. For example:
>>>
>>> Node a: mv /mnt/gfs2/sub /mnt/gfs2/tmp_name (rename sub to tmp_name)
>>>
>>> a1. Same directory, so rename glock is NOT held
>>> a2. /mnt/gfs2 is locked
>>> a3. Tries to lock sub for rename, but it is locked on node b
>>>
>>> Node b: mv /mnt/gfs2/sub /mnt/gfs2/dir1/ (move sub to dir1...
>>>           mv /mnt/gfs2/dir1/sub /mnt/gfs2/  ...then move it back)
>>>
>>> b1. Different directory, so rename glock IS held
>>> b2. /mnt/gfs2 is locked
>>> b3. dir1 is locked
>>> b4. sub is moved to dir1 and everything is unlocked
>>> b5. Different directory, so rename glock IS held again
>>> b6. dir1 is locked
>>> b7. Lock for /mnt/gfs2 is requested, but cannot be granted because
>>>       node 1 locked it in step a2.
>> If the parents are being locked before the child, as per the correct
>> locking order, then this cannot happen. The directory in which the child
>> is located should always be locked first, before the child, so that is
>> what protects the operation on a from whatever might be going on, on node b.
>>
>> When you get to step b7, sub is not locked (since it was unlocked in b4)
>> and not locked again. Thus a3 can complete. So this doesn't look like it
>> is the right explanation.
> Hi,
>
> I guess maybe my explanation is lacking.
> It's not so much a relationship between "parent" and "child"
> directories as it is "old" and "new" directories.
>
> The comments for function vfs_rename() explain the situations in which
> this can happen, and have been prevented on a single node through the
> use of s_vfs_rename_mutex. However, that mutex is not cluster-wide,
> which means the relationship of which inode is the "old" and which
> inode is the "new" can change indiscriminately without notice and
> without cluster-wide locking. The whole point of the "a" and "b"
> scenarios was to illustrate that one node can lock "old", then "new",
> but the other node can reverse the roles of those same inodes (which
> is the "old" and which is the "new") and therefore reverse the lock
> order without notice.
>
> Since the old-new relationship itself is not protected, we need
> some other way to get the lock order correct.
>
> My first attempt to fix this was to extend the "rename" glock to have
> a rename-wide reach so it affected both types of renames rather than
> today's code which only locks old and new if they're different.
> I implemented this with a new i_op called by vfs (vfs_rename) to make
> the rename glock serve as a kind of cluster-wide version vfs's
> s_vfs_rename_mutex. However, this ended up having a huge performance
> penalty for my test.
>
> My second attempt (the patch I posted) was to lock the inodes in
> block-number-sort order, because the block number relationships
> will never change, regardless of which is old and which is new.
> It made no sense to me to reinvent the wheel wrt locking them in
> sorted order, so I used gfs2_glock_nq_m which already does that.
>
> Regards,
>
> Bob Peterson

We are doing our best to get rid of the _m glock functions. Sorting 
things in block number order is a bit of a hack and it would be better 
to spend the time reducing the number of glocks involved in each 
operation overall.

I have wondered about the performance issues on the rename glock. Simply 
using that for everything is the obvious easy fix, but perhaps not 
surprising that you've seen some performance issues with that approach. 
I wonder if we can come up with a solution to break up the single rename 
glock into separate glocks using a hashing scheme, or some similar 
system. That way we might get the advantages of both improved speed and 
to retain the parent/child locking.

Either way, changing the lock ordering of lots of other bits of code is 
a non-starter, since then it will be incompatible with the way gfs2 has 
worked since it was created, and also incompatible with the vfs's own 
locking order that is used for local locks too.

Lets see if we can figure out a solution that will just address this 
particular issue on its own,

Steve.




^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-08-12 13:56 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <655197227.7744188.1565376946008.JavaMail.zimbra@redhat.com>
2019-08-09 18:58 ` [Cluster-devel] [GFS2 PATCH] gfs2: eliminate circular lock dependency in inode.c Bob Peterson
2019-08-12  9:07   ` Steven Whitehouse
2019-08-12 13:43     ` Bob Peterson
2019-08-12 13:56       ` Steven Whitehouse

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.