[Cluster-devel] [GFS2 PATCH 0/4] Patches to reduce GFS2 fragmentation

cluster-devel.redhat.com archive mirror
 help / color / mirror / Atom feed

* [Cluster-devel] [GFS2 PATCH 0/4] Patches to reduce GFS2 fragmentation
@ 2014-10-20 16:37 Bob Peterson
  2014-10-20 16:37 ` [Cluster-devel] [GFS2 PATCH 1/4] GFS2: Set of distributed preferences for rgrps Bob Peterson
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Bob Peterson @ 2014-10-20 16:37 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On October 8, I posted a GFS2 patch that greatly reduced inter-node
contention for resource group glocks. The patch was called:
"GFS2: Set of distributed preferences for rgrps". It implemented a
new scheme whereby each node in a cluster tries to "keep to itself"
for allocations. This is not unlike GFS1, which has a different scheme.

Although the patch sped up GFS2 performance in general, it also caused
more file fragmentation, because each node tended to focus on a smaller
subset of resource groups.

Here are run times and file fragmentation extent counts for my favorite
customer application, using a STOCK RHEL7 kernel (no patches):

Run times:
Run 1 time: 2hr 40min 33sec
Run 2 time: 2hr 39min 52sec
Run 3 time: 2hr 39min 31sec
Run 4 time: 2hr 33min 57sec
Run 5 time: 2hr 41min 6sec

Total file extents (File fragmentation):
EXTENT COUNT FOR OUTPUT FILES =  744708
EXTENT COUNT FOR OUTPUT FILES =  749868
EXTENT COUNT FOR OUTPUT FILES =  721862
EXTENT COUNT FOR OUTPUT FILES =  635301
EXTENT COUNT FOR OUTPUT FILES =  689263

The times are bad and the fragmentation level is also bad. If I add
just the first patch, "GFS2: Set of distributed preferences for rgrps"
you can see that the performance improves, but the fragmentation is
worse (I only did three iterations this time):

Run times:
Run 1 time: 2hr 2min 47sec
Run 2 time: 2hr 8min 37sec
Run 3 time: 2hr 10min 0sec

Total file extents (File fragmentation):
EXTENT COUNT FOR OUTPUT FILES =  1011217
EXTENT COUNT FOR OUTPUT FILES =  1025973
EXTENT COUNT FOR OUTPUT FILES =  1070163

So the patch improved performance by 25 percent, but file fragmentation is
30 percent worse. Some of this is undoubtedly due to the SAN array buffering,
hiding our multitude of sins. But not every customer will have this quality
of SAN. So it's important to reduce the fragmentation as well, so that some
people are helped while others are hurt by the patch.

Toward this end, I devised three relatively simple patches that greatly
reduce file fragmentation. With all four patches, the numbers are as follows:

Run times:
Run 1 time: 2hr 5min 46sec
Run 2 time: 2hr 10min 15sec
Run 3 time: 2hr 8min 4sec
Run 4 time: 2hr 9min 27sec
Run 5 time: 2hr 6min 15sec

Total file extents (File fragmentation):
EXTENT COUNT FOR OUTPUT FILES =  330276
EXTENT COUNT FOR OUTPUT FILES =  358939
EXTENT COUNT FOR OUTPUT FILES =  375374
EXTENT COUNT FOR OUTPUT FILES =  383071
EXTENT COUNT FOR OUTPUT FILES =  369269

As you can see, with this combination of four patches, the run times are
good as well as the file fragmentation levels. The file fragmentation is
about twice as good as the stock kernel, and significantly better (almost
three times better) than with the first patch alone.

This patch set includes all four patches.

Bob Peterson (4):
  GFS2: Set of distributed preferences for rgrps
  GFS2: Make block reservations more persistent
  GFS2: Only increase rs_sizehint
  GFS2: If we use up our block reservation, request more next time

 fs/gfs2/file.c       | 10 ++------
 fs/gfs2/incore.h     |  2 ++
 fs/gfs2/lock_dlm.c   |  2 ++
 fs/gfs2/ops_fstype.c |  1 +
 fs/gfs2/rgrp.c       | 69 ++++++++++++++++++++++++++++++++++++++++++++++++----
 5 files changed, 71 insertions(+), 13 deletions(-)

-- 
1.9.3

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Cluster-devel] [GFS2 PATCH 1/4] GFS2: Set of distributed preferences for rgrps
  2014-10-20 16:37 [Cluster-devel] [GFS2 PATCH 0/4] Patches to reduce GFS2 fragmentation Bob Peterson
@ 2014-10-20 16:37 ` Bob Peterson
  2014-10-21  9:30   ` Steven Whitehouse
  2014-10-20 16:37 ` [Cluster-devel] [GFS2 PATCH 2/4] GFS2: Make block reservations more persistent Bob Peterson
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 8+ messages in thread
From: Bob Peterson @ 2014-10-20 16:37 UTC (permalink / raw)
  To: cluster-devel.redhat.com

This patch tries to use the journal numbers to evenly distribute
which node prefers which resource group for block allocations. This
is to help performance.
---
 fs/gfs2/incore.h     |  2 ++
 fs/gfs2/lock_dlm.c   |  2 ++
 fs/gfs2/ops_fstype.c |  1 +
 fs/gfs2/rgrp.c       | 66 ++++++++++++++++++++++++++++++++++++++++++++++++----
 4 files changed, 66 insertions(+), 5 deletions(-)

diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 39e7e99..618d20a 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -97,6 +97,7 @@ struct gfs2_rgrpd {
 #define GFS2_RDF_CHECK		0x10000000 /* check for unlinked inodes */
 #define GFS2_RDF_UPTODATE	0x20000000 /* rg is up to date */
 #define GFS2_RDF_ERROR		0x40000000 /* error in rg */
+#define GFS2_RDF_PREFERRED	0x80000000 /* This rgrp is preferred */
 #define GFS2_RDF_MASK		0xf0000000 /* mask for internal flags */
 	spinlock_t rd_rsspin;           /* protects reservation related vars */
 	struct rb_root rd_rstree;       /* multi-block reservation tree */
@@ -808,6 +809,7 @@ struct gfs2_sbd {
 	char sd_table_name[GFS2_FSNAME_LEN];
 	char sd_proto_name[GFS2_FSNAME_LEN];
 
+	int sd_nodes;
 	/* Debugging crud */
 
 	unsigned long sd_last_warning;
diff --git a/fs/gfs2/lock_dlm.c b/fs/gfs2/lock_dlm.c
index 641383a..5aeb03a 100644
--- a/fs/gfs2/lock_dlm.c
+++ b/fs/gfs2/lock_dlm.c
@@ -1113,6 +1113,8 @@ static void gdlm_recover_done(void *arg, struct dlm_slot *slots, int num_slots,
 	struct gfs2_sbd *sdp = arg;
 	struct lm_lockstruct *ls = &sdp->sd_lockstruct;
 
+	BUG_ON(num_slots == 0);
+	sdp->sd_nodes = num_slots;
 	/* ensure the ls jid arrays are large enough */
 	set_recover_size(sdp, slots, num_slots);
 
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index d3eae24..bf3193f 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -134,6 +134,7 @@ static struct gfs2_sbd *init_sbd(struct super_block *sb)
 	atomic_set(&sdp->sd_log_freeze, 0);
 	atomic_set(&sdp->sd_frozen_root, 0);
 	init_waitqueue_head(&sdp->sd_frozen_root_wait);
+	sdp->sd_nodes = 1;
 
 	return sdp;
 }
diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 7474c41..50cdba2 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -936,7 +936,7 @@ static int read_rindex_entry(struct gfs2_inode *ip)
 	rgd->rd_gl->gl_vm.start = rgd->rd_addr * bsize;
 	rgd->rd_gl->gl_vm.end = rgd->rd_gl->gl_vm.start + (rgd->rd_length * bsize) - 1;
 	rgd->rd_rgl = (struct gfs2_rgrp_lvb *)rgd->rd_gl->gl_lksb.sb_lvbptr;
-	rgd->rd_flags &= ~GFS2_RDF_UPTODATE;
+	rgd->rd_flags &= ~(GFS2_RDF_UPTODATE | GFS2_RDF_PREFERRED);
 	if (rgd->rd_data > sdp->sd_max_rg_data)
 		sdp->sd_max_rg_data = rgd->rd_data;
 	spin_lock(&sdp->sd_rindex_spin);
@@ -955,6 +955,36 @@ fail:
 }
 
 /**
+ * set_rgrp_preferences - Run all the rgrps, selecting some we prefer to use
+ * @sdp: the GFS2 superblock
+ *
+ * The purpose of this function is to select a subset of the resource groups
+ * and mark them as PREFERRED. We do it in such a way that each node prefers
+ * to use a unique set of rgrps to minimize glock contention.
+ */
+static void set_rgrp_preferences(struct gfs2_sbd *sdp)
+{
+	struct gfs2_rgrpd *rgd, *first;
+	int i;
+
+	/* Skip an initial number of rgrps, based on this node's journal ID.
+	   That should start each node out on its own set. */
+	rgd = gfs2_rgrpd_get_first(sdp);
+	for (i = 0; i < sdp->sd_lockstruct.ls_jid; i++)
+		rgd = gfs2_rgrpd_get_next(rgd);
+	first = rgd;
+
+	do {
+		rgd->rd_flags |= GFS2_RDF_PREFERRED;
+		for (i = 0; i < sdp->sd_nodes; i++) {
+			rgd = gfs2_rgrpd_get_next(rgd);
+			if (rgd == first)
+				break;
+		}
+	} while (rgd != first);
+}
+
+/**
  * gfs2_ri_update - Pull in a new resource index from the disk
  * @ip: pointer to the rindex inode
  *
@@ -973,6 +1003,8 @@ static int gfs2_ri_update(struct gfs2_inode *ip)
 	if (error < 0)
 		return error;
 
+	set_rgrp_preferences(sdp);
+
 	sdp->sd_rindex_uptodate = 1;
 	return 0;
 }
@@ -1891,6 +1923,25 @@ static bool gfs2_select_rgrp(struct gfs2_rgrpd **pos, const struct gfs2_rgrpd *b
 }
 
 /**
+ * fast_to_acquire - determine if a resource group will be fast to acquire
+ *
+ * If this is one of our preferred rgrps, it should be quicker to acquire,
+ * because we tried to set ourselves up as dlm lock master.
+ */
+static inline int fast_to_acquire(struct gfs2_rgrpd *rgd)
+{
+	struct gfs2_glock *gl = rgd->rd_gl;
+
+	if (gl->gl_state != LM_ST_UNLOCKED && list_empty(&gl->gl_holders) &&
+	    !test_bit(GLF_DEMOTE_IN_PROGRESS, &gl->gl_flags) &&
+	    !test_bit(GLF_DEMOTE, &gl->gl_flags))
+		return 1;
+	if (rgd->rd_flags & GFS2_RDF_PREFERRED)
+		return 1;
+	return 0;
+}
+
+/**
  * gfs2_inplace_reserve - Reserve space in the filesystem
  * @ip: the inode to reserve space for
  * @ap: the allocation parameters
@@ -1932,10 +1983,15 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, const struct gfs2_alloc_parms *a
 			rg_locked = 0;
 			if (skip && skip--)
 				goto next_rgrp;
-			if (!gfs2_rs_active(rs) && (loops < 2) &&
-			     gfs2_rgrp_used_recently(rs, 1000) &&
-			     gfs2_rgrp_congested(rs->rs_rbm.rgd, loops))
-				goto next_rgrp;
+			if (!gfs2_rs_active(rs)) {
+				if (loops == 0 &&
+				    !fast_to_acquire(rs->rs_rbm.rgd))
+					goto next_rgrp;
+				if ((loops < 3) &&
+				    gfs2_rgrp_used_recently(rs, 1000) &&
+				    gfs2_rgrp_congested(rs->rs_rbm.rgd, loops))
+					goto next_rgrp;
+			}
 			error = gfs2_glock_nq_init(rs->rs_rbm.rgd->rd_gl,
 						   LM_ST_EXCLUSIVE, flags,
 						   &rs->rs_rgd_gh);
-- 
1.9.3



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [Cluster-devel] [GFS2 PATCH 2/4] GFS2: Make block reservations more persistent
  2014-10-20 16:37 [Cluster-devel] [GFS2 PATCH 0/4] Patches to reduce GFS2 fragmentation Bob Peterson
  2014-10-20 16:37 ` [Cluster-devel] [GFS2 PATCH 1/4] GFS2: Set of distributed preferences for rgrps Bob Peterson
@ 2014-10-20 16:37 ` Bob Peterson
  2014-10-21  9:24   ` Steven Whitehouse
  2014-10-20 16:37 ` [Cluster-devel] [GFS2 PATCH 3/4] GFS2: Only increase rs_sizehint Bob Peterson
  2014-10-20 16:37 ` [Cluster-devel] [GFS2 PATCH 4/4] GFS2: If we use up our block reservation, request more next time Bob Peterson
  3 siblings, 1 reply; 8+ messages in thread
From: Bob Peterson @ 2014-10-20 16:37 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Before this patch, whenever a struct file (opened to allow writes) was
closed, the multi-block reservation structure associated with the inode
was deleted. That's a problem, especially when there are multiple writers.
Applications that do open-write-close will suffer from greater levels
of fragmentation and need to re-do work to perform write operations.
This patch removes the reservation delete from the file close code so
that they're more persistent until the inode is deleted.
---
 fs/gfs2/file.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 7f4ed3d..2976019 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -616,15 +616,8 @@ static int gfs2_open(struct inode *inode, struct file *file)
 
 static int gfs2_release(struct inode *inode, struct file *file)
 {
-	struct gfs2_inode *ip = GFS2_I(inode);
-
 	kfree(file->private_data);
 	file->private_data = NULL;
-
-	if (!(file->f_mode & FMODE_WRITE))
-		return 0;
-
-	gfs2_rs_delete(ip, &inode->i_writecount);
 	return 0;
 }
 
-- 
1.9.3



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [Cluster-devel] [GFS2 PATCH 3/4] GFS2: Only increase rs_sizehint
  2014-10-20 16:37 [Cluster-devel] [GFS2 PATCH 0/4] Patches to reduce GFS2 fragmentation Bob Peterson
  2014-10-20 16:37 ` [Cluster-devel] [GFS2 PATCH 1/4] GFS2: Set of distributed preferences for rgrps Bob Peterson
  2014-10-20 16:37 ` [Cluster-devel] [GFS2 PATCH 2/4] GFS2: Make block reservations more persistent Bob Peterson
@ 2014-10-20 16:37 ` Bob Peterson
  2014-10-20 16:37 ` [Cluster-devel] [GFS2 PATCH 4/4] GFS2: If we use up our block reservation, request more next time Bob Peterson
  3 siblings, 0 replies; 8+ messages in thread
From: Bob Peterson @ 2014-10-20 16:37 UTC (permalink / raw)
  To: cluster-devel.redhat.com

If an application does a sequence of (1) big write, (2) little write
we don't necessarily want to reset the size hint based on the smaller
size. The fact that they did any big writes implies they may do more,
and therefore we should try to allocate bigger block reservations, even
if the last few were small writes. Therefore this patch changes function
gfs2_size_hint so that the size hint can only grow; it cannot shrink.
This is especially important where there are multiple writers.
---
 fs/gfs2/file.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 2976019..5c7a9c1 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -337,7 +337,8 @@ static void gfs2_size_hint(struct file *filep, loff_t offset, size_t size)
 	size_t blks = (size + sdp->sd_sb.sb_bsize - 1) >> sdp->sd_sb.sb_bsize_shift;
 	int hint = min_t(size_t, INT_MAX, blks);
 
-	atomic_set(&ip->i_res->rs_sizehint, hint);
+	if (hint > atomic_read(&ip->i_res->rs_sizehint))
+		atomic_set(&ip->i_res->rs_sizehint, hint);
 }
 
 /**
-- 
1.9.3



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [Cluster-devel] [GFS2 PATCH 4/4] GFS2: If we use up our block reservation, request more next time
  2014-10-20 16:37 [Cluster-devel] [GFS2 PATCH 0/4] Patches to reduce GFS2 fragmentation Bob Peterson
                   ` (2 preceding siblings ...)
  2014-10-20 16:37 ` [Cluster-devel] [GFS2 PATCH 3/4] GFS2: Only increase rs_sizehint Bob Peterson
@ 2014-10-20 16:37 ` Bob Peterson
  3 siblings, 0 replies; 8+ messages in thread
From: Bob Peterson @ 2014-10-20 16:37 UTC (permalink / raw)
  To: cluster-devel.redhat.com

If we run out of blocks for a given multi-block allocation, we obviously
did not reserve enough. We should reserve more blocks for the next
reservation to reduce fragmentation. This patch increases the size hint
for reservations when they run out.
---
 fs/gfs2/rgrp.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 50cdba2..265fbab 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -2251,6 +2251,9 @@ static void gfs2_adjust_reservation(struct gfs2_inode *ip,
 			trace_gfs2_rs(rs, TRACE_RS_CLAIM);
 			if (rs->rs_free && !ret)
 				goto out;
+			/* We used up our block reservation, so we should
+			   reserve more blocks next time. */
+			atomic_add(RGRP_RSRV_MINBLKS, &rs->rs_sizehint);
 		}
 		__rs_deltree(rs);
 	}
-- 
1.9.3



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [Cluster-devel] [GFS2 PATCH 2/4] GFS2: Make block reservations more persistent
  2014-10-20 16:37 ` [Cluster-devel] [GFS2 PATCH 2/4] GFS2: Make block reservations more persistent Bob Peterson
@ 2014-10-21  9:24   ` Steven Whitehouse
  0 siblings, 0 replies; 8+ messages in thread
From: Steven Whitehouse @ 2014-10-21  9:24 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On 20/10/14 17:37, Bob Peterson wrote:
> Before this patch, whenever a struct file (opened to allow writes) was
> closed, the multi-block reservation structure associated with the inode
> was deleted. That's a problem, especially when there are multiple writers.
> Applications that do open-write-close will suffer from greater levels
> of fragmentation and need to re-do work to perform write operations.
> This patch removes the reservation delete from the file close code so
> that they're more persistent until the inode is deleted.
> ---
>   fs/gfs2/file.c | 7 -------
>   1 file changed, 7 deletions(-)
This doesn't seem like a good plan. If you run something like untar, 
does that now leave gaps in the allocations? If there are applications 
which are going open/write/close in a loop, then it seems like it is the 
application that needs to be changed, rather than the filesystem,

Steve.

> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> index 7f4ed3d..2976019 100644
> --- a/fs/gfs2/file.c
> +++ b/fs/gfs2/file.c
> @@ -616,15 +616,8 @@ static int gfs2_open(struct inode *inode, struct file *file)
>   
>   static int gfs2_release(struct inode *inode, struct file *file)
>   {
> -	struct gfs2_inode *ip = GFS2_I(inode);
> -
>   	kfree(file->private_data);
>   	file->private_data = NULL;
> -
> -	if (!(file->f_mode & FMODE_WRITE))
> -		return 0;
> -
> -	gfs2_rs_delete(ip, &inode->i_writecount);
>   	return 0;
>   }
>   



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Cluster-devel] [GFS2 PATCH 1/4] GFS2: Set of distributed preferences for rgrps
  2014-10-20 16:37 ` [Cluster-devel] [GFS2 PATCH 1/4] GFS2: Set of distributed preferences for rgrps Bob Peterson
@ 2014-10-21  9:30   ` Steven Whitehouse
  2014-10-21 12:30     ` Bob Peterson
  0 siblings, 1 reply; 8+ messages in thread
From: Steven Whitehouse @ 2014-10-21  9:30 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On 20/10/14 17:37, Bob Peterson wrote:
> This patch tries to use the journal numbers to evenly distribute
> which node prefers which resource group for block allocations. This
> is to help performance.
> ---
>   fs/gfs2/incore.h     |  2 ++
>   fs/gfs2/lock_dlm.c   |  2 ++
>   fs/gfs2/ops_fstype.c |  1 +
>   fs/gfs2/rgrp.c       | 66 ++++++++++++++++++++++++++++++++++++++++++++++++----
>   4 files changed, 66 insertions(+), 5 deletions(-)
>
> diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
> index 39e7e99..618d20a 100644
> --- a/fs/gfs2/incore.h
> +++ b/fs/gfs2/incore.h
> @@ -97,6 +97,7 @@ struct gfs2_rgrpd {
>   #define GFS2_RDF_CHECK		0x10000000 /* check for unlinked inodes */
>   #define GFS2_RDF_UPTODATE	0x20000000 /* rg is up to date */
>   #define GFS2_RDF_ERROR		0x40000000 /* error in rg */
> +#define GFS2_RDF_PREFERRED	0x80000000 /* This rgrp is preferred */
>   #define GFS2_RDF_MASK		0xf0000000 /* mask for internal flags */
>   	spinlock_t rd_rsspin;           /* protects reservation related vars */
>   	struct rb_root rd_rstree;       /* multi-block reservation tree */
> @@ -808,6 +809,7 @@ struct gfs2_sbd {
>   	char sd_table_name[GFS2_FSNAME_LEN];
>   	char sd_proto_name[GFS2_FSNAME_LEN];
>   
> +	int sd_nodes;
>   	/* Debugging crud */
>   
>   	unsigned long sd_last_warning;
> diff --git a/fs/gfs2/lock_dlm.c b/fs/gfs2/lock_dlm.c
> index 641383a..5aeb03a 100644
> --- a/fs/gfs2/lock_dlm.c
> +++ b/fs/gfs2/lock_dlm.c
> @@ -1113,6 +1113,8 @@ static void gdlm_recover_done(void *arg, struct dlm_slot *slots, int num_slots,
>   	struct gfs2_sbd *sdp = arg;
>   	struct lm_lockstruct *ls = &sdp->sd_lockstruct;
>   
> +	BUG_ON(num_slots == 0);
> +	sdp->sd_nodes = num_slots;
>   	/* ensure the ls jid arrays are large enough */
>   	set_recover_size(sdp, slots, num_slots);
>   
I assume that you are trying to get the number of nodes here? I'm not 
sure that this is a good way to do that. I would expect that with older 
userspace, num_slots might indeed be 0 so that is something that needs 
to be checked. Also, I suspect that you want to know how many nodes 
could be in the cluster, rather than how many there are now, otherwise 
there will be odd results when mounting the cluster.

Counting the number of journals would be simpler I think, and less 
likely to give odd results.

> diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
> index d3eae24..bf3193f 100644
> --- a/fs/gfs2/ops_fstype.c
> +++ b/fs/gfs2/ops_fstype.c
> @@ -134,6 +134,7 @@ static struct gfs2_sbd *init_sbd(struct super_block *sb)
>   	atomic_set(&sdp->sd_log_freeze, 0);
>   	atomic_set(&sdp->sd_frozen_root, 0);
>   	init_waitqueue_head(&sdp->sd_frozen_root_wait);
> +	sdp->sd_nodes = 1;
>   
>   	return sdp;
>   }
> diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
> index 7474c41..50cdba2 100644
> --- a/fs/gfs2/rgrp.c
> +++ b/fs/gfs2/rgrp.c
> @@ -936,7 +936,7 @@ static int read_rindex_entry(struct gfs2_inode *ip)
>   	rgd->rd_gl->gl_vm.start = rgd->rd_addr * bsize;
>   	rgd->rd_gl->gl_vm.end = rgd->rd_gl->gl_vm.start + (rgd->rd_length * bsize) - 1;
>   	rgd->rd_rgl = (struct gfs2_rgrp_lvb *)rgd->rd_gl->gl_lksb.sb_lvbptr;
> -	rgd->rd_flags &= ~GFS2_RDF_UPTODATE;
> +	rgd->rd_flags &= ~(GFS2_RDF_UPTODATE | GFS2_RDF_PREFERRED);
>   	if (rgd->rd_data > sdp->sd_max_rg_data)
>   		sdp->sd_max_rg_data = rgd->rd_data;
>   	spin_lock(&sdp->sd_rindex_spin);
> @@ -955,6 +955,36 @@ fail:
>   }
>   
>   /**
> + * set_rgrp_preferences - Run all the rgrps, selecting some we prefer to use
> + * @sdp: the GFS2 superblock
> + *
> + * The purpose of this function is to select a subset of the resource groups
> + * and mark them as PREFERRED. We do it in such a way that each node prefers
> + * to use a unique set of rgrps to minimize glock contention.
> + */
> +static void set_rgrp_preferences(struct gfs2_sbd *sdp)
> +{
> +	struct gfs2_rgrpd *rgd, *first;
> +	int i;
> +
> +	/* Skip an initial number of rgrps, based on this node's journal ID.
> +	   That should start each node out on its own set. */
> +	rgd = gfs2_rgrpd_get_first(sdp);
> +	for (i = 0; i < sdp->sd_lockstruct.ls_jid; i++)
> +		rgd = gfs2_rgrpd_get_next(rgd);
> +	first = rgd;
> +
> +	do {
> +		rgd->rd_flags |= GFS2_RDF_PREFERRED;
> +		for (i = 0; i < sdp->sd_nodes; i++) {
> +			rgd = gfs2_rgrpd_get_next(rgd);
> +			if (rgd == first)
> +				break;
> +		}
> +	} while (rgd != first);
> +}
> +
> +/**
>    * gfs2_ri_update - Pull in a new resource index from the disk
>    * @ip: pointer to the rindex inode
>    *
> @@ -973,6 +1003,8 @@ static int gfs2_ri_update(struct gfs2_inode *ip)
>   	if (error < 0)
>   		return error;
>   
> +	set_rgrp_preferences(sdp);
> +
>   	sdp->sd_rindex_uptodate = 1;
>   	return 0;
>   }
> @@ -1891,6 +1923,25 @@ static bool gfs2_select_rgrp(struct gfs2_rgrpd **pos, const struct gfs2_rgrpd *b
>   }
>   
>   /**
> + * fast_to_acquire - determine if a resource group will be fast to acquire
> + *
> + * If this is one of our preferred rgrps, it should be quicker to acquire,
> + * because we tried to set ourselves up as dlm lock master.
> + */
> +static inline int fast_to_acquire(struct gfs2_rgrpd *rgd)
> +{
> +	struct gfs2_glock *gl = rgd->rd_gl;
> +
> +	if (gl->gl_state != LM_ST_UNLOCKED && list_empty(&gl->gl_holders) &&
> +	    !test_bit(GLF_DEMOTE_IN_PROGRESS, &gl->gl_flags) &&
> +	    !test_bit(GLF_DEMOTE, &gl->gl_flags))
> +		return 1;
> +	if (rgd->rd_flags & GFS2_RDF_PREFERRED)
> +		return 1;
> +	return 0;
> +}
> +
> +/**
>    * gfs2_inplace_reserve - Reserve space in the filesystem
>    * @ip: the inode to reserve space for
>    * @ap: the allocation parameters
> @@ -1932,10 +1983,15 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, const struct gfs2_alloc_parms *a
>   			rg_locked = 0;
>   			if (skip && skip--)
>   				goto next_rgrp;
> -			if (!gfs2_rs_active(rs) && (loops < 2) &&
> -			     gfs2_rgrp_used_recently(rs, 1000) &&
> -			     gfs2_rgrp_congested(rs->rs_rbm.rgd, loops))
> -				goto next_rgrp;
> +			if (!gfs2_rs_active(rs)) {
> +				if (loops == 0 &&
> +				    !fast_to_acquire(rs->rs_rbm.rgd))
> +					goto next_rgrp;
> +				if ((loops < 3) &&
> +				    gfs2_rgrp_used_recently(rs, 1000) &&
> +				    gfs2_rgrp_congested(rs->rs_rbm.rgd, loops))
> +					goto next_rgrp;
> +			}
This existing gfs2_rgrp_congested() function should be giving the answer 
as to which rgrp should be preferred, so the question is whether that is 
giving the wrong answer for some reason? I think that needs to be looked 
into and fixed if required,

Steve.

>   			error = gfs2_glock_nq_init(rs->rs_rbm.rgd->rd_gl,
>   						   LM_ST_EXCLUSIVE, flags,
>   						   &rs->rs_rgd_gh);



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Cluster-devel] [GFS2 PATCH 1/4] GFS2: Set of distributed preferences for rgrps
  2014-10-21  9:30   ` Steven Whitehouse
@ 2014-10-21 12:30     ` Bob Peterson
  0 siblings, 0 replies; 8+ messages in thread
From: Bob Peterson @ 2014-10-21 12:30 UTC (permalink / raw)
  To: cluster-devel.redhat.com

----- Original Message -----
> I assume that you are trying to get the number of nodes here? I'm not
> sure that this is a good way to do that. I would expect that with older
> userspace, num_slots might indeed be 0 so that is something that needs
> to be checked. Also, I suspect that you want to know how many nodes
> could be in the cluster, rather than how many there are now, otherwise
> there will be odd results when mounting the cluster.
> 
> Counting the number of journals would be simpler I think, and less
> likely to give odd results.
(snip)
My original version used the number of journals, which is fairly easy.
The problem is, customers often allocate extra journals to their file
system, anticipating that they will add more nodes in the future. Case
in point, our own performance group who has a four-node cluster, but
allocated 5 journals in mkfs. Doing so tends to leave large gaps that
will never be used until space gets low, and then it's a chaos of all
the nodes all trying to use those shunned rgrps at the same time.
I know people don't need to do that anymore, and it's a carry-over from
the GFS1 days, but people still do it.

I don't know of a better way to determine the number of nodes. The DLM
would know, but it doesn't share that information in any other way other
than the recovery code that I'm currently using with this patch.

I'm open to suggestions if there's a better way.

> This existing gfs2_rgrp_congested() function should be giving the answer
> as to which rgrp should be preferred, so the question is whether that is
> giving the wrong answer for some reason? I think that needs to be looked
> into and fixed if required,

The trouble is this:
The gfs2_rgrp_congested() function tells you if the rgrp is congested
at any given moment in time, and that's highly variable. What tends to
happen is that all the nodes create a bunch of files in a haphazard fashion,
as part of initialization. At the time, each node (accurately) sees that
there is _currently_ no congestion, so they all decide to use rgrp X.
They all make big multi-block reservations in rgrp X. Then they all
proceed to fight over who has the lock for rgrp X. Two reasons:
(a) when the initial files are set up, there are too few samples to get
any degree of accuracy with regard to congestion, and (b) there really
ISN'T any contention during setup because no one has begun to do any
serious writing: there's a trickle-in effect. The problem is that once
you've chosen a rgrp, you tend to stick with it, due to reservations and
due to the way "goal blocks" work, both of which preempt searching for
a different rgrp. 

Ordinarily, you would think the problem would get better (and therefore
faster) with time because there are more samples, and better information
regarding which rgrps really are congested, but in actual practice, it
doesn't work like that. All the nodes continue to fight over the same
rgrps. I suspect this is because in many use cases, workloads are evenly
distributed to the worker nodes, so they all go through phases of
(1) setup, (2) analysis of data, (3) writing, and they often hit the
same phases at roughly the same times (because of the even distribution
of the workload).

Experience has shown (both in GFS1 from prior years and GFS2) that
letting each node pick a unique subset of rgrps results in the least
amount of contention.

Regards,

Bob Peterson
Red Hat File Systems

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-10-21 12:30 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-10-20 16:37 [Cluster-devel] [GFS2 PATCH 0/4] Patches to reduce GFS2 fragmentation Bob Peterson
2014-10-20 16:37 ` [Cluster-devel] [GFS2 PATCH 1/4] GFS2: Set of distributed preferences for rgrps Bob Peterson
2014-10-21  9:30   ` Steven Whitehouse
2014-10-21 12:30     ` Bob Peterson
2014-10-20 16:37 ` [Cluster-devel] [GFS2 PATCH 2/4] GFS2: Make block reservations more persistent Bob Peterson
2014-10-21  9:24   ` Steven Whitehouse
2014-10-20 16:37 ` [Cluster-devel] [GFS2 PATCH 3/4] GFS2: Only increase rs_sizehint Bob Peterson
2014-10-20 16:37 ` [Cluster-devel] [GFS2 PATCH 4/4] GFS2: If we use up our block reservation, request more next time Bob Peterson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).