[Cluster-devel] [GFS2 PATCH 0/3] Patches to reduce GFS2 fragmentation

cluster-devel.redhat.com archive mirror
 help / color / mirror / Atom feed

* [Cluster-devel] [GFS2 PATCH 0/3] Patches to reduce GFS2 fragmentation
@ 2014-10-24 17:49 Bob Peterson
  2014-10-24 17:49 ` [Cluster-devel] [GFS2 PATCH 1/3] GFS2: Set of distributed preferences for rgrps Bob Peterson
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Bob Peterson @ 2014-10-24 17:49 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

This is a revised version of my patches to improve GFS2 performance and
reduce GFS2 fragmentation. As suggested by Steve Whitehouse, the first patch
has been modified to base its distribution of resource groups on the number
of journals.

Also as suggested, the patch that kept file close from deleting rgrp
reservations has been removed from the set, leaving only three patches.
The removal of that patch caused fragmentation to climb 15 percent, but it's
still much (41 percent) better than the original stock kernels. I agree that
use cases like untar would make for much worse fragmentation problems if
that patch were to remain.

I've re-tested with these three patches and found that performance and
fragmentation are still good. Here are numbers:

STOCK RHEL7 kernel (none of these patches):

Run times:
Run 1 time: 2hr 40min 33sec
Run 2 time: 2hr 39min 52sec
Run 3 time: 2hr 39min 31sec
Run 4 time: 2hr 33min 57sec
Run 5 time: 2hr 41min 6sec

Total file extents (File fragmentation):
EXTENT COUNT FOR OUTPUT FILES =  744708
EXTENT COUNT FOR OUTPUT FILES =  749868
EXTENT COUNT FOR OUTPUT FILES =  721862
EXTENT COUNT FOR OUTPUT FILES =  635301
EXTENT COUNT FOR OUTPUT FILES =  689263 (Average 708200)

The times are bad and the fragmentation level is also bad. If I add
just the first patch, "GFS2: Set of distributed preferences for rgrps"
the performance improves, but the fragmentation is worse (I only did three
iterations this time):

Run times:
Run 1 time: 2hr 2min 47sec
Run 2 time: 2hr 8min 37sec
Run 3 time: 2hr 10min 0sec

Total file extents (File fragmentation):
EXTENT COUNT FOR OUTPUT FILES =  1011217
EXTENT COUNT FOR OUTPUT FILES =  1025973
EXTENT COUNT FOR OUTPUT FILES =  1070163

With these three patches, both performance and fragmentation are good:

Run 1 time: 2hr 9min 49sec
Run 2 time: 2hr 10min 15sec
Run 3 time: 2hr 8min 35sec
Run 4 time: 2hr 15min 15sec
Run 5 time: 2hr 7min 4sec
Run 6 time: 2hr 8min 8sec
Run 7 time: 2hr 7min 54sec

EXTENT COUNT FOR OUTPUT FILES =  406366
EXTENT COUNT FOR OUTPUT FILES =  432978
EXTENT COUNT FOR OUTPUT FILES =  428736
EXTENT COUNT FOR OUTPUT FILES =  419736
EXTENT COUNT FOR OUTPUT FILES =  419040
EXTENT COUNT FOR OUTPUT FILES =  422774
EXTENT COUNT FOR OUTPUT FILES =  409281 (Average 419844)

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson <rpeterso@redhat.com> 
---
Bob Peterson (3):
  GFS2: Set of distributed preferences for rgrps
  GFS2: Only increase rs_sizehint
  GFS2: If we use up our block reservation, request more next time

 fs/gfs2/file.c   |  3 ++-
 fs/gfs2/incore.h |  1 +
 fs/gfs2/rgrp.c   | 69 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
 fs/gfs2/rgrp.h   |  1 +
 4 files changed, 68 insertions(+), 6 deletions(-)

-- 
1.9.3

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Cluster-devel] [GFS2 PATCH 1/3] GFS2: Set of distributed preferences for rgrps
  2014-10-24 17:49 [Cluster-devel] [GFS2 PATCH 0/3] Patches to reduce GFS2 fragmentation Bob Peterson
@ 2014-10-24 17:49 ` Bob Peterson
  2014-10-27 10:27   ` Steven Whitehouse
  2014-10-24 17:49 ` [Cluster-devel] [GFS2 PATCH 2/3] GFS2: Only increase rs_sizehint Bob Peterson
  2014-10-24 17:49 ` [Cluster-devel] [GFS2 PATCH 3/3] GFS2: If we use up our block reservation, request more next time Bob Peterson
  2 siblings, 1 reply; 8+ messages in thread
From: Bob Peterson @ 2014-10-24 17:49 UTC (permalink / raw)
  To: cluster-devel.redhat.com

This patch tries to use the journal numbers to evenly distribute
which node prefers which resource group for block allocations. This
is to help performance.
---
 fs/gfs2/incore.h |  1 +
 fs/gfs2/rgrp.c   | 66 +++++++++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 62 insertions(+), 5 deletions(-)

diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 39e7e99..1b89918 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -97,6 +97,7 @@ struct gfs2_rgrpd {
 #define GFS2_RDF_CHECK		0x10000000 /* check for unlinked inodes */
 #define GFS2_RDF_UPTODATE	0x20000000 /* rg is up to date */
 #define GFS2_RDF_ERROR		0x40000000 /* error in rg */
+#define GFS2_RDF_PREFERRED	0x80000000 /* This rgrp is preferred */
 #define GFS2_RDF_MASK		0xf0000000 /* mask for internal flags */
 	spinlock_t rd_rsspin;           /* protects reservation related vars */
 	struct rb_root rd_rstree;       /* multi-block reservation tree */
diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 7474c41..f65e56b 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -936,7 +936,7 @@ static int read_rindex_entry(struct gfs2_inode *ip)
 	rgd->rd_gl->gl_vm.start = rgd->rd_addr * bsize;
 	rgd->rd_gl->gl_vm.end = rgd->rd_gl->gl_vm.start + (rgd->rd_length * bsize) - 1;
 	rgd->rd_rgl = (struct gfs2_rgrp_lvb *)rgd->rd_gl->gl_lksb.sb_lvbptr;
-	rgd->rd_flags &= ~GFS2_RDF_UPTODATE;
+	rgd->rd_flags &= ~(GFS2_RDF_UPTODATE | GFS2_RDF_PREFERRED);
 	if (rgd->rd_data > sdp->sd_max_rg_data)
 		sdp->sd_max_rg_data = rgd->rd_data;
 	spin_lock(&sdp->sd_rindex_spin);
@@ -955,6 +955,36 @@ fail:
 }
 
 /**
+ * set_rgrp_preferences - Run all the rgrps, selecting some we prefer to use
+ * @sdp: the GFS2 superblock
+ *
+ * The purpose of this function is to select a subset of the resource groups
+ * and mark them as PREFERRED. We do it in such a way that each node prefers
+ * to use a unique set of rgrps to minimize glock contention.
+ */
+static void set_rgrp_preferences(struct gfs2_sbd *sdp)
+{
+	struct gfs2_rgrpd *rgd, *first;
+	int i;
+
+	/* Skip an initial number of rgrps, based on this node's journal ID.
+	   That should start each node out on its own set. */
+	rgd = gfs2_rgrpd_get_first(sdp);
+	for (i = 0; i < sdp->sd_lockstruct.ls_jid; i++)
+		rgd = gfs2_rgrpd_get_next(rgd);
+	first = rgd;
+
+	do {
+		rgd->rd_flags |= GFS2_RDF_PREFERRED;
+		for (i = 0; i < sdp->sd_journals; i++) {
+			rgd = gfs2_rgrpd_get_next(rgd);
+			if (rgd == first)
+				break;
+		}
+	} while (rgd != first);
+}
+
+/**
  * gfs2_ri_update - Pull in a new resource index from the disk
  * @ip: pointer to the rindex inode
  *
@@ -973,6 +1003,8 @@ static int gfs2_ri_update(struct gfs2_inode *ip)
 	if (error < 0)
 		return error;
 
+	set_rgrp_preferences(sdp);
+
 	sdp->sd_rindex_uptodate = 1;
 	return 0;
 }
@@ -1891,6 +1923,25 @@ static bool gfs2_select_rgrp(struct gfs2_rgrpd **pos, const struct gfs2_rgrpd *b
 }
 
 /**
+ * fast_to_acquire - determine if a resource group will be fast to acquire
+ *
+ * If this is one of our preferred rgrps, it should be quicker to acquire,
+ * because we tried to set ourselves up as dlm lock master.
+ */
+static inline int fast_to_acquire(struct gfs2_rgrpd *rgd)
+{
+	struct gfs2_glock *gl = rgd->rd_gl;
+
+	if (gl->gl_state != LM_ST_UNLOCKED && list_empty(&gl->gl_holders) &&
+	    !test_bit(GLF_DEMOTE_IN_PROGRESS, &gl->gl_flags) &&
+	    !test_bit(GLF_DEMOTE, &gl->gl_flags))
+		return 1;
+	if (rgd->rd_flags & GFS2_RDF_PREFERRED)
+		return 1;
+	return 0;
+}
+
+/**
  * gfs2_inplace_reserve - Reserve space in the filesystem
  * @ip: the inode to reserve space for
  * @ap: the allocation parameters
@@ -1932,10 +1983,15 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, const struct gfs2_alloc_parms *a
 			rg_locked = 0;
 			if (skip && skip--)
 				goto next_rgrp;
-			if (!gfs2_rs_active(rs) && (loops < 2) &&
-			     gfs2_rgrp_used_recently(rs, 1000) &&
-			     gfs2_rgrp_congested(rs->rs_rbm.rgd, loops))
-				goto next_rgrp;
+			if (!gfs2_rs_active(rs)) {
+				if (loops == 0 &&
+				    !fast_to_acquire(rs->rs_rbm.rgd))
+					goto next_rgrp;
+				if ((loops < 3) &&
+				    gfs2_rgrp_used_recently(rs, 1000) &&
+				    gfs2_rgrp_congested(rs->rs_rbm.rgd, loops))
+					goto next_rgrp;
+			}
 			error = gfs2_glock_nq_init(rs->rs_rbm.rgd->rd_gl,
 						   LM_ST_EXCLUSIVE, flags,
 						   &rs->rs_rgd_gh);
-- 
1.9.3



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [Cluster-devel] [GFS2 PATCH 2/3] GFS2: Only increase rs_sizehint
  2014-10-24 17:49 [Cluster-devel] [GFS2 PATCH 0/3] Patches to reduce GFS2 fragmentation Bob Peterson
  2014-10-24 17:49 ` [Cluster-devel] [GFS2 PATCH 1/3] GFS2: Set of distributed preferences for rgrps Bob Peterson
@ 2014-10-24 17:49 ` Bob Peterson
  2014-10-24 17:49 ` [Cluster-devel] [GFS2 PATCH 3/3] GFS2: If we use up our block reservation, request more next time Bob Peterson
  2 siblings, 0 replies; 8+ messages in thread
From: Bob Peterson @ 2014-10-24 17:49 UTC (permalink / raw)
  To: cluster-devel.redhat.com

If an application does a sequence of (1) big write, (2) little write
we don't necessarily want to reset the size hint based on the smaller
size. The fact that they did any big writes implies they may do more,
and therefore we should try to allocate bigger block reservations, even
if the last few were small writes. Therefore this patch changes function
gfs2_size_hint so that the size hint can only grow; it cannot shrink.
This is especially important where there are multiple writers.
---
 fs/gfs2/file.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 7f4ed3d..f64084b 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -337,7 +337,8 @@ static void gfs2_size_hint(struct file *filep, loff_t offset, size_t size)
 	size_t blks = (size + sdp->sd_sb.sb_bsize - 1) >> sdp->sd_sb.sb_bsize_shift;
 	int hint = min_t(size_t, INT_MAX, blks);
 
-	atomic_set(&ip->i_res->rs_sizehint, hint);
+	if (hint > atomic_read(&ip->i_res->rs_sizehint))
+		atomic_set(&ip->i_res->rs_sizehint, hint);
 }
 
 /**
-- 
1.9.3



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [Cluster-devel] [GFS2 PATCH 3/3] GFS2: If we use up our block reservation, request more next time
  2014-10-24 17:49 [Cluster-devel] [GFS2 PATCH 0/3] Patches to reduce GFS2 fragmentation Bob Peterson
  2014-10-24 17:49 ` [Cluster-devel] [GFS2 PATCH 1/3] GFS2: Set of distributed preferences for rgrps Bob Peterson
  2014-10-24 17:49 ` [Cluster-devel] [GFS2 PATCH 2/3] GFS2: Only increase rs_sizehint Bob Peterson
@ 2014-10-24 17:49 ` Bob Peterson
  2 siblings, 0 replies; 8+ messages in thread
From: Bob Peterson @ 2014-10-24 17:49 UTC (permalink / raw)
  To: cluster-devel.redhat.com

If we run out of blocks for a given multi-block allocation, we obviously
did not reserve enough. We should reserve more blocks for the next
reservation to reduce fragmentation. This patch increases the size hint
for reservations when they run out.
---
 fs/gfs2/rgrp.c | 3 +++
 fs/gfs2/rgrp.h | 1 +
 2 files changed, 4 insertions(+)

diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index f65e56b..d4b9d93 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -2251,6 +2251,9 @@ static void gfs2_adjust_reservation(struct gfs2_inode *ip,
 			trace_gfs2_rs(rs, TRACE_RS_CLAIM);
 			if (rs->rs_free && !ret)
 				goto out;
+			/* We used up our block reservation, so we should
+			   reserve more blocks next time. */
+			atomic_add(RGRP_RSRV_ADDBLKS, &rs->rs_sizehint);
 		}
 		__rs_deltree(rs);
 	}
diff --git a/fs/gfs2/rgrp.h b/fs/gfs2/rgrp.h
index 5d8f085..b104f4a 100644
--- a/fs/gfs2/rgrp.h
+++ b/fs/gfs2/rgrp.h
@@ -20,6 +20,7 @@
  */
 #define RGRP_RSRV_MINBYTES 8
 #define RGRP_RSRV_MINBLKS ((u32)(RGRP_RSRV_MINBYTES * GFS2_NBBY))
+#define RGRP_RSRV_ADDBLKS 64
 
 struct gfs2_rgrpd;
 struct gfs2_sbd;
-- 
1.9.3



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [Cluster-devel] [GFS2 PATCH 1/3] GFS2: Set of distributed preferences for rgrps
  2014-10-24 17:49 ` [Cluster-devel] [GFS2 PATCH 1/3] GFS2: Set of distributed preferences for rgrps Bob Peterson
@ 2014-10-27 10:27   ` Steven Whitehouse
  2014-10-27 14:07     ` Bob Peterson
  0 siblings, 1 reply; 8+ messages in thread
From: Steven Whitehouse @ 2014-10-27 10:27 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On 24/10/14 18:49, Bob Peterson wrote:
> This patch tries to use the journal numbers to evenly distribute
> which node prefers which resource group for block allocations. This
> is to help performance.
> ---
>   fs/gfs2/incore.h |  1 +
>   fs/gfs2/rgrp.c   | 66 +++++++++++++++++++++++++++++++++++++++++++++++++++-----
>   2 files changed, 62 insertions(+), 5 deletions(-)
>
> diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
> index 39e7e99..1b89918 100644
> --- a/fs/gfs2/incore.h
> +++ b/fs/gfs2/incore.h
> @@ -97,6 +97,7 @@ struct gfs2_rgrpd {
>   #define GFS2_RDF_CHECK		0x10000000 /* check for unlinked inodes */
>   #define GFS2_RDF_UPTODATE	0x20000000 /* rg is up to date */
>   #define GFS2_RDF_ERROR		0x40000000 /* error in rg */
> +#define GFS2_RDF_PREFERRED	0x80000000 /* This rgrp is preferred */
>   #define GFS2_RDF_MASK		0xf0000000 /* mask for internal flags */
>   	spinlock_t rd_rsspin;           /* protects reservation related vars */
>   	struct rb_root rd_rstree;       /* multi-block reservation tree */
> diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
> index 7474c41..f65e56b 100644
> --- a/fs/gfs2/rgrp.c
> +++ b/fs/gfs2/rgrp.c
> @@ -936,7 +936,7 @@ static int read_rindex_entry(struct gfs2_inode *ip)
>   	rgd->rd_gl->gl_vm.start = rgd->rd_addr * bsize;
>   	rgd->rd_gl->gl_vm.end = rgd->rd_gl->gl_vm.start + (rgd->rd_length * bsize) - 1;
>   	rgd->rd_rgl = (struct gfs2_rgrp_lvb *)rgd->rd_gl->gl_lksb.sb_lvbptr;
> -	rgd->rd_flags &= ~GFS2_RDF_UPTODATE;
> +	rgd->rd_flags &= ~(GFS2_RDF_UPTODATE | GFS2_RDF_PREFERRED);
>   	if (rgd->rd_data > sdp->sd_max_rg_data)
>   		sdp->sd_max_rg_data = rgd->rd_data;
>   	spin_lock(&sdp->sd_rindex_spin);
> @@ -955,6 +955,36 @@ fail:
>   }
>   
>   /**
> + * set_rgrp_preferences - Run all the rgrps, selecting some we prefer to use
> + * @sdp: the GFS2 superblock
> + *
> + * The purpose of this function is to select a subset of the resource groups
> + * and mark them as PREFERRED. We do it in such a way that each node prefers
> + * to use a unique set of rgrps to minimize glock contention.
> + */
> +static void set_rgrp_preferences(struct gfs2_sbd *sdp)
> +{
> +	struct gfs2_rgrpd *rgd, *first;
> +	int i;
> +
> +	/* Skip an initial number of rgrps, based on this node's journal ID.
> +	   That should start each node out on its own set. */
> +	rgd = gfs2_rgrpd_get_first(sdp);
> +	for (i = 0; i < sdp->sd_lockstruct.ls_jid; i++)
> +		rgd = gfs2_rgrpd_get_next(rgd);
> +	first = rgd;
> +
> +	do {
> +		rgd->rd_flags |= GFS2_RDF_PREFERRED;
> +		for (i = 0; i < sdp->sd_journals; i++) {
> +			rgd = gfs2_rgrpd_get_next(rgd);
> +			if (rgd == first)
> +				break;
> +		}
> +	} while (rgd != first);
> +}
> +
> +/**
>    * gfs2_ri_update - Pull in a new resource index from the disk
>    * @ip: pointer to the rindex inode
>    *
> @@ -973,6 +1003,8 @@ static int gfs2_ri_update(struct gfs2_inode *ip)
>   	if (error < 0)
>   		return error;
>   
> +	set_rgrp_preferences(sdp);
> +
>   	sdp->sd_rindex_uptodate = 1;
>   	return 0;
>   }
> @@ -1891,6 +1923,25 @@ static bool gfs2_select_rgrp(struct gfs2_rgrpd **pos, const struct gfs2_rgrpd *b
>   }
>   
>   /**
> + * fast_to_acquire - determine if a resource group will be fast to acquire
> + *
> + * If this is one of our preferred rgrps, it should be quicker to acquire,
> + * because we tried to set ourselves up as dlm lock master.
> + */
> +static inline int fast_to_acquire(struct gfs2_rgrpd *rgd)
> +{
> +	struct gfs2_glock *gl = rgd->rd_gl;
> +
> +	if (gl->gl_state != LM_ST_UNLOCKED && list_empty(&gl->gl_holders) &&
> +	    !test_bit(GLF_DEMOTE_IN_PROGRESS, &gl->gl_flags) &&
> +	    !test_bit(GLF_DEMOTE, &gl->gl_flags))
> +		return 1;
> +	if (rgd->rd_flags & GFS2_RDF_PREFERRED)
> +		return 1;
> +	return 0;
> +}
> +
> +/**
>    * gfs2_inplace_reserve - Reserve space in the filesystem
>    * @ip: the inode to reserve space for
>    * @ap: the allocation parameters
> @@ -1932,10 +1983,15 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, const struct gfs2_alloc_parms *a
>   			rg_locked = 0;
>   			if (skip && skip--)
>   				goto next_rgrp;
> -			if (!gfs2_rs_active(rs) && (loops < 2) &&
> -			     gfs2_rgrp_used_recently(rs, 1000) &&
> -			     gfs2_rgrp_congested(rs->rs_rbm.rgd, loops))
> -				goto next_rgrp;
> +			if (!gfs2_rs_active(rs)) {
> +				if (loops == 0 &&
> +				    !fast_to_acquire(rs->rs_rbm.rgd))
> +					goto next_rgrp;

> +				if ((loops < 3) &&
> +				    gfs2_rgrp_used_recently(rs, 1000) &&
> +				    gfs2_rgrp_congested(rs->rs_rbm.rgd, loops))
> +					goto next_rgrp;
> +			}
This makes no sense, we end the loop when loops == 3 so that these 
conditions will be applied in every case which is not what we want. We 
must always end up doing a search of every rgrp in the worst case, in 
order that if there is some space left somewhere, we will eventually 
find it.

Definitely better wrt figuring out which rgrps to prefer, but I'm not 
yet convinced about this logic. The whole point of the congestion logic 
is to figure out ahead of time, whether it will take a long time to 
access that rgrp, so it seems that this is not quite right, otherwise 
there should be no need to bypass it like this. The fast_to_acquire 
logic should at least by merged into the rgrp_congested logic, possibly 
by just reducing the threshold at which congestion is measured.

It might be useful to introduce a tracepoint for when we reject and rgrp 
during allocation, with a reason as to why it was rejected, so that it 
is easier to see whats going on here,

Steve.

>   			error = gfs2_glock_nq_init(rs->rs_rbm.rgd->rd_gl,
>   						   LM_ST_EXCLUSIVE, flags,
>   						   &rs->rs_rgd_gh);



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Cluster-devel] [GFS2 PATCH 1/3] GFS2: Set of distributed preferences for rgrps
  2014-10-27 10:27   ` Steven Whitehouse
@ 2014-10-27 14:07     ` Bob Peterson
  2014-10-28 13:34       ` Steven Whitehouse
  0 siblings, 1 reply; 8+ messages in thread
From: Bob Peterson @ 2014-10-27 14:07 UTC (permalink / raw)
  To: cluster-devel.redhat.com

----- Original Message -----
> > +				if ((loops < 3) &&
> > +				    gfs2_rgrp_used_recently(rs, 1000) &&
> > +				    gfs2_rgrp_congested(rs->rs_rbm.rgd, loops))
> > +					goto next_rgrp;
> > +			}
> This makes no sense, we end the loop when loops == 3 so that these
> conditions will be applied in every case which is not what we want. We
> must always end up doing a search of every rgrp in the worst case, in
> order that if there is some space left somewhere, we will eventually
> find it.
> 
> Definitely better wrt figuring out which rgrps to prefer, but I'm not
> yet convinced about this logic. The whole point of the congestion logic
> is to figure out ahead of time, whether it will take a long time to
> access that rgrp, so it seems that this is not quite right, otherwise
> there should be no need to bypass it like this. The fast_to_acquire
> logic should at least by merged into the rgrp_congested logic, possibly
> by just reducing the threshold at which congestion is measured.
> 
> It might be useful to introduce a tracepoint for when we reject and rgrp
> during allocation, with a reason as to why it was rejected, so that it
> is easier to see whats going on here,
> 
> Steve.

Hi,

Sigh. You're right: Good catch. The problem is that I've done more than 30
attempts at trying to get this right in my git tree, each of which has
up to 15 patches for various things. Somewhere around iteration 20, I
dropped an important change. My intent was always to add another layer
of rgrp criteria, so loops would be 4 rather than 3. I had done this
with a different patch, but it got dropped by accident. The 3 original
layers are as follows:

loop 0: Reject rgrps that are likely congested (based on past history)
        and rgrps where we just found congestion.
        Only accept rgrps for which we can get a multi-block reservation.
loop 1: Reject rgrps that are likely congested (based on past history)
        and rgrps where we just found congestion. Accept rgrps that have
        enough free space, even if we can't get a reservation.
loop 2: Don't ever reject rgrps because we're out of ideal conditions.

The new scheme was intended to add a new layer 0 which only accepts rgrps
within a preferred subset of rgrps. In other words:

loop 0: Reject rgrps that aren't in our preferred subset of rgrps.
        Reject rgrps that are likely congested (based on past history)
        and rgrps where we just found congestion.
        Only accept rgrps for which we can get a multi-block reservation.
loop 1: Reject rgrps that are likely congested (based on past history)
        and rgrps where we just found congestion.
        Only accept rgrps for which we can get a multi-block reservation.
loop 2: Reject rgrps that are likely congested (based on past history)
        and rgrps where we just found congestion. Accept any rgrp that has
        enough free space, even if we can't get a reservation.
loop 3: Don't ever reject rgrps because we're out of ideal conditions.

But is 4 loops too many? I could combine 0 and 1, and in fact, today's code
accidentally does just that. The mistake was probably that I had been
experimenting with 3 versus 4 layers and had switched them back and forth
a few times for various tests.

Regards,

Bob Peterson
Red Hat File Systems



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Cluster-devel] [GFS2 PATCH 1/3] GFS2: Set of distributed preferences for rgrps
  2014-10-27 14:07     ` Bob Peterson
@ 2014-10-28 13:34       ` Steven Whitehouse
  0 siblings, 0 replies; 8+ messages in thread
From: Steven Whitehouse @ 2014-10-28 13:34 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On 27/10/14 14:07, Bob Peterson wrote:
> ----- Original Message -----
>>> +				if ((loops < 3) &&
>>> +				    gfs2_rgrp_used_recently(rs, 1000) &&
>>> +				    gfs2_rgrp_congested(rs->rs_rbm.rgd, loops))
>>> +					goto next_rgrp;
>>> +			}
>> This makes no sense, we end the loop when loops == 3 so that these
>> conditions will be applied in every case which is not what we want. We
>> must always end up doing a search of every rgrp in the worst case, in
>> order that if there is some space left somewhere, we will eventually
>> find it.
>>
>> Definitely better wrt figuring out which rgrps to prefer, but I'm not
>> yet convinced about this logic. The whole point of the congestion logic
>> is to figure out ahead of time, whether it will take a long time to
>> access that rgrp, so it seems that this is not quite right, otherwise
>> there should be no need to bypass it like this. The fast_to_acquire
>> logic should at least by merged into the rgrp_congested logic, possibly
>> by just reducing the threshold at which congestion is measured.
>>
>> It might be useful to introduce a tracepoint for when we reject and rgrp
>> during allocation, with a reason as to why it was rejected, so that it
>> is easier to see whats going on here,
>>
>> Steve.
> Hi,
>
> Sigh. You're right: Good catch. The problem is that I've done more than 30
> attempts at trying to get this right in my git tree, each of which has
> up to 15 patches for various things. Somewhere around iteration 20, I
> dropped an important change. My intent was always to add another layer
> of rgrp criteria, so loops would be 4 rather than 3. I had done this
> with a different patch, but it got dropped by accident. The 3 original
> layers are as follows:
>
> loop 0: Reject rgrps that are likely congested (based on past history)
>          and rgrps where we just found congestion.
>          Only accept rgrps for which we can get a multi-block reservation.
> loop 1: Reject rgrps that are likely congested (based on past history)
>          and rgrps where we just found congestion. Accept rgrps that have
>          enough free space, even if we can't get a reservation.
> loop 2: Don't ever reject rgrps because we're out of ideal conditions.
That is not how it is supposed to work. Loop 0 is the one when we try 
and avoid rgrps which are congested. Loop 1 and 2 are the same in that 
both are supposed to do a full scan of the rgrps. The only reason for 
loop 2 is that we flush out any unlinked inodes that may have 
accumulated between loop 1 and loop 2, but otherwise they should be 
identical.

> The new scheme was intended to add a new layer 0 which only accepts rgrps
> within a preferred subset of rgrps. In other words:
>
> loop 0: Reject rgrps that aren't in our preferred subset of rgrps.
>          Reject rgrps that are likely congested (based on past history)
>          and rgrps where we just found congestion.
>          Only accept rgrps for which we can get a multi-block reservation.
> loop 1: Reject rgrps that are likely congested (based on past history)
>          and rgrps where we just found congestion.
>          Only accept rgrps for which we can get a multi-block reservation.
> loop 2: Reject rgrps that are likely congested (based on past history)
>          and rgrps where we just found congestion. Accept any rgrp that has
>          enough free space, even if we can't get a reservation.
> loop 3: Don't ever reject rgrps because we're out of ideal conditions.
>
> But is 4 loops too many? I could combine 0 and 1, and in fact, today's code
> accidentally does just that. The mistake was probably that I had been
> experimenting with 3 versus 4 layers and had switched them back and forth
> a few times for various tests.
>
> Regards,
>
> Bob Peterson
> Red Hat File Systems
Yes, definitely too many. If we are looping that many times, it suggests 
that something is wrong with the way in which we are searching for 
rgrps. It would be better to use fewer loops if at all possible, rather 
than more, since this looping will be very slow,

Steve.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Cluster-devel] [GFS2 PATCH 1/3] GFS2: Set of distributed preferences for rgrps
  2014-10-29 13:02 [Cluster-devel] [GFS2 PATCH 0/3] Patches to reduce GFS2 fragmentation Bob Peterson
@ 2014-10-29 13:02 ` Bob Peterson
  0 siblings, 0 replies; 8+ messages in thread
From: Bob Peterson @ 2014-10-29 13:02 UTC (permalink / raw)
  To: cluster-devel.redhat.com

This patch tries to use the journal numbers to evenly distribute
which node prefers which resource group for block allocations. This
is to help performance.
---
 fs/gfs2/incore.h |  1 +
 fs/gfs2/rgrp.c   | 66 +++++++++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 62 insertions(+), 5 deletions(-)

diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 39e7e99..1b89918 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -97,6 +97,7 @@ struct gfs2_rgrpd {
 #define GFS2_RDF_CHECK		0x10000000 /* check for unlinked inodes */
 #define GFS2_RDF_UPTODATE	0x20000000 /* rg is up to date */
 #define GFS2_RDF_ERROR		0x40000000 /* error in rg */
+#define GFS2_RDF_PREFERRED	0x80000000 /* This rgrp is preferred */
 #define GFS2_RDF_MASK		0xf0000000 /* mask for internal flags */
 	spinlock_t rd_rsspin;           /* protects reservation related vars */
 	struct rb_root rd_rstree;       /* multi-block reservation tree */
diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 7474c41..f4e4a0c 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -936,7 +936,7 @@ static int read_rindex_entry(struct gfs2_inode *ip)
 	rgd->rd_gl->gl_vm.start = rgd->rd_addr * bsize;
 	rgd->rd_gl->gl_vm.end = rgd->rd_gl->gl_vm.start + (rgd->rd_length * bsize) - 1;
 	rgd->rd_rgl = (struct gfs2_rgrp_lvb *)rgd->rd_gl->gl_lksb.sb_lvbptr;
-	rgd->rd_flags &= ~GFS2_RDF_UPTODATE;
+	rgd->rd_flags &= ~(GFS2_RDF_UPTODATE | GFS2_RDF_PREFERRED);
 	if (rgd->rd_data > sdp->sd_max_rg_data)
 		sdp->sd_max_rg_data = rgd->rd_data;
 	spin_lock(&sdp->sd_rindex_spin);
@@ -955,6 +955,36 @@ fail:
 }
 
 /**
+ * set_rgrp_preferences - Run all the rgrps, selecting some we prefer to use
+ * @sdp: the GFS2 superblock
+ *
+ * The purpose of this function is to select a subset of the resource groups
+ * and mark them as PREFERRED. We do it in such a way that each node prefers
+ * to use a unique set of rgrps to minimize glock contention.
+ */
+static void set_rgrp_preferences(struct gfs2_sbd *sdp)
+{
+	struct gfs2_rgrpd *rgd, *first;
+	int i;
+
+	/* Skip an initial number of rgrps, based on this node's journal ID.
+	   That should start each node out on its own set. */
+	rgd = gfs2_rgrpd_get_first(sdp);
+	for (i = 0; i < sdp->sd_lockstruct.ls_jid; i++)
+		rgd = gfs2_rgrpd_get_next(rgd);
+	first = rgd;
+
+	do {
+		rgd->rd_flags |= GFS2_RDF_PREFERRED;
+		for (i = 0; i < sdp->sd_journals; i++) {
+			rgd = gfs2_rgrpd_get_next(rgd);
+			if (rgd == first)
+				break;
+		}
+	} while (rgd != first);
+}
+
+/**
  * gfs2_ri_update - Pull in a new resource index from the disk
  * @ip: pointer to the rindex inode
  *
@@ -973,6 +1003,8 @@ static int gfs2_ri_update(struct gfs2_inode *ip)
 	if (error < 0)
 		return error;
 
+	set_rgrp_preferences(sdp);
+
 	sdp->sd_rindex_uptodate = 1;
 	return 0;
 }
@@ -1891,6 +1923,25 @@ static bool gfs2_select_rgrp(struct gfs2_rgrpd **pos, const struct gfs2_rgrpd *b
 }
 
 /**
+ * fast_to_acquire - determine if a resource group will be fast to acquire
+ *
+ * If this is one of our preferred rgrps, it should be quicker to acquire,
+ * because we tried to set ourselves up as dlm lock master.
+ */
+static inline int fast_to_acquire(struct gfs2_rgrpd *rgd)
+{
+	struct gfs2_glock *gl = rgd->rd_gl;
+
+	if (gl->gl_state != LM_ST_UNLOCKED && list_empty(&gl->gl_holders) &&
+	    !test_bit(GLF_DEMOTE_IN_PROGRESS, &gl->gl_flags) &&
+	    !test_bit(GLF_DEMOTE, &gl->gl_flags))
+		return 1;
+	if (rgd->rd_flags & GFS2_RDF_PREFERRED)
+		return 1;
+	return 0;
+}
+
+/**
  * gfs2_inplace_reserve - Reserve space in the filesystem
  * @ip: the inode to reserve space for
  * @ap: the allocation parameters
@@ -1932,10 +1983,15 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, const struct gfs2_alloc_parms *a
 			rg_locked = 0;
 			if (skip && skip--)
 				goto next_rgrp;
-			if (!gfs2_rs_active(rs) && (loops < 2) &&
-			     gfs2_rgrp_used_recently(rs, 1000) &&
-			     gfs2_rgrp_congested(rs->rs_rbm.rgd, loops))
-				goto next_rgrp;
+			if (!gfs2_rs_active(rs)) {
+				if (loops == 0 &&
+				    !fast_to_acquire(rs->rs_rbm.rgd))
+					goto next_rgrp;
+				if ((loops < 2) &&
+				    gfs2_rgrp_used_recently(rs, 1000) &&
+				    gfs2_rgrp_congested(rs->rs_rbm.rgd, loops))
+					goto next_rgrp;
+			}
 			error = gfs2_glock_nq_init(rs->rs_rbm.rgd->rd_gl,
 						   LM_ST_EXCLUSIVE, flags,
 						   &rs->rs_rgd_gh);
-- 
1.9.3



^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-10-29 13:02 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-10-24 17:49 [Cluster-devel] [GFS2 PATCH 0/3] Patches to reduce GFS2 fragmentation Bob Peterson
2014-10-24 17:49 ` [Cluster-devel] [GFS2 PATCH 1/3] GFS2: Set of distributed preferences for rgrps Bob Peterson
2014-10-27 10:27   ` Steven Whitehouse
2014-10-27 14:07     ` Bob Peterson
2014-10-28 13:34       ` Steven Whitehouse
2014-10-24 17:49 ` [Cluster-devel] [GFS2 PATCH 2/3] GFS2: Only increase rs_sizehint Bob Peterson
2014-10-24 17:49 ` [Cluster-devel] [GFS2 PATCH 3/3] GFS2: If we use up our block reservation, request more next time Bob Peterson
  -- strict thread matches above, loose matches on Subject: below --
2014-10-29 13:02 [Cluster-devel] [GFS2 PATCH 0/3] Patches to reduce GFS2 fragmentation Bob Peterson
2014-10-29 13:02 ` [Cluster-devel] [GFS2 PATCH 1/3] GFS2: Set of distributed preferences for rgrps Bob Peterson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).