cluster-devel.redhat.com archive mirror
 help / color / mirror / Atom feed
* [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements
@ 2018-09-20 14:52 Mark Syms
  2018-09-20 14:52 ` [Cluster-devel] [PATCH 1/2] Add some randomisation to the GFS2 resource group allocator Mark Syms
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Mark Syms @ 2018-09-20 14:52 UTC (permalink / raw)
  To: cluster-devel.redhat.com

While testing GFS2 as a storage repository for virtual machines we
discovered a number of scenarios where the performance was being
pathologically poor.

The scenarios are simplfied to the following -

  * On a single host in the cluster grow a number of files to a
    significant proportion of the filesystems LUN size, exceeding the
    hosts preferred resource group allocation. This can be replicated
    by using fio and writing to 20 different files with a script like

[test-files]
directory=gfs2/a:gfs2/b:gfs2/c:gfs2/d:gfs2/e:gfs2/f:gfs2/g:gfs2/h:gfs2/i:gfs2/j:gfs2/k:gfs2/l:gfs2/m:gfs2/n:gfs2/o:gfs2/p:gfs2/q:gfs2/r:gfs2/s:gfs2/t
nrfiles=1
size=20G
bs=512k
rw=write
buffered=0
ioengine=libaio
fallocate=none
numjobs=20

    After starting off at network wire speed this will rapidly degrade
    with the fio process reporting large sys time.

    This was diagnosed to all the processes contending on the glock in
    gfs2_inplace_reserve having all selected the same resource
    group. Patch 1 addresses this with an optional module parameter
    which enables behaviour to "randomly" skip a selected resource
    group in the first two passes in gfs_inplace_reserve in order to
    spread the processes out.

    Worth noting that this would probably also be addressed if the
    comment in Documentation/gfs2-glocks.txt about eventually making
    glock EX locally shared was made to happen. However, this looks
    like it would require quite a bit of coordination and design so
    this stop-gap helps in the meantime.

  * With two or more hosts growing files at high data rates the
    throughput drops to a small proportion of the maximum storage
    I/O. This is the several VMs all writing to the filesystem
    scenario. Sometimes this test would run through clean at 80-90% of
    storage wire speed but at other times the performance would drop
    on one or more hosts to a small number of KiB/s.

    This was diagnosed to the different hosts repeatedly bouncing
    resource group glocks between them as different hosts selected
    the same resource group (having exhausted the preferred groups).

    Patch 2 addresses this by -
      * adding a hold delay to the resource group glock if there are
        local waiters, following the pattern already in place for
        inodes, this should also provide more data for
        gfs2_rgrp_congested to work on.
      * remembering when we were last asked to demote the lock on a
        resource group
      * in the first two passes in gfs2_inplace_reserve avoiding
        resource groups where we have been asked to demote the glock
        within the last second

Mark Syms (1):
  GFS2: Avoid recently demoted rgrps.

Tim Smith (1):
  Add some randomisation to the GFS2 resource group allocator

 fs/gfs2/glock.c      |  7 +++++--
 fs/gfs2/incore.h     |  2 ++
 fs/gfs2/main.c       |  1 +
 fs/gfs2/rgrp.c       | 49 +++++++++++++++++++++++++++++++++++++++++++++----
 fs/gfs2/trace_gfs2.h | 12 +++++++++---
 5 files changed, 62 insertions(+), 9 deletions(-)

-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Cluster-devel] [PATCH 1/2] Add some randomisation to the GFS2 resource group allocator
  2018-09-20 14:52 [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements Mark Syms
@ 2018-09-20 14:52 ` Mark Syms
  2018-09-20 14:52 ` [Cluster-devel] [PATCH 2/2] GFS2: Avoid recently demoted rgrps Mark Syms
  2018-09-20 17:17 ` [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements Bob Peterson
  2 siblings, 0 replies; 18+ messages in thread
From: Mark Syms @ 2018-09-20 14:52 UTC (permalink / raw)
  To: cluster-devel.redhat.com

From: Tim Smith <tim.smith@citrix.com>

When growing a number of files on the same cluster node from different
threads (e.g. fio with 20 or so jobs), all those threads pile into
gfs2_inplace_reserve() independently looking to claim a new resource
group and after a while they all synchronise, getting through the
gfs2_rgrp_used_recently()/gfs2_rgrp_congested() check together.

When this happens, write performance drops to about 1/5 on a single
node cluster, and on multi-node clusters it drops to near zero on
some nodes. The output from "glocktop -r -H -d 1" when this happens
begins to show many processes stuck in gfs2_inplace_reserve(), waiting
on a resource group lock.

This commit introduces a module parameter which, when set to a value
of 1, will introduce some random jitter into the first two passes of
gfs2_inplace_reserve() when trying to lock a new resource group,
skipping to the next one 1/2 the time with progressively lower
probability on each attempt.

Signed-off-by: Tim Smith <tim.smith@citrix.com>
---
 fs/gfs2/rgrp.c | 39 +++++++++++++++++++++++++++++++++++----
 1 file changed, 35 insertions(+), 4 deletions(-)

diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 1ad3256..994eb7f 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -19,6 +19,7 @@
 #include <linux/blkdev.h>
 #include <linux/rbtree.h>
 #include <linux/random.h>
+#include <linux/module.h>
 
 #include "gfs2.h"
 #include "incore.h"
@@ -49,6 +50,11 @@
 #define LBITSKIP00 (0x0000000000000000UL)
 #endif
 
+static int gfs2_skippy_rgrp_alloc;
+
+module_param_named(skippy_rgrp_alloc, gfs2_skippy_rgrp_alloc, int, 0644);
+MODULE_PARM_DESC(skippy_rgrp_alloc, "Set skippiness of resource group allocator, 0|1. Where 1 will cause resource groups to be randomly skipped with the likelihood of skipping progressively decreasing after a skip has occured.");
+
 /*
  * These routines are used by the resource group routines (rgrp.c)
  * to keep track of block allocation.  Each block is represented by two
@@ -2016,6 +2022,11 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, struct gfs2_alloc_parms *ap)
 	u64 last_unlinked = NO_BLOCK;
 	int loops = 0;
 	u32 free_blocks, skip = 0;
+	/*
+	 * gfs2_skippy_rgrp_alloc provides our initial skippiness.
+	 * randskip will thus be 2-255 if we want it do do anything.
+	 */
+	u8 randskip = gfs2_skippy_rgrp_alloc + 1;
 
 	if (sdp->sd_args.ar_rgrplvb)
 		flags |= GL_SKIP;
@@ -2046,10 +2057,30 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, struct gfs2_alloc_parms *ap)
 				if (loops == 0 &&
 				    !fast_to_acquire(rs->rs_rbm.rgd))
 					goto next_rgrp;
-				if ((loops < 2) &&
-				    gfs2_rgrp_used_recently(rs, 1000) &&
-				    gfs2_rgrp_congested(rs->rs_rbm.rgd, loops))
-					goto next_rgrp;
+				if (loops < 2) {
+					/*
+					 * If resource group allocation is requested to be skippy,
+					 * roll a hypothetical dice of <randskip> sides and skip
+					 * straight to the next resource group anyway if it comes
+					 * up 1.
+					 */
+					if (gfs2_skippy_rgrp_alloc) {
+						u8 jitter;
+
+						prandom_bytes(&jitter, sizeof(jitter));
+						if ((jitter % randskip) == 0) {
+							/*
+							 * If we are choosing to skip, bump randskip to make it
+							 * successively less likely that we will skip again
+							 */
+							randskip ++;
+							goto next_rgrp;
+						}
+					}
+					if (gfs2_rgrp_used_recently(rs, 1000) &&
+						gfs2_rgrp_congested(rs->rs_rbm.rgd, loops))
+						goto next_rgrp;
+				}
 			}
 			error = gfs2_glock_nq_init(rs->rs_rbm.rgd->rd_gl,
 						   LM_ST_EXCLUSIVE, flags,
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Cluster-devel] [PATCH 2/2] GFS2: Avoid recently demoted rgrps.
  2018-09-20 14:52 [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements Mark Syms
  2018-09-20 14:52 ` [Cluster-devel] [PATCH 1/2] Add some randomisation to the GFS2 resource group allocator Mark Syms
@ 2018-09-20 14:52 ` Mark Syms
  2018-09-20 17:17 ` [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements Bob Peterson
  2 siblings, 0 replies; 18+ messages in thread
From: Mark Syms @ 2018-09-20 14:52 UTC (permalink / raw)
  To: cluster-devel.redhat.com

When under heavy I/O load from two or more hosts the resource group
allocation can result in glocks being bounced around between hosts.
Follow the example of inodes and if we have local waiters when asked
to demote the glock on a resource group add a delay. Additionally,
track when last asked to demote a lock and when assessing resource
groups in the allocator prefer, in the first two loop iterations, not
to use resource groups where we've been asked to demote the glock
within the last second.

Signed-off-by: Mark Syms <mark.syms@citrix.com>
---
 fs/gfs2/glock.c      |  7 +++++--
 fs/gfs2/incore.h     |  2 ++
 fs/gfs2/main.c       |  1 +
 fs/gfs2/rgrp.c       | 10 ++++++++++
 fs/gfs2/trace_gfs2.h | 12 +++++++++---
 5 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 4614ee2..94ef947 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -973,7 +973,9 @@ static void handle_callback(struct gfs2_glock *gl, unsigned int state,
 	}
 	if (gl->gl_ops->go_callback)
 		gl->gl_ops->go_callback(gl, remote);
-	trace_gfs2_demote_rq(gl, remote);
+	trace_gfs2_demote_rq(gl, remote, delay);
+	if (remote && !delay)
+		gl->gl_last_demote = jiffies;
 }
 
 void gfs2_print_dbg(struct seq_file *seq, const char *fmt, ...)
@@ -1339,7 +1341,8 @@ void gfs2_glock_cb(struct gfs2_glock *gl, unsigned int state)
 	gfs2_glock_hold(gl);
 	holdtime = gl->gl_tchange + gl->gl_hold_time;
 	if (test_bit(GLF_QUEUED, &gl->gl_flags) &&
-	    gl->gl_name.ln_type == LM_TYPE_INODE) {
+	    (gl->gl_name.ln_type == LM_TYPE_INODE ||
+	     gl->gl_name.ln_type == LM_TYPE_RGRP)) {
 		if (time_before(now, holdtime))
 			delay = holdtime - now;
 		if (test_bit(GLF_REPLY_PENDING, &gl->gl_flags))
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index b96d39c..e3d5b10 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -366,6 +366,8 @@ struct gfs2_glock {
 		     gl_reply:8;	/* Last reply from the dlm */
 
 	unsigned long gl_demote_time; /* time of first demote request */
+	unsigned long gl_last_demote; /* jiffies at last demote transition */
+
 	long gl_hold_time;
 	struct list_head gl_holders;
 
diff --git a/fs/gfs2/main.c b/fs/gfs2/main.c
index 2d55e2c..2183c73 100644
--- a/fs/gfs2/main.c
+++ b/fs/gfs2/main.c
@@ -58,6 +58,7 @@ static void gfs2_init_glock_once(void *foo)
 	INIT_LIST_HEAD(&gl->gl_ail_list);
 	atomic_set(&gl->gl_ail_count, 0);
 	atomic_set(&gl->gl_revokes, 0);
+	gl->gl_last_demote = jiffies - (2 * HZ);
 }
 
 static void gfs2_init_gl_aspace_once(void *foo)
diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 994eb7f..7b77bb2 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -1955,6 +1955,12 @@ static bool gfs2_rgrp_used_recently(const struct gfs2_blkreserv *rs,
 	return tdiff > (msecs * 1000 * 1000);
 }
 
+static bool gfs2_rgrp_demoted_recently(const struct gfs2_blkreserv *rs,
+				       u32 max_age_jiffies, u32 loop)
+{
+	return time_before(jiffies, rs->rs_rbm.rgd->rd_gl->gl_last_demote + max_age_jiffies);
+}
+
 static u32 gfs2_orlov_skip(const struct gfs2_inode *ip)
 {
 	const struct gfs2_sbd *sdp = GFS2_SB(&ip->i_inode);
@@ -2077,6 +2083,10 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, struct gfs2_alloc_parms *ap)
 							goto next_rgrp;
 						}
 					}
+
+					if (gfs2_rgrp_demoted_recently(rs, HZ, loops))
+						goto next_rgrp;
+
 					if (gfs2_rgrp_used_recently(rs, 1000) &&
 						gfs2_rgrp_congested(rs->rs_rbm.rgd, loops))
 						goto next_rgrp;
diff --git a/fs/gfs2/trace_gfs2.h b/fs/gfs2/trace_gfs2.h
index e002525..79935dc 100644
--- a/fs/gfs2/trace_gfs2.h
+++ b/fs/gfs2/trace_gfs2.h
@@ -161,9 +161,9 @@ static inline u8 glock_trace_state(unsigned int state)
 /* Callback (local or remote) requesting lock demotion */
 TRACE_EVENT(gfs2_demote_rq,
 
-	TP_PROTO(const struct gfs2_glock *gl, bool remote),
+	TP_PROTO(const struct gfs2_glock *gl, bool remote, unsigned long delay),
 
-	TP_ARGS(gl, remote),
+	TP_ARGS(gl, remote, delay),
 
 	TP_STRUCT__entry(
 		__field(        dev_t,  dev                     )
@@ -173,6 +173,8 @@ static inline u8 glock_trace_state(unsigned int state)
 		__field(	u8,	dmt_state		)
 		__field(	unsigned long,	flags		)
 		__field(	bool,	remote			)
+		__field(	unsigned long,  gl_last_demote	)
+		__field(	unsigned long,  delay		)
 	),
 
 	TP_fast_assign(
@@ -182,15 +184,19 @@ static inline u8 glock_trace_state(unsigned int state)
 		__entry->cur_state	= glock_trace_state(gl->gl_state);
 		__entry->dmt_state	= glock_trace_state(gl->gl_demote_state);
 		__entry->flags		= gl->gl_flags  | (gl->gl_object ? (1UL<<GLF_OBJECT) : 0);
+		__entry->gl_last_demote	= jiffies - gl->gl_last_demote;
 		__entry->remote		= remote;
+		__entry->delay		= delay;
 	),
 
-	TP_printk("%u,%u glock %d:%lld demote %s to %s flags:%s %s",
+	TP_printk("%u,%u glock %d:%lld demote %s to %s flags:%s %lu delay %lu %s",
 		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->gltype,
 		  (unsigned long long)__entry->glnum,
                   glock_trace_name(__entry->cur_state),
                   glock_trace_name(__entry->dmt_state),
 		  show_glock_flags(__entry->flags),
+		  __entry->gl_last_demote,
+		  __entry->delay,
 		  __entry->remote ? "remote" : "local")
 
 );
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements
  2018-09-20 14:52 [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements Mark Syms
  2018-09-20 14:52 ` [Cluster-devel] [PATCH 1/2] Add some randomisation to the GFS2 resource group allocator Mark Syms
  2018-09-20 14:52 ` [Cluster-devel] [PATCH 2/2] GFS2: Avoid recently demoted rgrps Mark Syms
@ 2018-09-20 17:17 ` Bob Peterson
  2018-09-20 17:47   ` Mark Syms
  2 siblings, 1 reply; 18+ messages in thread
From: Bob Peterson @ 2018-09-20 17:17 UTC (permalink / raw)
  To: cluster-devel.redhat.com

----- Original Message -----
> While testing GFS2 as a storage repository for virtual machines we
> discovered a number of scenarios where the performance was being
> pathologically poor.
> 
> The scenarios are simplfied to the following -
> 
>   * On a single host in the cluster grow a number of files to a
>     significant proportion of the filesystems LUN size, exceeding the
>     hosts preferred resource group allocation. This can be replicated
>     by using fio and writing to 20 different files with a script like

Hi Mark, Tim and all,

The performance problems with rgrp contention are well known, and have
been for a very long time.

In rhel6 it's not as big of a problem because rhel6 gfs2 uses "try locks"
which distributes different processes to unique rgrps, thus keeping
them from contending. However, it results in file system fragmentation
that tends to catch up with you later.

I posted a different patch set to solve the problem a different way
by trying to keep track of both Inter-node and Intra-node contention,
and redistributed rgrps accordingly. It was similar to your first patch,
but used a more predictable distribution, whereas yours is random.
It worked very well, but it ultimately got rejected by Steve Whitehouse
in favor of a better approach:

Our current plan is to allow rgrps to be shared among many processes on
a single node. This alleviates the contention, improves throughput
and performance, and fixes the "favoritism" problems gfs2 has today.
In other words, it's better than just redistributing the rgrps.

I did a proof-of-concept set of patches and saw pretty good performance
numbers and "fairness" among simultaneous writers. I posted that a few
months ago.

Your patch would certainly work, and random distribution of rgrps
would definitely gain performance, just as the Orlov algorithm does,
however, I still want to pursue what Steve suggested.

My patch set for this still needs some work because I found some
bugs with how things are done, so it'll take time to get working
properly.

Regards,

Bob Peterson
Red Hat File Systems



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements
  2018-09-20 17:17 ` [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements Bob Peterson
@ 2018-09-20 17:47   ` Mark Syms
  2018-09-20 18:16     ` Steven Whitehouse
  2018-09-28 12:23     ` Bob Peterson
  0 siblings, 2 replies; 18+ messages in thread
From: Mark Syms @ 2018-09-20 17:47 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Thanks for that Bob, we've been watching with interest the changes going in upstream but at the moment we're not really in a position to take advantage of them.

Due to hardware vendor support certification requirements XenServer can only very occasionally make big kernel bumps that would affect the ABI that the driver would see as that would require our hardware partners to recertify. So, we're currently on a 4.4.52 base but the gfs2 driver is somewhat newer as it is essentially self-contained and therefore we can backport change more easily. We currently have most of the GFS2 and DLM changes that are in 4.15 backported into the XenServer 7.6 kernel, but we can't take the ones related to iomap as they are more invasive and it looks like a number of the more recent performance targeting changes are also predicated on the iomap framework.

As I mentioned in the covering letter, the intra host problem would largely be a non-issue if EX glocks were actually a host wide thing with local mutexes used to share them within the host. I don't know if this is what your patch set is trying to achieve or not. It's not so much that that selection of resource group is "random", just that there is a random chance that we won't select the first RG that we test, it probably does work out much the same though.

The inter host problem addressed by the second patch seems to be less amenable to avoidance as the hosts don't seem to have a synchronous view of the state of the resource group locks (for understandable reasons as I'd expect thisto be very expensive to keep sync'd). So it seemed reasonable to try to make it "expensive" to request a resource that someone else is using and also to avoid immediately grabbing it back if we've been asked to relinquish it. It does seem to give a fairer balance to the usage without being massively invasive.

We thought we should share these with the community anyway even if they only serve as inspiration for more detailed changes and also to describe the scenarios where we're seeing issues now that we have completed implementing the XenServer support for GFS2 that we discussed back in Nuremburg last year. In our testing they certainly make things better. They probably aren?t fully optimal as we can't maintain 10g wire speed consistently across the full LUN but we're getting about 75% which is certainly better than we were seeing before we started looking at this.

Thanks,

	Mark.

-----Original Message-----
From: Bob Peterson <rpeterso@redhat.com> 
Sent: 20 September 2018 18:18
To: Mark Syms <Mark.Syms@citrix.com>
Cc: cluster-devel at redhat.com; Ross Lagerwall <ross.lagerwall@citrix.com>; Tim Smith <tim.smith@citrix.com>
Subject: Re: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements

----- Original Message -----
> While testing GFS2 as a storage repository for virtual machines we 
> discovered a number of scenarios where the performance was being 
> pathologically poor.
> 
> The scenarios are simplfied to the following -
> 
>   * On a single host in the cluster grow a number of files to a
>     significant proportion of the filesystems LUN size, exceeding the
>     hosts preferred resource group allocation. This can be replicated
>     by using fio and writing to 20 different files with a script like

Hi Mark, Tim and all,

The performance problems with rgrp contention are well known, and have been for a very long time.

In rhel6 it's not as big of a problem because rhel6 gfs2 uses "try locks"
which distributes different processes to unique rgrps, thus keeping them from contending. However, it results in file system fragmentation that tends to catch up with you later.

I posted a different patch set to solve the problem a different way by trying to keep track of both Inter-node and Intra-node contention, and redistributed rgrps accordingly. It was similar to your first patch, but used a more predictable distribution, whereas yours is random.
It worked very well, but it ultimately got rejected by Steve Whitehouse in favor of a better approach:

Our current plan is to allow rgrps to be shared among many processes on a single node. This alleviates the contention, improves throughput and performance, and fixes the "favoritism" problems gfs2 has today.
In other words, it's better than just redistributing the rgrps.

I did a proof-of-concept set of patches and saw pretty good performance numbers and "fairness" among simultaneous writers. I posted that a few months ago.

Your patch would certainly work, and random distribution of rgrps would definitely gain performance, just as the Orlov algorithm does, however, I still want to pursue what Steve suggested.

My patch set for this still needs some work because I found some bugs with how things are done, so it'll take time to get working properly.

Regards,

Bob Peterson
Red Hat File Systems



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements
  2018-09-20 17:47   ` Mark Syms
@ 2018-09-20 18:16     ` Steven Whitehouse
  2018-09-28 12:23     ` Bob Peterson
  1 sibling, 0 replies; 18+ messages in thread
From: Steven Whitehouse @ 2018-09-20 18:16 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,


On 20/09/18 18:47, Mark Syms wrote:
> Thanks for that Bob, we've been watching with interest the changes going in upstream but at the moment we're not really in a position to take advantage of them.
>
> Due to hardware vendor support certification requirements XenServer can only very occasionally make big kernel bumps that would affect the ABI that the driver would see as that would require our hardware partners to recertify. So, we're currently on a 4.4.52 base but the gfs2 driver is somewhat newer as it is essentially self-contained and therefore we can backport change more easily. We currently have most of the GFS2 and DLM changes that are in 4.15 backported into the XenServer 7.6 kernel, but we can't take the ones related to iomap as they are more invasive and it looks like a number of the more recent performance targeting changes are also predicated on the iomap framework.
>
> As I mentioned in the covering letter, the intra host problem would largely be a non-issue if EX glocks were actually a host wide thing with local mutexes used to share them within the host. I don't know if this is what your patch set is trying to achieve or not. It's not so much that that selection of resource group is "random", just that there is a random chance that we won't select the first RG that we test, it probably does work out much the same though.
Yes, that is the goal. Those patches shouldn't depend directly on the 
iomap work, but there is likely to be some overlap there.

> The inter host problem addressed by the second patch seems to be less amenable to avoidance as the hosts don't seem to have a synchronous view of the state of the resource group locks (for understandable reasons as I'd expect thisto be very expensive to keep sync'd). So it seemed reasonable to try to make it "expensive" to request a resource that someone else is using and also to avoid immediately grabbing it back if we've been asked to relinquish it. It does seem to give a fairer balance to the usage without being massively invasive.
>
> We thought we should share these with the community anyway even if they only serve as inspiration for more detailed changes and also to describe the scenarios where we're seeing issues now that we have completed implementing the XenServer support for GFS2 that we discussed back in Nuremburg last year. In our testing they certainly make things better. They probably aren?t fully optimal as we can't maintain 10g wire speed consistently across the full LUN but we're getting about 75% which is certainly better than we were seeing before we started looking at this.
>
> Thanks,
>
> 	Mark.
We are very much open to improvements and we'll definitely take a more 
detailed look at your patches in due course. We are always very happy to 
have more people working on GFS2,

Steve.

> -----Original Message-----
> From: Bob Peterson <rpeterso@redhat.com>
> Sent: 20 September 2018 18:18
> To: Mark Syms <Mark.Syms@citrix.com>
> Cc: cluster-devel at redhat.com; Ross Lagerwall <ross.lagerwall@citrix.com>; Tim Smith <tim.smith@citrix.com>
> Subject: Re: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements
>
> ----- Original Message -----
>> While testing GFS2 as a storage repository for virtual machines we
>> discovered a number of scenarios where the performance was being
>> pathologically poor.
>>
>> The scenarios are simplfied to the following -
>>
>>    * On a single host in the cluster grow a number of files to a
>>      significant proportion of the filesystems LUN size, exceeding the
>>      hosts preferred resource group allocation. This can be replicated
>>      by using fio and writing to 20 different files with a script like
> Hi Mark, Tim and all,
>
> The performance problems with rgrp contention are well known, and have been for a very long time.
>
> In rhel6 it's not as big of a problem because rhel6 gfs2 uses "try locks"
> which distributes different processes to unique rgrps, thus keeping them from contending. However, it results in file system fragmentation that tends to catch up with you later.
>
> I posted a different patch set to solve the problem a different way by trying to keep track of both Inter-node and Intra-node contention, and redistributed rgrps accordingly. It was similar to your first patch, but used a more predictable distribution, whereas yours is random.
> It worked very well, but it ultimately got rejected by Steve Whitehouse in favor of a better approach:
>
> Our current plan is to allow rgrps to be shared among many processes on a single node. This alleviates the contention, improves throughput and performance, and fixes the "favoritism" problems gfs2 has today.
> In other words, it's better than just redistributing the rgrps.
>
> I did a proof-of-concept set of patches and saw pretty good performance numbers and "fairness" among simultaneous writers. I posted that a few months ago.
>
> Your patch would certainly work, and random distribution of rgrps would definitely gain performance, just as the Orlov algorithm does, however, I still want to pursue what Steve suggested.
>
> My patch set for this still needs some work because I found some bugs with how things are done, so it'll take time to get working properly.
>
> Regards,
>
> Bob Peterson
> Red Hat File Systems
>





^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements
  2018-09-20 17:47   ` Mark Syms
  2018-09-20 18:16     ` Steven Whitehouse
@ 2018-09-28 12:23     ` Bob Peterson
  2018-09-28 12:36       ` Mark Syms
  1 sibling, 1 reply; 18+ messages in thread
From: Bob Peterson @ 2018-09-28 12:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

----- Original Message -----
> Thanks for that Bob, we've been watching with interest the changes going in
> upstream but at the moment we're not really in a position to take advantage
> of them.
> 
> Due to hardware vendor support certification requirements XenServer can only
> very occasionally make big kernel bumps that would affect the ABI that the
> driver would see as that would require our hardware partners to recertify.
> So, we're currently on a 4.4.52 base but the gfs2 driver is somewhat newer
> as it is essentially self-contained and therefore we can backport change
> more easily. We currently have most of the GFS2 and DLM changes that are in
> 4.15 backported into the XenServer 7.6 kernel, but we can't take the ones
> related to iomap as they are more invasive and it looks like a number of the
> more recent performance targeting changes are also predicated on the iomap
> framework.
> 
> As I mentioned in the covering letter, the intra host problem would largely
> be a non-issue if EX glocks were actually a host wide thing with local
> mutexes used to share them within the host. I don't know if this is what
> your patch set is trying to achieve or not. It's not so much that that
> selection of resource group is "random", just that there is a random chance
> that we won't select the first RG that we test, it probably does work out
> much the same though.
> 
> The inter host problem addressed by the second patch seems to be less
> amenable to avoidance as the hosts don't seem to have a synchronous view of
> the state of the resource group locks (for understandable reasons as I'd
> expect thisto be very expensive to keep sync'd). So it seemed reasonable to
> try to make it "expensive" to request a resource that someone else is using
> and also to avoid immediately grabbing it back if we've been asked to
> relinquish it. It does seem to give a fairer balance to the usage without
> being massively invasive.
> 
> We thought we should share these with the community anyway even if they only
> serve as inspiration for more detailed changes and also to describe the
> scenarios where we're seeing issues now that we have completed implementing
> the XenServer support for GFS2 that we discussed back in Nuremburg last
> year. In our testing they certainly make things better. They probably aren?t
> fully optimal as we can't maintain 10g wire speed consistently across the
> full LUN but we're getting about 75% which is certainly better than we were
> seeing before we started looking at this.
> 
> Thanks,
> 
> 	Mark.

Hi Mark,

I'm really curious if you guys tried the two patches I posted here from
17 January 2018 in place of the two patches you posted. We see much better
throughput with those over stock.

I know Steve wants a different solution, and in the long run it will be a
better one, but I've been trying to convince him we should use them as a
stop-gap measure to mitigate this problem until we get a more proper solution
in place (which is obviously taking some time, due to unforeseen circumstances).

Regards,

Bob Peterson



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements
  2018-09-28 12:23     ` Bob Peterson
@ 2018-09-28 12:36       ` Mark Syms
  2018-09-28 12:50         ` Mark Syms
  2018-09-28 12:55         ` Bob Peterson
  0 siblings, 2 replies; 18+ messages in thread
From: Mark Syms @ 2018-09-28 12:36 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi Bob,

No, we haven't but it wouldn't be hard for us to replace our patches in our internal patchqueue with these and try them. Will let you know what we find.

We have also seen, what we think is an unrelated issue where we get the following backtrace in kern.log and our system stalls

Sep 21 21:19:09 cl15-05 kernel: [21389.462707] INFO: task python:15480 blocked for more than 120 seconds.
Sep 21 21:19:09 cl15-05 kernel: [21389.462749]       Tainted: G           O    4.4.0+10 #1
Sep 21 21:19:09 cl15-05 kernel: [21389.462763] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 21 21:19:09 cl15-05 kernel: [21389.462783] python          D ffff88019628bc90     0 15480      1 0x00000000
Sep 21 21:19:09 cl15-05 kernel: [21389.462790]  ffff88019628bc90 ffff880198f11c00 ffff88005a509c00 ffff88019628c000
Sep 21 21:19:09 cl15-05 kernel: [21389.462795]  ffffc90040226000 ffff88019628bd80 fffffffffffffe58 ffff8801818da418
Sep 21 21:19:09 cl15-05 kernel: [21389.462799]  ffff88019628bca8 ffffffff815a1cd4 ffff8801818da5c0 ffff88019628bd68
Sep 21 21:19:09 cl15-05 kernel: [21389.462803] Call Trace:
Sep 21 21:19:09 cl15-05 kernel: [21389.462815]  [<ffffffff815a1cd4>] schedule+0x64/0x80
Sep 21 21:19:09 cl15-05 kernel: [21389.462877]  [<ffffffffa0663624>] find_insert_glock+0x4a4/0x530 [gfs2]
Sep 21 21:19:09 cl15-05 kernel: [21389.462891]  [<ffffffffa0660c20>] ? gfs2_holder_wake+0x20/0x20 [gfs2]
Sep 21 21:19:09 cl15-05 kernel: [21389.462903]  [<ffffffffa06639ed>] gfs2_glock_get+0x3d/0x330 [gfs2]
Sep 21 21:19:09 cl15-05 kernel: [21389.462928]  [<ffffffffa066cff2>] do_flock+0xf2/0x210 [gfs2]
Sep 21 21:19:09 cl15-05 kernel: [21389.462933]  [<ffffffffa0671ad0>] ? gfs2_getattr+0xe0/0xf0 [gfs2]
Sep 21 21:19:09 cl15-05 kernel: [21389.462938]  [<ffffffff811ba2fb>] ? cp_new_stat+0x10b/0x120
Sep 21 21:19:09 cl15-05 kernel: [21389.462943]  [<ffffffffa066d188>] gfs2_flock+0x78/0xa0 [gfs2]
Sep 21 21:19:09 cl15-05 kernel: [21389.462946]  [<ffffffff812021e9>] SyS_flock+0x129/0x170
Sep 21 21:19:09 cl15-05 kernel: [21389.462948]  [<ffffffff815a57ee>] entry_SYSCALL_64_fastpath+0x12/0x71

We think there is a possibility, given that this code path only gets entered if a glock is being destroyed, that there is a time of check, time of use issue here where by the time that schedule gets called the thing which we expect to be waking us up has completed dying and therefore won't trigger a wakeup for us. We only seen this a couple of times in fairly intensive VM stress tests where a lot of flocks get used on a small number of lock files (we use them to ensure consistent behaviour of disk activation/deactivation and also access to the database with the system state) but it's concerning nonetheless. We're looking at replacing the call to schedule with schedule_timeout with a timeout of maybe HZ to ensure that we will always get out of the schedule operation and retry. Is this something you think you may have seen or have any ideas on?

Thanks,

	Mark.

-----Original Message-----
From: Bob Peterson <rpeterso@redhat.com> 
Sent: 28 September 2018 13:24
To: Mark Syms <Mark.Syms@citrix.com>
Cc: cluster-devel at redhat.com; Ross Lagerwall <ross.lagerwall@citrix.com>; Tim Smith <tim.smith@citrix.com>
Subject: Re: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements

----- Original Message -----
> Thanks for that Bob, we've been watching with interest the changes 
> going in upstream but at the moment we're not really in a position to 
> take advantage of them.
> 
> Due to hardware vendor support certification requirements XenServer 
> can only very occasionally make big kernel bumps that would affect the 
> ABI that the driver would see as that would require our hardware partners to recertify.
> So, we're currently on a 4.4.52 base but the gfs2 driver is somewhat 
> newer as it is essentially self-contained and therefore we can 
> backport change more easily. We currently have most of the GFS2 and 
> DLM changes that are in
> 4.15 backported into the XenServer 7.6 kernel, but we can't take the 
> ones related to iomap as they are more invasive and it looks like a 
> number of the more recent performance targeting changes are also 
> predicated on the iomap framework.
> 
> As I mentioned in the covering letter, the intra host problem would 
> largely be a non-issue if EX glocks were actually a host wide thing 
> with local mutexes used to share them within the host. I don't know if 
> this is what your patch set is trying to achieve or not. It's not so 
> much that that selection of resource group is "random", just that 
> there is a random chance that we won't select the first RG that we 
> test, it probably does work out much the same though.
> 
> The inter host problem addressed by the second patch seems to be less 
> amenable to avoidance as the hosts don't seem to have a synchronous 
> view of the state of the resource group locks (for understandable 
> reasons as I'd expect thisto be very expensive to keep sync'd). So it 
> seemed reasonable to try to make it "expensive" to request a resource 
> that someone else is using and also to avoid immediately grabbing it 
> back if we've been asked to relinquish it. It does seem to give a 
> fairer balance to the usage without being massively invasive.
> 
> We thought we should share these with the community anyway even if 
> they only serve as inspiration for more detailed changes and also to 
> describe the scenarios where we're seeing issues now that we have 
> completed implementing the XenServer support for GFS2 that we 
> discussed back in Nuremburg last year. In our testing they certainly 
> make things better. They probably aren?t fully optimal as we can't 
> maintain 10g wire speed consistently across the full LUN but we're 
> getting about 75% which is certainly better than we were seeing before we started looking at this.
> 
> Thanks,
> 
> 	Mark.

Hi Mark,

I'm really curious if you guys tried the two patches I posted here from
17 January 2018 in place of the two patches you posted. We see much better throughput with those over stock.

I know Steve wants a different solution, and in the long run it will be a better one, but I've been trying to convince him we should use them as a stop-gap measure to mitigate this problem until we get a more proper solution in place (which is obviously taking some time, due to unforeseen circumstances).

Regards,

Bob Peterson



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements
  2018-09-28 12:36       ` Mark Syms
@ 2018-09-28 12:50         ` Mark Syms
  2018-09-28 13:18           ` Steven Whitehouse
  2018-09-28 12:55         ` Bob Peterson
  1 sibling, 1 reply; 18+ messages in thread
From: Mark Syms @ 2018-09-28 12:50 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi Bon,

The patches look quite good and would seem to help in the intra-node congestion case, which our first patch was trying to do. We haven't tried them yet but I'll pull a build together and try to run it over the weekend.

We don't however, see that they would help in the situation we saw for the second patch where rgrp glocks would get bounced around between hosts at high speed and cause lots of state flushing to occur in the process as the stats don't take any account of anything other than network latency whereas there is more involved with a rgrp glock when state needs to be flushed.

Any thoughts on this?

Thanks,

	Mark.

-----Original Message-----
From: Mark Syms 
Sent: 28 September 2018 13:37
To: 'Bob Peterson' <rpeterso@redhat.com>
Cc: cluster-devel at redhat.com; Tim Smith <tim.smith@citrix.com>; Ross Lagerwall <ross.lagerwall@citrix.com>
Subject: RE: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements

Hi Bob,

No, we haven't but it wouldn't be hard for us to replace our patches in our internal patchqueue with these and try them. Will let you know what we find.

We have also seen, what we think is an unrelated issue where we get the following backtrace in kern.log and our system stalls

Sep 21 21:19:09 cl15-05 kernel: [21389.462707] INFO: task python:15480 blocked for more than 120 seconds.
Sep 21 21:19:09 cl15-05 kernel: [21389.462749]       Tainted: G           O    4.4.0+10 #1
Sep 21 21:19:09 cl15-05 kernel: [21389.462763] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 21 21:19:09 cl15-05 kernel: [21389.462783] python          D ffff88019628bc90     0 15480      1 0x00000000
Sep 21 21:19:09 cl15-05 kernel: [21389.462790]  ffff88019628bc90 ffff880198f11c00 ffff88005a509c00 ffff88019628c000 Sep 21 21:19:09 cl15-05 kernel: [21389.462795]  ffffc90040226000 ffff88019628bd80 fffffffffffffe58 ffff8801818da418 Sep 21 21:19:09 cl15-05 kernel: [21389.462799]  ffff88019628bca8 ffffffff815a1cd4 ffff8801818da5c0 ffff88019628bd68 Sep 21 21:19:09 cl15-05 kernel: [21389.462803] Call Trace:
Sep 21 21:19:09 cl15-05 kernel: [21389.462815]  [<ffffffff815a1cd4>] schedule+0x64/0x80 Sep 21 21:19:09 cl15-05 kernel: [21389.462877]  [<ffffffffa0663624>] find_insert_glock+0x4a4/0x530 [gfs2] Sep 21 21:19:09 cl15-05 kernel: [21389.462891]  [<ffffffffa0660c20>] ? gfs2_holder_wake+0x20/0x20 [gfs2] Sep 21 21:19:09 cl15-05 kernel: [21389.462903]  [<ffffffffa06639ed>] gfs2_glock_get+0x3d/0x330 [gfs2] Sep 21 21:19:09 cl15-05 kernel: [21389.462928]  [<ffffffffa066cff2>] do_flock+0xf2/0x210 [gfs2] Sep 21 21:19:09 cl15-05 kernel: [21389.462933]  [<ffffffffa0671ad0>] ? gfs2_getattr+0xe0/0xf0 [gfs2] Sep 21 21:19:09 cl15-05 kernel: [21389.462938]  [<ffffffff811ba2fb>] ? cp_new_stat+0x10b/0x120 Sep 21 21:19:09 cl15-05 kernel: [21389.462943]  [<ffffffffa066d188>] gfs2_flock+0x78/0xa0 [gfs2] Sep 21 21:19:09 cl15-05 kernel: [21389.462946]  [<ffffffff812021e9>] SyS_flock+0x129/0x170 Sep 21 21:19:09 cl15-05 kernel: [21389.462948]  [<ffffffff815a57ee>] entry_SYSCALL_64_fastpath+0x12/0x71

We think there is a possibility, given that this code path only gets entered if a glock is being destroyed, that there is a time of check, time of use issue here where by the time that schedule gets called the thing which we expect to be waking us up has completed dying and therefore won't trigger a wakeup for us. We only seen this a couple of times in fairly intensive VM stress tests where a lot of flocks get used on a small number of lock files (we use them to ensure consistent behaviour of disk activation/deactivation and also access to the database with the system state) but it's concerning nonetheless. We're looking at replacing the call to schedule with schedule_timeout with a timeout of maybe HZ to ensure that we will always get out of the schedule operation and retry. Is this something you think you may have seen or have any ideas on?

Thanks,

	Mark.

-----Original Message-----
From: Bob Peterson <rpeterso@redhat.com>
Sent: 28 September 2018 13:24
To: Mark Syms <Mark.Syms@citrix.com>
Cc: cluster-devel at redhat.com; Ross Lagerwall <ross.lagerwall@citrix.com>; Tim Smith <tim.smith@citrix.com>
Subject: Re: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements

----- Original Message -----
> Thanks for that Bob, we've been watching with interest the changes 
> going in upstream but at the moment we're not really in a position to 
> take advantage of them.
> 
> Due to hardware vendor support certification requirements XenServer 
> can only very occasionally make big kernel bumps that would affect the 
> ABI that the driver would see as that would require our hardware partners to recertify.
> So, we're currently on a 4.4.52 base but the gfs2 driver is somewhat 
> newer as it is essentially self-contained and therefore we can 
> backport change more easily. We currently have most of the GFS2 and 
> DLM changes that are in
> 4.15 backported into the XenServer 7.6 kernel, but we can't take the 
> ones related to iomap as they are more invasive and it looks like a 
> number of the more recent performance targeting changes are also 
> predicated on the iomap framework.
> 
> As I mentioned in the covering letter, the intra host problem would 
> largely be a non-issue if EX glocks were actually a host wide thing 
> with local mutexes used to share them within the host. I don't know if 
> this is what your patch set is trying to achieve or not. It's not so 
> much that that selection of resource group is "random", just that 
> there is a random chance that we won't select the first RG that we 
> test, it probably does work out much the same though.
> 
> The inter host problem addressed by the second patch seems to be less 
> amenable to avoidance as the hosts don't seem to have a synchronous 
> view of the state of the resource group locks (for understandable 
> reasons as I'd expect thisto be very expensive to keep sync'd). So it 
> seemed reasonable to try to make it "expensive" to request a resource 
> that someone else is using and also to avoid immediately grabbing it 
> back if we've been asked to relinquish it. It does seem to give a 
> fairer balance to the usage without being massively invasive.
> 
> We thought we should share these with the community anyway even if 
> they only serve as inspiration for more detailed changes and also to 
> describe the scenarios where we're seeing issues now that we have 
> completed implementing the XenServer support for GFS2 that we 
> discussed back in Nuremburg last year. In our testing they certainly 
> make things better. They probably aren?t fully optimal as we can't 
> maintain 10g wire speed consistently across the full LUN but we're 
> getting about 75% which is certainly better than we were seeing before we started looking at this.
> 
> Thanks,
> 
> 	Mark.

Hi Mark,

I'm really curious if you guys tried the two patches I posted here from
17 January 2018 in place of the two patches you posted. We see much better throughput with those over stock.

I know Steve wants a different solution, and in the long run it will be a better one, but I've been trying to convince him we should use them as a stop-gap measure to mitigate this problem until we get a more proper solution in place (which is obviously taking some time, due to unforeseen circumstances).

Regards,

Bob Peterson



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements
  2018-09-28 12:36       ` Mark Syms
  2018-09-28 12:50         ` Mark Syms
@ 2018-09-28 12:55         ` Bob Peterson
  2018-09-28 13:56           ` Mark Syms
  1 sibling, 1 reply; 18+ messages in thread
From: Bob Peterson @ 2018-09-28 12:55 UTC (permalink / raw)
  To: cluster-devel.redhat.com

----- Original Message -----
> Hi Bob,
> 
> No, we haven't but it wouldn't be hard for us to replace our patches in our
> internal patchqueue with these and try them. Will let you know what we find.
> 
> We have also seen, what we think is an unrelated issue where we get the
> following backtrace in kern.log and our system stalls
> 
> Sep 21 21:19:09 cl15-05 kernel: [21389.462707] INFO: task python:15480
> blocked for more than 120 seconds.
> Sep 21 21:19:09 cl15-05 kernel: [21389.462749]       Tainted: G           O
> 4.4.0+10 #1
> Sep 21 21:19:09 cl15-05 kernel: [21389.462763] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Sep 21 21:19:09 cl15-05 kernel: [21389.462783] python          D
> ffff88019628bc90     0 15480      1 0x00000000
> Sep 21 21:19:09 cl15-05 kernel: [21389.462790]  ffff88019628bc90
> ffff880198f11c00 ffff88005a509c00 ffff88019628c000
> Sep 21 21:19:09 cl15-05 kernel: [21389.462795]  ffffc90040226000
> ffff88019628bd80 fffffffffffffe58 ffff8801818da418
> Sep 21 21:19:09 cl15-05 kernel: [21389.462799]  ffff88019628bca8
> ffffffff815a1cd4 ffff8801818da5c0 ffff88019628bd68
> Sep 21 21:19:09 cl15-05 kernel: [21389.462803] Call Trace:
> Sep 21 21:19:09 cl15-05 kernel: [21389.462815]  [<ffffffff815a1cd4>]
> schedule+0x64/0x80
> Sep 21 21:19:09 cl15-05 kernel: [21389.462877]  [<ffffffffa0663624>]
> find_insert_glock+0x4a4/0x530 [gfs2]
> Sep 21 21:19:09 cl15-05 kernel: [21389.462891]  [<ffffffffa0660c20>] ?
> gfs2_holder_wake+0x20/0x20 [gfs2]
> Sep 21 21:19:09 cl15-05 kernel: [21389.462903]  [<ffffffffa06639ed>]
> gfs2_glock_get+0x3d/0x330 [gfs2]
> Sep 21 21:19:09 cl15-05 kernel: [21389.462928]  [<ffffffffa066cff2>]
> do_flock+0xf2/0x210 [gfs2]
> Sep 21 21:19:09 cl15-05 kernel: [21389.462933]  [<ffffffffa0671ad0>] ?
> gfs2_getattr+0xe0/0xf0 [gfs2]
> Sep 21 21:19:09 cl15-05 kernel: [21389.462938]  [<ffffffff811ba2fb>] ?
> cp_new_stat+0x10b/0x120
> Sep 21 21:19:09 cl15-05 kernel: [21389.462943]  [<ffffffffa066d188>]
> gfs2_flock+0x78/0xa0 [gfs2]
> Sep 21 21:19:09 cl15-05 kernel: [21389.462946]  [<ffffffff812021e9>]
> SyS_flock+0x129/0x170
> Sep 21 21:19:09 cl15-05 kernel: [21389.462948]  [<ffffffff815a57ee>]
> entry_SYSCALL_64_fastpath+0x12/0x71
> 
> We think there is a possibility, given that this code path only gets entered
> if a glock is being destroyed, that there is a time of check, time of use
> issue here where by the time that schedule gets called the thing which we
> expect to be waking us up has completed dying and therefore won't trigger a
> wakeup for us. We only seen this a couple of times in fairly intensive VM
> stress tests where a lot of flocks get used on a small number of lock files
> (we use them to ensure consistent behaviour of disk activation/deactivation
> and also access to the database with the system state) but it's concerning
> nonetheless. We're looking at replacing the call to schedule with
> schedule_timeout with a timeout of maybe HZ to ensure that we will always
> get out of the schedule operation and retry. Is this something you think you
> may have seen or have any ideas on?
> 
> Thanks,
> 
> 	Mark.

Hi Mark,

It's very common to get call traces like that when one of the nodes
in a cluster goes down and the other nodes all wait for the failed node
to be fenced, etc. The node failure causes dlm to temporarily stop
granting locks until the issue is resolved. This is expected behavior,
and dlm recovery should eventually grant the lock once the node is
properly removed from the cluster. I haven't seen it on an flock glock,
because I personally don't often run with flocks, but it often happens
to me with other glocks when I do recovery testing (which I've been
doing a lot of lately).

So is there a node failure in your case? If there's a node failure, dlm
should recover the locks and allow the waiter to continue normally. If
it's not a node failure, it's hard to say... I know Andreas fixed some
problems with the rcu locking we do to protect the glock rhashtable.
Perhaps the kernel you're using is missing one of his patches? Or maybe
it's a new bug. Adding Andreas to the cc.

Regards,

Bob Peterson



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements
  2018-09-28 12:50         ` Mark Syms
@ 2018-09-28 13:18           ` Steven Whitehouse
  2018-09-28 13:43             ` Tim Smith
  0 siblings, 1 reply; 18+ messages in thread
From: Steven Whitehouse @ 2018-09-28 13:18 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,


On 28/09/18 13:50, Mark Syms wrote:
> Hi Bon,
>
> The patches look quite good and would seem to help in the intra-node congestion case, which our first patch was trying to do. We haven't tried them yet but I'll pull a build together and try to run it over the weekend.
>
> We don't however, see that they would help in the situation we saw for the second patch where rgrp glocks would get bounced around between hosts at high speed and cause lots of state flushing to occur in the process as the stats don't take any account of anything other than network latency whereas there is more involved with a rgrp glock when state needs to be flushed.
>
> Any thoughts on this?
>
> Thanks,
>
> 	Mark.
There are a few points here... the stats measure the latency of the DLM 
requests. Since in order to release a lock, some work has to be done, 
and the lock is not released until that work is complete, the stats do 
include that in their timings.

There are several parts to the complete picture here:

1. Resource group selection for allocation (which is what the current 
stats based solution tries to do)
 ?- Note this will not help deallocation, as then there is no choice in 
which resource group we use! So the following two items should address 
deallocation too...
2. Parallelism of resource group usage within a single node (currently 
missing, but we hope to add this feature shortly)
3. Reduction in latency when glocks need to be demoted for use on 
another node (something we plan to address in due course)

All these things are a part of the overall picture, and we need to be 
careful not to try and optimise one at the expense of others. It is 
actually quite easy to get a big improvement in one particular workload, 
but if we are not careful, it may well be at the expense of another that 
we've not taken into account. There will always be a trade off between 
locality and parallelism of course, but we do have to be fairly cautious 
here too.

We are of course very happy to encourage work in this area, since it 
should help us gain a greater insight into the various dependencies 
between these parts, and result in a better overall solution. I hope 
that helps to give a rough idea of our current thoughts and where we 
hope to get to in due course,

Steve.

> -----Original Message-----
> From: Mark Syms
> Sent: 28 September 2018 13:37
> To: 'Bob Peterson' <rpeterso@redhat.com>
> Cc: cluster-devel at redhat.com; Tim Smith <tim.smith@citrix.com>; Ross Lagerwall <ross.lagerwall@citrix.com>
> Subject: RE: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements
>
> Hi Bob,
>
> No, we haven't but it wouldn't be hard for us to replace our patches in our internal patchqueue with these and try them. Will let you know what we find.
>
> We have also seen, what we think is an unrelated issue where we get the following backtrace in kern.log and our system stalls
>
> Sep 21 21:19:09 cl15-05 kernel: [21389.462707] INFO: task python:15480 blocked for more than 120 seconds.
> Sep 21 21:19:09 cl15-05 kernel: [21389.462749]       Tainted: G           O    4.4.0+10 #1
> Sep 21 21:19:09 cl15-05 kernel: [21389.462763] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Sep 21 21:19:09 cl15-05 kernel: [21389.462783] python          D ffff88019628bc90     0 15480      1 0x00000000
> Sep 21 21:19:09 cl15-05 kernel: [21389.462790]  ffff88019628bc90 ffff880198f11c00 ffff88005a509c00 ffff88019628c000 Sep 21 21:19:09 cl15-05 kernel: [21389.462795]  ffffc90040226000 ffff88019628bd80 fffffffffffffe58 ffff8801818da418 Sep 21 21:19:09 cl15-05 kernel: [21389.462799]  ffff88019628bca8 ffffffff815a1cd4 ffff8801818da5c0 ffff88019628bd68 Sep 21 21:19:09 cl15-05 kernel: [21389.462803] Call Trace:
> Sep 21 21:19:09 cl15-05 kernel: [21389.462815]  [<ffffffff815a1cd4>] schedule+0x64/0x80 Sep 21 21:19:09 cl15-05 kernel: [21389.462877]  [<ffffffffa0663624>] find_insert_glock+0x4a4/0x530 [gfs2] Sep 21 21:19:09 cl15-05 kernel: [21389.462891]  [<ffffffffa0660c20>] ? gfs2_holder_wake+0x20/0x20 [gfs2] Sep 21 21:19:09 cl15-05 kernel: [21389.462903]  [<ffffffffa06639ed>] gfs2_glock_get+0x3d/0x330 [gfs2] Sep 21 21:19:09 cl15-05 kernel: [21389.462928]  [<ffffffffa066cff2>] do_flock+0xf2/0x210 [gfs2] Sep 21 21:19:09 cl15-05 kernel: [21389.462933]  [<ffffffffa0671ad0>] ? gfs2_getattr+0xe0/0xf0 [gfs2] Sep 21 21:19:09 cl15-05 kernel: [21389.462938]  [<ffffffff811ba2fb>] ? cp_new_stat+0x10b/0x120 Sep 21 21:19:09 cl15-05 kernel: [21389.462943]  [<ffffffffa066d188>] gfs2_flock+0x78/0xa0 [gfs2] Sep 21 21:19:09 cl15-05 kernel: [21389.462946]  [<ffffffff812021e9>] SyS_flock+0x129/0x170 Sep 21 21:19:09 cl15-05 kernel: [21389.462948]  [<ffffffff815a57ee>] entry_SYSCALL_64_fastpath+0x12/0x71
>
> We think there is a possibility, given that this code path only gets entered if a glock is being destroyed, that there is a time of check, time of use issue here where by the time that schedule gets called the thing which we expect to be waking us up has completed dying and therefore won't trigger a wakeup for us. We only seen this a couple of times in fairly intensive VM stress tests where a lot of flocks get used on a small number of lock files (we use them to ensure consistent behaviour of disk activation/deactivation and also access to the database with the system state) but it's concerning nonetheless. We're looking at replacing the call to schedule with schedule_timeout with a timeout of maybe HZ to ensure that we will always get out of the schedule operation and retry. Is this something you think you may have seen or have any ideas on?
>
> Thanks,
>
> 	Mark.
>
> -----Original Message-----
> From: Bob Peterson <rpeterso@redhat.com>
> Sent: 28 September 2018 13:24
> To: Mark Syms <Mark.Syms@citrix.com>
> Cc: cluster-devel at redhat.com; Ross Lagerwall <ross.lagerwall@citrix.com>; Tim Smith <tim.smith@citrix.com>
> Subject: Re: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements
>
> ----- Original Message -----
>> Thanks for that Bob, we've been watching with interest the changes
>> going in upstream but at the moment we're not really in a position to
>> take advantage of them.
>>
>> Due to hardware vendor support certification requirements XenServer
>> can only very occasionally make big kernel bumps that would affect the
>> ABI that the driver would see as that would require our hardware partners to recertify.
>> So, we're currently on a 4.4.52 base but the gfs2 driver is somewhat
>> newer as it is essentially self-contained and therefore we can
>> backport change more easily. We currently have most of the GFS2 and
>> DLM changes that are in
>> 4.15 backported into the XenServer 7.6 kernel, but we can't take the
>> ones related to iomap as they are more invasive and it looks like a
>> number of the more recent performance targeting changes are also
>> predicated on the iomap framework.
>>
>> As I mentioned in the covering letter, the intra host problem would
>> largely be a non-issue if EX glocks were actually a host wide thing
>> with local mutexes used to share them within the host. I don't know if
>> this is what your patch set is trying to achieve or not. It's not so
>> much that that selection of resource group is "random", just that
>> there is a random chance that we won't select the first RG that we
>> test, it probably does work out much the same though.
>>
>> The inter host problem addressed by the second patch seems to be less
>> amenable to avoidance as the hosts don't seem to have a synchronous
>> view of the state of the resource group locks (for understandable
>> reasons as I'd expect thisto be very expensive to keep sync'd). So it
>> seemed reasonable to try to make it "expensive" to request a resource
>> that someone else is using and also to avoid immediately grabbing it
>> back if we've been asked to relinquish it. It does seem to give a
>> fairer balance to the usage without being massively invasive.
>>
>> We thought we should share these with the community anyway even if
>> they only serve as inspiration for more detailed changes and also to
>> describe the scenarios where we're seeing issues now that we have
>> completed implementing the XenServer support for GFS2 that we
>> discussed back in Nuremburg last year. In our testing they certainly
>> make things better. They probably aren?t fully optimal as we can't
>> maintain 10g wire speed consistently across the full LUN but we're
>> getting about 75% which is certainly better than we were seeing before we started looking at this.
>>
>> Thanks,
>>
>> 	Mark.
> Hi Mark,
>
> I'm really curious if you guys tried the two patches I posted here from
> 17 January 2018 in place of the two patches you posted. We see much better throughput with those over stock.
>
> I know Steve wants a different solution, and in the long run it will be a better one, but I've been trying to convince him we should use them as a stop-gap measure to mitigate this problem until we get a more proper solution in place (which is obviously taking some time, due to unforeseen circumstances).
>
> Regards,
>
> Bob Peterson
>





^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements
  2018-09-28 13:18           ` Steven Whitehouse
@ 2018-09-28 13:43             ` Tim Smith
  2018-09-28 13:59               ` Bob Peterson
  2018-09-28 15:09               ` Steven Whitehouse
  0 siblings, 2 replies; 18+ messages in thread
From: Tim Smith @ 2018-09-28 13:43 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Friday, 28 September 2018 14:18:59 BST Steven Whitehouse wrote:
> Hi,
> 
> On 28/09/18 13:50, Mark Syms wrote:
> > Hi Bon,
> > 
> > The patches look quite good and would seem to help in the intra-node
> > congestion case, which our first patch was trying to do. We haven't tried
> > them yet but I'll pull a build together and try to run it over the
> > weekend.
> > 
> > We don't however, see that they would help in the situation we saw for the
> > second patch where rgrp glocks would get bounced around between hosts at
> > high speed and cause lots of state flushing to occur in the process as
> > the stats don't take any account of anything other than network latency
> > whereas there is more involved with a rgrp glock when state needs to be
> > flushed.
> > 
> > Any thoughts on this?
> > 
> > Thanks,
> > 
> > 	Mark.
> 
> There are a few points here... the stats measure the latency of the DLM
> requests. Since in order to release a lock, some work has to be done,
> and the lock is not released until that work is complete, the stats do
> include that in their timings.

I think what's happening for us is that the work that needs to be done to 
release an rgrp lock is happening pretty fast and is about the same in all 
cases, so the stats are not providing a meaningful distinction. We see the 
same lock (or small number of locks) bouncing back and forth between nodes 
with neither node seeming to consider them congested enough to avoid, even 
though the FS is <50% full and there must be plenty of other non-full rgrps.

-- 
Tim Smith <tim.smith@citrix.com>




^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements
  2018-09-28 12:55         ` Bob Peterson
@ 2018-09-28 13:56           ` Mark Syms
  2018-10-02 13:50             ` Mark Syms
  0 siblings, 1 reply; 18+ messages in thread
From: Mark Syms @ 2018-09-28 13:56 UTC (permalink / raw)
  To: cluster-devel.redhat.com

The hosts stayed up and in the first occurrence of this (that we caught inflight as opposed to only seeing the aftermath in automation logs) the locks all actually unwedged after about 2 hours.

	Mark.

-----Original Message-----
From: Bob Peterson <rpeterso@redhat.com> 
Sent: 28 September 2018 13:56
To: Mark Syms <Mark.Syms@citrix.com>
Cc: cluster-devel at redhat.com; Tim Smith <tim.smith@citrix.com>; Ross Lagerwall <ross.lagerwall@citrix.com>; Andreas Gruenbacher <agruenba@redhat.com>
Subject: Re: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements

----- Original Message -----
> Hi Bob,
> 
> No, we haven't but it wouldn't be hard for us to replace our patches 
> in our internal patchqueue with these and try them. Will let you know what we find.
> 
> We have also seen, what we think is an unrelated issue where we get 
> the following backtrace in kern.log and our system stalls
> 
> Sep 21 21:19:09 cl15-05 kernel: [21389.462707] INFO: task python:15480 
> blocked for more than 120 seconds.
> Sep 21 21:19:09 cl15-05 kernel: [21389.462749]       Tainted: G           O
> 4.4.0+10 #1
> Sep 21 21:19:09 cl15-05 kernel: [21389.462763] "echo 0 > 
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Sep 21 21:19:09 cl15-05 kernel: [21389.462783] python          D
> ffff88019628bc90     0 15480      1 0x00000000
> Sep 21 21:19:09 cl15-05 kernel: [21389.462790]  ffff88019628bc90
> ffff880198f11c00 ffff88005a509c00 ffff88019628c000 Sep 21 21:19:09 
> cl15-05 kernel: [21389.462795]  ffffc90040226000
> ffff88019628bd80 fffffffffffffe58 ffff8801818da418 Sep 21 21:19:09 
> cl15-05 kernel: [21389.462799]  ffff88019628bca8
> ffffffff815a1cd4 ffff8801818da5c0 ffff88019628bd68 Sep 21 21:19:09 
> cl15-05 kernel: [21389.462803] Call Trace:
> Sep 21 21:19:09 cl15-05 kernel: [21389.462815]  [<ffffffff815a1cd4>]
> schedule+0x64/0x80
> Sep 21 21:19:09 cl15-05 kernel: [21389.462877]  [<ffffffffa0663624>]
> find_insert_glock+0x4a4/0x530 [gfs2]
> Sep 21 21:19:09 cl15-05 kernel: [21389.462891]  [<ffffffffa0660c20>] ?
> gfs2_holder_wake+0x20/0x20 [gfs2]
> Sep 21 21:19:09 cl15-05 kernel: [21389.462903]  [<ffffffffa06639ed>]
> gfs2_glock_get+0x3d/0x330 [gfs2]
> Sep 21 21:19:09 cl15-05 kernel: [21389.462928]  [<ffffffffa066cff2>]
> do_flock+0xf2/0x210 [gfs2]
> Sep 21 21:19:09 cl15-05 kernel: [21389.462933]  [<ffffffffa0671ad0>] ?
> gfs2_getattr+0xe0/0xf0 [gfs2]
> Sep 21 21:19:09 cl15-05 kernel: [21389.462938]  [<ffffffff811ba2fb>] ?
> cp_new_stat+0x10b/0x120
> Sep 21 21:19:09 cl15-05 kernel: [21389.462943]  [<ffffffffa066d188>]
> gfs2_flock+0x78/0xa0 [gfs2]
> Sep 21 21:19:09 cl15-05 kernel: [21389.462946]  [<ffffffff812021e9>]
> SyS_flock+0x129/0x170
> Sep 21 21:19:09 cl15-05 kernel: [21389.462948]  [<ffffffff815a57ee>]
> entry_SYSCALL_64_fastpath+0x12/0x71
> 
> We think there is a possibility, given that this code path only gets 
> entered if a glock is being destroyed, that there is a time of check, 
> time of use issue here where by the time that schedule gets called the 
> thing which we expect to be waking us up has completed dying and 
> therefore won't trigger a wakeup for us. We only seen this a couple of 
> times in fairly intensive VM stress tests where a lot of flocks get 
> used on a small number of lock files (we use them to ensure consistent 
> behaviour of disk activation/deactivation and also access to the 
> database with the system state) but it's concerning nonetheless. We're 
> looking at replacing the call to schedule with schedule_timeout with a 
> timeout of maybe HZ to ensure that we will always get out of the 
> schedule operation and retry. Is this something you think you may have seen or have any ideas on?
> 
> Thanks,
> 
> 	Mark.

Hi Mark,

It's very common to get call traces like that when one of the nodes in a cluster goes down and the other nodes all wait for the failed node to be fenced, etc. The node failure causes dlm to temporarily stop granting locks until the issue is resolved. This is expected behavior, and dlm recovery should eventually grant the lock once the node is properly removed from the cluster. I haven't seen it on an flock glock, because I personally don't often run with flocks, but it often happens to me with other glocks when I do recovery testing (which I've been doing a lot of lately).

So is there a node failure in your case? If there's a node failure, dlm should recover the locks and allow the waiter to continue normally. If it's not a node failure, it's hard to say... I know Andreas fixed some problems with the rcu locking we do to protect the glock rhashtable.
Perhaps the kernel you're using is missing one of his patches? Or maybe it's a new bug. Adding Andreas to the cc.

Regards,

Bob Peterson



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements
  2018-09-28 13:43             ` Tim Smith
@ 2018-09-28 13:59               ` Bob Peterson
  2018-09-28 14:11                 ` Mark Syms
  2018-09-28 15:09                 ` Tim Smith
  2018-09-28 15:09               ` Steven Whitehouse
  1 sibling, 2 replies; 18+ messages in thread
From: Bob Peterson @ 2018-09-28 13:59 UTC (permalink / raw)
  To: cluster-devel.redhat.com

----- Original Message -----
> I think what's happening for us is that the work that needs to be done to
> release an rgrp lock is happening pretty fast and is about the same in all
> cases, so the stats are not providing a meaningful distinction. We see the
> same lock (or small number of locks) bouncing back and forth between nodes
> with neither node seeming to consider them congested enough to avoid, even
> though the FS is <50% full and there must be plenty of other non-full rgrps.
> 
> --
> Tim Smith <tim.smith@citrix.com>

Hi Tim,

Interesting.
I've done experiments in the past where I allowed resource group glocks
to take advantage of the "minimum hold time" which is today only used for
inode glocks. In my experiments it's made no appreciable difference that I
can recall, but it might be an interesting experiment for you to try.

Steve's right that we need to be careful not to improve one aspect of
performance while causing another aspect's downfall, like improving intra-node
congestion problems at the expense of inter-node congestion.

Regards,

Bob Peterson



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements
  2018-09-28 13:59               ` Bob Peterson
@ 2018-09-28 14:11                 ` Mark Syms
  2018-09-28 15:09                 ` Tim Smith
  1 sibling, 0 replies; 18+ messages in thread
From: Mark Syms @ 2018-09-28 14:11 UTC (permalink / raw)
  To: cluster-devel.redhat.com

To give some context here, the environment we were testing this in looks like this

* 2 x XenServer hosts, Dell R430s with Xeon E5-2630 v3 CPUs and Intel X520 10g NICS dedicated to the iSCSI traffic for GFS2 (only using one per host)
* Dedicated Linux filer packed with SSDs and 128GB of RAM. The native storage can sustainably support > 5GB/s write throughput and the host (currently) has a bonded pair of X710 10g NICS to serve the hosts.

So basically the storage is significantly faster than the network and will not be the bottleneck in these tests.

Whether what we observe here will change when we update the filer to have 6 10g NICs (planned in the next few weeks) will remain to be seen, obviously we'll need to add some more hosts to the cluster but we have another 10 in the rack so that isn't an issue.

Mark.

-----Original Message-----
From: Bob Peterson <rpeterso@redhat.com> 
Sent: 28 September 2018 15:00
To: Tim Smith <tim.smith@citrix.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>; Mark Syms <Mark.Syms@citrix.com>; cluster-devel at redhat.com; Ross Lagerwall <ross.lagerwall@citrix.com>
Subject: Re: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements

----- Original Message -----
> I think what's happening for us is that the work that needs to be done 
> to release an rgrp lock is happening pretty fast and is about the same 
> in all cases, so the stats are not providing a meaningful distinction. 
> We see the same lock (or small number of locks) bouncing back and 
> forth between nodes with neither node seeming to consider them 
> congested enough to avoid, even though the FS is <50% full and there must be plenty of other non-full rgrps.
> 
> --
> Tim Smith <tim.smith@citrix.com>

Hi Tim,

Interesting.
I've done experiments in the past where I allowed resource group glocks to take advantage of the "minimum hold time" which is today only used for inode glocks. In my experiments it's made no appreciable difference that I can recall, but it might be an interesting experiment for you to try.

Steve's right that we need to be careful not to improve one aspect of performance while causing another aspect's downfall, like improving intra-node congestion problems at the expense of inter-node congestion.

Regards,

Bob Peterson



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements
  2018-09-28 13:59               ` Bob Peterson
  2018-09-28 14:11                 ` Mark Syms
@ 2018-09-28 15:09                 ` Tim Smith
  1 sibling, 0 replies; 18+ messages in thread
From: Tim Smith @ 2018-09-28 15:09 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Friday, 28 September 2018 14:59:48 BST Bob Peterson wrote:
> ----- Original Message -----
> 
> > I think what's happening for us is that the work that needs to be done to
> > release an rgrp lock is happening pretty fast and is about the same in all
> > cases, so the stats are not providing a meaningful distinction. We see the
> > same lock (or small number of locks) bouncing back and forth between nodes
> > with neither node seeming to consider them congested enough to avoid, even
> > though the FS is <50% full and there must be plenty of other non-full
> > rgrps.
> > 
> > --
> > Tim Smith <tim.smith@citrix.com>
> 
> Hi Tim,
> 
> Interesting.
> I've done experiments in the past where I allowed resource group glocks
> to take advantage of the "minimum hold time" which is today only used for
> inode glocks. In my experiments it's made no appreciable difference that I
> can recall, but it might be an interesting experiment for you to try.

Our second patch does that, which should in theory give the stats calculation 
more to go on, but was mostly to allow a bit more work on a resource group 
when we do get it. It helps a bit, but doesn't really seem to keep us away 
from contended locks very well though we do get to hold on to them longer. I 
speculate that it will improve things like delete operations, but we haven't 
measured that specifically.

We also add a timestamp when we are asked to demote a lock, and then pay 
attention to it only for rgrp locks in inplace_reserve, trying to stay away 
from rgrps we've been asked to demote recently unless we're desparate. That 
helps a *lot*; we see two nodes fight a bit, learn to stay clear of each 
other, and not fight again until the FS is ~80% full

All our testing is done with multiple fio jobs per node, usually filling the 
FS from empty, but we occasionally run one with randwrite on the files we just 
laid out, just to make sure we didn't break the steady-state case.

I like the idea of your intra-node patches more than my coin-tossing approach, 
so it'll be interesting to see what results we get when Mark runs them.

> Steve's right that we need to be careful not to improve one aspect of
> performance while causing another aspect's downfall, like improving
> intra-node congestion problems at the expense of inter-node congestion.

We're also rather keen on keeping multi-node performance high. Our initial 
problem was that a single node was going so slowly even without competition 
that we couldn't reason about multiple nodes.

-- 
Tim Smith <tim.smith@citrix.com>




^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements
  2018-09-28 13:43             ` Tim Smith
  2018-09-28 13:59               ` Bob Peterson
@ 2018-09-28 15:09               ` Steven Whitehouse
  1 sibling, 0 replies; 18+ messages in thread
From: Steven Whitehouse @ 2018-09-28 15:09 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,


On 28/09/18 14:43, Tim Smith wrote:
> On Friday, 28 September 2018 14:18:59 BST Steven Whitehouse wrote:
>> Hi,
>>
>> On 28/09/18 13:50, Mark Syms wrote:
>>> Hi Bon,
>>>
>>> The patches look quite good and would seem to help in the intra-node
>>> congestion case, which our first patch was trying to do. We haven't tried
>>> them yet but I'll pull a build together and try to run it over the
>>> weekend.
>>>
>>> We don't however, see that they would help in the situation we saw for the
>>> second patch where rgrp glocks would get bounced around between hosts at
>>> high speed and cause lots of state flushing to occur in the process as
>>> the stats don't take any account of anything other than network latency
>>> whereas there is more involved with a rgrp glock when state needs to be
>>> flushed.
>>>
>>> Any thoughts on this?
>>>
>>> Thanks,
>>>
>>> 	Mark.
>> There are a few points here... the stats measure the latency of the DLM
>> requests. Since in order to release a lock, some work has to be done,
>> and the lock is not released until that work is complete, the stats do
>> include that in their timings.
> I think what's happening for us is that the work that needs to be done to
> release an rgrp lock is happening pretty fast and is about the same in all
> cases, so the stats are not providing a meaningful distinction. We see the
> same lock (or small number of locks) bouncing back and forth between nodes
> with neither node seeming to consider them congested enough to avoid, even
> though the FS is <50% full and there must be plenty of other non-full rgrps.
>

It could well be that is the case. The system was designed to deal with 
inter-node contention on resource group locks. If there is no inter-node 
contention then the times should be similar and the system should have 
little effect. If the contention is all intra-node then we'd prefer a 
solution which increases the parallelism there - it covers more use 
cases than just allocation. Also it will help to keep related blocks 
closer too each other, particularly as the filesystem ages.

If might also be that there is a bug too - so worth looking closely at 
the numbers just to make sure that it is working as intended,

Steve.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements
  2018-09-28 13:56           ` Mark Syms
@ 2018-10-02 13:50             ` Mark Syms
  0 siblings, 0 replies; 18+ messages in thread
From: Mark Syms @ 2018-10-02 13:50 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Just to follow up on this further. We made a change to swap the call to schedule for schedule_timeout and added a kprintf in the case where schedule_timeout actually timed out rather than being woken. Running in our multi VM , multi host stress test we saw one instance of the kprintf message and no stuck tasks at any point within a 24 stress test so it looks like this is beneficial, at least in our kernel.

I'll clean the change up and send it on but it's likely that it's only required due a race elsewhere and possibly one that has already been fixed in your latest codebase. Our thoughts revolve around the code performing clean up on the hashtable possibly checking to see if there are any waiters that will need to be notified on completion of the cleanup immediately before we start to wait and then not notifying  as it's not aware of the waiter, but we have no concrete proof of that just the outcome that we get into schedule and stick there.

Mark.	

-----Original Message-----
From: Mark Syms 
Sent: 28 September 2018 14:57
To: 'Bob Peterson' <rpeterso@redhat.com>
Cc: cluster-devel at redhat.com; Tim Smith <tim.smith@citrix.com>; Ross Lagerwall <ross.lagerwall@citrix.com>; Andreas Gruenbacher <agruenba@redhat.com>
Subject: RE: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements

The hosts stayed up and in the first occurrence of this (that we caught inflight as opposed to only seeing the aftermath in automation logs) the locks all actually unwedged after about 2 hours.

	Mark.

-----Original Message-----
From: Bob Peterson <rpeterso@redhat.com>
Sent: 28 September 2018 13:56
To: Mark Syms <Mark.Syms@citrix.com>
Cc: cluster-devel at redhat.com; Tim Smith <tim.smith@citrix.com>; Ross Lagerwall <ross.lagerwall@citrix.com>; Andreas Gruenbacher <agruenba@redhat.com>
Subject: Re: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements

----- Original Message -----
> Hi Bob,
> 
> No, we haven't but it wouldn't be hard for us to replace our patches 
> in our internal patchqueue with these and try them. Will let you know what we find.
> 
> We have also seen, what we think is an unrelated issue where we get 
> the following backtrace in kern.log and our system stalls
> 
> Sep 21 21:19:09 cl15-05 kernel: [21389.462707] INFO: task python:15480 
> blocked for more than 120 seconds.
> Sep 21 21:19:09 cl15-05 kernel: [21389.462749]       Tainted: G           O
> 4.4.0+10 #1
> Sep 21 21:19:09 cl15-05 kernel: [21389.462763] "echo 0 > 
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Sep 21 21:19:09 cl15-05 kernel: [21389.462783] python          D
> ffff88019628bc90     0 15480      1 0x00000000
> Sep 21 21:19:09 cl15-05 kernel: [21389.462790]  ffff88019628bc90
> ffff880198f11c00 ffff88005a509c00 ffff88019628c000 Sep 21 21:19:09
> cl15-05 kernel: [21389.462795]  ffffc90040226000
> ffff88019628bd80 fffffffffffffe58 ffff8801818da418 Sep 21 21:19:09
> cl15-05 kernel: [21389.462799]  ffff88019628bca8
> ffffffff815a1cd4 ffff8801818da5c0 ffff88019628bd68 Sep 21 21:19:09
> cl15-05 kernel: [21389.462803] Call Trace:
> Sep 21 21:19:09 cl15-05 kernel: [21389.462815]  [<ffffffff815a1cd4>]
> schedule+0x64/0x80
> Sep 21 21:19:09 cl15-05 kernel: [21389.462877]  [<ffffffffa0663624>]
> find_insert_glock+0x4a4/0x530 [gfs2]
> Sep 21 21:19:09 cl15-05 kernel: [21389.462891]  [<ffffffffa0660c20>] ?
> gfs2_holder_wake+0x20/0x20 [gfs2]
> Sep 21 21:19:09 cl15-05 kernel: [21389.462903]  [<ffffffffa06639ed>]
> gfs2_glock_get+0x3d/0x330 [gfs2]
> Sep 21 21:19:09 cl15-05 kernel: [21389.462928]  [<ffffffffa066cff2>]
> do_flock+0xf2/0x210 [gfs2]
> Sep 21 21:19:09 cl15-05 kernel: [21389.462933]  [<ffffffffa0671ad0>] ?
> gfs2_getattr+0xe0/0xf0 [gfs2]
> Sep 21 21:19:09 cl15-05 kernel: [21389.462938]  [<ffffffff811ba2fb>] ?
> cp_new_stat+0x10b/0x120
> Sep 21 21:19:09 cl15-05 kernel: [21389.462943]  [<ffffffffa066d188>]
> gfs2_flock+0x78/0xa0 [gfs2]
> Sep 21 21:19:09 cl15-05 kernel: [21389.462946]  [<ffffffff812021e9>]
> SyS_flock+0x129/0x170
> Sep 21 21:19:09 cl15-05 kernel: [21389.462948]  [<ffffffff815a57ee>]
> entry_SYSCALL_64_fastpath+0x12/0x71
> 
> We think there is a possibility, given that this code path only gets 
> entered if a glock is being destroyed, that there is a time of check, 
> time of use issue here where by the time that schedule gets called the 
> thing which we expect to be waking us up has completed dying and 
> therefore won't trigger a wakeup for us. We only seen this a couple of 
> times in fairly intensive VM stress tests where a lot of flocks get 
> used on a small number of lock files (we use them to ensure consistent 
> behaviour of disk activation/deactivation and also access to the 
> database with the system state) but it's concerning nonetheless. We're 
> looking at replacing the call to schedule with schedule_timeout with a 
> timeout of maybe HZ to ensure that we will always get out of the 
> schedule operation and retry. Is this something you think you may have seen or have any ideas on?
> 
> Thanks,
> 
> 	Mark.

Hi Mark,

It's very common to get call traces like that when one of the nodes in a cluster goes down and the other nodes all wait for the failed node to be fenced, etc. The node failure causes dlm to temporarily stop granting locks until the issue is resolved. This is expected behavior, and dlm recovery should eventually grant the lock once the node is properly removed from the cluster. I haven't seen it on an flock glock, because I personally don't often run with flocks, but it often happens to me with other glocks when I do recovery testing (which I've been doing a lot of lately).

So is there a node failure in your case? If there's a node failure, dlm should recover the locks and allow the waiter to continue normally. If it's not a node failure, it's hard to say... I know Andreas fixed some problems with the rcu locking we do to protect the glock rhashtable.
Perhaps the kernel you're using is missing one of his patches? Or maybe it's a new bug. Adding Andreas to the cc.

Regards,

Bob Peterson



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2018-10-02 13:50 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-09-20 14:52 [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements Mark Syms
2018-09-20 14:52 ` [Cluster-devel] [PATCH 1/2] Add some randomisation to the GFS2 resource group allocator Mark Syms
2018-09-20 14:52 ` [Cluster-devel] [PATCH 2/2] GFS2: Avoid recently demoted rgrps Mark Syms
2018-09-20 17:17 ` [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements Bob Peterson
2018-09-20 17:47   ` Mark Syms
2018-09-20 18:16     ` Steven Whitehouse
2018-09-28 12:23     ` Bob Peterson
2018-09-28 12:36       ` Mark Syms
2018-09-28 12:50         ` Mark Syms
2018-09-28 13:18           ` Steven Whitehouse
2018-09-28 13:43             ` Tim Smith
2018-09-28 13:59               ` Bob Peterson
2018-09-28 14:11                 ` Mark Syms
2018-09-28 15:09                 ` Tim Smith
2018-09-28 15:09               ` Steven Whitehouse
2018-09-28 12:55         ` Bob Peterson
2018-09-28 13:56           ` Mark Syms
2018-10-02 13:50             ` Mark Syms

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).