[Cluster-devel] [PATCH 1/2] Add some randomisation to the GFS2 resource group allocator

cluster-devel.redhat.com archive mirror
 help / color / mirror / Atom feed

From: Mark Syms <mark.syms@citrix.com>
To: cluster-devel.redhat.com
Subject: [Cluster-devel] [PATCH 1/2] Add some randomisation to the GFS2 resource group allocator
Date: Thu, 20 Sep 2018 15:52:12 +0100	[thread overview]
Message-ID: <1537455133-48589-2-git-send-email-mark.syms@citrix.com> (raw)
In-Reply-To: <1537455133-48589-1-git-send-email-mark.syms@citrix.com>

From: Tim Smith <tim.smith@citrix.com>

When growing a number of files on the same cluster node from different
threads (e.g. fio with 20 or so jobs), all those threads pile into
gfs2_inplace_reserve() independently looking to claim a new resource
group and after a while they all synchronise, getting through the
gfs2_rgrp_used_recently()/gfs2_rgrp_congested() check together.

When this happens, write performance drops to about 1/5 on a single
node cluster, and on multi-node clusters it drops to near zero on
some nodes. The output from "glocktop -r -H -d 1" when this happens
begins to show many processes stuck in gfs2_inplace_reserve(), waiting
on a resource group lock.

This commit introduces a module parameter which, when set to a value
of 1, will introduce some random jitter into the first two passes of
gfs2_inplace_reserve() when trying to lock a new resource group,
skipping to the next one 1/2 the time with progressively lower
probability on each attempt.

Signed-off-by: Tim Smith <tim.smith@citrix.com>
---
 fs/gfs2/rgrp.c | 39 +++++++++++++++++++++++++++++++++++----
 1 file changed, 35 insertions(+), 4 deletions(-)

diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 1ad3256..994eb7f 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -19,6 +19,7 @@
 #include <linux/blkdev.h>
 #include <linux/rbtree.h>
 #include <linux/random.h>
+#include <linux/module.h>
 
 #include "gfs2.h"
 #include "incore.h"
@@ -49,6 +50,11 @@
 #define LBITSKIP00 (0x0000000000000000UL)
 #endif
 
+static int gfs2_skippy_rgrp_alloc;
+
+module_param_named(skippy_rgrp_alloc, gfs2_skippy_rgrp_alloc, int, 0644);
+MODULE_PARM_DESC(skippy_rgrp_alloc, "Set skippiness of resource group allocator, 0|1. Where 1 will cause resource groups to be randomly skipped with the likelihood of skipping progressively decreasing after a skip has occured.");
+
 /*
  * These routines are used by the resource group routines (rgrp.c)
  * to keep track of block allocation.  Each block is represented by two
@@ -2016,6 +2022,11 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, struct gfs2_alloc_parms *ap)
 	u64 last_unlinked = NO_BLOCK;
 	int loops = 0;
 	u32 free_blocks, skip = 0;
+	/*
+	 * gfs2_skippy_rgrp_alloc provides our initial skippiness.
+	 * randskip will thus be 2-255 if we want it do do anything.
+	 */
+	u8 randskip = gfs2_skippy_rgrp_alloc + 1;
 
 	if (sdp->sd_args.ar_rgrplvb)
 		flags |= GL_SKIP;
@@ -2046,10 +2057,30 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, struct gfs2_alloc_parms *ap)
 				if (loops == 0 &&
 				    !fast_to_acquire(rs->rs_rbm.rgd))
 					goto next_rgrp;
-				if ((loops < 2) &&
-				    gfs2_rgrp_used_recently(rs, 1000) &&
-				    gfs2_rgrp_congested(rs->rs_rbm.rgd, loops))
-					goto next_rgrp;
+				if (loops < 2) {
+					/*
+					 * If resource group allocation is requested to be skippy,
+					 * roll a hypothetical dice of <randskip> sides and skip
+					 * straight to the next resource group anyway if it comes
+					 * up 1.
+					 */
+					if (gfs2_skippy_rgrp_alloc) {
+						u8 jitter;
+
+						prandom_bytes(&jitter, sizeof(jitter));
+						if ((jitter % randskip) == 0) {
+							/*
+							 * If we are choosing to skip, bump randskip to make it
+							 * successively less likely that we will skip again
+							 */
+							randskip ++;
+							goto next_rgrp;
+						}
+					}
+					if (gfs2_rgrp_used_recently(rs, 1000) &&
+						gfs2_rgrp_congested(rs->rs_rbm.rgd, loops))
+						goto next_rgrp;
+				}
 			}
 			error = gfs2_glock_nq_init(rs->rs_rbm.rgd->rd_gl,
 						   LM_ST_EXCLUSIVE, flags,
-- 
1.8.3.1

next prev parent reply	other threads:[~2018-09-20 14:52 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-09-20 14:52 [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements Mark Syms
2018-09-20 14:52 ` Mark Syms [this message]
2018-09-20 14:52 ` [Cluster-devel] [PATCH 2/2] GFS2: Avoid recently demoted rgrps Mark Syms
2018-09-20 17:17 ` [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements Bob Peterson
2018-09-20 17:47   ` Mark Syms
2018-09-20 18:16     ` Steven Whitehouse
2018-09-28 12:23     ` Bob Peterson
2018-09-28 12:36       ` Mark Syms
2018-09-28 12:50         ` Mark Syms
2018-09-28 13:18           ` Steven Whitehouse
2018-09-28 13:43             ` Tim Smith
2018-09-28 13:59               ` Bob Peterson
2018-09-28 14:11                 ` Mark Syms
2018-09-28 15:09                 ` Tim Smith
2018-09-28 15:09               ` Steven Whitehouse
2018-09-28 12:55         ` Bob Peterson
2018-09-28 13:56           ` Mark Syms
2018-10-02 13:50             ` Mark Syms

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:1ad3256 dfblob:994eb7f )
 OR (
bs:"[Cluster-devel] [PATCH 1/2] Add some randomisation to the GFS2 resource group allocator" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1537455133-48589-2-git-send-email-mark.syms@citrix.com \
    --to=mark.syms@citrix.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).