* [Cluster-devel] [GFS2 PATCH 0/2] GFS2: Rework rgrp glock congestion functions for intra-node @ 2018-01-17 14:23 Bob Peterson 2018-01-17 14:23 ` [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Simplify gfs2_rgrp_used_recently Bob Peterson 2018-01-17 14:23 ` [Cluster-devel] [GFS2 PATCH 2/2] GFS2: Split gfs2_rgrp_congested into inter-node and intra-node cases Bob Peterson 0 siblings, 2 replies; 9+ messages in thread From: Bob Peterson @ 2018-01-17 14:23 UTC (permalink / raw) To: cluster-devel.redhat.com I've gone through a few iterations of this patch set now with mixed reviews, the previous attempt being a resend on 10 January. Based on feedback, I've simplified the patch set greatly and broken it down into two simple patches, given here. The first patch reworks function gfs2_rgrp_used_recently() to simplify and use better calculations. The second patch adds an "intra-node" congestion check to the existing function which was previously only checking inter-node (DLM) congestion, which ended up severely hurting single-node scalability. This patch greatly improves the scalability of the GFS2 block allocator in cases where more than one single process wants to allocate blocks. Here are the results of a test I ran on a test machine using iozone with a growing number of concurrent processes: Before the two patches were applied: Children see throughput for 1 initial writers = 543094.50 kB/sec Children see throughput for 2 initial writers = 631279.31 kB/sec Children see throughput for 4 initial writers = 618569.31 kB/sec Children see throughput for 6 initial writers = 672926.77 kB/sec Children see throughput for 8 initial writers = 620530.25 kB/sec Children see throughput for 10 initial writers = 637743.89 kB/sec Children see throughput for 12 initial writers = 625197.03 kB/sec Children see throughput for 14 initial writers = 627233.04 kB/sec Children see throughput for 16 initial writers = 346880.52 kB/sec After the two patches are applied: Children see throughput for 1 initial writers = 539514.88 kB/sec Children see throughput for 2 initial writers = 630325.97 kB/sec Children see throughput for 4 initial writers = 820960.05 kB/sec Children see throughput for 6 initial writers = 773291.00 kB/sec Children see throughput for 8 initial writers = 764553.85 kB/sec Children see throughput for 10 initial writers = 837788.38 kB/sec Children see throughput for 12 initial writers = 752443.34 kB/sec Children see throughput for 14 initial writers = 781917.29 kB/sec Children see throughput for 16 initial writers = 816540.99 kB/sec With one or two processes running, the difference amounts to noise. But even with only 4 concurrent processes, the block allocator has 25 percent better throughput with the patches. At 16 concurrent processes, the overall throughput is more than double, although that number may be skewed by the fact that I've got 2 sockets, each of which has 8 cores, and some of the cores are used by the driver and its monitoring. --- Bob Peterson (2): GFS2: Simplify gfs2_rgrp_used_recently GFS2: Split gfs2_rgrp_congested into inter-node and intra-node cases fs/gfs2/rgrp.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 69 insertions(+), 14 deletions(-) -- 2.14.3 ^ permalink raw reply [flat|nested] 9+ messages in thread
* [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Simplify gfs2_rgrp_used_recently 2018-01-17 14:23 [Cluster-devel] [GFS2 PATCH 0/2] GFS2: Rework rgrp glock congestion functions for intra-node Bob Peterson @ 2018-01-17 14:23 ` Bob Peterson 2018-01-18 9:04 ` Steven Whitehouse 2018-01-17 14:23 ` [Cluster-devel] [GFS2 PATCH 2/2] GFS2: Split gfs2_rgrp_congested into inter-node and intra-node cases Bob Peterson 1 sibling, 1 reply; 9+ messages in thread From: Bob Peterson @ 2018-01-17 14:23 UTC (permalink / raw) To: cluster-devel.redhat.com This patch simplifies function gfs2_rgrp_used_recently. Now it uses ktime functions to do its calculations. Also, it's called with the rgrp rather than the reservation, and the constant HZ rather than the hard-coded value 1000. Signed-off-by: Bob Peterson <rpeterso@redhat.com> --- fs/gfs2/rgrp.c | 19 ++++++++----------- 1 file changed, 8 insertions(+), 11 deletions(-) diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c index 6dea72f49316..4c9467bf9ab6 100644 --- a/fs/gfs2/rgrp.c +++ b/fs/gfs2/rgrp.c @@ -1900,21 +1900,18 @@ static bool gfs2_rgrp_congested(const struct gfs2_rgrpd *rgd, int loops) } /** - * gfs2_rgrp_used_recently - * @rs: The block reservation with the rgrp to test + * gfs2_rgrp_used_recently - check if a rgrp has been used recently + * @rgd: The rgrp to test * @msecs: The time limit in milliseconds * * Returns: True if the rgrp glock has been used within the time limit */ -static bool gfs2_rgrp_used_recently(const struct gfs2_blkreserv *rs, - u64 msecs) +static inline bool gfs2_rgrp_used_recently(const struct gfs2_rgrpd *rgd, + u64 msecs) { - u64 tdiff; - - tdiff = ktime_to_ns(ktime_sub(ktime_get_real(), - rs->rs_rbm.rgd->rd_gl->gl_dstamp)); - - return tdiff > (msecs * 1000 * 1000); + return (ktime_before(ktime_get_real(), + ktime_add(rgd->rd_gl->gl_dstamp, + ms_to_ktime(msecs)))); } static u32 gfs2_orlov_skip(const struct gfs2_inode *ip) @@ -2014,7 +2011,7 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, struct gfs2_alloc_parms *ap) !fast_to_acquire(rs->rs_rbm.rgd)) goto next_rgrp; if ((loops < 2) && - gfs2_rgrp_used_recently(rs, 1000) && + gfs2_rgrp_used_recently(rs->rs_rbm.rgd, HZ) && gfs2_rgrp_congested(rs->rs_rbm.rgd, loops)) goto next_rgrp; } -- 2.14.3 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Simplify gfs2_rgrp_used_recently 2018-01-17 14:23 ` [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Simplify gfs2_rgrp_used_recently Bob Peterson @ 2018-01-18 9:04 ` Steven Whitehouse 0 siblings, 0 replies; 9+ messages in thread From: Steven Whitehouse @ 2018-01-18 9:04 UTC (permalink / raw) To: cluster-devel.redhat.com Hi, On 17/01/18 14:23, Bob Peterson wrote: > This patch simplifies function gfs2_rgrp_used_recently. Now it uses > ktime functions to do its calculations. Also, it's called with the > rgrp rather than the reservation, and the constant HZ rather than > the hard-coded value 1000. > > Signed-off-by: Bob Peterson <rpeterso@redhat.com> > --- > fs/gfs2/rgrp.c | 19 ++++++++----------- > 1 file changed, 8 insertions(+), 11 deletions(-) I'm not sure that this does simplify it. In any case the argument is a time specified in msec, and HZ is a constant whose value is the number of clock ticks in a second. It makes no sense to assign one to the other. It would be good to get away from the constant of 1 sec that is used here, but that required some analysis in order to figure out how to create a suitable algorithm. If as a result of testing we can work out a more suitable constant to use here, then that is fine as a stop gap measure too. HZ though makes no sense at all in this case, Steve. > diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c > index 6dea72f49316..4c9467bf9ab6 100644 > --- a/fs/gfs2/rgrp.c > +++ b/fs/gfs2/rgrp.c > @@ -1900,21 +1900,18 @@ static bool gfs2_rgrp_congested(const struct gfs2_rgrpd *rgd, int loops) > } > > /** > - * gfs2_rgrp_used_recently > - * @rs: The block reservation with the rgrp to test > + * gfs2_rgrp_used_recently - check if a rgrp has been used recently > + * @rgd: The rgrp to test > * @msecs: The time limit in milliseconds > * > * Returns: True if the rgrp glock has been used within the time limit > */ > -static bool gfs2_rgrp_used_recently(const struct gfs2_blkreserv *rs, > - u64 msecs) > +static inline bool gfs2_rgrp_used_recently(const struct gfs2_rgrpd *rgd, > + u64 msecs) > { > - u64 tdiff; > - > - tdiff = ktime_to_ns(ktime_sub(ktime_get_real(), > - rs->rs_rbm.rgd->rd_gl->gl_dstamp)); > - > - return tdiff > (msecs * 1000 * 1000); > + return (ktime_before(ktime_get_real(), > + ktime_add(rgd->rd_gl->gl_dstamp, > + ms_to_ktime(msecs)))); > } > > static u32 gfs2_orlov_skip(const struct gfs2_inode *ip) > @@ -2014,7 +2011,7 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, struct gfs2_alloc_parms *ap) > !fast_to_acquire(rs->rs_rbm.rgd)) > goto next_rgrp; > if ((loops < 2) && > - gfs2_rgrp_used_recently(rs, 1000) && > + gfs2_rgrp_used_recently(rs->rs_rbm.rgd, HZ) && > gfs2_rgrp_congested(rs->rs_rbm.rgd, loops)) > goto next_rgrp; > } ^ permalink raw reply [flat|nested] 9+ messages in thread
* [Cluster-devel] [GFS2 PATCH 2/2] GFS2: Split gfs2_rgrp_congested into inter-node and intra-node cases 2018-01-17 14:23 [Cluster-devel] [GFS2 PATCH 0/2] GFS2: Rework rgrp glock congestion functions for intra-node Bob Peterson 2018-01-17 14:23 ` [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Simplify gfs2_rgrp_used_recently Bob Peterson @ 2018-01-17 14:23 ` Bob Peterson 2018-01-18 9:08 ` Steven Whitehouse 1 sibling, 1 reply; 9+ messages in thread From: Bob Peterson @ 2018-01-17 14:23 UTC (permalink / raw) To: cluster-devel.redhat.com Before this patch, function gfs2_rgrp_congested only checked the inter-node (dlm) case. But it only makes sense to check the rgrp glock statistics if we're using a real locking protocol like dlm. For lock_nolock, this is just a waste of time and can give false positives. This patch adds logic to function gfs2_rgrp_congested to check inter-node and intra-node congestion. It checks for intra-node (local node process) congestion first, since that's faster. If there are other processes actively using this rgrp (i.e. if they have reservations or are holding its glock) it's considered congested in the intra-node sense. If there is not intra-node congestion, and the locking protocol is lock_dlm, we need to check the locking statistics to see if there is any inter-node congestion. Signed-off-by: Bob Peterson <rpeterso@redhat.com> --- fs/gfs2/rgrp.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 61 insertions(+), 3 deletions(-) diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c index 4c9467bf9ab6..97db13a841f4 100644 --- a/fs/gfs2/rgrp.c +++ b/fs/gfs2/rgrp.c @@ -1826,10 +1826,57 @@ static void try_rgrp_unlink(struct gfs2_rgrpd *rgd, u64 *last_unlinked, u64 skip return; } +/** + * other_rgrp_users - figure out if this rgrp has other local users + * @rgd: The resource group + * @locked: true if we've already held the glock + * + * We're trying to figure out if the given rgrp has anybody competing for + * its free space. If other processes have enqueued its glock, there's a + * good chance of competition. + * + * If there are multi-block reservations for this rgrp, there's a good + * chance another process will lock the rgrp for block allocations soon. + * + * If we've already held the glock, we no longer care if there are holders + * because that's now a given (rgrp glocks are never shared). + */ +static inline bool other_rgrp_users(const struct gfs2_rgrpd *rgd, bool locked) +{ + struct gfs2_glock *gl = rgd->rd_gl; + + if (!locked && !list_empty(&gl->gl_holders)) + return true; + if (!RB_EMPTY_ROOT(&rgd->rd_rstree)) + return true; + return false; +} + /** * gfs2_rgrp_congested - Use stats to figure out whether an rgrp is congested * @rgd: The rgrp in question * @loops: An indication of how picky we can be (0=very, 1=less so) + * @locked: Indication of whether we've already held the rgrp glock + * + * There are two kinds of congestion: inter-node and intra-node. + * + * Inter-node congestion is where multiple nodes all want to allocate blocks + * inside the same rgrp, which means they need to trade the rgrp glock back + * and forth, which is a slow process. To mitigate this, we use glock + * statistics to predict whether the glock is historically fast to acquire. + * + * Intra-node congestion is where you have multiple processes on the same + * node, all trying to allocate blocks in the same rgrp. There's nothing wrong + * with doing so, but each process needs to wait for the other to release the + * rgrp glock before it may proceed. + * + * Both kinds of congestion can hurt performance, but it's faster to check + * intra-node, so we do that first. After all, why bother to check if we can + * get the glock quickly from DLM if other processes have also used that + * same reasoning. + * + * If we're not using inter-node locking (dlm) it doesn't make sense to check + * the glock statistics. * * This function uses the recently added glock statistics in order to * figure out whether a parciular resource group is suffering from @@ -1853,7 +1900,8 @@ static void try_rgrp_unlink(struct gfs2_rgrpd *rgd, u64 *last_unlinked, u64 skip * Returns: A boolean verdict on the congestion status */ -static bool gfs2_rgrp_congested(const struct gfs2_rgrpd *rgd, int loops) +static bool gfs2_rgrp_congested(const struct gfs2_rgrpd *rgd, int loops, + bool locked) { const struct gfs2_glock *gl = rgd->rd_gl; const struct gfs2_sbd *sdp = gl->gl_name.ln_sbd; @@ -1865,6 +1913,15 @@ static bool gfs2_rgrp_congested(const struct gfs2_rgrpd *rgd, int loops) u64 var; int cpu, nonzero = 0; + /* Intra-node (local) lock congestion checks: */ + if (loops == 0 && other_rgrp_users(rgd, locked)) + return true; + + /* If not inter-node (DLM) locking, we don't care about the stats */ + if (rgd->rd_sbd->sd_lockstruct.ls_ops->lm_lock == NULL) + return false; + + /* Inter-node lock congestion checks: */ preempt_disable(); for_each_present_cpu(cpu) { st = &per_cpu_ptr(sdp->sd_lkstats, cpu)->lkstats[LM_TYPE_RGRP]; @@ -2012,7 +2069,8 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, struct gfs2_alloc_parms *ap) goto next_rgrp; if ((loops < 2) && gfs2_rgrp_used_recently(rs->rs_rbm.rgd, HZ) && - gfs2_rgrp_congested(rs->rs_rbm.rgd, loops)) + gfs2_rgrp_congested(rs->rs_rbm.rgd, loops, + false)) goto next_rgrp; } error = gfs2_glock_nq_init(rs->rs_rbm.rgd->rd_gl, @@ -2021,7 +2079,7 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, struct gfs2_alloc_parms *ap) if (unlikely(error)) return error; if (!gfs2_rs_active(rs) && (loops < 2) && - gfs2_rgrp_congested(rs->rs_rbm.rgd, loops)) + gfs2_rgrp_congested(rs->rs_rbm.rgd, loops, true)) goto skip_rgrp; if (sdp->sd_args.ar_rgrplvb) { error = update_rgrp_lvb(rs->rs_rbm.rgd); -- 2.14.3 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* [Cluster-devel] [GFS2 PATCH 2/2] GFS2: Split gfs2_rgrp_congested into inter-node and intra-node cases 2018-01-17 14:23 ` [Cluster-devel] [GFS2 PATCH 2/2] GFS2: Split gfs2_rgrp_congested into inter-node and intra-node cases Bob Peterson @ 2018-01-18 9:08 ` Steven Whitehouse 2018-01-19 14:43 ` Bob Peterson 0 siblings, 1 reply; 9+ messages in thread From: Steven Whitehouse @ 2018-01-18 9:08 UTC (permalink / raw) To: cluster-devel.redhat.com Hi, On 17/01/18 14:23, Bob Peterson wrote: > Before this patch, function gfs2_rgrp_congested only checked the > inter-node (dlm) case. But it only makes sense to check the rgrp > glock statistics if we're using a real locking protocol like dlm. > For lock_nolock, this is just a waste of time and can give false > positives. > > This patch adds logic to function gfs2_rgrp_congested to check > inter-node and intra-node congestion. It checks for intra-node > (local node process) congestion first, since that's faster. > > If there are other processes actively using this rgrp (i.e. if > they have reservations or are holding its glock) it's considered > congested in the intra-node sense. > > If there is not intra-node congestion, and the locking protocol > is lock_dlm, we need to check the locking statistics to see if > there is any inter-node congestion. > > Signed-off-by: Bob Peterson <rpeterso@redhat.com> My comments in relation to the previous patch still stand. This is still the wrong answer to the problem. If we have lock contention here, then we should be looking to fix the locking and not trying to avoid the issue by choosing a different rgrp. If we are deallocating blocks, we have no choice over which rgrp we lock, since we must lock the rgrp that contains the blocks we want to deallocate. The same problem exists there too, so lets fix this properly as per my previous comments, Steve. > --- > fs/gfs2/rgrp.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++++++--- > 1 file changed, 61 insertions(+), 3 deletions(-) > > diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c > index 4c9467bf9ab6..97db13a841f4 100644 > --- a/fs/gfs2/rgrp.c > +++ b/fs/gfs2/rgrp.c > @@ -1826,10 +1826,57 @@ static void try_rgrp_unlink(struct gfs2_rgrpd *rgd, u64 *last_unlinked, u64 skip > return; > } > > +/** > + * other_rgrp_users - figure out if this rgrp has other local users > + * @rgd: The resource group > + * @locked: true if we've already held the glock > + * > + * We're trying to figure out if the given rgrp has anybody competing for > + * its free space. If other processes have enqueued its glock, there's a > + * good chance of competition. > + * > + * If there are multi-block reservations for this rgrp, there's a good > + * chance another process will lock the rgrp for block allocations soon. > + * > + * If we've already held the glock, we no longer care if there are holders > + * because that's now a given (rgrp glocks are never shared). > + */ > +static inline bool other_rgrp_users(const struct gfs2_rgrpd *rgd, bool locked) > +{ > + struct gfs2_glock *gl = rgd->rd_gl; > + > + if (!locked && !list_empty(&gl->gl_holders)) > + return true; > + if (!RB_EMPTY_ROOT(&rgd->rd_rstree)) > + return true; > + return false; > +} > + > /** > * gfs2_rgrp_congested - Use stats to figure out whether an rgrp is congested > * @rgd: The rgrp in question > * @loops: An indication of how picky we can be (0=very, 1=less so) > + * @locked: Indication of whether we've already held the rgrp glock > + * > + * There are two kinds of congestion: inter-node and intra-node. > + * > + * Inter-node congestion is where multiple nodes all want to allocate blocks > + * inside the same rgrp, which means they need to trade the rgrp glock back > + * and forth, which is a slow process. To mitigate this, we use glock > + * statistics to predict whether the glock is historically fast to acquire. > + * > + * Intra-node congestion is where you have multiple processes on the same > + * node, all trying to allocate blocks in the same rgrp. There's nothing wrong > + * with doing so, but each process needs to wait for the other to release the > + * rgrp glock before it may proceed. > + * > + * Both kinds of congestion can hurt performance, but it's faster to check > + * intra-node, so we do that first. After all, why bother to check if we can > + * get the glock quickly from DLM if other processes have also used that > + * same reasoning. > + * > + * If we're not using inter-node locking (dlm) it doesn't make sense to check > + * the glock statistics. > * > * This function uses the recently added glock statistics in order to > * figure out whether a parciular resource group is suffering from > @@ -1853,7 +1900,8 @@ static void try_rgrp_unlink(struct gfs2_rgrpd *rgd, u64 *last_unlinked, u64 skip > * Returns: A boolean verdict on the congestion status > */ > > -static bool gfs2_rgrp_congested(const struct gfs2_rgrpd *rgd, int loops) > +static bool gfs2_rgrp_congested(const struct gfs2_rgrpd *rgd, int loops, > + bool locked) > { > const struct gfs2_glock *gl = rgd->rd_gl; > const struct gfs2_sbd *sdp = gl->gl_name.ln_sbd; > @@ -1865,6 +1913,15 @@ static bool gfs2_rgrp_congested(const struct gfs2_rgrpd *rgd, int loops) > u64 var; > int cpu, nonzero = 0; > > + /* Intra-node (local) lock congestion checks: */ > + if (loops == 0 && other_rgrp_users(rgd, locked)) > + return true; > + > + /* If not inter-node (DLM) locking, we don't care about the stats */ > + if (rgd->rd_sbd->sd_lockstruct.ls_ops->lm_lock == NULL) > + return false; > + > + /* Inter-node lock congestion checks: */ > preempt_disable(); > for_each_present_cpu(cpu) { > st = &per_cpu_ptr(sdp->sd_lkstats, cpu)->lkstats[LM_TYPE_RGRP]; > @@ -2012,7 +2069,8 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, struct gfs2_alloc_parms *ap) > goto next_rgrp; > if ((loops < 2) && > gfs2_rgrp_used_recently(rs->rs_rbm.rgd, HZ) && > - gfs2_rgrp_congested(rs->rs_rbm.rgd, loops)) > + gfs2_rgrp_congested(rs->rs_rbm.rgd, loops, > + false)) > goto next_rgrp; > } > error = gfs2_glock_nq_init(rs->rs_rbm.rgd->rd_gl, > @@ -2021,7 +2079,7 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, struct gfs2_alloc_parms *ap) > if (unlikely(error)) > return error; > if (!gfs2_rs_active(rs) && (loops < 2) && > - gfs2_rgrp_congested(rs->rs_rbm.rgd, loops)) > + gfs2_rgrp_congested(rs->rs_rbm.rgd, loops, true)) > goto skip_rgrp; > if (sdp->sd_args.ar_rgrplvb) { > error = update_rgrp_lvb(rs->rs_rbm.rgd); ^ permalink raw reply [flat|nested] 9+ messages in thread
* [Cluster-devel] [GFS2 PATCH 2/2] GFS2: Split gfs2_rgrp_congested into inter-node and intra-node cases 2018-01-18 9:08 ` Steven Whitehouse @ 2018-01-19 14:43 ` Bob Peterson 2018-01-24 17:04 ` Bob Peterson 0 siblings, 1 reply; 9+ messages in thread From: Bob Peterson @ 2018-01-19 14:43 UTC (permalink / raw) To: cluster-devel.redhat.com Hi, ----- Original Message ----- | > This patch adds logic to function gfs2_rgrp_congested to check | > inter-node and intra-node congestion. It checks for intra-node | > (local node process) congestion first, since that's faster. | > | > If there are other processes actively using this rgrp (i.e. if | > they have reservations or are holding its glock) it's considered | > congested in the intra-node sense. | My comments in relation to the previous patch still stand. This is still | the wrong answer to the problem. If we have lock contention here, then | we should be looking to fix the locking and not trying to avoid the | issue by choosing a different rgrp. | | If we are deallocating blocks, we have no choice over which rgrp we | lock, since we must lock the rgrp that contains the blocks we want to | deallocate. The same problem exists there too, so lets fix this properly | as per my previous comments, | | Steve. Some of the relevant comments from the previous were: | This is the wrong solution I think. We should not be choosing other | rgrps for allocation due to local contention. That is just likely to | land up creating more fragmentation by spreading things out across the | disk. The issue is that we can't currently separate the locking into dlm | (exclusive) and local (shared) but important data structures spinlock | protected, so that we land up with lock contention on the rgrp and | everything serialized via the glock. At the moment the glock state | machine only supports everything (dlm & local) shared or everything | exclusive. | | If we want to resolve this issue, then we need to deal with that issue | first and allow multiple local users of the same rgrp to co-exist with | minimal lock contention. That requires the glock state machine to | support that particular lock mode first though, | | Steve. I'd like to hear more about how you envision this to work (see below). If you study the problem with glocktop, you will see that in the intra-node congestion case (for example, ten processes on one node all doing iozone, dd, or other heavy user of the block allocator) most of the processes are stuck waiting for various rgrp glocks. That's the problem. My proposed solution was to reduce or eliminate that contention by letting each process stick to a unique rgrp for block assignments. This is pretty much what RHEL5 and RHEL6 did when they were using the "try locks" for rgrps, and that's exactly why those older releases performed significantly faster with those tests than RHEL7 and upstream. It's true that the block deallocator suffers from the same issue, and my proposed solution alleviates that too by spreading the blocks to multiple rgrps. Blocks may be both allocated and deallocated simultaneously because the glocks are unique. This is also not too different from how we gained performance boosts with the Orlov algorithm to assign unique rgrps to groups of files, which also spread the distance of the writes. This will not increase file fragmentation because that is all mitigated by the multi-block allocation algorithms we have in place. The file system itself may incur more fragmentation as now files may span a greater distance with writes to unique rgrps, but that is no different from RHEL5, RHEL6, and older releases, which acquired their rgrp glocks with "try locks" which yielded basically the same results. I'm not at all concerned about "head bouncing" performance issues because those are pretty much absorbed by almost all modern storage arrays. That's what the performance guys, and my own testing, tell me. Head-bounce problems are also mitigated by many modern hard drives as well. And let's face it: very few people are going to use GFS2 with hard drives that have head- bounce performance problems; GFS2 is mostly used on storage arrays. That's not to say there isn't a better solution. We've talked in the past about possibly allowing multiple processes to share the same rgrp for block allocations while the rgrp is exclusively locked (held) on that node. We can, for example, have unique reservations, and as long as the reservations don't conflict, we can safely assign blocks from them simultaneously. However, we will still get into sticky situations and need some kind of process-exclusive lock on the rgrp itself. For example, only one process may assign fields like rg_free, rg_inodes, and so forth. Maybe we can mitigate that with atomic variables and cpu boundaries. For the deallocation path, we can, for example, pre-construct a "multi-block reservation" structure for an existing file, then deallocate from that with the knowledge that other processes will avoid blocks touched by the reservations. But at some point, we will need a process-exclusive lock (spin_lock, rwsem, mutex, etc.) and that may prove to have a similar contention problem. This may require a fair amount of design to achieve both correctness and decent performance. I've also thought about assigning unique glocks for each bitmap of the rgrp rather than the entire rgrp. We could keep assign each writer a unique bitmap, but that's probably not good enough for two reasons: (1) writers to non-zero-index bitmaps still need to write to the rgrp block itself to adjust its used, free, and dinode counts. (2) this would only scale to the number of bitmaps, which is typically 5 or so, so the ten-writer tests would still contend in a lot of cases. Still, it would help "large (e.g. 2GB) rgrp" performance where you typically have 13 or so bitmaps. Since the glocks are kind of like a layer of caching for dlm locks, the idea of possibly having more than one "EX" state for a glock which both map to the same DLM state is intriguing. After all, rgrp glocks (like all glocks) have unique "glops" that we might be able to leverage for that purpose. I'll have to give that some thought, although the "multi-block reservations" concept still seems like a better fit. Perhaps we should do this in two-phases: (1) Implement my current design as a short-term solution which has been proven to increase performance by 20-50 percent for some use cases and gets the block allocator performance back to where it was in RHEL6. (And by the same token, helps the deallocator for the same reasons.) This is a stop-gap measure while we work on a longer-term solution that's likely to involve some design work and therefore take more time to perfect. (2) Design and implement a longer-term solution that allows for multiple processes to share an rgrp while its glock is held in EXclusive mode. As always, I'm open to more ideas on how to improve this. Regards, Bob Peterson ^ permalink raw reply [flat|nested] 9+ messages in thread
* [Cluster-devel] [GFS2 PATCH 2/2] GFS2: Split gfs2_rgrp_congested into inter-node and intra-node cases 2018-01-19 14:43 ` Bob Peterson @ 2018-01-24 17:04 ` Bob Peterson 2018-01-24 20:46 ` Steven Whitehouse 2018-01-25 11:47 ` Steven Whitehouse 0 siblings, 2 replies; 9+ messages in thread From: Bob Peterson @ 2018-01-24 17:04 UTC (permalink / raw) To: cluster-devel.redhat.com Hi, ----- Original Message ----- | | My comments in relation to the previous patch still stand. This is still | | the wrong answer to the problem. If we have lock contention here, then | | we should be looking to fix the locking and not trying to avoid the | | issue by choosing a different rgrp. | This is also not too different from how we gained performance | boosts with the Orlov algorithm to assign unique rgrps to groups | of files, which also spread the distance of the writes. | | This will not increase file fragmentation because that is all | mitigated by the multi-block allocation algorithms we have | in place. I've done some more analysis of the scale-out performance with and without these two patches, with help from Barry in the performance group. Attached to this email is an output file that compares performance results from a stock test kernel to the same test kernel plus my previously posted patches for intra-node congestion. You can see from this summary grep that the scalability is vastly better with the patch: [root at perf82 bin]# pr -m -T -w 160 stock829.log iozone16.log | grep 'Parent' Parent sees throughput for 1 initial writers = 543094.03 kB/sec Parent sees throughput for 1 initial writers = 539514.33 kB/sec Parent sees throughput for 2 initial writers = 621523.39 kB/sec Parent sees throughput for 2 initial writers = 625554.26 kB/sec Parent sees throughput for 4 initial writers = 244177.34 kB/sec Parent sees throughput for 4 initial writers = 803140.31 kB/sec Parent sees throughput for 6 initial writers = 196014.09 kB/sec Parent sees throughput for 6 initial writers = 765439.86 kB/sec Parent sees throughput for 8 initial writers = 157997.14 kB/sec Parent sees throughput for 8 initial writers = 760862.85 kB/sec Parent sees throughput for 10 initial writers = 155714.45 kB/sec Parent sees throughput for 10 initial writers = 829598.76 kB/sec Parent sees throughput for 12 initial writers = 133166.40 kB/sec Parent sees throughput for 12 initial writers = 744604.51 kB/sec Parent sees throughput for 14 initial writers = 146353.44 kB/sec Parent sees throughput for 14 initial writers = 776214.99 kB/sec Parent sees throughput for 16 initial writers = 84699.82 kB/sec Parent sees throughput for 16 initial writers = 796142.16 kB/sec What this means is that performance of 16 iozone threads was 97.5% parallel with my patches compared to 24.4% on the stock kernel. Looking at the file fragmentation data in the same file, you can see that file fragmentation is also noticeably improved in all runs with the patch in the 16-process case: Fragmentation Summary Fragmentation Summary /saswork/iozonetest1: 8622 extents found /saswork/iozonetest1: 8435 extents found /saswork/iozonetest10: 8614 extents found /saswork/iozonetest10: 8322 extents found /saswork/iozonetest11: 8617 extents found /saswork/iozonetest11: 8312 extents found /saswork/iozonetest12: 8658 extents found /saswork/iozonetest12: 8319 extents found /saswork/iozonetest13: 8469 extents found /saswork/iozonetest13: 8435 extents found /saswork/iozonetest14: 8650 extents found /saswork/iozonetest14: 8315 extents found /saswork/iozonetest15: 8649 extents found /saswork/iozonetest15: 8489 extents found /saswork/iozonetest16: 8653 extents found /saswork/iozonetest16: 8436 extents found /saswork/iozonetest2: 8464 extents found /saswork/iozonetest2: 8509 extents found /saswork/iozonetest3: 8635 extents found /saswork/iozonetest3: 8448 extents found /saswork/iozonetest4: 8659 extents found /saswork/iozonetest4: 9203 extents found /saswork/iozonetest5: 8647 extents found /saswork/iozonetest5: 8434 extents found /saswork/iozonetest6: 8669 extents found /saswork/iozonetest6: 8435 extents found /saswork/iozonetest7: 8656 extents found /saswork/iozonetest7: 8309 extents found /saswork/iozonetest8: 8660 extents found /saswork/iozonetest8: 8410 extents found /saswork/iozonetest9: 8607 extents found /saswork/iozonetest9: 8435 extents found See the attachment for more details. And as I said in my previous email, overall file _system_ fragmentation is, IMHO, not a big concern in the light of these performance gains. I'll remove the reference to HZ from the first patch: good call. I still see no reason to throw away the second patch, at least while we work on a longer-term solution. Regards, Bob Peterson Red Hat File Systems -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: bob.iozone16.pr.txt URL: <http://listman.redhat.com/archives/cluster-devel/attachments/20180124/fc2b6568/attachment.txt> ^ permalink raw reply [flat|nested] 9+ messages in thread
* [Cluster-devel] [GFS2 PATCH 2/2] GFS2: Split gfs2_rgrp_congested into inter-node and intra-node cases 2018-01-24 17:04 ` Bob Peterson @ 2018-01-24 20:46 ` Steven Whitehouse 2018-01-25 11:47 ` Steven Whitehouse 1 sibling, 0 replies; 9+ messages in thread From: Steven Whitehouse @ 2018-01-24 20:46 UTC (permalink / raw) To: cluster-devel.redhat.com On 24/01/18 17:04, Bob Peterson wrote: > Hi, > > ----- Original Message ----- > | | My comments in relation to the previous patch still stand. This is still > | | the wrong answer to the problem. If we have lock contention here, then > | | we should be looking to fix the locking and not trying to avoid the > | | issue by choosing a different rgrp. > | This is also not too different from how we gained performance > | boosts with the Orlov algorithm to assign unique rgrps to groups > | of files, which also spread the distance of the writes. > | > | This will not increase file fragmentation because that is all > | mitigated by the multi-block allocation algorithms we have > | in place. > > I've done some more analysis of the scale-out performance with and > without these two patches, with help from Barry in the performance group. > > Attached to this email is an output file that compares performance results > from a stock test kernel to the same test kernel plus my previously posted > patches for intra-node congestion. You can see from this summary grep > that the scalability is vastly better with the patch: > > [root at perf82 bin]# pr -m -T -w 160 stock829.log iozone16.log | grep 'Parent' > Parent sees throughput for 1 initial writers = 543094.03 kB/sec Parent sees throughput for 1 initial writers = 539514.33 kB/sec > Parent sees throughput for 2 initial writers = 621523.39 kB/sec Parent sees throughput for 2 initial writers = 625554.26 kB/sec > Parent sees throughput for 4 initial writers = 244177.34 kB/sec Parent sees throughput for 4 initial writers = 803140.31 kB/sec > Parent sees throughput for 6 initial writers = 196014.09 kB/sec Parent sees throughput for 6 initial writers = 765439.86 kB/sec > Parent sees throughput for 8 initial writers = 157997.14 kB/sec Parent sees throughput for 8 initial writers = 760862.85 kB/sec > Parent sees throughput for 10 initial writers = 155714.45 kB/sec Parent sees throughput for 10 initial writers = 829598.76 kB/sec > Parent sees throughput for 12 initial writers = 133166.40 kB/sec Parent sees throughput for 12 initial writers = 744604.51 kB/sec > Parent sees throughput for 14 initial writers = 146353.44 kB/sec Parent sees throughput for 14 initial writers = 776214.99 kB/sec > Parent sees throughput for 16 initial writers = 84699.82 kB/sec Parent sees throughput for 16 initial writers = 796142.16 kB/sec > > What this means is that performance of 16 iozone threads was 97.5% parallel > with my patches compared to 24.4% on the stock kernel. > > Looking at the file fragmentation data in the same file, you can see that > file fragmentation is also noticeably improved in all runs with the patch > in the 16-process case: > > Fragmentation Summary Fragmentation Summary > /saswork/iozonetest1: 8622 extents found /saswork/iozonetest1: 8435 extents found > /saswork/iozonetest10: 8614 extents found /saswork/iozonetest10: 8322 extents found > /saswork/iozonetest11: 8617 extents found /saswork/iozonetest11: 8312 extents found > /saswork/iozonetest12: 8658 extents found /saswork/iozonetest12: 8319 extents found > /saswork/iozonetest13: 8469 extents found /saswork/iozonetest13: 8435 extents found > /saswork/iozonetest14: 8650 extents found /saswork/iozonetest14: 8315 extents found > /saswork/iozonetest15: 8649 extents found /saswork/iozonetest15: 8489 extents found > /saswork/iozonetest16: 8653 extents found /saswork/iozonetest16: 8436 extents found > /saswork/iozonetest2: 8464 extents found /saswork/iozonetest2: 8509 extents found > /saswork/iozonetest3: 8635 extents found /saswork/iozonetest3: 8448 extents found > /saswork/iozonetest4: 8659 extents found /saswork/iozonetest4: 9203 extents found > /saswork/iozonetest5: 8647 extents found /saswork/iozonetest5: 8434 extents found > /saswork/iozonetest6: 8669 extents found /saswork/iozonetest6: 8435 extents found > /saswork/iozonetest7: 8656 extents found /saswork/iozonetest7: 8309 extents found > /saswork/iozonetest8: 8660 extents found /saswork/iozonetest8: 8410 extents found > /saswork/iozonetest9: 8607 extents found /saswork/iozonetest9: 8435 extents found > > See the attachment for more details. > > And as I said in my previous email, overall file _system_ fragmentation > is, IMHO, not a big concern in the light of these performance gains. > > I'll remove the reference to HZ from the first patch: good call. > > I still see no reason to throw away the second patch, at least while we work > on a longer-term solution. > > Regards, > > Bob Peterson > Red Hat File Systems I've no doubt that this is an issue that we need to address. The question is how to do it, rather than if. However iozone is not a great workload for testing this kind of thing, and if that was done on a brand new filesystem, then that will also have a significant effect on the fragmentation. Another issue is whether the (number of nodes x number of I/O threads per node) <= (number of rgrps) since in the case of a new filesystem, that will make a significant difference in terms of whether contention is seen or not. The proposed new algorithm seems to assume that this will always be the case. If it is not the case, then we will see significant contention on the rgrps. If we resolve this by doing more of the allocation/deallocation in parallel on a single rgrp, then we only require (number of rgrps) > (number of nodes) which is a less onerous requirement. I'm not proposing that we should throw out the patch either, but we should try to do things in a different order. Since the problem is really a single node issue, in that we want to be able to better share rgrps between threads on a single node, then it should be addressed as such. Adding a new flag to allow us to share an exclusive cluster lock among local processes should not be tricky to do. I've attached an (untested) patch that should allow that. The new semantics are that exclusive locks which have the GL_SHARED_EX flags on them will be shared locally. They will not be shared with exclusive locks that do not have that flag, so that gives us a way to introduce the change gradually.... as each glock can be audited for safety individually and the flag added when we know it is safe. It will likely be easier to start with the deallocation paths, since the critical paths are shorter and easier to identify. We can probably just add a suitable lock to cover the changes to the rgrp as a wrapper, and we might also be able to shrink the coverage of that lock over time, too. It needs to cover the bitmap changes and the summary info changes so that the two are still atomic with respect to other local threads working on the same rgrp. If we are careful we might even be able to make the lock a spinlock eventually, but it will probably need to be a per rgrp mutex or rwsem to start with. There will need to be some careful audit in order to ensure that the new per rgrp lock covers the correct parts of the code, and that we minimise those parts, in order to allow maximum parallelism. However, if we can achieve that, then we should get much greater gains in performance overall. Not least because this applies equally to allocation and deallocation, and your proposed new algorithm only addresses the allocation side of things. Once we have resolved the intra-node issues, then we can go back and take a second look at whether we can get even more gain from inter-node changes. The points I made in the previous emails still stand though, in that we should not be trying to solve an intra-node issue with an inter-node solution. The iozone test is not a very good test case for this - the worst case would be where we had large numbers of threads on each node, all writing out very small files. In that case we definitely want them grouped together for future reads and not spread over the disk in many different rgrps. Imagine someone running a find over such a filesystem, for example. We don't want that to degenerate into random reads, but to retain at least some locality. There will never be an algorithm that will be able to cope with every workload, but we should be able to get the same result (or better!) in terms of performance without having to change the allocation algorithm so drastically, Steve. -------------- next part -------------- A non-text attachment was scrubbed... Name: my.diff Type: text/x-patch Size: 1329 bytes Desc: not available URL: <http://listman.redhat.com/archives/cluster-devel/attachments/20180124/39bee092/attachment.bin> ^ permalink raw reply [flat|nested] 9+ messages in thread
* [Cluster-devel] [GFS2 PATCH 2/2] GFS2: Split gfs2_rgrp_congested into inter-node and intra-node cases 2018-01-24 17:04 ` Bob Peterson 2018-01-24 20:46 ` Steven Whitehouse @ 2018-01-25 11:47 ` Steven Whitehouse 1 sibling, 0 replies; 9+ messages in thread From: Steven Whitehouse @ 2018-01-25 11:47 UTC (permalink / raw) To: cluster-devel.redhat.com Hi, Some further thoughts... Whenever we find a problem related to a lock, it is a good plan to understand where the problem actually lies. In other words whether the locking itself is slow, or whether it is some action that is being performed under the lock that is the issue. We have the ability to easily create histograms of DLM lock times, and almost as easily create histograms of the glock times (gfs2_glock_queue -> gfs2_promote). We can easily filter on glock type (rgrp) and the lock transistions that we care about (any -> EX) too. So it would be interesting to look at this in order to get more of an insight into what is really going on. Taking the raw histogram and multiplying the count by the centre of each bucket gives us total time taken for each different lock latency. Then it is easy to see which latencies are the ones causing the most delay. It would also be interesting to know how long it takes to allocate and deallocate a block. What are the operations that take the most time? Unfortunately our block allocation tracepoint doesn't give us that info, but it is probably not that tricky to alter it, so that it does. That would give us a much more detailed picture of what is going on I think, Steve. ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2018-01-25 11:47 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-01-17 14:23 [Cluster-devel] [GFS2 PATCH 0/2] GFS2: Rework rgrp glock congestion functions for intra-node Bob Peterson 2018-01-17 14:23 ` [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Simplify gfs2_rgrp_used_recently Bob Peterson 2018-01-18 9:04 ` Steven Whitehouse 2018-01-17 14:23 ` [Cluster-devel] [GFS2 PATCH 2/2] GFS2: Split gfs2_rgrp_congested into inter-node and intra-node cases Bob Peterson 2018-01-18 9:08 ` Steven Whitehouse 2018-01-19 14:43 ` Bob Peterson 2018-01-24 17:04 ` Bob Peterson 2018-01-24 20:46 ` Steven Whitehouse 2018-01-25 11:47 ` Steven Whitehouse
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).