From mboxrd@z Thu Jan  1 00:00:00 1970
From: Mark Syms <Mark.Syms@citrix.com>
Date: Thu, 20 Sep 2018 17:47:23 +0000
Subject: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve
	performance	improvements
In-Reply-To: <77971123.14918571.1537463869960.JavaMail.zimbra@redhat.com>
References: <1537455133-48589-1-git-send-email-mark.syms@citrix.com>
	<77971123.14918571.1537463869960.JavaMail.zimbra@redhat.com>
Message-ID: <b948a97d0e9c48c7964d9fd8257e1283@AMSPEX02CL02.citrite.net>
List-Id: <cluster-devel.redhat.com>
To: cluster-devel.redhat.com
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit

Thanks for that Bob, we've been watching with interest the changes going in upstream but at the moment we're not really in a position to take advantage of them.

Due to hardware vendor support certification requirements XenServer can only very occasionally make big kernel bumps that would affect the ABI that the driver would see as that would require our hardware partners to recertify. So, we're currently on a 4.4.52 base but the gfs2 driver is somewhat newer as it is essentially self-contained and therefore we can backport change more easily. We currently have most of the GFS2 and DLM changes that are in 4.15 backported into the XenServer 7.6 kernel, but we can't take the ones related to iomap as they are more invasive and it looks like a number of the more recent performance targeting changes are also predicated on the iomap framework.

As I mentioned in the covering letter, the intra host problem would largely be a non-issue if EX glocks were actually a host wide thing with local mutexes used to share them within the host. I don't know if this is what your patch set is trying to achieve or not. It's not so much that that selection of resource group is "random", just that there is a random chance that we won't select the first RG that we test, it probably does work out much the same though.

The inter host problem addressed by the second patch seems to be less amenable to avoidance as the hosts don't seem to have a synchronous view of the state of the resource group locks (for understandable reasons as I'd expect thisto be very expensive to keep sync'd). So it seemed reasonable to try to make it "expensive" to request a resource that someone else is using and also to avoid immediately grabbing it back if we've been asked to relinquish it. It does seem to give a fairer balance to the usage without being massively invasive.

We thought we should share these with the community anyway even if they only serve as inspiration for more detailed changes and also to describe the scenarios where we're seeing issues now that we have completed implementing the XenServer support for GFS2 that we discussed back in Nuremburg last year. In our testing they certainly make things better. They probably aren?t fully optimal as we can't maintain 10g wire speed consistently across the full LUN but we're getting about 75% which is certainly better than we were seeing before we started looking at this.

Thanks,

	Mark.

-----Original Message-----
From: Bob Peterson <rpeterso@redhat.com> 
Sent: 20 September 2018 18:18
To: Mark Syms <Mark.Syms@citrix.com>
Cc: cluster-devel at redhat.com; Ross Lagerwall <ross.lagerwall@citrix.com>; Tim Smith <tim.smith@citrix.com>
Subject: Re: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements

----- Original Message -----
> While testing GFS2 as a storage repository for virtual machines we 
> discovered a number of scenarios where the performance was being 
> pathologically poor.
> 
> The scenarios are simplfied to the following -
> 
>   * On a single host in the cluster grow a number of files to a
>     significant proportion of the filesystems LUN size, exceeding the
>     hosts preferred resource group allocation. This can be replicated
>     by using fio and writing to 20 different files with a script like

Hi Mark, Tim and all,

The performance problems with rgrp contention are well known, and have been for a very long time.

In rhel6 it's not as big of a problem because rhel6 gfs2 uses "try locks"
which distributes different processes to unique rgrps, thus keeping them from contending. However, it results in file system fragmentation that tends to catch up with you later.

I posted a different patch set to solve the problem a different way by trying to keep track of both Inter-node and Intra-node contention, and redistributed rgrps accordingly. It was similar to your first patch, but used a more predictable distribution, whereas yours is random.
It worked very well, but it ultimately got rejected by Steve Whitehouse in favor of a better approach:

Our current plan is to allow rgrps to be shared among many processes on a single node. This alleviates the contention, improves throughput and performance, and fixes the "favoritism" problems gfs2 has today.
In other words, it's better than just redistributing the rgrps.

I did a proof-of-concept set of patches and saw pretty good performance numbers and "fairness" among simultaneous writers. I posted that a few months ago.

Your patch would certainly work, and random distribution of rgrps would definitely gain performance, just as the Orlov algorithm does, however, I still want to pursue what Steve suggested.

My patch set for this still needs some work because I found some bugs with how things are done, so it'll take time to get working properly.

Regards,

Bob Peterson
Red Hat File Systems