[Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements

cluster-devel.redhat.com archive mirror
 help / color / mirror / Atom feed

From: Mark Syms <Mark.Syms@citrix.com>
To: cluster-devel.redhat.com
Subject: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance	improvements
Date: Fri, 28 Sep 2018 12:50:06 +0000	[thread overview]
Message-ID: <c3fa4d50629c49f5b7d312108effe962@AMSPEX02CL02.citrite.net> (raw)
In-Reply-To: <06359a927ea74713adaa1d0a5aec74f6@AMSPEX02CL02.citrite.net>

Hi Bon,

The patches look quite good and would seem to help in the intra-node congestion case, which our first patch was trying to do. We haven't tried them yet but I'll pull a build together and try to run it over the weekend.

We don't however, see that they would help in the situation we saw for the second patch where rgrp glocks would get bounced around between hosts at high speed and cause lots of state flushing to occur in the process as the stats don't take any account of anything other than network latency whereas there is more involved with a rgrp glock when state needs to be flushed.

Any thoughts on this?

Thanks,

	Mark.

-----Original Message-----
From: Mark Syms 
Sent: 28 September 2018 13:37
To: 'Bob Peterson' <rpeterso@redhat.com>
Cc: cluster-devel at redhat.com; Tim Smith <tim.smith@citrix.com>; Ross Lagerwall <ross.lagerwall@citrix.com>
Subject: RE: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements

Hi Bob,

No, we haven't but it wouldn't be hard for us to replace our patches in our internal patchqueue with these and try them. Will let you know what we find.

We have also seen, what we think is an unrelated issue where we get the following backtrace in kern.log and our system stalls

Sep 21 21:19:09 cl15-05 kernel: [21389.462707] INFO: task python:15480 blocked for more than 120 seconds.
Sep 21 21:19:09 cl15-05 kernel: [21389.462749]       Tainted: G           O    4.4.0+10 #1
Sep 21 21:19:09 cl15-05 kernel: [21389.462763] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 21 21:19:09 cl15-05 kernel: [21389.462783] python          D ffff88019628bc90     0 15480      1 0x00000000
Sep 21 21:19:09 cl15-05 kernel: [21389.462790]  ffff88019628bc90 ffff880198f11c00 ffff88005a509c00 ffff88019628c000 Sep 21 21:19:09 cl15-05 kernel: [21389.462795]  ffffc90040226000 ffff88019628bd80 fffffffffffffe58 ffff8801818da418 Sep 21 21:19:09 cl15-05 kernel: [21389.462799]  ffff88019628bca8 ffffffff815a1cd4 ffff8801818da5c0 ffff88019628bd68 Sep 21 21:19:09 cl15-05 kernel: [21389.462803] Call Trace:
Sep 21 21:19:09 cl15-05 kernel: [21389.462815]  [<ffffffff815a1cd4>] schedule+0x64/0x80 Sep 21 21:19:09 cl15-05 kernel: [21389.462877]  [<ffffffffa0663624>] find_insert_glock+0x4a4/0x530 [gfs2] Sep 21 21:19:09 cl15-05 kernel: [21389.462891]  [<ffffffffa0660c20>] ? gfs2_holder_wake+0x20/0x20 [gfs2] Sep 21 21:19:09 cl15-05 kernel: [21389.462903]  [<ffffffffa06639ed>] gfs2_glock_get+0x3d/0x330 [gfs2] Sep 21 21:19:09 cl15-05 kernel: [21389.462928]  [<ffffffffa066cff2>] do_flock+0xf2/0x210 [gfs2] Sep 21 21:19:09 cl15-05 kernel: [21389.462933]  [<ffffffffa0671ad0>] ? gfs2_getattr+0xe0/0xf0 [gfs2] Sep 21 21:19:09 cl15-05 kernel: [21389.462938]  [<ffffffff811ba2fb>] ? cp_new_stat+0x10b/0x120 Sep 21 21:19:09 cl15-05 kernel: [21389.462943]  [<ffffffffa066d188>] gfs2_flock+0x78/0xa0 [gfs2] Sep 21 21:19:09 cl15-05 kernel: [21389.462946]  [<ffffffff812021e9>] SyS_flock+0x129/0x170 Sep 21 21:19:09 cl15-05 kernel: [21389.462948]  [<ffffffff815a57ee>] entry_SYSCALL_64_fastpath+0x12/0x71

We think there is a possibility, given that this code path only gets entered if a glock is being destroyed, that there is a time of check, time of use issue here where by the time that schedule gets called the thing which we expect to be waking us up has completed dying and therefore won't trigger a wakeup for us. We only seen this a couple of times in fairly intensive VM stress tests where a lot of flocks get used on a small number of lock files (we use them to ensure consistent behaviour of disk activation/deactivation and also access to the database with the system state) but it's concerning nonetheless. We're looking at replacing the call to schedule with schedule_timeout with a timeout of maybe HZ to ensure that we will always get out of the schedule operation and retry. Is this something you think you may have seen or have any ideas on?

Thanks,

	Mark.

-----Original Message-----
From: Bob Peterson <rpeterso@redhat.com>
Sent: 28 September 2018 13:24
To: Mark Syms <Mark.Syms@citrix.com>
Cc: cluster-devel at redhat.com; Ross Lagerwall <ross.lagerwall@citrix.com>; Tim Smith <tim.smith@citrix.com>
Subject: Re: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements

----- Original Message -----
> Thanks for that Bob, we've been watching with interest the changes 
> going in upstream but at the moment we're not really in a position to 
> take advantage of them.
> 
> Due to hardware vendor support certification requirements XenServer 
> can only very occasionally make big kernel bumps that would affect the 
> ABI that the driver would see as that would require our hardware partners to recertify.
> So, we're currently on a 4.4.52 base but the gfs2 driver is somewhat 
> newer as it is essentially self-contained and therefore we can 
> backport change more easily. We currently have most of the GFS2 and 
> DLM changes that are in
> 4.15 backported into the XenServer 7.6 kernel, but we can't take the 
> ones related to iomap as they are more invasive and it looks like a 
> number of the more recent performance targeting changes are also 
> predicated on the iomap framework.
> 
> As I mentioned in the covering letter, the intra host problem would 
> largely be a non-issue if EX glocks were actually a host wide thing 
> with local mutexes used to share them within the host. I don't know if 
> this is what your patch set is trying to achieve or not. It's not so 
> much that that selection of resource group is "random", just that 
> there is a random chance that we won't select the first RG that we 
> test, it probably does work out much the same though.
> 
> The inter host problem addressed by the second patch seems to be less 
> amenable to avoidance as the hosts don't seem to have a synchronous 
> view of the state of the resource group locks (for understandable 
> reasons as I'd expect thisto be very expensive to keep sync'd). So it 
> seemed reasonable to try to make it "expensive" to request a resource 
> that someone else is using and also to avoid immediately grabbing it 
> back if we've been asked to relinquish it. It does seem to give a 
> fairer balance to the usage without being massively invasive.
> 
> We thought we should share these with the community anyway even if 
> they only serve as inspiration for more detailed changes and also to 
> describe the scenarios where we're seeing issues now that we have 
> completed implementing the XenServer support for GFS2 that we 
> discussed back in Nuremburg last year. In our testing they certainly 
> make things better. They probably aren?t fully optimal as we can't 
> maintain 10g wire speed consistently across the full LUN but we're 
> getting about 75% which is certainly better than we were seeing before we started looking at this.
> 
> Thanks,
> 
> 	Mark.

Hi Mark,

I'm really curious if you guys tried the two patches I posted here from
17 January 2018 in place of the two patches you posted. We see much better throughput with those over stock.

I know Steve wants a different solution, and in the long run it will be a better one, but I've been trying to convince him we should use them as a stop-gap measure to mitigate this problem until we get a more proper solution in place (which is obviously taking some time, due to unforeseen circumstances).

Regards,

Bob Peterson

next prev parent reply	other threads:[~2018-09-28 12:50 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-09-20 14:52 [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements Mark Syms
2018-09-20 14:52 ` [Cluster-devel] [PATCH 1/2] Add some randomisation to the GFS2 resource group allocator Mark Syms
2018-09-20 14:52 ` [Cluster-devel] [PATCH 2/2] GFS2: Avoid recently demoted rgrps Mark Syms
2018-09-20 17:17 ` [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements Bob Peterson
2018-09-20 17:47   ` Mark Syms
2018-09-20 18:16     ` Steven Whitehouse
2018-09-28 12:23     ` Bob Peterson
2018-09-28 12:36       ` Mark Syms
2018-09-28 12:50         ` Mark Syms [this message]
2018-09-28 13:18           ` Steven Whitehouse
2018-09-28 13:43             ` Tim Smith
2018-09-28 13:59               ` Bob Peterson
2018-09-28 14:11                 ` Mark Syms
2018-09-28 15:09                 ` Tim Smith
2018-09-28 15:09               ` Steven Whitehouse
2018-09-28 12:55         ` Bob Peterson
2018-09-28 13:56           ` Mark Syms
2018-10-02 13:50             ` Mark Syms

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c3fa4d50629c49f5b7d312108effe962@AMSPEX02CL02.citrite.net \
    --to=mark.syms@citrix.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).