From mboxrd@z Thu Jan 1 00:00:00 1970 From: Steven Whitehouse Date: Tue, 9 Oct 2018 09:41:18 +0100 Subject: [Cluster-devel] [PATCH 1/2] GFS2: use schedule timeout in find insert glock In-Reply-To: <8c41a29508574143827309528c392350@AMSPEX02CL02.citrite.net> References: <1539002191-40831-1-git-send-email-mark.syms@citrix.com> <2889867.YGlmsYasr8@dhcp-3-135.uk.xensource.com> <2d468838-ea79-9c89-ae02-89b18f4bda37@redhat.com> <21999012.kCAuXqbhYe@dhcp-3-135.uk.xensource.com> <8c41a29508574143827309528c392350@AMSPEX02CL02.citrite.net> Message-ID: List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit On 09/10/18 09:13, Mark Syms wrote: > Having swapped the line below around we still see the timeout on schedule fire, but only once in > a fairly mega stress test. This is why we weren't worried about the timeout being HZ, the situation > is hardly ever hit as having to wait is rare and normally we are woken from schedule and without > a timeout on schedule we never wake up so a rare occurrence of waiting a second really doesn't > seem too bad. > > Mark. We should still get to the bottom of why the wake up is missing though, since without that fix we won't know if there is something else wrong somewhere, Steve. > > -----Original Message----- > From: Tim Smith > Sent: 08 October 2018 14:27 > To: Steven Whitehouse > Cc: Mark Syms ; cluster-devel at redhat.com; Ross Lagerwall > Subject: Re: [Cluster-devel] [PATCH 1/2] GFS2: use schedule timeout in find insert glock > > On Monday, 8 October 2018 14:13:10 BST Steven Whitehouse wrote: >> Hi, >> >> On 08/10/18 14:10, Tim Smith wrote: >>> On Monday, 8 October 2018 14:03:24 BST Steven Whitehouse wrote: >>>> On 08/10/18 13:59, Mark Syms wrote: >>>>> That sounds entirely reasonable so long as you are absolutely sure >>>>> that nothing is ever going to mess with that glock, we erred on >>>>> the side of more caution not knowing whether it would be guaranteed safe or not. >>>>> >>>>> Thanks, >>>>> >>>>> Mark >>>> We should have a look at the history to see how that wait got added. >>>> However the "dead" flag here means "don't touch this glock" and is >>>> there so that we can separate the marking dead from the actual >>>> removal from the list (which simplifies the locking during the >>>> scanning procedures) >>> You beat me to it :-) >>> >>> I think there might be a bit of a problem inserting a new entry with >>> the same name before the old entry has been fully destroyed (or at >>> least removed), which would be why the schedule() is there. >> If the old entry is marked dead, all future lookups should ignore it. >> We should only have a single non-dead entry at a time, but that >> doesn't seem like it should need us to wait for it. > On the second call we do have the new glock to insert as arg2, so we could try to swap them cleanly, yeah. > >> If we do discover that the wait is really required, then it sounds >> like as you mentioned above there is a lost wakeup, and that must >> presumably be on a code path that sets the dead flag and then fails to >> send a wake up later on. If we can drop the wait in the first place, >> that seems like a better plan, > Ooooh, I wonder if these two lines: > > wake_up_glock(gl); > call_rcu(&gl->gl_rcu, gfs2_glock_dealloc); > > in gfs2_glock_free() are the wrong way round? > > -- > Tim Smith > >