From mboxrd@z Thu Jan  1 00:00:00 1970
From: Bob Peterson <rpeterso@redhat.com>
Date: Mon, 4 Jan 2021 11:09:24 -0500 (EST)
Subject: [Cluster-devel] [GFS2 PATCH] gfs2: make recovery workqueue
 operate on a gfs2 mount point, not journal
In-Reply-To: <51252ca2-fa56-acb8-24cf-fb2e992f76de@redhat.com>
References: <2125295377.38904313.1608669538740.JavaMail.zimbra@redhat.com>
	<51252ca2-fa56-acb8-24cf-fb2e992f76de@redhat.com>
Message-ID: <561946972.42407585.1609776564024.JavaMail.zimbra@redhat.com>
List-Id: <cluster-devel.redhat.com>
To: cluster-devel.redhat.com
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit

Hi,

----- Original Message -----
> Hi,
> 
> On 22/12/2020 20:38, Bob Peterson wrote:
> > Hi,
> >
> > Before this patch, journal recovery was done by a workqueue function that
> > operated on a per-journal basis. The problem is, these could run
> > simultaneously
> > which meant that they could all use the same bio, sd_log_bio, to do their
> > writing to all the various journals. These operations overwrote one another
> > eventually causing memory corruption.
> 
> Why not just add more bios so that this issue goes away? It would make
> more sense than preventing recovery from running in parallel. In general
> recovery should be spread amoung nodes anyway, so the case of having
> multiple recoveries running on the same node in parallel should be
> fairly rare too,
> 
> Steve.

As I understand it, if we allocate a bio from the same bio_set (as bio_alloc does)
we need to submit the previous bio before getting the next one, which means
recovery processes cannot work in parallel, even if they use different bio pointers.

We can, of course, allocate several bio_sets, one for each journal, but I
remember Jeff Moyer telling me it would use 1MB per bio_set of memory,
which seems high. (I've not verified that.) I'm testing up to 60 mounts
times 5 cluster nodes (5 journals) which would add up to 300MB of memory.
That's not horrible but I remember we decided not to allocate separate
per-mount rb_trees for glock indexing because of the memory needed, and 
that seems much less by comparison.

We could also introduce new locking (and multiple bio pointers) to prevent
the bio from being used by multiple recoveries at the same time. I actually
tried that on an earlier attempt and immediately ran into deadlock issues,
probably because our journal writes also use the same bio.

This way is pretty simple and there are fewer recovery processes to worry
about when analyzing vmcores.

Bob