From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Teigland Date: Mon, 17 Dec 2018 10:46:58 -0600 Subject: [Cluster-devel] [GFS2 PATCH] gfs2: Panic when an io error occurs writing to the journal In-Reply-To: <1890286629.55916662.1545058727858.JavaMail.zimbra@redhat.com> References: <1033351102.55836224.1545054857301.JavaMail.zimbra@redhat.com> <90e95a6b-5893-d26e-95d4-e73680e0326b@citrix.com> <1bdf5580-76ca-c2b7-5d2f-8d780b15a06e@redhat.com> <1890286629.55916662.1545058727858.JavaMail.zimbra@redhat.com> Message-ID: <20181217164658.GA13933@redhat.com> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit On Mon, Dec 17, 2018 at 09:58:47AM -0500, Bob Peterson wrote: > Dave Teigland recommended. Unless I'm mistaken, Dave has said that GFS2 > should never withdraw; it should always just kernel panic (Dave, correct > me if I'm wrong). At least this patch confines that behavior to a small > subset of withdraws. The basic idea is that you want to get a malfunctioning node out of the way as quickly as possible so others can recover and carry on. Escalating a partial failure into a total node failure is the best way to do that in this case. Specialized recovery paths run from a partially failed node won't be as reliable, and are prone to blocking all the nodes. I think a reasonable alternative to this is to just sit in an infinite retry loop until the i/o succeeds. Dave