From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Teigland Date: Tue, 30 Nov 2010 12:30:51 -0500 Subject: [Cluster-devel] Patch: making DLM more robust In-Reply-To: <4CF52D0E.2020800@bull.net> References: <4CEA9ADD.2050109@bull.net> <20101122173442.GA21879@redhat.com> <4CEBD6A2.8090005@bull.net> <20101123171508.GC30147@redhat.com> <4CF52D0E.2020800@bull.net> Message-ID: <20101130173051.GB27123@redhat.com> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit On Tue, Nov 30, 2010 at 05:57:50PM +0100, Menyhart Zoltan wrote: > Hi, > > An easy first step to make DLM more robust can be adding a time out protection > to the lock space cration operation, while waiting for a "dlm_controld" action. > A new memeber "ci_dlm_controld_secs" is added to "dlm_config" to set up time out > in seconds, DEFAULT_DLM_CTRL_SECS is 5 seconds. > > At the same time, signals can be enabled and handled, too. > > DLM_USER_CREATE_LOCKSPACE will be able to return new error codes: > -EINTR or -ETIMEDOUT. > > Could you please tell me why the signals are blocked within "device_write()"? > I think it is safe to allow signals, surely in your original code sequences > waiting in an uninterruptible way. Thanks, I'll take a look; as long as it's disabled by default I don't expect I'd object much. There are two main problems with this idea, though, that need to be handled before it's generally usable: 1. The kernel can wait on user space indefinately during completely normal situations, e.g. the loss of quorum or fencing failures can delay completion indefinately. This means you can easily introduce false failures when using a timeout. EINTR, since it's driven by user intervention, is a better idea, e.g. killing a mount process. 2. The difficulty, even with EINTR, is correctly and cleanly unwinding the dlm_controld state. Dave