From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:47130) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aY37A-0001vg-12 for qemu-devel@nongnu.org; Mon, 22 Feb 2016 21:54:17 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aY375-00061W-Bb for qemu-devel@nongnu.org; Mon, 22 Feb 2016 21:54:15 -0500 Received: from [59.151.112.132] (port=30271 helo=heian.cn.fujitsu.com) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aY372-00060o-Fj for qemu-devel@nongnu.org; Mon, 22 Feb 2016 21:54:11 -0500 Message-ID: <56CBCA25.7090009@cn.fujitsu.com> Date: Tue, 23 Feb 2016 10:55:33 +0800 From: Changlong Xie MIME-Version: 1.0 References: <1455588944-29799-1-git-send-email-xiecl.fnst@cn.fujitsu.com> <1455588944-29799-2-git-send-email-xiecl.fnst@cn.fujitsu.com> <56C6D1CD.7010800@cn.fujitsu.com> <56C877F3.3070401@redhat.com> <56CA7DD8.9000403@cn.fujitsu.com> <20160222090254.GA2940@work-vm> <20160222103455.GC5387@noname.str.redhat.com> In-Reply-To: <20160222103455.GC5387@noname.str.redhat.com> Content-Type: text/plain; charset="windows-1252"; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH v2 1/1] quorum: Change vote rules for 64 bits hash List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Kevin Wolf , "Dr. David Alan Gilbert" Cc: Alberto Garcia , qemu devel , Max Reitz On 02/22/2016 06:34 PM, Kevin Wolf wrote: > Am 22.02.2016 um 10:02 hat Dr. David Alan Gilbert geschrieben: >> * Changlong Xie (xiecl.fnst@cn.fujitsu.com) wrote: >>> On 02/20/2016 10:28 PM, Max Reitz wrote: >>>> On 19.02.2016 12:24, Alberto Garcia wrote: >>>>> On Fri 19 Feb 2016 09:26:53 AM CET, Wen Congyang wrote: >>>>> >>>>>>>> If quorum has two children(A, B). A do flush sucessfully, but B >>>>>>>> flush failed. We MUST choice A as winner rather than just pick >>>>>>>> anyone of them. Otherwise the filesystem of guest will become >>>>>>>> read-only with following errors: >>>>>>>> >>>>>>>> end_request: I/O error, dev vda, sector 11159960 >>>>>>>> Aborting journal on device vda3-8 >>>>>>>> EXT4-fs error (device vda3): ext4_journal_start_sb:327: Detected abort journal >>>>>>>> EXT4-fs (vda3): Remounting filesystem read-only >>>>>>> >>>>>>> Hi Xie, >>>>>>> >>>>>>> Let's see if I'm getting this right: >>>>>>> >>>>>>> - When Quorum flushes to disk, there's a vote among the return values of >>>>>>> the flush operations of its members, and the one that wins is the one >>>>>>> that Quorum returns. >>>>>>> >>>>>>> - If there's a tie then Quorum choses the first result from the list of >>>>>>> winners. >>>>>>> >>>>>>> - With your patch you want to give priority to the vote with result == 0 >>>>>>> if there's any, so Quorum would return 0 (and succeed). >>>>>>> >>>>>>> This seems to me like an ad-hoc fix for a particular use case. What >>>>>>> if you have 3 members and two of them fail with the same error code? >>>>>>> Would you still return 0 or the error code from the other two? >>>>>> >>>>>> For example: >>>>>> children.0 returns 0 >>>>>> children.1 returns -EIO >>>>>> children.2 returns -EPIPE >>>>>> >>>>>> In this case, quorum returns -EPIPE now(without this patch). >>>>>> >>>>>> For example: >>>>>> children.0 returns -EPIPE >>>>>> children.1 returns -EIO >>>>>> children.2 returns 0 >>>>>> In this case, quorum returns 0 now. >>>>> >>>>> My question is: what's the rationale for returning 0 in case a) but not >>>>> in case b)? >>>>> >>>>> a) >>>>> children.0 returns -EPIPE >>>>> children.1 returns -EIO >>>>> children.2 returns 0 >>>>> >>>>> b) >>>>> children.0 returns -EIO >>>>> children.1 returns -EIO >>>>> children.2 returns 0 >>>>> >>>>> In both cases you have one successful flush and two errors. You want to >>>>> return always 0 in case a) and always -EIO in case b). But the only >>>>> difference is that in case b) the errors happen to be the same, so why >>>>> does that matter? >>>>> >>>>> That said, I'm not very convinced of the current logics of the Quorum >>>>> flush code either, so it's not even a problem with your patch... it >>>>> seems to me that the code should follow the same logics as in the >>>>> read/write case: if the number of correct flushes >= threshold then >>>>> return 0, else select the most common error code. >>>> >>>> I'm not convinced of the logic either, which is why I waited for you to >>>> respond to this patch. :-) >>>> >>>> Intuitively, I'd expect Quorum to return an error if flushing failed for >>>> any of the children, because, well, flushing failed. I somehow feel like >>>> flushing is different from a read or write operation and therefore >>>> ignoring the threshold would be fine here. However, maybe my intuition >>>> is just off. >>>> >>>> Anyway, regardless of that, if we do take the threshold into account, we >>>> should not use the exact error value for voting but just whether an >>>> error occurred or not. If all but one children fail to flush (all for >>>> different reasons), I find it totally wrong to return success. We should >>>> then just return -EIO or something. >>>> >>> Hi Berto & Max >>> >>> Thanks for your comments, i'd like to have a summary here. For flush cases: >>> >>> 1) if flush successfully(result >= 0), result = 0; else if result < 0, >>> result = -EIO. then invoke quorum_count_vote >>> 2) if correct flushes >= threshold, mark correct flushes as winner directly. > > Please try to return one of the real error codes instead of -EIO. > Essentially we should use the same logic as for writes (like Berto > suggested above). Yes, i correct myself. It seems everyone reaches an agreement in this thread, and will use the same logic as writes in next version. Thanks -Xie > >> I find it difficult to understand how this corresponds to the behaviour needed >> in COLO, where we have the NBD and the real storage on the primary; in that >> case the failure of the real storage should give an error to the guest, but the >> failure of the NBD shouldn't produce a guest visible failure. > > That's probably because you're abusing quorum as an active mirroring > filter, which it really isn't. This is okay for now so you have > something to work with, but I expect that eventually you'll need a > different driver. (Well, or maybe I'm mistaken and you actually do need > to read back data from NBD to compare it to the real storage - do you?) > > Anyway, I'm curious how you would handle a failed write/flush request to > the NBD target. Simply ignoring it doesn't feel right; in case of a > failover, wouldn't you switch to a potentially corrupted image then? > > Kevin > > > . >