From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jason Dillaman Subject: Re: RBD journal draft design Date: Thu, 4 Jun 2015 20:36:13 -0400 (EDT) Message-ID: <810657134.11416115.1433464573115.JavaMail.zimbra@redhat.com> References: <1574383603.9391063.1433257824183.JavaMail.zimbra@redhat.com> <1346174854.9391994.1433257911870.JavaMail.zimbra@redhat.com> <1679134333.10270211.1433348013379.JavaMail.zimbra@redhat.com> <1628237419.11058538.1433430488520.JavaMail.zimbra@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Return-path: Received: from mx4-phx2.redhat.com ([209.132.183.25]:46863 "EHLO mx4-phx2.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752849AbbFEAgO (ORCPT ); Thu, 4 Jun 2015 20:36:14 -0400 In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Gregory Farnum Cc: Ceph Development > >> ...Actually, doesn't *not* forcing a coordinated move from one object > >> set to another mean that you don't actually have an ordering guarantee > >> across tags if you replay the journal objects in order? > > > > The ordering between tags was meant to be a soft ordering guarantee (since > > any number of delays could throw off the actual order as delivered from > > the OS). In the case of a VM using multiple RBD images sharing the same > > journal, this provides an ordering guarantee per device but not between > > devices. > > > > This is no worse than the case of each RBD image using its own journal > > instead of sharing a journal and the behavior doesn't seem too different > > from a non-RBD case when submitting requests to two different physical > > devices (e.g. a SSD device and a NAS device will commit data at different > > latencies). > > Yes, it's exactly the same. But I thought the point was that if you > commingle the journals then you actually have the appropriate ordering > across clients/disks (if there's enough ordering and synchronization) > that you can stream the journal off-site and know that if there's any > kind of disaster you are always at least crash-consistent. If there's > arbitrary re-ordering of different volume writes at object boundaries > then I don't see what benefit there is to having a commingled journal > at all. > > I think there's a thing called a "consistency group" in various > storage platforms that is sort of similar to this, where you can take > a snapshot of a related group of volumes at once. I presume the > commingled journal is an attempt at basically having an ongoing > snapshot of the whole consistency group. Seems like even with a SAN-type consistency group, you could still have temporal ordering issues between volume writes unless it synchronized with the client OSes to flush out all volumes at a consistent place so that the snapshot could take place. I suppose you could provide much tighter QEMU inter-volume ordering guarantees if you modified the RBD block device so that each individual RBD image instance was provided a mechanism to coordinate the allocation of the sequence number between the images. Right now, each image is opened in its own context w/ no knowledge of one another and no way to coordinate. The current proposed tag + sequence number approach could be used to provide the soft inter-volume ordering guarantees until QEMU / librbd could be modified to support volume groupings.