From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from rv-out-0506.google.com (rv-out-0506.google.com [209.85.198.235]) by mail.linbit.com (LINBIT Mail Daemon) with ESMTP id D28DC2E078F0 for ; Sun, 21 Sep 2008 01:04:54 +0200 (CEST) Received: by rv-out-0506.google.com with SMTP id f6so889667rvb.3 for ; Sat, 20 Sep 2008 16:04:52 -0700 (PDT) Message-ID: <48D5818C.1030703@gmail.com> Date: Sat, 20 Sep 2008 17:04:44 -0600 From: Morey Roof MIME-Version: 1.0 To: drbd-dev@lists.linbit.com Subject: Re: [Drbd-dev] New Features References: <48D3378E.3020201@gmail.com> <20080919155343.GD9779@soda.linbit> <91a37e890809190919g5a746367g54e76d36e1a825f6@mail.gmail.com> <20080919221339.GB15916@soda.linbit> <48D42783.1050403@gmail.com> <20080920131821.GB16149@racke> <20080920134827.GD16149@racke> In-Reply-To: <20080920134827.GD16149@racke> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit List-Id: Coordination of development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , This is pretty much what I was thinking. For the btree, generations, and ref-count a good example to look at is how btrfs (This is the other project I have started to mess with) works. The design is very efficient and I think we could use a very close match for our setup. I haven't read the paper you sent yet but will get to that today. Let me know how you would like to start and I can start working a proof of concept and we can see how to go from there. -Morey Lars Ellenberg wrote: > On Sat, Sep 20, 2008 at 03:18:21PM +0200, Lars Ellenberg wrote: > > "Write-Back" cache: > > some things to think of when introducing a write back cache, > * need to do some cache coherency protocol > * need to track which block is where, so we can read the correct > version in case it has not yet been committed to final location > * if using a ram buffer as log disk, we need to track the latest > position for overwrites. > * if we have stages, i.e. ram buffer first, then log disk, then real > storage, we are the most flexible. > if peers are sufficiently close, we can send_page from the ram > buffer (and calculate checksums there, for data integrity). > if we use it as ring buffer, we'd not have to worry about > inconsistencies resulting from changes to in-flight buffers, > as they are all private. > * if we can use some efficient combination of digital tree, > btree and hash table to track which block is where, > we might be able to track a large, staged log device > as a sort of log-structured block device, making snapshots after the > fact for data generations still covered by the log very easy. > * we need a good refcount scheme on the ram buffers. > > of course we can start out "simple", and just provide a static cache, no > ring buffer or anything. > > this should probably be implemented as a generic device-mapper target, > which also makes testing much easier. > > which would make it possible to even add it to the current drbd > by just stacking it in front of the "lower-level device". > > for the "write-back" to "write-through" change, > we only need a minimal change in the current drbd module, which we can > enable based on the type of the device directly below us. > we could detect whether its a device-mapper target, > if so, which one, and access its special methods if any. > > this still sound a little quirky, so I'd suggest to introduce a special > BIO_RW_WRITE_THROUGH (to be defined) bit for the bi_flags. > > when not using it, that is write back. > when using it, it would trigger a flush of any pending requests, > and a direct remapping to the lower level device. > > BIO_RW_BARRIER requests would still need to trigger a flush as well, > and to go straight through. > > in the new architecture, where "drbd" probably becomes just a special > implementation and collection of device-mapper targets, communicating > with other device mapper targets becomes more easy (I hope). > > does that make sense? > >