From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jeff Garzik <jeff@garzik.org>
Subject: Re: HAIL volunteer Rick Peralta
Date: Wed, 29 Jul 2009 15:14:45 -0400
Message-ID: <4A709FA5.50804@garzik.org>
References: <29025029.1248785350151.JavaMail.root@mswamui-andean.atl.sa.earthlink.net>	 <4A6F5A3E.1070907@garzik.org>	<4A704376.6000303@tiac.net>	 <4A705D5C.9050909@garzik.org> <20090729105202.7d0410de.zaitcev@redhat.com>	 <4A70842E.8020908@garzik.org>  <4A708E00.5000304@garzik.org> <1248892516.4526.15.camel@decade.local>
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
Return-path: <hail-devel-owner@vger.kernel.org>
In-Reply-To: <1248892516.4526.15.camel@decade.local>
Sender: hail-devel-owner@vger.kernel.org
List-ID: <hail-devel.vger.kernel.org>
Content-Type: text/plain; charset="us-ascii"; format="flowed"
To: Fabian Deutsch <fabian.deutsch@gmx.de>
Cc: Pete Zaitcev <zaitcev@redhat.com>, Rick Peralta <fbp@tiac.net>, Project Hail <hail-devel@vger.kernel.org>, zaitcev@redat.com

Fabian Deutsch wrote:
> Am Mittwoch, den 29.07.2009, 13:59 -0400 schrieb Jeff Garzik:
>>> Or to take converse logic -- is it likely that service->service 
>>> replication is SLOWER than client->service replication?
>>>
>>> Every way I look at it, client->{service,service,service}
>> replication 
>>> seems both easy... and potentially slower than alternatives :)
>> To elaborate a bit more...  there obviously are cases where you want
>> the 
>> client to be the genesis of parallel data streams into the cloud.
>>
>> My point was more that there are real world situations where multiple 
>> outgoing streams from the client is significantly slower than a
>> single 
>> stream into the cloud, plus asking the cloud to perform further
>> copies.
> 
> Yes, I agree that we just should have one stream per BLOB to one chunkd,
> but we might attach replication destinations when streaming this blob to
> one chunkd. 
> The result is, that we've just got one stream to a chunkd instance,
> including some replication destinations, and chunkd will hapilly spread
> the relpicates.
> So we are just keeping the logic of where to replciate to, away from
> chunkd and leave it to the client (which can ask a third daemon) where
> to store the replicates.

I think we all agree on keeping the logic of where to replicate to, away 
from chunkd.

chunkd should be as dumb^H^H^Hsimple as possible, to permit maximum 
flexibility of chunkd-based applications.

chunkd-based applications will be the ones making chunk load balancing 
decisions, for example.


> dsts[] = logic->getDstsFor(blob)
> chunk->put(blob, dsts) /* Will return after successfull replc. */
> 
> The local in-cloud replication strategy, like chaining or parallel could
> be passed too, but might not be as relevant as the destinations itself.

The more I think about this, the more I think this will simply become a 
configuration setting of the storage pool[1], i.e. inside tabled or 
nfs4d configuration.

That would permit local administrators to make a decision whether 
chaining (from the client!) or parallel should be used.

All of this, it must be noted, is long term discussion.

As of today, chunkd is "defacto" coded to be parallel-from-client 
because that's the only method possible today :)

	Jeff


[1] Or perhaps the concept of a storage pool -- a collection of chunkd's 
shared by multiple applications -- will have its own configuration. 
Another long term discussion for another day...