From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jeff Garzik Subject: Re: HAIL volunteer Rick Peralta Date: Wed, 29 Jul 2009 15:14:45 -0400 Message-ID: <4A709FA5.50804@garzik.org> References: <29025029.1248785350151.JavaMail.root@mswamui-andean.atl.sa.earthlink.net> <4A6F5A3E.1070907@garzik.org> <4A704376.6000303@tiac.net> <4A705D5C.9050909@garzik.org> <20090729105202.7d0410de.zaitcev@redhat.com> <4A70842E.8020908@garzik.org> <4A708E00.5000304@garzik.org> <1248892516.4526.15.camel@decade.local> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <1248892516.4526.15.camel@decade.local> Sender: hail-devel-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii"; format="flowed" To: Fabian Deutsch Cc: Pete Zaitcev , Rick Peralta , Project Hail , zaitcev@redat.com Fabian Deutsch wrote: > Am Mittwoch, den 29.07.2009, 13:59 -0400 schrieb Jeff Garzik: >>> Or to take converse logic -- is it likely that service->service >>> replication is SLOWER than client->service replication? >>> >>> Every way I look at it, client->{service,service,service} >> replication >>> seems both easy... and potentially slower than alternatives :) >> To elaborate a bit more... there obviously are cases where you want >> the >> client to be the genesis of parallel data streams into the cloud. >> >> My point was more that there are real world situations where multiple >> outgoing streams from the client is significantly slower than a >> single >> stream into the cloud, plus asking the cloud to perform further >> copies. > > Yes, I agree that we just should have one stream per BLOB to one chunkd, > but we might attach replication destinations when streaming this blob to > one chunkd. > The result is, that we've just got one stream to a chunkd instance, > including some replication destinations, and chunkd will hapilly spread > the relpicates. > So we are just keeping the logic of where to replciate to, away from > chunkd and leave it to the client (which can ask a third daemon) where > to store the replicates. I think we all agree on keeping the logic of where to replicate to, away from chunkd. chunkd should be as dumb^H^H^Hsimple as possible, to permit maximum flexibility of chunkd-based applications. chunkd-based applications will be the ones making chunk load balancing decisions, for example. > dsts[] = logic->getDstsFor(blob) > chunk->put(blob, dsts) /* Will return after successfull replc. */ > > The local in-cloud replication strategy, like chaining or parallel could > be passed too, but might not be as relevant as the destinations itself. The more I think about this, the more I think this will simply become a configuration setting of the storage pool[1], i.e. inside tabled or nfs4d configuration. That would permit local administrators to make a decision whether chaining (from the client!) or parallel should be used. All of this, it must be noted, is long term discussion. As of today, chunkd is "defacto" coded to be parallel-from-client because that's the only method possible today :) Jeff [1] Or perhaps the concept of a storage pool -- a collection of chunkd's shared by multiple applications -- will have its own configuration. Another long term discussion for another day...