All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: HAIL volunteer Rick Peralta
       [not found]   ` <4A704376.6000303@tiac.net>
@ 2009-07-29 14:31     ` Jeff Garzik
  2009-07-29 16:52       ` Pete Zaitcev
  0 siblings, 1 reply; 10+ messages in thread
From: Jeff Garzik @ 2009-07-29 14:31 UTC (permalink / raw)
  To: Rick Peralta; +Cc: Project Hail


(added hail-devel to CC, with permission)



Rick Peralta wrote:
> Hi Jeff,
> 
> Is there someone taking the point for the chunkd development?

That's me, for the moment.  :)

In general, since Pete and I are Linux kernel developers -- one of the 
largest open source projects in the world -- I think importing the 
tried-and-true kernel requirements for code maintainership makes sense:

Once you are intimately familiar with a codebase, and can answers 
others' questions about design and code, you have reached the level 
where you could be a maintainer.

But more importantly -- that is not important!

Just contribute code, because that is the why open source projects 
advance in a particular direction anyway.  Linus Torvalds does not take 
point of the Linux kernel -- the people who actually write code do that. 
  Linus writes maybe 0.01% of the code these days.  The people who code, 
make the roadmap.


> To get to 10 Gbe there may be a variety of problems:
> 1) If there are data copies, the multiplicative effect can saturate the 
> memory system.
> 2) If the primary storage is rotating media (disk), it wold take 40+ 
> devices to keep up (assuming large stripes).
> 3) There are artifacts of the VM system that show up at high bandwidth, 
> especially if there is a lot of RAM.
> 4) The transaction frequency can become problematic, depending on the 
> application.
> 5) Running a single network transport at 10 Gbe can be challenging.

Indeed, scaling up the networking and storage can have plenty of 
implications.

We saw some of this when I was first working with NIC hardware 
manufacturers to add the first 10 Gbe NIC drivers to the Linux kernel.

chunkd is intentionally message-based, which implies that non-TCP 
protocols could readily be bolted on, for use in data centers with 10 
Gbe networks (AMQP? RDMA?).


> Is there an application profile that might be used as a performance 
> metric?  Something like total volume size (all media), total number of 
> chunks, distribution of chunk sizes, aggregate bandwidth, et cetera.

This is unfortunately going to vary wildly depending on the application 
using chunkd (and the application using that, in turn).

To take tabled as an example, and assuming a "standard cloud node" 
hardware setup,

* total single-node volume size:  one cheap SATA hard drive
* total number of chunks:  ==
	total number of tabled objects / number of storage nodes
* distribution of chunk sizes:  dependent upon the application using tabled
* aggregate bandwidth:  dependent upon the application using tabled


> If it matters I investigated building something very much like chunkd a 
> while back and has some stringent performance criteria.  It was not 
> clear what general application demand there is.  Is there a resource to 
> get a sense of where there is real need?

You should read the GoogleFS paper referenced on the chunkd wiki page: 
http://labs.google.com/papers/gfs-sosp2003.pdf  It describes the purpose 
and use of a chunk server, in the context of distributed cloud storage.

The demand is largely internal -- other Project Hail projects and 
outside distributed storage application should use chunkd in the 
creation of their own cloud-based service.

The intent is for tabled, nfs4d, and other distributed-storage projects 
to communicate with multiple chunkd's on multiple nodes, to accomplish 
replicated, highly available distributed storage.

If there is some specialized use of chunkd that you have in mind, we're 
interested in hearing that, too...   I certainly want to enable as many 
applications as possible with these projects.

Regards,

	Jeff




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: HAIL volunteer Rick Peralta
  2009-07-29 14:31     ` HAIL volunteer Rick Peralta Jeff Garzik
@ 2009-07-29 16:52       ` Pete Zaitcev
  2009-07-29 16:58         ` Fabian Deutsch
  2009-07-29 17:17         ` Jeff Garzik
  0 siblings, 2 replies; 10+ messages in thread
From: Pete Zaitcev @ 2009-07-29 16:52 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Rick Peralta, Project Hail, zaitcev

On Wed, 29 Jul 2009 10:31:56 -0400, Jeff Garzik <jeff@garzik.org> wrote:

> > Is there someone taking the point for the chunkd development?
> 
> That's me, for the moment.  :)

I have some short list todo for Chunk, after which I don't have
any particular plans:
 * Exit if CLD registration fails (maybe!).
 * Put ourhost into the CLD record, and the port.
 * Use base directory instead of Cell.
 * Switch to asprintf for CLD filenames, Geo.

So far we managed hacking on same codebase with relative ease.
Just make sure to post patches early.

> You should read the GoogleFS paper referenced on the chunkd wiki page: 
> http://labs.google.com/papers/gfs-sosp2003.pdf  It describes the purpose 
> and use of a chunk server, in the context of distributed cloud storage.

I think we're at a point where we have our own base of knowledge
and evolved an overall architecture to the point we don't have to
ape every little detail of Google architecture. In particular I'm
going to fight hard any talk of Chunk doing its own replication,
for now at least.

-- Pete

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: HAIL volunteer Rick Peralta
  2009-07-29 16:52       ` Pete Zaitcev
@ 2009-07-29 16:58         ` Fabian Deutsch
  2009-07-29 17:17         ` Jeff Garzik
  1 sibling, 0 replies; 10+ messages in thread
From: Fabian Deutsch @ 2009-07-29 16:58 UTC (permalink / raw)
  To: Pete Zaitcev; +Cc: Jeff Garzik, Rick Peralta, Project Hail, zaitcev

Am Mittwoch, den 29.07.2009, 10:52 -0600 schrieb Pete Zaitcev:
> I think we're at a point where we have our own base of knowledge
> and evolved an overall architecture to the point we don't have to
> ape every little detail of Google architecture. In particular I'm
> going to fight hard any talk of Chunk doing its own replication,
> for now at least.

Yes. I also think that chunkd should not do it's own replication. As the
strategy may be domain/application dependend. Therefor I'd appreciate if
chunkd would provide some kind of "copy(dst,sha)" function, to be able
to directly copy to another chunkd instance.

- fabian


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: HAIL volunteer Rick Peralta
  2009-07-29 16:52       ` Pete Zaitcev
  2009-07-29 16:58         ` Fabian Deutsch
@ 2009-07-29 17:17         ` Jeff Garzik
  2009-07-29 17:59           ` Jeff Garzik
  1 sibling, 1 reply; 10+ messages in thread
From: Jeff Garzik @ 2009-07-29 17:17 UTC (permalink / raw)
  To: Pete Zaitcev; +Cc: Rick Peralta, Project Hail, zaitcev

Pete Zaitcev wrote:
> On Wed, 29 Jul 2009 10:31:56 -0400, Jeff Garzik <jeff@garzik.org> wrote:
> 
>>> Is there someone taking the point for the chunkd development?
>> That's me, for the moment.  :)
> 
> I have some short list todo for Chunk, after which I don't have
> any particular plans:
>  * Exit if CLD registration fails (maybe!).

Hopefully all this is wrapped up into libcldc, such that, an application 
needs to only worry about major, abstracted events after calling 
new-session:

* no master, after defined "hunt" procedure.

	This includes both init and master failure (as distinguished
	from fail-over).

	The application will need to be in the "no CLD session"
	state in both cases.

	And indeed, exit() might be the best way to do that.

* master fail-over

	Flush our [currently non-existent] CLD cache.

etc.


>  * Put ourhost into the CLD record, and the port.
>  * Use base directory instead of Cell.
>  * Switch to asprintf for CLD filenames, Geo.

agreed


> So far we managed hacking on same codebase with relative ease.
> Just make sure to post patches early.
> 
>> You should read the GoogleFS paper referenced on the chunkd wiki page: 
>> http://labs.google.com/papers/gfs-sosp2003.pdf  It describes the purpose 
>> and use of a chunk server, in the context of distributed cloud storage.
> 
> I think we're at a point where we have our own base of knowledge
> and evolved an overall architecture to the point we don't have to
> ape every little detail of Google architecture.

Well, until the wiki has a description of the basic idea of a chunk 
server, the Google paper will have to do.

The point is not that we are aping Google, but more to describe the 
general concept to someone who does not know what a chunk server is, and 
how a chunk server fits into the "grand design."


> In particular I'm
> going to fight hard any talk of Chunk doing its own replication,
> for now at least.

WRT chunkd and replication, yes, that's fine for version 1.0.

But consider which is more likely to have bandwidth to spare:

	a) client -> service
		or
	b) service -> service

Of the two, I'd say "a" is a bit more likely to be remote (WAN) and have 
a slow-upload situation like my home cable modem (1 mbps down, 50 kbps 
up), and "b" is more likely to be LAN.

Or to take converse logic -- is it likely that service->service 
replication is SLOWER than client->service replication?

Every way I look at it, client->{service,service,service} replication 
seems both easy... and potentially slower than alternatives :)

	Jeff



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: HAIL volunteer Rick Peralta
  2009-07-29 17:17         ` Jeff Garzik
@ 2009-07-29 17:59           ` Jeff Garzik
  2009-07-29 18:19             ` Pete Zaitcev
  2009-07-29 18:35             ` Fabian Deutsch
  0 siblings, 2 replies; 10+ messages in thread
From: Jeff Garzik @ 2009-07-29 17:59 UTC (permalink / raw)
  To: Pete Zaitcev; +Cc: Rick Peralta, Project Hail, zaitcev

Jeff Garzik wrote:
> Pete Zaitcev wrote:
>> In particular I'm
>> going to fight hard any talk of Chunk doing its own replication,
>> for now at least.
> 
> WRT chunkd and replication, yes, that's fine for version 1.0.
> 
> But consider which is more likely to have bandwidth to spare:
> 
>     a) client -> service
>         or
>     b) service -> service
> 
> Of the two, I'd say "a" is a bit more likely to be remote (WAN) and have 
> a slow-upload situation like my home cable modem (1 mbps down, 50 kbps 
> up), and "b" is more likely to be LAN.
> 
> Or to take converse logic -- is it likely that service->service 
> replication is SLOWER than client->service replication?
> 
> Every way I look at it, client->{service,service,service} replication 
> seems both easy... and potentially slower than alternatives :)

To elaborate a bit more...  there obviously are cases where you want the 
client to be the genesis of parallel data streams into the cloud.

My point was more that there are real world situations where multiple 
outgoing streams from the client is significantly slower than a single 
stream into the cloud, plus asking the cloud to perform further copies.

	Jeff



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: HAIL volunteer Rick Peralta
  2009-07-29 17:59           ` Jeff Garzik
@ 2009-07-29 18:19             ` Pete Zaitcev
  2009-07-29 18:30               ` Jeff Garzik
  2009-07-29 18:35             ` Fabian Deutsch
  1 sibling, 1 reply; 10+ messages in thread
From: Pete Zaitcev @ 2009-07-29 18:19 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Rick Peralta, Project Hail, zaitcev

On Wed, 29 Jul 2009 13:59:28 -0400, Jeff Garzik <jeff@garzik.org> wrote:

> My point was more that there are real world situations where multiple 
> outgoing streams from the client is significantly slower than a single 
> stream into the cloud, plus asking the cloud to perform further copies.

We aren't exposing Chunk to clients outside the cloud, at least at present.
Let's wait and see if applications exist that require it, and how they
compete, say, with a bunch of NFS servers. The biggest difference about
Chunk is its ability to plug into CLD. Drop that and it's not that
special. But you can only use CLD if you are inside the cloud; in fact
it works best when you're inside the same data center with the CLD cell.
So I don't see the bandwidth argument having much weight.

I'm going to remember Fabian's idea of 3-rd party transfers, of course.
That potentially offers a significant reduction of load on tabled.
But mythical outside clients of Chunk are yet to be demostrated.

-- Pete

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: HAIL volunteer Rick Peralta
  2009-07-29 18:19             ` Pete Zaitcev
@ 2009-07-29 18:30               ` Jeff Garzik
  0 siblings, 0 replies; 10+ messages in thread
From: Jeff Garzik @ 2009-07-29 18:30 UTC (permalink / raw)
  To: Pete Zaitcev; +Cc: Rick Peralta, Project Hail

Pete Zaitcev wrote:
> On Wed, 29 Jul 2009 13:59:28 -0400, Jeff Garzik <jeff@garzik.org> wrote:
> 
>> My point was more that there are real world situations where multiple 
>> outgoing streams from the client is significantly slower than a single 
>> stream into the cloud, plus asking the cloud to perform further copies.
> 
> We aren't exposing Chunk to clients outside the cloud, at least at present.
> Let's wait and see if applications exist that require it, and how they
> compete, say, with a bunch of NFS servers. The biggest difference about
> Chunk is its ability to plug into CLD. Drop that and it's not that
> special. But you can only use CLD if you are inside the cloud; in fact
> it works best when you're inside the same data center with the CLD cell.
> So I don't see the bandwidth argument having much weight.
> 
> I'm going to remember Fabian's idea of 3-rd party transfers, of course.
> That potentially offers a significant reduction of load on tabled.
> But mythical outside clients of Chunk are yet to be demostrated.

Note that "mythical outside clients" is a key design element in NFS 
v4.1, which is a parallel distributed filesystem technology.  chunkd 
clients are quite a bit different from CLD clients.

Similarly, Hadoop DFS, CloudStore and GoogleFS clients [analagously] 
talk directly to chunkd rather than going through tabled.

	Jeff




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: HAIL volunteer Rick Peralta
  2009-07-29 17:59           ` Jeff Garzik
  2009-07-29 18:19             ` Pete Zaitcev
@ 2009-07-29 18:35             ` Fabian Deutsch
  2009-07-29 19:14               ` Jeff Garzik
  1 sibling, 1 reply; 10+ messages in thread
From: Fabian Deutsch @ 2009-07-29 18:35 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Pete Zaitcev, Rick Peralta, Project Hail, zaitcev

Am Mittwoch, den 29.07.2009, 13:59 -0400 schrieb Jeff Garzik:
> > Or to take converse logic -- is it likely that service->service 
> > replication is SLOWER than client->service replication?
> > 
> > Every way I look at it, client->{service,service,service}
> replication 
> > seems both easy... and potentially slower than alternatives :)
> 
> To elaborate a bit more...  there obviously are cases where you want
> the 
> client to be the genesis of parallel data streams into the cloud.
> 
> My point was more that there are real world situations where multiple 
> outgoing streams from the client is significantly slower than a
> single 
> stream into the cloud, plus asking the cloud to perform further
> copies.

Yes, I agree that we just should have one stream per BLOB to one chunkd,
but we might attach replication destinations when streaming this blob to
one chunkd. 
The result is, that we've just got one stream to a chunkd instance,
including some replication destinations, and chunkd will hapilly spread
the relpicates.
So we are just keeping the logic of where to replciate to, away from
chunkd and leave it to the client (which can ask a third daemon) where
to store the replicates.

dsts[] = logic->getDstsFor(blob)
chunk->put(blob, dsts) /* Will return after successfull replc. */

The local in-cloud replication strategy, like chaining or parallel could
be passed too, but might not be as relevant as the destinations itself.

- fabian


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: HAIL volunteer Rick Peralta
  2009-07-29 18:35             ` Fabian Deutsch
@ 2009-07-29 19:14               ` Jeff Garzik
  0 siblings, 0 replies; 10+ messages in thread
From: Jeff Garzik @ 2009-07-29 19:14 UTC (permalink / raw)
  To: Fabian Deutsch; +Cc: Pete Zaitcev, Rick Peralta, Project Hail, zaitcev

Fabian Deutsch wrote:
> Am Mittwoch, den 29.07.2009, 13:59 -0400 schrieb Jeff Garzik:
>>> Or to take converse logic -- is it likely that service->service 
>>> replication is SLOWER than client->service replication?
>>>
>>> Every way I look at it, client->{service,service,service}
>> replication 
>>> seems both easy... and potentially slower than alternatives :)
>> To elaborate a bit more...  there obviously are cases where you want
>> the 
>> client to be the genesis of parallel data streams into the cloud.
>>
>> My point was more that there are real world situations where multiple 
>> outgoing streams from the client is significantly slower than a
>> single 
>> stream into the cloud, plus asking the cloud to perform further
>> copies.
> 
> Yes, I agree that we just should have one stream per BLOB to one chunkd,
> but we might attach replication destinations when streaming this blob to
> one chunkd. 
> The result is, that we've just got one stream to a chunkd instance,
> including some replication destinations, and chunkd will hapilly spread
> the relpicates.
> So we are just keeping the logic of where to replciate to, away from
> chunkd and leave it to the client (which can ask a third daemon) where
> to store the replicates.

I think we all agree on keeping the logic of where to replicate to, away 
from chunkd.

chunkd should be as dumb^H^H^Hsimple as possible, to permit maximum 
flexibility of chunkd-based applications.

chunkd-based applications will be the ones making chunk load balancing 
decisions, for example.


> dsts[] = logic->getDstsFor(blob)
> chunk->put(blob, dsts) /* Will return after successfull replc. */
> 
> The local in-cloud replication strategy, like chaining or parallel could
> be passed too, but might not be as relevant as the destinations itself.

The more I think about this, the more I think this will simply become a 
configuration setting of the storage pool[1], i.e. inside tabled or 
nfs4d configuration.

That would permit local administrators to make a decision whether 
chaining (from the client!) or parallel should be used.

All of this, it must be noted, is long term discussion.

As of today, chunkd is "defacto" coded to be parallel-from-client 
because that's the only method possible today :)

	Jeff


[1] Or perhaps the concept of a storage pool -- a collection of chunkd's 
shared by multiple applications -- will have its own configuration. 
Another long term discussion for another day...





^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: HAIL volunteer Rick Peralta
@ 2009-07-31 16:37 Rick Peralta
  0 siblings, 0 replies; 10+ messages in thread
From: Rick Peralta @ 2009-07-31 16:37 UTC (permalink / raw)
  To: Project Hail; +Cc: Pete Zaitcev, Rick Peralta, jeff

Hi All,

Thanks for inviting me to the forum and thanks to you all for making things happen!

My father said, "don't change anything unless you know why".  Those words ring in my ears more and more after decades of System development.  It is my intention and hope to respect the wisdom of those words and be clear about what the objectives of any endeavor is (including sloth ;^).

The chunkd effort caught my eye for a variety of reasons.  It is functionally very much like something I advocated for a long time ago, it is a relatively simple, yet powerful machine and it may benefit by some redesign for performance (my personal specialty).

The question at hand is: What truly needs to be done?  Bugs are bugs and one can debate one solution over another, but in the end it's about getting things to work well.  Multithreading the transport layer is probably a good idea, but some diligence should be paid to why.  There are any number of other open issues that also deserve some attention.  Coding is fine, but understanding what and why seems to be a first step.

In order to have a common basis for evaluation I'd like to suggest a standard platform to consider in the context of discussions.  The current implementation of chunkd, running on a standard server (probably with a 32 bit address space), with gigabit Ethernet, and a single disk (good for about 25 MB/s & 15 ms seek time).  Consideration of more or different bulk storage, 10 Gbe, IB or other high bandwidth implementations and so forth can be considered as branches from the core model.

Given the current implementation of chunkd, it generally resides in user space, over a standard file system (complete with caches, overhead and whatever else comes along).

PZ>
I have some short list todo for Chunk, after which I don't have
any particular plans:
 * Exit if CLD registration fails (maybe!).
 * Put ourhost into the CLD record, and the port.
 * Use base directory instead of Cell.
 * Switch to asprintf for CLD filenames, Geo.

FD>
Yes. I also think that chunkd should not do it's own replication. As the
strategy may be domain/application dependend. Therefor I'd appreciate if
chunkd would provide some kind of "copy(dst,sha)" function, to be able
to directly copy to another chunkd instance.

JG>
Hopefully all this is wrapped up into libcldc...

JG>
* total single-node volume size:  one cheap SATA hard drive
* total number of chunks:  ==
	total number of tabled objects / number of storage nodes
* distribution of chunk sizes:  dependent upon the application using tabled
* aggregate bandwidth:  dependent upon the application using tabled

fbp>
Might we put some numbers to this?
Most notable is typical chunk size and number of supported clients.

 - Rick Peralta
    www.linkedin.com/in/rickperalta


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2009-07-31 16:37 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <29025029.1248785350151.JavaMail.root@mswamui-andean.atl.sa.earthlink.net>
     [not found] ` <4A6F5A3E.1070907@garzik.org>
     [not found]   ` <4A704376.6000303@tiac.net>
2009-07-29 14:31     ` HAIL volunteer Rick Peralta Jeff Garzik
2009-07-29 16:52       ` Pete Zaitcev
2009-07-29 16:58         ` Fabian Deutsch
2009-07-29 17:17         ` Jeff Garzik
2009-07-29 17:59           ` Jeff Garzik
2009-07-29 18:19             ` Pete Zaitcev
2009-07-29 18:30               ` Jeff Garzik
2009-07-29 18:35             ` Fabian Deutsch
2009-07-29 19:14               ` Jeff Garzik
2009-07-31 16:37 Rick Peralta

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.