* Re: HAIL volunteer Rick Peralta
[not found] ` <4A704376.6000303@tiac.net>
@ 2009-07-29 14:31 ` Jeff Garzik
2009-07-29 16:52 ` Pete Zaitcev
0 siblings, 1 reply; 10+ messages in thread
From: Jeff Garzik @ 2009-07-29 14:31 UTC (permalink / raw)
To: Rick Peralta; +Cc: Project Hail
(added hail-devel to CC, with permission)
Rick Peralta wrote:
> Hi Jeff,
>
> Is there someone taking the point for the chunkd development?
That's me, for the moment. :)
In general, since Pete and I are Linux kernel developers -- one of the
largest open source projects in the world -- I think importing the
tried-and-true kernel requirements for code maintainership makes sense:
Once you are intimately familiar with a codebase, and can answers
others' questions about design and code, you have reached the level
where you could be a maintainer.
But more importantly -- that is not important!
Just contribute code, because that is the why open source projects
advance in a particular direction anyway. Linus Torvalds does not take
point of the Linux kernel -- the people who actually write code do that.
Linus writes maybe 0.01% of the code these days. The people who code,
make the roadmap.
> To get to 10 Gbe there may be a variety of problems:
> 1) If there are data copies, the multiplicative effect can saturate the
> memory system.
> 2) If the primary storage is rotating media (disk), it wold take 40+
> devices to keep up (assuming large stripes).
> 3) There are artifacts of the VM system that show up at high bandwidth,
> especially if there is a lot of RAM.
> 4) The transaction frequency can become problematic, depending on the
> application.
> 5) Running a single network transport at 10 Gbe can be challenging.
Indeed, scaling up the networking and storage can have plenty of
implications.
We saw some of this when I was first working with NIC hardware
manufacturers to add the first 10 Gbe NIC drivers to the Linux kernel.
chunkd is intentionally message-based, which implies that non-TCP
protocols could readily be bolted on, for use in data centers with 10
Gbe networks (AMQP? RDMA?).
> Is there an application profile that might be used as a performance
> metric? Something like total volume size (all media), total number of
> chunks, distribution of chunk sizes, aggregate bandwidth, et cetera.
This is unfortunately going to vary wildly depending on the application
using chunkd (and the application using that, in turn).
To take tabled as an example, and assuming a "standard cloud node"
hardware setup,
* total single-node volume size: one cheap SATA hard drive
* total number of chunks: ==
total number of tabled objects / number of storage nodes
* distribution of chunk sizes: dependent upon the application using tabled
* aggregate bandwidth: dependent upon the application using tabled
> If it matters I investigated building something very much like chunkd a
> while back and has some stringent performance criteria. It was not
> clear what general application demand there is. Is there a resource to
> get a sense of where there is real need?
You should read the GoogleFS paper referenced on the chunkd wiki page:
http://labs.google.com/papers/gfs-sosp2003.pdf It describes the purpose
and use of a chunk server, in the context of distributed cloud storage.
The demand is largely internal -- other Project Hail projects and
outside distributed storage application should use chunkd in the
creation of their own cloud-based service.
The intent is for tabled, nfs4d, and other distributed-storage projects
to communicate with multiple chunkd's on multiple nodes, to accomplish
replicated, highly available distributed storage.
If there is some specialized use of chunkd that you have in mind, we're
interested in hearing that, too... I certainly want to enable as many
applications as possible with these projects.
Regards,
Jeff
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: HAIL volunteer Rick Peralta
2009-07-29 14:31 ` HAIL volunteer Rick Peralta Jeff Garzik
@ 2009-07-29 16:52 ` Pete Zaitcev
2009-07-29 16:58 ` Fabian Deutsch
2009-07-29 17:17 ` Jeff Garzik
0 siblings, 2 replies; 10+ messages in thread
From: Pete Zaitcev @ 2009-07-29 16:52 UTC (permalink / raw)
To: Jeff Garzik; +Cc: Rick Peralta, Project Hail, zaitcev
On Wed, 29 Jul 2009 10:31:56 -0400, Jeff Garzik <jeff@garzik.org> wrote:
> > Is there someone taking the point for the chunkd development?
>
> That's me, for the moment. :)
I have some short list todo for Chunk, after which I don't have
any particular plans:
* Exit if CLD registration fails (maybe!).
* Put ourhost into the CLD record, and the port.
* Use base directory instead of Cell.
* Switch to asprintf for CLD filenames, Geo.
So far we managed hacking on same codebase with relative ease.
Just make sure to post patches early.
> You should read the GoogleFS paper referenced on the chunkd wiki page:
> http://labs.google.com/papers/gfs-sosp2003.pdf It describes the purpose
> and use of a chunk server, in the context of distributed cloud storage.
I think we're at a point where we have our own base of knowledge
and evolved an overall architecture to the point we don't have to
ape every little detail of Google architecture. In particular I'm
going to fight hard any talk of Chunk doing its own replication,
for now at least.
-- Pete
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: HAIL volunteer Rick Peralta
2009-07-29 16:52 ` Pete Zaitcev
@ 2009-07-29 16:58 ` Fabian Deutsch
2009-07-29 17:17 ` Jeff Garzik
1 sibling, 0 replies; 10+ messages in thread
From: Fabian Deutsch @ 2009-07-29 16:58 UTC (permalink / raw)
To: Pete Zaitcev; +Cc: Jeff Garzik, Rick Peralta, Project Hail, zaitcev
Am Mittwoch, den 29.07.2009, 10:52 -0600 schrieb Pete Zaitcev:
> I think we're at a point where we have our own base of knowledge
> and evolved an overall architecture to the point we don't have to
> ape every little detail of Google architecture. In particular I'm
> going to fight hard any talk of Chunk doing its own replication,
> for now at least.
Yes. I also think that chunkd should not do it's own replication. As the
strategy may be domain/application dependend. Therefor I'd appreciate if
chunkd would provide some kind of "copy(dst,sha)" function, to be able
to directly copy to another chunkd instance.
- fabian
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: HAIL volunteer Rick Peralta
2009-07-29 16:52 ` Pete Zaitcev
2009-07-29 16:58 ` Fabian Deutsch
@ 2009-07-29 17:17 ` Jeff Garzik
2009-07-29 17:59 ` Jeff Garzik
1 sibling, 1 reply; 10+ messages in thread
From: Jeff Garzik @ 2009-07-29 17:17 UTC (permalink / raw)
To: Pete Zaitcev; +Cc: Rick Peralta, Project Hail, zaitcev
Pete Zaitcev wrote:
> On Wed, 29 Jul 2009 10:31:56 -0400, Jeff Garzik <jeff@garzik.org> wrote:
>
>>> Is there someone taking the point for the chunkd development?
>> That's me, for the moment. :)
>
> I have some short list todo for Chunk, after which I don't have
> any particular plans:
> * Exit if CLD registration fails (maybe!).
Hopefully all this is wrapped up into libcldc, such that, an application
needs to only worry about major, abstracted events after calling
new-session:
* no master, after defined "hunt" procedure.
This includes both init and master failure (as distinguished
from fail-over).
The application will need to be in the "no CLD session"
state in both cases.
And indeed, exit() might be the best way to do that.
* master fail-over
Flush our [currently non-existent] CLD cache.
etc.
> * Put ourhost into the CLD record, and the port.
> * Use base directory instead of Cell.
> * Switch to asprintf for CLD filenames, Geo.
agreed
> So far we managed hacking on same codebase with relative ease.
> Just make sure to post patches early.
>
>> You should read the GoogleFS paper referenced on the chunkd wiki page:
>> http://labs.google.com/papers/gfs-sosp2003.pdf It describes the purpose
>> and use of a chunk server, in the context of distributed cloud storage.
>
> I think we're at a point where we have our own base of knowledge
> and evolved an overall architecture to the point we don't have to
> ape every little detail of Google architecture.
Well, until the wiki has a description of the basic idea of a chunk
server, the Google paper will have to do.
The point is not that we are aping Google, but more to describe the
general concept to someone who does not know what a chunk server is, and
how a chunk server fits into the "grand design."
> In particular I'm
> going to fight hard any talk of Chunk doing its own replication,
> for now at least.
WRT chunkd and replication, yes, that's fine for version 1.0.
But consider which is more likely to have bandwidth to spare:
a) client -> service
or
b) service -> service
Of the two, I'd say "a" is a bit more likely to be remote (WAN) and have
a slow-upload situation like my home cable modem (1 mbps down, 50 kbps
up), and "b" is more likely to be LAN.
Or to take converse logic -- is it likely that service->service
replication is SLOWER than client->service replication?
Every way I look at it, client->{service,service,service} replication
seems both easy... and potentially slower than alternatives :)
Jeff
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: HAIL volunteer Rick Peralta
2009-07-29 17:17 ` Jeff Garzik
@ 2009-07-29 17:59 ` Jeff Garzik
2009-07-29 18:19 ` Pete Zaitcev
2009-07-29 18:35 ` Fabian Deutsch
0 siblings, 2 replies; 10+ messages in thread
From: Jeff Garzik @ 2009-07-29 17:59 UTC (permalink / raw)
To: Pete Zaitcev; +Cc: Rick Peralta, Project Hail, zaitcev
Jeff Garzik wrote:
> Pete Zaitcev wrote:
>> In particular I'm
>> going to fight hard any talk of Chunk doing its own replication,
>> for now at least.
>
> WRT chunkd and replication, yes, that's fine for version 1.0.
>
> But consider which is more likely to have bandwidth to spare:
>
> a) client -> service
> or
> b) service -> service
>
> Of the two, I'd say "a" is a bit more likely to be remote (WAN) and have
> a slow-upload situation like my home cable modem (1 mbps down, 50 kbps
> up), and "b" is more likely to be LAN.
>
> Or to take converse logic -- is it likely that service->service
> replication is SLOWER than client->service replication?
>
> Every way I look at it, client->{service,service,service} replication
> seems both easy... and potentially slower than alternatives :)
To elaborate a bit more... there obviously are cases where you want the
client to be the genesis of parallel data streams into the cloud.
My point was more that there are real world situations where multiple
outgoing streams from the client is significantly slower than a single
stream into the cloud, plus asking the cloud to perform further copies.
Jeff
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: HAIL volunteer Rick Peralta
2009-07-29 17:59 ` Jeff Garzik
@ 2009-07-29 18:19 ` Pete Zaitcev
2009-07-29 18:30 ` Jeff Garzik
2009-07-29 18:35 ` Fabian Deutsch
1 sibling, 1 reply; 10+ messages in thread
From: Pete Zaitcev @ 2009-07-29 18:19 UTC (permalink / raw)
To: Jeff Garzik; +Cc: Rick Peralta, Project Hail, zaitcev
On Wed, 29 Jul 2009 13:59:28 -0400, Jeff Garzik <jeff@garzik.org> wrote:
> My point was more that there are real world situations where multiple
> outgoing streams from the client is significantly slower than a single
> stream into the cloud, plus asking the cloud to perform further copies.
We aren't exposing Chunk to clients outside the cloud, at least at present.
Let's wait and see if applications exist that require it, and how they
compete, say, with a bunch of NFS servers. The biggest difference about
Chunk is its ability to plug into CLD. Drop that and it's not that
special. But you can only use CLD if you are inside the cloud; in fact
it works best when you're inside the same data center with the CLD cell.
So I don't see the bandwidth argument having much weight.
I'm going to remember Fabian's idea of 3-rd party transfers, of course.
That potentially offers a significant reduction of load on tabled.
But mythical outside clients of Chunk are yet to be demostrated.
-- Pete
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: HAIL volunteer Rick Peralta
2009-07-29 18:19 ` Pete Zaitcev
@ 2009-07-29 18:30 ` Jeff Garzik
0 siblings, 0 replies; 10+ messages in thread
From: Jeff Garzik @ 2009-07-29 18:30 UTC (permalink / raw)
To: Pete Zaitcev; +Cc: Rick Peralta, Project Hail
Pete Zaitcev wrote:
> On Wed, 29 Jul 2009 13:59:28 -0400, Jeff Garzik <jeff@garzik.org> wrote:
>
>> My point was more that there are real world situations where multiple
>> outgoing streams from the client is significantly slower than a single
>> stream into the cloud, plus asking the cloud to perform further copies.
>
> We aren't exposing Chunk to clients outside the cloud, at least at present.
> Let's wait and see if applications exist that require it, and how they
> compete, say, with a bunch of NFS servers. The biggest difference about
> Chunk is its ability to plug into CLD. Drop that and it's not that
> special. But you can only use CLD if you are inside the cloud; in fact
> it works best when you're inside the same data center with the CLD cell.
> So I don't see the bandwidth argument having much weight.
>
> I'm going to remember Fabian's idea of 3-rd party transfers, of course.
> That potentially offers a significant reduction of load on tabled.
> But mythical outside clients of Chunk are yet to be demostrated.
Note that "mythical outside clients" is a key design element in NFS
v4.1, which is a parallel distributed filesystem technology. chunkd
clients are quite a bit different from CLD clients.
Similarly, Hadoop DFS, CloudStore and GoogleFS clients [analagously]
talk directly to chunkd rather than going through tabled.
Jeff
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: HAIL volunteer Rick Peralta
2009-07-29 17:59 ` Jeff Garzik
2009-07-29 18:19 ` Pete Zaitcev
@ 2009-07-29 18:35 ` Fabian Deutsch
2009-07-29 19:14 ` Jeff Garzik
1 sibling, 1 reply; 10+ messages in thread
From: Fabian Deutsch @ 2009-07-29 18:35 UTC (permalink / raw)
To: Jeff Garzik; +Cc: Pete Zaitcev, Rick Peralta, Project Hail, zaitcev
Am Mittwoch, den 29.07.2009, 13:59 -0400 schrieb Jeff Garzik:
> > Or to take converse logic -- is it likely that service->service
> > replication is SLOWER than client->service replication?
> >
> > Every way I look at it, client->{service,service,service}
> replication
> > seems both easy... and potentially slower than alternatives :)
>
> To elaborate a bit more... there obviously are cases where you want
> the
> client to be the genesis of parallel data streams into the cloud.
>
> My point was more that there are real world situations where multiple
> outgoing streams from the client is significantly slower than a
> single
> stream into the cloud, plus asking the cloud to perform further
> copies.
Yes, I agree that we just should have one stream per BLOB to one chunkd,
but we might attach replication destinations when streaming this blob to
one chunkd.
The result is, that we've just got one stream to a chunkd instance,
including some replication destinations, and chunkd will hapilly spread
the relpicates.
So we are just keeping the logic of where to replciate to, away from
chunkd and leave it to the client (which can ask a third daemon) where
to store the replicates.
dsts[] = logic->getDstsFor(blob)
chunk->put(blob, dsts) /* Will return after successfull replc. */
The local in-cloud replication strategy, like chaining or parallel could
be passed too, but might not be as relevant as the destinations itself.
- fabian
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: HAIL volunteer Rick Peralta
2009-07-29 18:35 ` Fabian Deutsch
@ 2009-07-29 19:14 ` Jeff Garzik
0 siblings, 0 replies; 10+ messages in thread
From: Jeff Garzik @ 2009-07-29 19:14 UTC (permalink / raw)
To: Fabian Deutsch; +Cc: Pete Zaitcev, Rick Peralta, Project Hail, zaitcev
Fabian Deutsch wrote:
> Am Mittwoch, den 29.07.2009, 13:59 -0400 schrieb Jeff Garzik:
>>> Or to take converse logic -- is it likely that service->service
>>> replication is SLOWER than client->service replication?
>>>
>>> Every way I look at it, client->{service,service,service}
>> replication
>>> seems both easy... and potentially slower than alternatives :)
>> To elaborate a bit more... there obviously are cases where you want
>> the
>> client to be the genesis of parallel data streams into the cloud.
>>
>> My point was more that there are real world situations where multiple
>> outgoing streams from the client is significantly slower than a
>> single
>> stream into the cloud, plus asking the cloud to perform further
>> copies.
>
> Yes, I agree that we just should have one stream per BLOB to one chunkd,
> but we might attach replication destinations when streaming this blob to
> one chunkd.
> The result is, that we've just got one stream to a chunkd instance,
> including some replication destinations, and chunkd will hapilly spread
> the relpicates.
> So we are just keeping the logic of where to replciate to, away from
> chunkd and leave it to the client (which can ask a third daemon) where
> to store the replicates.
I think we all agree on keeping the logic of where to replicate to, away
from chunkd.
chunkd should be as dumb^H^H^Hsimple as possible, to permit maximum
flexibility of chunkd-based applications.
chunkd-based applications will be the ones making chunk load balancing
decisions, for example.
> dsts[] = logic->getDstsFor(blob)
> chunk->put(blob, dsts) /* Will return after successfull replc. */
>
> The local in-cloud replication strategy, like chaining or parallel could
> be passed too, but might not be as relevant as the destinations itself.
The more I think about this, the more I think this will simply become a
configuration setting of the storage pool[1], i.e. inside tabled or
nfs4d configuration.
That would permit local administrators to make a decision whether
chaining (from the client!) or parallel should be used.
All of this, it must be noted, is long term discussion.
As of today, chunkd is "defacto" coded to be parallel-from-client
because that's the only method possible today :)
Jeff
[1] Or perhaps the concept of a storage pool -- a collection of chunkd's
shared by multiple applications -- will have its own configuration.
Another long term discussion for another day...
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: HAIL volunteer Rick Peralta
@ 2009-07-31 16:37 Rick Peralta
0 siblings, 0 replies; 10+ messages in thread
From: Rick Peralta @ 2009-07-31 16:37 UTC (permalink / raw)
To: Project Hail; +Cc: Pete Zaitcev, Rick Peralta, jeff
Hi All,
Thanks for inviting me to the forum and thanks to you all for making things happen!
My father said, "don't change anything unless you know why". Those words ring in my ears more and more after decades of System development. It is my intention and hope to respect the wisdom of those words and be clear about what the objectives of any endeavor is (including sloth ;^).
The chunkd effort caught my eye for a variety of reasons. It is functionally very much like something I advocated for a long time ago, it is a relatively simple, yet powerful machine and it may benefit by some redesign for performance (my personal specialty).
The question at hand is: What truly needs to be done? Bugs are bugs and one can debate one solution over another, but in the end it's about getting things to work well. Multithreading the transport layer is probably a good idea, but some diligence should be paid to why. There are any number of other open issues that also deserve some attention. Coding is fine, but understanding what and why seems to be a first step.
In order to have a common basis for evaluation I'd like to suggest a standard platform to consider in the context of discussions. The current implementation of chunkd, running on a standard server (probably with a 32 bit address space), with gigabit Ethernet, and a single disk (good for about 25 MB/s & 15 ms seek time). Consideration of more or different bulk storage, 10 Gbe, IB or other high bandwidth implementations and so forth can be considered as branches from the core model.
Given the current implementation of chunkd, it generally resides in user space, over a standard file system (complete with caches, overhead and whatever else comes along).
PZ>
I have some short list todo for Chunk, after which I don't have
any particular plans:
* Exit if CLD registration fails (maybe!).
* Put ourhost into the CLD record, and the port.
* Use base directory instead of Cell.
* Switch to asprintf for CLD filenames, Geo.
FD>
Yes. I also think that chunkd should not do it's own replication. As the
strategy may be domain/application dependend. Therefor I'd appreciate if
chunkd would provide some kind of "copy(dst,sha)" function, to be able
to directly copy to another chunkd instance.
JG>
Hopefully all this is wrapped up into libcldc...
JG>
* total single-node volume size: one cheap SATA hard drive
* total number of chunks: ==
total number of tabled objects / number of storage nodes
* distribution of chunk sizes: dependent upon the application using tabled
* aggregate bandwidth: dependent upon the application using tabled
fbp>
Might we put some numbers to this?
Most notable is typical chunk size and number of supported clients.
- Rick Peralta
www.linkedin.com/in/rickperalta
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2009-07-31 16:37 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <29025029.1248785350151.JavaMail.root@mswamui-andean.atl.sa.earthlink.net>
[not found] ` <4A6F5A3E.1070907@garzik.org>
[not found] ` <4A704376.6000303@tiac.net>
2009-07-29 14:31 ` HAIL volunteer Rick Peralta Jeff Garzik
2009-07-29 16:52 ` Pete Zaitcev
2009-07-29 16:58 ` Fabian Deutsch
2009-07-29 17:17 ` Jeff Garzik
2009-07-29 17:59 ` Jeff Garzik
2009-07-29 18:19 ` Pete Zaitcev
2009-07-29 18:30 ` Jeff Garzik
2009-07-29 18:35 ` Fabian Deutsch
2009-07-29 19:14 ` Jeff Garzik
2009-07-31 16:37 Rick Peralta
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.