public inbox for hail-devel@vger.kernel.org
 help / color / mirror / Atom feed
* Zookeeper instead of CLD in Hail
@ 2010-06-05  3:27 Pete Zaitcev
  2010-06-07 13:21 ` Jeff Darcy
  2010-06-08  2:20 ` Jeff Garzik
  0 siblings, 2 replies; 8+ messages in thread
From: Pete Zaitcev @ 2010-06-05  3:27 UTC (permalink / raw)
  To: hail-devel; +Cc: zaitcev

Hi, guys:

I spent a few days playing with Zookeeper, with an eye on replacing
CLD with it. The short recommendation: don't do it, at least for now,
but reconsider if any sister services make a good use of it (e.g.
if MRG/DC image store does).

The easiest way to replace CLD would be to use Zookeeper as if it
were CLD, so I wrote a test that locked a file like cldu.c does now.
It was not too bad, but I learned two things:
 - what exactly Garzik was saying about "different focus" in Q&As
   after his presentations, and
 - locking anything is a really retarded thing to do in Zookeeper.

About the focus, ZK is just like CLD from a certain angle (it has
the good old files and provides a set of un-posixy operations
on them: watches, uniques, "ephemerals"), but it's also entirely
unlike CLD (e.g. no locks in the protocol). CLD's model is that
clients are daemons, each of which reads a few of its files, maybe
locks one or two at boot, and then nothing happens except keepalives.
Zookeeper's model... honestly I don't know what it is because it's
never explained concisely, but the docs that I saw seem to imply
huge numbers of clients all doing random ops all the time on the
same files, enough to cause a herd concerns. It looks like Yahoo
may be using Zookeeper as a lease manager or something. Crazy.

I heard people say they cribbed from the same Chubby paper, but
it's bollocks. It's absolutely nothing like what Chubby implies.
No locks for one thing. To be sure, Zookeeper provides a canned
piece of code which implements locks, kinda like you can implement
compare-and-swap using Dekker's algorithm on a CPU that doesn't
have it. The canned lock creates "sequenced" files (using a ZK
server call that creates unique filenames), then sets some
"watches" (same as CLD offers), then re-reads the directory to
find the lowest number sequential file, which is the winner of
the lock. Haha, only serious. I tested it, it works, but ewwwww.

They clearly want daemons to approach the whole problem in a
different way. For example, there's a similar canned recipy to
identify a "leader" client.

Overall, ZK seems like a mature, if quirky system. Quirky means that
I made my client OOM hard by using wrong compilation options, and it
took me a while to figure it out (PROTIP: do not use "single-threaded"
mode in Zookeeper, it is not loved and canned recipies may plain not
work with it). There were some other weird stories. But it definitely
works. Unfortunately, with the latest fix for the timer CLD works too:
I've not seen a server crash in a couple of months. So I do not see
an upside for us to switch at this point, and I have better things
to do than learning Zookeeper ropes for weeks.

BTW, Zookeeper is not packaged in Fedora. You have to install it
by hand. Thank heavens for /usr/local.

Dunno what their community is like. I'm going to send a trivial
patch to them and see what happens.

-- Pete

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Zookeeper instead of CLD in Hail
  2010-06-05  3:27 Zookeeper instead of CLD in Hail Pete Zaitcev
@ 2010-06-07 13:21 ` Jeff Darcy
  2010-06-07 18:15   ` Pete Zaitcev
  2010-06-08  2:32   ` Jeff Garzik
  2010-06-08  2:20 ` Jeff Garzik
  1 sibling, 2 replies; 8+ messages in thread
From: Jeff Darcy @ 2010-06-07 13:21 UTC (permalink / raw)
  To: Pete Zaitcev; +Cc: hail-devel

On 06/04/2010 11:27 PM, Pete Zaitcev wrote:
> About the focus, ZK is just like CLD from a certain angle (it has
> the good old files and provides a set of un-posixy operations
> on them: watches, uniques, "ephemerals"), but it's also entirely
> unlike CLD (e.g. no locks in the protocol). CLD's model is that
> clients are daemons, each of which reads a few of its files, maybe
> locks one or two at boot, and then nothing happens except keepalives.
> Zookeeper's model... honestly I don't know what it is because it's
> never explained concisely, but the docs that I saw seem to imply
> huge numbers of clients all doing random ops all the time on the
> same files, enough to cause a herd concerns. It looks like Yahoo
> may be using Zookeeper as a lease manager or something. Crazy.

According to the Chubby paper (available at e.g.
http://labs.google.com/papers/chubby.html) Chubby itself is mostly used
as a naming service.  They even say explicitly that few clients hold
locks (page 11, left column, under "use and behavior").

> I heard people say they cribbed from the same Chubby paper, but
> it's bollocks. It's absolutely nothing like what Chubby implies.
> No locks for one thing.

I guess it could be argued that locks are the essential feature of
Chubby - both the name of the service and the title of the paper refer
to it - but I think your dismissal of ZK and Chubby as unrelated is
itself bollocks.  Watches and ephemerals within a hierarchical namespace
of small pseudo-files are also central to how Chubby is actually used
within Google, even without locking.  ZK copied those features, along
with many of the underlying algorithms and protocols.  CLD copied a
slightly different set.  There's still a lot of overlap, even if the one
feature that has been overused in Hail is absent (by design) in ZK.

> They clearly want daemons to approach the whole problem in a
> different way.

That's very definitely true.  The essential feature here is not
synchronization but consensus - not control but data, not "you can't be
the leader" but "the leader is X".  If what someone needs is locking,
and it needs to be done frequently, then I'd say ZK is the wrong tool
but I'd also say that such an application is pretty fundamentally
non-scalable in ways that have little to do with Chubby/ZK/CLD.  People
do use ZK for locking, quite successfully, but generally in situations
where locking is rare enough to make the inefficiency of layered locking
protocols a non-issue (though it still seems they could do better than
the protocol you describe).

> So I do not see
> an upside for us to switch at this point, and I have better things
> to do than learning Zookeeper ropes for weeks.

That's a fair enough evaluation.  I think I'm the one who might have
suggested investigating ZK, but the reasons for that are largely
community-related and not technical.  If the technical barriers are too
high, then it's probably not worth the bother.  That does, however,
leave the door open for someone else to provide more scalable solutions
for the same problems by using more appropriate primitives.  If the goal
of Hail is to provide building blocks for people to use in constructing
their own higher-level service, then it seems to me that the building
blocks should be of a familiar shape and the adoption of ZK is about
100x that of CLD for similar purposes.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Zookeeper instead of CLD in Hail
  2010-06-07 13:21 ` Jeff Darcy
@ 2010-06-07 18:15   ` Pete Zaitcev
  2010-06-08  2:32   ` Jeff Garzik
  1 sibling, 0 replies; 8+ messages in thread
From: Pete Zaitcev @ 2010-06-07 18:15 UTC (permalink / raw)
  To: Jeff Darcy; +Cc: hail-devel

On Mon, 07 Jun 2010 09:21:07 -0400
Jeff Darcy <jdarcy@redhat.com> wrote:

> > So I do not see
> > an upside for us to switch at this point, and I have better things
> > to do than learning Zookeeper ropes for weeks.
> 
> That's a fair enough evaluation.  I think I'm the one who might have
> suggested investigating ZK, but the reasons for that are largely
> community-related and not technical.  If the technical barriers are too
> high, then it's probably not worth the bother. []

I am hoping someone else could do it. Although, "it" includes not
just throwing out cldu.c once, but actively maintaining the replacement.
Ideally even maintaining an RPM for ZK in Fedora.

-- Pete

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Zookeeper instead of CLD in Hail
  2010-06-05  3:27 Zookeeper instead of CLD in Hail Pete Zaitcev
  2010-06-07 13:21 ` Jeff Darcy
@ 2010-06-08  2:20 ` Jeff Garzik
  1 sibling, 0 replies; 8+ messages in thread
From: Jeff Garzik @ 2010-06-08  2:20 UTC (permalink / raw)
  To: Pete Zaitcev; +Cc: hail-devel

On 06/04/2010 11:27 PM, Pete Zaitcev wrote:
> I heard people say they cribbed from the same Chubby paper, but
> it's bollocks. It's absolutely nothing like what Chubby implies.
> No locks for one thing. To be sure, Zookeeper provides a canned
> piece of code which implements locks, kinda like you can implement
> compare-and-swap using Dekker's algorithm on a CPU that doesn't
> have it. The canned lock creates "sequenced" files (using a ZK
> server call that creates unique filenames), then sets some
> "watches" (same as CLD offers), then re-reads the directory to
> find the lowest number sequential file, which is the winner of
> the lock. Haha, only serious. I tested it, it works, but ewwwww.

Yeah, the main similarity is...  both ZK and CLD offer some type of 
filesystem (with all that implies).  ZK is IMO not much like Chubby at 
all, in terms of focus / design goals.

	Jeff


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Zookeeper instead of CLD in Hail
  2010-06-07 13:21 ` Jeff Darcy
  2010-06-07 18:15   ` Pete Zaitcev
@ 2010-06-08  2:32   ` Jeff Garzik
  2010-06-08 15:07     ` Jeff Darcy
  1 sibling, 1 reply; 8+ messages in thread
From: Jeff Garzik @ 2010-06-08  2:32 UTC (permalink / raw)
  To: Jeff Darcy; +Cc: Pete Zaitcev, hail-devel

On 06/07/2010 09:21 AM, Jeff Darcy wrote:
> On 06/04/2010 11:27 PM, Pete Zaitcev wrote:
>> About the focus, ZK is just like CLD from a certain angle (it has
>> the good old files and provides a set of un-posixy operations
>> on them: watches, uniques, "ephemerals"), but it's also entirely
>> unlike CLD (e.g. no locks in the protocol). CLD's model is that
>> clients are daemons, each of which reads a few of its files, maybe
>> locks one or two at boot, and then nothing happens except keepalives.
>> Zookeeper's model... honestly I don't know what it is because it's
>> never explained concisely, but the docs that I saw seem to imply
>> huge numbers of clients all doing random ops all the time on the
>> same files, enough to cause a herd concerns. It looks like Yahoo
>> may be using Zookeeper as a lease manager or something. Crazy.
>
> According to the Chubby paper (available at e.g.
> http://labs.google.com/papers/chubby.html) Chubby itself is mostly used
> as a naming service.  They even say explicitly that few clients hold
> locks (page 11, left column, under "use and behavior").
>
>> I heard people say they cribbed from the same Chubby paper, but
>> it's bollocks. It's absolutely nothing like what Chubby implies.
>> No locks for one thing.
>
> I guess it could be argued that locks are the essential feature of
> Chubby - both the name of the service and the title of the paper refer
> to it - but I think your dismissal of ZK and Chubby as unrelated is
> itself bollocks.  Watches and ephemerals within a hierarchical namespace
> of small pseudo-files are also central to how Chubby is actually used
> within Google, even without locking.  ZK copied those features, along
> with many of the underlying algorithms and protocols.  CLD copied a
> slightly different set.  There's still a lot of overlap, even if the one
> feature that has been overused in Hail is absent (by design) in ZK.

I think you're overselling that angle a bit.  Google discourages use of 
Chubby as a strict publish-subscribe mechanism, so watches and 
ephemerals aren't the bread-and-butter of Chubby necessarily.  A lot of 
the supposed commonality comes simply from the attribute of being a 
centralized respository of data for autonomous cloud systems -- a 
shared, highly reliable filesystem -- not any particular attribute 
related to watches or ephemerals.


>> So I do not see
>> an upside for us to switch at this point, and I have better things
>> to do than learning Zookeeper ropes for weeks.
>
> That's a fair enough evaluation.  I think I'm the one who might have
> suggested investigating ZK, but the reasons for that are largely
> community-related and not technical.  If the technical barriers are too
> high, then it's probably not worth the bother.  That does, however,
> leave the door open for someone else to provide more scalable solutions
> for the same problems by using more appropriate primitives.  If the goal
> of Hail is to provide building blocks for people to use in constructing
> their own higher-level service, then it seems to me that the building
> blocks should be of a familiar shape and the adoption of ZK is about
> 100x that of CLD for similar purposes.

Not sure what you mean by familiar shape?  ZK design is too haphazard, 
and developers IMO have an easier time grasping CLD's fundamental 
FS-like API.  The implementation is also 3x or more in terms of size, 
compared to CLD.

I'd rather get the basics right first, then pursue a popularity contest :)

	Jeff



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Zookeeper instead of CLD in Hail
  2010-06-08  2:32   ` Jeff Garzik
@ 2010-06-08 15:07     ` Jeff Darcy
  2010-06-09  4:49       ` Colin McCabe
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff Darcy @ 2010-06-08 15:07 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Pete Zaitcev, hail-devel

On 06/07/2010 10:32 PM, Jeff Garzik wrote:
> I think you're overselling that angle a bit.  Google discourages use of 
> Chubby as a strict publish-subscribe mechanism, so watches and 
> ephemerals aren't the bread-and-butter of Chubby necessarily.  A lot of 
> the supposed commonality comes simply from the attribute of being a 
> centralized respository of data for autonomous cloud systems -- a 
> shared, highly reliable filesystem -- not any particular attribute 
> related to watches or ephemerals.

Nonetheless, those are features both share, and many might argue that
they're preferable to locks.  Locking is a fundamentally lousy way to
build scalable and reliable distributed systems, as has been well known
for more than a decade.  That's why databases have trended toward MVCC
or AP/EC, why programming models have migrated toward async queues and
STM or actors, etc.  If using Chubby or its derivatives as a pub/sub hub
is discouraged, then using it as a DLM should be outright condemned.
Somebody who had followed the recommended programming model(s) with
Chubby would have little trouble transitioning to ZK.  That's what I
meant by "familiar shape" - that people who've already learned how to
write distributed applications could still apply that hard-won knowledge
with Hail components.

> Not sure what you mean by familiar shape?  ZK design is too haphazard, 
> and developers IMO have an easier time grasping CLD's fundamental 
> FS-like API.

Only in the sense that shared-state locking can be implemented in
trivial systems more readily than other concurrency models.
Unfortunately, as those systems scale beyond the trivial, it usually
becomes much harder (if not impossible) to keep them working and
performing well.  Rewrites from locking to other models are common in
this space.  People who haven't yet learned these lessons shouldn't be
our target audience, and enabling their mistakes shouldn't be a goal.

> The implementation is also 3x or more in terms of size, 
> compared to CLD.

Small code size is not, in and of itself, a virtue.  It can mean more
efficient implementation, but it can also mean reduced functionality or
unaddressed concerns (especially wrt error conditions or observability).
 I'm sure I could write something vaguely like Chubby that's even
smaller than CLD, but it would be absurd to claim that it must therefore
be better.

> I'd rather get the basics right first, then pursue a popularity contest :)

Popularity is not a goal, but it can be a means.  The world doesn't need
another Walrus, but it doesn't need another SML/Haskell languishing in
obscurity either.  The open-source landscape is littered with
developers' special little flowers that withered away because nobody
ever made them even slightly interesting to a wider audience.  No
attempt to provide building blocks for distributed systems can succeed
while remaining ignorant of (or antagonistic toward) established ways of
building such systems.  Look at how much developers fawn over projects
like Tokyo or Redis, which are fundamentally non-scalable.  Look at how
they use systems like memcached - in production, successfully - despite
it having a clearly ill-thought-out approach to consistency.  If Hail is
supposed to be about building blocks, then those should be about the
building blocks that people need, not the building blocks that are easy
or fun to create.  If somebody who already had ZK running could point
chunkd/tabled at that instead of having to download/install/configure
CLD (and fight with their IT folks about the DNS dependency), then they
might actually be more likely to use chunkd/tabled instead of something
even more broken.  Some few of them might even start to contribute.
Surely that would be a good thing.  If self-imposed technical obstacles
still mean it's too much effort then fine, but I don't think we need
more NIH syndrome.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Zookeeper instead of CLD in Hail
  2010-06-08 15:07     ` Jeff Darcy
@ 2010-06-09  4:49       ` Colin McCabe
  2010-06-09 14:35         ` Jeff Darcy
  0 siblings, 1 reply; 8+ messages in thread
From: Colin McCabe @ 2010-06-09  4:49 UTC (permalink / raw)
  To: Jeff Darcy; +Cc: Jeff Garzik, Pete Zaitcev, hail-devel

Well, Zookeeper is written in Java. Presumably it requires you to put
a JVM on each node. Unfortunately the JVM has kind of a large memory
footprint. That's great if your software is written in Java. If your
company didn't go that route, ZK doesn't seem like such a great
option.

On Tue, Jun 8, 2010 at 8:07 AM, Jeff Darcy <jdarcy@redhat.com> wrote:
> On 06/07/2010 10:32 PM, Jeff Garzik wrote:
>> I think you're overselling that angle a bit.  Google discourages use of
>> Chubby as a strict publish-subscribe mechanism, so watches and
>> ephemerals aren't the bread-and-butter of Chubby necessarily.

The Chubby paper specifically calls out using Chubby for publish /
subscribe as "abusive behavior." But ZooKeeper "is used at Yahoo! as
the coordination and failure recovery service for Yahoo! Message
Broker, which is a highly scalable publish-subscribe system" according
to hadoop.apache.org.

Although they might be equivalent in some theoretical computer science
sense, I get the impression that the two systems are very different
beasts...

>> A lot of
>> the supposed commonality comes simply from the attribute of being a
>> centralized respository of data for autonomous cloud systems -- a
>> shared, highly reliable filesystem -- not any particular attribute
>> related to watches or ephemerals.
>
> Nonetheless, those are features both share, and many might argue that
> they're preferable to locks.  Locking is a fundamentally lousy way to
> build scalable and reliable distributed systems, as has been well known
> for more than a decade.

I think *fine-grained* locking is a fundamentally lousy way to build
distributed systems. I haven't heard anyone argue that coarse-grained
locking is bad.
Have you read any interesting papers or books about this topic?

> If somebody who already had ZK running could point
> chunkd/tabled at that instead of having to download/install/configure
> CLD (and fight with their IT folks about the DNS dependency), then they
> might actually be more likely to use chunkd/tabled instead of something
> even more broken.  Some few of them might even start to contribute.
> Surely that would be a good thing.  If self-imposed technical obstacles
> still mean it's too much effort then fine, but I don't think we need
> more NIH syndrome.

It would be cool to have support another backend, I guess.

However... It seems to me that once you make the decision to use
ZooKeeper and Java, Walrus or Hadoop is probably a more practical
choice for the upper layers.
I've never deployed either though, so I could be wrong.

Colin McCabe

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Zookeeper instead of CLD in Hail
  2010-06-09  4:49       ` Colin McCabe
@ 2010-06-09 14:35         ` Jeff Darcy
  0 siblings, 0 replies; 8+ messages in thread
From: Jeff Darcy @ 2010-06-09 14:35 UTC (permalink / raw)
  To: Colin McCabe; +Cc: Jeff Garzik, Pete Zaitcev, hail-devel

On 06/09/2010 12:49 AM, Colin McCabe wrote:
> Well, Zookeeper is written in Java. Presumably it requires you to put
> a JVM on each node. Unfortunately the JVM has kind of a large memory
> footprint. That's great if your software is written in Java. If your
> company didn't go that route, ZK doesn't seem like such a great
> option.

I'm no fan of the Java "ecosystem" myself, and personally I do consider
that a drawback, but the fact remains that most of the people building
these sorts of applications are using Java or other VM-based languages
such as Ruby or Erlang and many do already have ZK running.  Whether
those are good choices or bad choices, they're common choices.

> The Chubby paper specifically calls out using Chubby for publish /
> subscribe as "abusive behavior." But ZooKeeper "is used at Yahoo! as
> the coordination and failure recovery service for Yahoo! Message
> Broker, which is a highly scalable publish-subscribe system" according
> to hadoop.apache.org.
> 
> Although they might be equivalent in some theoretical computer science
> sense, I get the impression that the two systems are very different
> beasts...

They're different, certainly, and CLD is different again, but they're
also all related.  It's also worth pointing out that YMB's use of ZK can
be considered proof that both ZK's semantics and its implementation are
sufficient for experienced domain experts to create production-level
systems (even at Yahoo's scale).  So far, there's no similar proof point
for CLD's semantics or implementation.

>> Nonetheless, those are features both share, and many might argue that
>> they're preferable to locks.  Locking is a fundamentally lousy way to
>> build scalable and reliable distributed systems, as has been well known
>> for more than a decade.
> 
> I think *fine-grained* locking is a fundamentally lousy way to build
> distributed systems. I haven't heard anyone argue that coarse-grained
> locking is bad.
> Have you read any interesting papers or books about this topic?

I think first we'd have to agree on a distinction between fine-grained
and coarse-grained.  ;)  The issue here is that you can do locking in ZK
quite easily.  It's just not very efficient, but you don't need it to be
efficient if the locks are coarse-grained anyway so the criticism about
ZK not having locks becomes quite meaningless.  Ephemeral nodes can be
used to synthesize the only kind of locks applications should be using,
and can also be used in other ways.  If you can do everything (that you
should be doing) in X that you can do in Y but not vice versa, and Y is
not provably faster or more reliable, then most developers would rightly
prefer X.

> However... It seems to me that once you make the decision to use
> ZooKeeper and Java, Walrus or Hadoop is probably a more practical
> choice for the upper layers.

Well, Hadoop is an entirely different kind of beast, and it might even
be worth exploring how Hail components might be used as Hadoop/MR data
sources or sinks.  Walrus is pretty directly comparable to tabled, but
mostly comes off worse in the comparison.  For one thing, it's even less
scalable.  A Java-based program would mostly not care about the
differences between tabled, Walrus, ParkPlace or Amazon S3 since they're
all behind the same HTTP-based protocol anyway.  (I have code that I
regularly test against tabled and Amazon without change.)  Where the
difference really becomes noticeable is not on the development side but
on the deployment side.  Installing chunkd and tabled now always always
always requires installing a third component as well, and since it's
supposed to be a highly available service that means care must be taken
to deploy it on physically separate machines etc. to avoid correlated
failures.  In a "green field" deployment of tabled/ZK there'd still be a
third component to install, but at least there's no DNS wart so there'd
be no need to negotiate with the people (often a separate group) who
control the local DNS.  More importantly, in environments where ZK has
already been deployed - and they're quite common - there'd be no need
for a third component, and no need to re-do all of that planning or
configuration.  Back on the development side, there'd also be no need to
deal with situations where only one of ZK and CLD had failed - and, no
matter how good we all think we are at developing and testing our code,
no responsible app developer can rule out such failures.

That said, Pete has rightly pointed out that switching chunkd/tabled to
ZK would require significant effort.  While I do believe there are
benefits, it's not clear they're great enough to justify that effort.
There are definitely other things - e.g. scalability/authentication
improvements, more testing - that even I would consider higher priorities.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-06-09 14:35 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-06-05  3:27 Zookeeper instead of CLD in Hail Pete Zaitcev
2010-06-07 13:21 ` Jeff Darcy
2010-06-07 18:15   ` Pete Zaitcev
2010-06-08  2:32   ` Jeff Garzik
2010-06-08 15:07     ` Jeff Darcy
2010-06-09  4:49       ` Colin McCabe
2010-06-09 14:35         ` Jeff Darcy
2010-06-08  2:20 ` Jeff Garzik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox