chunkd design genesis, storage tech, and support for multiple key/value tables

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jeff Garzik <jeff@garzik.org>
To: Pete Zaitcev <zaitcev@redhat.com>
Cc: hail-devel@vger.kernel.org
Subject: chunkd design genesis, storage tech, and support for multiple key/value tables
Date: Tue, 10 Nov 2009 15:46:27 -0500	[thread overview]
Message-ID: <4AF9D123.4090908@garzik.org> (raw)
In-Reply-To: <4AF9C3F3.8050508@garzik.org>

You wrote this insightful and pointed comment on IRC...
> Comparing with "every k/v service out there" assumes that you're
> growing a generic key/value service out of Chunk. You're essentially
> admitting it openly.

This is an excellent point to raise.  So let the "begin at the 
beginning", cover the chunkd design thought process, and hope to explain 
how this matches up.

Let us consider storage technology, at the level I'm used to:  ATA, 
SCSI, and nbd protocols.

For decades, storage has been a run of fixed-length records (sectors and 
blocks), with the following API:

	key = offset + data length
		<-- "key" is minimum amount of data required to
		    uniquely describe a run of data
	PUT key, data
	data = GET key

Now the world has figured out giving a storage device the flexibility to 
manage data on a per-object granular basis simplifies applications, and 
gives underlying storage more ability to optimize.  Thus was born the 
object-based storage device (SCSI OSD), with the API

	key = 64-bit object id
	PUT key, data, data length
	data, data length = GET key

A key design decision of Project Hail was to follow this object-based 
storage model, when considering the two alternatives:

1) Build cloud apps on top of multple block devices.  My conclusion: 
this is undesirable for the same reason why sector-based storage is 
undesirable:  applications want more granularity, and with sector-based 
systems, must build their own filesystem-like data structures just to 
keep their own objects separated from one another.

2) Build cloud apps on top of filesystems.  I think(?) GlusterFS is 
taking this route.  This approach is workable, but may create a lot of 
unnecessary overhead.  Filesystem protocols are much more complicated 
than storage protocols, in particular.

Object-based storage devices sit in the middle:  not as complex as 
filesystems, but more useful than sector-based storage.

chunkd is thus designed to be a simple, straightforward, easy-to-use 
replacement for SCSI OSD, which has already been proven useful in 
distributed storage (Lustre, pNFS).

That is why chunkd originally used fixed-length hexidecimal keys:  It 
was modelled on the SCSI OSD object id.  However, it quickly became 
evident in practice that EVERY chunkd application would create its own 
scheme to map internal_object_id to chunkd_object_id.

Thus, moving to generic key/value storage actually simplified 
applications, by eliminating that mapping.

However, one glaring difference from SCSI OSD was chunkd's lack of 
administrative partitions.  SCSI OSDs provide "partitions" within each 
logical unit (LUN), each of contains a set of objects within a single 
object id namespace.  Therefore, if you consider SCSI OSD object id as 
the key, then SCSI OSD definitely has multiple key/value tables.

As you pointed out on IRC, it is possible to create administrative 
partitioning by running multiple chunkd instances.

But I think the Real World(tm) has shown that in-protocol partitioning 
of object namespace is the way to go.  Being able to create and destroy 
partitions within the protocol, on-demand, has a lot of value.

So, just as SCSI OSD has

	[ target + logical unit + ] partition + object

With chunkd we can have

	[ host + port + ] table + object

Amazon S3 has buckets.  Pretty much every protocol in production tends 
to have some sort of administrative separation ability.

	Jeff

next prev parent reply	other threads:[~2009-11-10 20:46 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-11-10 11:24 [PATCH] chunkd: add support for multiple key/value tables Jeff Garzik
2009-11-10 16:33 ` Pete Zaitcev
2009-11-10 19:45   ` Jeff Garzik
2009-11-10 19:50     ` Jeff Garzik
2009-11-10 20:46       ` Jeff Garzik [this message]
2009-11-11  1:48         ` chunkd design genesis, storage tech, and " Pete Zaitcev

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4AF9D123.4090908@garzik.org \
    --to=jeff@garzik.org \
    --cc=hail-devel@vger.kernel.org \
    --cc=zaitcev@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.