From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jeff Garzik <jeff@garzik.org>
Subject: chunkd design notes (was Re: HAIL volunteer Rick Peralta)
Date: Fri, 31 Jul 2009 17:03:23 -0400
Message-ID: <4A735C1B.1010002@garzik.org>
References: <16237637.1249058268350.JavaMail.root@mswamui-billy.atl.sa.earthlink.net>
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
Return-path: <hail-devel-owner@vger.kernel.org>
In-Reply-To: <16237637.1249058268350.JavaMail.root@mswamui-billy.atl.sa.earthlink.net>
Sender: hail-devel-owner@vger.kernel.org
List-ID: <hail-devel.vger.kernel.org>
Content-Type: text/plain; charset="us-ascii"; format="flowed"
To: Rick Peralta <fbp@tiac.net>
Cc: Project Hail <hail-devel@vger.kernel.org>, Pete Zaitcev <zaitcev@redhat.com>

Rick Peralta wrote:
> Hi All,
> 
> Thanks for inviting me to the forum and thanks to you all for making things happen!
> 
> My father said, "don't change anything unless you know why".  Those words ring in my ears more and more after decades of System development.  It is my intention and hope to respect the wisdom of those words and be clear about what the objectives of any endeavor is (including sloth ;^).

Yes, that is pretty much the Linux mantra :)

Well that, along with "do what you must, and no more" (implying, don't 
try and predict the future, don't over-design)


> The chunkd effort caught my eye for a variety of reasons.  It is functionally very much like something I advocated for a long time ago, it is a relatively simple, yet powerful machine and it may benefit by some redesign for performance (my personal specialty).
> 
> The question at hand is: What truly needs to be done?  Bugs are bugs and one can debate one solution over another, but in the end it's about getting things to work well.  Multithreading the transport layer is probably a good idea, but some diligence should be paid to why.  There are any number of other open issues that also deserve some attention.  Coding is fine, but understanding what and why seems to be a first step.

The current, version 1.0 design goals for chunkd are

* multiple worker threads, because I/O parallelism

	- is the only way to max out storage hardware command and
	  completion queues
	- enables greater optimizations on a TCQ/NCQ-enabled storage
	  device, compared to slower command-at-a-time solutions

* no internal data caching; leverage kernel pagecache

* use POSIX filesystem API for our "database"; avoid sql, db4, sqlite, etc.

But I am very open to other design requirements or suggestions.  Speak 
up!  :)


> In order to have a common basis for evaluation I'd like to suggest a standard platform to consider in the context of discussions.  The current implementation of chunkd, running on a standard server (probably with a 32 bit address space), with gigabit Ethernet, and a single disk (good for about 25 MB/s & 15 ms seek time).  Consideration of more or different bulk storage, 10 Gbe, IB or other high bandwidth implementations and so forth can be considered as branches from the core model.

Yeah, the "standard platform model" is generally a "1U data center 
server", which probably equates to a physical or virtualized instance 
of: single multi-core CPU, 2-4GB RAM, gige, single ATA disk.

That example lends itself to 1000's of such chunkd storage nodes.

But it is also a valid minority model to have a handful of _huge_ chunkd 
nodes, perhaps tied to 10gige and SAN networks.


> Given the current implementation of chunkd, it generally resides in user space, over a standard file system (complete with caches, overhead and whatever else comes along).

Correct.


> PZ>
> I have some short list todo for Chunk, after which I don't have
> any particular plans:
>  * Exit if CLD registration fails (maybe!).
>  * Put ourhost into the CLD record, and the port.
>  * Use base directory instead of Cell.
>  * Switch to asprintf for CLD filenames, Geo.
> 
> FD>
> Yes. I also think that chunkd should not do it's own replication. As the
> strategy may be domain/application dependend. Therefor I'd appreciate if
> chunkd would provide some kind of "copy(dst,sha)" function, to be able
> to directly copy to another chunkd instance.
> 
> JG>
> Hopefully all this is wrapped up into libcldc...
> 
> JG>
> * total single-node volume size:  one cheap SATA hard drive
> * total number of chunks:  ==
> 	total number of tabled objects / number of storage nodes
> * distribution of chunk sizes:  dependent upon the application using tabled
> * aggregate bandwidth:  dependent upon the application using tabled
> 
> fbp>
> Might we put some numbers to this?
> Most notable is typical chunk size and number of supported clients.

You can make up some numbers, this chunk size and client count are two 
things that really will vary _wildly_ from application to application.

A distributed filesystem like Hadoop DFS / GoogleFS / CloudStore could 
have thousands of clients talking to a single chunkd node, because 
clients of those DFS's directly connect to the storage nodes.

NFS v4.1 also specifies a parallel storage model, where clients connect 
directly to the storage node storing the client's desired data.

In contrast, our current tabled design does not permit end-user clients 
to directly connect to chunkd storage.  That implies hundreds or 
thousands of chunkd nodes, with 0-5 actively connected clients.

Regards,

	Jeff