All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH] Expose Ceph data location information to Hadoop
@ 2010-11-30  6:33 Noah Watkins
  2010-11-30 16:59 ` Sage Weil
  2010-11-30 17:38 ` [PATCH] Expose Ceph data location information to Hadoop Alex‎ Nelson
  0 siblings, 2 replies; 17+ messages in thread
From: Noah Watkins @ 2010-11-30  6:33 UTC (permalink / raw)
  To: ceph-devel; +Cc: Joe Buck, Carlos Maltzahn

Hi Alex,

I have some feedback for this patch. The first is a question about the correctness of your method of retrieving block locations, and what the notion of a Hadoop block means in the context of Ceph, and the second is a design suggestion. 

Correctness of Block Location Retrieval
===============================

The following example is in relation to the JNI c++ code that creates the list of block locations by querying the IOCTL interface of a file in Ceph:

+  jlong loopinit=j_start/blocksize;
+  jlong i=loopinit;
+  for (jlong imax=j_start+j_len; i*blocksize < imax; i++) {
+    //Note <=; we go through the last requested byte.
+    //Set up the data location object
+    curoffset = i*blocksize;
+    dl.file_offset = curoffset;

It appears to me that this code does not fully consider the striping strategy that Ceph implements. More specifically this code appears to only work when the object size and striping unit are equal for a given file (something that is likely set by default). The following is for the case in which object size is not equal to the stripe unit.

Consider the following contrived setup for a file in Ceph from which Hadoop tries to acquire all object locations (i.e. Hadoop blocks):

Object size: 3 MB
Stripe unit: 1 MB
Stripe count: 3
File size: 18 MB
==> Thus, 6 objects (0, 1, ..., 5)

If j_start = 0 and j_len = 18 MB then the loop above queries Ceph about the objects containing the following offsets:

0 * blocksize = 0 MB
1 * blocksize = 3 MB
2 * blocksize = 6 MB
3 * blocksize = 9 MB
4 * blocksize = 12 MB
5 * blocksize = 15 MB

However, given that the object size and stripe unit are not equal, the objects don't fill up uniformly as a multiple of object size:

The above would result in Ceph reporting the following object numbers, missing objects (1, 2, 4, 5):

Offset --> Object Number
0 MB --> 0
3 MB --> 0
6 MB --> 0
9 MB --> 3
12 MB --> 3
15 MB  --> 3

This is easy to remedy by implementing the striping strategy in your code, but I think is also an opportunity for cleaning up the design a bit.

What is a Hadoop Block in Ceph?
==========================

Hadoop considers blocks to be contiguous extents, however, from the above example we can see that an object can have data from multiple, non-consecutive, contiguous extents, thus the object itself doesn't represent a fully contiguous extent.

The more natural (and general) solution is to consider the stripe unit to be the _unit_ of Hadoop blocks, not entire objects. When stripe unit and block size are the same the result is analogous to HDFS's treatment of blocks.

Design Suggestion
===============

I would propose moving the functionality of mapping offsets to object locations into a library managed in the Ceph tree, and either 1) use JNI as a thin layer to this library, or 2) scrap JNI altogether for JNA.

Either way, the motivation for moving this functionality into the Ceph tree is important because from the point of view of Hadoop object/block location is independent of striping strategy. Future Ceph enhancements and research may use alternative striping strategies which would thus have to be re-duplicated into the Hadoop code base.

Thanks,
Noah

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2010-12-02  1:43 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-30  6:33 [PATCH] Expose Ceph data location information to Hadoop Noah Watkins
2010-11-30 16:59 ` Sage Weil
2010-11-30 17:17   ` Noah Watkins
2010-11-30 17:28     ` Gregory Farnum
2010-11-30 17:31       ` Noah Watkins
2010-12-01  3:50   ` what's the exact meaning of cap? wchen
2010-12-01 16:11     ` Gregory Farnum
2010-12-02  1:41       ` wchen
2010-11-30 17:38 ` [PATCH] Expose Ceph data location information to Hadoop Alex‎ Nelson
2010-11-30 17:50   ` Noah Watkins
2010-11-30 18:10     ` Sage Weil
2010-11-30 18:32       ` Noah Watkins
2010-11-30 19:01         ` Sage Weil
2010-11-30 19:04           ` Noah Watkins
2010-11-30 19:41             ` Alex‎ Nelson
     [not found]             ` <9DEABEC1-48A1-466D-9942-C0D8A199EF96@soe.ucsc.edu>
2010-11-30 20:25               ` Joe Buck
2010-11-30 21:39       ` Noah Watkins

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.