Re: [PATCH] Expose Ceph data location information to Hadoop

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH] Expose Ceph data location information to Hadoop
@ 2010-11-30  6:33 Noah Watkins
  2010-11-30 16:59 ` Sage Weil
  2010-11-30 17:38 ` [PATCH] Expose Ceph data location information to Hadoop Alex‎ Nelson
  0 siblings, 2 replies; 17+ messages in thread
From: Noah Watkins @ 2010-11-30  6:33 UTC (permalink / raw)
  To: ceph-devel; +Cc: Joe Buck, Carlos Maltzahn

Hi Alex,

I have some feedback for this patch. The first is a question about the correctness of your method of retrieving block locations, and what the notion of a Hadoop block means in the context of Ceph, and the second is a design suggestion. 

Correctness of Block Location Retrieval
===============================

The following example is in relation to the JNI c++ code that creates the list of block locations by querying the IOCTL interface of a file in Ceph:

+  jlong loopinit=j_start/blocksize;
+  jlong i=loopinit;
+  for (jlong imax=j_start+j_len; i*blocksize < imax; i++) {
+    //Note <=; we go through the last requested byte.
+    //Set up the data location object
+    curoffset = i*blocksize;
+    dl.file_offset = curoffset;

It appears to me that this code does not fully consider the striping strategy that Ceph implements. More specifically this code appears to only work when the object size and striping unit are equal for a given file (something that is likely set by default). The following is for the case in which object size is not equal to the stripe unit.

Consider the following contrived setup for a file in Ceph from which Hadoop tries to acquire all object locations (i.e. Hadoop blocks):

Object size: 3 MB
Stripe unit: 1 MB
Stripe count: 3
File size: 18 MB
==> Thus, 6 objects (0, 1, ..., 5)

If j_start = 0 and j_len = 18 MB then the loop above queries Ceph about the objects containing the following offsets:

0 * blocksize = 0 MB
1 * blocksize = 3 MB
2 * blocksize = 6 MB
3 * blocksize = 9 MB
4 * blocksize = 12 MB
5 * blocksize = 15 MB

However, given that the object size and stripe unit are not equal, the objects don't fill up uniformly as a multiple of object size:

The above would result in Ceph reporting the following object numbers, missing objects (1, 2, 4, 5):

Offset --> Object Number
0 MB --> 0
3 MB --> 0
6 MB --> 0
9 MB --> 3
12 MB --> 3
15 MB  --> 3

This is easy to remedy by implementing the striping strategy in your code, but I think is also an opportunity for cleaning up the design a bit.

What is a Hadoop Block in Ceph?
==========================

Hadoop considers blocks to be contiguous extents, however, from the above example we can see that an object can have data from multiple, non-consecutive, contiguous extents, thus the object itself doesn't represent a fully contiguous extent.

The more natural (and general) solution is to consider the stripe unit to be the _unit_ of Hadoop blocks, not entire objects. When stripe unit and block size are the same the result is analogous to HDFS's treatment of blocks.

Design Suggestion
===============

I would propose moving the functionality of mapping offsets to object locations into a library managed in the Ceph tree, and either 1) use JNI as a thin layer to this library, or 2) scrap JNI altogether for JNA.

Either way, the motivation for moving this functionality into the Ceph tree is important because from the point of view of Hadoop object/block location is independent of striping strategy. Future Ceph enhancements and research may use alternative striping strategies which would thus have to be re-duplicated into the Hadoop code base.

Thanks,
Noah

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Expose Ceph data location information to Hadoop
  2010-11-30  6:33 [PATCH] Expose Ceph data location information to Hadoop Noah Watkins
@ 2010-11-30 16:59 ` Sage Weil
  2010-11-30 17:17   ` Noah Watkins
  2010-12-01  3:50   ` what's the exact meaning of cap? wchen
  2010-11-30 17:38 ` [PATCH] Expose Ceph data location information to Hadoop Alex‎ Nelson
  1 sibling, 2 replies; 17+ messages in thread
From: Sage Weil @ 2010-11-30 16:59 UTC (permalink / raw)
  To: Noah Watkins; +Cc: ceph-devel, Joe Buck, Carlos Maltzahn

On Mon, 29 Nov 2010, Noah Watkins wrote:
> What is a Hadoop Block in Ceph?
> ==========================
> 
> Hadoop considers blocks to be contiguous extents, however, from the 
> above example we can see that an object can have data from multiple, 
> non-consecutive, contiguous extents, thus the object itself doesn't 
> represent a fully contiguous extent.
> 
> The more natural (and general) solution is to consider the stripe unit 
> to be the _unit_ of Hadoop blocks, not entire objects. When stripe unit 
> and block size are the same the result is analogous to HDFS's treatment 
> of blocks.

Yeah, I would lean toward using the stripe unit as the "block" here.

> Design Suggestion
> ===============
> 
> I would propose moving the functionality of mapping offsets to object 
> locations into a library managed in the Ceph tree, and either 1) use JNI 
> as a thin layer to this library, or 2) scrap JNI altogether for JNA.
> 
> Either way, the motivation for moving this functionality into the Ceph 
> tree is important because from the point of view of Hadoop object/block 
> location is independent of striping strategy. Future Ceph enhancements 
> and research may use alternative striping strategies which would thus 
> have to be re-duplicated into the Hadoop code base.

Well, the ioctl interface is fixed (Linux kernel ABI rules), so there is 
no danger in relying on it.  In the end it'll be more work to create a 
separate library that just wraps the ioctls, and any change in the layout 
scheme that would motivate e.g. a new v2 ioctl would also mean updating 
the library, leaving you with the same backward compatibility issues we 
started with.  In the end whether it's an ioctl(2) call or a shared 
library call is mostly a matter of syntax; the underlying data passed by 
the interface is the same.

sage

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Expose Ceph data location information to Hadoop
  2010-11-30 16:59 ` Sage Weil
@ 2010-11-30 17:17   ` Noah Watkins
  2010-11-30 17:28     ` Gregory Farnum
  2010-12-01  3:50   ` what's the exact meaning of cap? wchen
  1 sibling, 1 reply; 17+ messages in thread
From: Noah Watkins @ 2010-11-30 17:17 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, Joe Buck, Carlos Maltzahn

> Well, the ioctl interface is fixed (Linux kernel ABI rules), so there is 
> no danger in relying on it.  In the end it'll be more work to create a 
> separate library that just wraps the ioctls, and any change in the layout 
> scheme that would motivate e.g. a new v2 ioctl would also mean updating 
> the library, leaving you with the same backward compatibility issues we 
> started with.  In the end whether it's an ioctl(2) call or a shared 
> library call is mostly a matter of syntax; the underlying data passed by 
> the interface is the same.

I agree that the stability of the ABI isn't an issue, but maybe I wasn't clear enough.

Currently the data location IOCTL is used in Hadoop to map a offset to an object location, thus it is necessary for Hadoop to generate a set of offsets that fall into distinct object boundaries in order to collect a list of object locations that should run map tasks.

The problem is that in order to generate this list of offsets, the striping strategy must be taken into consideration in Hadoop (as shown in the original version of this email). Currently this means that the logic of "ceph_calc_file_object_mapping(...)" in the client must be synchronized with the Hadoop code base, something that is unnecessary. Rather than duplicating this logic, a better solution seems to be to expand the Ceph IOCTL interface to include functionality for retrieving object locations or at the very least, a set of candidate offsets. This would allow the client to hide the striping strategy from Hadoop.

Whether or not this is an IOCTL that expands the ABI, or a user-level library that contains the striping strategy logic seems to be a decision for someone else, but the important thing is that striping strategy logic doesn't get distributed into arbitrary code bases such as Hadoop, requiring synchronization.

Thanks,
Noah

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Expose Ceph data location information to Hadoop
  2010-11-30 17:17   ` Noah Watkins
@ 2010-11-30 17:28     ` Gregory Farnum
  2010-11-30 17:31       ` Noah Watkins
  0 siblings, 1 reply; 17+ messages in thread
From: Gregory Farnum @ 2010-11-30 17:28 UTC (permalink / raw)
  To: Noah Watkins; +Cc: Sage Weil, ceph-devel, Joe Buck, Carlos Maltzahn

On Tue, Nov 30, 2010 at 9:17 AM, Noah Watkins <jayhawk@cs.ucsc.edu> wrote:
> Whether or not this is an IOCTL that expands the ABI, or a user-level library that contains the striping strategy logic seems to be a decision for someone else, but the important thing is that striping strategy logic doesn't get distributed into arbitrary code bases such as Hadoop, requiring synchronization.
Based on my (unmerged) work writing a userspace client-based
FileSystem for Hadoop, I don't think any JNI code is going to go into
the Hadoop tree. So the striping strategy could easily go into the JNI
code and be maintained separately from the Java code. :)
-Greg

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Expose Ceph data location information to Hadoop
  2010-11-30 17:28     ` Gregory Farnum
@ 2010-11-30 17:31       ` Noah Watkins
  0 siblings, 0 replies; 17+ messages in thread
From: Noah Watkins @ 2010-11-30 17:31 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Sage Weil, ceph-devel, Joe Buck, Carlos Maltzahn

> Based on my (unmerged) work writing a userspace client-based
> FileSystem for Hadoop, I don't think any JNI code is going to go into
> the Hadoop tree. So the striping strategy could easily go into the JNI
> code and be maintained separately from the Java code. :)

Ahh, excellent. That seems to resolve the issue :)

Thanks,
Noah

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Expose Ceph data location information to Hadoop
  2010-11-30  6:33 [PATCH] Expose Ceph data location information to Hadoop Noah Watkins
  2010-11-30 16:59 ` Sage Weil
@ 2010-11-30 17:38 ` Alex‎ Nelson
  2010-11-30 17:50   ` Noah Watkins
  1 sibling, 1 reply; 17+ messages in thread
From: Alex‎ Nelson @ 2010-11-30 17:38 UTC (permalink / raw)
  To: Noah Watkins; +Cc: ceph-devel, Joe Buck, Carlos Maltzahn

Noah,

There shouldn't be too much redundancy between here and the forked thread.


On Nov 29, 2010, at 22:33 , Noah Watkins wrote:

> Hi Alex,
> 
> I have some feedback for this patch. The first is a question about the correctness of your method of retrieving block locations, and what the notion of a Hadoop block means in the context of Ceph, and the second is a design suggestion. 
> 
> Correctness of Block Location Retrieval
> ===============================
> 
> The following example is in relation to the JNI c++ code that creates the list of block locations by querying the IOCTL interface of a file in Ceph:
> 
> +  jlong loopinit=j_start/blocksize;
> +  jlong i=loopinit;
> +  for (jlong imax=j_start+j_len; i*blocksize < imax; i++) {
> +    //Note <=; we go through the last requested byte.
(As always, the code evolves past the (inline) documentation.)
> +    //Set up the data location object
> +    curoffset = i*blocksize;
> +    dl.file_offset = curoffset;
> 
> It appears to me that this code does not fully consider the striping strategy that Ceph implements. More specifically this code appears to only work when the object size and striping unit are equal for a given file (something that is likely set by default).
You're correct, I didn't put in any assumptions for fl_stripe_unit and fl_object_size being different.  The matching defaults are in <ceph>/src/config.cc, in the struct ceph_file_layout g_default_file_layout.

> The following is for the case in which object size is not equal to the stripe unit.
> 
> Consider the following contrived setup for a file in Ceph from which Hadoop tries to acquire all object locations (i.e. Hadoop blocks):
> 
> Object size: 3 MB
> Stripe unit: 1 MB
> Stripe count: 3
> File size: 18 MB
> ==> Thus, 6 objects (0, 1, ..., 5)
> 
> If j_start = 0 and j_len = 18 MB then the loop above queries Ceph about the objects containing the following offsets:
> 
> 0 * blocksize = 0 MB
> 1 * blocksize = 3 MB
> 2 * blocksize = 6 MB
> 3 * blocksize = 9 MB
> 4 * blocksize = 12 MB
> 5 * blocksize = 15 MB
> 
> However, given that the object size and stripe unit are not equal, the objects don't fill up uniformly as a multiple of object size:
> 
> The above would result in Ceph reporting the following object numbers, missing objects (1, 2, 4, 5):
> 
> Offset --> Object Number
> 0 MB --> 0
> 3 MB --> 0
> 6 MB --> 0
> 9 MB --> 3
> 12 MB --> 3
> 15 MB  --> 3
> 
> This is easy to remedy by implementing the striping strategy in your code, but I think is also an opportunity for cleaning up the design a bit.
> 
> What is a Hadoop Block in Ceph?
> ==========================
> 
> Hadoop considers blocks to be contiguous extents, however, from the above example we can see that an object can have data from multiple, non-consecutive, contiguous extents, thus the object itself doesn't represent a fully contiguous extent.
I didn't follow your numeric examples above---I missed how you mapped Offsets to Object Numbers---but I follow you on striping meaning different data locations for what Hadoop would think would be one Ceph object in one place.
> 
> The more natural (and general) solution is to consider the stripe unit to be the _unit_ of Hadoop blocks, not entire objects. When stripe unit and block size are the same the result is analogous to HDFS's treatment of blocks.
I agree with you, and push forward one more step:  Ceph and Hadoop should just think of a block/object as the same size.  One of the TODO's is exposing Ceph's object size to Hadoop, and that "read" interface for block size will probably need to expand to a "write" interface to reduce confusion with folks configuring Hadoop to use a block size of N bytes.  (This train of thought spawned the resolved Ceph tracker item here: http://tracker.newdream.net/issues/185 .)  The Hadoop configuration block size would have to propagate down to the Ceph file layout configuration.  It may be that the functionality's already there in Hadoop and Ceph and just needs glue, I'm not sure.

> 
> Design Suggestion
> ===============
> 
> I would propose moving the functionality of mapping offsets to object locations into a library managed in the Ceph tree, and either 1) use JNI as a thin layer to this library, or 2) scrap JNI altogether for JNA.
After writing this code, I do like seeing the words "scrap" and "JNI" so close in the same sentence.  That's more up to the Hadoop community, though; I don't know how well-accepted JNA is in their code base.

The ioctl struct ceph_ioctl_dataloc already returns the primary copy's object offset for an input file offset, though I think it would be a little more useful if it included replica offsets.  Since that interface won't change (referencing the forked message thread), getting a single offset shouldn't mean any new code for Ceph.  If it looks like it's clumsy to actually _get_ the block offset judging from my code, just consider that code inflation from all my safety/debug JNI checks.  I'm not calling it pretty or elegant by any stretch.
> 
> Either way, the motivation for moving this functionality into the Ceph tree is important because from the point of view of Hadoop object/block location is independent of striping strategy.
I wasn't aware Hadoop had a built-in consideration for striping strategy.  grep'ing over the Hadoop code for "stripe" and "stripi" returns 0 hits.

> Future Ceph enhancements and research may use alternative striping strategies which would thus have to be re-duplicated into the Hadoop code base.
It sounds to me like Hadoop needs some code to determine or set striping strategy (independent of Ceph's logistics), if this technical point is going to come up repeatedly in future research.  Maybe the Ceph file system class would be a good place to try this out?  Going up higher in the class hierarchy may mean confusion for HDFS.

--Alex


> 
> Thanks,
> Noah--
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Expose Ceph data location information to Hadoop
  2010-11-30 17:38 ` [PATCH] Expose Ceph data location information to Hadoop Alex‎ Nelson
@ 2010-11-30 17:50   ` Noah Watkins
  2010-11-30 18:10     ` Sage Weil
  0 siblings, 1 reply; 17+ messages in thread
From: Noah Watkins @ 2010-11-30 17:50 UTC (permalink / raw)
  To: Alex‎ Nelson; +Cc: Noah Watkins, ceph-devel, Joe Buck, Carlos Maltzahn

> I didn't follow your numeric examples above---I missed how you mapped Offsets to Object Numbers---but I follow you on striping meaning different data locations for what Hadoop would think would be one Ceph object in one place.
I didn't explicitly describe the mapping algorithm, but it can be found in the function "ceph_calc_file_object_mapping(...)" in the kernel client. If you execute the algorithm with the parameters in my example you can reproduce the mapping I presented.

>> 
>> The more natural (and general) solution is to consider the stripe unit to be the _unit_ of Hadoop blocks, not entire objects. When stripe unit and block size are the same the result is analogous to HDFS's treatment of blocks.
> I agree with you, and push forward one more step:  Ceph and Hadoop should just think of a block/object as the same size.
Per Sage's response, Hadoop block can be equal to Ceph stripe unit.


> One of the TODO's is exposing Ceph's object size to Hadoop, and that "read" interface for block size will probably need to expand to a "write" interface to reduce confusion with folks configuring Hadoop to use a block size of N bytes.
How is configured block size relevant in Hadoop? This seems to me to be specific to HDFS. The analogy would be to configure the file layout parameters in Ceph.

> After writing this code, I do like seeing the words "scrap" and "JNI" so close in the same sentence.  That's more up to the Hadoop community, though; I don't know how well-accepted JNA is in their code base.
One solution might be a lazily populated sysfs interface for retrieving object information for a given file, circumventing the problem Java has calling IOCTLs. But that's another conversation.

> The ioctl struct ceph_ioctl_dataloc already returns the primary copy's object offset for an input file offset, though I think it would be a little more useful if it included replica offsets.
I can submit a patch for this. Sage, I remember you mentioning that reading from replicas might pose (scalability?) problems. Any thoughts on this?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Expose Ceph data location information to Hadoop
  2010-11-30 17:50   ` Noah Watkins
@ 2010-11-30 18:10     ` Sage Weil
  2010-11-30 18:32       ` Noah Watkins
  2010-11-30 21:39       ` Noah Watkins
  0 siblings, 2 replies; 17+ messages in thread
From: Sage Weil @ 2010-11-30 18:10 UTC (permalink / raw)
  To: Noah Watkins
  Cc: Alex‎ Nelson, Noah Watkins, ceph-devel, Joe Buck,
	Carlos Maltzahn

On Tue, 30 Nov 2010, Noah Watkins wrote:
> > After writing this code, I do like seeing the words "scrap" and "JNI" 
> > so close in the same sentence.  That's more up to the Hadoop 
> > community, though; I don't know how well-accepted JNA is in their code 
> > base.
> One solution might be a lazily populated sysfs interface for retrieving 
> object information for a given file, circumventing the problem Java has 
> calling IOCTLs. But that's another conversation.

Yeah, not pretty.  A shared library wouldn't really help here either, 
right?  And a command-line tool means additional overhead.  JNI (or 
equivalent) calling an ioctl seems like the most appropriate tool.

> > The ioctl struct ceph_ioctl_dataloc already returns the primary copy's 
> > object offset for an input file offset, though I think it would be a 
> > little more useful if it included replica offsets.
> I can submit a patch for this. Sage, I remember you mentioning that 
> reading from replicas might pose (scalability?) problems. Any thoughts 
> on this?

There are two things.  First, we'd need a DATALOC_V2 ioctl that would 
return locations for all replicas of the object.  Is Hadoop smart about 
scheduling jobs on the best replica?

The second part is be able to read from them.  In general, sending any/all 
reads to a random replica does bad things to your cache.  In principle, 
it's possible, though, at least when a file is only opened for read (at 
that point all replicas are known consistent on disk).  Someone suggested 
on IRC a while back that in such a case we have a check to read from a 
non-primary replica if that replica happens to be on the local node.  That 
sort of optimization would work in this case.  A number of changes in the 
OSD and client will be needed, but nothing too invasive (I think!).

Before going to the trouble, though, I want to make sure we'll really 
benefit from all of that...

sage

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Expose Ceph data location information to Hadoop
  2010-11-30 18:10     ` Sage Weil
@ 2010-11-30 18:32       ` Noah Watkins
  2010-11-30 19:01         ` Sage Weil
  2010-11-30 21:39       ` Noah Watkins
  1 sibling, 1 reply; 17+ messages in thread
From: Noah Watkins @ 2010-11-30 18:32 UTC (permalink / raw)
  To: Sage Weil
  Cc: Alex‎ Nelson, Noah Watkins, ceph-devel, Joe Buck,
	Carlos Maltzahn

> Yeah, not pretty.  A shared library wouldn't really help here either, 
> right?  And a command-line tool means additional overhead.  JNI (or 
> equivalent) calling an ioctl seems like the most appropriate tool.

I like the motivation behind the command-line tool, but agree with you on the overhead issues. The only other method for arbitrary communication between processes that comes to mind for this situation is a socket-based approach.

This could take two forms:

1) A user-space daemon to service requests from Hadoop
2) A socket between kernel and user-space to service requests.

The former is unattractive because it requires additional client setup, while the latter also poses challenges. However, If this approach seems attractive we could begin to experiment with the second option it in DebugFS to avoid ABI lock in?

One thing all solutions have in common is that the cost on the Hadoop end is a one-time cost. While overhead is important a number of inefficient lookups may easily be masked by the start-up costs associated with Hadoop's infrastructure.

> 
>>> The ioctl struct ceph_ioctl_dataloc already returns the primary copy's 
>>> object offset for an input file offset, though I think it would be a 
>>> little more useful if it included replica offsets.
>> I can submit a patch for this. Sage, I remember you mentioning that 
>> reading from replicas might pose (scalability?) problems. Any thoughts 
>> on this?
> 
> There are two things.  First, we'd need a DATALOC_V2 ioctl that would 
> return locations for all replicas of the object.  Is Hadoop smart about 
> scheduling jobs on the best replica?

Good question. I'm not sure what its scheduling policy is, but replica location is a key component of the Hadoop API, providing the information to the scheduler by default.

> Before going to the trouble, though, I want to make sure we'll really 
> benefit from all of that...

I agree. This enhancement is orthogonal to the overall design.

Thanks,
Noah

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Expose Ceph data location information to Hadoop
  2010-11-30 18:32       ` Noah Watkins
@ 2010-11-30 19:01         ` Sage Weil
  2010-11-30 19:04           ` Noah Watkins
  0 siblings, 1 reply; 17+ messages in thread
From: Sage Weil @ 2010-11-30 19:01 UTC (permalink / raw)
  To: Noah Watkins
  Cc: Alex‎ Nelson, Noah Watkins, ceph-devel, Joe Buck,
	Carlos Maltzahn

On Tue, 30 Nov 2010, Noah Watkins wrote:
> > Yeah, not pretty.  A shared library wouldn't really help here either, 
> > right?  And a command-line tool means additional overhead.  JNI (or 
> > equivalent) calling an ioctl seems like the most appropriate tool.
> 
> I like the motivation behind the command-line tool, but agree with you 
> on the overhead issues. The only other method for arbitrary 
> communication between processes that comes to mind for this situation is 
> a socket-based approach.
> 
> This could take two forms:
> 
> 1) A user-space daemon to service requests from Hadoop
> 2) A socket between kernel and user-space to service requests.
> 
> The former is unattractive because it requires additional client setup, 
> while the latter also poses challenges. However, If this approach seems 
> attractive we could begin to experiment with the second option it in 
> DebugFS to avoid ABI lock in?

Well, if the goal is just to get something working to test, then I would 
use JNI and use the ioctl; whether it can go upstream easily isn't 
relevant.  If the goal is something that can go upstream, then it 
needs a stable ABI, and debugfs isn't really a solution there either.  

Are we sure JNI is a real problem?  It really seems like the right tool 
for the job.  Greg seems to remember them asking who would maintain the 
(non-java) JNI bits, but even if that's us and not them (which is probably 
the way to go anyway), I don't see that that's a problem.

> >>> The ioctl struct ceph_ioctl_dataloc already returns the primary copy's 
> >>> object offset for an input file offset, though I think it would be a 
> >>> little more useful if it included replica offsets.
> >> I can submit a patch for this. Sage, I remember you mentioning that 
> >> reading from replicas might pose (scalability?) problems. Any thoughts 
> >> on this?
> > 
> > There are two things.  First, we'd need a DATALOC_V2 ioctl that would 
> > return locations for all replicas of the object.  Is Hadoop smart about 
> > scheduling jobs on the best replica?
> 
> Good question. I'm not sure what its scheduling policy is, but replica 
> location is a key component of the Hadoop API, providing the information 
> to the scheduler by default.

Let's start with just providing the primary replica, at least until we 
find out whether hadoop takes advantage of additional ones (does HDFS read 
from the local non-primary replica?).

sage


> 
> > Before going to the trouble, though, I want to make sure we'll really 
> > benefit from all of that...
> 
> I agree. This enhancement is orthogonal to the overall design.
> 
> Thanks,
> Noah
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Expose Ceph data location information to Hadoop
  2010-11-30 19:01         ` Sage Weil
@ 2010-11-30 19:04           ` Noah Watkins
  2010-11-30 19:41             ` Alex‎ Nelson
       [not found]             ` <9DEABEC1-48A1-466D-9942-C0D8A199EF96@soe.ucsc.edu>
  0 siblings, 2 replies; 17+ messages in thread
From: Noah Watkins @ 2010-11-30 19:04 UTC (permalink / raw)
  To: Sage Weil
  Cc: Alex‎ Nelson, Noah Watkins, ceph-devel, Joe Buck,
	Carlos Maltzahn

> Are we sure JNI is a real problem?  It really seems like the right tool 
> for the job.  Greg seems to remember them asking who would maintain the 
> (non-java) JNI bits, but even if that's us and not them (which is probably 
> the way to go anyway), I don't see that that's a problem.

Yeh, it's sort of a wash. A nice goal would be to have a patch that allowed Hadoop to not require any additional components (i.e. JNI packages) from the Ceph repository. Given that the Ceph infrastructure will be installed anyway in the case of Hadoop, it's a bit of a toss up.

-n

> Let's start with just providing the primary replica, at least until we 
> find out whether hadoop takes advantage of additional ones (does HDFS read 
> from the local non-primary replica?).

I believe that Hadoop will schedule a map job on at a local replica for load balancing, or to duplicate the work when a map is running slowly. Joe, can you confirm this?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Expose Ceph data location information to Hadoop
  2010-11-30 19:04           ` Noah Watkins
@ 2010-11-30 19:41             ` Alex‎ Nelson
       [not found]             ` <9DEABEC1-48A1-466D-9942-C0D8A199EF96@soe.ucsc.edu>
  1 sibling, 0 replies; 17+ messages in thread
From: Alex‎ Nelson @ 2010-11-30 19:41 UTC (permalink / raw)
  To: Noah Watkins
  Cc: Sage Weil, Noah Watkins, ceph-devel, Joe Buck, Carlos Maltzahn

>> Are we sure JNI is a real problem?  It really seems like the right tool 
>> for the job.  Greg seems to remember them asking who would maintain the 
>> (non-java) JNI bits, but even if that's us and not them (which is probably 
>> the way to go anyway), I don't see that that's a problem.
> 
> Yeh, it's sort of a wash. A nice goal would be to have a patch that allowed Hadoop to not require any additional components (i.e. JNI packages) from the Ceph repository. Given that the Ceph infrastructure will be installed anyway in the case of Hadoop, it's a bit of a toss up.

The JNI isn't very _fun_ to develop, but it does do the job just fine and with the expected pattern of using a stable interface, with nothing extravagant needed for either Hadoop or Ceph.  Hadoop already has JNI pieces, so adding more shouldn't be a problem (though I do wish the automake part wasn't so awkward to approach).

I suppose there will need to be some automated check for Ceph as part of the ant build process.

> 
> -n
> 
>> Let's start with just providing the primary replica, at least until we 
>> find out whether hadoop takes advantage of additional ones (does HDFS read 
>> from the local non-primary replica?).
> 
> I believe that Hadoop will schedule a map job on at a local replica for load balancing, or to duplicate the work when a map is running slowly. Joe, can you confirm this?
> 
When I ran my basic evaluation, Hadoop was reporting its locality results as about 75% of jobs being run on the same node as the data.  This seemed to be a result of overloading nodes.  Someone will need to run a proper evaluation, as my experiment was small and blew up when I expanded my test cluster.  It was probably a misconfigured kernel upgrade or something else uninteresting that's irrelevant here.

--Alex

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Expose Ceph data location information to Hadoop
       [not found]             ` <9DEABEC1-48A1-466D-9942-C0D8A199EF96@soe.ucsc.edu>
@ 2010-11-30 20:25               ` Joe Buck
  0 siblings, 0 replies; 17+ messages in thread
From: Joe Buck @ 2010-11-30 20:25 UTC (permalink / raw)
  To: Carlos Maltzahn
  Cc: Noah Watkins, Sage Weil, Alex‎ Nelson, Noah Watkins,
	ceph-devel

Having been in the code in Hadoop, each input split (a chunk of data to be processed), has with it a set of hosts to run on. But, that list is not a hard limit, rather it's merely a strong suggestion. Hadoop will work, in the sense that the map() and reduce() functions will execute and results will be produced, with no locality information. But, the whole point of Hadoop is to do in-situ processing so we want to give Hadoop the most opportunities to place the job on  host that has the data locally as we can.
Having only one host specified per block of data, as I assume Alex's implementation did from this discussion, then it's a lot more likely that a non-local host would read the data. Enabling Hadoop reads from replicas would make the load balancing a lot more efficient in terms of reads being local.

-Joe Buck

On Nov 30, 2010, at 11:48 AM, Carlos Maltzahn wrote:

> 
> On Nov 30, 2010, at 11:04 AM, Noah Watkins wrote:
> 
>>> Are we sure JNI is a real problem?  It really seems like the right tool 
>>> for the job.  Greg seems to remember them asking who would maintain the 
>>> (non-java) JNI bits, but even if that's us and not them (which is probably 
>>> the way to go anyway), I don't see that that's a problem.
>> 
>> Yeh, it's sort of a wash. A nice goal would be to have a patch that allowed Hadoop to not require any additional components (i.e. JNI packages) from the Ceph repository. Given that the Ceph infrastructure will be installed anyway in the case of Hadoop, it's a bit of a toss up.
>> 
>> -n
>> 
>>> Let's start with just providing the primary replica, at least until we 
>>> find out whether hadoop takes advantage of additional ones (does HDFS read 
>>> from the local non-primary replica?).
>> 
>> I believe that Hadoop will schedule a map job on at a local replica for load balancing, or to duplicate the work when a map is running slowly. Joe, can you confirm this?
> 
> Let me chime in: I believe Hadoop is scheduling by a combination of load and locality (and maybe other variables). I think the more choices we give to the task manager to satisfy both load and locality requirements, the better it will perform. Allowing Hadoop to see all replicas will allow us to verify the tradeoff between mapper performance and cost of additional replication.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Expose Ceph data location information to Hadoop
  2010-11-30 18:10     ` Sage Weil
  2010-11-30 18:32       ` Noah Watkins
@ 2010-11-30 21:39       ` Noah Watkins
  1 sibling, 0 replies; 17+ messages in thread
From: Noah Watkins @ 2010-11-30 21:39 UTC (permalink / raw)
  To: Sage Weil
  Cc: Alex‎ Nelson, Noah Watkins, ceph-devel, Joe Buck,
	Carlos Maltzahn

> Yeah, not pretty.  A shared library wouldn't really help here either, 
> right?  And a command-line tool means additional overhead.  JNI (or 
> equivalent) calling an ioctl seems like the most appropriate tool.

Not to continually dog on JNI, but I'd like to point out another downside, and that is portability. Using JNI to connect to the IOCTL requires that the JNI-based Java code be build on all platforms. As a result, each per-node deployment of Hadoop must ensure it's version of the JNI connector was built for that particular architecture. I'm wondering how this will affect Hadoop over Ceph in terms of a sales pitch, given that it could require a configuration headache for large-scale deployments.

Thanks,
Noah

^ permalink raw reply	[flat|nested] 17+ messages in thread

* what's the exact meaning of cap?
  2010-11-30 16:59 ` Sage Weil
  2010-11-30 17:17   ` Noah Watkins
@ 2010-12-01  3:50   ` wchen
  2010-12-01 16:11     ` Gregory Farnum
  1 sibling, 1 reply; 17+ messages in thread
From: wchen @ 2010-12-01  3:50 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel@vger.kernel.org

hi, Sage,

I'm reading the source code of ceph. Recently I'm very confused with
struct InodeCap's issued, implemented, wanted fields. I just have a
preliminary understanding of them, but what's the exact difference
between them? there are lots of code to reflect the relation between
them. Can you explain it for me? or any documents about it?

Thanks!

-- 
best regards,
Kevin Chen

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: what's the exact meaning of cap?
  2010-12-01  3:50   ` what's the exact meaning of cap? wchen
@ 2010-12-01 16:11     ` Gregory Farnum
  2010-12-02  1:41       ` wchen
  0 siblings, 1 reply; 17+ messages in thread
From: Gregory Farnum @ 2010-12-01 16:11 UTC (permalink / raw)
  To: wchen; +Cc: Sage Weil, ceph-devel@vger.kernel.org

On Tue, Nov 30, 2010 at 7:50 PM, wchen <wchen@tnsoft.com.cn> wrote:
> hi, Sage,
>
> I'm reading the source code of ceph. Recently I'm very confused with
> struct InodeCap's issued, implemented, wanted fields. I just have a
> preliminary understanding of them, but what's the exact difference
> between them? there are lots of code to reflect the relation between
> them. Can you explain it for me? or any documents about it?
Unfortunately we don't have a lot of documentation about the source
code itself. Sage's thesis (which is available on the website) is the
best resource, and does describe caps some.

Briefly, caps are short for capabilities. They are issued by the MDS
to clients to describe what the client is allowed to do with an inode
and its associated file. There are capabilities to, for instance, let
the client buffer writes, cache reads, and adjust certain kinds of
metadata (mtime, et al).
The wanted capabilities are caps the client wants but doesn't have.
Issued caps are ones the client has been granted, and (IIRC)
implemented caps are a subset of the issued caps describing which caps
the client has actually made use of. It's useful for determining
whether the client has dirty metadata and whatnot. :)
-Greg

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: what's the exact meaning of cap?
  2010-12-01 16:11     ` Gregory Farnum
@ 2010-12-02  1:41       ` wchen
  0 siblings, 0 replies; 17+ messages in thread
From: wchen @ 2010-12-02  1:41 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Sage Weil, ceph-devel@vger.kernel.org

hi, Farnum,

Thank you very much! :-)


On Thu, 2010-12-02 at 00:11 +0800, Gregory Farnum wrote:
> On Tue, Nov 30, 2010 at 7:50 PM, wchen <wchen@tnsoft.com.cn> wrote:
> > hi, Sage,
> >
> > I'm reading the source code of ceph. Recently I'm very confused with
> > struct InodeCap's issued, implemented, wanted fields. I just have a
> > preliminary understanding of them, but what's the exact difference
> > between them? there are lots of code to reflect the relation between
> > them. Can you explain it for me? or any documents about it?
> Unfortunately we don't have a lot of documentation about the source
> code itself. Sage's thesis (which is available on the website) is the
> best resource, and does describe caps some.
> 
> Briefly, caps are short for capabilities. They are issued by the MDS
> to clients to describe what the client is allowed to do with an inode
> and its associated file. There are capabilities to, for instance, let
> the client buffer writes, cache reads, and adjust certain kinds of
> metadata (mtime, et al).
> The wanted capabilities are caps the client wants but doesn't have.
> Issued caps are ones the client has been granted, and (IIRC)
> implemented caps are a subset of the issued caps describing which caps
> the client has actually made use of. It's useful for determining
> whether the client has dirty metadata and whatnot. :)
> -Greg

-- 
best regards,
Kevin Chen (陈巍)

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2010-12-02  1:43 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-30  6:33 [PATCH] Expose Ceph data location information to Hadoop Noah Watkins
2010-11-30 16:59 ` Sage Weil
2010-11-30 17:17   ` Noah Watkins
2010-11-30 17:28     ` Gregory Farnum
2010-11-30 17:31       ` Noah Watkins
2010-12-01  3:50   ` what's the exact meaning of cap? wchen
2010-12-01 16:11     ` Gregory Farnum
2010-12-02  1:41       ` wchen
2010-11-30 17:38 ` [PATCH] Expose Ceph data location information to Hadoop Alex‎ Nelson
2010-11-30 17:50   ` Noah Watkins
2010-11-30 18:10     ` Sage Weil
2010-11-30 18:32       ` Noah Watkins
2010-11-30 19:01         ` Sage Weil
2010-11-30 19:04           ` Noah Watkins
2010-11-30 19:41             ` Alex‎ Nelson
     [not found]             ` <9DEABEC1-48A1-466D-9942-C0D8A199EF96@soe.ucsc.edu>
2010-11-30 20:25               ` Joe Buck
2010-11-30 21:39       ` Noah Watkins

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.