All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: RBD/OSD questions
@ 2010-05-06 21:02 Martin Fick
  2010-05-06 21:22 ` Sage Weil
  2010-05-06 21:24 ` Cláudio Martins
  0 siblings, 2 replies; 14+ messages in thread
From: Martin Fick @ 2010-05-06 21:02 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

--- On Thu, 5/6/10, Sage Weil <sage@newdream.net> wrote:

> > -Also, can reads be spreadout over replicas?
> > 
> > This might be a nice optimization to reduce seek
> > times under certain conditions, when there are no
> > writers or the writer is the only reader (and thus
> > is aware of all the writes even before they complete).  
> > Under these conditions it seems like it
> > would be possible to not enforce the "tail reading"
> > order of replicas and thus additionally benefit
> > from "read stripping" across the replicas the way
> > many raid implementations do with RAID1.
> > 
> > I thought that this might be particularly useful
> > for RBD when it is used exclusively (say by mounting
> > a local FS)  since even with replicas, it seems like
> > it could then relax the replica tail reading 
> > constraint.
> 
> The idea certainly has it's appeal, and I played with it
> for a while a few years back.  At that time I had a
> _really_ hard time trying to manufacture a workload
>  scenario where it actually mades things faster
> and not slower.  In general, spreading out reads will
> pollute caches (e.g., spreading across two replicas means
> caches are half as effective).  

Hmm, I wonder if using a local FS on top of RBD would
be such a different use case from ceph that this may
not be very difficult to produce such a workload with.  
With a local FS on RBD I would expect massive local 
kernel level caching.  With this in mind I wonder how
effective OSD level caching would actually be.

I am particularly thinking of heavy seeky workloads 
which perhaps are somewhat already spreadout due to
stripping.  In other words RAID1 (mirroring) can 
decrease latencies over a non RAID setup locally even
though that is not the objective of RAID1, but does
RAID01 decrease latencies much over RAID0, maybe not?
That might explain the difficulty in creating such
a scenario.

To put this in the perspective of OSD setups, if you 
already have stripping, using the replicas also may 
not make much of a difference, but I wonder how a two
node OSD setup with double redundancy would fair?  
With such a setup there will not really be any 
stripping will there?  With such a setup (one that I 
can easily see being popular for simple/minimal RBD
redundancy setups), perhaps replica "stripping"
would help.  A 'smart' RBD could detect non 
contiguous reads and spread the reads out in that
case.

All theory I know, but it seems worth investigating 
various RBD specific workloads, at least for the 
RBD users/developers. :)


Also, with ceph many seeky workloads (small multi
file writes) might additionally already be spread 
out (and thus "stripped") due to CRUSH since they
are in different files.  But with RBD, it is all
one file so CRUSH will not help as much in this 
respect.



> What I tried to do was use fast heartbeats between OSDs to
> shared average request queue lengths, so that the primary 
> could 'shed' a read request to a replica if it's queue 
> length/request latency was significantly shorter.  
> I wasn't really able to make it work.

This sounds more 'intelligent" than what I was 
suggesting since it would take the status of the 
entire OSD cluster into account, not just the 
single RBD reads.
 
> For cold objects, shedding could help, but only if 
> there is a sufficient load disparity between replicas to 
> compensate for the overhead of shedding.

I could see how "shedding" as you mean it would 
add some overhead, but a simple client based 
fanout shouldn't really add much overhead.  You 
have designed CRUSH to allow fast direct IO with 
the OSDs, shedding seems to be a step backwards 
performance wise from this design, but client 
fanout to replicas directly is really not much 
different than stripping using CRUSH, it should 
be fast!

If this client fanout does help, one way to make
it smarter, or more cluster responsive would be
to expose some OSD queue/length info via the 
client APIs allowing clients themselves to do some
smart load balancing in these situations.  This 
could be applicable not just for seeky workloads, 
but also for unusual workloads which for some 
reason might bog down a particular OSD.  CRUSH
should normally prevent this from happening in
a well balanced cluster, but if a cluster is not
very heterogenous and has many OSD nodes with
varying latencies and perhaps other external
(non OSD) loads on them, your queue length idea 
with smart clients could help balance such a 
cluster on the clients themselves.

That's a lot of armchair talking I know, 
sorry. ;)  Thanks for listening...

-Martin




      

^ permalink raw reply	[flat|nested] 14+ messages in thread
* Re: RBD/OSD questions
@ 2010-05-06 22:28 Martin Fick
  0 siblings, 0 replies; 14+ messages in thread
From: Martin Fick @ 2010-05-06 22:28 UTC (permalink / raw)
  To: Sage Weil; +Cc: Cláudio Martins, ceph-devel

--- On Thu, 5/6/10, Sage Weil <sage@newdream.net> wrote:
> The image is striped over objects, _then_ the objects are
> replicated across OSDs.  Objects themselves aren't striped.
> 
> For example, if an image is striped over objects A B C D E
> F, each 4MB, you might end up with
> 
> osd0: A  B' C  D  E' F'
> osd1: A' B  C' D' E  F
> 
> where A is the primary copy, A' is the replica, etc.

Yes, I see now, much clearer, thanks.

So, as long as each object has one copy on each OSD it 
should be safe.  And there might be some hash based 
extra non-perfect stripping as a benefit.  Then, yeah,
it does seem like it would be hard to find a very 
unbalanced workload on a heterogeneous OSD (even with 
a 2 node cluster).  Cool.

Thanks,

-Martin



      
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread
* RBD/OSD questions
@ 2010-05-06 16:07 Martin Fick
  2010-05-06 17:14 ` Sage Weil
  0 siblings, 1 reply; 14+ messages in thread
From: Martin Fick @ 2010-05-06 16:07 UTC (permalink / raw)
  To: ceph-devel

I have a few more questions.

-Can files stored in the OSD heal "incrementally"?

Suppose there are 3 replicas for a large file and that
a small byte range change occurs while replica 3 is 
down.  Will replica 3 heal efficiently when it 
returns?  Will only the small changed byte range 
be transferred?


-Also, can reads be spreadout over replicas?

This might be a nice optimization to reduce seek
times under certain conditions, when there are no
writers or the writer is the only reader (and thus
is aware of all the writes even before they 
complete).  Under these conditions it seems like it
would be possible to not enforce the "tail reading"
order of replicas and thus additionally benefit
from "read stripping" across the replicas the way
many raid implementations do with RAID1.

I thought that this might be particularly useful
for RBD when it is used exclusively (say by mounting
a local FS)  since even with replicas, it seems like
it could then relax the replica tail reading 
constraint.

Any thoughts?  Thanks,

-Martin



      
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2010-05-11 16:23 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-06 21:02 RBD/OSD questions Martin Fick
2010-05-06 21:22 ` Sage Weil
2010-05-06 21:24 ` Cláudio Martins
2010-05-06 21:31   ` Sage Weil
2010-05-06 21:41     ` Martin Fick
2010-05-06 21:54       ` Gregory Farnum
2010-05-06 22:20       ` Sage Weil
2010-05-07 16:38         ` Andreas Grimm
2010-05-07 16:43           ` Sage Weil
2010-05-11  8:39             ` Anton
2010-05-11 16:26               ` Sage Weil
  -- strict thread matches above, loose matches on Subject: below --
2010-05-06 22:28 Martin Fick
2010-05-06 16:07 Martin Fick
2010-05-06 17:14 ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.