OSD weighting

All of lore.kernel.org
 help / color / mirror / Atom feed

* OSD weighting
@ 2012-04-20  7:23 Vladimir Bashkirtsev
  2012-04-20 17:15 ` Sage Weil
  0 siblings, 1 reply; 3+ messages in thread
From: Vladimir Bashkirtsev @ 2012-04-20  7:23 UTC (permalink / raw)
  To: ceph-devel

Dear devs,

Playing around with ceph and gradually moving it from a toy thing into 
production I wanted ceph to actually make its run for the money (so to 
speak). I have assembled number of OSDs which are really built on 
different hardware: starting from old P4 with 512MB of RAM and ending up 
with high end Dell server, including mixture of 100 and 1000 mbit 
networks. I will not really speak about performance of MONs and MDSes as 
they do fairly well does not matter what I throw to them. But with OSDs 
it is different story. Even one full OSD will stall whole ceph - I've 
read that it is normal and good way of fighting it is to have periodic 
health check to see that no OSD is approaching full status. However I 
believe it would be better if ceph will reduce weighting for OSDs 
approaching full status so it will effectively prevent OSD getting full. 
Should be reasonably simple to implement and will not cause major grief 
if some OSD will go past near full status to full status quickly and 
unnoticed. I guess reweight-by-utilization is an attempt to address the 
issue based on CPU performance.

In the mean time I have reverted back to manual weighting of OSDs and I 
found that there no clear explanation on how weights actually applied. 
I've seen suggestion to keep weight equivalent to number of TBs on OSD. 
Doing so in single rack has achieved expected result: data has spread 
itself proportionally to OSDs sizes. But when I started to move OSDs 
from toy rack into production rack I also have changed weights for racks 
in pool. So I had 6 OSDs and I moved 2 of them. I have changed toyrack 
weight to 4.000 and productionrack to 2.000. Waited for data to settle 
just to find out that disk use is no longer proportional. Then I have 
changed rack weights to total amount of TBs in the rack, data reshuffled 
and settled but again did not achieved expected result. So I guess 
function of weights: racks, hosts and devices is not straight forward as 
I thought originally. This begs clear explanation of how weights are 
used in case of straw algo.

Regards,
Vladimir

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: OSD weighting
  2012-04-20  7:23 OSD weighting Vladimir Bashkirtsev
@ 2012-04-20 17:15 ` Sage Weil
  2012-04-20 17:34   ` Sage Weil
  0 siblings, 1 reply; 3+ messages in thread
From: Sage Weil @ 2012-04-20 17:15 UTC (permalink / raw)
  To: Vladimir Bashkirtsev; +Cc: ceph-devel

On Fri, 20 Apr 2012, Vladimir Bashkirtsev wrote:
> Dear devs,
> 
> Playing around with ceph and gradually moving it from a toy thing into
> production I wanted ceph to actually make its run for the money (so to speak).
> I have assembled number of OSDs which are really built on different hardware:
> starting from old P4 with 512MB of RAM and ending up with high end Dell
> server, including mixture of 100 and 1000 mbit networks. I will not really
> speak about performance of MONs and MDSes as they do fairly well does not
> matter what I throw to them. But with OSDs it is different story. Even one
> full OSD will stall whole ceph - I've read that it is normal and good way of
> fighting it is to have periodic health check to see that no OSD is approaching
> full status. However I believe it would be better if ceph will reduce
> weighting for OSDs approaching full status so it will effectively prevent OSD
> getting full. Should be reasonably simple to implement and will not cause
> major grief if some OSD will go past near full status to full status quickly
> and unnoticed. I guess reweight-by-utilization is an attempt to address the
> issue based on CPU performance.

The intention is to reweight based on disk utilization.  Actually, it 
assumes that the CRUSH weights are "correct", but it will make minor 
corrections in the osdmap (post-crush) to prevent statistical outliers 
from filling up.  CPU performance isn't considered at all...

> In the mean time I have reverted back to manual weighting of OSDs and I found
> that there no clear explanation on how weights actually applied. I've seen
> suggestion to keep weight equivalent to number of TBs on OSD. Doing so in
> single rack has achieved expected result: data has spread itself
> proportionally to OSDs sizes. But when I started to move OSDs from toy rack
> into production rack I also have changed weights for racks in pool. So I had 6
> OSDs and I moved 2 of them. I have changed toyrack weight to 4.000 and
> productionrack to 2.000. Waited for data to settle just to find out that disk
> use is no longer proportional. Then I have changed rack weights to total
> amount of TBs in the rack, data reshuffled and settled but again did not
> achieved expected result. So I guess function of weights: racks, hosts and
> devices is not straight forward as I thought originally. This begs clear
> explanation of how weights are used in case of straw algo.

Can you share the output of 'ceph osd tree' and the last bit of 'ceph pg 
dump' (which shows actual utilizations) in that case?  There are some 
problems with CRUSH when buckets are small that we will be addressing 
soon.  I'd like to confirm that is what is going on.

Thanks!
sage

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: OSD weighting
  2012-04-20 17:15 ` Sage Weil
@ 2012-04-20 17:34   ` Sage Weil
  0 siblings, 0 replies; 3+ messages in thread
From: Sage Weil @ 2012-04-20 17:34 UTC (permalink / raw)
  To: Vladimir Bashkirtsev; +Cc: ceph-devel

On Fri, 20 Apr 2012, Sage Weil wrote:
> On Fri, 20 Apr 2012, Vladimir Bashkirtsev wrote:
> > Dear devs,
> > 
> > Playing around with ceph and gradually moving it from a toy thing into
> > production I wanted ceph to actually make its run for the money (so to speak).
> > I have assembled number of OSDs which are really built on different hardware:
> > starting from old P4 with 512MB of RAM and ending up with high end Dell
> > server, including mixture of 100 and 1000 mbit networks. I will not really
> > speak about performance of MONs and MDSes as they do fairly well does not
> > matter what I throw to them. But with OSDs it is different story. Even one
> > full OSD will stall whole ceph - I've read that it is normal and good way of
> > fighting it is to have periodic health check to see that no OSD is approaching
> > full status. However I believe it would be better if ceph will reduce
> > weighting for OSDs approaching full status so it will effectively prevent OSD
> > getting full. Should be reasonably simple to implement and will not cause
> > major grief if some OSD will go past near full status to full status quickly
> > and unnoticed. I guess reweight-by-utilization is an attempt to address the
> > issue based on CPU performance.
> 
> The intention is to reweight based on disk utilization.  Actually, it 
> assumes that the CRUSH weights are "correct", but it will make minor 
> corrections in the osdmap (post-crush) to prevent statistical outliers 
> from filling up.  CPU performance isn't considered at all...
> 
> > In the mean time I have reverted back to manual weighting of OSDs and I found
> > that there no clear explanation on how weights actually applied. I've seen
> > suggestion to keep weight equivalent to number of TBs on OSD. Doing so in
> > single rack has achieved expected result: data has spread itself
> > proportionally to OSDs sizes. But when I started to move OSDs from toy rack
> > into production rack I also have changed weights for racks in pool. So I had 6

Greg just pointed this out to me: are you explicitly setting the rack 
weights, so that they are no longer the sum of the items inside the rack?  
That might explain what you are seeing as well.  Strictly speaking that's 
okay.. every decision when descending the tree looks at the relative 
weights of children for each node/bucket.. but then the end result won't 
necessarily be proportional to the leaf weights.

> > OSDs and I moved 2 of them. I have changed toyrack weight to 4.000 and
> > productionrack to 2.000. Waited for data to settle just to find out that disk
> > use is no longer proportional. Then I have changed rack weights to total
> > amount of TBs in the rack, data reshuffled and settled but again did not
> > achieved expected result. So I guess function of weights: racks, hosts and
> > devices is not straight forward as I thought originally. This begs clear
> > explanation of how weights are used in case of straw algo.
> 
> Can you share the output of 'ceph osd tree' and the last bit of 'ceph pg 
> dump' (which shows actual utilizations) in that case?  There are some 
> problems with CRUSH when buckets are small that we will be addressing 
> soon.  I'd like to confirm that is what is going on.
> 
> Thanks!
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2012-04-20 17:34 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-04-20  7:23 OSD weighting Vladimir Bashkirtsev
2012-04-20 17:15 ` Sage Weil
2012-04-20 17:34   ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.