Very unbalanced storage

All of lore.kernel.org
 help / color / mirror / Atom feed

* Very unbalanced storage
@ 2012-08-31 11:11 Xiaopong Tran
  2012-08-31 15:00 ` Andrew Thompson
  2012-08-31 16:05 ` Sage Weil
  0 siblings, 2 replies; 17+ messages in thread
From: Xiaopong Tran @ 2012-08-31 11:11 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 3809 bytes --]

Hi,

Ceph storage on each disk in the cluster is very unbalanced. On each
node, the data seems to go to one or two disks, while other disks
are almost empty.

I can't find anything wrong from the crush map, it's just the
default for now. Attached is the crush map.

Here is the current situation on node s100001:

Filesystem                                              Size  Used Avail 
Use% Mounted on
/dev/sdb1                                               932G  4.3G  927G 
   1% /disk1
/dev/sdc1                                               932G  4.3G  927G 
   1% /disk2
/dev/sdd1                                               932G  4.3G  927G 
   1% /disk3
/dev/sde1                                               932G  4.3G  927G 
   1% /disk4
/dev/sdf1                                               932G  4.3G  927G 
   1% /disk5
/dev/sdg1                                               932G  4.3G  927G 
   1% /disk6
/dev/sdh1                                               932G  4.3G  927G 
   1% /disk7
/dev/sdi1                                               932G  4.3G  927G 
   1% /disk8
/dev/sdj1                                               932G  4.3G  927G 
   1% /disk9
/dev/sdk1                                               932G  445G  487G 
  48% /disk10

Here, we can see that all data seem to go to one osd only, while others
are almost empty.

And here's the situation on node s200001:

Filesystem                                              Size  Used Avail 
Use% Mounted on
/dev/sdb1                                               932G  443G  489G 
  48% /disk1
/dev/sdc1                                               932G  4.3G  927G 
   1% /disk2
/dev/sdd1                                               932G  4.3G  927G 
   1% /disk3
/dev/sde1                                               932G  4.3G  927G 
   1% /disk4
/dev/sdf1                                               932G  4.3G  927G 
   1% /disk5
/dev/sdg1                                               932G  4.3G  927G 
   1% /disk6
/dev/sdh1                                               932G  4.3G  927G 
   1% /disk7
/dev/sdi1                                               932G  4.3G  927G 
   1% /disk8
/dev/sdj1                                               932G  449G  483G 
  49% /disk9
/dev/sdk1                                               932G  4.3G  927G 
   1% /disk10

The situation is a bit better, but not much, the data are stored on two
disks mainly.

Here is a better situation, on node s100002:

Filesystem                                              Size  Used Avail 
Use% Mounted on
/dev/sdb1                                               1.9T  453G  1.4T 
  25% /disk1
/dev/sdc1                                               1.9T  4.3G  1.9T 
   1% /disk2
/dev/sdd1                                               1.9T  4.4G  1.9T 
   1% /disk3
/dev/sde1                                               1.9T  4.3G  1.9T 
   1% /disk4
/dev/sdf1                                               1.9T  457G  1.4T 
  25% /disk5
/dev/sdg1                                               1.9T  443G  1.4T 
  24% /disk6
/dev/sdh1                                               1.9T  4.4G  1.9T 
   1% /disk7
/dev/sdi1                                               1.9T  4.4G  1.9T 
   1% /disk8
/dev/sdj1                                               1.9T  427G  1.5T 
  23% /disk9
/dev/sdk1                                               1.9T  4.4G  1.9T 
   1% /disk10

It's better than the other two, but still not what I expected. I
expected the data to be spread out according to the weight of each
osd, as defined in the crush map. Or at least, as close to that
as possible. It might be just some obviously stupid config error,
but I don't know. This can't be normal, can it?

Thanks for any hint.

Xiaopong

[-- Attachment #2: crush.txt --]
[-- Type: text/plain, Size: 5097 bytes --]

# begin crush map

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 13 osd.13
device 14 osd.14
device 15 osd.15
device 16 osd.16
device 17 osd.17
device 18 osd.18
device 19 osd.19
device 20 osd.20
device 21 osd.21
device 22 osd.22
device 23 osd.23
device 24 osd.24
device 25 osd.25
device 26 osd.26
device 27 osd.27
device 28 osd.28
device 29 osd.29
device 30 osd.30
device 31 osd.31
device 32 osd.32
device 33 osd.33
device 34 osd.34
device 35 osd.35
device 36 osd.36
device 37 osd.37
device 38 osd.38
device 39 osd.39
device 40 osd.40
device 41 osd.41
device 42 osd.42
device 43 osd.43
device 44 osd.44
device 45 osd.45
device 46 osd.46
device 47 osd.47
device 48 osd.48
device 49 osd.49
device 50 osd.50
device 51 osd.51
device 52 osd.52
device 53 osd.53
device 54 osd.54
device 55 osd.55
device 56 osd.56
device 57 osd.57
device 58 osd.58
device 59 osd.59
device 60 osd.60
device 61 osd.61
device 62 osd.62
device 63 osd.63
device 64 osd.64
device 65 osd.65
device 66 osd.66
device 67 osd.67
device 68 osd.68
device 69 osd.69
device 70 osd.70
device 71 osd.71
device 72 osd.72
device 73 osd.73
device 74 osd.74
device 75 osd.75

# types
type 0 osd
type 1 host
type 2 rack
type 3 row
type 4 room
type 5 datacenter
type 6 pool

# buckets
host s100001 {
	id -2		# do not change unnecessarily
	# weight 10.000
	alg straw
	hash 0	# rjenkins1
	item osd.0 weight 1.000
	item osd.1 weight 1.000
	item osd.2 weight 1.000
	item osd.3 weight 1.000
	item osd.4 weight 1.000
	item osd.5 weight 1.000
	item osd.6 weight 1.000
	item osd.7 weight 1.000
	item osd.8 weight 1.000
	item osd.9 weight 1.000
}
host s200001 {
	id -4		# do not change unnecessarily
	# weight 10.000
	alg straw
	hash 0	# rjenkins1
	item osd.10 weight 1.000
	item osd.11 weight 1.000
	item osd.12 weight 1.000
	item osd.13 weight 1.000
	item osd.14 weight 1.000
	item osd.15 weight 1.000
	item osd.16 weight 1.000
	item osd.17 weight 1.000
	item osd.18 weight 1.000
	item osd.19 weight 1.000
}
host s300001 {
	id -5		# do not change unnecessarily
	# weight 10.000
	alg straw
	hash 0	# rjenkins1
	item osd.20 weight 1.000
	item osd.21 weight 1.000
	item osd.22 weight 1.000
	item osd.23 weight 1.000
	item osd.24 weight 1.000
	item osd.25 weight 1.000
	item osd.26 weight 1.000
	item osd.27 weight 1.000
	item osd.28 weight 1.000
	item osd.29 weight 1.000
}
host s100002 {
	id -6		# do not change unnecessarily
	# weight 20.000
	alg straw
	hash 0	# rjenkins1
	item osd.30 weight 2.000
	item osd.31 weight 2.000
	item osd.32 weight 2.000
	item osd.33 weight 2.000
	item osd.34 weight 2.000
	item osd.35 weight 2.000
	item osd.36 weight 2.000
	item osd.37 weight 2.000
	item osd.38 weight 2.000
	item osd.39 weight 2.000
}
host s200002 {
	id -7		# do not change unnecessarily
	# weight 20.000
	alg straw
	hash 0	# rjenkins1
	item osd.40 weight 2.000
	item osd.41 weight 2.000
	item osd.42 weight 2.000
	item osd.43 weight 2.000
	item osd.44 weight 2.000
	item osd.45 weight 2.000
	item osd.46 weight 2.000
	item osd.47 weight 2.000
	item osd.48 weight 2.000
	item osd.49 weight 2.000
}
host s300002 {
	id -8		# do not change unnecessarily
	# weight 20.000
	alg straw
	hash 0	# rjenkins1
	item osd.50 weight 2.000
	item osd.51 weight 2.000
	item osd.52 weight 2.000
	item osd.53 weight 2.000
	item osd.54 weight 2.000
	item osd.55 weight 2.000
	item osd.56 weight 2.000
	item osd.57 weight 2.000
	item osd.58 weight 2.000
	item osd.59 weight 2.000
}
host s100003 {
	id -9		# do not change unnecessarily
	# weight 16.000
	alg straw
	hash 0	# rjenkins1
	item osd.60 weight 2.000
	item osd.61 weight 2.000
	item osd.62 weight 2.000
	item osd.63 weight 2.000
	item osd.64 weight 2.000
	item osd.65 weight 2.000
	item osd.66 weight 2.000
	item osd.67 weight 2.000
}
host s200003 {
	id -10		# do not change unnecessarily
	# weight 16.000
	alg straw
	hash 0	# rjenkins1
	item osd.68 weight 2.000
	item osd.69 weight 2.000
	item osd.70 weight 2.000
	item osd.71 weight 2.000
	item osd.72 weight 2.000
	item osd.73 weight 2.000
	item osd.74 weight 2.000
	item osd.75 weight 2.000
}
rack unknownrack {
	id -3		# do not change unnecessarily
	# weight 122.000
	alg straw
	hash 0	# rjenkins1
	item s100001 weight 10.000
	item s200001 weight 10.000
	item s300001 weight 10.000
	item s100002 weight 20.000
	item s200002 weight 20.000
	item s300002 weight 20.000
	item s100003 weight 16.000
	item s200003 weight 16.000
}
pool default {
	id -1		# do not change unnecessarily
	# weight 122.000
	alg straw
	hash 0	# rjenkins1
	item unknownrack weight 122.000
}

# rules
rule data {
	ruleset 0
	type replicated
	min_size 1
	max_size 10
	step take default
	step chooseleaf firstn 0 type host
	step emit
}
rule metadata {
	ruleset 1
	type replicated
	min_size 1
	max_size 10
	step take default
	step chooseleaf firstn 0 type host
	step emit
}
rule rbd {
	ruleset 2
	type replicated
	min_size 1
	max_size 10
	step take default
	step chooseleaf firstn 0 type host
	step emit
}

# end crush map

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Very unbalanced storage
  2012-08-31 11:11 Very unbalanced storage Xiaopong Tran
@ 2012-08-31 15:00 ` Andrew Thompson
  2012-08-31 16:10   ` Sage Weil
  2012-08-31 16:05 ` Sage Weil
  1 sibling, 1 reply; 17+ messages in thread
From: Andrew Thompson @ 2012-08-31 15:00 UTC (permalink / raw)
  To: Xiaopong Tran; +Cc: ceph-devel@vger.kernel.org

On 8/31/2012 7:11 AM, Xiaopong Tran wrote:
> Hi,
>
> Ceph storage on each disk in the cluster is very unbalanced. On each
> node, the data seems to go to one or two disks, while other disks
> are almost empty.
>
> I can't find anything wrong from the crush map, it's just the
> default for now. Attached is the crush map.

Have you been reweight-ing osds? I went round and round with my cluster 
a few days ago reloading different crush maps only to find that it 
re-injecting a crush map didn't seem to overwrite reweights.

Take a look at `ceph osd tree` to see if the reweight column matches the 
weight column.

Note: I'm new at this. This is my experience only, with 0.48.1, and may 
not be entirely correct.

-- 
Andrew Thompson
http://aktzero.com/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Very unbalanced storage
  2012-08-31 11:11 Very unbalanced storage Xiaopong Tran
  2012-08-31 15:00 ` Andrew Thompson
@ 2012-08-31 16:05 ` Sage Weil
  2012-09-01  2:52   ` Xiaopong Tran
  1 sibling, 1 reply; 17+ messages in thread
From: Sage Weil @ 2012-08-31 16:05 UTC (permalink / raw)
  To: Xiaopong Tran; +Cc: ceph-devel@vger.kernel.org

On Fri, 31 Aug 2012, Xiaopong Tran wrote:
> Hi,
> 
> Ceph storage on each disk in the cluster is very unbalanced. On each
> node, the data seems to go to one or two disks, while other disks
> are almost empty.
> 
> I can't find anything wrong from the crush map, it's just the
> default for now. Attached is the crush map.

This is usually a problem with the pg_num for the pool you are using.  Can 
you include the output from 'ceph osd dump | grep ^pool'?  By default, 
pools get 8 pgs, which will distribute poorly.

sage


> 
> Here is the current situation on node s100001:
> 
> Filesystem                                              Size  Used Avail Use%
> Mounted on
> /dev/sdb1                                               932G  4.3G  927G   1%
> /disk1
> /dev/sdc1                                               932G  4.3G  927G   1%
> /disk2
> /dev/sdd1                                               932G  4.3G  927G   1%
> /disk3
> /dev/sde1                                               932G  4.3G  927G   1%
> /disk4
> /dev/sdf1                                               932G  4.3G  927G   1%
> /disk5
> /dev/sdg1                                               932G  4.3G  927G   1%
> /disk6
> /dev/sdh1                                               932G  4.3G  927G   1%
> /disk7
> /dev/sdi1                                               932G  4.3G  927G   1%
> /disk8
> /dev/sdj1                                               932G  4.3G  927G   1%
> /disk9
> /dev/sdk1                                               932G  445G  487G  48%
> /disk10
> 
> Here, we can see that all data seem to go to one osd only, while others
> are almost empty.
> 
> And here's the situation on node s200001:
> 
> Filesystem                                              Size  Used Avail Use%
> Mounted on
> /dev/sdb1                                               932G  443G  489G  48%
> /disk1
> /dev/sdc1                                               932G  4.3G  927G   1%
> /disk2
> /dev/sdd1                                               932G  4.3G  927G   1%
> /disk3
> /dev/sde1                                               932G  4.3G  927G   1%
> /disk4
> /dev/sdf1                                               932G  4.3G  927G   1%
> /disk5
> /dev/sdg1                                               932G  4.3G  927G   1%
> /disk6
> /dev/sdh1                                               932G  4.3G  927G   1%
> /disk7
> /dev/sdi1                                               932G  4.3G  927G   1%
> /disk8
> /dev/sdj1                                               932G  449G  483G  49%
> /disk9
> /dev/sdk1                                               932G  4.3G  927G   1%
> /disk10
> 
> The situation is a bit better, but not much, the data are stored on two
> disks mainly.
> 
> Here is a better situation, on node s100002:
> 
> Filesystem                                              Size  Used Avail Use%
> Mounted on
> /dev/sdb1                                               1.9T  453G  1.4T  25%
> /disk1
> /dev/sdc1                                               1.9T  4.3G  1.9T   1%
> /disk2
> /dev/sdd1                                               1.9T  4.4G  1.9T   1%
> /disk3
> /dev/sde1                                               1.9T  4.3G  1.9T   1%
> /disk4
> /dev/sdf1                                               1.9T  457G  1.4T  25%
> /disk5
> /dev/sdg1                                               1.9T  443G  1.4T  24%
> /disk6
> /dev/sdh1                                               1.9T  4.4G  1.9T   1%
> /disk7
> /dev/sdi1                                               1.9T  4.4G  1.9T   1%
> /disk8
> /dev/sdj1                                               1.9T  427G  1.5T  23%
> /disk9
> /dev/sdk1                                               1.9T  4.4G  1.9T   1%
> /disk10
> 
> It's better than the other two, but still not what I expected. I
> expected the data to be spread out according to the weight of each
> osd, as defined in the crush map. Or at least, as close to that
> as possible. It might be just some obviously stupid config error,
> but I don't know. This can't be normal, can it?
> 
> Thanks for any hint.
> 
> Xiaopong
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Very unbalanced storage
  2012-08-31 15:00 ` Andrew Thompson
@ 2012-08-31 16:10   ` Sage Weil
  2012-08-31 16:24     ` Andrew Thompson
  0 siblings, 1 reply; 17+ messages in thread
From: Sage Weil @ 2012-08-31 16:10 UTC (permalink / raw)
  To: Andrew Thompson; +Cc: Xiaopong Tran, ceph-devel@vger.kernel.org

On Fri, 31 Aug 2012, Andrew Thompson wrote:
> On 8/31/2012 7:11 AM, Xiaopong Tran wrote:
> > Hi,
> > 
> > Ceph storage on each disk in the cluster is very unbalanced. On each
> > node, the data seems to go to one or two disks, while other disks
> > are almost empty.
> > 
> > I can't find anything wrong from the crush map, it's just the
> > default for now. Attached is the crush map.
> 
> Have you been reweight-ing osds? I went round and round with my cluster a few
> days ago reloading different crush maps only to find that it re-injecting a
> crush map didn't seem to overwrite reweights.
> 
> Take a look at `ceph osd tree` to see if the reweight column matches the
> weight column.

Note that the ideal situation is for reweight to be 1, regardless of what 
the crush weight is.  If you find the utilizations are skewed, I would 
look for other causes before resorting to reweight-by-utilization; it is 
meant to adjust the normal statistical variation you expect from a 
(pseudo)random placement, but if the variance is high there is likely 
another cause.

sage

> 
> Note: I'm new at this. This is my experience only, with 0.48.1, and may not be
> entirely correct.
> 
> -- 
> Andrew Thompson
> http://aktzero.com/
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Very unbalanced storage
  2012-08-31 16:10   ` Sage Weil
@ 2012-08-31 16:24     ` Andrew Thompson
  2012-08-31 16:39       ` Gregory Farnum
  2012-08-31 16:43       ` Sage Weil
  0 siblings, 2 replies; 17+ messages in thread
From: Andrew Thompson @ 2012-08-31 16:24 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org

On 8/31/2012 12:10 PM, Sage Weil wrote:
> On Fri, 31 Aug 2012, Andrew Thompson wrote:
>> Have you been reweight-ing osds? I went round and round with my 
>> cluster a few days ago reloading different crush maps only to find 
>> that it re-injecting a crush map didn't seem to overwrite reweights. 
>> Take a look at `ceph osd tree` to see if the reweight column matches 
>> the weight column. 
> Note that the ideal situation is for reweight to be 1, regardless of what
> the crush weight is.  If you find the utilizations are skewed, I would
> look for other causes before resorting to reweight-by-utilization; it is
> meant to adjust the normal statistical variation you expect from a
> (pseudo)random placement, but if the variance is high there is likely
> another cause.

So if someone(me, guilty) had been messing with reweight, will setting 
them all to 1 return it to a normal un-reweight-ed state?

-- 
Andrew Thompson
http://aktzero.com/


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Very unbalanced storage
  2012-08-31 16:24     ` Andrew Thompson
@ 2012-08-31 16:39       ` Gregory Farnum
  2012-09-01  2:33         ` Xiaopong Tran
  2012-08-31 16:43       ` Sage Weil
  1 sibling, 1 reply; 17+ messages in thread
From: Gregory Farnum @ 2012-08-31 16:39 UTC (permalink / raw)
  To: Andrew Thompson; +Cc: ceph-devel@vger.kernel.org

On Fri, Aug 31, 2012 at 9:24 AM, Andrew Thompson <andrewkt@aktzero.com> wrote:
> On 8/31/2012 12:10 PM, Sage Weil wrote:
>>
>> On Fri, 31 Aug 2012, Andrew Thompson wrote:
>>>
>>> Have you been reweight-ing osds? I went round and round with my cluster a
>>> few days ago reloading different crush maps only to find that it
>>> re-injecting a crush map didn't seem to overwrite reweights. Take a look at
>>> `ceph osd tree` to see if the reweight column matches the weight column.
>>
>> Note that the ideal situation is for reweight to be 1, regardless of what
>> the crush weight is.  If you find the utilizations are skewed, I would
>> look for other causes before resorting to reweight-by-utilization; it is
>> meant to adjust the normal statistical variation you expect from a
>> (pseudo)random placement, but if the variance is high there is likely
>> another cause.
>
>
> So if someone(me, guilty) had been messing with reweight, will setting them
> all to 1 return it to a normal un-reweight-ed state?

Yep!
If you have OSDs with different sizes you'll want to adjust the CRUSH
weights, not the reweight values:
http://ceph.com/docs/master/ops/manage/crush/#adjusting-the-crush-weight

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Very unbalanced storage
  2012-08-31 16:24     ` Andrew Thompson
  2012-08-31 16:39       ` Gregory Farnum
@ 2012-08-31 16:43       ` Sage Weil
  1 sibling, 0 replies; 17+ messages in thread
From: Sage Weil @ 2012-08-31 16:43 UTC (permalink / raw)
  To: Andrew Thompson; +Cc: ceph-devel@vger.kernel.org

On Fri, 31 Aug 2012, Andrew Thompson wrote:
> On 8/31/2012 12:10 PM, Sage Weil wrote:
> > On Fri, 31 Aug 2012, Andrew Thompson wrote:
> > > Have you been reweight-ing osds? I went round and round with my cluster a
> > > few days ago reloading different crush maps only to find that it
> > > re-injecting a crush map didn't seem to overwrite reweights. Take a look
> > > at `ceph osd tree` to see if the reweight column matches the weight
> > > column. 
> > Note that the ideal situation is for reweight to be 1, regardless of what
> > the crush weight is.  If you find the utilizations are skewed, I would
> > look for other causes before resorting to reweight-by-utilization; it is
> > meant to adjust the normal statistical variation you expect from a
> > (pseudo)random placement, but if the variance is high there is likely
> > another cause.
> 
> So if someone(me, guilty) had been messing with reweight, will setting them
> all to 1 return it to a normal un-reweight-ed state?

Yep!  :)

sage


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Very unbalanced storage
  2012-08-31 16:39       ` Gregory Farnum
@ 2012-09-01  2:33         ` Xiaopong Tran
  2012-09-01  3:07           ` Sage Weil
  0 siblings, 1 reply; 17+ messages in thread
From: Xiaopong Tran @ 2012-09-01  2:33 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Andrew Thompson, ceph-devel@vger.kernel.org

On 09/01/2012 12:39 AM, Gregory Farnum wrote:
> On Fri, Aug 31, 2012 at 9:24 AM, Andrew Thompson <andrewkt@aktzero.com> wrote:
>> On 8/31/2012 12:10 PM, Sage Weil wrote:
>>>
>>> On Fri, 31 Aug 2012, Andrew Thompson wrote:
>>>>
>>>> Have you been reweight-ing osds? I went round and round with my cluster a
>>>> few days ago reloading different crush maps only to find that it
>>>> re-injecting a crush map didn't seem to overwrite reweights. Take a look at
>>>> `ceph osd tree` to see if the reweight column matches the weight column.
>>>
>>> Note that the ideal situation is for reweight to be 1, regardless of what
>>> the crush weight is.  If you find the utilizations are skewed, I would
>>> look for other causes before resorting to reweight-by-utilization; it is
>>> meant to adjust the normal statistical variation you expect from a
>>> (pseudo)random placement, but if the variance is high there is likely
>>> another cause.
>>
>>
>> So if someone(me, guilty) had been messing with reweight, will setting them
>> all to 1 return it to a normal un-reweight-ed state?
>
> Yep!
> If you have OSDs with different sizes you'll want to adjust the CRUSH
> weights, not the reweight values:
> http://ceph.com/docs/master/ops/manage/crush/#adjusting-the-crush-weight

Thanks for the reply. Yes, this was what I did, we had 1TB and 2TB HD,
so using 1TB as the base line, with weight being 1.0, then I'd like that
the 2TB HD store 2x amount of data, so that the disks always have
roughly same relative amount of data.

Originally, every osd has weight of 1.0, and I did:

ceph osd crush reweight osd.30 2.0

and all the 2TB disks.

And that's probably what caused the skew afterward. The crush map
attached in my last message was fetched from the cluster, and

ceph osd tree

does show that the weight of the 2TB disks as 2, but reweight is 1.

Now I'm getting confused by the meaning of crush weight :)

Best,

Xiaopong






^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Very unbalanced storage
  2012-08-31 16:05 ` Sage Weil
@ 2012-09-01  2:52   ` Xiaopong Tran
  2012-09-01  3:05     ` Sage Weil
  0 siblings, 1 reply; 17+ messages in thread
From: Xiaopong Tran @ 2012-09-01  2:52 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel@vger.kernel.org

On 09/01/2012 12:05 AM, Sage Weil wrote:
> On Fri, 31 Aug 2012, Xiaopong Tran wrote:
>> Hi,
>>
>> Ceph storage on each disk in the cluster is very unbalanced. On each
>> node, the data seems to go to one or two disks, while other disks
>> are almost empty.
>>
>> I can't find anything wrong from the crush map, it's just the
>> default for now. Attached is the crush map.
>
> This is usually a problem with the pg_num for the pool you are using.  Can
> you include the output from 'ceph osd dump | grep ^pool'?  By default,
> pools get 8 pgs, which will distribute poorly.
>
> sage
>
>
Here is the pool I'm interested in:

pool 9 'yunio2' rep size 3 crush_ruleset 0 object_hash rjenkins pg_num 8 
pgp_num 8 last_change 216 owner 0

So, ok, by default, the pg_num is really small. That's a very dumb
mistake I made. Is there any easy way to change this?

I looked at the tunables, if I upgrade to v0.48.1 or v0.49,
then would I be able to tune the pg_num value?

Thanks for any help, this is quite a serious issue.

Xiaopong

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Very unbalanced storage
  2012-09-01  2:52   ` Xiaopong Tran
@ 2012-09-01  3:05     ` Sage Weil
  2012-09-01  3:15       ` Xiaopong Tran
  2012-09-01  6:58       ` Andrew Thompson
  0 siblings, 2 replies; 17+ messages in thread
From: Sage Weil @ 2012-09-01  3:05 UTC (permalink / raw)
  To: Xiaopong Tran; +Cc: ceph-devel@vger.kernel.org

On Sat, 1 Sep 2012, Xiaopong Tran wrote:
> On 09/01/2012 12:05 AM, Sage Weil wrote:
> > On Fri, 31 Aug 2012, Xiaopong Tran wrote:
> > > Hi,
> > > 
> > > Ceph storage on each disk in the cluster is very unbalanced. On each
> > > node, the data seems to go to one or two disks, while other disks
> > > are almost empty.
> > > 
> > > I can't find anything wrong from the crush map, it's just the
> > > default for now. Attached is the crush map.
> > 
> > This is usually a problem with the pg_num for the pool you are using.  Can
> > you include the output from 'ceph osd dump | grep ^pool'?  By default,
> > pools get 8 pgs, which will distribute poorly.
> > 
> > sage
> > 
> > 
> Here is the pool I'm interested in:
> 
> pool 9 'yunio2' rep size 3 crush_ruleset 0 object_hash rjenkins pg_num 8
> pgp_num 8 last_change 216 owner 0
> 
> So, ok, by default, the pg_num is really small. That's a very dumb
> mistake I made. Is there any easy way to change this?

I think me choosing 8 as the default was the dumb thing :)
 
> I looked at the tunables, if I upgrade to v0.48.1 or v0.49,
> then would I be able to tune the pg_num value?

Sadly you can't yet adjust pg_num for an active pool.  You can create a 
new pool,

	ceph osd pool create <name> <pg_num>

I would aim for 20 * num_osd, or thereabouts.. see 

	http://ceph.com/docs/master/ops/manage/grow/placement-groups/

Then you can copy the data from the old pool to the new one with

	rados cppool yunio2 yunio3

This won't be particularly fast, but it will work.  You can also do

	ceph osd pool rename <oldname> <newname>
	ceph osd pool delete <name>

I hope this solves your problem!
sage


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Very unbalanced storage
  2012-09-01  2:33         ` Xiaopong Tran
@ 2012-09-01  3:07           ` Sage Weil
  0 siblings, 0 replies; 17+ messages in thread
From: Sage Weil @ 2012-09-01  3:07 UTC (permalink / raw)
  To: Xiaopong Tran; +Cc: Gregory Farnum, Andrew Thompson, ceph-devel@vger.kernel.org

On Sat, 1 Sep 2012, Xiaopong Tran wrote:
> On 09/01/2012 12:39 AM, Gregory Farnum wrote:
> > On Fri, Aug 31, 2012 at 9:24 AM, Andrew Thompson <andrewkt@aktzero.com>
> > wrote:
> > > On 8/31/2012 12:10 PM, Sage Weil wrote:
> > > > 
> > > > On Fri, 31 Aug 2012, Andrew Thompson wrote:
> > > > > 
> > > > > Have you been reweight-ing osds? I went round and round with my
> > > > > cluster a
> > > > > few days ago reloading different crush maps only to find that it
> > > > > re-injecting a crush map didn't seem to overwrite reweights. Take a
> > > > > look at
> > > > > `ceph osd tree` to see if the reweight column matches the weight
> > > > > column.
> > > > 
> > > > Note that the ideal situation is for reweight to be 1, regardless of
> > > > what
> > > > the crush weight is.  If you find the utilizations are skewed, I would
> > > > look for other causes before resorting to reweight-by-utilization; it is
> > > > meant to adjust the normal statistical variation you expect from a
> > > > (pseudo)random placement, but if the variance is high there is likely
> > > > another cause.
> > > 
> > > 
> > > So if someone(me, guilty) had been messing with reweight, will setting
> > > them
> > > all to 1 return it to a normal un-reweight-ed state?
> > 
> > Yep!
> > If you have OSDs with different sizes you'll want to adjust the CRUSH
> > weights, not the reweight values:
> > http://ceph.com/docs/master/ops/manage/crush/#adjusting-the-crush-weight
> 
> Thanks for the reply. Yes, this was what I did, we had 1TB and 2TB HD,
> so using 1TB as the base line, with weight being 1.0, then I'd like that
> the 2TB HD store 2x amount of data, so that the disks always have
> roughly same relative amount of data.
> 
> Originally, every osd has weight of 1.0, and I did:
> 
> ceph osd crush reweight osd.30 2.0
> 
> and all the 2TB disks.
> 
> And that's probably what caused the skew afterward. The crush map
> attached in my last message was fetched from the cluster, and
> 
> ceph osd tree
> 
> does show that the weight of the 2TB disks as 2, but reweight is 1.
> 
> Now I'm getting confused by the meaning of crush weight :)

Yes, sorry.  The left one (in osd tree) is the 'crush weight', and the 
right one is the 'reweight', which you can think of as a non-binary state 
of failure.  0 == failed (and everything remapped elsewhere), 1 == normal, 
and anything in between meaning that some fraction of the content is 
remapped elsewhere.

sage

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Very unbalanced storage
  2012-09-01  3:05     ` Sage Weil
@ 2012-09-01  3:15       ` Xiaopong Tran
  2012-09-01  6:58       ` Andrew Thompson
  1 sibling, 0 replies; 17+ messages in thread
From: Xiaopong Tran @ 2012-09-01  3:15 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel@vger.kernel.org

On 09/01/2012 11:05 AM, Sage Weil wrote:
> On Sat, 1 Sep 2012, Xiaopong Tran wrote:
>> On 09/01/2012 12:05 AM, Sage Weil wrote:
>>> On Fri, 31 Aug 2012, Xiaopong Tran wrote:
>>>> Hi,
>>>>
>>>> Ceph storage on each disk in the cluster is very unbalanced. On each
>>>> node, the data seems to go to one or two disks, while other disks
>>>> are almost empty.
>>>>
>>>> I can't find anything wrong from the crush map, it's just the
>>>> default for now. Attached is the crush map.
>>>
>>> This is usually a problem with the pg_num for the pool you are using.  Can
>>> you include the output from 'ceph osd dump | grep ^pool'?  By default,
>>> pools get 8 pgs, which will distribute poorly.
>>>
>>> sage
>>>
>>>
>> Here is the pool I'm interested in:
>>
>> pool 9 'yunio2' rep size 3 crush_ruleset 0 object_hash rjenkins pg_num 8
>> pgp_num 8 last_change 216 owner 0
>>
>> So, ok, by default, the pg_num is really small. That's a very dumb
>> mistake I made. Is there any easy way to change this?
>
> I think me choosing 8 as the default was the dumb thing :)
>
>> I looked at the tunables, if I upgrade to v0.48.1 or v0.49,
>> then would I be able to tune the pg_num value?
>
> Sadly you can't yet adjust pg_num for an active pool.  You can create a
> new pool,
>
> 	ceph osd pool create <name> <pg_num>
>
> I would aim for 20 * num_osd, or thereabouts.. see
>
> 	http://ceph.com/docs/master/ops/manage/grow/placement-groups/
>
> Then you can copy the data from the old pool to the new one with
>
> 	rados cppool yunio2 yunio3
>
> This won't be particularly fast, but it will work.  You can also do
>
> 	ceph osd pool rename <oldname> <newname>
> 	ceph osd pool delete <name>
>
> I hope this solves your problem!
> sage
>

Ok, this is going to be painful. But do I have to stop using
the current pool completely while I do

rados cppool yunio2 yunio3

? This is not something I can do now :)

But this wiki describes a nice way to increase the number of PGs:

http://ceph.com/wiki/Changing_the_number_of_PGs

Even if I upgrade to v0.48.1, this command can only change the PG size
when the pool is empty?

Thanks

Xiaopong

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Very unbalanced storage
  2012-09-01  3:05     ` Sage Weil
  2012-09-01  3:15       ` Xiaopong Tran
@ 2012-09-01  6:58       ` Andrew Thompson
  2012-09-04 15:59         ` Tommi Virtanen
  1 sibling, 1 reply; 17+ messages in thread
From: Andrew Thompson @ 2012-09-01  6:58 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org

On 8/31/2012 11:05 PM, Sage Weil wrote:
> Sadly you can't yet adjust pg_num for an active pool.  You can create a
> new pool,
>
> 	ceph osd pool create <name> <pg_num>
>
> I would aim for 20 * num_osd, or thereabouts.. see
>
> 	http://ceph.com/docs/master/ops/manage/grow/placement-groups/
>
> Then you can copy the data from the old pool to the new one with
>
> 	rados cppool yunio2 yunio3
>
> This won't be particularly fast, but it will work.  You can also do
>
> 	ceph osd pool rename <oldname> <newname>
> 	ceph osd pool delete <name>
>
> I hope this solves your problem!

Looking at old archives, I found this thread which shows that to mount a 
pool as cephfs, it needs to be added to mds:

http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/5685

I started a `rados cppool data tempstore` a couple hours ago. When it 
finishes, will I need to remove the current pool from mds somehow(other 
than just deleting the pool)?

Is `ceph mds add_data_pool <poolname>` still required? (It's not listed 
in `ceph --help`.)

Thanks.

-- 
Andrew Thompson
http://aktzero.com/


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Very unbalanced storage
  2012-09-01  6:58       ` Andrew Thompson
@ 2012-09-04 15:59         ` Tommi Virtanen
  2012-09-04 16:19           ` Andrew Thompson
  0 siblings, 1 reply; 17+ messages in thread
From: Tommi Virtanen @ 2012-09-04 15:59 UTC (permalink / raw)
  To: Andrew Thompson; +Cc: ceph-devel@vger.kernel.org

On Fri, Aug 31, 2012 at 11:58 PM, Andrew Thompson <andrewkt@aktzero.com> wrote:
> Looking at old archives, I found this thread which shows that to mount a
> pool as cephfs, it needs to be added to mds:
>
> http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/5685
>
> I started a `rados cppool data tempstore` a couple hours ago. When it
> finishes, will I need to remove the current pool from mds somehow(other than
> just deleting the pool)?
>
> Is `ceph mds add_data_pool <poolname>` still required? (It's not listed in
> `ceph --help`.)

If the pool you are trying to grow pg_num for really is a CephFS data
pool, I fear a "rados cppool" is nowhere near enough to perform a
migration. My understanding is that each of the inodes stored in
cephfs/on ceph-mds'es knows what pool the file data resides in; you
shoveling the objects into another pool with "rados cppool" doesn't
change these pointers, removing the old pool will just break the
filesystem.

Before we go too far down this road: is your problem pool *really*
being use as a cephfs data pool? Based on how it's not named "data"
and you're just now asking about "ceph mds add_data_pool", it seems
that's not likely..

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Very unbalanced storage
  2012-09-04 15:59         ` Tommi Virtanen
@ 2012-09-04 16:19           ` Andrew Thompson
  2012-09-04 16:43             ` Sage Weil
  2012-09-04 16:48             ` Tommi Virtanen
  0 siblings, 2 replies; 17+ messages in thread
From: Andrew Thompson @ 2012-09-04 16:19 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: ceph-devel@vger.kernel.org

On 9/4/2012 11:59 AM, Tommi Virtanen wrote:
> On Fri, Aug 31, 2012 at 11:58 PM, Andrew Thompson <andrewkt@aktzero.com> wrote:
>> Looking at old archives, I found this thread which shows that to mount a
>> pool as cephfs, it needs to be added to mds:
>>
>> http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/5685
>>
>> I started a `rados cppool data tempstore` a couple hours ago. When it
>> finishes, will I need to remove the current pool from mds somehow(other than
>> just deleting the pool)?
>>
>> Is `ceph mds add_data_pool <poolname>` still required? (It's not listed in
>> `ceph --help`.)
> If the pool you are trying to grow pg_num for really is a CephFS data
> pool, I fear a "rados cppool" is nowhere near enough to perform a
> migration. My understanding is that each of the inodes stored in
> cephfs/on ceph-mds'es knows what pool the file data resides in; you
> shoveling the objects into another pool with "rados cppool" doesn't
> change these pointers, removing the old pool will just break the
> filesystem.
>
> Before we go too far down this road: is your problem pool *really*
> being use as a cephfs data pool? Based on how it's not named "data"
> and you're just now asking about "ceph mds add_data_pool", it seems
> that's not likely..

Well, I guess it's time to wipe this cluster and start over.

Yes, it was my `data` pool I was trying to grow. After renaming and 
removing the original data pool, I can `ls` my folders/files, but not 
access them.

I attempted a tar backup beforehand, so unless it flaked out, I should 
be able to recover data.

I was concerned the small number of PGs created by default by mkcephfs 
would be an issue, so I was trying to up it a bit. I'm not going to have 
100+ OSDs or petabytes of data. I just want a relatively safe place to 
store my files that I can easily extend as needed.

So far, I'm 0 and 5... I keep blowing up the filesystem, one way or another.

-- 
Andrew Thompson
http://aktzero.com/


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Very unbalanced storage
  2012-09-04 16:19           ` Andrew Thompson
@ 2012-09-04 16:43             ` Sage Weil
  2012-09-04 16:48             ` Tommi Virtanen
  1 sibling, 0 replies; 17+ messages in thread
From: Sage Weil @ 2012-09-04 16:43 UTC (permalink / raw)
  To: Andrew Thompson; +Cc: Tommi Virtanen, ceph-devel@vger.kernel.org

On Tue, 4 Sep 2012, Andrew Thompson wrote:
> On 9/4/2012 11:59 AM, Tommi Virtanen wrote:
> > On Fri, Aug 31, 2012 at 11:58 PM, Andrew Thompson <andrewkt@aktzero.com>
> > wrote:
> > > Looking at old archives, I found this thread which shows that to mount a
> > > pool as cephfs, it needs to be added to mds:
> > > 
> > > http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/5685
> > > 
> > > I started a `rados cppool data tempstore` a couple hours ago. When it
> > > finishes, will I need to remove the current pool from mds somehow(other
> > > than
> > > just deleting the pool)?
> > > 
> > > Is `ceph mds add_data_pool <poolname>` still required? (It's not listed in
> > > `ceph --help`.)
> > If the pool you are trying to grow pg_num for really is a CephFS data
> > pool, I fear a "rados cppool" is nowhere near enough to perform a
> > migration. My understanding is that each of the inodes stored in
> > cephfs/on ceph-mds'es knows what pool the file data resides in; you
> > shoveling the objects into another pool with "rados cppool" doesn't
> > change these pointers, removing the old pool will just break the
> > filesystem.
> > 
> > Before we go too far down this road: is your problem pool *really*
> > being use as a cephfs data pool? Based on how it's not named "data"
> > and you're just now asking about "ceph mds add_data_pool", it seems
> > that's not likely..
> 
> Well, I guess it's time to wipe this cluster and start over.
> 
> Yes, it was my `data` pool I was trying to grow. After renaming and removing
> the original data pool, I can `ls` my folders/files, but not access them.

Yeah.  Sorry I didn't catch this earlier, but TV is right: the ceph fs 
inodes refer to the data pool pool by id #, not by name, so the cppool 
trick won't work in the fs case.

> I attempted a tar backup beforehand, so unless it flaked out, I should be able
> to recover data.
> 
> I was concerned the small number of PGs created by default by mkcephfs would
> be an issue, so I was trying to up it a bit. I'm not going to have 100+ OSDs
> or petabytes of data. I just want a relatively safe place to store my files
> that I can easily extend as needed.

mkcephfs picks the pg_num by taking the initial osd count and shiftin 'osd 
pg bits' bits to the left.  Adjusting that (by default it is 6) in 
ceph.conf should give you larger initial pools.

> So far, I'm 0 and 5... I keep blowing up the filesystem, one way or another.

Sorry to hear that!  The pg splitting (i.e., online pg_num adjustment) is 
still the next major osd project on the roadmap, but we've been a bit 
sidetracked with performance these past few weeks.

sage


> 
> -- 
> Andrew Thompson
> http://aktzero.com/
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Very unbalanced storage
  2012-09-04 16:19           ` Andrew Thompson
  2012-09-04 16:43             ` Sage Weil
@ 2012-09-04 16:48             ` Tommi Virtanen
  1 sibling, 0 replies; 17+ messages in thread
From: Tommi Virtanen @ 2012-09-04 16:48 UTC (permalink / raw)
  To: Andrew Thompson; +Cc: ceph-devel@vger.kernel.org

On Tue, Sep 4, 2012 at 9:19 AM, Andrew Thompson <andrewkt@aktzero.com> wrote:
> Yes, it was my `data` pool I was trying to grow. After renaming and removing
> the original data pool, I can `ls` my folders/files, but not access them.

Yup, you're seeing ceph-mds being able to access the "metadata" pool,
but all the directory entries pointing at file data that resides in a
pool_id that no longer exists.

While this would be recoverable by rewriting all the directory entries
etc, the simple answer is that your file data is not easily accessible
anymore. If this is just a test filesystem, and you have a recent
backup anyway, you might just go forward with restoring that. If there
is any doubt about that, you can leave the existing content around
until you're sure you can restore the backup successfully, and you
don't really need to re-create the cluster either. If this sounds
necessary, let me know and I'll walk you through the process; but the
simple answer really is recreating the cluster from scratch, so if
this is just test data, go with that.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2012-09-04 16:48 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-08-31 11:11 Very unbalanced storage Xiaopong Tran
2012-08-31 15:00 ` Andrew Thompson
2012-08-31 16:10   ` Sage Weil
2012-08-31 16:24     ` Andrew Thompson
2012-08-31 16:39       ` Gregory Farnum
2012-09-01  2:33         ` Xiaopong Tran
2012-09-01  3:07           ` Sage Weil
2012-08-31 16:43       ` Sage Weil
2012-08-31 16:05 ` Sage Weil
2012-09-01  2:52   ` Xiaopong Tran
2012-09-01  3:05     ` Sage Weil
2012-09-01  3:15       ` Xiaopong Tran
2012-09-01  6:58       ` Andrew Thompson
2012-09-04 15:59         ` Tommi Virtanen
2012-09-04 16:19           ` Andrew Thompson
2012-09-04 16:43             ` Sage Weil
2012-09-04 16:48             ` Tommi Virtanen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.