* Very unbalanced storage
@ 2012-08-31 11:11 Xiaopong Tran
2012-08-31 15:00 ` Andrew Thompson
2012-08-31 16:05 ` Sage Weil
0 siblings, 2 replies; 17+ messages in thread
From: Xiaopong Tran @ 2012-08-31 11:11 UTC (permalink / raw)
To: ceph-devel@vger.kernel.org
[-- Attachment #1: Type: text/plain, Size: 3809 bytes --]
Hi,
Ceph storage on each disk in the cluster is very unbalanced. On each
node, the data seems to go to one or two disks, while other disks
are almost empty.
I can't find anything wrong from the crush map, it's just the
default for now. Attached is the crush map.
Here is the current situation on node s100001:
Filesystem Size Used Avail
Use% Mounted on
/dev/sdb1 932G 4.3G 927G
1% /disk1
/dev/sdc1 932G 4.3G 927G
1% /disk2
/dev/sdd1 932G 4.3G 927G
1% /disk3
/dev/sde1 932G 4.3G 927G
1% /disk4
/dev/sdf1 932G 4.3G 927G
1% /disk5
/dev/sdg1 932G 4.3G 927G
1% /disk6
/dev/sdh1 932G 4.3G 927G
1% /disk7
/dev/sdi1 932G 4.3G 927G
1% /disk8
/dev/sdj1 932G 4.3G 927G
1% /disk9
/dev/sdk1 932G 445G 487G
48% /disk10
Here, we can see that all data seem to go to one osd only, while others
are almost empty.
And here's the situation on node s200001:
Filesystem Size Used Avail
Use% Mounted on
/dev/sdb1 932G 443G 489G
48% /disk1
/dev/sdc1 932G 4.3G 927G
1% /disk2
/dev/sdd1 932G 4.3G 927G
1% /disk3
/dev/sde1 932G 4.3G 927G
1% /disk4
/dev/sdf1 932G 4.3G 927G
1% /disk5
/dev/sdg1 932G 4.3G 927G
1% /disk6
/dev/sdh1 932G 4.3G 927G
1% /disk7
/dev/sdi1 932G 4.3G 927G
1% /disk8
/dev/sdj1 932G 449G 483G
49% /disk9
/dev/sdk1 932G 4.3G 927G
1% /disk10
The situation is a bit better, but not much, the data are stored on two
disks mainly.
Here is a better situation, on node s100002:
Filesystem Size Used Avail
Use% Mounted on
/dev/sdb1 1.9T 453G 1.4T
25% /disk1
/dev/sdc1 1.9T 4.3G 1.9T
1% /disk2
/dev/sdd1 1.9T 4.4G 1.9T
1% /disk3
/dev/sde1 1.9T 4.3G 1.9T
1% /disk4
/dev/sdf1 1.9T 457G 1.4T
25% /disk5
/dev/sdg1 1.9T 443G 1.4T
24% /disk6
/dev/sdh1 1.9T 4.4G 1.9T
1% /disk7
/dev/sdi1 1.9T 4.4G 1.9T
1% /disk8
/dev/sdj1 1.9T 427G 1.5T
23% /disk9
/dev/sdk1 1.9T 4.4G 1.9T
1% /disk10
It's better than the other two, but still not what I expected. I
expected the data to be spread out according to the weight of each
osd, as defined in the crush map. Or at least, as close to that
as possible. It might be just some obviously stupid config error,
but I don't know. This can't be normal, can it?
Thanks for any hint.
Xiaopong
[-- Attachment #2: crush.txt --]
[-- Type: text/plain, Size: 5097 bytes --]
# begin crush map
# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 13 osd.13
device 14 osd.14
device 15 osd.15
device 16 osd.16
device 17 osd.17
device 18 osd.18
device 19 osd.19
device 20 osd.20
device 21 osd.21
device 22 osd.22
device 23 osd.23
device 24 osd.24
device 25 osd.25
device 26 osd.26
device 27 osd.27
device 28 osd.28
device 29 osd.29
device 30 osd.30
device 31 osd.31
device 32 osd.32
device 33 osd.33
device 34 osd.34
device 35 osd.35
device 36 osd.36
device 37 osd.37
device 38 osd.38
device 39 osd.39
device 40 osd.40
device 41 osd.41
device 42 osd.42
device 43 osd.43
device 44 osd.44
device 45 osd.45
device 46 osd.46
device 47 osd.47
device 48 osd.48
device 49 osd.49
device 50 osd.50
device 51 osd.51
device 52 osd.52
device 53 osd.53
device 54 osd.54
device 55 osd.55
device 56 osd.56
device 57 osd.57
device 58 osd.58
device 59 osd.59
device 60 osd.60
device 61 osd.61
device 62 osd.62
device 63 osd.63
device 64 osd.64
device 65 osd.65
device 66 osd.66
device 67 osd.67
device 68 osd.68
device 69 osd.69
device 70 osd.70
device 71 osd.71
device 72 osd.72
device 73 osd.73
device 74 osd.74
device 75 osd.75
# types
type 0 osd
type 1 host
type 2 rack
type 3 row
type 4 room
type 5 datacenter
type 6 pool
# buckets
host s100001 {
id -2 # do not change unnecessarily
# weight 10.000
alg straw
hash 0 # rjenkins1
item osd.0 weight 1.000
item osd.1 weight 1.000
item osd.2 weight 1.000
item osd.3 weight 1.000
item osd.4 weight 1.000
item osd.5 weight 1.000
item osd.6 weight 1.000
item osd.7 weight 1.000
item osd.8 weight 1.000
item osd.9 weight 1.000
}
host s200001 {
id -4 # do not change unnecessarily
# weight 10.000
alg straw
hash 0 # rjenkins1
item osd.10 weight 1.000
item osd.11 weight 1.000
item osd.12 weight 1.000
item osd.13 weight 1.000
item osd.14 weight 1.000
item osd.15 weight 1.000
item osd.16 weight 1.000
item osd.17 weight 1.000
item osd.18 weight 1.000
item osd.19 weight 1.000
}
host s300001 {
id -5 # do not change unnecessarily
# weight 10.000
alg straw
hash 0 # rjenkins1
item osd.20 weight 1.000
item osd.21 weight 1.000
item osd.22 weight 1.000
item osd.23 weight 1.000
item osd.24 weight 1.000
item osd.25 weight 1.000
item osd.26 weight 1.000
item osd.27 weight 1.000
item osd.28 weight 1.000
item osd.29 weight 1.000
}
host s100002 {
id -6 # do not change unnecessarily
# weight 20.000
alg straw
hash 0 # rjenkins1
item osd.30 weight 2.000
item osd.31 weight 2.000
item osd.32 weight 2.000
item osd.33 weight 2.000
item osd.34 weight 2.000
item osd.35 weight 2.000
item osd.36 weight 2.000
item osd.37 weight 2.000
item osd.38 weight 2.000
item osd.39 weight 2.000
}
host s200002 {
id -7 # do not change unnecessarily
# weight 20.000
alg straw
hash 0 # rjenkins1
item osd.40 weight 2.000
item osd.41 weight 2.000
item osd.42 weight 2.000
item osd.43 weight 2.000
item osd.44 weight 2.000
item osd.45 weight 2.000
item osd.46 weight 2.000
item osd.47 weight 2.000
item osd.48 weight 2.000
item osd.49 weight 2.000
}
host s300002 {
id -8 # do not change unnecessarily
# weight 20.000
alg straw
hash 0 # rjenkins1
item osd.50 weight 2.000
item osd.51 weight 2.000
item osd.52 weight 2.000
item osd.53 weight 2.000
item osd.54 weight 2.000
item osd.55 weight 2.000
item osd.56 weight 2.000
item osd.57 weight 2.000
item osd.58 weight 2.000
item osd.59 weight 2.000
}
host s100003 {
id -9 # do not change unnecessarily
# weight 16.000
alg straw
hash 0 # rjenkins1
item osd.60 weight 2.000
item osd.61 weight 2.000
item osd.62 weight 2.000
item osd.63 weight 2.000
item osd.64 weight 2.000
item osd.65 weight 2.000
item osd.66 weight 2.000
item osd.67 weight 2.000
}
host s200003 {
id -10 # do not change unnecessarily
# weight 16.000
alg straw
hash 0 # rjenkins1
item osd.68 weight 2.000
item osd.69 weight 2.000
item osd.70 weight 2.000
item osd.71 weight 2.000
item osd.72 weight 2.000
item osd.73 weight 2.000
item osd.74 weight 2.000
item osd.75 weight 2.000
}
rack unknownrack {
id -3 # do not change unnecessarily
# weight 122.000
alg straw
hash 0 # rjenkins1
item s100001 weight 10.000
item s200001 weight 10.000
item s300001 weight 10.000
item s100002 weight 20.000
item s200002 weight 20.000
item s300002 weight 20.000
item s100003 weight 16.000
item s200003 weight 16.000
}
pool default {
id -1 # do not change unnecessarily
# weight 122.000
alg straw
hash 0 # rjenkins1
item unknownrack weight 122.000
}
# rules
rule data {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
rule metadata {
ruleset 1
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
rule rbd {
ruleset 2
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
# end crush map
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: Very unbalanced storage 2012-08-31 11:11 Very unbalanced storage Xiaopong Tran @ 2012-08-31 15:00 ` Andrew Thompson 2012-08-31 16:10 ` Sage Weil 2012-08-31 16:05 ` Sage Weil 1 sibling, 1 reply; 17+ messages in thread From: Andrew Thompson @ 2012-08-31 15:00 UTC (permalink / raw) To: Xiaopong Tran; +Cc: ceph-devel@vger.kernel.org On 8/31/2012 7:11 AM, Xiaopong Tran wrote: > Hi, > > Ceph storage on each disk in the cluster is very unbalanced. On each > node, the data seems to go to one or two disks, while other disks > are almost empty. > > I can't find anything wrong from the crush map, it's just the > default for now. Attached is the crush map. Have you been reweight-ing osds? I went round and round with my cluster a few days ago reloading different crush maps only to find that it re-injecting a crush map didn't seem to overwrite reweights. Take a look at `ceph osd tree` to see if the reweight column matches the weight column. Note: I'm new at this. This is my experience only, with 0.48.1, and may not be entirely correct. -- Andrew Thompson http://aktzero.com/ ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Very unbalanced storage 2012-08-31 15:00 ` Andrew Thompson @ 2012-08-31 16:10 ` Sage Weil 2012-08-31 16:24 ` Andrew Thompson 0 siblings, 1 reply; 17+ messages in thread From: Sage Weil @ 2012-08-31 16:10 UTC (permalink / raw) To: Andrew Thompson; +Cc: Xiaopong Tran, ceph-devel@vger.kernel.org On Fri, 31 Aug 2012, Andrew Thompson wrote: > On 8/31/2012 7:11 AM, Xiaopong Tran wrote: > > Hi, > > > > Ceph storage on each disk in the cluster is very unbalanced. On each > > node, the data seems to go to one or two disks, while other disks > > are almost empty. > > > > I can't find anything wrong from the crush map, it's just the > > default for now. Attached is the crush map. > > Have you been reweight-ing osds? I went round and round with my cluster a few > days ago reloading different crush maps only to find that it re-injecting a > crush map didn't seem to overwrite reweights. > > Take a look at `ceph osd tree` to see if the reweight column matches the > weight column. Note that the ideal situation is for reweight to be 1, regardless of what the crush weight is. If you find the utilizations are skewed, I would look for other causes before resorting to reweight-by-utilization; it is meant to adjust the normal statistical variation you expect from a (pseudo)random placement, but if the variance is high there is likely another cause. sage > > Note: I'm new at this. This is my experience only, with 0.48.1, and may not be > entirely correct. > > -- > Andrew Thompson > http://aktzero.com/ > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Very unbalanced storage 2012-08-31 16:10 ` Sage Weil @ 2012-08-31 16:24 ` Andrew Thompson 2012-08-31 16:39 ` Gregory Farnum 2012-08-31 16:43 ` Sage Weil 0 siblings, 2 replies; 17+ messages in thread From: Andrew Thompson @ 2012-08-31 16:24 UTC (permalink / raw) To: ceph-devel@vger.kernel.org On 8/31/2012 12:10 PM, Sage Weil wrote: > On Fri, 31 Aug 2012, Andrew Thompson wrote: >> Have you been reweight-ing osds? I went round and round with my >> cluster a few days ago reloading different crush maps only to find >> that it re-injecting a crush map didn't seem to overwrite reweights. >> Take a look at `ceph osd tree` to see if the reweight column matches >> the weight column. > Note that the ideal situation is for reweight to be 1, regardless of what > the crush weight is. If you find the utilizations are skewed, I would > look for other causes before resorting to reweight-by-utilization; it is > meant to adjust the normal statistical variation you expect from a > (pseudo)random placement, but if the variance is high there is likely > another cause. So if someone(me, guilty) had been messing with reweight, will setting them all to 1 return it to a normal un-reweight-ed state? -- Andrew Thompson http://aktzero.com/ ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Very unbalanced storage 2012-08-31 16:24 ` Andrew Thompson @ 2012-08-31 16:39 ` Gregory Farnum 2012-09-01 2:33 ` Xiaopong Tran 2012-08-31 16:43 ` Sage Weil 1 sibling, 1 reply; 17+ messages in thread From: Gregory Farnum @ 2012-08-31 16:39 UTC (permalink / raw) To: Andrew Thompson; +Cc: ceph-devel@vger.kernel.org On Fri, Aug 31, 2012 at 9:24 AM, Andrew Thompson <andrewkt@aktzero.com> wrote: > On 8/31/2012 12:10 PM, Sage Weil wrote: >> >> On Fri, 31 Aug 2012, Andrew Thompson wrote: >>> >>> Have you been reweight-ing osds? I went round and round with my cluster a >>> few days ago reloading different crush maps only to find that it >>> re-injecting a crush map didn't seem to overwrite reweights. Take a look at >>> `ceph osd tree` to see if the reweight column matches the weight column. >> >> Note that the ideal situation is for reweight to be 1, regardless of what >> the crush weight is. If you find the utilizations are skewed, I would >> look for other causes before resorting to reweight-by-utilization; it is >> meant to adjust the normal statistical variation you expect from a >> (pseudo)random placement, but if the variance is high there is likely >> another cause. > > > So if someone(me, guilty) had been messing with reweight, will setting them > all to 1 return it to a normal un-reweight-ed state? Yep! If you have OSDs with different sizes you'll want to adjust the CRUSH weights, not the reweight values: http://ceph.com/docs/master/ops/manage/crush/#adjusting-the-crush-weight ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Very unbalanced storage 2012-08-31 16:39 ` Gregory Farnum @ 2012-09-01 2:33 ` Xiaopong Tran 2012-09-01 3:07 ` Sage Weil 0 siblings, 1 reply; 17+ messages in thread From: Xiaopong Tran @ 2012-09-01 2:33 UTC (permalink / raw) To: Gregory Farnum; +Cc: Andrew Thompson, ceph-devel@vger.kernel.org On 09/01/2012 12:39 AM, Gregory Farnum wrote: > On Fri, Aug 31, 2012 at 9:24 AM, Andrew Thompson <andrewkt@aktzero.com> wrote: >> On 8/31/2012 12:10 PM, Sage Weil wrote: >>> >>> On Fri, 31 Aug 2012, Andrew Thompson wrote: >>>> >>>> Have you been reweight-ing osds? I went round and round with my cluster a >>>> few days ago reloading different crush maps only to find that it >>>> re-injecting a crush map didn't seem to overwrite reweights. Take a look at >>>> `ceph osd tree` to see if the reweight column matches the weight column. >>> >>> Note that the ideal situation is for reweight to be 1, regardless of what >>> the crush weight is. If you find the utilizations are skewed, I would >>> look for other causes before resorting to reweight-by-utilization; it is >>> meant to adjust the normal statistical variation you expect from a >>> (pseudo)random placement, but if the variance is high there is likely >>> another cause. >> >> >> So if someone(me, guilty) had been messing with reweight, will setting them >> all to 1 return it to a normal un-reweight-ed state? > > Yep! > If you have OSDs with different sizes you'll want to adjust the CRUSH > weights, not the reweight values: > http://ceph.com/docs/master/ops/manage/crush/#adjusting-the-crush-weight Thanks for the reply. Yes, this was what I did, we had 1TB and 2TB HD, so using 1TB as the base line, with weight being 1.0, then I'd like that the 2TB HD store 2x amount of data, so that the disks always have roughly same relative amount of data. Originally, every osd has weight of 1.0, and I did: ceph osd crush reweight osd.30 2.0 and all the 2TB disks. And that's probably what caused the skew afterward. The crush map attached in my last message was fetched from the cluster, and ceph osd tree does show that the weight of the 2TB disks as 2, but reweight is 1. Now I'm getting confused by the meaning of crush weight :) Best, Xiaopong ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Very unbalanced storage 2012-09-01 2:33 ` Xiaopong Tran @ 2012-09-01 3:07 ` Sage Weil 0 siblings, 0 replies; 17+ messages in thread From: Sage Weil @ 2012-09-01 3:07 UTC (permalink / raw) To: Xiaopong Tran; +Cc: Gregory Farnum, Andrew Thompson, ceph-devel@vger.kernel.org On Sat, 1 Sep 2012, Xiaopong Tran wrote: > On 09/01/2012 12:39 AM, Gregory Farnum wrote: > > On Fri, Aug 31, 2012 at 9:24 AM, Andrew Thompson <andrewkt@aktzero.com> > > wrote: > > > On 8/31/2012 12:10 PM, Sage Weil wrote: > > > > > > > > On Fri, 31 Aug 2012, Andrew Thompson wrote: > > > > > > > > > > Have you been reweight-ing osds? I went round and round with my > > > > > cluster a > > > > > few days ago reloading different crush maps only to find that it > > > > > re-injecting a crush map didn't seem to overwrite reweights. Take a > > > > > look at > > > > > `ceph osd tree` to see if the reweight column matches the weight > > > > > column. > > > > > > > > Note that the ideal situation is for reweight to be 1, regardless of > > > > what > > > > the crush weight is. If you find the utilizations are skewed, I would > > > > look for other causes before resorting to reweight-by-utilization; it is > > > > meant to adjust the normal statistical variation you expect from a > > > > (pseudo)random placement, but if the variance is high there is likely > > > > another cause. > > > > > > > > > So if someone(me, guilty) had been messing with reweight, will setting > > > them > > > all to 1 return it to a normal un-reweight-ed state? > > > > Yep! > > If you have OSDs with different sizes you'll want to adjust the CRUSH > > weights, not the reweight values: > > http://ceph.com/docs/master/ops/manage/crush/#adjusting-the-crush-weight > > Thanks for the reply. Yes, this was what I did, we had 1TB and 2TB HD, > so using 1TB as the base line, with weight being 1.0, then I'd like that > the 2TB HD store 2x amount of data, so that the disks always have > roughly same relative amount of data. > > Originally, every osd has weight of 1.0, and I did: > > ceph osd crush reweight osd.30 2.0 > > and all the 2TB disks. > > And that's probably what caused the skew afterward. The crush map > attached in my last message was fetched from the cluster, and > > ceph osd tree > > does show that the weight of the 2TB disks as 2, but reweight is 1. > > Now I'm getting confused by the meaning of crush weight :) Yes, sorry. The left one (in osd tree) is the 'crush weight', and the right one is the 'reweight', which you can think of as a non-binary state of failure. 0 == failed (and everything remapped elsewhere), 1 == normal, and anything in between meaning that some fraction of the content is remapped elsewhere. sage ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Very unbalanced storage 2012-08-31 16:24 ` Andrew Thompson 2012-08-31 16:39 ` Gregory Farnum @ 2012-08-31 16:43 ` Sage Weil 1 sibling, 0 replies; 17+ messages in thread From: Sage Weil @ 2012-08-31 16:43 UTC (permalink / raw) To: Andrew Thompson; +Cc: ceph-devel@vger.kernel.org On Fri, 31 Aug 2012, Andrew Thompson wrote: > On 8/31/2012 12:10 PM, Sage Weil wrote: > > On Fri, 31 Aug 2012, Andrew Thompson wrote: > > > Have you been reweight-ing osds? I went round and round with my cluster a > > > few days ago reloading different crush maps only to find that it > > > re-injecting a crush map didn't seem to overwrite reweights. Take a look > > > at `ceph osd tree` to see if the reweight column matches the weight > > > column. > > Note that the ideal situation is for reweight to be 1, regardless of what > > the crush weight is. If you find the utilizations are skewed, I would > > look for other causes before resorting to reweight-by-utilization; it is > > meant to adjust the normal statistical variation you expect from a > > (pseudo)random placement, but if the variance is high there is likely > > another cause. > > So if someone(me, guilty) had been messing with reweight, will setting them > all to 1 return it to a normal un-reweight-ed state? Yep! :) sage ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Very unbalanced storage 2012-08-31 11:11 Very unbalanced storage Xiaopong Tran 2012-08-31 15:00 ` Andrew Thompson @ 2012-08-31 16:05 ` Sage Weil 2012-09-01 2:52 ` Xiaopong Tran 1 sibling, 1 reply; 17+ messages in thread From: Sage Weil @ 2012-08-31 16:05 UTC (permalink / raw) To: Xiaopong Tran; +Cc: ceph-devel@vger.kernel.org On Fri, 31 Aug 2012, Xiaopong Tran wrote: > Hi, > > Ceph storage on each disk in the cluster is very unbalanced. On each > node, the data seems to go to one or two disks, while other disks > are almost empty. > > I can't find anything wrong from the crush map, it's just the > default for now. Attached is the crush map. This is usually a problem with the pg_num for the pool you are using. Can you include the output from 'ceph osd dump | grep ^pool'? By default, pools get 8 pgs, which will distribute poorly. sage > > Here is the current situation on node s100001: > > Filesystem Size Used Avail Use% > Mounted on > /dev/sdb1 932G 4.3G 927G 1% > /disk1 > /dev/sdc1 932G 4.3G 927G 1% > /disk2 > /dev/sdd1 932G 4.3G 927G 1% > /disk3 > /dev/sde1 932G 4.3G 927G 1% > /disk4 > /dev/sdf1 932G 4.3G 927G 1% > /disk5 > /dev/sdg1 932G 4.3G 927G 1% > /disk6 > /dev/sdh1 932G 4.3G 927G 1% > /disk7 > /dev/sdi1 932G 4.3G 927G 1% > /disk8 > /dev/sdj1 932G 4.3G 927G 1% > /disk9 > /dev/sdk1 932G 445G 487G 48% > /disk10 > > Here, we can see that all data seem to go to one osd only, while others > are almost empty. > > And here's the situation on node s200001: > > Filesystem Size Used Avail Use% > Mounted on > /dev/sdb1 932G 443G 489G 48% > /disk1 > /dev/sdc1 932G 4.3G 927G 1% > /disk2 > /dev/sdd1 932G 4.3G 927G 1% > /disk3 > /dev/sde1 932G 4.3G 927G 1% > /disk4 > /dev/sdf1 932G 4.3G 927G 1% > /disk5 > /dev/sdg1 932G 4.3G 927G 1% > /disk6 > /dev/sdh1 932G 4.3G 927G 1% > /disk7 > /dev/sdi1 932G 4.3G 927G 1% > /disk8 > /dev/sdj1 932G 449G 483G 49% > /disk9 > /dev/sdk1 932G 4.3G 927G 1% > /disk10 > > The situation is a bit better, but not much, the data are stored on two > disks mainly. > > Here is a better situation, on node s100002: > > Filesystem Size Used Avail Use% > Mounted on > /dev/sdb1 1.9T 453G 1.4T 25% > /disk1 > /dev/sdc1 1.9T 4.3G 1.9T 1% > /disk2 > /dev/sdd1 1.9T 4.4G 1.9T 1% > /disk3 > /dev/sde1 1.9T 4.3G 1.9T 1% > /disk4 > /dev/sdf1 1.9T 457G 1.4T 25% > /disk5 > /dev/sdg1 1.9T 443G 1.4T 24% > /disk6 > /dev/sdh1 1.9T 4.4G 1.9T 1% > /disk7 > /dev/sdi1 1.9T 4.4G 1.9T 1% > /disk8 > /dev/sdj1 1.9T 427G 1.5T 23% > /disk9 > /dev/sdk1 1.9T 4.4G 1.9T 1% > /disk10 > > It's better than the other two, but still not what I expected. I > expected the data to be spread out according to the weight of each > osd, as defined in the crush map. Or at least, as close to that > as possible. It might be just some obviously stupid config error, > but I don't know. This can't be normal, can it? > > Thanks for any hint. > > Xiaopong > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Very unbalanced storage 2012-08-31 16:05 ` Sage Weil @ 2012-09-01 2:52 ` Xiaopong Tran 2012-09-01 3:05 ` Sage Weil 0 siblings, 1 reply; 17+ messages in thread From: Xiaopong Tran @ 2012-09-01 2:52 UTC (permalink / raw) To: Sage Weil; +Cc: ceph-devel@vger.kernel.org On 09/01/2012 12:05 AM, Sage Weil wrote: > On Fri, 31 Aug 2012, Xiaopong Tran wrote: >> Hi, >> >> Ceph storage on each disk in the cluster is very unbalanced. On each >> node, the data seems to go to one or two disks, while other disks >> are almost empty. >> >> I can't find anything wrong from the crush map, it's just the >> default for now. Attached is the crush map. > > This is usually a problem with the pg_num for the pool you are using. Can > you include the output from 'ceph osd dump | grep ^pool'? By default, > pools get 8 pgs, which will distribute poorly. > > sage > > Here is the pool I'm interested in: pool 9 'yunio2' rep size 3 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 216 owner 0 So, ok, by default, the pg_num is really small. That's a very dumb mistake I made. Is there any easy way to change this? I looked at the tunables, if I upgrade to v0.48.1 or v0.49, then would I be able to tune the pg_num value? Thanks for any help, this is quite a serious issue. Xiaopong ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Very unbalanced storage 2012-09-01 2:52 ` Xiaopong Tran @ 2012-09-01 3:05 ` Sage Weil 2012-09-01 3:15 ` Xiaopong Tran 2012-09-01 6:58 ` Andrew Thompson 0 siblings, 2 replies; 17+ messages in thread From: Sage Weil @ 2012-09-01 3:05 UTC (permalink / raw) To: Xiaopong Tran; +Cc: ceph-devel@vger.kernel.org On Sat, 1 Sep 2012, Xiaopong Tran wrote: > On 09/01/2012 12:05 AM, Sage Weil wrote: > > On Fri, 31 Aug 2012, Xiaopong Tran wrote: > > > Hi, > > > > > > Ceph storage on each disk in the cluster is very unbalanced. On each > > > node, the data seems to go to one or two disks, while other disks > > > are almost empty. > > > > > > I can't find anything wrong from the crush map, it's just the > > > default for now. Attached is the crush map. > > > > This is usually a problem with the pg_num for the pool you are using. Can > > you include the output from 'ceph osd dump | grep ^pool'? By default, > > pools get 8 pgs, which will distribute poorly. > > > > sage > > > > > Here is the pool I'm interested in: > > pool 9 'yunio2' rep size 3 crush_ruleset 0 object_hash rjenkins pg_num 8 > pgp_num 8 last_change 216 owner 0 > > So, ok, by default, the pg_num is really small. That's a very dumb > mistake I made. Is there any easy way to change this? I think me choosing 8 as the default was the dumb thing :) > I looked at the tunables, if I upgrade to v0.48.1 or v0.49, > then would I be able to tune the pg_num value? Sadly you can't yet adjust pg_num for an active pool. You can create a new pool, ceph osd pool create <name> <pg_num> I would aim for 20 * num_osd, or thereabouts.. see http://ceph.com/docs/master/ops/manage/grow/placement-groups/ Then you can copy the data from the old pool to the new one with rados cppool yunio2 yunio3 This won't be particularly fast, but it will work. You can also do ceph osd pool rename <oldname> <newname> ceph osd pool delete <name> I hope this solves your problem! sage ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Very unbalanced storage 2012-09-01 3:05 ` Sage Weil @ 2012-09-01 3:15 ` Xiaopong Tran 2012-09-01 6:58 ` Andrew Thompson 1 sibling, 0 replies; 17+ messages in thread From: Xiaopong Tran @ 2012-09-01 3:15 UTC (permalink / raw) To: Sage Weil; +Cc: ceph-devel@vger.kernel.org On 09/01/2012 11:05 AM, Sage Weil wrote: > On Sat, 1 Sep 2012, Xiaopong Tran wrote: >> On 09/01/2012 12:05 AM, Sage Weil wrote: >>> On Fri, 31 Aug 2012, Xiaopong Tran wrote: >>>> Hi, >>>> >>>> Ceph storage on each disk in the cluster is very unbalanced. On each >>>> node, the data seems to go to one or two disks, while other disks >>>> are almost empty. >>>> >>>> I can't find anything wrong from the crush map, it's just the >>>> default for now. Attached is the crush map. >>> >>> This is usually a problem with the pg_num for the pool you are using. Can >>> you include the output from 'ceph osd dump | grep ^pool'? By default, >>> pools get 8 pgs, which will distribute poorly. >>> >>> sage >>> >>> >> Here is the pool I'm interested in: >> >> pool 9 'yunio2' rep size 3 crush_ruleset 0 object_hash rjenkins pg_num 8 >> pgp_num 8 last_change 216 owner 0 >> >> So, ok, by default, the pg_num is really small. That's a very dumb >> mistake I made. Is there any easy way to change this? > > I think me choosing 8 as the default was the dumb thing :) > >> I looked at the tunables, if I upgrade to v0.48.1 or v0.49, >> then would I be able to tune the pg_num value? > > Sadly you can't yet adjust pg_num for an active pool. You can create a > new pool, > > ceph osd pool create <name> <pg_num> > > I would aim for 20 * num_osd, or thereabouts.. see > > http://ceph.com/docs/master/ops/manage/grow/placement-groups/ > > Then you can copy the data from the old pool to the new one with > > rados cppool yunio2 yunio3 > > This won't be particularly fast, but it will work. You can also do > > ceph osd pool rename <oldname> <newname> > ceph osd pool delete <name> > > I hope this solves your problem! > sage > Ok, this is going to be painful. But do I have to stop using the current pool completely while I do rados cppool yunio2 yunio3 ? This is not something I can do now :) But this wiki describes a nice way to increase the number of PGs: http://ceph.com/wiki/Changing_the_number_of_PGs Even if I upgrade to v0.48.1, this command can only change the PG size when the pool is empty? Thanks Xiaopong ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Very unbalanced storage 2012-09-01 3:05 ` Sage Weil 2012-09-01 3:15 ` Xiaopong Tran @ 2012-09-01 6:58 ` Andrew Thompson 2012-09-04 15:59 ` Tommi Virtanen 1 sibling, 1 reply; 17+ messages in thread From: Andrew Thompson @ 2012-09-01 6:58 UTC (permalink / raw) To: ceph-devel@vger.kernel.org On 8/31/2012 11:05 PM, Sage Weil wrote: > Sadly you can't yet adjust pg_num for an active pool. You can create a > new pool, > > ceph osd pool create <name> <pg_num> > > I would aim for 20 * num_osd, or thereabouts.. see > > http://ceph.com/docs/master/ops/manage/grow/placement-groups/ > > Then you can copy the data from the old pool to the new one with > > rados cppool yunio2 yunio3 > > This won't be particularly fast, but it will work. You can also do > > ceph osd pool rename <oldname> <newname> > ceph osd pool delete <name> > > I hope this solves your problem! Looking at old archives, I found this thread which shows that to mount a pool as cephfs, it needs to be added to mds: http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/5685 I started a `rados cppool data tempstore` a couple hours ago. When it finishes, will I need to remove the current pool from mds somehow(other than just deleting the pool)? Is `ceph mds add_data_pool <poolname>` still required? (It's not listed in `ceph --help`.) Thanks. -- Andrew Thompson http://aktzero.com/ ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Very unbalanced storage 2012-09-01 6:58 ` Andrew Thompson @ 2012-09-04 15:59 ` Tommi Virtanen 2012-09-04 16:19 ` Andrew Thompson 0 siblings, 1 reply; 17+ messages in thread From: Tommi Virtanen @ 2012-09-04 15:59 UTC (permalink / raw) To: Andrew Thompson; +Cc: ceph-devel@vger.kernel.org On Fri, Aug 31, 2012 at 11:58 PM, Andrew Thompson <andrewkt@aktzero.com> wrote: > Looking at old archives, I found this thread which shows that to mount a > pool as cephfs, it needs to be added to mds: > > http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/5685 > > I started a `rados cppool data tempstore` a couple hours ago. When it > finishes, will I need to remove the current pool from mds somehow(other than > just deleting the pool)? > > Is `ceph mds add_data_pool <poolname>` still required? (It's not listed in > `ceph --help`.) If the pool you are trying to grow pg_num for really is a CephFS data pool, I fear a "rados cppool" is nowhere near enough to perform a migration. My understanding is that each of the inodes stored in cephfs/on ceph-mds'es knows what pool the file data resides in; you shoveling the objects into another pool with "rados cppool" doesn't change these pointers, removing the old pool will just break the filesystem. Before we go too far down this road: is your problem pool *really* being use as a cephfs data pool? Based on how it's not named "data" and you're just now asking about "ceph mds add_data_pool", it seems that's not likely.. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Very unbalanced storage 2012-09-04 15:59 ` Tommi Virtanen @ 2012-09-04 16:19 ` Andrew Thompson 2012-09-04 16:43 ` Sage Weil 2012-09-04 16:48 ` Tommi Virtanen 0 siblings, 2 replies; 17+ messages in thread From: Andrew Thompson @ 2012-09-04 16:19 UTC (permalink / raw) To: Tommi Virtanen; +Cc: ceph-devel@vger.kernel.org On 9/4/2012 11:59 AM, Tommi Virtanen wrote: > On Fri, Aug 31, 2012 at 11:58 PM, Andrew Thompson <andrewkt@aktzero.com> wrote: >> Looking at old archives, I found this thread which shows that to mount a >> pool as cephfs, it needs to be added to mds: >> >> http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/5685 >> >> I started a `rados cppool data tempstore` a couple hours ago. When it >> finishes, will I need to remove the current pool from mds somehow(other than >> just deleting the pool)? >> >> Is `ceph mds add_data_pool <poolname>` still required? (It's not listed in >> `ceph --help`.) > If the pool you are trying to grow pg_num for really is a CephFS data > pool, I fear a "rados cppool" is nowhere near enough to perform a > migration. My understanding is that each of the inodes stored in > cephfs/on ceph-mds'es knows what pool the file data resides in; you > shoveling the objects into another pool with "rados cppool" doesn't > change these pointers, removing the old pool will just break the > filesystem. > > Before we go too far down this road: is your problem pool *really* > being use as a cephfs data pool? Based on how it's not named "data" > and you're just now asking about "ceph mds add_data_pool", it seems > that's not likely.. Well, I guess it's time to wipe this cluster and start over. Yes, it was my `data` pool I was trying to grow. After renaming and removing the original data pool, I can `ls` my folders/files, but not access them. I attempted a tar backup beforehand, so unless it flaked out, I should be able to recover data. I was concerned the small number of PGs created by default by mkcephfs would be an issue, so I was trying to up it a bit. I'm not going to have 100+ OSDs or petabytes of data. I just want a relatively safe place to store my files that I can easily extend as needed. So far, I'm 0 and 5... I keep blowing up the filesystem, one way or another. -- Andrew Thompson http://aktzero.com/ ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Very unbalanced storage 2012-09-04 16:19 ` Andrew Thompson @ 2012-09-04 16:43 ` Sage Weil 2012-09-04 16:48 ` Tommi Virtanen 1 sibling, 0 replies; 17+ messages in thread From: Sage Weil @ 2012-09-04 16:43 UTC (permalink / raw) To: Andrew Thompson; +Cc: Tommi Virtanen, ceph-devel@vger.kernel.org On Tue, 4 Sep 2012, Andrew Thompson wrote: > On 9/4/2012 11:59 AM, Tommi Virtanen wrote: > > On Fri, Aug 31, 2012 at 11:58 PM, Andrew Thompson <andrewkt@aktzero.com> > > wrote: > > > Looking at old archives, I found this thread which shows that to mount a > > > pool as cephfs, it needs to be added to mds: > > > > > > http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/5685 > > > > > > I started a `rados cppool data tempstore` a couple hours ago. When it > > > finishes, will I need to remove the current pool from mds somehow(other > > > than > > > just deleting the pool)? > > > > > > Is `ceph mds add_data_pool <poolname>` still required? (It's not listed in > > > `ceph --help`.) > > If the pool you are trying to grow pg_num for really is a CephFS data > > pool, I fear a "rados cppool" is nowhere near enough to perform a > > migration. My understanding is that each of the inodes stored in > > cephfs/on ceph-mds'es knows what pool the file data resides in; you > > shoveling the objects into another pool with "rados cppool" doesn't > > change these pointers, removing the old pool will just break the > > filesystem. > > > > Before we go too far down this road: is your problem pool *really* > > being use as a cephfs data pool? Based on how it's not named "data" > > and you're just now asking about "ceph mds add_data_pool", it seems > > that's not likely.. > > Well, I guess it's time to wipe this cluster and start over. > > Yes, it was my `data` pool I was trying to grow. After renaming and removing > the original data pool, I can `ls` my folders/files, but not access them. Yeah. Sorry I didn't catch this earlier, but TV is right: the ceph fs inodes refer to the data pool pool by id #, not by name, so the cppool trick won't work in the fs case. > I attempted a tar backup beforehand, so unless it flaked out, I should be able > to recover data. > > I was concerned the small number of PGs created by default by mkcephfs would > be an issue, so I was trying to up it a bit. I'm not going to have 100+ OSDs > or petabytes of data. I just want a relatively safe place to store my files > that I can easily extend as needed. mkcephfs picks the pg_num by taking the initial osd count and shiftin 'osd pg bits' bits to the left. Adjusting that (by default it is 6) in ceph.conf should give you larger initial pools. > So far, I'm 0 and 5... I keep blowing up the filesystem, one way or another. Sorry to hear that! The pg splitting (i.e., online pg_num adjustment) is still the next major osd project on the roadmap, but we've been a bit sidetracked with performance these past few weeks. sage > > -- > Andrew Thompson > http://aktzero.com/ > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Very unbalanced storage 2012-09-04 16:19 ` Andrew Thompson 2012-09-04 16:43 ` Sage Weil @ 2012-09-04 16:48 ` Tommi Virtanen 1 sibling, 0 replies; 17+ messages in thread From: Tommi Virtanen @ 2012-09-04 16:48 UTC (permalink / raw) To: Andrew Thompson; +Cc: ceph-devel@vger.kernel.org On Tue, Sep 4, 2012 at 9:19 AM, Andrew Thompson <andrewkt@aktzero.com> wrote: > Yes, it was my `data` pool I was trying to grow. After renaming and removing > the original data pool, I can `ls` my folders/files, but not access them. Yup, you're seeing ceph-mds being able to access the "metadata" pool, but all the directory entries pointing at file data that resides in a pool_id that no longer exists. While this would be recoverable by rewriting all the directory entries etc, the simple answer is that your file data is not easily accessible anymore. If this is just a test filesystem, and you have a recent backup anyway, you might just go forward with restoring that. If there is any doubt about that, you can leave the existing content around until you're sure you can restore the backup successfully, and you don't really need to re-create the cluster either. If this sounds necessary, let me know and I'll walk you through the process; but the simple answer really is recreating the cluster from scratch, so if this is just test data, go with that. ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2012-09-04 16:48 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-08-31 11:11 Very unbalanced storage Xiaopong Tran 2012-08-31 15:00 ` Andrew Thompson 2012-08-31 16:10 ` Sage Weil 2012-08-31 16:24 ` Andrew Thompson 2012-08-31 16:39 ` Gregory Farnum 2012-09-01 2:33 ` Xiaopong Tran 2012-09-01 3:07 ` Sage Weil 2012-08-31 16:43 ` Sage Weil 2012-08-31 16:05 ` Sage Weil 2012-09-01 2:52 ` Xiaopong Tran 2012-09-01 3:05 ` Sage Weil 2012-09-01 3:15 ` Xiaopong Tran 2012-09-01 6:58 ` Andrew Thompson 2012-09-04 15:59 ` Tommi Virtanen 2012-09-04 16:19 ` Andrew Thompson 2012-09-04 16:43 ` Sage Weil 2012-09-04 16:48 ` Tommi Virtanen
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.