From mboxrd@z Thu Jan 1 00:00:00 1970 From: "caleb.miles" Subject: Re: Crush not deliverying data uniformly -> HEALTH_ERR full osd Date: Wed, 08 Aug 2012 15:57:28 -0700 Message-ID: <5022EED8.1070107@inktank.com> References: <81C477727102DA4E9B2605AC748C495419104F5549@exch10> <81C477727102DA4E9B2605AC748C495419104F5553@exch10> <81C477727102DA4E9B2605AC748C495419104F5579@exch10> <5022EE4D.1090604@inktank.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from mail-pb0-f46.google.com ([209.85.160.46]:59942 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757803Ab2HHW5c (ORCPT ); Wed, 8 Aug 2012 18:57:32 -0400 Received: by pbbrr13 with SMTP id rr13so2167859pbb.19 for ; Wed, 08 Aug 2012 15:57:31 -0700 (PDT) In-Reply-To: <5022EE4D.1090604@inktank.com> Sender: ceph-devel-owner@vger.kernel.org List-ID: To: ceph-devel@vger.kernel.org On 08/08/2012 03:55 PM, caleb.miles wrote: > Hi Paul, > > Sorry to take so long to get back to you. Could you add the following > lines to the top of your CRUSH map > > # tunables > tunable choose_local_tries 0 > tunable choose_local_fallback_tries 0 > tunable choose_total_tries 50 > > and compile with > > crushtool --enable-unsafe-tunables -c > > Caleb > > On 08/06/2012 03:40 PM, Paul Pettigrew wrote: >> Hi Caleb >> Crushmap below, thanks! >> Paul >> >> >> >> root@dsanb1-coy:~# cat crushfile.txt >> # begin crush map >> >> # devices >> device 0 osd.0 >> device 1 osd.1 >> device 2 osd.2 >> device 3 osd.3 >> device 4 osd.4 >> device 5 osd.5 >> device 6 osd.6 >> device 7 osd.7 >> device 8 osd.8 >> device 9 osd.9 >> device 10 osd.10 >> device 11 osd.11 >> device 12 osd.12 >> device 13 osd.13 >> device 14 osd.14 >> device 15 osd.15 >> device 16 osd.16 >> device 17 osd.17 >> device 18 osd.18 >> device 19 osd.19 >> device 20 osd.20 >> device 21 osd.21 >> device 22 osd.22 >> >> # types >> type 0 osd >> type 1 host >> type 2 rack >> type 3 zone >> >> # buckets >> host dsanb1-coy { >> id -2 # do not change unnecessarily >> # weight 11.000 >> alg straw >> hash 0 # rjenkins1 >> item osd.0 weight 2.000 >> item osd.1 weight 2.000 >> item osd.10 weight 2.000 >> item osd.2 weight 2.000 >> item osd.3 weight 2.000 >> item osd.4 weight 2.000 >> item osd.5 weight 2.000 >> item osd.6 weight 2.000 >> item osd.7 weight 2.000 >> item osd.8 weight 2.000 >> item osd.9 weight 2.000 >> } >> host dsanb2-coy { >> id -4 # do not change unnecessarily >> # weight 6.000 >> alg straw >> hash 0 # rjenkins1 >> item osd.11 weight 1.000 >> item osd.12 weight 1.000 >> item osd.13 weight 1.000 >> item osd.14 weight 1.000 >> item osd.15 weight 1.000 >> item osd.16 weight 1.000 >> } >> host dsanb3-coy { >> id -5 # do not change unnecessarily >> # weight 6.000 >> alg straw >> hash 0 # rjenkins1 >> item osd.17 weight 1.000 >> item osd.18 weight 1.000 >> item osd.19 weight 1.000 >> item osd.20 weight 1.000 >> item osd.21 weight 1.000 >> item osd.22 weight 1.000 >> } >> rack 2nrack { >> id -3 # do not change unnecessarily >> # weight 23.000 >> alg straw >> hash 0 # rjenkins1 >> item dsanb1-coy weight 11.000 >> item dsanb2-coy weight 6.000 >> item dsanb3-coy weight 6.000 >> } >> zone default { >> id -1 # do not change unnecessarily >> # weight 23.000 >> alg straw >> hash 0 # rjenkins1 >> item 2nrack weight 23.000 >> } >> rack 1nrack { >> id -6 # do not change unnecessarily >> # weight 11.000 >> alg straw >> hash 0 # rjenkins1 >> item weight 11.000 >> } >> zone bak { >> id -7 # do not change unnecessarily >> # weight 23.000 >> alg straw >> hash 0 # rjenkins1 >> item 1nrack weight 23.000 >> } >> >> # rules >> rule data { >> ruleset 0 >> type replicated >> min_size 1 >> max_size 10 >> step take default >> step chooseleaf firstn 0 type host >> step emit >> } >> rule metadata { >> ruleset 1 >> type replicated >> min_size 1 >> max_size 10 >> step take default >> step chooseleaf firstn 0 type host >> step emit >> } >> rule rbd { >> ruleset 2 >> type replicated >> min_size 1 >> max_size 10 >> step take default >> step chooseleaf firstn 0 type host >> step emit >> } >> rule backup { >> ruleset 3 >> type replicated >> min_size 1 >> max_size 10 >> step take bak >> step chooseleaf firstn 0 type host >> step emit >> } >> >> # end crush map >> >> >> -----Original Message----- >> From: ceph-devel-owner@vger.kernel.org >> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Caleb Miles >> Sent: Tuesday, 7 August 2012 6:09 AM >> To: ceph-devel@vger.kernel.org >> Subject: Re: Crush not deliverying data uniformly -> HEALTH_ERR full osd >> >> Hello Paul, >> >> Could you post your CRUSH map, crushtool -d >> >> caleb >> >> On Mon, Aug 6, 2012 at 1:01 PM, Yehuda Sadeh wrote: >>> ---------- Forwarded message ---------- >>> From: Paul Pettigrew >>> Date: Sun, Aug 5, 2012 at 8:08 PM >>> Subject: RE: Crush not deliverying data uniformly -> HEALTH_ERR full >>> osd >>> To: Yehuda Sadeh >>> Cc: "ceph-devel@vger.kernel.org" >>> >>> >>> Hi Yehuda, we have: >>> >>> root@dsanb1-coy:/mnt/ceph# ceph osd dump | grep ^pool pool 0 'data' >>> rep size 2 crush_ruleset 0 object_hash rjenkins pg_num >>> 1472 pgp_num 1472 last_change 1 owner 0 crash_replay_interval 45 pool >>> 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins pg_num >>> 1472 pgp_num 1472 last_change 1 owner 0 pool 2 'rbd' rep size 2 >>> crush_ruleset 2 object_hash rjenkins pg_num >>> 1472 pgp_num 1472 last_change 1 owner 0 pool 3 'backup' rep size 1 >>> crush_ruleset 3 object_hash rjenkins pg_num >>> 1472 pgp_num 1472 last_change 1 owner 0 >>> >>> >>> -----Original Message----- >>> From: ceph-devel-owner@vger.kernel.org >>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Yehuda Sadeh >>> Sent: Monday, 6 August 2012 11:16 AM >>> To: Paul Pettigrew >>> Cc: ceph-devel@vger.kernel.org >>> Subject: Re: Crush not deliverying data uniformly -> HEALTH_ERR full >>> osd >>> >>> On Sun, Aug 5, 2012 at 5:16 PM, Paul Pettigrew >>> wrote: >>>> Hi Ceph community >>>> >>>> We are at the stage of performance capacity testing, where >>>> significant amounts of backup data is being written to Ceph. >>>> >>>> The issue we have, is that the underlying HDD's are not being >>>> populated >>>> (roughly) uniformly, and our Ceph system hits a brick wall after a >>>> couple of days our 30TB storage system is no longer able to operate >>>> after having only stored ~7TB. >>>> >>>> Basically, despite HDD's (1:1 ratio between OSD and HDD) all being >>>> the same storage size and weighting in the Crushmap, we have disks >>>> either: >>>> a) using 1% space; >>>> b) using 48%; or >>>> c) using 96% >>>> Too precise a split to be an accident. See below for more detail >>>> (osd11-22 not expected to get data, per our crushmap): >>>> >>>> >>>> ceph pg dump >>>> >>>> pool 0 2442 0 0 0 10240000000 7302520 7302520 >>>> pool 1 57 0 0 0 127824767 5603518 5603518 >>>> pool 2 0 0 0 0 0 0 0 >>>> pool 3 1808757 0 0 0 7584377697985 1104048 1104048 >>>> sum 1811256 0 0 0 7594745522752 14010086 >>>> 14010086 >>>> osdstat kbused kbavail kb hb in hb out >>>> 0 930606904 1021178408 1953514584 >>>> [11,12,13,14,15,16,17,18,19,20,21,22] [] >>>> 1 1874428 1949525164 1953514584 >>>> [11,12,13,14,15,16,17,18,19,20,21,22] [] >>>> 2 928811428 1022963676 1953514584 >>>> [11,12,13,14,15,16,17,18,19,20,21,22] [] >>>> 3 929733676 1022051996 1953514584 >>>> [11,12,13,14,15,16,17,18,19,20,21,22] [] >>>> 4 1719124 1949678844 1953514584 >>>> [11,12,13,14,15,16,17,18,19,20,21,22] [] >>>> 5 1853452 1949545892 1953514584 >>>> [11,12,13,14,15,16,17,18,19,20,21,22] [] >>>> 6 930979476 1020807132 1953514584 >>>> [11,12,13,14,15,16,17,18,19,20,21,22] [] >>>> 7 1808968 1949590496 1953514584 >>>> [11,12,13,14,15,16,17,18,19,20,21,22] [] >>>> 8 934035924 1017759100 1953514584 >>>> [11,12,13,14,15,16,17,18,19,20,21,22] [] >>>> 9 1855955384 94927432 1953514584 >>>> [11,12,13,14,15,16,17,18,19,20,21,22] [] >>>> 10 933572004 1018232340 1953514584 >>>> [11,12,13,14,15,16,17,18,19,20,21,22] [] >>>> 11 2057096 953060760 957230808 >>>> [0,1,2,3,4,5,6,7,8,9,10,17,18,19,20,21,22] [] >>>> 12 2053512 953064656 957230808 >>>> [0,1,2,3,4,5,6,7,8,9,10,17,18,19,20,21,22] [] >>>> 13 2148732 972501316 976762584 >>>> [0,1,2,3,4,5,6,7,8,9,10,17,18,19,20,21,22] [] >>>> 14 2064640 972585104 976762584 >>>> [0,1,2,3,4,5,6,7,8,9,10,17,18,19,20,21,22] [] >>>> 15 1945388 972703468 976762584 >>>> [0,1,2,3,4,5,6,7,8,9,10,17,18,19,20,21] [] >>>> 16 2051708 972599412 976762584 >>>> [0,1,2,3,4,6,7,8,9,10,17,18,19,20,21] [] >>>> 17 2137632 952980216 957230808 >>>> [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16] [] >>>> 18 2000124 953117508 957230808 >>>> [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16] [] >>>> 19 2095124 972554492 976762584 >>>> [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16] [] >>>> 20 1986800 972662640 976762584 >>>> [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16] [] >>>> 21 2035204 972615332 976762584 >>>> [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16] [] >>>> 22 1961412 972687788 976762584 >>>> [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16] [] >>>> sum 7475488140 25609393172 33131684328 >>>> >>>> 2012-08-06 10:03:58.964716 7f06783bb700 0 -- 10.32.0.10:0/15147 >>>> send_keepalive con 0x223f690, no pipe. >>>> >>>> >>>> root@dsanb1-coy:~# df -h >>>> Filesystem Size Used Avail Use% >>>> Mounted on >>>> /dev/md0 462G 12G 446G 3% / >>>> udev 12G 4.0K 12G 1% /dev >>>> tmpfs 4.8G 448K 4.8G 1% /run >>>> none 5.0M 0 5.0M 0% /run/lock >>>> none 12G 0 12G 0% /run/shm >>>> /dev/sdc 1.9T 888G 974G 48% >>>> /ceph-data/osd.0 >>>> /dev/sdd 1.9T 1.8G 1.9T 1% >>>> /ceph-data/osd.1 >>>> /dev/sdp 1.9T 891G 972G 48% >>>> /ceph-data/osd.10 >>>> /dev/sde 1.9T 886G 976G 48% >>>> /ceph-data/osd.2 >>>> /dev/sdf 1.9T 887G 975G 48% >>>> /ceph-data/osd.3 >>>> /dev/sdg 1.9T 1.7G 1.9T 1% >>>> /ceph-data/osd.4 >>>> /dev/sdh 1.9T 1.8G 1.9T 1% >>>> /ceph-data/osd.5 >>>> /dev/sdi 1.9T 888G 974G 48% >>>> /ceph-data/osd.6 >>>> /dev/sdm 1.9T 1.8G 1.9T 1% >>>> /ceph-data/osd.7 >>>> /dev/sdn 1.9T 891G 971G 48% >>>> /ceph-data/osd.8 >>>> /dev/sdo 1.9T 1.8T 91G 96% >>>> /ceph-data/osd.9 >>>> 10.32.0.10,10.32.0.25,10.32.0.11:6789:/ 31T 7.1T 24T 23% >>>> /mnt/ceph >>>> >>>> >>>> We are writing via fstab based cephfs mounts, and the above is going >>>> to pool3, which is a "backup" pool where we are testing replication >>>> level of 1x only. This should not have any effect though? Below will >>>> illustrate the layout we are using (above data writing issue is only >>>> going to the first node per our testing design): >>>> >>>> root@dsanb1-coy:~# ceph osd tree >>>> dumped osdmap tree epoch 136 >>>> # id weight type name up/down reweight >>>> -7 23 zone bak >>>> -6 23 rack 1nrack >>>> -2 11 host dsanb1-coy >>>> 0 2 osd.0 up 1 >>>> 1 2 osd.1 up 1 >>>> 10 2 osd.10 up 1 >>>> 2 2 osd.2 up 1 >>>> 3 2 osd.3 up 1 >>>> 4 2 osd.4 up 1 >>>> 5 2 osd.5 up 1 >>>> 6 2 osd.6 up 1 >>>> 7 2 osd.7 up 1 >>>> 8 2 osd.8 up 1 >>>> 9 2 osd.9 up 1 >>>> -1 23 zone default >>>> -3 23 rack 2nrack >>>> -2 11 host dsanb1-coy >>>> 0 2 osd.0 up 1 >>>> 1 2 osd.1 up 1 >>>> 10 2 osd.10 up 1 >>>> 2 2 osd.2 up 1 >>>> 3 2 osd.3 up 1 >>>> 4 2 osd.4 up 1 >>>> 5 2 osd.5 up 1 >>>> 6 2 osd.6 up 1 >>>> 7 2 osd.7 up 1 >>>> 8 2 osd.8 up 1 >>>> 9 2 osd.9 up 1 >>>> -4 6 host dsanb2-coy >>>> 11 1 osd.11 up 1 >>>> 12 1 osd.12 up 1 >>>> 13 1 osd.13 up 1 >>>> 14 1 osd.14 up 1 >>>> 15 1 osd.15 up 1 >>>> 16 1 osd.16 up 1 >>>> -5 6 host dsanb3-coy >>>> 17 1 osd.17 up 1 >>>> 18 1 osd.18 up 1 >>>> 19 1 osd.19 up 1 >>>> 20 1 osd.20 up 1 >>>> 21 1 osd.21 up 1 >>>> 22 1 osd.22 up 1 >>>> >>>> >>>> Has anybody got any suggestions? >>>> >>> How many pgs per pool do you have? Specifically: >>> $ ceph osd dump | grep ^pool >>> >>> Thanks, >>> Yehuda >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>> in the body of a message to majordomo@vger.kernel.org More majordomo >>> info at http://vger.kernel.org/majordomo-info.html >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >> in the body of a message to majordomo@vger.kernel.org More majordomo >> info at http://vger.kernel.org/majordomo-info.html >> >> >