From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mark Nelson Subject: Re: Potential OSD deadlock? Date: Wed, 23 Sep 2015 14:10:25 -0500 Message-ID: <5602F921.3010204@redhat.com> References: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: ceph-users-bounces-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org Sender: "ceph-users" To: Robert LeBlanc , Sage Weil Cc: "ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org" , ceph-devel List-Id: ceph-devel.vger.kernel.org FWIW, we've got some 40GbE Intel cards in the community performance cluster on a Mellanox 40GbE switch that appear (knock on wood) to be running fine with 3.10.0-229.7.2.el7.x86_64. We did get feedback from Intel that older drivers might cause problems though. Here's ifconfig from one of the nodes: ens513f1: flags=4163 mtu 1500 inet 10.0.10.101 netmask 255.255.255.0 broadcast 10.0.10.255 inet6 fe80::6a05:caff:fe2b:7ea1 prefixlen 64 scopeid 0x20 ether 68:05:ca:2b:7e:a1 txqueuelen 1000 (Ethernet) RX packets 169232242875 bytes 229346261232279 (208.5 TiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 153491686361 bytes 203976410836881 (185.5 TiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 Mark On 09/23/2015 01:48 PM, Robert LeBlanc wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA256 > > OK, here is the update on the saga... > > I traced some more of blocked I/Os and it seems that communication > between two hosts seemed worse than others. I did a two way ping flood > between the two hosts using max packet sizes (1500). After 1.5M > packets, no lost pings. Then then had the ping flood running while I > put Ceph load on the cluster and the dropped pings started increasing > after stopping the Ceph workload the pings stopped dropping. > > I then ran iperf between all the nodes with the same results, so that > ruled out Ceph to a large degree. I then booted in the the > 3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there > hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really > need the network enhancements in the 4.x series to work well. > > Does this sound familiar to anyone? I'll probably start bisecting the > kernel to see where this issue in introduced. Both of the clusters > with this issue are running 4.x, other than that, they are pretty > differing hardware and network configs. > > Thanks, > -----BEGIN PGP SIGNATURE----- > Version: Mailvelope v1.1.0 > Comment: https://www.mailvelope.com > > wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr > RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l > AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V > 7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j > cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv > F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW > byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE > /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe > LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok > UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98 > sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX > KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X > 4OEo > =P33I > -----END PGP SIGNATURE----- > ---------------- > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > > > On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc wrote: >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA256 >> >> This is IPoIB and we have the MTU set to 64K. There was some issues >> pinging hosts with "No buffer space available" (hosts are currently >> configured for 4GB to test SSD caching rather than page cache). I >> found that MTU under 32K worked reliable for ping, but still had the >> blocked I/O. >> >> I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing >> the blocked I/O. >> - ---------------- >> Robert LeBlanc >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 >> >> >> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil wrote: >>> On Tue, 22 Sep 2015, Samuel Just wrote: >>>> I looked at the logs, it looks like there was a 53 second delay >>>> between when osd.17 started sending the osd_repop message and when >>>> osd.13 started reading it, which is pretty weird. Sage, didn't we >>>> once see a kernel issue which caused some messages to be mysteriously >>>> delayed for many 10s of seconds? >>> >>> Every time we have seen this behavior and diagnosed it in the wild it has >>> been a network misconfiguration. Usually related to jumbo frames. >>> >>> sage >>> >>> >>>> >>>> What kernel are you running? >>>> -Sam >>>> >>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc wrote: >>>>> -----BEGIN PGP SIGNED MESSAGE----- >>>>> Hash: SHA256 >>>>> >>>>> OK, looping in ceph-devel to see if I can get some more eyes. I've >>>>> extracted what I think are important entries from the logs for the >>>>> first blocked request. NTP is running all the servers so the logs >>>>> should be close in terms of time. Logs for 12:50 to 13:00 are >>>>> available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz >>>>> >>>>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client >>>>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13 >>>>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16 >>>>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17 >>>>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16 >>>>> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0 >>>>> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec >>>>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17 >>>>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13 >>>>> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0 >>>>> >>>>> In the logs I can see that osd.17 dispatches the I/O to osd.13 and >>>>> osd.16 almost silmutaniously. osd.16 seems to get the I/O right away, >>>>> but for some reason osd.13 doesn't get the message until 53 seconds >>>>> later. osd.17 seems happy to just wait and doesn't resend the data >>>>> (well, I'm not 100% sure how to tell which entries are the actual data >>>>> transfer). >>>>> >>>>> It looks like osd.17 is receiving responses to start the communication >>>>> with osd.13, but the op is not acknowledged until almost a minute >>>>> later. To me it seems that the message is getting received but not >>>>> passed to another thread right away or something. This test was done >>>>> with an idle cluster, a single fio client (rbd engine) with a single >>>>> thread. >>>>> >>>>> The OSD servers are almost 100% idle during these blocked I/O >>>>> requests. I think I'm at the end of my troubleshooting, so I can use >>>>> some help. >>>>> >>>>> Single Test started about >>>>> 2015-09-22 12:52:36 >>>>> >>>>> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 : >>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for > >>>>> 30.439150 secs >>>>> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 : >>>>> cluster [WRN] slow request 30.439150 seconds old, received at >>>>> 2015-09-22 12:55:06.487451: >>>>> osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0000000000000545 >>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write >>>>> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785) >>>>> currently waiting for subops from 13,16 >>>>> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster >>>>> [WRN] 2 slow requests, 2 included below; oldest blocked for > >>>>> 30.379680 secs >>>>> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster >>>>> [WRN] slow request 30.291520 seconds old, received at 2015-09-22 >>>>> 12:55:06.406303: >>>>> osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0000000000000541 >>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write >>>>> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785) >>>>> currently waiting for subops from 13,17 >>>>> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster >>>>> [WRN] slow request 30.379680 seconds old, received at 2015-09-22 >>>>> 12:55:06.318144: >>>>> osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.000000000000053f >>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write >>>>> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785) >>>>> currently waiting for subops from 13,14 >>>>> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 : >>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for > >>>>> 30.954212 secs >>>>> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 : >>>>> cluster [WRN] slow request 30.954212 seconds old, received at >>>>> 2015-09-22 12:57:33.044003: >>>>> osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.000000000000070d >>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write >>>>> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785) >>>>> currently waiting for subops from 16,17 >>>>> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 : >>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for > >>>>> 30.704367 secs >>>>> 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 : >>>>> cluster [WRN] slow request 30.704367 seconds old, received at >>>>> 2015-09-22 12:57:33.055404: >>>>> osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.000000000000070e >>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write >>>>> 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785) >>>>> currently waiting for subops from 13,17 >>>>> >>>>> Server IP addr OSD >>>>> nodev - 192.168.55.11 - 12 >>>>> nodew - 192.168.55.12 - 13 >>>>> nodex - 192.168.55.13 - 16 >>>>> nodey - 192.168.55.14 - 17 >>>>> nodez - 192.168.55.15 - 14 >>>>> nodezz - 192.168.55.16 - 15 >>>>> >>>>> fio job: >>>>> [rbd-test] >>>>> readwrite=write >>>>> blocksize=4M >>>>> #runtime=60 >>>>> name=rbd-test >>>>> #readwrite=randwrite >>>>> #bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1 >>>>> #rwmixread=72 >>>>> #norandommap >>>>> #size=1T >>>>> #blocksize=4k >>>>> ioengine=rbd >>>>> rbdname=test2 >>>>> pool=rbd >>>>> clientname=admin >>>>> iodepth=8 >>>>> #numjobs=4 >>>>> #thread >>>>> #group_reporting >>>>> #time_based >>>>> #direct=1 >>>>> #ramp_time=60 >>>>> >>>>> >>>>> Thanks, >>>>> -----BEGIN PGP SIGNATURE----- >>>>> Version: Mailvelope v1.1.0 >>>>> Comment: https://www.mailvelope.com >>>>> >>>>> wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z >>>>> tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu >>>>> h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl >>>>> 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB >>>>> sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7 >>>>> FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF >>>>> pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI >>>>> 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn >>>>> B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq >>>>> 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL >>>>> o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO >>>>> gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB >>>>> J3hS >>>>> =0J7F >>>>> -----END PGP SIGNATURE----- >>>>> ---------------- >>>>> Robert LeBlanc >>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 >>>>> >>>>> >>>>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum wrote: >>>>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc wrote: >>>>>>> -----BEGIN PGP SIGNED MESSAGE----- >>>>>>> Hash: SHA256 >>>>>>> >>>>>>> Is there some way to tell in the logs that this is happening? >>>>>> >>>>>> You can search for the (mangled) name _split_collection >>>>>>> I'm not >>>>>>> seeing much I/O, CPU usage during these times. Is there some way to >>>>>>> prevent the splitting? Is there a negative side effect to doing so? >>>>>> >>>>>> Bump up the split and merge thresholds. You can search the list for >>>>>> this, it was discussed not too long ago. >>>>>> >>>>>>> We've had I/O block for over 900 seconds and as soon as the sessions >>>>>>> are aborted, they are reestablished and complete immediately. >>>>>>> >>>>>>> The fio test is just a seq write, starting it over (rewriting from the >>>>>>> beginning) is still causing the issue. I was suspect that it is not >>>>>>> having to create new file and therefore split collections. This is on >>>>>>> my test cluster with no other load. >>>>>> >>>>>> Hmm, that does make it seem less likely if you're really not creating >>>>>> new objects, if you're actually running fio in such a way that it's >>>>>> not allocating new FS blocks (this is probably hard to set up?). >>>>>> >>>>>>> >>>>>>> I'll be doing a lot of testing today. Which log options and depths >>>>>>> would be the most helpful for tracking this issue down? >>>>>> >>>>>> If you want to go log diving "debug osd = 20", "debug filestore = 20", >>>>>> "debug ms = 1" are what the OSD guys like to see. That should spit out >>>>>> everything you need to track exactly what each Op is doing. >>>>>> -Greg >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>> >>>> >> >> -----BEGIN PGP SIGNATURE----- >> Version: Mailvelope v1.1.0 >> Comment: https://www.mailvelope.com >> >> wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG >> a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf >> a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV >> s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc >> iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B >> izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz >> caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj >> efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx >> GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7 >> glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H >> +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg >> pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp >> gcZm >> =CjwB >> -----END PGP SIGNATURE----- > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >