From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mark Nelson Subject: Re: ceph stability Date: Wed, 19 Dec 2012 07:26:54 -0600 Message-ID: <50D1C09E.9000504@inktank.com> References: Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from mail-ie0-f171.google.com ([209.85.223.171]:34365 "EHLO mail-ie0-f171.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753379Ab2LSN0x (ORCPT ); Wed, 19 Dec 2012 08:26:53 -0500 Received: by mail-ie0-f171.google.com with SMTP id 17so2709946iea.2 for ; Wed, 19 Dec 2012 05:26:53 -0800 (PST) In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Roman Hlynovskiy Cc: ceph-devel@vger.kernel.org On 12/19/2012 03:03 AM, Roman Hlynovskiy wrote: > Hello, > > I have 2 issues with ceph stability and looking for help to resolve them. > My setup is pretty simple - 3 debian 32bit stable systems each running > osd, mon and mds. > the conf is the following: > -------------------- > [global] > auth cluster required = none > auth service required = none > auth client required = none > > [osd] > osd journal size = 1000 > filestore xattr use omap = true > > [mon.a] > host = ceph-node01 > mon addr = 192.168.7.11:6789 > > [mon.b] > host = ceph-node02 > mon addr = 192.168.7.12:6789 > > [mon.c] > host = ceph-node03 > mon addr = 192.168.7.13:6789 > > [mds.a] > host = ceph-node01 > > [mds.b] > host = ceph-node02 > > [mds.c] > host = ceph-node03 A quicky side-node: multi-mds solutions aren't being supported in production right now. Not sure if your stat problems below are related, but you may want to try starting out with a single mds and see if the problem goes away. If so, there may be some hints in the mds logs regarding what's going on. Bug reports are welcome! > > [osd.0] > host = ceph-node01 > > [osd.1] > host = ceph-node02 > > [osd.2] > host = ceph-node03 > -------------------- > ceph -s is: > health HEALTH_OK > monmap e4: 3 mons at > {a=192.168.7.11:6789/0,b=192.168.7.12:6789/0,c=192.168.7.13:6789/0}, > election epoch 118, quorum 0,1,2 a,b,c > osdmap e197: 3 osds: 3 up, 3 in > pgmap v43305: 384 pgs: 384 active+clean; 72351 MB data, 144 GB > used, 105 GB / 249 GB avail > mdsmap e4439: 1/1/1 up {0=a=up:active}, 2 up:standby > -------------------- > > My first problem - I am getting spurious mon's deaths, which usually > looks like this: > > --- begin dump of recent events --- > 0> 2012-12-19 10:35:58.912119 b41eab70 -1 *** Caught signal (Aborted) ** > in thread b41eab70 > > ceph version 0.55.1 (8e25c8d984f9258644389a18997ec6bdef8e056b) > 1: /usr/bin/ceph-mon() [0x8183a11] > 2: [0xb7714400] > 3: (gsignal()+0x47) [0xb7337577] > 4: (abort()+0x182) [0xb733a962] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x14f) [0xb755653f] > 6: (()+0xbd405) [0xb7554405] > 7: (()+0xbd442) [0xb7554442] > 8: (()+0xbd581) [0xb7554581] > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x80f) [0x824cabf] > 10: /usr/bin/ceph-mon() [0x80e3c1d] > 11: (MDSMonitor::tick()+0x1e3b) [0x811ea0b] > 12: (MDSMonitor::on_active()+0x1d) [0x81188dd] > 13: (PaxosService::_active()+0x212) [0x80e4b02] > 14: (Context::complete(int)+0x19) [0x80c4cf9] > 15: (finish_contexts(CephContext*, std::list std::allocator >&, int)+0x13f) [0x80d208f] > 16: (Monitor::recovered_leader(int)+0x3ac) [0x80ac5ac] > 17: (Paxos::handle_last(MMonPaxos*)+0xb02) [0x80e0572] > 18: (Paxos::dispatch(PaxosServiceMessage*)+0x2c4) [0x80e0e94] > 19: (Monitor::_ms_dispatch(Message*)+0x1181) [0x80c3b11] > 20: (Monitor::ms_dispatch(Message*)+0x31) [0x80d5021] > 21: (DispatchQueue::entry()+0x337) [0x82afa47] > 22: (DispatchQueue::DispatchThread::entry()+0x20) [0x823eec0] > 23: (Thread::_entry_func(void*)+0x11) [0x824be41] > 24: (()+0x57b0) [0xb75ef7b0] > 25: (clone()+0x5e) [0xb73d8cde] > NOTE: a copy of the executable, or `objdump -rdS ` is > needed to interpret this. > > --- logging levels --- > 0/ 5 none > 0/ 1 lockdep > 0/ 1 context > 1/ 1 crush > 1/ 5 mds > 1/ 5 mds_balancer > 1/ 5 mds_locker > 1/ 5 mds_log > 1/ 5 mds_log_expire > 1/ 5 mds_migrator > 0/ 1 buffer > 0/ 1 timer > 0/ 1 filer > 0/ 1 striper > 0/ 1 objecter > 0/ 5 rados > 0/ 5 rbd > 0/ 5 journaler > 0/ 5 objectcacher > 0/ 5 client > 0/ 5 osd > 0/ 5 optracker > 0/ 5 objclass > 1/ 3 filestore > 1/ 3 journal > 0/ 5 ms > 1/ 5 mon > 0/10 monc > 0/ 5 paxos > 0/ 5 tp > 1/ 5 auth > 1/ 5 crypto > 1/ 1 finisher > 1/ 5 heartbeatmap > 1/ 5 perfcounter > 1/ 5 rgw > 1/ 5 hadoop > 1/ 5 javaclient > 1/ 5 asok > 1/ 1 throttle > -2/-2 (syslog threshold) > -1/-1 (stderr threshold) > max_recent 100000 > max_new 1000 > log_file /var/log/ceph/ceph-mon.a.log > --- end dump of recent events --- > > the binaries are coming from ceph.com debian-testing repo. > > My second problem - I have 2 systems which mount ceph. Whenever I > mount ceph on any other system it usually mounts but get stuck on > stat* operations (i.e. simple ls -al will hang with read( from the > ceph-mounted directory for ages). This kind of client stuck also > affects two working clients. they also start to stuck on the stat* > even after shutdown of the third client. so usually umount/mount or > even reboot for existing clients solves the issue) > > > -- > ...WBR, Roman Hlynovskiy > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >