From mboxrd@z Thu Jan 1 00:00:00 1970 From: Kevin Decherf Subject: Crash and strange things on MDS Date: Mon, 4 Feb 2013 19:01:54 +0100 Message-ID: <20130204180154.GO3286@kdecherf.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Received: from mail-wg0-f54.google.com ([74.125.82.54]:58324 "EHLO mail-wg0-f54.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753711Ab3BDSB7 (ORCPT ); Mon, 4 Feb 2013 13:01:59 -0500 Received: by mail-wg0-f54.google.com with SMTP id fm10so5072459wgb.9 for ; Mon, 04 Feb 2013 10:01:57 -0800 (PST) Content-Disposition: inline Sender: ceph-devel-owner@vger.kernel.org List-ID: To: ceph-devel@vger.kernel.org Cc: support@clever-cloud.com Hey everyone, It's my first post here to expose a potential issue I found today using Ceph 0.56.1. The cluster configuration is, briefly: 27 osd of ~900GB and 3 MON/MDS. All nodes are running Exherbo (source-based distribution) with Ceph 0.56.1 and Linux 3.7.0. We are only using CephFS on this cluster which is mounted on ~60 clients (increasing each day). Objects are replicated three times and the cluster handles only 7GB of data atm for 350k objects. In certain conditions (I don't know them atm), some clients hang, generate CPU overloads (kworker) and are unable to make any IO on Ceph. The active MDS have ~20Mbps in/out during the issue (less than 2Mbps in normal activity). I don't know if it's directly linked but we also observe a lot of missing files at the same time. The problem is similar to this one [1]. A restart of the client or the MDS was enough before today, but we found a new behavior: the active MDS consumes a lot of CPU during 3 to 5 hours with ~25% clients hanging. In logs I found a segfault with this backtrace [2] and 100,000 dumped events during the first hang. We observed another hang which produces lot of these events (in debug mode): - "mds.0.server FAIL on ESTALE but attempting recovery" - "mds.0.server reply_request -116 (Stale NFS file handle) client_request(client.10991:1031 getattr As #1000004bab0 RETRY=132)" We have no profiling tools available on these nodes, and I don't know what I should search in the 35 GB log file. Note: the segmentation fault occured only once but the problem was observed four times on this cluster. Any help may be appreciated. References: [1] http://www.spinics.net/lists/ceph-devel/msg04903.html [2] ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7) 1: /usr/bin/ceph-mds() [0x817e82] 2: (()+0xf140) [0x7f9091d30140] 3: (MDCache::request_drop_foreign_locks(MDRequest*)+0x21) [0x5b9dc1] 4: (MDCache::request_drop_locks(MDRequest*)+0x19) [0x5baae9] 5: (MDCache::request_cleanup(MDRequest*)+0x60) [0x5bab70] 6: (MDCache::request_kill(MDRequest*)+0x80) [0x5bae90] 7: (Server::journal_close_session(Session*, int)+0x372) [0x549aa2] 8: (Server::kill_session(Session*)+0x137) [0x549c67] 9: (Server::find_idle_sessions()+0x12a6) [0x54b0d6] 10: (MDS::tick()+0x338) [0x4da928] 11: (SafeTimer::timer_thread()+0x1af) [0x78151f] 12: (SafeTimerThread::entry()+0xd) [0x782bad] 13: (()+0x7ddf) [0x7f9091d28ddf] 14: (clone()+0x6d) [0x7f90909cc24d] Cheers, -- Kevin Decherf - @Kdecherf GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F http://kdecherf.com