From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752419Ab2H1IzG (ORCPT ); Tue, 28 Aug 2012 04:55:06 -0400 Received: from moutng.kundenserver.de ([212.227.126.171]:56488 "EHLO moutng.kundenserver.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752255Ab2H1IzB (ORCPT ); Tue, 28 Aug 2012 04:55:01 -0400 Message-ID: <503C8762.5070607@itechnical.de> Date: Tue, 28 Aug 2012 10:54:58 +0200 From: Heiko Nardmann User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:14.0) Gecko/20120713 Thunderbird/14.0 MIME-Version: 1.0 To: linux-kernel@vger.kernel.org Subject: Q: dlm_recoverd takes 100% Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit X-Provags-ID: V02:K0:I7Krn3QW/K+2608zSZjhvUMkNrPd7KgiKQe/IpN5SK7 0M0F1ZjAMCYGdR+pt3+7sXzLFUZcGxPOczNxicmL63aPP9wkWc 0ienBU6LNCuarOy/v+Xz0s/evAHMvO1Lo/JZefZtA1essh6fXq 6p/4KavNoL/emfiqUB9Rghj63+rb6zjLSSi7F2LlLbx3ZodQ2D zzsVEEvIjUjX5Vlvnixyv92BiqiKjZOZlHU6xqaW+ZL5k9UMdf Xh1EEBbeA+ZqaSDcwctFZFco0YT3Rp5RDHSoIAvoW99eh5XzGX 60mqe1Eikn5iFHrAvMhFwR3OEhY+ZauKFKwAn8tG1i5wf1acua L5XMmOqM4CJlLr7QaZXCfaumkYT1xx995xcu2FmUw Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi together, maybe someone can give me a hint which ML to contact (if I am wrong here)? In a two-node cluster system I see 'dlm_recoverd' taking 100% time of one cpu for around 6 minutes. Here is small excerpt from a 'top' output during that period: top - 10:51:01 up 3 days, 17:21, 5 users, load average: 10.19, 5.39, 2.76 Tasks: 536 total, 3 running, 533 sleeping, 0 stopped, 0 zombie Cpu(s): 1.2%us, 6.6%sy, 0.0%ni, 92.1%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 12183344k total, 11827540k used, 355804k free, 160332k buffers Swap: 14417912k total, 0k used, 14417912k free, 8299364k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3121 root 20 0 0 0 0 R 100.0 0.0 3:36.15 dlm_recoverd The cluster nodes use a shared SAN (GFS2). The second node has been rebooted while I experience this behaviour. The real problem is that my application is unable to open a file on the SAN for these 6 minutes. After the reboot of the second node all is fine again and the application succeeds in opening the file. So I am not sure what can cause those two symptoms. Thanks in advance for any hint! Kind regards, Heiko