From mboxrd@z Thu Jan 1 00:00:00 1970 From: Wido den Hollander Subject: Re: Hit suicide timeout after adding new osd Date: Wed, 23 Jan 2013 13:26:33 +0100 Message-ID: <50FFD6F9.1010705@widodh.nl> References: <50F80C3A.9020007@mermaidconsulting.dk> <50F80EFF.7020803@widodh.nl> <50F80FA0.5010504@profihost.ag> <50F819B8.4070004@widodh.nl> <50F81A9F.2090104@profihost.ag> <50F85FEC.7030305@mermaidconsulting.dk> <50F930EE.9070201@mermaidconsulting.dk> <50F9C051.7070900@mermaidconsulting.dk> <50FA6681.10507@mermaidconsulting.dk> <50FADE65.5050403@mermaidconsulting.dk> <50FAE8AB.5000602@mermaidconsulting.dk> <50FCE759.9070309@mermaidconsulting.dk> <50FFD420.7000604@mermaidconsulting.dk> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from smtp01.mail.pcextreme.nl ([109.72.87.137]:45286 "EHLO smtp01.mail.pcextreme.nl" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755405Ab3AWM0f (ORCPT ); Wed, 23 Jan 2013 07:26:35 -0500 In-Reply-To: <50FFD420.7000604@mermaidconsulting.dk> Sender: ceph-devel-owner@vger.kernel.org List-ID: To: =?ISO-8859-1?Q?Jens_Kristian_S=F8gaard?= Cc: Sage Weil , Stefan Priebe , "ceph-devel@vger.kernel.org" On 01/23/2013 01:14 PM, Jens Kristian S=F8gaard wrote: > Hi Sage, > >> I think the problem now is just that 'osd target transaction size' i= s >> too big (default is 300). Recommended 50.. let's see how that goes. >> Even smaller (20 or 25) would probably be fine. > Going through the code and reading that this solved it for Jens, could=20 this issue be traced back to less powerful CPUs? I've seen this on Atom and Fusion platforms which both don't excel in=20 their computing power. From what I read is that the OSD by default does 300 transactions and=20 then commits them? If the CPU is to slow to handle all the work timeout= s=20 can occur because it can't do all the transactions inside the set windo= w? By lowering the number of transactions it sends out a heartbeat more=20 often thus keeping itself alive. Correct? Wido > I set it to 50, and that seems to have solved all my problems. > > After a day or so my cluster got to a HEALTH_OK state again. It has b= een > running for a few days now without any crashes! > > Thanks for all your help! > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html