From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mark Nelson Subject: Re: Large numbers of OSD per node Date: Mon, 05 Nov 2012 06:45:25 -0600 Message-ID: <5097B4E5.8070706@inktank.com> References: <5097676B.5020200@gmail.com> <50979CA3.3060005@widodh.nl> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from mail-ia0-f174.google.com ([209.85.210.174]:51357 "EHLO mail-ia0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752066Ab2KEMqK (ORCPT ); Mon, 5 Nov 2012 07:46:10 -0500 Received: by mail-ia0-f174.google.com with SMTP id y32so4228762iag.19 for ; Mon, 05 Nov 2012 04:46:09 -0800 (PST) In-Reply-To: <50979CA3.3060005@widodh.nl> Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Wido den Hollander Cc: Andrew Thrift , "ceph-devel@vger.kernel.org" On 11/05/2012 05:01 AM, Wido den Hollander wrote: > Hi, > > On 05-11-12 08:14, Andrew Thrift wrote: >> Hi, >> >> We are evaluating CEPH for deployment. >> >> I was wondering if there are any current "best practices" around the >> number of OSD's per node ? >> >> >> e.g. We are looking at deploying 3 nodes, each with 72x SAS disks, and >> 2x 10gigabit Ethernet bonded. >> >> Would this best be configured as 72 OSD's per node. >> >> Or would we be better to using raid5 to have 18 OSD's per node ? >> > > You should be aware of a large data movement when using 3 nodes. > > I myself am I fan of going with a lot of smaller nodes instead of > building big nodes. > > With 3 such nodes you'd probably be going 2x replication? Otherwise you > can never recover when one of the 3 nodes completely burns down to the > ground. > > If you have 72 1TB disks in such a node you could in theory be moving > 72TB, that would put a lot of stress on the other two nodes and you > would need a lot of memory and CPU power. > > You might be better of by going for 27 nodes with 8 disks each, or have > 18 nodes with 12 disks? > > When a node fails the recovery will be much easier on your cluster. > > You can also take out a node for maintenance when needed. > > Another thing you should be aware of is status "D". What if a filesystem > inside one of your big machines hangs and one of the OSDs hangs in > status "D", waiting for I/O which will never come? > > You'd be forced to reboot that node and that would again take 72TB of > data offline. > > I am not aware of anybody using such big nodes in production. It could > work, but you will need a lot of memory and a lot of CPU. > > The recommendation is 1GB/1Ghz per OSD, so you'd be looking at at least > 72GB of memory and 72Ghz of CPU power. > > Wido To echo what Wido is saying here, we've not really extensively tested configurations with nodes that big at Inktank either. The biggest test node we have in-house is a 36-drive SC847a, and that was a pretty recent acquisition. Nodes that large are definitely bigger than what most people are looking at right now. For a deployment of the size you are talking about, I think you'd probably be better served with 24 disk or less nodes and picking up more of them. You'll likely have better performance and fewer problems if a node goes down. It is lower density, but I think in this case using up a few extra U will be worth it. Having said that, my guess is that if you were to use 72 drive nodes, you'd probably be best off doing a raid-5 or raid-6 and doing something like 12 6-drive OSDs. Be mindful of what drives, expanders, and controllers you pick. -- Mark Nelson Performance Engineer Inktank