From mboxrd@z Thu Jan 1 00:00:00 1970 From: Josh Durgin Subject: Re: High-availability testing of ceph Date: Mon, 30 Jul 2012 22:55:41 -0700 Message-ID: <5017735D.3060206@inktank.com> References: <60E83269D669544E8069A09CB69135EA011444@GDC-CLDMBX-P02.whq.wistron> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from mail-pb0-f46.google.com ([209.85.160.46]:64246 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752014Ab2GaFzo (ORCPT ); Tue, 31 Jul 2012 01:55:44 -0400 Received: by pbbrp8 with SMTP id rp8so11111343pbb.19 for ; Mon, 30 Jul 2012 22:55:43 -0700 (PDT) In-Reply-To: <60E83269D669544E8069A09CB69135EA011444@GDC-CLDMBX-P02.whq.wistron> Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Eric_YH_Chen@wiwynn.com Cc: ceph-devel@vger.kernel.org, Chris_YT_Huang@wiwynn.com, Victor_CY_Chang@wiwynn.com On 07/30/2012 07:46 PM, Eric_YH_Chen@wiwynn.com wrote: > Hi, all: > > I am testing high-availability of ceph. > > Environment: two servers, and 12 hard-disk on each server. Version: Ceph 0.48 > Kernel: 3.2.0-27 > > We create a ceph cluster with 24 osd. > Osd.0 ~ osd.11 is on server1 > Osd.12 ~ osd.23 is on server2 > > The crush rule is using default rule. > rule rbd { > ruleset 2 > type replicated > min_size 1 > max_size 10 > step take default > step chooseleaf firstn 0 type host > step emit > } > > pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 1536 pgp_num 1536 last_change 1172 owner 0 > > Test case 1: > 1. Create a rbd device and read/write to it > 2. Random turn off one osd on server1 (service ceph stop osd.0) > 3. check the read/write of rbd device > > Test case 2: > 1. Create a rbd device and read/write to it > 2. Random turn off one osd on server1 (service ceph stop osd.0) > 2. Random turn off one osd on server2 (service ceph stop osd.12) > 3. check the read/write of rbd device > > About test case 1, we can access the rbd device as normal. But about test case 2, we would hang there and no response. > Is it a correct scenario ? > > I imagine that we can turn off any two osd when we set the replication as 2. > Because without the master data, we have two other copies on two different osd. > Even when we turn off two osd, we can find the data on third osd. > Any misunderstanding? Thanks! rep size is the total number of copies, so stopping two osds with rep size 2 may cause you to lose access to some objects. Josh