From mboxrd@z Thu Jan 1 00:00:00 1970 From: Svein-Erik Lund Subject: Feature request regarding size and min_size on pools Date: Tue, 10 Sep 2013 16:21:27 +0400 (MSD) Message-ID: <750758449.922.1378815687370.JavaMail.root@mail> References: <1712504592.851.1378813557601.JavaMail.root@mail> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Return-path: Received: from smtp-gw11.han.skanova.net ([81.236.55.20]:35012 "EHLO smtp-gw11.han.skanova.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751374Ab3IJM1m (ORCPT ); Tue, 10 Sep 2013 08:27:42 -0400 Received: from mail.home.org (90.231.29.72) by smtp-gw11.han.skanova.net (8.5.133) id 516D05D203249616 for ceph-devel@vger.kernel.org; Tue, 10 Sep 2013 14:21:29 +0200 Received: from mail.home.org (mail.home.org [192.168.9.3]) by mail.home.org (Postfix) with ESMTP id A0137BBD21 for ; Tue, 10 Sep 2013 16:21:27 +0400 (MSK) In-Reply-To: <1712504592.851.1378813557601.JavaMail.root@mail> Sender: ceph-devel-owner@vger.kernel.org List-ID: To: ceph-devel@vger.kernel.org Hello, We are implementing ceph as storage backend for some systems. Unfortunately we have to use a posix filesystem for storing the data. To accomplish this we have implemented a solution quite similar to what Sebastien Han has described on his blog here http://www.sebastien-han.fr/blog/2012/07/06/nfs-over-rbd/ Now to our problem. We want to be sure that a write is replicated before we get a ack. Therefor we have set pg size to 2, and min_size to 2 as we have seen that a sudden removal of one osd can lead to data loss with min_size set to 1. The problem now is that if one osd goes down some pg's will end up incomplete, and no io operations will be allowed to the rbd. This problem could be solved a couple of ways 1) An option could be set so that writes always is done to the number of replicas as size before the write is acknowledged. 2) If a situation where one a pg ends up in a incomplete state ceph tries to resolv the situation by doing a recovery of the pg's in question. For us adding a third replica isn't a feasible solution, 1) we have our data in two locations 2) The cost would be to high.