From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mark Nelson Subject: Re: puzzled with the design pattern of ceph journal, really ruining performance Date: Wed, 17 Sep 2014 10:01:06 -0500 Message-ID: <5419A232.8080508@redhat.com> References: <872acc52-387a-4ca1-bd43-a3825a2746cc@mailpro> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from mail-ie0-f179.google.com ([209.85.223.179]:44611 "EHLO mail-ie0-f179.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755370AbaIQPBE (ORCPT ); Wed, 17 Sep 2014 11:01:04 -0400 Received: by mail-ie0-f179.google.com with SMTP id rl12so1904783iec.38 for ; Wed, 17 Sep 2014 08:01:03 -0700 (PDT) In-Reply-To: <872acc52-387a-4ca1-bd43-a3825a2746cc@mailpro> Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Alexandre DERUMIER , Xiaoxi Chen Cc: Somnath Roy , ?? , ceph-devel@vger.kernel.org On 09/17/2014 09:20 AM, Alexandre DERUMIER wrote: >>> 2. Have you got any data to prove the O_DSYNC or fdatasync kill th= e performance of journal? In our previous test, the journal SSD (use = a partition of a SSD as a journal for a particular OSD, and 4 OSD share= a >>same SSD) could reach its peak performance (300-400MB/s) > > Hi, > > I have done some bench here: > > http://www.mail-archive.com/ceph-users@lists.ceph.com/msg12950.html > > Some ssd models have really bad performance with O_DSYNC (crucial m55= 0 - 312 iops on 4k block). > Benching 1 osd,I can see big latencies for some seconds when O_DSYNC = occur =46WIW, the journal will coalesce writes quickly when there are many=20 concurrent 4k client writes. Once you hit around 8 4k IOs per OSD, the= =20 journal will start coalescing. For say 100-150 IOPs (what a spinning=20 disk can handle), expect around 9ish 100KB journal writes (with padding= =20 and header/footer for each client IO). What we've seen is that some=20 drives that aren't that great at 4K O_DSYNC writes are still reasonably= =20 good with 8+ concurrent larger O_DSYNC writes. > > > > crucial m550 > ------------ > #fio --filename=3D/dev/sdb --direct=3D1 --rw=3Dwrite --bs=3D4k --numj= obs=3D2 > --group_reporting --invalidate=3D0 --name=3Dab --sync=3D1 > bw=3D1249.9KB/s, iops=3D312 > > intel s3500 > ----------- > fio --filename=3D/dev/sdb --direct=3D1 --rw=3Dwrite --bs=3D4k --numjo= bs=3D2 > --group_reporting --invalidate=3D0 --name=3Dab --sync=3D1 > #bw=3D41794KB/s, iops=3D10448 > > ----- Mail original ----- > > De: "Xiaoxi Chen" > =C3=80: "Somnath Roy" , "??" , ceph-devel@vger.kernel.org > Envoy=C3=A9: Mercredi 17 Septembre 2014 09:59:37 > Objet: RE: puzzled with the design pattern of ceph journal, really ru= ining performance > > Hi Nicheal, > > 1. The main purpose of journal is provide transaction semantics (prev= ent partially update). Peer is not enough for this need because ceph wr= ites all replica at the same time, so when crush, you have no idea abou= t which replica has right data. For example, say if we have 2 replica, = user update a 4M object and the primary OSD crush when the first 2M was= written , secondary OSD may also failed when the first 3MB was written= =2E So both versions in primary/secondary are neither the new value, no= r the old value, and have no way to recover. So share the same idea as = database, we need to have a journal to support transaction and prevent = this happen. For some backend support transaction, BTRFS as an instance= , we don't need a journal, we can write the journal and data disk at th= e same time, the journal here is just try to help performance, since it= only do sequential write and we suspect it should be faster than backe= nd OSD. > > 2. Have you got any data to prove the O_DSYNC or fdatasync kill the p= erformance of journal? In our previous test, the journal SSD (use a par= tition of a SSD as a journal for a particular OSD, and 4 OSD share a sa= me SSD) could reach its peak performance (300-400MB/s) > > Xiaoxi > > > > -----Original Message----- > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.= kernel.org] On Behalf Of Somnath Roy > Sent: Wednesday, September 17, 2014 3:30 PM > To: =E5=A7=9A=E5=AE=81; ceph-devel@vger.kernel.org > Subject: RE: puzzled with the design pattern of ceph journal, really = ruining performance > > Hi Nicheal, > Not only recovery , IMHO the main purpose of ceph journal is to suppo= rt transaction semantics since XFS doesn't have that. I guess it can't = be achieved with pg_log/pg_info. > > Thanks & Regards > Somnath > > -----Original Message----- > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.= kernel.org] On Behalf Of ?? > Sent: Tuesday, September 16, 2014 11:29 PM > To: ceph-devel@vger.kernel.org > Subject: puzzled with the design pattern of ceph journal, really ruin= ing performance > > Hi, guys > > I analyze the architecture of the ceph souce code. > > I know that, in order to keep journal atomic and consistent, the jour= nal write mode should be set with O_DSYNC or called fdatasync() system = call after every write operation. However, this kind of operation is re= ally killing the performance as well as achieving high committing laten= cy, even if SSD is used as journal disk. If the SSD has capacitor to ke= ep the data safe when the system crashes, we can set the mount option n= obarrier or SSD itself will ignore the FLUSH REQUEST. So the performanc= e would be better. > > So can it be instead by other strategies? > As far as I am concerned, I think the most important part is pg_log a= nd pg_info. It will guides the crashed osd recovery its objects from th= e peers. Therefore, if we can keep pg_log at a consistent point, we can= recovery data without journal. So can we just use an "undo" > strategy on pg_log and neglect ceph journal? It will save lots of ban= dwidth, and also based on the consistent pg_log epoch, we can always re= covery data from its peering osd, right? But this will lead to recovery= more objects if the osd crash. > > Nicheal > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html