From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mark Nelson Subject: Re: puzzled with the design pattern of ceph journal, really ruining performance Date: Wed, 17 Sep 2014 20:23:28 -0500 Message-ID: <541A3410.6040504@redhat.com> References: <5419A232.8080508@redhat.com> <6F3FA899187F0043BA1827A69DA2F7CC020BD19A@shsmsx102.ccr.corp.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from mail-ig0-f182.google.com ([209.85.213.182]:44038 "EHLO mail-ig0-f182.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754154AbaIRBX0 (ORCPT ); Wed, 17 Sep 2014 21:23:26 -0400 Received: by mail-ig0-f182.google.com with SMTP id hn15so411119igb.9 for ; Wed, 17 Sep 2014 18:23:25 -0700 (PDT) In-Reply-To: <6F3FA899187F0043BA1827A69DA2F7CC020BD19A@shsmsx102.ccr.corp.intel.com> Sender: ceph-devel-owner@vger.kernel.org List-ID: To: "Chen, Xiaoxi" , Alexandre DERUMIER , Mark Nelson Cc: Somnath Roy , ?? , "ceph-devel@vger.kernel.org" On 09/17/2014 08:05 PM, Chen, Xiaoxi wrote: >> When benching the crucial m550, I only see time to time (maybe each = 30s,don't remember exactly), ios slowing doing to 200 for 1 or 2 second= s then going up to normal speed around 4000iops > > Wow, that indicate m550 is busying with garbage collection , maybe ju= st try to overprovision a bit (say if you have a 400G ssd , but only pa= rtition ~300G), overprovision SSD generally both help performance and d= urability. Actually if you look at the difference spec between Intel S3= 500 and S3700, the root cause is different over provision ratio :) Hrm, I thought the S3700 uses MLC-HET cells while the S3500 uses regula= r=20 MLC? > > -----Original Message----- > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.= kernel.org] On Behalf Of Alexandre DERUMIER > Sent: Thursday, September 18, 2014 5:13 AM > To: Mark Nelson > Cc: Somnath Roy; ??; ceph-devel@vger.kernel.org; Chen, Xiaoxi > Subject: Re: puzzled with the design pattern of ceph journal, really = ruining performance > >>> FWIW, the journal will coalesce writes quickly when there are many >>> concurrent 4k client writes. Once you hit around 8 4k IOs per OSD, >>> the journal will start coalescing. For say 100-150 IOPs (what a >>> spinning disk can handle), expect around 9ish 100KB journal writes >>> (with padding and header/footer for each client IO). What we've se= en >>> is that some drives that aren't that great at 4K O_DSYNC writes are >>> still reasonably good with 8+ concurrent larger O_DSYNC writes. > > Yes, indeed, it's not that bad. (hopefully ;) > > When benching the crucial m550, I only see time to time (maybe each 3= 0s,don't remember exactly), ios slowing doing to 200 for 1 or 2 seconds= then going up to normal speed around 4000iops. > > with intel s3500, I have constant write at 5000iops. > > > BTW, does rbd client cache also help for coalescing write (client sid= e), then help also the journal ? > > > > > > ----- Mail original ----- > > De: "Mark Nelson" > =C3=80: "Alexandre DERUMIER" , "Xiaoxi Chen" > Cc: "Somnath Roy" , "??" , ceph-devel@vger.kernel.org > Envoy=C3=A9: Mercredi 17 Septembre 2014 17:01:06 > Objet: Re: puzzled with the design pattern of ceph journal, really ru= ining performance > > On 09/17/2014 09:20 AM, Alexandre DERUMIER wrote: >>>> 2. Have you got any data to prove the O_DSYNC or fdatasync kill th= e >>>> performance of journal? In our previous test, the journal SSD (use= a >>>> partition of a SSD as a journal for a particular OSD, and 4 OSD >>>> share a >>same SSD) could reach its peak performance (300-400MB/s) >> >> Hi, >> >> I have done some bench here: >> >> http://www.mail-archive.com/ceph-users@lists.ceph.com/msg12950.html >> >> Some ssd models have really bad performance with O_DSYNC (crucial m5= 50 - 312 iops on 4k block). >> Benching 1 osd,I can see big latencies for some seconds when O_DSYNC >> occur > > FWIW, the journal will coalesce writes quickly when there are many co= ncurrent 4k client writes. Once you hit around 8 4k IOs per OSD, the jo= urnal will start coalescing. For say 100-150 IOPs (what a spinning disk= can handle), expect around 9ish 100KB journal writes (with padding and= header/footer for each client IO). What we've seen is that some drives= that aren't that great at 4K O_DSYNC writes are still reasonably good = with 8+ concurrent larger O_DSYNC writes. > >> >> >> >> crucial m550 >> ------------ >> #fio --filename=3D/dev/sdb --direct=3D1 --rw=3Dwrite --bs=3D4k --num= jobs=3D2 >> --group_reporting --invalidate=3D0 --name=3Dab --sync=3D1 bw=3D1249.= 9KB/s, >> iops=3D312 >> >> intel s3500 >> ----------- >> fio --filename=3D/dev/sdb --direct=3D1 --rw=3Dwrite --bs=3D4k --numj= obs=3D2 >> --group_reporting --invalidate=3D0 --name=3Dab --sync=3D1 #bw=3D4179= 4KB/s, >> iops=3D10448 >> >> ----- Mail original ----- >> >> De: "Xiaoxi Chen" >> =C3=80: "Somnath Roy" , "??" , >> ceph-devel@vger.kernel.org >> Envoy=C3=A9: Mercredi 17 Septembre 2014 09:59:37 >> Objet: RE: puzzled with the design pattern of ceph journal, really >> ruining performance >> >> Hi Nicheal, >> >> 1. The main purpose of journal is provide transaction semantics (pre= vent partially update). Peer is not enough for this need because ceph w= rites all replica at the same time, so when crush, you have no idea abo= ut which replica has right data. For example, say if we have 2 replica,= user update a 4M object and the primary OSD crush when the first 2M wa= s written , secondary OSD may also failed when the first 3MB was writte= n. So both versions in primary/secondary are neither the new value, nor= the old value, and have no way to recover. So share the same idea as d= atabase, we need to have a journal to support transaction and prevent t= his happen. For some backend support transaction, BTRFS as an instance,= we don't need a journal, we can write the journal and data disk at the= same time, the journal here is just try to help performance, since it = only do sequential write and we suspect it should be faster than backen= d OSD. >> >> 2. Have you got any data to prove the O_DSYNC or fdatasync kill the >> performance of journal? In our previous test, the journal SSD (use a >> partition of a SSD as a journal for a particular OSD, and 4 OSD shar= e >> a same SSD) could reach its peak performance (300-400MB/s) >> >> Xiaoxi >> >> >> >> -----Original Message----- >> From: ceph-devel-owner@vger.kernel.org >> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath Roy >> Sent: Wednesday, September 17, 2014 3:30 PM >> To: =E5=A7=9A=E5=AE=81; ceph-devel@vger.kernel.org >> Subject: RE: puzzled with the design pattern of ceph journal, really >> ruining performance >> >> Hi Nicheal, >> Not only recovery , IMHO the main purpose of ceph journal is to supp= ort transaction semantics since XFS doesn't have that. I guess it can't= be achieved with pg_log/pg_info. >> >> Thanks & Regards >> Somnath >> >> -----Original Message----- >> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger= =2Ekernel.org] On Behalf Of ?? >> Sent: Tuesday, September 16, 2014 11:29 PM >> To: ceph-devel@vger.kernel.org >> Subject: puzzled with the design pattern of ceph journal, really >> ruining performance >> >> Hi, guys >> >> I analyze the architecture of the ceph souce code. >> >> I know that, in order to keep journal atomic and consistent, the jou= rnal write mode should be set with O_DSYNC or called fdatasync() system= call after every write operation. However, this kind of operation is r= eally killing the performance as well as achieving high committing late= ncy, even if SSD is used as journal disk. If the SSD has capacitor to k= eep the data safe when the system crashes, we can set the mount option = nobarrier or SSD itself will ignore the FLUSH REQUEST. So the performan= ce would be better. >> >> So can it be instead by other strategies? >> As far as I am concerned, I think the most important part is pg_log = and pg_info. It will guides the crashed osd recovery its objects from t= he peers. Therefore, if we can keep pg_log at a consistent point, we ca= n recovery data without journal. So can we just use an "undo" >> strategy on pg_log and neglect ceph journal? It will save lots of ba= ndwidth, and also based on the consistent pg_log epoch, we can always r= ecovery data from its peering osd, right? But this will lead to recover= y more objects if the osd crash. >> >> Nicheal >> > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel"= in the body of a message to majordomo@vger.kernel.org More majordomo i= nfo at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html