From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mark Nelson Subject: Re: newstore performance update Date: Wed, 29 Apr 2015 08:20:18 -0500 Message-ID: <5540DA92.3070505@redhat.com> References: <554016E2.3000104@redhat.com> <6F3FA899187F0043BA1827A69DA2F7CC021E4894@shsmsx102.ccr.corp.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from mx1.redhat.com ([209.132.183.28]:54592 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1422850AbbD2NUW (ORCPT ); Wed, 29 Apr 2015 09:20:22 -0400 In-Reply-To: <6F3FA899187F0043BA1827A69DA2F7CC021E4894@shsmsx102.ccr.corp.intel.com> Sender: ceph-devel-owner@vger.kernel.org List-ID: To: "Chen, Xiaoxi" Cc: "ceph-devel@vger.kernel.org" On 04/29/2015 03:33 AM, Chen, Xiaoxi wrote: > Hi Mark, > Really good test:) I only played a bit on SSD, the parallel WAL thre= ads really helps but we still have a long way to go especially on all-s= sd case. > I tried this https://github.com/facebook/rocksdb/blob/master/util/env= _posix.cc#L1515 by hacking the rocksdb, but the performance difference= is negligible. > > The rocksdb digest speed should be the problem, I believe, I was plan= ned to prove this by skip all db transaction, but failed since hitting = other deadlock bug in newstore. I think sage has worked through all of the deadlock bugs I was seeing=20 short of possibly something going on with the overlay code. That=20 probably shouldn't matter on SSD though as it's probably best to leave=20 overlay off. > > Below are a bit more comments. >> Sage has been furiously working away at fixing bugs in newstore and >> improving performance. Specifically we've been focused on write >> performance as newstore was lagging filestore but quite a bit previo= usly. A >> lot of work has gone into implementing libaio behind the scenes and = as a >> result performance on spinning disks with SSD WAL (and SSD backed ro= cksdb) >> has improved pretty dramatically. It's now often beating filestore: >> > > SSD DB is still better than SSD WAL with request size > 128KB, this i= ndicate some WALs are actually written to Level0...Hmm, could we add ne= wstore_wal_max_ops/bytes to capping the total WAL size(how much data is= in WAL but not yet apply to backend FS) ? I suspect this would improv= e performance by prevent some IO with high WA cost and latency? Seems like it could work, but I wish we didn't have to add a workaround= =2E=20 It'd be nice if we could just tell rocksdb not to propagate that data= =2E=20 I don't remember, can we use column families for this? > >> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf >> >> On the other hand, sequential writes are slower than random writes w= hen >> the OSD, DB, and WAL are all on the same device be it a spinning dis= k or SSD. > > I think sequential writes slower than random is by design in Newstore= , because for every object we can only have one WAL , that means no con= current IO if the req_size* QD < 4MB. Not sure how many #QD do you have= in the test? I suspect 64 since there is a boost in seq write performa= nce with req size > 64 ( 64KB*64=3D4MB). You nailed it, 64. > > In this case, IO pattern will be : 1 write to DB WAL->Sync-> 1 Write= to FS -> Sync, we do everything in synchronize way ,which is essentia= lly expensive. Will you be on the performance call this morning? Perhaps we can talk=20 about it more there? > > Xiaoxi. >> -----Original Message----- >> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel- >> owner@vger.kernel.org] On Behalf Of Mark Nelson >> Sent: Wednesday, April 29, 2015 7:25 AM >> To: ceph-devel >> Subject: newstore performance update >> >> Hi Guys, >> >> Sage has been furiously working away at fixing bugs in newstore and >> improving performance. Specifically we've been focused on write >> performance as newstore was lagging filestore but quite a bit previo= usly. A >> lot of work has gone into implementing libaio behind the scenes and = as a >> result performance on spinning disks with SSD WAL (and SSD backed ro= cksdb) >> has improved pretty dramatically. It's now often beating filestore: >> > >> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf >> >> On the other hand, sequential writes are slower than random writes w= hen >> the OSD, DB, and WAL are all on the same device be it a spinning dis= k or SSD. > >> In this situation newstore does better with random writes and someti= mes >> beats filestore (such as in the everything-on-spinning disk tests, a= nd when IO >> sizes are small in the everything-on-ssd tests). >> >> Newstore is changing daily so keep in mind that these results are al= most >> assuredly going to change. An interesting area of investigation wil= l be why >> sequential writes are slower than random writes, and whether or not = we are >> being limited by rocksdb ingest speed and how. > >> >> I've also uploaded a quick perf call-graph I grabbed during the "all= -SSD" 32KB >> sequential write test to see if rocksdb was starving one of the core= s, but >> found something that looks quite a bit different: >> >> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf >> >> Mark >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel= " in the >> body of a message to majordomo@vger.kernel.org More majordomo info a= t >> http://vger.kernel.org/majordomo-info.html > N=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BDr=EF=BF=BD=EF=BF=BDy=EF= =BF=BD=EF=BF=BD=EF=BF=BDb=EF=BF=BDX=EF=BF=BD=EF=BF=BD=C7=A7v=EF=BF=BD^=EF= =BF=BD)=DE=BA{.n=EF=BF=BD+=EF=BF=BD=EF=BF=BD=EF=BF=BDz=EF=BF=BD]z=EF=BF= =BD=EF=BF=BD=EF=BF=BD{ay=EF=BF=BD=1D=CA=87=DA=99=EF=BF=BD,j=07=EF=BF=BD= =EF=BF=BDf=EF=BF=BD=EF=BF=BD=EF=BF=BDh=EF=BF=BD=EF=BF=BD=EF=BF=BDz=EF=BF= =BD=1E=EF=BF=BDw=EF=BF=BD=EF=BF=BD=EF=BF=BD=0C=EF=BF=BD=EF=BF=BD=EF=BF=BD= j:+v=EF=BF=BD=EF=BF=BD=EF=BF=BDw=EF=BF=BDj=EF=BF=BDm=EF=BF=BD=EF=BF=BD=EF= =BF=BD=EF=BF=BD=07=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BDzZ+=EF=BF=BD=EF=BF= =BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=DD=A2j"=EF=BF=BD=EF=BF=BD!tml=3D > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html