From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mark Nelson Subject: Re: Initial newstore vs filestore results Date: Fri, 10 Apr 2015 18:58:42 -0500 Message-ID: <552863B2.8080700@redhat.com> References: <5523F069.3000400@redhat.com> <55242D15.8080800@redhat.com> <55248856.1010808@redhat.com> <5525EFCC.3070607@redhat.com> <5526B044.2090002@redhat.com> <5527F204.3090108@redhat.com> <55282785.8040008@redhat.com> <55282CDB.9090608@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from mx1.redhat.com ([209.132.183.28]:50966 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753337AbbDJX6q (ORCPT ); Fri, 10 Apr 2015 19:58:46 -0400 In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: "Duan, Jiangang" , Sage Weil Cc: Ning Yao , ceph-devel I have some test results with universal compaction we did with joao's modbstore benchmark a while back: http://www.spinics.net/lists/ceph-devel/msg19685.html More specifically this pdf has data for universal compaction: http://nhm.ceph.com/mon-store-stress/Monitor_Store_Stress_Medium_Tests.pdf Mark On 04/10/2015 06:44 PM, Duan, Jiangang wrote: > You can try Universal Compaction > https://github.com/facebook/rocksdb/wiki/Universal-Compaction > > > > -----Original Message----- > From: Sage Weil [mailto:sage@newdream.net] > Sent: Saturday, April 11, 2015 7:24 AM > To: Mark Nelson > Cc: Ning Yao; Duan, Jiangang; ceph-devel > Subject: Re: Initial newstore vs filestore results > > On Fri, 10 Apr 2015, Mark Nelson wrote: >> Notice for instance a comparison of random 512k writes between >> filestore, newstore with no overlay, and newstore with 8m overlay: >> >> http://nhm.ceph.com/newstore/20150409/filestore/RBD_00524288_randwrite >> .png >> http://nhm.ceph.com/newstore/20150409/8m_overlay/RBD_00524288_randwrit >> e.png >> http://nhm.ceph.com/newstore/20150409/no_overlay/RBD_00524288_randwrit >> e.png >> >> The client rbd throughput as reported by fio is: >> >> filestore: 20.44MB/s >> newstore+no_overlay: 4.35MB/s >> newstore+8m_overlay: 3.86MB/s >> >> But notice that in the graphs, we see very different behaviors on disk. >> >> Filestore does a lot of reads and writes to a couple of specific >> portions of the device and has peaks/valleys when data gets written >> out in bulk. I would have expected to see more sequential looking >> writes during the peaks due to journal writes and no reads to that >> portion of the disk, but it seems murkier to me than that. >> >> http://nhm.ceph.com/newstore/20150409/filestore/RBD_00524288_randwrite >> _OSD0.mpg >> >> newstore+no_overlay does kind of a flurry of random IO and looks like >> newstore+it's >> somewhat seek bound. It's very consistent but actual write >> performance is low compared to what blktrace reports as the data >> hitting the disk. Something happening toward the beginning of the drive too. >> >> http://nhm.ceph.com/newstore/20150409/no_overlay/RBD_00524288_randwrit >> e_OSD0.mpg > > Yeah, looks like a bunch of write amplication... the disk bw used is really high. I think we need to look at what rocksdb is doing here. A couple things: > > - Make the log bigger, if we can, so that short-lived WAL keys don't get amplified. We'd rather eat memory than rewrite them in an sst since the number of them in flight is pretty well bounded. > > - The rocksdb log as it stands isn't ever going to perform as well as the FileJournal currently does. The FileJouranl uses a fixed-size device or file that's preallocated with no 'size' associated with it, so that when there is a write we only have to push down the data blocks (one seek), and on replay can identify valid records with a seq # and checksum. > Rocksdb's log is a .log file that grows and get's fsync(2)'d, which means that the data blocks have to hit the disk *and* the inode (size) needs to get updated for the commit to happen. We could improve this by doing a fallocate and turning it into a circular buffer. I'm not sure XFS will let us fallocate a fresh file of 0's though and avoid a second seek because it'll still need to flip the extent bits when the data blocks are written... or prefill the file with 0's before using it. :/ > > sage > > >> >> newstore+8m overlay is interesting. Lots of data gets written out to >> newstore+the disk >> in seemingly large chunks but the actual throughput as reported by the >> client is very slow. I assume there's tons of write amplification >> happening as rocksdb moves the 512k objects around into different levels. >> >> http://nhm.ceph.com/newstore/20150409/8m_overlay/RBD_00524288_randwrit >> e_OSD0.mpg >> >> Mark >> >> On 04/10/2015 02:41 PM, Mark Nelson wrote: >>> Seekwatcher movies and graphs finally finished generating for all of >>> the >>> tests: >>> >>> http://nhm.ceph.com/newstore/20150409/ >>> >>> Mark >>> >>> On 04/10/2015 10:53 AM, Mark Nelson wrote: >>>> Test results attached for different overlay settings at various IO >>>> sizes for writes and random writes. Basically it looks like as we >>>> increase the overlay size it changes the curve. So far we're >>>> still not doing as good as the filestore (co-located journal) though. >>>> >>>> I imagine the WAL probably does play a big part here. >>>> >>>> Mark >>>> >>>> On 04/10/2015 10:28 AM, Sage Weil wrote: >>>>> On Fri, 10 Apr 2015, Ning Yao wrote: >>>>>> KV store introduces too much write amplification, we may need >>>>>> self-implemented WAL? >>>>> >>>>> What we really want is to hint to the kv store that these keys >>>>> (or this key range) is short-lived and should never get >>>>> compacted. And/or, we need to just make sure the wal is >>>>> sufficiently large so that in practice that never happens to >>>>> those keys. >>>>> >>>>> Putting them outside the kv store means an additional seek/sync >>>>> for disks, which defeats most of the purpose. Maybe it makes >>>>> sense for flash... >>>>> but >>>>> the above avoids the problem in either case. >>>>> >>>>> I think we should target rocksdb for our initial tuning >>>>> attempts. So far all I've done is played a bit with the file >>>>> size (1mb -> 4mb -> 8mb) but my ad hoc tests didn't see much >>>>> difference. >>>>> >>>>> sage >>>>> >>>>> >>>>> >>>>>> Regards >>>>>> Ning Yao >>>>>> >>>>>> >>>>>> 2015-04-10 14:11 GMT+08:00 Duan, Jiangang : >>>>>>> IMHO, the newstore performance depends so much on KV store >>>>>>> performance due to the WAL - so pick up the right KV or >>>>>>> tune it will be the 1st step to do. >>>>>>> >>>>>>> -jiangang >>>>>>> >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: ceph-devel-owner@vger.kernel.org >>>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark >>>>>>> Nelson >>>>>>> Sent: Friday, April 10, 2015 1:01 AM >>>>>>> To: Sage Weil >>>>>>> Cc: ceph-devel >>>>>>> Subject: Re: Initial newstore vs filestore results >>>>>>> >>>>>>> On 04/08/2015 10:19 PM, Mark Nelson wrote: >>>>>>>> On 04/07/2015 09:58 PM, Sage Weil wrote: >>>>>>>>> What would be very interesting would be to see the 4KB >>>>>>>>> performance with the defaults (newstore overlay max = >>>>>>>>> 32) vs overlays disabled (newstore overlay max = 0) and >>>>>>>>> see if/how much it is helping. >>>>>>>> >>>>>>>> And here we go. 1 OSD, 1X replication. 16GB RBD volume. >>>>>>>> >>>>>>>> 4MB write read randw randr >>>>>>>> default overlay 36.13 106.61 34.49 92.69 >>>>>>>> no overlay 36.29 105.61 34.49 93.55 >>>>>>>> >>>>>>>> 128KB write read randw randr >>>>>>>> default overlay 1.71 97.90 1.65 25.79 >>>>>>>> no overlay 1.72 97.80 1.66 25.78 >>>>>>>> >>>>>>>> 4KB write read randw randr >>>>>>>> default overlay 0.40 61.88 1.29 1.11 >>>>>>>> no overlay 0.05 61.26 0.05 1.10 >>>>>>>> >>>>>>> >>>>>>> Update this morning. Also ran filestore tests for >>>>>>> comparison. Next we'll look at how tweaking the overlay for >>>>>>> different IO sizes affects things. IE the overlay threshold >>>>>>> is 64k right now and it appears that 128K write IOs for >>>>>>> instance are quite a bit worse with newstore currently than >>>>>>> with filestore. Sage also just committed changes that will >>>>>>> allow overlay writes during append/create which may help improve small IO write performance as well in some cases. >>>>>>> >>>>>>> 4MB write read randw randr >>>>>>> default overlay 36.13 106.61 34.49 92.69 >>>>>>> no overlay 36.29 105.61 34.49 93.55 >>>>>>> filestore 36.17 84.59 34.11 79.85 >>>>>>> >>>>>>> 128KB write read randw randr >>>>>>> default overlay 1.71 97.90 1.65 25.79 >>>>>>> no overlay 1.72 97.80 1.66 25.78 >>>>>>> filestore 27.15 79.91 8.77 19.00 >>>>>>> >>>>>>> 4KB write read randw randr >>>>>>> default overlay 0.40 61.88 1.29 1.11 >>>>>>> no overlay 0.05 61.26 0.05 1.10 >>>>>>> filestore 4.14 56.30 0.42 0.76 >>>>>>> >>>>>>> Seekwatcher movies and graphs available here: >>>>>>> >>>>>>> http://nhm.ceph.com/newstore/20150408/ >>>>>>> >>>>>>> Note for instance the very interesting blktrace patterns for >>>>>>> 4K random writes on the OSD in each case: >>>>>>> >>>>>>> http://nhm.ceph.com/newstore/20150408/filestore/RBD_00004096 >>>>>>> _randwrite.png >>>>>>> >>>>>>> >>>>>>> http://nhm.ceph.com/newstore/20150408/default_overlay/RBD_00 >>>>>>> 004096_randwrite.png >>>>>>> >>>>>>> >>>>>>> http://nhm.ceph.com/newstore/20150408/no_overlay/RBD_0000409 >>>>>>> 6_randwrite.png >>>>>>> >>>>>>> >>>>>>> >>>>>>> Mark >>>>>>> -- >>>>>>> To unsubscribe from this list: send the line "unsubscribe >>>>>>> ceph-devel" in the body of a message to >>>>>>> majordomo@vger.kernel.org More majordomo info at >>>>>>> http://vger.kernel.org/majordomo-info.html >>>>>>> -- >>>>>>> To unsubscribe from this list: send the line "unsubscribe >>>>>>> ceph-devel" in the body of a message to >>>>>>> majordomo@vger.kernel.org More majordomo info at >>>>>>> http://vger.kernel.org/majordomo-info.html >>>>>> >>>>>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe >>> ceph-devel" in the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >> in the body of a message to majordomo@vger.kernel.org More majordomo >> info at http://vger.kernel.org/majordomo-info.html >> >>