Re: newstore performance update

CEPH filesystem development
 help / color / mirror / Atom feed

From: Mark Nelson <mnelson@redhat.com>
To: "Chen, Xiaoxi" <xiaoxi.chen@intel.com>
Cc: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
Subject: Re: newstore performance update
Date: Wed, 29 Apr 2015 08:20:18 -0500	[thread overview]
Message-ID: <5540DA92.3070505@redhat.com> (raw)
In-Reply-To: <6F3FA899187F0043BA1827A69DA2F7CC021E4894@shsmsx102.ccr.corp.intel.com>



On 04/29/2015 03:33 AM, Chen, Xiaoxi wrote:
> Hi Mark,
> 	Really good test:) I only played a bit on SSD, the parallel WAL threads really helps but we still have a long way to go especially on all-ssd case.
> I tried this https://github.com/facebook/rocksdb/blob/master/util/env_posix.cc#L1515  by hacking the rocksdb, but the performance difference is negligible.
>
> The rocksdb digest speed should be the problem, I believe, I was planned to prove this by skip all db transaction, but failed since hitting other deadlock bug in newstore.

I think sage has worked through all of the deadlock bugs I was seeing 
short of possibly something going on with the overlay code.  That 
probably shouldn't matter on SSD though as it's probably best to leave 
overlay off.

>
> Below are a bit more comments.
>> Sage has been furiously working away at fixing bugs in newstore and
>> improving performance.  Specifically we've been focused on write
>> performance as newstore was lagging filestore but quite a bit previously.  A
>> lot of work has gone into implementing libaio behind the scenes and as a
>> result performance on spinning disks with SSD WAL (and SSD backed rocksdb)
>> has improved pretty dramatically. It's now often beating filestore:
>>
>
> SSD DB is still better than SSD WAL with request size > 128KB, this indicate some WALs are actually written to Level0...Hmm, could we add newstore_wal_max_ops/bytes to capping the total WAL size(how much data is in WAL but not yet apply to backend FS) ?  I suspect this would improve performance by prevent some IO with high WA cost and latency?

Seems like it could work, but I wish we didn't have to add a workaround. 
  It'd be nice if we could just tell rocksdb not to propagate that data. 
  I don't remember, can we use column families for this?

>
>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>>
>> On the other hand, sequential writes are slower than random writes when
>> the OSD, DB, and WAL are all on the same device be it a spinning disk or SSD.
>
> I think sequential writes slower than random is by design in Newstore, because for every object we can only have one WAL , that means no concurrent IO if the req_size* QD < 4MB. Not sure how many #QD do you have in the test? I suspect 64 since there is a boost in seq write performance with req size > 64 ( 64KB*64=4MB).

You nailed it, 64.

>
> In this case,  IO pattern will be : 1 write to DB WAL->Sync-> 1 Write to FS -> Sync,  we do everything in synchronize way ,which is essentially expensive.

Will you be on the performance call this morning?  Perhaps we can talk 
about it more there?

>
> 													Xiaoxi.
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>> owner@vger.kernel.org] On Behalf Of Mark Nelson
>> Sent: Wednesday, April 29, 2015 7:25 AM
>> To: ceph-devel
>> Subject: newstore performance update
>>
>> Hi Guys,
>>
>> Sage has been furiously working away at fixing bugs in newstore and
>> improving performance.  Specifically we've been focused on write
>> performance as newstore was lagging filestore but quite a bit previously.  A
>> lot of work has gone into implementing libaio behind the scenes and as a
>> result performance on spinning disks with SSD WAL (and SSD backed rocksdb)
>> has improved pretty dramatically. It's now often beating filestore:
>>
>
>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>>
>> On the other hand, sequential writes are slower than random writes when
>> the OSD, DB, and WAL are all on the same device be it a spinning disk or SSD.
>
>> In this situation newstore does better with random writes and sometimes
>> beats filestore (such as in the everything-on-spinning disk tests, and when IO
>> sizes are small in the everything-on-ssd tests).
>>
>> Newstore is changing daily so keep in mind that these results are almost
>> assuredly going to change.  An interesting area of investigation will be why
>> sequential writes are slower than random writes, and whether or not we are
>> being limited by rocksdb ingest speed and how.
>
>>
>> I've also uploaded a quick perf call-graph I grabbed during the "all-SSD" 32KB
>> sequential write test to see if rocksdb was starving one of the cores, but
>> found something that looks quite a bit different:
>>
>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>>
>> Mark
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
>> body of a message to majordomo@vger.kernel.org More majordomo info at
>> http://vger.kernel.org/majordomo-info.html
> N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�\x1dʇڙ�,j\a��f���h���z�\x1e�w���\f���j:+v���w�j�m����\a����zZ+�����ݢj"��!tml=
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2015-04-29 13:20 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-04-28 23:25 newstore performance update Mark Nelson
2015-04-29  0:00 ` Venkateswara Rao Jujjuri
2015-04-29  0:07   ` Mark Nelson
2015-04-29  2:59     ` kernel neophyte
2015-04-29  4:31       ` Alexandre DERUMIER
2015-04-29 13:11         ` Mark Nelson
2015-04-29 13:08       ` Mark Nelson
2015-04-29 15:55         ` Chen, Xiaoxi
2015-04-29 19:06           ` Mark Nelson
2015-04-30  1:08             ` Chen, Xiaoxi
2015-04-29  0:00 ` Mark Nelson
2015-04-29  8:33 ` Chen, Xiaoxi
2015-04-29 13:20   ` Mark Nelson [this message]
2015-04-29 15:00     ` Chen, Xiaoxi
2015-04-29 16:38   ` Sage Weil
2015-04-30 13:21     ` Haomai Wang
2015-04-30 16:20       ` Sage Weil
2015-04-30 13:28     ` Mark Nelson
2015-04-30 14:02       ` Chen, Xiaoxi
2015-04-30 14:11         ` Mark Nelson
2015-04-30 18:09           ` Sage Weil
2015-05-01 14:48             ` Mark Nelson
2015-05-01 15:22               ` Chen, Xiaoxi
2015-05-02  0:33               ` Sage Weil
2015-05-04 17:50                 ` Mark Nelson
2015-05-04 18:08                   ` Sage Weil
2015-05-05 17:43                     ` Mark Nelson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5540DA92.3070505@redhat.com \
    --to=mnelson@redhat.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=xiaoxi.chen@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox