Re: newstore performance update

CEPH filesystem development
 help / color / mirror / Atom feed

From: Mark Nelson <mnelson@redhat.com>
To: "Chen, Xiaoxi" <xiaoxi.chen@intel.com>, Sage Weil <sweil@redhat.com>
Cc: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
Subject: Re: newstore performance update
Date: Thu, 30 Apr 2015 09:11:04 -0500	[thread overview]
Message-ID: <554237F8.5070907@redhat.com> (raw)
In-Reply-To: <ijupkkkyuvtsofr81j33sd7l.1430402112852@email.android.com>



On 04/30/2015 09:02 AM, Chen, Xiaoxi wrote:
> I am not sure I really understand the osd code, but from the osd log,  in the sequential small write case, only one inflight op happening…
>
> and Mark, did you pre-allocate the rbd before doing sequential test? I believe you did, so both seq and random are in WAL mode.

Yes, the RBD image is pre-allocated.  Maybe Sage can chime in regarding 
the one inflight op.

Mark

>
> ---- Mark Nelson编写 ----
>
>
> On 04/29/2015 11:38 AM, Sage Weil wrote:
>> On Wed, 29 Apr 2015, Chen, Xiaoxi wrote:
>>> Hi Mark,
>>>       Really good test:) I only played a bit on SSD, the parallel WAL
>>> threads really helps but we still have a long way to go especially on
>>> all-ssd case. I tried this
>>> https://github.com/facebook/rocksdb/blob/master/util/env_posix.cc#L1515
>>> by hacking the rocksdb, but the performance difference is negligible.
>>
>> It gave me a 25% bump when rocksdb is on a spinning disk, so I went ahead
>> and committed the change to the branch.  Probably not noticeable on the
>> SSD, though it can't hurt.
>>
>>> The rocksdb digest speed should be the problem, I believe, I was planned
>>> to prove this by skip all db transaction, but failed since hitting other
>>> deadlock bug in newstore.
>>
>> Will look at that next!
>>
>>>
>>> Below are a bit more comments.
>>>> Sage has been furiously working away at fixing bugs in newstore and
>>>> improving performance.  Specifically we've been focused on write
>>>> performance as newstore was lagging filestore but quite a bit previously.  A
>>>> lot of work has gone into implementing libaio behind the scenes and as a
>>>> result performance on spinning disks with SSD WAL (and SSD backed rocksdb)
>>>> has improved pretty dramatically. It's now often beating filestore:
>>>>
>>>
>>> SSD DB is still better than SSD WAL with request size > 128KB, this indicate some WALs are actually written to Level0...Hmm, could we add newstore_wal_max_ops/bytes to capping the total WAL size(how much data is in WAL but not yet apply to backend FS) ?  I suspect this would improve performance by prevent some IO with high WA cost and latency?
>>>
>>>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>>>>
>>>> On the other hand, sequential writes are slower than random writes when
>>>> the OSD, DB, and WAL are all on the same device be it a spinning disk or SSD.
>>>
>>> I think sequential writes slower than random is by design in Newstore,
>>> because for every object we can only have one WAL , that means no
>>> concurrent IO if the req_size* QD < 4MB. Not sure how many #QD do you
>>> have in the test? I suspect 64 since there is a boost in seq write
>>> performance with req size > 64 ( 64KB*64=4MB).
>>>
>>> In this case, IO pattern will be : 1 write to DB WAL->Sync-> 1 Write to
>>> FS -> Sync, we do everything in synchronize way ,which is essentially
>>> expensive.
>>
>> The number of syncs is the same for appends vs wal... in both cases we
>> fdatasync the file and the db commit, but with WAL the fs sync comes after
>> the commit point instead of before (and we don't double-write the data).
>> Appends should still be pipelined (many in flight for the same object)...
>> and the db syncs will be batched in both cases (submit_transaction for
>> each io, and a single thread doing the submit_transaction_sync in a loop).
>>
>> If that's not the case then it's an accident?
>>
>> sage
>
> So I ran some more tests last night on 2c914df7 to see if any of the new
> changes made much difference for spinning disk small sequential writes,
> and the short answer is no.  Since overlay now works again I also ran
> tests with overlay enabled, and this may have helped marginally (and had
> mixed results for random writes, may need to tweak the default).
>
> After this I got to thinking about how the WAL-on-SSD results were so
> much better that I wanted to confirm that this issue is WAL related.  I
> tried setting DisableWAL. This resulted in about a 90x increase in
> sequential write performance, but only a 2x increase in random write
> performance.  What's more, if you look at the last graph on the pdf
> linked below, you can see that sequential 4k writes with WAL enabled are
> significantly slower than 4K random writes, but sequential 4K writes
> with WAL disabled are significantly faster.
>
> http://nhm.ceph.com/newstore/Newstore_DisableWAL.pdf
>
> So I guess now I wonder what is happening that is different in each
> case.  I'll probably sit down and start looking through the blktrace
> data and try to get more statistics out of rocksdb for each case.  It
> would be useful if we could tie the rocksdb stats call into an asok command:
>
> DB::GetProperty("rocksdb.stats", &stats)
>
> Mark
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2015-04-30 14:11 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-04-28 23:25 newstore performance update Mark Nelson
2015-04-29  0:00 ` Venkateswara Rao Jujjuri
2015-04-29  0:07   ` Mark Nelson
2015-04-29  2:59     ` kernel neophyte
2015-04-29  4:31       ` Alexandre DERUMIER
2015-04-29 13:11         ` Mark Nelson
2015-04-29 13:08       ` Mark Nelson
2015-04-29 15:55         ` Chen, Xiaoxi
2015-04-29 19:06           ` Mark Nelson
2015-04-30  1:08             ` Chen, Xiaoxi
2015-04-29  0:00 ` Mark Nelson
2015-04-29  8:33 ` Chen, Xiaoxi
2015-04-29 13:20   ` Mark Nelson
2015-04-29 15:00     ` Chen, Xiaoxi
2015-04-29 16:38   ` Sage Weil
2015-04-30 13:21     ` Haomai Wang
2015-04-30 16:20       ` Sage Weil
2015-04-30 13:28     ` Mark Nelson
2015-04-30 14:02       ` Chen, Xiaoxi
2015-04-30 14:11         ` Mark Nelson [this message]
2015-04-30 18:09           ` Sage Weil
2015-05-01 14:48             ` Mark Nelson
2015-05-01 15:22               ` Chen, Xiaoxi
2015-05-02  0:33               ` Sage Weil
2015-05-04 17:50                 ` Mark Nelson
2015-05-04 18:08                   ` Sage Weil
2015-05-05 17:43                     ` Mark Nelson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=554237F8.5070907@redhat.com \
    --to=mnelson@redhat.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=sweil@redhat.com \
    --cc=xiaoxi.chen@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox