Re: puzzled with the design pattern of ceph journal, really ruining performance

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mark Nelson <mark.nelson@inktank.com>
To: "Chen, Xiaoxi" <xiaoxi.chen@intel.com>,
	Alexandre DERUMIER <aderumier@odiso.com>,
	Mark Nelson <mark.nelson@inktank.com>
Cc: Somnath Roy <Somnath.Roy@sandisk.com>, ?? <zay11022@gmail.com>,
	"ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
Subject: Re: puzzled with the design pattern of ceph journal, really ruining performance
Date: Wed, 17 Sep 2014 20:23:28 -0500	[thread overview]
Message-ID: <541A3410.6040504@redhat.com> (raw)
In-Reply-To: <6F3FA899187F0043BA1827A69DA2F7CC020BD19A@shsmsx102.ccr.corp.intel.com>

On 09/17/2014 08:05 PM, Chen, Xiaoxi wrote:
>> When benching the crucial m550, I only see time to time (maybe each 30s,don't remember exactly), ios slowing doing to 200 for 1 or 2 seconds then going up to normal speed around 4000iops
>
> Wow, that indicate m550 is busying with garbage collection , maybe just try to overprovision a bit (say if you have a 400G ssd , but only partition ~300G), overprovision SSD generally both help performance and durability. Actually if you look at the difference spec between Intel S3500 and S3700, the root cause is different over provision ratio :)

Hrm, I thought the S3700 uses MLC-HET cells while the S3500 uses regular 
MLC?

>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Alexandre DERUMIER
> Sent: Thursday, September 18, 2014 5:13 AM
> To: Mark Nelson
> Cc: Somnath Roy; ??; ceph-devel@vger.kernel.org; Chen, Xiaoxi
> Subject: Re: puzzled with the design pattern of ceph journal, really ruining performance
>
>>> FWIW, the journal will coalesce writes quickly when there are many
>>> concurrent 4k client writes.  Once you hit around 8 4k IOs per OSD,
>>> the journal will start coalescing.  For say 100-150 IOPs (what a
>>> spinning disk can handle), expect around 9ish 100KB journal writes
>>> (with padding and header/footer for each client IO).  What we've seen
>>> is that some drives that aren't that great at 4K O_DSYNC writes are
>>> still reasonably good with 8+ concurrent larger O_DSYNC writes.
>
> Yes, indeed, it's not that bad. (hopefully ;)
>
> When benching the crucial m550, I only see time to time (maybe each 30s,don't remember exactly), ios slowing doing to 200 for 1 or 2 seconds then going up to normal speed around 4000iops.
>
> with intel s3500, I have constant write at 5000iops.
>
>
> BTW, does rbd client cache also help for coalescing write (client side), then help also the journal ?
>
>
>
>
>
> ----- Mail original -----
>
> De: "Mark Nelson" <mark.nelson@inktank.com>
> À: "Alexandre DERUMIER" <aderumier@odiso.com>, "Xiaoxi Chen" <xiaoxi.chen@intel.com>
> Cc: "Somnath Roy" <Somnath.Roy@sandisk.com>, "??" <zay11022@gmail.com>, ceph-devel@vger.kernel.org
> Envoyé: Mercredi 17 Septembre 2014 17:01:06
> Objet: Re: puzzled with the design pattern of ceph journal, really ruining performance
>
> On 09/17/2014 09:20 AM, Alexandre DERUMIER wrote:
>>>> 2. Have you got any data to prove the O_DSYNC or fdatasync kill the
>>>> performance of journal? In our previous test, the journal SSD (use a
>>>> partition of a SSD as a journal for a particular OSD, and 4 OSD
>>>> share a >>same SSD) could reach its peak performance (300-400MB/s)
>>
>> Hi,
>>
>> I have done some bench here:
>>
>> http://www.mail-archive.com/ceph-users@lists.ceph.com/msg12950.html
>>
>> Some ssd models have really bad performance with O_DSYNC (crucial m550 - 312 iops on 4k block).
>> Benching 1 osd,I can see big latencies for some seconds when O_DSYNC
>> occur
>
> FWIW, the journal will coalesce writes quickly when there are many concurrent 4k client writes. Once you hit around 8 4k IOs per OSD, the journal will start coalescing. For say 100-150 IOPs (what a spinning disk can handle), expect around 9ish 100KB journal writes (with padding and header/footer for each client IO). What we've seen is that some drives that aren't that great at 4K O_DSYNC writes are still reasonably good with 8+ concurrent larger O_DSYNC writes.
>
>>
>>
>>
>> crucial m550
>> ------------
>> #fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2
>> --group_reporting --invalidate=0 --name=ab --sync=1 bw=1249.9KB/s,
>> iops=312
>>
>> intel s3500
>> -----------
>> fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2
>> --group_reporting --invalidate=0 --name=ab --sync=1 #bw=41794KB/s,
>> iops=10448
>>
>> ----- Mail original -----
>>
>> De: "Xiaoxi Chen" <xiaoxi.chen@intel.com>
>> À: "Somnath Roy" <Somnath.Roy@sandisk.com>, "??" <zay11022@gmail.com>,
>> ceph-devel@vger.kernel.org
>> Envoyé: Mercredi 17 Septembre 2014 09:59:37
>> Objet: RE: puzzled with the design pattern of ceph journal, really
>> ruining performance
>>
>> Hi Nicheal,
>>
>> 1. The main purpose of journal is provide transaction semantics (prevent partially update). Peer is not enough for this need because ceph writes all replica at the same time, so when crush, you have no idea about which replica has right data. For example, say if we have 2 replica, user update a 4M object and the primary OSD crush when the first 2M was written , secondary OSD may also failed when the first 3MB was written. So both versions in primary/secondary are neither the new value, nor the old value, and have no way to recover. So share the same idea as database, we need to have a journal to support transaction and prevent this happen. For some backend support transaction, BTRFS as an instance, we don't need a journal, we can write the journal and data disk at the same time, the journal here is just try to help performance, since it only do sequential write and we suspect it should be faster than backend OSD.
>>
>> 2. Have you got any data to prove the O_DSYNC or fdatasync kill the
>> performance of journal? In our previous test, the journal SSD (use a
>> partition of a SSD as a journal for a particular OSD, and 4 OSD share
>> a same SSD) could reach its peak performance (300-400MB/s)
>>
>> Xiaoxi
>>
>>
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org
>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath Roy
>> Sent: Wednesday, September 17, 2014 3:30 PM
>> To: 姚宁; ceph-devel@vger.kernel.org
>> Subject: RE: puzzled with the design pattern of ceph journal, really
>> ruining performance
>>
>> Hi Nicheal,
>> Not only recovery , IMHO the main purpose of ceph journal is to support transaction semantics since XFS doesn't have that. I guess it can't be achieved with pg_log/pg_info.
>>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of ??
>> Sent: Tuesday, September 16, 2014 11:29 PM
>> To: ceph-devel@vger.kernel.org
>> Subject: puzzled with the design pattern of ceph journal, really
>> ruining performance
>>
>> Hi, guys
>>
>> I analyze the architecture of the ceph souce code.
>>
>> I know that, in order to keep journal atomic and consistent, the journal write mode should be set with O_DSYNC or called fdatasync() system call after every write operation. However, this kind of operation is really killing the performance as well as achieving high committing latency, even if SSD is used as journal disk. If the SSD has capacitor to keep the data safe when the system crashes, we can set the mount option nobarrier or SSD itself will ignore the FLUSH REQUEST. So the performance would be better.
>>
>> So can it be instead by other strategies?
>> As far as I am concerned, I think the most important part is pg_log and pg_info. It will guides the crashed osd recovery its objects from the peers. Therefore, if we can keep pg_log at a consistent point, we can recovery data without journal. So can we just use an "undo"
>> strategy on pg_log and neglect ceph journal? It will save lots of bandwidth, and also based on the consistent pg_log epoch, we can always recovery data from its peering osd, right? But this will lead to recovery more objects if the osd crash.
>>
>> Nicheal
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

     prev parent reply	other threads:[~2014-09-18  1:23 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-09-17  6:29 puzzled with the design pattern of ceph journal, really ruining performance 姚宁
2014-09-17  7:29 ` Somnath Roy
2014-09-17  7:59   ` Chen, Xiaoxi
2014-09-17 14:20     ` Alexandre DERUMIER
2014-09-17 15:01       ` Mark Nelson
2014-09-17 21:13         ` Alexandre DERUMIER
2014-09-18  1:05           ` Chen, Xiaoxi
2014-09-18  1:23             ` Mark Nelson [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=541A3410.6040504@redhat.com \
    --to=mark.nelson@inktank.com \
    --cc=Somnath.Roy@sandisk.com \
    --cc=aderumier@odiso.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=xiaoxi.chen@intel.com \
    --cc=zay11022@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.