All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Jim Schutt" <jaschut@sandia.gov>
To: Sage Weil <sage@newdream.net>
Cc: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
Subject: Re: scaling issues
Date: Fri, 9 Mar 2012 16:21:08 -0700	[thread overview]
Message-ID: <4F5A9064.3020400@sandia.gov> (raw)
In-Reply-To: <4F5A5C65.6030705@sandia.gov>

On 03/09/2012 12:39 PM, Jim Schutt wrote:
> On 03/08/2012 05:26 PM, Sage Weil wrote:
>> On Thu, 8 Mar 2012, Jim Schutt wrote:
>>> Hi,
>>>
>>> I've been trying to scale up a Ceph filesystem to as big
>>> as I have hardware for - up to 288 OSDs right now.
>>>
>>> (I'm using commit ed0f605365e - tip of master branch from
>>> a few days ago.)
>>>
>>> My problem is that I cannot get a 288 OSD filesystem to go active
>>> (that's with 1 mon and 1 MDS). Pretty quickly I start seeing
>>> "mds e4 e4: 1/1/1 up {0=cs33=up:creating(laggy or crashed)}".
>>> Note that as this is happening all the OSDs and the MDS are
>>> essentially idle; only the mon is busy.
>>>
>>> While tailing the mon log I noticed there was a periodic pause;
>>> after adding a little more debug printing, I learned that the
>>> pause was due to encoding pg_stat_t before writing the pg_map to disk.
>>>
>>> Here's the result of a scaling study I did on startup time for
>>> a freshly created filesystem. I normally run 24 OSDs/server on
>>> these machines with no trouble, for small numbers of OSDs.
>>>
>>> seconds from seconds from seconds to
>>> OSD PG store() mount store() mount encode
>>> to to all PGs pg_stat_t Notes
>>> up:active active+clean*
>>>
>>> 48 9504 58 63 0.30
>>> 72 14256 70 89 0.65
>>> 96 19008 93 117 1.1
>>> 120 23760 132 138 1.7
>>> 144 28512 92 165 2.3
>>> 168 33264 215 218 3.2 periods of
>>> "up:creating(laggy or crashed)"
>>> 192 38016 392 344 4.0 periods of
>>> "up:creating(laggy or crashed)"
>>> 240 47520 1189 644 6.3 periods of
>>> "up:creating(laggy or crashed)"
>>> 288 57024>14400>14400 9.0 never went
>>> active;>200 OSDs out, reporting "wrongly marked me down"
>>
>> Weird, pg_stat_t really shouldn't be growing quadratically. Can you look
>> at the size of the monitors pg/latest file, and see if those are growing
>> quadratically as well? I would expect it to be proportional to the
>> encode time.
>
> Here's the size of the last written mon/pgmap/latest for each run,
> and the time it took to encode the pg_stat_t part of that last
> instance of the file:
>
> OSDs size of pg_stat_t
> latest encode
> time
>
> 48 2976397 0.323052
> 72 4472477 0.666633
> 96 5969461 1.159198
> 120 7466021 1.738096
> 144 8963141 2.428229
> 168 10460309 3.203832
> 192 11956709 4.083013
> 240 14950445 6.453171
> 288 17916589 9.462052
>
> Looks like the file size is growing linearly, but the
> encode time is not?

I remembered I've been compiling with -g since I started working
with ceph (so I could poke around easily in gdb, and at small sizes
it didn't seem to hurt anything).  That's how I generated the above
results.

I recompiled with -g -O2, and got this:

OSDs  size of  pg_stat_t
        latest   encode
                  time

   48   2976461  0.052731
   72   4472477  0.107187
   96   5969477  0.194690
  120   7466021  0.311586
  144   8963141  0.465111
  168  10460317  0.680222
  192  11956709  0.713398
  240  14950437  1.159426
  288  17944413  1.714004

It seems that encoding time still isn't proportional to the
size of pgmap/latest.  However, things have improved enough
that my 288 OSD filesystem goes active pretty quickly (~90 sec),
so I can continue testing at that scale.

-- Jim


>
>>
>> And maybe send us a copy of one of the big ones?
>
> On its way.
>
> -- Jim
>
>>
>> Thanks-
>> sage
>>
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>



  reply	other threads:[~2012-03-09 23:21 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-03-08 23:31 scaling issues Jim Schutt
2012-03-09  0:26 ` Sage Weil
2012-03-09 19:39   ` Jim Schutt
2012-03-09 23:21     ` Jim Schutt [this message]
2012-04-10 16:22       ` Jim Schutt
2012-04-10 16:39         ` Sage Weil
2012-04-10 19:01           ` [EXTERNAL] " Jim Schutt
2012-04-10 22:38             ` Sage Weil

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4F5A9064.3020400@sandia.gov \
    --to=jaschut@sandia.gov \
    --cc=ceph-devel@vger.kernel.org \
    --cc=sage@newdream.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.