From mboxrd@z Thu Jan  1 00:00:00 1970
From: "Jim Schutt" <jaschut@sandia.gov>
Subject: Re: Trouble getting a new file system to start, for v0.59 and
 newer
Date: Thu, 4 Apr 2013 09:52:54 -0600
Message-ID: <515DA1D6.1060607@sandia.gov>
References: <515C4EC4.5040602@sandia.gov>
 <alpine.DEB.2.00.1304030857480.12367@cobra.newdream.net>
 <515C6232.4070204@sandia.gov>
 <CAPYLRzgT47iTKuv-YCQWfU_4O=prrkEByp9KAftj2jFMktve5Q@mail.gmail.com>
 <CAPYLRzhzQp3xsf_QzzYYJo9RAj_sBV1THs2PJUhk1QRe9XB42w@mail.gmail.com>
 <515C6DD1.2050702@sandia.gov>
 <alpine.DEB.2.00.1304031123540.15431@cobra.newdream.net>
 <515CAFED.60705@sandia.gov>
 <CAPYLRzgyxiFQ_VVde9dB6WZhg5eFr-qbRxS6WQC7ewC4FXXWpQ@mail.gmail.com>
 <515D8B07.6050102@sandia.gov>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: QUOTED-PRINTABLE
Return-path: <ceph-devel-owner@vger.kernel.org>
Received: from sentry-two.sandia.gov ([132.175.109.14]:38584 "EHLO
	sentry-two.sandia.gov" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1762366Ab3DDPxR (ORCPT
	<rfc822;ceph-devel@vger.kernel.org>); Thu, 4 Apr 2013 11:53:17 -0400
In-Reply-To: <515D8B07.6050102@sandia.gov>
Sender: ceph-devel-owner@vger.kernel.org
List-ID: <ceph-devel.vger.kernel.org>
To: Gregory Farnum <greg@inktank.com>
Cc: Sage Weil <sage@inktank.com>, Joao Eduardo Luis <joao.luis@inktank.com>, "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>

On 04/04/2013 08:15 AM, Jim Schutt wrote:
> On 04/03/2013 04:51 PM, Gregory Farnum wrote:
>> On Wed, Apr 3, 2013 at 3:40 PM, Jim Schutt <jaschut@sandia.gov> wrot=
e:
>>> On 04/03/2013 12:25 PM, Sage Weil wrote:
>>>>>>>> Sorry, guess I forgot some of the history since this piece at =
least is
>>>>>>>> resolved now. I'm surprised if 30-second timeouts are causing =
issues
>>>>>>>> without those overloads you were seeing; have you seen this is=
sue
>>>>>>>> without your high debugging levels and without the bad PG comm=
its (due
>>>>>>>> to debugging)?
>>>>>>
>>>>>> I think so, because that's why I started with higher debugging
>>>>>> levels.
>>>>>>
>>>>>> But, as it turns out, I'm just in the process of returning to my
>>>>>> testing of next, with all my debugging back to 0.  So, I'll try
>>>>>> the default timeout of 30 seconds first.  If I have trouble star=
ting
>>>>>> up a new file system, I'll turn up the timeout and try again, wi=
thout
>>>>>> any extra debugging.  Either way, I'll let you know what happens=
=2E
>>>> I would be curious to hear roughly what value between 30 and 300 i=
s
>>>> sufficient, if you can experiment just a bit.  We probably want to=
 adjust
>>>> the default.
>>>>
>>>> Perhaps more importantly, we'll need to look at the performance of=
 the pg
>>>> stat updates on the mon.  There is a refactor due in that code tha=
t should
>>>> improve life, but it's slated for dumpling.
>>>
>>> OK, here's some results, with all debugging at 0, using current nex=
t...
>>>
>>> My testing is for 1 mon + 576 OSDs, 24/host. All my storage cluster=
 hosts
>>> use 10 GbE NICs now.  The mon host uses an SSD for the mon data sto=
re.
>>> My test procedure is to start 'ceph -w', start all the OSDs, and on=
ce
>>> they're all running start the mon.  I report the time from starting
>>> the mon to all PGs active+clean.
>>>
>>> # PGs     osd mon ack    startup    notes
>>>             timeout       time
>>> -------  ------------    --------   -----
>>>  55392      default      >30:00       1
>>>  55392        300         18:36       2
>>>  55392         60        >30:00       3
>>>  55392        150        >30:00       4
>>>  55392        240        >30:00       5
>>>  55392        300        >30:00       2,6
>>>
>>> notes:
>>> 1) lots of PGs marked stale, OSDs wrongly marked down
>>>      before I gave up on this case
>>> 2) OSDs report lots of slow requests for "pg_notify(...) v4
>>>      currently wait for new map"
>>> 3) some OSDs wrongly marked down, OSDs report some slow requests
>>>      for "pg_notify(...) v4 currently wait for new map"
>>>      before I gave up on this case
>>> 4) appeared to be making progress; then an OSD was marked
>>>      out at ~21 minutes; many more marked out before I
>>>      gave up on this case
>>> 5) some OSD reports of slow requests for "pg_notify",
>>>      some OSDs wrongly marked down, appeared to be making
>>>      progress, then stalled; then I gave up on this case
>>> 6) retried this case, appeared to be making progress, but
>>>      after ~18 min stalled at 19701 active+clean, 35691 peering
>>>      until I gave up
>>>
>>> Hmmm, I didn't really expect the above results.  I ran out of
>>> time before attempting an even longer osd mon ack timeout.
>>>
>>> But either we're on the wrong trail, or 300 is not sufficient.
>>> Or, I'm doing something wrong and haven't yet figured out what
>>> it is.
>>>
>>> FWIW, on v0.57 or v0.58 I was testing with one pool at 256K PGs,
>>> and my memory is a new filesystem started up in ~5 minutes.  For
>>> that testing I had to increase 'paxos propose interval' to two
>>> or three seconds to keep the monitor writeout rate (as measured
>>> by vmstat) down to a sustained 50-70 MB/s during start-up.
>>>
>>> That was with a 1 GbE NIC in the mon; the reason I upgraded
>>> it was a filesystem with 512K PGs was taking too long to start,
>>> and I thought the mon might be network-limited since it had an
>>> SSD for the mon data store.
>>>
>>> For the testing above I used the default 'paxos propose interval'.
>>> I don't know if it matters, but vmstat sees only a little data
>>> being written on the mon system.
>>
>> That's odd; I'd actually expect to see more going to disk with v0.59
>> than previously. Is vmstat actually looking at disk IO, or might it =
be
>> missing DirectIO or something? (Not that I remember if LevelDB is
>> using those.)
>=20
> FWIW, 'dd oflag=3Ddirect' shows up in vmstat.  But, I don't know if
> that is relevant to what LevelDB might be doing...
>=20
>> However, I think you might want to increase your paxos propose
>> interval to where it was before =E2=80=94 your OSDs are having troub=
le keeping
>> up with the number of maps that are being generated, based on the fa=
ct
>> that you have a lot of pg notifies stuck waiting for newer maps.
>=20
> OK, I'll try that.  But to clarify, in the past the default paxos
> propose interval was good up to 128K PGs, or so.

Hmmmph.  With 'paxos propose interval =3D 3' and 'osd mon ack timeout =3D=
 300',
and no other debugging enabled, I still didn't get a new filesystem to
start up in <30 minutes.  I did modify the "mon hasn't acked PGStats"
message to be debug level 0, and saw a few of those, but saw no slow
pg_notify requests reported.

I'm really puzzled by that 'osd mon ack timeout =3D 300' case I reporte=
d
above, which started in ~18 minutes.  So far that's the only successful
start I've gotten with 30 min since I turned off debugging....

-- Jim

>=20
> Thanks -- Jim
>=20
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>
>>
>=20
>=20
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"=
 in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>=20
>=20


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html