Problems after crash yesterday

All of lore.kernel.org
 help / color / mirror / Atom feed

* Problems after crash yesterday
@ 2012-02-21 10:24 Jens Rehpöhler
  2012-02-22  9:53 ` Jens Rehpöhler
  0 siblings, 1 reply; 6+ messages in thread
From: Jens Rehpöhler @ 2012-02-21 10:24 UTC (permalink / raw)
  To: ceph-devel; +Cc: sage

Hi sage,

sorry ... we have to disturb you again.

After the node crash (oli wrote about that) we have some problems.

The recovery process is stuck at:

2012-02-21 11:20:15.948527    pg v986715: 2046 pgs: 2035 active+clean,
10 active+clean+inconsistent, 1 active+recovering+remapped+backfill;
1988 GB data, 3823 GB used, 25970 GB / 29794 GB avail; 1/1121879
degraded (0.000%)

We also see this messages every few seconds:

2012-02-21 11:20:15.106958   log 2012-02-21 11:20:05.765762 osd.3
10.10.10.8:6803/29916 131581 : [WRN] old request pg_log(0.ea epoch 849
query_epoch 843) v2 received at 2012-02-20 17:39:41.774507 currently started
2012-02-21 11:20:15.106958   log 2012-02-21 11:20:05.765775 osd.3
10.10.10.8:6803/29916 131582 : [WRN] old request pg_log(2.e8 epoch 849
query_epoch 843) v2 received at 2012-02-20 17:39:41.774662 currently no
flag points reached
2012-02-21 11:20:15.106958   log 2012-02-21 11:20:06.765912 osd.3
10.10.10.8:6803/29916 131583 : [WRN] old request pg_log(0.ea epoch 849
query_epoch 843) v2 received at 2012-02-20 17:39:41.774507 currently started
2012-02-21 11:20:15.106958   log 2012-02-21 11:20:06.765943 osd.3
10.10.10.8:6803/29916 131584 : [WRN] old request pg_log(2.e8 epoch 849
query_epoch 843) v2 received at 2012-02-20 17:39:41.774662 currently no
flag points reached
2012-02-21 11:20:15.106958   log 2012-02-21 11:20:07.766312 osd.3
10.10.10.8:6803/29916 131585 : [WRN] old request pg_log(0.ea epoch 849
query_epoch 843) v2 received at 2012-02-20 17:39:41.774507 currently started
2012-02-21 11:20:15.106958   log 2012-02-21 11:20:07.766324 osd.3
10.10.10.8:6803/29916 131586 : [WRN] old request pg_log(2.e8 epoch 849
query_epoch 843) v2 received at 2012-02-20 17:39:41.774662 currently no
flag points reached
2012-02-21 11:20:15.106958   log 2012-02-21 11:20:08.766467 osd.3
10.10.10.8:6803/29916 131587 : [WRN] old request pg_log(0.ea epoch 849
query_epoch 843) v2 received at 2012-02-20 17:39:41.774507 currently started

Any ideas how we can get the cluster back to consistent state  ?

Thank you !!

Jens

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Problems after crash yesterday
  2012-02-21 10:24 Problems after crash yesterday Jens Rehpöhler
@ 2012-02-22  9:53 ` Jens Rehpöhler
  2012-02-22 17:12   ` Gregory Farnum
  0 siblings, 1 reply; 6+ messages in thread
From: Jens Rehpöhler @ 2012-02-22  9:53 UTC (permalink / raw)
  To: ceph-devel; +Cc: sage

[-- Attachment #1: Type: text/plain, Size: 5111 bytes --]

Some Additios: meanwhile we are at the state:

2012-02-22 10:38:49.587403    pg v1044553: 2046 pgs: 2036 active+clean,
10 active+clean+inconsistent; 2110 GB data, 4061 GB used, 25732 GB /
29794 GB avail

The  active+recovering+remapped+backfill disappeared auf a restart of a
cashed OSD.

The OSD crashed after issuing the command "ceph pg repair 106.3".

The repeating message is also there:

2012-02-22 10:52:36.198983   log 2012-02-22 10:52:32.182488 osd.3
10.10.10.8:6803/29916 302906 : [WRN] old request pg_log(0.ea epoch 849
query_epoch 843) v2 received at 2012-02-20 17:39:41.774507 currently started
2012-02-22 10:52:36.198983   log 2012-02-22 10:52:32.182500 osd.3
10.10.10.8:6803/29916 302907 : [WRN] old request pg_log(2.e8 epoch 849
query_epoch 843) v2 received at 2012-02-20 17:39:41.774662 currently no
flag points reached
2012-02-22 10:52:36.198983   log 2012-02-22 10:52:33.182615 osd.3
10.10.10.8:6803/29916 302908 : [WRN] old request pg_log(0.ea epoch 849
query_epoch 843) v2 received at 2012-02-20 17:39:41.774507 currently started
2012-02-22 10:52:36.198983   log 2012-02-22 10:52:33.182629 osd.3
10.10.10.8:6803/29916 302909 : [WRN] old request pg_log(2.e8 epoch 849
query_epoch 843) v2 received at 2012-02-20 17:39:41.774662 currently no
flag points reached
2012-02-22 10:52:36.198983   log 2012-02-22 10:52:34.182839 osd.3
10.10.10.8:6803/29916 302910 : [WRN] old request pg_log(0.ea epoch 849
query_epoch 843) v2 received at 2012-02-20 17:39:41.774507 currently started
2012-02-22 10:52:36.198983   log 2012-02-22 10:52:34.182853 osd.3
10.10.10.8:6803/29916 302911 : [WRN] old request pg_log(2.e8 epoch 849
query_epoch 843) v2 received at 2012-02-20 17:39:41.774662 currently no
flag points reached
2012-02-22 10:52:36.198983   log 2012-02-22 10:52:35.183075 osd.3
10.10.10.8:6803/29916 302912 : [WRN] old request pg_log(0.ea epoch 849
query_epoch 843) v2 received at 2012-02-20 17:39:41.774507 currently started
2012-02-22 10:52:36.198983   log 2012-02-22 10:52:35.183089 osd.3
10.10.10.8:6803/29916 302913 : [WRN] old request pg_log(2.e8 epoch 849
query_epoch 843) v2 received at 2012-02-20 17:39:41.774662 currently no
flag points reached

Seems to hang since our crash.

At last we see some scrub error like this:

2012-02-22 10:47:35.049386 log 2012-02-22 10:47:25.310571 osd.4
10.10.10.10:6800/17745 34356 : [ERR] 16.4 osd.2: soid
ce7f1004/rb.0.0.00000000001a/headmissing attr _, missing attr snapset

any advice ?

thanks

Jens



Am 21.02.2012 11:24, schrieb Jens Rehpöhler:
> Hi sage,
>
> sorry ... we have to disturb you again.
>
> After the node crash (oli wrote about that) we have some problems.
>
> The recovery process is stuck at:
>
> 2012-02-21 11:20:15.948527    pg v986715: 2046 pgs: 2035 active+clean,
> 10 active+clean+inconsistent, 1 active+recovering+remapped+backfill;
> 1988 GB data, 3823 GB used, 25970 GB / 29794 GB avail; 1/1121879
> degraded (0.000%)
>
> We also see this messages every few seconds:
>
> 2012-02-21 11:20:15.106958   log 2012-02-21 11:20:05.765762 osd.3
> 10.10.10.8:6803/29916 131581 : [WRN] old request pg_log(0.ea epoch 849
> query_epoch 843) v2 received at 2012-02-20 17:39:41.774507 currently started
> 2012-02-21 11:20:15.106958   log 2012-02-21 11:20:05.765775 osd.3
> 10.10.10.8:6803/29916 131582 : [WRN] old request pg_log(2.e8 epoch 849
> query_epoch 843) v2 received at 2012-02-20 17:39:41.774662 currently no
> flag points reached
> 2012-02-21 11:20:15.106958   log 2012-02-21 11:20:06.765912 osd.3
> 10.10.10.8:6803/29916 131583 : [WRN] old request pg_log(0.ea epoch 849
> query_epoch 843) v2 received at 2012-02-20 17:39:41.774507 currently started
> 2012-02-21 11:20:15.106958   log 2012-02-21 11:20:06.765943 osd.3
> 10.10.10.8:6803/29916 131584 : [WRN] old request pg_log(2.e8 epoch 849
> query_epoch 843) v2 received at 2012-02-20 17:39:41.774662 currently no
> flag points reached
> 2012-02-21 11:20:15.106958   log 2012-02-21 11:20:07.766312 osd.3
> 10.10.10.8:6803/29916 131585 : [WRN] old request pg_log(0.ea epoch 849
> query_epoch 843) v2 received at 2012-02-20 17:39:41.774507 currently started
> 2012-02-21 11:20:15.106958   log 2012-02-21 11:20:07.766324 osd.3
> 10.10.10.8:6803/29916 131586 : [WRN] old request pg_log(2.e8 epoch 849
> query_epoch 843) v2 received at 2012-02-20 17:39:41.774662 currently no
> flag points reached
> 2012-02-21 11:20:15.106958   log 2012-02-21 11:20:08.766467 osd.3
> 10.10.10.8:6803/29916 131587 : [WRN] old request pg_log(0.ea epoch 849
> query_epoch 843) v2 received at 2012-02-20 17:39:41.774507 currently started
>
> Any ideas how we can get the cluster back to consistent state  ?
>
> Thank you !!
>
> Jens


-- 
mit freundlichen Grüssen

Jens Rehpöhler

----------------------------------------------------------------------
Filoo GmbH
Moltkestr. 25a
33330 Gütersloh
HRB4355 AG Gütersloh

Geschäftsführer: S.Grewing | J.Rehpöhler | Dr. C.Kunz
Telefon: +49 5241 8673012 | Mobil: +49 151 54645798
Hotline: 07000-3378658 (14 Ct/min) Fax: +49 5241 8673020



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Problems after crash yesterday
  2012-02-22  9:53 ` Jens Rehpöhler
@ 2012-02-22 17:12   ` Gregory Farnum
  2012-02-22 20:25     ` Jens Rehpöhler
  0 siblings, 1 reply; 6+ messages in thread
From: Gregory Farnum @ 2012-02-22 17:12 UTC (permalink / raw)
  To: Jens Rehpöhler; +Cc: ceph-devel@vger.kernel.org, sage@newdream.net

On Feb 22, 2012, at 1:53 AM, "Jens Rehpöhler" <jens.rehpoehler@filoo.de> wrote:

> Some Additios: meanwhile we are at the state:
>
> 2012-02-22 10:38:49.587403    pg v1044553: 2046 pgs: 2036 active+clean,
> 10 active+clean+inconsistent; 2110 GB data, 4061 GB used, 25732 GB /
> 29794 GB avail
>
> The  active+recovering+remapped+backfill disappeared auf a restart of a
> cashed OSD.
>
> The OSD crashed after issuing the command "ceph pg repair 106.3".
>
> The repeating message is also there:
Hmm. These messages indicate there are requests that came in that
never got answered -- or else that the tracking code isn't quite right
(it's new functionality). What version are you running?

> 2012-02-22 10:52:36.198983   log 2012-02-22 10:52:32.182488 osd.3
> 10.10.10.8:6803/29916 302906 : [WRN] old request pg_log(0.ea epoch 849
> query_epoch 843) v2 received at 2012-02-20 17:39:41.774507 currently started
> 2012-02-22 10:52:36.198983   log 2012-02-22 10:52:32.182500 osd.3
> 10.10.10.8:6803/29916 302907 : [WRN] old request pg_log(2.e8 epoch 849
> query_epoch 843) v2 received at 2012-02-20 17:39:41.774662 currently no
> flag points reached
> 2012-02-22 10:52:36.198983   log 2012-02-22 10:52:33.182615 osd.3
> 10.10.10.8:6803/29916 302908 : [WRN] old request pg_log(0.ea epoch 849
> query_epoch 843) v2 received at 2012-02-20 17:39:41.774507 currently started
> 2012-02-22 10:52:36.198983   log 2012-02-22 10:52:33.182629 osd.3
> 10.10.10.8:6803/29916 302909 : [WRN] old request pg_log(2.e8 epoch 849
> query_epoch 843) v2 received at 2012-02-20 17:39:41.774662 currently no
> flag points reached
> 2012-02-22 10:52:36.198983   log 2012-02-22 10:52:34.182839 osd.3
> 10.10.10.8:6803/29916 302910 : [WRN] old request pg_log(0.ea epoch 849
> query_epoch 843) v2 received at 2012-02-20 17:39:41.774507 currently started
> 2012-02-22 10:52:36.198983   log 2012-02-22 10:52:34.182853 osd.3
> 10.10.10.8:6803/29916 302911 : [WRN] old request pg_log(2.e8 epoch 849
> query_epoch 843) v2 received at 2012-02-20 17:39:41.774662 currently no
> flag points reached
> 2012-02-22 10:52:36.198983   log 2012-02-22 10:52:35.183075 osd.3
> 10.10.10.8:6803/29916 302912 : [WRN] old request pg_log(0.ea epoch 849
> query_epoch 843) v2 received at 2012-02-20 17:39:41.774507 currently started
> 2012-02-22 10:52:36.198983   log 2012-02-22 10:52:35.183089 osd.3
> 10.10.10.8:6803/29916 302913 : [WRN] old request pg_log(2.e8 epoch 849
> query_epoch 843) v2 received at 2012-02-20 17:39:41.774662 currently no
> flag points reached
>
> Seems to hang since our crash.
>
> At last we see some scrub error like this:
>
> 2012-02-22 10:47:35.049386 log 2012-02-22 10:47:25.310571 osd.4
> 10.10.10.10:6800/17745 34356 : [ERR] 16.4 osd.2: soid
> ce7f1004/rb.0.0.00000000001a/headmissing attr _, missing attr
And that's a problem with the xattrs. What filesystem are you using
underneath Ceph?

>
> any advice ?
>
> thanks
>
> Jens
>
>
>
> Am 21.02.2012 11:24, schrieb Jens Rehpöhler:
>> Hi sage,
>>
>> sorry ... we have to disturb you again.
>>
>> After the node crash (oli wrote about that) we have some problems.
>>
>> The recovery process is stuck at:
>>
>> 2012-02-21 11:20:15.948527    pg v986715: 2046 pgs: 2035 active+clean,
>> 10 active+clean+inconsistent, 1 active+recovering+remapped+backfill;
>> 1988 GB data, 3823 GB used, 25970 GB / 29794 GB avail; 1/1121879
>> degraded (0.000%)
>>
>> We also see this messages every few seconds:
>>
>> 2012-02-21 11:20:15.106958   log 2012-02-21 11:20:05.765762 osd.3
>> 10.10.10.8:6803/29916 131581 : [WRN] old request pg_log(0.ea epoch 849
>> query_epoch 843) v2 received at 2012-02-20 17:39:41.774507 currently started
>> 2012-02-21 11:20:15.106958   log 2012-02-21 11:20:05.765775 osd.3
>> 10.10.10.8:6803/29916 131582 : [WRN] old request pg_log(2.e8 epoch 849
>> query_epoch 843) v2 received at 2012-02-20 17:39:41.774662 currently no
>> flag points reached
>> 2012-02-21 11:20:15.106958   log 2012-02-21 11:20:06.765912 osd.3
>> 10.10.10.8:6803/29916 131583 : [WRN] old request pg_log(0.ea epoch 849
>> query_epoch 843) v2 received at 2012-02-20 17:39:41.774507 currently started
>> 2012-02-21 11:20:15.106958   log 2012-02-21 11:20:06.765943 osd.3
>> 10.10.10.8:6803/29916 131584 : [WRN] old request pg_log(2.e8 epoch 849
>> query_epoch 843) v2 received at 2012-02-20 17:39:41.774662 currently no
>> flag points reached
>> 2012-02-21 11:20:15.106958   log 2012-02-21 11:20:07.766312 osd.3
>> 10.10.10.8:6803/29916 131585 : [WRN] old request pg_log(0.ea epoch 849
>> query_epoch 843) v2 received at 2012-02-20 17:39:41.774507 currently started
>> 2012-02-21 11:20:15.106958   log 2012-02-21 11:20:07.766324 osd.3
>> 10.10.10.8:6803/29916 131586 : [WRN] old request pg_log(2.e8 epoch 849
>> query_epoch 843) v2 received at 2012-02-20 17:39:41.774662 currently no
>> flag points reached
>> 2012-02-21 11:20:15.106958   log 2012-02-21 11:20:08.766467 osd.3
>> 10.10.10.8:6803/29916 131587 : [WRN] old request pg_log(0.ea epoch 849
>> query_epoch 843) v2 received at 2012-02-20 17:39:41.774507 currently started
>>
>> Any ideas how we can get the cluster back to consistent state  ?
>>
>> Thank you !!
>>
>> Jens
>
>
> --
> mit freundlichen Grüssen
>
> Jens Rehpöhler
>
> ----------------------------------------------------------------------
> Filoo GmbH
> Moltkestr. 25a
> 33330 Gütersloh
> HRB4355 AG Gütersloh
>
> Geschäftsführer: S.Grewing | J.Rehpöhler | Dr. C.Kunz
> Telefon: +49 5241 8673012 | Mobil: +49 151 54645798
> Hotline: 07000-3378658 (14 Ct/min) Fax: +49 5241 8673020
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Problems after crash yesterday
  2012-02-22 17:12   ` Gregory Farnum
@ 2012-02-22 20:25     ` Jens Rehpöhler
  2012-02-24  5:14       ` Gregory Farnum
  0 siblings, 1 reply; 6+ messages in thread
From: Jens Rehpöhler @ 2012-02-22 20:25 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel@vger.kernel.org, sage@newdream.net

Hi Gregory,


On 22.02.2012 18:12, Gregory Farnum wrote:
> On Feb 22, 2012, at 1:53 AM, "Jens Rehpöhler" <jens.rehpoehler@filoo.de> wrote:
>
>> Some Additios: meanwhile we are at the state:
>>
>> 2012-02-22 10:38:49.587403    pg v1044553: 2046 pgs: 2036 active+clean,
>> 10 active+clean+inconsistent; 2110 GB data, 4061 GB used, 25732 GB /
>> 29794 GB avail
>>
>> The  active+recovering+remapped+backfill disappeared auf a restart of a
>> cashed OSD.
>>
>> The OSD crashed after issuing the command "ceph pg repair 106.3".
>>
>> The repeating message is also there:
> Hmm. These messages indicate there are requests that came in that
> never got answered -- or else that the tracking code isn't quite right
> (it's new functionality). What version are you running?
We use:

root@fcmsnode0:~# ceph -v
ceph version 0.42-62-gd6de0bb
(commit:d6de0bb83bcac238b3a6a376915e06fb7129b2c8)

Kernel is 3.2.1

i accidently updated one of our OSDs to 0.42 -> So we updated the whole
cluster.

The OSD repeated to crash while issuing  "repair" command. The
inconsistent PGs
are all on the same (newly added) node.

>> 2012-02-22 10:52:36.198983   log 2012-02-22 10:52:32.182488 osd.3
>> 10.10.10.8:6803/29916 302906 : [WRN] old request pg_log(0.ea epoch 849
>> query_epoch 843) v2 received at 2012-02-20 17:39:41.774507 currently started
>> 2012-02-22 10:52:36.198983   log 2012-02-22 10:52:32.182500 osd.3
>> 10.10.10.8:6803/29916 302907 : [WRN] old request pg_log(2.e8 epoch 849
>> query_epoch 843) v2 received at 2012-02-20 17:39:41.774662 currently no
>> flag points reached
>> 2012-02-22 10:52:36.198983   log 2012-02-22 10:52:33.182615 osd.3
>> 10.10.10.8:6803/29916 302908 : [WRN] old request pg_log(0.ea epoch 849
>> query_epoch 843) v2 received at 2012-02-20 17:39:41.774507 currently started
>> 2012-02-22 10:52:36.198983   log 2012-02-22 10:52:33.182629 osd.3
>> 10.10.10.8:6803/29916 302909 : [WRN] old request pg_log(2.e8 epoch 849
>> query_epoch 843) v2 received at 2012-02-20 17:39:41.774662 currently no
>> flag points reached
>> 2012-02-22 10:52:36.198983   log 2012-02-22 10:52:34.182839 osd.3
>> 10.10.10.8:6803/29916 302910 : [WRN] old request pg_log(0.ea epoch 849
>> query_epoch 843) v2 received at 2012-02-20 17:39:41.774507 currently started
>> 2012-02-22 10:52:36.198983   log 2012-02-22 10:52:34.182853 osd.3
>> 10.10.10.8:6803/29916 302911 : [WRN] old request pg_log(2.e8 epoch 849
>> query_epoch 843) v2 received at 2012-02-20 17:39:41.774662 currently no
>> flag points reached
>> 2012-02-22 10:52:36.198983   log 2012-02-22 10:52:35.183075 osd.3
>> 10.10.10.8:6803/29916 302912 : [WRN] old request pg_log(0.ea epoch 849
>> query_epoch 843) v2 received at 2012-02-20 17:39:41.774507 currently started
>> 2012-02-22 10:52:36.198983   log 2012-02-22 10:52:35.183089 osd.3
>> 10.10.10.8:6803/29916 302913 : [WRN] old request pg_log(2.e8 epoch 849
>> query_epoch 843) v2 received at 2012-02-20 17:39:41.774662 currently no
>> flag points reached
>>
>> Seems to hang since our crash.
>>
>> At last we see some scrub error like this:
>>
>> 2012-02-22 10:47:35.049386 log 2012-02-22 10:47:25.310571 osd.4
>> 10.10.10.10:6800/17745 34356 : [ERR] 16.4 osd.2: soid
>> ce7f1004/rb.0.0.00000000001a/headmissing attr _, missing attr
> And that's a problem with the xattrs. What filesystem are you using
> underneath Ceph?
XFS. We tried btrfs some weeks ago but we had some trouble with it und
heavy load.

The messages are repeated every 2 or 3 seconds.
>> any advice ?
>>
>> thanks
>>
>> Jens
>>
>>
>>
>> Am 21.02.2012 11:24, schrieb Jens Rehpöhler:
>>> Hi sage,
>>>
>>> sorry ... we have to disturb you again.
>>>
>>> After the node crash (oli wrote about that) we have some problems.
>>>
>>> The recovery process is stuck at:
>>>
>>> 2012-02-21 11:20:15.948527    pg v986715: 2046 pgs: 2035 active+clean,
>>> 10 active+clean+inconsistent, 1 active+recovering+remapped+backfill;
>>> 1988 GB data, 3823 GB used, 25970 GB / 29794 GB avail; 1/1121879
>>> degraded (0.000%)
>>>
>>> We also see this messages every few seconds:
>>>
>>> 2012-02-21 11:20:15.106958   log 2012-02-21 11:20:05.765762 osd.3
>>> 10.10.10.8:6803/29916 131581 : [WRN] old request pg_log(0.ea epoch 849
>>> query_epoch 843) v2 received at 2012-02-20 17:39:41.774507 currently started
>>> 2012-02-21 11:20:15.106958   log 2012-02-21 11:20:05.765775 osd.3
>>> 10.10.10.8:6803/29916 131582 : [WRN] old request pg_log(2.e8 epoch 849
>>> query_epoch 843) v2 received at 2012-02-20 17:39:41.774662 currently no
>>> flag points reached
>>> 2012-02-21 11:20:15.106958   log 2012-02-21 11:20:06.765912 osd.3
>>> 10.10.10.8:6803/29916 131583 : [WRN] old request pg_log(0.ea epoch 849
>>> query_epoch 843) v2 received at 2012-02-20 17:39:41.774507 currently started
>>> 2012-02-21 11:20:15.106958   log 2012-02-21 11:20:06.765943 osd.3
>>> 10.10.10.8:6803/29916 131584 : [WRN] old request pg_log(2.e8 epoch 849
>>> query_epoch 843) v2 received at 2012-02-20 17:39:41.774662 currently no
>>> flag points reached
>>> 2012-02-21 11:20:15.106958   log 2012-02-21 11:20:07.766312 osd.3
>>> 10.10.10.8:6803/29916 131585 : [WRN] old request pg_log(0.ea epoch 849
>>> query_epoch 843) v2 received at 2012-02-20 17:39:41.774507 currently started
>>> 2012-02-21 11:20:15.106958   log 2012-02-21 11:20:07.766324 osd.3
>>> 10.10.10.8:6803/29916 131586 : [WRN] old request pg_log(2.e8 epoch 849
>>> query_epoch 843) v2 received at 2012-02-20 17:39:41.774662 currently no
>>> flag points reached
>>> 2012-02-21 11:20:15.106958   log 2012-02-21 11:20:08.766467 osd.3
>>> 10.10.10.8:6803/29916 131587 : [WRN] old request pg_log(0.ea epoch 849
>>> query_epoch 843) v2 received at 2012-02-20 17:39:41.774507 currently started
>>>
>>> Any ideas how we can get the cluster back to consistent state  ?
>>>
>>> Thank you !!
>>>
>>> Jens
>>
>> --
>> mit freundlichen Grüssen
>>
>> Jens Rehpöhler
>>
>> ----------------------------------------------------------------------
>> Filoo GmbH
>> Moltkestr. 25a
>> 33330 Gütersloh
>> HRB4355 AG Gütersloh
>>
>> Geschäftsführer: S.Grewing | J.Rehpöhler | Dr. C.Kunz
>> Telefon: +49 5241 8673012 | Mobil: +49 151 54645798
>> Hotline: 07000-3378658 (14 Ct/min) Fax: +49 5241 8673020
>>
>>


-- 
mit freundlichen Grüssen

Jens Rehpöhler

----------------------------------------------------------------------
Filoo GmbH
Moltkestr. 25a
33330 Gütersloh
HRB4355 AG Gütersloh

Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz
Telefon: +49 5241 8673012 | Mobil: +49 151 54645798
Hotline: 07000-3378658 (14 Ct/min) Fax: +49 5241 8673020

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Problems after crash yesterday
  2012-02-22 20:25     ` Jens Rehpöhler
@ 2012-02-24  5:14       ` Gregory Farnum
  2012-02-27 23:32         ` Gregory Farnum
  0 siblings, 1 reply; 6+ messages in thread
From: Gregory Farnum @ 2012-02-24  5:14 UTC (permalink / raw)
  To: Jens Rehpöhler; +Cc: ceph-devel@vger.kernel.org, sage@newdream.net

On Wed, Feb 22, 2012 at 12:25 PM, Jens Rehpöhler
<jens.rehpoehler@filoo.de> wrote:
> Hi Gregory,
>
>
> On 22.02.2012 18:12, Gregory Farnum wrote:
>> On Feb 22, 2012, at 1:53 AM, "Jens Rehpöhler" <jens.rehpoehler@filoo.de> wrote:
>>
>>> Some Additios: meanwhile we are at the state:
>>>
>>> 2012-02-22 10:38:49.587403    pg v1044553: 2046 pgs: 2036 active+clean,
>>> 10 active+clean+inconsistent; 2110 GB data, 4061 GB used, 25732 GB /
>>> 29794 GB avail
>>>
>>> The  active+recovering+remapped+backfill disappeared auf a restart of a
>>> cashed OSD.
>>>
>>> The OSD crashed after issuing the command "ceph pg repair 106.3".
>>>
>>> The repeating message is also there:
>> Hmm. These messages indicate there are requests that came in that
>> never got answered -- or else that the tracking code isn't quite right
>> (it's new functionality). What version are you running?
> We use:
>
> root@fcmsnode0:~# ceph -v
> ceph version 0.42-62-gd6de0bb
> (commit:d6de0bb83bcac238b3a6a376915e06fb7129b2c8)
>
> Kernel is 3.2.1
>
> i accidently updated one of our OSDs to 0.42 -> So we updated the whole
> cluster.
>
> The OSD repeated to crash while issuing  "repair" command. The
> inconsistent PGs
> are all on the same (newly added) node.

Oh, that's interesting. Are all the other nodes in the cluster up and in?

In the next version or two we will have a lot more capability to look
into what's happening with stuck PGs like this, but for the moment we
need a log. If all the other nodes in the system are up, can you
restart this new OSD with "debug osd = 20" and "debug ms = 1" added to
its config?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Problems after crash yesterday
  2012-02-24  5:14       ` Gregory Farnum
@ 2012-02-27 23:32         ` Gregory Farnum
  0 siblings, 0 replies; 6+ messages in thread
From: Gregory Farnum @ 2012-02-27 23:32 UTC (permalink / raw)
  To: Jens Rehpöhler; +Cc: ceph-devel@vger.kernel.org

On Thu, Feb 23, 2012 at 9:14 PM, Gregory Farnum
<gregory.farnum@dreamhost.com> wrote:
> On Wed, Feb 22, 2012 at 12:25 PM, Jens Rehpöhler
> <jens.rehpoehler@filoo.de> wrote:
>> Hi Gregory,
>>
>>
>> On 22.02.2012 18:12, Gregory Farnum wrote:
>>> On Feb 22, 2012, at 1:53 AM, "Jens Rehpöhler" <jens.rehpoehler@filoo.de> wrote:
>>>
>>>> Some Additios: meanwhile we are at the state:
>>>>
>>>> 2012-02-22 10:38:49.587403    pg v1044553: 2046 pgs: 2036 active+clean,
>>>> 10 active+clean+inconsistent; 2110 GB data, 4061 GB used, 25732 GB /
>>>> 29794 GB avail
>>>>
>>>> The  active+recovering+remapped+backfill disappeared auf a restart of a
>>>> cashed OSD.
>>>>
>>>> The OSD crashed after issuing the command "ceph pg repair 106.3".
>>>>
>>>> The repeating message is also there:
>>> Hmm. These messages indicate there are requests that came in that
>>> never got answered -- or else that the tracking code isn't quite right
>>> (it's new functionality). What version are you running?
>> We use:
>>
>> root@fcmsnode0:~# ceph -v
>> ceph version 0.42-62-gd6de0bb
>> (commit:d6de0bb83bcac238b3a6a376915e06fb7129b2c8)
>>
>> Kernel is 3.2.1
>>
>> i accidently updated one of our OSDs to 0.42 -> So we updated the whole
>> cluster.
>>
>> The OSD repeated to crash while issuing  "repair" command. The
>> inconsistent PGs
>> are all on the same (newly added) node.
>
> Oh, that's interesting. Are all the other nodes in the cluster up and in?
>
> In the next version or two we will have a lot more capability to look
> into what's happening with stuck PGs like this, but for the moment we
> need a log. If all the other nodes in the system are up, can you
> restart this new OSD with "debug osd = 20" and "debug ms = 1" added to
> its config?
> -Greg

Actually, I suspect this might be related to that bug you reported
with the messenger. If you like you can just cherry-pick
244b70296622906f01cfa3d48c931aa08e663a75 (currently HEAD on the next
branch) onto your current install and see if that fixes things...
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-02-27 23:32 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-02-21 10:24 Problems after crash yesterday Jens Rehpöhler
2012-02-22  9:53 ` Jens Rehpöhler
2012-02-22 17:12   ` Gregory Farnum
2012-02-22 20:25     ` Jens Rehpöhler
2012-02-24  5:14       ` Gregory Farnum
2012-02-27 23:32         ` Gregory Farnum

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.