OSD not coming up after being set down

All of lore.kernel.org
 help / color / mirror / Atom feed

* OSD not coming up after being set down
@ 2016-03-02 16:03 Willem Jan Withagen
  2016-03-02 17:01 ` M Ranga Swami Reddy
  0 siblings, 1 reply; 6+ messages in thread
From: Willem Jan Withagen @ 2016-03-02 16:03 UTC (permalink / raw)
  To: Ceph Development

Hi,

Any handholding is welcomed!!

In test/cephtool-mon-test.sh part of the excuted code is:
 ceph osd down 0
  ceph osd dump | grep 'osd.0 down'
  ceph osd unset noup
  for ((i=0; i < 120; i++)); do
    if ! ceph osd dump | grep 'osd.0 up'; then
      echo "waiting for osd.0 to come back up"
      sleep 1
    else
      break
    fi
  done
  ceph osd dump | grep 'osd.0 up'

But the OSD refused to come back up.
Below the output of the dump.

How would I start analyzing this issue?
What kind of things would I expect to see in the logfile?
  What if the OSD does come up
  What if the OSD stays down

Thanx,
--WjW


*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
epoch 170
fsid 8b5c0b4b-e08c-11e5-8cd4-1c6f6582ec12
created 2016-03-02 16:36:35.001700
modified 2016-03-02 16:45:17.802073
flags sortbitwise
pool 0 'rbd' replicated size 3 min_size 1 crush_ruleset 0 object_hash
rjenkins pg_num 8 pgp_num 8 last_change 1 flags hashpspool stripe_width 0
max_osd 3
osd.0 down out weight 0 up_from 4 up_thru 163 down_at 166
last_clean_interval [0,0) 127.0.0.1:6804/2455 127.0.0.1:6805/2455
127.0.0.1:6806/2455 127.0.0.1:6807/2455 autoout,exists
8bc29c74-e08c-11e5-8cd4-1c6f6582ec12
osd.1 up   in  weight 1 up_from 8 up_thru 166 down_at 0
last_clean_interval [0,0) 127.0.0.1:6808/2475 127.0.0.1:6811/2475
127.0.0.1:6813/2475 127.0.0.1:6816/2475 exists,up
8d7a2cb5-e08c-11e5-8cd4-1c6f6582ec12
osd.2 up   in  weight 1 up_from 13 up_thru 166 down_at 0
last_clean_interval [0,0) 127.0.0.1:6817/2495 127.0.0.1:6818/2495
127.0.0.1:6819/2495 127.0.0.1:6820/2495 exists,up
8f46df05-e08c-11e5-8cd4-1c6f6582ec12
pg_temp 0.0 [0,2,1]
pg_temp 0.1 [2,0,1]
pg_temp 0.2 [0,1,2]
pg_temp 0.3 [2,0,1]
pg_temp 0.4 [0,2,1]
pg_temp 0.5 [0,2,1]
pg_temp 0.6 [0,1,2]
pg_temp 0.7 [1,0,2]
2016-03-02 16:56:11.027977 8021d7800  0 lockdep stop

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: OSD not coming up after being set down
  2016-03-02 16:03 OSD not coming up after being set down Willem Jan Withagen
@ 2016-03-02 17:01 ` M Ranga Swami Reddy
  2016-03-02 19:56   ` Willem Jan Withagen
  0 siblings, 1 reply; 6+ messages in thread
From: M Ranga Swami Reddy @ 2016-03-02 17:01 UTC (permalink / raw)
  To: Willem Jan Withagen; +Cc: Ceph Development

Please see the below:
---
The If something is causing OSDs to ‘flap’ (repeatedly getting marked
down and then up again), you can force the monitors to stop the
flapping with:

ceph osd set noup      # prevent OSDs from getting marked up
ceph osd set nodown    # prevent OSDs from getting marked down
----
ref: http://docs.ceph.com/docs/hammer/rados/troubleshooting/troubleshooting-osd/


On Wed, Mar 2, 2016 at 9:33 PM, Willem Jan Withagen <wjw@digiware.nl> wrote:
> Hi,
>
> Any handholding is welcomed!!
>
> In test/cephtool-mon-test.sh part of the excuted code is:
>  ceph osd down 0
>   ceph osd dump | grep 'osd.0 down'
>   ceph osd unset noup
>   for ((i=0; i < 120; i++)); do
>     if ! ceph osd dump | grep 'osd.0 up'; then
>       echo "waiting for osd.0 to come back up"
>       sleep 1
>     else
>       break
>     fi
>   done
>   ceph osd dump | grep 'osd.0 up'
>
> But the OSD refused to come back up.
> Below the output of the dump.
>
> How would I start analyzing this issue?
> What kind of things would I expect to see in the logfile?
>   What if the OSD does come up
>   What if the OSD stays down
>
> Thanx,
> --WjW
>
>
> *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
> epoch 170
> fsid 8b5c0b4b-e08c-11e5-8cd4-1c6f6582ec12
> created 2016-03-02 16:36:35.001700
> modified 2016-03-02 16:45:17.802073
> flags sortbitwise
> pool 0 'rbd' replicated size 3 min_size 1 crush_ruleset 0 object_hash
> rjenkins pg_num 8 pgp_num 8 last_change 1 flags hashpspool stripe_width 0
> max_osd 3
> osd.0 down out weight 0 up_from 4 up_thru 163 down_at 166
> last_clean_interval [0,0) 127.0.0.1:6804/2455 127.0.0.1:6805/2455
> 127.0.0.1:6806/2455 127.0.0.1:6807/2455 autoout,exists
> 8bc29c74-e08c-11e5-8cd4-1c6f6582ec12
> osd.1 up   in  weight 1 up_from 8 up_thru 166 down_at 0
> last_clean_interval [0,0) 127.0.0.1:6808/2475 127.0.0.1:6811/2475
> 127.0.0.1:6813/2475 127.0.0.1:6816/2475 exists,up
> 8d7a2cb5-e08c-11e5-8cd4-1c6f6582ec12
> osd.2 up   in  weight 1 up_from 13 up_thru 166 down_at 0
> last_clean_interval [0,0) 127.0.0.1:6817/2495 127.0.0.1:6818/2495
> 127.0.0.1:6819/2495 127.0.0.1:6820/2495 exists,up
> 8f46df05-e08c-11e5-8cd4-1c6f6582ec12
> pg_temp 0.0 [0,2,1]
> pg_temp 0.1 [2,0,1]
> pg_temp 0.2 [0,1,2]
> pg_temp 0.3 [2,0,1]
> pg_temp 0.4 [0,2,1]
> pg_temp 0.5 [0,2,1]
> pg_temp 0.6 [0,1,2]
> pg_temp 0.7 [1,0,2]
> 2016-03-02 16:56:11.027977 8021d7800  0 lockdep stop
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: OSD not coming up after being set down
  2016-03-02 17:01 ` M Ranga Swami Reddy
@ 2016-03-02 19:56   ` Willem Jan Withagen
  2016-03-02 20:11     ` Samuel Just
  0 siblings, 1 reply; 6+ messages in thread
From: Willem Jan Withagen @ 2016-03-02 19:56 UTC (permalink / raw)
  To: M Ranga Swami Reddy; +Cc: Ceph Development

On 2-3-2016 18:01, M Ranga Swami Reddy wrote:
> Please see the below:
> ---
> The If something is causing OSDs to ‘flap’ (repeatedly getting marked
> down and then up again), you can force the monitors to stop the
> flapping with:
> 
> ceph osd set noup      # prevent OSDs from getting marked up
> ceph osd set nodown    # prevent OSDs from getting marked down
> ----
> ref: http://docs.ceph.com/docs/hammer/rados/troubleshooting/troubleshooting-osd/

I don't think this is the issue.

The testcode should run as is. This run on Linux oke, but FreeBSD is
giving trouble.
The OSD should get up, but does not.
- OSD not receiving the UP
- OSD not able to go UP
- Or the monitors are not picking up?

--WjW

> On Wed, Mar 2, 2016 at 9:33 PM, Willem Jan Withagen <wjw@digiware.nl> wrote:
>> Hi,
>>
>> Any handholding is welcomed!!
>>
>> In test/cephtool-mon-test.sh part of the excuted code is:
>>  ceph osd down 0
>>   ceph osd dump | grep 'osd.0 down'
>>   ceph osd unset noup
>>   for ((i=0; i < 120; i++)); do
>>     if ! ceph osd dump | grep 'osd.0 up'; then
>>       echo "waiting for osd.0 to come back up"
>>       sleep 1
>>     else
>>       break
>>     fi
>>   done
>>   ceph osd dump | grep 'osd.0 up'
>>
>> But the OSD refused to come back up.
>> Below the output of the dump.
>>
>> How would I start analyzing this issue?
>> What kind of things would I expect to see in the logfile?
>>   What if the OSD does come up
>>   What if the OSD stays down
>>
>> Thanx,
>> --WjW
>>
>>
>> *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
>> epoch 170
>> fsid 8b5c0b4b-e08c-11e5-8cd4-1c6f6582ec12
>> created 2016-03-02 16:36:35.001700
>> modified 2016-03-02 16:45:17.802073
>> flags sortbitwise
>> pool 0 'rbd' replicated size 3 min_size 1 crush_ruleset 0 object_hash
>> rjenkins pg_num 8 pgp_num 8 last_change 1 flags hashpspool stripe_width 0
>> max_osd 3
>> osd.0 down out weight 0 up_from 4 up_thru 163 down_at 166
>> last_clean_interval [0,0) 127.0.0.1:6804/2455 127.0.0.1:6805/2455
>> 127.0.0.1:6806/2455 127.0.0.1:6807/2455 autoout,exists
>> 8bc29c74-e08c-11e5-8cd4-1c6f6582ec12
>> osd.1 up   in  weight 1 up_from 8 up_thru 166 down_at 0
>> last_clean_interval [0,0) 127.0.0.1:6808/2475 127.0.0.1:6811/2475
>> 127.0.0.1:6813/2475 127.0.0.1:6816/2475 exists,up
>> 8d7a2cb5-e08c-11e5-8cd4-1c6f6582ec12
>> osd.2 up   in  weight 1 up_from 13 up_thru 166 down_at 0
>> last_clean_interval [0,0) 127.0.0.1:6817/2495 127.0.0.1:6818/2495
>> 127.0.0.1:6819/2495 127.0.0.1:6820/2495 exists,up
>> 8f46df05-e08c-11e5-8cd4-1c6f6582ec12
>> pg_temp 0.0 [0,2,1]
>> pg_temp 0.1 [2,0,1]
>> pg_temp 0.2 [0,1,2]
>> pg_temp 0.3 [2,0,1]
>> pg_temp 0.4 [0,2,1]
>> pg_temp 0.5 [0,2,1]
>> pg_temp 0.6 [0,1,2]
>> pg_temp 0.7 [1,0,2]
>> 2016-03-02 16:56:11.027977 8021d7800  0 lockdep stop
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: OSD not coming up after being set down
  2016-03-02 19:56   ` Willem Jan Withagen
@ 2016-03-02 20:11     ` Samuel Just
  2016-03-02 20:21       ` Willem Jan Withagen
  0 siblings, 1 reply; 6+ messages in thread
From: Samuel Just @ 2016-03-02 20:11 UTC (permalink / raw)
  To: Willem Jan Withagen; +Cc: M Ranga Swami Reddy, Ceph Development

At this point, you will want to run the script and then dig through
the logs until you find something that doesn't match.
- Was osd.0 up to begin with?
- Is its process running?
- Did it get the map marking it down?
- Did it send a boot message back to the mon requesting that it be
marked back up?
- Did the mon get that message?
- Did the mon create a new map marking it up?
Etc
-Sam

On Wed, Mar 2, 2016 at 11:56 AM, Willem Jan Withagen <wjw@digiware.nl> wrote:
> On 2-3-2016 18:01, M Ranga Swami Reddy wrote:
>> Please see the below:
>> ---
>> The If something is causing OSDs to ‘flap’ (repeatedly getting marked
>> down and then up again), you can force the monitors to stop the
>> flapping with:
>>
>> ceph osd set noup      # prevent OSDs from getting marked up
>> ceph osd set nodown    # prevent OSDs from getting marked down
>> ----
>> ref: http://docs.ceph.com/docs/hammer/rados/troubleshooting/troubleshooting-osd/
>
> I don't think this is the issue.
>
> The testcode should run as is. This run on Linux oke, but FreeBSD is
> giving trouble.
> The OSD should get up, but does not.
> - OSD not receiving the UP
> - OSD not able to go UP
> - Or the monitors are not picking up?
>
> --WjW
>
>> On Wed, Mar 2, 2016 at 9:33 PM, Willem Jan Withagen <wjw@digiware.nl> wrote:
>>> Hi,
>>>
>>> Any handholding is welcomed!!
>>>
>>> In test/cephtool-mon-test.sh part of the excuted code is:
>>>  ceph osd down 0
>>>   ceph osd dump | grep 'osd.0 down'
>>>   ceph osd unset noup
>>>   for ((i=0; i < 120; i++)); do
>>>     if ! ceph osd dump | grep 'osd.0 up'; then
>>>       echo "waiting for osd.0 to come back up"
>>>       sleep 1
>>>     else
>>>       break
>>>     fi
>>>   done
>>>   ceph osd dump | grep 'osd.0 up'
>>>
>>> But the OSD refused to come back up.
>>> Below the output of the dump.
>>>
>>> How would I start analyzing this issue?
>>> What kind of things would I expect to see in the logfile?
>>>   What if the OSD does come up
>>>   What if the OSD stays down
>>>
>>> Thanx,
>>> --WjW
>>>
>>>
>>> *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
>>> epoch 170
>>> fsid 8b5c0b4b-e08c-11e5-8cd4-1c6f6582ec12
>>> created 2016-03-02 16:36:35.001700
>>> modified 2016-03-02 16:45:17.802073
>>> flags sortbitwise
>>> pool 0 'rbd' replicated size 3 min_size 1 crush_ruleset 0 object_hash
>>> rjenkins pg_num 8 pgp_num 8 last_change 1 flags hashpspool stripe_width 0
>>> max_osd 3
>>> osd.0 down out weight 0 up_from 4 up_thru 163 down_at 166
>>> last_clean_interval [0,0) 127.0.0.1:6804/2455 127.0.0.1:6805/2455
>>> 127.0.0.1:6806/2455 127.0.0.1:6807/2455 autoout,exists
>>> 8bc29c74-e08c-11e5-8cd4-1c6f6582ec12
>>> osd.1 up   in  weight 1 up_from 8 up_thru 166 down_at 0
>>> last_clean_interval [0,0) 127.0.0.1:6808/2475 127.0.0.1:6811/2475
>>> 127.0.0.1:6813/2475 127.0.0.1:6816/2475 exists,up
>>> 8d7a2cb5-e08c-11e5-8cd4-1c6f6582ec12
>>> osd.2 up   in  weight 1 up_from 13 up_thru 166 down_at 0
>>> last_clean_interval [0,0) 127.0.0.1:6817/2495 127.0.0.1:6818/2495
>>> 127.0.0.1:6819/2495 127.0.0.1:6820/2495 exists,up
>>> 8f46df05-e08c-11e5-8cd4-1c6f6582ec12
>>> pg_temp 0.0 [0,2,1]
>>> pg_temp 0.1 [2,0,1]
>>> pg_temp 0.2 [0,1,2]
>>> pg_temp 0.3 [2,0,1]
>>> pg_temp 0.4 [0,2,1]
>>> pg_temp 0.5 [0,2,1]
>>> pg_temp 0.6 [0,1,2]
>>> pg_temp 0.7 [1,0,2]
>>> 2016-03-02 16:56:11.027977 8021d7800  0 lockdep stop
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: OSD not coming up after being set down
  2016-03-02 20:11     ` Samuel Just
@ 2016-03-02 20:21       ` Willem Jan Withagen
  2016-03-02 23:02         ` Samuel Just
  0 siblings, 1 reply; 6+ messages in thread
From: Willem Jan Withagen @ 2016-03-02 20:21 UTC (permalink / raw)
  To: Samuel Just; +Cc: M Ranga Swami Reddy, Ceph Development

On 2-3-2016 21:11, Samuel Just wrote:
> At this point, you will want to run the script and then dig through
> the logs until you find something that doesn't match.
> - Was osd.0 up to begin with?
> - Is its process running?
> - Did it get the map marking it down?
> - Did it send a boot message back to the mon requesting that it be
> marked back up?
> - Did the mon get that message?
> - Did the mon create a new map marking it up?

Right this is sort of a handholding I was looking for.

The first 2 items are true.
Who sends "the map marking it down"?
	ceph osd down 0 => Mon => Osd
Or does that go directly ceph => Osd

Are there any statemachine pictures of this in the manuals?

--WjW

> Etc
> -Sam
> 
> On Wed, Mar 2, 2016 at 11:56 AM, Willem Jan Withagen <wjw@digiware.nl> wrote:
>> On 2-3-2016 18:01, M Ranga Swami Reddy wrote:
>>> Please see the below:
>>> ---
>>> The If something is causing OSDs to ‘flap’ (repeatedly getting marked
>>> down and then up again), you can force the monitors to stop the
>>> flapping with:
>>>
>>> ceph osd set noup      # prevent OSDs from getting marked up
>>> ceph osd set nodown    # prevent OSDs from getting marked down
>>> ----
>>> ref: http://docs.ceph.com/docs/hammer/rados/troubleshooting/troubleshooting-osd/
>>
>> I don't think this is the issue.
>>
>> The testcode should run as is. This run on Linux oke, but FreeBSD is
>> giving trouble.
>> The OSD should get up, but does not.
>> - OSD not receiving the UP
>> - OSD not able to go UP
>> - Or the monitors are not picking up?
>>
>> --WjW
>>
>>> On Wed, Mar 2, 2016 at 9:33 PM, Willem Jan Withagen <wjw@digiware.nl> wrote:
>>>> Hi,
>>>>
>>>> Any handholding is welcomed!!
>>>>
>>>> In test/cephtool-mon-test.sh part of the excuted code is:
>>>>  ceph osd down 0
>>>>   ceph osd dump | grep 'osd.0 down'
>>>>   ceph osd unset noup
>>>>   for ((i=0; i < 120; i++)); do
>>>>     if ! ceph osd dump | grep 'osd.0 up'; then
>>>>       echo "waiting for osd.0 to come back up"
>>>>       sleep 1
>>>>     else
>>>>       break
>>>>     fi
>>>>   done
>>>>   ceph osd dump | grep 'osd.0 up'
>>>>
>>>> But the OSD refused to come back up.
>>>> Below the output of the dump.
>>>>
>>>> How would I start analyzing this issue?
>>>> What kind of things would I expect to see in the logfile?
>>>>   What if the OSD does come up
>>>>   What if the OSD stays down
>>>>
>>>> Thanx,
>>>> --WjW
>>>>
>>>>
>>>> *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
>>>> epoch 170
>>>> fsid 8b5c0b4b-e08c-11e5-8cd4-1c6f6582ec12
>>>> created 2016-03-02 16:36:35.001700
>>>> modified 2016-03-02 16:45:17.802073
>>>> flags sortbitwise
>>>> pool 0 'rbd' replicated size 3 min_size 1 crush_ruleset 0 object_hash
>>>> rjenkins pg_num 8 pgp_num 8 last_change 1 flags hashpspool stripe_width 0
>>>> max_osd 3
>>>> osd.0 down out weight 0 up_from 4 up_thru 163 down_at 166
>>>> last_clean_interval [0,0) 127.0.0.1:6804/2455 127.0.0.1:6805/2455
>>>> 127.0.0.1:6806/2455 127.0.0.1:6807/2455 autoout,exists
>>>> 8bc29c74-e08c-11e5-8cd4-1c6f6582ec12
>>>> osd.1 up   in  weight 1 up_from 8 up_thru 166 down_at 0
>>>> last_clean_interval [0,0) 127.0.0.1:6808/2475 127.0.0.1:6811/2475
>>>> 127.0.0.1:6813/2475 127.0.0.1:6816/2475 exists,up
>>>> 8d7a2cb5-e08c-11e5-8cd4-1c6f6582ec12
>>>> osd.2 up   in  weight 1 up_from 13 up_thru 166 down_at 0
>>>> last_clean_interval [0,0) 127.0.0.1:6817/2495 127.0.0.1:6818/2495
>>>> 127.0.0.1:6819/2495 127.0.0.1:6820/2495 exists,up
>>>> 8f46df05-e08c-11e5-8cd4-1c6f6582ec12
>>>> pg_temp 0.0 [0,2,1]
>>>> pg_temp 0.1 [2,0,1]
>>>> pg_temp 0.2 [0,1,2]
>>>> pg_temp 0.3 [2,0,1]
>>>> pg_temp 0.4 [0,2,1]
>>>> pg_temp 0.5 [0,2,1]
>>>> pg_temp 0.6 [0,1,2]
>>>> pg_temp 0.7 [1,0,2]
>>>> 2016-03-02 16:56:11.027977 8021d7800  0 lockdep stop
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: OSD not coming up after being set down
  2016-03-02 20:21       ` Willem Jan Withagen
@ 2016-03-02 23:02         ` Samuel Just
  0 siblings, 0 replies; 6+ messages in thread
From: Samuel Just @ 2016-03-02 23:02 UTC (permalink / raw)
  To: Willem Jan Withagen; +Cc: M Ranga Swami Reddy, Ceph Development

Maps are created by the mons (that's pretty much what they're for).
The entire paxos thing it there to make sure that two maps with the
same epoch number are identical and that we produce them in increasing
epoch number.  The ceph command therefore causes the mon cluster to
publish a new map at the next epoch number.  That map is then sent out
to some number of osds who then gossip it out to the rest.  As osds
find out that osd 0 is down, they should start responding to osd.0's
pings with a "you died" message and otherwise ignoring it.  osd.0 will
then contact the mon for a more up-to-date map (which must be at least
as recent as the one which marked it down).  osd.0 will then get that
map, find out that it died, kill and reopen it's network connections
(so that it's a new instance), and send a boot to the mons requesting
that it be marked back up.
-Sam

On Wed, Mar 2, 2016 at 12:21 PM, Willem Jan Withagen <wjw@digiware.nl> wrote:
> On 2-3-2016 21:11, Samuel Just wrote:
>> At this point, you will want to run the script and then dig through
>> the logs until you find something that doesn't match.
>> - Was osd.0 up to begin with?
>> - Is its process running?
>> - Did it get the map marking it down?
>> - Did it send a boot message back to the mon requesting that it be
>> marked back up?
>> - Did the mon get that message?
>> - Did the mon create a new map marking it up?
>
> Right this is sort of a handholding I was looking for.
>
> The first 2 items are true.
> Who sends "the map marking it down"?
>         ceph osd down 0 => Mon => Osd
> Or does that go directly ceph => Osd
>
> Are there any statemachine pictures of this in the manuals?
>
> --WjW
>
>> Etc
>> -Sam
>>
>> On Wed, Mar 2, 2016 at 11:56 AM, Willem Jan Withagen <wjw@digiware.nl> wrote:
>>> On 2-3-2016 18:01, M Ranga Swami Reddy wrote:
>>>> Please see the below:
>>>> ---
>>>> The If something is causing OSDs to ‘flap’ (repeatedly getting marked
>>>> down and then up again), you can force the monitors to stop the
>>>> flapping with:
>>>>
>>>> ceph osd set noup      # prevent OSDs from getting marked up
>>>> ceph osd set nodown    # prevent OSDs from getting marked down
>>>> ----
>>>> ref: http://docs.ceph.com/docs/hammer/rados/troubleshooting/troubleshooting-osd/
>>>
>>> I don't think this is the issue.
>>>
>>> The testcode should run as is. This run on Linux oke, but FreeBSD is
>>> giving trouble.
>>> The OSD should get up, but does not.
>>> - OSD not receiving the UP
>>> - OSD not able to go UP
>>> - Or the monitors are not picking up?
>>>
>>> --WjW
>>>
>>>> On Wed, Mar 2, 2016 at 9:33 PM, Willem Jan Withagen <wjw@digiware.nl> wrote:
>>>>> Hi,
>>>>>
>>>>> Any handholding is welcomed!!
>>>>>
>>>>> In test/cephtool-mon-test.sh part of the excuted code is:
>>>>>  ceph osd down 0
>>>>>   ceph osd dump | grep 'osd.0 down'
>>>>>   ceph osd unset noup
>>>>>   for ((i=0; i < 120; i++)); do
>>>>>     if ! ceph osd dump | grep 'osd.0 up'; then
>>>>>       echo "waiting for osd.0 to come back up"
>>>>>       sleep 1
>>>>>     else
>>>>>       break
>>>>>     fi
>>>>>   done
>>>>>   ceph osd dump | grep 'osd.0 up'
>>>>>
>>>>> But the OSD refused to come back up.
>>>>> Below the output of the dump.
>>>>>
>>>>> How would I start analyzing this issue?
>>>>> What kind of things would I expect to see in the logfile?
>>>>>   What if the OSD does come up
>>>>>   What if the OSD stays down
>>>>>
>>>>> Thanx,
>>>>> --WjW
>>>>>
>>>>>
>>>>> *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
>>>>> epoch 170
>>>>> fsid 8b5c0b4b-e08c-11e5-8cd4-1c6f6582ec12
>>>>> created 2016-03-02 16:36:35.001700
>>>>> modified 2016-03-02 16:45:17.802073
>>>>> flags sortbitwise
>>>>> pool 0 'rbd' replicated size 3 min_size 1 crush_ruleset 0 object_hash
>>>>> rjenkins pg_num 8 pgp_num 8 last_change 1 flags hashpspool stripe_width 0
>>>>> max_osd 3
>>>>> osd.0 down out weight 0 up_from 4 up_thru 163 down_at 166
>>>>> last_clean_interval [0,0) 127.0.0.1:6804/2455 127.0.0.1:6805/2455
>>>>> 127.0.0.1:6806/2455 127.0.0.1:6807/2455 autoout,exists
>>>>> 8bc29c74-e08c-11e5-8cd4-1c6f6582ec12
>>>>> osd.1 up   in  weight 1 up_from 8 up_thru 166 down_at 0
>>>>> last_clean_interval [0,0) 127.0.0.1:6808/2475 127.0.0.1:6811/2475
>>>>> 127.0.0.1:6813/2475 127.0.0.1:6816/2475 exists,up
>>>>> 8d7a2cb5-e08c-11e5-8cd4-1c6f6582ec12
>>>>> osd.2 up   in  weight 1 up_from 13 up_thru 166 down_at 0
>>>>> last_clean_interval [0,0) 127.0.0.1:6817/2495 127.0.0.1:6818/2495
>>>>> 127.0.0.1:6819/2495 127.0.0.1:6820/2495 exists,up
>>>>> 8f46df05-e08c-11e5-8cd4-1c6f6582ec12
>>>>> pg_temp 0.0 [0,2,1]
>>>>> pg_temp 0.1 [2,0,1]
>>>>> pg_temp 0.2 [0,1,2]
>>>>> pg_temp 0.3 [2,0,1]
>>>>> pg_temp 0.4 [0,2,1]
>>>>> pg_temp 0.5 [0,2,1]
>>>>> pg_temp 0.6 [0,1,2]
>>>>> pg_temp 0.7 [1,0,2]
>>>>> 2016-03-02 16:56:11.027977 8021d7800  0 lockdep stop
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-03-02 23:02 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-02 16:03 OSD not coming up after being set down Willem Jan Withagen
2016-03-02 17:01 ` M Ranga Swami Reddy
2016-03-02 19:56   ` Willem Jan Withagen
2016-03-02 20:11     ` Samuel Just
2016-03-02 20:21       ` Willem Jan Withagen
2016-03-02 23:02         ` Samuel Just

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.