OSD state flipping when cluster-network in high utilization

All of lore.kernel.org
 help / color / mirror / Atom feed

* OSD state flipping when cluster-network in high utilization
@ 2013-05-14  8:56 Chen, Xiaoxi
  2013-05-14 15:30 ` [ceph-users] " Sage Weil
  0 siblings, 1 reply; 13+ messages in thread
From: Chen, Xiaoxi @ 2013-05-14  8:56 UTC (permalink / raw)
  To: ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
  Cc: ceph-users-Qp0mS5GaXlQ@public.gmane.org


[-- Attachment #1.1: Type: text/plain, Size: 649 bytes --]

Hi
  We are suffering our OSD flipping between up and down ( OSD X be voted to down due to 3 missing ping, and after a while it tells the monitor "map xxx wrongly mark me down" ). Because we are running sequential write performance test on top of RBDs, and the cluster network nics is really in high utilization (8Gb/s+ for a 10Gb network).
         Is this a expected behavior ? or how can I prevent this happen?
                                                                                                                                                                                                                                Xiaoxi

[-- Attachment #1.2: Type: text/html, Size: 3956 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [ceph-users] OSD state flipping when cluster-network in high utilization
  2013-05-14  8:56 OSD state flipping when cluster-network in high utilization Chen, Xiaoxi
@ 2013-05-14 15:30 ` Sage Weil
  2013-05-14 15:35   ` Mark Nelson
  2013-05-15  0:52   ` Wales Wang
  0 siblings, 2 replies; 13+ messages in thread
From: Sage Weil @ 2013-05-14 15:30 UTC (permalink / raw)
  To: Chen, Xiaoxi; +Cc: ceph-devel@vger.kernel.org, ceph-users@ceph.com

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1006 bytes --]

On Tue, 14 May 2013, Chen, Xiaoxi wrote:
> 
> Hi
> 
>   We are suffering our OSD flipping between up and down ( OSD X be voted to
> down due to 3 missing ping, and after a while it tells the monitor ?map xxx
> wrongly mark me down? ). Because we are running sequential write performance
> test on top of RBDs, and the cluster network nics is really in high
> utilization (8Gb/s+ for a 10Gb network).
> 
>          Is this a expected behavior ? or how can I prevent this happen?

You an increase the heartbeat grace period.  The pings are handled by a 
separate thread on the backside interface (if there is one).  If you are 
missing pings then the network or scheduler is preventing those (small) 
messages from being processed (there is almost no lock contention in that 
path).  Which means it really is taking ~20 seconds or wahtever to handle 
those messages.  It's really a questin of how unresponsive you want to 
permit the OSDs to be before you consider it a failure..

sage

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [ceph-users] OSD state flipping when cluster-network in high utilization
  2013-05-14 15:30 ` [ceph-users] " Sage Weil
@ 2013-05-14 15:35   ` Mark Nelson
       [not found]     ` <519259C5.3000109-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org>
  2013-05-15  0:52   ` Wales Wang
  1 sibling, 1 reply; 13+ messages in thread
From: Mark Nelson @ 2013-05-14 15:35 UTC (permalink / raw)
  To: Sage Weil; +Cc: Chen, Xiaoxi, ceph-devel@vger.kernel.org, ceph-users@ceph.com

On 05/14/2013 10:30 AM, Sage Weil wrote:
> On Tue, 14 May 2013, Chen, Xiaoxi wrote:
>>
>> Hi
>>
>>    We are suffering our OSD flipping between up and down ( OSD X be voted to
>> down due to 3 missing ping, and after a while it tells the monitor ?map xxx
>> wrongly mark me down? ). Because we are running sequential write performance
>> test on top of RBDs, and the cluster network nics is really in high
>> utilization (8Gb/s+ for a 10Gb network).
>>
>>           Is this a expected behavior ? or how can I prevent this happen?
>
> You an increase the heartbeat grace period.  The pings are handled by a
> separate thread on the backside interface (if there is one).  If you are
> missing pings then the network or scheduler is preventing those (small)
> messages from being processed (there is almost no lock contention in that
> path).  Which means it really is taking ~20 seconds or wahtever to handle
> those messages.  It's really a questin of how unresponsive you want to
> permit the OSDs to be before you consider it a failure..
>
> sage
>
>

It might be worth testing out how long pings or other network traffic 
are taking during these tests.  There may be some tcp tunning you can do 
here, or even consider using a separate network for the mons.

Mark

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: OSD state flipping when cluster-network in high utilization
       [not found]     ` <519259C5.3000109-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org>
@ 2013-05-14 23:13       ` Chen, Xiaoxi
       [not found]         ` <6AC1548D-DE2C-4E71-AC73-3903B57C76C6-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: Chen, Xiaoxi @ 2013-05-14 23:13 UTC (permalink / raw)
  To: Mark Nelson
  Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	ceph-users-Qp0mS5GaXlQ@public.gmane.org

I like the idea to leave ping in cluster network because it can help us detect switch／nic failure.

What confuse me is I keep pinging every ceph node's cluster ip，it is OK during the whole run with less than 1 ms latency，why the heartbeat still suffer？ TOP show my cpu not 100% utilized（with ＞30% io wait）.Enabling jumbo frame **seems** make things worth.(just feeling.no data supports)

发自我的 iPhone

在 2013-5-14，23:36，"Mark Nelson" <mark.nelson@inktank.com> 写道：

> On 05/14/2013 10:30 AM, Sage Weil wrote:
>> On Tue, 14 May 2013, Chen, Xiaoxi wrote:
>>> 
>>> Hi
>>> 
>>>   We are suffering our OSD flipping between up and down ( OSD X be voted to
>>> down due to 3 missing ping, and after a while it tells the monitor ?map xxx
>>> wrongly mark me down? ). Because we are running sequential write performance
>>> test on top of RBDs, and the cluster network nics is really in high
>>> utilization (8Gb/s+ for a 10Gb network).
>>> 
>>>          Is this a expected behavior ? or how can I prevent this happen?
>> 
>> You an increase the heartbeat grace period.  The pings are handled by a
>> separate thread on the backside interface (if there is one).  If you are
>> missing pings then the network or scheduler is preventing those (small)
>> messages from being processed (there is almost no lock contention in that
>> path).  Which means it really is taking ~20 seconds or wahtever to handle
>> those messages.  It's really a questin of how unresponsive you want to
>> permit the OSDs to be before you consider it a failure..
>> 
>> sage
>> 
>> 
> 
> It might be worth testing out how long pings or other network traffic are taking during these tests.  There may be some tcp tunning you can do here, or even consider using a separate network for the mons.
> 
> Mark
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: OSD state flipping when cluster-network in high utilization
       [not found]         ` <6AC1548D-DE2C-4E71-AC73-3903B57C76C6-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
@ 2013-05-14 23:22           ` Sage Weil
       [not found]             ` <alpine.DEB.2.00.1305141620410.12954-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: Sage Weil @ 2013-05-14 23:22 UTC (permalink / raw)
  To: Chen, Xiaoxi
  Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	ceph-users-Qp0mS5GaXlQ@public.gmane.org

On Tue, 14 May 2013, Chen, Xiaoxi wrote:
> I like the idea to leave ping in cluster network because it can help us 
> detect switch?nic failure.
> 
> What confuse me is I keep pinging every ceph node's cluster ip?it is OK 
> during the whole run with less than 1 ms latency?why the heartbeat still 
> suffer? TOP show my cpu not 100% utilized?with ?30% io wait?.Enabling 
> jumbo frame **seems** make things worth.(just feeling.no data supports)

I say "ping" in teh general sense.. it's not using ICMP, but sending 
small messages over a TCP session, doing some minimal processing on the 
other end, and sending them back.  If the machine is heavily loaded and 
that thread doesn't get scheduled or somehow blocks, it may be 
problematic.

How responsive generally is the machine under load?  Is there available 
CPU?

We can try to enable debugging to see what is going on.. 'debug ms = 1' 
and 'debug osd = 20' is everything we would need, but will incur 
additoinal load itself and may spoil the experiment...

sage

> 
> ???? iPhone
> 
> ? 2013-5-14?23:36?"Mark Nelson" <mark.nelson-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org> ???
> 
> > On 05/14/2013 10:30 AM, Sage Weil wrote:
> >> On Tue, 14 May 2013, Chen, Xiaoxi wrote:
> >>> 
> >>> Hi
> >>> 
> >>>   We are suffering our OSD flipping between up and down ( OSD X be voted to
> >>> down due to 3 missing ping, and after a while it tells the monitor ?map xxx
> >>> wrongly mark me down? ). Because we are running sequential write performance
> >>> test on top of RBDs, and the cluster network nics is really in high
> >>> utilization (8Gb/s+ for a 10Gb network).
> >>> 
> >>>          Is this a expected behavior ? or how can I prevent this happen?
> >> 
> >> You an increase the heartbeat grace period.  The pings are handled by a
> >> separate thread on the backside interface (if there is one).  If you are
> >> missing pings then the network or scheduler is preventing those (small)
> >> messages from being processed (there is almost no lock contention in that
> >> path).  Which means it really is taking ~20 seconds or wahtever to handle
> >> those messages.  It's really a questin of how unresponsive you want to
> >> permit the OSDs to be before you consider it a failure..
> >> 
> >> sage
> >> 
> >> 
> > 
> > It might be worth testing out how long pings or other network traffic are taking during these tests.  There may be some tcp tunning you can do here, or even consider using a separate network for the mons.
> > 
> > Mark
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [ceph-users] OSD state flipping when cluster-network in high utilization
  2013-05-14 15:30 ` [ceph-users] " Sage Weil
  2013-05-14 15:35   ` Mark Nelson
@ 2013-05-15  0:52   ` Wales Wang
  1 sibling, 0 replies; 13+ messages in thread
From: Wales Wang @ 2013-05-15  0:52 UTC (permalink / raw)
  To: Sage Weil; +Cc: Chen, Xiaoxi, ceph-devel@vger.kernel.org, ceph-users@ceph.com

My advice is leverage Linux cgroup to isolate some resource for critical part of Ceph.
Such as state of heartbeat, replication of OSDs.
Even the timeout set to many minutes, there always are bigger heavy loading on Ceph.

Wales Wang

在 2013-5-14，下午11:30，Sage Weil <sage@inktank.com> 写道：

> On Tue, 14 May 2013, Chen, Xiaoxi wrote:
>> 
>> Hi
>> 
>>   We are suffering our OSD flipping between up and down ( OSD X be voted to
>> down due to 3 missing ping, and after a while it tells the monitor ?map xxx
>> wrongly mark me down? ). Because we are running sequential write performance
>> test on top of RBDs, and the cluster network nics is really in high
>> utilization (8Gb/s+ for a 10Gb network).
>> 
>>          Is this a expected behavior ? or how can I prevent this happen?
> 
> You an increase the heartbeat grace period.  The pings are handled by a 
> separate thread on the backside interface (if there is one).  If you are 
> missing pings then the network or scheduler is preventing those (small) 
> messages from being processed (there is almost no lock contention in that 
> path).  Which means it really is taking ~20 seconds or wahtever to handle 
> those messages.  It's really a questin of how unresponsive you want to 
> permit the OSDs to be before you consider it a failure..
> 
> sage
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: OSD state flipping when cluster-network in high utilization
       [not found]             ` <alpine.DEB.2.00.1305141620410.12954-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
@ 2013-05-15  3:23               ` Chen, Xiaoxi
       [not found]                 ` <6F3FA899187F0043BA1827A69DA2F7CC0145F746-0J0gbvR4kTiiAffOGbnezLfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: Chen, Xiaoxi @ 2013-05-15  3:23 UTC (permalink / raw)
  To: Sage Weil
  Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	ceph-users-Qp0mS5GaXlQ@public.gmane.org

>How responsive generally is the machine under load?  Is there available CPU?
	The machine works well, and the issued OSDs are likely the same, seems because they have relative slower disk( disk type are the same but the latency is a bit higher ,8ms -> 10ms).
	
	Top show no idle % but still have 30+% of io_wait,  my colleague educate me that io_wait can be treated as free.

Another information is offload the heartbeat to 1Gb nic doesn't solve the problem, what's more, when we doing random write test, we can still see this flipping happen. So I would like to say it may related with CPU scheduler ? The heartbeat thread (in busy OSD ) failed to get enough cpu cycle.


-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
Sent: 2013年5月15日 7:23
To: Chen, Xiaoxi
Cc: Mark Nelson; ceph-devel@vger.kernel.org; ceph-users@ceph.com
Subject: Re: [ceph-users] OSD state flipping when cluster-network in high utilization

On Tue, 14 May 2013, Chen, Xiaoxi wrote:
> I like the idea to leave ping in cluster network because it can help 
> us detect switch?nic failure.
> 
> What confuse me is I keep pinging every ceph node's cluster ip?it is 
> OK during the whole run with less than 1 ms latency?why the heartbeat 
> still suffer? TOP show my cpu not 100% utilized?with ?30% io 
> wait?.Enabling jumbo frame **seems** make things worth.(just 
> feeling.no data supports)

I say "ping" in teh general sense.. it's not using ICMP, but sending small messages over a TCP session, doing some minimal processing on the other end, and sending them back.  If the machine is heavily loaded and that thread doesn't get scheduled or somehow blocks, it may be problematic.

How responsive generally is the machine under load?  Is there available CPU?

We can try to enable debugging to see what is going on.. 'debug ms = 1' 
and 'debug osd = 20' is everything we would need, but will incur additoinal load itself and may spoil the experiment...

sage

> 
> ???? iPhone
> 
> ? 2013-5-14?23:36?"Mark Nelson" <mark.nelson@inktank.com> ???
> 
> > On 05/14/2013 10:30 AM, Sage Weil wrote:
> >> On Tue, 14 May 2013, Chen, Xiaoxi wrote:
> >>> 
> >>> Hi
> >>> 
> >>>   We are suffering our OSD flipping between up and down ( OSD X be 
> >>> voted to down due to 3 missing ping, and after a while it tells 
> >>> the monitor ?map xxx wrongly mark me down? ). Because we are 
> >>> running sequential write performance test on top of RBDs, and the 
> >>> cluster network nics is really in high utilization (8Gb/s+ for a 10Gb network).
> >>> 
> >>>          Is this a expected behavior ? or how can I prevent this happen?
> >> 
> >> You an increase the heartbeat grace period.  The pings are handled 
> >> by a separate thread on the backside interface (if there is one).  
> >> If you are missing pings then the network or scheduler is 
> >> preventing those (small) messages from being processed (there is 
> >> almost no lock contention in that path).  Which means it really is 
> >> taking ~20 seconds or wahtever to handle those messages.  It's 
> >> really a questin of how unresponsive you want to permit the OSDs to be before you consider it a failure..
> >> 
> >> sage
> >> 
> >> 
> > 
> > It might be worth testing out how long pings or other network traffic are taking during these tests.  There may be some tcp tunning you can do here, or even consider using a separate network for the mons.
> > 
> > Mark
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: OSD state flipping when cluster-network in high utilization
       [not found]                 ` <6F3FA899187F0043BA1827A69DA2F7CC0145F746-0J0gbvR4kTiiAffOGbnezLfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2013-05-15  3:40                   ` Sage Weil
       [not found]                     ` <alpine.DEB.2.00.1305142039520.23131-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
  2013-05-15 15:05                   ` Jim Schutt
  1 sibling, 1 reply; 13+ messages in thread
From: Sage Weil @ 2013-05-15  3:40 UTC (permalink / raw)
  To: Chen, Xiaoxi
  Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	ceph-users-Qp0mS5GaXlQ@public.gmane.org

On Wed, 15 May 2013, Chen, Xiaoxi wrote:

> >How responsive generally is the machine under load?  Is there available CPU?
> 	The machine works well, and the issued OSDs are likely the same, seems because they have relative slower disk( disk type are the same but the latency is a bit higher ,8ms -> 10ms).
> 	
> 	Top show no idle % but still have 30+% of io_wait,  my colleague educate me that io_wait can be treated as free.
> 
> Another information is offload the heartbeat to 1Gb nic doesn't solve 
> the problem, what's more, when we doing random write test, we can still 
> see this flipping happen. So I would like to say it may related with CPU 
> scheduler ? The heartbeat thread (in busy OSD ) failed to get enough cpu 
> cycle.

Can you try reproducing with logs?  If we can reproduce that will give us 
some clues.

sage

> 
> -----Original Message-----
> From: ceph-devel-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:ceph-devel-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On Behalf Of Sage Weil
> Sent: 2013?5?15? 7:23
> To: Chen, Xiaoxi
> Cc: Mark Nelson; ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; ceph-users-Qp0mS5GaXlQ@public.gmane.org
> Subject: Re: [ceph-users] OSD state flipping when cluster-network in high utilization
> 
> On Tue, 14 May 2013, Chen, Xiaoxi wrote:
> > I like the idea to leave ping in cluster network because it can help 
> > us detect switch?nic failure.
> > 
> > What confuse me is I keep pinging every ceph node's cluster ip?it is 
> > OK during the whole run with less than 1 ms latency?why the heartbeat 
> > still suffer? TOP show my cpu not 100% utilized?with ?30% io 
> > wait?.Enabling jumbo frame **seems** make things worth.(just 
> > feeling.no data supports)
> 
> I say "ping" in teh general sense.. it's not using ICMP, but sending small messages over a TCP session, doing some minimal processing on the other end, and sending them back.  If the machine is heavily loaded and that thread doesn't get scheduled or somehow blocks, it may be problematic.
> 
> How responsive generally is the machine under load?  Is there available CPU?
> 
> We can try to enable debugging to see what is going on.. 'debug ms = 1' 
> and 'debug osd = 20' is everything we would need, but will incur additoinal load itself and may spoil the experiment...
> 
> sage
> 
> > 
> > ???? iPhone
> > 
> > ? 2013-5-14?23:36?"Mark Nelson" <mark.nelson-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org> ???
> > 
> > > On 05/14/2013 10:30 AM, Sage Weil wrote:
> > >> On Tue, 14 May 2013, Chen, Xiaoxi wrote:
> > >>> 
> > >>> Hi
> > >>> 
> > >>>   We are suffering our OSD flipping between up and down ( OSD X be 
> > >>> voted to down due to 3 missing ping, and after a while it tells 
> > >>> the monitor ?map xxx wrongly mark me down? ). Because we are 
> > >>> running sequential write performance test on top of RBDs, and the 
> > >>> cluster network nics is really in high utilization (8Gb/s+ for a 10Gb network).
> > >>> 
> > >>>          Is this a expected behavior ? or how can I prevent this happen?
> > >> 
> > >> You an increase the heartbeat grace period.  The pings are handled 
> > >> by a separate thread on the backside interface (if there is one).  
> > >> If you are missing pings then the network or scheduler is 
> > >> preventing those (small) messages from being processed (there is 
> > >> almost no lock contention in that path).  Which means it really is 
> > >> taking ~20 seconds or wahtever to handle those messages.  It's 
> > >> really a questin of how unresponsive you want to permit the OSDs to be before you consider it a failure..
> > >> 
> > >> sage
> > >> 
> > >> 
> > > 
> > > It might be worth testing out how long pings or other network traffic are taking during these tests.  There may be some tcp tunning you can do here, or even consider using a separate network for the mons.
> > > 
> > > Mark
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
> N????y????b?????v?????{.n??????z??ay????????j\a???f????????????????:+v????????\a??zZ+??????"?!?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: OSD state flipping when cluster-network in high utilization
       [not found]                     ` <alpine.DEB.2.00.1305142039520.23131-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
@ 2013-05-15 12:53                       ` Chen, Xiaoxi
       [not found]                         ` <alpine.DEB.2.00.1305150851190.14301@cobra.newdream.net>
  0 siblings, 1 reply; 13+ messages in thread
From: Chen, Xiaoxi @ 2013-05-15 12:53 UTC (permalink / raw)
  To: Sage Weil
  Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	ceph-users-Qp0mS5GaXlQ@public.gmane.org

Hi Sage,
	I can reproduce it with log ,but the log is quite big (30MB+) , where should I put the log?
	From the log and reading the code,seems that with uncertain reason, the OSD_OP_THREAD will be blocked there for more than 15s, which results a time-out. Then in " handle_osd_ping " function, it dropped the ping request because of "is healthy"= false.
	I have a full log with debug osd 20, but still have no idea why op thread will be blocked for that long, part of the log is pasted below:

2013-05-15 15:29:06.870834 7f0253340700 10 osd.319 4113 dequeue_op 0xf4821e0 prio 63 cost 4096 latency 0.000173 osd_op(client.4703.0
:2114483 rbd_data.11ff6b8b4567.0000000000000339 [write 1605632~4096] 3.3e4c40d7) v4 pg pg[3.d7( v 4103'5330 (3853'4329,4103'5330] lo
cal-les=4092 n=154 ec=100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod 4103'5329 active+clean]
2013-05-15 15:29:06.870862 7f0253340700 20 osd.319 pg_epoch: 4113 pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
=100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod 4103'5329 active+clean] op_has_sufficient_caps pool=3 (xiaoxi_rbd)
owner=0 need_read_cap=0 need_write_cap=1 need_class_read_cap=0 need_class_write_cap=0 -> yes
2013-05-15 15:29:06.870871 7f0253340700 10 osd.319 pg_epoch: 4113 pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
=100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod 4103'5329 active+clean] do_op osd_op(client.4703.0:2114483 rbd_data
.11ff6b8b4567.0000000000000339 [write 1605632~4096] 3.3e4c40d7) v4 may_write
2013-05-15 15:29:22.117110 7f025d354700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0253340700' had timed out after 15
2013-05-15 15:29:22.127345 7f0269b6d700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0253340700' had timed out after 15
2013-05-15 15:29:22.276280 7f025d354700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0253340700' had timed out after 15
2013-05-15 15:29:22.356731 7f025d354700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0253340700' had timed out after 15
2013-05-15 15:29:22.359808 7f025d354700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0253340700' had timed out after 15
2013-05-15 15:29:22.513247 7f0253340700 10 osd.319 pg_epoch: 4113 pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
=100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod 4103'5329 active+clean] get_snapset_context rbd_data.11ff6b8b4567.0
000000000000339 0 -> 1
2013-05-15 15:29:22.513270 7f0253340700 10 osd.319 pg_epoch: 4113 pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
=100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod 4103'5329 active+clean] populate_obc_watchers 3e4c40d7/rbd_data.11f
f6b8b4567.0000000000000339/head//3
2013-05-15 15:29:22.513276 7f0253340700 10 osd.319 pg_epoch: 4113 pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
=100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod 4103'5329 active+clean] get_object_context 0xd896840 3e4c40d7/rbd_d
ata.11ff6b8b4567.0000000000000339/head//3 0 -> 1 read 3e4c40d7/rbd_data.11ff6b8b4567.0000000000000339/head//3(4103'5267 client.4703.
0:2109755 wrlock_by=unknown.0.0:0)
2013-05-15 15:29:22.513284 7f0253340700 10 osd.319 pg_epoch: 4113 pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
=100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod 4103'5329 active+clean] find_object_context 3e4c40d7/rbd_data.11ff6
b8b4567.0000000000000339/head//3 @head
2013-05-15 15:29:22.513290 7f0253340700 10 osd.319 pg_epoch: 4113 pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
=100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod 4103'5329 active+clean] do_op mode is idle(wr=0)
2013-05-15 15:29:22.513295 7f0253340700 10 osd.319 pg_epoch: 4113 pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
=100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod 4103'5329 active+clean] do_op mode now rmw(wr=0)

-----Original Message-----
From: Sage Weil [mailto:sage@inktank.com] 
Sent: 2013年5月15日 11:40
To: Chen, Xiaoxi
Cc: Mark Nelson; ceph-devel@vger.kernel.org; ceph-users@ceph.com
Subject: RE: [ceph-users] OSD state flipping when cluster-network in high utilization

On Wed, 15 May 2013, Chen, Xiaoxi wrote:

> >How responsive generally is the machine under load?  Is there available CPU?
> 	The machine works well, and the issued OSDs are likely the same, seems because they have relative slower disk( disk type are the same but the latency is a bit higher ,8ms -> 10ms).
> 	
> 	Top show no idle % but still have 30+% of io_wait,  my colleague educate me that io_wait can be treated as free.
> 
> Another information is offload the heartbeat to 1Gb nic doesn't solve 
> the problem, what's more, when we doing random write test, we can 
> still see this flipping happen. So I would like to say it may related 
> with CPU scheduler ? The heartbeat thread (in busy OSD ) failed to get 
> enough cpu cycle.

Can you try reproducing with logs?  If we can reproduce that will give us some clues.

sage

> 
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org 
> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> Sent: 2013?5?15? 7:23
> To: Chen, Xiaoxi
> Cc: Mark Nelson; ceph-devel@vger.kernel.org; ceph-users@ceph.com
> Subject: Re: [ceph-users] OSD state flipping when cluster-network in 
> high utilization
> 
> On Tue, 14 May 2013, Chen, Xiaoxi wrote:
> > I like the idea to leave ping in cluster network because it can help 
> > us detect switch?nic failure.
> > 
> > What confuse me is I keep pinging every ceph node's cluster ip?it is 
> > OK during the whole run with less than 1 ms latency?why the 
> > heartbeat still suffer? TOP show my cpu not 100% utilized?with ?30% 
> > io wait?.Enabling jumbo frame **seems** make things worth.(just 
> > feeling.no data supports)
> 
> I say "ping" in teh general sense.. it's not using ICMP, but sending small messages over a TCP session, doing some minimal processing on the other end, and sending them back.  If the machine is heavily loaded and that thread doesn't get scheduled or somehow blocks, it may be problematic.
> 
> How responsive generally is the machine under load?  Is there available CPU?
> 
> We can try to enable debugging to see what is going on.. 'debug ms = 1' 
> and 'debug osd = 20' is everything we would need, but will incur additoinal load itself and may spoil the experiment...
> 
> sage
> 
> > 
> > ???? iPhone
> > 
> > ? 2013-5-14?23:36?"Mark Nelson" <mark.nelson@inktank.com> ???
> > 
> > > On 05/14/2013 10:30 AM, Sage Weil wrote:
> > >> On Tue, 14 May 2013, Chen, Xiaoxi wrote:
> > >>> 
> > >>> Hi
> > >>> 
> > >>>   We are suffering our OSD flipping between up and down ( OSD X 
> > >>> be voted to down due to 3 missing ping, and after a while it 
> > >>> tells the monitor ?map xxx wrongly mark me down? ). Because we 
> > >>> are running sequential write performance test on top of RBDs, 
> > >>> and the cluster network nics is really in high utilization (8Gb/s+ for a 10Gb network).
> > >>> 
> > >>>          Is this a expected behavior ? or how can I prevent this happen?
> > >> 
> > >> You an increase the heartbeat grace period.  The pings are 
> > >> handled by a separate thread on the backside interface (if there is one).
> > >> If you are missing pings then the network or scheduler is 
> > >> preventing those (small) messages from being processed (there is 
> > >> almost no lock contention in that path).  Which means it really 
> > >> is taking ~20 seconds or wahtever to handle those messages.  It's 
> > >> really a questin of how unresponsive you want to permit the OSDs to be before you consider it a failure..
> > >> 
> > >> sage
> > >> 
> > >> 
> > > 
> > > It might be worth testing out how long pings or other network traffic are taking during these tests.  There may be some tcp tunning you can do here, or even consider using a separate network for the mons.
> > > 
> > > Mark
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> N????y????b?????v?????{.n??????z??ay????????j ???f????????????????:+v???????? ??zZ+??????"?!?
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: OSD state flipping when cluster-network in high utilization
       [not found]                 ` <6F3FA899187F0043BA1827A69DA2F7CC0145F746-0J0gbvR4kTiiAffOGbnezLfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2013-05-15  3:40                   ` Sage Weil
@ 2013-05-15 15:05                   ` Jim Schutt
  2013-05-15 15:12                     ` [ceph-users] " Chen, Xiaoxi
  1 sibling, 1 reply; 13+ messages in thread
From: Jim Schutt @ 2013-05-15 15:05 UTC (permalink / raw)
  To: Chen, Xiaoxi
  Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	ceph-users-Qp0mS5GaXlQ@public.gmane.org

On 05/14/2013 09:23 PM, Chen, Xiaoxi wrote:
>> How responsive generally is the machine under load?  Is there available CPU?
> 	The machine works well, and the issued OSDs are likely the same, seems because they have relative slower disk( disk type are the same but the latency is a bit higher ,8ms -> 10ms).
> 	
> 	Top show no idle % but still have 30+% of io_wait,  my colleague educate me that io_wait can be treated as free.
> 
> Another information is offload the heartbeat to 1Gb nic doesn't solve the problem, what's more, when we doing random write test, we can still see this flipping happen. So I would like to say it may related with CPU scheduler ? The heartbeat thread (in busy OSD ) failed to get enough cpu cycle.
> 

FWIW, also take a close look at your monitor daemons, and
whether they show any signs of being overloaded.

I frequently see OSDs wrongly marked down when my
mons cannot keep up with their workload.

-- Jim

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [ceph-users] OSD state flipping when cluster-network in high utilization
  2013-05-15 15:05                   ` Jim Schutt
@ 2013-05-15 15:12                     ` Chen, Xiaoxi
  2013-05-15 15:38                       ` Jim Schutt
  0 siblings, 1 reply; 13+ messages in thread
From: Chen, Xiaoxi @ 2013-05-15 15:12 UTC (permalink / raw)
  To: Jim Schutt
  Cc: Sage Weil, Mark Nelson, ceph-devel@vger.kernel.org,
	ceph-users@ceph.com

Thanks， but i am not quite understand how to determine weather monitor overloaded？ and if yes，will start several monitor help？

发自我的 iPhone

在 2013-5-15，23:07，"Jim Schutt" <jaschut@sandia.gov> 写道：

> On 05/14/2013 09:23 PM, Chen, Xiaoxi wrote:
>>> How responsive generally is the machine under load?  Is there available CPU?
>>    The machine works well, and the issued OSDs are likely the same, seems because they have relative slower disk( disk type are the same but the latency is a bit higher ,8ms -> 10ms).
>>    
>>    Top show no idle % but still have 30+% of io_wait,  my colleague educate me that io_wait can be treated as free.
>> 
>> Another information is offload the heartbeat to 1Gb nic doesn't solve the problem, what's more, when we doing random write test, we can still see this flipping happen. So I would like to say it may related with CPU scheduler ? The heartbeat thread (in busy OSD ) failed to get enough cpu cycle.
>> 
> 
> FWIW, also take a close look at your monitor daemons, and
> whether they show any signs of being overloaded.
> 
> I frequently see OSDs wrongly marked down when my
> mons cannot keep up with their workload.
> 
> -- Jim
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [ceph-users] OSD state flipping when cluster-network in high utilization
  2013-05-15 15:12                     ` [ceph-users] " Chen, Xiaoxi
@ 2013-05-15 15:38                       ` Jim Schutt
  0 siblings, 0 replies; 13+ messages in thread
From: Jim Schutt @ 2013-05-15 15:38 UTC (permalink / raw)
  To: Chen, Xiaoxi
  Cc: Sage Weil, Mark Nelson, ceph-devel@vger.kernel.org,
	ceph-users@ceph.com

On 05/15/2013 09:12 AM, Chen, Xiaoxi wrote:
> Thanks， but i am not quite understand how to determine weather monitor overloaded？

I've also found it difficult, and a process of elimination.

Make sure your mon debugging is minimal.  If it's not,
lower it and see if it helps.

Watch your mon host with vmstat or iostat, to see what
it's disk write rate is during the periods you're having
trouble with OSDs having heartbeat issues.  If it's near
what you think the limiting rate is for whatever device
is holding the mon's store.db directory, that might be
your bottleneck.

> and if yes，will start several monitor help？

Sage or one of the other Ceph developers can speak better
to this.  I will note that with three monitors, if one
is too busy, it'll miss renewing its lease.  The result of
that will be an election to get a new quorum.  If you run
multiple mons and they hold frequent elections where a mon
drops out of quorum, then comes back in on the next election,
I've found that to be a sign that my mons are too busy.

-- Jim

> 
> 发自我的 iPhone
> 
> 在 2013-5-15，23:07，"Jim Schutt" <jaschut@sandia.gov> 写道：
> 
>> On 05/14/2013 09:23 PM, Chen, Xiaoxi wrote:
>>>> How responsive generally is the machine under load?  Is there available CPU?
>>>    The machine works well, and the issued OSDs are likely the same, seems because they have relative slower disk( disk type are the same but the latency is a bit higher ,8ms -> 10ms).
>>>    
>>>    Top show no idle % but still have 30+% of io_wait,  my colleague educate me that io_wait can be treated as free.
>>>
>>> Another information is offload the heartbeat to 1Gb nic doesn't solve the problem, what's more, when we doing random write test, we can still see this flipping happen. So I would like to say it may related with CPU scheduler ? The heartbeat thread (in busy OSD ) failed to get enough cpu cycle.
>>>
>>
>> FWIW, also take a close look at your monitor daemons, and
>> whether they show any signs of being overloaded.
>>
>> I frequently see OSDs wrongly marked down when my
>> mons cannot keep up with their workload.
>>
>> -- Jim
>>
>>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [ceph-users] OSD state flipping when cluster-network in high utilization
       [not found]                         ` <alpine.DEB.2.00.1305150851190.14301@cobra.newdream.net>
@ 2013-05-23  2:08                           ` Chen, Xiaoxi
  0 siblings, 0 replies; 13+ messages in thread
From: Chen, Xiaoxi @ 2013-05-23  2:08 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel@vger.kernel.org, ceph-users@ceph.com

Hi, Sage,
	Any news about this issue ? We are keeping reproducing it again and again.
	We plug out the disk and carefully test the raw disk performance, it's pretty normal.
																Xiaoxi

-----Original Message-----
From: Chen, Xiaoxi 
Sent: 2013年5月16日 6:38
To: 'Sage Weil'
Subject: RE: [ceph-users] OSD state flipping when cluster-network in high utilization

Uploaded to /home/cephdrop/xiaoxi_flip_osd/osdlog.tar.gz
Thanks

-----Original Message-----
From: Sage Weil [mailto:sage@inktank.com]
Sent: 2013年5月15日 23:52
To: Chen, Xiaoxi
Subject: RE: [ceph-users] OSD state flipping when cluster-network in high utilization

Can you sftp it to cephdrop@ceph.com, password asdf?  Thanks!

sage

On Wed, 15 May 2013, Chen, Xiaoxi wrote:

> Hi Sage,
> 	I can reproduce it with log ,but the log is quite big (30MB+) , where should I put the log?
> 	From the log and reading the code,seems that with uncertain reason, the OSD_OP_THREAD will be blocked there for more than 15s, which results a time-out. Then in " handle_osd_ping " function, it dropped the ping request because of "is healthy"= false.
> 	I have a full log with debug osd 20, but still have no idea why op thread will be blocked for that long, part of the log is pasted below:
> 
> 2013-05-15 15:29:06.870834 7f0253340700 10 osd.319 4113 dequeue_op
> 0xf4821e0 prio 63 cost 4096 latency 0.000173 osd_op(client.4703.0
> :2114483 rbd_data.11ff6b8b4567.0000000000000339 [write 1605632~4096]
> 3.3e4c40d7) v4 pg pg[3.d7( v 4103'5330 (3853'4329,4103'5330] lo
> cal-les=4092 n=154 ec=100 les/c 4092/4093 4091/4091/4034) [319,46] r=0
> lpr=4091 mlcod 4103'5329 active+clean]
> 2013-05-15 15:29:06.870862 7f0253340700 20 osd.319 pg_epoch: 4113 
> pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
> =100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod
> 4103'5329 active+clean] op_has_sufficient_caps pool=3 (xiaoxi_rbd)
> owner=0 need_read_cap=0 need_write_cap=1 need_class_read_cap=0
> need_class_write_cap=0 -> yes
> 2013-05-15 15:29:06.870871 7f0253340700 10 osd.319 pg_epoch: 4113 
> pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
> =100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod
> 4103'5329 active+clean] do_op osd_op(client.4703.0:2114483 rbd_data
> .11ff6b8b4567.0000000000000339 [write 1605632~4096] 3.3e4c40d7) v4 
> may_write
> 2013-05-15 15:29:22.117110 7f025d354700  1 heartbeat_map is_healthy 
> 'OSD::op_tp thread 0x7f0253340700' had timed out after 15
> 2013-05-15 15:29:22.127345 7f0269b6d700  1 heartbeat_map is_healthy 
> 'OSD::op_tp thread 0x7f0253340700' had timed out after 15
> 2013-05-15 15:29:22.276280 7f025d354700  1 heartbeat_map is_healthy 
> 'OSD::op_tp thread 0x7f0253340700' had timed out after 15
> 2013-05-15 15:29:22.356731 7f025d354700  1 heartbeat_map is_healthy 
> 'OSD::op_tp thread 0x7f0253340700' had timed out after 15
> 2013-05-15 15:29:22.359808 7f025d354700  1 heartbeat_map is_healthy 
> 'OSD::op_tp thread 0x7f0253340700' had timed out after 15
> 2013-05-15 15:29:22.513247 7f0253340700 10 osd.319 pg_epoch: 4113 
> pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
> =100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod
> 4103'5329 active+clean] get_snapset_context rbd_data.11ff6b8b4567.0
> 000000000000339 0 -> 1
> 2013-05-15 15:29:22.513270 7f0253340700 10 osd.319 pg_epoch: 4113 
> pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
> =100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod
> 4103'5329 active+clean] populate_obc_watchers 3e4c40d7/rbd_data.11f
> f6b8b4567.0000000000000339/head//3
> 2013-05-15 15:29:22.513276 7f0253340700 10 osd.319 pg_epoch: 4113 
> pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
> =100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod
> 4103'5329 active+clean] get_object_context 0xd896840 3e4c40d7/rbd_d
> ata.11ff6b8b4567.0000000000000339/head//3 0 -> 1 read 3e4c40d7/rbd_data.11ff6b8b4567.0000000000000339/head//3(4103'5267 client.4703.
> 0:2109755 wrlock_by=unknown.0.0:0)
> 2013-05-15 15:29:22.513284 7f0253340700 10 osd.319 pg_epoch: 4113 
> pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
> =100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod
> 4103'5329 active+clean] find_object_context 3e4c40d7/rbd_data.11ff6
> b8b4567.0000000000000339/head//3 @head
> 2013-05-15 15:29:22.513290 7f0253340700 10 osd.319 pg_epoch: 4113 
> pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
> =100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod
> 4103'5329 active+clean] do_op mode is idle(wr=0)
> 2013-05-15 15:29:22.513295 7f0253340700 10 osd.319 pg_epoch: 4113 
> pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
> =100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod
> 4103'5329 active+clean] do_op mode now rmw(wr=0)
> 
> -----Original Message-----
> From: Sage Weil [mailto:sage@inktank.com]
> Sent: 2013?5?15? 11:40
> To: Chen, Xiaoxi
> Cc: Mark Nelson; ceph-devel@vger.kernel.org; ceph-users@ceph.com
> Subject: RE: [ceph-users] OSD state flipping when cluster-network in 
> high utilization
> 
> On Wed, 15 May 2013, Chen, Xiaoxi wrote:
> 
> > >How responsive generally is the machine under load?  Is there available CPU?
> > 	The machine works well, and the issued OSDs are likely the same, seems because they have relative slower disk( disk type are the same but the latency is a bit higher ,8ms -> 10ms).
> > 	
> > 	Top show no idle % but still have 30+% of io_wait,  my colleague educate me that io_wait can be treated as free.
> > 
> > Another information is offload the heartbeat to 1Gb nic doesn't 
> > solve the problem, what's more, when we doing random write test, we 
> > can still see this flipping happen. So I would like to say it may 
> > related with CPU scheduler ? The heartbeat thread (in busy OSD ) 
> > failed to get enough cpu cycle.
> 
> Can you try reproducing with logs?  If we can reproduce that will give us some clues.
> 
> sage
> 
> > 
> > -----Original Message-----
> > From: ceph-devel-owner@vger.kernel.org 
> > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> > Sent: 2013?5?15? 7:23
> > To: Chen, Xiaoxi
> > Cc: Mark Nelson; ceph-devel@vger.kernel.org; ceph-users@ceph.com
> > Subject: Re: [ceph-users] OSD state flipping when cluster-network in 
> > high utilization
> > 
> > On Tue, 14 May 2013, Chen, Xiaoxi wrote:
> > > I like the idea to leave ping in cluster network because it can 
> > > help us detect switch?nic failure.
> > > 
> > > What confuse me is I keep pinging every ceph node's cluster ip?it 
> > > is OK during the whole run with less than 1 ms latency?why the 
> > > heartbeat still suffer? TOP show my cpu not 100% utilized?with 
> > > ?30% io wait?.Enabling jumbo frame **seems** make things 
> > > worth.(just feeling.no data supports)
> > 
> > I say "ping" in teh general sense.. it's not using ICMP, but sending small messages over a TCP session, doing some minimal processing on the other end, and sending them back.  If the machine is heavily loaded and that thread doesn't get scheduled or somehow blocks, it may be problematic.
> > 
> > How responsive generally is the machine under load?  Is there available CPU?
> > 
> > We can try to enable debugging to see what is going on.. 'debug ms = 1' 
> > and 'debug osd = 20' is everything we would need, but will incur additoinal load itself and may spoil the experiment...
> > 
> > sage
> > 
> > > 
> > > ???? iPhone
> > > 
> > > ? 2013-5-14?23:36?"Mark Nelson" <mark.nelson@inktank.com> ???
> > > 
> > > > On 05/14/2013 10:30 AM, Sage Weil wrote:
> > > >> On Tue, 14 May 2013, Chen, Xiaoxi wrote:
> > > >>> 
> > > >>> Hi
> > > >>> 
> > > >>>   We are suffering our OSD flipping between up and down ( OSD 
> > > >>> X be voted to down due to 3 missing ping, and after a while it 
> > > >>> tells the monitor ?map xxx wrongly mark me down? ). Because we 
> > > >>> are running sequential write performance test on top of RBDs, 
> > > >>> and the cluster network nics is really in high utilization (8Gb/s+ for a 10Gb network).
> > > >>> 
> > > >>>          Is this a expected behavior ? or how can I prevent this happen?
> > > >> 
> > > >> You an increase the heartbeat grace period.  The pings are 
> > > >> handled by a separate thread on the backside interface (if there is one).
> > > >> If you are missing pings then the network or scheduler is 
> > > >> preventing those (small) messages from being processed (there 
> > > >> is almost no lock contention in that path).  Which means it 
> > > >> really is taking ~20 seconds or wahtever to handle those 
> > > >> messages.  It's really a questin of how unresponsive you want to permit the OSDs to be before you consider it a failure..
> > > >> 
> > > >> sage
> > > >> 
> > > >> 
> > > > 
> > > > It might be worth testing out how long pings or other network traffic are taking during these tests.  There may be some tcp tunning you can do here, or even consider using a separate network for the mons.
> > > > 
> > > > Mark
> > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> > N????y????b?????v?????{.n??????z??ay????????j ???f????????????????:+v???????? ??zZ+??????"?!?
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2013-05-23  2:08 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-05-14  8:56 OSD state flipping when cluster-network in high utilization Chen, Xiaoxi
2013-05-14 15:30 ` [ceph-users] " Sage Weil
2013-05-14 15:35   ` Mark Nelson
     [not found]     ` <519259C5.3000109-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org>
2013-05-14 23:13       ` Chen, Xiaoxi
     [not found]         ` <6AC1548D-DE2C-4E71-AC73-3903B57C76C6-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
2013-05-14 23:22           ` Sage Weil
     [not found]             ` <alpine.DEB.2.00.1305141620410.12954-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
2013-05-15  3:23               ` Chen, Xiaoxi
     [not found]                 ` <6F3FA899187F0043BA1827A69DA2F7CC0145F746-0J0gbvR4kTiiAffOGbnezLfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2013-05-15  3:40                   ` Sage Weil
     [not found]                     ` <alpine.DEB.2.00.1305142039520.23131-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
2013-05-15 12:53                       ` Chen, Xiaoxi
     [not found]                         ` <alpine.DEB.2.00.1305150851190.14301@cobra.newdream.net>
2013-05-23  2:08                           ` [ceph-users] " Chen, Xiaoxi
2013-05-15 15:05                   ` Jim Schutt
2013-05-15 15:12                     ` [ceph-users] " Chen, Xiaoxi
2013-05-15 15:38                       ` Jim Schutt
2013-05-15  0:52   ` Wales Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.