OSD::disk_tp timeout

All of lore.kernel.org
 help / color / mirror / Atom feed

* OSD::disk_tp timeout
@ 2011-10-08 20:37 Christian Brunner
  2011-10-08 21:04 ` Martin Mailand
  2011-10-08 21:28 ` Sage Weil
  0 siblings, 2 replies; 6+ messages in thread
From: Christian Brunner @ 2011-10-08 20:37 UTC (permalink / raw)
  To: ceph-devel

Hi,

I've upgraded ceph from 0.32 to 0.36 yesterday. Now I have a totaly
screwed ceph cluster. :(

What bugs me most is the fact, that OSDs become unresponsive
frequently. The process is eating a lot of cpu and I can see the
following messages in the log:

Oct  8 22:30:05 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
Oct  8 22:30:10 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
Oct  8 22:30:15 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
Oct  8 22:30:20 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
Oct  8 22:30:25 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
Oct  8 22:30:30 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60

Do you have any idea, what to do about that?

Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: OSD::disk_tp timeout
  2011-10-08 20:37 OSD::disk_tp timeout Christian Brunner
@ 2011-10-08 21:04 ` Martin Mailand
  2011-10-08 21:28 ` Sage Weil
  1 sibling, 0 replies; 6+ messages in thread
From: Martin Mailand @ 2011-10-08 21:04 UTC (permalink / raw)
  To: chb; +Cc: ceph-devel

Hi Christian,
if I remember correctly you are using ceph with a qemu-kvm setup?

After the last update of ceph, the load average on the osd was doubled,
the performance of the kvm machines became bad.

The really weird thing is, the cluster "needs" around 30 mins to get 
into this state. After I restart the osd's everything is fine, than 
after a while the load of the osd nodes is building up. Most of the load 
is produced by btrfs kernel processes in the deferred state.

Not sure if I have the same problem as you, as I do not get any timeouts.

Best Regards,
  martin

Christian Brunner schrieb:
> Hi,
> 
> I've upgraded ceph from 0.32 to 0.36 yesterday. Now I have a totaly
> screwed ceph cluster. :(
> 
> What bugs me most is the fact, that OSDs become unresponsive
> frequently. The process is eating a lot of cpu and I can see the
> following messages in the log:
> 
> Oct  8 22:30:05 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
> is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
> Oct  8 22:30:10 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
> is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
> Oct  8 22:30:15 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
> is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
> Oct  8 22:30:20 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
> is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
> Oct  8 22:30:25 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
> is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
> Oct  8 22:30:30 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
> is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
> 
> Do you have any idea, what to do about that?
> 
> Regards,
> Christian
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: OSD::disk_tp timeout
  2011-10-08 20:37 OSD::disk_tp timeout Christian Brunner
  2011-10-08 21:04 ` Martin Mailand
@ 2011-10-08 21:28 ` Sage Weil
  2011-10-08 22:15   ` Martin Mailand
  1 sibling, 1 reply; 6+ messages in thread
From: Sage Weil @ 2011-10-08 21:28 UTC (permalink / raw)
  To: Christian Brunner; +Cc: ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1586 bytes --]

Hi Christian,

On Sat, 8 Oct 2011, Christian Brunner wrote:
> Hi,
> 
> I've upgraded ceph from 0.32 to 0.36 yesterday. Now I have a totaly
> screwed ceph cluster. :(
> 
> What bugs me most is the fact, that OSDs become unresponsive
> frequently. The process is eating a lot of cpu and I can see the

What version of btrfs are you running?  This sound a bit like the bug 
fixed by this patch:

http://www.spinics.net/lists/linux-btrfs/msg12627.html

(That was just merged into mainline this week.)

> following messages in the log:
> 
> Oct  8 22:30:05 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
> is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
> Oct  8 22:30:10 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
> is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
> Oct  8 22:30:15 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
> is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
> Oct  8 22:30:20 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
> is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
> Oct  8 22:30:25 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
> is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
> Oct  8 22:30:30 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
> is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
> 
> Do you have any idea, what to do about that?

Those messages just mean that a thread in the disk threadpool (which is 
doing all the writes to btrfs) is blocked/stopped.

sage

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: OSD::disk_tp timeout
  2011-10-08 21:28 ` Sage Weil
@ 2011-10-08 22:15   ` Martin Mailand
  2011-10-08 22:44     ` Sage Weil
  0 siblings, 1 reply; 6+ messages in thread
From: Martin Mailand @ 2011-10-08 22:15 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

Hi,
I am using v3.1-rc9, so the fix in there. Maybe I can nail it down a bit 
more specific.

Best Regards,
  martin

Sage Weil schrieb:
> Hi Christian,
> 
> On Sat, 8 Oct 2011, Christian Brunner wrote:
>> Hi,
>>
>> I've upgraded ceph from 0.32 to 0.36 yesterday. Now I have a totaly
>> screwed ceph cluster. :(
>>
>> What bugs me most is the fact, that OSDs become unresponsive
>> frequently. The process is eating a lot of cpu and I can see the
> 
> What version of btrfs are you running?  This sound a bit like the bug 
> fixed by this patch:
> 
> http://www.spinics.net/lists/linux-btrfs/msg12627.html
> 
> (That was just merged into mainline this week.)
> 
>> following messages in the log:
>>
>> Oct  8 22:30:05 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
>> is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
>> Oct  8 22:30:10 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
>> is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
>> Oct  8 22:30:15 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
>> is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
>> Oct  8 22:30:20 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
>> is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
>> Oct  8 22:30:25 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
>> is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
>> Oct  8 22:30:30 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
>> is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
>>
>> Do you have any idea, what to do about that?
> 
> Those messages just mean that a thread in the disk threadpool (which is 
> doing all the writes to btrfs) is blocked/stopped.
> 
> sage

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: OSD::disk_tp timeout
  2011-10-08 22:15   ` Martin Mailand
@ 2011-10-08 22:44     ` Sage Weil
  2011-10-09  6:02       ` Christian Brunner
  0 siblings, 1 reply; 6+ messages in thread
From: Sage Weil @ 2011-10-08 22:44 UTC (permalink / raw)
  To: Martin Mailand; +Cc: ceph-devel

On Sun, 9 Oct 2011, Martin Mailand wrote:
> Hi,
> I am using v3.1-rc9, so the fix in there. Maybe I can nail it down a bit more
> specific.

You might try sysrq-t or -w to see what the spinning CPUs are doing.

Thanks!
sage


> 
> Best Regards,
>  martin
> 
> Sage Weil schrieb:
> > Hi Christian,
> > 
> > On Sat, 8 Oct 2011, Christian Brunner wrote:
> > > Hi,
> > > 
> > > I've upgraded ceph from 0.32 to 0.36 yesterday. Now I have a totaly
> > > screwed ceph cluster. :(
> > > 
> > > What bugs me most is the fact, that OSDs become unresponsive
> > > frequently. The process is eating a lot of cpu and I can see the
> > 
> > What version of btrfs are you running?  This sound a bit like the bug fixed
> > by this patch:
> > 
> > http://www.spinics.net/lists/linux-btrfs/msg12627.html
> > 
> > (That was just merged into mainline this week.)
> > 
> > > following messages in the log:
> > > 
> > > Oct  8 22:30:05 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
> > > is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
> > > Oct  8 22:30:10 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
> > > is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
> > > Oct  8 22:30:15 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
> > > is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
> > > Oct  8 22:30:20 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
> > > is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
> > > Oct  8 22:30:25 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
> > > is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
> > > Oct  8 22:30:30 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
> > > is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
> > > 
> > > Do you have any idea, what to do about that?
> > 
> > Those messages just mean that a thread in the disk threadpool (which is
> > doing all the writes to btrfs) is blocked/stopped.
> > 
> > sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: OSD::disk_tp timeout
  2011-10-08 22:44     ` Sage Weil
@ 2011-10-09  6:02       ` Christian Brunner
  0 siblings, 0 replies; 6+ messages in thread
From: Christian Brunner @ 2011-10-09  6:02 UTC (permalink / raw)
  To: Sage Weil; +Cc: Martin Mailand, ceph-devel

[-- Attachment #1: Type: text/plain, Size: 2688 bytes --]

Here is a sysrq-t trace.

I'm running 4 OSDs on the server. The one that is causing problems has
pid 31956.

Thanks,
Christian

2011/10/9 Sage Weil <sage@newdream.net>:
> On Sun, 9 Oct 2011, Martin Mailand wrote:
>> Hi,
>> I am using v3.1-rc9, so the fix in there. Maybe I can nail it down a bit more
>> specific.
>
> You might try sysrq-t or -w to see what the spinning CPUs are doing.
>
> Thanks!
> sage
>
>
>>
>> Best Regards,
>>  martin
>>
>> Sage Weil schrieb:
>> > Hi Christian,
>> >
>> > On Sat, 8 Oct 2011, Christian Brunner wrote:
>> > > Hi,
>> > >
>> > > I've upgraded ceph from 0.32 to 0.36 yesterday. Now I have a totaly
>> > > screwed ceph cluster. :(
>> > >
>> > > What bugs me most is the fact, that OSDs become unresponsive
>> > > frequently. The process is eating a lot of cpu and I can see the
>> >
>> > What version of btrfs are you running?  This sound a bit like the bug fixed
>> > by this patch:
>> >
>> > http://www.spinics.net/lists/linux-btrfs/msg12627.html
>> >
>> > (That was just merged into mainline this week.)
>> >
>> > > following messages in the log:
>> > >
>> > > Oct  8 22:30:05 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
>> > > is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
>> > > Oct  8 22:30:10 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
>> > > is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
>> > > Oct  8 22:30:15 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
>> > > is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
>> > > Oct  8 22:30:20 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
>> > > is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
>> > > Oct  8 22:30:25 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
>> > > is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
>> > > Oct  8 22:30:30 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
>> > > is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
>> > >
>> > > Do you have any idea, what to do about that?
>> >
>> > Those messages just mean that a thread in the disk threadpool (which is
>> > doing all the writes to btrfs) is blocked/stopped.
>> >
>> > sage
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

[-- Attachment #2: sysrq-t.txt.gz --]
[-- Type: application/x-gzip, Size: 35036 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2011-10-09  6:02 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-10-08 20:37 OSD::disk_tp timeout Christian Brunner
2011-10-08 21:04 ` Martin Mailand
2011-10-08 21:28 ` Sage Weil
2011-10-08 22:15   ` Martin Mailand
2011-10-08 22:44     ` Sage Weil
2011-10-09  6:02       ` Christian Brunner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.