* OSD::disk_tp timeout
@ 2011-10-08 20:37 Christian Brunner
2011-10-08 21:04 ` Martin Mailand
2011-10-08 21:28 ` Sage Weil
0 siblings, 2 replies; 6+ messages in thread
From: Christian Brunner @ 2011-10-08 20:37 UTC (permalink / raw)
To: ceph-devel
Hi,
I've upgraded ceph from 0.32 to 0.36 yesterday. Now I have a totaly
screwed ceph cluster. :(
What bugs me most is the fact, that OSDs become unresponsive
frequently. The process is eating a lot of cpu and I can see the
following messages in the log:
Oct 8 22:30:05 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
Oct 8 22:30:10 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
Oct 8 22:30:15 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
Oct 8 22:30:20 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
Oct 8 22:30:25 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
Oct 8 22:30:30 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
Do you have any idea, what to do about that?
Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: OSD::disk_tp timeout
2011-10-08 20:37 OSD::disk_tp timeout Christian Brunner
@ 2011-10-08 21:04 ` Martin Mailand
2011-10-08 21:28 ` Sage Weil
1 sibling, 0 replies; 6+ messages in thread
From: Martin Mailand @ 2011-10-08 21:04 UTC (permalink / raw)
To: chb; +Cc: ceph-devel
Hi Christian,
if I remember correctly you are using ceph with a qemu-kvm setup?
After the last update of ceph, the load average on the osd was doubled,
the performance of the kvm machines became bad.
The really weird thing is, the cluster "needs" around 30 mins to get
into this state. After I restart the osd's everything is fine, than
after a while the load of the osd nodes is building up. Most of the load
is produced by btrfs kernel processes in the deferred state.
Not sure if I have the same problem as you, as I do not get any timeouts.
Best Regards,
martin
Christian Brunner schrieb:
> Hi,
>
> I've upgraded ceph from 0.32 to 0.36 yesterday. Now I have a totaly
> screwed ceph cluster. :(
>
> What bugs me most is the fact, that OSDs become unresponsive
> frequently. The process is eating a lot of cpu and I can see the
> following messages in the log:
>
> Oct 8 22:30:05 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
> is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
> Oct 8 22:30:10 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
> is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
> Oct 8 22:30:15 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
> is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
> Oct 8 22:30:20 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
> is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
> Oct 8 22:30:25 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
> is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
> Oct 8 22:30:30 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
> is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
>
> Do you have any idea, what to do about that?
>
> Regards,
> Christian
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: OSD::disk_tp timeout
2011-10-08 20:37 OSD::disk_tp timeout Christian Brunner
2011-10-08 21:04 ` Martin Mailand
@ 2011-10-08 21:28 ` Sage Weil
2011-10-08 22:15 ` Martin Mailand
1 sibling, 1 reply; 6+ messages in thread
From: Sage Weil @ 2011-10-08 21:28 UTC (permalink / raw)
To: Christian Brunner; +Cc: ceph-devel
[-- Attachment #1: Type: TEXT/PLAIN, Size: 1586 bytes --]
Hi Christian,
On Sat, 8 Oct 2011, Christian Brunner wrote:
> Hi,
>
> I've upgraded ceph from 0.32 to 0.36 yesterday. Now I have a totaly
> screwed ceph cluster. :(
>
> What bugs me most is the fact, that OSDs become unresponsive
> frequently. The process is eating a lot of cpu and I can see the
What version of btrfs are you running? This sound a bit like the bug
fixed by this patch:
http://www.spinics.net/lists/linux-btrfs/msg12627.html
(That was just merged into mainline this week.)
> following messages in the log:
>
> Oct 8 22:30:05 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
> is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
> Oct 8 22:30:10 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
> is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
> Oct 8 22:30:15 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
> is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
> Oct 8 22:30:20 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
> is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
> Oct 8 22:30:25 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
> is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
> Oct 8 22:30:30 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
> is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
>
> Do you have any idea, what to do about that?
Those messages just mean that a thread in the disk threadpool (which is
doing all the writes to btrfs) is blocked/stopped.
sage
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: OSD::disk_tp timeout
2011-10-08 21:28 ` Sage Weil
@ 2011-10-08 22:15 ` Martin Mailand
2011-10-08 22:44 ` Sage Weil
0 siblings, 1 reply; 6+ messages in thread
From: Martin Mailand @ 2011-10-08 22:15 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel
Hi,
I am using v3.1-rc9, so the fix in there. Maybe I can nail it down a bit
more specific.
Best Regards,
martin
Sage Weil schrieb:
> Hi Christian,
>
> On Sat, 8 Oct 2011, Christian Brunner wrote:
>> Hi,
>>
>> I've upgraded ceph from 0.32 to 0.36 yesterday. Now I have a totaly
>> screwed ceph cluster. :(
>>
>> What bugs me most is the fact, that OSDs become unresponsive
>> frequently. The process is eating a lot of cpu and I can see the
>
> What version of btrfs are you running? This sound a bit like the bug
> fixed by this patch:
>
> http://www.spinics.net/lists/linux-btrfs/msg12627.html
>
> (That was just merged into mainline this week.)
>
>> following messages in the log:
>>
>> Oct 8 22:30:05 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
>> is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
>> Oct 8 22:30:10 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
>> is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
>> Oct 8 22:30:15 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
>> is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
>> Oct 8 22:30:20 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
>> is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
>> Oct 8 22:30:25 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
>> is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
>> Oct 8 22:30:30 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
>> is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
>>
>> Do you have any idea, what to do about that?
>
> Those messages just mean that a thread in the disk threadpool (which is
> doing all the writes to btrfs) is blocked/stopped.
>
> sage
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: OSD::disk_tp timeout
2011-10-08 22:15 ` Martin Mailand
@ 2011-10-08 22:44 ` Sage Weil
2011-10-09 6:02 ` Christian Brunner
0 siblings, 1 reply; 6+ messages in thread
From: Sage Weil @ 2011-10-08 22:44 UTC (permalink / raw)
To: Martin Mailand; +Cc: ceph-devel
On Sun, 9 Oct 2011, Martin Mailand wrote:
> Hi,
> I am using v3.1-rc9, so the fix in there. Maybe I can nail it down a bit more
> specific.
You might try sysrq-t or -w to see what the spinning CPUs are doing.
Thanks!
sage
>
> Best Regards,
> martin
>
> Sage Weil schrieb:
> > Hi Christian,
> >
> > On Sat, 8 Oct 2011, Christian Brunner wrote:
> > > Hi,
> > >
> > > I've upgraded ceph from 0.32 to 0.36 yesterday. Now I have a totaly
> > > screwed ceph cluster. :(
> > >
> > > What bugs me most is the fact, that OSDs become unresponsive
> > > frequently. The process is eating a lot of cpu and I can see the
> >
> > What version of btrfs are you running? This sound a bit like the bug fixed
> > by this patch:
> >
> > http://www.spinics.net/lists/linux-btrfs/msg12627.html
> >
> > (That was just merged into mainline this week.)
> >
> > > following messages in the log:
> > >
> > > Oct 8 22:30:05 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
> > > is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
> > > Oct 8 22:30:10 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
> > > is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
> > > Oct 8 22:30:15 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
> > > is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
> > > Oct 8 22:30:20 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
> > > is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
> > > Oct 8 22:30:25 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
> > > is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
> > > Oct 8 22:30:30 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
> > > is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
> > >
> > > Do you have any idea, what to do about that?
> >
> > Those messages just mean that a thread in the disk threadpool (which is
> > doing all the writes to btrfs) is blocked/stopped.
> >
> > sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: OSD::disk_tp timeout
2011-10-08 22:44 ` Sage Weil
@ 2011-10-09 6:02 ` Christian Brunner
0 siblings, 0 replies; 6+ messages in thread
From: Christian Brunner @ 2011-10-09 6:02 UTC (permalink / raw)
To: Sage Weil; +Cc: Martin Mailand, ceph-devel
[-- Attachment #1: Type: text/plain, Size: 2688 bytes --]
Here is a sysrq-t trace.
I'm running 4 OSDs on the server. The one that is causing problems has
pid 31956.
Thanks,
Christian
2011/10/9 Sage Weil <sage@newdream.net>:
> On Sun, 9 Oct 2011, Martin Mailand wrote:
>> Hi,
>> I am using v3.1-rc9, so the fix in there. Maybe I can nail it down a bit more
>> specific.
>
> You might try sysrq-t or -w to see what the spinning CPUs are doing.
>
> Thanks!
> sage
>
>
>>
>> Best Regards,
>> martin
>>
>> Sage Weil schrieb:
>> > Hi Christian,
>> >
>> > On Sat, 8 Oct 2011, Christian Brunner wrote:
>> > > Hi,
>> > >
>> > > I've upgraded ceph from 0.32 to 0.36 yesterday. Now I have a totaly
>> > > screwed ceph cluster. :(
>> > >
>> > > What bugs me most is the fact, that OSDs become unresponsive
>> > > frequently. The process is eating a lot of cpu and I can see the
>> >
>> > What version of btrfs are you running? This sound a bit like the bug fixed
>> > by this patch:
>> >
>> > http://www.spinics.net/lists/linux-btrfs/msg12627.html
>> >
>> > (That was just merged into mainline this week.)
>> >
>> > > following messages in the log:
>> > >
>> > > Oct 8 22:30:05 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
>> > > is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
>> > > Oct 8 22:30:10 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
>> > > is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
>> > > Oct 8 22:30:15 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
>> > > is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
>> > > Oct 8 22:30:20 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
>> > > is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
>> > > Oct 8 22:30:25 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
>> > > is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
>> > > Oct 8 22:30:30 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
>> > > is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
>> > >
>> > > Do you have any idea, what to do about that?
>> >
>> > Those messages just mean that a thread in the disk threadpool (which is
>> > doing all the writes to btrfs) is blocked/stopped.
>> >
>> > sage
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
[-- Attachment #2: sysrq-t.txt.gz --]
[-- Type: application/x-gzip, Size: 35036 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2011-10-09 6:02 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-10-08 20:37 OSD::disk_tp timeout Christian Brunner
2011-10-08 21:04 ` Martin Mailand
2011-10-08 21:28 ` Sage Weil
2011-10-08 22:15 ` Martin Mailand
2011-10-08 22:44 ` Sage Weil
2011-10-09 6:02 ` Christian Brunner
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.