Osd load and timeout when recovering

All of lore.kernel.org
 help / color / mirror / Atom feed

* Osd load and timeout when recovering
@ 2012-10-22 10:27 Yann ROBIN
  2012-10-22 12:02 ` Wido den Hollander
  2012-10-22 14:26 ` Gregory Farnum
  0 siblings, 2 replies; 4+ messages in thread
From: Yann ROBIN @ 2012-10-22 10:27 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org

Hi,

We use ceph to store small file (lot of them) on different servers and access it using rados gateway.
Our data size is 380Go (very small). We have two host with 5 osd each.
We use small config for ceph : 2Go RAM server with 5 x 2To Disk (one OSD on each disk).
This is a very cheap config that allow us to keep our storing cost under control and it's enough to get the read performance we need.
(We use this config with mogilefs to store 150To of data)

This week-end we had an alert saying ceph was down.

After looking at the osd, we saw a very high load on osd (450 of load), some were down.
Ceph -s displayed that we were having down pg, peering+down pg, remapped pg. etc.

So we started to see that when we were peering and stuff like that, the load was very high.
OSD stop responding and we could see in the log message like :
FileStore timeout and Abort Signal

So basically the cluster was under load because we was recovering... but because it was under load recovering could not complete.

We change this params to get a longer timeout :
filestore op thread suicide timeout = 360 
filestore op thread timeout = 180 
osd default notify timeout = 360

The cluster was still under heavy load, osd was still timeouting (less timeouting but still)

So we test param to "throttle" the "recovery" process :
filestore op threads = 6
filestore queue max ops = 24
osd recovery max active = 1

Load was better, but still very high (30). 

We also try to put the journal in a tmpfs with zram.
We set noout so it won't copy files to satisfy the replicate count because osd were out.

We then updated to kernel 3.5 to get last xfs optim.

In the end nothing was working we were in the same infinite death loop of recovering => load => timeout => recovering.
So we updated from ceph 0.48.2 to 0.53, load was better and recovery finally worked.

As we don't want to be in the position again (24h downtime), I have some questions on ceph/rados.

1/ Even when we switch to ceph 0.53, the rados gateway was still not responding, Log was displaying Initalization timeout.
Is it normal that the "recovering" process kill the fact that we can read data from ceph ? 
The data is here, it is just moving, why can't we access it ?

2/ In case of very high load because ceph is moving data, is there a way to tell ceph to go slowly ?

Thanks,

--
Yann ROBIN
Société Publica
www.YouScribe.com

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Osd load and timeout when recovering
  2012-10-22 10:27 Osd load and timeout when recovering Yann ROBIN
@ 2012-10-22 12:02 ` Wido den Hollander
  2012-10-22 13:59   ` Yann ROBIN
  2012-10-22 14:26 ` Gregory Farnum
  1 sibling, 1 reply; 4+ messages in thread
From: Wido den Hollander @ 2012-10-22 12:02 UTC (permalink / raw)
  To: Yann ROBIN; +Cc: ceph-devel@vger.kernel.org

On 10/22/2012 12:27 PM, Yann ROBIN wrote:
> Hi,
>
> We use ceph to store small file (lot of them) on different servers and access it using rados gateway.
> Our data size is 380Go (very small). We have two host with 5 osd each.
> We use small config for ceph : 2Go RAM server with 5 x 2To Disk (one OSD on each disk).
> This is a very cheap config that allow us to keep our storing cost under control and it's enough to get the read performance we need.
> (We use this config with mogilefs to store 150To of data)
>
> This week-end we had an alert saying ceph was down.
>
> After looking at the osd, we saw a very high load on osd (450 of load), some were down.
> Ceph -s displayed that we were having down pg, peering+down pg, remapped pg. etc.
>

Could you tell us a bit more?

When the load was 450, was this mainly due to disk I/O wait?
Did the machines start to swap?

Could it be that the swapping was actually causing the machines to die 
even more?

Although a OSD could run with 100M of memory, during recovery it can 
grow quite fast.

> So we started to see that when we were peering and stuff like that, the load was very high.
> OSD stop responding and we could see in the log message like :
> FileStore timeout and Abort Signal
>
> So basically the cluster was under load because we was recovering... but because it was under load recovering could not complete.
>

FileStore aborts indicate that it couldn't get the work done quickly 
enough. I've seen this with btrfs, but you say you are using XFS.

You say you are storing small files. What exactly is "small"?

Wido

> We change this params to get a longer timeout :
> filestore op thread suicide timeout = 360
> filestore op thread timeout = 180
> osd default notify timeout = 360
>
> The cluster was still under heavy load, osd was still timeouting (less timeouting but still)
>
> So we test param to "throttle" the "recovery" process :
> filestore op threads = 6
> filestore queue max ops = 24
> osd recovery max active = 1
>
> Load was better, but still very high (30).
>
> We also try to put the journal in a tmpfs with zram.
> We set noout so it won't copy files to satisfy the replicate count because osd were out.
>
> We then updated to kernel 3.5 to get last xfs optim.
>
> In the end nothing was working we were in the same infinite death loop of recovering => load => timeout => recovering.
> So we updated from ceph 0.48.2 to 0.53, load was better and recovery finally worked.
>
> As we don't want to be in the position again (24h downtime), I have some questions on ceph/rados.
>
> 1/ Even when we switch to ceph 0.53, the rados gateway was still not responding, Log was displaying Initalization timeout.
> Is it normal that the "recovering" process kill the fact that we can read data from ceph ?
> The data is here, it is just moving, why can't we access it ?
>
> 2/ In case of very high load because ceph is moving data, is there a way to tell ceph to go slowly ?
>
>
> Thanks,
>
> --
> Yann ROBIN
> Société Publica
> www.YouScribe.com
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: Osd load and timeout when recovering
  2012-10-22 12:02 ` Wido den Hollander
@ 2012-10-22 13:59   ` Yann ROBIN
  0 siblings, 0 replies; 4+ messages in thread
From: Yann ROBIN @ 2012-10-22 13:59 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org

>>
>> After looking at the osd, we saw a very high load on osd (450 of load), some were down.
>> Ceph -s displayed that we were having down pg, peering+down pg, remapped pg. etc.
>>
>
>Could you tell us a bit more?
>
When the load was 450, was this mainly due to disk I/O wait?
Did the machines start to swap?

All disk were 100% busy. And server was swapping.

> Could it be that the swapping was actually causing the machines to die even more?

> Although a OSD could run with 100M of memory, during recovery it can grow quite fast.

Is there a way to estimate the needed memory ?

>
> So basically the cluster was under load because we was recovering... but because it was under load recovering could not complete.
>
>
>FileStore aborts indicate that it couldn't get the work done quickly enough. I've seen this with btrfs, but you say you are using XFS.
>
>You say you are storing small files. What exactly is "small"?

In average 120ko.


-- 
Yann ROBIN
www.YouScribe.com




^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Osd load and timeout when recovering
  2012-10-22 10:27 Osd load and timeout when recovering Yann ROBIN
  2012-10-22 12:02 ` Wido den Hollander
@ 2012-10-22 14:26 ` Gregory Farnum
  1 sibling, 0 replies; 4+ messages in thread
From: Gregory Farnum @ 2012-10-22 14:26 UTC (permalink / raw)
  To: Yann ROBIN; +Cc: ceph-devel@vger.kernel.org

On Mon, Oct 22, 2012 at 3:27 AM, Yann ROBIN <yann.robin@youscribe.com> wrote:
> Hi,
>
> We use ceph to store small file (lot of them) on different servers and access it using rados gateway.
> Our data size is 380Go (very small). We have two host with 5 osd each.
> We use small config for ceph : 2Go RAM server with 5 x 2To Disk (one OSD on each disk).
> This is a very cheap config that allow us to keep our storing cost under control and it's enough to get the read performance we need.
> (We use this config with mogilefs to store 150To of data)

These node sizes are one of your problems — while the OSDs in normal
operation are only using 100-200MBs of memory, they can spike quite a
lot during recovery. We generally recommend 1GB of RAM per daemon.

> This week-end we had an alert saying ceph was down.
>
> After looking at the osd, we saw a very high load on osd (450 of load), some were down.
> Ceph -s displayed that we were having down pg, peering+down pg, remapped pg. etc.
>
> So we started to see that when we were peering and stuff like that, the load was very high.
> OSD stop responding and we could see in the log message like :
> FileStore timeout and Abort Signal

Right. That means the OSD was sending operations down to disk that
were taking so long to complete it timed them out. Default on a
FileStore operation is 60 seconds, and if it was actually suiciding
the OSD that requires the disk to be nonresponsive for 180 seconds.

> So basically the cluster was under load because we was recovering... but because it was under load recovering could not complete.
>
> We change this params to get a longer timeout :
> filestore op thread suicide timeout = 360
> filestore op thread timeout = 180
> osd default notify timeout = 360

Okay, the first two of those are fine. The third one is actually
related to the "notify" OSD operation, and isn't helping you here.

> The cluster was still under heavy load, osd was still timeouting (less timeouting but still)
>
> So we test param to "throttle" the "recovery" process :
> filestore op threads = 6
> filestore queue max ops = 24
> osd recovery max active = 1

Okay, so now you've increased the number of simultaneous disk
operations the FileStore will dispatch from 2 to 6. That probably
didn't help. Decreasing the "filestore queue max ops" from 500 to
24...probably didn't do anything. But it might have; I'll defer to
others on that.
The one thing here that definitely did help is bringing down the "osd
recovery max active"...that's the number of PGs to try and recover
simultaneously; by dropping it from 5 to 1 you've reduced the total
number of recovery operations going on across the cluster.

> Load was better, but still very high (30).
>
> We also try to put the journal in a tmpfs with zram.

So, you took away RAM from the system? Ouch, but okay, maybe it's
reducing disk usage overall...

> We set noout so it won't copy files to satisfy the replicate count because osd were out.
>
> We then updated to kernel 3.5 to get last xfs optim.
>
> In the end nothing was working we were in the same infinite death loop of recovering => load => timeout => recovering.
> So we updated from ceph 0.48.2 to 0.53, load was better and recovery finally worked.

Right. You've run into a problem we call "cluster thrashing", in which
a problem with one OSD causes it to go out, and then the subsequent
map changes and data movement cause other OSDs to fall over as well.
This is a problem in argonaut which has been smoothed out a great deal
in subsequent development releases by greatly reducing the cost of OSD
map updates.

> As we don't want to be in the position again (24h downtime), I have some questions on ceph/rados.
>
> 1/ Even when we switch to ceph 0.53, the rados gateway was still not responding, Log was displaying Initalization timeout.
> Is it normal that the "recovering" process kill the fact that we can read data from ceph ?
> The data is here, it is just moving, why can't we access it ?

You had PGs that were in a "down" state, meaning that the OSD which is
supposed to be primary for them wasn't servicing requests yet. It
takes some time to establish who has the newest version of data in a
PG and gather up an active set.

> 2/ In case of very high load because ceph is moving data, is there a way to tell ceph to go slowly ?

There are a lot of switches you can throw to do this. You threw a
number of them. I'm not aware of any others off the top of my head,
but Sam or Sage might have more.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2012-10-22 14:26 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-10-22 10:27 Osd load and timeout when recovering Yann ROBIN
2012-10-22 12:02 ` Wido den Hollander
2012-10-22 13:59   ` Yann ROBIN
2012-10-22 14:26 ` Gregory Farnum

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.