Understanding mon space usage during recovery

All of lore.kernel.org
 help / color / mirror / Atom feed

* Understanding mon space usage during recovery
@ 2016-06-08  9:25 Bartłomiej Święcki
  2016-06-08 15:33 ` Gregory Farnum
  0 siblings, 1 reply; 3+ messages in thread
From: Bartłomiej Święcki @ 2016-06-08  9:25 UTC (permalink / raw)
  To: Ceph Development

Hi,

I was recently trying to understand the growth of mon disk space usage 
during recovery in one of our clusters,
wanted to know whether we could reduce disk usage somehow or if we just 
have to prepare more space for our mons.
Cluster is 0.94.6, just over 300 OSDs. Leveldb compaction does reduce 
space usage but it quickly grows back
to the previous usage. What I found out is that most of the leveldb data 
is used by osdmap history.

For each osdmap version leveldb contains both full and incremental entry 
so I was thinking if we really need to
store full osdmaps for all versions? If we're having incremental changes 
for every version anyway, wouldn't it be
sufficient to keep first full version only and then recover any future 
ones by applying incrementals?

I was also trying to understand how ceph figures out the range of osdmap 
versions to keep. After analyzing the code
I thought the obvious answer was in PGMap::calc_min_last_epoch_clean() - 
In case of our production cluster,
the difference between min and max clean epochs was around 30k during 
recovery, size of one full osdmap blob
in leveldb is around 250k.

I also tried to test this on my dev cluster where I could run gdb (15 
OSD, 4 OSD nearfull and lots of misplaced objects).
What I found out is that execution in OSDmonitor::get_trim_to() almost 
never jumped inside the first 'if'.
mon->pgmon()->is_readable() returns false, I did debug it once and it 
was a result of false returned by Paxos::is_lease_valid().
I was able get into mentioned 'if' only once the cluster got back to the 
healthy state. Is this expected behavior?

Thanks,
Bartek

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Understanding mon space usage during recovery
  2016-06-08  9:25 Understanding mon space usage during recovery Bartłomiej Święcki
@ 2016-06-08 15:33 ` Gregory Farnum
  2016-06-10  8:58   ` Bartłomiej Święcki
  0 siblings, 1 reply; 3+ messages in thread
From: Gregory Farnum @ 2016-06-08 15:33 UTC (permalink / raw)
  To: Bartłomiej Święcki; +Cc: Ceph Development

On Wed, Jun 8, 2016 at 2:25 AM, Bartłomiej Święcki
<bartlomiej.swiecki@corp.ovh.com> wrote:
> Hi,
>
> I was recently trying to understand the growth of mon disk space usage
> during recovery in one of our clusters,
> wanted to know whether we could reduce disk usage somehow or if we just have
> to prepare more space for our mons.
> Cluster is 0.94.6, just over 300 OSDs. Leveldb compaction does reduce space
> usage but it quickly grows back
> to the previous usage. What I found out is that most of the leveldb data is
> used by osdmap history.
>
> For each osdmap version leveldb contains both full and incremental entry so
> I was thinking if we really need to
> store full osdmaps for all versions? If we're having incremental changes for
> every version anyway, wouldn't it be
> sufficient to keep first full version only and then recover any future ones
> by applying incrementals?

Maybe not; we've gone back and forth on this but I think we ended up
learning that reconstructing them was just annoying in terms of
needing to read all the extra keys.

>
> I was also trying to understand how ceph figures out the range of osdmap
> versions to keep. After analyzing the code
> I thought the obvious answer was in PGMap::calc_min_last_epoch_clean() - In
> case of our production cluster,
> the difference between min and max clean epochs was around 30k during
> recovery, size of one full osdmap blob
> in leveldb is around 250k.

Yeah, there's not a lot that can be done about this directly. 30k maps
is an awful lot though; you probably have other issues happening in
your OSDs (or monitors?).

>
> I also tried to test this on my dev cluster where I could run gdb (15 OSD, 4
> OSD nearfull and lots of misplaced objects).
> What I found out is that execution in OSDmonitor::get_trim_to() almost never
> jumped inside the first 'if'.
> mon->pgmon()->is_readable() returns false, I did debug it once and it was a
> result of false returned by Paxos::is_lease_valid().

Okay, that's bad. If your lease isn't valid, then the monitors are
getting so bogged down that they're timing out the leases and
temporarily breaking quorum. You should figure out if this is a load
issue or a result of clock skew issues or something.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Understanding mon space usage during recovery
  2016-06-08 15:33 ` Gregory Farnum
@ 2016-06-10  8:58   ` Bartłomiej Święcki
  0 siblings, 0 replies; 3+ messages in thread
From: Bartłomiej Święcki @ 2016-06-10  8:58 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Ceph Development

Thanks for hints, replied inline.

On 06/08/2016 05:33 PM, Gregory Farnum wrote:
> On Wed, Jun 8, 2016 at 2:25 AM, Bartłomiej Święcki
> <bartlomiej.swiecki@corp.ovh.com> wrote:
>> Hi,
>>
>> I was recently trying to understand the growth of mon disk space usage
>> during recovery in one of our clusters,
>> wanted to know whether we could reduce disk usage somehow or if we just have
>> to prepare more space for our mons.
>> Cluster is 0.94.6, just over 300 OSDs. Leveldb compaction does reduce space
>> usage but it quickly grows back
>> to the previous usage. What I found out is that most of the leveldb data is
>> used by osdmap history.
>>
>> For each osdmap version leveldb contains both full and incremental entry so
>> I was thinking if we really need to
>> store full osdmaps for all versions? If we're having incremental changes for
>> every version anyway, wouldn't it be
>> sufficient to keep first full version only and then recover any future ones
>> by applying incrementals?
> Maybe not; we've gone back and forth on this but I think we ended up
> learning that reconstructing them was just annoying in terms of
> needing to read all the extra keys.
I think we can't get away reserving more space to mons then.
>> I was also trying to understand how ceph figures out the range of osdmap
>> versions to keep. After analyzing the code
>> I thought the obvious answer was in PGMap::calc_min_last_epoch_clean() - In
>> case of our production cluster,
>> the difference between min and max clean epochs was around 30k during
>> recovery, size of one full osdmap blob
>> in leveldb is around 250k.
> Yeah, there's not a lot that can be done about this directly. 30k maps
> is an awful lot though; you probably have other issues happening in
> your OSDs (or monitors?).
There were no other issues but it takes more than a day to finish 
rebalancing
the cluster. That would mean new osdmap would have to get created once every
3 seconds - is that possible? What events could cause upgrade of osdmap?
The cluster had noscrub and nodeep-scrub set to speedup the recovery.
>> I also tried to test this on my dev cluster where I could run gdb (15 OSD, 4
>> OSD nearfull and lots of misplaced objects).
>> What I found out is that execution in OSDmonitor::get_trim_to() almost never
>> jumped inside the first 'if'.
>> mon->pgmon()->is_readable() returns false, I did debug it once and it was a
>> result of false returned by Paxos::is_lease_valid().
> Okay, that's bad. If your lease isn't valid, then the monitors are
> getting so bogged down that they're timing out the leases and
> temporarily breaking quorum. You should figure out if this is a load
> issue or a result of clock skew issues or something.
> -Greg
I'll try to get some more details on that. I'm running dedicated NTP 
server to
sync all nodes so I guess clock skew is not an issue here.

Bartek
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2016-06-10  9:34 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-06-08  9:25 Understanding mon space usage during recovery Bartłomiej Święcki
2016-06-08 15:33 ` Gregory Farnum
2016-06-10  8:58   ` Bartłomiej Święcki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.