* Higher OSD disk util due to RBD snapshots from Dumpling to Firefly
@ 2014-12-31 16:21 Wido den Hollander
2015-01-01 9:30 ` Stefan Priebe
2015-01-07 16:51 ` Dan van der Ster
0 siblings, 2 replies; 7+ messages in thread
From: Wido den Hollander @ 2014-12-31 16:21 UTC (permalink / raw)
To: ceph-devel
Hi,
Last week I upgraded a 250 OSD cluster from Dumpling 0.67.10 to Firefly
0.80.7 and after the upgrade there was a severe performance drop on the
cluster.
It started raining slow requests after the upgrade and most of them
included a 'snapc' in the request.
That lead me to investigate the RBD snapshots and I found that a rogue
process had created ~1800 snapshots spread out over 200 volumes.
One image even had 181 snapshots!
As the snapshots weren't used I removed them all and after the snapshots
were removed the performance of the cluster came back to normal level again.
I'm wondering what changed between Dumpling and Firefly which caused
this? I saw OSDs spiking to 100% disk util constantly under Firefly
where this didn't happen with Dumpling.
Did something change in the way OSDs handle RBD snapshots which causes
them to create more disk I/O?
--
Wido den Hollander
42on B.V.
Ceph trainer and consultant
Phone: +31 (0)20 700 9902
Skype: contact42on
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Higher OSD disk util due to RBD snapshots from Dumpling to Firefly
2014-12-31 16:21 Higher OSD disk util due to RBD snapshots from Dumpling to Firefly Wido den Hollander
@ 2015-01-01 9:30 ` Stefan Priebe
2015-01-02 16:49 ` Samuel Just
2015-01-07 16:51 ` Dan van der Ster
1 sibling, 1 reply; 7+ messages in thread
From: Stefan Priebe @ 2015-01-01 9:30 UTC (permalink / raw)
To: Wido den Hollander, ceph-devel
hi,
Am 31.12.2014 um 17:21 schrieb Wido den Hollander:
> Hi,
>
> Last week I upgraded a 250 OSD cluster from Dumpling 0.67.10 to Firefly
> 0.80.7 and after the upgrade there was a severe performance drop on the
> cluster.
>
> It started raining slow requests after the upgrade and most of them
> included a 'snapc' in the request.
>
> That lead me to investigate the RBD snapshots and I found that a rogue
> process had created ~1800 snapshots spread out over 200 volumes.
>
> One image even had 181 snapshots!
>
> As the snapshots weren't used I removed them all and after the snapshots
> were removed the performance of the cluster came back to normal level again.
>
> I'm wondering what changed between Dumpling and Firefly which caused
> this? I saw OSDs spiking to 100% disk util constantly under Firefly
> where this didn't happen with Dumpling.
>
> Did something change in the way OSDs handle RBD snapshots which causes
> them to create more disk I/O?
I saw the same and addionally a slowdown in librbd too, that's why i'm
still on dumpling and won't upgrade until hammer.
Stefan
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Higher OSD disk util due to RBD snapshots from Dumpling to Firefly
2015-01-01 9:30 ` Stefan Priebe
@ 2015-01-02 16:49 ` Samuel Just
2015-01-02 18:43 ` Stefan Priebe
0 siblings, 1 reply; 7+ messages in thread
From: Samuel Just @ 2015-01-02 16:49 UTC (permalink / raw)
To: Stefan Priebe, Josh Durgin; +Cc: Wido den Hollander, ceph-devel
Odd, sounds like it might be rbd client side?
-Sam
On Thu, Jan 1, 2015 at 1:30 AM, Stefan Priebe <s.priebe@profihost.ag> wrote:
> hi,
>
> Am 31.12.2014 um 17:21 schrieb Wido den Hollander:
>>
>> Hi,
>>
>> Last week I upgraded a 250 OSD cluster from Dumpling 0.67.10 to Firefly
>> 0.80.7 and after the upgrade there was a severe performance drop on the
>> cluster.
>>
>> It started raining slow requests after the upgrade and most of them
>> included a 'snapc' in the request.
>>
>> That lead me to investigate the RBD snapshots and I found that a rogue
>> process had created ~1800 snapshots spread out over 200 volumes.
>>
>> One image even had 181 snapshots!
>>
>> As the snapshots weren't used I removed them all and after the snapshots
>> were removed the performance of the cluster came back to normal level
>> again.
>>
>> I'm wondering what changed between Dumpling and Firefly which caused
>> this? I saw OSDs spiking to 100% disk util constantly under Firefly
>> where this didn't happen with Dumpling.
>>
>> Did something change in the way OSDs handle RBD snapshots which causes
>> them to create more disk I/O?
>
>
> I saw the same and addionally a slowdown in librbd too, that's why i'm still
> on dumpling and won't upgrade until hammer.
>
> Stefan
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Higher OSD disk util due to RBD snapshots from Dumpling to Firefly
2015-01-02 16:49 ` Samuel Just
@ 2015-01-02 18:43 ` Stefan Priebe
2015-01-02 19:02 ` Samuel Just
0 siblings, 1 reply; 7+ messages in thread
From: Stefan Priebe @ 2015-01-02 18:43 UTC (permalink / raw)
To: sjust, Josh Durgin; +Cc: Wido den Hollander, ceph-devel
Am 02.01.2015 um 17:49 schrieb Samuel Just:
> Odd, sounds like it might be rbd client side?
> -Sam
That one was already on list:
https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg19091.html
Sadly there was no result as it was unseen for 2 weeks and i didn't had
the test equipment anymore.
Greets,
Stefan
> On Thu, Jan 1, 2015 at 1:30 AM, Stefan Priebe <s.priebe@profihost.ag> wrote:
>> hi,
>>
>> Am 31.12.2014 um 17:21 schrieb Wido den Hollander:
>>>
>>> Hi,
>>>
>>> Last week I upgraded a 250 OSD cluster from Dumpling 0.67.10 to Firefly
>>> 0.80.7 and after the upgrade there was a severe performance drop on the
>>> cluster.
>>>
>>> It started raining slow requests after the upgrade and most of them
>>> included a 'snapc' in the request.
>>>
>>> That lead me to investigate the RBD snapshots and I found that a rogue
>>> process had created ~1800 snapshots spread out over 200 volumes.
>>>
>>> One image even had 181 snapshots!
>>>
>>> As the snapshots weren't used I removed them all and after the snapshots
>>> were removed the performance of the cluster came back to normal level
>>> again.
>>>
>>> I'm wondering what changed between Dumpling and Firefly which caused
>>> this? I saw OSDs spiking to 100% disk util constantly under Firefly
>>> where this didn't happen with Dumpling.
>>>
>>> Did something change in the way OSDs handle RBD snapshots which causes
>>> them to create more disk I/O?
>>
>>
>> I saw the same and addionally a slowdown in librbd too, that's why i'm still
>> on dumpling and won't upgrade until hammer.
>>
>> Stefan
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Higher OSD disk util due to RBD snapshots from Dumpling to Firefly
2015-01-02 18:43 ` Stefan Priebe
@ 2015-01-02 19:02 ` Samuel Just
0 siblings, 0 replies; 7+ messages in thread
From: Samuel Just @ 2015-01-02 19:02 UTC (permalink / raw)
To: Stefan Priebe; +Cc: Josh Durgin, Wido den Hollander, ceph-devel
That may not be related.
-Sam
On Fri, Jan 2, 2015 at 10:43 AM, Stefan Priebe <s.priebe@profihost.ag> wrote:
> Am 02.01.2015 um 17:49 schrieb Samuel Just:
>>
>> Odd, sounds like it might be rbd client side?
>> -Sam
>
>
> That one was already on list:
> https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg19091.html
>
> Sadly there was no result as it was unseen for 2 weeks and i didn't had the
> test equipment anymore.
>
> Greets,
> Stefan
>
>
>> On Thu, Jan 1, 2015 at 1:30 AM, Stefan Priebe <s.priebe@profihost.ag>
>> wrote:
>>>
>>> hi,
>>>
>>> Am 31.12.2014 um 17:21 schrieb Wido den Hollander:
>>>>
>>>>
>>>> Hi,
>>>>
>>>> Last week I upgraded a 250 OSD cluster from Dumpling 0.67.10 to Firefly
>>>> 0.80.7 and after the upgrade there was a severe performance drop on the
>>>> cluster.
>>>>
>>>> It started raining slow requests after the upgrade and most of them
>>>> included a 'snapc' in the request.
>>>>
>>>> That lead me to investigate the RBD snapshots and I found that a rogue
>>>> process had created ~1800 snapshots spread out over 200 volumes.
>>>>
>>>> One image even had 181 snapshots!
>>>>
>>>> As the snapshots weren't used I removed them all and after the snapshots
>>>> were removed the performance of the cluster came back to normal level
>>>> again.
>>>>
>>>> I'm wondering what changed between Dumpling and Firefly which caused
>>>> this? I saw OSDs spiking to 100% disk util constantly under Firefly
>>>> where this didn't happen with Dumpling.
>>>>
>>>> Did something change in the way OSDs handle RBD snapshots which causes
>>>> them to create more disk I/O?
>>>
>>>
>>>
>>> I saw the same and addionally a slowdown in librbd too, that's why i'm
>>> still
>>> on dumpling and won't upgrade until hammer.
>>>
>>> Stefan
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Higher OSD disk util due to RBD snapshots from Dumpling to Firefly
2014-12-31 16:21 Higher OSD disk util due to RBD snapshots from Dumpling to Firefly Wido den Hollander
2015-01-01 9:30 ` Stefan Priebe
@ 2015-01-07 16:51 ` Dan van der Ster
2015-01-08 7:55 ` Wido den Hollander
1 sibling, 1 reply; 7+ messages in thread
From: Dan van der Ster @ 2015-01-07 16:51 UTC (permalink / raw)
To: Wido den Hollander; +Cc: ceph-devel
Hi Wido,
I've been trying to reproduce this but haven't been able yet.
What I've tried so far is use fio rbd with a 0.80.7 client connected
to a 0.80.7 cluster. I created a 10GB format 2 block device, then
measured the 4k randwrite iops before and after having snaps. I
measured around 2000 iops to the image before any snapshots, then
created 200 snapshots on the device and ran fio again. Initially the
iops were low (I guess this is from the 4MB CoW resulting from the
first 4k write to each underlying object). But eventually the speed
stabilized to around 2000 iops again. Actually the initial slowdown
was the same whether I created 1 snapshot or 200.
This was just quick subjective test so far, since from your report I
was expecting something obvious to stick out. But it appears pretty
OK, no? Would you have expected something different from these tests?
Cheers, Dan
On Wed, Dec 31, 2014 at 5:21 PM, Wido den Hollander <wido@42on.com> wrote:
> Hi,
>
> Last week I upgraded a 250 OSD cluster from Dumpling 0.67.10 to Firefly
> 0.80.7 and after the upgrade there was a severe performance drop on the
> cluster.
>
> It started raining slow requests after the upgrade and most of them
> included a 'snapc' in the request.
>
> That lead me to investigate the RBD snapshots and I found that a rogue
> process had created ~1800 snapshots spread out over 200 volumes.
>
> One image even had 181 snapshots!
>
> As the snapshots weren't used I removed them all and after the snapshots
> were removed the performance of the cluster came back to normal level again.
>
> I'm wondering what changed between Dumpling and Firefly which caused
> this? I saw OSDs spiking to 100% disk util constantly under Firefly
> where this didn't happen with Dumpling.
>
> Did something change in the way OSDs handle RBD snapshots which causes
> them to create more disk I/O?
>
> --
> Wido den Hollander
> 42on B.V.
> Ceph trainer and consultant
>
> Phone: +31 (0)20 700 9902
> Skype: contact42on
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Higher OSD disk util due to RBD snapshots from Dumpling to Firefly
2015-01-07 16:51 ` Dan van der Ster
@ 2015-01-08 7:55 ` Wido den Hollander
0 siblings, 0 replies; 7+ messages in thread
From: Wido den Hollander @ 2015-01-08 7:55 UTC (permalink / raw)
To: Dan van der Ster; +Cc: ceph-devel
On 01/07/2015 05:51 PM, Dan van der Ster wrote:
> Hi Wido,
> I've been trying to reproduce this but haven't been able yet.
>
> What I've tried so far is use fio rbd with a 0.80.7 client connected
> to a 0.80.7 cluster. I created a 10GB format 2 block device, then
> measured the 4k randwrite iops before and after having snaps. I
> measured around 2000 iops to the image before any snapshots, then
> created 200 snapshots on the device and ran fio again. Initially the
> iops were low (I guess this is from the 4MB CoW resulting from the
> first 4k write to each underlying object). But eventually the speed
> stabilized to around 2000 iops again. Actually the initial slowdown
> was the same whether I created 1 snapshot or 200.
>
> This was just quick subjective test so far, since from your report I
> was expecting something obvious to stick out. But it appears pretty
> OK, no? Would you have expected something different from these tests?
>
Well, I'm not sure what to expect. But what I noticed is that when I
removed all the snapshots the slow requests were gone and the disk util
dropped on the OSDs.
Wido
> Cheers, Dan
>
>
> On Wed, Dec 31, 2014 at 5:21 PM, Wido den Hollander <wido@42on.com> wrote:
>> Hi,
>>
>> Last week I upgraded a 250 OSD cluster from Dumpling 0.67.10 to Firefly
>> 0.80.7 and after the upgrade there was a severe performance drop on the
>> cluster.
>>
>> It started raining slow requests after the upgrade and most of them
>> included a 'snapc' in the request.
>>
>> That lead me to investigate the RBD snapshots and I found that a rogue
>> process had created ~1800 snapshots spread out over 200 volumes.
>>
>> One image even had 181 snapshots!
>>
>> As the snapshots weren't used I removed them all and after the snapshots
>> were removed the performance of the cluster came back to normal level again.
>>
>> I'm wondering what changed between Dumpling and Firefly which caused
>> this? I saw OSDs spiking to 100% disk util constantly under Firefly
>> where this didn't happen with Dumpling.
>>
>> Did something change in the way OSDs handle RBD snapshots which causes
>> them to create more disk I/O?
>>
>> --
>> Wido den Hollander
>> 42on B.V.
>> Ceph trainer and consultant
>>
>> Phone: +31 (0)20 700 9902
>> Skype: contact42on
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Wido den Hollander
42on B.V.
Ceph trainer and consultant
Phone: +31 (0)20 700 9902
Skype: contact42on
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2015-01-08 7:55 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-12-31 16:21 Higher OSD disk util due to RBD snapshots from Dumpling to Firefly Wido den Hollander
2015-01-01 9:30 ` Stefan Priebe
2015-01-02 16:49 ` Samuel Just
2015-01-02 18:43 ` Stefan Priebe
2015-01-02 19:02 ` Samuel Just
2015-01-07 16:51 ` Dan van der Ster
2015-01-08 7:55 ` Wido den Hollander
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.