long running workloads/rbd_fsx_cache_writeback.yaml on hammer

All of lore.kernel.org
 help / color / mirror / Atom feed

* long running workloads/rbd_fsx_cache_writeback.yaml on hammer
@ 2015-05-08  9:07 Loic Dachary
  2015-05-08 12:36 ` Jason Dillaman
  0 siblings, 1 reply; 7+ messages in thread
From: Loic Dachary @ 2015-05-08  9:07 UTC (permalink / raw)
  To: Jason Dillaman; +Cc: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 875 bytes --]

Hi Janson,

There is a long running (12h +) job at

http://pulpito.ceph.com/loic-2015-05-07_09:46:27-rbd-hammer-backports---basic-multi/878799/

which is about

 rbd/thrash/{base/install.yaml clusters/fixed-2.yaml fs/xfs.yaml msgr-failures/few.yaml thrashers/default.yaml workloads/rbd_fsx_cache_writeback.yaml} 

and runs on the current hammer-backports branch which contains a few librbd backports

http://tracker.ceph.com/issues/11492#Teuthology-run-commit-commita79146fc3cae28bf4c07478fb4566b06942da60d-hammer-backports-branch-May-2015

I don't see an obvious cause for error in the logs. Does it ring a bell ? Is it supposed to take that long ? Note that all other jobs completed successfully (see http://pulpito.ceph.com/loic-2015-05-07_09:46:27-rbd-hammer-backports---basic-multi/ for details).

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: long running workloads/rbd_fsx_cache_writeback.yaml on hammer
  2015-05-08  9:07 long running workloads/rbd_fsx_cache_writeback.yaml on hammer Loic Dachary
@ 2015-05-08 12:36 ` Jason Dillaman
  2015-05-08 14:34   ` Loic Dachary
  0 siblings, 1 reply; 7+ messages in thread
From: Jason Dillaman @ 2015-05-08 12:36 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Ceph Development

Nice catch.  It is deadlocked in a code path between handling a watch/notify error from librados and flushing the internal cache.  The issue already appears to exist in the hammer branch.

-- 

Jason Dillaman 
Red Hat 
dillaman@redhat.com 
http://www.redhat.com 


----- Original Message -----
From: "Loic Dachary" <loic@dachary.org>
To: "Jason Dillaman" <dillaman@redhat.com>
Cc: "Ceph Development" <ceph-devel@vger.kernel.org>
Sent: Friday, May 8, 2015 5:07:49 AM
Subject: long running workloads/rbd_fsx_cache_writeback.yaml on hammer

Hi Janson,

There is a long running (12h +) job at

http://pulpito.ceph.com/loic-2015-05-07_09:46:27-rbd-hammer-backports---basic-multi/878799/

which is about

 rbd/thrash/{base/install.yaml clusters/fixed-2.yaml fs/xfs.yaml msgr-failures/few.yaml thrashers/default.yaml workloads/rbd_fsx_cache_writeback.yaml} 

and runs on the current hammer-backports branch which contains a few librbd backports

http://tracker.ceph.com/issues/11492#Teuthology-run-commit-commita79146fc3cae28bf4c07478fb4566b06942da60d-hammer-backports-branch-May-2015

I don't see an obvious cause for error in the logs. Does it ring a bell ? Is it supposed to take that long ? Note that all other jobs completed successfully (see http://pulpito.ceph.com/loic-2015-05-07_09:46:27-rbd-hammer-backports---basic-multi/ for details).

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: long running workloads/rbd_fsx_cache_writeback.yaml on hammer
  2015-05-08 12:36 ` Jason Dillaman
@ 2015-05-08 14:34   ` Loic Dachary
  2015-05-08 15:03     ` Jason Dillaman
  0 siblings, 1 reply; 7+ messages in thread
From: Loic Dachary @ 2015-05-08 14:34 UTC (permalink / raw)
  To: Jason Dillaman; +Cc: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 614 bytes --]

Hi,

On 08/05/2015 14:36, Jason Dillaman wrote:
> Nice catch.  It is deadlocked in a code path between handling a watch/notify error from librados and flushing the internal cache.  The issue already appears to exist in the hammer branch.

I'm glad it helped you figure out a bug :-) From the point of view of the upcoming v0.94.2, do you think we should wait until this is fixed ? Or is it rare enough and can wait until v0.94.3 ? I'm asking because this bug seems to be the only potential blocker for v0.94.2. It's ok either way, just let me know .

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: long running workloads/rbd_fsx_cache_writeback.yaml on hammer
  2015-05-08 14:34   ` Loic Dachary
@ 2015-05-08 15:03     ` Jason Dillaman
  2015-05-08 15:13       ` Loic Dachary
  0 siblings, 1 reply; 7+ messages in thread
From: Jason Dillaman @ 2015-05-08 15:03 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Ceph Development, Josh Durgin

Since it sounds like the next hammer release is more or less ready, my vote would be that the fix can wait until the next release.  Since it will only occur if the new exclusive lock feature is enabled (disabled on images by default) and the connection between librbd and the OSD is reset with writeback data waiting in the cache, it sounds like a rare enough issue.

Final decision rests w/ Josh.

-- 

Jason Dillaman 
Red Hat 
dillaman@redhat.com 
http://www.redhat.com 

----- Original Message -----
From: "Loic Dachary" <loic@dachary.org>
To: "Jason Dillaman" <dillaman@redhat.com>
Cc: "Ceph Development" <ceph-devel@vger.kernel.org>
Sent: Friday, May 8, 2015 10:34:24 AM
Subject: Re: long running workloads/rbd_fsx_cache_writeback.yaml on hammer

Hi,

On 08/05/2015 14:36, Jason Dillaman wrote:
> Nice catch.  It is deadlocked in a code path between handling a watch/notify error from librados and flushing the internal cache.  The issue already appears to exist in the hammer branch.

I'm glad it helped you figure out a bug :-) From the point of view of the upcoming v0.94.2, do you think we should wait until this is fixed ? Or is it rare enough and can wait until v0.94.3 ? I'm asking because this bug seems to be the only potential blocker for v0.94.2. It's ok either way, just let me know .

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: long running workloads/rbd_fsx_cache_writeback.yaml on hammer
  2015-05-08 15:03     ` Jason Dillaman
@ 2015-05-08 15:13       ` Loic Dachary
  2015-05-08 16:07         ` Josh Durgin
  0 siblings, 1 reply; 7+ messages in thread
From: Loic Dachary @ 2015-05-08 15:13 UTC (permalink / raw)
  To: Jason Dillaman; +Cc: Ceph Development, Josh Durgin

[-- Attachment #1: Type: text/plain, Size: 607 bytes --]



On 08/05/2015 17:03, Jason Dillaman wrote:
> Since it sounds like the next hammer release is more or less ready, my vote would be that the fix can wait until the next release.  Since it will only occur if the new exclusive lock feature is enabled (disabled on images by default) and the connection between librbd and the OSD is reset with writeback data waiting in the cache, it sounds like a rare enough issue.
> 
> Final decision rests w/ Josh.

Ok :-) Would you be so kind as to point me to the corresponding issue in the tracker for reference ?

-- 
Loïc Dachary, Artisan Logiciel Libre


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: long running workloads/rbd_fsx_cache_writeback.yaml on hammer
  2015-05-08 15:13       ` Loic Dachary
@ 2015-05-08 16:07         ` Josh Durgin
  2015-05-08 16:57           ` Yuri Weinstein
  0 siblings, 1 reply; 7+ messages in thread
From: Josh Durgin @ 2015-05-08 16:07 UTC (permalink / raw)
  To: Loic Dachary, Jason Dillaman; +Cc: Ceph Development

On 05/08/2015 08:13 AM, Loic Dachary wrote:
>
>
> On 08/05/2015 17:03, Jason Dillaman wrote:
>> Since it sounds like the next hammer release is more or less ready, my vote would be that the fix can wait until the next release.  Since it will only occur if the new exclusive lock feature is enabled (disabled on images by default) and the connection between librbd and the OSD is reset with writeback data waiting in the cache, it sounds like a rare enough issue.
>>
>> Final decision rests w/ Josh.
>
> Ok :-) Would you be so kind as to point me to the corresponding issue in the tracker for reference ?

I agree with Jason, this doesn't need to block the release. I think the
issue he refers to is the last sentence in 
http://tracker.ceph.com/issues/11537

Josh

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: long running workloads/rbd_fsx_cache_writeback.yaml on hammer
  2015-05-08 16:07         ` Josh Durgin
@ 2015-05-08 16:57           ` Yuri Weinstein
  0 siblings, 0 replies; 7+ messages in thread
From: Yuri Weinstein @ 2015-05-08 16:57 UTC (permalink / raw)
  To: Josh Durgin; +Cc: Loic Dachary, Jason Dillaman, Ceph Development

In what suite(s) do we anticipate failures for next week hammer QE validation?

Thx
YuriW

----- Original Message -----
From: "Josh Durgin" <jdurgin@redhat.com>
To: "Loic Dachary" <loic@dachary.org>, "Jason Dillaman" <dillaman@redhat.com>
Cc: "Ceph Development" <ceph-devel@vger.kernel.org>
Sent: Friday, May 8, 2015 9:07:27 AM
Subject: Re: long running workloads/rbd_fsx_cache_writeback.yaml on hammer

On 05/08/2015 08:13 AM, Loic Dachary wrote:
>
>
> On 08/05/2015 17:03, Jason Dillaman wrote:
>> Since it sounds like the next hammer release is more or less ready, my vote would be that the fix can wait until the next release.  Since it will only occur if the new exclusive lock feature is enabled (disabled on images by default) and the connection between librbd and the OSD is reset with writeback data waiting in the cache, it sounds like a rare enough issue.
>>
>> Final decision rests w/ Josh.
>
> Ok :-) Would you be so kind as to point me to the corresponding issue in the tracker for reference ?

I agree with Jason, this doesn't need to block the release. I think the
issue he refers to is the last sentence in 
http://tracker.ceph.com/issues/11537

Josh
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-05-08 16:58 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-05-08  9:07 long running workloads/rbd_fsx_cache_writeback.yaml on hammer Loic Dachary
2015-05-08 12:36 ` Jason Dillaman
2015-05-08 14:34   ` Loic Dachary
2015-05-08 15:03     ` Jason Dillaman
2015-05-08 15:13       ` Loic Dachary
2015-05-08 16:07         ` Josh Durgin
2015-05-08 16:57           ` Yuri Weinstein

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.