getting ready for jewel 10.2.1

All of lore.kernel.org
 help / color / mirror / Atom feed

* getting ready for jewel 10.2.1
@ 2016-03-30 10:30 Loic Dachary
  2016-03-30 10:45 ` Abhishek Varshney
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Loic Dachary @ 2016-03-30 10:30 UTC (permalink / raw)
  To: Abhishek Varshney; +Cc: Ceph Development

Hi,

Now is a good time to get ready for jewel 10.2.1 and I created http://tracker.ceph.com/issues/15317 for that purpose. The goal is to be able to run as many suites as possible on OpenStack, so that we do not have to wait days (sometime a week) for runs to complete on Sepia. Best case scenario, all OpenStack specific problems are fixed by the time 10.2.1 is being prepared. Worst case scenario there is no time to fix issues and we keep using the sepia lab. I guess we'll end up somewhere in the middle : some suites will run fine on Openstack and we'll use sepia for others.

In a previous mail I voiced my concerns regarding the lack of interest of developers regarding teuthology job failures that are cause by variations in the infrastructure. I still have no clue how to convey my belief that it is important for teuthology jobs to succeed despite infrastructure variations. But instead of just giving up and do nothing, I will work on that for the rados suite and hope things will evolve in a good way. To be honest, figuring out http://tracker.ceph.com/issues/15236 and seeing a good run of the rados suite on jewel as a result renewed my motivation in that area :-)

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: getting ready for jewel 10.2.1
  2016-03-30 10:30 getting ready for jewel 10.2.1 Loic Dachary
@ 2016-03-30 10:45 ` Abhishek Varshney
  2016-03-30 18:47 ` Gregory Farnum
  2016-03-31 15:49 ` John Spray
  2 siblings, 0 replies; 9+ messages in thread
From: Abhishek Varshney @ 2016-03-30 10:45 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Ceph Development

Hi Loic,

On Wed, Mar 30, 2016 at 4:00 PM, Loic Dachary <loic@dachary.org> wrote:
> Hi,
>
> Now is a good time to get ready for jewel 10.2.1 and I created http://tracker.ceph.com/issues/15317 for that purpose.

Thanks for creating the tracker issue.

The goal is to be able to run as many suites as possible on OpenStack,
so that we do not have to wait days (sometime a week) for runs to
complete on Sepia. Best case scenario, all OpenStack specific problems
are fixed by the time 10.2.1 is being prepared. Worst case scenario
there is no time to fix issues and we keep using the sepia lab. I
guess we'll end up somewhere in the middle : some suites will run fine
on Openstack and we'll use sepia for others.

Lets aim for the stars :)

>
> In a previous mail I voiced my concerns regarding the lack of interest of developers regarding teuthology job failures that are cause by variations in the infrastructure. I still have no clue how to convey my belief that it is important for teuthology jobs to succeed despite infrastructure variations. But instead of just giving up and do nothing, I will work on that for the rados suite and hope things will evolve in a good way. To be honest, figuring out http://tracker.ceph.com/issues/15236 and seeing a good run of the rados suite on jewel as a result renewed my motivation in that area :-)
>
> Cheers
>
> --
> Loïc Dachary, Artisan Logiciel Libre

Thanks
Abhishek
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: getting ready for jewel 10.2.1
  2016-03-30 10:30 getting ready for jewel 10.2.1 Loic Dachary
  2016-03-30 10:45 ` Abhishek Varshney
@ 2016-03-30 18:47 ` Gregory Farnum
  2016-03-31 14:31   ` Loic Dachary
  2016-03-31 15:49 ` John Spray
  2 siblings, 1 reply; 9+ messages in thread
From: Gregory Farnum @ 2016-03-30 18:47 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Abhishek Varshney, Ceph Development

On Wed, Mar 30, 2016 at 3:30 AM, Loic Dachary <loic@dachary.org> wrote:
> Hi,
>
> Now is a good time to get ready for jewel 10.2.1 and I created http://tracker.ceph.com/issues/15317 for that purpose. The goal is to be able to run as many suites as possible on OpenStack, so that we do not have to wait days (sometime a week) for runs to complete on Sepia. Best case scenario, all OpenStack specific problems are fixed by the time 10.2.1 is being prepared. Worst case scenario there is no time to fix issues and we keep using the sepia lab. I guess we'll end up somewhere in the middle : some suites will run fine on Openstack and we'll use sepia for others.
>
> In a previous mail I voiced my concerns regarding the lack of interest of developers regarding teuthology job failures that are cause by variations in the infrastructure. I still have no clue how to convey my belief that it is important for teuthology jobs to succeed despite infrastructure variations. But instead of just giving up and do nothing, I will work on that for the rados suite and hope things will evolve in a good way. To be honest, figuring out http://tracker.ceph.com/issues/15236 and seeing a good run of the rados suite on jewel as a result renewed my motivation in that area :-)

I think you've convinced us all it's important in the abstract; that's
just very different from putting it on top of our list of priorities,
especially since we alleviated many of our needs in the sepia lab.
Beyond that, a lot of the issues we're seeing have very little to do
with Ceph itself, or even the testing programs, and that can make it
more difficult to get interested as we lack the necessary expertise. I
spent some time trying to get disk sizes and things matched up (and I
suddenly realize that never got merged), but some of the other odder
issues we're having:

http://tracker.ceph.com/issues/13980, in which we are failing to mount
anything with nfs v3. This is a config file that needs to get updated;
we do it for the sepia lab (probably in ansible?) but somehow that
information isn't getting into the ovh slaves. (Or else it is in
there, and there's something *else* broken.) If we are using a
separate setup regimen for OpenStack than we are in the sepia lab
there will be persistent breakage as new dependencies and
environmental expectations get added to one and not the other. :/

http://tracker.ceph.com/issues/13876, in which MPI is just failing to
get any connections going. Why? No idea; there's a teuthology commit
from you that's supposed to have opened up all the ports in the
firewall (and it sure *looks* like it does do that, but I don't know
how the rules work), but this works in sepia and inasmuch as we have
debugging info sure looks like some kind of network blockage...

So I think this isn't something that's going to get done properly
unless somebody gets assigned to just make everything work in all the
suites, who has the time to learn all the fiddly little bits. (Or we
somehow take a break for it as a project. But I don't see that going
well.) :/
-Greg

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: getting ready for jewel 10.2.1
  2016-03-30 18:47 ` Gregory Farnum
@ 2016-03-31 14:31   ` Loic Dachary
  2016-03-31 19:01     ` Gregory Farnum
  0 siblings, 1 reply; 9+ messages in thread
From: Loic Dachary @ 2016-03-31 14:31 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Ceph Development

Hi Gregory,

On 30/03/2016 20:47, Gregory Farnum wrote:
> On Wed, Mar 30, 2016 at 3:30 AM, Loic Dachary <loic@dachary.org> wrote:
>> Hi,
>>
>> Now is a good time to get ready for jewel 10.2.1 and I created http://tracker.ceph.com/issues/15317 for that purpose. The goal is to be able to run as many suites as possible on OpenStack, so that we do not have to wait days (sometime a week) for runs to complete on Sepia. Best case scenario, all OpenStack specific problems are fixed by the time 10.2.1 is being prepared. Worst case scenario there is no time to fix issues and we keep using the sepia lab. I guess we'll end up somewhere in the middle : some suites will run fine on Openstack and we'll use sepia for others.
>>
>> In a previous mail I voiced my concerns regarding the lack of interest of developers regarding teuthology job failures that are cause by variations in the infrastructure. I still have no clue how to convey my belief that it is important for teuthology jobs to succeed despite infrastructure variations. But instead of just giving up and do nothing, I will work on that for the rados suite and hope things will evolve in a good way. To be honest, figuring out http://tracker.ceph.com/issues/15236 and seeing a good run of the rados suite on jewel as a result renewed my motivation in that area :-)
> 
> I think you've convinced us all it's important in the abstract; that's
> just very different from putting it on top of our list of priorities,
> especially since we alleviated many of our needs in the sepia lab.
> Beyond that, a lot of the issues we're seeing have very little to do
> with Ceph itself, or even the testing programs, and that can make it
> more difficult to get interested as we lack the necessary expertise. I
> spent some time trying to get disk sizes and things matched up (and I
> suddenly realize that never got merged), but some of the other odder
> issues we're having:
> 
> http://tracker.ceph.com/issues/13980, in which we are failing to mount
> anything with nfs v3. This is a config file that needs to get updated;
> we do it for the sepia lab (probably in ansible?) but somehow that
> information isn't getting into the ovh slaves. (Or else it is in
> there, and there's something *else* broken.) If we are using a
> separate setup regimen for OpenStack than we are in the sepia lab
> there will be persistent breakage as new dependencies and
> environmental expectations get added to one and not the other. :/

ceph-cm-ansible does not have any OpenStack specific instructions. It's supposed to work exactly the same on both sepia and OpenStack. When teuthology provisions an OpenStack target, it does so in the same way it provisions VPS in sepia. The only difference is that OpenStack uses images that come from http://cloud.centos.org/centos/7/images/ etc., unmodified. The VPS images have sometime been modified. However, this has only been an issue once, over six months ago.

On OVH the UDP ports were firewalled, and that created the problem. I changed the firewall rules and I'm hopefull http://pulpito.ovh.sepia.ceph.com:8081/loic-2016-03-31_14:10:18-knfs-jewel-testing-basic-openstack/ will now pass.

> http://tracker.ceph.com/issues/13876, in which MPI is just failing to
> get any connections going. Why? No idea; there's a teuthology commit
> from you that's supposed to have opened up all the ports in the
> firewall (and it sure *looks* like it does do that, but I don't know
> how the rules work), but this works in sepia and inasmuch as we have
> debugging info sure looks like some kind of network blockage...

I opened the required port on the OVH lab. I don't think there is an ansible rule that does it but I'll ask Zack to be sure.

> So I think this isn't something that's going to get done properly
> unless somebody gets assigned to just make everything work in all the
> suites, who has the time to learn all the fiddly little bits. (Or we
> somehow take a break for it as a project. But I don't see that going
> well.) :/

If you suspect an OpenStack specific problem, feel free to ping me. There is a good chance I can help and together we can make teuthology happy with OpenStack :-)

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: getting ready for jewel 10.2.1
  2016-03-31 14:31   ` Loic Dachary
@ 2016-03-31 19:01     ` Gregory Farnum
  2016-03-31 22:13       ` Loic Dachary
  0 siblings, 1 reply; 9+ messages in thread
From: Gregory Farnum @ 2016-03-31 19:01 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Ceph Development

On Thu, Mar 31, 2016 at 7:31 AM, Loic Dachary <loic@dachary.org> wrote:
> Hi Gregory,
>
> On 30/03/2016 20:47, Gregory Farnum wrote:
>> On Wed, Mar 30, 2016 at 3:30 AM, Loic Dachary <loic@dachary.org> wrote:
>>> Hi,
>>>
>>> Now is a good time to get ready for jewel 10.2.1 and I created http://tracker.ceph.com/issues/15317 for that purpose. The goal is to be able to run as many suites as possible on OpenStack, so that we do not have to wait days (sometime a week) for runs to complete on Sepia. Best case scenario, all OpenStack specific problems are fixed by the time 10.2.1 is being prepared. Worst case scenario there is no time to fix issues and we keep using the sepia lab. I guess we'll end up somewhere in the middle : some suites will run fine on Openstack and we'll use sepia for others.
>>>
>>> In a previous mail I voiced my concerns regarding the lack of interest of developers regarding teuthology job failures that are cause by variations in the infrastructure. I still have no clue how to convey my belief that it is important for teuthology jobs to succeed despite infrastructure variations. But instead of just giving up and do nothing, I will work on that for the rados suite and hope things will evolve in a good way. To be honest, figuring out http://tracker.ceph.com/issues/15236 and seeing a good run of the rados suite on jewel as a result renewed my motivation in that area :-)
>>
>> I think you've convinced us all it's important in the abstract; that's
>> just very different from putting it on top of our list of priorities,
>> especially since we alleviated many of our needs in the sepia lab.
>> Beyond that, a lot of the issues we're seeing have very little to do
>> with Ceph itself, or even the testing programs, and that can make it
>> more difficult to get interested as we lack the necessary expertise. I
>> spent some time trying to get disk sizes and things matched up (and I
>> suddenly realize that never got merged), but some of the other odder
>> issues we're having:
>>
>> http://tracker.ceph.com/issues/13980, in which we are failing to mount
>> anything with nfs v3. This is a config file that needs to get updated;
>> we do it for the sepia lab (probably in ansible?) but somehow that
>> information isn't getting into the ovh slaves. (Or else it is in
>> there, and there's something *else* broken.) If we are using a
>> separate setup regimen for OpenStack than we are in the sepia lab
>> there will be persistent breakage as new dependencies and
>> environmental expectations get added to one and not the other. :/
>
> ceph-cm-ansible does not have any OpenStack specific instructions. It's supposed to work exactly the same on both sepia and OpenStack. When teuthology provisions an OpenStack target, it does so in the same way it provisions VPS in sepia. The only difference is that OpenStack uses images that come from http://cloud.centos.org/centos/7/images/ etc., unmodified. The VPS images have sometime been modified. However, this has only been an issue once, over six months ago.
>
> On OVH the UDP ports were firewalled, and that created the problem. I changed the firewall rules and I'm hopefull http://pulpito.ovh.sepia.ceph.com:8081/loic-2016-03-31_14:10:18-knfs-jewel-testing-basic-openstack/ will now pass.
>
>> http://tracker.ceph.com/issues/13876, in which MPI is just failing to
>> get any connections going. Why? No idea; there's a teuthology commit
>> from you that's supposed to have opened up all the ports in the
>> firewall (and it sure *looks* like it does do that, but I don't know
>> how the rules work), but this works in sepia and inasmuch as we have
>> debugging info sure looks like some kind of network blockage...
>
> I opened the required port on the OVH lab. I don't think there is an ansible rule that does it but I'll ask Zack to be sure.
>
>> So I think this isn't something that's going to get done properly
>> unless somebody gets assigned to just make everything work in all the
>> suites, who has the time to learn all the fiddly little bits. (Or we
>> somehow take a break for it as a project. But I don't see that going
>> well.) :/
>
> If you suspect an OpenStack specific problem, feel free to ping me. There is a good chance I can help and together we can make teuthology happy with OpenStack :-)

I really wasn't fishing with those, but hey! thanks so much for those fixes. :)

Do we have any way to automate those kinds of things for external
users? It sounds like right now these are just some random things any
third party needs to know to do, or their tests will mysteriously
fail?
-Greg

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: getting ready for jewel 10.2.1
  2016-03-31 19:01     ` Gregory Farnum
@ 2016-03-31 22:13       ` Loic Dachary
  0 siblings, 0 replies; 9+ messages in thread
From: Loic Dachary @ 2016-03-31 22:13 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Ceph Development



On 31/03/2016 21:01, Gregory Farnum wrote:
> On Thu, Mar 31, 2016 at 7:31 AM, Loic Dachary <loic@dachary.org> wrote:
>> Hi Gregory,
>>
>> On 30/03/2016 20:47, Gregory Farnum wrote:
>>> On Wed, Mar 30, 2016 at 3:30 AM, Loic Dachary <loic@dachary.org> wrote:
>>>> Hi,
>>>>
>>>> Now is a good time to get ready for jewel 10.2.1 and I created http://tracker.ceph.com/issues/15317 for that purpose. The goal is to be able to run as many suites as possible on OpenStack, so that we do not have to wait days (sometime a week) for runs to complete on Sepia. Best case scenario, all OpenStack specific problems are fixed by the time 10.2.1 is being prepared. Worst case scenario there is no time to fix issues and we keep using the sepia lab. I guess we'll end up somewhere in the middle : some suites will run fine on Openstack and we'll use sepia for others.
>>>>
>>>> In a previous mail I voiced my concerns regarding the lack of interest of developers regarding teuthology job failures that are cause by variations in the infrastructure. I still have no clue how to convey my belief that it is important for teuthology jobs to succeed despite infrastructure variations. But instead of just giving up and do nothing, I will work on that for the rados suite and hope things will evolve in a good way. To be honest, figuring out http://tracker.ceph.com/issues/15236 and seeing a good run of the rados suite on jewel as a result renewed my motivation in that area :-)
>>>
>>> I think you've convinced us all it's important in the abstract; that's
>>> just very different from putting it on top of our list of priorities,
>>> especially since we alleviated many of our needs in the sepia lab.
>>> Beyond that, a lot of the issues we're seeing have very little to do
>>> with Ceph itself, or even the testing programs, and that can make it
>>> more difficult to get interested as we lack the necessary expertise. I
>>> spent some time trying to get disk sizes and things matched up (and I
>>> suddenly realize that never got merged), but some of the other odder
>>> issues we're having:
>>>
>>> http://tracker.ceph.com/issues/13980, in which we are failing to mount
>>> anything with nfs v3. This is a config file that needs to get updated;
>>> we do it for the sepia lab (probably in ansible?) but somehow that
>>> information isn't getting into the ovh slaves. (Or else it is in
>>> there, and there's something *else* broken.) If we are using a
>>> separate setup regimen for OpenStack than we are in the sepia lab
>>> there will be persistent breakage as new dependencies and
>>> environmental expectations get added to one and not the other. :/
>>
>> ceph-cm-ansible does not have any OpenStack specific instructions. It's supposed to work exactly the same on both sepia and OpenStack. When teuthology provisions an OpenStack target, it does so in the same way it provisions VPS in sepia. The only difference is that OpenStack uses images that come from http://cloud.centos.org/centos/7/images/ etc., unmodified. The VPS images have sometime been modified. However, this has only been an issue once, over six months ago.
>>
>> On OVH the UDP ports were firewalled, and that created the problem. I changed the firewall rules and I'm hopefull http://pulpito.ovh.sepia.ceph.com:8081/loic-2016-03-31_14:10:18-knfs-jewel-testing-basic-openstack/ will now pass.
>>
>>> http://tracker.ceph.com/issues/13876, in which MPI is just failing to
>>> get any connections going. Why? No idea; there's a teuthology commit
>>> from you that's supposed to have opened up all the ports in the
>>> firewall (and it sure *looks* like it does do that, but I don't know
>>> how the rules work), but this works in sepia and inasmuch as we have
>>> debugging info sure looks like some kind of network blockage...
>>
>> I opened the required port on the OVH lab. I don't think there is an ansible rule that does it but I'll ask Zack to be sure.
>>
>>> So I think this isn't something that's going to get done properly
>>> unless somebody gets assigned to just make everything work in all the
>>> suites, who has the time to learn all the fiddly little bits. (Or we
>>> somehow take a break for it as a project. But I don't see that going
>>> well.) :/
>>
>> If you suspect an OpenStack specific problem, feel free to ping me. There is a good chance I can help and together we can make teuthology happy with OpenStack :-)
> 
> I really wasn't fishing with those, but hey! thanks so much for those fixes. :)
> 
> Do we have any way to automate those kinds of things for external
> users? It sounds like right now these are just some random things any
> third party needs to know to do, or their tests will mysteriously
> fail?

https://github.com/ceph/teuthology/pull/834/files automates that.

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: getting ready for jewel 10.2.1
  2016-03-30 10:30 getting ready for jewel 10.2.1 Loic Dachary
  2016-03-30 10:45 ` Abhishek Varshney
  2016-03-30 18:47 ` Gregory Farnum
@ 2016-03-31 15:49 ` John Spray
  2016-04-01 15:18   ` Sage Weil
  2 siblings, 1 reply; 9+ messages in thread
From: John Spray @ 2016-03-31 15:49 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Abhishek Varshney, Ceph Development

On Wed, Mar 30, 2016 at 11:30 AM, Loic Dachary <loic@dachary.org> wrote:
> Hi,
>
> Now is a good time to get ready for jewel 10.2.1 and I created http://tracker.ceph.com/issues/15317 for that purpose. The goal is to be able to run as many suites as possible on OpenStack, so that we do not have to wait days (sometime a week) for runs to complete on Sepia. Best case scenario, all OpenStack specific problems are fixed by the time 10.2.1 is being prepared. Worst case scenario there is no time to fix issues and we keep using the sepia lab. I guess we'll end up somewhere in the middle : some suites will run fine on Openstack and we'll use sepia for others.
>
> In a previous mail I voiced my concerns regarding the lack of interest of developers regarding teuthology job failures that are cause by variations in the infrastructure. I still have no clue how to convey my belief that it is important for teuthology jobs to succeed despite infrastructure variations. But instead of just giving up and do nothing, I will work on that for the rados suite and hope things will evolve in a good way. To be honest, figuring out http://tracker.ceph.com/issues/15236 and seeing a good run of the rados suite on jewel as a result renewed my motivation in that area :-)

If I was dedicating time to working on lab infrastructure, I think I
would prioritise stabilising the existing sepia lab.  I still see
infrastructure issues (these day usually package install failures)
sprinkled all over the place, so I have to question the value of
spreading ourselves even more thinly by trying to handle multiple
environments with their different quirks.

I have nothing against the openstack work, it is a good tool, but I
don't think it was wise to just deploy it and expect other developers
to handle the issues.  I would have liked to see at least one passing
filesystem run on openstack before regular nightlies were scheduled on
it.  Maybe now that we have fixes for #13980 and #13876 we will see a
passing run, and can get more of a sense of how stable/unstable these
tests are in the openstack environment: I think it's likely that we
will continue to see timeouts/instability from the comparatively
underpowered nodes.

John

> Cheers
>
> --
> Loïc Dachary, Artisan Logiciel Libre
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: getting ready for jewel 10.2.1
  2016-03-31 15:49 ` John Spray
@ 2016-04-01 15:18   ` Sage Weil
  2016-04-01 16:19     ` John Spray
  0 siblings, 1 reply; 9+ messages in thread
From: Sage Weil @ 2016-04-01 15:18 UTC (permalink / raw)
  To: John Spray; +Cc: Loic Dachary, Abhishek Varshney, Ceph Development

On Thu, 31 Mar 2016, John Spray wrote:
> On Wed, Mar 30, 2016 at 11:30 AM, Loic Dachary <loic@dachary.org> wrote:
> > Hi,
> >
> > Now is a good time to get ready for jewel 10.2.1 and I created 
> > http://tracker.ceph.com/issues/15317 for that purpose. The goal is to 
> > be able to run as many suites as possible on OpenStack, so that we do 
> > not have to wait days (sometime a week) for runs to complete on Sepia. 
> > Best case scenario, all OpenStack specific problems are fixed by the 
> > time 10.2.1 is being prepared. Worst case scenario there is no time to 
> > fix issues and we keep using the sepia lab. I guess we'll end up 
> > somewhere in the middle : some suites will run fine on Openstack and 
> > we'll use sepia for others.
> >
> > In a previous mail I voiced my concerns regarding the lack of interest 
> > of developers regarding teuthology job failures that are cause by 
> > variations in the infrastructure. I still have no clue how to convey 
> > my belief that it is important for teuthology jobs to succeed despite 
> > infrastructure variations. But instead of just giving up and do 
> > nothing, I will work on that for the rados suite and hope things will 
> > evolve in a good way. To be honest, figuring out 
> > http://tracker.ceph.com/issues/15236 and seeing a good run of the 
> > rados suite on jewel as a result renewed my motivation in that area 
> > :-)
> 
> If I was dedicating time to working on lab infrastructure, I think I
> would prioritise stabilising the existing sepia lab.  I still see
> infrastructure issues (these day usually package install failures)
> sprinkled all over the place, so I have to question the value of
> spreading ourselves even more thinly by trying to handle multiple
> environments with their different quirks.

I think we can't afford not to do both.  The problem with focusing only on 
sepia is that it makes it prevents new contributors from testing their 
code, and testing is one of the key pieces that preventing us from scaling 
our overall development velocity.

Also, FWIW, Sam sank a couple days this week into improvements on the 
sepia side that have eliminated almost all of the sepia package install 
noise we've been seeing (at least on the rados suite).  With Jewel 
stabilizing now is a good time to do the same with openstack.

> I have nothing against the openstack work, it is a good tool, but I
> don't think it was wise to just deploy it and expect other developers
> to handle the issues.  I would have liked to see at least one passing
> filesystem run on openstack before regular nightlies were scheduled on
> it.  Maybe now that we have fixes for #13980 and #13876 we will see a
> passing run, and can get more of a sense of how stable/unstable these
> tests are in the openstack environment: I think it's likely that we
> will continue to see timeouts/instability from the comparatively
> underpowered nodes.

The earlier transition to openstack left much to be desired, although to 
be fair it would have been hard to do it all that differently given we 
were forced out of Irvine by the sepia lab move.  In my view the main 
lesson learned was that without everyone feeling invested in fixing the 
issues to make the tests pass, the issues won't get fixed.  The lab folks 
don't understand all of the tests and their weird issues, and the 
developers are too busy with code to help debug them.

I'd like to convince everyone that making openstack a reliable testing 
environment is an importat strategic goal for the project as a whole, and 
in everyone's best interests, and that right now (while we're focusing on 
teuthology tests and waiting for the final blocking jewel bugs to be 
squashed) is as good a time as any to dig into the remaining issues... 
both with sepia *and* openstack.

Is that reasonable?
sage

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: getting ready for jewel 10.2.1
  2016-04-01 15:18   ` Sage Weil
@ 2016-04-01 16:19     ` John Spray
  0 siblings, 0 replies; 9+ messages in thread
From: John Spray @ 2016-04-01 16:19 UTC (permalink / raw)
  To: Sage Weil; +Cc: Loic Dachary, Abhishek Varshney, Ceph Development

On Fri, Apr 1, 2016 at 4:18 PM, Sage Weil <sage@newdream.net> wrote:
> On Thu, 31 Mar 2016, John Spray wrote:
>> On Wed, Mar 30, 2016 at 11:30 AM, Loic Dachary <loic@dachary.org> wrote:
>> > Hi,
>> >
>> > Now is a good time to get ready for jewel 10.2.1 and I created
>> > http://tracker.ceph.com/issues/15317 for that purpose. The goal is to
>> > be able to run as many suites as possible on OpenStack, so that we do
>> > not have to wait days (sometime a week) for runs to complete on Sepia.
>> > Best case scenario, all OpenStack specific problems are fixed by the
>> > time 10.2.1 is being prepared. Worst case scenario there is no time to
>> > fix issues and we keep using the sepia lab. I guess we'll end up
>> > somewhere in the middle : some suites will run fine on Openstack and
>> > we'll use sepia for others.
>> >
>> > In a previous mail I voiced my concerns regarding the lack of interest
>> > of developers regarding teuthology job failures that are cause by
>> > variations in the infrastructure. I still have no clue how to convey
>> > my belief that it is important for teuthology jobs to succeed despite
>> > infrastructure variations. But instead of just giving up and do
>> > nothing, I will work on that for the rados suite and hope things will
>> > evolve in a good way. To be honest, figuring out
>> > http://tracker.ceph.com/issues/15236 and seeing a good run of the
>> > rados suite on jewel as a result renewed my motivation in that area
>> > :-)
>>
>> If I was dedicating time to working on lab infrastructure, I think I
>> would prioritise stabilising the existing sepia lab.  I still see
>> infrastructure issues (these day usually package install failures)
>> sprinkled all over the place, so I have to question the value of
>> spreading ourselves even more thinly by trying to handle multiple
>> environments with their different quirks.
>
> I think we can't afford not to do both.  The problem with focusing only on
> sepia is that it makes it prevents new contributors from testing their
> code, and testing is one of the key pieces that preventing us from scaling
> our overall development velocity.
>
> Also, FWIW, Sam sank a couple days this week into improvements on the
> sepia side that have eliminated almost all of the sepia package install
> noise we've been seeing (at least on the rados suite).  With Jewel
> stabilizing now is a good time to do the same with openstack.
>
>> I have nothing against the openstack work, it is a good tool, but I
>> don't think it was wise to just deploy it and expect other developers
>> to handle the issues.  I would have liked to see at least one passing
>> filesystem run on openstack before regular nightlies were scheduled on
>> it.  Maybe now that we have fixes for #13980 and #13876 we will see a
>> passing run, and can get more of a sense of how stable/unstable these
>> tests are in the openstack environment: I think it's likely that we
>> will continue to see timeouts/instability from the comparatively
>> underpowered nodes.
>
> The earlier transition to openstack left much to be desired, although to
> be fair it would have been hard to do it all that differently given we
> were forced out of Irvine by the sepia lab move.  In my view the main
> lesson learned was that without everyone feeling invested in fixing the
> issues to make the tests pass, the issues won't get fixed.  The lab folks
> don't understand all of the tests and their weird issues, and the
> developers are too busy with code to help debug them.
>
> I'd like to convince everyone that making openstack a reliable testing
> environment is an importat strategic goal for the project as a whole, and
> in everyone's best interests, and that right now (while we're focusing on
> teuthology tests and waiting for the final blocking jewel bugs to be
> squashed) is as good a time as any to dig into the remaining issues...
> both with sepia *and* openstack.
>
> Is that reasonable?

OK -- if we are committed to doing both, then my opinions about
priorities are kind of academic :-)  If we're all pulling in the same
direction I think it's fine.

Cheers,
John

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2016-04-01 16:20 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-30 10:30 getting ready for jewel 10.2.1 Loic Dachary
2016-03-30 10:45 ` Abhishek Varshney
2016-03-30 18:47 ` Gregory Farnum
2016-03-31 14:31   ` Loic Dachary
2016-03-31 19:01     ` Gregory Farnum
2016-03-31 22:13       ` Loic Dachary
2016-03-31 15:49 ` John Spray
2016-04-01 15:18   ` Sage Weil
2016-04-01 16:19     ` John Spray

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.