Scaling Ceph reviews and testing

All of lore.kernel.org
 help / color / mirror / Atom feed

* Scaling Ceph reviews and testing
@ 2015-11-25 22:14 Gregory Farnum
  2015-11-25 23:31 ` Loic Dachary
  2015-11-26  8:56 ` Piotr.Dalek
  0 siblings, 2 replies; 7+ messages in thread
From: Gregory Farnum @ 2015-11-25 22:14 UTC (permalink / raw)
  To: ceph-devel

Everybody,
Ceph is popular! The global community of developers is growing
quickly, and that’s leading to some challenges for our leads and core
development team as we try to absorb incoming pull requests. Over the
past few weeks our leads have discussed (internally and with a few
external contributors) how to improve things, and we wanted to share
some conclusions.

It has been a long-standing requirement that all code be tested by
teuthology before being merged to master. In the past leads have
shouldered a lot of this burden through integration and testing
branches, but it’s become unsustainable in present form: some PRs
which are intended as RFCs are being mistakenly identified as final;
some PRs are submitted which pass cursory sniff tests but fail under
recovery conditions that the teuthology suites cover. To prevent that,
please comment on exactly what testing you’ve performed when
submitting a PR and a justification why that is sufficient to promote
it to integration testing. Be prepared for us to request more specific
testing before doing a careful review if we think it’s warranted: in
general, a run through the applicable regression suite (with new tests
added in a branch if applicable) will be required. Individual teams
and leads will develop specific regression testing requirements in the
near future.
For our most frequent and prolific contributors, we are going to start
expecting that you perform the above testing on your own before we
move on to a serious review or our own integration tests — this should
be much easier thanks to Loic’s work on teuthology-openstack!

It has also been policy that new features and bug fixes are
accompanied by tests which 1) demonstrate functionality and 2) check
failure cases. In this arena some of us have been lax, but nightly
stability has suffered. Some of us have also written tests for
external contributions, but this simply doesn’t scale and we are
cutting back. If you believe that a patch you’ve submitted is already
covered by tests, please point them out. If it’s not covered by
existing testing, write new ones! Specifically, the new feature (or
bug) should be covered by the area’s regression suite.  In most cases,
this will involve an addition to the ceph-qa-suite.  You should link
the branch with the change in the main ceph PR.  Your PR’s testing
should be performed with that ceph-qa-suite branch (since the existing
ceph-qa-suite coverage is presumably insufficient). If you need
guidance on how best to automate testing, ask! If you submit a PR
without these, it will just get bounced back to you and slow everybody
down.

We believe that these adjustments to our merge habits and the workload
distribution will increase code quality, increase throughput, allow
faster merges, and prevent the frequent “lost” PRs requiring rebases
that have been appearing over the last year. That will make Ceph
better for all of us.

Thanks!
-Greg
-Sam
-Yehuda
-Sage
-Josh
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Scaling Ceph reviews and testing
  2015-11-25 22:14 Scaling Ceph reviews and testing Gregory Farnum
@ 2015-11-25 23:31 ` Loic Dachary
  2015-11-26  7:14   ` Piotr.Dalek
  2015-11-26  8:56 ` Piotr.Dalek
  1 sibling, 1 reply; 7+ messages in thread
From: Loic Dachary @ 2015-11-25 23:31 UTC (permalink / raw)
  To: Gregory Farnum, ceph-devel

[-- Attachment #1: Type: text/plain, Size: 5502 bytes --]

Hi Greg & Sam & Josh & Sage & Yehuda,

It would be most helpful to validate that the current ceph-qa-suite tests pass on master with the teuthology OpenStack backend, using the lab setup by Zack. The problems, if any, are usually easy to resolve and progress is being made in that direction[1]. 

Even seasoned contributors struggle to understand the logic behind teuthology. Knowing for certain that a given job is known to pass with OpenStack is a major enabler. When a job is supposed to pass with OpenStack but has never been actually verified, it quickly becomes a blocker because the contributor can hardly differentiate that from a bug in his pull request. For instance, today Piotr Dalek had to patiently run a rados/thrash job four times to sort out if the machine crashing came from his pull request or from a lack of memory (8GB by default).

At present I'm confident that the following suites run fine on OpenStack on hammer:

  * upgrade/hammer
  * rados
  * rbd
  * ceph-disk

As part of the work done for the infernalis backports, Abhishek Varshney is running the rados suite on OpenStack and we're figuring out problems together, one at a time. We're making steady but slow progress because it's not our main focus.

The problem, when a job fails with OpenStack, is usually a timing issue (because virtual machines tend to be slower than bare metal) that requires a fix of the test, or a resource issue (because virtual machines are by default 8GB RAM, 40GB disk, 2 cpu and no disk attached) that require the addition of a yaml file like[2]:

openstack:
   - machine:
       ram: 15000
     volumes:
       count: 2
       size: 10

to set the ram of the machines to at least 15GB instead of 8GB and attach two disks, 10GB each to each machine.

Cheers

[1] openstack: rbd/{thrash,qemu}: allocate three disks, always https://github.com/ceph/ceph-qa-suite/pull/727 etc.
[2] Defining instances flavor and volumes https://github.com/dachary/teuthology/tree/openstack#defining-instances-flavor-and-volumes

On 25/11/2015 23:14, Gregory Farnum wrote:
> Everybody,
> Ceph is popular! The global community of developers is growing
> quickly, and that’s leading to some challenges for our leads and core
> development team as we try to absorb incoming pull requests. Over the
> past few weeks our leads have discussed (internally and with a few
> external contributors) how to improve things, and we wanted to share
> some conclusions.
> 
> It has been a long-standing requirement that all code be tested by
> teuthology before being merged to master. In the past leads have
> shouldered a lot of this burden through integration and testing
> branches, but it’s become unsustainable in present form: some PRs
> which are intended as RFCs are being mistakenly identified as final;
> some PRs are submitted which pass cursory sniff tests but fail under
> recovery conditions that the teuthology suites cover. To prevent that,
> please comment on exactly what testing you’ve performed when
> submitting a PR and a justification why that is sufficient to promote
> it to integration testing. Be prepared for us to request more specific
> testing before doing a careful review if we think it’s warranted: in
> general, a run through the applicable regression suite (with new tests
> added in a branch if applicable) will be required. Individual teams
> and leads will develop specific regression testing requirements in the
> near future.
> For our most frequent and prolific contributors, we are going to start
> expecting that you perform the above testing on your own before we
> move on to a serious review or our own integration tests — this should
> be much easier thanks to Loic’s work on teuthology-openstack!
> 
> It has also been policy that new features and bug fixes are
> accompanied by tests which 1) demonstrate functionality and 2) check
> failure cases. In this arena some of us have been lax, but nightly
> stability has suffered. Some of us have also written tests for
> external contributions, but this simply doesn’t scale and we are
> cutting back. If you believe that a patch you’ve submitted is already
> covered by tests, please point them out. If it’s not covered by
> existing testing, write new ones! Specifically, the new feature (or
> bug) should be covered by the area’s regression suite.  In most cases,
> this will involve an addition to the ceph-qa-suite.  You should link
> the branch with the change in the main ceph PR.  Your PR’s testing
> should be performed with that ceph-qa-suite branch (since the existing
> ceph-qa-suite coverage is presumably insufficient). If you need
> guidance on how best to automate testing, ask! If you submit a PR
> without these, it will just get bounced back to you and slow everybody
> down.
> 
> We believe that these adjustments to our merge habits and the workload
> distribution will increase code quality, increase throughput, allow
> faster merges, and prevent the frequent “lost” PRs requiring rebases
> that have been appearing over the last year. That will make Ceph
> better for all of us.
> 
> Thanks!
> -Greg
> -Sam
> -Yehuda
> -Sage
> -Josh
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Scaling Ceph reviews and testing
  2015-11-25 23:31 ` Loic Dachary
@ 2015-11-26  7:14   ` Piotr.Dalek
  0 siblings, 0 replies; 7+ messages in thread
From: Piotr.Dalek @ 2015-11-26  7:14 UTC (permalink / raw)
  To: Loic Dachary, Gregory Farnum, ceph-devel

> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Loic Dachary
> Sent: Thursday, November 26, 2015 12:32 AM
>
> Even seasoned contributors struggle to understand the logic behind
> teuthology. Knowing for certain that a given job is known to pass with
> OpenStack is a major enabler. When a job is supposed to pass with
> OpenStack but has never been actually verified, it quickly becomes a blocker
> because the contributor can hardly differentiate that from a bug in his pull
> request. For instance, today Piotr Dalek had to patiently run a rados/thrash
> job four times to sort out if the machine crashing came from his pull request
> or from a lack of memory (8GB by default).

Turned out to be resource issue. Seeing how random it is, I assume 8GB is on a brink of being enough - maybe increasing it to 12 (twelve) GB would do the trick without requiring too much resources?

With best regards / Pozdrawiam
Piotr Dałek


^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Scaling Ceph reviews and testing
  2015-11-25 22:14 Scaling Ceph reviews and testing Gregory Farnum
  2015-11-25 23:31 ` Loic Dachary
@ 2015-11-26  8:56 ` Piotr.Dalek
  2015-11-26  9:24   ` Igor.Podoski
  2015-11-26 10:02   ` Piotr.Dalek
  1 sibling, 2 replies; 7+ messages in thread
From: Piotr.Dalek @ 2015-11-26  8:56 UTC (permalink / raw)
  To: Gregory Farnum, ceph-devel

> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Gregory Farnum
> Sent: Wednesday, November 25, 2015 11:14 PM
>
> It has been a long-standing requirement that all code be tested by
> teuthology before being merged to master. In the past leads have
> shouldered a lot of this burden through integration and testing branches, but
> it’s become unsustainable in present form: some PRs which are intended as
> RFCs are being mistakenly identified as final; some PRs are submitted which
> pass cursory sniff tests but fail under recovery conditions that the teuthology
> suites cover. To prevent that, please comment on exactly what testing
> you’ve performed when submitting a PR and a justification why that is
> sufficient to promote it to integration testing. [..]

Unless people will be convinced that performing their own testing isn't that complex (teuthology-openstack is a leapfrog in right direction), they won't do it, either because they simply don't know how to do it, or they don't have resources to do so (small startups may not afford them at all, and large, global corporations might have hardware request procedures so complex and with a such large time span, that it scares the devs out).
But correctness and reliability regressions are one thing, performance regressions are another one. I already see PRs that promise performance increase, when at (my) first glance it looks totally contradictory, or it's just a big, 100+ line change which adds more complexity than performance. Not to mention utter nonsense like https://github.com/ceph/ceph/pull/6582 (excuse my finger-pointing, but this case is so extreme that it needs to be pointed out). Or, to put it more bluntly, some folks are spamming with performance PRs that in their opinion improve something, while in reality those PRs at best increase complexity of already complex code and add (sometimes potential) bugs, often with added bonus of actually degraded performance. So, my proposition is to postpone QA'ing performance pull requests until someone unrelated to PR author (or even author's company) can confirm that claims in that particular PR are true. Providing code snippet that shows the perf difference (or provide a way to verify those claims in reproducible matter) in PR should be enough for it (https://github.com/XinzeChi/ceph/commit/2c8a17560a797b316520cb689240d4dcecf3e4cc for a particular example), and it should help get rid of performance PRs that degrade performance or improve it only on particular hardware/software configuration and at best don't improve anything otherwise.

With best regards / Pozdrawiam
Piotr Dałek

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Scaling Ceph reviews and testing
  2015-11-26  8:56 ` Piotr.Dalek
@ 2015-11-26  9:24   ` Igor.Podoski
  2015-11-26  9:40     ` Piotr.Dalek
  2015-11-26 10:02   ` Piotr.Dalek
  1 sibling, 1 reply; 7+ messages in thread
From: Igor.Podoski @ 2015-11-26  9:24 UTC (permalink / raw)
  To: Piotr.Dalek@ts.fujitsu.com, Gregory Farnum, ceph-devel

> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Dalek, Piotr
> Sent: Thursday, November 26, 2015 9:56 AM
> To: Gregory Farnum; ceph-devel
> Subject: RE: Scaling Ceph reviews and testing
> 
> > -----Original Message-----
> > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > owner@vger.kernel.org] On Behalf Of Gregory Farnum
> > Sent: Wednesday, November 25, 2015 11:14 PM
> >
> > It has been a long-standing requirement that all code be tested by
> > teuthology before being merged to master. In the past leads have
> > shouldered a lot of this burden through integration and testing
> > branches, but it’s become unsustainable in present form: some PRs
> > which are intended as RFCs are being mistakenly identified as final;
> > some PRs are submitted which pass cursory sniff tests but fail under
> > recovery conditions that the teuthology suites cover. To prevent that,
> > please comment on exactly what testing you’ve performed when
> > submitting a PR and a justification why that is sufficient to promote
> > it to integration testing. [..]
> 
> Unless people will be convinced that performing their own testing isn't that
> complex (teuthology-openstack is a leapfrog in right direction), they won't do
> it, either because they simply don't know how to do it, or they don't have
> resources to do so (small startups may not afford them at all, and large,
> global corporations might have hardware request procedures so complex
> and with a such large time span, that it scares the devs out).
> But correctness and reliability regressions are one thing, performance
> regressions are another one. I already see PRs that promise performance
> increase, when at (my) first glance it looks totally contradictory, or it's just a
> big, 100+ line change which adds more complexity than performance. Not to
> mention utter nonsense like https://github.com/ceph/ceph/pull/6582
> (excuse my finger-pointing, but this case is so extreme that it needs to be
> pointed out). Or, to put it more bluntly, some folks are spamming with
> performance PRs that in their opinion improve something, while in reality
> those PRs at best increase complexity of already complex code and add
> (sometimes potential) bugs, often with added bonus of actually degraded
> performance. So, my proposition is to postpone QA'ing performance pull
> requests until someone unrelated to PR author (or even author's company)
> can confirm that claims in that particular PR are true. Providing code snippet
> that shows the perf difference (or provide a way to verify those claims in
> reproducible matter) in PR should be enough for it
> (https://github.com/XinzeChi/ceph/commit/2c8a17560a797b316520cb689240
> d4dcecf3e4cc for a particular example), and it should help get rid of
> performance PRs that degrade performance or improve it only on particular
> hardware/software configuration and at best don't improve anything
> otherwise.
> 
> 
> With best regards / Pozdrawiam
> Piotr Dałek

We could also add another label, like "explanation/data needed" and guys marking new PR's could add this to make this more restrict:  "Performance enhancements must come with test data and detailed explanations." (https://github.com/ceph/ceph/blob/master/CONTRIBUTING.rst )

Then Piotr's idea will be easier to do, when "PR validator" will have test data and explanation he could faster/easier decide if this PR make sense or not.

Regards,
Igor.

> 
> \x13  칻\x1c & ~ & \x18  +-  ݶ\x17  w  ˛   m \x1e \x17^  b  ^n r   z \x1a  h    &  \x1e G   h \x03( 階 ݢj"  \x1a ^[m     z
> ޖ   f   h   ~ m

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Scaling Ceph reviews and testing
  2015-11-26  9:24   ` Igor.Podoski
@ 2015-11-26  9:40     ` Piotr.Dalek
  0 siblings, 0 replies; 7+ messages in thread
From: Piotr.Dalek @ 2015-11-26  9:40 UTC (permalink / raw)
  To: Igor.Podoski@ts.fujitsu.com, Gregory Farnum, ceph-devel

> -----Original Message-----
> From: Podoski, Igor
> Sent: Thursday, November 26, 2015 10:25 AM
> 
> > But correctness and reliability regressions are one thing, performance
> > regressions are another one. I already see PRs that promise
> > performance increase, when at (my) first glance it looks totally
> > contradictory, or it's just a big, 100+ line change which adds more
> > complexity than performance. Not to mention utter nonsense like
> > https://github.com/ceph/ceph/pull/6582
> > (excuse my finger-pointing, but this case is so extreme that it needs
> > to be pointed out). Or, to put it more bluntly, some folks are
> > spamming with performance PRs that in their opinion improve something,
> > while in reality those PRs at best increase complexity of already
> > complex code and add (sometimes potential) bugs, often with added
> > bonus of actually degraded performance.
> > [..]
> 
> We could also add another label, like "explanation/data needed" and guys
> marking new PR's could add this to make this more restrict:  "Performance
> enhancements must come with test data and detailed explanations."
> (https://github.com/ceph/ceph/blob/master/CONTRIBUTING.rst )

There's already "pending-discussion" tag which could be used too.

> Then Piotr's idea will be easier to do, when "PR validator" will have test data
> and explanation he could faster/easier decide if this PR make sense or not.

Actually, it seems to be doable. PR submitter could submit some code that triggers optimized code path and some bot could build the PR branch and compare some performance metrics before and after applying modified code. The catch here is that it needs to be done on bare-metal machine, VMs are often too unstable in that regard and might give totally different results each run.


With best regards / Pozdrawiam
Piotr Dałek


^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Scaling Ceph reviews and testing
  2015-11-26  8:56 ` Piotr.Dalek
  2015-11-26  9:24   ` Igor.Podoski
@ 2015-11-26 10:02   ` Piotr.Dalek
  1 sibling, 0 replies; 7+ messages in thread
From: Piotr.Dalek @ 2015-11-26 10:02 UTC (permalink / raw)
  To: Gregory Farnum, ceph-devel

> So, my proposition is to postpone QA'ing performance pull
> requests until someone unrelated to PR author (or even author's company)
> can confirm that claims in that particular PR are true. Providing code snippet
> that shows the perf difference (or provide a way to verify those claims in
> reproducible matter) in PR should be enough for it
> (https://github.com/XinzeChi/ceph/commit/2c8a17560a797b316520cb689240
> d4dcecf3e4cc for a particular example), and it should help get rid of
> performance PRs that degrade performance or improve it only on particular
> hardware/software configuration and at best don't improve anything
> otherwise.

To clarify: if PR degrades performance for someone, that person should provide details of the case so others (including PR author) can review it (and maybe fix it). 
We're just humans and as such, we both make mistakes and learn from them, and my point was to encourage learning from mistakes and sharing knowledge, instead of going into full-on inter-company political warfare.


With best regards / Pozdrawiam
Piotr Dałek

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-11-26 10:02 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-11-25 22:14 Scaling Ceph reviews and testing Gregory Farnum
2015-11-25 23:31 ` Loic Dachary
2015-11-26  7:14   ` Piotr.Dalek
2015-11-26  8:56 ` Piotr.Dalek
2015-11-26  9:24   ` Igor.Podoski
2015-11-26  9:40     ` Piotr.Dalek
2015-11-26 10:02   ` Piotr.Dalek

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.