* Scaling Ceph reviews and testing @ 2015-11-25 22:14 Gregory Farnum 2015-11-25 23:31 ` Loic Dachary 2015-11-26 8:56 ` Piotr.Dalek 0 siblings, 2 replies; 7+ messages in thread From: Gregory Farnum @ 2015-11-25 22:14 UTC (permalink / raw) To: ceph-devel Everybody, Ceph is popular! The global community of developers is growing quickly, and that’s leading to some challenges for our leads and core development team as we try to absorb incoming pull requests. Over the past few weeks our leads have discussed (internally and with a few external contributors) how to improve things, and we wanted to share some conclusions. It has been a long-standing requirement that all code be tested by teuthology before being merged to master. In the past leads have shouldered a lot of this burden through integration and testing branches, but it’s become unsustainable in present form: some PRs which are intended as RFCs are being mistakenly identified as final; some PRs are submitted which pass cursory sniff tests but fail under recovery conditions that the teuthology suites cover. To prevent that, please comment on exactly what testing you’ve performed when submitting a PR and a justification why that is sufficient to promote it to integration testing. Be prepared for us to request more specific testing before doing a careful review if we think it’s warranted: in general, a run through the applicable regression suite (with new tests added in a branch if applicable) will be required. Individual teams and leads will develop specific regression testing requirements in the near future. For our most frequent and prolific contributors, we are going to start expecting that you perform the above testing on your own before we move on to a serious review or our own integration tests — this should be much easier thanks to Loic’s work on teuthology-openstack! It has also been policy that new features and bug fixes are accompanied by tests which 1) demonstrate functionality and 2) check failure cases. In this arena some of us have been lax, but nightly stability has suffered. Some of us have also written tests for external contributions, but this simply doesn’t scale and we are cutting back. If you believe that a patch you’ve submitted is already covered by tests, please point them out. If it’s not covered by existing testing, write new ones! Specifically, the new feature (or bug) should be covered by the area’s regression suite. In most cases, this will involve an addition to the ceph-qa-suite. You should link the branch with the change in the main ceph PR. Your PR’s testing should be performed with that ceph-qa-suite branch (since the existing ceph-qa-suite coverage is presumably insufficient). If you need guidance on how best to automate testing, ask! If you submit a PR without these, it will just get bounced back to you and slow everybody down. We believe that these adjustments to our merge habits and the workload distribution will increase code quality, increase throughput, allow faster merges, and prevent the frequent “lost” PRs requiring rebases that have been appearing over the last year. That will make Ceph better for all of us. Thanks! -Greg -Sam -Yehuda -Sage -Josh -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Scaling Ceph reviews and testing 2015-11-25 22:14 Scaling Ceph reviews and testing Gregory Farnum @ 2015-11-25 23:31 ` Loic Dachary 2015-11-26 7:14 ` Piotr.Dalek 2015-11-26 8:56 ` Piotr.Dalek 1 sibling, 1 reply; 7+ messages in thread From: Loic Dachary @ 2015-11-25 23:31 UTC (permalink / raw) To: Gregory Farnum, ceph-devel [-- Attachment #1: Type: text/plain, Size: 5502 bytes --] Hi Greg & Sam & Josh & Sage & Yehuda, It would be most helpful to validate that the current ceph-qa-suite tests pass on master with the teuthology OpenStack backend, using the lab setup by Zack. The problems, if any, are usually easy to resolve and progress is being made in that direction[1]. Even seasoned contributors struggle to understand the logic behind teuthology. Knowing for certain that a given job is known to pass with OpenStack is a major enabler. When a job is supposed to pass with OpenStack but has never been actually verified, it quickly becomes a blocker because the contributor can hardly differentiate that from a bug in his pull request. For instance, today Piotr Dalek had to patiently run a rados/thrash job four times to sort out if the machine crashing came from his pull request or from a lack of memory (8GB by default). At present I'm confident that the following suites run fine on OpenStack on hammer: * upgrade/hammer * rados * rbd * ceph-disk As part of the work done for the infernalis backports, Abhishek Varshney is running the rados suite on OpenStack and we're figuring out problems together, one at a time. We're making steady but slow progress because it's not our main focus. The problem, when a job fails with OpenStack, is usually a timing issue (because virtual machines tend to be slower than bare metal) that requires a fix of the test, or a resource issue (because virtual machines are by default 8GB RAM, 40GB disk, 2 cpu and no disk attached) that require the addition of a yaml file like[2]: openstack: - machine: ram: 15000 volumes: count: 2 size: 10 to set the ram of the machines to at least 15GB instead of 8GB and attach two disks, 10GB each to each machine. Cheers [1] openstack: rbd/{thrash,qemu}: allocate three disks, always https://github.com/ceph/ceph-qa-suite/pull/727 etc. [2] Defining instances flavor and volumes https://github.com/dachary/teuthology/tree/openstack#defining-instances-flavor-and-volumes On 25/11/2015 23:14, Gregory Farnum wrote: > Everybody, > Ceph is popular! The global community of developers is growing > quickly, and that’s leading to some challenges for our leads and core > development team as we try to absorb incoming pull requests. Over the > past few weeks our leads have discussed (internally and with a few > external contributors) how to improve things, and we wanted to share > some conclusions. > > It has been a long-standing requirement that all code be tested by > teuthology before being merged to master. In the past leads have > shouldered a lot of this burden through integration and testing > branches, but it’s become unsustainable in present form: some PRs > which are intended as RFCs are being mistakenly identified as final; > some PRs are submitted which pass cursory sniff tests but fail under > recovery conditions that the teuthology suites cover. To prevent that, > please comment on exactly what testing you’ve performed when > submitting a PR and a justification why that is sufficient to promote > it to integration testing. Be prepared for us to request more specific > testing before doing a careful review if we think it’s warranted: in > general, a run through the applicable regression suite (with new tests > added in a branch if applicable) will be required. Individual teams > and leads will develop specific regression testing requirements in the > near future. > For our most frequent and prolific contributors, we are going to start > expecting that you perform the above testing on your own before we > move on to a serious review or our own integration tests — this should > be much easier thanks to Loic’s work on teuthology-openstack! > > It has also been policy that new features and bug fixes are > accompanied by tests which 1) demonstrate functionality and 2) check > failure cases. In this arena some of us have been lax, but nightly > stability has suffered. Some of us have also written tests for > external contributions, but this simply doesn’t scale and we are > cutting back. If you believe that a patch you’ve submitted is already > covered by tests, please point them out. If it’s not covered by > existing testing, write new ones! Specifically, the new feature (or > bug) should be covered by the area’s regression suite. In most cases, > this will involve an addition to the ceph-qa-suite. You should link > the branch with the change in the main ceph PR. Your PR’s testing > should be performed with that ceph-qa-suite branch (since the existing > ceph-qa-suite coverage is presumably insufficient). If you need > guidance on how best to automate testing, ask! If you submit a PR > without these, it will just get bounced back to you and slow everybody > down. > > We believe that these adjustments to our merge habits and the workload > distribution will increase code quality, increase throughput, allow > faster merges, and prevent the frequent “lost” PRs requiring rebases > that have been appearing over the last year. That will make Ceph > better for all of us. > > Thanks! > -Greg > -Sam > -Yehuda > -Sage > -Josh > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Loïc Dachary, Artisan Logiciel Libre [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: Scaling Ceph reviews and testing 2015-11-25 23:31 ` Loic Dachary @ 2015-11-26 7:14 ` Piotr.Dalek 0 siblings, 0 replies; 7+ messages in thread From: Piotr.Dalek @ 2015-11-26 7:14 UTC (permalink / raw) To: Loic Dachary, Gregory Farnum, ceph-devel > -----Original Message----- > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel- > owner@vger.kernel.org] On Behalf Of Loic Dachary > Sent: Thursday, November 26, 2015 12:32 AM > > Even seasoned contributors struggle to understand the logic behind > teuthology. Knowing for certain that a given job is known to pass with > OpenStack is a major enabler. When a job is supposed to pass with > OpenStack but has never been actually verified, it quickly becomes a blocker > because the contributor can hardly differentiate that from a bug in his pull > request. For instance, today Piotr Dalek had to patiently run a rados/thrash > job four times to sort out if the machine crashing came from his pull request > or from a lack of memory (8GB by default). Turned out to be resource issue. Seeing how random it is, I assume 8GB is on a brink of being enough - maybe increasing it to 12 (twelve) GB would do the trick without requiring too much resources? With best regards / Pozdrawiam Piotr Dałek ^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: Scaling Ceph reviews and testing 2015-11-25 22:14 Scaling Ceph reviews and testing Gregory Farnum 2015-11-25 23:31 ` Loic Dachary @ 2015-11-26 8:56 ` Piotr.Dalek 2015-11-26 9:24 ` Igor.Podoski 2015-11-26 10:02 ` Piotr.Dalek 1 sibling, 2 replies; 7+ messages in thread From: Piotr.Dalek @ 2015-11-26 8:56 UTC (permalink / raw) To: Gregory Farnum, ceph-devel > -----Original Message----- > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel- > owner@vger.kernel.org] On Behalf Of Gregory Farnum > Sent: Wednesday, November 25, 2015 11:14 PM > > It has been a long-standing requirement that all code be tested by > teuthology before being merged to master. In the past leads have > shouldered a lot of this burden through integration and testing branches, but > it’s become unsustainable in present form: some PRs which are intended as > RFCs are being mistakenly identified as final; some PRs are submitted which > pass cursory sniff tests but fail under recovery conditions that the teuthology > suites cover. To prevent that, please comment on exactly what testing > you’ve performed when submitting a PR and a justification why that is > sufficient to promote it to integration testing. [..] Unless people will be convinced that performing their own testing isn't that complex (teuthology-openstack is a leapfrog in right direction), they won't do it, either because they simply don't know how to do it, or they don't have resources to do so (small startups may not afford them at all, and large, global corporations might have hardware request procedures so complex and with a such large time span, that it scares the devs out). But correctness and reliability regressions are one thing, performance regressions are another one. I already see PRs that promise performance increase, when at (my) first glance it looks totally contradictory, or it's just a big, 100+ line change which adds more complexity than performance. Not to mention utter nonsense like https://github.com/ceph/ceph/pull/6582 (excuse my finger-pointing, but this case is so extreme that it needs to be pointed out). Or, to put it more bluntly, some folks are spamming with performance PRs that in their opinion improve something, while in reality those PRs at best increase complexity of already complex code and add (sometimes potential) bugs, often with added bonus of actually degraded performance. So, my proposition is to postpone QA'ing performance pull requests until someone unrelated to PR author (or even author's company) can confirm that claims in that particular PR are true. Providing code snippet that shows the perf difference (or provide a way to verify those claims in reproducible matter) in PR should be enough for it (https://github.com/XinzeChi/ceph/commit/2c8a17560a797b316520cb689240d4dcecf3e4cc for a particular example), and it should help get rid of performance PRs that degrade performance or improve it only on particular hardware/software configuration and at best don't improve anything otherwise. With best regards / Pozdrawiam Piotr Dałek ^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: Scaling Ceph reviews and testing 2015-11-26 8:56 ` Piotr.Dalek @ 2015-11-26 9:24 ` Igor.Podoski 2015-11-26 9:40 ` Piotr.Dalek 2015-11-26 10:02 ` Piotr.Dalek 1 sibling, 1 reply; 7+ messages in thread From: Igor.Podoski @ 2015-11-26 9:24 UTC (permalink / raw) To: Piotr.Dalek@ts.fujitsu.com, Gregory Farnum, ceph-devel > -----Original Message----- > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel- > owner@vger.kernel.org] On Behalf Of Dalek, Piotr > Sent: Thursday, November 26, 2015 9:56 AM > To: Gregory Farnum; ceph-devel > Subject: RE: Scaling Ceph reviews and testing > > > -----Original Message----- > > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel- > > owner@vger.kernel.org] On Behalf Of Gregory Farnum > > Sent: Wednesday, November 25, 2015 11:14 PM > > > > It has been a long-standing requirement that all code be tested by > > teuthology before being merged to master. In the past leads have > > shouldered a lot of this burden through integration and testing > > branches, but it’s become unsustainable in present form: some PRs > > which are intended as RFCs are being mistakenly identified as final; > > some PRs are submitted which pass cursory sniff tests but fail under > > recovery conditions that the teuthology suites cover. To prevent that, > > please comment on exactly what testing you’ve performed when > > submitting a PR and a justification why that is sufficient to promote > > it to integration testing. [..] > > Unless people will be convinced that performing their own testing isn't that > complex (teuthology-openstack is a leapfrog in right direction), they won't do > it, either because they simply don't know how to do it, or they don't have > resources to do so (small startups may not afford them at all, and large, > global corporations might have hardware request procedures so complex > and with a such large time span, that it scares the devs out). > But correctness and reliability regressions are one thing, performance > regressions are another one. I already see PRs that promise performance > increase, when at (my) first glance it looks totally contradictory, or it's just a > big, 100+ line change which adds more complexity than performance. Not to > mention utter nonsense like https://github.com/ceph/ceph/pull/6582 > (excuse my finger-pointing, but this case is so extreme that it needs to be > pointed out). Or, to put it more bluntly, some folks are spamming with > performance PRs that in their opinion improve something, while in reality > those PRs at best increase complexity of already complex code and add > (sometimes potential) bugs, often with added bonus of actually degraded > performance. So, my proposition is to postpone QA'ing performance pull > requests until someone unrelated to PR author (or even author's company) > can confirm that claims in that particular PR are true. Providing code snippet > that shows the perf difference (or provide a way to verify those claims in > reproducible matter) in PR should be enough for it > (https://github.com/XinzeChi/ceph/commit/2c8a17560a797b316520cb689240 > d4dcecf3e4cc for a particular example), and it should help get rid of > performance PRs that degrade performance or improve it only on particular > hardware/software configuration and at best don't improve anything > otherwise. > > > With best regards / Pozdrawiam > Piotr Dałek We could also add another label, like "explanation/data needed" and guys marking new PR's could add this to make this more restrict: "Performance enhancements must come with test data and detailed explanations." (https://github.com/ceph/ceph/blob/master/CONTRIBUTING.rst ) Then Piotr's idea will be easier to do, when "PR validator" will have test data and explanation he could faster/easier decide if this PR make sense or not. Regards, Igor. > > \x13 칻\x1c & ~ & \x18 +- ݶ\x17 w ˛ m \x1e \x17^ b ^n r z \x1a h & \x1e G h \x03( 階 ݢj" \x1a ^[m z > ޖ f h ~ m ^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: Scaling Ceph reviews and testing 2015-11-26 9:24 ` Igor.Podoski @ 2015-11-26 9:40 ` Piotr.Dalek 0 siblings, 0 replies; 7+ messages in thread From: Piotr.Dalek @ 2015-11-26 9:40 UTC (permalink / raw) To: Igor.Podoski@ts.fujitsu.com, Gregory Farnum, ceph-devel > -----Original Message----- > From: Podoski, Igor > Sent: Thursday, November 26, 2015 10:25 AM > > > But correctness and reliability regressions are one thing, performance > > regressions are another one. I already see PRs that promise > > performance increase, when at (my) first glance it looks totally > > contradictory, or it's just a big, 100+ line change which adds more > > complexity than performance. Not to mention utter nonsense like > > https://github.com/ceph/ceph/pull/6582 > > (excuse my finger-pointing, but this case is so extreme that it needs > > to be pointed out). Or, to put it more bluntly, some folks are > > spamming with performance PRs that in their opinion improve something, > > while in reality those PRs at best increase complexity of already > > complex code and add (sometimes potential) bugs, often with added > > bonus of actually degraded performance. > > [..] > > We could also add another label, like "explanation/data needed" and guys > marking new PR's could add this to make this more restrict: "Performance > enhancements must come with test data and detailed explanations." > (https://github.com/ceph/ceph/blob/master/CONTRIBUTING.rst ) There's already "pending-discussion" tag which could be used too. > Then Piotr's idea will be easier to do, when "PR validator" will have test data > and explanation he could faster/easier decide if this PR make sense or not. Actually, it seems to be doable. PR submitter could submit some code that triggers optimized code path and some bot could build the PR branch and compare some performance metrics before and after applying modified code. The catch here is that it needs to be done on bare-metal machine, VMs are often too unstable in that regard and might give totally different results each run. With best regards / Pozdrawiam Piotr Dałek ^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: Scaling Ceph reviews and testing 2015-11-26 8:56 ` Piotr.Dalek 2015-11-26 9:24 ` Igor.Podoski @ 2015-11-26 10:02 ` Piotr.Dalek 1 sibling, 0 replies; 7+ messages in thread From: Piotr.Dalek @ 2015-11-26 10:02 UTC (permalink / raw) To: Gregory Farnum, ceph-devel > So, my proposition is to postpone QA'ing performance pull > requests until someone unrelated to PR author (or even author's company) > can confirm that claims in that particular PR are true. Providing code snippet > that shows the perf difference (or provide a way to verify those claims in > reproducible matter) in PR should be enough for it > (https://github.com/XinzeChi/ceph/commit/2c8a17560a797b316520cb689240 > d4dcecf3e4cc for a particular example), and it should help get rid of > performance PRs that degrade performance or improve it only on particular > hardware/software configuration and at best don't improve anything > otherwise. To clarify: if PR degrades performance for someone, that person should provide details of the case so others (including PR author) can review it (and maybe fix it). We're just humans and as such, we both make mistakes and learn from them, and my point was to encourage learning from mistakes and sharing knowledge, instead of going into full-on inter-company political warfare. With best regards / Pozdrawiam Piotr Dałek ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2015-11-26 10:02 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-11-25 22:14 Scaling Ceph reviews and testing Gregory Farnum 2015-11-25 23:31 ` Loic Dachary 2015-11-26 7:14 ` Piotr.Dalek 2015-11-26 8:56 ` Piotr.Dalek 2015-11-26 9:24 ` Igor.Podoski 2015-11-26 9:40 ` Piotr.Dalek 2015-11-26 10:02 ` Piotr.Dalek
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.