* python3 recipe PGO tests @ 2020-06-12 21:28 Ryan Rowe 2020-06-12 21:37 ` Alexander Kanavin 2020-06-15 1:05 ` [OE-core] " Anuj Mittal 0 siblings, 2 replies; 8+ messages in thread From: Ryan Rowe @ 2020-06-12 21:28 UTC (permalink / raw) To: openembedded-core@lists.openembedded.org Cc: alex.kanavin@gmail.com, Martin Kelly, Jim Broadus [-- Attachment #1: Type: text/plain, Size: 2024 bytes --] Hello Alex, I’m investigating Python 3 performance issues on a Raspberry Pi Yocto build; I appreciate any insights you can provide into the problem. In my investigation, I noticed that PGO was disabled in all cases due to a small bug. I fixed it in a patch submitted to OE-Core (#139459<https://lists.openembedded.org/g/openembedded-core/message/139459>). Even when PGO is indeed enabled, Python 3 runs significantly slower on Yocto-compiled Python 3.8.3 than the same version compiled on Raspbian. In your patch, 0001-Makefile.pre-use-qemu-wrapper-when-gathering-profile.patch<http://cgit.openembedded.org/openembedded-core/tree/meta/recipes-devtools/python/python3/0001-Makefile.pre-use-qemu-wrapper-when-gathering-profile.patch>, I see that you override the default PROFILE_TASK, which did not explicitly specify test suites, to a command that explicitly provides test suites. How did you decide on these tests? The standard PGO command runs 43 tests, while you specify 7. When I compile Python 3.8.3 on Raspbian, I see no intersection between the 43 tests run by default and the 7 you specify. Additionally, the default module for PROFILE is test while you use test.regrtest. For reference, here’s the results of a simple CPU-bound test. These tests were run on the same Raspberry Pi 4 with same SD card. python3 -m timeit -r 10 --setup ' def fib(n): if n < 2: return n if n == 2: return 1 return fib(n - 1) + fib(n - 2) ' '[fib(n) for n in range(20)]' # Yocto Python 3.8.3 # 10 loops, best of 10: 28.9 msec per loop # 10 loops, best of 10: 29.3 msec per loop # 10 loops, best of 10: 27.9 msec per loop # 10 loops, best of 10: 30.4 msec per loop # Average result: 31.625 msec per loop # Raspbian Python 3.8.3 # 50 loops, best of 10: 7.73 msec per loop # 50 loops, best of 10: 7.72 msec per loop # 50 loops, best of 10: 7.67 msec per loop # 50 loops, best of 10: 7.74 msec per loop # Average result: 7.715 msec per loop # Raspbian speedup: 4.09x Best, Ryan Rowe [-- Attachment #2: Type: text/html, Size: 6971 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: python3 recipe PGO tests 2020-06-12 21:28 python3 recipe PGO tests Ryan Rowe @ 2020-06-12 21:37 ` Alexander Kanavin 2020-06-15 1:05 ` [OE-core] " Anuj Mittal 1 sibling, 0 replies; 8+ messages in thread From: Alexander Kanavin @ 2020-06-12 21:37 UTC (permalink / raw) To: Ryan Rowe, Anuj Mittal, Ross Burton Cc: openembedded-core@lists.openembedded.org, Martin Kelly, Jim Broadus [-- Attachment #1: Type: text/plain, Size: 2594 bytes --] Hello Ryan, I did not write the pgo bits, I only preserved them (without testing) when the Python recipe was rewritten from scratch (by me) in order to bring some sanity to it, and make it possible again to update it to newer versions. The people you want to talk to are Anuj Mittal and Ross Burton (cc). Alex On Fri, 12 Jun 2020 at 23:28, Ryan Rowe <rrowe@xevo.com> wrote: > Hello Alex, > > > > I’m investigating Python 3 performance issues on a Raspberry Pi Yocto > build; I appreciate any insights you can provide into the problem. > > > > In my investigation, I noticed that PGO was disabled in all cases due to a > small bug. I fixed it in a patch submitted to OE-Core (#139459 > <https://lists.openembedded.org/g/openembedded-core/message/139459>). > Even when PGO is indeed enabled, Python 3 runs significantly slower on > Yocto-compiled Python 3.8.3 than the same version compiled on Raspbian. > > > > In your patch, > 0001-Makefile.pre-use-qemu-wrapper-when-gathering-profile.patch > <http://cgit.openembedded.org/openembedded-core/tree/meta/recipes-devtools/python/python3/0001-Makefile.pre-use-qemu-wrapper-when-gathering-profile.patch>, > I see that you override the default PROFILE_TASK, which did not > explicitly specify test suites, to a command that explicitly provides test > suites. How did you decide on these tests? The standard PGO command runs 43 > tests, while you specify 7. When I compile Python 3.8.3 on Raspbian, I see > no intersection between the 43 tests run by default and the 7 you specify. > Additionally, the default module for PROFILE is test while you use > test.regrtest. > > > > For reference, here’s the results of a simple CPU-bound test. These tests > were run on the same Raspberry Pi 4 with same SD card. > > > > python3 -m timeit -r 10 --setup ' > def fib(n): > if n < 2: > return n > > if n == 2: > return 1 > > return fib(n - 1) + fib(n - 2) > ' '[fib(n) for n in range(20)]' > > > > # Yocto Python 3.8.3 > # 10 loops, best of 10: 28.9 msec per loop > # 10 loops, best of 10: 29.3 msec per loop > # 10 loops, best of 10: 27.9 msec per loop > # 10 loops, best of 10: 30.4 msec per loop > # Average result: 31.625 msec per loop > > > > # Raspbian Python 3.8.3 > # 50 loops, best of 10: 7.73 msec per loop > # 50 loops, best of 10: 7.72 msec per loop > # 50 loops, best of 10: 7.67 msec per loop > # 50 loops, best of 10: 7.74 msec per loop > > # Average result: 7.715 msec per loop > > > > # Raspbian speedup: 4.09x > > > > Best, > > Ryan Rowe > [-- Attachment #2: Type: text/html, Size: 6104 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [OE-core] python3 recipe PGO tests 2020-06-12 21:28 python3 recipe PGO tests Ryan Rowe 2020-06-12 21:37 ` Alexander Kanavin @ 2020-06-15 1:05 ` Anuj Mittal 2020-06-15 20:33 ` Ryan Rowe 1 sibling, 1 reply; 8+ messages in thread From: Anuj Mittal @ 2020-06-15 1:05 UTC (permalink / raw) To: openembedded-core@lists.openembedded.org, rrowe@xevo.com Cc: alex.kanavin@gmail.com, jbroadus@xevo.com, mkelly@xevo.com On Fri, 2020-06-12 at 21:28 +0000, Ryan Rowe wrote: > Hello Alex, > > I’m investigating Python 3 performance issues on a Raspberry Pi Yocto > build; I appreciate any insights you can provide into the problem. > > In my investigation, I noticed that PGO was disabled in all cases due > to a small bug. I fixed it in a patch submitted to OE-Core (#139459). > Even when PGO is indeed enabled, Python 3 runs significantly slower > on Yocto-compiled Python 3.8.3 than the same version compiled on > Raspbian. > > In your patch, 0001-Makefile.pre-use-qemu-wrapper-when-gathering- > profile.patch, I see that you override the default PROFILE_TASK, > which did not explicitly specify test suites, to a command that > explicitly provides test suites. How did you decide on these tests? > The standard PGO command runs 43 tests, while you specify 7. When I > compile Python 3.8.3 on Raspbian, I see no intersection between the > 43 tests run by default and the 7 you specify. Additionally, the > default module for PROFILE is test while you use test.regrtest. We used to run pybench and then switched to regrtest: https://git.yoctoproject.org/cgit/cgit.cgi/poky/commit/?id=d9f7b9d3ad44195e68b2c1b09e3eb42e623c9a20 The PROFILE_TASK value it looks like was changed recently: https://github.com/python/cpython/commit/2406672984e4c1b18629e615edad52928a72ffcc#diff-45e8b91057f0c5b60efcb5944125b585 If the performance is actually degrading, may be we should change it to something more useful. Do you know much time does the default set of tasks take to run in qemu? Thanks, Anuj > > For reference, here’s the results of a simple CPU-bound test. These > tests were run on the same Raspberry Pi 4 with same SD card. > > python3 -m timeit -r 10 --setup ' > def fib(n): > if n < 2: > return n > if n == 2: > return 1 > return fib(n - 1) + fib(n - 2) > ' '[fib(n) for n in range(20)]' > > # Yocto Python 3.8.3 > # 10 loops, best of 10: 28.9 msec per loop > # 10 loops, best of 10: 29.3 msec per loop > # 10 loops, best of 10: 27.9 msec per loop > # 10 loops, best of 10: 30.4 msec per loop > # Average result: 31.625 msec per loop > > # Raspbian Python 3.8.3 > # 50 loops, best of 10: 7.73 msec per loop > # 50 loops, best of 10: 7.72 msec per loop > # 50 loops, best of 10: 7.67 msec per loop > # 50 loops, best of 10: 7.74 msec per loop > # Average result: 7.715 msec per loop > > # Raspbian speedup: 4.09x > > Best, > Ryan Rowe > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [OE-core] python3 recipe PGO tests 2020-06-15 1:05 ` [OE-core] " Anuj Mittal @ 2020-06-15 20:33 ` Ryan Rowe 2020-06-18 23:24 ` Khem Raj 0 siblings, 1 reply; 8+ messages in thread From: Ryan Rowe @ 2020-06-15 20:33 UTC (permalink / raw) To: Mittal, Anuj, openembedded-core@lists.openembedded.org Cc: Martin Kelly, Jim Broadus, alex.kanavin@gmail.com On 14/6/20, 18:05, "Mittal, Anuj" <anuj.mittal@intel.com> wrote: > On Fri, 2020-06-12 at 21:28 +0000, Ryan Rowe wrote: > > Hello Alex, > > > > I’m investigating Python 3 performance issues on a Raspberry Pi Yocto > > build; I appreciate any insights you can provide into the problem. > > > > In my investigation, I noticed that PGO was disabled in all cases due > > to a small bug. I fixed it in a patch submitted to OE-Core (#139459). > > Even when PGO is indeed enabled, Python 3 runs significantly slower > > on Yocto-compiled Python 3.8.3 than the same version compiled on > > Raspbian. > > > > In your patch, 0001-Makefile.pre-use-qemu-wrapper-when-gathering- > > profile.patch, I see that you override the default PROFILE_TASK, > > which did not explicitly specify test suites, to a command that > > explicitly provides test suites. How did you decide on these tests? > > The standard PGO command runs 43 tests, while you specify 7. When I > > compile Python 3.8.3 on Raspbian, I see no intersection between the > > 43 tests run by default and the 7 you specify. Additionally, the > > default module for PROFILE is test while you use test.regrtest. > > We used to run pybench and then switched to regrtest: > > https://git.yoctoproject.org/cgit/cgit.cgi/poky/commit/?id=d9f7b9d3ad44195e68b2c1b09e3eb42e623c9a20 > > The PROFILE_TASK value it looks like was changed recently: > > https://github.com/python/cpython/commit/2406672984e4c1b18629e615edad52928a72ffcc#diff-45e8b91057f0c5b60efcb5944125b585 > > If the performance is actually degrading, may be we should change it to > something more useful. Do you know much time does the default set of > tasks take to run in qemu? > > Thanks, > > Anuj Thanks for looking into this. It took me about 20 minutes to run the PGO tests and I did notice a significant improvement in Python runtime. However, that is compared against a non-PGO build. I have not compared the existing PGO arguments against the new upstream arguments. We've come to realize that our performance issues are not due to Python, but in fact a much deeper rooted issue. Simple C code takes 2-3 times longer to run on our image based on meta-raspberrypi's raspberrypi4 machine than stock Raspbian. On a side node, it seems that cPython now exposes PROFILE_TASK as a configuration option, so we can override that variable with our desired profiling arguments rather than modifying the Makefile directly with a patch. Thanks, Ryan > > > > For reference, here’s the results of a simple CPU-bound test. These > > tests were run on the same Raspberry Pi 4 with same SD card. > > > > python3 -m timeit -r 10 --setup ' > > def fib(n): > > if n < 2: > > return n > > if n == 2: > > return 1 > > return fib(n - 1) + fib(n - 2) > > ' '[fib(n) for n in range(20)]' > > > > # Yocto Python 3.8.3 > > # 10 loops, best of 10: 28.9 msec per loop > > # 10 loops, best of 10: 29.3 msec per loop > > # 10 loops, best of 10: 27.9 msec per loop > > # 10 loops, best of 10: 30.4 msec per loop > > # Average result: 31.625 msec per loop > > > > # Raspbian Python 3.8.3 > > # 50 loops, best of 10: 7.73 msec per loop > > # 50 loops, best of 10: 7.72 msec per loop > > # 50 loops, best of 10: 7.67 msec per loop > > # 50 loops, best of 10: 7.74 msec per loop > > # Average result: 7.715 msec per loop > > > > # Raspbian speedup: 4.09x > > > > Best, > > Ryan Rowe > > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [OE-core] python3 recipe PGO tests 2020-06-15 20:33 ` Ryan Rowe @ 2020-06-18 23:24 ` Khem Raj 2020-06-18 23:47 ` Andre McCurdy 0 siblings, 1 reply; 8+ messages in thread From: Khem Raj @ 2020-06-18 23:24 UTC (permalink / raw) To: Mittal, Anuj, openembedded-core@lists.openembedded.org Cc: Martin Kelly, Jim Broadus, alex.kanavin@gmail.com, Ryan Rowe On Monday, June 15, 2020 1:33:26 PM PDT Ryan Rowe wrote: > On 14/6/20, 18:05, "Mittal, Anuj" <anuj.mittal@intel.com> wrote: > > > On Fri, 2020-06-12 at 21:28 +0000, Ryan Rowe wrote: > > > > > Hello Alex, > > > > > > > > > > > > I’m investigating Python 3 performance issues on a Raspberry Pi Yocto > > > build; I appreciate any insights you can provide into the problem. > > > > > > > > > > > > In my investigation, I noticed that PGO was disabled in all cases due > > > to a small bug. I fixed it in a patch submitted to OE-Core (#139459). > > > Even when PGO is indeed enabled, Python 3 runs significantly slower > > > on Yocto-compiled Python 3.8.3 than the same version compiled on > > > Raspbian. > > > > > > > > > > > > In your patch, 0001-Makefile.pre-use-qemu-wrapper-when-gathering- > > > profile.patch, I see that you override the default PROFILE_TASK, > > > which did not explicitly specify test suites, to a command that > > > explicitly provides test suites. How did you decide on these tests? > > > The standard PGO command runs 43 tests, while you specify 7. When I > > > compile Python 3.8.3 on Raspbian, I see no intersection between the > > > 43 tests run by default and the 7 you specify. Additionally, the > > > default module for PROFILE is test while you use test.regrtest. > > > > > > > > We used to run pybench and then switched to regrtest: > > > > > > > > https://git.yoctoproject.org/cgit/cgit.cgi/poky/commit/?id=d9f7b9d3ad44195 > > e68b2c1b09e3eb42e623c9a20 > > > > > > > The PROFILE_TASK value it looks like was changed recently: > > > > > > > > https://github.com/python/cpython/commit/2406672984e4c1b18629e615edad52928 > > a72ffcc#diff-45e8b91057f0c5b60efcb5944125b585 > > > > > > > If the performance is actually degrading, may be we should change it to > > something more useful. Do you know much time does the default set of > > tasks take to run in qemu? > > > > > > > > Thanks, > > > > > > > > Anuj > > > Thanks for looking into this. It took me about 20 minutes to run the PGO > tests and I did notice a significant improvement in Python runtime. > However, that is compared against a non-PGO build. I have not compared > the existing PGO arguments against the new upstream arguments. > > We've come to realize that our performance issues are not due to Python, > but in fact a much deeper rooted issue. Simple C code takes 2-3 times > longer to run on our image based on meta-raspberrypi's raspberrypi4 > machine than stock Raspbian. > > On a side node, it seems that cPython now exposes PROFILE_TASK as a > configuration option, so we can override that variable with our > desired profiling arguments rather than modifying the Makefile > directly with a patch. > The patch 0001-Makefile.pre-use-qemu-wrapper-when-gathering-profile.patch seems to hardcode what tests to run, perhaps it will be better to use PROFILE_TASK When 3.5 -> 3.7 upgrade was done in https://git.openembedded.org/openembedded-core/commit/? id=02714c105426b0d687620913c1a7401b386428b6 it dropped using PYTHON3_PROFILE_TASK silently, among large swath of changes this patch carried. I guess we have not checked the py3 runtime performance to detect this regression. so it will be good to reinstate the variable to choose what tests one wants to run with defaults being whatever is optimal for autobuilder. > Thanks, > Ryan > > > > > > > > > > > For reference, here’s the results of a simple CPU-bound test. These > > > tests were run on the same Raspberry Pi 4 with same SD card. > > > > > > > > > > > > python3 -m timeit -r 10 --setup ' > > > def fib(n): > > > > > > if n < 2: > > > > > > return n > > > > > > if n == 2: > > > > > > return 1 > > > > > > return fib(n - 1) + fib(n - 2) > > > > > > ' '[fib(n) for n in range(20)]' > > > > > > > > > > > > # Yocto Python 3.8.3 > > > # 10 loops, best of 10: 28.9 msec per loop > > > # 10 loops, best of 10: 29.3 msec per loop > > > # 10 loops, best of 10: 27.9 msec per loop > > > # 10 loops, best of 10: 30.4 msec per loop > > > # Average result: 31.625 msec per loop > > > > > > > > > > > > # Raspbian Python 3.8.3 > > > # 50 loops, best of 10: 7.73 msec per loop > > > # 50 loops, best of 10: 7.72 msec per loop > > > # 50 loops, best of 10: 7.67 msec per loop > > > # 50 loops, best of 10: 7.74 msec per loop > > > # Average result: 7.715 msec per loop > > > > > > > > > > > > # Raspbian speedup: 4.09x > > > > > > > > > > > > Best, > > > Ryan Rowe > > > > > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [OE-core] python3 recipe PGO tests 2020-06-18 23:24 ` Khem Raj @ 2020-06-18 23:47 ` Andre McCurdy 2020-06-18 23:56 ` Khem Raj 0 siblings, 1 reply; 8+ messages in thread From: Andre McCurdy @ 2020-06-18 23:47 UTC (permalink / raw) To: Khem Raj Cc: Mittal, Anuj, openembedded-core@lists.openembedded.org, Martin Kelly, Jim Broadus, alex.kanavin@gmail.com, Ryan Rowe On Thu, Jun 18, 2020 at 4:25 PM Khem Raj <raj.khem@gmail.com> wrote: > > On Monday, June 15, 2020 1:33:26 PM PDT Ryan Rowe wrote: > > On 14/6/20, 18:05, "Mittal, Anuj" <anuj.mittal@intel.com> wrote: > > > > > On Fri, 2020-06-12 at 21:28 +0000, Ryan Rowe wrote: > > > > > > > Hello Alex, > > > > > > > > > > > > > > > > I’m investigating Python 3 performance issues on a Raspberry Pi Yocto > > > > build; I appreciate any insights you can provide into the problem. > > > > > > > > > > > > > > > > In my investigation, I noticed that PGO was disabled in all cases due > > > > to a small bug. I fixed it in a patch submitted to OE-Core (#139459). > > > > Even when PGO is indeed enabled, Python 3 runs significantly slower > > > > on Yocto-compiled Python 3.8.3 than the same version compiled on > > > > Raspbian. > > > > > > > > > > > > > > > > In your patch, 0001-Makefile.pre-use-qemu-wrapper-when-gathering- > > > > profile.patch, I see that you override the default PROFILE_TASK, > > > > which did not explicitly specify test suites, to a command that > > > > explicitly provides test suites. How did you decide on these tests? > > > > The standard PGO command runs 43 tests, while you specify 7. When I > > > > compile Python 3.8.3 on Raspbian, I see no intersection between the > > > > 43 tests run by default and the 7 you specify. Additionally, the > > > > default module for PROFILE is test while you use test.regrtest. > > > > > > > > > > > > We used to run pybench and then switched to regrtest: > > > > > > > > > > > > https://git.yoctoproject.org/cgit/cgit.cgi/poky/commit/?id=d9f7b9d3ad44195 > > > e68b2c1b09e3eb42e623c9a20 > > > > > > > > > > > The PROFILE_TASK value it looks like was changed recently: > > > > > > > > > > > > https://github.com/python/cpython/commit/2406672984e4c1b18629e615edad52928 > > > a72ffcc#diff-45e8b91057f0c5b60efcb5944125b585 > > > > > > > > > > > If the performance is actually degrading, may be we should change it to > > > something more useful. Do you know much time does the default set of > > > tasks take to run in qemu? > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Anuj > > > > > > Thanks for looking into this. It took me about 20 minutes to run the PGO > > tests and I did notice a significant improvement in Python runtime. > > However, that is compared against a non-PGO build. I have not compared > > the existing PGO arguments against the new upstream arguments. > > > > We've come to realize that our performance issues are not due to Python, > > but in fact a much deeper rooted issue. Simple C code takes 2-3 times > > longer to run on our image based on meta-raspberrypi's raspberrypi4 > > machine than stock Raspbian. > > > > On a side node, it seems that cPython now exposes PROFILE_TASK as a > > configuration option, so we can override that variable with our > > desired profiling arguments rather than modifying the Makefile > > directly with a patch. > > > > The patch 0001-Makefile.pre-use-qemu-wrapper-when-gathering-profile.patch > seems to hardcode what tests to run, perhaps it will be better to use > PROFILE_TASK > > When 3.5 -> 3.7 upgrade was done in > > https://git.openembedded.org/openembedded-core/commit/? > id=02714c105426b0d687620913c1a7401b386428b6 > > it dropped using PYTHON3_PROFILE_TASK silently, among large swath of changes > this patch carried. I guess we have not checked the py3 runtime performance to > detect this regression. Are we sure there is a regression? Ryan posted a follow up saying everything was slower in his tests, not just python. > so it will be good to reinstate the variable to choose what tests one wants to > run with defaults being whatever is optimal for autobuilder. > > > Thanks, > > Ryan > > > > > > > > > > > > > > > > For reference, here’s the results of a simple CPU-bound test. These > > > > tests were run on the same Raspberry Pi 4 with same SD card. > > > > > > > > > > > > > > > > python3 -m timeit -r 10 --setup ' > > > > def fib(n): > > > > > > > > if n < 2: > > > > > > > > return n > > > > > > > > if n == 2: > > > > > > > > return 1 > > > > > > > > return fib(n - 1) + fib(n - 2) > > > > > > > > ' '[fib(n) for n in range(20)]' > > > > > > > > > > > > > > > > # Yocto Python 3.8.3 > > > > # 10 loops, best of 10: 28.9 msec per loop > > > > # 10 loops, best of 10: 29.3 msec per loop > > > > # 10 loops, best of 10: 27.9 msec per loop > > > > # 10 loops, best of 10: 30.4 msec per loop > > > > # Average result: 31.625 msec per loop > > > > > > > > > > > > > > > > # Raspbian Python 3.8.3 > > > > # 50 loops, best of 10: 7.73 msec per loop > > > > # 50 loops, best of 10: 7.72 msec per loop > > > > # 50 loops, best of 10: 7.67 msec per loop > > > > # 50 loops, best of 10: 7.74 msec per loop > > > > # Average result: 7.715 msec per loop > > > > > > > > > > > > > > > > # Raspbian speedup: 4.09x > > > > > > > > > > > > > > > > Best, > > > > Ryan Rowe > > > > > > > > > > > > > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [OE-core] python3 recipe PGO tests 2020-06-18 23:47 ` Andre McCurdy @ 2020-06-18 23:56 ` Khem Raj 2020-06-19 1:30 ` Ryan Rowe 0 siblings, 1 reply; 8+ messages in thread From: Khem Raj @ 2020-06-18 23:56 UTC (permalink / raw) To: Andre McCurdy Cc: Mittal, Anuj, openembedded-core@lists.openembedded.org, Martin Kelly, Jim Broadus, alex.kanavin@gmail.com, Ryan Rowe On Thu, Jun 18, 2020 at 4:47 PM Andre McCurdy <armccurdy@gmail.com> wrote: > > On Thu, Jun 18, 2020 at 4:25 PM Khem Raj <raj.khem@gmail.com> wrote: > > > > On Monday, June 15, 2020 1:33:26 PM PDT Ryan Rowe wrote: > > > On 14/6/20, 18:05, "Mittal, Anuj" <anuj.mittal@intel.com> wrote: > > > > > > > On Fri, 2020-06-12 at 21:28 +0000, Ryan Rowe wrote: > > > > > > > > > Hello Alex, > > > > > > > > > > > > > > > > > > > > I’m investigating Python 3 performance issues on a Raspberry Pi Yocto > > > > > build; I appreciate any insights you can provide into the problem. > > > > > > > > > > > > > > > > > > > > In my investigation, I noticed that PGO was disabled in all cases due > > > > > to a small bug. I fixed it in a patch submitted to OE-Core (#139459). > > > > > Even when PGO is indeed enabled, Python 3 runs significantly slower > > > > > on Yocto-compiled Python 3.8.3 than the same version compiled on > > > > > Raspbian. > > > > > > > > > > > > > > > > > > > > In your patch, 0001-Makefile.pre-use-qemu-wrapper-when-gathering- > > > > > profile.patch, I see that you override the default PROFILE_TASK, > > > > > which did not explicitly specify test suites, to a command that > > > > > explicitly provides test suites. How did you decide on these tests? > > > > > The standard PGO command runs 43 tests, while you specify 7. When I > > > > > compile Python 3.8.3 on Raspbian, I see no intersection between the > > > > > 43 tests run by default and the 7 you specify. Additionally, the > > > > > default module for PROFILE is test while you use test.regrtest. > > > > > > > > > > > > > > > > We used to run pybench and then switched to regrtest: > > > > > > > > > > > > > > > > https://git.yoctoproject.org/cgit/cgit.cgi/poky/commit/?id=d9f7b9d3ad44195 > > > > e68b2c1b09e3eb42e623c9a20 > > > > > > > > > > > > > > > The PROFILE_TASK value it looks like was changed recently: > > > > > > > > > > > > > > > > https://github.com/python/cpython/commit/2406672984e4c1b18629e615edad52928 > > > > a72ffcc#diff-45e8b91057f0c5b60efcb5944125b585 > > > > > > > > > > > > > > > If the performance is actually degrading, may be we should change it to > > > > something more useful. Do you know much time does the default set of > > > > tasks take to run in qemu? > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > Anuj > > > > > > > > > Thanks for looking into this. It took me about 20 minutes to run the PGO > > > tests and I did notice a significant improvement in Python runtime. > > > However, that is compared against a non-PGO build. I have not compared > > > the existing PGO arguments against the new upstream arguments. > > > > > > We've come to realize that our performance issues are not due to Python, > > > but in fact a much deeper rooted issue. Simple C code takes 2-3 times > > > longer to run on our image based on meta-raspberrypi's raspberrypi4 > > > machine than stock Raspbian. > > > > > > On a side node, it seems that cPython now exposes PROFILE_TASK as a > > > configuration option, so we can override that variable with our > > > desired profiling arguments rather than modifying the Makefile > > > directly with a patch. > > > > > > > The patch 0001-Makefile.pre-use-qemu-wrapper-when-gathering-profile.patch > > seems to hardcode what tests to run, perhaps it will be better to use > > PROFILE_TASK > > > > When 3.5 -> 3.7 upgrade was done in > > > > https://git.openembedded.org/openembedded-core/commit/? > > id=02714c105426b0d687620913c1a7401b386428b6 > > > > it dropped using PYTHON3_PROFILE_TASK silently, among large swath of changes > > this patch carried. I guess we have not checked the py3 runtime performance to > > detect this regression. > > Are we sure there is a regression? Ryan posted a follow up saying > everything was slower in his tests, not just python. regression is disabling it with e53ebf29 > > > so it will be good to reinstate the variable to choose what tests one wants to > > run with defaults being whatever is optimal for autobuilder. > > > > > Thanks, > > > Ryan > > > > > > > > > > > > > > > > > > > > > For reference, here’s the results of a simple CPU-bound test. These > > > > > tests were run on the same Raspberry Pi 4 with same SD card. > > > > > > > > > > > > > > > > > > > > python3 -m timeit -r 10 --setup ' > > > > > def fib(n): > > > > > > > > > > if n < 2: > > > > > > > > > > return n > > > > > > > > > > if n == 2: > > > > > > > > > > return 1 > > > > > > > > > > return fib(n - 1) + fib(n - 2) > > > > > > > > > > ' '[fib(n) for n in range(20)]' > > > > > > > > > > > > > > > > > > > > # Yocto Python 3.8.3 > > > > > # 10 loops, best of 10: 28.9 msec per loop > > > > > # 10 loops, best of 10: 29.3 msec per loop > > > > > # 10 loops, best of 10: 27.9 msec per loop > > > > > # 10 loops, best of 10: 30.4 msec per loop > > > > > # Average result: 31.625 msec per loop > > > > > > > > > > > > > > > > > > > > # Raspbian Python 3.8.3 > > > > > # 50 loops, best of 10: 7.73 msec per loop > > > > > # 50 loops, best of 10: 7.72 msec per loop > > > > > # 50 loops, best of 10: 7.67 msec per loop > > > > > # 50 loops, best of 10: 7.74 msec per loop > > > > > # Average result: 7.715 msec per loop > > > > > > > > > > > > > > > > > > > > # Raspbian speedup: 4.09x > > > > > > > > > > > > > > > > > > > > Best, > > > > > Ryan Rowe > > > > > > > > > > > > > > > > > > > > > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [OE-core] python3 recipe PGO tests 2020-06-18 23:56 ` Khem Raj @ 2020-06-19 1:30 ` Ryan Rowe 0 siblings, 0 replies; 8+ messages in thread From: Ryan Rowe @ 2020-06-19 1:30 UTC (permalink / raw) To: Khem Raj, Andre McCurdy Cc: openembedded-core@lists.openembedded.org, Mittal, Anuj, alex.kanavin@gmail.com, Jim Broadus, Martin Kelly On 18/6/20, 16:57, "Khem Raj" <raj.khem@gmail.com> wrote: > > On Thu, Jun 18, 2020 at 4:47 PM Andre McCurdy <armccurdy@gmail.com> wrote: > > > > On Thu, Jun 18, 2020 at 4:25 PM Khem Raj <raj.khem@gmail.com> wrote: > > > > > > On Monday, June 15, 2020 1:33:26 PM PDT Ryan Rowe wrote: > > > > > > > On 14/6/20, 18:05, "Mittal, Anuj" <anuj.mittal@intel.com> wrote: > > > > > > > > > On Fri, 2020-06-12 at 21:28 +0000, Ryan Rowe wrote: > > > > > > > > > > > Hello Alex, > > > > > > > > > > > > > > > > > > > > > > > > I’m investigating Python 3 performance issues on a Raspberry Pi Yocto > > > > > > build; I appreciate any insights you can provide into the problem. > > > > > > > > > > > > > > > > > > > > > > > > In my investigation, I noticed that PGO was disabled in all cases due > > > > > > to a small bug. I fixed it in a patch submitted to OE-Core (#139459). > > > > > > Even when PGO is indeed enabled, Python 3 runs significantly slower > > > > > > on Yocto-compiled Python 3.8.3 than the same version compiled on > > > > > > Raspbian. > > > > > > > > > > > > > > > > > > > > > > > > In your patch, 0001-Makefile.pre-use-qemu-wrapper-when-gathering- > > > > > > profile.patch, I see that you override the default PROFILE_TASK, > > > > > > which did not explicitly specify test suites, to a command that > > > > > > explicitly provides test suites. How did you decide on these tests? > > > > > > The standard PGO command runs 43 tests, while you specify 7. When I > > > > > > compile Python 3.8.3 on Raspbian, I see no intersection between the > > > > > > 43 tests run by default and the 7 you specify. Additionally, the > > > > > > default module for PROFILE is test while you use test.regrtest. > > > > > > > > > > > > > > > > > > > > We used to run pybench and then switched to regrtest: > > > > > > > > > > > > > > > > > > > > https://git.yoctoproject.org/cgit/cgit.cgi/poky/commit/?id=d9f7b9d3ad44195 > > > > > e68b2c1b09e3eb42e623c9a20 > > > > > > > > > > > > > > > > > > > The PROFILE_TASK value it looks like was changed recently: > > > > > > > > > > > > > > > > > > > > https://github.com/python/cpython/commit/2406672984e4c1b18629e615edad52928 > > > > > a72ffcc#diff-45e8b91057f0c5b60efcb5944125b585 > > > > > > > > > > > > > > > > > > > If the performance is actually degrading, may be we should change it to > > > > > something more useful. Do you know much time does the default set of > > > > > tasks take to run in qemu? > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > Anuj > > > > > > > > > > > > Thanks for looking into this. It took me about 20 minutes to run the PGO > > > > tests and I did notice a significant improvement in Python runtime. > > > > However, that is compared against a non-PGO build. I have not compared > > > > the existing PGO arguments against the new upstream arguments. > > > > > > > > We've come to realize that our performance issues are not due to Python, > > > > but in fact a much deeper rooted issue. Simple C code takes 2-3 times > > > > longer to run on our image based on meta-raspberrypi's raspberrypi4 > > > > machine than stock Raspbian. > > > > > > > > On a side node, it seems that cPython now exposes PROFILE_TASK as a > > > > configuration option, so we can override that variable with our > > > > desired profiling arguments rather than modifying the Makefile > > > > directly with a patch. > > > > > > > > > > The patch 0001-Makefile.pre-use-qemu-wrapper-when-gathering-profile.patch > > > seems to hardcode what tests to run, perhaps it will be better to use > > > PROFILE_TASK We can use the default PROFILE_TASK, however it sounds like Ross had reason to switch from Pybench to regrtest, mainly execution time. In his commit, he notes "also upstream have removed it from Python and instead use test.regrtest —pgo to profile the interpreter." This does not seem to be true anymore as upstream uses test rather than test.regrtest. However, the default tests do take 20 minutes to run which is considerably longer than the current explicit tests. > > > > > > When 3.5 -> 3.7 upgrade was done in > > > > > > https://git.openembedded.org/openembedded-core/commit/? > > > id=02714c105426b0d687620913c1a7401b386428b6 > > > > > > it dropped using PYTHON3_PROFILE_TASK silently, among large swath of changes > > > this patch carried. I guess we have not checked the py3 runtime performance to > > > detect this regression. > > > > Are we sure there is a regression? Ryan posted a follow up saying > > everything was slower in his tests, not just python. In case anyone is curious, I did find out the issue. The CPU governor was powersave rather than ondemand. Silly me, I only checked the min and max freq, not that they were being used. And a quirk of the OS prevented any of my benchmarks from printing the observed clock speed during test, just empty strings. With this fixed and when compiling with upstream PGO in Yocto, I do observe comparable performance to regular upstream Python 3.8 compiled with PGO on Raspbian. > > regression is disabling it with e53ebf29 Yes, that's correct. This inadvertently disabled PGO entirely. I can do some tests tomorrow to determine the performance loss due to PGO with these explicit test suites rather than the defaults from the upstream. I did notice performance gain when using PGO, but that was against non-PGO. > > > > > > so it will be good to reinstate the variable to choose what tests one wants to > > > run with defaults being whatever is optimal for autobuilder. > > > > > > > Thanks, > > > > Ryan > > > > > > > > > > > > > > > > > > > > > > > > > > For reference, here’s the results of a simple CPU-bound test. These > > > > > > tests were run on the same Raspberry Pi 4 with same SD card. > > > > > > > > > > > > > > > > > > > > > > > > python3 -m timeit -r 10 --setup ' > > > > > > def fib(n): > > > > > > > > > > > > if n < 2: > > > > > > > > > > > > return n > > > > > > > > > > > > if n == 2: > > > > > > > > > > > > return 1 > > > > > > > > > > > > return fib(n - 1) + fib(n - 2) > > > > > > > > > > > > ' '[fib(n) for n in range(20)]' > > > > > > > > > > > > > > > > > > > > > > > > # Yocto Python 3.8.3 > > > > > > # 10 loops, best of 10: 28.9 msec per loop > > > > > > # 10 loops, best of 10: 29.3 msec per loop > > > > > > # 10 loops, best of 10: 27.9 msec per loop > > > > > > # 10 loops, best of 10: 30.4 msec per loop > > > > > > # Average result: 31.625 msec per loop > > > > > > > > > > > > > > > > > > > > > > > > # Raspbian Python 3.8.3 > > > > > > # 50 loops, best of 10: 7.73 msec per loop > > > > > > # 50 loops, best of 10: 7.72 msec per loop > > > > > > # 50 loops, best of 10: 7.67 msec per loop > > > > > > # 50 loops, best of 10: 7.74 msec per loop > > > > > > # Average result: 7.715 msec per loop > > > > > > > > > > > > > > > > > > > > > > > > # Raspbian speedup: 4.09x > > > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > Ryan Rowe > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2020-06-19 1:30 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2020-06-12 21:28 python3 recipe PGO tests Ryan Rowe 2020-06-12 21:37 ` Alexander Kanavin 2020-06-15 1:05 ` [OE-core] " Anuj Mittal 2020-06-15 20:33 ` Ryan Rowe 2020-06-18 23:24 ` Khem Raj 2020-06-18 23:47 ` Andre McCurdy 2020-06-18 23:56 ` Khem Raj 2020-06-19 1:30 ` Ryan Rowe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox