* Improving Build Speed
@ 2013-11-20 21:05 Ulf Samuelsson
2013-11-20 21:29 ` Richard Purdie
` (2 more replies)
0 siblings, 3 replies; 13+ messages in thread
From: Ulf Samuelsson @ 2013-11-20 21:05 UTC (permalink / raw)
To: Discussion of the angstrom distribution development,
Patches and discussions about the oe-core layer
Finally got my new build machine running. so I thought I'd measure
the performance vs the old machine
Home Built
Core i7-980X
6 core/12 threads @ 3,33GHz
12 GB RAM @ 1333 Mhz.
WD Black 1 TB @ 7200 rpm
Precision 7500
2 x (X5670 6 core 2,93 MHz)
2 x (24 GB RAM @ 1333 MHz)
2 x SAS 600 GB / 15K rpm, Striped RAID
Run Angstrom Distribution
oebb.sh config beaglebone
bitbake cloud9-<my>-gnome-image (It is slightly extended)
The first machine build this in about three hours using
PARALLEL_MAKE = "-j6"
BB_NUMBER_THREADS = "6"
The second machine build this much faster:
Initially tried
PARALLEL_MAKE = "-j2"
BB_NUMBER_THREADS = "12"
but the CPU frequency tool showed it to idle.
Changed to:
PARALLEL_MAKE = "-j6"
BB_NUMBER_THREADS = "24"
and was quicker, but it seemed to be a little flawed.
At several times during the build, the CPU frequtil
showed that most of the cores went down to
minimum frequency (2,93 GHz -> 1,6 GHz)
The image build breaks down into 7658 tasks
19:36 Start of Pseudo Build
19:40 Start of real build
19:42 Task 1000 built 2 minutes
19:45 Task 2000 built 3 minutes
19:47 Task 3000 built 2 minutes
19:48 Task 3500 built 1 minute
19:57 Task 4000 built 9 minutes ****** (1)
20:00 Task 4500 built 3 minutes
20:04 Task 5000 built 4 minutes
20:14 Task 5700 built 10 minutes
20:17 Task 6000 built 3 minutes
20:27 Task 6500 built 10 minutes
20:43 Task 7500 built 16 minutes
20:52 Task 7657 built 9 minutes ******* (2)
20:59 Task 7658 built 7 minutes ******* (3) (do_rootfs)
Total Time 83 minutes
'*******' areas with speed problems. Very little parallelism.
These times are actually after a few fixes, and the vanilla build will
be slower.
There are a few prob
There are several reasons for the speed traps.
(1) This occurs at the end of the build of the native tools
The build of the cross packages has started and stuff are unpacked
and patched, and waiting for eglibc to be ready.
(2) This occurs at the end of the build, when very few packages
are left to build so the RunQueue only contains a few packages.
Had a look at the packages built at the end.
webkit-gtk, gimp, abiword pulseaudio.
abiword has PARALLEL_MAKE = "" and takes forever.
I tried building an image with PARALLEL_MAKE = "-j24" and this
build completes without problem.
but I have not loaded it to a target yet.
AbiWord seems to be compiling almost alone for a long time.
Webkit-gtk has a strange fix in do_compile.
do_compile() {
if [ x"$MAKE" = x ]; then MAKE=make; fi
...
for error_count in 1 2 3; do
...
${MAKE} ${EXTRA_OEMAKE} "$@" || exit_code=1
...
done
...
}
Not sure, but I think this means that PARALLEL_MAKE might get ignored.
===================================
Since there are pacakges which, due to dependencies are almost
processed
alone, there is no reason to limit the parallelism for those.
Why restrict PARALLEL_MAKE to anything less than the number of H/W
threads in the machine?
Came up with a construct PARALLEL_HIGH which is defined alongside
PARALLEL_MAKE in conf/local.conf
PARALLEL_MAKE = "-j8"
PARALLEL_HIGH = "-j24"
In the appropriate recipes, which seems to be processed by bitbake
in solitude I do:
PARALLEL_HIGH ?= "${PARALLEL_MAKE}"
PARALLEL_MAKE = "${PARALLEL_HIGH}"
This means that they will try to use each H/W thread.
Added this to eglibc, abiword, nodejs, webkit-gtk
I thinks this could shave of maybe 5% of the build time.
===================================
When I looked at the bitbake runqueue stuff, it seems to prioritize
things with a lot of dependencies, which results in things like the
webkit-gtk
beeing built among the last packages.
It would probably be better if the webkit-gtk build started earlier,
so that the gimp build which depends on webkit-gtk, does not have
to run as a single task for a few minutes.
I am thinking of adding a few dummy packages which depend on
webkit-gtk and the
other long builds at the end, to fool bitbake to start their build
earlier,
but it might be a better idea, if a build hint could be part of the
recipe.
I guess a value, which could be added to the dependency count would
not be
to hard to implement (for those that know how)
(3) Creating the rootfs seems to have zero parallelism.
But I have not investigated if anything can be done.
===================================
So I propose the following changes:
1.Remove PARALLEL_MAKE = "" from abiword
2.Add the PARALLEL_HIGH variable to a few recipes.
3.Investigate if we can force the build of a few packages to an earlier
point.
=======================================
BTW: Have noticed that there are some dependencies missing from the recipes.
DEPENDENCY BUGS
pangomm needs to depend on "pango"
Otherwise, the required pangocairo might not be available when
pangomm is configured
goffice needs to depend on "librsvg gdk-pixbuf"
Also on "gobject-2.0 gmodule-2.0 gio-2.0", but I did not find
those packages,
so I assume they are generated somewhere. Did not investigate further.
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: Improving Build Speed 2013-11-20 21:05 Improving Build Speed Ulf Samuelsson @ 2013-11-20 21:29 ` Richard Purdie 2013-11-20 22:43 ` Ulf Samuelsson ` (2 more replies) 2013-11-21 10:05 ` Burton, Ross 2013-11-21 11:51 ` Enrico Scholz 2 siblings, 3 replies; 13+ messages in thread From: Richard Purdie @ 2013-11-20 21:29 UTC (permalink / raw) To: Ulf Samuelsson Cc: Discussion of the angstrom distribution development, Patches and discussions about the oe-core layer Hi Ulf, Nice to see someone else looking at this. I've shared some of my thoughts and observations below based on some of the work I've done trying to speed things up. On Wed, 2013-11-20 at 22:05 +0100, Ulf Samuelsson wrote: > Finally got my new build machine running. so I thought I'd measure > the performance vs the old machine > > Home Built > Core i7-980X > 6 core/12 threads @ 3,33GHz > 12 GB RAM @ 1333 Mhz. > WD Black 1 TB @ 7200 rpm > > Precision 7500 > 2 x (X5670 6 core 2,93 MHz) > 2 x (24 GB RAM @ 1333 MHz) > 2 x SAS 600 GB / 15K rpm, Striped RAID > > Run Angstrom Distribution > > oebb.sh config beaglebone > bitbake cloud9-<my>-gnome-image (It is slightly extended) > > The first machine build this in about three hours using > PARALLEL_MAKE = "-j6" > BB_NUMBER_THREADS = "6" > > The second machine build this much faster: > > Initially tried > > PARALLEL_MAKE = "-j2" > BB_NUMBER_THREADS = "12" > > but the CPU frequency tool showed it to idle. > Changed to: > > PARALLEL_MAKE = "-j6" > BB_NUMBER_THREADS = "24" > > and was quicker, but it seemed to be a little flawed. > At several times during the build, the CPU frequtil > showed that most of the cores went down to > minimum frequency (2,93 GHz -> 1,6 GHz) > > The image build breaks down into 7658 tasks > > 19:36 Start of Pseudo Build > 19:40 Start of real build > 19:42 Task 1000 built 2 minutes > 19:45 Task 2000 built 3 minutes > 19:47 Task 3000 built 2 minutes > 19:48 Task 3500 built 1 minute > 19:57 Task 4000 built 9 minutes ****** (1) > 20:00 Task 4500 built 3 minutes > 20:04 Task 5000 built 4 minutes > 20:14 Task 5700 built 10 minutes > 20:17 Task 6000 built 3 minutes > 20:27 Task 6500 built 10 minutes > 20:43 Task 7500 built 16 minutes > 20:52 Task 7657 built 9 minutes ******* (2) > 20:59 Task 7658 built 7 minutes ******* (3) (do_rootfs) > > Total Time 83 minutes FWIW this is clearly an older revision of the system. We now build pseudo in tree so the "Start of Pseudo Build" no longer exists. There have been several fixes in various performance areas recently too which all help a little. If that saves us the single threaded first 4 minutes that is clearly a good thing! :) > There are several reasons for the speed traps. > > (1) This occurs at the end of the build of the native tools > The build of the cross packages has started and stuff are unpacked > and patched, and waiting for eglibc to be ready. We have gone through this "critical path" and tried to strip out as many dependencies as we can without sacrificing correctness. I'm open to further ideas. > (2) This occurs at the end of the build, when very few packages > are left to build so the RunQueue only contains a few packages. > > Had a look at the packages built at the end. > > webkit-gtk, gimp, abiword pulseaudio. > > abiword has PARALLEL_MAKE = "" and takes forever. > I tried building an image with PARALLEL_MAKE = "-j24" and this > build completes without problem. > but I have not loaded it to a target yet. > AbiWord seems to be compiling almost alone for a long time. > > Webkit-gtk has a strange fix in do_compile. > > do_compile() { > if [ x"$MAKE" = x ]; then MAKE=make; fi > ... > for error_count in 1 2 3; do > ... > ${MAKE} ${EXTRA_OEMAKE} "$@" || exit_code=1 > ... > done > ... > } > > Not sure, but I think this means that PARALLEL_MAKE might get ignored. I think we got rid of this in master. It was to workaround make bugs which we now detect and error upon instead. > Why restrict PARALLEL_MAKE to anything less than the number of H/W > threads in the machine? > > Came up with a construct PARALLEL_HIGH which is defined alongside > PARALLEL_MAKE in conf/local.conf > > PARALLEL_MAKE = "-j8" > PARALLEL_HIGH = "-j24" > > In the appropriate recipes, which seems to be processed by bitbake > in solitude I do: > > PARALLEL_HIGH ?= "${PARALLEL_MAKE}" > PARALLEL_MAKE = "${PARALLEL_HIGH}" > > This means that they will try to use each H/W thread. Please benchmark the difference. I suspect we can just set the high number of make for everything. Note that few makefiles are well enough written to benefit from high levels of make (webkit being an notable exception). > When I looked at the bitbake runqueue stuff, it seems to prioritize > things with a lot of dependencies, which results in things like the > webkit-gtk > beeing built among the last packages. > > It would probably be better if the webkit-gtk build started earlier, > so that the gimp build which depends on webkit-gtk, does not have > to run as a single task for a few minutes. > > I am thinking of adding a few dummy packages which depend on > webkit-gtk and the > other long builds at the end, to fool bitbake to start their build > earlier, > but it might be a better idea, if a build hint could be part of the > recipe. > > I guess a value, which could be added to the dependency count would > not be > to hard to implement (for those that know how) It would be easy to write a custom scheduler which hardcoded prioritisation of critical path items (or slow ones). Its an idea I've not tried yet and would be easier than artificial dependency trees. One point to note is that looking at the build "bootcharts", there are "pinch points". For core-image-sato, these are notably the toolchain, then gettext, then gtk, then gstreamer. I suspect webkit has a similar issue to that. > (3) Creating the rootfs seems to have zero parallelism. > But I have not investigated if anything can be done. This is something I do want to fix in 1.6. We need to convert the core to python to gain access to easier threading mechanisms though. Certainly parallel image type generation and compression would be a win here. > =================================== > > So I propose the following changes: > > 1.Remove PARALLEL_MAKE = "" from abiword > 2.Add the PARALLEL_HIGH variable to a few recipes. > 3.Investigate if we can force the build of a few packages to an earlier > point. > > ======================================= > BTW: Have noticed that there are some dependencies missing from the recipes. > > > > DEPENDENCY BUGS > pangomm needs to depend on "pango" > Otherwise, the required pangocairo might not be available when > pangomm is configured > > goffice needs to depend on "librsvg gdk-pixbuf" > Also on "gobject-2.0 gmodule-2.0 gio-2.0", but I did not find > those packages, > so I assume they are generated somewhere. Did not investigate further. I'm sure patches would be most welcome for bugs like this. Cheers, Richard ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Improving Build Speed 2013-11-20 21:29 ` Richard Purdie @ 2013-11-20 22:43 ` Ulf Samuelsson 2013-11-21 0:19 ` Martin Jansa 2013-11-21 0:10 ` Martin Jansa 2013-11-21 8:04 ` Ulf Samuelsson 2 siblings, 1 reply; 13+ messages in thread From: Ulf Samuelsson @ 2013-11-20 22:43 UTC (permalink / raw) To: openembedded-core 2013-11-20 22:29, Richard Purdie skrev: > Hi Ulf, > > Nice to see someone else looking at this. I've shared some of my > thoughts and observations below based on some of the work I've done > trying to speed things up. > > On Wed, 2013-11-20 at 22:05 +0100, Ulf Samuelsson wrote: >> Finally got my new build machine running. so I thought I'd measure >> the performance vs the old machine >> >> Home Built >> Core i7-980X >> 6 core/12 threads @ 3,33GHz >> 12 GB RAM @ 1333 Mhz. >> WD Black 1 TB @ 7200 rpm >> >> Precision 7500 >> 2 x (X5670 6 core 2,93 MHz) >> 2 x (24 GB RAM @ 1333 MHz) >> 2 x SAS 600 GB / 15K rpm, Striped RAID >> >> Run Angstrom Distribution >> >> oebb.sh config beaglebone >> bitbake cloud9-<my>-gnome-image (It is slightly extended) >> >> The first machine build this in about three hours using >> PARALLEL_MAKE = "-j6" >> BB_NUMBER_THREADS = "6" >> >> The second machine build this much faster: >> >> Initially tried >> >> PARALLEL_MAKE = "-j2" >> BB_NUMBER_THREADS = "12" >> >> but the CPU frequency tool showed it to idle. >> Changed to: >> >> PARALLEL_MAKE = "-j6" >> BB_NUMBER_THREADS = "24" >> >> and was quicker, but it seemed to be a little flawed. >> At several times during the build, the CPU frequtil >> showed that most of the cores went down to >> minimum frequency (2,93 GHz -> 1,6 GHz) >> >> The image build breaks down into 7658 tasks >> >> 19:36 Start of Pseudo Build >> 19:40 Start of real build >> 19:42 Task 1000 built 2 minutes >> 19:45 Task 2000 built 3 minutes >> 19:47 Task 3000 built 2 minutes >> 19:48 Task 3500 built 1 minute >> 19:57 Task 4000 built 9 minutes ****** (1) >> 20:00 Task 4500 built 3 minutes >> 20:04 Task 5000 built 4 minutes >> 20:14 Task 5700 built 10 minutes >> 20:17 Task 6000 built 3 minutes >> 20:27 Task 6500 built 10 minutes >> 20:43 Task 7500 built 16 minutes >> 20:52 Task 7657 built 9 minutes ******* (2) >> 20:59 Task 7658 built 7 minutes ******* (3) (do_rootfs) >> >> Total Time 83 minutes > FWIW this is clearly an older revision of the system. We now build > pseudo in tree so the "Start of Pseudo Build" no longer exists. There > have been several fixes in various performance areas recently too which > all help a little. If that saves us the single threaded first 4 minutes > that is clearly a good thing! :) This is the Angstrom Master, which is Yocto-1.3 Had problems getting the build to complete with the Angstrom Yocto-1.4 >> There are several reasons for the speed traps. >> >> (1) This occurs at the end of the build of the native tools >> The build of the cross packages has started and stuff are unpacked >> and patched, and waiting for eglibc to be ready. > We have gone through this "critical path" and tried to strip out as many > dependencies as we can without sacrificing correctness. I'm open to > further ideas. > >> (2) This occurs at the end of the build, when very few packages >> are left to build so the RunQueue only contains a few packages. >> >> Had a look at the packages built at the end. >> >> webkit-gtk, gimp, abiword pulseaudio. >> >> abiword has PARALLEL_MAKE = "" and takes forever. >> I tried building an image with PARALLEL_MAKE = "-j24" and this >> build completes without problem. >> but I have not loaded it to a target yet. >> AbiWord seems to be compiling almost alone for a long time. >> >> Webkit-gtk has a strange fix in do_compile. >> >> do_compile() { >> if [ x"$MAKE" = x ]; then MAKE=make; fi >> ... >> for error_count in 1 2 3; do >> ... >> ${MAKE} ${EXTRA_OEMAKE} "$@" || exit_code=1 >> ... >> done >> ... >> } >> >> Not sure, but I think this means that PARALLEL_MAKE might get ignored. > I think we got rid of this in master. It was to workaround make bugs > which we now detect and error upon instead. > >> Why restrict PARALLEL_MAKE to anything less than the number of H/W >> threads in the machine? >> >> Came up with a construct PARALLEL_HIGH which is defined alongside >> PARALLEL_MAKE in conf/local.conf >> >> PARALLEL_MAKE = "-j8" >> PARALLEL_HIGH = "-j24" >> >> In the appropriate recipes, which seems to be processed by bitbake >> in solitude I do: >> >> PARALLEL_HIGH ?= "${PARALLEL_MAKE}" >> PARALLEL_MAKE = "${PARALLEL_HIGH}" >> >> This means that they will try to use each H/W thread. > Please benchmark the difference. I suspect we can just set the high > number of make for everything. Note that few makefiles are well enough > written to benefit from high levels of make (webkit being an notable > exception). I only checked a few, and no hard data, but looking at the cpufreq it certainly seemed better. Hard data is needed of course, so I will try that tomorrow. > >> When I looked at the bitbake runqueue stuff, it seems to prioritize >> things with a lot of dependencies, which results in things like the >> webkit-gtk >> beeing built among the last packages. >> >> It would probably be better if the webkit-gtk build started earlier, >> so that the gimp build which depends on webkit-gtk, does not have >> to run as a single task for a few minutes. >> >> I am thinking of adding a few dummy packages which depend on >> webkit-gtk and the >> other long builds at the end, to fool bitbake to start their build >> earlier, >> but it might be a better idea, if a build hint could be part of the >> recipe. >> >> I guess a value, which could be added to the dependency count would >> not be >> to hard to implement (for those that know how) > It would be easy to write a custom scheduler which hardcoded > prioritisation of critical path items (or slow ones). Its an idea I've > not tried yet and would be easier than artificial dependency trees. I generated a recipe which just installs /home/root but depends on a few things like gimp, webkit-gtk etc to see if I can get them to start earlier. Then I duplicated it 15 times and made a recipe which depends on these 15, and included the latter recipe in the image. Unfortunately this does not seem to make a difference. It was actually a few seconds slower, which I guess is due to the extra build time of the new recipes. gimp is still there as the only thread at the end. It could be that webkit-gtk depends on so many things it *has* to be built at the end. > > One point to note is that looking at the build "bootcharts", there are > "pinch points". For core-image-sato, these are notably the toolchain, > then gettext, then gtk, then gstreamer. I suspect webkit has a similar > issue to that. Another idea: I suspect that there is a lot of unpacking and patching of recipes for the target when the native stuff is built. Does it make sense to have multiple threads reading the disk, for the target recipes during the native build or will we just lose out due to seek time? Having multiple threads accessing the disk, might force the disk to spend most of its time seeking. Found an application which measures seek time performance, and my WD Black will do 83 seeks per second, and my SAS disk will do twice that. The RAID of two SAS disks will provide close to SSD throughput (380 MB/s) but seek time is no better than a single SAS disk. Since there is "empty time" at the end of the native build, does it make sense to minimize unpack/patch of target stuff when we reach that point, and then we let loose? ======================== Now with 48 MB of RAM, (which I might grow to 96 GB, if someone proves that this makes it faster), this might be useful to speed things up. Can tmpfs beat the kernel cache system? 1. Typically, I work on less than 10 recipes, and if I continuosly rebuild those, why not create the build directories as links to a tmpfs file system. Maybe a configuration file with a list of recipes to build on tmpfs. During a build from scratch, this is not so useful, but once most stuff is in place, it might, 2. If the downloads directory was shadowed in a tmpfs system then there would be less seek time during the build. The downloads tmpfs should be poplulated at boot time, and rsynced with a real disk in the background when new stuff is downloaded from internet. 3. With 96 GB of RAM, maybe the complete build directory will fit. Would be nice to build everything on tmpfs, and automatically rsync to a real disk when there is nothing else to do... 4. If not tmpfs is used, then It would still be good to have better control over the build directory. It make sense to me to have the metadata on an SSD, but the build directory should be on my RAID cluster for fast rebuilds. I can set this up manually, but it would be better to be able to specify this in a configuration file. >> (3) Creating the rootfs seems to have zero parallelism. >> But I have not investigated if anything can be done. > This is something I do want to fix in 1.6. We need to convert the core > to python to gain access to easier threading mechanisms though. > Certainly parallel image type generation and compression would be a win > here. > >> =================================== >> >> So I propose the following changes: >> >> 1.Remove PARALLEL_MAKE = "" from abiword >> 2.Add the PARALLEL_HIGH variable to a few recipes. >> 3.Investigate if we can force the build of a few packages to an earlier >> point. >> >> ======================================= >> BTW: Have noticed that there are some dependencies missing from the recipes. >> >> >> >> DEPENDENCY BUGS >> pangomm needs to depend on "pango" >> Otherwise, the required pangocairo might not be available when >> pangomm is configured >> >> goffice needs to depend on "librsvg gdk-pixbuf" >> Also on "gobject-2.0 gmodule-2.0 gio-2.0", but I did not find >> those packages, >> so I assume they are generated somewhere. Did not investigate further. > I'm sure patches would be most welcome for bugs like this. > > Cheers, > > Richard > > _______________________________________________ > Openembedded-core mailing list > Openembedded-core@lists.openembedded.org > http://lists.openembedded.org/mailman/listinfo/openembedded-core -- Best Regards Ulf Samuelsson eMagii ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Improving Build Speed 2013-11-20 22:43 ` Ulf Samuelsson @ 2013-11-21 0:19 ` Martin Jansa 2013-11-21 7:15 ` Ulf Samuelsson 0 siblings, 1 reply; 13+ messages in thread From: Martin Jansa @ 2013-11-21 0:19 UTC (permalink / raw) To: ulf; +Cc: openembedded-core [-- Attachment #1: Type: text/plain, Size: 3136 bytes --] On Wed, Nov 20, 2013 at 11:43:13PM +0100, Ulf Samuelsson wrote: > 2013-11-20 22:29, Richard Purdie skrev: > Another idea: > > I suspect that there is a lot of unpacking and patching of recipes > for the target when the native stuff is built. > Does it make sense to have multiple threads reading the disk, for > the target recipes during the native build or will we just lose out > due to seek time? > > Having multiple threads accessing the disk, might force the disk to spend > most of its time seeking. > Found an application which measures seek time performance, > and my WD Black will do 83 seeks per second, and my SAS disk will do > twice that. > The RAID of two SAS disks will provide close to SSD throughput (380 MB/s) > but seek time is no better than a single SAS disk. > > Since there is "empty time" at the end of the native build, does it make > sense > to minimize unpack/patch of target stuff when we reach that point, and > then we let loose? In my benchmarks increasing PARALLEL_MAKE till number of cores was significantly improving build time, but BB_NUMBER_THREADS had minimal influence somewhere above 6 or 8 (tested on various systems, even only 4 was optimum on my older RAID-0 and 2 on single disk). Of course it was quite different for clean build without sstate prepopulated and build where most of the stuff was reused from sstate. see http://wiki.webos-ports.org/wiki/OE_benchmark > ======================== > > Now with 48 MB of RAM, (which I might grow to 96 GB, if someone proves that > this makes it faster), this might be useful to speed things up. > > Can tmpfs beat the kernel cache system? > > 1. Typically, I work on less than 10 recipes, and if I continuosly > rebuild those, why not create the build directories as links to > a tmpfs file system. > Maybe a configuration file with a list of recipes to build on > tmpfs. > > During a build from scratch, this is not so useful, but once > most stuff is in place, it might, > > 2. If the downloads directory was shadowed in a tmpfs system > then there would be less seek time during the build. > The downloads tmpfs should be poplulated at boot time, > and rsynced with a real disk in the background when new stuff > is downloaded from internet. > > 3. With 96 GB of RAM, maybe the complete build directory will fit. > Would be nice to build everything on tmpfs, and automatically rsync > to a real disk when there is nothing else to do... > > 4. If not tmpfs is used, then It would still be good to have better > control > over the build directory. > It make sense to me to have the metadata on an SSD, but the > build directory should be on my RAID cluster for fast rebuilds. > I can set this up manually, but it would be better to be able to > specify this in a configuration file. > See http://www.mail-archive.com/yocto@yoctoproject.org/msg14879.html -- Martin 'JaMa' Jansa jabber: Martin.Jansa@gmail.com [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 205 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Improving Build Speed 2013-11-21 0:19 ` Martin Jansa @ 2013-11-21 7:15 ` Ulf Samuelsson 2013-11-21 12:53 ` Martin Jansa 2013-11-23 18:39 ` Nicolas Dechesne 0 siblings, 2 replies; 13+ messages in thread From: Ulf Samuelsson @ 2013-11-21 7:15 UTC (permalink / raw) To: Martin Jansa; +Cc: openembedded-core 2013-11-21 01:19, Martin Jansa skrev: > On Wed, Nov 20, 2013 at 11:43:13PM +0100, Ulf Samuelsson wrote: >> 2013-11-20 22:29, Richard Purdie skrev: >> Another idea: >> >> I suspect that there is a lot of unpacking and patching of recipes >> for the target when the native stuff is built. >> Does it make sense to have multiple threads reading the disk, for >> the target recipes during the native build or will we just lose out >> due to seek time? >> >> Having multiple threads accessing the disk, might force the disk to spend >> most of its time seeking. >> Found an application which measures seek time performance, >> and my WD Black will do 83 seeks per second, and my SAS disk will do >> twice that. >> The RAID of two SAS disks will provide close to SSD throughput (380 MB/s) >> but seek time is no better than a single SAS disk. >> >> Since there is "empty time" at the end of the native build, does it make >> sense >> to minimize unpack/patch of target stuff when we reach that point, and >> then we let loose? > In my benchmarks increasing PARALLEL_MAKE till number of cores was > significantly improving build time, but BB_NUMBER_THREADS had minimal > influence somewhere above 6 or 8 (tested on various systems, even only 4 was > optimum on my older RAID-0 and 2 on single disk). > Of course it was quite different for clean build without sstate > prepopulated and build where most of the stuff was reused from sstate. > > see http://wiki.webos-ports.org/wiki/OE_benchmark How many cores do you have in your build machine? I started a build, and after 20 minutes it had completed 1500 tasks using: PARALLEL_MAKE = "-j24" BB_NUMBER_THREADS = "6" The I decided to kill it. When I did PARALLEL_MAKE = "-j12" BB_NUMBER_THREADS = "24" It completed 2000 tasks in less than half the time. This does not use tmpfs though. Do you have any comparision between tmpfs builds and RAID builds? I currently do not use INHERIT += "rm_work" since I want to be able to do changes on some packages. Is there a way to defined rm_work on a package basis? Then the majority of the packages can be removed. I use 75 GB without "rm_work" BR Ulf > >> ======================== >> >> Now with 48 MB of RAM, (which I might grow to 96 GB, if someone proves that >> this makes it faster), this might be useful to speed things up. >> >> Can tmpfs beat the kernel cache system? >> >> 1. Typically, I work on less than 10 recipes, and if I continuosly >> rebuild those, why not create the build directories as links to >> a tmpfs file system. >> Maybe a configuration file with a list of recipes to build on >> tmpfs. >> >> During a build from scratch, this is not so useful, but once >> most stuff is in place, it might, >> >> 2. If the downloads directory was shadowed in a tmpfs system >> then there would be less seek time during the build. >> The downloads tmpfs should be poplulated at boot time, >> and rsynced with a real disk in the background when new stuff >> is downloaded from internet. >> >> 3. With 96 GB of RAM, maybe the complete build directory will fit. >> Would be nice to build everything on tmpfs, and automatically rsync >> to a real disk when there is nothing else to do... >> >> 4. If not tmpfs is used, then It would still be good to have better >> control >> over the build directory. >> It make sense to me to have the metadata on an SSD, but the >> build directory should be on my RAID cluster for fast rebuilds. >> I can set this up manually, but it would be better to be able to >> specify this in a configuration file. >> > See > http://www.mail-archive.com/yocto@yoctoproject.org/msg14879.html > -- Best Regards Ulf Samuelsson ulf@emagii.com +46 722 427437 ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Improving Build Speed 2013-11-21 7:15 ` Ulf Samuelsson @ 2013-11-21 12:53 ` Martin Jansa 2013-11-23 18:39 ` Nicolas Dechesne 1 sibling, 0 replies; 13+ messages in thread From: Martin Jansa @ 2013-11-21 12:53 UTC (permalink / raw) To: Ulf Samuelsson; +Cc: openembedded-core [-- Attachment #1: Type: text/plain, Size: 3709 bytes --] On Thu, Nov 21, 2013 at 08:15:08AM +0100, Ulf Samuelsson wrote: > 2013-11-21 01:19, Martin Jansa skrev: > > On Wed, Nov 20, 2013 at 11:43:13PM +0100, Ulf Samuelsson wrote: > >> 2013-11-20 22:29, Richard Purdie skrev: > >> Another idea: > >> > >> I suspect that there is a lot of unpacking and patching of recipes > >> for the target when the native stuff is built. > >> Does it make sense to have multiple threads reading the disk, for > >> the target recipes during the native build or will we just lose out > >> due to seek time? > >> > >> Having multiple threads accessing the disk, might force the disk to spend > >> most of its time seeking. > >> Found an application which measures seek time performance, > >> and my WD Black will do 83 seeks per second, and my SAS disk will do > >> twice that. > >> The RAID of two SAS disks will provide close to SSD throughput (380 MB/s) > >> but seek time is no better than a single SAS disk. > >> > >> Since there is "empty time" at the end of the native build, does it make > >> sense > >> to minimize unpack/patch of target stuff when we reach that point, and > >> then we let loose? > > In my benchmarks increasing PARALLEL_MAKE till number of cores was > > significantly improving build time, but BB_NUMBER_THREADS had minimal > > influence somewhere above 6 or 8 (tested on various systems, even only 4 was > > optimum on my older RAID-0 and 2 on single disk). > > Of course it was quite different for clean build without sstate > > prepopulated and build where most of the stuff was reused from sstate. > > > > see http://wiki.webos-ports.org/wiki/OE_benchmark > > How many cores do you have in your build machine? The one used in OE_benchmark has 8, my local builder also 8, I got the same results on machines with 32 and 48 cores. My experience (which can be different than what you see), is that PARALLEL_MAKE scales well with number of cores, but BB_NUMBER_THREADS is more or less limited by I/O performance, so even when the machine has 48 cores, it doesn't say anything about running 48 do_populate or do_package tasks at the same time causing avalanche of seeks. The other extreme is when all 48 BB threads are in do_compile and you can get 48x48 gcc processes which again doesn't work well on machine with 48 cores. with PARALLEL_MAKE = "-j32" BB_NUMBER_THREADS = "6" and very big image build, I see all cores well used most of the time. > I started a build, and after 20 minutes it had completed 1500 tasks using: > > PARALLEL_MAKE = "-j24" > BB_NUMBER_THREADS = "6" > > The I decided to kill it. > > When I did > PARALLEL_MAKE = "-j12" > BB_NUMBER_THREADS = "24" > > It completed 2000 tasks in less than half the time. You should have finish whole image, you can get 2000 tasks sooner (tasks like fetch/unpack/patch) but then you're still waiting for the rest, with smaller BB_NUMBER_THREADS it seems to spread tasks more evenly (doing more fetch/unpack/patch tasks later when CPUs are busy compiling something, which is good for I/O). > This does not use tmpfs though. > Do you have any comparision between tmpfs builds and RAID builds? I've sent it to ML few months ago, cannot find it now. > I currently do not use INHERIT += "rm_work" > since I want to be able to do changes on some packages. > Is there a way to defined rm_work on a package basis? > Then the majority of the packages can be removed. > > I use 75 GB without "rm_work" Understood, in my scenario I want to build world as soon as possible, keep sstate, record issues and forget about BUILDDIR. -- Martin 'JaMa' Jansa jabber: Martin.Jansa@gmail.com [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 205 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Improving Build Speed 2013-11-21 7:15 ` Ulf Samuelsson 2013-11-21 12:53 ` Martin Jansa @ 2013-11-23 18:39 ` Nicolas Dechesne 1 sibling, 0 replies; 13+ messages in thread From: Nicolas Dechesne @ 2013-11-23 18:39 UTC (permalink / raw) To: ulf; +Cc: Patches and discussions about the oe-core layer [-- Attachment #1: Type: text/plain, Size: 500 bytes --] On Thu, Nov 21, 2013 at 8:15 AM, Ulf Samuelsson <ulf@emagii.com> wrote: > I currently do not use INHERIT += "rm_work" > since I want to be able to do changes on some packages. > Is there a way to defined rm_work on a package basis? > Then the majority of the packages can be removed. > from rm_work.bbclass: # To inhibit rm_work for some recipes, specify them in RM_WORK_EXCLUDE. # For example, in conf/local.conf: # # RM_WORK_EXCLUDE += "icu-native icu busybox" [-- Attachment #2: Type: text/html, Size: 1613 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Improving Build Speed 2013-11-20 21:29 ` Richard Purdie 2013-11-20 22:43 ` Ulf Samuelsson @ 2013-11-21 0:10 ` Martin Jansa 2013-11-21 8:04 ` Ulf Samuelsson 2 siblings, 0 replies; 13+ messages in thread From: Martin Jansa @ 2013-11-21 0:10 UTC (permalink / raw) To: Richard Purdie Cc: Patches and discussions about the oe-core layer, Discussion of the angstrom distribution development, Ulf Samuelsson [-- Attachment #1: Type: text/plain, Size: 1379 bytes --] On Wed, Nov 20, 2013 at 09:29:16PM +0000, Richard Purdie wrote: > Hi Ulf, > > > (3) Creating the rootfs seems to have zero parallelism. > > But I have not investigated if anything can be done. > > This is something I do want to fix in 1.6. We need to convert the core > to python to gain access to easier threading mechanisms though. > Certainly parallel image type generation and compression would be a win > here. If you're building .bz2 images, then installing lbzip2/pbzip2 saves a lot of time in FSTYPES creation. > > DEPENDENCY BUGS > > pangomm needs to depend on "pango" > > Otherwise, the required pangocairo might not be available when > > pangomm is configured > > > > goffice needs to depend on "librsvg gdk-pixbuf" > > Also on "gobject-2.0 gmodule-2.0 gio-2.0", but I did not find > > those packages, > > so I assume they are generated somewhere. Did not investigate further. > > I'm sure patches would be most welcome for bugs like this. But please upgrade to latest layers first, because there 2 weren't detected in my last test-dependencies.sh so I guess they were fixed already. http://lists.openembedded.org/pipermail/openembedded-core/2013-October/084905.html You can use the same script to keep your new toy busy for a while :). -- Martin 'JaMa' Jansa jabber: Martin.Jansa@gmail.com [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 205 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Improving Build Speed 2013-11-20 21:29 ` Richard Purdie 2013-11-20 22:43 ` Ulf Samuelsson 2013-11-21 0:10 ` Martin Jansa @ 2013-11-21 8:04 ` Ulf Samuelsson 2013-11-21 13:53 ` Richard Purdie 2 siblings, 1 reply; 13+ messages in thread From: Ulf Samuelsson @ 2013-11-21 8:04 UTC (permalink / raw) To: openembedded-core >> Why restrict PARALLEL_MAKE to anything less than the number of H/W >> threads in the machine? >> >> Came up with a construct PARALLEL_HIGH which is defined alongside >> PARALLEL_MAKE in conf/local.conf >> >> PARALLEL_MAKE = "-j8" >> PARALLEL_HIGH = "-j24" >> >> In the appropriate recipes, which seems to be processed by bitbake >> in solitude I do: >> >> PARALLEL_HIGH ?= "${PARALLEL_MAKE}" >> PARALLEL_MAKE = "${PARALLEL_HIGH}" >> >> This means that they will try to use each H/W thread. > Please benchmark the difference. I suspect we can just set the high > number of make for everything. Note that few makefiles are well enough > written to benefit from high levels of make (webkit being an notable > exception). > It looks like it is shaving off ~2 minutes from a build which normally takes ~84 minutes. First build PARALLEL_MAKE = "-j12" PARALLEL_HIGH = "-j24" BB_NUMBER_THREADS = "24" real 83m24.093s Second build PARALLEL_MAKE = "-j12" PARALLEL_HIGH = "-j12" BB_NUMBER_THREADS = "24" real 85m12.007s BR Ulf > Cheers, Richard _______________________________________________ > Openembedded-core mailing list > Openembedded-core@lists.openembedded.org > http://lists.openembedded.org/mailman/listinfo/openembedded-core -- Best Regards Ulf Samuelsson eMagii ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Improving Build Speed 2013-11-21 8:04 ` Ulf Samuelsson @ 2013-11-21 13:53 ` Richard Purdie 2013-11-23 15:06 ` Ulf Samuelsson 0 siblings, 1 reply; 13+ messages in thread From: Richard Purdie @ 2013-11-21 13:53 UTC (permalink / raw) To: ulf; +Cc: openembedded-core On Thu, 2013-11-21 at 09:04 +0100, Ulf Samuelsson wrote: > >> Why restrict PARALLEL_MAKE to anything less than the number of H/W > >> threads in the machine? > >> > >> Came up with a construct PARALLEL_HIGH which is defined alongside > >> PARALLEL_MAKE in conf/local.conf > >> > >> PARALLEL_MAKE = "-j8" > >> PARALLEL_HIGH = "-j24" > >> > >> In the appropriate recipes, which seems to be processed by bitbake > >> in solitude I do: > >> > >> PARALLEL_HIGH ?= "${PARALLEL_MAKE}" > >> PARALLEL_MAKE = "${PARALLEL_HIGH}" > >> > >> This means that they will try to use each H/W thread. > > Please benchmark the difference. I suspect we can just set the high > > number of make for everything. Note that few makefiles are well enough > > written to benefit from high levels of make (webkit being an notable > > exception). > > > It looks like it is shaving off ~2 minutes from a build which normally > takes ~84 minutes. > > First build > PARALLEL_MAKE = "-j12" > PARALLEL_HIGH = "-j24" > BB_NUMBER_THREADS = "24" > real 83m24.093s > > Second build > PARALLEL_MAKE = "-j12" > PARALLEL_HIGH = "-j12" > BB_NUMBER_THREADS = "24" > real 85m12.007s but what if you set both to -j24? What I'm trying to understand is if we really need two different variables? Note you can also do: PARALLEL_MAKE = "-j12" PARALLEL_MAKE_pn-webkit-gtk = "-j24" so I'm still not convinced we want to start having PARALLEL_HIGH as it will just confuse users IMO. Cheers, Richard ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Improving Build Speed 2013-11-21 13:53 ` Richard Purdie @ 2013-11-23 15:06 ` Ulf Samuelsson 0 siblings, 0 replies; 13+ messages in thread From: Ulf Samuelsson @ 2013-11-23 15:06 UTC (permalink / raw) To: openembedded-core 2013-11-21 14:53, Richard Purdie skrev: > On Thu, 2013-11-21 at 09:04 +0100, Ulf Samuelsson wrote: >>>> Why restrict PARALLEL_MAKE to anything less than the number of H/W >>>> threads in the machine? >>>> >>>> Came up with a construct PARALLEL_HIGH which is defined alongside >>>> PARALLEL_MAKE in conf/local.conf >>>> >>>> PARALLEL_MAKE = "-j8" >>>> PARALLEL_HIGH = "-j24" >>>> >>>> In the appropriate recipes, which seems to be processed by bitbake >>>> in solitude I do: >>>> >>>> PARALLEL_HIGH ?= "${PARALLEL_MAKE}" >>>> PARALLEL_MAKE = "${PARALLEL_HIGH}" >>>> >>>> This means that they will try to use each H/W thread. >>> Please benchmark the difference. I suspect we can just set the high >>> number of make for everything. Note that few makefiles are well enough >>> written to benefit from high levels of make (webkit being an notable >>> exception). >>> >> It looks like it is shaving off ~2 minutes from a build which normally >> takes ~84 minutes. >> >> First build >> PARALLEL_MAKE = "-j12" >> PARALLEL_HIGH = "-j24" >> BB_NUMBER_THREADS = "24" >> real 83m24.093s >> >> Second build >> PARALLEL_MAKE = "-j12" >> PARALLEL_HIGH = "-j12" >> BB_NUMBER_THREADS = "24" >> real 85m12.007s > but what if you set both to -j24? > > What I'm trying to understand is if we really need two different > variables? > > Note you can also do: > > PARALLEL_MAKE = "-j12" > PARALLEL_MAKE_pn-webkit-gtk = "-j24" > > so I'm still not convinced we want to start having PARALLEL_HIGH as it > will just confuse users IMO. Today I tried building Angstrom cloud9-gnome-image which is about 75 GB. "sources" and "build" both located in tmpfs. (What the heck, RAM is cheap) PARALLEL_MAKE = "-j 12" PARALLEL_HIGH = "24" BB_NUMBER_THREADS = "24" The time to build from a RAID 0 (2 x SAS 15k RPM) was 01:23:25 The time to build from tmpfs was 01:21:15 This includes rsync'ing the deploy directory to the RAID disk so improving disk performance has its limits. (It was nice not listening to the disk seeks though) Only a 2 minute difference which is a bit disappointing... It completed 7658 task. I tried to check parallellity during the build by: ps -e | grep make | wc -l Everythings seems to be nice until about 3500 tasks. Then the numbed of makes drop dramatically When gcc-cross-linaro was built, only 2 makes are in progress. Between 4000 - 6000 the number of makes vary around 10-20 After 6000 it rises and varies between 30-50. There is a noticeable slowdown in task completion rate Around 7500 the number of tasks drop to a handful, and so does the number of makes. When gimp is the only package compiling, make count = 4 13:52:22 Building cloud9-icu-gnome-image 14:12:20 4000 19:58 14:19:04 5000 04:44 makes = (10-20) 14:27:21 5531 14:31:48 6000 12:44 makes = (30-50) 14:40:43 6500 08:57 14:57:42 7500 16:59 15:03:38 7647 15:06:45 7657 building gimp 15:13:56 7658 do_rootfs ============================================ I suspect that there are a number of packages that ignore PARALLEL_MAKE by "${MAKE} target inside the Makefile without passing PARALLEL_MAKE The gcc compiler build is one, but I suspect eglibc eglibc-locale webkit-gtk pulseaudio gimp inkscape glib-2.0 as well ============================================ Running 50 makes on a 24 thread machine is probably no good. One possible idea would be to count most tasks a "1" thread but to count a "do_compile" as "2" or "3" threads when determining whether to start new tasks or not. If there are few computables, then this would not limit anything, If there are many compiles that are computable, then fewer would be started. I suspect the latter part of the build will benefit. Know too little about the bitbake source to do modifications, but I think that if every time a do_compile is started, a variable "maketasks" is increased, and then decreased when stop you could do: if ((activity + (maketasks * scale_factor)) < number_tasks) then It would reduce the risk of getting into the situations where you have man more make provesses than H/W threads. Since the behaviour of the build varies over time, I think a dynamic algorithm of some kind is needed. Would it not be fun, if bitbake could tell the kernel how many makes to allow at a certan point of time? and make would request a number of threads, but would be satisfied with the number provided by the kernel =============================== BTW: found another lacking dependency parted needs libdl during configure, which means it needs to depend on "eglibc". BR Ulf Samuelsson > Cheers, > > Richard > > _______________________________________________ > Openembedded-core mailing list > Openembedded-core@lists.openembedded.org > http://lists.openembedded.org/mailman/listinfo/openembedded-core -- Best Regards Ulf Samuelsson eMagii ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Improving Build Speed 2013-11-20 21:05 Improving Build Speed Ulf Samuelsson 2013-11-20 21:29 ` Richard Purdie @ 2013-11-21 10:05 ` Burton, Ross 2013-11-21 11:51 ` Enrico Scholz 2 siblings, 0 replies; 13+ messages in thread From: Burton, Ross @ 2013-11-21 10:05 UTC (permalink / raw) To: Ulf Samuelsson Cc: Discussion of the angstrom distribution development, Patches and discussions about the oe-core layer On 20 November 2013 21:05, Ulf Samuelsson <angstrom-dev@emagii.com> wrote: > do_compile() { > if [ x"$MAKE" = x ]; then MAKE=make; fi > ... > for error_count in 1 2 3; do > ... > ${MAKE} ${EXTRA_OEMAKE} "$@" || exit_code=1 > ... > done > ... > } > > Not sure, but I think this means that PARALLEL_MAKE might get ignored. Yeah, good catch - the point of the loop was to handle random failures caused by dependency chains breaking when doing parallel builds... Anyway, that hack isn't present in 1.5 because it was caused by using an unpatched (read: broken) make 3.82 which we now detect at startup. Ross ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Improving Build Speed 2013-11-20 21:05 Improving Build Speed Ulf Samuelsson 2013-11-20 21:29 ` Richard Purdie 2013-11-21 10:05 ` Burton, Ross @ 2013-11-21 11:51 ` Enrico Scholz 2 siblings, 0 replies; 13+ messages in thread From: Enrico Scholz @ 2013-11-21 11:51 UTC (permalink / raw) To: openembedded-core Ulf Samuelsson <angstrom-dev-AoFPY8dbyRPQT0dZR+AlfA@public.gmane.org> writes: > PARALLEL_MAKE = "-j6" > BB_NUMBER_THREADS = "24" I define | PARALLEL_MAKE = "\ | -j ${@int(os.sysconf(os.sysconf_names['SC_NPROCESSORS_ONLN'])) * 2} \ | -l ${@int(os.sysconf(os.sysconf_names['SC_NPROCESSORS_ONLN'])) * 150/100} \ | " | | BB_NUMBER_THREADS ?= "\ | ${@int(os.sysconf(os.sysconf_names['SC_NPROCESSORS_ONLN'])) * 150/100}" in my global configuration (note the '-l'). I would like to limit it by the available RAM size (e.g. one -j per GB) but BB_NUMBER_THREADS makes it difficultly to express it. There are also dependencies on the used filesystem (e.g. btrfs performance seems to degrade rapidly with higher -j). It would be perfect when bitbake takes the role of the toplevel jobserver[1] but that's probably very difficultly to implement and might interfere with recursive make. > and was quicker, but it seemed to be a little flawed. At several > times during the build, the CPU frequtil showed that most of the cores > went down to minimum frequency (2,93 GHz -> 1,6 GHz) Capturing resource usage (--> getrusage(2)) will give more details (e.g. about i/o load). E.g. see https://www.cvg.de/people/ensc/oe-metrics.html Enrico ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2013-11-23 18:39 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-11-20 21:05 Improving Build Speed Ulf Samuelsson 2013-11-20 21:29 ` Richard Purdie 2013-11-20 22:43 ` Ulf Samuelsson 2013-11-21 0:19 ` Martin Jansa 2013-11-21 7:15 ` Ulf Samuelsson 2013-11-21 12:53 ` Martin Jansa 2013-11-23 18:39 ` Nicolas Dechesne 2013-11-21 0:10 ` Martin Jansa 2013-11-21 8:04 ` Ulf Samuelsson 2013-11-21 13:53 ` Richard Purdie 2013-11-23 15:06 ` Ulf Samuelsson 2013-11-21 10:05 ` Burton, Ross 2013-11-21 11:51 ` Enrico Scholz
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox