Hello!

rm_work.bbclass currently isn't good enough for use in really tightly
constrained environments. For example, building core-image-sato with
20GB of free disk space should work (final disk usage with rm_work
active: <13GB). But in practice, that disk space gets exhausted in the
middle of a build because do_rm_work doesn't run often and/or soon
enough.

I ran into this when building meta-intel-iot-security under TravisCI.
I'm now trying to improve what I did to enhance rm_work.bbclass and
submit it upstream. It isn't ready yet, so consider this a request for
comments. Current code as set of patches on top of poky is here:
https://github.com/pohly/poky/commits/rmwork

While working on this I wanted to see system resource usage over time. I
found that pybootchart has support for drawing such charts, but that
code is all disabled because the information isn't collected for a
build. I've added that to buildstats.bbclass and also added logging of
the disk usage, by hooking into the existing disk monitoring.

Collecting this information works best when done at regular time
intervals. There's no support for that in current bitbake, so I had to
add something (HeartbeatEvent).

Data collection is currently enabled by default and cannot be turned
off. There is some rate limit, but despite that, for a fast
core-image-sato build of 45 minutes, that additional logging adds 14M to
the buildstats, an increase of 66%. The overhead is larger for slower
builds (information about tasks the same, longer time range -> more
system usage information): 127%.

Should this data collection be enabled by default when
buildstats.bbclass is active? If yes, then I can think of several ways
of reducing the overhead (pre-process the information instead of writing
raw /proc dumps, on-the-fly compression), but it might not be worthwhile
to bother with that if it only gets written when explicitly requested. 

OTOH, parsing seems a bit slower now, so some further optimizations may
be useful either way.

Personally I'd prefer to have it enabled by default, because it is
useful to have the information after something unusual happens - perhaps
it can't be reproduced, or doing so would be slow.

If it is enabled by default, the data is small enough and runtime
overhead is low, should it be possible to disable it at all when
buildstats.bbclass is inherited? How small and low is enough?

In the current implementation, disk space usage is logged exactly for
those volumes that are monitored already (i.e. BB_DISKMON_DIRS). Does
anyone see a need to configure that separately? I personally didn't.

Now regarding rm_work.bbclass: the main problem (unfortunately not
solved to my own satisfaction) is that the do_rm_work cleanup task
depends on do_build, to ensure that all work for the recipe really is
done. But that has the effect that foo:do_rm_work also depends on
bar:do_build when there is a foo->bar dependency. That's how bitbake
ensures that building foo also builds everything needed by it.

I tried injecting do_rm_work between the tasks of a recipe and do_build
via an anonymous python method. That also allows removing the
BB_DEFAULT_TASK modification. The problem with that is determining "the
tasks of a recipe", As I suspected, the approach runs into problems when
some other classes also use anonymous python code to add tasks. In
practice, do_package_write_rpm/ipk/deb ended up getting added after the
do_rm_work injection, leading to race conditions. To move on with my
testing, I hard-coded those tasks as pre-condition for do_rm_work, but
that's just too fragile.

To solve this I can think of two solutions:
     1. ordering anonymous python methods by priority and establishing
        some kind of convention which changes are done at which priority
        - no idea how realistic or hard that is
     2. a dedicated hook for rm_work.bbclass in bitbake, similar to
        calculate_extra_depends in meta-world-pkgdata.bb

In addition, I also played with a different scheduler. The "completion"
scheduler used by rm_work.bbclass orders tasks like this (most important
first):

1. ID /work/poky/meta/recipes-devtools/quilt/quilt-native_0.64.bb:do_fetch
2. ID /work/poky/meta/recipes-devtools/quilt/quilt-native_0.64.bb:do_unpack
3. ID /work/poky/meta/recipes-devtools/quilt/quilt-native_0.64.bb:do_patch
4. ID /work/poky/meta/recipes-devtools/quilt/quilt-native_0.64.bb:do_configure
5. ID /work/poky/meta/recipes-devtools/quilt/quilt-native_0.64.bb:do_compile
6. ID /work/poky/meta/recipes-devtools/quilt/quilt-native_0.64.bb:do_install
7. ID /work/poky/meta/recipes-devtools/quilt/quilt-native_0.64.bb:do_populate_sysroot
8. ID /work/poky/meta/recipes-devtools/quilt/quilt-native_0.64.bb:do_populate_lic
9. ID /work/poky/meta/recipes-devtools/quilt/quilt-native_0.64.bb:do_rm_work
10. ID /work/poky/meta/recipes-devtools/quilt/quilt-native_0.64.bb:do_build
11. ID /work/poky/meta/recipes-extended/texinfo-dummy-native/texinfo-dummy-native.bb:do_fetch
12. ID /work/poky/meta/recipes-extended/texinfo-dummy-native/texinfo-dummy-native.bb:do_unpack
13. ID /work/poky/meta/recipes-extended/texinfo-dummy-native/texinfo-dummy-native.bb:do_patch
14. ID /work/poky/meta/recipes-extended/texinfo-dummy-native/texinfo-dummy-native.bb:do_configure
15. ID /work/poky/meta/recipes-extended/texinfo-dummy-native/texinfo-dummy-native.bb:do_compile
16. ID /work/poky/meta/recipes-extended/texinfo-dummy-native/texinfo-dummy-native.bb:do_install
17. ID /work/poky/meta/recipes-extended/texinfo-dummy-native/texinfo-dummy-native.bb:do_populate_sysroot
18. ID /work/poky/meta/recipes-extended/texinfo-dummy-native/texinfo-dummy-native.bb:do_populate_lic
19. ID /work/poky/meta/recipes-extended/texinfo-dummy-native/texinfo-dummy-native.bb:do_rm_work
20. ID /work/poky/meta/recipes-extended/texinfo-dummy-native/texinfo-dummy-native.bb:do_build
21. ID virtual:native:/work/poky/meta/recipes-devtools/gnu-config/gnu-config_git.bb:do_fetch
...

It basically starts with the most important recipe (as determined by a
dependency analysis that checks for the number of other recipes directly
or indirectly depending on it) and then lists all tasks in that recipe,
then continues with the next recipe.

My impression was that this can lead to the following undesirable
situation: foo:do_rm_work and bar:do_compile are both ready to run,
"bar" is more important than "foo" => another compile job gets started
instead of cleaning up some disk space first.

So I wrote a scheduler which orders first by tasks, starting with the
ones that are necessary to complete a recipe. The ordering of recipes is
the same as in the speed scheduler, i.e. if normally foo:do_compile
comes before bar:do_compile, then so does it with the rmwork scheduler.
The result is something like this:

1. ID /work/poky/meta/recipes-support/popt/popt_1.16.bb:do_build
2. ID /work/poky/meta/recipes-core/readline/readline_6.3.bb:do_build
3. ID /work/poky/meta/recipes-connectivity/libnss-mdns/libnss-mdns_0.10.bb:do_build
...
464. ID /work/poky/meta/recipes-sato/images/core-image-sato.bb:do_build
465. ID /work/poky/meta/recipes-graphics/xorg-proto/inputproto_2.3.2.bb:do_rm_work
466. ID /work/poky/meta/recipes-devtools/python/python3_3.5.2.bb:do_rm_work
467. ID /work/poky/meta/recipes-core/packagegroups/packagegroup-base.bb:do_rm_work
...
3620. ID virtual:native:/work/poky/meta/recipes-extended/pbzip2/pbzip2_1.1.13.bb:do_install
3621. ID /work/poky/meta/recipes-devtools/qemu/qemu-helper-native_1.0.bb:do_install
3622. ID /work/poky/meta/recipes-core/zlib/zlib_1.2.8.bb:do_compile_ptest_base
3623. ID /work/poky/meta/recipes-extended/bzip2/bzip2_1.0.6.bb:do_compile_ptest_base
...
3645. ID /work/poky/meta/recipes-support/libevent/libevent_2.0.22.bb:do_compile_ptest_base
3646. ID /work/poky/meta/recipes-core/busybox/busybox_1.24.1.bb:do_compile_ptest_base
3647. ID /work/poky/meta/recipes-kernel/linux/linux-yocto_4.8.bb:do_uboot_mkimage
3648. ID /work/poky/meta/recipes-kernel/linux/linux-yocto_4.8.bb:do_sizecheck
3649. ID /work/poky/meta/recipes-kernel/linux/linux-yocto_4.8.bb:do_strip
3650. ID /work/poky/meta/recipes-kernel/linux/linux-yocto_4.8.bb:do_compile_kernelmodules
3651. ID /work/poky/meta/recipes-kernel/linux/linux-yocto_4.8.bb:do_shared_workdir
3652. ID /work/poky/meta/recipes-kernel/linux/linux-yocto_4.8.bb:do_kernel_link_images
3653. ID /work/poky/meta/recipes-devtools/quilt/quilt-native_0.64.bb:do_compile
3654. ID /work/poky/meta/recipes-extended/texinfo-dummy-native/texinfo-dummy-native.bb:do_compile
...

Finally, I also added the possibility to run more of the "light" tasks
like do_rm_work in parallel to "heavy" tasks like do_compile. The
intention was that if BB_NUM_THREADS is exhausted by active do_compile
tasks (which are usually CPU bound), then running some additional
do_rm_work tasks in parallel will help to keep disk usage lower and also
won't interfere much with on-going compile jobs because file removal is
more IO intensive. It might also be beneficial to prepare the next
do_compile with do_fetch/unpack/patch/configure.

Orthogonal to that, I implemented the possibility to let downloaded
sources get removed by do_rm_work by using a per-recipe DL_DIR (see
rm_work_and_downloads.bbclass). This trades potentially downloading the
same source multiple times against disk usage. It's not ideal and
unusable for incremental builds (removing the source currently does not
trigger a re-download), but was enough for me under TravisCI.

But how well do these different improvements really work, and which ones
are worth the additional complexity? I ran benchmarks with building
core-image-sato on x86-64 for the default qemux86, on an 8 core +
hyperthreading i7-5960X Processor Extreme Edition (20M Cache, up to 3.50
GHz) with 32GB of RAM. Disk space is on a striped RAID array of two
conventional disks.

Test script attached. I only ran this once, so I can't vouch that the
numbers are stable. I remember reading about a benchmark wrapper which
can run such a test script multiple times and then will automatically
merge the output of all runs together, adding min/max/average/deviation,
but couldn't find the tool again. Does anyone know which tool does that?

Base build (no rm_work, BB_NUM_THREADS=16):
elapsed: 45:38.18
final disk usage: 32538MiB
max disk usage: 32475MiB

Original rm_work.bbclass:
elapsed: 44:20.16
final disk usage: 12875MiB
max disk usage: 18740MiB

The maximum disk usage here is from the buildstats, computed as delta
between lowest observed value and maximum observed value. It is a bit
too low because disk monitoring starts after parsing recipes and may
miss spikes. I also ran experiments with building in a fixed 20GB
partition, and that was not enough, the builds were always aborted by
the disk monitor when hitting the 19GB mark.

Total build time though is a bit lower, also with the other rm_work
configurations. I'm not exactly sure why that is - perhaps the disk
cache works better because it can reuse blocks?

Attached is the pybootchart of that build. Note how disk space usage and
size of the amount of cached pages go down in parallel at the end of the
build.

new-rmwork (= same scheduler, do_rm_work injected before do_build):
elapsed: 44:03.53
final disk usage: 12873MiB
max disk usage: 14444MiB

That's the major improvement. I did the scheduler changes first, but
those were not enough to get a build to complete in a 20GB partition
because the do_build dependencies prevented do_rm_work from running.

new-rmwork-new-scheduler
elapsed: 42:58.54
final disk usage: 12873MiB
max disk usage: 14230MiB

A bit better in terms of max disk usage than with the completion
scheduler, but not by much. My observation is that in practice there
aren't that many ready tasks, so often the choice of priorities among
them doesn't have such a big influence. The lower bound of the build
time is determined by the length of the do_configure->do_compile
critical path and it does not matter much when the other tasks run, as
long as they run at some point in parallel to that. Perhaps that is
different for larger distros. I might redo this benchmark with Ostro OS
XT...

new-rmwork-new-scheduler-additional-tasks (16 additional "light" tasks allowed):
elapsed: 45:20.15
final disk usage: 12876MiB
max disk usage: 13838MiB

This brings down maximum disk usage down a bit more, but perhaps now the
additional tasks do start to interfere with the critical path, because
total build time goes up.

The implementation is a hack, too (see comments in the code). I'm
inclined to just drop this mode.

Because TravisCI not only has little disk space, but also few CPUs, I
also ran this experiment over night with BB_NUM_THREADS and the entire
build pinned to just four CPUs with "taskset -c 4,5,6,7".

base
elapsed: 1:29:40
final disk usage: 32459MiB
max disk usage: 32430MiB

original-rmwork
elapsed: 1:30:56
final disk usage: 12881MiB
max disk usage: 16449MiB

new-rmwork
elapsed: 1:31:06
final disk usage: 12856MiB
max disk usage: 14412MiB

new-rmwork-new-scheduler
elapsed: 1:31:39
final disk usage: 12860MiB
max disk usage: 14423MiB

new-rmwork-new-scheduler-additional-tasks
elapsed: 1:27:38
final disk usage: 12887MiB
max disk usage: 14485MiB

This last build is the first time that I ever got more than
BB_NUM_THREADS do_compile tasks running in parallel. There wasn't enough
parallelism in the build for that when using 16 CPUs.

The results are not that different. My overall conclusion still is that
injecting do_rm_work is crucial, while the improved scheduler isn't that
relevant, at least not in these tests.

-- 
Best Regards, Patrick Ohly

The content of this message is my personal opinion only and although
I am an employee of Intel, the statements I make here in no way
represent Intel's position on the issue, nor am I authorized to speak
on behalf of Intel on this matter.