* [Buildroot] [PATCH 1/2] support/script/pkg-stats: run the 3 `make` calls in parallel
@ 2019-07-11 9:54 Victor Huesca
2019-07-11 9:54 ` [Buildroot] [PATCH 2/2] support/script/pkg-stats: run 'get latest version' in a process-pool Victor Huesca
2019-07-11 13:08 ` [Buildroot] [PATCH 1/2] support/script/pkg-stats: run the 3 `make` calls in parallel Thomas Petazzoni
0 siblings, 2 replies; 6+ messages in thread
From: Victor Huesca @ 2019-07-11 9:54 UTC (permalink / raw)
To: buildroot
The pkg-stats calls 3 times `make` to get a bunch of variables. These
calls all takes about the same time -- 95% of the time is spent parsing
the makefile -- so running these subprocesses in parallel can save this
shared time.
The gain scale quite well since with only 3 subprocesses I achieve a
x2.5 ~ x3 speedup.
Signed-off-by: Victor Huesca <victor.huesca@bootlin.com>
---
support/scripts/pkg-stats | 33 ++++++++++++++++++++++++---------
1 file changed, 24 insertions(+), 9 deletions(-)
diff --git a/support/scripts/pkg-stats b/support/scripts/pkg-stats
index 1ff3fb2bd1..2ed41dffa9 100755
--- a/support/scripts/pkg-stats
+++ b/support/scripts/pkg-stats
@@ -220,11 +220,30 @@ def get_pkglist(npackages, package_list):
return packages
+def check_multiple_output(cmds, *popenargs, **kwargs):
+ '''
+ Run each `cmd` asynchronousely and return their respective output.
+ The execution time is the max of all subprocess execution time.
+ '''
+ processes = [subprocess.Popen(cmd, *popenargs,
+ stdout=subprocess.PIPE,
+ stderr=subprocess.PIPE,
+ **kwargs) for cmd in cmds]
+ for proc in processes:
+ out, err = proc.communicate()
+ yield out
+
+
def package_init_make_info():
+ # Fetch all variables at once
+ licenses, files, versions = check_multiple_output([
+ ["make", "BR2_HAVE_DOT_CONFIG=y", "-s", "printvars", "VARS=%_LICENSE"],
+ ["make", "BR2_HAVE_DOT_CONFIG=y", "-s", "printvars", "VARS=%_LICENSE_FILES"],
+ ["make", "BR2_HAVE_DOT_CONFIG=y", "-s", "printvars", "VARS=%_VERSION"],
+ ])
+
# Licenses
- o = subprocess.check_output(["make", "BR2_HAVE_DOT_CONFIG=y",
- "-s", "printvars", "VARS=%_LICENSE"])
- for l in o.splitlines():
+ for l in licenses.splitlines():
# Get variable name and value
pkgvar, value = l.split("=")
@@ -241,9 +260,7 @@ def package_init_make_info():
Package.all_licenses.append(pkgvar)
# License files
- o = subprocess.check_output(["make", "BR2_HAVE_DOT_CONFIG=y",
- "-s", "printvars", "VARS=%_LICENSE_FILES"])
- for l in o.splitlines():
+ for l in files.splitlines():
# Get variable name and value
pkgvar, value = l.split("=")
@@ -260,14 +277,12 @@ def package_init_make_info():
Package.all_license_files.append(pkgvar)
# Version
- o = subprocess.check_output(["make", "BR2_HAVE_DOT_CONFIG=y",
- "-s", "printvars", "VARS=%_VERSION"])
# We process first the host package VERSION, and then the target
# package VERSION. This means that if a package exists in both
# target and host variants, with different version numbers
# (unlikely), we'll report the target version number.
- version_list = o.splitlines()
+ version_list = versions.splitlines()
version_list = [x for x in version_list if x.startswith("HOST_")] + \
[x for x in version_list if not x.startswith("HOST_")]
for l in version_list:
--
2.21.0
^ permalink raw reply related [flat|nested] 6+ messages in thread* [Buildroot] [PATCH 2/2] support/script/pkg-stats: run 'get latest version' in a process-pool
2019-07-11 9:54 [Buildroot] [PATCH 1/2] support/script/pkg-stats: run the 3 `make` calls in parallel Victor Huesca
@ 2019-07-11 9:54 ` Victor Huesca
2019-07-11 13:37 ` Arnout Vandecappelle
2019-07-11 13:08 ` [Buildroot] [PATCH 1/2] support/script/pkg-stats: run the 3 `make` calls in parallel Thomas Petazzoni
1 sibling, 1 reply; 6+ messages in thread
From: Victor Huesca @ 2019-07-11 9:54 UTC (permalink / raw)
To: buildroot
The major bottleneck in pkg-stats is the time spent waiting for
responses from distant servers. There is tow parts that imply
such communication with remote servers in pkg-stats:
- the function checking that package URLs are responding
- the function that fetch the latest package version from
release-monitoring.
The 1st one already make use of process-pools for speed-up but the 2nd
does not.
This patch simply add support of process-pools to this
`check_package_latest_version` function in addition to http-pool.
There have already been work trying to parallelize this function
using threads but there were a failure on some configurations [1].
This implementation rely on a dedicated module already in use on this
script, so it's unlikely to see failure on this one.
In terms of performances I achieve to run this function in ~3m vs ~25m
for the linear version.
[1] http://lists.busybox.net/pipermail/buildroot/2018-March/215368.html
Signed-off-by: Victor Huesca <victor.huesca@bootlin.com>
---
support/scripts/pkg-stats | 32 +++++++++++++++++---------------
1 file changed, 17 insertions(+), 15 deletions(-)
diff --git a/support/scripts/pkg-stats b/support/scripts/pkg-stats
index 2ed41dffa9..14033941ba 100755
--- a/support/scripts/pkg-stats
+++ b/support/scripts/pkg-stats
@@ -322,9 +322,9 @@ def check_package_urls(packages):
pkg.url_status = pkg.url_worker.get(timeout=3600)
-def release_monitoring_get_latest_version_by_distro(pool, name):
+def release_monitoring_get_latest_version_by_distro(name):
try:
- req = pool.request('GET', "/api/project/Buildroot/%s" % name)
+ req = Package.pool.request('GET', "/api/project/Buildroot/%s" % name)
except HTTPError:
return (RM_API_STATUS_ERROR, None, None)
@@ -339,9 +339,9 @@ def release_monitoring_get_latest_version_by_distro(pool, name):
return (RM_API_STATUS_FOUND_BY_DISTRO, None, data['id'])
-def release_monitoring_get_latest_version_by_guess(pool, name):
+def release_monitoring_get_latest_version_by_guess(name):
try:
- req = pool.request('GET', "/api/projects/?pattern=%s" % name)
+ req = Package.pool.request('GET', "/api/projects/?pattern=%s" % name)
except HTTPError:
return (RM_API_STATUS_ERROR, None, None)
@@ -375,18 +375,20 @@ def check_package_latest_version(packages):
- id: string containing the id of the project corresponding to this
package, as known by release-monitoring.org
"""
- pool = HTTPSConnectionPool('release-monitoring.org', port=443,
- cert_reqs='CERT_REQUIRED', ca_certs=certifi.where(),
- timeout=30)
- count = 0
+ Package.pool = HTTPSConnectionPool('release-monitoring.org', port=443,
+ cert_reqs='CERT_REQUIRED', ca_certs=certifi.where(),
+ timeout=30)
+ worker_pool = Pool(processes=64)
+ for pkg in packages:
+ pkg.url_worker = worker_pool.apply_async(release_monitoring_get_latest_version_by_distro, (pkg.name, ))
+
for pkg in packages:
- v = release_monitoring_get_latest_version_by_distro(pool, pkg.name)
- if v[0] == RM_API_STATUS_NOT_FOUND:
- v = release_monitoring_get_latest_version_by_guess(pool, pkg.name)
+ status, _, _ = pkg.url_worker.get()
+ if status == RM_API_STATUS_NOT_FOUND:
+ pkg.url_worker = worker_pool.apply_async(release_monitoring_get_latest_version_by_guess, (pkg.name, ))
- pkg.latest_version = v
- print("[%d/%d] Package %s" % (count, len(packages), pkg.name))
- count += 1
+ for count, pkg in enumerate(packages):
+ pkg.latest_version = pkg.url_worker.get()
def calculate_stats(packages):
@@ -773,7 +775,7 @@ def __main__():
pkg.set_check_package_warnings()
pkg.set_current_version()
pkg.set_url()
- print("Checking URL status")
+ print("Checking URL status ...")
check_package_urls(packages)
print("Getting latest versions ...")
check_package_latest_version(packages)
--
2.21.0
^ permalink raw reply related [flat|nested] 6+ messages in thread* [Buildroot] [PATCH 2/2] support/script/pkg-stats: run 'get latest version' in a process-pool
2019-07-11 9:54 ` [Buildroot] [PATCH 2/2] support/script/pkg-stats: run 'get latest version' in a process-pool Victor Huesca
@ 2019-07-11 13:37 ` Arnout Vandecappelle
0 siblings, 0 replies; 6+ messages in thread
From: Arnout Vandecappelle @ 2019-07-11 13:37 UTC (permalink / raw)
To: buildroot
On 11/07/2019 11:54, Victor Huesca wrote:
> The major bottleneck in pkg-stats is the time spent waiting for
> responses from distant servers. There is tow parts that imply
> such communication with remote servers in pkg-stats:
> - the function checking that package URLs are responding
> - the function that fetch the latest package version from
> release-monitoring.
> The 1st one already make use of process-pools for speed-up but the 2nd
> does not.
>
> This patch simply add support of process-pools to this
> `check_package_latest_version` function in addition to http-pool.
>
> There have already been work trying to parallelize this function
> using threads but there were a failure on some configurations [1].
> This implementation rely on a dedicated module already in use on this
> script, so it's unlikely to see failure on this one.
>
> In terms of performances I achieve to run this function in ~3m vs ~25m
> for the linear version.
>
> [1] http://lists.busybox.net/pipermail/buildroot/2018-March/215368.html
>
> Signed-off-by: Victor Huesca <victor.huesca@bootlin.com>
> ---
> support/scripts/pkg-stats | 32 +++++++++++++++++---------------
> 1 file changed, 17 insertions(+), 15 deletions(-)
>
> diff --git a/support/scripts/pkg-stats b/support/scripts/pkg-stats
> index 2ed41dffa9..14033941ba 100755
> --- a/support/scripts/pkg-stats
> +++ b/support/scripts/pkg-stats
> @@ -322,9 +322,9 @@ def check_package_urls(packages):
> pkg.url_status = pkg.url_worker.get(timeout=3600)
>
>
> -def release_monitoring_get_latest_version_by_distro(pool, name):
> +def release_monitoring_get_latest_version_by_distro(name):
Is there a good reason not to pass pool as an argument here? See below.
> try:
> - req = pool.request('GET', "/api/project/Buildroot/%s" % name)
> + req = Package.pool.request('GET', "/api/project/Buildroot/%s" % name)
> except HTTPError:
> return (RM_API_STATUS_ERROR, None, None)
>
> @@ -339,9 +339,9 @@ def release_monitoring_get_latest_version_by_distro(pool, name):
> return (RM_API_STATUS_FOUND_BY_DISTRO, None, data['id'])
>
>
> -def release_monitoring_get_latest_version_by_guess(pool, name):
> +def release_monitoring_get_latest_version_by_guess(name):
> try:
> - req = pool.request('GET', "/api/projects/?pattern=%s" % name)
> + req = Package.pool.request('GET', "/api/projects/?pattern=%s" % name)
> except HTTPError:
> return (RM_API_STATUS_ERROR, None, None)
>
> @@ -375,18 +375,20 @@ def check_package_latest_version(packages):
> - id: string containing the id of the project corresponding to this
> package, as known by release-monitoring.org
> """
> - pool = HTTPSConnectionPool('release-monitoring.org', port=443,
> - cert_reqs='CERT_REQUIRED', ca_certs=certifi.where(),
> - timeout=30)
> - count = 0
> + Package.pool = HTTPSConnectionPool('release-monitoring.org', port=443,
> + cert_reqs='CERT_REQUIRED', ca_certs=certifi.where(),
> + timeout=30)
> + worker_pool = Pool(processes=64)
> + for pkg in packages:
> + pkg.url_worker = worker_pool.apply_async(release_monitoring_get_latest_version_by_distro, (pkg.name, ))
Here, you can just use (Package.pool, pkg.name) as args, no?
With that, you don't need to add pool as a class attribute of Package anymore.
That was anyway a bit ugly.
Also, adding url_worker as an attribute to pkg here is a bit ugly. It would be
less so if you initialize it to None in Package.__init__.
> +
> for pkg in packages:
> - v = release_monitoring_get_latest_version_by_distro(pool, pkg.name)
> - if v[0] == RM_API_STATUS_NOT_FOUND:
> - v = release_monitoring_get_latest_version_by_guess(pool, pkg.name)
> + status, _, _ = pkg.url_worker.get()
Does this work? I mean, can you call get() multiple times?
> + if status == RM_API_STATUS_NOT_FOUND:
> + pkg.url_worker = worker_pool.apply_async(release_monitoring_get_latest_version_by_guess, (pkg.name, ))
I believe this approach has the disadvantage that it still kind of serializes
the retries (not exactly serializes, but they're not parallelized as much as
they could be). I think a better approach would be to include the retry in a
single callback function that tries both.
>
> - pkg.latest_version = v
> - print("[%d/%d] Package %s" % (count, len(packages), pkg.name))
It's a pity that we loose this progress indication, but I don't see a
convenient way to keep it.
> - count += 1
> + for count, pkg in enumerate(packages):
> + pkg.latest_version = pkg.url_worker.get()
You're not actually using count, so no need to enumerate. I guess that's a
leftover from attempting to keep the progress indication.
Regards,
Arnout
>
>
> def calculate_stats(packages):
> @@ -773,7 +775,7 @@ def __main__():
> pkg.set_check_package_warnings()
> pkg.set_current_version()
> pkg.set_url()
> - print("Checking URL status")
> + print("Checking URL status ...")
> check_package_urls(packages)
> print("Getting latest versions ...")
> check_package_latest_version(packages)
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Buildroot] [PATCH 1/2] support/script/pkg-stats: run the 3 `make` calls in parallel
2019-07-11 9:54 [Buildroot] [PATCH 1/2] support/script/pkg-stats: run the 3 `make` calls in parallel Victor Huesca
2019-07-11 9:54 ` [Buildroot] [PATCH 2/2] support/script/pkg-stats: run 'get latest version' in a process-pool Victor Huesca
@ 2019-07-11 13:08 ` Thomas Petazzoni
2019-07-11 13:17 ` Victor Huesca
2019-07-11 13:18 ` Arnout Vandecappelle
1 sibling, 2 replies; 6+ messages in thread
From: Thomas Petazzoni @ 2019-07-11 13:08 UTC (permalink / raw)
To: buildroot
Hello Victor,
On Thu, 11 Jul 2019 11:54:47 +0200
Victor Huesca <victor.huesca@bootlin.com> wrote:
> +def check_multiple_output(cmds, *popenargs, **kwargs):
> + '''
> + Run each `cmd` asynchronousely and return their respective output.
> + The execution time is the max of all subprocess execution time.
> + '''
> + processes = [subprocess.Popen(cmd, *popenargs,
> + stdout=subprocess.PIPE,
> + stderr=subprocess.PIPE,
> + **kwargs) for cmd in cmds]
> + for proc in processes:
> + out, err = proc.communicate()
> + yield out
> +
> +
> def package_init_make_info():
> + # Fetch all variables at once
> + licenses, files, versions = check_multiple_output([
> + ["make", "BR2_HAVE_DOT_CONFIG=y", "-s", "printvars", "VARS=%_LICENSE"],
> + ["make", "BR2_HAVE_DOT_CONFIG=y", "-s", "printvars", "VARS=%_LICENSE_FILES"],
> + ["make", "BR2_HAVE_DOT_CONFIG=y", "-s", "printvars", "VARS=%_VERSION"],
Instead of doing this, what about running:
make printvars VARS="%_LICENSE %_LICENSE_FILES %_VERSION"
and then parse out the result, which will contain all the _LICENSE,
_LICENSE_FILES and _VERSION variables ?
Best regards,
Thomas
--
Thomas Petazzoni, CTO, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Buildroot] [PATCH 1/2] support/script/pkg-stats: run the 3 `make` calls in parallel
2019-07-11 13:08 ` [Buildroot] [PATCH 1/2] support/script/pkg-stats: run the 3 `make` calls in parallel Thomas Petazzoni
@ 2019-07-11 13:17 ` Victor Huesca
2019-07-11 13:18 ` Arnout Vandecappelle
1 sibling, 0 replies; 6+ messages in thread
From: Victor Huesca @ 2019-07-11 13:17 UTC (permalink / raw)
To: buildroot
Hi Thomas,
On 11/07/2019 15:08, Thomas Petazzoni wrote:
> Hello Victor,
>
> On Thu, 11 Jul 2019 11:54:47 +0200
> Victor Huesca <victor.huesca@bootlin.com> wrote:
>
>> +def check_multiple_output(cmds, *popenargs, **kwargs):
>> + '''
>> + Run each `cmd` asynchronousely and return their respective output.
>> + The execution time is the max of all subprocess execution time.
>> + '''
>> + processes = [subprocess.Popen(cmd, *popenargs,
>> + stdout=subprocess.PIPE,
>> + stderr=subprocess.PIPE,
>> + **kwargs) for cmd in cmds]
>> + for proc in processes:
>> + out, err = proc.communicate()
>> + yield out
>> +
>> +
>> def package_init_make_info():
>> + # Fetch all variables at once
>> + licenses, files, versions = check_multiple_output([
>> + ["make", "BR2_HAVE_DOT_CONFIG=y", "-s", "printvars", "VARS=%_LICENSE"],
>> + ["make", "BR2_HAVE_DOT_CONFIG=y", "-s", "printvars", "VARS=%_LICENSE_FILES"],
>> + ["make", "BR2_HAVE_DOT_CONFIG=y", "-s", "printvars", "VARS=%_VERSION"],
>
> Instead of doing this, what about running:
>
> make printvars VARS="%_LICENSE %_LICENSE_FILES %_VERSION"
>
> and then parse out the result, which will contain all the _LICENSE,
> _LICENSE_FILES and _VERSION variables ?
Because I did not thought it was possible to do this >_<'
I remember that I tried several syntaxes to retrieve all variables with
a single `make` but did not find a working one and this solution was not
hard to implement...
Thanks for the hint, I'll send a v2 with this -- way better -- approach.
Thanks,
Victor
> Best regards,
>
> Thomas
>
--
Victor Huesca, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Buildroot] [PATCH 1/2] support/script/pkg-stats: run the 3 `make` calls in parallel
2019-07-11 13:08 ` [Buildroot] [PATCH 1/2] support/script/pkg-stats: run the 3 `make` calls in parallel Thomas Petazzoni
2019-07-11 13:17 ` Victor Huesca
@ 2019-07-11 13:18 ` Arnout Vandecappelle
1 sibling, 0 replies; 6+ messages in thread
From: Arnout Vandecappelle @ 2019-07-11 13:18 UTC (permalink / raw)
To: buildroot
On 11/07/2019 15:08, Thomas Petazzoni wrote:
> Hello Victor,
>
> On Thu, 11 Jul 2019 11:54:47 +0200
> Victor Huesca <victor.huesca@bootlin.com> wrote:
>
>> +def check_multiple_output(cmds, *popenargs, **kwargs):
>> + '''
>> + Run each `cmd` asynchronousely and return their respective output.
>> + The execution time is the max of all subprocess execution time.
>> + '''
>> + processes = [subprocess.Popen(cmd, *popenargs,
>> + stdout=subprocess.PIPE,
>> + stderr=subprocess.PIPE,
>> + **kwargs) for cmd in cmds]
>> + for proc in processes:
>> + out, err = proc.communicate()
>> + yield out
>> +
>> +
>> def package_init_make_info():
>> + # Fetch all variables at once
>> + licenses, files, versions = check_multiple_output([
>> + ["make", "BR2_HAVE_DOT_CONFIG=y", "-s", "printvars", "VARS=%_LICENSE"],
>> + ["make", "BR2_HAVE_DOT_CONFIG=y", "-s", "printvars", "VARS=%_LICENSE_FILES"],
>> + ["make", "BR2_HAVE_DOT_CONFIG=y", "-s", "printvars", "VARS=%_VERSION"],
>
> Instead of doing this, what about running:
>
> make printvars VARS="%_LICENSE %_LICENSE_FILES %_VERSION"
>
> and then parse out the result, which will contain all the _LICENSE,
> _LICENSE_FILES and _VERSION variables ?
Or better yet, run `make show-info` and parse JSON.
I'll admit, that requires some changes to show-info:
- include license_files;
- have an option to run it on *all* packages, not just the selected ones.
The second one might be tricky...
Regards,
Arnout
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2019-07-11 13:37 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-07-11 9:54 [Buildroot] [PATCH 1/2] support/script/pkg-stats: run the 3 `make` calls in parallel Victor Huesca
2019-07-11 9:54 ` [Buildroot] [PATCH 2/2] support/script/pkg-stats: run 'get latest version' in a process-pool Victor Huesca
2019-07-11 13:37 ` Arnout Vandecappelle
2019-07-11 13:08 ` [Buildroot] [PATCH 1/2] support/script/pkg-stats: run the 3 `make` calls in parallel Thomas Petazzoni
2019-07-11 13:17 ` Victor Huesca
2019-07-11 13:18 ` Arnout Vandecappelle
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox