All of lore.kernel.org
 help / color / mirror / Atom feed
* [Buildroot] [PATCH 1/2] support/script/pkg-stats: run the 3 `make` calls in parallel
@ 2019-07-11  9:54 Victor Huesca
  2019-07-11  9:54 ` [Buildroot] [PATCH 2/2] support/script/pkg-stats: run 'get latest version' in a process-pool Victor Huesca
  2019-07-11 13:08 ` [Buildroot] [PATCH 1/2] support/script/pkg-stats: run the 3 `make` calls in parallel Thomas Petazzoni
  0 siblings, 2 replies; 6+ messages in thread
From: Victor Huesca @ 2019-07-11  9:54 UTC (permalink / raw)
  To: buildroot

The pkg-stats calls 3 times `make` to get a bunch of variables. These
calls all takes about the same time -- 95% of the time is spent parsing
the makefile -- so running these subprocesses in parallel can save this
shared time.

The gain scale quite well since with only 3 subprocesses I achieve a
x2.5 ~ x3 speedup.

Signed-off-by: Victor Huesca <victor.huesca@bootlin.com>
---
 support/scripts/pkg-stats | 33 ++++++++++++++++++++++++---------
 1 file changed, 24 insertions(+), 9 deletions(-)

diff --git a/support/scripts/pkg-stats b/support/scripts/pkg-stats
index 1ff3fb2bd1..2ed41dffa9 100755
--- a/support/scripts/pkg-stats
+++ b/support/scripts/pkg-stats
@@ -220,11 +220,30 @@ def get_pkglist(npackages, package_list):
     return packages
 
 
+def check_multiple_output(cmds, *popenargs, **kwargs):
+    '''
+    Run each `cmd` asynchronousely and return their respective output.
+    The execution time is the max of all subprocess execution time.
+    '''
+    processes = [subprocess.Popen(cmd, *popenargs,
+                                  stdout=subprocess.PIPE,
+                                  stderr=subprocess.PIPE,
+                                  **kwargs) for cmd in cmds]
+    for proc in processes:
+        out, err = proc.communicate()
+        yield out
+
+
 def package_init_make_info():
+    # Fetch all variables at once
+    licenses, files, versions = check_multiple_output([
+        ["make", "BR2_HAVE_DOT_CONFIG=y", "-s", "printvars", "VARS=%_LICENSE"],
+        ["make", "BR2_HAVE_DOT_CONFIG=y", "-s", "printvars", "VARS=%_LICENSE_FILES"],
+        ["make", "BR2_HAVE_DOT_CONFIG=y", "-s", "printvars", "VARS=%_VERSION"],
+    ])
+
     # Licenses
-    o = subprocess.check_output(["make", "BR2_HAVE_DOT_CONFIG=y",
-                                 "-s", "printvars", "VARS=%_LICENSE"])
-    for l in o.splitlines():
+    for l in licenses.splitlines():
         # Get variable name and value
         pkgvar, value = l.split("=")
 
@@ -241,9 +260,7 @@ def package_init_make_info():
         Package.all_licenses.append(pkgvar)
 
     # License files
-    o = subprocess.check_output(["make", "BR2_HAVE_DOT_CONFIG=y",
-                                 "-s", "printvars", "VARS=%_LICENSE_FILES"])
-    for l in o.splitlines():
+    for l in files.splitlines():
         # Get variable name and value
         pkgvar, value = l.split("=")
 
@@ -260,14 +277,12 @@ def package_init_make_info():
         Package.all_license_files.append(pkgvar)
 
     # Version
-    o = subprocess.check_output(["make", "BR2_HAVE_DOT_CONFIG=y",
-                                 "-s", "printvars", "VARS=%_VERSION"])
 
     # We process first the host package VERSION, and then the target
     # package VERSION. This means that if a package exists in both
     # target and host variants, with different version numbers
     # (unlikely), we'll report the target version number.
-    version_list = o.splitlines()
+    version_list = versions.splitlines()
     version_list = [x for x in version_list if x.startswith("HOST_")] + \
                    [x for x in version_list if not x.startswith("HOST_")]
     for l in version_list:
-- 
2.21.0

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [Buildroot] [PATCH 2/2] support/script/pkg-stats: run 'get latest version' in a process-pool
  2019-07-11  9:54 [Buildroot] [PATCH 1/2] support/script/pkg-stats: run the 3 `make` calls in parallel Victor Huesca
@ 2019-07-11  9:54 ` Victor Huesca
  2019-07-11 13:37   ` Arnout Vandecappelle
  2019-07-11 13:08 ` [Buildroot] [PATCH 1/2] support/script/pkg-stats: run the 3 `make` calls in parallel Thomas Petazzoni
  1 sibling, 1 reply; 6+ messages in thread
From: Victor Huesca @ 2019-07-11  9:54 UTC (permalink / raw)
  To: buildroot

The major bottleneck in pkg-stats is the time spent waiting for
responses from distant servers. There is tow parts that imply
such communication with remote servers in pkg-stats:
- the function checking that package URLs are responding
- the function that fetch the latest package version from
release-monitoring.
The 1st one already make use of process-pools for speed-up but the 2nd
does not.

This patch simply add support of process-pools to this
`check_package_latest_version` function in addition to http-pool.

There have already been work trying to parallelize this function
using threads but there were a failure on some configurations [1].
This implementation rely on a dedicated module already in use on this
script, so it's unlikely to see failure on this one.

In terms of performances I achieve to run this function in ~3m vs ~25m
for the linear version.

[1] http://lists.busybox.net/pipermail/buildroot/2018-March/215368.html

Signed-off-by: Victor Huesca <victor.huesca@bootlin.com>
---
 support/scripts/pkg-stats | 32 +++++++++++++++++---------------
 1 file changed, 17 insertions(+), 15 deletions(-)

diff --git a/support/scripts/pkg-stats b/support/scripts/pkg-stats
index 2ed41dffa9..14033941ba 100755
--- a/support/scripts/pkg-stats
+++ b/support/scripts/pkg-stats
@@ -322,9 +322,9 @@ def check_package_urls(packages):
         pkg.url_status = pkg.url_worker.get(timeout=3600)
 
 
-def release_monitoring_get_latest_version_by_distro(pool, name):
+def release_monitoring_get_latest_version_by_distro(name):
     try:
-        req = pool.request('GET', "/api/project/Buildroot/%s" % name)
+        req = Package.pool.request('GET', "/api/project/Buildroot/%s" % name)
     except HTTPError:
         return (RM_API_STATUS_ERROR, None, None)
 
@@ -339,9 +339,9 @@ def release_monitoring_get_latest_version_by_distro(pool, name):
         return (RM_API_STATUS_FOUND_BY_DISTRO, None, data['id'])
 
 
-def release_monitoring_get_latest_version_by_guess(pool, name):
+def release_monitoring_get_latest_version_by_guess(name):
     try:
-        req = pool.request('GET', "/api/projects/?pattern=%s" % name)
+        req = Package.pool.request('GET', "/api/projects/?pattern=%s" % name)
     except HTTPError:
         return (RM_API_STATUS_ERROR, None, None)
 
@@ -375,18 +375,20 @@ def check_package_latest_version(packages):
     - id: string containing the id of the project corresponding to this
       package, as known by release-monitoring.org
     """
-    pool = HTTPSConnectionPool('release-monitoring.org', port=443,
-                               cert_reqs='CERT_REQUIRED', ca_certs=certifi.where(),
-                               timeout=30)
-    count = 0
+    Package.pool = HTTPSConnectionPool('release-monitoring.org', port=443,
+                                       cert_reqs='CERT_REQUIRED', ca_certs=certifi.where(),
+                                       timeout=30)
+    worker_pool = Pool(processes=64)
+    for pkg in packages:
+        pkg.url_worker = worker_pool.apply_async(release_monitoring_get_latest_version_by_distro, (pkg.name, ))
+
     for pkg in packages:
-        v = release_monitoring_get_latest_version_by_distro(pool, pkg.name)
-        if v[0] == RM_API_STATUS_NOT_FOUND:
-            v = release_monitoring_get_latest_version_by_guess(pool, pkg.name)
+        status, _, _ = pkg.url_worker.get()
+        if status == RM_API_STATUS_NOT_FOUND:
+            pkg.url_worker = worker_pool.apply_async(release_monitoring_get_latest_version_by_guess, (pkg.name, ))
 
-        pkg.latest_version = v
-        print("[%d/%d] Package %s" % (count, len(packages), pkg.name))
-        count += 1
+    for count, pkg in enumerate(packages):
+        pkg.latest_version = pkg.url_worker.get()
 
 
 def calculate_stats(packages):
@@ -773,7 +775,7 @@ def __main__():
         pkg.set_check_package_warnings()
         pkg.set_current_version()
         pkg.set_url()
-    print("Checking URL status")
+    print("Checking URL status ...")
     check_package_urls(packages)
     print("Getting latest versions ...")
     check_package_latest_version(packages)
-- 
2.21.0

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [Buildroot] [PATCH 1/2] support/script/pkg-stats: run the 3 `make` calls in parallel
  2019-07-11  9:54 [Buildroot] [PATCH 1/2] support/script/pkg-stats: run the 3 `make` calls in parallel Victor Huesca
  2019-07-11  9:54 ` [Buildroot] [PATCH 2/2] support/script/pkg-stats: run 'get latest version' in a process-pool Victor Huesca
@ 2019-07-11 13:08 ` Thomas Petazzoni
  2019-07-11 13:17   ` Victor Huesca
  2019-07-11 13:18   ` Arnout Vandecappelle
  1 sibling, 2 replies; 6+ messages in thread
From: Thomas Petazzoni @ 2019-07-11 13:08 UTC (permalink / raw)
  To: buildroot

Hello Victor,

On Thu, 11 Jul 2019 11:54:47 +0200
Victor Huesca <victor.huesca@bootlin.com> wrote:

> +def check_multiple_output(cmds, *popenargs, **kwargs):
> +    '''
> +    Run each `cmd` asynchronousely and return their respective output.
> +    The execution time is the max of all subprocess execution time.
> +    '''
> +    processes = [subprocess.Popen(cmd, *popenargs,
> +                                  stdout=subprocess.PIPE,
> +                                  stderr=subprocess.PIPE,
> +                                  **kwargs) for cmd in cmds]
> +    for proc in processes:
> +        out, err = proc.communicate()
> +        yield out
> +
> +
>  def package_init_make_info():
> +    # Fetch all variables at once
> +    licenses, files, versions = check_multiple_output([
> +        ["make", "BR2_HAVE_DOT_CONFIG=y", "-s", "printvars", "VARS=%_LICENSE"],
> +        ["make", "BR2_HAVE_DOT_CONFIG=y", "-s", "printvars", "VARS=%_LICENSE_FILES"],
> +        ["make", "BR2_HAVE_DOT_CONFIG=y", "-s", "printvars", "VARS=%_VERSION"],

Instead of doing this, what about running:

	make printvars VARS="%_LICENSE %_LICENSE_FILES %_VERSION"

and then parse out the result, which will contain all the _LICENSE,
_LICENSE_FILES and _VERSION variables ?

Best regards,

Thomas
-- 
Thomas Petazzoni, CTO, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Buildroot] [PATCH 1/2] support/script/pkg-stats: run the 3 `make` calls in parallel
  2019-07-11 13:08 ` [Buildroot] [PATCH 1/2] support/script/pkg-stats: run the 3 `make` calls in parallel Thomas Petazzoni
@ 2019-07-11 13:17   ` Victor Huesca
  2019-07-11 13:18   ` Arnout Vandecappelle
  1 sibling, 0 replies; 6+ messages in thread
From: Victor Huesca @ 2019-07-11 13:17 UTC (permalink / raw)
  To: buildroot

Hi Thomas,



On 11/07/2019 15:08, Thomas Petazzoni wrote:
> Hello Victor,
> 
> On Thu, 11 Jul 2019 11:54:47 +0200
> Victor Huesca <victor.huesca@bootlin.com> wrote:
> 
>> +def check_multiple_output(cmds, *popenargs, **kwargs):
>> +    '''
>> +    Run each `cmd` asynchronousely and return their respective output.
>> +    The execution time is the max of all subprocess execution time.
>> +    '''
>> +    processes = [subprocess.Popen(cmd, *popenargs,
>> +                                  stdout=subprocess.PIPE,
>> +                                  stderr=subprocess.PIPE,
>> +                                  **kwargs) for cmd in cmds]
>> +    for proc in processes:
>> +        out, err = proc.communicate()
>> +        yield out
>> +
>> +
>>  def package_init_make_info():
>> +    # Fetch all variables at once
>> +    licenses, files, versions = check_multiple_output([
>> +        ["make", "BR2_HAVE_DOT_CONFIG=y", "-s", "printvars", "VARS=%_LICENSE"],
>> +        ["make", "BR2_HAVE_DOT_CONFIG=y", "-s", "printvars", "VARS=%_LICENSE_FILES"],
>> +        ["make", "BR2_HAVE_DOT_CONFIG=y", "-s", "printvars", "VARS=%_VERSION"],
> 
> Instead of doing this, what about running:
> 
> 	make printvars VARS="%_LICENSE %_LICENSE_FILES %_VERSION"
> 
> and then parse out the result, which will contain all the _LICENSE,
> _LICENSE_FILES and _VERSION variables ?

Because I did not thought it was possible to do this >_<'
I remember that I tried several syntaxes to retrieve all variables with
a single `make` but did not find a working one and this solution was not
hard to implement...

Thanks for the hint, I'll send a v2 with this -- way better -- approach.

Thanks,
Victor

> Best regards,
> 
> Thomas
> 

-- 
Victor Huesca, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Buildroot] [PATCH 1/2] support/script/pkg-stats: run the 3 `make` calls in parallel
  2019-07-11 13:08 ` [Buildroot] [PATCH 1/2] support/script/pkg-stats: run the 3 `make` calls in parallel Thomas Petazzoni
  2019-07-11 13:17   ` Victor Huesca
@ 2019-07-11 13:18   ` Arnout Vandecappelle
  1 sibling, 0 replies; 6+ messages in thread
From: Arnout Vandecappelle @ 2019-07-11 13:18 UTC (permalink / raw)
  To: buildroot



On 11/07/2019 15:08, Thomas Petazzoni wrote:
> Hello Victor,
> 
> On Thu, 11 Jul 2019 11:54:47 +0200
> Victor Huesca <victor.huesca@bootlin.com> wrote:
> 
>> +def check_multiple_output(cmds, *popenargs, **kwargs):
>> +    '''
>> +    Run each `cmd` asynchronousely and return their respective output.
>> +    The execution time is the max of all subprocess execution time.
>> +    '''
>> +    processes = [subprocess.Popen(cmd, *popenargs,
>> +                                  stdout=subprocess.PIPE,
>> +                                  stderr=subprocess.PIPE,
>> +                                  **kwargs) for cmd in cmds]
>> +    for proc in processes:
>> +        out, err = proc.communicate()
>> +        yield out
>> +
>> +
>>  def package_init_make_info():
>> +    # Fetch all variables at once
>> +    licenses, files, versions = check_multiple_output([
>> +        ["make", "BR2_HAVE_DOT_CONFIG=y", "-s", "printvars", "VARS=%_LICENSE"],
>> +        ["make", "BR2_HAVE_DOT_CONFIG=y", "-s", "printvars", "VARS=%_LICENSE_FILES"],
>> +        ["make", "BR2_HAVE_DOT_CONFIG=y", "-s", "printvars", "VARS=%_VERSION"],
> 
> Instead of doing this, what about running:
> 
> 	make printvars VARS="%_LICENSE %_LICENSE_FILES %_VERSION"
> 
> and then parse out the result, which will contain all the _LICENSE,
> _LICENSE_FILES and _VERSION variables ?

 Or better yet, run `make show-info` and parse JSON.

 I'll admit, that requires some changes to show-info:

- include license_files;
- have an option to run it on *all* packages, not just the selected ones.

 The second one might be tricky...

 Regards,
 Arnout

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Buildroot] [PATCH 2/2] support/script/pkg-stats: run 'get latest version' in a process-pool
  2019-07-11  9:54 ` [Buildroot] [PATCH 2/2] support/script/pkg-stats: run 'get latest version' in a process-pool Victor Huesca
@ 2019-07-11 13:37   ` Arnout Vandecappelle
  0 siblings, 0 replies; 6+ messages in thread
From: Arnout Vandecappelle @ 2019-07-11 13:37 UTC (permalink / raw)
  To: buildroot



On 11/07/2019 11:54, Victor Huesca wrote:
> The major bottleneck in pkg-stats is the time spent waiting for
> responses from distant servers. There is tow parts that imply
> such communication with remote servers in pkg-stats:
> - the function checking that package URLs are responding
> - the function that fetch the latest package version from
> release-monitoring.
> The 1st one already make use of process-pools for speed-up but the 2nd
> does not.
> 
> This patch simply add support of process-pools to this
> `check_package_latest_version` function in addition to http-pool.
> 
> There have already been work trying to parallelize this function
> using threads but there were a failure on some configurations [1].
> This implementation rely on a dedicated module already in use on this
> script, so it's unlikely to see failure on this one.
> 
> In terms of performances I achieve to run this function in ~3m vs ~25m
> for the linear version.
> 
> [1] http://lists.busybox.net/pipermail/buildroot/2018-March/215368.html
> 
> Signed-off-by: Victor Huesca <victor.huesca@bootlin.com>
> ---
>  support/scripts/pkg-stats | 32 +++++++++++++++++---------------
>  1 file changed, 17 insertions(+), 15 deletions(-)
> 
> diff --git a/support/scripts/pkg-stats b/support/scripts/pkg-stats
> index 2ed41dffa9..14033941ba 100755
> --- a/support/scripts/pkg-stats
> +++ b/support/scripts/pkg-stats
> @@ -322,9 +322,9 @@ def check_package_urls(packages):
>          pkg.url_status = pkg.url_worker.get(timeout=3600)
>  
>  
> -def release_monitoring_get_latest_version_by_distro(pool, name):
> +def release_monitoring_get_latest_version_by_distro(name):

 Is there a good reason not to pass pool as an argument here? See below.

>      try:
> -        req = pool.request('GET', "/api/project/Buildroot/%s" % name)
> +        req = Package.pool.request('GET', "/api/project/Buildroot/%s" % name)
>      except HTTPError:
>          return (RM_API_STATUS_ERROR, None, None)
>  
> @@ -339,9 +339,9 @@ def release_monitoring_get_latest_version_by_distro(pool, name):
>          return (RM_API_STATUS_FOUND_BY_DISTRO, None, data['id'])
>  
>  
> -def release_monitoring_get_latest_version_by_guess(pool, name):
> +def release_monitoring_get_latest_version_by_guess(name):
>      try:
> -        req = pool.request('GET', "/api/projects/?pattern=%s" % name)
> +        req = Package.pool.request('GET', "/api/projects/?pattern=%s" % name)
>      except HTTPError:
>          return (RM_API_STATUS_ERROR, None, None)
>  
> @@ -375,18 +375,20 @@ def check_package_latest_version(packages):
>      - id: string containing the id of the project corresponding to this
>        package, as known by release-monitoring.org
>      """
> -    pool = HTTPSConnectionPool('release-monitoring.org', port=443,
> -                               cert_reqs='CERT_REQUIRED', ca_certs=certifi.where(),
> -                               timeout=30)
> -    count = 0
> +    Package.pool = HTTPSConnectionPool('release-monitoring.org', port=443,
> +                                       cert_reqs='CERT_REQUIRED', ca_certs=certifi.where(),
> +                                       timeout=30)
> +    worker_pool = Pool(processes=64)
> +    for pkg in packages:
> +        pkg.url_worker = worker_pool.apply_async(release_monitoring_get_latest_version_by_distro, (pkg.name, ))

 Here, you can just use (Package.pool, pkg.name) as args, no?

 With that, you don't need to add pool as a class attribute of Package anymore.
That was anyway a bit ugly.

 Also, adding url_worker as an attribute to pkg here is a bit ugly. It would be
less so if you initialize it to None in Package.__init__.

> +
>      for pkg in packages:
> -        v = release_monitoring_get_latest_version_by_distro(pool, pkg.name)
> -        if v[0] == RM_API_STATUS_NOT_FOUND:
> -            v = release_monitoring_get_latest_version_by_guess(pool, pkg.name)
> +        status, _, _ = pkg.url_worker.get()

 Does this work? I mean, can you call get() multiple times?

> +        if status == RM_API_STATUS_NOT_FOUND:
> +            pkg.url_worker = worker_pool.apply_async(release_monitoring_get_latest_version_by_guess, (pkg.name, ))

 I believe this approach has the disadvantage that it still kind of serializes
the retries (not exactly serializes, but they're not parallelized as much as
they could be). I think a better approach would be to include the retry in a
single callback function that tries both.

>  
> -        pkg.latest_version = v
> -        print("[%d/%d] Package %s" % (count, len(packages), pkg.name))

 It's a pity that we loose this progress indication, but I don't see a
convenient way to keep it.

> -        count += 1
> +    for count, pkg in enumerate(packages):
> +        pkg.latest_version = pkg.url_worker.get()

 You're not actually using count, so no need to enumerate. I guess that's a
leftover from attempting to keep the progress indication.

 Regards,
 Arnout

>  
>  
>  def calculate_stats(packages):
> @@ -773,7 +775,7 @@ def __main__():
>          pkg.set_check_package_warnings()
>          pkg.set_current_version()
>          pkg.set_url()
> -    print("Checking URL status")
> +    print("Checking URL status ...")
>      check_package_urls(packages)
>      print("Getting latest versions ...")
>      check_package_latest_version(packages)
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-07-11 13:37 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-07-11  9:54 [Buildroot] [PATCH 1/2] support/script/pkg-stats: run the 3 `make` calls in parallel Victor Huesca
2019-07-11  9:54 ` [Buildroot] [PATCH 2/2] support/script/pkg-stats: run 'get latest version' in a process-pool Victor Huesca
2019-07-11 13:37   ` Arnout Vandecappelle
2019-07-11 13:08 ` [Buildroot] [PATCH 1/2] support/script/pkg-stats: run the 3 `make` calls in parallel Thomas Petazzoni
2019-07-11 13:17   ` Victor Huesca
2019-07-11 13:18   ` Arnout Vandecappelle

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.