Re: [Buildroot] [PATCH 4/4] support/scripts/pkg-stats: reimplement CPE parsing in pkg-stats

From: "Yann E. MORIN" <yann.morin.1998@free.fr>
To: Thomas Petazzoni <thomas.petazzoni@bootlin.com>
Cc: buildroot@buildroot.org
Subject: Re: [Buildroot] [PATCH 4/4] support/scripts/pkg-stats: reimplement CPE parsing in pkg-stats
Date: Sat, 2 Apr 2022 19:20:12 +0200	[thread overview]
Message-ID: <20220402172012.GB1811301@scaer> (raw)
In-Reply-To: <20220402141531.1320584-4-thomas.petazzoni@bootlin.com>

Thomas, All,

On 2022-04-02 16:15 +0200, Thomas Petazzoni via buildroot spake thusly:
> pkg-stats currently uses the services from support/scripts/cpedb.py to
> match the CPE identifiers of packages with the official CPE database.
> 
> Unfortunately, the cpedb.py code uses regular ElementTree parsing,
> which involves loading the full XML tree into memory. This causes the
> pkg-stats process to consume a huge amount of memory:
> 
> thomas   1310458 85.2 21.4 3708952 3450164 pts/5 R+   16:04   0:33  |   |   \_ python3 ./support/scripts/pkg-stats
> 
> So, 3.7 GB of VSZ and 3.4 GB of RSS are used by the pkg-stats
> process. This is causing the OOM killer to kick-in on machines with
> relatively low memory.
> 
> This commit reimplements the XML parsing needed to do the CPE matching
> directly in pkg-stats, using the XmlParser functionality of
> ElementTree, also called "streaming parsing". Thanks to this, we never
> load the entire XML tree in RAM, but only stream it through the
> parser, and construct a very simple list of all CPE identifiers. The
> max memory consumption of pkg-stats is now:
> 
> thomas   1317511 74.2  0.9 381104 152224 pts/5   R+   16:08   0:17  |   |   \_ python3 ./support/scripts/pkg-stats
> 
> So, 381 MB of VSZ and 152 MB of RSS, which is obviously much better.
> 
> Now, one will probably wonder why this isn't directly changed in
> cpedb.py. The reason is simple: cpedb.py is also used by
> support/scripts/missing-cpe, which (for now) heavily relies on having
> in memory the ElementTree objects, to re-generate a snippet of XML
> that allows us to submit to NIST new CPE entries.
> 
> So, future work could include one of those two options:
> 
>  (1) Re-integrate cpedb.py into missing-cpe directly, and live with
>      two different ways of processing the CPE database.
> 
>  (2) Rewrite the missing-cpe logic to also be compatible with a
>      streaming parsing, which would allow this logic to be again
>      shared between pkg-stats and missing-cpe.
> 
> Signed-off-by: Thomas Petazzoni <thomas.petazzoni@bootlin.com>
> ---
>  support/scripts/pkg-stats | 39 +++++++++++++++++++++++++++++++++++----
>  1 file changed, 35 insertions(+), 4 deletions(-)
> 
> diff --git a/support/scripts/pkg-stats b/support/scripts/pkg-stats
> index ae1a9aa5e4..cc163ebb1a 100755
> --- a/support/scripts/pkg-stats
> +++ b/support/scripts/pkg-stats
> @@ -27,12 +27,14 @@ import re
>  import subprocess
>  import json
>  import sys
> +import time
> +import gzip
> +import xml.etree.ElementTree

You for to import requests, which is used later on.

I also fixed a bunch of flake8 issues:

    support/scripts/pkg-stats:49:1: E302 expected 2 blank lines, found 1
    support/scripts/pkg-stats:632:9: E306 expected 1 blank line before a nested definition, found 0
    support/scripts/pkg-stats:635:9: E306 expected 1 blank line before a nested definition, found 0
    support/scripts/pkg-stats:639:5: E303 too many blank lines (2)
    1     E302 expected 2 blank lines, found 1
    1     E303 too many blank lines (2)
    2     E306 expected 1 blank line before a nested definition, found 0

>  brpath = os.path.normpath(os.path.join(os.path.dirname(__file__), "..", ".."))
>  
>  sys.path.append(os.path.join(brpath, "utils"))
>  from getdeveloperlib import parse_developers  # noqa: E402
> -from cpedb import CPEDB  # noqa: E402
>  
>  INFRA_RE = re.compile(r"\$\(eval \$\(([a-z-]*)-package\)\)")
>  URL_RE = re.compile(r"\s*https?://\S*\s*$")
> @@ -42,6 +44,7 @@ RM_API_STATUS_FOUND_BY_DISTRO = 2
>  RM_API_STATUS_FOUND_BY_PATTERN = 3
>  RM_API_STATUS_NOT_FOUND = 4
>  
> +CPEDB_URL = "https://static.nvd.nist.gov/feeds/xml/cpe/dictionary/official-cpe-dictionary_v2.3.xml.gz"

Instead of duplicating it here, I changed that to import it from cpedb.

Applied to master with all the aboved fixed, thanks.

Regards,
Yann E. MORIN.

>  class Defconfig:
>      def __init__(self, name, path):
> @@ -624,12 +627,40 @@ def check_package_cves(nvd_path, packages):
>  
>  
>  def check_package_cpes(nvd_path, packages):
> -    cpedb = CPEDB(nvd_path)
> -    cpedb.get_xml_dict()
> +    class CpeXmlParser:
> +        cpes = []
> +        def start(self, tag, attrib):
> +            if tag == "{http://scap.nist.gov/schema/cpe-extension/2.3}cpe23-item":
> +                self.cpes.append(attrib['name'])
> +        def close(self):
> +            return self.cpes
> +
> +
> +    print("CPE: Setting up NIST dictionary")
> +    if not os.path.exists(os.path.join(nvd_path, "cpe")):
> +        os.makedirs(os.path.join(nvd_path, "cpe"))
> +
> +    cpe_dict_local = os.path.join(nvd_path, "cpe", os.path.basename(CPEDB_URL))
> +    if not os.path.exists(cpe_dict_local) or os.stat(cpe_dict_local).st_mtime < time.time() - 86400:
> +        print("CPE: Fetching xml manifest from [" + CPEDB_URL + "]")
> +        cpe_dict = requests.get(CPEDB_URL)
> +        open(cpe_dict_local, "wb").write(cpe_dict.content)
> +
> +    print("CPE: Unzipping xml manifest...")
> +    nist_cpe_file = gzip.GzipFile(fileobj=open(cpe_dict_local, 'rb'))
> +
> +    parser = xml.etree.ElementTree.XMLParser(target=CpeXmlParser())
> +    while True:
> +        c = nist_cpe_file.read(1024*1024)
> +        if not c:
> +            break
> +        parser.feed(c)
> +    cpes = parser.close()
> +
>      for p in packages:
>          if not p.cpeid:
>              continue
> -        if cpedb.find(p.cpeid):
> +        if p.cpeid in cpes:
>              p.status['cpe'] = ("ok", "verified CPE identifier")
>          else:
>              p.status['cpe'] = ("error", "CPE version unknown in CPE database")
> -- 
> 2.35.1
> 
> _______________________________________________
> buildroot mailing list
> buildroot@buildroot.org
> https://lists.buildroot.org/mailman/listinfo/buildroot

-- 
.-----------------.--------------------.------------------.--------------------.
|  Yann E. MORIN  | Real-Time Embedded | /"\ ASCII RIBBON | Erics' conspiracy: |
| +33 662 376 056 | Software  Designer | \ / CAMPAIGN     |  ___               |
| +33 561 099 427 `------------.-------:  X  AGAINST      |  \e/  There is no  |
| http://ymorin.is-a-geek.org/ | _/*\_ | / \ HTML MAIL    |   v   conspiracy.  |
'------------------------------^-------^------------------^--------------------'
_______________________________________________
buildroot mailing list
buildroot@buildroot.org
https://lists.buildroot.org/mailman/listinfo/buildroot