From: "Yann E. MORIN" <yann.morin.1998@free.fr>
To: Thomas Petazzoni <thomas.petazzoni@bootlin.com>
Cc: buildroot@buildroot.org
Subject: Re: [Buildroot] [PATCH 4/4] support/scripts/pkg-stats: reimplement CPE parsing in pkg-stats
Date: Sat, 2 Apr 2022 19:20:12 +0200 [thread overview]
Message-ID: <20220402172012.GB1811301@scaer> (raw)
In-Reply-To: <20220402141531.1320584-4-thomas.petazzoni@bootlin.com>
Thomas, All,
On 2022-04-02 16:15 +0200, Thomas Petazzoni via buildroot spake thusly:
> pkg-stats currently uses the services from support/scripts/cpedb.py to
> match the CPE identifiers of packages with the official CPE database.
>
> Unfortunately, the cpedb.py code uses regular ElementTree parsing,
> which involves loading the full XML tree into memory. This causes the
> pkg-stats process to consume a huge amount of memory:
>
> thomas 1310458 85.2 21.4 3708952 3450164 pts/5 R+ 16:04 0:33 | | \_ python3 ./support/scripts/pkg-stats
>
> So, 3.7 GB of VSZ and 3.4 GB of RSS are used by the pkg-stats
> process. This is causing the OOM killer to kick-in on machines with
> relatively low memory.
>
> This commit reimplements the XML parsing needed to do the CPE matching
> directly in pkg-stats, using the XmlParser functionality of
> ElementTree, also called "streaming parsing". Thanks to this, we never
> load the entire XML tree in RAM, but only stream it through the
> parser, and construct a very simple list of all CPE identifiers. The
> max memory consumption of pkg-stats is now:
>
> thomas 1317511 74.2 0.9 381104 152224 pts/5 R+ 16:08 0:17 | | \_ python3 ./support/scripts/pkg-stats
>
> So, 381 MB of VSZ and 152 MB of RSS, which is obviously much better.
>
> Now, one will probably wonder why this isn't directly changed in
> cpedb.py. The reason is simple: cpedb.py is also used by
> support/scripts/missing-cpe, which (for now) heavily relies on having
> in memory the ElementTree objects, to re-generate a snippet of XML
> that allows us to submit to NIST new CPE entries.
>
> So, future work could include one of those two options:
>
> (1) Re-integrate cpedb.py into missing-cpe directly, and live with
> two different ways of processing the CPE database.
>
> (2) Rewrite the missing-cpe logic to also be compatible with a
> streaming parsing, which would allow this logic to be again
> shared between pkg-stats and missing-cpe.
>
> Signed-off-by: Thomas Petazzoni <thomas.petazzoni@bootlin.com>
> ---
> support/scripts/pkg-stats | 39 +++++++++++++++++++++++++++++++++++----
> 1 file changed, 35 insertions(+), 4 deletions(-)
>
> diff --git a/support/scripts/pkg-stats b/support/scripts/pkg-stats
> index ae1a9aa5e4..cc163ebb1a 100755
> --- a/support/scripts/pkg-stats
> +++ b/support/scripts/pkg-stats
> @@ -27,12 +27,14 @@ import re
> import subprocess
> import json
> import sys
> +import time
> +import gzip
> +import xml.etree.ElementTree
You for to import requests, which is used later on.
I also fixed a bunch of flake8 issues:
support/scripts/pkg-stats:49:1: E302 expected 2 blank lines, found 1
support/scripts/pkg-stats:632:9: E306 expected 1 blank line before a nested definition, found 0
support/scripts/pkg-stats:635:9: E306 expected 1 blank line before a nested definition, found 0
support/scripts/pkg-stats:639:5: E303 too many blank lines (2)
1 E302 expected 2 blank lines, found 1
1 E303 too many blank lines (2)
2 E306 expected 1 blank line before a nested definition, found 0
> brpath = os.path.normpath(os.path.join(os.path.dirname(__file__), "..", ".."))
>
> sys.path.append(os.path.join(brpath, "utils"))
> from getdeveloperlib import parse_developers # noqa: E402
> -from cpedb import CPEDB # noqa: E402
>
> INFRA_RE = re.compile(r"\$\(eval \$\(([a-z-]*)-package\)\)")
> URL_RE = re.compile(r"\s*https?://\S*\s*$")
> @@ -42,6 +44,7 @@ RM_API_STATUS_FOUND_BY_DISTRO = 2
> RM_API_STATUS_FOUND_BY_PATTERN = 3
> RM_API_STATUS_NOT_FOUND = 4
>
> +CPEDB_URL = "https://static.nvd.nist.gov/feeds/xml/cpe/dictionary/official-cpe-dictionary_v2.3.xml.gz"
Instead of duplicating it here, I changed that to import it from cpedb.
Applied to master with all the aboved fixed, thanks.
Regards,
Yann E. MORIN.
> class Defconfig:
> def __init__(self, name, path):
> @@ -624,12 +627,40 @@ def check_package_cves(nvd_path, packages):
>
>
> def check_package_cpes(nvd_path, packages):
> - cpedb = CPEDB(nvd_path)
> - cpedb.get_xml_dict()
> + class CpeXmlParser:
> + cpes = []
> + def start(self, tag, attrib):
> + if tag == "{http://scap.nist.gov/schema/cpe-extension/2.3}cpe23-item":
> + self.cpes.append(attrib['name'])
> + def close(self):
> + return self.cpes
> +
> +
> + print("CPE: Setting up NIST dictionary")
> + if not os.path.exists(os.path.join(nvd_path, "cpe")):
> + os.makedirs(os.path.join(nvd_path, "cpe"))
> +
> + cpe_dict_local = os.path.join(nvd_path, "cpe", os.path.basename(CPEDB_URL))
> + if not os.path.exists(cpe_dict_local) or os.stat(cpe_dict_local).st_mtime < time.time() - 86400:
> + print("CPE: Fetching xml manifest from [" + CPEDB_URL + "]")
> + cpe_dict = requests.get(CPEDB_URL)
> + open(cpe_dict_local, "wb").write(cpe_dict.content)
> +
> + print("CPE: Unzipping xml manifest...")
> + nist_cpe_file = gzip.GzipFile(fileobj=open(cpe_dict_local, 'rb'))
> +
> + parser = xml.etree.ElementTree.XMLParser(target=CpeXmlParser())
> + while True:
> + c = nist_cpe_file.read(1024*1024)
> + if not c:
> + break
> + parser.feed(c)
> + cpes = parser.close()
> +
> for p in packages:
> if not p.cpeid:
> continue
> - if cpedb.find(p.cpeid):
> + if p.cpeid in cpes:
> p.status['cpe'] = ("ok", "verified CPE identifier")
> else:
> p.status['cpe'] = ("error", "CPE version unknown in CPE database")
> --
> 2.35.1
>
> _______________________________________________
> buildroot mailing list
> buildroot@buildroot.org
> https://lists.buildroot.org/mailman/listinfo/buildroot
--
.-----------------.--------------------.------------------.--------------------.
| Yann E. MORIN | Real-Time Embedded | /"\ ASCII RIBBON | Erics' conspiracy: |
| +33 662 376 056 | Software Designer | \ / CAMPAIGN | ___ |
| +33 561 099 427 `------------.-------: X AGAINST | \e/ There is no |
| http://ymorin.is-a-geek.org/ | _/*\_ | / \ HTML MAIL | v conspiracy. |
'------------------------------^-------^------------------^--------------------'
_______________________________________________
buildroot mailing list
buildroot@buildroot.org
https://lists.buildroot.org/mailman/listinfo/buildroot
next prev parent reply other threads:[~2022-04-02 17:20 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-04-02 14:15 [Buildroot] [PATCH 1/4] support/script/pkg-stats: allow disabling CPE matching Thomas Petazzoni via buildroot
2022-04-02 14:15 ` [Buildroot] [PATCH 2/4] support/scripts/pkg-stats: allow disabling package warnings retrieval Thomas Petazzoni via buildroot
2022-04-04 12:40 ` Peter Korsgaard
2022-04-02 14:15 ` [Buildroot] [PATCH 3/4] support/scripts/pkg-stats: add a timeout on HTTP requests for upstream URLs Thomas Petazzoni via buildroot
2022-04-04 12:40 ` Peter Korsgaard
2022-04-02 14:15 ` [Buildroot] [PATCH 4/4] support/scripts/pkg-stats: reimplement CPE parsing in pkg-stats Thomas Petazzoni via buildroot
2022-04-02 14:17 ` Thomas Petazzoni via buildroot
2022-04-02 17:20 ` Yann E. MORIN [this message]
2022-04-03 8:05 ` Thomas Petazzoni via buildroot
2022-04-04 12:40 ` Peter Korsgaard
2022-04-02 14:42 ` [Buildroot] [PATCH 1/4] support/script/pkg-stats: allow disabling CPE matching Yann E. MORIN
2022-04-04 12:40 ` Peter Korsgaard
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220402172012.GB1811301@scaer \
--to=yann.morin.1998@free.fr \
--cc=buildroot@buildroot.org \
--cc=thomas.petazzoni@bootlin.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.