From: Arnout Vandecappelle via buildroot <buildroot@buildroot.org>
To: Daniel Lang <dalang@gmx.at>, buildroot@buildroot.org
Cc: Thomas Petazzoni <thomas.petazzoni@bootlin.com>
Subject: Re: [Buildroot] [PATCH v3 5/8] support/scripts/nvd_api_v2.py: new helper class
Date: Wed, 30 Aug 2023 23:37:51 +0200 [thread overview]
Message-ID: <da055153-9ec6-ade2-46ee-49c2d33ebf36@mind.be> (raw)
In-Reply-To: <20230812192842.135682-5-dalang@gmx.at>
Hi Daniel,
Since this is new code, I'm going to give some coding style comments that
differ a bit from existing code you see in pkg-stats.
I may actually end up applying the patch without those coding style changes,
because the new NVD API is more important than any coding style concerns...
We'll see how far I get.
This is also not a complete review of this patch because it's almost time for
me to go to bed.
On 12/08/2023 21:28, Daniel Lang wrote:
> The current NVD data feeds used for CVE and CPE checking will be retired
> by 2023-12-05 [0]. Both have to be switched to the new v2 API. Since
> fetching data from both sources workes the same, a common base class is
workes -> works
> used to handle the API interaction.
> To store the data locally a sqlite database is used.
Maybe explain a bit more than the new API doesn't allow to download files
anymore, so we need to invent our own file format for persisting it, and a
sqlite db is a convenient format that is always available in python.
>
> [0]: https://nvd.nist.gov/General/News/change-timeline
>
> Signed-off-by: Daniel Lang <dalang@gmx.at>
> ---
> DEVELOPERS | 1 +
> support/scripts/nvd_api_v2.py | 138 ++++++++++++++++++++++++++++++++++
> 2 files changed, 139 insertions(+)
> create mode 100755 support/scripts/nvd_api_v2.py
>
> diff --git a/DEVELOPERS b/DEVELOPERS
> index 6ffa3ee693..81f809a4c0 100644
> --- a/DEVELOPERS
> +++ b/DEVELOPERS
> @@ -668,6 +668,7 @@ F: package/paho-mqtt-cpp/
> F: package/pangomm/
> F: package/pangomm2_46/
> F: package/sam-ba/
> +F: support/scripts/nvd_api_v2.py
>
> N: Damien Lanson <damien@kal-host.com>
> F: package/libvdpau/
> diff --git a/support/scripts/nvd_api_v2.py b/support/scripts/nvd_api_v2.py
> new file mode 100755
> index 0000000000..3fdf32596f
> --- /dev/null
> +++ b/support/scripts/nvd_api_v2.py
> @@ -0,0 +1,138 @@
> +#!/usr/bin/env python3
> +
> +from datetime import datetime, timezone
> +import os
Personally, I think it's better to move away from the os module and use
pathlib.Path instead everywhere a path is used. We don't do that anywhere in
Buildroot at the moment, but I think we should.
> +import requests
> +import shutil
> +import sqlite3
> +import time
> +
> +NVD_API_VERSION = '2.0'
> +NVD_API_BASE_URL = 'https://services.nvd.nist.gov/rest/json'
> +
> +
> +class NVD_API:
> + """
> + A helper class that fetches data from a NVD API and
> + helps manage a sqlite database.
> + """
> + def __init__(self, nvd_path, service, database_file_prefix):
> + """
> + Initialize a new NVD API endpoint with service
> + as the URL postfix.
Could you add documentation for the arguments? I was initially confused by
nvd_path, it sounded like the path part of the URL...
> + """
> + self.nvd_path = nvd_path
> + self.service = service
> + self.url = '%s/%s/%s' % (NVD_API_BASE_URL, service.lower(), NVD_API_VERSION)
Prefer to use format strings:
self.url = f'{NVD_API_BASE_URL}/{service.lower()}/{NVD_API_VERSION}
> + self.db_file = os.path.join(nvd_path, '%s-%s.sqlite' % (database_file_prefix, NVD_API_VERSION))
> + self.db_file_tmp = '%s.tmp' % self.db_file
I haven't looked in too much detail at how the tmp file is used, but I wonder
if it is really needed. sqlite should be able to manage access to the database
so that it stays consistent. If we make sure to use transactions if more than
one row needs to be stored consistently (which I doubt is really needed anyway),
then we should always be able to read consistent data from the database. And if
we re-fetch information from the NVD database, then we should be able to update
the existing row if it actually was already downloaded before.
If I'm missing something here, please add a detailed explanation to the
beginning of this file or class to understand why the tmp file is needed.
> +
> + def init_db(self):
> + """
> + Needs to be implemented by derived classes.
> + Used to make sure that database tables exist.
> + """
> + pass
> +
> + def save_to_db(self, start_index, total_results, content):
> + """
> + Needs to be implemented by derived classes.
> + Used to save the data given by a single API request
> + to the database.
> + """
> + pass
> +
> + def cleanup_db(self):
The name of the function doesn't sound very good to me - it makes me think
you're deleting the db file itself, not the tmp file.
> + """
> + Clean up any files that where left by previously
> + failed runs.
> + """
> + if os.path.exists(self.db_file_tmp):
> + os.remove(self.db_file_tmp)
> +
> + def open_db(self, tmp=False):
> + """
> + Open and return a connection to the sqlite database.
> + """
> + if tmp:
> + return sqlite3.connect(self.db_file_tmp)
> + return sqlite3.connect(self.db_file)
I think putting this in a function is making things more complicated than
calling sqlite3.connect directly from the caller.
> +
> + def download(self, last_update):
> + """
> + Download all entries from NVD since last_update (if not None).
> + For each downloaded page save_to_db is called to together with
> + progress information.
> + NVD rate limiting allows 5 requests per 30 seconds or one every
> + 6 seconds.
Could we maybe define those numbers as class variables and calculate the sleep
etc. times that are used below based on those? Instead of having constant 5 and 6.
> + """
> + args = {}
> + start_index = 0
> + total_results = 0
> + results_per_page = 0
> +
> + print('Downloading new %s' % self.service)
> +
> + if (last_update is not None):
> + args['lastModStartDate'] = last_update.isoformat()
> + args['lastModEndDate'] = datetime.now(tz=timezone.utc).isoformat()
> +
> + while True:
> + args['startIndex'] = start_index
> +
> + for attempt in range(5):
> + try:
> + page = requests.get(self.url, params=args)
> + page.raise_for_status()
> + content = page.json()
> + except Exception:
This will also catch unrecoverable errors, e.g. DNS lookup failures, which
leads to an endless loop without any feedback. Would it be possible to instead
catch the exact exception that requests raises when we hit rate limiting?
> + time.sleep(6)
> + else:
> + break
> +
> + if content is None:
> + # Nothing was downloaded
> + return False
> +
> + results_per_page = content['resultsPerPage']
> + total_results = content['totalResults']
> + start_index = content['startIndex']
> +
> + start_index += results_per_page
> +
> + if self.save_to_db(start_index, total_results, content) is not True:
Instead of "is not True" just do "if not self.save_to_db".
I'm not sure what the start_index and total_results parameters are supposed to
be used for. Are they supposed to be persisted? If yes, then IMHO this should be
done by this class, in a separate table defined by this class.
> + return False
> +
> + self.connection.commit()
> +
> + if start_index >= total_results:
> + return True
> +
> + # Otherwise rate limiting will be hit.
> + time.sleep(6)
If the rate limit really is "5 requests per 30 seconds", then we can actually
do 3 requests back-to-back without any sleep without hitting the rate limit,
right? Assuming you do call the script pretty regularly (like, what we do on
a.b.o), the new results should be just a few pages, so we can avoid the sleep
entirely...
> +
> + def check_for_updates(self):
> + """
> + Check if the database file exists and if the last
> + update was more than 24 hours ago.
> + """
> + self.cleanup_db()
> + last_update = None
> + if os.path.exists(self.db_file):
> + last_update = os.stat(self.db_file).st_mtime
I don't think the mtime is a reliable way to determine the last update time.
Instead, IMHO it's better to add a table in the database that keeps track of the
last update. This would be a table that is added by the base class itself before
calling init_db(). After the download has completed entirely, this timestamp
would be updated. So if it's interrupted in the middle, the next try will
re-download with the previous timestamp. Also, the timestamp should be updated
with the timestamp that was used in the lastModEndDate (which can be quite a bit
earlier than the time at which the download actually finishes). Perhaps even
subtract one second from it to make sure we don't miss any updates that were
entered at the exact time we did the previous download.
> + if last_update >= time.time() - 86400:
> + return []
> + # NVD uses UTC timestamps
> + last_update = datetime.fromtimestamp(last_update, tz=timezone.utc)
> + shutil.copy2(self.db_file, self.db_file_tmp)
> +
> + self.connection = self.open_db(True)
self.connection = None should be added to __init__ to "declare" the attribute.
Also, db_connection would be a better name.
Now it's time for bed :-)
Regards,
Arnout
> + self.init_db()
> +
> + success = self.download(last_update)
> + self.connection.close()
> + if success:
> + shutil.move(self.db_file_tmp, self.db_file)
> + else:
> + print("Update failed!")
> + os.remove(self.db_file_tmp)
_______________________________________________
buildroot mailing list
buildroot@buildroot.org
https://lists.buildroot.org/mailman/listinfo/buildroot
next prev parent reply other threads:[~2023-08-30 21:38 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-08-12 19:28 [Buildroot] [PATCH v3 1/8] support/scripts/pkg-stats: fix typos Daniel Lang
2023-08-12 19:28 ` [Buildroot] [PATCH v3 2/8] support/scripts/pkg-stats: ignore llvm-project.mk Daniel Lang
2023-08-30 20:31 ` Arnout Vandecappelle via buildroot
2023-08-31 3:35 ` Daniel Lang
2023-08-12 19:28 ` [Buildroot] [PATCH v3 3/8] support/scripts/pkg-stats: check all files for warnings Daniel Lang
2023-08-30 20:36 ` Arnout Vandecappelle via buildroot
2023-08-30 21:04 ` Thomas Petazzoni via buildroot
2023-08-31 19:52 ` Daniel Lang
2023-09-01 6:47 ` Arnout Vandecappelle via buildroot
2023-08-12 19:28 ` [Buildroot] [PATCH v3 4/8] support/scripts/gen-missing-cpe: remove rarely used script Daniel Lang
2023-08-30 20:46 ` Arnout Vandecappelle via buildroot
2023-08-12 19:28 ` [Buildroot] [PATCH v3 5/8] support/scripts/nvd_api_v2.py: new helper class Daniel Lang
2023-08-30 21:37 ` Arnout Vandecappelle via buildroot [this message]
2023-08-31 20:18 ` Daniel Lang
2023-09-01 7:03 ` Arnout Vandecappelle via buildroot
2023-09-01 8:10 ` Thomas Petazzoni via buildroot
2023-09-01 11:08 ` Daniel Lang
2023-09-01 7:10 ` Arnout Vandecappelle via buildroot
2023-08-12 19:28 ` [Buildroot] [PATCH v3 6/8] support/scripts/cve.py: switch to NVD API v2 Daniel Lang
2023-08-12 19:28 ` [Buildroot] [PATCH v3 7/8] support/scripts/pkg-stats: switch CPEs " Daniel Lang
2023-08-12 19:28 ` [Buildroot] [PATCH v3 8/8] support/scripts/pkg-stats: Only match CPE vendor and product Daniel Lang
2023-08-30 20:45 ` [Buildroot] [PATCH v3 1/8] support/scripts/pkg-stats: fix typos Arnout Vandecappelle via buildroot
2023-09-14 8:26 ` Peter Korsgaard
2023-09-14 8:25 ` Peter Korsgaard
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=da055153-9ec6-ade2-46ee-49c2d33ebf36@mind.be \
--to=buildroot@buildroot.org \
--cc=arnout@mind.be \
--cc=dalang@gmx.at \
--cc=thomas.petazzoni@bootlin.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox