kernelci.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* KernelCI Weekly Newsletter (Week 51)
@ 2024-12-20  5:53 Denys Fedoryshchenko
  0 siblings, 0 replies; only message in thread
From: Denys Fedoryshchenko @ 2024-12-20  5:53 UTC (permalink / raw)
  To: kernelci

**Summary**

KCIDB notification generation is overloaded due to too much data per
submission, constantly running out of RAM, this is exacerbated by the
queue replaying failed messages. The 8GB maximum RAM limit for Cloud
Functions is not enough to fit all the data - there are too many tests
and incidents in some submissions. Despite this, smaller submissions
are still getting processed, and we're getting some notifications
through. I have now limited the number of retries to five for each
update message, and set up dead-letter topic, which should receive all
the discarded messages. Because of this increased load (constant Cloud
Function execution) our finances are taking a hit and we're projected
to hit close to 800 USD this month. We have an option to completely
disable notification generation once again to save some money, but that
would mean no notifications will get through at all. If everyone agrees
that would be a good thing, I'll submit a small PR disabling them.

I've been investigating this problem for over a week now, approaching
it from different angles, with different solutions. I'm arriving at the
idea that we need to make a radical change on how we handle data for
notification generation. So far, for simplicity's sake, we cached the
complete data for each processed revision in RAM, and that no longer
works for some revisions -there's just too much data. E.g. this
revision has 217288 tests with a similar number of incidents:
https://kcidb.kernelci.org/d/revision/revision?var-datasource=edquppk2ghfcwc&var-origin=$__all&var-build_architecture=$__all&var-build_config_name=$__all&var-git_commit_hash=b32913a5609a36c230e9b091da26d38f8e80a056&var-patchset_hash=&from=now-100y&to=now&timezone=browser&var-test_path=

We need a more sophisticated approach, involving paging, only loading
summaries by default, and requiring narrowing down our queries with SQL
on the server side. It's possible to add these mechanisms to the
existing simple custom ORM we have in KCIDB, but I think it would be
more efficient and future-proof to try to use a ready-made, fully-
fledged solution. I think SQLAlchemy could be quite suitable, based on
my conversations with ChatGPT so far, and previous research. This will
need to be combined with pre-processing using (materialized) views on
PostgreSQL side, as SQLAlchemy doesn't work well with aggregation and
denormalized data. I'll be doing more research and experiments on this.
It would also be good to evaluate our overall architecture and which
databases/solutions we're using, again, in case there are better
approaches, in case we can improve costs, query performance, or
flexibility. However, that would be orthogonal to the problem of
fitting our data in memory, which I expect will be chasing us forever.
E.g. I successfully retrieved the tests for the above revision on a
machine with 32GB RAM, but that would be the maximum even for Cloud Run
containers, and in any case would be expensive to keep increasing.

** In other news: **

    KCIDB had a weird problem deploying from CI for a while, involving
PostgreSQL access issues from the Grafana container, which went away by
itself;
    I doubled the storage limit for PostgreSQL, to 400 GB, as we were
at 75% use already.

** Other Known Issues **

- The /docs endpoint remains broken due to compatibility issues between
FastAPI, FastAPI-users, and MongoDB in their current versions. This
will be investigated when additional resources become available.

** Ongoing Discussions and Development **

- From Maestro side we are trying to limit number of submissions.
- Work is underway to implement a storage solution in Rust. We are
actively discussing potential storage features, which are detailed
here:
https://gist.github.com/nuclearcat/2d34f8266ccd1b8356bd993eadcf2eed

Happy Holidays!
Thank you for reading, and apologies for the delayed newsletter.



^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2024-12-20  5:53 UTC | newest]

Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-20  5:53 KernelCI Weekly Newsletter (Week 51) Denys Fedoryshchenko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).