All of lore.kernel.org
 help / color / mirror / Atom feed
* RE: [OE-core] [PATCH] sstate: remove corrupted artifacts from local mirror
@ 2024-10-08 18:47 Sobon, Przemyslaw
  2024-10-10 12:18 ` Alexander Kanavin
  0 siblings, 1 reply; 12+ messages in thread
From: Sobon, Przemyslaw @ 2024-10-08 18:47 UTC (permalink / raw)
  To: Alexander Kanavin, Yu, Max; +Cc: openembedded-core@lists.openembedded.org

On Tuesday, October 8, 2024 2:38 AM Alexander Kanavin <alex.kanavin@gmail.com> wrote:
>On Mon, 7 Oct 2024 at 22:09, Yu, Max <yumx@amazon.com> wrote:
>> Corruptions are unavoidable part of our life, disks and network can inject
>> failures due to unpredictable and unknown reasons like
>> https://www.sciencedirect.com/science/article/abs/pii/S0026271421003723.
>> Even multi-layer protection is not perfect as it depends on where the error
>> is injected.
> I'm sorry but I am not convinced. You are the only one reporting the issue,
> so I don't think it's unavoidable.
The "only one reporting" does not mean the problem does not exist, it may be
just not as big problem for others as for us. Similar problem exist for DRAM
memory corruptions? Most of the people don't care about that but for some
this is important problem, e.g. when you see 1 OS crash per 10 years it is
not a big deal but if you own 10k servers you see 3 crashes per day. That is
the scale factor that is important. Max talked about our scale already.
Summarizing, the manual work is not a solution for us due to scale.

>> We proposed this patch as it is aligned with how we currently handle
>> incorrect checksums:
>> https://github.com/yoctoproject/poky/commit/672c07de4a96eb67eaafba0873eced44ec9ae1a6.
> What the patch is doing is essentially -c cleansstate on a cache used by
> several consumers. This makes it a non-starter: you can't remove items from
> live sstate like that. Once an object is in the cache, it needs to be removed
> offline, or replaced atomically if it's corrupted.
I disagree, we can overwrite bad artifact. Yocto indirectly does that as it has
to rebuild the package. This is "by design" behavior. And to be honest, there is
no difference between (1) rebuilding the package every time and (2) overwriting
sstate cache so any other build can reuse it. Is there any concern around
uploading such freshly built artifact?

> Before we decide how to handle corrupted items, it's perhaps better to
> consider first how they should be reported. Then users can decide what they
> want to do with that information without yocto forcing something on them.
> So please: how can I observe the issue?
This is random thing, we are not in control of e.g. DRAM bit flip error, they
simply happen. To simulate the situation you can inject an error yourself by
e.g., overwriting the random byte of the zstd file before it is uploaded.

>Alex

Regards,
Przemyslaw

^ permalink raw reply	[flat|nested] 12+ messages in thread
* Re: [OE-core] [PATCH] sstate: remove corrupted artifacts from local mirror
@ 2024-10-15 22:43 Yu, Max
  2024-10-15 23:01 ` Richard Purdie
  2024-10-16 10:01 ` Alexander Kanavin
  0 siblings, 2 replies; 12+ messages in thread
From: Yu, Max @ 2024-10-15 22:43 UTC (permalink / raw)
  To: Alexander Kanavin, Richard Purdie
  Cc: Sobon, Przemyslaw, openembedded-core@lists.openembedded.org

> > > So that's the tricky thing for us. We will have to run the script at
> > > a time window where no builds are happening. Because when I tried
> > > deleting corrupted objects from the s3 sstate cache manually, the
> > > corrupted sstate object ended up being reuploaded by ongoing
> > > builds... And this is pretty difficult to execute reliably, since we
> > > have so many builds.
> >
> > I'm a little bit puzzled here. How would something upload the corrupt
> > artefact back into the system?
>
> I am also puzzled by this. I was going to suggest that perhaps there's
> a rare bug in zstd compressor that deterministically (and
> reproducibly) creates a corrupt archive on a certain input, because
> ongoing builds can't simply 'reupload' something corrupted that has
> been deleted from sstate. But maybe there's some extra proprietary
> layer of 'overwriting' and 'synchronization' where all this trouble is
> coming from.

Just dug into some of the details more, maybe this is related to how we implemented the s3 mirror. For context, our Yocto builds work like this for a recipe:
1. download the sstate siginfo+object from s3, into our local sstate cache. (I think this is unmodified yocto logic for mirrors just using SSTATE_MIRRORDIR)
2. build
3. upload the local siginfo+object back to s3. (This only happens if s3's object is different)

The reupload happens when we delete the s3 object between steps 1 and 3, where step 3 will just reupload their local sstate cache objects (corrupted in this case). Especially with how yocto parallelizes tasks, this time window between 1 and 2 is quite large. And with the number of builds that happen, we then have no reliable way of cleaning out s3 sstate objects. Note that, even if we changed step 3 to only upload if the object doesn't exist, we would still have this problem.

Reflecting on our s3 mirror setup, maybe we're doing things in a non-standard way, especially regarding step 3.

How do folks normally update sstate mirrors? Do people usually only have a single/small number of writers? (For our use case, we want to keep the remote sstate mirror very up to date.)


> No, overwriting files is a recipe for disaster. See above, you get
> partially complete files which come out as corrupted.

I understand your concerns with overwriting. What I was suggesting with "overwriting" was more creating a new file and `mv`ing it, which should be atomic. Same with the s3 operations we use to upload, they will not leave a file partially written/overwritten if we trust s3 claims.

> I think this overwriting may be the source of your problems. You should
> not need to or be doing this. Why do you need to?
> 
> It is easy to blame bit flips and put a hack into the system to force
> an overwrite but I think you have a more fundamental issue going on
> which you may want to fix properly.

Like I said, it's probably not something like bit flips, but since we only see these corruptions after moving to kirkstone, it might be related to things like using pzstd or other infra changes. The s3 uploading/updating logic did not change, and we are even using the same boto3 (aws python client) versions. We definitely do also want to fix the corruption; it's just been incredibly hard to narrow down what is causing it... 

Thanks for your help and patience,
Max


^ permalink raw reply	[flat|nested] 12+ messages in thread
* Re: [OE-core] [PATCH] sstate: remove corrupted artifacts from local mirror
@ 2024-10-15 18:25 Yu, Max
  2024-10-15 20:32 ` Richard Purdie
  0 siblings, 1 reply; 12+ messages in thread
From: Yu, Max @ 2024-10-15 18:25 UTC (permalink / raw)
  To: Alexander Kanavin, Sobon, Przemyslaw
  Cc: openembedded-core@lists.openembedded.org


> But can you say what your scale actually is? How many sstate objects
> are written into the shared cache per day? How often do you see
> corruptions?

Our sstate cache has 25TB and 12M objects. I do not have the read/write metrics, but we run ~5k image builds per day, and we were seeing the corruption issue about once per two weeks before this patch. Our image builds range from 3k - 8k build steps, but these numbers become funky since these are not builds from scratch. (These corruptions we're facing are likely not bit flips, the example I gave was just trying to illustrate how rare events can happen that shouldn't just be summarized with "fix your infra". If it makes things better or worse, we only started to see these corruption issues after moving to kirkstone where they are now compressed with pzstd...)

For some more context, we at AWS use Yocto to build the OS for a lot of hardware platforms and a number of smaller component images. Cartesian product the two and we end up with needing to build a lot of images...

Specifically related to sstate caches, we have a remote sstate mirror setup in s3. All CI builds update the s3 mirror, so we have a large number of writers. For all our builds, we setup a local build directory per repo (a repo can have multiple), which is where we host the local sstate cache, and we parallelize based on build directory.

> The concern is that your patch does not overwrite the artifact, rather
> it deletes it first, and recreates it later. This creates a time
> window where a cache object exists, and then it doesn't, and then it
> exists again. This will break builds running in parallel in all sorts
> of interesting ways, as sstate is not designed for objects
> disappearing after they've been checked and confirmed to exist by
> bitbake.

This makes a lot of sense. This is a use case we didn't consider when creating the patch and not applicable to us. (Since we parallelize by multiple build directories that don't share a local sstate cache. Developer builds don't really care about the extra local parallelization, while CI builds share the remote sstate cache.)

> Then you can take the report and run a script that deletes the
> offending items. This all can be automated, and doesn't have to be
> executed manually.

So that's the tricky thing for us. We will have to run the script at a time window where no builds are happening. Because when I tried deleting corrupted objects from the s3 sstate cache manually, the corrupted sstate object ended up being reuploaded by ongoing builds... And this is pretty difficult to execute reliably, since we have so many builds.

I do understand more where you're coming from now, and it does sound like the way we implemented this fix is not applicable for all use cases. Can we make this a configurable option instead? 

Or if we don't want to remove an existing sstate object from the cache, would it be fine to then overwrite existing sstate cache objects after we finish rebuild it? With our current remote s3 sstate mirror, we already overwrite the remote when the object is different. (I think it might be possible to implement a marker during the decompression phase, and later overwrite the previous sstate object after the local rebuilds. Will need to look more into it. )

Thanks,
Max


On 10/10/24, 5:19 AM, "Alexander Kanavin" <alex.kanavin@gmail.com <mailto:alex.kanavin@gmail.com>> wrote:


CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.






On Tue, 8 Oct 2024 at 20:47, Sobon, Przemyslaw <psobon@amazon.com <mailto:psobon@amazon.com>> wrote:
> The "only one reporting" does not mean the problem does not exist, it may be
> just not as big problem for others as for us. Similar problem exist for DRAM
> memory corruptions? Most of the people don't care about that but for some
> this is important problem, e.g. when you see 1 OS crash per 10 years it is
> not a big deal but if you own 10k servers you see 3 crashes per day. That is
> the scale factor that is important. Max talked about our scale already.
> Summarizing, the manual work is not a solution for us due to scale.


But can you say what your scale actually is? How many sstate objects
are written into the shared cache per day? How often do you see
corruptions?


Basically it helps if you introduce yourselves and your product first,
as this is I think you first time interacting with the community?


> I disagree, we can overwrite bad artifact. Yocto indirectly does that as it has
> to rebuild the package. This is "by design" behavior. And to be honest, there is
> no difference between (1) rebuilding the package every time and (2) overwriting
> sstate cache so any other build can reuse it. Is there any concern around
> uploading such freshly built artifact?


The concern is that your patch does not overwrite the artifact, rather
it deletes it first, and recreates it later. This creates a time
window where a cache object exists, and then it doesn't, and then it
exists again. This will break builds running in parallel in all sorts
of interesting ways, as sstate is not designed for objects
disappearing after they've been checked and confirmed to exist by
bitbake.


For example, when we do need to test cache deletions (for instance in
oe-selftest), we make super-sure that this is done on a private small
test cache that isn't shared with anything, as otherwise there have
been notoriously strange failures in random places.


The other concern I expressed to Max: this auto-recovery sweeps the
'flaky hardware' problem under the rug, instead of being loud and
clear about it. If someone had a perfectly working sstate (and many
people do, including the yocto upstream), and then it started throwing
random fails, they're not going to notice it. If someone had very rare
corruptions and then the rate increased, they're not going to notice
that either. Except when they start to wonder why builds seem to take
longer and longer and longer.


> This is random thing, we are not in control of e.g. DRAM bit flip error, they
> simply happen. To simulate the situation you can inject an error yourself by
> e.g., overwriting the random byte of the zstd file before it is uploaded.


Yes, I saw it. The key issue is this bit in sstate.bbclass before
actually creating the sstate archive:


if sstate_pkg.exists():
touch(sstate_pkg)
return


If sstate item exists, then it will not be replaced, even if it's been
determined to be corrupted earlier.


I don't know yet how to best handle this, but I would want to improve
*reporting* of corrupt sstate before we can decide whether yocto can
do something about it that doesn't make things worse than they are
now.


Then you can take the report and run a script that deletes the
offending items. This all can be automated, and doesn't have to be
executed manually.


Alex




^ permalink raw reply	[flat|nested] 12+ messages in thread
* [PATCH] sstate: remove corrupted artifacts from local mirror
@ 2024-10-05  0:44 Max Yu
  2024-10-07 11:57 ` [OE-core] " Alexander Kanavin
  0 siblings, 1 reply; 12+ messages in thread
From: Max Yu @ 2024-10-05  0:44 UTC (permalink / raw)
  To: openembedded-core; +Cc: Max Yu, Przemyslaw Sobon

We observe sstate cache corruptions sometimes which cause rebuilds. That is not
a fatal error as the package has to be rebuilt and updated artifact needs to be
pushed to remote sstate cache mirror. Currently, Yocto does not handle
corruptions properly, where the corrupted artifact is not deleted or
renamed. Later, after the package is built the same corrupted artifact is pushed
to remote mirror and the same procedure is circled again and again.

This change verifies the outcome of the unpacking action and renames the
artifact if a fatal error occurred ("tar" tool returns error 2). In such case we
rename the artifact what causes that a proper one is created and uploaded
overwriting the exisiting one - the corrupted one - in the remote mirror. That
way we break the loop of uploading corrupted file again and again.

Suggested-by: Przemyslaw Sobon <psobon@amazon.com>
Signed-off-by: Max Yu <yumx@amazon.com>
---
 meta/classes-global/sstate.bbclass | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/meta/classes-global/sstate.bbclass b/meta/classes-global/sstate.bbclass
index 11bb892a42..5a7ce35341 100644
--- a/meta/classes-global/sstate.bbclass
+++ b/meta/classes-global/sstate.bbclass
@@ -937,7 +937,12 @@ sstate_unpack_package () {
 		ZSTD="pzstd -p ${ZSTD_THREADS}"
 	fi
 
-	tar -I "$ZSTD" -xvpf ${SSTATE_PKG}
+	if ! tar -I "$ZSTD" -xvpf ${SSTATE_PKG}; then
+		echo "Fatal error extracting sstate cache artifacts, file might be corrupted or truncated, renaming"
+		mv ${SSTATE_PKG} ${SSTATE_PKG}.unpack_error
+		exit 2
+	fi
+
 	# update .siginfo atime on local/NFS mirror if it is a symbolic link
 	[ ! -h ${SSTATE_PKG}.siginfo ] || [ ! -e ${SSTATE_PKG}.siginfo ] || touch -a ${SSTATE_PKG}.siginfo 2>/dev/null || true
 	# update each symbolic link instead of any referenced file
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-10-16 10:01 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-08 18:47 [OE-core] [PATCH] sstate: remove corrupted artifacts from local mirror Sobon, Przemyslaw
2024-10-10 12:18 ` Alexander Kanavin
  -- strict thread matches above, loose matches on Subject: below --
2024-10-15 22:43 Yu, Max
2024-10-15 23:01 ` Richard Purdie
2024-10-16 10:01 ` Alexander Kanavin
2024-10-15 18:25 Yu, Max
2024-10-15 20:32 ` Richard Purdie
2024-10-15 20:52   ` Alexander Kanavin
2024-10-05  0:44 Max Yu
2024-10-07 11:57 ` [OE-core] " Alexander Kanavin
2024-10-07 20:09   ` Yu, Max
2024-10-07 20:38     ` Alexander Kanavin
2024-10-08  9:37     ` Alexander Kanavin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.