Re: Optimizing C/R Image Format for Kubernetes

public inbox for criu@lists.linux.dev
 help / color / mirror / Atom feed

From: Adrian Reber <areber@redhat.com>
To: Andrei Vagin <avagin@gmail.com>
Cc: criu@lists.linux.dev, Radostin Stoyanov <rstoyanov1@gmail.com>
Subject: Re: Optimizing C/R Image Format for Kubernetes
Date: Thu, 19 Jun 2025 13:06:27 +0200	[thread overview]
Message-ID: <aFPvM8burGtrCg7Z@dcbz.redhat.com> (raw)
In-Reply-To: <CANaxB-wB6GR1mNZb02uK49c-q_Kx4uO5i92gE81iSt9ort_GEQ@mail.gmail.com>

On Wed, Jun 18, 2025 at 04:58:24PM -0700, Andrei Vagin wrote:
> I've been spending the last few days diving into checkpoint/restore
> (C/R) within Kubernetes, specifically focusing on the restore process
> and the current image format.
> 
> I found the current container image format to be suboptimal.

You are right. When we came up with that we were looking for something that
works and over the time we also saw that it is far from perfect.

The good thing is we control the implementation in podman, containerd
and cri-o and can easily change it to something better. We are open to
anything.

> I've examined containerd, and I suspect CRI-O has similar issues.

containerd is even worse then CRI-O because of the way it works
internally. My first approach was to directly write the checkpoint to
disk, but the containerd authors asked me to use their internal image
store. So now the checkpoint is created on disk, tarred up in the
containerd internal format then it is transferred internally to another
layer of containerd which unpacks it and adds the root-diff. Then it
writes this as another tar. Then to create an OCI image the tar is again
unpacked and written to another tar. So we are tarring up the data 4 or
5 times probably. There is a lot of room for optimization, but with
containerd and Kubernetes we were happy to get any reviewers at all and
adopted their not optimal suggestions.

> Essentially, it's a container image that encapsulates a
> checkpoint-restore archive. Each container start requires multiple
> unpacking steps:
> * Extracting the C/R archive: This yields two tar archives—one for the
>   filesystem delta and another for CRIU images.
> * Applying the filesystem delta: We need to mount the container's root
>   filesystem, then extract and apply this delta.
> * Restoring the container: Finally, we extract the CRIU images and
>   proceed with the restore.
> 
> I believe this format, with its nested tar archives, leads to a
> significant amount of time wasted on unpacking, which directly impacts
> performance.

As mentioned above. Totally correct.

> With the growing interest in using C/R to optimize application startup
> time. I've run some experiments. My findings indicate that the current
> image format significantly reduces the benefits of C/R, and in many
> cases, restoring a container from these images is actually slower than
> starting it from scratch.

We tried to have proper format defined in the OCI spec:

https://github.com/opencontainers/image-spec/issues/962

But the discussion didn't result in any thing useful so at some point we
just ignored it.

> Here's my vision for an ideal image format for C/R-ed containers:
> * Filesystem Delta as an Overlay Layer: The filesystem delta should be
>   treated just like any other container image delta. This means it would
>   be specified as one of the overlay layers when a container is mounted.

Yes. The current format was my wrong decision as I was not familiar
with how those delta layers are working.

> * Directly Accessible CRIU Images: Once an image is pulled locally, the
>   CRIU images should not be bundled in a tar archive. Instead, they
>   should be placed directly in a directory, allowing CRIU to use them
>   immediately without any extra extraction steps.

This is not actually true. The OCI image does not contain the tar
archive but the actual checkpoint files directly:

# podman pull quay.io/adrianreber/checkpoint-test:tag73
Trying to pull quay.io/adrianreber/checkpoint-test:tag73...
Getting image source signatures
Copying blob e65839d7ec1b done
Copying config 27d63848a3 done
Writing manifest to image destination
Storing signatures
27d63848a32d24c68b131f99880411c11af6519820ef22b989a86b7f10038c79
# podman image mount quay.io/adrianreber/checkpoint-test:tag73
/var/lib/containers/storage/overlay/98aaf3c7dc28cfb2e79893ef952380b00169dcce910be48bbea1143b07ae2a0e/merged
# ls -la /var/lib/containers/storage/overlay/98aaf3c7dc28cfb2e79893ef952380b00169dcce910be48bbea1143b07ae2a0e/merged
total 44
dr-xr-xr-x. 1 root root  4096 Jun 19 10:53 .
drwx------. 6 root root  4096 Jun 19 10:53 ..
-rw-------. 1 root root  1120 Feb  1 11:11 bind.mounts
drw-------. 2 root root  4096 Feb  1 11:11 checkpoint
-rw-------. 1 root root   616 Feb  1 11:11 config.dump
-rw-------. 1 root root     0 Feb  1 11:11 dump.log
-rw-r--r--. 1 root root   315 Feb  1 11:11 io.kubernetes.cri-o.LogPath
-rw-r--r--. 1 root root  2048 Feb  1 11:11 rootfs-diff.tar
-rw-------. 1 root root 11276 Feb  1 11:11 spec.dump
-rw-r--r--. 1 root root    49 Feb  1 11:11 stats-dump

We currently have some metadata defined in
github.com/checkpoint-restore/checkpointctl which we want to use in
all three projects (podman, containerd and cri-o).

What I also would like to see is that we can directly write to an OCI
image and not just first to a local tar archive and then convert it to
an OCI image (like Podman already does today). But that requires buy-in
from Kubernetes and changes to the CRI-API which has always been
extremely difficult for me to get accepted by Kubernetes. The main
problem is that checkpoint/restore is not seen as an important feature
from most Kubernetes contributors (especially approvers and reviewers).

So having someone who supports our work instead of blocking it would
help us a lot.

There is also the fear of exposing secret information which often blocks
and progress in the Kubernetes area. Having encryption in CRIU would
also make those discussions easier (even if the data is not always
encrypted, but being able to check the encryption box would make
discussions easier).

		Adrian

next prev parent reply	other threads:[~2025-06-19 11:06 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-18 23:58 Optimizing C/R Image Format for Kubernetes Andrei Vagin
2025-06-19  8:36 ` Radostin Stoyanov
2025-06-19 11:06 ` Adrian Reber [this message]
2025-06-20 19:34   ` Andrei Vagin
2025-06-22 11:46     ` Adrian Reber

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aFPvM8burGtrCg7Z@dcbz.redhat.com \
    --to=areber@redhat.com \
    --cc=avagin@gmail.com \
    --cc=criu@lists.linux.dev \
    --cc=rstoyanov1@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox