* Optimizing C/R Image Format for Kubernetes
@ 2025-06-18 23:58 Andrei Vagin
2025-06-19 8:36 ` Radostin Stoyanov
2025-06-19 11:06 ` Adrian Reber
0 siblings, 2 replies; 5+ messages in thread
From: Andrei Vagin @ 2025-06-18 23:58 UTC (permalink / raw)
To: criu, Radostin Stoyanov, Adrian Reber
Hi everyone,
I've been spending the last few days diving into checkpoint/restore
(C/R) within Kubernetes, specifically focusing on the restore process
and the current image format.
I found the current container image format to be suboptimal. I've
examined containerd, and I suspect CRI-O has similar issues.
Essentially, it's a container image that encapsulates a
checkpoint-restore archive. Each container start requires multiple
unpacking steps:
* Extracting the C/R archive: This yields two tar archives—one for the
filesystem delta and another for CRIU images.
* Applying the filesystem delta: We need to mount the container's root
filesystem, then extract and apply this delta.
* Restoring the container: Finally, we extract the CRIU images and
proceed with the restore.
I believe this format, with its nested tar archives, leads to a
significant amount of time wasted on unpacking, which directly impacts
performance.
With the growing interest in using C/R to optimize application startup
time. I've run some experiments. My findings indicate that the current
image format significantly reduces the benefits of C/R, and in many
cases, restoring a container from these images is actually slower than
starting it from scratch.
Here's my vision for an ideal image format for C/R-ed containers:
* Filesystem Delta as an Overlay Layer: The filesystem delta should be
treated just like any other container image delta. This means it would
be specified as one of the overlay layers when a container is mounted.
* Directly Accessible CRIU Images: Once an image is pulled locally, the
CRIU images should not be bundled in a tar archive. Instead, they
should be placed directly in a directory, allowing CRIU to use them
immediately without any extra extraction steps.
Thanks,
Andrei
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Optimizing C/R Image Format for Kubernetes
2025-06-18 23:58 Optimizing C/R Image Format for Kubernetes Andrei Vagin
@ 2025-06-19 8:36 ` Radostin Stoyanov
2025-06-19 11:06 ` Adrian Reber
1 sibling, 0 replies; 5+ messages in thread
From: Radostin Stoyanov @ 2025-06-19 8:36 UTC (permalink / raw)
To: Andrei Vagin, criu, Adrian Reber
Hi Andrei,
Thank you for investing! I like the idea about "Directly Accessible CRIU
Images". This was also one of the reasons we chose to implement
encryption support directly within the checkpoint/restore operations in
CRIU, rather than encrypting the tar archives. Enabling support for
encryption is an important requirement for the checkpoint/restore
functionality in Kubernetes. I will rebase my patches on the criu-dev
branch and open a pull request.
Best wishes,
Radostin
On 19/06/2025 00:58, Andrei Vagin wrote:
> Hi everyone,
>
> I've been spending the last few days diving into checkpoint/restore
> (C/R) within Kubernetes, specifically focusing on the restore process
> and the current image format.
>
> I found the current container image format to be suboptimal. I've
> examined containerd, and I suspect CRI-O has similar issues.
> Essentially, it's a container image that encapsulates a
> checkpoint-restore archive. Each container start requires multiple
> unpacking steps:
> * Extracting the C/R archive: This yields two tar archives—one for the
> filesystem delta and another for CRIU images.
> * Applying the filesystem delta: We need to mount the container's root
> filesystem, then extract and apply this delta.
> * Restoring the container: Finally, we extract the CRIU images and
> proceed with the restore.
>
> I believe this format, with its nested tar archives, leads to a
> significant amount of time wasted on unpacking, which directly impacts
> performance.
>
> With the growing interest in using C/R to optimize application startup
> time. I've run some experiments. My findings indicate that the current
> image format significantly reduces the benefits of C/R, and in many
> cases, restoring a container from these images is actually slower than
> starting it from scratch.
>
> Here's my vision for an ideal image format for C/R-ed containers:
> * Filesystem Delta as an Overlay Layer: The filesystem delta should be
> treated just like any other container image delta. This means it would
> be specified as one of the overlay layers when a container is mounted.
> * Directly Accessible CRIU Images: Once an image is pulled locally, the
> CRIU images should not be bundled in a tar archive. Instead, they
> should be placed directly in a directory, allowing CRIU to use them
> immediately without any extra extraction steps.
>
> Thanks,
> Andrei
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Optimizing C/R Image Format for Kubernetes
2025-06-18 23:58 Optimizing C/R Image Format for Kubernetes Andrei Vagin
2025-06-19 8:36 ` Radostin Stoyanov
@ 2025-06-19 11:06 ` Adrian Reber
2025-06-20 19:34 ` Andrei Vagin
1 sibling, 1 reply; 5+ messages in thread
From: Adrian Reber @ 2025-06-19 11:06 UTC (permalink / raw)
To: Andrei Vagin; +Cc: criu, Radostin Stoyanov
On Wed, Jun 18, 2025 at 04:58:24PM -0700, Andrei Vagin wrote:
> I've been spending the last few days diving into checkpoint/restore
> (C/R) within Kubernetes, specifically focusing on the restore process
> and the current image format.
>
> I found the current container image format to be suboptimal.
You are right. When we came up with that we were looking for something that
works and over the time we also saw that it is far from perfect.
The good thing is we control the implementation in podman, containerd
and cri-o and can easily change it to something better. We are open to
anything.
> I've examined containerd, and I suspect CRI-O has similar issues.
containerd is even worse then CRI-O because of the way it works
internally. My first approach was to directly write the checkpoint to
disk, but the containerd authors asked me to use their internal image
store. So now the checkpoint is created on disk, tarred up in the
containerd internal format then it is transferred internally to another
layer of containerd which unpacks it and adds the root-diff. Then it
writes this as another tar. Then to create an OCI image the tar is again
unpacked and written to another tar. So we are tarring up the data 4 or
5 times probably. There is a lot of room for optimization, but with
containerd and Kubernetes we were happy to get any reviewers at all and
adopted their not optimal suggestions.
> Essentially, it's a container image that encapsulates a
> checkpoint-restore archive. Each container start requires multiple
> unpacking steps:
> * Extracting the C/R archive: This yields two tar archives—one for the
> filesystem delta and another for CRIU images.
> * Applying the filesystem delta: We need to mount the container's root
> filesystem, then extract and apply this delta.
> * Restoring the container: Finally, we extract the CRIU images and
> proceed with the restore.
>
> I believe this format, with its nested tar archives, leads to a
> significant amount of time wasted on unpacking, which directly impacts
> performance.
As mentioned above. Totally correct.
> With the growing interest in using C/R to optimize application startup
> time. I've run some experiments. My findings indicate that the current
> image format significantly reduces the benefits of C/R, and in many
> cases, restoring a container from these images is actually slower than
> starting it from scratch.
We tried to have proper format defined in the OCI spec:
https://github.com/opencontainers/image-spec/issues/962
But the discussion didn't result in any thing useful so at some point we
just ignored it.
> Here's my vision for an ideal image format for C/R-ed containers:
> * Filesystem Delta as an Overlay Layer: The filesystem delta should be
> treated just like any other container image delta. This means it would
> be specified as one of the overlay layers when a container is mounted.
Yes. The current format was my wrong decision as I was not familiar
with how those delta layers are working.
> * Directly Accessible CRIU Images: Once an image is pulled locally, the
> CRIU images should not be bundled in a tar archive. Instead, they
> should be placed directly in a directory, allowing CRIU to use them
> immediately without any extra extraction steps.
This is not actually true. The OCI image does not contain the tar
archive but the actual checkpoint files directly:
# podman pull quay.io/adrianreber/checkpoint-test:tag73
Trying to pull quay.io/adrianreber/checkpoint-test:tag73...
Getting image source signatures
Copying blob e65839d7ec1b done
Copying config 27d63848a3 done
Writing manifest to image destination
Storing signatures
27d63848a32d24c68b131f99880411c11af6519820ef22b989a86b7f10038c79
# podman image mount quay.io/adrianreber/checkpoint-test:tag73
/var/lib/containers/storage/overlay/98aaf3c7dc28cfb2e79893ef952380b00169dcce910be48bbea1143b07ae2a0e/merged
# ls -la /var/lib/containers/storage/overlay/98aaf3c7dc28cfb2e79893ef952380b00169dcce910be48bbea1143b07ae2a0e/merged
total 44
dr-xr-xr-x. 1 root root 4096 Jun 19 10:53 .
drwx------. 6 root root 4096 Jun 19 10:53 ..
-rw-------. 1 root root 1120 Feb 1 11:11 bind.mounts
drw-------. 2 root root 4096 Feb 1 11:11 checkpoint
-rw-------. 1 root root 616 Feb 1 11:11 config.dump
-rw-------. 1 root root 0 Feb 1 11:11 dump.log
-rw-r--r--. 1 root root 315 Feb 1 11:11 io.kubernetes.cri-o.LogPath
-rw-r--r--. 1 root root 2048 Feb 1 11:11 rootfs-diff.tar
-rw-------. 1 root root 11276 Feb 1 11:11 spec.dump
-rw-r--r--. 1 root root 49 Feb 1 11:11 stats-dump
We currently have some metadata defined in
github.com/checkpoint-restore/checkpointctl which we want to use in
all three projects (podman, containerd and cri-o).
What I also would like to see is that we can directly write to an OCI
image and not just first to a local tar archive and then convert it to
an OCI image (like Podman already does today). But that requires buy-in
from Kubernetes and changes to the CRI-API which has always been
extremely difficult for me to get accepted by Kubernetes. The main
problem is that checkpoint/restore is not seen as an important feature
from most Kubernetes contributors (especially approvers and reviewers).
So having someone who supports our work instead of blocking it would
help us a lot.
There is also the fear of exposing secret information which often blocks
and progress in the Kubernetes area. Having encryption in CRIU would
also make those discussions easier (even if the data is not always
encrypted, but being able to check the encryption box would make
discussions easier).
Adrian
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Optimizing C/R Image Format for Kubernetes
2025-06-19 11:06 ` Adrian Reber
@ 2025-06-20 19:34 ` Andrei Vagin
2025-06-22 11:46 ` Adrian Reber
0 siblings, 1 reply; 5+ messages in thread
From: Andrei Vagin @ 2025-06-20 19:34 UTC (permalink / raw)
To: Adrian Reber, Radostin Stoyanov; +Cc: Andrei Vagin, criu
On Thu, Jun 19, 2025 at 4:06 AM Adrian Reber <areber@redhat.com> wrote:
...
>
> > Here's my vision for an ideal image format for C/R-ed containers:
> > * Filesystem Delta as an Overlay Layer: The filesystem delta should be
> > treated just like any other container image delta. This means it would
> > be specified as one of the overlay layers when a container is mounted.
>
> Yes. The current format was my wrong decision as I was not familiar
> with how those delta layers are working.
>
> > * Directly Accessible CRIU Images: Once an image is pulled locally, the
> > CRIU images should not be bundled in a tar archive. Instead, they
> > should be placed directly in a directory, allowing CRIU to use them
> > immediately without any extra extraction steps.
>
> This is not actually true. The OCI image does not contain the tar
> archive but the actual checkpoint files directly:
>
> # podman pull quay.io/adrianreber/checkpoint-test:tag73
> Trying to pull quay.io/adrianreber/checkpoint-test:tag73...
> Getting image source signatures
> Copying blob e65839d7ec1b done
> Copying config 27d63848a3 done
> Writing manifest to image destination
> Storing signatures
> 27d63848a32d24c68b131f99880411c11af6519820ef22b989a86b7f10038c79
> # podman image mount quay.io/adrianreber/checkpoint-test:tag73
> /var/lib/containers/storage/overlay/98aaf3c7dc28cfb2e79893ef952380b00169dcce910be48bbea1143b07ae2a0e/merged
> # ls -la /var/lib/containers/storage/overlay/98aaf3c7dc28cfb2e79893ef952380b00169dcce910be48bbea1143b07ae2a0e/merged
> total 44
> dr-xr-xr-x. 1 root root 4096 Jun 19 10:53 .
> drwx------. 6 root root 4096 Jun 19 10:53 ..
> -rw-------. 1 root root 1120 Feb 1 11:11 bind.mounts
> drw-------. 2 root root 4096 Feb 1 11:11 checkpoint
> -rw-------. 1 root root 616 Feb 1 11:11 config.dump
> -rw-------. 1 root root 0 Feb 1 11:11 dump.log
> -rw-r--r--. 1 root root 315 Feb 1 11:11 io.kubernetes.cri-o.LogPath
> -rw-r--r--. 1 root root 2048 Feb 1 11:11 rootfs-diff.tar
> -rw-------. 1 root root 11276 Feb 1 11:11 spec.dump
> -rw-r--r--. 1 root root 49 Feb 1 11:11 stats-dump
>
> We currently have some metadata defined in
> github.com/checkpoint-restore/checkpointctl which we want to use in
> all three projects (podman, containerd and cri-o).
You know, maybe there's a difference between CRI-O and containerd.
I followed the steps from the containerd test to create an image:
https://github.com/containerd/containerd/blob/main/contrib/checkpoint/checkpoint-restore-kubernetes-test.sh#L105
root@gke-cluster-1-default-pool-595f3f31-2wft:/home/avagin# docker
create --name test-image avagin/test-cpt:0.5 ls
c80dbf467d99a0e3a6684d6cb36d29c212a6b12bdfc9af8abe2ffe3fcb69a5de
root@gke-cluster-1-default-pool-595f3f31-2wft:/home/avagin# docker
export test-image | tar -t
.dockerenv
blobs/
blobs/sha256/
blobs/sha256/5159244823d7bfa959a4249c912ffef669c5596fcf41a866264823152b6dbba9
blobs/sha256/9178f6d56b033b8221dda746c3fd9ad98552569f05e66241365ef8a722da96be
blobs/sha256/eca4c8bdd20acb007a5594777ace63727d2c17413a54d3a5a817e252d0390902
dev/
dev/console
dev/pts/
dev/shm/
etc/
etc/hostname
etc/hosts
etc/mtab
etc/resolv.conf
index.json
oci-layout
proc/
sys/
root@gke-cluster-1-default-pool-595f3f31-2wft:/home/avagin# docker
export test-image | tar -x -C test-img/
# tar -tf test-img/blobs/sha256/eca4c8bdd20acb007a5594777ace63727d2c17413a54d3a5a817e252d0390902
checkpoint/
checkpoint/cgroup.img
checkpoint/core-1.img
checkpoint/core-8.img
checkpoint/descriptors.json
checkpoint/fdinfo-2.img
checkpoint/fdinfo-3.img
checkpoint/files.img
checkpoint/fs-1.img
...
>
> What I also would like to see is that we can directly write to an OCI
> image and not just first to a local tar archive and then convert it to
> an OCI image (like Podman already does today). But that requires buy-in
> from Kubernetes and changes to the CRI-API which has always been
> extremely difficult for me to get accepted by Kubernetes. The main
> problem is that checkpoint/restore is not seen as an important feature
> from most Kubernetes contributors (especially approvers and reviewers).
>
> So having someone who supports our work instead of blocking it would
> help us a lot.
>
> There is also the fear of exposing secret information which often blocks
> and progress in the Kubernetes area. Having encryption in CRIU would
> also make those discussions easier (even if the data is not always
> encrypted, but being able to check the encryption box would make
> discussions easier).
Radostin, what is the current state of encryption for CRIU images?
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Optimizing C/R Image Format for Kubernetes
2025-06-20 19:34 ` Andrei Vagin
@ 2025-06-22 11:46 ` Adrian Reber
0 siblings, 0 replies; 5+ messages in thread
From: Adrian Reber @ 2025-06-22 11:46 UTC (permalink / raw)
To: Andrei Vagin; +Cc: Radostin Stoyanov, Andrei Vagin, criu
On Fri, Jun 20, 2025 at 12:34:22PM -0700, Andrei Vagin wrote:
> On Thu, Jun 19, 2025 at 4:06 AM Adrian Reber <areber@redhat.com> wrote:
> ...
> >
> > > Here's my vision for an ideal image format for C/R-ed containers:
> > > * Filesystem Delta as an Overlay Layer: The filesystem delta should be
> > > treated just like any other container image delta. This means it would
> > > be specified as one of the overlay layers when a container is mounted.
> >
> > Yes. The current format was my wrong decision as I was not familiar
> > with how those delta layers are working.
> >
> > > * Directly Accessible CRIU Images: Once an image is pulled locally, the
> > > CRIU images should not be bundled in a tar archive. Instead, they
> > > should be placed directly in a directory, allowing CRIU to use them
> > > immediately without any extra extraction steps.
> >
> > This is not actually true. The OCI image does not contain the tar
> > archive but the actual checkpoint files directly:
> >
> > # podman pull quay.io/adrianreber/checkpoint-test:tag73
> > Trying to pull quay.io/adrianreber/checkpoint-test:tag73...
> > Getting image source signatures
> > Copying blob e65839d7ec1b done
> > Copying config 27d63848a3 done
> > Writing manifest to image destination
> > Storing signatures
> > 27d63848a32d24c68b131f99880411c11af6519820ef22b989a86b7f10038c79
> > # podman image mount quay.io/adrianreber/checkpoint-test:tag73
> > /var/lib/containers/storage/overlay/98aaf3c7dc28cfb2e79893ef952380b00169dcce910be48bbea1143b07ae2a0e/merged
> > # ls -la /var/lib/containers/storage/overlay/98aaf3c7dc28cfb2e79893ef952380b00169dcce910be48bbea1143b07ae2a0e/merged
> > total 44
> > dr-xr-xr-x. 1 root root 4096 Jun 19 10:53 .
> > drwx------. 6 root root 4096 Jun 19 10:53 ..
> > -rw-------. 1 root root 1120 Feb 1 11:11 bind.mounts
> > drw-------. 2 root root 4096 Feb 1 11:11 checkpoint
> > -rw-------. 1 root root 616 Feb 1 11:11 config.dump
> > -rw-------. 1 root root 0 Feb 1 11:11 dump.log
> > -rw-r--r--. 1 root root 315 Feb 1 11:11 io.kubernetes.cri-o.LogPath
> > -rw-r--r--. 1 root root 2048 Feb 1 11:11 rootfs-diff.tar
> > -rw-------. 1 root root 11276 Feb 1 11:11 spec.dump
> > -rw-r--r--. 1 root root 49 Feb 1 11:11 stats-dump
> >
> > We currently have some metadata defined in
> > github.com/checkpoint-restore/checkpointctl which we want to use in
> > all three projects (podman, containerd and cri-o).
>
> You know, maybe there's a difference between CRI-O and containerd.
> I followed the steps from the containerd test to create an image:
> https://github.com/containerd/containerd/blob/main/contrib/checkpoint/checkpoint-restore-kubernetes-test.sh#L105
>
> root@gke-cluster-1-default-pool-595f3f31-2wft:/home/avagin# docker
> create --name test-image avagin/test-cpt:0.5 ls
> c80dbf467d99a0e3a6684d6cb36d29c212a6b12bdfc9af8abe2ffe3fcb69a5de
> root@gke-cluster-1-default-pool-595f3f31-2wft:/home/avagin# docker
> export test-image | tar -t
> .dockerenv
> blobs/
> blobs/sha256/
> blobs/sha256/5159244823d7bfa959a4249c912ffef669c5596fcf41a866264823152b6dbba9
> blobs/sha256/9178f6d56b033b8221dda746c3fd9ad98552569f05e66241365ef8a722da96be
> blobs/sha256/eca4c8bdd20acb007a5594777ace63727d2c17413a54d3a5a817e252d0390902
> dev/
> dev/console
> dev/pts/
> dev/shm/
> etc/
> etc/hostname
> etc/hosts
> etc/mtab
> etc/resolv.conf
> index.json
> oci-layout
> proc/
> sys/
> root@gke-cluster-1-default-pool-595f3f31-2wft:/home/avagin# docker
> export test-image | tar -x -C test-img/
> # tar -tf test-img/blobs/sha256/eca4c8bdd20acb007a5594777ace63727d2c17413a54d3a5a817e252d0390902
> checkpoint/
> checkpoint/cgroup.img
> checkpoint/core-1.img
> checkpoint/core-8.img
> checkpoint/descriptors.json
> checkpoint/fdinfo-2.img
> checkpoint/fdinfo-3.img
> checkpoint/files.img
> checkpoint/fs-1.img
I am a bit confused. Using the following steps I see this:
# kubectl apply -f /root/sleep.yaml
pod/sleeper created
# CP=$(curl -s --insecure --cert /var/run/kubernetes/client-admin.crt --key /var/run/kubernetes/client-admin.key -X POST "https://localhost:10250/checkpoint/default/sleeper/sleep" | jq -r ".items[0]")
# newcontainer=$(buildah from scratch)
# buildah add "$newcontainer" $CP /
# buildah config --annotation=org.criu.checkpoint.container.name=test "$newcontainer"
# buildah commit "$newcontainer" checkpoint-image:latest
# buildah rm "$newcontainer"
# podman image mount checkpoint-image:latest
/var/lib/containers/storage/overlay/58681367751de52d5c779da8ee826d3ba51b21c880e4051f88ee64746d02017e/merged
# ls -la /var/lib/containers/storage/overlay/58681367751de52d5c779da8ee826d3ba51b21c880e4051f88ee64746d02017e/merged
total 32
dr-xr-xr-x. 1 root root 155 Jun 22 13:27 .
drwx------. 6 root root 69 Jun 22 13:27 ..
drwx------. 2 root root 4096 Jun 22 13:27 checkpoint
-rw-------. 1 root root 555 Jun 22 13:27 config.dump
-rw-------. 1 root root 0 Jun 22 13:27 container.log
-rw-r--r--. 1 root root 202 Jun 22 13:27 rootfs-diff.tar
-rw-r--r--. 1 root root 4424 Jun 22 13:27 spec.dump
-rw-------. 1 root root 46 Jun 22 13:27 stats-dump
-rw-------. 1 root root 298 Jun 22 13:27 status
-rw-------. 1 root root 1666 Jun 22 13:27 status.dump
# cat /var/lib/containers/storage/overlay/58681367751de52d5c779da8ee826d3ba51b21c880e4051f88ee64746d02017e/merged/config.dump | jq
{
"id": "d974adb0cc366bbb49ef83123eac019f2326b90c5af6eab18db0abb6a084c329",
"name": "sleep_sleeper_default_250c35ee-e0a4-4bf2-a681-09d7b3faf175_1",
"rootfsImage": "quay.io/adrianreber/sleep:alpine",
"rootfsImageRef": "quay.io/adrianreber/sleep@sha256:d504e702fa984e59d0573ff23a16023adb16a5405abf4ba35a64a62dbc9d3a6d",
"rootfsImageName": "quay.io/adrianreber/sleep:alpine",
"runtime": "io.containerd.runc.v2",
"createdTime": "2025-06-22T11:27:13.446696907Z",
"checkpointedTime": "2025-06-22T13:27:18.835719622+02:00",
"restoredTime": "0001-01-01T00:00:00Z",
"restored": false
}
I guess docker export provides something else than podman image mount.
But, whatever we have right now, we can change it to something better.
No problem. We are the authors of all the implementations in containerd
and CRI-O (and Podman) and can change it.
Adrian
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-06-22 11:46 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-18 23:58 Optimizing C/R Image Format for Kubernetes Andrei Vagin
2025-06-19 8:36 ` Radostin Stoyanov
2025-06-19 11:06 ` Adrian Reber
2025-06-20 19:34 ` Andrei Vagin
2025-06-22 11:46 ` Adrian Reber
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox