From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 730A018EFD4 for ; Thu, 19 Jun 2025 11:06:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750331199; cv=none; b=aW++gGTneUCQZSdcry4e/evMKBM5zkUAaSe2zV4OGaeZan3WVwMOViqv1N83I8a8YJwqsyBqZmudq1WyPlKjRssD4BmUMwaZ0zadsyKlr6rwIat7jfyjQ3aYRDB88ZKYlXUe1QpKMflCbBXfOGgml1IiRnGRMy4GJ0n6Kec1K8Y= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750331199; c=relaxed/simple; bh=XDZOe9tSh5LPgNAnkp3ruzvvIG53Iz5bb+N1UwV/LR8=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: In-Reply-To:Content-Type:Content-Disposition; b=NClWNH5KxhhLYot36nmDYBpMulLC8q/xIDdGETBCpyeE6NsyQdaumiKaHfFKUGXaYki0Evn61ufJCb0IYjbKEU2I26fiKcdCV8Uud/e6k2sCz/DQmyRgbL4ZLbRFJyzp7j7x+ZQGBS9oCkzqEDRNDYCLXJsT6Fz1Rh8dlUWOAX8= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=bvbg+QE5; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="bvbg+QE5" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1750331193; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=tR9ofxF1YjEMknl8iqQLA/w4x13FiWUWhpQhvcpwl1Y=; b=bvbg+QE5Kkbq8FywrzIfsF8svH2B+2NAFJ8UPJAUF9uqflHzO90o05+8IcQhCtFazWRp10 qE7Vdz6OhtNDv519nx8DMws3eMdzKdtDk6L4EN67mhVMOyhYSoLzbcDf4ngQoHxumE1mqn EFSIq8vkX7mMRpmF9+/ef1/tc+EffUc= Received: from mail-wr1-f72.google.com (mail-wr1-f72.google.com [209.85.221.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-479-AuXMNmKINteAOKY09pIdlg-1; Thu, 19 Jun 2025 07:06:31 -0400 X-MC-Unique: AuXMNmKINteAOKY09pIdlg-1 X-Mimecast-MFC-AGG-ID: AuXMNmKINteAOKY09pIdlg_1750331191 Received: by mail-wr1-f72.google.com with SMTP id ffacd0b85a97d-3a4f85f31d9so329958f8f.1 for ; Thu, 19 Jun 2025 04:06:31 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1750331191; x=1750935991; h=organization:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=tR9ofxF1YjEMknl8iqQLA/w4x13FiWUWhpQhvcpwl1Y=; b=CI4biTQDHmpoquNwb6MKOfn8limBkMGq1VSfzNgprDUna/nyMyDTDyfbVGjtBLd7XE SZIbFnQvo4eDigKRfRCAkqewLPronx3R5opWuh4wR3uUKXbejJBYNbCF0GgEMElFUICY 7Led8Cs4xDiGGiVXp0NEAWOBvL3ZfGWocGA+oPyD3yy+h+stdYq5nuT1Oh991ODrlGki lyO3lYvNzzh+e4eJQgQAcMiSghETvMsY+33k1dZp3mPcG54cyGT9cT5dSY7qdiS0aEAB hASD2+B+kE018oaU22X2NWUnlb9mm5G4Evl/HDuQUYi0V3VP3TjBHs8y53J7uLx56FHp 9dBQ== X-Gm-Message-State: AOJu0YxRpFtEBjq2LDJWl0vER3PHpAWeQ+L8LftL5dKGBPSreYLXWFA6 Wq5HlYyf8hM/5SoF+vZGngKQ3NKmr+diWk+7MhTKBF+55AtRgEC2HcjzMM6pgztMjjLi7XdB0bh SposCVEWJjYqlt4uf7MwLbCldj9qDuAVKIFGnb8/cER8WBLWzfRVFciE= X-Gm-Gg: ASbGncu8g/3SWxlNe5YKcFmI23gIus0Ly1LzL4g2VVUxJmE7s2Gl9xFEVNn+nq3Hhri q9GxTGe6PUvsCKgmFpnzsfQA+TwB/nMDqkaVhvQe1CDFlKrlADNvmva+9e2Z2pAK3xJFXu96kUj 4Mwhdo7RbJ1YHESgRUJEAijRVWidNFYqjdiisgXBy0rHVZlHtLoGtmJlNbOw+zI9zB5voptb4T7 fEnCZBYCZwqkRTqroG6vEhvBexSVAbv5+/du+DWtQJ1U6lkBCMvHhQ4qZuWDblU31roGe3tIWLG UwXHKYO0ODF26OFqcoNfarVbgvUN+E/46bS+Kb9Oxd220klYYJB7APwJgXwwvqHKzqvzkzyK9um JiXyZfvjsnA== X-Received: by 2002:a05:6000:420c:b0:3a0:aed9:e34 with SMTP id ffacd0b85a97d-3a572e92ff4mr16142302f8f.48.1750331190541; Thu, 19 Jun 2025 04:06:30 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEhvrTV1pCtRURG7mgIx/PTEg1v6iQtry5c570MIKSDaH60KPhl318LNBUbLFlWfVcgCwEWrw== X-Received: by 2002:a05:6000:420c:b0:3a0:aed9:e34 with SMTP id ffacd0b85a97d-3a572e92ff4mr16142261f8f.48.1750331189884; Thu, 19 Jun 2025 04:06:29 -0700 (PDT) Received: from dcbz.redhat.com (p200300fe2f0a45fca66684a94763023f.dip0.t-ipconnect.de. [2003:fe:2f0a:45fc:a666:84a9:4763:23f]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-3a568b087f8sm19513848f8f.53.2025.06.19.04.06.29 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 19 Jun 2025 04:06:29 -0700 (PDT) Date: Thu, 19 Jun 2025 13:06:27 +0200 From: Adrian Reber To: Andrei Vagin Cc: criu@lists.linux.dev, Radostin Stoyanov Subject: Re: Optimizing C/R Image Format for Kubernetes Message-ID: References: Precedence: bulk X-Mailing-List: criu@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 In-Reply-To: X-Operating-System: Linux (6.13.8-200.fc41.x86_64) X-Load-Average: 1.71 1.44 1.38 X-Unexpected: The Spanish Inquisition X-GnuPG-Key: gpg --recv-keys D3C4906A X-Url: Organization: Red Hat X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: fsSGwCftdmXDUoWjWRvzXFN-0fls9QtzYYFWxLrBh0o_1750331191 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit On Wed, Jun 18, 2025 at 04:58:24PM -0700, Andrei Vagin wrote: > I've been spending the last few days diving into checkpoint/restore > (C/R) within Kubernetes, specifically focusing on the restore process > and the current image format. > > I found the current container image format to be suboptimal. You are right. When we came up with that we were looking for something that works and over the time we also saw that it is far from perfect. The good thing is we control the implementation in podman, containerd and cri-o and can easily change it to something better. We are open to anything. > I've examined containerd, and I suspect CRI-O has similar issues. containerd is even worse then CRI-O because of the way it works internally. My first approach was to directly write the checkpoint to disk, but the containerd authors asked me to use their internal image store. So now the checkpoint is created on disk, tarred up in the containerd internal format then it is transferred internally to another layer of containerd which unpacks it and adds the root-diff. Then it writes this as another tar. Then to create an OCI image the tar is again unpacked and written to another tar. So we are tarring up the data 4 or 5 times probably. There is a lot of room for optimization, but with containerd and Kubernetes we were happy to get any reviewers at all and adopted their not optimal suggestions. > Essentially, it's a container image that encapsulates a > checkpoint-restore archive. Each container start requires multiple > unpacking steps: > * Extracting the C/R archive: This yields two tar archives—one for the > filesystem delta and another for CRIU images. > * Applying the filesystem delta: We need to mount the container's root > filesystem, then extract and apply this delta. > * Restoring the container: Finally, we extract the CRIU images and > proceed with the restore. > > I believe this format, with its nested tar archives, leads to a > significant amount of time wasted on unpacking, which directly impacts > performance. As mentioned above. Totally correct. > With the growing interest in using C/R to optimize application startup > time. I've run some experiments. My findings indicate that the current > image format significantly reduces the benefits of C/R, and in many > cases, restoring a container from these images is actually slower than > starting it from scratch. We tried to have proper format defined in the OCI spec: https://github.com/opencontainers/image-spec/issues/962 But the discussion didn't result in any thing useful so at some point we just ignored it. > Here's my vision for an ideal image format for C/R-ed containers: > * Filesystem Delta as an Overlay Layer: The filesystem delta should be > treated just like any other container image delta. This means it would > be specified as one of the overlay layers when a container is mounted. Yes. The current format was my wrong decision as I was not familiar with how those delta layers are working. > * Directly Accessible CRIU Images: Once an image is pulled locally, the > CRIU images should not be bundled in a tar archive. Instead, they > should be placed directly in a directory, allowing CRIU to use them > immediately without any extra extraction steps. This is not actually true. The OCI image does not contain the tar archive but the actual checkpoint files directly: # podman pull quay.io/adrianreber/checkpoint-test:tag73 Trying to pull quay.io/adrianreber/checkpoint-test:tag73... Getting image source signatures Copying blob e65839d7ec1b done Copying config 27d63848a3 done Writing manifest to image destination Storing signatures 27d63848a32d24c68b131f99880411c11af6519820ef22b989a86b7f10038c79 # podman image mount quay.io/adrianreber/checkpoint-test:tag73 /var/lib/containers/storage/overlay/98aaf3c7dc28cfb2e79893ef952380b00169dcce910be48bbea1143b07ae2a0e/merged # ls -la /var/lib/containers/storage/overlay/98aaf3c7dc28cfb2e79893ef952380b00169dcce910be48bbea1143b07ae2a0e/merged total 44 dr-xr-xr-x. 1 root root 4096 Jun 19 10:53 . drwx------. 6 root root 4096 Jun 19 10:53 .. -rw-------. 1 root root 1120 Feb 1 11:11 bind.mounts drw-------. 2 root root 4096 Feb 1 11:11 checkpoint -rw-------. 1 root root 616 Feb 1 11:11 config.dump -rw-------. 1 root root 0 Feb 1 11:11 dump.log -rw-r--r--. 1 root root 315 Feb 1 11:11 io.kubernetes.cri-o.LogPath -rw-r--r--. 1 root root 2048 Feb 1 11:11 rootfs-diff.tar -rw-------. 1 root root 11276 Feb 1 11:11 spec.dump -rw-r--r--. 1 root root 49 Feb 1 11:11 stats-dump We currently have some metadata defined in github.com/checkpoint-restore/checkpointctl which we want to use in all three projects (podman, containerd and cri-o). What I also would like to see is that we can directly write to an OCI image and not just first to a local tar archive and then convert it to an OCI image (like Podman already does today). But that requires buy-in from Kubernetes and changes to the CRI-API which has always been extremely difficult for me to get accepted by Kubernetes. The main problem is that checkpoint/restore is not seen as an important feature from most Kubernetes contributors (especially approvers and reviewers). So having someone who supports our work instead of blocking it would help us a lot. There is also the fear of exposing secret information which often blocks and progress in the Kubernetes area. Having encryption in CRIU would also make those discussions easier (even if the data is not always encrypted, but being able to check the encryption box would make discussions easier). Adrian