From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 5E323C54798
	for <linux-mm@archiver.kernel.org>; Wed,  6 Mar 2024 01:15:34 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 8B8896B0075; Tue,  5 Mar 2024 20:15:33 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 81AA86B007D; Tue,  5 Mar 2024 20:15:33 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 694176B007E; Tue,  5 Mar 2024 20:15:33 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id 503506B0075
	for <linux-mm@kvack.org>; Tue,  5 Mar 2024 20:15:33 -0500 (EST)
Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id F01A8A1507
	for <linux-mm@kvack.org>; Wed,  6 Mar 2024 01:15:32 +0000 (UTC)
X-FDA: 81864846504.18.6824C6B
Received: from mail-vs1-f51.google.com (mail-vs1-f51.google.com [209.85.217.51])
	by imf12.hostedemail.com (Postfix) with ESMTP id 4B7D640013
	for <linux-mm@kvack.org>; Wed,  6 Mar 2024 01:15:31 +0000 (UTC)
Authentication-Results: imf12.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=IGhTXyhl;
	spf=pass (imf12.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.51 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1709687731;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=XsN01tFdEdIROlyqUZ66jVUw49XIMkQMqNyMHHtILNY=;
	b=pue00rIaHzKuosnkOvRB8nSgFxLzYjcGA8YO1Ldm+hOgjRmp4ZUqXSj55MpfS+N43qiHru
	ZmEyEL2leHKJz1woFeH+7Hosyeeyk8b9T8cJNEmB4H8BT3tM6GqImYTAuL2quWBiaBmWDu
	6ePZ2bCo8MEIeXjYONgCMFJF4U0Rx9Y=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709687731; a=rsa-sha256;
	cv=none;
	b=Kh/kUy9mwdY3Kw1OWcrJ+i7vR5+o2upPcAM4Z0rrvo/eXmPMoASVOGoIeDVTfu8MAXsurM
	v13dYawY/Ula6anbG6lTxaPSpVn7cfAK2meRHsW9mE2ZZvyxDMHTZA5XwbQa7Uc/NTuSok
	gm7fHxRaB/noP37NClAf+hN/Rql6SZk=
ARC-Authentication-Results: i=1;
	imf12.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=IGhTXyhl;
	spf=pass (imf12.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.51 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
Received: by mail-vs1-f51.google.com with SMTP id ada2fe7eead31-472807a1d59so95201137.2
        for <linux-mm@kvack.org>; Tue, 05 Mar 2024 17:15:31 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1709687730; x=1710292530; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=XsN01tFdEdIROlyqUZ66jVUw49XIMkQMqNyMHHtILNY=;
        b=IGhTXyhlsP1sK3JzDdx1Oj8L1R+hq4vZQVm1fD4MVkmBUdRywQ8FDilHfeZFKs82P/
         MYEqsOHF/rLIGsN7wnT7NCFOtraYXdbI0dwc4cfmn1blG72cvwaBUWnzlEeECevSioSz
         SO75fvRd8DQekTaQ5O9soPCEbvkUMS91o8L7dKn7hmQWtI9sM2pemGt/g+5TIPD91G5b
         IyjUwBLEIBUM7SNYzw7uCGIDeRjq6pXD7gzmegqsYG0H8vKwq8Kj3UlM+tXVmvDWfvEE
         /FsDFVPfv3IeNE0bBgzcUyO+/aZTP+/akwysboswkzgrpLNVMv485nQxsUA8zGVsEyDr
         R2PA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1709687730; x=1710292530;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=XsN01tFdEdIROlyqUZ66jVUw49XIMkQMqNyMHHtILNY=;
        b=TT9lJMrugC9eWv1Kc4OzqmaI15QRwyQO5tMKoAyuqgZMJZFynimOIyVpVOUgK226Ty
         5gcjO3lCGwL28eGcOqimcrD3Kz//gRTg0p19FzuI6WyBm4UgoF39PL1ArsYGWeG2aBPo
         tCzxtT0JdeG7lSenmgB3hXaWrXMv5tnUNo2DcncJrqTtgnw/dLqFBq6OE1nXGCLSftEg
         8A/N93nz+Wz6hCoGQNHLCRZaBnuO74nN+nppuDvOl3Kp9uUcCDF/HvUjHVvftM1CFgGY
         Zo6oOoMokfXnSfIwrxI2RMgMGacCphZIdMJFWPeJ52gWcNUYTwVUxKwUrRxR9fvrTgGM
         WdVQ==
X-Forwarded-Encrypted: i=1; AJvYcCXYwfm+zuk9rko+kWhBqmElvsB8nldCDlVLKee55MHPfWwEmPVp6PLrtKANDAwMnOC1TapyGj7JWPfncgpuBg0zvN4=
X-Gm-Message-State: AOJu0Ywd1VrEo0B6TnTMVQ4zOlKJHFY36/pG58IBvpBUOubpcPbOKs3C
	nrMwurHfEK7ERBnk0FLdrF4VMdEH89FtNL0VWSAN1FDZ0fJDAdQ8e4asKVeCnquhwPbxK70mRgi
	QXO2xm2c5ZGol3hyVSHpO27cxLxs=
X-Google-Smtp-Source: AGHT+IGp8nTDPCgf652XwNX2QIK4m+IqsgfCAvmrw9KhrhyRL8b12E6DxKKvuPen5iIYtPJ0jFQMvLOGUs6ux2JQjbI=
X-Received: by 2002:a67:c412:0:b0:472:e292:3933 with SMTP id
 c18-20020a67c412000000b00472e2923933mr2622671vsk.17.1709687730252; Tue, 05
 Mar 2024 17:15:30 -0800 (PST)
MIME-Version: 1.0
References: <CAF8kJuMQ7qBZqdHHS52jRyA-ETTfHnPv+V9ChaBsJ_q_G801Lw@mail.gmail.com>
In-Reply-To: <CAF8kJuMQ7qBZqdHHS52jRyA-ETTfHnPv+V9ChaBsJ_q_G801Lw@mail.gmail.com>
From: Barry Song <21cnbao@gmail.com>
Date: Wed, 6 Mar 2024 14:15:18 +1300
Message-ID: <CAGsJ_4zJE_Jz_b4rdmRhs4=CLbc_fUNJZZ6H+Du9d6are4DLWA@mail.gmail.com>
Subject: Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
To: Chris Li <chrisl@kernel.org>
Cc: lsf-pc@lists.linux-foundation.org, linux-mm <linux-mm@kvack.org>, 
	ryan.roberts@arm.com, David Hildenbrand <david@redhat.com>, 
	Chuanhua Han <hanchuanhua@oppo.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Stat-Signature: ci9y8bosc5gjwz5kbynu54xtu9f7gnjz
X-Rspamd-Server: rspam10
X-Rspamd-Queue-Id: 4B7D640013
X-Rspam-User: 
X-HE-Tag: 1709687731-693370
X-HE-Meta: U2FsdGVkX1+6htAtPkNpaHFPxdXm0JF30WfbMB+xaZ/ge2JSuy2spPtkxZ/TcOpTJxJj9Hm7u4bVWENKiRExN5jh0li7b91zgjWHA7ClXpIDvTv4V4NVk/6nNyNGyko7Hdq/VLTiJ1xCwThsyEC1hujCxbxaMYUGy7BuXpzIWfpAzgS0pVh0a9JNnzj4bdOmwOTNrHaybYFF455QSVwcWzzQgehXYyJ/NTz8oc4um2a3V4MclinyacvyIIaHtuUi0LHXGbWfa1VlUsnkC6iVOsul9MofGSeBXvNlPe/64z7HObbzF1fF3Nd61qRNG7unFW5qoAtaTO8OmaGUNJswNCdHfF++AwTLUVRREKk5dy+VlGKwbsBhPRfKwG8zNZryjPDYFVxpGEuJEbUNf7X4hYc3YaMNYQkN6K3+jygO3JOHcT5D3BBjtfIuMX/w2YIDs5khg+hZyQIN0sONTtM30klHGlyHFBRcjaJs0kSQDnjIQvEgcRM8z+UXpswCujRKSFadU4udWILIi2EtOXudlk/aew6SlHoa1MFn0abX00dEfF8tVsTI5OtiCZjdoi1Vi0WmGVcElBFuNs5RmXS5c8Ov67woP1ydfRIL3BGXHH4yORog7phwqnx4/grvfuAlPs2e4fjZTHTRkPuA9K8Bk5lW0d8bagIxCDABcgw0qRbV38emCC4d9B70f5gpCIHcSPFkMxm4yXV7IDvUJLEoR+9VQi6cH3752QXOI7e7BO26vcywtK8jJpOqS3GvkDfRi8cFu+xPN+YlRbWrcaHk4CIOmldfZMP8WTQ8EaK7KovoOL9TTjSKNOLtw+DiEaprLLPzUstPROSA5oNEU81Z8TGQ2ve7W9ET8FZlW9RairIMRbv+a0GKBFORgCgjw0q62rWINvzwkiYPW4WHOkQhWIkwrlIWdDiupqw6bRtD7g5mJ5AINM0disZaPzXF8P1pNnzmuzQ+bzza5Jhd7o6
 UJlFkO93
 Hf8VNCneWfmqRcXXZni+Y4rqMtDZ0N9EUvvI1
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Fri, Mar 1, 2024 at 10:24=E2=80=AFPM Chris Li <chrisl@kernel.org> wrote:
>
> In last year's LSF/MM I talked about a VFS-like swap system. That is
> the pony that was chosen.
> However, I did not have much chance to go into details.
>
> This year, I would like to discuss what it takes to re-architect the
> whole swap back end from scratch?
>
> Let=E2=80=99s start from the requirements for the swap back end.
>
> 1) support the existing swap usage (not the implementation).
>
> Some other design goals::
>
> 2) low per swap entry memory usage.
>
> 3) low io latency.
>
> What are the functions the swap system needs to support?
>
> At the device level. Swap systems need to support a list of swap files
> with a priority order. The same priority of swap device will do round
> robin writing on the swap device. The swap device type includes zswap,
> zram, SSD, spinning hard disk, swap file in a file system.
>
> At the swap entry level, here is the list of existing swap entry usage:
>
> * Swap entry allocation and free. Each swap entry needs to be
> associated with a location of the disk space in the swapfile. (offset
> of swap entry).
> * Each swap entry needs to track the map count of the entry. (swap_map)
> * Each swap entry needs to be able to find the associated memory
> cgroup. (swap_cgroup_ctrl->map)
> * Swap cache. Lookup folio/shadow from swap entry
> * Swap page writes through a swapfile in a file system other than a
> block device. (swap_extent)
> * Shadow entry. (store in swap cache)
>
> Any new swap back end might have different internal implementation,
> but needs to support the above usage. For example, using the existing
> file system as swap backend, per vma or per swap entry map to a file
> would mean it needs additional data structure to track the
> swap_cgroup_ctrl, combined with the size of the file inode. It would
> be challenging to meet the design goal 2) and 3) using another file
> system as it is..
>
> I am considering grouping different swap entry data into one single
> struct and dynamically allocate it so no upfront allocation of
> swap_map.
>
> For the swap entry allocation.Current kernel support swap out 0 order
> or pmd order pages.
>
> There are some discussions and patches that add swap out for folio
> size in between (mTHP)
>
> https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@arm=
.com/
>
> and swap in for mTHP:
>
> https://lore.kernel.org/all/20240229003753.134193-1-21cnbao@gmail.com/
>
> The introduction of swapping different order of pages will further
> complicate the swap entry fragmentation issue. The swap back end has
> no way to predict the life cycle of the swap entries. Repeat allocate
> and free swap entry of different sizes will fragment the swap entries
> array. If we can=E2=80=99t allocate the contiguous swap entry for a mTHP,=
 it
> will have to split the mTHP to a smaller size to perform the swap in
> and out. T
>
> Current swap only supports 4K pages or pmd size pages. When adding the
> other in between sizes, it greatly increases the chance of fragmenting
> the swap entry space. When no more continuous swap swap entry for
> mTHP, it will force the mTHP split into 4K pages. If we don=E2=80=99t sol=
ve
> the fragmentation issue. It will be a constant source of splitting the
> mTHP.
>
> Another limitation I would like to address is that swap_writepage can
> only write out IO in one contiguous chunk, not able to perform
> non-continuous IO. When the swapfile is close to full, it is likely
> the unused entry will spread across different locations. It would be
> nice to be able to read and write large folio using discontiguous disk
> IO locations.

I don't find it will be too difficult for swap_writepage to only write
out a large folio which has discontiguous swap offsets. taking
zRAM as an example, as long as bio can be organized correctly,
zram should be able to write a large folio one by one for its all
subpages.

static void zram_bio_write(struct zram *zram, struct bio *bio)
{
        unsigned long start_time =3D bio_start_io_acct(bio);
        struct bvec_iter iter =3D bio->bi_iter;

        do {
                u32 index =3D iter.bi_sector >> SECTORS_PER_PAGE_SHIFT;
                u32 offset =3D (iter.bi_sector & (SECTORS_PER_PAGE - 1)) <<
                                SECTOR_SHIFT;
                struct bio_vec bv =3D bio_iter_iovec(bio, iter);

                bv.bv_len =3D min_t(u32, bv.bv_len, PAGE_SIZE - offset);

                if (zram_bvec_write(zram, &bv, index, offset, bio) < 0) {
                        atomic64_inc(&zram->stats.failed_writes);
                        bio->bi_status =3D BLK_STS_IOERR;
                        break;
                }

                zram_slot_lock(zram, index);
                zram_accessed(zram, index);
                zram_slot_unlock(zram, index);

                bio_advance_iter_single(bio, &iter, bv.bv_len);
        } while (iter.bi_size);

        bio_end_io_acct(bio, start_time);
        bio_endio(bio);
}

right now , add_to_swap() is lacking a way to record discontiguous
offset for each subpage, alternatively, we have a folio->swap.

I wonder if we can somehow make it page granularity, for each
subpage, it can have its own offset somehow like page->swap,
then in swap_writepage(), we can make a bio with multiple
discontiguous I/O index. then we allow add_to_swap() to get
nr_pages different swap offsets, and fill into each subpage.

But will this be a step back for folio?

>
> Some possible ideas for the fragmentation issue.
>
> a) buddy allocator for swap entities. Similar to the buddy allocator
> in memory. We can use a buddy allocator system for the swap entry to
> avoid the low order swap entry fragment too much of the high order
> swap entry. It should greatly reduce the fragmentation caused by
> allocate and free of the swap entry of different sizes. However the
> buddy allocator has its own limit as well. Unlike system memory, we
> can move and compact the memory. There is no rmap for swap entry, it
> is much harder to move a swap entry to another disk location. So the
> buddy allocator for swap will help, but not solve all the
> fragmentation issues.

I agree buddy will help. Meanwhile, we might need some way similar
with MOVABLE, UNMOVABLE migratetype. For example, try to gather
swap applications for small folios together and don't let them spread
throughout the whole swapfile.
we might be able to dynamically classify swap clusters to be for small
folios, for large folios, and avoid small folios to spread all clusters.

>
> b) Large swap entries. Take file as an example, a file on the file
> system can write to a discontinuous disk location. The file system
> responsible for tracking how to map the file offset into disk
> location. A large swap entry can have a similar indirection array map
> out the disk location for different subpages within a folio.  This
> allows a large folio to write out dis-continguos swap entries on the
> swap file. The array will need to store somewhere as part of the
> overhead.When allocating swap entries for the folio, we can allocate a
> batch of smaller 4k swap entries into an array. Use this array to
> read/write the large folio. There will be a lot of plumbing work to
> get it to work.

we already have page struct, i wonder if we can record the offset
there if this is not a step back to folio. on the other hand, while
swap-in, we can also allow large folios be swapped in from non-
discontiguous places and those offsets are actually also in PTE
entries.

I feel we have "page" to record offset before pageout() is done
and we have PTE entries to record offset after pageout() is
done.

But still (a) is needed as we really hope large folios can be put
in contiguous offsets, with this, we might have other benefit
like saving the whole compressed large folio as one object rather than
nr_pages objects in zsmalloc and decompressing them together
while swapping  in (a patchset is coming in a couple of days for this).
when a large folio is put in nr_pages different places, hardly can we do
this in zsmalloc. But at least, we can still swap-out large folios
without splitting and swap-in large folios though we read it
back from nr_pages different objects.

>
> Solution a) and b) can work together as well. Only use b) if not able
> to allocate swap entries from a).
>
> Chris

Thanks
Barry