From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id DBFDDC54E67
	for <linux-mm@archiver.kernel.org>; Thu, 14 Mar 2024 11:20:15 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 29AE080097; Thu, 14 Mar 2024 07:20:15 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 24A6480073; Thu, 14 Mar 2024 07:20:15 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 112DE80097; Thu, 14 Mar 2024 07:20:15 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 039CE80073
	for <linux-mm@kvack.org>; Thu, 14 Mar 2024 07:20:15 -0400 (EDT)
Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id C44EDC0A38
	for <linux-mm@kvack.org>; Thu, 14 Mar 2024 11:20:14 +0000 (UTC)
X-FDA: 81895400748.10.4DF7159
Received: from mail-ed1-f44.google.com (mail-ed1-f44.google.com [209.85.208.44])
	by imf22.hostedemail.com (Postfix) with ESMTP id 0C871C001B
	for <linux-mm@kvack.org>; Thu, 14 Mar 2024 11:20:11 +0000 (UTC)
Authentication-Results: imf22.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=WiKq6E3J;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf22.hostedemail.com: domain of chuanhuahan@gmail.com designates 209.85.208.44 as permitted sender) smtp.mailfrom=chuanhuahan@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1710415212;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=RiMC+OdI43N2uwMCU3Sgip8cZQKYUoAXA7jEsMP32DM=;
	b=r4p72RypWO5WdzqCDphnHlaiKsOEYDAApBAG1NeNV5KIJr+swQmDrgyXbOzuafpYgdveZR
	9N9xRkUO76GpjkV6N+KJNcLjllPJCfdmyEccHrl8rTBeva3k/cvQ5s9aj8MRAzA/z0HIXd
	Q7dyrOuUeUnKI/i/mw9cRG190pheVs8=
ARC-Authentication-Results: i=1;
	imf22.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=WiKq6E3J;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf22.hostedemail.com: domain of chuanhuahan@gmail.com designates 209.85.208.44 as permitted sender) smtp.mailfrom=chuanhuahan@gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710415212; a=rsa-sha256;
	cv=none;
	b=AqQ6Kh21oOIBIOixtK88ECbeK5n3sxsl4Mx8eP4CSCekNIUAg3q7mrZrrFjIYFYt1kqcxI
	XM6qp5HG0AsZ1porBx8w91Br0fEMgCzd4IQNpRGUKEbXKrZLpiaHFojk6iOEzWJenVbuxZ
	nqvOZInHkHxPdXyQf+9a8fnj/SbI9H4=
Received: by mail-ed1-f44.google.com with SMTP id 4fb4d7f45d1cf-56899d9bf52so1102525a12.2
        for <linux-mm@kvack.org>; Thu, 14 Mar 2024 04:20:11 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1710415210; x=1711020010; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=RiMC+OdI43N2uwMCU3Sgip8cZQKYUoAXA7jEsMP32DM=;
        b=WiKq6E3JV7OSSyOvJssZycIT4oxcRDT9XVJp+vZHgq3Ze1qLV90sokS6nYA86SfQ5+
         fyR4+T855I8uVlUoLKAOyUMMa50aZw++dQJ6MNMZ8pAEyZLeIxl8ghQN9B/8Z3KoYpIv
         LcvDprnDhGE767XqyNwk+V1L882Y6wYkTcpoV71u6jlaX/yC457g8qyv08E3eDgwmd50
         M2lkgdqXvGzwPMLU6x8lTCbfl1kd1YNxSaWBd9LzST+957rZf/49gV2+KcTspeW6Xdxp
         98VjyZR2B4yGvI8a+0U4DF6vKbgRNNGFneDKFsXrAiu7Jf/xiIoJWrW0BJ2OmJEVLIld
         I2tg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1710415210; x=1711020010;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=RiMC+OdI43N2uwMCU3Sgip8cZQKYUoAXA7jEsMP32DM=;
        b=XE3Sz+6K68CPkReheA/IMoiAbrlChYmKj6LSwXMiIvFA7/HPyQmk2pMVWLwhxOzK3b
         O8J/0tw2oYP0Iqhmd7phBFfIgIMmBEkfV8JzAv3AwsT7QA71Yg97c4uiA+EjidLydioE
         dDeK6ln091ZeZbTF/3RBboa6rK1ove4bTw5O40cn7OolZiE9oc/NSkRqlMns210TEYWA
         CCJICq9Xc8gfOJZuTHpmYGxd3xek/80aSJFBQl7O/VaMBaWjmBlfzLdUq1z39T2Pcvqs
         R9+fxcnCFlPyjo3Sjgy1TZjuSSm6j267c/Cjwlm4GDVZJceb+XanVFXYbbn934KTFDsh
         M5ug==
X-Forwarded-Encrypted: i=1; AJvYcCWgMuYEAb/UNgL3ndiybi85ytyYtCnjJLMpSafjiU7o9G+S0CD+eTOivgOtwgJA5fZJvceFLBt5Lz0IIOAF5bxli1Q=
X-Gm-Message-State: AOJu0Yz1qb/ta/YoTu1QX/1PsAiV1BqWnuez/LTbVKQP1EvDrI79aZzA
	4hnhhUdMyfTE2cEDgwZl5gUqCeKmS21ohK6w4K9KXFJiCvhMmMu6JSToHdgOAwd+FxHBK6qUdJJ
	KKPoqkNKjSkc5bgA3gsth3HExeds=
X-Google-Smtp-Source: AGHT+IEAlSV+IQzz4NXTWw3lOnzubD+9FYKMtk6n1PGkQsnD+iyB92UtNqGWvAsJi4nN41bxZU8cTK1VOi/PkMLOaT0=
X-Received: by 2002:a05:6402:1f83:b0:566:a526:21ea with SMTP id
 c3-20020a0564021f8300b00566a52621eamr1133749edc.33.1710415210215; Thu, 14 Mar
 2024 04:20:10 -0700 (PDT)
MIME-Version: 1.0
References: <CAF8kJuMQ7qBZqdHHS52jRyA-ETTfHnPv+V9ChaBsJ_q_G801Lw@mail.gmail.com>
 <039190fb-81da-c9b3-3f33-70069cdb27b0@oppo.com> <20240307140344.4wlumk6zxustylh6@quack3>
 <8da6a093-346b-35cd-818a-a82abfa6a930@oppo.com> <20240314082651.ckfpp2tyslq2hl2c@quack3>
In-Reply-To: <20240314082651.ckfpp2tyslq2hl2c@quack3>
From: Chuanhua Han <chuanhuahan@gmail.com>
Date: Thu, 14 Mar 2024 19:19:58 +0800
Message-ID: <CANzGp4Ks_uTj2h=G8cBBZLT+qMhWqbJC229xOTR_uHzrf4LpWw@mail.gmail.com>
Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
To: Jan Kara <jack@suse.cz>
Cc: Chuanhua Han <hanchuanhua@oppo.com>, Chris Li <chrisl@kernel.org>, 
	linux-mm <linux-mm@kvack.org>, lsf-pc@lists.linux-foundation.org, 
	ryan.roberts@arm.com, 21cnbao@gmail.com, david@redhat.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: 0C871C001B
X-Rspam-User: 
X-Rspamd-Server: rspam04
X-Stat-Signature: 7cr11bmu6xqmxw9opt8owiiy7bn3ynon
X-HE-Tag: 1710415211-653429
X-HE-Meta: U2FsdGVkX1/r+heNfQMZidB+kW2D8QrYzdDvMwuu5xq70PS2MXo48sDhG2+YeuQM2Omttt2ClEf9nxblg/LqevwUf49Pu2RP3oq3CbBPBA7zDc/Ksyh6V23z/JHvvwMcT3QVACp13zv2aoQjNmQtTH4Ewfv8L+4jCBDnQCrjHq4TlkLU03I145l0xyGR7E1OzVNND6uI8EPLoC9nQJAX2MYQJXTnft5CR22Y99+nfMo+UdfQl2iABiLDl3+a6UNbBOWy+0FB1y4IMYcCKQSMOv3MzYgN+ofd7hy2A+QaPVCUdHqzjeV/PnAVOqhVikg9HZ2YaxTk119kJWhlYZZMyye+0X8tT6GhvVC10T7jjttjntFqTFEWZRmlreLSI2OTYu8B9n3pCAaCTlC4xiQPBM+4U1OyHu6jx1nI1hs/gKnxalCpDhClGVRw7AAg40lXnuxdbgSxN8yIcUQyrbaiHaZ1k/XSb1yu2NeLqKexkpe4eUiydBZH8eaGZVWVu43JwjJAINpx/3jVSo2A6e6QB1gWb1jl/F4q7u/ayUStgsNmRs8hdBgmUBWK0TZegA26jStMCh7ZucP4Y5OaCP5cR7PGOQoptGl5eLxD0WH93G9YeGND9f/RzNqCyu4LoletgdyS/6LSjYEoIC6qjfFtCl9Gg7DxXRVKFyf73dQzGP6Qc2D6ZvgBTIXWZEJU3kR7ai4YYgFscHAzEv9IGthi96znXSAKhSNSv4BSfhk/+ydDa7ebMJAO6NbV9KpJjpNMCF4A8XTC8iOKEIa90xXXoGViJVDLMKbriU7hbJUG8AurodKVH0XTk5KcMaS2N9Dls3Aow7LIFKqmbN89guzKg87vE9L6XLZDqHLqqNqTAG2lN9xWNPGUZFI6QnOhW+lGWrISdHzj2OCTc57bcPzxL+iFL8HFj9SO
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Jan Kara <jack@suse.cz> =E4=BA=8E2024=E5=B9=B43=E6=9C=8814=E6=97=A5=E5=91=
=A8=E5=9B=9B 16:28=E5=86=99=E9=81=93=EF=BC=9A
>
> On Fri 08-03-24 10:02:20, Chuanhua Han wrote:
> >
> > =E5=9C=A8 2024/3/7 22:03, Jan Kara =E5=86=99=E9=81=93:
> > > On Thu 07-03-24 15:56:57, Chuanhua Han via Lsf-pc wrote:
> > >> =E5=9C=A8 2024/3/1 17:24, Chris Li =E5=86=99=E9=81=93:
> > >>> In last year's LSF/MM I talked about a VFS-like swap system. That i=
s
> > >>> the pony that was chosen.
> > >>> However, I did not have much chance to go into details.
> > >>>
> > >>> This year, I would like to discuss what it takes to re-architect th=
e
> > >>> whole swap back end from scratch?
> > >>>
> > >>> Let=E2=80=99s start from the requirements for the swap back end.
> > >>>
> > >>> 1) support the existing swap usage (not the implementation).
> > >>>
> > >>> Some other design goals::
> > >>>
> > >>> 2) low per swap entry memory usage.
> > >>>
> > >>> 3) low io latency.
> > >>>
> > >>> What are the functions the swap system needs to support?
> > >>>
> > >>> At the device level. Swap systems need to support a list of swap fi=
les
> > >>> with a priority order. The same priority of swap device will do rou=
nd
> > >>> robin writing on the swap device. The swap device type includes zsw=
ap,
> > >>> zram, SSD, spinning hard disk, swap file in a file system.
> > >>>
> > >>> At the swap entry level, here is the list of existing swap entry us=
age:
> > >>>
> > >>> * Swap entry allocation and free. Each swap entry needs to be
> > >>> associated with a location of the disk space in the swapfile. (offs=
et
> > >>> of swap entry).
> > >>> * Each swap entry needs to track the map count of the entry. (swap_=
map)
> > >>> * Each swap entry needs to be able to find the associated memory
> > >>> cgroup. (swap_cgroup_ctrl->map)
> > >>> * Swap cache. Lookup folio/shadow from swap entry
> > >>> * Swap page writes through a swapfile in a file system other than a
> > >>> block device. (swap_extent)
> > >>> * Shadow entry. (store in swap cache)
> > >>>
> > >>> Any new swap back end might have different internal implementation,
> > >>> but needs to support the above usage. For example, using the existi=
ng
> > >>> file system as swap backend, per vma or per swap entry map to a fil=
e
> > >>> would mean it needs additional data structure to track the
> > >>> swap_cgroup_ctrl, combined with the size of the file inode. It woul=
d
> > >>> be challenging to meet the design goal 2) and 3) using another file
> > >>> system as it is..
> > >>>
> > >>> I am considering grouping different swap entry data into one single
> > >>> struct and dynamically allocate it so no upfront allocation of
> > >>> swap_map.
> > >>>
> > >>> For the swap entry allocation.Current kernel support swap out 0 ord=
er
> > >>> or pmd order pages.
> > >>>
> > >>> There are some discussions and patches that add swap out for folio
> > >>> size in between (mTHP)
> > >>>
> > >>> https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.rober=
ts@arm.com/
> > >>>
> > >>> and swap in for mTHP:
> > >>>
> > >>> https://lore.kernel.org/all/20240229003753.134193-1-21cnbao@gmail.c=
om/
> > >>>
> > >>> The introduction of swapping different order of pages will further
> > >>> complicate the swap entry fragmentation issue. The swap back end ha=
s
> > >>> no way to predict the life cycle of the swap entries. Repeat alloca=
te
> > >>> and free swap entry of different sizes will fragment the swap entri=
es
> > >>> array. If we can=E2=80=99t allocate the contiguous swap entry for a=
 mTHP, it
> > >>> will have to split the mTHP to a smaller size to perform the swap i=
n
> > >>> and out. T
> > >>>
> > >>> Current swap only supports 4K pages or pmd size pages. When adding =
the
> > >>> other in between sizes, it greatly increases the chance of fragment=
ing
> > >>> the swap entry space. When no more continuous swap swap entry for
> > >>> mTHP, it will force the mTHP split into 4K pages. If we don=E2=80=
=99t solve
> > >>> the fragmentation issue. It will be a constant source of splitting =
the
> > >>> mTHP.
> > >>>
> > >>> Another limitation I would like to address is that swap_writepage c=
an
> > >>> only write out IO in one contiguous chunk, not able to perform
> > >>> non-continuous IO. When the swapfile is close to full, it is likely
> > >>> the unused entry will spread across different locations. It would b=
e
> > >>> nice to be able to read and write large folio using discontiguous d=
isk
> > >>> IO locations.
> > >>>
> > >>> Some possible ideas for the fragmentation issue.
> > >>>
> > >>> a) buddy allocator for swap entities. Similar to the buddy allocato=
r
> > >>> in memory. We can use a buddy allocator system for the swap entry t=
o
> > >>> avoid the low order swap entry fragment too much of the high order
> > >>> swap entry. It should greatly reduce the fragmentation caused by
> > >>> allocate and free of the swap entry of different sizes. However the
> > >>> buddy allocator has its own limit as well. Unlike system memory, we
> > >>> can move and compact the memory. There is no rmap for swap entry, i=
t
> > >>> is much harder to move a swap entry to another disk location. So th=
e
> > >>> buddy allocator for swap will help, but not solve all the
> > >>> fragmentation issues.
> > >> I have an idea here=F0=9F=98=81
> > >>
> > >> Each swap device is divided into multiple chunks, and each chunk is
> > >> allocated to meet each order allocation
> > >> (order indicates the order of swapout's folio, and each chunk is use=
d
> > >> for only one order).
> > >> This can solve the fragmentation problem, which is much simpler than
> > >> buddy, easier to implement,
> > >>  and can be compatible with multiple sizes, similar to small slab al=
locator.
> > >>
> > >> 1) Add structure members
> > >> In the swap_info_struct structure, we only need to add the offset ar=
ray
> > >> representing the offset of each order search.
> > >> eg:
> > >>
> > >> #define MTHP_NR_ORDER 9
> > >>
> > >> struct swap_info_struct {
> > >>     ...
> > >>     long order_off[MTHP_NR_ORDER];
> > >>     ...
> > >> };
> > >>
> > >> Note: order_off =3D -1 indicates that this order is not supported.
> > >>
> > >> 2) Initialize
> > >> Set the proportion of swap device occupied by each order.
> > >> For the sake of simplicity, there are 8 kinds of orders.
> > >> Number of slots occupied by each order: chunk_size =3D 1/8 * maxpage=
s
> > >> (maxpages indicates the maximum number of available slots in the cur=
rent
> > >> swap device)
> > > Well, but then if you fill in space of a particular order and need to=
 swap
> > > out a page of that order what do you do? Return ENOSPC prematurely?
> > If we swapout a subpage of large folio(due to a split in large folio),
> > Simply search for a free swap entry from order_off[0].
>
> I meant what are you going to do if you want to swapout 2MB huge page but
> you don't have any free swap entry of the appropriate order? History show=
s
> that these schemes where you partition available space into buckets of
> pages of different order tends to fragment rather quickly so you need to
> also implement some defragmentation / compaction scheme and once you do
> that you are at the complexity of a standard filesystem block allocator.
> That is all I wanted to point at :)
OK, got it!  It's true that my approach doesn't eliminate
fragmentation, but it can be
mitigated to some extent, and the method itself doesn't currently
involve complex
file system operations.
>
>                                                                 Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
>
Thnaks,
Chuanhua