public inbox for cgroups@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 00/18] Add Cgroup support for SGX EPC memory
@ 2023-09-13  4:06 Haitao Huang
       [not found] ` <20230913040635.28815-1-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
  0 siblings, 1 reply; 45+ messages in thread
From: Haitao Huang @ 2023-09-13  4:06 UTC (permalink / raw)
  To: jarkko-DgEjT+Ai2ygdnm+yROfE0A, dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	tj-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

SGX EPC memory allocations are separate from normal RAM allocations, and
are managed solely by the SGX subsystem. The existing cgroup memory
controller cannot be used to limit or account for SGX EPC memory, which is
a desirable feature in some environments, e.g., support for pod level
control in a Kubernates cluster on a VM or baremetal host [1,2].

This patchset implements the support for sgx_epc memory within the misc
cgroup controller. The user can use the misc cgroup controller to set and
enforce a max limit on total EPC usage per cgroup. The implementation
reports current usage and events of reaching the limit per cgroup as well
as the total system capacity.

This work was originally authored by Sean Christopherson a few years ago,
and previously modified by Kristen C. Accardi to work with more recent
kernels, and to utilize the misc cgroup controller rather than a custom
controller. Now I updated the patches based on review comments on the V2
and V3 series [3, 4], simplified a few aspects of the implementation/design
and fixed some stability issues found from testing, while keeping the same
user space facing interfaces.

The patchset adds support for multiple LRU lists to track both reclaimable
EPC pages (i.e., pages the reclaimer knows about), as well as unreclaimable
EPC pages (i.e., pages which the reclaimer isn't aware of, such as VA
pages).  These pages are assigned to an LRU list, as well as an enclave, so
that an enclave's full EPC usage can be tracked, and subject to the
per-cgroup limit. During OOM events, an enclave can have its memory zapped,
and all the EPC pages tracked by the LRU lists can be freed.

The EPC pages allocated for KVM guests by the virtual EPC driver are not
reclaimable by the host kernel [5]. Therefore they are not tracked by any
LRU lists for reclaiming purposes in this implementation, but they are
charged toward the cgroup of the user processs (e.g., QEMU) launching the
guest.  And when the cgroup  EPC usage reaches its limit, the virtual EPC
driver will stop allocating more EPC for the VM, and return SIGBUS to the
user process which would abort the VM launch.

To make it easier to follow, I reordered the patches in v4 into following
clusters:
- Patches 1&2 are prerequisite  misc cgroup changes
- Patches 3-8 deal with the 'reclaimable' pages
- Patches 9-12 deal with the 'unreclaimable' pages, which are freed only
  for OOM scenarios.
- Patches 13-15 re-organize EPC reclaiming code to be reusable by EPC
  cgroup.
- Patch 16 implements EPC cgroup as a misc cgroup.
- Patch 17 adds documentation for the EPC cgroup.
- Patch 18 adds test scripts. They depend on earlier fixes and enhancements
  reviewed previously [6]

I appreciate your comments and feedback.

---
v4:
* Collected "Tested-by" from Mikko. I kept it for now as no functional changes in v4.
* Rebased on to v6.6_rc1 and reordered patches as described above.
* Separated out the bug fixes [7,8,9]. This series depend on those patches. (Dave, Jarkko)
* Added comments in commit message to give more preview what's to come next. (Jarkko)
* Fixed some documentation error, gap, style (Mikko, Randy)
* Fixed some comments, typo, style in code (Mikko, Kai)
* Patch format and background for reclaimable vs unreclaimable (Kai, Jarkko)
* Fixed typo (Pavel)
* Exclude the previous fixes/enhancements for self-tests. Patch 18 now depends on series [6]
* Use the same to list for cover and all patches. (Sohil)

v3:

* Added EPC states to replace flags in sgx_epc_page struct. (Jarkko)
* Unrolled wrappers for cond_resched, list (Dave)
* Separate patches for adding reclaimable and unreclaimable lists. (Dave)
* Other improvments on patch flow, commit messages, styles. (Dave, Jarkko)
* Simplified the cgroup tree walking with plain
  css_for_each_descendant_pre.
* Fixed race conditions and crashes.
* OOM killer to wait for the victim enclave pages being reclaimed.
* Unblock the user by handling misc_max_write callback asynchronously.
* Rebased onto 6.4 and no longer base this series on the MCA patchset.
* Fix an overflow in misc_try_charge.
* Fix a NULL pointer in SGX PF handler.
* Updated and included the SGX selftest patches previously reviewed. Those
  patches fix issues triggered in high EPC pressure required for cgroup
  testing.
* Added test scripts to help setup and test SGX EPC cgroups.

[1]https://lore.kernel.org/all/DM6PR21MB11772A6ED915825854B419D6C4989-3jJ4emchXQBypjffeKSS3s1VXTxX1y3OvxpqHgZTriW3zl9H0oFU5g@public.gmane.org/
[2]https://lore.kernel.org/all/ZD7Iutppjj+muH4p@himmelriiki/
[3]https://lore.kernel.org/all/20221202183655.3767674-1-kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org/
[4]https://lore.kernel.org/linux-sgx/20230712230202.47929-1-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org/
[5]Documentation/arch/x86/sgx.rst, Section "Virtual EPC"
[6]https://lore.kernel.org/linux-sgx/20220905020411.17290-1-jarkko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org/
[7]https://lore.kernel.org/linux-sgx/ZLcXmvDKheCRYOjG-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org/
[8]https://lore.kernel.org/linux-sgx/20230721120231.13916-1-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org/
[9]https://lore.kernel.org/linux-sgx/20230728051024.33063-1-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org/

Haitao Huang (4):
  x86/sgx: Introduce EPC page states
  x86/sgx: Store struct sgx_encl when allocating new VA pages
  x86/sgx: Prepare for multiple LRUs
  selftests/sgx: Add scripts for epc cgroup testing

Kristen Carlson Accardi (9):
  cgroup/misc: Add per resource callbacks for CSS events
  cgroup/misc: Add SGX EPC resource type and export APIs for SGX driver
  x86/sgx: Add sgx_epc_lru_lists to encapsulate LRU lists
  x86/sgx: Use sgx_epc_lru_lists for existing active page list
  x86/sgx: Store reclaimable EPC pages in sgx_epc_lru_lists
  x86/sgx: Use a list to track to-be-reclaimed pages
  x86/sgx: store unreclaimable pages in LRU lists
  x86/sgx: Limit process EPC usage with misc cgroup controller
  Docs/x86/sgx: Add description for cgroup support

Sean Christopherson (5):
  x86/sgx: Introduce RECLAIM_IN_PROGRESS state
  x86/sgx: Add EPC page flags to identify owner types
  x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  x86/sgx: Expose sgx_reclaim_pages() for use by EPC cgroup
  x86/sgx: Add helper to grab pages from an arbitrary EPC LRU

 Documentation/arch/x86/sgx.rst                |  82 ++++
 arch/x86/Kconfig                              |  13 +
 arch/x86/kernel/cpu/sgx/Makefile              |   1 +
 arch/x86/kernel/cpu/sgx/driver.c              |  27 +-
 arch/x86/kernel/cpu/sgx/encl.c                |  74 +++-
 arch/x86/kernel/cpu/sgx/encl.h                |   4 +-
 arch/x86/kernel/cpu/sgx/epc_cgroup.c          | 406 ++++++++++++++++++
 arch/x86/kernel/cpu/sgx/epc_cgroup.h          |  59 +++
 arch/x86/kernel/cpu/sgx/ioctl.c               |  27 +-
 arch/x86/kernel/cpu/sgx/main.c                | 399 +++++++++++++----
 arch/x86/kernel/cpu/sgx/sgx.h                 | 115 ++++-
 include/linux/misc_cgroup.h                   |  34 ++
 kernel/cgroup/misc.c                          |  57 ++-
 .../selftests/sgx/run_tests_in_misc_cg.sh     |  68 +++
 tools/testing/selftests/sgx/setup_epc_cg.sh   |  29 ++
 .../selftests/sgx/watch_misc_for_tests.sh     |  13 +
 16 files changed, 1257 insertions(+), 151 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
 create mode 100755 tools/testing/selftests/sgx/run_tests_in_misc_cg.sh
 create mode 100755 tools/testing/selftests/sgx/setup_epc_cg.sh
 create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh

-- 
2.25.1


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v4 01/18] cgroup/misc: Add per resource callbacks for CSS events
       [not found] ` <20230913040635.28815-1-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
@ 2023-09-13  4:06   ` Haitao Huang
       [not found]     ` <20230913040635.28815-2-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
  2023-09-15 17:55     ` Tejun Heo
  2023-09-13  4:06   ` [PATCH v4 02/18] cgroup/misc: Add SGX EPC resource type and export APIs for SGX driver Haitao Huang
                     ` (17 subsequent siblings)
  18 siblings, 2 replies; 45+ messages in thread
From: Haitao Huang @ 2023-09-13  4:06 UTC (permalink / raw)
  To: jarkko-DgEjT+Ai2ygdnm+yROfE0A, dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	tj-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

From: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>

Consumers of the misc cgroup controller might need to perform separate
actions for Cgroups Subsystem State(CSS) events: cgroup alloc and free.
In addition, writes to the max value may also need separate action. Add
the ability to allow downstream users to setup callbacks for these
operations, and call the corresponding per-resource-type callback when
appropriate.

This code will be utilized by the SGX driver in a future patch.

Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
---
V4:
- Moved this to the front of the series.
- Applies on cgroup/for-6.6 with the overflow fix for misc.

V3:
- Removed the released() callback
---
 include/linux/misc_cgroup.h |  5 +++++
 kernel/cgroup/misc.c        | 32 +++++++++++++++++++++++++++++---
 2 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index e799b1f8d05b..e1bcd176c2de 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -37,6 +37,11 @@ struct misc_res {
 	u64 max;
 	atomic64_t usage;
 	atomic64_t events;
+
+	/* per resource callback ops */
+	int (*misc_cg_alloc)(struct misc_cg *cg);
+	void (*misc_cg_free)(struct misc_cg *cg);
+	void (*misc_cg_max_write)(struct misc_cg *cg);
 };
 
 /**
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 79a3717a5803..e0092170d0dd 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -276,10 +276,13 @@ static ssize_t misc_cg_max_write(struct kernfs_open_file *of, char *buf,
 
 	cg = css_misc(of_css(of));
 
-	if (READ_ONCE(misc_res_capacity[type]))
+	if (READ_ONCE(misc_res_capacity[type])) {
 		WRITE_ONCE(cg->res[type].max, max);
-	else
+		if (cg->res[type].misc_cg_max_write)
+			cg->res[type].misc_cg_max_write(cg);
+	} else {
 		ret = -EINVAL;
+	}
 
 	return ret ? ret : nbytes;
 }
@@ -383,23 +386,39 @@ static struct cftype misc_cg_files[] = {
 static struct cgroup_subsys_state *
 misc_cg_alloc(struct cgroup_subsys_state *parent_css)
 {
+	struct misc_cg *parent_cg;
 	enum misc_res_type i;
 	struct misc_cg *cg;
+	int ret;
 
 	if (!parent_css) {
 		cg = &root_cg;
+		parent_cg = &root_cg;
 	} else {
 		cg = kzalloc(sizeof(*cg), GFP_KERNEL);
 		if (!cg)
 			return ERR_PTR(-ENOMEM);
+		parent_cg = css_misc(parent_css);
 	}
 
 	for (i = 0; i < MISC_CG_RES_TYPES; i++) {
 		WRITE_ONCE(cg->res[i].max, MAX_NUM);
 		atomic64_set(&cg->res[i].usage, 0);
+		if (parent_cg->res[i].misc_cg_alloc) {
+			ret = parent_cg->res[i].misc_cg_alloc(cg);
+			if (ret)
+				goto alloc_err;
+		}
 	}
 
 	return &cg->css;
+
+alloc_err:
+	for (i = 0; i < MISC_CG_RES_TYPES; i++)
+		if (parent_cg->res[i].misc_cg_free)
+			cg->res[i].misc_cg_free(cg);
+	kfree(cg);
+	return ERR_PTR(ret);
 }
 
 /**
@@ -410,7 +429,14 @@ misc_cg_alloc(struct cgroup_subsys_state *parent_css)
  */
 static void misc_cg_free(struct cgroup_subsys_state *css)
 {
-	kfree(css_misc(css));
+	struct misc_cg *cg = css_misc(css);
+	enum misc_res_type i;
+
+	for (i = 0; i < MISC_CG_RES_TYPES; i++)
+		if (cg->res[i].misc_cg_free)
+			cg->res[i].misc_cg_free(cg);
+
+	kfree(cg);
 }
 
 /* Cgroup controller callbacks */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 02/18] cgroup/misc: Add SGX EPC resource type and export APIs for SGX driver
       [not found] ` <20230913040635.28815-1-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
  2023-09-13  4:06   ` [PATCH v4 01/18] cgroup/misc: Add per resource callbacks for CSS events Haitao Huang
@ 2023-09-13  4:06   ` Haitao Huang
       [not found]     ` <20230913040635.28815-3-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
  2023-09-13  4:06   ` [PATCH v4 03/18] x86/sgx: Add sgx_epc_lru_lists to encapsulate LRU lists Haitao Huang
                     ` (16 subsequent siblings)
  18 siblings, 1 reply; 45+ messages in thread
From: Haitao Huang @ 2023-09-13  4:06 UTC (permalink / raw)
  To: jarkko-DgEjT+Ai2ygdnm+yROfE0A, dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	tj-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

From: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>

The SGX driver will need to get access to the root misc_cg object
to do iterative walks and also determine if a charge will be
towards the root cgroup or not.

To manage the SGX EPC memory via the misc controller, the SGX
driver will also need to be able to iterate over the misc cgroup
hierarchy.

Move parent_misc() into misc_cgroup.h and make inline to make this
function available to SGX, rename it to misc_cg_parent(), and update
misc.c to use the new name.

Add per resource type private data so that SGX can store additional
per cgroup data with the misc_cg struct.

Allow SGX EPC memory to be a valid resource type for the misc
controller.

Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
---
V4:
- Moved this to the second in the series.
---
 include/linux/misc_cgroup.h | 29 +++++++++++++++++++++++++++++
 kernel/cgroup/misc.c        | 25 ++++++++++++-------------
 2 files changed, 41 insertions(+), 13 deletions(-)

diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index e1bcd176c2de..6f8330f435ba 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -17,6 +17,10 @@ enum misc_res_type {
 	MISC_CG_RES_SEV,
 	/* AMD SEV-ES ASIDs resource */
 	MISC_CG_RES_SEV_ES,
+#endif
+#ifdef CONFIG_CGROUP_SGX_EPC
+	/* SGX EPC memory resource */
+	MISC_CG_RES_SGX_EPC,
 #endif
 	MISC_CG_RES_TYPES
 };
@@ -37,6 +41,7 @@ struct misc_res {
 	u64 max;
 	atomic64_t usage;
 	atomic64_t events;
+	void *priv;
 
 	/* per resource callback ops */
 	int (*misc_cg_alloc)(struct misc_cg *cg);
@@ -59,6 +64,7 @@ struct misc_cg {
 	struct misc_res res[MISC_CG_RES_TYPES];
 };
 
+struct misc_cg *misc_cg_root(void);
 u64 misc_cg_res_total_usage(enum misc_res_type type);
 int misc_cg_set_capacity(enum misc_res_type type, u64 capacity);
 int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount);
@@ -78,6 +84,20 @@ static inline struct misc_cg *css_misc(struct cgroup_subsys_state *css)
 	return css ? container_of(css, struct misc_cg, css) : NULL;
 }
 
+/**
+ * misc_cg_parent() - Get the parent of the passed misc cgroup.
+ * @cgroup: cgroup whose parent needs to be fetched.
+ *
+ * Context: Any context.
+ * Return:
+ * * struct misc_cg* - Parent of the @cgroup.
+ * * %NULL - If @cgroup is null or the passed cgroup does not have a parent.
+ */
+static inline struct misc_cg *misc_cg_parent(struct misc_cg *cgroup)
+{
+	return cgroup ? css_misc(cgroup->css.parent) : NULL;
+}
+
 /*
  * get_current_misc_cg() - Find and get the misc cgroup of the current task.
  *
@@ -102,6 +122,15 @@ static inline void put_misc_cg(struct misc_cg *cg)
 }
 
 #else /* !CONFIG_CGROUP_MISC */
+static inline struct misc_cg *misc_cg_root(void)
+{
+	return NULL;
+}
+
+static inline struct misc_cg *misc_cg_parent(struct misc_cg *cg)
+{
+	return NULL;
+}
 
 static inline u64 misc_cg_res_total_usage(enum misc_res_type type)
 {
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index e0092170d0dd..dbd881be773f 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -24,6 +24,10 @@ static const char *const misc_res_name[] = {
 	/* AMD SEV-ES ASIDs resource */
 	"sev_es",
 #endif
+#ifdef CONFIG_CGROUP_SGX_EPC
+	/* Intel SGX EPC memory bytes */
+	"sgx_epc",
+#endif
 };
 
 /* Root misc cgroup */
@@ -40,18 +44,13 @@ static struct misc_cg root_cg;
 static u64 misc_res_capacity[MISC_CG_RES_TYPES];
 
 /**
- * parent_misc() - Get the parent of the passed misc cgroup.
- * @cgroup: cgroup whose parent needs to be fetched.
- *
- * Context: Any context.
- * Return:
- * * struct misc_cg* - Parent of the @cgroup.
- * * %NULL - If @cgroup is null or the passed cgroup does not have a parent.
+ * misc_cg_root() - Return the root misc cgroup.
  */
-static struct misc_cg *parent_misc(struct misc_cg *cgroup)
+struct misc_cg *misc_cg_root(void)
 {
-	return cgroup ? css_misc(cgroup->css.parent) : NULL;
+	return &root_cg;
 }
+EXPORT_SYMBOL_GPL(misc_cg_root);
 
 /**
  * valid_type() - Check if @type is valid or not.
@@ -150,7 +149,7 @@ int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount)
 	if (!amount)
 		return 0;
 
-	for (i = cg; i; i = parent_misc(i)) {
+	for (i = cg; i; i = misc_cg_parent(i)) {
 		res = &i->res[type];
 
 		new_usage = atomic64_add_return(amount, &res->usage);
@@ -163,12 +162,12 @@ int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount)
 	return 0;
 
 err_charge:
-	for (j = i; j; j = parent_misc(j)) {
+	for (j = i; j; j = misc_cg_parent(j)) {
 		atomic64_inc(&j->res[type].events);
 		cgroup_file_notify(&j->events_file);
 	}
 
-	for (j = cg; j != i; j = parent_misc(j))
+	for (j = cg; j != i; j = misc_cg_parent(j))
 		misc_cg_cancel_charge(type, j, amount);
 	misc_cg_cancel_charge(type, i, amount);
 	return ret;
@@ -190,7 +189,7 @@ void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg, u64 amount)
 	if (!(amount && valid_type(type) && cg))
 		return;
 
-	for (i = cg; i; i = parent_misc(i))
+	for (i = cg; i; i = misc_cg_parent(i))
 		misc_cg_cancel_charge(type, i, amount);
 }
 EXPORT_SYMBOL_GPL(misc_cg_uncharge);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 03/18] x86/sgx: Add sgx_epc_lru_lists to encapsulate LRU lists
       [not found] ` <20230913040635.28815-1-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
  2023-09-13  4:06   ` [PATCH v4 01/18] cgroup/misc: Add per resource callbacks for CSS events Haitao Huang
  2023-09-13  4:06   ` [PATCH v4 02/18] cgroup/misc: Add SGX EPC resource type and export APIs for SGX driver Haitao Huang
@ 2023-09-13  4:06   ` Haitao Huang
       [not found]     ` <20230913040635.28815-4-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
  2023-09-13  4:06   ` [PATCH v4 04/18] x86/sgx: Use sgx_epc_lru_lists for existing active page list Haitao Huang
                     ` (15 subsequent siblings)
  18 siblings, 1 reply; 45+ messages in thread
From: Haitao Huang @ 2023-09-13  4:06 UTC (permalink / raw)
  To: jarkko-DgEjT+Ai2ygdnm+yROfE0A, dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	tj-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

From: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>

Introduce a data structure to wrap the existing reclaimable list and its
spinlock. Each cgroup later will have one instance of this structure to
track EPC pages allocated for processes associated with the same cgroup.
Just like the global SGX reclaimer (ksgxd), an EPC cgroup reclaims pages
from the reclaimable list in this structure when its usage reaches near
its limit.

Currently, ksgxd does not track the VA, SECS pages. They are considered
as 'unreclaimable' pages that are only deallocated when their respective
owning enclaves are destroyed and all associated resources released.

When an EPC cgroup can not reclaim any more reclaimable EPC pages to
reduce its usage below its limit, the cgroup must also reclaim those
unreclaimables by killing their owning enclaves. The VA and SECS pages
later are also tracked in an 'unreclaimable' list added to this structure
to support this OOM killing of enclaves.

Signed-off-by: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Cc: Sean Christopherson <seanjc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
V4:
- Removed unneeded comments for the spinlock and the non-reclaimables.
(Kai, Jarkko)
- Revised the commit to add introduction comments for unreclaimables and
multiple LRU lists.(Kai)
- Reordered the patches: delay all changes for unreclaimables to
later, and this one becomes the first change in the SGX subsystem.

V3:
- Removed the helper functions and revised commit messages.
---
 arch/x86/kernel/cpu/sgx/sgx.h | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index d2dad21259a8..018414b2abe8 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -83,6 +83,20 @@ static inline void *sgx_get_epc_virt_addr(struct sgx_epc_page *page)
 	return section->virt_addr + index * PAGE_SIZE;
 }
 
+/*
+ * Tracks EPC pages reclaimable by the reclaimer (ksgxd).
+ */
+struct sgx_epc_lru_lists {
+	spinlock_t lock;
+	struct list_head reclaimable;
+};
+
+static inline void sgx_lru_init(struct sgx_epc_lru_lists *lrus)
+{
+	spin_lock_init(&lrus->lock);
+	INIT_LIST_HEAD(&lrus->reclaimable);
+}
+
 struct sgx_epc_page *__sgx_alloc_epc_page(void);
 void sgx_free_epc_page(struct sgx_epc_page *page);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 04/18] x86/sgx: Use sgx_epc_lru_lists for existing active page list
       [not found] ` <20230913040635.28815-1-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
                     ` (2 preceding siblings ...)
  2023-09-13  4:06   ` [PATCH v4 03/18] x86/sgx: Add sgx_epc_lru_lists to encapsulate LRU lists Haitao Huang
@ 2023-09-13  4:06   ` Haitao Huang
       [not found]     ` <20230913040635.28815-5-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
  2023-09-13  4:06   ` [PATCH v4 05/18] x86/sgx: Store reclaimable EPC pages in sgx_epc_lru_lists Haitao Huang
                     ` (14 subsequent siblings)
  18 siblings, 1 reply; 45+ messages in thread
From: Haitao Huang @ 2023-09-13  4:06 UTC (permalink / raw)
  To: jarkko-DgEjT+Ai2ygdnm+yROfE0A, dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	tj-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

From: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>

All EPC pages of enclaves including VA and SECS will be tracked in
sgx_epc_lru_lists structs, one per cgroup. For now just replace the
existing sgx_active_page_list in the reclaimer and its spinlock with a
global sgx_epc_lru_lists struct. VA and SECS pages are still not tracked
at this point but they will be tracked after an unreclaimable LRU list
is added to the sgx_epc_lru_lists struct.

Signed-off-by: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Cc: Sean Christopherson <seanjc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
V4:
- No change, only reordered the patch.

V3:
- Remove usage of list wrapper
---
 arch/x86/kernel/cpu/sgx/main.c | 39 +++++++++++++++++-----------------
 1 file changed, 20 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 166692f2d501..afce51d6e94a 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -26,10 +26,9 @@ static DEFINE_XARRAY(sgx_epc_address_space);
 
 /*
  * These variables are part of the state of the reclaimer, and must be accessed
- * with sgx_reclaimer_lock acquired.
+ * with sgx_global_lru.lock acquired.
  */
-static LIST_HEAD(sgx_active_page_list);
-static DEFINE_SPINLOCK(sgx_reclaimer_lock);
+static struct sgx_epc_lru_lists sgx_global_lru;
 
 static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
 
@@ -304,13 +303,13 @@ static void sgx_reclaim_pages(void)
 	int ret;
 	int i;
 
-	spin_lock(&sgx_reclaimer_lock);
+	spin_lock(&sgx_global_lru.lock);
 	for (i = 0; i < SGX_NR_TO_SCAN; i++) {
-		if (list_empty(&sgx_active_page_list))
+		epc_page = list_first_entry_or_null(&sgx_global_lru.reclaimable,
+						    struct sgx_epc_page, list);
+		if (!epc_page)
 			break;
 
-		epc_page = list_first_entry(&sgx_active_page_list,
-					    struct sgx_epc_page, list);
 		list_del_init(&epc_page->list);
 		encl_page = epc_page->owner;
 
@@ -322,7 +321,7 @@ static void sgx_reclaim_pages(void)
 			 */
 			epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
 	}
-	spin_unlock(&sgx_reclaimer_lock);
+	spin_unlock(&sgx_global_lru.lock);
 
 	for (i = 0; i < cnt; i++) {
 		epc_page = chunk[i];
@@ -345,9 +344,9 @@ static void sgx_reclaim_pages(void)
 		continue;
 
 skip:
-		spin_lock(&sgx_reclaimer_lock);
-		list_add_tail(&epc_page->list, &sgx_active_page_list);
-		spin_unlock(&sgx_reclaimer_lock);
+		spin_lock(&sgx_global_lru.lock);
+		list_add_tail(&epc_page->list, &sgx_global_lru.reclaimable);
+		spin_unlock(&sgx_global_lru.lock);
 
 		kref_put(&encl_page->encl->refcount, sgx_encl_release);
 
@@ -378,7 +377,7 @@ static void sgx_reclaim_pages(void)
 static bool sgx_should_reclaim(unsigned long watermark)
 {
 	return atomic_long_read(&sgx_nr_free_pages) < watermark &&
-	       !list_empty(&sgx_active_page_list);
+	       !list_empty(&sgx_global_lru.reclaimable);
 }
 
 /*
@@ -430,6 +429,8 @@ static bool __init sgx_page_reclaimer_init(void)
 
 	ksgxd_tsk = tsk;
 
+	sgx_lru_init(&sgx_global_lru);
+
 	return true;
 }
 
@@ -505,10 +506,10 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
  */
 void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
 {
-	spin_lock(&sgx_reclaimer_lock);
+	spin_lock(&sgx_global_lru.lock);
 	page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
-	list_add_tail(&page->list, &sgx_active_page_list);
-	spin_unlock(&sgx_reclaimer_lock);
+	list_add_tail(&page->list, &sgx_global_lru.reclaimable);
+	spin_unlock(&sgx_global_lru.lock);
 }
 
 /**
@@ -523,18 +524,18 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
  */
 int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
 {
-	spin_lock(&sgx_reclaimer_lock);
+	spin_lock(&sgx_global_lru.lock);
 	if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
 		/* The page is being reclaimed. */
 		if (list_empty(&page->list)) {
-			spin_unlock(&sgx_reclaimer_lock);
+			spin_unlock(&sgx_global_lru.lock);
 			return -EBUSY;
 		}
 
 		list_del(&page->list);
 		page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
 	}
-	spin_unlock(&sgx_reclaimer_lock);
+	spin_unlock(&sgx_global_lru.lock);
 
 	return 0;
 }
@@ -567,7 +568,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 			break;
 		}
 
-		if (list_empty(&sgx_active_page_list))
+		if (list_empty(&sgx_global_lru.reclaimable))
 			return ERR_PTR(-ENOMEM);
 
 		if (!reclaim) {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 05/18] x86/sgx: Store reclaimable EPC pages in sgx_epc_lru_lists
       [not found] ` <20230913040635.28815-1-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
                     ` (3 preceding siblings ...)
  2023-09-13  4:06   ` [PATCH v4 04/18] x86/sgx: Use sgx_epc_lru_lists for existing active page list Haitao Huang
@ 2023-09-13  4:06   ` Haitao Huang
       [not found]     ` <20230913040635.28815-6-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
  2023-09-13  4:06   ` [PATCH v4 06/18] x86/sgx: Introduce EPC page states Haitao Huang
                     ` (13 subsequent siblings)
  18 siblings, 1 reply; 45+ messages in thread
From: Haitao Huang @ 2023-09-13  4:06 UTC (permalink / raw)
  To: jarkko-DgEjT+Ai2ygdnm+yROfE0A, dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	tj-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

From: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>

Replace sgx_mark_page_reclaimable() and sgx_unmark_page_reclaimable()
with sgx_record_epc_page() and sgx_drop_epc_page(). The
sgx_record_epc_page() function adds the epc_page to the "reclaimable"
list in the sgx_epc_lru_lists struct, while sgx_drop_epc_page() removes
the page from the LRU list.

For now, this change serves as a straightforward replacement of the two
functions for pages tracked by the reclaimer. When the unreclaimable
list is added to track VA and SECS pages for cgroups, these functions
will be updated to add/remove them from the unreclaimable lists.

Signed-off-by: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Cc: Sean Christopherson <seanjc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
V4:
- Code update needed for patch reordering
- Revised commit message.
---
 arch/x86/kernel/cpu/sgx/encl.c  |  8 +++++---
 arch/x86/kernel/cpu/sgx/ioctl.c | 10 ++++++----
 arch/x86/kernel/cpu/sgx/main.c  | 22 ++++++++++++----------
 arch/x86/kernel/cpu/sgx/sgx.h   |  4 ++--
 4 files changed, 25 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 279148e72459..f84ee2eeb058 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -272,7 +272,8 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct sgx_encl *encl,
 		return ERR_CAST(epc_page);
 
 	encl->secs_child_cnt++;
-	sgx_mark_page_reclaimable(entry->epc_page);
+	sgx_record_epc_page(epc_page,
+			    SGX_EPC_PAGE_RECLAIMER_TRACKED);
 
 	return entry;
 }
@@ -398,7 +399,8 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
 	encl_page->type = SGX_PAGE_TYPE_REG;
 	encl->secs_child_cnt++;
 
-	sgx_mark_page_reclaimable(encl_page->epc_page);
+	sgx_record_epc_page(epc_page,
+			    SGX_EPC_PAGE_RECLAIMER_TRACKED);
 
 	phys_addr = sgx_get_epc_phys_addr(epc_page);
 	/*
@@ -714,7 +716,7 @@ void sgx_encl_release(struct kref *ref)
 			 * The page and its radix tree entry cannot be freed
 			 * if the page is being held by the reclaimer.
 			 */
-			if (sgx_unmark_page_reclaimable(entry->epc_page))
+			if (sgx_drop_epc_page(entry->epc_page))
 				continue;
 
 			sgx_encl_free_epc_page(entry->epc_page);
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index 5d390df21440..0d79dec408af 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -322,7 +322,8 @@ static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long src,
 			goto err_out;
 	}
 
-	sgx_mark_page_reclaimable(encl_page->epc_page);
+	sgx_record_epc_page(epc_page,
+			    SGX_EPC_PAGE_RECLAIMER_TRACKED);
 	mutex_unlock(&encl->lock);
 	mmap_read_unlock(current->mm);
 	return ret;
@@ -961,7 +962,7 @@ static long sgx_enclave_modify_types(struct sgx_encl *encl,
 			 * Prevent page from being reclaimed while mutex
 			 * is released.
 			 */
-			if (sgx_unmark_page_reclaimable(entry->epc_page)) {
+			if (sgx_drop_epc_page(entry->epc_page)) {
 				ret = -EAGAIN;
 				goto out_entry_changed;
 			}
@@ -976,7 +977,8 @@ static long sgx_enclave_modify_types(struct sgx_encl *encl,
 
 			mutex_lock(&encl->lock);
 
-			sgx_mark_page_reclaimable(entry->epc_page);
+			sgx_record_epc_page(entry->epc_page,
+					    SGX_EPC_PAGE_RECLAIMER_TRACKED);
 		}
 
 		/* Change EPC type */
@@ -1133,7 +1135,7 @@ static long sgx_encl_remove_pages(struct sgx_encl *encl,
 			goto out_unlock;
 		}
 
-		if (sgx_unmark_page_reclaimable(entry->epc_page)) {
+		if (sgx_drop_epc_page(entry->epc_page)) {
 			ret = -EBUSY;
 			goto out_unlock;
 		}
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index afce51d6e94a..dec1d57cbff6 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -268,7 +268,6 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
 			goto out;
 
 		sgx_encl_ewb(encl->secs.epc_page, &secs_backing);
-
 		sgx_encl_free_epc_page(encl->secs.epc_page);
 		encl->secs.epc_page = NULL;
 
@@ -498,31 +497,34 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
 }
 
 /**
- * sgx_mark_page_reclaimable() - Mark a page as reclaimable
+ * sgx_record_epc_page() - Add a page to the appropriate LRU list
  * @page:	EPC page
+ * @flags:	The type of page that is being recorded
  *
- * Mark a page as reclaimable and add it to the active page list. Pages
- * are automatically removed from the active list when freed.
+ * Mark a page with the specified flags and add it to the appropriate
+ * list.
  */
-void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
+void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
 {
 	spin_lock(&sgx_global_lru.lock);
-	page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
-	list_add_tail(&page->list, &sgx_global_lru.reclaimable);
+	WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
+	page->flags |= flags;
+	if (flags & SGX_EPC_PAGE_RECLAIMER_TRACKED)
+		list_add_tail(&page->list, &sgx_global_lru.reclaimable);
 	spin_unlock(&sgx_global_lru.lock);
 }
 
 /**
- * sgx_unmark_page_reclaimable() - Remove a page from the reclaim list
+ * sgx_drop_epc_page() - Remove a page from a LRU list
  * @page:	EPC page
  *
- * Clear the reclaimable flag and remove the page from the active page list.
+ * Clear the reclaimable flag if set and remove the page from its LRU.
  *
  * Return:
  *   0 on success,
  *   -EBUSY if the page is in the process of being reclaimed
  */
-int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
+int sgx_drop_epc_page(struct sgx_epc_page *page)
 {
 	spin_lock(&sgx_global_lru.lock);
 	if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 018414b2abe8..113d930fd087 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -101,8 +101,8 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void);
 void sgx_free_epc_page(struct sgx_epc_page *page);
 
 void sgx_reclaim_direct(void);
-void sgx_mark_page_reclaimable(struct sgx_epc_page *page);
-int sgx_unmark_page_reclaimable(struct sgx_epc_page *page);
+void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags);
+int sgx_drop_epc_page(struct sgx_epc_page *page);
 struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
 
 void sgx_ipi_cb(void *info);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 06/18] x86/sgx: Introduce EPC page states
       [not found] ` <20230913040635.28815-1-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
                     ` (4 preceding siblings ...)
  2023-09-13  4:06   ` [PATCH v4 05/18] x86/sgx: Store reclaimable EPC pages in sgx_epc_lru_lists Haitao Huang
@ 2023-09-13  4:06   ` Haitao Huang
       [not found]     ` <20230913040635.28815-7-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
  2023-09-13  4:06   ` [PATCH v4 07/18] x86/sgx: Introduce RECLAIM_IN_PROGRESS state Haitao Huang
                     ` (12 subsequent siblings)
  18 siblings, 1 reply; 45+ messages in thread
From: Haitao Huang @ 2023-09-13  4:06 UTC (permalink / raw)
  To: jarkko-DgEjT+Ai2ygdnm+yROfE0A, dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	tj-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

Use the lower 3 bits in the flags field of sgx_epc_page struct to
track EPC states in its life cycle and define an enum for possible
states. More state(s) will be added later.

Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
---
V4:
- No changes other than required for patch reordering.

V3:
- This is new in V3 to replace the bit mask based approach (requested by Jarkko)
---
 arch/x86/kernel/cpu/sgx/encl.c  | 14 +++++++---
 arch/x86/kernel/cpu/sgx/ioctl.c |  7 +++--
 arch/x86/kernel/cpu/sgx/main.c  | 19 +++++++------
 arch/x86/kernel/cpu/sgx/sgx.h   | 49 ++++++++++++++++++++++++++++++---
 4 files changed, 71 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index f84ee2eeb058..d11d4111aa98 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -244,8 +244,12 @@ static struct sgx_epc_page *sgx_encl_load_secs(struct sgx_encl *encl)
 {
 	struct sgx_epc_page *epc_page = encl->secs.epc_page;
 
-	if (!epc_page)
+	if (!epc_page) {
 		epc_page = sgx_encl_eldu(&encl->secs, NULL);
+		if (!IS_ERR(epc_page))
+			sgx_record_epc_page(epc_page,
+					    SGX_EPC_PAGE_UNRECLAIMABLE);
+	}
 
 	return epc_page;
 }
@@ -273,7 +277,7 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct sgx_encl *encl,
 
 	encl->secs_child_cnt++;
 	sgx_record_epc_page(epc_page,
-			    SGX_EPC_PAGE_RECLAIMER_TRACKED);
+			    SGX_EPC_PAGE_RECLAIMABLE);
 
 	return entry;
 }
@@ -400,7 +404,7 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
 	encl->secs_child_cnt++;
 
 	sgx_record_epc_page(epc_page,
-			    SGX_EPC_PAGE_RECLAIMER_TRACKED);
+			    SGX_EPC_PAGE_RECLAIMABLE);
 
 	phys_addr = sgx_get_epc_phys_addr(epc_page);
 	/*
@@ -1258,6 +1262,8 @@ struct sgx_epc_page *sgx_alloc_va_page(bool reclaim)
 		sgx_encl_free_epc_page(epc_page);
 		return ERR_PTR(-EFAULT);
 	}
+	sgx_record_epc_page(epc_page,
+			    SGX_EPC_PAGE_UNRECLAIMABLE);
 
 	return epc_page;
 }
@@ -1317,7 +1323,7 @@ void sgx_encl_free_epc_page(struct sgx_epc_page *page)
 {
 	int ret;
 
-	WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
+	WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_STATE_MASK);
 
 	ret = __eremove(sgx_get_epc_virt_addr(page));
 	if (WARN_ONCE(ret, EREMOVE_ERROR_MESSAGE, ret, ret))
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index 0d79dec408af..c28f074d5d71 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -113,6 +113,9 @@ static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs)
 	encl->attributes = secs->attributes;
 	encl->attributes_mask = SGX_ATTR_UNPRIV_MASK;
 
+	sgx_record_epc_page(encl->secs.epc_page,
+			    SGX_EPC_PAGE_UNRECLAIMABLE);
+
 	/* Set only after completion, as encl->lock has not been taken. */
 	set_bit(SGX_ENCL_CREATED, &encl->flags);
 
@@ -323,7 +326,7 @@ static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long src,
 	}
 
 	sgx_record_epc_page(epc_page,
-			    SGX_EPC_PAGE_RECLAIMER_TRACKED);
+			    SGX_EPC_PAGE_RECLAIMABLE);
 	mutex_unlock(&encl->lock);
 	mmap_read_unlock(current->mm);
 	return ret;
@@ -978,7 +981,7 @@ static long sgx_enclave_modify_types(struct sgx_encl *encl,
 			mutex_lock(&encl->lock);
 
 			sgx_record_epc_page(entry->epc_page,
-					    SGX_EPC_PAGE_RECLAIMER_TRACKED);
+					    SGX_EPC_PAGE_RECLAIMABLE);
 		}
 
 		/* Change EPC type */
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index dec1d57cbff6..b26860399402 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -318,7 +318,7 @@ static void sgx_reclaim_pages(void)
 			/* The owner is freeing the page. No need to add the
 			 * page back to the list of reclaimable pages.
 			 */
-			epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
+			sgx_epc_page_reset_state(epc_page);
 	}
 	spin_unlock(&sgx_global_lru.lock);
 
@@ -344,6 +344,7 @@ static void sgx_reclaim_pages(void)
 
 skip:
 		spin_lock(&sgx_global_lru.lock);
+		sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
 		list_add_tail(&epc_page->list, &sgx_global_lru.reclaimable);
 		spin_unlock(&sgx_global_lru.lock);
 
@@ -367,7 +368,7 @@ static void sgx_reclaim_pages(void)
 		sgx_reclaimer_write(epc_page, &backing[i]);
 
 		kref_put(&encl_page->encl->refcount, sgx_encl_release);
-		epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
+		sgx_epc_page_reset_state(epc_page);
 
 		sgx_free_epc_page(epc_page);
 	}
@@ -507,9 +508,9 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
 void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
 {
 	spin_lock(&sgx_global_lru.lock);
-	WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
+	WARN_ON_ONCE(sgx_epc_page_reclaimable(page->flags));
 	page->flags |= flags;
-	if (flags & SGX_EPC_PAGE_RECLAIMER_TRACKED)
+	if (sgx_epc_page_reclaimable(flags))
 		list_add_tail(&page->list, &sgx_global_lru.reclaimable);
 	spin_unlock(&sgx_global_lru.lock);
 }
@@ -527,7 +528,7 @@ void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
 int sgx_drop_epc_page(struct sgx_epc_page *page)
 {
 	spin_lock(&sgx_global_lru.lock);
-	if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
+	if (sgx_epc_page_reclaimable(page->flags)) {
 		/* The page is being reclaimed. */
 		if (list_empty(&page->list)) {
 			spin_unlock(&sgx_global_lru.lock);
@@ -535,7 +536,7 @@ int sgx_drop_epc_page(struct sgx_epc_page *page)
 		}
 
 		list_del(&page->list);
-		page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
+		sgx_epc_page_reset_state(page);
 	}
 	spin_unlock(&sgx_global_lru.lock);
 
@@ -607,6 +608,8 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
 	struct sgx_epc_section *section = &sgx_epc_sections[page->section];
 	struct sgx_numa_node *node = section->node;
 
+	WARN_ON_ONCE(page->flags & (SGX_EPC_PAGE_STATE_MASK));
+
 	spin_lock(&node->lock);
 
 	page->owner = NULL;
@@ -614,7 +617,7 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
 		list_add(&page->list, &node->sgx_poison_page_list);
 	else
 		list_add_tail(&page->list, &node->free_page_list);
-	page->flags = SGX_EPC_PAGE_IS_FREE;
+	page->flags = SGX_EPC_PAGE_FREE;
 
 	spin_unlock(&node->lock);
 	atomic_long_inc(&sgx_nr_free_pages);
@@ -715,7 +718,7 @@ int arch_memory_failure(unsigned long pfn, int flags)
 	 * If the page is on a free list, move it to the per-node
 	 * poison page list.
 	 */
-	if (page->flags & SGX_EPC_PAGE_IS_FREE) {
+	if (page->flags == SGX_EPC_PAGE_FREE) {
 		list_move(&page->list, &node->sgx_poison_page_list);
 		goto out;
 	}
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 113d930fd087..2faeb40b345f 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -23,11 +23,36 @@
 #define SGX_NR_LOW_PAGES		32
 #define SGX_NR_HIGH_PAGES		64
 
-/* Pages, which are being tracked by the page reclaimer. */
-#define SGX_EPC_PAGE_RECLAIMER_TRACKED	BIT(0)
+enum sgx_epc_page_state {
+	/* Not tracked by the reclaimer:
+	 * Pages allocated for virtual EPC which are never tracked by the host
+	 * reclaimer; pages just allocated from free list but not yet put in
+	 * use; pages just reclaimed, but not yet returned to the free list.
+	 * Becomes FREE after sgx_free_epc()
+	 * Becomes RECLAIMABLE or UNRECLAIMABLE after sgx_record_epc()
+	 */
+	SGX_EPC_PAGE_NOT_TRACKED = 0,
+
+	/* Page is in the free list, ready for allocation
+	 * Becomes NOT_TRACKED after sgx_alloc_epc_page()
+	 */
+	SGX_EPC_PAGE_FREE = 1,
+
+	/* Page is in use and tracked in a reclaimable LRU list
+	 * Becomes NOT_TRACKED after sgx_drop_epc()
+	 */
+	SGX_EPC_PAGE_RECLAIMABLE = 2,
+
+	/* Page is in use but tracked in an unreclaimable LRU list. These are
+	 * only reclaimable when the whole enclave is OOM killed or the enclave
+	 * is released, e.g., VA, SECS pages
+	 * Becomes NOT_TRACKED after sgx_drop_epc()
+	 */
+	SGX_EPC_PAGE_UNRECLAIMABLE = 3,
 
-/* Pages on free list */
-#define SGX_EPC_PAGE_IS_FREE		BIT(1)
+};
+
+#define SGX_EPC_PAGE_STATE_MASK GENMASK(2, 0)
 
 struct sgx_epc_page {
 	unsigned int section;
@@ -37,6 +62,22 @@ struct sgx_epc_page {
 	struct list_head list;
 };
 
+static inline void sgx_epc_page_reset_state(struct sgx_epc_page *page)
+{
+	page->flags &= ~SGX_EPC_PAGE_STATE_MASK;
+}
+
+static inline void sgx_epc_page_set_state(struct sgx_epc_page *page, unsigned long flags)
+{
+	page->flags &= ~SGX_EPC_PAGE_STATE_MASK;
+	page->flags |= (flags & SGX_EPC_PAGE_STATE_MASK);
+}
+
+static inline bool sgx_epc_page_reclaimable(unsigned long flags)
+{
+	return SGX_EPC_PAGE_RECLAIMABLE == (flags & SGX_EPC_PAGE_STATE_MASK);
+}
+
 /*
  * Contains the tracking data for NUMA nodes having EPC pages. Most importantly,
  * the free page list local to the node is stored here.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 07/18] x86/sgx: Introduce RECLAIM_IN_PROGRESS state
       [not found] ` <20230913040635.28815-1-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
                     ` (5 preceding siblings ...)
  2023-09-13  4:06   ` [PATCH v4 06/18] x86/sgx: Introduce EPC page states Haitao Huang
@ 2023-09-13  4:06   ` Haitao Huang
  2023-09-13  4:06   ` [PATCH v4 08/18] x86/sgx: Use a list to track to-be-reclaimed pages Haitao Huang
                     ` (11 subsequent siblings)
  18 siblings, 0 replies; 45+ messages in thread
From: Haitao Huang @ 2023-09-13  4:06 UTC (permalink / raw)
  To: jarkko-DgEjT+Ai2ygdnm+yROfE0A, dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	tj-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

From: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Add RECLAIM_IN_PROGRESS state to not rely on list_empty(&epc_page->list)
to determine if an EPC page is selected as a reclaiming candidate.

When a page is being reclaimed from the page pool (sgx_global_lru),
there is an intermediate stage where a page may have been identified as
a candidate for reclaiming, but has not yet been reclaimed.  Currently
such pages are list_del_init()'d from the global LRU list, and stored in
a an array on stack. To prevent another thread from dropping the same
page in the middle of reclaiming, sgx_drop_epc_page() checks for
list_empty(&epc_page->list).

A later patch will replace the array on stack with a temporary list to
store the candidate pages, so list_empty() should no longer be used for
this purpose.

Signed-off-by: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Cc: Sean Christopherson <seanjc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
V4:
- Fixed some typos.
- Revised commit message.

V3:
- Extend the sgx_epc_page_state enum introduced earlier to replace the
flag based approach.
---
 arch/x86/kernel/cpu/sgx/main.c | 21 ++++++++++-----------
 arch/x86/kernel/cpu/sgx/sgx.h  | 16 ++++++++++++++++
 2 files changed, 26 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index b26860399402..c1ae19a154d0 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -312,13 +312,15 @@ static void sgx_reclaim_pages(void)
 		list_del_init(&epc_page->list);
 		encl_page = epc_page->owner;
 
-		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0)
+		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0) {
+			sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
 			chunk[cnt++] = epc_page;
-		else
+		} else {
 			/* The owner is freeing the page. No need to add the
 			 * page back to the list of reclaimable pages.
 			 */
 			sgx_epc_page_reset_state(epc_page);
+		}
 	}
 	spin_unlock(&sgx_global_lru.lock);
 
@@ -528,16 +530,13 @@ void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
 int sgx_drop_epc_page(struct sgx_epc_page *page)
 {
 	spin_lock(&sgx_global_lru.lock);
-	if (sgx_epc_page_reclaimable(page->flags)) {
-		/* The page is being reclaimed. */
-		if (list_empty(&page->list)) {
-			spin_unlock(&sgx_global_lru.lock);
-			return -EBUSY;
-		}
-
-		list_del(&page->list);
-		sgx_epc_page_reset_state(page);
+	if (sgx_epc_page_reclaim_in_progress(page->flags)) {
+		spin_unlock(&sgx_global_lru.lock);
+		return -EBUSY;
 	}
+
+	list_del(&page->list);
+	sgx_epc_page_reset_state(page);
 	spin_unlock(&sgx_global_lru.lock);
 
 	return 0;
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 2faeb40b345f..764cec23f4e5 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -40,6 +40,8 @@ enum sgx_epc_page_state {
 
 	/* Page is in use and tracked in a reclaimable LRU list
 	 * Becomes NOT_TRACKED after sgx_drop_epc()
+	 * Becomes RECLAIM_IN_PROGRESS in sgx_reclaim_pages() when identified
+	 * for reclaiming
 	 */
 	SGX_EPC_PAGE_RECLAIMABLE = 2,
 
@@ -50,6 +52,14 @@ enum sgx_epc_page_state {
 	 */
 	SGX_EPC_PAGE_UNRECLAIMABLE = 3,
 
+	/* Page is being prepared for reclamation, tracked in a temporary
+	 * isolated list by the reclaimer.
+	 * Changes in sgx_reclaim_pages() back to RECLAIMABLE if preparation
+	 * fails for any reason.
+	 * Becomes NOT_TRACKED if reclaimed successfully in sgx_reclaim_pages()
+	 * and immediately sgx_free_epc() is called to make it FREE.
+	 */
+	SGX_EPC_PAGE_RECLAIM_IN_PROGRESS = 4,
 };
 
 #define SGX_EPC_PAGE_STATE_MASK GENMASK(2, 0)
@@ -73,6 +83,12 @@ static inline void sgx_epc_page_set_state(struct sgx_epc_page *page, unsigned lo
 	page->flags |= (flags & SGX_EPC_PAGE_STATE_MASK);
 }
 
+static inline bool sgx_epc_page_reclaim_in_progress(unsigned long flags)
+{
+	return SGX_EPC_PAGE_RECLAIM_IN_PROGRESS == (flags &
+						    SGX_EPC_PAGE_STATE_MASK);
+}
+
 static inline bool sgx_epc_page_reclaimable(unsigned long flags)
 {
 	return SGX_EPC_PAGE_RECLAIMABLE == (flags & SGX_EPC_PAGE_STATE_MASK);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 08/18] x86/sgx: Use a list to track to-be-reclaimed pages
       [not found] ` <20230913040635.28815-1-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
                     ` (6 preceding siblings ...)
  2023-09-13  4:06   ` [PATCH v4 07/18] x86/sgx: Introduce RECLAIM_IN_PROGRESS state Haitao Huang
@ 2023-09-13  4:06   ` Haitao Huang
       [not found]     ` <20230913040635.28815-9-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
  2023-09-13  4:06   ` [PATCH v4 09/18] x86/sgx: Store struct sgx_encl when allocating new VA pages Haitao Huang
                     ` (10 subsequent siblings)
  18 siblings, 1 reply; 45+ messages in thread
From: Haitao Huang @ 2023-09-13  4:06 UTC (permalink / raw)
  To: jarkko-DgEjT+Ai2ygdnm+yROfE0A, dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	tj-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

From: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>

Change sgx_reclaim_pages() to use a list rather than an array for
storing the epc_pages which will be reclaimed. This change is needed
to transition to the LRU implementation for EPC cgroup support.

When the EPC cgroup is implemented, the reclaiming process will do a
pre-order tree walk for the subtree starting from the limit-violating
cgroup.  When each node is visited, candidate pages are selected from
its "reclaimable" LRU list and moved into this temporary list. Passing a
list from node to node for temporary storage in this walk is more
straightforward than using an array.

Signed-off-by: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Haitao Huang<haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Cc: Sean Christopherson <seanjc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
V4:
- Changes needed for patch reordering
- Revised commit message

V3:
- Removed list wrappers
---
 arch/x86/kernel/cpu/sgx/main.c | 40 +++++++++++++++-------------------
 1 file changed, 18 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index c1ae19a154d0..fba06dc5abfe 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -293,12 +293,11 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
  */
 static void sgx_reclaim_pages(void)
 {
-	struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
 	struct sgx_backing backing[SGX_NR_TO_SCAN];
+	struct sgx_epc_page *epc_page, *tmp;
 	struct sgx_encl_page *encl_page;
-	struct sgx_epc_page *epc_page;
 	pgoff_t page_index;
-	int cnt = 0;
+	LIST_HEAD(iso);
 	int ret;
 	int i;
 
@@ -314,18 +313,22 @@ static void sgx_reclaim_pages(void)
 
 		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0) {
 			sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
-			chunk[cnt++] = epc_page;
+			list_move_tail(&epc_page->list, &iso);
 		} else {
-			/* The owner is freeing the page. No need to add the
-			 * page back to the list of reclaimable pages.
+			/* The owner is freeing the page, remove it from the
+			 * LRU list
 			 */
 			sgx_epc_page_reset_state(epc_page);
+			list_del_init(&epc_page->list);
 		}
 	}
 	spin_unlock(&sgx_global_lru.lock);
 
-	for (i = 0; i < cnt; i++) {
-		epc_page = chunk[i];
+	if (list_empty(&iso))
+		return;
+
+	i = 0;
+	list_for_each_entry_safe(epc_page, tmp, &iso, list) {
 		encl_page = epc_page->owner;
 
 		if (!sgx_reclaimer_age(epc_page))
@@ -340,6 +343,7 @@ static void sgx_reclaim_pages(void)
 			goto skip;
 		}
 
+		i++;
 		encl_page->desc |= SGX_ENCL_PAGE_BEING_RECLAIMED;
 		mutex_unlock(&encl_page->encl->lock);
 		continue;
@@ -347,27 +351,19 @@ static void sgx_reclaim_pages(void)
 skip:
 		spin_lock(&sgx_global_lru.lock);
 		sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
-		list_add_tail(&epc_page->list, &sgx_global_lru.reclaimable);
+		list_move_tail(&epc_page->list, &sgx_global_lru.reclaimable);
 		spin_unlock(&sgx_global_lru.lock);
 
 		kref_put(&encl_page->encl->refcount, sgx_encl_release);
-
-		chunk[i] = NULL;
-	}
-
-	for (i = 0; i < cnt; i++) {
-		epc_page = chunk[i];
-		if (epc_page)
-			sgx_reclaimer_block(epc_page);
 	}
 
-	for (i = 0; i < cnt; i++) {
-		epc_page = chunk[i];
-		if (!epc_page)
-			continue;
+	list_for_each_entry(epc_page, &iso, list)
+		sgx_reclaimer_block(epc_page);
 
+	i = 0;
+	list_for_each_entry_safe(epc_page, tmp, &iso, list) {
 		encl_page = epc_page->owner;
-		sgx_reclaimer_write(epc_page, &backing[i]);
+		sgx_reclaimer_write(epc_page, &backing[i++]);
 
 		kref_put(&encl_page->encl->refcount, sgx_encl_release);
 		sgx_epc_page_reset_state(epc_page);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 09/18] x86/sgx: Store struct sgx_encl when allocating new VA pages
       [not found] ` <20230913040635.28815-1-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
                     ` (7 preceding siblings ...)
  2023-09-13  4:06   ` [PATCH v4 08/18] x86/sgx: Use a list to track to-be-reclaimed pages Haitao Huang
@ 2023-09-13  4:06   ` Haitao Huang
       [not found]     ` <20230913040635.28815-10-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
  2023-09-13  4:06   ` [PATCH v4 10/18] x86/sgx: Add EPC page flags to identify owner types Haitao Huang
                     ` (9 subsequent siblings)
  18 siblings, 1 reply; 45+ messages in thread
From: Haitao Huang @ 2023-09-13  4:06 UTC (permalink / raw)
  To: jarkko-DgEjT+Ai2ygdnm+yROfE0A, dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	tj-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

In a later patch, when a cgroup has exceeded the max capacity for EPC
pages, it may need to identify and OOM kill a less active enclave to
make room for other enclaves within the same group. Such a victim
enclave would have no active pages other than the unreclaimable Version
Array (VA) and SECS pages.  Therefore, the cgroup needs examine its
unreclaimable page list, and finding an enclave given a SECS page or a
VA page. This will require a backpointer from a page to an enclave,
which is not available for VA pages.

Because struct sgx_epc_page instances of VA pages are not owned by an
sgx_encl_page instance, mark their owner as sgx_encl: pass the struct
sgx_encl of the enclave allocating the VA page to sgx_alloc_epc_page(),
which will store this value in the owner field of the struct
sgx_epc_page.  In a later patch, VA pages will be placed in an
unreclaimable queue that can be examined by the cgroup to select the OOM
killed enclave.

Signed-off-by: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Cc: Sean Christopherson <seanjc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
V4:
- Changes needed for patch reordering
- Revised commit messages (Jarkko)
---
 arch/x86/kernel/cpu/sgx/encl.c  |  5 +++--
 arch/x86/kernel/cpu/sgx/encl.h  |  2 +-
 arch/x86/kernel/cpu/sgx/ioctl.c |  2 +-
 arch/x86/kernel/cpu/sgx/main.c  | 20 ++++++++++----------
 arch/x86/kernel/cpu/sgx/sgx.h   |  5 ++++-
 5 files changed, 19 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index d11d4111aa98..1aee0ad00e66 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -1238,6 +1238,7 @@ void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr)
 
 /**
  * sgx_alloc_va_page() - Allocate a Version Array (VA) page
+ * @encl:    The enclave that this page is allocated to.
  * @reclaim: Reclaim EPC pages directly if none available. Enclave
  *           mutex should not be held if this is set.
  *
@@ -1247,12 +1248,12 @@ void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr)
  *   a VA page,
  *   -errno otherwise
  */
-struct sgx_epc_page *sgx_alloc_va_page(bool reclaim)
+struct sgx_epc_page *sgx_alloc_va_page(struct sgx_encl *encl, bool reclaim)
 {
 	struct sgx_epc_page *epc_page;
 	int ret;
 
-	epc_page = sgx_alloc_epc_page(NULL, reclaim);
+	epc_page = sgx_alloc_epc_page(encl, reclaim);
 	if (IS_ERR(epc_page))
 		return ERR_CAST(epc_page);
 
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index f94ff14c9486..831d63f80f5a 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -116,7 +116,7 @@ struct sgx_encl_page *sgx_encl_page_alloc(struct sgx_encl *encl,
 					  unsigned long offset,
 					  u64 secinfo_flags);
 void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr);
-struct sgx_epc_page *sgx_alloc_va_page(bool reclaim);
+struct sgx_epc_page *sgx_alloc_va_page(struct sgx_encl *encl, bool reclaim);
 unsigned int sgx_alloc_va_slot(struct sgx_va_page *va_page);
 void sgx_free_va_slot(struct sgx_va_page *va_page, unsigned int offset);
 bool sgx_va_page_full(struct sgx_va_page *va_page);
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index c28f074d5d71..3ab8c050e665 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -30,7 +30,7 @@ struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, bool reclaim)
 		if (!va_page)
 			return ERR_PTR(-ENOMEM);
 
-		va_page->epc_page = sgx_alloc_va_page(reclaim);
+		va_page->epc_page = sgx_alloc_va_page(encl, reclaim);
 		if (IS_ERR(va_page->epc_page)) {
 			err = ERR_CAST(va_page->epc_page);
 			kfree(va_page);
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index fba06dc5abfe..ed813288af44 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -107,7 +107,7 @@ static unsigned long __sgx_sanitize_pages(struct list_head *dirty_page_list)
 
 static bool sgx_reclaimer_age(struct sgx_epc_page *epc_page)
 {
-	struct sgx_encl_page *page = epc_page->owner;
+	struct sgx_encl_page *page = epc_page->encl_page;
 	struct sgx_encl *encl = page->encl;
 	struct sgx_encl_mm *encl_mm;
 	bool ret = true;
@@ -139,7 +139,7 @@ static bool sgx_reclaimer_age(struct sgx_epc_page *epc_page)
 
 static void sgx_reclaimer_block(struct sgx_epc_page *epc_page)
 {
-	struct sgx_encl_page *page = epc_page->owner;
+	struct sgx_encl_page *page = epc_page->encl_page;
 	unsigned long addr = page->desc & PAGE_MASK;
 	struct sgx_encl *encl = page->encl;
 	int ret;
@@ -196,7 +196,7 @@ void sgx_ipi_cb(void *info)
 static void sgx_encl_ewb(struct sgx_epc_page *epc_page,
 			 struct sgx_backing *backing)
 {
-	struct sgx_encl_page *encl_page = epc_page->owner;
+	struct sgx_encl_page *encl_page = epc_page->encl_page;
 	struct sgx_encl *encl = encl_page->encl;
 	struct sgx_va_page *va_page;
 	unsigned int va_offset;
@@ -249,7 +249,7 @@ static void sgx_encl_ewb(struct sgx_epc_page *epc_page,
 static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
 				struct sgx_backing *backing)
 {
-	struct sgx_encl_page *encl_page = epc_page->owner;
+	struct sgx_encl_page *encl_page = epc_page->encl_page;
 	struct sgx_encl *encl = encl_page->encl;
 	struct sgx_backing secs_backing;
 	int ret;
@@ -309,7 +309,7 @@ static void sgx_reclaim_pages(void)
 			break;
 
 		list_del_init(&epc_page->list);
-		encl_page = epc_page->owner;
+		encl_page = epc_page->encl_page;
 
 		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0) {
 			sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
@@ -329,7 +329,7 @@ static void sgx_reclaim_pages(void)
 
 	i = 0;
 	list_for_each_entry_safe(epc_page, tmp, &iso, list) {
-		encl_page = epc_page->owner;
+		encl_page = epc_page->encl_page;
 
 		if (!sgx_reclaimer_age(epc_page))
 			goto skip;
@@ -362,7 +362,7 @@ static void sgx_reclaim_pages(void)
 
 	i = 0;
 	list_for_each_entry_safe(epc_page, tmp, &iso, list) {
-		encl_page = epc_page->owner;
+		encl_page = epc_page->encl_page;
 		sgx_reclaimer_write(epc_page, &backing[i++]);
 
 		kref_put(&encl_page->encl->refcount, sgx_encl_release);
@@ -562,7 +562,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 	for ( ; ; ) {
 		page = __sgx_alloc_epc_page();
 		if (!IS_ERR(page)) {
-			page->owner = owner;
+			page->encl_page = owner;
 			break;
 		}
 
@@ -607,7 +607,7 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
 
 	spin_lock(&node->lock);
 
-	page->owner = NULL;
+	page->encl_page = NULL;
 	if (page->poison)
 		list_add(&page->list, &node->sgx_poison_page_list);
 	else
@@ -642,7 +642,7 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
 	for (i = 0; i < nr_pages; i++) {
 		section->pages[i].section = index;
 		section->pages[i].flags = 0;
-		section->pages[i].owner = NULL;
+		section->pages[i].encl_page = NULL;
 		section->pages[i].poison = 0;
 		list_add_tail(&section->pages[i].list, &sgx_dirty_page_list);
 	}
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 764cec23f4e5..c75ddc7168fa 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -68,7 +68,10 @@ struct sgx_epc_page {
 	unsigned int section;
 	u16 flags;
 	u16 poison;
-	struct sgx_encl_page *owner;
+	union {
+		struct sgx_encl_page *encl_page;
+		struct sgx_encl *encl;
+	};
 	struct list_head list;
 };
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 10/18] x86/sgx: Add EPC page flags to identify owner types
       [not found] ` <20230913040635.28815-1-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
                     ` (8 preceding siblings ...)
  2023-09-13  4:06   ` [PATCH v4 09/18] x86/sgx: Store struct sgx_encl when allocating new VA pages Haitao Huang
@ 2023-09-13  4:06   ` Haitao Huang
  2023-09-13  4:06   ` [PATCH v4 11/18] x86/sgx: store unreclaimable pages in LRU lists Haitao Huang
                     ` (8 subsequent siblings)
  18 siblings, 0 replies; 45+ messages in thread
From: Haitao Huang @ 2023-09-13  4:06 UTC (permalink / raw)
  To: jarkko-DgEjT+Ai2ygdnm+yROfE0A, dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	tj-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

From: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Two types of owners of struct sgx_epc_page, 'sgx_encl' for VA pages and
'sgx_encl_page' can be stored in the previously introduced union field.

OOM support for cgroups requires that the owner needs to be identified
when selecting pages from the unreclaimable list. Address this by adding
flags for the owner type.

Signed-off-by: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Cc: Sean Christopherson <seanjc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
V4:
- Updates for patch reordering.
- Rename SGX_EPC_OWNER_ENCL_PAGE to SGX_EPC_OWNER_PAGE. (Jarkko)
- Commit message changes. (Jarkko)
---
 arch/x86/kernel/cpu/sgx/encl.c  | 13 +++++++------
 arch/x86/kernel/cpu/sgx/ioctl.c |  6 ++++--
 arch/x86/kernel/cpu/sgx/sgx.h   |  6 ++++++
 3 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 1aee0ad00e66..91f83a5e543d 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -248,6 +248,7 @@ static struct sgx_epc_page *sgx_encl_load_secs(struct sgx_encl *encl)
 		epc_page = sgx_encl_eldu(&encl->secs, NULL);
 		if (!IS_ERR(epc_page))
 			sgx_record_epc_page(epc_page,
+					    SGX_EPC_OWNER_PAGE |
 					    SGX_EPC_PAGE_UNRECLAIMABLE);
 	}
 
@@ -276,8 +277,8 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct sgx_encl *encl,
 		return ERR_CAST(epc_page);
 
 	encl->secs_child_cnt++;
-	sgx_record_epc_page(epc_page,
-			    SGX_EPC_PAGE_RECLAIMABLE);
+	sgx_record_epc_page(epc_page, SGX_EPC_OWNER_PAGE |
+				      SGX_EPC_PAGE_RECLAIMABLE);
 
 	return entry;
 }
@@ -403,8 +404,8 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
 	encl_page->type = SGX_PAGE_TYPE_REG;
 	encl->secs_child_cnt++;
 
-	sgx_record_epc_page(epc_page,
-			    SGX_EPC_PAGE_RECLAIMABLE);
+	sgx_record_epc_page(epc_page, SGX_EPC_OWNER_PAGE |
+				      SGX_EPC_PAGE_RECLAIMABLE);
 
 	phys_addr = sgx_get_epc_phys_addr(epc_page);
 	/*
@@ -1263,8 +1264,8 @@ struct sgx_epc_page *sgx_alloc_va_page(struct sgx_encl *encl, bool reclaim)
 		sgx_encl_free_epc_page(epc_page);
 		return ERR_PTR(-EFAULT);
 	}
-	sgx_record_epc_page(epc_page,
-			    SGX_EPC_PAGE_UNRECLAIMABLE);
+	sgx_record_epc_page(epc_page, SGX_EPC_OWNER_ENCL |
+				      SGX_EPC_PAGE_UNRECLAIMABLE);
 
 	return epc_page;
 }
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index 3ab8c050e665..95ec20a6992f 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -114,6 +114,7 @@ static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs)
 	encl->attributes_mask = SGX_ATTR_UNPRIV_MASK;
 
 	sgx_record_epc_page(encl->secs.epc_page,
+			    SGX_EPC_OWNER_PAGE |
 			    SGX_EPC_PAGE_UNRECLAIMABLE);
 
 	/* Set only after completion, as encl->lock has not been taken. */
@@ -325,8 +326,8 @@ static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long src,
 			goto err_out;
 	}
 
-	sgx_record_epc_page(epc_page,
-			    SGX_EPC_PAGE_RECLAIMABLE);
+	sgx_record_epc_page(epc_page, SGX_EPC_OWNER_PAGE |
+				      SGX_EPC_PAGE_RECLAIMABLE);
 	mutex_unlock(&encl->lock);
 	mmap_read_unlock(current->mm);
 	return ret;
@@ -981,6 +982,7 @@ static long sgx_enclave_modify_types(struct sgx_encl *encl,
 			mutex_lock(&encl->lock);
 
 			sgx_record_epc_page(entry->epc_page,
+					    SGX_EPC_OWNER_PAGE |
 					    SGX_EPC_PAGE_RECLAIMABLE);
 		}
 
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index c75ddc7168fa..e06b4aadb6a1 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -64,6 +64,12 @@ enum sgx_epc_page_state {
 
 #define SGX_EPC_PAGE_STATE_MASK GENMASK(2, 0)
 
+/* flag for pages owned by a sgx_encl_page */
+#define SGX_EPC_OWNER_PAGE		BIT(3)
+
+/* flag for pages owned by a sgx_encl struct */
+#define SGX_EPC_OWNER_ENCL		BIT(4)
+
 struct sgx_epc_page {
 	unsigned int section;
 	u16 flags;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 11/18] x86/sgx: store unreclaimable pages in LRU lists
       [not found] ` <20230913040635.28815-1-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
                     ` (9 preceding siblings ...)
  2023-09-13  4:06   ` [PATCH v4 10/18] x86/sgx: Add EPC page flags to identify owner types Haitao Huang
@ 2023-09-13  4:06   ` Haitao Huang
       [not found]     ` <20230913040635.28815-12-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
  2023-09-13  4:06   ` [PATCH v4 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC Haitao Huang
                     ` (7 subsequent siblings)
  18 siblings, 1 reply; 45+ messages in thread
From: Haitao Huang @ 2023-09-13  4:06 UTC (permalink / raw)
  To: jarkko-DgEjT+Ai2ygdnm+yROfE0A, dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	tj-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

From: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>

When an OOM event occurs, all pages associated with an enclave will need
to be freed, including pages that are not currently tracked by the
cgroup LRU lists.

Add a new "unreclaimable" list to the sgx_epc_lru_lists struct and
update the "sgx_record/drop_epc_pages()" functions for adding/removing
VA and SECS pages to/from this "unreclaimable" list.

Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
---
V4:
- Updates for patch reordering.
- Revised commit messages.
- Revised comments for the list.

V3:
- Removed tracking virtual EPC pages in unreclaimable list as host
kernel does not reclaim them. The EPC cgroups implemented later only
blocks allocating for a guest if the limit is reached by returning
-ENOMEM from sgx_alloc_epc_page() called by virt_epc, and does nothing
else. Therefore, no need to track those in LRU lists.
---
 arch/x86/kernel/cpu/sgx/encl.c  | 2 ++
 arch/x86/kernel/cpu/sgx/ioctl.c | 1 +
 arch/x86/kernel/cpu/sgx/main.c  | 3 +++
 arch/x86/kernel/cpu/sgx/sgx.h   | 8 +++++++-
 4 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 91f83a5e543d..bf0ac3677ca8 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -748,6 +748,7 @@ void sgx_encl_release(struct kref *ref)
 	xa_destroy(&encl->page_array);
 
 	if (!encl->secs_child_cnt && encl->secs.epc_page) {
+		sgx_drop_epc_page(encl->secs.epc_page);
 		sgx_encl_free_epc_page(encl->secs.epc_page);
 		encl->secs.epc_page = NULL;
 	}
@@ -756,6 +757,7 @@ void sgx_encl_release(struct kref *ref)
 		va_page = list_first_entry(&encl->va_pages, struct sgx_va_page,
 					   list);
 		list_del(&va_page->list);
+		sgx_drop_epc_page(va_page->epc_page);
 		sgx_encl_free_epc_page(va_page->epc_page);
 		kfree(va_page);
 	}
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index 95ec20a6992f..8c23bb524674 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -48,6 +48,7 @@ void sgx_encl_shrink(struct sgx_encl *encl, struct sgx_va_page *va_page)
 	encl->page_cnt--;
 
 	if (va_page) {
+		sgx_drop_epc_page(va_page->epc_page);
 		sgx_encl_free_epc_page(va_page->epc_page);
 		list_del(&va_page->list);
 		kfree(va_page);
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index ed813288af44..f3a3ed894616 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -268,6 +268,7 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
 			goto out;
 
 		sgx_encl_ewb(encl->secs.epc_page, &secs_backing);
+		sgx_drop_epc_page(encl->secs.epc_page);
 		sgx_encl_free_epc_page(encl->secs.epc_page);
 		encl->secs.epc_page = NULL;
 
@@ -510,6 +511,8 @@ void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
 	page->flags |= flags;
 	if (sgx_epc_page_reclaimable(flags))
 		list_add_tail(&page->list, &sgx_global_lru.reclaimable);
+	else
+		list_add_tail(&page->list, &sgx_global_lru.unreclaimable);
 	spin_unlock(&sgx_global_lru.lock);
 }
 
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index e06b4aadb6a1..e210af77f0cf 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -150,17 +150,23 @@ static inline void *sgx_get_epc_virt_addr(struct sgx_epc_page *page)
 }
 
 /*
- * Tracks EPC pages reclaimable by the reclaimer (ksgxd).
+ * Contains EPC pages tracked by the reclaimer (ksgxd).
  */
 struct sgx_epc_lru_lists {
 	spinlock_t lock;
 	struct list_head reclaimable;
+	/*
+	 * Tracks SECS, VA pages,etc., pages only freeable after all its
+	 * dependent reclaimables are freed.
+	 */
+	struct list_head unreclaimable;
 };
 
 static inline void sgx_lru_init(struct sgx_epc_lru_lists *lrus)
 {
 	spin_lock_init(&lrus->lock);
 	INIT_LIST_HEAD(&lrus->reclaimable);
+	INIT_LIST_HEAD(&lrus->unreclaimable);
 }
 
 struct sgx_epc_page *__sgx_alloc_epc_page(void);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
       [not found] ` <20230913040635.28815-1-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
                     ` (10 preceding siblings ...)
  2023-09-13  4:06   ` [PATCH v4 11/18] x86/sgx: store unreclaimable pages in LRU lists Haitao Huang
@ 2023-09-13  4:06   ` Haitao Huang
       [not found]     ` <20230913040635.28815-13-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
  2023-09-13  4:06   ` [PATCH v4 13/18] x86/sgx: Expose sgx_reclaim_pages() for use by EPC cgroup Haitao Huang
                     ` (6 subsequent siblings)
  18 siblings, 1 reply; 45+ messages in thread
From: Haitao Huang @ 2023-09-13  4:06 UTC (permalink / raw)
  To: jarkko-DgEjT+Ai2ygdnm+yROfE0A, dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	tj-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

From: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Introduce the OOM path for killing an enclave with a reclaimer that is no
longer able to reclaim enough EPC pages. Find a victim enclave, which
will be an enclave with only "unreclaimable" EPC pages left in the
cgroup LRU lists. Once a victim is identified, mark the enclave as OOM
and zap the enclave's entire page range, and drain all mm references in
encl->mm_list. Block allocating any EPC pages in #PF handler, or
reloading any pages in all paths, or creating any new mappings.

The OOM killing path may race with the reclaimers: in some cases, the
victim enclave is in the process of reclaiming the last EPC pages when
OOM happens, that is, all pages other than SECS and VA pages are in
RECLAIMING_IN_PROGRESS state. The reclaiming process requires access to
the enclave backing, VA pages as well as SECS. So the OOM killer does
not directly release those enclave resources, instead, it lets all
reclaiming in progress to finish, and relies (as currently done) on
kref_put on encl->refcount to trigger sgx_encl_release() to do the
final cleanup.

Signed-off-by: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Cc: Sean Christopherson <seanjc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
V4:
- Updates for patch reordering and typo fixes.

V3:
- Rebased to use the new VMA_ITERATOR to zap VMAs.
- Fixed the racing cases by blocking new page allocation/mapping and
reloading when enclave is marked for OOM. And do not release any enclave
resources other than draining mm_list entries, and let pages in
RECLAIMING_IN_PROGRESS to be reaped by reclaimers.
- Due to above changes, also removed the no-longer needed encl->lock in
the OOM path which was causing deadlocks reported by the lock prover.
---
 arch/x86/kernel/cpu/sgx/driver.c |  27 +-----
 arch/x86/kernel/cpu/sgx/encl.c   |  48 ++++++++++-
 arch/x86/kernel/cpu/sgx/encl.h   |   2 +
 arch/x86/kernel/cpu/sgx/ioctl.c  |   9 ++
 arch/x86/kernel/cpu/sgx/main.c   | 140 +++++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/sgx/sgx.h    |   1 +
 6 files changed, 200 insertions(+), 27 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/driver.c b/arch/x86/kernel/cpu/sgx/driver.c
index 262f5fb18d74..ff42d649c7b6 100644
--- a/arch/x86/kernel/cpu/sgx/driver.c
+++ b/arch/x86/kernel/cpu/sgx/driver.c
@@ -44,7 +44,6 @@ static int sgx_open(struct inode *inode, struct file *file)
 static int sgx_release(struct inode *inode, struct file *file)
 {
 	struct sgx_encl *encl = file->private_data;
-	struct sgx_encl_mm *encl_mm;
 
 	/*
 	 * Drain the remaining mm_list entries. At this point the list contains
@@ -52,31 +51,7 @@ static int sgx_release(struct inode *inode, struct file *file)
 	 * not exited yet. The processes, which have exited, are gone from the
 	 * list by sgx_mmu_notifier_release().
 	 */
-	for ( ; ; )  {
-		spin_lock(&encl->mm_lock);
-
-		if (list_empty(&encl->mm_list)) {
-			encl_mm = NULL;
-		} else {
-			encl_mm = list_first_entry(&encl->mm_list,
-						   struct sgx_encl_mm, list);
-			list_del_rcu(&encl_mm->list);
-		}
-
-		spin_unlock(&encl->mm_lock);
-
-		/* The enclave is no longer mapped by any mm. */
-		if (!encl_mm)
-			break;
-
-		synchronize_srcu(&encl->srcu);
-		mmu_notifier_unregister(&encl_mm->mmu_notifier, encl_mm->mm);
-		kfree(encl_mm);
-
-		/* 'encl_mm' is gone, put encl_mm->encl reference: */
-		kref_put(&encl->refcount, sgx_encl_release);
-	}
-
+	sgx_encl_mm_drain(encl);
 	kref_put(&encl->refcount, sgx_encl_release);
 	return 0;
 }
diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index bf0ac3677ca8..85b6f218f029 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -453,6 +453,9 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
 	if (unlikely(!encl))
 		return VM_FAULT_SIGBUS;
 
+	if (test_bit(SGX_ENCL_OOM, &encl->flags))
+		return VM_FAULT_SIGBUS;
+
 	/*
 	 * The page_array keeps track of all enclave pages, whether they
 	 * are swapped out or not. If there is no entry for this page and
@@ -651,7 +654,8 @@ static int sgx_vma_access(struct vm_area_struct *vma, unsigned long addr,
 	if (!encl)
 		return -EFAULT;
 
-	if (!test_bit(SGX_ENCL_DEBUG, &encl->flags))
+	if (!test_bit(SGX_ENCL_DEBUG, &encl->flags) ||
+	    test_bit(SGX_ENCL_OOM, &encl->flags))
 		return -EFAULT;
 
 	for (i = 0; i < len; i += cnt) {
@@ -776,6 +780,45 @@ void sgx_encl_release(struct kref *ref)
 	kfree(encl);
 }
 
+/**
+ * sgx_encl_mm_drain - drain all mm_list entries
+ * @encl:	address of the sgx_encl to drain
+ *
+ * Used during oom kill to empty the mm_list entries after they have been
+ * zapped. Or used by sgx_release to drain the remaining mm_list entries when
+ * the enclave fd is closing. After this call, sgx_encl_release will be called
+ * with kref_put.
+ */
+void sgx_encl_mm_drain(struct sgx_encl *encl)
+{
+	struct sgx_encl_mm *encl_mm;
+
+	for ( ; ; )  {
+		spin_lock(&encl->mm_lock);
+
+		if (list_empty(&encl->mm_list)) {
+			encl_mm = NULL;
+		} else {
+			encl_mm = list_first_entry(&encl->mm_list,
+						   struct sgx_encl_mm, list);
+			list_del_rcu(&encl_mm->list);
+		}
+
+		spin_unlock(&encl->mm_lock);
+
+		/* The enclave is no longer mapped by any mm. */
+		if (!encl_mm)
+			break;
+
+		synchronize_srcu(&encl->srcu);
+		mmu_notifier_unregister(&encl_mm->mmu_notifier, encl_mm->mm);
+		kfree(encl_mm);
+
+		/* 'encl_mm' is gone, put encl_mm->encl reference: */
+		kref_put(&encl->refcount, sgx_encl_release);
+	}
+}
+
 /*
  * 'mm' is exiting and no longer needs mmu notifications.
  */
@@ -847,6 +890,9 @@ int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm)
 	struct sgx_encl_mm *encl_mm;
 	int ret;
 
+	if (test_bit(SGX_ENCL_OOM, &encl->flags))
+		return -ENOMEM;
+
 	/*
 	 * Even though a single enclave may be mapped into an mm more than once,
 	 * each 'mm' only appears once on encl->mm_list. This is guaranteed by
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index 831d63f80f5a..47792fb00cee 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -39,6 +39,7 @@ enum sgx_encl_flags {
 	SGX_ENCL_DEBUG		= BIT(1),
 	SGX_ENCL_CREATED	= BIT(2),
 	SGX_ENCL_INITIALIZED	= BIT(3),
+	SGX_ENCL_OOM		= BIT(4),
 };
 
 struct sgx_encl_mm {
@@ -125,5 +126,6 @@ struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
 					 unsigned long addr);
 struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, bool reclaim);
 void sgx_encl_shrink(struct sgx_encl *encl, struct sgx_va_page *va_page);
+void sgx_encl_mm_drain(struct sgx_encl *encl);
 
 #endif /* _X86_ENCL_H */
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index 8c23bb524674..1f65c79664a2 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -421,6 +421,9 @@ static long sgx_ioc_enclave_add_pages(struct sgx_encl *encl, void __user *arg)
 	    test_bit(SGX_ENCL_INITIALIZED, &encl->flags))
 		return -EINVAL;
 
+	if (test_bit(SGX_ENCL_OOM, &encl->flags))
+		return -ENOMEM;
+
 	if (copy_from_user(&add_arg, arg, sizeof(add_arg)))
 		return -EFAULT;
 
@@ -606,6 +609,9 @@ static long sgx_ioc_enclave_init(struct sgx_encl *encl, void __user *arg)
 	    test_bit(SGX_ENCL_INITIALIZED, &encl->flags))
 		return -EINVAL;
 
+	if (test_bit(SGX_ENCL_OOM, &encl->flags))
+		return -ENOMEM;
+
 	if (copy_from_user(&init_arg, arg, sizeof(init_arg)))
 		return -EFAULT;
 
@@ -682,6 +688,9 @@ static int sgx_ioc_sgx2_ready(struct sgx_encl *encl)
 	if (!test_bit(SGX_ENCL_INITIALIZED, &encl->flags))
 		return -EINVAL;
 
+	if (test_bit(SGX_ENCL_OOM, &encl->flags))
+		return -ENOMEM;
+
 	return 0;
 }
 
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index f3a3ed894616..c8900d62cfff 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -621,6 +621,146 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
 	atomic_long_inc(&sgx_nr_free_pages);
 }
 
+static bool sgx_oom_get_ref(struct sgx_epc_page *epc_page)
+{
+	struct sgx_encl *encl;
+
+	if (epc_page->flags & SGX_EPC_OWNER_PAGE)
+		encl = epc_page->encl_page->encl;
+	else if (epc_page->flags & SGX_EPC_OWNER_ENCL)
+		encl = epc_page->encl;
+	else
+		return false;
+
+	return kref_get_unless_zero(&encl->refcount);
+}
+
+static struct sgx_epc_page *sgx_oom_get_victim(struct sgx_epc_lru_lists *lru)
+{
+	struct sgx_epc_page *epc_page, *tmp;
+
+	if (list_empty(&lru->unreclaimable))
+		return NULL;
+
+	list_for_each_entry_safe(epc_page, tmp, &lru->unreclaimable, list) {
+		list_del_init(&epc_page->list);
+
+		if (sgx_oom_get_ref(epc_page))
+			return epc_page;
+	}
+	return NULL;
+}
+
+static void sgx_epc_oom_zap(void *owner, struct mm_struct *mm, unsigned long start,
+			    unsigned long end, const struct vm_operations_struct *ops)
+{
+	VMA_ITERATOR(vmi, mm, start);
+	struct vm_area_struct *vma;
+
+	/**
+	 * Use end because start can be zero and not mapped into
+	 * enclave even if encl->base = 0.
+	 */
+	for_each_vma_range(vmi, vma, end) {
+		if (vma->vm_ops == ops && vma->vm_private_data == owner &&
+		    vma->vm_start < end) {
+			zap_vma_pages(vma);
+		}
+	}
+}
+
+static bool sgx_oom_encl(struct sgx_encl *encl)
+{
+	unsigned long mm_list_version;
+	struct sgx_encl_mm *encl_mm;
+	bool ret = false;
+	int idx;
+
+	if (!test_bit(SGX_ENCL_CREATED, &encl->flags))
+		goto out_put;
+
+	/* Done OOM on this enclave previously, do not redo it.
+	 * This may happen when the SECS page is still UNRECLAIMABLE because
+	 * another page is in RECLAIM_IN_PROGRESS. Still return true so OOM
+	 * killer can wait until the reclaimer done with the hold-up page and
+	 * SECS before it move on to find another victim.
+	 */
+	if (test_bit(SGX_ENCL_OOM, &encl->flags))
+		goto out;
+
+	set_bit(SGX_ENCL_OOM, &encl->flags);
+
+	do {
+		mm_list_version = encl->mm_list_version;
+
+		/* Pairs with smp_rmb() in sgx_encl_mm_add(). */
+		smp_rmb();
+
+		idx = srcu_read_lock(&encl->srcu);
+
+		list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
+			if (!mmget_not_zero(encl_mm->mm))
+				continue;
+
+			mmap_read_lock(encl_mm->mm);
+
+			sgx_epc_oom_zap(encl, encl_mm->mm, encl->base,
+					encl->base + encl->size, &sgx_vm_ops);
+
+			mmap_read_unlock(encl_mm->mm);
+
+			mmput_async(encl_mm->mm);
+		}
+
+		srcu_read_unlock(&encl->srcu, idx);
+	} while (WARN_ON_ONCE(encl->mm_list_version != mm_list_version));
+
+	sgx_encl_mm_drain(encl);
+out:
+	ret = true;
+
+out_put:
+	/*
+	 * This puts the refcount we took when we identified this enclave as
+	 * an OOM victim.
+	 */
+	kref_put(&encl->refcount, sgx_encl_release);
+	return ret;
+}
+
+static inline bool sgx_oom_encl_page(struct sgx_encl_page *encl_page)
+{
+	return sgx_oom_encl(encl_page->encl);
+}
+
+/**
+ * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
+ * @lru:	LRU that is low
+ *
+ * Return:	%true if a victim was found and kicked.
+ */
+bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
+{
+	struct sgx_epc_page *victim;
+
+	spin_lock(&lru->lock);
+	victim = sgx_oom_get_victim(lru);
+	spin_unlock(&lru->lock);
+
+	if (!victim)
+		return false;
+
+	if (victim->flags & SGX_EPC_OWNER_PAGE)
+		return sgx_oom_encl_page(victim->encl_page);
+
+	if (victim->flags & SGX_EPC_OWNER_ENCL)
+		return sgx_oom_encl(victim->encl);
+
+	/*Will never happen unless we add more owner types in future */
+	WARN_ON_ONCE(1);
+	return false;
+}
+
 static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
 					 unsigned long index,
 					 struct sgx_epc_section *section)
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index e210af77f0cf..3818be5a8bd3 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -176,6 +176,7 @@ void sgx_reclaim_direct(void);
 void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags);
 int sgx_drop_epc_page(struct sgx_epc_page *page);
 struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
+bool sgx_epc_oom(struct sgx_epc_lru_lists *lrus);
 
 void sgx_ipi_cb(void *info);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 13/18] x86/sgx: Expose sgx_reclaim_pages() for use by EPC cgroup
       [not found] ` <20230913040635.28815-1-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
                     ` (11 preceding siblings ...)
  2023-09-13  4:06   ` [PATCH v4 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC Haitao Huang
@ 2023-09-13  4:06   ` Haitao Huang
  2023-09-13 15:36     ` Jarkko Sakkinen
  2023-09-13  4:06   ` [PATCH v4 14/18] x86/sgx: Add helper to grab pages from an arbitrary EPC LRU Haitao Huang
                     ` (5 subsequent siblings)
  18 siblings, 1 reply; 45+ messages in thread
From: Haitao Huang @ 2023-09-13  4:06 UTC (permalink / raw)
  To: jarkko-DgEjT+Ai2ygdnm+yROfE0A, dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	tj-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

From: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Adjust and expose the top-level reclaim function as
sgx_reclaim_epc_pages() for use by the upcoming EPC cgroup, which will
initiate reclaim to enforce the max limit.

Make these adjustments to the function signature.

1) To take a parameter that specifies the number of pages to scan for
reclaiming. Define a max value of 32, but scan 16 in the case for the
global reclaimer (ksgxd). The EPC cgroup will use it to specify a
desired number of pages to be reclaimed up to the max value of 32.

2) To take a flag to force reclaiming a page regardless of its age.  The
EPC cgroup will use the flag to enforce its limits by draining the
reclaimable lists before resorting to other measures, e.g. forcefully
kill enclaves.

3) Return the number of reclaimed pages. The EPC cgroup will use the
result to track reclaiming progress and escalate to a more forceful
reclaiming mode, e.g., calling this function with the flag to ignore age
of pages.

Signed-off-by: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Cc: Sean Christopherson <seanjc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
V4:
- Combined the 3 patches that made the individual changes to the
function signature.
- Removed 'high' limit in commit message.
---
 arch/x86/kernel/cpu/sgx/main.c | 30 ++++++++++++++++++++----------
 arch/x86/kernel/cpu/sgx/sgx.h  |  1 +
 2 files changed, 21 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index c8900d62cfff..e1dde431a400 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -17,6 +17,10 @@
 #include "driver.h"
 #include "encl.h"
 #include "encls.h"
+/**
+ * Maximum number of pages to scan for reclaiming.
+ */
+#define SGX_NR_TO_SCAN_MAX	32
 
 struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
 static int sgx_nr_epc_sections;
@@ -279,7 +283,11 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
 	mutex_unlock(&encl->lock);
 }
 
-/*
+/**
+ * sgx_reclaim_epc_pages() - Reclaim EPC pages from the consumers
+ * @nr_to_scan:		 Number of EPC pages to scan for reclaim
+ * @ignore_age:		 Reclaim a page even if it is young
+ *
  * Take a fixed number of pages from the head of the active page pool and
  * reclaim them to the enclave's private shmem files. Skip the pages, which have
  * been accessed since the last scan. Move those pages to the tail of active
@@ -292,15 +300,15 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
  * problematic as it would increase the lock contention too much, which would
  * halt forward progress.
  */
-static void sgx_reclaim_pages(void)
+size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
 {
-	struct sgx_backing backing[SGX_NR_TO_SCAN];
+	struct sgx_backing backing[SGX_NR_TO_SCAN_MAX];
 	struct sgx_epc_page *epc_page, *tmp;
 	struct sgx_encl_page *encl_page;
 	pgoff_t page_index;
 	LIST_HEAD(iso);
-	int ret;
-	int i;
+	size_t ret;
+	size_t i;
 
 	spin_lock(&sgx_global_lru.lock);
 	for (i = 0; i < SGX_NR_TO_SCAN; i++) {
@@ -326,13 +334,14 @@ static void sgx_reclaim_pages(void)
 	spin_unlock(&sgx_global_lru.lock);
 
 	if (list_empty(&iso))
-		return;
+		return 0;
 
 	i = 0;
 	list_for_each_entry_safe(epc_page, tmp, &iso, list) {
 		encl_page = epc_page->encl_page;
 
-		if (!sgx_reclaimer_age(epc_page))
+		if (i == SGX_NR_TO_SCAN_MAX ||
+		    (!ignore_age && !sgx_reclaimer_age(epc_page)))
 			goto skip;
 
 		page_index = PFN_DOWN(encl_page->desc - encl_page->encl->base);
@@ -371,6 +380,7 @@ static void sgx_reclaim_pages(void)
 
 		sgx_free_epc_page(epc_page);
 	}
+	return i;
 }
 
 static bool sgx_should_reclaim(unsigned long watermark)
@@ -387,7 +397,7 @@ static bool sgx_should_reclaim(unsigned long watermark)
 void sgx_reclaim_direct(void)
 {
 	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
-		sgx_reclaim_pages();
+		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
 }
 
 static int ksgxd(void *p)
@@ -410,7 +420,7 @@ static int ksgxd(void *p)
 				     sgx_should_reclaim(SGX_NR_HIGH_PAGES));
 
 		if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
-			sgx_reclaim_pages();
+			sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
 
 		cond_resched();
 	}
@@ -582,7 +592,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 			break;
 		}
 
-		sgx_reclaim_pages();
+		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
 		cond_resched();
 	}
 
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 3818be5a8bd3..aa4ec2c0ce96 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -177,6 +177,7 @@ void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags);
 int sgx_drop_epc_page(struct sgx_epc_page *page);
 struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
 bool sgx_epc_oom(struct sgx_epc_lru_lists *lrus);
+size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age);
 
 void sgx_ipi_cb(void *info);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 14/18] x86/sgx: Add helper to grab pages from an arbitrary EPC LRU
       [not found] ` <20230913040635.28815-1-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
                     ` (12 preceding siblings ...)
  2023-09-13  4:06   ` [PATCH v4 13/18] x86/sgx: Expose sgx_reclaim_pages() for use by EPC cgroup Haitao Huang
@ 2023-09-13  4:06   ` Haitao Huang
  2023-09-13  4:06   ` [PATCH v4 15/18] x86/sgx: Prepare for multiple LRUs Haitao Huang
                     ` (4 subsequent siblings)
  18 siblings, 0 replies; 45+ messages in thread
From: Haitao Huang @ 2023-09-13  4:06 UTC (permalink / raw)
  To: jarkko-DgEjT+Ai2ygdnm+yROfE0A, dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	tj-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

From: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Move the isolation loop into a helper, sgx_isolate_pages(), in
preparation for existence of multiple LRUs. Expose the helper to other
SGX code so that it can be called from the EPC cgroup code, e.g., to
isolate pages from a single cgroup LRU. Exposing the isolation loop
allows the cgroup iteration logic to be wholly encapsulated within the
cgroup code.

Signed-off-by: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Cc: Sean Christopherson <seanjc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
V4:
- No changes other than reordering the patches
---
 arch/x86/kernel/cpu/sgx/main.c | 57 +++++++++++++++++++++-------------
 arch/x86/kernel/cpu/sgx/sgx.h  |  2 ++
 2 files changed, 37 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index e1dde431a400..ce316bd5e5bb 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -283,6 +283,40 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
 	mutex_unlock(&encl->lock);
 }
 
+/**
+ * sgx_isolate_epc_pages() - Isolate pages from an LRU for reclaim
+ * @lru:	LRU from which to reclaim
+ * @nr_to_scan:	Number of pages to scan for reclaim
+ * @dst:	Destination list to hold the isolated pages
+ */
+void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, size_t nr_to_scan,
+			   struct list_head *dst)
+{
+	struct sgx_encl_page *encl_page;
+	struct sgx_epc_page *epc_page;
+
+	spin_lock(&lru->lock);
+	for (; nr_to_scan > 0; --nr_to_scan) {
+		epc_page = list_first_entry_or_null(&lru->reclaimable, struct sgx_epc_page, list);
+		if (!epc_page)
+			break;
+
+		encl_page = epc_page->encl_page;
+
+		if (kref_get_unless_zero(&encl_page->encl->refcount)) {
+			sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
+			list_move_tail(&epc_page->list, dst);
+		} else {
+			/* The owner is freeing the page, remove it from the
+			 * LRU list
+			 */
+			sgx_epc_page_reset_state(epc_page);
+			list_del_init(&epc_page->list);
+		}
+	}
+	spin_unlock(&lru->lock);
+}
+
 /**
  * sgx_reclaim_epc_pages() - Reclaim EPC pages from the consumers
  * @nr_to_scan:		 Number of EPC pages to scan for reclaim
@@ -310,28 +344,7 @@ size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
 	size_t ret;
 	size_t i;
 
-	spin_lock(&sgx_global_lru.lock);
-	for (i = 0; i < SGX_NR_TO_SCAN; i++) {
-		epc_page = list_first_entry_or_null(&sgx_global_lru.reclaimable,
-						    struct sgx_epc_page, list);
-		if (!epc_page)
-			break;
-
-		list_del_init(&epc_page->list);
-		encl_page = epc_page->encl_page;
-
-		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0) {
-			sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
-			list_move_tail(&epc_page->list, &iso);
-		} else {
-			/* The owner is freeing the page, remove it from the
-			 * LRU list
-			 */
-			sgx_epc_page_reset_state(epc_page);
-			list_del_init(&epc_page->list);
-		}
-	}
-	spin_unlock(&sgx_global_lru.lock);
+	sgx_isolate_epc_pages(&sgx_global_lru, nr_to_scan, &iso);
 
 	if (list_empty(&iso))
 		return 0;
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index aa4ec2c0ce96..7e21192b87a8 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -178,6 +178,8 @@ int sgx_drop_epc_page(struct sgx_epc_page *page);
 struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
 bool sgx_epc_oom(struct sgx_epc_lru_lists *lrus);
 size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age);
+void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lrus, size_t nr_to_scan,
+			   struct list_head *dst);
 
 void sgx_ipi_cb(void *info);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 15/18] x86/sgx: Prepare for multiple LRUs
       [not found] ` <20230913040635.28815-1-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
                     ` (13 preceding siblings ...)
  2023-09-13  4:06   ` [PATCH v4 14/18] x86/sgx: Add helper to grab pages from an arbitrary EPC LRU Haitao Huang
@ 2023-09-13  4:06   ` Haitao Huang
       [not found]     ` <20230913040635.28815-16-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
  2023-09-13  4:06   ` [PATCH v4 16/18] x86/sgx: Limit process EPC usage with misc cgroup controller Haitao Huang
                     ` (3 subsequent siblings)
  18 siblings, 1 reply; 45+ messages in thread
From: Haitao Huang @ 2023-09-13  4:06 UTC (permalink / raw)
  To: jarkko-DgEjT+Ai2ygdnm+yROfE0A, dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	tj-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

Add sgx_can_reclaim() wrapper and encapsulate direct references to the
global LRU list in the reclaimer functions so that they can be called with
an LRU list per EPC cgroup.

Signed-off-by: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Cc: Sean Christopherson <seanjc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
V4:
- Re-organized this patch to include all changes related to
encapsulation of the global LRU
- Moved this patch to precede the EPC cgroup patch
---
 arch/x86/kernel/cpu/sgx/main.c | 41 +++++++++++++++++++++++-----------
 1 file changed, 28 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index ce316bd5e5bb..3d396fe5ec09 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -34,6 +34,16 @@ static DEFINE_XARRAY(sgx_epc_address_space);
  */
 static struct sgx_epc_lru_lists sgx_global_lru;
 
+static inline struct sgx_epc_lru_lists *sgx_lru_lists(struct sgx_epc_page *epc_page)
+{
+	return &sgx_global_lru;
+}
+
+static inline bool sgx_can_reclaim(void)
+{
+	return !list_empty(&sgx_global_lru.reclaimable);
+}
+
 static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
 
 /* Nodes with one or more EPC sections. */
@@ -339,6 +349,7 @@ size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
 	struct sgx_backing backing[SGX_NR_TO_SCAN_MAX];
 	struct sgx_epc_page *epc_page, *tmp;
 	struct sgx_encl_page *encl_page;
+	struct sgx_epc_lru_lists *lru;
 	pgoff_t page_index;
 	LIST_HEAD(iso);
 	size_t ret;
@@ -372,10 +383,11 @@ size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
 		continue;
 
 skip:
-		spin_lock(&sgx_global_lru.lock);
+		lru = sgx_lru_lists(epc_page);
+		spin_lock(&lru->lock);
 		sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
-		list_move_tail(&epc_page->list, &sgx_global_lru.reclaimable);
-		spin_unlock(&sgx_global_lru.lock);
+		list_move_tail(&epc_page->list, &lru->reclaimable);
+		spin_unlock(&lru->lock);
 
 		kref_put(&encl_page->encl->refcount, sgx_encl_release);
 	}
@@ -399,7 +411,7 @@ size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
 static bool sgx_should_reclaim(unsigned long watermark)
 {
 	return atomic_long_read(&sgx_nr_free_pages) < watermark &&
-	       !list_empty(&sgx_global_lru.reclaimable);
+		sgx_can_reclaim();
 }
 
 /*
@@ -529,14 +541,16 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
  */
 void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
 {
-	spin_lock(&sgx_global_lru.lock);
+	struct sgx_epc_lru_lists *lru = sgx_lru_lists(page);
+
+	spin_lock(&lru->lock);
 	WARN_ON_ONCE(sgx_epc_page_reclaimable(page->flags));
 	page->flags |= flags;
 	if (sgx_epc_page_reclaimable(flags))
-		list_add_tail(&page->list, &sgx_global_lru.reclaimable);
+		list_add_tail(&page->list, &lru->reclaimable);
 	else
-		list_add_tail(&page->list, &sgx_global_lru.unreclaimable);
-	spin_unlock(&sgx_global_lru.lock);
+		list_add_tail(&page->list, &lru->unreclaimable);
+	spin_unlock(&lru->lock);
 }
 
 /**
@@ -551,15 +565,16 @@ void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
  */
 int sgx_drop_epc_page(struct sgx_epc_page *page)
 {
-	spin_lock(&sgx_global_lru.lock);
+	struct sgx_epc_lru_lists *lru = sgx_lru_lists(page);
+
+	spin_lock(&lru->lock);
 	if (sgx_epc_page_reclaim_in_progress(page->flags)) {
-		spin_unlock(&sgx_global_lru.lock);
+		spin_unlock(&lru->lock);
 		return -EBUSY;
 	}
-
 	list_del(&page->list);
 	sgx_epc_page_reset_state(page);
-	spin_unlock(&sgx_global_lru.lock);
+	spin_unlock(&lru->lock);
 
 	return 0;
 }
@@ -592,7 +607,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 			break;
 		}
 
-		if (list_empty(&sgx_global_lru.reclaimable))
+		if (!sgx_can_reclaim())
 			return ERR_PTR(-ENOMEM);
 
 		if (!reclaim) {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 16/18] x86/sgx: Limit process EPC usage with misc cgroup controller
       [not found] ` <20230913040635.28815-1-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
                     ` (14 preceding siblings ...)
  2023-09-13  4:06   ` [PATCH v4 15/18] x86/sgx: Prepare for multiple LRUs Haitao Huang
@ 2023-09-13  4:06   ` Haitao Huang
  2023-09-13 15:48     ` Jarkko Sakkinen
  2023-09-13  4:06   ` [PATCH v4 17/18] Docs/x86/sgx: Add description for cgroup support Haitao Huang
                     ` (2 subsequent siblings)
  18 siblings, 1 reply; 45+ messages in thread
From: Haitao Huang @ 2023-09-13  4:06 UTC (permalink / raw)
  To: jarkko-DgEjT+Ai2ygdnm+yROfE0A, dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	tj-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

From: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>

Implement support for cgroup control of SGX Enclave Page Cache (EPC)
memory using the misc cgroup controller. EPC memory is independent
from normal system memory, e.g. must be reserved at boot from RAM and
cannot be converted between EPC and normal memory while the system is
running. EPC is managed by the SGX subsystem and is not accounted by
the memory controller.

Much like normal system memory, EPC memory can be overcommitted via
virtual memory techniques and pages can be swapped out of the EPC to
their backing store (normal system memory, e.g. shmem).  The SGX EPC
subsystem is analogous to the memory subsystem and the SGX EPC controller
is in turn analogous to the memory controller; it implements limit and
protection models for EPC memory.

The misc controller provides a mechanism to set a hard limit of EPC
usage via the "sgx_epc" resource in "misc.max". The total EPC memory
available on the system is reported via the "sgx_epc" resource in
"misc.capacity".

This patch was modified from its original version to use the misc cgroup
controller instead of a custom controller.

Signed-off-by: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Tested-by: Mikko Ylinen <mikko.ylinen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>

Cc: Sean Christopherson <seanjc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
V4:
- Fix a white space issue in Kconfig (Randy).
- Update comments for LRU list as it can be owned by a cgroup.
- Fix comments for sgx_reclaim_epc_pages() and use IS_ENABLED consistently (Mikko)

V3:

1) Use the same maximum number of reclaiming candidate pages to be
processed, SGX_NR_TO_SCAN_MAX, for each reclaiming iteration in both
cgroup worker function and ksgxd. This fixes an overflow in the
backing store buffer with the same fixed size allocated on stack in
sgx_reclaim_epc_pages().

2) Initialize max for root EPC cgroup. Otherwise, all
misc_cg_try_charge() calls would fail as it checks for all limits of
ancestors all the way to the root node.

3) Start reclaiming whenever misc_cg_try_charge fails. Removed all
re-checks for limits and current usage. For all purposes and intent,
when misc_try_charge() fails, reclaiming is needed. This also corrects
an error of not reclaiming when the child limit is larger than one of
its ancestors.

4) Handle failure on charging to the root EPC cgroup. Failure on charging
to root means we are at or above capacity, so start reclaiming or return
OOM error.

5) Removed the custom cgroup tree walking iterator with epoch tracking
logic. Replaced it with just the plain css_for_each_descendant_pre
iterator. The custom iterator implemented a rather complex epoch scheme
I believe was intended to prevent extra reclaiming from multiple worker
threads doing the same walk but it turned out not matter much as each
thread would only reclaim when usage is above limit. Using the plain
css_for_each_descendant_pre iterator simplified code a bit.

6) Do not reclaim synchronously in misc_max_write callback which would
block the user. Instead queue an async work item to run the reclaiming
loop.

7) Other minor refactoring:
- Remove unused params in epc_cgroup APIs
- centralize uncharge into sgx_free_epc_page()
---
 arch/x86/Kconfig                     |  13 +
 arch/x86/kernel/cpu/sgx/Makefile     |   1 +
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 406 +++++++++++++++++++++++++++
 arch/x86/kernel/cpu/sgx/epc_cgroup.h |  59 ++++
 arch/x86/kernel/cpu/sgx/main.c       |  67 ++++-
 arch/x86/kernel/cpu/sgx/sgx.h        |  17 +-
 6 files changed, 547 insertions(+), 16 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 982b777eadc7..55fcf182d4a3 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1921,6 +1921,19 @@ config X86_SGX
 
 	  If unsure, say N.
 
+config CGROUP_SGX_EPC
+	bool "Miscellaneous Cgroup Controller for Enclave Page Cache (EPC) for Intel SGX"
+	depends on X86_SGX && CGROUP_MISC
+	help
+	  Provides control over the EPC footprint of tasks in a cgroup via
+	  the Miscellaneous cgroup controller.
+
+	  EPC is a subset of regular memory that is usable only by SGX
+	  enclaves and is very limited in quantity, e.g. less than 1%
+	  of total DRAM.
+
+	  Say N if unsure.
+
 config X86_USER_SHADOW_STACK
 	bool "X86 userspace shadow stack"
 	depends on AS_WRUSS
diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
index 9c1656779b2a..12901a488da7 100644
--- a/arch/x86/kernel/cpu/sgx/Makefile
+++ b/arch/x86/kernel/cpu/sgx/Makefile
@@ -4,3 +4,4 @@ obj-y += \
 	ioctl.o \
 	main.o
 obj-$(CONFIG_X86_SGX_KVM)	+= virt.o
+obj-$(CONFIG_CGROUP_SGX_EPC)	       += epc_cgroup.o
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
new file mode 100644
index 000000000000..7b86eb074abe
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -0,0 +1,406 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright(c) 2022 Intel Corporation.
+
+#include <linux/atomic.h>
+#include <linux/kernel.h>
+#include <linux/ratelimit.h>
+#include <linux/sched/signal.h>
+#include <linux/slab.h>
+#include <linux/threads.h>
+
+#include "epc_cgroup.h"
+
+#define SGX_EPC_RECLAIM_MIN_PAGES		16UL
+#define SGX_EPC_RECLAIM_IGNORE_AGE_THRESHOLD	5
+#define SGX_EPC_RECLAIM_OOM_THRESHOLD		5
+
+static struct workqueue_struct *sgx_epc_cg_wq;
+static bool sgx_epc_cgroup_oom(struct sgx_epc_cgroup *root);
+
+struct sgx_epc_reclaim_control {
+	struct sgx_epc_cgroup *epc_cg;
+	int nr_fails;
+	bool ignore_age;
+};
+
+static inline u64 sgx_epc_cgroup_page_counter_read(struct sgx_epc_cgroup *epc_cg)
+{
+	return atomic64_read(&epc_cg->cg->res[MISC_CG_RES_SGX_EPC].usage) / PAGE_SIZE;
+}
+
+static inline u64 sgx_epc_cgroup_max_pages(struct sgx_epc_cgroup *epc_cg)
+{
+	return READ_ONCE(epc_cg->cg->res[MISC_CG_RES_SGX_EPC].max) / PAGE_SIZE;
+}
+
+static inline u64 sgx_epc_cgroup_max_pages_to_root(struct sgx_epc_cgroup *epc_cg)
+{
+	struct misc_cg *i = epc_cg->cg;
+	u64 m = U64_MAX;
+
+	while (i) {
+		m = min(m, READ_ONCE(i->res[MISC_CG_RES_SGX_EPC].max));
+		i = misc_cg_parent(i);
+	}
+	return m / PAGE_SIZE;
+}
+
+static inline struct sgx_epc_cgroup *sgx_epc_cgroup_from_misc_cg(struct misc_cg *cg)
+{
+	if (cg)
+		return (struct sgx_epc_cgroup *)(cg->res[MISC_CG_RES_SGX_EPC].priv);
+
+	return NULL;
+}
+
+static inline bool sgx_epc_cgroup_disabled(void)
+{
+	return !cgroup_subsys_enabled(misc_cgrp_subsys);
+}
+
+/**
+ * sgx_epc_cgroup_lru_empty - check if a cgroup tree has no pages on its lrus
+ * @root:	root of the tree to check
+ *
+ * Return: %true if all cgroups under the specified root have empty LRU lists.
+ * Used to avoid livelocks due to a cgroup having a non-zero charge count but
+ * no pages on its LRUs, e.g. due to a dead enclave waiting to be released or
+ * because all pages in the cgroup are unreclaimable.
+ */
+bool sgx_epc_cgroup_lru_empty(struct sgx_epc_cgroup *root)
+{
+	struct cgroup_subsys_state *css_root = NULL;
+	struct cgroup_subsys_state *pos = NULL;
+	struct sgx_epc_cgroup *epc_cg = NULL;
+	bool ret = true;
+
+	/*
+	 * Caller ensure css_root ref acquired
+	 */
+	css_root = root ? &root->cg->css : &(misc_cg_root()->css);
+
+	rcu_read_lock();
+	css_for_each_descendant_pre(pos, css_root) {
+		if (!css_tryget(pos))
+			break;
+
+		rcu_read_unlock();
+
+		epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
+
+		spin_lock(&epc_cg->lru.lock);
+		ret = list_empty(&epc_cg->lru.reclaimable);
+		spin_unlock(&epc_cg->lru.lock);
+
+		rcu_read_lock();
+		css_put(pos);
+		if (!ret)
+			break;
+	}
+	rcu_read_unlock();
+	return ret;
+}
+
+/**
+ * sgx_epc_cgroup_isolate_pages - walk a cgroup tree and separate pages
+ * @root:	root of the tree to start walking
+ * @nr_to_scan: The number of pages that need to be isolated
+ * @dst:	Destination list to hold the isolated pages
+ *
+ * Walk the cgroup tree and isolate the pages in the hierarchy
+ * for reclaiming.
+ */
+void sgx_epc_cgroup_isolate_pages(struct sgx_epc_cgroup *root,
+				  size_t *nr_to_scan, struct list_head *dst)
+{
+	struct cgroup_subsys_state *css_root = NULL;
+	struct cgroup_subsys_state *pos = NULL;
+	struct sgx_epc_cgroup *epc_cg = NULL;
+
+	if (!*nr_to_scan)
+		return;
+
+	 /* Caller ensure css_root ref acquired */
+	css_root = root ? &root->cg->css : &(misc_cg_root()->css);
+
+	rcu_read_lock();
+	css_for_each_descendant_pre(pos, css_root) {
+		if (!css_tryget(pos))
+			break;
+		rcu_read_unlock();
+
+		epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
+		sgx_isolate_epc_pages(&epc_cg->lru, nr_to_scan, dst);
+
+		rcu_read_lock();
+		css_put(pos);
+		if (!*nr_to_scan)
+			break;
+	}
+	rcu_read_unlock();
+}
+
+static int sgx_epc_cgroup_reclaim_pages(unsigned long nr_pages,
+					struct sgx_epc_reclaim_control *rc)
+{
+	/*
+	 * Ensure sgx_reclaim_pages is called with a minimum and maximum
+	 * number of pages.  Attempting to reclaim only a few pages will
+	 * often fail and is inefficient, while reclaiming a huge number
+	 * of pages can result in soft lockups due to holding various
+	 * locks for an extended duration.
+	 */
+	nr_pages = max(nr_pages, SGX_EPC_RECLAIM_MIN_PAGES);
+	nr_pages = min(nr_pages, SGX_NR_TO_SCAN_MAX);
+
+	return sgx_reclaim_epc_pages(nr_pages, rc->ignore_age, rc->epc_cg);
+}
+
+static int sgx_epc_cgroup_reclaim_failed(struct sgx_epc_reclaim_control *rc)
+{
+	if (sgx_epc_cgroup_lru_empty(rc->epc_cg))
+		return -ENOMEM;
+
+	++rc->nr_fails;
+	if (rc->nr_fails > SGX_EPC_RECLAIM_IGNORE_AGE_THRESHOLD)
+		rc->ignore_age = true;
+
+	return 0;
+}
+
+static inline
+void sgx_epc_reclaim_control_init(struct sgx_epc_reclaim_control *rc,
+				  struct sgx_epc_cgroup *epc_cg)
+{
+	rc->epc_cg = epc_cg;
+	rc->nr_fails = 0;
+	rc->ignore_age = false;
+}
+
+/*
+ * Scheduled by sgx_epc_cgroup_try_charge() to reclaim pages from the
+ * cgroup when the cgroup is at/near its maximum capacity
+ */
+static void sgx_epc_cgroup_reclaim_work_func(struct work_struct *work)
+{
+	struct sgx_epc_reclaim_control rc;
+	struct sgx_epc_cgroup *epc_cg;
+	u64 cur, max;
+
+	epc_cg = container_of(work, struct sgx_epc_cgroup, reclaim_work);
+
+	sgx_epc_reclaim_control_init(&rc, epc_cg);
+
+	for (;;) {
+		max = sgx_epc_cgroup_max_pages_to_root(epc_cg);
+
+		/*
+		 * Adjust the limit down by one page, the goal is to free up
+		 * pages for fault allocations, not to simply obey the limit.
+		 * Conditionally decrementing max also means the cur vs. max
+		 * check will correctly handle the case where both are zero.
+		 */
+		if (max)
+			max--;
+
+		/*
+		 * Unless the limit is extremely low, in which case forcing
+		 * reclaim will likely cause thrashing, force the cgroup to
+		 * reclaim at least once if it's operating *near* its maximum
+		 * limit by adjusting @max down by half the min reclaim size.
+		 * This work func is scheduled by sgx_epc_cgroup_try_charge
+		 * when it cannot directly reclaim due to being in an atomic
+		 * context, e.g. EPC allocation in a fault handler.  Waiting
+		 * to reclaim until the cgroup is actually at its limit is less
+		 * performant as it means the faulting task is effectively
+		 * blocked until a worker makes its way through the global work
+		 * queue.
+		 */
+		if (max > SGX_NR_TO_SCAN_MAX)
+			max -= (SGX_EPC_RECLAIM_MIN_PAGES / 2);
+
+		max = min(max, sgx_epc_total_pages);
+		cur = sgx_epc_cgroup_page_counter_read(epc_cg);
+		if (cur <= max)
+			break;
+		/* Nothing reclaimable */
+		if (sgx_epc_cgroup_lru_empty(epc_cg)) {
+			if (!sgx_epc_cgroup_oom(epc_cg))
+				break;
+
+			continue;
+		}
+
+		if (!sgx_epc_cgroup_reclaim_pages(cur - max, &rc)) {
+			if (sgx_epc_cgroup_reclaim_failed(&rc))
+				break;
+		}
+	}
+}
+
+static int __sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg,
+				       bool reclaim)
+{
+	struct sgx_epc_reclaim_control rc;
+	unsigned int nr_empty = 0;
+
+	sgx_epc_reclaim_control_init(&rc, epc_cg);
+
+	for (;;) {
+		if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
+					PAGE_SIZE))
+			break;
+
+		if (sgx_epc_cgroup_lru_empty(epc_cg))
+			return -ENOMEM;
+
+		if (signal_pending(current))
+			return -ERESTARTSYS;
+
+		if (!reclaim) {
+			queue_work(sgx_epc_cg_wq, &rc.epc_cg->reclaim_work);
+			return -EBUSY;
+		}
+
+		if (!sgx_epc_cgroup_reclaim_pages(1, &rc)) {
+			if (sgx_epc_cgroup_reclaim_failed(&rc)) {
+				if (++nr_empty > SGX_EPC_RECLAIM_OOM_THRESHOLD)
+					return -ENOMEM;
+				schedule();
+			}
+		}
+	}
+	if (epc_cg->cg != misc_cg_root())
+		css_get(&epc_cg->cg->css);
+
+	return 0;
+}
+
+/**
+ * sgx_epc_cgroup_try_charge - hierarchically try to charge a single EPC page
+ * @mm:			the mm_struct of the process to charge
+ * @reclaim:		whether or not synchronous reclaim is allowed
+ *
+ * Returns EPC cgroup or NULL on success, -errno on failure.
+ */
+struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(bool reclaim)
+{
+	struct sgx_epc_cgroup *epc_cg;
+	int ret;
+
+	if (sgx_epc_cgroup_disabled())
+		return NULL;
+
+	epc_cg = sgx_epc_cgroup_from_misc_cg(get_current_misc_cg());
+	ret = __sgx_epc_cgroup_try_charge(epc_cg, reclaim);
+	put_misc_cg(epc_cg->cg);
+
+	if (ret)
+		return ERR_PTR(ret);
+
+	return epc_cg;
+}
+
+/**
+ * sgx_epc_cgroup_uncharge - hierarchically uncharge EPC pages
+ * @epc_cg:	the charged epc cgroup
+ */
+void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg)
+{
+	if (sgx_epc_cgroup_disabled())
+		return;
+
+	misc_cg_uncharge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
+
+	if (epc_cg->cg != misc_cg_root())
+		put_misc_cg(epc_cg->cg);
+}
+
+static bool sgx_epc_cgroup_oom(struct sgx_epc_cgroup *root)
+{
+	struct cgroup_subsys_state *css_root = NULL;
+	struct cgroup_subsys_state *pos = NULL;
+	struct sgx_epc_cgroup *epc_cg = NULL;
+	bool oom = false;
+
+	 /* Caller ensure css_root ref acquired */
+	css_root = root ? &root->cg->css : &(misc_cg_root()->css);
+
+	rcu_read_lock();
+	css_for_each_descendant_pre(pos, css_root) {
+		/* skip dead ones */
+		if (!css_tryget(pos))
+			continue;
+
+		rcu_read_unlock();
+
+		epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
+		oom = sgx_epc_oom(&epc_cg->lru);
+
+		rcu_read_lock();
+		css_put(pos);
+		if (oom)
+			break;
+	}
+	rcu_read_unlock();
+	return oom;
+}
+
+static void sgx_epc_cgroup_free(struct misc_cg *cg)
+{
+	struct sgx_epc_cgroup *epc_cg;
+
+	epc_cg = sgx_epc_cgroup_from_misc_cg(cg);
+	cancel_work_sync(&epc_cg->reclaim_work);
+	kfree(epc_cg);
+}
+
+static void sgx_epc_cgroup_max_write(struct misc_cg *cg)
+{
+	struct sgx_epc_reclaim_control rc;
+	struct sgx_epc_cgroup *epc_cg;
+
+	epc_cg = sgx_epc_cgroup_from_misc_cg(cg);
+
+	sgx_epc_reclaim_control_init(&rc, epc_cg);
+	/* Let the reclaimer to do the work so user is not blocked */
+	queue_work(sgx_epc_cg_wq, &rc.epc_cg->reclaim_work);
+}
+
+static int sgx_epc_cgroup_alloc(struct misc_cg *cg)
+{
+	struct sgx_epc_cgroup *epc_cg;
+
+	epc_cg = kzalloc(sizeof(*epc_cg), GFP_KERNEL);
+	if (!epc_cg)
+		return -ENOMEM;
+
+	sgx_lru_init(&epc_cg->lru);
+	INIT_WORK(&epc_cg->reclaim_work, sgx_epc_cgroup_reclaim_work_func);
+	cg->res[MISC_CG_RES_SGX_EPC].misc_cg_alloc = sgx_epc_cgroup_alloc;
+	cg->res[MISC_CG_RES_SGX_EPC].misc_cg_free = sgx_epc_cgroup_free;
+	cg->res[MISC_CG_RES_SGX_EPC].misc_cg_max_write = sgx_epc_cgroup_max_write;
+	cg->res[MISC_CG_RES_SGX_EPC].priv = epc_cg;
+	epc_cg->cg = cg;
+	return 0;
+}
+
+static int __init sgx_epc_cgroup_init(void)
+{
+	struct misc_cg *cg;
+
+	if (!boot_cpu_has(X86_FEATURE_SGX))
+		return 0;
+
+	sgx_epc_cg_wq = alloc_workqueue("sgx_epc_cg_wq",
+					WQ_UNBOUND | WQ_FREEZABLE,
+					WQ_UNBOUND_MAX_ACTIVE);
+	BUG_ON(!sgx_epc_cg_wq);
+
+	cg = misc_cg_root();
+	BUG_ON(!cg);
+	WRITE_ONCE(cg->res[MISC_CG_RES_SGX_EPC].max, U64_MAX);
+	atomic64_set(&cg->res[MISC_CG_RES_SGX_EPC].usage, 0UL);
+	return sgx_epc_cgroup_alloc(cg);
+}
+subsys_initcall(sgx_epc_cgroup_init);
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
new file mode 100644
index 000000000000..dfc902f4d96f
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -0,0 +1,59 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2022 Intel Corporation. */
+#ifndef _INTEL_SGX_EPC_CGROUP_H_
+#define _INTEL_SGX_EPC_CGROUP_H_
+
+#include <asm/sgx.h>
+#include <linux/cgroup.h>
+#include <linux/list.h>
+#include <linux/misc_cgroup.h>
+#include <linux/page_counter.h>
+#include <linux/workqueue.h>
+
+#include "sgx.h"
+
+#ifndef CONFIG_CGROUP_SGX_EPC
+#define MISC_CG_RES_SGX_EPC MISC_CG_RES_TYPES
+struct sgx_epc_cgroup;
+
+static inline struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(bool reclaim)
+{
+	return NULL;
+}
+
+static inline void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg) { }
+
+static inline void sgx_epc_cgroup_isolate_pages(struct sgx_epc_cgroup *root,
+						size_t *nr_to_scan,
+						struct list_head *dst) { }
+
+static inline struct sgx_epc_lru_lists *epc_cg_lru(struct sgx_epc_cgroup *epc_cg)
+{
+	return NULL;
+}
+
+static bool sgx_epc_cgroup_lru_empty(struct sgx_epc_cgroup *root)
+{
+	return true;
+}
+#else
+struct sgx_epc_cgroup {
+	struct misc_cg *cg;
+	struct sgx_epc_lru_lists	lru;
+	struct work_struct	reclaim_work;
+};
+
+struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(bool reclaim);
+void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg);
+bool sgx_epc_cgroup_lru_empty(struct sgx_epc_cgroup *root);
+void sgx_epc_cgroup_isolate_pages(struct sgx_epc_cgroup *root,
+				  size_t *nr_to_scan, struct list_head *dst);
+static inline struct sgx_epc_lru_lists *epc_cg_lru(struct sgx_epc_cgroup *epc_cg)
+{
+	if (epc_cg)
+		return &epc_cg->lru;
+	return NULL;
+}
+#endif
+
+#endif /* _INTEL_SGX_EPC_CGROUP_H_ */
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 3d396fe5ec09..20de17f4f576 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -6,6 +6,7 @@
 #include <linux/highmem.h>
 #include <linux/kthread.h>
 #include <linux/miscdevice.h>
+#include <linux/misc_cgroup.h>
 #include <linux/node.h>
 #include <linux/pagemap.h>
 #include <linux/ratelimit.h>
@@ -17,11 +18,9 @@
 #include "driver.h"
 #include "encl.h"
 #include "encls.h"
-/**
- * Maximum number of pages to scan for reclaiming.
- */
-#define SGX_NR_TO_SCAN_MAX	32
+#include "epc_cgroup.h"
 
+u64 sgx_epc_total_pages;
 struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
 static int sgx_nr_epc_sections;
 static struct task_struct *ksgxd_tsk;
@@ -36,11 +35,17 @@ static struct sgx_epc_lru_lists sgx_global_lru;
 
 static inline struct sgx_epc_lru_lists *sgx_lru_lists(struct sgx_epc_page *epc_page)
 {
+	if (IS_ENABLED(CONFIG_CGROUP_SGX_EPC))
+		return epc_cg_lru(epc_page->epc_cg);
+
 	return &sgx_global_lru;
 }
 
 static inline bool sgx_can_reclaim(void)
 {
+	if (IS_ENABLED(CONFIG_CGROUP_SGX_EPC))
+		return !sgx_epc_cgroup_lru_empty(NULL);
+
 	return !list_empty(&sgx_global_lru.reclaimable);
 }
 
@@ -299,14 +304,14 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
  * @nr_to_scan:	Number of pages to scan for reclaim
  * @dst:	Destination list to hold the isolated pages
  */
-void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, size_t nr_to_scan,
+void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, size_t *nr_to_scan,
 			   struct list_head *dst)
 {
 	struct sgx_encl_page *encl_page;
 	struct sgx_epc_page *epc_page;
 
 	spin_lock(&lru->lock);
-	for (; nr_to_scan > 0; --nr_to_scan) {
+	for (; *nr_to_scan > 0; --(*nr_to_scan)) {
 		epc_page = list_first_entry_or_null(&lru->reclaimable, struct sgx_epc_page, list);
 		if (!epc_page)
 			break;
@@ -331,6 +336,7 @@ void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, size_t nr_to_scan,
  * sgx_reclaim_epc_pages() - Reclaim EPC pages from the consumers
  * @nr_to_scan:		 Number of EPC pages to scan for reclaim
  * @ignore_age:		 Reclaim a page even if it is young
+ * @epc_cg:		 EPC cgroup from which to reclaim
  *
  * Take a fixed number of pages from the head of the active page pool and
  * reclaim them to the enclave's private shmem files. Skip the pages, which have
@@ -344,7 +350,8 @@ void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, size_t nr_to_scan,
  * problematic as it would increase the lock contention too much, which would
  * halt forward progress.
  */
-size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
+size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age,
+			     struct sgx_epc_cgroup *epc_cg)
 {
 	struct sgx_backing backing[SGX_NR_TO_SCAN_MAX];
 	struct sgx_epc_page *epc_page, *tmp;
@@ -355,7 +362,15 @@ size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
 	size_t ret;
 	size_t i;
 
-	sgx_isolate_epc_pages(&sgx_global_lru, nr_to_scan, &iso);
+	/*
+	 * If a specific cgroup is not being targeted, take from the global
+	 * list first, even when cgroups are enabled.  If there are
+	 * pages on the global LRU then they should get reclaimed asap.
+	 */
+	if (!IS_ENABLED(CONFIG_CGROUP_SGX_EPC) || !epc_cg)
+		sgx_isolate_epc_pages(&sgx_global_lru, &nr_to_scan, &iso);
+
+	sgx_epc_cgroup_isolate_pages(epc_cg, &nr_to_scan, &iso);
 
 	if (list_empty(&iso))
 		return 0;
@@ -422,7 +437,7 @@ static bool sgx_should_reclaim(unsigned long watermark)
 void sgx_reclaim_direct(void)
 {
 	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
-		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
+		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false, NULL);
 }
 
 static int ksgxd(void *p)
@@ -445,7 +460,7 @@ static int ksgxd(void *p)
 				     sgx_should_reclaim(SGX_NR_HIGH_PAGES));
 
 		if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
-			sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
+			sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false, NULL);
 
 		cond_resched();
 	}
@@ -599,6 +614,11 @@ int sgx_drop_epc_page(struct sgx_epc_page *page)
 struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 {
 	struct sgx_epc_page *page;
+	struct sgx_epc_cgroup *epc_cg;
+
+	epc_cg = sgx_epc_cgroup_try_charge(reclaim);
+	if (IS_ERR(epc_cg))
+		return ERR_CAST(epc_cg);
 
 	for ( ; ; ) {
 		page = __sgx_alloc_epc_page();
@@ -607,8 +627,10 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 			break;
 		}
 
-		if (!sgx_can_reclaim())
-			return ERR_PTR(-ENOMEM);
+		if (!sgx_can_reclaim()) {
+			page = ERR_PTR(-ENOMEM);
+			break;
+		}
 
 		if (!reclaim) {
 			page = ERR_PTR(-EBUSY);
@@ -620,10 +642,17 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 			break;
 		}
 
-		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
+		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false, NULL);
 		cond_resched();
 	}
 
+	if (!IS_ERR(page)) {
+		WARN_ON_ONCE(page->epc_cg);
+		page->epc_cg = epc_cg;
+	} else {
+		sgx_epc_cgroup_uncharge(epc_cg);
+	}
+
 	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
 		wake_up(&ksgxd_waitq);
 
@@ -646,6 +675,11 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
 
 	WARN_ON_ONCE(page->flags & (SGX_EPC_PAGE_STATE_MASK));
 
+	if (page->epc_cg) {
+		sgx_epc_cgroup_uncharge(page->epc_cg);
+		page->epc_cg = NULL;
+	}
+
 	spin_lock(&node->lock);
 
 	page->encl_page = NULL;
@@ -656,6 +690,7 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
 	page->flags = SGX_EPC_PAGE_FREE;
 
 	spin_unlock(&node->lock);
+
 	atomic_long_inc(&sgx_nr_free_pages);
 }
 
@@ -825,6 +860,7 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
 		section->pages[i].flags = 0;
 		section->pages[i].encl_page = NULL;
 		section->pages[i].poison = 0;
+		section->pages[i].epc_cg = NULL;
 		list_add_tail(&section->pages[i].list, &sgx_dirty_page_list);
 	}
 
@@ -969,6 +1005,7 @@ static void __init arch_update_sysfs_visibility(int nid) {}
 static bool __init sgx_page_cache_init(void)
 {
 	u32 eax, ebx, ecx, edx, type;
+	u64 capacity = 0;
 	u64 pa, size;
 	int nid;
 	int i;
@@ -1019,6 +1056,7 @@ static bool __init sgx_page_cache_init(void)
 
 		sgx_epc_sections[i].node =  &sgx_numa_nodes[nid];
 		sgx_numa_nodes[nid].size += size;
+		capacity += size;
 
 		sgx_nr_epc_sections++;
 	}
@@ -1028,6 +1066,9 @@ static bool __init sgx_page_cache_init(void)
 		return false;
 	}
 
+	misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, capacity);
+	sgx_epc_total_pages = capacity >> PAGE_SHIFT;
+
 	return true;
 }
 
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 7e21192b87a8..bf746d2af96d 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -19,6 +19,11 @@
 
 #define SGX_MAX_EPC_SECTIONS		8
 #define SGX_EEXTEND_BLOCK_SIZE		256
+
+/*
+ * Maximum number of pages to scan for reclaiming.
+ */
+#define SGX_NR_TO_SCAN_MAX		32UL
 #define SGX_NR_TO_SCAN			16
 #define SGX_NR_LOW_PAGES		32
 #define SGX_NR_HIGH_PAGES		64
@@ -70,6 +75,8 @@ enum sgx_epc_page_state {
 /* flag for pages owned by a sgx_encl struct */
 #define SGX_EPC_OWNER_ENCL		BIT(4)
 
+struct sgx_epc_cgroup;
+
 struct sgx_epc_page {
 	unsigned int section;
 	u16 flags;
@@ -79,6 +86,7 @@ struct sgx_epc_page {
 		struct sgx_encl *encl;
 	};
 	struct list_head list;
+	struct sgx_epc_cgroup *epc_cg;
 };
 
 static inline void sgx_epc_page_reset_state(struct sgx_epc_page *page)
@@ -127,6 +135,7 @@ struct sgx_epc_section {
 	struct sgx_numa_node *node;
 };
 
+extern u64 sgx_epc_total_pages;
 extern struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
 
 static inline unsigned long sgx_get_epc_phys_addr(struct sgx_epc_page *page)
@@ -150,7 +159,8 @@ static inline void *sgx_get_epc_virt_addr(struct sgx_epc_page *page)
 }
 
 /*
- * Contains EPC pages tracked by the reclaimer (ksgxd).
+ * Contains EPC pages tracked by the global reclaimer (ksgxd) or an EPC
+ * cgroup.
  */
 struct sgx_epc_lru_lists {
 	spinlock_t lock;
@@ -177,8 +187,9 @@ void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags);
 int sgx_drop_epc_page(struct sgx_epc_page *page);
 struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
 bool sgx_epc_oom(struct sgx_epc_lru_lists *lrus);
-size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age);
-void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lrus, size_t nr_to_scan,
+size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age,
+			     struct sgx_epc_cgroup *epc_cg);
+void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lrus, size_t *nr_to_scan,
 			   struct list_head *dst);
 
 void sgx_ipi_cb(void *info);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 17/18] Docs/x86/sgx: Add description for cgroup support
       [not found] ` <20230913040635.28815-1-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
                     ` (15 preceding siblings ...)
  2023-09-13  4:06   ` [PATCH v4 16/18] x86/sgx: Limit process EPC usage with misc cgroup controller Haitao Huang
@ 2023-09-13  4:06   ` Haitao Huang
  2023-09-13  4:06   ` [PATCH v4 18/18] selftests/sgx: Add scripts for epc cgroup testing Haitao Huang
  2023-09-15 18:26   ` [PATCH v4 00/18] Add Cgroup support for SGX EPC memory Tejun Heo
  18 siblings, 0 replies; 45+ messages in thread
From: Haitao Huang @ 2023-09-13  4:06 UTC (permalink / raw)
  To: jarkko-DgEjT+Ai2ygdnm+yROfE0A, dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	tj-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

From: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>

Add initial documentation of how to regulate the distribution of
SGX Enclave Page Cache (EPC) memory via the Miscellaneous cgroup
controller.

Signed-off-by: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Haitao Huang<haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Cc: Sean Christopherson <seanjc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Bagas Sanjaya <bagasdotme-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
V4:
- Fix indentation (Randy)
- Change misc.events file to be read-only
- Fix a typo for 'subsystem'
- Add behavior when VMM overcommit EPC with a cgroup (Mikko)
---
 Documentation/arch/x86/sgx.rst | 82 ++++++++++++++++++++++++++++++++++
 1 file changed, 82 insertions(+)

diff --git a/Documentation/arch/x86/sgx.rst b/Documentation/arch/x86/sgx.rst
index d90796adc2ec..65c211bd5342 100644
--- a/Documentation/arch/x86/sgx.rst
+++ b/Documentation/arch/x86/sgx.rst
@@ -300,3 +300,85 @@ to expected failures and handle them as follows:
    first call.  It indicates a bug in the kernel or the userspace client
    if any of the second round of ``SGX_IOC_VEPC_REMOVE_ALL`` calls has
    a return code other than 0.
+
+
+Cgroup Support
+==============
+
+The "sgx_epc" resource within the Miscellaneous cgroup controller regulates
+distribution of SGX EPC memory, which is a subset of system RAM that
+is used to provide SGX-enabled applications with protected memory,
+and is otherwise inaccessible, i.e. shows up as reserved in
+/proc/iomem and cannot be read/written outside of an SGX enclave.
+
+Although current systems implement EPC by stealing memory from RAM,
+for all intents and purposes the EPC is independent from normal system
+memory, e.g. must be reserved at boot from RAM and cannot be converted
+between EPC and normal memory while the system is running.  The EPC is
+managed by the SGX subsystem and is not accounted by the memory
+controller.  Note that this is true only for EPC memory itself, i.e.
+normal memory allocations related to SGX and EPC memory, e.g. the
+backing memory for evicted EPC pages, are accounted, limited and
+protected by the memory controller.
+
+Much like normal system memory, EPC memory can be overcommitted via
+virtual memory techniques and pages can be swapped out of the EPC
+to their backing store (normal system memory allocated via shmem).
+The SGX EPC subsystem is analogous to the memory subsystem, and
+it implements limit and protection models for EPC memory.
+
+SGX EPC Interface Files
+-----------------------
+
+For a generic description of the Miscellaneous controller interface
+files, please see Documentation/admin-guide/cgroup-v2.rst
+
+All SGX EPC memory amounts are in bytes unless explicitly stated
+otherwise.  If a value which is not PAGE_SIZE aligned is written,
+the actual value used by the controller will be rounded down to
+the closest PAGE_SIZE multiple.
+
+  misc.capacity
+        A read-only flat-keyed file shown only in the root cgroup.
+        The sgx_epc resource will show the total amount of EPC
+        memory available on the platform.
+
+  misc.current
+        A read-only flat-keyed file shown in the non-root cgroups.
+        The sgx_epc resource will show the current active EPC memory
+        usage of the cgroup and its descendants. EPC pages that are
+        swapped out to backing RAM are not included in the current count.
+
+  misc.max
+        A read-write single value file which exists on non-root
+        cgroups. The sgx_epc resource will show the EPC usage
+        hard limit. The default is "max".
+
+        If a cgroup's EPC usage reaches this limit, EPC allocations,
+        e.g. for page fault handling, will be blocked until EPC can
+        be reclaimed from the cgroup.  If EPC cannot be reclaimed in
+        a timely manner, reclaim will be forced, e.g. by ignoring LRU.
+
+        The EPC pages allocated for KVM guests by the virtual EPC driver
+        are not reclaimable by the host kernel SGX reclaimers. If a VMM
+        tries to start a VM within a cgroup whose EPC usage reaches this
+        limit, the virtual EPC driver will stop allocating more EPC for the
+        VM, and return SIGBUS to the VMM which would abort the VM launch.
+
+  misc.events
+        A read-only flat-keyed file which exists on non-root cgroups.
+        A value change in this file generates a file modified event.
+
+          max
+                The number of times the cgroup has triggered a reclaim
+                due to its EPC usage approaching (or exceeding) its max
+                EPC boundary.
+
+Migration
+---------
+
+Once an EPC page is charged to a cgroup (during allocation), it
+remains charged to the original cgroup until the page is released
+or reclaimed.  Migrating a process to a different cgroup doesn't
+move the EPC charges that it incurred while in the previous cgroup
+to its new cgroup.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 18/18] selftests/sgx: Add scripts for epc cgroup testing
       [not found] ` <20230913040635.28815-1-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
                     ` (16 preceding siblings ...)
  2023-09-13  4:06   ` [PATCH v4 17/18] Docs/x86/sgx: Add description for cgroup support Haitao Huang
@ 2023-09-13  4:06   ` Haitao Huang
  2023-09-15 18:26   ` [PATCH v4 00/18] Add Cgroup support for SGX EPC memory Tejun Heo
  18 siblings, 0 replies; 45+ messages in thread
From: Haitao Huang @ 2023-09-13  4:06 UTC (permalink / raw)
  To: jarkko-DgEjT+Ai2ygdnm+yROfE0A, dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	tj-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

Scripts rely on cgroup-tools package from libcgroup [1].

To test:
1) sudo ./setup_epc_cg.sh (optional one time setup)
2) sudo ./run_tests_in_misc_cg.sh

To watch misc group current:
./watch_misc_for_tests.sh current

[1] https://github.com/libcgroup/libcgroup/blob/main/README

Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
---
V4:

Note: Need to apply on top of this series previously reviewed:
https://lore.kernel.org/linux-sgx/20220905020411.17290-1-jarkko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org/
---
 .../selftests/sgx/run_tests_in_misc_cg.sh     | 68 +++++++++++++++++++
 tools/testing/selftests/sgx/setup_epc_cg.sh   | 29 ++++++++
 .../selftests/sgx/watch_misc_for_tests.sh     | 13 ++++
 3 files changed, 110 insertions(+)
 create mode 100755 tools/testing/selftests/sgx/run_tests_in_misc_cg.sh
 create mode 100755 tools/testing/selftests/sgx/setup_epc_cg.sh
 create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh

diff --git a/tools/testing/selftests/sgx/run_tests_in_misc_cg.sh b/tools/testing/selftests/sgx/run_tests_in_misc_cg.sh
new file mode 100755
index 000000000000..63da7b23b74e
--- /dev/null
+++ b/tools/testing/selftests/sgx/run_tests_in_misc_cg.sh
@@ -0,0 +1,68 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright(c) 2023 Intel Corporation.
+
+if ! lscgroup | grep -q "test/test1/test3$"; then
+  echo "setting up cgroups for testing..."
+  ./setup_epc_cg.sh
+fi
+
+cmd='./test_sgx'
+default_test="augment_via_eaccept_long"
+
+# We use 'tail' to skip header lines and 'sed' to remove 'enclave' from the first non-header line.
+list=$($cmd -l 2>&1 | tail -n +4 | sed '0,/^enclave/ s/^enclave//' | sed 's/^ *//')
+
+IFS=$'\n' read -d '' -r -a lines <<< "$list"
+lines=("all" "${lines[@]}")
+
+echo "Available tests:"
+for i in "${!lines[@]}"; do
+  # Check if the current line is the default test
+  if [[ ${lines[$i]} == *"$default_test"* ]]; then
+    echo "$((i)). ${lines[$i]} (default)"
+  else
+    echo "$((i)). ${lines[$i]}"
+  fi
+done
+
+echo "Please enter the number of the test you want to run (or press enter for the default test):"
+read choice
+
+if [ -z "$choice" ]; then
+  testname="$default_test"
+else
+  testname="${lines[$choice]}"
+fi
+
+if [ "$testname" == "all" ]; then
+  test_cmd="$cmd"
+else
+  test_cmd="$cmd -t $testname"
+fi
+
+timestamp=$(date +%Y%m%d_%H%M%S)
+
+# Always use leaf node of misc cgroups so it works for both v1 and v2
+# these may fail on OOM
+nohup bash -c "cgexec -g misc:test/test1/test3 $test_cmd" >test1_1_$timestamp.log 2>&1 &
+nohup bash -c "cgexec -g misc:test/test1/test3 $test_cmd" >test1_2_$timestamp.log 2>&1 &
+nohup bash -c "cgexec -g misc:test/test1/test3 $test_cmd" >test1_3_$timestamp.log 2>&1 &
+nohup bash -c "cgexec -g misc:test/test1/test3 $test_cmd" >test1_4_$timestamp.log 2>&1 &
+nohup bash -c "cgexec -g misc:test/test1/test3 $test_cmd" >test1_5_$timestamp.log 2>&1 &
+
+# These tests may timeout on oversubscribed tests on 4G EPC
+nohup bash -c "cgexec -g misc:test/test2 $test_cmd" >test2_1_$timestamp.log 2>&1 &
+nohup bash -c "cgexec -g misc:test/test2 $test_cmd" >test2_2_$timestamp.log 2>&1 &
+nohup bash -c "cgexec -g misc:test/test2 $test_cmd" >test2_3_$timestamp.log 2>&1 &
+nohup bash -c "cgexec -g misc:test/test2 $test_cmd" >test2_4_$timestamp.log 2>&1 &
+nohup bash -c "cgexec -g misc:test/test2 $test_cmd" >test2_5_$timestamp.log 2>&1 &
+nohup bash -c "cgexec -g misc:test/test2 $test_cmd" >test2_6_$timestamp.log 2>&1 &
+nohup bash -c "cgexec -g misc:test/test2 $test_cmd" >test2_7_$timestamp.log 2>&1 &
+nohup bash -c "cgexec -g misc:test/test2 $test_cmd" >test2_8_$timestamp.log 2>&1 &
+
+# this should work on 4G EPC
+nohup bash -c "cgexec -g misc:test4 $test_cmd" >test4_1_$timestamp.log 2>&1 &
+nohup bash -c "cgexec -g misc:test4 $test_cmd" >test4_2_$timestamp.log 2>&1 &
+nohup bash -c "cgexec -g misc:test4 $test_cmd" >test4_3_$timestamp.log 2>&1 &
+nohup bash -c "cgexec -g misc:test4 $test_cmd" >test4_4_$timestamp.log 2>&1 &
diff --git a/tools/testing/selftests/sgx/setup_epc_cg.sh b/tools/testing/selftests/sgx/setup_epc_cg.sh
new file mode 100755
index 000000000000..5fd137a66436
--- /dev/null
+++ b/tools/testing/selftests/sgx/setup_epc_cg.sh
@@ -0,0 +1,29 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright(c) 2023 Intel Corporation.
+
+cgcreate -g misc:test
+if [ $? -ne 0 ]; then
+    echo "Please make sure cgroup-tools is installed, and misc cgroup is mounted."
+    exit 1
+fi
+cgcreate -g misc:test/test1
+cgcreate -g misc:test/test1/test3
+cgcreate -g misc:test/test2
+cgcreate -g misc:test4
+
+# Setup for a platform with 4G EPC
+LARGER=4096000000
+LARGE=409600000
+SMALL=4096000
+if [ ! -d "/sys/fs/cgroup/misc" ]; then
+    echo "cgroups v2 is in use. Only leaf nodes can run a process"
+    echo "sgx_epc $SMALL" | tee /sys/fs/cgroup/test/test1/misc.max
+    echo "sgx_epc $LARGE" | tee /sys/fs/cgroup/test/test2/misc.max
+    echo "sgx_epc $LARGER" | tee /sys/fs/cgroup/test4/misc.max
+else
+    echo "cgroups v1 is in use."
+    echo "sgx_epc $SMALL" | tee /sys/fs/cgroup/misc/test/test1/misc.max
+    echo "sgx_epc $LARGE" | tee /sys/fs/cgroup/misc/test/test2/misc.max
+    echo "sgx_epc $LARGER" | tee /sys/fs/cgroup/misc/test4/misc.max
+fi
diff --git a/tools/testing/selftests/sgx/watch_misc_for_tests.sh b/tools/testing/selftests/sgx/watch_misc_for_tests.sh
new file mode 100755
index 000000000000..dbd38f346e7b
--- /dev/null
+++ b/tools/testing/selftests/sgx/watch_misc_for_tests.sh
@@ -0,0 +1,13 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright(c) 2023 Intel Corporation.
+
+if [ -z "$1" ]
+  then
+    echo "No argument supplied, please provide 'max', 'current' or 'events'"
+    exit 1
+fi
+
+watch -n 1 "find /sys/fs/cgroup -wholename */test*/misc.$1 -exec sh -c \
+    'echo \"\$1:\"; cat \"\$1\"' _ {} \;"
+
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 01/18] cgroup/misc: Add per resource callbacks for CSS events
       [not found]     ` <20230913040635.28815-2-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
@ 2023-09-13  9:39       ` Jarkko Sakkinen
  2023-09-16  4:11         ` Haitao Huang
  0 siblings, 1 reply; 45+ messages in thread
From: Jarkko Sakkinen @ 2023-09-13  9:39 UTC (permalink / raw)
  To: Haitao Huang, dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	tj-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

On Wed Sep 13, 2023 at 7:06 AM EEST, Haitao Huang wrote:
> From: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
>
> Consumers of the misc cgroup controller might need to perform separate
> actions for Cgroups Subsystem State(CSS) events: cgroup alloc and free.

nit: s/State(CSS)/State (CSS)/

"cgroup alloc" and "cgroup free" mean absolutely nothing.


> In addition, writes to the max value may also need separate action. Add

What "the max value"?

> the ability to allow downstream users to setup callbacks for these
> operations, and call the corresponding per-resource-type callback when
> appropriate.

Who are "the downstream users" and what sort of callbacks they setup?

>
> This code will be utilized by the SGX driver in a future patch.
>
> Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> ---
> V4:
> - Moved this to the front of the series.
> - Applies on cgroup/for-6.6 with the overflow fix for misc.
>
> V3:
> - Removed the released() callback
> ---
>  include/linux/misc_cgroup.h |  5 +++++
>  kernel/cgroup/misc.c        | 32 +++++++++++++++++++++++++++++---
>  2 files changed, 34 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
> index e799b1f8d05b..e1bcd176c2de 100644
> --- a/include/linux/misc_cgroup.h
> +++ b/include/linux/misc_cgroup.h
> @@ -37,6 +37,11 @@ struct misc_res {
>  	u64 max;
>  	atomic64_t usage;
>  	atomic64_t events;
> +
> +	/* per resource callback ops */
> +	int (*misc_cg_alloc)(struct misc_cg *cg);
> +	void (*misc_cg_free)(struct misc_cg *cg);
> +	void (*misc_cg_max_write)(struct misc_cg *cg);
>  };
>  
>  /**
> diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
> index 79a3717a5803..e0092170d0dd 100644
> --- a/kernel/cgroup/misc.c
> +++ b/kernel/cgroup/misc.c
> @@ -276,10 +276,13 @@ static ssize_t misc_cg_max_write(struct kernfs_open_file *of, char *buf,
>  
>  	cg = css_misc(of_css(of));
>  
> -	if (READ_ONCE(misc_res_capacity[type]))
> +	if (READ_ONCE(misc_res_capacity[type])) {
>  		WRITE_ONCE(cg->res[type].max, max);
> -	else
> +		if (cg->res[type].misc_cg_max_write)
> +			cg->res[type].misc_cg_max_write(cg);
> +	} else {
>  		ret = -EINVAL;
> +	}
>  
>  	return ret ? ret : nbytes;
>  }
> @@ -383,23 +386,39 @@ static struct cftype misc_cg_files[] = {
>  static struct cgroup_subsys_state *
>  misc_cg_alloc(struct cgroup_subsys_state *parent_css)
>  {
> +	struct misc_cg *parent_cg;
>  	enum misc_res_type i;
>  	struct misc_cg *cg;
> +	int ret;
>  
>  	if (!parent_css) {
>  		cg = &root_cg;
> +		parent_cg = &root_cg;
>  	} else {
>  		cg = kzalloc(sizeof(*cg), GFP_KERNEL);
>  		if (!cg)
>  			return ERR_PTR(-ENOMEM);
> +		parent_cg = css_misc(parent_css);
>  	}
>  
>  	for (i = 0; i < MISC_CG_RES_TYPES; i++) {
>  		WRITE_ONCE(cg->res[i].max, MAX_NUM);
>  		atomic64_set(&cg->res[i].usage, 0);
> +		if (parent_cg->res[i].misc_cg_alloc) {
> +			ret = parent_cg->res[i].misc_cg_alloc(cg);
> +			if (ret)
> +				goto alloc_err;
> +		}
>  	}
>  
>  	return &cg->css;
> +
> +alloc_err:
> +	for (i = 0; i < MISC_CG_RES_TYPES; i++)
> +		if (parent_cg->res[i].misc_cg_free)
> +			cg->res[i].misc_cg_free(cg);
> +	kfree(cg);
> +	return ERR_PTR(ret);
>  }
>  
>  /**
> @@ -410,7 +429,14 @@ misc_cg_alloc(struct cgroup_subsys_state *parent_css)
>   */
>  static void misc_cg_free(struct cgroup_subsys_state *css)
>  {
> -	kfree(css_misc(css));
> +	struct misc_cg *cg = css_misc(css);
> +	enum misc_res_type i;
> +
> +	for (i = 0; i < MISC_CG_RES_TYPES; i++)
> +		if (cg->res[i].misc_cg_free)
> +			cg->res[i].misc_cg_free(cg);
> +
> +	kfree(cg);
>  }
>  
>  /* Cgroup controller callbacks */
> -- 
> 2.25.1

BR, Jarkko

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 02/18] cgroup/misc: Add SGX EPC resource type and export APIs for SGX driver
       [not found]     ` <20230913040635.28815-3-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
@ 2023-09-13  9:43       ` Jarkko Sakkinen
  0 siblings, 0 replies; 45+ messages in thread
From: Jarkko Sakkinen @ 2023-09-13  9:43 UTC (permalink / raw)
  To: Haitao Huang, dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	tj-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

On Wed Sep 13, 2023 at 7:06 AM EEST, Haitao Huang wrote:
> From: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
>
> The SGX driver will need to get access to the root misc_cg object
> to do iterative walks and also determine if a charge will be
> towards the root cgroup or not.

What is "a charge" and why does SGX driver need to do iterative walks?
Neither is explained here.

> To manage the SGX EPC memory via the misc controller, the SGX
> driver will also need to be able to iterate over the misc cgroup
> hierarchy.

Ambiguous language: misc_cg vs "misc controller". Are the different
types of objects? If not, then stick to misc_cg everywhere.

> Move parent_misc() into misc_cgroup.h and make inline to make this
> function available to SGX, rename it to misc_cg_parent(), and update
> misc.c to use the new name.

net/rxrpc/misc.c?

The point being that plain "misc.c" is ambiguous.

> Add per resource type private data so that SGX can store additional
> per cgroup data with the misc_cg struct.

Yet another term "misc cg struct", and not just "misc_cg" like in the
first paragraph.

>
> Allow SGX EPC memory to be a valid resource type for the misc
> controller.
>
> Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> ---
> V4:
> - Moved this to the second in the series.
> ---
>  include/linux/misc_cgroup.h | 29 +++++++++++++++++++++++++++++
>  kernel/cgroup/misc.c        | 25 ++++++++++++-------------
>  2 files changed, 41 insertions(+), 13 deletions(-)
>
> diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
> index e1bcd176c2de..6f8330f435ba 100644
> --- a/include/linux/misc_cgroup.h
> +++ b/include/linux/misc_cgroup.h
> @@ -17,6 +17,10 @@ enum misc_res_type {
>  	MISC_CG_RES_SEV,
>  	/* AMD SEV-ES ASIDs resource */
>  	MISC_CG_RES_SEV_ES,
> +#endif
> +#ifdef CONFIG_CGROUP_SGX_EPC
> +	/* SGX EPC memory resource */
> +	MISC_CG_RES_SGX_EPC,
>  #endif
>  	MISC_CG_RES_TYPES
>  };
> @@ -37,6 +41,7 @@ struct misc_res {
>  	u64 max;
>  	atomic64_t usage;
>  	atomic64_t events;
> +	void *priv;
>  
>  	/* per resource callback ops */
>  	int (*misc_cg_alloc)(struct misc_cg *cg);
> @@ -59,6 +64,7 @@ struct misc_cg {
>  	struct misc_res res[MISC_CG_RES_TYPES];
>  };
>  
> +struct misc_cg *misc_cg_root(void);
>  u64 misc_cg_res_total_usage(enum misc_res_type type);
>  int misc_cg_set_capacity(enum misc_res_type type, u64 capacity);
>  int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount);
> @@ -78,6 +84,20 @@ static inline struct misc_cg *css_misc(struct cgroup_subsys_state *css)
>  	return css ? container_of(css, struct misc_cg, css) : NULL;
>  }
>  
> +/**
> + * misc_cg_parent() - Get the parent of the passed misc cgroup.
> + * @cgroup: cgroup whose parent needs to be fetched.
> + *
> + * Context: Any context.
> + * Return:
> + * * struct misc_cg* - Parent of the @cgroup.
> + * * %NULL - If @cgroup is null or the passed cgroup does not have a parent.
> + */
> +static inline struct misc_cg *misc_cg_parent(struct misc_cg *cgroup)
> +{
> +	return cgroup ? css_misc(cgroup->css.parent) : NULL;
> +}
> +
>  /*
>   * get_current_misc_cg() - Find and get the misc cgroup of the current task.
>   *
> @@ -102,6 +122,15 @@ static inline void put_misc_cg(struct misc_cg *cg)
>  }
>  
>  #else /* !CONFIG_CGROUP_MISC */
> +static inline struct misc_cg *misc_cg_root(void)
> +{
> +	return NULL;
> +}
> +
> +static inline struct misc_cg *misc_cg_parent(struct misc_cg *cg)
> +{
> +	return NULL;
> +}
>  
>  static inline u64 misc_cg_res_total_usage(enum misc_res_type type)
>  {
> diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
> index e0092170d0dd..dbd881be773f 100644
> --- a/kernel/cgroup/misc.c
> +++ b/kernel/cgroup/misc.c
> @@ -24,6 +24,10 @@ static const char *const misc_res_name[] = {
>  	/* AMD SEV-ES ASIDs resource */
>  	"sev_es",
>  #endif
> +#ifdef CONFIG_CGROUP_SGX_EPC
> +	/* Intel SGX EPC memory bytes */
> +	"sgx_epc",
> +#endif
>  };
>  
>  /* Root misc cgroup */
> @@ -40,18 +44,13 @@ static struct misc_cg root_cg;
>  static u64 misc_res_capacity[MISC_CG_RES_TYPES];
>  
>  /**
> - * parent_misc() - Get the parent of the passed misc cgroup.
> - * @cgroup: cgroup whose parent needs to be fetched.
> - *
> - * Context: Any context.
> - * Return:
> - * * struct misc_cg* - Parent of the @cgroup.
> - * * %NULL - If @cgroup is null or the passed cgroup does not have a parent.
> + * misc_cg_root() - Return the root misc cgroup.
>   */
> -static struct misc_cg *parent_misc(struct misc_cg *cgroup)
> +struct misc_cg *misc_cg_root(void)
>  {
> -	return cgroup ? css_misc(cgroup->css.parent) : NULL;
> +	return &root_cg;
>  }
> +EXPORT_SYMBOL_GPL(misc_cg_root);
>  
>  /**
>   * valid_type() - Check if @type is valid or not.
> @@ -150,7 +149,7 @@ int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount)
>  	if (!amount)
>  		return 0;
>  
> -	for (i = cg; i; i = parent_misc(i)) {
> +	for (i = cg; i; i = misc_cg_parent(i)) {
>  		res = &i->res[type];
>  
>  		new_usage = atomic64_add_return(amount, &res->usage);
> @@ -163,12 +162,12 @@ int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount)
>  	return 0;
>  
>  err_charge:
> -	for (j = i; j; j = parent_misc(j)) {
> +	for (j = i; j; j = misc_cg_parent(j)) {
>  		atomic64_inc(&j->res[type].events);
>  		cgroup_file_notify(&j->events_file);
>  	}
>  
> -	for (j = cg; j != i; j = parent_misc(j))
> +	for (j = cg; j != i; j = misc_cg_parent(j))
>  		misc_cg_cancel_charge(type, j, amount);
>  	misc_cg_cancel_charge(type, i, amount);
>  	return ret;
> @@ -190,7 +189,7 @@ void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg, u64 amount)
>  	if (!(amount && valid_type(type) && cg))
>  		return;
>  
> -	for (i = cg; i; i = parent_misc(i))
> +	for (i = cg; i; i = misc_cg_parent(i))
>  		misc_cg_cancel_charge(type, i, amount);
>  }
>  EXPORT_SYMBOL_GPL(misc_cg_uncharge);
> -- 
> 2.25.1


BR, Jarkko

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 03/18] x86/sgx: Add sgx_epc_lru_lists to encapsulate LRU lists
       [not found]     ` <20230913040635.28815-4-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
@ 2023-09-13  9:46       ` Jarkko Sakkinen
  2023-09-14 10:31       ` Huang, Kai
  1 sibling, 0 replies; 45+ messages in thread
From: Jarkko Sakkinen @ 2023-09-13  9:46 UTC (permalink / raw)
  To: Haitao Huang, dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	tj-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

On Wed Sep 13, 2023 at 7:06 AM EEST, Haitao Huang wrote:
> From: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
>
> Introduce a data structure to wrap the existing reclaimable list and its
> spinlock. Each cgroup later will have one instance of this structure to
> track EPC pages allocated for processes associated with the same cgroup.
> Just like the global SGX reclaimer (ksgxd), an EPC cgroup reclaims pages
> from the reclaimable list in this structure when its usage reaches near
> its limit.
>
> Currently, ksgxd does not track the VA, SECS pages. They are considered
> as 'unreclaimable' pages that are only deallocated when their respective
> owning enclaves are destroyed and all associated resources released.
>
> When an EPC cgroup can not reclaim any more reclaimable EPC pages to
> reduce its usage below its limit, the cgroup must also reclaim those
> unreclaimables by killing their owning enclaves. The VA and SECS pages
> later are also tracked in an 'unreclaimable' list added to this structure
> to support this OOM killing of enclaves.
>
> Signed-off-by: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> Cc: Sean Christopherson <seanjc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> ---
> V4:
> - Removed unneeded comments for the spinlock and the non-reclaimables.
> (Kai, Jarkko)
> - Revised the commit to add introduction comments for unreclaimables and
> multiple LRU lists.(Kai)
> - Reordered the patches: delay all changes for unreclaimables to
> later, and this one becomes the first change in the SGX subsystem.
>
> V3:
> - Removed the helper functions and revised commit messages.
> ---
>  arch/x86/kernel/cpu/sgx/sgx.h | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
> index d2dad21259a8..018414b2abe8 100644
> --- a/arch/x86/kernel/cpu/sgx/sgx.h
> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> @@ -83,6 +83,20 @@ static inline void *sgx_get_epc_virt_addr(struct sgx_epc_page *page)
>  	return section->virt_addr + index * PAGE_SIZE;
>  }
>  
> +/*
> + * Tracks EPC pages reclaimable by the reclaimer (ksgxd).
> + */
> +struct sgx_epc_lru_lists {
> +	spinlock_t lock;
> +	struct list_head reclaimable;
> +};
> +
> +static inline void sgx_lru_init(struct sgx_epc_lru_lists *lrus)
> +{
> +	spin_lock_init(&lrus->lock);
> +	INIT_LIST_HEAD(&lrus->reclaimable);
> +}
> +
>  struct sgx_epc_page *__sgx_alloc_epc_page(void);
>  void sgx_free_epc_page(struct sgx_epc_page *page);
>  
> -- 
> 2.25.1
>

Looks good but not yet time for ack'ing.

BR, Jarkko

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 04/18] x86/sgx: Use sgx_epc_lru_lists for existing active page list
       [not found]     ` <20230913040635.28815-5-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
@ 2023-09-13 15:00       ` Jarkko Sakkinen
  0 siblings, 0 replies; 45+ messages in thread
From: Jarkko Sakkinen @ 2023-09-13 15:00 UTC (permalink / raw)
  To: Haitao Huang, dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	tj-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

On Wed Sep 13, 2023 at 7:06 AM EEST, Haitao Huang wrote:
> From: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
>
> All EPC pages of enclaves including VA and SECS will be tracked in

s/VA/Version Array (VA)/
s/SECS/SGX Enclave Control Structure (SECS)/

Just a nitpick, because it is always good to remind what these acronyms
are (there are so many of them in this world).

> sgx_epc_lru_lists structs, one per cgroup. For now just replace the
> existing sgx_active_page_list in the reclaimer and its spinlock with a
> global sgx_epc_lru_lists struct. VA and SECS pages are still not tracked
> at this point but they will be tracked after an unreclaimable LRU list
> is added to the sgx_epc_lru_lists struct.
>
> Signed-off-by: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> Cc: Sean Christopherson <seanjc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> ---
> V4:
> - No change, only reordered the patch.
>
> V3:
> - Remove usage of list wrapper
> ---
>  arch/x86/kernel/cpu/sgx/main.c | 39 +++++++++++++++++-----------------
>  1 file changed, 20 insertions(+), 19 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index 166692f2d501..afce51d6e94a 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -26,10 +26,9 @@ static DEFINE_XARRAY(sgx_epc_address_space);
>  
>  /*
>   * These variables are part of the state of the reclaimer, and must be accessed
> - * with sgx_reclaimer_lock acquired.
> + * with sgx_global_lru.lock acquired.
>   */
> -static LIST_HEAD(sgx_active_page_list);
> -static DEFINE_SPINLOCK(sgx_reclaimer_lock);
> +static struct sgx_epc_lru_lists sgx_global_lru;
>  
>  static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
>  
> @@ -304,13 +303,13 @@ static void sgx_reclaim_pages(void)
>  	int ret;
>  	int i;
>  
> -	spin_lock(&sgx_reclaimer_lock);
> +	spin_lock(&sgx_global_lru.lock);
>  	for (i = 0; i < SGX_NR_TO_SCAN; i++) {
> -		if (list_empty(&sgx_active_page_list))
> +		epc_page = list_first_entry_or_null(&sgx_global_lru.reclaimable,
> +						    struct sgx_epc_page, list);
> +		if (!epc_page)
>  			break;
>  
> -		epc_page = list_first_entry(&sgx_active_page_list,
> -					    struct sgx_epc_page, list);
>  		list_del_init(&epc_page->list);
>  		encl_page = epc_page->owner;
>  
> @@ -322,7 +321,7 @@ static void sgx_reclaim_pages(void)
>  			 */
>  			epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
>  	}
> -	spin_unlock(&sgx_reclaimer_lock);
> +	spin_unlock(&sgx_global_lru.lock);
>  
>  	for (i = 0; i < cnt; i++) {
>  		epc_page = chunk[i];
> @@ -345,9 +344,9 @@ static void sgx_reclaim_pages(void)
>  		continue;
>  
>  skip:
> -		spin_lock(&sgx_reclaimer_lock);
> -		list_add_tail(&epc_page->list, &sgx_active_page_list);
> -		spin_unlock(&sgx_reclaimer_lock);
> +		spin_lock(&sgx_global_lru.lock);
> +		list_add_tail(&epc_page->list, &sgx_global_lru.reclaimable);
> +		spin_unlock(&sgx_global_lru.lock);
>  
>  		kref_put(&encl_page->encl->refcount, sgx_encl_release);
>  
> @@ -378,7 +377,7 @@ static void sgx_reclaim_pages(void)
>  static bool sgx_should_reclaim(unsigned long watermark)
>  {
>  	return atomic_long_read(&sgx_nr_free_pages) < watermark &&
> -	       !list_empty(&sgx_active_page_list);
> +	       !list_empty(&sgx_global_lru.reclaimable);
>  }
>  
>  /*
> @@ -430,6 +429,8 @@ static bool __init sgx_page_reclaimer_init(void)
>  
>  	ksgxd_tsk = tsk;
>  
> +	sgx_lru_init(&sgx_global_lru);
> +
>  	return true;
>  }
>  
> @@ -505,10 +506,10 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
>   */
>  void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
>  {
> -	spin_lock(&sgx_reclaimer_lock);
> +	spin_lock(&sgx_global_lru.lock);
>  	page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
> -	list_add_tail(&page->list, &sgx_active_page_list);
> -	spin_unlock(&sgx_reclaimer_lock);
> +	list_add_tail(&page->list, &sgx_global_lru.reclaimable);
> +	spin_unlock(&sgx_global_lru.lock);
>  }
>  
>  /**
> @@ -523,18 +524,18 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
>   */
>  int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
>  {
> -	spin_lock(&sgx_reclaimer_lock);
> +	spin_lock(&sgx_global_lru.lock);
>  	if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
>  		/* The page is being reclaimed. */
>  		if (list_empty(&page->list)) {
> -			spin_unlock(&sgx_reclaimer_lock);
> +			spin_unlock(&sgx_global_lru.lock);
>  			return -EBUSY;
>  		}
>  
>  		list_del(&page->list);
>  		page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
>  	}
> -	spin_unlock(&sgx_reclaimer_lock);
> +	spin_unlock(&sgx_global_lru.lock);
>  
>  	return 0;
>  }
> @@ -567,7 +568,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
>  			break;
>  		}
>  
> -		if (list_empty(&sgx_active_page_list))
> +		if (list_empty(&sgx_global_lru.reclaimable))
>  			return ERR_PTR(-ENOMEM);
>  
>  		if (!reclaim) {
> -- 
> 2.25.1

Other than that looks good to me (including the commit description).

BR, Jarkko

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 05/18] x86/sgx: Store reclaimable EPC pages in sgx_epc_lru_lists
       [not found]     ` <20230913040635.28815-6-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
@ 2023-09-13 15:14       ` Jarkko Sakkinen
  0 siblings, 0 replies; 45+ messages in thread
From: Jarkko Sakkinen @ 2023-09-13 15:14 UTC (permalink / raw)
  To: Haitao Huang, dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	tj-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

On Wed Sep 13, 2023 at 7:06 AM EEST, Haitao Huang wrote:
> From: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
>
> Replace sgx_mark_page_reclaimable() and sgx_unmark_page_reclaimable()
> with sgx_record_epc_page() and sgx_drop_epc_page(). The
> sgx_record_epc_page() function adds the epc_page to the "reclaimable"
> list in the sgx_epc_lru_lists struct, while sgx_drop_epc_page() removes
> the page from the LRU list.
>
> For now, this change serves as a straightforward replacement of the two
> functions for pages tracked by the reclaimer. When the unreclaimable
> list is added to track VA and SECS pages for cgroups, these functions
> will be updated to add/remove them from the unreclaimable lists.
>
> Signed-off-by: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> Cc: Sean Christopherson <seanjc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> ---
> V4:
> - Code update needed for patch reordering
> - Revised commit message.
> ---
>  arch/x86/kernel/cpu/sgx/encl.c  |  8 +++++---
>  arch/x86/kernel/cpu/sgx/ioctl.c | 10 ++++++----
>  arch/x86/kernel/cpu/sgx/main.c  | 22 ++++++++++++----------
>  arch/x86/kernel/cpu/sgx/sgx.h   |  4 ++--
>  4 files changed, 25 insertions(+), 19 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> index 279148e72459..f84ee2eeb058 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.c
> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> @@ -272,7 +272,8 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct sgx_encl *encl,
>  		return ERR_CAST(epc_page);
>  
>  	encl->secs_child_cnt++;
> -	sgx_mark_page_reclaimable(entry->epc_page);
> +	sgx_record_epc_page(epc_page,
> +			    SGX_EPC_PAGE_RECLAIMER_TRACKED);

	sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMER_TRACKED);

... even less than 80 characters (100 is the max these days)

>  
>  	return entry;
>  }
> @@ -398,7 +399,8 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
>  	encl_page->type = SGX_PAGE_TYPE_REG;
>  	encl->secs_child_cnt++;
>  
> -	sgx_mark_page_reclaimable(encl_page->epc_page);
> +	sgx_record_epc_page(epc_page,
> +			    SGX_EPC_PAGE_RECLAIMER_TRACKED);

Ditto.

>  
>  	phys_addr = sgx_get_epc_phys_addr(epc_page);
>  	/*
> @@ -714,7 +716,7 @@ void sgx_encl_release(struct kref *ref)
>  			 * The page and its radix tree entry cannot be freed
>  			 * if the page is being held by the reclaimer.
>  			 */
> -			if (sgx_unmark_page_reclaimable(entry->epc_page))
> +			if (sgx_drop_epc_page(entry->epc_page))
>  				continue;
>  
>  			sgx_encl_free_epc_page(entry->epc_page);
> diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
> index 5d390df21440..0d79dec408af 100644
> --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> @@ -322,7 +322,8 @@ static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long src,
>  			goto err_out;
>  	}
>  
> -	sgx_mark_page_reclaimable(encl_page->epc_page);
> +	sgx_record_epc_page(epc_pag
> +			    SGX_EPC_PAGE_RECLAIMER_TRACKED);

Ditto.

>  	mutex_unlock(&encl->lock);
>  	mmap_read_unlock(current->mm);
>  	return ret;
> @@ -961,7 +962,7 @@ static long sgx_enclave_modify_types(struct sgx_encl *encl,
>  			 * Prevent page from being reclaimed while mutex
>  			 * is released.
>  			 */
> -			if (sgx_unmark_page_reclaimable(entry->epc_page)) {
> +			if (sgx_drop_epc_page(entry->epc_page)) {
>  				ret = -EAGAIN;
>  				goto out_entry_changed;
>  			}
> @@ -976,7 +977,8 @@ static long sgx_enclave_modify_types(struct sgx_encl *encl,
>  
>  			mutex_lock(&encl->lock);
>  
> -			sgx_mark_page_reclaimable(entry->epc_page);
> +			sgx_record_epc_page(entry->epc_page,
> +					    SGX_EPC_PAGE_RECLAIMER_TRACKED);

Ditto.

>  		}
>  
>  		/* Change EPC type */
> @@ -1133,7 +1135,7 @@ static long sgx_encl_remove_pages(struct sgx_encl *encl,
>  			goto out_unlock;
>  		}
>  
> -		if (sgx_unmark_page_reclaimable(entry->epc_page)) {
> +		if (sgx_drop_epc_page(entry->epc_page)) {
>  			ret = -EBUSY;
>  			goto out_unlock;
>  		}
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index afce51d6e94a..dec1d57cbff6 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -268,7 +268,6 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
>  			goto out;
>  
>  		sgx_encl_ewb(encl->secs.epc_page, &secs_backing);
> -
>  		sgx_encl_free_epc_page(encl->secs.epc_page);
>  		encl->secs.epc_page = NULL;
>  
> @@ -498,31 +497,34 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
>  }
>  
>  /**
> - * sgx_mark_page_reclaimable() - Mark a page as reclaimable
> + * sgx_record_epc_page() - Add a page to the appropriate LRU list
>   * @page:	EPC page
> + * @flags:	The type of page that is being recorded
>   *
> - * Mark a page as reclaimable and add it to the active page list. Pages
> - * are automatically removed from the active list when freed.
> + * Mark a page with the specified flags and add it to the appropriate
> + * list.
>   */
> -void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
> +void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
>  {
>  	spin_lock(&sgx_global_lru.lock);
> -	page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
> -	list_add_tail(&page->list, &sgx_global_lru.reclaimable);
> +	WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
> +	page->flags |= flags;
> +	if (flags & SGX_EPC_PAGE_RECLAIMER_TRACKED)
> +		list_add_tail(&page->list, &sgx_global_lru.reclaimable);
>  	spin_unlock(&sgx_global_lru.lock);
>  }
>  
>  /**
> - * sgx_unmark_page_reclaimable() - Remove a page from the reclaim list
> + * sgx_drop_epc_page() - Remove a page from a LRU list
>   * @page:	EPC page
>   *
> - * Clear the reclaimable flag and remove the page from the active page list.
> + * Clear the reclaimable flag if set and remove the page from its LRU.
>   *
>   * Return:
>   *   0 on success,
>   *   -EBUSY if the page is in the process of being reclaimed
>   */
> -int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
> +int sgx_drop_epc_page(struct sgx_epc_page *page)
>  {
>  	spin_lock(&sgx_global_lru.lock);
>  	if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
> index 018414b2abe8..113d930fd087 100644
> --- a/arch/x86/kernel/cpu/sgx/sgx.h
> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> @@ -101,8 +101,8 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void);
>  void sgx_free_epc_page(struct sgx_epc_page *page);
>  
>  void sgx_reclaim_direct(void);
> -void sgx_mark_page_reclaimable(struct sgx_epc_page *page);
> -int sgx_unmark_page_reclaimable(struct sgx_epc_page *page);
> +void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags);
> +int sgx_drop_epc_page(struct sgx_epc_page *page);
>  struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
>  
>  void sgx_ipi_cb(void *info);
> -- 
> 2.25.1

BR, Jarkko

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 06/18] x86/sgx: Introduce EPC page states
       [not found]     ` <20230913040635.28815-7-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
@ 2023-09-13 15:15       ` Jarkko Sakkinen
  0 siblings, 0 replies; 45+ messages in thread
From: Jarkko Sakkinen @ 2023-09-13 15:15 UTC (permalink / raw)
  To: Haitao Huang, dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	tj-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

On Wed Sep 13, 2023 at 7:06 AM EEST, Haitao Huang wrote:
> Use the lower 3 bits in the flags field of sgx_epc_page struct to
> track EPC states in its life cycle and define an enum for possible
> states. More state(s) will be added later.
>
> Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> ---
> V4:
> - No changes other than required for patch reordering.
>
> V3:
> - This is new in V3 to replace the bit mask based approach (requested by Jarkko)
> ---
>  arch/x86/kernel/cpu/sgx/encl.c  | 14 +++++++---
>  arch/x86/kernel/cpu/sgx/ioctl.c |  7 +++--
>  arch/x86/kernel/cpu/sgx/main.c  | 19 +++++++------
>  arch/x86/kernel/cpu/sgx/sgx.h   | 49 ++++++++++++++++++++++++++++++---
>  4 files changed, 71 insertions(+), 18 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> index f84ee2eeb058..d11d4111aa98 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.c
> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> @@ -244,8 +244,12 @@ static struct sgx_epc_page *sgx_encl_load_secs(struct sgx_encl *encl)
>  {
>  	struct sgx_epc_page *epc_page = encl->secs.epc_page;
>  
> -	if (!epc_page)
> +	if (!epc_page) {
>  		epc_page = sgx_encl_eldu(&encl->secs, NULL);
> +		if (!IS_ERR(epc_page))
> +			sgx_record_epc_page(epc_page,
> +					    SGX_EPC_PAGE_UNRECLAIMABLE);

			sgx_record_epc_page(epc_page, SGX_EPC_PAGE_UNRECLAIMABLE) ;

> +	}
>  
>  	return epc_page;
>  }
> @@ -273,7 +277,7 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct sgx_encl *encl,
>  
>  	encl->secs_child_cnt++;
>  	sgx_record_epc_page(epc_page,
> -			    SGX_EPC_PAGE_RECLAIMER_TRACKED);
> +			    SGX_EPC_PAGE_RECLAIMABLE);
>  
>  	return entry;
>  }
> @@ -400,7 +404,7 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
>  	encl->secs_child_cnt++;
>  
>  	sgx_record_epc_page(epc_page,
> -			    SGX_EPC_PAGE_RECLAIMER_TRACKED);
> +			    SGX_EPC_PAGE_RECLAIMABLE);
>  
>  	phys_addr = sgx_get_epc_phys_addr(epc_page);
>  	/*
> @@ -1258,6 +1262,8 @@ struct sgx_epc_page *sgx_alloc_va_page(bool reclaim)
>  		sgx_encl_free_epc_page(epc_page);
>  		return ERR_PTR(-EFAULT);
>  	}
> +	sgx_record_epc_page(epc_page,
> +			    SGX_EPC_PAGE_UNRECLAIMABLE);
>  
>  	return epc_page;
>  }
> @@ -1317,7 +1323,7 @@ void sgx_encl_free_epc_page(struct sgx_epc_page *page)
>  {
>  	int ret;
>  
> -	WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
> +	WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_STATE_MASK);
>  
>  	ret = __eremove(sgx_get_epc_virt_addr(page));
>  	if (WARN_ONCE(ret, EREMOVE_ERROR_MESSAGE, ret, ret))
> diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
> index 0d79dec408af..c28f074d5d71 100644
> --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> @@ -113,6 +113,9 @@ static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs)
>  	encl->attributes = secs->attributes;
>  	encl->attributes_mask = SGX_ATTR_UNPRIV_MASK;
>  
> +	sgx_record_epc_page(encl->secs.epc_page,
> +			    SGX_EPC_PAGE_UNRECLAIMABLE);
> +
>  	/* Set only after completion, as encl->lock has not been taken. */
>  	set_bit(SGX_ENCL_CREATED, &encl->flags);
>  
> @@ -323,7 +326,7 @@ static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long src,
>  	}
>  
>  	sgx_record_epc_page(epc_page,
> -			    SGX_EPC_PAGE_RECLAIMER_TRACKED);
> +			    SGX_EPC_PAGE_RECLAIMABLE);
>  	mutex_unlock(&encl->lock);
>  	mmap_read_unlock(current->mm);
>  	return ret;
> @@ -978,7 +981,7 @@ static long sgx_enclave_modify_types(struct sgx_encl *encl,
>  			mutex_lock(&encl->lock);
>  
>  			sgx_record_epc_page(entry->epc_page,
> -					    SGX_EPC_PAGE_RECLAIMER_TRACKED);
> +					    SGX_EPC_PAGE_RECLAIMABLE);
>  		}
>  
>  		/* Change EPC type */
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index dec1d57cbff6..b26860399402 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -318,7 +318,7 @@ static void sgx_reclaim_pages(void)
>  			/* The owner is freeing the page. No need to add the
>  			 * page back to the list of reclaimable pages.
>  			 */
> -			epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
> +			sgx_epc_page_reset_state(epc_page);
>  	}
>  	spin_unlock(&sgx_global_lru.lock);
>  
> @@ -344,6 +344,7 @@ static void sgx_reclaim_pages(void)
>  
>  skip:
>  		spin_lock(&sgx_global_lru.lock);
> +		sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
>  		list_add_tail(&epc_page->list, &sgx_global_lru.reclaimable);
>  		spin_unlock(&sgx_global_lru.lock);
>  
> @@ -367,7 +368,7 @@ static void sgx_reclaim_pages(void)
>  		sgx_reclaimer_write(epc_page, &backing[i]);
>  
>  		kref_put(&encl_page->encl->refcount, sgx_encl_release);
> -		epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
> +		sgx_epc_page_reset_state(epc_page);
>  
>  		sgx_free_epc_page(epc_page);
>  	}
> @@ -507,9 +508,9 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
>  void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
>  {
>  	spin_lock(&sgx_global_lru.lock);
> -	WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
> +	WARN_ON_ONCE(sgx_epc_page_reclaimable(page->flags));
>  	page->flags |= flags;
> -	if (flags & SGX_EPC_PAGE_RECLAIMER_TRACKED)
> +	if (sgx_epc_page_reclaimable(flags))
>  		list_add_tail(&page->list, &sgx_global_lru.reclaimable);
>  	spin_unlock(&sgx_global_lru.lock);
>  }
> @@ -527,7 +528,7 @@ void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
>  int sgx_drop_epc_page(struct sgx_epc_page *page)
>  {
>  	spin_lock(&sgx_global_lru.lock);
> -	if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
> +	if (sgx_epc_page_reclaimable(page->flags)) {
>  		/* The page is being reclaimed. */
>  		if (list_empty(&page->list)) {
>  			spin_unlock(&sgx_global_lru.lock);
> @@ -535,7 +536,7 @@ int sgx_drop_epc_page(struct sgx_epc_page *page)
>  		}
>  
>  		list_del(&page->list);
> -		page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
> +		sgx_epc_page_reset_state(page);
>  	}
>  	spin_unlock(&sgx_global_lru.lock);
>  
> @@ -607,6 +608,8 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
>  	struct sgx_epc_section *section = &sgx_epc_sections[page->section];
>  	struct sgx_numa_node *node = section->node;
>  
> +	WARN_ON_ONCE(page->flags & (SGX_EPC_PAGE_STATE_MASK));
> +
>  	spin_lock(&node->lock);
>  
>  	page->owner = NULL;
> @@ -614,7 +617,7 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
>  		list_add(&page->list, &node->sgx_poison_page_list);
>  	else
>  		list_add_tail(&page->list, &node->free_page_list);
> -	page->flags = SGX_EPC_PAGE_IS_FREE;
> +	page->flags = SGX_EPC_PAGE_FREE;
>  
>  	spin_unlock(&node->lock);
>  	atomic_long_inc(&sgx_nr_free_pages);
> @@ -715,7 +718,7 @@ int arch_memory_failure(unsigned long pfn, int flags)
>  	 * If the page is on a free list, move it to the per-node
>  	 * poison page list.
>  	 */
> -	if (page->flags & SGX_EPC_PAGE_IS_FREE) {
> +	if (page->flags == SGX_EPC_PAGE_FREE) {
>  		list_move(&page->list, &node->sgx_poison_page_list);
>  		goto out;
>  	}
> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
> index 113d930fd087..2faeb40b345f 100644
> --- a/arch/x86/kernel/cpu/sgx/sgx.h
> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> @@ -23,11 +23,36 @@
>  #define SGX_NR_LOW_PAGES		32
>  #define SGX_NR_HIGH_PAGES		64
>  
> -/* Pages, which are being tracked by the page reclaimer. */
> -#define SGX_EPC_PAGE_RECLAIMER_TRACKED	BIT(0)
> +enum sgx_epc_page_state {
> +	/* Not tracked by the reclaimer:
> +	 * Pages allocated for virtual EPC which are never tracked by the host
> +	 * reclaimer; pages just allocated from free list but not yet put in
> +	 * use; pages just reclaimed, but not yet returned to the free list.
> +	 * Becomes FREE after sgx_free_epc()
> +	 * Becomes RECLAIMABLE or UNRECLAIMABLE after sgx_record_epc()
> +	 */
> +	SGX_EPC_PAGE_NOT_TRACKED = 0,
> +
> +	/* Page is in the free list, ready for allocation
> +	 * Becomes NOT_TRACKED after sgx_alloc_epc_page()
> +	 */
> +	SGX_EPC_PAGE_FREE = 1,
> +
> +	/* Page is in use and tracked in a reclaimable LRU list
> +	 * Becomes NOT_TRACKED after sgx_drop_epc()
> +	 */
> +	SGX_EPC_PAGE_RECLAIMABLE = 2,
> +
> +	/* Page is in use but tracked in an unreclaimable LRU list. These are
> +	 * only reclaimable when the whole enclave is OOM killed or the enclave
> +	 * is released, e.g., VA, SECS pages
> +	 * Becomes NOT_TRACKED after sgx_drop_epc()
> +	 */
> +	SGX_EPC_PAGE_UNRECLAIMABLE = 3,
>  
> -/* Pages on free list */
> -#define SGX_EPC_PAGE_IS_FREE		BIT(1)
> +};
> +
> +#define SGX_EPC_PAGE_STATE_MASK GENMASK(2, 0)
>  
>  struct sgx_epc_page {
>  	unsigned int section;
> @@ -37,6 +62,22 @@ struct sgx_epc_page {
>  	struct list_head list;
>  };
>  
> +static inline void sgx_epc_page_reset_state(struct sgx_epc_page *page)
> +{
> +	page->flags &= ~SGX_EPC_PAGE_STATE_MASK;
> +}
> +
> +static inline void sgx_epc_page_set_state(struct sgx_epc_page *page, unsigned long flags)
> +{
> +	page->flags &= ~SGX_EPC_PAGE_STATE_MASK;
> +	page->flags |= (flags & SGX_EPC_PAGE_STATE_MASK);
> +}
> +
> +static inline bool sgx_epc_page_reclaimable(unsigned long flags)
> +{
> +	return SGX_EPC_PAGE_RECLAIMABLE == (flags & SGX_EPC_PAGE_STATE_MASK);
> +}
> +
>  /*
>   * Contains the tracking data for NUMA nodes having EPC pages. Most importantly,
>   * the free page list local to the node is stored here.
> -- 
> 2.25.1

BR, Jarkko

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 08/18] x86/sgx: Use a list to track to-be-reclaimed pages
       [not found]     ` <20230913040635.28815-9-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
@ 2023-09-13 15:30       ` Jarkko Sakkinen
  0 siblings, 0 replies; 45+ messages in thread
From: Jarkko Sakkinen @ 2023-09-13 15:30 UTC (permalink / raw)
  To: Haitao Huang, dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	tj-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

On Wed Sep 13, 2023 at 7:06 AM EEST, Haitao Huang wrote:
> From: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
>
> Change sgx_reclaim_pages() to use a list rather than an array for
> storing the epc_pages which will be reclaimed. This change is needed
> to transition to the LRU implementation for EPC cgroup support.
>
> When the EPC cgroup is implemented, the reclaiming process will do a
> pre-order tree walk for the subtree starting from the limit-violating
> cgroup.  When each node is visited, candidate pages are selected from
> its "reclaimable" LRU list and moved into this temporary list. Passing a
> list from node to node for temporary storage in this walk is more
> straightforward than using an array.
>
> Signed-off-by: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> Signed-off-by: Haitao Huang<haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> Cc: Sean Christopherson <seanjc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> ---
> V4:
> - Changes needed for patch reordering
> - Revised commit message
>
> V3:
> - Removed list wrappers
> ---
>  arch/x86/kernel/cpu/sgx/main.c | 40 +++++++++++++++-------------------
>  1 file changed, 18 insertions(+), 22 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index c1ae19a154d0..fba06dc5abfe 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -293,12 +293,11 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
>   */
>  static void sgx_reclaim_pages(void)
>  {
> -	struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
>  	struct sgx_backing backing[SGX_NR_TO_SCAN];
> +	struct sgx_epc_page *epc_page, *tmp;
>  	struct sgx_encl_page *encl_page;
> -	struct sgx_epc_page *epc_page;
>  	pgoff_t page_index;
> -	int cnt = 0;
> +	LIST_HEAD(iso);
>  	int ret;
>  	int i;
>  
> @@ -314,18 +313,22 @@ static void sgx_reclaim_pages(void)
>  
>  		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0) {
>  			sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
> -			chunk[cnt++] = epc_page;
> +			list_move_tail(&epc_page->list, &iso);
>  		} else {
> -			/* The owner is freeing the page. No need to add the
> -			 * page back to the list of reclaimable pages.
> +			/* The owner is freeing the page, remove it from the
> +			 * LRU list
>  			 */
>  			sgx_epc_page_reset_state(epc_page);
> +			list_del_init(&epc_page->list);
>  		}
>  	}
>  	spin_unlock(&sgx_global_lru.lock);
>  
> -	for (i = 0; i < cnt; i++) {
> -		epc_page = chunk[i];
> +	if (list_empty(&iso))
> +		return;
> +
> +	i = 0;
> +	list_for_each_entry_safe(epc_page, tmp, &iso, list) {
>  		encl_page = epc_page->owner;
>  
>  		if (!sgx_reclaimer_age(epc_page))
> @@ -340,6 +343,7 @@ static void sgx_reclaim_pages(void)
>  			goto skip;
>  		}
>  
> +		i++;
>  		encl_page->desc |= SGX_ENCL_PAGE_BEING_RECLAIMED;
>  		mutex_unlock(&encl_page->encl->lock);
>  		continue;
> @@ -347,27 +351,19 @@ static void sgx_reclaim_pages(void)
>  skip:
>  		spin_lock(&sgx_global_lru.lock);
>  		sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
> -		list_add_tail(&epc_page->list, &sgx_global_lru.reclaimable);
> +		list_move_tail(&epc_page->list, &sgx_global_lru.reclaimable);
>  		spin_unlock(&sgx_global_lru.lock);
>  
>  		kref_put(&encl_page->encl->refcount, sgx_encl_release);
> -
> -		chunk[i] = NULL;
> -	}
> -
> -	for (i = 0; i < cnt; i++) {
> -		epc_page = chunk[i];
> -		if (epc_page)
> -			sgx_reclaimer_block(epc_page);
>  	}
>  
> -	for (i = 0; i < cnt; i++) {
> -		epc_page = chunk[i];
> -		if (!epc_page)
> -			continue;
> +	list_for_each_entry(epc_page, &iso, list)
> +		sgx_reclaimer_block(epc_page);
>  
> +	i = 0;
> +	list_for_each_entry_safe(epc_page, tmp, &iso, list) {
>  		encl_page = epc_page->owner;
> -		sgx_reclaimer_write(epc_page, &backing[i]);
> +		sgx_reclaimer_write(epc_page, &backing[i++]);
>  
>  		kref_put(&encl_page->encl->refcount, sgx_encl_release);
>  		sgx_epc_page_reset_state(epc_page);
> -- 
> 2.25.1

LGTM

BR, Jarkko

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 09/18] x86/sgx: Store struct sgx_encl when allocating new VA pages
       [not found]     ` <20230913040635.28815-10-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
@ 2023-09-13 15:31       ` Jarkko Sakkinen
  0 siblings, 0 replies; 45+ messages in thread
From: Jarkko Sakkinen @ 2023-09-13 15:31 UTC (permalink / raw)
  To: Haitao Huang, dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	tj-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

On Wed Sep 13, 2023 at 7:06 AM EEST, Haitao Huang wrote:
> In a later patch, when a cgroup has exceeded the max capacity for EPC
> pages, it may need to identify and OOM kill a less active enclave to
> make room for other enclaves within the same group. Such a victim
> enclave would have no active pages other than the unreclaimable Version
> Array (VA) and SECS pages.  Therefore, the cgroup needs examine its
> unreclaimable page list, and finding an enclave given a SECS page or a
> VA page. This will require a backpointer from a page to an enclave,
> which is not available for VA pages.
>
> Because struct sgx_epc_page instances of VA pages are not owned by an
> sgx_encl_page instance, mark their owner as sgx_encl: pass the struct
> sgx_encl of the enclave allocating the VA page to sgx_alloc_epc_page(),
> which will store this value in the owner field of the struct
> sgx_epc_page.  In a later patch, VA pages will be placed in an
> unreclaimable queue that can be examined by the cgroup to select the OOM
> killed enclave.
>
> Signed-off-by: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> Cc: Sean Christopherson <seanjc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> ---
> V4:
> - Changes needed for patch reordering
> - Revised commit messages (Jarkko)
> ---
>  arch/x86/kernel/cpu/sgx/encl.c  |  5 +++--
>  arch/x86/kernel/cpu/sgx/encl.h  |  2 +-
>  arch/x86/kernel/cpu/sgx/ioctl.c |  2 +-
>  arch/x86/kernel/cpu/sgx/main.c  | 20 ++++++++++----------
>  arch/x86/kernel/cpu/sgx/sgx.h   |  5 ++++-
>  5 files changed, 19 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> index d11d4111aa98..1aee0ad00e66 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.c
> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> @@ -1238,6 +1238,7 @@ void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr)
>  
>  /**
>   * sgx_alloc_va_page() - Allocate a Version Array (VA) page
> + * @encl:    The enclave that this page is allocated to.

Maybe would more clear:

* @encl:	The new owner of the page

>   * @reclaim: Reclaim EPC pages directly if none available. Enclave
>   *           mutex should not be held if this is set.
>   *
> @@ -1247,12 +1248,12 @@ void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr)
>   *   a VA page,
>   *   -errno otherwise
>   */
> -struct sgx_epc_page *sgx_alloc_va_page(bool reclaim)
> +struct sgx_epc_page *sgx_alloc_va_page(struct sgx_encl *encl, bool reclaim)
>  {
>  	struct sgx_epc_page *epc_page;
>  	int ret;
>  
> -	epc_page = sgx_alloc_epc_page(NULL, reclaim);
> +	epc_page = sgx_alloc_epc_page(encl, reclaim);
>  	if (IS_ERR(epc_page))
>  		return ERR_CAST(epc_page);
>  
> diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
> index f94ff14c9486..831d63f80f5a 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.h
> +++ b/arch/x86/kernel/cpu/sgx/encl.h
> @@ -116,7 +116,7 @@ struct sgx_encl_page *sgx_encl_page_alloc(struct sgx_encl *encl,
>  					  unsigned long offset,
>  					  u64 secinfo_flags);
>  void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr);
> -struct sgx_epc_page *sgx_alloc_va_page(bool reclaim);
> +struct sgx_epc_page *sgx_alloc_va_page(struct sgx_encl *encl, bool reclaim);
>  unsigned int sgx_alloc_va_slot(struct sgx_va_page *va_page);
>  void sgx_free_va_slot(struct sgx_va_page *va_page, unsigned int offset);
>  bool sgx_va_page_full(struct sgx_va_page *va_page);
> diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
> index c28f074d5d71..3ab8c050e665 100644
> --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> @@ -30,7 +30,7 @@ struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, bool reclaim)
>  		if (!va_page)
>  			return ERR_PTR(-ENOMEM);
>  
> -		va_page->epc_page = sgx_alloc_va_page(reclaim);
> +		va_page->epc_page = sgx_alloc_va_page(encl, reclaim);
>  		if (IS_ERR(va_page->epc_page)) {
>  			err = ERR_CAST(va_page->epc_page);
>  			kfree(va_page);
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index fba06dc5abfe..ed813288af44 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -107,7 +107,7 @@ static unsigned long __sgx_sanitize_pages(struct list_head *dirty_page_list)
>  
>  static bool sgx_reclaimer_age(struct sgx_epc_page *epc_page)
>  {
> -	struct sgx_encl_page *page = epc_page->owner;
> +	struct sgx_encl_page *page = epc_page->encl_page;
>  	struct sgx_encl *encl = page->encl;
>  	struct sgx_encl_mm *encl_mm;
>  	bool ret = true;
> @@ -139,7 +139,7 @@ static bool sgx_reclaimer_age(struct sgx_epc_page *epc_page)
>  
>  static void sgx_reclaimer_block(struct sgx_epc_page *epc_page)
>  {
> -	struct sgx_encl_page *page = epc_page->owner;
> +	struct sgx_encl_page *page = epc_page->encl_page;
>  	unsigned long addr = page->desc & PAGE_MASK;
>  	struct sgx_encl *encl = page->encl;
>  	int ret;
> @@ -196,7 +196,7 @@ void sgx_ipi_cb(void *info)
>  static void sgx_encl_ewb(struct sgx_epc_page *epc_page,
>  			 struct sgx_backing *backing)
>  {
> -	struct sgx_encl_page *encl_page = epc_page->owner;
> +	struct sgx_encl_page *encl_page = epc_page->encl_page;
>  	struct sgx_encl *encl = encl_page->encl;
>  	struct sgx_va_page *va_page;
>  	unsigned int va_offset;
> @@ -249,7 +249,7 @@ static void sgx_encl_ewb(struct sgx_epc_page *epc_page,
>  static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
>  				struct sgx_backing *backing)
>  {
> -	struct sgx_encl_page *encl_page = epc_page->owner;
> +	struct sgx_encl_page *encl_page = epc_page->encl_page;
>  	struct sgx_encl *encl = encl_page->encl;
>  	struct sgx_backing secs_backing;
>  	int ret;
> @@ -309,7 +309,7 @@ static void sgx_reclaim_pages(void)
>  			break;
>  
>  		list_del_init(&epc_page->list);
> -		encl_page = epc_page->owner;
> +		encl_page = epc_page->encl_page;
>  
>  		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0) {
>  			sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
> @@ -329,7 +329,7 @@ static void sgx_reclaim_pages(void)
>  
>  	i = 0;
>  	list_for_each_entry_safe(epc_page, tmp, &iso, list) {
> -		encl_page = epc_page->owner;
> +		encl_page = epc_page->encl_page;
>  
>  		if (!sgx_reclaimer_age(epc_page))
>  			goto skip;
> @@ -362,7 +362,7 @@ static void sgx_reclaim_pages(void)
>  
>  	i = 0;
>  	list_for_each_entry_safe(epc_page, tmp, &iso, list) {
> -		encl_page = epc_page->owner;
> +		encl_page = epc_page->encl_page;
>  		sgx_reclaimer_write(epc_page, &backing[i++]);
>  
>  		kref_put(&encl_page->encl->refcount, sgx_encl_release);
> @@ -562,7 +562,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
>  	for ( ; ; ) {
>  		page = __sgx_alloc_epc_page();
>  		if (!IS_ERR(page)) {
> -			page->owner = owner;
> +			page->encl_page = owner;
>  			break;
>  		}
>  
> @@ -607,7 +607,7 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
>  
>  	spin_lock(&node->lock);
>  
> -	page->owner = NULL;
> +	page->encl_page = NULL;
>  	if (page->poison)
>  		list_add(&page->list, &node->sgx_poison_page_list);
>  	else
> @@ -642,7 +642,7 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
>  	for (i = 0; i < nr_pages; i++) {
>  		section->pages[i].section = index;
>  		section->pages[i].flags = 0;
> -		section->pages[i].owner = NULL;
> +		section->pages[i].encl_page = NULL;
>  		section->pages[i].poison = 0;
>  		list_add_tail(&section->pages[i].list, &sgx_dirty_page_list);
>  	}
> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
> index 764cec23f4e5..c75ddc7168fa 100644
> --- a/arch/x86/kernel/cpu/sgx/sgx.h
> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> @@ -68,7 +68,10 @@ struct sgx_epc_page {
>  	unsigned int section;
>  	u16 flags;
>  	u16 poison;
> -	struct sgx_encl_page *owner;

	/* possible owner types */
> +	union {
> +		struct sgx_encl_page *encl_page;
> +		struct sgx_encl *encl;
> +	};
>  	struct list_head list;
>  };
>  
> -- 
> 2.25.1

BR, Jarkko

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 11/18] x86/sgx: store unreclaimable pages in LRU lists
       [not found]     ` <20230913040635.28815-12-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
@ 2023-09-13 15:33       ` Jarkko Sakkinen
  0 siblings, 0 replies; 45+ messages in thread
From: Jarkko Sakkinen @ 2023-09-13 15:33 UTC (permalink / raw)
  To: Haitao Huang, dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	tj-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

On Wed Sep 13, 2023 at 7:06 AM EEST, Haitao Huang wrote:
> From: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
>
> When an OOM event occurs, all pages associated with an enclave will need
> to be freed, including pages that are not currently tracked by the
> cgroup LRU lists.
>
> Add a new "unreclaimable" list to the sgx_epc_lru_lists struct and
> update the "sgx_record/drop_epc_pages()" functions for adding/removing
> VA and SECS pages to/from this "unreclaimable" list.
>
> Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> ---
> V4:
> - Updates for patch reordering.
> - Revised commit messages.
> - Revised comments for the list.
>
> V3:
> - Removed tracking virtual EPC pages in unreclaimable list as host
> kernel does not reclaim them. The EPC cgroups implemented later only
> blocks allocating for a guest if the limit is reached by returning
> -ENOMEM from sgx_alloc_epc_page() called by virt_epc, and does nothing
> else. Therefore, no need to track those in LRU lists.
> ---
>  arch/x86/kernel/cpu/sgx/encl.c  | 2 ++
>  arch/x86/kernel/cpu/sgx/ioctl.c | 1 +
>  arch/x86/kernel/cpu/sgx/main.c  | 3 +++
>  arch/x86/kernel/cpu/sgx/sgx.h   | 8 +++++++-
>  4 files changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> index 91f83a5e543d..bf0ac3677ca8 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.c
> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> @@ -748,6 +748,7 @@ void sgx_encl_release(struct kref *ref)
>  	xa_destroy(&encl->page_array);
>  
>  	if (!encl->secs_child_cnt && encl->secs.epc_page) {
> +		sgx_drop_epc_page(encl->secs.epc_page);
>  		sgx_encl_free_epc_page(encl->secs.epc_page);
>  		encl->secs.epc_page = NULL;
>  	}
> @@ -756,6 +757,7 @@ void sgx_encl_release(struct kref *ref)
>  		va_page = list_first_entry(&encl->va_pages, struct sgx_va_page,
>  					   list);
>  		list_del(&va_page->list);
> +		sgx_drop_epc_page(va_page->epc_page);
>  		sgx_encl_free_epc_page(va_page->epc_page);
>  		kfree(va_page);
>  	}
> diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
> index 95ec20a6992f..8c23bb524674 100644
> --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> @@ -48,6 +48,7 @@ void sgx_encl_shrink(struct sgx_encl *encl, struct sgx_va_page *va_page)
>  	encl->page_cnt--;
>  
>  	if (va_page) {
> +		sgx_drop_epc_page(va_page->epc_page);
>  		sgx_encl_free_epc_page(va_page->epc_page);
>  		list_del(&va_page->list);
>  		kfree(va_page);
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index ed813288af44..f3a3ed894616 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -268,6 +268,7 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
>  			goto out;
>  
>  		sgx_encl_ewb(encl->secs.epc_page, &secs_backing);
> +		sgx_drop_epc_page(encl->secs.epc_page);
>  		sgx_encl_free_epc_page(encl->secs.epc_page);
>  		encl->secs.epc_page = NULL;
>  
> @@ -510,6 +511,8 @@ void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
>  	page->flags |= flags;
>  	if (sgx_epc_page_reclaimable(flags))
>  		list_add_tail(&page->list, &sgx_global_lru.reclaimable);
> +	else
> +		list_add_tail(&page->list, &sgx_global_lru.unreclaimable);
>  	spin_unlock(&sgx_global_lru.lock);
>  }
>  
> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
> index e06b4aadb6a1..e210af77f0cf 100644
> --- a/arch/x86/kernel/cpu/sgx/sgx.h
> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> @@ -150,17 +150,23 @@ static inline void *sgx_get_epc_virt_addr(struct sgx_epc_page *page)
>  }
>  
>  /*
> - * Tracks EPC pages reclaimable by the reclaimer (ksgxd).
> + * Contains EPC pages tracked by the reclaimer (ksgxd).
>   */
>  struct sgx_epc_lru_lists {
>  	spinlock_t lock;
>  	struct list_head reclaimable;
> +	/*
> +	 * Tracks SECS, VA pages,etc., pages only freeable after all its
> +	 * dependent reclaimables are freed.
> +	 */
> +	struct list_head unreclaimable;
>  };
>  
>  static inline void sgx_lru_init(struct sgx_epc_lru_lists *lrus)
>  {
>  	spin_lock_init(&lrus->lock);
>  	INIT_LIST_HEAD(&lrus->reclaimable);
> +	INIT_LIST_HEAD(&lrus->unreclaimable);
>  }
>  
>  struct sgx_epc_page *__sgx_alloc_epc_page(void);
> -- 
> 2.25.1

LGTM

BR, Jarkko

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
       [not found]     ` <20230913040635.28815-13-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
@ 2023-09-13 15:34       ` Jarkko Sakkinen
  2023-09-16  4:19         ` Haitao Huang
  0 siblings, 1 reply; 45+ messages in thread
From: Jarkko Sakkinen @ 2023-09-13 15:34 UTC (permalink / raw)
  To: Haitao Huang, dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	tj-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

On Wed Sep 13, 2023 at 7:06 AM EEST, Haitao Huang wrote:
> From: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>
> Introduce the OOM path for killing an enclave with a reclaimer that is no
> longer able to reclaim enough EPC pages. Find a victim enclave, which
> will be an enclave with only "unreclaimable" EPC pages left in the
> cgroup LRU lists. Once a victim is identified, mark the enclave as OOM
> and zap the enclave's entire page range, and drain all mm references in
> encl->mm_list. Block allocating any EPC pages in #PF handler, or
> reloading any pages in all paths, or creating any new mappings.
>
> The OOM killing path may race with the reclaimers: in some cases, the
> victim enclave is in the process of reclaiming the last EPC pages when
> OOM happens, that is, all pages other than SECS and VA pages are in
> RECLAIMING_IN_PROGRESS state. The reclaiming process requires access to
> the enclave backing, VA pages as well as SECS. So the OOM killer does
> not directly release those enclave resources, instead, it lets all
> reclaiming in progress to finish, and relies (as currently done) on
> kref_put on encl->refcount to trigger sgx_encl_release() to do the
> final cleanup.
>
> Signed-off-by: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> Cc: Sean Christopherson <seanjc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> ---
> V4:
> - Updates for patch reordering and typo fixes.
>
> V3:
> - Rebased to use the new VMA_ITERATOR to zap VMAs.
> - Fixed the racing cases by blocking new page allocation/mapping and
> reloading when enclave is marked for OOM. And do not release any enclave
> resources other than draining mm_list entries, and let pages in
> RECLAIMING_IN_PROGRESS to be reaped by reclaimers.
> - Due to above changes, also removed the no-longer needed encl->lock in
> the OOM path which was causing deadlocks reported by the lock prover.
> ---
>  arch/x86/kernel/cpu/sgx/driver.c |  27 +-----
>  arch/x86/kernel/cpu/sgx/encl.c   |  48 ++++++++++-
>  arch/x86/kernel/cpu/sgx/encl.h   |   2 +
>  arch/x86/kernel/cpu/sgx/ioctl.c  |   9 ++
>  arch/x86/kernel/cpu/sgx/main.c   | 140 +++++++++++++++++++++++++++++++
>  arch/x86/kernel/cpu/sgx/sgx.h    |   1 +
>  6 files changed, 200 insertions(+), 27 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/driver.c b/arch/x86/kernel/cpu/sgx/driver.c
> index 262f5fb18d74..ff42d649c7b6 100644
> --- a/arch/x86/kernel/cpu/sgx/driver.c
> +++ b/arch/x86/kernel/cpu/sgx/driver.c
> @@ -44,7 +44,6 @@ static int sgx_open(struct inode *inode, struct file *file)
>  static int sgx_release(struct inode *inode, struct file *file)
>  {
>  	struct sgx_encl *encl = file->private_data;
> -	struct sgx_encl_mm *encl_mm;
>  
>  	/*
>  	 * Drain the remaining mm_list entries. At this point the list contains
> @@ -52,31 +51,7 @@ static int sgx_release(struct inode *inode, struct file *file)
>  	 * not exited yet. The processes, which have exited, are gone from the
>  	 * list by sgx_mmu_notifier_release().
>  	 */
> -	for ( ; ; )  {
> -		spin_lock(&encl->mm_lock);
> -
> -		if (list_empty(&encl->mm_list)) {
> -			encl_mm = NULL;
> -		} else {
> -			encl_mm = list_first_entry(&encl->mm_list,
> -						   struct sgx_encl_mm, list);
> -			list_del_rcu(&encl_mm->list);
> -		}
> -
> -		spin_unlock(&encl->mm_lock);
> -
> -		/* The enclave is no longer mapped by any mm. */
> -		if (!encl_mm)
> -			break;
> -
> -		synchronize_srcu(&encl->srcu);
> -		mmu_notifier_unregister(&encl_mm->mmu_notifier, encl_mm->mm);
> -		kfree(encl_mm);
> -
> -		/* 'encl_mm' is gone, put encl_mm->encl reference: */
> -		kref_put(&encl->refcount, sgx_encl_release);
> -	}
> -
> +	sgx_encl_mm_drain(encl);
>  	kref_put(&encl->refcount, sgx_encl_release);
>  	return 0;
>  }
> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> index bf0ac3677ca8..85b6f218f029 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.c
> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> @@ -453,6 +453,9 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
>  	if (unlikely(!encl))
>  		return VM_FAULT_SIGBUS;
>  
> +	if (test_bit(SGX_ENCL_OOM, &encl->flags))
> +		return VM_FAULT_SIGBUS;
> +
>  	/*
>  	 * The page_array keeps track of all enclave pages, whether they
>  	 * are swapped out or not. If there is no entry for this page and
> @@ -651,7 +654,8 @@ static int sgx_vma_access(struct vm_area_struct *vma, unsigned long addr,
>  	if (!encl)
>  		return -EFAULT;
>  
> -	if (!test_bit(SGX_ENCL_DEBUG, &encl->flags))
> +	if (!test_bit(SGX_ENCL_DEBUG, &encl->flags) ||
> +	    test_bit(SGX_ENCL_OOM, &encl->flags))
>  		return -EFAULT;
>  
>  	for (i = 0; i < len; i += cnt) {
> @@ -776,6 +780,45 @@ void sgx_encl_release(struct kref *ref)
>  	kfree(encl);
>  }
>  
> +/**
> + * sgx_encl_mm_drain - drain all mm_list entries
> + * @encl:	address of the sgx_encl to drain
> + *
> + * Used during oom kill to empty the mm_list entries after they have been
> + * zapped. Or used by sgx_release to drain the remaining mm_list entries when
> + * the enclave fd is closing. After this call, sgx_encl_release will be called
> + * with kref_put.
> + */
> +void sgx_encl_mm_drain(struct sgx_encl *encl)
> +{
> +	struct sgx_encl_mm *encl_mm;
> +
> +	for ( ; ; )  {
> +		spin_lock(&encl->mm_lock);
> +
> +		if (list_empty(&encl->mm_list)) {
> +			encl_mm = NULL;
> +		} else {
> +			encl_mm = list_first_entry(&encl->mm_list,
> +						   struct sgx_encl_mm, list);
> +			list_del_rcu(&encl_mm->list);
> +		}
> +
> +		spin_unlock(&encl->mm_lock);
> +
> +		/* The enclave is no longer mapped by any mm. */
> +		if (!encl_mm)
> +			break;
> +
> +		synchronize_srcu(&encl->srcu);
> +		mmu_notifier_unregister(&encl_mm->mmu_notifier, encl_mm->mm);
> +		kfree(encl_mm);
> +
> +		/* 'encl_mm' is gone, put encl_mm->encl reference: */
> +		kref_put(&encl->refcount, sgx_encl_release);
> +	}
> +}
> +
>  /*
>   * 'mm' is exiting and no longer needs mmu notifications.
>   */
> @@ -847,6 +890,9 @@ int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm)
>  	struct sgx_encl_mm *encl_mm;
>  	int ret;
>  
> +	if (test_bit(SGX_ENCL_OOM, &encl->flags))
> +		return -ENOMEM;
> +
>  	/*
>  	 * Even though a single enclave may be mapped into an mm more than once,
>  	 * each 'mm' only appears once on encl->mm_list. This is guaranteed by
> diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
> index 831d63f80f5a..47792fb00cee 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.h
> +++ b/arch/x86/kernel/cpu/sgx/encl.h
> @@ -39,6 +39,7 @@ enum sgx_encl_flags {
>  	SGX_ENCL_DEBUG		= BIT(1),
>  	SGX_ENCL_CREATED	= BIT(2),
>  	SGX_ENCL_INITIALIZED	= BIT(3),
> +	SGX_ENCL_OOM		= BIT(4),

Given how the constants are named before maybe SGX_ENCL_NO_MEMORY would
be more obvious.

>  };
>  
>  struct sgx_encl_mm {
> @@ -125,5 +126,6 @@ struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
>  					 unsigned long addr);
>  struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, bool reclaim);
>  void sgx_encl_shrink(struct sgx_encl *encl, struct sgx_va_page *va_page);
> +void sgx_encl_mm_drain(struct sgx_encl *encl);
>  
>  #endif /* _X86_ENCL_H */
> diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
> index 8c23bb524674..1f65c79664a2 100644
> --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> @@ -421,6 +421,9 @@ static long sgx_ioc_enclave_add_pages(struct sgx_encl *encl, void __user *arg)
>  	    test_bit(SGX_ENCL_INITIALIZED, &encl->flags))
>  		return -EINVAL;
>  
> +	if (test_bit(SGX_ENCL_OOM, &encl->flags))
> +		return -ENOMEM;
> +
>  	if (copy_from_user(&add_arg, arg, sizeof(add_arg)))
>  		return -EFAULT;
>  
> @@ -606,6 +609,9 @@ static long sgx_ioc_enclave_init(struct sgx_encl *encl, void __user *arg)
>  	    test_bit(SGX_ENCL_INITIALIZED, &encl->flags))
>  		return -EINVAL;
>  
> +	if (test_bit(SGX_ENCL_OOM, &encl->flags))
> +		return -ENOMEM;
> +
>  	if (copy_from_user(&init_arg, arg, sizeof(init_arg)))
>  		return -EFAULT;
>  
> @@ -682,6 +688,9 @@ static int sgx_ioc_sgx2_ready(struct sgx_encl *encl)
>  	if (!test_bit(SGX_ENCL_INITIALIZED, &encl->flags))
>  		return -EINVAL;
>  
> +	if (test_bit(SGX_ENCL_OOM, &encl->flags))
> +		return -ENOMEM;
> +
>  	return 0;
>  }
>  
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index f3a3ed894616..c8900d62cfff 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -621,6 +621,146 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
>  	atomic_long_inc(&sgx_nr_free_pages);
>  }
>  
> +static bool sgx_oom_get_ref(struct sgx_epc_page *epc_page)
> +{
> +	struct sgx_encl *encl;
> +
> +	if (epc_page->flags & SGX_EPC_OWNER_PAGE)
> +		encl = epc_page->encl_page->encl;
> +	else if (epc_page->flags & SGX_EPC_OWNER_ENCL)
> +		encl = epc_page->encl;
> +	else
> +		return false;
> +
> +	return kref_get_unless_zero(&encl->refcount);
> +}
> +
> +static struct sgx_epc_page *sgx_oom_get_victim(struct sgx_epc_lru_lists *lru)
> +{
> +	struct sgx_epc_page *epc_page, *tmp;
> +
> +	if (list_empty(&lru->unreclaimable))
> +		return NULL;
> +
> +	list_for_each_entry_safe(epc_page, tmp, &lru->unreclaimable, list) {
> +		list_del_init(&epc_page->list);
> +
> +		if (sgx_oom_get_ref(epc_page))
> +			return epc_page;
> +	}
> +	return NULL;
> +}
> +
> +static void sgx_epc_oom_zap(void *owner, struct mm_struct *mm, unsigned long start,
> +			    unsigned long end, const struct vm_operations_struct *ops)
> +{
> +	VMA_ITERATOR(vmi, mm, start);
> +	struct vm_area_struct *vma;
> +
> +	/**
> +	 * Use end because start can be zero and not mapped into
> +	 * enclave even if encl->base = 0.
> +	 */
> +	for_each_vma_range(vmi, vma, end) {
> +		if (vma->vm_ops == ops && vma->vm_private_data == owner &&
> +		    vma->vm_start < end) {
> +			zap_vma_pages(vma);
> +		}
> +	}
> +}
> +
> +static bool sgx_oom_encl(struct sgx_encl *encl)
> +{
> +	unsigned long mm_list_version;
> +	struct sgx_encl_mm *encl_mm;
> +	bool ret = false;
> +	int idx;
> +
> +	if (!test_bit(SGX_ENCL_CREATED, &encl->flags))
> +		goto out_put;
> +
> +	/* Done OOM on this enclave previously, do not redo it.
> +	 * This may happen when the SECS page is still UNRECLAIMABLE because
> +	 * another page is in RECLAIM_IN_PROGRESS. Still return true so OOM
> +	 * killer can wait until the reclaimer done with the hold-up page and
> +	 * SECS before it move on to find another victim.
> +	 */
> +	if (test_bit(SGX_ENCL_OOM, &encl->flags))
> +		goto out;
> +
> +	set_bit(SGX_ENCL_OOM, &encl->flags);
> +
> +	do {
> +		mm_list_version = encl->mm_list_version;
> +
> +		/* Pairs with smp_rmb() in sgx_encl_mm_add(). */
> +		smp_rmb();
> +
> +		idx = srcu_read_lock(&encl->srcu);
> +
> +		list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
> +			if (!mmget_not_zero(encl_mm->mm))
> +				continue;
> +
> +			mmap_read_lock(encl_mm->mm);
> +
> +			sgx_epc_oom_zap(encl, encl_mm->mm, encl->base,
> +					encl->base + encl->size, &sgx_vm_ops);
> +
> +			mmap_read_unlock(encl_mm->mm);
> +
> +			mmput_async(encl_mm->mm);
> +		}
> +
> +		srcu_read_unlock(&encl->srcu, idx);
> +	} while (WARN_ON_ONCE(encl->mm_list_version != mm_list_version));
> +
> +	sgx_encl_mm_drain(encl);
> +out:
> +	ret = true;
> +
> +out_put:
> +	/*
> +	 * This puts the refcount we took when we identified this enclave as
> +	 * an OOM victim.
> +	 */
> +	kref_put(&encl->refcount, sgx_encl_release);
> +	return ret;
> +}
> +
> +static inline bool sgx_oom_encl_page(struct sgx_encl_page *encl_page)
> +{
> +	return sgx_oom_encl(encl_page->encl);
> +}
> +
> +/**
> + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
> + * @lru:	LRU that is low
> + *
> + * Return:	%true if a victim was found and kicked.
> + */
> +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
> +{
> +	struct sgx_epc_page *victim;
> +
> +	spin_lock(&lru->lock);
> +	victim = sgx_oom_get_victim(lru);
> +	spin_unlock(&lru->lock);
> +
> +	if (!victim)
> +		return false;
> +
> +	if (victim->flags & SGX_EPC_OWNER_PAGE)
> +		return sgx_oom_encl_page(victim->encl_page);
> +
> +	if (victim->flags & SGX_EPC_OWNER_ENCL)
> +		return sgx_oom_encl(victim->encl);
> +
> +	/*Will never happen unless we add more owner types in future */
> +	WARN_ON_ONCE(1);
> +	return false;
> +}
> +
>  static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
>  					 unsigned long index,
>  					 struct sgx_epc_section *section)
> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
> index e210af77f0cf..3818be5a8bd3 100644
> --- a/arch/x86/kernel/cpu/sgx/sgx.h
> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> @@ -176,6 +176,7 @@ void sgx_reclaim_direct(void);
>  void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags);
>  int sgx_drop_epc_page(struct sgx_epc_page *page);
>  struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
> +bool sgx_epc_oom(struct sgx_epc_lru_lists *lrus);
>  
>  void sgx_ipi_cb(void *info);
>  
> -- 
> 2.25.1

BR, Jarkko

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 13/18] x86/sgx: Expose sgx_reclaim_pages() for use by EPC cgroup
  2023-09-13  4:06   ` [PATCH v4 13/18] x86/sgx: Expose sgx_reclaim_pages() for use by EPC cgroup Haitao Huang
@ 2023-09-13 15:36     ` Jarkko Sakkinen
  0 siblings, 0 replies; 45+ messages in thread
From: Jarkko Sakkinen @ 2023-09-13 15:36 UTC (permalink / raw)
  To: Haitao Huang, dave.hansen, tj, linux-kernel, linux-sgx, x86,
	cgroups, tglx, mingo, bp, hpa, sohil.mehta
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen,
	yangjie

On Wed Sep 13, 2023 at 7:06 AM EEST, Haitao Huang wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
>
> Adjust and expose the top-level reclaim function as
> sgx_reclaim_epc_pages() for use by the upcoming EPC cgroup, which will
> initiate reclaim to enforce the max limit.
>
> Make these adjustments to the function signature.
>
> 1) To take a parameter that specifies the number of pages to scan for
> reclaiming. Define a max value of 32, but scan 16 in the case for the
> global reclaimer (ksgxd). The EPC cgroup will use it to specify a
> desired number of pages to be reclaimed up to the max value of 32.
>
> 2) To take a flag to force reclaiming a page regardless of its age.  The
> EPC cgroup will use the flag to enforce its limits by draining the
> reclaimable lists before resorting to other measures, e.g. forcefully
> kill enclaves.
>
> 3) Return the number of reclaimed pages. The EPC cgroup will use the
> result to track reclaiming progress and escalate to a more forceful
> reclaiming mode, e.g., calling this function with the flag to ignore age
> of pages.
>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
> Cc: Sean Christopherson <seanjc@google.com>
> ---
> V4:
> - Combined the 3 patches that made the individual changes to the
> function signature.
> - Removed 'high' limit in commit message.
> ---
>  arch/x86/kernel/cpu/sgx/main.c | 30 ++++++++++++++++++++----------
>  arch/x86/kernel/cpu/sgx/sgx.h  |  1 +
>  2 files changed, 21 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index c8900d62cfff..e1dde431a400 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -17,6 +17,10 @@
>  #include "driver.h"
>  #include "encl.h"
>  #include "encls.h"

newline here

> +/**

/*

> + * Maximum number of pages to scan for reclaiming.
> + */
> +#define SGX_NR_TO_SCAN_MAX	32
>  
>  struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
>  static int sgx_nr_epc_sections;
> @@ -279,7 +283,11 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
>  	mutex_unlock(&encl->lock);
>  }
>  
> -/*
> +/**
> + * sgx_reclaim_epc_pages() - Reclaim EPC pages from the consumers
> + * @nr_to_scan:		 Number of EPC pages to scan for reclaim
> + * @ignore_age:		 Reclaim a page even if it is young
> + *
>   * Take a fixed number of pages from the head of the active page pool and
>   * reclaim them to the enclave's private shmem files. Skip the pages, which have
>   * been accessed since the last scan. Move those pages to the tail of active
> @@ -292,15 +300,15 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
>   * problematic as it would increase the lock contention too much, which would
>   * halt forward progress.
>   */
> -static void sgx_reclaim_pages(void)
> +size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
>  {
> -	struct sgx_backing backing[SGX_NR_TO_SCAN];
> +	struct sgx_backing backing[SGX_NR_TO_SCAN_MAX];
>  	struct sgx_epc_page *epc_page, *tmp;
>  	struct sgx_encl_page *encl_page;
>  	pgoff_t page_index;
>  	LIST_HEAD(iso);
> -	int ret;
> -	int i;
> +	size_t ret;
> +	size_t i;

I don't mind having these in separate lines but you could also

	size_t ret, i;

>  
>  	spin_lock(&sgx_global_lru.lock);
>  	for (i = 0; i < SGX_NR_TO_SCAN; i++) {
> @@ -326,13 +334,14 @@ static void sgx_reclaim_pages(void)
>  	spin_unlock(&sgx_global_lru.lock);
>  
>  	if (list_empty(&iso))
> -		return;
> +		return 0;
>  
>  	i = 0;
>  	list_for_each_entry_safe(epc_page, tmp, &iso, list) {
>  		encl_page = epc_page->encl_page;
>  
> -		if (!sgx_reclaimer_age(epc_page))
> +		if (i == SGX_NR_TO_SCAN_MAX ||
> +		    (!ignore_age && !sgx_reclaimer_age(epc_page)))
>  			goto skip;
>  
>  		page_index = PFN_DOWN(encl_page->desc - encl_page->encl->base);
> @@ -371,6 +380,7 @@ static void sgx_reclaim_pages(void)
>  
>  		sgx_free_epc_page(epc_page);
>  	}

newline

> +	return i;
>  }
>  
>  static bool sgx_should_reclaim(unsigned long watermark)
> @@ -387,7 +397,7 @@ static bool sgx_should_reclaim(unsigned long watermark)
>  void sgx_reclaim_direct(void)
>  {
>  	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
> -		sgx_reclaim_pages();
> +		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
>  }
>  
>  static int ksgxd(void *p)
> @@ -410,7 +420,7 @@ static int ksgxd(void *p)
>  				     sgx_should_reclaim(SGX_NR_HIGH_PAGES));
>  
>  		if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
> -			sgx_reclaim_pages();
> +			sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
>  
>  		cond_resched();
>  	}
> @@ -582,7 +592,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
>  			break;
>  		}
>  
> -		sgx_reclaim_pages();
> +		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
>  		cond_resched();
>  	}
>  
> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
> index 3818be5a8bd3..aa4ec2c0ce96 100644
> --- a/arch/x86/kernel/cpu/sgx/sgx.h
> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> @@ -177,6 +177,7 @@ void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags);
>  int sgx_drop_epc_page(struct sgx_epc_page *page);
>  struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
>  bool sgx_epc_oom(struct sgx_epc_lru_lists *lrus);
> +size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age);
>  
>  void sgx_ipi_cb(void *info);
>  
> -- 
> 2.25.1

BR, Jarkko

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 15/18] x86/sgx: Prepare for multiple LRUs
       [not found]     ` <20230913040635.28815-16-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
@ 2023-09-13 15:42       ` Jarkko Sakkinen
  2023-09-16  4:18         ` Haitao Huang
  0 siblings, 1 reply; 45+ messages in thread
From: Jarkko Sakkinen @ 2023-09-13 15:42 UTC (permalink / raw)
  To: Haitao Huang, dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	tj-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

On Wed Sep 13, 2023 at 7:06 AM EEST, Haitao Huang wrote:
> Add sgx_can_reclaim() wrapper and encapsulate direct references to the
> global LRU list in the reclaimer functions so that they can be called with
> an LRU list per EPC cgroup.
>
> Signed-off-by: Sean Christopherson <sean.j.christopherson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> Signed-off-by: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> Cc: Sean Christopherson <seanjc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> ---
> V4:
> - Re-organized this patch to include all changes related to
> encapsulation of the global LRU
> - Moved this patch to precede the EPC cgroup patch
> ---
>  arch/x86/kernel/cpu/sgx/main.c | 41 +++++++++++++++++++++++-----------
>  1 file changed, 28 insertions(+), 13 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index ce316bd5e5bb..3d396fe5ec09 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -34,6 +34,16 @@ static DEFINE_XARRAY(sgx_epc_address_space);
>   */
>  static struct sgx_epc_lru_lists sgx_global_lru;
>  
> +static inline struct sgx_epc_lru_lists *sgx_lru_lists(struct sgx_epc_page *epc_page)
> +{
> +	return &sgx_global_lru;
> +}

I'd simply export sgx_global_lru.

> +static inline bool sgx_can_reclaim(void)
> +{
> +	return !list_empty(&sgx_global_lru.reclaimable);
> +}


Accessors for the object should be named so that this fact is reflected,
e.g. sgx_global_lru_can_reclaim() in this case.

I would just open code this to the call sites though.

> +
>  static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
>  
>  /* Nodes with one or more EPC sections. */
> @@ -339,6 +349,7 @@ size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
>  	struct sgx_backing backing[SGX_NR_TO_SCAN_MAX];
>  	struct sgx_epc_page *epc_page, *tmp;
>  	struct sgx_encl_page *encl_page;
> +	struct sgx_epc_lru_lists *lru;
>  	pgoff_t page_index;
>  	LIST_HEAD(iso);
>  	size_t ret;
> @@ -372,10 +383,11 @@ size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
>  		continue;
>  
>  skip:
> -		spin_lock(&sgx_global_lru.lock);
> +		lru = sgx_lru_lists(epc_page);
> +		spin_lock(&lru->lock);
>  		sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
> -		list_move_tail(&epc_page->list, &sgx_global_lru.reclaimable);
> -		spin_unlock(&sgx_global_lru.lock);
> +		list_move_tail(&epc_page->list, &lru->reclaimable);
> +		spin_unlock(&lru->lock);
>  
>  		kref_put(&encl_page->encl->refcount, sgx_encl_release);
>  	}
> @@ -399,7 +411,7 @@ size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
>  static bool sgx_should_reclaim(unsigned long watermark)
>  {
>  	return atomic_long_read(&sgx_nr_free_pages) < watermark &&
> -	       !list_empty(&sgx_global_lru.reclaimable);
> +		sgx_can_reclaim();
>  }
>  
>  /*
> @@ -529,14 +541,16 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
>   */
>  void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
>  {
> -	spin_lock(&sgx_global_lru.lock);
> +	struct sgx_epc_lru_lists *lru = sgx_lru_lists(page);
> +
> +	spin_lock(&lru->lock);
>  	WARN_ON_ONCE(sgx_epc_page_reclaimable(page->flags));
>  	page->flags |= flags;
>  	if (sgx_epc_page_reclaimable(flags))
> -		list_add_tail(&page->list, &sgx_global_lru.reclaimable);
> +		list_add_tail(&page->list, &lru->reclaimable);
>  	else
> -		list_add_tail(&page->list, &sgx_global_lru.unreclaimable);
> -	spin_unlock(&sgx_global_lru.lock);
> +		list_add_tail(&page->list, &lru->unreclaimable);
> +	spin_unlock(&lru->lock);
>  }
>  
>  /**
> @@ -551,15 +565,16 @@ void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
>   */
>  int sgx_drop_epc_page(struct sgx_epc_page *page)
>  {
> -	spin_lock(&sgx_global_lru.lock);
> +	struct sgx_epc_lru_lists *lru = sgx_lru_lists(page);
> +
> +	spin_lock(&lru->lock);
>  	if (sgx_epc_page_reclaim_in_progress(page->flags)) {
> -		spin_unlock(&sgx_global_lru.lock);
> +		spin_unlock(&lru->lock);
>  		return -EBUSY;
>  	}
> -
>  	list_del(&page->list);
>  	sgx_epc_page_reset_state(page);
> -	spin_unlock(&sgx_global_lru.lock);
> +	spin_unlock(&lru->lock);
>  
>  	return 0;
>  }
> @@ -592,7 +607,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
>  			break;
>  		}
>  
> -		if (list_empty(&sgx_global_lru.reclaimable))
> +		if (!sgx_can_reclaim())
>  			return ERR_PTR(-ENOMEM);
>  
>  		if (!reclaim) {
> -- 
> 2.25.1

BR, Jarkko

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 16/18] x86/sgx: Limit process EPC usage with misc cgroup controller
  2023-09-13  4:06   ` [PATCH v4 16/18] x86/sgx: Limit process EPC usage with misc cgroup controller Haitao Huang
@ 2023-09-13 15:48     ` Jarkko Sakkinen
  0 siblings, 0 replies; 45+ messages in thread
From: Jarkko Sakkinen @ 2023-09-13 15:48 UTC (permalink / raw)
  To: Haitao Huang, dave.hansen, tj, linux-kernel, linux-sgx, x86,
	cgroups, tglx, mingo, bp, hpa, sohil.mehta
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen,
	yangjie

On Wed Sep 13, 2023 at 7:06 AM EEST, Haitao Huang wrote:
> From: Kristen Carlson Accardi <kristen@linux.intel.com>
>
> Implement support for cgroup control of SGX Enclave Page Cache (EPC)
> memory using the misc cgroup controller. EPC memory is independent
> from normal system memory, e.g. must be reserved at boot from RAM and
> cannot be converted between EPC and normal memory while the system is
> running. EPC is managed by the SGX subsystem and is not accounted by
> the memory controller.
>
> Much like normal system memory, EPC memory can be overcommitted via
> virtual memory techniques and pages can be swapped out of the EPC to
> their backing store (normal system memory, e.g. shmem).  The SGX EPC
> subsystem is analogous to the memory subsystem and the SGX EPC controller
> is in turn analogous to the memory controller; it implements limit and
> protection models for EPC memory.
>
> The misc controller provides a mechanism to set a hard limit of EPC
> usage via the "sgx_epc" resource in "misc.max". The total EPC memory
> available on the system is reported via the "sgx_epc" resource in
> "misc.capacity".
>
> This patch was modified from its original version to use the misc cgroup
> controller instead of a custom controller.
>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
> Tested-by: Mikko Ylinen <mikko.ylinen@linux.intel.com>
>
> Cc: Sean Christopherson <seanjc@google.com>
> ---
> V4:
> - Fix a white space issue in Kconfig (Randy).
> - Update comments for LRU list as it can be owned by a cgroup.
> - Fix comments for sgx_reclaim_epc_pages() and use IS_ENABLED consistently (Mikko)
>
> V3:
>
> 1) Use the same maximum number of reclaiming candidate pages to be
> processed, SGX_NR_TO_SCAN_MAX, for each reclaiming iteration in both
> cgroup worker function and ksgxd. This fixes an overflow in the
> backing store buffer with the same fixed size allocated on stack in
> sgx_reclaim_epc_pages().
>
> 2) Initialize max for root EPC cgroup. Otherwise, all
> misc_cg_try_charge() calls would fail as it checks for all limits of
> ancestors all the way to the root node.
>
> 3) Start reclaiming whenever misc_cg_try_charge fails. Removed all
> re-checks for limits and current usage. For all purposes and intent,
> when misc_try_charge() fails, reclaiming is needed. This also corrects
> an error of not reclaiming when the child limit is larger than one of
> its ancestors.
>
> 4) Handle failure on charging to the root EPC cgroup. Failure on charging
> to root means we are at or above capacity, so start reclaiming or return
> OOM error.
>
> 5) Removed the custom cgroup tree walking iterator with epoch tracking
> logic. Replaced it with just the plain css_for_each_descendant_pre
> iterator. The custom iterator implemented a rather complex epoch scheme
> I believe was intended to prevent extra reclaiming from multiple worker
> threads doing the same walk but it turned out not matter much as each
> thread would only reclaim when usage is above limit. Using the plain
> css_for_each_descendant_pre iterator simplified code a bit.
>
> 6) Do not reclaim synchronously in misc_max_write callback which would
> block the user. Instead queue an async work item to run the reclaiming
> loop.
>
> 7) Other minor refactoring:
> - Remove unused params in epc_cgroup APIs
> - centralize uncharge into sgx_free_epc_page()
> ---
>  arch/x86/Kconfig                     |  13 +
>  arch/x86/kernel/cpu/sgx/Makefile     |   1 +
>  arch/x86/kernel/cpu/sgx/epc_cgroup.c | 406 +++++++++++++++++++++++++++
>  arch/x86/kernel/cpu/sgx/epc_cgroup.h |  59 ++++
>  arch/x86/kernel/cpu/sgx/main.c       |  67 ++++-
>  arch/x86/kernel/cpu/sgx/sgx.h        |  17 +-
>  6 files changed, 547 insertions(+), 16 deletions(-)
>  create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
>  create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 982b777eadc7..55fcf182d4a3 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1921,6 +1921,19 @@ config X86_SGX
>  
>  	  If unsure, say N.
>  
> +config CGROUP_SGX_EPC
> +	bool "Miscellaneous Cgroup Controller for Enclave Page Cache (EPC) for Intel SGX"
> +	depends on X86_SGX && CGROUP_MISC
> +	help
> +	  Provides control over the EPC footprint of tasks in a cgroup via
> +	  the Miscellaneous cgroup controller.
> +
> +	  EPC is a subset of regular memory that is usable only by SGX
> +	  enclaves and is very limited in quantity, e.g. less than 1%
> +	  of total DRAM.
> +
> +	  Say N if unsure.
> +
>  config X86_USER_SHADOW_STACK
>  	bool "X86 userspace shadow stack"
>  	depends on AS_WRUSS
> diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
> index 9c1656779b2a..12901a488da7 100644
> --- a/arch/x86/kernel/cpu/sgx/Makefile
> +++ b/arch/x86/kernel/cpu/sgx/Makefile
> @@ -4,3 +4,4 @@ obj-y += \
>  	ioctl.o \
>  	main.o
>  obj-$(CONFIG_X86_SGX_KVM)	+= virt.o
> +obj-$(CONFIG_CGROUP_SGX_EPC)	       += epc_cgroup.o
> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> new file mode 100644
> index 000000000000..7b86eb074abe
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> @@ -0,0 +1,406 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright(c) 2022 Intel Corporation.
> +
> +#include <linux/atomic.h>
> +#include <linux/kernel.h>
> +#include <linux/ratelimit.h>
> +#include <linux/sched/signal.h>
> +#include <linux/slab.h>
> +#include <linux/threads.h>
> +
> +#include "epc_cgroup.h"
> +
> +#define SGX_EPC_RECLAIM_MIN_PAGES		16UL
> +#define SGX_EPC_RECLAIM_IGNORE_AGE_THRESHOLD	5
> +#define SGX_EPC_RECLAIM_OOM_THRESHOLD		5
> +
> +static struct workqueue_struct *sgx_epc_cg_wq;
> +static bool sgx_epc_cgroup_oom(struct sgx_epc_cgroup *root);
> +
> +struct sgx_epc_reclaim_control {
> +	struct sgx_epc_cgroup *epc_cg;
> +	int nr_fails;
> +	bool ignore_age;
> +};
> +
> +static inline u64 sgx_epc_cgroup_page_counter_read(struct sgx_epc_cgroup *epc_cg)
> +{
> +	return atomic64_read(&epc_cg->cg->res[MISC_CG_RES_SGX_EPC].usage) / PAGE_SIZE;
> +}
> +
> +static inline u64 sgx_epc_cgroup_max_pages(struct sgx_epc_cgroup *epc_cg)
> +{
> +	return READ_ONCE(epc_cg->cg->res[MISC_CG_RES_SGX_EPC].max) / PAGE_SIZE;
> +}
> +

/*
 * A brief explanation of the calculation below.
 */
> +static inline u64 sgx_epc_cgroup_max_pages_to_root(struct sgx_epc_cgroup *epc_cg)
> +{
> +	struct misc_cg *i = epc_cg->cg;
> +	u64 m = U64_MAX;
> +
> +	while (i) {
> +		m = min(m, READ_ONCE(i->res[MISC_CG_RES_SGX_EPC].max));
> +		i = misc_cg_parent(i);
> +	}

I'd add an empty line here.

> +	return m / PAGE_SIZE;
> +}
> +
> +static inline struct sgx_epc_cgroup *sgx_epc_cgroup_from_misc_cg(struct misc_cg *cg)
> +{
> +	if (cg)
> +		return (struct sgx_epc_cgroup *)(cg->res[MISC_CG_RES_SGX_EPC].priv);
> +
> +	return NULL;
> +}
> +
> +static inline bool sgx_epc_cgroup_disabled(void)
> +{
> +	return !cgroup_subsys_enabled(misc_cgrp_subsys);
> +}
> +
> +/**
> + * sgx_epc_cgroup_lru_empty - check if a cgroup tree has no pages on its lrus

Does not have "()":

https://www.kernel.org/doc/Documentation/kernel-doc-nano-HOWTO.txt

> + * @root:	root of the tree to check
> + *
> + * Return: %true if all cgroups under the specified root have empty LRU lists.
> + * Used to avoid livelocks due to a cgroup having a non-zero charge count but
> + * no pages on its LRUs, e.g. due to a dead enclave waiting to be released or
> + * because all pages in the cgroup are unreclaimable.
> + */
> +bool sgx_epc_cgroup_lru_empty(struct sgx_epc_cgroup *root)
> +{
> +	struct cgroup_subsys_state *css_root = NULL;
> +	struct cgroup_subsys_state *pos = NULL;
> +	struct sgx_epc_cgroup *epc_cg = NULL;
> +	bool ret = true;
> +
> +	/*
> +	 * Caller ensure css_root ref acquired
> +	 */
> +	css_root = root ? &root->cg->css : &(misc_cg_root()->css);
> +
> +	rcu_read_lock();
> +	css_for_each_descendant_pre(pos, css_root) {
> +		if (!css_tryget(pos))
> +			break;
> +
> +		rcu_read_unlock();
> +
> +		epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
> +
> +		spin_lock(&epc_cg->lru.lock);
> +		ret = list_empty(&epc_cg->lru.reclaimable);
> +		spin_unlock(&epc_cg->lru.lock);
> +
> +		rcu_read_lock();
> +		css_put(pos);
> +		if (!ret)
> +			break;
> +	}
> +	rcu_read_unlock();
> +	return ret;
> +}
> +
> +/**
> + * sgx_epc_cgroup_isolate_pages - walk a cgroup tree and separate pages

Ditto.

> + * @root:	root of the tree to start walking
> + * @nr_to_scan: The number of pages that need to be isolated
> + * @dst:	Destination list to hold the isolated pages

Not correctly aligned.

> + *
> + * Walk the cgroup tree and isolate the pages in the hierarchy
> + * for reclaiming.
> + */
> +void sgx_epc_cgroup_isolate_pages(struct sgx_epc_cgroup *root,
> +				  size_t *nr_to_scan, struct list_head *dst)
> +{
> +	struct cgroup_subsys_state *css_root = NULL;

Spurious initialization to NULL.

> +	struct cgroup_subsys_state *pos = NULL;

Ditto.

> +	struct sgx_epc_cgroup *epc_cg = NULL;

Ditto.

> +
> +	if (!*nr_to_scan)
> +		return;
> +
> +	 /* Caller ensure css_root ref acquired */
> +	css_root = root ? &root->cg->css : &(misc_cg_root()->css);
> +
> +	rcu_read_lock();
> +	css_for_each_descendant_pre(pos, css_root) {
> +		if (!css_tryget(pos))
> +			break;
> +		rcu_read_unlock();
> +
> +		epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
> +		sgx_isolate_epc_pages(&epc_cg->lru, nr_to_scan, dst);
> +
> +		rcu_read_lock();
> +		css_put(pos);
> +		if (!*nr_to_scan)
> +			break;
> +	}

I'd add an empty line here.

> +	rcu_read_unlock();
> +}
> +
> +static int sgx_epc_cgroup_reclaim_pages(unsigned long nr_pages,
> +					struct sgx_epc_reclaim_control *rc)
> +{
> +	/*
> +	 * Ensure sgx_reclaim_pages is called with a minimum and maximum
> +	 * number of pages.  Attempting to reclaim only a few pages will
> +	 * often fail and is inefficient, while reclaiming a huge number
> +	 * of pages can result in soft lockups due to holding various
> +	 * locks for an extended duration.
> +	 */
> +	nr_pages = max(nr_pages, SGX_EPC_RECLAIM_MIN_PAGES);
> +	nr_pages = min(nr_pages, SGX_NR_TO_SCAN_MAX);
> +
> +	return sgx_reclaim_epc_pages(nr_pages, rc->ignore_age, rc->epc_cg);
> +}
> +
> +static int sgx_epc_cgroup_reclaim_failed(struct sgx_epc_reclaim_control *rc)
> +{
> +	if (sgx_epc_cgroup_lru_empty(rc->epc_cg))
> +		return -ENOMEM;
> +
> +	++rc->nr_fails;
> +	if (rc->nr_fails > SGX_EPC_RECLAIM_IGNORE_AGE_THRESHOLD)
> +		rc->ignore_age = true;
> +
> +	return 0;
> +}
> +
> +static inline
> +void sgx_epc_reclaim_control_init(struct sgx_epc_reclaim_control *rc,
> +				  struct sgx_epc_cgroup *epc_cg)
> +{
> +	rc->epc_cg = epc_cg;
> +	rc->nr_fails = 0;
> +	rc->ignore_age = false;
> +}
> +
> +/*
> + * Scheduled by sgx_epc_cgroup_try_charge() to reclaim pages from the
> + * cgroup when the cgroup is at/near its maximum capacity
> + */
> +static void sgx_epc_cgroup_reclaim_work_func(struct work_struct *work)
> +{
> +	struct sgx_epc_reclaim_control rc;
> +	struct sgx_epc_cgroup *epc_cg;
> +	u64 cur, max;
> +
> +	epc_cg = container_of(work, struct sgx_epc_cgroup, reclaim_work);
> +
> +	sgx_epc_reclaim_control_init(&rc, epc_cg);
> +
> +	for (;;) {
> +		max = sgx_epc_cgroup_max_pages_to_root(epc_cg);
> +
> +		/*
> +		 * Adjust the limit down by one page, the goal is to free up
> +		 * pages for fault allocations, not to simply obey the limit.
> +		 * Conditionally decrementing max also means the cur vs. max
> +		 * check will correctly handle the case where both are zero.
> +		 */
> +		if (max)
> +			max--;
> +
> +		/*
> +		 * Unless the limit is extremely low, in which case forcing
> +		 * reclaim will likely cause thrashing, force the cgroup to
> +		 * reclaim at least once if it's operating *near* its maximum
> +		 * limit by adjusting @max down by half the min reclaim size.
> +		 * This work func is scheduled by sgx_epc_cgroup_try_charge
> +		 * when it cannot directly reclaim due to being in an atomic
> +		 * context, e.g. EPC allocation in a fault handler.  Waiting
> +		 * to reclaim until the cgroup is actually at its limit is less
> +		 * performant as it means the faulting task is effectively
> +		 * blocked until a worker makes its way through the global work
> +		 * queue.
> +		 */
> +		if (max > SGX_NR_TO_SCAN_MAX)
> +			max -= (SGX_EPC_RECLAIM_MIN_PAGES / 2);
> +
> +		max = min(max, sgx_epc_total_pages);
> +		cur = sgx_epc_cgroup_page_counter_read(epc_cg);
> +		if (cur <= max)
> +			break;
> +		/* Nothing reclaimable */
> +		if (sgx_epc_cgroup_lru_empty(epc_cg)) {
> +			if (!sgx_epc_cgroup_oom(epc_cg))
> +				break;
> +
> +			continue;
> +		}
> +
> +		if (!sgx_epc_cgroup_reclaim_pages(cur - max, &rc)) {
> +			if (sgx_epc_cgroup_reclaim_failed(&rc))
> +				break;
> +		}
> +	}
> +}
> +
> +static int __sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg,
> +				       bool reclaim)
> +{
> +	struct sgx_epc_reclaim_control rc;
> +	unsigned int nr_empty = 0;
> +
> +	sgx_epc_reclaim_control_init(&rc, epc_cg);
> +
> +	for (;;) {
> +		if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
> +					PAGE_SIZE))
> +			break;
> +
> +		if (sgx_epc_cgroup_lru_empty(epc_cg))
> +			return -ENOMEM;
> +
> +		if (signal_pending(current))
> +			return -ERESTARTSYS;
> +
> +		if (!reclaim) {
> +			queue_work(sgx_epc_cg_wq, &rc.epc_cg->reclaim_work);
> +			return -EBUSY;
> +		}
> +
> +		if (!sgx_epc_cgroup_reclaim_pages(1, &rc)) {
> +			if (sgx_epc_cgroup_reclaim_failed(&rc)) {
> +				if (++nr_empty > SGX_EPC_RECLAIM_OOM_THRESHOLD)
> +					return -ENOMEM;
> +				schedule();
> +			}
> +		}
> +	}
> +	if (epc_cg->cg != misc_cg_root())
> +		css_get(&epc_cg->cg->css);
> +
> +	return 0;
> +}
> +
> +/**
> + * sgx_epc_cgroup_try_charge - hierarchically try to charge a single EPC page

"()"

> + * @mm:			the mm_struct of the process to charge
> + * @reclaim:		whether or not synchronous reclaim is allowed
> + *
> + * Returns EPC cgroup or NULL on success, -errno on failure.
> + */
> +struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(bool reclaim)
> +{
> +	struct sgx_epc_cgroup *epc_cg;
> +	int ret;
> +
> +	if (sgx_epc_cgroup_disabled())
> +		return NULL;
> +
> +	epc_cg = sgx_epc_cgroup_from_misc_cg(get_current_misc_cg());
> +	ret = __sgx_epc_cgroup_try_charge(epc_cg, reclaim);
> +	put_misc_cg(epc_cg->cg);
> +
> +	if (ret)
> +		return ERR_PTR(ret);
> +
> +	return epc_cg;
> +}
> +
> +/**
> + * sgx_epc_cgroup_uncharge - hierarchically uncharge EPC pages

"()"

> + * @epc_cg:	the charged epc cgroup
> + */
> +void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg)
> +{
> +	if (sgx_epc_cgroup_disabled())
> +		return;
> +
> +	misc_cg_uncharge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
> +
> +	if (epc_cg->cg != misc_cg_root())
> +		put_misc_cg(epc_cg->cg);
> +}
> +
> +static bool sgx_epc_cgroup_oom(struct sgx_epc_cgroup *root)
> +{
> +	struct cgroup_subsys_state *css_root = NULL;
> +	struct cgroup_subsys_state *pos = NULL;
> +	struct sgx_epc_cgroup *epc_cg = NULL;

Please check also these initializations through.

> +	bool oom = false;
> +
> +	 /* Caller ensure css_root ref acquired */
> +	css_root = root ? &root->cg->css : &(misc_cg_root()->css);
> +
> +	rcu_read_lock();
> +	css_for_each_descendant_pre(pos, css_root) {
> +		/* skip dead ones */
> +		if (!css_tryget(pos))
> +			continue;
> +
> +		rcu_read_unlock();
> +
> +		epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
> +		oom = sgx_epc_oom(&epc_cg->lru);
> +
> +		rcu_read_lock();
> +		css_put(pos);
> +		if (oom)
> +			break;
> +	}
> +	rcu_read_unlock();
> +	return oom;
> +}
> +
> +static void sgx_epc_cgroup_free(struct misc_cg *cg)
> +{
> +	struct sgx_epc_cgroup *epc_cg;
> +
> +	epc_cg = sgx_epc_cgroup_from_misc_cg(cg);
> +	cancel_work_sync(&epc_cg->reclaim_work);
> +	kfree(epc_cg);
> +}
> +
> +static void sgx_epc_cgroup_max_write(struct misc_cg *cg)
> +{
> +	struct sgx_epc_reclaim_control rc;
> +	struct sgx_epc_cgroup *epc_cg;
> +
> +	epc_cg = sgx_epc_cgroup_from_misc_cg(cg);
> +
> +	sgx_epc_reclaim_control_init(&rc, epc_cg);
> +	/* Let the reclaimer to do the work so user is not blocked */
> +	queue_work(sgx_epc_cg_wq, &rc.epc_cg->reclaim_work);
> +}
> +
> +static int sgx_epc_cgroup_alloc(struct misc_cg *cg)
> +{
> +	struct sgx_epc_cgroup *epc_cg;
> +
> +	epc_cg = kzalloc(sizeof(*epc_cg), GFP_KERNEL);
> +	if (!epc_cg)
> +		return -ENOMEM;
> +
> +	sgx_lru_init(&epc_cg->lru);
> +	INIT_WORK(&epc_cg->reclaim_work, sgx_epc_cgroup_reclaim_work_func);
> +	cg->res[MISC_CG_RES_SGX_EPC].misc_cg_alloc = sgx_epc_cgroup_alloc;
> +	cg->res[MISC_CG_RES_SGX_EPC].misc_cg_free = sgx_epc_cgroup_free;
> +	cg->res[MISC_CG_RES_SGX_EPC].misc_cg_max_write = sgx_epc_cgroup_max_write;
> +	cg->res[MISC_CG_RES_SGX_EPC].priv = epc_cg;
> +	epc_cg->cg = cg;
> +	return 0;
> +}
> +
> +static int __init sgx_epc_cgroup_init(void)
> +{
> +	struct misc_cg *cg;
> +
> +	if (!boot_cpu_has(X86_FEATURE_SGX))
> +		return 0;
> +
> +	sgx_epc_cg_wq = alloc_workqueue("sgx_epc_cg_wq",
> +					WQ_UNBOUND | WQ_FREEZABLE,
> +					WQ_UNBOUND_MAX_ACTIVE);
> +	BUG_ON(!sgx_epc_cg_wq);
> +
> +	cg = misc_cg_root();
> +	BUG_ON(!cg);
> +	WRITE_ONCE(cg->res[MISC_CG_RES_SGX_EPC].max, U64_MAX);
> +	atomic64_set(&cg->res[MISC_CG_RES_SGX_EPC].usage, 0UL);
> +	return sgx_epc_cgroup_alloc(cg);
> +}
> +subsys_initcall(sgx_epc_cgroup_init);
> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> new file mode 100644
> index 000000000000..dfc902f4d96f
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> @@ -0,0 +1,59 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright(c) 2022 Intel Corporation. */
> +#ifndef _INTEL_SGX_EPC_CGROUP_H_
> +#define _INTEL_SGX_EPC_CGROUP_H_
> +
> +#include <asm/sgx.h>
> +#include <linux/cgroup.h>
> +#include <linux/list.h>
> +#include <linux/misc_cgroup.h>
> +#include <linux/page_counter.h>
> +#include <linux/workqueue.h>
> +
> +#include "sgx.h"
> +
> +#ifndef CONFIG_CGROUP_SGX_EPC
> +#define MISC_CG_RES_SGX_EPC MISC_CG_RES_TYPES
> +struct sgx_epc_cgroup;
> +
> +static inline struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(bool reclaim)
> +{
> +	return NULL;
> +}
> +
> +static inline void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg) { }
> +
> +static inline void sgx_epc_cgroup_isolate_pages(struct sgx_epc_cgroup *root,
> +						size_t *nr_to_scan,
> +						struct list_head *dst) { }
> +
> +static inline struct sgx_epc_lru_lists *epc_cg_lru(struct sgx_epc_cgroup *epc_cg)
> +{
> +	return NULL;
> +}
> +
> +static bool sgx_epc_cgroup_lru_empty(struct sgx_epc_cgroup *root)
> +{
> +	return true;
> +}
> +#else
> +struct sgx_epc_cgroup {
> +	struct misc_cg *cg;
> +	struct sgx_epc_lru_lists	lru;
> +	struct work_struct	reclaim_work;
> +};
> +
> +struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(bool reclaim);
> +void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg);
> +bool sgx_epc_cgroup_lru_empty(struct sgx_epc_cgroup *root);
> +void sgx_epc_cgroup_isolate_pages(struct sgx_epc_cgroup *root,
> +				  size_t *nr_to_scan, struct list_head *dst);
> +static inline struct sgx_epc_lru_lists *epc_cg_lru(struct sgx_epc_cgroup *epc_cg)
> +{
> +	if (epc_cg)
> +		return &epc_cg->lru;
> +	return NULL;
> +}
> +#endif
> +
> +#endif /* _INTEL_SGX_EPC_CGROUP_H_ */
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index 3d396fe5ec09..20de17f4f576 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -6,6 +6,7 @@
>  #include <linux/highmem.h>
>  #include <linux/kthread.h>
>  #include <linux/miscdevice.h>
> +#include <linux/misc_cgroup.h>
>  #include <linux/node.h>
>  #include <linux/pagemap.h>
>  #include <linux/ratelimit.h>
> @@ -17,11 +18,9 @@
>  #include "driver.h"
>  #include "encl.h"
>  #include "encls.h"
> -/**
> - * Maximum number of pages to scan for reclaiming.
> - */
> -#define SGX_NR_TO_SCAN_MAX	32
> +#include "epc_cgroup.h"
>  
> +u64 sgx_epc_total_pages;
>  struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
>  static int sgx_nr_epc_sections;
>  static struct task_struct *ksgxd_tsk;
> @@ -36,11 +35,17 @@ static struct sgx_epc_lru_lists sgx_global_lru;
>  
>  static inline struct sgx_epc_lru_lists *sgx_lru_lists(struct sgx_epc_page *epc_page)
>  {
> +	if (IS_ENABLED(CONFIG_CGROUP_SGX_EPC))
> +		return epc_cg_lru(epc_page->epc_cg);
> +
>  	return &sgx_global_lru;
>  }
>  
>  static inline bool sgx_can_reclaim(void)
>  {
> +	if (IS_ENABLED(CONFIG_CGROUP_SGX_EPC))
> +		return !sgx_epc_cgroup_lru_empty(NULL);
> +
>  	return !list_empty(&sgx_global_lru.reclaimable);
>  }
>  
> @@ -299,14 +304,14 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
>   * @nr_to_scan:	Number of pages to scan for reclaim
>   * @dst:	Destination list to hold the isolated pages
>   */
> -void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, size_t nr_to_scan,
> +void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, size_t *nr_to_scan,
>  			   struct list_head *dst)
>  {
>  	struct sgx_encl_page *encl_page;
>  	struct sgx_epc_page *epc_page;
>  
>  	spin_lock(&lru->lock);
> -	for (; nr_to_scan > 0; --nr_to_scan) {
> +	for (; *nr_to_scan > 0; --(*nr_to_scan)) {
>  		epc_page = list_first_entry_or_null(&lru->reclaimable, struct sgx_epc_page, list);
>  		if (!epc_page)
>  			break;
> @@ -331,6 +336,7 @@ void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, size_t nr_to_scan,
>   * sgx_reclaim_epc_pages() - Reclaim EPC pages from the consumers
>   * @nr_to_scan:		 Number of EPC pages to scan for reclaim
>   * @ignore_age:		 Reclaim a page even if it is young
> + * @epc_cg:		 EPC cgroup from which to reclaim
>   *
>   * Take a fixed number of pages from the head of the active page pool and
>   * reclaim them to the enclave's private shmem files. Skip the pages, which have
> @@ -344,7 +350,8 @@ void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, size_t nr_to_scan,
>   * problematic as it would increase the lock contention too much, which would
>   * halt forward progress.
>   */
> -size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
> +size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age,
> +			     struct sgx_epc_cgroup *epc_cg)
>  {
>  	struct sgx_backing backing[SGX_NR_TO_SCAN_MAX];
>  	struct sgx_epc_page *epc_page, *tmp;
> @@ -355,7 +362,15 @@ size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
>  	size_t ret;
>  	size_t i;
>  
> -	sgx_isolate_epc_pages(&sgx_global_lru, nr_to_scan, &iso);
> +	/*
> +	 * If a specific cgroup is not being targeted, take from the global
> +	 * list first, even when cgroups are enabled.  If there are
> +	 * pages on the global LRU then they should get reclaimed asap.
> +	 */
> +	if (!IS_ENABLED(CONFIG_CGROUP_SGX_EPC) || !epc_cg)
> +		sgx_isolate_epc_pages(&sgx_global_lru, &nr_to_scan, &iso);
> +
> +	sgx_epc_cgroup_isolate_pages(epc_cg, &nr_to_scan, &iso);
>  
>  	if (list_empty(&iso))
>  		return 0;
> @@ -422,7 +437,7 @@ static bool sgx_should_reclaim(unsigned long watermark)
>  void sgx_reclaim_direct(void)
>  {
>  	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
> -		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
> +		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false, NULL);
>  }
>  
>  static int ksgxd(void *p)
> @@ -445,7 +460,7 @@ static int ksgxd(void *p)
>  				     sgx_should_reclaim(SGX_NR_HIGH_PAGES));
>  
>  		if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
> -			sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
> +			sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false, NULL);
>  
>  		cond_resched();
>  	}
> @@ -599,6 +614,11 @@ int sgx_drop_epc_page(struct sgx_epc_page *page)
>  struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
>  {
>  	struct sgx_epc_page *page;
> +	struct sgx_epc_cgroup *epc_cg;
> +
> +	epc_cg = sgx_epc_cgroup_try_charge(reclaim);
> +	if (IS_ERR(epc_cg))
> +		return ERR_CAST(epc_cg);
>  
>  	for ( ; ; ) {
>  		page = __sgx_alloc_epc_page();
> @@ -607,8 +627,10 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
>  			break;
>  		}
>  
> -		if (!sgx_can_reclaim())
> -			return ERR_PTR(-ENOMEM);
> +		if (!sgx_can_reclaim()) {
> +			page = ERR_PTR(-ENOMEM);
> +			break;
> +		}
>  
>  		if (!reclaim) {
>  			page = ERR_PTR(-EBUSY);
> @@ -620,10 +642,17 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
>  			break;
>  		}
>  
> -		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
> +		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false, NULL);
>  		cond_resched();
>  	}
>  
> +	if (!IS_ERR(page)) {
> +		WARN_ON_ONCE(page->epc_cg);
> +		page->epc_cg = epc_cg;
> +	} else {
> +		sgx_epc_cgroup_uncharge(epc_cg);
> +	}
> +
>  	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
>  		wake_up(&ksgxd_waitq);
>  
> @@ -646,6 +675,11 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
>  
>  	WARN_ON_ONCE(page->flags & (SGX_EPC_PAGE_STATE_MASK));
>  
> +	if (page->epc_cg) {
> +		sgx_epc_cgroup_uncharge(page->epc_cg);
> +		page->epc_cg = NULL;
> +	}
> +
>  	spin_lock(&node->lock);
>  
>  	page->encl_page = NULL;
> @@ -656,6 +690,7 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
>  	page->flags = SGX_EPC_PAGE_FREE;
>  
>  	spin_unlock(&node->lock);
> +
>  	atomic_long_inc(&sgx_nr_free_pages);
>  }
>  
> @@ -825,6 +860,7 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
>  		section->pages[i].flags = 0;
>  		section->pages[i].encl_page = NULL;
>  		section->pages[i].poison = 0;
> +		section->pages[i].epc_cg = NULL;
>  		list_add_tail(&section->pages[i].list, &sgx_dirty_page_list);
>  	}
>  
> @@ -969,6 +1005,7 @@ static void __init arch_update_sysfs_visibility(int nid) {}
>  static bool __init sgx_page_cache_init(void)
>  {
>  	u32 eax, ebx, ecx, edx, type;
> +	u64 capacity = 0;
>  	u64 pa, size;
>  	int nid;
>  	int i;
> @@ -1019,6 +1056,7 @@ static bool __init sgx_page_cache_init(void)
>  
>  		sgx_epc_sections[i].node =  &sgx_numa_nodes[nid];
>  		sgx_numa_nodes[nid].size += size;
> +		capacity += size;
>  
>  		sgx_nr_epc_sections++;
>  	}
> @@ -1028,6 +1066,9 @@ static bool __init sgx_page_cache_init(void)
>  		return false;
>  	}
>  
> +	misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, capacity);
> +	sgx_epc_total_pages = capacity >> PAGE_SHIFT;
> +
>  	return true;
>  }
>  
> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
> index 7e21192b87a8..bf746d2af96d 100644
> --- a/arch/x86/kernel/cpu/sgx/sgx.h
> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> @@ -19,6 +19,11 @@
>  
>  #define SGX_MAX_EPC_SECTIONS		8
>  #define SGX_EEXTEND_BLOCK_SIZE		256
> +
> +/*
> + * Maximum number of pages to scan for reclaiming.
> + */
> +#define SGX_NR_TO_SCAN_MAX		32UL
>  #define SGX_NR_TO_SCAN			16
>  #define SGX_NR_LOW_PAGES		32
>  #define SGX_NR_HIGH_PAGES		64
> @@ -70,6 +75,8 @@ enum sgx_epc_page_state {
>  /* flag for pages owned by a sgx_encl struct */
>  #define SGX_EPC_OWNER_ENCL		BIT(4)
>  
> +struct sgx_epc_cgroup;
> +
>  struct sgx_epc_page {
>  	unsigned int section;
>  	u16 flags;
> @@ -79,6 +86,7 @@ struct sgx_epc_page {
>  		struct sgx_encl *encl;
>  	};
>  	struct list_head list;
> +	struct sgx_epc_cgroup *epc_cg;
>  };
>  
>  static inline void sgx_epc_page_reset_state(struct sgx_epc_page *page)
> @@ -127,6 +135,7 @@ struct sgx_epc_section {
>  	struct sgx_numa_node *node;
>  };
>  
> +extern u64 sgx_epc_total_pages;
>  extern struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
>  
>  static inline unsigned long sgx_get_epc_phys_addr(struct sgx_epc_page *page)
> @@ -150,7 +159,8 @@ static inline void *sgx_get_epc_virt_addr(struct sgx_epc_page *page)
>  }
>  
>  /*
> - * Contains EPC pages tracked by the reclaimer (ksgxd).
> + * Contains EPC pages tracked by the global reclaimer (ksgxd) or an EPC
> + * cgroup.
>   */
>  struct sgx_epc_lru_lists {
>  	spinlock_t lock;
> @@ -177,8 +187,9 @@ void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags);
>  int sgx_drop_epc_page(struct sgx_epc_page *page);
>  struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
>  bool sgx_epc_oom(struct sgx_epc_lru_lists *lrus);
> -size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age);
> -void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lrus, size_t nr_to_scan,
> +size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age,
> +			     struct sgx_epc_cgroup *epc_cg);
> +void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lrus, size_t *nr_to_scan,
>  			   struct list_head *dst);
>  
>  void sgx_ipi_cb(void *info);
> -- 
> 2.25.1

BR, Jarkko

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 03/18] x86/sgx: Add sgx_epc_lru_lists to encapsulate LRU lists
       [not found]     ` <20230913040635.28815-4-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
  2023-09-13  9:46       ` Jarkko Sakkinen
@ 2023-09-14 10:31       ` Huang, Kai
       [not found]         ` <851f9b3043732c17cd8f86a77ccee0b7c6caa22f.camel-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
  2023-09-15 16:28         ` Haitao Huang
  1 sibling, 2 replies; 45+ messages in thread
From: Huang, Kai @ 2023-09-14 10:31 UTC (permalink / raw)
  To: hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	x86-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org,
	dave.hansen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org,
	cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	bp-Gina5bIWoIWzQB+pC5nmwQ@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	jarkko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org,
	tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org,
	haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org,
	Mehta, Sohil, tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org,
	mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
  Cc: kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org,
	yangjie-0li6OtcxBFHby3iVrkZq2A@public.gmane.org, Li, Zhiquan1,
	Christopherson,, Sean,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org, Zhang, Bo,
	anakrish-0li6OtcxBFHby3iVrkZq2A@public.gmane.org

Some non-technical staff:

On Tue, 2023-09-12 at 21:06 -0700, Haitao Huang wrote:
> From: Kristen Carlson Accardi <kristen@linux.intel.com>

The patch was from Kristen, but ...

> 
> Introduce a data structure to wrap the existing reclaimable list and its
> spinlock. Each cgroup later will have one instance of this structure to
> track EPC pages allocated for processes associated with the same cgroup.
> Just like the global SGX reclaimer (ksgxd), an EPC cgroup reclaims pages
> from the reclaimable list in this structure when its usage reaches near
> its limit.
> 
> Currently, ksgxd does not track the VA, SECS pages. They are considered
> as 'unreclaimable' pages that are only deallocated when their respective
> owning enclaves are destroyed and all associated resources released.
> 
> When an EPC cgroup can not reclaim any more reclaimable EPC pages to
> reduce its usage below its limit, the cgroup must also reclaim those
> unreclaimables by killing their owning enclaves. The VA and SECS pages
> later are also tracked in an 'unreclaimable' list added to this structure
> to support this OOM killing of enclaves.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>

... it was firstly signed by Sean and then Kristen, which doesn't sound right.

If the patch was from Kristen, then either Sean's SoB should come after
Kristen's (which means Sean took Kristen's patch and signed it), or you need to
have a Co-developed-by tag for Sean right before his SoB (which indicates Sean
participated in the development of the patch but likely he wasn't the main
developer).

But I _guess_ the patch was just from Sean.

> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
> Cc: Sean Christopherson <seanjc@google.com>

You don't need 'Cc:' Sean if the patch has Sean's SoB.

More information please refer to "When to use Acked-by:, Cc:, and Co-developed-
by" section here: 

https://docs.kernel.org/process/submitting-patches.html

Also an explanation of when to use 'Cc:' from Sean (ignore technical staff):

https://lore.kernel.org/lkml/ZOZteOxJvq9v609G@google.com/

(And please check other patches too.)



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 03/18] x86/sgx: Add sgx_epc_lru_lists to encapsulate LRU lists
       [not found]         ` <851f9b3043732c17cd8f86a77ccee0b7c6caa22f.camel-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
@ 2023-09-14 16:13           ` Dave Hansen
  2023-09-14 21:58             ` Huang, Kai
  0 siblings, 1 reply; 45+ messages in thread
From: Dave Hansen @ 2023-09-14 16:13 UTC (permalink / raw)
  To: Huang, Kai, hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	x86-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org,
	dave.hansen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org,
	cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	bp-Gina5bIWoIWzQB+pC5nmwQ@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	jarkko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org,
	tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org,
	haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org,
	Mehta, Sohil, tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org,
	mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
  Cc: kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org,
	yangjie-0li6OtcxBFHby3iVrkZq2A@public.gmane.org, Li, Zhiquan1,
	Christopherson,, Sean,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org, Zhang, Bo,
	anakrish-0li6OtcxBFHby3iVrkZq2A@public.gmane.org

On 9/14/23 03:31, Huang, Kai wrote:
>> Signed-off-by: Haitao Huang <haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
>> Cc: Sean Christopherson <seanjc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> You don't need 'Cc:' Sean if the patch has Sean's SoB.

It is a SoB for Sean's @intel address and cc's his @google address.

It is fine.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 03/18] x86/sgx: Add sgx_epc_lru_lists to encapsulate LRU lists
  2023-09-14 16:13           ` Dave Hansen
@ 2023-09-14 21:58             ` Huang, Kai
  0 siblings, 0 replies; 45+ messages in thread
From: Huang, Kai @ 2023-09-14 21:58 UTC (permalink / raw)
  To: Hansen, Dave, linux-sgx@vger.kernel.org, x86@kernel.org,
	dave.hansen@linux.intel.com, cgroups@vger.kernel.org,
	hpa@zytor.com, linux-kernel@vger.kernel.org, jarkko@kernel.org,
	bp@alien8.de, haitao.huang@linux.intel.com, tglx@linutronix.de,
	tj@kernel.org, Mehta, Sohil, mingo@redhat.com
  Cc: kristen@linux.intel.com, anakrish@microsoft.com, Li, Zhiquan1,
	Christopherson,, Sean, mikko.ylinen@linux.intel.com,
	yangjie@microsoft.com, Zhang, Bo

On Thu, 2023-09-14 at 09:13 -0700, Dave Hansen wrote:
> On 9/14/23 03:31, Huang, Kai wrote:
> > > Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
> > > Cc: Sean Christopherson <seanjc@google.com>
> > You don't need 'Cc:' Sean if the patch has Sean's SoB.
> 
> It is a SoB for Sean's @intel address and cc's his @google address.
> 
> It is fine.

Oops I didn't notice the email difference.  Thanks for pointing out!

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 03/18] x86/sgx: Add sgx_epc_lru_lists to encapsulate LRU lists
  2023-09-14 10:31       ` Huang, Kai
       [not found]         ` <851f9b3043732c17cd8f86a77ccee0b7c6caa22f.camel-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
@ 2023-09-15 16:28         ` Haitao Huang
  1 sibling, 0 replies; 45+ messages in thread
From: Haitao Huang @ 2023-09-15 16:28 UTC (permalink / raw)
  To: hpa@zytor.com, linux-sgx@vger.kernel.org, x86@kernel.org,
	dave.hansen@linux.intel.com, cgroups@vger.kernel.org,
	bp@alien8.de, linux-kernel@vger.kernel.org, jarkko@kernel.org,
	tglx@linutronix.de, Mehta, Sohil, tj@kernel.org, mingo@redhat.com,
	Huang, Kai
  Cc: kristen@linux.intel.com, yangjie@microsoft.com, Li, Zhiquan1,
	Christopherson,, Sean, mikko.ylinen@linux.intel.com, Zhang, Bo,
	anakrish@microsoft.com

On Thu, 14 Sep 2023 05:31:30 -0500, Huang, Kai <kai.huang@intel.com> wrote:

> Some non-technical staff:
>
> On Tue, 2023-09-12 at 21:06 -0700, Haitao Huang wrote:
>> From: Kristen Carlson Accardi <kristen@linux.intel.com>
>
> The patch was from Kristen, but ...
>
>>
>> Introduce a data structure to wrap the existing reclaimable list and its
>> spinlock. Each cgroup later will have one instance of this structure to
>> track EPC pages allocated for processes associated with the same cgroup.
>> Just like the global SGX reclaimer (ksgxd), an EPC cgroup reclaims pages
>> from the reclaimable list in this structure when its usage reaches near
>> its limit.
>>
>> Currently, ksgxd does not track the VA, SECS pages. They are considered
>> as 'unreclaimable' pages that are only deallocated when their respective
>> owning enclaves are destroyed and all associated resources released.
>>
>> When an EPC cgroup can not reclaim any more reclaimable EPC pages to
>> reduce its usage below its limit, the cgroup must also reclaim those
>> unreclaimables by killing their owning enclaves. The VA and SECS pages
>> later are also tracked in an 'unreclaimable' list added to this  
>> structure
>> to support this OOM killing of enclaves.
>>
>> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
>> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
>
> ... it was firstly signed by Sean and then Kristen, which doesn't sound  
> right.
>
> If the patch was from Kristen, then either Sean's SoB should come after
> Kristen's (which means Sean took Kristen's patch and signed it), or you  
> need to
> have a Co-developed-by tag for Sean right before his SoB (which  
> indicates Sean
> participated in the development of the patch but likely he wasn't the  
> main
> developer).
>
> But I _guess_ the patch was just from Sean.
>
 From what I see:
In v1 kristen included a "From" tsg for Sean. In v2 she split the original  
patch into two and added some wrappers/ At that time, she removed the  
"From" tag for both patches but kept the SOB and CC.

@Kristen, could you confirm?

I only removed the wrappers from v2 based on Dave's comments.
So if confirmed by Kristen, should we add "From" tag for Sean?

I'll double check the other patches.
Thanks
Haitao

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 01/18] cgroup/misc: Add per resource callbacks for CSS events
  2023-09-13  4:06   ` [PATCH v4 01/18] cgroup/misc: Add per resource callbacks for CSS events Haitao Huang
       [not found]     ` <20230913040635.28815-2-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
@ 2023-09-15 17:55     ` Tejun Heo
       [not found]       ` <ZQSaoXBg-X4cwFdX-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
  1 sibling, 1 reply; 45+ messages in thread
From: Tejun Heo @ 2023-09-15 17:55 UTC (permalink / raw)
  To: Haitao Huang
  Cc: jarkko, dave.hansen, linux-kernel, linux-sgx, x86, cgroups, tglx,
	mingo, bp, hpa, sohil.mehta, zhiquan1.li, kristen, seanjc, zhanb,
	anakrish, mikko.ylinen, yangjie

On Tue, Sep 12, 2023 at 09:06:18PM -0700, Haitao Huang wrote:
> @@ -37,6 +37,11 @@ struct misc_res {
>  	u64 max;
>  	atomic64_t usage;
>  	atomic64_t events;
> +
> +	/* per resource callback ops */
> +	int (*misc_cg_alloc)(struct misc_cg *cg);
> +	void (*misc_cg_free)(struct misc_cg *cg);
> +	void (*misc_cg_max_write)(struct misc_cg *cg);

A nit about naming. These are already in misc_res and cgroup_ and cgrp_
prefixes are a lot more common. So, maybe go for sth like cgrp_alloc?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 01/18] cgroup/misc: Add per resource callbacks for CSS events
       [not found]       ` <ZQSaoXBg-X4cwFdX-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
@ 2023-09-15 17:58         ` Tejun Heo
  2023-09-16  1:27           ` Haitao Huang
  0 siblings, 1 reply; 45+ messages in thread
From: Tejun Heo @ 2023-09-15 17:58 UTC (permalink / raw)
  To: Haitao Huang
  Cc: jarkko-DgEjT+Ai2ygdnm+yROfE0A, dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w,
	zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

On Fri, Sep 15, 2023 at 07:55:45AM -1000, Tejun Heo wrote:
> On Tue, Sep 12, 2023 at 09:06:18PM -0700, Haitao Huang wrote:
> > @@ -37,6 +37,11 @@ struct misc_res {
> >  	u64 max;
> >  	atomic64_t usage;
> >  	atomic64_t events;
> > +
> > +	/* per resource callback ops */
> > +	int (*misc_cg_alloc)(struct misc_cg *cg);
> > +	void (*misc_cg_free)(struct misc_cg *cg);
> > +	void (*misc_cg_max_write)(struct misc_cg *cg);
> 
> A nit about naming. These are already in misc_res and cgroup_ and cgrp_
> prefixes are a lot more common. So, maybe go for sth like cgrp_alloc?

Ah, never mind about the prefix part. misc is using cg_ prefix widely
already.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 00/18] Add Cgroup support for SGX EPC memory
       [not found] ` <20230913040635.28815-1-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
                     ` (17 preceding siblings ...)
  2023-09-13  4:06   ` [PATCH v4 18/18] selftests/sgx: Add scripts for epc cgroup testing Haitao Huang
@ 2023-09-15 18:26   ` Tejun Heo
  18 siblings, 0 replies; 45+ messages in thread
From: Tejun Heo @ 2023-09-15 18:26 UTC (permalink / raw)
  To: Haitao Huang
  Cc: jarkko-DgEjT+Ai2ygdnm+yROfE0A, dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w,
	zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

On Tue, Sep 12, 2023 at 09:06:17PM -0700, Haitao Huang wrote:
> SGX EPC memory allocations are separate from normal RAM allocations, and
> are managed solely by the SGX subsystem. The existing cgroup memory
> controller cannot be used to limit or account for SGX EPC memory, which is
> a desirable feature in some environments, e.g., support for pod level
> control in a Kubernates cluster on a VM or baremetal host [1,2].
> 
> This patchset implements the support for sgx_epc memory within the misc
> cgroup controller. The user can use the misc cgroup controller to set and
> enforce a max limit on total EPC usage per cgroup. The implementation
> reports current usage and events of reaching the limit per cgroup as well
> as the total system capacity.

Minor nit aside, it looks fine from cgroup side.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 01/18] cgroup/misc: Add per resource callbacks for CSS events
  2023-09-15 17:58         ` Tejun Heo
@ 2023-09-16  1:27           ` Haitao Huang
  0 siblings, 0 replies; 45+ messages in thread
From: Haitao Huang @ 2023-09-16  1:27 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jarkko, dave.hansen, linux-kernel, linux-sgx, x86, cgroups, tglx,
	mingo, bp, hpa, sohil.mehta, zhiquan1.li, kristen, seanjc, zhanb,
	anakrish, mikko.ylinen, yangjie

On Fri, 15 Sep 2023 12:58:11 -0500, Tejun Heo <tj@kernel.org> wrote:

> On Fri, Sep 15, 2023 at 07:55:45AM -1000, Tejun Heo wrote:
>> On Tue, Sep 12, 2023 at 09:06:18PM -0700, Haitao Huang wrote:
>> > @@ -37,6 +37,11 @@ struct misc_res {
>> >  	u64 max;
>> >  	atomic64_t usage;
>> >  	atomic64_t events;
>> > +
>> > +	/* per resource callback ops */
>> > +	int (*misc_cg_alloc)(struct misc_cg *cg);
>> > +	void (*misc_cg_free)(struct misc_cg *cg);
>> > +	void (*misc_cg_max_write)(struct misc_cg *cg);
>>
>> A nit about naming. These are already in misc_res and cgroup_ and cgrp_
>> prefixes are a lot more common. So, maybe go for sth like cgrp_alloc?
>
> Ah, never mind about the prefix part. misc is using cg_ prefix widely
> already.
>


Change them to plain alloc, free, max_write? As they are per resource  
type, not per cgroup.
Also following no-prefix naming scheme like "open" for fops, vma_ops, etc.

Thanks for your review.

Haitao

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 01/18] cgroup/misc: Add per resource callbacks for CSS events
  2023-09-13  9:39       ` Jarkko Sakkinen
@ 2023-09-16  4:11         ` Haitao Huang
       [not found]           ` <op.2bci9anpwjvjmi-yDQzE4XY+yVaPPhiJ6yCxLKMmGWinSIL2HeeBUIffwg@public.gmane.org>
  0 siblings, 1 reply; 45+ messages in thread
From: Haitao Huang @ 2023-09-16  4:11 UTC (permalink / raw)
  To: dave.hansen, tj, linux-kernel, linux-sgx, x86, cgroups, tglx,
	mingo, bp, hpa, sohil.mehta, Jarkko Sakkinen
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen,
	yangjie

Hi Jarkko

On Wed, 13 Sep 2023 04:39:06 -0500, Jarkko Sakkinen <jarkko@kernel.org>  
wrote:

> On Wed Sep 13, 2023 at 7:06 AM EEST, Haitao Huang wrote:
>> From: Kristen Carlson Accardi <kristen@linux.intel.com>
>>
>> Consumers of the misc cgroup controller might need to perform separate
>> actions for Cgroups Subsystem State(CSS) events: cgroup alloc and free.
>
> nit: s/State(CSS)/State (CSS)/
>
> "cgroup alloc" and "cgroup free" mean absolutely nothing.
>
>
>> In addition, writes to the max value may also need separate action. Add
>
> What "the max value"?
>
>> the ability to allow downstream users to setup callbacks for these
>> operations, and call the corresponding per-resource-type callback when
>> appropriate.
>
> Who are "the downstream users" and what sort of callbacks they setup?

How about this?

The misc cgroup controller (subsystem) currently does not perform resource  
type specific action for Cgroups Subsystem State (CSS) events: the  
'css_alloc' event when a cgroup is created and the 'css_free' event when a  
cgroup is destroyed, or in event of user writing the max value to the  
misc.max file to set the consumption limit of a specific resource  
[admin-guide/cgroup-v2.rst, 5-9. Misc].

Define callbacks for those events and allow resource providers to register  
the callbacks per resource type as needed. This will be utilized later by  
the EPC misc cgroup support implemented in the SGX driver:
- On cgroup alloc, allocate and initialize necessary structures for EPC  
reclaiming, e.g., LRU list, work queue, etc.
- On cgroup free, cleanup and free those structures created in alloc.
- On max write, trigger EPC reclaiming if the new limit is at or below  
current consumption.

Thanks
Haitao


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 15/18] x86/sgx: Prepare for multiple LRUs
  2023-09-13 15:42       ` Jarkko Sakkinen
@ 2023-09-16  4:18         ` Haitao Huang
  0 siblings, 0 replies; 45+ messages in thread
From: Haitao Huang @ 2023-09-16  4:18 UTC (permalink / raw)
  To: dave.hansen, tj, linux-kernel, linux-sgx, x86, cgroups, tglx,
	mingo, bp, hpa, sohil.mehta, Jarkko Sakkinen
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen,
	yangjie

On Wed, 13 Sep 2023 10:42:52 -0500, Jarkko Sakkinen <jarkko@kernel.org>  
wrote:

> On Wed Sep 13, 2023 at 7:06 AM EEST, Haitao Huang wrote:
>> Add sgx_can_reclaim() wrapper and encapsulate direct references to the
>> global LRU list in the reclaimer functions so that they can be called  
>> with
>> an LRU list per EPC cgroup.
>>
>> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
>> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
>> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
>> Cc: Sean Christopherson <seanjc@google.com>
>> ---
>> V4:
>> - Re-organized this patch to include all changes related to
>> encapsulation of the global LRU
>> - Moved this patch to precede the EPC cgroup patch
>> ---
>>  arch/x86/kernel/cpu/sgx/main.c | 41 +++++++++++++++++++++++-----------
>>  1 file changed, 28 insertions(+), 13 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/sgx/main.c  
>> b/arch/x86/kernel/cpu/sgx/main.c
>> index ce316bd5e5bb..3d396fe5ec09 100644
>> --- a/arch/x86/kernel/cpu/sgx/main.c
>> +++ b/arch/x86/kernel/cpu/sgx/main.c
>> @@ -34,6 +34,16 @@ static DEFINE_XARRAY(sgx_epc_address_space);
>>   */
>>  static struct sgx_epc_lru_lists sgx_global_lru;
>>
>> +static inline struct sgx_epc_lru_lists *sgx_lru_lists(struct  
>> sgx_epc_page *epc_page)
>> +{
>> +	return &sgx_global_lru;
>> +}
>
> I'd simply export sgx_global_lru.
>
The purpose of this patch to to hide sgx_global_lru so later we can have  
LRU per cgroup.
I'll update the commit message to make it clear this is not just for   
sgx_can_reclaim

>> +static inline bool sgx_can_reclaim(void)
>> +{
>> +	return !list_empty(&sgx_global_lru.reclaimable);
>> +}
>
>
> Accessors for the object should be named so that this fact is reflected,
> e.g. sgx_global_lru_can_reclaim() in this case.
>
> I would just open code this to the call sites though.
>
ditto

Thanks
Haitao

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2023-09-13 15:34       ` Jarkko Sakkinen
@ 2023-09-16  4:19         ` Haitao Huang
  0 siblings, 0 replies; 45+ messages in thread
From: Haitao Huang @ 2023-09-16  4:19 UTC (permalink / raw)
  To: dave.hansen-VuQAYsv1563Yd54FQh9/CA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w,
	Jarkko Sakkinen
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

On Wed, 13 Sep 2023 10:34:28 -0500, Jarkko Sakkinen <jarkko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>  
wrote:

>> +++ b/arch/x86/kernel/cpu/sgx/encl.h
>> @@ -39,6 +39,7 @@ enum sgx_encl_flags {
>>  	SGX_ENCL_DEBUG		= BIT(1),
>>  	SGX_ENCL_CREATED	= BIT(2),
>>  	SGX_ENCL_INITIALIZED	= BIT(3),
>> +	SGX_ENCL_OOM		= BIT(4),
>
> Given how the constants are named before maybe SGX_ENCL_NO_MEMORY would
> be more obvious.

Will do.
Thanks
Haitao

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 01/18] cgroup/misc: Add per resource callbacks for CSS events
       [not found]           ` <op.2bci9anpwjvjmi-yDQzE4XY+yVaPPhiJ6yCxLKMmGWinSIL2HeeBUIffwg@public.gmane.org>
@ 2023-09-25 16:57             ` Jarkko Sakkinen
  2023-09-25 16:57               ` Jarkko Sakkinen
  0 siblings, 1 reply; 45+ messages in thread
From: Jarkko Sakkinen @ 2023-09-25 16:57 UTC (permalink / raw)
  To: Haitao Huang, dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	tj-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sgx-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, bp-Gina5bIWoIWzQB+pC5nmwQ,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, sohil.mehta-ral2JQCrhuEAvxtiuMwx3w
  Cc: zhiquan1.li-ral2JQCrhuEAvxtiuMwx3w,
	kristen-VuQAYsv1563Yd54FQh9/CA, seanjc-hpIqsD4AKlfQT0dZR+AlfA,
	zhanb-0li6OtcxBFHby3iVrkZq2A, anakrish-0li6OtcxBFHby3iVrkZq2A,
	mikko.ylinen-VuQAYsv1563Yd54FQh9/CA,
	yangjie-0li6OtcxBFHby3iVrkZq2A

On Sat Sep 16, 2023 at 7:11 AM EEST, Haitao Huang wrote:
> Hi Jarkko
>
> On Wed, 13 Sep 2023 04:39:06 -0500, Jarkko Sakkinen <jarkko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>  
> wrote:
>
> > On Wed Sep 13, 2023 at 7:06 AM EEST, Haitao Huang wrote:
> >> From: Kristen Carlson Accardi <kristen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> >>
> >> Consumers of the misc cgroup controller might need to perform separate
> >> actions for Cgroups Subsystem State(CSS) events: cgroup alloc and free.
> >
> > nit: s/State(CSS)/State (CSS)/
> >
> > "cgroup alloc" and "cgroup free" mean absolutely nothing.
> >
> >
> >> In addition, writes to the max value may also need separate action. Add
> >
> > What "the max value"?
> >
> >> the ability to allow downstream users to setup callbacks for these
> >> operations, and call the corresponding per-resource-type callback when
> >> appropriate.
> >
> > Who are "the downstream users" and what sort of callbacks they setup?
>
> How about this?
>
> The misc cgroup controller (subsystem) currently does not perform resource  
> type specific action for Cgroups Subsystem State (CSS) events: the  
> 'css_alloc' event when a cgroup is created and the 'css_free' event when a  
> cgroup is destroyed, or in event of user writing the max value to the  
> misc.max file to set the consumption limit of a specific resource  
> [admin-guide/cgroup-v2.rst, 5-9. Misc].
>
> Define callbacks for those events and allow resource providers to register  
> the callbacks per resource type as needed. This will be utilized later by  
> the EPC misc cgroup support implemented in the SGX driver:
> - On cgroup alloc, allocate and initialize necessary structures for EPC  
> reclaiming, e.g., LRU list, work queue, etc.
> - On cgroup free, cleanup and free those structures created in alloc.
> - On max write, trigger EPC reclaiming if the new limit is at or below  
> current consumption.

Yeah, this is much better (I was on holiday, thus the delay on
response).

> Thanks
> Haitao

BR, Jarkko

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 01/18] cgroup/misc: Add per resource callbacks for CSS events
  2023-09-25 16:57             ` Jarkko Sakkinen
@ 2023-09-25 16:57               ` Jarkko Sakkinen
  0 siblings, 0 replies; 45+ messages in thread
From: Jarkko Sakkinen @ 2023-09-25 16:57 UTC (permalink / raw)
  To: Haitao Huang, dave.hansen, tj, linux-kernel, linux-sgx, x86,
	cgroups, tglx, mingo, bp, hpa, sohil.mehta
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen,
	yangjie

On Sat Sep 16, 2023 at 7:11 AM EEST, Haitao Huang wrote:
> Hi Jarkko
>
> On Wed, 13 Sep 2023 04:39:06 -0500, Jarkko Sakkinen <jarkko@kernel.org>  
> wrote:
>
> > On Wed Sep 13, 2023 at 7:06 AM EEST, Haitao Huang wrote:
> >> From: Kristen Carlson Accardi <kristen@linux.intel.com>
> >>
> >> Consumers of the misc cgroup controller might need to perform separate
> >> actions for Cgroups Subsystem State(CSS) events: cgroup alloc and free.
> >
> > nit: s/State(CSS)/State (CSS)/
> >
> > "cgroup alloc" and "cgroup free" mean absolutely nothing.
> >
> >
> >> In addition, writes to the max value may also need separate action. Add
> >
> > What "the max value"?
> >
> >> the ability to allow downstream users to setup callbacks for these
> >> operations, and call the corresponding per-resource-type callback when
> >> appropriate.
> >
> > Who are "the downstream users" and what sort of callbacks they setup?
>
> How about this?
>
> The misc cgroup controller (subsystem) currently does not perform resource  
> type specific action for Cgroups Subsystem State (CSS) events: the  
> 'css_alloc' event when a cgroup is created and the 'css_free' event when a  
> cgroup is destroyed, or in event of user writing the max value to the  
> misc.max file to set the consumption limit of a specific resource  
> [admin-guide/cgroup-v2.rst, 5-9. Misc].
>
> Define callbacks for those events and allow resource providers to register  
> the callbacks per resource type as needed. This will be utilized later by  
> the EPC misc cgroup support implemented in the SGX driver:
> - On cgroup alloc, allocate and initialize necessary structures for EPC  
> reclaiming, e.g., LRU list, work queue, etc.
> - On cgroup free, cleanup and free those structures created in alloc.
> - On max write, trigger EPC reclaiming if the new limit is at or below  
> current consumption.

Yeah, this is much better (I was on holiday, thus the delay on
response).

> Thanks
> Haitao

BR, Jarkko

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2023-09-25 16:57 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-13  4:06 [PATCH v4 00/18] Add Cgroup support for SGX EPC memory Haitao Huang
     [not found] ` <20230913040635.28815-1-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2023-09-13  4:06   ` [PATCH v4 01/18] cgroup/misc: Add per resource callbacks for CSS events Haitao Huang
     [not found]     ` <20230913040635.28815-2-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2023-09-13  9:39       ` Jarkko Sakkinen
2023-09-16  4:11         ` Haitao Huang
     [not found]           ` <op.2bci9anpwjvjmi-yDQzE4XY+yVaPPhiJ6yCxLKMmGWinSIL2HeeBUIffwg@public.gmane.org>
2023-09-25 16:57             ` Jarkko Sakkinen
2023-09-25 16:57               ` Jarkko Sakkinen
2023-09-15 17:55     ` Tejun Heo
     [not found]       ` <ZQSaoXBg-X4cwFdX-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
2023-09-15 17:58         ` Tejun Heo
2023-09-16  1:27           ` Haitao Huang
2023-09-13  4:06   ` [PATCH v4 02/18] cgroup/misc: Add SGX EPC resource type and export APIs for SGX driver Haitao Huang
     [not found]     ` <20230913040635.28815-3-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2023-09-13  9:43       ` Jarkko Sakkinen
2023-09-13  4:06   ` [PATCH v4 03/18] x86/sgx: Add sgx_epc_lru_lists to encapsulate LRU lists Haitao Huang
     [not found]     ` <20230913040635.28815-4-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2023-09-13  9:46       ` Jarkko Sakkinen
2023-09-14 10:31       ` Huang, Kai
     [not found]         ` <851f9b3043732c17cd8f86a77ccee0b7c6caa22f.camel-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
2023-09-14 16:13           ` Dave Hansen
2023-09-14 21:58             ` Huang, Kai
2023-09-15 16:28         ` Haitao Huang
2023-09-13  4:06   ` [PATCH v4 04/18] x86/sgx: Use sgx_epc_lru_lists for existing active page list Haitao Huang
     [not found]     ` <20230913040635.28815-5-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2023-09-13 15:00       ` Jarkko Sakkinen
2023-09-13  4:06   ` [PATCH v4 05/18] x86/sgx: Store reclaimable EPC pages in sgx_epc_lru_lists Haitao Huang
     [not found]     ` <20230913040635.28815-6-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2023-09-13 15:14       ` Jarkko Sakkinen
2023-09-13  4:06   ` [PATCH v4 06/18] x86/sgx: Introduce EPC page states Haitao Huang
     [not found]     ` <20230913040635.28815-7-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2023-09-13 15:15       ` Jarkko Sakkinen
2023-09-13  4:06   ` [PATCH v4 07/18] x86/sgx: Introduce RECLAIM_IN_PROGRESS state Haitao Huang
2023-09-13  4:06   ` [PATCH v4 08/18] x86/sgx: Use a list to track to-be-reclaimed pages Haitao Huang
     [not found]     ` <20230913040635.28815-9-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2023-09-13 15:30       ` Jarkko Sakkinen
2023-09-13  4:06   ` [PATCH v4 09/18] x86/sgx: Store struct sgx_encl when allocating new VA pages Haitao Huang
     [not found]     ` <20230913040635.28815-10-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2023-09-13 15:31       ` Jarkko Sakkinen
2023-09-13  4:06   ` [PATCH v4 10/18] x86/sgx: Add EPC page flags to identify owner types Haitao Huang
2023-09-13  4:06   ` [PATCH v4 11/18] x86/sgx: store unreclaimable pages in LRU lists Haitao Huang
     [not found]     ` <20230913040635.28815-12-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2023-09-13 15:33       ` Jarkko Sakkinen
2023-09-13  4:06   ` [PATCH v4 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC Haitao Huang
     [not found]     ` <20230913040635.28815-13-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2023-09-13 15:34       ` Jarkko Sakkinen
2023-09-16  4:19         ` Haitao Huang
2023-09-13  4:06   ` [PATCH v4 13/18] x86/sgx: Expose sgx_reclaim_pages() for use by EPC cgroup Haitao Huang
2023-09-13 15:36     ` Jarkko Sakkinen
2023-09-13  4:06   ` [PATCH v4 14/18] x86/sgx: Add helper to grab pages from an arbitrary EPC LRU Haitao Huang
2023-09-13  4:06   ` [PATCH v4 15/18] x86/sgx: Prepare for multiple LRUs Haitao Huang
     [not found]     ` <20230913040635.28815-16-haitao.huang-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2023-09-13 15:42       ` Jarkko Sakkinen
2023-09-16  4:18         ` Haitao Huang
2023-09-13  4:06   ` [PATCH v4 16/18] x86/sgx: Limit process EPC usage with misc cgroup controller Haitao Huang
2023-09-13 15:48     ` Jarkko Sakkinen
2023-09-13  4:06   ` [PATCH v4 17/18] Docs/x86/sgx: Add description for cgroup support Haitao Huang
2023-09-13  4:06   ` [PATCH v4 18/18] selftests/sgx: Add scripts for epc cgroup testing Haitao Huang
2023-09-15 18:26   ` [PATCH v4 00/18] Add Cgroup support for SGX EPC memory Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox