[PATCH v16 00/16] Add Cgroup support for SGX EPC memory

linux-sgx.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v16 00/16] Add Cgroup support for SGX EPC memory
@ 2024-08-21  1:53 Haitao Huang
  2024-08-21  1:53 ` [PATCH v16 01/16] x86/sgx: Replace boolean parameters with enums Haitao Huang
                   ` (17 more replies)
  0 siblings, 18 replies; 42+ messages in thread
From: Haitao Huang @ 2024-08-21  1:53 UTC (permalink / raw)
  To: jarkko, dave.hansen, kai.huang, tj, mkoutny, chenridong,
	linux-kernel, linux-sgx, x86, cgroups, tglx, mingo, bp, hpa,
	sohil.mehta, tim.c.chen
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen,
	yangjie, chrisyan

SGX Enclave Page Cache (EPC) memory allocations are separate from normal
RAM allocations, and are managed solely by the SGX subsystem. The existing
cgroup memory controller cannot be used to limit or account for SGX EPC
memory, which is a desirable feature in some environments, e.g., support
for pod level control in a Kubernates cluster on a VM or bare-metal host
[1,2].

This patchset implements the support for sgx_epc memory within the misc
cgroup controller. A user can use the misc cgroup controller to set and
enforce a max limit on total EPC usage per cgroup. The implementation
reports current usage and events of reaching the limit per cgroup as well
as the total system capacity.

Much like normal system memory, EPC memory can be overcommitted via virtual
memory techniques and pages can be swapped out of the EPC to their backing
store, which are normal system memory allocated via shmem and accounted by
the memory controller. Similar to per-cgroup reclamation done by the memory
controller, the EPC misc controller needs to implement a per-cgroup EPC
reclaiming process: when the EPC usage of a cgroup reaches its hard limit
('sgx_epc' entry in the 'misc.max' file), the cgroup starts swapping out
some EPC pages within the same cgroup to make room for new allocations.

For that, this implementation tracks reclaimable EPC pages in a separate
LRU list in each cgroup, and below are more details and justification of
this design. 

Track EPC pages in per-cgroup LRUs (from Dave)
----------------------------------------------

tl;dr: A cgroup hitting its limit should be as similar as possible to the
system running out of EPC memory. The only two choices to implement that
are nasty changes the existing LRU scanning algorithm, or to add new LRUs.
The result: Add a new LRU for each cgroup and scans those instead. Replace
the existing global cgroup with the root cgroup's LRU (only when this new
support is compiled in, obviously).

The existing EPC memory management aims to be a miniature version of the
core VM where EPC memory can be overcommitted and reclaimed. EPC
allocations can wait for reclaim. The alternative to waiting would have
been to send a signal and let the enclave die.

This series attempts to implement that same logic for cgroups, for the same
reasons: it's preferable to wait for memory to become available and let
reclaim happen than to do things that are fatal to enclaves.

There is currently a global reclaimable page SGX LRU list. That list (and
the existing scanning algorithm) is essentially useless for doing reclaim
when a cgroup hits its limit because the cgroup's pages are scattered
around that LRU. It is unspeakably inefficient to scan a linked list with
millions of entries for what could be dozens of pages from a cgroup that
needs reclaim.

Even if unspeakably slow reclaim was accepted, the existing scanning
algorithm only picks a few pages off the head of the global LRU. It would
either need to hold the list locks for unreasonable amounts of time, or be
taught to scan the list in pieces, which has its own challenges.

Unreclaimable Enclave Pages
---------------------------

There are a variety of page types for enclaves, each serving different
purposes [5]. Although the SGX architecture supports swapping for all
types, some special pages, e.g., Version Array(VA) and Secure Enclave
Control Structure (SECS)[5], holds meta data of reclaimed pages and
enclaves. That makes reclamation of such pages more intricate to manage.
The SGX driver global reclaimer currently does not swap out VA pages. It
only swaps the SECS page of an enclave when all other associated pages have
been swapped out. The cgroup reclaimer follows the same approach and does
not track those in per-cgroup LRUs and considers them as unreclaimable
pages. The allocation of these pages is counted towards the usage of a
specific cgroup and is subject to the cgroup's set EPC limits.

Earlier versions of this series implemented forced enclave-killing to
reclaim VA and SECS pages. That was designed to enforce the 'max' limit,
particularly in scenarios where a user or administrator reduces this limit
post-launch of enclaves. However, subsequent discussions [3, 4] indicated
that such preemptive enforcement is not necessary for the misc-controllers.
Therefore, reclaiming SECS/VA pages by force-killing enclaves were removed,
and the limit is only enforced at the time of new EPC allocation request.
When a cgroup hits its limit but nothing left in the LRUs of the subtree,
i.e., nothing to reclaim in the cgroup, any new attempt to allocate EPC
within that cgroup will result in an 'ENOMEM'.

Unreclaimable Guest VM EPC Pages
--------------------------------

The EPC pages allocated for guest VMs by the virtual EPC driver are not
reclaimable by the host kernel [6]. Therefore an EPC cgroup also treats
those as unreclaimable and returns ENOMEM when its limit is hit and nothing
reclaimable left within the cgroup. The virtual EPC driver translates the
ENOMEM error resulted from an EPC allocation request into a SIGBUS to the
user process exactly the same way handling host running out of physical
EPC.

This work was originally authored by Sean Christopherson a few years ago,
and previously modified by Kristen C. Accardi to utilize the misc cgroup
controller rather than a custom controller. I have been updating the
patches based on review comments since V2 [7-20], simplified the
implementation/design, added selftest scripts, fixed some stability issues
found from testing.

Thanks to all for the review/test/tags/feedback provided on the previous
versions. 

I appreciate your further reviewing/testing and providing tags if
appropriate.

---
V16:
- Revised the per-cgroup reclamation basic flow
  sgx_cgroup_reclaim_pages(): add next_cg field in each cgroup to track
  the next descendant to scan, and create a synchronized iterator to more
  fairly scan all descendants if needed for reclamation. (Kai)
- Separate patches to abstract the uses of global LRU, sgx_cgroup_reclaim_direct(),
  sgx_cgroup_reclaim_global() implementaions. (Kai)
- MISC don't call the ops if capacity is zero. (Kai)
- Commit message improvements, clarified requirements for per-cgroup
  reclamation. (Kai)
- Fix bugs in handling failures during init.
- Only turn on callbacks and set capacity at the end of sgx_init()

V15:
- Disable SGX when sgx_cgroup_init() fails instead of using BUG_ON()
  (Jarkko)
- Reset capacity if sgx_cgroup_init() fails. (Kai)
- Style fixes (Jarkko, Kai)
- In misc.c, only invoke the ->free() callbacks for resource types whose
  ->alloc() callback  was called and returned success. (Ridong)

V14:
- modified sgx_cgroup_reclaim_pages() to return the next node. Caller can use it as the new
  starting node for next round of reclamation attempt if needed. This is to fix a corner case
where a super busy top level cgroup may block reclamation in lower level cgroups. (Kai)
- Move renaming of sgx_should_reclaim_global() to the patch 'x86/sgx: Add basic EPC reclamation
flow for cgroup'. (Kai)

v13:
- Only allocate workqueue for SGX cgroup when misc is enabled and BUG_ON() when allocation fails
- Add more tags
- Commit logs and style improvements (Kai)
- Test script improvements (Jarkko)

V12:
- Integrate test scripts to kselftests "run_tests" target. (Jarkko)
- Remove CGROUP_SGX_EPC kconfig, conditionally compile with CGROUP_MISC enabled. (Jarkko)
- Explain why taking 'struct misc_cg *cg' as parameter, but not 'struct misc_res *res' in the
  changelog for patch #2. (Kai)
- Remove "unlikely" in patch #2 (Kai)

V11:
- Update copyright years and use c style (Kai)
- Improve and simplify test scripts: remove cgroup-tools and bash dependency, drop cgroup v1.
  (Jarkko, Michal)
- Add more stub/wrapper functions to minimize #ifdefs in c file. (Kai)
- Revise commit message for patch #8 to clarify design rational (Kai)
- Print error instead of WARN for init failure. (Kai)
- Add check for need to queue an async reclamation before returning from
  sgx_cgroup_try_charge(), do so if needed.

V10:
- Use enum instead of boolean for the 'reclaim' parameters in
  sgx_alloc_epc_page(). (Dave, Jarkko)
- Pass mm struct instead of a boolean 'indirect'. (Dave, Jarkko)
- Add comments/macros to clarify the cgroup async reclaimer design. (Kai)
- Simplify sgx_reclaim_pages() signature, removing a pointer passed in.
  (Kai)
- Clarify design of sgx_cgroup_reclaim_pages(). (Kai)
	- Does not return a value for callers to check.
	- Its usage pattern is similar to that of sgx_reclaim_pages() now
- Add cond_resched() in the loop in the cgroup reclaimer to improve
  liveliness.
- Add logic for cgroup level reclamation in sgx_reclaim_direct()
- Restructure V9 patches 7-10 to make them flow better. (Kai)
- Disable cgroup if workqueue allocation failed during init. (Kai)
- Shorten names for EPC cgroup functions, structures and variables.
  (Jarkko)
- Separate out a helper for for addressing single iteration of the loop in
  sgx_cgroup_try_charge(). (Jarkko)
- More cleanup/clarifying/comments/style fixes. (Kai, Jarkko)  

V9:
- Add comments for static variables outside functions. (Jarkko)
- Remove unnecessary ifs. (Tim)
- Add more Reviewed-By: tags from Jarkko and TJ.

V8:
- Style fixes. (Jarkko)
- Abstract _misc_res_free/alloc() (Jarkko)
- Remove unneeded NULL checks. (Jarkko)

V7:
- Split the large patch for the final EPC implementation, #10 in V6, into
  smaller ones. (Dave, Kai)
- Scan and reclaim one cgroup at a time, don't split sgx_reclaim_pages()
  into two functions (Kai)
- Removed patches to introduce the EPC page states, list for storing
  candidate pages for reclamation. (not needed due to above changes)
- Make ops one per resource type and store them in array (Michal)
- Rename the ops struct to misc_res_ops, and enforce the constraints of
  required callback functions (Jarkko)
- Initialize epc cgroup in sgx driver init function. (Kai)
- Moved addition of priv field to patch 4 where it was used first. (Jarkko)
- Split sgx_get_current_epc_cg() out of sgx_epc_cg_try_charge() (Kai)
- Use a static for root cgroup (Kai)

[1]https://lore.kernel.org/all/DM6PR21MB11772A6ED915825854B419D6C4989@DM6PR21MB1177.namprd21.prod.outlook.com/
[2]https://lore.kernel.org/all/ZD7Iutppjj+muH4p@himmelriiki/
[3]https://lore.kernel.org/lkml/7a1a5125-9da2-47b6-ba0f-cf24d84df16b@intel.com/
[4]https://lore.kernel.org/lkml/yz44wukoic3syy6s4fcrngagurkjhe2hzka6kvxbajdtro3fwu@zd2ilht7wcw3/
[5]Documentation/arch/x86/sgx.rst, Section"Enclave Page Types"
[6]Documentation/arch/x86/sgx.rst, Section "Virtual EPC"
[7]v2: https://lore.kernel.org/all/20221202183655.3767674-1-kristen@linux.intel.com/
[8]v3: https://lore.kernel.org/linux-sgx/20230712230202.47929-1-haitao.huang@linux.intel.com/
[9]v4: https://lore.kernel.org/all/20230913040635.28815-1-haitao.huang@linux.intel.com/
[10]v5: https://lore.kernel.org/all/20230923030657.16148-1-haitao.huang@linux.intel.com/
[11]v6: https://lore.kernel.org/linux-sgx/20231030182013.40086-1-haitao.huang@linux.intel.com/
[12]v7: https://lore.kernel.org/linux-sgx/20240122172048.11953-1-haitao.huang@linux.intel.com/T/#t
[13]v8: https://lore.kernel.org/linux-sgx/20240130020938.10025-1-haitao.huang@linux.intel.com/T/#t
[14]v9: https://lore.kernel.org/lkml/20240205210638.157741-1-haitao.huang@linux.intel.com/T/
[15]v10: https://lore.kernel.org/linux-sgx/20240328002229.30264-1-haitao.huang@linux.intel.com/T/#t
[16]v11: https://lore.kernel.org/lkml/20240410182558.41467-1-haitao.huang@linux.intel.com/
[17]v12: https://lore.kernel.org/lkml/20240416032011.58578-1-haitao.huang@linux.intel.com/
[18]v13: https://lore.kernel.org/lkml/20240430195108.5676-1-haitao.huang@linux.intel.com/ 
[19]v14: https://lore.kernel.org/linux-sgx/20240531222630.4634-1-haitao.huang@linux.intel.com/T/#t
[20]v15: https://lore.kernel.org/linux-sgx/20240617125321.36658-1-haitao.huang@linux.intel.com/T/#t

Haitao Huang (7):
  x86/sgx: Replace boolean parameters with enums
  x86/sgx: Encapsulate uses of the global LRU
  x86/sgx: Add basic EPC reclamation flow for cgroup
  x86/sgx: Charge mem_cgroup for per-cgroup reclamation
  x86/sgx: Revise global reclamation for EPC cgroups
  x86/sgx: implement direct reclamation for cgroups
  selftests/sgx: Add scripts for EPC cgroup testing

Kristen Carlson Accardi (7):
  cgroup/misc: Add per resource callbacks for CSS events
  cgroup/misc: Export APIs for SGX driver
  cgroup/misc: Add SGX EPC resource type
  x86/sgx: Implement basic EPC misc cgroup functionality
  x86/sgx: Abstract tracking reclaimable pages in LRU
  x86/sgx: Implement async reclamation for cgroup
  x86/sgx: Turn on per-cgroup EPC reclamation

Sean Christopherson (2):
  x86/sgx: Add sgx_epc_lru_list to encapsulate LRU list
  Docs/x86/sgx: Add description for cgroup support

 Documentation/arch/x86/sgx.rst                |  83 ++++
 arch/x86/kernel/cpu/sgx/Makefile              |   1 +
 arch/x86/kernel/cpu/sgx/encl.c                |  41 +-
 arch/x86/kernel/cpu/sgx/encl.h                |   7 +-
 arch/x86/kernel/cpu/sgx/epc_cgroup.c          | 438 ++++++++++++++++++
 arch/x86/kernel/cpu/sgx/epc_cgroup.h          | 108 +++++
 arch/x86/kernel/cpu/sgx/ioctl.c               |  10 +-
 arch/x86/kernel/cpu/sgx/main.c                | 219 ++++++---
 arch/x86/kernel/cpu/sgx/sgx.h                 |  54 ++-
 arch/x86/kernel/cpu/sgx/virt.c                |   2 +-
 include/linux/misc_cgroup.h                   |  41 ++
 kernel/cgroup/misc.c                          | 113 ++++-
 tools/testing/selftests/sgx/Makefile          |   3 +-
 tools/testing/selftests/sgx/README            | 109 +++++
 tools/testing/selftests/sgx/ash_cgexec.sh     |  16 +
 tools/testing/selftests/sgx/config            |   4 +
 .../selftests/sgx/run_epc_cg_selftests.sh     | 294 ++++++++++++
 tools/testing/selftests/sgx/settings          |   2 +
 .../selftests/sgx/watch_misc_for_tests.sh     |  11 +
 19 files changed, 1444 insertions(+), 112 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
 create mode 100644 tools/testing/selftests/sgx/README
 create mode 100755 tools/testing/selftests/sgx/ash_cgexec.sh
 create mode 100644 tools/testing/selftests/sgx/config
 create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh
 create mode 100644 tools/testing/selftests/sgx/settings
 create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh

base-commit: 7c626ce4bae1ac14f60076d00eafe71af30450ba
-- 
2.43.0

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v16 01/16] x86/sgx: Replace boolean parameters with enums
  2024-08-21  1:53 [PATCH v16 00/16] Add Cgroup support for SGX EPC memory Haitao Huang
@ 2024-08-21  1:53 ` Haitao Huang
  2024-08-21  1:53 ` [PATCH v16 02/16] cgroup/misc: Add per resource callbacks for CSS events Haitao Huang
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Haitao Huang @ 2024-08-21  1:53 UTC (permalink / raw)
  To: jarkko, dave.hansen, kai.huang, tj, mkoutny, chenridong,
	linux-kernel, linux-sgx, x86, cgroups, tglx, mingo, bp, hpa,
	sohil.mehta, tim.c.chen
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen,
	yangjie, chrisyan

Replace boolean parameters for 'reclaim' in the function
sgx_alloc_epc_page() and its callers with an enum.

Also opportunistically remove non-static declaration of
__sgx_alloc_epc_page() and a typo

Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Suggested-by: Jarkko Sakkinen <jarkko@kernel.org>
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Tested-by: Jarkko Sakkinen <jarkko@kernel.org>
---
 arch/x86/kernel/cpu/sgx/encl.c  | 12 ++++++------
 arch/x86/kernel/cpu/sgx/encl.h  |  4 ++--
 arch/x86/kernel/cpu/sgx/ioctl.c | 10 +++++-----
 arch/x86/kernel/cpu/sgx/main.c  | 14 +++++++-------
 arch/x86/kernel/cpu/sgx/sgx.h   | 13 +++++++++++--
 arch/x86/kernel/cpu/sgx/virt.c  |  2 +-
 6 files changed, 32 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 279148e72459..f474179b6f77 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -217,7 +217,7 @@ static struct sgx_epc_page *sgx_encl_eldu(struct sgx_encl_page *encl_page,
 	struct sgx_epc_page *epc_page;
 	int ret;
 
-	epc_page = sgx_alloc_epc_page(encl_page, false);
+	epc_page = sgx_alloc_epc_page(encl_page, SGX_NO_RECLAIM);
 	if (IS_ERR(epc_page))
 		return epc_page;
 
@@ -359,14 +359,14 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
 		goto err_out_unlock;
 	}
 
-	epc_page = sgx_alloc_epc_page(encl_page, false);
+	epc_page = sgx_alloc_epc_page(encl_page, SGX_NO_RECLAIM);
 	if (IS_ERR(epc_page)) {
 		if (PTR_ERR(epc_page) == -EBUSY)
 			vmret =  VM_FAULT_NOPAGE;
 		goto err_out_unlock;
 	}
 
-	va_page = sgx_encl_grow(encl, false);
+	va_page = sgx_encl_grow(encl, SGX_NO_RECLAIM);
 	if (IS_ERR(va_page)) {
 		if (PTR_ERR(va_page) == -EBUSY)
 			vmret = VM_FAULT_NOPAGE;
@@ -1232,8 +1232,8 @@ void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr)
 
 /**
  * sgx_alloc_va_page() - Allocate a Version Array (VA) page
- * @reclaim: Reclaim EPC pages directly if none available. Enclave
- *           mutex should not be held if this is set.
+ * @reclaim: Whether reclaim EPC pages directly if none available. Enclave
+ *           mutex should not be held for SGX_DO_RECLAIM.
  *
  * Allocate a free EPC page and convert it to a Version Array (VA) page.
  *
@@ -1241,7 +1241,7 @@ void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr)
  *   a VA page,
  *   -errno otherwise
  */
-struct sgx_epc_page *sgx_alloc_va_page(bool reclaim)
+struct sgx_epc_page *sgx_alloc_va_page(enum sgx_reclaim reclaim)
 {
 	struct sgx_epc_page *epc_page;
 	int ret;
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index f94ff14c9486..fe15ade02ca1 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -116,14 +116,14 @@ struct sgx_encl_page *sgx_encl_page_alloc(struct sgx_encl *encl,
 					  unsigned long offset,
 					  u64 secinfo_flags);
 void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr);
-struct sgx_epc_page *sgx_alloc_va_page(bool reclaim);
+struct sgx_epc_page *sgx_alloc_va_page(enum sgx_reclaim reclaim);
 unsigned int sgx_alloc_va_slot(struct sgx_va_page *va_page);
 void sgx_free_va_slot(struct sgx_va_page *va_page, unsigned int offset);
 bool sgx_va_page_full(struct sgx_va_page *va_page);
 void sgx_encl_free_epc_page(struct sgx_epc_page *page);
 struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
 					 unsigned long addr);
-struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, bool reclaim);
+struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, enum sgx_reclaim reclaim);
 void sgx_encl_shrink(struct sgx_encl *encl, struct sgx_va_page *va_page);
 
 #endif /* _X86_ENCL_H */
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index b65ab214bdf5..793a0ba2cb16 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -17,7 +17,7 @@
 #include "encl.h"
 #include "encls.h"
 
-struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, bool reclaim)
+struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, enum sgx_reclaim reclaim)
 {
 	struct sgx_va_page *va_page = NULL;
 	void *err;
@@ -64,7 +64,7 @@ static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs)
 	struct file *backing;
 	long ret;
 
-	va_page = sgx_encl_grow(encl, true);
+	va_page = sgx_encl_grow(encl, SGX_DO_RECLAIM);
 	if (IS_ERR(va_page))
 		return PTR_ERR(va_page);
 	else if (va_page)
@@ -83,7 +83,7 @@ static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs)
 
 	encl->backing = backing;
 
-	secs_epc = sgx_alloc_epc_page(&encl->secs, true);
+	secs_epc = sgx_alloc_epc_page(&encl->secs, SGX_DO_RECLAIM);
 	if (IS_ERR(secs_epc)) {
 		ret = PTR_ERR(secs_epc);
 		goto err_out_backing;
@@ -269,13 +269,13 @@ static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long src,
 	if (IS_ERR(encl_page))
 		return PTR_ERR(encl_page);
 
-	epc_page = sgx_alloc_epc_page(encl_page, true);
+	epc_page = sgx_alloc_epc_page(encl_page, SGX_DO_RECLAIM);
 	if (IS_ERR(epc_page)) {
 		kfree(encl_page);
 		return PTR_ERR(epc_page);
 	}
 
-	va_page = sgx_encl_grow(encl, true);
+	va_page = sgx_encl_grow(encl, SGX_DO_RECLAIM);
 	if (IS_ERR(va_page)) {
 		ret = PTR_ERR(va_page);
 		goto err_out_free;
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 27892e57c4ef..e64073fb4256 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -464,14 +464,14 @@ static struct sgx_epc_page *__sgx_alloc_epc_page_from_node(int nid)
 /**
  * __sgx_alloc_epc_page() - Allocate an EPC page
  *
- * Iterate through NUMA nodes and reserve ia free EPC page to the caller. Start
+ * Iterate through NUMA nodes and reserve a free EPC page to the caller. Start
  * from the NUMA node, where the caller is executing.
  *
  * Return:
  * - an EPC page:	A borrowed EPC pages were available.
  * - NULL:		Out of EPC pages.
  */
-struct sgx_epc_page *__sgx_alloc_epc_page(void)
+static struct sgx_epc_page *__sgx_alloc_epc_page(void)
 {
 	struct sgx_epc_page *page;
 	int nid_of_current = numa_node_id();
@@ -543,12 +543,12 @@ int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
 /**
  * sgx_alloc_epc_page() - Allocate an EPC page
  * @owner:	the owner of the EPC page
- * @reclaim:	reclaim pages if necessary
+ * @reclaim:	whether reclaim pages if necessary
  *
  * Iterate through EPC sections and borrow a free EPC page to the caller. When a
  * page is no longer needed it must be released with sgx_free_epc_page(). If
- * @reclaim is set to true, directly reclaim pages when we are out of pages. No
- * mm's can be locked when @reclaim is set to true.
+ * @reclaim is set to SGX_DO_RECLAIM, directly reclaim pages when we are out of
+ * pages. No mm's can be locked for SGX_DO_RECLAIM.
  *
  * Finally, wake up ksgxd when the number of pages goes below the watermark
  * before returning back to the caller.
@@ -557,7 +557,7 @@ int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
  *   an EPC page,
  *   -errno on error
  */
-struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
+struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
 {
 	struct sgx_epc_page *page;
 
@@ -571,7 +571,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 		if (list_empty(&sgx_active_page_list))
 			return ERR_PTR(-ENOMEM);
 
-		if (!reclaim) {
+		if (reclaim == SGX_NO_RECLAIM) {
 			page = ERR_PTR(-EBUSY);
 			break;
 		}
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index d2dad21259a8..ca34cd4f58ac 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -29,6 +29,16 @@
 /* Pages on free list */
 #define SGX_EPC_PAGE_IS_FREE		BIT(1)
 
+/**
+ * enum sgx_reclaim - Whether EPC reclamation is allowed within a function.
+ * %SGX_NO_RECLAIM:		Do not reclaim EPC pages.
+ * %SGX_DO_RECLAIM:		Reclaim EPC pages as needed.
+ */
+enum sgx_reclaim {
+	SGX_NO_RECLAIM,
+	SGX_DO_RECLAIM
+};
+
 struct sgx_epc_page {
 	unsigned int section;
 	u16 flags;
@@ -83,13 +93,12 @@ static inline void *sgx_get_epc_virt_addr(struct sgx_epc_page *page)
 	return section->virt_addr + index * PAGE_SIZE;
 }
 
-struct sgx_epc_page *__sgx_alloc_epc_page(void);
 void sgx_free_epc_page(struct sgx_epc_page *page);
 
 void sgx_reclaim_direct(void);
 void sgx_mark_page_reclaimable(struct sgx_epc_page *page);
 int sgx_unmark_page_reclaimable(struct sgx_epc_page *page);
-struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
+struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim);
 
 void sgx_ipi_cb(void *info);
 
diff --git a/arch/x86/kernel/cpu/sgx/virt.c b/arch/x86/kernel/cpu/sgx/virt.c
index 7aaa3652e31d..e7fdc3a9abae 100644
--- a/arch/x86/kernel/cpu/sgx/virt.c
+++ b/arch/x86/kernel/cpu/sgx/virt.c
@@ -46,7 +46,7 @@ static int __sgx_vepc_fault(struct sgx_vepc *vepc,
 	if (epc_page)
 		return 0;
 
-	epc_page = sgx_alloc_epc_page(vepc, false);
+	epc_page = sgx_alloc_epc_page(vepc, SGX_NO_RECLAIM);
 	if (IS_ERR(epc_page))
 		return PTR_ERR(epc_page);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v16 02/16] cgroup/misc: Add per resource callbacks for CSS events
  2024-08-21  1:53 [PATCH v16 00/16] Add Cgroup support for SGX EPC memory Haitao Huang
  2024-08-21  1:53 ` [PATCH v16 01/16] x86/sgx: Replace boolean parameters with enums Haitao Huang
@ 2024-08-21  1:53 ` Haitao Huang
  2024-08-27 10:45   ` Huang, Kai
  2024-08-29  9:25   ` Huang, Kai
  2024-08-21  1:53 ` [PATCH v16 03/16] cgroup/misc: Export APIs for SGX driver Haitao Huang
                   ` (15 subsequent siblings)
  17 siblings, 2 replies; 42+ messages in thread
From: Haitao Huang @ 2024-08-21  1:53 UTC (permalink / raw)
  To: jarkko, dave.hansen, kai.huang, tj, mkoutny, chenridong,
	linux-kernel, linux-sgx, x86, cgroups, tglx, mingo, bp, hpa,
	sohil.mehta, tim.c.chen
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen,
	yangjie, chrisyan

From: Kristen Carlson Accardi <kristen@linux.intel.com>

The misc cgroup controller (subsystem) currently does not perform
resource type specific action for Cgroups Subsystem State (CSS) events:
the 'css_alloc' event when a cgroup is created and the 'css_free' event
when a cgroup is destroyed.

Define callbacks for those events and allow resource providers to
register the callbacks per resource type as needed. This will be
utilized later by the EPC misc cgroup support implemented in the SGX
driver.

The SGX callbacks will need to access the 'struct misc_cg'.  Pass
'struct misc_cg' to the callbacks but not the 'struct misc_res' because
the latter doesn't have a pointer pointing back to 'struct misc_cg'.

Link: https://lore.kernel.org/lkml/op.2kdw36otwjvjmi@hhuan26-mobl.amr.corp.intel.com/

Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Tejun Heo <tj@kernel.org>
Tested-by: Jarkko Sakkinen <jarkko@kernel.org>
---
V16:
- Don't call the ops if capacity is zero. (Kai)
- Simplify the last paragraph of the commit message. (Kai)

V15:
- Style fixes (Jarkko, Kai)
- _misc_cg_res_free() takes the last index. Only call ->free() for those res types with ->alloc() successfully called. (Ridong)

V12:
- Add comments in commit to clarify reason to pass in misc_cg, not
misc_res. (Kai)
- Remove unlikely (Kai)

V8:
- Abstract out _misc_cg_res_free() and _misc_cg_res_alloc() (Jarkko)

V7:
- Make ops one per resource type and store them in array (Michal)
- Rename the ops struct to misc_res_ops, and enforce the constraints of required callback
functions (Jarkko)
- Moved addition of priv field to patch 4 where it was used first. (Jarkko)

V6:
- Create ops struct for per resource callbacks (Jarkko)
- Drop max_write callback (Dave, Michal)
- Style fixes (Kai)
---
 include/linux/misc_cgroup.h | 11 +++++
 kernel/cgroup/misc.c        | 88 ++++++++++++++++++++++++++++++++++---
 2 files changed, 93 insertions(+), 6 deletions(-)

diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index 49eef10c8e59..e5159770a68e 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -28,6 +28,16 @@ struct misc_cg;
 
 #include <linux/cgroup.h>
 
+/**
+ * struct misc_res_ops: per resource type callback ops.
+ * @alloc: invoked for resource specific initialization when cgroup is allocated.
+ * @free: invoked for resource specific cleanup when cgroup is deallocated.
+ */
+struct misc_res_ops {
+	int (*alloc)(struct misc_cg *cg);
+	void (*free)(struct misc_cg *cg);
+};
+
 /**
  * struct misc_res: Per cgroup per misc type resource
  * @max: Maximum limit on the resource.
@@ -62,6 +72,7 @@ struct misc_cg {
 
 u64 misc_cg_res_total_usage(enum misc_res_type type);
 int misc_cg_set_capacity(enum misc_res_type type, u64 capacity);
+int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops);
 int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount);
 void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg, u64 amount);
 
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 0e26068995a6..f3ce896d2c21 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -39,6 +39,9 @@ static struct misc_cg root_cg;
  */
 static u64 misc_res_capacity[MISC_CG_RES_TYPES];
 
+/* Resource type specific operations */
+static const struct misc_res_ops *misc_res_ops[MISC_CG_RES_TYPES];
+
 /**
  * parent_misc() - Get the parent of the passed misc cgroup.
  * @cgroup: cgroup whose parent needs to be fetched.
@@ -105,6 +108,41 @@ int misc_cg_set_capacity(enum misc_res_type type, u64 capacity)
 }
 EXPORT_SYMBOL_GPL(misc_cg_set_capacity);
 
+/**
+ * misc_cg_set_ops() - register resource specific operations.
+ * @type: Type of the misc res.
+ * @ops: Operations for the given type.
+ *
+ * The callbacks in @ops will not be invoked if the capacity of @type is 0.
+ *
+ * Context: Any context.
+ * Return:
+ * * %0 - Successfully registered the operations.
+ * * %-EINVAL - If @type is invalid, or the operations missing any required callbacks.
+ */
+int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops)
+{
+	if (!valid_type(type))
+		return -EINVAL;
+
+	if (!ops)
+		return -EINVAL;
+
+	if (!ops->alloc) {
+		pr_err("%s: alloc missing\n", __func__);
+		return -EINVAL;
+	}
+
+	if (!ops->free) {
+		pr_err("%s: free missing\n", __func__);
+		return -EINVAL;
+	}
+
+	misc_res_ops[type] = ops;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(misc_cg_set_ops);
+
 /**
  * misc_cg_cancel_charge() - Cancel the charge from the misc cgroup.
  * @type: Misc res type in misc cg to cancel the charge from.
@@ -439,6 +477,36 @@ static struct cftype misc_cg_files[] = {
 	{}
 };
 
+static inline void _misc_cg_res_free(struct misc_cg *cg, enum misc_res_type last)
+{
+	enum misc_res_type i;
+
+	for (i = 0; i <= last; i++)
+		if (misc_res_ops[i] && READ_ONCE(misc_res_capacity[i]))
+			misc_res_ops[i]->free(cg);
+}
+
+static inline int _misc_cg_res_alloc(struct misc_cg *cg)
+{
+	enum misc_res_type i;
+	int ret;
+
+	for (i = 0; i < MISC_CG_RES_TYPES; i++) {
+		WRITE_ONCE(cg->res[i].max, MAX_NUM);
+		atomic64_set(&cg->res[i].usage, 0);
+
+		if (misc_res_ops[i] && READ_ONCE(misc_res_capacity[i])) {
+			ret = misc_res_ops[i]->alloc(cg);
+			if (ret) {
+				_misc_cg_res_free(cg, i);
+				return ret;
+			}
+		}
+	}
+
+	return 0;
+}
+
 /**
  * misc_cg_alloc() - Allocate misc cgroup.
  * @parent_css: Parent cgroup.
@@ -451,20 +519,25 @@ static struct cftype misc_cg_files[] = {
 static struct cgroup_subsys_state *
 misc_cg_alloc(struct cgroup_subsys_state *parent_css)
 {
-	enum misc_res_type i;
-	struct misc_cg *cg;
+	struct misc_cg *parent_cg, *cg;
+	int ret;
 
 	if (!parent_css) {
+		parent_cg = &root_cg;
 		cg = &root_cg;
 	} else {
 		cg = kzalloc(sizeof(*cg), GFP_KERNEL);
 		if (!cg)
 			return ERR_PTR(-ENOMEM);
+		parent_cg = css_misc(parent_css);
 	}
 
-	for (i = 0; i < MISC_CG_RES_TYPES; i++) {
-		WRITE_ONCE(cg->res[i].max, MAX_NUM);
-		atomic64_set(&cg->res[i].usage, 0);
+	ret = _misc_cg_res_alloc(cg);
+	if (ret) {
+		if (likely(parent_css))
+			kfree(cg);
+
+		return ERR_PTR(ret);
 	}
 
 	return &cg->css;
@@ -478,7 +551,10 @@ misc_cg_alloc(struct cgroup_subsys_state *parent_css)
  */
 static void misc_cg_free(struct cgroup_subsys_state *css)
 {
-	kfree(css_misc(css));
+	struct misc_cg *cg = css_misc(css);
+
+	_misc_cg_res_free(cg, MISC_CG_RES_TYPES - 1);
+	kfree(cg);
 }
 
 /* Cgroup controller callbacks */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v16 02/16] cgroup/misc: Add per resource callbacks for CSS events
  2024-08-21  1:53 ` [PATCH v16 02/16] cgroup/misc: Add per resource callbacks for CSS events Haitao Huang
@ 2024-08-27 10:45   ` Huang, Kai
  2024-08-29  9:25   ` Huang, Kai
  1 sibling, 0 replies; 42+ messages in thread
From: Huang, Kai @ 2024-08-27 10:45 UTC (permalink / raw)
  To: chenridong@huawei.com, linux-sgx@vger.kernel.org,
	cgroups@vger.kernel.org, mkoutny@suse.com,
	dave.hansen@linux.intel.com, haitao.huang@linux.intel.com,
	tim.c.chen@linux.intel.com, linux-kernel@vger.kernel.org,
	mingo@redhat.com, tglx@linutronix.de, tj@kernel.org,
	jarkko@kernel.org, Mehta, Sohil, hpa@zytor.com, bp@alien8.de,
	x86@kernel.org
  Cc: mikko.ylinen@linux.intel.com, seanjc@google.com,
	anakrish@microsoft.com, Zhang, Bo, kristen@linux.intel.com,
	yangjie@microsoft.com, Li, Zhiquan1, chrisyan@microsoft.com

On Tue, 2024-08-20 at 18:53 -0700, Haitao Huang wrote:
> +/**
> + * misc_cg_set_ops() - register resource specific operations.
> + * @type: Type of the misc res.
> + * @ops: Operations for the given type.
> + *
> + * The callbacks in @ops will not be invoked if the capacity of @type is 0.
> + *
> + * Context: Any context.
> + * Return:
> + * * %0 - Successfully registered the operations.
> + * * %-EINVAL - If @type is invalid, or the operations missing any required callbacks.
> + */
> +int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops)
> +{
> +	if (!valid_type(type))
> +		return -EINVAL;
> +
> +	if (!ops)
> +		return -EINVAL;
> +
> +	if (!ops->alloc) {
> +		pr_err("%s: alloc missing\n", __func__);
> +		return -EINVAL;
> +	}
> +
> +	if (!ops->free) {
> +		pr_err("%s: free missing\n", __func__);
> +		return -EINVAL;
> +	}
> +
> +	misc_res_ops[type] = ops;
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(misc_cg_set_ops);

As mentioned in another reply, the export isn't mandatory for SGX cgroup
support due to SGX driver cannot be a module.

But I understand the intention is to treat this in a similar way to
misc_cg_set_capacity() etc which are currently exported.

I have no hard opinion whether to export this one.  But if we export this, I
don't quite like the fact that it doesn't accept the NULL ops to allow the
caller to unwind.

But I'll leave to cgroup maintainers.

> +
>  /**
>   * misc_cg_cancel_charge() - Cancel the charge from the misc cgroup.
>   * @type: Misc res type in misc cg to cancel the charge from.
> @@ -439,6 +477,36 @@ static struct cftype misc_cg_files[] = {
>  	{}
>  };
>  
> +static inline void _misc_cg_res_free(struct misc_cg *cg, enum misc_res_type last)
> +{
> +	enum misc_res_type i;
> +
> +	for (i = 0; i <= last; i++)
> +		if (misc_res_ops[i] && READ_ONCE(misc_res_capacity[i]))
> +			misc_res_ops[i]->free(cg);
> +}
> +
> +static inline int _misc_cg_res_alloc(struct misc_cg *cg)
> +{
> +	enum misc_res_type i;
> +	int ret;
> +
> +	for (i = 0; i < MISC_CG_RES_TYPES; i++) {
> +		WRITE_ONCE(cg->res[i].max, MAX_NUM);
> +		atomic64_set(&cg->res[i].usage, 0);
> +
> +		if (misc_res_ops[i] && READ_ONCE(misc_res_capacity[i])) {
> +			ret = misc_res_ops[i]->alloc(cg);
> +			if (ret) {
> +				_misc_cg_res_free(cg, i);
> +				return ret;
> +			}
> +		}
> +	}
> +
> +	return 0;
> +}
> +

I don't fully understand why kernel uses READ_ONCE()/WRITE_ONCE() to access
misc_res_capacity[i] and others like 'cg->res[i].max', but it looks weird they
are not used when accessing misc_res_ops[i]?


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v16 02/16] cgroup/misc: Add per resource callbacks for CSS events
  2024-08-21  1:53 ` [PATCH v16 02/16] cgroup/misc: Add per resource callbacks for CSS events Haitao Huang
  2024-08-27 10:45   ` Huang, Kai
@ 2024-08-29  9:25   ` Huang, Kai
  1 sibling, 0 replies; 42+ messages in thread
From: Huang, Kai @ 2024-08-29  9:25 UTC (permalink / raw)
  To: chenridong@huawei.com, linux-sgx@vger.kernel.org,
	cgroups@vger.kernel.org, mkoutny@suse.com,
	dave.hansen@linux.intel.com, haitao.huang@linux.intel.com,
	tim.c.chen@linux.intel.com, linux-kernel@vger.kernel.org,
	mingo@redhat.com, tglx@linutronix.de, tj@kernel.org,
	jarkko@kernel.org, Mehta, Sohil, hpa@zytor.com, bp@alien8.de,
	x86@kernel.org
  Cc: mikko.ylinen@linux.intel.com, seanjc@google.com,
	anakrish@microsoft.com, Zhang, Bo, kristen@linux.intel.com,
	yangjie@microsoft.com, Li, Zhiquan1, chrisyan@microsoft.com

On Tue, 2024-08-20 at 18:53 -0700, Haitao Huang wrote:
>  /**
>   * misc_cg_alloc() - Allocate misc cgroup.
>   * @parent_css: Parent cgroup.
> @@ -451,20 +519,25 @@ static struct cftype misc_cg_files[] = {
>  static struct cgroup_subsys_state *
>  misc_cg_alloc(struct cgroup_subsys_state *parent_css)
>  {
> -	enum misc_res_type i;
> -	struct misc_cg *cg;
> +	struct misc_cg *parent_cg, *cg;
> +	int ret;
>  
>  	if (!parent_css) {
> +		parent_cg = &root_cg;
>  		cg = &root_cg;
>  	} else {
>  		cg = kzalloc(sizeof(*cg), GFP_KERNEL);
>  		if (!cg)
>  			return ERR_PTR(-ENOMEM);
> +		parent_cg = css_misc(parent_css);
>  	}
>  
> -	for (i = 0; i < MISC_CG_RES_TYPES; i++) {
> -		WRITE_ONCE(cg->res[i].max, MAX_NUM);
> -		atomic64_set(&cg->res[i].usage, 0);
> +	ret = _misc_cg_res_alloc(cg);
> +	if (ret) {
> +		if (likely(parent_css))
> +			kfree(cg);
> +
> +		return ERR_PTR(ret);
>  	}
>  
>  	return &cg->css;

What's the purpose of @parent_cg?

# make kernel/cgroup/ W=1
...
kernel/cgroup/misc.c: In function ‘misc_cg_alloc’:
kernel/cgroup/misc.c:522:25: warning: variable ‘parent_cg’ set but not used [-
Wunused-but-set-variable]
  522 |         struct misc_cg *parent_cg, *cg;
      |                         ^~~~~~~~~


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v16 03/16] cgroup/misc: Export APIs for SGX driver
  2024-08-21  1:53 [PATCH v16 00/16] Add Cgroup support for SGX EPC memory Haitao Huang
  2024-08-21  1:53 ` [PATCH v16 01/16] x86/sgx: Replace boolean parameters with enums Haitao Huang
  2024-08-21  1:53 ` [PATCH v16 02/16] cgroup/misc: Add per resource callbacks for CSS events Haitao Huang
@ 2024-08-21  1:53 ` Haitao Huang
  2024-08-27 10:25   ` Huang, Kai
  2024-08-21  1:53 ` [PATCH v16 04/16] cgroup/misc: Add SGX EPC resource type Haitao Huang
                   ` (14 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Haitao Huang @ 2024-08-21  1:53 UTC (permalink / raw)
  To: jarkko, dave.hansen, kai.huang, tj, mkoutny, chenridong,
	linux-kernel, linux-sgx, x86, cgroups, tglx, mingo, bp, hpa,
	sohil.mehta, tim.c.chen
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen,
	yangjie, chrisyan

From: Kristen Carlson Accardi <kristen@linux.intel.com>

The SGX EPC cgroup will reclaim EPC pages when usage in a cgroup reaches
its or ancestor's limit. This requires a walk from the current cgroup up
to the root similar to misc_cg_try_charge(). Export misc_cg_parent() to
enable this walk.

The SGX driver also needs start a global level reclamation from the
root. Export misc_cg_root() for the SGX driver to access.

Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Tested-by: Jarkko Sakkinen <jarkko@kernel.org>
---
V6:
- Make commit messages more concise and split the original patch into two(Kai)
---
 include/linux/misc_cgroup.h | 24 ++++++++++++++++++++++++
 kernel/cgroup/misc.c        | 21 ++++++++-------------
 2 files changed, 32 insertions(+), 13 deletions(-)

diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index e5159770a68e..75c711333ad4 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -70,6 +70,7 @@ struct misc_cg {
 	struct misc_res res[MISC_CG_RES_TYPES];
 };
 
+struct misc_cg *misc_cg_root(void);
 u64 misc_cg_res_total_usage(enum misc_res_type type);
 int misc_cg_set_capacity(enum misc_res_type type, u64 capacity);
 int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops);
@@ -90,6 +91,20 @@ static inline struct misc_cg *css_misc(struct cgroup_subsys_state *css)
 	return css ? container_of(css, struct misc_cg, css) : NULL;
 }
 
+/**
+ * misc_cg_parent() - Get the parent of the passed misc cgroup.
+ * @cgroup: cgroup whose parent needs to be fetched.
+ *
+ * Context: Any context.
+ * Return:
+ * * struct misc_cg* - Parent of the @cgroup.
+ * * %NULL - If @cgroup is null or the passed cgroup does not have a parent.
+ */
+static inline struct misc_cg *misc_cg_parent(struct misc_cg *cgroup)
+{
+	return cgroup ? css_misc(cgroup->css.parent) : NULL;
+}
+
 /*
  * get_current_misc_cg() - Find and get the misc cgroup of the current task.
  *
@@ -114,6 +129,15 @@ static inline void put_misc_cg(struct misc_cg *cg)
 }
 
 #else /* !CONFIG_CGROUP_MISC */
+static inline struct misc_cg *misc_cg_root(void)
+{
+	return NULL;
+}
+
+static inline struct misc_cg *misc_cg_parent(struct misc_cg *cg)
+{
+	return NULL;
+}
 
 static inline u64 misc_cg_res_total_usage(enum misc_res_type type)
 {
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index f3ce896d2c21..6cf1f0899f4e 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -43,18 +43,13 @@ static u64 misc_res_capacity[MISC_CG_RES_TYPES];
 static const struct misc_res_ops *misc_res_ops[MISC_CG_RES_TYPES];
 
 /**
- * parent_misc() - Get the parent of the passed misc cgroup.
- * @cgroup: cgroup whose parent needs to be fetched.
- *
- * Context: Any context.
- * Return:
- * * struct misc_cg* - Parent of the @cgroup.
- * * %NULL - If @cgroup is null or the passed cgroup does not have a parent.
+ * misc_cg_root() - Return the root misc cgroup.
  */
-static struct misc_cg *parent_misc(struct misc_cg *cgroup)
+struct misc_cg *misc_cg_root(void)
 {
-	return cgroup ? css_misc(cgroup->css.parent) : NULL;
+	return &root_cg;
 }
+EXPORT_SYMBOL_GPL(misc_cg_root);
 
 /**
  * valid_type() - Check if @type is valid or not.
@@ -177,7 +172,7 @@ static void misc_cg_event(enum misc_res_type type, struct misc_cg *cg)
 	atomic64_inc(&cg->res[type].events_local);
 	cgroup_file_notify(&cg->events_local_file);
 
-	for (; parent_misc(cg); cg = parent_misc(cg)) {
+	for (; misc_cg_parent(cg); cg = misc_cg_parent(cg)) {
 		atomic64_inc(&cg->res[type].events);
 		cgroup_file_notify(&cg->events_file);
 	}
@@ -212,7 +207,7 @@ int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount)
 	if (!amount)
 		return 0;
 
-	for (i = cg; i; i = parent_misc(i)) {
+	for (i = cg; i; i = misc_cg_parent(i)) {
 		res = &i->res[type];
 
 		new_usage = atomic64_add_return(amount, &res->usage);
@@ -228,7 +223,7 @@ int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount)
 err_charge:
 	misc_cg_event(type, i);
 
-	for (j = cg; j != i; j = parent_misc(j))
+	for (j = cg; j != i; j = misc_cg_parent(j))
 		misc_cg_cancel_charge(type, j, amount);
 	misc_cg_cancel_charge(type, i, amount);
 	return ret;
@@ -250,7 +245,7 @@ void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg, u64 amount)
 	if (!(amount && valid_type(type) && cg))
 		return;
 
-	for (i = cg; i; i = parent_misc(i))
+	for (i = cg; i; i = misc_cg_parent(i))
 		misc_cg_cancel_charge(type, i, amount);
 }
 EXPORT_SYMBOL_GPL(misc_cg_uncharge);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v16 03/16] cgroup/misc: Export APIs for SGX driver
  2024-08-21  1:53 ` [PATCH v16 03/16] cgroup/misc: Export APIs for SGX driver Haitao Huang
@ 2024-08-27 10:25   ` Huang, Kai
  0 siblings, 0 replies; 42+ messages in thread
From: Huang, Kai @ 2024-08-27 10:25 UTC (permalink / raw)
  To: chenridong@huawei.com, linux-sgx@vger.kernel.org,
	cgroups@vger.kernel.org, mkoutny@suse.com,
	dave.hansen@linux.intel.com, haitao.huang@linux.intel.com,
	tim.c.chen@linux.intel.com, linux-kernel@vger.kernel.org,
	mingo@redhat.com, tglx@linutronix.de, tj@kernel.org,
	jarkko@kernel.org, Mehta, Sohil, hpa@zytor.com, bp@alien8.de,
	x86@kernel.org
  Cc: mikko.ylinen@linux.intel.com, seanjc@google.com,
	anakrish@microsoft.com, Zhang, Bo, kristen@linux.intel.com,
	yangjie@microsoft.com, Li, Zhiquan1, chrisyan@microsoft.com

On Tue, 2024-08-20 at 18:53 -0700, Haitao Huang wrote:
> -static struct misc_cg *parent_misc(struct misc_cg *cgroup)
> +struct misc_cg *misc_cg_root(void)
>  {
> -	return cgroup ? css_misc(cgroup->css.parent) : NULL;
> +	return &root_cg;
>  }
> +EXPORT_SYMBOL_GPL(misc_cg_root);
>  

This doesn't need to be exported.  SGX driver cannot be a module.

I tested building the this series with this export removed and the kernel
could be built successfully.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v16 04/16] cgroup/misc: Add SGX EPC resource type
  2024-08-21  1:53 [PATCH v16 00/16] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (2 preceding siblings ...)
  2024-08-21  1:53 ` [PATCH v16 03/16] cgroup/misc: Export APIs for SGX driver Haitao Huang
@ 2024-08-21  1:53 ` Haitao Huang
  2024-08-21  1:53 ` [PATCH v16 05/16] x86/sgx: Implement basic EPC misc cgroup functionality Haitao Huang
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Haitao Huang @ 2024-08-21  1:53 UTC (permalink / raw)
  To: jarkko, dave.hansen, kai.huang, tj, mkoutny, chenridong,
	linux-kernel, linux-sgx, x86, cgroups, tglx, mingo, bp, hpa,
	sohil.mehta, tim.c.chen
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen,
	yangjie, chrisyan

From: Kristen Carlson Accardi <kristen@linux.intel.com>

Add SGX EPC memory, MISC_CG_RES_SGX_EPC, to be a valid resource type
for the misc controller.

Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Tested-by: Jarkko Sakkinen <jarkko@kernel.org>
---
V12:
- Remove CONFIG_CGROUP_SGX_EPC (Jarkko)

V6:
- Split the original patch into this and the preceding one (Kai)
---
 include/linux/misc_cgroup.h | 4 ++++
 kernel/cgroup/misc.c        | 4 ++++
 2 files changed, 8 insertions(+)

diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index 75c711333ad4..b4119869b0d1 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -17,6 +17,10 @@ enum misc_res_type {
 	MISC_CG_RES_SEV,
 	/** @MISC_CG_RES_SEV_ES: AMD SEV-ES ASIDs resource */
 	MISC_CG_RES_SEV_ES,
+#endif
+#ifdef CONFIG_X86_SGX
+	/** @MISC_CG_RES_SGX_EPC: SGX EPC memory resource */
+	MISC_CG_RES_SGX_EPC,
 #endif
 	/** @MISC_CG_RES_TYPES: count of enum misc_res_type constants */
 	MISC_CG_RES_TYPES
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 6cf1f0899f4e..300af1ee20fc 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -24,6 +24,10 @@ static const char *const misc_res_name[] = {
 	/* AMD SEV-ES ASIDs resource */
 	"sev_es",
 #endif
+#ifdef CONFIG_X86_SGX
+	/* Intel SGX EPC memory bytes */
+	"sgx_epc",
+#endif
 };
 
 /* Root misc cgroup */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v16 05/16] x86/sgx: Implement basic EPC misc cgroup functionality
  2024-08-21  1:53 [PATCH v16 00/16] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (3 preceding siblings ...)
  2024-08-21  1:53 ` [PATCH v16 04/16] cgroup/misc: Add SGX EPC resource type Haitao Huang
@ 2024-08-21  1:53 ` Haitao Huang
  2024-08-27 10:21   ` Huang, Kai
  2024-08-27 23:11   ` Huang, Kai
  2024-08-21  1:53 ` [PATCH v16 06/16] x86/sgx: Add sgx_epc_lru_list to encapsulate LRU list Haitao Huang
                   ` (12 subsequent siblings)
  17 siblings, 2 replies; 42+ messages in thread
From: Haitao Huang @ 2024-08-21  1:53 UTC (permalink / raw)
  To: jarkko, dave.hansen, kai.huang, tj, mkoutny, chenridong,
	linux-kernel, linux-sgx, x86, cgroups, tglx, mingo, bp, hpa,
	sohil.mehta, tim.c.chen
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen,
	yangjie, chrisyan

From: Kristen Carlson Accardi <kristen@linux.intel.com>

SGX Enclave Page Cache (EPC) memory allocations are separate from normal
RAM allocations, and are managed solely by the SGX subsystem. The
existing cgroup memory controller cannot be used to limit or account for
SGX EPC memory, which is a desirable feature in some environments. For
instance, within a Kubernetes environment, while a user may specify a
particular EPC quota for a pod, the orchestrator requires a mechanism to
enforce that the pod's actual runtime EPC usage does not exceed the
allocated quota.

Utilize the misc controller [admin-guide/cgroup-v2.rst, 5-9. Misc] to
limit and track EPC allocations per cgroup. Earlier patches have added
the "sgx_epc" resource type in the misc cgroup subsystem. Add basic
support in SGX driver as the "sgx_epc" resource provider:

- Set "capacity" of EPC by calling misc_cg_set_capacity()
- Update EPC usage counter, "current", by calling charge and uncharge
APIs for EPC allocation and deallocation, respectively.
- Setup sgx_epc resource type specific callbacks, which perform
initialization and cleanup during cgroup allocation and deallocation,
respectively.

With these changes, the misc cgroup controller enables users to set a hard
limit for EPC usage in the "misc.max" interface file. It reports current
usage in "misc.current", the total EPC memory available in
"misc.capacity", and the number of times EPC usage reached the max limit
in "misc.events".

For now, the EPC cgroup simply blocks additional EPC allocation in
sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
still tracked in the global active list, only reclaimed by the global
reclaimer when the total free page count is lower than a threshold.

Later patches will reorganize the tracking and reclamation code in the
global reclaimer and implement per-cgroup tracking and reclaiming.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Tested-by: Jarkko Sakkinen <jarkko@kernel.org>
---
V16:
- Proper handling for failures during init (Kai)
- Register ops and capacity at the end when SGX is ready to handle
  callbacks.

V15:
- Declare __init for sgx_cgroup_init() (Jarkko)
- Disable SGX when sgx_cgroup_init() fails (Jarkko)

V13:
- Remove unneeded includes. (Kai)

V12:
- Remove CONFIG_CGROUP_SGX_EPC and make sgx cgroup implementation
conditionally compiled with CONFIG_CGROUP_MISC. (Jarkko)

V11:
- Update copyright and format better (Kai)
- Create wrappers to remove #ifdefs in c file. (Kai)
- Remove unneeded comments (Kai)

V10:
- Shorten function, variable, struct names, s/sgx_epc_cgroup/sgx_cgroup. (Jarkko)
- Use enums instead of booleans for the parameters. (Dave, Jarkko)

V8:
- Remove null checks for epc_cg in try_charge()/uncharge(). (Jarkko)
- Remove extra space, '_INTEL'. (Jarkko)

V7:
- Use a static for root cgroup (Kai)
- Wrap epc_cg field in sgx_epc_page struct with #ifdef (Kai)
- Correct check for charge API return (Kai)
- Start initialization in SGX device driver init (Kai)
- Remove unneeded BUG_ON (Kai)
- Split  sgx_get_current_epc_cg() out of sgx_epc_cg_try_charge() (Kai)

V6:
- Split the original large patch"Limit process EPC usage with misc
cgroup controller"  and restructure it (Kai)
---
 arch/x86/kernel/cpu/sgx/Makefile     |  1 +
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 93 ++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/sgx/epc_cgroup.h | 78 +++++++++++++++++++++++
 arch/x86/kernel/cpu/sgx/main.c       | 44 +++++++++++--
 arch/x86/kernel/cpu/sgx/sgx.h        | 24 +++++++
 include/linux/misc_cgroup.h          |  2 +
 6 files changed, 238 insertions(+), 4 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h

diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
index 9c1656779b2a..081cb424575e 100644
--- a/arch/x86/kernel/cpu/sgx/Makefile
+++ b/arch/x86/kernel/cpu/sgx/Makefile
@@ -4,3 +4,4 @@ obj-y += \
 	ioctl.o \
 	main.o
 obj-$(CONFIG_X86_SGX_KVM)	+= virt.o
+obj-$(CONFIG_CGROUP_MISC)	+= epc_cgroup.o
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
new file mode 100644
index 000000000000..0e422fef02bb
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -0,0 +1,93 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2022-2024 Intel Corporation. */
+
+#include<linux/slab.h>
+#include "epc_cgroup.h"
+
+/* The root SGX EPC cgroup */
+static struct sgx_cgroup sgx_cg_root;
+
+/**
+ * sgx_cgroup_try_charge() - try to charge cgroup for a single EPC page
+ *
+ * @sgx_cg:	The EPC cgroup to be charged for the page.
+ * Return:
+ * * %0 - If successfully charged.
+ * * -errno - for failures.
+ */
+int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg)
+{
+	return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, sgx_cg->cg, PAGE_SIZE);
+}
+
+/**
+ * sgx_cgroup_uncharge() - uncharge a cgroup for an EPC page
+ * @sgx_cg:	The charged sgx cgroup.
+ */
+void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg)
+{
+	misc_cg_uncharge(MISC_CG_RES_SGX_EPC, sgx_cg->cg, PAGE_SIZE);
+}
+
+static void sgx_cgroup_free(struct misc_cg *cg)
+{
+	struct sgx_cgroup *sgx_cg;
+
+	sgx_cg = sgx_cgroup_from_misc_cg(cg);
+	if (!sgx_cg)
+		return;
+
+	kfree(sgx_cg);
+}
+
+static void sgx_cgroup_misc_init(struct misc_cg *cg, struct sgx_cgroup *sgx_cg)
+{
+	cg->res[MISC_CG_RES_SGX_EPC].priv = sgx_cg;
+	sgx_cg->cg = cg;
+}
+
+static int sgx_cgroup_alloc(struct misc_cg *cg)
+{
+	struct sgx_cgroup *sgx_cg;
+
+	sgx_cg = kzalloc(sizeof(*sgx_cg), GFP_KERNEL);
+	if (!sgx_cg)
+		return -ENOMEM;
+
+	sgx_cgroup_misc_init(cg, sgx_cg);
+
+	return 0;
+}
+
+const struct misc_res_ops sgx_cgroup_ops = {
+	.alloc = sgx_cgroup_alloc,
+	.free = sgx_cgroup_free,
+};
+
+int __init sgx_cgroup_init(void)
+{
+	sgx_cgroup_misc_init(misc_cg_root(), &sgx_cg_root);
+
+	return 0;
+}
+
+/**
+ * Register capacity and ops for SGX cgroup.
+ * Only called at the end of sgx_init() when SGX is ready to handle the ops
+ * callbacks.
+ */
+void __init sgx_cgroup_register(void)
+{
+	unsigned int nid = first_node(sgx_numa_mask);
+	unsigned int first = nid;
+	u64 capacity = 0;
+
+	misc_cg_set_ops(MISC_CG_RES_SGX_EPC, &sgx_cgroup_ops);
+
+	/* sgx_numa_mask is not empty when this is called */
+	do {
+		capacity += sgx_numa_nodes[nid].size;
+		nid = next_node_in(nid, sgx_numa_mask);
+	} while (nid != first);
+	misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, capacity);
+}
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
new file mode 100644
index 000000000000..e74b1ea0b642
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -0,0 +1,78 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _SGX_EPC_CGROUP_H_
+#define _SGX_EPC_CGROUP_H_
+
+#include <asm/sgx.h>
+#include <linux/cgroup.h>
+#include <linux/misc_cgroup.h>
+
+#include "sgx.h"
+
+#ifndef CONFIG_CGROUP_MISC
+
+#define MISC_CG_RES_SGX_EPC MISC_CG_RES_TYPES
+struct sgx_cgroup;
+
+static inline struct sgx_cgroup *sgx_get_current_cg(void)
+{
+	return NULL;
+}
+
+static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg) { }
+
+static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg)
+{
+	return 0;
+}
+
+static inline void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg) { }
+
+static inline int __init sgx_cgroup_init(void)
+{
+	return 0;
+}
+
+static inline void __init sgx_cgroup_register(void) { }
+
+#else /* CONFIG_CGROUP_MISC */
+
+struct sgx_cgroup {
+	struct misc_cg *cg;
+};
+
+static inline struct sgx_cgroup *sgx_cgroup_from_misc_cg(struct misc_cg *cg)
+{
+	return (struct sgx_cgroup *)(cg->res[MISC_CG_RES_SGX_EPC].priv);
+}
+
+/**
+ * sgx_get_current_cg() - get the EPC cgroup of current process.
+ *
+ * Returned cgroup has its ref count increased by 1. Caller must call
+ * sgx_put_cg() to return the reference.
+ *
+ * Return: EPC cgroup to which the current task belongs to.
+ */
+static inline struct sgx_cgroup *sgx_get_current_cg(void)
+{
+	/* get_current_misc_cg() never returns NULL when Kconfig enabled */
+	return sgx_cgroup_from_misc_cg(get_current_misc_cg());
+}
+
+/**
+ * sgx_put_cg() - Put the EPC cgroup and reduce its ref count.
+ * @sgx_cg - EPC cgroup to put.
+ */
+static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg)
+{
+	put_misc_cg(sgx_cg->cg);
+}
+
+int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg);
+void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg);
+int __init sgx_cgroup_init(void);
+void __init sgx_cgroup_register(void);
+
+#endif /* CONFIG_CGROUP_MISC */
+
+#endif /* _SGX_EPC_CGROUP_H_ */
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index e64073fb4256..0fda964c0a7c 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -18,6 +18,7 @@
 #include "driver.h"
 #include "encl.h"
 #include "encls.h"
+#include "epc_cgroup.h"
 
 struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
 static int sgx_nr_epc_sections;
@@ -35,14 +36,14 @@ static DEFINE_SPINLOCK(sgx_reclaimer_lock);
 static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
 
 /* Nodes with one or more EPC sections. */
-static nodemask_t sgx_numa_mask;
+nodemask_t sgx_numa_mask;
 
 /*
  * Array with one list_head for each possible NUMA node.  Each
  * list contains all the sgx_epc_section's which are on that
  * node.
  */
-static struct sgx_numa_node *sgx_numa_nodes;
+struct sgx_numa_node *sgx_numa_nodes;
 
 static LIST_HEAD(sgx_dirty_page_list);
 
@@ -559,7 +560,16 @@ int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
  */
 struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
 {
+	struct sgx_cgroup *sgx_cg;
 	struct sgx_epc_page *page;
+	int ret;
+
+	sgx_cg = sgx_get_current_cg();
+	ret = sgx_cgroup_try_charge(sgx_cg);
+	if (ret) {
+		sgx_put_cg(sgx_cg);
+		return ERR_PTR(ret);
+	}
 
 	for ( ; ; ) {
 		page = __sgx_alloc_epc_page();
@@ -568,8 +578,10 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
 			break;
 		}
 
-		if (list_empty(&sgx_active_page_list))
-			return ERR_PTR(-ENOMEM);
+		if (list_empty(&sgx_active_page_list)) {
+			page = ERR_PTR(-ENOMEM);
+			break;
+		}
 
 		if (reclaim == SGX_NO_RECLAIM) {
 			page = ERR_PTR(-EBUSY);
@@ -585,6 +597,15 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
 		cond_resched();
 	}
 
+	if (!IS_ERR(page)) {
+		WARN_ON_ONCE(sgx_epc_page_get_cgroup(page));
+		/* sgx_put_cg() in sgx_free_epc_page() */
+		sgx_epc_page_set_cgroup(page, sgx_cg);
+	} else {
+		sgx_cgroup_uncharge(sgx_cg);
+		sgx_put_cg(sgx_cg);
+	}
+
 	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
 		wake_up(&ksgxd_waitq);
 
@@ -603,8 +624,16 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
 void sgx_free_epc_page(struct sgx_epc_page *page)
 {
 	struct sgx_epc_section *section = &sgx_epc_sections[page->section];
+	struct sgx_cgroup *sgx_cg = sgx_epc_page_get_cgroup(page);
 	struct sgx_numa_node *node = section->node;
 
+	/* sgx_cg could be NULL if called from __sgx_sanitize_pages() */
+	if (sgx_cg) {
+		sgx_cgroup_uncharge(sgx_cg);
+		sgx_put_cg(sgx_cg);
+		sgx_epc_page_set_cgroup(page, NULL);
+	}
+
 	spin_lock(&node->lock);
 
 	page->owner = NULL;
@@ -644,6 +673,8 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
 		section->pages[i].flags = 0;
 		section->pages[i].owner = NULL;
 		section->pages[i].poison = 0;
+		sgx_epc_page_set_cgroup(&section->pages[i], NULL);
+
 		list_add_tail(&section->pages[i].list, &sgx_dirty_page_list);
 	}
 
@@ -930,6 +961,9 @@ static int __init sgx_init(void)
 	if (ret)
 		goto err_kthread;
 
+	ret = sgx_cgroup_init();
+	if (ret)
+		goto err_provision;
 	/*
 	 * Always try to initialize the native *and* KVM drivers.
 	 * The KVM driver is less picky than the native one and
@@ -943,6 +977,8 @@ static int __init sgx_init(void)
 	if (sgx_vepc_init() && ret)
 		goto err_provision;
 
+	sgx_cgroup_register();
+
 	return 0;
 
 err_provision:
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index ca34cd4f58ac..c5208da7c8eb 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -39,14 +39,35 @@ enum sgx_reclaim {
 	SGX_DO_RECLAIM
 };
 
+struct sgx_cgroup;
+
 struct sgx_epc_page {
 	unsigned int section;
 	u16 flags;
 	u16 poison;
 	struct sgx_encl_page *owner;
 	struct list_head list;
+#ifdef CONFIG_CGROUP_MISC
+	struct sgx_cgroup *sgx_cg;
+#endif
 };
 
+static inline void sgx_epc_page_set_cgroup(struct sgx_epc_page *page, struct sgx_cgroup *cg)
+{
+#ifdef CONFIG_CGROUP_MISC
+	page->sgx_cg = cg;
+#endif
+}
+
+static inline struct sgx_cgroup *sgx_epc_page_get_cgroup(struct sgx_epc_page *page)
+{
+#ifdef CONFIG_CGROUP_MISC
+	return page->sgx_cg;
+#else
+	return NULL;
+#endif
+}
+
 /*
  * Contains the tracking data for NUMA nodes having EPC pages. Most importantly,
  * the free page list local to the node is stored here.
@@ -58,6 +79,9 @@ struct sgx_numa_node {
 	spinlock_t lock;
 };
 
+extern nodemask_t sgx_numa_mask;
+extern struct sgx_numa_node *sgx_numa_nodes;
+
 /*
  * The firmware can define multiple chunks of EPC to the different areas of the
  * physical memory e.g. for memory areas of the each node. This structure is
diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index b4119869b0d1..df88e1ff9877 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -48,6 +48,7 @@ struct misc_res_ops {
  * @watermark: Historical maximum usage of the resource.
  * @usage: Current usage of the resource.
  * @events: Number of times, the resource limit exceeded.
+ * @priv: resource specific data.
  */
 struct misc_res {
 	u64 max;
@@ -55,6 +56,7 @@ struct misc_res {
 	atomic64_t usage;
 	atomic64_t events;
 	atomic64_t events_local;
+	void *priv;
 };
 
 /**
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v16 05/16] x86/sgx: Implement basic EPC misc cgroup functionality
  2024-08-21  1:53 ` [PATCH v16 05/16] x86/sgx: Implement basic EPC misc cgroup functionality Haitao Huang
@ 2024-08-27 10:21   ` Huang, Kai
  2024-08-27 23:11   ` Huang, Kai
  1 sibling, 0 replies; 42+ messages in thread
From: Huang, Kai @ 2024-08-27 10:21 UTC (permalink / raw)
  To: chenridong@huawei.com, linux-sgx@vger.kernel.org,
	cgroups@vger.kernel.org, mkoutny@suse.com,
	dave.hansen@linux.intel.com, haitao.huang@linux.intel.com,
	tim.c.chen@linux.intel.com, linux-kernel@vger.kernel.org,
	mingo@redhat.com, tglx@linutronix.de, tj@kernel.org,
	jarkko@kernel.org, Mehta, Sohil, hpa@zytor.com, bp@alien8.de,
	x86@kernel.org
  Cc: mikko.ylinen@linux.intel.com, seanjc@google.com,
	anakrish@microsoft.com, Zhang, Bo, kristen@linux.intel.com,
	yangjie@microsoft.com, Li, Zhiquan1, chrisyan@microsoft.com

On Tue, 2024-08-20 at 18:53 -0700, Haitao Huang wrote:
> +/**
> + * Register capacity and ops for SGX cgroup.
> + * Only called at the end of sgx_init() when SGX is ready to handle the ops
> + * callbacks.
> + */

Got this warning when building with W=1:

arch/x86/kernel/cpu/sgx/epc_cgroup.c:420: warning: This comment starts with
'/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-
doc.rst
 * Register capacity and ops for SGX cgroup.

It should be fixed.

> +void __init sgx_cgroup_register(void)
> +{
> +	unsigned int nid = first_node(sgx_numa_mask);
> +	unsigned int first = nid;
> +	u64 capacity = 0;
> +
> +	misc_cg_set_ops(MISC_CG_RES_SGX_EPC, &sgx_cgroup_ops);
> +
> +	/* sgx_numa_mask is not empty when this is called */
> +	do {
> +		capacity += sgx_numa_nodes[nid].size;
> +		nid = next_node_in(nid, sgx_numa_mask);
> +	} while (nid != first);
> +	misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, capacity);
> +}

Nit (leave to you):

Is sgx_cgroup_enable() better?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v16 05/16] x86/sgx: Implement basic EPC misc cgroup functionality
  2024-08-21  1:53 ` [PATCH v16 05/16] x86/sgx: Implement basic EPC misc cgroup functionality Haitao Huang
  2024-08-27 10:21   ` Huang, Kai
@ 2024-08-27 23:11   ` Huang, Kai
  2024-08-28  0:01     ` Huang, Kai
  1 sibling, 1 reply; 42+ messages in thread
From: Huang, Kai @ 2024-08-27 23:11 UTC (permalink / raw)
  To: Haitao Huang, jarkko, dave.hansen, tj, mkoutny, chenridong,
	linux-kernel, linux-sgx, x86, cgroups, tglx, mingo, bp, hpa,
	sohil.mehta, tim.c.chen
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen,
	yangjie, chrisyan

> +static void sgx_cgroup_misc_init(struct misc_cg *cg, struct sgx_cgroup *sgx_cg)
> +{
> +	cg->res[MISC_CG_RES_SGX_EPC].priv = sgx_cg;
> +	sgx_cg->cg = cg;
> +}
> +

[...]

> +int __init sgx_cgroup_init(void)
> +{
> +	sgx_cgroup_misc_init(misc_cg_root(), &sgx_cg_root);
> +
> +	return 0;
> +} > +
> +/**
> + * Register capacity and ops for SGX cgroup.
> + * Only called at the end of sgx_init() when SGX is ready to handle the ops
> + * callbacks.
> + */
> +void __init sgx_cgroup_register(void)
> +{
> +	unsigned int nid = first_node(sgx_numa_mask);
> +	unsigned int first = nid;
> +	u64 capacity = 0;
> +
> +	misc_cg_set_ops(MISC_CG_RES_SGX_EPC, &sgx_cgroup_ops);
> +
> +	/* sgx_numa_mask is not empty when this is called */
> +	do {
> +		capacity += sgx_numa_nodes[nid].size;
> +		nid = next_node_in(nid, sgx_numa_mask);
> +	} while (nid != first);
> +	misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, capacity);
> +}

[...]

>   
> @@ -930,6 +961,9 @@ static int __init sgx_init(void)
>   	if (ret)
>   		goto err_kthread;
>   
> +	ret = sgx_cgroup_init();
> +	if (ret)
> +		goto err_provision;
>   	/*
>   	 * Always try to initialize the native *and* KVM drivers.
>   	 * The KVM driver is less picky than the native one and
> @@ -943,6 +977,8 @@ static int __init sgx_init(void)
>   	if (sgx_vepc_init() && ret)
>   		goto err_provision;

In sgx_cgroup_init():

     sgx_cgroup_misc_init(misc_cg_root(), &sgx_cg_root);

.. also cannot fail.

I think it should be moved to the sgx_cgroup_register().  Otherwise, if 
any step after sgx_cgroup_init() fails, there's no unwind for the above 
operation.

The consequence is the misc_cg_root()->res[EPC].priv will remain 
pointing to the SGX root cgroup.

It shouldn't cause any real issue for now, but it's weird to have that 
set, and can potentially cause problem in the future.

>   
> +	sgx_cgroup_register();
> +
>   	return 0;
>   
>   err_provision:

So, I think we should do:

1) Rename sgx_cgroup_register() -> sgx_cgroup_init(), and move the

	sgx_cgroup_misc_init(misc_cg_root(), &sgx_cg_root);

to it.  All operations in the (new) sgx_cgroup_init() won't fail.

2) Remove (existing) sgx_cgroup_init() form this patch, but introduce it 
in the patch "x86/sgx: Implement async reclamation for cgroup" and 
rename it to sgx_cgroup_prepare() or something.  It just allocates 
workqueue inside.  And sgx_cgroup_deinit() -> sgx_cgroup_cleanup().

Makes sense?


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v16 05/16] x86/sgx: Implement basic EPC misc cgroup functionality
  2024-08-27 23:11   ` Huang, Kai
@ 2024-08-28  0:01     ` Huang, Kai
  0 siblings, 0 replies; 42+ messages in thread
From: Huang, Kai @ 2024-08-28  0:01 UTC (permalink / raw)
  To: Haitao Huang, jarkko, dave.hansen, tj, mkoutny, chenridong,
	linux-kernel, linux-sgx, x86, cgroups, tglx, mingo, bp, hpa,
	sohil.mehta, tim.c.chen
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen,
	yangjie, chrisyan



On 28/08/2024 11:11 am, Huang, Kai wrote:
>> +static void sgx_cgroup_misc_init(struct misc_cg *cg, struct 
>> sgx_cgroup *sgx_cg)
>> +{
>> +    cg->res[MISC_CG_RES_SGX_EPC].priv = sgx_cg;
>> +    sgx_cg->cg = cg;
>> +}
>> +
> 
> [...]
> 
>> +int __init sgx_cgroup_init(void)
>> +{
>> +    sgx_cgroup_misc_init(misc_cg_root(), &sgx_cg_root);
>> +
>> +    return 0;
>> +} > +
>> +/**
>> + * Register capacity and ops for SGX cgroup.
>> + * Only called at the end of sgx_init() when SGX is ready to handle 
>> the ops
>> + * callbacks.
>> + */
>> +void __init sgx_cgroup_register(void)
>> +{
>> +    unsigned int nid = first_node(sgx_numa_mask);
>> +    unsigned int first = nid;
>> +    u64 capacity = 0;
>> +
>> +    misc_cg_set_ops(MISC_CG_RES_SGX_EPC, &sgx_cgroup_ops);
>> +
>> +    /* sgx_numa_mask is not empty when this is called */
>> +    do {
>> +        capacity += sgx_numa_nodes[nid].size;
>> +        nid = next_node_in(nid, sgx_numa_mask);
>> +    } while (nid != first);
>> +    misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, capacity);
>> +}
> 
> [...]
> 
>> @@ -930,6 +961,9 @@ static int __init sgx_init(void)
>>       if (ret)
>>           goto err_kthread;
>> +    ret = sgx_cgroup_init();
>> +    if (ret)
>> +        goto err_provision;
>>       /*
>>        * Always try to initialize the native *and* KVM drivers.
>>        * The KVM driver is less picky than the native one and
>> @@ -943,6 +977,8 @@ static int __init sgx_init(void)
>>       if (sgx_vepc_init() && ret)
>>           goto err_provision;
> 
> In sgx_cgroup_init():
> 
>      sgx_cgroup_misc_init(misc_cg_root(), &sgx_cg_root);
> 
> .. also cannot fail.
> 
> I think it should be moved to the sgx_cgroup_register().  Otherwise, if 
> any step after sgx_cgroup_init() fails, there's no unwind for the above 
> operation.
> 
> The consequence is the misc_cg_root()->res[EPC].priv will remain 
> pointing to the SGX root cgroup.
> 
> It shouldn't cause any real issue for now, but it's weird to have that 
> set, and can potentially cause problem in the future.
> 
>> +    sgx_cgroup_register();
>> +
>>       return 0;
>>   err_provision:
> 
> So, I think we should do:
> 
> 1) Rename sgx_cgroup_register() -> sgx_cgroup_init(), and move the
> 
>      sgx_cgroup_misc_init(misc_cg_root(), &sgx_cg_root);
> 
> to it.  All operations in the (new) sgx_cgroup_init() won't fail.
> 
> 2) Remove (existing) sgx_cgroup_init() form this patch, but introduce it 
> in the patch "x86/sgx: Implement async reclamation for cgroup" and 
> rename it to sgx_cgroup_prepare() or something.  It just allocates 
> workqueue inside.  And sgx_cgroup_deinit() -> sgx_cgroup_cleanup().
> 
> Makes sense?
> 
> 

With the above addressed, and the k-doc warning fixed:

Reviewed-by: Kai Huang <kai.huang@intel.com>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v16 06/16] x86/sgx: Add sgx_epc_lru_list to encapsulate LRU list
  2024-08-21  1:53 [PATCH v16 00/16] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (4 preceding siblings ...)
  2024-08-21  1:53 ` [PATCH v16 05/16] x86/sgx: Implement basic EPC misc cgroup functionality Haitao Huang
@ 2024-08-21  1:53 ` Haitao Huang
  2024-08-21  1:53 ` [PATCH v16 07/16] x86/sgx: Abstract tracking reclaimable pages in LRU Haitao Huang
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Haitao Huang @ 2024-08-21  1:53 UTC (permalink / raw)
  To: jarkko, dave.hansen, kai.huang, tj, mkoutny, chenridong,
	linux-kernel, linux-sgx, x86, cgroups, tglx, mingo, bp, hpa,
	sohil.mehta, tim.c.chen
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen,
	yangjie, chrisyan

From: Sean Christopherson <sean.j.christopherson@intel.com>

Introduce a data structure to wrap the existing reclaimable list and its
spinlock. Each cgroup later will have one instance of this structure to
track EPC pages allocated for processes associated with the same cgroup.
Just like the global SGX reclaimer (ksgxd), an EPC cgroup reclaims pages
from the reclaimable list in this structure when its usage reaches near
its limit.

Use this structure to encapsulate the LRU list and its lock used by the
global reclaimer.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Tested-by: Jarkko Sakkinen <jarkko@kernel.org>
---
V15:
- Add comment for spinlock. (Jarkko, Kai)

V6:
- removed introduction to unreclaimables in commit message.

V4:
- Removed unneeded comments for the spinlock and the non-reclaimables.
(Kai, Jarkko)
- Revised the commit to add introduction comments for unreclaimables and
multiple LRU lists.(Kai)
- Reordered the patches: delay all changes for unreclaimables to
later, and this one becomes the first change in the SGX subsystem.

V3:
- Removed the helper functions and revised commit messages.
---
 arch/x86/kernel/cpu/sgx/main.c | 39 +++++++++++++++++-----------------
 arch/x86/kernel/cpu/sgx/sgx.h  | 16 ++++++++++++++
 2 files changed, 36 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 0fda964c0a7c..45832c9a1e1b 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -28,10 +28,9 @@ static DEFINE_XARRAY(sgx_epc_address_space);
 
 /*
  * These variables are part of the state of the reclaimer, and must be accessed
- * with sgx_reclaimer_lock acquired.
+ * with sgx_global_lru.lock acquired.
  */
-static LIST_HEAD(sgx_active_page_list);
-static DEFINE_SPINLOCK(sgx_reclaimer_lock);
+static struct sgx_epc_lru_list sgx_global_lru;
 
 static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
 
@@ -306,13 +305,13 @@ static void sgx_reclaim_pages(void)
 	int ret;
 	int i;
 
-	spin_lock(&sgx_reclaimer_lock);
+	spin_lock(&sgx_global_lru.lock);
 	for (i = 0; i < SGX_NR_TO_SCAN; i++) {
-		if (list_empty(&sgx_active_page_list))
+		epc_page = list_first_entry_or_null(&sgx_global_lru.reclaimable,
+						    struct sgx_epc_page, list);
+		if (!epc_page)
 			break;
 
-		epc_page = list_first_entry(&sgx_active_page_list,
-					    struct sgx_epc_page, list);
 		list_del_init(&epc_page->list);
 		encl_page = epc_page->owner;
 
@@ -324,7 +323,7 @@ static void sgx_reclaim_pages(void)
 			 */
 			epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
 	}
-	spin_unlock(&sgx_reclaimer_lock);
+	spin_unlock(&sgx_global_lru.lock);
 
 	for (i = 0; i < cnt; i++) {
 		epc_page = chunk[i];
@@ -347,9 +346,9 @@ static void sgx_reclaim_pages(void)
 		continue;
 
 skip:
-		spin_lock(&sgx_reclaimer_lock);
-		list_add_tail(&epc_page->list, &sgx_active_page_list);
-		spin_unlock(&sgx_reclaimer_lock);
+		spin_lock(&sgx_global_lru.lock);
+		list_add_tail(&epc_page->list, &sgx_global_lru.reclaimable);
+		spin_unlock(&sgx_global_lru.lock);
 
 		kref_put(&encl_page->encl->refcount, sgx_encl_release);
 
@@ -380,7 +379,7 @@ static void sgx_reclaim_pages(void)
 static bool sgx_should_reclaim(unsigned long watermark)
 {
 	return atomic_long_read(&sgx_nr_free_pages) < watermark &&
-	       !list_empty(&sgx_active_page_list);
+	       !list_empty(&sgx_global_lru.reclaimable);
 }
 
 /*
@@ -432,6 +431,8 @@ static bool __init sgx_page_reclaimer_init(void)
 
 	ksgxd_tsk = tsk;
 
+	sgx_lru_init(&sgx_global_lru);
+
 	return true;
 }
 
@@ -507,10 +508,10 @@ static struct sgx_epc_page *__sgx_alloc_epc_page(void)
  */
 void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
 {
-	spin_lock(&sgx_reclaimer_lock);
+	spin_lock(&sgx_global_lru.lock);
 	page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
-	list_add_tail(&page->list, &sgx_active_page_list);
-	spin_unlock(&sgx_reclaimer_lock);
+	list_add_tail(&page->list, &sgx_global_lru.reclaimable);
+	spin_unlock(&sgx_global_lru.lock);
 }
 
 /**
@@ -525,18 +526,18 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
  */
 int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
 {
-	spin_lock(&sgx_reclaimer_lock);
+	spin_lock(&sgx_global_lru.lock);
 	if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
 		/* The page is being reclaimed. */
 		if (list_empty(&page->list)) {
-			spin_unlock(&sgx_reclaimer_lock);
+			spin_unlock(&sgx_global_lru.lock);
 			return -EBUSY;
 		}
 
 		list_del(&page->list);
 		page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
 	}
-	spin_unlock(&sgx_reclaimer_lock);
+	spin_unlock(&sgx_global_lru.lock);
 
 	return 0;
 }
@@ -578,7 +579,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
 			break;
 		}
 
-		if (list_empty(&sgx_active_page_list)) {
+		if (list_empty(&sgx_global_lru.reclaimable)) {
 			page = ERR_PTR(-ENOMEM);
 			break;
 		}
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index c5208da7c8eb..0c8d88eb65ff 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -117,6 +117,22 @@ static inline void *sgx_get_epc_virt_addr(struct sgx_epc_page *page)
 	return section->virt_addr + index * PAGE_SIZE;
 }
 
+/*
+ * Contains EPC pages tracked by the global reclaimer (ksgxd) or an EPC
+ * cgroup.
+ */
+struct sgx_epc_lru_list {
+	/* Use this lock to protect access from multiple reclamation worker threads */
+	spinlock_t lock;
+	struct list_head reclaimable;
+};
+
+static inline void sgx_lru_init(struct sgx_epc_lru_list *lru)
+{
+	spin_lock_init(&lru->lock);
+	INIT_LIST_HEAD(&lru->reclaimable);
+}
+
 void sgx_free_epc_page(struct sgx_epc_page *page);
 
 void sgx_reclaim_direct(void);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v16 07/16] x86/sgx: Abstract tracking reclaimable pages in LRU
  2024-08-21  1:53 [PATCH v16 00/16] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (5 preceding siblings ...)
  2024-08-21  1:53 ` [PATCH v16 06/16] x86/sgx: Add sgx_epc_lru_list to encapsulate LRU list Haitao Huang
@ 2024-08-21  1:53 ` Haitao Huang
  2024-08-21  1:53 ` [PATCH v16 08/16] x86/sgx: Encapsulate uses of the global LRU Haitao Huang
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Haitao Huang @ 2024-08-21  1:53 UTC (permalink / raw)
  To: jarkko, dave.hansen, kai.huang, tj, mkoutny, chenridong,
	linux-kernel, linux-sgx, x86, cgroups, tglx, mingo, bp, hpa,
	sohil.mehta, tim.c.chen
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen,
	yangjie, chrisyan

From: Kristen Carlson Accardi <kristen@linux.intel.com>

The SGX driver tracks reclaimable EPC pages by adding a newly allocated
page into the global LRU list in sgx_mark_page_reclaimable(), and doing
the opposite in sgx_unmark_page_reclaimable().

To support SGX EPC cgroup, the SGX driver will need to maintain an LRU
list for each cgroup, and each newly allocated EPC page will need to be
added to the LRU of associated cgroup, not always the global LRU list.

When sgx_mark_page_reclaimable() is called, the cgroup that the newly
allocated EPC page belongs to is already known, i.e., it has been set to
the 'struct sgx_epc_page'.

Add a helper, sgx_epc_page_lru(), to return the LRU that the EPC page
should be added to for the given EPC page.  Currently it just returns
the global LRU. Change sgx_{mark|unmark}_page_reclaimable() to use the
helper function to get the LRU from the EPC page instead of referring to
the global LRU directly.

This allows EPC page being able to be tracked in "per-cgroup" LRU when
that becomes ready.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Tested-by: Jarkko Sakkinen <jarkko@kernel.org>
---
V13:
- Revise commit log (Kai)
- Rename sgx_lru_list() to sgx_epc_page_lru()

V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/main.c | 30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 45832c9a1e1b..5d5f6baac9c8 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -32,6 +32,11 @@ static DEFINE_XARRAY(sgx_epc_address_space);
  */
 static struct sgx_epc_lru_list sgx_global_lru;
 
+static inline struct sgx_epc_lru_list *sgx_epc_page_lru(struct sgx_epc_page *epc_page)
+{
+	return &sgx_global_lru;
+}
+
 static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
 
 /* Nodes with one or more EPC sections. */
@@ -500,25 +505,24 @@ static struct sgx_epc_page *__sgx_alloc_epc_page(void)
 }
 
 /**
- * sgx_mark_page_reclaimable() - Mark a page as reclaimable
+ * sgx_mark_page_reclaimable() - Mark a page as reclaimable and track it in a LRU.
  * @page:	EPC page
- *
- * Mark a page as reclaimable and add it to the active page list. Pages
- * are automatically removed from the active list when freed.
  */
 void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
 {
-	spin_lock(&sgx_global_lru.lock);
+	struct sgx_epc_lru_list *lru = sgx_epc_page_lru(page);
+
+	spin_lock(&lru->lock);
 	page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
-	list_add_tail(&page->list, &sgx_global_lru.reclaimable);
-	spin_unlock(&sgx_global_lru.lock);
+	list_add_tail(&page->list, &lru->reclaimable);
+	spin_unlock(&lru->lock);
 }
 
 /**
- * sgx_unmark_page_reclaimable() - Remove a page from the reclaim list
+ * sgx_unmark_page_reclaimable() - Remove a page from its tracking LRU
  * @page:	EPC page
  *
- * Clear the reclaimable flag and remove the page from the active page list.
+ * Clear the reclaimable flag if set and remove the page from its LRU.
  *
  * Return:
  *   0 on success,
@@ -526,18 +530,20 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
  */
 int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
 {
-	spin_lock(&sgx_global_lru.lock);
+	struct sgx_epc_lru_list *lru = sgx_epc_page_lru(page);
+
+	spin_lock(&lru->lock);
 	if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
 		/* The page is being reclaimed. */
 		if (list_empty(&page->list)) {
-			spin_unlock(&sgx_global_lru.lock);
+			spin_unlock(&lru->lock);
 			return -EBUSY;
 		}
 
 		list_del(&page->list);
 		page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
 	}
-	spin_unlock(&sgx_global_lru.lock);
+	spin_unlock(&lru->lock);
 
 	return 0;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v16 08/16] x86/sgx: Encapsulate uses of the global LRU
  2024-08-21  1:53 [PATCH v16 00/16] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (6 preceding siblings ...)
  2024-08-21  1:53 ` [PATCH v16 07/16] x86/sgx: Abstract tracking reclaimable pages in LRU Haitao Huang
@ 2024-08-21  1:53 ` Haitao Huang
  2024-08-27 11:13   ` Huang, Kai
  2024-08-27 18:15   ` Jarkko Sakkinen
  2024-08-21  1:53 ` [PATCH v16 09/16] x86/sgx: Add basic EPC reclamation flow for cgroup Haitao Huang
                   ` (9 subsequent siblings)
  17 siblings, 2 replies; 42+ messages in thread
From: Haitao Huang @ 2024-08-21  1:53 UTC (permalink / raw)
  To: jarkko, dave.hansen, kai.huang, tj, mkoutny, chenridong,
	linux-kernel, linux-sgx, x86, cgroups, tglx, mingo, bp, hpa,
	sohil.mehta, tim.c.chen
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen,
	yangjie, chrisyan

To support the per-cgroup reclamation, each cgroup will have its own
"per-cgroup LRU" and EPC pages will be in its owner cgroup's LRU instead
of the global LRU. Abstract the code that is directly working with the
global LRU into functions reusable with per-cgroup LRUs.

Currently the basic reclamation procedure, sgx_reclaim_pages() directly
reclaims pages from the global LRU. Change it to take in an LRU.

Note the global EPC reclamation will still be needed when the total EPC
usage reaches the system capacity while usages of some cgroups are below
their respective limits. Create a separate wrapper for the global
reclamation, sgx_reclaim_pages_global(), passing in the global LRU to
the new sgx_reclaim_pages() now. Later it will be revised to reclaim
from multiple LRUs from all EPC cgroups instead of a single global LRU.

Wrap the existing emptiness check of the global LRU with a helper so
that it can be changed later to work with multiple LRUs when per-cgroup
LRU comes to play.

Also the per-cgroup EPC reclaim and global EPC reclaim will have
different check on whether they should be done.  Rename the existing
sgx_should_reclaim() to sgx_should_reclaim_global() to separate the two
cases.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
---
v16:
- Regroup all abstraction related to global LRU usage to this patch from
  different patches in previous version. Position this before adding
per-cgroup reclaim. (Kai)

V13:
- Rename sgx_can_reclaim() to sgx_can_reclaim_global() and
sgx_should_reclaim() to sgx_should_reclaim_global(). (Kai)

V10:
- Add comments for the new function. (Jarkko)

V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/main.c | 63 ++++++++++++++++++++++------------
 1 file changed, 41 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 5d5f6baac9c8..47dfba6f45af 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -37,6 +37,21 @@ static inline struct sgx_epc_lru_list *sgx_epc_page_lru(struct sgx_epc_page *epc
 	return &sgx_global_lru;
 }
 
+/*
+ * Check if there is any reclaimable page at global level.
+ */
+static inline bool sgx_can_reclaim_global(void)
+{
+	/*
+	 * Now all EPC pages are still tracked in the @sgx_global_lru, so only
+	 * check @sgx_global_lru.
+	 *
+	 * When EPC pages are tracked in the actual per-cgroup LRUs,
+	 * replace with sgx_cgroup_lru_empty(misc_cg_root()).
+	 */
+	return !list_empty(&sgx_global_lru.reclaimable);
+}
+
 static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
 
 /* Nodes with one or more EPC sections. */
@@ -287,10 +302,10 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
 }
 
 /*
- * Take a fixed number of pages from the head of the active page pool and
- * reclaim them to the enclave's private shmem files. Skip the pages, which have
- * been accessed since the last scan. Move those pages to the tail of active
- * page pool so that the pages get scanned in LRU like fashion.
+ * Take a fixed number of pages from the head of a given LRU and reclaim them to
+ * the enclave's private shmem files. Skip the pages, which have been accessed
+ * since the last scan. Move those pages to the tail of the list so that the
+ * pages get scanned in LRU like fashion.
  *
  * Batch process a chunk of pages (at the moment 16) in order to degrade amount
  * of IPI's and ETRACK's potentially required. sgx_encl_ewb() does degrade a bit
@@ -299,7 +314,7 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
  * problematic as it would increase the lock contention too much, which would
  * halt forward progress.
  */
-static void sgx_reclaim_pages(void)
+static void sgx_reclaim_pages(struct sgx_epc_lru_list *lru)
 {
 	struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
 	struct sgx_backing backing[SGX_NR_TO_SCAN];
@@ -310,10 +325,9 @@ static void sgx_reclaim_pages(void)
 	int ret;
 	int i;
 
-	spin_lock(&sgx_global_lru.lock);
+	spin_lock(&lru->lock);
 	for (i = 0; i < SGX_NR_TO_SCAN; i++) {
-		epc_page = list_first_entry_or_null(&sgx_global_lru.reclaimable,
-						    struct sgx_epc_page, list);
+		epc_page = list_first_entry_or_null(&lru->reclaimable, struct sgx_epc_page, list);
 		if (!epc_page)
 			break;
 
@@ -328,7 +342,7 @@ static void sgx_reclaim_pages(void)
 			 */
 			epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
 	}
-	spin_unlock(&sgx_global_lru.lock);
+	spin_unlock(&lru->lock);
 
 	for (i = 0; i < cnt; i++) {
 		epc_page = chunk[i];
@@ -351,9 +365,9 @@ static void sgx_reclaim_pages(void)
 		continue;
 
 skip:
-		spin_lock(&sgx_global_lru.lock);
-		list_add_tail(&epc_page->list, &sgx_global_lru.reclaimable);
-		spin_unlock(&sgx_global_lru.lock);
+		spin_lock(&lru->lock);
+		list_add_tail(&epc_page->list, &lru->reclaimable);
+		spin_unlock(&lru->lock);
 
 		kref_put(&encl_page->encl->refcount, sgx_encl_release);
 
@@ -381,10 +395,15 @@ static void sgx_reclaim_pages(void)
 	}
 }
 
-static bool sgx_should_reclaim(unsigned long watermark)
+static bool sgx_should_reclaim_global(unsigned long watermark)
 {
 	return atomic_long_read(&sgx_nr_free_pages) < watermark &&
-	       !list_empty(&sgx_global_lru.reclaimable);
+		sgx_can_reclaim_global();
+}
+
+static void sgx_reclaim_pages_global(void)
+{
+	sgx_reclaim_pages(&sgx_global_lru);
 }
 
 /*
@@ -394,8 +413,8 @@ static bool sgx_should_reclaim(unsigned long watermark)
  */
 void sgx_reclaim_direct(void)
 {
-	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
-		sgx_reclaim_pages();
+	if (sgx_should_reclaim_global(SGX_NR_LOW_PAGES))
+		sgx_reclaim_pages_global();
 }
 
 static int ksgxd(void *p)
@@ -415,10 +434,10 @@ static int ksgxd(void *p)
 
 		wait_event_freezable(ksgxd_waitq,
 				     kthread_should_stop() ||
-				     sgx_should_reclaim(SGX_NR_HIGH_PAGES));
+				     sgx_should_reclaim_global(SGX_NR_HIGH_PAGES));
 
-		if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
-			sgx_reclaim_pages();
+		if (sgx_should_reclaim_global(SGX_NR_HIGH_PAGES))
+			sgx_reclaim_pages_global();
 
 		cond_resched();
 	}
@@ -585,7 +604,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
 			break;
 		}
 
-		if (list_empty(&sgx_global_lru.reclaimable)) {
+		if (!sgx_can_reclaim_global()) {
 			page = ERR_PTR(-ENOMEM);
 			break;
 		}
@@ -600,7 +619,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
 			break;
 		}
 
-		sgx_reclaim_pages();
+		sgx_reclaim_pages_global();
 		cond_resched();
 	}
 
@@ -613,7 +632,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
 		sgx_put_cg(sgx_cg);
 	}
 
-	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
+	if (sgx_should_reclaim_global(SGX_NR_LOW_PAGES))
 		wake_up(&ksgxd_waitq);
 
 	return page;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v16 08/16] x86/sgx: Encapsulate uses of the global LRU
  2024-08-21  1:53 ` [PATCH v16 08/16] x86/sgx: Encapsulate uses of the global LRU Haitao Huang
@ 2024-08-27 11:13   ` Huang, Kai
  2024-08-27 23:58     ` Huang, Kai
  2024-08-27 18:15   ` Jarkko Sakkinen
  1 sibling, 1 reply; 42+ messages in thread
From: Huang, Kai @ 2024-08-27 11:13 UTC (permalink / raw)
  To: chenridong@huawei.com, linux-sgx@vger.kernel.org,
	cgroups@vger.kernel.org, mkoutny@suse.com,
	dave.hansen@linux.intel.com, haitao.huang@linux.intel.com,
	tim.c.chen@linux.intel.com, linux-kernel@vger.kernel.org,
	mingo@redhat.com, tglx@linutronix.de, tj@kernel.org,
	jarkko@kernel.org, Mehta, Sohil, hpa@zytor.com, bp@alien8.de,
	x86@kernel.org
  Cc: mikko.ylinen@linux.intel.com, seanjc@google.com,
	anakrish@microsoft.com, Zhang, Bo, kristen@linux.intel.com,
	yangjie@microsoft.com, Li, Zhiquan1, chrisyan@microsoft.com


> +static inline bool sgx_can_reclaim_global(void)
> +{
> +	/*
> +	 * Now all EPC pages are still tracked in the @sgx_global_lru, so only
> +	 * check @sgx_global_lru.
> +	 *
> +	 * When EPC pages are tracked in the actual per-cgroup LRUs,
> +	 * replace with sgx_cgroup_lru_empty(misc_cg_root()).
> +	 */
> +	return !list_empty(&sgx_global_lru.reclaimable);
> +}

Firstly, sgx_cgroup_lru_empty() is only introduced in the next patch, so it's
wrong to mention it in the comment in _this_ patch.

It's weird to add the above comment here in this patch anyway, since ...

[...]


> +static void sgx_reclaim_pages_global(void)
> +{
> +	sgx_reclaim_pages(&sgx_global_lru);
>  }

... this function (which is no difference IMHO) doesn't have a similar comment
here.

The similar comment to this function is only added in the later "[PATCH v16
12/16] x86/sgx: Revise global reclamation for EPC cgroups":

 static void sgx_reclaim_pages_global(struct mm_struct *charge_mm)
 {
+	/*
+	 * Now all EPC pages are still tracked in the @sgx_global_lru.
+	 * Still reclaim from it.
+	 *
+	 * When EPC pages are tracked in the actual per-cgroup LRUs,
+	 * sgx_cgroup_reclaim_pages_global() will be called.
+	 */
 	sgx_reclaim_pages(&sgx_global_lru, charge_mm);
 }

So if the comment of sgx_can_reclaim_global() were needed, it's more
reasonable to add it in the later patch where the comment for
sgx_reclaim_pages_global() is also added IMHO?



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v16 08/16] x86/sgx: Encapsulate uses of the global LRU
  2024-08-27 11:13   ` Huang, Kai
@ 2024-08-27 23:58     ` Huang, Kai
  0 siblings, 0 replies; 42+ messages in thread
From: Huang, Kai @ 2024-08-27 23:58 UTC (permalink / raw)
  To: chenridong@huawei.com, linux-sgx@vger.kernel.org,
	cgroups@vger.kernel.org, mkoutny@suse.com,
	dave.hansen@linux.intel.com, haitao.huang@linux.intel.com,
	tim.c.chen@linux.intel.com, linux-kernel@vger.kernel.org,
	mingo@redhat.com, tglx@linutronix.de, tj@kernel.org,
	jarkko@kernel.org, Mehta, Sohil, hpa@zytor.com, bp@alien8.de,
	x86@kernel.org
  Cc: mikko.ylinen@linux.intel.com, seanjc@google.com,
	anakrish@microsoft.com, Zhang, Bo, kristen@linux.intel.com,
	yangjie@microsoft.com, Li, Zhiquan1, chrisyan@microsoft.com



On 27/08/2024 11:13 pm, Huang, Kai wrote:
> 
>> +static inline bool sgx_can_reclaim_global(void)
>> +{
>> +	/*
>> +	 * Now all EPC pages are still tracked in the @sgx_global_lru, so only
>> +	 * check @sgx_global_lru.
>> +	 *
>> +	 * When EPC pages are tracked in the actual per-cgroup LRUs,
>> +	 * replace with sgx_cgroup_lru_empty(misc_cg_root()).
>> +	 */
>> +	return !list_empty(&sgx_global_lru.reclaimable);
>> +}
> 
> Firstly, sgx_cgroup_lru_empty() is only introduced in the next patch, so it's
> wrong to mention it in the comment in _this_ patch.
> 
> It's weird to add the above comment here in this patch anyway, since ...

With this comment moved to patch "x86/sgx: Revise global reclamation for 
EPC cgroups",

Reviewed-by: Kai Huang <kai.huang@intel.com>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v16 08/16] x86/sgx: Encapsulate uses of the global LRU
  2024-08-21  1:53 ` [PATCH v16 08/16] x86/sgx: Encapsulate uses of the global LRU Haitao Huang
  2024-08-27 11:13   ` Huang, Kai
@ 2024-08-27 18:15   ` Jarkko Sakkinen
  1 sibling, 0 replies; 42+ messages in thread
From: Jarkko Sakkinen @ 2024-08-27 18:15 UTC (permalink / raw)
  To: Haitao Huang, dave.hansen, kai.huang, tj, mkoutny, chenridong,
	linux-kernel, linux-sgx, x86, cgroups, tglx, mingo, bp, hpa,
	sohil.mehta, tim.c.chen
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen,
	yangjie, chrisyan

On Wed Aug 21, 2024 at 4:53 AM EEST, Haitao Huang wrote:
> To support the per-cgroup reclamation, each cgroup will have its own
> "per-cgroup LRU" and EPC pages will be in its owner cgroup's LRU instead
> of the global LRU. Abstract the code that is directly working with the
> global LRU into functions reusable with per-cgroup LRUs.
>
> Currently the basic reclamation procedure, sgx_reclaim_pages() directly
> reclaims pages from the global LRU. Change it to take in an LRU.
>
> Note the global EPC reclamation will still be needed when the total EPC
> usage reaches the system capacity while usages of some cgroups are below
> their respective limits. Create a separate wrapper for the global
> reclamation, sgx_reclaim_pages_global(), passing in the global LRU to
> the new sgx_reclaim_pages() now. Later it will be revised to reclaim
> from multiple LRUs from all EPC cgroups instead of a single global LRU.
>
> Wrap the existing emptiness check of the global LRU with a helper so
> that it can be changed later to work with multiple LRUs when per-cgroup
> LRU comes to play.
>
> Also the per-cgroup EPC reclaim and global EPC reclaim will have
> different check on whether they should be done.  Rename the existing
> sgx_should_reclaim() to sgx_should_reclaim_global() to separate the two
> cases.
>
> Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>

Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>

BR, Jarkko

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v16 09/16] x86/sgx: Add basic EPC reclamation flow for cgroup
  2024-08-21  1:53 [PATCH v16 00/16] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (7 preceding siblings ...)
  2024-08-21  1:53 ` [PATCH v16 08/16] x86/sgx: Encapsulate uses of the global LRU Haitao Huang
@ 2024-08-21  1:53 ` Haitao Huang
  2024-08-22  4:00   ` Haitao Huang
                     ` (2 more replies)
  2024-08-21  1:53 ` [PATCH v16 10/16] x86/sgx: Implement async reclamation " Haitao Huang
                   ` (8 subsequent siblings)
  17 siblings, 3 replies; 42+ messages in thread
From: Haitao Huang @ 2024-08-21  1:53 UTC (permalink / raw)
  To: jarkko, dave.hansen, kai.huang, tj, mkoutny, chenridong,
	linux-kernel, linux-sgx, x86, cgroups, tglx, mingo, bp, hpa,
	sohil.mehta, tim.c.chen
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen,
	yangjie, chrisyan

Currently in the EPC page allocation, the kernel simply fails the
allocation when the current EPC cgroup fails to charge due to its usage
reaching limit.  This is not ideal.  When that happens, a better way is
to reclaim EPC page(s) from the current EPC cgroup to reduce its usage
so the new allocation can succeed.

Currently, all EPC pages are tracked in a single global LRU, and the
"global EPC reclamation" supports the following 3 cases:

1) On-demand asynchronous reclamation: For allocation requests that can
   not wait for reclamation but can be retried, an asynchronous
   reclamation is triggered, in which the global reclaimer, ksgxd, keeps
   reclaiming EPC pages until the free page count is above a minimal
   threshold.

2) On-demand synchronous reclamation: For allocations that can wait for
   reclamation, the EPC page allocator, sgx_alloc_epc_page() reclaims
   EPC page(s) immediately until at least one free page is available for
   allocation.

3) Preemptive reclamation: For some allocation requests, e.g.,
   allocation for reloading a reclaimed page to change its permissions
   or page type, the kernel invokes sgx_reclaim_direct() to preemptively
   reclaim EPC page(s) as a best effort to minimize on-demand
   reclamation for subsequent allocations.

Similarly, a "per-cgroup reclamation" is needed to support the above 3
cases as well:

1) For on-demand asynchronous reclamation, a per-cgroup reclamation
   needs to be invoked to maintain a minimal difference between the
   usage and the limit for each cgroup, analogous to the minimal free
   page threshold  maintained by the global reclaimer.

2) For on-demand synchronous reclamation, sgx_cgroup_try_charge() needs
   to invoke the per-cgroup reclamation until the cgroup usage become
   at least one page lower than its limit.

3) For preemptive reclamation, sgx_reclaim_direct() needs to invoke the
   per-cgroup reclamation to minimize per-cgroup on-demand reclamation
   for subsequent allocations.

To support the per-cgroup reclamation, introduce a "per-cgroup LRU" to
track all EPC pages belong to the owner cgroup to utilize the existing
sgx_reclaim_pages().

Currently, the global reclamation treats all EPC pages equally as it
scans all EPC pages in FIFO order in the global LRU.  The "per-cgroup
reclamation" needs to somehow achieve the same fairness of all EPC pages
that are tracked in the multiple LRUs of the given cgroup and all the
descendants to reflect the nature of the cgroup.

The idea is to achieve such fairness by scanning "all EPC cgroups" of
the subtree (the given cgroup and all the descendants) equally in turns,
and in the scan to each cgroup, apply the existing sgx_reclaim_pages()
to its LRU. This basic flow is encapsulated in a new function,
sgx_cgroup_reclaim_pages().

Export sgx_reclaim_pages() for use in sgx_cgroup_reclaim_pages(). And
modify sgx_reclaim_pages() to return the number of pages scanned so
sgx_cgroup_reclaim_pages() can track scanning progress and determine
whether enough scanning is done or to continue the scanning for next
descendant.

Whenever reclaiming in a subtree of a given root is needed, start the
scanning from the next descendant where scanning was stopped at last
time.  To keep track of the next descendant cgroup to scan, add a new
field, next_cg, in the sgx_cgroup struct.  Create an iterator function,
sgx_cgroup_next_get(), atomically returns a valid reference of the
descendant for next round of scanning and advances @next_cg to next
valid descendant in a preorder walk. This iterator function is used in
sgx_cgroup_reclaim_pages() to iterate descendants for scanning.
Separately also advances @next_cg to next valid descendant when the
cgroup referenced by @next_cg is to be freed.

Add support for on-demand synchronous reclamation in
sgx_cgroup_try_charge(), applying sgx_cgroup_reclaim_pages() iteratively
until cgroup usage is lower than its limit.

Later patches will reuse sgx_cgroup_reclaim_pages() to add support for
asynchronous and preemptive reclamation.

Note all reclaimable EPC pages are still tracked in the global LRU thus
no per-cgroup reclamation is actually active at the moment: -ENOMEM is
returned by __sgx_cgroup_try_charge() when LRUs are empty. Per-cgroup
tracking and reclamation will be turned on in the end after all
necessary infrastructure is in place.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
---
V16:
- Revise commit log to define reclamation requirement and the design more clearly. (Kai)
- Revise sgx_cgroup_reclaim_pages() to scan cgroups in subtree more
  fairly, track next_cg in each sgx_cgroup and add helpers that is used
  to iterate descendant in sgx_cgroup_reclaim_pages().(Kai)

V14:
- Allow sgx_cgroup_reclaim_pages() to continue from previous tree-walk.
It takes in a 'start' node and returns the 'next' node for the caller to
use as the new 'start'. This is to ensure pages in lower level cgroups
can be reclaimed if all pages in upper level nodes are "too young".
(Kai)
- Move renaming sgx_should_reclaim() to sgx_should_reclaim_global() from
a later patch to this one. (Kai)

V11:
- Use commit message suggested by Kai
- Remove "usage" comments for functions. (Kai)

V10:
- Simplify the signature by removing a pointer to nr_to_scan (Kai)
- Return pages attempted instead of reclaimed as it is really what the
cgroup caller needs to track progress. This further simplifies the design.
- Merge patch for exposing sgx_reclaim_pages() with basic synchronous
reclamation. (Kai)
- Shorten names for EPC cgroup functions. (Jarkko)
- Fix/add comments to justify the design (Kai)
- Separate out a helper for for addressing single iteration of the loop
in sgx_cgroup_try_charge(). (Jarkko)

V9:
- Add comments for static variables. (Jarkko)

V8:
- Use width of 80 characters in text paragraphs. (Jarkko)
- Remove alignment for substructure variables. (Jarkko)

V7:
- Reworked from patch 9 of V6, "x86/sgx: Restructure top-level EPC reclaim
function". Do not split the top level function (Kai)
- Dropped patches 7 and 8 of V6.
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 208 ++++++++++++++++++++++++++-
 arch/x86/kernel/cpu/sgx/epc_cgroup.h |  18 ++-
 arch/x86/kernel/cpu/sgx/main.c       |  19 ++-
 arch/x86/kernel/cpu/sgx/sgx.h        |   1 +
 4 files changed, 238 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index 0e422fef02bb..ce28efd20a15 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -7,17 +7,203 @@
 /* The root SGX EPC cgroup */
 static struct sgx_cgroup sgx_cg_root;
 
+/*
+ * Return the next descendant in a preorder walk, given a root, @root and a
+ * cgroup, @cg, to start the walk from. Return @root if no descendant left for
+ * this walk, Otherwise, return next descendant and has its refcnt incremented.
+ */
+static struct sgx_cgroup *sgx_cgroup_next_descendant_pre(struct sgx_cgroup *root,
+							 struct sgx_cgroup *cg)
+{
+	struct cgroup_subsys_state *next;
+
+	rcu_read_lock();
+	for (;;) {
+		next = css_next_descendant_pre(&cg->cg->css, &root->cg->css);
+		if (!next) {
+			next = &root->cg->css;
+			break;
+		}
+
+		if (css_tryget(next))
+			break;
+	}
+	rcu_read_unlock();
+
+	return sgx_cgroup_from_misc_cg(css_misc(next));
+}
+
+/*
+ * For a given root, @root, if a given cgroup, @cg, is the next cgroup to
+ * reclaim pages from, i.e., referenced by @root->next_cg, then advance
+ * @root->next_cg to the next valid cgroup in a preorder walk or the root if no
+ * more descendants left to walk.
+ *
+ * Called from sgx_cgroup_free() when @cg is to be freed so it can no longer be
+ * used as 'next_cg'.
+ */
+static inline void sgx_cgroup_next_skip(struct sgx_cgroup *root, struct sgx_cgroup *cg)
+{
+	struct sgx_cgroup *p;
+
+	spin_lock(&root->next_cg_lock);
+	p = root->next_cg;
+	spin_unlock(&root->next_cg_lock);
+
+	/* Already moved by other threads, no need to update */
+	if (cg != p)
+		return;
+
+	p = sgx_cgroup_next_descendant_pre(root, cg);
+
+	spin_lock(&root->next_cg_lock);
+	if (root->next_cg == cg)
+		root->next_cg = p;
+	spin_unlock(&root->next_cg_lock);
+
+	/* Decrement refcnt so cgroup pointed to by p can be released */
+	if (p != cg && p != root)
+		sgx_put_cg(p);
+}
+
+/*
+ * Return the cgroup currently referenced by @root->next_cg and advance
+ * @root->next_cg to the next descendant or @root.  The returned cgroup has its
+ * refcnt incremented if it is not @root and caller must release the refcnt.
+ */
+static inline struct sgx_cgroup *sgx_cgroup_next_get(struct sgx_cgroup *root)
+{
+	struct sgx_cgroup *p;
+
+	/*
+	 * Acquire a reference for the to-be-returned cgroup and advance
+	 * next_cg with the lock so the same cg not returned to two threads.
+	 */
+	spin_lock(&root->next_cg_lock);
+
+	p = root->next_cg;
+
+	/* Advance the to-be-returned to next descendant if current one is dying */
+	if (p != root && !css_tryget(&p->cg->css))
+		p = sgx_cgroup_next_descendant_pre(root, p);
+
+	/* Advance next_cg */
+	root->next_cg = sgx_cgroup_next_descendant_pre(root, p);
+
+	/* Decrement ref here so it can be released by cgroup subsystem */
+	if (root->next_cg != root)
+		sgx_put_cg(root->next_cg);
+
+	spin_unlock(&root->next_cg_lock);
+
+	/* p is root or refcnt incremented */
+	return p;
+}
+
 /**
- * sgx_cgroup_try_charge() - try to charge cgroup for a single EPC page
+ * sgx_cgroup_lru_empty() - check if a cgroup tree has no pages on its LRUs
+ * @root:	Root of the tree to check
  *
+ * Return: %true if all cgroups under the specified root have empty LRU lists.
+ */
+static bool sgx_cgroup_lru_empty(struct misc_cg *root)
+{
+	struct cgroup_subsys_state *css_root;
+	struct cgroup_subsys_state *pos;
+	struct sgx_cgroup *sgx_cg;
+	bool ret = true;
+
+	/*
+	 * Caller must ensure css_root ref acquired
+	 */
+	css_root = &root->css;
+
+	rcu_read_lock();
+	css_for_each_descendant_pre(pos, css_root) {
+		if (!css_tryget(pos))
+			continue;
+
+		rcu_read_unlock();
+
+		sgx_cg = sgx_cgroup_from_misc_cg(css_misc(pos));
+
+		spin_lock(&sgx_cg->lru.lock);
+		ret = list_empty(&sgx_cg->lru.reclaimable);
+		spin_unlock(&sgx_cg->lru.lock);
+
+		rcu_read_lock();
+		css_put(pos);
+		if (!ret)
+			break;
+	}
+
+	rcu_read_unlock();
+
+	return ret;
+}
+
+/*
+ * Scan at least @nr_to_scan pages and attempt to reclaim them from the subtree of @root.
+ */
+static inline void sgx_cgroup_reclaim_pages(struct sgx_cgroup *root,
+					    unsigned int nr_to_scan)
+{
+	struct sgx_cgroup *next_cg = NULL;
+	unsigned int cnt = 0;
+
+	while (!sgx_cgroup_lru_empty(root->cg) && cnt < nr_to_scan) {
+		next_cg = sgx_cgroup_next_get(root);
+		cnt += sgx_reclaim_pages(&next_cg->lru);
+		if (next_cg != root)
+			sgx_put_cg(next_cg);
+	}
+}
+
+static int __sgx_cgroup_try_charge(struct sgx_cgroup *epc_cg)
+{
+	if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE))
+		return 0;
+
+	/* No reclaimable pages left in the cgroup */
+	if (sgx_cgroup_lru_empty(epc_cg->cg))
+		return -ENOMEM;
+
+	if (signal_pending(current))
+		return -ERESTARTSYS;
+
+	return -EBUSY;
+}
+
+/**
+ * sgx_cgroup_try_charge() - try to charge cgroup for a single EPC page
  * @sgx_cg:	The EPC cgroup to be charged for the page.
+ * @reclaim:	Whether or not synchronous EPC reclaim is allowed.
  * Return:
  * * %0 - If successfully charged.
  * * -errno - for failures.
  */
-int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg)
+int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim)
 {
-	return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, sgx_cg->cg, PAGE_SIZE);
+	int ret;
+
+	for (;;) {
+		ret = __sgx_cgroup_try_charge(sgx_cg);
+
+		if (ret != -EBUSY)
+			goto out;
+
+		if (reclaim == SGX_NO_RECLAIM) {
+			ret = -ENOMEM;
+			goto out;
+		}
+
+		sgx_cgroup_reclaim_pages(sgx_cg, 1);
+
+		cond_resched();
+	}
+
+out:
+	return ret;
 }
 
 /**
@@ -32,18 +218,34 @@ void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg)
 static void sgx_cgroup_free(struct misc_cg *cg)
 {
 	struct sgx_cgroup *sgx_cg;
+	struct misc_cg *p;
 
 	sgx_cg = sgx_cgroup_from_misc_cg(cg);
 	if (!sgx_cg)
 		return;
 
+	/*
+	 * Notify ancestors to not reclaim from this dying cgroup.
+	 * Not start from this cgroup itself because at this point no reference
+	 * of this cgroup being hold, i.e., all pages in this cgroup are freed
+	 * and LRU is empty, so no reclamation possible.
+	 */
+	p = misc_cg_parent(cg);
+	while (p) {
+		sgx_cgroup_next_skip(sgx_cgroup_from_misc_cg(p), sgx_cg);
+		p = misc_cg_parent(p);
+	}
+
 	kfree(sgx_cg);
 }
 
 static void sgx_cgroup_misc_init(struct misc_cg *cg, struct sgx_cgroup *sgx_cg)
 {
+	sgx_lru_init(&sgx_cg->lru);
 	cg->res[MISC_CG_RES_SGX_EPC].priv = sgx_cg;
 	sgx_cg->cg = cg;
+	sgx_cg->next_cg = sgx_cg;
+	spin_lock_init(&sgx_cg->next_cg_lock);
 }
 
 static int sgx_cgroup_alloc(struct misc_cg *cg)
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
index e74b1ea0b642..da7f5315bff8 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -20,7 +20,7 @@ static inline struct sgx_cgroup *sgx_get_current_cg(void)
 
 static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg) { }
 
-static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg)
+static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim)
 {
 	return 0;
 }
@@ -38,6 +38,20 @@ static inline void __init sgx_cgroup_register(void) { }
 
 struct sgx_cgroup {
 	struct misc_cg *cg;
+	struct sgx_epc_lru_list lru;
+	/*
+	 * Pointer to the next cgroup to scan when the per-cgroup reclamation
+	 * is triggered next time. It does not hold a reference to prevent it
+	 * from being freed in order to allow the misc cgroup subsystem to
+	 * release and free the cgroup as needed, e.g., when admin wants to
+	 * delete the cgroup. When the cgroup pointed to is being freed,
+	 * sgx_cgroup_next_cg_skip(), will be invoked to update the pointer to
+	 * next accessible cgroup in a preorder walk of the subtree of the same
+	 * root.
+	 */
+	struct sgx_cgroup *next_cg;
+	/* Lock to protect concurrent access to @next_cg */
+	spinlock_t next_cg_lock;
 };
 
 static inline struct sgx_cgroup *sgx_cgroup_from_misc_cg(struct misc_cg *cg)
@@ -68,7 +82,7 @@ static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg)
 	put_misc_cg(sgx_cg->cg);
 }
 
-int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg);
+int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim);
 void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg);
 int __init sgx_cgroup_init(void);
 void __init sgx_cgroup_register(void);
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 47dfba6f45af..303b06f39e4e 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -301,7 +301,10 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
 	mutex_unlock(&encl->lock);
 }
 
-/*
+/**
+ * sgx_reclaim_pages() - Attempt to reclaim a fixed number of pages from an LRU
+ * @lru:	The LRU from which pages are reclaimed.
+ *
  * Take a fixed number of pages from the head of a given LRU and reclaim them to
  * the enclave's private shmem files. Skip the pages, which have been accessed
  * since the last scan. Move those pages to the tail of the list so that the
@@ -313,8 +316,10 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
  * + EWB) but not sufficiently. Reclaiming one page at a time would also be
  * problematic as it would increase the lock contention too much, which would
  * halt forward progress.
+ *
+ * Return:	Number of pages attempted for reclamation.
  */
-static void sgx_reclaim_pages(struct sgx_epc_lru_list *lru)
+unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru)
 {
 	struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
 	struct sgx_backing backing[SGX_NR_TO_SCAN];
@@ -393,6 +398,8 @@ static void sgx_reclaim_pages(struct sgx_epc_lru_list *lru)
 
 		sgx_free_epc_page(epc_page);
 	}
+
+	return cnt;
 }
 
 static bool sgx_should_reclaim_global(unsigned long watermark)
@@ -591,7 +598,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
 	int ret;
 
 	sgx_cg = sgx_get_current_cg();
-	ret = sgx_cgroup_try_charge(sgx_cg);
+	ret = sgx_cgroup_try_charge(sgx_cg, reclaim);
 	if (ret) {
 		sgx_put_cg(sgx_cg);
 		return ERR_PTR(ret);
@@ -619,6 +626,12 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
 			break;
 		}
 
+		/*
+		 * At this point, the usage within this cgroup is under its
+		 * limit but there is no physical page left for allocation.
+		 * Perform a global reclaim to get some pages released from any
+		 * cgroup with reclaimable pages.
+		 */
 		sgx_reclaim_pages_global();
 		cond_resched();
 	}
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 0c8d88eb65ff..f456abfe5c52 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -139,6 +139,7 @@ void sgx_reclaim_direct(void);
 void sgx_mark_page_reclaimable(struct sgx_epc_page *page);
 int sgx_unmark_page_reclaimable(struct sgx_epc_page *page);
 struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim);
+unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru);
 
 void sgx_ipi_cb(void *info);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v16 09/16] x86/sgx: Add basic EPC reclamation flow for cgroup
  2024-08-21  1:53 ` [PATCH v16 09/16] x86/sgx: Add basic EPC reclamation flow for cgroup Haitao Huang
@ 2024-08-22  4:00   ` Haitao Huang
  2024-08-27 18:15   ` Jarkko Sakkinen
  2024-08-27 23:42   ` Huang, Kai
  2 siblings, 0 replies; 42+ messages in thread
From: Haitao Huang @ 2024-08-22  4:00 UTC (permalink / raw)
  To: jarkko, dave.hansen, kai.huang, tj, mkoutny, chenridong,
	linux-kernel, linux-sgx, x86, cgroups, tglx, mingo, bp, hpa,
	sohil.mehta, tim.c.chen, Haitao Huang
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen,
	yangjie, chrisyan

> +static struct sgx_cgroup *sgx_cgroup_next_descendant_pre(struct  
> sgx_cgroup *root,
> +							 struct sgx_cgroup *cg)
> +{
> +	struct cgroup_subsys_state *next;
> +
> +	rcu_read_lock();
> +	for (;;) {
> +		next = css_next_descendant_pre(&cg->cg->css, &root->cg->css);

I  messed it up in a last minute change and rebase. Above should be:

+	struct cgroup_subsys_state *next = &cg->cg->css;
+
+	rcu_read_lock();
+	for (;;) {
+		next = css_next_descendant_pre(next, &root->cg->css);

Fixed in a branch here:  
https://github.com/haitaohuang/linux/tree/sgx_cg_upstream_v16_plus
Will include in next version or update if needed.

> +		if (!next) {
> +			next = &root->cg->css;
> +			break;
> +		}
> +
> +		if (css_tryget(next))
> +			break;
> +	}
> +	rcu_read_unlock();
> +
> +	return sgx_cgroup_from_misc_cg(css_misc(next));
> +}
> +
BR
Haitao

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v16 09/16] x86/sgx: Add basic EPC reclamation flow for cgroup
  2024-08-21  1:53 ` [PATCH v16 09/16] x86/sgx: Add basic EPC reclamation flow for cgroup Haitao Huang
  2024-08-22  4:00   ` Haitao Huang
@ 2024-08-27 18:15   ` Jarkko Sakkinen
  2024-08-27 23:42   ` Huang, Kai
  2 siblings, 0 replies; 42+ messages in thread
From: Jarkko Sakkinen @ 2024-08-27 18:15 UTC (permalink / raw)
  To: Haitao Huang, dave.hansen, kai.huang, tj, mkoutny, chenridong,
	linux-kernel, linux-sgx, x86, cgroups, tglx, mingo, bp, hpa,
	sohil.mehta, tim.c.chen
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen,
	yangjie, chrisyan

On Wed Aug 21, 2024 at 4:53 AM EEST, Haitao Huang wrote:
> Currently in the EPC page allocation, the kernel simply fails the
> allocation when the current EPC cgroup fails to charge due to its usage
> reaching limit.  This is not ideal.  When that happens, a better way is
> to reclaim EPC page(s) from the current EPC cgroup to reduce its usage
> so the new allocation can succeed.
>
> Currently, all EPC pages are tracked in a single global LRU, and the
> "global EPC reclamation" supports the following 3 cases:
>
> 1) On-demand asynchronous reclamation: For allocation requests that can
>    not wait for reclamation but can be retried, an asynchronous
>    reclamation is triggered, in which the global reclaimer, ksgxd, keeps
>    reclaiming EPC pages until the free page count is above a minimal
>    threshold.
>
> 2) On-demand synchronous reclamation: For allocations that can wait for
>    reclamation, the EPC page allocator, sgx_alloc_epc_page() reclaims
>    EPC page(s) immediately until at least one free page is available for
>    allocation.
>
> 3) Preemptive reclamation: For some allocation requests, e.g.,
>    allocation for reloading a reclaimed page to change its permissions
>    or page type, the kernel invokes sgx_reclaim_direct() to preemptively
>    reclaim EPC page(s) as a best effort to minimize on-demand
>    reclamation for subsequent allocations.
>
> Similarly, a "per-cgroup reclamation" is needed to support the above 3
> cases as well:
>
> 1) For on-demand asynchronous reclamation, a per-cgroup reclamation
>    needs to be invoked to maintain a minimal difference between the
>    usage and the limit for each cgroup, analogous to the minimal free
>    page threshold  maintained by the global reclaimer.
>
> 2) For on-demand synchronous reclamation, sgx_cgroup_try_charge() needs
>    to invoke the per-cgroup reclamation until the cgroup usage become
>    at least one page lower than its limit.
>
> 3) For preemptive reclamation, sgx_reclaim_direct() needs to invoke the
>    per-cgroup reclamation to minimize per-cgroup on-demand reclamation
>    for subsequent allocations.
>
> To support the per-cgroup reclamation, introduce a "per-cgroup LRU" to
> track all EPC pages belong to the owner cgroup to utilize the existing
> sgx_reclaim_pages().
>
> Currently, the global reclamation treats all EPC pages equally as it
> scans all EPC pages in FIFO order in the global LRU.  The "per-cgroup
> reclamation" needs to somehow achieve the same fairness of all EPC pages
> that are tracked in the multiple LRUs of the given cgroup and all the
> descendants to reflect the nature of the cgroup.
>
> The idea is to achieve such fairness by scanning "all EPC cgroups" of
> the subtree (the given cgroup and all the descendants) equally in turns,
> and in the scan to each cgroup, apply the existing sgx_reclaim_pages()
> to its LRU. This basic flow is encapsulated in a new function,
> sgx_cgroup_reclaim_pages().
>
> Export sgx_reclaim_pages() for use in sgx_cgroup_reclaim_pages(). And
> modify sgx_reclaim_pages() to return the number of pages scanned so
> sgx_cgroup_reclaim_pages() can track scanning progress and determine
> whether enough scanning is done or to continue the scanning for next
> descendant.
>
> Whenever reclaiming in a subtree of a given root is needed, start the
> scanning from the next descendant where scanning was stopped at last
> time.  To keep track of the next descendant cgroup to scan, add a new
> field, next_cg, in the sgx_cgroup struct.  Create an iterator function,
> sgx_cgroup_next_get(), atomically returns a valid reference of the
> descendant for next round of scanning and advances @next_cg to next
> valid descendant in a preorder walk. This iterator function is used in
> sgx_cgroup_reclaim_pages() to iterate descendants for scanning.
> Separately also advances @next_cg to next valid descendant when the
> cgroup referenced by @next_cg is to be freed.
>
> Add support for on-demand synchronous reclamation in
> sgx_cgroup_try_charge(), applying sgx_cgroup_reclaim_pages() iteratively
> until cgroup usage is lower than its limit.
>
> Later patches will reuse sgx_cgroup_reclaim_pages() to add support for
> asynchronous and preemptive reclamation.
>
> Note all reclaimable EPC pages are still tracked in the global LRU thus
> no per-cgroup reclamation is actually active at the moment: -ENOMEM is
> returned by __sgx_cgroup_try_charge() when LRUs are empty. Per-cgroup
> tracking and reclamation will be turned on in the end after all
> necessary infrastructure is in place.
>
> Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>

Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>

BR, Jarkko

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v16 09/16] x86/sgx: Add basic EPC reclamation flow for cgroup
  2024-08-21  1:53 ` [PATCH v16 09/16] x86/sgx: Add basic EPC reclamation flow for cgroup Haitao Huang
  2024-08-22  4:00   ` Haitao Huang
  2024-08-27 18:15   ` Jarkko Sakkinen
@ 2024-08-27 23:42   ` Huang, Kai
  2 siblings, 0 replies; 42+ messages in thread
From: Huang, Kai @ 2024-08-27 23:42 UTC (permalink / raw)
  To: Haitao Huang, jarkko, dave.hansen, tj, mkoutny, chenridong,
	linux-kernel, linux-sgx, x86, cgroups, tglx, mingo, bp, hpa,
	sohil.mehta, tim.c.chen
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen,
	yangjie, chrisyan


> 2) For on-demand synchronous reclamation, sgx_cgroup_try_charge() needs
>     to invoke the per-cgroup reclamation until the cgroup usage become

"become" -> "becomes"

>   
> +/*
> + * Return the next descendant in a preorder walk, given a root, @root and a
> + * cgroup, @cg, to start the walk from. Return @root if no descendant left for
> + * this walk, Otherwise, return next descendant and has its refcnt incremented.

"," -> "." before the "Otherwise".

"has its refcnt ..." -> "have its refcnt ...",

Or, "return ... with its refcnt ..." (which is better I guess).

> + */
> +static struct sgx_cgroup *sgx_cgroup_next_descendant_pre(struct sgx_cgroup *root,
> +							 struct sgx_cgroup *cg)
> +{

Reviewed-by: Kai Huang <kai.huang@intel.com>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v16 10/16] x86/sgx: Implement async reclamation for cgroup
  2024-08-21  1:53 [PATCH v16 00/16] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (8 preceding siblings ...)
  2024-08-21  1:53 ` [PATCH v16 09/16] x86/sgx: Add basic EPC reclamation flow for cgroup Haitao Huang
@ 2024-08-21  1:53 ` Haitao Huang
  2024-08-27 10:22   ` Huang, Kai
  2024-08-27 18:15   ` Jarkko Sakkinen
  2024-08-21  1:53 ` [PATCH v16 11/16] x86/sgx: Charge mem_cgroup for per-cgroup reclamation Haitao Huang
                   ` (7 subsequent siblings)
  17 siblings, 2 replies; 42+ messages in thread
From: Haitao Huang @ 2024-08-21  1:53 UTC (permalink / raw)
  To: jarkko, dave.hansen, kai.huang, tj, mkoutny, chenridong,
	linux-kernel, linux-sgx, x86, cgroups, tglx, mingo, bp, hpa,
	sohil.mehta, tim.c.chen
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen,
	yangjie, chrisyan

From: Kristen Carlson Accardi <kristen@linux.intel.com>

In cases EPC pages need be allocated during a page fault and the cgroup
usage is near its limit, an asynchronous reclamation needs to be
triggered to avoid blocking the page fault handling.

To keep implementation simple, use a workqueue instead of kthread to
schedule the asynchronous reclamation work. Add corresponding work item and
function definitions for EPC cgroup.

In sgx_cgroup_try_charge(), if caller does not allow synchronous
reclamation, queue an asynchronous work into the workqueue.

The current global reclaimer, ksgxd, maintains a threshold for the
minimal free EPC pages to avoid thrashing when allocating EPC pages.
Similarly, only reclaiming EPC pages from the current cgroup when its
usage actually reaches limit could also cause thrashing.  To avoid that,
define a similar "per-cgroup usage threshold" and actively trigger
asynchronous per-cgroup EPC reclamation when the usage reaches the
threshold after try_charge() is successful.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Tested-by: Jarkko Sakkinen <jarkko@kernel.org>
---
V16:
- Destroy workqueue in sgx_cgroup_deinit()
- Reuse the new sgx_cgroup_reclaim_pages() to scan at least
  SGX_NR_TO_SCAN pages for each round async reclaim.
- Revise commit message. (Kai)

V15:
- Disable SGX when sgx_cgroup_init() fails instead of BUG_ON() (Jarkko)
- Reset capacity to zero when sgx_cgroup_init() fails. (Kai)
V13:
- Revert to BUG_ON() in case of workq allocation failure in init and
only alloc if misc is enabled.

V11:
- Print error instead of WARN (Kai)
- Add check for need to queue an async reclamation before returning from
try_charge(), do so if needed. This is to be consistent with global
reclaimer to minimize thrashing during allocation time.

V10:
- Split asynchronous flow in separate patch. (Kai)
- Consider cgroup disabled when the workqueue allocation fail during
init. (Kai)
- Abstract out sgx_cgroup_should_reclaim().

V9:
- Add comments for static variables. (Jarkko)

V8:
- Remove alignment for substructure variables. (Jarkko)

V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 115 ++++++++++++++++++++++++++-
 arch/x86/kernel/cpu/sgx/epc_cgroup.h |   4 +
 arch/x86/kernel/cpu/sgx/main.c       |   5 +-
 3 files changed, 122 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index ce28efd20a15..c40f601d06d9 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -4,9 +4,37 @@
 #include<linux/slab.h>
 #include "epc_cgroup.h"
 
+/*
+ * The minimal free pages, or the minimal margin between limit and usage
+ * maintained by per-cgroup reclaimer.
+ *
+ * Set this to the low threshold used by the global reclaimer, ksgxd.
+ */
+#define SGX_CG_MIN_FREE_PAGE	(SGX_NR_LOW_PAGES)
+
+/*
+ * If the cgroup limit is close to SGX_CG_MIN_FREE_PAGE, maintaining the minimal
+ * free pages would barely leave any page for use, causing excessive reclamation
+ * and thrashing.
+ *
+ * Define the following limit, below which cgroup does not maintain the minimal
+ * free page threshold. Set this to quadruple of the minimal so at least 75%
+ * pages used without being reclaimed.
+ */
+#define SGX_CG_LOW_LIMIT	(SGX_CG_MIN_FREE_PAGE * 4)
+
 /* The root SGX EPC cgroup */
 static struct sgx_cgroup sgx_cg_root;
 
+/*
+ * The work queue that reclaims EPC pages in the background for cgroups.
+ *
+ * A cgroup schedules a work item into this queue to reclaim pages within the
+ * same cgroup when its usage limit is reached and synchronous reclamation is not
+ * an option, i.e., in a page fault handler.
+ */
+static struct workqueue_struct *sgx_cg_wq;
+
 /*
  * Return the next descendant in a preorder walk, given a root, @root and a
  * cgroup, @cg, to start the walk from. Return @root if no descendant left for
@@ -100,6 +128,34 @@ static inline struct sgx_cgroup *sgx_cgroup_next_get(struct sgx_cgroup *root)
 	return p;
 }
 
+static inline u64 sgx_cgroup_page_counter_read(struct sgx_cgroup *sgx_cg)
+{
+	return atomic64_read(&sgx_cg->cg->res[MISC_CG_RES_SGX_EPC].usage) / PAGE_SIZE;
+}
+
+static inline u64 sgx_cgroup_max_pages(struct sgx_cgroup *sgx_cg)
+{
+	return READ_ONCE(sgx_cg->cg->res[MISC_CG_RES_SGX_EPC].max) / PAGE_SIZE;
+}
+
+/*
+ * Get the lower bound of limits of a cgroup and its ancestors. Used in
+ * sgx_cgroup_should_reclaim() to determine if EPC usage of a cgroup is
+ * close to its limit or its ancestors' hence reclamation is needed.
+ */
+static inline u64 sgx_cgroup_max_pages_to_root(struct sgx_cgroup *sgx_cg)
+{
+	struct misc_cg *i = sgx_cg->cg;
+	u64 m = U64_MAX;
+
+	while (i) {
+		m = min(m, READ_ONCE(i->res[MISC_CG_RES_SGX_EPC].max));
+		i = misc_cg_parent(i);
+	}
+
+	return m / PAGE_SIZE;
+}
+
 /**
  * sgx_cgroup_lru_empty() - check if a cgroup tree has no pages on its LRUs
  * @root:	Root of the tree to check
@@ -159,6 +215,43 @@ static inline void sgx_cgroup_reclaim_pages(struct sgx_cgroup *root,
 	}
 }
 
+/* Check whether EPC reclaim should be performed for a given EPC cgroup.*/
+static bool sgx_cgroup_should_reclaim(struct sgx_cgroup *sgx_cg)
+{
+	u64 cur, max;
+
+	if (sgx_cgroup_lru_empty(sgx_cg->cg))
+		return false;
+
+	max = sgx_cgroup_max_pages_to_root(sgx_cg);
+
+	/*
+	 * Unless the limit is very low, maintain a minimal "credit" available
+	 * for charge to avoid per-cgroup reclamation and to serve new
+	 * allocation requests more quickly.
+	 */
+	if (max > SGX_CG_LOW_LIMIT)
+		max -= SGX_CG_MIN_FREE_PAGE;
+
+	cur = sgx_cgroup_page_counter_read(sgx_cg);
+
+	return (cur >= max);
+}
+
+/*
+ * Asynchronous work flow to reclaim pages from the cgroup when the cgroup is
+ * at/near its maximum capacity.
+ */
+static void sgx_cgroup_reclaim_work_func(struct work_struct *work)
+{
+	struct sgx_cgroup *root = container_of(work, struct sgx_cgroup, reclaim_work);
+
+	while (sgx_cgroup_should_reclaim(root)) {
+		sgx_cgroup_reclaim_pages(root, SGX_NR_TO_SCAN);
+		cond_resched();
+	}
+}
+
 static int __sgx_cgroup_try_charge(struct sgx_cgroup *epc_cg)
 {
 	if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE))
@@ -193,7 +286,8 @@ int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim)
 			goto out;
 
 		if (reclaim == SGX_NO_RECLAIM) {
-			ret = -ENOMEM;
+			queue_work(sgx_cg_wq, &sgx_cg->reclaim_work);
+			ret = -EBUSY;
 			goto out;
 		}
 
@@ -202,6 +296,9 @@ int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim)
 		cond_resched();
 	}
 
+	if (sgx_cgroup_should_reclaim(sgx_cg))
+		queue_work(sgx_cg_wq, &sgx_cg->reclaim_work);
+
 out:
 	return ret;
 }
@@ -224,6 +321,7 @@ static void sgx_cgroup_free(struct misc_cg *cg)
 	if (!sgx_cg)
 		return;
 
+	cancel_work_sync(&sgx_cg->reclaim_work);
 	/*
 	 * Notify ancestors to not reclaim from this dying cgroup.
 	 * Not start from this cgroup itself because at this point no reference
@@ -242,6 +340,7 @@ static void sgx_cgroup_free(struct misc_cg *cg)
 static void sgx_cgroup_misc_init(struct misc_cg *cg, struct sgx_cgroup *sgx_cg)
 {
 	sgx_lru_init(&sgx_cg->lru);
+	INIT_WORK(&sgx_cg->reclaim_work, sgx_cgroup_reclaim_work_func);
 	cg->res[MISC_CG_RES_SGX_EPC].priv = sgx_cg;
 	sgx_cg->cg = cg;
 	sgx_cg->next_cg = sgx_cg;
@@ -268,11 +367,25 @@ const struct misc_res_ops sgx_cgroup_ops = {
 
 int __init sgx_cgroup_init(void)
 {
+	sgx_cg_wq = alloc_workqueue("sgx_cg_wq", WQ_UNBOUND | WQ_FREEZABLE,
+				    WQ_UNBOUND_MAX_ACTIVE);
+	if (!sgx_cg_wq) {
+		pr_err("alloc_workqueue() failed for SGX cgroup.\n");
+		return -ENOMEM;
+	}
 	sgx_cgroup_misc_init(misc_cg_root(), &sgx_cg_root);
 
 	return 0;
 }
 
+/**
+ * Only called during init to unwind what's done in sgx_cgroup_init()
+ */
+void __init sgx_cgroup_deinit(void)
+{
+	destroy_workqueue(sgx_cg_wq);
+}
+
 /**
  * Register capacity and ops for SGX cgroup.
  * Only called at the end of sgx_init() when SGX is ready to handle the ops
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
index da7f5315bff8..37364bdb797f 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -32,6 +32,8 @@ static inline int __init sgx_cgroup_init(void)
 	return 0;
 }
 
+static inline void __init sgx_cgroup_deinit(void) { }
+
 static inline void __init sgx_cgroup_register(void) { }
 
 #else /* CONFIG_CGROUP_MISC */
@@ -39,6 +41,7 @@ static inline void __init sgx_cgroup_register(void) { }
 struct sgx_cgroup {
 	struct misc_cg *cg;
 	struct sgx_epc_lru_list lru;
+	struct work_struct reclaim_work;
 	/*
 	 * Pointer to the next cgroup to scan when the per-cgroup reclamation
 	 * is triggered next time. It does not hold a reference to prevent it
@@ -86,6 +89,7 @@ int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim);
 void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg);
 int __init sgx_cgroup_init(void);
 void __init sgx_cgroup_register(void);
+void __init sgx_cgroup_deinit(void);
 
 #endif /* CONFIG_CGROUP_MISC */
 
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 303b06f39e4e..4bb8aa019c6a 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -1014,12 +1014,15 @@ static int __init sgx_init(void)
 	ret = sgx_drv_init();
 
 	if (sgx_vepc_init() && ret)
-		goto err_provision;
+		goto err_cgroup;
 
 	sgx_cgroup_register();
 
 	return 0;
 
+err_cgroup:
+	sgx_cgroup_deinit();
+
 err_provision:
 	misc_deregister(&sgx_dev_provision);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v16 10/16] x86/sgx: Implement async reclamation for cgroup
  2024-08-21  1:53 ` [PATCH v16 10/16] x86/sgx: Implement async reclamation " Haitao Huang
@ 2024-08-27 10:22   ` Huang, Kai
  2024-08-27 23:46     ` Huang, Kai
  2024-08-27 18:15   ` Jarkko Sakkinen
  1 sibling, 1 reply; 42+ messages in thread
From: Huang, Kai @ 2024-08-27 10:22 UTC (permalink / raw)
  To: chenridong@huawei.com, linux-sgx@vger.kernel.org,
	cgroups@vger.kernel.org, mkoutny@suse.com,
	dave.hansen@linux.intel.com, haitao.huang@linux.intel.com,
	tim.c.chen@linux.intel.com, linux-kernel@vger.kernel.org,
	mingo@redhat.com, tglx@linutronix.de, tj@kernel.org,
	jarkko@kernel.org, Mehta, Sohil, hpa@zytor.com, bp@alien8.de,
	x86@kernel.org
  Cc: mikko.ylinen@linux.intel.com, seanjc@google.com,
	anakrish@microsoft.com, Zhang, Bo, kristen@linux.intel.com,
	yangjie@microsoft.com, Li, Zhiquan1, chrisyan@microsoft.com

On Tue, 2024-08-20 at 18:53 -0700, Haitao Huang wrote:
> +/**
> + * Only called during init to unwind what's done in sgx_cgroup_init()
> + */
> +void __init sgx_cgroup_deinit(void)
> +{
> +	destroy_workqueue(sgx_cg_wq);
> +}
> +

Ditto:

arch/x86/kernel/cpu/sgx/epc_cgroup.c:412: warning: This comment starts with
'/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-
doc.rst
 * Only called during init to unwind what's done in sgx_cgroup_init()

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v16 10/16] x86/sgx: Implement async reclamation for cgroup
  2024-08-27 10:22   ` Huang, Kai
@ 2024-08-27 23:46     ` Huang, Kai
  0 siblings, 0 replies; 42+ messages in thread
From: Huang, Kai @ 2024-08-27 23:46 UTC (permalink / raw)
  To: chenridong@huawei.com, linux-sgx@vger.kernel.org,
	cgroups@vger.kernel.org, mkoutny@suse.com,
	dave.hansen@linux.intel.com, haitao.huang@linux.intel.com,
	tim.c.chen@linux.intel.com, linux-kernel@vger.kernel.org,
	mingo@redhat.com, tglx@linutronix.de, tj@kernel.org,
	jarkko@kernel.org, Mehta, Sohil, hpa@zytor.com, bp@alien8.de,
	x86@kernel.org
  Cc: mikko.ylinen@linux.intel.com, seanjc@google.com,
	anakrish@microsoft.com, Zhang, Bo, kristen@linux.intel.com,
	yangjie@microsoft.com, Li, Zhiquan1, chrisyan@microsoft.com



On 27/08/2024 10:22 pm, Huang, Kai wrote:
> On Tue, 2024-08-20 at 18:53 -0700, Haitao Huang wrote:
>> +/**
>> + * Only called during init to unwind what's done in sgx_cgroup_init()
>> + */
>> +void __init sgx_cgroup_deinit(void)
>> +{
>> +	destroy_workqueue(sgx_cg_wq);
>> +}
>> +
> 
> Ditto:
> 
> arch/x86/kernel/cpu/sgx/epc_cgroup.c:412: warning: This comment starts with
> '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-
> doc.rst
>   * Only called during init to unwind what's done in sgx_cgroup_init()

With this fixed, and things mentioned in ..

https://lore.kernel.org/lkml/D3QWEFR2E2BZ.187FVXI3QQU9U@kernel.org/T/#md5267379c3787d436d3297295fe4da587522444c

.. done:

Reviewed-by: Kai Huang <kai.huang@intel.com>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v16 10/16] x86/sgx: Implement async reclamation for cgroup
  2024-08-21  1:53 ` [PATCH v16 10/16] x86/sgx: Implement async reclamation " Haitao Huang
  2024-08-27 10:22   ` Huang, Kai
@ 2024-08-27 18:15   ` Jarkko Sakkinen
  1 sibling, 0 replies; 42+ messages in thread
From: Jarkko Sakkinen @ 2024-08-27 18:15 UTC (permalink / raw)
  To: Haitao Huang, dave.hansen, kai.huang, tj, mkoutny, chenridong,
	linux-kernel, linux-sgx, x86, cgroups, tglx, mingo, bp, hpa,
	sohil.mehta, tim.c.chen
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen,
	yangjie, chrisyan

On Wed Aug 21, 2024 at 4:53 AM EEST, Haitao Huang wrote:
> From: Kristen Carlson Accardi <kristen@linux.intel.com>
>
> In cases EPC pages need be allocated during a page fault and the cgroup
> usage is near its limit, an asynchronous reclamation needs to be
> triggered to avoid blocking the page fault handling.
>
> To keep implementation simple, use a workqueue instead of kthread to
> schedule the asynchronous reclamation work. Add corresponding work item and
> function definitions for EPC cgroup.
>
> In sgx_cgroup_try_charge(), if caller does not allow synchronous
> reclamation, queue an asynchronous work into the workqueue.
>
> The current global reclaimer, ksgxd, maintains a threshold for the
> minimal free EPC pages to avoid thrashing when allocating EPC pages.
> Similarly, only reclaiming EPC pages from the current cgroup when its
> usage actually reaches limit could also cause thrashing.  To avoid that,
> define a similar "per-cgroup usage threshold" and actively trigger
> asynchronous per-cgroup EPC reclamation when the usage reaches the
> threshold after try_charge() is successful.
>
> Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
> Tested-by: Jarkko Sakkinen <jarkko@kernel.org>

Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>

BR, Jarkko

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v16 11/16] x86/sgx: Charge mem_cgroup for per-cgroup reclamation
  2024-08-21  1:53 [PATCH v16 00/16] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (9 preceding siblings ...)
  2024-08-21  1:53 ` [PATCH v16 10/16] x86/sgx: Implement async reclamation " Haitao Huang
@ 2024-08-21  1:53 ` Haitao Huang
  2024-08-21  1:54 ` [PATCH v16 12/16] x86/sgx: Revise global reclamation for EPC cgroups Haitao Huang
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Haitao Huang @ 2024-08-21  1:53 UTC (permalink / raw)
  To: jarkko, dave.hansen, kai.huang, tj, mkoutny, chenridong,
	linux-kernel, linux-sgx, x86, cgroups, tglx, mingo, bp, hpa,
	sohil.mehta, tim.c.chen
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen,
	yangjie, chrisyan

Enclave Page Cache(EPC) memory can be swapped out to regular system
memory, and the consumed memory should be charged to a proper
mem_cgroup. Currently the selection of mem_cgroup to charge is done in
sgx_encl_get_mem_cgroup(). But it considers all contexts other than the
ksgxd thread are user processes. With the new EPC cgroup implementation,
the swapping can also happen in EPC cgroup work-queue threads. In those
cases, it improperly selects the root mem_cgroup to charge for the RAM
usage.

Remove current_is_ksgxd() and change sgx_encl_get_mem_cgroup() to take
an additional argument to explicitly specify the mm struct to charge for
allocations. Callers from background kthreads not associated with a
charging mm struct would set it to NULL, while callers in user process
contexts set it to current->mm.

Internally, it handles the case when the charging mm given is NULL, by
searching for an mm struct from enclave's mm_list.

Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Reported-by: Mikko Ylinen <mikko.ylinen@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: Mikko Ylinen <mikko.ylinen@linux.intel.com>
Tested-by: Jarkko Sakkinen <jarkko@kernel.org>
---
V10:
- Pass mm struct instead of a boolean 'indirect'. (Dave, Jarkko)

V9:
- Reduce number of if statements. (Tim)

V8:
- Limit text paragraphs to 80 characters wide. (Jarkko)
---
 arch/x86/kernel/cpu/sgx/encl.c       | 29 ++++++++++++++--------------
 arch/x86/kernel/cpu/sgx/encl.h       |  3 +--
 arch/x86/kernel/cpu/sgx/epc_cgroup.c |  9 ++++++---
 arch/x86/kernel/cpu/sgx/main.c       | 29 +++++++++++++---------------
 arch/x86/kernel/cpu/sgx/sgx.h        |  2 +-
 5 files changed, 36 insertions(+), 36 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index f474179b6f77..7b77dad41daf 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -993,23 +993,23 @@ static int __sgx_encl_get_backing(struct sgx_encl *encl, unsigned long page_inde
 }
 
 /*
- * When called from ksgxd, returns the mem_cgroup of a struct mm stored
- * in the enclave's mm_list. When not called from ksgxd, just returns
- * the mem_cgroup of the current task.
+ * Find the mem_cgroup to charge for memory allocated on behalf of an enclave.
+ *
+ * Used in sgx_encl_alloc_backing() for backing store allocation.
+ *
+ * Return the mem_cgroup of the given charge_mm. Otherwise return the mem_cgroup
+ * of a struct mm stored in the enclave's mm_list.
  */
-static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl)
+static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl,
+						  struct mm_struct *charge_mm)
 {
 	struct mem_cgroup *memcg = NULL;
 	struct sgx_encl_mm *encl_mm;
 	int idx;
 
-	/*
-	 * If called from normal task context, return the mem_cgroup
-	 * of the current task's mm. The remainder of the handling is for
-	 * ksgxd.
-	 */
-	if (!current_is_ksgxd())
-		return get_mem_cgroup_from_mm(current->mm);
+	 /* Use the charge_mm if given. */
+	if (charge_mm)
+		return get_mem_cgroup_from_mm(charge_mm);
 
 	/*
 	 * Search the enclave's mm_list to find an mm associated with
@@ -1047,8 +1047,9 @@ static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl)
  * @encl:	an enclave pointer
  * @page_index:	enclave page index
  * @backing:	data for accessing backing storage for the page
+ * @charge_mm:	the mm to charge for the allocation
  *
- * When called from ksgxd, sets the active memcg from one of the
+ * When charge_mm is NULL, sets the active memcg from one of the
  * mms in the enclave's mm_list prior to any backing page allocation,
  * in order to ensure that shmem page allocations are charged to the
  * enclave.  Create a backing page for loading data back into an EPC page with
@@ -1060,9 +1061,9 @@ static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl)
  *   -errno otherwise.
  */
 int sgx_encl_alloc_backing(struct sgx_encl *encl, unsigned long page_index,
-			   struct sgx_backing *backing)
+			   struct sgx_backing *backing, struct mm_struct *charge_mm)
 {
-	struct mem_cgroup *encl_memcg = sgx_encl_get_mem_cgroup(encl);
+	struct mem_cgroup *encl_memcg = sgx_encl_get_mem_cgroup(encl, charge_mm);
 	struct mem_cgroup *memcg = set_active_memcg(encl_memcg);
 	int ret;
 
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index fe15ade02ca1..5ce9d108290f 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -103,12 +103,11 @@ static inline int sgx_encl_find(struct mm_struct *mm, unsigned long addr,
 int sgx_encl_may_map(struct sgx_encl *encl, unsigned long start,
 		     unsigned long end, unsigned long vm_flags);
 
-bool current_is_ksgxd(void);
 void sgx_encl_release(struct kref *ref);
 int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm);
 const cpumask_t *sgx_encl_cpumask(struct sgx_encl *encl);
 int sgx_encl_alloc_backing(struct sgx_encl *encl, unsigned long page_index,
-			   struct sgx_backing *backing);
+			   struct sgx_backing *backing, struct mm_struct *charge_mm);
 void sgx_encl_put_backing(struct sgx_backing *backing);
 int sgx_encl_test_and_clear_young(struct mm_struct *mm,
 				  struct sgx_encl_page *page);
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index c40f601d06d9..ae65cac858f8 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -200,8 +200,10 @@ static bool sgx_cgroup_lru_empty(struct misc_cg *root)
 
 /*
  * Scan at least @nr_to_scan pages and attempt to reclaim them from the subtree of @root.
+ * Charge backing store allocation to the given mm, @charge_mm.
  */
 static inline void sgx_cgroup_reclaim_pages(struct sgx_cgroup *root,
+					    struct mm_struct *charge_mm,
 					    unsigned int nr_to_scan)
 {
 	struct sgx_cgroup *next_cg = NULL;
@@ -209,7 +211,7 @@ static inline void sgx_cgroup_reclaim_pages(struct sgx_cgroup *root,
 
 	while (!sgx_cgroup_lru_empty(root->cg) && cnt < nr_to_scan) {
 		next_cg = sgx_cgroup_next_get(root);
-		cnt += sgx_reclaim_pages(&next_cg->lru);
+		cnt += sgx_reclaim_pages(&next_cg->lru, charge_mm);
 		if (next_cg != root)
 			sgx_put_cg(next_cg);
 	}
@@ -247,7 +249,8 @@ static void sgx_cgroup_reclaim_work_func(struct work_struct *work)
 	struct sgx_cgroup *root = container_of(work, struct sgx_cgroup, reclaim_work);
 
 	while (sgx_cgroup_should_reclaim(root)) {
-		sgx_cgroup_reclaim_pages(root, SGX_NR_TO_SCAN);
+		/* Indirect reclaim, no mm to charge, so NULL: */
+		sgx_cgroup_reclaim_pages(root, NULL, SGX_NR_TO_SCAN);
 		cond_resched();
 	}
 }
@@ -291,7 +294,7 @@ int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim)
 			goto out;
 		}
 
-		sgx_cgroup_reclaim_pages(sgx_cg, 1);
+		sgx_cgroup_reclaim_pages(sgx_cg, current->mm, 1);
 
 		cond_resched();
 	}
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 4bb8aa019c6a..fa00ed5e884d 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -268,8 +268,8 @@ static void sgx_encl_ewb(struct sgx_epc_page *epc_page,
 	}
 }
 
-static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
-				struct sgx_backing *backing)
+static void sgx_reclaimer_write(struct sgx_epc_page *epc_page, struct sgx_backing *backing,
+				struct mm_struct *charge_mm)
 {
 	struct sgx_encl_page *encl_page = epc_page->owner;
 	struct sgx_encl *encl = encl_page->encl;
@@ -285,7 +285,7 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
 
 	if (!encl->secs_child_cnt && test_bit(SGX_ENCL_INITIALIZED, &encl->flags)) {
 		ret = sgx_encl_alloc_backing(encl, PFN_DOWN(encl->size),
-					   &secs_backing);
+					   &secs_backing, charge_mm);
 		if (ret)
 			goto out;
 
@@ -304,6 +304,7 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
 /**
  * sgx_reclaim_pages() - Attempt to reclaim a fixed number of pages from an LRU
  * @lru:	The LRU from which pages are reclaimed.
+ * @charge_mm:	The mm to charge for backing store allocation.
  *
  * Take a fixed number of pages from the head of a given LRU and reclaim them to
  * the enclave's private shmem files. Skip the pages, which have been accessed
@@ -319,7 +320,7 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
  *
  * Return:	Number of pages attempted for reclamation.
  */
-unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru)
+unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru, struct mm_struct *charge_mm)
 {
 	struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
 	struct sgx_backing backing[SGX_NR_TO_SCAN];
@@ -359,7 +360,7 @@ unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru)
 		page_index = PFN_DOWN(encl_page->desc - encl_page->encl->base);
 
 		mutex_lock(&encl_page->encl->lock);
-		ret = sgx_encl_alloc_backing(encl_page->encl, page_index, &backing[i]);
+		ret = sgx_encl_alloc_backing(encl_page->encl, page_index, &backing[i], charge_mm);
 		if (ret) {
 			mutex_unlock(&encl_page->encl->lock);
 			goto skip;
@@ -391,7 +392,7 @@ unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru)
 			continue;
 
 		encl_page = epc_page->owner;
-		sgx_reclaimer_write(epc_page, &backing[i]);
+		sgx_reclaimer_write(epc_page, &backing[i], charge_mm);
 
 		kref_put(&encl_page->encl->refcount, sgx_encl_release);
 		epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
@@ -408,9 +409,9 @@ static bool sgx_should_reclaim_global(unsigned long watermark)
 		sgx_can_reclaim_global();
 }
 
-static void sgx_reclaim_pages_global(void)
+static void sgx_reclaim_pages_global(struct mm_struct *charge_mm)
 {
-	sgx_reclaim_pages(&sgx_global_lru);
+	sgx_reclaim_pages(&sgx_global_lru, charge_mm);
 }
 
 /*
@@ -421,7 +422,7 @@ static void sgx_reclaim_pages_global(void)
 void sgx_reclaim_direct(void)
 {
 	if (sgx_should_reclaim_global(SGX_NR_LOW_PAGES))
-		sgx_reclaim_pages_global();
+		sgx_reclaim_pages_global(current->mm);
 }
 
 static int ksgxd(void *p)
@@ -444,7 +445,8 @@ static int ksgxd(void *p)
 				     sgx_should_reclaim_global(SGX_NR_HIGH_PAGES));
 
 		if (sgx_should_reclaim_global(SGX_NR_HIGH_PAGES))
-			sgx_reclaim_pages_global();
+			/* Indirect reclaim, no mm to charge, so NULL: */
+			sgx_reclaim_pages_global(NULL);
 
 		cond_resched();
 	}
@@ -467,11 +469,6 @@ static bool __init sgx_page_reclaimer_init(void)
 	return true;
 }
 
-bool current_is_ksgxd(void)
-{
-	return current == ksgxd_tsk;
-}
-
 static struct sgx_epc_page *__sgx_alloc_epc_page_from_node(int nid)
 {
 	struct sgx_numa_node *node = &sgx_numa_nodes[nid];
@@ -632,7 +629,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
 		 * Perform a global reclaim to get some pages released from any
 		 * cgroup with reclaimable pages.
 		 */
-		sgx_reclaim_pages_global();
+		sgx_reclaim_pages_global(current->mm);
 		cond_resched();
 	}
 
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index f456abfe5c52..ddc88ae2a488 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -139,7 +139,7 @@ void sgx_reclaim_direct(void);
 void sgx_mark_page_reclaimable(struct sgx_epc_page *page);
 int sgx_unmark_page_reclaimable(struct sgx_epc_page *page);
 struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim);
-unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru);
+unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru, struct mm_struct *charge_mm);
 
 void sgx_ipi_cb(void *info);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v16 12/16] x86/sgx: Revise global reclamation for EPC cgroups
  2024-08-21  1:53 [PATCH v16 00/16] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (10 preceding siblings ...)
  2024-08-21  1:53 ` [PATCH v16 11/16] x86/sgx: Charge mem_cgroup for per-cgroup reclamation Haitao Huang
@ 2024-08-21  1:54 ` Haitao Huang
  2024-08-27 11:32   ` Huang, Kai
  2024-08-27 18:16   ` Jarkko Sakkinen
  2024-08-21  1:54 ` [PATCH v16 13/16] x86/sgx: implement direct reclamation for cgroups Haitao Huang
                   ` (5 subsequent siblings)
  17 siblings, 2 replies; 42+ messages in thread
From: Haitao Huang @ 2024-08-21  1:54 UTC (permalink / raw)
  To: jarkko, dave.hansen, kai.huang, tj, mkoutny, chenridong,
	linux-kernel, linux-sgx, x86, cgroups, tglx, mingo, bp, hpa,
	sohil.mehta, tim.c.chen
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen,
	yangjie, chrisyan

With EPC cgroups, the global reclamation function,
sgx_reclaim_pages_global(), can no longer apply to the global LRU as
pages are now in per-cgroup LRUs.

Create a wrapper, sgx_cgroup_reclaim_global() to invoke
sgx_cgroup_reclaim_pages() passing in the root cgroup. Call this wrapper
from sgx_reclaim_pages_global() when cgroup is enabled. The wrapper will
scan and attempt to reclaim SGX_NR_TO_SCAN pages just like the current
global reclaim.

Note this simple implementation doesn't _exactly_ mimic the current
global EPC reclaim (which always tries to do the actual reclaim in batch
of SGX_NR_TO_SCAN pages): in rare cases when LRUs have less than
SGX_NR_TO_SCAN reclaimable pages, the actual reclaim of EPC pages will
be split into smaller batches _across_ multiple LRUs with each being
smaller than SGX_NR_TO_SCAN pages.

A more precise way to mimic the current global EPC reclaim would be to
have a new function to only "scan" (or "isolate") SGX_NR_TO_SCAN pages
_across_ the given EPC cgroup _AND_ its descendants, and then do the
actual reclaim in one batch.  But this is unnecessarily complicated at
this stage to address such rare cases.

Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
---
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 12 ++++++++++++
 arch/x86/kernel/cpu/sgx/epc_cgroup.h |  3 +++
 arch/x86/kernel/cpu/sgx/main.c       |  7 +++++++
 3 files changed, 22 insertions(+)

diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index ae65cac858f8..23a61689e0d9 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -240,6 +240,18 @@ static bool sgx_cgroup_should_reclaim(struct sgx_cgroup *sgx_cg)
 	return (cur >= max);
 }
 
+/**
+ * sgx_cgroup_reclaim_pages_global() - Perform one round of global reclamation.
+ *
+ * @charge_mm:	The mm to be charged for the backing store of reclaimed pages.
+ *
+ * Try to scan and attempt reclamation from root cgroup for %SGX_NR_TO_SCAN pages.
+ */
+void sgx_cgroup_reclaim_pages_global(struct mm_struct *charge_mm)
+{
+	sgx_cgroup_reclaim_pages(&sgx_cg_root, charge_mm, SGX_NR_TO_SCAN);
+}
+
 /*
  * Asynchronous work flow to reclaim pages from the cgroup when the cgroup is
  * at/near its maximum capacity.
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
index 37364bdb797f..c0390111e28c 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -36,6 +36,8 @@ static inline void __init sgx_cgroup_deinit(void) { }
 
 static inline void __init sgx_cgroup_register(void) { }
 
+static inline void sgx_cgroup_reclaim_pages_global(struct mm_struct *charge_mm) { }
+
 #else /* CONFIG_CGROUP_MISC */
 
 struct sgx_cgroup {
@@ -87,6 +89,7 @@ static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg)
 
 int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim);
 void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg);
+void sgx_cgroup_reclaim_pages_global(struct mm_struct *charge_mm);
 int __init sgx_cgroup_init(void);
 void __init sgx_cgroup_register(void);
 void __init sgx_cgroup_deinit(void);
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index fa00ed5e884d..d00cb012838b 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -411,6 +411,13 @@ static bool sgx_should_reclaim_global(unsigned long watermark)
 
 static void sgx_reclaim_pages_global(struct mm_struct *charge_mm)
 {
+	/*
+	 * Now all EPC pages are still tracked in the @sgx_global_lru.
+	 * Still reclaim from it.
+	 *
+	 * When EPC pages are tracked in the actual per-cgroup LRUs,
+	 * sgx_cgroup_reclaim_pages_global() will be called.
+	 */
 	sgx_reclaim_pages(&sgx_global_lru, charge_mm);
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v16 12/16] x86/sgx: Revise global reclamation for EPC cgroups
  2024-08-21  1:54 ` [PATCH v16 12/16] x86/sgx: Revise global reclamation for EPC cgroups Haitao Huang
@ 2024-08-27 11:32   ` Huang, Kai
  2024-08-27 23:29     ` Huang, Kai
  2024-08-27 18:16   ` Jarkko Sakkinen
  1 sibling, 1 reply; 42+ messages in thread
From: Huang, Kai @ 2024-08-27 11:32 UTC (permalink / raw)
  To: chenridong@huawei.com, linux-sgx@vger.kernel.org,
	cgroups@vger.kernel.org, mkoutny@suse.com,
	dave.hansen@linux.intel.com, haitao.huang@linux.intel.com,
	tim.c.chen@linux.intel.com, linux-kernel@vger.kernel.org,
	mingo@redhat.com, tglx@linutronix.de, tj@kernel.org,
	jarkko@kernel.org, Mehta, Sohil, hpa@zytor.com, bp@alien8.de,
	x86@kernel.org
  Cc: mikko.ylinen@linux.intel.com, seanjc@google.com,
	anakrish@microsoft.com, Zhang, Bo, kristen@linux.intel.com,
	yangjie@microsoft.com, Li, Zhiquan1, chrisyan@microsoft.com

On Tue, 2024-08-20 at 18:54 -0700, Haitao Huang wrote:
> With EPC cgroups, the global reclamation function,
> sgx_reclaim_pages_global(), can no longer apply to the global LRU as
> pages are now in per-cgroup LRUs.
> 
> Create a wrapper, sgx_cgroup_reclaim_global() to invoke
> sgx_cgroup_reclaim_pages() passing in the root cgroup. 
> 

[...]

> Call this wrapper
> from sgx_reclaim_pages_global() when cgroup is enabled. 
> 

This is not done in this patch.

> The wrapper will
> scan and attempt to reclaim SGX_NR_TO_SCAN pages just like the current
> global reclaim.
> 
> Note this simple implementation doesn't _exactly_ mimic the current
> global EPC reclaim (which always tries to do the actual reclaim in batch
> of SGX_NR_TO_SCAN pages): in rare cases when LRUs have less than
> SGX_NR_TO_SCAN reclaimable pages, the actual reclaim of EPC pages will
> be split into smaller batches _across_ multiple LRUs with each being
> smaller than SGX_NR_TO_SCAN pages.
> 
> A more precise way to mimic the current global EPC reclaim would be to
> have a new function to only "scan" (or "isolate") SGX_NR_TO_SCAN pages
> _across_ the given EPC cgroup _AND_ its descendants, and then do the
> actual reclaim in one batch.  But this is unnecessarily complicated at
> this stage to address such rare cases.
> 
> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
> ---
>  arch/x86/kernel/cpu/sgx/epc_cgroup.c | 12 ++++++++++++
>  arch/x86/kernel/cpu/sgx/epc_cgroup.h |  3 +++
>  arch/x86/kernel/cpu/sgx/main.c       |  7 +++++++
>  3 files changed, 22 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> index ae65cac858f8..23a61689e0d9 100644
> --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> @@ -240,6 +240,18 @@ static bool sgx_cgroup_should_reclaim(struct sgx_cgroup *sgx_cg)
>  	return (cur >= max);
>  }
>  
> +/**
> + * sgx_cgroup_reclaim_pages_global() - Perform one round of global reclamation.
> + *
> + * @charge_mm:	The mm to be charged for the backing store of reclaimed pages.
> + *
> + * Try to scan and attempt reclamation from root cgroup for %SGX_NR_TO_SCAN pages.
> + */
> +void sgx_cgroup_reclaim_pages_global(struct mm_struct *charge_mm)
> +{
> +	sgx_cgroup_reclaim_pages(&sgx_cg_root, charge_mm, SGX_NR_TO_SCAN);
> +}
> +
> 

[...]

>  
>  static void sgx_reclaim_pages_global(struct mm_struct *charge_mm)
>  {
> +	/*
> +	 * Now all EPC pages are still tracked in the @sgx_global_lru.
> +	 * Still reclaim from it.
> +	 *
> +	 * When EPC pages are tracked in the actual per-cgroup LRUs,
> +	 * sgx_cgroup_reclaim_pages_global() will be called.
> +	 */
>  	sgx_reclaim_pages(&sgx_global_lru, charge_mm);
>  }
>  

I didn't realize the only functional change of this patch is to add a new
helper sgx_cgroup_reclaim_pages_global() (hmm... it's not even a functional
change because the helper isn't called).

It might make sense to have this as a separate patch with the comment of
sgx_can_reclaim_global() being moved here, from the perspective that this
patch only adds the building block to do global reclaim from per-cgroup LRUs
but doesn't actually turn that on.

But given this patch only adds a helper it might also make sense to just merge
this with the patch "x86/sgx: Turn on per-cgroup EPC reclamation".

I have no hard opinion on this.  Maybe we can keep the current way unless
other people comment in.

Btw, I don't quite follow why you placed this patch here.  It should be placed
right before the patch "x86/sgx: Turn on per-cgroup EPC reclamation" (if we
want to make this as a separate patch).

Or, the patch "x86/sgx: implement direct reclamation for cgroups" should be
moved to an earlier place after patch "x86/sgx: Implement async reclamation
for cgroup".


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v16 12/16] x86/sgx: Revise global reclamation for EPC cgroups
  2024-08-27 11:32   ` Huang, Kai
@ 2024-08-27 23:29     ` Huang, Kai
  0 siblings, 0 replies; 42+ messages in thread
From: Huang, Kai @ 2024-08-27 23:29 UTC (permalink / raw)
  To: chenridong@huawei.com, linux-sgx@vger.kernel.org,
	cgroups@vger.kernel.org, mkoutny@suse.com,
	dave.hansen@linux.intel.com, haitao.huang@linux.intel.com,
	tim.c.chen@linux.intel.com, linux-kernel@vger.kernel.org,
	mingo@redhat.com, tglx@linutronix.de, tj@kernel.org,
	jarkko@kernel.org, Mehta, Sohil, hpa@zytor.com, bp@alien8.de,
	x86@kernel.org
  Cc: mikko.ylinen@linux.intel.com, seanjc@google.com,
	anakrish@microsoft.com, Zhang, Bo, kristen@linux.intel.com,
	yangjie@microsoft.com, Li, Zhiquan1, chrisyan@microsoft.com



On 27/08/2024 11:32 pm, Huang, Kai wrote:
> On Tue, 2024-08-20 at 18:54 -0700, Haitao Huang wrote:
>> With EPC cgroups, the global reclamation function,
>> sgx_reclaim_pages_global(), can no longer apply to the global LRU as
>> pages are now in per-cgroup LRUs.
>>
>> Create a wrapper, sgx_cgroup_reclaim_global() to invoke
>> sgx_cgroup_reclaim_pages() passing in the root cgroup.
>>
> 
> [...]
> 
>> Call this wrapper
>> from sgx_reclaim_pages_global() when cgroup is enabled.
>>
> 
> This is not done in this patch.
> 

With this removed, and ...

> 
> It might make sense to have this as a separate patch with the comment of
> sgx_can_reclaim_global() being moved here, from the perspective that this
> patch only adds the building block to do global reclaim from per-cgroup LRUs
> but doesn't actually turn that on.
> 

... with the comment of sgx_can_reclaim_global() moved to this patch:

Reviewed-by: Kai Huang <kai.huang@intel.com>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v16 12/16] x86/sgx: Revise global reclamation for EPC cgroups
  2024-08-21  1:54 ` [PATCH v16 12/16] x86/sgx: Revise global reclamation for EPC cgroups Haitao Huang
  2024-08-27 11:32   ` Huang, Kai
@ 2024-08-27 18:16   ` Jarkko Sakkinen
  1 sibling, 0 replies; 42+ messages in thread
From: Jarkko Sakkinen @ 2024-08-27 18:16 UTC (permalink / raw)
  To: Haitao Huang, dave.hansen, kai.huang, tj, mkoutny, chenridong,
	linux-kernel, linux-sgx, x86, cgroups, tglx, mingo, bp, hpa,
	sohil.mehta, tim.c.chen
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen,
	yangjie, chrisyan

On Wed Aug 21, 2024 at 4:54 AM EEST, Haitao Huang wrote:
> With EPC cgroups, the global reclamation function,
> sgx_reclaim_pages_global(), can no longer apply to the global LRU as
> pages are now in per-cgroup LRUs.
>
> Create a wrapper, sgx_cgroup_reclaim_global() to invoke
> sgx_cgroup_reclaim_pages() passing in the root cgroup. Call this wrapper
> from sgx_reclaim_pages_global() when cgroup is enabled. The wrapper will
> scan and attempt to reclaim SGX_NR_TO_SCAN pages just like the current
> global reclaim.
>
> Note this simple implementation doesn't _exactly_ mimic the current
> global EPC reclaim (which always tries to do the actual reclaim in batch
> of SGX_NR_TO_SCAN pages): in rare cases when LRUs have less than
> SGX_NR_TO_SCAN reclaimable pages, the actual reclaim of EPC pages will
> be split into smaller batches _across_ multiple LRUs with each being
> smaller than SGX_NR_TO_SCAN pages.
>
> A more precise way to mimic the current global EPC reclaim would be to
> have a new function to only "scan" (or "isolate") SGX_NR_TO_SCAN pages
> _across_ the given EPC cgroup _AND_ its descendants, and then do the
> actual reclaim in one batch.  But this is unnecessarily complicated at
> this stage to address such rare cases.
>
> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>

Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>

BR, Jarkko

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v16 13/16] x86/sgx: implement direct reclamation for cgroups
  2024-08-21  1:53 [PATCH v16 00/16] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (11 preceding siblings ...)
  2024-08-21  1:54 ` [PATCH v16 12/16] x86/sgx: Revise global reclamation for EPC cgroups Haitao Huang
@ 2024-08-21  1:54 ` Haitao Huang
  2024-08-27 18:16   ` Jarkko Sakkinen
  2024-08-27 23:55   ` Huang, Kai
  2024-08-21  1:54 ` [PATCH v16 14/16] x86/sgx: Turn on per-cgroup EPC reclamation Haitao Huang
                   ` (4 subsequent siblings)
  17 siblings, 2 replies; 42+ messages in thread
From: Haitao Huang @ 2024-08-21  1:54 UTC (permalink / raw)
  To: jarkko, dave.hansen, kai.huang, tj, mkoutny, chenridong,
	linux-kernel, linux-sgx, x86, cgroups, tglx, mingo, bp, hpa,
	sohil.mehta, tim.c.chen
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen,
	yangjie, chrisyan

sgx_reclaim_direct() was introduced to preemptively reclaim some pages
as the best effort to avoid on-demand reclamation that can stall forward
progress in some situations, e.g., allocating pages to load previously
reclaimed page to perform EDMM operations on [1].

Currently when the global usage is close to the capacity,
sgx_reclaim_direct() makes one invocation to sgx_reclaim_pages_global()
but does not guarantee there are free pages available for later
allocations to succeed. In other words, the only goal here is to reduce
the chance of on-demand reclamation at allocation time. In cases of
allocation failure, the caller, the EDMM ioctl()'s, would return -EAGAIN
to user space and let the user space to decide whether to retry or not.

With EPC cgroups enabled, usage of a cgroup can also reach its limit
(usually much lower than capacity) and trigger per-cgroup reclamation.
Implement a similar strategy to reduce the chance of on-demand
per-cgroup reclamation for this use case.

Create a wrapper, sgx_cgroup_reclaim_direct(), to perform a preemptive
reclamation at cgroup level, and have sgx_reclaim_direct() call it when
EPC cgroup is enabled.

[1] https://lore.kernel.org/all/a0d8f037c4a075d56bf79f432438412985f7ff7a.1652137848.git.reinette.chatre@intel.com/T/#u

Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
---
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 15 +++++++++++++++
 arch/x86/kernel/cpu/sgx/epc_cgroup.h |  3 +++
 arch/x86/kernel/cpu/sgx/main.c       |  4 ++++
 3 files changed, 22 insertions(+)

diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index 23a61689e0d9..b7d60b2d878d 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -252,6 +252,21 @@ void sgx_cgroup_reclaim_pages_global(struct mm_struct *charge_mm)
 	sgx_cgroup_reclaim_pages(&sgx_cg_root, charge_mm, SGX_NR_TO_SCAN);
 }
 
+/**
+ * sgx_cgroup_reclaim_direct() - Preemptive reclamation.
+ *
+ * Scan and attempt to reclaim %SGX_NR_TO_SCAN as best effort to allow caller
+ * make quick progress.
+ */
+void sgx_cgroup_reclaim_direct(void)
+{
+	struct sgx_cgroup *sgx_cg = sgx_get_current_cg();
+
+	if (sgx_cgroup_should_reclaim(sgx_cg))
+		sgx_cgroup_reclaim_pages(sgx_cg, current->mm, SGX_NR_TO_SCAN);
+	sgx_put_cg(sgx_cg);
+}
+
 /*
  * Asynchronous work flow to reclaim pages from the cgroup when the cgroup is
  * at/near its maximum capacity.
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
index c0390111e28c..cf2b946d993e 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -38,6 +38,8 @@ static inline void __init sgx_cgroup_register(void) { }
 
 static inline void sgx_cgroup_reclaim_pages_global(struct mm_struct *charge_mm) { }
 
+static inline void sgx_cgroup_reclaim_direct(void) { }
+
 #else /* CONFIG_CGROUP_MISC */
 
 struct sgx_cgroup {
@@ -90,6 +92,7 @@ static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg)
 int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim);
 void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg);
 void sgx_cgroup_reclaim_pages_global(struct mm_struct *charge_mm);
+void sgx_cgroup_reclaim_direct(void);
 int __init sgx_cgroup_init(void);
 void __init sgx_cgroup_register(void);
 void __init sgx_cgroup_deinit(void);
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index d00cb012838b..9a8f91ebd21b 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -428,6 +428,10 @@ static void sgx_reclaim_pages_global(struct mm_struct *charge_mm)
  */
 void sgx_reclaim_direct(void)
 {
+	/* Reduce chance of per-cgroup reclamation for later allocation */
+	sgx_cgroup_reclaim_direct();
+
+	/* Reduce chance of the global reclamation for later allocation */
 	if (sgx_should_reclaim_global(SGX_NR_LOW_PAGES))
 		sgx_reclaim_pages_global(current->mm);
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v16 13/16] x86/sgx: implement direct reclamation for cgroups
  2024-08-21  1:54 ` [PATCH v16 13/16] x86/sgx: implement direct reclamation for cgroups Haitao Huang
@ 2024-08-27 18:16   ` Jarkko Sakkinen
  2024-08-27 23:55   ` Huang, Kai
  1 sibling, 0 replies; 42+ messages in thread
From: Jarkko Sakkinen @ 2024-08-27 18:16 UTC (permalink / raw)
  To: Haitao Huang, dave.hansen, kai.huang, tj, mkoutny, chenridong,
	linux-kernel, linux-sgx, x86, cgroups, tglx, mingo, bp, hpa,
	sohil.mehta, tim.c.chen
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen,
	yangjie, chrisyan

On Wed Aug 21, 2024 at 4:54 AM EEST, Haitao Huang wrote:
> sgx_reclaim_direct() was introduced to preemptively reclaim some pages
> as the best effort to avoid on-demand reclamation that can stall forward
> progress in some situations, e.g., allocating pages to load previously
> reclaimed page to perform EDMM operations on [1].
>
> Currently when the global usage is close to the capacity,
> sgx_reclaim_direct() makes one invocation to sgx_reclaim_pages_global()
> but does not guarantee there are free pages available for later
> allocations to succeed. In other words, the only goal here is to reduce
> the chance of on-demand reclamation at allocation time. In cases of
> allocation failure, the caller, the EDMM ioctl()'s, would return -EAGAIN
> to user space and let the user space to decide whether to retry or not.
>
> With EPC cgroups enabled, usage of a cgroup can also reach its limit
> (usually much lower than capacity) and trigger per-cgroup reclamation.
> Implement a similar strategy to reduce the chance of on-demand
> per-cgroup reclamation for this use case.
>
> Create a wrapper, sgx_cgroup_reclaim_direct(), to perform a preemptive
> reclamation at cgroup level, and have sgx_reclaim_direct() call it when
> EPC cgroup is enabled.
>
> [1] https://lore.kernel.org/all/a0d8f037c4a075d56bf79f432438412985f7ff7a.1652137848.git.reinette.chatre@intel.com/T/#u
>
> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
> ---
>  arch/x86/kernel/cpu/sgx/epc_cgroup.c | 15 +++++++++++++++
>  arch/x86/kernel/cpu/sgx/epc_cgroup.h |  3 +++
>  arch/x86/kernel/cpu/sgx/main.c       |  4 ++++
>  3 files changed, 22 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> index 23a61689e0d9..b7d60b2d878d 100644
> --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> @@ -252,6 +252,21 @@ void sgx_cgroup_reclaim_pages_global(struct mm_struct *charge_mm)
>  	sgx_cgroup_reclaim_pages(&sgx_cg_root, charge_mm, SGX_NR_TO_SCAN);
>  }
>  
> +/**
> + * sgx_cgroup_reclaim_direct() - Preemptive reclamation.
> + *
> + * Scan and attempt to reclaim %SGX_NR_TO_SCAN as best effort to allow caller
> + * make quick progress.
> + */
> +void sgx_cgroup_reclaim_direct(void)
> +{
> +	struct sgx_cgroup *sgx_cg = sgx_get_current_cg();
> +
> +	if (sgx_cgroup_should_reclaim(sgx_cg))
> +		sgx_cgroup_reclaim_pages(sgx_cg, current->mm, SGX_NR_TO_SCAN);
> +	sgx_put_cg(sgx_cg);
> +}
> +
>  /*
>   * Asynchronous work flow to reclaim pages from the cgroup when the cgroup is
>   * at/near its maximum capacity.
> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> index c0390111e28c..cf2b946d993e 100644
> --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> @@ -38,6 +38,8 @@ static inline void __init sgx_cgroup_register(void) { }
>  
>  static inline void sgx_cgroup_reclaim_pages_global(struct mm_struct *charge_mm) { }
>  
> +static inline void sgx_cgroup_reclaim_direct(void) { }
> +
>  #else /* CONFIG_CGROUP_MISC */
>  
>  struct sgx_cgroup {
> @@ -90,6 +92,7 @@ static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg)
>  int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim);
>  void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg);
>  void sgx_cgroup_reclaim_pages_global(struct mm_struct *charge_mm);
> +void sgx_cgroup_reclaim_direct(void);
>  int __init sgx_cgroup_init(void);
>  void __init sgx_cgroup_register(void);
>  void __init sgx_cgroup_deinit(void);
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index d00cb012838b..9a8f91ebd21b 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -428,6 +428,10 @@ static void sgx_reclaim_pages_global(struct mm_struct *charge_mm)
>   */
>  void sgx_reclaim_direct(void)
>  {
> +	/* Reduce chance of per-cgroup reclamation for later allocation */
> +	sgx_cgroup_reclaim_direct();
> +
> +	/* Reduce chance of the global reclamation for later allocation */
>  	if (sgx_should_reclaim_global(SGX_NR_LOW_PAGES))
>  		sgx_reclaim_pages_global(current->mm);
>  }


Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>

BR, Jarkko

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v16 13/16] x86/sgx: implement direct reclamation for cgroups
  2024-08-21  1:54 ` [PATCH v16 13/16] x86/sgx: implement direct reclamation for cgroups Haitao Huang
  2024-08-27 18:16   ` Jarkko Sakkinen
@ 2024-08-27 23:55   ` Huang, Kai
  2024-08-29  9:34     ` Huang, Kai
  1 sibling, 1 reply; 42+ messages in thread
From: Huang, Kai @ 2024-08-27 23:55 UTC (permalink / raw)
  To: Haitao Huang, jarkko, dave.hansen, tj, mkoutny, chenridong,
	linux-kernel, linux-sgx, x86, cgroups, tglx, mingo, bp, hpa,
	sohil.mehta, tim.c.chen
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen,
	yangjie, chrisyan



On 21/08/2024 1:54 pm, Haitao Huang wrote:
> sgx_reclaim_direct() was introduced to preemptively reclaim some pages
> as the best effort to avoid on-demand reclamation that can stall forward
> progress in some situations, e.g., allocating pages to load previously
> reclaimed page to perform EDMM operations on [1].
> 
> Currently when the global usage is close to the capacity,
> sgx_reclaim_direct() makes one invocation to sgx_reclaim_pages_global()
> but does not guarantee there are free pages available for later
> allocations to succeed. In other words, the only goal here is to reduce
> the chance of on-demand reclamation at allocation time. In cases of
> allocation failure, the caller, the EDMM ioctl()'s, would return -EAGAIN
> to user space and let the user space to decide whether to retry or not.
> 
> With EPC cgroups enabled, usage of a cgroup can also reach its limit
> (usually much lower than capacity) and trigger per-cgroup reclamation.
> Implement a similar strategy to reduce the chance of on-demand
> per-cgroup reclamation for this use case.

I wish there's some explanation about why we don't just try to bring 
down the usage to limit here, but I guess that's OK since what we do is 
just _trying_ to increase the success rate of the later EPC allocation.

Also, when this is invoked, it should be very rare that the limit is way 
lower than the usage, so ...

> 
> Create a wrapper, sgx_cgroup_reclaim_direct(), to perform a preemptive
> reclamation at cgroup level, and have sgx_reclaim_direct() call it when
> EPC cgroup is enabled.
> 
> [1] https://lore.kernel.org/all/a0d8f037c4a075d56bf79f432438412985f7ff7a.1652137848.git.reinette.chatre@intel.com/T/#u
> 
> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>

... feel free to add:

Reviewed-by: Kai Huang <kai.huang@intel.com>

> ---
>   arch/x86/kernel/cpu/sgx/epc_cgroup.c | 15 +++++++++++++++
>   arch/x86/kernel/cpu/sgx/epc_cgroup.h |  3 +++
>   arch/x86/kernel/cpu/sgx/main.c       |  4 ++++
>   3 files changed, 22 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> index 23a61689e0d9..b7d60b2d878d 100644
> --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> @@ -252,6 +252,21 @@ void sgx_cgroup_reclaim_pages_global(struct mm_struct *charge_mm)
>   	sgx_cgroup_reclaim_pages(&sgx_cg_root, charge_mm, SGX_NR_TO_SCAN);
>   }
>   
> +/**
> + * sgx_cgroup_reclaim_direct() - Preemptive reclamation.
> + *
> + * Scan and attempt to reclaim %SGX_NR_TO_SCAN as best effort to allow caller
> + * make quick progress.
> + */

Nit:

I don't think this is to allow the "caller" to make quick(er) progress?

I should be making the "later EPC allocation" quicker?

> +void sgx_cgroup_reclaim_direct(void)
> +{
> +	struct sgx_cgroup *sgx_cg = sgx_get_current_cg();
> +
> +	if (sgx_cgroup_should_reclaim(sgx_cg))
> +		sgx_cgroup_reclaim_pages(sgx_cg, current->mm, SGX_NR_TO_SCAN);
> +	sgx_put_cg(sgx_cg);
> +}
> +
>   /*
>    * Asynchronous work flow to reclaim pages from the cgroup when the cgroup is
>    * at/near its maximum capacity.
> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> index c0390111e28c..cf2b946d993e 100644
> --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> @@ -38,6 +38,8 @@ static inline void __init sgx_cgroup_register(void) { }
>   
>   static inline void sgx_cgroup_reclaim_pages_global(struct mm_struct *charge_mm) { }
>   
> +static inline void sgx_cgroup_reclaim_direct(void) { }
> +
>   #else /* CONFIG_CGROUP_MISC */
>   
>   struct sgx_cgroup {
> @@ -90,6 +92,7 @@ static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg)
>   int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim);
>   void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg);
>   void sgx_cgroup_reclaim_pages_global(struct mm_struct *charge_mm);
> +void sgx_cgroup_reclaim_direct(void);
>   int __init sgx_cgroup_init(void);
>   void __init sgx_cgroup_register(void);
>   void __init sgx_cgroup_deinit(void);
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index d00cb012838b..9a8f91ebd21b 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -428,6 +428,10 @@ static void sgx_reclaim_pages_global(struct mm_struct *charge_mm)
>    */
>   void sgx_reclaim_direct(void)
>   {
> +	/* Reduce chance of per-cgroup reclamation for later allocation */
> +	sgx_cgroup_reclaim_direct();

See.  It says "for later allocation" here.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v16 13/16] x86/sgx: implement direct reclamation for cgroups
  2024-08-27 23:55   ` Huang, Kai
@ 2024-08-29  9:34     ` Huang, Kai
  0 siblings, 0 replies; 42+ messages in thread
From: Huang, Kai @ 2024-08-29  9:34 UTC (permalink / raw)
  To: chenridong@huawei.com, linux-sgx@vger.kernel.org,
	cgroups@vger.kernel.org, mkoutny@suse.com,
	dave.hansen@linux.intel.com, haitao.huang@linux.intel.com,
	tim.c.chen@linux.intel.com, linux-kernel@vger.kernel.org,
	mingo@redhat.com, tglx@linutronix.de, tj@kernel.org,
	jarkko@kernel.org, Mehta, Sohil, hpa@zytor.com, bp@alien8.de,
	x86@kernel.org
  Cc: mikko.ylinen@linux.intel.com, seanjc@google.com,
	anakrish@microsoft.com, Zhang, Bo, kristen@linux.intel.com,
	yangjie@microsoft.com, Li, Zhiquan1, chrisyan@microsoft.com


> 
> ... feel free to add:
> 
> Reviewed-by: Kai Huang <kai.huang@intel.com>
> 
> 

One more nitpicking about patch title:

"implement" -> "Implement"

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v16 14/16] x86/sgx: Turn on per-cgroup EPC reclamation
  2024-08-21  1:53 [PATCH v16 00/16] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (12 preceding siblings ...)
  2024-08-21  1:54 ` [PATCH v16 13/16] x86/sgx: implement direct reclamation for cgroups Haitao Huang
@ 2024-08-21  1:54 ` Haitao Huang
  2024-08-27 23:25   ` Huang, Kai
  2024-08-21  1:54 ` [PATCH v16 15/16] Docs/x86/sgx: Add description for cgroup support Haitao Huang
                   ` (3 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Haitao Huang @ 2024-08-21  1:54 UTC (permalink / raw)
  To: jarkko, dave.hansen, kai.huang, tj, mkoutny, chenridong,
	linux-kernel, linux-sgx, x86, cgroups, tglx, mingo, bp, hpa,
	sohil.mehta, tim.c.chen
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen,
	yangjie, chrisyan

From: Kristen Carlson Accardi <kristen@linux.intel.com>

Previous patches have implemented all infrastructure needed for
per-cgroup EPC page tracking and reclaiming. But all reclaimable EPC
pages are still tracked in the global LRU as sgx_epc_page_lru() always
returns reference to the global LRU.

Change sgx_epc_page_lru() to return the LRU of the cgroup in which the
given EPC page is allocated.

Update sgx_can_reclaim_global(), to check emptiness of LRUs of all
cgroups, and update sgx_reclaim_pages_global(), to utilize
sgx_cgroup_reclaim_pages_global(), when EPC cgroup is enabled.

With these changes, the global reclamation and per-cgroup reclamation
both work properly with all pages tracked in per-cgroup LRUs.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
---
V16:
- Separated out the global and direct reclamation to earlier patch.(Kai)

V14:
- Update global reclamation to use the new sgx_cgroup_reclaim_pages() to
iterate cgroups at lower level if the top cgroups are too busy.

V13:
- Use IS_ENABLED(CONFIG_CGROUP_MISC) in sgx_can_reclaim_global(). (Kai)

V12:
- Remove CONFIG_CGROUP_SGX_EPC, conditional compile SGX Cgroup for
CONFIGCONFIG_CGROUPMISC. (Jarkko)

V11:
- Reword the comments for global reclamation for allocation failure
after passing cgroup charging. (Kai)
- Add stub functions to remove ifdefs in c file (Kai)
- Add more detailed comments to clarify each page belongs to one cgroup, or the
root. (Kai)

V10:
- Add comment to clarify each page belongs to one cgroup, or the root by
default. (Kai)
- Merge the changes that expose sgx_cgroup_* functions to this patch.
- Add changes for sgx_reclaim_direct() that was missed previously.

V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/epc_cgroup.c |  2 +-
 arch/x86/kernel/cpu/sgx/epc_cgroup.h |  6 ++++
 arch/x86/kernel/cpu/sgx/main.c       | 45 ++++++++++++++++++----------
 3 files changed, 36 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index b7d60b2d878d..c3f0c7bc13c6 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -162,7 +162,7 @@ static inline u64 sgx_cgroup_max_pages_to_root(struct sgx_cgroup *sgx_cg)
  *
  * Return: %true if all cgroups under the specified root have empty LRU lists.
  */
-static bool sgx_cgroup_lru_empty(struct misc_cg *root)
+bool sgx_cgroup_lru_empty(struct misc_cg *root)
 {
 	struct cgroup_subsys_state *css_root;
 	struct cgroup_subsys_state *pos;
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
index cf2b946d993e..cd957cf38204 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -27,6 +27,11 @@ static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_recl
 
 static inline void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg) { }
 
+static inline bool sgx_cgroup_lru_empty(struct misc_cg *root)
+{
+	return true;
+}
+
 static inline int __init sgx_cgroup_init(void)
 {
 	return 0;
@@ -91,6 +96,7 @@ static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg)
 
 int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim);
 void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg);
+bool sgx_cgroup_lru_empty(struct misc_cg *root);
 void sgx_cgroup_reclaim_pages_global(struct mm_struct *charge_mm);
 void sgx_cgroup_reclaim_direct(void);
 int __init sgx_cgroup_init(void);
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 9a8f91ebd21b..2a23a10d882e 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -32,9 +32,30 @@ static DEFINE_XARRAY(sgx_epc_address_space);
  */
 static struct sgx_epc_lru_list sgx_global_lru;
 
+/*
+ * Get the per-cgroup or global LRU list that tracks the given reclaimable page.
+ */
 static inline struct sgx_epc_lru_list *sgx_epc_page_lru(struct sgx_epc_page *epc_page)
 {
+#ifdef CONFIG_CGROUP_MISC
+	/*
+	 * epc_page->sgx_cg here is never NULL during a reclaimable epc_page's
+	 * life between sgx_alloc_epc_page() and sgx_free_epc_page():
+	 *
+	 * In sgx_alloc_epc_page(), epc_page->sgx_cg is set to the return from
+	 * sgx_get_current_cg() which is the misc cgroup of the current task, or
+	 * the root by default even if the misc cgroup is disabled by kernel
+	 * command line.
+	 *
+	 * epc_page->sgx_cg is only unset by sgx_free_epc_page().
+	 *
+	 * This function is never used before sgx_alloc_epc_page() or after
+	 * sgx_free_epc_page().
+	 */
+	return &epc_page->sgx_cg->lru;
+#else
 	return &sgx_global_lru;
+#endif
 }
 
 /*
@@ -42,14 +63,10 @@ static inline struct sgx_epc_lru_list *sgx_epc_page_lru(struct sgx_epc_page *epc
  */
 static inline bool sgx_can_reclaim_global(void)
 {
-	/*
-	 * Now all EPC pages are still tracked in the @sgx_global_lru, so only
-	 * check @sgx_global_lru.
-	 *
-	 * When EPC pages are tracked in the actual per-cgroup LRUs,
-	 * replace with sgx_cgroup_lru_empty(misc_cg_root()).
-	 */
-	return !list_empty(&sgx_global_lru.reclaimable);
+	if (IS_ENABLED(CONFIG_CGROUP_MISC))
+		return !sgx_cgroup_lru_empty(misc_cg_root());
+	else
+		return !list_empty(&sgx_global_lru.reclaimable);
 }
 
 static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
@@ -411,14 +428,10 @@ static bool sgx_should_reclaim_global(unsigned long watermark)
 
 static void sgx_reclaim_pages_global(struct mm_struct *charge_mm)
 {
-	/*
-	 * Now all EPC pages are still tracked in the @sgx_global_lru.
-	 * Still reclaim from it.
-	 *
-	 * When EPC pages are tracked in the actual per-cgroup LRUs,
-	 * sgx_cgroup_reclaim_pages_global() will be called.
-	 */
-	sgx_reclaim_pages(&sgx_global_lru, charge_mm);
+	if (IS_ENABLED(CONFIG_CGROUP_MISC))
+		sgx_cgroup_reclaim_pages_global(charge_mm);
+	else
+		sgx_reclaim_pages(&sgx_global_lru, charge_mm);
 }
 
 /*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v16 14/16] x86/sgx: Turn on per-cgroup EPC reclamation
  2024-08-21  1:54 ` [PATCH v16 14/16] x86/sgx: Turn on per-cgroup EPC reclamation Haitao Huang
@ 2024-08-27 23:25   ` Huang, Kai
  0 siblings, 0 replies; 42+ messages in thread
From: Huang, Kai @ 2024-08-27 23:25 UTC (permalink / raw)
  To: Haitao Huang, jarkko, dave.hansen, tj, mkoutny, chenridong,
	linux-kernel, linux-sgx, x86, cgroups, tglx, mingo, bp, hpa,
	sohil.mehta, tim.c.chen
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen,
	yangjie, chrisyan



On 21/08/2024 1:54 pm, Haitao Huang wrote:
> From: Kristen Carlson Accardi <kristen@linux.intel.com>
> 
> Previous patches have implemented all infrastructure needed for
> per-cgroup EPC page tracking and reclaiming. But all reclaimable EPC
> pages are still tracked in the global LRU as sgx_epc_page_lru() always
> returns reference to the global LRU.
> 
> Change sgx_epc_page_lru() to return the LRU of the cgroup in which the
> given EPC page is allocated.
> 
> Update sgx_can_reclaim_global(), to check emptiness of LRUs of all
> cgroups, and update sgx_reclaim_pages_global(), to utilize
> sgx_cgroup_reclaim_pages_global(), when EPC cgroup is enabled.
> 
> With these changes, the global reclamation and per-cgroup reclamation
> both work properly with all pages tracked in per-cgroup LRUs.
> 
> Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>

Reviewed-by: Kai Huang <kai.huang@intel.com>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v16 15/16] Docs/x86/sgx: Add description for cgroup support
  2024-08-21  1:53 [PATCH v16 00/16] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (13 preceding siblings ...)
  2024-08-21  1:54 ` [PATCH v16 14/16] x86/sgx: Turn on per-cgroup EPC reclamation Haitao Huang
@ 2024-08-21  1:54 ` Haitao Huang
  2024-08-21  1:54 ` [PATCH v16 16/16] selftests/sgx: Add scripts for EPC cgroup testing Haitao Huang
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Haitao Huang @ 2024-08-21  1:54 UTC (permalink / raw)
  To: jarkko, dave.hansen, kai.huang, tj, mkoutny, chenridong,
	linux-kernel, linux-sgx, x86, cgroups, tglx, mingo, bp, hpa,
	sohil.mehta, tim.c.chen
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen,
	yangjie, chrisyan

From: Sean Christopherson <sean.j.christopherson@intel.com>

Add initial documentation of how to regulate the distribution of
SGX Enclave Page Cache (EPC) memory via the Miscellaneous cgroup
controller.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Co-developed-by: Haitao Huang<haitao.huang@linux.intel.com>
Signed-off-by: Haitao Huang<haitao.huang@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Acked-by: Kai Huang <kai.huang@intel.com>
Tested-by: Mikko Ylinen <mikko.ylinen@linux.intel.com>
Tested-by: Jarkko Sakkinen <jarkko@kernel.org>
---
V8:
- Limit text width to 80 characters to be consistent.

V6:
- Remove mentioning of VMM specific behavior on handling SIGBUS
- Remove statement of forced reclamation, add statement to specify
ENOMEM returned when no reclamation possible.
- Added statements on the non-preemptive nature for the max limit
- Dropped Reviewed-by tag because of changes

V4:
- Fix indentation (Randy)
- Change misc.events file to be read-only
- Fix a typo for 'subsystem'
- Add behavior when VMM overcommit EPC with a cgroup (Mikko)
---
 Documentation/arch/x86/sgx.rst | 83 ++++++++++++++++++++++++++++++++++
 1 file changed, 83 insertions(+)

diff --git a/Documentation/arch/x86/sgx.rst b/Documentation/arch/x86/sgx.rst
index d90796adc2ec..c537e6a9aa65 100644
--- a/Documentation/arch/x86/sgx.rst
+++ b/Documentation/arch/x86/sgx.rst
@@ -300,3 +300,86 @@ to expected failures and handle them as follows:
    first call.  It indicates a bug in the kernel or the userspace client
    if any of the second round of ``SGX_IOC_VEPC_REMOVE_ALL`` calls has
    a return code other than 0.
+
+
+Cgroup Support
+==============
+
+The "sgx_epc" resource within the Miscellaneous cgroup controller regulates
+distribution of SGX EPC memory, which is a subset of system RAM that is used to
+provide SGX-enabled applications with protected memory, and is otherwise
+inaccessible, i.e. shows up as reserved in /proc/iomem and cannot be
+read/written outside of an SGX enclave.
+
+Although current systems implement EPC by stealing memory from RAM, for all
+intents and purposes the EPC is independent from normal system memory, e.g. must
+be reserved at boot from RAM and cannot be converted between EPC and normal
+memory while the system is running.  The EPC is managed by the SGX subsystem and
+is not accounted by the memory controller.  Note that this is true only for EPC
+memory itself, i.e.  normal memory allocations related to SGX and EPC memory,
+e.g. the backing memory for evicted EPC pages, are accounted, limited and
+protected by the memory controller.
+
+Much like normal system memory, EPC memory can be overcommitted via virtual
+memory techniques and pages can be swapped out of the EPC to their backing store
+(normal system memory allocated via shmem).  The SGX EPC subsystem is analogous
+to the memory subsystem, and it implements limit and protection models for EPC
+memory.
+
+SGX EPC Interface Files
+-----------------------
+
+For a generic description of the Miscellaneous controller interface files,
+please see Documentation/admin-guide/cgroup-v2.rst
+
+All SGX EPC memory amounts are in bytes unless explicitly stated otherwise. If
+a value which is not PAGE_SIZE aligned is written, the actual value used by the
+controller will be rounded down to the closest PAGE_SIZE multiple.
+
+  misc.capacity
+        A read-only flat-keyed file shown only in the root cgroup. The sgx_epc
+        resource will show the total amount of EPC memory available on the
+        platform.
+
+  misc.current
+        A read-only flat-keyed file shown in the non-root cgroups. The sgx_epc
+        resource will show the current active EPC memory usage of the cgroup and
+        its descendants. EPC pages that are swapped out to backing RAM are not
+        included in the current count.
+
+  misc.max
+        A read-write single value file which exists on non-root cgroups. The
+        sgx_epc resource will show the EPC usage hard limit. The default is
+        "max".
+
+        If a cgroup's EPC usage reaches this limit, EPC allocations, e.g., for
+        page fault handling, will be blocked until EPC can be reclaimed from the
+        cgroup. If there are no pages left that are reclaimable within the same
+        group, the kernel returns ENOMEM.
+
+        The EPC pages allocated for a guest VM by the virtual EPC driver are not
+        reclaimable by the host kernel. In case the guest cgroup's limit is
+        reached and no reclaimable pages left in the same cgroup, the virtual
+        EPC driver returns SIGBUS to the user space process to indicate failure
+        on new EPC allocation requests.
+
+        The misc.max limit is non-preemptive. If a user writes a limit lower
+        than the current usage to this file, the cgroup will not preemptively
+        deallocate pages currently in use, and will only start blocking the next
+        allocation and reclaiming EPC at that time.
+
+  misc.events
+        A read-only flat-keyed file which exists on non-root cgroups.
+        A value change in this file generates a file modified event.
+
+          max
+                The number of times the cgroup has triggered a reclaim due to
+                its EPC usage approaching (or exceeding) its max EPC boundary.
+
+Migration
+---------
+
+Once an EPC page is charged to a cgroup (during allocation), it remains charged
+to the original cgroup until the page is released or reclaimed. Migrating a
+process to a different cgroup doesn't move the EPC charges that it incurred
+while in the previous cgroup to its new cgroup.
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v16 16/16] selftests/sgx: Add scripts for EPC cgroup testing
  2024-08-21  1:53 [PATCH v16 00/16] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (14 preceding siblings ...)
  2024-08-21  1:54 ` [PATCH v16 15/16] Docs/x86/sgx: Add description for cgroup support Haitao Huang
@ 2024-08-21  1:54 ` Haitao Huang
  2024-08-27 23:26   ` Huang, Kai
  2024-08-26 10:57 ` [PATCH v16 00/16] Add Cgroup support for SGX EPC memory Mikko Ylinen
  2024-08-27 18:18 ` Jarkko Sakkinen
  17 siblings, 1 reply; 42+ messages in thread
From: Haitao Huang @ 2024-08-21  1:54 UTC (permalink / raw)
  To: jarkko, dave.hansen, kai.huang, tj, mkoutny, chenridong,
	linux-kernel, linux-sgx, x86, cgroups, tglx, mingo, bp, hpa,
	sohil.mehta, tim.c.chen
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen,
	yangjie, chrisyan

With different cgroups, the script starts one or multiple concurrent SGX
selftests (test_sgx), each to run the unclobbered_vdso_oversubscribed
test case, which loads an enclave of EPC size equal to the EPC capacity
available on the platform. The script checks results against the
expectation set for each cgroup and reports success or failure.

The script creates 3 different cgroups at the beginning with following
expectations:

1) small - intentionally small enough to fail the test loading an
enclave of size equal to the capacity.
2) large - large enough to run up to 4 concurrent tests but fail some if
more than 4 concurrent tests are run. The script starts 4 expecting at
least one test to pass, and then starts 5 expecting at least one test
to fail.
3) larger - limit is the same as the capacity, large enough to run lots of
concurrent tests. The script starts 8 of them and expects all pass.
Then it reruns the same test with one process randomly killed and
usage checked to be zero after all processes exit.

The script also includes a test with low mem_cg limit and large sgx_epc
limit to verify that the RAM used for per-cgroup reclamation is charged
to a proper mem_cg. For this test, it turns off swapping before start,
and turns swapping back on afterwards.

Add README to document how to run the tests.

Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: Jarkko Sakkinen <jarkko@kernel.org>
---
V13:
- More improvement on handling error cases and style fixes.
- Add settings file for custom timeout

V12:
- Integrate the scripts to the "run_tests" target. (Jarkko)

V11:
- Remove cgroups-tools dependency and make scripts ash compatible. (Jarkko)
- Drop support for cgroup v1 and simplify. (Michal, Jarkko)
- Add documentation for functions. (Jarkko)
- Turn off swapping before memcontrol tests and back on after
- Format and style fixes, name for hard coded values

V7:
- Added memcontrol test.

V5:
- Added script with automatic results checking, remove the interactive
script.
- The script can run independent from the series below.
---
 tools/testing/selftests/sgx/Makefile          |   3 +-
 tools/testing/selftests/sgx/README            | 109 +++++++
 tools/testing/selftests/sgx/ash_cgexec.sh     |  16 +
 tools/testing/selftests/sgx/config            |   4 +
 .../selftests/sgx/run_epc_cg_selftests.sh     | 294 ++++++++++++++++++
 tools/testing/selftests/sgx/settings          |   2 +
 .../selftests/sgx/watch_misc_for_tests.sh     |  11 +
 7 files changed, 438 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/sgx/README
 create mode 100755 tools/testing/selftests/sgx/ash_cgexec.sh
 create mode 100644 tools/testing/selftests/sgx/config
 create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh
 create mode 100644 tools/testing/selftests/sgx/settings
 create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh

diff --git a/tools/testing/selftests/sgx/Makefile b/tools/testing/selftests/sgx/Makefile
index 03b5e13b872b..3e673b8ace3f 100644
--- a/tools/testing/selftests/sgx/Makefile
+++ b/tools/testing/selftests/sgx/Makefile
@@ -20,7 +20,8 @@ ENCL_LDFLAGS := -Wl,-T,test_encl.lds,--build-id=none
 
 ifeq ($(CAN_BUILD_X86_64), 1)
 TEST_CUSTOM_PROGS := $(OUTPUT)/test_sgx
-TEST_FILES := $(OUTPUT)/test_encl.elf
+TEST_FILES := $(OUTPUT)/test_encl.elf ash_cgexec.sh
+TEST_PROGS := run_epc_cg_selftests.sh
 
 all: $(TEST_CUSTOM_PROGS) $(OUTPUT)/test_encl.elf
 endif
diff --git a/tools/testing/selftests/sgx/README b/tools/testing/selftests/sgx/README
new file mode 100644
index 000000000000..f84406bf29a4
--- /dev/null
+++ b/tools/testing/selftests/sgx/README
@@ -0,0 +1,109 @@
+SGX selftests
+
+The SGX selftests includes a c program (test_sgx) that covers basic user space
+facing APIs and a shell scripts (run_sgx_cg_selftests.sh) testing SGX misc
+cgroup. The SGX cgroup test script requires root privileges and runs a
+specific test case of  the test_sgx in different cgroups configured by the
+script. More details about the cgroup test can be found below.
+
+All SGX selftests can run with or without kselftest framework.
+
+WITH KSELFTEST FRAMEWORK
+=======================
+
+BUILD
+-----
+
+Build executable file "test_sgx" from top level directory of the kernel source:
+ $ make -C tools/testing/selftests TARGETS=sgx
+
+RUN
+---
+
+Run all sgx tests as sudo or root since the cgroup tests need to configure cgroup
+limits in files under /sys/fs/cgroup.
+
+ $ sudo make -C tools/testing/selftests/sgx run_tests
+
+Without sudo, SGX cgroup tests will be skipped.
+
+On platforms with large Enclave Page Cache (EPC) and/or less cpu cores, you may
+need adjust the timeout in 'settings' to avoid timeouts.
+
+More details about kselftest framework can be found in
+Documentation/dev-tools/kselftest.rst.
+
+WITHOUT KSELFTEST FRAMEWORK
+===========================
+
+BUILD
+-----
+
+Build executable file "test_sgx" from this
+directory(tools/testing/selftests/sgx/):
+
+  $ make
+
+RUN
+---
+
+Run all non-cgroup tests:
+
+ $ ./test_sgx
+
+To test SGX cgroup:
+
+ $ sudo ./run_sgx_cg_selftests.sh
+
+THE SGX CGROUP TEST SCRIPTS
+===========================
+
+Overview of the main cgroup test script
+---------------------------------------
+
+With different cgroups, the script (run_sgx_cg_selftests.sh) starts one or
+multiple concurrent SGX selftests (test_sgx), each to run the
+unclobbered_vdso_oversubscribed test case, which loads an enclave of EPC size
+equal to the EPC capacity available on the platform. The script checks results
+against the expectation set for each cgroup and reports success or failure.
+
+The script creates 3 different cgroups at the beginning with following
+expectations:
+
+  1) small - intentionally small enough to fail the test loading an enclave of
+             size equal to the capacity.
+
+  2) large - large enough to run up to 4 concurrent tests but fail some if more
+	     than 4 concurrent tests are run. The script starts 4 expecting at
+	     least one test to pass, and then starts 5 expecting at least one
+             test to fail.
+
+  3) larger - limit is the same as the capacity, large enough to run lots of
+	      concurrent tests. The script starts 8 of them and expects all
+	      pass.  Then it reruns the same test with one process randomly
+	      killed and usage checked to be zero after all processes exit.
+
+The script also includes a test with low mem_cg limit (memory.max) and the
+'large' sgx_epc limit to verify that the RAM used for per-cgroup reclamation is
+charged to a proper mem_cg. To validate mem_cg OOM-kills processes when its
+memory.max limit is reached due to SGX EPC reclamation, the script turns off
+swapping before start, and turns swapping back on afterwards for this particular
+test.
+
+The helper script
+------------------------------------------------------
+
+To monitor the SGX cgroup settings and behaviors or trouble-shoot during
+testing, the helper script, watch_misc_for_tests.sh, can be used to watch
+relevant entries in cgroupfs files. For example, to watch the SGX cgroup
+'current' counter changes during testing, run this in a separate terminal from
+this directory:
+
+  $ ./watch_misc_for_tests.sh current
+
+For more details about SGX cgroups, see "Cgroup Support" in
+Documentation/arch/x86/sgx.rst.
+
+The scripts require cgroup v2 support. More details about cgroup v2 can be found
+in Documentation/admin-guide/cgroup-v2.rst.
+
diff --git a/tools/testing/selftests/sgx/ash_cgexec.sh b/tools/testing/selftests/sgx/ash_cgexec.sh
new file mode 100755
index 000000000000..cfa5d2b0e795
--- /dev/null
+++ b/tools/testing/selftests/sgx/ash_cgexec.sh
@@ -0,0 +1,16 @@
+#!/usr/bin/env sh
+# SPDX-License-Identifier: GPL-2.0
+# Copyright(c) 2024 Intel Corporation.
+
+# Start a program in a given cgroup.
+# Supports V2 cgroup paths, relative to /sys/fs/cgroup
+if [ "$#" -lt 2 ]; then
+    echo "Usage: $0 <v2 cgroup path> <command> [args...]"
+    exit 1
+fi
+# Move this shell to the cgroup.
+echo 0 >/sys/fs/cgroup/$1/cgroup.procs
+shift
+# Execute the command within the cgroup
+exec "$@"
+
diff --git a/tools/testing/selftests/sgx/config b/tools/testing/selftests/sgx/config
new file mode 100644
index 000000000000..e7f1db1d3eff
--- /dev/null
+++ b/tools/testing/selftests/sgx/config
@@ -0,0 +1,4 @@
+CONFIG_CGROUPS=y
+CONFIG_CGROUP_MISC=y
+CONFIG_MEMCG=y
+CONFIG_X86_SGX=y
diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
new file mode 100755
index 000000000000..dab648e0bb53
--- /dev/null
+++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
@@ -0,0 +1,294 @@
+#!/usr/bin/env sh
+# SPDX-License-Identifier: GPL-2.0
+# Copyright(c) 2023, 2024 Intel Corporation.
+
+PROCESS_SUCCESS=1
+PROCESS_FAILURE=0
+# Wait for a process and check for expected exit status.
+#
+# Arguments:
+#	$1 - the pid of the process to wait and check.
+#	$2 - 1 if expecting success, 0 for failure.
+#
+# Return:
+#	0 if the exit status of the process matches the expectation.
+#	1 otherwise.
+wait_check_process_status() {
+    pid=$1
+    check_for_success=$2
+
+    wait "$pid"
+    status=$?
+
+    if [ $check_for_success -eq $PROCESS_SUCCESS ] && [ $status -eq 0 ]; then
+        echo "# Process $pid succeeded."
+        return 0
+    elif [ $check_for_success -eq $PROCESS_FAILURE ] && [ $status -ne 0 ]; then
+        echo "# Process $pid returned failure."
+        return 0
+    fi
+    return 1
+}
+
+# Wait for a set of processes and check for expected exit status
+#
+# Arguments:
+#	$1 - 1 if expecting success, 0 for failure.
+#	remaining args - The pids of the processes
+#
+# Return:
+#	0 if exit status of any process matches the expectation.
+#	1 otherwise.
+wait_and_detect_for_any() {
+    check_for_success=$1
+
+    shift
+    detected=1 # 0 for success detection
+
+    for pid in $@; do
+        if wait_check_process_status "$pid" "$check_for_success"; then
+            detected=0
+            # Wait for other processes to exit
+        fi
+    done
+
+    return $detected
+}
+
+# Kselftest framework requirement - SKIP code is 4.
+ksft_skip=4
+if [ "$(id -u)" -ne 0 ]; then
+    echo "SKIP: SGX cgroup tests need root privileges."
+    exit $ksft_skip
+fi
+
+cg_root=/sys/fs/cgroup
+if [ ! -d "$cg_root/$test_root_cg" ]; then
+    echo "SKIP: SGX cgroup tests require v2 cgroups."
+    exit $ksft_skip
+fi
+test_root_cg=sgx_kselftest
+#make sure we start clean
+if [ -d "$cg_root/$test_root_cg" ]; then
+    echo "SKIP: Please clean up $cg_root/$test_root_cg."
+    exit $ksft_skip
+fi
+
+test_cg_small_parent=$test_root_cg/sgx_test_small_parent
+test_cg_large=$test_root_cg/sgx_test_large
+test_cg_small=$test_cg_small_parent/sgx_test_small
+test_cg_larger=$test_root_cg/sgx_test_larger
+
+clean_up()
+{
+    # Wait a little for cgroups to reset counters for dead processes.
+    sleep 2
+    rmdir $cg_root/$test_cg_large
+    rmdir $cg_root/$test_cg_small
+    rmdir $cg_root/$test_cg_larger
+    rmdir $cg_root/$test_cg_small_parent
+    rmdir $cg_root/$test_root_cg
+}
+
+mkdir $cg_root/$test_root_cg && \
+mkdir $cg_root/$test_cg_small_parent && \
+mkdir $cg_root/$test_cg_large && \
+mkdir $cg_root/$test_cg_small && \
+mkdir $cg_root/$test_cg_larger
+if [ $? -ne 0 ]; then
+    echo "FAIL: Failed creating cgroups."
+    exit 1
+fi
+
+# Turn on misc and memory controller in non-leaf nodes
+echo "+misc" >  $cg_root/cgroup.subtree_control && \
+echo "+memory" > $cg_root/cgroup.subtree_control && \
+echo "+misc" >  $cg_root/$test_root_cg/cgroup.subtree_control && \
+echo "+memory" > $cg_root/$test_root_cg/cgroup.subtree_control && \
+echo "+misc" >  $cg_root/$test_cg_small_parent/cgroup.subtree_control
+if [ $? -ne 0 ]; then
+    echo "FAIL: can't set up cgroups, make sure misc and memory cgroups are enabled."
+    clean_up
+    exit 1
+fi
+
+epc_capacity=$(grep "sgx_epc" "$cg_root/misc.capacity" | awk '{print $2}')
+
+# This is below number of VA pages needed for enclave of capacity size. So
+# should fail oversubscribed cases
+epc_small_limit=$(( epc_capacity / 512 ))
+
+# At least load one enclave of capacity size successfully, maybe up to 4.
+# But some may fail if we run more than 4 concurrent enclaves of capacity size.
+epc_large_limit=$(( epc_small_limit * 4 ))
+
+# Load lots of enclaves
+epc_larger_limit=$epc_capacity
+echo "# Setting up SGX cgroup limits."
+echo "sgx_epc $epc_small_limit" > $cg_root/$test_cg_small_parent/misc.max && \
+echo "sgx_epc $epc_large_limit" >  $cg_root/$test_cg_large/misc.max && \
+echo "sgx_epc $epc_larger_limit" > $cg_root/$test_cg_larger/misc.max
+if [ $? -ne 0 ]; then
+    echo "# Failed setting up misc limits for sgx_epc."
+    echo "SKIP: Kernel does not support SGX cgroup."
+    clean_up
+    exit $ksft_skip
+fi
+
+test_cmd="./test_sgx -t unclobbered_vdso_oversubscribed"
+
+echo "# Start unclobbered_vdso_oversubscribed with small EPC limit, expecting failure..."
+./ash_cgexec.sh $test_cg_small $test_cmd >/dev/null 2>&1
+if [ $? -eq 0 ]; then
+    echo "FAIL: Fail on small EPC limit, not expecting any test passes."
+    clean_up
+    exit 1
+else
+    echo "# Test failed as expected."
+fi
+
+echo "PASS: small EPC limit test."
+
+echo "# Start 4 concurrent unclobbered_vdso_oversubscribed tests with large EPC limit, \
+expecting at least one success...."
+
+pids=""
+for i in 1 2 3 4; do
+    (
+        ./ash_cgexec.sh $test_cg_large $test_cmd >/dev/null 2>&1
+    ) &
+    pids="$pids $!"
+done
+
+if wait_and_detect_for_any $PROCESS_SUCCESS "$pids"; then
+    echo "PASS: large EPC limit positive testing."
+else
+    echo "FAIL: Failed on large EPC limit positive testing, no test passes."
+    clean_up
+    exit 1
+fi
+
+echo "# Start 5 concurrent unclobbered_vdso_oversubscribed tests with large EPC limit, \
+expecting at least one failure...."
+pids=""
+for i in 1 2 3 4 5; do
+    (
+        ./ash_cgexec.sh $test_cg_large $test_cmd >/dev/null 2>&1
+    ) &
+    pids="$pids $!"
+done
+
+if wait_and_detect_for_any $PROCESS_FAILURE "$pids"; then
+    echo "PASS: large EPC limit negative testing."
+else
+    echo "FAIL: Failed on large EPC limit negative testing, no test fails."
+    clean_up
+    exit 1
+fi
+
+echo "# Start 8 concurrent unclobbered_vdso_oversubscribed tests with larger EPC limit, \
+expecting no failure...."
+pids=""
+for i in 1 2 3 4 5 6 7 8; do
+    (
+        ./ash_cgexec.sh $test_cg_larger $test_cmd >/dev/null 2>&1
+    ) &
+    pids="$pids $!"
+done
+
+if wait_and_detect_for_any $PROCESS_FAILURE "$pids"; then
+    echo "FAIL: Failed on larger EPC limit, at least one test fails."
+    clean_up
+    exit 1
+else
+    echo "PASS: larger EPC limit tests."
+fi
+
+echo "# Start 8 concurrent unclobbered_vdso_oversubscribed tests with larger EPC limit,\
+ randomly kill one, expecting no failure...."
+pids=""
+for i in 1 2 3 4 5 6 7 8; do
+    (
+        ./ash_cgexec.sh $test_cg_larger $test_cmd >/dev/null 2>&1
+    ) &
+    pids="$pids $!"
+done
+random_number=$(awk 'BEGIN{srand();print int(rand()*5)}')
+sleep $((random_number + 1))
+
+# Randomly select a process to kill
+# Make sure usage counter not leaked at the end.
+random_index=$(awk 'BEGIN{srand();print int(rand()*8)}')
+counter=0
+for pid in $pids; do
+    if [ "$counter" -eq "$random_index" ]; then
+        pid_to_kill=$pid
+        break
+    fi
+    counter=$((counter + 1))
+done
+
+kill $pid_to_kill
+echo "# Killed process with PID: $pid_to_kill"
+
+any_failure=0
+for pid in $pids; do
+    wait "$pid"
+    status=$?
+    if [ "$pid" != "$pid_to_kill" ]; then
+        if [ $status -ne 0 ]; then
+	    echo "# Process $pid returned failure."
+            any_failure=1
+        fi
+    fi
+done
+
+if [ $any_failure -ne 0 ]; then
+    echo "FAIL: Failed on random killing, at least one test fails."
+    clean_up
+    exit 1
+fi
+echo "PASS: larger EPC limit test with a process randomly killed."
+
+mem_limit_too_small=$((epc_capacity - 2 * epc_large_limit))
+
+echo "$mem_limit_too_small" > $cg_root/$test_cg_large/memory.max
+if [ $? -ne 0 ]; then
+    echo "FAIL: Failed setting up memory controller."
+    clean_up
+    exit 1
+fi
+
+echo "# Start 4 concurrent unclobbered_vdso_oversubscribed tests with large EPC limit, \
+and too small RAM limit, expecting all failures...."
+# Ensure swapping off so the OOM killer is activated when mem_cgroup limit is hit.
+swapoff -a
+pids=""
+for i in 1 2 3 4; do
+    (
+        ./ash_cgexec.sh $test_cg_large $test_cmd >/dev/null 2>&1
+    ) &
+    pids="$pids $!"
+done
+
+if wait_and_detect_for_any $PROCESS_SUCCESS "$pids"; then
+    echo "FAIL: Failed on tests with memcontrol, some tests did not fail."
+    clean_up
+    swapon -a
+    exit 1
+else
+    swapon -a
+    echo "PASS: large EPC limit tests with memcontrol."
+fi
+
+sleep 2
+
+epc_usage=$(grep '^sgx_epc' "$cg_root/$test_root_cg/misc.current" | awk '{print $2}')
+if [ "$epc_usage" -ne 0 ]; then
+    echo "FAIL: Final usage is $epc_usage, not 0."
+else
+    echo "PASS: leakage check."
+    echo "PASS: ALL cgroup limit tests, cleanup cgroups..."
+fi
+clean_up
+echo "# Done SGX cgroup tests."
diff --git a/tools/testing/selftests/sgx/settings b/tools/testing/selftests/sgx/settings
new file mode 100644
index 000000000000..4bf7dcbf9fa8
--- /dev/null
+++ b/tools/testing/selftests/sgx/settings
@@ -0,0 +1,2 @@
+# This timeout may need be increased for platforms with EPC larger than 4G
+timeout=140
diff --git a/tools/testing/selftests/sgx/watch_misc_for_tests.sh b/tools/testing/selftests/sgx/watch_misc_for_tests.sh
new file mode 100755
index 000000000000..9280a5e0962b
--- /dev/null
+++ b/tools/testing/selftests/sgx/watch_misc_for_tests.sh
@@ -0,0 +1,11 @@
+#!/usr/bin/env sh
+# SPDX-License-Identifier: GPL-2.0
+# Copyright(c) 2023, 2024 Intel Corporation.
+
+if [ -z "$1" ]; then
+    echo "No argument supplied, please provide 'max', 'current', or 'events'"
+    exit 1
+fi
+
+watch -n 1 'find /sys/fs/cgroup -wholename "*/sgx_test*/misc.'$1'" -exec \
+    sh -c '\''echo "$1:"; cat "$1"'\'' _ {} \;'
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v16 16/16] selftests/sgx: Add scripts for EPC cgroup testing
  2024-08-21  1:54 ` [PATCH v16 16/16] selftests/sgx: Add scripts for EPC cgroup testing Haitao Huang
@ 2024-08-27 23:26   ` Huang, Kai
  0 siblings, 0 replies; 42+ messages in thread
From: Huang, Kai @ 2024-08-27 23:26 UTC (permalink / raw)
  To: Haitao Huang, jarkko, dave.hansen, tj, mkoutny, chenridong,
	linux-kernel, linux-sgx, x86, cgroups, tglx, mingo, bp, hpa,
	sohil.mehta, tim.c.chen
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen,
	yangjie, chrisyan



On 21/08/2024 1:54 pm, Haitao Huang wrote:
> With different cgroups, the script starts one or multiple concurrent SGX
> selftests (test_sgx), each to run the unclobbered_vdso_oversubscribed
> test case, which loads an enclave of EPC size equal to the EPC capacity
> available on the platform. The script checks results against the
> expectation set for each cgroup and reports success or failure.
> 
> The script creates 3 different cgroups at the beginning with following
> expectations:
> 
> 1) small - intentionally small enough to fail the test loading an
> enclave of size equal to the capacity.
> 2) large - large enough to run up to 4 concurrent tests but fail some if
> more than 4 concurrent tests are run. The script starts 4 expecting at
> least one test to pass, and then starts 5 expecting at least one test
> to fail.
> 3) larger - limit is the same as the capacity, large enough to run lots of
> concurrent tests. The script starts 8 of them and expects all pass.
> Then it reruns the same test with one process randomly killed and
> usage checked to be zero after all processes exit.
> 
> The script also includes a test with low mem_cg limit and large sgx_epc
> limit to verify that the RAM used for per-cgroup reclamation is charged
> to a proper mem_cg. For this test, it turns off swapping before start,
> and turns swapping back on afterwards.
> 
> Add README to document how to run the tests.
> 
> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
> Tested-by: Jarkko Sakkinen <jarkko@kernel.org>
> ---

Acked-by: Kai Huang <kai.huang@intel.com>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v16 00/16] Add Cgroup support for SGX EPC memory
  2024-08-21  1:53 [PATCH v16 00/16] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (15 preceding siblings ...)
  2024-08-21  1:54 ` [PATCH v16 16/16] selftests/sgx: Add scripts for EPC cgroup testing Haitao Huang
@ 2024-08-26 10:57 ` Mikko Ylinen
  2024-08-27 18:18 ` Jarkko Sakkinen
  17 siblings, 0 replies; 42+ messages in thread
From: Mikko Ylinen @ 2024-08-26 10:57 UTC (permalink / raw)
  To: Haitao Huang
  Cc: jarkko, dave.hansen, kai.huang, tj, mkoutny, chenridong,
	linux-kernel, linux-sgx, x86, cgroups, tglx, mingo, bp, hpa,
	sohil.mehta, tim.c.chen, zhiquan1.li, kristen, seanjc, zhanb,
	anakrish, yangjie, chrisyan

On Tue, Aug 20, 2024 at 06:53:48PM -0700, Haitao Huang wrote:
> SGX Enclave Page Cache (EPC) memory allocations are separate from normal
> RAM allocations, and are managed solely by the SGX subsystem. The existing
> cgroup memory controller cannot be used to limit or account for SGX EPC
> memory, which is a desirable feature in some environments, e.g., support
> for pod level control in a Kubernates cluster on a VM or bare-metal host
> [1,2].
>  
> This patchset implements the support for sgx_epc memory within the misc
> cgroup controller. A user can use the misc cgroup controller to set and
> enforce a max limit on total EPC usage per cgroup. The implementation
> reports current usage and events of reaching the limit per cgroup as well
> as the total system capacity.
>  
> Much like normal system memory, EPC memory can be overcommitted via virtual
> memory techniques and pages can be swapped out of the EPC to their backing
> store, which are normal system memory allocated via shmem and accounted by
> the memory controller. Similar to per-cgroup reclamation done by the memory
> controller, the EPC misc controller needs to implement a per-cgroup EPC
> reclaiming process: when the EPC usage of a cgroup reaches its hard limit
> ('sgx_epc' entry in the 'misc.max' file), the cgroup starts swapping out
> some EPC pages within the same cgroup to make room for new allocations.
>  
> For that, this implementation tracks reclaimable EPC pages in a separate
> LRU list in each cgroup, and below are more details and justification of
> this design. 
>  
> Track EPC pages in per-cgroup LRUs (from Dave)
> ----------------------------------------------
>  
> tl;dr: A cgroup hitting its limit should be as similar as possible to the
> system running out of EPC memory. The only two choices to implement that
> are nasty changes the existing LRU scanning algorithm, or to add new LRUs.
> The result: Add a new LRU for each cgroup and scans those instead. Replace
> the existing global cgroup with the root cgroup's LRU (only when this new
> support is compiled in, obviously).
>  
> The existing EPC memory management aims to be a miniature version of the
> core VM where EPC memory can be overcommitted and reclaimed. EPC
> allocations can wait for reclaim. The alternative to waiting would have
> been to send a signal and let the enclave die.
>  
> This series attempts to implement that same logic for cgroups, for the same
> reasons: it's preferable to wait for memory to become available and let
> reclaim happen than to do things that are fatal to enclaves.
>  
> There is currently a global reclaimable page SGX LRU list. That list (and
> the existing scanning algorithm) is essentially useless for doing reclaim
> when a cgroup hits its limit because the cgroup's pages are scattered
> around that LRU. It is unspeakably inefficient to scan a linked list with
> millions of entries for what could be dozens of pages from a cgroup that
> needs reclaim.
>  
> Even if unspeakably slow reclaim was accepted, the existing scanning
> algorithm only picks a few pages off the head of the global LRU. It would
> either need to hold the list locks for unreasonable amounts of time, or be
> taught to scan the list in pieces, which has its own challenges.
>  
> Unreclaimable Enclave Pages
> ---------------------------
>  
> There are a variety of page types for enclaves, each serving different
> purposes [5]. Although the SGX architecture supports swapping for all
> types, some special pages, e.g., Version Array(VA) and Secure Enclave
> Control Structure (SECS)[5], holds meta data of reclaimed pages and
> enclaves. That makes reclamation of such pages more intricate to manage.
> The SGX driver global reclaimer currently does not swap out VA pages. It
> only swaps the SECS page of an enclave when all other associated pages have
> been swapped out. The cgroup reclaimer follows the same approach and does
> not track those in per-cgroup LRUs and considers them as unreclaimable
> pages. The allocation of these pages is counted towards the usage of a
> specific cgroup and is subject to the cgroup's set EPC limits.
>  
> Earlier versions of this series implemented forced enclave-killing to
> reclaim VA and SECS pages. That was designed to enforce the 'max' limit,
> particularly in scenarios where a user or administrator reduces this limit
> post-launch of enclaves. However, subsequent discussions [3, 4] indicated
> that such preemptive enforcement is not necessary for the misc-controllers.
> Therefore, reclaiming SECS/VA pages by force-killing enclaves were removed,
> and the limit is only enforced at the time of new EPC allocation request.
> When a cgroup hits its limit but nothing left in the LRUs of the subtree,
> i.e., nothing to reclaim in the cgroup, any new attempt to allocate EPC
> within that cgroup will result in an 'ENOMEM'.
>  
> Unreclaimable Guest VM EPC Pages
> --------------------------------
>  
> The EPC pages allocated for guest VMs by the virtual EPC driver are not
> reclaimable by the host kernel [6]. Therefore an EPC cgroup also treats
> those as unreclaimable and returns ENOMEM when its limit is hit and nothing
> reclaimable left within the cgroup. The virtual EPC driver translates the
> ENOMEM error resulted from an EPC allocation request into a SIGBUS to the
> user process exactly the same way handling host running out of physical
> EPC.
>  
> This work was originally authored by Sean Christopherson a few years ago,
> and previously modified by Kristen C. Accardi to utilize the misc cgroup
> controller rather than a custom controller. I have been updating the
> patches based on review comments since V2 [7-20], simplified the
> implementation/design, added selftest scripts, fixed some stability issues
> found from testing.
>  
> Thanks to all for the review/test/tags/feedback provided on the previous
> versions. 
>  
> I appreciate your further reviewing/testing and providing tags if
> appropriate.

I tested using this series and also with the v16_plus fix. My container/
Kubernetes tests pass as expected.

Tested-by: Mikko Ylinen <mikko.ylinen@linux.intel.com>

Regards, Mikko

>  
> ---
> V16:
> - Revised the per-cgroup reclamation basic flow
>   sgx_cgroup_reclaim_pages(): add next_cg field in each cgroup to track
>   the next descendant to scan, and create a synchronized iterator to more
>   fairly scan all descendants if needed for reclamation. (Kai)
> - Separate patches to abstract the uses of global LRU, sgx_cgroup_reclaim_direct(),
>   sgx_cgroup_reclaim_global() implementaions. (Kai)
> - MISC don't call the ops if capacity is zero. (Kai)
> - Commit message improvements, clarified requirements for per-cgroup
>   reclamation. (Kai)
> - Fix bugs in handling failures during init.
> - Only turn on callbacks and set capacity at the end of sgx_init()
> 
> V15:
> - Disable SGX when sgx_cgroup_init() fails instead of using BUG_ON()
>   (Jarkko)
> - Reset capacity if sgx_cgroup_init() fails. (Kai)
> - Style fixes (Jarkko, Kai)
> - In misc.c, only invoke the ->free() callbacks for resource types whose
>   ->alloc() callback  was called and returned success. (Ridong)
> 
> V14:
> - modified sgx_cgroup_reclaim_pages() to return the next node. Caller can use it as the new
>   starting node for next round of reclamation attempt if needed. This is to fix a corner case
> where a super busy top level cgroup may block reclamation in lower level cgroups. (Kai)
> - Move renaming of sgx_should_reclaim_global() to the patch 'x86/sgx: Add basic EPC reclamation
> flow for cgroup'. (Kai)
> 
> v13:
> - Only allocate workqueue for SGX cgroup when misc is enabled and BUG_ON() when allocation fails
> - Add more tags
> - Commit logs and style improvements (Kai)
> - Test script improvements (Jarkko)
>  
> V12:
> - Integrate test scripts to kselftests "run_tests" target. (Jarkko)
> - Remove CGROUP_SGX_EPC kconfig, conditionally compile with CGROUP_MISC enabled. (Jarkko)
> - Explain why taking 'struct misc_cg *cg' as parameter, but not 'struct misc_res *res' in the
>   changelog for patch #2. (Kai)
> - Remove "unlikely" in patch #2 (Kai)
>   
> V11:
> - Update copyright years and use c style (Kai)
> - Improve and simplify test scripts: remove cgroup-tools and bash dependency, drop cgroup v1.
>   (Jarkko, Michal)
> - Add more stub/wrapper functions to minimize #ifdefs in c file. (Kai)
> - Revise commit message for patch #8 to clarify design rational (Kai)
> - Print error instead of WARN for init failure. (Kai)
> - Add check for need to queue an async reclamation before returning from
>   sgx_cgroup_try_charge(), do so if needed.
> 
> V10:
> - Use enum instead of boolean for the 'reclaim' parameters in
>   sgx_alloc_epc_page(). (Dave, Jarkko)
> - Pass mm struct instead of a boolean 'indirect'. (Dave, Jarkko)
> - Add comments/macros to clarify the cgroup async reclaimer design. (Kai)
> - Simplify sgx_reclaim_pages() signature, removing a pointer passed in.
>   (Kai)
> - Clarify design of sgx_cgroup_reclaim_pages(). (Kai)
> 	- Does not return a value for callers to check.
> 	- Its usage pattern is similar to that of sgx_reclaim_pages() now
> - Add cond_resched() in the loop in the cgroup reclaimer to improve
>   liveliness.
> - Add logic for cgroup level reclamation in sgx_reclaim_direct()
> - Restructure V9 patches 7-10 to make them flow better. (Kai)
> - Disable cgroup if workqueue allocation failed during init. (Kai)
> - Shorten names for EPC cgroup functions, structures and variables.
>   (Jarkko)
> - Separate out a helper for for addressing single iteration of the loop in
>   sgx_cgroup_try_charge(). (Jarkko)
> - More cleanup/clarifying/comments/style fixes. (Kai, Jarkko)  
>  
> V9:
> - Add comments for static variables outside functions. (Jarkko)
> - Remove unnecessary ifs. (Tim)
> - Add more Reviewed-By: tags from Jarkko and TJ.
>  
> V8:
> - Style fixes. (Jarkko)
> - Abstract _misc_res_free/alloc() (Jarkko)
> - Remove unneeded NULL checks. (Jarkko)
>  
> V7:
> - Split the large patch for the final EPC implementation, #10 in V6, into
>   smaller ones. (Dave, Kai)
> - Scan and reclaim one cgroup at a time, don't split sgx_reclaim_pages()
>   into two functions (Kai)
> - Removed patches to introduce the EPC page states, list for storing
>   candidate pages for reclamation. (not needed due to above changes)
> - Make ops one per resource type and store them in array (Michal)
> - Rename the ops struct to misc_res_ops, and enforce the constraints of
>   required callback functions (Jarkko)
> - Initialize epc cgroup in sgx driver init function. (Kai)
> - Moved addition of priv field to patch 4 where it was used first. (Jarkko)
> - Split sgx_get_current_epc_cg() out of sgx_epc_cg_try_charge() (Kai)
> - Use a static for root cgroup (Kai)
>  
> [1]https://lore.kernel.org/all/DM6PR21MB11772A6ED915825854B419D6C4989@DM6PR21MB1177.namprd21.prod.outlook.com/
> [2]https://lore.kernel.org/all/ZD7Iutppjj+muH4p@himmelriiki/
> [3]https://lore.kernel.org/lkml/7a1a5125-9da2-47b6-ba0f-cf24d84df16b@intel.com/
> [4]https://lore.kernel.org/lkml/yz44wukoic3syy6s4fcrngagurkjhe2hzka6kvxbajdtro3fwu@zd2ilht7wcw3/
> [5]Documentation/arch/x86/sgx.rst, Section"Enclave Page Types"
> [6]Documentation/arch/x86/sgx.rst, Section "Virtual EPC"
> [7]v2: https://lore.kernel.org/all/20221202183655.3767674-1-kristen@linux.intel.com/
> [8]v3: https://lore.kernel.org/linux-sgx/20230712230202.47929-1-haitao.huang@linux.intel.com/
> [9]v4: https://lore.kernel.org/all/20230913040635.28815-1-haitao.huang@linux.intel.com/
> [10]v5: https://lore.kernel.org/all/20230923030657.16148-1-haitao.huang@linux.intel.com/
> [11]v6: https://lore.kernel.org/linux-sgx/20231030182013.40086-1-haitao.huang@linux.intel.com/
> [12]v7: https://lore.kernel.org/linux-sgx/20240122172048.11953-1-haitao.huang@linux.intel.com/T/#t
> [13]v8: https://lore.kernel.org/linux-sgx/20240130020938.10025-1-haitao.huang@linux.intel.com/T/#t
> [14]v9: https://lore.kernel.org/lkml/20240205210638.157741-1-haitao.huang@linux.intel.com/T/
> [15]v10: https://lore.kernel.org/linux-sgx/20240328002229.30264-1-haitao.huang@linux.intel.com/T/#t
> [16]v11: https://lore.kernel.org/lkml/20240410182558.41467-1-haitao.huang@linux.intel.com/
> [17]v12: https://lore.kernel.org/lkml/20240416032011.58578-1-haitao.huang@linux.intel.com/
> [18]v13: https://lore.kernel.org/lkml/20240430195108.5676-1-haitao.huang@linux.intel.com/ 
> [19]v14: https://lore.kernel.org/linux-sgx/20240531222630.4634-1-haitao.huang@linux.intel.com/T/#t
> [20]v15: https://lore.kernel.org/linux-sgx/20240617125321.36658-1-haitao.huang@linux.intel.com/T/#t
> 
> Haitao Huang (7):
>   x86/sgx: Replace boolean parameters with enums
>   x86/sgx: Encapsulate uses of the global LRU
>   x86/sgx: Add basic EPC reclamation flow for cgroup
>   x86/sgx: Charge mem_cgroup for per-cgroup reclamation
>   x86/sgx: Revise global reclamation for EPC cgroups
>   x86/sgx: implement direct reclamation for cgroups
>   selftests/sgx: Add scripts for EPC cgroup testing
> 
> Kristen Carlson Accardi (7):
>   cgroup/misc: Add per resource callbacks for CSS events
>   cgroup/misc: Export APIs for SGX driver
>   cgroup/misc: Add SGX EPC resource type
>   x86/sgx: Implement basic EPC misc cgroup functionality
>   x86/sgx: Abstract tracking reclaimable pages in LRU
>   x86/sgx: Implement async reclamation for cgroup
>   x86/sgx: Turn on per-cgroup EPC reclamation
> 
> Sean Christopherson (2):
>   x86/sgx: Add sgx_epc_lru_list to encapsulate LRU list
>   Docs/x86/sgx: Add description for cgroup support
> 
>  Documentation/arch/x86/sgx.rst                |  83 ++++
>  arch/x86/kernel/cpu/sgx/Makefile              |   1 +
>  arch/x86/kernel/cpu/sgx/encl.c                |  41 +-
>  arch/x86/kernel/cpu/sgx/encl.h                |   7 +-
>  arch/x86/kernel/cpu/sgx/epc_cgroup.c          | 438 ++++++++++++++++++
>  arch/x86/kernel/cpu/sgx/epc_cgroup.h          | 108 +++++
>  arch/x86/kernel/cpu/sgx/ioctl.c               |  10 +-
>  arch/x86/kernel/cpu/sgx/main.c                | 219 ++++++---
>  arch/x86/kernel/cpu/sgx/sgx.h                 |  54 ++-
>  arch/x86/kernel/cpu/sgx/virt.c                |   2 +-
>  include/linux/misc_cgroup.h                   |  41 ++
>  kernel/cgroup/misc.c                          | 113 ++++-
>  tools/testing/selftests/sgx/Makefile          |   3 +-
>  tools/testing/selftests/sgx/README            | 109 +++++
>  tools/testing/selftests/sgx/ash_cgexec.sh     |  16 +
>  tools/testing/selftests/sgx/config            |   4 +
>  .../selftests/sgx/run_epc_cg_selftests.sh     | 294 ++++++++++++
>  tools/testing/selftests/sgx/settings          |   2 +
>  .../selftests/sgx/watch_misc_for_tests.sh     |  11 +
>  19 files changed, 1444 insertions(+), 112 deletions(-)
>  create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
>  create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
>  create mode 100644 tools/testing/selftests/sgx/README
>  create mode 100755 tools/testing/selftests/sgx/ash_cgexec.sh
>  create mode 100644 tools/testing/selftests/sgx/config
>  create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh
>  create mode 100644 tools/testing/selftests/sgx/settings
>  create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh
> 
> 
> base-commit: 7c626ce4bae1ac14f60076d00eafe71af30450ba
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v16 00/16] Add Cgroup support for SGX EPC memory
  2024-08-21  1:53 [PATCH v16 00/16] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (16 preceding siblings ...)
  2024-08-26 10:57 ` [PATCH v16 00/16] Add Cgroup support for SGX EPC memory Mikko Ylinen
@ 2024-08-27 18:18 ` Jarkko Sakkinen
  17 siblings, 0 replies; 42+ messages in thread
From: Jarkko Sakkinen @ 2024-08-27 18:18 UTC (permalink / raw)
  To: Haitao Huang, dave.hansen, kai.huang, tj, mkoutny, chenridong,
	linux-kernel, linux-sgx, x86, cgroups, tglx, mingo, bp, hpa,
	sohil.mehta, tim.c.chen
  Cc: zhiquan1.li, kristen, seanjc, zhanb, anakrish, mikko.ylinen,
	yangjie, chrisyan

On Wed Aug 21, 2024 at 4:53 AM EEST, Haitao Huang wrote:
> SGX Enclave Page Cache (EPC) memory allocations are separate from normal
> RAM allocations, and are managed solely by the SGX subsystem. The existing
> cgroup memory controller cannot be used to limit or account for SGX EPC
> memory, which is a desirable feature in some environments, e.g., support
> for pod level control in a Kubernates cluster on a VM or bare-metal host
> [1,2].
>  
> This patchset implements the support for sgx_epc memory within the misc
> cgroup controller. A user can use the misc cgroup controller to set and
> enforce a max limit on total EPC usage per cgroup. The implementation
> reports current usage and events of reaching the limit per cgroup as well
> as the total system capacity.
>  
> Much like normal system memory, EPC memory can be overcommitted via virtual
> memory techniques and pages can be swapped out of the EPC to their backing
> store, which are normal system memory allocated via shmem and accounted by
> the memory controller. Similar to per-cgroup reclamation done by the memory
> controller, the EPC misc controller needs to implement a per-cgroup EPC
> reclaiming process: when the EPC usage of a cgroup reaches its hard limit
> ('sgx_epc' entry in the 'misc.max' file), the cgroup starts swapping out
> some EPC pages within the same cgroup to make room for new allocations.
>  
> For that, this implementation tracks reclaimable EPC pages in a separate
> LRU list in each cgroup, and below are more details and justification of
> this design. 
>  
> Track EPC pages in per-cgroup LRUs (from Dave)
> ----------------------------------------------
>  
> tl;dr: A cgroup hitting its limit should be as similar as possible to the
> system running out of EPC memory. The only two choices to implement that
> are nasty changes the existing LRU scanning algorithm, or to add new LRUs.
> The result: Add a new LRU for each cgroup and scans those instead. Replace
> the existing global cgroup with the root cgroup's LRU (only when this new
> support is compiled in, obviously).
>  
> The existing EPC memory management aims to be a miniature version of the
> core VM where EPC memory can be overcommitted and reclaimed. EPC
> allocations can wait for reclaim. The alternative to waiting would have
> been to send a signal and let the enclave die.
>  
> This series attempts to implement that same logic for cgroups, for the same
> reasons: it's preferable to wait for memory to become available and let
> reclaim happen than to do things that are fatal to enclaves.
>  
> There is currently a global reclaimable page SGX LRU list. That list (and
> the existing scanning algorithm) is essentially useless for doing reclaim
> when a cgroup hits its limit because the cgroup's pages are scattered
> around that LRU. It is unspeakably inefficient to scan a linked list with
> millions of entries for what could be dozens of pages from a cgroup that
> needs reclaim.
>  
> Even if unspeakably slow reclaim was accepted, the existing scanning
> algorithm only picks a few pages off the head of the global LRU. It would
> either need to hold the list locks for unreasonable amounts of time, or be
> taught to scan the list in pieces, which has its own challenges.
>  
> Unreclaimable Enclave Pages
> ---------------------------
>  
> There are a variety of page types for enclaves, each serving different
> purposes [5]. Although the SGX architecture supports swapping for all
> types, some special pages, e.g., Version Array(VA) and Secure Enclave
> Control Structure (SECS)[5], holds meta data of reclaimed pages and
> enclaves. That makes reclamation of such pages more intricate to manage.
> The SGX driver global reclaimer currently does not swap out VA pages. It
> only swaps the SECS page of an enclave when all other associated pages have
> been swapped out. The cgroup reclaimer follows the same approach and does
> not track those in per-cgroup LRUs and considers them as unreclaimable
> pages. The allocation of these pages is counted towards the usage of a
> specific cgroup and is subject to the cgroup's set EPC limits.
>  
> Earlier versions of this series implemented forced enclave-killing to
> reclaim VA and SECS pages. That was designed to enforce the 'max' limit,
> particularly in scenarios where a user or administrator reduces this limit
> post-launch of enclaves. However, subsequent discussions [3, 4] indicated
> that such preemptive enforcement is not necessary for the misc-controllers.
> Therefore, reclaiming SECS/VA pages by force-killing enclaves were removed,
> and the limit is only enforced at the time of new EPC allocation request.
> When a cgroup hits its limit but nothing left in the LRUs of the subtree,
> i.e., nothing to reclaim in the cgroup, any new attempt to allocate EPC
> within that cgroup will result in an 'ENOMEM'.
>  
> Unreclaimable Guest VM EPC Pages
> --------------------------------
>  
> The EPC pages allocated for guest VMs by the virtual EPC driver are not
> reclaimable by the host kernel [6]. Therefore an EPC cgroup also treats
> those as unreclaimable and returns ENOMEM when its limit is hit and nothing
> reclaimable left within the cgroup. The virtual EPC driver translates the
> ENOMEM error resulted from an EPC allocation request into a SIGBUS to the
> user process exactly the same way handling host running out of physical
> EPC.
>  
> This work was originally authored by Sean Christopherson a few years ago,
> and previously modified by Kristen C. Accardi to utilize the misc cgroup
> controller rather than a custom controller. I have been updating the
> patches based on review comments since V2 [7-20], simplified the
> implementation/design, added selftest scripts, fixed some stability issues
> found from testing.
>  
> Thanks to all for the review/test/tags/feedback provided on the previous
> versions. 
>  
> I appreciate your further reviewing/testing and providing tags if
> appropriate.
>  
> ---
> V16:
> - Revised the per-cgroup reclamation basic flow
>   sgx_cgroup_reclaim_pages(): add next_cg field in each cgroup to track
>   the next descendant to scan, and create a synchronized iterator to more
>   fairly scan all descendants if needed for reclamation. (Kai)
> - Separate patches to abstract the uses of global LRU, sgx_cgroup_reclaim_direct(),
>   sgx_cgroup_reclaim_global() implementaions. (Kai)
> - MISC don't call the ops if capacity is zero. (Kai)
> - Commit message improvements, clarified requirements for per-cgroup
>   reclamation. (Kai)
> - Fix bugs in handling failures during init.
> - Only turn on callbacks and set capacity at the end of sgx_init()
>
> V15:
> - Disable SGX when sgx_cgroup_init() fails instead of using BUG_ON()
>   (Jarkko)
> - Reset capacity if sgx_cgroup_init() fails. (Kai)
> - Style fixes (Jarkko, Kai)
> - In misc.c, only invoke the ->free() callbacks for resource types whose
>   ->alloc() callback  was called and returned success. (Ridong)
>
> V14:
> - modified sgx_cgroup_reclaim_pages() to return the next node. Caller can use it as the new
>   starting node for next round of reclamation attempt if needed. This is to fix a corner case
> where a super busy top level cgroup may block reclamation in lower level cgroups. (Kai)
> - Move renaming of sgx_should_reclaim_global() to the patch 'x86/sgx: Add basic EPC reclamation
> flow for cgroup'. (Kai)
>
> v13:
> - Only allocate workqueue for SGX cgroup when misc is enabled and BUG_ON() when allocation fails
> - Add more tags
> - Commit logs and style improvements (Kai)
> - Test script improvements (Jarkko)
>  
> V12:
> - Integrate test scripts to kselftests "run_tests" target. (Jarkko)
> - Remove CGROUP_SGX_EPC kconfig, conditionally compile with CGROUP_MISC enabled. (Jarkko)
> - Explain why taking 'struct misc_cg *cg' as parameter, but not 'struct misc_res *res' in the
>   changelog for patch #2. (Kai)
> - Remove "unlikely" in patch #2 (Kai)
>   
> V11:
> - Update copyright years and use c style (Kai)
> - Improve and simplify test scripts: remove cgroup-tools and bash dependency, drop cgroup v1.
>   (Jarkko, Michal)
> - Add more stub/wrapper functions to minimize #ifdefs in c file. (Kai)
> - Revise commit message for patch #8 to clarify design rational (Kai)
> - Print error instead of WARN for init failure. (Kai)
> - Add check for need to queue an async reclamation before returning from
>   sgx_cgroup_try_charge(), do so if needed.
>
> V10:
> - Use enum instead of boolean for the 'reclaim' parameters in
>   sgx_alloc_epc_page(). (Dave, Jarkko)
> - Pass mm struct instead of a boolean 'indirect'. (Dave, Jarkko)
> - Add comments/macros to clarify the cgroup async reclaimer design. (Kai)
> - Simplify sgx_reclaim_pages() signature, removing a pointer passed in.
>   (Kai)
> - Clarify design of sgx_cgroup_reclaim_pages(). (Kai)
> 	- Does not return a value for callers to check.
> 	- Its usage pattern is similar to that of sgx_reclaim_pages() now
> - Add cond_resched() in the loop in the cgroup reclaimer to improve
>   liveliness.
> - Add logic for cgroup level reclamation in sgx_reclaim_direct()
> - Restructure V9 patches 7-10 to make them flow better. (Kai)
> - Disable cgroup if workqueue allocation failed during init. (Kai)
> - Shorten names for EPC cgroup functions, structures and variables.
>   (Jarkko)
> - Separate out a helper for for addressing single iteration of the loop in
>   sgx_cgroup_try_charge(). (Jarkko)
> - More cleanup/clarifying/comments/style fixes. (Kai, Jarkko)  
>  
> V9:
> - Add comments for static variables outside functions. (Jarkko)
> - Remove unnecessary ifs. (Tim)
> - Add more Reviewed-By: tags from Jarkko and TJ.
>  
> V8:
> - Style fixes. (Jarkko)
> - Abstract _misc_res_free/alloc() (Jarkko)
> - Remove unneeded NULL checks. (Jarkko)
>  
> V7:
> - Split the large patch for the final EPC implementation, #10 in V6, into
>   smaller ones. (Dave, Kai)
> - Scan and reclaim one cgroup at a time, don't split sgx_reclaim_pages()
>   into two functions (Kai)
> - Removed patches to introduce the EPC page states, list for storing
>   candidate pages for reclamation. (not needed due to above changes)
> - Make ops one per resource type and store them in array (Michal)
> - Rename the ops struct to misc_res_ops, and enforce the constraints of
>   required callback functions (Jarkko)
> - Initialize epc cgroup in sgx driver init function. (Kai)
> - Moved addition of priv field to patch 4 where it was used first. (Jarkko)
> - Split sgx_get_current_epc_cg() out of sgx_epc_cg_try_charge() (Kai)
> - Use a static for root cgroup (Kai)
>  
> [1]https://lore.kernel.org/all/DM6PR21MB11772A6ED915825854B419D6C4989@DM6PR21MB1177.namprd21.prod.outlook.com/
> [2]https://lore.kernel.org/all/ZD7Iutppjj+muH4p@himmelriiki/
> [3]https://lore.kernel.org/lkml/7a1a5125-9da2-47b6-ba0f-cf24d84df16b@intel.com/
> [4]https://lore.kernel.org/lkml/yz44wukoic3syy6s4fcrngagurkjhe2hzka6kvxbajdtro3fwu@zd2ilht7wcw3/
> [5]Documentation/arch/x86/sgx.rst, Section"Enclave Page Types"
> [6]Documentation/arch/x86/sgx.rst, Section "Virtual EPC"
> [7]v2: https://lore.kernel.org/all/20221202183655.3767674-1-kristen@linux.intel.com/
> [8]v3: https://lore.kernel.org/linux-sgx/20230712230202.47929-1-haitao.huang@linux.intel.com/
> [9]v4: https://lore.kernel.org/all/20230913040635.28815-1-haitao.huang@linux.intel.com/
> [10]v5: https://lore.kernel.org/all/20230923030657.16148-1-haitao.huang@linux.intel.com/
> [11]v6: https://lore.kernel.org/linux-sgx/20231030182013.40086-1-haitao.huang@linux.intel.com/
> [12]v7: https://lore.kernel.org/linux-sgx/20240122172048.11953-1-haitao.huang@linux.intel.com/T/#t
> [13]v8: https://lore.kernel.org/linux-sgx/20240130020938.10025-1-haitao.huang@linux.intel.com/T/#t
> [14]v9: https://lore.kernel.org/lkml/20240205210638.157741-1-haitao.huang@linux.intel.com/T/
> [15]v10: https://lore.kernel.org/linux-sgx/20240328002229.30264-1-haitao.huang@linux.intel.com/T/#t
> [16]v11: https://lore.kernel.org/lkml/20240410182558.41467-1-haitao.huang@linux.intel.com/
> [17]v12: https://lore.kernel.org/lkml/20240416032011.58578-1-haitao.huang@linux.intel.com/
> [18]v13: https://lore.kernel.org/lkml/20240430195108.5676-1-haitao.huang@linux.intel.com/ 
> [19]v14: https://lore.kernel.org/linux-sgx/20240531222630.4634-1-haitao.huang@linux.intel.com/T/#t
> [20]v15: https://lore.kernel.org/linux-sgx/20240617125321.36658-1-haitao.huang@linux.intel.com/T/#t
>
> Haitao Huang (7):
>   x86/sgx: Replace boolean parameters with enums
>   x86/sgx: Encapsulate uses of the global LRU
>   x86/sgx: Add basic EPC reclamation flow for cgroup
>   x86/sgx: Charge mem_cgroup for per-cgroup reclamation
>   x86/sgx: Revise global reclamation for EPC cgroups
>   x86/sgx: implement direct reclamation for cgroups
>   selftests/sgx: Add scripts for EPC cgroup testing
>
> Kristen Carlson Accardi (7):
>   cgroup/misc: Add per resource callbacks for CSS events
>   cgroup/misc: Export APIs for SGX driver
>   cgroup/misc: Add SGX EPC resource type
>   x86/sgx: Implement basic EPC misc cgroup functionality
>   x86/sgx: Abstract tracking reclaimable pages in LRU
>   x86/sgx: Implement async reclamation for cgroup
>   x86/sgx: Turn on per-cgroup EPC reclamation
>
> Sean Christopherson (2):
>   x86/sgx: Add sgx_epc_lru_list to encapsulate LRU list
>   Docs/x86/sgx: Add description for cgroup support
>
>  Documentation/arch/x86/sgx.rst                |  83 ++++
>  arch/x86/kernel/cpu/sgx/Makefile              |   1 +
>  arch/x86/kernel/cpu/sgx/encl.c                |  41 +-
>  arch/x86/kernel/cpu/sgx/encl.h                |   7 +-
>  arch/x86/kernel/cpu/sgx/epc_cgroup.c          | 438 ++++++++++++++++++
>  arch/x86/kernel/cpu/sgx/epc_cgroup.h          | 108 +++++
>  arch/x86/kernel/cpu/sgx/ioctl.c               |  10 +-
>  arch/x86/kernel/cpu/sgx/main.c                | 219 ++++++---
>  arch/x86/kernel/cpu/sgx/sgx.h                 |  54 ++-
>  arch/x86/kernel/cpu/sgx/virt.c                |   2 +-
>  include/linux/misc_cgroup.h                   |  41 ++
>  kernel/cgroup/misc.c                          | 113 ++++-
>  tools/testing/selftests/sgx/Makefile          |   3 +-
>  tools/testing/selftests/sgx/README            | 109 +++++
>  tools/testing/selftests/sgx/ash_cgexec.sh     |  16 +
>  tools/testing/selftests/sgx/config            |   4 +
>  .../selftests/sgx/run_epc_cg_selftests.sh     | 294 ++++++++++++
>  tools/testing/selftests/sgx/settings          |   2 +
>  .../selftests/sgx/watch_misc_for_tests.sh     |  11 +
>  19 files changed, 1444 insertions(+), 112 deletions(-)
>  create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
>  create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
>  create mode 100644 tools/testing/selftests/sgx/README
>  create mode 100755 tools/testing/selftests/sgx/ash_cgexec.sh
>  create mode 100644 tools/testing/selftests/sgx/config
>  create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh
>  create mode 100644 tools/testing/selftests/sgx/settings
>  create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh
>
>
> base-commit: 7c626ce4bae1ac14f60076d00eafe71af30450ba

Personally I think that I merging would be an appropriate action to
take now. There's ever continuing stream of small glitches like for
any new code but I don't see anything that could not be tuned over
time.

I.e. it is something that we can understand and maintain.

BR, Jarkko

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2024-08-29  9:34 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-21  1:53 [PATCH v16 00/16] Add Cgroup support for SGX EPC memory Haitao Huang
2024-08-21  1:53 ` [PATCH v16 01/16] x86/sgx: Replace boolean parameters with enums Haitao Huang
2024-08-21  1:53 ` [PATCH v16 02/16] cgroup/misc: Add per resource callbacks for CSS events Haitao Huang
2024-08-27 10:45   ` Huang, Kai
2024-08-29  9:25   ` Huang, Kai
2024-08-21  1:53 ` [PATCH v16 03/16] cgroup/misc: Export APIs for SGX driver Haitao Huang
2024-08-27 10:25   ` Huang, Kai
2024-08-21  1:53 ` [PATCH v16 04/16] cgroup/misc: Add SGX EPC resource type Haitao Huang
2024-08-21  1:53 ` [PATCH v16 05/16] x86/sgx: Implement basic EPC misc cgroup functionality Haitao Huang
2024-08-27 10:21   ` Huang, Kai
2024-08-27 23:11   ` Huang, Kai
2024-08-28  0:01     ` Huang, Kai
2024-08-21  1:53 ` [PATCH v16 06/16] x86/sgx: Add sgx_epc_lru_list to encapsulate LRU list Haitao Huang
2024-08-21  1:53 ` [PATCH v16 07/16] x86/sgx: Abstract tracking reclaimable pages in LRU Haitao Huang
2024-08-21  1:53 ` [PATCH v16 08/16] x86/sgx: Encapsulate uses of the global LRU Haitao Huang
2024-08-27 11:13   ` Huang, Kai
2024-08-27 23:58     ` Huang, Kai
2024-08-27 18:15   ` Jarkko Sakkinen
2024-08-21  1:53 ` [PATCH v16 09/16] x86/sgx: Add basic EPC reclamation flow for cgroup Haitao Huang
2024-08-22  4:00   ` Haitao Huang
2024-08-27 18:15   ` Jarkko Sakkinen
2024-08-27 23:42   ` Huang, Kai
2024-08-21  1:53 ` [PATCH v16 10/16] x86/sgx: Implement async reclamation " Haitao Huang
2024-08-27 10:22   ` Huang, Kai
2024-08-27 23:46     ` Huang, Kai
2024-08-27 18:15   ` Jarkko Sakkinen
2024-08-21  1:53 ` [PATCH v16 11/16] x86/sgx: Charge mem_cgroup for per-cgroup reclamation Haitao Huang
2024-08-21  1:54 ` [PATCH v16 12/16] x86/sgx: Revise global reclamation for EPC cgroups Haitao Huang
2024-08-27 11:32   ` Huang, Kai
2024-08-27 23:29     ` Huang, Kai
2024-08-27 18:16   ` Jarkko Sakkinen
2024-08-21  1:54 ` [PATCH v16 13/16] x86/sgx: implement direct reclamation for cgroups Haitao Huang
2024-08-27 18:16   ` Jarkko Sakkinen
2024-08-27 23:55   ` Huang, Kai
2024-08-29  9:34     ` Huang, Kai
2024-08-21  1:54 ` [PATCH v16 14/16] x86/sgx: Turn on per-cgroup EPC reclamation Haitao Huang
2024-08-27 23:25   ` Huang, Kai
2024-08-21  1:54 ` [PATCH v16 15/16] Docs/x86/sgx: Add description for cgroup support Haitao Huang
2024-08-21  1:54 ` [PATCH v16 16/16] selftests/sgx: Add scripts for EPC cgroup testing Haitao Huang
2024-08-27 23:26   ` Huang, Kai
2024-08-26 10:57 ` [PATCH v16 00/16] Add Cgroup support for SGX EPC memory Mikko Ylinen
2024-08-27 18:18 ` Jarkko Sakkinen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).