Re: [PATCH v15 08/14] x86/sgx: Add basic EPC reclamation flow for cgroup

public inbox for cgroups@vger.kernel.org
 help / color / mirror / Atom feed

From: "Huang, Kai" <kai.huang@intel.com>
To: "hpa@zytor.com" <hpa@zytor.com>,
	"tim.c.chen@linux.intel.com" <tim.c.chen@linux.intel.com>,
	"linux-sgx@vger.kernel.org" <linux-sgx@vger.kernel.org>,
	"x86@kernel.org" <x86@kernel.org>,
	"dave.hansen@linux.intel.com" <dave.hansen@linux.intel.com>,
	"jarkko@kernel.org" <jarkko@kernel.org>,
	"cgroups@vger.kernel.org" <cgroups@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"mkoutny@suse.com" <mkoutny@suse.com>,
	"tglx@linutronix.de" <tglx@linutronix.de>,
	"haitao.huang@linux.intel.com" <haitao.huang@linux.intel.com>,
	"Mehta, Sohil" <sohil.mehta@intel.com>,
	"tj@kernel.org" <tj@kernel.org>,
	"mingo@redhat.com" <mingo@redhat.com>,
	"bp@alien8.de" <bp@alien8.de>
Cc: "mikko.ylinen@linux.intel.com" <mikko.ylinen@linux.intel.com>,
	"seanjc@google.com" <seanjc@google.com>,
	"anakrish@microsoft.com" <anakrish@microsoft.com>,
	"Zhang, Bo" <zhanb@microsoft.com>,
	"kristen@linux.intel.com" <kristen@linux.intel.com>,
	"yangjie@microsoft.com" <yangjie@microsoft.com>,
	"Li, Zhiquan1" <zhiquan1.li@intel.com>,
	"chrisyan@microsoft.com" <chrisyan@microsoft.com>
Subject: Re: [PATCH v15 08/14] x86/sgx: Add basic EPC reclamation flow for cgroup
Date: Thu, 20 Jun 2024 13:28:57 +0000	[thread overview]
Message-ID: <ecaab6953b36adb278c98b01f5eb647ff0cc9aab.camel@intel.com> (raw)
In-Reply-To: <20240617125321.36658-9-haitao.huang@linux.intel.com>


On 18/06/2024 12:53 am, Huang, Haitao wrote:
> From: Kristen Carlson Accardi <kristen@linux.intel.com>
> 
> Currently in the EPC page allocation, the kernel simply fails the
> allocation when the current EPC cgroup fails to charge due to its usage
> reaching limit.  This is not ideal. When that happens, a better way is
> to reclaim EPC page(s) from the current EPC cgroup (and/or its
> descendants) to reduce its usage so the new allocation can succeed.
> 
> Add the basic building blocks to support per-cgroup reclamation.
> 
> Currently the kernel only has one place to reclaim EPC pages: the global
> EPC LRU list.  To support the "per-cgroup" EPC reclaim, maintain an LRU
> list for each EPC cgroup, and introduce a "cgroup" variant function to
> reclaim EPC pages from a given EPC cgroup and its descendants.
> 
> Currently the kernel does the global EPC reclaim in sgx_reclaim_page().
> It always tries to reclaim EPC pages in batch of SGX_NR_TO_SCAN (16)
> pages.  Specifically, it always "scans", or "isolates" SGX_NR_TO_SCAN
> pages from the global LRU, and then tries to reclaim these pages at once
> for better performance.
> 
> Implement the "cgroup" variant EPC reclaim in a similar way, but keep
> the implementation simple: 1) change sgx_reclaim_pages() to take an LRU
> as input, and return the pages that are "scanned" and attempted for
> reclamation (but not necessarily reclaimed successfully); 2) loop the
> given EPC cgroup and its descendants and do the new sgx_reclaim_pages()
> until SGX_NR_TO_SCAN pages are "scanned". 
> 
> This implementation, encapsulated in sgx_cgroup_reclaim_pages(), always
> tries to reclaim SGX_NR_TO_SCAN pages from the LRU of the given EPC
> cgroup, and only moves to its descendants when there's no enough
> reclaimable EPC pages to "scan" in its LRU.  It should be enough for
> most cases. 

[...]

> In other cases, the caller may invoke this function in a
> loop to ensure enough pages reclaimed for its usage. To ensure all
> descendant groups scanned in a round-robin fashion in those cases,
> sgx_cgroup_reclaim_pages() takes in a starting cgroup and returns the
> next cgroup that the caller can pass in as the new starting cgroup for a
> subsequent call.


AFAICT this part is new, and I believe this "round-robin" thing is just 
for the "global reclaim"?  Or is it also for per-cgroup reclaim where more
than SGX_NR_TO_SCAN pages needs to be reclaimed?

I wish the changelog should just point out what consumers will use this
new sgx_cgroup_reclaim_pages(), like:

The sgx_cgroup_reclaim_pages() will be used in three cases:

 1) direct/sync per-cgroup reclaim in try_charge()
 2) indirect/async per-cgroup reclaim triggered in try_charge()
 3) global reclaim

And then describe how will they use sgx_cgroup_reclaim_pages():

Both 1) and 2) can result in needing to reclaim more than SGX_NR_TO_SCAN
pages, in which case we should <fill in how to reclaim>.

For 3), the new global reclaim should try tot match the existing global
reclaim behaviour, that is to try to treat all EPC pages equally. 
<continue to explain how can sgx_cgroup_reclaim_pages() achieve this.>

With above context, we can justify why to make sgx_cgroup_reclaim_pages()
in this form.

> 
> Note, this simple implementation doesn't _exactly_ mimic the current
> global EPC reclaim (which always tries to do the actual reclaim in batch
> of SGX_NR_TO_SCAN pages): when LRUs have less than SGX_NR_TO_SCAN
> reclaimable pages, the actual reclaim of EPC pages will be split into
> smaller batches _across_ multiple LRUs with each being smaller than
> SGX_NR_TO_SCAN pages.
> 
> A more precise way to mimic the current global EPC reclaim would be to
> have a new function to only "scan" (or "isolate") SGX_NR_TO_SCAN pages
> _across_ the given EPC cgroup _AND_ its descendants, and then do the
> actual reclaim in one batch.  But this is unnecessarily complicated at
> this stage.
> 
> Alternatively, the current sgx_reclaim_pages() could be changed to
> return the actual "reclaimed" pages, but not "scanned" pages. However,
> the reclamation is a lengthy process, forcing a successful reclamation
> of predetermined number of pages may block the caller for too long. And
> that may not be acceptable in some synchronous contexts, e.g., in
> serving an ioctl().
> 
> With this building block in place, add synchronous reclamation support
> in sgx_cgroup_try_charge(): trigger a call to
> sgx_cgroup_reclaim_pages() if the cgroup reaches its limit and the
> caller allows synchronous reclaim as indicated by s newly added
> parameter.
> 
> A later patch will add support for asynchronous reclamation reusing
> sgx_cgroup_reclaim_pages().

It seems you also should mention the new global reclaim will also use 
this sgx_cgroup_reclaim_pages()?

[...]

> +/**
> + * sgx_cgroup_reclaim_pages() - reclaim EPC from a cgroup tree
> + * @root:	The root of cgroup tree to reclaim from.
> + * @start:	The descendant cgroup from which to start the tree walking.
> + *
> + * This function performs a pre-order walk in the cgroup tree under the given
> + * root, starting from the node %start, or from the root if %start is NULL. The
> + * function will attempt to reclaim pages at each node until a fixed number of
> + * pages (%SGX_NR_TO_SCAN) are attempted for reclamation. No guarantee of
> + * success on the actual reclamation process. In extreme cases, if all pages in
> + * front of the LRUs are recently accessed, i.e., considered "too young" to
> + * reclaim, no page will actually be reclaimed after walking the whole tree.
> + *
> + * In some cases, a caller may want to ensure enough reclamation until its
> + * specific need is met. In those cases, the caller should invoke this function
> + * in a loop, and at each iteration passes in the same root and the next node
> + * returned from the previous call as the new %start.
> + *
> + * Return: The next misc cgroup in the subtree to continue the scanning and
> + * attempt for more reclamation from this subtree if needed.  
> 

[...]

> Caller must
> + * release the reference if the returned is not used as %start for a subsequent
> + * call.
> 

This sentence isn't clear to me.

First of all, release the reference "of what"?  The %start, or the one
returned by this function?

And is it because of ...

> + */
> +static struct misc_cg *sgx_cgroup_reclaim_pages(struct misc_cg *root, struct misc_cg *start)
> +{
> +	struct cgroup_subsys_state *css_root, *pos;
> +	struct cgroup_subsys_state *next = NULL;
> +	struct sgx_cgroup *sgx_cg;
> +	unsigned int cnt = 0;
> +
> +	 /* Caller must ensure css_root and start ref's acquired */

... the caller must acquire the ref of both @css_root and @css_start, and
...

> +	css_root = &root->css;
> +	if (start)
> +		pos = &start->css;
> +	else
> +		pos = css_root;
> +
> +	while (cnt < SGX_NR_TO_SCAN) {
> +		sgx_cg = sgx_cgroup_from_misc_cg(css_misc(pos));
> +		cnt += sgx_reclaim_pages(&sgx_cg->lru);
> +
> +		rcu_read_lock();
> +
> +		next = css_next_descendant_pre(pos, css_root);
> +
> +		if (pos != css_root)
> +			css_put(pos);

... the ref is decreased internally?

> +
> +		if (!next || !css_tryget(next)) {
> +			/* We are done if next is NULL or not safe to continue
> +			 * the walk if next is dead. Return NULL and the caller
> +			 * determines whether to restart from root.
> +			 */

Incorrect comment style.

> +			rcu_read_unlock();
> +			return NULL;
> +		}
> +
> +		rcu_read_unlock();
> +		pos = next;

There's no ref grab here, wouldn't the above ...

		if (pos != css_root)
			css_put(pos);

... decrease the ref w/o having it been increased?

> +	}
> +
> +	return css_misc(next);

Here AFAICT the ref isn't increased, but ...

[...]


> +/**
> + * sgx_cgroup_try_charge() - try to charge cgroup for a single EPC page
>    * @sgx_cg:	The EPC cgroup to be charged for the page.
> + * @reclaim:	Whether or not synchronous EPC reclaim is allowed.
>    * Return:
>    * * %0 - If successfully charged.
>    * * -errno - for failures.
>    */
> -int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg)
> +int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim)
>   {
> -	return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, sgx_cg->cg, PAGE_SIZE);
> +	int ret;
> +	struct misc_cg *cg_next = NULL;
> +
> +	for (;;) {
> +		ret = __sgx_cgroup_try_charge(sgx_cg);
> +
> +		if (ret != -EBUSY)
> +			goto out;
> +
> +		if (reclaim == SGX_NO_RECLAIM) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +
> +		cg_next = sgx_cgroup_reclaim_pages(sgx_cg->cg, cg_next);
> +		cond_resched();
> +	}
> +
> +out:
> +	if (cg_next != sgx_cg->cg)
> +		put_misc_cg(cg_next);

... if I am reading correctly, here you does the put anyway.

> +	return ret;
>   }
>   

And when there are more than SGX_NR_TO_SCAN pages that need to reclaim,
the above ...

	for (;;) {
		cg_next = sgx_cgroup_reclaim_pages(sgx_cg->cg, cg_next);
	}

... actually tries to reclaim those pages from @sgx_cg _AND_ it's
descendants, and tries to do it _EQUALLY_.

Is this desired, or should we always try to reclaim from the @sgx_cg
first, but only moves to the desendants when the @sgx_cg shouldn't be
reclaimed anymore?

Anyway, it's different from the previous behaviour.

[...]

> -static bool sgx_should_reclaim(unsigned long watermark)
> +static bool sgx_should_reclaim_global(unsigned long watermark)
>   {
>   	return atomic_long_read(&sgx_nr_free_pages) < watermark &&
>   	       !list_empty(&sgx_global_lru.reclaimable);
>   }
>   
> +static void sgx_reclaim_pages_global(void)
> +{
> +	sgx_reclaim_pages(&sgx_global_lru);
> +}
> +
>   /*
>    * sgx_reclaim_direct() should be called (without enclave's mutex held)
>    * in locations where SGX memory resources might be low and might be
> @@ -394,8 +405,8 @@ static bool sgx_should_reclaim(unsigned long watermark)
>    */
>   void sgx_reclaim_direct(void)
>   {
> -	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
> -		sgx_reclaim_pages();
> +	if (sgx_should_reclaim_global(SGX_NR_LOW_PAGES))
> +		sgx_reclaim_pages_global();
>   }
>   

I wish the rename was mentioned in the changelog too.

next prev parent reply	other threads:[~2024-06-20 13:29 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-17 12:53 [PATCH v15 00/14] Add Cgroup support for SGX EPC memory Huang, Haitao
2024-06-17 12:53 ` [PATCH v15 01/14] x86/sgx: Replace boolean parameters with enums Huang, Haitao
2024-06-17 12:53 ` [PATCH v15 02/14] cgroup/misc: Add per resource callbacks for CSS events Huang, Haitao
2024-06-17 12:53 ` [PATCH v15 03/14] cgroup/misc: Export APIs for SGX driver Huang, Haitao
2024-06-17 12:53 ` [PATCH v15 04/14] cgroup/misc: Add SGX EPC resource type Huang, Haitao
2024-06-17 12:53 ` [PATCH v15 05/14] x86/sgx: Implement basic EPC misc cgroup functionality Huang, Haitao
2024-06-18 11:31   ` Huang, Kai
2024-06-18 12:56     ` Haitao Huang
2024-06-18 23:15       ` Huang, Kai
2024-06-18 23:23         ` Haitao Huang
2024-06-19  2:00           ` Huang, Kai
2024-06-17 12:53 ` [PATCH v15 06/14] x86/sgx: Add sgx_epc_lru_list to encapsulate LRU list Huang, Haitao
2024-06-17 12:53 ` [PATCH v15 07/14] x86/sgx: Abstract tracking reclaimable pages in LRU Huang, Haitao
2024-06-17 12:53 ` [PATCH v15 08/14] x86/sgx: Add basic EPC reclamation flow for cgroup Huang, Haitao
2024-06-20 13:28   ` Huang, Kai [this message]
2024-06-20 16:05     ` Haitao Huang
2024-06-20 22:52       ` Huang, Kai
2024-06-17 12:53 ` [PATCH v15 09/14] x86/sgx: Abstract check for global reclaimable pages Huang, Haitao
2024-06-17 12:53 ` [PATCH v15 10/14] x86/sgx: Implement async reclamation for cgroup Huang, Haitao
2024-06-20 15:39   ` Haitao Huang
2024-06-17 12:53 ` [PATCH v15 11/14] x86/sgx: Charge mem_cgroup for per-cgroup reclamation Huang, Haitao
2024-06-17 12:53 ` [PATCH v15 12/14] x86/sgx: Turn on per-cgroup EPC reclamation Huang, Haitao
2024-06-20 10:30   ` Huang, Kai
2024-06-20 15:06     ` Haitao Huang
2024-06-20 23:53       ` Huang, Kai
2024-06-17 12:53 ` [PATCH v15 13/14] Docs/x86/sgx: Add description for cgroup support Huang, Haitao
2024-06-17 12:53 ` [PATCH v15 14/14] selftests/sgx: Add scripts for EPC cgroup testing Huang, Haitao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ecaab6953b36adb278c98b01f5eb647ff0cc9aab.camel@intel.com \
    --to=kai.huang@intel.com \
    --cc=anakrish@microsoft.com \
    --cc=bp@alien8.de \
    --cc=cgroups@vger.kernel.org \
    --cc=chrisyan@microsoft.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=haitao.huang@linux.intel.com \
    --cc=hpa@zytor.com \
    --cc=jarkko@kernel.org \
    --cc=kristen@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-sgx@vger.kernel.org \
    --cc=mikko.ylinen@linux.intel.com \
    --cc=mingo@redhat.com \
    --cc=mkoutny@suse.com \
    --cc=seanjc@google.com \
    --cc=sohil.mehta@intel.com \
    --cc=tglx@linutronix.de \
    --cc=tim.c.chen@linux.intel.com \
    --cc=tj@kernel.org \
    --cc=x86@kernel.org \
    --cc=yangjie@microsoft.com \
    --cc=zhanb@microsoft.com \
    --cc=zhiquan1.li@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox