Re: [RFC PATCH 0/4] SGX shmem backing store issue

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jarkko Sakkinen <jarkko@kernel.org>
To: Reinette Chatre <reinette.chatre@intel.com>
Cc: dave.hansen@linux.intel.com, linux-sgx@vger.kernel.org,
	haitao.huang@intel.com
Subject: Re: [RFC PATCH 0/4] SGX shmem backing store issue
Date: Wed, 4 May 2022 09:40:50 +0300	[thread overview]
Message-ID: <YnIf8vvtfqHxdaUP@kernel.org> (raw)
In-Reply-To: <cover.1651171455.git.reinette.chatre@intel.com>

On Thu, Apr 28, 2022 at 01:11:23PM -0700, Reinette Chatre wrote:
> Hi Everybody,
> 
> Haitao reported encountering the following WARN while stress testing SGX
> with the SGX2 series [1] applied:
> 
> ELDU returned 1073741837 (0x4000000d)
> WARNING: CPU: 72 PID: 24407 at arch/x86/kernel/cpu/sgx/encl.c:81 sgx_encl_eldu+0x3cf/0x400
> ...
> Call Trace:
> <TASK>
> ? xa_load+0x6e/0xa0
> __sgx_encl_load_page+0x3d/0x80
> sgx_encl_load_page_in_vma+0x4a/0x60
> sgx_vma_fault+0x7f/0x3b0
> ? sysvec_call_function_single+0x66/0xd0
> ? asm_sysvec_call_function_single+0x12/0x20
> __do_fault+0x39/0x110
> __handle_mm_fault+0x1222/0x15a0
> handle_mm_fault+0xdb/0x2c0
> do_user_addr_fault+0x1d1/0x650
> ? exit_to_user_mode_prepare+0x45/0x170
> exc_page_fault+0x76/0x170
> ? asm_exc_page_fault+0x8/0x30
> asm_exc_page_fault+0x1e/0x30
> ...
> </TASK>
> 
> ENCLS[ELDU] is returning a #GP when attempting to load an enclave
> page from the backing store into the enclave. This is likely because
> of a MAC verification failure.
> 
> Haitao's stress testing involves running two concurrent loops of the SGX
> selftests on a system with 4GB EPC memory. One of the tests is modified
> to reduce the oversubscription heap size:
> 
> diff --git a/tools/testing/selftests/sgx/main.c b/tools/testing/selftests/sgx/main.c
> index d480c2dd2858..12008789325b 100644
> --- a/tools/testing/selftests/sgx/main.c
> +++ b/tools/testing/selftests/sgx/main.c
> @@ -398,7 +398,7 @@ TEST_F_TIMEOUT(enclave, unclobbered_vdso_oversubscribed_remove, 900)
>  	 * Create enclave with additional heap that is as big as all
>  	 * available physical SGX memory.
>  	 */
> -	total_mem = get_total_epc_mem();
> +	total_mem = get_total_epc_mem()/16;
>  	ASSERT_NE(total_mem, 0);
>  	TH_LOG("Creating an enclave with %lu bytes heap may take a while ...",
>  	       total_mem);
> 
> If the the test compiled with above snippet is renamed as "test_sgx_small"
> and the original renamed as "test_sgx_large" the two concurrent loops are
> run as follows:
> 
> (for i in $(seq 1 999); do echo "Iteration $i"; sudo ./test_sgx_large; done ) > log.large 2>&1
> (for i in $(seq 1 9999); do echo "Iteration $i"; sudo ./test_sgx_small; done ) > log.small 2>&1
> 
> If the SGX driver is modified to always WARN when ENCLS[ELDU] encounters a #GP
> (see below) then the WARN appears after a few iterations of "test_sgx_large"
> and shows up throughout the testing.
> 
> diff --git a/arch/x86/kernel/cpu/sgx/encls.h b/arch/x86/kernel/cpu/sgx/encls.h
> index 99004b02e2ed..68c1dbc84ed3 100644
> --- a/arch/x86/kernel/cpu/sgx/encls.h
> +++ b/arch/x86/kernel/cpu/sgx/encls.h
> @@ -18,7 +18,7 @@
>  #define ENCLS_WARN(r, name) {						  \
>  	do {								  \
>  		int _r = (r);						  \
> -		WARN_ONCE(_r, "%s returned %d (0x%x)\n", (name), _r, _r); \
> +		WARN(_r, "%s returned %d (0x%x)\n", (name), _r, _r); \
>  	} while (0);							  \
>  }
> 
> I focused on investigating this issue the past two weeks and was able to
> narrow down the cause but not yet fix the issue. Now I am out of ideas and
> would really appreciate if you could provide guidance on how to proceed.
> 
> Here is what I have learned so far:
> * Reverting commit 08999b2489b4 ("x86/sgx: Free backing memory after
>   faulting the enclave page") resolves the issue. With that commit
>   reverted the concurrent selftest loops can run to completion without
>   encountering any WARNs.
> 
> * The issue is also resolved if only the calls (introduced by commit
>   08999b2489b4 ("x86/sgx: Free backing memory after faulting the enclave
>   page") ) to sgx_encl_truncate_backing_page() within __sgx_encl_eldu()
>   are disabled.
> 
> * Two issues with commit 08999b2489b4 ("x86/sgx: Free backing memory
>   after faulting the enclave page") are fixed with patch 1 and 2 in
>   this series. These fixes do not resolve the WARN.
> 
> * There is a potential race between the reclaimer and the page fault
>   handler while interacting with the backing store. This is addressed
>   in patch 3, but it does not resolve the WARN.
> 
> * Lockdep did reveal a locking issue in SGX2 - the EAUG flow should not
>   do direct reclaim since the page fault handler already has mmap_lock
>   that the reclaimer will attempt to obtain. This will be fixed in the
>   next version of SGX2 but that fix does not resolve the WARN.
> 
> * Since ENCLS[ELDU] returns #GP on MAC verification failure a possible
>   cause could be if a problem occurs while loading the page from
>   backing store.
>   sgx_encl_get_backing_page() is used to both allocate backing storage
>   in the ENCLS[EWB] flow as well as to lookup backing storage in the
>   ENCLS[ELDU] flow.
>   As part of the investigation the interface was split to ensure that
>   backing storage is only allocated during the ENCLS[EWB] flow and only
>   lookup occurs during the ENCLS[ELDU] flow. This can be found in
>   patch 4.
>   This change revealed that there are cases during ENCLS[ELDU] flow
>   when the backing page is not present - attempting to lookup a
>   backing page returns -ENOENT. Before patch 4 a new backing page
>   would be allocated and thus trigger a #GP when ENCLS[ELDU] is
>   attempted.
>   This change is not a fix, the code path just fails earlier now, but
>   it reveals a cause of the WARNs encountered (MAC verification will
>   fail on a newly allocated page) but not why the pages cannot be found
>   in the backing store. With this change the number of WARNs are
>   reduced significantly but not completely. Even with this patch
>   applied the WARN was encountered once while running the stress
>   selftests.
> 
> Could you please advise on how to proceed? Do you perhaps have ideas
> why the backing store could not contain a page when it is loaded back
> during the ENCLS[ELDU] flow?
> There is some interesting text in the comments within shmem_fault()
> that hints that faulting pages should be avoided into holes as they
> are being punched ... but I do not understand those details and
> would really appreciate guidance on how to understand what is going on
> here.
> 
> These SGX2 tests exercise the concurrent operation of reclaimer and page
> fault handler and have a good track record of locating issues. Thanks
> to it we already have:
> 8795359e35bc ("x86/sgx: Silence softlockup detection when releasing large enclaves")
> ac5d272a0ad0 ("x86/sgx: Fix free page accounting")
> While this issue is triggered by the SGX2 tests I cannot find a connection
> with the SGX2 code paths. I have made a few attempts to create an
> equivalent test for SGX1 but have not yet had success to create a SGX1
> test that can trigger this issue.
> 
> These patches are based on v5.18-rc4 with the SGX2 series applied but
> do not touch SGX2 code paths. Patch 1,2, and 3 also apply cleanly to
> v5.18-rc4 while patch 4 has a conflict in arch/x86/kernel/cpu/sgx/encl.h
> due to the SGX2 declarations.
> 
> Your feedback will be greatly appreciated.
> Thank you very much
> 
> Reinette

Hi, sorry for the response lag. I was on sick leave for the 2nd half of
last week. Looking into this.

BR, Jarkko

next prev parent reply	other threads:[~2022-05-04  6:42 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-28 20:11 [RFC PATCH 0/4] SGX shmem backing store issue Reinette Chatre
2022-04-28 20:11 ` [RFC PATCH 1/4] x86/sgx: Do not free backing memory on ENCLS[ELDU] failure Reinette Chatre
2022-04-28 21:30   ` Dave Hansen
2022-04-28 22:20     ` Reinette Chatre
2022-04-28 22:53       ` Dave Hansen
2022-04-28 23:49         ` Reinette Chatre
2022-05-03  2:01           ` Kai Huang
2022-05-07 17:25           ` Jarkko Sakkinen
2022-05-09 17:17             ` Reinette Chatre
2022-05-10  0:36               ` Kai Huang
2022-05-11 10:26                 ` Jarkko Sakkinen
2022-05-11 18:29                   ` Haitao Huang
2022-05-11 22:00                     ` Kai Huang
2022-05-12 21:14                     ` Jarkko Sakkinen
2022-05-06 22:09     ` Jarkko Sakkinen
2022-04-28 20:11 ` [RFC PATCH 2/4] x86/sgx: Set dirty bit after modifying page contents Reinette Chatre
2022-04-28 21:40   ` Dave Hansen
2022-04-28 22:41     ` Reinette Chatre
2022-05-06 22:27   ` Jarkko Sakkinen
2022-05-06 22:40     ` Reinette Chatre
2022-05-07 18:01       ` Jarkko Sakkinen
2022-04-28 20:11 ` [RFC PATCH 3/4] x86/sgx: Obtain backing storage page with enclave mutex held Reinette Chatre
2022-04-28 21:58   ` Dave Hansen
2022-04-28 22:44     ` Reinette Chatre
2022-05-06 22:43   ` Jarkko Sakkinen
2022-04-28 20:11 ` [RFC PATCH 4/4] x86/sgx: Do not allocate backing pages when loading from backing store Reinette Chatre
2022-04-28 21:12 ` [RFC PATCH 0/4] SGX shmem backing store issue Dave Hansen
2022-04-29 18:50   ` Reinette Chatre
2022-04-29 19:45     ` Dave Hansen
2022-04-30  3:22       ` Reinette Chatre
2022-04-30 15:52         ` Reinette Chatre
2022-05-02 14:36         ` Dave Hansen
2022-05-02 17:11           ` Reinette Chatre
2022-05-02 21:33             ` Dave Hansen
2022-05-04 22:13               ` Reinette Chatre
2022-05-04 22:58                 ` Dave Hansen
2022-05-04 23:36                   ` Reinette Chatre
2022-05-04 23:50                     ` Dave Hansen
2022-05-05  0:08                       ` Reinette Chatre
2022-05-04 23:05                 ` Dave Hansen
2022-05-07 17:46               ` Jarkko Sakkinen
2022-05-07 17:48                 ` Jarkko Sakkinen
2022-05-09 17:09                   ` Reinette Chatre
2022-05-10 22:28                     ` Jarkko Sakkinen
2022-05-11 17:23                       ` Reinette Chatre
2022-05-12 14:10                         ` Jarkko Sakkinen
2022-04-28 21:29 ` Dave Hansen
2022-04-28 22:20   ` Reinette Chatre
2022-05-04  6:40 ` Jarkko Sakkinen [this message]
2022-05-05  6:09 ` Jarkko Sakkinen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YnIf8vvtfqHxdaUP@kernel.org \
    --to=jarkko@kernel.org \
    --cc=dave.hansen@linux.intel.com \
    --cc=haitao.huang@intel.com \
    --cc=linux-sgx@vger.kernel.org \
    --cc=reinette.chatre@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.