* [PATCH v4 0/3] x86/sgx: Fix two data races in EAUG/EREMOVE flows
@ 2024-07-05 7:45 Dmitrii Kuvaiskii
2024-07-05 7:45 ` [PATCH v4 1/3] x86/sgx: Split SGX_ENCL_PAGE_BEING_RECLAIMED into two flags Dmitrii Kuvaiskii
` (3 more replies)
0 siblings, 4 replies; 27+ messages in thread
From: Dmitrii Kuvaiskii @ 2024-07-05 7:45 UTC (permalink / raw)
To: dave.hansen, jarkko, kai.huang, haitao.huang, reinette.chatre,
linux-sgx, linux-kernel
Cc: mona.vij, kailun.qin
SGX runtimes such as Gramine may implement EDMM-based lazy allocation of
enclave pages and may support MADV_DONTNEED semantics [1]. The former
implies #PF-based page allocation, and the latter implies the usage of
SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl.
EDMM-based lazy allocation and MADV_DONTNEED semantics provide
significant performance improvement for some workloads that run on
Gramine. For example, a Java workload with a 16GB enclave size has
approx. 57x improvement in total runtime. Thus, we consider it important
to permit these optimizations in Gramine. However, we observed hangs of
applications (Node.js, PyTorch, R, iperf, Blender, Nginx) when run on
Gramine with EDMM, lazy allocation and MADV_DONTNEED features enabled.
We wrote a trivial stress test to reproduce the hangs observed in
real-world applications. The test stresses #PF-based page allocation and
SGX_IOC_ENCLAVE_REMOVE_PAGES flows in the SGX driver:
/* repeatedly touch different enclave pages at random and mix with
* madvise(MADV_DONTNEED) to stress EAUG/EREMOVE flows */
static void* thread_func(void* arg) {
size_t num_pages = 0xA000 / page_size;
for (int i = 0; i < 5000; i++) {
size_t page = get_random_ulong() % num_pages;
char data = READ_ONCE(((char*)arg)[page * page_size]);
page = get_random_ulong() % num_pages;
madvise(arg + page * page_size, page_size, MADV_DONTNEED);
}
}
addr = mmap(NULL, 0xA000, PROT_READ | PROT_WRITE, MAP_ANONYMOUS, -1, 0);
pthread_t threads[16];
for (int i = 0; i < 16; i++)
pthread_create(&threads[i], NULL, thread_func, addr);
This test uncovers two data races in the SGX driver. The remaining
patches describe and fix these races.
I performed several stress tests to verify that there are no other data
races (at least with the test program above):
- On Icelake server with 128GB of PRM, without madvise(). This stresses
the first data race. A Gramine SGX test suite running in the
background for additional stressing. Result: 1,000 runs without hangs
(result without the first bug fix: hangs every time).
- On Icelake server with 128GB of PRM, with madvise(). This stresses the
second data race. A Gramine SGX test suite running in the background
for additional stressing. Result: 1,000 runs without hangs (result
with the first bug fix but without the second bug fix: hangs approx.
once in 50 runs).
- On Icelake server with 4GB of PRM, with madvise(). This additionally
stresses the enclave page swapping flows. Two Gramine SGX test suites
running in the background for additional stressing of swapping (I
observe 100% CPU utilization from ksgxd which confirms that swapping
happens). Result: 1,000 runs without hangs.
v3 -> v4:
- Added a preparatory patch to split the SGX_ENCL_PAGE_BEING_RECLAIMED
flag into two: SGX_ENCL_PAGE_BUSY and SGX_ENCL_PAGE_PCMD_BUSY
(split suggested by Dave Hansen [2])
- No changes in the second patch (that fixes the first bug)
- Trivial changes in the third patch (that fixes the second bug), now
that we have a preparatory patch; plus expanded a comment (as
suggested by Dave Hansen)
v2 -> v3:
- No changes in code itself
- Improved commit message of the first patch (text suggested by Dave
Hansen); kept the CPU1 vs CPU2 diagram (as all reviewers liked it)
- No changes in the commit message of the second patch
v1 -> v2:
- No changes in code itself
- Expanded cover letter
- Added CPU1 vs CPU2 race scenarios in commit messages
[1] https://github.com/gramineproject/gramine/pull/1513
[2] https://lore.kernel.org/all/1d405428-3847-4862-b146-dd57711c881e@intel.com/
v1: https://lore.kernel.org/all/20240429104330.3636113-3-dmitrii.kuvaiskii@intel.com/
v2: https://lore.kernel.org/all/20240515131240.1304824-1-dmitrii.kuvaiskii@intel.com/
v3: https://lore.kernel.org/all/20240517110631.3441817-1-dmitrii.kuvaiskii@intel.com/
Dmitrii Kuvaiskii (3):
x86/sgx: Split SGX_ENCL_PAGE_BEING_RECLAIMED into two flags
x86/sgx: Resolve EAUG race where losing thread returns SIGBUS
x86/sgx: Resolve EREMOVE page vs EAUG page data race
arch/x86/kernel/cpu/sgx/encl.c | 23 ++++++++++++-----------
arch/x86/kernel/cpu/sgx/encl.h | 10 ++++++++--
arch/x86/kernel/cpu/sgx/ioctl.c | 7 +++++++
arch/x86/kernel/cpu/sgx/main.c | 4 ++--
4 files changed, 29 insertions(+), 15 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH v4 1/3] x86/sgx: Split SGX_ENCL_PAGE_BEING_RECLAIMED into two flags
2024-07-05 7:45 [PATCH v4 0/3] x86/sgx: Fix two data races in EAUG/EREMOVE flows Dmitrii Kuvaiskii
@ 2024-07-05 7:45 ` Dmitrii Kuvaiskii
2024-07-10 15:15 ` Haitao Huang
` (3 more replies)
2024-07-05 7:45 ` [PATCH v4 2/3] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS Dmitrii Kuvaiskii
` (2 subsequent siblings)
3 siblings, 4 replies; 27+ messages in thread
From: Dmitrii Kuvaiskii @ 2024-07-05 7:45 UTC (permalink / raw)
To: dave.hansen, jarkko, kai.huang, haitao.huang, reinette.chatre,
linux-sgx, linux-kernel
Cc: mona.vij, kailun.qin, stable
SGX_ENCL_PAGE_BEING_RECLAIMED flag is set when the enclave page is being
reclaimed (moved to the backing store). This flag however has two
logical meanings:
1. Don't attempt to load the enclave page (the page is busy).
2. Don't attempt to remove the PCMD page corresponding to this enclave
page (the PCMD page is busy).
To reflect these two meanings, split SGX_ENCL_PAGE_BEING_RECLAIMED into
two flags: SGX_ENCL_PAGE_BUSY and SGX_ENCL_PAGE_PCMD_BUSY. Currently,
both flags are set only when the enclave page is being reclaimed. A
future commit will introduce a new case when the enclave page is being
removed; this new case will set only the SGX_ENCL_PAGE_BUSY flag.
Cc: stable@vger.kernel.org
Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
---
arch/x86/kernel/cpu/sgx/encl.c | 16 +++++++---------
arch/x86/kernel/cpu/sgx/encl.h | 10 ++++++++--
arch/x86/kernel/cpu/sgx/main.c | 4 ++--
3 files changed, 17 insertions(+), 13 deletions(-)
diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 279148e72459..c0a3c00284c8 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -46,10 +46,10 @@ static int sgx_encl_lookup_backing(struct sgx_encl *encl, unsigned long page_ind
* a check if an enclave page sharing the PCMD page is in the process of being
* reclaimed.
*
- * The reclaimer sets the SGX_ENCL_PAGE_BEING_RECLAIMED flag when it
- * intends to reclaim that enclave page - it means that the PCMD page
- * associated with that enclave page is about to get some data and thus
- * even if the PCMD page is empty, it should not be truncated.
+ * The reclaimer sets the SGX_ENCL_PAGE_PCMD_BUSY flag when it intends to
+ * reclaim that enclave page - it means that the PCMD page associated with that
+ * enclave page is about to get some data and thus even if the PCMD page is
+ * empty, it should not be truncated.
*
* Context: Enclave mutex (&sgx_encl->lock) must be held.
* Return: 1 if the reclaimer is about to write to the PCMD page
@@ -77,8 +77,7 @@ static int reclaimer_writing_to_pcmd(struct sgx_encl *encl,
* Stop when reaching the SECS page - it does not
* have a page_array entry and its reclaim is
* started and completed with enclave mutex held so
- * it does not use the SGX_ENCL_PAGE_BEING_RECLAIMED
- * flag.
+ * it does not use the SGX_ENCL_PAGE_PCMD_BUSY flag.
*/
if (addr == encl->base + encl->size)
break;
@@ -91,8 +90,7 @@ static int reclaimer_writing_to_pcmd(struct sgx_encl *encl,
* VA page slot ID uses same bit as the flag so it is important
* to ensure that the page is not already in backing store.
*/
- if (entry->epc_page &&
- (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)) {
+ if (entry->epc_page && (entry->desc & SGX_ENCL_PAGE_PCMD_BUSY)) {
reclaimed = 1;
break;
}
@@ -257,7 +255,7 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct sgx_encl *encl,
/* Entry successfully located. */
if (entry->epc_page) {
- if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)
+ if (entry->desc & SGX_ENCL_PAGE_BUSY)
return ERR_PTR(-EBUSY);
return entry;
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index f94ff14c9486..11b09899cd92 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -22,8 +22,14 @@
/* 'desc' bits holding the offset in the VA (version array) page. */
#define SGX_ENCL_PAGE_VA_OFFSET_MASK GENMASK_ULL(11, 3)
-/* 'desc' bit marking that the page is being reclaimed. */
-#define SGX_ENCL_PAGE_BEING_RECLAIMED BIT(3)
+/* 'desc' bit indicating that the page is busy (e.g. being reclaimed). */
+#define SGX_ENCL_PAGE_BUSY BIT(2)
+
+/*
+ * 'desc' bit indicating that PCMD page associated with the enclave page is
+ * busy (e.g. because the enclave page is being reclaimed).
+ */
+#define SGX_ENCL_PAGE_PCMD_BUSY BIT(3)
struct sgx_encl_page {
unsigned long desc;
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 166692f2d501..e94b09c43673 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -204,7 +204,7 @@ static void sgx_encl_ewb(struct sgx_epc_page *epc_page,
void *va_slot;
int ret;
- encl_page->desc &= ~SGX_ENCL_PAGE_BEING_RECLAIMED;
+ encl_page->desc &= ~(SGX_ENCL_PAGE_BUSY | SGX_ENCL_PAGE_PCMD_BUSY);
va_page = list_first_entry(&encl->va_pages, struct sgx_va_page,
list);
@@ -340,7 +340,7 @@ static void sgx_reclaim_pages(void)
goto skip;
}
- encl_page->desc |= SGX_ENCL_PAGE_BEING_RECLAIMED;
+ encl_page->desc |= SGX_ENCL_PAGE_BUSY | SGX_ENCL_PAGE_PCMD_BUSY;
mutex_unlock(&encl_page->encl->lock);
continue;
--
2.43.0
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v4 2/3] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS
2024-07-05 7:45 [PATCH v4 0/3] x86/sgx: Fix two data races in EAUG/EREMOVE flows Dmitrii Kuvaiskii
2024-07-05 7:45 ` [PATCH v4 1/3] x86/sgx: Split SGX_ENCL_PAGE_BEING_RECLAIMED into two flags Dmitrii Kuvaiskii
@ 2024-07-05 7:45 ` Dmitrii Kuvaiskii
2024-07-17 10:38 ` Jarkko Sakkinen
2024-07-25 0:52 ` Huang, Kai
2024-07-05 7:45 ` [PATCH v4 3/3] x86/sgx: Resolve EREMOVE page vs EAUG page data race Dmitrii Kuvaiskii
2024-07-05 9:03 ` [PATCH v4 0/3] x86/sgx: Fix two data races in EAUG/EREMOVE flows Jarkko Sakkinen
3 siblings, 2 replies; 27+ messages in thread
From: Dmitrii Kuvaiskii @ 2024-07-05 7:45 UTC (permalink / raw)
To: dave.hansen, jarkko, kai.huang, haitao.huang, reinette.chatre,
linux-sgx, linux-kernel
Cc: mona.vij, kailun.qin, stable, Marcelina Kościelnicka
Imagine an mmap()'d file. Two threads touch the same address at the same
time and fault. Both allocate a physical page and race to install a PTE
for that page. Only one will win the race. The loser frees its page, but
still continues handling the fault as a success and returns
VM_FAULT_NOPAGE from the fault handler.
The same race can happen with SGX. But there's a bug: the loser in the
SGX steers into a failure path. The loser EREMOVE's the winner's EPC
page, then returns SIGBUS, likely killing the app.
Fix the SGX loser's behavior. Change the return code to VM_FAULT_NOPAGE
to avoid SIGBUS and call sgx_free_epc_page() which avoids EREMOVE'ing
the winner's page and only frees the page that the loser allocated.
The race can be illustrated as follows:
/* /*
* Fault on CPU1 * Fault on CPU2
* on enclave page X * on enclave page X
*/ */
sgx_vma_fault() { sgx_vma_fault() {
xa_load(&encl->page_array) xa_load(&encl->page_array)
== NULL --> == NULL -->
sgx_encl_eaug_page() { sgx_encl_eaug_page() {
... ...
/* /*
* alloc encl_page * alloc encl_page
*/ */
mutex_lock(&encl->lock);
/*
* alloc EPC page
*/
epc_page = sgx_alloc_epc_page(...);
/*
* add page to enclave's xarray
*/
xa_insert(&encl->page_array, ...);
/*
* add page to enclave via EAUG
* (page is in pending state)
*/
/*
* add PTE entry
*/
vmf_insert_pfn(...);
mutex_unlock(&encl->lock);
return VM_FAULT_NOPAGE;
}
}
/*
* All good up to here: enclave page
* successfully added to enclave,
* ready for EACCEPT from user space
*/
mutex_lock(&encl->lock);
/*
* alloc EPC page
*/
epc_page = sgx_alloc_epc_page(...);
/*
* add page to enclave's xarray,
* this fails with -EBUSY as this
* page was already added by CPU2
*/
xa_insert(&encl->page_array, ...);
err_out_shrink:
sgx_encl_free_epc_page(epc_page) {
/*
* remove page via EREMOVE
*
* *BUG*: page added by CPU2 is
* yanked from enclave while it
* remains accessible from OS
* perspective (PTE installed)
*/
/*
* free EPC page
*/
sgx_free_epc_page(epc_page);
}
mutex_unlock(&encl->lock);
/*
* *BUG*: SIGBUS is returned
* for a valid enclave page
*/
return VM_FAULT_SIGBUS;
}
}
Fixes: 5a90d2c3f5ef ("x86/sgx: Support adding of pages to an initialized enclave")
Cc: stable@vger.kernel.org
Reported-by: Marcelina Kościelnicka <mwk@invisiblethingslab.com>
Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
Reviewed-by: Haitao Huang <haitao.huang@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
---
arch/x86/kernel/cpu/sgx/encl.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index c0a3c00284c8..9f7f9e57cdeb 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -380,8 +380,11 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
* If ret == -EBUSY then page was created in another flow while
* running without encl->lock
*/
- if (ret)
+ if (ret) {
+ if (ret == -EBUSY)
+ vmret = VM_FAULT_NOPAGE;
goto err_out_shrink;
+ }
pginfo.secs = (unsigned long)sgx_get_epc_virt_addr(encl->secs.epc_page);
pginfo.addr = encl_page->desc & PAGE_MASK;
@@ -417,7 +420,7 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
err_out_shrink:
sgx_encl_shrink(encl, va_page);
err_out_epc:
- sgx_encl_free_epc_page(epc_page);
+ sgx_free_epc_page(epc_page);
err_out_unlock:
mutex_unlock(&encl->lock);
kfree(encl_page);
--
2.43.0
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v4 3/3] x86/sgx: Resolve EREMOVE page vs EAUG page data race
2024-07-05 7:45 [PATCH v4 0/3] x86/sgx: Fix two data races in EAUG/EREMOVE flows Dmitrii Kuvaiskii
2024-07-05 7:45 ` [PATCH v4 1/3] x86/sgx: Split SGX_ENCL_PAGE_BEING_RECLAIMED into two flags Dmitrii Kuvaiskii
2024-07-05 7:45 ` [PATCH v4 2/3] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS Dmitrii Kuvaiskii
@ 2024-07-05 7:45 ` Dmitrii Kuvaiskii
2024-07-10 15:16 ` Haitao Huang
` (2 more replies)
2024-07-05 9:03 ` [PATCH v4 0/3] x86/sgx: Fix two data races in EAUG/EREMOVE flows Jarkko Sakkinen
3 siblings, 3 replies; 27+ messages in thread
From: Dmitrii Kuvaiskii @ 2024-07-05 7:45 UTC (permalink / raw)
To: dave.hansen, jarkko, kai.huang, haitao.huang, reinette.chatre,
linux-sgx, linux-kernel
Cc: mona.vij, kailun.qin, stable
Two enclave threads may try to add and remove the same enclave page
simultaneously (e.g., if the SGX runtime supports both lazy allocation
and MADV_DONTNEED semantics). Consider some enclave page added to the
enclave. User space decides to temporarily remove this page (e.g.,
emulating the MADV_DONTNEED semantics) on CPU1. At the same time, user
space performs a memory access on the same page on CPU2, which results
in a #PF and ultimately in sgx_vma_fault(). Scenario proceeds as
follows:
/*
* CPU1: User space performs
* ioctl(SGX_IOC_ENCLAVE_REMOVE_PAGES)
* on enclave page X
*/
sgx_encl_remove_pages() {
mutex_lock(&encl->lock);
entry = sgx_encl_load_page(encl);
/*
* verify that page is
* trimmed and accepted
*/
mutex_unlock(&encl->lock);
/*
* remove PTE entry; cannot
* be performed under lock
*/
sgx_zap_enclave_ptes(encl);
/*
* Fault on CPU2 on same page X
*/
sgx_vma_fault() {
/*
* PTE entry was removed, but the
* page is still in enclave's xarray
*/
xa_load(&encl->page_array) != NULL ->
/*
* SGX driver thinks that this page
* was swapped out and loads it
*/
mutex_lock(&encl->lock);
/*
* this is effectively a no-op
*/
entry = sgx_encl_load_page_in_vma();
/*
* add PTE entry
*
* *BUG*: a PTE is installed for a
* page in process of being removed
*/
vmf_insert_pfn(...);
mutex_unlock(&encl->lock);
return VM_FAULT_NOPAGE;
}
/*
* continue with page removal
*/
mutex_lock(&encl->lock);
sgx_encl_free_epc_page(epc_page) {
/*
* remove page via EREMOVE
*/
/*
* free EPC page
*/
sgx_free_epc_page(epc_page);
}
xa_erase(&encl->page_array);
mutex_unlock(&encl->lock);
}
Here, CPU1 removed the page. However CPU2 installed the PTE entry on the
same page. This enclave page becomes perpetually inaccessible (until
another SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl). This is because the page is
marked accessible in the PTE entry but is not EAUGed, and any subsequent
access to this page raises a fault: with the kernel believing there to
be a valid VMA, the unlikely error code X86_PF_SGX encountered by code
path do_user_addr_fault() -> access_error() causes the SGX driver's
sgx_vma_fault() to be skipped and user space receives a SIGSEGV instead.
The userspace SIGSEGV handler cannot perform EACCEPT because the page
was not EAUGed. Thus, the user space is stuck with the inaccessible
page.
Fix this race by forcing the fault handler on CPU2 to back off if the
page is currently being removed (on CPU1). This is achieved by
setting SGX_ENCL_PAGE_BUSY flag right-before the first mutex_unlock() in
sgx_encl_remove_pages(). Upon loading the page, CPU2 checks whether this
page is busy, and if yes then CPU2 backs off and waits until the page is
completely removed. After that, any memory access to this page results
in a normal "allocate and EAUG a page on #PF" flow.
Fixes: 9849bb27152c ("x86/sgx: Support complete page removal")
Cc: stable@vger.kernel.org
Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
---
arch/x86/kernel/cpu/sgx/ioctl.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index 5d390df21440..02441883401d 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -1141,7 +1141,14 @@ static long sgx_encl_remove_pages(struct sgx_encl *encl,
/*
* Do not keep encl->lock because of dependency on
* mmap_lock acquired in sgx_zap_enclave_ptes().
+ *
+ * Releasing encl->lock leads to a data race: while CPU1
+ * performs sgx_zap_enclave_ptes() and removes the PTE entry
+ * for the enclave page, CPU2 may attempt to load this page
+ * (because the page is still in enclave's xarray). To prevent
+ * CPU2 from loading the page, mark the page as busy.
*/
+ entry->desc |= SGX_ENCL_PAGE_BUSY;
mutex_unlock(&encl->lock);
sgx_zap_enclave_ptes(encl, addr);
--
2.43.0
^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH v4 0/3] x86/sgx: Fix two data races in EAUG/EREMOVE flows
2024-07-05 7:45 [PATCH v4 0/3] x86/sgx: Fix two data races in EAUG/EREMOVE flows Dmitrii Kuvaiskii
` (2 preceding siblings ...)
2024-07-05 7:45 ` [PATCH v4 3/3] x86/sgx: Resolve EREMOVE page vs EAUG page data race Dmitrii Kuvaiskii
@ 2024-07-05 9:03 ` Jarkko Sakkinen
3 siblings, 0 replies; 27+ messages in thread
From: Jarkko Sakkinen @ 2024-07-05 9:03 UTC (permalink / raw)
To: Dmitrii Kuvaiskii, dave.hansen, jarkko, kai.huang, haitao.huang,
reinette.chatre, linux-sgx, linux-kernel
Cc: mona.vij, kailun.qin
On Fri Jul 5, 2024 at 10:45 AM EEST, Dmitrii Kuvaiskii wrote:
> SGX runtimes such as Gramine may implement EDMM-based lazy allocation of
> enclave pages and may support MADV_DONTNEED semantics [1]. The former
> implies #PF-based page allocation, and the latter implies the usage of
> SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl.
>
> EDMM-based lazy allocation and MADV_DONTNEED semantics provide
> significant performance improvement for some workloads that run on
> Gramine. For example, a Java workload with a 16GB enclave size has
> approx. 57x improvement in total runtime. Thus, we consider it important
> to permit these optimizations in Gramine. However, we observed hangs of
> applications (Node.js, PyTorch, R, iperf, Blender, Nginx) when run on
> Gramine with EDMM, lazy allocation and MADV_DONTNEED features enabled.
>
> We wrote a trivial stress test to reproduce the hangs observed in
> real-world applications. The test stresses #PF-based page allocation and
> SGX_IOC_ENCLAVE_REMOVE_PAGES flows in the SGX driver:
>
> /* repeatedly touch different enclave pages at random and mix with
> * madvise(MADV_DONTNEED) to stress EAUG/EREMOVE flows */
> static void* thread_func(void* arg) {
> size_t num_pages = 0xA000 / page_size;
> for (int i = 0; i < 5000; i++) {
> size_t page = get_random_ulong() % num_pages;
> char data = READ_ONCE(((char*)arg)[page * page_size]);
>
> page = get_random_ulong() % num_pages;
> madvise(arg + page * page_size, page_size, MADV_DONTNEED);
> }
> }
>
> addr = mmap(NULL, 0xA000, PROT_READ | PROT_WRITE, MAP_ANONYMOUS, -1, 0);
> pthread_t threads[16];
> for (int i = 0; i < 16; i++)
> pthread_create(&threads[i], NULL, thread_func, addr);
>
> This test uncovers two data races in the SGX driver. The remaining
> patches describe and fix these races.
>
> I performed several stress tests to verify that there are no other data
> races (at least with the test program above):
>
> - On Icelake server with 128GB of PRM, without madvise(). This stresses
> the first data race. A Gramine SGX test suite running in the
> background for additional stressing. Result: 1,000 runs without hangs
> (result without the first bug fix: hangs every time).
> - On Icelake server with 128GB of PRM, with madvise(). This stresses the
> second data race. A Gramine SGX test suite running in the background
> for additional stressing. Result: 1,000 runs without hangs (result
> with the first bug fix but without the second bug fix: hangs approx.
> once in 50 runs).
> - On Icelake server with 4GB of PRM, with madvise(). This additionally
> stresses the enclave page swapping flows. Two Gramine SGX test suites
> running in the background for additional stressing of swapping (I
> observe 100% CPU utilization from ksgxd which confirms that swapping
> happens). Result: 1,000 runs without hangs.
>
> v3 -> v4:
> - Added a preparatory patch to split the SGX_ENCL_PAGE_BEING_RECLAIMED
> flag into two: SGX_ENCL_PAGE_BUSY and SGX_ENCL_PAGE_PCMD_BUSY
> (split suggested by Dave Hansen [2])
> - No changes in the second patch (that fixes the first bug)
> - Trivial changes in the third patch (that fixes the second bug), now
> that we have a preparatory patch; plus expanded a comment (as
> suggested by Dave Hansen)
>
> v2 -> v3:
> - No changes in code itself
> - Improved commit message of the first patch (text suggested by Dave
> Hansen); kept the CPU1 vs CPU2 diagram (as all reviewers liked it)
> - No changes in the commit message of the second patch
>
> v1 -> v2:
> - No changes in code itself
> - Expanded cover letter
> - Added CPU1 vs CPU2 race scenarios in commit messages
>
> [1] https://github.com/gramineproject/gramine/pull/1513
> [2] https://lore.kernel.org/all/1d405428-3847-4862-b146-dd57711c881e@intel.com/
>
> v1: https://lore.kernel.org/all/20240429104330.3636113-3-dmitrii.kuvaiskii@intel.com/
> v2: https://lore.kernel.org/all/20240515131240.1304824-1-dmitrii.kuvaiskii@intel.com/
> v3: https://lore.kernel.org/all/20240517110631.3441817-1-dmitrii.kuvaiskii@intel.com/
>
> Dmitrii Kuvaiskii (3):
> x86/sgx: Split SGX_ENCL_PAGE_BEING_RECLAIMED into two flags
> x86/sgx: Resolve EAUG race where losing thread returns SIGBUS
> x86/sgx: Resolve EREMOVE page vs EAUG page data race
>
> arch/x86/kernel/cpu/sgx/encl.c | 23 ++++++++++++-----------
> arch/x86/kernel/cpu/sgx/encl.h | 10 ++++++++--
> arch/x86/kernel/cpu/sgx/ioctl.c | 7 +++++++
> arch/x86/kernel/cpu/sgx/main.c | 4 ++--
> 4 files changed, 29 insertions(+), 15 deletions(-)
Piling this to my TODO folder, on holiday up until week 31.
Happened to see this as I'm doing some last bit release stuff for 6.10
(unrelated to SGX).
BR, Jarkko
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v4 1/3] x86/sgx: Split SGX_ENCL_PAGE_BEING_RECLAIMED into two flags
2024-07-05 7:45 ` [PATCH v4 1/3] x86/sgx: Split SGX_ENCL_PAGE_BEING_RECLAIMED into two flags Dmitrii Kuvaiskii
@ 2024-07-10 15:15 ` Haitao Huang
2024-07-17 10:36 ` Jarkko Sakkinen
` (2 subsequent siblings)
3 siblings, 0 replies; 27+ messages in thread
From: Haitao Huang @ 2024-07-10 15:15 UTC (permalink / raw)
To: dave.hansen, jarkko, kai.huang, reinette.chatre, linux-sgx,
linux-kernel, Dmitrii Kuvaiskii
Cc: mona.vij, kailun.qin, stable
On Fri, 05 Jul 2024 02:45:22 -0500, Dmitrii Kuvaiskii
<dmitrii.kuvaiskii@intel.com> wrote:
> SGX_ENCL_PAGE_BEING_RECLAIMED flag is set when the enclave page is being
> reclaimed (moved to the backing store). This flag however has two
> logical meanings:
>
> 1. Don't attempt to load the enclave page (the page is busy).
> 2. Don't attempt to remove the PCMD page corresponding to this enclave
> page (the PCMD page is busy).
>
> To reflect these two meanings, split SGX_ENCL_PAGE_BEING_RECLAIMED into
> two flags: SGX_ENCL_PAGE_BUSY and SGX_ENCL_PAGE_PCMD_BUSY. Currently,
> both flags are set only when the enclave page is being reclaimed. A
> future commit will introduce a new case when the enclave page is being
> removed; this new case will set only the SGX_ENCL_PAGE_BUSY flag.
>
LGTM.
Reviewed-by: Haitao Huang <haitao.huang@linux.intel.com>
Thanks
Haitao
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v4 3/3] x86/sgx: Resolve EREMOVE page vs EAUG page data race
2024-07-05 7:45 ` [PATCH v4 3/3] x86/sgx: Resolve EREMOVE page vs EAUG page data race Dmitrii Kuvaiskii
@ 2024-07-10 15:16 ` Haitao Huang
2024-07-17 10:38 ` Jarkko Sakkinen
2024-07-25 1:21 ` Huang, Kai
2 siblings, 0 replies; 27+ messages in thread
From: Haitao Huang @ 2024-07-10 15:16 UTC (permalink / raw)
To: dave.hansen, jarkko, kai.huang, reinette.chatre, linux-sgx,
linux-kernel, Dmitrii Kuvaiskii
Cc: mona.vij, kailun.qin, stable
On Fri, 05 Jul 2024 02:45:24 -0500, Dmitrii Kuvaiskii
<dmitrii.kuvaiskii@intel.com> wrote:
> Two enclave threads may try to add and remove the same enclave page
> simultaneously (e.g., if the SGX runtime supports both lazy allocation
> and MADV_DONTNEED semantics). Consider some enclave page added to the
> enclave. User space decides to temporarily remove this page (e.g.,
> emulating the MADV_DONTNEED semantics) on CPU1. At the same time, user
> space performs a memory access on the same page on CPU2, which results
> in a #PF and ultimately in sgx_vma_fault(). Scenario proceeds as
> follows:
>
> /*
> * CPU1: User space performs
> * ioctl(SGX_IOC_ENCLAVE_REMOVE_PAGES)
> * on enclave page X
> */
> sgx_encl_remove_pages() {
>
> mutex_lock(&encl->lock);
>
> entry = sgx_encl_load_page(encl);
> /*
> * verify that page is
> * trimmed and accepted
> */
>
> mutex_unlock(&encl->lock);
>
> /*
> * remove PTE entry; cannot
> * be performed under lock
> */
> sgx_zap_enclave_ptes(encl);
> /*
> * Fault on CPU2 on same page X
> */
> sgx_vma_fault() {
> /*
> * PTE entry was removed, but the
> * page is still in enclave's xarray
> */
> xa_load(&encl->page_array) != NULL ->
> /*
> * SGX driver thinks that this page
> * was swapped out and loads it
> */
> mutex_lock(&encl->lock);
> /*
> * this is effectively a no-op
> */
> entry = sgx_encl_load_page_in_vma();
> /*
> * add PTE entry
> *
> * *BUG*: a PTE is installed for a
> * page in process of being removed
> */
> vmf_insert_pfn(...);
>
> mutex_unlock(&encl->lock);
> return VM_FAULT_NOPAGE;
> }
> /*
> * continue with page removal
> */
> mutex_lock(&encl->lock);
>
> sgx_encl_free_epc_page(epc_page) {
> /*
> * remove page via EREMOVE
> */
> /*
> * free EPC page
> */
> sgx_free_epc_page(epc_page);
> }
>
> xa_erase(&encl->page_array);
>
> mutex_unlock(&encl->lock);
> }
>
> Here, CPU1 removed the page. However CPU2 installed the PTE entry on the
> same page. This enclave page becomes perpetually inaccessible (until
> another SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl). This is because the page is
> marked accessible in the PTE entry but is not EAUGed, and any subsequent
> access to this page raises a fault: with the kernel believing there to
> be a valid VMA, the unlikely error code X86_PF_SGX encountered by code
> path do_user_addr_fault() -> access_error() causes the SGX driver's
> sgx_vma_fault() to be skipped and user space receives a SIGSEGV instead.
> The userspace SIGSEGV handler cannot perform EACCEPT because the page
> was not EAUGed. Thus, the user space is stuck with the inaccessible
> page.
>
> Fix this race by forcing the fault handler on CPU2 to back off if the
> page is currently being removed (on CPU1). This is achieved by
> setting SGX_ENCL_PAGE_BUSY flag right-before the first mutex_unlock() in
> sgx_encl_remove_pages(). Upon loading the page, CPU2 checks whether this
> page is busy, and if yes then CPU2 backs off and waits until the page is
> completely removed. After that, any memory access to this page results
> in a normal "allocate and EAUG a page on #PF" flow.
>
> Fixes: 9849bb27152c ("x86/sgx: Support complete page removal")
> Cc: stable@vger.kernel.org
> Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
> ---
> arch/x86/kernel/cpu/sgx/ioctl.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c
> b/arch/x86/kernel/cpu/sgx/ioctl.c
> index 5d390df21440..02441883401d 100644
> --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> @@ -1141,7 +1141,14 @@ static long sgx_encl_remove_pages(struct sgx_encl
> *encl,
> /*
> * Do not keep encl->lock because of dependency on
> * mmap_lock acquired in sgx_zap_enclave_ptes().
> + *
> + * Releasing encl->lock leads to a data race: while CPU1
> + * performs sgx_zap_enclave_ptes() and removes the PTE entry
> + * for the enclave page, CPU2 may attempt to load this page
> + * (because the page is still in enclave's xarray). To prevent
> + * CPU2 from loading the page, mark the page as busy.
> */
> + entry->desc |= SGX_ENCL_PAGE_BUSY;
> mutex_unlock(&encl->lock);
> sgx_zap_enclave_ptes(encl, addr);
Reviewed-by: Haitao Huang <haitao.huang@linux.intel.com>
Thanks
Haitao
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v4 1/3] x86/sgx: Split SGX_ENCL_PAGE_BEING_RECLAIMED into two flags
2024-07-05 7:45 ` [PATCH v4 1/3] x86/sgx: Split SGX_ENCL_PAGE_BEING_RECLAIMED into two flags Dmitrii Kuvaiskii
2024-07-10 15:15 ` Haitao Huang
@ 2024-07-17 10:36 ` Jarkko Sakkinen
2024-08-12 8:12 ` Dmitrii Kuvaiskii
2024-07-17 10:37 ` Jarkko Sakkinen
2024-07-25 2:00 ` Huang, Kai
3 siblings, 1 reply; 27+ messages in thread
From: Jarkko Sakkinen @ 2024-07-17 10:36 UTC (permalink / raw)
To: Dmitrii Kuvaiskii, dave.hansen, kai.huang, haitao.huang,
reinette.chatre, linux-sgx, linux-kernel
Cc: mona.vij, kailun.qin, stable
On Fri Jul 5, 2024 at 10:45 AM EEST, Dmitrii Kuvaiskii wrote:
> SGX_ENCL_PAGE_BEING_RECLAIMED flag is set when the enclave page is being
> reclaimed (moved to the backing store). This flag however has two
> logical meanings:
~~~~~~~~
side-effects
Missing the actor.
"The page reclaimer thread sets SGX_ENC_PAGE_BEING_RECLAIMED flag..."
It is not just randomly "set".
>
> 1. Don't attempt to load the enclave page (the page is busy).
Please point out where in the source code.
> 2. Don't attempt to remove the PCMD page corresponding to this enclave
> page (the PCMD page is busy).
Please point out where in the source code.
These would be great reference when looking back.
>
> To reflect these two meanings, split SGX_ENCL_PAGE_BEING_RECLAIMED into
I don't care about meanings. Only who does and what.
BR, Jarkko
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v4 1/3] x86/sgx: Split SGX_ENCL_PAGE_BEING_RECLAIMED into two flags
2024-07-05 7:45 ` [PATCH v4 1/3] x86/sgx: Split SGX_ENCL_PAGE_BEING_RECLAIMED into two flags Dmitrii Kuvaiskii
2024-07-10 15:15 ` Haitao Huang
2024-07-17 10:36 ` Jarkko Sakkinen
@ 2024-07-17 10:37 ` Jarkko Sakkinen
2024-08-12 8:16 ` Dmitrii Kuvaiskii
2024-07-25 2:00 ` Huang, Kai
3 siblings, 1 reply; 27+ messages in thread
From: Jarkko Sakkinen @ 2024-07-17 10:37 UTC (permalink / raw)
To: Dmitrii Kuvaiskii, dave.hansen, kai.huang, haitao.huang,
reinette.chatre, linux-sgx, linux-kernel
Cc: mona.vij, kailun.qin, stable
On Fri Jul 5, 2024 at 10:45 AM EEST, Dmitrii Kuvaiskii wrote:
> +/*
> + * 'desc' bit indicating that PCMD page associated with the enclave page is
> + * busy (e.g. because the enclave page is being reclaimed).
> + */
> +#define SGX_ENCL_PAGE_PCMD_BUSY BIT(3)
What are other situations when this flag is set than being
reclaimed? The comment says that it is only one use for this
flag.
BR, Jarkko
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v4 2/3] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS
2024-07-05 7:45 ` [PATCH v4 2/3] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS Dmitrii Kuvaiskii
@ 2024-07-17 10:38 ` Jarkko Sakkinen
2024-08-12 8:21 ` Dmitrii Kuvaiskii
2024-07-25 0:52 ` Huang, Kai
1 sibling, 1 reply; 27+ messages in thread
From: Jarkko Sakkinen @ 2024-07-17 10:38 UTC (permalink / raw)
To: Dmitrii Kuvaiskii, dave.hansen, kai.huang, haitao.huang,
reinette.chatre, linux-sgx, linux-kernel
Cc: mona.vij, kailun.qin, stable, Marcelina Kościelnicka
On Fri Jul 5, 2024 at 10:45 AM EEST, Dmitrii Kuvaiskii wrote:
> Imagine an mmap()'d file. Two threads touch the same address at the same
> time and fault. Both allocate a physical page and race to install a PTE
> for that page. Only one will win the race. The loser frees its page, but
> still continues handling the fault as a success and returns
> VM_FAULT_NOPAGE from the fault handler.
>
> The same race can happen with SGX. But there's a bug: the loser in the
> SGX steers into a failure path. The loser EREMOVE's the winner's EPC
> page, then returns SIGBUS, likely killing the app.
>
> Fix the SGX loser's behavior. Change the return code to VM_FAULT_NOPAGE
> to avoid SIGBUS and call sgx_free_epc_page() which avoids EREMOVE'ing
> the winner's page and only frees the page that the loser allocated.
>
> The race can be illustrated as follows:
>
> /* /*
> * Fault on CPU1 * Fault on CPU2
> * on enclave page X * on enclave page X
> */ */
> sgx_vma_fault() { sgx_vma_fault() {
>
> xa_load(&encl->page_array) xa_load(&encl->page_array)
> == NULL --> == NULL -->
>
> sgx_encl_eaug_page() { sgx_encl_eaug_page() {
>
> ... ...
>
> /* /*
> * alloc encl_page * alloc encl_page
> */ */
> mutex_lock(&encl->lock);
> /*
> * alloc EPC page
> */
> epc_page = sgx_alloc_epc_page(...);
> /*
> * add page to enclave's xarray
> */
> xa_insert(&encl->page_array, ...);
> /*
> * add page to enclave via EAUG
> * (page is in pending state)
> */
> /*
> * add PTE entry
> */
> vmf_insert_pfn(...);
>
> mutex_unlock(&encl->lock);
> return VM_FAULT_NOPAGE;
> }
> }
> /*
> * All good up to here: enclave page
> * successfully added to enclave,
> * ready for EACCEPT from user space
> */
> mutex_lock(&encl->lock);
> /*
> * alloc EPC page
> */
> epc_page = sgx_alloc_epc_page(...);
> /*
> * add page to enclave's xarray,
> * this fails with -EBUSY as this
> * page was already added by CPU2
> */
> xa_insert(&encl->page_array, ...);
>
> err_out_shrink:
> sgx_encl_free_epc_page(epc_page) {
> /*
> * remove page via EREMOVE
> *
> * *BUG*: page added by CPU2 is
> * yanked from enclave while it
> * remains accessible from OS
> * perspective (PTE installed)
> */
> /*
> * free EPC page
> */
> sgx_free_epc_page(epc_page);
> }
>
> mutex_unlock(&encl->lock);
> /*
> * *BUG*: SIGBUS is returned
> * for a valid enclave page
> */
> return VM_FAULT_SIGBUS;
> }
> }
>
> Fixes: 5a90d2c3f5ef ("x86/sgx: Support adding of pages to an initialized enclave")
> Cc: stable@vger.kernel.org
> Reported-by: Marcelina Kościelnicka <mwk@invisiblethingslab.com>
> Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
> Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
> Reviewed-by: Haitao Huang <haitao.huang@linux.intel.com>
> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
> ---
> arch/x86/kernel/cpu/sgx/encl.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> index c0a3c00284c8..9f7f9e57cdeb 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.c
> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> @@ -380,8 +380,11 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
> * If ret == -EBUSY then page was created in another flow while
> * running without encl->lock
> */
> - if (ret)
> + if (ret) {
> + if (ret == -EBUSY)
> + vmret = VM_FAULT_NOPAGE;
> goto err_out_shrink;
> + }
>
> pginfo.secs = (unsigned long)sgx_get_epc_virt_addr(encl->secs.epc_page);
> pginfo.addr = encl_page->desc & PAGE_MASK;
> @@ -417,7 +420,7 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
> err_out_shrink:
> sgx_encl_shrink(encl, va_page);
> err_out_epc:
> - sgx_encl_free_epc_page(epc_page);
> + sgx_free_epc_page(epc_page);
> err_out_unlock:
> mutex_unlock(&encl->lock);
> kfree(encl_page);
Fixes should be in the head of the series so please reorder.
BR, Jarkko
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v4 3/3] x86/sgx: Resolve EREMOVE page vs EAUG page data race
2024-07-05 7:45 ` [PATCH v4 3/3] x86/sgx: Resolve EREMOVE page vs EAUG page data race Dmitrii Kuvaiskii
2024-07-10 15:16 ` Haitao Huang
@ 2024-07-17 10:38 ` Jarkko Sakkinen
2024-08-12 8:25 ` Dmitrii Kuvaiskii
2024-07-25 1:21 ` Huang, Kai
2 siblings, 1 reply; 27+ messages in thread
From: Jarkko Sakkinen @ 2024-07-17 10:38 UTC (permalink / raw)
To: Dmitrii Kuvaiskii, dave.hansen, kai.huang, haitao.huang,
reinette.chatre, linux-sgx, linux-kernel
Cc: mona.vij, kailun.qin, stable
On Fri Jul 5, 2024 at 10:45 AM EEST, Dmitrii Kuvaiskii wrote:
> Two enclave threads may try to add and remove the same enclave page
> simultaneously (e.g., if the SGX runtime supports both lazy allocation
> and MADV_DONTNEED semantics). Consider some enclave page added to the
> enclave. User space decides to temporarily remove this page (e.g.,
> emulating the MADV_DONTNEED semantics) on CPU1. At the same time, user
> space performs a memory access on the same page on CPU2, which results
> in a #PF and ultimately in sgx_vma_fault(). Scenario proceeds as
> follows:
>
> /*
> * CPU1: User space performs
> * ioctl(SGX_IOC_ENCLAVE_REMOVE_PAGES)
> * on enclave page X
> */
> sgx_encl_remove_pages() {
>
> mutex_lock(&encl->lock);
>
> entry = sgx_encl_load_page(encl);
> /*
> * verify that page is
> * trimmed and accepted
> */
>
> mutex_unlock(&encl->lock);
>
> /*
> * remove PTE entry; cannot
> * be performed under lock
> */
> sgx_zap_enclave_ptes(encl);
> /*
> * Fault on CPU2 on same page X
> */
> sgx_vma_fault() {
> /*
> * PTE entry was removed, but the
> * page is still in enclave's xarray
> */
> xa_load(&encl->page_array) != NULL ->
> /*
> * SGX driver thinks that this page
> * was swapped out and loads it
> */
> mutex_lock(&encl->lock);
> /*
> * this is effectively a no-op
> */
> entry = sgx_encl_load_page_in_vma();
> /*
> * add PTE entry
> *
> * *BUG*: a PTE is installed for a
> * page in process of being removed
> */
> vmf_insert_pfn(...);
>
> mutex_unlock(&encl->lock);
> return VM_FAULT_NOPAGE;
> }
> /*
> * continue with page removal
> */
> mutex_lock(&encl->lock);
>
> sgx_encl_free_epc_page(epc_page) {
> /*
> * remove page via EREMOVE
> */
> /*
> * free EPC page
> */
> sgx_free_epc_page(epc_page);
> }
>
> xa_erase(&encl->page_array);
>
> mutex_unlock(&encl->lock);
> }
>
> Here, CPU1 removed the page. However CPU2 installed the PTE entry on the
> same page. This enclave page becomes perpetually inaccessible (until
> another SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl). This is because the page is
> marked accessible in the PTE entry but is not EAUGed, and any subsequent
> access to this page raises a fault: with the kernel believing there to
> be a valid VMA, the unlikely error code X86_PF_SGX encountered by code
> path do_user_addr_fault() -> access_error() causes the SGX driver's
> sgx_vma_fault() to be skipped and user space receives a SIGSEGV instead.
> The userspace SIGSEGV handler cannot perform EACCEPT because the page
> was not EAUGed. Thus, the user space is stuck with the inaccessible
> page.
>
> Fix this race by forcing the fault handler on CPU2 to back off if the
> page is currently being removed (on CPU1). This is achieved by
> setting SGX_ENCL_PAGE_BUSY flag right-before the first mutex_unlock() in
> sgx_encl_remove_pages(). Upon loading the page, CPU2 checks whether this
> page is busy, and if yes then CPU2 backs off and waits until the page is
> completely removed. After that, any memory access to this page results
> in a normal "allocate and EAUG a page on #PF" flow.
>
> Fixes: 9849bb27152c ("x86/sgx: Support complete page removal")
> Cc: stable@vger.kernel.org
> Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
> ---
> arch/x86/kernel/cpu/sgx/ioctl.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
> index 5d390df21440..02441883401d 100644
> --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> @@ -1141,7 +1141,14 @@ static long sgx_encl_remove_pages(struct sgx_encl *encl,
> /*
> * Do not keep encl->lock because of dependency on
> * mmap_lock acquired in sgx_zap_enclave_ptes().
> + *
> + * Releasing encl->lock leads to a data race: while CPU1
> + * performs sgx_zap_enclave_ptes() and removes the PTE entry
> + * for the enclave page, CPU2 may attempt to load this page
> + * (because the page is still in enclave's xarray). To prevent
> + * CPU2 from loading the page, mark the page as busy.
> */
> + entry->desc |= SGX_ENCL_PAGE_BUSY;
> mutex_unlock(&encl->lock);
>
> sgx_zap_enclave_ptes(encl, addr);
Ditto.
BR, Jarkko
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v4 2/3] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS
2024-07-05 7:45 ` [PATCH v4 2/3] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS Dmitrii Kuvaiskii
2024-07-17 10:38 ` Jarkko Sakkinen
@ 2024-07-25 0:52 ` Huang, Kai
1 sibling, 0 replies; 27+ messages in thread
From: Huang, Kai @ 2024-07-25 0:52 UTC (permalink / raw)
To: Dmitrii Kuvaiskii, dave.hansen, jarkko, haitao.huang,
reinette.chatre, linux-sgx, linux-kernel
Cc: mona.vij, kailun.qin, stable, Marcelina Kościelnicka
On 5/07/2024 7:45 pm, Dmitrii Kuvaiskii wrote:
> Imagine an mmap()'d file. Two threads touch the same address at the same
> time and fault. Both allocate a physical page and race to install a PTE
> for that page. Only one will win the race. The loser frees its page, but
> still continues handling the fault as a success and returns
> VM_FAULT_NOPAGE from the fault handler.
>
> The same race can happen with SGX. But there's a bug: the loser in the
> SGX steers into a failure path. The loser EREMOVE's the winner's EPC
> page, then returns SIGBUS, likely killing the app.
>
> Fix the SGX loser's behavior. Change the return code to VM_FAULT_NOPAGE
> to avoid SIGBUS and call sgx_free_epc_page() which avoids EREMOVE'ing
> the winner's page and only frees the page that the loser allocated.
>
> The race can be illustrated as follows:
>
> /* /*
> * Fault on CPU1 * Fault on CPU2
> * on enclave page X * on enclave page X
> */ */
> sgx_vma_fault() { sgx_vma_fault() {
>
> xa_load(&encl->page_array) xa_load(&encl->page_array)
> == NULL --> == NULL -->
>
> sgx_encl_eaug_page() { sgx_encl_eaug_page() {
>
> ... ...
>
> /* /*
> * alloc encl_page * alloc encl_page
> */ */
> mutex_lock(&encl->lock);
> /*
> * alloc EPC page
> */
> epc_page = sgx_alloc_epc_page(...);
> /*
> * add page to enclave's xarray
> */
> xa_insert(&encl->page_array, ...);
> /*
> * add page to enclave via EAUG
> * (page is in pending state)
> */
> /*
> * add PTE entry
> */
> vmf_insert_pfn(...);
>
> mutex_unlock(&encl->lock);
> return VM_FAULT_NOPAGE;
> }
> }
> /*
> * All good up to here: enclave page
> * successfully added to enclave,
> * ready for EACCEPT from user space
> */
> mutex_lock(&encl->lock);
> /*
> * alloc EPC page
> */
> epc_page = sgx_alloc_epc_page(...);
> /*
> * add page to enclave's xarray,
> * this fails with -EBUSY as this
> * page was already added by CPU2
> */
> xa_insert(&encl->page_array, ...);
Seems the reason of this issue is we allocate encl_page outside of the
encl->lock mutex, and the current way to detect "whether the fault has
been handled by another thread" is by checking whether xa_insert()
returns -EBUSY -- which ...
>
> err_out_shrink:
> sgx_encl_free_epc_page(epc_page) {
> /*
> * remove page via EREMOVE
> *
> * *BUG*: page added by CPU2 is
> * yanked from enclave while it
> * remains accessible from OS
> * perspective (PTE installed)
> */
> /*
> * free EPC page
> */
> sgx_free_epc_page(epc_page);
> }
> > mutex_unlock(&encl->lock);
> /*
> * *BUG*: SIGBUS is returned
> * for a valid enclave page
> */
> return VM_FAULT_SIGBUS;
> }
> }
>
> Fixes: 5a90d2c3f5ef ("x86/sgx: Support adding of pages to an initialized enclave")
> Cc: stable@vger.kernel.org
> Reported-by: Marcelina Kościelnicka <mwk@invisiblethingslab.com>
> Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
> Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
> Reviewed-by: Haitao Huang <haitao.huang@linux.intel.com>
> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
> ---
> arch/x86/kernel/cpu/sgx/encl.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> index c0a3c00284c8..9f7f9e57cdeb 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.c
> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> @@ -380,8 +380,11 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
> * If ret == -EBUSY then page was created in another flow while
> * running without encl->lock
> */
> - if (ret)
> + if (ret) {
> + if (ret == -EBUSY)
> + vmret = VM_FAULT_NOPAGE;
> goto err_out_shrink;
> + }
... isn't done in the current code despite there's a comment for it
ret = xa_insert(&encl->page_array, PFN_DOWN(encl_page->desc),
encl_page, GFP_KERNEL);
/*
* If ret == -EBUSY then page was created in another flow while
* running without encl->lock
*/
if (ret)
goto err_out_shrink;
And this patch actually does that.
But instead of using xa_insert() to detect such case, where we have done
bunch of things and needs to revert of all them if xa_insert() fails,
could we just re-check the encl_page inside the encl->lock and quickly
mark it as done if another thread has already done the job?
Something like below (build tested only):
diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 279148e72459..7bf63d1b047b 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -339,6 +339,18 @@ static vm_fault_t sgx_encl_eaug_page(struct
vm_area_struct *vma,
if (!test_bit(SGX_ENCL_INITIALIZED, &encl->flags))
return VM_FAULT_SIGBUS;
+
+ mutex_lock(&encl->lock);
+
+ /*
+ * Multiple threads may try to fault in the same EPC page
+ * concurrently. Re-check if another thread has already
+ * done that.
+ */
+ encl_page = xa_load(&encl->page_array, PFN_DOWN(addr));
+ if(encl_page)
+ goto done;
+
/*
* Ignore internal permission checking for dynamically added pages.
* They matter only for data added during the pre-initialization
@@ -347,10 +359,10 @@ static vm_fault_t sgx_encl_eaug_page(struct
vm_area_struct *vma,
*/
secinfo_flags = SGX_SECINFO_R | SGX_SECINFO_W | SGX_SECINFO_X;
encl_page = sgx_encl_page_alloc(encl, addr - encl->base,
secinfo_flags);
- if (IS_ERR(encl_page))
- return VM_FAULT_OOM;
-
- mutex_lock(&encl->lock);
+ if (IS_ERR(encl_page)) {
+ vmret = VM_FAULT_OOM;
+ goto err_out;
+ }
epc_page = sgx_encl_load_secs(encl);
if (IS_ERR(epc_page)) {
@@ -378,10 +390,6 @@ static vm_fault_t sgx_encl_eaug_page(struct
vm_area_struct *vma,
ret = xa_insert(&encl->page_array, PFN_DOWN(encl_page->desc),
encl_page, GFP_KERNEL);
- /*
- * If ret == -EBUSY then page was created in another flow while
- * running without encl->lock
- */
if (ret)
goto err_out_shrink;
@@ -391,7 +399,7 @@ static vm_fault_t sgx_encl_eaug_page(struct
vm_area_struct *vma,
ret = __eaug(&pginfo, sgx_get_epc_virt_addr(epc_page));
if (ret)
- goto err_out;
+ goto err_out_eaug;
encl_page->encl = encl;
encl_page->epc_page = epc_page;
@@ -410,10 +418,11 @@ static vm_fault_t sgx_encl_eaug_page(struct
vm_area_struct *vma,
mutex_unlock(&encl->lock);
return VM_FAULT_SIGBUS;
}
+done:
mutex_unlock(&encl->lock);
return VM_FAULT_NOPAGE;
-err_out:
+err_out_eaug:
xa_erase(&encl->page_array, PFN_DOWN(encl_page->desc));
err_out_shrink:
@@ -421,9 +430,9 @@ static vm_fault_t sgx_encl_eaug_page(struct
vm_area_struct *vma,
err_out_epc:
sgx_encl_free_epc_page(epc_page);
err_out_unlock:
- mutex_unlock(&encl->lock);
kfree(encl_page);
-
+err_out:
+ mutex_unlock(&encl->lock);
return vmret;
}
^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH v4 3/3] x86/sgx: Resolve EREMOVE page vs EAUG page data race
2024-07-05 7:45 ` [PATCH v4 3/3] x86/sgx: Resolve EREMOVE page vs EAUG page data race Dmitrii Kuvaiskii
2024-07-10 15:16 ` Haitao Huang
2024-07-17 10:38 ` Jarkko Sakkinen
@ 2024-07-25 1:21 ` Huang, Kai
2024-08-09 9:35 ` Dmitrii Kuvaiskii
2 siblings, 1 reply; 27+ messages in thread
From: Huang, Kai @ 2024-07-25 1:21 UTC (permalink / raw)
To: Dmitrii Kuvaiskii, dave.hansen, jarkko, haitao.huang,
reinette.chatre, linux-sgx, linux-kernel
Cc: mona.vij, kailun.qin, stable
On 5/07/2024 7:45 pm, Dmitrii Kuvaiskii wrote:
> Two enclave threads may try to add and remove the same enclave page
> simultaneously (e.g., if the SGX runtime supports both lazy allocation
> and MADV_DONTNEED semantics). Consider some enclave page added to the
> enclave. User space decides to temporarily remove this page (e.g.,
> emulating the MADV_DONTNEED semantics) on CPU1. At the same time, user
> space performs a memory access on the same page on CPU2, which results
> in a #PF and ultimately in sgx_vma_fault(). Scenario proceeds as
> follows:
>
> /*
> * CPU1: User space performs
> * ioctl(SGX_IOC_ENCLAVE_REMOVE_PAGES)
> * on enclave page X
> */
> sgx_encl_remove_pages() {
>
> mutex_lock(&encl->lock);
>
> entry = sgx_encl_load_page(encl);
> /*
> * verify that page is
> * trimmed and accepted
> */
>
> mutex_unlock(&encl->lock);
>
> /*
> * remove PTE entry; cannot
> * be performed under lock
> */
> sgx_zap_enclave_ptes(encl);
> /*
> * Fault on CPU2 on same page X
> */
> sgx_vma_fault() {
> /*
> * PTE entry was removed, but the
> * page is still in enclave's xarray
> */
> xa_load(&encl->page_array) != NULL ->
> /*
> * SGX driver thinks that this page
> * was swapped out and loads it
> */
> mutex_lock(&encl->lock);
> /*
> * this is effectively a no-op
> */
> entry = sgx_encl_load_page_in_vma();
> /*
> * add PTE entry
> *
> * *BUG*: a PTE is installed for a
> * page in process of being removed
> */
> vmf_insert_pfn(...);
>
> mutex_unlock(&encl->lock);
> return VM_FAULT_NOPAGE;
> }
> /*
> * continue with page removal
> */
> mutex_lock(&encl->lock);
>
> sgx_encl_free_epc_page(epc_page) {
> /*
> * remove page via EREMOVE
> */
> /*
> * free EPC page
> */
> sgx_free_epc_page(epc_page);
> }
>
> xa_erase(&encl->page_array);
>
> mutex_unlock(&encl->lock);
> }
>
> Here, CPU1 removed the page. However CPU2 installed the PTE entry on the
> same page. This enclave page becomes perpetually inaccessible (until
> another SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl). This is because the page is
> marked accessible in the PTE entry but is not EAUGed, and any subsequent
> access to this page raises a fault: with the kernel believing there to
> be a valid VMA, the unlikely error code X86_PF_SGX encountered by code
> path do_user_addr_fault() -> access_error() causes the SGX driver's
> sgx_vma_fault() to be skipped and user space receives a SIGSEGV instead.
> The userspace SIGSEGV handler cannot perform EACCEPT because the page
> was not EAUGed. Thus, the user space is stuck with the inaccessible
> page.
Reading the code, it seems the ioctl(sgx_ioc_enclave_modify_types) also
zaps EPC mapping when converting a normal page to TSC. Thus IIUC it
should also suffer this issue?
>
> Fix this race by forcing the fault handler on CPU2 to back off if the
> page is currently being removed (on CPU1). This is achieved by
> setting SGX_ENCL_PAGE_BUSY flag right-before the first mutex_unlock() in
> sgx_encl_remove_pages(). Upon loading the page, CPU2 checks whether this
> page is busy, and if yes then CPU2 backs off and waits until the page is
> completely removed. After that, any memory access to this page results
> in a normal "allocate and EAUG a page on #PF" flow.
>
> Fixes: 9849bb27152c ("x86/sgx: Support complete page removal")
> Cc: stable@vger.kernel.org
> Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
> ---
> arch/x86/kernel/cpu/sgx/ioctl.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
> index 5d390df21440..02441883401d 100644
> --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> @@ -1141,7 +1141,14 @@ static long sgx_encl_remove_pages(struct sgx_encl *encl,
> /*
> * Do not keep encl->lock because of dependency on
> * mmap_lock acquired in sgx_zap_enclave_ptes().
> + *
> + * Releasing encl->lock leads to a data race: while CPU1
> + * performs sgx_zap_enclave_ptes() and removes the PTE entry
> + * for the enclave page, CPU2 may attempt to load this page
> + * (because the page is still in enclave's xarray). To prevent
> + * CPU2 from loading the page, mark the page as busy.
> */
> + entry->desc |= SGX_ENCL_PAGE_BUSY;
> mutex_unlock(&encl->lock);
>
> sgx_zap_enclave_ptes(encl, addr);
The fix seems reasonable to me for the REMOVE case. But IIUC the BUSY
flag should be applied to the above case (PT change) too?
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v4 1/3] x86/sgx: Split SGX_ENCL_PAGE_BEING_RECLAIMED into two flags
2024-07-05 7:45 ` [PATCH v4 1/3] x86/sgx: Split SGX_ENCL_PAGE_BEING_RECLAIMED into two flags Dmitrii Kuvaiskii
` (2 preceding siblings ...)
2024-07-17 10:37 ` Jarkko Sakkinen
@ 2024-07-25 2:00 ` Huang, Kai
3 siblings, 0 replies; 27+ messages in thread
From: Huang, Kai @ 2024-07-25 2:00 UTC (permalink / raw)
To: Dmitrii Kuvaiskii, dave.hansen, jarkko, haitao.huang,
reinette.chatre, linux-sgx, linux-kernel
Cc: mona.vij, kailun.qin, stable
On 5/07/2024 7:45 pm, Dmitrii Kuvaiskii wrote:
> SGX_ENCL_PAGE_BEING_RECLAIMED flag is set when the enclave page is being
> reclaimed (moved to the backing store). This flag however has two
> logical meanings:
>
> 1. Don't attempt to load the enclave page (the page is busy).
> 2. Don't attempt to remove the PCMD page corresponding to this enclave
> page (the PCMD page is busy).
Nit:
I think the SGX_ENCL_PAGE_PCMD_BUSY isn't that accurate, because
obviously the actual backing page is busy (i.e., cannot be truncated)
too. So the current SGX_ENCL_PAGE_BEING_RECLAIMED is also fine to me.
But truncating the actual backing page can also be determined by
SGX_ENCL_PAGE_PCMD_BUSY I suppose (if needed in the future), so this
looks fine to me too.
>
> To reflect these two meanings, split SGX_ENCL_PAGE_BEING_RECLAIMED into
> two flags: SGX_ENCL_PAGE_BUSY and SGX_ENCL_PAGE_PCMD_BUSY. Currently,
> both flags are set only when the enclave page is being reclaimed. A
> future commit will introduce a new case when the enclave page is being
> removed; this new case will set only the SGX_ENCL_PAGE_BUSY flag.
As replied to the last patch, IIUC EREMOVE ioctl doesn't seem to be the
only case where the BUSY needs to be marked, so let's just say something
general but not mention the removal case specifically.
Anyway, feel free to add:
Acked-by: Kai Huang <kai.huang@intel.com>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v4 3/3] x86/sgx: Resolve EREMOVE page vs EAUG page data race
2024-07-25 1:21 ` Huang, Kai
@ 2024-08-09 9:35 ` Dmitrii Kuvaiskii
2024-08-09 11:19 ` Huang, Kai
0 siblings, 1 reply; 27+ messages in thread
From: Dmitrii Kuvaiskii @ 2024-08-09 9:35 UTC (permalink / raw)
To: kai.huang
Cc: dave.hansen, dmitrii.kuvaiskii, haitao.huang, jarkko, kailun.qin,
linux-kernel, linux-sgx, mona.vij, reinette.chatre, stable
On Thu, Jul 25, 2024 at 01:21:56PM +1200, Huang, Kai wrote:
>
> > Two enclave threads may try to add and remove the same enclave page
> > simultaneously (e.g., if the SGX runtime supports both lazy allocation
> > and MADV_DONTNEED semantics). Consider some enclave page added to the
> > enclave. User space decides to temporarily remove this page (e.g.,
> > emulating the MADV_DONTNEED semantics) on CPU1. At the same time, user
> > space performs a memory access on the same page on CPU2, which results
> > in a #PF and ultimately in sgx_vma_fault(). Scenario proceeds as
> > follows:
> >
> > [ ... skipped ... ]
> >
> > Here, CPU1 removed the page. However CPU2 installed the PTE entry on the
> > same page. This enclave page becomes perpetually inaccessible (until
> > another SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl). This is because the page is
> > marked accessible in the PTE entry but is not EAUGed, and any subsequent
> > access to this page raises a fault: with the kernel believing there to
> > be a valid VMA, the unlikely error code X86_PF_SGX encountered by code
> > path do_user_addr_fault() -> access_error() causes the SGX driver's
> > sgx_vma_fault() to be skipped and user space receives a SIGSEGV instead.
> > The userspace SIGSEGV handler cannot perform EACCEPT because the page
> > was not EAUGed. Thus, the user space is stuck with the inaccessible
> > page.
>
> Reading the code, it seems the ioctl(sgx_ioc_enclave_modify_types) also zaps
> EPC mapping when converting a normal page to TSC. Thus IIUC it should also
> suffer this issue?
Technically yes, sgx_enclave_modify_types() has a similar code path and
can be patched in a similar way.
Practically though, I can't imagine an SGX program or framework to allow a
scenario when CPU1 modifies the type of the enclave page from REG to TCS
and at the same time CPU2 performs a memory access on the same page. This
would be clearly a bug in the SGX program/framework. For example, Gramine
always follows the path of: create a new REG enclave page, modify it to
TCS, only then start using it; i.e., there is never a point in time at
which the REG page is allocated and ready to be converted to a TCS page,
and some other thread/CPU accesses it in-between these steps.
TLDR: I can add similar handling to sgx_enclave_modify_types() if
reviewers insist, but I don't see how this data race can ever be
triggered by benign real-world SGX applications.
--
Dmitrii Kuvaiskii
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v4 3/3] x86/sgx: Resolve EREMOVE page vs EAUG page data race
2024-08-09 9:35 ` Dmitrii Kuvaiskii
@ 2024-08-09 11:19 ` Huang, Kai
2024-08-12 8:32 ` Dmitrii Kuvaiskii
0 siblings, 1 reply; 27+ messages in thread
From: Huang, Kai @ 2024-08-09 11:19 UTC (permalink / raw)
To: Kuvaiskii, Dmitrii
Cc: linux-sgx@vger.kernel.org, dave.hansen@linux.intel.com,
jarkko@kernel.org, Chatre, Reinette, linux-kernel@vger.kernel.org,
Qin, Kailun, haitao.huang@linux.intel.com, stable@vger.kernel.org,
Vij, Mona
On Fri, 2024-08-09 at 02:35 -0700, Dmitrii Kuvaiskii wrote:
> On Thu, Jul 25, 2024 at 01:21:56PM +1200, Huang, Kai wrote:
> >
> > > Two enclave threads may try to add and remove the same enclave page
> > > simultaneously (e.g., if the SGX runtime supports both lazy allocation
> > > and MADV_DONTNEED semantics). Consider some enclave page added to the
> > > enclave. User space decides to temporarily remove this page (e.g.,
> > > emulating the MADV_DONTNEED semantics) on CPU1. At the same time, user
> > > space performs a memory access on the same page on CPU2, which results
> > > in a #PF and ultimately in sgx_vma_fault(). Scenario proceeds as
> > > follows:
> > >
> > > [ ... skipped ... ]
> > >
> > > Here, CPU1 removed the page. However CPU2 installed the PTE entry on the
> > > same page. This enclave page becomes perpetually inaccessible (until
> > > another SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl). This is because the page is
> > > marked accessible in the PTE entry but is not EAUGed, and any subsequent
> > > access to this page raises a fault: with the kernel believing there to
> > > be a valid VMA, the unlikely error code X86_PF_SGX encountered by code
> > > path do_user_addr_fault() -> access_error() causes the SGX driver's
> > > sgx_vma_fault() to be skipped and user space receives a SIGSEGV instead.
> > > The userspace SIGSEGV handler cannot perform EACCEPT because the page
> > > was not EAUGed. Thus, the user space is stuck with the inaccessible
> > > page.
> >
> > Reading the code, it seems the ioctl(sgx_ioc_enclave_modify_types) also zaps
> > EPC mapping when converting a normal page to TSC. Thus IIUC it should also
> > suffer this issue?
>
> Technically yes, sgx_enclave_modify_types() has a similar code path and
> can be patched in a similar way.
>
> Practically though, I can't imagine an SGX program or framework to allow a
> scenario when CPU1 modifies the type of the enclave page from REG to TCS
> and at the same time CPU2 performs a memory access on the same page. This
> would be clearly a bug in the SGX program/framework. For example, Gramine
> always follows the path of: create a new REG enclave page, modify it to
> TCS, only then start using it; i.e., there is never a point in time at
> which the REG page is allocated and ready to be converted to a TCS page,
> and some other thread/CPU accesses it in-between these steps.
I think we need to understand the consequence of such bug (assuming such
behaviour is 100% a bug) both to kernel and to enclave.
To the kernel I don't see any big issue: for the sgx_vma_fault() path it will
find the EPC page is already loaded thus just setup the mapping again; for the
sgx_enclave_modify_types() path the worst case is __emodt() could fail,
resulting in enclave being killed probably.
So if this race is 100% a bug, and will also certainly kill the enclave, then
I guess it is fine not to handle. But there's an "assuming" here.
On the other hand, there's no risk if we apply BUSY flag here too. If it is a
bug in enclave, then it can die anyway; otherwise it may survive.
>
> TLDR: I can add similar handling to sgx_enclave_modify_types() if
> reviewers insist, but I don't see how this data race can ever be
> triggered by benign real-world SGX applications.
>
So as mentioned above, I intend to suggest to also apply the BUSY flag here.
And we can have a consist rule in the kernel:
If an enclave page is under certainly operation by the kernel with the mapping
removed, other threads trying to access that page are temporarily blocked and
should retry.
But this is only my 2cents, and I'll leave to maintainers.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v4 1/3] x86/sgx: Split SGX_ENCL_PAGE_BEING_RECLAIMED into two flags
2024-07-17 10:36 ` Jarkko Sakkinen
@ 2024-08-12 8:12 ` Dmitrii Kuvaiskii
2024-08-15 18:29 ` Jarkko Sakkinen
0 siblings, 1 reply; 27+ messages in thread
From: Dmitrii Kuvaiskii @ 2024-08-12 8:12 UTC (permalink / raw)
To: jarkko
Cc: dave.hansen, dmitrii.kuvaiskii, haitao.huang, kai.huang,
kailun.qin, linux-kernel, linux-sgx, mona.vij, reinette.chatre,
stable
On Wed, Jul 17, 2024 at 01:36:08PM +0300, Jarkko Sakkinen wrote:
> On Fri Jul 5, 2024 at 10:45 AM EEST, Dmitrii Kuvaiskii wrote:
> > SGX_ENCL_PAGE_BEING_RECLAIMED flag is set when the enclave page is being
> > reclaimed (moved to the backing store). This flag however has two
> > logical meanings:
> ~~~~~~~~
> side-effects
Could you clarify the required action here? Do you expect me to replace
"This flag however has two logical meanings" with "This flag however has
two logical side-effects"? The suggested word doesn't seem to apply nicely
to this case. In my text, I have the following two sentences: "Don't
attempt to load the enclave page" and "Don't attempt to remove the PCMD
page ...". I don't think it's proper English to say that "Don't attempt
..." is a side effect. Or do you want me to also modify the two sentences
in the list?
By the way, this text is a rephrasing of Dave Hansen's comment:
https://lore.kernel.org/all/1d405428-3847-4862-b146-dd57711c881e@intel.com/
> > To reflect these two meanings, split SGX_ENCL_PAGE_BEING_RECLAIMED into
>
> I don't care about meanings. Only who does and what.
Could you clarify the required action here? What would be a better
rephrasing? Aren't we supposed to clarify the rationale behind the code
changes in the commit message?
--
Dmitrii Kuvaiskii
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v4 1/3] x86/sgx: Split SGX_ENCL_PAGE_BEING_RECLAIMED into two flags
2024-07-17 10:37 ` Jarkko Sakkinen
@ 2024-08-12 8:16 ` Dmitrii Kuvaiskii
2024-08-15 18:30 ` Jarkko Sakkinen
0 siblings, 1 reply; 27+ messages in thread
From: Dmitrii Kuvaiskii @ 2024-08-12 8:16 UTC (permalink / raw)
To: jarkko
Cc: dave.hansen, dmitrii.kuvaiskii, haitao.huang, kai.huang,
kailun.qin, linux-kernel, linux-sgx, mona.vij, reinette.chatre,
stable
On Wed, Jul 17, 2024 at 01:37:39PM +0300, Jarkko Sakkinen wrote:
> On Fri Jul 5, 2024 at 10:45 AM EEST, Dmitrii Kuvaiskii wrote:
> > +/*
> > + * 'desc' bit indicating that PCMD page associated with the enclave page is
> > + * busy (e.g. because the enclave page is being reclaimed).
> > + */
> > +#define SGX_ENCL_PAGE_PCMD_BUSY BIT(3)
>
> What are other situations when this flag is set than being
> reclaimed? The comment says that it is only one use for this
> flag.
Yes, your understanding is correct, currently there is only one situation.
Do you want me to modify the comment somehow?
--
Dmitrii Kuvaiskii
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v4 2/3] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS
2024-07-17 10:38 ` Jarkko Sakkinen
@ 2024-08-12 8:21 ` Dmitrii Kuvaiskii
2024-08-15 18:31 ` Jarkko Sakkinen
0 siblings, 1 reply; 27+ messages in thread
From: Dmitrii Kuvaiskii @ 2024-08-12 8:21 UTC (permalink / raw)
To: jarkko
Cc: dave.hansen, dmitrii.kuvaiskii, haitao.huang, kai.huang,
kailun.qin, linux-kernel, linux-sgx, mona.vij, mwk,
reinette.chatre, stable
On Wed, Jul 17, 2024 at 01:38:37PM +0300, Jarkko Sakkinen wrote:
> Fixes should be in the head of the series so please reorder.
Do you mean that the preparation patch [1] should be applied after the two
bug fixes? This seems to not make sense -- isn't it more correct to first
refactor code, and then to fix in a cleaner way? I thought that was the
point of previous Dave Hansen's comments [2].
[1] https://lore.kernel.org/all/20240705074524.443713-2-dmitrii.kuvaiskii@intel.com/
[2] https://lore.kernel.org/all/1d405428-3847-4862-b146-dd57711c881e@intel.com/
--
Dmitrii Kuvaiskii
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v4 3/3] x86/sgx: Resolve EREMOVE page vs EAUG page data race
2024-07-17 10:38 ` Jarkko Sakkinen
@ 2024-08-12 8:25 ` Dmitrii Kuvaiskii
2024-08-15 18:34 ` Jarkko Sakkinen
0 siblings, 1 reply; 27+ messages in thread
From: Dmitrii Kuvaiskii @ 2024-08-12 8:25 UTC (permalink / raw)
To: jarkko
Cc: dave.hansen, dmitrii.kuvaiskii, haitao.huang, kai.huang,
kailun.qin, linux-kernel, linux-sgx, mona.vij, reinette.chatre,
stable
On Wed, Jul 17, 2024 at 01:38:59PM +0300, Jarkko Sakkinen wrote:
> Ditto.
Just to be sure: I assume this means "Fixes should be in the head of the
series so please reorder"? If yes, please see my reply in the other email
[1].
[1] https://lore.kernel.org/all/20240812082128.3084051-1-dmitrii.kuvaiskii@intel.com/
--
Dmitrii Kuvaiskii
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v4 3/3] x86/sgx: Resolve EREMOVE page vs EAUG page data race
2024-08-09 11:19 ` Huang, Kai
@ 2024-08-12 8:32 ` Dmitrii Kuvaiskii
2024-08-12 10:34 ` Huang, Kai
0 siblings, 1 reply; 27+ messages in thread
From: Dmitrii Kuvaiskii @ 2024-08-12 8:32 UTC (permalink / raw)
To: kai.huang
Cc: dave.hansen, dmitrii.kuvaiskii, haitao.huang, jarkko, kailun.qin,
linux-kernel, linux-sgx, mona.vij, reinette.chatre, stable
On Fri, Aug 09, 2024 at 11:19:22AM +0000, Huang, Kai wrote:
> > TLDR: I can add similar handling to sgx_enclave_modify_types() if
> > reviewers insist, but I don't see how this data race can ever be
> > triggered by benign real-world SGX applications.
>
> So as mentioned above, I intend to suggest to also apply the BUSY flag here.
> And we can have a consist rule in the kernel:
>
> If an enclave page is under certainly operation by the kernel with the mapping
> removed, other threads trying to access that page are temporarily blocked and
> should retry.
I agree with your assessment on the consequences of such bug in
sgx_enclave_modify_types(). To my understanding, this bug can only affect
the SGX enclave (i.e. the userspace) -- either the SGX enclave will hang
or will be terminated.
Anyway, I will apply the BUSY flag also in sgx_enclave_modify_types() in
the next iteration of this patch series.
--
Dmitrii Kuvaiskii
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v4 3/3] x86/sgx: Resolve EREMOVE page vs EAUG page data race
2024-08-12 8:32 ` Dmitrii Kuvaiskii
@ 2024-08-12 10:34 ` Huang, Kai
0 siblings, 0 replies; 27+ messages in thread
From: Huang, Kai @ 2024-08-12 10:34 UTC (permalink / raw)
To: Kuvaiskii, Dmitrii
Cc: linux-sgx@vger.kernel.org, dave.hansen@linux.intel.com,
jarkko@kernel.org, Chatre, Reinette, linux-kernel@vger.kernel.org,
Qin, Kailun, haitao.huang@linux.intel.com, stable@vger.kernel.org,
Vij, Mona
On Mon, 2024-08-12 at 01:32 -0700, Kuvaiskii, Dmitrii wrote:
> On Fri, Aug 09, 2024 at 11:19:22AM +0000, Huang, Kai wrote:
>
> > > TLDR: I can add similar handling to sgx_enclave_modify_types() if
> > > reviewers insist, but I don't see how this data race can ever be
> > > triggered by benign real-world SGX applications.
> >
> > So as mentioned above, I intend to suggest to also apply the BUSY flag here.
> > And we can have a consist rule in the kernel:
> >
> > If an enclave page is under certainly operation by the kernel with the mapping
> > removed, other threads trying to access that page are temporarily blocked and
> > should retry.
>
> I agree with your assessment on the consequences of such bug in
> sgx_enclave_modify_types(). To my understanding, this bug can only affect
> the SGX enclave (i.e. the userspace) -- either the SGX enclave will hang
> or will be terminated.
>
> Anyway, I will apply the BUSY flag also in sgx_enclave_modify_types() in
> the next iteration of this patch series.
>
Thanks.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v4 1/3] x86/sgx: Split SGX_ENCL_PAGE_BEING_RECLAIMED into two flags
2024-08-12 8:12 ` Dmitrii Kuvaiskii
@ 2024-08-15 18:29 ` Jarkko Sakkinen
0 siblings, 0 replies; 27+ messages in thread
From: Jarkko Sakkinen @ 2024-08-15 18:29 UTC (permalink / raw)
To: Dmitrii Kuvaiskii
Cc: dave.hansen, haitao.huang, kai.huang, kailun.qin, linux-kernel,
linux-sgx, mona.vij, reinette.chatre, stable
On Mon Aug 12, 2024 at 11:12 AM EEST, Dmitrii Kuvaiskii wrote:
> On Wed, Jul 17, 2024 at 01:36:08PM +0300, Jarkko Sakkinen wrote:
> > On Fri Jul 5, 2024 at 10:45 AM EEST, Dmitrii Kuvaiskii wrote:
> > > SGX_ENCL_PAGE_BEING_RECLAIMED flag is set when the enclave page is being
> > > reclaimed (moved to the backing store). This flag however has two
> > > logical meanings:
> > ~~~~~~~~
> > side-effects
>
> Could you clarify the required action here? Do you expect me to replace
> "This flag however has two logical meanings" with "This flag however has
> two logical side-effects"? The suggested word doesn't seem to apply nicely
> to this case. In my text, I have the following two sentences: "Don't
> attempt to load the enclave page" and "Don't attempt to remove the PCMD
> page ...". I don't think it's proper English to say that "Don't attempt
> ..." is a side effect. Or do you want me to also modify the two sentences
> in the list?
I agree with you here, you can ignore this comment.
BR, Jarkko
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v4 1/3] x86/sgx: Split SGX_ENCL_PAGE_BEING_RECLAIMED into two flags
2024-08-12 8:16 ` Dmitrii Kuvaiskii
@ 2024-08-15 18:30 ` Jarkko Sakkinen
0 siblings, 0 replies; 27+ messages in thread
From: Jarkko Sakkinen @ 2024-08-15 18:30 UTC (permalink / raw)
To: Dmitrii Kuvaiskii
Cc: dave.hansen, haitao.huang, kai.huang, kailun.qin, linux-kernel,
linux-sgx, mona.vij, reinette.chatre, stable
On Mon Aug 12, 2024 at 11:16 AM EEST, Dmitrii Kuvaiskii wrote:
> On Wed, Jul 17, 2024 at 01:37:39PM +0300, Jarkko Sakkinen wrote:
> > On Fri Jul 5, 2024 at 10:45 AM EEST, Dmitrii Kuvaiskii wrote:
> > > +/*
> > > + * 'desc' bit indicating that PCMD page associated with the enclave page is
> > > + * busy (e.g. because the enclave page is being reclaimed).
> > > + */
> > > +#define SGX_ENCL_PAGE_PCMD_BUSY BIT(3)
> >
> > What are other situations when this flag is set than being
> > reclaimed? The comment says that it is only one use for this
> > flag.
>
> Yes, your understanding is correct, currently there is only one situation.
>
> Do you want me to modify the comment somehow?
Yes, just s/e.g.//
>
> --
> Dmitrii Kuvaiskii
BR, Jarkko
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v4 2/3] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS
2024-08-12 8:21 ` Dmitrii Kuvaiskii
@ 2024-08-15 18:31 ` Jarkko Sakkinen
0 siblings, 0 replies; 27+ messages in thread
From: Jarkko Sakkinen @ 2024-08-15 18:31 UTC (permalink / raw)
To: Dmitrii Kuvaiskii
Cc: dave.hansen, haitao.huang, kai.huang, kailun.qin, linux-kernel,
linux-sgx, mona.vij, mwk, reinette.chatre, stable
On Mon Aug 12, 2024 at 11:21 AM EEST, Dmitrii Kuvaiskii wrote:
> On Wed, Jul 17, 2024 at 01:38:37PM +0300, Jarkko Sakkinen wrote:
>
> > Fixes should be in the head of the series so please reorder.
>
> Do you mean that the preparation patch [1] should be applied after the two
> bug fixes? This seems to not make sense -- isn't it more correct to first
> refactor code, and then to fix in a cleaner way? I thought that was the
> point of previous Dave Hansen's comments [2].
>
> [1] https://lore.kernel.org/all/20240705074524.443713-2-dmitrii.kuvaiskii@intel.com/
> [2] https://lore.kernel.org/all/1d405428-3847-4862-b146-dd57711c881e@intel.com/
>
> --
> Dmitrii Kuvaiskii
OK, I read the references you put, and agree with you here. Thanks for the remarks.
BR, Jarkko
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v4 3/3] x86/sgx: Resolve EREMOVE page vs EAUG page data race
2024-08-12 8:25 ` Dmitrii Kuvaiskii
@ 2024-08-15 18:34 ` Jarkko Sakkinen
2024-08-15 18:37 ` Jarkko Sakkinen
0 siblings, 1 reply; 27+ messages in thread
From: Jarkko Sakkinen @ 2024-08-15 18:34 UTC (permalink / raw)
To: Dmitrii Kuvaiskii
Cc: dave.hansen, haitao.huang, kai.huang, kailun.qin, linux-kernel,
linux-sgx, mona.vij, reinette.chatre, stable
On Mon Aug 12, 2024 at 11:25 AM EEST, Dmitrii Kuvaiskii wrote:
> On Wed, Jul 17, 2024 at 01:38:59PM +0300, Jarkko Sakkinen wrote:
>
> > Ditto.
>
> Just to be sure: I assume this means "Fixes should be in the head of the
> series so please reorder"? If yes, please see my reply in the other email
> [1].
OK, based on your earlier remarks and references I agree with you.
>
> [1] https://lore.kernel.org/all/20240812082128.3084051-1-dmitrii.kuvaiskii@intel.com/
>
> --
> Dmitrii Kuvaiskii
I think for future and since we have bunch of state flags, removing
that "e.g." is worth of doing. Often you need to go through all of
the flags to remind you how they interact, and at that point "one
vs many" does help navigating the complexity.
BR, Jarkko
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v4 3/3] x86/sgx: Resolve EREMOVE page vs EAUG page data race
2024-08-15 18:34 ` Jarkko Sakkinen
@ 2024-08-15 18:37 ` Jarkko Sakkinen
0 siblings, 0 replies; 27+ messages in thread
From: Jarkko Sakkinen @ 2024-08-15 18:37 UTC (permalink / raw)
To: Jarkko Sakkinen, Dmitrii Kuvaiskii
Cc: dave.hansen, haitao.huang, kai.huang, kailun.qin, linux-kernel,
linux-sgx, mona.vij, reinette.chatre, stable
On Thu Aug 15, 2024 at 9:34 PM EEST, Jarkko Sakkinen wrote:
> On Mon Aug 12, 2024 at 11:25 AM EEST, Dmitrii Kuvaiskii wrote:
> > On Wed, Jul 17, 2024 at 01:38:59PM +0300, Jarkko Sakkinen wrote:
> >
> > > Ditto.
> >
> > Just to be sure: I assume this means "Fixes should be in the head of the
> > series so please reorder"? If yes, please see my reply in the other email
> > [1].
>
> OK, based on your earlier remarks and references I agree with you.
>
> >
> > [1] https://lore.kernel.org/all/20240812082128.3084051-1-dmitrii.kuvaiskii@intel.com/
> >
> > --
> > Dmitrii Kuvaiskii
>
> I think for future and since we have bunch of state flags, removing
> that "e.g." is worth of doing. Often you need to go through all of
> the flags to remind you how they interact, and at that point "one
> vs many" does help navigating the complexity.
Actually every time there's a patch that has anything to do with
the state flags I go through all of em as a reminder. Might seem
like irrelevant detail but really is not (and neither unnecessarry
nitpicking). All small clues speed up that process or can mislead.
BR, Jarkko
^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2024-08-15 18:37 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-05 7:45 [PATCH v4 0/3] x86/sgx: Fix two data races in EAUG/EREMOVE flows Dmitrii Kuvaiskii
2024-07-05 7:45 ` [PATCH v4 1/3] x86/sgx: Split SGX_ENCL_PAGE_BEING_RECLAIMED into two flags Dmitrii Kuvaiskii
2024-07-10 15:15 ` Haitao Huang
2024-07-17 10:36 ` Jarkko Sakkinen
2024-08-12 8:12 ` Dmitrii Kuvaiskii
2024-08-15 18:29 ` Jarkko Sakkinen
2024-07-17 10:37 ` Jarkko Sakkinen
2024-08-12 8:16 ` Dmitrii Kuvaiskii
2024-08-15 18:30 ` Jarkko Sakkinen
2024-07-25 2:00 ` Huang, Kai
2024-07-05 7:45 ` [PATCH v4 2/3] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS Dmitrii Kuvaiskii
2024-07-17 10:38 ` Jarkko Sakkinen
2024-08-12 8:21 ` Dmitrii Kuvaiskii
2024-08-15 18:31 ` Jarkko Sakkinen
2024-07-25 0:52 ` Huang, Kai
2024-07-05 7:45 ` [PATCH v4 3/3] x86/sgx: Resolve EREMOVE page vs EAUG page data race Dmitrii Kuvaiskii
2024-07-10 15:16 ` Haitao Huang
2024-07-17 10:38 ` Jarkko Sakkinen
2024-08-12 8:25 ` Dmitrii Kuvaiskii
2024-08-15 18:34 ` Jarkko Sakkinen
2024-08-15 18:37 ` Jarkko Sakkinen
2024-07-25 1:21 ` Huang, Kai
2024-08-09 9:35 ` Dmitrii Kuvaiskii
2024-08-09 11:19 ` Huang, Kai
2024-08-12 8:32 ` Dmitrii Kuvaiskii
2024-08-12 10:34 ` Huang, Kai
2024-07-05 9:03 ` [PATCH v4 0/3] x86/sgx: Fix two data races in EAUG/EREMOVE flows Jarkko Sakkinen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox