[PATCH] tee: optee: Add support for supplicant timeout

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] tee: optee: Add support for supplicant timeout
@ 2024-12-13 11:14 Sumit Garg
  2024-12-18  6:57 ` Jens Wiklander
  2025-01-20  9:24 ` Sumit Garg
  0 siblings, 2 replies; 18+ messages in thread
From: Sumit Garg @ 2024-12-13 11:14 UTC (permalink / raw)
  To: op-tee; +Cc: jens.wiklander, jerome.forissier, linux-kernel, Sumit Garg

OP-TEE supplicant is a user-space daemon and it's possible for it
being crashed or killed in the middle of processing an OP-TEE RPC call.
It becomes more complicated when there is incorrect shutdown ordering
of the supplicant process vs the OP-TEE client application which can
eventually lead to system hang-up waiting for the closure of the client
application.

In order to gracefully handle this scenario, let's add a long enough
timeout to wait for supplicant to process requests. In case there is a
timeout then we return a proper error code for the RPC request.

Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
---
 drivers/tee/optee/supp.c | 58 +++++++++++++++++++++++++---------------
 1 file changed, 36 insertions(+), 22 deletions(-)

diff --git a/drivers/tee/optee/supp.c b/drivers/tee/optee/supp.c
index 322a543b8c27..92e86ac4cdd4 100644
--- a/drivers/tee/optee/supp.c
+++ b/drivers/tee/optee/supp.c
@@ -7,6 +7,15 @@
 #include <linux/uaccess.h>
 #include "optee_private.h"
 
+/*
+ * OP-TEE supplicant timeout, the user-space supplicant may get
+ * crashed or killed while servicing an RPC call. This will just lead
+ * to OP-TEE client hung indefinitely just waiting for supplicant to
+ * serve requests which isn't expected. It is rather expected to fail
+ * gracefully with a timeout which is long enough.
+ */
+#define SUPP_TIMEOUT	(msecs_to_jiffies(10000))
+
 struct optee_supp_req {
 	struct list_head link;
 
@@ -52,8 +61,10 @@ void optee_supp_release(struct optee_supp *supp)
 
 	/* Abort all queued requests */
 	list_for_each_entry_safe(req, req_tmp, &supp->reqs, link) {
-		list_del(&req->link);
-		req->in_queue = false;
+		if (req->in_queue) {
+			list_del(&req->link);
+			req->in_queue = false;
+		}
 		req->ret = TEEC_ERROR_COMMUNICATION;
 		complete(&req->c);
 	}
@@ -82,6 +93,7 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params,
 	struct optee_supp_req *req;
 	bool interruptable;
 	u32 ret;
+	int res = 1;
 
 	/*
 	 * Return in case there is no supplicant available and
@@ -108,28 +120,28 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params,
 	/* Tell an eventual waiter there's a new request */
 	complete(&supp->reqs_c);
 
-	/*
-	 * Wait for supplicant to process and return result, once we've
-	 * returned from wait_for_completion(&req->c) successfully we have
-	 * exclusive access again.
-	 */
-	while (wait_for_completion_interruptible(&req->c)) {
+	/* Wait for supplicant to process and return result */
+	while (res) {
+		res = wait_for_completion_interruptible_timeout(&req->c,
+								SUPP_TIMEOUT);
+		/* Check if supplicant served the request */
+		if (res > 0)
+			break;
+
 		mutex_lock(&supp->mutex);
+		/*
+		 * There's no supplicant available and since the supp->mutex
+		 * currently is held none can become available until the mutex
+		 * released again.
+		 *
+		 * Interrupting an RPC to supplicant is only allowed as a way
+		 * of slightly improving the user experience in case the
+		 * supplicant hasn't been started yet. During normal operation
+		 * the supplicant will serve all requests in a timely manner and
+		 * interrupting then wouldn't make sense.
+		 */
 		interruptable = !supp->ctx;
-		if (interruptable) {
-			/*
-			 * There's no supplicant available and since the
-			 * supp->mutex currently is held none can
-			 * become available until the mutex released
-			 * again.
-			 *
-			 * Interrupting an RPC to supplicant is only
-			 * allowed as a way of slightly improving the user
-			 * experience in case the supplicant hasn't been
-			 * started yet. During normal operation the supplicant
-			 * will serve all requests in a timely manner and
-			 * interrupting then wouldn't make sense.
-			 */
+		if (interruptable || (res == 0)) {
 			if (req->in_queue) {
 				list_del(&req->link);
 				req->in_queue = false;
@@ -141,6 +153,8 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params,
 			req->ret = TEEC_ERROR_COMMUNICATION;
 			break;
 		}
+		if (res == 0)
+			req->ret = TEE_ERROR_TIMEOUT;
 	}
 
 	ret = req->ret;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] tee: optee: Add support for supplicant timeout
  2024-12-13 11:14 [PATCH] tee: optee: Add support for supplicant timeout Sumit Garg
@ 2024-12-18  6:57 ` Jens Wiklander
  2024-12-18  7:30   ` Sumit Garg
  2025-01-20  9:24 ` Sumit Garg
  1 sibling, 1 reply; 18+ messages in thread
From: Jens Wiklander @ 2024-12-18  6:57 UTC (permalink / raw)
  To: Sumit Garg; +Cc: op-tee, jerome.forissier, linux-kernel

Hi Sumit,

On Fri, Dec 13, 2024 at 12:15 PM Sumit Garg <sumit.garg@linaro.org> wrote:
>
> OP-TEE supplicant is a user-space daemon and it's possible for it
> being crashed or killed in the middle of processing an OP-TEE RPC call.
> It becomes more complicated when there is incorrect shutdown ordering
> of the supplicant process vs the OP-TEE client application which can
> eventually lead to system hang-up waiting for the closure of the client
> application.
>
> In order to gracefully handle this scenario, let's add a long enough
> timeout to wait for supplicant to process requests. In case there is a
> timeout then we return a proper error code for the RPC request.
>
> Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
> ---
>  drivers/tee/optee/supp.c | 58 +++++++++++++++++++++++++---------------
>  1 file changed, 36 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/tee/optee/supp.c b/drivers/tee/optee/supp.c
> index 322a543b8c27..92e86ac4cdd4 100644
> --- a/drivers/tee/optee/supp.c
> +++ b/drivers/tee/optee/supp.c
> @@ -7,6 +7,15 @@
>  #include <linux/uaccess.h>
>  #include "optee_private.h"
>
> +/*
> + * OP-TEE supplicant timeout, the user-space supplicant may get
> + * crashed or killed while servicing an RPC call. This will just lead
> + * to OP-TEE client hung indefinitely just waiting for supplicant to
> + * serve requests which isn't expected. It is rather expected to fail
> + * gracefully with a timeout which is long enough.
> + */
> +#define SUPP_TIMEOUT   (msecs_to_jiffies(10000))
> +
>  struct optee_supp_req {
>         struct list_head link;
>
> @@ -52,8 +61,10 @@ void optee_supp_release(struct optee_supp *supp)
>
>         /* Abort all queued requests */
>         list_for_each_entry_safe(req, req_tmp, &supp->reqs, link) {
> -               list_del(&req->link);
> -               req->in_queue = false;
> +               if (req->in_queue) {
> +                       list_del(&req->link);
> +                       req->in_queue = false;
> +               }
>                 req->ret = TEEC_ERROR_COMMUNICATION;
>                 complete(&req->c);
>         }
> @@ -82,6 +93,7 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params,
>         struct optee_supp_req *req;
>         bool interruptable;
>         u32 ret;
> +       int res = 1;
>
>         /*
>          * Return in case there is no supplicant available and
> @@ -108,28 +120,28 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params,
>         /* Tell an eventual waiter there's a new request */
>         complete(&supp->reqs_c);
>
> -       /*
> -        * Wait for supplicant to process and return result, once we've
> -        * returned from wait_for_completion(&req->c) successfully we have
> -        * exclusive access again.
> -        */
> -       while (wait_for_completion_interruptible(&req->c)) {
> +       /* Wait for supplicant to process and return result */
> +       while (res) {
> +               res = wait_for_completion_interruptible_timeout(&req->c,
> +                                                               SUPP_TIMEOUT);
> +               /* Check if supplicant served the request */
> +               if (res > 0)
> +                       break;
> +
>                 mutex_lock(&supp->mutex);
> +               /*
> +                * There's no supplicant available and since the supp->mutex
> +                * currently is held none can become available until the mutex
> +                * released again.
> +                *
> +                * Interrupting an RPC to supplicant is only allowed as a way
> +                * of slightly improving the user experience in case the
> +                * supplicant hasn't been started yet. During normal operation
> +                * the supplicant will serve all requests in a timely manner and
> +                * interrupting then wouldn't make sense.
> +                */
>                 interruptable = !supp->ctx;
> -               if (interruptable) {
> -                       /*
> -                        * There's no supplicant available and since the
> -                        * supp->mutex currently is held none can
> -                        * become available until the mutex released
> -                        * again.
> -                        *
> -                        * Interrupting an RPC to supplicant is only
> -                        * allowed as a way of slightly improving the user
> -                        * experience in case the supplicant hasn't been
> -                        * started yet. During normal operation the supplicant
> -                        * will serve all requests in a timely manner and
> -                        * interrupting then wouldn't make sense.
> -                        */
> +               if (interruptable || (res == 0)) {

Are you fixing an observed problem or a theoretical one? If the
supplicant has died then "interruptable" is expected to be true so the
timeout shouldn't matter.

Cheers,
Jens

>                         if (req->in_queue) {
>                                 list_del(&req->link);
>                                 req->in_queue = false;
> @@ -141,6 +153,8 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params,
>                         req->ret = TEEC_ERROR_COMMUNICATION;
>                         break;
>                 }
> +               if (res == 0)
> +                       req->ret = TEE_ERROR_TIMEOUT;
>         }
>
>         ret = req->ret;
> --
> 2.43.0
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] tee: optee: Add support for supplicant timeout
  2024-12-18  6:57 ` Jens Wiklander
@ 2024-12-18  7:30   ` Sumit Garg
  2024-12-18 10:32     ` Jens Wiklander
  0 siblings, 1 reply; 18+ messages in thread
From: Sumit Garg @ 2024-12-18  7:30 UTC (permalink / raw)
  To: Jens Wiklander; +Cc: op-tee, jerome.forissier, linux-kernel, Erik Schilling

+ Erik

Hi Jens,

On Wed, 18 Dec 2024 at 12:27, Jens Wiklander <jens.wiklander@linaro.org> wrote:
>
> Hi Sumit,
>
> On Fri, Dec 13, 2024 at 12:15 PM Sumit Garg <sumit.garg@linaro.org> wrote:
> >
> > OP-TEE supplicant is a user-space daemon and it's possible for it
> > being crashed or killed in the middle of processing an OP-TEE RPC call.
> > It becomes more complicated when there is incorrect shutdown ordering
> > of the supplicant process vs the OP-TEE client application which can
> > eventually lead to system hang-up waiting for the closure of the client
> > application.
> >
> > In order to gracefully handle this scenario, let's add a long enough
> > timeout to wait for supplicant to process requests. In case there is a
> > timeout then we return a proper error code for the RPC request.
> >
> > Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
> > ---
> >  drivers/tee/optee/supp.c | 58 +++++++++++++++++++++++++---------------
> >  1 file changed, 36 insertions(+), 22 deletions(-)
> >
> > diff --git a/drivers/tee/optee/supp.c b/drivers/tee/optee/supp.c
> > index 322a543b8c27..92e86ac4cdd4 100644
> > --- a/drivers/tee/optee/supp.c
> > +++ b/drivers/tee/optee/supp.c
> > @@ -7,6 +7,15 @@
> >  #include <linux/uaccess.h>
> >  #include "optee_private.h"
> >
> > +/*
> > + * OP-TEE supplicant timeout, the user-space supplicant may get
> > + * crashed or killed while servicing an RPC call. This will just lead
> > + * to OP-TEE client hung indefinitely just waiting for supplicant to
> > + * serve requests which isn't expected. It is rather expected to fail
> > + * gracefully with a timeout which is long enough.
> > + */
> > +#define SUPP_TIMEOUT   (msecs_to_jiffies(10000))
> > +
> >  struct optee_supp_req {
> >         struct list_head link;
> >
> > @@ -52,8 +61,10 @@ void optee_supp_release(struct optee_supp *supp)
> >
> >         /* Abort all queued requests */
> >         list_for_each_entry_safe(req, req_tmp, &supp->reqs, link) {
> > -               list_del(&req->link);
> > -               req->in_queue = false;
> > +               if (req->in_queue) {
> > +                       list_del(&req->link);
> > +                       req->in_queue = false;
> > +               }
> >                 req->ret = TEEC_ERROR_COMMUNICATION;
> >                 complete(&req->c);
> >         }
> > @@ -82,6 +93,7 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params,
> >         struct optee_supp_req *req;
> >         bool interruptable;
> >         u32 ret;
> > +       int res = 1;
> >
> >         /*
> >          * Return in case there is no supplicant available and
> > @@ -108,28 +120,28 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params,
> >         /* Tell an eventual waiter there's a new request */
> >         complete(&supp->reqs_c);
> >
> > -       /*
> > -        * Wait for supplicant to process and return result, once we've
> > -        * returned from wait_for_completion(&req->c) successfully we have
> > -        * exclusive access again.
> > -        */
> > -       while (wait_for_completion_interruptible(&req->c)) {
> > +       /* Wait for supplicant to process and return result */
> > +       while (res) {
> > +               res = wait_for_completion_interruptible_timeout(&req->c,
> > +                                                               SUPP_TIMEOUT);
> > +               /* Check if supplicant served the request */
> > +               if (res > 0)
> > +                       break;
> > +
> >                 mutex_lock(&supp->mutex);
> > +               /*
> > +                * There's no supplicant available and since the supp->mutex
> > +                * currently is held none can become available until the mutex
> > +                * released again.
> > +                *
> > +                * Interrupting an RPC to supplicant is only allowed as a way
> > +                * of slightly improving the user experience in case the
> > +                * supplicant hasn't been started yet. During normal operation
> > +                * the supplicant will serve all requests in a timely manner and
> > +                * interrupting then wouldn't make sense.
> > +                */
> >                 interruptable = !supp->ctx;
> > -               if (interruptable) {
> > -                       /*
> > -                        * There's no supplicant available and since the
> > -                        * supp->mutex currently is held none can
> > -                        * become available until the mutex released
> > -                        * again.
> > -                        *
> > -                        * Interrupting an RPC to supplicant is only
> > -                        * allowed as a way of slightly improving the user
> > -                        * experience in case the supplicant hasn't been
> > -                        * started yet. During normal operation the supplicant
> > -                        * will serve all requests in a timely manner and
> > -                        * interrupting then wouldn't make sense.
> > -                        */
> > +               if (interruptable || (res == 0)) {
>
> Are you fixing an observed problem or a theoretical one?

It is an observed problem, I was able to reproduce it using following
sequence with OP-TEE buildroot setup:

$ xtest 600 &  // Run some secure storage tests using supplicant in
the background
$ kill -9 `pidof tee-supplicant` // Kill supplicant when the tests are
in progress.

This will cause the xtest to hang up.

> If the
> supplicant has died then "interruptable" is expected to be true so the
> timeout shouldn't matter.

When the supplicant dies, it doesn't lead to releasing the supplicant
context in the above test scenario. The reason is probably the
supplicant shared memory reference is held by OP-TEE which is in turn
is holding a reference to supplicant context.

-Sumit

>
> Cheers,
> Jens
>
> >                         if (req->in_queue) {
> >                                 list_del(&req->link);
> >                                 req->in_queue = false;
> > @@ -141,6 +153,8 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params,
> >                         req->ret = TEEC_ERROR_COMMUNICATION;
> >                         break;
> >                 }
> > +               if (res == 0)
> > +                       req->ret = TEE_ERROR_TIMEOUT;
> >         }
> >
> >         ret = req->ret;
> > --
> > 2.43.0
> >

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] tee: optee: Add support for supplicant timeout
  2024-12-18  7:30   ` Sumit Garg
@ 2024-12-18 10:32     ` Jens Wiklander
  2024-12-18 11:07       ` Sumit Garg
  0 siblings, 1 reply; 18+ messages in thread
From: Jens Wiklander @ 2024-12-18 10:32 UTC (permalink / raw)
  To: Sumit Garg; +Cc: op-tee, jerome.forissier, linux-kernel, Erik Schilling

On Wed, Dec 18, 2024 at 8:30 AM Sumit Garg <sumit.garg@linaro.org> wrote:
>
> + Erik
>
> Hi Jens,
>
> On Wed, 18 Dec 2024 at 12:27, Jens Wiklander <jens.wiklander@linaro.org> wrote:
> >
> > Hi Sumit,
> >
> > On Fri, Dec 13, 2024 at 12:15 PM Sumit Garg <sumit.garg@linaro.org> wrote:
> > >
> > > OP-TEE supplicant is a user-space daemon and it's possible for it
> > > being crashed or killed in the middle of processing an OP-TEE RPC call.
> > > It becomes more complicated when there is incorrect shutdown ordering
> > > of the supplicant process vs the OP-TEE client application which can
> > > eventually lead to system hang-up waiting for the closure of the client
> > > application.
> > >
> > > In order to gracefully handle this scenario, let's add a long enough
> > > timeout to wait for supplicant to process requests. In case there is a
> > > timeout then we return a proper error code for the RPC request.
> > >
> > > Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
> > > ---
> > >  drivers/tee/optee/supp.c | 58 +++++++++++++++++++++++++---------------
> > >  1 file changed, 36 insertions(+), 22 deletions(-)
> > >
> > > diff --git a/drivers/tee/optee/supp.c b/drivers/tee/optee/supp.c
> > > index 322a543b8c27..92e86ac4cdd4 100644
> > > --- a/drivers/tee/optee/supp.c
> > > +++ b/drivers/tee/optee/supp.c
> > > @@ -7,6 +7,15 @@
> > >  #include <linux/uaccess.h>
> > >  #include "optee_private.h"
> > >
> > > +/*
> > > + * OP-TEE supplicant timeout, the user-space supplicant may get
> > > + * crashed or killed while servicing an RPC call. This will just lead
> > > + * to OP-TEE client hung indefinitely just waiting for supplicant to
> > > + * serve requests which isn't expected. It is rather expected to fail
> > > + * gracefully with a timeout which is long enough.
> > > + */
> > > +#define SUPP_TIMEOUT   (msecs_to_jiffies(10000))
> > > +
> > >  struct optee_supp_req {
> > >         struct list_head link;
> > >
> > > @@ -52,8 +61,10 @@ void optee_supp_release(struct optee_supp *supp)
> > >
> > >         /* Abort all queued requests */
> > >         list_for_each_entry_safe(req, req_tmp, &supp->reqs, link) {
> > > -               list_del(&req->link);
> > > -               req->in_queue = false;
> > > +               if (req->in_queue) {
> > > +                       list_del(&req->link);
> > > +                       req->in_queue = false;
> > > +               }
> > >                 req->ret = TEEC_ERROR_COMMUNICATION;
> > >                 complete(&req->c);
> > >         }
> > > @@ -82,6 +93,7 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params,
> > >         struct optee_supp_req *req;
> > >         bool interruptable;
> > >         u32 ret;
> > > +       int res = 1;
> > >
> > >         /*
> > >          * Return in case there is no supplicant available and
> > > @@ -108,28 +120,28 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params,
> > >         /* Tell an eventual waiter there's a new request */
> > >         complete(&supp->reqs_c);
> > >
> > > -       /*
> > > -        * Wait for supplicant to process and return result, once we've
> > > -        * returned from wait_for_completion(&req->c) successfully we have
> > > -        * exclusive access again.
> > > -        */
> > > -       while (wait_for_completion_interruptible(&req->c)) {
> > > +       /* Wait for supplicant to process and return result */
> > > +       while (res) {
> > > +               res = wait_for_completion_interruptible_timeout(&req->c,
> > > +                                                               SUPP_TIMEOUT);
> > > +               /* Check if supplicant served the request */
> > > +               if (res > 0)
> > > +                       break;
> > > +
> > >                 mutex_lock(&supp->mutex);
> > > +               /*
> > > +                * There's no supplicant available and since the supp->mutex
> > > +                * currently is held none can become available until the mutex
> > > +                * released again.
> > > +                *
> > > +                * Interrupting an RPC to supplicant is only allowed as a way
> > > +                * of slightly improving the user experience in case the
> > > +                * supplicant hasn't been started yet. During normal operation
> > > +                * the supplicant will serve all requests in a timely manner and
> > > +                * interrupting then wouldn't make sense.
> > > +                */
> > >                 interruptable = !supp->ctx;
> > > -               if (interruptable) {
> > > -                       /*
> > > -                        * There's no supplicant available and since the
> > > -                        * supp->mutex currently is held none can
> > > -                        * become available until the mutex released
> > > -                        * again.
> > > -                        *
> > > -                        * Interrupting an RPC to supplicant is only
> > > -                        * allowed as a way of slightly improving the user
> > > -                        * experience in case the supplicant hasn't been
> > > -                        * started yet. During normal operation the supplicant
> > > -                        * will serve all requests in a timely manner and
> > > -                        * interrupting then wouldn't make sense.
> > > -                        */
> > > +               if (interruptable || (res == 0)) {
> >
> > Are you fixing an observed problem or a theoretical one?
>
> It is an observed problem, I was able to reproduce it using following
> sequence with OP-TEE buildroot setup:
>
> $ xtest 600 &  // Run some secure storage tests using supplicant in
> the background
> $ kill -9 `pidof tee-supplicant` // Kill supplicant when the tests are
> in progress.
>
> This will cause the xtest to hang up.
>
> > If the
> > supplicant has died then "interruptable" is expected to be true so the
> > timeout shouldn't matter.
>
> When the supplicant dies, it doesn't lead to releasing the supplicant
> context in the above test scenario. The reason is probably the
> supplicant shared memory reference is held by OP-TEE which is in turn
> is holding a reference to supplicant context.

This sounds like the problem Amirreza is trying to solve for the
QCOMTEE driver. If we could get the supplicant context removed soon
after the supplicant has died we wouldn't need this, except that we
may need some trick to avoid ignoring an eventual signal received
while tee-supplicant is dying.

Wait, would it work to break the loop on SIGKILL? It's an uncatchable
signal so there's no reason for the calling process to wait anyway.

Cheers,
Jens

>
> -Sumit
>
> >
> > Cheers,
> > Jens
> >
> > >                         if (req->in_queue) {
> > >                                 list_del(&req->link);
> > >                                 req->in_queue = false;
> > > @@ -141,6 +153,8 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params,
> > >                         req->ret = TEEC_ERROR_COMMUNICATION;
> > >                         break;
> > >                 }
> > > +               if (res == 0)
> > > +                       req->ret = TEE_ERROR_TIMEOUT;
> > >         }
> > >
> > >         ret = req->ret;
> > > --
> > > 2.43.0
> > >

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] tee: optee: Add support for supplicant timeout
  2024-12-18 10:32     ` Jens Wiklander
@ 2024-12-18 11:07       ` Sumit Garg
  2024-12-18 13:34         ` Jens Wiklander
  0 siblings, 1 reply; 18+ messages in thread
From: Sumit Garg @ 2024-12-18 11:07 UTC (permalink / raw)
  To: Jens Wiklander; +Cc: op-tee, jerome.forissier, linux-kernel, Erik Schilling

On Wed, 18 Dec 2024 at 16:02, Jens Wiklander <jens.wiklander@linaro.org> wrote:
>
> On Wed, Dec 18, 2024 at 8:30 AM Sumit Garg <sumit.garg@linaro.org> wrote:
> >
> > + Erik
> >
> > Hi Jens,
> >
> > On Wed, 18 Dec 2024 at 12:27, Jens Wiklander <jens.wiklander@linaro.org> wrote:
> > >
> > > Hi Sumit,
> > >
> > > On Fri, Dec 13, 2024 at 12:15 PM Sumit Garg <sumit.garg@linaro.org> wrote:
> > > >
> > > > OP-TEE supplicant is a user-space daemon and it's possible for it
> > > > being crashed or killed in the middle of processing an OP-TEE RPC call.
> > > > It becomes more complicated when there is incorrect shutdown ordering
> > > > of the supplicant process vs the OP-TEE client application which can
> > > > eventually lead to system hang-up waiting for the closure of the client
> > > > application.
> > > >
> > > > In order to gracefully handle this scenario, let's add a long enough
> > > > timeout to wait for supplicant to process requests. In case there is a
> > > > timeout then we return a proper error code for the RPC request.
> > > >
> > > > Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
> > > > ---
> > > >  drivers/tee/optee/supp.c | 58 +++++++++++++++++++++++++---------------
> > > >  1 file changed, 36 insertions(+), 22 deletions(-)
> > > >
> > > > diff --git a/drivers/tee/optee/supp.c b/drivers/tee/optee/supp.c
> > > > index 322a543b8c27..92e86ac4cdd4 100644
> > > > --- a/drivers/tee/optee/supp.c
> > > > +++ b/drivers/tee/optee/supp.c
> > > > @@ -7,6 +7,15 @@
> > > >  #include <linux/uaccess.h>
> > > >  #include "optee_private.h"
> > > >
> > > > +/*
> > > > + * OP-TEE supplicant timeout, the user-space supplicant may get
> > > > + * crashed or killed while servicing an RPC call. This will just lead
> > > > + * to OP-TEE client hung indefinitely just waiting for supplicant to
> > > > + * serve requests which isn't expected. It is rather expected to fail
> > > > + * gracefully with a timeout which is long enough.
> > > > + */
> > > > +#define SUPP_TIMEOUT   (msecs_to_jiffies(10000))
> > > > +
> > > >  struct optee_supp_req {
> > > >         struct list_head link;
> > > >
> > > > @@ -52,8 +61,10 @@ void optee_supp_release(struct optee_supp *supp)
> > > >
> > > >         /* Abort all queued requests */
> > > >         list_for_each_entry_safe(req, req_tmp, &supp->reqs, link) {
> > > > -               list_del(&req->link);
> > > > -               req->in_queue = false;
> > > > +               if (req->in_queue) {
> > > > +                       list_del(&req->link);
> > > > +                       req->in_queue = false;
> > > > +               }
> > > >                 req->ret = TEEC_ERROR_COMMUNICATION;
> > > >                 complete(&req->c);
> > > >         }
> > > > @@ -82,6 +93,7 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params,
> > > >         struct optee_supp_req *req;
> > > >         bool interruptable;
> > > >         u32 ret;
> > > > +       int res = 1;
> > > >
> > > >         /*
> > > >          * Return in case there is no supplicant available and
> > > > @@ -108,28 +120,28 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params,
> > > >         /* Tell an eventual waiter there's a new request */
> > > >         complete(&supp->reqs_c);
> > > >
> > > > -       /*
> > > > -        * Wait for supplicant to process and return result, once we've
> > > > -        * returned from wait_for_completion(&req->c) successfully we have
> > > > -        * exclusive access again.
> > > > -        */
> > > > -       while (wait_for_completion_interruptible(&req->c)) {
> > > > +       /* Wait for supplicant to process and return result */
> > > > +       while (res) {
> > > > +               res = wait_for_completion_interruptible_timeout(&req->c,
> > > > +                                                               SUPP_TIMEOUT);
> > > > +               /* Check if supplicant served the request */
> > > > +               if (res > 0)
> > > > +                       break;
> > > > +
> > > >                 mutex_lock(&supp->mutex);
> > > > +               /*
> > > > +                * There's no supplicant available and since the supp->mutex
> > > > +                * currently is held none can become available until the mutex
> > > > +                * released again.
> > > > +                *
> > > > +                * Interrupting an RPC to supplicant is only allowed as a way
> > > > +                * of slightly improving the user experience in case the
> > > > +                * supplicant hasn't been started yet. During normal operation
> > > > +                * the supplicant will serve all requests in a timely manner and
> > > > +                * interrupting then wouldn't make sense.
> > > > +                */
> > > >                 interruptable = !supp->ctx;
> > > > -               if (interruptable) {
> > > > -                       /*
> > > > -                        * There's no supplicant available and since the
> > > > -                        * supp->mutex currently is held none can
> > > > -                        * become available until the mutex released
> > > > -                        * again.
> > > > -                        *
> > > > -                        * Interrupting an RPC to supplicant is only
> > > > -                        * allowed as a way of slightly improving the user
> > > > -                        * experience in case the supplicant hasn't been
> > > > -                        * started yet. During normal operation the supplicant
> > > > -                        * will serve all requests in a timely manner and
> > > > -                        * interrupting then wouldn't make sense.
> > > > -                        */
> > > > +               if (interruptable || (res == 0)) {
> > >
> > > Are you fixing an observed problem or a theoretical one?
> >
> > It is an observed problem, I was able to reproduce it using following
> > sequence with OP-TEE buildroot setup:
> >
> > $ xtest 600 &  // Run some secure storage tests using supplicant in
> > the background
> > $ kill -9 `pidof tee-supplicant` // Kill supplicant when the tests are
> > in progress.
> >
> > This will cause the xtest to hang up.
> >
> > > If the
> > > supplicant has died then "interruptable" is expected to be true so the
> > > timeout shouldn't matter.
> >
> > When the supplicant dies, it doesn't lead to releasing the supplicant
> > context in the above test scenario. The reason is probably the
> > supplicant shared memory reference is held by OP-TEE which is in turn
> > is holding a reference to supplicant context.
>
> This sounds like the problem Amirreza is trying to solve for the
> QCOMTEE driver. If we could get the supplicant context removed soon
> after the supplicant has died we wouldn't need this, except that we
> may need some trick to avoid ignoring an eventual signal received
> while tee-supplicant is dying.

That would be an improvement but it may still get unnoticed in future
once something else starts ref counting the supplicant context.

>
> Wait, would it work to break the loop on SIGKILL? It's an uncatchable
> signal so there's no reason for the calling process to wait anyway.

I agree this can be one way to solve the issue when the supplicant
gets killed but what if the supplicant gets crashed then it will be
another signal to handle. This approach sounds error prone to me as we
might miss a corner case.

So the question here is why do we need an infinite wait loop for the
supplicant which breaks only if we receive events from the user-space?
Isn't it rather robust for the kernel to have a bounded supplicant
wait loop? Do you have any particular use-case where this bounded wait
loop won't suffice?

-Sumit

>
> Cheers,
> Jens
>
> >
> > -Sumit
> >
> > >
> > > Cheers,
> > > Jens
> > >
> > > >                         if (req->in_queue) {
> > > >                                 list_del(&req->link);
> > > >                                 req->in_queue = false;
> > > > @@ -141,6 +153,8 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params,
> > > >                         req->ret = TEEC_ERROR_COMMUNICATION;
> > > >                         break;
> > > >                 }
> > > > +               if (res == 0)
> > > > +                       req->ret = TEE_ERROR_TIMEOUT;
> > > >         }
> > > >
> > > >         ret = req->ret;
> > > > --
> > > > 2.43.0
> > > >

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] tee: optee: Add support for supplicant timeout
  2024-12-18 11:07       ` Sumit Garg
@ 2024-12-18 13:34         ` Jens Wiklander
  2024-12-18 14:41           ` Sumit Garg
  0 siblings, 1 reply; 18+ messages in thread
From: Jens Wiklander @ 2024-12-18 13:34 UTC (permalink / raw)
  To: Sumit Garg; +Cc: op-tee, jerome.forissier, linux-kernel, Erik Schilling

On Wed, Dec 18, 2024 at 12:07 PM Sumit Garg <sumit.garg@linaro.org> wrote:
>
> On Wed, 18 Dec 2024 at 16:02, Jens Wiklander <jens.wiklander@linaro.org> wrote:
> >
> > On Wed, Dec 18, 2024 at 8:30 AM Sumit Garg <sumit.garg@linaro.org> wrote:
> > >
> > > + Erik
> > >
> > > Hi Jens,
> > >
> > > On Wed, 18 Dec 2024 at 12:27, Jens Wiklander <jens.wiklander@linaro.org> wrote:
> > > >
> > > > Hi Sumit,
> > > >
> > > > On Fri, Dec 13, 2024 at 12:15 PM Sumit Garg <sumit.garg@linaro.org> wrote:
> > > > >
> > > > > OP-TEE supplicant is a user-space daemon and it's possible for it
> > > > > being crashed or killed in the middle of processing an OP-TEE RPC call.
> > > > > It becomes more complicated when there is incorrect shutdown ordering
> > > > > of the supplicant process vs the OP-TEE client application which can
> > > > > eventually lead to system hang-up waiting for the closure of the client
> > > > > application.
> > > > >
> > > > > In order to gracefully handle this scenario, let's add a long enough
> > > > > timeout to wait for supplicant to process requests. In case there is a
> > > > > timeout then we return a proper error code for the RPC request.
> > > > >
> > > > > Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
> > > > > ---
> > > > >  drivers/tee/optee/supp.c | 58 +++++++++++++++++++++++++---------------
> > > > >  1 file changed, 36 insertions(+), 22 deletions(-)
> > > > >
> > > > > diff --git a/drivers/tee/optee/supp.c b/drivers/tee/optee/supp.c
> > > > > index 322a543b8c27..92e86ac4cdd4 100644
> > > > > --- a/drivers/tee/optee/supp.c
> > > > > +++ b/drivers/tee/optee/supp.c
> > > > > @@ -7,6 +7,15 @@
> > > > >  #include <linux/uaccess.h>
> > > > >  #include "optee_private.h"
> > > > >
> > > > > +/*
> > > > > + * OP-TEE supplicant timeout, the user-space supplicant may get
> > > > > + * crashed or killed while servicing an RPC call. This will just lead
> > > > > + * to OP-TEE client hung indefinitely just waiting for supplicant to
> > > > > + * serve requests which isn't expected. It is rather expected to fail
> > > > > + * gracefully with a timeout which is long enough.
> > > > > + */
> > > > > +#define SUPP_TIMEOUT   (msecs_to_jiffies(10000))
> > > > > +
> > > > >  struct optee_supp_req {
> > > > >         struct list_head link;
> > > > >
> > > > > @@ -52,8 +61,10 @@ void optee_supp_release(struct optee_supp *supp)
> > > > >
> > > > >         /* Abort all queued requests */
> > > > >         list_for_each_entry_safe(req, req_tmp, &supp->reqs, link) {
> > > > > -               list_del(&req->link);
> > > > > -               req->in_queue = false;
> > > > > +               if (req->in_queue) {
> > > > > +                       list_del(&req->link);
> > > > > +                       req->in_queue = false;
> > > > > +               }
> > > > >                 req->ret = TEEC_ERROR_COMMUNICATION;
> > > > >                 complete(&req->c);
> > > > >         }
> > > > > @@ -82,6 +93,7 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params,
> > > > >         struct optee_supp_req *req;
> > > > >         bool interruptable;
> > > > >         u32 ret;
> > > > > +       int res = 1;
> > > > >
> > > > >         /*
> > > > >          * Return in case there is no supplicant available and
> > > > > @@ -108,28 +120,28 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params,
> > > > >         /* Tell an eventual waiter there's a new request */
> > > > >         complete(&supp->reqs_c);
> > > > >
> > > > > -       /*
> > > > > -        * Wait for supplicant to process and return result, once we've
> > > > > -        * returned from wait_for_completion(&req->c) successfully we have
> > > > > -        * exclusive access again.
> > > > > -        */
> > > > > -       while (wait_for_completion_interruptible(&req->c)) {
> > > > > +       /* Wait for supplicant to process and return result */
> > > > > +       while (res) {
> > > > > +               res = wait_for_completion_interruptible_timeout(&req->c,
> > > > > +                                                               SUPP_TIMEOUT);
> > > > > +               /* Check if supplicant served the request */
> > > > > +               if (res > 0)
> > > > > +                       break;
> > > > > +
> > > > >                 mutex_lock(&supp->mutex);
> > > > > +               /*
> > > > > +                * There's no supplicant available and since the supp->mutex
> > > > > +                * currently is held none can become available until the mutex
> > > > > +                * released again.
> > > > > +                *
> > > > > +                * Interrupting an RPC to supplicant is only allowed as a way
> > > > > +                * of slightly improving the user experience in case the
> > > > > +                * supplicant hasn't been started yet. During normal operation
> > > > > +                * the supplicant will serve all requests in a timely manner and
> > > > > +                * interrupting then wouldn't make sense.
> > > > > +                */
> > > > >                 interruptable = !supp->ctx;
> > > > > -               if (interruptable) {
> > > > > -                       /*
> > > > > -                        * There's no supplicant available and since the
> > > > > -                        * supp->mutex currently is held none can
> > > > > -                        * become available until the mutex released
> > > > > -                        * again.
> > > > > -                        *
> > > > > -                        * Interrupting an RPC to supplicant is only
> > > > > -                        * allowed as a way of slightly improving the user
> > > > > -                        * experience in case the supplicant hasn't been
> > > > > -                        * started yet. During normal operation the supplicant
> > > > > -                        * will serve all requests in a timely manner and
> > > > > -                        * interrupting then wouldn't make sense.
> > > > > -                        */
> > > > > +               if (interruptable || (res == 0)) {
> > > >
> > > > Are you fixing an observed problem or a theoretical one?
> > >
> > > It is an observed problem, I was able to reproduce it using following
> > > sequence with OP-TEE buildroot setup:
> > >
> > > $ xtest 600 &  // Run some secure storage tests using supplicant in
> > > the background
> > > $ kill -9 `pidof tee-supplicant` // Kill supplicant when the tests are
> > > in progress.
> > >
> > > This will cause the xtest to hang up.
> > >
> > > > If the
> > > > supplicant has died then "interruptable" is expected to be true so the
> > > > timeout shouldn't matter.
> > >
> > > When the supplicant dies, it doesn't lead to releasing the supplicant
> > > context in the above test scenario. The reason is probably the
> > > supplicant shared memory reference is held by OP-TEE which is in turn
> > > is holding a reference to supplicant context.
> >
> > This sounds like the problem Amirreza is trying to solve for the
> > QCOMTEE driver. If we could get the supplicant context removed soon
> > after the supplicant has died we wouldn't need this, except that we
> > may need some trick to avoid ignoring an eventual signal received
> > while tee-supplicant is dying.
>
> That would be an improvement but it may still get unnoticed in future
> once something else starts ref counting the supplicant context.
>

That depends on the implementation. We were discussing adding a
callback when the file descriptor is closed.

> >
> > Wait, would it work to break the loop on SIGKILL? It's an uncatchable
> > signal so there's no reason for the calling process to wait anyway.
>
> I agree this can be one way to solve the issue when the supplicant
> gets killed but what if the supplicant gets crashed then it will be
> another signal to handle. This approach sounds error prone to me as we
> might miss a corner case.
>
> So the question here is why do we need an infinite wait loop for the
> supplicant which breaks only if we receive events from the user-space?
> Isn't it rather robust for the kernel to have a bounded supplicant
> wait loop? Do you have any particular use-case where this bounded wait
> loop won't suffice?

It will make it tricky to debug tee-supplicant with GDB. Is there any
risk of timeout during suspend?

Cheers,
Jens

>
> -Sumit
>
> >
> > Cheers,
> > Jens
> >
> > >
> > > -Sumit
> > >
> > > >
> > > > Cheers,
> > > > Jens
> > > >
> > > > >                         if (req->in_queue) {
> > > > >                                 list_del(&req->link);
> > > > >                                 req->in_queue = false;
> > > > > @@ -141,6 +153,8 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params,
> > > > >                         req->ret = TEEC_ERROR_COMMUNICATION;
> > > > >                         break;
> > > > >                 }
> > > > > +               if (res == 0)
> > > > > +                       req->ret = TEE_ERROR_TIMEOUT;
> > > > >         }
> > > > >
> > > > >         ret = req->ret;
> > > > > --
> > > > > 2.43.0
> > > > >

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] tee: optee: Add support for supplicant timeout
  2024-12-18 13:34         ` Jens Wiklander
@ 2024-12-18 14:41           ` Sumit Garg
  0 siblings, 0 replies; 18+ messages in thread
From: Sumit Garg @ 2024-12-18 14:41 UTC (permalink / raw)
  To: Jens Wiklander; +Cc: op-tee, jerome.forissier, linux-kernel, Erik Schilling

On Wed, 18 Dec 2024 at 19:04, Jens Wiklander <jens.wiklander@linaro.org> wrote:
>
> On Wed, Dec 18, 2024 at 12:07 PM Sumit Garg <sumit.garg@linaro.org> wrote:
> >
> > On Wed, 18 Dec 2024 at 16:02, Jens Wiklander <jens.wiklander@linaro.org> wrote:
> > >
> > > On Wed, Dec 18, 2024 at 8:30 AM Sumit Garg <sumit.garg@linaro.org> wrote:
> > > >
> > > > + Erik
> > > >
> > > > Hi Jens,
> > > >
> > > > On Wed, 18 Dec 2024 at 12:27, Jens Wiklander <jens.wiklander@linaro.org> wrote:
> > > > >
> > > > > Hi Sumit,
> > > > >
> > > > > On Fri, Dec 13, 2024 at 12:15 PM Sumit Garg <sumit.garg@linaro.org> wrote:
> > > > > >
> > > > > > OP-TEE supplicant is a user-space daemon and it's possible for it
> > > > > > being crashed or killed in the middle of processing an OP-TEE RPC call.
> > > > > > It becomes more complicated when there is incorrect shutdown ordering
> > > > > > of the supplicant process vs the OP-TEE client application which can
> > > > > > eventually lead to system hang-up waiting for the closure of the client
> > > > > > application.
> > > > > >
> > > > > > In order to gracefully handle this scenario, let's add a long enough
> > > > > > timeout to wait for supplicant to process requests. In case there is a
> > > > > > timeout then we return a proper error code for the RPC request.
> > > > > >
> > > > > > Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
> > > > > > ---
> > > > > >  drivers/tee/optee/supp.c | 58 +++++++++++++++++++++++++---------------
> > > > > >  1 file changed, 36 insertions(+), 22 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/tee/optee/supp.c b/drivers/tee/optee/supp.c
> > > > > > index 322a543b8c27..92e86ac4cdd4 100644
> > > > > > --- a/drivers/tee/optee/supp.c
> > > > > > +++ b/drivers/tee/optee/supp.c
> > > > > > @@ -7,6 +7,15 @@
> > > > > >  #include <linux/uaccess.h>
> > > > > >  #include "optee_private.h"
> > > > > >
> > > > > > +/*
> > > > > > + * OP-TEE supplicant timeout, the user-space supplicant may get
> > > > > > + * crashed or killed while servicing an RPC call. This will just lead
> > > > > > + * to OP-TEE client hung indefinitely just waiting for supplicant to
> > > > > > + * serve requests which isn't expected. It is rather expected to fail
> > > > > > + * gracefully with a timeout which is long enough.
> > > > > > + */
> > > > > > +#define SUPP_TIMEOUT   (msecs_to_jiffies(10000))
> > > > > > +
> > > > > >  struct optee_supp_req {
> > > > > >         struct list_head link;
> > > > > >
> > > > > > @@ -52,8 +61,10 @@ void optee_supp_release(struct optee_supp *supp)
> > > > > >
> > > > > >         /* Abort all queued requests */
> > > > > >         list_for_each_entry_safe(req, req_tmp, &supp->reqs, link) {
> > > > > > -               list_del(&req->link);
> > > > > > -               req->in_queue = false;
> > > > > > +               if (req->in_queue) {
> > > > > > +                       list_del(&req->link);
> > > > > > +                       req->in_queue = false;
> > > > > > +               }
> > > > > >                 req->ret = TEEC_ERROR_COMMUNICATION;
> > > > > >                 complete(&req->c);
> > > > > >         }
> > > > > > @@ -82,6 +93,7 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params,
> > > > > >         struct optee_supp_req *req;
> > > > > >         bool interruptable;
> > > > > >         u32 ret;
> > > > > > +       int res = 1;
> > > > > >
> > > > > >         /*
> > > > > >          * Return in case there is no supplicant available and
> > > > > > @@ -108,28 +120,28 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params,
> > > > > >         /* Tell an eventual waiter there's a new request */
> > > > > >         complete(&supp->reqs_c);
> > > > > >
> > > > > > -       /*
> > > > > > -        * Wait for supplicant to process and return result, once we've
> > > > > > -        * returned from wait_for_completion(&req->c) successfully we have
> > > > > > -        * exclusive access again.
> > > > > > -        */
> > > > > > -       while (wait_for_completion_interruptible(&req->c)) {
> > > > > > +       /* Wait for supplicant to process and return result */
> > > > > > +       while (res) {
> > > > > > +               res = wait_for_completion_interruptible_timeout(&req->c,
> > > > > > +                                                               SUPP_TIMEOUT);
> > > > > > +               /* Check if supplicant served the request */
> > > > > > +               if (res > 0)
> > > > > > +                       break;
> > > > > > +
> > > > > >                 mutex_lock(&supp->mutex);
> > > > > > +               /*
> > > > > > +                * There's no supplicant available and since the supp->mutex
> > > > > > +                * currently is held none can become available until the mutex
> > > > > > +                * released again.
> > > > > > +                *
> > > > > > +                * Interrupting an RPC to supplicant is only allowed as a way
> > > > > > +                * of slightly improving the user experience in case the
> > > > > > +                * supplicant hasn't been started yet. During normal operation
> > > > > > +                * the supplicant will serve all requests in a timely manner and
> > > > > > +                * interrupting then wouldn't make sense.
> > > > > > +                */
> > > > > >                 interruptable = !supp->ctx;
> > > > > > -               if (interruptable) {
> > > > > > -                       /*
> > > > > > -                        * There's no supplicant available and since the
> > > > > > -                        * supp->mutex currently is held none can
> > > > > > -                        * become available until the mutex released
> > > > > > -                        * again.
> > > > > > -                        *
> > > > > > -                        * Interrupting an RPC to supplicant is only
> > > > > > -                        * allowed as a way of slightly improving the user
> > > > > > -                        * experience in case the supplicant hasn't been
> > > > > > -                        * started yet. During normal operation the supplicant
> > > > > > -                        * will serve all requests in a timely manner and
> > > > > > -                        * interrupting then wouldn't make sense.
> > > > > > -                        */
> > > > > > +               if (interruptable || (res == 0)) {
> > > > >
> > > > > Are you fixing an observed problem or a theoretical one?
> > > >
> > > > It is an observed problem, I was able to reproduce it using following
> > > > sequence with OP-TEE buildroot setup:
> > > >
> > > > $ xtest 600 &  // Run some secure storage tests using supplicant in
> > > > the background
> > > > $ kill -9 `pidof tee-supplicant` // Kill supplicant when the tests are
> > > > in progress.
> > > >
> > > > This will cause the xtest to hang up.
> > > >
> > > > > If the
> > > > > supplicant has died then "interruptable" is expected to be true so the
> > > > > timeout shouldn't matter.
> > > >
> > > > When the supplicant dies, it doesn't lead to releasing the supplicant
> > > > context in the above test scenario. The reason is probably the
> > > > supplicant shared memory reference is held by OP-TEE which is in turn
> > > > is holding a reference to supplicant context.
> > >
> > > This sounds like the problem Amirreza is trying to solve for the
> > > QCOMTEE driver. If we could get the supplicant context removed soon
> > > after the supplicant has died we wouldn't need this, except that we
> > > may need some trick to avoid ignoring an eventual signal received
> > > while tee-supplicant is dying.
> >
> > That would be an improvement but it may still get unnoticed in future
> > once something else starts ref counting the supplicant context.
> >
>
> That depends on the implementation. We were discussing adding a
> callback when the file descriptor is closed.
>
> > >
> > > Wait, would it work to break the loop on SIGKILL? It's an uncatchable
> > > signal so there's no reason for the calling process to wait anyway.
> >
> > I agree this can be one way to solve the issue when the supplicant
> > gets killed but what if the supplicant gets crashed then it will be
> > another signal to handle. This approach sounds error prone to me as we
> > might miss a corner case.
> >
> > So the question here is why do we need an infinite wait loop for the
> > supplicant which breaks only if we receive events from the user-space?
> > Isn't it rather robust for the kernel to have a bounded supplicant
> > wait loop? Do you have any particular use-case where this bounded wait
> > loop won't suffice?
>
> It will make it tricky to debug tee-supplicant with GDB.

Debugging with GDB is always a bit tricky when you are debugging on
the kernel/user-space boundary.

> Is there any
> risk of timeout during suspend?

I don't think so since this timeout is per supplicant request and I
can't think of a scenario where the system suspend occurs without all
CPUs going into idle state first. That means there won't be any
supplicant request running at that point.

-Sumit

>
> Cheers,
> Jens
>
> >
> > -Sumit
> >
> > >
> > > Cheers,
> > > Jens
> > >
> > > >
> > > > -Sumit
> > > >
> > > > >
> > > > > Cheers,
> > > > > Jens
> > > > >
> > > > > >                         if (req->in_queue) {
> > > > > >                                 list_del(&req->link);
> > > > > >                                 req->in_queue = false;
> > > > > > @@ -141,6 +153,8 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params,
> > > > > >                         req->ret = TEEC_ERROR_COMMUNICATION;
> > > > > >                         break;
> > > > > >                 }
> > > > > > +               if (res == 0)
> > > > > > +                       req->ret = TEE_ERROR_TIMEOUT;
> > > > > >         }
> > > > > >
> > > > > >         ret = req->ret;
> > > > > > --
> > > > > > 2.43.0
> > > > > >

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] tee: optee: Add support for supplicant timeout
  2024-12-13 11:14 [PATCH] tee: optee: Add support for supplicant timeout Sumit Garg
  2024-12-18  6:57 ` Jens Wiklander
@ 2025-01-20  9:24 ` Sumit Garg
  2025-01-20 13:31   ` Jens Wiklander
  1 sibling, 1 reply; 18+ messages in thread
From: Sumit Garg @ 2025-01-20  9:24 UTC (permalink / raw)
  To: jens.wiklander; +Cc: op-tee, jerome.forissier, linux-kernel

Hi Jens,

On Fri, 13 Dec 2024 at 16:45, Sumit Garg <sumit.garg@linaro.org> wrote:
>
> OP-TEE supplicant is a user-space daemon and it's possible for it
> being crashed or killed in the middle of processing an OP-TEE RPC call.
> It becomes more complicated when there is incorrect shutdown ordering
> of the supplicant process vs the OP-TEE client application which can
> eventually lead to system hang-up waiting for the closure of the client
> application.
>
> In order to gracefully handle this scenario, let's add a long enough
> timeout to wait for supplicant to process requests. In case there is a
> timeout then we return a proper error code for the RPC request.
>
> Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
> ---
>  drivers/tee/optee/supp.c | 58 +++++++++++++++++++++++++---------------
>  1 file changed, 36 insertions(+), 22 deletions(-)
>

Do you have any further comments here? Or is it fine for you to pick it up?

-Sumit

> diff --git a/drivers/tee/optee/supp.c b/drivers/tee/optee/supp.c
> index 322a543b8c27..92e86ac4cdd4 100644
> --- a/drivers/tee/optee/supp.c
> +++ b/drivers/tee/optee/supp.c
> @@ -7,6 +7,15 @@
>  #include <linux/uaccess.h>
>  #include "optee_private.h"
>
> +/*
> + * OP-TEE supplicant timeout, the user-space supplicant may get
> + * crashed or killed while servicing an RPC call. This will just lead
> + * to OP-TEE client hung indefinitely just waiting for supplicant to
> + * serve requests which isn't expected. It is rather expected to fail
> + * gracefully with a timeout which is long enough.
> + */
> +#define SUPP_TIMEOUT   (msecs_to_jiffies(10000))
> +
>  struct optee_supp_req {
>         struct list_head link;
>
> @@ -52,8 +61,10 @@ void optee_supp_release(struct optee_supp *supp)
>
>         /* Abort all queued requests */
>         list_for_each_entry_safe(req, req_tmp, &supp->reqs, link) {
> -               list_del(&req->link);
> -               req->in_queue = false;
> +               if (req->in_queue) {
> +                       list_del(&req->link);
> +                       req->in_queue = false;
> +               }
>                 req->ret = TEEC_ERROR_COMMUNICATION;
>                 complete(&req->c);
>         }
> @@ -82,6 +93,7 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params,
>         struct optee_supp_req *req;
>         bool interruptable;
>         u32 ret;
> +       int res = 1;
>
>         /*
>          * Return in case there is no supplicant available and
> @@ -108,28 +120,28 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params,
>         /* Tell an eventual waiter there's a new request */
>         complete(&supp->reqs_c);
>
> -       /*
> -        * Wait for supplicant to process and return result, once we've
> -        * returned from wait_for_completion(&req->c) successfully we have
> -        * exclusive access again.
> -        */
> -       while (wait_for_completion_interruptible(&req->c)) {
> +       /* Wait for supplicant to process and return result */
> +       while (res) {
> +               res = wait_for_completion_interruptible_timeout(&req->c,
> +                                                               SUPP_TIMEOUT);
> +               /* Check if supplicant served the request */
> +               if (res > 0)
> +                       break;
> +
>                 mutex_lock(&supp->mutex);
> +               /*
> +                * There's no supplicant available and since the supp->mutex
> +                * currently is held none can become available until the mutex
> +                * released again.
> +                *
> +                * Interrupting an RPC to supplicant is only allowed as a way
> +                * of slightly improving the user experience in case the
> +                * supplicant hasn't been started yet. During normal operation
> +                * the supplicant will serve all requests in a timely manner and
> +                * interrupting then wouldn't make sense.
> +                */
>                 interruptable = !supp->ctx;
> -               if (interruptable) {
> -                       /*
> -                        * There's no supplicant available and since the
> -                        * supp->mutex currently is held none can
> -                        * become available until the mutex released
> -                        * again.
> -                        *
> -                        * Interrupting an RPC to supplicant is only
> -                        * allowed as a way of slightly improving the user
> -                        * experience in case the supplicant hasn't been
> -                        * started yet. During normal operation the supplicant
> -                        * will serve all requests in a timely manner and
> -                        * interrupting then wouldn't make sense.
> -                        */
> +               if (interruptable || (res == 0)) {
>                         if (req->in_queue) {
>                                 list_del(&req->link);
>                                 req->in_queue = false;
> @@ -141,6 +153,8 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params,
>                         req->ret = TEEC_ERROR_COMMUNICATION;
>                         break;
>                 }
> +               if (res == 0)
> +                       req->ret = TEE_ERROR_TIMEOUT;
>         }
>
>         ret = req->ret;
> --
> 2.43.0
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] tee: optee: Add support for supplicant timeout
  2025-01-20  9:24 ` Sumit Garg
@ 2025-01-20 13:31   ` Jens Wiklander
  2025-01-22  9:15     ` Sumit Garg
  0 siblings, 1 reply; 18+ messages in thread
From: Jens Wiklander @ 2025-01-20 13:31 UTC (permalink / raw)
  To: Sumit Garg; +Cc: op-tee, jerome.forissier, linux-kernel

On Mon, Jan 20, 2025 at 10:24 AM Sumit Garg <sumit.garg@linaro.org> wrote:
>
> Hi Jens,
>
> On Fri, 13 Dec 2024 at 16:45, Sumit Garg <sumit.garg@linaro.org> wrote:
> >
> > OP-TEE supplicant is a user-space daemon and it's possible for it
> > being crashed or killed in the middle of processing an OP-TEE RPC call.
> > It becomes more complicated when there is incorrect shutdown ordering
> > of the supplicant process vs the OP-TEE client application which can
> > eventually lead to system hang-up waiting for the closure of the client
> > application.
> >
> > In order to gracefully handle this scenario, let's add a long enough
> > timeout to wait for supplicant to process requests. In case there is a
> > timeout then we return a proper error code for the RPC request.
> >
> > Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
> > ---
> >  drivers/tee/optee/supp.c | 58 +++++++++++++++++++++++++---------------
> >  1 file changed, 36 insertions(+), 22 deletions(-)
> >
>
> Do you have any further comments here? Or is it fine for you to pick it up?

I don't like the timeout, it's a bit too hacky.

Cheers,
Jens

>
> -Sumit
>
> > diff --git a/drivers/tee/optee/supp.c b/drivers/tee/optee/supp.c
> > index 322a543b8c27..92e86ac4cdd4 100644
> > --- a/drivers/tee/optee/supp.c
> > +++ b/drivers/tee/optee/supp.c
> > @@ -7,6 +7,15 @@
> >  #include <linux/uaccess.h>
> >  #include "optee_private.h"
> >
> > +/*
> > + * OP-TEE supplicant timeout, the user-space supplicant may get
> > + * crashed or killed while servicing an RPC call. This will just lead
> > + * to OP-TEE client hung indefinitely just waiting for supplicant to
> > + * serve requests which isn't expected. It is rather expected to fail
> > + * gracefully with a timeout which is long enough.
> > + */
> > +#define SUPP_TIMEOUT   (msecs_to_jiffies(10000))
> > +
> >  struct optee_supp_req {
> >         struct list_head link;
> >
> > @@ -52,8 +61,10 @@ void optee_supp_release(struct optee_supp *supp)
> >
> >         /* Abort all queued requests */
> >         list_for_each_entry_safe(req, req_tmp, &supp->reqs, link) {
> > -               list_del(&req->link);
> > -               req->in_queue = false;
> > +               if (req->in_queue) {
> > +                       list_del(&req->link);
> > +                       req->in_queue = false;
> > +               }
> >                 req->ret = TEEC_ERROR_COMMUNICATION;
> >                 complete(&req->c);
> >         }
> > @@ -82,6 +93,7 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params,
> >         struct optee_supp_req *req;
> >         bool interruptable;
> >         u32 ret;
> > +       int res = 1;
> >
> >         /*
> >          * Return in case there is no supplicant available and
> > @@ -108,28 +120,28 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params,
> >         /* Tell an eventual waiter there's a new request */
> >         complete(&supp->reqs_c);
> >
> > -       /*
> > -        * Wait for supplicant to process and return result, once we've
> > -        * returned from wait_for_completion(&req->c) successfully we have
> > -        * exclusive access again.
> > -        */
> > -       while (wait_for_completion_interruptible(&req->c)) {
> > +       /* Wait for supplicant to process and return result */
> > +       while (res) {
> > +               res = wait_for_completion_interruptible_timeout(&req->c,
> > +                                                               SUPP_TIMEOUT);
> > +               /* Check if supplicant served the request */
> > +               if (res > 0)
> > +                       break;
> > +
> >                 mutex_lock(&supp->mutex);
> > +               /*
> > +                * There's no supplicant available and since the supp->mutex
> > +                * currently is held none can become available until the mutex
> > +                * released again.
> > +                *
> > +                * Interrupting an RPC to supplicant is only allowed as a way
> > +                * of slightly improving the user experience in case the
> > +                * supplicant hasn't been started yet. During normal operation
> > +                * the supplicant will serve all requests in a timely manner and
> > +                * interrupting then wouldn't make sense.
> > +                */
> >                 interruptable = !supp->ctx;
> > -               if (interruptable) {
> > -                       /*
> > -                        * There's no supplicant available and since the
> > -                        * supp->mutex currently is held none can
> > -                        * become available until the mutex released
> > -                        * again.
> > -                        *
> > -                        * Interrupting an RPC to supplicant is only
> > -                        * allowed as a way of slightly improving the user
> > -                        * experience in case the supplicant hasn't been
> > -                        * started yet. During normal operation the supplicant
> > -                        * will serve all requests in a timely manner and
> > -                        * interrupting then wouldn't make sense.
> > -                        */
> > +               if (interruptable || (res == 0)) {
> >                         if (req->in_queue) {
> >                                 list_del(&req->link);
> >                                 req->in_queue = false;
> > @@ -141,6 +153,8 @@ u32 optee_supp_thrd_req(struct tee_context *ctx, u32 func, size_t num_params,
> >                         req->ret = TEEC_ERROR_COMMUNICATION;
> >                         break;
> >                 }
> > +               if (res == 0)
> > +                       req->ret = TEE_ERROR_TIMEOUT;
> >         }
> >
> >         ret = req->ret;
> > --
> > 2.43.0
> >

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] tee: optee: Add support for supplicant timeout
  2025-01-20 13:31   ` Jens Wiklander
@ 2025-01-22  9:15     ` Sumit Garg
  2025-01-22 10:06       ` Jens Wiklander
  0 siblings, 1 reply; 18+ messages in thread
From: Sumit Garg @ 2025-01-22  9:15 UTC (permalink / raw)
  To: Jens Wiklander; +Cc: op-tee, jerome.forissier, linux-kernel

On Mon, 20 Jan 2025 at 19:01, Jens Wiklander <jens.wiklander@linaro.org> wrote:
>
> On Mon, Jan 20, 2025 at 10:24 AM Sumit Garg <sumit.garg@linaro.org> wrote:
> >
> > Hi Jens,
> >
> > On Fri, 13 Dec 2024 at 16:45, Sumit Garg <sumit.garg@linaro.org> wrote:
> > >
> > > OP-TEE supplicant is a user-space daemon and it's possible for it
> > > being crashed or killed in the middle of processing an OP-TEE RPC call.
> > > It becomes more complicated when there is incorrect shutdown ordering
> > > of the supplicant process vs the OP-TEE client application which can
> > > eventually lead to system hang-up waiting for the closure of the client
> > > application.
> > >
> > > In order to gracefully handle this scenario, let's add a long enough
> > > timeout to wait for supplicant to process requests. In case there is a
> > > timeout then we return a proper error code for the RPC request.
> > >
> > > Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
> > > ---
> > >  drivers/tee/optee/supp.c | 58 +++++++++++++++++++++++++---------------
> > >  1 file changed, 36 insertions(+), 22 deletions(-)
> > >
> >
> > Do you have any further comments here? Or is it fine for you to pick it up?
>
> I don't like the timeout, it's a bit too hacky.
>

Can you please elaborate here as to why? Currently the kernel is
relying too heavily on tee-supplicant daemon to respond properly in an
unbounded loop. I think this is really a bug for devices in production
in scenarios where the tee-supplicant gets stuck for some reason or
crashes in the middle or gets killed. These can simply lead to system
hung up issues during shutdown requiring a hard power off or reset
which isn't expected from a sane system. So I rather consider the
unbounded wait loop for tee-supplicant a bug we have currently
requiring a fix to be backported.

So do you have a better suggestion to fix this in the mainline as well
as backported to stable releases?

-Sumit

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] tee: optee: Add support for supplicant timeout
  2025-01-22  9:15     ` Sumit Garg
@ 2025-01-22 10:06       ` Jens Wiklander
  2025-01-22 12:25         ` Sumit Garg
  0 siblings, 1 reply; 18+ messages in thread
From: Jens Wiklander @ 2025-01-22 10:06 UTC (permalink / raw)
  To: Sumit Garg; +Cc: op-tee, jerome.forissier, linux-kernel

On Wed, Jan 22, 2025 at 10:15 AM Sumit Garg <sumit.garg@linaro.org> wrote:
>
> On Mon, 20 Jan 2025 at 19:01, Jens Wiklander <jens.wiklander@linaro.org> wrote:
> >
> > On Mon, Jan 20, 2025 at 10:24 AM Sumit Garg <sumit.garg@linaro.org> wrote:
> > >
> > > Hi Jens,
> > >
> > > On Fri, 13 Dec 2024 at 16:45, Sumit Garg <sumit.garg@linaro.org> wrote:
> > > >
> > > > OP-TEE supplicant is a user-space daemon and it's possible for it
> > > > being crashed or killed in the middle of processing an OP-TEE RPC call.
> > > > It becomes more complicated when there is incorrect shutdown ordering
> > > > of the supplicant process vs the OP-TEE client application which can
> > > > eventually lead to system hang-up waiting for the closure of the client
> > > > application.
> > > >
> > > > In order to gracefully handle this scenario, let's add a long enough
> > > > timeout to wait for supplicant to process requests. In case there is a
> > > > timeout then we return a proper error code for the RPC request.
> > > >
> > > > Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
> > > > ---
> > > >  drivers/tee/optee/supp.c | 58 +++++++++++++++++++++++++---------------
> > > >  1 file changed, 36 insertions(+), 22 deletions(-)
> > > >
> > >
> > > Do you have any further comments here? Or is it fine for you to pick it up?
> >
> > I don't like the timeout, it's a bit too hacky.
> >
>
> Can you please elaborate here as to why?

Tee-supplicant is supposed to respond in a timely manner. What is
timely manner depends on the use case, but you now want to say that
it's 10 seconds regardless of what's going on. This makes it
impossible to debug tee-supplicant with a debugger unless you're very
quick. It might also introduce random timouts in a system under a
heavy IO load.

> Currently the kernel is
> relying too heavily on tee-supplicant daemon to respond properly in an
> unbounded loop. I think this is really a bug for devices in production
> in scenarios where the tee-supplicant gets stuck for some reason or
> crashes in the middle or gets killed.

Tee-supplicant is a system daemon, the system depends heavily on it.
It should not be killable by unprivileged processes.
 What happens if init crashes or is killed?

> These can simply lead to system
> hung up issues during shutdown requiring a hard power off or reset
> which isn't expected from a sane system.

This is new information. Is the use case to avoid blocking for too
long during shutdown or reset? We may need to do something to ensure
that the tee-supplicant isn't killed before OP-TEE is done with it.
Enforcing this timeout during shutdown makes sense, but not during
normal operation.

> So I rather consider the
> unbounded wait loop for tee-supplicant a bug we have currently
> requiring a fix to be backported.

Yeah, but only during certain circumstances.

>
> So do you have a better suggestion to fix this in the mainline as well
> as backported to stable releases?

Let's start by finding out what problem you're trying to fix.

Cheers,
Jens

>
> -Sumit

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] tee: optee: Add support for supplicant timeout
  2025-01-22 10:06       ` Jens Wiklander
@ 2025-01-22 12:25         ` Sumit Garg
  2025-01-27  7:33           ` Jens Wiklander
  0 siblings, 1 reply; 18+ messages in thread
From: Sumit Garg @ 2025-01-22 12:25 UTC (permalink / raw)
  To: Jens Wiklander; +Cc: op-tee, jerome.forissier, linux-kernel

On Wed, 22 Jan 2025 at 15:36, Jens Wiklander <jens.wiklander@linaro.org> wrote:
>
> On Wed, Jan 22, 2025 at 10:15 AM Sumit Garg <sumit.garg@linaro.org> wrote:
> >
> > On Mon, 20 Jan 2025 at 19:01, Jens Wiklander <jens.wiklander@linaro.org> wrote:
> > >
> > > On Mon, Jan 20, 2025 at 10:24 AM Sumit Garg <sumit.garg@linaro.org> wrote:
> > > >
> > > > Hi Jens,
> > > >
> > > > On Fri, 13 Dec 2024 at 16:45, Sumit Garg <sumit.garg@linaro.org> wrote:
> > > > >
> > > > > OP-TEE supplicant is a user-space daemon and it's possible for it
> > > > > being crashed or killed in the middle of processing an OP-TEE RPC call.
> > > > > It becomes more complicated when there is incorrect shutdown ordering
> > > > > of the supplicant process vs the OP-TEE client application which can
> > > > > eventually lead to system hang-up waiting for the closure of the client
> > > > > application.
> > > > >
> > > > > In order to gracefully handle this scenario, let's add a long enough
> > > > > timeout to wait for supplicant to process requests. In case there is a
> > > > > timeout then we return a proper error code for the RPC request.
> > > > >
> > > > > Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
> > > > > ---
> > > > >  drivers/tee/optee/supp.c | 58 +++++++++++++++++++++++++---------------
> > > > >  1 file changed, 36 insertions(+), 22 deletions(-)
> > > > >
> > > >
> > > > Do you have any further comments here? Or is it fine for you to pick it up?
> > >
> > > I don't like the timeout, it's a bit too hacky.
> > >
> >
> > Can you please elaborate here as to why?
>
> Tee-supplicant is supposed to respond in a timely manner. What is
> timely manner depends on the use case, but you now want to say that
> it's 10 seconds regardless of what's going on. This makes it
> impossible to debug tee-supplicant with a debugger unless you're very
> quick. It might also introduce random timouts in a system under a
> heavy IO load.

Although, initially I thought 10 seconds should be enough for any
user-space process to be considered hung but then I saw
DEFAULT_HUNG_TASK_TIMEOUT which is 120 seconds for a task to be
considered hung. How about rather a Kconfig option like
OPTEE_SUPPLICANT_HUNG_TIMEOUT which defaults to 120 seconds? It can be
configured as 0 to disable timeout entirely for debugging purposes.

>
> > Currently the kernel is
> > relying too heavily on tee-supplicant daemon to respond properly in an
> > unbounded loop. I think this is really a bug for devices in production
> > in scenarios where the tee-supplicant gets stuck for some reason or
> > crashes in the middle or gets killed.
>
> Tee-supplicant is a system daemon, the system depends heavily on it.
> It should not be killable by unprivileged processes.
>  What happens if init crashes or is killed?

I am not aware of any other place in the kernel where a kernel thread
does an unbounded loop for even other system daemons.

>
> > These can simply lead to system
> > hung up issues during shutdown requiring a hard power off or reset
> > which isn't expected from a sane system.
>
> This is new information. Is the use case to avoid blocking for too
> long during shutdown or reset? We may need to do something to ensure
> that the tee-supplicant isn't killed before OP-TEE is done with it.
> Enforcing this timeout during shutdown makes sense, but not during
> normal operation.

This was entirely the motivation of this patch, maybe the commit
description wasn't clear as I expected.

    OP-TEE supplicant is a user-space daemon and it's possible for it
    being crashed or killed in the middle of processing an OP-TEE RPC call.
    It becomes more complicated when there is incorrect shutdown ordering
    of the supplicant process vs the OP-TEE client application which can
    eventually lead to system hang-up waiting for the closure of the client
    application.

It's essentially trying to fix the system hang up problem during
shutdown/reboot. You can't always control user-space to do the right
thing but the golden rule is that the user-space shouldn't be able to
break or hang-up the kernel.

>
> > So I rather consider the
> > unbounded wait loop for tee-supplicant a bug we have currently
> > requiring a fix to be backported.
>
> Yeah, but only during certain circumstances.
>
> >
> > So do you have a better suggestion to fix this in the mainline as well
> > as backported to stable releases?
>
> Let's start by finding out what problem you're trying to fix.

Let me know if the Kconfig option as proposed above sounds reasonable to you.

-Sumit

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] tee: optee: Add support for supplicant timeout
  2025-01-22 12:25         ` Sumit Garg
@ 2025-01-27  7:33           ` Jens Wiklander
  2025-01-27  8:29             ` Arnd Bergmann
  0 siblings, 1 reply; 18+ messages in thread
From: Jens Wiklander @ 2025-01-27  7:33 UTC (permalink / raw)
  To: Sumit Garg, Arnd Bergmann; +Cc: op-tee, jerome.forissier, linux-kernel

[+Arnd for a question below]

On Wed, Jan 22, 2025 at 1:25 PM Sumit Garg <sumit.garg@linaro.org> wrote:
>
> On Wed, 22 Jan 2025 at 15:36, Jens Wiklander <jens.wiklander@linaro.org> wrote:
> >
> > On Wed, Jan 22, 2025 at 10:15 AM Sumit Garg <sumit.garg@linaro.org> wrote:
> > >
> > > On Mon, 20 Jan 2025 at 19:01, Jens Wiklander <jens.wiklander@linaro.org> wrote:
> > > >
> > > > On Mon, Jan 20, 2025 at 10:24 AM Sumit Garg <sumit.garg@linaro.org> wrote:
> > > > >
> > > > > Hi Jens,
> > > > >
> > > > > On Fri, 13 Dec 2024 at 16:45, Sumit Garg <sumit.garg@linaro.org> wrote:
> > > > > >
> > > > > > OP-TEE supplicant is a user-space daemon and it's possible for it
> > > > > > being crashed or killed in the middle of processing an OP-TEE RPC call.
> > > > > > It becomes more complicated when there is incorrect shutdown ordering
> > > > > > of the supplicant process vs the OP-TEE client application which can
> > > > > > eventually lead to system hang-up waiting for the closure of the client
> > > > > > application.
> > > > > >
> > > > > > In order to gracefully handle this scenario, let's add a long enough
> > > > > > timeout to wait for supplicant to process requests. In case there is a
> > > > > > timeout then we return a proper error code for the RPC request.
> > > > > >
> > > > > > Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
> > > > > > ---
> > > > > >  drivers/tee/optee/supp.c | 58 +++++++++++++++++++++++++---------------
> > > > > >  1 file changed, 36 insertions(+), 22 deletions(-)
> > > > > >
> > > > >
> > > > > Do you have any further comments here? Or is it fine for you to pick it up?
> > > >
> > > > I don't like the timeout, it's a bit too hacky.
> > > >
> > >
> > > Can you please elaborate here as to why?
> >
> > Tee-supplicant is supposed to respond in a timely manner. What is
> > timely manner depends on the use case, but you now want to say that
> > it's 10 seconds regardless of what's going on. This makes it
> > impossible to debug tee-supplicant with a debugger unless you're very
> > quick. It might also introduce random timouts in a system under a
> > heavy IO load.
>
> Although, initially I thought 10 seconds should be enough for any
> user-space process to be considered hung but then I saw
> DEFAULT_HUNG_TASK_TIMEOUT which is 120 seconds for a task to be
> considered hung. How about rather a Kconfig option like
> OPTEE_SUPPLICANT_HUNG_TIMEOUT which defaults to 120 seconds? It can be
> configured as 0 to disable timeout entirely for debugging purposes.

Adding a timeout when a timeout isn't needed seems wrong, even if the
timeout is very long. Arnd, what do you think?

>
> >
> > > Currently the kernel is
> > > relying too heavily on tee-supplicant daemon to respond properly in an
> > > unbounded loop. I think this is really a bug for devices in production
> > > in scenarios where the tee-supplicant gets stuck for some reason or
> > > crashes in the middle or gets killed.
> >
> > Tee-supplicant is a system daemon, the system depends heavily on it.
> > It should not be killable by unprivileged processes.
> >  What happens if init crashes or is killed?
>
> I am not aware of any other place in the kernel where a kernel thread
> does an unbounded loop for even other system daemons.

I imagine that FUSE and NFS mounts behave similarly.

>
> >
> > > These can simply lead to system
> > > hung up issues during shutdown requiring a hard power off or reset
> > > which isn't expected from a sane system.
> >
> > This is new information. Is the use case to avoid blocking for too
> > long during shutdown or reset? We may need to do something to ensure
> > that the tee-supplicant isn't killed before OP-TEE is done with it.
> > Enforcing this timeout during shutdown makes sense, but not during
> > normal operation.
>
> This was entirely the motivation of this patch, maybe the commit
> description wasn't clear as I expected.
>
>     OP-TEE supplicant is a user-space daemon and it's possible for it
>     being crashed or killed in the middle of processing an OP-TEE RPC call.
>     It becomes more complicated when there is incorrect shutdown ordering
>     of the supplicant process vs the OP-TEE client application which can
>     eventually lead to system hang-up waiting for the closure of the client
>     application.
>
> It's essentially trying to fix the system hang up problem during
> shutdown/reboot. You can't always control user-space to do the right
> thing but the golden rule is that the user-space shouldn't be able to
> break or hang-up the kernel.
>
> >
> > > So I rather consider the
> > > unbounded wait loop for tee-supplicant a bug we have currently
> > > requiring a fix to be backported.
> >
> > Yeah, but only during certain circumstances.
> >
> > >
> > > So do you have a better suggestion to fix this in the mainline as well
> > > as backported to stable releases?
> >
> > Let's start by finding out what problem you're trying to fix.
>
> Let me know if the Kconfig option as proposed above sounds reasonable to you.

No one is going to bother with that config option, recompiling the
kernel to be able to debug tee-supplicant is a bit much.

Cheers,
Jens


>
> -Sumit

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] tee: optee: Add support for supplicant timeout
  2025-01-27  7:33           ` Jens Wiklander
@ 2025-01-27  8:29             ` Arnd Bergmann
  2025-01-28  8:19               ` Jens Wiklander
  0 siblings, 1 reply; 18+ messages in thread
From: Arnd Bergmann @ 2025-01-27  8:29 UTC (permalink / raw)
  To: Jens Wiklander, Sumit Garg; +Cc: op-tee, Jerome Forissier, linux-kernel

On Mon, Jan 27, 2025, at 08:33, Jens Wiklander wrote:
> [+Arnd for a question below]
>
> On Wed, Jan 22, 2025 at 1:25 PM Sumit Garg <sumit.garg@linaro.org> wrote:
>>
>> On Wed, 22 Jan 2025 at 15:36, Jens Wiklander <jens.wiklander@linaro.org> wrote:
>> >
>> > On Wed, Jan 22, 2025 at 10:15 AM Sumit Garg <sumit.garg@linaro.org> wrote:
>> > >
>> > > On Mon, 20 Jan 2025 at 19:01, Jens Wiklander <jens.wiklander@linaro.org> wrote:
>> > > >
>> > > > On Mon, Jan 20, 2025 at 10:24 AM Sumit Garg <sumit.garg@linaro.org> wrote:
>> > > > >
>> > > > > Hi Jens,
>> > > > >
>> > > > > On Fri, 13 Dec 2024 at 16:45, Sumit Garg <sumit.garg@linaro.org> wrote:
>> > > > > >
>> > > > > > OP-TEE supplicant is a user-space daemon and it's possible for it
>> > > > > > being crashed or killed in the middle of processing an OP-TEE RPC call.
>> > > > > > It becomes more complicated when there is incorrect shutdown ordering
>> > > > > > of the supplicant process vs the OP-TEE client application which can
>> > > > > > eventually lead to system hang-up waiting for the closure of the client
>> > > > > > application.
>> > > > > >
>> > > > > > In order to gracefully handle this scenario, let's add a long enough
>> > > > > > timeout to wait for supplicant to process requests. In case there is a
>> > > > > > timeout then we return a proper error code for the RPC request.
>> > > > > >
>> > > > > > Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
>> > > > > > ---
>> > > > > >  drivers/tee/optee/supp.c | 58 +++++++++++++++++++++++++---------------
>> > > > > >  1 file changed, 36 insertions(+), 22 deletions(-)
>> > > > > >
>> > > > >
>> > > > > Do you have any further comments here? Or is it fine for you to pick it up?
>> > > >
>> > > > I don't like the timeout, it's a bit too hacky.
>> > > >
>> > >
>> > > Can you please elaborate here as to why?
>> >
>> > Tee-supplicant is supposed to respond in a timely manner. What is
>> > timely manner depends on the use case, but you now want to say that
>> > it's 10 seconds regardless of what's going on. This makes it
>> > impossible to debug tee-supplicant with a debugger unless you're very
>> > quick. It might also introduce random timouts in a system under a
>> > heavy IO load.
>>
>> Although, initially I thought 10 seconds should be enough for any
>> user-space process to be considered hung but then I saw
>> DEFAULT_HUNG_TASK_TIMEOUT which is 120 seconds for a task to be
>> considered hung. How about rather a Kconfig option like
>> OPTEE_SUPPLICANT_HUNG_TIMEOUT which defaults to 120 seconds? It can be
>> configured as 0 to disable timeout entirely for debugging purposes.
>
> Adding a timeout when a timeout isn't needed seems wrong, even if the
> timeout is very long. Arnd, what do you think?

It's hard to put an upper bound on user space latency.

As far as I can tell, even that DEFAULT_HUNG_TASK_TIMEOUT
limit is only for tasks that are in an unkillable state for more
than two minutes, but the supplicant not providing results
to the kernel could also happen when it waits in a killable
or interruptible state, or when it does multiple I/Os in a
row that each block for a time under 120 seconds.

A single sector write to an eMMC can easily take multiple
seconds by itself when nothing is going on and the device is
in need of garbage collection. If the system is already in
low memory or there are other tasks writing to the file system,
you can have many such I/O operations queued up in the device
when it adds another I/O to the back of the queue.

Looking at the function that Sumit suggested changing, I see
another problem, both before and after the patch:

        while (wait_for_completion_interruptible(&req->c)) {
                mutex_lock(&supp->mutex);
                interruptable = !supp->ctx;
                if (interruptable) {
...
                }
                mutex_unlock(&supp->mutex);

                if (interruptable) {
                        req->ret = TEEC_ERROR_COMMUNICATION;
                        break;
                }
        }

The "_interruptible()" wait makes no sense here if the
"interruptable" variable is unset: The task remains in
interrupted state, so the while() loop that was waiting
for the wake_up_interruptible() turns into a busy loop
if the task actually gets a signal.

If the task at this point is running at a realtime
priority, it would prevent the thing it is waiting for
from getting scheduled on the same CPU.

>> > > So do you have a better suggestion to fix this in the mainline as well
>> > > as backported to stable releases?
>> >
>> > Let's start by finding out what problem you're trying to fix.
>>
>> Let me know if the Kconfig option as proposed above sounds reasonable to you.
>
> No one is going to bother with that config option, recompiling the
> kernel to be able to debug tee-supplicant is a bit much.

If a timeout is needed, having it runtime configurable seems
more useful than a build time option.

      Arnd

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] tee: optee: Add support for supplicant timeout
  2025-01-27  8:29             ` Arnd Bergmann
@ 2025-01-28  8:19               ` Jens Wiklander
  2025-01-31 11:06                 ` Sumit Garg
  0 siblings, 1 reply; 18+ messages in thread
From: Jens Wiklander @ 2025-01-28  8:19 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: Sumit Garg, op-tee, Jerome Forissier, linux-kernel

On Mon, Jan 27, 2025 at 9:29 AM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Mon, Jan 27, 2025, at 08:33, Jens Wiklander wrote:
> > [+Arnd for a question below]
> >
> > On Wed, Jan 22, 2025 at 1:25 PM Sumit Garg <sumit.garg@linaro.org> wrote:
> >>
> >> On Wed, 22 Jan 2025 at 15:36, Jens Wiklander <jens.wiklander@linaro.org> wrote:
> >> >
> >> > On Wed, Jan 22, 2025 at 10:15 AM Sumit Garg <sumit.garg@linaro.org> wrote:
> >> > >
> >> > > On Mon, 20 Jan 2025 at 19:01, Jens Wiklander <jens.wiklander@linaro.org> wrote:
> >> > > >
> >> > > > On Mon, Jan 20, 2025 at 10:24 AM Sumit Garg <sumit.garg@linaro.org> wrote:
> >> > > > >
> >> > > > > Hi Jens,
> >> > > > >
> >> > > > > On Fri, 13 Dec 2024 at 16:45, Sumit Garg <sumit.garg@linaro.org> wrote:
> >> > > > > >
> >> > > > > > OP-TEE supplicant is a user-space daemon and it's possible for it
> >> > > > > > being crashed or killed in the middle of processing an OP-TEE RPC call.
> >> > > > > > It becomes more complicated when there is incorrect shutdown ordering
> >> > > > > > of the supplicant process vs the OP-TEE client application which can
> >> > > > > > eventually lead to system hang-up waiting for the closure of the client
> >> > > > > > application.
> >> > > > > >
> >> > > > > > In order to gracefully handle this scenario, let's add a long enough
> >> > > > > > timeout to wait for supplicant to process requests. In case there is a
> >> > > > > > timeout then we return a proper error code for the RPC request.
> >> > > > > >
> >> > > > > > Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
> >> > > > > > ---
> >> > > > > >  drivers/tee/optee/supp.c | 58 +++++++++++++++++++++++++---------------
> >> > > > > >  1 file changed, 36 insertions(+), 22 deletions(-)
> >> > > > > >
> >> > > > >
> >> > > > > Do you have any further comments here? Or is it fine for you to pick it up?
> >> > > >
> >> > > > I don't like the timeout, it's a bit too hacky.
> >> > > >
> >> > >
> >> > > Can you please elaborate here as to why?
> >> >
> >> > Tee-supplicant is supposed to respond in a timely manner. What is
> >> > timely manner depends on the use case, but you now want to say that
> >> > it's 10 seconds regardless of what's going on. This makes it
> >> > impossible to debug tee-supplicant with a debugger unless you're very
> >> > quick. It might also introduce random timouts in a system under a
> >> > heavy IO load.
> >>
> >> Although, initially I thought 10 seconds should be enough for any
> >> user-space process to be considered hung but then I saw
> >> DEFAULT_HUNG_TASK_TIMEOUT which is 120 seconds for a task to be
> >> considered hung. How about rather a Kconfig option like
> >> OPTEE_SUPPLICANT_HUNG_TIMEOUT which defaults to 120 seconds? It can be
> >> configured as 0 to disable timeout entirely for debugging purposes.
> >
> > Adding a timeout when a timeout isn't needed seems wrong, even if the
> > timeout is very long. Arnd, what do you think?
>
> It's hard to put an upper bound on user space latency.
>
> As far as I can tell, even that DEFAULT_HUNG_TASK_TIMEOUT
> limit is only for tasks that are in an unkillable state for more
> than two minutes, but the supplicant not providing results
> to the kernel could also happen when it waits in a killable
> or interruptible state, or when it does multiple I/Os in a
> row that each block for a time under 120 seconds.
>
> A single sector write to an eMMC can easily take multiple
> seconds by itself when nothing is going on and the device is
> in need of garbage collection. If the system is already in
> low memory or there are other tasks writing to the file system,
> you can have many such I/O operations queued up in the device
> when it adds another I/O to the back of the queue.

Adding a timeout means we must somehow handle them to avoid spurious errors.

>
> Looking at the function that Sumit suggested changing, I see
> another problem, both before and after the patch:
>
>         while (wait_for_completion_interruptible(&req->c)) {
>                 mutex_lock(&supp->mutex);
>                 interruptable = !supp->ctx;
>                 if (interruptable) {
> ...
>                 }
>                 mutex_unlock(&supp->mutex);
>
>                 if (interruptable) {
>                         req->ret = TEEC_ERROR_COMMUNICATION;
>                         break;
>                 }
>         }
>
> The "_interruptible()" wait makes no sense here if the
> "interruptable" variable is unset: The task remains in
> interrupted state, so the while() loop that was waiting
> for the wake_up_interruptible() turns into a busy loop
> if the task actually gets a signal.
>
> If the task at this point is running at a realtime
> priority, it would prevent the thing it is waiting for
> from getting scheduled on the same CPU.

I see the problem, thanks.

>
> >> > > So do you have a better suggestion to fix this in the mainline as well
> >> > > as backported to stable releases?
> >> >
> >> > Let's start by finding out what problem you're trying to fix.
> >>
> >> Let me know if the Kconfig option as proposed above sounds reasonable to you.
> >
> > No one is going to bother with that config option, recompiling the
> > kernel to be able to debug tee-supplicant is a bit much.
>
> If a timeout is needed, having it runtime configurable seems
> more useful than a build time option.

To address Sumit's issue of hung shutdowns and reboots, the timeout
could be activated only during shutdowns and reboots.

Thanks,
Jens

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] tee: optee: Add support for supplicant timeout
  2025-01-28  8:19               ` Jens Wiklander
@ 2025-01-31 11:06                 ` Sumit Garg
  2025-01-31 11:32                   ` Arnd Bergmann
  0 siblings, 1 reply; 18+ messages in thread
From: Sumit Garg @ 2025-01-31 11:06 UTC (permalink / raw)
  To: Jens Wiklander, Arnd Bergmann; +Cc: op-tee, Jerome Forissier, linux-kernel

Hi Arnd, Jens,

On Tue, 28 Jan 2025 at 13:49, Jens Wiklander <jens.wiklander@linaro.org> wrote:
>
> On Mon, Jan 27, 2025 at 9:29 AM Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > On Mon, Jan 27, 2025, at 08:33, Jens Wiklander wrote:
> > > [+Arnd for a question below]
> > >
> > > On Wed, Jan 22, 2025 at 1:25 PM Sumit Garg <sumit.garg@linaro.org> wrote:
> > >>
> > >> On Wed, 22 Jan 2025 at 15:36, Jens Wiklander <jens.wiklander@linaro.org> wrote:
> > >> >
> > >> > On Wed, Jan 22, 2025 at 10:15 AM Sumit Garg <sumit.garg@linaro.org> wrote:
> > >> > >
> > >> > > On Mon, 20 Jan 2025 at 19:01, Jens Wiklander <jens.wiklander@linaro.org> wrote:
> > >> > > >
> > >> > > > On Mon, Jan 20, 2025 at 10:24 AM Sumit Garg <sumit.garg@linaro.org> wrote:
> > >> > > > >
> > >> > > > > Hi Jens,
> > >> > > > >
> > >> > > > > On Fri, 13 Dec 2024 at 16:45, Sumit Garg <sumit.garg@linaro.org> wrote:
> > >> > > > > >
> > >> > > > > > OP-TEE supplicant is a user-space daemon and it's possible for it
> > >> > > > > > being crashed or killed in the middle of processing an OP-TEE RPC call.
> > >> > > > > > It becomes more complicated when there is incorrect shutdown ordering
> > >> > > > > > of the supplicant process vs the OP-TEE client application which can
> > >> > > > > > eventually lead to system hang-up waiting for the closure of the client
> > >> > > > > > application.
> > >> > > > > >
> > >> > > > > > In order to gracefully handle this scenario, let's add a long enough
> > >> > > > > > timeout to wait for supplicant to process requests. In case there is a
> > >> > > > > > timeout then we return a proper error code for the RPC request.
> > >> > > > > >
> > >> > > > > > Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
> > >> > > > > > ---
> > >> > > > > >  drivers/tee/optee/supp.c | 58 +++++++++++++++++++++++++---------------
> > >> > > > > >  1 file changed, 36 insertions(+), 22 deletions(-)
> > >> > > > > >
> > >> > > > >
> > >> > > > > Do you have any further comments here? Or is it fine for you to pick it up?
> > >> > > >
> > >> > > > I don't like the timeout, it's a bit too hacky.
> > >> > > >
> > >> > >
> > >> > > Can you please elaborate here as to why?
> > >> >
> > >> > Tee-supplicant is supposed to respond in a timely manner. What is
> > >> > timely manner depends on the use case, but you now want to say that
> > >> > it's 10 seconds regardless of what's going on. This makes it
> > >> > impossible to debug tee-supplicant with a debugger unless you're very
> > >> > quick. It might also introduce random timouts in a system under a
> > >> > heavy IO load.
> > >>
> > >> Although, initially I thought 10 seconds should be enough for any
> > >> user-space process to be considered hung but then I saw
> > >> DEFAULT_HUNG_TASK_TIMEOUT which is 120 seconds for a task to be
> > >> considered hung. How about rather a Kconfig option like
> > >> OPTEE_SUPPLICANT_HUNG_TIMEOUT which defaults to 120 seconds? It can be
> > >> configured as 0 to disable timeout entirely for debugging purposes.
> > >
> > > Adding a timeout when a timeout isn't needed seems wrong, even if the
> > > timeout is very long. Arnd, what do you think?
> >
> > It's hard to put an upper bound on user space latency.
> >
> > As far as I can tell, even that DEFAULT_HUNG_TASK_TIMEOUT
> > limit is only for tasks that are in an unkillable state for more
> > than two minutes, but the supplicant not providing results
> > to the kernel could also happen when it waits in a killable
> > or interruptible state, or when it does multiple I/Os in a
> > row that each block for a time under 120 seconds.

The supplicant itself can be in a killable state but the client
application which will be waiting for it in the kernel is in
unkillable state. So if a shutdown/reboot is issued at that point
which has to kill the client application first and then the
tee-supplicant, it will just cause the system to hang up. That is a
real problem that I am trying to address here.

> >
> > A single sector write to an eMMC can easily take multiple
> > seconds by itself when nothing is going on and the device is
> > in need of garbage collection. If the system is already in
> > low memory or there are other tasks writing to the file system,
> > you can have many such I/O operations queued up in the device
> > when it adds another I/O to the back of the queue.
>
> Adding a timeout means we must somehow handle them to avoid spurious errors.
>

The timeout I am proposing here as default being 2 minutes is just for
a single tee-supplicant RPC request. I suppose that should be
sufficient to avoid any spurious errors.

> >
> > Looking at the function that Sumit suggested changing, I see
> > another problem, both before and after the patch:
> >
> >         while (wait_for_completion_interruptible(&req->c)) {
> >                 mutex_lock(&supp->mutex);
> >                 interruptable = !supp->ctx;
> >                 if (interruptable) {
> > ...
> >                 }
> >                 mutex_unlock(&supp->mutex);
> >
> >                 if (interruptable) {
> >                         req->ret = TEEC_ERROR_COMMUNICATION;
> >                         break;
> >                 }
> >         }
> >
> > The "_interruptible()" wait makes no sense here if the
> > "interruptable" variable is unset: The task remains in
> > interrupted state, so the while() loop that was waiting
> > for the wake_up_interruptible() turns into a busy loop
> > if the task actually gets a signal.
> >
> > If the task at this point is running at a realtime
> > priority, it would prevent the thing it is waiting for
> > from getting scheduled on the same CPU.
>
> I see the problem, thanks.

Thanks Arnd for spotting it, I will drop the "_interruptible()" bit in
the next version.

>
> >
> > >> > > So do you have a better suggestion to fix this in the mainline as well
> > >> > > as backported to stable releases?
> > >> >
> > >> > Let's start by finding out what problem you're trying to fix.
> > >>
> > >> Let me know if the Kconfig option as proposed above sounds reasonable to you.
> > >
> > > No one is going to bother with that config option, recompiling the
> > > kernel to be able to debug tee-supplicant is a bit much.
> >
> > If a timeout is needed, having it runtime configurable seems
> > more useful than a build time option.
>

I liked the idea of a runtime configurable timeout option as module
parameter. I will use that instead.

> To address Sumit's issue of hung shutdowns and reboots, the timeout
> could be activated only during shutdowns and reboots.

You never know if the client application gets hung for tee-supplicant
before or after a shutdown/reboot is issued.

-Sumit

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] tee: optee: Add support for supplicant timeout
  2025-01-31 11:06                 ` Sumit Garg
@ 2025-01-31 11:32                   ` Arnd Bergmann
  2025-02-03  8:03                     ` Sumit Garg
  0 siblings, 1 reply; 18+ messages in thread
From: Arnd Bergmann @ 2025-01-31 11:32 UTC (permalink / raw)
  To: Sumit Garg, Jens Wiklander; +Cc: op-tee, Jerome Forissier, linux-kernel

On Fri, Jan 31, 2025, at 12:06, Sumit Garg wrote:
> On Tue, 28 Jan 2025 at 13:49, Jens Wiklander <jens.wiklander@linaro.org> wrote:
>> >
>> > As far as I can tell, even that DEFAULT_HUNG_TASK_TIMEOUT
>> > limit is only for tasks that are in an unkillable state for more
>> > than two minutes, but the supplicant not providing results
>> > to the kernel could also happen when it waits in a killable
>> > or interruptible state, or when it does multiple I/Os in a
>> > row that each block for a time under 120 seconds.
>
> The supplicant itself can be in a killable state but the client
> application which will be waiting for it in the kernel is in
> unkillable state. So if a shutdown/reboot is issued at that point
> which has to kill the client application first and then the
> tee-supplicant, it will just cause the system to hang up. That is a
> real problem that I am trying to address here.

Is the client the one waiting in optee_supp_thrd_req(), or
is that the supplicant?

If the problem is the client being unkillable at the moment,
could you address that by using wait_even_killable() or
mutex_lock_killable() and just bail out safely? I assume
at the point the system is rebooting, there is no need to
wait for the supplicant to complete, other than making sure
the function can return without running into undefined state
of the kernel.

>> > A single sector write to an eMMC can easily take multiple
>> > seconds by itself when nothing is going on and the device is
>> > in need of garbage collection. If the system is already in
>> > low memory or there are other tasks writing to the file system,
>> > you can have many such I/O operations queued up in the device
>> > when it adds another I/O to the back of the queue.
>>
>> Adding a timeout means we must somehow handle them to avoid spurious errors.
>>
>
> The timeout I am proposing here as default being 2 minutes is just for
> a single tee-supplicant RPC request. I suppose that should be
> sufficient to avoid any spurious errors.

It won't guarantee that, but if there is a 2 minute delay, that
likely means that the system has become quite unusable from being
slow, even if it hasn't otherwise misbehaved to that point.

     Arnd

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] tee: optee: Add support for supplicant timeout
  2025-01-31 11:32                   ` Arnd Bergmann
@ 2025-02-03  8:03                     ` Sumit Garg
  0 siblings, 0 replies; 18+ messages in thread
From: Sumit Garg @ 2025-02-03  8:03 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: Jens Wiklander, op-tee, Jerome Forissier, linux-kernel

On Fri, 31 Jan 2025 at 17:03, Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Fri, Jan 31, 2025, at 12:06, Sumit Garg wrote:
> > On Tue, 28 Jan 2025 at 13:49, Jens Wiklander <jens.wiklander@linaro.org> wrote:
> >> >
> >> > As far as I can tell, even that DEFAULT_HUNG_TASK_TIMEOUT
> >> > limit is only for tasks that are in an unkillable state for more
> >> > than two minutes, but the supplicant not providing results
> >> > to the kernel could also happen when it waits in a killable
> >> > or interruptible state, or when it does multiple I/Os in a
> >> > row that each block for a time under 120 seconds.
> >
> > The supplicant itself can be in a killable state but the client
> > application which will be waiting for it in the kernel is in
> > unkillable state. So if a shutdown/reboot is issued at that point
> > which has to kill the client application first and then the
> > tee-supplicant, it will just cause the system to hang up. That is a
> > real problem that I am trying to address here.
>
> Is the client the one waiting in optee_supp_thrd_req(), or
> is that the supplicant?

It's the client only.

>
> If the problem is the client being unkillable at the moment,
> could you address that by using wait_even_killable() or
> mutex_lock_killable() and just bail out safely? I assume
> at the point the system is rebooting, there is no need to
> wait for the supplicant to complete, other than making sure
> the function can return without running into undefined state
> of the kernel.

Thanks for the suggestion, it helps to resolve the shutdown/reboot
hung up issue we were observing. Incorporated your suggestion in v2.

>
> >> > A single sector write to an eMMC can easily take multiple
> >> > seconds by itself when nothing is going on and the device is
> >> > in need of garbage collection. If the system is already in
> >> > low memory or there are other tasks writing to the file system,
> >> > you can have many such I/O operations queued up in the device
> >> > when it adds another I/O to the back of the queue.
> >>
> >> Adding a timeout means we must somehow handle them to avoid spurious errors.
> >>
> >
> > The timeout I am proposing here as default being 2 minutes is just for
> > a single tee-supplicant RPC request. I suppose that should be
> > sufficient to avoid any spurious errors.
>
> It won't guarantee that, but if there is a 2 minute delay, that
> likely means that the system has become quite unusable from being
> slow, even if it hasn't otherwise misbehaved to that point.

Agree.

-Sumit

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-02-03  8:03 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-13 11:14 [PATCH] tee: optee: Add support for supplicant timeout Sumit Garg
2024-12-18  6:57 ` Jens Wiklander
2024-12-18  7:30   ` Sumit Garg
2024-12-18 10:32     ` Jens Wiklander
2024-12-18 11:07       ` Sumit Garg
2024-12-18 13:34         ` Jens Wiklander
2024-12-18 14:41           ` Sumit Garg
2025-01-20  9:24 ` Sumit Garg
2025-01-20 13:31   ` Jens Wiklander
2025-01-22  9:15     ` Sumit Garg
2025-01-22 10:06       ` Jens Wiklander
2025-01-22 12:25         ` Sumit Garg
2025-01-27  7:33           ` Jens Wiklander
2025-01-27  8:29             ` Arnd Bergmann
2025-01-28  8:19               ` Jens Wiklander
2025-01-31 11:06                 ` Sumit Garg
2025-01-31 11:32                   ` Arnd Bergmann
2025-02-03  8:03                     ` Sumit Garg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).