All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jack Pham <quic_jackp@quicinc.com>
To: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Cc: Linyu Yuan <quic_linyyuan@quicinc.com>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	<linux-usb@vger.kernel.org>,
	Wesley Cheng <quic_wcheng@quicinc.com>,
	Subbaraman Narayanamurthy <quic_subbaram@quicinc.com>,
	Fenglin Wu <quic_fenglinw@quicinc.com>
Subject: Re: [PATCH v2] usb: ucsi: fix connector partner ucsi work issue
Date: Mon, 9 Jan 2023 12:46:11 -0800	[thread overview]
Message-ID: <20230109204611.GE28337@jackp-linux.qualcomm.com> (raw)
In-Reply-To: <Y7v/Qk+dyDYyCGT4@kuha.fi.intel.com>

On Mon, Jan 09, 2023 at 01:49:22PM +0200, Heikki Krogerus wrote:
> Hi,
> 
> On Thu, Jan 05, 2023 at 10:34:41AM -0800, Jack Pham wrote:
> > On Thu, Jan 05, 2023 at 01:27:07PM +0200, Heikki Krogerus wrote:
> > > On Thu, Jan 05, 2023 at 01:42:40PM +0800, Linyu Yuan wrote:
> > > > When a PPM client unregisters with UCSI framework, connector specific work
> > > > queue is destroyed. However, a pending delayed work queued before may
> > > > still exist. Once the delay timer expires and the work is scheduled,
> > > > this can cause a system crash as the workqueue is destroyed already.
> > > 
> > > When the workqueue is destroyed it's also flushed, no? So how could
> > > there be still something pending, delayed or not? Did you actually see
> > > this happening?
> > 
> > Yes, we encountered this during a scenario in which our PPM firmware 
> > is undergoing a reset which is handled by calling ucsi_unregister().
> > The connectors' workqueues are destroyed but apparently the
> > destroy_workqueue() does *not* seem to take care pending delayed items.
> > 
> > 	Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
> > 	...
> > 	Call trace:
> > 	 __queue_work+0x1f4/0x550
> > 	 delayed_work_timer_fn+0x28/0x38
> > 	 call_timer_fn+0x3c/0x238
> > 	 expire_timers+0xcc/0x168
> > 	 __run_timers+0x194/0x1f8
> > 	 run_timer_softirq+0x2c/0x54
> > 	 _stext+0xec/0x3a8
> > 	 __irq_exit_rcu+0xa0/0xfc
> > 	 irq_exit_rcu+0x18/0x28
> > 	 el0_interrupt+0x5c/0x138
> > 	 __el0_irq_handler_common+0x20/0x30
> > 	 el0t_64_irq_handler+0x18/0x28
> > 	 el0t_64_irq+0x1a0/0x1a4
> > 	Code: eb16013f 54000300 aa1a03e0 9441be2a (f9400280) 
> > 
> > In particular this is happening for the ucsi_check_connection() which is
> > the currently the only work item using a non-zero delay.  If we look
> > closely at queue_delayed_work() we can see in that case it defers by
> > using a separate timer:
> > 
> > 	static void __queue_delayed_work(int cpu, struct workqueue_struct *wq,
> > 					struct delayed_work *dwork, unsigned long delay)
> > 	{
> > 		struct timer_list *timer = &dwork->timer;
> > 		struct work_struct *work = &dwork->work;
> > 
> > 		<...snip...>
> > 
> > 		/*
> > 		 * If @delay is 0, queue @dwork->work immediately.  This is for
> > 		 * both optimization and correctness.  The earliest @timer can
> > 		 * expire is on the closest next tick and delayed_work users depend
> > 		 * on that there's no such delay when @delay is 0.
> > 		 */
> > 		if (!delay) {
> > 			__queue_work(cpu, wq, &dwork->work);
> > 			return;
> > 		}
> > 
> > 		dwork->wq = wq;
> > 		       ^^^^^^^^ wq gets saved here, but might be
> > 				destroyed before timer expires
> > 
> > 		dwork->cpu = cpu;
> > 		timer->expires = jiffies + delay;
> > 
> > 		if (unlikely(cpu != WORK_CPU_UNBOUND))
> > 			add_timer_on(timer, cpu);
> > 		else
> > 			add_timer(timer);
> > 	}
> > 
> > In ucsi_unregister() we destroy the connector's wq, but there is a
> > pending timer still for the ucsi_check_connection() item and upon
> > expiry it tries to do the real __queue_work() call on a dangling 'wq'.
> 
> Okay.
> 
> > > > Fix this by moving all partner related delayed work to connector instance
> > > > and cancel all of them when ucsi_unregister() is called by PPM client.
> > > 
> > > I would love to be able to cancel these works, though not because of
> > > driver removal - I'm more concerned about disconnections. The reason
> > > why each task is a unique work is because it allows the driver to add
> > > the same task to the queue as many times as needed, and that was
> > > needed inorder to recover from some firmware issues. If there's only a
> > > dedicated work per task like in your proposal, we can only schedule
> > > the task once at a time, and that may lead into other issues.
> > 
> > I see, queuing a task multiple times is a good reason to allocate the
> > workers dynamically.  Then what we really need is a way to reliably
> > cancel & reclaim any pending items that are sitting on their own timers,
> > since they are otherwise unreachable via the 'wq'. 
> > 
> > cancel_delayed_work(), cancel_delayed_work_sync(), flush_delayed_work()
> > all require the handle to the delayed_work itself which we don't keep a
> > reference to.
> > 
> > Do you have any other suggestions for this?
> 
> How about separate list for the works?
> 
> diff --git a/drivers/usb/typec/ucsi/ucsi.c b/drivers/usb/typec/ucsi/ucsi.c
> index 8695eb2c6227e..d5cf7573a2cfa 100644
> --- a/drivers/usb/typec/ucsi/ucsi.c
> +++ b/drivers/usb/typec/ucsi/ucsi.c
> @@ -187,6 +187,7 @@ EXPORT_SYMBOL_GPL(ucsi_send_command);
>  
>  struct ucsi_work {
>         struct delayed_work work;
> +       struct list_head node;
>         unsigned long delay;
>         unsigned int count;
>         struct ucsi_connector *con;
> @@ -202,6 +203,7 @@ static void ucsi_poll_worker(struct work_struct *work)
>         mutex_lock(&con->lock);
>  
>         if (!con->partner) {
> +               list_del(&uwork->node);
>                 mutex_unlock(&con->lock);
>                 kfree(uwork);
>                 return;
> @@ -209,10 +211,12 @@ static void ucsi_poll_worker(struct work_struct *work)
>  
>         ret = uwork->cb(con);
>  
> -       if (uwork->count-- && (ret == -EBUSY || ret == -ETIMEDOUT))
> +       if (uwork->count-- && (ret == -EBUSY || ret == -ETIMEDOUT)) {
>                 queue_delayed_work(con->wq, &uwork->work, uwork->delay);
> -       else
> +       } else {
> +               list_del(&uwork->node);
>                 kfree(uwork);
> +       }
>  
>         mutex_unlock(&con->lock);
>  }
> @@ -236,6 +240,7 @@ static int ucsi_partner_task(struct ucsi_connector *con,
>         uwork->con = con;
>         uwork->cb = cb;
>  
> +       list_add_tail(&uwork->node, &con->works);
>         queue_delayed_work(con->wq, &uwork->work, delay);
>  
>         return 0;
> @@ -1056,6 +1061,7 @@ static int ucsi_register_port(struct ucsi *ucsi, int index)
>         INIT_WORK(&con->work, ucsi_handle_connector_change);
>         init_completion(&con->complete);
>         mutex_init(&con->lock);
> +       INIT_LIST_HEAD(&con->works);
>         con->num = index + 1;
>         con->ucsi = ucsi;
>  
> diff --git a/drivers/usb/typec/ucsi/ucsi.h b/drivers/usb/typec/ucsi/ucsi.h
> index 8361e1cfc8eba..dcb792eeedb94 100644
> --- a/drivers/usb/typec/ucsi/ucsi.h
> +++ b/drivers/usb/typec/ucsi/ucsi.h
> @@ -325,6 +325,7 @@ struct ucsi_connector {
>         struct work_struct work;
>         struct completion complete;
>         struct workqueue_struct *wq;
> +       struct list_head works;
>  
>         struct typec_port *port;
>         struct typec_partner *partner;
> 
> 
> Something like that. Then just walk through the list and cancel each
> work in it before destroying the wq. Would that work?

Thanks Heikki for the suggestion!  I think it should work (plus the
actual list walk in ucsi_unregister).  We will give it a try and will
send a v2 if it works out.

Jack

      reply	other threads:[~2023-01-09 20:46 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-05  5:42 [PATCH v2] usb: ucsi: fix connector partner ucsi work issue Linyu Yuan
2023-01-05 11:27 ` Heikki Krogerus
2023-01-05 18:34   ` Jack Pham
2023-01-09 11:49     ` Heikki Krogerus
2023-01-09 20:46       ` Jack Pham [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230109204611.GE28337@jackp-linux.qualcomm.com \
    --to=quic_jackp@quicinc.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=heikki.krogerus@linux.intel.com \
    --cc=linux-usb@vger.kernel.org \
    --cc=quic_fenglinw@quicinc.com \
    --cc=quic_linyyuan@quicinc.com \
    --cc=quic_subbaram@quicinc.com \
    --cc=quic_wcheng@quicinc.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.