From mboxrd@z Thu Jan  1 00:00:00 1970
From: willy@linux.intel.com (Matthew Wilcox)
Date: Thu, 2 Jan 2014 11:58:53 -0500
Subject: [PATCH] NVMe: use-after-free on surprise removal fixes
In-Reply-To: <1388528118-4866-1-git-send-email-keith.busch@intel.com>
References: <1388528118-4866-1-git-send-email-keith.busch@intel.com>
Message-ID: <20140102165853.GN4945@linux.intel.com>

On Tue, Dec 31, 2013@03:15:18PM -0700, Keith Busch wrote:
> A surprise removal of a device while a program has an open reference to
> the block handle may cause it to use the nvme dev, namespace, or queue
> after we've freed them. Reference count the opens on the block device,
> do not free the namespaces until the device ref count is 0, and protect
> queue access with RCU.

Brave.  I love it.

> diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
> index b59a93a..d71fc27 100644
> --- a/drivers/block/nvme-core.c
> +++ b/drivers/block/nvme-core.c
> @@ -263,12 +263,12 @@ static void *cancel_cmdid(struct nvme_queue *nvmeq, int cmdid,
>  
>  struct nvme_queue *get_nvmeq(struct nvme_dev *dev)
>  {
> -	return dev->queues[get_cpu() + 1];
> +	return rcu_dereference(dev->queues[part_stat_lock() + 1]);
>  }
>  
>  void put_nvmeq(struct nvme_queue *nvmeq)
>  {
> -	put_cpu();
> +	part_stat_unlock();
>  }
>  
>  /**

I'm not so keen on using part_stat_lock,unlock here.  If the part stats
switch to a different protection scheme, this will be broken.  Also,
I'm not entirely sure that we're safe in the !SMP case (can't we end up
without having called preempt_disable()?).  I think it's probably better
to do:

-	return dev->queues[get_cpu() + 1];
+	int queue;
+	rcu_read_lock();
+	queue = get_cpu() + 1;
+	return rcu_dereference(dev->queues[queue]);

> @@ -1155,10 +1155,20 @@ static void nvme_free_queues(struct nvme_dev *dev, int lowest)
>  {
>  	int i;
>  
> +	for (i = num_possible_cpus(); i > dev->queue_count - 1; i--)
> +		rcu_assign_pointer(dev->queues[i], NULL);
>  	for (i = dev->queue_count - 1; i >= lowest; i--) {
> -		nvme_free_queue(dev->queues[i]);
> +		struct nvme_queue *nvmeq = dev->queues[i];
> +
> +		spin_lock_irq(&nvmeq->q_lock);
> +		nvme_cancel_ios(nvmeq, false);
> +		spin_unlock_irq(&nvmeq->q_lock);
> +
> +		rcu_assign_pointer(dev->queues[i], NULL);
> +		synchronize_rcu();
> +
> +		nvme_free_queue(nvmeq);
>  		dev->queue_count--;
> -		dev->queues[i] = NULL;
>  	}
>  }
>  

I don't think you want to do this.  You're calling synchronize_rcu() for
each queue individually, which is extremely heavy-weight.  This routine
could take minutes to execute on a machine with a lot of RCU activity.

Two ways to fix this; accumulate the queues into a function-local array,
call synchronize_rcu() once, then free them all.  Alternatively, use
call_rcu() to have nvme_free_queue() executed after a grace period
has elapsed.

If we go that route, then as described in
Documentation/RCU/rcubarrier.txt, we will need to add a call to
rcu_barrier() in the module exit path to be sure that all the call_rcu()
functions have finished executing.  So there's probabaly no real
difference between the execution time of the two options in the case of
module unload, though it'd be visible in device hotplug.

By the way, this is an extremely common mistake people make when
introducing RCU.  If one can figure out a way to avoid calling
synchronize_rcu(), then one probably should.

Also, why is
		rcu_assign_pointer(dev->queues[i], NULL);
after cancelling the I/Os?  It seems to me that another thread could submit
an I/O between the cancellations and setting the queue to NULL.  I think
the pointer assignment needs to at least be within the spin_lock section,
but it could also be done before taking the spinlock ... right?