All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vivek Goyal <vgoyal@redhat.com>
To: Jerome Marchand <jmarchan@redhat.com>
Cc: Jens Axboe <jaxboe@fusionio.com>,
	Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] block: fix accounting bug on cross partition merges
Date: Thu, 23 Dec 2010 10:39:15 -0500	[thread overview]
Message-ID: <20101223153915.GE9502@redhat.com> (raw)
In-Reply-To: <4D13664C.3020500@redhat.com>

On Thu, Dec 23, 2010 at 04:10:04PM +0100, Jerome Marchand wrote:
> On 12/17/2010 08:06 PM, Jens Axboe wrote:
> > On 2010-12-17 14:42, Jerome Marchand wrote:
> >>
> >> /proc/diskstats would display a strange output as follows.
> > 
> > [snip]
> > 
> > This looks a lot better! One comment:
> > 
> >> diff --git a/block/blk-core.c b/block/blk-core.c
> >> index 4ce953f..064921d 100644
> >> --- a/block/blk-core.c
> >> +++ b/block/blk-core.c
> >> @@ -64,13 +64,16 @@ static void drive_stat_acct(struct request *rq, int new_io)
> >>  		return;
> >>  
> >>  	cpu = part_stat_lock();
> >> -	part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
> >>  
> >> -	if (!new_io)
> >> +	if (!new_io) {
> >> +		part = rq->part;
> >>  		part_stat_inc(cpu, part, merges[rw]);
> >> -	else {
> >> +	} else {
> >> +		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
> >>  		part_round_stats(cpu, part);
> >>  		part_inc_in_flight(part, rw);
> >> +		kref_get(&part->ref);
> >> +		rq->part = part;
> >>  	}
> >>  
> >>  	part_stat_unlock();
> > 
> > I don't think this is completely safe. The rcu lock is held due to the
> > part_stat_lock(), but that only prevents the __delete_partition()
> > callback from happening. Lets say you have this:
> > 
> > CPU0                                         CPU1
> > part = disk_map_sector_rcu()
> >                                              kref_put(part); <- now 0
> 
> A kref_get(part) happens here, or am I missing something?

It might happen that kref_get(part) is called after kref_put() has been
called and reference has reached 0 and and delete_parition() call has
been scheduled as soon as rcu grace period is over. So if you do
kref_get() after that, it is not going to help.

> If we use an atomic kref_test_and_get() function, which only increment the
> refcount if it's not zero, we would be safe. If kref_test_and_get() fails,
> we can cache rq->rq_disk->part0 instead. See my drafty patch below.

Conceptually it kind of makes sense to me. So if even if we get the
pointer to partition under rcu_read_lock(), we will not account the IO
to partition if it is going away.

> 
> > part_stat_unlock()
> >                                              __delete_partition();
> >                                              ...
> >                                              delete_partition_rcu_cb();
> > merge, or endio, boom
> > 
> > Now rq has ->part pointing to freed memory, later merges or end
> > accounting will touch freed memory.
> > 
> > I think we can fix this by just having delete_partition_rcu_rb() check
> > the reference count and return if non-zero. Since someone holds a
> > reference to the table, they will drop it and we'll re-schedule the rcu
> > callback.
> > 
> > 
> 
> 
> ---
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 4ce953f..72d12d2 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -64,13 +64,21 @@ static void drive_stat_acct(struct request *rq, int new_io)
>  		return;
>  
>  	cpu = part_stat_lock();
> -	part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
>  
> -	if (!new_io)
> +	if (!new_io) {
> +		part = rq->part;
>  		part_stat_inc(cpu, part, merges[rw]);
> -	else {
> +	} else {
> +		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
> +		if(part->partno && !kref_test_and_get(&part->ref))
> +			/*

Do we have to check this part->partno both while taking and releasing
reference. Can't we take one extra reference for disk->part0, at
alloc_disk_node() time so that it is never freed and only freed when
disk is going away and gendisk is being freed.

That way, you don't have to differentiate between disk->part0 and rest
of the partitions while taking or dropping references.

Thanks
Vivek

> +			 * The partition is already being removed,
> +			 * the request will be accounted on the disk only
> +			 */
> +			part = &rq->rq_disk->part0;
>  		part_round_stats(cpu, part);
>  		part_inc_in_flight(part, rw);
> +		rq->part = part;
>  	}
>  
>  	part_stat_unlock();
> @@ -128,6 +136,7 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
>  	rq->ref_count = 1;
>  	rq->start_time = jiffies;
>  	set_start_time_ns(rq);
> +	rq->part = NULL;
>  }
>  EXPORT_SYMBOL(blk_rq_init);
>  
> @@ -1776,7 +1785,7 @@ static void blk_account_io_completion(struct request *req, unsigned int bytes)
>  		int cpu;
>  
>  		cpu = part_stat_lock();
> -		part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
> +		part = req->part;
>  		part_stat_add(cpu, part, sectors[rw], bytes >> 9);
>  		part_stat_unlock();
>  	}
> @@ -1796,13 +1805,15 @@ static void blk_account_io_done(struct request *req)
>  		int cpu;
>  
>  		cpu = part_stat_lock();
> -		part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
> +		part = req->part;
>  
>  		part_stat_inc(cpu, part, ios[rw]);
>  		part_stat_add(cpu, part, ticks[rw], duration);
>  		part_round_stats(cpu, part);
>  		part_dec_in_flight(part, rw);
>  
> +		if (part->partno)
> +			kref_put(&part->ref, __delete_partition);
>  		part_stat_unlock();
>  	}
>  }
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 77b7c26..3334578 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -351,11 +351,13 @@ static void blk_account_io_merge(struct request *req)
>  		int cpu;
>  
>  		cpu = part_stat_lock();
> -		part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
> +		part = req->part;
>  
>  		part_round_stats(cpu, part);
>  		part_dec_in_flight(part, rq_data_dir(req));
>  
> +		if (part->partno)
> +			kref_put(&part->ref, __delete_partition);
>  		part_stat_unlock();
>  	}
>  }
> diff --git a/block/genhd.c b/block/genhd.c
> index 5fa2b44..0c55eae 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -1192,6 +1192,7 @@ struct gendisk *alloc_disk_node(int minors, int node_id)
>  			return NULL;
>  		}
>  		disk->part_tbl->part[0] = &disk->part0;
> +		kref_init(&disk->part0.ref);
>  
>  		disk->minors = minors;
>  		rand_initialize_disk(disk);
> diff --git a/fs/partitions/check.c b/fs/partitions/check.c
> index 0a8b0ad..0123717 100644
> --- a/fs/partitions/check.c
> +++ b/fs/partitions/check.c
> @@ -372,6 +372,13 @@ static void delete_partition_rcu_cb(struct rcu_head *head)
>  	put_device(part_to_dev(part));
>  }
>  
> +void __delete_partition(struct kref *ref)
> +{
> +	struct hd_struct *part = container_of(ref, struct hd_struct, ref);
> +
> +	call_rcu(&part->rcu_head, delete_partition_rcu_cb);
> +}
> +
>  void delete_partition(struct gendisk *disk, int partno)
>  {
>  	struct disk_part_tbl *ptbl = disk->part_tbl;
> @@ -390,7 +397,7 @@ void delete_partition(struct gendisk *disk, int partno)
>  	kobject_put(part->holder_dir);
>  	device_del(part_to_dev(part));
>  
> -	call_rcu(&part->rcu_head, delete_partition_rcu_cb);
> +	kref_put(&part->ref, __delete_partition);
>  }
>  
>  static ssize_t whole_disk_show(struct device *dev,
> @@ -489,6 +496,7 @@ struct hd_struct *add_partition(struct gendisk *disk, int partno,
>  	if (!dev_get_uevent_suppress(ddev))
>  		kobject_uevent(&pdev->kobj, KOBJ_ADD);
>  
> +	kref_init(&p->ref);
>  	return p;
>  
>  out_free_info:
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index aae86fd..482a7fd 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -115,6 +115,7 @@ struct request {
>  	void *elevator_private3;
>  
>  	struct gendisk *rq_disk;
> +	struct hd_struct *part;
>  	unsigned long start_time;
>  #ifdef CONFIG_BLK_CGROUP
>  	unsigned long long start_time_ns;
> diff --git a/include/linux/genhd.h b/include/linux/genhd.h
> index 7a7b9c1..2ba2792 100644
> --- a/include/linux/genhd.h
> +++ b/include/linux/genhd.h
> @@ -116,6 +116,7 @@ struct hd_struct {
>  	struct disk_stats dkstats;
>  #endif
>  	struct rcu_head rcu_head;
> +	struct kref ref;
>  };
>  
>  #define GENHD_FL_REMOVABLE			1
> @@ -583,6 +584,7 @@ extern struct hd_struct * __must_check add_partition(struct gendisk *disk,
>  						     sector_t len, int flags,
>  						     struct partition_meta_info
>  						       *info);
> +extern void __delete_partition(struct kref *ref);
>  extern void delete_partition(struct gendisk *, int);
>  extern void printk_all_partitions(void);
>  
> diff --git a/include/linux/kref.h b/include/linux/kref.h
> index 6cc38fc..90b9e44 100644
> --- a/include/linux/kref.h
> +++ b/include/linux/kref.h
> @@ -23,6 +23,7 @@ struct kref {
>  
>  void kref_init(struct kref *kref);
>  void kref_get(struct kref *kref);
> +int kref_test_and_get(struct kref *kref);
>  int kref_put(struct kref *kref, void (*release) (struct kref *kref));
>  
>  #endif /* _KREF_H_ */
> diff --git a/lib/kref.c b/lib/kref.c
> index d3d227a..5f663b9 100644
> --- a/lib/kref.c
> +++ b/lib/kref.c
> @@ -37,6 +37,22 @@ void kref_get(struct kref *kref)
>  }
>  
>  /**
> + * kref_test_and_get - increment refcount for object only if refcount is not
> + * zero.
> + * @kref: object.
> + *
> + * Return non-zero if the refcount was incremented, 0 otherwise
> + */
> +int kref_test_and_get(struct kref *kref)
> +{
> +	int ret;
> +	smp_mb__before_atomic_inc();
> +	ret = atomic_inc_not_zero(&kref->refcount);
> +	smp_mb__after_atomic_inc();
> +	return ret;
> +}
> +
> +/**
>   * kref_put - decrement refcount for object.
>   * @kref: object.
>   * @release: pointer to the function that will clean up the object when the

  reply	other threads:[~2010-12-23 15:39 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-12-06  9:44 [PATCH 1/2] Don't merge different partition's IOs Yasuaki Ishimatsu
2010-12-06 16:08 ` Linus Torvalds
2010-12-07  7:18   ` Satoru Takeuchi
2010-12-07 18:39     ` Vivek Goyal
2010-12-08  7:33     ` Jens Axboe
2010-12-08  7:59       ` Satoru Takeuchi
2010-12-08  8:06         ` Jens Axboe
2010-12-08  8:11           ` Satoru Takeuchi
2010-12-08 14:46             ` Jens Axboe
2010-12-08 15:51               ` Vivek Goyal
2010-12-08 15:58                 ` Vivek Goyal
2010-12-10 11:22                   ` Jerome Marchand
2010-12-10 16:12               ` Jerome Marchand
2010-12-10 16:55                 ` Vivek Goyal
2010-12-14 20:25                   ` Jens Axboe
2010-12-17 13:42                     ` [PATCH] block: fix accounting bug on cross partition merges Jerome Marchand
2010-12-17 19:06                       ` Jens Axboe
2010-12-17 22:32                         ` Vivek Goyal
2010-12-23 15:10                         ` Jerome Marchand
2010-12-23 15:39                           ` Vivek Goyal [this message]
2010-12-23 17:04                             ` Jerome Marchand
2010-12-24 19:29                               ` Vivek Goyal
2011-01-04 15:52                                 ` [PATCH 1/2] kref: add kref_test_and_get Jerome Marchand
2011-01-04 15:55                                   ` [PATCH 2/2] block: fix accounting bug on cross partition merges Jerome Marchand
2011-01-04 21:00                                     ` Greg KH
2011-01-05 13:51                                       ` Jerome Marchand
2011-01-05 16:00                                         ` Greg KH
2011-01-05 16:19                                           ` Jerome Marchand
2011-01-05 16:27                                             ` Greg KH
2011-01-05 13:55                                       ` Jens Axboe
2011-01-05 15:58                                         ` Greg KH
2011-01-05 18:46                                           ` Jens Axboe
2011-01-05 20:08                                             ` Greg KH
2011-01-05 21:38                                               ` Jens Axboe
2011-01-05 22:16                                                 ` Greg KH
2011-01-06  9:46                                                   ` Jens Axboe
2011-01-05 14:00                                     ` Jens Axboe
2011-01-05 14:09                                       ` Jerome Marchand
2011-01-05 14:17                                         ` Jens Axboe
2011-01-04 16:05                                   ` [PATCH 1/2] kref: add kref_test_and_get Eric Dumazet
2011-01-05 15:02                                     ` [PATCH 1/2 v2] " Jerome Marchand
2011-01-05 15:43                                       ` Alexey Dobriyan
2011-01-05 15:57                                         ` Greg KH
2011-01-05 15:56                                       ` Greg KH
2011-01-04 20:57                                   ` [PATCH 1/2] " Greg KH
2011-01-05 13:35                                     ` Jerome Marchand
2011-01-05 15:55                                       ` Greg KH

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20101223153915.GE9502@redhat.com \
    --to=vgoyal@redhat.com \
    --cc=isimatu.yasuaki@jp.fujitsu.com \
    --cc=jaxboe@fusionio.com \
    --cc=jmarchan@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=takeuchi_satoru@jp.fujitsu.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.