* [PATCH 1/2] btrfs: raid56: Don't keep rbio for later steal
@ 2017-01-17 1:10 Qu Wenruo
2017-01-17 1:10 ` [PATCH 2/2] btrfs: replace: Use ref counts to avoid destroying target device when canceled Qu Wenruo
0 siblings, 1 reply; 6+ messages in thread
From: Qu Wenruo @ 2017-01-17 1:10 UTC (permalink / raw)
To: linux-btrfs
Before this patch, btrfs raid56 will keep raid56 rbio even all its IO is
done.
This may save some time allocating rbio, but it can cause deadly
use-after-free bug, for the following case:
Original fs: 4 devices RAID5
Process A | Process B
--------------------------------------------------------------------------
| Start device replace
| Now the fs has 5 devices
| devid 0: replace device
| devid 1~4: old devices
btrfs_map_bio() |
|- __btrfs_map_block() |
| bbio has 5 stripes |
| including devid 0 |
|- raid56_parity_write() |
|
raid_write_end_io() |
|- rbio_orig_end_io() |
|- unlock_stripe() |
Keeps the old rbio for |
later steal, which has |
stripe for devid 0 |
| Cancel device replace
| Now the fs has 4 devices
| devid 0 is freed
Some IO happens |
raid_write_end_io() |
|- rbio_orig_end_io() |
|- unlock_stripe() |
|- steal_rbio() |
Use old rbio, whose |
bbio has freed devid 0|
stripe |
Any access to rbio->bbio will |
cause general protection or NULL |
pointer dereference |
Such bug can already be triggered by fstests btrfs/069 for RAID5/6
profiles.
Fix it by not keeping old rbio in unlock_stripe(), so we just free the
finished rbio and rbio->bbio, so above problem wont' happen.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
fs/btrfs/raid56.c | 18 +-----------------
1 file changed, 1 insertion(+), 17 deletions(-)
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index b8ffd9ea7499..41ced67165c0 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -774,7 +774,6 @@ static noinline void unlock_stripe(struct btrfs_raid_bio *rbio)
int bucket;
struct btrfs_stripe_hash *h;
unsigned long flags;
- int keep_cache = 0;
bucket = rbio_bucket(rbio);
h = rbio->fs_info->stripe_hash_table->table + bucket;
@@ -786,19 +785,6 @@ static noinline void unlock_stripe(struct btrfs_raid_bio *rbio)
spin_lock(&rbio->bio_list_lock);
if (!list_empty(&rbio->hash_list)) {
- /*
- * if we're still cached and there is no other IO
- * to perform, just leave this rbio here for others
- * to steal from later
- */
- if (list_empty(&rbio->plug_list) &&
- test_bit(RBIO_CACHE_BIT, &rbio->flags)) {
- keep_cache = 1;
- clear_bit(RBIO_RMW_LOCKED_BIT, &rbio->flags);
- BUG_ON(!bio_list_empty(&rbio->bio_list));
- goto done;
- }
-
list_del_init(&rbio->hash_list);
atomic_dec(&rbio->refs);
@@ -846,13 +832,11 @@ static noinline void unlock_stripe(struct btrfs_raid_bio *rbio)
goto done_nolock;
}
}
-done:
spin_unlock(&rbio->bio_list_lock);
spin_unlock_irqrestore(&h->lock, flags);
done_nolock:
- if (!keep_cache)
- remove_rbio_from_cache(rbio);
+ remove_rbio_from_cache(rbio);
}
static void __free_raid_bio(struct btrfs_raid_bio *rbio)
--
2.11.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH 2/2] btrfs: replace: Use ref counts to avoid destroying target device when canceled
2017-01-17 1:10 [PATCH 1/2] btrfs: raid56: Don't keep rbio for later steal Qu Wenruo
@ 2017-01-17 1:10 ` Qu Wenruo
2017-01-17 22:47 ` Josef Bacik
2017-02-01 22:37 ` Filipe Manana
0 siblings, 2 replies; 6+ messages in thread
From: Qu Wenruo @ 2017-01-17 1:10 UTC (permalink / raw)
To: linux-btrfs
When dev-replace and scrub are run at the same time, dev-replace can be
canceled by scrub. It's quite common for btrfs/069.
While in that case, target device can be destroyed at cancel time,
leading to a user-after-free bug:
Process A (dev-replace) | Process B(scrub)
----------------------------------------------------------------------
|(Any RW is OK)
|scrub_setup_recheck_block()
||- btrfs_map_sblock()
| Got a bbio with tgtdev
btrfs_dev_replace_finishing() |
|- btrfs_destory_dev_replace_tgtdev()|
|- call_rcu(free_device) |
|- __free_device() |
|- kfree(device) |
| Scrub worker:
| Access bbio->stripes[], which
| contains tgtdev.
| This triggers general protection.
The bug is mostly obvious for RAID5/6 since raid56 choose to keep old
rbio and rbio->bbio for later steal, this hugely enlarged the race
window and makes it much easier to trigger the bug.
This patch introduces 'tgtdev_refs' and 'tgtdev_wait' for btrfs_device
to wait for all its user released the target device.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
fs/btrfs/dev-replace.c | 7 ++++++-
fs/btrfs/volumes.c | 36 +++++++++++++++++++++++++++++++++++-
fs/btrfs/volumes.h | 10 ++++++++++
3 files changed, 51 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 5de280b9ad73..794a6a0bedf2 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -558,7 +558,6 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
rcu_str_deref(src_device->name),
src_device->devid,
rcu_str_deref(tgt_device->name));
- tgt_device->is_tgtdev_for_dev_replace = 0;
tgt_device->devid = src_device->devid;
src_device->devid = BTRFS_DEV_REPLACE_DEVID;
memcpy(uuid_tmp, tgt_device->uuid, sizeof(uuid_tmp));
@@ -579,6 +578,12 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
btrfs_dev_replace_unlock(dev_replace, 1);
+ /*
+ * Only change is_tgtdev_for_dev_replace flag after all its
+ * users get released.
+ */
+ wait_target_device(tgt_device);
+ tgt_device->is_tgtdev_for_dev_replace = 0;
btrfs_rm_dev_replace_blocked(fs_info);
btrfs_rm_dev_replace_remove_srcdev(fs_info, src_device);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index bb8592e1a364..74a6ee981b78 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2064,6 +2064,7 @@ void btrfs_destroy_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
WARN_ON(!tgtdev);
mutex_lock(&fs_info->fs_devices->device_list_mutex);
+ wait_target_device(tgtdev);
btrfs_sysfs_rm_device_link(fs_info->fs_devices, tgtdev);
if (tgtdev->bdev)
@@ -2598,6 +2599,8 @@ int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
device->is_tgtdev_for_dev_replace = 1;
device->mode = FMODE_EXCL;
device->dev_stats_valid = 1;
+ atomic_set(&device->tgtdev_refs, 0);
+ init_waitqueue_head(&device->tgtdev_wait);
set_blocksize(device->bdev, 4096);
device->fs_devices = fs_info->fs_devices;
list_add(&device->dev_list, &fs_info->fs_devices->devices);
@@ -2624,6 +2627,8 @@ void btrfs_init_dev_replace_tgtdev_for_resume(struct btrfs_fs_info *fs_info,
tgtdev->sector_size = sectorsize;
tgtdev->fs_info = fs_info;
tgtdev->in_fs_metadata = 1;
+ atomic_set(&tgtdev->tgtdev_refs, 0);
+ init_waitqueue_head(&tgtdev->tgtdev_wait);
}
static noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
@@ -5302,6 +5307,32 @@ static struct btrfs_bio *alloc_btrfs_bio(int total_stripes, int real_stripes)
return bbio;
}
+static void pin_bbio_target_device(struct btrfs_bio *bbio)
+{
+ int i;
+
+ for (i = 0; i < bbio->num_stripes; i++) {
+ struct btrfs_device *device = bbio->stripes[i].dev;
+
+ if (device->is_tgtdev_for_dev_replace)
+ atomic_inc(&device->tgtdev_refs);
+ }
+}
+
+static void unpin_bbio_target_device(struct btrfs_bio *bbio)
+{
+ int i;
+
+ for (i = 0; i < bbio->num_stripes; i++) {
+ struct btrfs_device *device = bbio->stripes[i].dev;
+
+ if (device->is_tgtdev_for_dev_replace) {
+ atomic_dec(&device->tgtdev_refs);
+ wake_up(&device->tgtdev_wait);
+ }
+ }
+}
+
void btrfs_get_bbio(struct btrfs_bio *bbio)
{
WARN_ON(!atomic_read(&bbio->refs));
@@ -5312,8 +5343,10 @@ void btrfs_put_bbio(struct btrfs_bio *bbio)
{
if (!bbio)
return;
- if (atomic_dec_and_test(&bbio->refs))
+ if (atomic_dec_and_test(&bbio->refs)) {
+ unpin_bbio_target_device(bbio);
kfree(bbio);
+ }
}
static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
@@ -5868,6 +5901,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
bbio->stripes[0].physical = physical_to_patch_in_first_stripe;
bbio->mirror_num = map->num_stripes + 1;
}
+ pin_bbio_target_device(bbio);
out:
if (dev_replace_is_ongoing) {
btrfs_dev_replace_clear_lock_blocking(dev_replace);
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 811328984702..1b4069dbe1c3 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -149,6 +149,10 @@ struct btrfs_device {
/* Counter to record the change of device stats */
atomic_t dev_stats_ccnt;
atomic_t dev_stat_values[BTRFS_DEV_STAT_VALUES_MAX];
+
+ /* To ensure we wait this target device before destroying it */
+ atomic_t tgtdev_refs;
+ wait_queue_head_t tgtdev_wait;
};
/*
@@ -538,4 +542,10 @@ struct list_head *btrfs_get_fs_uuids(void);
void btrfs_set_fs_info_ptr(struct btrfs_fs_info *fs_info);
void btrfs_reset_fs_info_ptr(struct btrfs_fs_info *fs_info);
+static inline void wait_target_device(struct btrfs_device *tgtdev)
+{
+ if (!tgtdev || !tgtdev->is_tgtdev_for_dev_replace)
+ return;
+ wait_event(tgtdev->tgtdev_wait, atomic_read(&tgtdev->tgtdev_refs) == 0);
+}
#endif
--
2.11.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH 2/2] btrfs: replace: Use ref counts to avoid destroying target device when canceled
2017-01-17 1:10 ` [PATCH 2/2] btrfs: replace: Use ref counts to avoid destroying target device when canceled Qu Wenruo
@ 2017-01-17 22:47 ` Josef Bacik
2017-01-18 0:46 ` Qu Wenruo
2017-02-01 22:37 ` Filipe Manana
1 sibling, 1 reply; 6+ messages in thread
From: Josef Bacik @ 2017-01-17 22:47 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs
On Mon, Jan 16, 2017 at 5:10 PM, Qu Wenruo <quwenruo@cn.fujitsu.com>
wrote:
> When dev-replace and scrub are run at the same time, dev-replace can
> be
> canceled by scrub. It's quite common for btrfs/069.
>
> While in that case, target device can be destroyed at cancel time,
> leading to a user-after-free bug:
>
> Process A (dev-replace) | Process B(scrub)
> ----------------------------------------------------------------------
> |(Any RW is OK)
> |scrub_setup_recheck_block()
> ||- btrfs_map_sblock()
> | Got a bbio with tgtdev
> btrfs_dev_replace_finishing() |
> |- btrfs_destory_dev_replace_tgtdev()|
> |- call_rcu(free_device) |
> |- __free_device() |
> |- kfree(device) |
> | Scrub worker:
> | Access bbio->stripes[], which
> | contains tgtdev.
> | This triggers general
> protection.
>
> The bug is mostly obvious for RAID5/6 since raid56 choose to keep old
> rbio and rbio->bbio for later steal, this hugely enlarged the race
> window and makes it much easier to trigger the bug.
>
> This patch introduces 'tgtdev_refs' and 'tgtdev_wait' for btrfs_device
> to wait for all its user released the target device.
>
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> ---
> fs/btrfs/dev-replace.c | 7 ++++++-
> fs/btrfs/volumes.c | 36 +++++++++++++++++++++++++++++++++++-
> fs/btrfs/volumes.h | 10 ++++++++++
> 3 files changed, 51 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
> index 5de280b9ad73..794a6a0bedf2 100644
> --- a/fs/btrfs/dev-replace.c
> +++ b/fs/btrfs/dev-replace.c
> @@ -558,7 +558,6 @@ static int btrfs_dev_replace_finishing(struct
> btrfs_fs_info *fs_info,
> rcu_str_deref(src_device->name),
> src_device->devid,
> rcu_str_deref(tgt_device->name));
> - tgt_device->is_tgtdev_for_dev_replace = 0;
> tgt_device->devid = src_device->devid;
> src_device->devid = BTRFS_DEV_REPLACE_DEVID;
> memcpy(uuid_tmp, tgt_device->uuid, sizeof(uuid_tmp));
> @@ -579,6 +578,12 @@ static int btrfs_dev_replace_finishing(struct
> btrfs_fs_info *fs_info,
>
> btrfs_dev_replace_unlock(dev_replace, 1);
>
> + /*
> + * Only change is_tgtdev_for_dev_replace flag after all its
> + * users get released.
> + */
> + wait_target_device(tgt_device);
> + tgt_device->is_tgtdev_for_dev_replace = 0;
> btrfs_rm_dev_replace_blocked(fs_info);
>
> btrfs_rm_dev_replace_remove_srcdev(fs_info, src_device);
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index bb8592e1a364..74a6ee981b78 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -2064,6 +2064,7 @@ void btrfs_destroy_dev_replace_tgtdev(struct
> btrfs_fs_info *fs_info,
> WARN_ON(!tgtdev);
> mutex_lock(&fs_info->fs_devices->device_list_mutex);
>
> + wait_target_device(tgtdev);
> btrfs_sysfs_rm_device_link(fs_info->fs_devices, tgtdev);
>
> if (tgtdev->bdev)
> @@ -2598,6 +2599,8 @@ int btrfs_init_dev_replace_tgtdev(struct
> btrfs_fs_info *fs_info,
> device->is_tgtdev_for_dev_replace = 1;
> device->mode = FMODE_EXCL;
> device->dev_stats_valid = 1;
> + atomic_set(&device->tgtdev_refs, 0);
> + init_waitqueue_head(&device->tgtdev_wait);
> set_blocksize(device->bdev, 4096);
> device->fs_devices = fs_info->fs_devices;
> list_add(&device->dev_list, &fs_info->fs_devices->devices);
> @@ -2624,6 +2627,8 @@ void
> btrfs_init_dev_replace_tgtdev_for_resume(struct btrfs_fs_info
> *fs_info,
> tgtdev->sector_size = sectorsize;
> tgtdev->fs_info = fs_info;
> tgtdev->in_fs_metadata = 1;
> + atomic_set(&tgtdev->tgtdev_refs, 0);
> + init_waitqueue_head(&tgtdev->tgtdev_wait);
> }
>
> static noinline int btrfs_update_device(struct btrfs_trans_handle
> *trans,
> @@ -5302,6 +5307,32 @@ static struct btrfs_bio *alloc_btrfs_bio(int
> total_stripes, int real_stripes)
> return bbio;
> }
>
> +static void pin_bbio_target_device(struct btrfs_bio *bbio)
> +{
> + int i;
> +
> + for (i = 0; i < bbio->num_stripes; i++) {
> + struct btrfs_device *device = bbio->stripes[i].dev;
> +
> + if (device->is_tgtdev_for_dev_replace)
> + atomic_inc(&device->tgtdev_refs);
> + }
> +}
Can we just do this at the map time? So when we add a new stripe we go
ahead and take the ref then, and the same at complete time? Thanks,
Josef
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 2/2] btrfs: replace: Use ref counts to avoid destroying target device when canceled
2017-01-17 22:47 ` Josef Bacik
@ 2017-01-18 0:46 ` Qu Wenruo
0 siblings, 0 replies; 6+ messages in thread
From: Qu Wenruo @ 2017-01-18 0:46 UTC (permalink / raw)
To: Josef Bacik; +Cc: linux-btrfs
At 01/18/2017 06:47 AM, Josef Bacik wrote:
> On Mon, Jan 16, 2017 at 5:10 PM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>> When dev-replace and scrub are run at the same time, dev-replace can be
>> canceled by scrub. It's quite common for btrfs/069.
>>
>> While in that case, target device can be destroyed at cancel time,
>> leading to a user-after-free bug:
>>
>> Process A (dev-replace) | Process B(scrub)
>> ----------------------------------------------------------------------
>> |(Any RW is OK)
>> |scrub_setup_recheck_block()
>> ||- btrfs_map_sblock()
>> | Got a bbio with tgtdev
>> btrfs_dev_replace_finishing() |
>> |- btrfs_destory_dev_replace_tgtdev()|
>> |- call_rcu(free_device) |
>> |- __free_device() |
>> |- kfree(device) |
>> | Scrub worker:
>> | Access bbio->stripes[], which
>> | contains tgtdev.
>> | This triggers general protection.
>>
>> The bug is mostly obvious for RAID5/6 since raid56 choose to keep old
>> rbio and rbio->bbio for later steal, this hugely enlarged the race
>> window and makes it much easier to trigger the bug.
>>
>> This patch introduces 'tgtdev_refs' and 'tgtdev_wait' for btrfs_device
>> to wait for all its user released the target device.
>>
>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>> ---
>> fs/btrfs/dev-replace.c | 7 ++++++-
>> fs/btrfs/volumes.c | 36 +++++++++++++++++++++++++++++++++++-
>> fs/btrfs/volumes.h | 10 ++++++++++
>> 3 files changed, 51 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
>> index 5de280b9ad73..794a6a0bedf2 100644
>> --- a/fs/btrfs/dev-replace.c
>> +++ b/fs/btrfs/dev-replace.c
>> @@ -558,7 +558,6 @@ static int btrfs_dev_replace_finishing(struct
>> btrfs_fs_info *fs_info,
>> rcu_str_deref(src_device->name),
>> src_device->devid,
>> rcu_str_deref(tgt_device->name));
>> - tgt_device->is_tgtdev_for_dev_replace = 0;
>> tgt_device->devid = src_device->devid;
>> src_device->devid = BTRFS_DEV_REPLACE_DEVID;
>> memcpy(uuid_tmp, tgt_device->uuid, sizeof(uuid_tmp));
>> @@ -579,6 +578,12 @@ static int btrfs_dev_replace_finishing(struct
>> btrfs_fs_info *fs_info,
>>
>> btrfs_dev_replace_unlock(dev_replace, 1);
>>
>> + /*
>> + * Only change is_tgtdev_for_dev_replace flag after all its
>> + * users get released.
>> + */
>> + wait_target_device(tgt_device);
>> + tgt_device->is_tgtdev_for_dev_replace = 0;
>> btrfs_rm_dev_replace_blocked(fs_info);
>>
>> btrfs_rm_dev_replace_remove_srcdev(fs_info, src_device);
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index bb8592e1a364..74a6ee981b78 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -2064,6 +2064,7 @@ void btrfs_destroy_dev_replace_tgtdev(struct
>> btrfs_fs_info *fs_info,
>> WARN_ON(!tgtdev);
>> mutex_lock(&fs_info->fs_devices->device_list_mutex);
>>
>> + wait_target_device(tgtdev);
>> btrfs_sysfs_rm_device_link(fs_info->fs_devices, tgtdev);
>>
>> if (tgtdev->bdev)
>> @@ -2598,6 +2599,8 @@ int btrfs_init_dev_replace_tgtdev(struct
>> btrfs_fs_info *fs_info,
>> device->is_tgtdev_for_dev_replace = 1;
>> device->mode = FMODE_EXCL;
>> device->dev_stats_valid = 1;
>> + atomic_set(&device->tgtdev_refs, 0);
>> + init_waitqueue_head(&device->tgtdev_wait);
>> set_blocksize(device->bdev, 4096);
>> device->fs_devices = fs_info->fs_devices;
>> list_add(&device->dev_list, &fs_info->fs_devices->devices);
>> @@ -2624,6 +2627,8 @@ void
>> btrfs_init_dev_replace_tgtdev_for_resume(struct btrfs_fs_info *fs_info,
>> tgtdev->sector_size = sectorsize;
>> tgtdev->fs_info = fs_info;
>> tgtdev->in_fs_metadata = 1;
>> + atomic_set(&tgtdev->tgtdev_refs, 0);
>> + init_waitqueue_head(&tgtdev->tgtdev_wait);
>> }
>>
>> static noinline int btrfs_update_device(struct btrfs_trans_handle
>> *trans,
>> @@ -5302,6 +5307,32 @@ static struct btrfs_bio *alloc_btrfs_bio(int
>> total_stripes, int real_stripes)
>> return bbio;
>> }
>>
>> +static void pin_bbio_target_device(struct btrfs_bio *bbio)
>> +{
>> + int i;
>> +
>> + for (i = 0; i < bbio->num_stripes; i++) {
>> + struct btrfs_device *device = bbio->stripes[i].dev;
>> +
>> + if (device->is_tgtdev_for_dev_replace)
>> + atomic_inc(&device->tgtdev_refs);
>> + }
>> +}
>
> Can we just do this at the map time? So when we add a new stripe we go
> ahead and take the ref then, and the same at complete time? Thanks,
>
> Josef
Thanks for the review.
But I'm not quite sure what you mean here.
This pin_bbio_target_device() is called inside the dev_replace lock
protection, so it's called at map time.
And we must call it after patching mirror inside __btrfs_map_block(), so
I can't find a better timing to call it.
Would you please give me some hint?
Thanks,
Qu
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 2/2] btrfs: replace: Use ref counts to avoid destroying target device when canceled
2017-01-17 1:10 ` [PATCH 2/2] btrfs: replace: Use ref counts to avoid destroying target device when canceled Qu Wenruo
2017-01-17 22:47 ` Josef Bacik
@ 2017-02-01 22:37 ` Filipe Manana
2017-02-02 0:23 ` Qu Wenruo
1 sibling, 1 reply; 6+ messages in thread
From: Filipe Manana @ 2017-02-01 22:37 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs@vger.kernel.org
On Tue, Jan 17, 2017 at 1:10 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
> When dev-replace and scrub are run at the same time, dev-replace can be
> canceled by scrub. It's quite common for btrfs/069.
>
> While in that case, target device can be destroyed at cancel time,
> leading to a user-after-free bug:
>
> Process A (dev-replace) | Process B(scrub)
> ----------------------------------------------------------------------
> |(Any RW is OK)
> |scrub_setup_recheck_block()
> ||- btrfs_map_sblock()
> | Got a bbio with tgtdev
> btrfs_dev_replace_finishing() |
> |- btrfs_destory_dev_replace_tgtdev()|
> |- call_rcu(free_device) |
> |- __free_device() |
> |- kfree(device) |
> | Scrub worker:
> | Access bbio->stripes[], which
> | contains tgtdev.
> | This triggers general protection.
>
> The bug is mostly obvious for RAID5/6 since raid56 choose to keep old
> rbio and rbio->bbio for later steal, this hugely enlarged the race
> window and makes it much easier to trigger the bug.
>
> This patch introduces 'tgtdev_refs' and 'tgtdev_wait' for btrfs_device
> to wait for all its user released the target device.
>
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
I haven't read the patch, but seeing it's a fix for the issue you
reported in the thread with the subject "v4.9-rc7 scrub kernel panic",
could you please include the dmesg trace in the changelog? This helps
people figuring out if there's a fix for an issue they're hitting...
(I just hit it today for example).
Thanks.
> ---
> fs/btrfs/dev-replace.c | 7 ++++++-
> fs/btrfs/volumes.c | 36 +++++++++++++++++++++++++++++++++++-
> fs/btrfs/volumes.h | 10 ++++++++++
> 3 files changed, 51 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
> index 5de280b9ad73..794a6a0bedf2 100644
> --- a/fs/btrfs/dev-replace.c
> +++ b/fs/btrfs/dev-replace.c
> @@ -558,7 +558,6 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
> rcu_str_deref(src_device->name),
> src_device->devid,
> rcu_str_deref(tgt_device->name));
> - tgt_device->is_tgtdev_for_dev_replace = 0;
> tgt_device->devid = src_device->devid;
> src_device->devid = BTRFS_DEV_REPLACE_DEVID;
> memcpy(uuid_tmp, tgt_device->uuid, sizeof(uuid_tmp));
> @@ -579,6 +578,12 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
>
> btrfs_dev_replace_unlock(dev_replace, 1);
>
> + /*
> + * Only change is_tgtdev_for_dev_replace flag after all its
> + * users get released.
> + */
> + wait_target_device(tgt_device);
> + tgt_device->is_tgtdev_for_dev_replace = 0;
> btrfs_rm_dev_replace_blocked(fs_info);
>
> btrfs_rm_dev_replace_remove_srcdev(fs_info, src_device);
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index bb8592e1a364..74a6ee981b78 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -2064,6 +2064,7 @@ void btrfs_destroy_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
> WARN_ON(!tgtdev);
> mutex_lock(&fs_info->fs_devices->device_list_mutex);
>
> + wait_target_device(tgtdev);
> btrfs_sysfs_rm_device_link(fs_info->fs_devices, tgtdev);
>
> if (tgtdev->bdev)
> @@ -2598,6 +2599,8 @@ int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
> device->is_tgtdev_for_dev_replace = 1;
> device->mode = FMODE_EXCL;
> device->dev_stats_valid = 1;
> + atomic_set(&device->tgtdev_refs, 0);
> + init_waitqueue_head(&device->tgtdev_wait);
> set_blocksize(device->bdev, 4096);
> device->fs_devices = fs_info->fs_devices;
> list_add(&device->dev_list, &fs_info->fs_devices->devices);
> @@ -2624,6 +2627,8 @@ void btrfs_init_dev_replace_tgtdev_for_resume(struct btrfs_fs_info *fs_info,
> tgtdev->sector_size = sectorsize;
> tgtdev->fs_info = fs_info;
> tgtdev->in_fs_metadata = 1;
> + atomic_set(&tgtdev->tgtdev_refs, 0);
> + init_waitqueue_head(&tgtdev->tgtdev_wait);
> }
>
> static noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
> @@ -5302,6 +5307,32 @@ static struct btrfs_bio *alloc_btrfs_bio(int total_stripes, int real_stripes)
> return bbio;
> }
>
> +static void pin_bbio_target_device(struct btrfs_bio *bbio)
> +{
> + int i;
> +
> + for (i = 0; i < bbio->num_stripes; i++) {
> + struct btrfs_device *device = bbio->stripes[i].dev;
> +
> + if (device->is_tgtdev_for_dev_replace)
> + atomic_inc(&device->tgtdev_refs);
> + }
> +}
> +
> +static void unpin_bbio_target_device(struct btrfs_bio *bbio)
> +{
> + int i;
> +
> + for (i = 0; i < bbio->num_stripes; i++) {
> + struct btrfs_device *device = bbio->stripes[i].dev;
> +
> + if (device->is_tgtdev_for_dev_replace) {
> + atomic_dec(&device->tgtdev_refs);
> + wake_up(&device->tgtdev_wait);
> + }
> + }
> +}
> +
> void btrfs_get_bbio(struct btrfs_bio *bbio)
> {
> WARN_ON(!atomic_read(&bbio->refs));
> @@ -5312,8 +5343,10 @@ void btrfs_put_bbio(struct btrfs_bio *bbio)
> {
> if (!bbio)
> return;
> - if (atomic_dec_and_test(&bbio->refs))
> + if (atomic_dec_and_test(&bbio->refs)) {
> + unpin_bbio_target_device(bbio);
> kfree(bbio);
> + }
> }
>
> static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
> @@ -5868,6 +5901,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
> bbio->stripes[0].physical = physical_to_patch_in_first_stripe;
> bbio->mirror_num = map->num_stripes + 1;
> }
> + pin_bbio_target_device(bbio);
> out:
> if (dev_replace_is_ongoing) {
> btrfs_dev_replace_clear_lock_blocking(dev_replace);
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 811328984702..1b4069dbe1c3 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -149,6 +149,10 @@ struct btrfs_device {
> /* Counter to record the change of device stats */
> atomic_t dev_stats_ccnt;
> atomic_t dev_stat_values[BTRFS_DEV_STAT_VALUES_MAX];
> +
> + /* To ensure we wait this target device before destroying it */
> + atomic_t tgtdev_refs;
> + wait_queue_head_t tgtdev_wait;
> };
>
> /*
> @@ -538,4 +542,10 @@ struct list_head *btrfs_get_fs_uuids(void);
> void btrfs_set_fs_info_ptr(struct btrfs_fs_info *fs_info);
> void btrfs_reset_fs_info_ptr(struct btrfs_fs_info *fs_info);
>
> +static inline void wait_target_device(struct btrfs_device *tgtdev)
> +{
> + if (!tgtdev || !tgtdev->is_tgtdev_for_dev_replace)
> + return;
> + wait_event(tgtdev->tgtdev_wait, atomic_read(&tgtdev->tgtdev_refs) == 0);
> +}
> #endif
> --
> 2.11.0
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Filipe David Manana,
"People will forget what you said,
people will forget what you did,
but people will never forget how you made them feel."
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 2/2] btrfs: replace: Use ref counts to avoid destroying target device when canceled
2017-02-01 22:37 ` Filipe Manana
@ 2017-02-02 0:23 ` Qu Wenruo
0 siblings, 0 replies; 6+ messages in thread
From: Qu Wenruo @ 2017-02-02 0:23 UTC (permalink / raw)
To: fdmanana; +Cc: linux-btrfs@vger.kernel.org
At 02/02/2017 06:37 AM, Filipe Manana wrote:
> On Tue, Jan 17, 2017 at 1:10 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>> When dev-replace and scrub are run at the same time, dev-replace can be
>> canceled by scrub. It's quite common for btrfs/069.
>>
>> While in that case, target device can be destroyed at cancel time,
>> leading to a user-after-free bug:
>>
>> Process A (dev-replace) | Process B(scrub)
>> ----------------------------------------------------------------------
>> |(Any RW is OK)
>> |scrub_setup_recheck_block()
>> ||- btrfs_map_sblock()
>> | Got a bbio with tgtdev
>> btrfs_dev_replace_finishing() |
>> |- btrfs_destory_dev_replace_tgtdev()|
>> |- call_rcu(free_device) |
>> |- __free_device() |
>> |- kfree(device) |
>> | Scrub worker:
>> | Access bbio->stripes[], which
>> | contains tgtdev.
>> | This triggers general protection.
>>
>> The bug is mostly obvious for RAID5/6 since raid56 choose to keep old
>> rbio and rbio->bbio for later steal, this hugely enlarged the race
>> window and makes it much easier to trigger the bug.
>>
>> This patch introduces 'tgtdev_refs' and 'tgtdev_wait' for btrfs_device
>> to wait for all its user released the target device.
>>
>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>
> I haven't read the patch, but seeing it's a fix for the issue you
> reported in the thread with the subject "v4.9-rc7 scrub kernel panic",
> could you please include the dmesg trace in the changelog? This helps
> people figuring out if there's a fix for an issue they're hitting...
> (I just hit it today for example).
No problem, I'll update it with backtrace.
Thanks,
Qu
>
> Thanks.
>
>> ---
>> fs/btrfs/dev-replace.c | 7 ++++++-
>> fs/btrfs/volumes.c | 36 +++++++++++++++++++++++++++++++++++-
>> fs/btrfs/volumes.h | 10 ++++++++++
>> 3 files changed, 51 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
>> index 5de280b9ad73..794a6a0bedf2 100644
>> --- a/fs/btrfs/dev-replace.c
>> +++ b/fs/btrfs/dev-replace.c
>> @@ -558,7 +558,6 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
>> rcu_str_deref(src_device->name),
>> src_device->devid,
>> rcu_str_deref(tgt_device->name));
>> - tgt_device->is_tgtdev_for_dev_replace = 0;
>> tgt_device->devid = src_device->devid;
>> src_device->devid = BTRFS_DEV_REPLACE_DEVID;
>> memcpy(uuid_tmp, tgt_device->uuid, sizeof(uuid_tmp));
>> @@ -579,6 +578,12 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
>>
>> btrfs_dev_replace_unlock(dev_replace, 1);
>>
>> + /*
>> + * Only change is_tgtdev_for_dev_replace flag after all its
>> + * users get released.
>> + */
>> + wait_target_device(tgt_device);
>> + tgt_device->is_tgtdev_for_dev_replace = 0;
>> btrfs_rm_dev_replace_blocked(fs_info);
>>
>> btrfs_rm_dev_replace_remove_srcdev(fs_info, src_device);
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index bb8592e1a364..74a6ee981b78 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -2064,6 +2064,7 @@ void btrfs_destroy_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
>> WARN_ON(!tgtdev);
>> mutex_lock(&fs_info->fs_devices->device_list_mutex);
>>
>> + wait_target_device(tgtdev);
>> btrfs_sysfs_rm_device_link(fs_info->fs_devices, tgtdev);
>>
>> if (tgtdev->bdev)
>> @@ -2598,6 +2599,8 @@ int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
>> device->is_tgtdev_for_dev_replace = 1;
>> device->mode = FMODE_EXCL;
>> device->dev_stats_valid = 1;
>> + atomic_set(&device->tgtdev_refs, 0);
>> + init_waitqueue_head(&device->tgtdev_wait);
>> set_blocksize(device->bdev, 4096);
>> device->fs_devices = fs_info->fs_devices;
>> list_add(&device->dev_list, &fs_info->fs_devices->devices);
>> @@ -2624,6 +2627,8 @@ void btrfs_init_dev_replace_tgtdev_for_resume(struct btrfs_fs_info *fs_info,
>> tgtdev->sector_size = sectorsize;
>> tgtdev->fs_info = fs_info;
>> tgtdev->in_fs_metadata = 1;
>> + atomic_set(&tgtdev->tgtdev_refs, 0);
>> + init_waitqueue_head(&tgtdev->tgtdev_wait);
>> }
>>
>> static noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
>> @@ -5302,6 +5307,32 @@ static struct btrfs_bio *alloc_btrfs_bio(int total_stripes, int real_stripes)
>> return bbio;
>> }
>>
>> +static void pin_bbio_target_device(struct btrfs_bio *bbio)
>> +{
>> + int i;
>> +
>> + for (i = 0; i < bbio->num_stripes; i++) {
>> + struct btrfs_device *device = bbio->stripes[i].dev;
>> +
>> + if (device->is_tgtdev_for_dev_replace)
>> + atomic_inc(&device->tgtdev_refs);
>> + }
>> +}
>> +
>> +static void unpin_bbio_target_device(struct btrfs_bio *bbio)
>> +{
>> + int i;
>> +
>> + for (i = 0; i < bbio->num_stripes; i++) {
>> + struct btrfs_device *device = bbio->stripes[i].dev;
>> +
>> + if (device->is_tgtdev_for_dev_replace) {
>> + atomic_dec(&device->tgtdev_refs);
>> + wake_up(&device->tgtdev_wait);
>> + }
>> + }
>> +}
>> +
>> void btrfs_get_bbio(struct btrfs_bio *bbio)
>> {
>> WARN_ON(!atomic_read(&bbio->refs));
>> @@ -5312,8 +5343,10 @@ void btrfs_put_bbio(struct btrfs_bio *bbio)
>> {
>> if (!bbio)
>> return;
>> - if (atomic_dec_and_test(&bbio->refs))
>> + if (atomic_dec_and_test(&bbio->refs)) {
>> + unpin_bbio_target_device(bbio);
>> kfree(bbio);
>> + }
>> }
>>
>> static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
>> @@ -5868,6 +5901,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
>> bbio->stripes[0].physical = physical_to_patch_in_first_stripe;
>> bbio->mirror_num = map->num_stripes + 1;
>> }
>> + pin_bbio_target_device(bbio);
>> out:
>> if (dev_replace_is_ongoing) {
>> btrfs_dev_replace_clear_lock_blocking(dev_replace);
>> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
>> index 811328984702..1b4069dbe1c3 100644
>> --- a/fs/btrfs/volumes.h
>> +++ b/fs/btrfs/volumes.h
>> @@ -149,6 +149,10 @@ struct btrfs_device {
>> /* Counter to record the change of device stats */
>> atomic_t dev_stats_ccnt;
>> atomic_t dev_stat_values[BTRFS_DEV_STAT_VALUES_MAX];
>> +
>> + /* To ensure we wait this target device before destroying it */
>> + atomic_t tgtdev_refs;
>> + wait_queue_head_t tgtdev_wait;
>> };
>>
>> /*
>> @@ -538,4 +542,10 @@ struct list_head *btrfs_get_fs_uuids(void);
>> void btrfs_set_fs_info_ptr(struct btrfs_fs_info *fs_info);
>> void btrfs_reset_fs_info_ptr(struct btrfs_fs_info *fs_info);
>>
>> +static inline void wait_target_device(struct btrfs_device *tgtdev)
>> +{
>> + if (!tgtdev || !tgtdev->is_tgtdev_for_dev_replace)
>> + return;
>> + wait_event(tgtdev->tgtdev_wait, atomic_read(&tgtdev->tgtdev_refs) == 0);
>> +}
>> #endif
>> --
>> 2.11.0
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2017-02-02 0:23 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-01-17 1:10 [PATCH 1/2] btrfs: raid56: Don't keep rbio for later steal Qu Wenruo
2017-01-17 1:10 ` [PATCH 2/2] btrfs: replace: Use ref counts to avoid destroying target device when canceled Qu Wenruo
2017-01-17 22:47 ` Josef Bacik
2017-01-18 0:46 ` Qu Wenruo
2017-02-01 22:37 ` Filipe Manana
2017-02-02 0:23 ` Qu Wenruo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).