From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from cantor2.suse.de ([195.135.220.15]:56576 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752642AbbEHWp4 convert rfc822-to-8bit (ORCPT ); Fri, 8 May 2015 18:45:56 -0400 References: <554909A5.4030307@suse.com> Mime-Version: 1.0 (1.0) In-Reply-To: Content-Type: text/plain; charset=us-ascii Message-Id: <5343733B-D959-4207-976F-EF5EEAAE9E5F@suse.com> Cc: linux-btrfs From: Jeff Mahoney Subject: Re: [PATCH v2] btrfs: iterate over unused chunk space in FITRIM Date: Fri, 8 May 2015 18:45:50 -0400 To: "fdmanana@gmail.com" Sender: linux-btrfs-owner@vger.kernel.org List-ID: > On May 8, 2015, at 5:16 PM, Filipe David Manana wrote: > >> On Fri, May 8, 2015 at 9:38 PM, Filipe David Manana wrote: >>> On Tue, May 5, 2015 at 7:19 PM, Jeff Mahoney wrote: >>> Since we now clean up block groups automatically as they become >>> empty, iterating over block groups is no longer sufficient to discard >>> unused space. >>> >>> This patch iterates over the unused chunk space and discards any regions >>> that are unallocated, regardless of whether they were ever used. This is >>> a change for btrfs but is consistent with other file systems. >>> >>> We do this in a transactionless manner since the discard process can take >>> a substantial amount of time and a transaction would need to be started >>> before the acquisition of the device list lock. That would mean a >>> transaction would be held open across /all/ of the discards collectively. >>> In order to prevent other threads from allocating or freeing chunks, we >>> hold the chunks lock across the search and discard calls. We release it >>> between searches to allow the file system to perform more-or-less >>> normally. Since the running transaction can commit and disappear while >>> we're using the transaction pointer, we take a reference to it and >>> release it after the search. This is safe since it would happen normally >>> at the end of the transaction commit after any locks are released anyway. >>> >>> Signed-off-by: Jeff Mahoney >>> --- >> >> Hi Jeff, >> >> Missing changelog describing what changed between v1 and v2. >> >>> fs/btrfs/extent-tree.c | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++ >>> fs/btrfs/volumes.c | 61 ++++++++++++++++++++------------ >>> fs/btrfs/volumes.h | 3 ++ >>> 3 files changed, 137 insertions(+), 23 deletions(-) >>> >>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >>> index 0ec8e22..6d1d74d 100644 >>> --- a/fs/btrfs/extent-tree.c >>> +++ b/fs/btrfs/extent-tree.c >>> @@ -10031,10 +10031,94 @@ int btrfs_error_unpin_extent_range(struct btrfs_root *root, u64 start, u64 end) >>> return unpin_extent_range(root, start, end, false); >>> } >>> >>> +/* >>> + * It used to be that old block groups would be left around forever. >>> + * Iterating over them would be enough to trim unused space. Since we >>> + * now automatically remove them, we also need to iterate over unallocated >>> + * space. >>> + * >>> + * We don't want a transaction for this since the discard may take a >>> + * substantial amount of time. We don't require that a transaction be >>> + * running, but we do need to take a running transaction into account >>> + * to ensure that we're not discarding chunks that were released in >>> + * the current transaction. >>> + * >>> + * Holding the chunks lock will prevent other threads from allocating >>> + * or releasing chunks, but it won't prevent a running transaction >>> + * from committing and releasing the memory that the pending chunks >>> + * list head uses. For that, we need to take a reference to the >>> + * transaction. >>> + */ >>> +static int btrfs_trim_free_extents(struct btrfs_device *device, >>> + u64 minlen, u64 *trimmed) >>> +{ >>> + u64 start = 0, len = 0; >>> + int ret; >>> + >>> + *trimmed = 0; >>> + >>> + /* Not writeable = nothing to do. */ >>> + if (!device->writeable) >>> + return 0; >>> + >>> + /* No free space = nothing to do. */ >>> + if (device->total_bytes <= device->bytes_used) >>> + return 0; >>> + >>> + ret = 0; >>> + >>> + while (1) { >>> + struct btrfs_fs_info *fs_info = device->dev_root->fs_info; >>> + struct btrfs_transaction *trans; >>> + >>> + ret = mutex_lock_interruptible(&fs_info->chunk_mutex); >>> + if (ret) >>> + return ret; >>> + >>> + spin_lock(&fs_info->trans_lock); >>> + trans = fs_info->running_transaction; >>> + if (trans) >>> + atomic_inc(&trans->use_count); >>> + spin_unlock(&fs_info->trans_lock); >>> + >>> + ret = find_free_dev_extent_start(trans, device, minlen, start, >>> + &start, &len); >>> + if (trans) >>> + btrfs_put_transaction(trans); >>> + >>> + if (ret) { >>> + mutex_unlock(&fs_info->chunk_mutex); >>> + if (ret == -ENOSPC) >>> + ret = 0; >>> + break; >>> + } >>> + >>> + ret = btrfs_issue_discard(device->bdev, start, len); >>> + mutex_unlock(&fs_info->chunk_mutex); >>> + >>> + if (ret) >>> + break; >>> + >>> + start += len; >>> + *trimmed += len; >>> + >>> + if (fatal_signal_pending(current)) { >>> + ret = -ERESTARTSYS; >>> + break; >>> + } >>> + >>> + cond_resched(); >>> + } >>> + >>> + return ret; >>> +} >>> + >>> int btrfs_trim_fs(struct btrfs_root *root, struct fstrim_range *range) >>> { >>> struct btrfs_fs_info *fs_info = root->fs_info; >>> struct btrfs_block_group_cache *cache = NULL; >>> + struct btrfs_device *device; >>> + struct list_head *devices; >>> u64 group_trimmed; >>> u64 start; >>> u64 end; >>> @@ -10089,6 +10173,18 @@ int btrfs_trim_fs(struct btrfs_root *root, struct fstrim_range *range) >>> cache = next_block_group(fs_info->tree_root, cache); >>> } >>> >>> + mutex_lock(&root->fs_info->fs_devices->device_list_mutex); >>> + devices = &root->fs_info->fs_devices->alloc_list; >>> + list_for_each_entry(device, devices, dev_alloc_list) { >>> + ret = btrfs_trim_free_extents(device, range->minlen, >>> + &group_trimmed); >>> + if (ret) >>> + break; >>> + >>> + trimmed += group_trimmed; >>> + } >>> + mutex_unlock(&root->fs_info->fs_devices->device_list_mutex); >>> + >>> range->len = trimmed; >>> return ret; >>> } >>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c >>> index 96aebf3..ada8965 100644 >>> --- a/fs/btrfs/volumes.c >>> +++ b/fs/btrfs/volumes.c >>> @@ -1051,15 +1051,19 @@ out: >>> return ret; >>> } >>> >>> -static int contains_pending_extent(struct btrfs_trans_handle *trans, >>> +static int contains_pending_extent(struct btrfs_transaction *transaction, >>> struct btrfs_device *device, >>> u64 *start, u64 len) >>> { >>> + struct btrfs_fs_info *fs_info = device->dev_root->fs_info; >>> struct extent_map *em; >>> - struct list_head *search_list = &trans->transaction->pending_chunks; >>> + struct list_head *search_list = &transaction->pending_chunks; >> >> transaction can be NULL so we need to check for that first (as you do below). >> >>> int ret = 0; >>> u64 physical_start = *start; >>> >>> + if (!transaction) >>> + search_list = &fs_info->pinned_chunks; >>> + >>> again: >>> list_for_each_entry(em, search_list, list) { >>> struct map_lookup *map; >>> @@ -1078,8 +1082,8 @@ again: >>> ret = 1; >>> } >>> } >>> - if (search_list == &trans->transaction->pending_chunks) { >>> - search_list = &trans->root->fs_info->pinned_chunks; >>> + if (search_list == &transaction->pending_chunks) { > > And here missing the null test too, which should be something like: > > if (transaction && search_list == &transaction->pending_chunks) { > That's the same case as above. We're not dereferencing transaction here either. The test in the if statement will fail because &transaction->pending_chunks will be 0x100 or something. -Jeff > >>> + search_list = &fs_info->pinned_chunks; >>> goto again; >>> } >>> >>> @@ -1088,12 +1092,13 @@ again: >>> >>> >>> /* >>> - * find_free_dev_extent - find free space in the specified device >>> - * @device: the device which we search the free space in >>> - * @num_bytes: the size of the free space that we need >>> - * @start: store the start of the free space. >>> - * @len: the size of the free space. that we find, or the size of the max >>> - * free space if we don't find suitable free space >>> + * find_free_dev_extent_start - find free space in the specified device >>> + * @device: the device which we search the free space in >>> + * @num_bytes: the size of the free space that we need >>> + * @search_start: the position from which to begin the search >>> + * @start: store the start of the free space. >>> + * @len: the size of the free space. that we find, or the size >>> + * of the max free space if we don't find suitable free space >>> * >>> * this uses a pretty simple search, the expectation is that it is >>> * called very infrequently and that a given device has a small number >>> @@ -1107,9 +1112,9 @@ again: >>> * But if we don't find suitable free space, it is used to store the size of >>> * the max free space. >>> */ >>> -int find_free_dev_extent(struct btrfs_trans_handle *trans, >>> - struct btrfs_device *device, u64 num_bytes, >>> - u64 *start, u64 *len) >>> +int find_free_dev_extent_start(struct btrfs_transaction *transaction, >>> + struct btrfs_device *device, u64 num_bytes, >>> + u64 search_start, u64 *start, u64 *len) >>> { >>> struct btrfs_key key; >>> struct btrfs_root *root = device->dev_root; >>> @@ -1119,19 +1124,11 @@ int find_free_dev_extent(struct btrfs_trans_handle *trans, >>> u64 max_hole_start; >>> u64 max_hole_size; >>> u64 extent_end; >>> - u64 search_start; >>> u64 search_end = device->total_bytes; >>> int ret; >>> int slot; >>> struct extent_buffer *l; >>> >>> - /* FIXME use last free of some kind */ >>> - >>> - /* we don't want to overwrite the superblock on the drive, >>> - * so we make sure to start at an offset of at least 1MB >>> - */ >>> - search_start = max(root->fs_info->alloc_start, 1024ull * 1024); >>> - >>> path = btrfs_alloc_path(); >>> if (!path) >>> return -ENOMEM; >>> @@ -1192,7 +1189,7 @@ again: >>> * Have to check before we set max_hole_start, otherwise >>> * we could end up sending back this offset anyway. >>> */ >>> - if (contains_pending_extent(trans, device, >>> + if (contains_pending_extent(transaction, device, >>> &search_start, >>> hole_size)) { >>> if (key.offset >= search_start) { >>> @@ -1241,7 +1238,7 @@ next: >>> if (search_end > search_start) { >>> hole_size = search_end - search_start; >>> >>> - if (contains_pending_extent(trans, device, &search_start, >>> + if (contains_pending_extent(transaction, device, &search_start, >>> hole_size)) { >>> btrfs_release_path(path); >>> goto again; >>> @@ -1267,6 +1264,24 @@ out: >>> return ret; >>> } >>> >>> +int find_free_dev_extent(struct btrfs_trans_handle *trans, >>> + struct btrfs_device *device, u64 num_bytes, >>> + u64 *start, u64 *len) >>> +{ >>> + struct btrfs_root *root = device->dev_root; >>> + u64 search_start; >>> + >>> + /* FIXME use last free of some kind */ >>> + >>> + /* >>> + * we don't want to overwrite the superblock on the drive, >>> + * so we make sure to start at an offset of at least 1MB >>> + */ >>> + search_start = max(root->fs_info->alloc_start, 1024ull * 1024); >>> + return find_free_dev_extent_start(trans->transaction, device, >>> + num_bytes, search_start, start, len); >>> +} >>> + >>> static int btrfs_free_dev_extent(struct btrfs_trans_handle *trans, >>> struct btrfs_device *device, >>> u64 start, u64 *dev_extent_len) >>> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h >>> index ebc3133..30918a8 100644 >>> --- a/fs/btrfs/volumes.h >>> +++ b/fs/btrfs/volumes.h >>> @@ -449,6 +449,9 @@ int btrfs_cancel_balance(struct btrfs_fs_info *fs_info); >>> int btrfs_create_uuid_tree(struct btrfs_fs_info *fs_info); >>> int btrfs_check_uuid_tree(struct btrfs_fs_info *fs_info); >>> int btrfs_chunk_readonly(struct btrfs_root *root, u64 chunk_offset); >>> +int find_free_dev_extent_start(struct btrfs_transaction *transaction, >>> + struct btrfs_device *device, u64 num_bytes, >>> + u64 search_start, u64 *start, u64 *max_avail); >>> int find_free_dev_extent(struct btrfs_trans_handle *trans, >>> struct btrfs_device *device, u64 num_bytes, >>> u64 *start, u64 *max_avail); >>> -- >>> 1.8.5.6 >>> >>> >>> -- >>> Jeff Mahoney >>> SUSE Labs >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> >> -- >> Filipe David Manana, >> >> "Reasonable men adapt themselves to the world. >> Unreasonable men adapt the world to themselves. >> That's why all progress depends on unreasonable men." > > > > -- > Filipe David Manana, > > "Reasonable men adapt themselves to the world. > Unreasonable men adapt the world to themselves. > That's why all progress depends on unreasonable men." > -- Jeff Mahoney SUSE Labs