From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from cantor2.suse.de ([195.135.220.15]:55819 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752970AbbEHWRb (ORCPT ); Fri, 8 May 2015 18:17:31 -0400 Message-ID: <554D35F6.8040601@suse.com> Date: Fri, 08 May 2015 18:17:26 -0400 From: Jeff Mahoney MIME-Version: 1.0 To: fdmanana@gmail.com CC: linux-btrfs Subject: Re: [PATCH v2] btrfs: iterate over unused chunk space in FITRIM References: <554909A5.4030307@suse.com> In-Reply-To: Content-Type: text/plain; charset=utf-8 Sender: linux-btrfs-owner@vger.kernel.org List-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 5/8/15 4:38 PM, Filipe David Manana wrote: > On Tue, May 5, 2015 at 7:19 PM, Jeff Mahoney > wrote: >> Since we now clean up block groups automatically as they become >> empty, iterating over block groups is no longer sufficient to >> discard unused space. >> >> This patch iterates over the unused chunk space and discards any >> regions that are unallocated, regardless of whether they were >> ever used. This is a change for btrfs but is consistent with >> other file systems. >> >> We do this in a transactionless manner since the discard process >> can take a substantial amount of time and a transaction would >> need to be started before the acquisition of the device list >> lock. That would mean a transaction would be held open across >> /all/ of the discards collectively. In order to prevent other >> threads from allocating or freeing chunks, we hold the chunks >> lock across the search and discard calls. We release it between >> searches to allow the file system to perform more-or-less >> normally. Since the running transaction can commit and disappear >> while we're using the transaction pointer, we take a reference to >> it and release it after the search. This is safe since it would >> happen normally at the end of the transaction commit after any >> locks are released anyway. >> >> Signed-off-by: Jeff Mahoney --- > > Hi Jeff, > > Missing changelog describing what changed between v1 and v2. > >> fs/btrfs/extent-tree.c | 96 >> ++++++++++++++++++++++++++++++++++++++++++++++++++ >> fs/btrfs/volumes.c | 61 ++++++++++++++++++++------------ >> fs/btrfs/volumes.h | 3 ++ 3 files changed, 137 >> insertions(+), 23 deletions(-) >> >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >> index 0ec8e22..6d1d74d 100644 --- a/fs/btrfs/extent-tree.c +++ >> b/fs/btrfs/extent-tree.c @@ -10031,10 +10031,94 @@ int >> btrfs_error_unpin_extent_range(struct btrfs_root *root, u64 >> start, u64 end) return unpin_extent_range(root, start, end, >> false); } >> >> +/* + * It used to be that old block groups would be left around >> forever. + * Iterating over them would be enough to trim unused >> space. Since we + * now automatically remove them, we also need >> to iterate over unallocated + * space. + * + * We don't want a >> transaction for this since the discard may take a + * substantial >> amount of time. We don't require that a transaction be + * >> running, but we do need to take a running transaction into >> account + * to ensure that we're not discarding chunks that were >> released in + * the current transaction. + * + * Holding the >> chunks lock will prevent other threads from allocating + * or >> releasing chunks, but it won't prevent a running transaction + * >> from committing and releasing the memory that the pending chunks >> + * list head uses. For that, we need to take a reference to >> the + * transaction. + */ +static int >> btrfs_trim_free_extents(struct btrfs_device *device, + >> u64 minlen, u64 *trimmed) +{ + u64 start = 0, len = 0; + >> int ret; + + *trimmed = 0; + + /* Not writeable = >> nothing to do. */ + if (!device->writeable) + >> return 0; + + /* No free space = nothing to do. */ + >> if (device->total_bytes <= device->bytes_used) + >> return 0; + + ret = 0; + + while (1) { + >> struct btrfs_fs_info *fs_info = device->dev_root->fs_info; + >> struct btrfs_transaction *trans; + + ret = >> mutex_lock_interruptible(&fs_info->chunk_mutex); + >> if (ret) + return ret; + + >> spin_lock(&fs_info->trans_lock); + trans = >> fs_info->running_transaction; + if (trans) + >> atomic_inc(&trans->use_count); + >> spin_unlock(&fs_info->trans_lock); + + ret = >> find_free_dev_extent_start(trans, device, minlen, start, + >> &start, &len); + if (trans) + >> btrfs_put_transaction(trans); + + if (ret) { + >> mutex_unlock(&fs_info->chunk_mutex); + if >> (ret == -ENOSPC) + ret = 0; + >> break; + } + + ret = >> btrfs_issue_discard(device->bdev, start, len); + >> mutex_unlock(&fs_info->chunk_mutex); + + if (ret) + >> break; + + start += len; + *trimmed >> += len; + + if (fatal_signal_pending(current)) { + >> ret = -ERESTARTSYS; + break; + >> } + + cond_resched(); + } + + return >> ret; +} + int btrfs_trim_fs(struct btrfs_root *root, struct >> fstrim_range *range) { struct btrfs_fs_info *fs_info = >> root->fs_info; struct btrfs_block_group_cache *cache = NULL; + >> struct btrfs_device *device; + struct list_head *devices; >> u64 group_trimmed; u64 start; u64 end; @@ -10089,6 +10173,18 @@ >> int btrfs_trim_fs(struct btrfs_root *root, struct fstrim_range >> *range) cache = next_block_group(fs_info->tree_root, cache); } >> >> + >> mutex_lock(&root->fs_info->fs_devices->device_list_mutex); + >> devices = &root->fs_info->fs_devices->alloc_list; + >> list_for_each_entry(device, devices, dev_alloc_list) { + >> ret = btrfs_trim_free_extents(device, range->minlen, + >> &group_trimmed); + if (ret) + >> break; + + trimmed += group_trimmed; + } + >> mutex_unlock(&root->fs_info->fs_devices->device_list_mutex); + >> range->len = trimmed; return ret; } diff --git >> a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 96aebf3..ada8965 >> 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ >> -1051,15 +1051,19 @@ out: return ret; } >> >> -static int contains_pending_extent(struct btrfs_trans_handle >> *trans, +static int contains_pending_extent(struct >> btrfs_transaction *transaction, struct btrfs_device *device, u64 >> *start, u64 len) { + struct btrfs_fs_info *fs_info = >> device->dev_root->fs_info; struct extent_map *em; - struct >> list_head *search_list = &trans->transaction->pending_chunks; + >> struct list_head *search_list = &transaction->pending_chunks; > > transaction can be NULL so we need to check for that first (as you > do below). The ordering here is safe. We're not dereferencing the transaction, we're getting the address of one of the members. search_list will point to 0x100 or something. We don't ever dereference that without checking the transaction against NULL. - -Jeff >> int ret = 0; u64 physical_start = *start; >> >> + if (!transaction) + search_list = >> &fs_info->pinned_chunks; + again: list_for_each_entry(em, >> search_list, list) { struct map_lookup *map; @@ -1078,8 +1082,8 >> @@ again: ret = 1; } } - if (search_list == >> &trans->transaction->pending_chunks) { - >> search_list = &trans->root->fs_info->pinned_chunks; + if >> (search_list == &transaction->pending_chunks) { + >> search_list = &fs_info->pinned_chunks; goto again; } >> >> @@ -1088,12 +1092,13 @@ again: >> >> >> /* - * find_free_dev_extent - find free space in the specified >> device - * @device: the device which we search the free space >> in - * @num_bytes: the size of the free space that we need - * >> @start: store the start of the free space. - * @len: >> the size of the free space. that we find, or the size of the max >> - * free space if we don't find suitable free space + >> * find_free_dev_extent_start - find free space in the specified >> device + * @device: the device which we search the free >> space in + * @num_bytes: the size of the free space that we >> need + * @search_start: the position from which to begin the >> search + * @start: store the start of the free space. + * >> @len: the size of the free space. that we find, or the >> size + * of the max free space if we don't find >> suitable free space * * this uses a pretty simple search, the >> expectation is that it is * called very infrequently and that a >> given device has a small number @@ -1107,9 +1112,9 @@ again: * >> But if we don't find suitable free space, it is used to store the >> size of * the max free space. */ -int find_free_dev_extent(struct >> btrfs_trans_handle *trans, - struct >> btrfs_device *device, u64 num_bytes, - u64 >> *start, u64 *len) +int find_free_dev_extent_start(struct >> btrfs_transaction *transaction, + >> struct btrfs_device *device, u64 num_bytes, + >> u64 search_start, u64 *start, u64 *len) { struct btrfs_key key; >> struct btrfs_root *root = device->dev_root; @@ -1119,19 +1124,11 >> @@ int find_free_dev_extent(struct btrfs_trans_handle *trans, u64 >> max_hole_start; u64 max_hole_size; u64 extent_end; - u64 >> search_start; u64 search_end = device->total_bytes; int ret; int >> slot; struct extent_buffer *l; >> >> - /* FIXME use last free of some kind */ - - /* we >> don't want to overwrite the superblock on the drive, - * >> so we make sure to start at an offset of at least 1MB - >> */ - search_start = max(root->fs_info->alloc_start, 1024ull >> * 1024); - path = btrfs_alloc_path(); if (!path) return -ENOMEM; >> @@ -1192,7 +1189,7 @@ again: * Have to check before we set >> max_hole_start, otherwise * we could end up sending back this >> offset anyway. */ - if >> (contains_pending_extent(trans, device, + >> if (contains_pending_extent(transaction, device, &search_start, >> hole_size)) { if (key.offset >= search_start) { @@ -1241,7 >> +1238,7 @@ next: if (search_end > search_start) { hole_size = >> search_end - search_start; >> >> - if (contains_pending_extent(trans, device, >> &search_start, + if >> (contains_pending_extent(transaction, device, &search_start, >> hole_size)) { btrfs_release_path(path); goto again; @@ -1267,6 >> +1264,24 @@ out: return ret; } >> >> +int find_free_dev_extent(struct btrfs_trans_handle *trans, + >> struct btrfs_device *device, u64 num_bytes, + >> u64 *start, u64 *len) +{ + struct btrfs_root *root = >> device->dev_root; + u64 search_start; + + /* FIXME >> use last free of some kind */ + + /* + * we don't >> want to overwrite the superblock on the drive, + * so we >> make sure to start at an offset of at least 1MB + */ + >> search_start = max(root->fs_info->alloc_start, 1024ull * 1024); + >> return find_free_dev_extent_start(trans->transaction, device, + >> num_bytes, search_start, start, len); +} + static int >> btrfs_free_dev_extent(struct btrfs_trans_handle *trans, struct >> btrfs_device *device, u64 start, u64 *dev_extent_len) diff --git >> a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index ebc3133..30918a8 >> 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ >> -449,6 +449,9 @@ int btrfs_cancel_balance(struct btrfs_fs_info >> *fs_info); int btrfs_create_uuid_tree(struct btrfs_fs_info >> *fs_info); int btrfs_check_uuid_tree(struct btrfs_fs_info >> *fs_info); int btrfs_chunk_readonly(struct btrfs_root *root, u64 >> chunk_offset); +int find_free_dev_extent_start(struct >> btrfs_transaction *transaction, + struct >> btrfs_device *device, u64 num_bytes, + u64 >> search_start, u64 *start, u64 *max_avail); int >> find_free_dev_extent(struct btrfs_trans_handle *trans, struct >> btrfs_device *device, u64 num_bytes, u64 *start, u64 >> *max_avail); -- 1.8.5.6 >> >> >> -- Jeff Mahoney SUSE Labs -- To unsubscribe from this list: send >> the line "unsubscribe linux-btrfs" in the body of a message to >> majordomo@vger.kernel.org More majordomo info at >> http://vger.kernel.org/majordomo-info.html > > > - -- Jeff Mahoney SUSE Labs -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.19 (Darwin) iQIcBAEBAgAGBQJVTTX2AAoJEB57S2MheeWyUGYP/iIl5JchDH1JqFV/AWtdrf4B huxL1gFYH8hAVj0pr1av7+EnbPp9rPwIFZXanvbJC9KFM3SCGEnF9/IYHeTzxiUl ZH7UC6E4xnr4xXrgKVT5nhOIDwg5cSdvuBzGfG6HttdL7KxhiHNGATDAXIn/hU1h XbU+aze8RR5+AHmh4ZboMAJr0/2VRU7C5aHyzvnCJbBwY6IqaI+RLyJ5ClQsIK1H PTMAgNmKogieEmNcFDq4UKaMZDZ0lj4A3QDJ6XAGvXyFcecncz8vF9QrurusslDO b074P1QVnzCPlejiIxsalPjT441tvWTPc5tcIa7Is6Zuoe9ezjN3LFwUA1Yvr061 0/D667gN2/+Tjgak5zAvD/efkvzXdZV1GOm9fZdFeR6cBhXVYv2sZVBmf7llEtn6 EAk6gqqHmdJQT1SgG1hOqSGVAKH4FbGUGq40zuKpey8Yna0iOaYxMssYxEnJIeeP EPt56nIv84Np69pFIsembp1ZB6Evy5yFm6DzjFwUQ8kH8NHbz1GqMDKaOCuCAXPn hiYG0VK6tNrfwmzthxmgwlTwYtA2wx9ug/i8fnOLwkFyxnOJJxX3RGBRe5Wiqcaf RKP2a3ZbXSl+OyNguqr0Zg4GXcgtrccSqXj9FGvUqNSVTdYoo5B6ctbKPZtxA0u1 v5n8rcnwDvuLDKJuldG4 =HZMM -----END PGP SIGNATURE-----