From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Jan Ziak <0xe2.0x9a.0x9b@gmail.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
Date: Mon, 7 Mar 2022 15:31:17 +0800 [thread overview]
Message-ID: <3c668ffe-edb0-bbbb-cfe0-e307bad79b1a@gmx.com> (raw)
In-Reply-To: <e5bb2e23-2101-dcc3-695e-f3a0f5a4aba7@gmx.com>
[-- Attachment #1: Type: text/plain, Size: 3611 bytes --]
On 2022/3/7 10:39, Qu Wenruo wrote:
>
>
> On 2022/3/7 10:23, Jan Ziak wrote:
>> On Mon, Mar 7, 2022 at 1:48 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>> On 2022/3/6 23:59, Jan Ziak wrote:
>>>> I would like to report that btrfs in Linux kernel 5.16.12 mounted with
>>>> the autodefrag option wrote 5TB in a single day to a 1TB SSD that is
>>>> about 50% full.
>>>>
>>>> Defragmenting 0.5TB on a drive that is 50% full should write far
>>>> less than 5TB.
>>>
>>> If using defrag ioctl, that's a good and solid expectation.
>>>
>>> Autodefrag will mark any file which got smaller writes (<64K) for scan.
>>> For smaller extents than 64K, they will be re-dirtied for writeback.
>>
>> The NVMe device has 512-byte sectors, but has another namespace with
>> 4K sectors. Will it help btrfs-autodefrag to reformat the drive to 4K
>> sectors? I expect that it won't help - I am asking just in case my
>> expectation is wrong.
>
> The minimal sector size of btrfs is 4K, so I don't believe it would
> cause any difference.
>
>>
>>> So in theory, if the cleaner is triggered very frequently to do
>>> autodefrag, it can indeed easily amplify the writes.
>>
>> According to usr/bin/glances, the sqlite app is writing less than 1 MB
>> per second to the NVMe device. btrfs's autodefrag write amplification
>> is from the 1 MB/s to approximately 200 MB/s.
>
> This is definitely something wrong.
>
> Autodefrag by default should only get triggered every 300s, thus even
> all new bytes are re-dirtied, it should only cause a less than 300M
> write burst every 300s, not a consistent write.
>
>>
>>> Are you using commit= mount option? Which would reduce the commit
>>> interval thus trigger autodefrag more frequently.
>>
>> I am not using commit= mount option.
>>
>>>> CPU utilization on an otherwise idle machine is approximately 600% all
>>>> the time: btrfs-cleaner 100%, kworkers...btrfs 500%.
>>>
>>> The problem is why the CPU usage is at 100% for cleaner.
>>>
>>> Would you please apply this patch on your kernel?
>>> https://patchwork.kernel.org/project/linux-btrfs/patch/bf2635d213e0c85251c4cd0391d8fbf274d7d637.1645705266.git.wqu@suse.com/
>>>
>>>
>>> Then enable the following trace events...
>>
>> I will try to apply the patch, collect the events and post the
>> results. First, I will wait for the sqlite file to gain about 1
>> million extents, which shouldn't take too long.
>
> Thank you very much for the future trace events log.
>
> That would be the determining data for us to solve it.
Forgot to mention that, that patch itself relies on refactors in the
previous patches.
Thus you may want to apply the whole patchset.
Or use the attached diff which I manually backported for v5.16.12.
Thanks,
Qu
>
>>
>> ----
>>
>> BTW: "compsize file-with-million-extents" finishes in 0.2 seconds
>> (uses BTRFS_IOC_TREE_SEARCH_V2 ioctl), but "filefrag
>> file-with-million-extents" doesn't finish even after several minutes
>> of time (uses FS_IOC_FIEMAP ioctl - manages to perform only about 5
>> ioctl syscalls per second - and appears to be slowing down as the
>> value of the "fm_start" ioctl argument grows; e2fsprogs version
>> 1.46.5). It would be nice if filefrag was faster than just a few
>> ioctls per second.
>
> This is mostly a race with autodefrag.
>
> Both are using file extent map, thus if autodefrag is still trying to
> redirty the file again and again, it would definitely cause problems for
> anything also using file extent map.
>
> Thanks,
> Qu
>>
>> ----
>>
>> Sincerely
>> Jan
[-- Attachment #2: 0001-btrfs-add-trace-events-for-defrag.patch --]
[-- Type: text/x-patch, Size: 8043 bytes --]
From 757bf0aa39c44fc7c3e8e57f1c785ab6c7cffa8a Mon Sep 17 00:00:00 2001
Message-Id: <757bf0aa39c44fc7c3e8e57f1c785ab6c7cffa8a.1646638257.git.wqu@suse.com>
From: Qu Wenruo <wqu@suse.com>
Date: Sun, 13 Feb 2022 14:19:20 +0800
Subject: [PATCH] btrfs: add trace events for defrag
This is the backport for v5.16.12, without the dependency on the
btrfs_defrag_ctrl refactor.
This patch will introduce the following trace events:
- trace_defrag_add_target()
- trace_defrag_one_locked_range()
- trace_defrag_file_start()
- trace_defrag_file_end()
Under most cases, all of them are needed to debug policy related defrag
bugs.
The example output would look like this: (with TASK, CPU, TIMESTAMP and
UUID skipped)
defrag_file_start: <UUID>: root=5 ino=257 start=0 len=131072 extent_thresh=262144 newer_than=7 flags=0x0 compress=0 max_sectors_to_defrag=1024
defrag_add_target: <UUID>: root=5 ino=257 target_start=0 target_len=4096 found em=0 len=4096 generation=7
defrag_add_target: <UUID>: root=5 ino=257 target_start=4096 target_len=4096 found em=4096 len=4096 generation=7
...
defrag_add_target: <UUID>: root=5 ino=257 target_start=57344 target_len=4096 found em=57344 len=4096 generation=7
defrag_add_target: <UUID>: root=5 ino=257 target_start=61440 target_len=4096 found em=61440 len=4096 generation=7
defrag_add_target: <UUID>: root=5 ino=257 target_start=0 target_len=4096 found em=0 len=4096 generation=7
defrag_add_target: <UUID>: root=5 ino=257 target_start=4096 target_len=4096 found em=4096 len=4096 generation=7
...
defrag_add_target: <UUID>: root=5 ino=257 target_start=57344 target_len=4096 found em=57344 len=4096 generation=7
defrag_add_target: <UUID>: root=5 ino=257 target_start=61440 target_len=4096 found em=61440 len=4096 generation=7
defrag_one_locked_range: <UUID>: root=5 ino=257 start=0 len=65536
defrag_file_end: <UUID>: root=5 ino=257 sectors_defragged=16 last_scanned=131072 ret=0
Although the defrag_add_target() part is lengthy, it shows some details
of the extent map we get.
With the extra info from defrag_file_start(), we can check if the target
em is correct for our defrag policy.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/ioctl.c | 6 ++
include/trace/events/btrfs.h | 128 +++++++++++++++++++++++++++++++++++
2 files changed, 134 insertions(+)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 541a4fbfd79e..622d10ac3e97 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1272,6 +1272,7 @@ static int defrag_collect_targets(struct btrfs_inode *inode,
add:
last_is_target = true;
range_len = min(extent_map_end(em), start + len) - cur;
+ trace_defrag_add_target(inode, em, cur, range_len);
/*
* This one is a good target, check if it can be merged into
* last range of the target list.
@@ -1366,6 +1367,7 @@ static int defrag_one_locked_target(struct btrfs_inode *inode,
ret = btrfs_delalloc_reserve_space(inode, &data_reserved, start, len);
if (ret < 0)
return ret;
+ trace_defrag_one_locked_range(inode, start, (u32)len);
clear_extent_bit(&inode->io_tree, start, start + len - 1,
EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING |
EXTENT_DEFRAG, 0, 0, cached_state);
@@ -1591,6 +1593,9 @@ int btrfs_defrag_file(struct inode *inode, struct file_ra_state *ra,
/* Align the range */
cur = round_down(range->start, fs_info->sectorsize);
last_byte = round_up(last_byte, fs_info->sectorsize) - 1;
+ trace_defrag_file_start(BTRFS_I(inode), cur, last_byte + 1 - cur,
+ extent_thresh, newer_than, max_to_defrag,
+ range->flags, range->compress_type);
/*
* If we were not given a ra, allocate a readahead context. As
@@ -1690,6 +1695,7 @@ int btrfs_defrag_file(struct inode *inode, struct file_ra_state *ra,
BTRFS_I(inode)->defrag_compress = BTRFS_COMPRESS_NONE;
btrfs_inode_unlock(inode, 0);
}
+ trace_defrag_file_end(BTRFS_I(inode), ret, sectors_defragged, cur);
return ret;
}
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 8f58fd95efc7..98eb8f4a04c6 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -2263,6 +2263,134 @@ DEFINE_EVENT(btrfs__space_info_update, update_bytes_pinned,
TP_ARGS(fs_info, sinfo, old, diff)
);
+TRACE_EVENT(defrag_one_locked_range,
+
+ TP_PROTO(const struct btrfs_inode *inode, u64 start, u32 len),
+
+ TP_ARGS(inode, start, len),
+
+ TP_STRUCT__entry_btrfs(
+ __field( u64, root )
+ __field( u64, ino )
+ __field( u64, start )
+ __field( u32, len )
+ ),
+
+ TP_fast_assign_btrfs(inode->root->fs_info,
+ __entry->root = inode->root->root_key.objectid;
+ __entry->ino = btrfs_ino(inode);
+ __entry->start = start;
+ __entry->len = len;
+ ),
+
+ TP_printk_btrfs("root=%llu ino=%llu start=%llu len=%u",
+ __entry->root, __entry->ino, __entry->start, __entry->len)
+);
+
+TRACE_EVENT(defrag_add_target,
+
+ TP_PROTO(const struct btrfs_inode *inode, const struct extent_map *em,
+ u64 start, u32 len),
+
+ TP_ARGS(inode, em, start, len),
+
+ TP_STRUCT__entry_btrfs(
+ __field( u64, root )
+ __field( u64, ino )
+ __field( u64, target_start )
+ __field( u32, target_len )
+ __field( u64, em_generation )
+ __field( u64, em_start )
+ __field( u64, em_len )
+ ),
+
+ TP_fast_assign_btrfs(inode->root->fs_info,
+ __entry->root = inode->root->root_key.objectid;
+ __entry->ino = btrfs_ino(inode);
+ __entry->target_start = start;
+ __entry->target_len = len;
+ __entry->em_generation = em->generation;
+ __entry->em_start = em->start;
+ __entry->em_len = em->len;
+ ),
+
+ TP_printk_btrfs("root=%llu ino=%llu target_start=%llu target_len=%u "
+ "found em=%llu len=%llu generation=%llu",
+ __entry->root, __entry->ino, __entry->target_start,
+ __entry->target_len, __entry->em_start, __entry->em_len,
+ __entry->em_generation)
+);
+
+TRACE_EVENT(defrag_file_start,
+
+ TP_PROTO(const struct btrfs_inode *inode,
+ u64 start, u64 len, u32 extent_thresh, u64 newer_than,
+ unsigned long max_sectors_to_defrag, u64 flags, u32 compress),
+
+ TP_ARGS(inode, start, len, extent_thresh, newer_than,
+ max_sectors_to_defrag, flags, compress),
+
+ TP_STRUCT__entry_btrfs(
+ __field( u64, root )
+ __field( u64, ino )
+ __field( u64, start )
+ __field( u64, len )
+ __field( u64, newer_than )
+ __field( u64, max_sectors_to_defrag )
+ __field( u32, extent_thresh )
+ __field( u8, flags )
+ __field( u8, compress )
+ ),
+
+ TP_fast_assign_btrfs(inode->root->fs_info,
+ __entry->root = inode->root->root_key.objectid;
+ __entry->ino = btrfs_ino(inode);
+ __entry->start = start;
+ __entry->len = len;
+ __entry->extent_thresh = extent_thresh;
+ __entry->newer_than = newer_than;
+ __entry->max_sectors_to_defrag = max_sectors_to_defrag;
+ __entry->flags = flags;
+ __entry->compress = compress;
+ ),
+
+ TP_printk_btrfs("root=%llu ino=%llu start=%llu len=%llu "
+ "extent_thresh=%u newer_than=%llu flags=0x%x compress=%u "
+ "max_sectors_to_defrag=%llu",
+ __entry->root, __entry->ino, __entry->start, __entry->len,
+ __entry->extent_thresh, __entry->newer_than, __entry->flags,
+ __entry->compress, __entry->max_sectors_to_defrag)
+);
+
+TRACE_EVENT(defrag_file_end,
+
+ TP_PROTO(const struct btrfs_inode *inode,
+ int ret, u64 sectors_defragged, u64 last_scanned),
+
+ TP_ARGS(inode, ret, sectors_defragged, last_scanned),
+
+ TP_STRUCT__entry_btrfs(
+ __field( u64, root )
+ __field( u64, ino )
+ __field( u64, sectors_defragged )
+ __field( u64, last_scanned )
+ __field( int, ret )
+ ),
+
+ TP_fast_assign_btrfs(inode->root->fs_info,
+ __entry->root = inode->root->root_key.objectid;
+ __entry->ino = btrfs_ino(inode);
+ __entry->sectors_defragged = sectors_defragged;
+ __entry->last_scanned = last_scanned;
+ __entry->ret = ret;
+ ),
+
+ TP_printk_btrfs("root=%llu ino=%llu sectors_defragged=%llu "
+ "last_scanned=%llu ret=%d",
+ __entry->root, __entry->ino, __entry->sectors_defragged,
+ __entry->last_scanned, __entry->ret)
+);
+
#endif /* _TRACE_BTRFS_H */
/* This part must be outside protection */
--
2.35.1
next prev parent reply other threads:[~2022-03-07 7:31 UTC|newest]
Thread overview: 71+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-03-06 15:59 Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit Jan Ziak
2022-03-07 0:48 ` Qu Wenruo
2022-03-07 2:23 ` Jan Ziak
2022-03-07 2:39 ` Qu Wenruo
2022-03-07 7:31 ` Qu Wenruo [this message]
2022-03-10 1:10 ` Jan Ziak
2022-03-10 1:26 ` Qu Wenruo
2022-03-10 4:33 ` Jan Ziak
2022-03-10 6:42 ` Qu Wenruo
2022-03-10 21:31 ` Jan Ziak
2022-03-10 23:27 ` Qu Wenruo
2022-03-11 2:42 ` Jan Ziak
2022-03-11 2:59 ` Qu Wenruo
2022-03-11 5:04 ` Jan Ziak
2022-03-11 16:31 ` Jan Ziak
2022-03-11 20:02 ` Jan Ziak
2022-03-11 23:04 ` Qu Wenruo
2022-03-11 23:28 ` Jan Ziak
2022-03-11 23:39 ` Qu Wenruo
2022-03-12 0:01 ` Jan Ziak
2022-03-12 0:15 ` Qu Wenruo
2022-03-12 3:16 ` Zygo Blaxell
2022-03-12 2:43 ` Zygo Blaxell
2022-03-12 3:24 ` Qu Wenruo
2022-03-12 3:48 ` Zygo Blaxell
2022-03-14 20:09 ` Phillip Susi
2022-03-14 22:59 ` Zygo Blaxell
2022-03-15 18:28 ` Phillip Susi
2022-03-15 19:28 ` Jan Ziak
2022-03-15 21:06 ` Zygo Blaxell
2022-03-15 22:20 ` Jan Ziak
2022-03-16 17:02 ` Zygo Blaxell
2022-03-16 17:48 ` Jan Ziak
2022-03-17 2:11 ` Zygo Blaxell
2022-03-16 18:46 ` Phillip Susi
2022-03-16 19:59 ` Zygo Blaxell
2022-03-20 17:50 ` Forza
2022-03-20 21:15 ` Zygo Blaxell
2022-03-08 21:57 ` Jan Ziak
2022-03-08 23:40 ` Qu Wenruo
2022-03-09 22:22 ` Jan Ziak
2022-03-09 22:44 ` Qu Wenruo
2022-03-09 22:55 ` Jan Ziak
2022-03-09 23:00 ` Jan Ziak
2022-03-09 4:48 ` Zygo Blaxell
2022-03-07 14:30 ` Phillip Susi
2022-03-08 21:43 ` Jan Ziak
2022-03-09 18:46 ` Phillip Susi
2022-03-09 21:35 ` Jan Ziak
2022-03-14 20:02 ` Phillip Susi
2022-03-14 21:53 ` Jan Ziak
2022-03-14 22:24 ` Remi Gauvin
2022-03-14 22:51 ` Zygo Blaxell
2022-03-14 23:07 ` Remi Gauvin
2022-03-14 23:39 ` Zygo Blaxell
2022-03-15 14:14 ` Remi Gauvin
2022-03-15 18:51 ` Zygo Blaxell
2022-03-15 19:22 ` Remi Gauvin
2022-03-15 21:08 ` Zygo Blaxell
2022-03-15 18:15 ` Phillip Susi
2022-03-16 16:52 ` Andrei Borzenkov
2022-03-16 18:28 ` Jan Ziak
2022-03-16 18:31 ` Phillip Susi
2022-03-16 18:43 ` Andrei Borzenkov
2022-03-16 18:46 ` Jan Ziak
2022-03-16 19:04 ` Zygo Blaxell
2022-03-17 20:34 ` Phillip Susi
2022-03-17 22:06 ` Zygo Blaxell
2022-03-16 12:47 ` Kai Krakow
2022-03-16 18:18 ` Jan Ziak
-- strict thread matches above, loose matches on Subject: below --
2022-06-17 0:20 Jan Ziak
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=3c668ffe-edb0-bbbb-cfe0-e307bad79b1a@gmx.com \
--to=quwenruo.btrfs@gmx.com \
--cc=0xe2.0x9a.0x9b@gmail.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox