From: Theodore Ts'o <tytso@mit.edu>
To: Linux Filesystem Development List <linux-fsdevel@vger.kernel.org>
Cc: viro@ZenIV.linux.org.uk, Theodore Ts'o <tytso@mit.edu>
Subject: [PATCH-v9 0/3] add support for lazytime mount option
Date: Mon, 2 Feb 2015 00:36:59 -0500 [thread overview]
Message-ID: <1422855422-7444-1-git-send-email-tytso@mit.edu> (raw)
This is an updated version of what had originally been an
ext4-specific patch which significantly improves performance by lazily
writing timestamp updates (and in particular, mtime updates) to disk.
The in-memory timestamps are always correct, but they are only written
to disk when required for correctness.
This provides a huge performance boost for ext4 due to how it handles
journalling, but it's valuable for all file systems running on flash
storage or drive-managed SMR disks by reducing the metadata write
load. So upon request, I've moved the functionality to the VFS layer.
Once the /sbin/mount program adds support for MS_LAZYTIME, all file
systems should be able to benefit from this optimization.
There is still an ext4-specific optimization, which may be applicable
for other file systems which store more than one inode in a block, but
it will require file system specific code. It is purely optional,
however.
For people interested seeing how timestamp updates are held back, the
following example commands to enable the tracepoints debugging may be
helpful:
mount -o remount,lazytime /
cd /sys/kernel/debug/tracing
echo 1 > events/writeback/writeback_lazytime/enable
echo 1 > events/writeback/writeback_lazytime_iput/enable
echo "state & 2048" > events/writeback/writeback_dirty_inode_enqueue/filter
echo 1 > events/writeback/writeback_dirty_inode_enqueue/enable
echo 1 > events/ext4/ext4_other_inode_update_time/enable
cat trace_pipe
You can also see how many lazytime inodes are in memory by looking in
/sys/kernel/debug/bdi/<bdi>/stats
Changes since -v8:
- in ext4_update_other_inodes_time() clear I_DIRTY_TIME_EXPIRED as
well as I_DIRTY_TIME
- Fixed a bug which broke writeback in some cases (introduced in -v7)
Changes since -v7:
- Fix comment typos
- Clear the I_DIRTY_TIME flag if I_DIRTY_INODE gets added in
__mark_inode_dirty()
- Fix a bug accidentally introduced in -v7 which broke lazytime altogether
Changes since -v6:
- Add a new tracepoint writeback_dirty_inode_enqueue
- Move generic handling of update_time() to generic_update_time(),
so filesystems can more easily hook or modify update_time()
- The file system's dirty_inode() will now always get called with
I_DIRTY_TIME when the inode time is updated. (I_DIRTY_SYNC will
also be set if the inode should be updated right away.) This allows
file systems such as XFS to update its on-disk copy of the inode if
I_DIRTY_TIME is set.
Changes since -v5:
- Tweak move_expired_inodes to handle sync() and syncfs(), and drop
flush_sb_dirty_time().
- Move logic for handling the b_dirty_time list into
__mark_inode_dirty().
- Move I_DIRTY back to its original definition, and use I_DIRTY_ALL
for I_DIRTY plus I_DIRTY_TIME.
- Fold some patches together to make the first patch easier to
review (and modify/update).
- Use the pre-existing writeback tracepoints instead of creating a new
fs tracepoints.
Changes since -v4:
- Fix ext4 optimization so it does not need to increment (and more
problematically, decrement) the inode reference count
- Per Christoph's suggestion, drop support for btrfs and xfs for now,
issues with how btrfs and xfs handle dirty inode tracking. We can add
btrfs and xfs support back later or at the end of this series if we
want to revisit this decision.
- Miscellaneous cleanups
Changes since -v3:
- inodes with I_DIRTY_TIME set are placed on a new bdi list,
b_dirty_time. This allows filesystem-level syncs to more
easily iterate over those inodes that need to have their
timestamps written to disk.
- dirty timestamps will be written out asynchronously on the final
iput, instead of when the inode gets evicted.
- separate the definition of the new function
find_active_inode_nowait() to a separate patch
- create separate flag masks: I_DIRTY_WB and I_DIRTY_INODE, which
indicate whether the inode needs to be on the write back lists,
or whether the inode itself is dirty, while I_DIRTY means any one
of the inode dirty flags are set. This simplifies the fs
writeback logic which needs to test for different combinations of
the inode dirty flags in different places.
Changes since -v2:
- If update_time() updates i_version, it will not use lazytime (i..e,
the inode will be marked dirty so the change will be persisted on to
disk sooner rather than later). Yes, this eliminates the
benefits of lazytime if the user is experting the file system via
NFSv4. Sad, but NFS's requirements seem to mandate this.
- Fix time wrapping bug 49 days after the system boots (on a system
with a 32-bit jiffies). Use get_monotonic_boottime() instead.
- Clean up type warning in include/tracing/ext4.h
- Added explicit parenthesis for stylistic reasons
- Added an is_readonly() inode operations method so btrfs doesn't
have to duplicate code in update_time().
Changes since -v1:
- Added explanatory comments in update_time() regarding i_ts_dirty_days
- Fix type used for days_since_boot
- Improve SMP scalability in update_time and ext4_update_other_inodes_time
- Added tracepoints to help test and characterize how often and under
what circumstances inodes have their timestamps lazily updated
Theodore Ts'o (3):
vfs: add support for a lazytime mount option
vfs: add find_inode_nowait() function
ext4: add optimization for the lazytime mount option
fs/ext4/inode.c | 70 +++++++++++++++++++++++++-
fs/ext4/super.c | 10 ++++
fs/fs-writeback.c | 62 +++++++++++++++++++----
fs/gfs2/file.c | 4 +-
fs/inode.c | 106 +++++++++++++++++++++++++++++++++------
fs/jfs/file.c | 2 +-
fs/libfs.c | 2 +-
fs/proc_namespace.c | 1 +
fs/sync.c | 8 +++
include/linux/backing-dev.h | 1 +
include/linux/fs.h | 10 ++++
include/trace/events/ext4.h | 30 +++++++++++
include/trace/events/writeback.h | 60 +++++++++++++++++++++-
include/uapi/linux/fs.h | 4 +-
mm/backing-dev.c | 10 +++-
15 files changed, 343 insertions(+), 37 deletions(-)
--
2.1.0
next reply other threads:[~2015-02-02 5:37 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-02-02 5:36 Theodore Ts'o [this message]
2015-02-02 5:37 ` [PATCH-v9 1/3] vfs: add support for a lazytime mount option Theodore Ts'o
[not found] ` <1422855422-7444-2-git-send-email-tytso-3s7WtUTddSA@public.gmane.org>
2015-02-02 6:03 ` Michael Kerrisk
2015-02-02 5:37 ` [PATCH-v9 2/3] vfs: add find_inode_nowait() function Theodore Ts'o
[not found] ` <1422855422-7444-3-git-send-email-tytso-3s7WtUTddSA@public.gmane.org>
2015-02-02 6:04 ` Michael Kerrisk
2015-02-02 5:37 ` [PATCH-v9 3/3] ext4: add optimization for the lazytime mount option Theodore Ts'o
2015-02-02 6:03 ` Michael Kerrisk
2015-02-02 6:03 ` [PATCH-v9 0/3] add support for " Michael Kerrisk
[not found] ` <CAHO5Pa0ySnLb_UGUw3deVyZEr8gdzzdeyMP5rXcT1MLOeccLGg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-02-02 14:48 ` Theodore Ts'o
[not found] ` <20150202144833.GB2509-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>
2015-02-02 15:40 ` Michael Kerrisk (man-pages)
2015-02-03 7:56 ` Dmitry Monakhov
2015-02-04 16:43 ` Theodore Ts'o
2015-02-04 16:59 ` Al Viro
2015-02-05 7:47 ` Al Viro
2015-03-30 11:00 ` Karel Zak
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1422855422-7444-1-git-send-email-tytso@mit.edu \
--to=tytso@mit.edu \
--cc=linux-fsdevel@vger.kernel.org \
--cc=viro@ZenIV.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).