* Re: [PATCH-v9 0/3] add support for lazytime mount option
[not found] <1422855422-7444-1-git-send-email-tytso@mit.edu>
@ 2015-02-02 6:03 ` Michael Kerrisk
[not found] ` <CAHO5Pa0ySnLb_UGUw3deVyZEr8gdzzdeyMP5rXcT1MLOeccLGg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
[not found] ` <1422855422-7444-2-git-send-email-tytso@mit.edu>
` (2 subsequent siblings)
3 siblings, 1 reply; 6+ messages in thread
From: Michael Kerrisk @ 2015-02-02 6:03 UTC (permalink / raw)
To: Theodore Ts'o; +Cc: Linux Filesystem Development List, Al Viro, Linux API
Hi Ted,
Since this is an API change, linux-api@ shouls be CCed, Added.
Thanks,
Michael
On Mon, Feb 2, 2015 at 6:36 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> This is an updated version of what had originally been an
> ext4-specific patch which significantly improves performance by lazily
> writing timestamp updates (and in particular, mtime updates) to disk.
> The in-memory timestamps are always correct, but they are only written
> to disk when required for correctness.
>
> This provides a huge performance boost for ext4 due to how it handles
> journalling, but it's valuable for all file systems running on flash
> storage or drive-managed SMR disks by reducing the metadata write
> load. So upon request, I've moved the functionality to the VFS layer.
> Once the /sbin/mount program adds support for MS_LAZYTIME, all file
> systems should be able to benefit from this optimization.
>
> There is still an ext4-specific optimization, which may be applicable
> for other file systems which store more than one inode in a block, but
> it will require file system specific code. It is purely optional,
> however.
>
> For people interested seeing how timestamp updates are held back, the
> following example commands to enable the tracepoints debugging may be
> helpful:
>
> mount -o remount,lazytime /
> cd /sys/kernel/debug/tracing
> echo 1 > events/writeback/writeback_lazytime/enable
> echo 1 > events/writeback/writeback_lazytime_iput/enable
> echo "state & 2048" > events/writeback/writeback_dirty_inode_enqueue/filter
> echo 1 > events/writeback/writeback_dirty_inode_enqueue/enable
> echo 1 > events/ext4/ext4_other_inode_update_time/enable
> cat trace_pipe
>
> You can also see how many lazytime inodes are in memory by looking in
> /sys/kernel/debug/bdi/<bdi>/stats
>
> Changes since -v8:
> - in ext4_update_other_inodes_time() clear I_DIRTY_TIME_EXPIRED as
> well as I_DIRTY_TIME
> - Fixed a bug which broke writeback in some cases (introduced in -v7)
>
> Changes since -v7:
> - Fix comment typos
> - Clear the I_DIRTY_TIME flag if I_DIRTY_INODE gets added in
> __mark_inode_dirty()
> - Fix a bug accidentally introduced in -v7 which broke lazytime altogether
>
> Changes since -v6:
> - Add a new tracepoint writeback_dirty_inode_enqueue
> - Move generic handling of update_time() to generic_update_time(),
> so filesystems can more easily hook or modify update_time()
> - The file system's dirty_inode() will now always get called with
> I_DIRTY_TIME when the inode time is updated. (I_DIRTY_SYNC will
> also be set if the inode should be updated right away.) This allows
> file systems such as XFS to update its on-disk copy of the inode if
> I_DIRTY_TIME is set.
>
> Changes since -v5:
> - Tweak move_expired_inodes to handle sync() and syncfs(), and drop
> flush_sb_dirty_time().
> - Move logic for handling the b_dirty_time list into
> __mark_inode_dirty().
> - Move I_DIRTY back to its original definition, and use I_DIRTY_ALL
> for I_DIRTY plus I_DIRTY_TIME.
> - Fold some patches together to make the first patch easier to
> review (and modify/update).
> - Use the pre-existing writeback tracepoints instead of creating a new
> fs tracepoints.
>
> Changes since -v4:
> - Fix ext4 optimization so it does not need to increment (and more
> problematically, decrement) the inode reference count
> - Per Christoph's suggestion, drop support for btrfs and xfs for now,
> issues with how btrfs and xfs handle dirty inode tracking. We can add
> btrfs and xfs support back later or at the end of this series if we
> want to revisit this decision.
> - Miscellaneous cleanups
>
> Changes since -v3:
> - inodes with I_DIRTY_TIME set are placed on a new bdi list,
> b_dirty_time. This allows filesystem-level syncs to more
> easily iterate over those inodes that need to have their
> timestamps written to disk.
> - dirty timestamps will be written out asynchronously on the final
> iput, instead of when the inode gets evicted.
> - separate the definition of the new function
> find_active_inode_nowait() to a separate patch
> - create separate flag masks: I_DIRTY_WB and I_DIRTY_INODE, which
> indicate whether the inode needs to be on the write back lists,
> or whether the inode itself is dirty, while I_DIRTY means any one
> of the inode dirty flags are set. This simplifies the fs
> writeback logic which needs to test for different combinations of
> the inode dirty flags in different places.
>
> Changes since -v2:
> - If update_time() updates i_version, it will not use lazytime (i..e,
> the inode will be marked dirty so the change will be persisted on to
> disk sooner rather than later). Yes, this eliminates the
> benefits of lazytime if the user is experting the file system via
> NFSv4. Sad, but NFS's requirements seem to mandate this.
> - Fix time wrapping bug 49 days after the system boots (on a system
> with a 32-bit jiffies). Use get_monotonic_boottime() instead.
> - Clean up type warning in include/tracing/ext4.h
> - Added explicit parenthesis for stylistic reasons
> - Added an is_readonly() inode operations method so btrfs doesn't
> have to duplicate code in update_time().
>
> Changes since -v1:
> - Added explanatory comments in update_time() regarding i_ts_dirty_days
> - Fix type used for days_since_boot
> - Improve SMP scalability in update_time and ext4_update_other_inodes_time
> - Added tracepoints to help test and characterize how often and under
> what circumstances inodes have their timestamps lazily updated
>
> Theodore Ts'o (3):
> vfs: add support for a lazytime mount option
> vfs: add find_inode_nowait() function
> ext4: add optimization for the lazytime mount option
>
> fs/ext4/inode.c | 70 +++++++++++++++++++++++++-
> fs/ext4/super.c | 10 ++++
> fs/fs-writeback.c | 62 +++++++++++++++++++----
> fs/gfs2/file.c | 4 +-
> fs/inode.c | 106 +++++++++++++++++++++++++++++++++------
> fs/jfs/file.c | 2 +-
> fs/libfs.c | 2 +-
> fs/proc_namespace.c | 1 +
> fs/sync.c | 8 +++
> include/linux/backing-dev.h | 1 +
> include/linux/fs.h | 10 ++++
> include/trace/events/ext4.h | 30 +++++++++++
> include/trace/events/writeback.h | 60 +++++++++++++++++++++-
> include/uapi/linux/fs.h | 4 +-
> mm/backing-dev.c | 10 +++-
> 15 files changed, 343 insertions(+), 37 deletions(-)
>
> --
> 2.1.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH-v9 1/3] vfs: add support for a lazytime mount option
[not found] ` <1422855422-7444-2-git-send-email-tytso-3s7WtUTddSA@public.gmane.org>
@ 2015-02-02 6:03 ` Michael Kerrisk
0 siblings, 0 replies; 6+ messages in thread
From: Michael Kerrisk @ 2015-02-02 6:03 UTC (permalink / raw)
To: Theodore Ts'o; +Cc: Linux Filesystem Development List, Al Viro, Linux API
[CC += linux-api@]
On Mon, Feb 2, 2015 at 6:37 AM, Theodore Ts'o <tytso-3s7WtUTddSA@public.gmane.org> wrote:
> Add a new mount option which enables a new "lazytime" mode. This mode
> causes atime, mtime, and ctime updates to only be made to the
> in-memory version of the inode. The on-disk times will only get
> updated when (a) if the inode needs to be updated for some non-time
> related change, (b) if userspace calls fsync(), syncfs() or sync(), or
> (c) just before an undeleted inode is evicted from memory.
>
> This is OK according to POSIX because there are no guarantees after a
> crash unless userspace explicitly requests via a fsync(2) call.
>
> For workloads which feature a large number of random write to a
> preallocated file, the lazytime mount option significantly reduces
> writes to the inode table. The repeated 4k writes to a single block
> will result in undesirable stress on flash devices and SMR disk
> drives. Even on conventional HDD's, the repeated writes to the inode
> table block will trigger Adjacent Track Interference (ATI) remediation
> latencies, which very negatively impact long tail latencies --- which
> is a very big deal for web serving tiers (for example).
>
> Google-Bug-Id: 18297052
>
> Signed-off-by: Theodore Ts'o <tytso-3s7WtUTddSA@public.gmane.org>
> ---
> fs/ext4/inode.c | 6 ++++
> fs/fs-writeback.c | 62 +++++++++++++++++++++++++++++++++-------
> fs/gfs2/file.c | 4 +--
> fs/inode.c | 56 +++++++++++++++++++++++++-----------
> fs/jfs/file.c | 2 +-
> fs/libfs.c | 2 +-
> fs/proc_namespace.c | 1 +
> fs/sync.c | 8 ++++++
> include/linux/backing-dev.h | 1 +
> include/linux/fs.h | 5 ++++
> include/trace/events/writeback.h | 60 +++++++++++++++++++++++++++++++++++++-
> include/uapi/linux/fs.h | 4 ++-
> mm/backing-dev.c | 10 +++++--
> 13 files changed, 186 insertions(+), 35 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 5653fa4..628df5b 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4840,11 +4840,17 @@ int ext4_mark_inode_dirty(handle_t *handle, struct inode *inode)
> * If the inode is marked synchronous, we don't honour that here - doing
> * so would cause a commit on atime updates, which we don't bother doing.
> * We handle synchronous inodes at the highest possible level.
> + *
> + * If only the I_DIRTY_TIME flag is set, we can skip everything. If
> + * I_DIRTY_TIME and I_DIRTY_SYNC is set, the only inode fields we need
> + * to copy into the on-disk inode structure are the timestamp files.
> */
> void ext4_dirty_inode(struct inode *inode, int flags)
> {
> handle_t *handle;
>
> + if (flags == I_DIRTY_TIME)
> + return;
> handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> if (IS_ERR(handle))
> goto out;
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 2d609a5..0046861 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -247,14 +247,19 @@ static bool inode_dirtied_after(struct inode *inode, unsigned long t)
> return ret;
> }
>
> +#define EXPIRE_DIRTY_ATIME 0x0001
> +
> /*
> * Move expired (dirtied before work->older_than_this) dirty inodes from
> * @delaying_queue to @dispatch_queue.
> */
> static int move_expired_inodes(struct list_head *delaying_queue,
> struct list_head *dispatch_queue,
> + int flags,
> struct wb_writeback_work *work)
> {
> + unsigned long *older_than_this = NULL;
> + unsigned long expire_time;
> LIST_HEAD(tmp);
> struct list_head *pos, *node;
> struct super_block *sb = NULL;
> @@ -262,13 +267,21 @@ static int move_expired_inodes(struct list_head *delaying_queue,
> int do_sb_sort = 0;
> int moved = 0;
>
> + if ((flags & EXPIRE_DIRTY_ATIME) == 0)
> + older_than_this = work->older_than_this;
> + else if ((work->reason == WB_REASON_SYNC) == 0) {
> + expire_time = jiffies - (HZ * 86400);
> + older_than_this = &expire_time;
> + }
> while (!list_empty(delaying_queue)) {
> inode = wb_inode(delaying_queue->prev);
> - if (work->older_than_this &&
> - inode_dirtied_after(inode, *work->older_than_this))
> + if (older_than_this &&
> + inode_dirtied_after(inode, *older_than_this))
> break;
> list_move(&inode->i_wb_list, &tmp);
> moved++;
> + if (flags & EXPIRE_DIRTY_ATIME)
> + set_bit(__I_DIRTY_TIME_EXPIRED, &inode->i_state);
> if (sb_is_blkdev_sb(inode->i_sb))
> continue;
> if (sb && sb != inode->i_sb)
> @@ -309,9 +322,12 @@ out:
> static void queue_io(struct bdi_writeback *wb, struct wb_writeback_work *work)
> {
> int moved;
> +
> assert_spin_locked(&wb->list_lock);
> list_splice_init(&wb->b_more_io, &wb->b_io);
> - moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, work);
> + moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, 0, work);
> + moved += move_expired_inodes(&wb->b_dirty_time, &wb->b_io,
> + EXPIRE_DIRTY_ATIME, work);
> trace_writeback_queue_io(wb, work, moved);
> }
>
> @@ -435,6 +451,8 @@ static void requeue_inode(struct inode *inode, struct bdi_writeback *wb,
> * updates after data IO completion.
> */
> redirty_tail(inode, wb);
> + } else if (inode->i_state & I_DIRTY_TIME) {
> + list_move(&inode->i_wb_list, &wb->b_dirty_time);
> } else {
> /* The inode is clean. Remove from writeback lists. */
> list_del_init(&inode->i_wb_list);
> @@ -481,7 +499,13 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
> spin_lock(&inode->i_lock);
>
> dirty = inode->i_state & I_DIRTY;
> - inode->i_state &= ~I_DIRTY;
> + if (((dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) &&
> + (inode->i_state & I_DIRTY_TIME)) ||
> + (inode->i_state & I_DIRTY_TIME_EXPIRED)) {
> + dirty |= I_DIRTY_TIME | I_DIRTY_TIME_EXPIRED;
> + trace_writeback_lazytime(inode);
> + }
> + inode->i_state &= ~dirty;
>
> /*
> * Paired with smp_mb() in __mark_inode_dirty(). This allows
> @@ -501,8 +525,10 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
>
> spin_unlock(&inode->i_lock);
>
> + if (dirty & I_DIRTY_TIME)
> + mark_inode_dirty_sync(inode);
> /* Don't write the inode if only I_DIRTY_PAGES was set */
> - if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
> + if (dirty & ~I_DIRTY_PAGES) {
> int err = write_inode(inode, wbc);
> if (ret == 0)
> ret = err;
> @@ -550,7 +576,7 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
> * make sure inode is on some writeback list and leave it there unless
> * we have completely cleaned the inode.
> */
> - if (!(inode->i_state & I_DIRTY) &&
> + if (!(inode->i_state & I_DIRTY_ALL) &&
> (wbc->sync_mode != WB_SYNC_ALL ||
> !mapping_tagged(inode->i_mapping, PAGECACHE_TAG_WRITEBACK)))
> goto out;
> @@ -565,7 +591,7 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
> * If inode is clean, remove it from writeback lists. Otherwise don't
> * touch it. See comment above for explanation.
> */
> - if (!(inode->i_state & I_DIRTY))
> + if (!(inode->i_state & I_DIRTY_ALL))
> list_del_init(&inode->i_wb_list);
> spin_unlock(&wb->list_lock);
> inode_sync_complete(inode);
> @@ -707,7 +733,7 @@ static long writeback_sb_inodes(struct super_block *sb,
> wrote += write_chunk - wbc.nr_to_write;
> spin_lock(&wb->list_lock);
> spin_lock(&inode->i_lock);
> - if (!(inode->i_state & I_DIRTY))
> + if (!(inode->i_state & I_DIRTY_ALL))
> wrote++;
> requeue_inode(inode, wb, &wbc);
> inode_sync_complete(inode);
> @@ -1145,16 +1171,20 @@ static noinline void block_dump___mark_inode_dirty(struct inode *inode)
> * page->mapping->host, so the page-dirtying time is recorded in the internal
> * blockdev inode.
> */
> +#define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
> void __mark_inode_dirty(struct inode *inode, int flags)
> {
> struct super_block *sb = inode->i_sb;
> struct backing_dev_info *bdi = NULL;
> + int dirtytime;
> +
> + trace_writeback_mark_inode_dirty(inode, flags);
>
> /*
> * Don't do this for I_DIRTY_PAGES - that doesn't actually
> * dirty the inode itself
> */
> - if (flags & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
> + if (flags & (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_TIME)) {
> trace_writeback_dirty_inode_start(inode, flags);
>
> if (sb->s_op->dirty_inode)
> @@ -1162,6 +1192,9 @@ void __mark_inode_dirty(struct inode *inode, int flags)
>
> trace_writeback_dirty_inode(inode, flags);
> }
> + if (flags & I_DIRTY_INODE)
> + flags &= ~I_DIRTY_TIME;
> + dirtytime = flags & I_DIRTY_TIME;
>
> /*
> * Paired with smp_mb() in __writeback_single_inode() for the
> @@ -1169,16 +1202,21 @@ void __mark_inode_dirty(struct inode *inode, int flags)
> */
> smp_mb();
>
> - if ((inode->i_state & flags) == flags)
> + if (((inode->i_state & flags) == flags) ||
> + (dirtytime && (inode->i_state & I_DIRTY_INODE)))
> return;
>
> if (unlikely(block_dump))
> block_dump___mark_inode_dirty(inode);
>
> spin_lock(&inode->i_lock);
> + if (dirtytime && (inode->i_state & I_DIRTY_INODE))
> + goto out_unlock_inode;
> if ((inode->i_state & flags) != flags) {
> const int was_dirty = inode->i_state & I_DIRTY;
>
> + if (flags & I_DIRTY_INODE)
> + inode->i_state &= ~I_DIRTY_TIME;
> inode->i_state |= flags;
>
> /*
> @@ -1225,8 +1263,10 @@ void __mark_inode_dirty(struct inode *inode, int flags)
> }
>
> inode->dirtied_when = jiffies;
> - list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
> + list_move(&inode->i_wb_list, dirtytime ?
> + &bdi->wb.b_dirty_time : &bdi->wb.b_dirty);
> spin_unlock(&bdi->wb.list_lock);
> + trace_writeback_dirty_inode_enqueue(inode);
>
> if (wakeup_bdi)
> bdi_wakeup_thread_delayed(bdi);
> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> index 6e600ab..15c44cf 100644
> --- a/fs/gfs2/file.c
> +++ b/fs/gfs2/file.c
> @@ -655,7 +655,7 @@ static int gfs2_fsync(struct file *file, loff_t start, loff_t end,
> {
> struct address_space *mapping = file->f_mapping;
> struct inode *inode = mapping->host;
> - int sync_state = inode->i_state & I_DIRTY;
> + int sync_state = inode->i_state & I_DIRTY_ALL;
> struct gfs2_inode *ip = GFS2_I(inode);
> int ret = 0, ret1 = 0;
>
> @@ -668,7 +668,7 @@ static int gfs2_fsync(struct file *file, loff_t start, loff_t end,
> if (!gfs2_is_jdata(ip))
> sync_state &= ~I_DIRTY_PAGES;
> if (datasync)
> - sync_state &= ~I_DIRTY_SYNC;
> + sync_state &= ~(I_DIRTY_SYNC | I_DIRTY_TIME);
>
> if (sync_state) {
> ret = sync_inode_metadata(inode, 1);
> diff --git a/fs/inode.c b/fs/inode.c
> index aa149e7..4feb85c 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -18,6 +18,7 @@
> #include <linux/buffer_head.h> /* for inode_has_buffers */
> #include <linux/ratelimit.h>
> #include <linux/list_lru.h>
> +#include <trace/events/writeback.h>
> #include "internal.h"
>
> /*
> @@ -30,7 +31,7 @@
> * inode_sb_list_lock protects:
> * sb->s_inodes, inode->i_sb_list
> * bdi->wb.list_lock protects:
> - * bdi->wb.b_{dirty,io,more_io}, inode->i_wb_list
> + * bdi->wb.b_{dirty,io,more_io,dirty_time}, inode->i_wb_list
> * inode_hash_lock protects:
> * inode_hashtable, inode->i_hash
> *
> @@ -416,7 +417,8 @@ static void inode_lru_list_add(struct inode *inode)
> */
> void inode_add_lru(struct inode *inode)
> {
> - if (!(inode->i_state & (I_DIRTY | I_SYNC | I_FREEING | I_WILL_FREE)) &&
> + if (!(inode->i_state & (I_DIRTY_ALL | I_SYNC |
> + I_FREEING | I_WILL_FREE)) &&
> !atomic_read(&inode->i_count) && inode->i_sb->s_flags & MS_ACTIVE)
> inode_lru_list_add(inode);
> }
> @@ -647,7 +649,7 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
> spin_unlock(&inode->i_lock);
> continue;
> }
> - if (inode->i_state & I_DIRTY && !kill_dirty) {
> + if (inode->i_state & I_DIRTY_ALL && !kill_dirty) {
> spin_unlock(&inode->i_lock);
> busy = 1;
> continue;
> @@ -1432,11 +1434,20 @@ static void iput_final(struct inode *inode)
> */
> void iput(struct inode *inode)
> {
> - if (inode) {
> - BUG_ON(inode->i_state & I_CLEAR);
> -
> - if (atomic_dec_and_lock(&inode->i_count, &inode->i_lock))
> - iput_final(inode);
> + if (!inode)
> + return;
> + BUG_ON(inode->i_state & I_CLEAR);
> +retry:
> + if (atomic_dec_and_lock(&inode->i_count, &inode->i_lock)) {
> + if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
> + atomic_inc(&inode->i_count);
> + inode->i_state &= ~I_DIRTY_TIME;
> + spin_unlock(&inode->i_lock);
> + trace_writeback_lazytime_iput(inode);
> + mark_inode_dirty_sync(inode);
> + goto retry;
> + }
> + iput_final(inode);
> }
> }
> EXPORT_SYMBOL(iput);
> @@ -1495,14 +1506,9 @@ static int relatime_need_update(struct vfsmount *mnt, struct inode *inode,
> return 0;
> }
>
> -/*
> - * This does the actual work of updating an inodes time or version. Must have
> - * had called mnt_want_write() before calling this.
> - */
> -static int update_time(struct inode *inode, struct timespec *time, int flags)
> +int generic_update_time(struct inode *inode, struct timespec *time, int flags)
> {
> - if (inode->i_op->update_time)
> - return inode->i_op->update_time(inode, time, flags);
> + int iflags = I_DIRTY_TIME;
>
> if (flags & S_ATIME)
> inode->i_atime = *time;
> @@ -1512,9 +1518,27 @@ static int update_time(struct inode *inode, struct timespec *time, int flags)
> inode->i_ctime = *time;
> if (flags & S_MTIME)
> inode->i_mtime = *time;
> - mark_inode_dirty_sync(inode);
> +
> + if (!(inode->i_sb->s_flags & MS_LAZYTIME) || (flags & S_VERSION))
> + iflags |= I_DIRTY_SYNC;
> + __mark_inode_dirty(inode, iflags);
> return 0;
> }
> +EXPORT_SYMBOL(generic_update_time);
> +
> +/*
> + * This does the actual work of updating an inodes time or version. Must have
> + * had called mnt_want_write() before calling this.
> + */
> +static int update_time(struct inode *inode, struct timespec *time, int flags)
> +{
> + int (*update_time)(struct inode *, struct timespec *, int);
> +
> + update_time = inode->i_op->update_time ? inode->i_op->update_time :
> + generic_update_time;
> +
> + return update_time(inode, time, flags);
> +}
>
> /**
> * touch_atime - update the access time
> diff --git a/fs/jfs/file.c b/fs/jfs/file.c
> index 33aa0cc..10815f8 100644
> --- a/fs/jfs/file.c
> +++ b/fs/jfs/file.c
> @@ -39,7 +39,7 @@ int jfs_fsync(struct file *file, loff_t start, loff_t end, int datasync)
> return rc;
>
> mutex_lock(&inode->i_mutex);
> - if (!(inode->i_state & I_DIRTY) ||
> + if (!(inode->i_state & I_DIRTY_ALL) ||
> (datasync && !(inode->i_state & I_DIRTY_DATASYNC))) {
> /* Make sure committed changes hit the disk */
> jfs_flush_journal(JFS_SBI(inode->i_sb)->log, 1);
> diff --git a/fs/libfs.c b/fs/libfs.c
> index 005843c..b2ffdb0 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -948,7 +948,7 @@ int __generic_file_fsync(struct file *file, loff_t start, loff_t end,
>
> mutex_lock(&inode->i_mutex);
> ret = sync_mapping_buffers(inode->i_mapping);
> - if (!(inode->i_state & I_DIRTY))
> + if (!(inode->i_state & I_DIRTY_ALL))
> goto out;
> if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> goto out;
> diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
> index 0f96f71..8db932d 100644
> --- a/fs/proc_namespace.c
> +++ b/fs/proc_namespace.c
> @@ -44,6 +44,7 @@ static int show_sb_opts(struct seq_file *m, struct super_block *sb)
> { MS_SYNCHRONOUS, ",sync" },
> { MS_DIRSYNC, ",dirsync" },
> { MS_MANDLOCK, ",mand" },
> + { MS_LAZYTIME, ",lazytime" },
> { 0, NULL }
> };
> const struct proc_fs_info *fs_infop;
> diff --git a/fs/sync.c b/fs/sync.c
> index 01d9f18..fbc98ee 100644
> --- a/fs/sync.c
> +++ b/fs/sync.c
> @@ -177,8 +177,16 @@ SYSCALL_DEFINE1(syncfs, int, fd)
> */
> int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync)
> {
> + struct inode *inode = file->f_mapping->host;
> +
> if (!file->f_op->fsync)
> return -EINVAL;
> + if (!datasync && (inode->i_state & I_DIRTY_TIME)) {
> + spin_lock(&inode->i_lock);
> + inode->i_state &= ~I_DIRTY_TIME;
> + spin_unlock(&inode->i_lock);
> + mark_inode_dirty_sync(inode);
> + }
> return file->f_op->fsync(file, start, end, datasync);
> }
> EXPORT_SYMBOL(vfs_fsync_range);
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 5da6012..4cdf733 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -55,6 +55,7 @@ struct bdi_writeback {
> struct list_head b_dirty; /* dirty inodes */
> struct list_head b_io; /* parked for writeback */
> struct list_head b_more_io; /* parked for more writeback */
> + struct list_head b_dirty_time; /* time stamps are dirty */
> spinlock_t list_lock; /* protects the b_* lists */
> };
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index f90c028..5ca285f 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1746,8 +1746,12 @@ struct super_operations {
> #define __I_DIO_WAKEUP 9
> #define I_DIO_WAKEUP (1 << I_DIO_WAKEUP)
> #define I_LINKABLE (1 << 10)
> +#define I_DIRTY_TIME (1 << 11)
> +#define __I_DIRTY_TIME_EXPIRED 12
> +#define I_DIRTY_TIME_EXPIRED (1 << __I_DIRTY_TIME_EXPIRED)
>
> #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
> +#define I_DIRTY_ALL (I_DIRTY | I_DIRTY_TIME)
>
> extern void __mark_inode_dirty(struct inode *, int);
> static inline void mark_inode_dirty(struct inode *inode)
> @@ -1910,6 +1914,7 @@ extern int current_umask(void);
>
> extern void ihold(struct inode * inode);
> extern void iput(struct inode *);
> +extern int generic_update_time(struct inode *, struct timespec *, int);
>
> static inline struct inode *file_inode(const struct file *f)
> {
> diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
> index cee02d6..5ecb4c2 100644
> --- a/include/trace/events/writeback.h
> +++ b/include/trace/events/writeback.h
> @@ -18,6 +18,8 @@
> {I_FREEING, "I_FREEING"}, \
> {I_CLEAR, "I_CLEAR"}, \
> {I_SYNC, "I_SYNC"}, \
> + {I_DIRTY_TIME, "I_DIRTY_TIME"}, \
> + {I_DIRTY_TIME_EXPIRED, "I_DIRTY_TIME_EXPIRED"}, \
> {I_REFERENCED, "I_REFERENCED"} \
> )
>
> @@ -68,6 +70,7 @@ DECLARE_EVENT_CLASS(writeback_dirty_inode_template,
> TP_STRUCT__entry (
> __array(char, name, 32)
> __field(unsigned long, ino)
> + __field(unsigned long, state)
> __field(unsigned long, flags)
> ),
>
> @@ -78,16 +81,25 @@ DECLARE_EVENT_CLASS(writeback_dirty_inode_template,
> strncpy(__entry->name,
> bdi->dev ? dev_name(bdi->dev) : "(unknown)", 32);
> __entry->ino = inode->i_ino;
> + __entry->state = inode->i_state;
> __entry->flags = flags;
> ),
>
> - TP_printk("bdi %s: ino=%lu flags=%s",
> + TP_printk("bdi %s: ino=%lu state=%s flags=%s",
> __entry->name,
> __entry->ino,
> + show_inode_state(__entry->state),
> show_inode_state(__entry->flags)
> )
> );
>
> +DEFINE_EVENT(writeback_dirty_inode_template, writeback_mark_inode_dirty,
> +
> + TP_PROTO(struct inode *inode, int flags),
> +
> + TP_ARGS(inode, flags)
> +);
> +
> DEFINE_EVENT(writeback_dirty_inode_template, writeback_dirty_inode_start,
>
> TP_PROTO(struct inode *inode, int flags),
> @@ -598,6 +610,52 @@ DEFINE_EVENT(writeback_single_inode_template, writeback_single_inode,
> TP_ARGS(inode, wbc, nr_to_write)
> );
>
> +DECLARE_EVENT_CLASS(writeback_lazytime_template,
> + TP_PROTO(struct inode *inode),
> +
> + TP_ARGS(inode),
> +
> + TP_STRUCT__entry(
> + __field( dev_t, dev )
> + __field(unsigned long, ino )
> + __field(unsigned long, state )
> + __field( __u16, mode )
> + __field(unsigned long, dirtied_when )
> + ),
> +
> + TP_fast_assign(
> + __entry->dev = inode->i_sb->s_dev;
> + __entry->ino = inode->i_ino;
> + __entry->state = inode->i_state;
> + __entry->mode = inode->i_mode;
> + __entry->dirtied_when = inode->dirtied_when;
> + ),
> +
> + TP_printk("dev %d,%d ino %lu dirtied %lu state %s mode 0%o",
> + MAJOR(__entry->dev), MINOR(__entry->dev),
> + __entry->ino, __entry->dirtied_when,
> + show_inode_state(__entry->state), __entry->mode)
> +);
> +
> +DEFINE_EVENT(writeback_lazytime_template, writeback_lazytime,
> + TP_PROTO(struct inode *inode),
> +
> + TP_ARGS(inode)
> +);
> +
> +DEFINE_EVENT(writeback_lazytime_template, writeback_lazytime_iput,
> + TP_PROTO(struct inode *inode),
> +
> + TP_ARGS(inode)
> +);
> +
> +DEFINE_EVENT(writeback_lazytime_template, writeback_dirty_inode_enqueue,
> +
> + TP_PROTO(struct inode *inode),
> +
> + TP_ARGS(inode)
> +);
> +
> #endif /* _TRACE_WRITEBACK_H */
>
> /* This part must be outside protection */
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index 3735fa0..9b964a5 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -90,6 +90,7 @@ struct inodes_stat_t {
> #define MS_KERNMOUNT (1<<22) /* this is a kern_mount call */
> #define MS_I_VERSION (1<<23) /* Update inode I_version field */
> #define MS_STRICTATIME (1<<24) /* Always perform atime updates */
> +#define MS_LAZYTIME (1<<25) /* Update the on-disk [acm]times lazily */
>
> /* These sb flags are internal to the kernel */
> #define MS_NOSEC (1<<28)
> @@ -100,7 +101,8 @@ struct inodes_stat_t {
> /*
> * Superblock flags that can be altered by MS_REMOUNT
> */
> -#define MS_RMT_MASK (MS_RDONLY|MS_SYNCHRONOUS|MS_MANDLOCK|MS_I_VERSION)
> +#define MS_RMT_MASK (MS_RDONLY|MS_SYNCHRONOUS|MS_MANDLOCK|MS_I_VERSION|\
> + MS_LAZYTIME)
>
> /*
> * Old magic mount flag and mask
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 0ae0df5..915feea 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -69,10 +69,10 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
> unsigned long background_thresh;
> unsigned long dirty_thresh;
> unsigned long bdi_thresh;
> - unsigned long nr_dirty, nr_io, nr_more_io;
> + unsigned long nr_dirty, nr_io, nr_more_io, nr_dirty_time;
> struct inode *inode;
>
> - nr_dirty = nr_io = nr_more_io = 0;
> + nr_dirty = nr_io = nr_more_io = nr_dirty_time = 0;
> spin_lock(&wb->list_lock);
> list_for_each_entry(inode, &wb->b_dirty, i_wb_list)
> nr_dirty++;
> @@ -80,6 +80,9 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
> nr_io++;
> list_for_each_entry(inode, &wb->b_more_io, i_wb_list)
> nr_more_io++;
> + list_for_each_entry(inode, &wb->b_dirty_time, i_wb_list)
> + if (inode->i_state & I_DIRTY_TIME)
> + nr_dirty_time++;
> spin_unlock(&wb->list_lock);
>
> global_dirty_limits(&background_thresh, &dirty_thresh);
> @@ -98,6 +101,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
> "b_dirty: %10lu\n"
> "b_io: %10lu\n"
> "b_more_io: %10lu\n"
> + "b_dirty_time: %10lu\n"
> "bdi_list: %10u\n"
> "state: %10lx\n",
> (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
> @@ -111,6 +115,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
> nr_dirty,
> nr_io,
> nr_more_io,
> + nr_dirty_time,
> !list_empty(&bdi->bdi_list), bdi->state);
> #undef K
>
> @@ -418,6 +423,7 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
> INIT_LIST_HEAD(&wb->b_dirty);
> INIT_LIST_HEAD(&wb->b_io);
> INIT_LIST_HEAD(&wb->b_more_io);
> + INIT_LIST_HEAD(&wb->b_dirty_time);
> spin_lock_init(&wb->list_lock);
> INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);
> }
> --
> 2.1.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH-v9 3/3] ext4: add optimization for the lazytime mount option
[not found] ` <1422855422-7444-4-git-send-email-tytso@mit.edu>
@ 2015-02-02 6:03 ` Michael Kerrisk
0 siblings, 0 replies; 6+ messages in thread
From: Michael Kerrisk @ 2015-02-02 6:03 UTC (permalink / raw)
To: Theodore Ts'o; +Cc: Linux Filesystem Development List, Al Viro, Linux API
[CC += linux-api@]
On Mon, Feb 2, 2015 at 6:37 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> Add an optimization for the MS_LAZYTIME mount option so that we will
> opportunistically write out any inodes with the I_DIRTY_TIME flag set
> in a particular inode table block when we need to update some inode in
> that inode table block anyway.
>
> Also add some temporary code so that we can set the lazytime mount
> option without needing a modified /sbin/mount program which can set
> MS_LAZYTIME. We can eventually make this go away once util-linux has
> added support.
>
> Google-Bug-Id: 18297052
>
> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
> ---
> fs/ext4/inode.c | 64 +++++++++++++++++++++++++++++++++++++++++++--
> fs/ext4/super.c | 10 +++++++
> include/trace/events/ext4.h | 30 +++++++++++++++++++++
> 3 files changed, 102 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 628df5b..9193ea1 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4139,6 +4139,65 @@ static int ext4_inode_blocks_set(handle_t *handle,
> return 0;
> }
>
> +struct other_inode {
> + unsigned long orig_ino;
> + struct ext4_inode *raw_inode;
> +};
> +
> +static int other_inode_match(struct inode * inode, unsigned long ino,
> + void *data)
> +{
> + struct other_inode *oi = (struct other_inode *) data;
> +
> + if ((inode->i_ino != ino) ||
> + (inode->i_state & (I_FREEING | I_WILL_FREE | I_NEW |
> + I_DIRTY_SYNC | I_DIRTY_DATASYNC)) ||
> + ((inode->i_state & I_DIRTY_TIME) == 0))
> + return 0;
> + spin_lock(&inode->i_lock);
> + if (((inode->i_state & (I_FREEING | I_WILL_FREE | I_NEW |
> + I_DIRTY_SYNC | I_DIRTY_DATASYNC)) == 0) &&
> + (inode->i_state & I_DIRTY_TIME)) {
> + struct ext4_inode_info *ei = EXT4_I(inode);
> +
> + inode->i_state &= ~(I_DIRTY_TIME | I_DIRTY_TIME_EXPIRED);
> + spin_unlock(&inode->i_lock);
> +
> + spin_lock(&ei->i_raw_lock);
> + EXT4_INODE_SET_XTIME(i_ctime, inode, oi->raw_inode);
> + EXT4_INODE_SET_XTIME(i_mtime, inode, oi->raw_inode);
> + EXT4_INODE_SET_XTIME(i_atime, inode, oi->raw_inode);
> + ext4_inode_csum_set(inode, oi->raw_inode, ei);
> + spin_unlock(&ei->i_raw_lock);
> + trace_ext4_other_inode_update_time(inode, oi->orig_ino);
> + return -1;
> + }
> + spin_unlock(&inode->i_lock);
> + return -1;
> +}
> +
> +/*
> + * Opportunistically update the other time fields for other inodes in
> + * the same inode table block.
> + */
> +static void ext4_update_other_inodes_time(struct super_block *sb,
> + unsigned long orig_ino, char *buf)
> +{
> + struct other_inode oi;
> + unsigned long ino;
> + int i, inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
> + int inode_size = EXT4_INODE_SIZE(sb);
> +
> + oi.orig_ino = orig_ino;
> + ino = orig_ino & ~(inodes_per_block - 1);
> + for (i = 0; i < inodes_per_block; i++, ino++, buf += inode_size) {
> + if (ino == orig_ino)
> + continue;
> + oi.raw_inode = (struct ext4_inode *) buf;
> + (void) find_inode_nowait(sb, ino, other_inode_match, &oi);
> + }
> +}
> +
> /*
> * Post the struct inode info into an on-disk inode location in the
> * buffer-cache. This gobbles the caller's reference to the
> @@ -4248,10 +4307,11 @@ static int ext4_do_update_inode(handle_t *handle,
> cpu_to_le16(ei->i_extra_isize);
> }
> }
> -
> ext4_inode_csum_set(inode, raw_inode, ei);
> -
> spin_unlock(&ei->i_raw_lock);
> + if (inode->i_sb->s_flags & MS_LAZYTIME)
> + ext4_update_other_inodes_time(inode->i_sb, inode->i_ino,
> + bh->b_data);
>
> BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
> rc = ext4_handle_dirty_metadata(handle, NULL, bh);
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 74c5f53..362b23c 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1139,6 +1139,7 @@ enum {
> Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err,
> Opt_usrquota, Opt_grpquota, Opt_i_version,
> Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_mblk_io_submit,
> + Opt_lazytime, Opt_nolazytime,
> Opt_nomblk_io_submit, Opt_block_validity, Opt_noblock_validity,
> Opt_inode_readahead_blks, Opt_journal_ioprio,
> Opt_dioread_nolock, Opt_dioread_lock,
> @@ -1202,6 +1203,8 @@ static const match_table_t tokens = {
> {Opt_i_version, "i_version"},
> {Opt_stripe, "stripe=%u"},
> {Opt_delalloc, "delalloc"},
> + {Opt_lazytime, "lazytime"},
> + {Opt_nolazytime, "nolazytime"},
> {Opt_nodelalloc, "nodelalloc"},
> {Opt_removed, "mblk_io_submit"},
> {Opt_removed, "nomblk_io_submit"},
> @@ -1459,6 +1462,12 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
> case Opt_i_version:
> sb->s_flags |= MS_I_VERSION;
> return 1;
> + case Opt_lazytime:
> + sb->s_flags |= MS_LAZYTIME;
> + return 1;
> + case Opt_nolazytime:
> + sb->s_flags &= ~MS_LAZYTIME;
> + return 1;
> }
>
> for (m = ext4_mount_opts; m->token != Opt_err; m++)
> @@ -5020,6 +5029,7 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data)
> }
> #endif
>
> + *flags = (*flags & ~MS_LAZYTIME) | (sb->s_flags & MS_LAZYTIME);
> ext4_msg(sb, KERN_INFO, "re-mounted. Opts: %s", orig_data);
> kfree(orig_data);
> return 0;
> diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
> index 6cfb841..6e5abd6 100644
> --- a/include/trace/events/ext4.h
> +++ b/include/trace/events/ext4.h
> @@ -73,6 +73,36 @@ struct extent_status;
> { FALLOC_FL_ZERO_RANGE, "ZERO_RANGE"})
>
>
> +TRACE_EVENT(ext4_other_inode_update_time,
> + TP_PROTO(struct inode *inode, ino_t orig_ino),
> +
> + TP_ARGS(inode, orig_ino),
> +
> + TP_STRUCT__entry(
> + __field( dev_t, dev )
> + __field( ino_t, ino )
> + __field( ino_t, orig_ino )
> + __field( uid_t, uid )
> + __field( gid_t, gid )
> + __field( __u16, mode )
> + ),
> +
> + TP_fast_assign(
> + __entry->orig_ino = orig_ino;
> + __entry->dev = inode->i_sb->s_dev;
> + __entry->ino = inode->i_ino;
> + __entry->uid = i_uid_read(inode);
> + __entry->gid = i_gid_read(inode);
> + __entry->mode = inode->i_mode;
> + ),
> +
> + TP_printk("dev %d,%d orig_ino %lu ino %lu mode 0%o uid %u gid %u",
> + MAJOR(__entry->dev), MINOR(__entry->dev),
> + (unsigned long) __entry->orig_ino,
> + (unsigned long) __entry->ino, __entry->mode,
> + __entry->uid, __entry->gid)
> +);
> +
> TRACE_EVENT(ext4_free_inode,
> TP_PROTO(struct inode *inode),
>
> --
> 2.1.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH-v9 2/3] vfs: add find_inode_nowait() function
[not found] ` <1422855422-7444-3-git-send-email-tytso-3s7WtUTddSA@public.gmane.org>
@ 2015-02-02 6:04 ` Michael Kerrisk
0 siblings, 0 replies; 6+ messages in thread
From: Michael Kerrisk @ 2015-02-02 6:04 UTC (permalink / raw)
To: Theodore Ts'o; +Cc: Linux Filesystem Development List, Al Viro, Linux API
[CC += linux-api@]
On Mon, Feb 2, 2015 at 6:37 AM, Theodore Ts'o <tytso-3s7WtUTddSA@public.gmane.org> wrote:
> Add a new function find_inode_nowait() which is an even more general
> version of ilookup5_nowait(). It is designed for callers which need
> very fine grained control over when the function is allowed to block
> or increment the inode's reference count.
>
> Signed-off-by: Theodore Ts'o <tytso-3s7WtUTddSA@public.gmane.org>
> ---
> fs/inode.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/fs.h | 5 +++++
> 2 files changed, 55 insertions(+)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index 4feb85c..740cba7 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1284,6 +1284,56 @@ struct inode *ilookup(struct super_block *sb, unsigned long ino)
> }
> EXPORT_SYMBOL(ilookup);
>
> +/**
> + * find_inode_nowait - find an inode in the inode cache
> + * @sb: super block of file system to search
> + * @hashval: hash value (usually inode number) to search for
> + * @match: callback used for comparisons between inodes
> + * @data: opaque data pointer to pass to @match
> + *
> + * Search for the inode specified by @hashval and @data in the inode
> + * cache, where the helper function @match will return 0 if the inode
> + * does not match, 1 if the inode does match, and -1 if the search
> + * should be stopped. The @match function must be responsible for
> + * taking the i_lock spin_lock and checking i_state for an inode being
> + * freed or being initialized, and incrementing the reference count
> + * before returning 1. It also must not sleep, since it is called with
> + * the inode_hash_lock spinlock held.
> + *
> + * This is a even more generalized version of ilookup5() when the
> + * function must never block --- find_inode() can block in
> + * __wait_on_freeing_inode() --- or when the caller can not increment
> + * the reference count because the resulting iput() might cause an
> + * inode eviction. The tradeoff is that the @match funtion must be
> + * very carefully implemented.
> + */
> +struct inode *find_inode_nowait(struct super_block *sb,
> + unsigned long hashval,
> + int (*match)(struct inode *, unsigned long,
> + void *),
> + void *data)
> +{
> + struct hlist_head *head = inode_hashtable + hash(sb, hashval);
> + struct inode *inode, *ret_inode = NULL;
> + int mval;
> +
> + spin_lock(&inode_hash_lock);
> + hlist_for_each_entry(inode, head, i_hash) {
> + if (inode->i_sb != sb)
> + continue;
> + mval = match(inode, hashval, data);
> + if (mval == 0)
> + continue;
> + if (mval == 1)
> + ret_inode = inode;
> + goto out;
> + }
> +out:
> + spin_unlock(&inode_hash_lock);
> + return ret_inode;
> +}
> +EXPORT_SYMBOL(find_inode_nowait);
> +
> int insert_inode_locked(struct inode *inode)
> {
> struct super_block *sb = inode->i_sb;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 5ca285f..af810cc 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2441,6 +2441,11 @@ extern struct inode *ilookup(struct super_block *sb, unsigned long ino);
>
> extern struct inode * iget5_locked(struct super_block *, unsigned long, int (*test)(struct inode *, void *), int (*set)(struct inode *, void *), void *);
> extern struct inode * iget_locked(struct super_block *, unsigned long);
> +extern struct inode *find_inode_nowait(struct super_block *,
> + unsigned long,
> + int (*match)(struct inode *,
> + unsigned long, void *),
> + void *data);
> extern int insert_inode_locked4(struct inode *, unsigned long, int (*test)(struct inode *, void *), void *);
> extern int insert_inode_locked(struct inode *);
> #ifdef CONFIG_DEBUG_LOCK_ALLOC
> --
> 2.1.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH-v9 0/3] add support for lazytime mount option
[not found] ` <CAHO5Pa0ySnLb_UGUw3deVyZEr8gdzzdeyMP5rXcT1MLOeccLGg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-02-02 14:48 ` Theodore Ts'o
[not found] ` <20150202144833.GB2509-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>
0 siblings, 1 reply; 6+ messages in thread
From: Theodore Ts'o @ 2015-02-02 14:48 UTC (permalink / raw)
To: Michael Kerrisk; +Cc: Linux Filesystem Development List, Al Viro, Linux API
On Mon, Feb 02, 2015 at 07:03:11AM +0100, Michael Kerrisk wrote:
> Hi Ted,
>
> Since this is an API change, linux-api@ shouls be CCed, Added.
I didn't realize a mount option would be considered an API change.
The man page project isn't documenting these things, are they?
- Ted
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH-v9 0/3] add support for lazytime mount option
[not found] ` <20150202144833.GB2509-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>
@ 2015-02-02 15:40 ` Michael Kerrisk (man-pages)
0 siblings, 0 replies; 6+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-02-02 15:40 UTC (permalink / raw)
To: Theodore Ts'o; +Cc: Linux Filesystem Development List, Al Viro, Linux API
Hi Ted,
On 2 February 2015 at 15:48, Theodore Ts'o <tytso-3s7WtUTddSA@public.gmane.org> wrote:
> On Mon, Feb 02, 2015 at 07:03:11AM +0100, Michael Kerrisk wrote:
>> Hi Ted,
>>
>> Since this is an API change, linux-api@ shouls be CCed, Added.
>
> I didn't realize a mount option would be considered an API change.
Well, inasmuch as it's exposed via a system call, sure it is.
> The man page project isn't documenting these things, are they?
Indeed it is. See http://man7.org/linux/man-pages/man2/mount.2.html.
Thanks,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2015-02-02 15:40 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1422855422-7444-1-git-send-email-tytso@mit.edu>
2015-02-02 6:03 ` [PATCH-v9 0/3] add support for lazytime mount option Michael Kerrisk
[not found] ` <CAHO5Pa0ySnLb_UGUw3deVyZEr8gdzzdeyMP5rXcT1MLOeccLGg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-02-02 14:48 ` Theodore Ts'o
[not found] ` <20150202144833.GB2509-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>
2015-02-02 15:40 ` Michael Kerrisk (man-pages)
[not found] ` <1422855422-7444-2-git-send-email-tytso@mit.edu>
[not found] ` <1422855422-7444-2-git-send-email-tytso-3s7WtUTddSA@public.gmane.org>
2015-02-02 6:03 ` [PATCH-v9 1/3] vfs: add support for a " Michael Kerrisk
[not found] ` <1422855422-7444-4-git-send-email-tytso@mit.edu>
2015-02-02 6:03 ` [PATCH-v9 3/3] ext4: add optimization for the " Michael Kerrisk
[not found] ` <1422855422-7444-3-git-send-email-tytso@mit.edu>
[not found] ` <1422855422-7444-3-git-send-email-tytso-3s7WtUTddSA@public.gmane.org>
2015-02-02 6:04 ` [PATCH-v9 2/3] vfs: add find_inode_nowait() function Michael Kerrisk
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).