Re: [PATCH-v5 1/5] vfs: add support for a lazytime mount option

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH-v5 1/5] vfs: add support for a lazytime mount option
       [not found]     ` <20141128181421.GA19461@google.com>
@ 2014-12-02 12:58       ` Jan Kara
  2014-12-02 17:55         ` Boaz Harrosh
  0 siblings, 1 reply; 5+ messages in thread
From: Jan Kara @ 2014-12-02 12:58 UTC (permalink / raw)
  To: Ted Ts'o; +Cc: linux-fsdevel, linux-ext4, Jan Kara, linux-btrfs, xfs

On Fri 28-11-14 13:14:21, Ted Tso wrote:
> On Fri, Nov 28, 2014 at 06:23:23PM +0100, Jan Kara wrote:
> > Hum, when someone calls fsync() for an inode, you likely want to sync
> > timestamps to disk even if everything else is clean. I think that doing
> > what you did in last version:
> > 	dirty = inode->i_state & I_DIRTY_INODE;
> > 	inode->i_state &= ~I_DIRTY_INODE;
> > 	spin_unlock(&inode->i_lock);
> > 	if (dirty & I_DIRTY_TIME)
> > 		mark_inode_dirty_sync(inode);
> > looks better to me. IMO when someone calls __writeback_single_inode() we
> > should write whatever we have...
> 
> Yes, but we also have to distinguish between what happens on an
> fsync() versus what happens on a periodic writeback if I_DIRTY_PAGES
> (but not I_DIRTY_SYNC or I_DIRTY_DATASYNC) is set.  So there is a
> check in the fsync() code path to handle the concern you raised above.
  Ah, this is the thing you have been likely talking about but which I was
constantly missing in my thoughts. You don't want to write times when inode
has only dirty pages and timestamps - I was always thinking about a
situation where inode has only dirty timestamps and not pages. This
situation also complicates the writeback logic because when inode has dirty
pages, you need to track it as normal dirty inode for page writeback (with
dirtied_when correspoding to time when pages were dirtied) but in
parallel you now need to track the information that inode has timestamps
that weren't written for X long. And even if we stored how old are
timestamps it isn't easily possible to keep the list of inodes with just
dirty timestamps sorted by dirty time. So now I finally understand why you
did things the way you did them... Sorry for misleading you.

So let's restart the design so that things are clear:
1) We have new inode bit I_DIRTY_TIME. This means that only timestamps in
the inode have changed. The desired behavior is that inode is with
I_DIRTY_TIME and without I_DIRTY_SYNC | I_DIRTY_DATASYNC is written by
background writeback only once per 24 hours. Such inodes do get written by
sync(2) and fsync(2) calls.

2) Inodes with only I_DIRTY_TIME are tracked in a new dirty list
b_dirty_time. We use i_wb_list list head for this. Unlike b_dirty list,
this list isn't kept sorted by dirtied_when. If queue_io() sees for_sync
bit set in the work item, it will call mark_inode_dirty_sync() for all
inodes in b_dirty_time before queuing io from b_dirty list. Once per hour
(or something like that) flusher thread scans the whole b_dirty_time list
and calls mark_inode_dirty_sync() for all inodes that have too old dirty
timestamps (to detect this we need a new time stamp in the inode).

3) When fsync() sees inode with I_DIRTY_TIME set, it calls
mark_inode_dirty_sync().

4) When we are dropping last inode reference and inode has I_DIRTY_TIME
set, we call mark_inode_dirty_sync().

And that should be it, right?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH-v5 1/5] vfs: add support for a lazytime mount option
  2014-12-02 12:58       ` [PATCH-v5 1/5] vfs: add support for a lazytime mount option Jan Kara
@ 2014-12-02 17:55         ` Boaz Harrosh
  2014-12-02 19:23           ` Theodore Ts'o
  0 siblings, 1 reply; 5+ messages in thread
From: Boaz Harrosh @ 2014-12-02 17:55 UTC (permalink / raw)
  To: Jan Kara, Ted Ts'o; +Cc: linux-fsdevel, linux-ext4, linux-btrfs, xfs

On 12/02/2014 02:58 PM, Jan Kara wrote:
> On Fri 28-11-14 13:14:21, Ted Tso wrote:
>> On Fri, Nov 28, 2014 at 06:23:23PM +0100, Jan Kara wrote:
>>> Hum, when someone calls fsync() for an inode, you likely want to sync
>>> timestamps to disk even if everything else is clean. I think that doing
>>> what you did in last version:
>>> 	dirty = inode->i_state & I_DIRTY_INODE;
>>> 	inode->i_state &= ~I_DIRTY_INODE;
>>> 	spin_unlock(&inode->i_lock);
>>> 	if (dirty & I_DIRTY_TIME)
>>> 		mark_inode_dirty_sync(inode);
>>> looks better to me. IMO when someone calls __writeback_single_inode() we
>>> should write whatever we have...
>>
>> Yes, but we also have to distinguish between what happens on an
>> fsync() versus what happens on a periodic writeback if I_DIRTY_PAGES
>> (but not I_DIRTY_SYNC or I_DIRTY_DATASYNC) is set.  So there is a
>> check in the fsync() code path to handle the concern you raised above.
>   Ah, this is the thing you have been likely talking about but which I was
> constantly missing in my thoughts. You don't want to write times when inode
> has only dirty pages and timestamps - 

This I do not understand. I thought that I_DIRTY_TIME, and the all
lazytime mount option, is only for atime. So if there are dirty
pages then there are also m/ctime that changed and surly we want to
write these times to disk ASAP.

if we are lazytime also with m/ctime then I think I would like an
option for only atime lazy. because m/ctime is cardinal to some
operations even though I might want atime lazy.

Sorry for the slowness, I'm probably missing something
Thanks
Boaz

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH-v5 1/5] vfs: add support for a lazytime mount option
  2014-12-02 17:55         ` Boaz Harrosh
@ 2014-12-02 19:23           ` Theodore Ts'o
  2014-12-02 20:37             ` Andreas Dilger
  0 siblings, 1 reply; 5+ messages in thread
From: Theodore Ts'o @ 2014-12-02 19:23 UTC (permalink / raw)
  To: Boaz Harrosh; +Cc: linux-fsdevel, linux-ext4, Jan Kara, linux-btrfs, xfs

On Tue, Dec 02, 2014 at 07:55:48PM +0200, Boaz Harrosh wrote:
> 
> This I do not understand. I thought that I_DIRTY_TIME, and the all
> lazytime mount option, is only for atime. So if there are dirty
> pages then there are also m/ctime that changed and surly we want to
> write these times to disk ASAP.

What are the situations where you are most concerned about mtime or
ctime being accurate after a crash?

I've been running with it on my laptop for a while now, and it's
certainly not a problem for build trees; remember, whenever you need
to update the inode to update i_blocks or i_size, the inode (with its
updated timestamps) will be flushed to disk anyway.

In actual practice, what happens in a build tree is that when make
decides that it needs to update a generated file, when the file is
created as a zero-length inode, m/ctime will be set to the time that
file is created, which is newer than its source files.  As the file is
written, the mtime is updated each time that we actually need to do an
allocating write.  In the case of the linker, it will seek to the
beginning of the file to update ELF header at the very end of its
operation, and *that* time will be left stale, such that the in-memory
mtime is perhaps a millisecond ahead of the on-disk mtime.  But in the
case of a crash, either time is such that make won't be confused.

I'm not aware of an application which is doing a large number of
non-allocating random writes (for example, such as a database), where
said database actually cares about mtime being correct.  In fact, most
databases use fdatasync() to prevent the mtimes from being sync'ed out
to disk on each transaction, so they don't have guaranteed timestamp
accuracy after a crash anyway.  The problem is even if the database is
using fdatasync(), every five seconds we end up updating the mtime
anyway --- and in the case of ext4, we end up needing to take various
journal locks which on a sufficiently parallel workload and a
sufficiently fast disk, can actually cause measurable contention.

Did you have such a use case or application in mind?

> if we are lazytime also with m/ctime then I think I would like an
> option for only atime lazy. because m/ctime is cardinal to some
> operations even though I might want atime lazy.

If there's a sufficiently compelling use case where we do actually
care about mtime/ctime being accurate, and the current semantics don't
provide enough of a guarantee, it's certainly something we could do.
I'd rather keep things simple unless it's really there.  (After all,
we did create the strictatime mount option, but I'm not sure anyone
every ends up using it.  It woud be a shame if we created a
strictcmtime, which had the same usage rate.)

I'll also note that if it's only about atime updates, with the default
relatime mount option, I'm not sure there's enough of a win to hae a
mode to justify a lazyatime only option.  If you really neeed strict
c/mtime after a crash, maybe the best thing to do is to just simply
not use the lazytime mount option and be done with it.

Cheeres,

					- Ted

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH-v5 1/5] vfs: add support for a lazytime mount option
  2014-12-02 19:23           ` Theodore Ts'o
@ 2014-12-02 20:37             ` Andreas Dilger
  2014-12-02 21:01               ` Theodore Ts'o
  0 siblings, 1 reply; 5+ messages in thread
From: Andreas Dilger @ 2014-12-02 20:37 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Boaz Harrosh, Jan Kara, xfs, linux-fsdevel, linux-ext4,
	linux-btrfs

On Dec 2, 2014, at 12:23 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Tue, Dec 02, 2014 at 07:55:48PM +0200, Boaz Harrosh wrote:
>> 
>> This I do not understand. I thought that I_DIRTY_TIME, and the all
>> lazytime mount option, is only for atime. So if there are dirty
>> pages then there are also m/ctime that changed and surly we want to
>> write these times to disk ASAP.
> 
> What are the situations where you are most concerned about mtime or
> ctime being accurate after a crash?
> 
> I've been running with it on my laptop for a while now, and it's
> certainly not a problem for build trees; remember, whenever you need
> to update the inode to update i_blocks or i_size, the inode (with its
> updated timestamps) will be flushed to disk anyway.
[snip]
> I'm not aware of an application which is doing a large number of
> non-allocating random writes (for example, such as a database), where
> said database actually cares about mtime being correct.
[snip]
> Did you have such a use case or application in mind?


One thing that comes to mind is touch/utimes()/utimensat().  Those
should definitely not result in timestamps being kept only in memory
for 24h, since the whole point of those calls is to update the times.
It makes sense for these APIs to dirty the inode for proper writeout.

Cheers, Andreas

>> if we are lazytime also with m/ctime then I think I would like an
>> option for only atime lazy. because m/ctime is cardinal to some
>> operations even though I might want atime lazy.
> 
> If there's a sufficiently compelling use case where we do actually
> care about mtime/ctime being accurate, and the current semantics don't
> provide enough of a guarantee, it's certainly something we could do.
> I'd rather keep things simple unless it's really there.  (After all,
> we did create the strictatime mount option, but I'm not sure anyone
> every ends up using it.  It woud be a shame if we created a
> strictcmtime, which had the same usage rate.)
> 
> I'll also note that if it's only about atime updates, with the default
> relatime mount option, I'm not sure there's enough of a win to hae a
> mode to justify a lazyatime only option.  If you really neeed strict
> c/mtime after a crash, maybe the best thing to do is to just simply
> not use the lazytime mount option and be done with it.
> 
> Cheeres,
> 
> 					- Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas





_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH-v5 1/5] vfs: add support for a lazytime mount option
  2014-12-02 20:37             ` Andreas Dilger
@ 2014-12-02 21:01               ` Theodore Ts'o
  0 siblings, 0 replies; 5+ messages in thread
From: Theodore Ts'o @ 2014-12-02 21:01 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Boaz Harrosh, Jan Kara, xfs, linux-fsdevel, linux-ext4,
	linux-btrfs

On Tue, Dec 02, 2014 at 01:37:27PM -0700, Andreas Dilger wrote:
> 
> One thing that comes to mind is touch/utimes()/utimensat().  Those
> should definitely not result in timestamps being kept only in memory
> for 24h, since the whole point of those calls is to update the times.
> It makes sense for these APIs to dirty the inode for proper writeout.

Not a problem.  Touch/utimes* go through notify_change() and
->setattr, so they won't go through the I_DIRTY_TIME code path.

	      	   	    	    	- Ted

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-12-02 21:01 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1417154411-5367-1-git-send-email-tytso@mit.edu>
     [not found] ` <1417154411-5367-2-git-send-email-tytso@mit.edu>
     [not found]   ` <20141128172323.GD738@quack.suse.cz>
     [not found]     ` <20141128181421.GA19461@google.com>
2014-12-02 12:58       ` [PATCH-v5 1/5] vfs: add support for a lazytime mount option Jan Kara
2014-12-02 17:55         ` Boaz Harrosh
2014-12-02 19:23           ` Theodore Ts'o
2014-12-02 20:37             ` Andreas Dilger
2014-12-02 21:01               ` Theodore Ts'o

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox