per inode fsync optimization question

public inbox for linux-ext4@vger.kernel.org
 help / color / mirror / Atom feed

* per inode fsync optimization question
@ 2013-04-03 14:21 Dmitry Monakhov
  2013-04-03 14:50 ` Jan Kara
  0 siblings, 1 reply; 6+ messages in thread
From: Dmitry Monakhov @ 2013-04-03 14:21 UTC (permalink / raw)
  To: ext4 development, Jan Kara

inode store i_sync_tid and i_datasync_tid  in order to optimize journal
flushes and wait for commits only when necessary, but
fields are declared as tid_t(not atomic_t as it done in ext3) so we
have not synchronization between readers and writers, so gcc and cpu
is allowed to perform prefetch, cache and other stuff.
Looks like a bug, right?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: per inode fsync optimization question
  2013-04-03 14:21 per inode fsync optimization question Dmitry Monakhov
@ 2013-04-03 14:50 ` Jan Kara
  2013-04-03 15:09   ` Dmitry Monakhov
  0 siblings, 1 reply; 6+ messages in thread
From: Jan Kara @ 2013-04-03 14:50 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: ext4 development, Jan Kara

On Wed 03-04-13 18:21:46, Dmitry Monakhov wrote:
> inode store i_sync_tid and i_datasync_tid  in order to optimize journal
> flushes and wait for commits only when necessary, but
> fields are declared as tid_t(not atomic_t as it done in ext3) so we
> have not synchronization between readers and writers, so gcc and cpu
> is allowed to perform prefetch, cache and other stuff.
> Looks like a bug, right?
  Reads and writes to atomic_t aren't guaranteed to be any kind of a
barrier (if fact they are compiled as simple stores and loads on x86). Only
arithmetic operations on atomic types are special. So using tid_t is just
fine.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: per inode fsync optimization question
  2013-04-03 14:50 ` Jan Kara
@ 2013-04-03 15:09   ` Dmitry Monakhov
  2013-04-03 15:15     ` Jan Kara
  0 siblings, 1 reply; 6+ messages in thread
From: Dmitry Monakhov @ 2013-04-03 15:09 UTC (permalink / raw)
  To: Jan Kara; +Cc: ext4 development, Jan Kara

On Wed, 3 Apr 2013 16:50:55 +0200, Jan Kara <jack@suse.cz> wrote:
> On Wed 03-04-13 18:21:46, Dmitry Monakhov wrote:
> > inode store i_sync_tid and i_datasync_tid  in order to optimize journal
> > flushes and wait for commits only when necessary, but
> > fields are declared as tid_t(not atomic_t as it done in ext3) so we
> > have not synchronization between readers and writers, so gcc and cpu
> > is allowed to perform prefetch, cache and other stuff.
> > Looks like a bug, right?
>   Reads and writes to atomic_t aren't guaranteed to be any kind of a
> barrier (if fact they are compiled as simple stores and loads on x86). Only
> arithmetic operations on atomic types are special. So using tid_t is just
> fine.
Ok but what about prefetching?
Compiler is allowed to prefetch on early stage ?
should we use ACCESS_ONCE() or wmb() and rmb() here?


> 
> 								Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: per inode fsync optimization question
  2013-04-03 15:09   ` Dmitry Monakhov
@ 2013-04-03 15:15     ` Jan Kara
  2013-04-03 15:41       ` Dmitry Monakhov
  0 siblings, 1 reply; 6+ messages in thread
From: Jan Kara @ 2013-04-03 15:15 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: Jan Kara, ext4 development

On Wed 03-04-13 19:09:33, Dmitry Monakhov wrote:
> On Wed, 3 Apr 2013 16:50:55 +0200, Jan Kara <jack@suse.cz> wrote:
> > On Wed 03-04-13 18:21:46, Dmitry Monakhov wrote:
> > > inode store i_sync_tid and i_datasync_tid  in order to optimize journal
> > > flushes and wait for commits only when necessary, but
> > > fields are declared as tid_t(not atomic_t as it done in ext3) so we
> > > have not synchronization between readers and writers, so gcc and cpu
> > > is allowed to perform prefetch, cache and other stuff.
> > > Looks like a bug, right?
> >   Reads and writes to atomic_t aren't guaranteed to be any kind of a
> > barrier (if fact they are compiled as simple stores and loads on x86). Only
> > arithmetic operations on atomic types are special. So using tid_t is just
> > fine.
> Ok but what about prefetching?
> Compiler is allowed to prefetch on early stage ?
> should we use ACCESS_ONCE() or wmb() and rmb() here?
  Yes, but prefetch can hardly happen before the syscall is started and
value from that time is enough. We just have to be sure that if user can
prove write(2) happened before fsync(2), then data written by write(2) are
on disk. So I don't think we need any barriers there.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: per inode fsync optimization question
  2013-04-03 15:15     ` Jan Kara
@ 2013-04-03 15:41       ` Dmitry Monakhov
  2013-04-03 16:03         ` Jan Kara
  0 siblings, 1 reply; 6+ messages in thread
From: Dmitry Monakhov @ 2013-04-03 15:41 UTC (permalink / raw)
  To: Jan Kara; +Cc: Jan Kara, ext4 development

On Wed, 3 Apr 2013 17:15:22 +0200, Jan Kara <jack@suse.cz> wrote:
> On Wed 03-04-13 19:09:33, Dmitry Monakhov wrote:
> > On Wed, 3 Apr 2013 16:50:55 +0200, Jan Kara <jack@suse.cz> wrote:
> > > On Wed 03-04-13 18:21:46, Dmitry Monakhov wrote:
> > > > inode store i_sync_tid and i_datasync_tid  in order to optimize journal
> > > > flushes and wait for commits only when necessary, but
> > > > fields are declared as tid_t(not atomic_t as it done in ext3) so we
> > > > have not synchronization between readers and writers, so gcc and cpu
> > > > is allowed to perform prefetch, cache and other stuff.
> > > > Looks like a bug, right?
> > >   Reads and writes to atomic_t aren't guaranteed to be any kind of a
> > > barrier (if fact they are compiled as simple stores and loads on x86). Only
> > > arithmetic operations on atomic types are special. So using tid_t is just
> > > fine.
> > Ok but what about prefetching?
> > Compiler is allowed to prefetch on early stage ?
> > should we use ACCESS_ONCE() or wmb() and rmb() here?
>   Yes, but prefetch can hardly happen before the syscall is started and
> value from that time is enough. We just have to be sure that if user can
> prove write(2) happened before fsync(2), then data written by write(2) are
> on disk. So I don't think we need any barriers there.
Sorry for be annoying but what prevents us from following situation?:
DD:
fallocate(2)
write(2)

fsync(2)                       
 {prefetch}commit_tid = ie->i_sync_tid (T1)
                                               [flushd]
                                                ->convert_extents
                                                  -> ei->i_sync_tid = current_tid (T2)
 
 Observe that commit_tid == T1 (too old)
 issue a barrier and exit but
 data still in transaction which is not yet committed
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: per inode fsync optimization question
  2013-04-03 15:41       ` Dmitry Monakhov
@ 2013-04-03 16:03         ` Jan Kara
  0 siblings, 0 replies; 6+ messages in thread
From: Jan Kara @ 2013-04-03 16:03 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: Jan Kara, ext4 development

On Wed 03-04-13 19:41:38, Dmitry Monakhov wrote:
> On Wed, 3 Apr 2013 17:15:22 +0200, Jan Kara <jack@suse.cz> wrote:
> > On Wed 03-04-13 19:09:33, Dmitry Monakhov wrote:
> > > On Wed, 3 Apr 2013 16:50:55 +0200, Jan Kara <jack@suse.cz> wrote:
> > > > On Wed 03-04-13 18:21:46, Dmitry Monakhov wrote:
> > > > > inode store i_sync_tid and i_datasync_tid  in order to optimize journal
> > > > > flushes and wait for commits only when necessary, but
> > > > > fields are declared as tid_t(not atomic_t as it done in ext3) so we
> > > > > have not synchronization between readers and writers, so gcc and cpu
> > > > > is allowed to perform prefetch, cache and other stuff.
> > > > > Looks like a bug, right?
> > > >   Reads and writes to atomic_t aren't guaranteed to be any kind of a
> > > > barrier (if fact they are compiled as simple stores and loads on x86). Only
> > > > arithmetic operations on atomic types are special. So using tid_t is just
> > > > fine.
> > > Ok but what about prefetching?
> > > Compiler is allowed to prefetch on early stage ?
> > > should we use ACCESS_ONCE() or wmb() and rmb() here?
> >   Yes, but prefetch can hardly happen before the syscall is started and
> > value from that time is enough. We just have to be sure that if user can
> > prove write(2) happened before fsync(2), then data written by write(2) are
> > on disk. So I don't think we need any barriers there.
> Sorry for be annoying but what prevents us from following situation?:
> DD:
> fallocate(2)
> write(2)
> 
> fsync(2)                       
>  {prefetch}commit_tid = ie->i_sync_tid (T1)
>                                                [flushd]
>                                                 ->convert_extents
>                                                   -> ei->i_sync_tid = current_tid (T2)
>  
>  Observe that commit_tid == T1 (too old)
>  issue a barrier and exit but
>  data still in transaction which is not yet committed
  Heh, ok, you are careful :). So I have to be as well. Races with extent
conversion specifically are prevented by the call to
ext4_flush_unwritten_io(). That call effectively establishes a full barrier
by taking and dropping a spinlock. So if that call saw empty list of
conversions, we must later see fresh value of i_sync_tid.

Now ext4_flush_unwritten_io() is going away in my cleanup patches but then
the synchronization of writeback (and extent conversion) from flusher
thread and fsync() happens in filemap_write_and_wait_range() which
establishes similar barrier by looking at PageWriteback bit - i.e. if it
sees PageWriteback cleared, we must also see new value of i_sync_tid.

Thanks for poking into this because it made me realize things aren't as
trivial as I thought they are.

									Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2013-04-03 16:03 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-04-03 14:21 per inode fsync optimization question Dmitry Monakhov
2013-04-03 14:50 ` Jan Kara
2013-04-03 15:09   ` Dmitry Monakhov
2013-04-03 15:15     ` Jan Kara
2013-04-03 15:41       ` Dmitry Monakhov
2013-04-03 16:03         ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox