Re: Data races in ext4 - Pedro Fonseca

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Pedro Fonseca <pfonseca@mpi-sws.org>
To: Theodore Ts'o <tytso@mit.edu>
Cc: linux-ext4@vger.kernel.org, adilger.kernel@dilger.ca
Subject: Re: Data races in ext4
Date: Thu, 10 Apr 2014 17:40:54 +0200	[thread overview]
Message-ID: <5346BB86.50001@mpi-sws.org> (raw)
In-Reply-To: <20140409213340.GB15303@thunk.org>

On 04/09/2014 11:33 PM, Theodore Ts'o wrote:
> The ones relating to ext4_do_update_inode() and generic_fillattr()
> look fine to me.  Basically the first may happen if there are two
> CPU's simultaneously trying to update the on-disk data structure for
> an inode from the in-memory data structure, and that's not a problem
> since the in-memory data structure is always authoratative.  The
> second happens if someone tries to stat(2) an inode while some inode
> field is getting updated.  Most of the inode field updates will be for
> things like mtime and atime updates, or a chown or chmod, where a
> single field is getting updated, and that's not a problem since there
> is no guarantee which version of the inode information you'll get when
> you call stat(2) while the inode is getting modified out from under
> you.  It is possible that the userspace might see the i_blocks and
> i_size be out of sync, but I really can't bring myself to care about
> that.

Regarding the generic_fillattr() function, apart from the inconsistency 
between i_blocks/i_size, it looks like it can also cause wrong values to 
be returned because several of those fields are 64 bits. This means that 
each assignment gets converted into multiple instructions (in 32-bit 
architectures), allowing other instructions to interleave in between. 
For example, this is the assembly code I get when compiling 
generic_fillattr():
>     stat->atime = inode->i_atime;
> c10aa017:   8b 48 3c                mov    0x3c(%eax),%ecx
> c10aa01a:   8b 58 40                mov    0x40(%eax),%ebx
> c10aa01d:   89 4a 28                mov    %ecx,0x28(%edx)
> c10aa020:   89 5a 2c                mov    %ebx,0x2c(%edx)
>     stat->mtime = inode->i_mtime;
> c10aa023:   8b 48 44                mov    0x44(%eax),%ecx
> c10aa026:   8b 58 48                mov    0x48(%eax),%ebx
> c10aa029:   89 4a 30                mov    %ecx,0x30(%edx)
> c10aa02c:   89 5a 34                mov    %ebx,0x34(%edx)
>     stat->ctime = inode->i_ctime;
> c10aa02f:   8b 48 4c                mov    0x4c(%eax),%ecx
> c10aa032:   8b 58 50                mov    0x50(%eax),%ebx
> c10aa035:   89 4a 38                mov    %ecx,0x38(%edx)
> c10aa038:   89 5a 3c                mov    %ebx,0x3c(%edx)
...
>     stat->blocks = inode->i_blocks;
> c10aa048:   8b 58 60                mov    0x60(%eax),%ebx
> c10aa04b:   8b 48 5c                mov    0x5c(%eax),%ecx
> c10aa04e:   89 5a 48                mov    %ebx,0x48(%edx)
> c10aa051:   89 4a 44                mov    %ecx,0x44(%edx)


I still don't understand why the races in ext4_do_update_inode() won't 
cause problems. Similarly to generic_fillattr(), in 
ext4_do_update_inode() wrong values can end up being written to the 
inode fields. Is there some mechanism to recover/ignore these values 
that I'm missing?


> As for the rest, some are obviously false positives.  For example, if
> you take a look at:
>
> Variable: journal->j_running_transaction	Addresses: c1138ca1 c1136b02 c1136ca9
> c1136ca9 jbd2_get_transaction /linux-3.13.5/fs/jbd2/transaction.c:103
> c1138ca1 jbd2_journal_commit_transaction /linux-3.13.5/fs/jbd2/commit.c:539
> c1136b02 start_this_handle /linux-3.13.5/fs/jbd2/transaction.c:280
>
> It's obvious from the code path that start_this_handle() is extremely
> careful to always revalidate j_running_transaction after either (a)
> taking a read lock on the j_state_lock for the first time, or (b)
> after dropping the read_lock and grabbing a write_lock on
> j_state_lock, and if things have changed, it loops and tries again.
You're right, start_this_handle() revalidates the value read so I also 
don't see harm in this race.


> There are a couple that require some closer examination; for some of
> them, for example the two reported cases of ext4_isize_set() vs
> ext4_isize(), having the stack trace would be really helpful.

Those two pairs of functions were actually also called from the 
ext4_do_update_inode() function:
> 4338     if (ei->i_disksize != ext4_isize(raw_inode)) {
> 4339         ext4_isize_set(raw_inode, ei->i_disksize);
> 4340         need_datasync = 1;
> 4341     }

So the top part of the stack trace is:
   ext4_mark_inode_dirty() -> ext4_mark_iloc_dirty() -> 
ext4_do_update_inode() -> ext4_isize()
   ext4_mark_inode_dirty() -> ext4_mark_iloc_dirty() -> 
ext4_do_update_inode() -> ext4_isize_set()

next prev parent reply	other threads:[~2014-04-10 15:41 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-04-08 20:58 Data races in ext4 Pedro Fonseca
2014-04-09 21:33 ` Theodore Ts'o
2014-04-10 15:40   ` Pedro Fonseca [this message]
2014-04-10 22:09     ` Theodore Ts'o
2014-04-11 13:08       ` Pedro Fonseca

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5346BB86.50001@mpi-sws.org \
    --to=pfonseca@mpi-sws.org \
    --cc=adilger.kernel@dilger.ca \
    --cc=linux-ext4@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).