linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	"stable@vger.kernel.org" <stable@vger.kernel.org>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>,
	"willy@linux.intel.com" <willy@linux.intel.com>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"ross.zwisler@linux.intel.com" <ross.zwisler@linux.intel.com>,
	"jack@suse.com" <jack@suse.com>
Subject: Re: [PATCH 4/8] mm, dax: truncate dax mappings at bdev or fs shutdown
Date: Fri, 20 Nov 2015 10:17:04 +1100	[thread overview]
Message-ID: <20151119231704.GG19199@dastard> (raw)
In-Reply-To: <CAPcyv4iGYPGz6n6YxdEWFJ4c0BZRXrWpa2188K2z9zz4_ez0TA@mail.gmail.com>

On Thu, Nov 19, 2015 at 08:55:48AM -0800, Dan Williams wrote:
> On Thu, Nov 19, 2015 at 4:55 AM, Jan Kara <jack@suse.cz> wrote:
> >> Subject: mm, dax: unmap dax mappings at bdev or fs shutdown
> >> From: Dan Williams <dan.j.williams@intel.com>
> >>
> >> Currently dax mappings leak past / survive block_device shutdown.  While
> >> page cache pages are permitted to be read/written after the block_device
> >> is torn down this is not acceptable in the dax case as all media access
> >> must end when the device is disabled.  The pfn backing a dax mapping is
> >> permitted to be invalidated after bdev shutdown and this is indeed the
> >> case with brd.
> >>
> >> When a dax capable block_device driver calls del_gendisk() in its
> >> shutdown path del_gendisk() needs to ensure that all DAX pfns are
> >> unmapped.  This is different than the pagecache backed case where the
> >> disk is protected by the queue being torn down which ends I/O to the
> >> device.  Since dax bypasses the page cache we need to unconditionally
> >> unmap the inode.
> > ...
> >> +static void unmap_list(struct list_head *head)
> >> +{
> >> +     struct inode *inode, *_i;
> >> +
> >> +     list_for_each_entry_safe(inode, _i, head, i_lru) {
> >> +             list_del_init(&inode->i_lru);
> >> +             unmap_mapping_range(&inode->i_data, 0, 0, 1);
> >> +             iput(inode);
> >> +             cond_resched();
> >> +     }
> >> +}
> >> +
> >>  /**
> >>   * evict_inodes      - evict all evictable inodes for a superblock
> >>   * @sb:              superblock to operate on
> >> @@ -642,6 +654,7 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
> >>       int busy = 0;
> >>       struct inode *inode, *next;
> >>       LIST_HEAD(dispose);
> >> +     LIST_HEAD(unmap);
> >>
> >>       spin_lock(&sb->s_inode_list_lock);
> >>       list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
> >> @@ -655,6 +668,19 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
> >>                       busy = 1;
> >>                       continue;
> >>               }
> >> +             if (IS_DAX(inode) && atomic_read(&inode->i_count)) {
> >> +                     /*
> >> +                      * dax mappings can't live past this invalidation event
> >> +                      * as there is no page cache present to allow the data
> >> +                      * to remain accessible.
> >> +                      */
> >> +                     __iget(inode);
> >> +                     inode_lru_list_del(inode);
> >> +                     spin_unlock(&inode->i_lock);
> >> +                     list_add(&inode->i_lru, &unmap);
> >> +                     busy = 1;
> >> +                     continue;
> >> +             }
> >
> > I'm somewhat concerned about the abuse of i_lru list here. The inodes have
> > i_count > 0 and so the code generally assumes such inodes should be removed
> > from LRU list if they are there. Now I didn't find a place where this could
> > really hit you but it is fragile. And when that happens, you have a
> > corruption of your unmap list (since you access it without any locks) and
> > also you will not unmap your inode.
> 
> "i_lru" list abuse was my main concern with this patch, I'll look into
> a different way.

Yeah, you can't use i_lru at all for purposes like this - even if
there are active references to the inode, the shrinker could be
walking the LRU list and accessing this inode (e.g.
inode_lru_isolate()) at the same time this code is removing it from
the LRU. Then things will go real bad....

> > Also are you sure that your unmapping cannot race with other process
> > mapping the pfns again?
> 
> You're right, there is indeed a gap between the unmap and when
> get_blocks() will start returning errors in the fault path.

get_blocks() won't start returning errors until the filesystem has
had an IO error. Given they cache metadata and do async
transactions, it could be some time before the filesystem notices
that the device has gone away in a fatal way.

> I think I
> need to defer the unmapping until after blk_cleanup_queue() where we
> know that no new I/o to the device is possible.

Actually, I think we need to trigger a filesystem shutdown before
doing anything else (e.g. before unmapping the inodes). That way the
filesystem will error out new calls to get_blocks() and prevent any
new IO submission while the block device teardown and inode
unmapping is done. i.e. solving the problem at the block device
level is hard, but we already have all the necessary infrastructure
to shut off IO and new block mappings at the filesystem level....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  parent reply	other threads:[~2015-11-19 23:17 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-11-17 20:15 [PATCH 0/8] dax fixes / cleanups: pmd vs thp, lifetime, and locking Dan Williams
2015-11-17 20:15 ` [PATCH 1/8] ext2, ext4: warn when mounting with dax enabled Dan Williams
2015-11-17 20:16 ` [PATCH 2/8] dax: disable pmd mappings Dan Williams
2015-11-17 20:51   ` Ross Zwisler
2015-11-17 20:16 ` [PATCH 3/8] mm, dax: fix DAX deadlocks (COW fault) Dan Williams
2015-11-17 20:16 ` [PATCH 4/8] mm, dax: truncate dax mappings at bdev or fs shutdown Dan Williams
2015-11-18 15:09   ` Jan Kara
2015-11-19  0:22     ` Williams, Dan J
2015-11-19 12:55       ` Jan Kara
2015-11-19 16:55         ` Dan Williams
2015-11-19 17:12           ` Jan Kara
2015-11-19 23:17           ` Dave Chinner [this message]
2015-11-20  0:05             ` Williams, Dan J
2015-11-20  4:06               ` Dave Chinner
2015-11-20  4:25                 ` Dan Williams
2015-11-20 17:08                   ` Dan Williams
2015-11-17 20:16 ` [PATCH 5/8] pmem, dax: clean up clear_pmem() Dan Williams
2015-11-17 20:16 ` [PATCH 6/8] dax: increase granularity of dax_clear_blocks() operations Dan Williams
2015-11-17 20:16 ` [PATCH 7/8] dax: guarantee page aligned results from bdev_direct_access() Dan Williams
2015-11-17 20:16 ` [PATCH 8/8] dax: fix lifetime of in-kernel dax mappings with dax_map_atomic() Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20151119231704.GG19199@dastard \
    --to=david@fromorbit.com \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=jack@suse.com \
    --cc=jack@suse.cz \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=ross.zwisler@linux.intel.com \
    --cc=stable@vger.kernel.org \
    --cc=willy@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).