[RFC PATCH v3 0/7] btrfs: implement swap file support

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v3 0/7] btrfs: implement swap file support
@ 2014-12-10  1:45 Omar Sandoval
  2014-12-10  1:45 ` [RFC PATCH v3 1/7] direct-io: don't dirty ITER_BVEC pages on read Omar Sandoval
                   ` (7 more replies)
  0 siblings, 8 replies; 13+ messages in thread
From: Omar Sandoval @ 2014-12-10  1:45 UTC (permalink / raw)
  To: Alexander Viro, Andrew Morton, Chris Mason, Josef Bacik,
	Trond Myklebust, Christoph Hellwig, David Sterba, linux-btrfs,
	linux-fsdevel, linux-mm, linux-nfs, linux-kernel
  Cc: Omar Sandoval

Hi, everyone,

This patch series, based on v3.18, implements support for swap files on BTRFS.
Patches 1, 3, and 4 are for the VFS folks, patch 2 is for NFS, and the rest is
all BTRFS.

The standard swap file implementation uses bmap() to get a list of physical
blocks to do I/O on. This doesn't work for BTRFS, which moves disk blocks around
as part of normal operation (COW, defragmentation, etc.).

Swap-over-NFS introduced an interface through which a filesystem can arbitrate
swap I/O through address space operations:

- swap_activate() is called by swapon() and informs the address space that the
  given file is going to be used for swap, so it should take adequate measures
  like reserving space on disk and pinning block lookup information in memory
- swap_deactivate() is used to clean up on swapoff()
- direct_IO() is used to page in and out (this no longer uses readpage as part
  of this patch series)

Patches 1-4 clean up the necessary infrastructure. There's more that can make
this better (like resurrecting kernel AIO), but that can be done as a follow-up
to the work here.

Patches 5 and 6 lay the groundwork needed for using a swap file on BTRFS, and
patch 7 implements the actual aops.

Version 3 incorporates a bunch of David Sterba's feedback, both style and design
issues. We now audit various ioctls to prevent them from interfering with swap
file operation and handle extents which can't be nocow'd.

After some discussion on the mailing list, I decided that for simplicity and
reliability, it's best to simply disallow COW files and files with shared
extents (like files with extents shared with a snapshot). From a user's
perspective, this means that a snapshotted subvolume cannot be used for a swap
file, but keeping the swap file in a separate subvolume that is never
snapshotted seems entirely reasonable to me. An alternative suggestion was to
allow swap files to be snapshotted and to do an implied COW on swap file
activation, which I was ready to implement until I realized that we can't permit
snapshotting a subvolume with an active swap file, so this creates a surprising
inconsistency for users (in my opinion).

As with before, this functionality is tenuously tested in a virtual machine with
some artificial workloads, but it "works for me". I'm pretty happy with the
results on my end, so please comment away.

Thanks!

Omar Sandoval (7):
  direct-io: don't dirty ITER_BVEC pages on read
  nfs: don't dirty ITER_BVEC pages read through direct I/O
  swap: use direct I/O for SWP_FILE swap_readpage
  vfs: update swap_{,de}activate documentation
  btrfs: prevent ioctls from interfering with a swap file
  btrfs: add EXTENT_FLAG_SWAPFILE
  btrfs: enable swap file support

 Documentation/filesystems/Locking |   7 +-
 Documentation/filesystems/vfs.txt |   7 +-
 fs/btrfs/ctree.h                  |   3 +
 fs/btrfs/disk-io.c                |   1 +
 fs/btrfs/extent_io.c              |   1 +
 fs/btrfs/extent_map.h             |   1 +
 fs/btrfs/inode.c                  | 132 ++++++++++++++++++++++++++++++++++++++
 fs/btrfs/ioctl.c                  |  35 ++++++++--
 fs/direct-io.c                    |   8 ++-
 fs/nfs/direct.c                   |   5 +-
 include/trace/events/btrfs.h      |   3 +-
 mm/page_io.c                      |  32 +++++++--
 12 files changed, 216 insertions(+), 19 deletions(-)

-- 
2.1.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH v3 1/7] direct-io: don't dirty ITER_BVEC pages on read
  2014-12-10  1:45 [RFC PATCH v3 0/7] btrfs: implement swap file support Omar Sandoval
@ 2014-12-10  1:45 ` Omar Sandoval
  2014-12-10  1:45 ` [RFC PATCH v3 2/7] nfs: don't dirty ITER_BVEC pages read through direct I/O Omar Sandoval
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Omar Sandoval @ 2014-12-10  1:45 UTC (permalink / raw)
  To: Alexander Viro, Andrew Morton, Chris Mason, Josef Bacik,
	Trond Myklebust, Christoph Hellwig, David Sterba, linux-btrfs,
	linux-fsdevel, linux-mm, linux-nfs, linux-kernel
  Cc: Omar Sandoval, Ming Lei

Reads through the iov_iter infrastructure for kernel pages shouldn't be
dirtied by the direct I/O code.

This is based on Dave Kleikamp's and Ming Lei's previously posted
patches.

Cc: Ming Lei <ming.lei@canonical.com>
Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Signed-off-by: Omar Sandoval <osandov@osandov.com>
---
 fs/direct-io.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index e181b6b..e542ce4 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -120,6 +120,7 @@ struct dio {
 	spinlock_t bio_lock;		/* protects BIO fields below */
 	int page_errors;		/* errno from get_user_pages() */
 	int is_async;			/* is IO async ? */
+	int should_dirty;		/* should we mark read pages dirty? */
 	bool defer_completion;		/* defer AIO completion to workqueue? */
 	int io_error;			/* IO error in completion path */
 	unsigned long refcount;		/* direct_io_worker() and bios */
@@ -392,7 +393,7 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
 	dio->refcount++;
 	spin_unlock_irqrestore(&dio->bio_lock, flags);
 
-	if (dio->is_async && dio->rw == READ)
+	if (dio->is_async && dio->rw == READ && dio->should_dirty)
 		bio_set_pages_dirty(bio);
 
 	if (sdio->submit_io)
@@ -463,13 +464,13 @@ static int dio_bio_complete(struct dio *dio, struct bio *bio)
 	if (!uptodate)
 		dio->io_error = -EIO;
 
-	if (dio->is_async && dio->rw == READ) {
+	if (dio->is_async && dio->rw == READ && dio->should_dirty) {
 		bio_check_pages_dirty(bio);	/* transfers ownership */
 	} else {
 		bio_for_each_segment_all(bvec, bio, i) {
 			struct page *page = bvec->bv_page;
 
-			if (dio->rw == READ && !PageCompound(page))
+			if (dio->rw == READ && !PageCompound(page) && dio->should_dirty)
 				set_page_dirty_lock(page);
 			page_cache_release(page);
 		}
@@ -1177,6 +1178,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 
 	dio->inode = inode;
 	dio->rw = rw;
+	dio->should_dirty = !(iter->type & ITER_BVEC);
 
 	/*
 	 * For AIO O_(D)SYNC writes we need to defer completions to a workqueue
-- 
2.1.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH v3 2/7] nfs: don't dirty ITER_BVEC pages read through direct I/O
  2014-12-10  1:45 [RFC PATCH v3 0/7] btrfs: implement swap file support Omar Sandoval
  2014-12-10  1:45 ` [RFC PATCH v3 1/7] direct-io: don't dirty ITER_BVEC pages on read Omar Sandoval
@ 2014-12-10  1:45 ` Omar Sandoval
  2014-12-10  1:45 ` [RFC PATCH v3 3/7] swap: use direct I/O for SWP_FILE swap_readpage Omar Sandoval
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Omar Sandoval @ 2014-12-10  1:45 UTC (permalink / raw)
  To: Alexander Viro, Andrew Morton, Chris Mason, Josef Bacik,
	Trond Myklebust, Christoph Hellwig, David Sterba, linux-btrfs,
	linux-fsdevel, linux-mm, linux-nfs, linux-kernel
  Cc: Omar Sandoval

As with the generic blockdev code, kernel pages shouldn't be dirtied by
the direct I/O path.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
---
 fs/nfs/direct.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 10bf072..a67fa2c 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -88,6 +88,7 @@ struct nfs_direct_req {
 	struct pnfs_ds_commit_info ds_cinfo;	/* Storage for cinfo */
 	struct work_struct	work;
 	int			flags;
+	int			should_dirty;	/* should we mark read pages dirty? */
 #define NFS_ODIRECT_DO_COMMIT		(1)	/* an unstable reply was received */
 #define NFS_ODIRECT_RESCHED_WRITES	(2)	/* write verification failed */
 	struct nfs_writeverf	verf;		/* unstable write verifier */
@@ -370,7 +371,8 @@ static void nfs_direct_read_completion(struct nfs_pgio_header *hdr)
 		struct nfs_page *req = nfs_list_entry(hdr->pages.next);
 		struct page *page = req->wb_page;
 
-		if (!PageCompound(page) && bytes < hdr->good_bytes)
+		if (!PageCompound(page) && bytes < hdr->good_bytes &&
+		    dreq->should_dirty)
 			set_page_dirty(page);
 		bytes += req->wb_bytes;
 		nfs_list_remove_request(req);
@@ -542,6 +544,7 @@ ssize_t nfs_file_direct_read(struct kiocb *iocb, struct iov_iter *iter,
 	dreq->inode = inode;
 	dreq->bytes_left = count;
 	dreq->ctx = get_nfs_open_context(nfs_file_open_context(iocb->ki_filp));
+	dreq->should_dirty = !(iter->type & ITER_BVEC);
 	l_ctx = nfs_get_lock_context(dreq->ctx);
 	if (IS_ERR(l_ctx)) {
 		result = PTR_ERR(l_ctx);
-- 
2.1.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH v3 3/7] swap: use direct I/O for SWP_FILE swap_readpage
  2014-12-10  1:45 [RFC PATCH v3 0/7] btrfs: implement swap file support Omar Sandoval
  2014-12-10  1:45 ` [RFC PATCH v3 1/7] direct-io: don't dirty ITER_BVEC pages on read Omar Sandoval
  2014-12-10  1:45 ` [RFC PATCH v3 2/7] nfs: don't dirty ITER_BVEC pages read through direct I/O Omar Sandoval
@ 2014-12-10  1:45 ` Omar Sandoval
  2014-12-10  1:45 ` [RFC PATCH v3 4/7] vfs: update swap_{,de}activate documentation Omar Sandoval
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Omar Sandoval @ 2014-12-10  1:45 UTC (permalink / raw)
  To: Alexander Viro, Andrew Morton, Chris Mason, Josef Bacik,
	Trond Myklebust, Christoph Hellwig, David Sterba, linux-btrfs,
	linux-fsdevel, linux-mm, linux-nfs, linux-kernel
  Cc: Omar Sandoval, Mel Gorman

On Mon, Nov 17, 2014 at 07:48:17AM -0800, Christoph Hellwig wrote:
> With the new iov_iter infrastructure that supprots direct I/O to kernel
> pages please get rid of the ->readpage hack first.  I'm still utterly
> disapoined that this crap ever got merged.

Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Omar Sandoval <osandov@osandov.com>
---
 mm/page_io.c | 32 ++++++++++++++++++++++++++++----
 1 file changed, 28 insertions(+), 4 deletions(-)

diff --git a/mm/page_io.c b/mm/page_io.c
index 955db8b..10715e0 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -283,8 +283,7 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
 
 		set_page_writeback(page);
 		unlock_page(page);
-		ret = mapping->a_ops->direct_IO(ITER_BVEC | WRITE,
-						&kiocb, &from,
+		ret = mapping->a_ops->direct_IO(WRITE, &kiocb, &from,
 						kiocb.ki_pos);
 		if (ret == PAGE_SIZE) {
 			count_vm_event(PSWPOUT);
@@ -348,12 +347,37 @@ int swap_readpage(struct page *page)
 	}
 
 	if (sis->flags & SWP_FILE) {
+		struct kiocb kiocb;
 		struct file *swap_file = sis->swap_file;
 		struct address_space *mapping = swap_file->f_mapping;
+		struct bio_vec bv = {
+			.bv_page = page,
+			.bv_len = PAGE_SIZE,
+			.bv_offset = 0,
+		};
+		struct iov_iter to = {
+			.type = ITER_BVEC | READ,
+			.count = PAGE_SIZE,
+			.iov_offset = 0,
+			.nr_segs = 1,
+		};
+		to.bvec = &bv;	/* older gcc versions are broken */
+
+		init_sync_kiocb(&kiocb, swap_file);
+		kiocb.ki_pos = page_file_offset(page);
+		kiocb.ki_nbytes = PAGE_SIZE;
 
-		ret = mapping->a_ops->readpage(swap_file, page);
-		if (!ret)
+		ret = mapping->a_ops->direct_IO(READ, &kiocb, &to,
+						kiocb.ki_pos);
+		if (ret == PAGE_SIZE) {
+			SetPageUptodate(page);
 			count_vm_event(PSWPIN);
+			ret = 0;
+		} else {
+			ClearPageUptodate(page);
+			SetPageError(page);
+		}
+		unlock_page(page);
 		return ret;
 	}
 
-- 
2.1.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH v3 4/7] vfs: update swap_{,de}activate documentation
  2014-12-10  1:45 [RFC PATCH v3 0/7] btrfs: implement swap file support Omar Sandoval
                   ` (2 preceding siblings ...)
  2014-12-10  1:45 ` [RFC PATCH v3 3/7] swap: use direct I/O for SWP_FILE swap_readpage Omar Sandoval
@ 2014-12-10  1:45 ` Omar Sandoval
  2014-12-10  1:45 ` [RFC PATCH v3 5/7] btrfs: prevent ioctls from interfering with a swap file Omar Sandoval
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Omar Sandoval @ 2014-12-10  1:45 UTC (permalink / raw)
  To: Alexander Viro, Andrew Morton, Chris Mason, Josef Bacik,
	Trond Myklebust, Christoph Hellwig, David Sterba, linux-btrfs,
	linux-fsdevel, linux-mm, linux-nfs, linux-kernel
  Cc: Omar Sandoval

Parameters were added to swap_activate in the same patch series that
introduced it without updating the documentation. Additionally, the
documentation claims that non-existent address space operations
swap_{in,out} are used for swap I/O, but it's (now) direct_IO.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
---
 Documentation/filesystems/Locking | 7 ++++---
 Documentation/filesystems/vfs.txt | 7 ++++---
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index b30753c..e72b4c3 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -205,7 +205,8 @@ prototypes:
 	int (*launder_page)(struct page *);
 	int (*is_partially_uptodate)(struct page *, unsigned long, unsigned long);
 	int (*error_remove_page)(struct address_space *, struct page *);
-	int (*swap_activate)(struct file *);
+	int (*swap_activate)(struct swap_info_struct *, struct file *,
+			     sector_t *);
 	int (*swap_deactivate)(struct file *);
 
 locking rules:
@@ -230,8 +231,8 @@ migratepage:		yes (both)
 launder_page:		yes
 is_partially_uptodate:	yes
 error_remove_page:	yes
-swap_activate:		no
-swap_deactivate:	no
+swap_activate:					yes
+swap_deactivate:				no
 
 	->write_begin(), ->write_end(), ->sync_page() and ->readpage()
 may be called from the request handler (/dev/loop).
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 20bf204..46ef103 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -600,8 +600,9 @@ struct address_space_operations {
 					unsigned long);
 	void (*is_dirty_writeback) (struct page *, bool *, bool *);
 	int (*error_remove_page) (struct mapping *mapping, struct page *page);
-	int (*swap_activate)(struct file *);
-	int (*swap_deactivate)(struct file *);
+	int (*swap_activate)(struct swap_info_struct *, struct file *,
+			     sector_t *);
+	void (*swap_deactivate)(struct file *);
 };
 
   writepage: called by the VM to write a dirty page to backing store.
@@ -788,7 +789,7 @@ struct address_space_operations {
 	memory. A return value of zero indicates success,
 	in which case this file can be used to back swapspace. The
 	swapspace operations will be proxied to this address space's
-	->swap_{out,in} methods.
+	->direct_IO method for both reads and writes.
 
   swap_deactivate: Called during swapoff on files where swap_activate
 	was successful.
-- 
2.1.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH v3 5/7] btrfs: prevent ioctls from interfering with a swap file
  2014-12-10  1:45 [RFC PATCH v3 0/7] btrfs: implement swap file support Omar Sandoval
                   ` (3 preceding siblings ...)
  2014-12-10  1:45 ` [RFC PATCH v3 4/7] vfs: update swap_{,de}activate documentation Omar Sandoval
@ 2014-12-10  1:45 ` Omar Sandoval
  2014-12-10  1:45 ` [RFC PATCH v3 6/7] btrfs: add EXTENT_FLAG_SWAPFILE Omar Sandoval
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Omar Sandoval @ 2014-12-10  1:45 UTC (permalink / raw)
  To: Alexander Viro, Andrew Morton, Chris Mason, Josef Bacik,
	Trond Myklebust, Christoph Hellwig, David Sterba, linux-btrfs,
	linux-fsdevel, linux-mm, linux-nfs, linux-kernel
  Cc: Omar Sandoval

There are several ioctls which can work around constraints enforced by
btrfs_swap_activate and lead to an unsafe situation. We cannot do any of
the following on an active swap file in order to avoid creating
compressed or shared extents:

- chattr -C or +c
- snapshot create
- defrag
- clone
- dedup

Signed-off-by: Omar Sandoval <osandov@osandov.com>
---
 fs/btrfs/ctree.h   |  3 +++
 fs/btrfs/disk-io.c |  1 +
 fs/btrfs/ioctl.c   | 35 +++++++++++++++++++++++++++++++----
 3 files changed, 35 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index fe69edd..38979b9 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1891,6 +1891,9 @@ struct btrfs_root {
 	int send_in_progress;
 	struct btrfs_subvolume_writers *subv_writers;
 	atomic_t will_be_snapshoted;
+
+	/* Number of active swapfiles */
+	atomic_t nr_swapfiles;
 };
 
 struct btrfs_ioctl_defrag_range_args {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 1bf9f89..60094c4 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1265,6 +1265,7 @@ static void __setup_root(u32 nodesize, u32 sectorsize, u32 stripesize,
 	atomic_set(&root->orphan_inodes, 0);
 	atomic_set(&root->refs, 1);
 	atomic_set(&root->will_be_snapshoted, 0);
+	atomic_set(&root->nr_swapfiles, 0);
 	root->log_transid = 0;
 	root->log_transid_committed = -1;
 	root->last_log_commit = 0;
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 4399f0c..18fa95c 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -291,9 +291,11 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg)
 		} else {
 			ip->flags |= BTRFS_INODE_NODATACOW;
 		}
-	} else {
+	} else if (!IS_SWAPFILE(inode)) {
 		/*
-		 * Revert back under same assuptions as above
+		 * Revert back under same assumptions as above. swap_activate
+		 * checks that we don't swapon a copy-on-write file, but we also
+		 * make sure that it doesn't become copy-on-write here.
 		 */
 		if (S_ISREG(mode)) {
 			if (inode->i_size == 0)
@@ -316,7 +318,12 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg)
 		ret = btrfs_set_prop(inode, "btrfs.compression", NULL, 0, 0);
 		if (ret && ret != -ENODATA)
 			goto out_drop;
-	} else if (flags & FS_COMPR_FL) {
+	} else if (flags & FS_COMPR_FL && !IS_SWAPFILE(inode)) {
+		/*
+		 * Like nodatacow, swap_activate checks that we don't swapon a
+		 * compressed file, so we shouldn't let it become compressed.
+		 */
+
 		const char *comp;
 
 		ip->flags |= BTRFS_INODE_COMPRESS;
@@ -330,7 +337,6 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg)
 				     comp, strlen(comp), 0);
 		if (ret)
 			goto out_drop;
-
 	} else {
 		ret = btrfs_set_prop(inode, "btrfs.compression", NULL, 0, 0);
 		if (ret && ret != -ENODATA)
@@ -647,6 +653,12 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir,
 	if (!test_bit(BTRFS_ROOT_REF_COWS, &root->state))
 		return -EINVAL;
 
+	if (atomic_read(&root->nr_swapfiles)) {
+		btrfs_err(root->fs_info,
+			  "cannot create snapshot with active swapfile");
+		return -ETXTBSY;
+	}
+
 	atomic_inc(&root->will_be_snapshoted);
 	smp_mb__after_atomic();
 	btrfs_wait_nocow_write(root);
@@ -1292,6 +1304,12 @@ int btrfs_defrag_file(struct inode *inode, struct file *file,
 			compress_type = range->compress_type;
 	}
 
+	mutex_lock(&inode->i_mutex);
+	ret = IS_SWAPFILE(inode) ? -ETXTBSY : 0;
+	mutex_unlock(&inode->i_mutex);
+	if (ret)
+		return ret;
+
 	if (extent_thresh == 0)
 		extent_thresh = 256 * 1024;
 
@@ -2927,6 +2945,11 @@ static int btrfs_extent_same(struct inode *src, u64 loff, u64 len,
 
 	btrfs_double_lock(src, loff, dst, dst_loff, len);
 
+	if (IS_SWAPFILE(src) || IS_SWAPFILE(dst)) {
+		ret = -ETXTBSY;
+		goto out_unlock;
+	}
+
 	ret = extent_same_check_offsets(src, loff, len);
 	if (ret)
 		goto out_unlock;
@@ -3644,6 +3667,10 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
 		mutex_lock(&src->i_mutex);
 	}
 
+	ret = -ETXTBSY;
+	if (IS_SWAPFILE(src) || IS_SWAPFILE(inode))
+		goto out_unlock;
+
 	/* determine range to clone */
 	ret = -EINVAL;
 	if (off + len > src->i_size || off + len < off)
-- 
2.1.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH v3 6/7] btrfs: add EXTENT_FLAG_SWAPFILE
  2014-12-10  1:45 [RFC PATCH v3 0/7] btrfs: implement swap file support Omar Sandoval
                   ` (4 preceding siblings ...)
  2014-12-10  1:45 ` [RFC PATCH v3 5/7] btrfs: prevent ioctls from interfering with a swap file Omar Sandoval
@ 2014-12-10  1:45 ` Omar Sandoval
  2014-12-12 10:32   ` David Sterba
  2014-12-10  1:45 ` [RFC PATCH v3 7/7] btrfs: enable swap file support Omar Sandoval
  2014-12-12 10:32 ` [RFC PATCH v3 0/7] btrfs: implement " David Sterba
  7 siblings, 1 reply; 13+ messages in thread
From: Omar Sandoval @ 2014-12-10  1:45 UTC (permalink / raw)
  To: Alexander Viro, Andrew Morton, Chris Mason, Josef Bacik,
	Trond Myklebust, Christoph Hellwig, David Sterba, linux-btrfs,
	linux-fsdevel, linux-mm, linux-nfs, linux-kernel
  Cc: Omar Sandoval

Extents mapping a swap file should remain pinned in memory in order to
avoid doing allocations to look up an extent when we're already low on
memory. Rather than overloading EXTENT_FLAG_PINNED, add a new flag
specifically for this purpose.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
---
 fs/btrfs/extent_io.c         | 1 +
 fs/btrfs/extent_map.h        | 1 +
 fs/btrfs/inode.c             | 1 +
 include/trace/events/btrfs.h | 3 ++-
 4 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index bf3f424..36166d0 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4244,6 +4244,7 @@ int try_release_extent_mapping(struct extent_map_tree *map,
 				break;
 			}
 			if (test_bit(EXTENT_FLAG_PINNED, &em->flags) ||
+			    test_bit(EXTENT_FLAG_SWAPFILE, &em->flags) ||
 			    em->start != start) {
 				write_unlock(&map->lock);
 				free_extent_map(em);
diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
index b2991fd..93b9548 100644
--- a/fs/btrfs/extent_map.h
+++ b/fs/btrfs/extent_map.h
@@ -16,6 +16,7 @@
 #define EXTENT_FLAG_LOGGING 4 /* Logging this extent */
 #define EXTENT_FLAG_FILLING 5 /* Filling in a preallocated extent */
 #define EXTENT_FLAG_FS_MAPPING 6 /* filesystem extent mapping type */
+#define EXTENT_FLAG_SWAPFILE 7 /* this extent maps a swap file */
 
 struct extent_map {
 	struct rb_node rb_node;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d23362f..7c2dfb2 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6353,6 +6353,7 @@ again:
 		else
 			goto out;
 	}
+	WARN_ON_ONCE(IS_SWAPFILE(inode));
 	em = alloc_extent_map();
 	if (!em) {
 		err = -ENOMEM;
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 1faecea..5c5f9de 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -164,7 +164,8 @@ DEFINE_EVENT(btrfs__inode, btrfs_inode_evict,
 		{ (1 << EXTENT_FLAG_PREALLOC), 		"PREALLOC" 	},\
 		{ (1 << EXTENT_FLAG_LOGGING),	 	"LOGGING" 	},\
 		{ (1 << EXTENT_FLAG_FILLING),	 	"FILLING" 	},\
-		{ (1 << EXTENT_FLAG_FS_MAPPING),	"FS_MAPPING"	})
+		{ (1 << EXTENT_FLAG_FS_MAPPING),	"FS_MAPPING"	},\
+		{ (1 << EXTENT_FLAG_SWAPFILE),		"SWAPFILE"	})
 
 TRACE_EVENT_CONDITION(btrfs_get_extent,
 
-- 
2.1.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v3 6/7] btrfs: add EXTENT_FLAG_SWAPFILE
  2014-12-10  1:45 ` [RFC PATCH v3 6/7] btrfs: add EXTENT_FLAG_SWAPFILE Omar Sandoval
@ 2014-12-12 10:32   ` David Sterba
  0 siblings, 0 replies; 13+ messages in thread
From: David Sterba @ 2014-12-12 10:32 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Alexander Viro, Andrew Morton, Chris Mason, Josef Bacik,
	Trond Myklebust, Christoph Hellwig, David Sterba, linux-btrfs,
	linux-fsdevel, linux-mm, linux-nfs, linux-kernel

On Tue, Dec 09, 2014 at 05:45:47PM -0800, Omar Sandoval wrote:
> Extents mapping a swap file should remain pinned in memory in order to
> avoid doing allocations to look up an extent when we're already low on
> memory. Rather than overloading EXTENT_FLAG_PINNED, add a new flag
> specifically for this purpose.
> 
> Signed-off-by: Omar Sandoval <osandov@osandov.com>

Reviewed-by: David Sterba <dsterba@suse.cz>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH v3 7/7] btrfs: enable swap file support
  2014-12-10  1:45 [RFC PATCH v3 0/7] btrfs: implement swap file support Omar Sandoval
                   ` (5 preceding siblings ...)
  2014-12-10  1:45 ` [RFC PATCH v3 6/7] btrfs: add EXTENT_FLAG_SWAPFILE Omar Sandoval
@ 2014-12-10  1:45 ` Omar Sandoval
  2014-12-12 10:51   ` David Sterba
  2014-12-12 10:32 ` [RFC PATCH v3 0/7] btrfs: implement " David Sterba
  7 siblings, 1 reply; 13+ messages in thread
From: Omar Sandoval @ 2014-12-10  1:45 UTC (permalink / raw)
  To: Alexander Viro, Andrew Morton, Chris Mason, Josef Bacik,
	Trond Myklebust, Christoph Hellwig, David Sterba, linux-btrfs,
	linux-fsdevel, linux-mm, linux-nfs, linux-kernel
  Cc: Omar Sandoval

Implement the swap file a_ops on btrfs. Activation does two things:

1. Checks for a usable swap file: it must be fully allocated (no holes),
   support direct I/O (so no compressed or inline extents) and must be
   eligible for nocow in its entirety in order to avoid doing a bunch of
   allocations for a COW when we're already low on memory
2. Pins the extent maps in memory with EXTENT_FLAG_SWAPFILE

Deactivation unpins all of the extent maps.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
---
 fs/btrfs/inode.c | 131 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 131 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 7c2dfb2..76b58d7 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7191,6 +7191,7 @@ static int btrfs_get_blocks_direct(struct inode *inode, sector_t iblock,
 	 * this will cow the extent, reset the len in case we changed
 	 * it above
 	 */
+	WARN_ON_ONCE(IS_SWAPFILE(inode));
 	len = bh_result->b_size;
 	free_extent_map(em);
 	em = btrfs_new_extent_direct(inode, start, len);
@@ -9443,6 +9444,134 @@ out_inode:
 
 }
 
+static void __clear_swapfile_extents(struct inode *inode)
+{
+	u64 isize = inode->i_size;
+	struct extent_map *em;
+	u64 start, len;
+
+	start = 0;
+	while (start < isize) {
+		len = isize - start;
+		em = btrfs_get_extent(inode, NULL, 0, start, len, 0);
+		if (IS_ERR(em))
+			return;
+
+		clear_bit(EXTENT_FLAG_SWAPFILE, &em->flags);
+
+		start = extent_map_end(em);
+		free_extent_map(em);
+	}
+}
+
+static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
+			       sector_t *span)
+{
+	struct inode *inode = file_inode(file);
+	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
+	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+	int ret = 0;
+	u64 isize = inode->i_size;
+	struct extent_state *cached_state = NULL;
+	struct extent_map *em;
+	u64 start, len;
+
+	if (BTRFS_I(inode)->flags & BTRFS_INODE_COMPRESS) {
+		/* Can't do direct I/O on a compressed file. */
+		btrfs_err(fs_info, "swapfile is compressed");
+		return -EINVAL;
+	}
+	if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW)) {
+		/*
+		 * Going through the copy-on-write path while swapping pages
+		 * in/out and doing a bunch of allocations could stress the
+		 * memory management code that got us there in the first place,
+		 * and that's sure to be a bad time.
+		 */
+		btrfs_err(fs_info, "swapfile is copy-on-write");
+		return -EINVAL;
+	}
+
+	lock_extent_bits(io_tree, 0, isize - 1, 0, &cached_state);
+
+	/*
+	 * All of the extents must be allocated and support direct I/O. Inline
+	 * extents and compressed extents fall back to buffered I/O, so those
+	 * are no good. Additionally, all of the extents must be safe for nocow.
+	 */
+	atomic_inc(&BTRFS_I(inode)->root->nr_swapfiles);
+	start = 0;
+	while (start < isize) {
+		len = isize - start;
+		em = btrfs_get_extent(inode, NULL, 0, start, len, 0);
+		if (IS_ERR(em)) {
+			ret = PTR_ERR(em);
+			goto out;
+		}
+
+		if (test_bit(EXTENT_FLAG_VACANCY, &em->flags) ||
+		    em->block_start == EXTENT_MAP_HOLE) {
+			btrfs_err(fs_info, "swapfile has holes");
+			ret = -EINVAL;
+			goto out;
+		}
+		if (em->block_start == EXTENT_MAP_INLINE) {
+			/*
+			 * It's unlikely we'll ever actually find ourselves
+			 * here, as a file small enough to fit inline won't be
+			 * big enough to store more than the swap header, but in
+			 * case something changes in the future, let's catch it
+			 * here rather than later.
+			 */
+			btrfs_err(fs_info, "swapfile is inline");
+			ret = -EINVAL;
+			goto out;
+		}
+		if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
+			btrfs_err(fs_info, "swapfile is compresed");
+			ret = -EINVAL;
+			goto out;
+		}
+		ret = can_nocow_extent(inode, start, &len, NULL, NULL, NULL);
+		if (ret < 0) {
+			goto out;
+		} else if (ret == 1) {
+			ret = 0;
+		} else {
+			btrfs_err(fs_info, "swapfile has extent requiring COW (%llu-%llu)",
+				  start, start + len - 1);
+			ret = -EINVAL;
+			goto out;
+		}
+
+		set_bit(EXTENT_FLAG_SWAPFILE, &em->flags);
+
+		start = extent_map_end(em);
+		free_extent_map(em);
+	}
+
+out:
+	if (ret) {
+		__clear_swapfile_extents(inode);
+		atomic_dec(&BTRFS_I(inode)->root->nr_swapfiles);
+	}
+	unlock_extent_cached(io_tree, 0, isize - 1, &cached_state, GFP_NOFS);
+	return ret;
+}
+
+static void btrfs_swap_deactivate(struct file *file)
+{
+	struct inode *inode = file_inode(file);
+	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+	struct extent_state *cached_state = NULL;
+	u64 isize = inode->i_size;
+
+	lock_extent_bits(io_tree, 0, isize - 1, 0, &cached_state);
+	__clear_swapfile_extents(inode);
+	unlock_extent_cached(io_tree, 0, isize - 1, &cached_state, GFP_NOFS);
+	atomic_dec(&BTRFS_I(inode)->root->nr_swapfiles);
+}
+
 static const struct inode_operations btrfs_dir_inode_operations = {
 	.getattr	= btrfs_getattr,
 	.lookup		= btrfs_lookup,
@@ -9520,6 +9649,8 @@ static const struct address_space_operations btrfs_aops = {
 	.releasepage	= btrfs_releasepage,
 	.set_page_dirty	= btrfs_set_page_dirty,
 	.error_remove_page = generic_error_remove_page,
+	.swap_activate	= btrfs_swap_activate,
+	.swap_deactivate = btrfs_swap_deactivate,
 };
 
 static const struct address_space_operations btrfs_symlink_aops = {
-- 
2.1.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v3 7/7] btrfs: enable swap file support
  2014-12-10  1:45 ` [RFC PATCH v3 7/7] btrfs: enable swap file support Omar Sandoval
@ 2014-12-12 10:51   ` David Sterba
  2014-12-12 20:00     ` Omar Sandoval
  0 siblings, 1 reply; 13+ messages in thread
From: David Sterba @ 2014-12-12 10:51 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Alexander Viro, Andrew Morton, Chris Mason, Josef Bacik,
	Trond Myklebust, Christoph Hellwig, David Sterba, linux-btrfs,
	linux-fsdevel, linux-mm, linux-nfs, linux-kernel

On Tue, Dec 09, 2014 at 05:45:48PM -0800, Omar Sandoval wrote:
> +static void __clear_swapfile_extents(struct inode *inode)
> +{
> +	u64 isize = inode->i_size;
> +	struct extent_map *em;
> +	u64 start, len;
> +
> +	start = 0;
> +	while (start < isize) {
> +		len = isize - start;
> +		em = btrfs_get_extent(inode, NULL, 0, start, len, 0);
> +		if (IS_ERR(em))
> +			return;

This could transiently fail if there's no memory to allocate the em, and
would leak the following extents.

> +
> +		clear_bit(EXTENT_FLAG_SWAPFILE, &em->flags);
> +
> +		start = extent_map_end(em);
> +		free_extent_map(em);
> +	}
> +}
> +
> +static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
> +			       sector_t *span)
> +{
> +	struct inode *inode = file_inode(file);
> +	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
> +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> +	int ret = 0;
> +	u64 isize = inode->i_size;
> +	struct extent_state *cached_state = NULL;
> +	struct extent_map *em;
> +	u64 start, len;
> +
> +	if (BTRFS_I(inode)->flags & BTRFS_INODE_COMPRESS) {
> +		/* Can't do direct I/O on a compressed file. */
> +		btrfs_err(fs_info, "swapfile is compressed");
> +		return -EINVAL;
> +	}
> +	if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW)) {
> +		/*
> +		 * Going through the copy-on-write path while swapping pages
> +		 * in/out and doing a bunch of allocations could stress the
> +		 * memory management code that got us there in the first place,
> +		 * and that's sure to be a bad time.
> +		 */
> +		btrfs_err(fs_info, "swapfile is copy-on-write");
> +		return -EINVAL;
> +	}
> +
> +	lock_extent_bits(io_tree, 0, isize - 1, 0, &cached_state);
> +
> +	/*
> +	 * All of the extents must be allocated and support direct I/O. Inline
> +	 * extents and compressed extents fall back to buffered I/O, so those
> +	 * are no good. Additionally, all of the extents must be safe for nocow.
> +	 */
> +	atomic_inc(&BTRFS_I(inode)->root->nr_swapfiles);
> +	start = 0;
> +	while (start < isize) {
> +		len = isize - start;
> +		em = btrfs_get_extent(inode, NULL, 0, start, len, 0);
> +		if (IS_ERR(em)) {

		IS_ERR_OR_NULL(em)

>From now on the em is valid and has to be free_extent_map()ed ...

> +			ret = PTR_ERR(em);
> +			goto out;
> +		}
> +
> +		if (test_bit(EXTENT_FLAG_VACANCY, &em->flags) ||
> +		    em->block_start == EXTENT_MAP_HOLE) {
> +			btrfs_err(fs_info, "swapfile has holes");
> +			ret = -EINVAL;

... and all the error branches would miss it.

> +			goto out;
> +		}
> +		if (em->block_start == EXTENT_MAP_INLINE) {
> +			/*
> +			 * It's unlikely we'll ever actually find ourselves
> +			 * here, as a file small enough to fit inline won't be
> +			 * big enough to store more than the swap header, but in
> +			 * case something changes in the future, let's catch it
> +			 * here rather than later.
> +			 */
> +			btrfs_err(fs_info, "swapfile is inline");
> +			ret = -EINVAL;

here

> +			goto out;
> +		}
> +		if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
> +			btrfs_err(fs_info, "swapfile is compresed");
> +			ret = -EINVAL;

here

> +			goto out;
> +		}
> +		ret = can_nocow_extent(inode, start, &len, NULL, NULL, NULL);
> +		if (ret < 0) {

here

> +			goto out;
> +		} else if (ret == 1) {
> +			ret = 0;
> +		} else {
> +			btrfs_err(fs_info, "swapfile has extent requiring COW (%llu-%llu)",
> +				  start, start + len - 1);
> +			ret = -EINVAL;

here

> +			goto out;
> +		}
> +
> +		set_bit(EXTENT_FLAG_SWAPFILE, &em->flags);
> +
> +		start = extent_map_end(em);
> +		free_extent_map(em);
> +	}
> +
> +out:
> +	if (ret) {

should be fixed by:

		if (!IS_ERR_OR_NULL(em))
			free_extent_map(em);

> +		__clear_swapfile_extents(inode);
> +		atomic_dec(&BTRFS_I(inode)->root->nr_swapfiles);
> +	}
> +	unlock_extent_cached(io_tree, 0, isize - 1, &cached_state, GFP_NOFS);
> +	return ret;
> +}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v3 7/7] btrfs: enable swap file support
  2014-12-12 10:51   ` David Sterba
@ 2014-12-12 20:00     ` Omar Sandoval
  0 siblings, 0 replies; 13+ messages in thread
From: Omar Sandoval @ 2014-12-12 20:00 UTC (permalink / raw)
  To: David Sterba
  Cc: Alexander Viro, Andrew Morton, Chris Mason, Josef Bacik,
	Trond Myklebust, Christoph Hellwig, linux-btrfs, linux-fsdevel,
	linux-mm, linux-nfs, linux-kernel

On Fri, Dec 12, 2014 at 11:51:22AM +0100, David Sterba wrote:
> On Tue, Dec 09, 2014 at 05:45:48PM -0800, Omar Sandoval wrote:
> > +static void __clear_swapfile_extents(struct inode *inode)
> > +{
> > +	u64 isize = inode->i_size;
> > +	struct extent_map *em;
> > +	u64 start, len;
> > +
> > +	start = 0;
> > +	while (start < isize) {
> > +		len = isize - start;
> > +		em = btrfs_get_extent(inode, NULL, 0, start, len, 0);
> > +		if (IS_ERR(em))
> > +			return;
> 
> This could transiently fail if there's no memory to allocate the em, and
> would leak the following extents.
> 
This leak I was aware of, and at the time I didn't see a good way to get
around it. After all, if we can't get the current extent, there's no way
to iterate through the rest of them. Now I see that instead of doing
this at the btrfs_get_extent level, I can just go through all of the
extent_maps in the extent_map_tree.

> > +
> > +		clear_bit(EXTENT_FLAG_SWAPFILE, &em->flags);
> > +
> > +		start = extent_map_end(em);
> > +		free_extent_map(em);
> > +	}
> > +}
> > +
> > +static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
> > +			       sector_t *span)
> > +{
> > +	struct inode *inode = file_inode(file);
> > +	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
> > +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> > +	int ret = 0;
> > +	u64 isize = inode->i_size;
> > +	struct extent_state *cached_state = NULL;
> > +	struct extent_map *em;
> > +	u64 start, len;
> > +
> > +	if (BTRFS_I(inode)->flags & BTRFS_INODE_COMPRESS) {
> > +		/* Can't do direct I/O on a compressed file. */
> > +		btrfs_err(fs_info, "swapfile is compressed");
> > +		return -EINVAL;
> > +	}
> > +	if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW)) {
> > +		/*
> > +		 * Going through the copy-on-write path while swapping pages
> > +		 * in/out and doing a bunch of allocations could stress the
> > +		 * memory management code that got us there in the first place,
> > +		 * and that's sure to be a bad time.
> > +		 */
> > +		btrfs_err(fs_info, "swapfile is copy-on-write");
> > +		return -EINVAL;
> > +	}
> > +
> > +	lock_extent_bits(io_tree, 0, isize - 1, 0, &cached_state);
> > +
> > +	/*
> > +	 * All of the extents must be allocated and support direct I/O. Inline
> > +	 * extents and compressed extents fall back to buffered I/O, so those
> > +	 * are no good. Additionally, all of the extents must be safe for nocow.
> > +	 */
> > +	atomic_inc(&BTRFS_I(inode)->root->nr_swapfiles);
> > +	start = 0;
> > +	while (start < isize) {
> > +		len = isize - start;
> > +		em = btrfs_get_extent(inode, NULL, 0, start, len, 0);
> > +		if (IS_ERR(em)) {
> 
> 		IS_ERR_OR_NULL(em)
> 
> From now on the em is valid and has to be free_extent_map()ed ...
> 
> > +			ret = PTR_ERR(em);
> > +			goto out;
> > +		}
> > +
> > +		if (test_bit(EXTENT_FLAG_VACANCY, &em->flags) ||
> > +		    em->block_start == EXTENT_MAP_HOLE) {
> > +			btrfs_err(fs_info, "swapfile has holes");
> > +			ret = -EINVAL;
> 
> ... and all the error branches would miss it.
> 
> > +			goto out;
> > +		}
> > +		if (em->block_start == EXTENT_MAP_INLINE) {
> > +			/*
> > +			 * It's unlikely we'll ever actually find ourselves
> > +			 * here, as a file small enough to fit inline won't be
> > +			 * big enough to store more than the swap header, but in
> > +			 * case something changes in the future, let's catch it
> > +			 * here rather than later.
> > +			 */
> > +			btrfs_err(fs_info, "swapfile is inline");
> > +			ret = -EINVAL;
> 
> here
> 
> > +			goto out;
> > +		}
> > +		if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
> > +			btrfs_err(fs_info, "swapfile is compresed");
> > +			ret = -EINVAL;
> 
> here
> 
> > +			goto out;
> > +		}
> > +		ret = can_nocow_extent(inode, start, &len, NULL, NULL, NULL);
> > +		if (ret < 0) {
> 
> here
> 
> > +			goto out;
> > +		} else if (ret == 1) {
> > +			ret = 0;
> > +		} else {
> > +			btrfs_err(fs_info, "swapfile has extent requiring COW (%llu-%llu)",
> > +				  start, start + len - 1);
> > +			ret = -EINVAL;
> 
> here
> 
> > +			goto out;
> > +		}
> > +
> > +		set_bit(EXTENT_FLAG_SWAPFILE, &em->flags);
> > +
> > +		start = extent_map_end(em);
> > +		free_extent_map(em);
> > +	}
> > +
> > +out:
> > +	if (ret) {
> 
> should be fixed by:
> 
> 		if (!IS_ERR_OR_NULL(em))
> 			free_extent_map(em);
> 
> > +		__clear_swapfile_extents(inode);
> > +		atomic_dec(&BTRFS_I(inode)->root->nr_swapfiles);
> > +	}
> > +	unlock_extent_cached(io_tree, 0, isize - 1, &cached_state, GFP_NOFS);
> > +	return ret;
> > +}
This leak I completely missed. Thanks.

-- 
Omar

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v3 0/7] btrfs: implement swap file support
  2014-12-10  1:45 [RFC PATCH v3 0/7] btrfs: implement swap file support Omar Sandoval
                   ` (6 preceding siblings ...)
  2014-12-10  1:45 ` [RFC PATCH v3 7/7] btrfs: enable swap file support Omar Sandoval
@ 2014-12-12 10:32 ` David Sterba
  2014-12-12 20:15   ` Omar Sandoval
  7 siblings, 1 reply; 13+ messages in thread
From: David Sterba @ 2014-12-12 10:32 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Alexander Viro, Andrew Morton, Chris Mason, Josef Bacik,
	Trond Myklebust, Christoph Hellwig, David Sterba, linux-btrfs,
	linux-fsdevel, linux-mm, linux-nfs, linux-kernel

On Tue, Dec 09, 2014 at 05:45:41PM -0800, Omar Sandoval wrote:
> After some discussion on the mailing list, I decided that for simplicity and
> reliability, it's best to simply disallow COW files and files with shared
> extents (like files with extents shared with a snapshot). From a user's
> perspective, this means that a snapshotted subvolume cannot be used for a swap
> file, but keeping the swap file in a separate subvolume that is never
> snapshotted seems entirely reasonable to me.

Well, there are enough special cases how to do things on btrfs and I'd
like to avoid introducing another one.

> An alternative suggestion was to
> allow swap files to be snapshotted and to do an implied COW on swap file
> activation, which I was ready to implement until I realized that we can't permit
> snapshotting a subvolume with an active swap file, so this creates a surprising
> inconsistency for users (in my opinion).

I still don't see why it's not possible to do the snapshot with an
active swapfile.

> As with before, this functionality is tenuously tested in a virtual machine with
> some artificial workloads, but it "works for me". I'm pretty happy with the
> results on my end, so please comment away.

The non-btrfs changes can go independently and do not have to wait until
we resolve the swap vs snapshot problem.

I did a simple test and it crashed instantly, lockep complains:

memory: 2G
swap file: 1G
kernel: 3.17 + v3

[  739.790731] Adding 1054716k swap on /mnt/test-swap/mnt/swapfile.  Priority:-1 extents:1 across:1054716k
[  751.848607]
[  751.851852] =====================================
[  751.852161] [ BUG: bad unlock balance detected! ]
[  751.852161] 3.17.0-default+ #199 Not tainted
[  751.852161] -------------------------------------
[  751.852161] heavy_swap/4119 is trying to release lock (&sb->s_type->i_mutex_key) at:
[  751.852161] [<ffffffff81a4f0ce>] mutex_unlock+0xe/0x10
[  751.852161] but there are no more locks to release!
[  751.852161]
[  751.852161] other info that might help us debug this:
[  751.852161] 1 lock held by heavy_swap/4119:
[  751.852161]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff81043ba9>] __do_page_fault+0x149/0x560
[  751.852161]
[  751.852161] stack backtrace:
[  751.852161] CPU: 1 PID: 4119 Comm: heavy_swap Not tainted 3.17.0-default+ #199
[  751.852161] Hardware name: Intel Corporation Santa Rosa platform/Matanzas, BIOS TSRSCRB1.86C.0047.B00.0610170821 10/17/06
[  751.852161]  ffffffff81a4f0ce ffff880075dbb3d8 ffffffff81a4b268 0000000000000001
[  751.852161]  ffff8800775e0000 ffff880075dbb408 ffffffff810b51a9 0000000000000000
[  751.852161]  ffff8800775e0000 00000000ffffffff ffff8800763c9d00 ffff880075dbb4a8
[  751.852161] Call Trace:
[  751.852161]  [<ffffffff81a4f0ce>] ? mutex_unlock+0xe/0x10
[  751.852161]  [<ffffffff81a4b268>] dump_stack+0x51/0x71
[  751.852161]  [<ffffffff810b51a9>] print_unlock_imbalance_bug+0xf9/0x100
[  751.852161]  [<ffffffff810b8bcf>] lock_release_non_nested+0x2cf/0x3e0
[  751.852161]  [<ffffffff81a541ed>] ? ftrace_call+0x5/0x2f
[  751.852161]  [<ffffffff81a4f0ce>] ? mutex_unlock+0xe/0x10
[  751.852161]  [<ffffffff81a4f0ce>] ? mutex_unlock+0xe/0x10
[  751.852161]  [<ffffffff810b8da9>] lock_release+0xc9/0x240
[  751.852161]  [<ffffffff81a4eef0>] __mutex_unlock_slowpath+0x80/0x190
[  751.852161]  [<ffffffff81a4f0c9>] ? mutex_unlock+0x9/0x10
[  751.852161]  [<ffffffff81a4f0ce>] mutex_unlock+0xe/0x10
[  751.852161]  [<ffffffffa0037c88>] btrfs_direct_IO+0x2b8/0x310 [btrfs]
[  751.852161]  [<ffffffff810acf4d>] ? __wake_up_bit+0xd/0x50
[  751.852161]  [<ffffffff8117e03b>] __swap_writepage+0x10b/0x270
[  751.852161]  [<ffffffff81180723>] ? page_swapcount+0x53/0x70
[  751.852161]  [<ffffffff8117e1d7>] swap_writepage+0x37/0x60
[  751.852161]  [<ffffffff8115a072>] shmem_writepage+0x2a2/0x2e0
[  751.852161]  [<ffffffff811554ae>] shrink_page_list+0x44e/0x9d0
[  751.852161]  [<ffffffff81a51b50>] ? _raw_spin_unlock_irq+0x30/0x40
[  751.852161]  [<ffffffff811560cd>] shrink_inactive_list+0x26d/0x4f0
[  751.852161]  [<ffffffff813adcc9>] ? blk_start_plug+0x9/0x50
[  751.852161]  [<ffffffff81156918>] shrink_lruvec+0x5c8/0x6c0
[  751.852161]  [<ffffffff81165f09>] ? compaction_suitable+0x19/0xc0
[  751.852161]  [<ffffffff81165f09>] ? compaction_suitable+0x19/0xc0
[  751.852161]  [<ffffffff81156a5d>] shrink_zone+0x4d/0x120
[  751.852161]  [<ffffffff811577ea>] do_try_to_free_pages+0x19a/0x3a0
[  751.852161]  [<ffffffff81152a7d>] ? pfmemalloc_watermark_ok+0xd/0xc0
[  751.852161]  [<ffffffff81157b42>] try_to_free_pages+0xb2/0x160
[  751.852161]  [<ffffffff81a4bb79>] ? _cond_resched+0x9/0x30
[  751.852161]  [<ffffffff8114adfb>] __alloc_pages_nodemask+0x5eb/0xa90
[  751.852161]  [<ffffffff81a541ed>] ? ftrace_call+0x5/0x2f
[  751.852161]  [<ffffffff811784f1>] ? anon_vma_prepare+0x21/0x190
[  751.852161]  [<ffffffff811916a8>] do_huge_pmd_anonymous_page+0xe8/0x330
[  751.852161]  [<ffffffff811783c9>] ? is_vma_temporary_stack+0x9/0x30
[  751.852161]  [<ffffffff8116edd5>] handle_mm_fault+0x135/0xb60
[  751.852161]  [<ffffffff81172015>] ? find_vma+0x15/0x80
[  751.852161]  [<ffffffff81166c6d>] ? vmacache_find+0xd/0xd0
[  751.852161]  [<ffffffff81097c2e>] ? __might_sleep+0xe/0x110
[  751.852161]  [<ffffffff81043c0d>] __do_page_fault+0x1ad/0x560
[  751.852161]  [<ffffffff81073000>] ? do_fork+0xe0/0x420
[  751.852161]  [<ffffffff81a53f43>] ? error_sti+0x5/0x6
[  751.852161]  [<ffffffff813e660d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[  751.852161]  [<ffffffff810440cc>] do_page_fault+0xc/0x10
[  751.852161]  [<ffffffff81a53d42>] page_fault+0x22/0x30

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v3 0/7] btrfs: implement swap file support
  2014-12-12 10:32 ` [RFC PATCH v3 0/7] btrfs: implement " David Sterba
@ 2014-12-12 20:15   ` Omar Sandoval
  0 siblings, 0 replies; 13+ messages in thread
From: Omar Sandoval @ 2014-12-12 20:15 UTC (permalink / raw)
  To: David Sterba
  Cc: Alexander Viro, Andrew Morton, Chris Mason, Josef Bacik,
	Trond Myklebust, Christoph Hellwig, linux-btrfs, linux-fsdevel,
	linux-mm, linux-nfs, linux-kernel

On Fri, Dec 12, 2014 at 11:32:13AM +0100, David Sterba wrote:
> On Tue, Dec 09, 2014 at 05:45:41PM -0800, Omar Sandoval wrote:
> > After some discussion on the mailing list, I decided that for simplicity and
> > reliability, it's best to simply disallow COW files and files with shared
> > extents (like files with extents shared with a snapshot). From a user's
> > perspective, this means that a snapshotted subvolume cannot be used for a swap
> > file, but keeping the swap file in a separate subvolume that is never
> > snapshotted seems entirely reasonable to me.
> 
> Well, there are enough special cases how to do things on btrfs and I'd
> like to avoid introducing another one.
> 
> > An alternative suggestion was to
> > allow swap files to be snapshotted and to do an implied COW on swap file
> > activation, which I was ready to implement until I realized that we can't permit
> > snapshotting a subvolume with an active swap file, so this creates a surprising
> > inconsistency for users (in my opinion).
> 
> I still don't see why it's not possible to do the snapshot with an
> active swapfile.
> 
Creating a snapshot of an active swapfile would create shared extents,
so the next time we have to swap out a page, we'd have to do a COW,
which we're already trying pretty hard to avoid. We could allow it, but
it might lead to some unreliable behavior and unhappy emails to the
mailing list. However, I do see your point about wanting to avoid
special cases, so I'd like to get some more input from others on this as
well.

> > As with before, this functionality is tenuously tested in a virtual machine with
> > some artificial workloads, but it "works for me". I'm pretty happy with the
> > results on my end, so please comment away.
> 
> The non-btrfs changes can go independently and do not have to wait until
> we resolve the swap vs snapshot problem.
> 
> I did a simple test and it crashed instantly, lockep complains:
> 
> memory: 2G
> swap file: 1G
> kernel: 3.17 + v3
> 
[snip]

That's my fault for not running with lockdep enabled. The problem here
is that swap-over-NFS is the only caller of nfs_direct_IO, so
nfs_direct_IO doesn't observe the normal direct_IO locking conventions
and neither does swap_writepage. I'll have to shuffle around some code
on the NFS side to fix that.

It looks like the non-btrfs parts of this might get a bit bigger, so
I'll look into getting that in separately.

Thanks!
-- 
Omar

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2014-12-12 20:15 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-12-10  1:45 [RFC PATCH v3 0/7] btrfs: implement swap file support Omar Sandoval
2014-12-10  1:45 ` [RFC PATCH v3 1/7] direct-io: don't dirty ITER_BVEC pages on read Omar Sandoval
2014-12-10  1:45 ` [RFC PATCH v3 2/7] nfs: don't dirty ITER_BVEC pages read through direct I/O Omar Sandoval
2014-12-10  1:45 ` [RFC PATCH v3 3/7] swap: use direct I/O for SWP_FILE swap_readpage Omar Sandoval
2014-12-10  1:45 ` [RFC PATCH v3 4/7] vfs: update swap_{,de}activate documentation Omar Sandoval
2014-12-10  1:45 ` [RFC PATCH v3 5/7] btrfs: prevent ioctls from interfering with a swap file Omar Sandoval
2014-12-10  1:45 ` [RFC PATCH v3 6/7] btrfs: add EXTENT_FLAG_SWAPFILE Omar Sandoval
2014-12-12 10:32   ` David Sterba
2014-12-10  1:45 ` [RFC PATCH v3 7/7] btrfs: enable swap file support Omar Sandoval
2014-12-12 10:51   ` David Sterba
2014-12-12 20:00     ` Omar Sandoval
2014-12-12 10:32 ` [RFC PATCH v3 0/7] btrfs: implement " David Sterba
2014-12-12 20:15   ` Omar Sandoval

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).