All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@linux-foundation.org>
To: Nick Piggin <npiggin@suse.de>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	linux-fsdevel@vger.kernel.org,
	Christian Borntraeger <borntraeger@de.ibm.com>,
	"Eric W. Biederman" <ebiederm@xmission.com>,
	rob@landley.net, Jens Axboe <axboe@kernel.dk>
Subject: Re: [patch] rewrite rd
Date: Mon, 3 Dec 2007 22:29:03 -0800	[thread overview]
Message-ID: <20071203222903.c820707b.akpm@linux-foundation.org> (raw)
In-Reply-To: <20071204042628.GA26636@wotan.suse.de>

On Tue, 4 Dec 2007 05:26:28 +0100 Nick Piggin <npiggin@suse.de> wrote:

> Hi,
> 
> This is my proposal for a (hopefully) backwards compatible rd driver.
> The motivation for me is not pressing, except that I have this code
> sitting here that is either going to rot or get merged. I'm happy to
> make myself maintainer of this code, but if any real block device
> driver writer would like to, that would be fine by me ;)
> 
> Comments?
> 
> --
> 
> This is a rewrite of the ramdisk block device driver.
> 
> The old one is really difficult because it effectively implements a block
> device which serves data out of its own buffer cache. It relies on the dirty
> bit being set, to pin its backing store in cache, however there are non
> trivial paths which can clear the dirty bit (eg. try_to_free_buffers()),
> which had recently lead to data corruption. And in general it is completely
> wrong for a block device driver to do this.
> 
> The new one is more like a regular block device driver. It has no idea
> about vm/vfs stuff. It's backing store is similar to the buffer cache
> (a simple radix-tree of pages), but it doesn't know anything about page
> cache (the pages in the radix tree are not pagecache pages).
> 
> There is one slight downside -- direct block device access and filesystem
> metadata access goes through an extra copy and gets stored in RAM twice.
> However, this downside is only slight, because the real buffercache of the
> device is now reclaimable (because we're not playing crazy games with it),
> so under memory intensive situations, footprint should effectively be the
> same -- maybe even a slight advantage to the new driver because it can also
> reclaim buffer heads.

what about mmap(/dev/ram0)?

> The fact that it now goes through all the regular vm/fs paths makes it
> much more useful for testing, too.
> 
>    text    data     bss     dec     hex filename
>    2837     849     384    4070     fe6 drivers/block/rd.o
>    3528     371      12    3911     f47 drivers/block/brd.o
> 
> Text is larger, but data and bss are smaller, making total size smaller.
> 
> A few other nice things about it:
> - Similar structure and layout to the new loop device handlinag.
> - Dynamic ramdisk creation.
> - Runtime flexible buffer head size (because it is no longer part of the
>   ramdisk code).
> - Boot / load time flexible ramdisk size, which could easily be extended
>   to a per-ramdisk runtime changeable size (eg. with an ioctl).

This ramdisk driver can use highmem whereas the existing one can't (yes?). 
That's a pretty major difference.  Plus look at the revoltingness in rd.c's
mapping_set_gfp_mask()s.

> +++ linux-2.6/drivers/block/brd.c
> @@ -0,0 +1,536 @@
> +/*
> + * Ram backed block device driver.
> + *
> + * Copyright (C) 2007 Nick Piggin
> + * Copyright (C) 2007 Novell Inc.
> + *
> + * Parts derived from drivers/block/rd.c, and drivers/block/loop.c, copyright
> + * of their respective owners.
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/moduleparam.h>
> +#include <linux/major.h>
> +#include <linux/blkdev.h>
> +#include <linux/bio.h>
> +#include <linux/highmem.h>
> +#include <linux/gfp.h>
> +#include <linux/radix-tree.h>
> +#include <linux/buffer_head.h> /* invalidate_bh_lrus() */
> +
> +#include <asm/uaccess.h>
> +
> +#define SECTOR_SHIFT		9

That's our third definition of SECTOR_SHIFT.

> +#define PAGE_SECTORS_SHIFT	(PAGE_SHIFT - SECTOR_SHIFT)
> +#define PAGE_SECTORS		(1 << PAGE_SECTORS_SHIFT)
> +
> +/*
> + * Each block ramdisk device has a radix_tree brd_pages of pages that stores
> + * the pages containing the block device's contents. A brd page's ->index is
> + * its offset in PAGE_SIZE units. This is similar to, but in no way connected
> + * with, the kernel's pagecache or buffer cache (which sit above our block
> + * device).
> + */
> +struct brd_device {
> +	int		brd_number;
> +	int		brd_refcnt;
> +	loff_t		brd_offset;
> +	loff_t		brd_sizelimit;
> +	unsigned	brd_blocksize;
> +
> +	struct request_queue	*brd_queue;
> +	struct gendisk		*brd_disk;
> +	struct list_head	brd_list;
> +
> +	/*
> +	 * Backing store of pages and lock to protect it. This is the contents
> +	 * of the block device.
> +	 */
> +	spinlock_t		brd_lock;
> +	struct radix_tree_root	brd_pages;
> +};
> +
> +/*
> + * Look up and return a brd's page for a given sector.
> + */
> +static struct page *brd_lookup_page(struct brd_device *brd, sector_t sector)
> +{
> +	unsigned long idx;

Could use pgoff_t here if you think that's clearer.

> +	struct page *page;
> +
> +	/*
> +	 * The page lifetime is protected by the fact that we have opened the
> +	 * device node -- brd pages will never be deleted under us, so we
> +	 * don't need any further locking or refcounting.
> +	 *
> +	 * This is strictly true for the radix-tree nodes as well (ie. we
> +	 * don't actually need the rcu_read_lock()), however that is not a
> +	 * documented feature of the radix-tree API so it is better to be
> +	 * safe here (we don't have total exclusion from radix tree updates
> +	 * here, only deletes).
> +	 */
> +	rcu_read_lock();
> +	idx = sector >> PAGE_SECTORS_SHIFT; /* sector to page index */
> +	page = radix_tree_lookup(&brd->brd_pages, idx);
> +	rcu_read_unlock();
> +
> +	BUG_ON(page && page->index != idx);
> +
> +	return page;
> +}
> +
> +/*
> + * Look up and return a brd's page for a given sector.
> + * If one does not exist, allocate an empty page, and insert that. Then
> + * return it.
> + */
> +static struct page *brd_insert_page(struct brd_device *brd, sector_t sector)
> +{
> +	unsigned long idx;
> +	struct page *page;
> +
> +	page = brd_lookup_page(brd, sector);
> +	if (page)
> +		return page;
> +
> +	page = alloc_page(GFP_NOIO | __GFP_HIGHMEM | __GFP_ZERO);

Why GFP_NOIO?

Have you thought about __GFP_MOVABLE treatment here?  Seems pretty
desirable as this sucker can be large.

> +	if (!page)
> +		return NULL;
> +
> +	if (radix_tree_preload(GFP_NOIO)) {
> +		__free_page(page);
> +		return NULL;
> +	}
> +
> +	spin_lock(&brd->brd_lock);
> +	idx = sector >> PAGE_SECTORS_SHIFT;
> +	if (radix_tree_insert(&brd->brd_pages, idx, page)) {
> +		__free_page(page);
> +		page = radix_tree_lookup(&brd->brd_pages, idx);
> +		BUG_ON(!page);
> +		BUG_ON(page->index != idx);
> +	} else
> +		page->index = idx;
> +	spin_unlock(&brd->brd_lock);
> +
> +	radix_tree_preload_end();
> +
> +	return page;
> +}
> +
> +/*
> + * Free all backing store pages and radix tree. This must only be called when
> + * there are no other users of the device.
> + */
> +#define FREE_BATCH 16
> +static void brd_free_pages(struct brd_device *brd)
> +{
> +	unsigned long pos = 0;
> +	struct page *pages[FREE_BATCH];
> +	int nr_pages;
> +
> +	do {
> +		int i;
> +
> +		nr_pages = radix_tree_gang_lookup(&brd->brd_pages,
> +				(void **)pages, pos, FREE_BATCH);
> +
> +		for (i = 0; i < nr_pages; i++) {
> +			void *ret;
> +
> +			BUG_ON(pages[i]->index < pos);
> +			pos = pages[i]->index;
> +			ret = radix_tree_delete(&brd->brd_pages, pos);
> +			BUG_ON(!ret || ret != pages[i]);
> +			__free_page(pages[i]);
> +		}
> +
> +		pos++;
> +
> +	} while (nr_pages == FREE_BATCH);
> +}

I have vague memories that radix_tree_gang_lookup()'s "naive"
implementation may return fewer items than you asked for even when there
are more items remaining - when it hits certain boundaries.

> +/*
> + * copy_to_brd_setup must be called before copy_to_brd. It may sleep.
> + */
> +static int copy_to_brd_setup(struct brd_device *brd, sector_t sector, size_t n)
> +{
> +	unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
> +	size_t copy;
> +
> +	copy = min((unsigned long)n, PAGE_SIZE - offset);

use min_t.  Or, better, sort out the types.

> +	if (!brd_insert_page(brd, sector))
> +		return -ENOMEM;
> +	if (copy < n) {
> +		sector += copy >> SECTOR_SHIFT;
> +		if (!brd_insert_page(brd, sector))
> +			return -ENOMEM;
> +	}
> +	return 0;
> +}
> +
> +/*
> + * Copy n bytes from src to the brd starting at sector. Does not sleep.
> + */
> +static void copy_to_brd(struct brd_device *brd, const void *src,
> +			sector_t sector, size_t n)
> +{
> +	struct page *page;
> +	void *dst;
> +	unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
> +	size_t copy;
> +
> +	copy = min((unsigned long)n, PAGE_SIZE - offset);

Ditto.

> +	page = brd_lookup_page(brd, sector);
> +	BUG_ON(!page);
> +
> +	dst = kmap_atomic(page, KM_USER1);
> +	memcpy(dst + offset, src, copy);
> +	kunmap_atomic(dst, KM_USER1);

Might need flush_dcache_page() if mmap gets sorted out.

> +	if (copy < n) {
> +		src += copy;
> +		sector += copy >> SECTOR_SHIFT;
> +		copy = n - copy;
> +		page = brd_lookup_page(brd, sector);
> +		BUG_ON(!page);
> +
> +		dst = kmap_atomic(page, KM_USER1);
> +		memcpy(dst, src, copy);
> +		kunmap_atomic(dst, KM_USER1);
> +	}
> +}
> +
> +/*
> + * Copy n bytes to dst from the brd starting at sector. Does not sleep.
> + */
> +static void copy_from_brd(void *dst, struct brd_device *brd,
> +			sector_t sector, size_t n)
> +{
> +	struct page *page;
> +	const void *src;
> +	unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
> +	size_t copy;
> +
> +	copy = min((unsigned long)n, PAGE_SIZE - offset);

tritto.

> +	page = brd_lookup_page(brd, sector);
> +	if (page) {
> +		src = kmap_atomic(page, KM_USER1);
> +		memcpy(dst, src + offset, copy);
> +		kunmap_atomic(src, KM_USER1);
> +	} else
> +		memset(dst, 0, copy);
> +
> +	if (copy < n) {
> +		dst += copy;
> +		sector += copy >> SECTOR_SHIFT;
> +		copy = n - copy;
> +		page = brd_lookup_page(brd, sector);
> +		if (page) {
> +			src = kmap_atomic(page, KM_USER1);
> +			memcpy(dst, src, copy);
> +			kunmap_atomic(src, KM_USER1);
> +		} else
> +			memset(dst, 0, copy);
> +	}
> +}
> +
> +/*
> + * Process a single bvec of a bio.
> + */
> +static int brd_do_bvec(struct brd_device *brd, struct page *page,
> +			unsigned int len, unsigned int off, int rw,
> +			sector_t sector)
> +{
> +	void *mem;
> +	int err = 0;
> +
> +	if (rw != READ) {
> +		err = copy_to_brd_setup(brd, sector, len);
> +		if (err)
> +			goto out;
> +	}
> +
> +	mem = kmap_atomic(page, KM_USER0);
> +	if (rw == READ) {
> +		copy_from_brd(mem + off, brd, sector, len);
> +		flush_dcache_page(page);

hm, there's a flush_dcache_page().  I guess you've throught it through ;)

> +	} else
> +		copy_to_brd(brd, mem + off, sector, len);
> +	kunmap_atomic(mem, KM_USER0);
> +
> +out:
> +	return err;
> +}
> +
>
> ...
>
> +static int brd_ioctl(struct inode *inode, struct file *file,
> +			unsigned int cmd, unsigned long arg)
> +{
> +	int error;
> +	struct block_device *bdev = inode->i_bdev;
> +	struct brd_device *brd = bdev->bd_disk->private_data;
> +
> +	if (cmd != BLKFLSBUF)
> +		return -ENOTTY;
> +
> +	/*
> +	 * ram device BLKFLSBUF has special semantics, we want to actually
> +	 * release and destroy the ramdisk data.
> +	 */
> +	mutex_lock(&bdev->bd_mutex);
> +	error = -EBUSY;
> +	if (bdev->bd_openers <= 1) {
> +		/*
> +		 * Invalidate the cache first, so it isn't written
> +		 * back to the device.
> +		 */
> +		invalidate_bh_lrus();
> +		truncate_inode_pages(bdev->bd_inode->i_mapping, 0);

hm, some other thread can instantiate pagecache here.  I guess it's always
been like that and there's not a lot we can (or should) do about it.

> +		brd_free_pages(brd);
> +		error = 0;
> +	}
> +	mutex_unlock(&bdev->bd_mutex);
> +
> +	return error;
> +}
> +
>
> ...
>
> --- linux-2.6.orig/fs/buffer.c
> +++ linux-2.6/fs/buffer.c
> @@ -1436,6 +1436,7 @@ void invalidate_bh_lrus(void)
>  {
>  	on_each_cpu(invalidate_bh_lru, NULL, 1, 1);
>  }
> +EXPORT_SYMBOL_GPL(invalidate_bh_lrus);

Maybe create a new helper function which does
invalidate_bh_lrus()+truncate_inode_pages(), call that from kill_bdev() and
here, make invalidate_bh_lrus() static.

That's a separate patch, I guess.

  reply	other threads:[~2007-12-04  6:30 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-12-04  4:26 [patch] rewrite rd Nick Piggin
2007-12-04  4:26 ` Nick Piggin
2007-12-04  6:29 ` Andrew Morton [this message]
2007-12-04  7:01   ` Nick Piggin
2007-12-04  7:08     ` Nick Piggin
2007-12-04  7:55 ` Rob Landley
2007-12-04  9:29   ` Nick Piggin
2007-12-04 19:53     ` Rob Landley
2007-12-04  9:54 ` Christian Borntraeger
2007-12-04 10:10   ` Nick Piggin
2007-12-04 11:21     ` [patch] rd: support XIP Nick Piggin
2007-12-04 11:23       ` [patch] ext2: xip check fix Nick Piggin
2007-12-05 15:43         ` Carsten Otte
2007-12-05 23:33           ` Nick Piggin
2007-12-06  8:43             ` Carsten Otte
2007-12-06  8:52               ` Nick Piggin
2007-12-06  9:59                 ` Carsten Otte
2007-12-06 10:18                   ` Nick Piggin
2007-12-06 10:24                     ` Carsten Otte
2007-12-06 18:11                       ` Rob Landley
2007-12-07  3:22                         ` Jared Hulbert
2007-12-07  4:17                           ` Rob Landley
2007-12-07  4:23                             ` Nick Piggin
2007-12-07  4:40                             ` Jared Hulbert
2007-12-07  8:59                           ` Carsten Otte
2007-12-07  9:52                             ` Jared Hulbert
2007-12-04 11:26       ` [patch] rd: support XIP Andrew Morton
2007-12-04 11:35         ` Nick Piggin
2007-12-04 13:00           ` [patch] mm: fix XIP file writes Nick Piggin
2007-12-10 14:38             ` Christian Borntraeger
2007-12-12  4:03               ` Nick Piggin
2007-12-04 12:06       ` [patch] rd: support XIP Duane Griffin
2007-12-04 13:03         ` [patch] rd: support XIP (updated) Nick Piggin
2008-01-14 16:47 ` [patch] rewrite rd Matthew Wilcox
2008-01-14 17:21   ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20071203222903.c820707b.akpm@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=borntraeger@de.ibm.com \
    --cc=ebiederm@xmission.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=npiggin@suse.de \
    --cc=rob@landley.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.