Re: AZFS file system proposal

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* Re: AZFS file system proposal
       [not found] <20080618160629.6cd749a8@mercedes-benz.boeblingen.de.ibm.com>
@ 2008-07-01 14:59 ` Arnd Bergmann
  2008-07-07 15:39   ` Maxim Shchetynin
                     ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Arnd Bergmann @ 2008-07-01 14:59 UTC (permalink / raw)
  To: Maxim Shchetynin; +Cc: linux-fsdevel, linuxppc-dev, linux-kernel

On Wednesday 18 June 2008, Maxim Shchetynin wrote:
> AZFS patch updated accordinly to comments of Christoph Hellwig and Dmitri Vorobiev.

Sorry for my not commenting earlier on this. I'm finally collecting my
2.6.27 patches and stumbled over it again. There are a few details
that I hope we can fix up quickly, other than that, it looks good now,
great work!

> Subject: azfs: initial submit of azfs, a non-buffered filesystem
 
Please make the patch subject the actual subject of your email next time,
and put the introductory text below the Signed-off-by: lines, separated
by a "---" line. That will make the standard tools work without extra
effort on my side. Also, please always Cc the person you want to merge
the patch, in this case probably me.

> diff -Nuar linux-2.6.26-rc6/fs/Makefile linux-2.6.26-rc6-azfs/fs/Makefile
> --- linux-2.6.26-rc6/fs/Makefile	2008-06-12 23:22:24.000000000 +0200
> +++ linux-2.6.26-rc6-azfs/fs/Makefile	2008-06-16 11:17:50.000000000 +0200
> @@ -119,3 +119,4 @@
>  obj-$(CONFIG_DEBUG_FS)		+= debugfs/
>  obj-$(CONFIG_OCFS2_FS)		+= ocfs2/
>  obj-$(CONFIG_GFS2_FS)           += gfs2/
> +obj-$(CONFIG_AZ_FS)		+= azfs.o
> diff -Nuar linux-2.6.26-rc6/fs/azfs.c linux-2.6.26-rc6-azfs/fs/azfs.c
> --- linux-2.6.26-rc6/fs/azfs.c	1970-01-01 01:00:00.000000000 +0100
> +++ linux-2.6.26-rc6-azfs/fs/azfs.c	2008-06-18 15:56:13.252266896 +0200

All other file systems are in separate directories, so it would be better
to rename fs/azfs.c to fs/azfs/inode.c

> +#define AZFS_FILESYSTEM_NAME		"azfs"
> +#define AZFS_FILESYSTEM_FLAGS		FS_REQUIRES_DEV
> +
> +#define AZFS_SUPERBLOCK_MAGIC		0xABBA1972
> +#define AZFS_SUPERBLOCK_FLAGS		MS_NOEXEC | \
> +					MS_SYNCHRONOUS | \
> +					MS_DIRSYNC | \
> +					MS_ACTIVE

Why MS_NOEXEC? What happens on a remount if the user does not specifies
-o remount,exec?

> +/**
> + * azfs_block_find - get real address of a part of a file
> + * @inode: inode
> + * @direction: data direction
> + * @from: offset for read/write operation
> + * @size: pointer to a value of the amount of data to be read/written
> + */
> +static unsigned long
> +azfs_block_find(struct inode *inode, enum azfs_direction direction,
> +		unsigned long from, unsigned long *size)
> +{
> +	struct azfs_super *super;
> +	struct azfs_znode *znode;
> +	struct azfs_block *block;
> +	unsigned long block_id, west, east;
> +
> +	super = inode->i_sb->s_fs_info;
> +	znode = I2Z(inode);
> +
> +	if (from + *size > znode->size) {
> +		i_size_write(inode, from + *size);
> +		inode->i_op->truncate(inode);
> +	}
> +
> +	read_lock(&znode->lock);
> +
> +	if (list_empty(&znode->block_list)) {
> +		read_unlock(&znode->lock);
> +		return 0;
> +	}
> +
> +	block_id = from >> super->block_shift;
> +
> +	for_each_block(block, &znode->block_list) {
> +		if (block->count > block_id)
> +			break;
> +		block_id -= block->count;
> +	}
> +
> +	west = from % super->block_size;
> +	east = ((block->count - block_id) << super->block_shift) - west;
> +
> +	if (*size > east)
> +		*size = east;
> +
> +	block_id = ((block->id + block_id) << super->block_shift) + west;
> +
> +	read_unlock(&znode->lock);
> +
> +	block_id += direction == AZFS_MMAP ? super->ph_addr : super->io_addr;
> +
> +	return block_id;
> +}

This overloading of the return type to mean either a pointer or an offset
on the block device is rather confusing. Why not just return the raw block_id
before the last += and leave that part up to the caller?

static void __iomem *
azfs_block_addr(struct inode *inode, enum azfs_direction direction,
		unsigned long from, unsigned long *size)
{
	struct azfs_super *super;
	unsigned long offset;
	void __iomem *p;

	super = inode->i_sb->s_fs_info;
	offset = azfs_block_find(inode, super, 0, from, size);
	p = super->ph_addr + offset;

	return p;
}

> +	target = iov->iov_base;
> +	todo = min((loff_t) iov->iov_len, i_size_read(inode) - pos);
> +
> +	for (step = todo; step; step -= size) {
> +		size = step;
> +		pin = azfs_block_find(inode, AZFS_READ, pos, &size);
> +		if (!pin) {
> +			rc = -ENOSPC;
> +			goto out;
> +		}
> +		if (copy_to_user(target, (void*) pin, size)) {
> +			rc = -EFAULT;
> +			goto out;
> +		}

Question to the powerpc folks: is copy_to_user safe for an __iomem source?
Should there be two copies (memcpy_fromio and copy_to_user) instead?

> +	page_prot = pgprot_val(vma->vm_page_prot);
> +	page_prot |= (_PAGE_NO_CACHE | _PAGE_RW);
> +	page_prot &= ~_PAGE_GUARDED;
> +	vma->vm_page_prot = __pgprot(page_prot);

The pgprot modifications rely on powerpc specific flags, but the
file system should not really need to be powerpc only.

The flags we want are more or less the same as PAGE_AGP, because
both are I/O mapped memory that needs to be uncached but should
not be guarded, for performance reasons.

Maybe we can introduce a new PAGE_IOMEM here that we can use
in all places that need something like this. In spufs we need
the same flags for the local store mappings.

I wouldn't hold up merging the file system for this problem, but
until it is solved, the Kconfig entry should probably have
a "depends on PPC".

	Arnd <><

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: AZFS file system proposal
  2008-07-01 14:59 ` AZFS file system proposal Arnd Bergmann
@ 2008-07-07 15:39   ` Maxim Shchetynin
  2008-07-08 14:42     ` Arnd Bergmann
  2008-07-07 15:42   ` azfs: initial submit of azfs, a non-buffered filesystem Maxim Shchetynin
  2008-07-09  8:58   ` AZFS file system proposal Benjamin Herrenschmidt
  2 siblings, 1 reply; 11+ messages in thread
From: Maxim Shchetynin @ 2008-07-07 15:39 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linuxppc-dev; +Cc: Arnd Bergmann

Thank you Arnd for your comments. I have changed my patch accordinly (I wil=
l send it in a few minutes).

> > Subject: azfs: initial submit of azfs, a non-buffered filesystem
>=20
> Please make the patch subject the actual subject of your email next time,
> and put the introductory text below the Signed-off-by: lines, separated
> by a "---" line. That will make the standard tools work without extra
> effort on my side. Also, please always Cc the person you want to merge
> the patch, in this case probably me.

Done.

> All other file systems are in separate directories, so it would be better
> to rename fs/azfs.c to fs/azfs/inode.c

Done.

> > +#define AZFS_SUPERBLOCK_FLAGS		MS_NOEXEC | \
> > +					MS_SYNCHRONOUS | \
> > +					MS_DIRSYNC | \
> > +					MS_ACTIVE
>=20
> Why MS_NOEXEC? What happens on a remount if the user does not specifies
> -o remount,exec?

I also don't see any reason of keeping MS_NOEXEC - have just removed it.

> > +static unsigned long
> > +azfs_block_find(struct inode *inode, enum azfs_direction direction,
> > +		unsigned long from, unsigned long *size)
> > +{
> > ...
> > +}
>=20
> This overloading of the return type to mean either a pointer or an offset
> on the block device is rather confusing. Why not just return the raw bloc=
k_id
> before the last +=3D and leave that part up to the caller?

Changed.

> > +		if (copy_to_user(target, (void*) pin, size)) {
> > +			rc =3D -EFAULT;
> > +			goto out;
> > +		}
>=20
> Question to the powerpc folks: is copy_to_user safe for an __iomem source?
> Should there be two copies (memcpy_fromio and copy_to_user) instead?

I leave this question open.

> > +	page_prot =3D pgprot_val(vma->vm_page_prot);
> > +	page_prot |=3D (_PAGE_NO_CACHE | _PAGE_RW);
> > +	page_prot &=3D ~_PAGE_GUARDED;
> > +	vma->vm_page_prot =3D __pgprot(page_prot);
>=20
> The pgprot modifications rely on powerpc specific flags, but the
> file system should not really need to be powerpc only.
>=20
> The flags we want are more or less the same as PAGE_AGP, because
> both are I/O mapped memory that needs to be uncached but should
> not be guarded, for performance reasons.
>=20
> Maybe we can introduce a new PAGE_IOMEM here that we can use
> in all places that need something like this. In spufs we need
> the same flags for the local store mappings.
>=20
> I wouldn't hold up merging the file system for this problem, but
> until it is solved, the Kconfig entry should probably have
> a "depends on PPC".

Done.

--=20
Mit freundlichen Gr=C3=BC=C3=9Fen / met vriendelijke groeten / avec regards

    Maxim V. Shchetynin
    Linux Kernel Entwicklung
    IBM Deutschland Entwicklung GmbH
    Linux f=C3=BCr Cell, Abteilung 3250
    Sch=C3=B6naicher Stra=C3=9Fe 220
    71032 B=C3=B6blingen

Vorsitzender des Aufsichtsrats: Johann Weihen
Gesch=C3=A4ftsf=C3=BChrung: Herbert Kircher
Sitz der Gesellschaft: B=C3=B6blingen
Registriergericht: Amtsgericht Stuttgart, HRB 243294

Fahr nur so schnell wie dein Schutzengel fliegen kann!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* azfs: initial submit of azfs, a non-buffered filesystem
  2008-07-01 14:59 ` AZFS file system proposal Arnd Bergmann
  2008-07-07 15:39   ` Maxim Shchetynin
@ 2008-07-07 15:42   ` Maxim Shchetynin
  2008-07-07 19:37     ` Uli Luckas
  2008-07-09  8:58   ` AZFS file system proposal Benjamin Herrenschmidt
  2 siblings, 1 reply; 11+ messages in thread
From: Maxim Shchetynin @ 2008-07-07 15:42 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linuxppc-dev; +Cc: Arnd Bergmann

AZFS is a file system which keeps all files on memory mapped random
access storage. It was designed to work on the axonram device driver
for IBM QS2x blade servers, but can operate on any block device
that exports a direct_access method.

Signed-off-by: Maxim Shchetynin <maxim@de.ibm.com>
---

diff -Nuar linux-2.6.26-rc9/Documentation/filesystems/azfs.txt linux-2.6.26=
-rc9-azfs/Documentation/filesystems/azfs.txt
--- linux-2.6.26-rc9/Documentation/filesystems/azfs.txt	1970-01-01 01:00:00=
.000000000 +0100
+++ linux-2.6.26-rc9-azfs/Documentation/filesystems/azfs.txt	2008-07-07 13:=
43:45.235739896 +0200
@@ -0,0 +1,22 @@
+AZFS is a file system which keeps all files on memory mapped random
+access storage. It was designed to work on the axonram device driver
+for IBM QS2x blade servers, but can operate on any block device
+that exports a direct_access method.
+
+Everything in AZFS is temporary in the sense that all the data stored
+therein is lost when you switch off or reboot a system. If you unmount
+an AZFS instance, all the data will be kept on device as long your system
+is not shut down or rebooted. You can later mount AZFS on from device again
+to get access to your files.
+
+AZFS uses a block device only for data but not for file information.
+All inodes (file and directory information) is kept in RAM.
+
+When you mount AZFS you are able to specify a file system block size with
+'-o bs=3D<size in bytes>' option. There are no software limitations for
+a block size but you would not be able to mmap files on AZFS if block size
+is less than a system page size. If no '-o bs' option is specified on mount
+a block size of the used block device is used as a default block size for =
AZFS.
+
+Other available mount options for AZFS are '-o uid=3D<id>' and '-o gid=3D<=
id>',
+which allow you to set the owner and group of the root of the file system.
diff -Nuar linux-2.6.26-rc9/arch/powerpc/configs/cell_defconfig linux-2.6.2=
6-rc9-azfs/arch/powerpc/configs/cell_defconfig
--- linux-2.6.26-rc9/arch/powerpc/configs/cell_defconfig	2008-07-06 00:53:2=
2.000000000 +0200
+++ linux-2.6.26-rc9-azfs/arch/powerpc/configs/cell_defconfig	2008-07-07 13=
:43:45.244738607 +0200
@@ -240,6 +240,7 @@
 # CPU Frequency drivers
 #
 CONFIG_AXON_RAM=3Dm
+CONFIG_AZ_FS=3Dm
 # CONFIG_FSL_ULI1575 is not set
=20
 #
diff -Nuar linux-2.6.26-rc9/fs/Kconfig linux-2.6.26-rc9-azfs/fs/Kconfig
--- linux-2.6.26-rc9/fs/Kconfig	2008-07-06 00:53:22.000000000 +0200
+++ linux-2.6.26-rc9-azfs/fs/Kconfig	2008-07-07 13:45:29.397644341 +0200
@@ -1017,6 +1017,22 @@
 config HUGETLB_PAGE
 	def_bool HUGETLBFS
=20
+config AZ_FS
+	tristate "AZFS filesystem support"
+	depends on PPC
+	help
+	  azfs is a file system for I/O attached memory backing. It requires
+	  a block device with direct_access capability, e.g. axonram.
+	  Mounting such device with azfs gives memory mapped access to the
+	  underlying memory to user space.
+
+	  Read <file:Documentation/filesystems/azfs.txt> for details.
+
+	  To compile this file system support as a module, choose M here: the
+	  module will be called azfs.
+
+	  If unsure, say N.
+
 config CONFIGFS_FS
 	tristate "Userspace-driven configuration filesystem"
 	depends on SYSFS
diff -Nuar linux-2.6.26-rc9/fs/Makefile linux-2.6.26-rc9-azfs/fs/Makefile
--- linux-2.6.26-rc9/fs/Makefile	2008-07-06 00:53:22.000000000 +0200
+++ linux-2.6.26-rc9-azfs/fs/Makefile	2008-07-07 13:45:49.436832234 +0200
@@ -119,3 +119,4 @@
 obj-$(CONFIG_DEBUG_FS)		+=3D debugfs/
 obj-$(CONFIG_OCFS2_FS)		+=3D ocfs2/
 obj-$(CONFIG_GFS2_FS)           +=3D gfs2/
+obj-$(CONFIG_AZ_FS)		+=3D azfs/
diff -Nuar linux-2.6.26-rc9/fs/azfs/Makefile linux-2.6.26-rc9-azfs/fs/azfs/=
Makefile
--- linux-2.6.26-rc9/fs/azfs/Makefile	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.26-rc9-azfs/fs/azfs/Makefile	2008-07-07 13:46:38.413264402 +0=
200
@@ -0,0 +1,7 @@
+#
+# Makefile for azfs routines
+#
+
+obj-$(CONFIG_AZ_FS) +=3D azfs.o
+
+azfs-y :=3D inode.o
diff -Nuar linux-2.6.26-rc9/fs/azfs/inode.c linux-2.6.26-rc9-azfs/fs/azfs/i=
node.c
--- linux-2.6.26-rc9/fs/azfs/inode.c	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.26-rc9-azfs/fs/azfs/inode.c	2008-07-07 17:31:06.183098986 +02=
00
@@ -0,0 +1,1176 @@
+/*
+ * (C) Copyright IBM Deutschland Entwicklung GmbH 2007
+ *
+ * Author: Maxim Shchetynin <maxim@de.ibm.com>
+ *
+ * Non-buffered filesystem driver.
+ * It registers a filesystem which may be used for all kind of block devic=
es
+ * which have a direct_access() method in block_device_operations.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2, or (at your option)
+ * any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+
+#include <linux/backing-dev.h>
+#include <linux/blkdev.h>
+#include <linux/cache.h>
+#include <linux/dcache.h>
+#include <linux/device.h>
+#include <linux/err.h>
+#include <linux/fs.h>
+#include <linux/genhd.h>
+#include <linux/kernel.h>
+#include <linux/limits.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/mount.h>
+#include <linux/mm.h>
+#include <linux/mm_types.h>
+#include <linux/mutex.h>
+#include <linux/namei.h>
+#include <linux/pagemap.h>
+#include <linux/parser.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/stat.h>
+#include <linux/statfs.h>
+#include <linux/string.h>
+#include <linux/time.h>
+#include <linux/types.h>
+#include <linux/aio.h>
+#include <linux/uio.h>
+#include <asm/bug.h>
+#include <asm/page.h>
+#include <asm/pgtable.h>
+#include <asm/string.h>
+
+#define AZFS_FILESYSTEM_NAME		"azfs"
+#define AZFS_FILESYSTEM_FLAGS		FS_REQUIRES_DEV
+
+#define AZFS_SUPERBLOCK_MAGIC		0xABBA1972
+#define AZFS_SUPERBLOCK_FLAGS		MS_SYNCHRONOUS | \
+					MS_DIRSYNC | \
+					MS_ACTIVE
+
+#define AZFS_BDI_CAPABILITIES		BDI_CAP_NO_ACCT_DIRTY | \
+					BDI_CAP_NO_WRITEBACK | \
+					BDI_CAP_MAP_COPY | \
+					BDI_CAP_MAP_DIRECT | \
+					BDI_CAP_VMFLAGS
+
+#define AZFS_CACHE_FLAGS		SLAB_HWCACHE_ALIGN | \
+					SLAB_RECLAIM_ACCOUNT | \
+					SLAB_MEM_SPREAD
+
+struct azfs_super {
+	struct list_head		list;
+	unsigned long			media_size;
+	unsigned long			block_size;
+	unsigned short			block_shift;
+	unsigned long			sector_size;
+	unsigned short			sector_shift;
+	uid_t				uid;
+	gid_t				gid;
+	unsigned long			ph_addr;
+	unsigned long			io_addr;
+	struct block_device		*blkdev;
+	struct dentry			*root;
+	struct list_head		block_list;
+	rwlock_t			lock;
+};
+
+struct azfs_super_list {
+	struct list_head		head;
+	spinlock_t			lock;
+};
+
+struct azfs_block {
+	struct list_head		list;
+	unsigned long			id;
+	unsigned long			count;
+};
+
+struct azfs_znode {
+	struct list_head		block_list;
+	rwlock_t			lock;
+	loff_t				size;
+	struct inode			vfs_inode;
+};
+
+static struct azfs_super_list		super_list;
+static struct kmem_cache		*azfs_znode_cache __read_mostly =3D NULL;
+static struct kmem_cache		*azfs_block_cache __read_mostly =3D NULL;
+
+#define I2S(inode) \
+	inode->i_sb->s_fs_info
+#define I2Z(inode) \
+	container_of(inode, struct azfs_znode, vfs_inode)
+
+#define for_each_block(block, block_list) \
+	list_for_each_entry(block, block_list, list)
+#define for_each_block_reverse(block, block_list) \
+	list_for_each_entry_reverse(block, block_list, list)
+#define for_each_block_safe(block, temp, block_list) \
+	list_for_each_entry_safe(block, temp, block_list, list)
+#define for_each_block_safe_reverse(block, temp, block_list) \
+	list_for_each_entry_safe_reverse(block, temp, block_list, list)
+
+/**
+ * azfs_block_init - create and initialise a new block in a list
+ * @block_list: destination list
+ * @id: block id
+ * @count: size of a block
+ */
+static inline struct azfs_block*
+azfs_block_init(struct list_head *block_list,
+		unsigned long id, unsigned long count)
+{
+	struct azfs_block *block;
+
+	block =3D kmem_cache_alloc(azfs_block_cache, GFP_KERNEL);
+	if (!block)
+		return NULL;
+
+	block->id =3D id;
+	block->count =3D count;
+
+	INIT_LIST_HEAD(&block->list);
+	list_add_tail(&block->list, block_list);
+
+	return block;
+}
+
+/**
+ * azfs_block_free - remove block from a list and free it back in cache
+ * @block: block to be removed
+ */
+static inline void
+azfs_block_free(struct azfs_block *block)
+{
+	list_del(&block->list);
+	kmem_cache_free(azfs_block_cache, block);
+}
+
+/**
+ * azfs_block_move - move block to another list
+ * @block: block to be moved
+ * @block_list: destination list
+ */
+static inline void
+azfs_block_move(struct azfs_block *block, struct list_head *block_list)
+{
+	list_move_tail(&block->list, block_list);
+}
+
+/**
+ * azfs_block_find - get a block id of a part of a file
+ * @inode: inode
+ * @from: offset for read/write operation
+ * @size: pointer to a value of the amount of data to be read/written
+ */
+static unsigned long
+azfs_block_find(struct inode *inode, unsigned long from, unsigned long *si=
ze)
+{
+	struct azfs_super *super;
+	struct azfs_znode *znode;
+	struct azfs_block *block;
+	unsigned long block_id, west, east;
+
+	super =3D I2S(inode);
+	znode =3D I2Z(inode);
+
+	read_lock(&znode->lock);
+
+	while (from + *size > znode->size) {
+		read_unlock(&znode->lock);
+		i_size_write(inode, from + *size);
+		inode->i_op->truncate(inode);
+		read_lock(&znode->lock);
+	}
+
+	if (list_empty(&znode->block_list)) {
+		read_unlock(&znode->lock);
+		*size =3D 0;
+		return 0;
+	}
+
+	block_id =3D from >> super->block_shift;
+
+	for_each_block(block, &znode->block_list) {
+		if (block->count > block_id)
+			break;
+		block_id -=3D block->count;
+	}
+
+	west =3D from % super->block_size;
+	east =3D ((block->count - block_id) << super->block_shift) - west;
+
+	if (*size > east)
+		*size =3D east;
+
+	block_id =3D ((block->id + block_id) << super->block_shift) + west;
+
+	read_unlock(&znode->lock);
+
+	return block_id;
+}
+
+static struct inode*
+azfs_new_inode(struct super_block *, struct inode *, int, dev_t);
+
+/**
+ * azfs_mknod - mknod() method for inode_operations
+ * @dir, @dentry, @mode, @dev: see inode_operations methods
+ */
+static int
+azfs_mknod(struct inode *dir, struct dentry *dentry, int mode, dev_t dev)
+{
+	struct inode *inode;
+
+	inode =3D azfs_new_inode(dir->i_sb, dir, mode, dev);
+	if (!inode)
+		return -ENOSPC;
+
+	if (S_ISREG(mode))
+		I2Z(inode)->size =3D 0;
+
+	dget(dentry);
+	d_instantiate(dentry, inode);
+
+	return 0;
+}
+
+/**
+ * azfs_create - create() method for inode_operations
+ * @dir, @dentry, @mode, @nd: see inode_operations methods
+ */
+static int
+azfs_create(struct inode *dir, struct dentry *dentry, int mode,
+	    struct nameidata *nd)
+{
+	return azfs_mknod(dir, dentry, mode | S_IFREG, 0);
+}
+
+/**
+ * azfs_mkdir - mkdir() method for inode_operations
+ * @dir, @dentry, @mode: see inode_operations methods
+ */
+static int
+azfs_mkdir(struct inode *dir, struct dentry *dentry, int mode)
+{
+	int rc;
+
+	rc =3D azfs_mknod(dir, dentry, mode | S_IFDIR, 0);
+	if (!rc)
+		inc_nlink(dir);
+
+	return rc;
+}
+
+/**
+ * azfs_symlink - symlink() method for inode_operations
+ * @dir, @dentry, @name: see inode_operations methods
+ */
+static int
+azfs_symlink(struct inode *dir, struct dentry *dentry, const char *name)
+{
+	struct inode *inode;
+	int rc;
+
+	inode =3D azfs_new_inode(dir->i_sb, dir, S_IFLNK | S_IRWXUGO, 0);
+	if (!inode)
+		return -ENOSPC;
+
+	rc =3D page_symlink(inode, name, strlen(name) + 1);
+	if (rc) {
+		iput(inode);
+		return rc;
+	}
+
+	dget(dentry);
+	d_instantiate(dentry, inode);
+
+	return 0;
+}
+
+/**
+ * azfs_aio_read - aio_read() method for file_operations
+ * @iocb, @iov, @nr_segs, @pos: see file_operations methods
+ */
+static ssize_t
+azfs_aio_read(struct kiocb *iocb, const struct iovec *iov,
+	      unsigned long nr_segs, loff_t pos)
+{
+	struct azfs_super *super;
+	struct inode *inode;
+	void *target;
+	unsigned long pin;
+	unsigned long size, todo, step;
+	ssize_t rc;
+
+	inode =3D iocb->ki_filp->f_mapping->host;
+	super =3D I2S(inode);
+
+	mutex_lock(&inode->i_mutex);
+
+	if (pos >=3D i_size_read(inode)) {
+		rc =3D 0;
+		goto out;
+	}
+
+	target =3D iov->iov_base;
+	todo =3D min((loff_t) iov->iov_len, i_size_read(inode) - pos);
+
+	for (step =3D todo; step; step -=3D size) {
+		size =3D step;
+		pin =3D azfs_block_find(inode, pos, &size);
+		if (!size) {
+			rc =3D -ENOSPC;
+			goto out;
+		}
+		pin +=3D super->io_addr;
+		if (copy_to_user(target, (void*) pin, size)) {
+			rc =3D -EFAULT;
+			goto out;
+		}
+
+		iocb->ki_pos +=3D size;
+		pos +=3D size;
+		target +=3D size;
+	}
+
+	rc =3D todo;
+
+out:
+	mutex_unlock(&inode->i_mutex);
+
+	return rc;
+}
+
+/**
+ * azfs_aio_write - aio_write() method for file_operations
+ * @iocb, @iov, @nr_segs, @pos: see file_operations methods
+ */
+static ssize_t
+azfs_aio_write(struct kiocb *iocb, const struct iovec *iov,
+	       unsigned long nr_segs, loff_t pos)
+{
+	struct azfs_super *super;
+	struct inode *inode;
+	void *source;
+	unsigned long pin;
+	unsigned long size, todo, step;
+	ssize_t rc;
+
+	inode =3D iocb->ki_filp->f_mapping->host;
+	super =3D I2S(inode);
+
+	source =3D iov->iov_base;
+	todo =3D iov->iov_len;
+
+	mutex_lock(&inode->i_mutex);
+
+	for (step =3D todo; step; step -=3D size) {
+		size =3D step;
+		pin =3D azfs_block_find(inode, pos, &size);
+		if (!size) {
+			rc =3D -ENOSPC;
+			goto out;
+		}
+		pin +=3D super->io_addr;
+		if (copy_from_user((void*) pin, source, size)) {
+			rc =3D -EFAULT;
+			goto out;
+		}
+
+		iocb->ki_pos +=3D size;
+		pos +=3D size;
+		source +=3D size;
+	}
+
+	rc =3D todo;
+
+out:
+	mutex_unlock(&inode->i_mutex);
+
+	return rc;
+}
+
+/**
+ * azfs_open - open() method for file_operations
+ * @inode, @file: see file_operations methods
+ */
+static int
+azfs_open(struct inode *inode, struct file *file)
+{
+	if (file->f_flags & O_TRUNC) {
+		i_size_write(inode, 0);
+		inode->i_op->truncate(inode);
+	}
+	if (file->f_flags & O_APPEND)
+		inode->i_fop->llseek(file, 0, SEEK_END);
+
+	return 0;
+}
+
+/**
+ * azfs_mmap - mmap() method for file_operations
+ * @file, @vm: see file_operations methods
+ */
+static int
+azfs_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct azfs_super *super;
+	struct azfs_znode *znode;
+	struct inode *inode;
+	unsigned long cursor, pin;
+	unsigned long todo, size, vm_start;
+	pgprot_t page_prot;
+
+	inode =3D file->f_dentry->d_inode;
+	znode =3D I2Z(inode);
+	super =3D I2S(inode);
+
+	if (super->block_size < PAGE_SIZE)
+		return -EINVAL;
+
+	cursor =3D vma->vm_pgoff << super->block_shift;
+	todo =3D vma->vm_end - vma->vm_start;
+
+	if (cursor + todo > i_size_read(inode))
+		return -EINVAL;
+
+	page_prot =3D pgprot_val(vma->vm_page_prot);
+#ifdef CONFIG_PPC
+	page_prot |=3D (_PAGE_NO_CACHE | _PAGE_RW);
+	page_prot &=3D ~_PAGE_GUARDED;
+#else
+#warning You need to set in pgprot the PAGE_* flags specific to you archit=
ecture
+#endif
+	vma->vm_page_prot =3D __pgprot(page_prot);
+
+	vm_start =3D vma->vm_start;
+	for (size =3D todo; todo; todo -=3D size, size =3D todo) {
+		pin =3D azfs_block_find(inode, cursor, &size);
+		if (!size)
+			return -EAGAIN;
+		pin +=3D super->ph_addr;
+		pin >>=3D PAGE_SHIFT;
+		if (remap_pfn_range(vma, vm_start, pin, size, vma->vm_page_prot))
+			return -EAGAIN;
+
+		vm_start +=3D size;
+		cursor +=3D size;
+	}
+
+	return 0;
+}
+
+/**
+ * azfs_truncate - truncate() method for inode_operations
+ * @inode: see inode_operations methods
+ */
+static void
+azfs_truncate(struct inode *inode)
+{
+	struct azfs_super *super;
+	struct azfs_znode *znode;
+	struct azfs_block *block, *tmp_block, *temp, *west, *east;
+	unsigned long id, count;
+	signed long delta;
+
+	super =3D I2S(inode);
+	znode =3D I2Z(inode);
+
+	delta =3D i_size_read(inode) + (super->block_size - 1);
+	delta >>=3D super->block_shift;
+	delta -=3D inode->i_blocks;
+
+	if (delta =3D=3D 0) {
+		znode->size =3D i_size_read(inode);
+		return;
+	}
+
+	write_lock(&znode->lock);
+
+	while (delta > 0) {
+		west =3D east =3D NULL;
+
+		write_lock(&super->lock);
+
+		if (list_empty(&super->block_list)) {
+			write_unlock(&super->lock);
+			break;
+		}
+
+		for (count =3D delta; count; count--) {
+			for_each_block(block, &super->block_list)
+				if (block->count >=3D count) {
+					east =3D block;
+					break;
+				}
+			if (east)
+				break;
+		}
+
+		for_each_block_reverse(block, &znode->block_list) {
+			if (block->id + block->count =3D=3D east->id)
+				west =3D block;
+			break;
+		}
+
+		if (east->count =3D=3D count) {
+			if (west) {
+				west->count +=3D east->count;
+				azfs_block_free(east);
+			} else {
+				azfs_block_move(east, &znode->block_list);
+			}
+		} else {
+			if (west) {
+				west->count +=3D count;
+			} else {
+				if (!azfs_block_init(&znode->block_list,
+						east->id, count)) {
+					write_unlock(&super->lock);
+					break;
+				}
+			}
+
+			east->id +=3D count;
+			east->count -=3D count;
+		}
+
+		write_unlock(&super->lock);
+
+		inode->i_blocks +=3D count;
+
+		delta -=3D count;
+	}
+
+	while (delta < 0) {
+		for_each_block_safe_reverse(block, tmp_block, &znode->block_list) {
+			id =3D block->id;
+			count =3D block->count;
+			if ((signed long) count + delta > 0) {
+				block->count +=3D delta;
+				id +=3D block->count;
+				count -=3D block->count;
+				block =3D NULL;
+			}
+
+			west =3D east =3D NULL;
+
+			write_lock(&super->lock);
+
+			for_each_block(temp, &super->block_list) {
+				if (!west && (temp->id + temp->count =3D=3D id))
+					west =3D temp;
+				else if (!east && (id + count =3D=3D temp->id))
+					east =3D temp;
+				if (west && east)
+					break;
+			}
+
+			if (west && east) {
+				west->count +=3D count + east->count;
+				azfs_block_free(east);
+				if (block)
+					azfs_block_free(block);
+			} else if (west) {
+				west->count +=3D count;
+				if (block)
+					azfs_block_free(block);
+			} else if (east) {
+				east->id -=3D count;
+				east->count +=3D count;
+				if (block)
+					azfs_block_free(block);
+			} else {
+				if (!block) {
+					if (!azfs_block_init(&super->block_list,
+							id, count)) {
+						write_unlock(&super->lock);
+						break;
+					}
+				} else {
+					azfs_block_move(block, &super->block_list);
+				}
+			}
+
+			write_unlock(&super->lock);
+
+			inode->i_blocks -=3D count;
+
+			delta +=3D count;
+
+			break;
+		}
+	}
+
+	write_unlock(&znode->lock);
+
+	znode->size =3D min(i_size_read(inode),
+			(loff_t) inode->i_blocks << super->block_shift);
+}
+
+/**
+ * azfs_getattr - getattr() method for inode_operations
+ * @mnt, @dentry, @stat: see inode_operations methods
+ */
+static int
+azfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *st=
at)
+{
+	struct azfs_super *super;
+	struct inode *inode;
+	unsigned short shift;
+
+	inode =3D dentry->d_inode;
+	super =3D I2S(inode);
+
+	generic_fillattr(inode, stat);
+	stat->blocks =3D inode->i_blocks;
+	shift =3D super->block_shift - super->sector_shift;
+	if (shift)
+		stat->blocks <<=3D shift;
+
+	return 0;
+}
+
+static const struct address_space_operations azfs_aops =3D {
+	.write_begin	=3D simple_write_begin,
+	.write_end	=3D simple_write_end
+};
+
+static struct backing_dev_info azfs_bdi =3D {
+	.ra_pages	=3D 0,
+	.capabilities	=3D AZFS_BDI_CAPABILITIES
+};
+
+static struct inode_operations azfs_dir_iops =3D {
+	.create		=3D azfs_create,
+	.lookup		=3D simple_lookup,
+	.link		=3D simple_link,
+	.unlink		=3D simple_unlink,
+	.symlink	=3D azfs_symlink,
+	.mkdir		=3D azfs_mkdir,
+	.rmdir		=3D simple_rmdir,
+	.mknod		=3D azfs_mknod,
+	.rename		=3D simple_rename
+};
+
+static const struct file_operations azfs_reg_fops =3D {
+	.llseek		=3D generic_file_llseek,
+	.aio_read	=3D azfs_aio_read,
+	.aio_write	=3D azfs_aio_write,
+	.open		=3D azfs_open,
+	.mmap		=3D azfs_mmap,
+	.fsync		=3D simple_sync_file,
+};
+
+static struct inode_operations azfs_reg_iops =3D {
+	.truncate	=3D azfs_truncate,
+	.getattr	=3D azfs_getattr
+};
+
+/**
+ * azfs_new_inode - cook a new inode
+ * @sb: super-block
+ * @dir: parent directory
+ * @mode: file mode
+ * @dev: to be forwarded to init_special_inode()
+ */
+static struct inode*
+azfs_new_inode(struct super_block *sb, struct inode *dir, int mode, dev_t =
dev)
+{
+	struct azfs_super *super;
+	struct inode *inode;
+
+	inode =3D new_inode(sb);
+	if (!inode)
+		return NULL;
+
+	inode->i_atime =3D inode->i_mtime =3D inode->i_ctime =3D CURRENT_TIME;
+
+	inode->i_mode =3D mode;
+	if (dir) {
+		dir->i_mtime =3D dir->i_ctime =3D inode->i_mtime;
+		inode->i_uid =3D current->fsuid;
+		if (dir->i_mode & S_ISGID) {
+			if (S_ISDIR(mode))
+				inode->i_mode |=3D S_ISGID;
+			inode->i_gid =3D dir->i_gid;
+		} else {
+			inode->i_gid =3D current->fsgid;
+		}
+	} else {
+		super =3D sb->s_fs_info;
+		inode->i_uid =3D super->uid;
+		inode->i_gid =3D super->gid;
+	}
+
+	inode->i_blocks =3D 0;
+	inode->i_mapping->a_ops =3D &azfs_aops;
+	inode->i_mapping->backing_dev_info =3D &azfs_bdi;
+
+	switch (mode & S_IFMT) {
+	case S_IFDIR:
+		inode->i_op =3D &azfs_dir_iops;
+		inode->i_fop =3D &simple_dir_operations;
+		inc_nlink(inode);
+		break;
+
+	case S_IFREG:
+		inode->i_op =3D &azfs_reg_iops;
+		inode->i_fop =3D &azfs_reg_fops;
+		break;
+
+	case S_IFLNK:
+		inode->i_op =3D &page_symlink_inode_operations;
+		break;
+
+	default:
+		init_special_inode(inode, mode, dev);
+		break;
+	}
+
+	return inode;
+}
+
+/**
+ * azfs_alloc_inode - alloc_inode() method for super_operations
+ * @sb: see super_operations methods
+ */
+static struct inode*
+azfs_alloc_inode(struct super_block *sb)
+{
+	struct azfs_znode *znode;
+
+	znode =3D kmem_cache_alloc(azfs_znode_cache, GFP_KERNEL);
+	if (znode) {
+		INIT_LIST_HEAD(&znode->block_list);
+		rwlock_init(&znode->lock);
+
+		inode_init_once(&znode->vfs_inode);
+
+		return &znode->vfs_inode;
+	}
+
+	return NULL;
+}
+
+/**
+ * azfs_destroy_inode - destroy_inode() method for super_operations
+ * @inode: see super_operations methods
+ */
+static void
+azfs_destroy_inode(struct inode *inode)
+{
+	kmem_cache_free(azfs_znode_cache, I2Z(inode));
+}
+
+/**
+ * azfs_delete_inode - delete_inode() method for super_operations
+ * @inode: see super_operations methods
+ */
+static void
+azfs_delete_inode(struct inode *inode)
+{
+	if (S_ISREG(inode->i_mode)) {
+		i_size_write(inode, 0);
+		azfs_truncate(inode);
+	}
+	truncate_inode_pages(&inode->i_data, 0);
+	clear_inode(inode);
+}
+
+/**
+ * azfs_statfs - statfs() method for super_operations
+ * @dentry, @stat: see super_operations methods
+ */
+static int
+azfs_statfs(struct dentry *dentry, struct kstatfs *stat)
+{
+	struct super_block *sb;
+	struct azfs_super *super;
+	struct inode *inode;
+	unsigned long inodes, blocks;
+
+	sb =3D dentry->d_sb;
+	super =3D sb->s_fs_info;
+
+	inodes =3D blocks =3D 0;
+	mutex_lock(&sb->s_lock);
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+		inodes++;
+		blocks +=3D inode->i_blocks;
+	}
+	mutex_unlock(&sb->s_lock);
+
+	stat->f_type =3D AZFS_SUPERBLOCK_MAGIC;
+	stat->f_bsize =3D super->block_size;
+	stat->f_blocks =3D super->media_size >> super->block_shift;
+	stat->f_bfree =3D stat->f_blocks - blocks;
+	stat->f_bavail =3D stat->f_blocks - blocks;
+	stat->f_files =3D inodes + blocks;
+	stat->f_ffree =3D blocks + 1;
+	stat->f_namelen =3D NAME_MAX;
+
+	return 0;
+}
+
+static struct super_operations azfs_ops =3D {
+	.alloc_inode	=3D azfs_alloc_inode,
+	.destroy_inode	=3D azfs_destroy_inode,
+	.drop_inode	=3D generic_delete_inode,
+	.delete_inode	=3D azfs_delete_inode,
+	.statfs		=3D azfs_statfs
+};
+
+enum {
+	Opt_blocksize_short,
+	Opt_blocksize_long,
+	Opt_uid,
+	Opt_gid,
+	Opt_err
+};
+
+static match_table_t tokens =3D {
+	{Opt_blocksize_short, "bs=3D%u"},
+	{Opt_blocksize_long, "blocksize=3D%u"},
+	{Opt_uid, "uid=3D%u"},
+	{Opt_gid, "gid=3D%u"},
+	{Opt_err, NULL}
+};
+
+/**
+ * azfs_parse_mount_parameters - parse options given to mount with -o
+ * @super: azfs super block extension
+ * @options: comma separated options
+ */
+static int
+azfs_parse_mount_parameters(struct azfs_super *super, char *options)
+{
+	char *option;
+	int token, value;
+	substring_t args[MAX_OPT_ARGS];
+
+	if (!options)
+		return 1;
+
+	while ((option =3D strsep(&options, ",")) !=3D NULL) {
+		if (!*option)
+			continue;
+
+		token =3D match_token(option, tokens, args);
+		switch (token) {
+		case Opt_blocksize_short:
+		case Opt_blocksize_long:
+			if (match_int(&args[0], &value))
+				goto syntax_error;
+			super->block_size =3D value;
+			break;
+
+		case Opt_uid:
+			if (match_int(&args[0], &value))
+				goto syntax_error;
+			super->uid =3D value;
+			break;
+
+		case Opt_gid:
+			if (match_int(&args[0], &value))
+				goto syntax_error;
+			super->gid =3D value;
+			break;
+
+		default:
+			goto syntax_error;
+		}
+	}
+
+	return 1;
+
+syntax_error:
+	printk(KERN_ERR "%s: invalid mount option\n",
+			AZFS_FILESYSTEM_NAME);
+
+	return 0;
+}
+
+/**
+ * azfs_fill_super - fill_super routine for get_sb
+ * @sb, @data, @silent: see file_system_type methods
+ */
+static int
+azfs_fill_super(struct super_block *sb, void *data, int silent)
+{
+	struct gendisk *disk;
+	struct azfs_super *super =3D NULL, *tmp_super;
+	struct azfs_block *block =3D NULL;
+	struct inode *inode =3D NULL;
+	void *kaddr;
+	unsigned long pfn;
+	int rc;
+
+	BUG_ON(!sb->s_bdev);
+
+	disk =3D sb->s_bdev->bd_disk;
+
+	BUG_ON(!disk || !disk->queue);
+
+	if (!disk->fops->direct_access) {
+		printk(KERN_ERR "%s needs a block device with a "
+				"direct_access() method\n",
+				AZFS_FILESYSTEM_NAME);
+		return -ENOSYS;
+	}
+
+	get_device(disk->driverfs_dev);
+
+	sb->s_magic =3D AZFS_SUPERBLOCK_MAGIC;
+	sb->s_flags =3D AZFS_SUPERBLOCK_FLAGS;
+	sb->s_op =3D &azfs_ops;
+	sb->s_maxbytes =3D get_capacity(disk) * disk->queue->hardsect_size;
+	sb->s_time_gran =3D 1;
+
+	spin_lock(&super_list.lock);
+	list_for_each_entry(tmp_super, &super_list.head, list)
+		if (tmp_super->blkdev =3D=3D sb->s_bdev) {
+			super =3D tmp_super;
+			break;
+		}
+	spin_unlock(&super_list.lock);
+
+	if (super) {
+		if (data && strlen((char*) data))
+			printk(KERN_WARNING "/dev/%s was already mounted with "
+					"%s before, it will be mounted with "
+					"mount options used last time, "
+					"options just given would be ignored\n",
+					disk->disk_name, AZFS_FILESYSTEM_NAME);
+		sb->s_fs_info =3D super;
+	} else {
+		super =3D kzalloc(sizeof(struct azfs_super), GFP_KERNEL);
+		if (!super) {
+			rc =3D -ENOMEM;
+			goto failed;
+		}
+		sb->s_fs_info =3D super;
+
+		if (!azfs_parse_mount_parameters(super, (char*) data)) {
+			rc =3D -EINVAL;
+			goto failed;
+		}
+
+		inode =3D azfs_new_inode(sb, NULL, S_IFDIR | S_IRWXUGO, 0);
+		if (!inode) {
+			rc =3D -ENOMEM;
+			goto failed;
+		}
+
+		super->root =3D d_alloc_root(inode);
+		if (!super->root) {
+			rc =3D -ENOMEM;
+			goto failed;
+		}
+		dget(super->root);
+
+		INIT_LIST_HEAD(&super->list);
+		INIT_LIST_HEAD(&super->block_list);
+		rwlock_init(&super->lock);
+
+		super->media_size =3D sb->s_maxbytes;
+
+		if (!super->block_size)
+			super->block_size =3D sb->s_blocksize;
+		super->block_shift =3D blksize_bits(super->block_size);
+
+		super->sector_size =3D disk->queue->hardsect_size;
+		super->sector_shift =3D blksize_bits(super->sector_size);
+
+		super->blkdev =3D sb->s_bdev;
+
+		block =3D azfs_block_init(&super->block_list,
+				0, super->media_size >> super->block_shift);
+		if (!block) {
+			rc =3D -ENOMEM;
+			goto failed;
+		}
+
+		rc =3D disk->fops->direct_access(super->blkdev, 0, &kaddr, &pfn);
+		if (rc < 0) {
+			rc =3D -EFAULT;
+			goto failed;
+		}
+		super->ph_addr =3D (unsigned long) kaddr;
+
+		super->io_addr =3D (unsigned long) ioremap_flags(
+				super->ph_addr, super->media_size, _PAGE_NO_CACHE);
+		if (!super->io_addr) {
+			rc =3D -EFAULT;
+			goto failed;
+		}
+
+		spin_lock(&super_list.lock);
+		list_add(&super->list, &super_list.head);
+		spin_unlock(&super_list.lock);
+	}
+
+	sb->s_root =3D super->root;
+	disk->driverfs_dev->driver_data =3D super;
+	disk->driverfs_dev->platform_data =3D sb;
+
+	if (super->block_size < PAGE_SIZE)
+		printk(KERN_INFO "Block size on %s is smaller then system "
+				"page size: mmap() would not be supported\n",
+				disk->disk_name);
+
+	return 0;
+
+failed:
+	if (super) {
+		sb->s_root =3D NULL;
+		sb->s_fs_info =3D NULL;
+		if (block)
+			azfs_block_free(block);
+		if (super->root)
+			dput(super->root);
+		if (inode)
+			iput(inode);
+		disk->driverfs_dev->driver_data =3D NULL;
+		kfree(super);
+		disk->driverfs_dev->platform_data =3D NULL;
+		put_device(disk->driverfs_dev);
+	}
+
+	return rc;
+}
+
+/**
+ * azfs_get_sb - get_sb() method for file_system_type
+ * @fs_type, @flags, @dev_name, @data, @mount: see file_system_type methods
+ */
+static int
+azfs_get_sb(struct file_system_type *fs_type, int flags,
+	    const char *dev_name, void *data, struct vfsmount *mount)
+{
+	return get_sb_bdev(fs_type, flags,
+			dev_name, data, azfs_fill_super, mount);
+}
+
+/**
+ * azfs_kill_sb - kill_sb() method for file_system_type
+ * @sb: see file_system_type methods
+ */
+static void
+azfs_kill_sb(struct super_block *sb)
+{
+	sb->s_root =3D NULL;
+	kill_block_super(sb);
+}
+
+static struct file_system_type azfs_fs =3D {
+	.owner		=3D THIS_MODULE,
+	.name		=3D AZFS_FILESYSTEM_NAME,
+	.get_sb		=3D azfs_get_sb,
+	.kill_sb	=3D azfs_kill_sb,
+	.fs_flags	=3D AZFS_FILESYSTEM_FLAGS
+};
+
+/**
+ * azfs_init
+ */
+static int __init
+azfs_init(void)
+{
+	int rc;
+
+	INIT_LIST_HEAD(&super_list.head);
+	spin_lock_init(&super_list.lock);
+
+	azfs_znode_cache =3D kmem_cache_create("azfs_znode_cache",
+			sizeof(struct azfs_znode), 0, AZFS_CACHE_FLAGS, NULL);
+	if (!azfs_znode_cache) {
+		printk(KERN_ERR "Could not allocate inode cache for %s\n",
+				AZFS_FILESYSTEM_NAME);
+		rc =3D -ENOMEM;
+		goto failed;
+	}
+
+	azfs_block_cache =3D kmem_cache_create("azfs_block_cache",
+			sizeof(struct azfs_block), 0, AZFS_CACHE_FLAGS, NULL);
+	if (!azfs_block_cache) {
+		printk(KERN_ERR "Could not allocate block cache for %s\n",
+				AZFS_FILESYSTEM_NAME);
+		rc =3D -ENOMEM;
+		goto failed;
+	}
+
+	rc =3D register_filesystem(&azfs_fs);
+	if (rc !=3D 0) {
+		printk(KERN_ERR "Could not register %s\n",
+				AZFS_FILESYSTEM_NAME);
+		goto failed;
+	}
+
+	return 0;
+
+failed:
+	if (azfs_block_cache)
+		kmem_cache_destroy(azfs_block_cache);
+
+	if (azfs_znode_cache)
+		kmem_cache_destroy(azfs_znode_cache);
+
+	return rc;
+}
+
+/**
+ * azfs_exit
+ */
+static void __exit
+azfs_exit(void)
+{
+	struct azfs_super *super, *tmp_super;
+	struct azfs_block *block, *tmp_block;
+	struct gendisk *disk;
+
+	spin_lock(&super_list.lock);
+	list_for_each_entry_safe(super, tmp_super, &super_list.head, list) {
+		disk =3D super->blkdev->bd_disk;
+		list_del(&super->list);
+		iounmap((void*) super->io_addr);
+		write_lock(&super->lock);
+		for_each_block_safe(block, tmp_block, &super->block_list)
+			azfs_block_free(block);
+		write_unlock(&super->lock);
+		disk->driverfs_dev->driver_data =3D NULL;
+		disk->driverfs_dev->platform_data =3D NULL;
+		kfree(super);
+		put_device(disk->driverfs_dev);
+	}
+	spin_unlock(&super_list.lock);
+
+	unregister_filesystem(&azfs_fs);
+
+	kmem_cache_destroy(azfs_block_cache);
+	kmem_cache_destroy(azfs_znode_cache);
+}
+
+module_init(azfs_init);
+module_exit(azfs_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Maxim Shchetynin <maxim@de.ibm.com>");
+MODULE_DESCRIPTION("Non-buffered file system for IO devices");

--=20
Mit freundlichen Gr=C3=BC=C3=9Fen / met vriendelijke groeten / avec regards

    Maxim V. Shchetynin
    Linux Kernel Entwicklung
    IBM Deutschland Entwicklung GmbH
    Linux f=C3=BCr Cell, Abteilung 3250
    Sch=C3=B6naicher Stra=C3=9Fe 220
    71032 B=C3=B6blingen

Vorsitzender des Aufsichtsrats: Johann Weihen
Gesch=C3=A4ftsf=C3=BChrung: Herbert Kircher
Sitz der Gesellschaft: B=C3=B6blingen
Registriergericht: Amtsgericht Stuttgart, HRB 243294

Fahr nur so schnell wie dein Schutzengel fliegen kann!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: azfs: initial submit of azfs, a non-buffered filesystem
  2008-07-07 15:42   ` azfs: initial submit of azfs, a non-buffered filesystem Maxim Shchetynin
@ 2008-07-07 19:37     ` Uli Luckas
  2008-07-08  9:10       ` Maxim Shchetynin
  0 siblings, 1 reply; 11+ messages in thread
From: Uli Luckas @ 2008-07-07 19:37 UTC (permalink / raw)
  To: LKML; +Cc: linux-fsdevel, linuxppc-dev, Maxim Shchetynin, Arnd Bergmann

On Monday, 7. July 2008, Maxim Shchetynin wrote:
> AZFS is a file system which keeps all files on memory mapped random
> access storage.
Hi Maxim, 
do you mean "memory backed" instead of "memory mapped"?

regards
Uli

-- 

------- ROAD ...the handyPC Company - - -  ) ) )

Uli Luckas
Software Development

ROAD GmbH
Bennigsenstr. 14 | 12159 Berlin | Germany
fon: +49 (30) 230069 - 64 | fax: +49 (30) 230069 - 69
url: www.road.de

Amtsgericht Charlottenburg: HRB 96688 B
Managing directors: Hans-Peter Constien, Hubertus von Streit

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: azfs: initial submit of azfs, a non-buffered filesystem
  2008-07-07 19:37     ` Uli Luckas
@ 2008-07-08  9:10       ` Maxim Shchetynin
  0 siblings, 0 replies; 11+ messages in thread
From: Maxim Shchetynin @ 2008-07-08  9:10 UTC (permalink / raw)
  To: LKML, linux-fsdevel, linuxppc-dev; +Cc: Uli Luckas, Arnd Bergmann

Am Mon, 7 Jul 2008 21:37:43 +0200
schrieb Uli Luckas <u.luckas@road.de>:

> > AZFS is a file system which keeps all files on memory mapped random
> > access storage.
> Hi Maxim,=20
> do you mean "memory backed" instead of "memory mapped"?

Right, I have corrected this already in my patch.
Thank you.

--=20
Mit freundlichen Gr=C3=BC=C3=9Fen / met vriendelijke groeten / avec regards

    Maxim V. Shchetynin
    Linux Kernel Entwicklung
    IBM Deutschland Entwicklung GmbH
    Linux f=C3=BCr Cell, Abteilung 3250
    Sch=C3=B6naicher Stra=C3=9Fe 220
    71032 B=C3=B6blingen

Vorsitzender des Aufsichtsrats: Johann Weihen
Gesch=C3=A4ftsf=C3=BChrung: Herbert Kircher
Sitz der Gesellschaft: B=C3=B6blingen
Registriergericht: Amtsgericht Stuttgart, HRB 243294

Fahr nur so schnell wie dein Schutzengel fliegen kann!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: AZFS file system proposal
  2008-07-07 15:39   ` Maxim Shchetynin
@ 2008-07-08 14:42     ` Arnd Bergmann
  2008-07-09  6:48       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 11+ messages in thread
From: Arnd Bergmann @ 2008-07-08 14:42 UTC (permalink / raw)
  To: Maxim Shchetynin
  Cc: Mark Nelson, Gunnar von Boehn, linux-kernel, linuxppc-dev,
	Paul Mackerras, linux-fsdevel

On Monday 07 July 2008, Maxim Shchetynin wrote:
> > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0if=
 (copy_to_user(target, (void*) pin, size)) {
> > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0rc =3D -EFAULT;
> > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0goto out;
> > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0}
> >=20
> > Question to the powerpc folks: is copy_to_user safe for an __iomem sour=
ce?
> > Should there be two copies (memcpy_fromio and copy_to_user) instead?
>=20
> I leave this question open.
>=20

Cc:'ing some more people that might have more of a clue on this question.
_memcpy_fromio does a "sync" at the start and an "eieio" at the end.
IFAICT, neither are needed here because the source is always memory.

It also handles unaligned memory accesses, which copy_to_user should
also do correctly, so it *looks* like it should work with just a
copy_to_user, but it still feels wrong to use an __iomem pointer
as the source for a copy_to_user.

Any ideas?

	Arnd <><

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: AZFS file system proposal
  2008-07-08 14:42     ` Arnd Bergmann
@ 2008-07-09  6:48       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 11+ messages in thread
From: Benjamin Herrenschmidt @ 2008-07-09  6:48 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Mark Nelson, Gunnar von Boehn, linux-kernel, linuxppc-dev,
	Maxim Shchetynin, Paul Mackerras, linux-fsdevel

> Cc:'ing some more people that might have more of a clue on this question.
> _memcpy_fromio does a "sync" at the start and an "eieio" at the end.
> IFAICT, neither are needed here because the source is always memory.
> 
> It also handles unaligned memory accesses, which copy_to_user should
> also do correctly, so it *looks* like it should work with just a
> copy_to_user, but it still feels wrong to use an __iomem pointer
> as the source for a copy_to_user.
> 
> Any ideas?

It's a bit nasty yes. The problem is that copy_to/from_user might
do cache tricks which will blow up if the area is non-cacheable.

We have a similar problem with Mark's work on faster copy functions
since things like sys_read() can be called on userspace non-cacheable
memory such as spu local stores.

So I'm not 100% sure what the right approach here. Our copy_tofrom_user
today does dcbt on the source for example, which I hope only turns into
a no-op... The risk is if we start using dcbz.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: AZFS file system proposal
  2008-07-01 14:59 ` AZFS file system proposal Arnd Bergmann
  2008-07-07 15:39   ` Maxim Shchetynin
  2008-07-07 15:42   ` azfs: initial submit of azfs, a non-buffered filesystem Maxim Shchetynin
@ 2008-07-09  8:58   ` Benjamin Herrenschmidt
  2008-07-09  9:14     ` Maxim Shchetynin
  2 siblings, 1 reply; 11+ messages in thread
From: Benjamin Herrenschmidt @ 2008-07-09  8:58 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: linux-fsdevel, linuxppc-dev, Maxim Shchetynin, linux-kernel

On Tue, 2008-07-01 at 16:59 +0200, Arnd Bergmann wrote:
> I wouldn't hold up merging the file system for this problem, but
> until it is solved, the Kconfig entry should probably have
> a "depends on PPC".

Better, use an ifdef for powerpc flags, and #else to pgprot_noncached.

Ben.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: AZFS file system proposal
  2008-07-09  8:58   ` AZFS file system proposal Benjamin Herrenschmidt
@ 2008-07-09  9:14     ` Maxim Shchetynin
  2008-07-09  9:23       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 11+ messages in thread
From: Maxim Shchetynin @ 2008-07-09  9:14 UTC (permalink / raw)
  To: linux-fsdevel, linuxppc-dev, linux-kernel; +Cc: Arnd Bergmann

Am Wed, 09 Jul 2008 18:58:38 +1000
schrieb Benjamin Herrenschmidt <benh@kernel.crashing.org>:

> On Tue, 2008-07-01 at 16:59 +0200, Arnd Bergmann wrote:
> > I wouldn't hold up merging the file system for this problem, but
> > until it is solved, the Kconfig entry should probably have
> > a "depends on PPC".
>=20
> Better, use an ifdef for powerpc flags, and #else to pgprot_noncached.

Thank you Ben. Then, how about this?

azfs_mmap(struct file *file, struct vm_area_struct *vma)
{
...
...
...
#ifdef CONFIG_PPC
	pgprot_t page_prot;
#endif
...
...
...
#ifdef CONFIG_PPC
	page_prot =3D pgprot_val(vma->vm_page_prot);
	page_prot |=3D (_PAGE_NO_CACHE | _PAGE_RW);
	page_prot &=3D ~_PAGE_GUARDED;
	vma->vm_page_prot =3D __pgprot(page_prot);
#else
	vma->vm_page_prot =3D pgprot_noncached(vma->vm_page_prot);
#endif
...
...
...

--=20
Mit freundlichen Gr=C3=BC=C3=9Fen / met vriendelijke groeten / avec regards

    Maxim V. Shchetynin
    Linux Kernel Entwicklung
    IBM Deutschland Research & Development GmbH
    Linux f=C3=BCr Cell, Abteilung 3250
    Sch=C3=B6naicher Stra=C3=9Fe 220
    71032 B=C3=B6blingen

Vorsitzender des Aufsichtsrats: Martin Jetter
Gesch=C3=A4ftsf=C3=BChrung: Herbert Kircher
Sitz der Gesellschaft: B=C3=B6blingen
Registriergericht: Amtsgericht Stuttgart, HRB 243294

Fahr nur so schnell wie dein Schutzengel fliegen kann!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: AZFS file system proposal
  2008-07-09  9:14     ` Maxim Shchetynin
@ 2008-07-09  9:23       ` Benjamin Herrenschmidt
  2008-07-09 10:58         ` Maxim Shchetynin
  0 siblings, 1 reply; 11+ messages in thread
From: Benjamin Herrenschmidt @ 2008-07-09  9:23 UTC (permalink / raw)
  To: Maxim Shchetynin; +Cc: linux-fsdevel, linuxppc-dev, linux-kernel, Arnd Bergmann

On Wed, 2008-07-09 at 11:14 +0200, Maxim Shchetynin wrote:
> Am Wed, 09 Jul 2008 18:58:38 +1000
> schrieb Benjamin Herrenschmidt <benh@kernel.crashing.org>:
> 
> > On Tue, 2008-07-01 at 16:59 +0200, Arnd Bergmann wrote:
> > > I wouldn't hold up merging the file system for this problem, but
> > > until it is solved, the Kconfig entry should probably have
> > > a "depends on PPC".
> > 
> > Better, use an ifdef for powerpc flags, and #else to pgprot_noncached.
> 
> Thank you Ben. Then, how about this?
> 
> azfs_mmap(struct file *file, struct vm_area_struct *vma)
> {
> ...
> ...
> ...
> #ifdef CONFIG_PPC
> 	pgprot_t page_prot;
> #endif
> ...
> ...
> ...
> #ifdef CONFIG_PPC
> 	page_prot = pgprot_val(vma->vm_page_prot);
> 	page_prot |= (_PAGE_NO_CACHE | _PAGE_RW);
> 	page_prot &= ~_PAGE_GUARDED;
> 	vma->vm_page_prot = __pgprot(page_prot);
> #else
> 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> #endif
> ...

I'd rather do

	pgprot_t  prot;

#ifdef CONFIG_PPC
	prot = <whatever>
#else
	prot = pgprot_noncached(...)
#endif
	vma->vm_page_prot = prot;

To limit the number of ifdef's

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: AZFS file system proposal
  2008-07-09  9:23       ` Benjamin Herrenschmidt
@ 2008-07-09 10:58         ` Maxim Shchetynin
  0 siblings, 0 replies; 11+ messages in thread
From: Maxim Shchetynin @ 2008-07-09 10:58 UTC (permalink / raw)
  To: linux-fsdevel, linuxppc-dev, linux-kernel; +Cc: Arnd Bergmann

> I'd rather do
>=20
> 	pgprot_t  prot;
>=20
> #ifdef CONFIG_PPC
> 	prot =3D <whatever>
> #else
> 	prot =3D pgprot_noncached(...)
> #endif
> 	vma->vm_page_prot =3D prot;

I have changed my patch accordinly. Thank you.

--=20
Mit freundlichen Gr=C3=BC=C3=9Fen / met vriendelijke groeten / avec regards

    Maxim V. Shchetynin
    Linux Kernel Entwicklung
    IBM Deutschland Research & Development GmbH
    Linux f=C3=BCr Cell, Abteilung 3250
    Sch=C3=B6naicher Stra=C3=9Fe 220
    71032 B=C3=B6blingen

Vorsitzender des Aufsichtsrats: Martin Jetter
Gesch=C3=A4ftsf=C3=BChrung: Herbert Kircher
Sitz der Gesellschaft: B=C3=B6blingen
Registriergericht: Amtsgericht Stuttgart, HRB 243294

Fahr nur so schnell wie dein Schutzengel fliegen kann!

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2008-07-09 11:00 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20080618160629.6cd749a8@mercedes-benz.boeblingen.de.ibm.com>
2008-07-01 14:59 ` AZFS file system proposal Arnd Bergmann
2008-07-07 15:39   ` Maxim Shchetynin
2008-07-08 14:42     ` Arnd Bergmann
2008-07-09  6:48       ` Benjamin Herrenschmidt
2008-07-07 15:42   ` azfs: initial submit of azfs, a non-buffered filesystem Maxim Shchetynin
2008-07-07 19:37     ` Uli Luckas
2008-07-08  9:10       ` Maxim Shchetynin
2008-07-09  8:58   ` AZFS file system proposal Benjamin Herrenschmidt
2008-07-09  9:14     ` Maxim Shchetynin
2008-07-09  9:23       ` Benjamin Herrenschmidt
2008-07-09 10:58         ` Maxim Shchetynin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).