[patch 0/9] Cell patches for 2.6.27, version 2

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* [patch 0/9] Cell patches for 2.6.27, version 2
@ 2008-07-15 19:51 arnd
  2008-07-15 19:51 ` [patch 1/9] powerpc/cell/edac: log a syndrome code in case of correctable error arnd
                   ` (8 more replies)
  0 siblings, 9 replies; 25+ messages in thread
From: arnd @ 2008-07-15 19:51 UTC (permalink / raw)
  To: benh, cbe-oss-dev, linuxppc-dev

Hi Ben,

These are the remaining patches I have lined up for 2.6.27.
Thanks for including the others already. There were a few
comments in the last round that I hope all got addressed now.

I'm not sure about the final two patches for the IOMMU, I think
you wanted to investigate the situation more, but I haven't
seen an update from you any more. I'm convinced going to weak
ordering is the right move for the cell IOMMU, and even if it
is not, it will not break anything on current machines because
the firmware already overrides this setting.

I have left the copy_to_user in AZFS in place, I think we
need to follow up on this again, to either document that
copy_fromto_user needs to be able to work on uncached memory,
or to do an extra copy, but I think this should not hold up
the AZFS merge. All other comments for it have been addressed.

You can pull from the usual location at

master.kernel.org:/pub/scm/linux/kernel/git/arnd/cell-2.6.git cell-next

	Arnd <><

--

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 1/9] powerpc/cell/edac: log a syndrome code in case of correctable error
  2008-07-15 19:51 [patch 0/9] Cell patches for 2.6.27, version 2 arnd
@ 2008-07-15 19:51 ` arnd
  2008-07-17  5:59   ` Benjamin Herrenschmidt
  2008-07-15 19:51 ` [patch 2/9] powerpc/axonram: use only one block device major number arnd
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 25+ messages in thread
From: arnd @ 2008-07-15 19:51 UTC (permalink / raw)
  To: benh, cbe-oss-dev, linuxppc-dev; +Cc: Maxim Shchetynin

From: Maxim Shchetynin <maxim@de.ibm.com>

If correctable error occurs the syndrome code was logged as 0. This patch
lets EDAC to log a correct syndrome code to make problem investigation
easier.

Signed-off-by: Maxim Shchetynin <maxim@de.ibm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/edac/cell_edac.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/edac/cell_edac.c b/drivers/edac/cell_edac.c
index b54112f..0e024fe 100644
--- a/drivers/edac/cell_edac.c
+++ b/drivers/edac/cell_edac.c
@@ -33,7 +33,7 @@ static void cell_edac_count_ce(struct mem_ctl_info *mci, int chan, u64 ar)
 {
 	struct cell_edac_priv		*priv = mci->pvt_info;
 	struct csrow_info		*csrow = &mci->csrows[0];
-	unsigned long			address, pfn, offset;
+	unsigned long			address, pfn, offset, syndrome;
 
 	dev_dbg(mci->dev, "ECC CE err on node %d, channel %d, ar = 0x%016lx\n",
 		priv->node, chan, ar);
@@ -44,10 +44,11 @@ static void cell_edac_count_ce(struct mem_ctl_info *mci, int chan, u64 ar)
 		address = (address << 1) | chan;
 	pfn = address >> PAGE_SHIFT;
 	offset = address & ~PAGE_MASK;
+	syndrome = (ar & 0x000000001fe00000ul) >> 21;
 
 	/* TODO: Decoding of the error addresss */
 	edac_mc_handle_ce(mci, csrow->first_page + pfn, offset,
-			  0, 0, chan, "");
+			  syndrome, 0, chan, "");
 }
 
 static void cell_edac_count_ue(struct mem_ctl_info *mci, int chan, u64 ar)
-- 
1.5.4.3

-- 

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [patch 1/9] powerpc/cell/edac: log a syndrome code in case of correctable error
  2008-07-15 19:51 ` [patch 1/9] powerpc/cell/edac: log a syndrome code in case of correctable error arnd
@ 2008-07-17  5:59   ` Benjamin Herrenschmidt
  2008-07-17 18:35     ` Doug Thompson
  0 siblings, 1 reply; 25+ messages in thread
From: Benjamin Herrenschmidt @ 2008-07-17  5:59 UTC (permalink / raw)
  To: arnd, Doug Thompson
  Cc: Maxim Shchetynin, linuxppc-dev, cbe-oss-dev, bluesmoke-devel

Arnd, Maxim, please, next time, send that patch or at least CC the
bluesmoke-devel list for EDAC related bits.

Doug, if you are ok with this patch, I'll merge it via the powerpc
tree.

Cheers,
Ben.

On Tue, 2008-07-15 at 21:51 +0200, arnd@arndb.de 

> From: Maxim Shchetynin <maxim@de.ibm.com>
> 
> If correctable error occurs the syndrome code was logged as 0. This patch
> lets EDAC to log a correct syndrome code to make problem investigation
> easier.
> 
> Signed-off-by: Maxim Shchetynin <maxim@de.ibm.com>
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
>  drivers/edac/cell_edac.c |    5 +++--
>  1 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/edac/cell_edac.c b/drivers/edac/cell_edac.c
> index b54112f..0e024fe 100644
> --- a/drivers/edac/cell_edac.c
> +++ b/drivers/edac/cell_edac.c
> @@ -33,7 +33,7 @@ static void cell_edac_count_ce(struct mem_ctl_info *mci, int chan, u64 ar)
>  {
>  	struct cell_edac_priv		*priv = mci->pvt_info;
>  	struct csrow_info		*csrow = &mci->csrows[0];
> -	unsigned long			address, pfn, offset;
> +	unsigned long			address, pfn, offset, syndrome;
>  
>  	dev_dbg(mci->dev, "ECC CE err on node %d, channel %d, ar = 0x%016lx\n",
>  		priv->node, chan, ar);
> @@ -44,10 +44,11 @@ static void cell_edac_count_ce(struct mem_ctl_info *mci, int chan, u64 ar)
>  		address = (address << 1) | chan;
>  	pfn = address >> PAGE_SHIFT;
>  	offset = address & ~PAGE_MASK;
> +	syndrome = (ar & 0x000000001fe00000ul) >> 21;
>  
>  	/* TODO: Decoding of the error addresss */
>  	edac_mc_handle_ce(mci, csrow->first_page + pfn, offset,
> -			  0, 0, chan, "");
> +			  syndrome, 0, chan, "");
>  }
>  
>  static void cell_edac_count_ue(struct mem_ctl_info *mci, int chan, u64 ar)
> -- 
> 1.5.4.3
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 1/9] powerpc/cell/edac: log a syndrome code in case of correctable error
  2008-07-17  5:59   ` Benjamin Herrenschmidt
@ 2008-07-17 18:35     ` Doug Thompson
  0 siblings, 0 replies; 25+ messages in thread
From: Doug Thompson @ 2008-07-17 18:35 UTC (permalink / raw)
  To: benh, arnd; +Cc: Maxim Shchetynin, linuxppc-dev, cbe-oss-dev, bluesmoke-devel


--- Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> Arnd, Maxim, please, next time, send that patch or at least CC the
> bluesmoke-devel list for EDAC related bits.
> 
> Doug, if you are ok with this patch, I'll merge it via the powerpc

fine with me, acked below

doug t

> tree.
> 
> Cheers,
> Ben.
> 
> On Tue, 2008-07-15 at 21:51 +0200, arnd@arndb.de 
> 
> > From: Maxim Shchetynin <maxim@de.ibm.com>
> > 
> > If correctable error occurs the syndrome code was logged as 0. This patch
> > lets EDAC to log a correct syndrome code to make problem investigation
> > easier.
> > 
> > Signed-off-by: Maxim Shchetynin <maxim@de.ibm.com>
> > Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Acked-by: Doug Thompson <dougthompson@xmission.com>


> > ---
> >  drivers/edac/cell_edac.c |    5 +++--
> >  1 files changed, 3 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/edac/cell_edac.c b/drivers/edac/cell_edac.c
> > index b54112f..0e024fe 100644
> > --- a/drivers/edac/cell_edac.c
> > +++ b/drivers/edac/cell_edac.c
> > @@ -33,7 +33,7 @@ static void cell_edac_count_ce(struct mem_ctl_info *mci, int chan, u64 ar)
> >  {
> >  	struct cell_edac_priv		*priv = mci->pvt_info;
> >  	struct csrow_info		*csrow = &mci->csrows[0];
> > -	unsigned long			address, pfn, offset;
> > +	unsigned long			address, pfn, offset, syndrome;
> >  
> >  	dev_dbg(mci->dev, "ECC CE err on node %d, channel %d, ar = 0x%016lx\n",
> >  		priv->node, chan, ar);
> > @@ -44,10 +44,11 @@ static void cell_edac_count_ce(struct mem_ctl_info *mci, int chan, u64 ar)
> >  		address = (address << 1) | chan;
> >  	pfn = address >> PAGE_SHIFT;
> >  	offset = address & ~PAGE_MASK;
> > +	syndrome = (ar & 0x000000001fe00000ul) >> 21;
> >  
> >  	/* TODO: Decoding of the error addresss */
> >  	edac_mc_handle_ce(mci, csrow->first_page + pfn, offset,
> > -			  0, 0, chan, "");
> > +			  syndrome, 0, chan, "");
> >  }
> >  
> >  static void cell_edac_count_ue(struct mem_ctl_info *mci, int chan, u64 ar)
> > -- 
> > 1.5.4.3
> > 
> 
> 


W1DUG

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 2/9] powerpc/axonram: use only one block device major number
  2008-07-15 19:51 [patch 0/9] Cell patches for 2.6.27, version 2 arnd
  2008-07-15 19:51 ` [patch 1/9] powerpc/cell/edac: log a syndrome code in case of correctable error arnd
@ 2008-07-15 19:51 ` arnd
  2008-07-15 19:51 ` [patch 3/9] powerpc/axonram: enable partitioning of the Axons DDR2 DIMMs arnd
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 25+ messages in thread
From: arnd @ 2008-07-15 19:51 UTC (permalink / raw)
  To: benh, cbe-oss-dev, linuxppc-dev; +Cc: Maxim Shchetynin

From: Maxim Shchetynin <maxim@de.ibm.com>

Axonram module registers one block device for each DDR2 DIMM found
on a system. This means that each DDR2 DIMM becomes its own block device
major number. This patch lets axonram module to register the only one
block device for all DDR2 DIMMs which also spares kernel resources.

Signed-off-by: Maxim Shchetynin <maxim@de.ibm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 arch/powerpc/sysdev/axonram.c |   23 +++++++++++++++--------
 1 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 7f59188..9b639ed 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -57,6 +57,8 @@
 #define AXON_RAM_SECTOR_SIZE		1 << AXON_RAM_SECTOR_SHIFT
 #define AXON_RAM_IRQ_FLAGS		IRQF_SHARED | IRQF_TRIGGER_RISING
 
+static int azfs_major, azfs_minor;
+
 struct axon_ram_bank {
 	struct of_device	*device;
 	struct gendisk		*disk;
@@ -227,19 +229,14 @@ axon_ram_probe(struct of_device *device, const struct of_device_id *device_id)
 		goto failed;
 	}
 
-	bank->disk->first_minor = 0;
+	bank->disk->major = azfs_major;
+	bank->disk->first_minor = azfs_minor;
 	bank->disk->fops = &axon_ram_devops;
 	bank->disk->private_data = bank;
 	bank->disk->driverfs_dev = &device->dev;
 
 	sprintf(bank->disk->disk_name, "%s%d",
 			AXON_RAM_DEVICE_NAME, axon_ram_bank_id);
-	bank->disk->major = register_blkdev(0, bank->disk->disk_name);
-	if (bank->disk->major < 0) {
-		dev_err(&device->dev, "Cannot register block device\n");
-		rc = -EFAULT;
-		goto failed;
-	}
 
 	bank->disk->queue = blk_alloc_queue(GFP_KERNEL);
 	if (bank->disk->queue == NULL) {
@@ -276,6 +273,8 @@ axon_ram_probe(struct of_device *device, const struct of_device_id *device_id)
 		goto failed;
 	}
 
+	azfs_minor += bank->disk->minors;
+
 	return 0;
 
 failed:
@@ -310,7 +309,6 @@ axon_ram_remove(struct of_device *device)
 
 	device_remove_file(&device->dev, &dev_attr_ecc);
 	free_irq(bank->irq_id, device);
-	unregister_blkdev(bank->disk->major, bank->disk->disk_name);
 	del_gendisk(bank->disk);
 	iounmap((void __iomem *) bank->io_addr);
 	kfree(bank);
@@ -341,6 +339,14 @@ static struct of_platform_driver axon_ram_driver = {
 static int __init
 axon_ram_init(void)
 {
+	azfs_major = register_blkdev(azfs_major, AXON_RAM_DEVICE_NAME);
+	if (azfs_major < 0) {
+		printk(KERN_ERR "%s cannot become block device major number\n",
+				AXON_RAM_MODULE_NAME);
+		return -EFAULT;
+	}
+	azfs_minor = 0;
+
 	return of_register_platform_driver(&axon_ram_driver);
 }
 
@@ -351,6 +357,7 @@ static void __exit
 axon_ram_exit(void)
 {
 	of_unregister_platform_driver(&axon_ram_driver);
+	unregister_blkdev(azfs_major, AXON_RAM_DEVICE_NAME);
 }
 
 module_init(axon_ram_init);
-- 
1.5.4.3

-- 

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [patch 3/9] powerpc/axonram: enable partitioning of the Axons DDR2 DIMMs
  2008-07-15 19:51 [patch 0/9] Cell patches for 2.6.27, version 2 arnd
  2008-07-15 19:51 ` [patch 1/9] powerpc/cell/edac: log a syndrome code in case of correctable error arnd
  2008-07-15 19:51 ` [patch 2/9] powerpc/axonram: use only one block device major number arnd
@ 2008-07-15 19:51 ` arnd
  2008-07-15 19:51 ` [patch 4/9] powerpc/cell/cpufreq: add spu aware cpufreq governor arnd
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 25+ messages in thread
From: arnd @ 2008-07-15 19:51 UTC (permalink / raw)
  To: benh, cbe-oss-dev, linuxppc-dev; +Cc: Maxim Shchetynin

From: Maxim Shchetynin <maxim@de.ibm.com>

DDR2 memory DIMMs on the Axon could be accessed only as one partition
when using file system drivers which are using the direct_access() method.
This patch enables for such file system drivers to access Axon's DDR2 memory
even if it is splitted in several partitions.

Signed-off-by: Maxim Shchetynin <maxim@de.ibm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 arch/powerpc/sysdev/axonram.c |    5 ++++-
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 9b639ed..9e105cb 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -150,7 +150,10 @@ axon_ram_direct_access(struct block_device *device, sector_t sector,
 	struct axon_ram_bank *bank = device->bd_disk->private_data;
 	loff_t offset;
 
-	offset = sector << AXON_RAM_SECTOR_SHIFT;
+	offset = sector;
+	if (device->bd_part != NULL)
+		offset += device->bd_part->start_sect;
+	offset <<= AXON_RAM_SECTOR_SHIFT;
 	if (offset >= bank->size) {
 		dev_err(&bank->device->dev, "Access outside of address space\n");
 		return -ERANGE;
-- 
1.5.4.3

-- 

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [patch 4/9] powerpc/cell/cpufreq: add spu aware cpufreq governor
  2008-07-15 19:51 [patch 0/9] Cell patches for 2.6.27, version 2 arnd
                   ` (2 preceding siblings ...)
  2008-07-15 19:51 ` [patch 3/9] powerpc/axonram: enable partitioning of the Axons DDR2 DIMMs arnd
@ 2008-07-15 19:51 ` arnd
  2008-07-15 19:51 ` [patch 5/9] powerpc/cell: cleanup sysreset_hack for IBM cell blades arnd
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 25+ messages in thread
From: arnd @ 2008-07-15 19:51 UTC (permalink / raw)
  To: benh, cbe-oss-dev, linuxppc-dev; +Cc: Christian Krafft, Jeremy Kerr, Dave Jones

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 7043 bytes --]

From: Christian Krafft <krafft@de.ibm.com>

This patch adds a cpufreq governor that takes the number of running spus
into account. It's very similar to the ondemand governor, but not as complex.
Instead of hacking spu load into the ondemand governor it might be easier to
have cpufreq accepting multiple governors per cpu in future.
Don't know if this is the right way, but it would keep the governors simple.

Signed-off-by: Christian Krafft <krafft@de.ibm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Dave Jones <davej@redhat.com>
Cc: Jeremy Kerr <jk@ozlabs.org>
---
 arch/powerpc/platforms/cell/Kconfig             |    9 +
 arch/powerpc/platforms/cell/Makefile            |    1 +
 arch/powerpc/platforms/cell/cpufreq_spudemand.c |  184 +++++++++++++++++++++++
 3 files changed, 194 insertions(+), 0 deletions(-)
 create mode 100644 arch/powerpc/platforms/cell/cpufreq_spudemand.c

diff --git a/arch/powerpc/platforms/cell/Kconfig b/arch/powerpc/platforms/cell/Kconfig
index 3959fcf..6ee571f 100644
--- a/arch/powerpc/platforms/cell/Kconfig
+++ b/arch/powerpc/platforms/cell/Kconfig
@@ -107,6 +107,15 @@ config CBE_CPUFREQ_PMI
 	  processor will not only be able to run at lower speed,
 	  but also at lower core voltage.
 
+config CBE_CPUFREQ_SPU_GOVERNOR
+	tristate "CBE frequency scaling based on SPU usage"
+	depends on SPU_FS && CPU_FREQ
+	default m
+	help
+	  This governor checks for spu usage to adjust the cpu frequency.
+	  If no spu is running on a given cpu, that cpu will be throttled to
+	  the minimal possible frequency.
+
 endmenu
 
 config OPROFILE_CELL
diff --git a/arch/powerpc/platforms/cell/Makefile b/arch/powerpc/platforms/cell/Makefile
index c2a7e4e..7fcf09b 100644
--- a/arch/powerpc/platforms/cell/Makefile
+++ b/arch/powerpc/platforms/cell/Makefile
@@ -8,6 +8,7 @@ obj-$(CONFIG_CBE_THERM)			+= cbe_thermal.o
 obj-$(CONFIG_CBE_CPUFREQ_PMI)		+= cbe_cpufreq_pmi.o
 obj-$(CONFIG_CBE_CPUFREQ)		+= cbe-cpufreq.o
 cbe-cpufreq-y				+= cbe_cpufreq_pervasive.o cbe_cpufreq.o
+obj-$(CONFIG_CBE_CPUFREQ_SPU_GOVERNOR)	+= cpufreq_spudemand.o
 
 ifeq ($(CONFIG_SMP),y)
 obj-$(CONFIG_PPC_CELL_NATIVE)		+= smp.o
diff --git a/arch/powerpc/platforms/cell/cpufreq_spudemand.c b/arch/powerpc/platforms/cell/cpufreq_spudemand.c
new file mode 100644
index 0000000..a3c6c01
--- /dev/null
+++ b/arch/powerpc/platforms/cell/cpufreq_spudemand.c
@@ -0,0 +1,184 @@
+/*
+ * spu aware cpufreq governor for the cell processor
+ *
+ * © Copyright IBM Corporation 2006-2008
+ *
+ * Author: Christian Krafft <krafft@de.ibm.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2, or (at your option)
+ * any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+
+#include <linux/cpufreq.h>
+#include <linux/sched.h>
+#include <linux/timer.h>
+#include <linux/workqueue.h>
+#include <asm/atomic.h>
+#include <asm/machdep.h>
+#include <asm/spu.h>
+
+#define POLL_TIME	100000		/* in µs */
+#define EXP		753		/* exp(-1) in fixed-point */
+
+struct spu_gov_info_struct {
+	unsigned long busy_spus;	/* fixed-point */
+	struct cpufreq_policy *policy;
+	struct delayed_work work;
+	unsigned int poll_int;		/* µs */
+};
+static DEFINE_PER_CPU(struct spu_gov_info_struct, spu_gov_info);
+
+static struct workqueue_struct *kspugov_wq;
+
+static int calc_freq(struct spu_gov_info_struct *info)
+{
+	int cpu;
+	int busy_spus;
+
+	cpu = info->policy->cpu;
+	busy_spus = atomic_read(&cbe_spu_info[cpu_to_node(cpu)].busy_spus);
+
+	CALC_LOAD(info->busy_spus, EXP, busy_spus * FIXED_1);
+	pr_debug("cpu %d: busy_spus=%d, info->busy_spus=%ld\n",
+			cpu, busy_spus, info->busy_spus);
+
+	return info->policy->max * info->busy_spus / FIXED_1;
+}
+
+static void spu_gov_work(struct work_struct *work)
+{
+	struct spu_gov_info_struct *info;
+	int delay;
+	unsigned long target_freq;
+
+	info = container_of(work, struct spu_gov_info_struct, work.work);
+
+	/* after cancel_delayed_work_sync we unset info->policy */
+	BUG_ON(info->policy == NULL);
+
+	target_freq = calc_freq(info);
+	__cpufreq_driver_target(info->policy, target_freq, CPUFREQ_RELATION_H);
+
+	delay = usecs_to_jiffies(info->poll_int);
+	queue_delayed_work_on(info->policy->cpu, kspugov_wq, &info->work, delay);
+}
+
+static void spu_gov_init_work(struct spu_gov_info_struct *info)
+{
+	int delay = usecs_to_jiffies(info->poll_int);
+	INIT_DELAYED_WORK_DEFERRABLE(&info->work, spu_gov_work);
+	queue_delayed_work_on(info->policy->cpu, kspugov_wq, &info->work, delay);
+}
+
+static void spu_gov_cancel_work(struct spu_gov_info_struct *info)
+{
+	cancel_delayed_work_sync(&info->work);
+}
+
+static int spu_gov_govern(struct cpufreq_policy *policy, unsigned int event)
+{
+	unsigned int cpu = policy->cpu;
+	struct spu_gov_info_struct *info, *affected_info;
+	int i;
+	int ret = 0;
+
+	info = &per_cpu(spu_gov_info, cpu);
+
+	switch (event) {
+	case CPUFREQ_GOV_START:
+		if (!cpu_online(cpu)) {
+			printk(KERN_ERR "cpu %d is not online\n", cpu);
+			ret = -EINVAL;
+			break;
+		}
+
+		if (!policy->cur) {
+			printk(KERN_ERR "no cpu specified in policy\n");
+			ret = -EINVAL;
+			break;
+		}
+
+		/* initialize spu_gov_info for all affected cpus */
+		for_each_cpu_mask(i, policy->cpus) {
+			affected_info = &per_cpu(spu_gov_info, i);
+			affected_info->policy = policy;
+		}
+
+		info->poll_int = POLL_TIME;
+
+		/* setup timer */
+		spu_gov_init_work(info);
+
+		break;
+
+	case CPUFREQ_GOV_STOP:
+		/* cancel timer */
+		spu_gov_cancel_work(info);
+
+		/* clean spu_gov_info for all affected cpus */
+		for_each_cpu_mask (i, policy->cpus) {
+			info = &per_cpu(spu_gov_info, i);
+			info->policy = NULL;
+		}
+
+		break;
+	}
+
+	return ret;
+}
+
+static struct cpufreq_governor spu_governor = {
+	.name = "spudemand",
+	.governor = spu_gov_govern,
+	.owner = THIS_MODULE,
+};
+
+/*
+ * module init and destoy
+ */
+
+static int __init spu_gov_init(void)
+{
+	int ret;
+
+	kspugov_wq = create_workqueue("kspugov");
+	if (!kspugov_wq) {
+		printk(KERN_ERR "creation of kspugov failed\n");
+		ret = -EFAULT;
+		goto out;
+	}
+
+	ret = cpufreq_register_governor(&spu_governor);
+	if (ret) {
+		printk(KERN_ERR "registration of governor failed\n");
+		destroy_workqueue(kspugov_wq);
+		goto out;
+	}
+out:
+	return ret;
+}
+
+static void __exit spu_gov_exit(void)
+{
+	cpufreq_unregister_governor(&spu_governor);
+	destroy_workqueue(kspugov_wq);
+}
+
+
+module_init(spu_gov_init);
+module_exit(spu_gov_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Christian Krafft <krafft@de.ibm.com>");
+
-- 
1.5.4.3

-- 

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [patch 5/9] powerpc/cell: cleanup sysreset_hack for IBM cell blades
  2008-07-15 19:51 [patch 0/9] Cell patches for 2.6.27, version 2 arnd
                   ` (3 preceding siblings ...)
  2008-07-15 19:51 ` [patch 4/9] powerpc/cell/cpufreq: add spu aware cpufreq governor arnd
@ 2008-07-15 19:51 ` arnd
  2008-07-15 19:51 ` [patch 6/9] powerpc/cell: add support for power button of future " arnd
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 25+ messages in thread
From: arnd @ 2008-07-15 19:51 UTC (permalink / raw)
  To: benh, cbe-oss-dev, linuxppc-dev; +Cc: Christian Krafft

From: Christian Krafft <krafft@de.ibm.com>

This patch adds a config option for the sysreset_hack used for
IBM Cell blades. The code is moves from pervasive.c into ras.c and
gets it's own init method.

Signed-off-by: Christian Krafft <krafft@de.ibm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 arch/powerpc/platforms/cell/Kconfig     |    8 +++++
 arch/powerpc/platforms/cell/pervasive.c |   27 +-----------------
 arch/powerpc/platforms/cell/pervasive.h |    9 ++++++
 arch/powerpc/platforms/cell/ras.c       |   46 +++++++++++++++++++++++++++++++
 4 files changed, 64 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/platforms/cell/Kconfig b/arch/powerpc/platforms/cell/Kconfig
index 6ee571f..2d1957b 100644
--- a/arch/powerpc/platforms/cell/Kconfig
+++ b/arch/powerpc/platforms/cell/Kconfig
@@ -83,6 +83,14 @@ config CBE_RAS
 	depends on PPC_CELL_NATIVE
 	default y
 
+config PPC_IBM_CELL_RESETBUTTON
+	bool "IBM Cell Blade Pinhole reset button"
+	depends on CBE_RAS && PPC_IBM_CELL_BLADE
+	default y
+	help
+	  Support Pinhole Resetbutton on IBM Cell blades.
+	  This adds a method to trigger system reset via front panel pinhole button.
+
 config CBE_THERM
 	tristate "CBE thermal support"
 	default m
diff --git a/arch/powerpc/platforms/cell/pervasive.c b/arch/powerpc/platforms/cell/pervasive.c
index 8a3631c..efdacc8 100644
--- a/arch/powerpc/platforms/cell/pervasive.c
+++ b/arch/powerpc/platforms/cell/pervasive.c
@@ -38,8 +38,6 @@
 
 #include "pervasive.h"
 
-static int sysreset_hack;
-
 static void cbe_power_save(void)
 {
 	unsigned long ctrl, thread_switch_control;
@@ -87,9 +85,6 @@ static void cbe_power_save(void)
 
 static int cbe_system_reset_exception(struct pt_regs *regs)
 {
-	int cpu;
-	struct cbe_pmd_regs __iomem *pmd;
-
 	switch (regs->msr & SRR1_WAKEMASK) {
 	case SRR1_WAKEEE:
 		do_IRQ(regs);
@@ -98,19 +93,7 @@ static int cbe_system_reset_exception(struct pt_regs *regs)
 		timer_interrupt(regs);
 		break;
 	case SRR1_WAKEMT:
-		/*
-		 * The BMC can inject user triggered system reset exceptions,
-		 * but cannot set the system reset reason in srr1,
-		 * so check an extra register here.
-		 */
-		if (sysreset_hack && (cpu = smp_processor_id()) == 0) {
-			pmd = cbe_get_cpu_pmd_regs(cpu);
-			if (in_be64(&pmd->ras_esc_0) & 0xffff) {
-				out_be64(&pmd->ras_esc_0, 0);
-				return 0;
-			}
-		}
-		break;
+		return cbe_sysreset_hack();
 #ifdef CONFIG_CBE_RAS
 	case SRR1_WAKESYSERR:
 		cbe_system_error_exception(regs);
@@ -134,8 +117,6 @@ void __init cbe_pervasive_init(void)
 	if (!cpu_has_feature(CPU_FTR_PAUSE_ZERO))
 		return;
 
-	sysreset_hack = machine_is_compatible("IBM,CBPLUS-1.0");
-
 	for_each_possible_cpu(cpu) {
 		struct cbe_pmd_regs __iomem *regs = cbe_get_cpu_pmd_regs(cpu);
 		if (!regs)
@@ -144,12 +125,6 @@ void __init cbe_pervasive_init(void)
 		 /* Enable Pause(0) control bit */
 		out_be64(&regs->pmcr, in_be64(&regs->pmcr) |
 					    CBE_PMD_PAUSE_ZERO_CONTROL);
-
-		/* Enable JTAG system-reset hack */
-		if (sysreset_hack)
-			out_be32(&regs->fir_mode_reg,
-				in_be32(&regs->fir_mode_reg) |
-				CBE_PMD_FIR_MODE_M8);
 	}
 
 	ppc_md.power_save = cbe_power_save;
diff --git a/arch/powerpc/platforms/cell/pervasive.h b/arch/powerpc/platforms/cell/pervasive.h
index 7b50947..fd4d7b7 100644
--- a/arch/powerpc/platforms/cell/pervasive.h
+++ b/arch/powerpc/platforms/cell/pervasive.h
@@ -30,4 +30,13 @@ extern void cbe_system_error_exception(struct pt_regs *regs);
 extern void cbe_maintenance_exception(struct pt_regs *regs);
 extern void cbe_thermal_exception(struct pt_regs *regs);
 
+#ifdef CONFIG_PPC_IBM_CELL_RESETBUTTON
+extern int cbe_sysreset_hack(void);
+#else
+static inline int cbe_sysreset_hack(void)
+{
+	return 1;
+}
+#endif /* CONFIG_PPC_IBM_CELL_RESETBUTTON */
+
 #endif
diff --git a/arch/powerpc/platforms/cell/ras.c b/arch/powerpc/platforms/cell/ras.c
index 505f9b9..2a14b05 100644
--- a/arch/powerpc/platforms/cell/ras.c
+++ b/arch/powerpc/platforms/cell/ras.c
@@ -236,6 +236,52 @@ static struct notifier_block cbe_ptcal_reboot_notifier = {
 	.notifier_call = cbe_ptcal_notify_reboot
 };
 
+#ifdef CONFIG_PPC_IBM_CELL_RESETBUTTON
+static int sysreset_hack;
+
+static int __init cbe_sysreset_init(void)
+{
+	struct cbe_pmd_regs __iomem *regs;
+
+	sysreset_hack = machine_is_compatible("IBM,CBPLUS-1.0");
+	if (!sysreset_hack)
+		return 0;
+
+	regs = cbe_get_cpu_pmd_regs(0);
+	if (!regs)
+		return 0;
+
+	/* Enable JTAG system-reset hack */
+	out_be32(&regs->fir_mode_reg,
+		in_be32(&regs->fir_mode_reg) |
+		CBE_PMD_FIR_MODE_M8);
+
+	return 0;
+}
+device_initcall(cbe_sysreset_init);
+
+int cbe_sysreset_hack(void)
+{
+	struct cbe_pmd_regs __iomem *regs;
+
+	/*
+	 * The BMC can inject user triggered system reset exceptions,
+	 * but cannot set the system reset reason in srr1,
+	 * so check an extra register here.
+	 */
+	if (sysreset_hack && (smp_processor_id() == 0)) {
+		regs = cbe_get_cpu_pmd_regs(0);
+		if (!regs)
+			return 0;
+		if (in_be64(&regs->ras_esc_0) & 0x0000ffff) {
+			out_be64(&regs->ras_esc_0, 0);
+			return 0;
+		}
+	}
+	return 1;
+}
+#endif /* CONFIG_PPC_IBM_CELL_RESETBUTTON */
+
 int __init cbe_ptcal_init(void)
 {
 	int ret;
-- 
1.5.4.3

-- 

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [patch 6/9] powerpc/cell: add support for power button of future IBM cell blades
  2008-07-15 19:51 [patch 0/9] Cell patches for 2.6.27, version 2 arnd
                   ` (4 preceding siblings ...)
  2008-07-15 19:51 ` [patch 5/9] powerpc/cell: cleanup sysreset_hack for IBM cell blades arnd
@ 2008-07-15 19:51 ` arnd
  2008-07-15 19:51 ` [patch 7/9] azfs: initial submit of azfs, a non-buffered filesystem arnd
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 25+ messages in thread
From: arnd @ 2008-07-15 19:51 UTC (permalink / raw)
  To: benh, cbe-oss-dev, linuxppc-dev; +Cc: Christian Krafft

From: Christian Krafft <krafft@de.ibm.com>

This patch adds support for the power button on future IBM cell blades.
It actually doesn't shut down the machine. Instead it exposes an
input device /dev/input/event0 to userspace which sends KEY_POWER
if power button has been pressed.
haldaemon actually recognizes the button, so a plattform independent acpid
replacement should handle it correctly.

Signed-off-by: Christian Krafft <krafft@de.ibm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 arch/powerpc/platforms/cell/Kconfig           |    8 ++
 arch/powerpc/platforms/cell/Makefile          |    2 +
 arch/powerpc/platforms/cell/cbe_powerbutton.c |  117 +++++++++++++++++++++++++
 include/asm-powerpc/pmi.h                     |    1 +
 4 files changed, 128 insertions(+), 0 deletions(-)
 create mode 100644 arch/powerpc/platforms/cell/cbe_powerbutton.c

diff --git a/arch/powerpc/platforms/cell/Kconfig b/arch/powerpc/platforms/cell/Kconfig
index 2d1957b..c14d7d8 100644
--- a/arch/powerpc/platforms/cell/Kconfig
+++ b/arch/powerpc/platforms/cell/Kconfig
@@ -91,6 +91,14 @@ config PPC_IBM_CELL_RESETBUTTON
 	  Support Pinhole Resetbutton on IBM Cell blades.
 	  This adds a method to trigger system reset via front panel pinhole button.
 
+config PPC_IBM_CELL_POWERBUTTON
+	tristate "IBM Cell Blade power button"
+	depends on PPC_IBM_CELL_BLADE && PPC_PMI && INPUT_EVDEV
+	default y
+	help
+	  Support Powerbutton on IBM Cell blades.
+	  This will enable the powerbutton as an input device.
+
 config CBE_THERM
 	tristate "CBE thermal support"
 	default m
diff --git a/arch/powerpc/platforms/cell/Makefile b/arch/powerpc/platforms/cell/Makefile
index 7fcf09b..7fd8308 100644
--- a/arch/powerpc/platforms/cell/Makefile
+++ b/arch/powerpc/platforms/cell/Makefile
@@ -10,6 +10,8 @@ obj-$(CONFIG_CBE_CPUFREQ)		+= cbe-cpufreq.o
 cbe-cpufreq-y				+= cbe_cpufreq_pervasive.o cbe_cpufreq.o
 obj-$(CONFIG_CBE_CPUFREQ_SPU_GOVERNOR)	+= cpufreq_spudemand.o
 
+obj-$(CONFIG_PPC_IBM_CELL_POWERBUTTON)	+= cbe_powerbutton.o
+
 ifeq ($(CONFIG_SMP),y)
 obj-$(CONFIG_PPC_CELL_NATIVE)		+= smp.o
 endif
diff --git a/arch/powerpc/platforms/cell/cbe_powerbutton.c b/arch/powerpc/platforms/cell/cbe_powerbutton.c
new file mode 100644
index 0000000..dcddaa5
--- /dev/null
+++ b/arch/powerpc/platforms/cell/cbe_powerbutton.c
@@ -0,0 +1,117 @@
+/*
+ * driver for powerbutton on IBM cell blades
+ *
+ * (C) Copyright IBM Corp. 2005-2008
+ *
+ * Author: Christian Krafft <krafft@de.ibm.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2, or (at your option)
+ * any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+
+#include <linux/input.h>
+#include <linux/platform_device.h>
+#include <asm/pmi.h>
+#include <asm/prom.h>
+
+static struct input_dev *button_dev;
+static struct platform_device *button_pdev;
+
+static void cbe_powerbutton_handle_pmi(pmi_message_t pmi_msg)
+{
+	BUG_ON(pmi_msg.type != PMI_TYPE_POWER_BUTTON);
+
+	input_report_key(button_dev, KEY_POWER, 1);
+	input_sync(button_dev);
+	input_report_key(button_dev, KEY_POWER, 0);
+	input_sync(button_dev);
+}
+
+static struct pmi_handler cbe_pmi_handler = {
+	.type			= PMI_TYPE_POWER_BUTTON,
+	.handle_pmi_message	= cbe_powerbutton_handle_pmi,
+};
+
+static int __init cbe_powerbutton_init(void)
+{
+	int ret = 0;
+	struct input_dev *dev;
+
+	if (!machine_is_compatible("IBM,CBPLUS-1.0")) {
+		printk(KERN_ERR "%s: Not a cell blade.\n", __func__);
+		ret = -ENODEV;
+		goto out;
+	}
+
+	dev = input_allocate_device();
+	if (!dev) {
+		ret = -ENOMEM;
+		printk(KERN_ERR "%s: Not enough memory.\n", __func__);
+		goto out;
+	}
+
+	set_bit(EV_KEY, dev->evbit);
+	set_bit(KEY_POWER, dev->keybit);
+
+	dev->name = "Power Button";
+	dev->id.bustype = BUS_HOST;
+
+	/* this makes the button look like an acpi power button
+	 * no clue whether anyone relies on that though */
+	dev->id.product = 0x02;
+	dev->phys = "LNXPWRBN/button/input0";
+
+	button_pdev = platform_device_register_simple("power_button", 0, NULL, 0);
+	if (IS_ERR(button_pdev)) {
+		ret = PTR_ERR(button_pdev);
+		goto out_free_input;
+	}
+
+	dev->dev.parent = &button_pdev->dev;
+	ret = input_register_device(dev);
+	if (ret) {
+		printk(KERN_ERR "%s: Failed to register device\n", __func__);
+		goto out_free_pdev;
+	}
+
+	button_dev = dev;
+
+	ret = pmi_register_handler(&cbe_pmi_handler);
+	if (ret) {
+		printk(KERN_ERR "%s: Failed to register with pmi.\n", __func__);
+		goto out_free_pdev;
+	}
+
+	goto out;
+
+out_free_pdev:
+	platform_device_unregister(button_pdev);
+out_free_input:
+	input_free_device(dev);
+out:
+	return ret;
+}
+
+static void __exit cbe_powerbutton_exit(void)
+{
+	pmi_unregister_handler(&cbe_pmi_handler);
+	platform_device_unregister(button_pdev);
+	input_free_device(button_dev);
+}
+
+module_init(cbe_powerbutton_init);
+module_exit(cbe_powerbutton_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Christian Krafft <krafft@de.ibm.com>");
diff --git a/include/asm-powerpc/pmi.h b/include/asm-powerpc/pmi.h
index e1dc090..b4e91fb 100644
--- a/include/asm-powerpc/pmi.h
+++ b/include/asm-powerpc/pmi.h
@@ -30,6 +30,7 @@
 #ifdef __KERNEL__
 
 #define PMI_TYPE_FREQ_CHANGE	0x01
+#define PMI_TYPE_POWER_BUTTON	0x02
 #define PMI_READ_TYPE		0
 #define PMI_READ_DATA0		1
 #define PMI_READ_DATA1		2
-- 
1.5.4.3

-- 

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [patch 7/9] azfs: initial submit of azfs, a non-buffered filesystem
  2008-07-15 19:51 [patch 0/9] Cell patches for 2.6.27, version 2 arnd
                   ` (5 preceding siblings ...)
  2008-07-15 19:51 ` [patch 6/9] powerpc/cell: add support for power button of future " arnd
@ 2008-07-15 19:51 ` arnd
  2008-07-17  6:13   ` Benjamin Herrenschmidt
  2008-07-22  9:49   ` [Cbe-oss-dev] " Christoph Hellwig
  2008-07-15 19:51 ` [patch 8/9] powerpc/dma: use the struct dma_attrs in iommu code arnd
  2008-07-15 19:51 ` [patch 9/9] powerpc/cell: Add DMA_ATTR_STRONG_ORDERING dma attribute and use in IOMMU code arnd
  8 siblings, 2 replies; 25+ messages in thread
From: arnd @ 2008-07-15 19:51 UTC (permalink / raw)
  To: benh, cbe-oss-dev, linuxppc-dev; +Cc: Maxim Shchetynin

From: Maxim Shchetynin <maxim@linux.vnet.ibm.com>

AZFS is a file system which keeps all files on memory mapped random
access storage. It was designed to work on the axonram device driver
for IBM QS2x blade servers, but can operate on any block device
that exports a direct_access method.

Signed-off-by: Maxim Shchetynin <maxim@de.ibm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 Documentation/filesystems/azfs.txt  |   22 +
 arch/powerpc/configs/cell_defconfig |    1 +
 fs/Kconfig                          |   15 +
 fs/Makefile                         |    1 +
 fs/azfs/Makefile                    |    7 +
 fs/azfs/inode.c                     | 1184 +++++++++++++++++++++++++++++++++++
 6 files changed, 1230 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/filesystems/azfs.txt
 create mode 100644 fs/azfs/Makefile
 create mode 100644 fs/azfs/inode.c

diff --git a/Documentation/filesystems/azfs.txt b/Documentation/filesystems/azfs.txt
new file mode 100644
index 0000000..c4bf659
--- /dev/null
+++ b/Documentation/filesystems/azfs.txt
@@ -0,0 +1,22 @@
+AZFS is a file system which keeps all files on memory backed random
+access storage. It was designed to work on the axonram device driver
+for IBM QS2x blade servers, but can operate on any block device
+that exports a direct_access method.
+
+Everything in AZFS is temporary in the sense that all the data stored
+therein is lost when you switch off or reboot a system. If you unmount
+an AZFS instance, all the data will be kept on device as long your system
+is not shut down or rebooted. You can later mount AZFS on from device again
+to get access to your files.
+
+AZFS uses a block device only for data but not for file information.
+All inodes (file and directory information) is kept in RAM.
+
+When you mount AZFS you are able to specify a file system block size with
+'-o bs=<size in bytes>' option. There are no software limitations for
+a block size but you would not be able to mmap files on AZFS if block size
+is less than a system page size. If no '-o bs' option is specified on mount
+a block size of the used block device is used as a default block size for AZFS.
+
+Other available mount options for AZFS are '-o uid=<id>' and '-o gid=<id>',
+which allow you to set the owner and group of the root of the file system.
diff --git a/arch/powerpc/configs/cell_defconfig b/arch/powerpc/configs/cell_defconfig
index c420e47..235a0c8 100644
--- a/arch/powerpc/configs/cell_defconfig
+++ b/arch/powerpc/configs/cell_defconfig
@@ -240,6 +240,7 @@ CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y
 # CPU Frequency drivers
 #
 CONFIG_AXON_RAM=m
+CONFIG_AZ_FS=m
 # CONFIG_FSL_ULI1575 is not set
 
 #
diff --git a/fs/Kconfig b/fs/Kconfig
index 2694648..2d4e42b 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -1017,6 +1017,21 @@ config HUGETLBFS
 config HUGETLB_PAGE
 	def_bool HUGETLBFS
 
+config AZ_FS
+	tristate "AZFS filesystem support"
+	help
+	  azfs is a file system for I/O attached memory backing. It requires
+	  a block device with direct_access capability, e.g. axonram.
+	  Mounting such device with azfs gives memory mapped access to the
+	  underlying memory to user space.
+
+	  Read <file:Documentation/filesystems/azfs.txt> for details.
+
+	  To compile this file system support as a module, choose M here: the
+	  module will be called azfs.
+
+	  If unsure, say N.
+
 config CONFIGFS_FS
 	tristate "Userspace-driven configuration filesystem"
 	depends on SYSFS
diff --git a/fs/Makefile b/fs/Makefile
index 1e7a11b..20e3253 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -119,3 +119,4 @@ obj-$(CONFIG_HPPFS)		+= hppfs/
 obj-$(CONFIG_DEBUG_FS)		+= debugfs/
 obj-$(CONFIG_OCFS2_FS)		+= ocfs2/
 obj-$(CONFIG_GFS2_FS)           += gfs2/
+obj-$(CONFIG_AZ_FS)		+= azfs/
diff --git a/fs/azfs/Makefile b/fs/azfs/Makefile
new file mode 100644
index 0000000..ff04d41
--- /dev/null
+++ b/fs/azfs/Makefile
@@ -0,0 +1,7 @@
+#
+# Makefile for azfs routines
+#
+
+obj-$(CONFIG_AZ_FS) += azfs.o
+
+azfs-y := inode.o
diff --git a/fs/azfs/inode.c b/fs/azfs/inode.c
new file mode 100644
index 0000000..00dc2af
--- /dev/null
+++ b/fs/azfs/inode.c
@@ -0,0 +1,1184 @@
+/*
+ * (C) Copyright IBM Deutschland Entwicklung GmbH 2007
+ *
+ * Author: Maxim Shchetynin <maxim@de.ibm.com>
+ *
+ * Non-buffered filesystem driver.
+ * It registers a filesystem which may be used for all kind of block devices
+ * which have a direct_access() method in block_device_operations.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2, or (at your option)
+ * any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+
+#include <linux/backing-dev.h>
+#include <linux/blkdev.h>
+#include <linux/cache.h>
+#include <linux/dcache.h>
+#include <linux/device.h>
+#include <linux/err.h>
+#include <linux/fs.h>
+#include <linux/genhd.h>
+#include <linux/kernel.h>
+#include <linux/limits.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/mount.h>
+#include <linux/mm.h>
+#include <linux/mm_types.h>
+#include <linux/mutex.h>
+#include <linux/namei.h>
+#include <linux/pagemap.h>
+#include <linux/parser.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/stat.h>
+#include <linux/statfs.h>
+#include <linux/string.h>
+#include <linux/time.h>
+#include <linux/types.h>
+#include <linux/aio.h>
+#include <linux/uio.h>
+#include <asm/bug.h>
+#include <asm/page.h>
+#include <asm/pgtable.h>
+#include <asm/string.h>
+
+#define AZFS_FILESYSTEM_NAME		"azfs"
+#define AZFS_FILESYSTEM_FLAGS		FS_REQUIRES_DEV
+
+#define AZFS_SUPERBLOCK_MAGIC		0xABBA1972
+#define AZFS_SUPERBLOCK_FLAGS		MS_SYNCHRONOUS | \
+					MS_DIRSYNC | \
+					MS_ACTIVE
+
+#define AZFS_BDI_CAPABILITIES		BDI_CAP_NO_ACCT_DIRTY | \
+					BDI_CAP_NO_WRITEBACK | \
+					BDI_CAP_MAP_COPY | \
+					BDI_CAP_MAP_DIRECT | \
+					BDI_CAP_VMFLAGS
+
+#define AZFS_CACHE_FLAGS		SLAB_HWCACHE_ALIGN | \
+					SLAB_RECLAIM_ACCOUNT | \
+					SLAB_MEM_SPREAD
+
+struct azfs_super {
+	struct list_head		list;
+	unsigned long			media_size;
+	unsigned long			block_size;
+	unsigned short			block_shift;
+	unsigned long			sector_size;
+	unsigned short			sector_shift;
+	uid_t				uid;
+	gid_t				gid;
+	unsigned long			ph_addr;
+	unsigned long			io_addr;
+	struct block_device		*blkdev;
+	struct dentry			*root;
+	struct list_head		block_list;
+	rwlock_t			lock;
+};
+
+struct azfs_super_list {
+	struct list_head		head;
+	spinlock_t			lock;
+};
+
+struct azfs_block {
+	struct list_head		list;
+	unsigned long			id;
+	unsigned long			count;
+};
+
+struct azfs_znode {
+	struct list_head		block_list;
+	rwlock_t			lock;
+	loff_t				size;
+	struct inode			vfs_inode;
+};
+
+static struct azfs_super_list		super_list;
+static struct kmem_cache		*azfs_znode_cache __read_mostly = NULL;
+static struct kmem_cache		*azfs_block_cache __read_mostly = NULL;
+
+#define I2S(inode) \
+	inode->i_sb->s_fs_info
+#define I2Z(inode) \
+	container_of(inode, struct azfs_znode, vfs_inode)
+
+#define for_each_block(block, block_list) \
+	list_for_each_entry(block, block_list, list)
+#define for_each_block_reverse(block, block_list) \
+	list_for_each_entry_reverse(block, block_list, list)
+#define for_each_block_safe(block, temp, block_list) \
+	list_for_each_entry_safe(block, temp, block_list, list)
+#define for_each_block_safe_reverse(block, temp, block_list) \
+	list_for_each_entry_safe_reverse(block, temp, block_list, list)
+
+/**
+ * azfs_block_init - create and initialise a new block in a list
+ * @block_list: destination list
+ * @id: block id
+ * @count: size of a block
+ */
+static inline struct azfs_block*
+azfs_block_init(struct list_head *block_list,
+		unsigned long id, unsigned long count)
+{
+	struct azfs_block *block;
+
+	block = kmem_cache_alloc(azfs_block_cache, GFP_KERNEL);
+	if (!block)
+		return NULL;
+
+	block->id = id;
+	block->count = count;
+
+	INIT_LIST_HEAD(&block->list);
+	list_add_tail(&block->list, block_list);
+
+	return block;
+}
+
+/**
+ * azfs_block_free - remove block from a list and free it back in cache
+ * @block: block to be removed
+ */
+static inline void
+azfs_block_free(struct azfs_block *block)
+{
+	list_del(&block->list);
+	kmem_cache_free(azfs_block_cache, block);
+}
+
+/**
+ * azfs_block_move - move block to another list
+ * @block: block to be moved
+ * @block_list: destination list
+ */
+static inline void
+azfs_block_move(struct azfs_block *block, struct list_head *block_list)
+{
+	list_move_tail(&block->list, block_list);
+}
+
+/**
+ * azfs_block_find - get a block id of a part of a file
+ * @inode: inode
+ * @from: offset for read/write operation
+ * @size: pointer to a value of the amount of data to be read/written
+ */
+static unsigned long
+azfs_block_find(struct inode *inode, unsigned long from, unsigned long *size)
+{
+	struct azfs_super *super;
+	struct azfs_znode *znode;
+	struct azfs_block *block;
+	unsigned long block_id, west, east;
+
+	super = I2S(inode);
+	znode = I2Z(inode);
+
+	read_lock(&znode->lock);
+
+	while (from + *size > znode->size) {
+		read_unlock(&znode->lock);
+		i_size_write(inode, from + *size);
+		inode->i_op->truncate(inode);
+		read_lock(&znode->lock);
+	}
+
+	if (list_empty(&znode->block_list)) {
+		read_unlock(&znode->lock);
+		*size = 0;
+		return 0;
+	}
+
+	block_id = from >> super->block_shift;
+
+	for_each_block(block, &znode->block_list) {
+		if (block->count > block_id)
+			break;
+		block_id -= block->count;
+	}
+
+	west = from % super->block_size;
+	east = ((block->count - block_id) << super->block_shift) - west;
+
+	if (*size > east)
+		*size = east;
+
+	block_id = ((block->id + block_id) << super->block_shift) + west;
+
+	read_unlock(&znode->lock);
+
+	return block_id;
+}
+
+static struct inode*
+azfs_new_inode(struct super_block *, struct inode *, int, dev_t);
+
+/**
+ * azfs_mknod - mknod() method for inode_operations
+ * @dir, @dentry, @mode, @dev: see inode_operations methods
+ */
+static int
+azfs_mknod(struct inode *dir, struct dentry *dentry, int mode, dev_t dev)
+{
+	struct inode *inode;
+
+	inode = azfs_new_inode(dir->i_sb, dir, mode, dev);
+	if (!inode)
+		return -ENOSPC;
+
+	if (S_ISREG(mode))
+		I2Z(inode)->size = 0;
+
+	dget(dentry);
+	d_instantiate(dentry, inode);
+
+	return 0;
+}
+
+/**
+ * azfs_create - create() method for inode_operations
+ * @dir, @dentry, @mode, @nd: see inode_operations methods
+ */
+static int
+azfs_create(struct inode *dir, struct dentry *dentry, int mode,
+	    struct nameidata *nd)
+{
+	return azfs_mknod(dir, dentry, mode | S_IFREG, 0);
+}
+
+/**
+ * azfs_mkdir - mkdir() method for inode_operations
+ * @dir, @dentry, @mode: see inode_operations methods
+ */
+static int
+azfs_mkdir(struct inode *dir, struct dentry *dentry, int mode)
+{
+	int rc;
+
+	rc = azfs_mknod(dir, dentry, mode | S_IFDIR, 0);
+	if (!rc)
+		inc_nlink(dir);
+
+	return rc;
+}
+
+/**
+ * azfs_symlink - symlink() method for inode_operations
+ * @dir, @dentry, @name: see inode_operations methods
+ */
+static int
+azfs_symlink(struct inode *dir, struct dentry *dentry, const char *name)
+{
+	struct inode *inode;
+	int rc;
+
+	inode = azfs_new_inode(dir->i_sb, dir, S_IFLNK | S_IRWXUGO, 0);
+	if (!inode)
+		return -ENOSPC;
+
+	rc = page_symlink(inode, name, strlen(name) + 1);
+	if (rc) {
+		iput(inode);
+		return rc;
+	}
+
+	dget(dentry);
+	d_instantiate(dentry, inode);
+
+	return 0;
+}
+
+/**
+ * azfs_aio_read - aio_read() method for file_operations
+ * @iocb, @iov, @nr_segs, @pos: see file_operations methods
+ */
+static ssize_t
+azfs_aio_read(struct kiocb *iocb, const struct iovec *iov,
+	      unsigned long nr_segs, loff_t pos)
+{
+	struct azfs_super *super;
+	struct inode *inode;
+	void *target;
+	unsigned long pin;
+	unsigned long size, todo, step;
+	ssize_t rc;
+
+	inode = iocb->ki_filp->f_mapping->host;
+	super = I2S(inode);
+
+	mutex_lock(&inode->i_mutex);
+
+	if (pos >= i_size_read(inode)) {
+		rc = 0;
+		goto out;
+	}
+
+	target = iov->iov_base;
+	todo = min((loff_t) iov->iov_len, i_size_read(inode) - pos);
+
+	for (step = todo; step; step -= size) {
+		size = step;
+		pin = azfs_block_find(inode, pos, &size);
+		if (!size) {
+			rc = -ENOSPC;
+			goto out;
+		}
+		pin += super->io_addr;
+		/*
+		 * FIXME: pin is actually an __iomem pointer, is
+		 * that safe? -arnd
+		 */
+		if (copy_to_user(target, (void*) pin, size)) {
+			rc = -EFAULT;
+			goto out;
+		}
+
+		iocb->ki_pos += size;
+		pos += size;
+		target += size;
+	}
+
+	rc = todo;
+
+out:
+	mutex_unlock(&inode->i_mutex);
+
+	return rc;
+}
+
+/**
+ * azfs_aio_write - aio_write() method for file_operations
+ * @iocb, @iov, @nr_segs, @pos: see file_operations methods
+ */
+static ssize_t
+azfs_aio_write(struct kiocb *iocb, const struct iovec *iov,
+	       unsigned long nr_segs, loff_t pos)
+{
+	struct azfs_super *super;
+	struct inode *inode;
+	void *source;
+	unsigned long pin;
+	unsigned long size, todo, step;
+	ssize_t rc;
+
+	inode = iocb->ki_filp->f_mapping->host;
+	super = I2S(inode);
+
+	source = iov->iov_base;
+	todo = iov->iov_len;
+
+	mutex_lock(&inode->i_mutex);
+
+	for (step = todo; step; step -= size) {
+		size = step;
+		pin = azfs_block_find(inode, pos, &size);
+		if (!size) {
+			rc = -ENOSPC;
+			goto out;
+		}
+		pin += super->io_addr;
+		/*
+		 * FIXME: pin is actually an __iomem pointer, is
+		 * that safe? -arnd
+		 */
+		if (copy_from_user((void*) pin, source, size)) {
+			rc = -EFAULT;
+			goto out;
+		}
+
+		iocb->ki_pos += size;
+		pos += size;
+		source += size;
+	}
+
+	rc = todo;
+
+out:
+	mutex_unlock(&inode->i_mutex);
+
+	return rc;
+}
+
+/**
+ * azfs_open - open() method for file_operations
+ * @inode, @file: see file_operations methods
+ */
+static int
+azfs_open(struct inode *inode, struct file *file)
+{
+	if (file->f_flags & O_TRUNC) {
+		i_size_write(inode, 0);
+		inode->i_op->truncate(inode);
+	}
+	if (file->f_flags & O_APPEND)
+		inode->i_fop->llseek(file, 0, SEEK_END);
+
+	return 0;
+}
+
+/**
+ * azfs_mmap - mmap() method for file_operations
+ * @file, @vm: see file_operations methods
+ */
+static int
+azfs_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct azfs_super *super;
+	struct azfs_znode *znode;
+	struct inode *inode;
+	unsigned long cursor, pin;
+	unsigned long todo, size, vm_start;
+	pgprot_t page_prot;
+
+	inode = file->f_dentry->d_inode;
+	znode = I2Z(inode);
+	super = I2S(inode);
+
+	if (super->block_size < PAGE_SIZE)
+		return -EINVAL;
+
+	cursor = vma->vm_pgoff << super->block_shift;
+	todo = vma->vm_end - vma->vm_start;
+
+	if (cursor + todo > i_size_read(inode))
+		return -EINVAL;
+
+	page_prot = pgprot_val(vma->vm_page_prot);
+#ifdef CONFIG_PPC
+	page_prot |= (_PAGE_NO_CACHE | _PAGE_RW);
+	page_prot &= ~_PAGE_GUARDED;
+#else
+	page_prot = pgprot_noncached(page_prot);
+#endif
+	vma->vm_page_prot = __pgprot(page_prot);
+
+	vm_start = vma->vm_start;
+	for (size = todo; todo; todo -= size, size = todo) {
+		pin = azfs_block_find(inode, cursor, &size);
+		if (!size)
+			return -EAGAIN;
+		pin += super->ph_addr;
+		pin >>= PAGE_SHIFT;
+		if (remap_pfn_range(vma, vm_start, pin, size, vma->vm_page_prot))
+			return -EAGAIN;
+
+		vm_start += size;
+		cursor += size;
+	}
+
+	return 0;
+}
+
+/**
+ * azfs_truncate - truncate() method for inode_operations
+ * @inode: see inode_operations methods
+ */
+static void
+azfs_truncate(struct inode *inode)
+{
+	struct azfs_super *super;
+	struct azfs_znode *znode;
+	struct azfs_block *block, *tmp_block, *temp, *west, *east;
+	unsigned long id, count;
+	signed long delta;
+
+	super = I2S(inode);
+	znode = I2Z(inode);
+
+	delta = i_size_read(inode) + (super->block_size - 1);
+	delta >>= super->block_shift;
+	delta -= inode->i_blocks;
+
+	if (delta == 0) {
+		znode->size = i_size_read(inode);
+		return;
+	}
+
+	write_lock(&znode->lock);
+
+	while (delta > 0) {
+		west = east = NULL;
+
+		write_lock(&super->lock);
+
+		if (list_empty(&super->block_list)) {
+			write_unlock(&super->lock);
+			break;
+		}
+
+		for (count = delta; count; count--) {
+			for_each_block(block, &super->block_list)
+				if (block->count >= count) {
+					east = block;
+					break;
+				}
+			if (east)
+				break;
+		}
+
+		for_each_block_reverse(block, &znode->block_list) {
+			if (block->id + block->count == east->id)
+				west = block;
+			break;
+		}
+
+		if (east->count == count) {
+			if (west) {
+				west->count += east->count;
+				azfs_block_free(east);
+			} else {
+				azfs_block_move(east, &znode->block_list);
+			}
+		} else {
+			if (west) {
+				west->count += count;
+			} else {
+				if (!azfs_block_init(&znode->block_list,
+						east->id, count)) {
+					write_unlock(&super->lock);
+					break;
+				}
+			}
+
+			east->id += count;
+			east->count -= count;
+		}
+
+		write_unlock(&super->lock);
+
+		inode->i_blocks += count;
+
+		delta -= count;
+	}
+
+	while (delta < 0) {
+		for_each_block_safe_reverse(block, tmp_block, &znode->block_list) {
+			id = block->id;
+			count = block->count;
+			if ((signed long) count + delta > 0) {
+				block->count += delta;
+				id += block->count;
+				count -= block->count;
+				block = NULL;
+			}
+
+			west = east = NULL;
+
+			write_lock(&super->lock);
+
+			for_each_block(temp, &super->block_list) {
+				if (!west && (temp->id + temp->count == id))
+					west = temp;
+				else if (!east && (id + count == temp->id))
+					east = temp;
+				if (west && east)
+					break;
+			}
+
+			if (west && east) {
+				west->count += count + east->count;
+				azfs_block_free(east);
+				if (block)
+					azfs_block_free(block);
+			} else if (west) {
+				west->count += count;
+				if (block)
+					azfs_block_free(block);
+			} else if (east) {
+				east->id -= count;
+				east->count += count;
+				if (block)
+					azfs_block_free(block);
+			} else {
+				if (!block) {
+					if (!azfs_block_init(&super->block_list,
+							id, count)) {
+						write_unlock(&super->lock);
+						break;
+					}
+				} else {
+					azfs_block_move(block, &super->block_list);
+				}
+			}
+
+			write_unlock(&super->lock);
+
+			inode->i_blocks -= count;
+
+			delta += count;
+
+			break;
+		}
+	}
+
+	write_unlock(&znode->lock);
+
+	znode->size = min(i_size_read(inode),
+			(loff_t) inode->i_blocks << super->block_shift);
+}
+
+/**
+ * azfs_getattr - getattr() method for inode_operations
+ * @mnt, @dentry, @stat: see inode_operations methods
+ */
+static int
+azfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
+{
+	struct azfs_super *super;
+	struct inode *inode;
+	unsigned short shift;
+
+	inode = dentry->d_inode;
+	super = I2S(inode);
+
+	generic_fillattr(inode, stat);
+	stat->blocks = inode->i_blocks;
+	shift = super->block_shift - super->sector_shift;
+	if (shift)
+		stat->blocks <<= shift;
+
+	return 0;
+}
+
+static const struct address_space_operations azfs_aops = {
+	.write_begin	= simple_write_begin,
+	.write_end	= simple_write_end
+};
+
+static struct backing_dev_info azfs_bdi = {
+	.ra_pages	= 0,
+	.capabilities	= AZFS_BDI_CAPABILITIES
+};
+
+static struct inode_operations azfs_dir_iops = {
+	.create		= azfs_create,
+	.lookup		= simple_lookup,
+	.link		= simple_link,
+	.unlink		= simple_unlink,
+	.symlink	= azfs_symlink,
+	.mkdir		= azfs_mkdir,
+	.rmdir		= simple_rmdir,
+	.mknod		= azfs_mknod,
+	.rename		= simple_rename
+};
+
+static const struct file_operations azfs_reg_fops = {
+	.llseek		= generic_file_llseek,
+	.aio_read	= azfs_aio_read,
+	.aio_write	= azfs_aio_write,
+	.open		= azfs_open,
+	.mmap		= azfs_mmap,
+	.fsync		= simple_sync_file,
+};
+
+static struct inode_operations azfs_reg_iops = {
+	.truncate	= azfs_truncate,
+	.getattr	= azfs_getattr
+};
+
+/**
+ * azfs_new_inode - cook a new inode
+ * @sb: super-block
+ * @dir: parent directory
+ * @mode: file mode
+ * @dev: to be forwarded to init_special_inode()
+ */
+static struct inode*
+azfs_new_inode(struct super_block *sb, struct inode *dir, int mode, dev_t dev)
+{
+	struct azfs_super *super;
+	struct inode *inode;
+
+	inode = new_inode(sb);
+	if (!inode)
+		return NULL;
+
+	inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
+
+	inode->i_mode = mode;
+	if (dir) {
+		dir->i_mtime = dir->i_ctime = inode->i_mtime;
+		inode->i_uid = current->fsuid;
+		if (dir->i_mode & S_ISGID) {
+			if (S_ISDIR(mode))
+				inode->i_mode |= S_ISGID;
+			inode->i_gid = dir->i_gid;
+		} else {
+			inode->i_gid = current->fsgid;
+		}
+	} else {
+		super = sb->s_fs_info;
+		inode->i_uid = super->uid;
+		inode->i_gid = super->gid;
+	}
+
+	inode->i_blocks = 0;
+	inode->i_mapping->a_ops = &azfs_aops;
+	inode->i_mapping->backing_dev_info = &azfs_bdi;
+
+	switch (mode & S_IFMT) {
+	case S_IFDIR:
+		inode->i_op = &azfs_dir_iops;
+		inode->i_fop = &simple_dir_operations;
+		inc_nlink(inode);
+		break;
+
+	case S_IFREG:
+		inode->i_op = &azfs_reg_iops;
+		inode->i_fop = &azfs_reg_fops;
+		break;
+
+	case S_IFLNK:
+		inode->i_op = &page_symlink_inode_operations;
+		break;
+
+	default:
+		init_special_inode(inode, mode, dev);
+		break;
+	}
+
+	return inode;
+}
+
+/**
+ * azfs_alloc_inode - alloc_inode() method for super_operations
+ * @sb: see super_operations methods
+ */
+static struct inode*
+azfs_alloc_inode(struct super_block *sb)
+{
+	struct azfs_znode *znode;
+
+	znode = kmem_cache_alloc(azfs_znode_cache, GFP_KERNEL);
+	if (znode) {
+		INIT_LIST_HEAD(&znode->block_list);
+		rwlock_init(&znode->lock);
+
+		inode_init_once(&znode->vfs_inode);
+
+		return &znode->vfs_inode;
+	}
+
+	return NULL;
+}
+
+/**
+ * azfs_destroy_inode - destroy_inode() method for super_operations
+ * @inode: see super_operations methods
+ */
+static void
+azfs_destroy_inode(struct inode *inode)
+{
+	kmem_cache_free(azfs_znode_cache, I2Z(inode));
+}
+
+/**
+ * azfs_delete_inode - delete_inode() method for super_operations
+ * @inode: see super_operations methods
+ */
+static void
+azfs_delete_inode(struct inode *inode)
+{
+	if (S_ISREG(inode->i_mode)) {
+		i_size_write(inode, 0);
+		azfs_truncate(inode);
+	}
+	truncate_inode_pages(&inode->i_data, 0);
+	clear_inode(inode);
+}
+
+/**
+ * azfs_statfs - statfs() method for super_operations
+ * @dentry, @stat: see super_operations methods
+ */
+static int
+azfs_statfs(struct dentry *dentry, struct kstatfs *stat)
+{
+	struct super_block *sb;
+	struct azfs_super *super;
+	struct inode *inode;
+	unsigned long inodes, blocks;
+
+	sb = dentry->d_sb;
+	super = sb->s_fs_info;
+
+	inodes = blocks = 0;
+	mutex_lock(&sb->s_lock);
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+		inodes++;
+		blocks += inode->i_blocks;
+	}
+	mutex_unlock(&sb->s_lock);
+
+	stat->f_type = AZFS_SUPERBLOCK_MAGIC;
+	stat->f_bsize = super->block_size;
+	stat->f_blocks = super->media_size >> super->block_shift;
+	stat->f_bfree = stat->f_blocks - blocks;
+	stat->f_bavail = stat->f_blocks - blocks;
+	stat->f_files = inodes + blocks;
+	stat->f_ffree = blocks + 1;
+	stat->f_namelen = NAME_MAX;
+
+	return 0;
+}
+
+static struct super_operations azfs_ops = {
+	.alloc_inode	= azfs_alloc_inode,
+	.destroy_inode	= azfs_destroy_inode,
+	.drop_inode	= generic_delete_inode,
+	.delete_inode	= azfs_delete_inode,
+	.statfs		= azfs_statfs
+};
+
+enum {
+	Opt_blocksize_short,
+	Opt_blocksize_long,
+	Opt_uid,
+	Opt_gid,
+	Opt_err
+};
+
+static match_table_t tokens = {
+	{Opt_blocksize_short, "bs=%u"},
+	{Opt_blocksize_long, "blocksize=%u"},
+	{Opt_uid, "uid=%u"},
+	{Opt_gid, "gid=%u"},
+	{Opt_err, NULL}
+};
+
+/**
+ * azfs_parse_mount_parameters - parse options given to mount with -o
+ * @super: azfs super block extension
+ * @options: comma separated options
+ */
+static int
+azfs_parse_mount_parameters(struct azfs_super *super, char *options)
+{
+	char *option;
+	int token, value;
+	substring_t args[MAX_OPT_ARGS];
+
+	if (!options)
+		return 1;
+
+	while ((option = strsep(&options, ",")) != NULL) {
+		if (!*option)
+			continue;
+
+		token = match_token(option, tokens, args);
+		switch (token) {
+		case Opt_blocksize_short:
+		case Opt_blocksize_long:
+			if (match_int(&args[0], &value))
+				goto syntax_error;
+			super->block_size = value;
+			break;
+
+		case Opt_uid:
+			if (match_int(&args[0], &value))
+				goto syntax_error;
+			super->uid = value;
+			break;
+
+		case Opt_gid:
+			if (match_int(&args[0], &value))
+				goto syntax_error;
+			super->gid = value;
+			break;
+
+		default:
+			goto syntax_error;
+		}
+	}
+
+	return 1;
+
+syntax_error:
+	printk(KERN_ERR "%s: invalid mount option\n",
+			AZFS_FILESYSTEM_NAME);
+
+	return 0;
+}
+
+/**
+ * azfs_fill_super - fill_super routine for get_sb
+ * @sb, @data, @silent: see file_system_type methods
+ */
+static int
+azfs_fill_super(struct super_block *sb, void *data, int silent)
+{
+	struct gendisk *disk;
+	struct azfs_super *super = NULL, *tmp_super;
+	struct azfs_block *block = NULL;
+	struct inode *inode = NULL;
+	void *kaddr;
+	unsigned long pfn;
+	int rc;
+
+	BUG_ON(!sb->s_bdev);
+
+	disk = sb->s_bdev->bd_disk;
+
+	BUG_ON(!disk || !disk->queue);
+
+	if (!disk->fops->direct_access) {
+		printk(KERN_ERR "%s needs a block device with a "
+				"direct_access() method\n",
+				AZFS_FILESYSTEM_NAME);
+		return -ENOSYS;
+	}
+
+	get_device(disk->driverfs_dev);
+
+	sb->s_magic = AZFS_SUPERBLOCK_MAGIC;
+	sb->s_flags = AZFS_SUPERBLOCK_FLAGS;
+	sb->s_op = &azfs_ops;
+	sb->s_maxbytes = get_capacity(disk) * disk->queue->hardsect_size;
+	sb->s_time_gran = 1;
+
+	spin_lock(&super_list.lock);
+	list_for_each_entry(tmp_super, &super_list.head, list)
+		if (tmp_super->blkdev == sb->s_bdev) {
+			super = tmp_super;
+			break;
+		}
+	spin_unlock(&super_list.lock);
+
+	if (super) {
+		if (data && strlen((char*) data))
+			printk(KERN_WARNING "/dev/%s was already mounted with "
+					"%s before, it will be mounted with "
+					"mount options used last time, "
+					"options just given would be ignored\n",
+					disk->disk_name, AZFS_FILESYSTEM_NAME);
+		sb->s_fs_info = super;
+	} else {
+		super = kzalloc(sizeof(struct azfs_super), GFP_KERNEL);
+		if (!super) {
+			rc = -ENOMEM;
+			goto failed;
+		}
+		sb->s_fs_info = super;
+
+		if (!azfs_parse_mount_parameters(super, (char*) data)) {
+			rc = -EINVAL;
+			goto failed;
+		}
+
+		inode = azfs_new_inode(sb, NULL, S_IFDIR | S_IRWXUGO, 0);
+		if (!inode) {
+			rc = -ENOMEM;
+			goto failed;
+		}
+
+		super->root = d_alloc_root(inode);
+		if (!super->root) {
+			rc = -ENOMEM;
+			goto failed;
+		}
+		dget(super->root);
+
+		INIT_LIST_HEAD(&super->list);
+		INIT_LIST_HEAD(&super->block_list);
+		rwlock_init(&super->lock);
+
+		super->media_size = sb->s_maxbytes;
+
+		if (!super->block_size)
+			super->block_size = sb->s_blocksize;
+		super->block_shift = blksize_bits(super->block_size);
+
+		super->sector_size = disk->queue->hardsect_size;
+		super->sector_shift = blksize_bits(super->sector_size);
+
+		super->blkdev = sb->s_bdev;
+
+		block = azfs_block_init(&super->block_list,
+				0, super->media_size >> super->block_shift);
+		if (!block) {
+			rc = -ENOMEM;
+			goto failed;
+		}
+
+		rc = disk->fops->direct_access(super->blkdev, 0, &kaddr, &pfn);
+		if (rc < 0) {
+			rc = -EFAULT;
+			goto failed;
+		}
+		super->ph_addr = (unsigned long) kaddr;
+
+		super->io_addr = (unsigned long) ioremap_flags(
+				super->ph_addr, super->media_size, _PAGE_NO_CACHE);
+		if (!super->io_addr) {
+			rc = -EFAULT;
+			goto failed;
+		}
+
+		spin_lock(&super_list.lock);
+		list_add(&super->list, &super_list.head);
+		spin_unlock(&super_list.lock);
+	}
+
+	sb->s_root = super->root;
+	disk->driverfs_dev->driver_data = super;
+	disk->driverfs_dev->platform_data = sb;
+
+	if (super->block_size < PAGE_SIZE)
+		printk(KERN_INFO "Block size on %s is smaller then system "
+				"page size: mmap() would not be supported\n",
+				disk->disk_name);
+
+	return 0;
+
+failed:
+	if (super) {
+		sb->s_root = NULL;
+		sb->s_fs_info = NULL;
+		if (block)
+			azfs_block_free(block);
+		if (super->root)
+			dput(super->root);
+		if (inode)
+			iput(inode);
+		disk->driverfs_dev->driver_data = NULL;
+		kfree(super);
+		disk->driverfs_dev->platform_data = NULL;
+		put_device(disk->driverfs_dev);
+	}
+
+	return rc;
+}
+
+/**
+ * azfs_get_sb - get_sb() method for file_system_type
+ * @fs_type, @flags, @dev_name, @data, @mount: see file_system_type methods
+ */
+static int
+azfs_get_sb(struct file_system_type *fs_type, int flags,
+	    const char *dev_name, void *data, struct vfsmount *mount)
+{
+	return get_sb_bdev(fs_type, flags,
+			dev_name, data, azfs_fill_super, mount);
+}
+
+/**
+ * azfs_kill_sb - kill_sb() method for file_system_type
+ * @sb: see file_system_type methods
+ */
+static void
+azfs_kill_sb(struct super_block *sb)
+{
+	sb->s_root = NULL;
+	kill_block_super(sb);
+}
+
+static struct file_system_type azfs_fs = {
+	.owner		= THIS_MODULE,
+	.name		= AZFS_FILESYSTEM_NAME,
+	.get_sb		= azfs_get_sb,
+	.kill_sb	= azfs_kill_sb,
+	.fs_flags	= AZFS_FILESYSTEM_FLAGS
+};
+
+/**
+ * azfs_init
+ */
+static int __init
+azfs_init(void)
+{
+	int rc;
+
+	INIT_LIST_HEAD(&super_list.head);
+	spin_lock_init(&super_list.lock);
+
+	azfs_znode_cache = kmem_cache_create("azfs_znode_cache",
+			sizeof(struct azfs_znode), 0, AZFS_CACHE_FLAGS, NULL);
+	if (!azfs_znode_cache) {
+		printk(KERN_ERR "Could not allocate inode cache for %s\n",
+				AZFS_FILESYSTEM_NAME);
+		rc = -ENOMEM;
+		goto failed;
+	}
+
+	azfs_block_cache = kmem_cache_create("azfs_block_cache",
+			sizeof(struct azfs_block), 0, AZFS_CACHE_FLAGS, NULL);
+	if (!azfs_block_cache) {
+		printk(KERN_ERR "Could not allocate block cache for %s\n",
+				AZFS_FILESYSTEM_NAME);
+		rc = -ENOMEM;
+		goto failed;
+	}
+
+	rc = register_filesystem(&azfs_fs);
+	if (rc != 0) {
+		printk(KERN_ERR "Could not register %s\n",
+				AZFS_FILESYSTEM_NAME);
+		goto failed;
+	}
+
+	return 0;
+
+failed:
+	if (azfs_block_cache)
+		kmem_cache_destroy(azfs_block_cache);
+
+	if (azfs_znode_cache)
+		kmem_cache_destroy(azfs_znode_cache);
+
+	return rc;
+}
+
+/**
+ * azfs_exit
+ */
+static void __exit
+azfs_exit(void)
+{
+	struct azfs_super *super, *tmp_super;
+	struct azfs_block *block, *tmp_block;
+	struct gendisk *disk;
+
+	spin_lock(&super_list.lock);
+	list_for_each_entry_safe(super, tmp_super, &super_list.head, list) {
+		disk = super->blkdev->bd_disk;
+		list_del(&super->list);
+		iounmap((void*) super->io_addr);
+		write_lock(&super->lock);
+		for_each_block_safe(block, tmp_block, &super->block_list)
+			azfs_block_free(block);
+		write_unlock(&super->lock);
+		disk->driverfs_dev->driver_data = NULL;
+		disk->driverfs_dev->platform_data = NULL;
+		kfree(super);
+		put_device(disk->driverfs_dev);
+	}
+	spin_unlock(&super_list.lock);
+
+	unregister_filesystem(&azfs_fs);
+
+	kmem_cache_destroy(azfs_block_cache);
+	kmem_cache_destroy(azfs_znode_cache);
+}
+
+module_init(azfs_init);
+module_exit(azfs_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Maxim Shchetynin <maxim@de.ibm.com>");
+MODULE_DESCRIPTION("Non-buffered file system for IO devices");
-- 
1.5.4.3

-- 

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [patch 7/9] azfs: initial submit of azfs, a non-buffered filesystem
  2008-07-15 19:51 ` [patch 7/9] azfs: initial submit of azfs, a non-buffered filesystem arnd
@ 2008-07-17  6:13   ` Benjamin Herrenschmidt
  2008-07-22  9:49   ` [Cbe-oss-dev] " Christoph Hellwig
  1 sibling, 0 replies; 25+ messages in thread
From: Benjamin Herrenschmidt @ 2008-07-17  6:13 UTC (permalink / raw)
  To: arnd; +Cc: Maxim Shchetynin, linuxppc-dev, cbe-oss-dev

On Tue, 2008-07-15 at 21:51 +0200, arnd@arndb.de wrote:
> plain text document attachment
> (0007-azfs-initial-submit-of-azfs-a-non-buffered-filesys.patch)
> From: Maxim Shchetynin <maxim@linux.vnet.ibm.com>
> 
> AZFS is a file system which keeps all files on memory mapped random
> access storage. It was designed to work on the axonram device driver
> for IBM QS2x blade servers, but can operate on any block device
> that exports a direct_access method.
> 
> Signed-off-by: Maxim Shchetynin <maxim@de.ibm.com>
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---

Can I get this acked by some filesystem person like Al ?

Cheers,
Ben.

>  Documentation/filesystems/azfs.txt  |   22 +
>  arch/powerpc/configs/cell_defconfig |    1 +
>  fs/Kconfig                          |   15 +
>  fs/Makefile                         |    1 +
>  fs/azfs/Makefile                    |    7 +
>  fs/azfs/inode.c                     | 1184 +++++++++++++++++++++++++++++++++++
>  6 files changed, 1230 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/filesystems/azfs.txt
>  create mode 100644 fs/azfs/Makefile
>  create mode 100644 fs/azfs/inode.c
> 
> diff --git a/Documentation/filesystems/azfs.txt b/Documentation/filesystems/azfs.txt
> new file mode 100644
> index 0000000..c4bf659
> --- /dev/null
> +++ b/Documentation/filesystems/azfs.txt
> @@ -0,0 +1,22 @@
> +AZFS is a file system which keeps all files on memory backed random
> +access storage. It was designed to work on the axonram device driver
> +for IBM QS2x blade servers, but can operate on any block device
> +that exports a direct_access method.
> +
> +Everything in AZFS is temporary in the sense that all the data stored
> +therein is lost when you switch off or reboot a system. If you unmount
> +an AZFS instance, all the data will be kept on device as long your system
> +is not shut down or rebooted. You can later mount AZFS on from device again
> +to get access to your files.
> +
> +AZFS uses a block device only for data but not for file information.
> +All inodes (file and directory information) is kept in RAM.
> +
> +When you mount AZFS you are able to specify a file system block size with
> +'-o bs=<size in bytes>' option. There are no software limitations for
> +a block size but you would not be able to mmap files on AZFS if block size
> +is less than a system page size. If no '-o bs' option is specified on mount
> +a block size of the used block device is used as a default block size for AZFS.
> +
> +Other available mount options for AZFS are '-o uid=<id>' and '-o gid=<id>',
> +which allow you to set the owner and group of the root of the file system.
> diff --git a/arch/powerpc/configs/cell_defconfig b/arch/powerpc/configs/cell_defconfig
> index c420e47..235a0c8 100644
> --- a/arch/powerpc/configs/cell_defconfig
> +++ b/arch/powerpc/configs/cell_defconfig
> @@ -240,6 +240,7 @@ CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y
>  # CPU Frequency drivers
>  #
>  CONFIG_AXON_RAM=m
> +CONFIG_AZ_FS=m
>  # CONFIG_FSL_ULI1575 is not set
>  
>  #
> diff --git a/fs/Kconfig b/fs/Kconfig
> index 2694648..2d4e42b 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -1017,6 +1017,21 @@ config HUGETLBFS
>  config HUGETLB_PAGE
>  	def_bool HUGETLBFS
>  
> +config AZ_FS
> +	tristate "AZFS filesystem support"
> +	help
> +	  azfs is a file system for I/O attached memory backing. It requires
> +	  a block device with direct_access capability, e.g. axonram.
> +	  Mounting such device with azfs gives memory mapped access to the
> +	  underlying memory to user space.
> +
> +	  Read <file:Documentation/filesystems/azfs.txt> for details.
> +
> +	  To compile this file system support as a module, choose M here: the
> +	  module will be called azfs.
> +
> +	  If unsure, say N.
> +
>  config CONFIGFS_FS
>  	tristate "Userspace-driven configuration filesystem"
>  	depends on SYSFS
> diff --git a/fs/Makefile b/fs/Makefile
> index 1e7a11b..20e3253 100644
> --- a/fs/Makefile
> +++ b/fs/Makefile
> @@ -119,3 +119,4 @@ obj-$(CONFIG_HPPFS)		+= hppfs/
>  obj-$(CONFIG_DEBUG_FS)		+= debugfs/
>  obj-$(CONFIG_OCFS2_FS)		+= ocfs2/
>  obj-$(CONFIG_GFS2_FS)           += gfs2/
> +obj-$(CONFIG_AZ_FS)		+= azfs/
> diff --git a/fs/azfs/Makefile b/fs/azfs/Makefile
> new file mode 100644
> index 0000000..ff04d41
> --- /dev/null
> +++ b/fs/azfs/Makefile
> @@ -0,0 +1,7 @@
> +#
> +# Makefile for azfs routines
> +#
> +
> +obj-$(CONFIG_AZ_FS) += azfs.o
> +
> +azfs-y := inode.o
> diff --git a/fs/azfs/inode.c b/fs/azfs/inode.c
> new file mode 100644
> index 0000000..00dc2af
> --- /dev/null
> +++ b/fs/azfs/inode.c
> @@ -0,0 +1,1184 @@
> +/*
> + * (C) Copyright IBM Deutschland Entwicklung GmbH 2007
> + *
> + * Author: Maxim Shchetynin <maxim@de.ibm.com>
> + *
> + * Non-buffered filesystem driver.
> + * It registers a filesystem which may be used for all kind of block devices
> + * which have a direct_access() method in block_device_operations.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2, or (at your option)
> + * any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
> + */
> +
> +#include <linux/backing-dev.h>
> +#include <linux/blkdev.h>
> +#include <linux/cache.h>
> +#include <linux/dcache.h>
> +#include <linux/device.h>
> +#include <linux/err.h>
> +#include <linux/fs.h>
> +#include <linux/genhd.h>
> +#include <linux/kernel.h>
> +#include <linux/limits.h>
> +#include <linux/list.h>
> +#include <linux/module.h>
> +#include <linux/mount.h>
> +#include <linux/mm.h>
> +#include <linux/mm_types.h>
> +#include <linux/mutex.h>
> +#include <linux/namei.h>
> +#include <linux/pagemap.h>
> +#include <linux/parser.h>
> +#include <linux/slab.h>
> +#include <linux/spinlock.h>
> +#include <linux/stat.h>
> +#include <linux/statfs.h>
> +#include <linux/string.h>
> +#include <linux/time.h>
> +#include <linux/types.h>
> +#include <linux/aio.h>
> +#include <linux/uio.h>
> +#include <asm/bug.h>
> +#include <asm/page.h>
> +#include <asm/pgtable.h>
> +#include <asm/string.h>
> +
> +#define AZFS_FILESYSTEM_NAME		"azfs"
> +#define AZFS_FILESYSTEM_FLAGS		FS_REQUIRES_DEV
> +
> +#define AZFS_SUPERBLOCK_MAGIC		0xABBA1972
> +#define AZFS_SUPERBLOCK_FLAGS		MS_SYNCHRONOUS | \
> +					MS_DIRSYNC | \
> +					MS_ACTIVE
> +
> +#define AZFS_BDI_CAPABILITIES		BDI_CAP_NO_ACCT_DIRTY | \
> +					BDI_CAP_NO_WRITEBACK | \
> +					BDI_CAP_MAP_COPY | \
> +					BDI_CAP_MAP_DIRECT | \
> +					BDI_CAP_VMFLAGS
> +
> +#define AZFS_CACHE_FLAGS		SLAB_HWCACHE_ALIGN | \
> +					SLAB_RECLAIM_ACCOUNT | \
> +					SLAB_MEM_SPREAD
> +
> +struct azfs_super {
> +	struct list_head		list;
> +	unsigned long			media_size;
> +	unsigned long			block_size;
> +	unsigned short			block_shift;
> +	unsigned long			sector_size;
> +	unsigned short			sector_shift;
> +	uid_t				uid;
> +	gid_t				gid;
> +	unsigned long			ph_addr;
> +	unsigned long			io_addr;
> +	struct block_device		*blkdev;
> +	struct dentry			*root;
> +	struct list_head		block_list;
> +	rwlock_t			lock;
> +};
> +
> +struct azfs_super_list {
> +	struct list_head		head;
> +	spinlock_t			lock;
> +};
> +
> +struct azfs_block {
> +	struct list_head		list;
> +	unsigned long			id;
> +	unsigned long			count;
> +};
> +
> +struct azfs_znode {
> +	struct list_head		block_list;
> +	rwlock_t			lock;
> +	loff_t				size;
> +	struct inode			vfs_inode;
> +};
> +
> +static struct azfs_super_list		super_list;
> +static struct kmem_cache		*azfs_znode_cache __read_mostly = NULL;
> +static struct kmem_cache		*azfs_block_cache __read_mostly = NULL;
> +
> +#define I2S(inode) \
> +	inode->i_sb->s_fs_info
> +#define I2Z(inode) \
> +	container_of(inode, struct azfs_znode, vfs_inode)
> +
> +#define for_each_block(block, block_list) \
> +	list_for_each_entry(block, block_list, list)
> +#define for_each_block_reverse(block, block_list) \
> +	list_for_each_entry_reverse(block, block_list, list)
> +#define for_each_block_safe(block, temp, block_list) \
> +	list_for_each_entry_safe(block, temp, block_list, list)
> +#define for_each_block_safe_reverse(block, temp, block_list) \
> +	list_for_each_entry_safe_reverse(block, temp, block_list, list)
> +
> +/**
> + * azfs_block_init - create and initialise a new block in a list
> + * @block_list: destination list
> + * @id: block id
> + * @count: size of a block
> + */
> +static inline struct azfs_block*
> +azfs_block_init(struct list_head *block_list,
> +		unsigned long id, unsigned long count)
> +{
> +	struct azfs_block *block;
> +
> +	block = kmem_cache_alloc(azfs_block_cache, GFP_KERNEL);
> +	if (!block)
> +		return NULL;
> +
> +	block->id = id;
> +	block->count = count;
> +
> +	INIT_LIST_HEAD(&block->list);
> +	list_add_tail(&block->list, block_list);
> +
> +	return block;
> +}
> +
> +/**
> + * azfs_block_free - remove block from a list and free it back in cache
> + * @block: block to be removed
> + */
> +static inline void
> +azfs_block_free(struct azfs_block *block)
> +{
> +	list_del(&block->list);
> +	kmem_cache_free(azfs_block_cache, block);
> +}
> +
> +/**
> + * azfs_block_move - move block to another list
> + * @block: block to be moved
> + * @block_list: destination list
> + */
> +static inline void
> +azfs_block_move(struct azfs_block *block, struct list_head *block_list)
> +{
> +	list_move_tail(&block->list, block_list);
> +}
> +
> +/**
> + * azfs_block_find - get a block id of a part of a file
> + * @inode: inode
> + * @from: offset for read/write operation
> + * @size: pointer to a value of the amount of data to be read/written
> + */
> +static unsigned long
> +azfs_block_find(struct inode *inode, unsigned long from, unsigned long *size)
> +{
> +	struct azfs_super *super;
> +	struct azfs_znode *znode;
> +	struct azfs_block *block;
> +	unsigned long block_id, west, east;
> +
> +	super = I2S(inode);
> +	znode = I2Z(inode);
> +
> +	read_lock(&znode->lock);
> +
> +	while (from + *size > znode->size) {
> +		read_unlock(&znode->lock);
> +		i_size_write(inode, from + *size);
> +		inode->i_op->truncate(inode);
> +		read_lock(&znode->lock);
> +	}
> +
> +	if (list_empty(&znode->block_list)) {
> +		read_unlock(&znode->lock);
> +		*size = 0;
> +		return 0;
> +	}
> +
> +	block_id = from >> super->block_shift;
> +
> +	for_each_block(block, &znode->block_list) {
> +		if (block->count > block_id)
> +			break;
> +		block_id -= block->count;
> +	}
> +
> +	west = from % super->block_size;
> +	east = ((block->count - block_id) << super->block_shift) - west;
> +
> +	if (*size > east)
> +		*size = east;
> +
> +	block_id = ((block->id + block_id) << super->block_shift) + west;
> +
> +	read_unlock(&znode->lock);
> +
> +	return block_id;
> +}
> +
> +static struct inode*
> +azfs_new_inode(struct super_block *, struct inode *, int, dev_t);
> +
> +/**
> + * azfs_mknod - mknod() method for inode_operations
> + * @dir, @dentry, @mode, @dev: see inode_operations methods
> + */
> +static int
> +azfs_mknod(struct inode *dir, struct dentry *dentry, int mode, dev_t dev)
> +{
> +	struct inode *inode;
> +
> +	inode = azfs_new_inode(dir->i_sb, dir, mode, dev);
> +	if (!inode)
> +		return -ENOSPC;
> +
> +	if (S_ISREG(mode))
> +		I2Z(inode)->size = 0;
> +
> +	dget(dentry);
> +	d_instantiate(dentry, inode);
> +
> +	return 0;
> +}
> +
> +/**
> + * azfs_create - create() method for inode_operations
> + * @dir, @dentry, @mode, @nd: see inode_operations methods
> + */
> +static int
> +azfs_create(struct inode *dir, struct dentry *dentry, int mode,
> +	    struct nameidata *nd)
> +{
> +	return azfs_mknod(dir, dentry, mode | S_IFREG, 0);
> +}
> +
> +/**
> + * azfs_mkdir - mkdir() method for inode_operations
> + * @dir, @dentry, @mode: see inode_operations methods
> + */
> +static int
> +azfs_mkdir(struct inode *dir, struct dentry *dentry, int mode)
> +{
> +	int rc;
> +
> +	rc = azfs_mknod(dir, dentry, mode | S_IFDIR, 0);
> +	if (!rc)
> +		inc_nlink(dir);
> +
> +	return rc;
> +}
> +
> +/**
> + * azfs_symlink - symlink() method for inode_operations
> + * @dir, @dentry, @name: see inode_operations methods
> + */
> +static int
> +azfs_symlink(struct inode *dir, struct dentry *dentry, const char *name)
> +{
> +	struct inode *inode;
> +	int rc;
> +
> +	inode = azfs_new_inode(dir->i_sb, dir, S_IFLNK | S_IRWXUGO, 0);
> +	if (!inode)
> +		return -ENOSPC;
> +
> +	rc = page_symlink(inode, name, strlen(name) + 1);
> +	if (rc) {
> +		iput(inode);
> +		return rc;
> +	}
> +
> +	dget(dentry);
> +	d_instantiate(dentry, inode);
> +
> +	return 0;
> +}
> +
> +/**
> + * azfs_aio_read - aio_read() method for file_operations
> + * @iocb, @iov, @nr_segs, @pos: see file_operations methods
> + */
> +static ssize_t
> +azfs_aio_read(struct kiocb *iocb, const struct iovec *iov,
> +	      unsigned long nr_segs, loff_t pos)
> +{
> +	struct azfs_super *super;
> +	struct inode *inode;
> +	void *target;
> +	unsigned long pin;
> +	unsigned long size, todo, step;
> +	ssize_t rc;
> +
> +	inode = iocb->ki_filp->f_mapping->host;
> +	super = I2S(inode);
> +
> +	mutex_lock(&inode->i_mutex);
> +
> +	if (pos >= i_size_read(inode)) {
> +		rc = 0;
> +		goto out;
> +	}
> +
> +	target = iov->iov_base;
> +	todo = min((loff_t) iov->iov_len, i_size_read(inode) - pos);
> +
> +	for (step = todo; step; step -= size) {
> +		size = step;
> +		pin = azfs_block_find(inode, pos, &size);
> +		if (!size) {
> +			rc = -ENOSPC;
> +			goto out;
> +		}
> +		pin += super->io_addr;
> +		/*
> +		 * FIXME: pin is actually an __iomem pointer, is
> +		 * that safe? -arnd
> +		 */
> +		if (copy_to_user(target, (void*) pin, size)) {
> +			rc = -EFAULT;
> +			goto out;
> +		}
> +
> +		iocb->ki_pos += size;
> +		pos += size;
> +		target += size;
> +	}
> +
> +	rc = todo;
> +
> +out:
> +	mutex_unlock(&inode->i_mutex);
> +
> +	return rc;
> +}
> +
> +/**
> + * azfs_aio_write - aio_write() method for file_operations
> + * @iocb, @iov, @nr_segs, @pos: see file_operations methods
> + */
> +static ssize_t
> +azfs_aio_write(struct kiocb *iocb, const struct iovec *iov,
> +	       unsigned long nr_segs, loff_t pos)
> +{
> +	struct azfs_super *super;
> +	struct inode *inode;
> +	void *source;
> +	unsigned long pin;
> +	unsigned long size, todo, step;
> +	ssize_t rc;
> +
> +	inode = iocb->ki_filp->f_mapping->host;
> +	super = I2S(inode);
> +
> +	source = iov->iov_base;
> +	todo = iov->iov_len;
> +
> +	mutex_lock(&inode->i_mutex);
> +
> +	for (step = todo; step; step -= size) {
> +		size = step;
> +		pin = azfs_block_find(inode, pos, &size);
> +		if (!size) {
> +			rc = -ENOSPC;
> +			goto out;
> +		}
> +		pin += super->io_addr;
> +		/*
> +		 * FIXME: pin is actually an __iomem pointer, is
> +		 * that safe? -arnd
> +		 */
> +		if (copy_from_user((void*) pin, source, size)) {
> +			rc = -EFAULT;
> +			goto out;
> +		}
> +
> +		iocb->ki_pos += size;
> +		pos += size;
> +		source += size;
> +	}
> +
> +	rc = todo;
> +
> +out:
> +	mutex_unlock(&inode->i_mutex);
> +
> +	return rc;
> +}
> +
> +/**
> + * azfs_open - open() method for file_operations
> + * @inode, @file: see file_operations methods
> + */
> +static int
> +azfs_open(struct inode *inode, struct file *file)
> +{
> +	if (file->f_flags & O_TRUNC) {
> +		i_size_write(inode, 0);
> +		inode->i_op->truncate(inode);
> +	}
> +	if (file->f_flags & O_APPEND)
> +		inode->i_fop->llseek(file, 0, SEEK_END);
> +
> +	return 0;
> +}
> +
> +/**
> + * azfs_mmap - mmap() method for file_operations
> + * @file, @vm: see file_operations methods
> + */
> +static int
> +azfs_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	struct azfs_super *super;
> +	struct azfs_znode *znode;
> +	struct inode *inode;
> +	unsigned long cursor, pin;
> +	unsigned long todo, size, vm_start;
> +	pgprot_t page_prot;
> +
> +	inode = file->f_dentry->d_inode;
> +	znode = I2Z(inode);
> +	super = I2S(inode);
> +
> +	if (super->block_size < PAGE_SIZE)
> +		return -EINVAL;
> +
> +	cursor = vma->vm_pgoff << super->block_shift;
> +	todo = vma->vm_end - vma->vm_start;
> +
> +	if (cursor + todo > i_size_read(inode))
> +		return -EINVAL;
> +
> +	page_prot = pgprot_val(vma->vm_page_prot);
> +#ifdef CONFIG_PPC
> +	page_prot |= (_PAGE_NO_CACHE | _PAGE_RW);
> +	page_prot &= ~_PAGE_GUARDED;
> +#else
> +	page_prot = pgprot_noncached(page_prot);
> +#endif
> +	vma->vm_page_prot = __pgprot(page_prot);
> +
> +	vm_start = vma->vm_start;
> +	for (size = todo; todo; todo -= size, size = todo) {
> +		pin = azfs_block_find(inode, cursor, &size);
> +		if (!size)
> +			return -EAGAIN;
> +		pin += super->ph_addr;
> +		pin >>= PAGE_SHIFT;
> +		if (remap_pfn_range(vma, vm_start, pin, size, vma->vm_page_prot))
> +			return -EAGAIN;
> +
> +		vm_start += size;
> +		cursor += size;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * azfs_truncate - truncate() method for inode_operations
> + * @inode: see inode_operations methods
> + */
> +static void
> +azfs_truncate(struct inode *inode)
> +{
> +	struct azfs_super *super;
> +	struct azfs_znode *znode;
> +	struct azfs_block *block, *tmp_block, *temp, *west, *east;
> +	unsigned long id, count;
> +	signed long delta;
> +
> +	super = I2S(inode);
> +	znode = I2Z(inode);
> +
> +	delta = i_size_read(inode) + (super->block_size - 1);
> +	delta >>= super->block_shift;
> +	delta -= inode->i_blocks;
> +
> +	if (delta == 0) {
> +		znode->size = i_size_read(inode);
> +		return;
> +	}
> +
> +	write_lock(&znode->lock);
> +
> +	while (delta > 0) {
> +		west = east = NULL;
> +
> +		write_lock(&super->lock);
> +
> +		if (list_empty(&super->block_list)) {
> +			write_unlock(&super->lock);
> +			break;
> +		}
> +
> +		for (count = delta; count; count--) {
> +			for_each_block(block, &super->block_list)
> +				if (block->count >= count) {
> +					east = block;
> +					break;
> +				}
> +			if (east)
> +				break;
> +		}
> +
> +		for_each_block_reverse(block, &znode->block_list) {
> +			if (block->id + block->count == east->id)
> +				west = block;
> +			break;
> +		}
> +
> +		if (east->count == count) {
> +			if (west) {
> +				west->count += east->count;
> +				azfs_block_free(east);
> +			} else {
> +				azfs_block_move(east, &znode->block_list);
> +			}
> +		} else {
> +			if (west) {
> +				west->count += count;
> +			} else {
> +				if (!azfs_block_init(&znode->block_list,
> +						east->id, count)) {
> +					write_unlock(&super->lock);
> +					break;
> +				}
> +			}
> +
> +			east->id += count;
> +			east->count -= count;
> +		}
> +
> +		write_unlock(&super->lock);
> +
> +		inode->i_blocks += count;
> +
> +		delta -= count;
> +	}
> +
> +	while (delta < 0) {
> +		for_each_block_safe_reverse(block, tmp_block, &znode->block_list) {
> +			id = block->id;
> +			count = block->count;
> +			if ((signed long) count + delta > 0) {
> +				block->count += delta;
> +				id += block->count;
> +				count -= block->count;
> +				block = NULL;
> +			}
> +
> +			west = east = NULL;
> +
> +			write_lock(&super->lock);
> +
> +			for_each_block(temp, &super->block_list) {
> +				if (!west && (temp->id + temp->count == id))
> +					west = temp;
> +				else if (!east && (id + count == temp->id))
> +					east = temp;
> +				if (west && east)
> +					break;
> +			}
> +
> +			if (west && east) {
> +				west->count += count + east->count;
> +				azfs_block_free(east);
> +				if (block)
> +					azfs_block_free(block);
> +			} else if (west) {
> +				west->count += count;
> +				if (block)
> +					azfs_block_free(block);
> +			} else if (east) {
> +				east->id -= count;
> +				east->count += count;
> +				if (block)
> +					azfs_block_free(block);
> +			} else {
> +				if (!block) {
> +					if (!azfs_block_init(&super->block_list,
> +							id, count)) {
> +						write_unlock(&super->lock);
> +						break;
> +					}
> +				} else {
> +					azfs_block_move(block, &super->block_list);
> +				}
> +			}
> +
> +			write_unlock(&super->lock);
> +
> +			inode->i_blocks -= count;
> +
> +			delta += count;
> +
> +			break;
> +		}
> +	}
> +
> +	write_unlock(&znode->lock);
> +
> +	znode->size = min(i_size_read(inode),
> +			(loff_t) inode->i_blocks << super->block_shift);
> +}
> +
> +/**
> + * azfs_getattr - getattr() method for inode_operations
> + * @mnt, @dentry, @stat: see inode_operations methods
> + */
> +static int
> +azfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
> +{
> +	struct azfs_super *super;
> +	struct inode *inode;
> +	unsigned short shift;
> +
> +	inode = dentry->d_inode;
> +	super = I2S(inode);
> +
> +	generic_fillattr(inode, stat);
> +	stat->blocks = inode->i_blocks;
> +	shift = super->block_shift - super->sector_shift;
> +	if (shift)
> +		stat->blocks <<= shift;
> +
> +	return 0;
> +}
> +
> +static const struct address_space_operations azfs_aops = {
> +	.write_begin	= simple_write_begin,
> +	.write_end	= simple_write_end
> +};
> +
> +static struct backing_dev_info azfs_bdi = {
> +	.ra_pages	= 0,
> +	.capabilities	= AZFS_BDI_CAPABILITIES
> +};
> +
> +static struct inode_operations azfs_dir_iops = {
> +	.create		= azfs_create,
> +	.lookup		= simple_lookup,
> +	.link		= simple_link,
> +	.unlink		= simple_unlink,
> +	.symlink	= azfs_symlink,
> +	.mkdir		= azfs_mkdir,
> +	.rmdir		= simple_rmdir,
> +	.mknod		= azfs_mknod,
> +	.rename		= simple_rename
> +};
> +
> +static const struct file_operations azfs_reg_fops = {
> +	.llseek		= generic_file_llseek,
> +	.aio_read	= azfs_aio_read,
> +	.aio_write	= azfs_aio_write,
> +	.open		= azfs_open,
> +	.mmap		= azfs_mmap,
> +	.fsync		= simple_sync_file,
> +};
> +
> +static struct inode_operations azfs_reg_iops = {
> +	.truncate	= azfs_truncate,
> +	.getattr	= azfs_getattr
> +};
> +
> +/**
> + * azfs_new_inode - cook a new inode
> + * @sb: super-block
> + * @dir: parent directory
> + * @mode: file mode
> + * @dev: to be forwarded to init_special_inode()
> + */
> +static struct inode*
> +azfs_new_inode(struct super_block *sb, struct inode *dir, int mode, dev_t dev)
> +{
> +	struct azfs_super *super;
> +	struct inode *inode;
> +
> +	inode = new_inode(sb);
> +	if (!inode)
> +		return NULL;
> +
> +	inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
> +
> +	inode->i_mode = mode;
> +	if (dir) {
> +		dir->i_mtime = dir->i_ctime = inode->i_mtime;
> +		inode->i_uid = current->fsuid;
> +		if (dir->i_mode & S_ISGID) {
> +			if (S_ISDIR(mode))
> +				inode->i_mode |= S_ISGID;
> +			inode->i_gid = dir->i_gid;
> +		} else {
> +			inode->i_gid = current->fsgid;
> +		}
> +	} else {
> +		super = sb->s_fs_info;
> +		inode->i_uid = super->uid;
> +		inode->i_gid = super->gid;
> +	}
> +
> +	inode->i_blocks = 0;
> +	inode->i_mapping->a_ops = &azfs_aops;
> +	inode->i_mapping->backing_dev_info = &azfs_bdi;
> +
> +	switch (mode & S_IFMT) {
> +	case S_IFDIR:
> +		inode->i_op = &azfs_dir_iops;
> +		inode->i_fop = &simple_dir_operations;
> +		inc_nlink(inode);
> +		break;
> +
> +	case S_IFREG:
> +		inode->i_op = &azfs_reg_iops;
> +		inode->i_fop = &azfs_reg_fops;
> +		break;
> +
> +	case S_IFLNK:
> +		inode->i_op = &page_symlink_inode_operations;
> +		break;
> +
> +	default:
> +		init_special_inode(inode, mode, dev);
> +		break;
> +	}
> +
> +	return inode;
> +}
> +
> +/**
> + * azfs_alloc_inode - alloc_inode() method for super_operations
> + * @sb: see super_operations methods
> + */
> +static struct inode*
> +azfs_alloc_inode(struct super_block *sb)
> +{
> +	struct azfs_znode *znode;
> +
> +	znode = kmem_cache_alloc(azfs_znode_cache, GFP_KERNEL);
> +	if (znode) {
> +		INIT_LIST_HEAD(&znode->block_list);
> +		rwlock_init(&znode->lock);
> +
> +		inode_init_once(&znode->vfs_inode);
> +
> +		return &znode->vfs_inode;
> +	}
> +
> +	return NULL;
> +}
> +
> +/**
> + * azfs_destroy_inode - destroy_inode() method for super_operations
> + * @inode: see super_operations methods
> + */
> +static void
> +azfs_destroy_inode(struct inode *inode)
> +{
> +	kmem_cache_free(azfs_znode_cache, I2Z(inode));
> +}
> +
> +/**
> + * azfs_delete_inode - delete_inode() method for super_operations
> + * @inode: see super_operations methods
> + */
> +static void
> +azfs_delete_inode(struct inode *inode)
> +{
> +	if (S_ISREG(inode->i_mode)) {
> +		i_size_write(inode, 0);
> +		azfs_truncate(inode);
> +	}
> +	truncate_inode_pages(&inode->i_data, 0);
> +	clear_inode(inode);
> +}
> +
> +/**
> + * azfs_statfs - statfs() method for super_operations
> + * @dentry, @stat: see super_operations methods
> + */
> +static int
> +azfs_statfs(struct dentry *dentry, struct kstatfs *stat)
> +{
> +	struct super_block *sb;
> +	struct azfs_super *super;
> +	struct inode *inode;
> +	unsigned long inodes, blocks;
> +
> +	sb = dentry->d_sb;
> +	super = sb->s_fs_info;
> +
> +	inodes = blocks = 0;
> +	mutex_lock(&sb->s_lock);
> +	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> +		inodes++;
> +		blocks += inode->i_blocks;
> +	}
> +	mutex_unlock(&sb->s_lock);
> +
> +	stat->f_type = AZFS_SUPERBLOCK_MAGIC;
> +	stat->f_bsize = super->block_size;
> +	stat->f_blocks = super->media_size >> super->block_shift;
> +	stat->f_bfree = stat->f_blocks - blocks;
> +	stat->f_bavail = stat->f_blocks - blocks;
> +	stat->f_files = inodes + blocks;
> +	stat->f_ffree = blocks + 1;
> +	stat->f_namelen = NAME_MAX;
> +
> +	return 0;
> +}
> +
> +static struct super_operations azfs_ops = {
> +	.alloc_inode	= azfs_alloc_inode,
> +	.destroy_inode	= azfs_destroy_inode,
> +	.drop_inode	= generic_delete_inode,
> +	.delete_inode	= azfs_delete_inode,
> +	.statfs		= azfs_statfs
> +};
> +
> +enum {
> +	Opt_blocksize_short,
> +	Opt_blocksize_long,
> +	Opt_uid,
> +	Opt_gid,
> +	Opt_err
> +};
> +
> +static match_table_t tokens = {
> +	{Opt_blocksize_short, "bs=%u"},
> +	{Opt_blocksize_long, "blocksize=%u"},
> +	{Opt_uid, "uid=%u"},
> +	{Opt_gid, "gid=%u"},
> +	{Opt_err, NULL}
> +};
> +
> +/**
> + * azfs_parse_mount_parameters - parse options given to mount with -o
> + * @super: azfs super block extension
> + * @options: comma separated options
> + */
> +static int
> +azfs_parse_mount_parameters(struct azfs_super *super, char *options)
> +{
> +	char *option;
> +	int token, value;
> +	substring_t args[MAX_OPT_ARGS];
> +
> +	if (!options)
> +		return 1;
> +
> +	while ((option = strsep(&options, ",")) != NULL) {
> +		if (!*option)
> +			continue;
> +
> +		token = match_token(option, tokens, args);
> +		switch (token) {
> +		case Opt_blocksize_short:
> +		case Opt_blocksize_long:
> +			if (match_int(&args[0], &value))
> +				goto syntax_error;
> +			super->block_size = value;
> +			break;
> +
> +		case Opt_uid:
> +			if (match_int(&args[0], &value))
> +				goto syntax_error;
> +			super->uid = value;
> +			break;
> +
> +		case Opt_gid:
> +			if (match_int(&args[0], &value))
> +				goto syntax_error;
> +			super->gid = value;
> +			break;
> +
> +		default:
> +			goto syntax_error;
> +		}
> +	}
> +
> +	return 1;
> +
> +syntax_error:
> +	printk(KERN_ERR "%s: invalid mount option\n",
> +			AZFS_FILESYSTEM_NAME);
> +
> +	return 0;
> +}
> +
> +/**
> + * azfs_fill_super - fill_super routine for get_sb
> + * @sb, @data, @silent: see file_system_type methods
> + */
> +static int
> +azfs_fill_super(struct super_block *sb, void *data, int silent)
> +{
> +	struct gendisk *disk;
> +	struct azfs_super *super = NULL, *tmp_super;
> +	struct azfs_block *block = NULL;
> +	struct inode *inode = NULL;
> +	void *kaddr;
> +	unsigned long pfn;
> +	int rc;
> +
> +	BUG_ON(!sb->s_bdev);
> +
> +	disk = sb->s_bdev->bd_disk;
> +
> +	BUG_ON(!disk || !disk->queue);
> +
> +	if (!disk->fops->direct_access) {
> +		printk(KERN_ERR "%s needs a block device with a "
> +				"direct_access() method\n",
> +				AZFS_FILESYSTEM_NAME);
> +		return -ENOSYS;
> +	}
> +
> +	get_device(disk->driverfs_dev);
> +
> +	sb->s_magic = AZFS_SUPERBLOCK_MAGIC;
> +	sb->s_flags = AZFS_SUPERBLOCK_FLAGS;
> +	sb->s_op = &azfs_ops;
> +	sb->s_maxbytes = get_capacity(disk) * disk->queue->hardsect_size;
> +	sb->s_time_gran = 1;
> +
> +	spin_lock(&super_list.lock);
> +	list_for_each_entry(tmp_super, &super_list.head, list)
> +		if (tmp_super->blkdev == sb->s_bdev) {
> +			super = tmp_super;
> +			break;
> +		}
> +	spin_unlock(&super_list.lock);
> +
> +	if (super) {
> +		if (data && strlen((char*) data))
> +			printk(KERN_WARNING "/dev/%s was already mounted with "
> +					"%s before, it will be mounted with "
> +					"mount options used last time, "
> +					"options just given would be ignored\n",
> +					disk->disk_name, AZFS_FILESYSTEM_NAME);
> +		sb->s_fs_info = super;
> +	} else {
> +		super = kzalloc(sizeof(struct azfs_super), GFP_KERNEL);
> +		if (!super) {
> +			rc = -ENOMEM;
> +			goto failed;
> +		}
> +		sb->s_fs_info = super;
> +
> +		if (!azfs_parse_mount_parameters(super, (char*) data)) {
> +			rc = -EINVAL;
> +			goto failed;
> +		}
> +
> +		inode = azfs_new_inode(sb, NULL, S_IFDIR | S_IRWXUGO, 0);
> +		if (!inode) {
> +			rc = -ENOMEM;
> +			goto failed;
> +		}
> +
> +		super->root = d_alloc_root(inode);
> +		if (!super->root) {
> +			rc = -ENOMEM;
> +			goto failed;
> +		}
> +		dget(super->root);
> +
> +		INIT_LIST_HEAD(&super->list);
> +		INIT_LIST_HEAD(&super->block_list);
> +		rwlock_init(&super->lock);
> +
> +		super->media_size = sb->s_maxbytes;
> +
> +		if (!super->block_size)
> +			super->block_size = sb->s_blocksize;
> +		super->block_shift = blksize_bits(super->block_size);
> +
> +		super->sector_size = disk->queue->hardsect_size;
> +		super->sector_shift = blksize_bits(super->sector_size);
> +
> +		super->blkdev = sb->s_bdev;
> +
> +		block = azfs_block_init(&super->block_list,
> +				0, super->media_size >> super->block_shift);
> +		if (!block) {
> +			rc = -ENOMEM;
> +			goto failed;
> +		}
> +
> +		rc = disk->fops->direct_access(super->blkdev, 0, &kaddr, &pfn);
> +		if (rc < 0) {
> +			rc = -EFAULT;
> +			goto failed;
> +		}
> +		super->ph_addr = (unsigned long) kaddr;
> +
> +		super->io_addr = (unsigned long) ioremap_flags(
> +				super->ph_addr, super->media_size, _PAGE_NO_CACHE);
> +		if (!super->io_addr) {
> +			rc = -EFAULT;
> +			goto failed;
> +		}
> +
> +		spin_lock(&super_list.lock);
> +		list_add(&super->list, &super_list.head);
> +		spin_unlock(&super_list.lock);
> +	}
> +
> +	sb->s_root = super->root;
> +	disk->driverfs_dev->driver_data = super;
> +	disk->driverfs_dev->platform_data = sb;
> +
> +	if (super->block_size < PAGE_SIZE)
> +		printk(KERN_INFO "Block size on %s is smaller then system "
> +				"page size: mmap() would not be supported\n",
> +				disk->disk_name);
> +
> +	return 0;
> +
> +failed:
> +	if (super) {
> +		sb->s_root = NULL;
> +		sb->s_fs_info = NULL;
> +		if (block)
> +			azfs_block_free(block);
> +		if (super->root)
> +			dput(super->root);
> +		if (inode)
> +			iput(inode);
> +		disk->driverfs_dev->driver_data = NULL;
> +		kfree(super);
> +		disk->driverfs_dev->platform_data = NULL;
> +		put_device(disk->driverfs_dev);
> +	}
> +
> +	return rc;
> +}
> +
> +/**
> + * azfs_get_sb - get_sb() method for file_system_type
> + * @fs_type, @flags, @dev_name, @data, @mount: see file_system_type methods
> + */
> +static int
> +azfs_get_sb(struct file_system_type *fs_type, int flags,
> +	    const char *dev_name, void *data, struct vfsmount *mount)
> +{
> +	return get_sb_bdev(fs_type, flags,
> +			dev_name, data, azfs_fill_super, mount);
> +}
> +
> +/**
> + * azfs_kill_sb - kill_sb() method for file_system_type
> + * @sb: see file_system_type methods
> + */
> +static void
> +azfs_kill_sb(struct super_block *sb)
> +{
> +	sb->s_root = NULL;
> +	kill_block_super(sb);
> +}
> +
> +static struct file_system_type azfs_fs = {
> +	.owner		= THIS_MODULE,
> +	.name		= AZFS_FILESYSTEM_NAME,
> +	.get_sb		= azfs_get_sb,
> +	.kill_sb	= azfs_kill_sb,
> +	.fs_flags	= AZFS_FILESYSTEM_FLAGS
> +};
> +
> +/**
> + * azfs_init
> + */
> +static int __init
> +azfs_init(void)
> +{
> +	int rc;
> +
> +	INIT_LIST_HEAD(&super_list.head);
> +	spin_lock_init(&super_list.lock);
> +
> +	azfs_znode_cache = kmem_cache_create("azfs_znode_cache",
> +			sizeof(struct azfs_znode), 0, AZFS_CACHE_FLAGS, NULL);
> +	if (!azfs_znode_cache) {
> +		printk(KERN_ERR "Could not allocate inode cache for %s\n",
> +				AZFS_FILESYSTEM_NAME);
> +		rc = -ENOMEM;
> +		goto failed;
> +	}
> +
> +	azfs_block_cache = kmem_cache_create("azfs_block_cache",
> +			sizeof(struct azfs_block), 0, AZFS_CACHE_FLAGS, NULL);
> +	if (!azfs_block_cache) {
> +		printk(KERN_ERR "Could not allocate block cache for %s\n",
> +				AZFS_FILESYSTEM_NAME);
> +		rc = -ENOMEM;
> +		goto failed;
> +	}
> +
> +	rc = register_filesystem(&azfs_fs);
> +	if (rc != 0) {
> +		printk(KERN_ERR "Could not register %s\n",
> +				AZFS_FILESYSTEM_NAME);
> +		goto failed;
> +	}
> +
> +	return 0;
> +
> +failed:
> +	if (azfs_block_cache)
> +		kmem_cache_destroy(azfs_block_cache);
> +
> +	if (azfs_znode_cache)
> +		kmem_cache_destroy(azfs_znode_cache);
> +
> +	return rc;
> +}
> +
> +/**
> + * azfs_exit
> + */
> +static void __exit
> +azfs_exit(void)
> +{
> +	struct azfs_super *super, *tmp_super;
> +	struct azfs_block *block, *tmp_block;
> +	struct gendisk *disk;
> +
> +	spin_lock(&super_list.lock);
> +	list_for_each_entry_safe(super, tmp_super, &super_list.head, list) {
> +		disk = super->blkdev->bd_disk;
> +		list_del(&super->list);
> +		iounmap((void*) super->io_addr);
> +		write_lock(&super->lock);
> +		for_each_block_safe(block, tmp_block, &super->block_list)
> +			azfs_block_free(block);
> +		write_unlock(&super->lock);
> +		disk->driverfs_dev->driver_data = NULL;
> +		disk->driverfs_dev->platform_data = NULL;
> +		kfree(super);
> +		put_device(disk->driverfs_dev);
> +	}
> +	spin_unlock(&super_list.lock);
> +
> +	unregister_filesystem(&azfs_fs);
> +
> +	kmem_cache_destroy(azfs_block_cache);
> +	kmem_cache_destroy(azfs_znode_cache);
> +}
> +
> +module_init(azfs_init);
> +module_exit(azfs_exit);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Maxim Shchetynin <maxim@de.ibm.com>");
> +MODULE_DESCRIPTION("Non-buffered file system for IO devices");
> -- 
> 1.5.4.3
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Cbe-oss-dev] [patch 7/9] azfs: initial submit of azfs, a non-buffered filesystem
  2008-07-15 19:51 ` [patch 7/9] azfs: initial submit of azfs, a non-buffered filesystem arnd
  2008-07-17  6:13   ` Benjamin Herrenschmidt
@ 2008-07-22  9:49   ` Christoph Hellwig
  1 sibling, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2008-07-22  9:49 UTC (permalink / raw)
  To: arnd; +Cc: cbe-oss-dev, linuxppc-dev

On Tue, Jul 15, 2008 at 09:51:46PM +0200, arnd@arndb.de wrote:
> From: Maxim Shchetynin <maxim@linux.vnet.ibm.com>
> 
> AZFS is a file system which keeps all files on memory mapped random
> access storage. It was designed to work on the axonram device driver
> for IBM QS2x blade servers, but can operate on any block device
> that exports a direct_access method.

I don't thinks it's quite ready yet.  I've had another look through the
code and here's some issues I came up with:

 - first thing is that it's not sparse clean, which is a bit of a red
   flag.  The __iomem and __user annotations are there for a reason.
 - then even with annotations we still have the issue of
   copy_{from,to}_user into mmio regions.  According to benh that's
   still an open issue, but it at least needs very good explanations in
   comments.
 - the aio_read and aio_write methods don't handle actually vectored
   writes.  They need to iterate over all iovecs.  Or just implement
   plain read/write given that it's not actually asynchronous.
 - the persistent superblock hack needs to go away.  Just clean up
   everything in ->put_super.  If we want a fully persistant fs
   we should just be using ext2 + xip
 - azfs_open looks very fishy.  there's never a need to do seeks
   inside open.  if O_APPEND is set the VFS makes sure the read and
   write methods get the right ppos pointer passed.
   And truncation is done by the VFS for O_TRUNC opens through
   ->setattr
 - azfs_znode should not have a size field of it's own, but the
   filesystem should only use the inode one
 - the lists and inode_init_once should be called from the slab
   constructor as in other filesystems
 - I don't think there is any point of having a slab cache
   for the azfs_block structures
 - disk->driverfs_dev is not writeable to the filesystem, but for
   driver use.  The information azfs stores in there is not used
   anyway, so it could easily be removed.
 - lots of duplicated field in azfs_super where the superblock
   ones should be used:

	media_size	-> sb->s_maxbytes
	sector_size	-> not needed at all
	blkdev		-> sb->s_bdev
	root		-> sb->s_root

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 8/9] powerpc/dma: use the struct dma_attrs in iommu code
  2008-07-15 19:51 [patch 0/9] Cell patches for 2.6.27, version 2 arnd
                   ` (6 preceding siblings ...)
  2008-07-15 19:51 ` [patch 7/9] azfs: initial submit of azfs, a non-buffered filesystem arnd
@ 2008-07-15 19:51 ` arnd
  2008-07-15 19:51 ` [patch 9/9] powerpc/cell: Add DMA_ATTR_STRONG_ORDERING dma attribute and use in IOMMU code arnd
  8 siblings, 0 replies; 25+ messages in thread
From: arnd @ 2008-07-15 19:51 UTC (permalink / raw)
  To: benh, cbe-oss-dev, linuxppc-dev; +Cc: Mark Nelson

From: Mark Nelson <markn@au1.ibm.com>

Update iommu_alloc() to take the struct dma_attrs and pass them on to
tce_build(). This change propagates down to the tce_build functions of
all the platforms.

Signed-off-by: Mark Nelson <markn@au1.ibm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 arch/powerpc/kernel/iommu.c            |   13 ++++++++-----
 arch/powerpc/platforms/cell/iommu.c    |    5 +++--
 arch/powerpc/platforms/iseries/iommu.c |    3 ++-
 arch/powerpc/platforms/pasemi/iommu.c  |    3 ++-
 arch/powerpc/platforms/pseries/iommu.c |   14 +++++++++-----
 arch/powerpc/sysdev/dart_iommu.c       |    3 ++-
 include/asm-powerpc/machdep.h          |    3 ++-
 7 files changed, 28 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 8c68ee9..2385f68 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -186,7 +186,8 @@ static unsigned long iommu_range_alloc(struct device *dev,
 static dma_addr_t iommu_alloc(struct device *dev, struct iommu_table *tbl,
 			      void *page, unsigned int npages,
 			      enum dma_data_direction direction,
-			      unsigned long mask, unsigned int align_order)
+			      unsigned long mask, unsigned int align_order,
+			      struct dma_attrs *attrs)
 {
 	unsigned long entry, flags;
 	dma_addr_t ret = DMA_ERROR_CODE;
@@ -205,7 +206,7 @@ static dma_addr_t iommu_alloc(struct device *dev, struct iommu_table *tbl,
 
 	/* Put the TCEs in the HW table */
 	ppc_md.tce_build(tbl, entry, npages, (unsigned long)page & IOMMU_PAGE_MASK,
-			 direction);
+			 direction, attrs);
 
 
 	/* Flush/invalidate TLB caches if necessary */
@@ -336,7 +337,8 @@ int iommu_map_sg(struct device *dev, struct iommu_table *tbl,
 			    npages, entry, dma_addr);
 
 		/* Insert into HW table */
-		ppc_md.tce_build(tbl, entry, npages, vaddr & IOMMU_PAGE_MASK, direction);
+		ppc_md.tce_build(tbl, entry, npages, vaddr & IOMMU_PAGE_MASK,
+				 direction, attrs);
 
 		/* If we are in an open segment, try merging */
 		if (segstart != s) {
@@ -573,7 +575,8 @@ dma_addr_t iommu_map_single(struct device *dev, struct iommu_table *tbl,
 			align = PAGE_SHIFT - IOMMU_PAGE_SHIFT;
 
 		dma_handle = iommu_alloc(dev, tbl, vaddr, npages, direction,
-					 mask >> IOMMU_PAGE_SHIFT, align);
+					 mask >> IOMMU_PAGE_SHIFT, align,
+					 attrs);
 		if (dma_handle == DMA_ERROR_CODE) {
 			if (printk_ratelimit())  {
 				printk(KERN_INFO "iommu_alloc failed, "
@@ -642,7 +645,7 @@ void *iommu_alloc_coherent(struct device *dev, struct iommu_table *tbl,
 	nio_pages = size >> IOMMU_PAGE_SHIFT;
 	io_order = get_iommu_order(size);
 	mapping = iommu_alloc(dev, tbl, ret, nio_pages, DMA_BIDIRECTIONAL,
-			      mask >> IOMMU_PAGE_SHIFT, io_order);
+			      mask >> IOMMU_PAGE_SHIFT, io_order, NULL);
 	if (mapping == DMA_ERROR_CODE) {
 		free_pages((unsigned long)ret, order);
 		return NULL;
diff --git a/arch/powerpc/platforms/cell/iommu.c b/arch/powerpc/platforms/cell/iommu.c
index eeacb3a..3b70784 100644
--- a/arch/powerpc/platforms/cell/iommu.c
+++ b/arch/powerpc/platforms/cell/iommu.c
@@ -173,7 +173,8 @@ static void invalidate_tce_cache(struct cbe_iommu *iommu, unsigned long *pte,
 }
 
 static void tce_build_cell(struct iommu_table *tbl, long index, long npages,
-		unsigned long uaddr, enum dma_data_direction direction)
+		unsigned long uaddr, enum dma_data_direction direction,
+		struct dma_attrs *attrs)
 {
 	int i;
 	unsigned long *io_pte, base_pte;
@@ -519,7 +520,7 @@ cell_iommu_setup_window(struct cbe_iommu *iommu, struct device_node *np,
 
 	__set_bit(0, window->table.it_map);
 	tce_build_cell(&window->table, window->table.it_offset, 1,
-		       (unsigned long)iommu->pad_page, DMA_TO_DEVICE);
+		       (unsigned long)iommu->pad_page, DMA_TO_DEVICE, NULL);
 	window->table.it_hint = window->table.it_blocksize;
 
 	return window;
diff --git a/arch/powerpc/platforms/iseries/iommu.c b/arch/powerpc/platforms/iseries/iommu.c
index ab5d868..bc818e4 100644
--- a/arch/powerpc/platforms/iseries/iommu.c
+++ b/arch/powerpc/platforms/iseries/iommu.c
@@ -42,7 +42,8 @@
 #include <asm/iseries/iommu.h>
 
 static void tce_build_iSeries(struct iommu_table *tbl, long index, long npages,
-		unsigned long uaddr, enum dma_data_direction direction)
+		unsigned long uaddr, enum dma_data_direction direction,
+		struct dma_attrs *attrs)
 {
 	u64 rc;
 	u64 tce, rpn;
diff --git a/arch/powerpc/platforms/pasemi/iommu.c b/arch/powerpc/platforms/pasemi/iommu.c
index 86967bd..70541b7 100644
--- a/arch/powerpc/platforms/pasemi/iommu.c
+++ b/arch/powerpc/platforms/pasemi/iommu.c
@@ -85,7 +85,8 @@ static int iommu_table_iobmap_inited;
 
 static void iobmap_build(struct iommu_table *tbl, long index,
 			 long npages, unsigned long uaddr,
-			 enum dma_data_direction direction)
+			 enum dma_data_direction direction,
+			 struct dma_attrs *attrs)
 {
 	u32 *ip;
 	u32 rpn;
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 9a12908..5377dd4 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -50,7 +50,8 @@
 
 static void tce_build_pSeries(struct iommu_table *tbl, long index,
 			      long npages, unsigned long uaddr,
-			      enum dma_data_direction direction)
+			      enum dma_data_direction direction,
+			      struct dma_attrs *attrs)
 {
 	u64 proto_tce;
 	u64 *tcep;
@@ -95,7 +96,8 @@ static unsigned long tce_get_pseries(struct iommu_table *tbl, long index)
 
 static void tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum,
 				long npages, unsigned long uaddr,
-				enum dma_data_direction direction)
+				enum dma_data_direction direction,
+				struct dma_attrs *attrs)
 {
 	u64 rc;
 	u64 proto_tce, tce;
@@ -127,7 +129,8 @@ static DEFINE_PER_CPU(u64 *, tce_page) = NULL;
 
 static void tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
 				     long npages, unsigned long uaddr,
-				     enum dma_data_direction direction)
+				     enum dma_data_direction direction,
+				     struct dma_attrs *attrs)
 {
 	u64 rc;
 	u64 proto_tce;
@@ -136,7 +139,8 @@ static void tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
 	long l, limit;
 
 	if (npages == 1) {
-		tce_build_pSeriesLP(tbl, tcenum, npages, uaddr, direction);
+		tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
+				    direction, attrs);
 		return;
 	}
 
@@ -150,7 +154,7 @@ static void tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
 		/* If allocation fails, fall back to the loop implementation */
 		if (!tcep) {
 			tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
-					    direction);
+					    direction, attrs);
 			return;
 		}
 		__get_cpu_var(tce_page) = tcep;
diff --git a/arch/powerpc/sysdev/dart_iommu.c b/arch/powerpc/sysdev/dart_iommu.c
index 005c2ec..de8c8b5 100644
--- a/arch/powerpc/sysdev/dart_iommu.c
+++ b/arch/powerpc/sysdev/dart_iommu.c
@@ -149,7 +149,8 @@ static void dart_flush(struct iommu_table *tbl)
 
 static void dart_build(struct iommu_table *tbl, long index,
 		       long npages, unsigned long uaddr,
-		       enum dma_data_direction direction)
+		       enum dma_data_direction direction,
+		       struct dma_attrs *attrs)
 {
 	unsigned int *dp;
 	unsigned int rpn;
diff --git a/include/asm-powerpc/machdep.h b/include/asm-powerpc/machdep.h
index 9899226..1233d73 100644
--- a/include/asm-powerpc/machdep.h
+++ b/include/asm-powerpc/machdep.h
@@ -80,7 +80,8 @@ struct machdep_calls {
 				     long index,
 				     long npages,
 				     unsigned long uaddr,
-				     enum dma_data_direction direction);
+				     enum dma_data_direction direction,
+				     struct dma_attrs *attrs);
 	void		(*tce_free)(struct iommu_table *tbl,
 				    long index,
 				    long npages);
-- 
1.5.4.3

-- 

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [patch 9/9] powerpc/cell: Add DMA_ATTR_STRONG_ORDERING dma attribute and use in IOMMU code
  2008-07-15 19:51 [patch 0/9] Cell patches for 2.6.27, version 2 arnd
                   ` (7 preceding siblings ...)
  2008-07-15 19:51 ` [patch 8/9] powerpc/dma: use the struct dma_attrs in iommu code arnd
@ 2008-07-15 19:51 ` arnd
  2008-07-15 20:34   ` Roland Dreier
  8 siblings, 1 reply; 25+ messages in thread
From: arnd @ 2008-07-15 19:51 UTC (permalink / raw)
  To: benh, cbe-oss-dev, linuxppc-dev; +Cc: Mark Nelson

From: Mark Nelson <markn@au1.ibm.com>

Introduce a new dma attriblue DMA_ATTR_STRONG_ORDERING to use strong ordering
on DMA mappings in the Cell processor. Add the code to the Cell's IOMMU
implementation to use this.

The current Cell IOMMU implementation sets the IOPTE_SO_RW bits in all IOTPEs
(for both the dynamic and fixed mappings) which enforces strong ordering of
both reads and writes. This patch makes the default behaviour weak ordering
(the IOPTE_SO_RW bits not set) and to request a strongly ordered mapping the
new DMA_ATTR_STRONG_ORDERING needs to be used.

Dynamic mappings can be weakly or strongly ordered on an individual basis
but the fixed mapping is always weakly ordered.

Signed-off-by: Mark Nelson <markn@au1.ibm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 Documentation/DMA-attributes.txt    |   12 +++++
 arch/powerpc/platforms/cell/iommu.c |   93 ++++++++++++++++++++++++++++++++---
 include/linux/dma-attrs.h           |    1 +
 3 files changed, 99 insertions(+), 7 deletions(-)

diff --git a/Documentation/DMA-attributes.txt b/Documentation/DMA-attributes.txt
index 6d772f8..f2d2800 100644
--- a/Documentation/DMA-attributes.txt
+++ b/Documentation/DMA-attributes.txt
@@ -22,3 +22,15 @@ ready and available in memory.  The DMA of the "completion indication"
 could race with data DMA.  Mapping the memory used for completion
 indications with DMA_ATTR_WRITE_BARRIER would prevent the race.
 
+
+DMA_ATTR_STRONG_ORDERING
+----------------------
+
+DMA_ATTR_STRONG_ORDERING specifies that previous reads and writes are
+performed in the order in which they're received by the IOMMU; thus
+reads and writes may not pass each other.
+
+Platforms that are strongly ordered by default will ignore this new
+attribute but platforms that are weakly ordered by default should not
+ignore this new attribute. Instead, they should return an error if a
+strongly ordered mapping cannot be used when one is requested.
diff --git a/arch/powerpc/platforms/cell/iommu.c b/arch/powerpc/platforms/cell/iommu.c
index 3b70784..7f6ed20 100644
--- a/arch/powerpc/platforms/cell/iommu.c
+++ b/arch/powerpc/platforms/cell/iommu.c
@@ -194,11 +194,13 @@ static void tce_build_cell(struct iommu_table *tbl, long index, long npages,
 	const unsigned long prot = 0xc48;
 	base_pte =
 		((prot << (52 + 4 * direction)) & (IOPTE_PP_W | IOPTE_PP_R))
-		| IOPTE_M | IOPTE_SO_RW | (window->ioid & IOPTE_IOID_Mask);
+		| IOPTE_M | (window->ioid & IOPTE_IOID_Mask);
 #else
-	base_pte = IOPTE_PP_W | IOPTE_PP_R | IOPTE_M | IOPTE_SO_RW |
+	base_pte = IOPTE_PP_W | IOPTE_PP_R | IOPTE_M |
 		(window->ioid & IOPTE_IOID_Mask);
 #endif
+	if (unlikely(dma_get_attr(DMA_ATTR_STRONG_ORDERING, attrs)))
+		base_pte |= IOPTE_SO_RW;
 
 	io_pte = (unsigned long *)tbl->it_base + (index - tbl->it_offset);
 
@@ -539,7 +541,6 @@ static struct cbe_iommu *cell_iommu_for_node(int nid)
 static unsigned long cell_dma_direct_offset;
 
 static unsigned long dma_iommu_fixed_base;
-struct dma_mapping_ops dma_iommu_fixed_ops;
 
 static struct iommu_table *cell_get_iommu_table(struct device *dev)
 {
@@ -563,6 +564,85 @@ static struct iommu_table *cell_get_iommu_table(struct device *dev)
 	return &window->table;
 }
 
+static void *dma_fixed_alloc_coherent(struct device *dev, size_t size,
+				      dma_addr_t *dma_handle, gfp_t flag)
+{
+	return dma_direct_ops.alloc_coherent(dev, size, dma_handle, flag);
+}
+
+static void dma_fixed_free_coherent(struct device *dev, size_t size,
+				    void *vaddr, dma_addr_t dma_handle)
+{
+	dma_direct_ops.free_coherent(dev, size, vaddr, dma_handle);
+}
+
+static dma_addr_t dma_fixed_map_single(struct device *dev, void *ptr,
+				       size_t size,
+				       enum dma_data_direction direction,
+				       struct dma_attrs *attrs)
+{
+	if (dma_get_attr(DMA_ATTR_STRONG_ORDERING, attrs))
+		return iommu_map_single(dev, cell_get_iommu_table(dev), ptr,
+					size, device_to_mask(dev), direction,
+					attrs);
+	else
+		return dma_direct_ops.map_single(dev, ptr, size, direction,
+						 attrs);
+}
+
+static void dma_fixed_unmap_single(struct device *dev, dma_addr_t dma_addr,
+				   size_t size,
+				   enum dma_data_direction direction,
+				   struct dma_attrs *attrs)
+{
+	if (dma_get_attr(DMA_ATTR_STRONG_ORDERING, attrs))
+		iommu_unmap_single(cell_get_iommu_table(dev), dma_addr, size,
+				   direction, attrs);
+	else
+		dma_direct_ops.unmap_single(dev, dma_addr, size, direction,
+					    attrs);
+}
+
+static int dma_fixed_map_sg(struct device *dev, struct scatterlist *sg,
+			   int nents, enum dma_data_direction direction,
+			   struct dma_attrs *attrs)
+{
+	if (dma_get_attr(DMA_ATTR_STRONG_ORDERING, attrs))
+		return iommu_map_sg(dev, cell_get_iommu_table(dev), sg, nents,
+				    device_to_mask(dev), direction, attrs);
+	else
+		return dma_direct_ops.map_sg(dev, sg, nents, direction, attrs);
+}
+
+static void dma_fixed_unmap_sg(struct device *dev, struct scatterlist *sg,
+			       int nents, enum dma_data_direction direction,
+			       struct dma_attrs *attrs)
+{
+	if (dma_get_attr(DMA_ATTR_STRONG_ORDERING, attrs))
+		iommu_unmap_sg(cell_get_iommu_table(dev), sg, nents, direction,
+			       attrs);
+	else
+		dma_direct_ops.unmap_sg(dev, sg, nents, direction, attrs);
+}
+
+static int dma_fixed_dma_supported(struct device *dev, u64 mask)
+{
+	return mask == DMA_64BIT_MASK;
+}
+
+static int dma_set_mask_and_switch(struct device *dev, u64 dma_mask);
+
+struct dma_mapping_ops dma_iommu_fixed_ops = {
+	.alloc_coherent = dma_fixed_alloc_coherent,
+	.free_coherent  = dma_fixed_free_coherent,
+	.map_single     = dma_fixed_map_single,
+	.unmap_single   = dma_fixed_unmap_single,
+	.map_sg         = dma_fixed_map_sg,
+	.unmap_sg       = dma_fixed_unmap_sg,
+	.dma_supported  = dma_fixed_dma_supported,
+	.set_dma_mask   = dma_set_mask_and_switch,
+};
+
 static void cell_dma_dev_setup_fixed(struct device *dev);
 
 static void cell_dma_dev_setup(struct device *dev)
@@ -919,9 +999,11 @@ static void cell_iommu_setup_fixed_ptab(struct cbe_iommu *iommu,
 
 	pr_debug("iommu: mapping 0x%lx pages from 0x%lx\n", fsize, fbase);
 
-	base_pte = IOPTE_PP_W | IOPTE_PP_R | IOPTE_M | IOPTE_SO_RW
+	base_pte = IOPTE_PP_W | IOPTE_PP_R | IOPTE_M
 		    | (cell_iommu_get_ioid(np) & IOPTE_IOID_Mask);
 
+	pr_info("IOMMU: Using weak ordering for fixed mapping\n");
+
 	for (uaddr = 0; uaddr < fsize; uaddr += (1 << 24)) {
 		/* Don't touch the dynamic region */
 		ioaddr = uaddr + fbase;
@@ -1037,9 +1119,6 @@ static int __init cell_iommu_fixed_mapping_init(void)
 		cell_iommu_setup_window(iommu, np, dbase, dsize, 0);
 	}
 
-	dma_iommu_fixed_ops = dma_direct_ops;
-	dma_iommu_fixed_ops.set_dma_mask = dma_set_mask_and_switch;
-
 	dma_iommu_ops.set_dma_mask = dma_set_mask_and_switch;
 	set_pci_dma_ops(&dma_iommu_ops);
 
diff --git a/include/linux/dma-attrs.h b/include/linux/dma-attrs.h
index 1677e2b..600fea7 100644
--- a/include/linux/dma-attrs.h
+++ b/include/linux/dma-attrs.h
@@ -12,6 +12,7 @@
  */
 enum dma_attr {
 	DMA_ATTR_WRITE_BARRIER,
+	DMA_ATTR_STRONG_ORDERING,
 	DMA_ATTR_MAX,
 };
 
-- 
1.5.4.3

-- 

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [patch 9/9] powerpc/cell: Add DMA_ATTR_STRONG_ORDERING dma attribute and use in IOMMU code
  2008-07-15 19:51 ` [patch 9/9] powerpc/cell: Add DMA_ATTR_STRONG_ORDERING dma attribute and use in IOMMU code arnd
@ 2008-07-15 20:34   ` Roland Dreier
  2008-07-15 21:27     ` Arnd Bergmann
  0 siblings, 1 reply; 25+ messages in thread
From: Roland Dreier @ 2008-07-15 20:34 UTC (permalink / raw)
  To: arnd; +Cc: Mark Nelson, cbe-oss-dev, linuxppc-dev

Sorry for the late comments, I missed this when it went by before.

 > +DMA_ATTR_STRONG_ORDERING
 > +----------------------
 > +
 > +DMA_ATTR_STRONG_ORDERING specifies that previous reads and writes are
 > +performed in the order in which they're received by the IOMMU; thus
 > +reads and writes may not pass each other.

I don't understand what this is trying to say.  What is "previous"
referring to?  What does "received by the IOMMU" mean -- do you mean
issued onto the bus by the CPU? When you say "reads and writes may not
pass each other," do you mean just that reads may not pass writes and
writes may not pass reads, or do you mean that reads also can't pass
reads and writes can't pass writes?

Since I don't know exactly what this attribute does, I can't be sure,
but it seems that making weak ordering the default is dangerous in that
it breaks drivers that expect usual memory ordering semantics.  Would it
be safer/better to make strong ordering the default and then add a
"WEAK_ORDERING" attribute that drivers can use as an optimization?

 - R.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 9/9] powerpc/cell: Add DMA_ATTR_STRONG_ORDERING dma attribute and use in IOMMU code
  2008-07-15 20:34   ` Roland Dreier
@ 2008-07-15 21:27     ` Arnd Bergmann
  2008-07-16  2:18       ` Roland Dreier
  0 siblings, 1 reply; 25+ messages in thread
From: Arnd Bergmann @ 2008-07-15 21:27 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Mark Nelson, Roland Dreier, cbe-oss-dev

On Tuesday 15 July 2008, Roland Dreier wrote:
> Sorry for the late comments, I missed this when it went by before.
>=20
> =C2=A0> +DMA_ATTR_STRONG_ORDERING
> =C2=A0> +----------------------
> =C2=A0> +
> =C2=A0> +DMA_ATTR_STRONG_ORDERING specifies that previous reads and write=
s are
> =C2=A0> +performed in the order in which they're received by the IOMMU; t=
hus
> =C2=A0> +reads and writes may not pass each other.
>=20
> I don't understand what this is trying to say. =C2=A0What is "previous"
> referring to? What does "received by the IOMMU" mean -- do you mean=20
> issued onto the bus by the CPU?

This is all about inbound transfers, i.e. DMAs coming from the I/O bridge
into the CPU, both DMA read and DMA write.

The relevant paragraph in the specification is=20
"If the SO bits in the I/O page table entry =3D =E2=80=9811=E2=80=99 and th=
e IOIF S-bit is =E2=80=981=E2=80=99,
this READ or WRITE cannot be placed on the EIB until all previous READs and
WRITEs from this CVCID and IOID have gotten an ACK or NULL type snoop respo=
nse."

Normally, this is only true for accesses going to the same cache line,
accesses from one device to different cache lines that are issued in order
also send their response in-order (unless you get an I/O exception, which m=
eans
you're toast), but can arrive at the I/O location out of order.

My interpretation is that strong ordering basically turns our whole I/O
subsystem into single-issue non-posted accesses (all devices are currently
configured to use the same CVCID and IOID), so we really should not
do that.

> When you say "reads and writes may not=20
> pass each other," do you mean just that reads may not pass writes and
> writes may not pass reads, or do you mean that reads also can't pass
> reads and writes can't pass writes?
>=20
> Since I don't know exactly what this attribute does, I can't be sure,
> but it seems that making weak ordering the default is dangerous in that
> it breaks drivers that expect usual memory ordering semantics. =C2=A0Woul=
d it
> be safer/better to make strong ordering the default and then add a
> "WEAK_ORDERING" attribute that drivers can use as an optimization?

With all our existing hardware, the I/O bridge overrides the setting
to select weak ordering. The purpose of this patch is to change the
default so that the bridge does not force weak ordering any more and
some drivers are free to set strong ordering without impacting performance
on the other drivers.

Strong ordering is only active when both the bridge and the IOMMU enable
it, but for correctly written drivers, this only results in a slowdown.

	Arnd <><

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 9/9] powerpc/cell: Add DMA_ATTR_STRONG_ORDERING dma attribute and use in IOMMU code
  2008-07-15 21:27     ` Arnd Bergmann
@ 2008-07-16  2:18       ` Roland Dreier
  2008-07-16  7:54         ` Arnd Bergmann
  0 siblings, 1 reply; 25+ messages in thread
From: Roland Dreier @ 2008-07-16  2:18 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: Mark Nelson, linuxppc-dev, cbe-oss-dev

 > This is all about inbound transfers, i.e. DMAs coming from the I/O bridge
 > into the CPU, both DMA read and DMA write.

OK -- at a minimum I think the documentation should make that clear.  As
it stands the description of what this does is pretty much impossible to
parse and understand.

 > Strong ordering is only active when both the bridge and the IOMMU enable
 > it, but for correctly written drivers, this only results in a slowdown.

So when would someone use this dma attribute?  As a hack to fix drivers
where the real fix is too complicated?

 - R.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 9/9] powerpc/cell: Add DMA_ATTR_STRONG_ORDERING dma attribute and use in IOMMU code
  2008-07-16  2:18       ` Roland Dreier
@ 2008-07-16  7:54         ` Arnd Bergmann
  2008-07-17  6:20           ` [Cbe-oss-dev] " Benjamin Herrenschmidt
  0 siblings, 1 reply; 25+ messages in thread
From: Arnd Bergmann @ 2008-07-16  7:54 UTC (permalink / raw)
  To: Roland Dreier; +Cc: Mark Nelson, linuxppc-dev, cbe-oss-dev

On Wednesday 16 July 2008, Roland Dreier wrote:
> =A0> Strong ordering is only active when both the bridge and the IOMMU en=
able
> =A0> it, but for correctly written drivers, this only results in a slowdo=
wn.
>=20
> So when would someone use this dma attribute? =A0As a hack to fix drivers
> where the real fix is too complicated?

This is used in the Axon PCIe endpoint drivers, e.g. in the Roadrunner
machine. The reason was to improve roundtrip latency by doing only
mmio stores, not loads, on each side of the PCIe connection, which
turn into posted DMA operations on the other end. With relaxed ordering,
the posted writes may be observed out of order. Strong ordering makes
sure they arrive in-order without having to do a non-posted mmio read
or eieio operation on the receiver side.

	Arnd <><

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Cbe-oss-dev] [patch 9/9] powerpc/cell: Add DMA_ATTR_STRONG_ORDERING dma attribute and use in IOMMU code
  2008-07-16  7:54         ` Arnd Bergmann
@ 2008-07-17  6:20           ` Benjamin Herrenschmidt
  2008-07-17 14:53             ` Arnd Bergmann
  0 siblings, 1 reply; 25+ messages in thread
From: Benjamin Herrenschmidt @ 2008-07-17  6:20 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: linuxppc-dev, Roland Dreier, cbe-oss-dev

On Wed, 2008-07-16 at 09:54 +0200, Arnd Bergmann wrote:
> On Wednesday 16 July 2008, Roland Dreier wrote:
> >  > Strong ordering is only active when both the bridge and the IOMMU enable
> >  > it, but for correctly written drivers, this only results in a slowdown.
> > 
> > So when would someone use this dma attribute?  As a hack to fix drivers
> > where the real fix is too complicated?
> 
> This is used in the Axon PCIe endpoint drivers, e.g. in the Roadrunner
> machine. The reason was to improve roundtrip latency by doing only
> mmio stores, not loads, on each side of the PCIe connection, which
> turn into posted DMA operations on the other end. With relaxed ordering,
> the posted writes may be observed out of order. Strong ordering makes
> sure they arrive in-order without having to do a non-posted mmio read
> or eieio operation on the receiver side.

I don't think it's legal for writes from a given initiator to arrive to
memory out of order.

Some drivers, notably network drivers, for example, rely on the "OWN"
bit being written last in memory when writing back ring buffer status.

If the bit arrives before the actual data, then data corruption will
occur.

Ben.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Cbe-oss-dev] [patch 9/9] powerpc/cell: Add DMA_ATTR_STRONG_ORDERING dma attribute and use in IOMMU code
  2008-07-17  6:20           ` [Cbe-oss-dev] " Benjamin Herrenschmidt
@ 2008-07-17 14:53             ` Arnd Bergmann
  2008-07-17 20:10               ` Benjamin Herrenschmidt
  2008-07-17 20:10               ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 25+ messages in thread
From: Arnd Bergmann @ 2008-07-17 14:53 UTC (permalink / raw)
  To: cbe-oss-dev, benh
  Cc: linuxppc-dev, Roland Dreier, Peter Altevogt, Hans Boettiger

On Thursday 17 July 2008, Benjamin Herrenschmidt wrote:
> On Wed, 2008-07-16 at 09:54 +0200, Arnd Bergmann wrote:
> > On Wednesday 16 July 2008, Roland Dreier wrote:
> > >  > Strong ordering is only active when both the bridge and the IOMMU
> > >  > enable it, but for correctly written drivers, this only results in a
> > >  > slowdown.
> > >
> > > So when would someone use this dma attribute?  As a hack to fix drivers
> > > where the real fix is too complicated?
> >
> > This is used in the Axon PCIe endpoint drivers, e.g. in the Roadrunner
> > machine. The reason was to improve roundtrip latency by doing only
> > mmio stores, not loads, on each side of the PCIe connection, which
> > turn into posted DMA operations on the other end. With relaxed ordering,
> > the posted writes may be observed out of order. Strong ordering makes
> > sure they arrive in-order without having to do a non-posted mmio read
> > or eieio operation on the receiver side.
>
> I don't think it's legal for writes from a given initiator to arrive to
> memory out of order.
> 
> Some drivers, notably network drivers, for example, rely on the "OWN"
> bit being written last in memory when writing back ring buffer status.
> 
> If the bit arrives before the actual data, then data corruption will
> occur.

Ok, this makes sense. I've followed the bit down in the specification,
and now it seems like we can't just set relaxed ordering in the IOMMU
but should use the value that comes from the PCIe device.

The flow of the order bit in this machine is as follows:

1. The device can select relaxed (weak) or non-relaxed (strong) ordering
for a DMA transfer. PCI-X is always strong, DMAx can be configured globally,
and PCIe is device specific.
2. The PCIe root complex can override the order bit and force it to strong
ordering (which we don't).
3. The PLB5-to-C3PO bridge can override the bit and force it to weak or
strong or leave it alone (we force it to weak).
4. The IOMMU can force the bit to weak on a per-page base (we don't without
the patch, but do with the patch).

Peter and Hans were involved in the discussion that led to the decision
to change step 3 from per-transfer default to always weak ordering.
I think they verified that this is safe for all the peripherals that we
have on the QS21 and QS22 blades (tg3, ehci, mthca, mptsas), but that
doesn't mean that it is safe in general, so I guess you are right that
we should not make it the default in the kernel for Cell systems.
Hans, can you confirm this?

	Arnd <><

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Cbe-oss-dev] [patch 9/9] powerpc/cell: Add DMA_ATTR_STRONG_ORDERING dma attribute and use in IOMMU code
  2008-07-17 14:53             ` Arnd Bergmann
@ 2008-07-17 20:10               ` Benjamin Herrenschmidt
  2008-07-17 20:10               ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 25+ messages in thread
From: Benjamin Herrenschmidt @ 2008-07-17 20:10 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linuxppc-dev, Roland Dreier, Peter Altevogt, cbe-oss-dev,
	Hans Boettiger

On Thu, 2008-07-17 at 16:53 +0200, Arnd Bergmann wrote:
> 
> Peter and Hans were involved in the discussion that led to the
> decision
> to change step 3 from per-transfer default to always weak ordering.
> I think they verified that this is safe for all the peripherals that
> we
> have on the QS21 and QS22 blades (tg3, ehci, mthca, mptsas), but that
> doesn't mean that it is safe in general, so I guess you are right that
> we should not make it the default in the kernel for Cell systems.
> Hans, can you confirm this?

I'm surprised with tg3. We need to make sure that updates to the
hw status block are properly ordered vs writes to the ring, among
other things.

Ben.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Cbe-oss-dev] [patch 9/9] powerpc/cell: Add DMA_ATTR_STRONG_ORDERING dma attribute and use in IOMMU code
  2008-07-17 14:53             ` Arnd Bergmann
  2008-07-17 20:10               ` Benjamin Herrenschmidt
@ 2008-07-17 20:10               ` Benjamin Herrenschmidt
  2008-07-18 13:03                 ` [PATCH] Add DMA_ATTR_WEAK_ORDERING dma attribute and use in Cell " Arnd Bergmann
  1 sibling, 1 reply; 25+ messages in thread
From: Benjamin Herrenschmidt @ 2008-07-17 20:10 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linuxppc-dev, Roland Dreier, Peter Altevogt, cbe-oss-dev,
	Hans Boettiger


> Ok, this makes sense. I've followed the bit down in the specification,
> and now it seems like we can't just set relaxed ordering in the IOMMU
> but should use the value that comes from the PCIe device.
> 
> The flow of the order bit in this machine is as follows:
> 
> 1. The device can select relaxed (weak) or non-relaxed (strong) ordering
> for a DMA transfer. PCI-X is always strong, DMAx can be configured globally,
> and PCIe is device specific.
> 2. The PCIe root complex can override the order bit and force it to strong
> ordering (which we don't).
> 3. The PLB5-to-C3PO bridge can override the bit and force it to weak or
> strong or leave it alone (we force it to weak).
> 4. The IOMMU can force the bit to weak on a per-page base (we don't without
> the patch, but do with the patch).
> 
> Peter and Hans were involved in the discussion that led to the decision
> to change step 3 from per-transfer default to always weak ordering.
> I think they verified that this is safe for all the peripherals that we
> have on the QS21 and QS22 blades (tg3, ehci, mthca, mptsas), but that
> doesn't mean that it is safe in general, so I guess you are right that
> we should not make it the default in the kernel for Cell systems.
> Hans, can you confirm this?

In the meantime, send a patch that defaults to strong with explicit
weak, we can easily fixup after that.

Ben.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH] Add DMA_ATTR_WEAK_ORDERING dma attribute and use in Cell IOMMU code
  2008-07-17 20:10               ` Benjamin Herrenschmidt
@ 2008-07-18 13:03                 ` Arnd Bergmann
  2008-07-19  7:29                   ` [Cbe-oss-dev] " Jeremy Kerr
  0 siblings, 1 reply; 25+ messages in thread
From: Arnd Bergmann @ 2008-07-18 13:03 UTC (permalink / raw)
  To: benh
  Cc: Mark Nelson, Roland Dreier, Hans Boettiger, linuxppc-dev,
	Peter Altevogt, cbe-oss-dev

Introduce a new dma attriblue DMA_ATTR_WEAK_ORDERING to use weak ordering
on DMA mappings in the Cell processor. Add the code to the Cell's IOMMU
implementation to use this code.

Dynamic mappings can be weakly or strongly ordered on an individual basis
but the fixed mapping has to be either completely strong or completely weak.
This is currently decided by a kernel boot option (pass iommu_fixed=weak
for a weakly ordered fixed linear mapping, strongly ordered is the default).

Signed-off-by: Mark Nelson <markn@au1.ibm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
On Thursday 17 July 2008, Benjamin Herrenschmidt wrote:
> In the meantime, send a patch that defaults to strong with explicit
> weak, we can easily fixup after that.
> 

Ok, this is the previous version of the patch from Mark, which
does exactly that. Please use this one instead.

 Documentation/DMA-attributes.txt    |    9 ++
 arch/powerpc/platforms/cell/iommu.c |  113 ++++++++++++++++++++++++++++++++++--
 include/linux/dma-attrs.h           |    1 
 3 files changed, 118 insertions(+), 5 deletions(-)

Index: linux-2.6/arch/powerpc/platforms/cell/iommu.c
===================================================================
--- linux-2.6.orig/arch/powerpc/platforms/cell/iommu.c
+++ linux-2.6/arch/powerpc/platforms/cell/iommu.c
@@ -198,6 +198,8 @@ static void tce_build_cell(struct iommu_
 	base_pte = IOPTE_PP_W | IOPTE_PP_R | IOPTE_M | IOPTE_SO_RW |
 		(window->ioid & IOPTE_IOID_Mask);
 #endif
+	if (unlikely(dma_get_attr(DMA_ATTR_WEAK_ORDERING, attrs)))
+		base_pte &= ~IOPTE_SO_RW;
 
 	io_pte = (unsigned long *)tbl->it_base + (index - tbl->it_offset);
 
@@ -538,7 +540,9 @@ static struct cbe_iommu *cell_iommu_for_
 static unsigned long cell_dma_direct_offset;
 
 static unsigned long dma_iommu_fixed_base;
-struct dma_mapping_ops dma_iommu_fixed_ops;
+
+/* iommu_fixed_is_weak is set if booted with iommu_fixed=weak */
+static int iommu_fixed_is_weak;
 
 static struct iommu_table *cell_get_iommu_table(struct device *dev)
 {
@@ -562,6 +566,98 @@ static struct iommu_table *cell_get_iomm
 	return &window->table;
 }
 
+/* A coherent allocation implies strong ordering */
+
+static void *dma_fixed_alloc_coherent(struct device *dev, size_t size,
+				      dma_addr_t *dma_handle, gfp_t flag)
+{
+	if (iommu_fixed_is_weak)
+		return iommu_alloc_coherent(dev, cell_get_iommu_table(dev),
+					    size, dma_handle,
+					    device_to_mask(dev), flag,
+					    dev->archdata.numa_node);
+	else
+		return dma_direct_ops.alloc_coherent(dev, size, dma_handle,
+						     flag);
+}
+
+static void dma_fixed_free_coherent(struct device *dev, size_t size,
+				    void *vaddr, dma_addr_t dma_handle)
+{
+	if (iommu_fixed_is_weak)
+		iommu_free_coherent(cell_get_iommu_table(dev), size, vaddr,
+				    dma_handle);
+	else
+		dma_direct_ops.free_coherent(dev, size, vaddr, dma_handle);
+}
+
+static dma_addr_t dma_fixed_map_single(struct device *dev, void *ptr,
+				       size_t size,
+				       enum dma_data_direction direction,
+				       struct dma_attrs *attrs)
+{
+	if (iommu_fixed_is_weak == dma_get_attr(DMA_ATTR_WEAK_ORDERING, attrs))
+		return dma_direct_ops.map_single(dev, ptr, size, direction,
+						 attrs);
+	else
+		return iommu_map_single(dev, cell_get_iommu_table(dev), ptr,
+					size, device_to_mask(dev), direction,
+					attrs);
+}
+
+static void dma_fixed_unmap_single(struct device *dev, dma_addr_t dma_addr,
+				   size_t size,
+				   enum dma_data_direction direction,
+				   struct dma_attrs *attrs)
+{
+	if (iommu_fixed_is_weak == dma_get_attr(DMA_ATTR_WEAK_ORDERING, attrs))
+		dma_direct_ops.unmap_single(dev, dma_addr, size, direction,
+					    attrs);
+	else
+		iommu_unmap_single(cell_get_iommu_table(dev), dma_addr, size,
+				   direction, attrs);
+}
+
+static int dma_fixed_map_sg(struct device *dev, struct scatterlist *sg,
+			   int nents, enum dma_data_direction direction,
+			   struct dma_attrs *attrs)
+{
+	if (iommu_fixed_is_weak == dma_get_attr(DMA_ATTR_WEAK_ORDERING, attrs))
+		return dma_direct_ops.map_sg(dev, sg, nents, direction, attrs);
+	else
+		return iommu_map_sg(dev, cell_get_iommu_table(dev), sg, nents,
+				    device_to_mask(dev), direction, attrs);
+}
+
+static void dma_fixed_unmap_sg(struct device *dev, struct scatterlist *sg,
+			       int nents, enum dma_data_direction direction,
+			       struct dma_attrs *attrs)
+{
+	if (iommu_fixed_is_weak == dma_get_attr(DMA_ATTR_WEAK_ORDERING, attrs))
+		dma_direct_ops.unmap_sg(dev, sg, nents, direction, attrs);
+	else
+		iommu_unmap_sg(cell_get_iommu_table(dev), sg, nents, direction,
+			       attrs);
+}
+
+static int dma_fixed_dma_supported(struct device *dev, u64 mask)
+{
+	return mask == DMA_64BIT_MASK;
+}
+
+static int dma_set_mask_and_switch(struct device *dev, u64 dma_mask);
+
+struct dma_mapping_ops dma_iommu_fixed_ops = {
+	.alloc_coherent = dma_fixed_alloc_coherent,
+	.free_coherent  = dma_fixed_free_coherent,
+	.map_single     = dma_fixed_map_single,
+	.unmap_single   = dma_fixed_unmap_single,
+	.map_sg         = dma_fixed_map_sg,
+	.unmap_sg       = dma_fixed_unmap_sg,
+	.dma_supported  = dma_fixed_dma_supported,
+	.set_dma_mask   = dma_set_mask_and_switch,
+};
+
 static void cell_dma_dev_setup_fixed(struct device *dev);
 
 static void cell_dma_dev_setup(struct device *dev)
@@ -918,9 +1014,16 @@ static void cell_iommu_setup_fixed_ptab(
 
 	pr_debug("iommu: mapping 0x%lx pages from 0x%lx\n", fsize, fbase);
 
-	base_pte = IOPTE_PP_W | IOPTE_PP_R | IOPTE_M | IOPTE_SO_RW
+	base_pte = IOPTE_PP_W | IOPTE_PP_R | IOPTE_M
 		    | (cell_iommu_get_ioid(np) & IOPTE_IOID_Mask);
 
+	if (iommu_fixed_is_weak)
+		pr_info("IOMMU: Using weak ordering for fixed mapping\n");
+	else {
+		pr_info("IOMMU: Using strong ordering for fixed mapping\n");
+		base_pte |= IOPTE_SO_RW;
+	}
+
 	for (uaddr = 0; uaddr < fsize; uaddr += (1 << 24)) {
 		/* Don't touch the dynamic region */
 		ioaddr = uaddr + fbase;
@@ -1036,9 +1139,6 @@ static int __init cell_iommu_fixed_mappi
 		cell_iommu_setup_window(iommu, np, dbase, dsize, 0);
 	}
 
-	dma_iommu_fixed_ops = dma_direct_ops;
-	dma_iommu_fixed_ops.set_dma_mask = dma_set_mask_and_switch;
-
 	dma_iommu_ops.set_dma_mask = dma_set_mask_and_switch;
 	set_pci_dma_ops(&dma_iommu_ops);
 
@@ -1052,6 +1152,9 @@ static int __init setup_iommu_fixed(char
 	if (strcmp(str, "off") == 0)
 		iommu_fixed_disabled = 1;
 
+	else if (strcmp(str, "weak") == 0)
+		iommu_fixed_is_weak = 1;
+
 	return 1;
 }
 __setup("iommu_fixed=", setup_iommu_fixed);
Index: linux-2.6/Documentation/DMA-attributes.txt
===================================================================
--- linux-2.6.orig/Documentation/DMA-attributes.txt
+++ linux-2.6/Documentation/DMA-attributes.txt
@@ -22,3 +22,12 @@ ready and available in memory.  The DMA 
 could race with data DMA.  Mapping the memory used for completion
 indications with DMA_ATTR_WRITE_BARRIER would prevent the race.
 
+DMA_ATTR_WEAK_ORDERING
+----------------------
+
+DMA_ATTR_WEAK_ORDERING specifies that reads and writes to the mapping
+may be weakly ordered, that is that reads and writes may pass each other.
+
+Since it is optional for platforms to implement DMA_ATTR_WEAK_ORDERING,
+those that do not will simply ignore the attribute and exhibit default
+behavior.
Index: linux-2.6/include/linux/dma-attrs.h
===================================================================
--- linux-2.6.orig/include/linux/dma-attrs.h
+++ linux-2.6/include/linux/dma-attrs.h
@@ -12,6 +12,7 @@
  */
 enum dma_attr {
 	DMA_ATTR_WRITE_BARRIER,
+	DMA_ATTR_WEAK_ORDERING,
 	DMA_ATTR_MAX,
 };
 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Cbe-oss-dev] [PATCH] Add DMA_ATTR_WEAK_ORDERING dma attribute and use in Cell IOMMU code
  2008-07-18 13:03                 ` [PATCH] Add DMA_ATTR_WEAK_ORDERING dma attribute and use in Cell " Arnd Bergmann
@ 2008-07-19  7:29                   ` Jeremy Kerr
  2008-07-19  8:36                     ` Arnd Bergmann
  0 siblings, 1 reply; 25+ messages in thread
From: Jeremy Kerr @ 2008-07-19  7:29 UTC (permalink / raw)
  To: cbe-oss-dev
  Cc: Arnd Bergmann, Roland Dreier, linuxppc-dev, Peter Altevogt,
	Hans Boettiger


> Ok, this is the previous version of the patch from Mark, which
> does exactly that. Please use this one instead.

So, should we add a:

From: Mark Nelson <markn@au1.ibm.com>

?

Jeremy

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Cbe-oss-dev] [PATCH] Add DMA_ATTR_WEAK_ORDERING dma attribute and use in Cell IOMMU code
  2008-07-19  7:29                   ` [Cbe-oss-dev] " Jeremy Kerr
@ 2008-07-19  8:36                     ` Arnd Bergmann
  0 siblings, 0 replies; 25+ messages in thread
From: Arnd Bergmann @ 2008-07-19  8:36 UTC (permalink / raw)
  To: Jeremy Kerr
  Cc: Roland Dreier, Hans Boettiger, linuxppc-dev, Peter Altevogt,
	cbe-oss-dev

On Saturday 19 July 2008, Jeremy Kerr wrote:
> 
> > Ok, this is the previous version of the patch from Mark, which
> > does exactly that. Please use this one instead.
> 
> So, should we add a:
> 
> From: Mark Nelson <markn@au1.ibm.com>
> 

Yes, of course. I keep losing this line when sending patches through
quilt and kmail.

	Arnd <><

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2008-07-22  9:49 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-15 19:51 [patch 0/9] Cell patches for 2.6.27, version 2 arnd
2008-07-15 19:51 ` [patch 1/9] powerpc/cell/edac: log a syndrome code in case of correctable error arnd
2008-07-17  5:59   ` Benjamin Herrenschmidt
2008-07-17 18:35     ` Doug Thompson
2008-07-15 19:51 ` [patch 2/9] powerpc/axonram: use only one block device major number arnd
2008-07-15 19:51 ` [patch 3/9] powerpc/axonram: enable partitioning of the Axons DDR2 DIMMs arnd
2008-07-15 19:51 ` [patch 4/9] powerpc/cell/cpufreq: add spu aware cpufreq governor arnd
2008-07-15 19:51 ` [patch 5/9] powerpc/cell: cleanup sysreset_hack for IBM cell blades arnd
2008-07-15 19:51 ` [patch 6/9] powerpc/cell: add support for power button of future " arnd
2008-07-15 19:51 ` [patch 7/9] azfs: initial submit of azfs, a non-buffered filesystem arnd
2008-07-17  6:13   ` Benjamin Herrenschmidt
2008-07-22  9:49   ` [Cbe-oss-dev] " Christoph Hellwig
2008-07-15 19:51 ` [patch 8/9] powerpc/dma: use the struct dma_attrs in iommu code arnd
2008-07-15 19:51 ` [patch 9/9] powerpc/cell: Add DMA_ATTR_STRONG_ORDERING dma attribute and use in IOMMU code arnd
2008-07-15 20:34   ` Roland Dreier
2008-07-15 21:27     ` Arnd Bergmann
2008-07-16  2:18       ` Roland Dreier
2008-07-16  7:54         ` Arnd Bergmann
2008-07-17  6:20           ` [Cbe-oss-dev] " Benjamin Herrenschmidt
2008-07-17 14:53             ` Arnd Bergmann
2008-07-17 20:10               ` Benjamin Herrenschmidt
2008-07-17 20:10               ` Benjamin Herrenschmidt
2008-07-18 13:03                 ` [PATCH] Add DMA_ATTR_WEAK_ORDERING dma attribute and use in Cell " Arnd Bergmann
2008-07-19  7:29                   ` [Cbe-oss-dev] " Jeremy Kerr
2008-07-19  8:36                     ` Arnd Bergmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).