linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC/PATCH 1/5] bdev: execute in place (V2)
       [not found] <1116422644.2202.1.camel@cotte.boeblingen.de.ibm.com>
@ 2005-05-18 13:53 ` Carsten Otte
  2005-05-18 14:27   ` Christoph Hellwig
  2005-05-18 13:53 ` [RFC/PATCH 2/5] mm/fs: " Carsten Otte
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Carsten Otte @ 2005-05-18 13:53 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, schwidefsky, akpm

[RFC/PATCH 1/5] bdev: execute in place (V2)
This patch introduces a new block device operation called direct_access.
It is used to retrieve a reference to the data on disk behind a given
sector. This reference is supposed to be cpu addressable, physical
address, and remain valid until release is called.
This patch also implements this operation for our dcssblk device driver.
Changes from previous version: none

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
--- 
diff -ruN linux-git/drivers/s390/block/dcssblk.c linux-git-xip/drivers/s390/block/dcssblk.c
--- linux-git/drivers/s390/block/dcssblk.c	2005-05-17 14:23:24.000000000 +0200
+++ linux-git-xip/drivers/s390/block/dcssblk.c	2005-05-17 16:57:07.306779600 +0200
@@ -35,14 +35,17 @@
 static int dcssblk_open(struct inode *inode, struct file *filp);
 static int dcssblk_release(struct inode *inode, struct file *filp);
 static int dcssblk_make_request(struct request_queue *q, struct bio *bio);
+static int dcssblk_direct_access(struct inode *inode, sector_t secnum,
+				 unsigned long *data);
 
 static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";
 
 static int dcssblk_major;
 static struct block_device_operations dcssblk_devops = {
-	.owner   = THIS_MODULE,
-	.open    = dcssblk_open,
-	.release = dcssblk_release,
+	.owner   	= THIS_MODULE,
+	.open    	= dcssblk_open,
+	.release 	= dcssblk_release,
+	.direct_access 	= dcssblk_direct_access,
 };
 
 static ssize_t dcssblk_add_store(struct device * dev, const char * buf,
@@ -641,6 +644,20 @@
 		/* Request beyond end of DCSS segment. */
 		goto fail;
 	}
+	/* verify data transfer direction */
+	if (dev_info->is_shared) {
+		switch (dev_info->segment_type) {
+		case SEG_TYPE_SR:
+		case SEG_TYPE_ER:
+		case SEG_TYPE_SC:
+			/* cannot write to these segments */
+			if (bio_data_dir(bio) == WRITE) {
+				PRINT_WARN("rejecting write to ro segment %s\n", dev_info->dev.bus_id);
+				goto fail;
+			}
+		}
+	}
+
 	index = (bio->bi_sector >> 3);
 	bio_for_each_segment(bvec, bio, i) {
 		page_addr = (unsigned long)
@@ -661,7 +678,26 @@
 	bio_endio(bio, bytes_done, 0);
 	return 0;
 fail:
-	bio_io_error(bio, bytes_done);
+	bio_io_error(bio, bio->bi_size);
+	return 0;
+}
+
+static int
+dcssblk_direct_access (struct inode *inode, sector_t secnum,
+			unsigned long *data)
+{
+	struct dcssblk_dev_info *dev_info;
+	unsigned long pgoff;
+
+	dev_info = inode->i_sb->s_bdev->bd_disk->private_data;
+	if (!dev_info)
+		return -ENODEV;
+	if (secnum % (PAGE_SIZE/512))
+		return -EINVAL;
+	pgoff = secnum / (PAGE_SIZE / 512);
+	if ((pgoff+1)*PAGE_SIZE-1 > dev_info->end - dev_info->start)
+		return -ERANGE;
+	*data = (unsigned long) (dev_info->start+pgoff*PAGE_SIZE);
 	return 0;
 }
 
diff -ruN linux-git/include/linux/fs.h linux-git-xip/include/linux/fs.h
--- linux-git/include/linux/fs.h	2005-05-17 14:23:35.000000000 +0200
+++ linux-git-xip/include/linux/fs.h	2005-05-17 16:57:07.308779296 +0200
@@ -884,6 +884,7 @@
 	int (*release) (struct inode *, struct file *);
 	int (*ioctl) (struct inode *, struct file *, unsigned, unsigned long);
 	long (*compat_ioctl) (struct file *, unsigned, unsigned long);
+	int (*direct_access) (struct inode *, sector_t, unsigned long *);
 	int (*media_changed) (struct gendisk *);
 	int (*revalidate_disk) (struct gendisk *);
 	struct module *owner;

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC/PATCH 2/5] mm/fs: execute in place (V2)
       [not found] <1116422644.2202.1.camel@cotte.boeblingen.de.ibm.com>
  2005-05-18 13:53 ` [RFC/PATCH 1/5] bdev: execute in place (V2) Carsten Otte
@ 2005-05-18 13:53 ` Carsten Otte
  2005-05-18 14:27   ` Christoph Hellwig
  2005-05-18 13:53 ` [RFC/PATCH 3/5] ext2: " Carsten Otte
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Carsten Otte @ 2005-05-18 13:53 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, schwidefsky, akpm

[RFC/PATCH 2/5] mm/fs: execute in place (V2) 
This patch adds a new address space operation called get_xip_page, which
works similar to readpage/writepage but returns a reference to a struct
page for the on-disk data for the given page. The page is supposed to be
up-to-date.

In reaction to feedback for last version, this time filemap.c has been
split into three files:
- mm/filemap.h contains some inline functions moved here from mm/filemap.c
  that are called in both filemap.c and filemap_xip.c. Macros have been
  defined that check if execute in place should be used for a given object.
  If no filesystems with xip support are compiled for the kernel
  (CONFIG_FS_XIP not set) those expand to 0. Otherwise they expand to the
  corresponding checks...
- mm/filemap.c now contains more or less its "classic" functionality.
  However, above macros are used to call xip functions if xip is enabled at
  compile time and if the address space has get_xip_page. In addition, some
  inline functions have been moved away to filemap.h
- mm/filemap_xip.c now contains all xip related functions, they have been
  in filemap.c in the previous version of the patch.
This addresses two issues:
- code path is unchanged for kernels that do not have any xip filesystems
  enabled at compile time
- filemap.c stays as readable as it was to avoid headaches reading it

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
---
diff -ruN linux-git/fs/open.c linux-git-xip/fs/open.c
--- linux-git/fs/open.c	2005-05-17 14:23:32.000000000 +0200
+++ linux-git-xip/fs/open.c	2005-05-17 18:33:57.750457896 +0200
@@ -807,7 +807,9 @@
 
 	/* NB: we're sure to have correct a_ops only after f_op->open */
 	if (f->f_flags & O_DIRECT) {
-		if (!f->f_mapping->a_ops || !f->f_mapping->a_ops->direct_IO) {
+		if (!f->f_mapping->a_ops || 
+		    ((!f->f_mapping->a_ops->direct_IO) &&
+		    (!f->f_mapping->a_ops->get_xip_page))) {
 			fput(f);
 			f = ERR_PTR(-EINVAL);
 		}
diff -ruN linux-git/include/linux/fs.h linux-git-xip/include/linux/fs.h
--- linux-git/include/linux/fs.h	2005-05-17 18:01:33.000000000 +0200
+++ linux-git-xip/include/linux/fs.h	2005-05-17 18:33:57.753457440 +0200
@@ -330,6 +330,8 @@
 	int (*releasepage) (struct page *, int);
 	ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
 			loff_t offset, unsigned long nr_segs);
+	struct page* (*get_xip_page)(struct address_space *, sector_t,
+			int);
 };
 
 struct backing_dev_info;
@@ -1473,14 +1475,19 @@
 		unsigned long *, loff_t, loff_t *, size_t, size_t);
 extern ssize_t generic_file_buffered_write(struct kiocb *, const struct iovec *,
 		unsigned long, loff_t, loff_t *, size_t, ssize_t);
+extern ssize_t generic_file_xip_write(struct kiocb *, const struct iovec *,
+		unsigned long, loff_t, loff_t *, size_t);
 extern ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos);
 extern ssize_t do_sync_write(struct file *filp, const char __user *buf, size_t len, loff_t *ppos);
 ssize_t generic_file_write_nolock(struct file *file, const struct iovec *iov,
 				unsigned long nr_segs, loff_t *ppos);
 extern ssize_t generic_file_sendfile(struct file *, loff_t *, size_t, read_actor_t, void *);
 extern void do_generic_mapping_read(struct address_space *mapping,
-				    struct file_ra_state *, struct file *,
-				    loff_t *, read_descriptor_t *, read_actor_t);
+				   struct file_ra_state *, struct file *,
+				   loff_t *, read_descriptor_t *, read_actor_t);
+extern void do_xip_mapping_read   (struct address_space *mapping,
+				   struct file_ra_state *, struct file *,
+				   loff_t *, read_descriptor_t *, read_actor_t);
 extern void
 file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping);
 extern ssize_t generic_file_direct_IO(int rw, struct kiocb *iocb,
@@ -1494,17 +1501,32 @@
 extern loff_t remote_llseek(struct file *file, loff_t offset, int origin);
 extern int generic_file_open(struct inode * inode, struct file * filp);
 extern int nonseekable_open(struct inode * inode, struct file * filp);
+extern int xip_truncate_page(struct address_space *mapping, loff_t from);
+
+#ifdef CONFIG_FS_XIP
+#define file_is_xip(file)   unlikely(file->f_mapping->a_ops->get_xip_page)
+#else
+#define file_is_xip(file)   0
+#endif
 
 static inline void do_generic_file_read(struct file * filp, loff_t *ppos,
 					read_descriptor_t * desc,
 					read_actor_t actor)
 {
-	do_generic_mapping_read(filp->f_mapping,
-				&filp->f_ra,
-				filp,
-				ppos,
-				desc,
-				actor);
+	if (file_is_xip(filp))
+		do_xip_mapping_read(filp->f_mapping,
+					&filp->f_ra,
+					filp,
+					ppos,
+					desc,
+					actor);
+	else
+		do_generic_mapping_read(filp->f_mapping,
+					&filp->f_ra,
+					filp,
+					ppos,
+					desc,
+					actor);
 }
 
 ssize_t __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
diff -ruN linux-git/mm/Makefile linux-git-xip/mm/Makefile
--- linux-git/mm/Makefile	2005-05-17 14:23:36.000000000 +0200
+++ linux-git-xip/mm/Makefile	2005-05-17 18:33:57.754457288 +0200
@@ -18,3 +18,4 @@
 obj-$(CONFIG_SHMEM) += shmem.o
 obj-$(CONFIG_TINY_SHMEM) += tiny-shmem.o
 
+obj-$(CONFIG_FS_XIP) += filemap_xip.o
diff -ruN linux-git/mm/filemap.c linux-git-xip/mm/filemap.c
--- linux-git/mm/filemap.c	2005-05-17 14:23:36.000000000 +0200
+++ linux-git-xip/mm/filemap.c	2005-05-17 18:33:57.757456832 +0200
@@ -28,6 +28,7 @@
 #include <linux/blkdev.h>
 #include <linux/security.h>
 #include <linux/syscalls.h>
+#include "filemap.h"
 /*
  * FIXME: remove all knowledge of the buffer layer from the core VM
  */
@@ -968,6 +969,7 @@
 	ssize_t retval;
 	unsigned long seg;
 	size_t count;
+	int xip = file_is_xip(filp) ? 1 : 0;
 
 	count = 0;
 	for (seg = 0; seg < nr_segs; seg++) {
@@ -990,7 +992,9 @@
 	}
 
 	/* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
-	if (filp->f_flags & O_DIRECT) {
+	/* do not use generic_file_direct_IO on xip files, xip IO is
+	   implicitly direct as well */
+	if (filp->f_flags & O_DIRECT && !xip) {
 		loff_t pos = *ppos, size;
 		struct address_space *mapping;
 		struct inode *inode;
@@ -1110,6 +1114,9 @@
 do_readahead(struct address_space *mapping, struct file *filp,
 	     unsigned long index, unsigned long nr)
 {
+	if (mapping_is_xip_save(mapping))
+		return 0;
+
 	if (!mapping || !mapping->a_ops || !mapping->a_ops->readpage)
 		return -EINVAL;
 
@@ -1538,10 +1545,13 @@
 {
 	struct address_space *mapping = file->f_mapping;
 
-	if (!mapping->a_ops->readpage)
+	if ((!mapping->a_ops->readpage) && (!mapping_is_xip(mapping)))
 		return -ENOEXEC;
 	file_accessed(file);
-	vma->vm_ops = &generic_file_vm_ops;
+	if (mapping_is_xip(mapping))
+		vma->vm_ops = &xip_file_vm_ops;
+	else
+		vma->vm_ops = &generic_file_vm_ops;
 	return 0;
 }
 EXPORT_SYMBOL(filemap_populate);
@@ -1714,32 +1724,7 @@
 }
 EXPORT_SYMBOL(remove_suid);
 
-/*
- * Copy as much as we can into the page and return the number of bytes which
- * were sucessfully copied.  If a fault is encountered then clear the page
- * out to (offset+bytes) and return the number of bytes which were copied.
- */
-static inline size_t
-filemap_copy_from_user(struct page *page, unsigned long offset,
-			const char __user *buf, unsigned bytes)
-{
-	char *kaddr;
-	int left;
-
-	kaddr = kmap_atomic(page, KM_USER0);
-	left = __copy_from_user_inatomic(kaddr + offset, buf, bytes);
-	kunmap_atomic(kaddr, KM_USER0);
-
-	if (left != 0) {
-		/* Do it the slow way */
-		kaddr = kmap(page);
-		left = __copy_from_user(kaddr + offset, buf, bytes);
-		kunmap(page);
-	}
-	return bytes - left;
-}
-
-static size_t
+size_t
 __filemap_copy_from_user_iovec(char *vaddr, 
 			const struct iovec *iov, size_t base, size_t bytes)
 {
@@ -1767,52 +1752,6 @@
 }
 
 /*
- * This has the same sideeffects and return value as filemap_copy_from_user().
- * The difference is that on a fault we need to memset the remainder of the
- * page (out to offset+bytes), to emulate filemap_copy_from_user()'s
- * single-segment behaviour.
- */
-static inline size_t
-filemap_copy_from_user_iovec(struct page *page, unsigned long offset,
-			const struct iovec *iov, size_t base, size_t bytes)
-{
-	char *kaddr;
-	size_t copied;
-
-	kaddr = kmap_atomic(page, KM_USER0);
-	copied = __filemap_copy_from_user_iovec(kaddr + offset, iov,
-						base, bytes);
-	kunmap_atomic(kaddr, KM_USER0);
-	if (copied != bytes) {
-		kaddr = kmap(page);
-		copied = __filemap_copy_from_user_iovec(kaddr + offset, iov,
-							base, bytes);
-		kunmap(page);
-	}
-	return copied;
-}
-
-static inline void
-filemap_set_next_iovec(const struct iovec **iovp, size_t *basep, size_t bytes)
-{
-	const struct iovec *iov = *iovp;
-	size_t base = *basep;
-
-	while (bytes) {
-		int copy = min(bytes, iov->iov_len - base);
-
-		bytes -= copy;
-		base += copy;
-		if (iov->iov_len == base) {
-			iov++;
-			base = 0;
-		}
-	}
-	*iovp = iov;
-	*basep = base;
-}
-
-/*
  * Performs necessary checks before doing a write
  *
  * Can adjust writing position aor amount of bytes to write.
@@ -2123,6 +2062,13 @@
 
 	inode_update_time(inode, 1);
 
+	if (file_is_xip(file)) {
+		/* use execute in place to copy directly to disk */
+		written = generic_file_xip_write (iocb, iov,
+			        nr_segs, pos, ppos, count);
+		goto out;
+	}
+
 	/* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
 	if (unlikely(file->f_flags & O_DIRECT)) {
 		written = generic_file_direct_write(iocb, iov,
diff -ruN linux-git/mm/filemap.h linux-git-xip/mm/filemap.h
--- linux-git/mm/filemap.h	1970-01-01 01:00:00.000000000 +0100
+++ linux-git-xip/mm/filemap.h	2005-05-17 18:33:57.792451512 +0200
@@ -0,0 +1,141 @@
+/*
+ *	linux/mm/filemap.h
+ *
+ * Copyright (C) 2005 IBM Corporation
+ * Author: Carsten Otte <cotte@de.ibm.com>
+ *
+ * derived from linux/mm/filemap.c
+ *        Copyright (C) Linus Torvalds
+ */
+
+#ifndef __FILEMAP_H
+#define __FILEMAP_H
+
+#include <linux/types.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/highmem.h>
+#include <linux/uio.h>
+#include <linux/config.h>
+#include <asm/uaccess.h>
+
+#ifdef CONFIG_FS_XIP
+extern struct vm_operations_struct xip_file_vm_ops;
+
+void
+do_xip_mapping_read(struct address_space *mapping,
+		    struct file_ra_state *_ra,
+		    struct file *filp,
+		    loff_t *ppos,
+		    read_descriptor_t *desc,
+		    read_actor_t actor);
+
+void
+__filemap_xip_unmap (struct address_space * mapping,
+		     unsigned long pgoff);
+
+struct page *
+filemap_xip_nopage(struct vm_area_struct * area,
+		   unsigned long address,
+		   int *type);
+
+ssize_t
+generic_file_xip_write(struct kiocb *iocb, const struct iovec *iov,
+		       unsigned long nr_segs, loff_t pos, loff_t *ppos,
+		       size_t count);
+
+int
+xip_truncate_page(struct address_space *mapping,
+		  loff_t from);
+
+#define mapping_is_xip(map)      unlikely(map->a_ops->get_xip_page)
+#define mapping_is_xip_save(map) unlikely(map && map->a_ops \
+                                          && map->a_ops->get_xip_page)
+#else /* not defined CONFIG_FS_XIP */
+#define mapping_is_xip(map)      0
+#define mapping_is_xip_save(map) 0
+#define do_xip_mapping_read(arg1, arg2, arg3, arg4, arg5) BUG()
+#define xip_truncate_page(map, from) BUG()
+#endif /* defined CONFIG_FS_XIP */
+
+extern struct vm_operations_struct xip_file_vm_ops;
+
+size_t
+__filemap_copy_from_user_iovec(char *vaddr, 
+			       const struct iovec *iov,
+			       size_t base,
+			       size_t bytes);
+
+/*
+ * Copy as much as we can into the page and return the number of bytes which
+ * were sucessfully copied.  If a fault is encountered then clear the page
+ * out to (offset+bytes) and return the number of bytes which were copied.
+ */
+static inline size_t
+filemap_copy_from_user(struct page *page, unsigned long offset,
+			const char __user *buf, unsigned bytes)
+{
+	char *kaddr;
+	int left;
+
+	kaddr = kmap_atomic(page, KM_USER0);
+	left = __copy_from_user_inatomic(kaddr + offset, buf, bytes);
+	kunmap_atomic(kaddr, KM_USER0);
+
+	if (left != 0) {
+		/* Do it the slow way */
+		kaddr = kmap(page);
+		left = __copy_from_user(kaddr + offset, buf, bytes);
+		kunmap(page);
+	}
+	return bytes - left;
+}
+
+/*
+ * This has the same sideeffects and return value as filemap_copy_from_user().
+ * The difference is that on a fault we need to memset the remainder of the
+ * page (out to offset+bytes), to emulate filemap_copy_from_user()'s
+ * single-segment behaviour.
+ */
+static inline size_t
+filemap_copy_from_user_iovec(struct page *page, unsigned long offset,
+			const struct iovec *iov, size_t base, size_t bytes)
+{
+	char *kaddr;
+	size_t copied;
+
+	kaddr = kmap_atomic(page, KM_USER0);
+	copied = __filemap_copy_from_user_iovec(kaddr + offset, iov,
+						base, bytes);
+	kunmap_atomic(kaddr, KM_USER0);
+	if (copied != bytes) {
+		kaddr = kmap(page);
+		copied = __filemap_copy_from_user_iovec(kaddr + offset, iov,
+							base, bytes);
+		kunmap(page);
+	}
+	return copied;
+}
+
+static inline void
+filemap_set_next_iovec(const struct iovec **iovp, size_t *basep, size_t bytes)
+{
+	const struct iovec *iov = *iovp;
+	size_t base = *basep;
+
+	while (bytes) {
+		int copy = min(bytes, iov->iov_len - base);
+
+		bytes -= copy;
+		base += copy;
+		if (iov->iov_len == base) {
+			iov++;
+			base = 0;
+		}
+	}
+	*iovp = iov;
+	*basep = base;
+}
+
+
+#endif
diff -ruN linux-git/mm/filemap_xip.c linux-git-xip/mm/filemap_xip.c
--- linux-git/mm/filemap_xip.c	1970-01-01 01:00:00.000000000 +0100
+++ linux-git-xip/mm/filemap_xip.c	2005-05-17 18:33:57.794451208 +0200
@@ -0,0 +1,388 @@
+/*
+ *	linux/mm/filemap_xip.c
+ *
+ * Copyright (C) 2005 IBM Corporation
+ * Author: Carsten Otte <cotte@de.ibm.com>
+ *
+ * derived from linux/mm/filemap.c - Copyright (C) Linus Torvalds
+ *
+ */
+
+#include <linux/pagemap.h>
+#include <linux/module.h>
+#include <linux/uio.h>
+#include "filemap.h"
+
+struct vm_operations_struct xip_file_vm_ops = {
+	.nopage         = filemap_xip_nopage,
+};
+
+
+/*
+ * This is a generic file read routine for execute in place files, and uses 
+ * the mapping->a_ops->get_xip_page() function for the actual low-level
+ * stuff.
+ *
+ * Note the struct file* is not used at all.  It may be NULL.
+ */
+void
+do_xip_mapping_read(struct address_space *mapping,
+		    struct file_ra_state *_ra,
+		    struct file *filp,
+		    loff_t *ppos,
+		    read_descriptor_t *desc,
+		    read_actor_t actor)
+{
+	struct inode *inode = mapping->host;
+	unsigned long index, end_index, offset;
+	loff_t isize;
+
+	BUG_ON(!mapping->a_ops->get_xip_page);
+
+	index = *ppos >> PAGE_CACHE_SHIFT;
+	offset = *ppos & ~PAGE_CACHE_MASK;
+
+	isize = i_size_read(inode);
+	if (!isize)
+		goto out;
+
+	end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
+	for (;;) {
+		struct page *page;
+		unsigned long nr, ret;
+
+		/* nr is the maximum number of bytes to copy from this page */
+		nr = PAGE_CACHE_SIZE;
+		if (index >= end_index) {
+			if (index > end_index)
+				goto out;
+			nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
+			if (nr <= offset) {
+				goto out;
+			}
+		}
+		nr = nr - offset;
+
+		page = mapping->a_ops->get_xip_page(mapping,
+			index*(PAGE_SIZE/512), 0);
+		if (!page)
+			goto no_xip_page;
+		if (unlikely(IS_ERR(page))) {
+			if (PTR_ERR(page) == -ENODATA) {
+				/* sparse */
+				page = virt_to_page(empty_zero_page);
+			} else {
+				desc->error = PTR_ERR(page);
+				goto out;
+			}
+		} else
+			BUG_ON(!PageUptodate(page));
+
+		/* If users can be writing to this page using arbitrary
+		 * virtual addresses, take care about potential aliasing
+		 * before reading the page on the kernel side.
+		 */
+		if (mapping_writably_mapped(mapping))
+			flush_dcache_page(page);
+
+		/*
+		 * Ok, we have the page, and it's up-to-date, so
+		 * now we can copy it to user space...
+		 *
+		 * The actor routine returns how many bytes were actually used..
+		 * NOTE! This may not be the same as how much of a user buffer
+		 * we filled up (we may be padding etc), so we can only update
+		 * "pos" here (the actor routine has to update the user buffer
+		 * pointers and the remaining count).
+		 */
+		ret = actor(desc, page, offset, nr);
+		offset += ret;
+		index += offset >> PAGE_CACHE_SHIFT;
+		offset &= ~PAGE_CACHE_MASK;
+
+		if (ret == nr && desc->count)
+			continue;
+		goto out;
+
+no_xip_page:
+		/* Did not get the page. Report it */
+		desc->error = -EIO;
+		goto out;
+	}
+
+out:
+	*ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
+	if (filp)
+		file_accessed(filp);
+}
+
+EXPORT_SYMBOL(do_xip_mapping_read);
+
+
+/*
+ * __filemap_xip_unmap is invoked from filemap_xip_unmap and
+ * generic_file_xip_write
+ *
+ * This function walks all vmas of the address_space and unmaps the
+ * empty_zero_page when found at pgoff. Should it go in rmap.c?
+ */
+void
+__filemap_xip_unmap (struct address_space * mapping,
+		     unsigned long pgoff)
+{
+	struct vm_area_struct *vma;
+	struct mm_struct *mm;
+	struct prio_tree_iter iter;
+	unsigned long address;
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+	pte_t pteval;
+
+	spin_lock(&mapping->i_mmap_lock);
+	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
+		mm = vma->vm_mm;
+		address = vma->vm_start +
+			((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+		BUG_ON(address < vma->vm_start || address >= vma->vm_end);
+		/*
+		 * We need the page_table_lock to protect us from page faults,
+		 * munmap, fork, etc...
+		 */
+		spin_lock(&mm->page_table_lock);
+		pgd = pgd_offset(mm, address);
+		if (!pgd_present(*pgd))
+			goto next_unlock;
+		pud = pud_offset(pgd, address);
+		if (!pud_present(*pud))
+			goto next_unlock;
+		pmd = pmd_offset(pud, address);
+		if (!pmd_present(*pmd))
+			goto next_unlock;
+		
+		pte = pte_offset_map(pmd, address);
+		if (!pte_present(*pte))
+			goto next_unmap;
+		if ((page_to_pfn(virt_to_page(empty_zero_page)))
+			!= pte_pfn(*pte))
+			/* pte does already reference new xip block here */
+			goto next_unmap;
+		/* Nuke the page table entry. */
+		flush_cache_page(vma, address, pte_pfn(pte));
+		pteval = ptep_clear_flush(vma, address, pte);
+		BUG_ON(pte_dirty(pteval));
+	next_unmap:
+		pte_unmap(pte);
+	next_unlock:
+		spin_unlock(&mm->page_table_lock);
+	}
+	spin_unlock(&mapping->i_mmap_lock);
+}
+
+
+/*
+ * filemap_xip_nopage() is invoked via the vma operations vector for a
+ * mapped memory region to read in file data during a page fault.
+ *
+ * This function is derived from filemap_nopage, but used for execute in place
+ */
+struct page *
+filemap_xip_nopage(struct vm_area_struct * area,
+		   unsigned long address,
+		   int *type)
+{
+	struct file *file = area->vm_file;
+	struct address_space *mapping = file->f_mapping;
+	struct inode *inode = mapping->host;
+	struct page *page;
+	unsigned long size, pgoff, endoff;
+
+	pgoff = ((address - area->vm_start) >> PAGE_CACHE_SHIFT)
+		+ area->vm_pgoff;
+	endoff = ((area->vm_end - area->vm_start) >> PAGE_CACHE_SHIFT)
+		+ area->vm_pgoff;
+
+	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+	if (pgoff >= size) {
+		return NULL;
+	}
+
+	page = mapping->a_ops->get_xip_page(mapping, pgoff*(PAGE_SIZE/512), 0);
+	if (!IS_ERR(page)) {
+		BUG_ON(!PageUptodate(page));
+		return page;
+	}
+	if (PTR_ERR(page) != -ENODATA)
+		return NULL;
+
+	/* sparse block */
+	if ((area->vm_flags & (VM_WRITE | VM_MAYWRITE)) &&
+	    (area->vm_flags & (VM_SHARED| VM_MAYSHARE)) &&
+	    (!(mapping->host->i_sb->s_flags & MS_RDONLY))) {
+		/* maybe shared writable, allocate new block */
+		page = mapping->a_ops->get_xip_page (mapping,
+			pgoff*(PAGE_SIZE/512), 1);
+		if (IS_ERR(page))
+			return NULL;
+		BUG_ON(!PageUptodate(page));
+		/* unmap page at pgoff from all other vmas */
+		__filemap_xip_unmap(mapping, pgoff);
+	} else {
+		/* not shared and writable, use empty_zero_page */
+		page = virt_to_page(empty_zero_page);
+	}
+
+	return page;
+}
+
+
+ssize_t
+generic_file_xip_write(struct kiocb *iocb, const struct iovec *iov,
+		       unsigned long nr_segs, loff_t pos, loff_t *ppos,
+		       size_t count)
+{
+	struct file *file = iocb->ki_filp;
+	struct address_space * mapping = file->f_mapping;
+	struct address_space_operations *a_ops = mapping->a_ops;
+	struct inode 	*inode = mapping->host;
+	long		status = 0;
+	struct page	*page;
+	size_t		bytes;
+	const struct iovec *cur_iov = iov; /* current iovec */
+	size_t		iov_base = 0;	   /* offset in the current iovec */
+	char __user	*buf;
+	ssize_t		written = 0;
+
+	BUG_ON(!mapping->a_ops->get_xip_page);
+
+	buf = iov->iov_base;
+	do {
+		unsigned long index;
+		unsigned long offset;
+		size_t copied;
+
+		offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
+		index = pos >> PAGE_CACHE_SHIFT;
+		bytes = PAGE_CACHE_SIZE - offset;
+		if (bytes > count)
+			bytes = count;
+
+		/*
+		 * Bring in the user page that we will copy from _first_.
+		 * Otherwise there's a nasty deadlock on copying from the
+		 * same page as we're writing to, without it being marked
+		 * up-to-date.
+		 */
+		fault_in_pages_readable(buf, bytes);
+
+		page = a_ops->get_xip_page(mapping,
+						    index*(PAGE_SIZE/512), 0);
+		if (IS_ERR(page) && (PTR_ERR(page) == -ENODATA)) {
+			/* we allocate a new page unmap it */
+			page = a_ops->get_xip_page(mapping,
+				index*(PAGE_SIZE/512), 1);
+			if (!IS_ERR(page))
+			/* unmap page at pgoff from all other vmas */
+			__filemap_xip_unmap(mapping, index);
+
+		}
+
+		if (IS_ERR(page)) {
+			status = PTR_ERR(page);
+			break;
+		}
+
+		BUG_ON(!PageUptodate(page));
+
+		if (likely(nr_segs == 1))
+			copied = filemap_copy_from_user(page, offset,
+							buf, bytes);
+		else
+			copied = filemap_copy_from_user_iovec(page, offset,
+						cur_iov, iov_base, bytes);
+		flush_dcache_page(page);
+		if (likely(copied > 0)) {
+			status = copied;
+
+			if (status >= 0) {
+				written += status;
+				count -= status;
+				pos += status;
+				buf += status;
+				if (unlikely(nr_segs > 1))
+					filemap_set_next_iovec(&cur_iov,
+							&iov_base, status);
+			}
+		}
+		if (unlikely(copied != bytes))
+			if (status >= 0)
+				status = -EFAULT;
+		if (status < 0)
+			break;
+	} while (count);
+	*ppos = pos;
+	/*
+	 * No need to use i_size_read() here, the i_size
+	 * cannot change under us because we hold i_sem.
+	 */
+	if (pos > inode->i_size) {
+		i_size_write(inode, pos);
+		mark_inode_dirty(inode);
+	}
+
+	return written ? written : status;
+}
+EXPORT_SYMBOL(generic_file_xip_write);
+
+
+/*
+ * truncate a page used for execute in place
+ * functionality is analog to block_truncate_page but does use get_xip_page
+ * to get the page instead of page cache
+ */
+int
+xip_truncate_page(struct address_space *mapping, loff_t from)
+{
+	pgoff_t index = from >> PAGE_CACHE_SHIFT;
+	unsigned offset = from & (PAGE_CACHE_SIZE-1);
+	unsigned blocksize;
+	unsigned length;
+	struct page *page;
+	void *kaddr;
+	int err;
+
+	blocksize = 1 << mapping->host->i_blkbits;
+	length = offset & (blocksize - 1);
+
+	/* Block boundary? Nothing to do */
+	if (!length)
+		return 0;
+
+	length = blocksize - length;
+
+	page = mapping->a_ops->get_xip_page(mapping,
+					    index*(PAGE_SIZE/512), 0);
+	err = -ENOMEM;
+	if (!page)
+		goto out;
+	if (unlikely(IS_ERR(page))) {
+		if (PTR_ERR(page) == -ENODATA) {
+			/* Hole? No need to truncate */
+			return 0;
+		} else {
+			err = PTR_ERR(page);
+			goto out;
+		}
+	} else
+		BUG_ON(!PageUptodate(page));
+	kaddr = kmap_atomic(page, KM_USER0);
+	memset(kaddr + offset, 0, length);
+	kunmap_atomic(kaddr, KM_USER0);
+
+	flush_dcache_page(page);
+	err = 0;
+out:
+	return err;
+}
+EXPORT_SYMBOL(xip_truncate_page);



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC/PATCH 3/5] ext2: execute in place (V2)
       [not found] <1116422644.2202.1.camel@cotte.boeblingen.de.ibm.com>
  2005-05-18 13:53 ` [RFC/PATCH 1/5] bdev: execute in place (V2) Carsten Otte
  2005-05-18 13:53 ` [RFC/PATCH 2/5] mm/fs: " Carsten Otte
@ 2005-05-18 13:53 ` Carsten Otte
  2005-05-18 13:53 ` [RFC/PATCH 4/5] loop: " Carsten Otte
  2005-05-18 13:54 ` [RFC/PATCH 5/5] madvice/fadvice: " Carsten Otte
  4 siblings, 0 replies; 17+ messages in thread
From: Carsten Otte @ 2005-05-18 13:53 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, schwidefsky, akpm

[RFC/PATCH 3/5] ext2: execute in place (V2)
This patch adds support for execute in place in ext2. I've chosen this
filesystem because it is simple enough for me stupido to understand,
and well written. Afaics, other filesystems could be changed similar.
A new config option EXT2_FS_XIP is used to switch execute in place
support on/off. If turned on, a new mount option -o xip is available.
If ext2 is used with a block device that does support the
direct_access block device operation, and -o xip is used, ext2 does
provide different address space operations for regular files:
ext2_aops_xip, that do implement get_xip_page but not
readpage(s)/writepage(s).

This patch is fixed now, previous version did not build without
CONFIG_EXT2_FS_XIP.

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
--- 
diff -ruN linux-git/fs/Kconfig linux-git-xip/fs/Kconfig
--- linux-git/fs/Kconfig	2005-05-17 14:23:26.000000000 +0200
+++ linux-git-xip/fs/Kconfig	2005-05-17 18:44:32.037031680 +0200
@@ -50,6 +50,23 @@
 	  If you are not using a security module that requires using
 	  extended attributes for file security labels, say N.
 
+config EXT2_FS_XIP
+	bool "Ext2 execute in place support"
+	depends on EXT2_FS
+	help
+	  Execute in place can be used on memory-backed block devices. If you
+	  enable this option, you can select to mount block devices which are
+	  capable of this feature without using the page cache.
+
+	  If you do not use a block device that is capable of using this, 
+	  or if unsure, say N.
+
+config FS_XIP
+# execute in place
+	bool
+	depends on EXT2_FS_XIP
+	default y
+
 config EXT3_FS
 	tristate "Ext3 journalling file system support"
 	help
diff -ruN linux-git/fs/ext2/Makefile linux-git-xip/fs/ext2/Makefile
--- linux-git/fs/ext2/Makefile	2005-05-17 14:23:29.000000000 +0200
+++ linux-git-xip/fs/ext2/Makefile	2005-05-17 18:44:32.113020128 +0200
@@ -10,3 +10,4 @@
 ext2-$(CONFIG_EXT2_FS_XATTR)	 += xattr.o xattr_user.o xattr_trusted.o
 ext2-$(CONFIG_EXT2_FS_POSIX_ACL) += acl.o
 ext2-$(CONFIG_EXT2_FS_SECURITY)	 += xattr_security.o
+ext2-$(CONFIG_EXT2_FS_XIP)	 += xip.o
diff -ruN linux-git/fs/ext2/ext2.h linux-git-xip/fs/ext2/ext2.h
--- linux-git/fs/ext2/ext2.h	2005-05-17 14:23:29.000000000 +0200
+++ linux-git-xip/fs/ext2/ext2.h	2005-05-17 18:44:32.180009944 +0200
@@ -150,6 +150,7 @@
 
 /* inode.c */
 extern struct address_space_operations ext2_aops;
+extern struct address_space_operations ext2_aops_xip;
 extern struct address_space_operations ext2_nobh_aops;
 
 /* namei.c */
diff -ruN linux-git/fs/ext2/inode.c linux-git-xip/fs/ext2/inode.c
--- linux-git/fs/ext2/inode.c	2005-05-17 14:23:29.000000000 +0200
+++ linux-git-xip/fs/ext2/inode.c	2005-05-17 18:44:32.182009640 +0200
@@ -33,6 +33,7 @@
 #include <linux/mpage.h>
 #include "ext2.h"
 #include "acl.h"
+#include "xip.h"
 
 MODULE_AUTHOR("Remy Card and others");
 MODULE_DESCRIPTION("Second Extended Filesystem");
@@ -594,6 +595,16 @@
 	if (err)
 		goto cleanup;
 
+	if (ext2_use_xip(inode->i_sb)) {
+		/*
+		 * we need to clear the block
+		 */
+		err = ext2_clear_xip_target (inode,
+			le32_to_cpu(chain[depth-1].key));
+		if (err)
+			goto cleanup;
+	}
+
 	if (ext2_splice_branch(inode, iblock, chain, partial, left) < 0)
 		goto changed;
 
@@ -691,6 +702,11 @@
 	.writepages		= ext2_writepages,
 };
 
+struct address_space_operations ext2_aops_xip = {
+	.bmap			= ext2_bmap,
+	.get_xip_page		= ext2_get_xip_page,
+};
+
 struct address_space_operations ext2_nobh_aops = {
 	.readpage		= ext2_readpage,
 	.readpages		= ext2_readpages,
@@ -910,7 +926,9 @@
 	iblock = (inode->i_size + blocksize-1)
 					>> EXT2_BLOCK_SIZE_BITS(inode->i_sb);
 
-	if (test_opt(inode->i_sb, NOBH))
+	if (mapping_is_xip(inode->i_mapping))
+		xip_truncate_page(inode->i_mapping, inode->i_size);
+	else if (test_opt(inode->i_sb, NOBH))
 		nobh_truncate_page(inode->i_mapping, inode->i_size);
 	else
 		block_truncate_page(inode->i_mapping,
@@ -1111,7 +1129,9 @@
 	if (S_ISREG(inode->i_mode)) {
 		inode->i_op = &ext2_file_inode_operations;
 		inode->i_fop = &ext2_file_operations;
-		if (test_opt(inode->i_sb, NOBH))
+		if (ext2_use_xip(inode->i_sb))
+			inode->i_mapping->a_ops = &ext2_aops_xip;
+		else if (test_opt(inode->i_sb, NOBH))
 			inode->i_mapping->a_ops = &ext2_nobh_aops;
 		else
 			inode->i_mapping->a_ops = &ext2_aops;
diff -ruN linux-git/fs/ext2/namei.c linux-git-xip/fs/ext2/namei.c
--- linux-git/fs/ext2/namei.c	2005-05-17 14:23:29.000000000 +0200
+++ linux-git-xip/fs/ext2/namei.c	2005-05-17 18:44:32.183009488 +0200
@@ -34,6 +34,7 @@
 #include "ext2.h"
 #include "xattr.h"
 #include "acl.h"
+#include "xip.h"
 
 /*
  * Couple of helper functions - make the code slightly cleaner.
@@ -128,7 +129,9 @@
 	if (!IS_ERR(inode)) {
 		inode->i_op = &ext2_file_inode_operations;
 		inode->i_fop = &ext2_file_operations;
-		if (test_opt(inode->i_sb, NOBH))
+		if (ext2_use_xip(inode->i_sb))
+			inode->i_mapping->a_ops = &ext2_aops_xip;
+		else if (test_opt(inode->i_sb, NOBH))
 			inode->i_mapping->a_ops = &ext2_nobh_aops;
 		else
 			inode->i_mapping->a_ops = &ext2_aops;
diff -ruN linux-git/fs/ext2/super.c linux-git-xip/fs/ext2/super.c
--- linux-git/fs/ext2/super.c	2005-05-17 14:23:29.000000000 +0200
+++ linux-git-xip/fs/ext2/super.c	2005-05-17 18:44:32.185009184 +0200
@@ -31,6 +31,7 @@
 #include "ext2.h"
 #include "xattr.h"
 #include "acl.h"
+#include "xip.h"
 
 static void ext2_sync_super(struct super_block *sb,
 			    struct ext2_super_block *es);
@@ -257,7 +258,7 @@
 	Opt_bsd_df, Opt_minix_df, Opt_grpid, Opt_nogrpid,
 	Opt_resgid, Opt_resuid, Opt_sb, Opt_err_cont, Opt_err_panic, Opt_err_ro,
 	Opt_nouid32, Opt_check, Opt_nocheck, Opt_debug, Opt_oldalloc, Opt_orlov, Opt_nobh,
-	Opt_user_xattr, Opt_nouser_xattr, Opt_acl, Opt_noacl,
+	Opt_user_xattr, Opt_nouser_xattr, Opt_acl, Opt_noacl, Opt_xip,
 	Opt_ignore, Opt_err,
 };
 
@@ -286,6 +287,7 @@
 	{Opt_nouser_xattr, "nouser_xattr"},
 	{Opt_acl, "acl"},
 	{Opt_noacl, "noacl"},
+	{Opt_xip, "xip"},
 	{Opt_ignore, "grpquota"},
 	{Opt_ignore, "noquota"},
 	{Opt_ignore, "quota"},
@@ -397,6 +399,13 @@
 			printk("EXT2 (no)acl options not supported\n");
 			break;
 #endif
+		case Opt_xip:
+#ifdef CONFIG_EXT2_FS_XIP
+			set_opt (sbi->s_mount_opt, XIP);
+#else
+			printk("EXT2 xip option not supported\n");
+#endif
+			break;
 		case Opt_ignore:
 			break;
 		default:
@@ -640,6 +649,9 @@
 		((EXT2_SB(sb)->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ?
 		 MS_POSIXACL : 0);
 
+	ext2_xip_verify_sb(sb); /* see if bdev supports xip, unset
+				    EXT2_MOUNT_XIP if not */
+
 	if (le32_to_cpu(es->s_rev_level) == EXT2_GOOD_OLD_REV &&
 	    (EXT2_HAS_COMPAT_FEATURE(sb, ~0U) ||
 	     EXT2_HAS_RO_COMPAT_FEATURE(sb, ~0U) ||
@@ -668,6 +680,13 @@
 
 	blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size);
 
+	if ((ext2_use_xip(sb)) && ((blocksize != PAGE_SIZE) ||
+				  (sb->s_blocksize != blocksize))) {
+		if (!silent)
+			printk("XIP: Unsupported blocksize\n");
+		goto failed_mount;
+	}
+
 	/* If the blocksize doesn't match, re-read the thing.. */
 	if (sb->s_blocksize != blocksize) {
 		brelse(bh);
@@ -916,6 +935,7 @@
 {
 	struct ext2_sb_info * sbi = EXT2_SB(sb);
 	struct ext2_super_block * es;
+	unsigned long old_mount_opt = sbi->s_mount_opt;
 
 	/*
 	 * Allow the "check" option to be passed as a remount option.
@@ -927,6 +947,11 @@
 		((sbi->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0);
 
 	es = sbi->s_es;
+	if (((sbi->s_mount_opt & EXT2_MOUNT_XIP) != 
+	    (old_mount_opt & EXT2_MOUNT_XIP)) &&
+	    invalidate_inodes(sb))
+		ext2_warning(sb, __FUNCTION__, "busy inodes while remounting "\
+			     "xip remain in cache (no functional problem)");
 	if ((*flags & MS_RDONLY) == (sb->s_flags & MS_RDONLY))
 		return 0;
 	if (*flags & MS_RDONLY) {
diff -ruN linux-git/fs/ext2/xip.c linux-git-xip/fs/ext2/xip.c
--- linux-git/fs/ext2/xip.c	1970-01-01 01:00:00.000000000 +0100
+++ linux-git-xip/fs/ext2/xip.c	2005-05-17 18:44:32.186009032 +0200
@@ -0,0 +1,80 @@
+/*
+ *  linux/fs/ext2/xip.c
+ *
+ * Copyright (C) 2005 IBM Corporation
+ * Author: Carsten Otte (cotte@de.ibm.com)
+ */
+
+#include <linux/mm.h>
+#include <linux/fs.h>
+#include <linux/genhd.h>
+#include <linux/buffer_head.h>
+#include <linux/ext2_fs_sb.h>
+#include <linux/ext2_fs.h>
+#include "ext2.h"
+#include "xip.h"
+
+static inline int
+__inode_direct_access(struct inode *inode, sector_t sector, unsigned long *data) {
+	BUG_ON(!inode->i_sb->s_bdev->bd_disk->fops->direct_access);
+	return inode->i_sb->s_bdev->bd_disk->fops
+		->direct_access(inode,sector,data);
+}
+
+int
+ext2_clear_xip_target(struct inode *inode, int block) {
+	sector_t sector = block*(PAGE_SIZE/512);
+	unsigned long data;
+	int rc;
+
+	rc = __inode_direct_access(inode, sector, &data);
+	if (rc)
+		return rc;
+	clear_page((void*)data);
+	return 0;
+}
+
+void ext2_xip_verify_sb(struct super_block *sb)
+{
+	struct ext2_sb_info *sbi = EXT2_SB(sb);
+	
+	if ((sbi->s_mount_opt & EXT2_MOUNT_XIP)) {
+		if ((sb->s_bdev == NULL) ||
+			sb->s_bdev->bd_disk == NULL ||
+			sb->s_bdev->bd_disk->fops == NULL ||
+			sb->s_bdev->bd_disk->fops->direct_access == NULL) {
+			sbi->s_mount_opt &= (~EXT2_MOUNT_XIP);
+			ext2_warning(sb, __FUNCTION__,
+				"ignoring xip option - not supported by bdev");
+		}
+	}
+}
+
+struct page* 
+ext2_get_xip_page(struct address_space *mapping, sector_t blockno,
+		   int create)
+{
+	int rc;
+	unsigned long data;
+	struct buffer_head tmp;
+
+	tmp.b_state = 0;
+	tmp.b_blocknr = 0;
+	rc = ext2_get_block(mapping->host, blockno/(PAGE_SIZE/512) , &tmp,
+				create);
+	if (rc)
+		return ERR_PTR(rc);
+	if (tmp.b_blocknr == 0) {
+		/* SPARSE block */
+		BUG_ON(create);
+		return ERR_PTR(-ENODATA);
+	}
+
+	rc = __inode_direct_access
+		(mapping->host,tmp.b_blocknr*(PAGE_SIZE/512) ,&data);
+	if (rc)
+		return ERR_PTR(rc);
+
+	SetPageUptodate(virt_to_page(data));
+	return virt_to_page(data);
+}
diff -ruN linux-git/fs/ext2/xip.h linux-git-xip/fs/ext2/xip.h
--- linux-git/fs/ext2/xip.h	1970-01-01 01:00:00.000000000 +0100
+++ linux-git-xip/fs/ext2/xip.h	2005-05-17 18:44:32.186009032 +0200
@@ -0,0 +1,25 @@
+/*
+ *  linux/fs/ext2/xip.h
+ *
+ * Copyright (C) 2005 IBM Corporation
+ * Author: Carsten Otte (cotte@de.ibm.com)
+ */
+
+#ifdef CONFIG_EXT2_FS_XIP
+extern void ext2_xip_verify_sb (struct super_block *);
+extern int ext2_clear_xip_target (struct inode *, int);
+
+static inline int ext2_use_xip (struct super_block *sb)
+{
+	struct ext2_sb_info *sbi = EXT2_SB(sb);
+	return (sbi->s_mount_opt & EXT2_MOUNT_XIP);
+}
+struct page* ext2_get_xip_page (struct address_space *, sector_t, int);
+#define mapping_is_xip(map) unlikely(map->a_ops->get_xip_page)
+#else
+#define mapping_is_xip(map) 0
+#define ext2_xip_verify_sb(sb)              do { } while (0)
+#define ext2_use_xip(sb)	            0
+#define ext2_clear_xip_target(inode, chain) 0
+#define ext2_get_xip_page                   NULL
+#endif
diff -ruN linux-git/include/linux/ext2_fs.h linux-git-xip/include/linux/ext2_fs.h
--- linux-git/include/linux/ext2_fs.h	2005-05-17 14:23:35.000000000 +0200
+++ linux-git-xip/include/linux/ext2_fs.h	2005-05-17 18:44:32.195007664 +0200
@@ -300,18 +300,19 @@
 /*
  * Mount flags
  */
-#define EXT2_MOUNT_CHECK		0x0001	/* Do mount-time checks */
-#define EXT2_MOUNT_OLDALLOC		0x0002  /* Don't use the new Orlov allocator */
-#define EXT2_MOUNT_GRPID		0x0004	/* Create files with directory's group */
-#define EXT2_MOUNT_DEBUG		0x0008	/* Some debugging messages */
-#define EXT2_MOUNT_ERRORS_CONT		0x0010	/* Continue on errors */
-#define EXT2_MOUNT_ERRORS_RO		0x0020	/* Remount fs ro on errors */
-#define EXT2_MOUNT_ERRORS_PANIC		0x0040	/* Panic on errors */
-#define EXT2_MOUNT_MINIX_DF		0x0080	/* Mimics the Minix statfs */
-#define EXT2_MOUNT_NOBH			0x0100	/* No buffer_heads */
-#define EXT2_MOUNT_NO_UID32		0x0200  /* Disable 32-bit UIDs */
-#define EXT2_MOUNT_XATTR_USER		0x4000	/* Extended user attributes */
-#define EXT2_MOUNT_POSIX_ACL		0x8000	/* POSIX Access Control Lists */
+#define EXT2_MOUNT_CHECK		0x000001  /* Do mount-time checks */
+#define EXT2_MOUNT_OLDALLOC		0x000002  /* Don't use the new Orlov allocator */
+#define EXT2_MOUNT_GRPID		0x000004  /* Create files with directory's group */
+#define EXT2_MOUNT_DEBUG		0x000008  /* Some debugging messages */
+#define EXT2_MOUNT_ERRORS_CONT		0x000010  /* Continue on errors */
+#define EXT2_MOUNT_ERRORS_RO		0x000020  /* Remount fs ro on errors */
+#define EXT2_MOUNT_ERRORS_PANIC		0x000040  /* Panic on errors */
+#define EXT2_MOUNT_MINIX_DF		0x000080  /* Mimics the Minix statfs */
+#define EXT2_MOUNT_NOBH			0x000100  /* No buffer_heads */
+#define EXT2_MOUNT_NO_UID32		0x000200  /* Disable 32-bit UIDs */
+#define EXT2_MOUNT_XATTR_USER		0x004000  /* Extended user attributes */
+#define EXT2_MOUNT_POSIX_ACL		0x008000  /* POSIX Access Control Lists */
+#define EXT2_MOUNT_XIP			0x010000  /* Execute in place */
 
 #define clear_opt(o, opt)		o &= ~EXT2_MOUNT_##opt
 #define set_opt(o, opt)			o |= EXT2_MOUNT_##opt



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC/PATCH 4/5] loop: execute in place (V2)
       [not found] <1116422644.2202.1.camel@cotte.boeblingen.de.ibm.com>
                   ` (2 preceding siblings ...)
  2005-05-18 13:53 ` [RFC/PATCH 3/5] ext2: " Carsten Otte
@ 2005-05-18 13:53 ` Carsten Otte
  2005-05-18 14:28   ` Christoph Hellwig
  2005-05-18 13:54 ` [RFC/PATCH 5/5] madvice/fadvice: " Carsten Otte
  4 siblings, 1 reply; 17+ messages in thread
From: Carsten Otte @ 2005-05-18 13:53 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, schwidefsky, akpm

[RFC/PATCH 4/5] loop: execute in place (V2)
The old loop driver in 2.6.11. used the readpage/writepage aops to
transfer data. Now loop can also use read/write and direct_IO on the
file if readpage/writepage are not available. Unlike the old 2.6.11.
version, today's loop driver does work with files that do not have
readpage/writepage. Threrefore, this patch is optional.
This patch adds one more transport method to loop that uses the new
address space operation get_xip_page if available.

This patch is unchanged from previous version.

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
--- 
diff -ruN linux-git/drivers/block/loop.c linux-git-xip/drivers/block/loop.c
--- linux-git/drivers/block/loop.c	2005-05-17 14:23:16.000000000 +0200
+++ linux-git-xip/drivers/block/loop.c	2005-05-17 19:12:50.707794472 +0200
@@ -275,6 +275,83 @@
 	goto out;
 }
 
+ 
+static int
+do_lo_send_xip(struct loop_device *lo, struct bio_vec *bvec, int bsize, loff_t pos,
+		struct page* ignored)
+{
+	struct file *file = lo->lo_backing_file; /* kudos to NFsckingS */
+	struct address_space *mapping = file->f_mapping;
+	struct address_space_operations *aops = mapping->a_ops;
+	struct page *page;
+	pgoff_t index;
+	unsigned size, offset, bv_offs;
+	int len;
+	int ret = 0;
+
+	down(&mapping->host->i_sem);
+	index = pos >> PAGE_CACHE_SHIFT;
+	offset = pos & ((pgoff_t)PAGE_CACHE_SIZE - 1);
+	bv_offs = bvec->bv_offset;
+	len = bvec->bv_len;
+	while (len > 0) {
+		sector_t IV;
+		int transfer_result;
+
+		IV = ((sector_t)index << (PAGE_CACHE_SHIFT - 9))+(offset >> 9);
+
+		size = PAGE_CACHE_SIZE - offset;
+		if (size > len)
+			size = len;
+
+		page = aops->get_xip_page(mapping,
+			index*(PAGE_SIZE/512), 0);
+		if (!page)
+			goto fail;
+		if (unlikely(IS_ERR(page))) {
+			if (PTR_ERR(page) == -ENODATA) {
+				/* sparse */
+				page = virt_to_page(empty_zero_page);
+			} else
+				goto fail;
+		} else
+			BUG_ON(!PageUptodate(page));
+
+		transfer_result = lo_do_transfer(lo, WRITE, page, offset,
+						 bvec->bv_page, bv_offs,
+						 size, IV);
+		if (transfer_result) {
+			char *kaddr;
+
+			/*
+			 * The transfer failed, but we still write the data to
+			 * keep prepare/commit calls balanced.
+			 */
+			printk(KERN_ERR "loop: transfer error block %llu\n",
+			       (unsigned long long)index);
+			kaddr = kmap_atomic(page, KM_USER0);
+			memset(kaddr + offset, 0, size);
+			kunmap_atomic(kaddr, KM_USER0);
+		}
+		flush_dcache_page(page);
+		if (transfer_result)
+			goto fail;
+		bv_offs += size;
+		len -= size;
+		offset = 0;
+		index++;
+		pos += size;
+	}
+	up(&mapping->host->i_sem);
+out:
+	return ret;
+
+fail:
+	up(&mapping->host->i_sem);
+	ret = -1;
+	goto out;
+}
+
 /**
  * __do_lo_send_write - helper for writing data to a loop device
  *
@@ -356,8 +433,11 @@
 	struct page *page = NULL;
 	int i, ret = 0;
 
-	do_lo_send = do_lo_send_aops;
-	if (!(lo->lo_flags & LO_FLAGS_USE_AOPS)) {
+	if (lo->lo_flags & LO_FLAGS_USE_AOPS)
+		do_lo_send = do_lo_send_aops;
+	else if (lo->lo_flags & LO_FLAGS_USE_XIP)
+		do_lo_send = do_lo_send_xip;
+	else {
 		do_lo_send = do_lo_send_direct_write;
 		if (lo->transfer != transfer_none) {
 			page = alloc_page(GFP_NOIO | __GFP_HIGHMEM);
@@ -787,11 +867,13 @@
 		 */
 		if (!file->f_op->sendfile)
 			goto out_putf;
-		if (aops->prepare_write && aops->commit_write)
+		if (aops->get_xip_page)
+			lo_flags |= LO_FLAGS_USE_XIP;
+		else if (aops->prepare_write && aops->commit_write)
 			lo_flags |= LO_FLAGS_USE_AOPS;
-		if (!(lo_flags & LO_FLAGS_USE_AOPS) && !file->f_op->write)
+		if (!(lo_flags & (LO_FLAGS_USE_AOPS | LO_FLAGS_USE_XIP)) 
+		    && !file->f_op->write)
 			lo_flags |= LO_FLAGS_READ_ONLY;
-
 		lo_blocksize = inode->i_blksize;
 		error = 0;
 	} else {
diff -ruN linux-git/include/linux/loop.h linux-git-xip/include/linux/loop.h
--- linux-git/include/linux/loop.h	2005-05-17 14:23:35.000000000 +0200
+++ linux-git-xip/include/linux/loop.h	2005-05-17 19:12:50.717792952 +0200
@@ -74,6 +74,7 @@
 enum {
 	LO_FLAGS_READ_ONLY	= 1,
 	LO_FLAGS_USE_AOPS	= 2,
+	LO_FLAGS_USE_XIP        = 4,
 };
 
 #include <asm/posix_types.h>	/* for __kernel_old_dev_t */

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC/PATCH 5/5] madvice/fadvice: execute in place (V2)
       [not found] <1116422644.2202.1.camel@cotte.boeblingen.de.ibm.com>
                   ` (3 preceding siblings ...)
  2005-05-18 13:53 ` [RFC/PATCH 4/5] loop: " Carsten Otte
@ 2005-05-18 13:54 ` Carsten Otte
  4 siblings, 0 replies; 17+ messages in thread
From: Carsten Otte @ 2005-05-18 13:54 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, schwidefsky, akpm

[RFC/PATCH 5/5] madvice/fadvice: execute in place (V2)
This patch makes sys_madvice and sys_fadvice return 0 on advices to
cache data assoctiated with files that do have get_xip_page. Since the
data for those is in memory anyway, we can just ignore the advice...

This patch is unchanged from previous version.

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
--- 
diff -ruN linux-git/mm/fadvise.c linux-git-xip/mm/fadvise.c
--- linux-git/mm/fadvise.c	2005-05-17 14:23:36.000000000 +0200
+++ linux-git-xip/mm/fadvise.c	2005-05-17 20:16:40.000000000 +0200
@@ -43,6 +43,10 @@
 		goto out;
 	}
 
+	if (mapping->a_ops->get_xip_page)
+		/* no bad return value, but ignore advice */
+		goto out;
+
 	/* Careful about overflows. Len == 0 means "as much as possible" */
 	endbyte = offset + len;
 	if (!len || endbyte < len)
diff -ruN linux-git/mm/madvise.c linux-git-xip/mm/madvise.c
--- linux-git/mm/madvise.c	2005-05-17 14:23:36.000000000 +0200
+++ linux-git-xip/mm/madvise.c	2005-05-17 20:16:40.000000000 +0200
@@ -65,6 +65,10 @@
 	if (!file)
 		return -EBADF;
 
+	if (file->f_mapping->a_ops->get_xip_page)
+		/* no bad return value, but ignore advice */
+		return 0;
+
 	start = ((start - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
 	if (end > vma->vm_end)
 		end = vma->vm_end;

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC/PATCH 2/5] mm/fs: execute in place (V2)
  2005-05-18 13:53 ` [RFC/PATCH 2/5] mm/fs: " Carsten Otte
@ 2005-05-18 14:27   ` Christoph Hellwig
  2005-05-18 14:56     ` Carsten Otte
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2005-05-18 14:27 UTC (permalink / raw)
  To: cotte; +Cc: linux-kernel, linux-fsdevel, schwidefsky, akpm

>  static inline void do_generic_file_read(struct file * filp, loff_t *ppos,
>  					read_descriptor_t * desc,
>  					read_actor_t actor)
>  {
> -	do_generic_mapping_read(filp->f_mapping,
> -				&filp->f_ra,
> -				filp,
> -				ppos,
> -				desc,
> -				actor);
> +	if (file_is_xip(filp))
> +		do_xip_mapping_read(filp->f_mapping,
> +					&filp->f_ra,
> +					filp,
> +					ppos,
> +					desc,
> +					actor);
> +	else
> +		do_generic_mapping_read(filp->f_mapping,
> +					&filp->f_ra,
> +					filp,
> +					ppos,
> +					desc,
> +					actor);
>  }

>  	size_t count;
> +	int xip = file_is_xip(filp) ? 1 : 0;
>  
>  	count = 0;
>  	for (seg = 0; seg < nr_segs; seg++) {
> @@ -990,7 +992,9 @@
>  	}
>  
>  	/* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
> -	if (filp->f_flags & O_DIRECT) {
> +	/* do not use generic_file_direct_IO on xip files, xip IO is
> +	   implicitly direct as well */
> +	if (filp->f_flags & O_DIRECT && !xip) {
>  		loff_t pos = *ppos, size;
>  		struct address_space *mapping;
>  		struct inode *inode;

I don't like this read split at all.  Please just define a completely
separate entry point for read, xip_file_read except for verifying the
iovecs you don't share any code.

> @@ -1538,10 +1545,13 @@
>  {
>  	struct address_space *mapping = file->f_mapping;
>  
> -	if (!mapping->a_ops->readpage)
> +	if ((!mapping->a_ops->readpage) && (!mapping_is_xip(mapping)))
>  		return -ENOEXEC;
>  	file_accessed(file);
> -	vma->vm_ops = &generic_file_vm_ops;
> +	if (mapping_is_xip(mapping))
> +		vma->vm_ops = &xip_file_vm_ops;
> +	else
> +		vma->vm_ops = &generic_file_vm_ops;
>  	return 0;
>  }

Similar please add a separate xip_file_mmap.

> @@ -2123,6 +2062,13 @@
>  
>  	inode_update_time(inode, 1);
>  
> +	if (file_is_xip(file)) {
> +		/* use execute in place to copy directly to disk */
> +		written = generic_file_xip_write (iocb, iov,
> +			        nr_segs, pos, ppos, count);
> +		goto out;
> +	}
> +

Dito with xip_file_write.

> diff -ruN linux-git/mm/filemap.h linux-git-xip/mm/filemap.h
> --- linux-git/mm/filemap.h	1970-01-01 01:00:00.000000000 +0100
> +++ linux-git-xip/mm/filemap.h	2005-05-17 18:33:57.792451512 +0200
> @@ -0,0 +1,141 @@
> +/*
> + *	linux/mm/filemap.h
> + *
> + * Copyright (C) 2005 IBM Corporation

I think just adding an IBM copyright isn't fair.  Just copy it from
filemap.c

> +		/*
> +		 * We need the page_table_lock to protect us from page faults,
> +		 * munmap, fork, etc...
> +		 */
> +		spin_lock(&mm->page_table_lock);
> +		pgd = pgd_offset(mm, address);
> +		if (!pgd_present(*pgd))
> +			goto next_unlock;
> +		pud = pud_offset(pgd, address);
> +		if (!pud_present(*pud))
> +			goto next_unlock;
> +		pmd = pmd_offset(pud, address);
> +		if (!pmd_present(*pmd))
> +			goto next_unlock;
> +		
> +		pte = pte_offset_map(pmd, address);
> +		if (!pte_present(*pte))
> +			goto next_unmap;
> +		if ((page_to_pfn(virt_to_page(empty_zero_page)))
> +			!= pte_pfn(*pte))
> +			/* pte does already reference new xip block here */

You should probably use page_check_address().  Currently it's static in rmap.c,
but that could be changed.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC/PATCH 1/5] bdev: execute in place (V2)
  2005-05-18 13:53 ` [RFC/PATCH 1/5] bdev: execute in place (V2) Carsten Otte
@ 2005-05-18 14:27   ` Christoph Hellwig
  2005-05-18 15:36     ` Carsten Otte
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2005-05-18 14:27 UTC (permalink / raw)
  To: cotte; +Cc: linux-kernel, linux-fsdevel, schwidefsky, akpm

> +	int (*direct_access) (struct inode *, sector_t, unsigned long *);

this should have a block_device * first argument.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC/PATCH 4/5] loop: execute in place (V2)
  2005-05-18 13:53 ` [RFC/PATCH 4/5] loop: " Carsten Otte
@ 2005-05-18 14:28   ` Christoph Hellwig
  2005-05-18 14:38     ` Carsten Otte
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2005-05-18 14:28 UTC (permalink / raw)
  To: cotte; +Cc: linux-kernel, linux-fsdevel, schwidefsky, akpm

On Wed, May 18, 2005 at 03:53:52PM +0200, Carsten Otte wrote:
> [RFC/PATCH 4/5] loop: execute in place (V2)
> The old loop driver in 2.6.11. used the readpage/writepage aops to
> transfer data. Now loop can also use read/write and direct_IO on the
> file if readpage/writepage are not available. Unlike the old 2.6.11.
> version, today's loop driver does work with files that do not have
> readpage/writepage. Threrefore, this patch is optional.
> This patch adds one more transport method to loop that uses the new
> address space operation get_xip_page if available.
> 
> This patch is unchanged from previous version.

This should be ifdef'ed to avoid bloat for non-XIP builds.  Or just be dropped
completely.  How much difference does it make over read/write and where does
loop performance matter?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC/PATCH 4/5] loop: execute in place (V2)
  2005-05-18 14:28   ` Christoph Hellwig
@ 2005-05-18 14:38     ` Carsten Otte
  0 siblings, 0 replies; 17+ messages in thread
From: Carsten Otte @ 2005-05-18 14:38 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-kernel, linux-fsdevel, schwidefsky, akpm

Christoph Hellwig wrote:

>On Wed, May 18, 2005 at 03:53:52PM +0200, Carsten Otte wrote:
>  
>
>>[RFC/PATCH 4/5] loop: execute in place (V2)
>>
>>    
>>
>
>This should be ifdef'ed to avoid bloat for non-XIP builds.  Or just be dropped
>completely.  How much difference does it make over read/write and where does
>loop performance matter?
>  
>
I don't think loop on xip is performance critical. For page cache lookup
I see a performance difference of factor 2 on our platform because we
have decent memory bandwidth and lock contention slows things down
with many CPUs. Given that even without this patch we don't do page
cache lookups, I don't think there's much difference. Initially this patch
was written for the old loop driver that won't work without this patch...
Guess that dropping it is a good idea.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC/PATCH 2/5] mm/fs: execute in place (V2)
  2005-05-18 14:27   ` Christoph Hellwig
@ 2005-05-18 14:56     ` Carsten Otte
  2005-05-18 15:00       ` Christoph Hellwig
  0 siblings, 1 reply; 17+ messages in thread
From: Carsten Otte @ 2005-05-18 14:56 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-kernel, linux-fsdevel, schwidefsky, akpm

Christoph Hellwig wrote:

> I don't like this read split at all. Please just define a completely
>
>separate entry point for read, xip_file_read except for verifying the
>iovecs you don't share any code.
>Similar please add a separate xip_file_mmap.
>Dito with xip_file_write.
>  
>
I do plainly agree that this would make the code more readable here.
But it has a significant downside:
Once you have a different set of file operations for either case, you
also need to have a different file_operations struct in each individual
filesystem using this. Also, this moves the check "do we have xip today?"
from here to the filesystem that needs to decide which file operations
struct to use.
Looking forward, there may be multiple filesystems using this which
leads to duplicating the need for this check.

>  
>
>>diff -ruN linux-git/mm/filemap.h linux-git-xip/mm/filemap.h
>>--- linux-git/mm/filemap.h	1970-01-01 01:00:00.000000000 +0100
>>+++ linux-git-xip/mm/filemap.h	2005-05-17 18:33:57.792451512 +0200
>>@@ -0,0 +1,141 @@
>>+/*
>>+ *	linux/mm/filemap.h
>>+ *
>>+ * Copyright (C) 2005 IBM Corporation
>>    
>>
>
>I think just adding an IBM copyright isn't fair.  Just copy it from
>filemap.c
>  
>
Agreed.

>
>You should probably use page_check_address().  Currently it's static in rmap.c,
>but that could be changed.
>  
>
Good point. This was derived from try_to_unmap_one before that one
was added. Btw: Should'nt this function move to rmap.c anyway?


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC/PATCH 2/5] mm/fs: execute in place (V2)
  2005-05-18 14:56     ` Carsten Otte
@ 2005-05-18 15:00       ` Christoph Hellwig
  2005-05-18 15:31         ` Carsten Otte
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2005-05-18 15:00 UTC (permalink / raw)
  To: Carsten Otte
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, schwidefsky, akpm

On Wed, May 18, 2005 at 04:56:42PM +0200, Carsten Otte wrote:
> I do plainly agree that this would make the code more readable here.
> But it has a significant downside:
> Once you have a different set of file operations for either case, you
> also need to have a different file_operations struct in each individual
> filesystem using this. Also, this moves the check "do we have xip today?"
> from here to the filesystem that needs to decide which file operations
> struct to use.
> Looking forward, there may be multiple filesystems using this which
> leads to duplicating the need for this check.

I don't think that's much of a problem.  The filesystem has a new file_operations
instance and decided at read_inode time which one to use.  You already have different
address_space operations and a different truncate anyway.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC/PATCH 2/5] mm/fs: execute in place (V2)
  2005-05-18 15:00       ` Christoph Hellwig
@ 2005-05-18 15:31         ` Carsten Otte
  2005-05-18 15:36           ` Christoph Hellwig
  0 siblings, 1 reply; 17+ messages in thread
From: Carsten Otte @ 2005-05-18 15:31 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-kernel, linux-fsdevel, schwidefsky, akpm

Christoph Hellwig wrote:

>On Wed, May 18, 2005 at 04:56:42PM +0200, Carsten Otte wrote:
>  
>
>>I do plainly agree that this would make the code more readable here.
>>But it has a significant downside:
>>Once you have a different set of file operations for either case, you
>>also need to have a different file_operations struct in each individual
>>filesystem using this. Also, this moves the check "do we have xip today?"
>>from here to the filesystem that needs to decide which file operations
>>struct to use.
>>Looking forward, there may be multiple filesystems using this which
>>leads to duplicating the need for this check.
>>    
>>
>
>I don't think that's much of a problem.  The filesystem has a new file_operations
>instance and decided at read_inode time which one to use.  You already have different
>address_space operations and a different truncate anyway.
>
>  
>
Yea, but in addition to the multiplication for the check it would duplicate
significant part of filemap:
- generic_file_read           => xip_file_read
- generic_file_aio_read    => xip_file_aio_read
- __generic_file_aio_read => __xip_file_aio_read
- generic_file_sendfile     => xip_file_sendfile
- generic file_readv          => xip_file_readv
- generic_file_write          => xip_file_write
- generic_file_aio_write_nolock => xip_file_write_nolock
- __generic_file_write_nolock => __xip_file_write_nolock
- generic_file_write_nolock => xip_file_write_nolock
- generic_file_aio_write => xip_file_aio_write
- generic_file_mmap => xip_file_mmap
- generic_file_readonly_mmap => xip_file_readonly_mmap

All changes to these functions would need to be mirrored, and the binary
kernel images with xip enabled would grow by the size of those functions.

But given that the copies of those function would be equivalent to their
original, I honestly think that duplicating them is worse then splitting
the read/write pathes at where handling is in fact different:
- mapping_read
- nopage
- generic_file_write
- truncate page


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC/PATCH 1/5] bdev: execute in place (V2)
  2005-05-18 14:27   ` Christoph Hellwig
@ 2005-05-18 15:36     ` Carsten Otte
  2005-05-18 15:37       ` Christoph Hellwig
  0 siblings, 1 reply; 17+ messages in thread
From: Carsten Otte @ 2005-05-18 15:36 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-kernel, linux-fsdevel, schwidefsky, akpm

Christoph Hellwig wrote:

>>+	int (*direct_access) (struct inode *, sector_t, unsigned long *);
>>    
>>
>
>this should have a block_device * first argument.
>  
>
While I agree that (block_device *) would be a good thing to address
the target block device, the inode *  is consistent with other
operations in this vector: open, release, & ioctl use the same scheme.
The reason for inode * here is that the caller has no easy way to get
to the block_device *. How would the filesystem do that?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC/PATCH 2/5] mm/fs: execute in place (V2)
  2005-05-18 15:31         ` Carsten Otte
@ 2005-05-18 15:36           ` Christoph Hellwig
  2005-05-18 15:50             ` Carsten Otte
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2005-05-18 15:36 UTC (permalink / raw)
  To: Carsten Otte
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, schwidefsky, akpm

On Wed, May 18, 2005 at 05:31:13PM +0200, Carsten Otte wrote:
> - generic_file_read           => xip_file_read

no need to have that one if you implement aio_read -> use do_sync_read

> - generic_file_aio_read    => xip_file_aio_read
> - __generic_file_aio_read => __xip_file_aio_read

readv and aio_read are just wrappers around this one.

> - generic_file_sendfile     => xip_file_sendfile

pretty trivial

> - generic file_readv          => xip_file_readv
> - generic_file_write          => xip_file_write

just use do_sync_write

> - generic_file_aio_write_nolock => xip_file_write_nolock
> - __generic_file_write_nolock => __xip_file_write_nolock
> - generic_file_write_nolock => xip_file_write_nolock
> - generic_file_aio_write => xip_file_aio_write

you don't need all these.  Just writev and aio_write as wrappers around a common one

> - generic_file_mmap => xip_file_mmap

this one doesn't share code anyway

> - generic_file_readonly_mmap => xip_file_readonly_mmap

unless you want to implement a readonly filesystem with xip support you
don't need this one.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC/PATCH 1/5] bdev: execute in place (V2)
  2005-05-18 15:36     ` Carsten Otte
@ 2005-05-18 15:37       ` Christoph Hellwig
  0 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2005-05-18 15:37 UTC (permalink / raw)
  To: Carsten Otte
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, schwidefsky, akpm

On Wed, May 18, 2005 at 05:36:49PM +0200, Carsten Otte wrote:
> Christoph Hellwig wrote:
> 
> >>+	int (*direct_access) (struct inode *, sector_t, unsigned long *);
> >>    
> >>
> >
> >this should have a block_device * first argument.
> >  
> >
> While I agree that (block_device *) would be a good thing to address
> the target block device, the inode *  is consistent with other
> operations in this vector: open, release, & ioctl use the same scheme.

That's going to change real soon.

> The reason for inode * here is that the caller has no easy way to get
> to the block_device *. How would the filesystem do that?

sb->s_bdev

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC/PATCH 2/5] mm/fs: execute in place (V2)
  2005-05-18 15:36           ` Christoph Hellwig
@ 2005-05-18 15:50             ` Carsten Otte
  2005-05-18 15:53               ` Christoph Hellwig
  0 siblings, 1 reply; 17+ messages in thread
From: Carsten Otte @ 2005-05-18 15:50 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-kernel, linux-fsdevel, schwidefsky, akpm

Christoph Hellwig wrote:

>On Wed, May 18, 2005 at 05:31:13PM +0200, Carsten Otte wrote:
>  
>
>>- generic_file_read           => xip_file_read
>>    
>>
>
>no need to have that one if you implement aio_read -> use do_sync_read
>
>  
>
>>- generic_file_aio_read    => xip_file_aio_read
>>- __generic_file_aio_read => __xip_file_aio_read
>>    
>>
>
>readv and aio_read are just wrappers around this one.
>
>  
>
>>- generic_file_sendfile     => xip_file_sendfile
>>    
>>
>
>pretty trivial
>
>  
>
>>- generic file_readv          => xip_file_readv
>>- generic_file_write          => xip_file_write
>>    
>>
>
>just use do_sync_write
>
>  
>
>>- generic_file_aio_write_nolock => xip_file_write_nolock
>>- __generic_file_write_nolock => __xip_file_write_nolock
>>- generic_file_write_nolock => xip_file_write_nolock
>>- generic_file_aio_write => xip_file_aio_write
>>    
>>
>
>you don't need all these.  Just writev and aio_write as wrappers around a common one
>
>  
>
>>- generic_file_mmap => xip_file_mmap
>>    
>>
>
>this one doesn't share code anyway
>
>  
>
>>- generic_file_readonly_mmap => xip_file_readonly_mmap
>>    
>>
>
>unless you want to implement a readonly filesystem with xip support you
>don't need this one.
>
>  
>
I agree that sync/async is not too much of a difference when you do a memcpy
behind, so you can just have wrappers. I am still not convinced that it
will stay
reasonably small with all that duplicated stuff, but since it's easy to
do I just
gonna give it a try to see how it'll look alike. Bet the patch size will
double.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC/PATCH 2/5] mm/fs: execute in place (V2)
  2005-05-18 15:50             ` Carsten Otte
@ 2005-05-18 15:53               ` Christoph Hellwig
  0 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2005-05-18 15:53 UTC (permalink / raw)
  To: Carsten Otte
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, schwidefsky, akpm

On Wed, May 18, 2005 at 05:50:59PM +0200, Carsten Otte wrote:
> I agree that sync/async is not too much of a difference when you do a memcpy
> behind, so you can just have wrappers.

They already are wrappers in filemap.c  In fact one of my planned projects
is to kill all that silly duplication and have aio_readv/aio_writev entry
points for filesystems and read/write for drivers and nothing else.  That
would cleanup the mess extremly.

> I am still not convinced that it
> will stay
> reasonably small with all that duplicated stuff, but since it's easy to
> do I just
> gonna give it a try to see how it'll look alike. Bet the patch size will
> double.

I think that's okay.  XIP is a total minority feature, and while we should
avoid duolication where possible not making filemap.c even more messy is
by far preferable.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2005-05-18 15:53 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1116422644.2202.1.camel@cotte.boeblingen.de.ibm.com>
2005-05-18 13:53 ` [RFC/PATCH 1/5] bdev: execute in place (V2) Carsten Otte
2005-05-18 14:27   ` Christoph Hellwig
2005-05-18 15:36     ` Carsten Otte
2005-05-18 15:37       ` Christoph Hellwig
2005-05-18 13:53 ` [RFC/PATCH 2/5] mm/fs: " Carsten Otte
2005-05-18 14:27   ` Christoph Hellwig
2005-05-18 14:56     ` Carsten Otte
2005-05-18 15:00       ` Christoph Hellwig
2005-05-18 15:31         ` Carsten Otte
2005-05-18 15:36           ` Christoph Hellwig
2005-05-18 15:50             ` Carsten Otte
2005-05-18 15:53               ` Christoph Hellwig
2005-05-18 13:53 ` [RFC/PATCH 3/5] ext2: " Carsten Otte
2005-05-18 13:53 ` [RFC/PATCH 4/5] loop: " Carsten Otte
2005-05-18 14:28   ` Christoph Hellwig
2005-05-18 14:38     ` Carsten Otte
2005-05-18 13:54 ` [RFC/PATCH 5/5] madvice/fadvice: " Carsten Otte

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).