* [rfc][patch 1/2] mm: introduce VM_MIXEDMAP mappings
@ 2007-12-14 13:38 Nick Piggin
2007-12-14 13:41 ` [rfc][patch 2/2] xip: support non-struct page memory Nick Piggin
0 siblings, 1 reply; 79+ messages in thread
From: Nick Piggin @ 2007-12-14 13:38 UTC (permalink / raw)
To: Jared Hulbert, Linux Memory Management List, Carsten Otte
Hi,
These 2 patches are an rfc only at this point. I've done some basic testing
of them with my brd device driver, but it will probably take more work to
get a real setup working...
This is one of the things Jared might need to better support some of his
non volatile memory work in the pipeline, however we're hoping that it
might be of some use to s390 systems too.
Thanks,
Nick
---
From: Jared Hulbert <jaredeh@gmail.com>
mm: introduce VM_MIXEDMAP
Introduce a new type of mapping, VM_MIXEDMAP. This is unlike VM_PFNMAP in
that it can support COW mappings of arbitrary ranges including ranges without
struct page (PFNMAP can only support COW in those cases where the un-COW-ed
translations are mapped linearly in the virtual address).
VM_MIXEDMAP achieves this by refcounting all pfn_valid pages, and not
refcounting !pfn_valid pages (which is not an option for VM_PFNMAP, because
it needs to avoid refcounting pfn_valid pages eg. for /dev/mem mappings).
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -106,6 +106,7 @@ extern unsigned int kobjsize(const void
#define VM_ALWAYSDUMP 0x04000000 /* Always include in core dumps */
#define VM_CAN_NONLINEAR 0x08000000 /* Has ->fault & does nonlinear pages */
+#define VM_MIXEDMAP 0x10000000 /* Can contain "struct page" and pure PFN pages */
#ifndef VM_STACK_DEFAULT_FLAGS /* arch can override this */
#define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -361,35 +361,65 @@ static inline int is_cow_mapping(unsigne
}
/*
- * This function gets the "struct page" associated with a pte.
+ * This function gets the "struct page" associated with a pte or returns
+ * NULL if no "struct page" is associated with the pte.
*
- * NOTE! Some mappings do not have "struct pages". A raw PFN mapping
- * will have each page table entry just pointing to a raw page frame
- * number, and as far as the VM layer is concerned, those do not have
- * pages associated with them - even if the PFN might point to memory
+ * A raw VM_PFNMAP mapping (ie. one that is not COWed) may not have any "struct
+ * page" backing, and even if they do, they are not refcounted. COWed pages of
+ * a VM_PFNMAP do always have a struct page, and they are normally refcounted
+ * (they are _normal_ pages).
+ *
+ * So a raw PFNMAP mapping will have each page table entry just pointing
+ * to a page frame number, and as far as the VM layer is concerned, those do
+ * not have pages associated with them - even if the PFN might point to memory
* that otherwise is perfectly fine and has a "struct page".
*
- * The way we recognize those mappings is through the rules set up
- * by "remap_pfn_range()": the vma will have the VM_PFNMAP bit set,
- * and the vm_pgoff will point to the first PFN mapped: thus every
+ * The way we recognize COWed pages within VM_PFNMAP mappings is through the
+ * rules set up by "remap_pfn_range()": the vma will have the VM_PFNMAP bit
+ * set, and the vm_pgoff will point to the first PFN mapped: thus every
* page that is a raw mapping will always honor the rule
*
* pfn_of_page == vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT)
*
- * and if that isn't true, the page has been COW'ed (in which case it
- * _does_ have a "struct page" associated with it even if it is in a
- * VM_PFNMAP range).
+ * A call to vm_normal_page() will return NULL for such a page.
+ *
+ * If the page doesn't follow the "remap_pfn_range()" rule in a VM_PFNMAP
+ * then the page has been COW'ed. A COW'ed page _does_ have a "struct page"
+ * associated with it even if it is in a VM_PFNMAP range. Calling
+ * vm_normal_page() on such a page will therefore return the "struct page".
+ *
+ *
+ * VM_MIXEDMAP mappings can likewise contain memory with or without "struct
+ * page" backing, however the difference is that _all_ pages with a struct
+ * page (that is, those where pfn_valid is true) are refcounted and considered
+ * normal pages by the VM. The disadvantage is that pages are refcounted
+ * (which can be slower and simply not an option for some PFNMAP users). The
+ * advantage is that we don't have to follow the strict linearity rule of
+ * PFNMAP mappings in order to support COWable mappings.
+ *
+ * A call to vm_normal_page() with a VM_MIXEDMAP mapping will return the
+ * associated "struct page" or NULL for memory not backed by a "struct page".
+ *
+ *
+ * All other mappings should have a valid struct page, which will be
+ * returned by a call to vm_normal_page().
*/
struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, pte_t pte)
{
unsigned long pfn = pte_pfn(pte);
- if (unlikely(vma->vm_flags & VM_PFNMAP)) {
- unsigned long off = (addr - vma->vm_start) >> PAGE_SHIFT;
- if (pfn == vma->vm_pgoff + off)
- return NULL;
- if (!is_cow_mapping(vma->vm_flags))
- return NULL;
+ if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
+ if (vma->vm_flags & VM_MIXEDMAP) {
+ if (!pfn_valid(pfn))
+ return NULL;
+ goto out;
+ } else {
+ unsigned long off = (addr-vma->vm_start) >> PAGE_SHIFT;
+ if (pfn == vma->vm_pgoff + off)
+ return NULL;
+ if (!is_cow_mapping(vma->vm_flags))
+ return NULL;
+ }
}
/*
@@ -410,6 +440,7 @@ struct page *vm_normal_page(struct vm_ar
* The PAGE_ZERO() pages and various VDSO mappings can
* cause them to exist.
*/
+out:
return pfn_to_page(pfn);
}
@@ -1211,8 +1242,11 @@ int vm_insert_pfn(struct vm_area_struct
pte_t *pte, entry;
spinlock_t *ptl;
- BUG_ON(!(vma->vm_flags & VM_PFNMAP));
- BUG_ON(is_cow_mapping(vma->vm_flags));
+ BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
+ BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
+ (VM_PFNMAP|VM_MIXEDMAP));
+ BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
+ BUG_ON((vma->vm_flags & VM_MIXEDMAP) && pfn_valid(pfn));
retval = -ENOMEM;
pte = get_locked_pte(mm, addr, &ptl);
@@ -2386,10 +2420,13 @@ static noinline int do_no_pfn(struct mm_
unsigned long pfn;
pte_unmap(page_table);
- BUG_ON(!(vma->vm_flags & VM_PFNMAP));
- BUG_ON(is_cow_mapping(vma->vm_flags));
+ BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
+ BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
pfn = vma->vm_ops->nopfn(vma, address & PAGE_MASK);
+
+ BUG_ON((vma->vm_flags & VM_MIXEDMAP) && pfn_valid(pfn));
+
if (unlikely(pfn == NOPFN_OOM))
return VM_FAULT_OOM;
else if (unlikely(pfn == NOPFN_SIGBUS))
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* [rfc][patch 2/2] xip: support non-struct page memory
2007-12-14 13:38 [rfc][patch 1/2] mm: introduce VM_MIXEDMAP mappings Nick Piggin
@ 2007-12-14 13:41 ` Nick Piggin
2007-12-14 13:46 ` Carsten Otte
` (2 more replies)
0 siblings, 3 replies; 79+ messages in thread
From: Nick Piggin @ 2007-12-14 13:41 UTC (permalink / raw)
To: Jared Hulbert, Linux Memory Management List, Carsten Otte
This is just a prototype for one possible way of supporting this. I may
be missing some important detail or eg. have missed some requirement of the
s390 XIP block device that makes the idea infeasible... comments?
---
Convert XIP to support non-struct page backed memory, using VM_MIXEDMAP
for the user mappings.
This requires the get_xip_page API to be changed to an address based one.
(The kaddr->pfn conversion may not be quite right for all architectures or XIP
memory mappings, and the cacheflushing may need to be updated for some archs).
Index: linux-2.6/fs/ext2/inode.c
===================================================================
--- linux-2.6.orig/fs/ext2/inode.c
+++ linux-2.6/fs/ext2/inode.c
@@ -800,7 +800,7 @@ const struct address_space_operations ex
const struct address_space_operations ext2_aops_xip = {
.bmap = ext2_bmap,
- .get_xip_page = ext2_get_xip_page,
+ .get_xip_address = ext2_get_xip_address,
};
const struct address_space_operations ext2_nobh_aops = {
Index: linux-2.6/fs/ext2/xip.c
===================================================================
--- linux-2.6.orig/fs/ext2/xip.c
+++ linux-2.6/fs/ext2/xip.c
@@ -15,24 +15,25 @@
#include "xip.h"
static inline int
-__inode_direct_access(struct inode *inode, sector_t sector,
- unsigned long *data)
+__inode_direct_access(struct inode *inode, sector_t block, unsigned long *data)
{
+ sector_t sector;
BUG_ON(!inode->i_sb->s_bdev->bd_disk->fops->direct_access);
+
+ sector = block * (PAGE_SIZE / 512); /* ext2 block to bdev sector */
return inode->i_sb->s_bdev->bd_disk->fops
- ->direct_access(inode->i_sb->s_bdev,sector,data);
+ ->direct_access(inode->i_sb->s_bdev, sector, data);
}
static inline int
-__ext2_get_sector(struct inode *inode, sector_t offset, int create,
+__ext2_get_block(struct inode *inode, pgoff_t pgoff, int create,
sector_t *result)
{
struct buffer_head tmp;
int rc;
memset(&tmp, 0, sizeof(struct buffer_head));
- rc = ext2_get_block(inode, offset/ (PAGE_SIZE/512), &tmp,
- create);
+ rc = ext2_get_block(inode, pgoff, &tmp, create);
*result = tmp.b_blocknr;
/* did we get a sparse block (hole in the file)? */
@@ -45,13 +46,12 @@ __ext2_get_sector(struct inode *inode, s
}
int
-ext2_clear_xip_target(struct inode *inode, int block)
+ext2_clear_xip_target(struct inode *inode, sector_t block)
{
- sector_t sector = block * (PAGE_SIZE/512);
unsigned long data;
int rc;
- rc = __inode_direct_access(inode, sector, &data);
+ rc = __inode_direct_access(inode, block, &data);
if (!rc)
clear_page((void*)data);
return rc;
@@ -69,24 +69,24 @@ void ext2_xip_verify_sb(struct super_blo
}
}
-struct page *
-ext2_get_xip_page(struct address_space *mapping, sector_t offset,
- int create)
+void *
+ext2_get_xip_address(struct address_space *mapping, pgoff_t pgoff, int create)
{
int rc;
unsigned long data;
- sector_t sector;
+ sector_t block;
/* first, retrieve the sector number */
- rc = __ext2_get_sector(mapping->host, offset, create, §or);
+ rc = __ext2_get_block(mapping->host, pgoff, create, &block);
if (rc)
goto error;
/* retrieve address of the target data */
- rc = __inode_direct_access
- (mapping->host, sector * (PAGE_SIZE/512), &data);
- if (!rc)
- return virt_to_page(data);
+ rc = __inode_direct_access(mapping->host, block, &data);
+ if (rc)
+ goto error;
+
+ return (void *)data;
error:
return ERR_PTR(rc);
Index: linux-2.6/fs/ext2/xip.h
===================================================================
--- linux-2.6.orig/fs/ext2/xip.h
+++ linux-2.6/fs/ext2/xip.h
@@ -7,15 +7,15 @@
#ifdef CONFIG_EXT2_FS_XIP
extern void ext2_xip_verify_sb (struct super_block *);
-extern int ext2_clear_xip_target (struct inode *, int);
+extern int ext2_clear_xip_target (struct inode *, sector_t);
static inline int ext2_use_xip (struct super_block *sb)
{
struct ext2_sb_info *sbi = EXT2_SB(sb);
return (sbi->s_mount_opt & EXT2_MOUNT_XIP);
}
-struct page* ext2_get_xip_page (struct address_space *, sector_t, int);
-#define mapping_is_xip(map) unlikely(map->a_ops->get_xip_page)
+void *ext2_get_xip_address(struct address_space *, sector_t, int);
+#define mapping_is_xip(map) unlikely(map->a_ops->get_xip_address)
#else
#define mapping_is_xip(map) 0
#define ext2_xip_verify_sb(sb) do { } while (0)
Index: linux-2.6/fs/open.c
===================================================================
--- linux-2.6.orig/fs/open.c
+++ linux-2.6/fs/open.c
@@ -778,7 +778,7 @@ static struct file *__dentry_open(struct
if (f->f_flags & O_DIRECT) {
if (!f->f_mapping->a_ops ||
((!f->f_mapping->a_ops->direct_IO) &&
- (!f->f_mapping->a_ops->get_xip_page))) {
+ (!f->f_mapping->a_ops->get_xip_address))) {
fput(f);
f = ERR_PTR(-EINVAL);
}
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -473,8 +473,7 @@ struct address_space_operations {
int (*releasepage) (struct page *, gfp_t);
ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
loff_t offset, unsigned long nr_segs);
- struct page* (*get_xip_page)(struct address_space *, sector_t,
- int);
+ void * (*get_xip_address)(struct address_space *, pgoff_t, int);
/* migrate the contents of a page to the specified target */
int (*migratepage) (struct address_space *,
struct page *, struct page *);
Index: linux-2.6/mm/fadvise.c
===================================================================
--- linux-2.6.orig/mm/fadvise.c
+++ linux-2.6/mm/fadvise.c
@@ -49,7 +49,7 @@ asmlinkage long sys_fadvise64_64(int fd,
goto out;
}
- if (mapping->a_ops->get_xip_page)
+ if (mapping->a_ops->get_xip_address)
/* no bad return value, but ignore advice */
goto out;
Index: linux-2.6/mm/filemap_xip.c
===================================================================
--- linux-2.6.orig/mm/filemap_xip.c
+++ linux-2.6/mm/filemap_xip.c
@@ -15,6 +15,7 @@
#include <linux/rmap.h>
#include <linux/sched.h>
#include <asm/tlbflush.h>
+#include <asm/io.h>
/*
* We do use our own empty page to avoid interference with other users
@@ -41,36 +42,39 @@ static struct page *xip_sparse_page(void
/*
* This is a file read routine for execute in place files, and uses
- * the mapping->a_ops->get_xip_page() function for the actual low-level
+ * the mapping->a_ops->get_xip_address() function for the actual low-level
* stuff.
*
* Note the struct file* is not used at all. It may be NULL.
*/
-static void
+static ssize_t
do_xip_mapping_read(struct address_space *mapping,
struct file_ra_state *_ra,
struct file *filp,
- loff_t *ppos,
- read_descriptor_t *desc,
- read_actor_t actor)
+ char __user *buf,
+ size_t len,
+ loff_t *ppos)
{
struct inode *inode = mapping->host;
unsigned long index, end_index, offset;
- loff_t isize;
+ loff_t isize, pos;
+ size_t copied = 0, error = 0;
- BUG_ON(!mapping->a_ops->get_xip_page);
+ BUG_ON(!mapping->a_ops->get_xip_address);
- index = *ppos >> PAGE_CACHE_SHIFT;
- offset = *ppos & ~PAGE_CACHE_MASK;
+ pos = *ppos;
+ index = pos >> PAGE_CACHE_SHIFT;
+ offset = pos & ~PAGE_CACHE_MASK;
isize = i_size_read(inode);
if (!isize)
goto out;
end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
- for (;;) {
- struct page *page;
- unsigned long nr, ret;
+ do {
+ unsigned long nr, left;
+ void *xip_mem;
+ int zero = 0;
/* nr is the maximum number of bytes to copy from this page */
nr = PAGE_CACHE_SIZE;
@@ -83,17 +87,20 @@ do_xip_mapping_read(struct address_space
}
}
nr = nr - offset;
+ if (nr > len)
+ nr = len;
- page = mapping->a_ops->get_xip_page(mapping,
- index*(PAGE_SIZE/512), 0);
- if (!page)
- goto no_xip_page;
- if (unlikely(IS_ERR(page))) {
- if (PTR_ERR(page) == -ENODATA) {
+ xip_mem = mapping->a_ops->get_xip_address(mapping, index, 0);
+ if (!xip_mem) {
+ error = -EIO;
+ goto out;
+ }
+ if (unlikely(IS_ERR(xip_mem))) {
+ if (PTR_ERR(xip_mem) == -ENODATA) {
/* sparse */
- page = ZERO_PAGE(0);
+ zero = 1;
} else {
- desc->error = PTR_ERR(page);
+ error = PTR_ERR(xip_mem);
goto out;
}
}
@@ -103,10 +110,10 @@ do_xip_mapping_read(struct address_space
* before reading the page on the kernel side.
*/
if (mapping_writably_mapped(mapping))
- flush_dcache_page(page);
+ /* address based flush */ ;
/*
- * Ok, we have the page, so now we can copy it to user space...
+ * Ok, we have the mem, so now we can copy it to user space...
*
* The actor routine returns how many bytes were actually used..
* NOTE! This may not be the same as how much of a user buffer
@@ -114,47 +121,38 @@ do_xip_mapping_read(struct address_space
* "pos" here (the actor routine has to update the user buffer
* pointers and the remaining count).
*/
- ret = actor(desc, page, offset, nr);
- offset += ret;
- index += offset >> PAGE_CACHE_SHIFT;
- offset &= ~PAGE_CACHE_MASK;
+ if (!zero)
+ left = __copy_to_user(buf+copied, xip_mem+offset, nr);
+ else
+ left = __clear_user(buf + copied, nr);
- if (ret == nr && desc->count)
- continue;
- goto out;
+ if (left) {
+ error = -EFAULT;
+ goto out;
+ }
-no_xip_page:
- /* Did not get the page. Report it */
- desc->error = -EIO;
- goto out;
- }
+ copied += (nr - left);
+ offset += (nr - left);
+ index += offset >> PAGE_CACHE_SHIFT;
+ offset &= ~PAGE_CACHE_MASK;
+ } while (copied < len);
out:
- *ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
+ *ppos = pos + copied;
if (filp)
file_accessed(filp);
+
+ return (copied ? copied : error);
}
ssize_t
xip_file_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos)
{
- read_descriptor_t desc;
-
if (!access_ok(VERIFY_WRITE, buf, len))
return -EFAULT;
- desc.written = 0;
- desc.arg.buf = buf;
- desc.count = len;
- desc.error = 0;
-
- do_xip_mapping_read(filp->f_mapping, &filp->f_ra, filp,
- ppos, &desc, file_read_actor);
-
- if (desc.written)
- return desc.written;
- else
- return desc.error;
+ return do_xip_mapping_read(filp->f_mapping, &filp->f_ra, filp,
+ buf, len, ppos);
}
EXPORT_SYMBOL_GPL(xip_file_read);
@@ -209,13 +207,14 @@ __xip_unmap (struct address_space * mapp
*
* This function is derived from filemap_fault, but used for execute in place
*/
-static int xip_file_fault(struct vm_area_struct *area, struct vm_fault *vmf)
+static int xip_file_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
- struct file *file = area->vm_file;
+ struct file *file = vma->vm_file;
struct address_space *mapping = file->f_mapping;
struct inode *inode = mapping->host;
- struct page *page;
pgoff_t size;
+ void *xip_mem;
+ struct page *page;
/* XXX: are VM_FAULT_ codes OK? */
@@ -223,24 +222,32 @@ static int xip_file_fault(struct vm_area
if (vmf->pgoff >= size)
return VM_FAULT_SIGBUS;
- page = mapping->a_ops->get_xip_page(mapping,
- vmf->pgoff*(PAGE_SIZE/512), 0);
- if (!IS_ERR(page))
- goto out;
- if (PTR_ERR(page) != -ENODATA)
+ xip_mem = mapping->a_ops->get_xip_address(mapping, vmf->pgoff, 0);
+ if (!IS_ERR(xip_mem))
+ goto found;
+ if (PTR_ERR(xip_mem) != -ENODATA)
return VM_FAULT_OOM;
/* sparse block */
- if ((area->vm_flags & (VM_WRITE | VM_MAYWRITE)) &&
- (area->vm_flags & (VM_SHARED| VM_MAYSHARE)) &&
+ if ((vma->vm_flags & (VM_WRITE | VM_MAYWRITE)) &&
+ (vma->vm_flags & (VM_SHARED| VM_MAYSHARE)) &&
(!(mapping->host->i_sb->s_flags & MS_RDONLY))) {
+ unsigned long pfn;
+
/* maybe shared writable, allocate new block */
- page = mapping->a_ops->get_xip_page(mapping,
- vmf->pgoff*(PAGE_SIZE/512), 1);
- if (IS_ERR(page))
+ xip_mem = mapping->a_ops->get_xip_address(mapping,vmf->pgoff,1);
+ if (IS_ERR(xip_mem))
return VM_FAULT_SIGBUS;
- /* unmap page at pgoff from all other vmas */
+ /* unmap sparse mappings at pgoff from all other vmas */
__xip_unmap(mapping, vmf->pgoff);
+
+found:
+ pfn = virt_to_phys(xip_mem) >> PAGE_SHIFT;
+ if (!pfn_valid(pfn)) {
+ vm_insert_pfn(vma, (unsigned long)vmf->virtual_address, pfn);
+ return VM_FAULT_NOPAGE;
+ }
+ page = pfn_to_page(pfn);
} else {
/* not shared and writable, use xip_sparse_page() */
page = xip_sparse_page();
@@ -248,7 +255,6 @@ static int xip_file_fault(struct vm_area
return VM_FAULT_OOM;
}
-out:
page_cache_get(page);
vmf->page = page;
return 0;
@@ -260,11 +266,11 @@ static struct vm_operations_struct xip_f
int xip_file_mmap(struct file * file, struct vm_area_struct * vma)
{
- BUG_ON(!file->f_mapping->a_ops->get_xip_page);
+ BUG_ON(!file->f_mapping->a_ops->get_xip_address);
file_accessed(file);
vma->vm_ops = &xip_file_vm_ops;
- vma->vm_flags |= VM_CAN_NONLINEAR;
+ vma->vm_flags |= VM_CAN_NONLINEAR | VM_MIXEDMAP;
return 0;
}
EXPORT_SYMBOL_GPL(xip_file_mmap);
@@ -277,17 +283,16 @@ __xip_file_write(struct file *filp, cons
const struct address_space_operations *a_ops = mapping->a_ops;
struct inode *inode = mapping->host;
long status = 0;
- struct page *page;
size_t bytes;
ssize_t written = 0;
- BUG_ON(!mapping->a_ops->get_xip_page);
+ BUG_ON(!mapping->a_ops->get_xip_address);
do {
unsigned long index;
unsigned long offset;
size_t copied;
- char *kaddr;
+ void *xip_mem;
offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
index = pos >> PAGE_CACHE_SHIFT;
@@ -295,28 +300,22 @@ __xip_file_write(struct file *filp, cons
if (bytes > count)
bytes = count;
- page = a_ops->get_xip_page(mapping,
- index*(PAGE_SIZE/512), 0);
- if (IS_ERR(page) && (PTR_ERR(page) == -ENODATA)) {
+ xip_mem = a_ops->get_xip_address(mapping, index, 0);
+ if (IS_ERR(xip_mem) && (PTR_ERR(xip_mem) == -ENODATA)) {
/* we allocate a new page unmap it */
- page = a_ops->get_xip_page(mapping,
- index*(PAGE_SIZE/512), 1);
- if (!IS_ERR(page))
+ xip_mem = a_ops->get_xip_address(mapping, index, 1);
+ if (!IS_ERR(xip_mem))
/* unmap page at pgoff from all other vmas */
__xip_unmap(mapping, index);
}
- if (IS_ERR(page)) {
- status = PTR_ERR(page);
+ if (IS_ERR(xip_mem)) {
+ status = PTR_ERR(xip_mem);
break;
}
- fault_in_pages_readable(buf, bytes);
- kaddr = kmap_atomic(page, KM_USER0);
copied = bytes -
- __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
- kunmap_atomic(kaddr, KM_USER0);
- flush_dcache_page(page);
+ __copy_from_user_nocache(xip_mem + offset, buf, bytes);
if (likely(copied > 0)) {
status = copied;
@@ -396,7 +395,7 @@ EXPORT_SYMBOL_GPL(xip_file_write);
/*
* truncate a page used for execute in place
- * functionality is analog to block_truncate_page but does use get_xip_page
+ * functionality is analog to block_truncate_page but does use get_xip_adddress
* to get the page instead of page cache
*/
int
@@ -406,9 +405,9 @@ xip_truncate_page(struct address_space *
unsigned offset = from & (PAGE_CACHE_SIZE-1);
unsigned blocksize;
unsigned length;
- struct page *page;
+ void *xip_mem;
- BUG_ON(!mapping->a_ops->get_xip_page);
+ BUG_ON(!mapping->a_ops->get_xip_address);
blocksize = 1 << mapping->host->i_blkbits;
length = offset & (blocksize - 1);
@@ -419,18 +418,17 @@ xip_truncate_page(struct address_space *
length = blocksize - length;
- page = mapping->a_ops->get_xip_page(mapping,
- index*(PAGE_SIZE/512), 0);
- if (!page)
+ xip_mem = mapping->a_ops->get_xip_address(mapping, index, 0);
+ if (!xip_mem)
return -ENOMEM;
- if (unlikely(IS_ERR(page))) {
- if (PTR_ERR(page) == -ENODATA)
+ if (unlikely(IS_ERR(xip_mem))) {
+ if (PTR_ERR(xip_mem) == -ENODATA)
/* Hole? No need to truncate */
return 0;
else
- return PTR_ERR(page);
+ return PTR_ERR(xip_mem);
}
- zero_user_page(page, offset, length, KM_USER0);
+ memset(xip_mem + offset, 0, length);
return 0;
}
EXPORT_SYMBOL_GPL(xip_truncate_page);
Index: linux-2.6/mm/madvise.c
===================================================================
--- linux-2.6.orig/mm/madvise.c
+++ linux-2.6/mm/madvise.c
@@ -112,7 +112,7 @@ static long madvise_willneed(struct vm_a
if (!file)
return -EBADF;
- if (file->f_mapping->a_ops->get_xip_page) {
+ if (file->f_mapping->a_ops->get_xip_address) {
/* no bad return value, but ignore advice */
return 0;
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 2/2] xip: support non-struct page memory
2007-12-14 13:41 ` [rfc][patch 2/2] xip: support non-struct page memory Nick Piggin
@ 2007-12-14 13:46 ` Carsten Otte
2007-12-15 1:07 ` Jared Hulbert
2007-12-19 14:04 ` Carsten Otte
2007-12-20 13:53 ` Carsten Otte
2 siblings, 1 reply; 79+ messages in thread
From: Carsten Otte @ 2007-12-14 13:46 UTC (permalink / raw)
To: Nick Piggin; +Cc: Jared Hulbert, Linux Memory Management List
Nick Piggin wrote:
> This is just a prototype for one possible way of supporting this. I may
> be missing some important detail or eg. have missed some requirement of the
> s390 XIP block device that makes the idea infeasible... comments?
Seems to be christmas time, I get a feature that has been on my most
wanted list for years :-). Will play with it and test it asap :-).
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 2/2] xip: support non-struct page memory
2007-12-14 13:46 ` Carsten Otte
@ 2007-12-15 1:07 ` Jared Hulbert
2007-12-15 1:17 ` Nick Piggin
0 siblings, 1 reply; 79+ messages in thread
From: Jared Hulbert @ 2007-12-15 1:07 UTC (permalink / raw)
To: carsteno; +Cc: Nick Piggin, Linux Memory Management List
> Nick Piggin wrote:
> > This is just a prototype for one possible way of supporting this. I may
> > be missing some important detail or eg. have missed some requirement of the
> > s390 XIP block device that makes the idea infeasible... comments?
> Seems to be christmas time, I get a feature that has been on my most
> wanted list for years :-). Will play with it and test it asap :-).
That's exactly how I feel. I'm testing it out right now.
One thing I would love is for a way for get_xip_address to be able to
punt. To be able to tell filemap_xip.c functions that the filemap.c
or generic functions should be used instead. For example
xip_file_fault() calls filemap_fault() when get_xip_address() returns
NULL. Can we do that for a return value of NULL?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 2/2] xip: support non-struct page memory
2007-12-15 1:07 ` Jared Hulbert
@ 2007-12-15 1:17 ` Nick Piggin
2007-12-15 6:47 ` Jared Hulbert
0 siblings, 1 reply; 79+ messages in thread
From: Nick Piggin @ 2007-12-15 1:17 UTC (permalink / raw)
To: Jared Hulbert; +Cc: carsteno, Linux Memory Management List
On Fri, Dec 14, 2007 at 05:07:38PM -0800, Jared Hulbert wrote:
> > Nick Piggin wrote:
> > > This is just a prototype for one possible way of supporting this. I may
> > > be missing some important detail or eg. have missed some requirement of the
> > > s390 XIP block device that makes the idea infeasible... comments?
> > Seems to be christmas time, I get a feature that has been on my most
> > wanted list for years :-). Will play with it and test it asap :-).
>
> That's exactly how I feel. I'm testing it out right now.
Well, then call me Saint Nick ;)
> One thing I would love is for a way for get_xip_address to be able to
> punt. To be able to tell filemap_xip.c functions that the filemap.c
> or generic functions should be used instead. For example
> xip_file_fault() calls filemap_fault() when get_xip_address() returns
> NULL. Can we do that for a return value of NULL?
I was thinking about that, but I wonder if it shouldn't be done in
the filesystem. Eg. if your filesystem mixes both pagecache and XIP,
then it would call into either filemap or filemap_xip...
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 2/2] xip: support non-struct page memory
2007-12-15 1:17 ` Nick Piggin
@ 2007-12-15 6:47 ` Jared Hulbert
0 siblings, 0 replies; 79+ messages in thread
From: Jared Hulbert @ 2007-12-15 6:47 UTC (permalink / raw)
To: Nick Piggin; +Cc: carsteno, Linux Memory Management List
> > One thing I would love is for a way for get_xip_address to be able to
> > punt. To be able to tell filemap_xip.c functions that the filemap.c
> > or generic functions should be used instead. For example
> > xip_file_fault() calls filemap_fault() when get_xip_address() returns
> > NULL. Can we do that for a return value of NULL?
>
> I was thinking about that, but I wonder if it shouldn't be done in
> the filesystem. Eg. if your filesystem mixes both pagecache and XIP,
> then it would call into either filemap or filemap_xip...
Well yeah it can be done in the filesystem. I just hate to have an
axfs_mmap() that is identical to xip_file_mmap() if it can be avoided.
Is there some reason not to do the NULL thing?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 2/2] xip: support non-struct page memory
2007-12-14 13:41 ` [rfc][patch 2/2] xip: support non-struct page memory Nick Piggin
2007-12-14 13:46 ` Carsten Otte
@ 2007-12-19 14:04 ` Carsten Otte
2007-12-20 9:23 ` Jared Hulbert
2007-12-20 13:53 ` Carsten Otte
2 siblings, 1 reply; 79+ messages in thread
From: Carsten Otte @ 2007-12-19 14:04 UTC (permalink / raw)
To: Nick Piggin; +Cc: Jared Hulbert, Linux Memory Management List
Nick Piggin wrote:
> This is just a prototype for one possible way of supporting this. I may
> be missing some important detail or eg. have missed some requirement of the
> s390 XIP block device that makes the idea infeasible... comments?
I've tested your patch series on s390 with dcssblk block device and
ext2 file system with -o xip. Everything seems to work fine. I will
now patch my kernel not to build struct page for the shared segment
and see if that works too.
so long,
Carsten
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 2/2] xip: support non-struct page memory
2007-12-19 14:04 ` Carsten Otte
@ 2007-12-20 9:23 ` Jared Hulbert
2007-12-21 0:40 ` Nick Piggin
0 siblings, 1 reply; 79+ messages in thread
From: Jared Hulbert @ 2007-12-20 9:23 UTC (permalink / raw)
To: carsteno; +Cc: Nick Piggin, Linux Memory Management List
On Dec 19, 2007 6:04 AM, Carsten Otte <cotte@de.ibm.com> wrote:
> Nick Piggin wrote:
> > This is just a prototype for one possible way of supporting this. I may
> > be missing some important detail or eg. have missed some requirement of the
> > s390 XIP block device that makes the idea infeasible... comments?
> I've tested your patch series on s390 with dcssblk block device and
> ext2 file system with -o xip. Everything seems to work fine. I will
> now patch my kernel not to build struct page for the shared segment
> and see if that works too.
I tested it with AXFS for ARM on NOR flash (pfn) and with a UML build
on x86 using the UML iomem interface (struct page). Works slick.
Cleans up the nastiest part of AXFS and makes a MTD patch unnecessary.
Very nice.
So we've got some documentation to do and you missed this, it won't
compile with EXT2 XIP off.
diff -r e677a09f65e2 fs/ext2/xip.h
--- a/fs/ext2/xip.h Thu Dec 20 00:53:18 2007 -0800
+++ b/fs/ext2/xip.h Thu Dec 20 01:14:41 2007 -0800
@@ -21,5 +21,5 @@ void *ext2_get_xip_address(struct addres
#define ext2_xip_verify_sb(sb) do { } while (0)
#define ext2_use_xip(sb) 0
#define ext2_clear_xip_target(inode, chain) 0
-#define ext2_get_xip_page NULL
+#define ext2_get_xip_address NULL
#endif
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 2/2] xip: support non-struct page memory
2007-12-14 13:41 ` [rfc][patch 2/2] xip: support non-struct page memory Nick Piggin
2007-12-14 13:46 ` Carsten Otte
2007-12-19 14:04 ` Carsten Otte
@ 2007-12-20 13:53 ` Carsten Otte
2007-12-20 14:33 ` Carsten Otte
2 siblings, 1 reply; 79+ messages in thread
From: Carsten Otte @ 2007-12-20 13:53 UTC (permalink / raw)
To: Nick Piggin
Cc: Jared Hulbert, Linux Memory Management List, Martin Schwidefsky
Nick Piggin wrote:
> This is just a prototype for one possible way of supporting this. I may
> be missing some important detail or eg. have missed some requirement of the
> s390 XIP block device that makes the idea infeasible... comments?
I've tested things now without initialization of our struct page
entries for s390. This does'nt work out, as you can see below.
free_hot_cold_page apparently still uses the struct page behind our
shared memory segment.
Please don't get confused by the process name "mount": this is _not_
the mount that has mounted the xip file system but rather an elf
binary of /bin/mount [on ext3] which is linked against a library in
/usr/lib64 [on ext2 -o xip].
I'll drill down deeper here to see why it does'nt work as expected...
<6>extmem info:segment_load: loaded segment COTTE range
0000000020000000 .. 000000007fe00fff type SW in shared mode
<6>dcssblk info: Loaded segment COTTE, size = 1608519680 Byte,
capacity = 3141640 (512 Byte) sectors
<4>EXT2-fs warning: checktime reached, running e2fsck is recommended
<0>Bad page state in process 'mount'
<0>page:000003fffedd7a50 flags:0x0000000000000000
mapping:0000000000000000 mapcount:1 count:0
<0>Trying to fix it up, but a reboot is needed
<0>Backtrace:
<4>0000000000000000 000000000fbd5b58 0000000000000002
0000000000000000
<4> 000000000fbd5bf8 000000000fbd5b70 000000000fbd5b70
000000000012b882
<4> 0000000000000000 0000000000000000 000003fffe6f76f8
0000000000000000
<4> 0000000000000000 000000000fbd5b58 000000000000000d
000000000fbd5bc8
<4> 0000000000415f30 00000000001037b8 000000000fbd5b58
000000000fbd5ba0
<4>Call Trace:
<4>([<0000000000103736>] show_trace+0x12e/0x148)
<4> [<0000000000171e10>] bad_page+0x94/0xd0
<4> [<0000000000172c80>] free_hot_cold_page+0x218/0x230
<4> [<0000000000180082>] unmap_vmas+0x4e6/0xc50
<4> [<0000000000185fa0>] exit_mmap+0x128/0x408
<4> [<0000000000127e90>] mmput+0x70/0xe4
<4> [<000000000012f606>] do_exit+0x1b6/0x8ac
<4> [<000000000012fd48>] do_group_exit+0x4c/0xa4
<4> [<00000000001102b8>] sysc_noemu+0x10/0x16
<4> [<0000020000108272>] 0x20000108272
...and here the patch I use to get rid of the struct page entries in
our vmmem_map array:
---
Index: linux-2.6/arch/s390/mm/vmem.c
===================================================================
--- linux-2.6.orig/arch/s390/mm/vmem.c
+++ linux-2.6/arch/s390/mm/vmem.c
@@ -310,8 +310,6 @@ out:
int add_shared_memory(unsigned long start, unsigned long size)
{
struct memory_segment *seg;
- struct page *page;
- unsigned long pfn, num_pfn, end_pfn;
int ret;
mutex_lock(&vmem_mutex);
@@ -330,20 +328,6 @@ int add_shared_memory(unsigned long star
if (ret)
goto out_remove;
- pfn = PFN_DOWN(start);
- num_pfn = PFN_DOWN(size);
- end_pfn = pfn + num_pfn;
-
- page = pfn_to_page(pfn);
- memset(page, 0, num_pfn * sizeof(struct page));
-
- for (; pfn < end_pfn; pfn++) {
- page = pfn_to_page(pfn);
- init_page_count(page);
- reset_page_mapcount(page);
- SetPageReserved(page);
- INIT_LIST_HEAD(&page->lru);
- }
goto out;
out_remove:
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 2/2] xip: support non-struct page memory
2007-12-20 13:53 ` Carsten Otte
@ 2007-12-20 14:33 ` Carsten Otte
2007-12-20 14:50 ` Carsten Otte
2007-12-21 0:45 ` Nick Piggin
0 siblings, 2 replies; 79+ messages in thread
From: Carsten Otte @ 2007-12-20 14:33 UTC (permalink / raw)
To: Nick Piggin
Cc: Jared Hulbert, Linux Memory Management List, Martin Schwidefsky
Carsten Otte wrote:
> I'll drill down deeper here to see why it does'nt work as expected...
Apparently pfn_valid() is true for our shared memory segment. The s390
implementation checks if the pfn is within max_pfn, which reflects the
size of the kernel page table 1:1 mapping. If that is the case, we use
one of our many magic instructions "lra" to ask our mmu if there is
memory we can access at subject address. Both is true for our shared
memory segment. Thus, the page gets refcounted regular on a struct
page entry that is not initialized.
Even worse, changing the semantic of pfn_valid() on s390 to be false
for shared segments is no option. We'll want to use the same memory
segment for memory hotplug. And in that case we do want refcounting
because it becomes regular linux memory.
So bottom line I think we do need a different trigger then pfn_valid()
to select which pages within VM_MIXEDMAP get refcounted and which don't.
cheers,
Carsten
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 2/2] xip: support non-struct page memory
2007-12-20 14:33 ` Carsten Otte
@ 2007-12-20 14:50 ` Carsten Otte
2007-12-20 17:24 ` Jared Hulbert
2007-12-21 0:50 ` Nick Piggin
2007-12-21 0:45 ` Nick Piggin
1 sibling, 2 replies; 79+ messages in thread
From: Carsten Otte @ 2007-12-20 14:50 UTC (permalink / raw)
To: carsteno
Cc: Nick Piggin, Jared Hulbert, Linux Memory Management List,
Martin Schwidefsky
Carsten Otte wrote:
> So bottom line I think we do need a different trigger then pfn_valid()
> to select which pages within VM_MIXEDMAP get refcounted and which don't.
A poor man's solution could be, to store a pfn range of the flash chip
and/or shared memory segment inside vm_area_struct, and in case of
VM_MIXEDMAP we check if the pfn matches that range. If so: no
refcounting. If not: regular refcounting. Is that an option?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 2/2] xip: support non-struct page memory
2007-12-20 14:50 ` Carsten Otte
@ 2007-12-20 17:24 ` Jared Hulbert
2007-12-21 0:12 ` Jared Hulbert
2007-12-21 9:49 ` Carsten Otte
2007-12-21 0:50 ` Nick Piggin
1 sibling, 2 replies; 79+ messages in thread
From: Jared Hulbert @ 2007-12-20 17:24 UTC (permalink / raw)
To: carsteno; +Cc: Nick Piggin, Linux Memory Management List, Martin Schwidefsky
> A poor man's solution could be, to store a pfn range of the flash chip
> and/or shared memory segment inside vm_area_struct, and in case of
> VM_MIXEDMAP we check if the pfn matches that range. If so: no
> refcounting. If not: regular refcounting. Is that an option?
I'm not picturing what is responsible for configuring this stored pfn
range. Does the fs do it on mount? Does the MTD or your funky
direct_access block driver do it?
What if you use VM_PFNMAP instead of VM_MIXEDMAP?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 2/2] xip: support non-struct page memory
2007-12-20 17:24 ` Jared Hulbert
@ 2007-12-21 0:12 ` Jared Hulbert
2007-12-21 0:56 ` Nick Piggin
2007-12-21 9:56 ` Carsten Otte
2007-12-21 9:49 ` Carsten Otte
1 sibling, 2 replies; 79+ messages in thread
From: Jared Hulbert @ 2007-12-21 0:12 UTC (permalink / raw)
To: carsteno; +Cc: Nick Piggin, Linux Memory Management List, Martin Schwidefsky
On Dec 20, 2007 9:24 AM, Jared Hulbert <jaredeh@gmail.com> wrote:
> > A poor man's solution could be, to store a pfn range of the flash chip
> > and/or shared memory segment inside vm_area_struct, and in case of
> > VM_MIXEDMAP we check if the pfn matches that range. If so: no
> > refcounting. If not: regular refcounting. Is that an option?
>
> I'm not picturing what is responsible for configuring this stored pfn
> range. Does the fs do it on mount? Does the MTD or your funky
> direct_access block driver do it?
>
> What if you use VM_PFNMAP instead of VM_MIXEDMAP?
Though that might _work_ for ext2 it doesn't fix VM_MIXEDMAP.
vm_normal_page() needs to know if a VM_MIXEDMAP pfn has a struct page
or not. Somebody had suggested we'd need a pfn_normal() or something.
Maybe it should be called pfn_has_page() instead. For ARM
pfn_has_page() == pfn_valid() near as I can tell. What about on s390?
If pfn_valid() doesn't work, then can you check if the pfn is
hotplugged in? What would pfn_to_page() return if the associated
struct page entry was not initialized? Can we use what is returned to
check if the pfn has no page?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 2/2] xip: support non-struct page memory
2007-12-20 9:23 ` Jared Hulbert
@ 2007-12-21 0:40 ` Nick Piggin
0 siblings, 0 replies; 79+ messages in thread
From: Nick Piggin @ 2007-12-21 0:40 UTC (permalink / raw)
To: Jared Hulbert; +Cc: carsteno, Linux Memory Management List
On Thu, Dec 20, 2007 at 01:23:53AM -0800, Jared Hulbert wrote:
> On Dec 19, 2007 6:04 AM, Carsten Otte <cotte@de.ibm.com> wrote:
> > Nick Piggin wrote:
> > > This is just a prototype for one possible way of supporting this. I may
> > > be missing some important detail or eg. have missed some requirement of the
> > > s390 XIP block device that makes the idea infeasible... comments?
> > I've tested your patch series on s390 with dcssblk block device and
> > ext2 file system with -o xip. Everything seems to work fine. I will
> > now patch my kernel not to build struct page for the shared segment
> > and see if that works too.
>
> I tested it with AXFS for ARM on NOR flash (pfn) and with a UML build
> on x86 using the UML iomem interface (struct page). Works slick.
> Cleans up the nastiest part of AXFS and makes a MTD patch unnecessary.
> Very nice.
Ah, excellent. Thanks for testing! I may not have a lot more time to spend
on this before next year, but feel free to do what you please with it until
then.
> So we've got some documentation to do and you missed this, it won't
> compile with EXT2 XIP off.
Yep, thanks.
>
> diff -r e677a09f65e2 fs/ext2/xip.h
> --- a/fs/ext2/xip.h Thu Dec 20 00:53:18 2007 -0800
> +++ b/fs/ext2/xip.h Thu Dec 20 01:14:41 2007 -0800
> @@ -21,5 +21,5 @@ void *ext2_get_xip_address(struct addres
> #define ext2_xip_verify_sb(sb) do { } while (0)
> #define ext2_use_xip(sb) 0
> #define ext2_clear_xip_target(inode, chain) 0
> -#define ext2_get_xip_page NULL
> +#define ext2_get_xip_address NULL
> #endif
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 2/2] xip: support non-struct page memory
2007-12-20 14:33 ` Carsten Otte
2007-12-20 14:50 ` Carsten Otte
@ 2007-12-21 0:45 ` Nick Piggin
2007-12-21 10:05 ` Carsten Otte
1 sibling, 1 reply; 79+ messages in thread
From: Nick Piggin @ 2007-12-21 0:45 UTC (permalink / raw)
To: carsteno; +Cc: Jared Hulbert, Linux Memory Management List, Martin Schwidefsky
On Thu, Dec 20, 2007 at 03:33:05PM +0100, Carsten Otte wrote:
> Carsten Otte wrote:
> >I'll drill down deeper here to see why it does'nt work as expected...
> Apparently pfn_valid() is true for our shared memory segment. The s390
> implementation checks if the pfn is within max_pfn, which reflects the
> size of the kernel page table 1:1 mapping. If that is the case, we use
> one of our many magic instructions "lra" to ask our mmu if there is
> memory we can access at subject address. Both is true for our shared
> memory segment. Thus, the page gets refcounted regular on a struct
> page entry that is not initialized.
>
> Even worse, changing the semantic of pfn_valid() on s390 to be false
> for shared segments is no option. We'll want to use the same memory
> segment for memory hotplug. And in that case we do want refcounting
> because it becomes regular linux memory.
So then you're back to needing struct pages again. Do you allocate
them at hotplug time?
AFAIK, sparsemem keeps track of all sections for pfn_valid(), which would
work. Any plans to convert s390 to it? ;)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 2/2] xip: support non-struct page memory
2007-12-20 14:50 ` Carsten Otte
2007-12-20 17:24 ` Jared Hulbert
@ 2007-12-21 0:50 ` Nick Piggin
2007-12-21 10:02 ` Carsten Otte
1 sibling, 1 reply; 79+ messages in thread
From: Nick Piggin @ 2007-12-21 0:50 UTC (permalink / raw)
To: carsteno; +Cc: Jared Hulbert, Linux Memory Management List, Martin Schwidefsky
On Thu, Dec 20, 2007 at 03:50:27PM +0100, Carsten Otte wrote:
> Carsten Otte wrote:
> >So bottom line I think we do need a different trigger then pfn_valid()
> >to select which pages within VM_MIXEDMAP get refcounted and which don't.
> A poor man's solution could be, to store a pfn range of the flash chip
> and/or shared memory segment inside vm_area_struct, and in case of
> VM_MIXEDMAP we check if the pfn matches that range. If so: no
> refcounting. If not: regular refcounting. Is that an option?
Yeah, although I'd not particularly like to touch generic code for such a
thing (except of course we could add an extra test to VM_MIXEDMAP, which
would be a noop for all other architectures).
You wouldn't even need to store it in the vm_area_struct -- you could just
set up eg. an rb tree of flash extents, and have a function that looks up
that tree for you.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 2/2] xip: support non-struct page memory
2007-12-21 0:12 ` Jared Hulbert
@ 2007-12-21 0:56 ` Nick Piggin
2007-12-21 9:56 ` Carsten Otte
1 sibling, 0 replies; 79+ messages in thread
From: Nick Piggin @ 2007-12-21 0:56 UTC (permalink / raw)
To: Jared Hulbert; +Cc: carsteno, Linux Memory Management List, Martin Schwidefsky
On Thu, Dec 20, 2007 at 04:12:52PM -0800, Jared Hulbert wrote:
> On Dec 20, 2007 9:24 AM, Jared Hulbert <jaredeh@gmail.com> wrote:
> > > A poor man's solution could be, to store a pfn range of the flash chip
> > > and/or shared memory segment inside vm_area_struct, and in case of
> > > VM_MIXEDMAP we check if the pfn matches that range. If so: no
> > > refcounting. If not: regular refcounting. Is that an option?
> >
> > I'm not picturing what is responsible for configuring this stored pfn
> > range. Does the fs do it on mount? Does the MTD or your funky
> > direct_access block driver do it?
> >
> > What if you use VM_PFNMAP instead of VM_MIXEDMAP?
>
> Though that might _work_ for ext2 it doesn't fix VM_MIXEDMAP.
Yeah, I guess they have the same problem as you: they want to be able
to support COW of non contiguous physical memory mappings as well (which
PFNMAP can't do).
> vm_normal_page() needs to know if a VM_MIXEDMAP pfn has a struct page
> or not. Somebody had suggested we'd need a pfn_normal() or something.
> Maybe it should be called pfn_has_page() instead. For ARM
> pfn_has_page() == pfn_valid() near as I can tell. What about on s390?
> If pfn_valid() doesn't work, then can you check if the pfn is
> hotplugged in? What would pfn_to_page() return if the associated
> struct page entry was not initialized? Can we use what is returned to
> check if the pfn has no page?
As fas as I know, that's what pfn_valid() should tell you (ie. that you
have a struct page allocated). So I think this is kind of a quirk of the
s390 memory model and I'd rather not "legitimize" it by calling it pfn_normal
(because then what's pfn_valid for?).
But definitely I think we could support a hack for them one way or the other.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 2/2] xip: support non-struct page memory
2007-12-20 17:24 ` Jared Hulbert
2007-12-21 0:12 ` Jared Hulbert
@ 2007-12-21 9:49 ` Carsten Otte
1 sibling, 0 replies; 79+ messages in thread
From: Carsten Otte @ 2007-12-21 9:49 UTC (permalink / raw)
To: Jared Hulbert
Cc: carsteno, Nick Piggin, Linux Memory Management List,
Martin Schwidefsky
Jared Hulbert wrote:
> I'm not picturing what is responsible for configuring this stored pfn
> range. Does the fs do it on mount? Does the MTD or your funky
> direct_access block driver do it?
We could set it up at mmap() time, where we do set VM_MIXEDMAP
alltogehter.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 2/2] xip: support non-struct page memory
2007-12-21 0:12 ` Jared Hulbert
2007-12-21 0:56 ` Nick Piggin
@ 2007-12-21 9:56 ` Carsten Otte
1 sibling, 0 replies; 79+ messages in thread
From: Carsten Otte @ 2007-12-21 9:56 UTC (permalink / raw)
To: Jared Hulbert
Cc: carsteno, Nick Piggin, Linux Memory Management List,
Martin Schwidefsky
Jared Hulbert wrote:
> vm_normal_page() needs to know if a VM_MIXEDMAP pfn has a struct page
> or not. Somebody had suggested we'd need a pfn_normal() or something.
> Maybe it should be called pfn_has_page() instead. For ARM
> pfn_has_page() == pfn_valid() near as I can tell. What about on s390?
Well, pfn_valid does'nt work for us as I pointed out before.
> If pfn_valid() doesn't work, then can you check if the pfn is
> hotplugged in?
Since the same memory segment may either be used as hotplug memory or
as shared segment for xip, and since we'd want regular refcounting in
one scenario and we'd not want regular refcounting in the other, I
don't see an easy way. And walking a list of ranges to figure out is
definetly too slow.
> What would pfn_to_page() return if the associated
> struct page entry was not initialized?
A pointer to the entry that is not initialized.
> Can we use what is returned to
> check if the pfn has no page?
As far as I undstand Heiko's vmem_map magic, when we do access the
vmem_map array to check, a struct page entry is created as reaction to
the page fault. Therefore, this scenario gets us back the disatvantage
of having struct page in the first place: memory consumption.
I think pfn_valid() or pfn_has_page() or similar arch callback does'nt
work. We need a place to store the information whether or not a page
needs refcounting or not. Either in the pte, or in vm_area_struct.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 2/2] xip: support non-struct page memory
2007-12-21 0:50 ` Nick Piggin
@ 2007-12-21 10:02 ` Carsten Otte
2007-12-21 10:14 ` Nick Piggin
0 siblings, 1 reply; 79+ messages in thread
From: Carsten Otte @ 2007-12-21 10:02 UTC (permalink / raw)
To: Nick Piggin
Cc: carsteno, Jared Hulbert, Linux Memory Management List,
Martin Schwidefsky
Nick Piggin wrote:
> You wouldn't even need to store it in the vm_area_struct -- you could just
> set up eg. an rb tree of flash extents, and have a function that looks up
> that tree for you.
We have a list aready, and I don't see the number of plugged extents
get so large that rb tree saves us CPU cycles over a list implementation.
Martin Schwidefsky suggested to use a bit in the page table entry to
prevent refcounting. fault() could set it up proper for xip pages.
That would be way faster then walking a list. Would that be an option?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 2/2] xip: support non-struct page memory
2007-12-21 0:45 ` Nick Piggin
@ 2007-12-21 10:05 ` Carsten Otte
2007-12-21 10:20 ` Nick Piggin
0 siblings, 1 reply; 79+ messages in thread
From: Carsten Otte @ 2007-12-21 10:05 UTC (permalink / raw)
To: Nick Piggin
Cc: carsteno, Jared Hulbert, Linux Memory Management List,
Martin Schwidefsky, Heiko Carstens
Nick Piggin wrote:
> So then you're back to needing struct pages again. Do you allocate
> them at hotplug time?
They get allocated by cathing kernel page faults when accessing the
mem_map array and filling in pages on demand. This happens at hotplug
time, where we initialize the content of struct page.
> AFAIK, sparsemem keeps track of all sections for pfn_valid(), which would
> work. Any plans to convert s390 to it? ;)
I think vmem_map is superior to sparsemem, because a
single-dimensional mem_map array is faster work with (single step
lookup). And we've got plenty of virtual address space for the
vmem_map array on 64bit.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 2/2] xip: support non-struct page memory
2007-12-21 10:02 ` Carsten Otte
@ 2007-12-21 10:14 ` Nick Piggin
2007-12-21 10:17 ` Carsten Otte
0 siblings, 1 reply; 79+ messages in thread
From: Nick Piggin @ 2007-12-21 10:14 UTC (permalink / raw)
To: carsteno; +Cc: Jared Hulbert, Linux Memory Management List, Martin Schwidefsky
On Fri, Dec 21, 2007 at 11:02:19AM +0100, Carsten Otte wrote:
> Nick Piggin wrote:
> >You wouldn't even need to store it in the vm_area_struct -- you could just
> >set up eg. an rb tree of flash extents, and have a function that looks up
> >that tree for you.
> We have a list aready, and I don't see the number of plugged extents
> get so large that rb tree saves us CPU cycles over a list implementation.
> Martin Schwidefsky suggested to use a bit in the page table entry to
> prevent refcounting. fault() could set it up proper for xip pages.
> That would be way faster then walking a list. Would that be an option?
I thought s390 was short on OS-available pte bits. There are a couple of other
nice things to use them for, so I'd rather not for this if possible (it is
not so critical if you can use a list, I would have thought)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 2/2] xip: support non-struct page memory
2007-12-21 10:14 ` Nick Piggin
@ 2007-12-21 10:17 ` Carsten Otte
2007-12-21 10:23 ` Nick Piggin
0 siblings, 1 reply; 79+ messages in thread
From: Carsten Otte @ 2007-12-21 10:17 UTC (permalink / raw)
To: Nick Piggin
Cc: carsteno, Jared Hulbert, Linux Memory Management List,
Martin Schwidefsky
Nick Piggin wrote:
> I thought s390 was short on OS-available pte bits. There are a couple of other
> nice things to use them for, so I'd rather not for this if possible (it is
> not so critical if you can use a list, I would have thought)
OS-available bits are only short for invalid ptes. For valid ptes
however, there are quite a few spare.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 2/2] xip: support non-struct page memory
2007-12-21 10:05 ` Carsten Otte
@ 2007-12-21 10:20 ` Nick Piggin
2007-12-21 10:35 ` Carsten Otte
0 siblings, 1 reply; 79+ messages in thread
From: Nick Piggin @ 2007-12-21 10:20 UTC (permalink / raw)
To: carsteno
Cc: Jared Hulbert, Linux Memory Management List, Martin Schwidefsky,
Heiko Carstens
On Fri, Dec 21, 2007 at 11:05:52AM +0100, Carsten Otte wrote:
> Nick Piggin wrote:
> >So then you're back to needing struct pages again. Do you allocate
> >them at hotplug time?
> They get allocated by cathing kernel page faults when accessing the
> mem_map array and filling in pages on demand. This happens at hotplug
> time, where we initialize the content of struct page.
Yep OK.
> >AFAIK, sparsemem keeps track of all sections for pfn_valid(), which would
> >work. Any plans to convert s390 to it? ;)
> I think vmem_map is superior to sparsemem, because a
> single-dimensional mem_map array is faster work with (single step
> lookup). And we've got plenty of virtual address space for the
> vmem_map array on 64bit.
But it doesn't still retain sparsemem sections behind that? Ie. so that
pfn_valid could be used? (I admittedly don't know enough eabout the memory
model code).
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 2/2] xip: support non-struct page memory
2007-12-21 10:17 ` Carsten Otte
@ 2007-12-21 10:23 ` Nick Piggin
2007-12-21 10:31 ` Carsten Otte
0 siblings, 1 reply; 79+ messages in thread
From: Nick Piggin @ 2007-12-21 10:23 UTC (permalink / raw)
To: carsteno; +Cc: Jared Hulbert, Linux Memory Management List, Martin Schwidefsky
On Fri, Dec 21, 2007 at 11:17:14AM +0100, Carsten Otte wrote:
> Nick Piggin wrote:
> >I thought s390 was short on OS-available pte bits. There are a couple of
> >other
> >nice things to use them for, so I'd rather not for this if possible (it is
> >not so critical if you can use a list, I would have thought)
> OS-available bits are only short for invalid ptes. For valid ptes
> however, there are quite a few spare.
OK, that's good news for my lockless get_user_pages ;)
And also potentially good news for the whole vm_normal_page scheme...
though I'd prefer to start simple (ie. don't use the pte bit, rather
walk the list), and see if it works first.
But whatever you think I guess, either way it would go in arch specific
code where your opinion outweighs mine ;)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 2/2] xip: support non-struct page memory
2007-12-21 10:23 ` Nick Piggin
@ 2007-12-21 10:31 ` Carsten Otte
0 siblings, 0 replies; 79+ messages in thread
From: Carsten Otte @ 2007-12-21 10:31 UTC (permalink / raw)
To: Nick Piggin
Cc: carsteno, Jared Hulbert, Linux Memory Management List,
Martin Schwidefsky
Nick Piggin wrote:
> OK, that's good news for my lockless get_user_pages ;)
>
> And also potentially good news for the whole vm_normal_page scheme...
> though I'd prefer to start simple (ie. don't use the pte bit, rather
> walk the list), and see if it works first.
>
> But whatever you think I guess, either way it would go in arch specific
> code where your opinion outweighs mine ;)
You clearly overestimate my influence on Martin. I rather keep my
fingers off the memory management backend there.
But either way, what we'd need is an arch callback that can map to
pfn_valid() for ARM and maybe others, and that we could map different.
I'll try to come up with a patch that implements such callback using
list-walk for s390. Hopefully we can safely grab the list lock
everywhere we need to check.
Btw: I will also continue to work on this next year, and take two
weeks christmas vacation.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 2/2] xip: support non-struct page memory
2007-12-21 10:20 ` Nick Piggin
@ 2007-12-21 10:35 ` Carsten Otte
2007-12-21 10:47 ` Nick Piggin
0 siblings, 1 reply; 79+ messages in thread
From: Carsten Otte @ 2007-12-21 10:35 UTC (permalink / raw)
To: Nick Piggin
Cc: carsteno, Jared Hulbert, Linux Memory Management List,
Martin Schwidefsky, Heiko Carstens
Nick Piggin wrote:
>>> AFAIK, sparsemem keeps track of all sections for pfn_valid(), which would
>>> work. Any plans to convert s390 to it? ;)
>> I think vmem_map is superior to sparsemem, because a
>> single-dimensional mem_map array is faster work with (single step
>> lookup). And we've got plenty of virtual address space for the
>> vmem_map array on 64bit.
>
> But it doesn't still retain sparsemem sections behind that? Ie. so that
> pfn_valid could be used? (I admittedly don't know enough eabout the memory
> model code).
Not as far as I know. But arch/s390/mm/vmem.c has:
struct memory_segment {
struct list_head list;
unsigned long start;
unsigned long size;
};
static LIST_HEAD(mem_segs);
This is maintained every time we map a segment/unmap a segment. And we
could add a bit to struct memory_segment meaning "refcount this one".
This way, we could tell core mm whether or not a pfn should be refcounted.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 2/2] xip: support non-struct page memory
2007-12-21 10:35 ` Carsten Otte
@ 2007-12-21 10:47 ` Nick Piggin
2007-12-21 19:29 ` Martin Schwidefsky
` (2 more replies)
0 siblings, 3 replies; 79+ messages in thread
From: Nick Piggin @ 2007-12-21 10:47 UTC (permalink / raw)
To: carsteno
Cc: Jared Hulbert, Linux Memory Management List, Martin Schwidefsky,
Heiko Carstens
On Fri, Dec 21, 2007 at 11:35:02AM +0100, Carsten Otte wrote:
> Nick Piggin wrote:
> >>>AFAIK, sparsemem keeps track of all sections for pfn_valid(), which would
> >>>work. Any plans to convert s390 to it? ;)
> >>I think vmem_map is superior to sparsemem, because a
> >>single-dimensional mem_map array is faster work with (single step
> >>lookup). And we've got plenty of virtual address space for the
> >>vmem_map array on 64bit.
> >
> >But it doesn't still retain sparsemem sections behind that? Ie. so that
> >pfn_valid could be used? (I admittedly don't know enough eabout the memory
> >model code).
> Not as far as I know. But arch/s390/mm/vmem.c has:
>
> struct memory_segment {
> struct list_head list;
> unsigned long start;
> unsigned long size;
> };
>
> static LIST_HEAD(mem_segs);
>
> This is maintained every time we map a segment/unmap a segment. And we
> could add a bit to struct memory_segment meaning "refcount this one".
> This way, we could tell core mm whether or not a pfn should be refcounted.
Right, this should work.
BTW. having a per-arch function sounds reasonable for a start. I'd just give
it a long name, so that people don't start using it for weird things ;)
mixedmap_refcount_pfn() or something.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 2/2] xip: support non-struct page memory
2007-12-21 10:47 ` Nick Piggin
@ 2007-12-21 19:29 ` Martin Schwidefsky
2008-01-07 4:43 ` [rfc][patch] mm: use a pte bit to flag normal pages Nick Piggin
2008-01-08 9:35 ` [rfc][patch 0/4] VM_MIXEDMAP patchset with s390 backend Carsten Otte
[not found] ` <1199784196.25114.11.camel@cotte.boeblingen.de.ibm.com>
2 siblings, 1 reply; 79+ messages in thread
From: Martin Schwidefsky @ 2007-12-21 19:29 UTC (permalink / raw)
To: Nick Piggin
Cc: carsteno, Heiko Carstens, Jared Hulbert,
Linux Memory Management List
Nick Piggin <npiggin@suse.de> wrote on 12/21/2007 11:47:01 AM:
> On Fri, Dec 21, 2007 at 11:35:02AM +0100, Carsten Otte wrote:
> > Nick Piggin wrote:
> > >But it doesn't still retain sparsemem sections behind that? Ie. so
that
> > >pfn_valid could be used? (I admittedly don't know enough eabout the
memory
> > >model code).
> > Not as far as I know. But arch/s390/mm/vmem.c has:
> >
> > struct memory_segment {
> > struct list_head list;
> > unsigned long start;
> > unsigned long size;
> > };
> >
> > static LIST_HEAD(mem_segs);
> >
> > This is maintained every time we map a segment/unmap a segment. And we
> > could add a bit to struct memory_segment meaning "refcount this one".
> > This way, we could tell core mm whether or not a pfn should be
refcounted.
>
> Right, this should work.
>
> BTW. having a per-arch function sounds reasonable for a start. I'd just
give
> it a long name, so that people don't start using it for weird things ;)
> mixedmap_refcount_pfn() or something.
Hmm, I would prefer to have a pte bit, it seem much more natural to me.
We know that this is a special pte when it gets mapped, but we "forgot"
that fact when the pte is picked up again in vm_normal_page. To search a
list when a simple bit in the pte get the job done just feels wrong.
By the way, for s390 the lower 8 bits of the pte are OS defined. The lowest
two bits are used in addition to the hardware invalid and the hardware
read-
only bit to define the pte type. For valid ptes the remaining 6 bits are
unused. Pick one, e.g. 2**2 for the bit that says
"don't-refcount-this-pte".
blue skies,
Martin
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* [rfc][patch] mm: use a pte bit to flag normal pages
2007-12-21 19:29 ` Martin Schwidefsky
@ 2008-01-07 4:43 ` Nick Piggin
2008-01-07 10:30 ` Russell King
2008-01-10 13:33 ` Carsten Otte
0 siblings, 2 replies; 79+ messages in thread
From: Nick Piggin @ 2008-01-07 4:43 UTC (permalink / raw)
To: Martin Schwidefsky
Cc: carsteno, Heiko Carstens, Jared Hulbert,
Linux Memory Management List, linux-arch
On Fri, Dec 21, 2007 at 08:29:50PM +0100, Martin Schwidefsky wrote:
> Nick Piggin <npiggin@suse.de> wrote on 12/21/2007 11:47:01 AM:
> > On Fri, Dec 21, 2007 at 11:35:02AM +0100, Carsten Otte wrote:
> > > Nick Piggin wrote:
> > > >But it doesn't still retain sparsemem sections behind that? Ie. so
> that
> > > >pfn_valid could be used? (I admittedly don't know enough eabout the
> memory
> > > >model code).
> > > Not as far as I know. But arch/s390/mm/vmem.c has:
> > >
> > > struct memory_segment {
> > > struct list_head list;
> > > unsigned long start;
> > > unsigned long size;
> > > };
> > >
> > > static LIST_HEAD(mem_segs);
> > >
> > > This is maintained every time we map a segment/unmap a segment. And we
> > > could add a bit to struct memory_segment meaning "refcount this one".
> > > This way, we could tell core mm whether or not a pfn should be
> refcounted.
> >
> > Right, this should work.
> >
> > BTW. having a per-arch function sounds reasonable for a start. I'd just
> give
> > it a long name, so that people don't start using it for weird things ;)
> > mixedmap_refcount_pfn() or something.
>
> Hmm, I would prefer to have a pte bit, it seem much more natural to me.
> We know that this is a special pte when it gets mapped, but we "forgot"
> that fact when the pte is picked up again in vm_normal_page. To search a
> list when a simple bit in the pte get the job done just feels wrong.
> By the way, for s390 the lower 8 bits of the pte are OS defined. The lowest
> two bits are used in addition to the hardware invalid and the hardware
> read-
> only bit to define the pte type. For valid ptes the remaining 6 bits are
> unused. Pick one, e.g. 2**2 for the bit that says
> "don't-refcount-this-pte".
This would be nice if we can do it, although I would prefer to make everything
work without any pte bits first, in order to make sure all architectures have a
chance at implementing it (although I guess for s390 specific memory map stuff,
it is reasonable for you to do your own thing there...).
We initially wanted to do the whole vm_normal_page thing this way, with another
pte bit, but we thought there were one or two archs with no spare bits. BTW. I
also need this bit in order to implement my lockless get_user_pages, so I do hope
to get it in. I'd like to know what architectures cannot spare a software bit in
their pte_present ptes...
---
Rather than play interesting games with vmas to work out whether the mapped page
should be refcounted or not, use a new bit in the "present" pte to distinguish
such pages.
This allows much simpler "vm_normal_page" implementation, and more flexible rules
for COW pages in pfn mappings (eg. our proposed VM_MIXEDMAP mode would becomes a noop).
It also provides one of the required pieces for the lockless get_user_pages.
Unfortunately, maybe not all architectures can spare a bit in the pte for this.
So we probably have to end up with some ifdefs (if we even want to add this
approach at all). For this reason, I would prefer for now to avoid using a new pte
bit to implement any of this stuff, and get VM_MIXEDMAP and its callers working
nicely on all architectures first.
Thanks,
Nick
---
Index: linux-2.6/include/asm-powerpc/pgtable-ppc64.h
===================================================================
--- linux-2.6.orig/include/asm-powerpc/pgtable-ppc64.h
+++ linux-2.6/include/asm-powerpc/pgtable-ppc64.h
@@ -93,6 +93,7 @@
#define _PAGE_RW 0x0200 /* software: user write access allowed */
#define _PAGE_HASHPTE 0x0400 /* software: pte has an associated HPTE */
#define _PAGE_BUSY 0x0800 /* software: PTE & hash are busy */
+#define _PAGE_SPECIAL 0x1000 /* software: pte associated with special page */
#define _PAGE_BASE (_PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_COHERENT)
@@ -233,12 +234,13 @@ static inline pte_t pfn_pte(unsigned lon
/*
* The following only work if pte_present() is true.
- * Undefined behaviour if not..
+ * Undefined behaviour if not.. (XXX: comment wrong eg. for pte_file())
*/
static inline int pte_write(pte_t pte) { return pte_val(pte) & _PAGE_RW;}
static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY;}
static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED;}
static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE;}
+static inline int pte_special(pte_t pte) { return pte_val(pte) & _PAGE_SPECIAL; }
static inline void pte_uncache(pte_t pte) { pte_val(pte) |= _PAGE_NO_CACHE; }
static inline void pte_cache(pte_t pte) { pte_val(pte) &= ~_PAGE_NO_CACHE; }
@@ -257,6 +259,8 @@ static inline pte_t pte_mkyoung(pte_t pt
pte_val(pte) |= _PAGE_ACCESSED; return pte; }
static inline pte_t pte_mkhuge(pte_t pte) {
return pte; }
+static inline pte_t pte_mkspecial(pte_t pte) {
+ pte_val(pte) |= _PAGE_SPECIAL; return pte; }
/* Atomic PTE updates */
static inline unsigned long pte_update(struct mm_struct *mm,
Index: linux-2.6/include/asm-um/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-um/pgtable.h
+++ linux-2.6/include/asm-um/pgtable.h
@@ -21,6 +21,7 @@
#define _PAGE_USER 0x040
#define _PAGE_ACCESSED 0x080
#define _PAGE_DIRTY 0x100
+#define _PAGE_SPECIAL 0x200
/* If _PAGE_PRESENT is clear, we use these: */
#define _PAGE_FILE 0x008 /* nonlinear file mapping, saved PTE; unset:swap */
#define _PAGE_PROTNONE 0x010 /* if the user mapped it with PROT_NONE;
@@ -220,6 +221,11 @@ static inline int pte_newprot(pte_t pte)
return(pte_present(pte) && (pte_get_bits(pte, _PAGE_NEWPROT)));
}
+static inline int pte_special(pte_t pte)
+{
+ return pte_get_bits(pte, _PAGE_SPECIAL);
+}
+
/*
* =================================
* Flags setting section.
@@ -288,6 +294,12 @@ static inline pte_t pte_mknewpage(pte_t
return(pte);
}
+static inline pte_t pte_mkspecial(pte_t pte)
+{
+ pte_set_bits(pte, _PAGE_SPECIAL);
+ return(pte);
+}
+
static inline void set_pte(pte_t *pteptr, pte_t pteval)
{
pte_copy(*pteptr, pteval);
Index: linux-2.6/include/asm-x86/pgtable_32.h
===================================================================
--- linux-2.6.orig/include/asm-x86/pgtable_32.h
+++ linux-2.6/include/asm-x86/pgtable_32.h
@@ -102,6 +102,7 @@ void paging_init(void);
#define _PAGE_BIT_UNUSED2 10
#define _PAGE_BIT_UNUSED3 11
#define _PAGE_BIT_NX 63
+#define _PAGE_BIT_SPECIAL _PAGE_BIT_UNUSED1
#define _PAGE_PRESENT 0x001
#define _PAGE_RW 0x002
@@ -115,6 +116,7 @@ void paging_init(void);
#define _PAGE_UNUSED1 0x200 /* available for programmer */
#define _PAGE_UNUSED2 0x400
#define _PAGE_UNUSED3 0x800
+#define _PAGE_SPECIAL PAGE_UNUSED1
/* If _PAGE_PRESENT is clear, we use these: */
#define _PAGE_FILE 0x040 /* nonlinear file mapping, saved PTE; unset:swap */
@@ -219,6 +221,7 @@ static inline int pte_dirty(pte_t pte)
static inline int pte_young(pte_t pte) { return (pte).pte_low & _PAGE_ACCESSED; }
static inline int pte_write(pte_t pte) { return (pte).pte_low & _PAGE_RW; }
static inline int pte_huge(pte_t pte) { return (pte).pte_low & _PAGE_PSE; }
+static inline int pte_special(pte_t pte) { return (pte).pte_low & _PAGE_SPECIAL; }
/*
* The following only works if pte_present() is not true.
@@ -232,6 +235,7 @@ static inline pte_t pte_mkdirty(pte_t pt
static inline pte_t pte_mkyoung(pte_t pte) { (pte).pte_low |= _PAGE_ACCESSED; return pte; }
static inline pte_t pte_mkwrite(pte_t pte) { (pte).pte_low |= _PAGE_RW; return pte; }
static inline pte_t pte_mkhuge(pte_t pte) { (pte).pte_low |= _PAGE_PSE; return pte; }
+static inline pte_t pte_mkspecial(pte_t pte) { (pte).pte_low |= _PAGE_SPECIAL; return pte; }
#ifdef CONFIG_X86_PAE
# include <asm/pgtable-3level.h>
Index: linux-2.6/include/asm-x86/pgtable_64.h
===================================================================
--- linux-2.6.orig/include/asm-x86/pgtable_64.h
+++ linux-2.6/include/asm-x86/pgtable_64.h
@@ -151,6 +151,7 @@ static inline pte_t ptep_get_and_clear_f
#define _PAGE_BIT_DIRTY 6
#define _PAGE_BIT_PSE 7 /* 4 MB (or 2MB) page */
#define _PAGE_BIT_GLOBAL 8 /* Global TLB entry PPro+ */
+#define _PAGE_BIT_SPECIAL 9
#define _PAGE_BIT_NX 63 /* No execute: only valid after cpuid check */
#define _PAGE_PRESENT 0x001
@@ -163,6 +164,7 @@ static inline pte_t ptep_get_and_clear_f
#define _PAGE_PSE 0x080 /* 2MB page */
#define _PAGE_FILE 0x040 /* nonlinear file mapping, saved PTE; unset:swap */
#define _PAGE_GLOBAL 0x100 /* Global TLB entry */
+#define _PAGE_SPECIAL 0x200
#define _PAGE_PROTNONE 0x080 /* If not present */
#define _PAGE_NX (_AC(1,UL)<<_PAGE_BIT_NX)
@@ -272,6 +274,7 @@ static inline int pte_young(pte_t pte)
static inline int pte_write(pte_t pte) { return pte_val(pte) & _PAGE_RW; }
static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; }
static inline int pte_huge(pte_t pte) { return pte_val(pte) & _PAGE_PSE; }
+static inline int pte_special(pte_t pte) { return pte_val(pte) & _PAGE_SPECIAL; }
static inline pte_t pte_mkclean(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_DIRTY)); return pte; }
static inline pte_t pte_mkold(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_ACCESSED)); return pte; }
@@ -282,6 +285,7 @@ static inline pte_t pte_mkyoung(pte_t pt
static inline pte_t pte_mkwrite(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_RW)); return pte; }
static inline pte_t pte_mkhuge(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_PSE)); return pte; }
static inline pte_t pte_clrhuge(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_PSE)); return pte; }
+static inline pte_t pte_mkspecial(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_SPECIAL)); return pte; }
struct vm_area_struct;
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -698,7 +698,20 @@ struct zap_details {
unsigned long truncate_count; /* Compare vm_truncate_count */
};
-struct page *vm_normal_page(struct vm_area_struct *, unsigned long, pte_t);
+/*
+ * This function gets the "struct page" associated with a pte.
+ *
+ * "Special" mappings do not wish to be associated with a "struct page" (either
+ * it doesn't exist, or it exists but they don't want to touch it). In this
+ * case, NULL is returned here.
+ */
+static inline struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, pte_t pte)
+{
+ if (likely(!pte_special(pte)))
+ return pte_page(pte);
+ return NULL;
+}
+
unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
unsigned long size, struct zap_details *);
unsigned long unmap_vmas(struct mmu_gather **tlb,
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -361,64 +361,10 @@ static inline int is_cow_mapping(unsigne
}
/*
- * This function gets the "struct page" associated with a pte.
- *
- * NOTE! Some mappings do not have "struct pages". A raw PFN mapping
- * will have each page table entry just pointing to a raw page frame
- * number, and as far as the VM layer is concerned, those do not have
- * pages associated with them - even if the PFN might point to memory
- * that otherwise is perfectly fine and has a "struct page".
- *
- * The way we recognize those mappings is through the rules set up
- * by "remap_pfn_range()": the vma will have the VM_PFNMAP bit set,
- * and the vm_pgoff will point to the first PFN mapped: thus every
- * page that is a raw mapping will always honor the rule
- *
- * pfn_of_page == vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT)
- *
- * and if that isn't true, the page has been COW'ed (in which case it
- * _does_ have a "struct page" associated with it even if it is in a
- * VM_PFNMAP range).
- */
-struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, pte_t pte)
-{
- unsigned long pfn = pte_pfn(pte);
-
- if (unlikely(vma->vm_flags & VM_PFNMAP)) {
- unsigned long off = (addr - vma->vm_start) >> PAGE_SHIFT;
- if (pfn == vma->vm_pgoff + off)
- return NULL;
- if (!is_cow_mapping(vma->vm_flags))
- return NULL;
- }
-
- /*
- * Add some anal sanity checks for now. Eventually,
- * we should just do "return pfn_to_page(pfn)", but
- * in the meantime we check that we get a valid pfn,
- * and that the resulting page looks ok.
- */
- if (unlikely(!pfn_valid(pfn))) {
- print_bad_pte(vma, pte, addr);
- return NULL;
- }
-
- /*
- * NOTE! We still have PageReserved() pages in the page
- * tables.
- *
- * The PAGE_ZERO() pages and various VDSO mappings can
- * cause them to exist.
- */
- return pfn_to_page(pfn);
-}
-
-/*
* copy one vm_area from one task to the other. Assumes the page tables
* already present in the new task to be cleared in the whole range
* covered by this vma.
*/
-
static inline void
copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
@@ -1212,7 +1158,6 @@ int vm_insert_pfn(struct vm_area_struct
spinlock_t *ptl;
BUG_ON(!(vma->vm_flags & VM_PFNMAP));
- BUG_ON(is_cow_mapping(vma->vm_flags));
retval = -ENOMEM;
pte = get_locked_pte(mm, addr, &ptl);
@@ -1223,7 +1168,7 @@ int vm_insert_pfn(struct vm_area_struct
goto out_unlock;
/* Ok, finally just insert the thing.. */
- entry = pfn_pte(pfn, vma->vm_page_prot);
+ entry = pte_mkspecial(pfn_pte(pfn, vma->vm_page_prot));
set_pte_at(mm, addr, pte, entry);
update_mmu_cache(vma, addr, entry);
@@ -1254,7 +1199,7 @@ static int remap_pte_range(struct mm_str
arch_enter_lazy_mmu_mode();
do {
BUG_ON(!pte_none(*pte));
- set_pte_at(mm, addr, pte, pfn_pte(pfn, prot));
+ set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot)));
pfn++;
} while (pte++, addr += PAGE_SIZE, addr != end);
arch_leave_lazy_mmu_mode();
@@ -1321,30 +1266,6 @@ int remap_pfn_range(struct vm_area_struc
struct mm_struct *mm = vma->vm_mm;
int err;
- /*
- * Physically remapped pages are special. Tell the
- * rest of the world about it:
- * VM_IO tells people not to look at these pages
- * (accesses can have side effects).
- * VM_RESERVED is specified all over the place, because
- * in 2.4 it kept swapout's vma scan off this vma; but
- * in 2.6 the LRU scan won't even find its pages, so this
- * flag means no more than count its pages in reserved_vm,
- * and omit it from core dump, even when VM_IO turned off.
- * VM_PFNMAP tells the core MM that the base pages are just
- * raw PFN mappings, and do not have a "struct page" associated
- * with them.
- *
- * There's a horrible special case to handle copy-on-write
- * behaviour that some programs depend on. We mark the "original"
- * un-COW'ed pages by matching them up with "vma->vm_pgoff".
- */
- if (is_cow_mapping(vma->vm_flags)) {
- if (addr != vma->vm_start || end != vma->vm_end)
- return -EINVAL;
- vma->vm_pgoff = pfn;
- }
-
vma->vm_flags |= VM_IO | VM_RESERVED | VM_PFNMAP;
BUG_ON(addr >= end);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch] mm: use a pte bit to flag normal pages
2008-01-07 4:43 ` [rfc][patch] mm: use a pte bit to flag normal pages Nick Piggin
@ 2008-01-07 10:30 ` Russell King
2008-01-07 11:14 ` Nick Piggin
2008-01-07 18:49 ` Jared Hulbert
2008-01-10 13:33 ` Carsten Otte
1 sibling, 2 replies; 79+ messages in thread
From: Russell King @ 2008-01-07 10:30 UTC (permalink / raw)
To: Nick Piggin
Cc: Martin Schwidefsky, carsteno, Heiko Carstens, Jared Hulbert,
Linux Memory Management List, linux-arch
On Mon, Jan 07, 2008 at 05:43:55AM +0100, Nick Piggin wrote:
> We initially wanted to do the whole vm_normal_page thing this way, with
> another pte bit, but we thought there were one or two archs with no spare
> bits. BTW. I also need this bit in order to implement my lockless
> get_user_pages, so I do hope to get it in. I'd like to know what
> architectures cannot spare a software bit in their pte_present ptes...
ARM is going to have to use the three remaining bits we have in the PTE
to store the memory type to resolve bugs on later platforms. Once they're
used, ARM will no longer have any room for any further PTE expansion.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch] mm: use a pte bit to flag normal pages
2008-01-07 10:30 ` Russell King
@ 2008-01-07 11:14 ` Nick Piggin
2008-01-07 18:49 ` Jared Hulbert
1 sibling, 0 replies; 79+ messages in thread
From: Nick Piggin @ 2008-01-07 11:14 UTC (permalink / raw)
To: Martin Schwidefsky, carsteno, Heiko Carstens, Jared Hulbert,
Linux Memory Management List, linux-arch
On Mon, Jan 07, 2008 at 10:30:29AM +0000, Russell King wrote:
> On Mon, Jan 07, 2008 at 05:43:55AM +0100, Nick Piggin wrote:
> > We initially wanted to do the whole vm_normal_page thing this way, with
> > another pte bit, but we thought there were one or two archs with no spare
> > bits. BTW. I also need this bit in order to implement my lockless
> > get_user_pages, so I do hope to get it in. I'd like to know what
> > architectures cannot spare a software bit in their pte_present ptes...
>
> ARM is going to have to use the three remaining bits we have in the PTE
> to store the memory type to resolve bugs on later platforms. Once they're
> used, ARM will no longer have any room for any further PTE expansion.
OK, it is good to have a negative confirmed. So I think we should definitely
get the non-pte-bit based mapping schemes working and tested on all platforms
before using a pte bit mapping...
FWIW, it might be possible for platforms to implement lockless get_user_pages
in other ways too. But that's getting ahead of myself.
Thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch] mm: use a pte bit to flag normal pages
2008-01-07 10:30 ` Russell King
2008-01-07 11:14 ` Nick Piggin
@ 2008-01-07 18:49 ` Jared Hulbert
2008-01-07 19:45 ` Russell King
1 sibling, 1 reply; 79+ messages in thread
From: Jared Hulbert @ 2008-01-07 18:49 UTC (permalink / raw)
To: Nick Piggin, Martin Schwidefsky, carsteno, Heiko Carstens,
Jared Hulbert, Linux Memory Management List, linux-arch
> ARM is going to have to use the three remaining bits we have in the PTE
> to store the memory type to resolve bugs on later platforms. Once they're
> used, ARM will no longer have any room for any further PTE expansion.
Russell,
Can you explain this a little more.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch] mm: use a pte bit to flag normal pages
2008-01-07 18:49 ` Jared Hulbert
@ 2008-01-07 19:45 ` Russell King
2008-01-07 22:52 ` Jared Hulbert
` (2 more replies)
0 siblings, 3 replies; 79+ messages in thread
From: Russell King @ 2008-01-07 19:45 UTC (permalink / raw)
To: Jared Hulbert
Cc: Nick Piggin, Martin Schwidefsky, carsteno, Heiko Carstens,
Linux Memory Management List, linux-arch
On Mon, Jan 07, 2008 at 10:49:57AM -0800, Jared Hulbert wrote:
> > ARM is going to have to use the three remaining bits we have in the PTE
> > to store the memory type to resolve bugs on later platforms. Once they're
> > used, ARM will no longer have any room for any further PTE expansion.
>
> Russell,
>
> Can you explain this a little more.
In old ARM CPUs, there were two bits that defined the characteristics of
the mapping - the C and B bits (C = cacheable, B = bufferable)
Some ARMv5 (particularly Xscale-based) and all ARMv6 CPUs extend this to
five bits and introduce "memory types" - 3 bits of TEX, and C and B.
Between these bits, it defines:
- strongly ordered
- bufferable only *
- device, sharable *
- device, unsharable
- memory, bufferable and cacheable, write through, no write allocate
- memory, bufferable and cacheable, write back, no write allocate
- memory, bufferable and cacheable, write back, write allocate
- implementation defined combinations (eg, selecting "minicache")
- and a set of 16 states to allow the policy of inner and outer levels
of cache to be defined (two bits per level).
Of course, not all CPUs support all the above - for example, if write
back caches aren't supported then the result is a write through cache.
The write allocation setting is a "hint" - if the hardware doesn't
support write allocate, it'll just be read allocate.
There are now CPUs out there where the old combinations (TEX=0) are
broken - and causes nasty effects like writes to bypass the write
protection under certain circumstances, or the data cache to hang if
you're using a strongly ordered mapping.
The "workaround" for these is to avoid the problematical mapping mode -
which is CPU specific, and depends on knowledge of what's being mapped.
For instance, you might use a sharable device mapping instead of
strongly ordered for devices. However, you might want to use an
outer cacheable but inner uncacheable mapping instead of strongly
ordered for memory.
Now, couple this with the fix for shared mmaps - where we normally turn
a cacheable mapping into a bufferable mapping, or if the write buffer has
visible side effects, a strongly ordered mapping, or if strongly ordered
mappings are buggy... etc.
Also note that there are devices (typically "unshared" devices) on some
ARM CPUs that you can only access if you set the TEX bits correctly.
Currently, Linux is able to setup mappings in kernel space to cover
any combination of settings. However, userspace is much more limited
because we don't carry the additional bits around in the Linux version
of the PTE - and as such shared mmaps on some systems can end up locking
the CPU.
A few attempts have been made at solving these without using the
additional PTE bits, but they've been less that robust.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch] mm: use a pte bit to flag normal pages
2008-01-07 19:45 ` Russell King
@ 2008-01-07 22:52 ` Jared Hulbert
2008-01-08 2:37 ` Andi Kleen
2008-01-08 10:11 ` Catalin Marinas
2 siblings, 0 replies; 79+ messages in thread
From: Jared Hulbert @ 2008-01-07 22:52 UTC (permalink / raw)
To: Jared Hulbert, Nick Piggin, Martin Schwidefsky, carsteno,
Heiko Carstens, Linux Memory Management List, linux-arch
> Currently, Linux is able to setup mappings in kernel space to cover
> any combination of settings. However, userspace is much more limited
> because we don't carry the additional bits around in the Linux version
> of the PTE - and as such shared mmaps on some systems can end up locking
> the CPU.
>
> A few attempts have been made at solving these without using the
> additional PTE bits, but they've been less that robust.
Do these new ARM implementations use more bits than most archs?
Most ARM implementations can spare a PTE bit for this, right? Is the
use of these 3 extra bits to cover a few buggy processors or is this
caused by consolidating the needs of widely differing architectures?
I just can't get over the idea that you _have_ use up all available
bits. Oh well.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch] mm: use a pte bit to flag normal pages
2008-01-07 19:45 ` Russell King
2008-01-07 22:52 ` Jared Hulbert
@ 2008-01-08 2:37 ` Andi Kleen
2008-01-08 2:49 ` Nick Piggin
2008-01-08 10:11 ` Catalin Marinas
2 siblings, 1 reply; 79+ messages in thread
From: Andi Kleen @ 2008-01-08 2:37 UTC (permalink / raw)
To: Jared Hulbert, Nick Piggin, Martin Schwidefsky, carsteno,
Heiko Carstens, Linux Memory Management List, linux-arch
> - strongly ordered
> - bufferable only *
> - device, sharable *
> - device, unsharable
> - memory, bufferable and cacheable, write through, no write allocate
> - memory, bufferable and cacheable, write back, no write allocate
> - memory, bufferable and cacheable, write back, write allocate
> - implementation defined combinations (eg, selecting "minicache")
> - and a set of 16 states to allow the policy of inner and outer levels
> of cache to be defined (two bits per level).
Do you need all of those in user space? Perhaps you could give
the bits different meanings depending on user or kernel space.
I think Nick et.al. just need the bits for user space; they won't
care about kernel mappings.
-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch] mm: use a pte bit to flag normal pages
2008-01-08 2:37 ` Andi Kleen
@ 2008-01-08 2:49 ` Nick Piggin
2008-01-08 3:31 ` Andi Kleen
0 siblings, 1 reply; 79+ messages in thread
From: Nick Piggin @ 2008-01-08 2:49 UTC (permalink / raw)
To: Andi Kleen
Cc: Jared Hulbert, Martin Schwidefsky, carsteno, Heiko Carstens,
Linux Memory Management List, linux-arch
On Tue, Jan 08, 2008 at 03:37:46AM +0100, Andi Kleen wrote:
> > - strongly ordered
> > - bufferable only *
> > - device, sharable *
> > - device, unsharable
> > - memory, bufferable and cacheable, write through, no write allocate
> > - memory, bufferable and cacheable, write back, no write allocate
> > - memory, bufferable and cacheable, write back, write allocate
> > - implementation defined combinations (eg, selecting "minicache")
> > - and a set of 16 states to allow the policy of inner and outer levels
> > of cache to be defined (two bits per level).
>
> Do you need all of those in user space? Perhaps you could give
> the bits different meanings depending on user or kernel space.
> I think Nick et.al. just need the bits for user space; they won't
> care about kernel mappings.
Yes correct -- they are only for userspace mappings. Though that includes mmaps
of /dev/mem and device drivers etc.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch] mm: use a pte bit to flag normal pages
2008-01-08 2:49 ` Nick Piggin
@ 2008-01-08 3:31 ` Andi Kleen
2008-01-08 3:52 ` Nick Piggin
0 siblings, 1 reply; 79+ messages in thread
From: Andi Kleen @ 2008-01-08 3:31 UTC (permalink / raw)
To: Nick Piggin
Cc: Andi Kleen, Jared Hulbert, Martin Schwidefsky, carsteno,
Heiko Carstens, Linux Memory Management List, linux-arch
On Tue, Jan 08, 2008 at 03:49:07AM +0100, Nick Piggin wrote:
> On Tue, Jan 08, 2008 at 03:37:46AM +0100, Andi Kleen wrote:
> > > - strongly ordered
> > > - bufferable only *
> > > - device, sharable *
> > > - device, unsharable
> > > - memory, bufferable and cacheable, write through, no write allocate
> > > - memory, bufferable and cacheable, write back, no write allocate
> > > - memory, bufferable and cacheable, write back, write allocate
> > > - implementation defined combinations (eg, selecting "minicache")
> > > - and a set of 16 states to allow the policy of inner and outer levels
> > > of cache to be defined (two bits per level).
> >
> > Do you need all of those in user space? Perhaps you could give
> > the bits different meanings depending on user or kernel space.
> > I think Nick et.al. just need the bits for user space; they won't
> > care about kernel mappings.
>
> Yes correct -- they are only for userspace mappings. Though that includes mmaps
> of /dev/mem and device drivers etc.
/dev/mem can be always special cased by checking the VMA flags, can't it?
-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch] mm: use a pte bit to flag normal pages
2008-01-08 3:31 ` Andi Kleen
@ 2008-01-08 3:52 ` Nick Piggin
0 siblings, 0 replies; 79+ messages in thread
From: Nick Piggin @ 2008-01-08 3:52 UTC (permalink / raw)
To: Andi Kleen
Cc: Jared Hulbert, Martin Schwidefsky, carsteno, Heiko Carstens,
Linux Memory Management List, linux-arch
On Tue, Jan 08, 2008 at 04:31:03AM +0100, Andi Kleen wrote:
> On Tue, Jan 08, 2008 at 03:49:07AM +0100, Nick Piggin wrote:
> > On Tue, Jan 08, 2008 at 03:37:46AM +0100, Andi Kleen wrote:
> > > > - strongly ordered
> > > > - bufferable only *
> > > > - device, sharable *
> > > > - device, unsharable
> > > > - memory, bufferable and cacheable, write through, no write allocate
> > > > - memory, bufferable and cacheable, write back, no write allocate
> > > > - memory, bufferable and cacheable, write back, write allocate
> > > > - implementation defined combinations (eg, selecting "minicache")
> > > > - and a set of 16 states to allow the policy of inner and outer levels
> > > > of cache to be defined (two bits per level).
> > >
> > > Do you need all of those in user space? Perhaps you could give
> > > the bits different meanings depending on user or kernel space.
> > > I think Nick et.al. just need the bits for user space; they won't
> > > care about kernel mappings.
> >
> > Yes correct -- they are only for userspace mappings. Though that includes mmaps
> > of /dev/mem and device drivers etc.
>
> /dev/mem can be always special cased by checking the VMA flags, can't it?
That's basically what we do today with COW support for VM_PFNMAP. Once you have
that, I don't think there is a huge reason to _also_ use the pte bit for other
mappings (because you need to have the VM_PFNMAP support there anyway).
For lockless get_user_pages, I don't take mmap_sem, look up any vmas, or even
take any page table locks, so it doesn't help there either. (though in the case
of lockless gup, architectues that cannot support it can simply revert to the
regular gup).
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* [rfc][patch 0/4] VM_MIXEDMAP patchset with s390 backend
2007-12-21 10:47 ` Nick Piggin
2007-12-21 19:29 ` Martin Schwidefsky
@ 2008-01-08 9:35 ` Carsten Otte
2008-01-08 10:08 ` Nick Piggin
` (2 more replies)
[not found] ` <1199784196.25114.11.camel@cotte.boeblingen.de.ibm.com>
2 siblings, 3 replies; 79+ messages in thread
From: Carsten Otte @ 2008-01-08 9:35 UTC (permalink / raw)
To: Nick Piggin
Cc: carsteno, Jared Hulbert, Linux Memory Management List,
Martin Schwidefsky, Heiko Carstens
Am Freitag, den 21.12.2007, 11:47 +0100 schrieb Nick Piggin:
> BTW. having a per-arch function sounds reasonable for a start. I'd just give
> it a long name, so that people don't start using it for weird things ;)
> mixedmap_refcount_pfn() or something.
Based on our previous discussion, and based on previous patches by Jared
and Nick, this patch series makes XIP without struct page backing usable
on s390 architecture.
This patch set includes:
1/4: mm: introduce VM_MIXEDMAP mappings from Jared Hulbert, modified to
use an arch-callback to identify whether or not a pfn needs refcounting
2/4: xip: support non-struct page memory from Nick Piggin, modified to
use an arch-callback to identify whether or not a pfn needs refcounting
3/4: s390: remove struct page entries for z/VM DCSS memory segments
4/4: s390: proof of concept implementation of mixedmap_refcount_pfn()
for s390 using list-walk
Above stack seems to work well, I did some sniff-testing applied on top
of Linus' current git tree. We do want to spend a precious pte bit to
speed up this callback, therefore patch 4/4 will get replaced.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* [rfc][patch 1/4] mm: introduce VM_MIXEDMAP
[not found] ` <1199784196.25114.11.camel@cotte.boeblingen.de.ibm.com>
@ 2008-01-08 9:35 ` Carsten Otte, Jared Hulbert, Carsten Otte
2008-01-08 9:35 ` [rfc][patch 2/4] xip: support non-struct page memory Carsten Otte, Nick Piggin, Carsten Otte
` (2 subsequent siblings)
3 siblings, 0 replies; 79+ messages in thread
From: Carsten Otte, Jared Hulbert, Carsten Otte @ 2008-01-08 9:35 UTC (permalink / raw)
To: Nick Piggin
Cc: carsteno, Jared Hulbert, Linux Memory Management List,
Martin Schwidefsky, Heiko Carstens
mm: introduce VM_MIXEDMAP
Introduce a new type of mapping, VM_MIXEDMAP. This is unlike VM_PFNMAP in
that it can support COW mappings of arbitrary ranges including ranges without
struct page (PFNMAP can only support COW in those cases where the un-COW-ed
translations are mapped linearly in the virtual address).
VM_MIXEDMAP achieves this by refcounting pages with mixedmap_refcount_pfn(pfn)
being non-zero, and not refcounting !mixedmap_refcount_pfn(pfn) pages
(which is not an option for VM_PFNMAP, because it needs to avoid refcounting
pfn_valid pages eg. for /dev/mem mappings).
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
---
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -106,6 +106,7 @@ extern unsigned int kobjsize(const void
#define VM_ALWAYSDUMP 0x04000000 /* Always include in core dumps */
#define VM_CAN_NONLINEAR 0x08000000 /* Has ->fault & does nonlinear pages */
+#define VM_MIXEDMAP 0x10000000 /* Can contain "struct page" and pure PFN pages */
#ifndef VM_STACK_DEFAULT_FLAGS /* arch can override this */
#define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -361,35 +361,66 @@ static inline int is_cow_mapping(unsigne
}
/*
- * This function gets the "struct page" associated with a pte.
+ * This function gets the "struct page" associated with a pte or returns
+ * NULL if no "struct page" is associated with the pte.
*
- * NOTE! Some mappings do not have "struct pages". A raw PFN mapping
- * will have each page table entry just pointing to a raw page frame
- * number, and as far as the VM layer is concerned, those do not have
- * pages associated with them - even if the PFN might point to memory
+ * A raw VM_PFNMAP mapping (ie. one that is not COWed) may not have any "struct
+ * page" backing, and even if they do, they are not refcounted. COWed pages of
+ * a VM_PFNMAP do always have a struct page, and they are normally refcounted
+ * (they are _normal_ pages).
+ *
+ * So a raw PFNMAP mapping will have each page table entry just pointing
+ * to a page frame number, and as far as the VM layer is concerned, those do
+ * not have pages associated with them - even if the PFN might point to memory
* that otherwise is perfectly fine and has a "struct page".
*
- * The way we recognize those mappings is through the rules set up
- * by "remap_pfn_range()": the vma will have the VM_PFNMAP bit set,
- * and the vm_pgoff will point to the first PFN mapped: thus every
+ * The way we recognize COWed pages within VM_PFNMAP mappings is through the
+ * rules set up by "remap_pfn_range()": the vma will have the VM_PFNMAP bit
+ * set, and the vm_pgoff will point to the first PFN mapped: thus every
* page that is a raw mapping will always honor the rule
*
* pfn_of_page == vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT)
*
- * and if that isn't true, the page has been COW'ed (in which case it
- * _does_ have a "struct page" associated with it even if it is in a
- * VM_PFNMAP range).
+ * A call to vm_normal_page() will return NULL for such a page.
+ *
+ * If the page doesn't follow the "remap_pfn_range()" rule in a VM_PFNMAP
+ * then the page has been COW'ed. A COW'ed page _does_ have a "struct page"
+ * associated with it even if it is in a VM_PFNMAP range. Calling
+ * vm_normal_page() on such a page will therefore return the "struct page".
+ *
+ *
+ * VM_MIXEDMAP mappings can likewise contain memory with or without "struct
+ * page" backing, however the difference is that _all_ pages with a struct
+ * page (that is, those where mixedmap_refcount_pfn is true) are refcounted
+ * and considered
+ * and considered normal pages by the VM. The disadvantage is that pages are
+ * refcounted (which can be slower and simply not an option for some PFNMAP
+ * users). The advantage is that we don't have to follow the strict linearity
+ * rule of PFNMAP mappings in order to support COWable mappings.
+ *
+ * A call to vm_normal_page() with a VM_MIXEDMAP mapping will return the
+ * associated "struct page" or NULL for memory not backed by a "struct page".
+ *
+ *
+ * All other mappings should have a valid struct page, which will be
+ * returned by a call to vm_normal_page().
*/
struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, pte_t pte)
{
unsigned long pfn = pte_pfn(pte);
- if (unlikely(vma->vm_flags & VM_PFNMAP)) {
- unsigned long off = (addr - vma->vm_start) >> PAGE_SHIFT;
- if (pfn == vma->vm_pgoff + off)
- return NULL;
- if (!is_cow_mapping(vma->vm_flags))
- return NULL;
+ if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
+ if (vma->vm_flags & VM_MIXEDMAP) {
+ if (!mixedmap_refcount_pfn(pfn))
+ return NULL;
+ goto out;
+ } else {
+ unsigned long off = (addr-vma->vm_start) >> PAGE_SHIFT;
+ if (pfn == vma->vm_pgoff + off)
+ return NULL;
+ if (!is_cow_mapping(vma->vm_flags))
+ return NULL;
+ }
}
/*
@@ -410,6 +441,7 @@ struct page *vm_normal_page(struct vm_ar
* The PAGE_ZERO() pages and various VDSO mappings can
* cause them to exist.
*/
+out:
return pfn_to_page(pfn);
}
@@ -1211,8 +1243,11 @@ int vm_insert_pfn(struct vm_area_struct
pte_t *pte, entry;
spinlock_t *ptl;
- BUG_ON(!(vma->vm_flags & VM_PFNMAP));
- BUG_ON(is_cow_mapping(vma->vm_flags));
+ BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
+ BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
+ (VM_PFNMAP|VM_MIXEDMAP));
+ BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
+ BUG_ON((vma->vm_flags & VM_MIXEDMAP) && mixedmap_refcount_pfn(pfn));
retval = -ENOMEM;
pte = get_locked_pte(mm, addr, &ptl);
@@ -2386,10 +2421,13 @@ static noinline int do_no_pfn(struct mm_
unsigned long pfn;
pte_unmap(page_table);
- BUG_ON(!(vma->vm_flags & VM_PFNMAP));
- BUG_ON(is_cow_mapping(vma->vm_flags));
+ BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
+ BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
pfn = vma->vm_ops->nopfn(vma, address & PAGE_MASK);
+
+ BUG_ON((vma->vm_flags & VM_MIXEDMAP) && mixedmap_refcount_pfn(pfn));
+
if (unlikely(pfn == NOPFN_OOM))
return VM_FAULT_OOM;
else if (unlikely(pfn == NOPFN_SIGBUS))
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* [rfc][patch 2/4] xip: support non-struct page memory
[not found] ` <1199784196.25114.11.camel@cotte.boeblingen.de.ibm.com>
2008-01-08 9:35 ` [rfc][patch 1/4] mm: introduce VM_MIXEDMAP Carsten Otte, Jared Hulbert, Carsten Otte
@ 2008-01-08 9:35 ` Carsten Otte, Nick Piggin, Carsten Otte
2008-01-08 9:36 ` [rfc][patch 3/4] s390: remove sturct page entries for z/VM DCSS memory segments Carsten Otte
2008-01-08 9:36 ` [rfc][patch 4/4] s390: mixedmap_refcount_pfn implementation using list walk Carsten Otte
3 siblings, 0 replies; 79+ messages in thread
From: Carsten Otte, Nick Piggin, Carsten Otte @ 2008-01-08 9:35 UTC (permalink / raw)
To: Nick Piggin
Cc: carsteno, Jared Hulbert, Linux Memory Management List,
Martin Schwidefsky, Heiko Carstens
Convert XIP to support non-struct page backed memory, using VM_MIXEDMAP
for the user mappings.
This requires the get_xip_page API to be changed to an address based one.
(The kaddr->pfn conversion may not be quite right for all architectures or XIP
memory mappings, and the cacheflushing may need to be updated for some archs).
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
---
Index: linux-2.6/fs/ext2/inode.c
===================================================================
--- linux-2.6.orig/fs/ext2/inode.c
+++ linux-2.6/fs/ext2/inode.c
@@ -800,7 +800,7 @@ const struct address_space_operations ex
const struct address_space_operations ext2_aops_xip = {
.bmap = ext2_bmap,
- .get_xip_page = ext2_get_xip_page,
+ .get_xip_address = ext2_get_xip_address,
};
const struct address_space_operations ext2_nobh_aops = {
Index: linux-2.6/fs/ext2/xip.c
===================================================================
--- linux-2.6.orig/fs/ext2/xip.c
+++ linux-2.6/fs/ext2/xip.c
@@ -15,24 +15,25 @@
#include "xip.h"
static inline int
-__inode_direct_access(struct inode *inode, sector_t sector,
- unsigned long *data)
+__inode_direct_access(struct inode *inode, sector_t block, unsigned long *data)
{
+ sector_t sector;
BUG_ON(!inode->i_sb->s_bdev->bd_disk->fops->direct_access);
+
+ sector = block * (PAGE_SIZE / 512); /* ext2 block to bdev sector */
return inode->i_sb->s_bdev->bd_disk->fops
- ->direct_access(inode->i_sb->s_bdev,sector,data);
+ ->direct_access(inode->i_sb->s_bdev, sector, data);
}
static inline int
-__ext2_get_sector(struct inode *inode, sector_t offset, int create,
+__ext2_get_block(struct inode *inode, pgoff_t pgoff, int create,
sector_t *result)
{
struct buffer_head tmp;
int rc;
memset(&tmp, 0, sizeof(struct buffer_head));
- rc = ext2_get_block(inode, offset/ (PAGE_SIZE/512), &tmp,
- create);
+ rc = ext2_get_block(inode, pgoff, &tmp, create);
*result = tmp.b_blocknr;
/* did we get a sparse block (hole in the file)? */
@@ -45,13 +46,12 @@ __ext2_get_sector(struct inode *inode, s
}
int
-ext2_clear_xip_target(struct inode *inode, int block)
+ext2_clear_xip_target(struct inode *inode, sector_t block)
{
- sector_t sector = block * (PAGE_SIZE/512);
unsigned long data;
int rc;
- rc = __inode_direct_access(inode, sector, &data);
+ rc = __inode_direct_access(inode, block, &data);
if (!rc)
clear_page((void*)data);
return rc;
@@ -69,24 +69,24 @@ void ext2_xip_verify_sb(struct super_blo
}
}
-struct page *
-ext2_get_xip_page(struct address_space *mapping, sector_t offset,
- int create)
+void *
+ext2_get_xip_address(struct address_space *mapping, pgoff_t pgoff, int create)
{
int rc;
unsigned long data;
- sector_t sector;
+ sector_t block;
/* first, retrieve the sector number */
- rc = __ext2_get_sector(mapping->host, offset, create, §or);
+ rc = __ext2_get_block(mapping->host, pgoff, create, &block);
if (rc)
goto error;
/* retrieve address of the target data */
- rc = __inode_direct_access
- (mapping->host, sector * (PAGE_SIZE/512), &data);
- if (!rc)
- return virt_to_page(data);
+ rc = __inode_direct_access(mapping->host, block, &data);
+ if (rc)
+ goto error;
+
+ return (void *)data;
error:
return ERR_PTR(rc);
Index: linux-2.6/fs/ext2/xip.h
===================================================================
--- linux-2.6.orig/fs/ext2/xip.h
+++ linux-2.6/fs/ext2/xip.h
@@ -7,15 +7,15 @@
#ifdef CONFIG_EXT2_FS_XIP
extern void ext2_xip_verify_sb (struct super_block *);
-extern int ext2_clear_xip_target (struct inode *, int);
+extern int ext2_clear_xip_target (struct inode *, sector_t);
static inline int ext2_use_xip (struct super_block *sb)
{
struct ext2_sb_info *sbi = EXT2_SB(sb);
return (sbi->s_mount_opt & EXT2_MOUNT_XIP);
}
-struct page* ext2_get_xip_page (struct address_space *, sector_t, int);
-#define mapping_is_xip(map) unlikely(map->a_ops->get_xip_page)
+void *ext2_get_xip_address(struct address_space *, sector_t, int);
+#define mapping_is_xip(map) unlikely(map->a_ops->get_xip_address)
#else
#define mapping_is_xip(map) 0
#define ext2_xip_verify_sb(sb) do { } while (0)
Index: linux-2.6/fs/open.c
===================================================================
--- linux-2.6.orig/fs/open.c
+++ linux-2.6/fs/open.c
@@ -778,7 +778,7 @@ static struct file *__dentry_open(struct
if (f->f_flags & O_DIRECT) {
if (!f->f_mapping->a_ops ||
((!f->f_mapping->a_ops->direct_IO) &&
- (!f->f_mapping->a_ops->get_xip_page))) {
+ (!f->f_mapping->a_ops->get_xip_address))) {
fput(f);
f = ERR_PTR(-EINVAL);
}
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -473,8 +473,7 @@ struct address_space_operations {
int (*releasepage) (struct page *, gfp_t);
ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
loff_t offset, unsigned long nr_segs);
- struct page* (*get_xip_page)(struct address_space *, sector_t,
- int);
+ void * (*get_xip_address)(struct address_space *, pgoff_t, int);
/* migrate the contents of a page to the specified target */
int (*migratepage) (struct address_space *,
struct page *, struct page *);
Index: linux-2.6/mm/fadvise.c
===================================================================
--- linux-2.6.orig/mm/fadvise.c
+++ linux-2.6/mm/fadvise.c
@@ -49,7 +49,7 @@ asmlinkage long sys_fadvise64_64(int fd,
goto out;
}
- if (mapping->a_ops->get_xip_page)
+ if (mapping->a_ops->get_xip_address)
/* no bad return value, but ignore advice */
goto out;
Index: linux-2.6/mm/filemap_xip.c
===================================================================
--- linux-2.6.orig/mm/filemap_xip.c
+++ linux-2.6/mm/filemap_xip.c
@@ -15,6 +15,7 @@
#include <linux/rmap.h>
#include <linux/sched.h>
#include <asm/tlbflush.h>
+#include <asm/io.h>
/*
* We do use our own empty page to avoid interference with other users
@@ -41,36 +42,39 @@ static struct page *xip_sparse_page(void
/*
* This is a file read routine for execute in place files, and uses
- * the mapping->a_ops->get_xip_page() function for the actual low-level
+ * the mapping->a_ops->get_xip_address() function for the actual low-level
* stuff.
*
* Note the struct file* is not used at all. It may be NULL.
*/
-static void
+static ssize_t
do_xip_mapping_read(struct address_space *mapping,
struct file_ra_state *_ra,
struct file *filp,
- loff_t *ppos,
- read_descriptor_t *desc,
- read_actor_t actor)
+ char __user *buf,
+ size_t len,
+ loff_t *ppos)
{
struct inode *inode = mapping->host;
unsigned long index, end_index, offset;
- loff_t isize;
+ loff_t isize, pos;
+ size_t copied = 0, error = 0;
- BUG_ON(!mapping->a_ops->get_xip_page);
+ BUG_ON(!mapping->a_ops->get_xip_address);
- index = *ppos >> PAGE_CACHE_SHIFT;
- offset = *ppos & ~PAGE_CACHE_MASK;
+ pos = *ppos;
+ index = pos >> PAGE_CACHE_SHIFT;
+ offset = pos & ~PAGE_CACHE_MASK;
isize = i_size_read(inode);
if (!isize)
goto out;
end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
- for (;;) {
- struct page *page;
- unsigned long nr, ret;
+ do {
+ unsigned long nr, left;
+ void *xip_mem;
+ int zero = 0;
/* nr is the maximum number of bytes to copy from this page */
nr = PAGE_CACHE_SIZE;
@@ -83,17 +87,20 @@ do_xip_mapping_read(struct address_space
}
}
nr = nr - offset;
+ if (nr > len)
+ nr = len;
- page = mapping->a_ops->get_xip_page(mapping,
- index*(PAGE_SIZE/512), 0);
- if (!page)
- goto no_xip_page;
- if (unlikely(IS_ERR(page))) {
- if (PTR_ERR(page) == -ENODATA) {
+ xip_mem = mapping->a_ops->get_xip_address(mapping, index, 0);
+ if (!xip_mem) {
+ error = -EIO;
+ goto out;
+ }
+ if (unlikely(IS_ERR(xip_mem))) {
+ if (PTR_ERR(xip_mem) == -ENODATA) {
/* sparse */
- page = ZERO_PAGE(0);
+ zero = 1;
} else {
- desc->error = PTR_ERR(page);
+ error = PTR_ERR(xip_mem);
goto out;
}
}
@@ -103,10 +110,10 @@ do_xip_mapping_read(struct address_space
* before reading the page on the kernel side.
*/
if (mapping_writably_mapped(mapping))
- flush_dcache_page(page);
+ /* address based flush */ ;
/*
- * Ok, we have the page, so now we can copy it to user space...
+ * Ok, we have the mem, so now we can copy it to user space...
*
* The actor routine returns how many bytes were actually used..
* NOTE! This may not be the same as how much of a user buffer
@@ -114,47 +121,38 @@ do_xip_mapping_read(struct address_space
* "pos" here (the actor routine has to update the user buffer
* pointers and the remaining count).
*/
- ret = actor(desc, page, offset, nr);
- offset += ret;
- index += offset >> PAGE_CACHE_SHIFT;
- offset &= ~PAGE_CACHE_MASK;
+ if (!zero)
+ left = __copy_to_user(buf+copied, xip_mem+offset, nr);
+ else
+ left = __clear_user(buf + copied, nr);
- if (ret == nr && desc->count)
- continue;
- goto out;
+ if (left) {
+ error = -EFAULT;
+ goto out;
+ }
-no_xip_page:
- /* Did not get the page. Report it */
- desc->error = -EIO;
- goto out;
- }
+ copied += (nr - left);
+ offset += (nr - left);
+ index += offset >> PAGE_CACHE_SHIFT;
+ offset &= ~PAGE_CACHE_MASK;
+ } while (copied < len);
out:
- *ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
+ *ppos = pos + copied;
if (filp)
file_accessed(filp);
+
+ return (copied ? copied : error);
}
ssize_t
xip_file_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos)
{
- read_descriptor_t desc;
-
if (!access_ok(VERIFY_WRITE, buf, len))
return -EFAULT;
- desc.written = 0;
- desc.arg.buf = buf;
- desc.count = len;
- desc.error = 0;
-
- do_xip_mapping_read(filp->f_mapping, &filp->f_ra, filp,
- ppos, &desc, file_read_actor);
-
- if (desc.written)
- return desc.written;
- else
- return desc.error;
+ return do_xip_mapping_read(filp->f_mapping, &filp->f_ra, filp,
+ buf, len, ppos);
}
EXPORT_SYMBOL_GPL(xip_file_read);
@@ -209,13 +207,14 @@ __xip_unmap (struct address_space * mapp
*
* This function is derived from filemap_fault, but used for execute in place
*/
-static int xip_file_fault(struct vm_area_struct *area, struct vm_fault *vmf)
+static int xip_file_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
- struct file *file = area->vm_file;
+ struct file *file = vma->vm_file;
struct address_space *mapping = file->f_mapping;
struct inode *inode = mapping->host;
- struct page *page;
pgoff_t size;
+ void *xip_mem;
+ struct page *page;
/* XXX: are VM_FAULT_ codes OK? */
@@ -223,24 +222,32 @@ static int xip_file_fault(struct vm_area
if (vmf->pgoff >= size)
return VM_FAULT_SIGBUS;
- page = mapping->a_ops->get_xip_page(mapping,
- vmf->pgoff*(PAGE_SIZE/512), 0);
- if (!IS_ERR(page))
- goto out;
- if (PTR_ERR(page) != -ENODATA)
+ xip_mem = mapping->a_ops->get_xip_address(mapping, vmf->pgoff, 0);
+ if (!IS_ERR(xip_mem))
+ goto found;
+ if (PTR_ERR(xip_mem) != -ENODATA)
return VM_FAULT_OOM;
/* sparse block */
- if ((area->vm_flags & (VM_WRITE | VM_MAYWRITE)) &&
- (area->vm_flags & (VM_SHARED| VM_MAYSHARE)) &&
+ if ((vma->vm_flags & (VM_WRITE | VM_MAYWRITE)) &&
+ (vma->vm_flags & (VM_SHARED| VM_MAYSHARE)) &&
(!(mapping->host->i_sb->s_flags & MS_RDONLY))) {
+ unsigned long pfn;
+
/* maybe shared writable, allocate new block */
- page = mapping->a_ops->get_xip_page(mapping,
- vmf->pgoff*(PAGE_SIZE/512), 1);
- if (IS_ERR(page))
+ xip_mem = mapping->a_ops->get_xip_address(mapping,vmf->pgoff,1);
+ if (IS_ERR(xip_mem))
return VM_FAULT_SIGBUS;
- /* unmap page at pgoff from all other vmas */
+ /* unmap sparse mappings at pgoff from all other vmas */
__xip_unmap(mapping, vmf->pgoff);
+
+found:
+ pfn = virt_to_phys(xip_mem) >> PAGE_SHIFT;
+ if (!mixedmap_refcount_pfn(pfn)) {
+ vm_insert_pfn(vma, (unsigned long)vmf->virtual_address, pfn);
+ return VM_FAULT_NOPAGE;
+ }
+ page = pfn_to_page(pfn);
} else {
/* not shared and writable, use xip_sparse_page() */
page = xip_sparse_page();
@@ -248,7 +255,6 @@ static int xip_file_fault(struct vm_area
return VM_FAULT_OOM;
}
-out:
page_cache_get(page);
vmf->page = page;
return 0;
@@ -260,11 +266,11 @@ static struct vm_operations_struct xip_f
int xip_file_mmap(struct file * file, struct vm_area_struct * vma)
{
- BUG_ON(!file->f_mapping->a_ops->get_xip_page);
+ BUG_ON(!file->f_mapping->a_ops->get_xip_address);
file_accessed(file);
vma->vm_ops = &xip_file_vm_ops;
- vma->vm_flags |= VM_CAN_NONLINEAR;
+ vma->vm_flags |= VM_CAN_NONLINEAR | VM_MIXEDMAP;
return 0;
}
EXPORT_SYMBOL_GPL(xip_file_mmap);
@@ -277,17 +283,16 @@ __xip_file_write(struct file *filp, cons
const struct address_space_operations *a_ops = mapping->a_ops;
struct inode *inode = mapping->host;
long status = 0;
- struct page *page;
size_t bytes;
ssize_t written = 0;
- BUG_ON(!mapping->a_ops->get_xip_page);
+ BUG_ON(!mapping->a_ops->get_xip_address);
do {
unsigned long index;
unsigned long offset;
size_t copied;
- char *kaddr;
+ void *xip_mem;
offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
index = pos >> PAGE_CACHE_SHIFT;
@@ -295,28 +300,22 @@ __xip_file_write(struct file *filp, cons
if (bytes > count)
bytes = count;
- page = a_ops->get_xip_page(mapping,
- index*(PAGE_SIZE/512), 0);
- if (IS_ERR(page) && (PTR_ERR(page) == -ENODATA)) {
+ xip_mem = a_ops->get_xip_address(mapping, index, 0);
+ if (IS_ERR(xip_mem) && (PTR_ERR(xip_mem) == -ENODATA)) {
/* we allocate a new page unmap it */
- page = a_ops->get_xip_page(mapping,
- index*(PAGE_SIZE/512), 1);
- if (!IS_ERR(page))
+ xip_mem = a_ops->get_xip_address(mapping, index, 1);
+ if (!IS_ERR(xip_mem))
/* unmap page at pgoff from all other vmas */
__xip_unmap(mapping, index);
}
- if (IS_ERR(page)) {
- status = PTR_ERR(page);
+ if (IS_ERR(xip_mem)) {
+ status = PTR_ERR(xip_mem);
break;
}
- fault_in_pages_readable(buf, bytes);
- kaddr = kmap_atomic(page, KM_USER0);
copied = bytes -
- __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
- kunmap_atomic(kaddr, KM_USER0);
- flush_dcache_page(page);
+ __copy_from_user_nocache(xip_mem + offset, buf, bytes);
if (likely(copied > 0)) {
status = copied;
@@ -396,7 +395,7 @@ EXPORT_SYMBOL_GPL(xip_file_write);
/*
* truncate a page used for execute in place
- * functionality is analog to block_truncate_page but does use get_xip_page
+ * functionality is analog to block_truncate_page but does use get_xip_adddress
* to get the page instead of page cache
*/
int
@@ -406,9 +405,9 @@ xip_truncate_page(struct address_space *
unsigned offset = from & (PAGE_CACHE_SIZE-1);
unsigned blocksize;
unsigned length;
- struct page *page;
+ void *xip_mem;
- BUG_ON(!mapping->a_ops->get_xip_page);
+ BUG_ON(!mapping->a_ops->get_xip_address);
blocksize = 1 << mapping->host->i_blkbits;
length = offset & (blocksize - 1);
@@ -419,18 +418,17 @@ xip_truncate_page(struct address_space *
length = blocksize - length;
- page = mapping->a_ops->get_xip_page(mapping,
- index*(PAGE_SIZE/512), 0);
- if (!page)
+ xip_mem = mapping->a_ops->get_xip_address(mapping, index, 0);
+ if (!xip_mem)
return -ENOMEM;
- if (unlikely(IS_ERR(page))) {
- if (PTR_ERR(page) == -ENODATA)
+ if (unlikely(IS_ERR(xip_mem))) {
+ if (PTR_ERR(xip_mem) == -ENODATA)
/* Hole? No need to truncate */
return 0;
else
- return PTR_ERR(page);
+ return PTR_ERR(xip_mem);
}
- zero_user_page(page, offset, length, KM_USER0);
+ memset(xip_mem + offset, 0, length);
return 0;
}
EXPORT_SYMBOL_GPL(xip_truncate_page);
Index: linux-2.6/mm/madvise.c
===================================================================
--- linux-2.6.orig/mm/madvise.c
+++ linux-2.6/mm/madvise.c
@@ -112,7 +112,7 @@ static long madvise_willneed(struct vm_a
if (!file)
return -EBADF;
- if (file->f_mapping->a_ops->get_xip_page) {
+ if (file->f_mapping->a_ops->get_xip_address) {
/* no bad return value, but ignore advice */
return 0;
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* [rfc][patch 3/4] s390: remove sturct page entries for z/VM DCSS memory segments
[not found] ` <1199784196.25114.11.camel@cotte.boeblingen.de.ibm.com>
2008-01-08 9:35 ` [rfc][patch 1/4] mm: introduce VM_MIXEDMAP Carsten Otte, Jared Hulbert, Carsten Otte
2008-01-08 9:35 ` [rfc][patch 2/4] xip: support non-struct page memory Carsten Otte, Nick Piggin, Carsten Otte
@ 2008-01-08 9:36 ` Carsten Otte
2008-01-08 9:36 ` [rfc][patch 4/4] s390: mixedmap_refcount_pfn implementation using list walk Carsten Otte
3 siblings, 0 replies; 79+ messages in thread
From: Carsten Otte @ 2008-01-08 9:36 UTC (permalink / raw)
To: Nick Piggin
Cc: carsteno, Jared Hulbert, Linux Memory Management List,
Martin Schwidefsky, Heiko Carstens
This patch removes the creation of struct page entries for z/VM DCSS memory segments
that are being loaded.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
---
Index: linux-2.6/arch/s390/mm/vmem.c
===================================================================
--- linux-2.6.orig/arch/s390/mm/vmem.c
+++ linux-2.6/arch/s390/mm/vmem.c
@@ -310,8 +310,6 @@ out:
int add_shared_memory(unsigned long start, unsigned long size)
{
struct memory_segment *seg;
- struct page *page;
- unsigned long pfn, num_pfn, end_pfn;
int ret;
mutex_lock(&vmem_mutex);
@@ -326,24 +324,10 @@ int add_shared_memory(unsigned long star
if (ret)
goto out_free;
- ret = vmem_add_mem(start, size);
+ ret = vmem_add_range(start, size);
if (ret)
goto out_remove;
- pfn = PFN_DOWN(start);
- num_pfn = PFN_DOWN(size);
- end_pfn = pfn + num_pfn;
-
- page = pfn_to_page(pfn);
- memset(page, 0, num_pfn * sizeof(struct page));
-
- for (; pfn < end_pfn; pfn++) {
- page = pfn_to_page(pfn);
- init_page_count(page);
- reset_page_mapcount(page);
- SetPageReserved(page);
- INIT_LIST_HEAD(&page->lru);
- }
goto out;
out_remove:
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* [rfc][patch 4/4] s390: mixedmap_refcount_pfn implementation using list walk
[not found] ` <1199784196.25114.11.camel@cotte.boeblingen.de.ibm.com>
` (2 preceding siblings ...)
2008-01-08 9:36 ` [rfc][patch 3/4] s390: remove sturct page entries for z/VM DCSS memory segments Carsten Otte
@ 2008-01-08 9:36 ` Carsten Otte
3 siblings, 0 replies; 79+ messages in thread
From: Carsten Otte @ 2008-01-08 9:36 UTC (permalink / raw)
To: Nick Piggin
Cc: carsteno, Jared Hulbert, Linux Memory Management List,
Martin Schwidefsky, Heiko Carstens
This patch implements mixedmap_refcount_pfn() for s390 architecture using
list-walk. This is merely meant to be a proof of concept, because we do
prefer spending one valuable pte bit to speed this up.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
---
Index: linux-2.6/arch/s390/mm/vmem.c
===================================================================
--- linux-2.6.orig/arch/s390/mm/vmem.c
+++ linux-2.6/arch/s390/mm/vmem.c
@@ -339,6 +339,26 @@ out:
return ret;
}
+int mixedmap_refcount_pfn(unsigned long pfn)
+{
+ int rc;
+ struct memory_segment *tmp;
+
+ mutex_lock(&vmem_mutex);
+
+ list_for_each_entry(tmp, &mem_segs, list) {
+ if ((tmp->start >= pfn << PAGE_SHIFT) &&
+ (tmp->start + tmp->size - 1 < pfn << PAGE_SHIFT)) {
+ rc = 0;
+ goto out;
+ }
+ }
+ rc = 1;
+out:
+ mutex_unlock(&vmem_mutex);
+ return rc;
+}
+
/*
* map whole physical memory to virtual memory (identity mapping)
*/
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -28,6 +28,13 @@ extern unsigned long num_physpages;
extern void * high_memory;
extern int page_cluster;
+/*
+ * This callback is only needed when using VM_MIXEDMAP. It is used by common
+ * code to check if a pfn needs refcounting in the corresponding struct page.
+ */
+extern int mixedmap_refcount_pfn(unsigned long pfn);
+
+
#ifdef CONFIG_SYSCTL
extern int sysctl_legacy_va_layout;
#else
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 0/4] VM_MIXEDMAP patchset with s390 backend
2008-01-08 9:35 ` [rfc][patch 0/4] VM_MIXEDMAP patchset with s390 backend Carsten Otte
@ 2008-01-08 10:08 ` Nick Piggin
2008-01-08 11:34 ` Carsten Otte
2008-01-09 15:14 ` [rfc][patch 0/4] VM_MIXEDMAP patchset with s390 backend v2 Carsten Otte
[not found] ` <1199891032.28689.9.camel@cotte.boeblingen.de.ibm.com>
2 siblings, 1 reply; 79+ messages in thread
From: Nick Piggin @ 2008-01-08 10:08 UTC (permalink / raw)
To: Carsten Otte
Cc: carsteno, Jared Hulbert, Linux Memory Management List,
Martin Schwidefsky, Heiko Carstens
On Tue, Jan 08, 2008 at 10:35:54AM +0100, Carsten Otte wrote:
> Am Freitag, den 21.12.2007, 11:47 +0100 schrieb Nick Piggin:
> > BTW. having a per-arch function sounds reasonable for a start. I'd just give
> > it a long name, so that people don't start using it for weird things ;)
> > mixedmap_refcount_pfn() or something.
> Based on our previous discussion, and based on previous patches by Jared
> and Nick, this patch series makes XIP without struct page backing usable
> on s390 architecture.
> This patch set includes:
> 1/4: mm: introduce VM_MIXEDMAP mappings from Jared Hulbert, modified to
> use an arch-callback to identify whether or not a pfn needs refcounting
> 2/4: xip: support non-struct page memory from Nick Piggin, modified to
> use an arch-callback to identify whether or not a pfn needs refcounting
> 3/4: s390: remove struct page entries for z/VM DCSS memory segments
> 4/4: s390: proof of concept implementation of mixedmap_refcount_pfn()
> for s390 using list-walk
Nice! I'm glad that the xip support didn't need anything further than
the mixedmap_refcount_pfn for s390. Hopefully it proves to be stable
under further testing.
I'm just curious (or forgetful) as to why s390's pfn_valid does not walk
your memory segments? (That would allow the s390 proof of concept to be
basically a noop, and mixedmap_refcount_pfn will only be required when
we start using another pte bit.
> Above stack seems to work well, I did some sniff-testing applied on top
> of Linus' current git tree. We do want to spend a precious pte bit to
> speed up this callback, therefore patch 4/4 will get replaced.
I think using another bit in the pte for special mappings is reasonable.
As I posted in my earlier patch, we can also use it to simplify vm_normal_page,
and it facilitates a lock free get_user_pages.
Anyway, hmm... I guess we should probably get these patches into -mm and
then upstream soon. Any objections from anyone? Do you guys have performance /
stress testing for xip?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch] mm: use a pte bit to flag normal pages
2008-01-07 19:45 ` Russell King
2008-01-07 22:52 ` Jared Hulbert
2008-01-08 2:37 ` Andi Kleen
@ 2008-01-08 10:11 ` Catalin Marinas
2008-01-08 10:52 ` Russell King
2 siblings, 1 reply; 79+ messages in thread
From: Catalin Marinas @ 2008-01-08 10:11 UTC (permalink / raw)
To: Russell King
Cc: Jared Hulbert, Nick Piggin, Martin Schwidefsky, carsteno,
Heiko Carstens, Linux Memory Management List, linux-arch
On Mon, 2008-01-07 at 19:45 +0000, Russell King wrote:
> In old ARM CPUs, there were two bits that defined the characteristics of
> the mapping - the C and B bits (C = cacheable, B = bufferable)
>
> Some ARMv5 (particularly Xscale-based) and all ARMv6 CPUs extend this to
> five bits and introduce "memory types" - 3 bits of TEX, and C and B.
>
> Between these bits, it defines:
>
> - strongly ordered
> - bufferable only *
> - device, sharable *
> - device, unsharable
> - memory, bufferable and cacheable, write through, no write allocate
> - memory, bufferable and cacheable, write back, no write allocate
> - memory, bufferable and cacheable, write back, write allocate
> - implementation defined combinations (eg, selecting "minicache")
> - and a set of 16 states to allow the policy of inner and outer levels
> of cache to be defined (two bits per level).
Can we not restrict these to a maximum of 8 base types at run-time? If
yes, we can only use 3 bits for encoding and also benefit from the
automatic remapping in later ARM CPUs. For those not familiar with ARM,
8 combinations of the TEX, C, B and S (shared) bits can be specified in
separate registers and the pte would only use 3 bits to refer to those.
Even older cores would benefit from this as I think it is faster to read
the encoding from an array in set_pte than doing all the bit comparisons
to calculate the hardware pte in the current implementation.
--
Catalin
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch] mm: use a pte bit to flag normal pages
2008-01-08 10:11 ` Catalin Marinas
@ 2008-01-08 10:52 ` Russell King
2008-01-08 13:54 ` Catalin Marinas
0 siblings, 1 reply; 79+ messages in thread
From: Russell King @ 2008-01-08 10:52 UTC (permalink / raw)
To: Catalin Marinas
Cc: Jared Hulbert, Nick Piggin, Martin Schwidefsky, carsteno,
Heiko Carstens, Linux Memory Management List, linux-arch
On Tue, Jan 08, 2008 at 10:11:15AM +0000, Catalin Marinas wrote:
> On Mon, 2008-01-07 at 19:45 +0000, Russell King wrote:
> > In old ARM CPUs, there were two bits that defined the characteristics of
> > the mapping - the C and B bits (C = cacheable, B = bufferable)
> >
> > Some ARMv5 (particularly Xscale-based) and all ARMv6 CPUs extend this to
> > five bits and introduce "memory types" - 3 bits of TEX, and C and B.
> >
> > Between these bits, it defines:
> >
> > - strongly ordered
> > - bufferable only *
> > - device, sharable *
> > - device, unsharable
> > - memory, bufferable and cacheable, write through, no write allocate
> > - memory, bufferable and cacheable, write back, no write allocate
> > - memory, bufferable and cacheable, write back, write allocate
> > - implementation defined combinations (eg, selecting "minicache")
> > - and a set of 16 states to allow the policy of inner and outer levels
> > of cache to be defined (two bits per level).
>
> Can we not restrict these to a maximum of 8 base types at run-time? If
> yes, we can only use 3 bits for encoding and also benefit from the
> automatic remapping in later ARM CPUs. For those not familiar with ARM,
> 8 combinations of the TEX, C, B and S (shared) bits can be specified in
> separate registers and the pte would only use 3 bits to refer to those.
> Even older cores would benefit from this as I think it is faster to read
> the encoding from an array in set_pte than doing all the bit comparisons
> to calculate the hardware pte in the current implementation.
So basically that gives us the following combinations:
TEXCB
00000 - /dev/mem and device uncachable mappings (strongly ordered)
00001 - frame buffers
00010 - write through mappings (selectable via kernel command line)
and also work-around for user read-only write-back mappings
on PXA2.
00011 - normal write back mappings
00101 - Xscale3 "shared device" work-around for strongly ordered mappings
00110 - PXA3 mini-cache or other "implementation defined features"
00111 - write back write allocate mappings
01000 - non-shared device (will be required to map some devices to userspace)
and also Xscale3 work-around for strongly ordered mappings
10111 - Xscale3 L2 cache-enabled mappings
It's unclear at present what circumstances you'd use each of the two
Xscale3 work-around bit combinations - or indeed whether there's a
printing error in the documentation concerning TEXCB=00101.
It's also unclear how to squeeze these down into a bit pattern in such
a way that we avoid picking out bits from the Linux PTE, and recombining
them so we can look them up in a table or whatever - especially given
that set_pte is a fast path and extra cycles there have a VERY noticable
impact on overall system performance.
However, until we get around to sorting out the implementation of the
Xscale3 strongly ordered work-around which seems to be the highest
priority (and hardest to resolve) I don't think there's much more to
discuss; we don't have a clear way ahead on these issues at the moment.
All we current have is the errata entry, and we know people are seeing
data corruption on Xscale3 platforms.
And no, I don't think we can keep it contained within the Xscale3 support
file - the set_pte method isn't passed sufficient information for that.
Conversely, setting the TEX bits behind set_pte's back by using set_pte_ext
results in loss of that information when the page is aged - again resulting
in data corruption.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 0/4] VM_MIXEDMAP patchset with s390 backend
2008-01-08 10:08 ` Nick Piggin
@ 2008-01-08 11:34 ` Carsten Otte
2008-01-08 11:55 ` Nick Piggin
2008-01-08 13:56 ` Jörn Engel
0 siblings, 2 replies; 79+ messages in thread
From: Carsten Otte @ 2008-01-08 11:34 UTC (permalink / raw)
To: Nick Piggin
Cc: carsteno, Jared Hulbert, Linux Memory Management List,
Martin Schwidefsky, Heiko Carstens
Nick Piggin wrote:
> I'm just curious (or forgetful) as to why s390's pfn_valid does not walk
> your memory segments? (That would allow the s390 proof of concept to be
> basically a noop, and mixedmap_refcount_pfn will only be required when
> we start using another pte bit.
Our pfn_valid uses a hardware instruction, which does check if there
is memory behind a pfn which we can access. And we'd like to use the
very same memory segment for both regular memory hotplug where the
memory ends up in ZONE_NORMAL (in this case the memory would be
read+write, and not shared with other guests), and for backing xip
file systems (in this case the memory would be read-only, and shared).
And in both cases, our instruction does consider the pfn to be valid.
Thus, pfn_valid is not the right indicator for us to check if we need
refcounting or not.
> I think using another bit in the pte for special mappings is reasonable.
> As I posted in my earlier patch, we can also use it to simplify vm_normal_page,
> and it facilitates a lock free get_user_pages.
That patch looks very nice. I am going to define PTE_SPECIAL for s390
arch next...
> Anyway, hmm... I guess we should probably get these patches into -mm and
> then upstream soon. Any objections from anyone? Do you guys have performance /
> stress testing for xip?
I think it is mature enough to push upstream, I've booted a distro
with /usr on it. But I really really want to exchange patch #4 with a
pte-bit based one before pushing this.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 0/4] VM_MIXEDMAP patchset with s390 backend
2008-01-08 11:34 ` Carsten Otte
@ 2008-01-08 11:55 ` Nick Piggin
2008-01-08 12:03 ` Carsten Otte
2008-01-08 13:56 ` Jörn Engel
1 sibling, 1 reply; 79+ messages in thread
From: Nick Piggin @ 2008-01-08 11:55 UTC (permalink / raw)
To: carsteno
Cc: Jared Hulbert, Linux Memory Management List, Martin Schwidefsky,
Heiko Carstens
On Tue, Jan 08, 2008 at 12:34:22PM +0100, Carsten Otte wrote:
> Nick Piggin wrote:
> >I'm just curious (or forgetful) as to why s390's pfn_valid does not walk
> >your memory segments? (That would allow the s390 proof of concept to be
> >basically a noop, and mixedmap_refcount_pfn will only be required when
> >we start using another pte bit.
> Our pfn_valid uses a hardware instruction, which does check if there
> is memory behind a pfn which we can access. And we'd like to use the
> very same memory segment for both regular memory hotplug where the
> memory ends up in ZONE_NORMAL (in this case the memory would be
> read+write, and not shared with other guests), and for backing xip
> file systems (in this case the memory would be read-only, and shared).
> And in both cases, our instruction does consider the pfn to be valid.
> Thus, pfn_valid is not the right indicator for us to check if we need
> refcounting or not.
Sure, but I think pfn_valid is _supposed_ to indicate whether the pfn has
a struct page, isn't it? I think that is the case with the other standard
memory models. That would make it applicable for you (ie. if it were
implemented as mixedmap_refcount_pfn).
Is it used in any performance critical paths that it needs to be really
fast?
(Anyway, this is just an aside -- I have no real problem with
mixedmap_refcount_pfn, or using a special pte bit...)
> >I think using another bit in the pte for special mappings is reasonable.
> >As I posted in my earlier patch, we can also use it to simplify
> >vm_normal_page,
> >and it facilitates a lock free get_user_pages.
> That patch looks very nice. I am going to define PTE_SPECIAL for s390
> arch next...
Great.
> >Anyway, hmm... I guess we should probably get these patches into -mm and
> >then upstream soon. Any objections from anyone? Do you guys have
> >performance /
> >stress testing for xip?
> I think it is mature enough to push upstream, I've booted a distro
> with /usr on it.
Oh good. So just to clarify -- I guess you guys have a readonly filesystem
containing the distro on the host, and mount it XIP on each guest... avoiding
struct page means you save a bit of memory on each guest?
> But I really really want to exchange patch #4 with a
> pte-bit based one before pushing this.
OK fair enough, let's do that.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 0/4] VM_MIXEDMAP patchset with s390 backend
2008-01-08 11:55 ` Nick Piggin
@ 2008-01-08 12:03 ` Carsten Otte
0 siblings, 0 replies; 79+ messages in thread
From: Carsten Otte @ 2008-01-08 12:03 UTC (permalink / raw)
To: Nick Piggin
Cc: carsteno, Jared Hulbert, Linux Memory Management List,
Martin Schwidefsky, Heiko Carstens
Nick Piggin wrote:
> Oh good. So just to clarify -- I guess you guys have a readonly filesystem
> containing the distro on the host, and mount it XIP on each guest... avoiding
> struct page means you save a bit of memory on each guest?
That's right. It's quite a bit of memory for struct page entries,
because we'd love to have an entire distro with a superset of packages
for each guest being installed on the filesystem (-- large shared
segment). And we're talking 3 digits amount of guests here. This is a
real benefit in our scenario.
>> But I really really want to exchange patch #4 with a
>> pte-bit based one before pushing this.
> OK fair enough, let's do that.
Thanks, am on it...
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch] mm: use a pte bit to flag normal pages
2008-01-08 10:52 ` Russell King
@ 2008-01-08 13:54 ` Catalin Marinas
2008-01-08 14:08 ` Russell King
0 siblings, 1 reply; 79+ messages in thread
From: Catalin Marinas @ 2008-01-08 13:54 UTC (permalink / raw)
To: Russell King
Cc: Jared Hulbert, Nick Piggin, Martin Schwidefsky, carsteno,
Heiko Carstens, Linux Memory Management List, linux-arch
On Tue, 2008-01-08 at 10:52 +0000, Russell King wrote:
> On Tue, Jan 08, 2008 at 10:11:15AM +0000, Catalin Marinas wrote:
> > Can we not restrict these to a maximum of 8 base types at run-time? If
> > yes, we can only use 3 bits for encoding and also benefit from the
> > automatic remapping in later ARM CPUs. For those not familiar with ARM,
> > 8 combinations of the TEX, C, B and S (shared) bits can be specified in
> > separate registers and the pte would only use 3 bits to refer to those.
> > Even older cores would benefit from this as I think it is faster to read
> > the encoding from an array in set_pte than doing all the bit comparisons
> > to calculate the hardware pte in the current implementation.
>
> So basically that gives us the following combinations:
I reordered them a bit for easier commenting.
> TEXCB
> 00010 - write through mappings (selectable via kernel command line)
> and also work-around for user read-only write-back mappings
> on PXA2.
> 00011 - normal write back mappings
> 00111 - write back write allocate mappings
Do you need to use all of the above at the same time? We could have only
one type, "normal memory", and configure the desired TEX encoding at
boot time.
> 00000 - /dev/mem and device uncachable mappings (strongly ordered)
> 00101 - Xscale3 "shared device" work-around for strongly ordered mappings
> 01000 - non-shared device (will be required to map some devices to
> userspace)
> and also Xscale3 work-around for strongly ordered mappings
I don't know the details of the Xscale3 bug but would you need all of
these encodings at run-time? Do you need both "strongly ordered" and the
workaround? We could only have the "strongly ordered" type and configure
the TEX bits at boot time to be "shared device" if the workaround is
needed.
For the last one, we could have the "non-shared device" type.
> 00001 - frame buffers
This would be "shared device" on newer CPUs.
> 00110 - PXA3 mini-cache or other "implementation defined features"
> 10111 - Xscale3 L2 cache-enabled mappings
It depends on how many of these you would need at run-time. If the base
types are "normal", "strongly ordered", "shared device", "non-shared
device", you still have 4 more left (or 3 on ARMv6 with TEX remapping
enabled since one encoding is implementation defined).
> It's unclear at present what circumstances you'd use each of the two
> Xscale3 work-around bit combinations - or indeed whether there's a
> printing error in the documentation concerning TEXCB=00101.
As I said, I don't know the details of this bug and can't comment.
> It's also unclear how to squeeze these down into a bit pattern in such
> a way that we avoid picking out bits from the Linux PTE, and recombining
> them so we can look them up in a table or whatever - especially given
> that set_pte is a fast path and extra cycles there have a VERY noticable
> impact on overall system performance.
As with the automatic remapping on ARMv6, we could use TEX[0], C and B
to for the 3 bit index in the table. For pre-ARMv6 hardware, we need a
bit of shifting and masking before looking up in the 8 32bit words table
but, for subsequent calls to set_pte, it is likely that the table would
be in cache anyway. There is also the option of choosing 3 consecutive
bits to avoid shifting on pre-ARMv6.
I agree there would be a delay on pre-ARMv6 CPUs but the impact might
not be that big since the current set_pte implementations still do
additional bit shifting/comparison for the access permissions. The
advantage is that we free 2 bits from the TEXCB encoding.
I haven't run any benchmarks and I can't say how big the impact is but,
based on some past discussions, 3-4 more cycles in set_pte might go
unnoticed because of other, bigger overheads.
--
Catalin
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 0/4] VM_MIXEDMAP patchset with s390 backend
2008-01-08 11:34 ` Carsten Otte
2008-01-08 11:55 ` Nick Piggin
@ 2008-01-08 13:56 ` Jörn Engel
2008-01-08 14:51 ` Carsten Otte
1 sibling, 1 reply; 79+ messages in thread
From: Jörn Engel @ 2008-01-08 13:56 UTC (permalink / raw)
To: Carsten Otte
Cc: Nick Piggin, carsteno, Jared Hulbert,
Linux Memory Management List, Martin Schwidefsky, Heiko Carstens
On Tue, 8 January 2008 12:34:22 +0100, Carsten Otte wrote:
>
> That patch looks very nice. I am going to define PTE_SPECIAL for s390
> arch next...
"PTE_SPECIAL" does not sound too descriptive. Maybe PTE_MIXEDMAP? It
may not be great, but at least it give a hint in the right direction.
JA?rn
--
Good warriors cause others to come to them and do not go to others.
-- Sun Tzu
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch] mm: use a pte bit to flag normal pages
2008-01-08 13:54 ` Catalin Marinas
@ 2008-01-08 14:08 ` Russell King
0 siblings, 0 replies; 79+ messages in thread
From: Russell King @ 2008-01-08 14:08 UTC (permalink / raw)
To: Catalin Marinas
Cc: Jared Hulbert, Nick Piggin, Martin Schwidefsky, carsteno,
Heiko Carstens, Linux Memory Management List, linux-arch
On Tue, Jan 08, 2008 at 01:54:15PM +0000, Catalin Marinas wrote:
> On Tue, 2008-01-08 at 10:52 +0000, Russell King wrote:
> > It's unclear at present what circumstances you'd use each of the two
> > Xscale3 work-around bit combinations - or indeed whether there's a
> > printing error in the documentation concerning TEXCB=00101.
>
> As I said, I don't know the details of this bug and can't comment.
As I said I don't think there's anything further that can be usefully
added to this discussion until we're further down the road with this.
Even though you don't know the details of the bug report, I've mentioned
as much as I know about it at present - and that includes with access to
Marvells spec update document. When I'm further down the line with PXA3
work maybe I'll know more, but my priority at the moment on PXA3 is
suspend/resume support.
> I haven't run any benchmarks and I can't say how big the impact is but,
> based on some past discussions, 3-4 more cycles in set_pte might go
> unnoticed because of other, bigger overheads.
Except when you're clearing out page tables - for instance when a
thread exits. It's very noticable and shows up rather well in
fork+exit tests - even shell scripts.
This was certainly the case with 2.2 kernels. Whether 2.6 kernels
are soo heavy weight that it's been swapped into non-existence I
don't know.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 0/4] VM_MIXEDMAP patchset with s390 backend
2008-01-08 13:56 ` Jörn Engel
@ 2008-01-08 14:51 ` Carsten Otte
2008-01-08 18:09 ` Jared Hulbert
0 siblings, 1 reply; 79+ messages in thread
From: Carsten Otte @ 2008-01-08 14:51 UTC (permalink / raw)
To: Jörn Engel
Cc: Nick Piggin, carsteno, Jared Hulbert,
Linux Memory Management List, Martin Schwidefsky, Heiko Carstens
JA?rn Engel wrote:
> "PTE_SPECIAL" does not sound too descriptive. Maybe PTE_MIXEDMAP? It
> may not be great, but at least it give a hint in the right direction.
True, I've chosen a different name. PTE_SPECIAL is the name in Nick's
original patch (see patch in this thread).
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 0/4] VM_MIXEDMAP patchset with s390 backend
2008-01-08 14:51 ` Carsten Otte
@ 2008-01-08 18:09 ` Jared Hulbert
2008-01-08 22:12 ` Nick Piggin
0 siblings, 1 reply; 79+ messages in thread
From: Jared Hulbert @ 2008-01-08 18:09 UTC (permalink / raw)
To: carsteno
Cc: Jörn Engel, Nick Piggin, Linux Memory Management List,
Martin Schwidefsky, Heiko Carstens
> Jorn Engel wrote:
> > "PTE_SPECIAL" does not sound too descriptive. Maybe PTE_MIXEDMAP? It
> > may not be great, but at least it give a hint in the right direction.
> True, I've chosen a different name. PTE_SPECIAL is the name in Nick's
> original patch (see patch in this thread).
Nick also want's to use that bit to "implement my lockless
get_user_page" I assume that's why the name is a little vague.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 0/4] VM_MIXEDMAP patchset with s390 backend
2008-01-08 18:09 ` Jared Hulbert
@ 2008-01-08 22:12 ` Nick Piggin
0 siblings, 0 replies; 79+ messages in thread
From: Nick Piggin @ 2008-01-08 22:12 UTC (permalink / raw)
To: Jared Hulbert
Cc: carsteno, Jörn Engel, Linux Memory Management List,
Martin Schwidefsky, Heiko Carstens
On Tue, Jan 08, 2008 at 10:09:52AM -0800, Jared Hulbert wrote:
> > Jorn Engel wrote:
> > > "PTE_SPECIAL" does not sound too descriptive. Maybe PTE_MIXEDMAP? It
> > > may not be great, but at least it give a hint in the right direction.
> > True, I've chosen a different name. PTE_SPECIAL is the name in Nick's
> > original patch (see patch in this thread).
>
> Nick also want's to use that bit to "implement my lockless
> get_user_page" I assume that's why the name is a little vague.
Yeah, and to simplify vm_normal_page on those architectures which provide it.
So it isn't just for VM_MIXEDMAP mappings (unless you're implementing a bit
specifically for that as an s390 specific thing -- which is reasonable for now).
We have 2 types of user mappings in the VM; "normal" and not-normal. I don't think
the latter have a name, so I call them special. If you're used to "normal" then
I think special is descriptive enough ;)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* [rfc][patch 0/4] VM_MIXEDMAP patchset with s390 backend v2
2008-01-08 9:35 ` [rfc][patch 0/4] VM_MIXEDMAP patchset with s390 backend Carsten Otte
2008-01-08 10:08 ` Nick Piggin
@ 2008-01-09 15:14 ` Carsten Otte
[not found] ` <1199891032.28689.9.camel@cotte.boeblingen.de.ibm.com>
2 siblings, 0 replies; 79+ messages in thread
From: Carsten Otte @ 2008-01-09 15:14 UTC (permalink / raw)
To: Nick Piggin
Cc: carsteno, Jared Hulbert, Linux Memory Management List,
Martin Schwidefsky, Heiko Carstens
This patchset is an improved version of yesterday's patchset, which does
contain the following patches:
1/4: add arch callbacks to toggle reference counting for VM_MIXEDMAP
pages
2/4: patch from Jared Hulbert that introduces VM_MIXEDMAP
3/4: patch from Nick Piggin, which uses VM_MIXEDMAP for XIP mappings
4/4: remove struct page entries for z/VM DCSS memory segments
This patch series is tested on top of Linus' git tree with ext2 -o xip
and dcssblk on s390x.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* [rfc][patch 1/4] include: add callbacks to toggle reference counting for VM_MIXEDMAP pages
[not found] ` <1199891032.28689.9.camel@cotte.boeblingen.de.ibm.com>
@ 2008-01-09 15:14 ` Carsten Otte, Carsten Otte
2008-01-09 17:31 ` Martin Schwidefsky
` (2 more replies)
2008-01-09 15:14 ` [rfc][patch 2/4] mm: introduce VM_MIXEDMAP Carsten Otte, Jared Hulbert, Carsten Otte
` (2 subsequent siblings)
3 siblings, 3 replies; 79+ messages in thread
From: Carsten Otte, Carsten Otte @ 2008-01-09 15:14 UTC (permalink / raw)
To: Nick Piggin
Cc: carsteno, Jared Hulbert, Linux Memory Management List,
Martin Schwidefsky, Heiko Carstens
include: add callbacks to toggle reference counting for VM_MIXEDMAP pages
This patch introduces two arch callbacks, which may optionally be implemented
in case the architecutre does define __HAVE_ARCH_PTEP_NOREFCOUNT.
The first callback, pte_set_norefcount(__pte) is called by core-vm to indicate
that subject page table entry is going to be inserted into a VM_MIXEDMAP vma.
default implementation: noop
s390 implementation: set sw defined bit in pte
proposed arm implementation: noop
The second callback, mixedmap_refcount_pte(__pte) is called by core-vm to
figure out whether or not subject pte requires reference counting in the
corresponding struct page entry. A non-zero result indicates reference counting
is required.
default implementation: (1)
s390 implementation: query sw defined bit in pte
proposed arm implementation: convert pte_t to pfn, use pfn_valid()
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
---
Index: linux-2.6/include/asm-generic/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-generic/pgtable.h
+++ linux-2.6/include/asm-generic/pgtable.h
@@ -99,6 +99,11 @@ static inline void ptep_set_wrprotect(st
}
#endif
+#ifndef __HAVE_ARCH_PTEP_NOREFCOUNT
+#define pte_set_norefcount(__pte) (__pte)
+#define mixedmap_refcount_pte(__pte) (1)
+#endif
+
#ifndef __HAVE_ARCH_PTE_SAME
#define pte_same(A,B) (pte_val(A) == pte_val(B))
#endif
Index: linux-2.6/include/asm-s390/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-s390/pgtable.h
+++ linux-2.6/include/asm-s390/pgtable.h
@@ -228,6 +228,7 @@ extern unsigned long vmalloc_end;
/* Software bits in the page table entry */
#define _PAGE_SWT 0x001 /* SW pte type bit t */
#define _PAGE_SWX 0x002 /* SW pte type bit x */
+#define _PAGE_NOREFCNT 0x004 /* SW prevent refcount for xip */
/* Six different types of pages. */
#define _PAGE_TYPE_EMPTY 0x400
@@ -773,6 +774,14 @@ static inline pte_t ptep_get_and_clear_f
__changed; \
})
+#define __HAVE_ARCH_PTEP_NOREFCOUNT
+#define pte_set_norefcount(__pte) \
+({ \
+ pte_val(__pte) |= _PAGE_NOREFCNT; \
+ __pte; \
+})
+#define mixedmap_refcount_pte(__pte) (!(pte_val(__pte) & _PAGE_NOREFCNT))
+
/*
* Test and clear dirty bit in storage key.
* We can't clear the changed bit atomically. This is a potential
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* [rfc][patch 2/4] mm: introduce VM_MIXEDMAP
[not found] ` <1199891032.28689.9.camel@cotte.boeblingen.de.ibm.com>
2008-01-09 15:14 ` [rfc][patch 1/4] include: add callbacks to toggle reference counting for VM_MIXEDMAP pages Carsten Otte, Carsten Otte
@ 2008-01-09 15:14 ` Carsten Otte, Jared Hulbert, Carsten Otte
2008-01-09 15:14 ` [rfc][patch 3/4] Convert XIP to support non-struct page backed memory Carsten Otte, Nick Piggin
2008-01-09 15:14 ` [rfc][patch 4/4] s390: remove struct page entries for DCSS memory segments Carsten Otte, Carsten Otte
3 siblings, 0 replies; 79+ messages in thread
From: Carsten Otte, Jared Hulbert, Carsten Otte @ 2008-01-09 15:14 UTC (permalink / raw)
To: Nick Piggin
Cc: carsteno, Jared Hulbert, Linux Memory Management List,
Martin Schwidefsky, Heiko Carstens
mm: introduce VM_MIXEDMAP
Introduce a new type of mapping, VM_MIXEDMAP. This is unlike VM_PFNMAP in
that it can support COW mappings of arbitrary ranges including ranges without
struct page (PFNMAP can only support COW in those cases where the un-COW-ed
translations are mapped linearly in the virtual address).
VM_MIXEDMAP achieves this by refcounting pages with mixedmap_refcount_pte(pte)
being non-zero, and not refcounting !mixedmap_refcount_pte(pte) pages
(which is not an option for VM_PFNMAP, because it needs to avoid refcounting
pfn_valid pages eg. for /dev/mem mappings).
The core VM calls pte_set_norefcount(__pte) for each PTE that is being created
by vm_insert_pfn and do_no_pfn for a vma that has VM_MIXEDMAP set.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
---
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -106,6 +106,7 @@ extern unsigned int kobjsize(const void
#define VM_ALWAYSDUMP 0x04000000 /* Always include in core dumps */
#define VM_CAN_NONLINEAR 0x08000000 /* Has ->fault & does nonlinear pages */
+#define VM_MIXEDMAP 0x10000000 /* Can contain "struct page" and pure PFN pages */
#ifndef VM_STACK_DEFAULT_FLAGS /* arch can override this */
#define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -361,35 +361,65 @@ static inline int is_cow_mapping(unsigne
}
/*
- * This function gets the "struct page" associated with a pte.
+ * This function gets the "struct page" associated with a pte or returns
+ * NULL if no "struct page" is associated with the pte.
*
- * NOTE! Some mappings do not have "struct pages". A raw PFN mapping
- * will have each page table entry just pointing to a raw page frame
- * number, and as far as the VM layer is concerned, those do not have
- * pages associated with them - even if the PFN might point to memory
+ * A raw VM_PFNMAP mapping (ie. one that is not COWed) may not have any "struct
+ * page" backing, and even if they do, they are not refcounted. COWed pages of
+ * a VM_PFNMAP do always have a struct page, and they are normally refcounted
+ * (they are _normal_ pages).
+ *
+ * So a raw PFNMAP mapping will have each page table entry just pointing
+ * to a page frame number, and as far as the VM layer is concerned, those do
+ * not have pages associated with them - even if the PFN might point to memory
* that otherwise is perfectly fine and has a "struct page".
*
- * The way we recognize those mappings is through the rules set up
- * by "remap_pfn_range()": the vma will have the VM_PFNMAP bit set,
- * and the vm_pgoff will point to the first PFN mapped: thus every
+ * The way we recognize COWed pages within VM_PFNMAP mappings is through the
+ * rules set up by "remap_pfn_range()": the vma will have the VM_PFNMAP bit
+ * set, and the vm_pgoff will point to the first PFN mapped: thus every
* page that is a raw mapping will always honor the rule
*
* pfn_of_page == vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT)
*
- * and if that isn't true, the page has been COW'ed (in which case it
- * _does_ have a "struct page" associated with it even if it is in a
- * VM_PFNMAP range).
+ * A call to vm_normal_page() will return NULL for such a page.
+ *
+ * If the page doesn't follow the "remap_pfn_range()" rule in a VM_PFNMAP
+ * then the page has been COW'ed. A COW'ed page _does_ have a "struct page"
+ * associated with it even if it is in a VM_PFNMAP range. Calling
+ * vm_normal_page() on such a page will therefore return the "struct page".
+ *
+ *
+ * VM_MIXEDMAP mappings can likewise contain memory with or without "struct
+ * page" backing, however the difference is that _all_ pages with a struct
+ * page (that is, those where mixedmap_refcount_pte is true) are refcounted
+ * and considered normal pages by the VM. The disadvantage is that pages are
+ * refcounted (which can be slower and simply not an option for some PFNMAP
+ * users). The advantage is that we don't have to follow the strict linearity
+ * rule of PFNMAP mappings in order to support COWable mappings.
+ *
+ * A call to vm_normal_page() with a VM_MIXEDMAP mapping will return the
+ * associated "struct page" or NULL for memory not backed by a "struct page".
+ *
+ *
+ * All other mappings should have a valid struct page, which will be
+ * returned by a call to vm_normal_page().
*/
struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, pte_t pte)
{
unsigned long pfn = pte_pfn(pte);
- if (unlikely(vma->vm_flags & VM_PFNMAP)) {
- unsigned long off = (addr - vma->vm_start) >> PAGE_SHIFT;
- if (pfn == vma->vm_pgoff + off)
- return NULL;
- if (!is_cow_mapping(vma->vm_flags))
- return NULL;
+ if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
+ if (vma->vm_flags & VM_MIXEDMAP) {
+ if (!mixedmap_refcount_pte(pte))
+ return NULL;
+ goto out;
+ } else {
+ unsigned long off = (addr-vma->vm_start) >> PAGE_SHIFT;
+ if (pfn == vma->vm_pgoff + off)
+ return NULL;
+ if (!is_cow_mapping(vma->vm_flags))
+ return NULL;
+ }
}
/*
@@ -410,6 +440,7 @@ struct page *vm_normal_page(struct vm_ar
* The PAGE_ZERO() pages and various VDSO mappings can
* cause them to exist.
*/
+out:
return pfn_to_page(pfn);
}
@@ -1211,8 +1242,10 @@ int vm_insert_pfn(struct vm_area_struct
pte_t *pte, entry;
spinlock_t *ptl;
- BUG_ON(!(vma->vm_flags & VM_PFNMAP));
- BUG_ON(is_cow_mapping(vma->vm_flags));
+ BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
+ BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
+ (VM_PFNMAP|VM_MIXEDMAP));
+ BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
retval = -ENOMEM;
pte = get_locked_pte(mm, addr, &ptl);
@@ -1222,8 +1255,11 @@ int vm_insert_pfn(struct vm_area_struct
if (!pte_none(*pte))
goto out_unlock;
+
/* Ok, finally just insert the thing.. */
entry = pfn_pte(pfn, vma->vm_page_prot);
+ if (vma->vm_flags & VM_MIXEDMAP)
+ entry = pte_set_norefcount(entry);
set_pte_at(mm, addr, pte, entry);
update_mmu_cache(vma, addr, entry);
@@ -2386,10 +2422,11 @@ static noinline int do_no_pfn(struct mm_
unsigned long pfn;
pte_unmap(page_table);
- BUG_ON(!(vma->vm_flags & VM_PFNMAP));
- BUG_ON(is_cow_mapping(vma->vm_flags));
+ BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
+ BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
pfn = vma->vm_ops->nopfn(vma, address & PAGE_MASK);
+
if (unlikely(pfn == NOPFN_OOM))
return VM_FAULT_OOM;
else if (unlikely(pfn == NOPFN_SIGBUS))
@@ -2404,6 +2441,8 @@ static noinline int do_no_pfn(struct mm_
entry = pfn_pte(pfn, vma->vm_page_prot);
if (write_access)
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ if (vma->vm_flags & VM_MIXEDMAP)
+ entry = pte_set_norefcount(entry);
set_pte_at(mm, address, page_table, entry);
}
pte_unmap_unlock(page_table, ptl);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* [rfc][patch 3/4] Convert XIP to support non-struct page backed memory
[not found] ` <1199891032.28689.9.camel@cotte.boeblingen.de.ibm.com>
2008-01-09 15:14 ` [rfc][patch 1/4] include: add callbacks to toggle reference counting for VM_MIXEDMAP pages Carsten Otte, Carsten Otte
2008-01-09 15:14 ` [rfc][patch 2/4] mm: introduce VM_MIXEDMAP Carsten Otte, Jared Hulbert, Carsten Otte
@ 2008-01-09 15:14 ` Carsten Otte, Nick Piggin
2008-01-09 15:14 ` [rfc][patch 4/4] s390: remove struct page entries for DCSS memory segments Carsten Otte, Carsten Otte
3 siblings, 0 replies; 79+ messages in thread
From: Carsten Otte, Nick Piggin @ 2008-01-09 15:14 UTC (permalink / raw)
To: Nick Piggin
Cc: carsteno, Jared Hulbert, Linux Memory Management List,
Martin Schwidefsky, Heiko Carstens
Convert XIP to support non-struct page backed memory, using VM_MIXEDMAP
for the user mappings.
This requires the get_xip_page API to be changed to an address based one.
(The kaddr->pfn conversion may not be quite right for all architectures or XIP
memory mappings, and the cacheflushing may need to be updated for some archs).
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
---
Index: linux-2.6/fs/ext2/inode.c
===================================================================
--- linux-2.6.orig/fs/ext2/inode.c
+++ linux-2.6/fs/ext2/inode.c
@@ -800,7 +800,7 @@ const struct address_space_operations ex
const struct address_space_operations ext2_aops_xip = {
.bmap = ext2_bmap,
- .get_xip_page = ext2_get_xip_page,
+ .get_xip_address = ext2_get_xip_address,
};
const struct address_space_operations ext2_nobh_aops = {
Index: linux-2.6/fs/ext2/xip.c
===================================================================
--- linux-2.6.orig/fs/ext2/xip.c
+++ linux-2.6/fs/ext2/xip.c
@@ -15,24 +15,25 @@
#include "xip.h"
static inline int
-__inode_direct_access(struct inode *inode, sector_t sector,
- unsigned long *data)
+__inode_direct_access(struct inode *inode, sector_t block, unsigned long *data)
{
+ sector_t sector;
BUG_ON(!inode->i_sb->s_bdev->bd_disk->fops->direct_access);
+
+ sector = block * (PAGE_SIZE / 512); /* ext2 block to bdev sector */
return inode->i_sb->s_bdev->bd_disk->fops
- ->direct_access(inode->i_sb->s_bdev,sector,data);
+ ->direct_access(inode->i_sb->s_bdev, sector, data);
}
static inline int
-__ext2_get_sector(struct inode *inode, sector_t offset, int create,
+__ext2_get_block(struct inode *inode, pgoff_t pgoff, int create,
sector_t *result)
{
struct buffer_head tmp;
int rc;
memset(&tmp, 0, sizeof(struct buffer_head));
- rc = ext2_get_block(inode, offset/ (PAGE_SIZE/512), &tmp,
- create);
+ rc = ext2_get_block(inode, pgoff, &tmp, create);
*result = tmp.b_blocknr;
/* did we get a sparse block (hole in the file)? */
@@ -45,13 +46,12 @@ __ext2_get_sector(struct inode *inode, s
}
int
-ext2_clear_xip_target(struct inode *inode, int block)
+ext2_clear_xip_target(struct inode *inode, sector_t block)
{
- sector_t sector = block * (PAGE_SIZE/512);
unsigned long data;
int rc;
- rc = __inode_direct_access(inode, sector, &data);
+ rc = __inode_direct_access(inode, block, &data);
if (!rc)
clear_page((void*)data);
return rc;
@@ -69,24 +69,24 @@ void ext2_xip_verify_sb(struct super_blo
}
}
-struct page *
-ext2_get_xip_page(struct address_space *mapping, sector_t offset,
- int create)
+void *
+ext2_get_xip_address(struct address_space *mapping, pgoff_t pgoff, int create)
{
int rc;
unsigned long data;
- sector_t sector;
+ sector_t block;
/* first, retrieve the sector number */
- rc = __ext2_get_sector(mapping->host, offset, create, §or);
+ rc = __ext2_get_block(mapping->host, pgoff, create, &block);
if (rc)
goto error;
/* retrieve address of the target data */
- rc = __inode_direct_access
- (mapping->host, sector * (PAGE_SIZE/512), &data);
- if (!rc)
- return virt_to_page(data);
+ rc = __inode_direct_access(mapping->host, block, &data);
+ if (rc)
+ goto error;
+
+ return (void *)data;
error:
return ERR_PTR(rc);
Index: linux-2.6/fs/ext2/xip.h
===================================================================
--- linux-2.6.orig/fs/ext2/xip.h
+++ linux-2.6/fs/ext2/xip.h
@@ -7,15 +7,15 @@
#ifdef CONFIG_EXT2_FS_XIP
extern void ext2_xip_verify_sb (struct super_block *);
-extern int ext2_clear_xip_target (struct inode *, int);
+extern int ext2_clear_xip_target (struct inode *, sector_t);
static inline int ext2_use_xip (struct super_block *sb)
{
struct ext2_sb_info *sbi = EXT2_SB(sb);
return (sbi->s_mount_opt & EXT2_MOUNT_XIP);
}
-struct page* ext2_get_xip_page (struct address_space *, sector_t, int);
-#define mapping_is_xip(map) unlikely(map->a_ops->get_xip_page)
+void *ext2_get_xip_address(struct address_space *, sector_t, int);
+#define mapping_is_xip(map) unlikely(map->a_ops->get_xip_address)
#else
#define mapping_is_xip(map) 0
#define ext2_xip_verify_sb(sb) do { } while (0)
Index: linux-2.6/fs/open.c
===================================================================
--- linux-2.6.orig/fs/open.c
+++ linux-2.6/fs/open.c
@@ -778,7 +778,7 @@ static struct file *__dentry_open(struct
if (f->f_flags & O_DIRECT) {
if (!f->f_mapping->a_ops ||
((!f->f_mapping->a_ops->direct_IO) &&
- (!f->f_mapping->a_ops->get_xip_page))) {
+ (!f->f_mapping->a_ops->get_xip_address))) {
fput(f);
f = ERR_PTR(-EINVAL);
}
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -473,8 +473,7 @@ struct address_space_operations {
int (*releasepage) (struct page *, gfp_t);
ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
loff_t offset, unsigned long nr_segs);
- struct page* (*get_xip_page)(struct address_space *, sector_t,
- int);
+ void * (*get_xip_address)(struct address_space *, pgoff_t, int);
/* migrate the contents of a page to the specified target */
int (*migratepage) (struct address_space *,
struct page *, struct page *);
Index: linux-2.6/mm/fadvise.c
===================================================================
--- linux-2.6.orig/mm/fadvise.c
+++ linux-2.6/mm/fadvise.c
@@ -49,7 +49,7 @@ asmlinkage long sys_fadvise64_64(int fd,
goto out;
}
- if (mapping->a_ops->get_xip_page)
+ if (mapping->a_ops->get_xip_address)
/* no bad return value, but ignore advice */
goto out;
Index: linux-2.6/mm/filemap_xip.c
===================================================================
--- linux-2.6.orig/mm/filemap_xip.c
+++ linux-2.6/mm/filemap_xip.c
@@ -15,6 +15,7 @@
#include <linux/rmap.h>
#include <linux/sched.h>
#include <asm/tlbflush.h>
+#include <asm/io.h>
/*
* We do use our own empty page to avoid interference with other users
@@ -42,36 +43,39 @@ static struct page *xip_sparse_page(void
/*
* This is a file read routine for execute in place files, and uses
- * the mapping->a_ops->get_xip_page() function for the actual low-level
+ * the mapping->a_ops->get_xip_address() function for the actual low-level
* stuff.
*
* Note the struct file* is not used at all. It may be NULL.
*/
-static void
+static ssize_t
do_xip_mapping_read(struct address_space *mapping,
struct file_ra_state *_ra,
struct file *filp,
- loff_t *ppos,
- read_descriptor_t *desc,
- read_actor_t actor)
+ char __user *buf,
+ size_t len,
+ loff_t *ppos)
{
struct inode *inode = mapping->host;
unsigned long index, end_index, offset;
- loff_t isize;
+ loff_t isize, pos;
+ size_t copied = 0, error = 0;
- BUG_ON(!mapping->a_ops->get_xip_page);
+ BUG_ON(!mapping->a_ops->get_xip_address);
- index = *ppos >> PAGE_CACHE_SHIFT;
- offset = *ppos & ~PAGE_CACHE_MASK;
+ pos = *ppos;
+ index = pos >> PAGE_CACHE_SHIFT;
+ offset = pos & ~PAGE_CACHE_MASK;
isize = i_size_read(inode);
if (!isize)
goto out;
end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
- for (;;) {
- struct page *page;
- unsigned long nr, ret;
+ do {
+ unsigned long nr, left;
+ void *xip_mem;
+ int zero = 0;
/* nr is the maximum number of bytes to copy from this page */
nr = PAGE_CACHE_SIZE;
@@ -84,17 +88,20 @@ do_xip_mapping_read(struct address_space
}
}
nr = nr - offset;
+ if (nr > len)
+ nr = len;
- page = mapping->a_ops->get_xip_page(mapping,
- index*(PAGE_SIZE/512), 0);
- if (!page)
- goto no_xip_page;
- if (unlikely(IS_ERR(page))) {
- if (PTR_ERR(page) == -ENODATA) {
+ xip_mem = mapping->a_ops->get_xip_address(mapping, index, 0);
+ if (!xip_mem) {
+ error = -EIO;
+ goto out;
+ }
+ if (unlikely(IS_ERR(xip_mem))) {
+ if (PTR_ERR(xip_mem) == -ENODATA) {
/* sparse */
- page = ZERO_PAGE(0);
+ zero = 1;
} else {
- desc->error = PTR_ERR(page);
+ error = PTR_ERR(xip_mem);
goto out;
}
}
@@ -104,10 +111,10 @@ do_xip_mapping_read(struct address_space
* before reading the page on the kernel side.
*/
if (mapping_writably_mapped(mapping))
- flush_dcache_page(page);
+ /* address based flush */ ;
/*
- * Ok, we have the page, so now we can copy it to user space...
+ * Ok, we have the mem, so now we can copy it to user space...
*
* The actor routine returns how many bytes were actually used..
* NOTE! This may not be the same as how much of a user buffer
@@ -115,47 +122,38 @@ do_xip_mapping_read(struct address_space
* "pos" here (the actor routine has to update the user buffer
* pointers and the remaining count).
*/
- ret = actor(desc, page, offset, nr);
- offset += ret;
- index += offset >> PAGE_CACHE_SHIFT;
- offset &= ~PAGE_CACHE_MASK;
+ if (!zero)
+ left = __copy_to_user(buf+copied, xip_mem+offset, nr);
+ else
+ left = __clear_user(buf + copied, nr);
- if (ret == nr && desc->count)
- continue;
- goto out;
+ if (left) {
+ error = -EFAULT;
+ goto out;
+ }
-no_xip_page:
- /* Did not get the page. Report it */
- desc->error = -EIO;
- goto out;
- }
+ copied += (nr - left);
+ offset += (nr - left);
+ index += offset >> PAGE_CACHE_SHIFT;
+ offset &= ~PAGE_CACHE_MASK;
+ } while (copied < len);
out:
- *ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
+ *ppos = pos + copied;
if (filp)
file_accessed(filp);
+
+ return (copied ? copied : error);
}
ssize_t
xip_file_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos)
{
- read_descriptor_t desc;
-
if (!access_ok(VERIFY_WRITE, buf, len))
return -EFAULT;
- desc.written = 0;
- desc.arg.buf = buf;
- desc.count = len;
- desc.error = 0;
-
- do_xip_mapping_read(filp->f_mapping, &filp->f_ra, filp,
- ppos, &desc, file_read_actor);
-
- if (desc.written)
- return desc.written;
- else
- return desc.error;
+ return do_xip_mapping_read(filp->f_mapping, &filp->f_ra, filp,
+ buf, len, ppos);
}
EXPORT_SYMBOL_GPL(xip_file_read);
@@ -210,13 +208,14 @@ __xip_unmap (struct address_space * mapp
*
* This function is derived from filemap_fault, but used for execute in place
*/
-static int xip_file_fault(struct vm_area_struct *area, struct vm_fault *vmf)
+static int xip_file_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
- struct file *file = area->vm_file;
+ struct file *file = vma->vm_file;
struct address_space *mapping = file->f_mapping;
struct inode *inode = mapping->host;
- struct page *page;
pgoff_t size;
+ void *xip_mem;
+ struct page *page;
/* XXX: are VM_FAULT_ codes OK? */
@@ -224,24 +223,29 @@ static int xip_file_fault(struct vm_area
if (vmf->pgoff >= size)
return VM_FAULT_SIGBUS;
- page = mapping->a_ops->get_xip_page(mapping,
- vmf->pgoff*(PAGE_SIZE/512), 0);
- if (!IS_ERR(page))
- goto out;
- if (PTR_ERR(page) != -ENODATA)
+ xip_mem = mapping->a_ops->get_xip_address(mapping, vmf->pgoff, 0);
+ if (!IS_ERR(xip_mem))
+ goto found;
+ if (PTR_ERR(xip_mem) != -ENODATA)
return VM_FAULT_OOM;
/* sparse block */
- if ((area->vm_flags & (VM_WRITE | VM_MAYWRITE)) &&
- (area->vm_flags & (VM_SHARED| VM_MAYSHARE)) &&
+ if ((vma->vm_flags & (VM_WRITE | VM_MAYWRITE)) &&
+ (vma->vm_flags & (VM_SHARED| VM_MAYSHARE)) &&
(!(mapping->host->i_sb->s_flags & MS_RDONLY))) {
+ unsigned long pfn;
+
/* maybe shared writable, allocate new block */
- page = mapping->a_ops->get_xip_page(mapping,
- vmf->pgoff*(PAGE_SIZE/512), 1);
- if (IS_ERR(page))
+ xip_mem = mapping->a_ops->get_xip_address(mapping,vmf->pgoff,1);
+ if (IS_ERR(xip_mem))
return VM_FAULT_SIGBUS;
- /* unmap page at pgoff from all other vmas */
+ /* unmap sparse mappings at pgoff from all other vmas */
__xip_unmap(mapping, vmf->pgoff);
+
+found:
+ pfn = virt_to_phys(xip_mem) >> PAGE_SHIFT;
+ vm_insert_pfn(vma, (unsigned long)vmf->virtual_address, pfn);
+ return VM_FAULT_NOPAGE;
} else {
/* not shared and writable, use xip_sparse_page() */
page = xip_sparse_page();
@@ -249,7 +253,6 @@ static int xip_file_fault(struct vm_area
return VM_FAULT_OOM;
}
-out:
page_cache_get(page);
vmf->page = page;
return 0;
@@ -261,11 +264,11 @@ static struct vm_operations_struct xip_f
int xip_file_mmap(struct file * file, struct vm_area_struct * vma)
{
- BUG_ON(!file->f_mapping->a_ops->get_xip_page);
+ BUG_ON(!file->f_mapping->a_ops->get_xip_address);
file_accessed(file);
vma->vm_ops = &xip_file_vm_ops;
- vma->vm_flags |= VM_CAN_NONLINEAR;
+ vma->vm_flags |= VM_CAN_NONLINEAR | VM_MIXEDMAP;
return 0;
}
EXPORT_SYMBOL_GPL(xip_file_mmap);
@@ -278,17 +281,16 @@ __xip_file_write(struct file *filp, cons
const struct address_space_operations *a_ops = mapping->a_ops;
struct inode *inode = mapping->host;
long status = 0;
- struct page *page;
size_t bytes;
ssize_t written = 0;
- BUG_ON(!mapping->a_ops->get_xip_page);
+ BUG_ON(!mapping->a_ops->get_xip_address);
do {
unsigned long index;
unsigned long offset;
size_t copied;
- char *kaddr;
+ void *xip_mem;
offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
index = pos >> PAGE_CACHE_SHIFT;
@@ -296,28 +298,22 @@ __xip_file_write(struct file *filp, cons
if (bytes > count)
bytes = count;
- page = a_ops->get_xip_page(mapping,
- index*(PAGE_SIZE/512), 0);
- if (IS_ERR(page) && (PTR_ERR(page) == -ENODATA)) {
+ xip_mem = a_ops->get_xip_address(mapping, index, 0);
+ if (IS_ERR(xip_mem) && (PTR_ERR(xip_mem) == -ENODATA)) {
/* we allocate a new page unmap it */
- page = a_ops->get_xip_page(mapping,
- index*(PAGE_SIZE/512), 1);
- if (!IS_ERR(page))
+ xip_mem = a_ops->get_xip_address(mapping, index, 1);
+ if (!IS_ERR(xip_mem))
/* unmap page at pgoff from all other vmas */
__xip_unmap(mapping, index);
}
- if (IS_ERR(page)) {
- status = PTR_ERR(page);
+ if (IS_ERR(xip_mem)) {
+ status = PTR_ERR(xip_mem);
break;
}
- fault_in_pages_readable(buf, bytes);
- kaddr = kmap_atomic(page, KM_USER0);
copied = bytes -
- __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
- kunmap_atomic(kaddr, KM_USER0);
- flush_dcache_page(page);
+ __copy_from_user_nocache(xip_mem + offset, buf, bytes);
if (likely(copied > 0)) {
status = copied;
@@ -397,7 +393,7 @@ EXPORT_SYMBOL_GPL(xip_file_write);
/*
* truncate a page used for execute in place
- * functionality is analog to block_truncate_page but does use get_xip_page
+ * functionality is analog to block_truncate_page but does use get_xip_adddress
* to get the page instead of page cache
*/
int
@@ -407,9 +403,9 @@ xip_truncate_page(struct address_space *
unsigned offset = from & (PAGE_CACHE_SIZE-1);
unsigned blocksize;
unsigned length;
- struct page *page;
+ void *xip_mem;
- BUG_ON(!mapping->a_ops->get_xip_page);
+ BUG_ON(!mapping->a_ops->get_xip_address);
blocksize = 1 << mapping->host->i_blkbits;
length = offset & (blocksize - 1);
@@ -420,18 +416,17 @@ xip_truncate_page(struct address_space *
length = blocksize - length;
- page = mapping->a_ops->get_xip_page(mapping,
- index*(PAGE_SIZE/512), 0);
- if (!page)
+ xip_mem = mapping->a_ops->get_xip_address(mapping, index, 0);
+ if (!xip_mem)
return -ENOMEM;
- if (unlikely(IS_ERR(page))) {
- if (PTR_ERR(page) == -ENODATA)
+ if (unlikely(IS_ERR(xip_mem))) {
+ if (PTR_ERR(xip_mem) == -ENODATA)
/* Hole? No need to truncate */
return 0;
else
- return PTR_ERR(page);
+ return PTR_ERR(xip_mem);
}
- zero_user_page(page, offset, length, KM_USER0);
+ memset(xip_mem + offset, 0, length);
return 0;
}
EXPORT_SYMBOL_GPL(xip_truncate_page);
Index: linux-2.6/mm/madvise.c
===================================================================
--- linux-2.6.orig/mm/madvise.c
+++ linux-2.6/mm/madvise.c
@@ -112,7 +112,7 @@ static long madvise_willneed(struct vm_a
if (!file)
return -EBADF;
- if (file->f_mapping->a_ops->get_xip_page) {
+ if (file->f_mapping->a_ops->get_xip_address) {
/* no bad return value, but ignore advice */
return 0;
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* [rfc][patch 4/4] s390: remove struct page entries for DCSS memory segments
[not found] ` <1199891032.28689.9.camel@cotte.boeblingen.de.ibm.com>
` (2 preceding siblings ...)
2008-01-09 15:14 ` [rfc][patch 3/4] Convert XIP to support non-struct page backed memory Carsten Otte, Nick Piggin
@ 2008-01-09 15:14 ` Carsten Otte, Carsten Otte
3 siblings, 0 replies; 79+ messages in thread
From: Carsten Otte, Carsten Otte @ 2008-01-09 15:14 UTC (permalink / raw)
To: Nick Piggin
Cc: carsteno, Jared Hulbert, Linux Memory Management List,
Martin Schwidefsky, Heiko Carstens
s390: remove struct page entries for DCSS memory segments
This patch removes struct page entries for DCSS segments that are being loaded.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
---
Index: linux-2.6/arch/s390/mm/vmem.c
===================================================================
--- linux-2.6.orig/arch/s390/mm/vmem.c
+++ linux-2.6/arch/s390/mm/vmem.c
@@ -310,8 +310,6 @@ out:
int add_shared_memory(unsigned long start, unsigned long size)
{
struct memory_segment *seg;
- struct page *page;
- unsigned long pfn, num_pfn, end_pfn;
int ret;
mutex_lock(&vmem_mutex);
@@ -326,24 +324,10 @@ int add_shared_memory(unsigned long star
if (ret)
goto out_free;
- ret = vmem_add_mem(start, size);
+ ret = vmem_add_range(start, size);
if (ret)
goto out_remove;
- pfn = PFN_DOWN(start);
- num_pfn = PFN_DOWN(size);
- end_pfn = pfn + num_pfn;
-
- page = pfn_to_page(pfn);
- memset(page, 0, num_pfn * sizeof(struct page));
-
- for (; pfn < end_pfn; pfn++) {
- page = pfn_to_page(pfn);
- init_page_count(page);
- reset_page_mapcount(page);
- SetPageReserved(page);
- INIT_LIST_HEAD(&page->lru);
- }
goto out;
out_remove:
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 1/4] include: add callbacks to toggle reference counting for VM_MIXEDMAP pages
2008-01-09 15:14 ` [rfc][patch 1/4] include: add callbacks to toggle reference counting for VM_MIXEDMAP pages Carsten Otte, Carsten Otte
@ 2008-01-09 17:31 ` Martin Schwidefsky
2008-01-09 18:17 ` Jared Hulbert
2008-01-10 0:20 ` Nick Piggin
2 siblings, 0 replies; 79+ messages in thread
From: Martin Schwidefsky @ 2008-01-09 17:31 UTC (permalink / raw)
To: Carsten Otte
Cc: Nick Piggin, carsteno, Jared Hulbert,
Linux Memory Management List, Heiko Carstens
On Wed, 2008-01-09 at 16:14 +0100, Carsten Otte wrote:
> include: add callbacks to toggle reference counting for VM_MIXEDMAP pages
>
> This patch introduces two arch callbacks, which may optionally be implemented
> in case the architecutre does define __HAVE_ARCH_PTEP_NOREFCOUNT.
>
> The first callback, pte_set_norefcount(__pte) is called by core-vm to indicate
> that subject page table entry is going to be inserted into a VM_MIXEDMAP vma.
> default implementation: noop
> s390 implementation: set sw defined bit in pte
> proposed arm implementation: noop
>
> The second callback, mixedmap_refcount_pte(__pte) is called by core-vm to
> figure out whether or not subject pte requires reference counting in the
> corresponding struct page entry. A non-zero result indicates reference counting
> is required.
> default implementation: (1)
> s390 implementation: query sw defined bit in pte
> proposed arm implementation: convert pte_t to pfn, use pfn_valid()
>
> Signed-off-by: Carsten Otte <cotte@de.ibm.com>
For the s390 pieces of this patch:
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 1/4] include: add callbacks to toggle reference counting for VM_MIXEDMAP pages
2008-01-09 15:14 ` [rfc][patch 1/4] include: add callbacks to toggle reference counting for VM_MIXEDMAP pages Carsten Otte, Carsten Otte
2008-01-09 17:31 ` Martin Schwidefsky
@ 2008-01-09 18:17 ` Jared Hulbert
2008-01-10 7:59 ` Carsten Otte
2008-01-10 0:20 ` Nick Piggin
2 siblings, 1 reply; 79+ messages in thread
From: Jared Hulbert @ 2008-01-09 18:17 UTC (permalink / raw)
To: Carsten Otte
Cc: Nick Piggin, carsteno, Linux Memory Management List,
Martin Schwidefsky, Heiko Carstens
On Jan 9, 2008 7:14 AM, Carsten Otte <cotte@de.ibm.com> wrote:
> From: Carsten Otte <cotte@de.ibm.com>
>
> include: add callbacks to toggle reference counting for VM_MIXEDMAP pages
>
> This patch introduces two arch callbacks, which may optionally be implemented
> in case the architecutre does define __HAVE_ARCH_PTEP_NOREFCOUNT.
>
> The first callback, pte_set_norefcount(__pte) is called by core-vm to indicate
> that subject page table entry is going to be inserted into a VM_MIXEDMAP vma.
> default implementation: noop
> s390 implementation: set sw defined bit in pte
> proposed arm implementation: noop
>
> The second callback, mixedmap_refcount_pte(__pte) is called by core-vm to
> figure out whether or not subject pte requires reference counting in the
> corresponding struct page entry. A non-zero result indicates reference counting
> is required.
> default implementation: (1)
I think this should be:
default implementation: convert pte_t to pfn, use pfn_valid()
Keep in mind the reason we are talking about using anything other than
pfn_valid() in vm_normal_page() is because s390 has a non-standard
pfn_valid() implementation. It's s390 that's broken, not the rest of
the world. So lets not break everything else to fix s390:) Or am I
missing something?
> s390 implementation: query sw defined bit in pte
> proposed arm implementation: convert pte_t to pfn, use pfn_valid()
proposed arm implementation: default
> Signed-off-by: Carsten Otte <cotte@de.ibm.com>
> ---
> Index: linux-2.6/include/asm-generic/pgtable.h
> ===================================================================
> --- linux-2.6.orig/include/asm-generic/pgtable.h
> +++ linux-2.6/include/asm-generic/pgtable.h
> @@ -99,6 +99,11 @@ static inline void ptep_set_wrprotect(st
> }
> #endif
>
> +#ifndef __HAVE_ARCH_PTEP_NOREFCOUNT
> +#define pte_set_norefcount(__pte) (__pte)
> +#define mixedmap_refcount_pte(__pte) (1)
+#define mixedmap_refcount_pte(__pte) pfn_valid(pte_pfn(__pte))
Should we rename "mixedmap_refcount_pte" to "mixedmap_normal_pte" or
something else more neutral? To me "mixedmap_refcount_pte" sounds
like it's altering the pte.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 1/4] include: add callbacks to toggle reference counting for VM_MIXEDMAP pages
2008-01-09 15:14 ` [rfc][patch 1/4] include: add callbacks to toggle reference counting for VM_MIXEDMAP pages Carsten Otte, Carsten Otte
2008-01-09 17:31 ` Martin Schwidefsky
2008-01-09 18:17 ` Jared Hulbert
@ 2008-01-10 0:20 ` Nick Piggin
2008-01-10 8:06 ` Carsten Otte
2 siblings, 1 reply; 79+ messages in thread
From: Nick Piggin @ 2008-01-10 0:20 UTC (permalink / raw)
To: Carsten Otte
Cc: carsteno, Jared Hulbert, Linux Memory Management List,
Martin Schwidefsky, Heiko Carstens
On Wed, Jan 09, 2008 at 04:14:05PM +0100, Carsten Otte wrote:
> From: Carsten Otte <cotte@de.ibm.com>
>
> include: add callbacks to toggle reference counting for VM_MIXEDMAP pages
>
> This patch introduces two arch callbacks, which may optionally be implemented
> in case the architecutre does define __HAVE_ARCH_PTEP_NOREFCOUNT.
>
> The first callback, pte_set_norefcount(__pte) is called by core-vm to indicate
> that subject page table entry is going to be inserted into a VM_MIXEDMAP vma.
> default implementation: noop
> s390 implementation: set sw defined bit in pte
> proposed arm implementation: noop
>
> The second callback, mixedmap_refcount_pte(__pte) is called by core-vm to
> figure out whether or not subject pte requires reference counting in the
> corresponding struct page entry. A non-zero result indicates reference counting
> is required.
> default implementation: (1)
> s390 implementation: query sw defined bit in pte
> proposed arm implementation: convert pte_t to pfn, use pfn_valid()
>
> Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Hmm, I had it in my mind that this would be entirely hidden in the s390's
mixedmap_refcount_pfn, but of course you actually need to set the pte too....
In that case, I would rather prefer to go along the lines of my pte_special
patch, which would replace all of vm_normal_page (on a per-arch basis), and
you wouldn't need this mixedmap_refcount_* stuff (it can stay pfn_valid for
those architectures that don't implement pte_special).
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 1/4] include: add callbacks to toggle reference counting for VM_MIXEDMAP pages
2008-01-09 18:17 ` Jared Hulbert
@ 2008-01-10 7:59 ` Carsten Otte
2008-01-10 20:01 ` Jared Hulbert
2008-01-10 20:23 ` [rfc][patch 1/4] include: add callbacks to toggle reference counting for VM_MIXEDMAP pages Jared Hulbert
0 siblings, 2 replies; 79+ messages in thread
From: Carsten Otte @ 2008-01-10 7:59 UTC (permalink / raw)
To: Jared Hulbert
Cc: Nick Piggin, carsteno, Linux Memory Management List,
Martin Schwidefsky, Heiko Carstens
Jared Hulbert wrote:
> I think this should be:
>
> default implementation: convert pte_t to pfn, use pfn_valid()
>
> Keep in mind the reason we are talking about using anything other than
> pfn_valid() in vm_normal_page() is because s390 has a non-standard
> pfn_valid() implementation. It's s390 that's broken, not the rest of
> the world. So lets not break everything else to fix s390:) Or am I
> missing something?
I think you're bending the original meaning of pfn_valid() in this
case: it is supposed to be true when a pfn refers to an accessable
mapping. In fact, I consider pfn_valid() broken on arm if it returns
false for a pfn that is perfectly valid for use in a pfnmap/mixedmap
mapping. I think you're looking for
pfn_has_struct_page_entry_for_it(), and that's different from the
original meaning described above.
I think it would be plain wrong to assume all architectures have this
meaning of pfn_valid() that arm has today.
>> s390 implementation: query sw defined bit in pte
>> proposed arm implementation: convert pte_t to pfn, use pfn_valid()
>
> proposed arm implementation: default
>
>> Signed-off-by: Carsten Otte <cotte@de.ibm.com>
>> ---
>> Index: linux-2.6/include/asm-generic/pgtable.h
>> ===================================================================
>> --- linux-2.6.orig/include/asm-generic/pgtable.h
>> +++ linux-2.6/include/asm-generic/pgtable.h
>> @@ -99,6 +99,11 @@ static inline void ptep_set_wrprotect(st
>> }
>> #endif
>>
>> +#ifndef __HAVE_ARCH_PTEP_NOREFCOUNT
>> +#define pte_set_norefcount(__pte) (__pte)
>> +#define mixedmap_refcount_pte(__pte) (1)
>
> +#define mixedmap_refcount_pte(__pte) pfn_valid(pte_pfn(__pte))
>
> Should we rename "mixedmap_refcount_pte" to "mixedmap_normal_pte" or
> something else more neutral? To me "mixedmap_refcount_pte" sounds
> like it's altering the pte.
Hmmmmh. Indeed, the wording is confusing here.
But anyway, I do want to play with Nick's PTE_SPECIAL thing next.
Therefore, I'm not going to change that unless we conclude we want to
go down this path.
Jared, did you try this on arm? Did it work for you with my proposed
callback implementation?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 1/4] include: add callbacks to toggle reference counting for VM_MIXEDMAP pages
2008-01-10 0:20 ` Nick Piggin
@ 2008-01-10 8:06 ` Carsten Otte
0 siblings, 0 replies; 79+ messages in thread
From: Carsten Otte @ 2008-01-10 8:06 UTC (permalink / raw)
To: Nick Piggin
Cc: carsteno, Jared Hulbert, Linux Memory Management List, mschwid2,
heicars2
Nick Piggin wrote:
> Hmm, I had it in my mind that this would be entirely hidden in the s390's
> mixedmap_refcount_pfn, but of course you actually need to set the pte too....
I did'nt think about that upfront too.
> In that case, I would rather prefer to go along the lines of my pte_special
> patch, which would replace all of vm_normal_page (on a per-arch basis), and
> you wouldn't need this mixedmap_refcount_* stuff (it can stay pfn_valid for
> those architectures that don't implement pte_special).
I am going to play with PTE_SPECIAL next. I tend to agree with you
that the
PTE_SPECIAL path looks more promising than the one implemented in this
patch
series because it offers a more generic meaning for our valuable pte
bit which
can be used for various purposes by core-vm.
Let's just implement them all, and figure the best one after that ;-).
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch] mm: use a pte bit to flag normal pages
2008-01-07 4:43 ` [rfc][patch] mm: use a pte bit to flag normal pages Nick Piggin
2008-01-07 10:30 ` Russell King
@ 2008-01-10 13:33 ` Carsten Otte
2008-01-10 23:18 ` Nick Piggin
1 sibling, 1 reply; 79+ messages in thread
From: Carsten Otte @ 2008-01-10 13:33 UTC (permalink / raw)
To: Nick Piggin
Cc: Martin Schwidefsky, carsteno, Heiko Carstens, Jared Hulbert,
Linux Memory Management List, linux-arch
Nick Piggin wrote:
> We initially wanted to do the whole vm_normal_page thing this way, with another
> pte bit, but we thought there were one or two archs with no spare bits. BTW. I
> also need this bit in order to implement my lockless get_user_pages, so I do hope
> to get it in. I'd like to know what architectures cannot spare a software bit in
> their pte_present ptes...
I've been playing with the original PAGE_SPECIAL patch a little bit, and
you can find the corresponding s390 definition below that you might want
to add to your patch queue.
It is a little unclear to me, how you'd like to proceed from here:
- with PTE_SPECIAL, do we still have VM_MIXEDMAP or similar flag to
distinguish our new type of mapping from VM_PFNMAP? Which vma flags are
we supposed to use for xip mappings?
- does VM_PFNMAP work as before, or do you intend to replace it?
- what about vm_normal_page? Do you intend to have one per arch? The one
proposed by this patch breaks Jared's pfn_valid() thing and VM_PFNMAP
for archs that don't have PAGE_SPECIAL as far as I can tell.
---
Index: linux-2.6/include/asm-s390/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-s390/pgtable.h
+++ linux-2.6/include/asm-s390/pgtable.h
@@ -228,6 +228,7 @@ extern unsigned long vmalloc_end;
/* Software bits in the page table entry */
#define _PAGE_SWT 0x001 /* SW pte type bit t */
#define _PAGE_SWX 0x002 /* SW pte type bit x */
+#define _PAGE_SPECIAL 0x004 /* SW associated with special page */
/* Six different types of pages. */
#define _PAGE_TYPE_EMPTY 0x400
@@ -504,6 +505,12 @@ static inline int pte_file(pte_t pte)
return (pte_val(pte) & mask) == _PAGE_TYPE_FILE;
}
+static inline int pte_special(pte_t pte)
+{
+ BUG_ON(!pte_present(pte));
+ return (pte_val(pte) & _PAGE_SPECIAL);
+}
+
#define __HAVE_ARCH_PTE_SAME
#define pte_same(a,b) (pte_val(a) == pte_val(b))
@@ -654,6 +661,13 @@ static inline pte_t pte_mkyoung(pte_t pt
return pte;
}
+static inline pte_t pte_mkspecial(pte_t pte)
+{
+ BUG_ON(!pte_present(pte));
+ pte_val(pte) |= _PAGE_SPECIAL;
+ return pte;
+}
+
#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 1/4] include: add callbacks to toggle reference counting for VM_MIXEDMAP pages
2008-01-10 7:59 ` Carsten Otte
@ 2008-01-10 20:01 ` Jared Hulbert
2008-01-11 8:45 ` Carsten Otte
2008-01-10 20:23 ` [rfc][patch 1/4] include: add callbacks to toggle reference counting for VM_MIXEDMAP pages Jared Hulbert
1 sibling, 1 reply; 79+ messages in thread
From: Jared Hulbert @ 2008-01-10 20:01 UTC (permalink / raw)
To: carsteno
Cc: Nick Piggin, Linux Memory Management List, Martin Schwidefsky,
Heiko Carstens
> I think you're looking for
> pfn_has_struct_page_entry_for_it(), and that's different from the
> original meaning described above.
Yes. That's what I'm looking for.
Carsten,
I think I get the problem now. You've been saying over and over, I
just didn't hear it. We are not using the same assumptions for what
VM_MIXEDMAP means.
Look's like today most architectures just use pfn_valid() to see if a
pfn is in a valid RAM segment. The assumption used in
vm_normal_page() is that valid_RAM == has_page_struct. That's fine by
me for VM_MIXEDMAP because I'm only assuming 2 states a page can be
in: (1) page struct RAM (2) pfn only Flash memory ioremap()'ed in.
You are wanting to add a third: (3) valid RAM, pfn only mapping with
the ability to add a page struct when needed.
Is this right?
> Jared, did you try this on arm?
No. I'm not sure where we stand. Shall I bother or do I wait for the
next patch?
> Did it work for you with my proposed
> callback implementation?
I'm sure I can make a callback work kind of like I proposed above.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 1/4] include: add callbacks to toggle reference counting for VM_MIXEDMAP pages
2008-01-10 7:59 ` Carsten Otte
2008-01-10 20:01 ` Jared Hulbert
@ 2008-01-10 20:23 ` Jared Hulbert
2008-01-11 8:32 ` Carsten Otte
1 sibling, 1 reply; 79+ messages in thread
From: Jared Hulbert @ 2008-01-10 20:23 UTC (permalink / raw)
To: carsteno
Cc: Nick Piggin, Linux Memory Management List, Martin Schwidefsky,
Heiko Carstens
> In fact, I consider pfn_valid() broken on arm if it returns
> false for a pfn that is perfectly valid for use in a pfnmap/mixedmap
> mapping.
Remember, my interest in creating VM_MIXEDMAP is in mapping Flash into
these pfnmap/mixedmap regions. I don't think it's fair to let
pfn_valid() work for Flash pages, at least for now, because there are
many things you can't do with them that you can do with RAM.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch] mm: use a pte bit to flag normal pages
2008-01-10 13:33 ` Carsten Otte
@ 2008-01-10 23:18 ` Nick Piggin
0 siblings, 0 replies; 79+ messages in thread
From: Nick Piggin @ 2008-01-10 23:18 UTC (permalink / raw)
To: Carsten Otte
Cc: Martin Schwidefsky, carsteno, Heiko Carstens, Jared Hulbert,
Linux Memory Management List, linux-arch
On Thu, Jan 10, 2008 at 02:33:27PM +0100, Carsten Otte wrote:
> Nick Piggin wrote:
> > We initially wanted to do the whole vm_normal_page thing this way, with another
> > pte bit, but we thought there were one or two archs with no spare bits. BTW. I
> > also need this bit in order to implement my lockless get_user_pages, so I do hope
> > to get it in. I'd like to know what architectures cannot spare a software bit in
> > their pte_present ptes...
> I've been playing with the original PAGE_SPECIAL patch a little bit, and
> you can find the corresponding s390 definition below that you might want
> to add to your patch queue.
> It is a little unclear to me, how you'd like to proceed from here:
> - with PTE_SPECIAL, do we still have VM_MIXEDMAP or similar flag to
> distinguish our new type of mapping from VM_PFNMAP? Which vma flags are
> we supposed to use for xip mappings?
We should not need anything in the VMA, because the vm can get all the
required information from the pte. However, we still need to keep the
MIXEMAP and PFNMAP stuff around for architectures that don't provide a
pte_special.
> - does VM_PFNMAP work as before, or do you intend to replace it?
PFNMAP can be replaced with pte_special as well. They are all schemes
used to exempt a pte from having its struct page refcounted... if we
use a bit per pte, then we need nothing else.
> - what about vm_normal_page? Do you intend to have one per arch? The one
> proposed by this patch breaks Jared's pfn_valid() thing and VM_PFNMAP
> for archs that don't have PAGE_SPECIAL as far as I can tell.
I think just have 2 in the core code. Switched by ifdef. I'll work on a
more polished patch for that.
>
> ---
> Index: linux-2.6/include/asm-s390/pgtable.h
> ===================================================================
> --- linux-2.6.orig/include/asm-s390/pgtable.h
> +++ linux-2.6/include/asm-s390/pgtable.h
> @@ -228,6 +228,7 @@ extern unsigned long vmalloc_end;
> /* Software bits in the page table entry */
> #define _PAGE_SWT 0x001 /* SW pte type bit t */
> #define _PAGE_SWX 0x002 /* SW pte type bit x */
> +#define _PAGE_SPECIAL 0x004 /* SW associated with special page */
>
> /* Six different types of pages. */
> #define _PAGE_TYPE_EMPTY 0x400
> @@ -504,6 +505,12 @@ static inline int pte_file(pte_t pte)
> return (pte_val(pte) & mask) == _PAGE_TYPE_FILE;
> }
>
> +static inline int pte_special(pte_t pte)
> +{
> + BUG_ON(!pte_present(pte));
> + return (pte_val(pte) & _PAGE_SPECIAL);
> +}
> +
> #define __HAVE_ARCH_PTE_SAME
> #define pte_same(a,b) (pte_val(a) == pte_val(b))
>
> @@ -654,6 +661,13 @@ static inline pte_t pte_mkyoung(pte_t pt
> return pte;
> }
>
> +static inline pte_t pte_mkspecial(pte_t pte)
> +{
> + BUG_ON(!pte_present(pte));
> + pte_val(pte) |= _PAGE_SPECIAL;
> + return pte;
> +}
> +
> #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
> static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
> unsigned long addr, pte_t *ptep)
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 1/4] include: add callbacks to toggle reference counting for VM_MIXEDMAP pages
2008-01-10 20:23 ` [rfc][patch 1/4] include: add callbacks to toggle reference counting for VM_MIXEDMAP pages Jared Hulbert
@ 2008-01-11 8:32 ` Carsten Otte
0 siblings, 0 replies; 79+ messages in thread
From: Carsten Otte @ 2008-01-11 8:32 UTC (permalink / raw)
To: Jared Hulbert
Cc: carsteno, Nick Piggin, Linux Memory Management List,
Martin Schwidefsky, Heiko Carstens
Jared Hulbert wrote:
>> In fact, I consider pfn_valid() broken on arm if it returns
>> false for a pfn that is perfectly valid for use in a pfnmap/mixedmap
>> mapping.
>
> Remember, my interest in creating VM_MIXEDMAP is in mapping Flash into
> these pfnmap/mixedmap regions. I don't think it's fair to let
> pfn_valid() work for Flash pages, at least for now, because there are
> many things you can't do with them that you can do with RAM.
You've got a point there. Our memory segments don't differ from
regular RAM too much, other then Flash. I think I have to withdraw my
statement.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 1/4] include: add callbacks to toggle reference counting for VM_MIXEDMAP pages
2008-01-10 20:01 ` Jared Hulbert
@ 2008-01-11 8:45 ` Carsten Otte
2008-01-13 2:44 ` Nick Piggin
0 siblings, 1 reply; 79+ messages in thread
From: Carsten Otte @ 2008-01-11 8:45 UTC (permalink / raw)
To: Jared Hulbert
Cc: carsteno, Nick Piggin, Linux Memory Management List,
Martin Schwidefsky, Heiko Carstens
Jared Hulbert wrote:
>> I think you're looking for
>> pfn_has_struct_page_entry_for_it(), and that's different from the
>> original meaning described above.
>
> Yes. That's what I'm looking for.
>
> Carsten,
>
> I think I get the problem now. You've been saying over and over, I
> just didn't hear it. We are not using the same assumptions for what
> VM_MIXEDMAP means.
>
> Look's like today most architectures just use pfn_valid() to see if a
> pfn is in a valid RAM segment. The assumption used in
> vm_normal_page() is that valid_RAM == has_page_struct. That's fine by
> me for VM_MIXEDMAP because I'm only assuming 2 states a page can be
> in: (1) page struct RAM (2) pfn only Flash memory ioremap()'ed in.
> You are wanting to add a third: (3) valid RAM, pfn only mapping with
> the ability to add a page struct when needed.
>
> Is this right?
About right. There are a few differences between "valid ram" and our
DCSS segments, but yes. Our segments are not present at system
startup, and can be "loaded" afterwards by hypercall. Thus, they're
not detected and initialized as regular memory.
We have the option to add struct page entries for them. In case of
using the segment for xip, we don't want struct page entries and
rather prefer VM_MIXEDMAP, but with regular memory (with struct page)
being used after cow.
The segments can either be exclusive for one Linux image, or shared
between multiple. And they can be read-only or read+write. A memory
store to a read-only segment would fail. For xip, we either use
"shared, read-only" or "exclusive, read+write". I think in your
categories we're like
(3) valid RAM that may be read-only, pfn only mapping, no struct page
>> Jared, did you try this on arm?
>
> No. I'm not sure where we stand. Shall I bother or do I wait for the
> next patch?
I guess we should wait for Nick's patch. He has already decided not to
go down this path.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 1/4] include: add callbacks to toggle reference counting for VM_MIXEDMAP pages
2008-01-11 8:45 ` Carsten Otte
@ 2008-01-13 2:44 ` Nick Piggin
2008-01-14 11:36 ` Carsten Otte
2008-01-15 13:05 ` Carsten Otte
0 siblings, 2 replies; 79+ messages in thread
From: Nick Piggin @ 2008-01-13 2:44 UTC (permalink / raw)
To: carsteno
Cc: Jared Hulbert, Linux Memory Management List, Martin Schwidefsky,
Heiko Carstens
On Fri, Jan 11, 2008 at 09:45:27AM +0100, Carsten Otte wrote:
> Jared Hulbert wrote:
> >>I think you're looking for
> >>pfn_has_struct_page_entry_for_it(), and that's different from the
> >>original meaning described above.
> >
> >Yes. That's what I'm looking for.
> >
> >Carsten,
> >
> >I think I get the problem now. You've been saying over and over, I
> >just didn't hear it. We are not using the same assumptions for what
> >VM_MIXEDMAP means.
> >
> >Look's like today most architectures just use pfn_valid() to see if a
> >pfn is in a valid RAM segment. The assumption used in
> >vm_normal_page() is that valid_RAM == has_page_struct. That's fine by
> >me for VM_MIXEDMAP because I'm only assuming 2 states a page can be
> >in: (1) page struct RAM (2) pfn only Flash memory ioremap()'ed in.
> >You are wanting to add a third: (3) valid RAM, pfn only mapping with
> >the ability to add a page struct when needed.
> >
> >Is this right?
> About right. There are a few differences between "valid ram" and our
> DCSS segments, but yes. Our segments are not present at system
> startup, and can be "loaded" afterwards by hypercall. Thus, they're
> not detected and initialized as regular memory.
> We have the option to add struct page entries for them. In case of
> using the segment for xip, we don't want struct page entries and
> rather prefer VM_MIXEDMAP, but with regular memory (with struct page)
> being used after cow.
You know that pfn_valid() can be changed at runtime depending on what
your intentions are for that page. It can remain false if you don't
want struct pages for it, then you can switch a flag...
> >>Jared, did you try this on arm?
> >
> >No. I'm not sure where we stand. Shall I bother or do I wait for the
> >next patch?
> I guess we should wait for Nick's patch. He has already decided not to
> go down this path.
I've just been looking at putting everything together (including the
pte_special patch). I still hit one problem with your required modification
to the filemap_xip patch.
You need to unconditionally do a vm_insert_pfn in xip_file_fault, and rely
on the pte bit to tell the rest of the VM that the page has not been
refcounted. For architectures without such a bit, this breaks VM_MIXEDMAP,
because it relies on testing pfn_valid() rather than a pte bit here.
We can go 2 ways here: either s390 can make pfn_valid() work like we'd
like; or we can have a vm_insert_mixedmap_pfn(), which has
#ifdef __HAVE_ARCH_PTE_SPECIAL
in order to do the right thing (ie. those architectures which do have pte
special can just do vm_insert_pfn, and those that don't will either do a
vm_insert_pfn or vm_insert_page depending on the result of pfn_valid).
The latter I guess is more efficient for those that do implement pte_special,
however if anything I would rather investigate that as an incremental patch
after the basics are working. It would also break the dependency of the
xip stuff on the pte_special patch, and basically make everything much
more likely to get merged IMO.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 1/4] include: add callbacks to toggle reference counting for VM_MIXEDMAP pages
2008-01-13 2:44 ` Nick Piggin
@ 2008-01-14 11:36 ` Carsten Otte
2008-01-16 4:04 ` Nick Piggin
2008-01-15 13:05 ` Carsten Otte
1 sibling, 1 reply; 79+ messages in thread
From: Carsten Otte @ 2008-01-14 11:36 UTC (permalink / raw)
To: Nick Piggin
Cc: carsteno, Jared Hulbert, Linux Memory Management List,
Martin Schwidefsky, Heiko Carstens
Nick Piggin wrote:
> You know that pfn_valid() can be changed at runtime depending on what
> your intentions are for that page. It can remain false if you don't
> want struct pages for it, then you can switch a flag...
We would'nt need to switch at runtime: it is sufficient to make that
decision when the segment gets attched.
> I've just been looking at putting everything together (including the
> pte_special patch).
Yippieh. I am going to try it out next :-).
> I still hit one problem with your required modification
> to the filemap_xip patch.
>
> You need to unconditionally do a vm_insert_pfn in xip_file_fault, and rely
> on the pte bit to tell the rest of the VM that the page has not been
> refcounted. For architectures without such a bit, this breaks VM_MIXEDMAP,
> because it relies on testing pfn_valid() rather than a pte bit here.
> We can go 2 ways here: either s390 can make pfn_valid() work like we'd
> like; or we can have a vm_insert_mixedmap_pfn(), which has
> #ifdef __HAVE_ARCH_PTE_SPECIAL
> in order to do the right thing (ie. those architectures which do have pte
> special can just do vm_insert_pfn, and those that don't will either do a
> vm_insert_pfn or vm_insert_page depending on the result of pfn_valid).
Of those two choices, I'd cleary favor vm_insert_mixedmap_pfn(). But
we can #ifdef __HAVE_ARCH_PTE_SPECIAL in vm_insert_pfn() too, can't
we? We can safely set the bit for both VM_MIXEDMAP and VM_PFNMAP. Did
I miss something?
> The latter I guess is more efficient for those that do implement pte_special,
> however if anything I would rather investigate that as an incremental patch
> after the basics are working. It would also break the dependency of the
> xip stuff on the pte_special patch, and basically make everything much
> more likely to get merged IMO.
I'll talk to Martin and see what he thinks. I really hate doing list
walk in pfn_valid(), it just does'nt feel right.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 1/4] include: add callbacks to toggle reference counting for VM_MIXEDMAP pages
2008-01-13 2:44 ` Nick Piggin
2008-01-14 11:36 ` Carsten Otte
@ 2008-01-15 13:05 ` Carsten Otte
2008-01-16 4:22 ` Nick Piggin
1 sibling, 1 reply; 79+ messages in thread
From: Carsten Otte @ 2008-01-15 13:05 UTC (permalink / raw)
To: Nick Piggin
Cc: carsteno, Jared Hulbert, Linux Memory Management List,
Martin Schwidefsky, Heiko Carstens
Am Sonntag, den 13.01.2008, 03:44 +0100 schrieb Nick Piggin:
> I've just been looking at putting everything together (including the
> pte_special patch). I still hit one problem with your required modification
> to the filemap_xip patch.
>
> You need to unconditionally do a vm_insert_pfn in xip_file_fault, and rely
> on the pte bit to tell the rest of the VM that the page has not been
> refcounted. For architectures without such a bit, this breaks VM_MIXEDMAP,
> because it relies on testing pfn_valid() rather than a pte bit here.
> We can go 2 ways here: either s390 can make pfn_valid() work like we'd
> like; or we can have a vm_insert_mixedmap_pfn(), which has
> #ifdef __HAVE_ARCH_PTE_SPECIAL
> in order to do the right thing (ie. those architectures which do have pte
> special can just do vm_insert_pfn, and those that don't will either do a
> vm_insert_pfn or vm_insert_page depending on the result of pfn_valid).
>
> The latter I guess is more efficient for those that do implement pte_special,
> however if anything I would rather investigate that as an incremental patch
> after the basics are working. It would also break the dependency of the
> xip stuff on the pte_special patch, and basically make everything much
> more likely to get merged IMO.
The change in semantic of pfn_valid() for VM_MIXEDMAP keeps coming up,
and I keep saying it's a bad idea. To figure how it really looks like,
I've done the patch (at the end of this mail) to make pfn_valid() walk
the list of dcss segments. I ran into a few issues:
a) it does'nt work because we need to grab a mutex in atomic
This sanity check in vm_normal_page uses pfn_valid() in the fast path:
/*
* Add some anal sanity checks for now. Eventually, we should just do
* "return pfn_to_page(pfn)", but in the meantime we check thaclarificationt we get
* a valid pfn, and that the resulting page looks ok.
*/
if (unlikely(!pfn_valid(pfn))) {
print_bad_pte(vma, pte, addr);
return NULL;
}
And that is evaluated in context of get_user_pages() where we may not
grab our list mutex. The result looks like this:
<3>BUG: sleeping function called from invalid context at
kernel/mutex.c:87
<4>in_atomic():1, irqs_disabled():0
<4>Call Trace:
<4>([<0000000000103556>] show_trace+0x12e/0x148)
<4> [<00000000001208da>] __might_sleep+0x10a/0x118
<4> [<0000000000409024>] mutex_lock+0x30/0x6c
<4> [<0000000000102158>] pfn_in_shared_memory+0x38/0xcc
<4> [<000000000017f1be>] vm_normal_page+0xa2/0x140
<4> [<000000000017fc9e>] follow_page+0x1da/0x274
<4> [<0000000000182030>] get_user_pages+0x144/0x488
<4> [<00000000001a2926>] get_arg_page+0x5a/0xc4
<4> [<00000000001a2c60>] copy_strings+0x164/0x274
<4> [<00000000001a2dcc>] copy_strings_kernel+0x5c/0xb0
<4> [<00000000001a47a8>] do_execve+0x194/0x214
<4> [<0000000000110262>] kernel_execve+0x28/0x70
<4> [<0000000000100112>] init_post+0x72/0x114
<4> [<000000000064e3f0>] kernel_init+0x288/0x398
<4> [<0000000000107366>] kernel_thread_starter+0x6/0xc
<4> [<0000000000107360>] kernel_thread_starter+0x0/0xc
The list protection could be changed to a spinlock to make this work.
b) is is a big performance penality in the fast path
Due to the fact that pfn_valid() is checked on regular minor faults
without VM_MIXEDMAP, we'll have a lock and walking a potentially long
list on a critical path.
c) the patch looks ugly
Primitives like pfn_valid() should be a small check or a small inline
assembly. The need to call back to high level kernel code from core-vm
looks wrong to me. Read the patch, and I think you'll come to the same
conclusion.
d) timing
pfn_valid() is evaluated before our dcss list got initialized. We could
circumvent this by adding an extra check like "if the list was not
initialized, and we have memory behind the pfn we assume that the pfn is
valid without reading the list", but that would make this thing even
more ugly.
I've talked this over with Martin, and we concluded that:
- the semantics of pfn_valid() are unclear and need to be clarified
- using pfn_valid() to tell which pages have a struct page backing is
not an option for s390. We'd rather prefer to keep our struct page
entries that we'd love to get rid of over this ugly hack.
Thus, I think we have a dependency on pte_special as a prereqisite to
VM_PFNMAP for xip.
---
Index: linux-2.6/arch/s390/mm/vmem.c
===================================================================
--- linux-2.6.orig/arch/s390/mm/vmem.c
+++ linux-2.6/arch/s390/mm/vmem.c
@@ -339,6 +339,27 @@ out:
return ret;
}
+int pfn_in_shared_memory(unsigned long pfn)
+{
+ int rc;
+ struct memory_segment *tmp;
+
+ mutex_lock(&vmem_mutex);
+
+ list_for_each_entry(tmp, &mem_segs, list) {
+ if ((tmp->start >= pfn << PAGE_SHIFT) &&
+ (tmp->start + tmp->size - 1 < pfn << PAGE_SHIFT)) {
+ rc = 1;
+ goto out;
+ }
+ }
+ rc = 0;
+out:
+ mutex_unlock(&vmem_mutex);
+ return rc;
+}
+
+
/*
* map whole physical memory to virtual memory (identity mapping)
*/
Index: linux-2.6/include/asm-s390/page.h
===================================================================
--- linux-2.6.orig/include/asm-s390/page.h
+++ linux-2.6/include/asm-s390/page.h
@@ -135,7 +135,11 @@ page_get_storage_key(unsigned long addr)
extern unsigned long max_pfn;
-static inline int pfn_valid(unsigned long pfn)
+extern int add_shared_memory(unsigned long start, unsigned long size);
+extern int remove_shared_memory(unsigned long start, unsigned long size);
+extern int pfn_in_shared_memory(unsigned long pfn);
+
+static inline int __pfn_in_kmap(unsigned long pfn)
{
unsigned long dummy;
int ccode;
@@ -153,6 +157,13 @@ static inline int pfn_valid(unsigned lon
return !ccode;
}
+static inline int pfn_valid(unsigned long pfn)
+{
+ if (__pfn_in_kmap(pfn) && !pfn_in_shared_memory(pfn))
+ return 1;
+ return 0;
+}
+
#endif /* !__ASSEMBLY__ */
/* to align the pointer to the (next) page boundary */
@@ -164,7 +175,7 @@ static inline int pfn_valid(unsigned lon
#define __va(x) (void *)(unsigned long)(x)
#define virt_to_page(kaddr) pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)
#define page_to_phys(page) (page_to_pfn(page) << PAGE_SHIFT)
-#define virt_addr_valid(kaddr) pfn_valid(__pa(kaddr) >> PAGE_SHIFT)
+#define virt_addr_valid(kaddr) __pfn_in_kmap(__pa(kaddr) >> PAGE_SHIFT)
#define VM_DATA_DEFAULT_FLAGS (VM_READ | VM_WRITE | VM_EXEC | \
VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)
Index: linux-2.6/include/asm-s390/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-s390/pgtable.h
+++ linux-2.6/include/asm-s390/pgtable.h
@@ -957,9 +957,6 @@ static inline pte_t mk_swap_pte(unsigned
#define kern_addr_valid(addr) (1)
-extern int add_shared_memory(unsigned long start, unsigned long size);
-extern int remove_shared_memory(unsigned long start, unsigned long size);
-
/*
* No page table caches to initialise
*/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 1/4] include: add callbacks to toggle reference counting for VM_MIXEDMAP pages
2008-01-14 11:36 ` Carsten Otte
@ 2008-01-16 4:04 ` Nick Piggin
0 siblings, 0 replies; 79+ messages in thread
From: Nick Piggin @ 2008-01-16 4:04 UTC (permalink / raw)
To: carsteno
Cc: Jared Hulbert, Linux Memory Management List, Martin Schwidefsky,
Heiko Carstens
On Mon, Jan 14, 2008 at 12:36:34PM +0100, Carsten Otte wrote:
> Nick Piggin wrote:
> >You know that pfn_valid() can be changed at runtime depending on what
> >your intentions are for that page. It can remain false if you don't
> >want struct pages for it, then you can switch a flag...
> We would'nt need to switch at runtime: it is sufficient to make that
> decision when the segment gets attched.
>
> >I've just been looking at putting everything together (including the
> >pte_special patch).
> Yippieh. I am going to try it out next :-).
Just had a few things come up so I've been unable to finish this off
earlier, but I'll have a look again tonight if I manage to make some
progress on one more bug I'm working on.
> >I still hit one problem with your required modification
> >to the filemap_xip patch.
> >
> >You need to unconditionally do a vm_insert_pfn in xip_file_fault, and rely
> >on the pte bit to tell the rest of the VM that the page has not been
> >refcounted. For architectures without such a bit, this breaks VM_MIXEDMAP,
> >because it relies on testing pfn_valid() rather than a pte bit here.
> >We can go 2 ways here: either s390 can make pfn_valid() work like we'd
> >like; or we can have a vm_insert_mixedmap_pfn(), which has
> >#ifdef __HAVE_ARCH_PTE_SPECIAL
> >in order to do the right thing (ie. those architectures which do have pte
> >special can just do vm_insert_pfn, and those that don't will either do a
> >vm_insert_pfn or vm_insert_page depending on the result of pfn_valid).
> Of those two choices, I'd cleary favor vm_insert_mixedmap_pfn(). But
> we can #ifdef __HAVE_ARCH_PTE_SPECIAL in vm_insert_pfn() too, can't
> we? We can safely set the bit for both VM_MIXEDMAP and VM_PFNMAP. Did
> I miss something?
I guess we could, but we have some good debug checking in vm_insert_pfn,
and I'd like to make it clear that using it will not result in the
struct page being touched.
vm_insert_mixedmap_pfn (maybe we could rename it to vm_insert_mixed)
would be quite a specialised thing which is easier to audit if it is
on its own IMO.
But you're right in that there is nothing technical preventing it.
Anyway, I'll post some patches soon.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rfc][patch 1/4] include: add callbacks to toggle reference counting for VM_MIXEDMAP pages
2008-01-15 13:05 ` Carsten Otte
@ 2008-01-16 4:22 ` Nick Piggin
2008-01-16 14:29 ` [rft] updated xip patch rollup Nick Piggin
0 siblings, 1 reply; 79+ messages in thread
From: Nick Piggin @ 2008-01-16 4:22 UTC (permalink / raw)
To: Carsten Otte
Cc: carsteno, Jared Hulbert, Linux Memory Management List,
Martin Schwidefsky, Heiko Carstens
On Tue, Jan 15, 2008 at 02:05:50PM +0100, Carsten Otte wrote:
> Am Sonntag, den 13.01.2008, 03:44 +0100 schrieb Nick Piggin:
> > I've just been looking at putting everything together (including the
> > pte_special patch). I still hit one problem with your required modification
> > to the filemap_xip patch.
> >
> > You need to unconditionally do a vm_insert_pfn in xip_file_fault, and rely
> > on the pte bit to tell the rest of the VM that the page has not been
> > refcounted. For architectures without such a bit, this breaks VM_MIXEDMAP,
> > because it relies on testing pfn_valid() rather than a pte bit here.
> > We can go 2 ways here: either s390 can make pfn_valid() work like we'd
> > like; or we can have a vm_insert_mixedmap_pfn(), which has
> > #ifdef __HAVE_ARCH_PTE_SPECIAL
> > in order to do the right thing (ie. those architectures which do have pte
> > special can just do vm_insert_pfn, and those that don't will either do a
> > vm_insert_pfn or vm_insert_page depending on the result of pfn_valid).
> >
> > The latter I guess is more efficient for those that do implement pte_special,
> > however if anything I would rather investigate that as an incremental patch
> > after the basics are working. It would also break the dependency of the
> > xip stuff on the pte_special patch, and basically make everything much
> > more likely to get merged IMO.
> The change in semantic of pfn_valid() for VM_MIXEDMAP keeps coming up,
> and I keep saying it's a bad idea. To figure how it really looks like,
> I've done the patch (at the end of this mail) to make pfn_valid() walk
> the list of dcss segments. I ran into a few issues:
OK, thanks. That's very thorough of you ;)
> a) it does'nt work because we need to grab a mutex in atomic
> This sanity check in vm_normal_page uses pfn_valid() in the fast path:
> /*
> * Add some anal sanity checks for now. Eventually, we should just do
> * "return pfn_to_page(pfn)", but in the meantime we check thaclarificationt we get
> * a valid pfn, and that the resulting page looks ok.
> */
> if (unlikely(!pfn_valid(pfn))) {
> print_bad_pte(vma, pte, addr);
> return NULL;
> }
> And that is evaluated in context of get_user_pages() where we may not
> grab our list mutex. The result looks like this:
> <3>BUG: sleeping function called from invalid context at
> kernel/mutex.c:87
> <4>in_atomic():1, irqs_disabled():0
> <4>Call Trace:
> <4>([<0000000000103556>] show_trace+0x12e/0x148)
> <4> [<00000000001208da>] __might_sleep+0x10a/0x118
> <4> [<0000000000409024>] mutex_lock+0x30/0x6c
> <4> [<0000000000102158>] pfn_in_shared_memory+0x38/0xcc
> <4> [<000000000017f1be>] vm_normal_page+0xa2/0x140
> <4> [<000000000017fc9e>] follow_page+0x1da/0x274
> <4> [<0000000000182030>] get_user_pages+0x144/0x488
> <4> [<00000000001a2926>] get_arg_page+0x5a/0xc4
> <4> [<00000000001a2c60>] copy_strings+0x164/0x274
> <4> [<00000000001a2dcc>] copy_strings_kernel+0x5c/0xb0
> <4> [<00000000001a47a8>] do_execve+0x194/0x214
> <4> [<0000000000110262>] kernel_execve+0x28/0x70
> <4> [<0000000000100112>] init_post+0x72/0x114
> <4> [<000000000064e3f0>] kernel_init+0x288/0x398
> <4> [<0000000000107366>] kernel_thread_starter+0x6/0xc
> <4> [<0000000000107360>] kernel_thread_starter+0x0/0xc
> The list protection could be changed to a spinlock to make this work.
>
> b) is is a big performance penality in the fast path
> Due to the fact that pfn_valid() is checked on regular minor faults
> without VM_MIXEDMAP, we'll have a lock and walking a potentially long
> list on a critical path.
We could put that under CONFIG_DEBUG_VM; with sparsemem, it is relatively
expensive too, and I have seen it cost nearly 5% kernel time on x86 in
fork/exec/exit stuff...
> c) the patch looks ugly
> Primitives like pfn_valid() should be a small check or a small inline
> assembly. The need to call back to high level kernel code from core-vm
> looks wrong to me. Read the patch, and I think you'll come to the same
> conclusion.
Well, in the sparsemem model, it needs to call in and evaluate
whether the segment is valid. Granted that it is an array lookup
rather than a list walk AFAIKS, but the idea is the same (query
the "memory model" to ask whether the pfn is valid)
> d) timing
> pfn_valid() is evaluated before our dcss list got initialized. We could
> circumvent this by adding an extra check like "if the list was not
> initialized, and we have memory behind the pfn we assume that the pfn is
> valid without reading the list", but that would make this thing even
> more ugly.
Hmm, this is interesting. Do you know how eg. SPARSEMEM architectures
can get away with this? I'd say it would be possible to init your
structures before pfn_valid is called, but I'm not too familiar with
memory models or the setup code...
> I've talked this over with Martin, and we concluded that:
> - the semantics of pfn_valid() are unclear and need to be clarified
Most of the core code seems to use it just in debug checking. Some
parts of mm/page_alloc.c use it in a way that definitely looks like
it means "is there a valid struct page" (and that's also what
implementations do, eg. flatmem can easily have pfn_valid even if
there is no real memory). However most of those callers are setup
things, which s390 maye avoids.
> - using pfn_valid() to tell which pages have a struct page backing is
> not an option for s390. We'd rather prefer to keep our struct page
> entries that we'd love to get rid of over this ugly hack.
>
> Thus, I think we have a dependency on pte_special as a prereqisite to
> VM_PFNMAP for xip.
Although it could be *possible* to implement pfn_valid as such, I
agree that we should allow the option of using pte_special. I think it
is quite reasonable to want to have a runtime-dynamic data structure
of memory regions like s390, and I don't think VM_MIXEDMAP is such a
slowpath that we can just say "it's fine to take a global lock and
search a long list for each fault". Eg. because if you have your
distro running out of there, then every exec()/exit()/etc is going to
do this hundreds of times.
So I'm convinced. And thanks for spending the time to help me with
that ;)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* [rft] updated xip patch rollup
2008-01-16 4:22 ` Nick Piggin
@ 2008-01-16 14:29 ` Nick Piggin
2008-01-17 10:24 ` Carsten Otte
0 siblings, 1 reply; 79+ messages in thread
From: Nick Piggin @ 2008-01-16 14:29 UTC (permalink / raw)
To: Carsten Otte
Cc: carsteno, Jared Hulbert, Linux Memory Management List,
Martin Schwidefsky, Heiko Carstens
On Wed, Jan 16, 2008 at 05:22:06AM +0100, Nick Piggin wrote:
>
> Although it could be *possible* to implement pfn_valid as such, I
> agree that we should allow the option of using pte_special. I think it
> is quite reasonable to want to have a runtime-dynamic data structure
> of memory regions like s390, and I don't think VM_MIXEDMAP is such a
> slowpath that we can just say "it's fine to take a global lock and
> search a long list for each fault". Eg. because if you have your
> distro running out of there, then every exec()/exit()/etc is going to
> do this hundreds of times.
>
> So I'm convinced. And thanks for spending the time to help me with
> that ;)
Hi guys,
I'm just lumping all these patches together, sorry... I just want to
get something out that can be tested, and it is time for bed so I didn't
get around to making a proper patchset. It's against mainline.
Nothing major changed since you've last seen it. It is a rollup of
everything, with vm_normal_page and vm_insert_mixed etc. stuff in
mm/memory.c, and the vm_insert_mixed caller in mm/filemap_xip.c having
all the real changes.
I've tested it with XIP on brd on x86, both with and without pte_special.
This covers many (but not all) cases of refcounting.
Anyway, here it is... assuming no problems, I'll work on making the
patchset. I'm still hoping we can convince Linus to like it ;)
Is this all still looking OK for you, Jared?
Thanks,
Nick
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -361,55 +361,97 @@ static inline int is_cow_mapping(unsigne
}
/*
- * This function gets the "struct page" associated with a pte.
+ * vm_normal_page -- This function gets the "struct page" associated with a pte.
*
- * NOTE! Some mappings do not have "struct pages". A raw PFN mapping
- * will have each page table entry just pointing to a raw page frame
- * number, and as far as the VM layer is concerned, those do not have
- * pages associated with them - even if the PFN might point to memory
- * that otherwise is perfectly fine and has a "struct page".
- *
- * The way we recognize those mappings is through the rules set up
- * by "remap_pfn_range()": the vma will have the VM_PFNMAP bit set,
- * and the vm_pgoff will point to the first PFN mapped: thus every
- * page that is a raw mapping will always honor the rule
+ * "Special" mappings do not wish to be associated with a "struct page" (either
+ * it doesn't exist, or it exists but they don't want to touch it). In this
+ * case, NULL is returned here. "Normal" mappings do have a struct page.
+ *
+ * There are 2 broad cases. Firstly, an architecture may define a pte_special()
+ * pte bit, in which case this function is trivial. Secondly, an architecture
+ * may not have a spare pte bit, which requires a more complicated scheme,
+ * described below.
+ *
+ * A raw VM_PFNMAP mapping (ie. one that is not COWed) is always considered a
+ * special mapping (even if there are underlying and valid "struct pages").
+ * COWed pages of a VM_PFNMAP are always normal.
+ *
+ * The way we recognize COWed pages within VM_PFNMAP mappings is through the
+ * rules set up by "remap_pfn_range()": the vma will have the VM_PFNMAP bit
+ * set, and the vm_pgoff will point to the first PFN mapped: thus every special
+ * mapping will always honor the rule
*
* pfn_of_page == vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT)
*
- * and if that isn't true, the page has been COW'ed (in which case it
- * _does_ have a "struct page" associated with it even if it is in a
- * VM_PFNMAP range).
+ * And for normal mappings this is false.
+ *
+ * This restricts such mappings to be a linear translation from virtual address
+ * to pfn. To get around this restriction, we allow arbitrary mappings so long
+ * as the vma is not a COW mapping; in that case, we know that all ptes are
+ * special (because none can have been COWed).
+ *
+ *
+ * In order to support COW of arbitrary special mappings, we have VM_MIXEDMAP.
+ *
+ * VM_MIXEDMAP mappings can likewise contain memory with or without "struct
+ * page" backing, however the difference is that _all_ pages with a struct
+ * page (that is, those where pfn_valid is true) are refcounted and considered
+ * normal pages by the VM. The disadvantage is that pages are refcounted
+ * (which can be slower and simply not an option for some PFNMAP users). The
+ * advantage is that we don't have to follow the strict linearity rule of
+ * PFNMAP mappings in order to support COWable mappings.
+ *
*/
+#ifdef __HAVE_ARCH_PTE_SPECIAL
+# define __HAVE_PTE_SPECIAL 1
+#else
+# define __HAVE_PTE_SPECIAL 0
+#endif
struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, pte_t pte)
{
- unsigned long pfn = pte_pfn(pte);
+ unsigned long pfn;
+
+ if (__HAVE_PTE_SPECIAL) {
+ if (likely(!pte_special(pte))) {
+ VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
+ return pte_page(pte);
+ }
+ VM_BUG_ON(!(vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)));
+ return NULL;
+ }
+
+ /* !__HAVE_PTE_SPECIAL case follows: */
+
+ pfn = pte_pfn(pte);
- if (unlikely(vma->vm_flags & VM_PFNMAP)) {
- unsigned long off = (addr - vma->vm_start) >> PAGE_SHIFT;
- if (pfn == vma->vm_pgoff + off)
- return NULL;
- if (!is_cow_mapping(vma->vm_flags))
- return NULL;
+ if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
+ if (vma->vm_flags & VM_MIXEDMAP) {
+ if (!pfn_valid(pfn))
+ return NULL;
+ goto out;
+ } else {
+ unsigned long off = (addr-vma->vm_start) >> PAGE_SHIFT;
+ if (pfn == vma->vm_pgoff + off)
+ return NULL;
+ if (!is_cow_mapping(vma->vm_flags))
+ return NULL;
+ }
}
- /*
- * Add some anal sanity checks for now. Eventually,
- * we should just do "return pfn_to_page(pfn)", but
- * in the meantime we check that we get a valid pfn,
- * and that the resulting page looks ok.
- */
+#ifdef CONFIG_DEBUG_VM
+ /* Check that we get a valid pfn. */
if (unlikely(!pfn_valid(pfn))) {
print_bad_pte(vma, pte, addr);
return NULL;
}
+#endif
/*
- * NOTE! We still have PageReserved() pages in the page
- * tables.
+ * NOTE! We still have PageReserved() pages in the page tables.
*
- * The PAGE_ZERO() pages and various VDSO mappings can
- * cause them to exist.
+ * eg. VDSO mappings can cause them to exist.
*/
+out:
return pfn_to_page(pfn);
}
@@ -1127,8 +1169,9 @@ pte_t * fastcall get_locked_pte(struct m
* old drivers should use this, and they needed to mark their
* pages reserved for the old functions anyway.
*/
-static int insert_page(struct mm_struct *mm, unsigned long addr, struct page *page, pgprot_t prot)
+static int insert_page(struct vm_area_struct *vma, unsigned long addr, struct page *page, pgprot_t prot)
{
+ struct mm_struct *mm = vma->vm_mm;
int retval;
pte_t *pte;
spinlock_t *ptl;
@@ -1187,33 +1230,17 @@ int vm_insert_page(struct vm_area_struct
if (!page_count(page))
return -EINVAL;
vma->vm_flags |= VM_INSERTPAGE;
- return insert_page(vma->vm_mm, addr, page, vma->vm_page_prot);
+ return insert_page(vma, addr, page, vma->vm_page_prot);
}
EXPORT_SYMBOL(vm_insert_page);
-/**
- * vm_insert_pfn - insert single pfn into user vma
- * @vma: user vma to map to
- * @addr: target user address of this page
- * @pfn: source kernel pfn
- *
- * Similar to vm_inert_page, this allows drivers to insert individual pages
- * they've allocated into a user vma. Same comments apply.
- *
- * This function should only be called from a vm_ops->fault handler, and
- * in that case the handler should return NULL.
- */
-int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
- unsigned long pfn)
+static int insert_pfn(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn, pgprot_t prot)
{
struct mm_struct *mm = vma->vm_mm;
int retval;
pte_t *pte, entry;
spinlock_t *ptl;
- BUG_ON(!(vma->vm_flags & VM_PFNMAP));
- BUG_ON(is_cow_mapping(vma->vm_flags));
-
retval = -ENOMEM;
pte = get_locked_pte(mm, addr, &ptl);
if (!pte)
@@ -1223,19 +1250,74 @@ int vm_insert_pfn(struct vm_area_struct
goto out_unlock;
/* Ok, finally just insert the thing.. */
- entry = pfn_pte(pfn, vma->vm_page_prot);
+ entry = pte_mkspecial(pfn_pte(pfn, prot));
set_pte_at(mm, addr, pte, entry);
- update_mmu_cache(vma, addr, entry);
+ update_mmu_cache(vma, addr, entry); /* XXX: why not for insert_page? */
retval = 0;
out_unlock:
pte_unmap_unlock(pte, ptl);
-
out:
return retval;
}
+
+/**
+ * vm_insert_pfn - insert single pfn into user vma
+ * @vma: user vma to map to
+ * @addr: target user address of this page
+ * @pfn: source kernel pfn
+ *
+ * Similar to vm_inert_page, this allows drivers to insert individual pages
+ * they've allocated into a user vma. Same comments apply.
+ *
+ * This function should only be called from a vm_ops->fault handler, and
+ * in that case the handler should return NULL.
+ */
+int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
+ unsigned long pfn)
+{
+ /*
+ * Technically, architectures with pte_special can avoid all these
+ * restrictions (same for remap_pfn_range). However we would like
+ * consistency in testing and feature parity among all, so we should
+ * try to keep these invariants in place for everybody.
+ */
+ BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
+ BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
+ (VM_PFNMAP|VM_MIXEDMAP));
+ BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
+ BUG_ON((vma->vm_flags & VM_MIXEDMAP) && pfn_valid(pfn));
+
+ if (addr < vma->vm_start || addr >= vma->vm_end)
+ return -EFAULT;
+ return insert_pfn(vma, addr, pfn, vma->vm_page_prot);
+}
EXPORT_SYMBOL(vm_insert_pfn);
+int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
+ unsigned long pfn)
+{
+ BUG_ON(!(vma->vm_flags & VM_MIXEDMAP));
+
+ if (addr < vma->vm_start || addr >= vma->vm_end)
+ return -EFAULT;
+
+ /*
+ * If we don't have pte special, then we have to use the pfn_valid()
+ * based VM_MIXEDMAP scheme (see vm_normal_page), and thus we *must*
+ * refcount the page if pfn_valid is true (hence insert_page rather
+ * than insert_pfn).
+ */
+ if (!__HAVE_PTE_SPECIAL && pfn_valid(pfn)) {
+ struct page *page;
+
+ page = pfn_to_page(pfn);
+ return insert_page(vma, addr, page, vma->vm_page_prot);
+ }
+ return insert_pfn(vma, addr, pfn, vma->vm_page_prot);
+}
+EXPORT_SYMBOL(vm_insert_mixed);
+
/*
* maps a range of physical memory into the requested pages. the old
* mappings are removed. any references to nonexistent pages results
@@ -1254,7 +1336,7 @@ static int remap_pte_range(struct mm_str
arch_enter_lazy_mmu_mode();
do {
BUG_ON(!pte_none(*pte));
- set_pte_at(mm, addr, pte, pfn_pte(pfn, prot));
+ set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot)));
pfn++;
} while (pte++, addr += PAGE_SIZE, addr != end);
arch_leave_lazy_mmu_mode();
@@ -2386,10 +2468,13 @@ static noinline int do_no_pfn(struct mm_
unsigned long pfn;
pte_unmap(page_table);
- BUG_ON(!(vma->vm_flags & VM_PFNMAP));
- BUG_ON(is_cow_mapping(vma->vm_flags));
+ BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
+ BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
pfn = vma->vm_ops->nopfn(vma, address & PAGE_MASK);
+
+ BUG_ON((vma->vm_flags & VM_MIXEDMAP) && pfn_valid(pfn));
+
if (unlikely(pfn == NOPFN_OOM))
return VM_FAULT_OOM;
else if (unlikely(pfn == NOPFN_SIGBUS))
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -106,6 +106,7 @@ extern unsigned int kobjsize(const void
#define VM_ALWAYSDUMP 0x04000000 /* Always include in core dumps */
#define VM_CAN_NONLINEAR 0x08000000 /* Has ->fault & does nonlinear pages */
+#define VM_MIXEDMAP 0x10000000 /* Can contain "struct page" and pure PFN pages */
#ifndef VM_STACK_DEFAULT_FLAGS /* arch can override this */
#define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
@@ -698,7 +699,8 @@ struct zap_details {
unsigned long truncate_count; /* Compare vm_truncate_count */
};
-struct page *vm_normal_page(struct vm_area_struct *, unsigned long, pte_t);
+struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, pte_t pte);
+
unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
unsigned long size, struct zap_details *);
unsigned long unmap_vmas(struct mmu_gather **tlb,
@@ -1095,6 +1097,8 @@ int remap_pfn_range(struct vm_area_struc
int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *);
int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
unsigned long pfn);
+int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
+ unsigned long pfn);
struct page *follow_page(struct vm_area_struct *, unsigned long address,
unsigned int foll_flags);
Index: linux-2.6/include/asm-um/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-um/pgtable.h
+++ linux-2.6/include/asm-um/pgtable.h
@@ -21,6 +21,8 @@
#define _PAGE_USER 0x040
#define _PAGE_ACCESSED 0x080
#define _PAGE_DIRTY 0x100
+#define _PAGE_SPECIAL 0x200
+#define __HAVE_ARCH_PTE_SPECIAL
/* If _PAGE_PRESENT is clear, we use these: */
#define _PAGE_FILE 0x008 /* nonlinear file mapping, saved PTE; unset:swap */
#define _PAGE_PROTNONE 0x010 /* if the user mapped it with PROT_NONE;
@@ -220,6 +222,11 @@ static inline int pte_newprot(pte_t pte)
return(pte_present(pte) && (pte_get_bits(pte, _PAGE_NEWPROT)));
}
+static inline int pte_special(pte_t pte)
+{
+ return pte_get_bits(pte, _PAGE_SPECIAL);
+}
+
/*
* =================================
* Flags setting section.
@@ -288,6 +295,12 @@ static inline pte_t pte_mknewpage(pte_t
return(pte);
}
+static inline pte_t pte_mkspecial(pte_t pte)
+{
+ pte_set_bits(pte, _PAGE_SPECIAL);
+ return(pte);
+}
+
static inline void set_pte(pte_t *pteptr, pte_t pteval)
{
pte_copy(*pteptr, pteval);
Index: linux-2.6/include/asm-x86/pgtable_32.h
===================================================================
--- linux-2.6.orig/include/asm-x86/pgtable_32.h
+++ linux-2.6/include/asm-x86/pgtable_32.h
@@ -102,6 +102,7 @@ void paging_init(void);
#define _PAGE_BIT_UNUSED2 10
#define _PAGE_BIT_UNUSED3 11
#define _PAGE_BIT_NX 63
+#define _PAGE_BIT_SPECIAL _PAGE_BIT_UNUSED1
#define _PAGE_PRESENT 0x001
#define _PAGE_RW 0x002
@@ -115,6 +116,8 @@ void paging_init(void);
#define _PAGE_UNUSED1 0x200 /* available for programmer */
#define _PAGE_UNUSED2 0x400
#define _PAGE_UNUSED3 0x800
+#define _PAGE_SPECIAL PAGE_UNUSED1
+#define __HAVE_ARCH_PTE_SPECIAL
/* If _PAGE_PRESENT is clear, we use these: */
#define _PAGE_FILE 0x040 /* nonlinear file mapping, saved PTE; unset:swap */
@@ -219,6 +222,7 @@ static inline int pte_dirty(pte_t pte)
static inline int pte_young(pte_t pte) { return (pte).pte_low & _PAGE_ACCESSED; }
static inline int pte_write(pte_t pte) { return (pte).pte_low & _PAGE_RW; }
static inline int pte_huge(pte_t pte) { return (pte).pte_low & _PAGE_PSE; }
+static inline int pte_special(pte_t pte) { return (pte).pte_low & _PAGE_SPECIAL; }
/*
* The following only works if pte_present() is not true.
@@ -232,6 +236,7 @@ static inline pte_t pte_mkdirty(pte_t pt
static inline pte_t pte_mkyoung(pte_t pte) { (pte).pte_low |= _PAGE_ACCESSED; return pte; }
static inline pte_t pte_mkwrite(pte_t pte) { (pte).pte_low |= _PAGE_RW; return pte; }
static inline pte_t pte_mkhuge(pte_t pte) { (pte).pte_low |= _PAGE_PSE; return pte; }
+static inline pte_t pte_mkspecial(pte_t pte) { (pte).pte_low |= _PAGE_SPECIAL; return pte; }
#ifdef CONFIG_X86_PAE
# include <asm/pgtable-3level.h>
Index: linux-2.6/include/asm-x86/pgtable_64.h
===================================================================
--- linux-2.6.orig/include/asm-x86/pgtable_64.h
+++ linux-2.6/include/asm-x86/pgtable_64.h
@@ -151,6 +151,7 @@ static inline pte_t ptep_get_and_clear_f
#define _PAGE_BIT_DIRTY 6
#define _PAGE_BIT_PSE 7 /* 4 MB (or 2MB) page */
#define _PAGE_BIT_GLOBAL 8 /* Global TLB entry PPro+ */
+#define _PAGE_BIT_SPECIAL 9
#define _PAGE_BIT_NX 63 /* No execute: only valid after cpuid check */
#define _PAGE_PRESENT 0x001
@@ -163,6 +164,8 @@ static inline pte_t ptep_get_and_clear_f
#define _PAGE_PSE 0x080 /* 2MB page */
#define _PAGE_FILE 0x040 /* nonlinear file mapping, saved PTE; unset:swap */
#define _PAGE_GLOBAL 0x100 /* Global TLB entry */
+#define _PAGE_SPECIAL 0x200
+#define __HAVE_ARCH_PTE_SPECIAL
#define _PAGE_PROTNONE 0x080 /* If not present */
#define _PAGE_NX (_AC(1,UL)<<_PAGE_BIT_NX)
@@ -272,6 +275,7 @@ static inline int pte_young(pte_t pte)
static inline int pte_write(pte_t pte) { return pte_val(pte) & _PAGE_RW; }
static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; }
static inline int pte_huge(pte_t pte) { return pte_val(pte) & _PAGE_PSE; }
+static inline int pte_special(pte_t pte) { return pte_val(pte) & _PAGE_SPECIAL; }
static inline pte_t pte_mkclean(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_DIRTY)); return pte; }
static inline pte_t pte_mkold(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_ACCESSED)); return pte; }
@@ -282,6 +286,7 @@ static inline pte_t pte_mkyoung(pte_t pt
static inline pte_t pte_mkwrite(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_RW)); return pte; }
static inline pte_t pte_mkhuge(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_PSE)); return pte; }
static inline pte_t pte_clrhuge(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_PSE)); return pte; }
+static inline pte_t pte_mkspecial(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_SPECIAL)); return pte; }
struct vm_area_struct;
Index: linux-2.6/include/asm-alpha/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-alpha/pgtable.h
+++ linux-2.6/include/asm-alpha/pgtable.h
@@ -268,6 +268,7 @@ extern inline int pte_write(pte_t pte)
extern inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY; }
extern inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; }
extern inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; }
+extern inline int pte_special(pte_t pte) { return 0; }
extern inline pte_t pte_wrprotect(pte_t pte) { pte_val(pte) |= _PAGE_FOW; return pte; }
extern inline pte_t pte_mkclean(pte_t pte) { pte_val(pte) &= ~(__DIRTY_BITS); return pte; }
@@ -275,6 +276,7 @@ extern inline pte_t pte_mkold(pte_t pte)
extern inline pte_t pte_mkwrite(pte_t pte) { pte_val(pte) &= ~_PAGE_FOW; return pte; }
extern inline pte_t pte_mkdirty(pte_t pte) { pte_val(pte) |= __DIRTY_BITS; return pte; }
extern inline pte_t pte_mkyoung(pte_t pte) { pte_val(pte) |= __ACCESS_BITS; return pte; }
+extern inline pte_t pte_mkspecial(pte_t pte) { return pte; }
#define PAGE_DIR_OFFSET(tsk,address) pgd_offset((tsk),(address))
Index: linux-2.6/include/asm-avr32/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-avr32/pgtable.h
+++ linux-2.6/include/asm-avr32/pgtable.h
@@ -211,6 +211,10 @@ static inline int pte_young(pte_t pte)
{
return pte_val(pte) & _PAGE_ACCESSED;
}
+static inline int pte_special(pte_t pte)
+{
+ return 0;
+}
/*
* The following only work if pte_present() is not true.
@@ -251,6 +255,10 @@ static inline pte_t pte_mkyoung(pte_t pt
set_pte(&pte, __pte(pte_val(pte) | _PAGE_ACCESSED));
return pte;
}
+static inline pte_t pte_mkspecial(pte_t pte)
+{
+ return pte;
+}
#define pmd_none(x) (!pmd_val(x))
#define pmd_present(x) (pmd_val(x) & _PAGE_PRESENT)
Index: linux-2.6/include/asm-cris/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-cris/pgtable.h
+++ linux-2.6/include/asm-cris/pgtable.h
@@ -115,6 +115,7 @@ static inline int pte_write(pte_t pte)
static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_MODIFIED; }
static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; }
static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; }
+static inline int pte_special(pte_t pte) { return 0; }
static inline pte_t pte_wrprotect(pte_t pte)
{
@@ -162,6 +163,7 @@ static inline pte_t pte_mkyoung(pte_t pt
}
return pte;
}
+static inline pte_t pte_mkspecial(pte_t pte) { return pte; }
/*
* Conversion functions: convert a page and protection to a page entry,
Index: linux-2.6/include/asm-frv/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-frv/pgtable.h
+++ linux-2.6/include/asm-frv/pgtable.h
@@ -380,6 +380,7 @@ static inline pmd_t *pmd_offset(pud_t *d
static inline int pte_dirty(pte_t pte) { return (pte).pte & _PAGE_DIRTY; }
static inline int pte_young(pte_t pte) { return (pte).pte & _PAGE_ACCESSED; }
static inline int pte_write(pte_t pte) { return !((pte).pte & _PAGE_WP); }
+static inline int pte_special(pte_t pte) { return 0; }
static inline pte_t pte_mkclean(pte_t pte) { (pte).pte &= ~_PAGE_DIRTY; return pte; }
static inline pte_t pte_mkold(pte_t pte) { (pte).pte &= ~_PAGE_ACCESSED; return pte; }
@@ -387,6 +388,7 @@ static inline pte_t pte_wrprotect(pte_t
static inline pte_t pte_mkdirty(pte_t pte) { (pte).pte |= _PAGE_DIRTY; return pte; }
static inline pte_t pte_mkyoung(pte_t pte) { (pte).pte |= _PAGE_ACCESSED; return pte; }
static inline pte_t pte_mkwrite(pte_t pte) { (pte).pte &= ~_PAGE_WP; return pte; }
+static inline pte_t pte_mkspecial(pte_t pte) { return pte; }
static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
{
Index: linux-2.6/include/asm-ia64/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-ia64/pgtable.h
+++ linux-2.6/include/asm-ia64/pgtable.h
@@ -302,6 +302,8 @@ ia64_phys_addr_valid (unsigned long addr
#define pte_dirty(pte) ((pte_val(pte) & _PAGE_D) != 0)
#define pte_young(pte) ((pte_val(pte) & _PAGE_A) != 0)
#define pte_file(pte) ((pte_val(pte) & _PAGE_FILE) != 0)
+#define pte_special(pte) 0
+
/*
* Note: we convert AR_RWX to AR_RX and AR_RW to AR_R by clearing the 2nd bit in the
* access rights:
@@ -313,6 +315,7 @@ ia64_phys_addr_valid (unsigned long addr
#define pte_mkclean(pte) (__pte(pte_val(pte) & ~_PAGE_D))
#define pte_mkdirty(pte) (__pte(pte_val(pte) | _PAGE_D))
#define pte_mkhuge(pte) (__pte(pte_val(pte)))
+#define pte_mkspecial(pte) (pte)
/*
* Because ia64's Icache and Dcache is not coherent (on a cpu), we need to
Index: linux-2.6/include/asm-m32r/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-m32r/pgtable.h
+++ linux-2.6/include/asm-m32r/pgtable.h
@@ -214,6 +214,11 @@ static inline int pte_file(pte_t pte)
return pte_val(pte) & _PAGE_FILE;
}
+static inline int pte_special(pte_t pte)
+{
+ return 0;
+}
+
static inline pte_t pte_mkclean(pte_t pte)
{
pte_val(pte) &= ~_PAGE_DIRTY;
@@ -250,6 +255,11 @@ static inline pte_t pte_mkwrite(pte_t pt
return pte;
}
+static inline pte_t pte_mkspecial(pte_t pte)
+{
+ return pte;
+}
+
static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
{
return test_and_clear_bit(_PAGE_BIT_ACCESSED, ptep);
Index: linux-2.6/include/asm-m68k/motorola_pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-m68k/motorola_pgtable.h
+++ linux-2.6/include/asm-m68k/motorola_pgtable.h
@@ -168,6 +168,7 @@ static inline int pte_write(pte_t pte)
static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY; }
static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; }
static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; }
+static inline int pte_special(pte_t pte) { return 0; }
static inline pte_t pte_wrprotect(pte_t pte) { pte_val(pte) |= _PAGE_RONLY; return pte; }
static inline pte_t pte_mkclean(pte_t pte) { pte_val(pte) &= ~_PAGE_DIRTY; return pte; }
@@ -185,6 +186,7 @@ static inline pte_t pte_mkcache(pte_t pt
pte_val(pte) = (pte_val(pte) & _CACHEMASK040) | m68k_supervisor_cachemode;
return pte;
}
+static inline pte_t pte_mkspecial(pte_t pte) { return pte; }
#define PAGE_DIR_OFFSET(tsk,address) pgd_offset((tsk),(address))
Index: linux-2.6/include/asm-m68k/sun3_pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-m68k/sun3_pgtable.h
+++ linux-2.6/include/asm-m68k/sun3_pgtable.h
@@ -169,6 +169,7 @@ static inline int pte_write(pte_t pte)
static inline int pte_dirty(pte_t pte) { return pte_val(pte) & SUN3_PAGE_MODIFIED; }
static inline int pte_young(pte_t pte) { return pte_val(pte) & SUN3_PAGE_ACCESSED; }
static inline int pte_file(pte_t pte) { return pte_val(pte) & SUN3_PAGE_ACCESSED; }
+static inline int pte_special(pte_t pte) { return 0; }
static inline pte_t pte_wrprotect(pte_t pte) { pte_val(pte) &= ~SUN3_PAGE_WRITEABLE; return pte; }
static inline pte_t pte_mkclean(pte_t pte) { pte_val(pte) &= ~SUN3_PAGE_MODIFIED; return pte; }
@@ -181,6 +182,7 @@ static inline pte_t pte_mknocache(pte_t
//static inline pte_t pte_mkcache(pte_t pte) { pte_val(pte) &= SUN3_PAGE_NOCACHE; return pte; }
// until then, use:
static inline pte_t pte_mkcache(pte_t pte) { return pte; }
+static inline pte_t pte_mkspecial(pte_t pte) { return pte; }
extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
extern pgd_t kernel_pg_dir[PTRS_PER_PGD];
Index: linux-2.6/include/asm-mips/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-mips/pgtable.h
+++ linux-2.6/include/asm-mips/pgtable.h
@@ -285,6 +285,8 @@ static inline pte_t pte_mkyoung(pte_t pt
return pte;
}
#endif
+static inline int pte_special(pte_t pte) { return 0; }
+static inline pte_t pte_mkspecial(pte_t pte) { return pte; }
/*
* Macro to make mark a page protection value as "uncacheable". Note
Index: linux-2.6/include/asm-parisc/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-parisc/pgtable.h
+++ linux-2.6/include/asm-parisc/pgtable.h
@@ -331,6 +331,7 @@ static inline int pte_dirty(pte_t pte)
static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; }
static inline int pte_write(pte_t pte) { return pte_val(pte) & _PAGE_WRITE; }
static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; }
+static inline int pte_special(pte_t pte) { return 0; }
static inline pte_t pte_mkclean(pte_t pte) { pte_val(pte) &= ~_PAGE_DIRTY; return pte; }
static inline pte_t pte_mkold(pte_t pte) { pte_val(pte) &= ~_PAGE_ACCESSED; return pte; }
@@ -338,6 +339,7 @@ static inline pte_t pte_wrprotect(pte_t
static inline pte_t pte_mkdirty(pte_t pte) { pte_val(pte) |= _PAGE_DIRTY; return pte; }
static inline pte_t pte_mkyoung(pte_t pte) { pte_val(pte) |= _PAGE_ACCESSED; return pte; }
static inline pte_t pte_mkwrite(pte_t pte) { pte_val(pte) |= _PAGE_WRITE; return pte; }
+static inline pte_t pte_mkspecial(pte_t pte) { return pte; }
/*
* Conversion functions: convert a page and protection to a page entry,
Index: linux-2.6/include/asm-powerpc/pgtable-ppc32.h
===================================================================
--- linux-2.6.orig/include/asm-powerpc/pgtable-ppc32.h
+++ linux-2.6/include/asm-powerpc/pgtable-ppc32.h
@@ -514,6 +514,7 @@ static inline int pte_write(pte_t pte)
static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY; }
static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; }
static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; }
+static inline int pte_special(pte_t pte) { return 0; }
static inline void pte_uncache(pte_t pte) { pte_val(pte) |= _PAGE_NO_CACHE; }
static inline void pte_cache(pte_t pte) { pte_val(pte) &= ~_PAGE_NO_CACHE; }
@@ -531,6 +532,8 @@ static inline pte_t pte_mkdirty(pte_t pt
pte_val(pte) |= _PAGE_DIRTY; return pte; }
static inline pte_t pte_mkyoung(pte_t pte) {
pte_val(pte) |= _PAGE_ACCESSED; return pte; }
+static inline pte_t pte_mkspecial(pte_t pte) {
+ return pte; }
static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
{
Index: linux-2.6/include/asm-powerpc/pgtable-ppc64.h
===================================================================
--- linux-2.6.orig/include/asm-powerpc/pgtable-ppc64.h
+++ linux-2.6/include/asm-powerpc/pgtable-ppc64.h
@@ -239,6 +239,7 @@ static inline int pte_write(pte_t pte) {
static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY;}
static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED;}
static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE;}
+static inline int pte_special(pte_t pte) { return 0; }
static inline void pte_uncache(pte_t pte) { pte_val(pte) |= _PAGE_NO_CACHE; }
static inline void pte_cache(pte_t pte) { pte_val(pte) &= ~_PAGE_NO_CACHE; }
@@ -257,6 +258,8 @@ static inline pte_t pte_mkyoung(pte_t pt
pte_val(pte) |= _PAGE_ACCESSED; return pte; }
static inline pte_t pte_mkhuge(pte_t pte) {
return pte; }
+static inline pte_t pte_mkspecial(pte_t pte) {
+ return pte; }
/* Atomic PTE updates */
static inline unsigned long pte_update(struct mm_struct *mm,
Index: linux-2.6/include/asm-ppc/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-ppc/pgtable.h
+++ linux-2.6/include/asm-ppc/pgtable.h
@@ -537,6 +537,7 @@ static inline int pte_write(pte_t pte)
static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY; }
static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; }
static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; }
+static inline int pte_special(pte_t pte) { return 0; }
static inline void pte_uncache(pte_t pte) { pte_val(pte) |= _PAGE_NO_CACHE; }
static inline void pte_cache(pte_t pte) { pte_val(pte) &= ~_PAGE_NO_CACHE; }
@@ -554,6 +555,8 @@ static inline pte_t pte_mkdirty(pte_t pt
pte_val(pte) |= _PAGE_DIRTY; return pte; }
static inline pte_t pte_mkyoung(pte_t pte) {
pte_val(pte) |= _PAGE_ACCESSED; return pte; }
+static inline pte_t pte_mkspecial(pte_t pte) {
+ return pte; }
static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
{
Index: linux-2.6/include/asm-s390/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-s390/pgtable.h
+++ linux-2.6/include/asm-s390/pgtable.h
@@ -228,6 +228,8 @@ extern unsigned long vmalloc_end;
/* Software bits in the page table entry */
#define _PAGE_SWT 0x001 /* SW pte type bit t */
#define _PAGE_SWX 0x002 /* SW pte type bit x */
+#define _PAGE_SPECIAL 0x004 /* SW associated with special page */
+#define __HAVE_ARCH_PTE_SPECIAL
/* Six different types of pages. */
#define _PAGE_TYPE_EMPTY 0x400
@@ -504,6 +506,11 @@ static inline int pte_file(pte_t pte)
return (pte_val(pte) & mask) == _PAGE_TYPE_FILE;
}
+static inline int pte_special(pte_t pte)
+{
+ return (pte_val(pte) & _PAGE_SPECIAL);
+}
+
#define __HAVE_ARCH_PTE_SAME
#define pte_same(a,b) (pte_val(a) == pte_val(b))
@@ -654,6 +661,12 @@ static inline pte_t pte_mkyoung(pte_t pt
return pte;
}
+static inline pte_t pte_mkspecial(pte_t pte)
+{
+ pte_val(pte) |= _PAGE_SPECIAL;
+ return pte;
+}
+
#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep)
Index: linux-2.6/include/asm-sh64/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-sh64/pgtable.h
+++ linux-2.6/include/asm-sh64/pgtable.h
@@ -419,6 +419,7 @@ static inline int pte_dirty(pte_t pte){
static inline int pte_young(pte_t pte){ return pte_val(pte) & _PAGE_ACCESSED; }
static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; }
static inline int pte_write(pte_t pte){ return pte_val(pte) & _PAGE_WRITE; }
+static inline int pte_special(pte_t pte) { return 0; }
static inline pte_t pte_wrprotect(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_WRITE)); return pte; }
static inline pte_t pte_mkclean(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_DIRTY)); return pte; }
@@ -427,6 +428,7 @@ static inline pte_t pte_mkwrite(pte_t pt
static inline pte_t pte_mkdirty(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_DIRTY)); return pte; }
static inline pte_t pte_mkyoung(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_ACCESSED)); return pte; }
static inline pte_t pte_mkhuge(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_SZHUGE)); return pte; }
+static inline pte_t pte_mkspecial(pte_t pte) { return pte; }
/*
Index: linux-2.6/include/asm-sparc/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-sparc/pgtable.h
+++ linux-2.6/include/asm-sparc/pgtable.h
@@ -219,6 +219,11 @@ static inline int pte_file(pte_t pte)
return pte_val(pte) & BTFIXUP_HALF(pte_filei);
}
+static inline int pte_special(pte_t pte)
+{
+ return 0;
+}
+
/*
*/
BTFIXUPDEF_HALF(pte_wrprotecti)
@@ -251,6 +256,8 @@ BTFIXUPDEF_CALL_CONST(pte_t, pte_mkyoung
#define pte_mkdirty(pte) BTFIXUP_CALL(pte_mkdirty)(pte)
#define pte_mkyoung(pte) BTFIXUP_CALL(pte_mkyoung)(pte)
+#define pte_mkspecial(pte_t pte) (pte)
+
#define pfn_pte(pfn, prot) mk_pte(pfn_to_page(pfn), prot)
BTFIXUPDEF_CALL(unsigned long, pte_pfn, pte_t)
Index: linux-2.6/include/asm-sparc64/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-sparc64/pgtable.h
+++ linux-2.6/include/asm-sparc64/pgtable.h
@@ -506,6 +506,11 @@ static inline pte_t pte_mkyoung(pte_t pt
return __pte(pte_val(pte) | mask);
}
+static inline pte_t pte_mkspecial(pte_t pte)
+{
+ return pte;
+}
+
static inline unsigned long pte_young(pte_t pte)
{
unsigned long mask;
@@ -608,6 +613,11 @@ static inline unsigned long pte_present(
return val;
}
+static inline int pte_special(pte_t pte)
+{
+ return 0;
+}
+
#define pmd_set(pmdp, ptep) \
(pmd_val(*(pmdp)) = (__pa((unsigned long) (ptep)) >> 11UL))
#define pud_set(pudp, pmdp) \
Index: linux-2.6/include/asm-xtensa/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-xtensa/pgtable.h
+++ linux-2.6/include/asm-xtensa/pgtable.h
@@ -212,6 +212,8 @@ static inline int pte_write(pte_t pte) {
static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY; }
static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; }
static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; }
+static inline int pte_special(pte_t pte) { return 0; }
+
static inline pte_t pte_wrprotect(pte_t pte)
{ pte_val(pte) &= ~(_PAGE_WRITABLE | _PAGE_HW_WRITE); return pte; }
static inline pte_t pte_mkclean(pte_t pte)
@@ -224,6 +226,8 @@ static inline pte_t pte_mkyoung(pte_t pt
{ pte_val(pte) |= _PAGE_ACCESSED; return pte; }
static inline pte_t pte_mkwrite(pte_t pte)
{ pte_val(pte) |= _PAGE_WRITABLE; return pte; }
+static inline pte_t pte_mkspecial(pte_t pte)
+ { return pte; }
/*
* Conversion functions: convert a page and protection to a page entry,
Index: linux-2.6/fs/ext2/super.c
===================================================================
--- linux-2.6.orig/fs/ext2/super.c
+++ linux-2.6/fs/ext2/super.c
@@ -844,8 +844,7 @@ static int ext2_fill_super(struct super_
blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size);
- if ((ext2_use_xip(sb)) && ((blocksize != PAGE_SIZE) ||
- (sb->s_blocksize != blocksize))) {
+ if (ext2_use_xip(sb) && blocksize != PAGE_SIZE) {
if (!silent)
printk("XIP: Unsupported blocksize\n");
goto failed_mount;
Index: linux-2.6/fs/ext2/inode.c
===================================================================
--- linux-2.6.orig/fs/ext2/inode.c
+++ linux-2.6/fs/ext2/inode.c
@@ -800,7 +800,7 @@ const struct address_space_operations ex
const struct address_space_operations ext2_aops_xip = {
.bmap = ext2_bmap,
- .get_xip_page = ext2_get_xip_page,
+ .get_xip_address = ext2_get_xip_address,
};
const struct address_space_operations ext2_nobh_aops = {
Index: linux-2.6/fs/ext2/xip.c
===================================================================
--- linux-2.6.orig/fs/ext2/xip.c
+++ linux-2.6/fs/ext2/xip.c
@@ -15,24 +15,25 @@
#include "xip.h"
static inline int
-__inode_direct_access(struct inode *inode, sector_t sector,
- unsigned long *data)
+__inode_direct_access(struct inode *inode, sector_t block, unsigned long *data)
{
+ sector_t sector;
BUG_ON(!inode->i_sb->s_bdev->bd_disk->fops->direct_access);
+
+ sector = block * (PAGE_SIZE / 512); /* ext2 block to bdev sector */
return inode->i_sb->s_bdev->bd_disk->fops
- ->direct_access(inode->i_sb->s_bdev,sector,data);
+ ->direct_access(inode->i_sb->s_bdev, sector, data);
}
static inline int
-__ext2_get_sector(struct inode *inode, sector_t offset, int create,
+__ext2_get_block(struct inode *inode, pgoff_t pgoff, int create,
sector_t *result)
{
struct buffer_head tmp;
int rc;
memset(&tmp, 0, sizeof(struct buffer_head));
- rc = ext2_get_block(inode, offset/ (PAGE_SIZE/512), &tmp,
- create);
+ rc = ext2_get_block(inode, pgoff, &tmp, create);
*result = tmp.b_blocknr;
/* did we get a sparse block (hole in the file)? */
@@ -45,13 +46,12 @@ __ext2_get_sector(struct inode *inode, s
}
int
-ext2_clear_xip_target(struct inode *inode, int block)
+ext2_clear_xip_target(struct inode *inode, sector_t block)
{
- sector_t sector = block * (PAGE_SIZE/512);
unsigned long data;
int rc;
- rc = __inode_direct_access(inode, sector, &data);
+ rc = __inode_direct_access(inode, block, &data);
if (!rc)
clear_page((void*)data);
return rc;
@@ -69,24 +69,24 @@ void ext2_xip_verify_sb(struct super_blo
}
}
-struct page *
-ext2_get_xip_page(struct address_space *mapping, sector_t offset,
- int create)
+void *
+ext2_get_xip_address(struct address_space *mapping, pgoff_t pgoff, int create)
{
int rc;
unsigned long data;
- sector_t sector;
+ sector_t block;
/* first, retrieve the sector number */
- rc = __ext2_get_sector(mapping->host, offset, create, §or);
+ rc = __ext2_get_block(mapping->host, pgoff, create, &block);
if (rc)
goto error;
/* retrieve address of the target data */
- rc = __inode_direct_access
- (mapping->host, sector * (PAGE_SIZE/512), &data);
- if (!rc)
- return virt_to_page(data);
+ rc = __inode_direct_access(mapping->host, block, &data);
+ if (rc)
+ goto error;
+
+ return (void *)data;
error:
return ERR_PTR(rc);
Index: linux-2.6/fs/ext2/xip.h
===================================================================
--- linux-2.6.orig/fs/ext2/xip.h
+++ linux-2.6/fs/ext2/xip.h
@@ -7,19 +7,19 @@
#ifdef CONFIG_EXT2_FS_XIP
extern void ext2_xip_verify_sb (struct super_block *);
-extern int ext2_clear_xip_target (struct inode *, int);
+extern int ext2_clear_xip_target (struct inode *, sector_t);
static inline int ext2_use_xip (struct super_block *sb)
{
struct ext2_sb_info *sbi = EXT2_SB(sb);
return (sbi->s_mount_opt & EXT2_MOUNT_XIP);
}
-struct page* ext2_get_xip_page (struct address_space *, sector_t, int);
-#define mapping_is_xip(map) unlikely(map->a_ops->get_xip_page)
+void *ext2_get_xip_address(struct address_space *, sector_t, int);
+#define mapping_is_xip(map) unlikely(map->a_ops->get_xip_address)
#else
#define mapping_is_xip(map) 0
#define ext2_xip_verify_sb(sb) do { } while (0)
#define ext2_use_xip(sb) 0
#define ext2_clear_xip_target(inode, chain) 0
-#define ext2_get_xip_page NULL
+#define ext2_get_xip_address NULL
#endif
Index: linux-2.6/fs/open.c
===================================================================
--- linux-2.6.orig/fs/open.c
+++ linux-2.6/fs/open.c
@@ -778,7 +778,7 @@ static struct file *__dentry_open(struct
if (f->f_flags & O_DIRECT) {
if (!f->f_mapping->a_ops ||
((!f->f_mapping->a_ops->direct_IO) &&
- (!f->f_mapping->a_ops->get_xip_page))) {
+ (!f->f_mapping->a_ops->get_xip_address))) {
fput(f);
f = ERR_PTR(-EINVAL);
}
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -473,8 +473,7 @@ struct address_space_operations {
int (*releasepage) (struct page *, gfp_t);
ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
loff_t offset, unsigned long nr_segs);
- struct page* (*get_xip_page)(struct address_space *, sector_t,
- int);
+ void * (*get_xip_address)(struct address_space *, pgoff_t, int);
/* migrate the contents of a page to the specified target */
int (*migratepage) (struct address_space *,
struct page *, struct page *);
Index: linux-2.6/mm/fadvise.c
===================================================================
--- linux-2.6.orig/mm/fadvise.c
+++ linux-2.6/mm/fadvise.c
@@ -49,7 +49,7 @@ asmlinkage long sys_fadvise64_64(int fd,
goto out;
}
- if (mapping->a_ops->get_xip_page)
+ if (mapping->a_ops->get_xip_address)
/* no bad return value, but ignore advice */
goto out;
Index: linux-2.6/mm/filemap_xip.c
===================================================================
--- linux-2.6.orig/mm/filemap_xip.c
+++ linux-2.6/mm/filemap_xip.c
@@ -15,6 +15,7 @@
#include <linux/rmap.h>
#include <linux/sched.h>
#include <asm/tlbflush.h>
+#include <asm/io.h>
/*
* We do use our own empty page to avoid interference with other users
@@ -42,36 +43,39 @@ static struct page *xip_sparse_page(void
/*
* This is a file read routine for execute in place files, and uses
- * the mapping->a_ops->get_xip_page() function for the actual low-level
+ * the mapping->a_ops->get_xip_address() function for the actual low-level
* stuff.
*
* Note the struct file* is not used at all. It may be NULL.
*/
-static void
+static ssize_t
do_xip_mapping_read(struct address_space *mapping,
struct file_ra_state *_ra,
struct file *filp,
- loff_t *ppos,
- read_descriptor_t *desc,
- read_actor_t actor)
+ char __user *buf,
+ size_t len,
+ loff_t *ppos)
{
struct inode *inode = mapping->host;
unsigned long index, end_index, offset;
- loff_t isize;
+ loff_t isize, pos;
+ size_t copied = 0, error = 0;
- BUG_ON(!mapping->a_ops->get_xip_page);
+ BUG_ON(!mapping->a_ops->get_xip_address);
- index = *ppos >> PAGE_CACHE_SHIFT;
- offset = *ppos & ~PAGE_CACHE_MASK;
+ pos = *ppos;
+ index = pos >> PAGE_CACHE_SHIFT;
+ offset = pos & ~PAGE_CACHE_MASK;
isize = i_size_read(inode);
if (!isize)
goto out;
end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
- for (;;) {
- struct page *page;
- unsigned long nr, ret;
+ do {
+ unsigned long nr, left;
+ void *xip_mem;
+ int zero = 0;
/* nr is the maximum number of bytes to copy from this page */
nr = PAGE_CACHE_SIZE;
@@ -84,17 +88,20 @@ do_xip_mapping_read(struct address_space
}
}
nr = nr - offset;
+ if (nr > len)
+ nr = len;
- page = mapping->a_ops->get_xip_page(mapping,
- index*(PAGE_SIZE/512), 0);
- if (!page)
- goto no_xip_page;
- if (unlikely(IS_ERR(page))) {
- if (PTR_ERR(page) == -ENODATA) {
+ xip_mem = mapping->a_ops->get_xip_address(mapping, index, 0);
+ if (!xip_mem) {
+ error = -EIO;
+ goto out;
+ }
+ if (unlikely(IS_ERR(xip_mem))) {
+ if (PTR_ERR(xip_mem) == -ENODATA) {
/* sparse */
- page = ZERO_PAGE(0);
+ zero = 1;
} else {
- desc->error = PTR_ERR(page);
+ error = PTR_ERR(xip_mem);
goto out;
}
}
@@ -104,10 +111,10 @@ do_xip_mapping_read(struct address_space
* before reading the page on the kernel side.
*/
if (mapping_writably_mapped(mapping))
- flush_dcache_page(page);
+ /* address based flush */ ;
/*
- * Ok, we have the page, so now we can copy it to user space...
+ * Ok, we have the mem, so now we can copy it to user space...
*
* The actor routine returns how many bytes were actually used..
* NOTE! This may not be the same as how much of a user buffer
@@ -115,47 +122,38 @@ do_xip_mapping_read(struct address_space
* "pos" here (the actor routine has to update the user buffer
* pointers and the remaining count).
*/
- ret = actor(desc, page, offset, nr);
- offset += ret;
- index += offset >> PAGE_CACHE_SHIFT;
- offset &= ~PAGE_CACHE_MASK;
+ if (!zero)
+ left = __copy_to_user(buf+copied, xip_mem+offset, nr);
+ else
+ left = __clear_user(buf + copied, nr);
- if (ret == nr && desc->count)
- continue;
- goto out;
+ if (left) {
+ error = -EFAULT;
+ goto out;
+ }
-no_xip_page:
- /* Did not get the page. Report it */
- desc->error = -EIO;
- goto out;
- }
+ copied += (nr - left);
+ offset += (nr - left);
+ index += offset >> PAGE_CACHE_SHIFT;
+ offset &= ~PAGE_CACHE_MASK;
+ } while (copied < len);
out:
- *ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
+ *ppos = pos + copied;
if (filp)
file_accessed(filp);
+
+ return (copied ? copied : error);
}
ssize_t
xip_file_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos)
{
- read_descriptor_t desc;
-
if (!access_ok(VERIFY_WRITE, buf, len))
return -EFAULT;
- desc.written = 0;
- desc.arg.buf = buf;
- desc.count = len;
- desc.error = 0;
-
- do_xip_mapping_read(filp->f_mapping, &filp->f_ra, filp,
- ppos, &desc, file_read_actor);
-
- if (desc.written)
- return desc.written;
- else
- return desc.error;
+ return do_xip_mapping_read(filp->f_mapping, &filp->f_ra, filp,
+ buf, len, ppos);
}
EXPORT_SYMBOL_GPL(xip_file_read);
@@ -210,13 +208,14 @@ __xip_unmap (struct address_space * mapp
*
* This function is derived from filemap_fault, but used for execute in place
*/
-static int xip_file_fault(struct vm_area_struct *area, struct vm_fault *vmf)
+static int xip_file_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
- struct file *file = area->vm_file;
+ struct file *file = vma->vm_file;
struct address_space *mapping = file->f_mapping;
struct inode *inode = mapping->host;
- struct page *page;
pgoff_t size;
+ void *xip_mem;
+ struct page *page;
/* XXX: are VM_FAULT_ codes OK? */
@@ -224,35 +223,43 @@ static int xip_file_fault(struct vm_area
if (vmf->pgoff >= size)
return VM_FAULT_SIGBUS;
- page = mapping->a_ops->get_xip_page(mapping,
- vmf->pgoff*(PAGE_SIZE/512), 0);
- if (!IS_ERR(page))
- goto out;
- if (PTR_ERR(page) != -ENODATA)
+ xip_mem = mapping->a_ops->get_xip_address(mapping, vmf->pgoff, 0);
+ if (!IS_ERR(xip_mem))
+ goto found;
+ if (PTR_ERR(xip_mem) != -ENODATA)
return VM_FAULT_OOM;
/* sparse block */
- if ((area->vm_flags & (VM_WRITE | VM_MAYWRITE)) &&
- (area->vm_flags & (VM_SHARED| VM_MAYSHARE)) &&
+ if ((vma->vm_flags & (VM_WRITE | VM_MAYWRITE)) &&
+ (vma->vm_flags & (VM_SHARED| VM_MAYSHARE)) &&
(!(mapping->host->i_sb->s_flags & MS_RDONLY))) {
+ unsigned long pfn;
+ int err;
+
/* maybe shared writable, allocate new block */
- page = mapping->a_ops->get_xip_page(mapping,
- vmf->pgoff*(PAGE_SIZE/512), 1);
- if (IS_ERR(page))
+ xip_mem = mapping->a_ops->get_xip_address(mapping,vmf->pgoff,1);
+ if (IS_ERR(xip_mem))
return VM_FAULT_SIGBUS;
- /* unmap page at pgoff from all other vmas */
+ /* unmap sparse mappings at pgoff from all other vmas */
__xip_unmap(mapping, vmf->pgoff);
+
+found:
+ pfn = virt_to_phys(xip_mem) >> PAGE_SHIFT;
+ err = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address, pfn);
+ if (err == -ENOMEM)
+ return VM_FAULT_OOM;
+ BUG_ON(err);
+ return VM_FAULT_NOPAGE;
} else {
/* not shared and writable, use xip_sparse_page() */
page = xip_sparse_page();
if (!page)
return VM_FAULT_OOM;
- }
-out:
- page_cache_get(page);
- vmf->page = page;
- return 0;
+ page_cache_get(page);
+ vmf->page = page;
+ return 0;
+ }
}
static struct vm_operations_struct xip_file_vm_ops = {
@@ -261,11 +268,11 @@ static struct vm_operations_struct xip_f
int xip_file_mmap(struct file * file, struct vm_area_struct * vma)
{
- BUG_ON(!file->f_mapping->a_ops->get_xip_page);
+ BUG_ON(!file->f_mapping->a_ops->get_xip_address);
file_accessed(file);
vma->vm_ops = &xip_file_vm_ops;
- vma->vm_flags |= VM_CAN_NONLINEAR;
+ vma->vm_flags |= VM_CAN_NONLINEAR | VM_MIXEDMAP;
return 0;
}
EXPORT_SYMBOL_GPL(xip_file_mmap);
@@ -278,17 +285,16 @@ __xip_file_write(struct file *filp, cons
const struct address_space_operations *a_ops = mapping->a_ops;
struct inode *inode = mapping->host;
long status = 0;
- struct page *page;
size_t bytes;
ssize_t written = 0;
- BUG_ON(!mapping->a_ops->get_xip_page);
+ BUG_ON(!mapping->a_ops->get_xip_address);
do {
unsigned long index;
unsigned long offset;
size_t copied;
- char *kaddr;
+ void *xip_mem;
offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
index = pos >> PAGE_CACHE_SHIFT;
@@ -296,28 +302,22 @@ __xip_file_write(struct file *filp, cons
if (bytes > count)
bytes = count;
- page = a_ops->get_xip_page(mapping,
- index*(PAGE_SIZE/512), 0);
- if (IS_ERR(page) && (PTR_ERR(page) == -ENODATA)) {
+ xip_mem = a_ops->get_xip_address(mapping, index, 0);
+ if (IS_ERR(xip_mem) && (PTR_ERR(xip_mem) == -ENODATA)) {
/* we allocate a new page unmap it */
- page = a_ops->get_xip_page(mapping,
- index*(PAGE_SIZE/512), 1);
- if (!IS_ERR(page))
+ xip_mem = a_ops->get_xip_address(mapping, index, 1);
+ if (!IS_ERR(xip_mem))
/* unmap page at pgoff from all other vmas */
__xip_unmap(mapping, index);
}
- if (IS_ERR(page)) {
- status = PTR_ERR(page);
+ if (IS_ERR(xip_mem)) {
+ status = PTR_ERR(xip_mem);
break;
}
- fault_in_pages_readable(buf, bytes);
- kaddr = kmap_atomic(page, KM_USER0);
copied = bytes -
- __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
- kunmap_atomic(kaddr, KM_USER0);
- flush_dcache_page(page);
+ __copy_from_user_nocache(xip_mem + offset, buf, bytes);
if (likely(copied > 0)) {
status = copied;
@@ -397,7 +397,7 @@ EXPORT_SYMBOL_GPL(xip_file_write);
/*
* truncate a page used for execute in place
- * functionality is analog to block_truncate_page but does use get_xip_page
+ * functionality is analog to block_truncate_page but does use get_xip_adddress
* to get the page instead of page cache
*/
int
@@ -407,9 +407,9 @@ xip_truncate_page(struct address_space *
unsigned offset = from & (PAGE_CACHE_SIZE-1);
unsigned blocksize;
unsigned length;
- struct page *page;
+ void *xip_mem;
- BUG_ON(!mapping->a_ops->get_xip_page);
+ BUG_ON(!mapping->a_ops->get_xip_address);
blocksize = 1 << mapping->host->i_blkbits;
length = offset & (blocksize - 1);
@@ -420,18 +420,17 @@ xip_truncate_page(struct address_space *
length = blocksize - length;
- page = mapping->a_ops->get_xip_page(mapping,
- index*(PAGE_SIZE/512), 0);
- if (!page)
+ xip_mem = mapping->a_ops->get_xip_address(mapping, index, 0);
+ if (!xip_mem)
return -ENOMEM;
- if (unlikely(IS_ERR(page))) {
- if (PTR_ERR(page) == -ENODATA)
+ if (unlikely(IS_ERR(xip_mem))) {
+ if (PTR_ERR(xip_mem) == -ENODATA)
/* Hole? No need to truncate */
return 0;
else
- return PTR_ERR(page);
+ return PTR_ERR(xip_mem);
}
- zero_user_page(page, offset, length, KM_USER0);
+ memset(xip_mem + offset, 0, length);
return 0;
}
EXPORT_SYMBOL_GPL(xip_truncate_page);
Index: linux-2.6/mm/madvise.c
===================================================================
--- linux-2.6.orig/mm/madvise.c
+++ linux-2.6/mm/madvise.c
@@ -112,7 +112,7 @@ static long madvise_willneed(struct vm_a
if (!file)
return -EBADF;
- if (file->f_mapping->a_ops->get_xip_page) {
+ if (file->f_mapping->a_ops->get_xip_address) {
/* no bad return value, but ignore advice */
return 0;
}
Index: linux-2.6/arch/s390/mm/vmem.c
===================================================================
--- linux-2.6.orig/arch/s390/mm/vmem.c
+++ linux-2.6/arch/s390/mm/vmem.c
@@ -310,8 +310,6 @@ out:
int add_shared_memory(unsigned long start, unsigned long size)
{
struct memory_segment *seg;
- struct page *page;
- unsigned long pfn, num_pfn, end_pfn;
int ret;
mutex_lock(&vmem_mutex);
@@ -326,24 +324,10 @@ int add_shared_memory(unsigned long star
if (ret)
goto out_free;
- ret = vmem_add_mem(start, size);
+ ret = vmem_add_range(start, size);
if (ret)
goto out_remove;
- pfn = PFN_DOWN(start);
- num_pfn = PFN_DOWN(size);
- end_pfn = pfn + num_pfn;
-
- page = pfn_to_page(pfn);
- memset(page, 0, num_pfn * sizeof(struct page));
-
- for (; pfn < end_pfn; pfn++) {
- page = pfn_to_page(pfn);
- init_page_count(page);
- reset_page_mapcount(page);
- SetPageReserved(page);
- INIT_LIST_HEAD(&page->lru);
- }
goto out;
out_remove:
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [rft] updated xip patch rollup
2008-01-16 14:29 ` [rft] updated xip patch rollup Nick Piggin
@ 2008-01-17 10:24 ` Carsten Otte
0 siblings, 0 replies; 79+ messages in thread
From: Carsten Otte @ 2008-01-17 10:24 UTC (permalink / raw)
To: Nick Piggin
Cc: carsteno, Jared Hulbert, Linux Memory Management List, mschwid2,
heicars2
Nick Piggin wrote:
> I've tested it with XIP on brd on x86, both with and without pte_special.
> This covers many (but not all) cases of refcounting.
>
> Anyway, here it is... assuming no problems, I'll work on making the
> patchset. I'm still hoping we can convince Linus to like it ;)
Works for me. I have tested with dcssblk, and ext2 -o xip on s390x. I
have bootet a distro and built a kernel with the gcc/glibc being on
xip file system. Thumbs up :-).
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 79+ messages in thread
end of thread, other threads:[~2008-01-17 10:24 UTC | newest]
Thread overview: 79+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-12-14 13:38 [rfc][patch 1/2] mm: introduce VM_MIXEDMAP mappings Nick Piggin
2007-12-14 13:41 ` [rfc][patch 2/2] xip: support non-struct page memory Nick Piggin
2007-12-14 13:46 ` Carsten Otte
2007-12-15 1:07 ` Jared Hulbert
2007-12-15 1:17 ` Nick Piggin
2007-12-15 6:47 ` Jared Hulbert
2007-12-19 14:04 ` Carsten Otte
2007-12-20 9:23 ` Jared Hulbert
2007-12-21 0:40 ` Nick Piggin
2007-12-20 13:53 ` Carsten Otte
2007-12-20 14:33 ` Carsten Otte
2007-12-20 14:50 ` Carsten Otte
2007-12-20 17:24 ` Jared Hulbert
2007-12-21 0:12 ` Jared Hulbert
2007-12-21 0:56 ` Nick Piggin
2007-12-21 9:56 ` Carsten Otte
2007-12-21 9:49 ` Carsten Otte
2007-12-21 0:50 ` Nick Piggin
2007-12-21 10:02 ` Carsten Otte
2007-12-21 10:14 ` Nick Piggin
2007-12-21 10:17 ` Carsten Otte
2007-12-21 10:23 ` Nick Piggin
2007-12-21 10:31 ` Carsten Otte
2007-12-21 0:45 ` Nick Piggin
2007-12-21 10:05 ` Carsten Otte
2007-12-21 10:20 ` Nick Piggin
2007-12-21 10:35 ` Carsten Otte
2007-12-21 10:47 ` Nick Piggin
2007-12-21 19:29 ` Martin Schwidefsky
2008-01-07 4:43 ` [rfc][patch] mm: use a pte bit to flag normal pages Nick Piggin
2008-01-07 10:30 ` Russell King
2008-01-07 11:14 ` Nick Piggin
2008-01-07 18:49 ` Jared Hulbert
2008-01-07 19:45 ` Russell King
2008-01-07 22:52 ` Jared Hulbert
2008-01-08 2:37 ` Andi Kleen
2008-01-08 2:49 ` Nick Piggin
2008-01-08 3:31 ` Andi Kleen
2008-01-08 3:52 ` Nick Piggin
2008-01-08 10:11 ` Catalin Marinas
2008-01-08 10:52 ` Russell King
2008-01-08 13:54 ` Catalin Marinas
2008-01-08 14:08 ` Russell King
2008-01-10 13:33 ` Carsten Otte
2008-01-10 23:18 ` Nick Piggin
2008-01-08 9:35 ` [rfc][patch 0/4] VM_MIXEDMAP patchset with s390 backend Carsten Otte
2008-01-08 10:08 ` Nick Piggin
2008-01-08 11:34 ` Carsten Otte
2008-01-08 11:55 ` Nick Piggin
2008-01-08 12:03 ` Carsten Otte
2008-01-08 13:56 ` Jörn Engel
2008-01-08 14:51 ` Carsten Otte
2008-01-08 18:09 ` Jared Hulbert
2008-01-08 22:12 ` Nick Piggin
2008-01-09 15:14 ` [rfc][patch 0/4] VM_MIXEDMAP patchset with s390 backend v2 Carsten Otte
[not found] ` <1199891032.28689.9.camel@cotte.boeblingen.de.ibm.com>
2008-01-09 15:14 ` [rfc][patch 1/4] include: add callbacks to toggle reference counting for VM_MIXEDMAP pages Carsten Otte, Carsten Otte
2008-01-09 17:31 ` Martin Schwidefsky
2008-01-09 18:17 ` Jared Hulbert
2008-01-10 7:59 ` Carsten Otte
2008-01-10 20:01 ` Jared Hulbert
2008-01-11 8:45 ` Carsten Otte
2008-01-13 2:44 ` Nick Piggin
2008-01-14 11:36 ` Carsten Otte
2008-01-16 4:04 ` Nick Piggin
2008-01-15 13:05 ` Carsten Otte
2008-01-16 4:22 ` Nick Piggin
2008-01-16 14:29 ` [rft] updated xip patch rollup Nick Piggin
2008-01-17 10:24 ` Carsten Otte
2008-01-10 20:23 ` [rfc][patch 1/4] include: add callbacks to toggle reference counting for VM_MIXEDMAP pages Jared Hulbert
2008-01-11 8:32 ` Carsten Otte
2008-01-10 0:20 ` Nick Piggin
2008-01-10 8:06 ` Carsten Otte
2008-01-09 15:14 ` [rfc][patch 2/4] mm: introduce VM_MIXEDMAP Carsten Otte, Jared Hulbert, Carsten Otte
2008-01-09 15:14 ` [rfc][patch 3/4] Convert XIP to support non-struct page backed memory Carsten Otte, Nick Piggin
2008-01-09 15:14 ` [rfc][patch 4/4] s390: remove struct page entries for DCSS memory segments Carsten Otte, Carsten Otte
[not found] ` <1199784196.25114.11.camel@cotte.boeblingen.de.ibm.com>
2008-01-08 9:35 ` [rfc][patch 1/4] mm: introduce VM_MIXEDMAP Carsten Otte, Jared Hulbert, Carsten Otte
2008-01-08 9:35 ` [rfc][patch 2/4] xip: support non-struct page memory Carsten Otte, Nick Piggin, Carsten Otte
2008-01-08 9:36 ` [rfc][patch 3/4] s390: remove sturct page entries for z/VM DCSS memory segments Carsten Otte
2008-01-08 9:36 ` [rfc][patch 4/4] s390: mixedmap_refcount_pfn implementation using list walk Carsten Otte
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).