Embedded Linux development

Embedded Linux development
 help / color / mirror / Atom feed

* [PATCH v3 5/5] cramfs: rehabilitate it
From: Nicolas Pitre @ 2017-08-31  3:09 UTC (permalink / raw)
  To: Alexander Viro; +Cc: linux-fsdevel, linux-embedded, linux-kernel, Chris Brandt
In-Reply-To: <20170831030932.26979-1-nicolas.pitre@linaro.org>

Update documentation, pointer to latest tools, appoint myself as
maintainer. Given it's been unloved for so long, I don't expect anyone
will protest.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Tested-by: Chris Brandt <chris.brandt@renesas.com>
---
 Documentation/filesystems/cramfs.txt | 42 ++++++++++++++++++++++++++++++++++++
 MAINTAINERS                          |  4 ++--
 fs/cramfs/Kconfig                    |  9 +++++---
 3 files changed, 50 insertions(+), 5 deletions(-)

diff --git a/Documentation/filesystems/cramfs.txt b/Documentation/filesystems/cramfs.txt
index 4006298f67..8875d306bc 100644
--- a/Documentation/filesystems/cramfs.txt
+++ b/Documentation/filesystems/cramfs.txt
@@ -45,6 +45,48 @@ you can just change the #define in mkcramfs.c, so long as you don't
 mind the filesystem becoming unreadable to future kernels.
 
 
+Memory Mapped cramfs image
+--------------------------
+
+The CRAMFS_PHYSMEM Kconfig option adds support for loading data directly
+from a physical linear memory range (usually non volatile memory like Flash)
+to cramfs instead of going through the block device layer. This saves some
+memory since no intermediate buffering is necessary to hold the data before
+decompressing.
+
+And when data blocks are kept uncompressed and properly aligned, they will
+automatically be mapped directly into user space whenever possible providing
+eXecute-In-Place (XIP) from ROM of read-only segments. Data segments mapped
+read-write (hence they have to be copied to RAM) may still be compressed in
+the cramfs image in the same file along with non compressed read-only
+segments. Both MMU and no-MMU systems are supported. This is particularly
+handy for tiny embedded systems with very tight memory constraints.
+
+The filesystem type for this feature is "cramfs_physmem" to distinguish it
+from the block device (or MTD) based access. The location of the cramfs
+image in memory is system dependent. You must know the proper physical
+address where the cramfs image is located and specify it using the
+physaddr=0x******** mount option (for example, if the physical address
+of the cramfs image is 0x80100000, the following command would mount it
+on /mnt:
+
+$ mount -t cramfs_physmem -o physaddr=0x80100000 none /mnt
+
+To boot such an image as the root filesystem, the following kernel
+commandline parameters must be provided:
+
+	"rootfstype=cramfs_physmem rootflags=physaddr=0x80100000"
+
+
+Tools
+-----
+
+A version of mkcramfs that can take advantage of the latest capabilities
+described above can be found here:
+
+https://github.com/npitre/cramfs-tools
+
+
 For /usr/share/magic
 --------------------
 
diff --git a/MAINTAINERS b/MAINTAINERS
index 44cb004c76..12f8155cfe 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3612,8 +3612,8 @@ F:	drivers/cpuidle/*
 F:	include/linux/cpuidle.h
 
 CRAMFS FILESYSTEM
-W:	http://sourceforge.net/projects/cramfs/
-S:	Orphan / Obsolete
+M:	Nicolas Pitre <nico@linaro.org>
+S:	Maintained
 F:	Documentation/filesystems/cramfs.txt
 F:	fs/cramfs/
 
diff --git a/fs/cramfs/Kconfig b/fs/cramfs/Kconfig
index 5b4e0b7e13..ae1fe6c795 100644
--- a/fs/cramfs/Kconfig
+++ b/fs/cramfs/Kconfig
@@ -1,5 +1,5 @@
 config CRAMFS
-	tristate "Compressed ROM file system support (cramfs) (OBSOLETE)"
+	tristate "Compressed ROM file system support (cramfs)"
 	select ZLIB_INFLATE
 	help
 	  Saying Y here includes support for CramFs (Compressed ROM File
@@ -15,8 +15,11 @@ config CRAMFS
 	  cramfs.  Note that the root file system (the one containing the
 	  directory /) cannot be compiled as a module.
 
-	  This filesystem is obsoleted by SquashFS, which is much better
-	  in terms of performance and features.
+	  This filesystem is limited in capabilities and performance on
+	  purpose to remain small and low on RAM usage. It is most suitable
+	  for small embedded systems. For a more capable compressed filesystem
+	  you should look at SquashFS which is much better in terms of
+	  performance and features.
 
 	  If unsure, say N.
 
-- 
2.9.5


^ permalink raw reply related

* Re: [PATCH v3 4/5] cramfs: add mmap support
From: Christoph Hellwig @ 2017-08-31  9:23 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Alexander Viro, linux-fsdevel, linux-embedded, linux-kernel,
	Chris Brandt, linux-mm
In-Reply-To: <20170831030932.26979-5-nicolas.pitre@linaro.org>

The whole VMA games here look entirely bogus  you can't just drop
and reacquire mmap_sem for example.  And splitting vmas looks just
as promblematic.

As a minimum you really must see the linux-mm list can get some
feedback there.

On Wed, Aug 30, 2017 at 11:09:31PM -0400, Nicolas Pitre wrote:
> When cramfs_physmem is used then we have the opportunity to map files
> directly from ROM, directly into user space, saving on RAM usage.
> This gives us Execute-In-Place (XIP) support.
> 
> For a file to be mmap()-able, the map area has to correspond to a range
> of uncompressed and contiguous blocks, and in the MMU case it also has
> to be page aligned. A version of mkcramfs with appropriate support is
> necessary to create such a filesystem image.
> 
> In the MMU case it may happen for a vma structure to extend beyond the
> actual file size. This is notably the case in binfmt_elf.c:elf_map().
> Or the file's last block is shared with other files and cannot be mapped
> as is. Rather than refusing to mmap it, we do a partial map and set up
> a special vm_ops fault handler that splits the vma in two: the direct
> mapping vma and the memory-backed vma populated by the readpage method.
> In practice the unmapped area is seldom accessed so the split might never
> occur before this area is discarded.
> 
> In the non-MMU case it is the get_unmapped_area method that is responsible
> for providing the address where the actual data can be found. No mapping
> is necessary of course.
> 
> Signed-off-by: Nicolas Pitre <nico@linaro.org>
> Tested-by: Chris Brandt <chris.brandt@renesas.com>
> ---
>  fs/cramfs/inode.c | 295 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 295 insertions(+)
> 
> diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c
> index 2fc886092b..1d7d61354b 100644
> --- a/fs/cramfs/inode.c
> +++ b/fs/cramfs/inode.c
> @@ -15,7 +15,9 @@
>  
>  #include <linux/module.h>
>  #include <linux/fs.h>
> +#include <linux/file.h>
>  #include <linux/pagemap.h>
> +#include <linux/ramfs.h>
>  #include <linux/init.h>
>  #include <linux/string.h>
>  #include <linux/blkdev.h>
> @@ -49,6 +51,7 @@ static inline struct cramfs_sb_info *CRAMFS_SB(struct super_block *sb)
>  static const struct super_operations cramfs_ops;
>  static const struct inode_operations cramfs_dir_inode_operations;
>  static const struct file_operations cramfs_directory_operations;
> +static const struct file_operations cramfs_physmem_fops;
>  static const struct address_space_operations cramfs_aops;
>  
>  static DEFINE_MUTEX(read_mutex);
> @@ -96,6 +99,10 @@ static struct inode *get_cramfs_inode(struct super_block *sb,
>  	case S_IFREG:
>  		inode->i_fop = &generic_ro_fops;
>  		inode->i_data.a_ops = &cramfs_aops;
> +		if (IS_ENABLED(CONFIG_CRAMFS_PHYSMEM) &&
> +		    CRAMFS_SB(sb)->flags & CRAMFS_FLAG_EXT_BLOCK_POINTERS &&
> +		    CRAMFS_SB(sb)->linear_phys_addr)
> +			inode->i_fop = &cramfs_physmem_fops;
>  		break;
>  	case S_IFDIR:
>  		inode->i_op = &cramfs_dir_inode_operations;
> @@ -277,6 +284,294 @@ static void *cramfs_read(struct super_block *sb, unsigned int offset,
>  		return NULL;
>  }
>  
> +/*
> + * For a mapping to be possible, we need a range of uncompressed and
> + * contiguous blocks. Return the offset for the first block and number of
> + * valid blocks for which that is true, or zero otherwise.
> + */
> +static u32 cramfs_get_block_range(struct inode *inode, u32 pgoff, u32 *pages)
> +{
> +	struct super_block *sb = inode->i_sb;
> +	struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
> +	int i;
> +	u32 *blockptrs, blockaddr;
> +
> +	/*
> +	 * We can dereference memory directly here as this code may be
> +	 * reached only when there is a direct filesystem image mapping
> +	 * available in memory.
> +	 */
> +	blockptrs = (u32 *)(sbi->linear_virt_addr + OFFSET(inode) + pgoff*4);
> +	blockaddr = blockptrs[0] & ~CRAMFS_BLK_FLAGS;
> +	i = 0;
> +	do {
> +		u32 expect = blockaddr + i * (PAGE_SIZE >> 2);
> +		expect |= CRAMFS_BLK_FLAG_DIRECT_PTR|CRAMFS_BLK_FLAG_UNCOMPRESSED;
> +		if (blockptrs[i] != expect) {
> +			pr_debug("range: block %d/%d got %#x expects %#x\n",
> +				 pgoff+i, pgoff+*pages-1, blockptrs[i], expect);
> +			if (i == 0)
> +				return 0;
> +			break;
> +		}
> +	} while (++i < *pages);
> +
> +	*pages = i;
> +
> +	/* stored "direct" block ptrs are shifted down by 2 bits */
> +	return blockaddr << 2;
> +}
> +
> +/*
> + * It is possible for cramfs_physmem_mmap() to partially populate the mapping
> + * causing page faults in the unmapped area. When that happens, we need to
> + * split the vma so that the unmapped area gets its own vma that can be backed
> + * with actual memory pages and loaded normally. This is necessary because
> + * remap_pfn_range() overwrites vma->vm_pgoff with the pfn and filemap_fault()
> + * no longer works with it. Furthermore this makes /proc/x/maps right.
> + * Q: is there a way to do split vma at mmap() time?
> + */
> +static const struct vm_operations_struct cramfs_vmasplit_ops;
> +static int cramfs_vmasplit_fault(struct vm_fault *vmf)
> +{
> +	struct mm_struct *mm = vmf->vma->vm_mm;
> +	struct vm_area_struct *vma, *new_vma;
> +	struct file *vma_file = get_file(vmf->vma->vm_file);
> +	unsigned long split_val, split_addr;
> +	unsigned int split_pgoff;
> +	int ret;
> +
> +	/* We have some vma surgery to do and need the write lock. */
> +	up_read(&mm->mmap_sem);
> +	if (down_write_killable(&mm->mmap_sem)) {
> +		fput(vma_file);
> +		return VM_FAULT_RETRY;
> +	}
> +
> +	/* Make sure the vma didn't change between the locks */
> +	ret = VM_FAULT_SIGSEGV;
> +	vma = find_vma(mm, vmf->address);
> +	if (!vma)
> +		goto out_fput;
> +
> +	/*
> +	 * Someone else might have raced with us and handled the fault,
> +	 * changed the vma, etc. If so let it go back to user space and
> +	 * fault again if necessary.
> +	 */
> +	ret = VM_FAULT_NOPAGE;
> +	if (vma->vm_ops != &cramfs_vmasplit_ops || vma->vm_file != vma_file)
> +		goto out_fput;
> +	fput(vma_file);
> +
> +	/* Retrieve the vma split address and validate it */
> +	split_val = (unsigned long)vma->vm_private_data;
> +	split_pgoff = split_val & 0xfff;
> +	split_addr = (split_val >> 12) << PAGE_SHIFT;
> +	if (split_addr < vma->vm_start) {
> +		/* bottom of vma was unmapped */
> +		split_pgoff += (vma->vm_start - split_addr) >> PAGE_SHIFT;
> +		split_addr = vma->vm_start;
> +	}
> +	pr_debug("fault: addr=%#lx vma=%#lx-%#lx split=%#lx\n",
> +		 vmf->address, vma->vm_start, vma->vm_end, split_addr);
> +	ret = VM_FAULT_SIGSEGV;
> +	if (!split_val || split_addr > vmf->address || vma->vm_end <= vmf->address)
> +		goto out;
> +
> +	if (unlikely(vma->vm_start == split_addr)) {
> +		/* nothing to split */
> +		new_vma = vma;
> +	} else {
> +		/* Split away the directly mapped area */
> +		ret = VM_FAULT_OOM;
> +		if (split_vma(mm, vma, split_addr, 0) != 0)
> +			goto out;
> +
> +		/* The direct vma should no longer ever fault */
> +		vma->vm_ops = NULL;
> +
> +		/* Retrieve the new vma covering the unmapped area */
> +		new_vma = find_vma(mm, split_addr);
> +		BUG_ON(new_vma == vma);
> +		ret = VM_FAULT_SIGSEGV;
> +		if (!new_vma)
> +			goto out;
> +	}
> +
> +	/*
> +	 * Readjust the new vma with the actual file based pgoff and
> +	 * process the fault normally on it.
> +	 */
> +	new_vma->vm_pgoff = split_pgoff;
> +	new_vma->vm_ops = &generic_file_vm_ops;
> +	new_vma->vm_flags &= ~(VM_IO | VM_PFNMAP | VM_DONTEXPAND);
> +	vmf->vma = new_vma;
> +	vmf->pgoff = split_pgoff;
> +	vmf->pgoff += (vmf->address - new_vma->vm_start) >> PAGE_SHIFT;
> +	downgrade_write(&mm->mmap_sem);
> +	return filemap_fault(vmf);
> +
> +out_fput:
> +	fput(vma_file);
> +out:
> +	downgrade_write(&mm->mmap_sem);
> +	return ret;
> +}
> +
> +static const struct vm_operations_struct cramfs_vmasplit_ops = {
> +	.fault	= cramfs_vmasplit_fault,
> +};
> +
> +static int cramfs_physmem_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	struct inode *inode = file_inode(file);
> +	struct super_block *sb = inode->i_sb;
> +	struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
> +	unsigned int pages, vma_pages, max_pages, offset;
> +	unsigned long address;
> +	char *fail_reason;
> +	int ret;
> +
> +	if (!IS_ENABLED(CONFIG_MMU))
> +		return vma->vm_flags & (VM_SHARED | VM_MAYSHARE) ? 0 : -ENOSYS;
> +
> +	if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE))
> +		return -EINVAL;
> +
> +	/* Could COW work here? */
> +	fail_reason = "vma is writable";
> +	if (vma->vm_flags & VM_WRITE)
> +		goto fail;
> +
> +	vma_pages = (vma->vm_end - vma->vm_start + PAGE_SIZE - 1) >> PAGE_SHIFT;
> +	max_pages = (inode->i_size + PAGE_SIZE - 1) >> PAGE_SHIFT;
> +	fail_reason = "beyond file limit";
> +	if (vma->vm_pgoff >= max_pages)
> +		goto fail;
> +	pages = vma_pages;
> +	if (pages > max_pages - vma->vm_pgoff)
> +		pages = max_pages - vma->vm_pgoff;
> +
> +	offset = cramfs_get_block_range(inode, vma->vm_pgoff, &pages);
> +	fail_reason = "unsuitable block layout";
> +	if (!offset)
> +		goto fail;
> +	address = sbi->linear_phys_addr + offset;
> +	fail_reason = "data is not page aligned";
> +	if (!PAGE_ALIGNED(address))
> +		goto fail;
> +
> +	/* Don't map the last page if it contains some other data */
> +	if (unlikely(vma->vm_pgoff + pages == max_pages)) {
> +		unsigned int partial = offset_in_page(inode->i_size);
> +		if (partial) {
> +			char *data = sbi->linear_virt_addr + offset;
> +			data += (max_pages - 1) * PAGE_SIZE + partial;
> +			while ((unsigned long)data & 7)
> +				if (*data++ != 0)
> +					goto nonzero;
> +			while (offset_in_page(data)) {
> +				if (*(u64 *)data != 0) {
> +					nonzero:
> +					pr_debug("mmap: %s: last page is shared\n",
> +						 file_dentry(file)->d_name.name);
> +					pages--;
> +					break;
> +				}
> +				data += 8;
> +			}
> +		}
> +	}
> +
> +	if (pages) {
> +		/*
> +		 * If we can't map it all, page faults will occur if the
> +		 * unmapped area is accessed. Let's handle them to split the
> +		 * vma and let the normal paging machinery take care of the
> +		 * rest through cramfs_readpage(). Because remap_pfn_range()
> +		 * repurposes vma->vm_pgoff, we have to save it somewhere.
> +		 * Let's use vma->vm_private_data to hold both the pgoff and
> +		 * the actual address split point. Maximum file size is 16MB
> +		 * (12 bits pgoff) and max 20 bits pfn where a long is 32 bits
> +		 * so we can pack both together.
> +		 */
> +		if (pages != vma_pages) {
> +			unsigned int split_pgoff = vma->vm_pgoff + pages;
> +			unsigned long split_pfn = (vma->vm_start >> PAGE_SHIFT) + pages;
> +			unsigned long split_val = split_pgoff | (split_pfn << 12);
> +			vma->vm_private_data = (void *)split_val;
> +			vma->vm_ops = &cramfs_vmasplit_ops;
> +			/* to keep remap_pfn_range() happy */
> +			vma->vm_end = vma->vm_start + pages * PAGE_SIZE;
> +		}
> +
> +		ret = remap_pfn_range(vma, vma->vm_start, address >> PAGE_SHIFT,
> +				      pages * PAGE_SIZE, vma->vm_page_prot);
> +		/* restore vm_end in case we cheated it above */
> +		vma->vm_end = vma->vm_start + vma_pages * PAGE_SIZE;
> +		if (ret)
> +			return ret;
> +
> +		pr_debug("mapped %s at 0x%08lx (%u/%u pages) to vma 0x%08lx, "
> +			 "page_prot 0x%llx\n", file_dentry(file)->d_name.name,
> +			 address, pages, vma_pages, vma->vm_start,
> +			 (unsigned long long)pgprot_val(vma->vm_page_prot));
> +		return 0;
> +	}
> +	fail_reason = "no suitable block remaining";
> +
> +fail:
> +	pr_debug("%s: direct mmap failed: %s\n",
> +		 file_dentry(file)->d_name.name, fail_reason);
> +
> +	/* We failed to do a direct map, but normal paging will do it */
> +	vma->vm_ops = &generic_file_vm_ops;
> +	return 0;
> +}
> +
> +#ifndef CONFIG_MMU
> +
> +static unsigned long cramfs_physmem_get_unmapped_area(struct file *file,
> +			unsigned long addr, unsigned long len,
> +			unsigned long pgoff, unsigned long flags)
> +{
> +	struct inode *inode = file_inode(file);
> +	struct super_block *sb = inode->i_sb;
> +	struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
> +	unsigned int pages, block_pages, max_pages, offset;
> +
> +	pages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT;
> +	max_pages = (inode->i_size + PAGE_SIZE - 1) >> PAGE_SHIFT;
> +	if (pgoff >= max_pages || pages > max_pages - pgoff)
> +		return -EINVAL;
> +	block_pages = pages;
> +	offset = cramfs_get_block_range(inode, pgoff, &block_pages);
> +	if (!offset || block_pages != pages)
> +		return -ENOSYS;
> +	addr = sbi->linear_phys_addr + offset;
> +	pr_debug("get_unmapped for %s ofs %#lx siz %lu at 0x%08lx\n",
> +		 file_dentry(file)->d_name.name, pgoff*PAGE_SIZE, len, addr);
> +	return addr;
> +}
> +
> +static unsigned cramfs_physmem_mmap_capabilities(struct file *file)
> +{
> +	return NOMMU_MAP_COPY | NOMMU_MAP_DIRECT | NOMMU_MAP_READ | NOMMU_MAP_EXEC;
> +}
> +#endif
> +
> +static const struct file_operations cramfs_physmem_fops = {
> +	.llseek			= generic_file_llseek,
> +	.read_iter		= generic_file_read_iter,
> +	.splice_read		= generic_file_splice_read,
> +	.mmap			= cramfs_physmem_mmap,
> +#ifndef CONFIG_MMU
> +	.get_unmapped_area	= cramfs_physmem_get_unmapped_area,
> +	.mmap_capabilities	= cramfs_physmem_mmap_capabilities,
> +#endif
> +};
> +
>  static void cramfs_blkdev_kill_sb(struct super_block *sb)
>  {
>  	struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
> -- 
> 2.9.5
> 
---end quoted text---

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v3 4/5] cramfs: add mmap support
From: Nicolas Pitre @ 2017-08-31 17:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Alexander Viro, linux-fsdevel, linux-embedded, linux-kernel,
	Chris Brandt, linux-mm
In-Reply-To: <20170831092338.GA8196@infradead.org>

On Thu, 31 Aug 2017, Christoph Hellwig wrote:

> The whole VMA games here look entirely bogus  you can't just drop
> and reacquire mmap_sem for example.
> And splitting vmas looks just
> as promblematic.

I didn't just decide to do that on a whim. I spent quite some time 
looking at page fault code paths and make sure it is fine to reaquire 
the lock. There are existing code paths that drop the lock entirely and 
return with no locks so this is already expected by the core code.

> As a minimum you really must see the linux-mm list can get some
> feedback there.

Good point. You added linux-mm to CC so I'll wait for their comments.

Nicolas

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* execve(NULL, argv, envp) for nommu?
From: Rob Landley @ 2017-09-05  7:34 UTC (permalink / raw)
  To: linux-embedded

For years I've wanted an execve() system call modification that let me
pass a NULL as the first argument to say "re-exec this program please".
Because on nommu you've got to exec something to unblock vfork(), and
daemons (or things like busybox and toybox) want to re-exec themselves.
I just hit this again trying to implement a nommu-friendly strace(): the
one on github doesn't SIGSTOP the child before the execve() of the
process to trace because vfork(), and just races and misses the first
few system calls on nommu instead...)

The problem with exec /proc/self/exe is A) I haven't necessarily got
/proc mounted, B) in a chroot the original binary might not be in scope
anymore. But I'm already _running_ this program. If I could fork() I
could already get a second copy of the sucker and call main() again
myself if necessary, but I can't, so...

I'm aware there's a possible "but what if it was suid and it's already
dropped privileges" argument, and I'm fine with execve(NULL) not
honoring the suid bit if people feel that way. I just wanna unblock
vfork() while still running this code. (A way to detect I did this would
be great too, but the normal tweaking of argv[] or envp[] to let main
know we're a child still works.)

Is there a _reason_ the kernel doesn't do this, or has nobody bothered
to code it up yet?

Rob

^ permalink raw reply

* Re: execve(NULL, argv, envp) for nommu?
From: Geert Uytterhoeven @ 2017-09-05  9:00 UTC (permalink / raw)
  To: Rob Landley; +Cc: Linux Embedded, Oleg Nesterov, linux-kernel@vger.kernel.org
In-Reply-To: <324c00d9-06a6-1fc5-83fe-5bd36d874501@landley.net>

CC Oleg, lkml

On Tue, Sep 5, 2017 at 9:34 AM, Rob Landley <rob@landley.net> wrote:
> For years I've wanted an execve() system call modification that let me
> pass a NULL as the first argument to say "re-exec this program please".
> Because on nommu you've got to exec something to unblock vfork(), and
> daemons (or things like busybox and toybox) want to re-exec themselves.
> I just hit this again trying to implement a nommu-friendly strace(): the
> one on github doesn't SIGSTOP the child before the execve() of the
> process to trace because vfork(), and just races and misses the first
> few system calls on nommu instead...)
>
> The problem with exec /proc/self/exe is A) I haven't necessarily got
> /proc mounted, B) in a chroot the original binary might not be in scope
> anymore. But I'm already _running_ this program. If I could fork() I
> could already get a second copy of the sucker and call main() again
> myself if necessary, but I can't, so...
>
> I'm aware there's a possible "but what if it was suid and it's already
> dropped privileges" argument, and I'm fine with execve(NULL) not
> honoring the suid bit if people feel that way. I just wanna unblock
> vfork() while still running this code. (A way to detect I did this would
> be great too, but the normal tweaking of argv[] or envp[] to let main
> know we're a child still works.)
>
> Is there a _reason_ the kernel doesn't do this, or has nobody bothered
> to code it up yet?
>
> Rob

^ permalink raw reply

* Re: execve(NULL, argv, envp) for nommu?
From: Alan Cox @ 2017-09-05 13:24 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Rob Landley, Linux Embedded, Oleg Nesterov,
	linux-kernel@vger.kernel.org
In-Reply-To: <CAMuHMdUPUaLfbbFF1kZoEUy7or-9sVOt=ykAHT+S6NBvFy5V=g@mail.gmail.com>

> > anymore. But I'm already _running_ this program. If I could fork() I
> > could already get a second copy of the sucker and call main() again
> > myself if necessary, but I can't, so...

You can - ptrace 8)

> > honoring the suid bit if people feel that way. I just wanna unblock
> > vfork() while still running this code. 

Would it make more sense to have a way to promote your vfork into a
fork when you hit these cases (I appreciate that fork on NOMMU has a much
higher performance cost as you start having to softmmu copy or swap
pages).

Alan

^ permalink raw reply

* Re: execve(NULL, argv, envp) for nommu?
From: Rob Landley @ 2017-09-06  1:12 UTC (permalink / raw)
  To: Alan Cox, Geert Uytterhoeven
  Cc: Linux Embedded, Oleg Nesterov, dalias,
	linux-kernel@vger.kernel.org
In-Reply-To: <20170905142436.262ed118@alans-desktop>

On 09/05/2017 08:24 AM, Alan Cox wrote:
>>> anymore. But I'm already _running_ this program. If I could fork() I
>>> could already get a second copy of the sucker and call main() again
>>> myself if necessary, but I can't, so...
> 
> You can - ptrace 8)

Oh I can call clone() with various flags and try to fake it myself, it
just won't do what I want. :)

>>> honoring the suid bit if people feel that way. I just wanna unblock
>>> vfork() while still running this code. 
> 
> Would it make more sense to have a way to promote your vfork into a
> fork when you hit these cases (I appreciate that fork on NOMMU has a much
> higher performance cost as you start having to softmmu copy or swap
> pages).

It's not the performance cost, it's rewriting all the pointers.

Without address translation, copying the existing mappings to a new
range requires finding and adjusting every pointer to the old data,
which you can do for the executable mappings in PIE* binaries, but
tracking down all the pointers on the stack, heap, and in your global
variables? Flaming pain.

Making fork() work on nommu is basically the same problem as making
garbage collection work in C on mmu. Thus those of us who defend vfork()
from the people who don't understand why it exists periodically
suggesting we remove it.

> Alan

Rob

* or FDPIC, which is basically just PIE with 4 individually relocatable
text/data/rodata/bss segments instead of one big mapping you relocate as
a contiguous block; both work on nommu but fdpic can fit into more
fragmented memory, and becauase the segments are independent it lets
nommu share some segments between processes (code+rodata**) without
sharing others (data and bss). That's why nommu can't run normal elf but
can run PIE or FDPIC binaries. Or binflt which is the old a.out version.

** Don't ask me what happens when rodata contains a constant pointer to
a bss or data object. I'm guessing the compiler Does A Thing. Ask Rich
Felker?

^ permalink raw reply

* Re: execve(NULL, argv, envp) for nommu?
From: Rob Landley @ 2017-09-08 21:18 UTC (permalink / raw)
  To: Alan Cox, Geert Uytterhoeven
  Cc: Linux Embedded, Oleg Nesterov, dalias,
	linux-kernel@vger.kernel.org
In-Reply-To: <ab6e6e8b-7040-a07d-5502-405701182568@landley.net>

On 09/05/2017 08:12 PM, Rob Landley wrote:
> On 09/05/2017 08:24 AM, Alan Cox wrote:
>>>> honoring the suid bit if people feel that way. I just wanna unblock
>>>> vfork() while still running this code. 
>>
>> Would it make more sense to have a way to promote your vfork into a
>> fork when you hit these cases (I appreciate that fork on NOMMU has a much
>> higher performance cost as you start having to softmmu copy or swap
>> pages).
> 
> It's not the performance cost, it's rewriting all the pointers.
> 
> Without address translation, copying the existing mappings to a new
> range requires finding and adjusting every pointer to the old data,
> which you can do for the executable mappings in PIE* binaries, but
> tracking down all the pointers on the stack, heap, and in your global
> variables? Flaming pain.
> 
> Making fork() work on nommu is basically the same problem as making
> garbage collection work in C on mmu. Thus those of us who defend vfork()
> from the people who don't understand why it exists periodically
> suggesting we remove it.

So is exec(NULL, argv, envp) a reasonable thing to want?

Rob

^ permalink raw reply

* Re: execve(NULL, argv, envp) for nommu?
From: Oleg Nesterov @ 2017-09-11 15:15 UTC (permalink / raw)
  To: Rob Landley
  Cc: Alan Cox, Geert Uytterhoeven, Linux Embedded, dalias,
	linux-kernel@vger.kernel.org
In-Reply-To: <d2d1acae-b2a1-9f41-d3bf-9d3b35a62664@landley.net>

On 09/08, Rob Landley wrote:
>
> So is exec(NULL, argv, envp) a reasonable thing to want?

I think that something like prctl(PR_OPEN_EXE_FILE) which does

	dentry_open(current->mm->exe_file->path, O_PATH)

and returns fd make more sense.

Then you can do execveat(fd, "", ..., AT_EMPTY_PATH).

But to be honest, I can't understand the problem, because I know nothing
about nommu.

You need to unblock parent sleeping in vfork(), and you can't do another
fork (I don't undestand why).

Perhaps the child can create another thread? The main thread can exit
after that and unblock the parent. Or perhaps even something like
clone(CLONE_VM | CLONE_PARENT), I dunno...

Oleg.

^ permalink raw reply

* Re: execve(NULL, argv, envp) for nommu?
From: Alan Cox @ 2017-09-11 18:14 UTC (permalink / raw)
  To: Rob Landley
  Cc: Geert Uytterhoeven, Linux Embedded, Oleg Nesterov, dalias,
	linux-kernel@vger.kernel.org
In-Reply-To: <ab6e6e8b-7040-a07d-5502-405701182568@landley.net>

> It's not the performance cost, it's rewriting all the pointers.

Which you don't need to do

> Without address translation, copying the existing mappings to a new
> range requires finding and adjusting every pointer to the old data,

No it doesn't. See Minix.

When you fork() rather than vfork you stick a copy of any non-relocatable
elements (typically DATA copy + BSS + stack with a sane CPU and compiler)
into a buffer and you swap them over with the real copy when you task
switch to the one in the wrong place. If you start the child first you
usually only take one copy.

I've always been amused that Linux NOMMU hasn't managed to grow a feature
that people successfully implemented on 68000 long long ago, and I
believe some other processors back to v6/v7 days.

Alan

^ permalink raw reply

* Re: execve(NULL, argv, envp) for nommu?
From: Rob Landley @ 2017-09-12 10:48 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Alan Cox, Geert Uytterhoeven, Linux Embedded, dalias,
	linux-kernel@vger.kernel.org
In-Reply-To: <20170911151526.GA4126@redhat.com>

On 09/11/2017 10:15 AM, Oleg Nesterov wrote:
> On 09/08, Rob Landley wrote:
>>
>> So is exec(NULL, argv, envp) a reasonable thing to want?
> 
> I think that something like prctl(PR_OPEN_EXE_FILE) which does
> 
> 	dentry_open(current->mm->exe_file->path, O_PATH)
> 
> and returns fd make more sense.
> 
> Then you can do execveat(fd, "", ..., AT_EMPTY_PATH).
I'm all for it? That sounds like a cosmetic difference, a more verbose
way of achieving the same outcome.

(Of course now you've got a filehandle you can read xattrs and such
through from otherwise jailed contexts letting you do things you
couldn't necessarily do before, but I assume you know the security
implications of that more than I do. I tried to suggest something that
_didn't_ create new capabilities, just let nommu do a thing that mmu
could already do.)

> But to be honest, I can't understand the problem, because I know nothing
> about nommu.
> 
> You need to unblock parent sleeping in vfork(), and you can't do another
> fork (I don't undestand why).

A nommu system doesn't have a memory management unit, so all addresses
are physical addresses. This means two processes can't see different
things at the same address: either they see the same thing or one of
them can't see that address (due to a range register making it).

Conventional fork() creates copy on write mappings of all the existing
writable memory of the parent process. So when the new PID dirties a
page, the old page gets copied by the fault handler. The problem isn't
the copies (that's just slow), the problem is two processes seeing
different things at the same address. That requires an MMU with a TLB
loaded from page tables.

If you create _new_ mappings and copy the data over, they'll have
different addresses. But any pointers you copied will point to the _old_
addresses. Finding and adjusting all those pointers to point to the new
addresses instead is basically the same problem as doing garbage
collection in C.

Your stack has pointers. Your heap has pointers. Your data and bss (once
initialized) can have pointers. These pointers can be in the middle of
malloc()'ed structures so no ELF table anywhere knows anything about
them. A long variable containing a value that _could_ point into one of
these ranges isn't guaranteed to _be_ a pointer, in which case adjusting
it is breakage. Tracking them all down and fixing up just the right ones
without missing any or changing data you shouldn't is REALLY HARD.

The vfork() system call is what you use on nommu instead: it creates a
child process that uses its parent's memory mappings. The parent process
is stopped until the child calls _exit() or exec(), either of which
means it stops using those mappings and the parent can go back to using
them without the two stomping on each other. (Usually they even share
the same stack, so the child shouldn't return from the function that
called vfork() or it'll corrupt the stack for the parent process. And be
careful about changing local variables, the parent might see the changes
when it resumes. Some vfork() implementations provide a small new stack,
ala signal handlers or kernel interrupts, so you can't guarantee your
parent will see your local variable changes, but you still can't return
from the function that called vfork() in either case.)

So after calling vfork(), the child _must_ call exec() in order for
there to be two independent processes running at the same time. Until
then, the parent is stopped.

The real problem with implementing full fork() isn't the expense of
copying the data (although if you fork and exec from a mozilla style pig
process, you could copy hundreds of megabytes of data and then
immediately discard it again; that's why fork() doesn't usually do that;
oh and on nommu systems you need _contiguous_ memory blocks for the data
because it can't collect disparate pages together into a longer mapping,
so this is actually a largeish real-world issue on those systems, not
merely slow and expensive.) The hard problem is translating the pointers
so the new mapping doesn't read/write objects in the old mapping.

> Perhaps the child can create another thread? The main thread can exit
> after that and unblock the parent. Or perhaps even something like
> clone(CLONE_VM | CLONE_PARENT), I dunno...

Launching a new thread doesn't unblock the parent. A second vfork() from
the child wouldn't unblock the parent. Your mappings are still
overcommited, only _exit() or execve() releases the child process's use
of those mappings.

You can create threads on nommu because they're designed to share the
same mappings. In that case you're guaranteed a new stack, and not
stomping the parent's data is your problem.

But if you exec() from a thread, posix says it kills all the other threads:

http://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html

And even without that, we're still in the "vfork but add concurrency"
territory. Your threads don't have their own independent mappings,
they're sharing and stomping each other's data unless you add locking
and write your program to know about the other threads. To get two
independent process contexts running the same executable but with
different mappings (I.E. the goal we started with), you still need the
child to exec. And the start of this thread was "exec what"?

> Oleg.

Rob

^ permalink raw reply

* Re: execve(NULL, argv, envp) for nommu?
From: Geert Uytterhoeven @ 2017-09-12 11:30 UTC (permalink / raw)
  To: Rob Landley
  Cc: Oleg Nesterov, Alan Cox, Linux Embedded, Rich Felker,
	linux-kernel@vger.kernel.org
In-Reply-To: <d3ae79b1-810d-8abc-3692-69cef4bd1a7a@landley.net>

Hi Rob,

On Tue, Sep 12, 2017 at 12:48 PM, Rob Landley <rob@landley.net> wrote:
> A nommu system doesn't have a memory management unit, so all addresses
> are physical addresses. This means two processes can't see different
> things at the same address: either they see the same thing or one of
> them can't see that address (due to a range register making it).
>
> Conventional fork() creates copy on write mappings of all the existing
> writable memory of the parent process. So when the new PID dirties a
> page, the old page gets copied by the fault handler. The problem isn't
> the copies (that's just slow), the problem is two processes seeing
> different things at the same address. That requires an MMU with a TLB
> loaded from page tables.
>
> If you create _new_ mappings and copy the data over, they'll have
> different addresses. But any pointers you copied will point to the _old_
> addresses. Finding and adjusting all those pointers to point to the new
> addresses instead is basically the same problem as doing garbage
> collection in C.
>
> Your stack has pointers. Your heap has pointers. Your data and bss (once
> initialized) can have pointers. These pointers can be in the middle of
> malloc()'ed structures so no ELF table anywhere knows anything about
> them. A long variable containing a value that _could_ point into one of
> these ranges isn't guaranteed to _be_ a pointer, in which case adjusting
> it is breakage. Tracking them all down and fixing up just the right ones
> without missing any or changing data you shouldn't is REALLY HARD.

Hence (make the compiler) never store pointers, only offsets relative to a
base register. So after making copies of stack, data/bss, and heap, all you
need to do is adjust these base registers for the child process.
Nothing in main memory needs to be modified.

Text accesses can be PC-relative => nothing to adjust.
Local variable accesses are stack-relative => nothing to adjust.
Data/bss accesses can be relative to a reserved register that stores the
data base address => only adjust the base register, nothing in RAM to adjust.
Heap accesses can be relative to a reserved register that stores the heap
base address => only adjust the base register, nothing in RAM to adjust.

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply

* Re: execve(NULL, argv, envp) for nommu?
From: Rob Landley @ 2017-09-12 13:45 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Oleg Nesterov, Alan Cox, Linux Embedded, Rich Felker,
	linux-kernel@vger.kernel.org
In-Reply-To: <CAMuHMdXQn47LK6RLtdgW3JJtL1VwqK7ZvUdmLgaZat8TA30KtQ@mail.gmail.com>

On 09/12/2017 06:30 AM, Geert Uytterhoeven wrote:
> Hi Rob,
> 
> On Tue, Sep 12, 2017 at 12:48 PM, Rob Landley <rob@landley.net> wrote:
>> Your stack has pointers. Your heap has pointers. Your data and bss (once
>> initialized) can have pointers. These pointers can be in the middle of
>> malloc()'ed structures so no ELF table anywhere knows anything about
>> them. A long variable containing a value that _could_ point into one of
>> these ranges isn't guaranteed to _be_ a pointer, in which case adjusting
>> it is breakage. Tracking them all down and fixing up just the right ones
>> without missing any or changing data you shouldn't is REALLY HARD.
> 
> Hence (make the compiler) never store pointers, only offsets relative to a
> base register. So after making copies of stack, data/bss, and heap, all you
> need to do is adjust these base registers for the child process.
> Nothing in main memory needs to be modified.

Ok, I'll bite. How do you set a signal handler under this regime, since
that needs to pass a function pointer to the syscall? Have a different
function pointer type for when you want a real pointer instead of an
offset pointer? Perhaps label them "near" and "far" pointers, since
there's precedent for that back under DOS?

When you call printf(), how does it accept both a "string constant"
living in rodata and a char array on the stack? Two printf functions
with different argument types? If it _does_ take an actual memory
address rather than an offset that isn't always vs the same segment then
you've written pointers to the stack...

You're also requiring static linking: shared libraries work just fine
with fdpic, but under your segment:offset addressing system all text has
to be relative to the same code segment.

Plus there's still the "fork() off of mozilla" problem that you may copy
lots of data just to immediately discard it as the common case (unless
you'd still use vfork() for most things), and you still need contiguous
blocks of memory for each segment (nommu is vulnerable to fragmentation,
increasingly so as the system stays up longer) so your fork() will fail
where vfork() succeeds. But that just makes it really slow and
unreliable, rather than requiring a large rewrite of the C language.

> Text accesses can be PC-relative => nothing to adjust.
> Local variable accesses are stack-relative => nothing to adjust.
> Data/bss accesses can be relative to a reserved register that stores the
> data base address => only adjust the base register, nothing in RAM to adjust.

Does this compiler setup you're describing actually exist?

Instead of making a minor adjustment to one system call, it's better to
extensively rewrite compilers and calling conventions, ignoring the way
C traditionally treats strings and arrays as pointers where pointers
into data, bss, heap, and stack are all used interchangeably...

> Heap accesses can be relative to a reserved register that stores the heap
> base address => only adjust the base register, nothing in RAM to adjust.

Query: if you implement a linked list ala:

struct blah {
  struct blah *next;
  char *key, *value;
};

If next points to a malloc(), key is a constant string in rodata, and
value was strchr(getenv(key), '=')+1 (with appropriate error checking of
course), how does your compiler know which segment each pointer in that
structure is offset from? (What segment IS your environment space
relative to, anyway? It's not the _current_ value of your stack pointer,
that moves.)

How does your proposed compiler rewrite handle mmap()? You can do
MAP_SHARED just fine on nommu today, it's only MAP_PRIVATE that requires
copy on write. (Yes MAP_SHARED can be read only.)

You're aware that most heap implementations can have more than one
underlying mmap(), right?

  http://git.musl-libc.org/cgit/musl/tree/src/malloc/malloc.c#n320

https://github.com/kraj/uClibc/blob/master/libc/stdlib/malloc/malloc.c#L121

So when you say _the_ heap base address above, which chunk are you
referring to?

Rob

^ permalink raw reply

* Re: execve(NULL, argv, envp) for nommu?
From: Oleg Nesterov @ 2017-09-12 15:45 UTC (permalink / raw)
  To: Rob Landley
  Cc: Alan Cox, Geert Uytterhoeven, Linux Embedded, dalias,
	linux-kernel@vger.kernel.org
In-Reply-To: <d3ae79b1-810d-8abc-3692-69cef4bd1a7a@landley.net>

On 09/12, Rob Landley wrote:
>
> On 09/11/2017 10:15 AM, Oleg Nesterov wrote:
> > On 09/08, Rob Landley wrote:
> >>
> >> So is exec(NULL, argv, envp) a reasonable thing to want?
> >
> > I think that something like prctl(PR_OPEN_EXE_FILE) which does
> >
> > 	dentry_open(current->mm->exe_file->path, O_PATH)
> >
> > and returns fd make more sense.
> >
> > Then you can do execveat(fd, "", ..., AT_EMPTY_PATH).
> I'm all for it? That sounds like a cosmetic difference, a more verbose
> way of achieving the same outcome.

Simpler to implement. Something like the (untested) patch below. Not sure
it is correct, not sure it is good idea, etc.

> (Of course now you've got a filehandle you can read xattrs and such
> through from otherwise jailed contexts letting you do things you
> couldn't necessarily do before,

I can be easily wrong, this is not my area, but afaics no. Note that
you get the FMODE_PATH file (see O_PATH), you can do almost nothing
with it.

So. IIUC with this patch you can do

	fd = prctl(PR_OPEN_EXE_FILE);

	execveat(fd, "", NULL, NULL, AT_EMPTY_PATH);

and execveat should succeed even if the binary was unlinked/renamed in
between.

otoh it should fail if, say, you do "chmod a-x exename" in between.

However. This won't work after chroot() so I am not sure this solves your
problems.

> but I assume you know the security
> implications of that more than I do.

Unlikely ;)


> > But to be honest, I can't understand the problem, because I know nothing
> > about nommu.
> >
> > You need to unblock parent sleeping in vfork(), and you can't do another
> > fork (I don't undestand why).
>
> A nommu system doesn't have a memory management unit, so all addresses
> are physical addresses. This means two processes can't see different
> things at the same address: either they see the same thing or one of
> them can't see that address (due to a range register making it).

Yes, yes, I understand, and thanks for your detailed explanation...

> > Perhaps the child can create another thread? The main thread can exit
> > after that and unblock the parent. Or perhaps even something like
> > clone(CLONE_VM | CLONE_PARENT), I dunno...
>
> Launching a new thread doesn't unblock the parent.

Well, this doesn't really matter, but see above, the main thread can exit
after that. This should unblock the parent.

> And even without that, we're still in the "vfork but add concurrency"
> territory. Your threads don't have their own independent mappings,

Of course!

Just I misinterpreted your initial email as if this is fine for your
use-case, and all you need is unblock the parent and nothing else.

Oleg.
---


--- x/kernel/sys.c
+++ x/kernel/sys.c
@@ -2183,6 +2183,40 @@ static int propagate_has_child_subreaper(struct task_struct *p, void *data)
 	return 1;
 }
 
+static int open_mm_exe_file(void)
+{
+	struct file *exe_file, *file;
+	struct path *path;
+	int fd = -ENOENT;
+
+	exe_file = get_mm_exe_file(current->mm);
+	if (!exe_file)
+		goto out;
+
+	path = &exe_file->f_path;
+	if (!path->dentry)
+		goto put_exe_file;
+
+	fd = get_unused_fd_flags(O_CLOEXEC); // flags?
+	if (fd < 0)
+		goto put_exe_file;
+
+	file = dentry_open(path, O_PATH, current_cred());
+	if (IS_ERR(file)) {
+		put_unused_fd(fd);
+		fd = PTR_ERR(file);
+		goto put_exe_file;
+	}
+
+	path_get(path);
+	fd_install(fd, file);
+
+put_exe_file:
+	fput(exe_file);
+out:
+	return fd;
+}
+
 SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		unsigned long, arg4, unsigned long, arg5)
 {
@@ -2196,6 +2230,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 
 	error = 0;
 	switch (option) {
+	case PR_OPEN_EXE_FILE:
+		error = open_mm_exe_file();
+		break;
 	case PR_SET_PDEATHSIG:
 		if (!valid_signal(arg2)) {
 			error = -EINVAL;


^ permalink raw reply

* Re: execve(NULL, argv, envp) for nommu?
From: Oleg Nesterov @ 2017-09-13 14:20 UTC (permalink / raw)
  To: Rob Landley
  Cc: Alan Cox, Geert Uytterhoeven, Linux Embedded, dalias,
	linux-kernel@vger.kernel.org
In-Reply-To: <20170912154549.GA31411@redhat.com>

On 09/12, Oleg Nesterov wrote:
>
> On 09/12, Rob Landley wrote:
> >
> > On 09/11/2017 10:15 AM, Oleg Nesterov wrote:
> > > On 09/08, Rob Landley wrote:
> > >>
> > >> So is exec(NULL, argv, envp) a reasonable thing to want?
> > >
> > > I think that something like prctl(PR_OPEN_EXE_FILE) which does
> > >
> > > 	dentry_open(current->mm->exe_file->path, O_PATH)
> > >
> > > and returns fd make more sense.
> > >
> > > Then you can do execveat(fd, "", ..., AT_EMPTY_PATH).
> > I'm all for it? That sounds like a cosmetic difference, a more verbose
> > way of achieving the same outcome.
>
> Simpler to implement. Something like the (untested) patch below. Not sure
> it is correct, not sure it is good idea, etc.

OTOH... with the trivial patch below

	execveat(AT_FDCWD, "", NULL, NULL, AT_EMPTY_PATH);

should always work, even if the binary is not in scope after chroot, or if
it is no longer executable, or unlinked. But I am not sure what else should
we do to avoid the security problems.

Oleg.


--- x/fs/exec.c
+++ x/fs/exec.c
@@ -832,23 +832,32 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags)
 {
 	struct file *file;
 	int err;
-	struct open_flags open_exec_flags = {
-		.open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
-		.acc_mode = MAY_EXEC,
-		.intent = LOOKUP_OPEN,
-		.lookup_flags = LOOKUP_FOLLOW,
-	};
-
-	if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH)) != 0)
-		return ERR_PTR(-EINVAL);
-	if (flags & AT_SYMLINK_NOFOLLOW)
-		open_exec_flags.lookup_flags &= ~LOOKUP_FOLLOW;
-	if (flags & AT_EMPTY_PATH)
-		open_exec_flags.lookup_flags |= LOOKUP_EMPTY;
 
-	file = do_filp_open(fd, name, &open_exec_flags);
-	if (IS_ERR(file))
-		goto out;
+	if (fd == AT_FDCWD && name->name[0] == '\0' && flags == AT_EMPTY_PATH) {
+		file = get_mm_exe_file(current->mm);
+		if (!file) {
+			file = ERR_PTR(-ENOENT);
+			goto out;
+		}
+	} else {
+		struct open_flags open_exec_flags = {
+			.open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
+			.acc_mode = MAY_EXEC,
+			.intent = LOOKUP_OPEN,
+			.lookup_flags = LOOKUP_FOLLOW,
+		};
+
+		if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH)) != 0)
+			return ERR_PTR(-EINVAL);
+		if (flags & AT_SYMLINK_NOFOLLOW)
+			open_exec_flags.lookup_flags &= ~LOOKUP_FOLLOW;
+		if (flags & AT_EMPTY_PATH)
+			open_exec_flags.lookup_flags |= LOOKUP_EMPTY;
+
+		file = do_filp_open(fd, name, &open_exec_flags);
+		if (IS_ERR(file))
+			goto out;
+	}
 
 	err = -EACCES;
 	if (!S_ISREG(file_inode(file)->i_mode))


^ permalink raw reply

* Re: execve(NULL, argv, envp) for nommu?
From: Alan Cox @ 2017-09-13 19:33 UTC (permalink / raw)
  To: Rob Landley
  Cc: Geert Uytterhoeven, Oleg Nesterov, Linux Embedded, Rich Felker,
	linux-kernel@vger.kernel.org
In-Reply-To: <bb73820f-745b-e55b-57f6-10b3de0f801b@landley.net>

> Ok, I'll bite. How do you set a signal handler under this regime, since
> that needs to pass a function pointer to the syscall? Have a different
> function pointer type for when you want a real pointer instead of an
> offset pointer? Perhaps label them "near" and "far" pointers, since
> there's precedent for that back under DOS?

A function pointer is an offset relative to the base of the code (but the
other comments are mostly valid)

For most hardware it's cheaper to just do it the way Minix did,
especially as all the hard work in being able to share code and
copy/migrate data happens to have been done in order to make XIP work. A
modern CPU can copy memory at lot faster than an 8MHZ 68K which couldn't
even manage to move 16bits/clock.

> You're also requiring static linking: shared libraries work just fine
> with fdpic, but under your segment:offset addressing system all text has
> to be relative to the same code segment.

No - see the Windows 16bit approach to this. Bring a bucket though 8)

> Plus there's still the "fork() off of mozilla" problem that you may copy
> lots of data just to immediately discard it as the common case (unless
> you'd still use vfork() for most things), and you still need contiguous
> blocks of memory for each segment (nommu is vulnerable to fragmentation,
> increasingly so as the system stays up longer) so your fork() will fail
> where vfork() succeeds. But that just makes it really slow and

If you just do copies and scheduling time swaps of memory blocks then
fragmentation isn't a problem because you can fragment the copy not
currently running. In fact you can (as MAPUX did) extend this to
completely kill the fragmentation problem at the cost of turning
sustained high memory usage with few process deaths into very poor
performance. MAPUX algorithm works very hard to keep stuff unfragmented
but is prepared to move chunks of other processes temporarily around in
order to keep the running process where it should be. In effect it
implements a software paged MMU with an allocator that tries to achieve a
1:1 mapping of the virt/phys of the process.

POSIX tries to side step all of this by providing a combined fork/mess
with file handles of child etc/execve function (posix_spawn) that an
MMUless system can implement to provide the usual functionalities of
fork() / execve() like handle redirection. There are also other ways to
implement that with threads not sharing file handles if you have enough
thread capability (something posix spawn can't assume).

Alan

^ permalink raw reply

* A Quick Survey on a Patch Propagation Tool
From: Aravind Machiry @ 2017-09-19 20:07 UTC (permalink / raw)
  To: linux-embedded

Hi Developer and maintainers,

We (researchers for UC Santa Barbara) are developing a tool that will
help in propagating patches.

Please, It would be great if you can fill a 2-question  anonymous
survey: https://goo.gl/forms/5cBSx4axKmc8BEtA3

Would you be interested in a tool, which identifies patches that could
be imported with a minimal or rather no testing?
E.g. Security patches, you can import security patches as they usually
do not affect the functionality.

This tool would use only old source file and the new source file!! No
commit messages, no build setup, nothing!!
Something like: git saferebase?

We actually used the tool on the Linux Main Line repository and it did
identify several (60%) patches which are safe to port or do not affect
the functionality.

This tool could be used to import patches from the main source branch
to your branch without worrying about testing them.

You can also use this tool as a patch monitor, which monitors all
commits to a repository and inform you about patches that do not
affect the functionality or otherwise safe patches.

Note that: This tool is supposed to *help* an expert not to *replace*  one.

Thank You,
Aravind

^ permalink raw reply

* [PATCH v4 0/5] cramfs refresh for embedded usage
From: Nicolas Pitre @ 2017-09-27 23:32 UTC (permalink / raw)
  To: Alexander Viro, linux-mm
  Cc: linux-fsdevel, linux-embedded, linux-kernel, Chris Brandt

To memory management people: please review patch #4 of this series.

This series brings a nice refresh to the cramfs filesystem, adding the
following capabilities:

- Direct memory access, bypassing the block and/or MTD layers entirely.

- Ability to store individual data blocks uncompressed.

- Ability to locate individual data blocks anywhere in the filesystem.

The end result is a very tight filesystem that can be accessed directly
from ROM without any other subsystem underneath. This also allows for
user space XIP which is a very important feature for tiny embedded
systems.

This series is also available based on v4.13 via git here:

  http://git.linaro.org/people/nicolas.pitre/linux xipcramfs

Why cramfs?

  Because cramfs is very simple and small. With CONFIG_CRAMFS_BLOCK=n and
  CONFIG_CRAMFS_PHYSMEM=y the cramfs driver may use as little as 3704 bytes
  of code. That's many times smaller than squashfs. And the runtime memory
  usage is also much less with cramfs than squashfs. It packs very tightly
  already compared to romfs which has no compression support. And the cramfs
  format was simple to extend, allowing for both compressed and uncompressed
  blocks within the same file.

Why not accessing ROM via MTD?

  The MTD layer is nice and flexible. It also represents a huge overhead
  considering its core with no other enabled options weights 19KB.
  That's many times the size of the cramfs code for something that
  essentially boils down to a glorified argument parser and a call to
  memremap() in this case.  And if someone still wants to use cramfs via
  MTD then it is already possible with mtdblock.

Why not using DAX?

  DAX stands for "Direct Access" and is a generic kernel layer helping
  with the necessary tasks involved with XIP. It is tailored for large
  writable filesystems and relies on the presence of an MMU. It also has
  the following shortcoming: "The DAX code does not work correctly on
  architectures which have virtually mapped caches such as ARM, MIPS and
  SPARC." That makes it unsuitable for a large portion of the intended
  targets for this series. And due to the read-only nature of cramfs, it is
  possible to achieve the intended result with a much simpler approach making
  DAX somewhat overkill in this context.

The maximum size of a cramfs image can't exceed 272MB. In practice it is
likely to be much much less. Given this series is concerned with small
memory systems, even in the MMU case there is always plenty of vmalloc
space left to map it all and even a 272MB memremap() wouldn't be a
problem. If it is then maybe your system is big enough with large
resources to manage already and you're pretty unlikely to be using cramfs
in the first place.

Of course, while this cramfs remains backward compatible with existing
filesystem images, a newer mkcramfs version is necessary to take advantage
of the extended data layout. I created a version of mkcramfs that
detects ELF files and marks text+rodata segments for XIP and compresses the
rest of those ELF files automatically.

So here it is. I'm also willing to step up as cramfs maintainer given
that no sign of any maintenance activities appeared for years.

Changes from v3:

- Rebased on v4.13.
- Made direct access depend on cramfs not being modular due to unexported
  vma handling functions.
- Solicit comments from mm people explicitly.

Changes from v2:

- Plugged a few races in cramfs_vmasplit_fault(). Thanks to Al Viro for
  highlighting them.
- Fixed some checkpatch warnings

Changes from v1:

- Improved mmap() support by adding the ability to partially populate a
  mapping and lazily split the non directly mapable pages to a separate
  vma at fault time (thanks to Chris Brandt for testing).
- Clarified the documentation some more.

diffstat:

 Documentation/filesystems/cramfs.txt |  42 ++
 MAINTAINERS                          |   4 +-
 fs/cramfs/Kconfig                    |  38 +-
 fs/cramfs/README                     |  31 +-
 fs/cramfs/inode.c                    | 646 ++++++++++++++++++++++++++---
 include/uapi/linux/cramfs_fs.h       |  20 +-
 init/do_mounts.c                     |   8 +
 7 files changed, 712 insertions(+), 77 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH v4 1/5] cramfs: direct memory access support
From: Nicolas Pitre @ 2017-09-27 23:32 UTC (permalink / raw)
  To: Alexander Viro, linux-mm
  Cc: linux-fsdevel, linux-embedded, linux-kernel, Chris Brandt
In-Reply-To: <20170927233224.31676-1-nicolas.pitre@linaro.org>

Small embedded systems typically execute the kernel code in place (XIP)
directly from flash to save on precious RAM usage. This adds the ability
to consume filesystem data directly from flash to the cramfs filesystem
as well. Cramfs is particularly well suited to this feature as it is
very simple and its RAM usage is already very low, and with this feature
it is possible to use it with no block device support and even lower RAM
usage.

This patch was inspired by a similar patch from Shane Nay dated 17 years
ago that used to be very popular in embedded circles but never made it
into mainline. This is a cleaned-up implementation that uses far fewer
memory address at run time when both methods are configured in. In the
context of small IoT deployments, this functionality has become relevant
and useful again.

To distinguish between both access types, the cramfs_physmem filesystem
type must be specified when using a memory accessible cramfs image, and
the physaddr argument must provide the actual filesystem image's physical
memory location.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Tested-by: Chris Brandt <chris.brandt@renesas.com>
---
 fs/cramfs/Kconfig |  29 +++++-
 fs/cramfs/inode.c | 264 +++++++++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 241 insertions(+), 52 deletions(-)

diff --git a/fs/cramfs/Kconfig b/fs/cramfs/Kconfig
index 11b29d491b..5b4e0b7e13 100644
--- a/fs/cramfs/Kconfig
+++ b/fs/cramfs/Kconfig
@@ -1,6 +1,5 @@
 config CRAMFS
 	tristate "Compressed ROM file system support (cramfs) (OBSOLETE)"
-	depends on BLOCK
 	select ZLIB_INFLATE
 	help
 	  Saying Y here includes support for CramFs (Compressed ROM File
@@ -20,3 +19,31 @@ config CRAMFS
 	  in terms of performance and features.
 
 	  If unsure, say N.
+
+config CRAMFS_BLOCKDEV
+	bool "Support CramFs image over a regular block device" if EXPERT
+	depends on CRAMFS && BLOCK
+	default y
+	help
+	  This option allows the CramFs driver to load data from a regular
+	  block device such a disk partition or a ramdisk.
+
+config CRAMFS_PHYSMEM
+	bool "Support CramFs image directly mapped in physical memory"
+	depends on CRAMFS
+	default y if !CRAMFS_BLOCKDEV
+	help
+	  This option allows the CramFs driver to load data directly from
+	  a linear adressed memory range (usually non volatile memory
+	  like flash) instead of going through the block device layer.
+	  This saves some memory since no intermediate buffering is
+	  necessary.
+
+	  The filesystem type for this feature is "cramfs_physmem".
+	  The location of the CramFs image in memory is board
+	  dependent. Therefore, if you say Y, you must know the proper
+	  physical address where to store the CramFs image and specify
+	  it using the physaddr=0x******** mount option (for example:
+	  "mount -t cramfs_physmem -o physaddr=0x100000 none /mnt").
+
+	  If unsure, say N.
diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c
index 7919967488..19f464a214 100644
--- a/fs/cramfs/inode.c
+++ b/fs/cramfs/inode.c
@@ -24,6 +24,7 @@
 #include <linux/mutex.h>
 #include <uapi/linux/cramfs_fs.h>
 #include <linux/uaccess.h>
+#include <linux/io.h>
 
 #include "internal.h"
 
@@ -36,6 +37,8 @@ struct cramfs_sb_info {
 	unsigned long blocks;
 	unsigned long files;
 	unsigned long flags;
+	void *linear_virt_addr;
+	phys_addr_t linear_phys_addr;
 };
 
 static inline struct cramfs_sb_info *CRAMFS_SB(struct super_block *sb)
@@ -140,6 +143,9 @@ static struct inode *get_cramfs_inode(struct super_block *sb,
  * BLKS_PER_BUF*PAGE_SIZE, so that the caller doesn't need to
  * worry about end-of-buffer issues even when decompressing a full
  * page cache.
+ *
+ * Note: This is all optimized away at compile time when
+ *       CONFIG_CRAMFS_BLOCKDEV=n.
  */
 #define READ_BUFFERS (2)
 /* NEXT_BUFFER(): Loop over [0..(READ_BUFFERS-1)]. */
@@ -160,10 +166,10 @@ static struct super_block *buffer_dev[READ_BUFFERS];
 static int next_buffer;
 
 /*
- * Returns a pointer to a buffer containing at least LEN bytes of
- * filesystem starting at byte offset OFFSET into the filesystem.
+ * Populate our block cache and return a pointer from it.
  */
-static void *cramfs_read(struct super_block *sb, unsigned int offset, unsigned int len)
+static void *cramfs_blkdev_read(struct super_block *sb, unsigned int offset,
+				unsigned int len)
 {
 	struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping;
 	struct page *pages[BLKS_PER_BUF];
@@ -239,7 +245,39 @@ static void *cramfs_read(struct super_block *sb, unsigned int offset, unsigned i
 	return read_buffers[buffer] + offset;
 }
 
-static void cramfs_kill_sb(struct super_block *sb)
+/*
+ * Return a pointer to the linearly addressed cramfs image in memory.
+ */
+static void *cramfs_direct_read(struct super_block *sb, unsigned int offset,
+				unsigned int len)
+{
+	struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
+
+	if (!len)
+		return NULL;
+	if (len > sbi->size || offset > sbi->size - len)
+	       return page_address(ZERO_PAGE(0));
+	return sbi->linear_virt_addr + offset;
+}
+
+/*
+ * Returns a pointer to a buffer containing at least LEN bytes of
+ * filesystem starting at byte offset OFFSET into the filesystem.
+ */
+static void *cramfs_read(struct super_block *sb, unsigned int offset,
+			 unsigned int len)
+{
+	struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
+
+	if (IS_ENABLED(CONFIG_CRAMFS_PHYSMEM) && sbi->linear_virt_addr)
+		return cramfs_direct_read(sb, offset, len);
+	else if (IS_ENABLED(CONFIG_CRAMFS_BLOCKDEV))
+		return cramfs_blkdev_read(sb, offset, len);
+	else
+		return NULL;
+}
+
+static void cramfs_blkdev_kill_sb(struct super_block *sb)
 {
 	struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
 
@@ -247,6 +285,16 @@ static void cramfs_kill_sb(struct super_block *sb)
 	kfree(sbi);
 }
 
+static void cramfs_physmem_kill_sb(struct super_block *sb)
+{
+	struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
+
+	if (sbi->linear_virt_addr)
+		memunmap(sbi->linear_virt_addr);
+	kill_anon_super(sb);
+	kfree(sbi);
+}
+
 static int cramfs_remount(struct super_block *sb, int *flags, char *data)
 {
 	sync_filesystem(sb);
@@ -254,34 +302,24 @@ static int cramfs_remount(struct super_block *sb, int *flags, char *data)
 	return 0;
 }
 
-static int cramfs_fill_super(struct super_block *sb, void *data, int silent)
+static int cramfs_read_super(struct super_block *sb,
+			     struct cramfs_super *super, int silent)
 {
-	int i;
-	struct cramfs_super super;
+	struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
 	unsigned long root_offset;
-	struct cramfs_sb_info *sbi;
-	struct inode *root;
-
-	sb->s_flags |= MS_RDONLY;
-
-	sbi = kzalloc(sizeof(struct cramfs_sb_info), GFP_KERNEL);
-	if (!sbi)
-		return -ENOMEM;
-	sb->s_fs_info = sbi;
 
-	/* Invalidate the read buffers on mount: think disk change.. */
-	mutex_lock(&read_mutex);
-	for (i = 0; i < READ_BUFFERS; i++)
-		buffer_blocknr[i] = -1;
+	/* We don't know the real size yet */
+	sbi->size = PAGE_SIZE;
 
 	/* Read the first block and get the superblock from it */
-	memcpy(&super, cramfs_read(sb, 0, sizeof(super)), sizeof(super));
+	mutex_lock(&read_mutex);
+	memcpy(super, cramfs_read(sb, 0, sizeof(*super)), sizeof(*super));
 	mutex_unlock(&read_mutex);
 
 	/* Do sanity checks on the superblock */
-	if (super.magic != CRAMFS_MAGIC) {
+	if (super->magic != CRAMFS_MAGIC) {
 		/* check for wrong endianness */
-		if (super.magic == CRAMFS_MAGIC_WEND) {
+		if (super->magic == CRAMFS_MAGIC_WEND) {
 			if (!silent)
 				pr_err("wrong endianness\n");
 			return -EINVAL;
@@ -289,10 +327,10 @@ static int cramfs_fill_super(struct super_block *sb, void *data, int silent)
 
 		/* check at 512 byte offset */
 		mutex_lock(&read_mutex);
-		memcpy(&super, cramfs_read(sb, 512, sizeof(super)), sizeof(super));
+		memcpy(super, cramfs_read(sb, 512, sizeof(*super)), sizeof(*super));
 		mutex_unlock(&read_mutex);
-		if (super.magic != CRAMFS_MAGIC) {
-			if (super.magic == CRAMFS_MAGIC_WEND && !silent)
+		if (super->magic != CRAMFS_MAGIC) {
+			if (super->magic == CRAMFS_MAGIC_WEND && !silent)
 				pr_err("wrong endianness\n");
 			else if (!silent)
 				pr_err("wrong magic\n");
@@ -301,34 +339,34 @@ static int cramfs_fill_super(struct super_block *sb, void *data, int silent)
 	}
 
 	/* get feature flags first */
-	if (super.flags & ~CRAMFS_SUPPORTED_FLAGS) {
+	if (super->flags & ~CRAMFS_SUPPORTED_FLAGS) {
 		pr_err("unsupported filesystem features\n");
 		return -EINVAL;
 	}
 
 	/* Check that the root inode is in a sane state */
-	if (!S_ISDIR(super.root.mode)) {
+	if (!S_ISDIR(super->root.mode)) {
 		pr_err("root is not a directory\n");
 		return -EINVAL;
 	}
 	/* correct strange, hard-coded permissions of mkcramfs */
-	super.root.mode |= (S_IRUSR | S_IXUSR | S_IRGRP | S_IXGRP | S_IROTH | S_IXOTH);
+	super->root.mode |= (S_IRUSR | S_IXUSR | S_IRGRP | S_IXGRP | S_IROTH | S_IXOTH);
 
-	root_offset = super.root.offset << 2;
-	if (super.flags & CRAMFS_FLAG_FSID_VERSION_2) {
-		sbi->size = super.size;
-		sbi->blocks = super.fsid.blocks;
-		sbi->files = super.fsid.files;
+	root_offset = super->root.offset << 2;
+	if (super->flags & CRAMFS_FLAG_FSID_VERSION_2) {
+		sbi->size = super->size;
+		sbi->blocks = super->fsid.blocks;
+		sbi->files = super->fsid.files;
 	} else {
 		sbi->size = 1<<28;
 		sbi->blocks = 0;
 		sbi->files = 0;
 	}
-	sbi->magic = super.magic;
-	sbi->flags = super.flags;
+	sbi->magic = super->magic;
+	sbi->flags = super->flags;
 	if (root_offset == 0)
 		pr_info("empty filesystem");
-	else if (!(super.flags & CRAMFS_FLAG_SHIFTED_ROOT_OFFSET) &&
+	else if (!(super->flags & CRAMFS_FLAG_SHIFTED_ROOT_OFFSET) &&
 		 ((root_offset != sizeof(struct cramfs_super)) &&
 		  (root_offset != 512 + sizeof(struct cramfs_super))))
 	{
@@ -336,9 +374,18 @@ static int cramfs_fill_super(struct super_block *sb, void *data, int silent)
 		return -EINVAL;
 	}
 
+	return 0;
+}
+
+static int cramfs_finalize_super(struct super_block *sb,
+				 struct cramfs_inode *cramfs_root)
+{
+	struct inode *root;
+
 	/* Set it all up.. */
+	sb->s_flags |= MS_RDONLY;
 	sb->s_op = &cramfs_ops;
-	root = get_cramfs_inode(sb, &super.root, 0);
+	root = get_cramfs_inode(sb, cramfs_root, 0);
 	if (IS_ERR(root))
 		return PTR_ERR(root);
 	sb->s_root = d_make_root(root);
@@ -347,6 +394,92 @@ static int cramfs_fill_super(struct super_block *sb, void *data, int silent)
 	return 0;
 }
 
+static int cramfs_blkdev_fill_super(struct super_block *sb, void *data, int silent)
+{
+	struct cramfs_sb_info *sbi;
+	struct cramfs_super super;
+	int i, err;
+
+	sbi = kzalloc(sizeof(struct cramfs_sb_info), GFP_KERNEL);
+	if (!sbi)
+		return -ENOMEM;
+	sb->s_fs_info = sbi;
+
+	/* Invalidate the read buffers on mount: think disk change.. */
+	for (i = 0; i < READ_BUFFERS; i++)
+		buffer_blocknr[i] = -1;
+
+	err = cramfs_read_super(sb, &super, silent);
+	if (err)
+		return err;
+	return cramfs_finalize_super(sb, &super.root);
+}
+
+static int cramfs_physmem_fill_super(struct super_block *sb, void *data, int silent)
+{
+	struct cramfs_sb_info *sbi;
+	struct cramfs_super super;
+	char *p;
+	int err;
+
+	sbi = kzalloc(sizeof(struct cramfs_sb_info), GFP_KERNEL);
+	if (!sbi)
+		return -ENOMEM;
+	sb->s_fs_info = sbi;
+
+	/*
+	 * The physical location of the cramfs image is specified as
+	 * a mount parameter.  This parameter is mandatory for obvious
+	 * reasons.  Some validation is made on the phys address but this
+	 * is not exhaustive and we count on the fact that someone using
+	 * this feature is supposed to know what he/she's doing.
+	 */
+	if (!data || !(p = strstr((char *)data, "physaddr="))) {
+		pr_err("unknown physical address for linear cramfs image\n");
+		return -EINVAL;
+	}
+	sbi->linear_phys_addr = memparse(p + 9, NULL);
+	if (!sbi->linear_phys_addr) {
+		pr_err("bad value for cramfs image physical address\n");
+		return -EINVAL;
+	}
+	if (sbi->linear_phys_addr & (PAGE_SIZE-1)) {
+		pr_err("physical address %pap for linear cramfs isn't aligned to a page boundary\n",
+			&sbi->linear_phys_addr);
+		return -EINVAL;
+	}
+
+	/*
+	 * Map only one page for now.  Will remap it when fs size is known.
+	 * Although we'll only read from it, we want the CPU cache to
+	 * kick in for the higher throughput it provides, hence MEMREMAP_WB.
+	 */
+	pr_info("checking physical address %pap for linear cramfs image\n", &sbi->linear_phys_addr);
+	sbi->linear_virt_addr = memremap(sbi->linear_phys_addr, PAGE_SIZE,
+					 MEMREMAP_WB);
+	if (!sbi->linear_virt_addr) {
+		pr_err("ioremap of the linear cramfs image failed\n");
+		return -ENOMEM;
+	}
+
+	err = cramfs_read_super(sb, &super, silent);
+	if (err)
+		return err;
+
+	/* Remap the whole filesystem now */
+	pr_info("linear cramfs image appears to be %lu KB in size\n",
+		sbi->size/1024);
+	memunmap(sbi->linear_virt_addr);
+	sbi->linear_virt_addr = memremap(sbi->linear_phys_addr, sbi->size,
+					 MEMREMAP_WB);
+	if (!sbi->linear_virt_addr) {
+		pr_err("ioremap of the linear cramfs image failed\n");
+		return -ENOMEM;
+	}
+
+	return cramfs_finalize_super(sb, &super.root);
+}
+
 static int cramfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 {
 	struct super_block *sb = dentry->d_sb;
@@ -573,38 +706,67 @@ static const struct super_operations cramfs_ops = {
 	.statfs		= cramfs_statfs,
 };
 
-static struct dentry *cramfs_mount(struct file_system_type *fs_type,
-	int flags, const char *dev_name, void *data)
+static struct dentry *cramfs_blkdev_mount(struct file_system_type *fs_type,
+				int flags, const char *dev_name, void *data)
+{
+	return mount_bdev(fs_type, flags, dev_name, data, cramfs_blkdev_fill_super);
+}
+
+static struct dentry *cramfs_physmem_mount(struct file_system_type *fs_type,
+				int flags, const char *dev_name, void *data)
 {
-	return mount_bdev(fs_type, flags, dev_name, data, cramfs_fill_super);
+	return mount_nodev(fs_type, flags, data, cramfs_physmem_fill_super);
 }
 
 static struct file_system_type cramfs_fs_type = {
 	.owner		= THIS_MODULE,
 	.name		= "cramfs",
-	.mount		= cramfs_mount,
-	.kill_sb	= cramfs_kill_sb,
+	.mount		= cramfs_blkdev_mount,
+	.kill_sb	= cramfs_blkdev_kill_sb,
 	.fs_flags	= FS_REQUIRES_DEV,
 };
+
+static struct file_system_type cramfs_physmem_fs_type = {
+	.owner		= THIS_MODULE,
+	.name		= "cramfs_physmem",
+	.mount		= cramfs_physmem_mount,
+	.kill_sb	= cramfs_physmem_kill_sb,
+};
+
+#ifdef CONFIG_CRAMFS_BLOCKDEV
 MODULE_ALIAS_FS("cramfs");
+#endif
+#ifdef CONFIG_CRAMFS_PHYSMEM
+MODULE_ALIAS_FS("cramfs_physmem");
+#endif
 
 static int __init init_cramfs_fs(void)
 {
 	int rv;
 
-	rv = cramfs_uncompress_init();
-	if (rv < 0)
-		return rv;
-	rv = register_filesystem(&cramfs_fs_type);
-	if (rv < 0)
-		cramfs_uncompress_exit();
-	return rv;
+	if ((rv = cramfs_uncompress_init()) < 0)
+		goto err0;
+	if (IS_ENABLED(CONFIG_CRAMFS_BLOCKDEV) &&
+	    (rv = register_filesystem(&cramfs_fs_type)) < 0)
+		goto err1;
+	if (IS_ENABLED(CONFIG_CRAMFS_PHYSMEM) &&
+	    (rv = register_filesystem(&cramfs_physmem_fs_type)) < 0)
+		goto err2;
+	return 0;
+
+err2:	if (IS_ENABLED(CONFIG_CRAMFS_BLOCKDEV))
+		unregister_filesystem(&cramfs_fs_type);
+err1:	cramfs_uncompress_exit();
+err0:	return rv;
 }
 
 static void __exit exit_cramfs_fs(void)
 {
 	cramfs_uncompress_exit();
-	unregister_filesystem(&cramfs_fs_type);
+	if (IS_ENABLED(CONFIG_CRAMFS_BLOCKDEV))
+		unregister_filesystem(&cramfs_fs_type);
+	if (IS_ENABLED(CONFIG_CRAMFS_PHYSMEM))
+		unregister_filesystem(&cramfs_physmem_fs_type);
 }
 
 module_init(init_cramfs_fs)
-- 
2.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH v4 2/5] cramfs: make cramfs_physmem usable as root fs
From: Nicolas Pitre @ 2017-09-27 23:32 UTC (permalink / raw)
  To: Alexander Viro, linux-mm
  Cc: linux-fsdevel, linux-embedded, linux-kernel, Chris Brandt
In-Reply-To: <20170927233224.31676-1-nicolas.pitre@linaro.org>

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Tested-by: Chris Brandt <chris.brandt@renesas.com>
---
 init/do_mounts.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/init/do_mounts.c b/init/do_mounts.c
index c2de5104aa..43b5817f60 100644
--- a/init/do_mounts.c
+++ b/init/do_mounts.c
@@ -556,6 +556,14 @@ void __init prepare_namespace(void)
 		ssleep(root_delay);
 	}
 
+	if (IS_ENABLED(CONFIG_CRAMFS_PHYSMEM) && root_fs_names &&
+	    !strcmp(root_fs_names, "cramfs_physmem")) {
+		int err = do_mount_root("cramfs", "cramfs_physmem",
+					root_mountflags, root_mount_data);
+		if (!err)
+			goto out;
+	}
+
 	/*
 	 * wait for the known devices to complete their probing
 	 *
-- 
2.9.5


^ permalink raw reply related

* [PATCH v4 3/5] cramfs: implement uncompressed and arbitrary data block positioning
From: Nicolas Pitre @ 2017-09-27 23:32 UTC (permalink / raw)
  To: Alexander Viro, linux-mm
  Cc: linux-fsdevel, linux-embedded, linux-kernel, Chris Brandt
In-Reply-To: <20170927233224.31676-1-nicolas.pitre@linaro.org>

Two new capabilities are introduced here:

- The ability to store some blocks uncompressed.

- The ability to locate blocks anywhere.

Those capabilities can be used independently, but the combination
opens the possibility for execute-in-place (XIP) of program text segments
that must remain uncompressed, and in the MMU case, must have a specific
alignment.  It is even possible to still have the writable data segments
from the same file compressed as they have to be copied into RAM anyway.

This is achieved by giving special meanings to some unused block pointer
bits while remaining compatible with legacy cramfs images.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Tested-by: Chris Brandt <chris.brandt@renesas.com>
---
 fs/cramfs/README               | 31 ++++++++++++++-
 fs/cramfs/inode.c              | 87 +++++++++++++++++++++++++++++++++---------
 include/uapi/linux/cramfs_fs.h | 20 +++++++++-
 3 files changed, 118 insertions(+), 20 deletions(-)

diff --git a/fs/cramfs/README b/fs/cramfs/README
index 9d4e7ea311..d71b27e0ff 100644
--- a/fs/cramfs/README
+++ b/fs/cramfs/README
@@ -49,17 +49,46 @@ same as the start of the (i+1)'th <block> if there is one).  The first
 <block> immediately follows the last <block_pointer> for the file.
 <block_pointer>s are each 32 bits long.
 
+When the CRAMFS_FLAG_EXT_BLOCK_POINTERS capability bit is set, each
+<block_pointer>'s top bits may contain special flags as follows:
+
+CRAMFS_BLK_FLAG_UNCOMPRESSED (bit 31):
+	The block data is not compressed and should be copied verbatim.
+
+CRAMFS_BLK_FLAG_DIRECT_PTR (bit 30):
+	The <block_pointer> stores the actual block start offset and not
+	its end, shifted right by 2 bits. The block must therefore be
+	aligned to a 4-byte boundary. The block size is either blksize
+	if CRAMFS_BLK_FLAG_UNCOMPRESSED is also specified, otherwise
+	the compressed data length is included in the first 2 bytes of
+	the block data. This is used to allow discontiguous data layout
+	and specific data block alignments e.g. for XIP applications.
+
+
 The order of <file_data>'s is a depth-first descent of the directory
 tree, i.e. the same order as `find -size +0 \( -type f -o -type l \)
 -print'.
 
 
 <block>: The i'th <block> is the output of zlib's compress function
-applied to the i'th blksize-sized chunk of the input data.
+applied to the i'th blksize-sized chunk of the input data if the
+corresponding CRAMFS_BLK_FLAG_UNCOMPRESSED <block_ptr> bit is not set,
+otherwise it is the input data directly.
 (For the last <block> of the file, the input may of course be smaller.)
 Each <block> may be a different size.  (See <block_pointer> above.)
+
 <block>s are merely byte-aligned, not generally u32-aligned.
 
+When CRAMFS_BLK_FLAG_DIRECT_PTR is specified then the corresponding
+<block> may be located anywhere and not necessarily contiguous with
+the previous/next blocks. In that case it is minimally u32-aligned.
+If CRAMFS_BLK_FLAG_UNCOMPRESSED is also specified then the size is always
+blksize except for the last block which is limited by the file length.
+If CRAMFS_BLK_FLAG_DIRECT_PTR is set and CRAMFS_BLK_FLAG_UNCOMPRESSED
+is not set then the first 2 bytes of the block contains the size of the
+remaining block data as this cannot be determined from the placement of
+logically adjacent blocks.
+
 
 Holes
 -----
diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c
index 19f464a214..2fc886092b 100644
--- a/fs/cramfs/inode.c
+++ b/fs/cramfs/inode.c
@@ -636,33 +636,84 @@ static int cramfs_readpage(struct file *file, struct page *page)
 	if (page->index < maxblock) {
 		struct super_block *sb = inode->i_sb;
 		u32 blkptr_offset = OFFSET(inode) + page->index*4;
-		u32 start_offset, compr_len;
+		u32 block_ptr, block_start, block_len;
+		bool uncompressed, direct;
 
-		start_offset = OFFSET(inode) + maxblock*4;
 		mutex_lock(&read_mutex);
-		if (page->index)
-			start_offset = *(u32 *) cramfs_read(sb, blkptr_offset-4,
-				4);
-		compr_len = (*(u32 *) cramfs_read(sb, blkptr_offset, 4) -
-			start_offset);
-		mutex_unlock(&read_mutex);
+		block_ptr = *(u32 *) cramfs_read(sb, blkptr_offset, 4);
+		uncompressed = (block_ptr & CRAMFS_BLK_FLAG_UNCOMPRESSED);
+		direct = (block_ptr & CRAMFS_BLK_FLAG_DIRECT_PTR);
+		block_ptr &= ~CRAMFS_BLK_FLAGS;
+
+		if (direct) {
+			/*
+			 * The block pointer is an absolute start pointer,
+			 * shifted by 2 bits. The size is included in the
+			 * first 2 bytes of the data block when compressed,
+			 * or PAGE_SIZE otherwise.
+			 */
+			block_start = block_ptr << 2;
+			if (uncompressed) {
+				block_len = PAGE_SIZE;
+				/* if last block: cap to file length */
+				if (page->index == maxblock - 1)
+					block_len = offset_in_page(inode->i_size);
+			} else {
+				block_len = *(u16 *)
+					cramfs_read(sb, block_start, 2);
+				block_start += 2;
+			}
+		} else {
+			/*
+			 * The block pointer indicates one past the end of
+			 * the current block (start of next block). If this
+			 * is the first block then it starts where the block
+			 * pointer table ends, otherwise its start comes
+			 * from the previous block's pointer.
+			 */
+			block_start = OFFSET(inode) + maxblock*4;
+			if (page->index)
+				block_start = *(u32 *)
+					cramfs_read(sb, blkptr_offset-4, 4);
+			/* Beware... previous ptr might be a direct ptr */
+			if (unlikely(block_start & CRAMFS_BLK_FLAG_DIRECT_PTR)) {
+				/* See comments on earlier code. */
+				u32 prev_start = block_start;
+			       block_start = prev_start & ~CRAMFS_BLK_FLAGS;
+			       block_start <<= 2;
+				if (prev_start & CRAMFS_BLK_FLAG_UNCOMPRESSED) {
+					block_start += PAGE_SIZE;
+				} else {
+					block_len = *(u16 *)
+						cramfs_read(sb, block_start, 2);
+					block_start += 2 + block_len;
+				}
+			}
+			block_start &= ~CRAMFS_BLK_FLAGS;
+			block_len = block_ptr - block_start;
+		}
 
-		if (compr_len == 0)
+		if (block_len == 0)
 			; /* hole */
-		else if (unlikely(compr_len > (PAGE_SIZE << 1))) {
-			pr_err("bad compressed blocksize %u\n",
-				compr_len);
+		else if (unlikely(block_len > 2*PAGE_SIZE ||
+				  (uncompressed && block_len > PAGE_SIZE))) {
+			mutex_unlock(&read_mutex);
+			pr_err("bad data blocksize %u\n", block_len);
 			goto err;
+		} else if (uncompressed) {
+			memcpy(pgdata,
+			       cramfs_read(sb, block_start, block_len),
+			       block_len);
+			bytes_filled = block_len;
 		} else {
-			mutex_lock(&read_mutex);
 			bytes_filled = cramfs_uncompress_block(pgdata,
 				 PAGE_SIZE,
-				 cramfs_read(sb, start_offset, compr_len),
-				 compr_len);
-			mutex_unlock(&read_mutex);
-			if (unlikely(bytes_filled < 0))
-				goto err;
+				 cramfs_read(sb, block_start, block_len),
+				 block_len);
 		}
+		mutex_unlock(&read_mutex);
+		if (unlikely(bytes_filled < 0))
+			goto err;
 	}
 
 	memset(pgdata + bytes_filled, 0, PAGE_SIZE - bytes_filled);
diff --git a/include/uapi/linux/cramfs_fs.h b/include/uapi/linux/cramfs_fs.h
index e4611a9b92..c7a7883fab 100644
--- a/include/uapi/linux/cramfs_fs.h
+++ b/include/uapi/linux/cramfs_fs.h
@@ -73,6 +73,7 @@ struct cramfs_super {
 #define CRAMFS_FLAG_HOLES		0x00000100	/* support for holes */
 #define CRAMFS_FLAG_WRONG_SIGNATURE	0x00000200	/* reserved */
 #define CRAMFS_FLAG_SHIFTED_ROOT_OFFSET	0x00000400	/* shifted root fs */
+#define CRAMFS_FLAG_EXT_BLOCK_POINTERS	0x00000800	/* block pointer extensions */
 
 /*
  * Valid values in super.flags.  Currently we refuse to mount
@@ -82,7 +83,24 @@ struct cramfs_super {
 #define CRAMFS_SUPPORTED_FLAGS	( 0x000000ff \
 				| CRAMFS_FLAG_HOLES \
 				| CRAMFS_FLAG_WRONG_SIGNATURE \
-				| CRAMFS_FLAG_SHIFTED_ROOT_OFFSET )
+				| CRAMFS_FLAG_SHIFTED_ROOT_OFFSET \
+				| CRAMFS_FLAG_EXT_BLOCK_POINTERS )
 
+/*
+ * Block pointer flags
+ *
+ * The maximum block offset that needs to be represented is roughly:
+ *
+ *   (1 << CRAMFS_OFFSET_WIDTH) * 4 +
+ *   (1 << CRAMFS_SIZE_WIDTH) / PAGE_SIZE * (4 + PAGE_SIZE)
+ *   = 0x11004000
+ *
+ * That leaves room for 3 flag bits in the block pointer table.
+ */
+#define CRAMFS_BLK_FLAG_UNCOMPRESSED	(1 << 31)
+#define CRAMFS_BLK_FLAG_DIRECT_PTR	(1 << 30)
+
+#define CRAMFS_BLK_FLAGS	( CRAMFS_BLK_FLAG_UNCOMPRESSED \
+				| CRAMFS_BLK_FLAG_DIRECT_PTR )
 
 #endif /* _UAPI__CRAMFS_H */
-- 
2.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH v4 4/5] cramfs: add mmap support
From: Nicolas Pitre @ 2017-09-27 23:32 UTC (permalink / raw)
  To: Alexander Viro, linux-mm
  Cc: linux-fsdevel, linux-embedded, linux-kernel, Chris Brandt
In-Reply-To: <20170927233224.31676-1-nicolas.pitre@linaro.org>

When cramfs_physmem is used then we have the opportunity to map files
directly from ROM, directly into user space, saving on RAM usage.
This gives us Execute-In-Place (XIP) support.

For a file to be mmap()-able, the map area has to correspond to a range
of uncompressed and contiguous blocks, and in the MMU case it also has
to be page aligned. A version of mkcramfs with appropriate support is
necessary to create such a filesystem image.

In the MMU case it may happen for a vma structure to extend beyond the
actual file size. This is notably the case in binfmt_elf.c:elf_map().
Or the file's last block is shared with other files and cannot be mapped
as is. Rather than refusing to mmap it, we do a partial map and set up
a special vm_ops fault handler that splits the vma in two: the direct
mapping vma and the memory-backed vma populated by the readpage method.
In practice the unmapped area is seldom accessed so the split might never
occur before this area is discarded.

In the non-MMU case it is the get_unmapped_area method that is responsible
for providing the address where the actual data can be found. No mapping
is necessary of course.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Tested-by: Chris Brandt <chris.brandt@renesas.com>
---
 fs/cramfs/Kconfig |   2 +-
 fs/cramfs/inode.c | 295 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 296 insertions(+), 1 deletion(-)

diff --git a/fs/cramfs/Kconfig b/fs/cramfs/Kconfig
index 5b4e0b7e13..306549be25 100644
--- a/fs/cramfs/Kconfig
+++ b/fs/cramfs/Kconfig
@@ -30,7 +30,7 @@ config CRAMFS_BLOCKDEV
 
 config CRAMFS_PHYSMEM
 	bool "Support CramFs image directly mapped in physical memory"
-	depends on CRAMFS
+	depends on CRAMFS = y
 	default y if !CRAMFS_BLOCKDEV
 	help
 	  This option allows the CramFs driver to load data directly from
diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c
index 2fc886092b..1d7d61354b 100644
--- a/fs/cramfs/inode.c
+++ b/fs/cramfs/inode.c
@@ -15,7 +15,9 @@
 
 #include <linux/module.h>
 #include <linux/fs.h>
+#include <linux/file.h>
 #include <linux/pagemap.h>
+#include <linux/ramfs.h>
 #include <linux/init.h>
 #include <linux/string.h>
 #include <linux/blkdev.h>
@@ -49,6 +51,7 @@ static inline struct cramfs_sb_info *CRAMFS_SB(struct super_block *sb)
 static const struct super_operations cramfs_ops;
 static const struct inode_operations cramfs_dir_inode_operations;
 static const struct file_operations cramfs_directory_operations;
+static const struct file_operations cramfs_physmem_fops;
 static const struct address_space_operations cramfs_aops;
 
 static DEFINE_MUTEX(read_mutex);
@@ -96,6 +99,10 @@ static struct inode *get_cramfs_inode(struct super_block *sb,
 	case S_IFREG:
 		inode->i_fop = &generic_ro_fops;
 		inode->i_data.a_ops = &cramfs_aops;
+		if (IS_ENABLED(CONFIG_CRAMFS_PHYSMEM) &&
+		    CRAMFS_SB(sb)->flags & CRAMFS_FLAG_EXT_BLOCK_POINTERS &&
+		    CRAMFS_SB(sb)->linear_phys_addr)
+			inode->i_fop = &cramfs_physmem_fops;
 		break;
 	case S_IFDIR:
 		inode->i_op = &cramfs_dir_inode_operations;
@@ -277,6 +284,294 @@ static void *cramfs_read(struct super_block *sb, unsigned int offset,
 		return NULL;
 }
 
+/*
+ * For a mapping to be possible, we need a range of uncompressed and
+ * contiguous blocks. Return the offset for the first block and number of
+ * valid blocks for which that is true, or zero otherwise.
+ */
+static u32 cramfs_get_block_range(struct inode *inode, u32 pgoff, u32 *pages)
+{
+	struct super_block *sb = inode->i_sb;
+	struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
+	int i;
+	u32 *blockptrs, blockaddr;
+
+	/*
+	 * We can dereference memory directly here as this code may be
+	 * reached only when there is a direct filesystem image mapping
+	 * available in memory.
+	 */
+	blockptrs = (u32 *)(sbi->linear_virt_addr + OFFSET(inode) + pgoff*4);
+	blockaddr = blockptrs[0] & ~CRAMFS_BLK_FLAGS;
+	i = 0;
+	do {
+		u32 expect = blockaddr + i * (PAGE_SIZE >> 2);
+		expect |= CRAMFS_BLK_FLAG_DIRECT_PTR|CRAMFS_BLK_FLAG_UNCOMPRESSED;
+		if (blockptrs[i] != expect) {
+			pr_debug("range: block %d/%d got %#x expects %#x\n",
+				 pgoff+i, pgoff+*pages-1, blockptrs[i], expect);
+			if (i == 0)
+				return 0;
+			break;
+		}
+	} while (++i < *pages);
+
+	*pages = i;
+
+	/* stored "direct" block ptrs are shifted down by 2 bits */
+	return blockaddr << 2;
+}
+
+/*
+ * It is possible for cramfs_physmem_mmap() to partially populate the mapping
+ * causing page faults in the unmapped area. When that happens, we need to
+ * split the vma so that the unmapped area gets its own vma that can be backed
+ * with actual memory pages and loaded normally. This is necessary because
+ * remap_pfn_range() overwrites vma->vm_pgoff with the pfn and filemap_fault()
+ * no longer works with it. Furthermore this makes /proc/x/maps right.
+ * Q: is there a way to do split vma at mmap() time?
+ */
+static const struct vm_operations_struct cramfs_vmasplit_ops;
+static int cramfs_vmasplit_fault(struct vm_fault *vmf)
+{
+	struct mm_struct *mm = vmf->vma->vm_mm;
+	struct vm_area_struct *vma, *new_vma;
+	struct file *vma_file = get_file(vmf->vma->vm_file);
+	unsigned long split_val, split_addr;
+	unsigned int split_pgoff;
+	int ret;
+
+	/* We have some vma surgery to do and need the write lock. */
+	up_read(&mm->mmap_sem);
+	if (down_write_killable(&mm->mmap_sem)) {
+		fput(vma_file);
+		return VM_FAULT_RETRY;
+	}
+
+	/* Make sure the vma didn't change between the locks */
+	ret = VM_FAULT_SIGSEGV;
+	vma = find_vma(mm, vmf->address);
+	if (!vma)
+		goto out_fput;
+
+	/*
+	 * Someone else might have raced with us and handled the fault,
+	 * changed the vma, etc. If so let it go back to user space and
+	 * fault again if necessary.
+	 */
+	ret = VM_FAULT_NOPAGE;
+	if (vma->vm_ops != &cramfs_vmasplit_ops || vma->vm_file != vma_file)
+		goto out_fput;
+	fput(vma_file);
+
+	/* Retrieve the vma split address and validate it */
+	split_val = (unsigned long)vma->vm_private_data;
+	split_pgoff = split_val & 0xfff;
+	split_addr = (split_val >> 12) << PAGE_SHIFT;
+	if (split_addr < vma->vm_start) {
+		/* bottom of vma was unmapped */
+		split_pgoff += (vma->vm_start - split_addr) >> PAGE_SHIFT;
+		split_addr = vma->vm_start;
+	}
+	pr_debug("fault: addr=%#lx vma=%#lx-%#lx split=%#lx\n",
+		 vmf->address, vma->vm_start, vma->vm_end, split_addr);
+	ret = VM_FAULT_SIGSEGV;
+	if (!split_val || split_addr > vmf->address || vma->vm_end <= vmf->address)
+		goto out;
+
+	if (unlikely(vma->vm_start == split_addr)) {
+		/* nothing to split */
+		new_vma = vma;
+	} else {
+		/* Split away the directly mapped area */
+		ret = VM_FAULT_OOM;
+		if (split_vma(mm, vma, split_addr, 0) != 0)
+			goto out;
+
+		/* The direct vma should no longer ever fault */
+		vma->vm_ops = NULL;
+
+		/* Retrieve the new vma covering the unmapped area */
+		new_vma = find_vma(mm, split_addr);
+		BUG_ON(new_vma == vma);
+		ret = VM_FAULT_SIGSEGV;
+		if (!new_vma)
+			goto out;
+	}
+
+	/*
+	 * Readjust the new vma with the actual file based pgoff and
+	 * process the fault normally on it.
+	 */
+	new_vma->vm_pgoff = split_pgoff;
+	new_vma->vm_ops = &generic_file_vm_ops;
+	new_vma->vm_flags &= ~(VM_IO | VM_PFNMAP | VM_DONTEXPAND);
+	vmf->vma = new_vma;
+	vmf->pgoff = split_pgoff;
+	vmf->pgoff += (vmf->address - new_vma->vm_start) >> PAGE_SHIFT;
+	downgrade_write(&mm->mmap_sem);
+	return filemap_fault(vmf);
+
+out_fput:
+	fput(vma_file);
+out:
+	downgrade_write(&mm->mmap_sem);
+	return ret;
+}
+
+static const struct vm_operations_struct cramfs_vmasplit_ops = {
+	.fault	= cramfs_vmasplit_fault,
+};
+
+static int cramfs_physmem_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct inode *inode = file_inode(file);
+	struct super_block *sb = inode->i_sb;
+	struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
+	unsigned int pages, vma_pages, max_pages, offset;
+	unsigned long address;
+	char *fail_reason;
+	int ret;
+
+	if (!IS_ENABLED(CONFIG_MMU))
+		return vma->vm_flags & (VM_SHARED | VM_MAYSHARE) ? 0 : -ENOSYS;
+
+	if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE))
+		return -EINVAL;
+
+	/* Could COW work here? */
+	fail_reason = "vma is writable";
+	if (vma->vm_flags & VM_WRITE)
+		goto fail;
+
+	vma_pages = (vma->vm_end - vma->vm_start + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	max_pages = (inode->i_size + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	fail_reason = "beyond file limit";
+	if (vma->vm_pgoff >= max_pages)
+		goto fail;
+	pages = vma_pages;
+	if (pages > max_pages - vma->vm_pgoff)
+		pages = max_pages - vma->vm_pgoff;
+
+	offset = cramfs_get_block_range(inode, vma->vm_pgoff, &pages);
+	fail_reason = "unsuitable block layout";
+	if (!offset)
+		goto fail;
+	address = sbi->linear_phys_addr + offset;
+	fail_reason = "data is not page aligned";
+	if (!PAGE_ALIGNED(address))
+		goto fail;
+
+	/* Don't map the last page if it contains some other data */
+	if (unlikely(vma->vm_pgoff + pages == max_pages)) {
+		unsigned int partial = offset_in_page(inode->i_size);
+		if (partial) {
+			char *data = sbi->linear_virt_addr + offset;
+			data += (max_pages - 1) * PAGE_SIZE + partial;
+			while ((unsigned long)data & 7)
+				if (*data++ != 0)
+					goto nonzero;
+			while (offset_in_page(data)) {
+				if (*(u64 *)data != 0) {
+					nonzero:
+					pr_debug("mmap: %s: last page is shared\n",
+						 file_dentry(file)->d_name.name);
+					pages--;
+					break;
+				}
+				data += 8;
+			}
+		}
+	}
+
+	if (pages) {
+		/*
+		 * If we can't map it all, page faults will occur if the
+		 * unmapped area is accessed. Let's handle them to split the
+		 * vma and let the normal paging machinery take care of the
+		 * rest through cramfs_readpage(). Because remap_pfn_range()
+		 * repurposes vma->vm_pgoff, we have to save it somewhere.
+		 * Let's use vma->vm_private_data to hold both the pgoff and
+		 * the actual address split point. Maximum file size is 16MB
+		 * (12 bits pgoff) and max 20 bits pfn where a long is 32 bits
+		 * so we can pack both together.
+		 */
+		if (pages != vma_pages) {
+			unsigned int split_pgoff = vma->vm_pgoff + pages;
+			unsigned long split_pfn = (vma->vm_start >> PAGE_SHIFT) + pages;
+			unsigned long split_val = split_pgoff | (split_pfn << 12);
+			vma->vm_private_data = (void *)split_val;
+			vma->vm_ops = &cramfs_vmasplit_ops;
+			/* to keep remap_pfn_range() happy */
+			vma->vm_end = vma->vm_start + pages * PAGE_SIZE;
+		}
+
+		ret = remap_pfn_range(vma, vma->vm_start, address >> PAGE_SHIFT,
+				      pages * PAGE_SIZE, vma->vm_page_prot);
+		/* restore vm_end in case we cheated it above */
+		vma->vm_end = vma->vm_start + vma_pages * PAGE_SIZE;
+		if (ret)
+			return ret;
+
+		pr_debug("mapped %s at 0x%08lx (%u/%u pages) to vma 0x%08lx, "
+			 "page_prot 0x%llx\n", file_dentry(file)->d_name.name,
+			 address, pages, vma_pages, vma->vm_start,
+			 (unsigned long long)pgprot_val(vma->vm_page_prot));
+		return 0;
+	}
+	fail_reason = "no suitable block remaining";
+
+fail:
+	pr_debug("%s: direct mmap failed: %s\n",
+		 file_dentry(file)->d_name.name, fail_reason);
+
+	/* We failed to do a direct map, but normal paging will do it */
+	vma->vm_ops = &generic_file_vm_ops;
+	return 0;
+}
+
+#ifndef CONFIG_MMU
+
+static unsigned long cramfs_physmem_get_unmapped_area(struct file *file,
+			unsigned long addr, unsigned long len,
+			unsigned long pgoff, unsigned long flags)
+{
+	struct inode *inode = file_inode(file);
+	struct super_block *sb = inode->i_sb;
+	struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
+	unsigned int pages, block_pages, max_pages, offset;
+
+	pages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	max_pages = (inode->i_size + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	if (pgoff >= max_pages || pages > max_pages - pgoff)
+		return -EINVAL;
+	block_pages = pages;
+	offset = cramfs_get_block_range(inode, pgoff, &block_pages);
+	if (!offset || block_pages != pages)
+		return -ENOSYS;
+	addr = sbi->linear_phys_addr + offset;
+	pr_debug("get_unmapped for %s ofs %#lx siz %lu at 0x%08lx\n",
+		 file_dentry(file)->d_name.name, pgoff*PAGE_SIZE, len, addr);
+	return addr;
+}
+
+static unsigned cramfs_physmem_mmap_capabilities(struct file *file)
+{
+	return NOMMU_MAP_COPY | NOMMU_MAP_DIRECT | NOMMU_MAP_READ | NOMMU_MAP_EXEC;
+}
+#endif
+
+static const struct file_operations cramfs_physmem_fops = {
+	.llseek			= generic_file_llseek,
+	.read_iter		= generic_file_read_iter,
+	.splice_read		= generic_file_splice_read,
+	.mmap			= cramfs_physmem_mmap,
+#ifndef CONFIG_MMU
+	.get_unmapped_area	= cramfs_physmem_get_unmapped_area,
+	.mmap_capabilities	= cramfs_physmem_mmap_capabilities,
+#endif
+};
+
 static void cramfs_blkdev_kill_sb(struct super_block *sb)
 {
 	struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
-- 
2.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH v4 5/5] cramfs: rehabilitate it
From: Nicolas Pitre @ 2017-09-27 23:32 UTC (permalink / raw)
  To: Alexander Viro, linux-mm
  Cc: linux-fsdevel, linux-embedded, linux-kernel, Chris Brandt
In-Reply-To: <20170927233224.31676-1-nicolas.pitre@linaro.org>

Update documentation, pointer to latest tools, appoint myself as
maintainer. Given it's been unloved for so long, I don't expect anyone
will protest.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Tested-by: Chris Brandt <chris.brandt@renesas.com>
---
 Documentation/filesystems/cramfs.txt | 42 ++++++++++++++++++++++++++++++++++++
 MAINTAINERS                          |  4 ++--
 fs/cramfs/Kconfig                    |  9 +++++---
 3 files changed, 50 insertions(+), 5 deletions(-)

diff --git a/Documentation/filesystems/cramfs.txt b/Documentation/filesystems/cramfs.txt
index 4006298f67..8875d306bc 100644
--- a/Documentation/filesystems/cramfs.txt
+++ b/Documentation/filesystems/cramfs.txt
@@ -45,6 +45,48 @@ you can just change the #define in mkcramfs.c, so long as you don't
 mind the filesystem becoming unreadable to future kernels.
 
 
+Memory Mapped cramfs image
+--------------------------
+
+The CRAMFS_PHYSMEM Kconfig option adds support for loading data directly
+from a physical linear memory range (usually non volatile memory like Flash)
+to cramfs instead of going through the block device layer. This saves some
+memory since no intermediate buffering is necessary to hold the data before
+decompressing.
+
+And when data blocks are kept uncompressed and properly aligned, they will
+automatically be mapped directly into user space whenever possible providing
+eXecute-In-Place (XIP) from ROM of read-only segments. Data segments mapped
+read-write (hence they have to be copied to RAM) may still be compressed in
+the cramfs image in the same file along with non compressed read-only
+segments. Both MMU and no-MMU systems are supported. This is particularly
+handy for tiny embedded systems with very tight memory constraints.
+
+The filesystem type for this feature is "cramfs_physmem" to distinguish it
+from the block device (or MTD) based access. The location of the cramfs
+image in memory is system dependent. You must know the proper physical
+address where the cramfs image is located and specify it using the
+physaddr=0x******** mount option (for example, if the physical address
+of the cramfs image is 0x80100000, the following command would mount it
+on /mnt:
+
+$ mount -t cramfs_physmem -o physaddr=0x80100000 none /mnt
+
+To boot such an image as the root filesystem, the following kernel
+commandline parameters must be provided:
+
+	"rootfstype=cramfs_physmem rootflags=physaddr=0x80100000"
+
+
+Tools
+-----
+
+A version of mkcramfs that can take advantage of the latest capabilities
+described above can be found here:
+
+https://github.com/npitre/cramfs-tools
+
+
 For /usr/share/magic
 --------------------
 
diff --git a/MAINTAINERS b/MAINTAINERS
index 1c3feffb1c..f00aec6a66 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3612,8 +3612,8 @@ F:	drivers/cpuidle/*
 F:	include/linux/cpuidle.h
 
 CRAMFS FILESYSTEM
-W:	http://sourceforge.net/projects/cramfs/
-S:	Orphan / Obsolete
+M:	Nicolas Pitre <nico@linaro.org>
+S:	Maintained
 F:	Documentation/filesystems/cramfs.txt
 F:	fs/cramfs/
 
diff --git a/fs/cramfs/Kconfig b/fs/cramfs/Kconfig
index 306549be25..374d52e029 100644
--- a/fs/cramfs/Kconfig
+++ b/fs/cramfs/Kconfig
@@ -1,5 +1,5 @@
 config CRAMFS
-	tristate "Compressed ROM file system support (cramfs) (OBSOLETE)"
+	tristate "Compressed ROM file system support (cramfs)"
 	select ZLIB_INFLATE
 	help
 	  Saying Y here includes support for CramFs (Compressed ROM File
@@ -15,8 +15,11 @@ config CRAMFS
 	  cramfs.  Note that the root file system (the one containing the
 	  directory /) cannot be compiled as a module.
 
-	  This filesystem is obsoleted by SquashFS, which is much better
-	  in terms of performance and features.
+	  This filesystem is limited in capabilities and performance on
+	  purpose to remain small and low on RAM usage. It is most suitable
+	  for small embedded systems. For a more capable compressed filesystem
+	  you should look at SquashFS which is much better in terms of
+	  performance and features.
 
 	  If unsure, say N.
 
-- 
2.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* Re: [PATCH v4 1/5] cramfs: direct memory access support
From: Christoph Hellwig @ 2017-10-01  8:29 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Alexander Viro, linux-mm, linux-fsdevel, linux-embedded,
	linux-kernel, Chris Brandt, linux-mtd, devicetree
In-Reply-To: <20170927233224.31676-2-nicolas.pitre@linaro.org>

On Wed, Sep 27, 2017 at 07:32:20PM -0400, Nicolas Pitre wrote:
> To distinguish between both access types, the cramfs_physmem filesystem
> type must be specified when using a memory accessible cramfs image, and
> the physaddr argument must provide the actual filesystem image's physical
> memory location.

Sorry, but this still is a complete no-go.  A physical address is not a
proper interface.  You still need to have some interface for your NOR nand
or DRAM.  - usually that would be a mtd driver, but if you have a good
reason why that's not suitable for you (and please explain it well)
we'll need a little OF or similar layer to bind a thin driver.

> 
> Signed-off-by: Nicolas Pitre <nico@linaro.org>
> Tested-by: Chris Brandt <chris.brandt@renesas.com>
> ---
>  fs/cramfs/Kconfig |  29 +++++-
>  fs/cramfs/inode.c | 264 +++++++++++++++++++++++++++++++++++++++++++-----------
>  2 files changed, 241 insertions(+), 52 deletions(-)
> 
> diff --git a/fs/cramfs/Kconfig b/fs/cramfs/Kconfig
> index 11b29d491b..5b4e0b7e13 100644
> --- a/fs/cramfs/Kconfig
> +++ b/fs/cramfs/Kconfig
> @@ -1,6 +1,5 @@
>  config CRAMFS
>  	tristate "Compressed ROM file system support (cramfs) (OBSOLETE)"
> -	depends on BLOCK
>  	select ZLIB_INFLATE
>  	help
>  	  Saying Y here includes support for CramFs (Compressed ROM File
> @@ -20,3 +19,31 @@ config CRAMFS
>  	  in terms of performance and features.
>  
>  	  If unsure, say N.
> +
> +config CRAMFS_BLOCKDEV
> +	bool "Support CramFs image over a regular block device" if EXPERT
> +	depends on CRAMFS && BLOCK
> +	default y
> +	help
> +	  This option allows the CramFs driver to load data from a regular
> +	  block device such a disk partition or a ramdisk.
> +
> +config CRAMFS_PHYSMEM
> +	bool "Support CramFs image directly mapped in physical memory"
> +	depends on CRAMFS
> +	default y if !CRAMFS_BLOCKDEV
> +	help
> +	  This option allows the CramFs driver to load data directly from
> +	  a linear adressed memory range (usually non volatile memory
> +	  like flash) instead of going through the block device layer.
> +	  This saves some memory since no intermediate buffering is
> +	  necessary.
> +
> +	  The filesystem type for this feature is "cramfs_physmem".
> +	  The location of the CramFs image in memory is board
> +	  dependent. Therefore, if you say Y, you must know the proper
> +	  physical address where to store the CramFs image and specify
> +	  it using the physaddr=0x******** mount option (for example:
> +	  "mount -t cramfs_physmem -o physaddr=0x100000 none /mnt").
> +
> +	  If unsure, say N.
> diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c
> index 7919967488..19f464a214 100644
> --- a/fs/cramfs/inode.c
> +++ b/fs/cramfs/inode.c
> @@ -24,6 +24,7 @@
>  #include <linux/mutex.h>
>  #include <uapi/linux/cramfs_fs.h>
>  #include <linux/uaccess.h>
> +#include <linux/io.h>
>  
>  #include "internal.h"
>  
> @@ -36,6 +37,8 @@ struct cramfs_sb_info {
>  	unsigned long blocks;
>  	unsigned long files;
>  	unsigned long flags;
> +	void *linear_virt_addr;
> +	phys_addr_t linear_phys_addr;
>  };
>  
>  static inline struct cramfs_sb_info *CRAMFS_SB(struct super_block *sb)
> @@ -140,6 +143,9 @@ static struct inode *get_cramfs_inode(struct super_block *sb,
>   * BLKS_PER_BUF*PAGE_SIZE, so that the caller doesn't need to
>   * worry about end-of-buffer issues even when decompressing a full
>   * page cache.
> + *
> + * Note: This is all optimized away at compile time when
> + *       CONFIG_CRAMFS_BLOCKDEV=n.
>   */
>  #define READ_BUFFERS (2)
>  /* NEXT_BUFFER(): Loop over [0..(READ_BUFFERS-1)]. */
> @@ -160,10 +166,10 @@ static struct super_block *buffer_dev[READ_BUFFERS];
>  static int next_buffer;
>  
>  /*
> - * Returns a pointer to a buffer containing at least LEN bytes of
> - * filesystem starting at byte offset OFFSET into the filesystem.
> + * Populate our block cache and return a pointer from it.
>   */
> -static void *cramfs_read(struct super_block *sb, unsigned int offset, unsigned int len)
> +static void *cramfs_blkdev_read(struct super_block *sb, unsigned int offset,
> +				unsigned int len)
>  {
>  	struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping;
>  	struct page *pages[BLKS_PER_BUF];
> @@ -239,7 +245,39 @@ static void *cramfs_read(struct super_block *sb, unsigned int offset, unsigned i
>  	return read_buffers[buffer] + offset;
>  }
>  
> -static void cramfs_kill_sb(struct super_block *sb)
> +/*
> + * Return a pointer to the linearly addressed cramfs image in memory.
> + */
> +static void *cramfs_direct_read(struct super_block *sb, unsigned int offset,
> +				unsigned int len)
> +{
> +	struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
> +
> +	if (!len)
> +		return NULL;
> +	if (len > sbi->size || offset > sbi->size - len)
> +	       return page_address(ZERO_PAGE(0));
> +	return sbi->linear_virt_addr + offset;
> +}
> +
> +/*
> + * Returns a pointer to a buffer containing at least LEN bytes of
> + * filesystem starting at byte offset OFFSET into the filesystem.
> + */
> +static void *cramfs_read(struct super_block *sb, unsigned int offset,
> +			 unsigned int len)
> +{
> +	struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
> +
> +	if (IS_ENABLED(CONFIG_CRAMFS_PHYSMEM) && sbi->linear_virt_addr)
> +		return cramfs_direct_read(sb, offset, len);
> +	else if (IS_ENABLED(CONFIG_CRAMFS_BLOCKDEV))
> +		return cramfs_blkdev_read(sb, offset, len);
> +	else
> +		return NULL;
> +}
> +
> +static void cramfs_blkdev_kill_sb(struct super_block *sb)
>  {
>  	struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
>  
> @@ -247,6 +285,16 @@ static void cramfs_kill_sb(struct super_block *sb)
>  	kfree(sbi);
>  }
>  
> +static void cramfs_physmem_kill_sb(struct super_block *sb)
> +{
> +	struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
> +
> +	if (sbi->linear_virt_addr)
> +		memunmap(sbi->linear_virt_addr);
> +	kill_anon_super(sb);
> +	kfree(sbi);
> +}
> +
>  static int cramfs_remount(struct super_block *sb, int *flags, char *data)
>  {
>  	sync_filesystem(sb);
> @@ -254,34 +302,24 @@ static int cramfs_remount(struct super_block *sb, int *flags, char *data)
>  	return 0;
>  }
>  
> -static int cramfs_fill_super(struct super_block *sb, void *data, int silent)
> +static int cramfs_read_super(struct super_block *sb,
> +			     struct cramfs_super *super, int silent)
>  {
> -	int i;
> -	struct cramfs_super super;
> +	struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
>  	unsigned long root_offset;
> -	struct cramfs_sb_info *sbi;
> -	struct inode *root;
> -
> -	sb->s_flags |= MS_RDONLY;
> -
> -	sbi = kzalloc(sizeof(struct cramfs_sb_info), GFP_KERNEL);
> -	if (!sbi)
> -		return -ENOMEM;
> -	sb->s_fs_info = sbi;
>  
> -	/* Invalidate the read buffers on mount: think disk change.. */
> -	mutex_lock(&read_mutex);
> -	for (i = 0; i < READ_BUFFERS; i++)
> -		buffer_blocknr[i] = -1;
> +	/* We don't know the real size yet */
> +	sbi->size = PAGE_SIZE;
>  
>  	/* Read the first block and get the superblock from it */
> -	memcpy(&super, cramfs_read(sb, 0, sizeof(super)), sizeof(super));
> +	mutex_lock(&read_mutex);
> +	memcpy(super, cramfs_read(sb, 0, sizeof(*super)), sizeof(*super));
>  	mutex_unlock(&read_mutex);
>  
>  	/* Do sanity checks on the superblock */
> -	if (super.magic != CRAMFS_MAGIC) {
> +	if (super->magic != CRAMFS_MAGIC) {
>  		/* check for wrong endianness */
> -		if (super.magic == CRAMFS_MAGIC_WEND) {
> +		if (super->magic == CRAMFS_MAGIC_WEND) {
>  			if (!silent)
>  				pr_err("wrong endianness\n");
>  			return -EINVAL;
> @@ -289,10 +327,10 @@ static int cramfs_fill_super(struct super_block *sb, void *data, int silent)
>  
>  		/* check at 512 byte offset */
>  		mutex_lock(&read_mutex);
> -		memcpy(&super, cramfs_read(sb, 512, sizeof(super)), sizeof(super));
> +		memcpy(super, cramfs_read(sb, 512, sizeof(*super)), sizeof(*super));
>  		mutex_unlock(&read_mutex);
> -		if (super.magic != CRAMFS_MAGIC) {
> -			if (super.magic == CRAMFS_MAGIC_WEND && !silent)
> +		if (super->magic != CRAMFS_MAGIC) {
> +			if (super->magic == CRAMFS_MAGIC_WEND && !silent)
>  				pr_err("wrong endianness\n");
>  			else if (!silent)
>  				pr_err("wrong magic\n");
> @@ -301,34 +339,34 @@ static int cramfs_fill_super(struct super_block *sb, void *data, int silent)
>  	}
>  
>  	/* get feature flags first */
> -	if (super.flags & ~CRAMFS_SUPPORTED_FLAGS) {
> +	if (super->flags & ~CRAMFS_SUPPORTED_FLAGS) {
>  		pr_err("unsupported filesystem features\n");
>  		return -EINVAL;
>  	}
>  
>  	/* Check that the root inode is in a sane state */
> -	if (!S_ISDIR(super.root.mode)) {
> +	if (!S_ISDIR(super->root.mode)) {
>  		pr_err("root is not a directory\n");
>  		return -EINVAL;
>  	}
>  	/* correct strange, hard-coded permissions of mkcramfs */
> -	super.root.mode |= (S_IRUSR | S_IXUSR | S_IRGRP | S_IXGRP | S_IROTH | S_IXOTH);
> +	super->root.mode |= (S_IRUSR | S_IXUSR | S_IRGRP | S_IXGRP | S_IROTH | S_IXOTH);
>  
> -	root_offset = super.root.offset << 2;
> -	if (super.flags & CRAMFS_FLAG_FSID_VERSION_2) {
> -		sbi->size = super.size;
> -		sbi->blocks = super.fsid.blocks;
> -		sbi->files = super.fsid.files;
> +	root_offset = super->root.offset << 2;
> +	if (super->flags & CRAMFS_FLAG_FSID_VERSION_2) {
> +		sbi->size = super->size;
> +		sbi->blocks = super->fsid.blocks;
> +		sbi->files = super->fsid.files;
>  	} else {
>  		sbi->size = 1<<28;
>  		sbi->blocks = 0;
>  		sbi->files = 0;
>  	}
> -	sbi->magic = super.magic;
> -	sbi->flags = super.flags;
> +	sbi->magic = super->magic;
> +	sbi->flags = super->flags;
>  	if (root_offset == 0)
>  		pr_info("empty filesystem");
> -	else if (!(super.flags & CRAMFS_FLAG_SHIFTED_ROOT_OFFSET) &&
> +	else if (!(super->flags & CRAMFS_FLAG_SHIFTED_ROOT_OFFSET) &&
>  		 ((root_offset != sizeof(struct cramfs_super)) &&
>  		  (root_offset != 512 + sizeof(struct cramfs_super))))
>  	{
> @@ -336,9 +374,18 @@ static int cramfs_fill_super(struct super_block *sb, void *data, int silent)
>  		return -EINVAL;
>  	}
>  
> +	return 0;
> +}
> +
> +static int cramfs_finalize_super(struct super_block *sb,
> +				 struct cramfs_inode *cramfs_root)
> +{
> +	struct inode *root;
> +
>  	/* Set it all up.. */
> +	sb->s_flags |= MS_RDONLY;
>  	sb->s_op = &cramfs_ops;
> -	root = get_cramfs_inode(sb, &super.root, 0);
> +	root = get_cramfs_inode(sb, cramfs_root, 0);
>  	if (IS_ERR(root))
>  		return PTR_ERR(root);
>  	sb->s_root = d_make_root(root);
> @@ -347,6 +394,92 @@ static int cramfs_fill_super(struct super_block *sb, void *data, int silent)
>  	return 0;
>  }
>  
> +static int cramfs_blkdev_fill_super(struct super_block *sb, void *data, int silent)
> +{
> +	struct cramfs_sb_info *sbi;
> +	struct cramfs_super super;
> +	int i, err;
> +
> +	sbi = kzalloc(sizeof(struct cramfs_sb_info), GFP_KERNEL);
> +	if (!sbi)
> +		return -ENOMEM;
> +	sb->s_fs_info = sbi;
> +
> +	/* Invalidate the read buffers on mount: think disk change.. */
> +	for (i = 0; i < READ_BUFFERS; i++)
> +		buffer_blocknr[i] = -1;
> +
> +	err = cramfs_read_super(sb, &super, silent);
> +	if (err)
> +		return err;
> +	return cramfs_finalize_super(sb, &super.root);
> +}
> +
> +static int cramfs_physmem_fill_super(struct super_block *sb, void *data, int silent)
> +{
> +	struct cramfs_sb_info *sbi;
> +	struct cramfs_super super;
> +	char *p;
> +	int err;
> +
> +	sbi = kzalloc(sizeof(struct cramfs_sb_info), GFP_KERNEL);
> +	if (!sbi)
> +		return -ENOMEM;
> +	sb->s_fs_info = sbi;
> +
> +	/*
> +	 * The physical location of the cramfs image is specified as
> +	 * a mount parameter.  This parameter is mandatory for obvious
> +	 * reasons.  Some validation is made on the phys address but this
> +	 * is not exhaustive and we count on the fact that someone using
> +	 * this feature is supposed to know what he/she's doing.
> +	 */
> +	if (!data || !(p = strstr((char *)data, "physaddr="))) {
> +		pr_err("unknown physical address for linear cramfs image\n");
> +		return -EINVAL;
> +	}
> +	sbi->linear_phys_addr = memparse(p + 9, NULL);
> +	if (!sbi->linear_phys_addr) {
> +		pr_err("bad value for cramfs image physical address\n");
> +		return -EINVAL;
> +	}
> +	if (sbi->linear_phys_addr & (PAGE_SIZE-1)) {
> +		pr_err("physical address %pap for linear cramfs isn't aligned to a page boundary\n",
> +			&sbi->linear_phys_addr);
> +		return -EINVAL;
> +	}
> +
> +	/*
> +	 * Map only one page for now.  Will remap it when fs size is known.
> +	 * Although we'll only read from it, we want the CPU cache to
> +	 * kick in for the higher throughput it provides, hence MEMREMAP_WB.
> +	 */
> +	pr_info("checking physical address %pap for linear cramfs image\n", &sbi->linear_phys_addr);
> +	sbi->linear_virt_addr = memremap(sbi->linear_phys_addr, PAGE_SIZE,
> +					 MEMREMAP_WB);
> +	if (!sbi->linear_virt_addr) {
> +		pr_err("ioremap of the linear cramfs image failed\n");
> +		return -ENOMEM;
> +	}
> +
> +	err = cramfs_read_super(sb, &super, silent);
> +	if (err)
> +		return err;
> +
> +	/* Remap the whole filesystem now */
> +	pr_info("linear cramfs image appears to be %lu KB in size\n",
> +		sbi->size/1024);
> +	memunmap(sbi->linear_virt_addr);
> +	sbi->linear_virt_addr = memremap(sbi->linear_phys_addr, sbi->size,
> +					 MEMREMAP_WB);
> +	if (!sbi->linear_virt_addr) {
> +		pr_err("ioremap of the linear cramfs image failed\n");
> +		return -ENOMEM;
> +	}
> +
> +	return cramfs_finalize_super(sb, &super.root);
> +}
> +
>  static int cramfs_statfs(struct dentry *dentry, struct kstatfs *buf)
>  {
>  	struct super_block *sb = dentry->d_sb;
> @@ -573,38 +706,67 @@ static const struct super_operations cramfs_ops = {
>  	.statfs		= cramfs_statfs,
>  };
>  
> -static struct dentry *cramfs_mount(struct file_system_type *fs_type,
> -	int flags, const char *dev_name, void *data)
> +static struct dentry *cramfs_blkdev_mount(struct file_system_type *fs_type,
> +				int flags, const char *dev_name, void *data)
> +{
> +	return mount_bdev(fs_type, flags, dev_name, data, cramfs_blkdev_fill_super);
> +}
> +
> +static struct dentry *cramfs_physmem_mount(struct file_system_type *fs_type,
> +				int flags, const char *dev_name, void *data)
>  {
> -	return mount_bdev(fs_type, flags, dev_name, data, cramfs_fill_super);
> +	return mount_nodev(fs_type, flags, data, cramfs_physmem_fill_super);
>  }
>  
>  static struct file_system_type cramfs_fs_type = {
>  	.owner		= THIS_MODULE,
>  	.name		= "cramfs",
> -	.mount		= cramfs_mount,
> -	.kill_sb	= cramfs_kill_sb,
> +	.mount		= cramfs_blkdev_mount,
> +	.kill_sb	= cramfs_blkdev_kill_sb,
>  	.fs_flags	= FS_REQUIRES_DEV,
>  };
> +
> +static struct file_system_type cramfs_physmem_fs_type = {
> +	.owner		= THIS_MODULE,
> +	.name		= "cramfs_physmem",
> +	.mount		= cramfs_physmem_mount,
> +	.kill_sb	= cramfs_physmem_kill_sb,
> +};
> +
> +#ifdef CONFIG_CRAMFS_BLOCKDEV
>  MODULE_ALIAS_FS("cramfs");
> +#endif
> +#ifdef CONFIG_CRAMFS_PHYSMEM
> +MODULE_ALIAS_FS("cramfs_physmem");
> +#endif
>  
>  static int __init init_cramfs_fs(void)
>  {
>  	int rv;
>  
> -	rv = cramfs_uncompress_init();
> -	if (rv < 0)
> -		return rv;
> -	rv = register_filesystem(&cramfs_fs_type);
> -	if (rv < 0)
> -		cramfs_uncompress_exit();
> -	return rv;
> +	if ((rv = cramfs_uncompress_init()) < 0)
> +		goto err0;
> +	if (IS_ENABLED(CONFIG_CRAMFS_BLOCKDEV) &&
> +	    (rv = register_filesystem(&cramfs_fs_type)) < 0)
> +		goto err1;
> +	if (IS_ENABLED(CONFIG_CRAMFS_PHYSMEM) &&
> +	    (rv = register_filesystem(&cramfs_physmem_fs_type)) < 0)
> +		goto err2;
> +	return 0;
> +
> +err2:	if (IS_ENABLED(CONFIG_CRAMFS_BLOCKDEV))
> +		unregister_filesystem(&cramfs_fs_type);
> +err1:	cramfs_uncompress_exit();
> +err0:	return rv;
>  }
>  
>  static void __exit exit_cramfs_fs(void)
>  {
>  	cramfs_uncompress_exit();
> -	unregister_filesystem(&cramfs_fs_type);
> +	if (IS_ENABLED(CONFIG_CRAMFS_BLOCKDEV))
> +		unregister_filesystem(&cramfs_fs_type);
> +	if (IS_ENABLED(CONFIG_CRAMFS_PHYSMEM))
> +		unregister_filesystem(&cramfs_physmem_fs_type);
>  }
>  
>  module_init(init_cramfs_fs)
> -- 
> 2.9.5
> 
---end quoted text---

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v4 4/5] cramfs: add mmap support
From: Christoph Hellwig @ 2017-10-01  8:30 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Alexander Viro, linux-mm, linux-fsdevel, linux-embedded,
	linux-kernel, Chris Brandt
In-Reply-To: <20170927233224.31676-5-nicolas.pitre@linaro.org>

up_read(&mm->mmap_sem) in the fault path is a still a complete
no-go,

NAK

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox