* [PATCH 04/16] zuf: zuf-rootfs
  2019-08-12 16:42 [PATCHSET 00/16] zuf: ZUFS Zero-copy User-mode FileSystem Boaz Harrosh
@ 2019-08-12 16:42 ` Boaz Harrosh
  0 siblings, 0 replies; 32+ messages in thread
From: Boaz Harrosh @ 2019-08-12 16:42 UTC (permalink / raw)
  To: Boaz Harrosh, Boaz Harrosh, linux-fsdevel; +Cc: Boaz Harrosh
zuf-root is a pseudo FS that the zusd Server communicates through,
registers new file-systems. receives new mount requests.
In this patch we have the bring up of that special FS.
The principal communication with zuf-rootfs is done through
tmep-files + io-ctls.
Caller does an open(O_TMPFILE) and invokes some IOCTL_XXX on
the file. The specific ioctl establishes one of zuf_special_file
types object and attaches the object to the file-ptr and by that
defining special behavior for that object.
Otherwise zuf-rootfs is not an FS at all. It has a few viewable
variable files, exposing state and info about the system. In this
patch we can see the "state" variable-file, that denotes to user-mode
when the Kernel is ready for new mounts. And the registered_fs which
exposes what zufFS(s) where registered with the Kernel.
There is a one-to-one relationship between a zuf-root SB and
a zusd Server. Each zusd Server can support multiple zusFS
plugins and register multiple filesystem-types.
The zuf-rootfs (mount -t zuf) is usually mounted on
/sys/fs/zuf. The /sys/fs/zuf directory is automatically created
when zuf.ko is loaded. If an admin wants to run more zusd server
applications she/he can mount a second instance of -t zuf on some
dir and point the new zusd Server to it. (zusd has an optional path
argument). Otherwise a second instance attempting to communicate
with a busy zuf-root will fail.
TODO: How to trigger a first mount on module_load. Currently
admin needs to manually "mount -t zuf none /sys/fs/zuf"
Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/Makefile   |   4 +
 fs/zuf/_extern.h  |  41 +++++
 fs/zuf/_pr.h      |  63 +++++++
 fs/zuf/super.c    |  53 ++++++
 fs/zuf/zuf-core.c |  69 ++++++++
 fs/zuf/zuf-root.c | 435 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/zuf/zuf.h      | 115 ++++++++++++
 fs/zuf/zus_api.h  |  36 ++++
 8 files changed, 816 insertions(+)
 create mode 100644 fs/zuf/_extern.h
 create mode 100644 fs/zuf/_pr.h
 create mode 100644 fs/zuf/super.c
 create mode 100644 fs/zuf/zuf-core.c
 create mode 100644 fs/zuf/zuf-root.c
 create mode 100644 fs/zuf/zuf.h
diff --git a/fs/zuf/Makefile b/fs/zuf/Makefile
index 452cec55f34d..b08c08e73faa 100644
--- a/fs/zuf/Makefile
+++ b/fs/zuf/Makefile
@@ -10,5 +10,9 @@
 
 obj-$(CONFIG_ZUFS_FS) += zuf.o
 
+# ZUF core
+zuf-y += zuf-core.o zuf-root.o
+
 # Main FS
+zuf-y += super.o
 zuf-y += module.o
diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
new file mode 100644
index 000000000000..0e8aa52f1259
--- /dev/null
+++ b/fs/zuf/_extern.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#ifndef __ZUF_EXTERN_H__
+#define __ZUF_EXTERN_H__
+/*
+ * DO NOT INCLUDE this file directly, it is included by zuf.h
+ * It is here because zuf.h got to big
+ */
+
+/*
+ * extern functions declarations
+ */
+
+/* zuf-core.c */
+int zufc_zts_init(struct zuf_root_info *zri); /* Some private types in core */
+void zufc_zts_fini(struct zuf_root_info *zri);
+
+long zufc_ioctl(struct file *filp, unsigned int cmd, ulong arg);
+int zufc_release(struct inode *inode, struct file *file);
+int zufc_mmap(struct file *file, struct vm_area_struct *vma);
+
+/* zuf-root.c */
+int zufr_register_fs(struct super_block *sb, struct zufs_ioc_register_fs *rfs);
+
+/* super.c */
+int zuf_init_inodecache(void);
+void zuf_destroy_inodecache(void);
+
+struct dentry *zuf_mount(struct file_system_type *fs_type, int flags,
+			 const char *dev_name, void *data);
+
+#endif	/*ndef __ZUF_EXTERN_H__*/
diff --git a/fs/zuf/_pr.h b/fs/zuf/_pr.h
new file mode 100644
index 000000000000..51924b6bd2a5
--- /dev/null
+++ b/fs/zuf/_pr.h
@@ -0,0 +1,63 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#ifndef __ZUF_PR_H__
+#define __ZUF_PR_H__
+
+#ifdef pr_fmt
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#endif
+
+/*
+ * Debug code
+ */
+#define zuf_err(s, args ...)		pr_err("[%s:%d] " s, __func__, \
+							__LINE__, ## args)
+#define zuf_err_cnd(silent, s, args ...) \
+	do {if (!silent) \
+		pr_err("[%s:%d] " s, __func__, __LINE__, ## args); \
+	} while (0)
+#define zuf_warn(s, args ...)		pr_warn("[%s:%d] " s, __func__, \
+							__LINE__, ## args)
+#define zuf_warn_cnd(silent, s, args ...) \
+	do {if (!silent) \
+		pr_warn("[%s:%d] " s, __func__, __LINE__, ## args); \
+	} while (0)
+#define zuf_info(s, args ...)          pr_info("~info~ " s, ## args)
+
+#define zuf_chan_debug(c, s, args...)	pr_debug(c " [%s:%d] " s, __func__, \
+							__LINE__, ## args)
+
+/* ~~~ channel prints ~~~ */
+#define zuf_dbg_perf(s, args ...)	zuf_chan_debug("perfo", s, ##args)
+#define zuf_dbg_err(s, args ...)	zuf_chan_debug("error", s, ##args)
+#define zuf_dbg_vfs(s, args ...)	zuf_chan_debug("vfs  ", s, ##args)
+#define zuf_dbg_rw(s, args ...)		zuf_chan_debug("rw   ", s, ##args)
+#define zuf_dbg_t1(s, args ...)		zuf_chan_debug("t1   ", s, ##args)
+#define zuf_dbg_xattr(s, args ...)	zuf_chan_debug("xattr", s, ##args)
+#define zuf_dbg_acl(s, args ...)	zuf_chan_debug("acl  ", s, ##args)
+#define zuf_dbg_t2(s, args ...)		zuf_chan_debug("t2dbg", s, ##args)
+#define zuf_dbg_t2_rw(s, args ...)	zuf_chan_debug("t2grw", s, ##args)
+#define zuf_dbg_core(s, args ...)	zuf_chan_debug("core ", s, ##args)
+#define zuf_dbg_mmap(s, args ...)	zuf_chan_debug("mmap ", s, ##args)
+#define zuf_dbg_zus(s, args ...)	zuf_chan_debug("zusdg", s, ##args)
+#define zuf_dbg_verbose(s, args ...)	zuf_chan_debug("d-oto", s, ##args)
+
+#define md_err		zuf_err
+#define md_warn		zuf_warn
+#define md_err_cnd	zuf_err_cnd
+#define md_warn_cnd	zuf_warn_cnd
+#define md_dbg_err	zuf_dbg_err
+#define md_dbg_verbose	zuf_dbg_verbose
+
+
+#endif /* define __ZUF_PR_H__ */
diff --git a/fs/zuf/super.c b/fs/zuf/super.c
new file mode 100644
index 000000000000..f7f7798425a9
--- /dev/null
+++ b/fs/zuf/super.c
@@ -0,0 +1,53 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Super block operations.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>
+ */
+
+#include <linux/types.h>
+#include <linux/parser.h>
+#include <linux/statfs.h>
+#include <linux/backing-dev.h>
+
+#include "zuf.h"
+
+static struct kmem_cache *zuf_inode_cachep;
+
+static void _init_once(void *foo)
+{
+	struct zuf_inode_info *zii = foo;
+
+	inode_init_once(&zii->vfs_inode);
+}
+
+int __init zuf_init_inodecache(void)
+{
+	zuf_inode_cachep = kmem_cache_create("zuf_inode_cache",
+					       sizeof(struct zuf_inode_info),
+					       0,
+					       (SLAB_RECLAIM_ACCOUNT |
+						SLAB_MEM_SPREAD |
+						SLAB_TYPESAFE_BY_RCU),
+					       _init_once);
+	if (zuf_inode_cachep == NULL)
+		return -ENOMEM;
+	return 0;
+}
+
+void zuf_destroy_inodecache(void)
+{
+	kmem_cache_destroy(zuf_inode_cachep);
+}
+
+struct dentry *zuf_mount(struct file_system_type *fs_type, int flags,
+			 const char *dev_name, void *data)
+{
+	return ERR_PTR(-ENOTSUPP);
+}
diff --git a/fs/zuf/zuf-core.c b/fs/zuf/zuf-core.c
new file mode 100644
index 000000000000..c9bb31f75bed
--- /dev/null
+++ b/fs/zuf/zuf-core.c
@@ -0,0 +1,69 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Ioctl operations.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+
+#include <linux/types.h>
+#include <linux/mm.h>
+#include <linux/mm_types.h>
+#include <linux/delay.h>
+#include <linux/pfn_t.h>
+#include <linux/sched/signal.h>
+
+#include "zuf.h"
+
+int zufc_zts_init(struct zuf_root_info *zri)
+{
+	return 0;
+}
+
+void zufc_zts_fini(struct zuf_root_info *zri)
+{
+}
+
+long zufc_ioctl(struct file *file, unsigned int cmd, ulong arg)
+{
+	switch (cmd) {
+	default:
+		zuf_err("%d\n", cmd);
+		return -ENOTTY;
+	}
+}
+
+int zufc_release(struct inode *inode, struct file *file)
+{
+	struct zuf_special_file *zsf = file->private_data;
+
+	if (!zsf)
+		return 0;
+
+	switch (zsf->type) {
+	default:
+		return 0;
+	}
+}
+
+int zufc_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct zuf_special_file *zsf = file->private_data;
+
+	if (unlikely(!zsf)) {
+		zuf_err("Which mmap is that !!!!\n");
+		return -ENOTTY;
+	}
+
+	switch (zsf->type) {
+	default:
+		zuf_err("type=%d\n", zsf->type);
+		return -ENOTTY;
+	}
+}
diff --git a/fs/zuf/zuf-root.c b/fs/zuf/zuf-root.c
new file mode 100644
index 000000000000..1f5f886997f7
--- /dev/null
+++ b/fs/zuf/zuf-root.c
@@ -0,0 +1,435 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * ZUF Root filesystem.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUS-ZUF interaction is done via a small specialized FS that
+ * provides the communication with the mount-thread, ZTs, pmem devices,
+ * and so on ...
+ * Subsequently all FS super_blocks are children of this root, and point
+ * to it. All sharing the same zuf communication channels.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/magic.h>
+#include <asm-generic/mman.h>
+
+#include "zuf.h"
+
+/* ~~~~ Register/Unregister FS-types ~~~~ */
+#ifdef CONFIG_LOCKDEP
+
+/*
+ * NOTE: When CONFIG_LOCKDEP is on. register_filesystem() complains when
+ * the fstype object is from a kmalloc. Because of some lockdep_keys not
+ * being const_obj something.
+ *
+ * So in this case we have maximum of 16 fstypes system wide
+ * (Total for all mounted zuf_root(s)). This way we can have them
+ * in const_obj memory below at g_fs_array
+ */
+
+enum { MAX_LOCKDEP_FSs = 16 };
+static uint g_fs_next;
+static struct zuf_fs_type g_fs_array[MAX_LOCKDEP_FSs];
+
+static struct zuf_fs_type *_fs_type_alloc(void)
+{
+	struct zuf_fs_type *ret;
+
+	if (MAX_LOCKDEP_FSs <= g_fs_next)
+		return NULL;
+
+	ret = &g_fs_array[g_fs_next++];
+	memset(ret, 0, sizeof(*ret));
+	return ret;
+}
+
+static void _fs_type_free(struct zuf_fs_type *zft)
+{
+	if (zft == &g_fs_array[0])
+		g_fs_next = 0;
+}
+
+#else /* !CONFIG_LOCKDEP*/
+static struct zuf_fs_type *_fs_type_alloc(void)
+{
+	return kcalloc(1, sizeof(struct zuf_fs_type), GFP_KERNEL);
+}
+
+static void _fs_type_free(struct zuf_fs_type *zft)
+{
+	kfree(zft);
+}
+#endif /*CONFIG_LOCKDEP*/
+
+
+static ssize_t _state_read(struct file *file, char __user *buf, size_t len,
+			   loff_t *ppos)
+{
+	struct zuf_root_info *zri = ZRI(file->f_inode->i_sb);
+	const char *msg;
+
+	if (*ppos > 0)
+		return 0;
+
+	switch (zri->state) {
+	case ZUF_ROOT_INITIALIZING:
+		msg = "initializing\n";
+		break;
+	case ZUF_ROOT_REGISTERING_FS:
+		msg = "registering_fs\n";
+		break;
+	case ZUF_ROOT_MOUNT_READY:
+		msg = "mount_ready\n";
+		break;
+	default:
+		msg = "UNKNOWN\n";
+		break;
+	}
+
+	return simple_read_from_buffer(buf, len, ppos, msg, strlen(msg));
+}
+
+static const struct file_operations _state_ops = {
+	.open = nonseekable_open,
+	.read = _state_read,
+	.llseek = no_llseek,
+};
+
+static ssize_t _registered_fs_read(struct file *file, char __user *buf,
+				   size_t len, loff_t *ppos)
+{
+	struct zuf_root_info *zri = ZRI(file->f_inode->i_sb);
+	size_t buff_len = 0;
+	struct zuf_fs_type *zft;
+	char *fs_buff, *p;
+	ssize_t ret;
+	size_t name_len;
+
+	list_for_each_entry(zft, &zri->fst_list, list)
+		buff_len += strlen(zft->rfi.fsname) + 1;
+
+	if (unlikely(*ppos > buff_len))
+		return -EINVAL;
+	if (*ppos == buff_len)
+		return 0;
+
+	fs_buff = kzalloc(buff_len + 1, GFP_KERNEL);
+	if (unlikely(!fs_buff))
+		return -ENOMEM;
+
+	p = fs_buff;
+	list_for_each_entry(zft, &zri->fst_list, list) {
+		if (p != fs_buff) {
+			*p = ' ';
+			++p;
+		}
+		name_len = strlen(zft->rfi.fsname);
+		memcpy(p, zft->rfi.fsname, name_len);
+		p += name_len;
+	}
+
+	p = fs_buff + *ppos;
+	buff_len = buff_len - *ppos;
+	ret = simple_read_from_buffer(buf, len, ppos, p, buff_len);
+	kfree(fs_buff);
+
+	return ret;
+}
+
+static const struct file_operations _registered_fs_ops = {
+	.open = nonseekable_open,
+	.read = _registered_fs_read,
+	.llseek = no_llseek,
+};
+
+
+int zufr_register_fs(struct super_block *sb, struct zufs_ioc_register_fs *rfs)
+{
+	struct zuf_fs_type *zft = _fs_type_alloc();
+	struct zuf_root_info *zri = ZRI(sb);
+
+	if (unlikely(!zft))
+		return -ENOMEM;
+
+	if (zri->state == ZUF_ROOT_INITIALIZING)
+		zri->state = ZUF_ROOT_REGISTERING_FS;
+
+	/* Original vfs file type */
+	zft->vfs_fst.owner	= THIS_MODULE;
+	zft->vfs_fst.name	= kstrdup(rfs->rfi.fsname, GFP_KERNEL);
+	zft->vfs_fst.mount	= zuf_mount;
+	zft->vfs_fst.kill_sb	= kill_block_super;
+
+	/* ZUS info about this FS */
+	zft->rfi		= rfs->rfi;
+	zft->zus_zfi		= rfs->zus_zfi;
+	INIT_LIST_HEAD(&zft->list);
+	/* Back pointer to our communication channels */
+	zft->zri		= ZRI(sb);
+
+	zuf_add_fs_type(zft->zri, zft);
+	zuf_info("register_filesystem [%s]\n", zft->vfs_fst.name);
+	return register_filesystem(&zft->vfs_fst);
+}
+
+static void _unregister_all_fses(struct zuf_root_info *zri)
+{
+	struct zuf_fs_type *zft, *n;
+
+	list_for_each_entry_safe_reverse(zft, n, &zri->fst_list, list) {
+		unregister_filesystem(&zft->vfs_fst);
+		list_del_init(&zft->list);
+		_fs_type_free(zft);
+	}
+}
+
+static int zufr_unlink(struct inode *dir, struct dentry *dentry)
+{
+	struct inode *inode = dentry->d_inode;
+
+	drop_nlink(inode);
+	return 0;
+}
+
+/* Force alignment of 2M for all vma(s)
+ *
+ * This belongs to t1.c and what it does for mmap. But we do not mind
+ * that both our mmaps (grab_pmem or ZTs) will be 2M aligned so keep
+ * it here. And zus mappings just all match perfectly with no need for
+ * holes.
+ * FIXME: This is copy/paste from dax-device. It can be very much simplified
+ * for what we need.
+ */
+static unsigned long zufr_get_unmapped_area(struct file *filp,
+		unsigned long addr, unsigned long len, unsigned long pgoff,
+		unsigned long flags)
+{
+	unsigned long off, off_end, off_align, len_align, addr_align;
+	unsigned long align = PMD_SIZE;
+
+	if (addr)
+		goto out;
+
+	off = pgoff << PAGE_SHIFT;
+	off_end = off + len;
+	off_align = round_up(off, align);
+
+	if ((off_end <= off_align) || ((off_end - off_align) < align))
+		goto out;
+
+	len_align = len + align;
+	if ((off + len_align) < off)
+		goto out;
+
+	addr_align = current->mm->get_unmapped_area(filp, addr, len_align,
+			pgoff, flags);
+	if (!IS_ERR_VALUE(addr_align)) {
+		addr_align += (off - addr_align) & (align - 1);
+		return addr_align;
+	}
+ out:
+	return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags);
+}
+
+static const struct inode_operations zufr_inode_operations;
+static const struct file_operations zufr_file_dir_operations = {
+	.open		= dcache_dir_open,
+	.release	= dcache_dir_close,
+	.llseek		= dcache_dir_lseek,
+	.read		= generic_read_dir,
+	.iterate_shared	= dcache_readdir,
+	.fsync		= noop_fsync,
+	.unlocked_ioctl = zufc_ioctl,
+};
+static const struct file_operations zufr_file_reg_operations = {
+	.fsync			= noop_fsync,
+	.unlocked_ioctl		= zufc_ioctl,
+	.get_unmapped_area	= zufr_get_unmapped_area,
+	.mmap			= zufc_mmap,
+	.release		= zufc_release,
+};
+
+static int zufr_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
+{
+	struct zuf_root_info *zri = ZRI(dir->i_sb);
+	struct inode *inode;
+	int err;
+
+	inode = new_inode(dir->i_sb);
+	if (!inode)
+		return -ENOMEM;
+
+	/* We need to impersonate device-dax (S_DAX + S_IFCHR) in order to get
+	 * the PMD (huge) page faults and allow RDMA memory access via GUP
+	 * (get_user_pages_longterm).
+	 */
+	inode->i_flags = S_DAX;
+	mode = (mode & ~S_IFREG) | S_IFCHR; /* change file type to char */
+
+	inode->i_ino = ++zri->next_ino; /* none atomic only one mount thread */
+	inode->i_blocks = inode->i_size = 0;
+	inode->i_ctime = inode->i_mtime = current_time(inode);
+	inode->i_atime = inode->i_ctime;
+	inode_init_owner(inode, dir, mode);
+
+	inode->i_op = &zufr_inode_operations;
+	inode->i_fop = &zufr_file_reg_operations;
+
+	err = insert_inode_locked(inode);
+	if (unlikely(err)) {
+		zuf_err("[%ld] insert_inode_locked => %d\n", inode->i_ino, err);
+		goto fail;
+	}
+	d_tmpfile(dentry, inode);
+	unlock_new_inode(inode);
+	return 0;
+
+fail:
+	clear_nlink(inode);
+	make_bad_inode(inode);
+	iput(inode);
+	return err;
+}
+
+static void zufr_put_super(struct super_block *sb)
+{
+	struct zuf_root_info *zri = ZRI(sb);
+
+	zufc_zts_fini(zri);
+	_unregister_all_fses(zri);
+
+	zuf_info("zuf_root umount\n");
+}
+
+static void zufr_evict_inode(struct inode *inode)
+{
+	clear_inode(inode);
+}
+
+static const struct inode_operations zufr_inode_operations = {
+	.lookup		= simple_lookup,
+
+	.tmpfile	= zufr_tmpfile,
+	.unlink		= zufr_unlink,
+};
+static const struct super_operations zufr_super_operations = {
+	.statfs		= simple_statfs,
+
+	.evict_inode	= zufr_evict_inode,
+	.put_super	= zufr_put_super,
+};
+
+#define ZUFR_SUPER_MAGIC 0x1717
+
+static int zufr_fill_super(struct super_block *sb, void *data, int silent)
+{
+	static struct tree_descr zufr_files[] = {
+		[2] = {"state", &_state_ops, S_IFREG | 0400},
+		[3] = {"registered_fs", &_registered_fs_ops, S_IFREG | 0400},
+		{""},
+	};
+	struct zuf_root_info *zri;
+	struct inode *root_i;
+	int err;
+
+	zri = kzalloc(sizeof(*zri), GFP_KERNEL);
+	if (!zri) {
+		zuf_err_cnd(silent,
+			    "Not enough memory to allocate zuf_root_info\n");
+		return -ENOMEM;
+	}
+
+	err = simple_fill_super(sb, ZUFR_SUPER_MAGIC, zufr_files);
+	if (unlikely(err)) {
+		kfree(zri);
+		return err;
+	}
+
+	sb->s_op = &zufr_super_operations;
+	sb->s_fs_info = zri;
+	zri->sb = sb;
+
+	root_i = sb->s_root->d_inode;
+	root_i->i_fop = &zufr_file_dir_operations;
+	root_i->i_op = &zufr_inode_operations;
+
+	mutex_init(&zri->sbl_lock);
+	INIT_LIST_HEAD(&zri->fst_list);
+
+	err = zufc_zts_init(zri);
+	if (unlikely(err))
+		return err; /* put will be called we have a root */
+
+	return 0;
+}
+
+static struct dentry *zufr_mount(struct file_system_type *fs_type,
+				  int flags, const char *dev_name,
+				  void *data)
+{
+	struct dentry *ret = mount_nodev(fs_type, flags, data, zufr_fill_super);
+
+	if (IS_ERR_OR_NULL(ret)) {
+		zuf_dbg_err("mount_nodev(%s, %s) => %ld\n", dev_name,
+			    (char *)data, PTR_ERR(ret));
+		return ret;
+	}
+
+	zuf_info("zuf_root mount [%s]\n", dev_name);
+	return ret;
+}
+
+static struct file_system_type zufr_type = {
+	.owner =	THIS_MODULE,
+	.name =		"zuf",
+	.mount =	zufr_mount,
+	.kill_sb	= kill_litter_super,
+};
+
+/* Create an /sys/fs/zuf/ directory. to mount on */
+static struct kset *zufr_kset;
+
+int __init zuf_root_init(void)
+{
+	int err = zuf_init_inodecache();
+
+	if (unlikely(err))
+		return err;
+
+	zufr_kset = kset_create_and_add("zuf", NULL, fs_kobj);
+	if (!zufr_kset) {
+		err = -ENOMEM;
+		goto un_inodecache;
+	}
+
+	err = register_filesystem(&zufr_type);
+	if (unlikely(err))
+		goto un_kset;
+
+	return 0;
+
+un_kset:
+	kset_unregister(zufr_kset);
+un_inodecache:
+	zuf_destroy_inodecache();
+	return err;
+}
+
+static void __exit zuf_root_exit(void)
+{
+	unregister_filesystem(&zufr_type);
+	kset_unregister(zufr_kset);
+	zuf_destroy_inodecache();
+}
+
+module_init(zuf_root_init)
+module_exit(zuf_root_exit)
diff --git a/fs/zuf/zuf.h b/fs/zuf/zuf.h
new file mode 100644
index 000000000000..3062f78c72d4
--- /dev/null
+++ b/fs/zuf/zuf.h
@@ -0,0 +1,115 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Definitions for the ZUF filesystem.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#ifndef __ZUF_H
+#define __ZUF_H
+
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/list.h>
+#include <linux/types.h>
+#include <linux/xattr.h>
+#include <linux/exportfs.h>
+#include <linux/page_ref.h>
+#include <linux/mm.h>
+
+#include "zus_api.h"
+
+#include "_pr.h"
+
+enum zlfs_e_special_file {
+	zlfs_e_zt = 1,
+	zlfs_e_mout_thread,
+	zlfs_e_pmem,
+	zlfs_e_dpp_buff,
+	zlfs_e_private_mount,
+};
+
+struct zuf_special_file {
+	enum zlfs_e_special_file type;
+	struct file *file;
+};
+
+struct zuf_private_mount_info {
+	struct zuf_special_file zsf;
+	struct super_block *sb;
+};
+
+enum {
+	ZUF_ROOT_INITIALIZING = 0,
+	ZUF_ROOT_REGISTERING_FS = 1,
+	ZUF_ROOT_MOUNT_READY = 2,
+};
+
+/* This is the zuf-root.c mini filesystem */
+struct zuf_root_info {
+	#define SBL_INC 64
+	struct sb_is_list {
+		uint num;
+		uint max;
+		struct super_block **array;
+	} sbl;
+	struct mutex sbl_lock;
+
+	ulong next_ino;
+
+	/* The definition of _ztp is private to zuf-core.c */
+	struct zuf_threads_pool *_ztp;
+
+	struct super_block *sb;
+	struct list_head fst_list;
+	int state;
+};
+
+static inline struct zuf_root_info *ZRI(struct super_block *sb)
+{
+	struct zuf_root_info *zri = sb->s_fs_info;
+
+	WARN_ON(zri->sb != sb);
+	return zri;
+}
+
+struct zuf_fs_type {
+	struct file_system_type vfs_fst;
+	struct zus_fs_info	*zus_zfi;
+	struct register_fs_info rfi;
+	struct zuf_root_info *zri;
+
+	struct list_head list;
+};
+
+static inline void zuf_add_fs_type(struct zuf_root_info *zri,
+				   struct zuf_fs_type *zft)
+{
+	/* Unlocked for now only one mount-thread with zus */
+	list_add(&zft->list, &zri->fst_list);
+}
+
+/*
+ * ZUF per-inode data in memory
+ */
+struct zuf_inode_info {
+	struct inode		vfs_inode;
+};
+
+static inline struct zuf_inode_info *ZUII(struct inode *inode)
+{
+	return container_of(inode, struct zuf_inode_info, vfs_inode);
+}
+
+/* Keep this include last thing in file */
+#include "_extern.h"
+
+#endif /* __ZUF_H */
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index 4b1816e5dfd8..181805052ec0 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -97,4 +97,40 @@
 
 #endif /*  ndef __KERNEL__ */
 
+struct zufs_ioc_hdr {
+	__s32 err;	/* IN/OUT must be first */
+	__u16 in_len;	/* How much to be copied *to* zus */
+	__u16 out_max;	/* Max receive buffer at dispatch caller */
+	__u16 out_start;/* Start of output parameters (to caller) */
+	__u16 out_len;	/* How much to be copied *from* zus to caller */
+			/* can be modified by zus */
+	__u16 operation;/* One of e_zufs_operation */
+	__u16 flags;	/* e_zufs_hdr_flags bit flags */
+	__u32 offset;	/* Start of user buffer in ZT mmap */
+	__u32 len;	/* Len of user buffer in ZT mmap */
+};
+
+struct register_fs_info {
+	char fsname[16];	/* Only 4 chars and a NUL please      */
+	__u32 FS_magic;         /* This is the FS's version && magic  */
+	__u32 FS_ver_major;	/* on disk, not the zuf-to-zus version*/
+	__u32 FS_ver_minor;	/* (See also struct md_dev_table)   */
+	__u32 notused;
+
+	__u64 dt_offset;
+	__u64 s_maxbytes;
+	__u32 s_time_gran;
+	__u32 def_mode;
+};
+
+/* Register FS */
+/* A cookie from user-mode given in register_fs_info */
+struct zus_fs_info;
+struct zufs_ioc_register_fs {
+	struct zufs_ioc_hdr hdr;
+	struct zus_fs_info *zus_zfi;
+	struct register_fs_info rfi;
+};
+#define ZU_IOC_REGISTER_FS	_IOWR('Z', 10, struct zufs_ioc_register_fs)
+
 #endif /* _LINUX_ZUFS_API_H */
-- 
2.20.1
^ permalink raw reply related	[flat|nested] 32+ messages in thread
* [PATCH 04/16] zuf: zuf-rootfs
  2019-08-12 16:47 [PATCHSET " Boaz Harrosh
@ 2019-08-12 16:47 ` Boaz Harrosh
  0 siblings, 0 replies; 32+ messages in thread
From: Boaz Harrosh @ 2019-08-12 16:47 UTC (permalink / raw)
  To: linux-fsdevel, Anna Schumaker, Al Viro, Linus Torvalds
  Cc: Miklos Szeredi, Amir Goldstein, Amit Golander, Sagi Manole,
	Matthew Wilcox, Dan Williams
zuf-root is a pseudo FS that the zusd Server communicates through,
registers new file-systems. receives new mount requests.
In this patch we have the bring up of that special FS.
The principal communication with zuf-rootfs is done through
tmep-files + io-ctls.
Caller does an open(O_TMPFILE) and invokes some IOCTL_XXX on
the file. The specific ioctl establishes one of zuf_special_file
types object and attaches the object to the file-ptr and by that
defining special behavior for that object.
Otherwise zuf-rootfs is not an FS at all. It has a few viewable
variable files, exposing state and info about the system. In this
patch we can see the "state" variable-file, that denotes to user-mode
when the Kernel is ready for new mounts. And the registered_fs which
exposes what zufFS(s) where registered with the Kernel.
There is a one-to-one relationship between a zuf-root SB and
a zusd Server. Each zusd Server can support multiple zusFS
plugins and register multiple filesystem-types.
The zuf-rootfs (mount -t zuf) is usually mounted on
/sys/fs/zuf. The /sys/fs/zuf directory is automatically created
when zuf.ko is loaded. If an admin wants to run more zusd server
applications she/he can mount a second instance of -t zuf on some
dir and point the new zusd Server to it. (zusd has an optional path
argument). Otherwise a second instance attempting to communicate
with a busy zuf-root will fail.
TODO: How to trigger a first mount on module_load. Currently
admin needs to manually "mount -t zuf none /sys/fs/zuf"
Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/Makefile   |   4 +
 fs/zuf/_extern.h  |  41 +++++
 fs/zuf/_pr.h      |  63 +++++++
 fs/zuf/super.c    |  53 ++++++
 fs/zuf/zuf-core.c |  69 ++++++++
 fs/zuf/zuf-root.c | 435 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/zuf/zuf.h      | 115 ++++++++++++
 fs/zuf/zus_api.h  |  36 ++++
 8 files changed, 816 insertions(+)
 create mode 100644 fs/zuf/_extern.h
 create mode 100644 fs/zuf/_pr.h
 create mode 100644 fs/zuf/super.c
 create mode 100644 fs/zuf/zuf-core.c
 create mode 100644 fs/zuf/zuf-root.c
 create mode 100644 fs/zuf/zuf.h
diff --git a/fs/zuf/Makefile b/fs/zuf/Makefile
index 452cec55f34d..b08c08e73faa 100644
--- a/fs/zuf/Makefile
+++ b/fs/zuf/Makefile
@@ -10,5 +10,9 @@
 
 obj-$(CONFIG_ZUFS_FS) += zuf.o
 
+# ZUF core
+zuf-y += zuf-core.o zuf-root.o
+
 # Main FS
+zuf-y += super.o
 zuf-y += module.o
diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
new file mode 100644
index 000000000000..0e8aa52f1259
--- /dev/null
+++ b/fs/zuf/_extern.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#ifndef __ZUF_EXTERN_H__
+#define __ZUF_EXTERN_H__
+/*
+ * DO NOT INCLUDE this file directly, it is included by zuf.h
+ * It is here because zuf.h got to big
+ */
+
+/*
+ * extern functions declarations
+ */
+
+/* zuf-core.c */
+int zufc_zts_init(struct zuf_root_info *zri); /* Some private types in core */
+void zufc_zts_fini(struct zuf_root_info *zri);
+
+long zufc_ioctl(struct file *filp, unsigned int cmd, ulong arg);
+int zufc_release(struct inode *inode, struct file *file);
+int zufc_mmap(struct file *file, struct vm_area_struct *vma);
+
+/* zuf-root.c */
+int zufr_register_fs(struct super_block *sb, struct zufs_ioc_register_fs *rfs);
+
+/* super.c */
+int zuf_init_inodecache(void);
+void zuf_destroy_inodecache(void);
+
+struct dentry *zuf_mount(struct file_system_type *fs_type, int flags,
+			 const char *dev_name, void *data);
+
+#endif	/*ndef __ZUF_EXTERN_H__*/
diff --git a/fs/zuf/_pr.h b/fs/zuf/_pr.h
new file mode 100644
index 000000000000..51924b6bd2a5
--- /dev/null
+++ b/fs/zuf/_pr.h
@@ -0,0 +1,63 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#ifndef __ZUF_PR_H__
+#define __ZUF_PR_H__
+
+#ifdef pr_fmt
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#endif
+
+/*
+ * Debug code
+ */
+#define zuf_err(s, args ...)		pr_err("[%s:%d] " s, __func__, \
+							__LINE__, ## args)
+#define zuf_err_cnd(silent, s, args ...) \
+	do {if (!silent) \
+		pr_err("[%s:%d] " s, __func__, __LINE__, ## args); \
+	} while (0)
+#define zuf_warn(s, args ...)		pr_warn("[%s:%d] " s, __func__, \
+							__LINE__, ## args)
+#define zuf_warn_cnd(silent, s, args ...) \
+	do {if (!silent) \
+		pr_warn("[%s:%d] " s, __func__, __LINE__, ## args); \
+	} while (0)
+#define zuf_info(s, args ...)          pr_info("~info~ " s, ## args)
+
+#define zuf_chan_debug(c, s, args...)	pr_debug(c " [%s:%d] " s, __func__, \
+							__LINE__, ## args)
+
+/* ~~~ channel prints ~~~ */
+#define zuf_dbg_perf(s, args ...)	zuf_chan_debug("perfo", s, ##args)
+#define zuf_dbg_err(s, args ...)	zuf_chan_debug("error", s, ##args)
+#define zuf_dbg_vfs(s, args ...)	zuf_chan_debug("vfs  ", s, ##args)
+#define zuf_dbg_rw(s, args ...)		zuf_chan_debug("rw   ", s, ##args)
+#define zuf_dbg_t1(s, args ...)		zuf_chan_debug("t1   ", s, ##args)
+#define zuf_dbg_xattr(s, args ...)	zuf_chan_debug("xattr", s, ##args)
+#define zuf_dbg_acl(s, args ...)	zuf_chan_debug("acl  ", s, ##args)
+#define zuf_dbg_t2(s, args ...)		zuf_chan_debug("t2dbg", s, ##args)
+#define zuf_dbg_t2_rw(s, args ...)	zuf_chan_debug("t2grw", s, ##args)
+#define zuf_dbg_core(s, args ...)	zuf_chan_debug("core ", s, ##args)
+#define zuf_dbg_mmap(s, args ...)	zuf_chan_debug("mmap ", s, ##args)
+#define zuf_dbg_zus(s, args ...)	zuf_chan_debug("zusdg", s, ##args)
+#define zuf_dbg_verbose(s, args ...)	zuf_chan_debug("d-oto", s, ##args)
+
+#define md_err		zuf_err
+#define md_warn		zuf_warn
+#define md_err_cnd	zuf_err_cnd
+#define md_warn_cnd	zuf_warn_cnd
+#define md_dbg_err	zuf_dbg_err
+#define md_dbg_verbose	zuf_dbg_verbose
+
+
+#endif /* define __ZUF_PR_H__ */
diff --git a/fs/zuf/super.c b/fs/zuf/super.c
new file mode 100644
index 000000000000..f7f7798425a9
--- /dev/null
+++ b/fs/zuf/super.c
@@ -0,0 +1,53 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Super block operations.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>
+ */
+
+#include <linux/types.h>
+#include <linux/parser.h>
+#include <linux/statfs.h>
+#include <linux/backing-dev.h>
+
+#include "zuf.h"
+
+static struct kmem_cache *zuf_inode_cachep;
+
+static void _init_once(void *foo)
+{
+	struct zuf_inode_info *zii = foo;
+
+	inode_init_once(&zii->vfs_inode);
+}
+
+int __init zuf_init_inodecache(void)
+{
+	zuf_inode_cachep = kmem_cache_create("zuf_inode_cache",
+					       sizeof(struct zuf_inode_info),
+					       0,
+					       (SLAB_RECLAIM_ACCOUNT |
+						SLAB_MEM_SPREAD |
+						SLAB_TYPESAFE_BY_RCU),
+					       _init_once);
+	if (zuf_inode_cachep == NULL)
+		return -ENOMEM;
+	return 0;
+}
+
+void zuf_destroy_inodecache(void)
+{
+	kmem_cache_destroy(zuf_inode_cachep);
+}
+
+struct dentry *zuf_mount(struct file_system_type *fs_type, int flags,
+			 const char *dev_name, void *data)
+{
+	return ERR_PTR(-ENOTSUPP);
+}
diff --git a/fs/zuf/zuf-core.c b/fs/zuf/zuf-core.c
new file mode 100644
index 000000000000..c9bb31f75bed
--- /dev/null
+++ b/fs/zuf/zuf-core.c
@@ -0,0 +1,69 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Ioctl operations.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+
+#include <linux/types.h>
+#include <linux/mm.h>
+#include <linux/mm_types.h>
+#include <linux/delay.h>
+#include <linux/pfn_t.h>
+#include <linux/sched/signal.h>
+
+#include "zuf.h"
+
+int zufc_zts_init(struct zuf_root_info *zri)
+{
+	return 0;
+}
+
+void zufc_zts_fini(struct zuf_root_info *zri)
+{
+}
+
+long zufc_ioctl(struct file *file, unsigned int cmd, ulong arg)
+{
+	switch (cmd) {
+	default:
+		zuf_err("%d\n", cmd);
+		return -ENOTTY;
+	}
+}
+
+int zufc_release(struct inode *inode, struct file *file)
+{
+	struct zuf_special_file *zsf = file->private_data;
+
+	if (!zsf)
+		return 0;
+
+	switch (zsf->type) {
+	default:
+		return 0;
+	}
+}
+
+int zufc_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct zuf_special_file *zsf = file->private_data;
+
+	if (unlikely(!zsf)) {
+		zuf_err("Which mmap is that !!!!\n");
+		return -ENOTTY;
+	}
+
+	switch (zsf->type) {
+	default:
+		zuf_err("type=%d\n", zsf->type);
+		return -ENOTTY;
+	}
+}
diff --git a/fs/zuf/zuf-root.c b/fs/zuf/zuf-root.c
new file mode 100644
index 000000000000..1f5f886997f7
--- /dev/null
+++ b/fs/zuf/zuf-root.c
@@ -0,0 +1,435 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * ZUF Root filesystem.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUS-ZUF interaction is done via a small specialized FS that
+ * provides the communication with the mount-thread, ZTs, pmem devices,
+ * and so on ...
+ * Subsequently all FS super_blocks are children of this root, and point
+ * to it. All sharing the same zuf communication channels.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/magic.h>
+#include <asm-generic/mman.h>
+
+#include "zuf.h"
+
+/* ~~~~ Register/Unregister FS-types ~~~~ */
+#ifdef CONFIG_LOCKDEP
+
+/*
+ * NOTE: When CONFIG_LOCKDEP is on. register_filesystem() complains when
+ * the fstype object is from a kmalloc. Because of some lockdep_keys not
+ * being const_obj something.
+ *
+ * So in this case we have maximum of 16 fstypes system wide
+ * (Total for all mounted zuf_root(s)). This way we can have them
+ * in const_obj memory below at g_fs_array
+ */
+
+enum { MAX_LOCKDEP_FSs = 16 };
+static uint g_fs_next;
+static struct zuf_fs_type g_fs_array[MAX_LOCKDEP_FSs];
+
+static struct zuf_fs_type *_fs_type_alloc(void)
+{
+	struct zuf_fs_type *ret;
+
+	if (MAX_LOCKDEP_FSs <= g_fs_next)
+		return NULL;
+
+	ret = &g_fs_array[g_fs_next++];
+	memset(ret, 0, sizeof(*ret));
+	return ret;
+}
+
+static void _fs_type_free(struct zuf_fs_type *zft)
+{
+	if (zft == &g_fs_array[0])
+		g_fs_next = 0;
+}
+
+#else /* !CONFIG_LOCKDEP*/
+static struct zuf_fs_type *_fs_type_alloc(void)
+{
+	return kcalloc(1, sizeof(struct zuf_fs_type), GFP_KERNEL);
+}
+
+static void _fs_type_free(struct zuf_fs_type *zft)
+{
+	kfree(zft);
+}
+#endif /*CONFIG_LOCKDEP*/
+
+
+static ssize_t _state_read(struct file *file, char __user *buf, size_t len,
+			   loff_t *ppos)
+{
+	struct zuf_root_info *zri = ZRI(file->f_inode->i_sb);
+	const char *msg;
+
+	if (*ppos > 0)
+		return 0;
+
+	switch (zri->state) {
+	case ZUF_ROOT_INITIALIZING:
+		msg = "initializing\n";
+		break;
+	case ZUF_ROOT_REGISTERING_FS:
+		msg = "registering_fs\n";
+		break;
+	case ZUF_ROOT_MOUNT_READY:
+		msg = "mount_ready\n";
+		break;
+	default:
+		msg = "UNKNOWN\n";
+		break;
+	}
+
+	return simple_read_from_buffer(buf, len, ppos, msg, strlen(msg));
+}
+
+static const struct file_operations _state_ops = {
+	.open = nonseekable_open,
+	.read = _state_read,
+	.llseek = no_llseek,
+};
+
+static ssize_t _registered_fs_read(struct file *file, char __user *buf,
+				   size_t len, loff_t *ppos)
+{
+	struct zuf_root_info *zri = ZRI(file->f_inode->i_sb);
+	size_t buff_len = 0;
+	struct zuf_fs_type *zft;
+	char *fs_buff, *p;
+	ssize_t ret;
+	size_t name_len;
+
+	list_for_each_entry(zft, &zri->fst_list, list)
+		buff_len += strlen(zft->rfi.fsname) + 1;
+
+	if (unlikely(*ppos > buff_len))
+		return -EINVAL;
+	if (*ppos == buff_len)
+		return 0;
+
+	fs_buff = kzalloc(buff_len + 1, GFP_KERNEL);
+	if (unlikely(!fs_buff))
+		return -ENOMEM;
+
+	p = fs_buff;
+	list_for_each_entry(zft, &zri->fst_list, list) {
+		if (p != fs_buff) {
+			*p = ' ';
+			++p;
+		}
+		name_len = strlen(zft->rfi.fsname);
+		memcpy(p, zft->rfi.fsname, name_len);
+		p += name_len;
+	}
+
+	p = fs_buff + *ppos;
+	buff_len = buff_len - *ppos;
+	ret = simple_read_from_buffer(buf, len, ppos, p, buff_len);
+	kfree(fs_buff);
+
+	return ret;
+}
+
+static const struct file_operations _registered_fs_ops = {
+	.open = nonseekable_open,
+	.read = _registered_fs_read,
+	.llseek = no_llseek,
+};
+
+
+int zufr_register_fs(struct super_block *sb, struct zufs_ioc_register_fs *rfs)
+{
+	struct zuf_fs_type *zft = _fs_type_alloc();
+	struct zuf_root_info *zri = ZRI(sb);
+
+	if (unlikely(!zft))
+		return -ENOMEM;
+
+	if (zri->state == ZUF_ROOT_INITIALIZING)
+		zri->state = ZUF_ROOT_REGISTERING_FS;
+
+	/* Original vfs file type */
+	zft->vfs_fst.owner	= THIS_MODULE;
+	zft->vfs_fst.name	= kstrdup(rfs->rfi.fsname, GFP_KERNEL);
+	zft->vfs_fst.mount	= zuf_mount;
+	zft->vfs_fst.kill_sb	= kill_block_super;
+
+	/* ZUS info about this FS */
+	zft->rfi		= rfs->rfi;
+	zft->zus_zfi		= rfs->zus_zfi;
+	INIT_LIST_HEAD(&zft->list);
+	/* Back pointer to our communication channels */
+	zft->zri		= ZRI(sb);
+
+	zuf_add_fs_type(zft->zri, zft);
+	zuf_info("register_filesystem [%s]\n", zft->vfs_fst.name);
+	return register_filesystem(&zft->vfs_fst);
+}
+
+static void _unregister_all_fses(struct zuf_root_info *zri)
+{
+	struct zuf_fs_type *zft, *n;
+
+	list_for_each_entry_safe_reverse(zft, n, &zri->fst_list, list) {
+		unregister_filesystem(&zft->vfs_fst);
+		list_del_init(&zft->list);
+		_fs_type_free(zft);
+	}
+}
+
+static int zufr_unlink(struct inode *dir, struct dentry *dentry)
+{
+	struct inode *inode = dentry->d_inode;
+
+	drop_nlink(inode);
+	return 0;
+}
+
+/* Force alignment of 2M for all vma(s)
+ *
+ * This belongs to t1.c and what it does for mmap. But we do not mind
+ * that both our mmaps (grab_pmem or ZTs) will be 2M aligned so keep
+ * it here. And zus mappings just all match perfectly with no need for
+ * holes.
+ * FIXME: This is copy/paste from dax-device. It can be very much simplified
+ * for what we need.
+ */
+static unsigned long zufr_get_unmapped_area(struct file *filp,
+		unsigned long addr, unsigned long len, unsigned long pgoff,
+		unsigned long flags)
+{
+	unsigned long off, off_end, off_align, len_align, addr_align;
+	unsigned long align = PMD_SIZE;
+
+	if (addr)
+		goto out;
+
+	off = pgoff << PAGE_SHIFT;
+	off_end = off + len;
+	off_align = round_up(off, align);
+
+	if ((off_end <= off_align) || ((off_end - off_align) < align))
+		goto out;
+
+	len_align = len + align;
+	if ((off + len_align) < off)
+		goto out;
+
+	addr_align = current->mm->get_unmapped_area(filp, addr, len_align,
+			pgoff, flags);
+	if (!IS_ERR_VALUE(addr_align)) {
+		addr_align += (off - addr_align) & (align - 1);
+		return addr_align;
+	}
+ out:
+	return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags);
+}
+
+static const struct inode_operations zufr_inode_operations;
+static const struct file_operations zufr_file_dir_operations = {
+	.open		= dcache_dir_open,
+	.release	= dcache_dir_close,
+	.llseek		= dcache_dir_lseek,
+	.read		= generic_read_dir,
+	.iterate_shared	= dcache_readdir,
+	.fsync		= noop_fsync,
+	.unlocked_ioctl = zufc_ioctl,
+};
+static const struct file_operations zufr_file_reg_operations = {
+	.fsync			= noop_fsync,
+	.unlocked_ioctl		= zufc_ioctl,
+	.get_unmapped_area	= zufr_get_unmapped_area,
+	.mmap			= zufc_mmap,
+	.release		= zufc_release,
+};
+
+static int zufr_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
+{
+	struct zuf_root_info *zri = ZRI(dir->i_sb);
+	struct inode *inode;
+	int err;
+
+	inode = new_inode(dir->i_sb);
+	if (!inode)
+		return -ENOMEM;
+
+	/* We need to impersonate device-dax (S_DAX + S_IFCHR) in order to get
+	 * the PMD (huge) page faults and allow RDMA memory access via GUP
+	 * (get_user_pages_longterm).
+	 */
+	inode->i_flags = S_DAX;
+	mode = (mode & ~S_IFREG) | S_IFCHR; /* change file type to char */
+
+	inode->i_ino = ++zri->next_ino; /* none atomic only one mount thread */
+	inode->i_blocks = inode->i_size = 0;
+	inode->i_ctime = inode->i_mtime = current_time(inode);
+	inode->i_atime = inode->i_ctime;
+	inode_init_owner(inode, dir, mode);
+
+	inode->i_op = &zufr_inode_operations;
+	inode->i_fop = &zufr_file_reg_operations;
+
+	err = insert_inode_locked(inode);
+	if (unlikely(err)) {
+		zuf_err("[%ld] insert_inode_locked => %d\n", inode->i_ino, err);
+		goto fail;
+	}
+	d_tmpfile(dentry, inode);
+	unlock_new_inode(inode);
+	return 0;
+
+fail:
+	clear_nlink(inode);
+	make_bad_inode(inode);
+	iput(inode);
+	return err;
+}
+
+static void zufr_put_super(struct super_block *sb)
+{
+	struct zuf_root_info *zri = ZRI(sb);
+
+	zufc_zts_fini(zri);
+	_unregister_all_fses(zri);
+
+	zuf_info("zuf_root umount\n");
+}
+
+static void zufr_evict_inode(struct inode *inode)
+{
+	clear_inode(inode);
+}
+
+static const struct inode_operations zufr_inode_operations = {
+	.lookup		= simple_lookup,
+
+	.tmpfile	= zufr_tmpfile,
+	.unlink		= zufr_unlink,
+};
+static const struct super_operations zufr_super_operations = {
+	.statfs		= simple_statfs,
+
+	.evict_inode	= zufr_evict_inode,
+	.put_super	= zufr_put_super,
+};
+
+#define ZUFR_SUPER_MAGIC 0x1717
+
+static int zufr_fill_super(struct super_block *sb, void *data, int silent)
+{
+	static struct tree_descr zufr_files[] = {
+		[2] = {"state", &_state_ops, S_IFREG | 0400},
+		[3] = {"registered_fs", &_registered_fs_ops, S_IFREG | 0400},
+		{""},
+	};
+	struct zuf_root_info *zri;
+	struct inode *root_i;
+	int err;
+
+	zri = kzalloc(sizeof(*zri), GFP_KERNEL);
+	if (!zri) {
+		zuf_err_cnd(silent,
+			    "Not enough memory to allocate zuf_root_info\n");
+		return -ENOMEM;
+	}
+
+	err = simple_fill_super(sb, ZUFR_SUPER_MAGIC, zufr_files);
+	if (unlikely(err)) {
+		kfree(zri);
+		return err;
+	}
+
+	sb->s_op = &zufr_super_operations;
+	sb->s_fs_info = zri;
+	zri->sb = sb;
+
+	root_i = sb->s_root->d_inode;
+	root_i->i_fop = &zufr_file_dir_operations;
+	root_i->i_op = &zufr_inode_operations;
+
+	mutex_init(&zri->sbl_lock);
+	INIT_LIST_HEAD(&zri->fst_list);
+
+	err = zufc_zts_init(zri);
+	if (unlikely(err))
+		return err; /* put will be called we have a root */
+
+	return 0;
+}
+
+static struct dentry *zufr_mount(struct file_system_type *fs_type,
+				  int flags, const char *dev_name,
+				  void *data)
+{
+	struct dentry *ret = mount_nodev(fs_type, flags, data, zufr_fill_super);
+
+	if (IS_ERR_OR_NULL(ret)) {
+		zuf_dbg_err("mount_nodev(%s, %s) => %ld\n", dev_name,
+			    (char *)data, PTR_ERR(ret));
+		return ret;
+	}
+
+	zuf_info("zuf_root mount [%s]\n", dev_name);
+	return ret;
+}
+
+static struct file_system_type zufr_type = {
+	.owner =	THIS_MODULE,
+	.name =		"zuf",
+	.mount =	zufr_mount,
+	.kill_sb	= kill_litter_super,
+};
+
+/* Create an /sys/fs/zuf/ directory. to mount on */
+static struct kset *zufr_kset;
+
+int __init zuf_root_init(void)
+{
+	int err = zuf_init_inodecache();
+
+	if (unlikely(err))
+		return err;
+
+	zufr_kset = kset_create_and_add("zuf", NULL, fs_kobj);
+	if (!zufr_kset) {
+		err = -ENOMEM;
+		goto un_inodecache;
+	}
+
+	err = register_filesystem(&zufr_type);
+	if (unlikely(err))
+		goto un_kset;
+
+	return 0;
+
+un_kset:
+	kset_unregister(zufr_kset);
+un_inodecache:
+	zuf_destroy_inodecache();
+	return err;
+}
+
+static void __exit zuf_root_exit(void)
+{
+	unregister_filesystem(&zufr_type);
+	kset_unregister(zufr_kset);
+	zuf_destroy_inodecache();
+}
+
+module_init(zuf_root_init)
+module_exit(zuf_root_exit)
diff --git a/fs/zuf/zuf.h b/fs/zuf/zuf.h
new file mode 100644
index 000000000000..3062f78c72d4
--- /dev/null
+++ b/fs/zuf/zuf.h
@@ -0,0 +1,115 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Definitions for the ZUF filesystem.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#ifndef __ZUF_H
+#define __ZUF_H
+
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/list.h>
+#include <linux/types.h>
+#include <linux/xattr.h>
+#include <linux/exportfs.h>
+#include <linux/page_ref.h>
+#include <linux/mm.h>
+
+#include "zus_api.h"
+
+#include "_pr.h"
+
+enum zlfs_e_special_file {
+	zlfs_e_zt = 1,
+	zlfs_e_mout_thread,
+	zlfs_e_pmem,
+	zlfs_e_dpp_buff,
+	zlfs_e_private_mount,
+};
+
+struct zuf_special_file {
+	enum zlfs_e_special_file type;
+	struct file *file;
+};
+
+struct zuf_private_mount_info {
+	struct zuf_special_file zsf;
+	struct super_block *sb;
+};
+
+enum {
+	ZUF_ROOT_INITIALIZING = 0,
+	ZUF_ROOT_REGISTERING_FS = 1,
+	ZUF_ROOT_MOUNT_READY = 2,
+};
+
+/* This is the zuf-root.c mini filesystem */
+struct zuf_root_info {
+	#define SBL_INC 64
+	struct sb_is_list {
+		uint num;
+		uint max;
+		struct super_block **array;
+	} sbl;
+	struct mutex sbl_lock;
+
+	ulong next_ino;
+
+	/* The definition of _ztp is private to zuf-core.c */
+	struct zuf_threads_pool *_ztp;
+
+	struct super_block *sb;
+	struct list_head fst_list;
+	int state;
+};
+
+static inline struct zuf_root_info *ZRI(struct super_block *sb)
+{
+	struct zuf_root_info *zri = sb->s_fs_info;
+
+	WARN_ON(zri->sb != sb);
+	return zri;
+}
+
+struct zuf_fs_type {
+	struct file_system_type vfs_fst;
+	struct zus_fs_info	*zus_zfi;
+	struct register_fs_info rfi;
+	struct zuf_root_info *zri;
+
+	struct list_head list;
+};
+
+static inline void zuf_add_fs_type(struct zuf_root_info *zri,
+				   struct zuf_fs_type *zft)
+{
+	/* Unlocked for now only one mount-thread with zus */
+	list_add(&zft->list, &zri->fst_list);
+}
+
+/*
+ * ZUF per-inode data in memory
+ */
+struct zuf_inode_info {
+	struct inode		vfs_inode;
+};
+
+static inline struct zuf_inode_info *ZUII(struct inode *inode)
+{
+	return container_of(inode, struct zuf_inode_info, vfs_inode);
+}
+
+/* Keep this include last thing in file */
+#include "_extern.h"
+
+#endif /* __ZUF_H */
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index 4b1816e5dfd8..181805052ec0 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -97,4 +97,40 @@
 
 #endif /*  ndef __KERNEL__ */
 
+struct zufs_ioc_hdr {
+	__s32 err;	/* IN/OUT must be first */
+	__u16 in_len;	/* How much to be copied *to* zus */
+	__u16 out_max;	/* Max receive buffer at dispatch caller */
+	__u16 out_start;/* Start of output parameters (to caller) */
+	__u16 out_len;	/* How much to be copied *from* zus to caller */
+			/* can be modified by zus */
+	__u16 operation;/* One of e_zufs_operation */
+	__u16 flags;	/* e_zufs_hdr_flags bit flags */
+	__u32 offset;	/* Start of user buffer in ZT mmap */
+	__u32 len;	/* Len of user buffer in ZT mmap */
+};
+
+struct register_fs_info {
+	char fsname[16];	/* Only 4 chars and a NUL please      */
+	__u32 FS_magic;         /* This is the FS's version && magic  */
+	__u32 FS_ver_major;	/* on disk, not the zuf-to-zus version*/
+	__u32 FS_ver_minor;	/* (See also struct md_dev_table)   */
+	__u32 notused;
+
+	__u64 dt_offset;
+	__u64 s_maxbytes;
+	__u32 s_time_gran;
+	__u32 def_mode;
+};
+
+/* Register FS */
+/* A cookie from user-mode given in register_fs_info */
+struct zus_fs_info;
+struct zufs_ioc_register_fs {
+	struct zufs_ioc_hdr hdr;
+	struct zus_fs_info *zus_zfi;
+	struct register_fs_info rfi;
+};
+#define ZU_IOC_REGISTER_FS	_IOWR('Z', 10, struct zufs_ioc_register_fs)
+
 #endif /* _LINUX_ZUFS_API_H */
-- 
2.20.1
^ permalink raw reply related	[flat|nested] 32+ messages in thread
* [PATCHSET v02 00/16] zuf: ZUFS Zero-copy User-mode FileSystem
@ 2019-09-26  2:07 Boaz Harrosh
  2019-09-26  2:07 ` [PATCH 01/16] fs: Add the ZUF filesystem to the build + License Boaz Harrosh
                   ` (17 more replies)
  0 siblings, 18 replies; 32+ messages in thread
From: Boaz Harrosh @ 2019-09-26  2:07 UTC (permalink / raw)
  To: linux-fsdevel, Anna Schumaker, Al Viro, Matt Benjamin
  Cc: Miklos Szeredi, Amir Goldstein, Sagi Manole, Matthew Wilcox,
	Dan Williams
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 10591 bytes --]
I would please like to submit the Kernel code part of the ZUFS file system,
for review. V02
[v02]
   The git of the changes over v01 can be found at: (Note based on v5.2)
	git https://github.com/NetApp/zufs-zuf upstream-5.2-v02-fixes
   The patches submitted are at:
	git https://github.com/NetApp/zufs-zuf upstream-v02
   list of changes since v01
   * Based on Linux v5.3. Previous was based on v5.2 and experienced
     build breakage with v5.3-rcX
   * Address *all* comments by the Intel Robot. The code should be
     completely free of any warnings.
   * Some bugs found since the time of the first submission.
   * More I(s) doted and T(s) crossed. Please see above
     upstream-5.2-v02-fixes for all the patches on top of v01 before
     they were re-squashed into this set.
[v01]
   find v01 submission here:
   https://lore.kernel.org/linux-fsdevel/20190812164806.15852-1-boazh@netapp.com/
   On github:
	git https://github.com/NetApp/zufs-zuf upstream-v01
---
ZUFS is a full implementation of a VFS filesystem. But mainly it is a very
new way to communicate with user-mode servers.
With performance and scalability never seen before. (<4us latency)
Why? the core communication with user-mode is completely lockless,
per-cpu locality, NUMA aware.
The Kernel code presented here can be found at:
	https://github.com/NetApp/zufs-zuf upstream
And the User-mode Server + example FSs here:
	https://github.com/NetApp/zufs-zus upstream
ZUFS - stands for Zero-copy User-mode FS
The Intention of this project is performance and low-latency.
* True zero copy end to end of both data and meta data.
* Very *low latency*, very high CPU locality, lock-less parallelism.
* Synchronous operations (for low latency)
* Numa awareness
Short description:
  ZUFS is a from scratch implementation of a filesystem-in-user-space, which
  tries to address the above goals. from the get go it is aimed for pmem
  based FSs. But supports any other type of FSs.
  The novelty of this project is that the interface is designed with a modern
  multi-core NUMA machine in mind down to the ABI.
  Also it utilizes the normal mount API of the Kernel.
  Multiple block devices are supported per superblock, Kernel owns those
  devices. FileSystem types are registered/exposed via the regular way
The Kernel is released as a pure GPLv2 License. The user-mode core is
BSD-3 so to be friendly with other OSs.
Current status: There are a couple of trivial open-source filesystem
implementations and a full blown proprietary implementation from Netapp.
 3 more ports to more serious open-source filesystems are on the way.
A usermode CEPH client, a ZFS implementation, and port of the infamous PMFS
to demonstrate the amazing pmem performance under zufs.
(Will be released as Open source when they are ready)
Together with the Kernel module submitted here the User-mode-Server and the
zusFSs User-mode plugins, pass Netapp QA including xfstests + internal QA tests.
And is released to costumers as Maxdata.
So it is very stable and performant
In the git repository above there is also a backport for rhel 7.6 7.7 and 8.0
Including rpm packages for Kernel and Server components.
(Also available evaluation licenses of Maxdata 1.5 for developers.
 Please contact Amit Golander <Amit.Golander@netapp.com> if you need one)
Performance:
A simple fio direct 4k random write test with incrementing number
of threads.
[fuse]
threads wr_iops	wr_bw	wr_lat
1	33606	134424	26.53226
2	57056	228224	30.38476
4	88667	354668	40.12783
7	116561	466245	53.98572
8	129134	516539	55.6134
[fuse-splice]
threads	wr_iops	wr_bw	wr_lat
1	39670	158682	21.8399
2	51100	204400	34.63294
4	75220	300882	47.42344
7	97706	390825	63.04435
8	98034	392137	73.24263
[xfs-dax]
threads	wr_iops	wr_bw		wr_lat   
[Maxdata-1.5-zufs]
threads	wr_iops	wr_bw		wr_lat
1	1041802 260,450		3.623
2	1983997 495,999		3.808
4	3829456 957,364		3.959
7	4501154 1,125,288	5.895330
8	4400698 1,100,174	6.922174
I have used an 8 way KVM-qemu with 2 NUMA nodes.
(on an Intel(R) Xeon(R) CPU E3-1230 v6 @ 3.50GHz)
Running fio with 4k random writes O_DIRECT | O_SYNC to a DRAM
simulated pmem. (memmap=! at grub)
Fuse-fs was a memcpy same 4k null-FS
fio was run with more and more threads (see threads column)
to test for scalability.
We see a bit of a slowdown when pushing to 8 threads. This is
mainly a scheduler and KVM issue. Big metal machines do better
(more flat scalability) but also degrade a bit on full load
I will try to post real metal scores later.
The in Kernel xfs-dax is slower than a zufs-pmem because:
1. It was not built specifically for pmem so there are latency
   issues (async operations) and extra copies in places.
2. In writes because of the Journal there are actually 3 IOPs
   for every write. Where with pmem other means can keep things
   crash-proof.
3. Because in random write + DAX each block is written twice
   It is first ZEROed then copied too.
4. But mainly because we use a single pmem on one of the NUMAs
   with zufs we put a pmem device on each NUMA node. And each core
   writes locally. So the memory bandwith is doubled. (Perhaps there
   is a way to use a dm configuration that makes this better but at
   the base xfs is not NUMA aware)
Is why I chose writes. With reads xfs-dax is much faster. In
zufs reads are actually 10% slower because in reads we do regular
memcpy-from-pmem which is exactly 10% slower than mov_nt operations
[Changes since last RFC submission]
Lots and lots of changes since then. More hardening stability
and more fixtures.
But mainly is the NEW-IO way.
The old way of IO where we mmap application-pages into the Server is
still there because there are modes where this is faster still.
For example direct IO from network type of FSs. We are all about choice.
(The zusFS is the one that decides which mode to use)
But the results above are with the NEW-IO way. The new way is -
we ask the Server what are the blocks to read/write (both pmem or bdev)
and the IO or pmem_memcpy is done in Kernel.
(We do not yet cache these results in Kernel but might in future
 ((when caching will actually make things faster currently xarray does
   not scale for us)))
[TODOs]
1. EZUFS_ASYNC is not submitted here. It is implemented but there are
   no current users so it was never fully tested (waiting for a user)
2. Support Page-cache. This one is very easy to do, but again no users
   yet
3. more stuff ....
Please help with *reviews*, comments, questions. We believe this is a very
important project that opens new ways for implementing Server-applications,
including but not restricted to FS Server applications.
Thank you
Boaz
----------------------------------------------------------------
Boaz Harrosh (16):
      fs: Add the ZUF filesystem to the build + License
      MAINTAINERS: Add the ZUFS maintainership
      zuf: Preliminary Documentation
      zuf: zuf-rootfs
      zuf: zuf-core The ZTs
      zuf: Multy Devices
      zuf: mounting
      zuf: Namei and directory operations
      zuf: readdir operation
      zuf: symlink
      zuf: Write/Read implementation
      zuf: mmap & sync
      zuf: More file operation
      zuf: ioctl implementation
      zuf: xattr && acl implementation
      zuf: Support for dynamic-debug of zusFSs
 Documentation/filesystems/zufs.txt |  386 +++++++++++++++++++++++++++++++
 MAINTAINERS                        |    6 +
 fs/Kconfig                         |    1 +
 fs/Makefile                        |    1 +
 fs/zuf/Kconfig                     |   24 ++
 fs/zuf/Makefile                    |   23 ++
 fs/zuf/_extern.h                   |  179 +++++++++++++++
 fs/zuf/_pr.h                       |   68 ++++++
 fs/zuf/acl.c                       |  270 ++++++++++++++++++++++
 fs/zuf/directory.c                 |  171 ++++++++++++++
 fs/zuf/file.c                      |  825 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/zuf/inode.c                     |  630 ++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/zuf/ioctl.c                     |  309 +++++++++++++++++++++++++
 fs/zuf/md.c                        |  742 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/zuf/md.h                        |  332 +++++++++++++++++++++++++++
 fs/zuf/md_def.h                    |  141 ++++++++++++
 fs/zuf/mmap.c                      |  300 ++++++++++++++++++++++++
 fs/zuf/module.c                    |   28 +++
 fs/zuf/namei.c                     |  435 +++++++++++++++++++++++++++++++++++
 fs/zuf/relay.h                     |  104 +++++++++
 fs/zuf/rw.c                        | 1051 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/zuf/super.c                     |  954 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/zuf/symlink.c                   |   74 ++++++
 fs/zuf/t1.c                        |  145 ++++++++++++
 fs/zuf/t2.c                        |  356 ++++++++++++++++++++++++++++
 fs/zuf/t2.h                        |   68 ++++++
 fs/zuf/xattr.c                     |  314 +++++++++++++++++++++++++
 fs/zuf/zuf-core.c                  | 1735 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/zuf/zuf-root.c                  |  519 +++++++++++++++++++++++++++++++++++++++++
 fs/zuf/zuf.h                       |  452 ++++++++++++++++++++++++++++++++++++
 fs/zuf/zus_api.h                   | 1075 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 31 files changed, 11718 insertions(+)
 create mode 100644 Documentation/filesystems/zufs.txt
 create mode 100644 fs/zuf/Kconfig
 create mode 100644 fs/zuf/Makefile
 create mode 100644 fs/zuf/_extern.h
 create mode 100644 fs/zuf/_pr.h
 create mode 100644 fs/zuf/acl.c
 create mode 100644 fs/zuf/directory.c
 create mode 100644 fs/zuf/file.c
 create mode 100644 fs/zuf/inode.c
 create mode 100644 fs/zuf/ioctl.c
 create mode 100644 fs/zuf/md.c
 create mode 100644 fs/zuf/md.h
 create mode 100644 fs/zuf/md_def.h
 create mode 100644 fs/zuf/mmap.c
 create mode 100644 fs/zuf/module.c
 create mode 100644 fs/zuf/namei.c
 create mode 100644 fs/zuf/relay.h
 create mode 100644 fs/zuf/rw.c
 create mode 100644 fs/zuf/super.c
 create mode 100644 fs/zuf/symlink.c
 create mode 100644 fs/zuf/t1.c
 create mode 100644 fs/zuf/t2.c
 create mode 100644 fs/zuf/t2.h
 create mode 100644 fs/zuf/xattr.c
 create mode 100644 fs/zuf/zuf-core.c
 create mode 100644 fs/zuf/zuf-root.c
 create mode 100644 fs/zuf/zuf.h
 create mode 100644 fs/zuf/zus_api.h
^ permalink raw reply	[flat|nested] 32+ messages in thread
* [PATCH 01/16] fs: Add the ZUF filesystem to the build + License
  2019-09-26  2:07 [PATCHSET v02 00/16] zuf: ZUFS Zero-copy User-mode FileSystem Boaz Harrosh
@ 2019-09-26  2:07 ` Boaz Harrosh
  2019-09-26  2:07 ` [PATCH 02/16] MAINTAINERS: Add the ZUFS maintainership Boaz Harrosh
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 32+ messages in thread
From: Boaz Harrosh @ 2019-09-26  2:07 UTC (permalink / raw)
  To: linux-fsdevel, Anna Schumaker, Al Viro, Matt Benjamin
  Cc: Miklos Szeredi, Amir Goldstein, Sagi Manole, Matthew Wilcox,
	Dan Williams
This adds the ZUF filesystem-in-user_mode module to the
fs/ build system.
Also added:
	* fs/zuf/Kconfig
	* fs/zuf/module.c - This file contains the LICENCE
			    of zuf code base
	* fs/zuf/Makefile - Rather empty Makefile with only
			    module.c above
I add the fs/zuf/Makefile to demonstrate that at every
patch-set stage code still compiles and there are no external
references outside of the code already submitted.
Off course only at the very last patch we have a working ZUF feeder
[LICENCE]
  zuf.ko is a GPLv2 licensed project.
  However the ZUS user mode Server is a BSD-3-Clause licensed
  project.
  Therefor you will see that:
	zus_api.h
	md_def.h
	md.h
	t2.h
  Are common files with the ZUS project. And are separately dual
  Licensed as:
	GPL-2.0 WITH Linux-syscall-note or BSD-3-Clause.
  Any code contributor to these headers should note that her/his code to
  these files only, is dual licensed.
  This is for the obvious reasons as these headers define the API between
  Kernel and the user-mode Server.
Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/Kconfig       |  1 +
 fs/Makefile      |  1 +
 fs/zuf/Kconfig   | 24 ++++++++++++
 fs/zuf/Makefile  | 14 +++++++
 fs/zuf/module.c  | 28 ++++++++++++++
 fs/zuf/zus_api.h | 96 ++++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 164 insertions(+)
 create mode 100644 fs/zuf/Kconfig
 create mode 100644 fs/zuf/Makefile
 create mode 100644 fs/zuf/module.c
 create mode 100644 fs/zuf/zus_api.h
diff --git a/fs/Kconfig b/fs/Kconfig
index bfb1c6095c7a..452244733bb5 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -261,6 +261,7 @@ source "fs/romfs/Kconfig"
 source "fs/pstore/Kconfig"
 source "fs/sysv/Kconfig"
 source "fs/ufs/Kconfig"
+source "fs/zuf/Kconfig"
 
 endif # MISC_FILESYSTEMS
 
diff --git a/fs/Makefile b/fs/Makefile
index d60089fd689b..178e27ddd605 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -130,3 +130,4 @@ obj-$(CONFIG_F2FS_FS)		+= f2fs/
 obj-$(CONFIG_CEPH_FS)		+= ceph/
 obj-$(CONFIG_PSTORE)		+= pstore/
 obj-$(CONFIG_EFIVAR_FS)		+= efivarfs/
+obj-$(CONFIG_ZUFS_FS)		+= zuf/
diff --git a/fs/zuf/Kconfig b/fs/zuf/Kconfig
new file mode 100644
index 000000000000..58288f4245c2
--- /dev/null
+++ b/fs/zuf/Kconfig
@@ -0,0 +1,24 @@
+config ZUFS_FS
+	tristate "ZUF - Zero-copy User-mode Feeder"
+	depends on BLOCK
+	depends on ZONE_DEVICE
+	select CRC16
+	select MEMCG
+	help
+	   ZUFS Kernel part.
+	   To enable say Y here.
+
+	   To compile this as a module,  choose M here: the module will be
+	   called zuf.ko
+
+config ZUF_DEBUG
+	bool "ZUF: enable debug subsystems use"
+	depends on ZUFS_FS
+	default n
+	help
+	  INTERNAL QA USE ONLY!!! DO NOT USE!!!
+	  Please leave as N here
+
+	  This option adds some extra code that helps
+	  in QA testing of the code. It may slow the
+	  operation and produce bigger code
diff --git a/fs/zuf/Makefile b/fs/zuf/Makefile
new file mode 100644
index 000000000000..452cec55f34d
--- /dev/null
+++ b/fs/zuf/Makefile
@@ -0,0 +1,14 @@
+#
+# ZUF: Zero-copy User-mode Feeder
+#
+# Copyright (c) 2018 NetApp Inc. All rights reserved.
+#
+# ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+#
+# Makefile for the Linux zufs Kernel Feeder.
+#
+
+obj-$(CONFIG_ZUFS_FS) += zuf.o
+
+# Main FS
+zuf-y += module.o
diff --git a/fs/zuf/module.c b/fs/zuf/module.c
new file mode 100644
index 000000000000..523633c1bf9d
--- /dev/null
+++ b/fs/zuf/module.c
@@ -0,0 +1,28 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * zuf - Zero-copy User-mode Feeder
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * This program is free software; you may redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see <https://www.gnu.org/licenses/>.
+ */
+#include <linux/module.h>
+
+#include "zus_api.h"
+
+MODULE_AUTHOR("Boaz Harrosh <boazh@netapp.com>");
+MODULE_AUTHOR("Sagi Manole <sagim@netapp.com>");
+MODULE_DESCRIPTION("Zero-copy User-mode Feeder");
+MODULE_LICENSE("GPL");
+MODULE_VERSION(__stringify(ZUFS_MAJOR_VERSION) "."
+		__stringify(ZUFS_MINOR_VERSION));
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
new file mode 100644
index 000000000000..069153fc0b96
--- /dev/null
+++ b/fs/zuf/zus_api.h
@@ -0,0 +1,96 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note or BSD-3-Clause */
+/*
+ * zufs_api.h:
+ *	ZUFS (Zero-copy User-mode File System) is:
+ *		zuf (Zero-copy User-mode Feeder (Kernel)) +
+ *		zus (Zero-copy User-mode Server (daemon))
+ *
+ *	This file defines the API between the open source FS
+ *	Server, and the Kernel module,
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+#ifndef _LINUX_ZUFS_API_H
+#define _LINUX_ZUFS_API_H
+
+#include <linux/types.h>
+#include <linux/uuid.h>
+#include <linux/fiemap.h>
+#include <stddef.h>
+
+#ifdef __cplusplus
+#define NAMELESS(X) X
+#else
+#define NAMELESS(X)
+#endif
+
+/*
+ * Version rules:
+ *   This is the zus-to-zuf API version. And not the Filesystem
+ * on disk structures versions. These are left to the FS-plugging
+ * to supply and check.
+ * Specifically any of the API structures and constants found in this
+ * file.
+ * If the changes are made in a way backward compatible with old
+ * user-space, MINOR is incremented. Else MAJOR is incremented.
+ *
+ * It is up to the Server to decides if it wants to run with this
+ * Kernel or not. Version is only passively reported.
+ */
+#define ZUFS_MINORS_PER_MAJOR	1024
+#define ZUFS_MAJOR_VERSION 1
+#define ZUFS_MINOR_VERSION 0
+
+/* Kernel versus User space compatibility definitions */
+#ifdef __KERNEL__
+
+#include <linux/statfs.h>
+
+#else /* ! __KERNEL__ */
+
+/* verify statfs64 definition is included */
+#if !defined(__USE_LARGEFILE64) && defined(_SYS_STATFS_H)
+#error "include to 'sys/statfs.h' must appear after 'zus_api.h'"
+#else
+#define __USE_LARGEFILE64 1
+#endif
+
+#include <sys/statfs.h>
+
+#include <string.h>
+
+#define u8 uint8_t
+#define umode_t uint16_t
+
+#define PAGE_SHIFT     12
+#define PAGE_SIZE      (1 << PAGE_SHIFT)
+
+#ifndef ALIGN
+#define ALIGN(x, a)		ALIGN_MASK(x, (typeof(x))(a) - 1)
+#define ALIGN_MASK(x, mask)	(((x) + (mask)) & ~(mask))
+#endif
+
+#ifndef likely
+#define likely(x_)	__builtin_expect(!!(x_), 1)
+#define unlikely(x_)	__builtin_expect(!!(x_), 0)
+#endif
+
+#ifndef BIT
+#define BIT(b)  (1UL << (b))
+#endif
+
+/* RHEL/CentOS7 are missing these */
+#ifndef FALLOC_FL_UNSHARE_RANGE
+#define FALLOC_FL_UNSHARE_RANGE         0x40
+#endif
+#ifndef FALLOC_FL_INSERT_RANGE
+#define FALLOC_FL_INSERT_RANGE		0x20
+#endif
+
+#endif /*  ndef __KERNEL__ */
+
+#endif /* _LINUX_ZUFS_API_H */
-- 
2.21.0
^ permalink raw reply related	[flat|nested] 32+ messages in thread
* [PATCH 02/16] MAINTAINERS: Add the ZUFS maintainership
  2019-09-26  2:07 [PATCHSET v02 00/16] zuf: ZUFS Zero-copy User-mode FileSystem Boaz Harrosh
  2019-09-26  2:07 ` [PATCH 01/16] fs: Add the ZUF filesystem to the build + License Boaz Harrosh
@ 2019-09-26  2:07 ` Boaz Harrosh
  2019-09-26  2:07 ` [PATCH 03/16] zuf: Preliminary Documentation Boaz Harrosh
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 32+ messages in thread
From: Boaz Harrosh @ 2019-09-26  2:07 UTC (permalink / raw)
  To: linux-fsdevel, Anna Schumaker, Al Viro, Matt Benjamin
  Cc: Miklos Szeredi, Amir Goldstein, Sagi Manole, Matthew Wilcox,
	Dan Williams
ZUFS sitting in the fs/zuf/ directory is maintained
by Netapp. (I added my email)
I keep this as separate patch as this file might be
a source of conflicts
Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 MAINTAINERS | 6 ++++++
 1 file changed, 6 insertions(+)
diff --git a/MAINTAINERS b/MAINTAINERS
index a50e97a63bc8..8703871c1505 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -17867,6 +17867,12 @@ L:	linux-mm@kvack.org
 S:	Maintained
 F:	mm/zswap.c
 
+ZUFS ZERO COPY USER-MODE FILESYSTEM
+M:	Boaz Harrosh <boazh@netapp.com>
+L:	linux-fsdevel@vger.kernel.org
+S:	Maintained
+F:	fs/zuf/
+
 THE REST
 M:	Linus Torvalds <torvalds@linux-foundation.org>
 L:	linux-kernel@vger.kernel.org
-- 
2.21.0
^ permalink raw reply related	[flat|nested] 32+ messages in thread
* [PATCH 03/16] zuf: Preliminary Documentation
  2019-09-26  2:07 [PATCHSET v02 00/16] zuf: ZUFS Zero-copy User-mode FileSystem Boaz Harrosh
  2019-09-26  2:07 ` [PATCH 01/16] fs: Add the ZUF filesystem to the build + License Boaz Harrosh
  2019-09-26  2:07 ` [PATCH 02/16] MAINTAINERS: Add the ZUFS maintainership Boaz Harrosh
@ 2019-09-26  2:07 ` Boaz Harrosh
  2019-09-26  2:07 ` [PATCH 04/16] zuf: zuf-rootfs Boaz Harrosh
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 32+ messages in thread
From: Boaz Harrosh @ 2019-09-26  2:07 UTC (permalink / raw)
  To: linux-fsdevel, Anna Schumaker, Al Viro, Matt Benjamin
  Cc: Miklos Szeredi, Amir Goldstein, Sagi Manole, Matthew Wilcox,
	Dan Williams
Adding Documentation/filesystems/zufs.txt.
Adding some Documentation first. So to give the reviewer
of the coming patch-set. Some background and overview of
the all system.
[v2]
  Incorporated Randy's few comments.
Randy Please give it an harder review?
CC: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 Documentation/filesystems/zufs.txt | 386 +++++++++++++++++++++++++++++
 1 file changed, 386 insertions(+)
 create mode 100644 Documentation/filesystems/zufs.txt
diff --git a/Documentation/filesystems/zufs.txt b/Documentation/filesystems/zufs.txt
new file mode 100644
index 000000000000..2a347a446aa7
--- /dev/null
+++ b/Documentation/filesystems/zufs.txt
@@ -0,0 +1,386 @@
+ZUFS - Zero-copy User-mode FileSystem
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Trees:
+	git clone https://github.com/NetApp/zufs-zuf -b upstream
+	git clone https://github.com/NetApp/zufs-zus -b upstream
+
+patches, comments, questions, requests to:
+	boazh@netapp.com
+
+Introduction:
+~~~~~~~~~~~~~
+
+ZUFS - stands for Zero-copy User-mode FS
+▪ It is geared towards true zero copy end to end of both data and meta data.
+▪ It is geared towards very *low latency*, very high CPU locality, lock-less
+  parallelism.
+▪ Synchronous operations
+▪ Numa awareness
+
+  ZUFS is a, from scratch, implementation of a filesystem-in-user-space, which
+tries to address the above goals. It is aimed for pmem based FSs. But supports
+any other type of FSs
+
+Glossary and names:
+~~~~~~~~~~~~~~~~~~~
+
+ZUF - Zero-copy User-mode Feeder
+  zuf.ko is the Kernel VFS component. Its job is to interface with the Kernel
+  VFS and dispatch commands to a User-mode application Server.
+  Uptodate code is found at:
+	git clone https://github.com/NetApp/zufs-zuf -b upstream
+
+ZUS - Zero-copy User-mode Server
+  zufs utilizes a User-mode server application. That takes care of the detailed
+  communication protocol and correctness with the Kernel.
+  In turn it utilizes many zusFS Filesystem plugins to implement the actual
+  on disc Filesystem.
+  Uptodate code is found at:
+	git clone https://github.com/NetApp/zufs-zus -b upstream
+
+zusFS - FS plugins
+  These are .so loadable modules that implement one or more Filesystem-types
+  (mount -t xyz).
+  The zus server communicates with the plugin via a set of function vectors
+  for the different operations. And establishes communication via defined
+  structures.
+
+Filesystem-type:
+  At startup zus registers with the Kernel one or more Filesystem-type(s)
+  Associated with the type is a unique type-name (mount -t foofs) +
+  different info about the fs, like a magic number and so on.
+  One Server can support many FS-types, in turn each FS-type can mount
+  multiple super-blocks, each supporting multiple devices.
+
+Device-Table (MDT) - A zufs FS can support multiple devices
+  ZUF in Kernel may receive, like any mount command a block-device or none.
+  For the former if the specified FS-types states so in a special field.
+  The mount will look for a Device table. A list of devices in a specific
+  order sitting at some offset on each block-device. The system will then
+  proceed to open and own all these devices and associate them to the mounting
+  super-block.
+  If FS-type specifies a -1 at DT_offset then there is no device table
+  and a DT of a single device is created. (If we have no devices, none
+  is specified than we operate without any block devices. (Mount options give
+  some indication of the storage information))
+  The device table has special consideration for pmem devices and will
+  present the all linear array of devices to zus, as one flat mmap space.
+  Alternatively all non-pmem devices are also provided an interface
+  with facility of data movement from pmem to slower devices.
+  A detailed NUMA info is exported to the Server for maximum utilization.
+  Each device has an associated NUMA node, so Server can optimize IO to
+  these devices
+
+pmem: (Also called t1)
+  Multiple pmem devices are presented to the server as a single
+  linear file mmap. Something like /dev/dax. But it is strictly
+  available only to the specific super-block that owns it.
+
+Shadow: (For debugging)
+  "Shadow" is used for debugging the correct persistence of pmem based
+  filesystems. With pmem if modified a user must call cl_flush/sfence
+  for the data to be guarantied resistance. This is very hard to test
+  and time consuming. So for that we invented the shadow.
+  There is a special mode bit in the MDT header that denotes a shadow
+  system. In a shadow setup each pmem device is divided in half. First
+  half is available for FS storage. The second half is a Shadow. IE
+  each time the FS calls cl_flush or mov_nt the data is then memcopied
+  to the shadow.
+  At mount time the Shadow is copied onto the main part. And thous
+  presenting only those bits that where persisted by the FS. So a simple
+  remount can simulate a full machine reboot.
+  The Shadow is presented as the upper part of the mmaped region. IE
+  the all t1 ranged is repeated again. The zus core code fasilitates
+  zusFS implementors in accessing this facility
+
+zufs_dpp_t - Dual port pointer type
+  At some points in the protocol there are objects that return from zus
+  (The Server) to the Kernel via a dpp_t. This is a special kind of pointer
+  It is actually an offset 8 bytes aligned with the 3 low bits specifying
+  a pool code: [offset = dpp_t & ~0x7] [pool = dpp_t & 0x7]
+  pool == 0 means the offset is in pmem who's management is by zuf and
+  a full easy access is provided for zus.
+
+  pool != 0 Is a pre-established file (up to 6 such files per sb) where
+  the zus has an mmap on the file and the Kernel can access that data
+  via an offset into the file.
+  pool == 7 denotes an offset into the application buffers associated
+  with the current IO.
+  All dpp_t objects life time rules are strictly defined.
+  Mainly the primary use of dpp_t is the on-pmem inode structure. Both
+  zus and zuf can access and change this structure. On any modification
+  the zus is called so to be notified of any changes, persistence.
+  More such objects are: Symlinks, xattrs, data-blocks etc...
+
+Relay-wait-object:
+  communication between Kernel and server are done via zus-threads that
+  sleep in Kernel (inside an IOCTL) and wait for commands. Once received
+  the IOCTL returns operation id executed and the return info is returned via
+  a new IOCTL call, which then waits for the next operation.
+  To wake up the sleeping thread we use a Relay-wait-object. Currently
+  it is two waitqueue_head(s) back to back.
+  In future we should investigate the use of a new special scheduler object
+  That switches from thread A to predefined thread ZT context without passing
+  through the scheduler at all.
+  (The switching is already very fast, faster then anything currently
+   in the Kernel. But I believe I can shave another 1 micro off a roundtrip)
+
+ZT-threads-array:
+  The novelty of the zufs is the ZT-threads system. 3 threads or more are
+  pre-created for each active core in the system.
+  ▪ The thread is AFFINITY set for that single core only.
+  ▪ Special communication file per ZT (O_TMPFILE + IOCTL_ZUFS_INIT)
+    At initialization the ZT thread communicates through a ZT_INIT ioctl
+    and registers as the handler of that core (Channel)
+  ▪ Also for each ZT, Kernel allocates an IOCTL-buffer that is directly
+    accessed by Kernel. In turn that IOCTL-buffer is mmaped by zus
+    for the Server access of that communication buffer. (This is for zero
+    copy operations as well as avoiding the smem memory barrier)
+  ▪ IOCTL_ZU_WAIT_OPT – threads sleeps in Kernel waiting for an operation
+    via the IOCTL_ZU_WAIT_OPT call.
+
+  ▪ On operation dispatch current CPU's ZT free channel is selected.
+    Operation info is set into the IOCTL-buffer, the ZT is woken and the
+    application thread is put to sleep.
+  ▪ After execution, ZT returns to kernel (IOCTL_ZU_WAIT_OPT), app is released,
+    Server wait for new operation on that CPU.
+  ▪ Each ZT has a cyclic logic. Each call to IOCTL_ZU_WAIT_OPT from Server
+    returns the results of the previous operation, before going to sleep
+    waiting to receive a new operation.
+	zus			zuf-zt				application
+    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+     ---> IOCTL_ZU_WAIT_OPT    if (app-waiting)
+     |					wake-up-application	 -> return to app
+     |				FS-WAIT
+     |				|				<- POSIX call
+     |				V		<- fs-wake-up(dispatch)
+     |			<- return with new command
+     |--<- do_new_operation
+
+ZUS-mount-thread:
+  The system utilizes a single mount thread. (This thread is not affinity to any
+  core).
+  ▪ It will first Register all FS-types supported by this Server (By calling
+    all zusFS plugins to register their supported types). Once done
+  ▪ As above, the thread sleeps in Kernel via the IOCTL_ZU_MOUNT call.
+  ▪ When the Kernel receives a mount request (vfs calles the fs_type->mount opt)
+    a mount is dispatched back to zus.
+  ▪ NOTE: That only on very first mount the above ZT-threads-array is created
+    the same ZT-array is then used for all super-blocks in the system
+  ▪ As part of the mount command in the context of this same mount-thread
+    a call to IOCTL_ZU_GRAB_PMEM will establish an interface to the pmem
+    Associated with this super_block
+  ▪ On return like above a new call to IOCTL_ZU_MOUNT will return info of the
+    mount before sleeping in kernel waiting for a new dispatch. All SB info
+    is provided to zuf, including the root inode info. Kernel then proceeds
+    to complete the mount call.
+  ▪ NOTE that since there is a single mount thread lots of FS-registration
+    super_block and pmem management are lockless.
+
+Philosophy of operations:
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+1. [zuf-root]
+
+On module load  (zuf.ko) A special pseudo FS is mounted on /sys/fs/zuf. This is
+called zuf-root.
+The zuf-root has no visible files. All communication is done via special-files.
+special-files are open(O_TMPFILE) and establish a special role via an
+IOCTL. (Example above ZT-thread is one such special file)
+All communications with the server are done via the zuf-root. Each root owns
+many FS-types and each FS-type owns many super-blocks of this type. All Sharing
+the same communication channels.
+Since all FS-type Servers live in the same zus application address space, at
+times. If the administrator wants to separate between different servers, he/she
+can mount a new zuf-root and point a new server instance on that new mount,
+registering other FS-types on that other instance. The all communication array
+will then be duplicated as well.
+(Otherwise pointing a new server instance on a busy root will return an error)
+
+2. [zus server start]
+  ▪ On load all configured zusFS plugins are loaded.
+  ▪ The Server starts by starting a single mount thread.
+  ▪ It than proceeds to register with Kernel all FS-types it will support.
+    (This is done on the single mount thread, so FS-registration and
+     mount/umount operate in a single thread and therefor need not any locks)
+  ▪ Sleeping in the Kernel on a special-file of that zuf-root. waiting for
+    a mount command.
+
+3. [mount -t xyz]
+  [In Kernel]
+  ▪ If xyz was registered above as part of the Server startup. the regular
+    mount command will come to the zuf module with a zuf_mount() call. with
+    the xyz-FS-info. In turn this points to a zuf-root.
+  ▪ Code than proceed to load a device-table of devices as  specified above.
+    It then establishes an multi_devices object with a specific sb_id.
+  ▪ It proceeds to call mount_bdev. Always with the same main-device
+    thous fully sporting automatic bind mounts. Even if different
+    devices are given to the mount command.
+  ▪ In zuf_fill_super it will then dispatch (awaken) the mount thread
+    specifying two parameters. One the FS-type to mount, and then
+    the sb_id Associated with this super_block.
+
+  [In zus]
+  ▪ A zus_super_block_info is allocated.
+  ▪ zus calls PMEM_GRAB(sb_id) to establish a direct mapping to its
+    pmem devices. On return we have full access to our PMEM
+
+  ▪ ZT-threads-array
+    If this is the first mount the ZT-threads-array is created and
+    established. The mount thread will wait until all zt-threads finished
+    initialization and ready to rock.
+  ▪ Root-zus_inode is loaded and is returned to kernel
+  ▪ More info about the mount like block sizes and so on are returned to kernel.
+
+  [In Kernel]
+   The zuf_fill_super is finalized vectors established and we have a new
+   super_block ready for operations.
+
+4. An FS operation like create or WRITE/READ and so on arrives from application
+   via VFS. Eventually an Operation is dispatched to zus:
+   ▪ A special per-operation descriptor is filled up with all parameters.
+   ▪ A current CPU channel is grabbed. the operation descriptor is put on
+     that channel (ZT). Including get_user_pages or Kernel-pages associated
+     with this OPT.
+   ▪ The ZT is awaken, app thread put to sleep.
+   ▪ Optionally in ZT context pages are mapped to that ZT-vma. This is so we
+     are sure the map is only on a single core. And no other core's TLB is
+     affected.
+   ▪ ZT thread is returned to user-space.
+   ▪ In ZT context the zus Server calls the appropriate zusFS->operation
+     vector. Output params filled.
+   ▪ zus calls again with an IOCTL_ZU_WAIT_OPT with the same descriptor
+     to return the requested info.
+   ▪ At Kernel (zuf) the app thread is awaken with the results, and the
+     ZT thread goes back to sleep waiting a new operation.
+
+   ZT rules:
+       A ZT thread should try to minimize it's sleeps. it might take locks
+   In which case we will see that the same CPU channel is reentered via another
+   application/thread. But now that CPU channel is taken.  What we do is we
+   utilize a few channels (ZTs) per core and those threads may grab another
+   channel. But this only postpones the problem. On a busy contended system,
+   all such channels will be consumed. If all channels are taken the
+   application thread is put on a busy scheduling wait until a channel can
+   be grabbed.
+   If The server needs to sleep for a long time it should utilize the
+   ZUFS_ASYNC return option. The app is then kept sleeping on an
+   operation-context object and the ZT freed for foreground operation.
+   At some point in time when the server completes the delayed operation
+   it will notify the Kernel with a special async IO-context cookie.
+   And the app will be awakened.
+
+4. On umount the operation is reversed and all resources are released.
+5. In case of an application or Server crash, all resources are Associated
+   with files, on file_release these resources are caught and freed.
+
+Objects and life-time
+~~~~~~~~~~~~~~~~~~~~~
+
+Each Kernel object type has an assosiated zus Server object type who's life
+time is governed by the life-time of the Kernel object. Therefor the Server's
+job is easy because it need not establish any object caches / hashes and so on.
+
+Inside zus all objects are allocated by the zusFS plugin. So in turn it can
+allocate a bigger space for its own private data and access it via the
+container_off() coding pattern. So when I say below a zus-object I mean both
+zus public part + zusFS private part of the same object.
+
+All operations return a User-mode pointer that are opaque to the the Kernel
+code, they are just a cookie which is returned back to zus, when needed.
+At times when we want the Kernel to have direct access to a zus object like
+zufs_inode, along with the cookie we also return a dpp_t, with a defined
+structure.
+
+Kernel object 			| zus object 		| Kernel access (via dpp_t)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+zuf_fs_type
+	file_system_type	| zus_fs_info		| no
+
+zuf_sb_info
+	super_block		| zus_sb_info		| no
+
+zuf_inode_info			|			|
+	vfs_inode		| zus_inode_info	| no
+	zufs_inode *		| 	zufs_inode *	| yes
+	synlink *		|	char-array	| yes
+	xattr**			|	zus_xattr	| yes
+
+When a Kernel object's time is to die, a final call to zus is
+dispatched so the associated object can also be freed. Which means
+that on memory pressure when object caches are evicted also the zus
+memory resources are freed.
+
+
+How to use zufs:
+~~~~~~~~~~~~~~~~
+
+The most updated documentation of how to use the latest code bases
+is the script (set of scripts) at fs/do-zu/zudo on the zus git tree
+
+We the developers at Netapp use this script to mount and test our
+latest code. So any new Secret will be found in these scripts. Please
+read them as the ultimate source of how to operate things.
+
+We assume you cloned these git trees:
+[]$ mkdir zufs; cd zufs
+[]$ git clone https://github.com/NetApp/zufs-zuf -b upstream
+[]$ git clone https://github.com/NetApp/zufs-zuf -b upstream
+
+This will create the following trees
+zufs/zus - Source code for Server
+zufs/zuf - Linux Kernel source tree to compile and install on your machine
+
+Also specifically:
+zufs/zus/fs/do-zu/zudo - script Documenting how to run things
+
+[]$ cd zufs
+
+First time
+[] zus/fs/do-zu/zudo
+this will create a file:
+	zus/fs/do-zu/zu.conf
+
+Edit this file for your environment. Devices, mount-point and so on.
+On first run an example file will be created for you. Fill in the
+blanks. Most params can stay as is in most cases
+
+Now lets start running:
+
+[1]$ zus/fs/do-zu/zudo mkfs
+This will run the proper mkfs command selected at zu.conf file
+with the proper devices.
+
+[2]$ zus/fs/do-zu/zudo zuf-insmod
+This loads the zuf.ko module
+
+[3]$ zus/fs/do-zu/zudo zuf-root
+This mounts the zuf-root FS above on /sys/fs/zuf (automatically created in [2])
+
+[4]$ zus/fs/do-zu/zudo zus-up
+This runs the zus daemon in the background
+
+[5]$ zus/fs/do-zu/zudo mount
+This mount the mkfs FS above on the specified dir in zu.conf
+
+To run all the 5 commands above at once do:
+[]$ zus/fs/do-zu/zudo up
+
+To undo all the above in reverse order do:
+[]$ zus/fs/do-zu/zudo down
+
+And the most magic command is:
+[]$ zus/fs/do-zu/zudo again
+Will do a "down", then update-mods, then "up"
+(update-mods is a special script to copy the latest compiled binaries)
+
+Now you are ready for some:
+[]$ zus/fs/do-zu/zudo xfstest
+xfstests is assumed to be installed in the regular /opt/xfstests dir
+
+Again please see inside the scripts what each command does
+these scripts are the ultimate Documentation, do not believe
+anything I'm saying here. (Because it is outdated by now)
+
+Have a nice day
-- 
2.21.0
^ permalink raw reply related	[flat|nested] 32+ messages in thread
* [PATCH 04/16] zuf: zuf-rootfs
  2019-09-26  2:07 [PATCHSET v02 00/16] zuf: ZUFS Zero-copy User-mode FileSystem Boaz Harrosh
                   ` (2 preceding siblings ...)
  2019-09-26  2:07 ` [PATCH 03/16] zuf: Preliminary Documentation Boaz Harrosh
@ 2019-09-26  2:07 ` Boaz Harrosh
  2019-09-26  2:07 ` [PATCH 05/16] zuf: zuf-core The ZTs Boaz Harrosh
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 32+ messages in thread
From: Boaz Harrosh @ 2019-09-26  2:07 UTC (permalink / raw)
  To: linux-fsdevel, Anna Schumaker, Al Viro, Matt Benjamin
  Cc: Miklos Szeredi, Amir Goldstein, Sagi Manole, Matthew Wilcox,
	Dan Williams
zuf-root is a pseudo FS that the zusd Server communicates through,
registers new file-systems. receives new mount requests.
In this patch we have the bring up of that special FS.
The principal communication with zuf-rootfs is done through
tmep-files + io-ctls.
Caller does an open(O_TMPFILE) and invokes some IOCTL_XXX on
the file. The specific ioctl establishes one of zuf_special_file
types object and attaches the object to the file-ptr and by that
defining special behavior for that object.
Otherwise zuf-rootfs is not an FS at all. It has a few viewable
variable files, exposing state and info about the system. In this
patch we can see the "state" variable-file, that denotes to user-mode
when the Kernel is ready for new mounts. And the registered_fs which
exposes what zufFS(s) where registered with the Kernel.
There is a one-to-one relationship between a zuf-root SB and
a zusd Server. Each zusd Server can support multiple zusFS
plugins and register multiple filesystem-types.
The zuf-rootfs (mount -t zuf) is usually mounted on
/sys/fs/zuf. The /sys/fs/zuf directory is automatically created
when zuf.ko is loaded. If an admin wants to run more zusd server
applications she/he can mount a second instance of -t zuf on some
dir and point the new zusd Server to it. (zusd has an optional path
argument). Otherwise a second instance attempting to communicate
with a busy zuf-root will fail.
TODO: How to trigger a first mount on module_load. Currently
admin needs to manually "mount -t zuf none /sys/fs/zuf"
Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/Makefile   |   4 +
 fs/zuf/_extern.h  |  41 +++++
 fs/zuf/_pr.h      |  63 +++++++
 fs/zuf/super.c    |  53 ++++++
 fs/zuf/zuf-core.c |  69 ++++++++
 fs/zuf/zuf-root.c | 438 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/zuf/zuf.h      | 116 ++++++++++++
 fs/zuf/zus_api.h  |  36 ++++
 8 files changed, 820 insertions(+)
 create mode 100644 fs/zuf/_extern.h
 create mode 100644 fs/zuf/_pr.h
 create mode 100644 fs/zuf/super.c
 create mode 100644 fs/zuf/zuf-core.c
 create mode 100644 fs/zuf/zuf-root.c
 create mode 100644 fs/zuf/zuf.h
diff --git a/fs/zuf/Makefile b/fs/zuf/Makefile
index 452cec55f34d..b08c08e73faa 100644
--- a/fs/zuf/Makefile
+++ b/fs/zuf/Makefile
@@ -10,5 +10,9 @@
 
 obj-$(CONFIG_ZUFS_FS) += zuf.o
 
+# ZUF core
+zuf-y += zuf-core.o zuf-root.o
+
 # Main FS
+zuf-y += super.o
 zuf-y += module.o
diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
new file mode 100644
index 000000000000..0e8aa52f1259
--- /dev/null
+++ b/fs/zuf/_extern.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#ifndef __ZUF_EXTERN_H__
+#define __ZUF_EXTERN_H__
+/*
+ * DO NOT INCLUDE this file directly, it is included by zuf.h
+ * It is here because zuf.h got to big
+ */
+
+/*
+ * extern functions declarations
+ */
+
+/* zuf-core.c */
+int zufc_zts_init(struct zuf_root_info *zri); /* Some private types in core */
+void zufc_zts_fini(struct zuf_root_info *zri);
+
+long zufc_ioctl(struct file *filp, unsigned int cmd, ulong arg);
+int zufc_release(struct inode *inode, struct file *file);
+int zufc_mmap(struct file *file, struct vm_area_struct *vma);
+
+/* zuf-root.c */
+int zufr_register_fs(struct super_block *sb, struct zufs_ioc_register_fs *rfs);
+
+/* super.c */
+int zuf_init_inodecache(void);
+void zuf_destroy_inodecache(void);
+
+struct dentry *zuf_mount(struct file_system_type *fs_type, int flags,
+			 const char *dev_name, void *data);
+
+#endif	/*ndef __ZUF_EXTERN_H__*/
diff --git a/fs/zuf/_pr.h b/fs/zuf/_pr.h
new file mode 100644
index 000000000000..51924b6bd2a5
--- /dev/null
+++ b/fs/zuf/_pr.h
@@ -0,0 +1,63 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#ifndef __ZUF_PR_H__
+#define __ZUF_PR_H__
+
+#ifdef pr_fmt
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#endif
+
+/*
+ * Debug code
+ */
+#define zuf_err(s, args ...)		pr_err("[%s:%d] " s, __func__, \
+							__LINE__, ## args)
+#define zuf_err_cnd(silent, s, args ...) \
+	do {if (!silent) \
+		pr_err("[%s:%d] " s, __func__, __LINE__, ## args); \
+	} while (0)
+#define zuf_warn(s, args ...)		pr_warn("[%s:%d] " s, __func__, \
+							__LINE__, ## args)
+#define zuf_warn_cnd(silent, s, args ...) \
+	do {if (!silent) \
+		pr_warn("[%s:%d] " s, __func__, __LINE__, ## args); \
+	} while (0)
+#define zuf_info(s, args ...)          pr_info("~info~ " s, ## args)
+
+#define zuf_chan_debug(c, s, args...)	pr_debug(c " [%s:%d] " s, __func__, \
+							__LINE__, ## args)
+
+/* ~~~ channel prints ~~~ */
+#define zuf_dbg_perf(s, args ...)	zuf_chan_debug("perfo", s, ##args)
+#define zuf_dbg_err(s, args ...)	zuf_chan_debug("error", s, ##args)
+#define zuf_dbg_vfs(s, args ...)	zuf_chan_debug("vfs  ", s, ##args)
+#define zuf_dbg_rw(s, args ...)		zuf_chan_debug("rw   ", s, ##args)
+#define zuf_dbg_t1(s, args ...)		zuf_chan_debug("t1   ", s, ##args)
+#define zuf_dbg_xattr(s, args ...)	zuf_chan_debug("xattr", s, ##args)
+#define zuf_dbg_acl(s, args ...)	zuf_chan_debug("acl  ", s, ##args)
+#define zuf_dbg_t2(s, args ...)		zuf_chan_debug("t2dbg", s, ##args)
+#define zuf_dbg_t2_rw(s, args ...)	zuf_chan_debug("t2grw", s, ##args)
+#define zuf_dbg_core(s, args ...)	zuf_chan_debug("core ", s, ##args)
+#define zuf_dbg_mmap(s, args ...)	zuf_chan_debug("mmap ", s, ##args)
+#define zuf_dbg_zus(s, args ...)	zuf_chan_debug("zusdg", s, ##args)
+#define zuf_dbg_verbose(s, args ...)	zuf_chan_debug("d-oto", s, ##args)
+
+#define md_err		zuf_err
+#define md_warn		zuf_warn
+#define md_err_cnd	zuf_err_cnd
+#define md_warn_cnd	zuf_warn_cnd
+#define md_dbg_err	zuf_dbg_err
+#define md_dbg_verbose	zuf_dbg_verbose
+
+
+#endif /* define __ZUF_PR_H__ */
diff --git a/fs/zuf/super.c b/fs/zuf/super.c
new file mode 100644
index 000000000000..f7f7798425a9
--- /dev/null
+++ b/fs/zuf/super.c
@@ -0,0 +1,53 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Super block operations.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>
+ */
+
+#include <linux/types.h>
+#include <linux/parser.h>
+#include <linux/statfs.h>
+#include <linux/backing-dev.h>
+
+#include "zuf.h"
+
+static struct kmem_cache *zuf_inode_cachep;
+
+static void _init_once(void *foo)
+{
+	struct zuf_inode_info *zii = foo;
+
+	inode_init_once(&zii->vfs_inode);
+}
+
+int __init zuf_init_inodecache(void)
+{
+	zuf_inode_cachep = kmem_cache_create("zuf_inode_cache",
+					       sizeof(struct zuf_inode_info),
+					       0,
+					       (SLAB_RECLAIM_ACCOUNT |
+						SLAB_MEM_SPREAD |
+						SLAB_TYPESAFE_BY_RCU),
+					       _init_once);
+	if (zuf_inode_cachep == NULL)
+		return -ENOMEM;
+	return 0;
+}
+
+void zuf_destroy_inodecache(void)
+{
+	kmem_cache_destroy(zuf_inode_cachep);
+}
+
+struct dentry *zuf_mount(struct file_system_type *fs_type, int flags,
+			 const char *dev_name, void *data)
+{
+	return ERR_PTR(-ENOTSUPP);
+}
diff --git a/fs/zuf/zuf-core.c b/fs/zuf/zuf-core.c
new file mode 100644
index 000000000000..c9bb31f75bed
--- /dev/null
+++ b/fs/zuf/zuf-core.c
@@ -0,0 +1,69 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Ioctl operations.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+
+#include <linux/types.h>
+#include <linux/mm.h>
+#include <linux/mm_types.h>
+#include <linux/delay.h>
+#include <linux/pfn_t.h>
+#include <linux/sched/signal.h>
+
+#include "zuf.h"
+
+int zufc_zts_init(struct zuf_root_info *zri)
+{
+	return 0;
+}
+
+void zufc_zts_fini(struct zuf_root_info *zri)
+{
+}
+
+long zufc_ioctl(struct file *file, unsigned int cmd, ulong arg)
+{
+	switch (cmd) {
+	default:
+		zuf_err("%d\n", cmd);
+		return -ENOTTY;
+	}
+}
+
+int zufc_release(struct inode *inode, struct file *file)
+{
+	struct zuf_special_file *zsf = file->private_data;
+
+	if (!zsf)
+		return 0;
+
+	switch (zsf->type) {
+	default:
+		return 0;
+	}
+}
+
+int zufc_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct zuf_special_file *zsf = file->private_data;
+
+	if (unlikely(!zsf)) {
+		zuf_err("Which mmap is that !!!!\n");
+		return -ENOTTY;
+	}
+
+	switch (zsf->type) {
+	default:
+		zuf_err("type=%d\n", zsf->type);
+		return -ENOTTY;
+	}
+}
diff --git a/fs/zuf/zuf-root.c b/fs/zuf/zuf-root.c
new file mode 100644
index 000000000000..ea7eb810ea9d
--- /dev/null
+++ b/fs/zuf/zuf-root.c
@@ -0,0 +1,438 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * ZUF Root filesystem.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUS-ZUF interaction is done via a small specialized FS that
+ * provides the communication with the mount-thread, ZTs, pmem devices,
+ * and so on ...
+ * Subsequently all FS super_blocks are children of this root, and point
+ * to it. All sharing the same zuf communication channels.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/magic.h>
+#include <asm-generic/mman.h>
+
+#include "zuf.h"
+
+/* ~~~~ Register/Unregister FS-types ~~~~ */
+#ifdef CONFIG_LOCKDEP
+
+/*
+ * NOTE: When CONFIG_LOCKDEP is on. register_filesystem() complains when
+ * the fstype object is from a kmalloc. Because of some lockdep_keys not
+ * being const_obj something.
+ *
+ * So in this case we have maximum of 16 fstypes system wide
+ * (Total for all mounted zuf_root(s)). This way we can have them
+ * in const_obj memory below at g_fs_array
+ */
+
+enum { MAX_LOCKDEP_FSs = 16 };
+static uint g_fs_next;
+static struct zuf_fs_type g_fs_array[MAX_LOCKDEP_FSs];
+
+static struct zuf_fs_type *_fs_type_alloc(void)
+{
+	struct zuf_fs_type *ret;
+
+	if (MAX_LOCKDEP_FSs <= g_fs_next)
+		return NULL;
+
+	ret = &g_fs_array[g_fs_next++];
+	memset(ret, 0, sizeof(*ret));
+	return ret;
+}
+
+static void _fs_type_free(struct zuf_fs_type *zft)
+{
+	if (zft == &g_fs_array[0])
+		g_fs_next = 0;
+}
+
+#else /* !CONFIG_LOCKDEP*/
+static struct zuf_fs_type *_fs_type_alloc(void)
+{
+	return kcalloc(1, sizeof(struct zuf_fs_type), GFP_KERNEL);
+}
+
+static void _fs_type_free(struct zuf_fs_type *zft)
+{
+	kfree(zft);
+}
+#endif /*CONFIG_LOCKDEP*/
+
+
+static ssize_t _state_read(struct file *file, char __user *buf, size_t len,
+			   loff_t *ppos)
+{
+	struct zuf_root_info *zri = ZRI(file->f_inode->i_sb);
+	const char *msg;
+
+	if (*ppos > 0)
+		return 0;
+
+	switch (zri->state) {
+	case ZUF_ROOT_INITIALIZING:
+		msg = "initializing\n";
+		break;
+	case ZUF_ROOT_REGISTERING_FS:
+		msg = "registering_fs\n";
+		break;
+	case ZUF_ROOT_MOUNT_READY:
+		msg = "mount_ready\n";
+		break;
+	case ZUF_ROOT_SERVER_FAILED:
+		msg = "server_failed\n";
+		break;
+	default:
+		msg = "UNKNOWN\n";
+		break;
+	}
+
+	return simple_read_from_buffer(buf, len, ppos, msg, strlen(msg));
+}
+
+static const struct file_operations _state_ops = {
+	.open = nonseekable_open,
+	.read = _state_read,
+	.llseek = no_llseek,
+};
+
+static ssize_t _registered_fs_read(struct file *file, char __user *buf,
+				   size_t len, loff_t *ppos)
+{
+	struct zuf_root_info *zri = ZRI(file->f_inode->i_sb);
+	size_t buff_len = 0;
+	struct zuf_fs_type *zft;
+	char *fs_buff, *p;
+	ssize_t ret;
+	size_t name_len;
+
+	list_for_each_entry(zft, &zri->fst_list, list)
+		buff_len += strlen(zft->rfi.fsname) + 1;
+
+	if (unlikely(*ppos > buff_len))
+		return -EINVAL;
+	if (*ppos == buff_len)
+		return 0;
+
+	fs_buff = kzalloc(buff_len + 1, GFP_KERNEL);
+	if (unlikely(!fs_buff))
+		return -ENOMEM;
+
+	p = fs_buff;
+	list_for_each_entry(zft, &zri->fst_list, list) {
+		if (p != fs_buff) {
+			*p = ' ';
+			++p;
+		}
+		name_len = strlen(zft->rfi.fsname);
+		memcpy(p, zft->rfi.fsname, name_len);
+		p += name_len;
+	}
+
+	p = fs_buff + *ppos;
+	buff_len = buff_len - *ppos;
+	ret = simple_read_from_buffer(buf, len, ppos, p, buff_len);
+	kfree(fs_buff);
+
+	return ret;
+}
+
+static const struct file_operations _registered_fs_ops = {
+	.open = nonseekable_open,
+	.read = _registered_fs_read,
+	.llseek = no_llseek,
+};
+
+
+int zufr_register_fs(struct super_block *sb, struct zufs_ioc_register_fs *rfs)
+{
+	struct zuf_fs_type *zft = _fs_type_alloc();
+	struct zuf_root_info *zri = ZRI(sb);
+
+	if (unlikely(!zft))
+		return -ENOMEM;
+
+	if (zri->state == ZUF_ROOT_INITIALIZING)
+		zri->state = ZUF_ROOT_REGISTERING_FS;
+
+	/* Original vfs file type */
+	zft->vfs_fst.owner	= THIS_MODULE;
+	zft->vfs_fst.name	= kstrdup(rfs->rfi.fsname, GFP_KERNEL);
+	zft->vfs_fst.mount	= zuf_mount;
+	zft->vfs_fst.kill_sb	= kill_block_super;
+
+	/* ZUS info about this FS */
+	zft->rfi		= rfs->rfi;
+	zft->zus_zfi		= rfs->zus_zfi;
+	INIT_LIST_HEAD(&zft->list);
+	/* Back pointer to our communication channels */
+	zft->zri		= ZRI(sb);
+
+	zuf_add_fs_type(zft->zri, zft);
+	zuf_info("register_filesystem [%s]\n", zft->vfs_fst.name);
+	return register_filesystem(&zft->vfs_fst);
+}
+
+static void _unregister_all_fses(struct zuf_root_info *zri)
+{
+	struct zuf_fs_type *zft, *n;
+
+	list_for_each_entry_safe_reverse(zft, n, &zri->fst_list, list) {
+		unregister_filesystem(&zft->vfs_fst);
+		list_del_init(&zft->list);
+		_fs_type_free(zft);
+	}
+}
+
+static int zufr_unlink(struct inode *dir, struct dentry *dentry)
+{
+	struct inode *inode = dentry->d_inode;
+
+	drop_nlink(inode);
+	return 0;
+}
+
+/* Force alignment of 2M for all vma(s)
+ *
+ * This belongs to t1.c and what it does for mmap. But we do not mind
+ * that both our mmaps (grab_pmem or ZTs) will be 2M aligned so keep
+ * it here. And zus mappings just all match perfectly with no need for
+ * holes.
+ * FIXME: This is copy/paste from dax-device. It can be very much simplified
+ * for what we need.
+ */
+static unsigned long zufr_get_unmapped_area(struct file *filp,
+		unsigned long addr, unsigned long len, unsigned long pgoff,
+		unsigned long flags)
+{
+	unsigned long off, off_end, off_align, len_align, addr_align;
+	unsigned long align = PMD_SIZE;
+
+	if (addr)
+		goto out;
+
+	off = pgoff << PAGE_SHIFT;
+	off_end = off + len;
+	off_align = round_up(off, align);
+
+	if ((off_end <= off_align) || ((off_end - off_align) < align))
+		goto out;
+
+	len_align = len + align;
+	if ((off + len_align) < off)
+		goto out;
+
+	addr_align = current->mm->get_unmapped_area(filp, addr, len_align,
+			pgoff, flags);
+	if (!IS_ERR_VALUE(addr_align)) {
+		addr_align += (off - addr_align) & (align - 1);
+		return addr_align;
+	}
+ out:
+	return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags);
+}
+
+static const struct inode_operations zufr_inode_operations;
+static const struct file_operations zufr_file_dir_operations = {
+	.open		= dcache_dir_open,
+	.release	= dcache_dir_close,
+	.llseek		= dcache_dir_lseek,
+	.read		= generic_read_dir,
+	.iterate_shared	= dcache_readdir,
+	.fsync		= noop_fsync,
+	.unlocked_ioctl = zufc_ioctl,
+};
+static const struct file_operations zufr_file_reg_operations = {
+	.fsync			= noop_fsync,
+	.unlocked_ioctl		= zufc_ioctl,
+	.get_unmapped_area	= zufr_get_unmapped_area,
+	.mmap			= zufc_mmap,
+	.release		= zufc_release,
+};
+
+static int zufr_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
+{
+	struct zuf_root_info *zri = ZRI(dir->i_sb);
+	struct inode *inode;
+	int err;
+
+	inode = new_inode(dir->i_sb);
+	if (!inode)
+		return -ENOMEM;
+
+	/* We need to impersonate device-dax (S_DAX + S_IFCHR) in order to get
+	 * the PMD (huge) page faults and allow RDMA memory access via GUP
+	 * (get_user_pages_longterm).
+	 */
+	inode->i_flags = S_DAX;
+	mode = (mode & ~S_IFREG) | S_IFCHR; /* change file type to char */
+
+	inode->i_ino = ++zri->next_ino; /* none atomic only one mount thread */
+	inode->i_blocks = inode->i_size = 0;
+	inode->i_ctime = inode->i_mtime = current_time(inode);
+	inode->i_atime = inode->i_ctime;
+	inode_init_owner(inode, dir, mode);
+
+	inode->i_op = &zufr_inode_operations;
+	inode->i_fop = &zufr_file_reg_operations;
+
+	err = insert_inode_locked(inode);
+	if (unlikely(err)) {
+		zuf_err("[%ld] insert_inode_locked => %d\n", inode->i_ino, err);
+		goto fail;
+	}
+	d_tmpfile(dentry, inode);
+	unlock_new_inode(inode);
+	return 0;
+
+fail:
+	clear_nlink(inode);
+	make_bad_inode(inode);
+	iput(inode);
+	return err;
+}
+
+static void zufr_put_super(struct super_block *sb)
+{
+	struct zuf_root_info *zri = ZRI(sb);
+
+	zufc_zts_fini(zri);
+	_unregister_all_fses(zri);
+
+	zuf_info("zuf_root umount\n");
+}
+
+static void zufr_evict_inode(struct inode *inode)
+{
+	clear_inode(inode);
+}
+
+static const struct inode_operations zufr_inode_operations = {
+	.lookup		= simple_lookup,
+
+	.tmpfile	= zufr_tmpfile,
+	.unlink		= zufr_unlink,
+};
+static const struct super_operations zufr_super_operations = {
+	.statfs		= simple_statfs,
+
+	.evict_inode	= zufr_evict_inode,
+	.put_super	= zufr_put_super,
+};
+
+#define ZUFR_SUPER_MAGIC 0x1717
+
+static int zufr_fill_super(struct super_block *sb, void *data, int silent)
+{
+	static struct tree_descr zufr_files[] = {
+		[2] = {"state", &_state_ops, S_IFREG | 0400},
+		[3] = {"registered_fs", &_registered_fs_ops, S_IFREG | 0400},
+		{""},
+	};
+	struct zuf_root_info *zri;
+	struct inode *root_i;
+	int err;
+
+	zri = kzalloc(sizeof(*zri), GFP_KERNEL);
+	if (!zri) {
+		zuf_err_cnd(silent,
+			    "Not enough memory to allocate zuf_root_info\n");
+		return -ENOMEM;
+	}
+
+	err = simple_fill_super(sb, ZUFR_SUPER_MAGIC, zufr_files);
+	if (unlikely(err)) {
+		kfree(zri);
+		return err;
+	}
+
+	sb->s_op = &zufr_super_operations;
+	sb->s_fs_info = zri;
+	zri->sb = sb;
+
+	root_i = sb->s_root->d_inode;
+	root_i->i_fop = &zufr_file_dir_operations;
+	root_i->i_op = &zufr_inode_operations;
+
+	mutex_init(&zri->sbl_lock);
+	INIT_LIST_HEAD(&zri->fst_list);
+
+	err = zufc_zts_init(zri);
+	if (unlikely(err))
+		return err; /* put will be called we have a root */
+
+	return 0;
+}
+
+static struct dentry *zufr_mount(struct file_system_type *fs_type,
+				  int flags, const char *dev_name,
+				  void *data)
+{
+	struct dentry *ret = mount_nodev(fs_type, flags, data, zufr_fill_super);
+
+	if (IS_ERR_OR_NULL(ret)) {
+		zuf_dbg_err("mount_nodev(%s, %s) => %ld\n", dev_name,
+			    (char *)data, PTR_ERR(ret));
+		return ret;
+	}
+
+	zuf_info("zuf_root mount [%s]\n", dev_name);
+	return ret;
+}
+
+static struct file_system_type zufr_type = {
+	.owner =	THIS_MODULE,
+	.name =		"zuf",
+	.mount =	zufr_mount,
+	.kill_sb	= kill_litter_super,
+};
+
+/* Create an /sys/fs/zuf/ directory. to mount on */
+static struct kset *zufr_kset;
+
+int __init zuf_root_init(void)
+{
+	int err = zuf_init_inodecache();
+
+	if (unlikely(err))
+		return err;
+
+	zufr_kset = kset_create_and_add("zuf", NULL, fs_kobj);
+	if (!zufr_kset) {
+		err = -ENOMEM;
+		goto un_inodecache;
+	}
+
+	err = register_filesystem(&zufr_type);
+	if (unlikely(err))
+		goto un_kset;
+
+	return 0;
+
+un_kset:
+	kset_unregister(zufr_kset);
+un_inodecache:
+	zuf_destroy_inodecache();
+	return err;
+}
+
+static void __exit zuf_root_exit(void)
+{
+	unregister_filesystem(&zufr_type);
+	kset_unregister(zufr_kset);
+	zuf_destroy_inodecache();
+}
+
+module_init(zuf_root_init)
+module_exit(zuf_root_exit)
diff --git a/fs/zuf/zuf.h b/fs/zuf/zuf.h
new file mode 100644
index 000000000000..919b84f7478f
--- /dev/null
+++ b/fs/zuf/zuf.h
@@ -0,0 +1,116 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Definitions for the ZUF filesystem.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#ifndef __ZUF_H
+#define __ZUF_H
+
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/list.h>
+#include <linux/types.h>
+#include <linux/xattr.h>
+#include <linux/exportfs.h>
+#include <linux/page_ref.h>
+#include <linux/mm.h>
+
+#include "zus_api.h"
+
+#include "_pr.h"
+
+enum zlfs_e_special_file {
+	zlfs_e_zt = 1,
+	zlfs_e_mout_thread,
+	zlfs_e_pmem,
+	zlfs_e_dpp_buff,
+	zlfs_e_private_mount,
+};
+
+struct zuf_special_file {
+	enum zlfs_e_special_file type;
+	struct file *file;
+};
+
+struct zuf_private_mount_info {
+	struct zuf_special_file zsf;
+	struct super_block *sb;
+};
+
+enum {
+	ZUF_ROOT_INITIALIZING = 0,
+	ZUF_ROOT_REGISTERING_FS = 1,
+	ZUF_ROOT_MOUNT_READY = 2,
+	ZUF_ROOT_SERVER_FAILED	= 3,	/* server crashed unexpectedly */
+};
+
+/* This is the zuf-root.c mini filesystem */
+struct zuf_root_info {
+	#define SBL_INC 64
+	struct sb_is_list {
+		uint num;
+		uint max;
+		struct super_block **array;
+	} sbl;
+	struct mutex sbl_lock;
+
+	ulong next_ino;
+
+	/* The definition of _ztp is private to zuf-core.c */
+	struct zuf_threads_pool *_ztp;
+
+	struct super_block *sb;
+	struct list_head fst_list;
+	int state;
+};
+
+static inline struct zuf_root_info *ZRI(struct super_block *sb)
+{
+	struct zuf_root_info *zri = sb->s_fs_info;
+
+	WARN_ON(zri->sb != sb);
+	return zri;
+}
+
+struct zuf_fs_type {
+	struct file_system_type vfs_fst;
+	struct zus_fs_info	*zus_zfi;
+	struct register_fs_info rfi;
+	struct zuf_root_info *zri;
+
+	struct list_head list;
+};
+
+static inline void zuf_add_fs_type(struct zuf_root_info *zri,
+				   struct zuf_fs_type *zft)
+{
+	/* Unlocked for now only one mount-thread with zus */
+	list_add(&zft->list, &zri->fst_list);
+}
+
+/*
+ * ZUF per-inode data in memory
+ */
+struct zuf_inode_info {
+	struct inode		vfs_inode;
+};
+
+static inline struct zuf_inode_info *ZUII(struct inode *inode)
+{
+	return container_of(inode, struct zuf_inode_info, vfs_inode);
+}
+
+/* Keep this include last thing in file */
+#include "_extern.h"
+
+#endif /* __ZUF_H */
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index 069153fc0b96..f293e03460be 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -93,4 +93,40 @@
 
 #endif /*  ndef __KERNEL__ */
 
+struct zufs_ioc_hdr {
+	__s32 err;	/* IN/OUT must be first */
+	__u16 in_len;	/* How much to be copied *to* zus */
+	__u16 out_max;	/* Max receive buffer at dispatch caller */
+	__u16 out_start;/* Start of output parameters (to caller) */
+	__u16 out_len;	/* How much to be copied *from* zus to caller */
+			/* can be modified by zus */
+	__u16 operation;/* One of e_zufs_operation */
+	__u16 flags;	/* e_zufs_hdr_flags bit flags */
+	__u32 offset;	/* Start of user buffer in ZT mmap */
+	__u32 len;	/* Len of user buffer in ZT mmap */
+};
+
+struct register_fs_info {
+	char fsname[16];	/* Only 4 chars and a NUL please      */
+	__u32 FS_magic;         /* This is the FS's version && magic  */
+	__u32 FS_ver_major;	/* on disk, not the zuf-to-zus version*/
+	__u32 FS_ver_minor;	/* (See also struct md_dev_table)   */
+	__u32 notused;
+
+	__u64 dt_offset;
+	__u64 s_maxbytes;
+	__u32 s_time_gran;
+	__u32 def_mode;
+};
+
+/* Register FS */
+/* A cookie from user-mode given in register_fs_info */
+struct zus_fs_info;
+struct zufs_ioc_register_fs {
+	struct zufs_ioc_hdr hdr;
+	struct zus_fs_info *zus_zfi;
+	struct register_fs_info rfi;
+};
+#define ZU_IOC_REGISTER_FS	_IOWR('Z', 10, struct zufs_ioc_register_fs)
+
 #endif /* _LINUX_ZUFS_API_H */
-- 
2.21.0
^ permalink raw reply related	[flat|nested] 32+ messages in thread
* [PATCH 05/16] zuf: zuf-core The ZTs
  2019-09-26  2:07 [PATCHSET v02 00/16] zuf: ZUFS Zero-copy User-mode FileSystem Boaz Harrosh
                   ` (3 preceding siblings ...)
  2019-09-26  2:07 ` [PATCH 04/16] zuf: zuf-rootfs Boaz Harrosh
@ 2019-09-26  2:07 ` Boaz Harrosh
  2019-09-26  2:07 ` [PATCH 06/16] zuf: Multy Devices Boaz Harrosh
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 32+ messages in thread
From: Boaz Harrosh @ 2019-09-26  2:07 UTC (permalink / raw)
  To: linux-fsdevel, Anna Schumaker, Al Viro, Matt Benjamin
  Cc: Miklos Szeredi, Amir Goldstein, Sagi Manole, Matthew Wilcox,
	Dan Williams
zuf-core establishes the communication channels with the ZUS
User Mode Server.
In this patch we have the core communication mechanics.
Which is the Novelty of this project.
(See previous submitted documentation for more info)
Users will come later in the patchset
NOTE: The use of the file relay.h. defines an object "relay".
 "Relay" here is in the sense of a relay-race where runners
 pass the baton from runner to runner.
 Also here it is when thread of an Application passes execution
 to the Server thread and back.
 TODO: In future we might define a new scheduler object that
       will do the same but without passing through the scheduler
       at all but relinquishing the reminder of its time slice
       to the next thread. Maybe we can cut another 1/2 a micro
       off the latency of an IOP (By avoiding locks and atomics)
[v2 for Linux v5.3]
  lin-jump5.3: task_struct cpus_allowed => cpus_mask
Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/_extern.h  |   16 +
 fs/zuf/_pr.h      |    5 +
 fs/zuf/relay.h    |  104 +++++
 fs/zuf/zuf-core.c | 1077 ++++++++++++++++++++++++++++++++++++++++++++-
 fs/zuf/zuf.h      |   41 ++
 fs/zuf/zus_api.h  |  291 ++++++++++++
 6 files changed, 1533 insertions(+), 1 deletion(-)
 create mode 100644 fs/zuf/relay.h
diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
index 0e8aa52f1259..1f786fc24b85 100644
--- a/fs/zuf/_extern.h
+++ b/fs/zuf/_extern.h
@@ -27,6 +27,22 @@ void zufc_zts_fini(struct zuf_root_info *zri);
 long zufc_ioctl(struct file *filp, unsigned int cmd, ulong arg);
 int zufc_release(struct inode *inode, struct file *file);
 int zufc_mmap(struct file *file, struct vm_area_struct *vma);
+const char *zuf_op_name(enum e_zufs_operation op);
+
+int zufc_dispatch_mount(struct zuf_root_info *zri, struct zus_fs_info *zus_zfi,
+			enum e_mount_operation operation,
+			struct zufs_ioc_mount *zim);
+
+int __zufc_dispatch(struct zuf_root_info *zri, struct zuf_dispatch_op *zdo);
+static inline
+int zufc_dispatch(struct zuf_root_info *zri, struct zufs_ioc_hdr *hdr,
+		  struct page **pages, uint nump)
+{
+	struct zuf_dispatch_op zdo;
+
+	zuf_dispatch_init(&zdo, hdr, pages, nump);
+	return __zufc_dispatch(zri, &zdo);
+}
 
 /* zuf-root.c */
 int zufr_register_fs(struct super_block *sb, struct zufs_ioc_register_fs *rfs);
diff --git a/fs/zuf/_pr.h b/fs/zuf/_pr.h
index 51924b6bd2a5..2cdb0806687b 100644
--- a/fs/zuf/_pr.h
+++ b/fs/zuf/_pr.h
@@ -34,6 +34,11 @@
 	} while (0)
 #define zuf_info(s, args ...)          pr_info("~info~ " s, ## args)
 
+#define zuf_err_dispatch(sb, s, args ...) \
+	do { if (zuf_fst(sb)->zri->state != ZUF_ROOT_SERVER_FAILED) \
+		pr_err("[%s:%d] " s, __func__, __LINE__, ## args); \
+	} while (0)
+
 #define zuf_chan_debug(c, s, args...)	pr_debug(c " [%s:%d] " s, __func__, \
 							__LINE__, ## args)
 
diff --git a/fs/zuf/relay.h b/fs/zuf/relay.h
new file mode 100644
index 000000000000..4cf642e177cd
--- /dev/null
+++ b/fs/zuf/relay.h
@@ -0,0 +1,104 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Relay scheduler-object Header file.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+
+#ifndef __RELAY_H__
+#define __RELAY_H__
+
+/* ~~~~ Relay ~~~~ */
+struct relay {
+	wait_queue_head_t fss_wq;
+	bool fss_wakeup;
+	bool fss_waiting;
+
+	wait_queue_head_t app_wq;
+	bool app_wakeup;
+	bool app_waiting;
+
+	cpumask_t cpus_allowed;
+};
+
+static inline void relay_init(struct relay *relay)
+{
+	init_waitqueue_head(&relay->fss_wq);
+	init_waitqueue_head(&relay->app_wq);
+}
+
+static inline bool relay_is_app_waiting(struct relay *relay)
+{
+	return relay->app_waiting;
+}
+
+static inline void relay_app_wakeup(struct relay *relay)
+{
+	relay->app_waiting = false;
+
+	relay->app_wakeup = true;
+	wake_up(&relay->app_wq);
+}
+
+static inline int __relay_fss_wait(struct relay *relay, bool keep_locked)
+{
+	relay->fss_waiting = !keep_locked;
+	relay->fss_wakeup = false;
+	return  wait_event_interruptible(relay->fss_wq, relay->fss_wakeup);
+}
+
+static inline int relay_fss_wait(struct relay *relay)
+{
+	return __relay_fss_wait(relay, false);
+}
+
+static inline bool relay_is_fss_waiting_grab(struct relay *relay)
+{
+	if (relay->fss_waiting) {
+		relay->fss_waiting = false;
+		return true;
+	}
+	return false;
+}
+
+static inline void relay_fss_wakeup(struct relay *relay)
+{
+	relay->fss_wakeup = true;
+	wake_up(&relay->fss_wq);
+}
+
+static inline int relay_fss_wakeup_app_wait(struct relay *relay)
+{
+	relay->app_waiting = true;
+
+	relay_fss_wakeup(relay);
+
+	relay->app_wakeup = false;
+
+	return wait_event_interruptible(relay->app_wq, relay->app_wakeup);
+}
+
+static inline
+void relay_fss_wakeup_app_wait_spin(struct relay *relay, spinlock_t *spinlock)
+{
+	relay->app_waiting = true;
+
+	relay_fss_wakeup(relay);
+
+	relay->app_wakeup = false;
+	spin_unlock(spinlock);
+
+	wait_event(relay->app_wq, relay->app_wakeup);
+}
+
+static inline void relay_fss_wakeup_app_wait_cont(struct relay *relay)
+{
+	wait_event(relay->app_wq, relay->app_wakeup);
+}
+
+#endif /* ifndef __RELAY_H__ */
diff --git a/fs/zuf/zuf-core.c b/fs/zuf/zuf-core.c
index c9bb31f75bed..60f0d3ffe562 100644
--- a/fs/zuf/zuf-core.c
+++ b/fs/zuf/zuf-core.c
@@ -18,23 +18,884 @@
 #include <linux/delay.h>
 #include <linux/pfn_t.h>
 #include <linux/sched/signal.h>
+#include <linux/uaccess.h>
+#include <linux/kref.h>
 
 #include "zuf.h"
+#include "relay.h"
+
+enum { INITIAL_ZT_CHANNELS = 3 };
+
+struct zufc_thread {
+	struct zuf_special_file hdr;
+	struct relay relay;
+	struct vm_area_struct *vma;
+	int no;
+	int chan;
+
+	/* Kernel side allocated IOCTL buffer */
+	struct vm_area_struct *opt_buff_vma;
+	void *opt_buff;
+	ulong max_zt_command;
+
+	/* Next operation*/
+	struct zuf_dispatch_op *zdo;
+};
+
+struct zuf_threads_pool {
+	struct __mount_thread_info {
+		struct zuf_special_file zsf;
+		spinlock_t lock;
+		struct relay relay;
+		struct zufs_ioc_mount *zim;
+	} mount;
+
+	uint _max_zts;
+	uint _max_channels;
+	 /* array of pcp_arrays */
+	struct zufc_thread *_all_zt[ZUFS_MAX_ZT_CHANNELS];
+};
+
+/* ~~~~ some helpers ~~~~ */
+const char *zuf_op_name(enum e_zufs_operation op)
+{
+#define CASE_ENUM_NAME(e) case e: return #e
+	switch  (op) {
+		CASE_ENUM_NAME(ZUFS_OP_NULL);
+		CASE_ENUM_NAME(ZUFS_OP_BREAK);
+	case ZUFS_OP_MAX_OPT:
+	default:
+		return "UNKNOWN";
+	}
+}
+
+static inline ulong _zt_pr_no(struct zufc_thread *zt)
+{
+	/* So in hex it will be channel as first nibble and cpu as 3rd and on */
+	return ((ulong)zt->no << 8) | zt->chan;
+}
+
+static struct zufc_thread *_zt_from_cpu(struct zuf_root_info *zri,
+					int cpu, uint chan)
+{
+	return per_cpu_ptr(zri->_ztp->_all_zt[chan], cpu);
+}
+
+static struct zufc_thread *_zt_from_f_private(struct file *file)
+{
+	struct zuf_special_file *zsf = file->private_data;
+
+	WARN_ON(zsf->type != zlfs_e_zt);
+	return container_of(zsf, struct zufc_thread, hdr);
+}
+
+/* ~~~~ init/ fini ~~~~ */
+static int _alloc_zts_channel(struct zuf_root_info *zri, int channel)
+{
+	zri->_ztp->_all_zt[channel] = alloc_percpu_gfp(struct zufc_thread,
+						       GFP_KERNEL | __GFP_ZERO);
+	if (unlikely(!zri->_ztp->_all_zt[channel])) {
+		zuf_err("!!! alloc_percpu channel=%d failed\n", channel);
+		return -ENOMEM;
+	}
+	return 0;
+}
 
 int zufc_zts_init(struct zuf_root_info *zri)
 {
+	int c;
+
+	zri->_ztp = kcalloc(1, sizeof(struct zuf_threads_pool), GFP_KERNEL);
+	if (unlikely(!zri->_ztp))
+		return -ENOMEM;
+
+	spin_lock_init(&zri->_ztp->mount.lock);
+	relay_init(&zri->_ztp->mount.relay);
+
+	zri->_ztp->_max_zts = num_possible_cpus();
+	zri->_ztp->_max_channels = INITIAL_ZT_CHANNELS;
+
+	for (c = 0; c < INITIAL_ZT_CHANNELS; ++c) {
+		int err = _alloc_zts_channel(zri, c);
+
+		if (unlikely(err))
+			return err;
+	}
+
 	return 0;
 }
 
 void zufc_zts_fini(struct zuf_root_info *zri)
 {
+	int c;
+
+	/* Always safe/must call zufc_zts_fini */
+	if (!zri->_ztp)
+		return;
+
+	for (c = 0; c < zri->_ztp->_max_channels; ++c) {
+		if (zri->_ztp->_all_zt[c])
+			free_percpu(zri->_ztp->_all_zt[c]);
+	}
+	kfree(zri->_ztp);
+	zri->_ztp = NULL;
+}
+
+/* ~~~~ mounting ~~~~*/
+int __zufc_dispatch_mount(struct zuf_root_info *zri,
+			  enum e_mount_operation operation,
+			  struct zufs_ioc_mount *zim)
+{
+	struct __mount_thread_info *zmt = &zri->_ztp->mount;
+
+	zim->hdr.operation = operation;
+	for (;;) {
+		bool fss_waiting;
+
+		spin_lock(&zmt->lock);
+
+		if (unlikely(!zmt->zsf.file)) {
+			spin_unlock(&zmt->lock);
+			zuf_err("Server not up\n");
+			zim->hdr.err = -EIO;
+			return zim->hdr.err;
+		}
+
+		fss_waiting = relay_is_fss_waiting_grab(&zmt->relay);
+		if (fss_waiting)
+			break;
+		/* in case of break above spin_unlock is done inside
+		 * relay_fss_wakeup_app_wait
+		 */
+
+		spin_unlock(&zmt->lock);
+
+		/* It is OK to wait if user storms mounts */
+		zuf_dbg_verbose("waiting\n");
+		msleep(100);
+	}
+
+	zmt->zim = zim;
+	relay_fss_wakeup_app_wait_spin(&zmt->relay, &zmt->lock);
+
+	if (zim->hdr.err > 0) {
+		zuf_err("[%s] Bad Server RC not negative => %d\n",
+			zuf_op_name(zim->hdr.operation), zim->hdr.err);
+		zim->hdr.err = -EBADRQC;
+	}
+	return zim->hdr.err;
+}
+
+int zufc_dispatch_mount(struct zuf_root_info *zri, struct zus_fs_info *zus_zfi,
+			enum e_mount_operation operation,
+			struct zufs_ioc_mount *zim)
+{
+	zim->hdr.out_len = sizeof(*zim);
+	zim->hdr.in_len = sizeof(*zim);
+	if (operation == ZUFS_M_MOUNT || operation == ZUFS_M_REMOUNT)
+		zim->hdr.in_len += zim->zmi.po.mount_options_len;
+	zim->zmi.zus_zfi = zus_zfi;
+	zim->zmi.num_cpu = zri->_ztp->_max_zts;
+	zim->zmi.num_channels = zri->_ztp->_max_channels;
+
+	return __zufc_dispatch_mount(zri, operation, zim);
+}
+
+static int _zu_mount(struct file *file, void *parg)
+{
+	struct super_block *sb = file->f_inode->i_sb;
+	struct zuf_root_info *zri = ZRI(sb);
+	struct __mount_thread_info *zmt = &zri->_ztp->mount;
+	bool waiting_for_reply;
+	struct zufs_ioc_mount *zim;
+	ulong cp_ret;
+	int err;
+
+	spin_lock(&zmt->lock);
+
+	if (unlikely(!file->private_data)) {
+		/* First time register this file as the mount-thread owner */
+		zmt->zsf.type = zlfs_e_mout_thread;
+		zmt->zsf.file = file;
+		file->private_data = &zmt->zsf;
+		zri->state = ZUF_ROOT_MOUNT_READY;
+	} else if (unlikely(file->private_data != zmt)) {
+		spin_unlock(&zmt->lock);
+		zuf_err("Say what?? %p != %p\n",
+			file->private_data, zmt);
+		return -EIO;
+	}
+
+	zim = zmt->zim;
+	zmt->zim = NULL;
+	waiting_for_reply = zim && relay_is_app_waiting(&zmt->relay);
+
+	spin_unlock(&zmt->lock);
+
+	if (waiting_for_reply) {
+		cp_ret = copy_from_user(zim, parg, zim->hdr.out_len);
+		if (unlikely(cp_ret)) {
+			zuf_err("copy_from_user => %ld\n", cp_ret);
+			 zim->hdr.err = -EFAULT;
+		}
+
+		relay_app_wakeup(&zmt->relay);
+	}
+
+	/* This gets to sleep until a mount comes */
+	err = relay_fss_wait(&zmt->relay);
+	if (unlikely(err || !zmt->zim)) {
+		struct zufs_ioc_hdr *hdr = parg;
+
+		/* Released by _zu_break INTER or crash */
+		zuf_dbg_zus("_zu_break? %p => %d\n", zmt->zim, err);
+		put_user(ZUFS_OP_BREAK, &hdr->operation);
+		put_user(EIO, &hdr->err);
+		return err;
+	}
+
+	zim = zmt->zim;
+	cp_ret = copy_to_user(parg, zim, zim->hdr.in_len);
+	if (unlikely(cp_ret)) {
+		err = -EFAULT;
+		zuf_err("copy_to_user =>%ld\n", cp_ret);
+	}
+	return err;
+}
+
+static void zufc_mounter_release(struct file *file)
+{
+	struct zuf_root_info *zri = ZRI(file->f_inode->i_sb);
+	struct __mount_thread_info *zmt = &zri->_ztp->mount;
+
+	zuf_dbg_zus("closed fu=%d au=%d fw=%d aw=%d\n",
+		  zmt->relay.fss_wakeup, zmt->relay.app_wakeup,
+		  zmt->relay.fss_waiting, zmt->relay.app_waiting);
+
+	spin_lock(&zmt->lock);
+	zmt->zsf.file = NULL;
+	if (relay_is_app_waiting(&zmt->relay)) {
+		zri->state = ZUF_ROOT_SERVER_FAILED;
+		zuf_err("server emergency exit while IO\n");
+		if (zmt->zim)
+			zmt->zim->hdr.err = -EIO;
+		spin_unlock(&zmt->lock);
+
+		relay_app_wakeup(&zmt->relay);
+		msleep(1000); /* crap */
+	} else {
+		if (zmt->zim)
+			zmt->zim->hdr.err = 0;
+		spin_unlock(&zmt->lock);
+	}
+}
+
+/* ~~~~ ZU_IOC_NUMA_MAP ~~~~ */
+static int _zu_numa_map(struct file *file, void *parg)
+{
+	struct zufs_ioc_numa_map *numa_map;
+	int n_nodes = num_possible_nodes();
+	uint *nodes_cpu_count;
+	uint max_cpu_per_node = 0;
+	uint alloc_size;
+	int cpu, i, err;
+
+	alloc_size = sizeof(*numa_map) +
+			(n_nodes * sizeof(numa_map->cpu_set_per_node[0]));
+
+	if ((n_nodes > 255) || (alloc_size > PAGE_SIZE)) {
+		zuf_warn("!!!unexpected big machine with %d nodes alloc_size=0x%x\n",
+			  n_nodes, alloc_size);
+		return -ENOTSUPP;
+	}
+
+	nodes_cpu_count = kcalloc(n_nodes, sizeof(uint), GFP_KERNEL);
+	if (unlikely(!nodes_cpu_count))
+		return -ENOMEM;
+
+	numa_map = kzalloc(alloc_size, GFP_KERNEL);
+	if (unlikely(!numa_map)) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	numa_map->possible_nodes	= num_possible_nodes();
+	numa_map->possible_cpus		= num_possible_cpus();
+
+	numa_map->online_nodes		= num_online_nodes();
+	numa_map->online_cpus		= num_online_cpus();
+
+	for_each_online_cpu(cpu)
+		set_bit(cpu, numa_map->cpu_set_per_node[cpu_to_node(cpu)].bits);
+
+	for_each_cpu(cpu, cpu_online_mask) {
+		uint ctn  = cpu_to_node(cpu);
+		uint ncc = ++nodes_cpu_count[ctn];
+
+		max_cpu_per_node = max(max_cpu_per_node, ncc);
+	}
+
+	for (i = 1; i < n_nodes; ++i) {
+		if (nodes_cpu_count[i] &&
+		    (nodes_cpu_count[i] != nodes_cpu_count[0])) {
+			zuf_info("@[%d]=%d Unbalanced CPU sockets @[0]=%d\n",
+				  i, nodes_cpu_count[i], nodes_cpu_count[0]);
+			numa_map->nodes_not_symmetrical = true;
+			break;
+		}
+	}
+
+	numa_map->max_cpu_per_node = max_cpu_per_node;
+
+	zuf_dbg_verbose(
+		"possible_nodes=%d possible_cpus=%d online_nodes=%d online_cpus=%d\n",
+		numa_map->possible_nodes, numa_map->possible_cpus,
+		numa_map->online_nodes, numa_map->online_cpus);
+
+	err = copy_to_user(parg, numa_map, alloc_size);
+	kfree(numa_map);
+out:
+	kfree(nodes_cpu_count);
+	return err;
+}
+
+static void _prep_header_size_op(struct zufs_ioc_hdr *hdr,
+				 enum e_zufs_operation op, int err)
+{
+	memset(hdr, 0, sizeof(*hdr));
+	hdr->operation = op;
+	hdr->in_len = sizeof(*hdr);
+	hdr->err = err;
+}
+
+/* ~~~~~ ZT thread operations ~~~~~ */
+
+static int _zu_init(struct file *file, void *parg)
+{
+	struct zufc_thread *zt;
+	int cpu = smp_processor_id();
+	struct zufs_ioc_init zi_init;
+	int err;
+
+	err = copy_from_user(&zi_init, parg, sizeof(zi_init));
+	if (unlikely(err)) {
+		zuf_err("=>%d\n", err);
+		return err;
+	}
+	if (unlikely(zi_init.channel_no >= ZUFS_MAX_ZT_CHANNELS)) {
+		zuf_err("[%d] channel_no=%d\n", cpu, zi_init.channel_no);
+		return -EINVAL;
+	}
+
+	zuf_dbg_zus("[%d] channel=%d\n", cpu, zi_init.channel_no);
+
+	zt = _zt_from_cpu(ZRI(file->f_inode->i_sb), cpu, zi_init.channel_no);
+	if (unlikely(!zt)) {
+		zi_init.hdr.err = -ERANGE;
+		zuf_err("_zt_from_cpu(%d, %d) => %d\n",
+			cpu, zi_init.channel_no, err);
+		goto out;
+	}
+
+	if (unlikely(zt->hdr.file)) {
+		zi_init.hdr.err = -EINVAL;
+		zuf_err("[%d] !!! thread already set\n", cpu);
+		goto out;
+	}
+
+	relay_init(&zt->relay);
+	zt->hdr.type = zlfs_e_zt;
+	zt->hdr.file = file;
+	zt->no = cpu;
+	zt->chan = zi_init.channel_no;
+
+	zt->max_zt_command = zi_init.max_command;
+	zt->opt_buff = vmalloc(zi_init.max_command);
+	if (unlikely(!zt->opt_buff)) {
+		zi_init.hdr.err = -ENOMEM;
+		goto out;
+	}
+
+	file->private_data = &zt->hdr;
+out:
+	err = copy_to_user(parg, &zi_init, sizeof(zi_init));
+	if (err)
+		zuf_err("=>%d\n", err);
+	return err;
+}
+
+/* Caller checks that file->private_data != NULL */
+static void zufc_zt_release(struct file *file)
+{
+	struct zuf_root_info *zri = ZRI(file->f_inode->i_sb);
+	struct zufc_thread *zt = _zt_from_f_private(file);
+
+	if (unlikely(zt->hdr.file != file))
+		zuf_err("What happened zt->file(%p) != file(%p)\n",
+			zt->hdr.file, file);
+
+	zuf_dbg_zus("[%d] closed fu=%d au=%d fw=%d aw=%d\n",
+		  zt->no, zt->relay.fss_wakeup, zt->relay.app_wakeup,
+		  zt->relay.fss_waiting, zt->relay.app_waiting);
+
+	if (relay_is_app_waiting(&zt->relay)) {
+		zuf_err("server emergency exit while IO\n");
+
+		/* NOTE: Do not call _unmap_pages the vma is gone */
+		zt->hdr.file = NULL;
+
+		zri->state = ZUF_ROOT_SERVER_FAILED;
+
+		relay_app_wakeup(&zt->relay);
+		msleep(1000); /* crap */
+	}
+
+	vfree(zt->opt_buff);
+	memset(zt, 0, sizeof(*zt));
+}
+
+static int _map_pages(struct zufc_thread *zt, struct page **pages, uint nump,
+		      bool map_readonly)
+{
+	int p, err;
+
+	if (!(zt->vma && pages && nump))
+		return 0;
+
+	for (p = 0; p < nump; ++p) {
+		ulong zt_addr = zt->vma->vm_start + p * PAGE_SIZE;
+		ulong pfn = page_to_pfn(pages[p]);
+		pfn_t pfnt = phys_to_pfn_t(PFN_PHYS(pfn), PFN_MAP | PFN_DEV);
+		vm_fault_t flt;
+
+		if (map_readonly)
+			flt = vmf_insert_mixed(zt->vma, zt_addr, pfnt);
+		else
+			flt = vmf_insert_mixed_mkwrite(zt->vma, zt_addr, pfnt);
+		err = zuf_flt_to_err(flt);
+		if (unlikely(err)) {
+			zuf_err("zuf: remap_pfn_range => %d p=0x%x start=0x%lx\n",
+				 err, p, zt->vma->vm_start);
+			return err;
+		}
+	}
+	return 0;
+}
+
+static void _unmap_pages(struct zufc_thread *zt, struct page **pages, uint nump)
+{
+	if (!(zt->vma && zt->zdo && pages && nump))
+		return;
+
+	zt->zdo->pages = NULL;
+	zt->zdo->nump = 0;
+
+	zap_vma_ptes(zt->vma, zt->vma->vm_start, nump * PAGE_SIZE);
+}
+
+static int _copy_outputs(struct zufc_thread *zt, void *arg)
+{
+	struct zufs_ioc_hdr *hdr = zt->zdo->hdr;
+	struct zufs_ioc_hdr *user_hdr = zt->opt_buff;
+
+	if (zt->opt_buff_vma->vm_start != (ulong)arg) {
+		zuf_err("malicious Server\n");
+		return -EINVAL;
+	}
+
+	/* Update on the user out_len and return-code */
+	hdr->err = user_hdr->err;
+	hdr->out_len = user_hdr->out_len;
+
+	if (!hdr->out_len)
+		return 0;
+
+	if ((hdr->err == -EZUFS_RETRY && zt->zdo->oh) ||
+	    (hdr->out_max < hdr->out_len)) {
+		if (WARN_ON(!zt->zdo->oh)) {
+			zuf_err("Trouble op(%s) out_max=%d out_len=%d\n",
+				zuf_op_name(hdr->operation),
+				hdr->out_max, hdr->out_len);
+			return -EFAULT;
+		}
+		zuf_dbg_zus("[%s] %d %d => %d\n",
+			    zuf_op_name(hdr->operation),
+			    hdr->out_max, hdr->out_len, hdr->err);
+		return zt->zdo->oh(zt->zdo, zt->opt_buff, zt->max_zt_command);
+	} else {
+		void *rply = (void *)hdr + hdr->out_start;
+		void *from = zt->opt_buff + hdr->out_start;
+
+		memcpy(rply, from, hdr->out_len);
+		return 0;
+	}
+}
+
+static int _zu_wait(struct file *file, void *parg)
+{
+	struct zufc_thread *zt;
+	bool __chan_is_locked = false;
+	int err;
+
+	zt = _zt_from_f_private(file);
+	if (unlikely(!zt)) {
+		zuf_err("Unexpected ZT state\n");
+		err = -ERANGE;
+		goto err;
+	}
+
+	if (!zt->hdr.file || file != zt->hdr.file) {
+		zuf_err("fatal\n");
+		err = -E2BIG;
+		goto err;
+	}
+	if (unlikely((ulong)parg != zt->opt_buff_vma->vm_start)) {
+		zuf_err("fatal 2\n");
+		err = -EINVAL;
+		goto err;
+	}
+
+	if (relay_is_app_waiting(&zt->relay)) {
+		if (unlikely(!zt->zdo)) {
+			zuf_err("User has gone...\n");
+			err = -E2BIG;
+			goto err;
+		}
+
+		/* overflow_handler might decide to execute the parg here at
+		 * zus context and return to server.
+		 * If it also has an error to report to zus it will set
+		 * zdo->hdr->err. EZUS_RETRY_DONE is when that happens.
+		 * In this case pages stay mapped in zt->vma.
+		 */
+		err = _copy_outputs(zt, parg);
+		if (err == EZUF_RETRY_DONE) {
+			put_user(zt->zdo->hdr->err, (int *)parg);
+			return 0;
+		}
+
+		_unmap_pages(zt, zt->zdo->pages, zt->zdo->nump);
+
+		zt->zdo = NULL;
+		if (unlikely(err)) /* _copy_outputs returned an err */
+			goto err;
+
+		relay_app_wakeup(&zt->relay);
+	}
+
+	err = __relay_fss_wait(&zt->relay, __chan_is_locked);
+	if (err)
+		zuf_dbg_err("[%d] relay error: %d\n", zt->no, err);
+
+	if (zt->zdo &&  zt->zdo->hdr &&
+	    zt->zdo->hdr->operation != ZUFS_OP_BREAK &&
+	    zt->zdo->hdr->operation < ZUFS_OP_MAX_OPT) {
+		/* call map here at the zuf thread so we need no locks
+		 * TODO: Currently only ZUFS_OP_WRITE protects user-buffers
+		 * we should have a bit set in zt->zdo->hdr set per operation.
+		 * TODO: Why this does not work?
+		 */
+		_map_pages(zt, zt->zdo->pages, zt->zdo->nump, 0);
+	} else {
+		/* This Means we were released by _zu_break */
+		zuf_dbg_zus("_zu_break? => %d\n", err);
+		_prep_header_size_op(zt->opt_buff, ZUFS_OP_BREAK, err);
+	}
+
+	return err;
+
+err:
+	put_user(err, (int *)parg);
+	return err;
+}
+
+static int _try_grab_zt_channel(struct zuf_root_info *zri, int cpu,
+				 struct zufc_thread **ztp)
+{
+	struct zufc_thread *zt;
+	int c;
+
+	for (c = 0; ; ++c) {
+		zt = _zt_from_cpu(zri, cpu, c);
+		if (unlikely(!zt || !zt->hdr.file))
+			break;
+
+		if (relay_is_fss_waiting_grab(&zt->relay)) {
+			*ztp = zt;
+			return true;
+		}
+	}
+
+	*ztp = _zt_from_cpu(zri, cpu, 0);
+	return false;
+}
+
+#ifdef CONFIG_ZUF_DEBUG
+#define DEBUG_CPU_SWITCH(cpu)		\
+	do {					\
+		int cpu2 = smp_processor_id();	\
+		if (cpu2 != cpu)		\
+			zuf_warn("App switched cpu1=%u cpu2=%u\n", \
+				 cpu, cpu2);	\
+	} while (0)
+
+static
+int _r_zufs_dispatch(struct zuf_root_info *zri, struct zuf_dispatch_op *zdo)
+
+#else /* !CONFIG_ZUF_DEBUG */
+#define DEBUG_CPU_SWITCH(cpu)
+
+int __zufc_dispatch(struct zuf_root_info *zri, struct zuf_dispatch_op *zdo)
+#endif /* CONFIG_ZUF_DEBUG */
+{
+	struct task_struct *app = get_current();
+	struct zufs_ioc_hdr *hdr = zdo->hdr;
+	int cpu;
+	struct zufc_thread *zt;
+
+	if (unlikely(zri->state == ZUF_ROOT_SERVER_FAILED))
+		return -EIO;
+
+	if (unlikely(hdr->out_len && !hdr->out_max)) {
+		/* TODO: Complain here and let caller code do this proper */
+		hdr->out_max = hdr->out_len;
+	}
+
+	if (unlikely(zdo->__locked_zt)) {
+		zt = zdo->__locked_zt;
+		zdo->__locked_zt = NULL;
+
+		cpu = get_cpu();
+		/* FIXME: Very Pedantic need it stay */
+		if (unlikely((zt->zdo != zdo) || cpu != zt->no)) {
+			zuf_warn("[%ld] __locked_zt but zdo(%p != %p) || cpu(%d != %d)\n",
+				 _zt_pr_no(zt), zt->zdo, zdo, cpu, zt->no);
+			put_cpu();
+			goto channel_busy;
+		}
+		goto has_channel;
+	}
+channel_busy:
+	cpu = get_cpu();
+
+	if (!_try_grab_zt_channel(zri, cpu, &zt)) {
+		put_cpu();
+
+		/* If channel was grabbed then maybe a break_all is in progress
+		 * on a different CPU make sure zt->file on this core is
+		 * updated
+		 */
+		mb();
+		if (unlikely(!zt->hdr.file)) {
+			zuf_err("[%d] !zt->file\n", cpu);
+			return -EIO;
+		}
+		zuf_dbg_err("[%d] can this be\n", cpu);
+		/* FIXME: Do something much smarter */
+		msleep(10);
+		if (signal_pending(get_current())) {
+			zuf_dbg_err("[%d] => EINTR\n", cpu);
+			return -EINTR;
+		}
+		goto channel_busy;
+	}
+
+	/* lock app to this cpu while waiting */
+	cpumask_copy(&zt->relay.cpus_allowed, &app->cpus_mask);
+	cpumask_copy(&app->cpus_mask,  cpumask_of(smp_processor_id()));
+
+	zt->zdo = zdo;
+
+has_channel:
+	if (zdo->dh)
+		zdo->dh(zdo, zt, zt->opt_buff);
+	else
+		memcpy(zt->opt_buff, zt->zdo->hdr, zt->zdo->hdr->in_len);
+
+	put_cpu();
+
+	if (relay_fss_wakeup_app_wait(&zt->relay) == -ERESTARTSYS) {
+		struct zufs_ioc_hdr *opt_hdr = zt->opt_buff;
+
+		opt_hdr->flags |= ZUFS_H_INTR;
+
+		relay_fss_wakeup_app_wait_cont(&zt->relay);
+	}
+
+	/* __locked_zt must be kept on same cpu */
+	if (!zdo->__locked_zt)
+		/* restore cpu affinity after wakeup */
+		cpumask_copy(&app->cpus_mask, &zt->relay.cpus_allowed);
+
+	DEBUG_CPU_SWITCH(cpu);
+
+	return zt->hdr.file ? hdr->err : -EIO;
+}
+
+#ifdef CONFIG_ZUF_DEBUG
+#define MAX_ZT_SEC 7
+int __zufc_dispatch(struct zuf_root_info *zri, struct zuf_dispatch_op *zdo)
+{
+	u64 t1, t2;
+	int err;
+
+	t1 = ktime_get_ns();
+	err = _r_zufs_dispatch(zri, zdo);
+	t2 = ktime_get_ns();
+
+	if ((t2 - t1) > MAX_ZT_SEC * NSEC_PER_SEC)
+		zuf_err("zufc_dispatch(%s, [0x%x-0x%x]) took %lld sec\n",
+			zuf_op_name(zdo->hdr->operation), zdo->hdr->offset,
+			zdo->hdr->len,
+			(t2 - t1) / NSEC_PER_SEC);
+
+	return err;
+}
+#endif /* def CONFIG_ZUF_DEBUG */
+
+/* ~~~ iomap_exec && exec_buffer allocation ~~~ */
+
+struct zu_exec_buff {
+	struct zuf_special_file hdr;
+	struct vm_area_struct *vma;
+	void *opt_buff;
+	ulong alloc_size;
+};
+
+/* Do some common checks and conversions */
+static inline struct zu_exec_buff *_ebuff_from_file(struct file *file)
+{
+	struct zu_exec_buff *ebuff = file->private_data;
+
+	if (WARN_ON_ONCE(ebuff->hdr.type != zlfs_e_dpp_buff)) {
+		zuf_err("Must call ZU_IOC_ALLOC_BUFFER first\n");
+		return NULL;
+	}
+
+	if (WARN_ON_ONCE(ebuff->hdr.file != file))
+		return NULL;
+
+	return ebuff;
+}
+
+static int _zu_ebuff_alloc(struct file *file, void *arg)
+{
+	struct zufs_ioc_alloc_buffer ioc_alloc;
+	struct zu_exec_buff *ebuff;
+	int err;
+
+	err = copy_from_user(&ioc_alloc, arg, sizeof(ioc_alloc));
+	if (unlikely(err)) {
+		zuf_err("=>%d\n", err);
+		return err;
+	}
+
+	if (ioc_alloc.init_size > ioc_alloc.max_size)
+		return -EINVAL;
+
+	/* TODO: Easily Support growing */
+	/* TODO: Support global pools, also easy */
+	if (ioc_alloc.pool_no || ioc_alloc.init_size != ioc_alloc.max_size)
+		return -ENOTSUPP;
+
+	ebuff = kzalloc(sizeof(*ebuff), GFP_KERNEL);
+	if (unlikely(!ebuff))
+		return -ENOMEM;
+
+	ebuff->hdr.type = zlfs_e_dpp_buff;
+	ebuff->hdr.file = file;
+	i_size_write(file->f_inode, ioc_alloc.max_size);
+	ebuff->alloc_size =  ioc_alloc.init_size;
+	ebuff->opt_buff = vmalloc(ioc_alloc.init_size);
+	if (unlikely(!ebuff->opt_buff)) {
+		kfree(ebuff);
+		return -ENOMEM;
+	}
+
+	file->private_data = &ebuff->hdr;
+	return 0;
+}
+
+static void zufc_ebuff_release(struct file *file)
+{
+	struct zu_exec_buff *ebuff = _ebuff_from_file(file);
+
+	if (unlikely(!ebuff))
+		return;
+
+	vfree(ebuff->opt_buff);
+	ebuff->hdr.type = 0;
+	ebuff->hdr.file = NULL; /* for none-dbg Kernels && use-after-free */
+	kfree(ebuff);
+}
+
+/* ~~~~ ioctl & release handlers ~~~~ */
+static int _zu_register_fs(struct file *file, void *parg)
+{
+	struct zufs_ioc_register_fs rfs;
+	int err;
+
+	err = copy_from_user(&rfs, parg, sizeof(rfs));
+	if (unlikely(err)) {
+		zuf_err("=>%d\n", err);
+		return err;
+	}
+
+	err = zufr_register_fs(file->f_inode->i_sb, &rfs);
+	if (err)
+		zuf_err("=>%d\n", err);
+	err = put_user(err, (int *)parg);
+	return err;
+}
+
+static int _zu_break(struct file *filp, void *parg)
+{
+	struct zuf_root_info *zri = ZRI(filp->f_inode->i_sb);
+	int i, c;
+
+	zuf_dbg_core("enter\n");
+	mb(); /* TODO how to schedule on all CPU's */
+
+	for (i = 0; i < zri->_ztp->_max_zts; ++i) {
+		if (unlikely(!cpu_active(i)))
+			continue;
+		for (c = 0; c < zri->_ztp->_max_channels; ++c) {
+			struct zufc_thread *zt = _zt_from_cpu(zri, i, c);
+
+			if (unlikely(!(zt && zt->hdr.file)))
+				continue;
+			relay_fss_wakeup(&zt->relay);
+		}
+	}
+
+	if (zri->_ztp->mount.zsf.file)
+		relay_fss_wakeup(&zri->_ztp->mount.relay);
+
+	zuf_dbg_core("exit\n");
+	return 0;
 }
 
 long zufc_ioctl(struct file *file, unsigned int cmd, ulong arg)
 {
+	void __user *parg = (void __user *)arg;
+
 	switch (cmd) {
+	case ZU_IOC_REGISTER_FS:
+		return _zu_register_fs(file, parg);
+	case ZU_IOC_MOUNT:
+		return _zu_mount(file, parg);
+	case ZU_IOC_NUMA_MAP:
+		return _zu_numa_map(file, parg);
+	case ZU_IOC_INIT_THREAD:
+		return _zu_init(file, parg);
+	case ZU_IOC_WAIT_OPT:
+		return _zu_wait(file, parg);
+	case ZU_IOC_ALLOC_BUFFER:
+		return _zu_ebuff_alloc(file, parg);
+	case ZU_IOC_BREAK_ALL:
+		return _zu_break(file, parg);
 	default:
-		zuf_err("%d\n", cmd);
+		zuf_err("%d %ld\n", cmd, ZU_IOC_WAIT_OPT);
 		return -ENOTTY;
 	}
 }
@@ -47,11 +908,221 @@ int zufc_release(struct inode *inode, struct file *file)
 		return 0;
 
 	switch (zsf->type) {
+	case zlfs_e_zt:
+		zufc_zt_release(file);
+		return 0;
+	case zlfs_e_mout_thread:
+		zufc_mounter_release(file);
+		return 0;
+	case zlfs_e_pmem:
+		/* NOTHING to clean for pmem file yet */
+		/* zuf_pmem_release(file);*/
+		return 0;
+	case zlfs_e_dpp_buff:
+		zufc_ebuff_release(file);
+		return 0;
 	default:
 		return 0;
 	}
 }
 
+/* ~~~~  mmap area of app buffers into server ~~~~ */
+
+static vm_fault_t zuf_zt_fault(struct vm_fault *vmf)
+{
+	zuf_err("should not fault pgoff=0x%lx\n", vmf->pgoff);
+	return VM_FAULT_SIGBUS;
+}
+
+static const struct vm_operations_struct zuf_vm_ops = {
+	.fault		= zuf_zt_fault,
+};
+
+static int _zufc_zt_mmap(struct file *file, struct vm_area_struct *vma,
+			 struct zufc_thread *zt)
+{
+	/* VM_PFNMAP for zap_vma_ptes() Careful! */
+	vma->vm_flags |= VM_PFNMAP;
+	vma->vm_ops = &zuf_vm_ops;
+
+	zt->vma = vma;
+
+	zuf_dbg_core(
+		"[0x%lx] start=0x%lx end=0x%lx flags=0x%lx file-start=0x%lx\n",
+		_zt_pr_no(zt), vma->vm_start, vma->vm_end, vma->vm_flags,
+		vma->vm_pgoff);
+
+	return 0;
+}
+
+/* ~~~~  mmap the Kernel allocated IOCTL buffer per ZT ~~~~ */
+static int _opt_buff_mmap(struct vm_area_struct *vma, void *opt_buff,
+			  ulong opt_size)
+{
+	ulong offset;
+
+	if (!opt_buff)
+		return -ENOMEM;
+
+	for (offset = 0; offset < opt_size; offset += PAGE_SIZE) {
+		ulong addr = vma->vm_start + offset;
+		ulong pfn = vmalloc_to_pfn(opt_buff +  offset);
+		pfn_t pfnt = phys_to_pfn_t(PFN_PHYS(pfn), PFN_MAP | PFN_DEV);
+		int err;
+
+		zuf_dbg_verbose("[0x%lx] pfn-0x%lx addr=0x%lx buff=0x%lx\n",
+				offset, pfn, addr, (ulong)opt_buff + offset);
+
+		err = zuf_flt_to_err(vmf_insert_mixed_mkwrite(vma, addr, pfnt));
+		if (unlikely(err)) {
+			zuf_err("zuf: zuf_insert_mixed_mkwrite => %d offset=0x%lx addr=0x%lx\n",
+				 err, offset, addr);
+			return err;
+		}
+	}
+
+	return 0;
+}
+
+static vm_fault_t zuf_obuff_fault(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct zufc_thread *zt = _zt_from_f_private(vma->vm_file);
+	long offset = (vmf->pgoff << PAGE_SHIFT) - ZUS_API_MAP_MAX_SIZE;
+	int err;
+
+	zuf_dbg_core(
+		"[0x%lx] start=0x%lx end=0x%lx file-start=0x%lx offset=0x%lx\n",
+		_zt_pr_no(zt), vma->vm_start, vma->vm_end, vma->vm_pgoff,
+		offset);
+
+	/* if Server overruns its buffer crash it dead */
+	if (unlikely((offset < 0) || (zt->max_zt_command < offset))) {
+		zuf_err("[0x%lx] start=0x%lx end=0x%lx file-start=0x%lx offset=0x%lx\n",
+			_zt_pr_no(zt), vma->vm_start,
+			vma->vm_end, vma->vm_pgoff, offset);
+		return VM_FAULT_SIGBUS;
+	}
+
+	/* We never released a zus-core.c that does not fault the
+	 * first page first. I want to see if this happens
+	 */
+	if (unlikely(offset))
+		zuf_warn("Suspicious server activity\n");
+
+	/* This faults only once at very first access */
+	err = _opt_buff_mmap(vma, zt->opt_buff, zt->max_zt_command);
+	if (unlikely(err))
+		return VM_FAULT_SIGBUS;
+
+	return VM_FAULT_NOPAGE;
+}
+
+static const struct vm_operations_struct zuf_obuff_ops = {
+	.fault		= zuf_obuff_fault,
+};
+
+static int _zufc_obuff_mmap(struct file *file, struct vm_area_struct *vma,
+			    struct zufc_thread *zt)
+{
+	vma->vm_flags |= VM_PFNMAP;
+	vma->vm_ops = &zuf_obuff_ops;
+
+	zt->opt_buff_vma = vma;
+
+	zuf_dbg_core(
+		"[0x%lx] start=0x%lx end=0x%lx flags=0x%lx file-start=0x%lx\n",
+		_zt_pr_no(zt), vma->vm_start, vma->vm_end, vma->vm_flags,
+		vma->vm_pgoff);
+
+	return 0;
+}
+
+/* ~~~ */
+
+static int zufc_zt_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct zufc_thread *zt = _zt_from_f_private(file);
+
+	/* We have two areas of mmap in this special file.
+	 * 0 to ZUS_API_MAP_MAX_SIZE:
+	 *	The first part where app pages are mapped
+	 *	into server per operation.
+	 * ZUS_API_MAP_MAX_SIZE of size zuf_root_info->max_zt_command
+	 *	Is where we map the per ZT ioctl-buffer, later passed
+	 *	to the zus_ioc_wait IOCTL call
+	 */
+	if (vma->vm_pgoff == ZUS_API_MAP_MAX_SIZE / PAGE_SIZE)
+		return _zufc_obuff_mmap(file, vma, zt);
+
+	/* zuf ZT API is very particular about where in its
+	 * special file we communicate
+	 */
+	if (unlikely(vma->vm_pgoff))
+		return -EINVAL;
+
+	return _zufc_zt_mmap(file, vma, zt);
+}
+
+/* ~~~~ Implementation of the ZU_IOC_ALLOC_BUFFER mmap facility ~~~~ */
+
+static vm_fault_t zuf_ebuff_fault(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct zu_exec_buff *ebuff = _ebuff_from_file(vma->vm_file);
+	long offset = (vmf->pgoff << PAGE_SHIFT);
+	int err;
+
+	zuf_dbg_core("start=0x%lx end=0x%lx file-start=0x%lx file-off=0x%lx\n",
+		     vma->vm_start, vma->vm_end, vma->vm_pgoff, offset);
+
+	if (unlikely(!ebuff))
+		return VM_FAULT_SIGBUS;
+
+	/* if Server overruns its buffer crash it dead */
+	if (unlikely((offset < 0) || (ebuff->alloc_size < offset))) {
+		zuf_err("start=0x%lx end=0x%lx file-start=0x%lx file-off=0x%lx\n",
+			vma->vm_start, vma->vm_end, vma->vm_pgoff,
+			offset);
+		return VM_FAULT_SIGBUS;
+	}
+
+	/* We never released a zus-core.c that does not fault the
+	 * first page first. I want to see if this happens
+	 */
+	if (unlikely(offset))
+		zuf_warn("Suspicious server activity\n");
+
+	/* This faults only once at very first access */
+	err = _opt_buff_mmap(vma, ebuff->opt_buff, ebuff->alloc_size);
+	if (unlikely(err))
+		return VM_FAULT_SIGBUS;
+
+	return VM_FAULT_NOPAGE;
+}
+
+static const struct vm_operations_struct zuf_ebuff_ops = {
+	.fault		= zuf_ebuff_fault,
+};
+
+static int zufc_ebuff_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct zu_exec_buff *ebuff = _ebuff_from_file(vma->vm_file);
+
+	if (unlikely(!ebuff))
+		return -EINVAL;
+
+	vma->vm_flags |= VM_PFNMAP;
+	vma->vm_ops = &zuf_ebuff_ops;
+
+	ebuff->vma = vma;
+
+	zuf_dbg_core("start=0x%lx end=0x%lx flags=0x%lx file-start=0x%lx\n",
+		      vma->vm_start, vma->vm_end, vma->vm_flags, vma->vm_pgoff);
+
+	return 0;
+}
+
 int zufc_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	struct zuf_special_file *zsf = file->private_data;
@@ -62,6 +1133,10 @@ int zufc_mmap(struct file *file, struct vm_area_struct *vma)
 	}
 
 	switch (zsf->type) {
+	case zlfs_e_zt:
+		return zufc_zt_mmap(file, vma);
+	case zlfs_e_dpp_buff:
+		return zufc_ebuff_mmap(file, vma);
 	default:
 		zuf_err("type=%d\n", zsf->type);
 		return -ENOTTY;
diff --git a/fs/zuf/zuf.h b/fs/zuf/zuf.h
index 919b84f7478f..05ec08d17d69 100644
--- a/fs/zuf/zuf.h
+++ b/fs/zuf/zuf.h
@@ -110,6 +110,47 @@ static inline struct zuf_inode_info *ZUII(struct inode *inode)
 	return container_of(inode, struct zuf_inode_info, vfs_inode);
 }
 
+struct zuf_dispatch_op;
+typedef int (*overflow_handler)(struct zuf_dispatch_op *zdo, void *parg,
+				ulong zt_max_bytes);
+typedef void (*dispatch_handler)(struct zuf_dispatch_op *zdo, void *pzt,
+				void *parg);
+struct zuf_dispatch_op {
+	struct zufs_ioc_hdr *hdr;
+	union {
+		struct page **pages;
+		ulong *bns;
+	};
+	uint nump;
+	overflow_handler oh;
+	dispatch_handler dh;
+	struct super_block *sb;
+	struct inode *inode;
+
+	/* Don't touch zuf-core only!!! */
+	struct zufc_thread *__locked_zt;
+};
+
+static inline void
+zuf_dispatch_init(struct zuf_dispatch_op *zdo, struct zufs_ioc_hdr *hdr,
+		 struct page **pages, uint nump)
+{
+	memset(zdo, 0, sizeof(*zdo));
+	zdo->hdr = hdr;
+	zdo->pages = pages; zdo->nump = nump;
+}
+
+static inline int zuf_flt_to_err(vm_fault_t flt)
+{
+	if (likely(flt == VM_FAULT_NOPAGE))
+		return 0;
+
+	if (flt == VM_FAULT_OOM)
+		return -ENOMEM;
+
+	return -EACCES;
+}
+
 /* Keep this include last thing in file */
 #include "_extern.h"
 
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index f293e03460be..6b1fbaf24222 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -93,6 +93,123 @@
 
 #endif /*  ndef __KERNEL__ */
 
+/* first available error code after include/linux/errno.h */
+#define EZUFS_RETRY	531
+
+/* The below is private to zuf Kernel only. Is not exposed to VFS nor zus
+ * (defined here to allocate the constant)
+ */
+#define EZUF_RETRY_DONE 540
+
+/* TODO: Someone forgot i_flags & i_version for STATX_ attrs should send a patch
+ * to add them
+ */
+#define ZUFS_STATX_FLAGS	0x20000000U
+#define ZUFS_STATX_VERSION	0x40000000U
+
+/*
+ * Maximal count of links to a file
+ */
+#define ZUFS_LINK_MAX          32000
+#define ZUFS_MAX_SYMLINK	PAGE_SIZE
+#define ZUFS_NAME_LEN		255
+#define ZUFS_READAHEAD_PAGES	8
+
+/* All device sizes offsets must align on 2M */
+#define ZUFS_ALLOC_MASK		(1024 * 1024 * 2 - 1)
+
+/**
+ * zufs dual port memory
+ * This is a special type of offset to either memory or persistent-memory,
+ * that is designed to be used in the interface mechanism between userspace
+ * and kernel, and can be accessed by both.
+ * 3 first bits denote a mem-pool:
+ * 0   - pmem pool
+ * 1-6 - established shared pool by a call to zufs_ioc_create_mempool (below)
+ * 7   - offset into app memory
+ */
+typedef __u64 __bitwise zu_dpp_t;
+
+static inline uint zu_dpp_t_pool(zu_dpp_t t)
+{
+	return t & 0x7;
+}
+
+static inline ulong zu_dpp_t_val(zu_dpp_t t)
+{
+	return t & ~0x7;
+}
+
+static inline zu_dpp_t zu_enc_dpp_t(ulong v, uint pool)
+{
+	return v | pool;
+}
+
+static inline ulong zu_dpp_t_bn(zu_dpp_t t)
+{
+	return t >> 3;
+}
+
+static inline zu_dpp_t zu_enc_dpp_t_bn(ulong v, uint pool)
+{
+	return zu_enc_dpp_t(v << 3, pool);
+}
+
+/*
+ * Structure of a ZUS inode.
+ * This is all the inode fields
+ */
+
+/* See VFS inode flags at fs.h. As ZUFS support flags up to the 7th bit, we
+ * use higher bits for ZUFS specific flags
+ */
+#define ZUFS_S_IMMUTABLE 04000
+
+/* zus_inode size */
+#define ZUFS_INODE_SIZE 128    /* must be power of two */
+
+struct zus_inode {
+	__le16	i_flags;	/* Inode flags */
+	__le16	i_mode;		/* File mode */
+	__le32	i_nlink;	/* Links count */
+	__le64	i_size;		/* Size of data in bytes */
+/* 16*/	struct __zi_on_disk_desc {
+		__le64	a[2];
+	}	i_on_disk;	/* FS-specific on disc placement */
+/* 32*/	__le64	i_blocks;
+	__le64	i_mtime;	/* Inode/data Modification time */
+	__le64	i_ctime;	/* Inode/data Changed time */
+	__le64	i_atime;	/* Data Access time */
+/* 64 - cache-line boundary */
+	__le64	i_ino;		/* Inode number */
+	__le32	i_uid;		/* Owner Uid */
+	__le32	i_gid;		/* Group Id */
+	__le64	i_xattr;	/* FS-specific Extended attribute block */
+	__le64	i_generation;	/* File version (for NFS) */
+/* 96*/	union NAMELESS(_I_U) {
+		__le32	i_rdev;		/* special-inode major/minor etc ...*/
+		u8	i_symlink[32];	/* if i_size < sizeof(i_symlink) */
+		__le64	i_sym_dpp;	/* Link location if long symlink */
+		struct  _zu_dir {
+			__le64	dir_root;
+			__le64  parent;
+		}	i_dir;
+	};
+	/* Total ZUFS_INODE_SIZE bytes always */
+};
+
+/* ~~~~~ ZUFS API ioctl commands ~~~~~ */
+enum {
+	ZUS_API_MAP_MAX_PAGES	= 1024,
+	ZUS_API_MAP_MAX_SIZE	= ZUS_API_MAP_MAX_PAGES * PAGE_SIZE,
+};
+
+/* These go on zufs_ioc_hdr->flags */
+enum e_zufs_hdr_flags {
+	ZUFS_H_INTR		= (1 << 0),
+	ZUFS_H_HAS_PIGY_PUT	= (1 << 1),
+};
+
 struct zufs_ioc_hdr {
 	__s32 err;	/* IN/OUT must be first */
 	__u16 in_len;	/* How much to be copied *to* zus */
@@ -129,4 +246,178 @@ struct zufs_ioc_register_fs {
 };
 #define ZU_IOC_REGISTER_FS	_IOWR('Z', 10, struct zufs_ioc_register_fs)
 
+/* A cookie from user-mode returned by mount */
+struct zus_sb_info;
+
+/* zus cookie per inode */
+struct zus_inode_info;
+
+enum ZUFS_M_FLAGS {
+	ZUFS_M_PEDANTIC		= 0x00000001,
+	ZUFS_M_EPHEMERAL	= 0x00000002,
+	ZUFS_M_SILENT		= 0x00000004,
+	ZUFS_M_PRIVATE		= 0x00000008,
+};
+
+struct zufs_parse_options {
+	__u64 mount_flags;
+	__u32 pedantic;
+	__u32 mount_options_len;
+	char mount_options[0];
+};
+
+/* These go on  zufs_ioc_mount->hdr->operation */
+enum e_mount_operation {
+	ZUFS_M_MOUNT	= 1,
+	ZUFS_M_UMOUNT,
+	ZUFS_M_REMOUNT,
+	ZUFS_M_DDBG_RD,
+	ZUFS_M_DDBG_WR,
+};
+
+/* For zufs_mount_info->remount_flags */
+enum e_remount_flags {
+	ZUFS_REM_WAS_RO		= 0x00000001,
+	ZUFS_REM_WILL_RO	= 0x00000002,
+};
+
+/* FS specific capabilities @zufs_mount_info->fs_caps */
+enum {
+	ZUFS_FSC_ACL_ON		= 0x0001,
+	ZUFS_FSC_NIO_READS	= 0x0002,
+	ZUFS_FSC_NIO_WRITES	= 0x0004,
+};
+
+struct zufs_mount_info {
+	/* IN */
+	struct zus_fs_info *zus_zfi;
+	__u64	remount_flags;
+	__u64	sb_id;
+	__u16	num_cpu;
+	__u16	num_channels;
+	__u32	__pad;
+
+	/* OUT */
+	struct zus_sb_info *zus_sbi;
+	/* mount is also iget of root */
+	struct zus_inode_info *zus_ii;
+	zu_dpp_t _zi;
+
+	/* FS specific info */
+	__u32 fs_caps;
+	__u32 s_blocksize_bits;
+
+	/* IN - mount options, var len must be last */
+	struct zufs_parse_options po;
+};
+
+struct zufs_ddbg_info {
+	__u64 id; /* IN where to start from, OUT last ID */
+	/* IN size of buffer, OUT size of dynamic debug message */
+	__u64 len;
+	char msg[0];
+};
+
+/* mount / umount */
+struct  zufs_ioc_mount {
+	struct zufs_ioc_hdr hdr;
+	union {
+		struct zufs_mount_info zmi;
+		struct zufs_ddbg_info zdi;
+	};
+};
+#define ZU_IOC_MOUNT		_IOWR('Z', 11, struct zufs_ioc_mount)
+
+/* pmem  */
+struct zufs_cpu_set {
+	ulong bits[16];
+};
+
+struct zufs_ioc_numa_map {
+	/* Set by zus */
+	struct zufs_ioc_hdr hdr;
+
+	__u32	possible_nodes;
+	__u32	possible_cpus;
+	__u32	online_nodes;
+	__u32	online_cpus;
+
+	__u32	max_cpu_per_node;
+
+	/* This indicates that NOT all nodes have @max_cpu_per_node cpus */
+	bool	nodes_not_symmetrical;
+	__u8	__pad[19]; /* align cpu_set_per_node to next cache-line */
+
+	/* Variable size must keep last
+	 * size @possible_nodes
+	 */
+	struct zufs_cpu_set cpu_set_per_node[];
+};
+#define ZU_IOC_NUMA_MAP	_IOWR('Z', 13, struct zufs_ioc_numa_map)
+
+/* ZT init */
+enum { ZUFS_MAX_ZT_CHANNELS = 4 };
+
+struct zufs_ioc_init {
+	struct zufs_ioc_hdr hdr;
+	__u32 channel_no;
+	__u32 max_command;
+};
+#define ZU_IOC_INIT_THREAD	_IOWR('Z', 15, struct zufs_ioc_init)
+
+/* break_all (Server telling kernel to clean) */
+struct zufs_ioc_break_all {
+	struct zufs_ioc_hdr hdr;
+};
+#define ZU_IOC_BREAK_ALL	_IOWR('Z', 16, struct zufs_ioc_break_all)
+
+/* Allocate a special_file that will be a dual-port communication buffer with
+ * user mode.
+ * Server will access the buffer via the mmap of this file.
+ * Kernel will access the file via the valloc() pointer
+ *
+ * Some IOCTLs below demand use of this kind of buffer for communication
+ * TODO:
+ * pool_no is if we want to associate this buffer onto the 6 possible
+ * mem-pools per zuf_sbi. So anywhere we have a zu_dpp_t it will mean
+ * access from this pool.
+ * If pool_no is zero then it is private to only this file. In this case
+ * sb_id && zus_sbi are ignored / not needed.
+ */
+struct zufs_ioc_alloc_buffer {
+	struct zufs_ioc_hdr hdr;
+	/* The ID of the super block received in mount */
+	__u64	sb_id;
+	/* We verify the sb_id validity against zus_sbi */
+	struct zus_sb_info *zus_sbi;
+	/* max size of buffer allowed (size of mmap) */
+	__u32 max_size;
+	/* allocate this much on initial call and set into vma */
+	__u32 init_size;
+
+	/* TODO: These below are now set to ZERO. Need implementation */
+	__u16 pool_no;
+	__u16 flags;
+	__u32 reserved;
+};
+#define ZU_IOC_ALLOC_BUFFER	_IOWR('Z', 17, struct zufs_ioc_init)
+
+/* ~~~  zufs_ioc_wait_operation ~~~ */
+struct zufs_ioc_wait_operation {
+	struct zufs_ioc_hdr hdr;
+	/* maximum size is governed by zufs_ioc_init->max_command */
+	char opt_buff[];
+};
+#define ZU_IOC_WAIT_OPT		_IOWR('Z', 18, struct zufs_ioc_wait_operation)
+
+/* These are the possible operations sent from Kernel to the Server in the
+ * return of the ZU_IOC_WAIT_OPT.
+ */
+enum e_zufs_operation {
+	ZUFS_OP_NULL		= 0,
+	ZUFS_OP_BREAK		= 1,	/* Kernel telling Server to exit */
+
+	ZUFS_OP_MAX_OPT,
+};
+
 #endif /* _LINUX_ZUFS_API_H */
-- 
2.21.0
^ permalink raw reply related	[flat|nested] 32+ messages in thread
* [PATCH 06/16] zuf: Multy Devices
  2019-09-26  2:07 [PATCHSET v02 00/16] zuf: ZUFS Zero-copy User-mode FileSystem Boaz Harrosh
                   ` (4 preceding siblings ...)
  2019-09-26  2:07 ` [PATCH 05/16] zuf: zuf-core The ZTs Boaz Harrosh
@ 2019-09-26  2:07 ` Boaz Harrosh
  2019-09-26  2:07 ` [PATCH 07/16] zuf: mounting Boaz Harrosh
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 32+ messages in thread
From: Boaz Harrosh @ 2019-09-26  2:07 UTC (permalink / raw)
  To: linux-fsdevel, Anna Schumaker, Al Viro, Matt Benjamin
  Cc: Miklos Szeredi, Amir Goldstein, Sagi Manole, Matthew Wilcox,
	Dan Williams
ZUFS supports Multiple block devices per super_block.
This here is the devices handling code. At the output
a single multi_devices (md.h) object is associated with the
mounting super_block.
There are three mode of operations:
* mount with out a device (mount -t FOO none /somepath)
* A single device - The FS stated register_fs_info->dt_offset==-1
  No checks are made by Kernel, the single bdev is registered with
  Kernel's mount_bdev. It is up to the zusFS to check validity
* Multy devices - The FS stated register_fs_info->dt_offset==X
  This mode is the main of this patch.
  A single device is given on the mount command line. At
  register_fs_info->dt_offset of this device we look for a
  zufs_dev_table structure. After all the checks we look there
  at the device list and open all devices. Any one of the devices may
  be given on command line. But they will always be opened in
  DT(Device Table) order. The Device table has the notion of two types
  of bdevs:
  T1 devices - are pmem devices capable of direct_access
  T2 devices - are none direct_access devices
  All t1 devices are presented as one linear array. in DT order
  In t1.c we mmap this space for the server to directly access
  pmem. (In the proper persistent way)
  [We do not support any direct_access device, we only support
   pmem(s) where the all device can be addressed by a single
   physical/virtual address. This is checked before mount]
   The T2 devices are also grabbed and owned by the super_block
   A later API will enable the Server to write or transfer buffers
   from T1 to T2 in a very efficient manner. Also presented as a
   single linear array in DT order.
   Both kind of devices are NUMA aware and the NUMA info is presented
   to the zusFS for optimal allocation and access.
[v2]
  The new gcc compiler does not like that the case /* fall through */
  comments comes with other text. So split the comment to two lines
  to silence the compiler.
[v3]
  Do not use __packed on interface structures
Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/Makefile   |   3 +
 fs/zuf/_extern.h  |   6 +
 fs/zuf/md.c       | 742 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/zuf/md.h       | 332 +++++++++++++++++++++
 fs/zuf/md_def.h   | 141 +++++++++
 fs/zuf/super.c    |   6 +
 fs/zuf/t1.c       | 136 +++++++++
 fs/zuf/t2.c       | 356 ++++++++++++++++++++++
 fs/zuf/t2.h       |  68 +++++
 fs/zuf/zuf-core.c |  76 +++++
 fs/zuf/zuf.h      |  54 ++++
 fs/zuf/zus_api.h  |  15 +
 12 files changed, 1935 insertions(+)
 create mode 100644 fs/zuf/md.c
 create mode 100644 fs/zuf/md.h
 create mode 100644 fs/zuf/md_def.h
 create mode 100644 fs/zuf/t1.c
 create mode 100644 fs/zuf/t2.c
 create mode 100644 fs/zuf/t2.h
diff --git a/fs/zuf/Makefile b/fs/zuf/Makefile
index b08c08e73faa..a247bd85d9aa 100644
--- a/fs/zuf/Makefile
+++ b/fs/zuf/Makefile
@@ -10,6 +10,9 @@
 
 obj-$(CONFIG_ZUFS_FS) += zuf.o
 
+# Infrastructure
+zuf-y += md.o t1.o t2.o
+
 # ZUF core
 zuf-y += zuf-core.o zuf-root.o
 
diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
index 1f786fc24b85..a5929d3d165c 100644
--- a/fs/zuf/_extern.h
+++ b/fs/zuf/_extern.h
@@ -54,4 +54,10 @@ void zuf_destroy_inodecache(void);
 struct dentry *zuf_mount(struct file_system_type *fs_type, int flags,
 			 const char *dev_name, void *data);
 
+struct super_block *zuf_sb_from_id(struct zuf_root_info *zri, __u64 sb_id,
+				   struct zus_sb_info *zus_sbi);
+
+/* t1.c */
+int zuf_pmem_mmap(struct file *file, struct vm_area_struct *vma);
+
 #endif	/*ndef __ZUF_EXTERN_H__*/
diff --git a/fs/zuf/md.c b/fs/zuf/md.c
new file mode 100644
index 000000000000..c4778b4fdff8
--- /dev/null
+++ b/fs/zuf/md.c
@@ -0,0 +1,742 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Multi-Device operations.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#include <linux/blkdev.h>
+#include <linux/pfn_t.h>
+#include <linux/crc16.h>
+#include <linux/uuid.h>
+
+#include <linux/gcd.h>
+
+#include "_pr.h"
+#include "md.h"
+#include "t2.h"
+
+const fmode_t _g_mode = FMODE_READ | FMODE_WRITE | FMODE_EXCL;
+
+static int _bdev_get_by_path(const char *path, struct block_device **bdev,
+			     void *holder)
+{
+	*bdev = blkdev_get_by_path(path, _g_mode, holder);
+	if (IS_ERR(*bdev)) {
+		int err = PTR_ERR(*bdev);
+		*bdev = NULL;
+		return err;
+	}
+	return 0;
+}
+
+static void _bdev_put(struct block_device **bdev)
+{
+	if (*bdev) {
+		blkdev_put(*bdev, _g_mode);
+		*bdev = NULL;
+	}
+}
+
+/* convert uuid to a /dev/ path */
+static char *_uuid_path(uuid_le *uuid, char path[PATH_UUID])
+{
+	sprintf(path, "/dev/disk/by-uuid/%pUb", uuid);
+	return path;
+}
+
+static int _bdev_get_by_uuid(struct block_device **bdev, uuid_le *uuid,
+			       void *holder, bool silent)
+{
+	char path[PATH_UUID];
+	int err;
+
+	_uuid_path(uuid, path);
+	err = _bdev_get_by_path(path, bdev, holder);
+	if (unlikely(err))
+		md_err_cnd(silent, "failed to get device path=%s =>%d\n",
+			   path, err);
+
+	return err;
+}
+
+short md_calc_csum(struct md_dev_table *mdt)
+{
+	uint n = MDT_STATIC_SIZE(mdt) - sizeof(mdt->s_sum);
+
+	return crc16(~0, (__u8 *)&mdt->s_version, n);
+}
+
+/* ~~~~~~~ mdt related functions ~~~~~~~ */
+
+int md_t2_mdt_read(struct multi_devices *md, int index,
+		   struct md_dev_table *mdt)
+{
+	int err = t2_readpage(md, index, virt_to_page(mdt));
+
+	if (err)
+		md_dbg_verbose("!!! t2_readpage err=%d\n", err);
+
+	return err;
+}
+
+static int _t2_mdt_read(struct block_device *bdev, struct md_dev_table *mdt)
+{
+	int err;
+	/* t2 interface works for all block devices */
+	struct multi_devices *md;
+	struct md_dev_info *mdi;
+
+	md = kzalloc(sizeof(*md), GFP_KERNEL);
+	if (unlikely(!md))
+		return -ENOMEM;
+
+	md->t2_count = 1;
+	md->devs[0].bdev = bdev;
+	mdi = &md->devs[0];
+	md->t2a.map = &mdi;
+	md->t2a.bn_gcd = 1; /*Does not matter only must not be zero */
+
+	err = md_t2_mdt_read(md, 0, mdt);
+
+	kfree(md);
+	return err;
+}
+
+int md_t2_mdt_write(struct multi_devices *md, struct md_dev_table *mdt)
+{
+	int i, err = 0;
+
+	for (i = 0; i < md->t2_count; ++i) {
+		ulong bn = md_o2p(md_t2_dev(md, i)->offset);
+
+		mdt->s_dev_list.id_index = mdt->s_dev_list.t1_count + i;
+		mdt->s_sum = cpu_to_le16(md_calc_csum(mdt));
+
+		err = t2_writepage(md, bn, virt_to_page(mdt));
+		if (err)
+			md_dbg_verbose("!!! t2_writepage err=%d\n", err);
+	}
+
+	return err;
+}
+
+static bool _csum_mismatch(struct md_dev_table *mdt, int silent)
+{
+	ushort crc = md_calc_csum(mdt);
+
+	if (mdt->s_sum == cpu_to_le16(crc))
+		return false;
+
+	md_warn_cnd(silent, "expected(0x%x) != s_sum(0x%x)\n",
+		      cpu_to_le16(crc), mdt->s_sum);
+	return true;
+}
+
+static bool _uuid_le_equal(uuid_le *uuid1, uuid_le *uuid2)
+{
+	return (memcmp(uuid1, uuid2, sizeof(uuid_le)) == 0);
+}
+
+static bool _mdt_compare_uuids(struct md_dev_table *mdt,
+			       struct md_dev_table *main_mdt, int silent)
+{
+	int i, dev_count;
+
+	if (!_uuid_le_equal(&mdt->s_uuid, &main_mdt->s_uuid)) {
+		md_warn_cnd(silent, "mdt uuid (%pUb != %pUb) mismatch\n",
+			      &mdt->s_uuid, &main_mdt->s_uuid);
+		return false;
+	}
+
+	dev_count = mdt->s_dev_list.t1_count + mdt->s_dev_list.t2_count +
+		    mdt->s_dev_list.rmem_count;
+	for (i = 0; i < dev_count; ++i) {
+		struct md_dev_id *dev_id1 = &mdt->s_dev_list.dev_ids[i];
+		struct md_dev_id *dev_id2 = &main_mdt->s_dev_list.dev_ids[i];
+
+		if (!_uuid_le_equal(&dev_id1->uuid, &dev_id2->uuid)) {
+			md_warn_cnd(silent,
+				    "mdt dev %d uuid (%pUb != %pUb) mismatch\n",
+				    i, &dev_id1->uuid, &dev_id2->uuid);
+			return false;
+		}
+
+		if (dev_id1->blocks != dev_id2->blocks) {
+			md_warn_cnd(silent,
+				    "mdt dev %d blocks (0x%llx != 0x%llx) mismatch\n",
+				    i, le64_to_cpu(dev_id1->blocks),
+				    le64_to_cpu(dev_id2->blocks));
+			return false;
+		}
+	}
+
+	return true;
+}
+
+bool md_mdt_check(struct md_dev_table *mdt,
+		  struct md_dev_table *main_mdt, struct block_device *bdev,
+		  struct mdt_check *mc)
+{
+	struct md_dev_id *dev_id;
+	ulong bdev_size, super_size;
+
+	BUILD_BUG_ON(MDT_STATIC_SIZE(mdt) & (SMP_CACHE_BYTES - 1));
+
+	/* Do sanity checks on the superblock */
+	if (le32_to_cpu(mdt->s_magic) != mc->magic) {
+		md_warn_cnd(mc->silent,
+			     "Magic error in super block: please run fsck\n");
+		return false;
+	}
+
+	if ((mc->major_ver != mdt_major_version(mdt)) ||
+	    (mc->minor_ver < mdt_minor_version(mdt))) {
+		md_warn_cnd(mc->silent,
+			     "mkfs-mount versions mismatch! %d.%d != %d.%d\n",
+			     mdt_major_version(mdt), mdt_minor_version(mdt),
+			     mc->major_ver, mc->minor_ver);
+		return false;
+	}
+
+	if (_csum_mismatch(mdt, mc->silent)) {
+		md_warn_cnd(mc->silent,
+			    "crc16 error in super block: please run fsck\n");
+		return false;
+	}
+
+	if (main_mdt) {
+		if (mdt->s_dev_list.t1_count != main_mdt->s_dev_list.t1_count) {
+			md_warn_cnd(mc->silent, "mdt t1 count mismatch\n");
+			return false;
+		}
+
+		if (mdt->s_dev_list.t2_count != main_mdt->s_dev_list.t2_count) {
+			md_warn_cnd(mc->silent, "mdt t2 count mismatch\n");
+			return false;
+		}
+
+		if (mdt->s_dev_list.rmem_count !=
+		    main_mdt->s_dev_list.rmem_count) {
+			md_warn_cnd(mc->silent,
+				    "mdt rmem dev count mismatch\n");
+			return false;
+		}
+
+		if (!_mdt_compare_uuids(mdt, main_mdt, mc->silent))
+			return false;
+	}
+
+	/* check alignment */
+	dev_id = &mdt->s_dev_list.dev_ids[mdt->s_dev_list.id_index];
+	super_size = md_p2o(__dev_id_blocks(dev_id));
+	if (unlikely(!super_size || super_size & mc->alloc_mask)) {
+		md_warn_cnd(mc->silent, "super_size(0x%lx) ! 2_M aligned\n",
+			      super_size);
+		return false;
+	}
+
+	if (!bdev)
+		return true;
+
+	/* check t1 device size */
+	bdev_size = i_size_read(bdev->bd_inode);
+	if (unlikely(super_size > bdev_size)) {
+		md_warn_cnd(mc->silent,
+			    "bdev_size(0x%lx) too small expected 0x%lx\n",
+			    bdev_size, super_size);
+		return false;
+	} else if (unlikely(super_size < bdev_size)) {
+		md_dbg_err("Note mdt->size=(0x%lx) < bdev_size(0x%lx)\n",
+			      super_size, bdev_size);
+	}
+
+	return true;
+}
+
+int md_set_sb(struct multi_devices *md, struct block_device *s_bdev,
+	      void *sb, int silent)
+{
+	struct md_dev_info *main_mdi = md_dev_info(md, md->dev_index);
+	int i;
+
+	main_mdi->bdev = s_bdev;
+
+	for (i = 0; i < md->t1_count + md->t2_count; ++i) {
+		struct md_dev_info *mdi;
+
+		if (i == md->dev_index)
+			continue;
+
+		mdi = md_dev_info(md, i);
+		if (mdi->bdev->bd_super && (mdi->bdev->bd_super != sb)) {
+			md_warn_cnd(silent,
+				"!!! %s already mounted on a different FS => -EBUSY\n",
+				_bdev_name(mdi->bdev));
+			return -EBUSY;
+		}
+
+		mdi->bdev->bd_super = sb;
+	}
+
+	return 0;
+}
+
+void md_fini(struct multi_devices *md, bool put_all)
+{
+	struct md_dev_info *main_mdi;
+	int i;
+
+	if (unlikely(!md))
+		return;
+
+	main_mdi = md_dev_info(md, md->dev_index);
+	kfree(md->t2a.map);
+	kfree(md->t1a.map);
+
+	for (i = 0; i < md->t1_count + md->t2_count; ++i) {
+		struct md_dev_info *mdi = md_dev_info(md, i);
+
+		if (i < md->t1_count)
+			md_t1_info_fini(mdi);
+		if (!mdi->bdev || i == md->dev_index)
+			continue;
+		mdi->bdev->bd_super = NULL;
+		_bdev_put(&mdi->bdev);
+	}
+
+	if (put_all)
+		_bdev_put(&main_mdi->bdev);
+	else
+		/* Main dev is GET && PUT by VFS. Only stop pointing to it */
+		main_mdi->bdev = NULL;
+
+	kfree(md);
+}
+
+
+/* ~~~~~~~ Pre-mount operations ~~~~~~~ */
+
+static int _get_device(struct block_device **bdev, const char *dev_name,
+		       uuid_le *uuid, void *holder, int silent,
+		       bool *bind_mount)
+{
+	int err;
+
+	if (dev_name)
+		err = _bdev_get_by_path(dev_name, bdev, holder);
+	else
+		err = _bdev_get_by_uuid(bdev, uuid, holder, silent);
+
+	if (unlikely(err)) {
+		md_err_cnd(silent,
+			"failed to get device dev_name=%s uuid=%pUb err=%d\n",
+			dev_name, uuid, err);
+		return err;
+	}
+
+	if (bind_mount &&  (*bdev)->bd_super &&
+			   (*bdev)->bd_super->s_bdev == *bdev)
+		*bind_mount = true;
+
+	return 0;
+}
+
+static int _init_dev_info(struct md_dev_info *mdi, struct md_dev_id *id,
+			  int index, u64 offset,
+			  struct md_dev_table *main_mdt,
+			  struct mdt_check *mc, bool t1_dev,
+			  int silent)
+{
+	struct md_dev_table *mdt = NULL;
+	bool mdt_alloc = false;
+	int err = 0;
+
+	if (mdi->bdev == NULL) {
+		err = _get_device(&mdi->bdev, NULL, &id->uuid, mc->holder,
+				  silent, NULL);
+		if (unlikely(err))
+			return err;
+	}
+
+	mdi->offset = offset;
+	mdi->size = md_p2o(__dev_id_blocks(id));
+	mdi->index = index;
+
+	if (t1_dev) {
+		struct page *dev_page;
+		int end_of_dev_nid;
+
+		err = md_t1_info_init(mdi, silent);
+		if (unlikely(err))
+			return err;
+
+		if ((ulong)mdi->t1i.virt_addr & mc->alloc_mask) {
+			md_warn_cnd(silent, "!!! unaligned device %s\n",
+				      _bdev_name(mdi->bdev));
+			return -EINVAL;
+		}
+
+		if (!__pfn_to_section(mdi->t1i.phys_pfn)) {
+			md_err_cnd(silent, "Intel does not like pages...\n");
+			return -EINVAL;
+		}
+
+		mdt = mdi->t1i.virt_addr;
+
+		mdi->t1i.pgmap = virt_to_page(mdt)->pgmap;
+		dev_page = pfn_to_page(mdi->t1i.phys_pfn);
+		mdi->nid = page_to_nid(dev_page);
+		end_of_dev_nid = page_to_nid(dev_page + md_o2p(mdi->size - 1));
+
+		if (mdi->nid != end_of_dev_nid)
+			md_warn("pmem crosses NUMA boundaries");
+	} else {
+		mdt = (void *)__get_free_page(GFP_KERNEL);
+		if (unlikely(!mdt)) {
+			md_dbg_err("!!! failed to alloc page\n");
+			return -ENOMEM;
+		}
+
+		mdt_alloc = true;
+		err = _t2_mdt_read(mdi->bdev, mdt);
+		if (unlikely(err)) {
+			md_err_cnd(silent, "failed to read mdt from t2 => %d\n",
+				   err);
+			goto out;
+		}
+		mdi->nid = __dev_id_nid(id);
+	}
+
+	if (!md_mdt_check(mdt, main_mdt, mdi->bdev, mc)) {
+		md_err_cnd(silent, "device %s failed integrity check\n",
+			     _bdev_name(mdi->bdev));
+		err = -EINVAL;
+		goto out;
+	}
+
+	return 0;
+
+out:
+	if (mdt_alloc)
+		free_page((ulong)mdt);
+	return err;
+}
+
+static int _map_setup(struct multi_devices *md, ulong blocks, int dev_start,
+		      struct md_dev_larray *larray)
+{
+	ulong map_size, bn_end;
+	int i, dev_index = dev_start;
+
+	map_size = blocks / larray->bn_gcd;
+	larray->map = kcalloc(map_size, sizeof(*larray->map), GFP_KERNEL);
+	if (!larray->map) {
+		md_dbg_err("failed to allocate dev map\n");
+		return -ENOMEM;
+	}
+
+	bn_end = md_o2p(md->devs[dev_index].size);
+	for (i = 0; i < map_size; ++i) {
+		if ((i * larray->bn_gcd) >= bn_end)
+			bn_end += md_o2p(md->devs[++dev_index].size);
+		larray->map[i] = &md->devs[dev_index];
+	}
+
+	return 0;
+}
+
+static int _md_init(struct multi_devices *md, struct mdt_check *mc,
+		    struct md_dev_list *dev_list, int silent)
+{
+	struct md_dev_table *main_mdt = NULL;
+	u64 total_size = 0;
+	int i, err;
+
+	for (i = 0; i < md->t1_count; ++i) {
+		struct md_dev_info *mdi = md_t1_dev(md, i);
+		struct md_dev_table *dev_mdt;
+
+		err = _init_dev_info(mdi, &dev_list->dev_ids[i], i, total_size,
+				     main_mdt, mc, true, silent);
+		if (unlikely(err))
+			return err;
+
+		/* apparently gcd(0,X)=X which is nice */
+		md->t1a.bn_gcd = gcd(md->t1a.bn_gcd, md_o2p(mdi->size));
+		total_size += mdi->size;
+
+		dev_mdt = md_t1_addr(md, i);
+		if (!main_mdt)
+			main_mdt = dev_mdt;
+
+		if (mdt_test_option(dev_mdt, MDT_F_SHADOW))
+			memcpy(mdi->t1i.virt_addr,
+			       mdi->t1i.virt_addr + mdi->size, mdi->size);
+
+		md_dbg_verbose("dev=%d %pUb %s v=%p pfn=%lu off=%lu size=%lu\n",
+				 i, &dev_list->dev_ids[i].uuid,
+				 _bdev_name(mdi->bdev), dev_mdt,
+				 mdi->t1i.phys_pfn, mdi->offset, mdi->size);
+	}
+
+	md->t1_blocks = le64_to_cpu(main_mdt->s_t1_blocks);
+	if (unlikely(md->t1_blocks != md_o2p(total_size))) {
+		md_err_cnd(silent,
+			"FS corrupted md->t1_blocks(0x%lx) != total_size(0x%llx)\n",
+			md->t1_blocks, total_size);
+		return -EIO;
+	}
+
+	err = _map_setup(md, le64_to_cpu(main_mdt->s_t1_blocks), 0, &md->t1a);
+	if (unlikely(err))
+		return err;
+
+	md_dbg_verbose("t1 devices=%d total_size=0x%llx segment_map=0x%lx\n",
+			 md->t1_count, total_size,
+			 md_o2p(total_size) / md->t1a.bn_gcd);
+
+	if (md->t2_count == 0)
+		return 0;
+
+	/* Done with t1. Counting t2s */
+	total_size = 0;
+	for (i = 0; i < md->t2_count; ++i) {
+		struct md_dev_info *mdi = md_t2_dev(md, i);
+
+		err = _init_dev_info(mdi, &dev_list->dev_ids[md->t1_count + i],
+				     md->t1_count + i, total_size, main_mdt,
+				     mc, false, silent);
+		if (unlikely(err))
+			return err;
+
+		/* apparently gcd(0,X)=X which is nice */
+		md->t2a.bn_gcd = gcd(md->t2a.bn_gcd, md_o2p(mdi->size));
+		total_size += mdi->size;
+
+		md_dbg_verbose("dev=%d %s off=%lu size=%lu\n", i,
+				 _bdev_name(mdi->bdev), mdi->offset, mdi->size);
+	}
+
+	md->t2_blocks = le64_to_cpu(main_mdt->s_t2_blocks);
+	if (unlikely(md->t2_blocks != md_o2p(total_size))) {
+		md_err_cnd(silent,
+			"FS corrupted md->t2_blocks(0x%lx) != total_size(0x%llx)\n",
+			md->t2_blocks, total_size);
+		return -EIO;
+	}
+
+	err = _map_setup(md, le64_to_cpu(main_mdt->s_t2_blocks), md->t1_count,
+			 &md->t2a);
+	if (unlikely(err))
+		return err;
+
+	md_dbg_verbose("t2 devices=%d total_size=%llu segment_map=%lu\n",
+			 md->t2_count, total_size,
+			 md_o2p(total_size) / md->t2a.bn_gcd);
+
+	return 0;
+}
+
+static int _load_dev_list(struct md_dev_list *dev_list, struct mdt_check *mc,
+			  struct block_device *bdev, const char *dev_name,
+			  int silent)
+{
+	struct md_dev_table *mdt;
+	int err;
+
+	mdt = (void *)__get_free_page(GFP_KERNEL);
+	if (unlikely(!mdt)) {
+		md_dbg_err("!!! failed to alloc page\n");
+		return -ENOMEM;
+	}
+
+	err = _t2_mdt_read(bdev, mdt);
+	if (unlikely(err)) {
+		md_err_cnd(silent, "failed to read super block from %s => %d\n",
+			     dev_name, err);
+		goto out;
+	}
+
+	if (!md_mdt_check(mdt, NULL, bdev, mc)) {
+		md_err_cnd(silent, "bad mdt in %s\n", dev_name);
+		err = -EINVAL;
+		goto out;
+	}
+
+	*dev_list = mdt->s_dev_list;
+
+out:
+	free_page((ulong)mdt);
+	return err;
+}
+
+/* md_init - allocates and initializes ready to go multi_devices object
+ *
+ * The rule is that if md_init returns error caller must call md_fini always
+ */
+int md_init(struct multi_devices **ret_md, const char *dev_name,
+	    struct mdt_check *mc, char path[PATH_UUID],	const char **dev_path)
+{
+	struct md_dev_list *dev_list;
+	struct block_device *bdev;
+	struct multi_devices *md;
+	short id_index;
+	bool bind_mount = false;
+	int err;
+
+	md = kzalloc(sizeof(*md), GFP_KERNEL);
+	*ret_md = md;
+	if (unlikely(!md))
+		return -ENOMEM;
+
+	dev_list = kmalloc(sizeof(*dev_list), GFP_KERNEL);
+	if (unlikely(!dev_list))
+		return -ENOMEM;
+
+	err = _get_device(&bdev, dev_name, NULL, mc->holder, mc->silent,
+			  &bind_mount);
+	if (unlikely(err))
+		goto out2;
+
+	err = _load_dev_list(dev_list, mc, bdev, dev_name, mc->silent);
+	if (unlikely(err)) {
+		_bdev_put(&bdev);
+		goto out2;
+	}
+
+	id_index = le16_to_cpu(dev_list->id_index);
+	if (bind_mount) {
+		_bdev_put(&bdev);
+		md->dev_index = id_index;
+		goto out;
+	}
+
+	md->t1_count = le16_to_cpu(dev_list->t1_count);
+	md->t2_count = le16_to_cpu(dev_list->t2_count);
+	md->devs[id_index].bdev = bdev;
+
+	if ((id_index != 0)) {
+		err = _get_device(&md_t1_dev(md, 0)->bdev, NULL,
+				  &dev_list->dev_ids[0].uuid, mc->holder,
+				  mc->silent, &bind_mount);
+		if (unlikely(err))
+			goto out2;
+
+		if (bind_mount)
+			goto out;
+	}
+
+	if (md->t2_count) {
+		int t2_index = md->t1_count;
+
+		/* t2 is the primary device if given in mount, or the first
+		 * mount specified it as primary device
+		 */
+		if (id_index != md->t1_count) {
+			err = _get_device(&md_t2_dev(md, 0)->bdev, NULL,
+					  &dev_list->dev_ids[t2_index].uuid,
+					  mc->holder, mc->silent, &bind_mount);
+			if (unlikely(err))
+				goto out2;
+
+			if (bind_mount)
+				md->dev_index = t2_index;
+		}
+
+		if (t2_index <= id_index)
+			md->dev_index = t2_index;
+	}
+
+out:
+	if (md->dev_index != id_index)
+		*dev_path = _uuid_path(&dev_list->dev_ids[md->dev_index].uuid,
+				       path);
+	else
+		*dev_path = dev_name;
+
+	if (!bind_mount) {
+		err = _md_init(md, mc, dev_list, mc->silent);
+		if (unlikely(err))
+			goto out2;
+		if (!(mc->private_mnt))
+			_bdev_put(&md_dev_info(md, md->dev_index)->bdev);
+	} else {
+		md_fini(md, true);
+	}
+
+out2:
+	kfree(dev_list);
+
+	return err;
+}
+
+/* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ * PORTING SECTION:
+ * Below are members that are done differently in different Linux versions.
+ * So keep separate from code
+ */
+static int _check_da_ret(struct md_dev_info *mdi, long avail, bool silent)
+{
+	if (unlikely(avail < (long)mdi->size)) {
+		if (0 < avail) {
+			md_warn_cnd(silent,
+				"Unsupported DAX device %s (range mismatch) => 0x%lx < 0x%lx\n",
+				_bdev_name(mdi->bdev), avail, mdi->size);
+			return -ERANGE;
+		}
+		md_warn_cnd(silent, "!!! %s direct_access return => %ld\n",
+			    _bdev_name(mdi->bdev), avail);
+		return avail;
+	}
+	return 0;
+}
+
+#include <linux/dax.h>
+
+int md_t1_info_init(struct md_dev_info *mdi, bool silent)
+{
+	pfn_t a_pfn_t;
+	void *addr;
+	long nrpages, avail, pgoff;
+	int id;
+
+	mdi->t1i.dax_dev = fs_dax_get_by_bdev(mdi->bdev);
+	if (unlikely(!mdi->t1i.dax_dev))
+		return -EOPNOTSUPP;
+
+	id = dax_read_lock();
+
+	bdev_dax_pgoff(mdi->bdev, 0, PAGE_SIZE, &pgoff);
+	nrpages = dax_direct_access(mdi->t1i.dax_dev, pgoff, md_o2p(mdi->size),
+				    &addr, &a_pfn_t);
+	dax_read_unlock(id);
+	if (unlikely(nrpages <= 0)) {
+		if (!nrpages)
+			nrpages = -ERANGE;
+		avail = nrpages;
+	} else {
+		avail = md_p2o(nrpages);
+	}
+
+	mdi->t1i.virt_addr = addr;
+	mdi->t1i.phys_pfn = pfn_t_to_pfn(a_pfn_t);
+
+	md_dbg_verbose("0x%lx 0x%lx pgoff=0x%lx\n",
+			 (ulong)mdi->t1i.virt_addr, mdi->t1i.phys_pfn, pgoff);
+
+	return _check_da_ret(mdi, avail, silent);
+}
+
+void md_t1_info_fini(struct md_dev_info *mdi)
+{
+	fs_put_dax(mdi->t1i.dax_dev);
+	mdi->t1i.dax_dev = NULL;
+	mdi->t1i.virt_addr = NULL;
+}
diff --git a/fs/zuf/md.h b/fs/zuf/md.h
new file mode 100644
index 000000000000..15ba7d646544
--- /dev/null
+++ b/fs/zuf/md.h
@@ -0,0 +1,332 @@
+/* SPDX-License-Identifier: GPL-2.0 or BSD-3-Clause */
+/*
+ * Multi-Device operations.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#ifndef __MD_H__
+#define __MD_H__
+
+#include <linux/types.h>
+
+#include "md_def.h"
+
+#ifndef __KERNEL__
+struct page;
+struct block_device;
+#else
+#	include <linux/blkdev.h>
+#endif /* ndef __KERNEL__ */
+
+struct md_t1_info {
+	void *virt_addr;
+#ifdef __KERNEL__
+	ulong phys_pfn;
+	struct dax_device *dax_dev;
+	struct dev_pagemap *pgmap;
+#endif /*def __KERNEL__*/
+};
+
+struct md_t2_info {
+#ifndef __KERNEL__
+	bool err_read_reported;
+	bool err_write_reported;
+#endif
+};
+
+struct md_dev_info {
+	struct block_device *bdev;
+	ulong size;
+	ulong offset;
+	union {
+		struct md_t1_info	t1i;
+		struct md_t2_info	t2i;
+	};
+	int index;
+	int nid;
+};
+
+struct md_dev_larray {
+	ulong bn_gcd;
+	struct md_dev_info **map;
+};
+
+#ifndef __KERNEL__
+struct fba {
+	int fd; void *ptr;
+	size_t size;
+	void *orig_ptr;
+};
+#endif /*! __KERNEL__*/
+
+struct zus_sb_info;
+struct multi_devices {
+	int dev_index;
+	int t1_count;
+	int t2_count;
+	struct md_dev_info devs[MD_DEV_MAX];
+	struct md_dev_larray t1a;
+	struct md_dev_larray t2a;
+#ifndef __KERNEL__
+	struct zufs_ioc_pmem pmem_info; /* As received from Kernel */
+
+	void *p_pmem_addr;
+	int fd;
+	uint user_page_size;
+	struct fba pages;
+	struct zus_sb_info *sbi;
+#else
+	ulong t1_blocks;
+	ulong t2_blocks;
+#endif /*! __KERNEL__*/
+};
+
+enum md_init_flags {
+	MD_I_F_PRIVATE		= (1UL << 0),
+};
+
+static inline __u64 md_p2o(ulong bn)
+{
+	return (__u64)bn << PAGE_SHIFT;
+}
+
+static inline ulong md_o2p(__u64 offset)
+{
+	return offset >> PAGE_SHIFT;
+}
+
+static inline ulong md_o2p_up(__u64 offset)
+{
+	return md_o2p(offset + PAGE_SIZE - 1);
+}
+
+static inline struct md_dev_info *md_t1_dev(struct multi_devices *md, int i)
+{
+	return &md->devs[i];
+}
+
+static inline struct md_dev_info *md_t2_dev(struct multi_devices *md, int i)
+{
+	return &md->devs[md->t1_count + i];
+}
+
+static inline struct md_dev_info *md_dev_info(struct multi_devices *md, int i)
+{
+	return &md->devs[i];
+}
+
+static inline void *md_t1_addr(struct multi_devices *md, int i)
+{
+	struct md_dev_info *mdi = md_t1_dev(md, i);
+
+	return mdi->t1i.virt_addr;
+}
+
+static inline ulong md_t1_blocks(struct multi_devices *md)
+{
+#ifdef __KERNEL__
+	return md->t1_blocks;
+#else
+	return md->pmem_info.mdt.s_t1_blocks;
+#endif
+}
+
+static inline ulong md_t2_blocks(struct multi_devices *md)
+{
+#ifdef __KERNEL__
+	return md->t2_blocks;
+#else
+	return md->pmem_info.mdt.s_t2_blocks;
+#endif
+}
+
+static inline struct md_dev_table *md_zdt(struct multi_devices *md)
+{
+	return md_t1_addr(md, 0);
+}
+
+static inline struct md_dev_info *md_bn_t1_dev(struct multi_devices *md,
+						 ulong bn)
+{
+	return md->t1a.map[bn / md->t1a.bn_gcd];
+}
+
+static inline uuid_le *md_main_uuid(struct multi_devices *md)
+{
+	return &md_zdt(md)->s_dev_list.dev_ids[md->dev_index].uuid;
+}
+
+#ifdef __KERNEL__
+static inline ulong md_pfn(struct multi_devices *md, ulong block)
+{
+	struct md_dev_info *mdi;
+	bool add_pfn = false;
+	ulong base_pfn;
+
+	if (unlikely(md_t1_blocks(md) <= block)) {
+		if (WARN_ON(!mdt_test_option(md_zdt(md), MDT_F_SHADOW)))
+			return 0;
+		block -= md_t1_blocks(md);
+		add_pfn = true;
+	}
+
+	mdi = md_bn_t1_dev(md, block);
+	if (add_pfn)
+		base_pfn = mdi->t1i.phys_pfn + md_o2p(mdi->size);
+	else
+		base_pfn = mdi->t1i.phys_pfn;
+	return base_pfn + (block - md_o2p(mdi->offset));
+}
+#endif /* def __KERNEL__ */
+
+static inline void *md_addr(struct multi_devices *md, ulong offset)
+{
+#ifdef __KERNEL__
+	struct md_dev_info *mdi = md_bn_t1_dev(md, md_o2p(offset));
+
+	return offset ? mdi->t1i.virt_addr + (offset - mdi->offset) : NULL;
+#else
+	return offset ? md->p_pmem_addr + offset : NULL;
+#endif
+}
+
+static inline void *md_baddr(struct multi_devices *md, ulong bn)
+{
+	return md_addr(md, md_p2o(bn));
+}
+
+static inline struct md_dev_info *md_bn_t2_dev(struct multi_devices *md,
+					       ulong bn)
+{
+	return md->t2a.map[bn / md->t2a.bn_gcd];
+}
+
+static inline int md_t2_bn_nid(struct multi_devices *md, ulong bn)
+{
+	struct md_dev_info *mdi = md_bn_t2_dev(md, bn);
+
+	return mdi->nid;
+}
+
+static inline ulong md_t2_local_bn(struct multi_devices *md, ulong bn)
+{
+#ifdef __KERNEL__
+	struct md_dev_info *mdi = md_bn_t2_dev(md, bn);
+
+	return bn - md_o2p(mdi->offset);
+#else
+	return bn; /* In zus we just let Kernel worry about it */
+#endif
+}
+
+static inline ulong md_t2_gcd(struct multi_devices *md)
+{
+	return md->t2a.bn_gcd;
+}
+
+static inline void *md_addr_verify(struct multi_devices *md, ulong offset)
+{
+	if (unlikely(offset > md_p2o(md_t1_blocks(md)))) {
+		md_dbg_err("offset=0x%lx > max=0x%llx\n",
+			    offset, md_p2o(md_t1_blocks(md)));
+		return NULL;
+	}
+
+	return md_addr(md, offset);
+}
+
+static inline struct page *md_bn_to_page(struct multi_devices *md, ulong bn)
+{
+#ifdef __KERNEL__
+	return pfn_to_page(md_pfn(md, bn));
+#else
+	return md->pages.ptr + bn * md->user_page_size;
+#endif
+}
+
+static inline ulong md_addr_to_offset(struct multi_devices *md, void *addr)
+{
+#ifdef __KERNEL__
+	/* TODO: Keep the device index in page-flags we need to fix the
+	 * page-ref right? for now with pages untouched we need this loop
+	 */
+	int dev_index;
+
+	for (dev_index = 0; dev_index < md->t1_count; ++dev_index) {
+		struct md_dev_info *mdi = md_t1_dev(md, dev_index);
+
+		if ((mdi->t1i.virt_addr <= addr) &&
+		    (addr < (mdi->t1i.virt_addr + mdi->size)))
+			return mdi->offset + (addr - mdi->t1i.virt_addr);
+	}
+
+	return 0;
+#else /* !__KERNEL__ */
+	return addr - md->p_pmem_addr;
+#endif
+}
+
+static inline ulong md_addr_to_bn(struct multi_devices *md, void *addr)
+{
+	return md_o2p(md_addr_to_offset(md, addr));
+}
+
+static inline ulong md_page_to_bn(struct multi_devices *md, struct page *page)
+{
+#ifdef __KERNEL__
+	return md_addr_to_bn(md, page_address(page));
+#else
+	ulong bytes = (void *)page - md->pages.ptr;
+
+	return bytes / md->user_page_size;
+#endif
+}
+
+#ifdef __KERNEL__
+/* TODO: Change API to take mdi and also support in um */
+static inline const char *_bdev_name(struct block_device *bdev)
+{
+	return dev_name(&bdev->bd_part->__dev);
+}
+#endif /*def __KERNEL__*/
+
+struct mdt_check {
+	ulong alloc_mask;
+	uint major_ver;
+	uint minor_ver;
+	__u32  magic;
+
+	void *holder;
+	bool silent;
+	bool private_mnt;
+};
+
+/* md.c */
+bool md_mdt_check(struct md_dev_table *mdt, struct md_dev_table *main_mdt,
+		  struct block_device *bdev, struct mdt_check *mc);
+int md_t2_mdt_read(struct multi_devices *md, int dev_index,
+		   struct md_dev_table *mdt);
+int md_t2_mdt_write(struct multi_devices *md, struct md_dev_table *mdt);
+short md_calc_csum(struct md_dev_table *mdt);
+void md_fini(struct multi_devices *md, bool put_all);
+
+#ifdef __KERNEL__
+/* length of uuid dev path /dev/disk/by-uuid/<uuid> */
+#define PATH_UUID	64
+int md_init(struct multi_devices **md, const char *dev_name,
+	    struct mdt_check *mc, char path[PATH_UUID], const char **dp);
+int md_set_sb(struct multi_devices *md, struct block_device *s_bdev, void *sb,
+	      int silent);
+int md_t1_info_init(struct md_dev_info *mdi, bool silent);
+void md_t1_info_fini(struct md_dev_info *mdi);
+
+#else /* libzus */
+int md_init_from_pmem_info(struct multi_devices *md);
+#endif
+
+#endif
diff --git a/fs/zuf/md_def.h b/fs/zuf/md_def.h
new file mode 100644
index 000000000000..72eda8516754
--- /dev/null
+++ b/fs/zuf/md_def.h
@@ -0,0 +1,141 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note or BSD-3-Clause */
+/*
+ * Multi-Device operations.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+#ifndef _LINUX_MD_DEF_H
+#define _LINUX_MD_DEF_H
+
+#include <linux/types.h>
+#include <linux/uuid.h>
+
+#ifndef __KERNEL__
+
+#include <stdint.h>
+#include <endian.h>
+#include <stdbool.h>
+#include <stdlib.h>
+
+#ifndef le16_to_cpu
+
+#define le16_to_cpu(x)	((__u16)le16toh(x))
+#define le32_to_cpu(x)	((__u32)le32toh(x))
+#define le64_to_cpu(x)	((__u64)le64toh(x))
+#define cpu_to_le16(x)	((__le16)htole16(x))
+#define cpu_to_le32(x)	((__le32)htole32(x))
+#define cpu_to_le64(x)	((__le64)htole64(x))
+
+#endif
+
+#ifndef __aligned
+#define	__aligned(x)			__attribute__((aligned(x)))
+#endif
+
+#endif /*  ndef __KERNEL__ */
+
+#define MDT_SIZE 4096
+
+#define MD_DEV_NUMA_SHIFT		60
+#define MD_DEV_BLOCKS_MASK		0x0FFFFFFFFFFFFFFF
+
+struct md_dev_id {
+	uuid_le	uuid;
+	__le64	blocks;
+} __aligned(8);
+
+static inline __u64 __dev_id_blocks(struct md_dev_id *dev)
+{
+	return le64_to_cpu(dev->blocks) & MD_DEV_BLOCKS_MASK;
+}
+
+static inline void __dev_id_blocks_set(struct md_dev_id *dev, __u64 blocks)
+{
+	dev->blocks &= ~MD_DEV_BLOCKS_MASK;
+	dev->blocks |= blocks;
+}
+
+static inline int __dev_id_nid(struct md_dev_id *dev)
+{
+	return (int)(le64_to_cpu(dev->blocks) >> MD_DEV_NUMA_SHIFT);
+}
+
+static inline void __dev_id_nid_set(struct md_dev_id *dev, int nid)
+{
+	dev->blocks &= MD_DEV_BLOCKS_MASK;
+	dev->blocks |= (__le64)nid << MD_DEV_NUMA_SHIFT;
+}
+
+/* 64 is the nicest number to still fit when the ZDT is 2048 and 6 bits can
+ * fit in page struct for address to block translation.
+ */
+#define MD_DEV_MAX   64
+
+struct md_dev_list {
+	__le16		   id_index;	/* index of current dev in list */
+	__le16		   t1_count;	/* # of t1 devs */
+	__le16		   t2_count;	/* # of t2 devs (after t1_count) */
+	__le16		   rmem_count;	/* align to 64 bit */
+	struct md_dev_id dev_ids[MD_DEV_MAX];
+} __aligned(64);
+
+/*
+ * Structure of the on disk multy device table
+ * NOTE: md_dev_table is always of size MDT_SIZE. These below are the
+ *   currently defined/used members in this version.
+ *   TODO: remove the s_ from all the fields
+ */
+struct md_dev_table {
+	/* static fields. they never change after file system creation.
+	 * checksum only validates up to s_start_dynamic field below
+	 */
+	__le16		s_sum;              /* checksum of this sb */
+	__le16		s_version;          /* zdt-version */
+	__le32		s_magic;            /* magic signature */
+	uuid_le		s_uuid;		    /* 128-bit uuid */
+	__le64		s_flags;
+	__le64		s_t1_blocks;
+	__le64		s_t2_blocks;
+
+	struct md_dev_list s_dev_list;
+
+	char		s_start_dynamic[0];
+
+	/* all the dynamic fields should go here */
+	__le64		s_mtime;		/* mount time */
+	__le64		s_wtime;		/* write time */
+};
+
+/* device table s_flags */
+enum enum_mdt_flags {
+	MDT_F_SHADOW		= (1UL << 0),	/* simulate cpu cache */
+	MDT_F_POSIXACL		= (1UL << 1),	/* enable acls */
+
+	MDT_F_USER_START	= 8,	/* first 8 bit reserved for mdt */
+};
+
+static inline bool mdt_test_option(struct md_dev_table *mdt,
+				   enum enum_mdt_flags flag)
+{
+	return (mdt->s_flags & flag) != 0;
+}
+
+#define MD_MINORS_PER_MAJOR	1024
+
+static inline int mdt_major_version(struct md_dev_table *mdt)
+{
+	return le16_to_cpu(mdt->s_version) / MD_MINORS_PER_MAJOR;
+}
+
+static inline int mdt_minor_version(struct md_dev_table *mdt)
+{
+	return le16_to_cpu(mdt->s_version) % MD_MINORS_PER_MAJOR;
+}
+
+#define MDT_STATIC_SIZE(mdt) ((__u64)&mdt->s_start_dynamic - (__u64)mdt)
+
+#endif /* _LINUX_MD_DEF_H */
diff --git a/fs/zuf/super.c b/fs/zuf/super.c
index f7f7798425a9..2248ee74e4c2 100644
--- a/fs/zuf/super.c
+++ b/fs/zuf/super.c
@@ -20,6 +20,12 @@
 
 static struct kmem_cache *zuf_inode_cachep;
 
+struct super_block *zuf_sb_from_id(struct zuf_root_info *zri, __u64 sb_id,
+				   struct zus_sb_info *zus_sbi)
+{
+	return NULL;
+}
+
 static void _init_once(void *foo)
 {
 	struct zuf_inode_info *zii = foo;
diff --git a/fs/zuf/t1.c b/fs/zuf/t1.c
new file mode 100644
index 000000000000..46ea7f6181fc
--- /dev/null
+++ b/fs/zuf/t1.c
@@ -0,0 +1,136 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Just the special mmap of the all t1 array to the ZUS Server
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+
+#include <linux/mm.h>
+#include <linux/fs.h>
+#include <linux/pfn_t.h>
+#include <asm/pgtable.h>
+
+#include "_pr.h"
+#include "zuf.h"
+
+/* ~~~ Functions for mmap a t1-array and page faults ~~~ */
+static struct zuf_pmem_file *_pmem_from_f_private(struct file *file)
+{
+	struct zuf_special_file *zsf = file->private_data;
+
+	WARN_ON(zsf->type != zlfs_e_pmem);
+	return container_of(zsf, struct zuf_pmem_file, hdr);
+}
+
+static vm_fault_t t1_fault(struct vm_fault *vmf, enum page_entry_size pe_size)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct inode *inode = vma->vm_file->f_mapping->host;
+	ulong addr = vmf->address;
+	struct zuf_pmem_file *z_pmem;
+	pgoff_t size;
+	ulong bn;
+	pfn_t pfnt;
+	ulong pfn = 0;
+	vm_fault_t flt;
+
+	zuf_dbg_t1("[%ld] vm_start=0x%lx vm_end=0x%lx VA=0x%lx "
+		    "pgoff=0x%lx vmf_flags=0x%x cow_page=%p page=%p pe_size=%d\n",
+		    inode->i_ino, vma->vm_start, vma->vm_end, addr, vmf->pgoff,
+		    vmf->flags, vmf->cow_page, vmf->page, pe_size);
+
+	if (unlikely(vmf->page)) {
+		zuf_err("[%ld] vm_start=0x%lx vm_end=0x%lx VA=0x%lx "
+			"pgoff=0x%lx vmf_flags=0x%x page=%p cow_page=%p\n",
+			inode->i_ino, vma->vm_start, vma->vm_end, addr,
+			vmf->pgoff, vmf->flags, vmf->page, vmf->cow_page);
+		return VM_FAULT_SIGBUS;
+	}
+
+	size = md_o2p_up(i_size_read(inode));
+	if (unlikely(vmf->pgoff >= size)) {
+		ulong pgoff = vma->vm_pgoff + md_o2p(addr - vma->vm_start);
+
+		zuf_err("[%ld] pgoff(0x%lx)(0x%lx) >= size(0x%lx) => SIGBUS\n",
+			 inode->i_ino, vmf->pgoff, pgoff, size);
+
+		return VM_FAULT_SIGBUS;
+	}
+
+	if (vmf->cow_page)
+		/* HOWTO: prevent private mmaps */
+		return VM_FAULT_SIGBUS;
+
+	z_pmem = _pmem_from_f_private(vma->vm_file);
+
+	switch (pe_size) {
+	case PE_SIZE_PTE:
+		zuf_err("[%ld] PTE fault not expected pgoff=0x%lx addr=0x%lx\n",
+			inode->i_ino, vmf->pgoff, addr);
+		/* Always PMD insert 2M chunks */
+		/* fall through */
+	case PE_SIZE_PMD:
+		bn = linear_page_index(vma, addr & PMD_MASK);
+		pfn = md_pfn(z_pmem->md, bn);
+		pfnt = phys_to_pfn_t(PFN_PHYS(pfn), PFN_MAP | PFN_DEV);
+		flt = vmf_insert_pfn_pmd(vmf, pfnt, true);
+		zuf_dbg_t1("[%ld] PMD pfn-0x%lx addr=0x%lx bn=0x%lx pgoff=0x%lx => %d\n",
+			inode->i_ino, pfn, addr, bn, vmf->pgoff, flt);
+		break;
+	default:
+		/* FIXME: Easily support PE_SIZE_PUD Just needs to align to
+		 * PUD_MASK at zufr_get_unmapped_area(). But this is hard today
+		 * because of the 2M nvdimm lib takes for its page flag
+		 * information with NFIT. (That need not be there in any which
+		 * case.)
+		 * Which means zufr_get_unmapped_area needs to return
+		 * a align1G+2M address start. and first 1G is map PMD size.
+		 * Very ugly, sigh.
+		 * One thing I do not understand why when the vma->vm_start is
+		 * not PUD aligned and faults requests index zero. Then system
+		 * asks for PE_SIZE_PUD anyway. say my 0 index is 1G aligned
+		 * vmf_insert_pfn_pud() will always fail because the aligned
+		 * vm_addr is outside the vma.
+		 */
+		flt = VM_FAULT_FALLBACK;
+		zuf_dbg_t1("[%ld] default? pgoff=0x%lx addr=0x%lx pe_size=0x%x => %d\n",
+			   inode->i_ino, vmf->pgoff, addr, pe_size, flt);
+	}
+
+	return flt;
+}
+
+static vm_fault_t t1_fault_pte(struct vm_fault *vmf)
+{
+	return t1_fault(vmf, PE_SIZE_PTE);
+}
+
+static const struct vm_operations_struct t1_vm_ops = {
+	.huge_fault	= t1_fault,
+	.fault		= t1_fault_pte,
+};
+
+int zuf_pmem_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct zuf_special_file *zsf = file->private_data;
+
+	if (!zsf || zsf->type != zlfs_e_pmem)
+		return -EPERM;
+
+	vma->vm_flags |= VM_HUGEPAGE;
+	vma->vm_ops = &t1_vm_ops;
+
+	zuf_dbg_vfs("[%ld] start=0x%lx end=0x%lx flags=0x%lx page_prot=0x%lx\n",
+		     file->f_mapping->host->i_ino, vma->vm_start, vma->vm_end,
+		     vma->vm_flags, pgprot_val(vma->vm_page_prot));
+
+	return 0;
+}
+
diff --git a/fs/zuf/t2.c b/fs/zuf/t2.c
new file mode 100644
index 000000000000..d293ce0ac249
--- /dev/null
+++ b/fs/zuf/t2.c
@@ -0,0 +1,356 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Tier-2 operations.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+
+
+#include <linux/bitops.h>
+#include <linux/bio.h>
+
+#include "zuf.h"
+
+#define t2_warn zuf_warn
+
+static const char *_pr_rw(int rw)
+{
+	return (rw & WRITE) ? "WRITE" : "READ";
+}
+#define t2_tis_dbg(tis, fmt, args ...) \
+	zuf_dbg_t2("%s: r=%d f=0x%lx " fmt, _pr_rw(tis->rw_flags),	       \
+		    atomic_read(&tis->refcount), tis->rw_flags, ##args)
+
+#define t2_tis_dbg_rw(tis, fmt, args ...) \
+	zuf_dbg_t2_rw("%s<%p>: r=%d f=0x%lx " fmt, _pr_rw(tis->rw_flags),     \
+		    tis->priv, atomic_read(&tis->refcount), tis->rw_flags,\
+		    ##args)
+
+/* ~~~~~~~~~~~~ Async read/write ~~~~~~~~~~ */
+void t2_io_begin(struct multi_devices *md, int rw, t2_io_done_fn done,
+		 void *priv, uint n_vects, struct t2_io_state *tis)
+{
+	atomic_set(&tis->refcount, 1);
+	tis->md = md;
+	tis->done = done;
+	tis->priv = priv;
+	tis->n_vects = min(n_vects ? n_vects : 1, (uint)BIO_MAX_PAGES);
+	tis->rw_flags = rw;
+	tis->last_t2 = -1;
+	tis->cur_bio = NULL;
+	tis->index = ~0;
+	bio_list_init(&tis->delayed_bios);
+	tis->err = 0;
+	blk_start_plug(&tis->plug);
+	t2_tis_dbg_rw(tis, "done=%pS n_vects=%d\n", done, n_vects);
+}
+
+static void _tis_put(struct t2_io_state *tis)
+{
+	t2_tis_dbg_rw(tis, "done=%pS\n", tis->done);
+
+	if (test_bit(B_TIS_FREE_AFTER_WAIT, &tis->rw_flags))
+		wake_up_var(&tis->refcount);
+	else if (tis->done)
+		/* last - done may free the tis */
+		tis->done(tis, NULL, true);
+}
+
+static inline void tis_get(struct t2_io_state *tis)
+{
+	atomic_inc(&tis->refcount);
+}
+
+static inline int tis_put(struct t2_io_state *tis)
+{
+	if (atomic_dec_and_test(&tis->refcount)) {
+		_tis_put(tis);
+		return 1;
+	}
+	return 0;
+}
+
+static int _status_to_errno(blk_status_t status)
+{
+	return blk_status_to_errno(status);
+}
+
+void t2_io_done(struct t2_io_state *tis, struct bio *bio, bool last)
+{
+	struct bio_vec *bv;
+	struct bvec_iter_all i;
+
+	if (!bio)
+		return;
+
+	bio_for_each_segment_all(bv, bio, i)
+		put_page(bv->bv_page);
+}
+
+static void _tis_bio_done(struct bio *bio)
+{
+	struct t2_io_state *tis = bio->bi_private;
+
+	t2_tis_dbg(tis, "done=%pS err=%d\n", tis->done, bio->bi_status);
+
+	if (unlikely(bio->bi_status)) {
+		zuf_dbg_err("%s: err=%d last-err=%d\n",
+			     _pr_rw(tis->rw_flags), bio->bi_status, tis->err);
+		/* Store the last one */
+		tis->err = _status_to_errno(bio->bi_status);
+	}
+
+	if (tis->done)
+		tis->done(tis, bio, false);
+	else
+		t2_io_done(tis, bio, false);
+
+	bio_put(bio);
+	tis_put(tis);
+}
+
+static bool _tis_delay(struct t2_io_state *tis)
+{
+	return 0 != (tis->rw_flags & TIS_DELAY_SUBMIT);
+}
+
+#define bio_list_for_each_safe(bio, btmp, bl)				\
+	for (bio = (bl)->head,	btmp = bio ? bio->bi_next : NULL;	\
+	     bio; bio = btmp,	btmp = bio ? bio->bi_next : NULL)
+
+static void _tis_submit_bio(struct t2_io_state *tis, bool flush, bool done)
+{
+	if (flush || done) {
+		if (_tis_delay(tis)) {
+			struct bio *btmp, *bio;
+
+			bio_list_for_each_safe(bio, btmp, &tis->delayed_bios) {
+				bio->bi_next = NULL;
+				if (bio->bi_iter.bi_sector == -1) {
+					t2_warn("!!!!!!!!!!!!!\n");
+					bio_put(bio);
+					continue;
+				}
+				t2_tis_dbg(tis, "submit bio[%d] max_v=%d\n",
+					    bio->bi_vcnt, tis->n_vects);
+				submit_bio(bio);
+			}
+			bio_list_init(&tis->delayed_bios);
+		}
+
+		if (!tis->cur_bio)
+			return;
+
+		if (tis->cur_bio->bi_iter.bi_sector != -1) {
+			t2_tis_dbg(tis, "submit bio[%d] max_v=%d\n",
+				    tis->cur_bio->bi_vcnt, tis->n_vects);
+			submit_bio(tis->cur_bio);
+			tis->cur_bio = NULL;
+			tis->index = ~0;
+		} else if (done) {
+			t2_tis_dbg(tis, "put cur_bio=%p\n", tis->cur_bio);
+			bio_put(tis->cur_bio);
+			WARN_ON(tis_put(tis));
+		}
+	} else if (tis->cur_bio && (tis->cur_bio->bi_iter.bi_sector != -1)) {
+		/* Not flushing regular progress */
+		if (_tis_delay(tis)) {
+			t2_tis_dbg(tis, "list_add cur_bio=%p\n", tis->cur_bio);
+			bio_list_add(&tis->delayed_bios, tis->cur_bio);
+		} else {
+			t2_tis_dbg(tis, "submit bio[%d] max_v=%d\n",
+				    tis->cur_bio->bi_vcnt, tis->n_vects);
+			submit_bio(tis->cur_bio);
+		}
+		tis->cur_bio = NULL;
+		tis->index = ~0;
+	}
+}
+
+/* tis->cur_bio MUST be NULL, checked by caller */
+static void _tis_alloc(struct t2_io_state *tis, struct md_dev_info *mdi,
+		       gfp_t gfp)
+{
+	struct bio *bio = bio_alloc(gfp, tis->n_vects);
+	int bio_op;
+
+	if (unlikely(!bio)) {
+		if (!_tis_delay(tis))
+			t2_warn("!!! failed to alloc bio");
+		tis->err = -ENOMEM;
+		return;
+	}
+
+	if (WARN_ON(!tis || !tis->md)) {
+		tis->err = -ENOMEM;
+		return;
+	}
+
+	/* FIXME: bio_set_op_attrs macro has a BUG which does not allow this
+	 * question inline.
+	 */
+	bio_op = (tis->rw_flags & WRITE) ? REQ_OP_WRITE : REQ_OP_READ;
+	bio_set_op_attrs(bio, bio_op, 0);
+
+	bio->bi_iter.bi_sector = -1;
+	bio->bi_end_io = _tis_bio_done;
+	bio->bi_private = tis;
+
+	if (mdi) {
+		bio_set_dev(bio, mdi->bdev);
+		tis->index = mdi->index;
+	} else {
+		tis->index = ~0;
+	}
+	tis->last_t2 = -1;
+	tis->cur_bio = bio;
+	tis_get(tis);
+	t2_tis_dbg(tis, "New bio n_vects=%d\n", tis->n_vects);
+}
+
+int t2_io_prealloc(struct t2_io_state *tis, uint n_vects)
+{
+	tis->err = 0; /* reset any -ENOMEM from a previous t2_io_add */
+
+	_tis_submit_bio(tis, true, false);
+	tis->n_vects = min(n_vects ? n_vects : 1, (uint)BIO_MAX_PAGES);
+
+	t2_tis_dbg(tis, "n_vects=%d cur_bio=%p\n", tis->n_vects, tis->cur_bio);
+
+	if (!tis->cur_bio)
+		_tis_alloc(tis, NULL, GFP_NOFS);
+	return tis->err;
+}
+
+int t2_io_add(struct t2_io_state *tis, ulong t2, struct page *page)
+{
+	struct md_dev_info *mdi;
+	ulong local_t2;
+	int ret;
+
+	if (t2 > md_t2_blocks(tis->md)) {
+		zuf_err("bad t2 (0x%lx) offset\n", t2);
+		return -EFAULT;
+	}
+	get_page(page);
+
+	mdi = md_bn_t2_dev(tis->md, t2);
+	WARN_ON(!mdi);
+
+	if (unlikely(!mdi->bdev)) {
+		zuf_err("mdi->bdev == NULL!! t2=0x%lx\n", t2);
+		return -EFAULT;
+	}
+
+	local_t2 = md_t2_local_bn(tis->md, t2);
+	if (((local_t2 != (tis->last_t2 + 1)) && (tis->last_t2 != -1)) ||
+	   ((0 < tis->index) && (tis->index != mdi->index)))
+		_tis_submit_bio(tis, false, false);
+
+start:
+	if (!tis->cur_bio) {
+		_tis_alloc(tis, mdi, _tis_delay(tis) ? GFP_ATOMIC : GFP_NOFS);
+		if (unlikely(tis->err)) {
+			put_page(page);
+			return tis->err;
+		}
+	} else if (tis->index == ~0) {
+		/* the bio was allocated during t2_io_prealloc */
+		tis->index = mdi->index;
+		bio_set_dev(tis->cur_bio, mdi->bdev);
+	}
+
+	if (tis->last_t2 == -1)
+		tis->cur_bio->bi_iter.bi_sector =
+						local_t2 * T2_SECTORS_PER_PAGE;
+
+	ret = bio_add_page(tis->cur_bio, page, PAGE_SIZE, 0);
+	if (unlikely(ret != PAGE_SIZE)) {
+		t2_tis_dbg(tis, "bio_add_page=>%d bi_vcnt=%d n_vects=%d\n",
+			   ret, tis->cur_bio->bi_vcnt, tis->n_vects);
+		_tis_submit_bio(tis, false, false);
+		goto start; /* device does not support tis->n_vects */
+	}
+
+	if ((tis->cur_bio->bi_vcnt == tis->n_vects) && (tis->n_vects != 1))
+		_tis_submit_bio(tis, false, false);
+
+	t2_tis_dbg(tis, "t2=0x%lx last_t2=0x%lx local_t2=0x%lx t1=0x%lx\n",
+		   t2, tis->last_t2, local_t2, md_page_to_bn(tis->md, page));
+
+	tis->last_t2 = local_t2;
+	return 0;
+}
+
+int t2_io_end(struct t2_io_state *tis, bool wait)
+{
+	if (unlikely(!tis || !tis->md))
+		return 0; /* never initialized nothing to do */
+
+	t2_tis_dbg_rw(tis, "wait=%d\n", wait);
+
+	_tis_submit_bio(tis, true, true);
+	blk_finish_plug(&tis->plug);
+
+	if (wait)
+		set_bit(B_TIS_FREE_AFTER_WAIT, &tis->rw_flags);
+	tis_put(tis);
+
+	if (wait) {
+		wait_var_event(&tis->refcount, !atomic_read(&tis->refcount));
+		if (tis->done)
+			tis->done(tis, NULL, true);
+	}
+
+	return tis->err;
+}
+
+/* ~~~~~~~ Sync read/write ~~~~~~~ TODO: Remove soon */
+static int _sync_io_page(struct multi_devices *md, int rw, ulong bn,
+			 struct page *page)
+{
+	struct t2_io_state tis;
+	int err;
+
+	t2_io_begin(md, rw, NULL, NULL, 1, &tis);
+
+	t2_tis_dbg((&tis), "bn=0x%lx p-i=0x%lx\n", bn, page->index);
+
+	err = t2_io_add(&tis, bn, page);
+	if (unlikely(err))
+		return err;
+
+	err = submit_bio_wait(tis.cur_bio);
+	if (unlikely(err)) {
+		SetPageError(page);
+		/*
+		 * We failed to write the page out to tier-2.
+		 * Print a dire warning that things will go BAD (tm)
+		 * very quickly.
+		 */
+		zuf_err("io-error bn=0x%lx => %d\n", bn, err);
+	}
+
+	/* Same as t2_io_end+_tis_bio_done but without the kref stuff */
+	blk_finish_plug(&tis.plug);
+	put_page(page);
+	if (likely(tis.cur_bio))
+		bio_put(tis.cur_bio);
+
+	return err;
+}
+
+int t2_writepage(struct multi_devices *md, ulong bn, struct page *page)
+{
+	return _sync_io_page(md, WRITE, bn, page);
+}
+
+int t2_readpage(struct multi_devices *md, ulong bn, struct page *page)
+{
+	return _sync_io_page(md, READ, bn, page);
+}
diff --git a/fs/zuf/t2.h b/fs/zuf/t2.h
new file mode 100644
index 000000000000..cbd23dd409eb
--- /dev/null
+++ b/fs/zuf/t2.h
@@ -0,0 +1,68 @@
+/* SPDX-License-Identifier: GPL-2.0 or BSD-3-Clause */
+/*
+ * Tier-2 Header file.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+
+#ifndef __T2_H__
+#define __T2_H__
+
+#include <linux/blkdev.h>
+#include <linux/fs.h>
+#include <linux/bio.h>
+#include <linux/kref.h>
+#include "md.h"
+
+#define T2_SECTORS_PER_PAGE	(PAGE_SIZE / 512)
+
+/* t2.c */
+
+/* Sync read/write */
+int t2_writepage(struct multi_devices *md, ulong bn, struct page *page);
+int t2_readpage(struct multi_devices *md, ulong bn, struct page *page);
+
+/* Async read/write */
+struct t2_io_state;
+typedef void (*t2_io_done_fn)(struct t2_io_state *tis, struct bio *bio,
+			      bool last);
+
+struct t2_io_state {
+	atomic_t refcount; /* counts in-flight bios */
+	struct blk_plug plug;
+
+	struct multi_devices	*md;
+	int		index;
+	t2_io_done_fn	done;
+	void		*priv;
+
+	uint		n_vects;
+	ulong		rw_flags;
+	ulong		last_t2;
+	struct bio	*cur_bio;
+	struct bio_list	delayed_bios;
+	int		err;
+};
+
+/* For rw_flags above */
+/* From Kernel: WRITE		(1U << 0) */
+#define TIS_DELAY_SUBMIT	(1U << 2)
+enum {B_TIS_FREE_AFTER_WAIT = 3};
+#define TIS_FREE_AFTER_WAIT	(1U << B_TIS_FREE_AFTER_WAIT)
+#define TIS_USER_DEF_FIRST	(1U << 8)
+
+void t2_io_begin(struct multi_devices *md, int rw, t2_io_done_fn done,
+		 void *priv, uint n_vects, struct t2_io_state *tis);
+int t2_io_prealloc(struct t2_io_state *tis, uint n_vects);
+int t2_io_add(struct t2_io_state *tis, ulong t2, struct page *page);
+int t2_io_end(struct t2_io_state *tis, bool wait);
+
+/* This is done by default if t2_io_done_fn above is NULL
+ * Can also be chain-called by users.
+ */
+void t2_io_done(struct t2_io_state *tis, struct bio *bio, bool last);
+
+#endif /*def __T2_H__*/
diff --git a/fs/zuf/zuf-core.c b/fs/zuf/zuf-core.c
index 60f0d3ffe562..cc49cfa95244 100644
--- a/fs/zuf/zuf-core.c
+++ b/fs/zuf/zuf-core.c
@@ -359,6 +359,78 @@ static int _zu_numa_map(struct file *file, void *parg)
 	return err;
 }
 
+/* ~~~~ PMEM GRAB ~~~~ */
+/*FIXME: At pmem the struct md_dev_list for t1(s) is not properly set
+ * For now we do not fix it and re-write the mdt. So just fix the one
+ * we are about to send to Server
+ */
+static void _fix_numa_ids(struct multi_devices *md, struct md_dev_list *mdl)
+{
+	int i;
+
+	for (i = 0; i < md->t1_count; ++i)
+		if (md->devs[i].nid != __dev_id_nid(&mdl->dev_ids[i]))
+			__dev_id_nid_set(&mdl->dev_ids[i], md->devs[i].nid);
+}
+
+static int _zu_grab_pmem(struct file *file, void *parg)
+{
+	struct zuf_root_info *zri = ZRI(file->f_inode->i_sb);
+	struct zufs_ioc_pmem __user *arg_pmem = parg;
+	struct zufs_ioc_pmem *zi_pmem = kzalloc(sizeof(*zi_pmem), GFP_KERNEL);
+	struct super_block *sb;
+	struct zuf_sb_info *sbi;
+	size_t pmem_size;
+	int err;
+
+	if (unlikely(!zi_pmem))
+		return -ENOMEM;
+
+	err = get_user(zi_pmem->sb_id, &arg_pmem->sb_id);
+	if (err) {
+		zuf_err("\n");
+		goto out;
+	}
+
+	sb = zuf_sb_from_id(zri, zi_pmem->sb_id, NULL);
+	if (unlikely(!sb)) {
+		err = -ENODEV;
+		zuf_err("!!! pmem_kern_id=%llu not found\n", zi_pmem->sb_id);
+		goto out;
+	}
+	sbi = SBI(sb);
+
+	if (sbi->pmem.hdr.file) {
+		zuf_err("[%llu] pmem already taken\n", zi_pmem->sb_id);
+		err = -EIO;
+		goto out;
+	}
+
+	memcpy(&zi_pmem->mdt, md_zdt(sbi->md), sizeof(zi_pmem->mdt));
+	zi_pmem->dev_index = sbi->md->dev_index;
+	_fix_numa_ids(sbi->md, &zi_pmem->mdt.s_dev_list);
+
+	pmem_size = md_p2o(md_t1_blocks(sbi->md));
+	if (mdt_test_option(md_zdt(sbi->md), MDT_F_SHADOW))
+		pmem_size += pmem_size;
+	i_size_write(file->f_inode, pmem_size);
+	sbi->pmem.hdr.type = zlfs_e_pmem;
+	sbi->pmem.hdr.file = file;
+	sbi->pmem.md = sbi->md; /* FIXME: Use container_of in t1.c */
+	file->private_data = &sbi->pmem.hdr;
+	zuf_dbg_core("pmem %llu i_size=0x%llx GRABED %s\n",
+		     zi_pmem->sb_id, i_size_read(file->f_inode),
+		     _bdev_name(md_t1_dev(sbi->md, 0)->bdev));
+
+out:
+	zi_pmem->hdr.err = err;
+	err = copy_to_user(parg, zi_pmem, sizeof(*zi_pmem));
+	if (err)
+		zuf_err("=>%d\n", err);
+	kfree(zi_pmem);
+	return err;
+}
+
 static void _prep_header_size_op(struct zufs_ioc_hdr *hdr,
 				 enum e_zufs_operation op, int err)
 {
@@ -886,6 +958,8 @@ long zufc_ioctl(struct file *file, unsigned int cmd, ulong arg)
 		return _zu_mount(file, parg);
 	case ZU_IOC_NUMA_MAP:
 		return _zu_numa_map(file, parg);
+	case ZU_IOC_GRAB_PMEM:
+		return _zu_grab_pmem(file, parg);
 	case ZU_IOC_INIT_THREAD:
 		return _zu_init(file, parg);
 	case ZU_IOC_WAIT_OPT:
@@ -1135,6 +1209,8 @@ int zufc_mmap(struct file *file, struct vm_area_struct *vma)
 	switch (zsf->type) {
 	case zlfs_e_zt:
 		return zufc_zt_mmap(file, vma);
+	case zlfs_e_pmem:
+		return zuf_pmem_mmap(file, vma);
 	case zlfs_e_dpp_buff:
 		return zufc_ebuff_mmap(file, vma);
 	default:
diff --git a/fs/zuf/zuf.h b/fs/zuf/zuf.h
index 05ec08d17d69..d0cb762f50ec 100644
--- a/fs/zuf/zuf.h
+++ b/fs/zuf/zuf.h
@@ -28,6 +28,8 @@
 #include "zus_api.h"
 
 #include "_pr.h"
+#include "md.h"
+#include "t2.h"
 
 enum zlfs_e_special_file {
 	zlfs_e_zt = 1,
@@ -98,6 +100,13 @@ static inline void zuf_add_fs_type(struct zuf_root_info *zri,
 	list_add(&zft->list, &zri->fst_list);
 }
 
+/* t1.c special file to mmap our pmem */
+struct zuf_pmem_file {
+	struct zuf_special_file hdr;
+	struct multi_devices *md;
+};
+
+
 /*
  * ZUF per-inode data in memory
  */
@@ -110,6 +119,51 @@ static inline struct zuf_inode_info *ZUII(struct inode *inode)
 	return container_of(inode, struct zuf_inode_info, vfs_inode);
 }
 
+/*
+ * ZUF super-block data in memory
+ */
+struct zuf_sb_info {
+	struct super_block *sb;
+	struct multi_devices *md;
+	struct zuf_pmem_file pmem;
+
+	/* zus cookie*/
+	struct zus_sb_info *zus_sbi;
+
+	/* Mount options */
+	unsigned long	s_mount_opt;
+	ulong		fs_caps;
+	char		*pmount_dev; /* for private mount */
+
+	spinlock_t		s_mmap_dirty_lock;
+	struct list_head	s_mmap_dirty;
+};
+
+static inline struct zuf_sb_info *SBI(struct super_block *sb)
+{
+	return sb->s_fs_info;
+}
+
+static inline struct zuf_fs_type *ZUF_FST(struct file_system_type *fs_type)
+{
+	return container_of(fs_type, struct zuf_fs_type, vfs_fst);
+}
+
+static inline struct zuf_fs_type *zuf_fst(struct super_block *sb)
+{
+	return ZUF_FST(sb->s_type);
+}
+
+static inline struct zuf_root_info *ZUF_ROOT(struct zuf_sb_info *sbi)
+{
+	return zuf_fst(sbi->sb)->zri;
+}
+
+static inline bool zuf_rdonly(struct super_block *sb)
+{
+	return sb_rdonly(sb);
+}
+
 struct zuf_dispatch_op;
 typedef int (*overflow_handler)(struct zuf_dispatch_op *zdo, void *parg,
 				ulong zt_max_bytes);
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index 6b1fbaf24222..4292a4fa5f1a 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -22,6 +22,8 @@
 #include <linux/fiemap.h>
 #include <stddef.h>
 
+#include "md_def.h"
+
 #ifdef __cplusplus
 #define NAMELESS(X) X
 #else
@@ -355,6 +357,19 @@ struct zufs_ioc_numa_map {
 };
 #define ZU_IOC_NUMA_MAP	_IOWR('Z', 13, struct zufs_ioc_numa_map)
 
+struct zufs_ioc_pmem {
+	/* Set by zus */
+	struct zufs_ioc_hdr hdr;
+	__u64 sb_id;
+
+	/* Returned to zus */
+	struct md_dev_table mdt;
+	__u32 dev_index;
+	__u32 ___pad;
+};
+/* GRAB is never ungrabed umount or file close cleans it all */
+#define ZU_IOC_GRAB_PMEM	_IOWR('Z', 14, struct zufs_ioc_pmem)
+
 /* ZT init */
 enum { ZUFS_MAX_ZT_CHANNELS = 4 };
 
-- 
2.21.0
^ permalink raw reply related	[flat|nested] 32+ messages in thread
* [PATCH 07/16] zuf: mounting
  2019-09-26  2:07 [PATCHSET v02 00/16] zuf: ZUFS Zero-copy User-mode FileSystem Boaz Harrosh
                   ` (5 preceding siblings ...)
  2019-09-26  2:07 ` [PATCH 06/16] zuf: Multy Devices Boaz Harrosh
@ 2019-09-26  2:07 ` Boaz Harrosh
  2019-09-26  2:07 ` [PATCH 08/16] zuf: Namei and directory operations Boaz Harrosh
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 32+ messages in thread
From: Boaz Harrosh @ 2019-09-26  2:07 UTC (permalink / raw)
  To: linux-fsdevel, Anna Schumaker, Al Viro, Matt Benjamin
  Cc: Miklos Szeredi, Amir Goldstein, Sagi Manole, Matthew Wilcox,
	Dan Williams
In this patch we already establish a mounted filesystem.
These are the steps for mounting a zufs Filesystem:
* All devices (Single or Multiple) are opened and established in
  an md object.
* mount_bdev is called with the main (first) device, in turn
  fill_supper is called.
* fill_supper dispatches a mount_operation(register_fs_info) to the
  server with an sb_id of the newly created super_block.
*  The Server at the zus mount routine. Will first thing do
  a GRAB_PMEM(sb_id) ioctl call to establish a special filehandle
  through which it will have full access to the all of its pmem space.
  With that it will call the zusFS to continue to inspect the content
  of devices and mount the FS.
* On return from mount the zusFS returns the root inode info
* fill_supper continues to create a root vfs-inode and returns
  successfully.
* We now have a mounted super_block, with corresponding super_block
  objects in the Server.
* Also in this patch global sb operations like statfs show-options
  and remount. And the umount/destruction of a super_block.
* There is a special support for a "private-mounting" of devices.
  private-mounting is usually used by the zusFS fschk/mkfs type
  applications that want a full access and lock-down of its multy-devices.
  But otherwise wants an exclusive access to these devices. The private
  mount exposes all the same services to the Server application. But
  there is no registered/mounted super_block in VFS.
  This is a very powerful tool for zusFS development because the same
  exact code that is used in a running FS is also used for the FS-utils.
  The code feels exactly the same as a live mount.
  (See the zus project for more info)
[v2]
  big_alloc now uses a dedicated 8k kmem_cache pool instead of kmalloc
  because it is used in the IO fast path. And we want guarantied forward
  progress.
Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/Makefile   |   2 +-
 fs/zuf/_extern.h  |  10 +-
 fs/zuf/inode.c    |  23 ++
 fs/zuf/super.c    | 806 +++++++++++++++++++++++++++++++++++++++++++++-
 fs/zuf/zuf-core.c |  96 ++++++
 fs/zuf/zuf-root.c |   5 +
 fs/zuf/zuf.h      | 134 ++++++++
 fs/zuf/zus_api.h  |  35 ++
 8 files changed, 1107 insertions(+), 4 deletions(-)
 create mode 100644 fs/zuf/inode.c
diff --git a/fs/zuf/Makefile b/fs/zuf/Makefile
index a247bd85d9aa..a5800cad73fd 100644
--- a/fs/zuf/Makefile
+++ b/fs/zuf/Makefile
@@ -17,5 +17,5 @@ zuf-y += md.o t1.o t2.o
 zuf-y += zuf-core.o zuf-root.o
 
 # Main FS
-zuf-y += super.o
+zuf-y += super.o inode.o
 zuf-y += module.o
diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
index a5929d3d165c..b1514e5821a2 100644
--- a/fs/zuf/_extern.h
+++ b/fs/zuf/_extern.h
@@ -51,12 +51,20 @@ int zufr_register_fs(struct super_block *sb, struct zufs_ioc_register_fs *rfs);
 int zuf_init_inodecache(void);
 void zuf_destroy_inodecache(void);
 
+int zuf_8k_cache_init(void);
+void zuf_8k_cache_fini(void);
+
 struct dentry *zuf_mount(struct file_system_type *fs_type, int flags,
 			 const char *dev_name, void *data);
-
+int zuf_private_mount(struct zuf_root_info *zri, struct register_fs_info *rfi,
+		      struct zufs_mount_info *zmi, struct super_block **sb_out);
+int zuf_private_umount(struct zuf_root_info *zri, struct super_block *sb);
 struct super_block *zuf_sb_from_id(struct zuf_root_info *zri, __u64 sb_id,
 				   struct zus_sb_info *zus_sbi);
 
+/* inode.c */
+struct inode *zuf_iget(struct super_block *sb, struct zus_inode_info *zus_ii,
+		       zu_dpp_t _zi, bool *exist);
 /* t1.c */
 int zuf_pmem_mmap(struct file *file, struct vm_area_struct *vma);
 
diff --git a/fs/zuf/inode.c b/fs/zuf/inode.c
new file mode 100644
index 000000000000..a6115289dcda
--- /dev/null
+++ b/fs/zuf/inode.c
@@ -0,0 +1,23 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Inode methods (allocate/free/read/write).
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#include "zuf.h"
+
+struct inode *zuf_iget(struct super_block *sb, struct zus_inode_info *zus_ii,
+		       zu_dpp_t _zi, bool *exist)
+{
+	return ERR_PTR(-ENOTSUPP);
+}
+
diff --git a/fs/zuf/super.c b/fs/zuf/super.c
index 2248ee74e4c2..01927deb5013 100644
--- a/fs/zuf/super.c
+++ b/fs/zuf/super.c
@@ -18,12 +18,740 @@
 
 #include "zuf.h"
 
+static struct super_operations zuf_sops;
 static struct kmem_cache *zuf_inode_cachep;
 
+enum {
+	Opt_uid,
+	Opt_gid,
+	Opt_pedantic,
+	Opt_ephemeral,
+	Opt_dax,
+	Opt_zpmdev,
+	Opt_err
+};
+
+static const match_table_t tokens = {
+	{ Opt_pedantic,		"pedantic"		},
+	{ Opt_pedantic,		"pedantic=%d"		},
+	{ Opt_ephemeral,	"ephemeral"		},
+	{ Opt_dax,		"dax"			},
+	{ Opt_zpmdev,		ZUFS_PMDEV_OPT"=%s"	},
+	{ Opt_err,		NULL			},
+};
+
+static int _parse_options(struct zuf_sb_info *sbi, const char *data,
+			  bool remount, struct zufs_parse_options *po)
+{
+	char *orig_options, *options;
+	char *p;
+	substring_t args[MAX_OPT_ARGS];
+	int err = 0;
+	bool ephemeral = false;
+	bool silent = test_opt(sbi, SILENT);
+	size_t mount_options_len = 0;
+
+	/* no options given */
+	if (!data)
+		return 0;
+
+	options = orig_options = kstrdup(data, GFP_KERNEL);
+	if (!options) {
+		zuf_err_cnd(silent, "kstrdup => -ENOMEM\n");
+		return -ENOMEM;
+	}
+
+	while ((p = strsep(&options, ",")) != NULL) {
+		int token;
+
+		if (!*p)
+			continue;
+
+		/* Initialize args struct so we know whether arg was found */
+		args[0].to = args[0].from = NULL;
+		token = match_token(p, tokens, args);
+		switch (token) {
+		case Opt_pedantic:
+			if (!args[0].from) {
+				po->mount_flags |= ZUFS_M_PEDANTIC;
+				set_opt(sbi, PEDANTIC);
+				continue;
+			}
+			if (match_int(&args[0], &po->pedantic))
+				goto bad_opt;
+			break;
+		case Opt_ephemeral:
+			po->mount_flags |= ZUFS_M_EPHEMERAL;
+			set_opt(sbi, EPHEMERAL);
+			ephemeral = true;
+			break;
+		case Opt_dax:
+			set_opt(sbi, DAX);
+			break;
+		case Opt_zpmdev:
+			if (unlikely(!test_opt(sbi, PRIVATE)))
+				goto bad_opt;
+			sbi->pmount_dev = match_strdup(&args[0]);
+			if (sbi->pmount_dev == NULL)
+				goto no_mem;
+			break;
+		default: {
+			if (mount_options_len != 0) {
+				po->mount_options[mount_options_len] = ',';
+				mount_options_len++;
+			}
+			strcat(po->mount_options, p);
+			mount_options_len += strlen(p);
+		}
+		}
+	}
+
+	if (remount && test_opt(sbi, EPHEMERAL) && (ephemeral == false))
+		clear_opt(sbi, EPHEMERAL);
+out:
+	kfree(orig_options);
+	return err;
+
+bad_opt:
+	zuf_warn_cnd(silent, "Bad mount option: \"%s\"\n", p);
+	err = -EINVAL;
+	goto out;
+no_mem:
+	zuf_warn_cnd(silent, "Not enough memory to parse options");
+	err = -ENOMEM;
+	goto out;
+}
+
+static int _print_tier_info(struct multi_devices *md, char **buff, int start,
+			    int count, int *_space, char *str)
+{
+	int space = *_space;
+	char *b = *buff;
+	int printed;
+	int i;
+
+	printed = snprintf(b, space, str);
+	if (unlikely(printed > space))
+		return -ENOSPC;
+
+	b += printed;
+	space -= printed;
+
+	for (i = start; i < start + count; ++i) {
+		printed = snprintf(b, space, "%s%s", i == start ? "" : ",",
+				   _bdev_name(md_dev_info(md, i)->bdev));
+
+		if (unlikely(printed > space))
+			return -ENOSPC;
+
+		b += printed;
+		space -= printed;
+	}
+	*_space = space;
+	*buff = b;
+
+	return 0;
+}
+
+static void _print_mount_info(struct zuf_sb_info *sbi, char *mount_options)
+{
+	struct multi_devices *md = sbi->md;
+	char buff[992];
+	int space = sizeof(buff);
+	char *b = buff;
+	int err;
+
+	err = _print_tier_info(md, &b, 0, md->t1_count, &space, "t1=");
+	if (unlikely(err))
+		goto no_space;
+
+	if (md->t2_count == 0)
+		goto print_options;
+
+	err = _print_tier_info(md, &b, md->t1_count, md->t2_count, &space,
+			       " t2=");
+	if (unlikely(err))
+		goto no_space;
+
+print_options:
+	if (mount_options) {
+		int printed = snprintf(b, space, " -o %s", mount_options);
+
+		if (unlikely(printed > space))
+			goto no_space;
+	}
+
+print:
+	zuf_info("mounted %s (0x%lx/0x%lx)\n", buff,
+		 md_t1_blocks(sbi->md), md_t2_blocks(sbi->md));
+	return;
+
+no_space:
+	snprintf(buff + sizeof(buff) - 4, 4, "...");
+	goto print;
+}
+
+static void _sb_mwtime_now(struct super_block *sb, struct md_dev_table *zdt)
+{
+	struct timespec64 now = current_time(sb->s_root->d_inode);
+
+	timespec_to_mt(&zdt->s_mtime, &now);
+	zdt->s_wtime = zdt->s_mtime;
+	/* TOZO _persist_md(sb, &zdt->s_mtime, 2*sizeof(zdt->s_mtime)); */
+}
+
+static void _clean_bdi(struct super_block *sb)
+{
+	if (sb->s_bdi != &noop_backing_dev_info) {
+		bdi_put(sb->s_bdi);
+		sb->s_bdi = &noop_backing_dev_info;
+	}
+}
+
+static int _setup_bdi(struct super_block *sb, const char *device_name)
+{
+	const char *n = sb->s_type->name;
+	int err;
+
+	if (sb->s_bdi)
+		_clean_bdi(sb);
+
+	err = super_setup_bdi_name(sb, "%s-%s", n, device_name);
+	if (unlikely(err)) {
+		zuf_err("Failed to super_setup_bdi\n");
+		return err;
+	}
+
+	sb->s_bdi->ra_pages = ZUFS_READAHEAD_PAGES;
+	sb->s_bdi->capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK;
+	return 0;
+}
+
+static int _sb_add(struct zuf_root_info *zri, struct super_block *sb,
+		   __u64 *sb_id)
+{
+	uint i;
+	int err;
+
+	mutex_lock(&zri->sbl_lock);
+
+	if (zri->sbl.num == zri->sbl.max) {
+		struct super_block **new_array;
+
+		new_array = krealloc(zri->sbl.array,
+				  (zri->sbl.max + SBL_INC) * sizeof(*new_array),
+				  GFP_KERNEL | __GFP_ZERO);
+		if (unlikely(!new_array)) {
+			err = -ENOMEM;
+			goto out;
+		}
+		zri->sbl.max += SBL_INC;
+		zri->sbl.array = new_array;
+	}
+
+	for (i = 0; i < zri->sbl.max; ++i)
+		if (!zri->sbl.array[i])
+			break;
+
+	if (unlikely(i == zri->sbl.max)) {
+		zuf_err("!!!!! can't be! i=%d g_sbl.num=%d g_sbl.max=%d\n",
+			i, zri->sbl.num, zri->sbl.max);
+		err = -EFAULT;
+		goto out;
+	}
+
+	++zri->sbl.num;
+	zri->sbl.array[i] = sb;
+	*sb_id = i + 1;
+	err = 0;
+
+	zuf_dbg_vfs("sb_id=%lld\n", *sb_id);
+out:
+	mutex_unlock(&zri->sbl_lock);
+	return err;
+}
+
+static void _sb_remove(struct zuf_root_info *zri, struct super_block *sb)
+{
+	uint i;
+
+	mutex_lock(&zri->sbl_lock);
+
+	for (i = 0; i < zri->sbl.max; ++i)
+		if (zri->sbl.array[i] == sb)
+			break;
+	if (unlikely(i == zri->sbl.max)) {
+		zuf_err("!!!!! can't be! i=%d g_sbl.num=%d g_sbl.max=%d\n",
+			i, zri->sbl.num, zri->sbl.max);
+		goto out;
+	}
+
+	zri->sbl.array[i] = NULL;
+	--zri->sbl.num;
+out:
+	mutex_unlock(&zri->sbl_lock);
+}
+
 struct super_block *zuf_sb_from_id(struct zuf_root_info *zri, __u64 sb_id,
 				   struct zus_sb_info *zus_sbi)
 {
-	return NULL;
+	struct super_block *sb;
+
+	--sb_id;
+
+	if (zri->sbl.max <= sb_id) {
+		zuf_err("Invalid SB_ID 0x%llx\n", sb_id);
+		return NULL;
+	}
+
+	sb = zri->sbl.array[sb_id];
+	if (!sb) {
+		zuf_err("Stale SB_ID 0x%llx\n", sb_id);
+		return NULL;
+	}
+
+	return sb;
+}
+
+static void zuf_put_super(struct super_block *sb)
+{
+	struct zuf_sb_info *sbi = SBI(sb);
+
+	/* FIXME: This is because of a Kernel BUG (in v4.20) which
+	 * sometimes complains in _setup_bdi() on a recycle_mount that sysfs
+	 * bdi already exists. Cleaning here solves it.
+	 * Calling synchronize_rcu in zuf_kill_sb() after the call to
+	 * kill_block_super() does NOT solve it.
+	 */
+	_clean_bdi(sb);
+
+	if (sbi->zus_sbi) {
+		struct zufs_ioc_mount zim = {
+			.zmi.zus_sbi = sbi->zus_sbi,
+		};
+
+		zufc_dispatch_mount(ZUF_ROOT(sbi), NULL, ZUFS_M_UMOUNT, &zim);
+		sbi->zus_sbi = NULL;
+	}
+
+	/* NOTE!!! this is a HACK! we should not touch the s_umount
+	 * lock but to make lockdep happy we do that since our devices
+	 * are held exclusivly. Need to revisit every kernel version
+	 * change.
+	 */
+	if (sbi->md) {
+		up_write(&sb->s_umount);
+		md_fini(sbi->md, false);
+		down_write(&sb->s_umount);
+	}
+
+	_sb_remove(ZUF_ROOT(sbi), sb);
+	sb->s_fs_info = NULL;
+	if (!test_opt(sbi, FAILED))
+		zuf_info("unmounted /dev/%s\n", _bdev_name(sb->s_bdev));
+	kfree(sbi);
+}
+
+struct __fill_super_params {
+	struct multi_devices *md;
+	char *mount_options;
+};
+
+int zuf_private_mount(struct zuf_root_info *zri, struct register_fs_info *rfi,
+		      struct zufs_mount_info *zmi, struct super_block **sb_out)
+{
+	bool silent = zmi->po.mount_flags & ZUFS_M_SILENT;
+	char path[PATH_UUID];
+	const char *dev_path = NULL;
+	struct zuf_sb_info *sbi;
+	struct super_block *sb;
+	char *mount_options;
+	struct mdt_check mc = {
+		.alloc_mask	= ZUFS_ALLOC_MASK,
+		.major_ver	= rfi->FS_ver_major,
+		.minor_ver	= rfi->FS_ver_minor,
+		.magic		= rfi->FS_magic,
+
+		.silent = silent,
+		.private_mnt = true,
+	};
+	int err;
+
+	sb = kzalloc(sizeof(struct super_block), GFP_KERNEL);
+	if (unlikely(!sb)) {
+		zuf_err_cnd(silent, "Not enough memory to allocate sb\n");
+		return -ENOMEM;
+	}
+
+	sbi = kzalloc(sizeof(struct zuf_sb_info), GFP_KERNEL);
+	if (unlikely(!sbi)) {
+		zuf_err_cnd(silent, "Not enough memory to allocate sbi\n");
+		kfree(sb);
+		return -ENOMEM;
+	}
+
+	sb->s_fs_info = sbi;
+	sbi->sb = sb;
+
+	zmi->po.mount_flags |= ZUFS_M_PRIVATE;
+	set_opt(sbi, PRIVATE);
+
+	mount_options = kstrndup(zmi->po.mount_options,
+				 zmi->po.mount_options_len, GFP_KERNEL);
+	if (unlikely(!mount_options)) {
+		zuf_err_cnd(silent, "Not enough memory\n");
+		err = -ENOMEM;
+		goto fail;
+	}
+
+	memset(zmi->po.mount_options, 0, zmi->po.mount_options_len);
+
+	err = _parse_options(sbi, mount_options, 0, &zmi->po);
+	if (unlikely(err)) {
+		zuf_err_cnd(silent, "option parsing failed => %d\n", err);
+		goto fail;
+	}
+
+	if (unlikely(!sbi->pmount_dev)) {
+		zuf_err_cnd(silent, "private mount missing mountdev option\n");
+		err = -EINVAL;
+		goto fail;
+	}
+
+	zmi->po.mount_options_len = strlen(zmi->po.mount_options);
+
+	mc.holder = sbi;
+	err = md_init(&sbi->md, sbi->pmount_dev, &mc, path, &dev_path);
+	if (unlikely(err)) {
+		zuf_err_cnd(silent, "md_init failed! => %d\n", err);
+		goto fail;
+	}
+
+	zuf_dbg_verbose("private mount of %s\n", dev_path);
+
+	err = _sb_add(zri, sb, &zmi->sb_id);
+	if (unlikely(err)) {
+		zuf_err_cnd(silent, "_sb_add failed => %d\n", err);
+		goto fail;
+	}
+
+	*sb_out = sb;
+	return 0;
+
+fail:
+	if (sbi->md)
+		md_fini(sbi->md, true);
+	kfree(mount_options);
+	kfree(sbi->pmount_dev);
+	kfree(sbi);
+	kfree(sb);
+
+	return err;
+}
+
+int zuf_private_umount(struct zuf_root_info *zri, struct super_block *sb)
+{
+	struct zuf_sb_info *sbi = SBI(sb);
+
+	_sb_remove(zri, sb);
+	md_fini(sbi->md, true);
+	kfree(sbi->pmount_dev);
+	kfree(sbi);
+	kfree(sb);
+
+	return 0;
+}
+
+static int zuf_fill_super(struct super_block *sb, void *data, int silent)
+{
+	struct zuf_sb_info *sbi = NULL;
+	struct __fill_super_params *fsp = data;
+	struct zufs_ioc_mount zim = {};
+	struct zufs_ioc_mount *ioc_mount;
+	enum big_alloc_type bat;
+	struct register_fs_info *rfi;
+	struct inode *root_i;
+	size_t zim_size, mount_options_len;
+	bool exist;
+	int err;
+
+	BUILD_BUG_ON(sizeof(struct md_dev_table) > MDT_SIZE);
+	BUILD_BUG_ON(sizeof(struct zus_inode) != ZUFS_INODE_SIZE);
+
+	mount_options_len = (fsp->mount_options ?
+					strlen(fsp->mount_options) : 0) + 1;
+	zim_size = sizeof(zim) + mount_options_len;
+	ioc_mount = big_alloc(zim_size, sizeof(zim), &zim,
+			      GFP_KERNEL | __GFP_ZERO, &bat);
+	if (unlikely(!ioc_mount)) {
+		zuf_err_cnd(silent, "big_alloc(%ld) => -ENOMEM\n", zim_size);
+		return -ENOMEM;
+	}
+
+	ioc_mount->zmi.po.mount_options_len = mount_options_len;
+
+	err = _sb_add(zuf_fst(sb)->zri, sb, &ioc_mount->zmi.sb_id);
+	if (unlikely(err)) {
+		zuf_err_cnd(silent, "_sb_add failed => %d\n", err);
+		goto error;
+	}
+
+	sbi = kzalloc(sizeof(struct zuf_sb_info), GFP_KERNEL);
+	if (!sbi) {
+		zuf_err_cnd(silent, "Not enough memory to allocate sbi\n");
+		err = -ENOMEM;
+		goto error;
+	}
+	sb->s_fs_info = sbi;
+	sbi->sb = sb;
+
+	/* Initialize embedded objects */
+	spin_lock_init(&sbi->s_mmap_dirty_lock);
+	INIT_LIST_HEAD(&sbi->s_mmap_dirty);
+	if (silent) {
+		ioc_mount->zmi.po.mount_flags |= ZUFS_M_SILENT;
+		set_opt(sbi, SILENT);
+	}
+
+	sbi->md = fsp->md;
+	err = md_set_sb(sbi->md, sb->s_bdev, sb, silent);
+	if (unlikely(err))
+		goto error;
+
+	err = _parse_options(sbi, fsp->mount_options, 0, &ioc_mount->zmi.po);
+	if (err)
+		goto error;
+
+	err = _setup_bdi(sb, _bdev_name(sb->s_bdev));
+	if (err) {
+		zuf_err_cnd(silent, "Failed to setup bdi => %d\n", err);
+		goto error;
+	}
+
+	/* Tell ZUS to mount an FS for us */
+	err = zufc_dispatch_mount(ZUF_ROOT(sbi), zuf_fst(sb)->zus_zfi,
+				  ZUFS_M_MOUNT, ioc_mount);
+	if (unlikely(err)) {
+		zuf_err_cnd(silent, "zufc_dispatch_mount failed => %d\n", err);
+		goto error;
+	}
+	sbi->zus_sbi = ioc_mount->zmi.zus_sbi;
+
+	/* Init with default values */
+	sb->s_blocksize_bits = ioc_mount->zmi.s_blocksize_bits;
+	sb->s_blocksize = 1 << ioc_mount->zmi.s_blocksize_bits;
+
+	rfi = &zuf_fst(sb)->rfi;
+
+	sb->s_magic = rfi->FS_magic;
+	sb->s_time_gran = rfi->s_time_gran;
+	sb->s_maxbytes = rfi->s_maxbytes;
+	sb->s_flags |= SB_NOSEC;
+
+	sbi->fs_caps = ioc_mount->zmi.fs_caps;
+	if (sbi->fs_caps & ZUFS_FSC_ACL_ON)
+		sb->s_flags |= SB_POSIXACL;
+
+	sb->s_op = &zuf_sops;
+
+	root_i = zuf_iget(sb, ioc_mount->zmi.zus_ii, ioc_mount->zmi._zi,
+			  &exist);
+	if (IS_ERR(root_i)) {
+		err = PTR_ERR(root_i);
+		zuf_err_cnd(silent, "zuf_iget failed => %d\n", err);
+		goto error;
+	}
+	WARN_ON(exist);
+
+	sb->s_root = d_make_root(root_i);
+	if (!sb->s_root) {
+		zuf_err_cnd(silent, "d_make_root root inode failed\n");
+		iput(root_i); /* undo zuf_iget */
+		err = -ENOMEM;
+		goto error;
+	}
+
+	if (!zuf_rdonly(sb))
+		_sb_mwtime_now(sb, md_zdt(sbi->md));
+
+	mt_to_timespec(&root_i->i_ctime, &zus_zi(root_i)->i_ctime);
+	mt_to_timespec(&root_i->i_mtime, &zus_zi(root_i)->i_mtime);
+
+	_print_mount_info(sbi, fsp->mount_options);
+	clear_opt(sbi, SILENT);
+	big_free(ioc_mount, bat);
+	return 0;
+
+error:
+	zuf_warn("NOT mounting => %d\n", err);
+	if (sbi) {
+		set_opt(sbi, FAILED);
+		zuf_put_super(sb);
+	}
+	big_free(ioc_mount, bat);
+	return err;
+}
+
+static void _zst_to_kst(const struct statfs64 *zst, struct kstatfs *kst)
+{
+	kst->f_type	= zst->f_type;
+	kst->f_bsize	= zst->f_bsize;
+	kst->f_blocks	= zst->f_blocks;
+	kst->f_bfree	= zst->f_bfree;
+	kst->f_bavail	= zst->f_bavail;
+	kst->f_files	= zst->f_files;
+	kst->f_ffree	= zst->f_ffree;
+	kst->f_fsid	= zst->f_fsid;
+	kst->f_namelen	= zst->f_namelen;
+	kst->f_frsize	= zst->f_frsize;
+	kst->f_flags	= zst->f_flags;
+}
+
+static int zuf_statfs(struct dentry *d, struct kstatfs *buf)
+{
+	struct zuf_sb_info *sbi = SBI(d->d_sb);
+	struct zufs_ioc_statfs ioc_statfs = {
+		.hdr.in_len = offsetof(struct zufs_ioc_statfs, statfs_out),
+		.hdr.out_len = sizeof(ioc_statfs),
+		.hdr.operation = ZUFS_OP_STATFS,
+		.zus_sbi = sbi->zus_sbi,
+	};
+	int err;
+
+	err = zufc_dispatch(ZUF_ROOT(sbi), &ioc_statfs.hdr, NULL, 0);
+	if (unlikely(err && err != -EINTR)) {
+		zuf_err_dispatch(d->d_sb,
+			"zufc_dispatch failed op=ZUFS_OP_STATFS => %d\n",
+			err);
+		return err;
+	}
+
+	_zst_to_kst(&ioc_statfs.statfs_out, buf);
+	return 0;
+}
+
+struct __mount_options {
+	struct zufs_ioc_mount_options imo;
+	char buf[ZUFS_MO_MAX];
+};
+
+static int zuf_show_options(struct seq_file *seq, struct dentry *root)
+{
+	struct zuf_sb_info *sbi = SBI(root->d_sb);
+	struct __mount_options mo = {
+		.imo.hdr.in_len = sizeof(mo.imo),
+		.imo.hdr.out_start = offsetof(typeof(mo.imo), buf),
+		.imo.hdr.out_len = 0,
+		.imo.hdr.out_max = sizeof(mo.buf),
+		.imo.hdr.operation = ZUFS_OP_SHOW_OPTIONS,
+		.imo.zus_sbi = sbi->zus_sbi,
+	};
+	int err;
+
+	if (test_opt(sbi, EPHEMERAL))
+		seq_puts(seq, ",ephemeral");
+	if (test_opt(sbi, DAX))
+		seq_puts(seq, ",dax");
+
+	err = zufc_dispatch(ZUF_ROOT(sbi), &mo.imo.hdr, NULL, 0);
+	if (unlikely(err)) {
+		zuf_err_dispatch(root->d_sb,
+			"zufs_dispatch failed op=ZUS_OP_SHOW_OPTIONS => %d\n",
+			err);
+		/* NOTE: if zusd crashed and we try to run 'umount', it will
+		 * SEGFAULT because zufc_dispatch will return -EFAULT.
+		 * Just return 0 as if the FS has no specific mount options.
+		 */
+		return 0;
+	}
+	seq_puts(seq, mo.buf);
+
+	return 0;
+}
+
+static int zuf_show_devname(struct seq_file *seq, struct dentry *root)
+{
+	seq_printf(seq, "/dev/%s", _bdev_name(root->d_sb->s_bdev));
+
+	return 0;
+}
+
+static int zuf_remount(struct super_block *sb, int *mntflags, char *data)
+{
+	struct zuf_sb_info *sbi = SBI(sb);
+	struct zufs_ioc_mount zim = {};
+	struct zufs_ioc_mount *ioc_mount;
+	size_t remount_options_len, zim_size;
+	enum big_alloc_type bat;
+	ulong old_mount_opt = sbi->s_mount_opt;
+	int err;
+
+	zuf_info("remount... -o %s\n", data);
+
+	remount_options_len = data ? (strlen(data) + 1) : 0;
+	zim_size = sizeof(zim) + remount_options_len;
+	ioc_mount = big_alloc(zim_size, sizeof(zim), &zim,
+			      GFP_KERNEL | __GFP_ZERO, &bat);
+	if (unlikely(!ioc_mount))
+		return -ENOMEM;
+
+	ioc_mount->zmi.zus_sbi = sbi->zus_sbi,
+	ioc_mount->zmi.remount_flags = zuf_rdonly(sb) ? ZUFS_REM_WAS_RO : 0;
+	ioc_mount->zmi.po.mount_options_len = remount_options_len;
+
+	err = _parse_options(sbi, data, 1, &ioc_mount->zmi.po);
+	if (unlikely(err))
+		goto fail;
+
+	if (*mntflags & SB_RDONLY) {
+		ioc_mount->zmi.remount_flags |= ZUFS_REM_WILL_RO;
+
+		if (!zuf_rdonly(sb))
+			_sb_mwtime_now(sb, md_zdt(sbi->md));
+	} else if (zuf_rdonly(sb)) {
+		_sb_mwtime_now(sb, md_zdt(sbi->md));
+	}
+
+	err = zufc_dispatch_mount(ZUF_ROOT(sbi), zuf_fst(sb)->zus_zfi,
+				  ZUFS_M_REMOUNT, ioc_mount);
+	if (unlikely(err))
+		goto fail;
+
+	big_free(ioc_mount, bat);
+	return 0;
+
+fail:
+	sbi->s_mount_opt = old_mount_opt;
+	big_free(ioc_mount, bat);
+	zuf_dbg_err("remount failed restore option\n");
+	return err;
+}
+
+static int zuf_update_s_wtime(struct super_block *sb)
+{
+	if (!(zuf_rdonly(sb))) {
+		struct timespec64 now = current_time(sb->s_root->d_inode);
+
+		timespec_to_mt(&md_zdt(SBI(sb)->md)->s_wtime, &now);
+	}
+	return 0;
+}
+
+static struct inode *zuf_alloc_inode(struct super_block *sb)
+{
+	struct zuf_inode_info *zii;
+
+	zii = kmem_cache_alloc(zuf_inode_cachep, GFP_NOFS);
+	if (!zii)
+		return NULL;
+
+	zii->vfs_inode.i_version.counter = 1;
+	return &zii->vfs_inode;
+}
+
+static void zuf_destroy_inode(struct inode *inode)
+{
+	kmem_cache_free(zuf_inode_cachep, ZUII(inode));
 }
 
 static void _init_once(void *foo)
@@ -31,6 +759,7 @@ static void _init_once(void *foo)
 	struct zuf_inode_info *zii = foo;
 
 	inode_init_once(&zii->vfs_inode);
+	zii->zi = NULL;
 }
 
 int __init zuf_init_inodecache(void)
@@ -52,8 +781,81 @@ void zuf_destroy_inodecache(void)
 	kmem_cache_destroy(zuf_inode_cachep);
 }
 
+static struct super_operations zuf_sops = {
+	.alloc_inode	= zuf_alloc_inode,
+	.destroy_inode	= zuf_destroy_inode,
+	.put_super	= zuf_put_super,
+	.freeze_fs	= zuf_update_s_wtime,
+	.unfreeze_fs	= zuf_update_s_wtime,
+	.statfs		= zuf_statfs,
+	.remount_fs	= zuf_remount,
+	.show_options	= zuf_show_options,
+	.show_devname	= zuf_show_devname,
+};
+
 struct dentry *zuf_mount(struct file_system_type *fs_type, int flags,
 			 const char *dev_name, void *data)
 {
-	return ERR_PTR(-ENOTSUPP);
+	int silent = flags & SB_SILENT ? 1 : 0;
+	struct __fill_super_params fsp = {
+		.mount_options = data,
+	};
+	struct zuf_fs_type *fst = ZUF_FST(fs_type);
+	struct register_fs_info *rfi = &fst->rfi;
+	struct mdt_check mc = {
+		.alloc_mask	= ZUFS_ALLOC_MASK,
+		.major_ver	= rfi->FS_ver_major,
+		.minor_ver	= rfi->FS_ver_minor,
+		.magic		= rfi->FS_magic,
+
+		.holder = fs_type,
+		.silent = silent,
+	};
+	struct dentry *ret = NULL;
+	char path[PATH_UUID];
+	const char *dev_path = NULL;
+	int err;
+
+	zuf_dbg_vfs("dev_name=%s, data=%s\n", dev_name, (const char *)data);
+
+	err = md_init(&fsp.md, dev_name, &mc, path, &dev_path);
+	if (unlikely(err)) {
+		zuf_err_cnd(silent, "md_init failed! => %d\n", err);
+		goto out;
+	}
+
+	zuf_dbg_vfs("mounting with dev_path=%s\n", dev_path);
+	ret = mount_bdev(fs_type, flags, dev_path, &fsp, zuf_fill_super);
+
+out:
+	if (unlikely(err) && fsp.md)
+		md_fini(fsp.md, true);
+
+	return err ? ERR_PTR(err) : ret;
+}
+
+// ==== 8k fast_alloc ====
+static struct kmem_cache *zuf_8k_cachep;
+
+void *zuf_8k_alloc(gfp_t gfp)
+{
+	return kmem_cache_alloc(zuf_8k_cachep, gfp);
+}
+
+void zuf_8k_free(void *ptr)
+{
+	kmem_cache_free(zuf_8k_cachep, ptr);
+}
+
+int __init zuf_8k_cache_init(void)
+{
+	zuf_8k_cachep = kmem_cache_create("zuf_8k_cache", S_8K, 0, 0, NULL);
+	if (unlikely(!zuf_8k_cachep))
+		return -ENOMEM;
+	return 0;
+}
+
+void zuf_8k_cache_fini(void)
+{
+	kmem_cache_destroy(zuf_8k_cachep);
 }
diff --git a/fs/zuf/zuf-core.c b/fs/zuf/zuf-core.c
index cc49cfa95244..a417f9463682 100644
--- a/fs/zuf/zuf-core.c
+++ b/fs/zuf/zuf-core.c
@@ -63,6 +63,8 @@ const char *zuf_op_name(enum e_zufs_operation op)
 	switch  (op) {
 		CASE_ENUM_NAME(ZUFS_OP_NULL);
 		CASE_ENUM_NAME(ZUFS_OP_BREAK);
+		CASE_ENUM_NAME(ZUFS_OP_STATFS);
+		CASE_ENUM_NAME(ZUFS_OP_SHOW_OPTIONS);
 	case ZUFS_OP_MAX_OPT:
 	default:
 		return "UNKNOWN";
@@ -290,6 +292,95 @@ static void zufc_mounter_release(struct file *file)
 	}
 }
 
+static int _zu_private_mounter_release(struct file *file)
+{
+	struct zuf_root_info *zri = ZRI(file->f_inode->i_sb);
+	struct zuf_special_file *zsf = file->private_data;
+	struct zuf_private_mount_info *zpmi;
+	int err;
+
+	zpmi = container_of(zsf, struct zuf_private_mount_info, zsf);
+
+	err = zuf_private_umount(zri, zpmi->sb);
+
+	kfree(zpmi);
+
+	return err;
+}
+
+static int _zu_private_mounter(struct file *file, void *parg)
+{
+	struct super_block *sb = file->f_inode->i_sb;
+	struct zufs_ioc_mount_private *zip = NULL;
+	struct zuf_private_mount_info *zpmi;
+	struct zuf_root_info *zri = ZRI(sb);
+	struct zufs_ioc_hdr hdr;
+	__u32 is_umount;
+	ulong cp_ret;
+	int err = 0;
+
+	get_user(is_umount,
+		 &((struct zufs_ioc_mount_private *)parg)->is_umount);
+	if (is_umount)
+		return _zu_private_mounter_release(file);
+
+	if (unlikely(file->private_data)) {
+		zuf_err("One mount per runner please..\n");
+		return -EINVAL;
+	}
+
+	zpmi = kzalloc(sizeof(*zpmi), GFP_KERNEL);
+	if (unlikely(!zpmi)) {
+		zuf_err("alloc failed\n");
+		return -ENOMEM;
+	}
+
+	zpmi->zsf.type = zlfs_e_private_mount;
+	zpmi->zsf.file = file;
+
+	cp_ret = copy_from_user(&hdr, parg, sizeof(hdr));
+	if (unlikely(cp_ret)) {
+		zuf_err("copy_from_user(hdr) => %ld\n", cp_ret);
+		err = -EFAULT;
+		goto fail;
+	}
+
+	zip = kmalloc(hdr.in_len, GFP_KERNEL);
+	if (unlikely(!zip)) {
+		zuf_err("alloc failed\n");
+		err = -ENOMEM;
+		goto fail;
+	}
+
+	cp_ret = copy_from_user(zip, parg, hdr.in_len);
+	if (unlikely(cp_ret)) {
+		zuf_err("copy_from_user => %ld\n", cp_ret);
+		err = -EFAULT;
+		goto fail;
+	}
+
+	err = zuf_private_mount(zri, &zip->rfi, &zip->zmi, &zpmi->sb);
+	if (unlikely(err))
+		goto fail;
+
+	cp_ret = copy_to_user(parg, zip, hdr.in_len);
+	if (unlikely(cp_ret)) {
+		zuf_err("copy_to_user =>%ld\n", cp_ret);
+		err = -EFAULT;
+		goto fail;
+	}
+
+	file->private_data = &zpmi->zsf;
+
+out:
+	kfree(zip);
+	return err;
+
+fail:
+	kfree(zpmi);
+	goto out;
+}
+
 /* ~~~~ ZU_IOC_NUMA_MAP ~~~~ */
 static int _zu_numa_map(struct file *file, void *parg)
 {
@@ -966,6 +1057,8 @@ long zufc_ioctl(struct file *file, unsigned int cmd, ulong arg)
 		return _zu_wait(file, parg);
 	case ZU_IOC_ALLOC_BUFFER:
 		return _zu_ebuff_alloc(file, parg);
+	case ZU_IOC_PRIVATE_MOUNT:
+		return _zu_private_mounter(file, parg);
 	case ZU_IOC_BREAK_ALL:
 		return _zu_break(file, parg);
 	default:
@@ -988,6 +1081,9 @@ int zufc_release(struct inode *inode, struct file *file)
 	case zlfs_e_mout_thread:
 		zufc_mounter_release(file);
 		return 0;
+	case zlfs_e_private_mount:
+		_zu_private_mounter_release(file);
+		return 0;
 	case zlfs_e_pmem:
 		/* NOTHING to clean for pmem file yet */
 		/* zuf_pmem_release(file);*/
diff --git a/fs/zuf/zuf-root.c b/fs/zuf/zuf-root.c
index ea7eb810ea9d..ecf240bd3e3f 100644
--- a/fs/zuf/zuf-root.c
+++ b/fs/zuf/zuf-root.c
@@ -405,6 +405,10 @@ int __init zuf_root_init(void)
 {
 	int err = zuf_init_inodecache();
 
+	if (unlikely(err))
+		return err;
+
+	err = zuf_8k_cache_init();
 	if (unlikely(err))
 		return err;
 
@@ -431,6 +435,7 @@ static void __exit zuf_root_exit(void)
 {
 	unregister_filesystem(&zufr_type);
 	kset_unregister(zufr_kset);
+	zuf_8k_cache_fini();
 	zuf_destroy_inodecache();
 }
 
diff --git a/fs/zuf/zuf.h b/fs/zuf/zuf.h
index d0cb762f50ec..18cbc376cfa6 100644
--- a/fs/zuf/zuf.h
+++ b/fs/zuf/zuf.h
@@ -106,12 +106,33 @@ struct zuf_pmem_file {
 	struct multi_devices *md;
 };
 
+/*
+ * Private Super-block flags
+ */
+enum {
+	ZUF_MOUNT_PEDANTIC	= 0x000001,	/* Check for memory leaks */
+	ZUF_MOUNT_PEDANTIC_SHADOW = 0x00002,	/* */
+	ZUF_MOUNT_SILENT	= 0x000004,	/* verbosity is silent */
+	ZUF_MOUNT_EPHEMERAL	= 0x000008,	/* Don't persist the data */
+	ZUF_MOUNT_FAILED	= 0x000010,	/* mark a failed-mount */
+	ZUF_MOUNT_DAX		= 0x000020,	/* mounted with dax option */
+	ZUF_MOUNT_POSIXACL	= 0x000040,	/* mounted with posix acls */
+	ZUF_MOUNT_PRIVATE	= 0x000080,	/* private mount from runner */
+};
+
+#define clear_opt(sbi, opt)       (sbi->s_mount_opt &= ~ZUF_MOUNT_ ## opt)
+#define set_opt(sbi, opt)         (sbi->s_mount_opt |= ZUF_MOUNT_ ## opt)
+#define test_opt(sbi, opt)      (sbi->s_mount_opt & ZUF_MOUNT_ ## opt)
 
 /*
  * ZUF per-inode data in memory
  */
 struct zuf_inode_info {
 	struct inode		vfs_inode;
+
+	/* cookies from Server */
+	struct zus_inode	*zi;
+	struct zus_inode_info	*zus_ii;
 };
 
 static inline struct zuf_inode_info *ZUII(struct inode *inode)
@@ -164,6 +185,119 @@ static inline bool zuf_rdonly(struct super_block *sb)
 	return sb_rdonly(sb);
 }
 
+static inline bool zuf_is_nio_reads(struct inode *inode)
+{
+	return SBI(inode->i_sb)->fs_caps & ZUFS_FSC_NIO_READS;
+}
+
+static inline bool zuf_is_nio_writes(struct inode *inode)
+{
+	return SBI(inode->i_sb)->fs_caps & ZUFS_FSC_NIO_WRITES;
+}
+
+static inline struct zus_inode *zus_zi(struct inode *inode)
+{
+	return ZUII(inode)->zi;
+}
+
+/* An accessor because of the frequent use in prints */
+static inline ulong _zi_ino(struct zus_inode *zi)
+{
+	return le64_to_cpu(zi->i_ino);
+}
+
+static inline bool _zi_active(struct zus_inode *zi)
+{
+	return (zi->i_nlink || zi->i_mode);
+}
+
+static inline void mt_to_timespec(struct timespec64 *t, __le64 *mt)
+{
+	u32 nsec;
+
+	t->tv_sec = div_s64_rem(le64_to_cpu(*mt), NSEC_PER_SEC, &nsec);
+	t->tv_nsec = nsec;
+}
+
+static inline void timespec_to_mt(__le64 *mt, struct timespec64 *t)
+{
+	*mt = cpu_to_le64(t->tv_sec * NSEC_PER_SEC + t->tv_nsec);
+}
+
+static inline
+void zus_inode_cmtime_now(struct inode *inode, struct zus_inode *zi)
+{
+	inode->i_mtime = inode->i_ctime = current_time(inode);
+	timespec_to_mt(&zi->i_ctime, &inode->i_ctime);
+	zi->i_mtime = zi->i_ctime;
+}
+
+static inline
+void zus_inode_ctime_now(struct inode *inode, struct zus_inode *zi)
+{
+	inode->i_ctime = current_time(inode);
+	timespec_to_mt(&zi->i_ctime, &inode->i_ctime);
+}
+
+static inline void *zuf_dpp_t_addr(struct super_block *sb, zu_dpp_t v)
+{
+	/* TODO: Implement zufs_ioc_create_mempool already */
+	if (WARN_ON(zu_dpp_t_pool(v)))
+		return NULL;
+
+	return md_addr_verify(SBI(sb)->md, zu_dpp_t_val(v));
+}
+
+enum big_alloc_type { ba_stack, ba_8k, ba_vmalloc };
+#define S_8K (1024UL * 8)
+
+void *zuf_8k_alloc(gfp_t gfp);
+void  zuf_8k_free(void *ptr);
+
+static inline
+void *big_alloc(uint bytes, uint local_size, void *local, gfp_t gfp,
+		enum big_alloc_type *bat)
+{
+	void *ptr;
+
+	if (bytes <= local_size) {
+		*bat = ba_stack;
+		ptr = local;
+	} else if (bytes <= S_8K) {
+		*bat = ba_8k;
+		ptr = zuf_8k_alloc(gfp);
+	} else {
+		*bat = ba_vmalloc;
+		ptr = vmalloc(bytes);
+	}
+
+	return ptr;
+}
+
+static inline void big_free(void *ptr, enum big_alloc_type bat)
+{
+	if (unlikely(!ptr))
+		return;
+
+	switch (bat) {
+	case ba_stack:
+		break;
+	case ba_8k:
+		zuf_8k_free(ptr);
+		break;
+	case ba_vmalloc:
+		vfree(ptr);
+	}
+}
+
+#if (CONFIG_FRAME_WARN == 0)
+#	define ZUF_MAX_STACK(minus) (THREAD_SIZE / 2 - minus)
+#elif (CONFIG_FRAME_WARN < (S_8K + 8))
+#	define ZUF_MAX_STACK(minus) (CONFIG_FRAME_WARN - minus)
+#else
+#	define ZUF_MAX_STACK(minus) ((S_8K + 8) - minus)
+#endif
+
 struct zuf_dispatch_op;
 typedef int (*overflow_handler)(struct zuf_dispatch_op *zdo, void *parg,
 				ulong zt_max_bytes);
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index 4292a4fa5f1a..1af3bd016453 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -330,6 +330,17 @@ struct  zufs_ioc_mount {
 };
 #define ZU_IOC_MOUNT		_IOWR('Z', 11, struct zufs_ioc_mount)
 
+/* Mount locally with a zus-runner process */
+#define ZUFS_PMDEV_OPT "zpmdev"
+struct zufs_ioc_mount_private {
+	struct zufs_ioc_hdr	hdr;
+	__u32			mount_fd; /* kernel cookie */
+	__u32			is_umount; /* true or false */
+	struct register_fs_info	rfi;
+	struct zufs_mount_info	zmi; /* must be last */
+};
+#define ZU_IOC_PRIVATE_MOUNT	_IOWR('Z', 12, struct zufs_ioc_mount_private)
+
 /* pmem  */
 struct zufs_cpu_set {
 	ulong bits[16];
@@ -432,7 +443,31 @@ enum e_zufs_operation {
 	ZUFS_OP_NULL		= 0,
 	ZUFS_OP_BREAK		= 1,	/* Kernel telling Server to exit */
 
+	ZUFS_OP_STATFS		= 2,
+	ZUFS_OP_SHOW_OPTIONS	= 3,
+
 	ZUFS_OP_MAX_OPT,
 };
 
+#define ZUFS_MO_MAX	512
+
+struct zufs_ioc_mount_options {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_sb_info *zus_sbi;
+
+	/* OUT */
+	char	buf[0];
+};
+
+/* ZUFS_OP_STATFS */
+struct zufs_ioc_statfs {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_sb_info *zus_sbi;
+
+	/* OUT */
+	struct statfs64 statfs_out;
+};
+
 #endif /* _LINUX_ZUFS_API_H */
-- 
2.21.0
^ permalink raw reply related	[flat|nested] 32+ messages in thread
* [PATCH 08/16] zuf: Namei and directory operations
  2019-09-26  2:07 [PATCHSET v02 00/16] zuf: ZUFS Zero-copy User-mode FileSystem Boaz Harrosh
                   ` (6 preceding siblings ...)
  2019-09-26  2:07 ` [PATCH 07/16] zuf: mounting Boaz Harrosh
@ 2019-09-26  2:07 ` Boaz Harrosh
  2019-09-26  2:07 ` [PATCH 09/16] zuf: readdir operation Boaz Harrosh
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 32+ messages in thread
From: Boaz Harrosh @ 2019-09-26  2:07 UTC (permalink / raw)
  To: linux-fsdevel, Anna Schumaker, Al Viro, Matt Benjamin
  Cc: Miklos Szeredi, Amir Goldstein, Sagi Manole, Matthew Wilcox,
	Dan Williams
Introducing Creation/deletion of files
Directory add/remove
Other namei operations
This is all a very STD Kernel way of doing things.
Each VFS operation is packed and dispatched to Server.
After dispatch return, pushing results into Kernel
structures
NOTE: The use of a zufs_inode communication structure
that is returned as a zufs_dpp_t (Dual port pointer)
Both Kernel and Server can read/write to this object.
If Kernel modifies this object it is always before
the dispatch so server can persist the changes.
It is also used by Server to return new info to be updated
into the vfs_inode.
In a pmem system this object can be directly pointing
to storage.
Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/Makefile    |   2 +-
 fs/zuf/_extern.h   |  41 ++++
 fs/zuf/directory.c | 100 ++++++++
 fs/zuf/file.c      |  31 +++
 fs/zuf/inode.c     | 561 ++++++++++++++++++++++++++++++++++++++++++++-
 fs/zuf/namei.c     | 402 ++++++++++++++++++++++++++++++++
 fs/zuf/super.c     |   2 +
 fs/zuf/zuf-core.c  |  10 +
 fs/zuf/zuf.h       |  63 +++++
 fs/zuf/zus_api.h   |  94 ++++++++
 10 files changed, 1304 insertions(+), 2 deletions(-)
 create mode 100644 fs/zuf/directory.c
 create mode 100644 fs/zuf/file.c
 create mode 100644 fs/zuf/namei.c
diff --git a/fs/zuf/Makefile b/fs/zuf/Makefile
index a5800cad73fd..2bfed45723e3 100644
--- a/fs/zuf/Makefile
+++ b/fs/zuf/Makefile
@@ -17,5 +17,5 @@ zuf-y += md.o t1.o t2.o
 zuf-y += zuf-core.o zuf-root.o
 
 # Main FS
-zuf-y += super.o inode.o
+zuf-y += super.o inode.o directory.o namei.o file.o
 zuf-y += module.o
diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
index b1514e5821a2..50887792bf42 100644
--- a/fs/zuf/_extern.h
+++ b/fs/zuf/_extern.h
@@ -62,10 +62,51 @@ int zuf_private_umount(struct zuf_root_info *zri, struct super_block *sb);
 struct super_block *zuf_sb_from_id(struct zuf_root_info *zri, __u64 sb_id,
 				   struct zus_sb_info *zus_sbi);
 
+/* file.c */
+long __zuf_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len);
+
+/* namei.c */
+void zuf_zii_sync(struct inode *inode, bool sync_nlink);
+
 /* inode.c */
+int zuf_evict_dispatch(struct super_block *sb, struct zus_inode_info *zus_ii,
+		       int operation, uint flags);
 struct inode *zuf_iget(struct super_block *sb, struct zus_inode_info *zus_ii,
 		       zu_dpp_t _zi, bool *exist);
+void zuf_evict_inode(struct inode *inode);
+struct inode *zuf_new_inode(struct inode *dir, umode_t mode,
+			    const struct qstr *qstr, const char *symname,
+			    ulong rdev_or_isize, bool tmpfile);
+int zuf_write_inode(struct inode *inode, struct writeback_control *wbc);
+int zuf_update_time(struct inode *inode, struct timespec64 *time, int flags);
+int zuf_setattr(struct dentry *dentry, struct iattr *attr);
+int zuf_getattr(const struct path *path, struct kstat *stat,
+		 u32 request_mask, unsigned int flags);
+void zuf_set_inode_flags(struct inode *inode, struct zus_inode *zi);
+
+/* directory.c */
+int zuf_add_dentry(struct inode *dir, struct qstr *str, struct inode *inode);
+int zuf_remove_dentry(struct inode *dir, struct qstr *str, struct inode *inode);
+
 /* t1.c */
 int zuf_pmem_mmap(struct file *file, struct vm_area_struct *vma);
 
+/*
+ * Inode and files operations
+ */
+
+/* file.c */
+extern const struct inode_operations zuf_file_inode_operations;
+extern const struct file_operations zuf_file_operations;
+
+/* inode.c */
+extern const struct address_space_operations zuf_aops;
+
+/* namei.c */
+extern const struct inode_operations zuf_dir_inode_operations;
+extern const struct inode_operations zuf_special_inode_operations;
+
+/* dir.c */
+extern const struct file_operations zuf_dir_operations;
+
 #endif	/*ndef __ZUF_EXTERN_H__*/
diff --git a/fs/zuf/directory.c b/fs/zuf/directory.c
new file mode 100644
index 000000000000..5624e05f96e5
--- /dev/null
+++ b/fs/zuf/directory.c
@@ -0,0 +1,100 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BRIEF DESCRIPTION
+ *
+ * File operations for directories.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#include <linux/fs.h>
+#include <linux/vmalloc.h>
+#include "zuf.h"
+
+static int zuf_readdir(struct file *file, struct dir_context *ctx)
+{
+	return -ENOTSUPP;
+}
+
+/*
+ *FIXME comment to full git diff
+ */
+
+static int _dentry_dispatch(struct inode *dir, struct inode *inode,
+			    struct qstr *str, int operation)
+{
+	struct zufs_ioc_dentry ioc_dentry = {
+		.hdr.operation = operation,
+		.hdr.in_len = sizeof(ioc_dentry),
+		.hdr.out_len = sizeof(ioc_dentry),
+		.zus_ii = inode ? ZUII(inode)->zus_ii : NULL,
+		.zus_dir_ii = ZUII(dir)->zus_ii,
+		.str.len = str->len,
+	};
+	int err;
+
+	memcpy(&ioc_dentry.str.name, str->name, str->len);
+
+	err = zufc_dispatch(ZUF_ROOT(SBI(dir->i_sb)), &ioc_dentry.hdr, NULL, 0);
+	if (unlikely(err)) {
+		zuf_dbg_err("[%ld] op=%d zufc_dispatch failed => %d\n",
+			    dir->i_ino, operation, err);
+		return err;
+	}
+
+	return 0;
+}
+
+/* return pointer to added de on success, err-code on failure */
+int zuf_add_dentry(struct inode *dir, struct qstr *str, struct inode *inode)
+{
+	struct zuf_inode_info *zii = ZUII(dir);
+	int err;
+
+	if (!str->len || !zii->zi)
+		return -EINVAL;
+
+	zus_inode_cmtime_now(dir, zii->zi);
+	err = _dentry_dispatch(dir, inode, str, ZUFS_OP_ADD_DENTRY);
+	if (unlikely(err)) {
+		zuf_dbg_err("[%ld] _dentry_dispatch failed => %d\n",
+			    dir->i_ino, err);
+		return err;
+	}
+	zuf_zii_sync(dir, false);
+
+	return 0;
+}
+
+int zuf_remove_dentry(struct inode *dir, struct qstr *str, struct inode *inode)
+{
+	struct zuf_inode_info *zii = ZUII(dir);
+	int err;
+
+	if (!str->len)
+		return -EINVAL;
+
+	zus_inode_cmtime_now(dir, zii->zi);
+	err = _dentry_dispatch(dir, inode, str, ZUFS_OP_REMOVE_DENTRY);
+	if (unlikely(err)) {
+		zuf_dbg_err("[%ld] _dentry_dispatch failed => %d\n",
+			    dir->i_ino, err);
+		return err;
+	}
+	zuf_zii_sync(dir, false);
+
+	return 0;
+}
+
+const struct file_operations zuf_dir_operations = {
+	.llseek		= generic_file_llseek,
+	.read		= generic_read_dir,
+	.iterate_shared	= zuf_readdir,
+	.fsync		= noop_fsync,
+};
diff --git a/fs/zuf/file.c b/fs/zuf/file.c
new file mode 100644
index 000000000000..619dada43666
--- /dev/null
+++ b/fs/zuf/file.c
@@ -0,0 +1,31 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BRIEF DESCRIPTION
+ *
+ * File operations for files.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#include "zuf.h"
+
+long __zuf_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len)
+{
+	return -ENOTSUPP;
+}
+
+const struct file_operations zuf_file_operations = {
+	.open			= generic_file_open,
+};
+
+const struct inode_operations zuf_file_inode_operations = {
+	.setattr	= zuf_setattr,
+	.getattr	= zuf_getattr,
+	.update_time	= zuf_update_time,
+};
diff --git a/fs/zuf/inode.c b/fs/zuf/inode.c
index a6115289dcda..88cb1937c223 100644
--- a/fs/zuf/inode.c
+++ b/fs/zuf/inode.c
@@ -13,11 +13,570 @@
  *	Sagi Manole <sagim@netapp.com>"
  */
 
+#include <linux/fs.h>
+#include <linux/aio.h>
+#include <linux/highuid.h>
+#include <linux/module.h>
+#include <linux/mpage.h>
+#include <linux/backing-dev.h>
+#include <linux/types.h>
+#include <linux/ratelimit.h>
+#include <linux/posix_acl_xattr.h>
+#include <linux/security.h>
+#include <linux/delay.h>
+
 #include "zuf.h"
 
+/* Flags that should be inherited by new inodes from their parent. */
+#define ZUFS_FL_INHERITED (S_SYNC | S_NOATIME | S_DIRSYNC)
+
+/* Flags that are appropriate for regular files (all but dir-specific ones). */
+#define ZUFS_FL_REG_MASK (~S_DIRSYNC)
+
+/* Flags that are appropriate for non-dir/non-regular files. */
+#define ZUFS_FL_OTHER_MASK (S_NOATIME)
+
+static bool _zi_valid(struct zus_inode *zi)
+{
+	if (!_zi_active(zi))
+		return false;
+
+	switch (le16_to_cpu(zi->i_mode) & S_IFMT) {
+	case S_IFREG:
+	case S_IFDIR:
+	case S_IFLNK:
+	case S_IFBLK:
+	case S_IFCHR:
+	case S_IFIFO:
+	case S_IFSOCK:
+		return true;
+	default:
+		zuf_err("unknown file type ino=%lld mode=%d\n", zi->i_ino,
+			  zi->i_mode);
+		return false;
+	}
+}
+
+static void _set_inode_from_zi(struct inode *inode, struct zus_inode *zi)
+{
+	inode->i_mode = le16_to_cpu(zi->i_mode);
+	inode->i_uid = KUIDT_INIT(le32_to_cpu(zi->i_uid));
+	inode->i_gid = KGIDT_INIT(le32_to_cpu(zi->i_gid));
+	set_nlink(inode, le16_to_cpu(zi->i_nlink));
+	inode->i_size = le64_to_cpu(zi->i_size);
+	inode->i_blocks = le64_to_cpu(zi->i_blocks);
+	mt_to_timespec(&inode->i_atime, &zi->i_atime);
+	mt_to_timespec(&inode->i_ctime, &zi->i_ctime);
+	mt_to_timespec(&inode->i_mtime, &zi->i_mtime);
+	inode->i_generation = le64_to_cpu(zi->i_generation);
+	zuf_set_inode_flags(inode, zi);
+
+	inode->i_blocks = le64_to_cpu(zi->i_blocks);
+	inode->i_mapping->a_ops = &zuf_aops;
+
+	switch (inode->i_mode & S_IFMT) {
+	case S_IFREG:
+		inode->i_op = &zuf_file_inode_operations;
+		inode->i_fop = &zuf_file_operations;
+		break;
+	case S_IFDIR:
+		inode->i_op = &zuf_dir_inode_operations;
+		inode->i_fop = &zuf_dir_operations;
+		break;
+	case S_IFBLK:
+	case S_IFCHR:
+	case S_IFIFO:
+	case S_IFSOCK:
+		inode->i_size = 0;
+		inode->i_op = &zuf_special_inode_operations;
+		init_special_inode(inode, inode->i_mode,
+				   le32_to_cpu(zi->i_rdev));
+		break;
+	default:
+		zuf_err("unknown file type ino=%lld mode=%d\n", zi->i_ino,
+			  zi->i_mode);
+		break;
+	}
+
+	inode->i_ino = le64_to_cpu(zi->i_ino);
+}
+
+/* Mask out flags that are inappropriate for the given type of inode. */
+static uint _calc_flags(umode_t mode, uint dir_flags, uint flags)
+{
+	uint zufs_flags = dir_flags & ZUFS_FL_INHERITED;
+
+	if (S_ISREG(mode))
+		zufs_flags &= ZUFS_FL_REG_MASK;
+	else if (!S_ISDIR(mode))
+		zufs_flags &= ZUFS_FL_OTHER_MASK;
+
+	return zufs_flags;
+}
+
+static int _set_zi_from_inode(struct inode *dir, struct zus_inode *zi,
+			      struct inode *inode)
+{
+	struct zus_inode *zidir = zus_zi(dir);
+
+	if (unlikely(!zidir))
+		return -EACCES;
+
+	zi->i_mode = cpu_to_le16(inode->i_mode);
+	zi->i_uid = cpu_to_le32(__kuid_val(inode->i_uid));
+	zi->i_gid = cpu_to_le32(__kgid_val(inode->i_gid));
+	/* NOTE: zus is boss of i_nlink (but let it know what we think) */
+	zi->i_nlink = cpu_to_le16(inode->i_nlink);
+	zi->i_size = cpu_to_le64(inode->i_size);
+	zi->i_blocks = cpu_to_le64(inode->i_blocks);
+	timespec_to_mt(&zi->i_atime, &inode->i_atime);
+	timespec_to_mt(&zi->i_mtime, &inode->i_mtime);
+	timespec_to_mt(&zi->i_ctime, &inode->i_ctime);
+	zi->i_generation = cpu_to_le32(inode->i_generation);
+
+	if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode))
+		zi->i_rdev = cpu_to_le32(inode->i_rdev);
+
+	zi->i_flags = cpu_to_le16(_calc_flags(inode->i_mode,
+					      le16_to_cpu(zidir->i_flags),
+					      inode->i_flags));
+	return 0;
+}
+
+static bool _times_equal(struct timespec64 *t, __le64 *mt)
+{
+	__le64 time;
+
+	timespec_to_mt(&time, t);
+	return time == *mt;
+}
+
+/* This function checks if VFS's inode and zus_inode are in sync */
+static void _warn_inode_dirty(struct inode *inode, struct zus_inode *zi)
+{
+#define __MISMACH_INT(inode, X, Y)	\
+	if (X != Y)			\
+		zuf_warn("[%ld] " #X"=0x%lx " #Y"=0x%lx""\n",	\
+			  inode->i_ino, (ulong)(X), (ulong)(Y))
+#define __MISMACH_TIME(inode, X, Y)	\
+	if (!_times_equal(X, Y)) {	\
+		struct timespec64 t;	\
+		mt_to_timespec(&t, (Y));\
+		zuf_warn("[%ld] " #X"=%lld:%ld " #Y"=%lld:%ld""\n",	\
+			  inode->i_ino, (X)->tv_sec, (X)->tv_nsec,	\
+			  t.tv_sec, t.tv_nsec);		\
+	}
+
+	if (!_times_equal(&inode->i_ctime, &zi->i_ctime) ||
+	    !_times_equal(&inode->i_mtime, &zi->i_mtime) ||
+	    !_times_equal(&inode->i_atime, &zi->i_atime) ||
+	    inode->i_size != le64_to_cpu(zi->i_size) ||
+	    inode->i_mode != le16_to_cpu(zi->i_mode) ||
+	    __kuid_val(inode->i_uid) != le32_to_cpu(zi->i_uid) ||
+	    __kgid_val(inode->i_gid) != le32_to_cpu(zi->i_gid) ||
+	    inode->i_nlink != le16_to_cpu(zi->i_nlink) ||
+	    inode->i_ino != _zi_ino(zi) ||
+	    inode->i_blocks != le64_to_cpu(zi->i_blocks)) {
+		__MISMACH_TIME(inode, &inode->i_ctime, &zi->i_ctime);
+		__MISMACH_TIME(inode, &inode->i_mtime, &zi->i_mtime);
+		__MISMACH_TIME(inode, &inode->i_atime, &zi->i_atime);
+		__MISMACH_INT(inode, inode->i_size, le64_to_cpu(zi->i_size));
+		__MISMACH_INT(inode, inode->i_mode, le16_to_cpu(zi->i_mode));
+		__MISMACH_INT(inode, __kuid_val(inode->i_uid),
+			      le32_to_cpu(zi->i_uid));
+		__MISMACH_INT(inode, __kgid_val(inode->i_gid),
+			      le32_to_cpu(zi->i_gid));
+		__MISMACH_INT(inode, inode->i_nlink, le16_to_cpu(zi->i_nlink));
+		__MISMACH_INT(inode, inode->i_ino, _zi_ino(zi));
+		__MISMACH_INT(inode, inode->i_blocks,
+			      le64_to_cpu(zi->i_blocks));
+	}
+}
+
+static void _zii_connect(struct inode *inode, struct zus_inode *zi,
+			 struct zus_inode_info *zus_ii)
+{
+	struct zuf_inode_info *zii = ZUII(inode);
+
+	zii->zi = zi;
+	zii->zus_ii = zus_ii;
+}
+
 struct inode *zuf_iget(struct super_block *sb, struct zus_inode_info *zus_ii,
 		       zu_dpp_t _zi, bool *exist)
 {
-	return ERR_PTR(-ENOTSUPP);
+	struct zus_inode *zi = zuf_dpp_t_addr(sb, _zi);
+	struct inode *inode;
+
+	*exist = false;
+	if (unlikely(!zi)) {
+		/* Don't trust ZUS pointers */
+		zuf_err("Bad zus_inode 0x%llx\n", _zi);
+		return ERR_PTR(-EIO);
+	}
+	if (unlikely(!zus_ii)) {
+		zuf_err("zus_ii NULL\n");
+		return ERR_PTR(-EIO);
+	}
+
+	if (!_zi_valid(zi)) {
+		zuf_err("inactive node ino=%lld links=%d mode=%d\n", zi->i_ino,
+			  zi->i_nlink, zi->i_mode);
+		return ERR_PTR(-ESTALE);
+	}
+
+	zuf_dbg_zus("[%lld] size=0x%llx, blocks=0x%llx ct=0x%llx mt=0x%llx link=0x%x mode=0x%x xattr=0x%llx\n",
+		    zi->i_ino, zi->i_size, zi->i_blocks, zi->i_ctime,
+		    zi->i_mtime, zi->i_nlink, zi->i_mode, zi->i_xattr);
+
+	inode = iget_locked(sb, _zi_ino(zi));
+	if (unlikely(!inode))
+		return ERR_PTR(-ENOMEM);
+
+	if (!(inode->i_state & I_NEW)) {
+		*exist = true;
+		return inode;
+	}
+
+	_set_inode_from_zi(inode, zi);
+	_zii_connect(inode, zi, zus_ii);
+
+	unlock_new_inode(inode);
+	return inode;
+}
+
+int zuf_evict_dispatch(struct super_block *sb, struct zus_inode_info *zus_ii,
+		       int operation, uint flags)
+{
+	struct zufs_ioc_evict_inode ioc_evict_inode = {
+		.hdr.in_len = sizeof(ioc_evict_inode),
+		.hdr.out_len = sizeof(ioc_evict_inode),
+		.hdr.operation = operation,
+		.zus_ii = zus_ii,
+		.flags = flags,
+	};
+	int err;
+
+	err = zufc_dispatch(ZUF_ROOT(SBI(sb)), &ioc_evict_inode.hdr, NULL, 0);
+	if (unlikely(err && err != -EINTR))
+		zuf_err_dispatch(sb, "zufc_dispatch failed op=%s => %d\n",
+				 zuf_op_name(operation), err);
+	return err;
+}
+
+void zuf_evict_inode(struct inode *inode)
+{
+	struct super_block *sb = inode->i_sb;
+	struct zuf_inode_info *zii = ZUII(inode);
+
+	if (!inode->i_nlink) {
+		if (unlikely(!zii->zi)) {
+			zuf_dbg_err("[%ld] inode without zi mode=0x%x size=0x%llx\n",
+				    inode->i_ino, inode->i_mode, inode->i_size);
+			goto out;
+		}
+
+		if (unlikely(is_bad_inode(inode)))
+			zuf_dbg_err("[%ld] inode is bad mode=0x%x zi=%p\n",
+				    inode->i_ino, inode->i_mode, zii->zi);
+		else
+			_warn_inode_dirty(inode, zii->zi);
+
+		zuf_w_lock(zii);
+
+		zuf_evict_dispatch(sb, zii->zus_ii, ZUFS_OP_FREE_INODE, 0);
+
+		inode->i_mtime = inode->i_ctime = current_time(inode);
+		inode->i_size = 0;
+
+		zuf_w_unlock(zii);
+	} else {
+		zuf_dbg_vfs("[%ld] inode is going down?\n", inode->i_ino);
+
+		zuf_smw_lock(zii);
+
+		zuf_evict_dispatch(sb, zii->zus_ii, ZUFS_OP_EVICT_INODE, 0);
+
+		zuf_smw_unlock(zii);
+	}
+
+out:
+	zii->zus_ii = NULL;
+	zii->zi = NULL;
+
+	clear_inode(inode);
+}
+
+/* @rdev_or_isize is i_size in the case of a symlink
+ * and rdev in the case of special-files
+ */
+struct inode *zuf_new_inode(struct inode *dir, umode_t mode,
+			    const struct qstr *qstr, const char *symname,
+			    ulong rdev_or_isize, bool tmpfile)
+{
+	struct super_block *sb = dir->i_sb;
+	struct zuf_sb_info *sbi = SBI(sb);
+	struct zufs_ioc_new_inode ioc_new_inode = {
+		.hdr.in_len = sizeof(ioc_new_inode),
+		.hdr.out_len = sizeof(ioc_new_inode),
+		.hdr.operation = ZUFS_OP_NEW_INODE,
+		.dir_ii = ZUII(dir)->zus_ii,
+		.flags = tmpfile ? ZI_TMPFILE : 0,
+		.str.len = qstr->len,
+	};
+	struct inode *inode;
+	struct zus_inode *zi = NULL;
+	struct page *pages[2];
+	uint nump = 0;
+	int err;
+
+	memcpy(&ioc_new_inode.str.name, qstr->name, qstr->len);
+
+	inode = new_inode(sb);
+	if (!inode)
+		return ERR_PTR(-ENOMEM);
+
+	inode_init_owner(inode, dir, mode);
+	inode->i_blocks = inode->i_size = 0;
+	inode->i_ctime = inode->i_mtime = current_time(dir);
+	inode->i_atime = inode->i_ctime;
+
+	zuf_dbg_verbose("inode=%p name=%s\n", inode, qstr->name);
+
+	zuf_set_inode_flags(inode, &ioc_new_inode.zi);
+
+	if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode) ||
+	    S_ISFIFO(inode->i_mode) || S_ISSOCK(inode->i_mode)) {
+		init_special_inode(inode, mode, rdev_or_isize);
+	}
+
+	err = _set_zi_from_inode(dir, &ioc_new_inode.zi, inode);
+	if (unlikely(err))
+		goto fail;
+
+	zus_inode_cmtime_now(dir, zus_zi(dir));
+
+	err = zufc_dispatch(ZUF_ROOT(sbi), &ioc_new_inode.hdr, pages, nump);
+	if (unlikely(err)) {
+		zuf_dbg_err("zufc_dispatch failed => %d\n", err);
+		goto fail;
+	}
+	zi = zuf_dpp_t_addr(sb, ioc_new_inode._zi);
+
+	_zii_connect(inode, zi, ioc_new_inode.zus_ii);
+
+	/* update inode fields from filesystem inode */
+	inode->i_ino = le64_to_cpu(zi->i_ino);
+	inode->i_size = le64_to_cpu(zi->i_size);
+	inode->i_generation = le64_to_cpu(zi->i_generation);
+	inode->i_blocks = le64_to_cpu(zi->i_blocks);
+	set_nlink(inode, le16_to_cpu(zi->i_nlink));
+	zuf_zii_sync(dir, false);
+
+	zuf_dbg_zus("[%lld] size=0x%llx, blocks=0x%llx ct=0x%llx mt=0x%llx link=0x%x mode=0x%x xattr=0x%llx\n",
+		    zi->i_ino, zi->i_size, zi->i_blocks, zi->i_ctime,
+		    zi->i_mtime, zi->i_nlink, zi->i_mode, zi->i_xattr);
+
+	zuf_dbg_verbose("allocating inode %ld (zi=%p)\n", _zi_ino(zi), zi);
+
+	err = insert_inode_locked(inode);
+	if (unlikely(err)) {
+		zuf_dbg_err("[%ld:%s] generation=%lld insert_inode_locked => %d\n",
+			    inode->i_ino, qstr->name, zi->i_generation, err);
+		goto fail;
+	}
+
+	return inode;
+
+fail:
+	clear_nlink(inode);
+	if (zi)
+		zi->i_nlink = 0;
+	make_bad_inode(inode);
+	iput(inode);
+	return ERR_PTR(err);
+}
+
+int zuf_write_inode(struct inode *inode, struct writeback_control *wbc)
+{
+	/* write_inode should never be called because we always keep our inodes
+	 * clean. So let us know if write_inode ever gets called.
+	 */
+
+	/* d_tmpfile() does a mark_inode_dirty so only complain on regular files
+	 * TODO: How? Every thing off for now
+	 * WARN_ON(inode->i_nlink);
+	 */
+
+	return 0;
+}
+
+/*
+ * Mostly supporting file_accessed() for now. Which is the only one we use.
+ *
+ * But also file_update_time is used by fifo code.
+ */
+int zuf_update_time(struct inode *inode, struct timespec64 *time, int flags)
+{
+	struct zus_inode *zi = zus_zi(inode);
+
+	if (flags & S_ATIME) {
+		inode->i_atime = *time;
+		timespec_to_mt(&zi->i_atime, &inode->i_atime);
+		/* FIXME: Set a flag that zi needs flushing
+		 * for now every read needs zi-flushing.
+		 */
+	}
+
+	/* File_update_time() is not used by zuf.
+	 * FIXME: One exception is O_TMPFILE the vfs calls file_update_time
+	 * internally bypassing FS. So just do and silent.
+	 * The zus O_TMPFILE create protocol knows it needs flushing
+	 */
+	if ((flags & S_CTIME) || (flags & S_MTIME)) {
+		if (flags & S_CTIME) {
+			inode->i_ctime = *time;
+			timespec_to_mt(&zi->i_ctime, &inode->i_ctime);
+		}
+		if (flags & S_MTIME) {
+			inode->i_mtime = *time;
+			timespec_to_mt(&zi->i_mtime, &inode->i_mtime);
+		}
+		zuf_dbg_vfs("called for S_CTIME | S_MTIME 0x%x\n", flags);
+	}
+
+	if (flags & ~(S_CTIME | S_MTIME | S_ATIME))
+		zuf_err("called for 0x%x\n", flags);
+
+	return 0;
+}
+
+int zuf_getattr(const struct path *path, struct kstat *stat, u32 request_mask,
+		unsigned int flags)
+{
+	struct dentry *dentry = path->dentry;
+	struct inode *inode = d_inode(dentry);
+
+	if (inode->i_flags & S_APPEND)
+		stat->attributes |= STATX_ATTR_APPEND;
+	if (inode->i_flags & S_IMMUTABLE)
+		stat->attributes |= STATX_ATTR_IMMUTABLE;
+
+	stat->attributes_mask |= (STATX_ATTR_APPEND |
+				  STATX_ATTR_IMMUTABLE);
+	generic_fillattr(inode, stat);
+	/* stat->blocks should be the number of 512B blocks */
+	stat->blocks = inode->i_blocks << (inode->i_sb->s_blocksize_bits - 9);
+
+	return 0;
+}
+
+int zuf_setattr(struct dentry *dentry, struct iattr *attr)
+{
+	struct inode *inode = dentry->d_inode;
+	struct zuf_inode_info *zii = ZUII(inode);
+	struct zus_inode *zi = zii->zi;
+	struct zufs_ioc_attr ioc_attr = {
+		.hdr.in_len = sizeof(ioc_attr),
+		.hdr.out_len = sizeof(ioc_attr),
+		.hdr.operation = ZUFS_OP_SETATTR,
+		.zus_ii = zii->zus_ii,
+	};
+	int err;
+
+	if (!zi)
+		return -EACCES;
+
+	/* Truncate is implemented via  fallocate(punch_hole) which means we
+	 * are not atomic with the other ATTRs. I think someone said that
+	 * some Kernel FSs don't even support truncate to come together with
+	 * other ATTRs
+	 */
+	if ((attr->ia_valid & ATTR_SIZE)) {
+		ZUF_CHECK_I_W_LOCK(inode);
+		zuf_smw_lock(zii);
+		err = __zuf_fallocate(inode, ZUFS_FL_TRUNCATE, attr->ia_size,
+				      ~0ULL);
+		zuf_smw_unlock(zii);
+		if (unlikely(err))
+			return err;
+		attr->ia_valid &= ~ATTR_SIZE;
+	}
+
+	err = setattr_prepare(dentry, attr);
+	if (unlikely(err))
+		return err;
+
+	if (attr->ia_valid & ATTR_MODE) {
+		zuf_dbg_vfs("[%ld] ATTR_MODE=0x%x\n",
+			     inode->i_ino, attr->ia_mode);
+		ioc_attr.zuf_attr |= STATX_MODE;
+		inode->i_mode = attr->ia_mode;
+		zi->i_mode = cpu_to_le16(inode->i_mode);
+		if (test_opt(SBI(inode->i_sb), POSIXACL)) {
+			err = posix_acl_chmod(inode, inode->i_mode);
+			if (unlikely(err))
+				return err;
+		}
+	}
+
+	if (attr->ia_valid & ATTR_UID) {
+		zuf_dbg_vfs("[%ld] ATTR_UID=0x%x\n",
+			     inode->i_ino, __kuid_val(attr->ia_uid));
+		ioc_attr.zuf_attr |= STATX_UID;
+		inode->i_uid = attr->ia_uid;
+		zi->i_uid = cpu_to_le32(__kuid_val(inode->i_uid));
+	}
+	if (attr->ia_valid & ATTR_GID) {
+		zuf_dbg_vfs("[%ld] ATTR_GID=0x%x\n",
+			     inode->i_ino, __kgid_val(attr->ia_gid));
+		ioc_attr.zuf_attr |= STATX_GID;
+		inode->i_gid = attr->ia_gid;
+		zi->i_gid = cpu_to_le32(__kgid_val(inode->i_gid));
+	}
+
+	if (attr->ia_valid & ATTR_ATIME) {
+		ioc_attr.zuf_attr |= STATX_ATIME;
+		inode->i_atime = attr->ia_atime;
+		timespec_to_mt(&zi->i_atime, &inode->i_atime);
+		zuf_dbg_vfs("[%ld] ATTR_ATIME=0x%llx\n",
+			     inode->i_ino, zi->i_atime);
+	}
+	if (attr->ia_valid & ATTR_CTIME) {
+		ioc_attr.zuf_attr |= STATX_CTIME;
+		inode->i_ctime = attr->ia_ctime;
+		timespec_to_mt(&zi->i_ctime, &inode->i_ctime);
+		zuf_dbg_vfs("[%ld] ATTR_CTIME=0x%llx\n",
+			     inode->i_ino, zi->i_ctime);
+	}
+	if (attr->ia_valid & ATTR_MTIME) {
+		ioc_attr.zuf_attr |= STATX_MTIME;
+		inode->i_mtime = attr->ia_mtime;
+		timespec_to_mt(&zi->i_mtime, &inode->i_mtime);
+		zuf_dbg_vfs("[%ld] ATTR_MTIME=0x%llx\n",
+			     inode->i_ino, zi->i_mtime);
+	}
+
+	err = zufc_dispatch(ZUF_ROOT(SBI(inode->i_sb)), &ioc_attr.hdr, NULL, 0);
+	if (unlikely(err))
+		zuf_dbg_err("[%ld] set_attr=0x%x failed => %d\n",
+			    inode->i_ino, ioc_attr.zuf_attr, err);
+
+	return err;
+}
+
+void zuf_set_inode_flags(struct inode *inode, struct zus_inode *zi)
+{
+	unsigned int flags = le16_to_cpu(zi->i_flags) & ~ZUFS_S_IMMUTABLE;
+
+	inode->i_flags &=
+		~(S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC);
+	inode->i_flags |= flags;
+	if (zi->i_flags & ZUFS_S_IMMUTABLE)
+		inode->i_flags |= S_IMMUTABLE | S_NOATIME;
+	if (!zi->i_xattr)
+		inode_has_no_xattr(inode);
 }
 
+const struct address_space_operations zuf_aops = {
+};
diff --git a/fs/zuf/namei.c b/fs/zuf/namei.c
new file mode 100644
index 000000000000..299134ca7c07
--- /dev/null
+++ b/fs/zuf/namei.c
@@ -0,0 +1,402 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Inode operations for directories.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+#include <linux/fs.h>
+#include "zuf.h"
+
+
+static struct inode *d_parent(struct dentry *dentry)
+{
+	return dentry->d_parent->d_inode;
+}
+
+static void _set_nlink(struct inode *inode, struct zus_inode *zi)
+{
+	set_nlink(inode, le32_to_cpu(zi->i_nlink));
+}
+
+void zuf_zii_sync(struct inode *inode, bool sync_nlink)
+{
+	struct zus_inode *zi = zus_zi(inode);
+
+	if (inode->i_size != le64_to_cpu(zi->i_size) ||
+	    inode->i_blocks != le64_to_cpu(zi->i_blocks)) {
+		i_size_write(inode, le64_to_cpu(zi->i_size));
+		inode->i_blocks = le64_to_cpu(zi->i_blocks);
+	}
+
+	if (sync_nlink)
+		_set_nlink(inode, zi);
+}
+
+static void _instantiate_unlock(struct dentry *dentry, struct inode *inode)
+{
+	d_instantiate(dentry, inode);
+	unlock_new_inode(inode);
+}
+
+static struct dentry *zuf_lookup(struct inode *dir, struct dentry *dentry,
+				 uint flags)
+{
+	struct super_block *sb = dir->i_sb;
+	struct qstr *str = &dentry->d_name;
+	uint in_len = offsetof(struct zufs_ioc_lookup, _zi);
+	struct zufs_ioc_lookup ioc_lu = {
+		.hdr.in_len = in_len,
+		.hdr.out_start = in_len,
+		.hdr.out_len = sizeof(ioc_lu) - in_len,
+		.hdr.operation = ZUFS_OP_LOOKUP,
+		.dir_ii = ZUII(dir)->zus_ii,
+		.str.len = str->len,
+	};
+	struct inode *inode = NULL;
+	bool exist;
+	int err;
+
+	zuf_dbg_vfs("[%ld] dentry-name=%s\n", dir->i_ino, dentry->d_name.name);
+
+	if (dentry->d_name.len > ZUFS_NAME_LEN)
+		return ERR_PTR(-ENAMETOOLONG);
+
+	memcpy(&ioc_lu.str.name, str->name, str->len);
+
+	err = zufc_dispatch(ZUF_ROOT(SBI(sb)), &ioc_lu.hdr, NULL, 0);
+	if (unlikely(err)) {
+		zuf_dbg_err("zufc_dispatch failed => %d\n", err);
+		goto out;
+	}
+
+	inode = zuf_iget(dir->i_sb, ioc_lu.zus_ii, ioc_lu._zi, &exist);
+	if (exist) {
+		zuf_dbg_err("race in lookup\n");
+		zuf_evict_dispatch(sb, ioc_lu.zus_ii, ZUFS_OP_EVICT_INODE,
+				   ZI_LOOKUP_RACE);
+	}
+
+out:
+	return d_splice_alias(inode, dentry);
+}
+
+/*
+ * By the time this is called, we already have created
+ * the directory cache entry for the new file, but it
+ * is so far negative - it has no inode.
+ *
+ * If the create succeeds, we fill in the inode information
+ * with d_instantiate().
+ */
+static int zuf_create(struct inode *dir, struct dentry *dentry, umode_t mode,
+		      bool excl)
+{
+	struct inode *inode;
+
+	zuf_dbg_vfs("[%ld] dentry-name=%s mode=0x%x\n",
+		     dir->i_ino, dentry->d_name.name, mode);
+
+	inode = zuf_new_inode(dir, mode, &dentry->d_name, NULL, 0, false);
+	if (IS_ERR(inode))
+		return PTR_ERR(inode);
+
+	inode->i_op = &zuf_file_inode_operations;
+	inode->i_mapping->a_ops = &zuf_aops;
+	inode->i_fop = &zuf_file_operations;
+
+	_instantiate_unlock(dentry, inode);
+
+	return 0;
+}
+
+static int zuf_mknod(struct inode *dir, struct dentry *dentry, umode_t mode,
+		     dev_t rdev)
+{
+	struct inode *inode;
+
+	zuf_dbg_vfs("[%ld] mode=0x%x rdev=0x%x\n", dir->i_ino, mode, rdev);
+
+	inode = zuf_new_inode(dir, mode, &dentry->d_name, NULL, rdev, false);
+	if (IS_ERR(inode))
+		return PTR_ERR(inode);
+
+	inode->i_op = &zuf_special_inode_operations;
+
+	_instantiate_unlock(dentry, inode);
+
+	return 0;
+}
+
+static int zuf_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
+{
+	struct inode *inode;
+
+	inode = zuf_new_inode(dir, mode, &dentry->d_name, NULL, 0, true);
+	if (IS_ERR(inode))
+		return PTR_ERR(inode);
+
+	/* TODO: See about more ephemeral operations on this file, around
+	 * mmap and such.
+	 * Must see about that tmpfile mode that is later link_at
+	 * (probably the !O_EXCL flag)
+	 */
+	inode->i_op = &zuf_file_inode_operations;
+	inode->i_mapping->a_ops = &zuf_aops;
+	inode->i_fop = &zuf_file_operations;
+
+	set_nlink(inode, 1); /* user_mode knows nothing */
+	d_tmpfile(dentry, inode);
+	/* tmpfile operate on nlink=0. Since this is a tmp file we do not care
+	 * about cl_flushing. If later this file will be linked to a dir. the
+	 * add_dentry will flush the zi.
+	 */
+	zus_zi(inode)->i_nlink = inode->i_nlink;
+
+	unlock_new_inode(inode);
+	return 0;
+}
+
+static int zuf_link(struct dentry *dest_dentry, struct inode *dir,
+		    struct dentry *dentry)
+{
+	struct inode *inode = dest_dentry->d_inode;
+	int err;
+
+	zuf_dbg_vfs("[%ld] dentry-ino=%ld dentry-name=%s dentry-parent=%ld dest_d-ino=%ld dest_d-name=%s\n",
+		     dir->i_ino, inode->i_ino, dentry->d_name.name,
+		     d_parent(dentry)->i_ino,
+		     dest_dentry->d_inode->i_ino, dest_dentry->d_name.name);
+
+	if (inode->i_nlink >= ZUFS_LINK_MAX)
+		return -EMLINK;
+
+	ihold(inode);
+
+	zus_inode_cmtime_now(dir, zus_zi(dir));
+	zus_inode_ctime_now(inode, zus_zi(inode));
+
+	err = zuf_add_dentry(dir, &dentry->d_name, inode);
+	if (unlikely(err)) {
+		iput(inode);
+		return err;
+	}
+
+	_set_nlink(inode, zus_zi(inode));
+
+	d_instantiate(dentry, inode);
+
+	return 0;
+}
+
+static int zuf_unlink(struct inode *dir, struct dentry *dentry)
+{
+	struct inode *inode = dentry->d_inode;
+	int err;
+
+	zuf_dbg_vfs("[%ld] dentry-ino=%ld dentry-name=%s dentry-parent=%ld\n",
+		     dir->i_ino, inode->i_ino, dentry->d_name.name,
+		     d_parent(dentry)->i_ino);
+
+	inode->i_ctime = dir->i_ctime;
+	timespec_to_mt(&zus_zi(inode)->i_ctime, &inode->i_ctime);
+
+	err = zuf_remove_dentry(dir, &dentry->d_name, inode);
+	if (unlikely(err))
+		return err;
+
+	zuf_zii_sync(inode, true);
+	zuf_zii_sync(dir, true);
+
+	return 0;
+}
+
+static int zuf_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
+{
+	struct inode *inode;
+
+	zuf_dbg_vfs("[%ld] dentry-name=%s dentry-parent=%ld mode=0x%x\n",
+		     dir->i_ino, dentry->d_name.name, d_parent(dentry)->i_ino,
+		     mode);
+
+	if (dir->i_nlink >= ZUFS_LINK_MAX)
+		return -EMLINK;
+
+	inode = zuf_new_inode(dir, S_IFDIR | mode, &dentry->d_name, NULL, 0,
+			      false);
+	if (IS_ERR(inode))
+		return PTR_ERR(inode);
+
+	inode->i_op = &zuf_dir_inode_operations;
+	inode->i_fop = &zuf_dir_operations;
+	inode->i_mapping->a_ops = &zuf_aops;
+
+	zuf_zii_sync(dir, true);
+
+	_instantiate_unlock(dentry, inode);
+
+	return 0;
+}
+
+static bool _empty_dir(struct inode *dir)
+{
+	if (dir->i_nlink != 2) {
+		zuf_dbg_verbose("[%ld] directory has nlink(%d) != 2\n",
+				dir->i_ino, dir->i_nlink);
+		return false;
+	}
+	/* NOTE: Above is not the only -ENOTEMPTY the zus-fs will need to check
+	 * for the "only-files" no subdirs case. And return -ENOTEMPTY below
+	 */
+	return true;
+}
+
+static int zuf_rmdir(struct inode *dir, struct dentry *dentry)
+{
+	struct inode *inode = dentry->d_inode;
+	int err;
+
+	zuf_dbg_vfs("[%ld] dentry-ino=%ld dentry-name=%s dentry-parent=%ld\n",
+		     dir->i_ino, inode->i_ino, dentry->d_name.name,
+		     d_parent(dentry)->i_ino);
+
+	if (!inode)
+		return -ENOENT;
+
+	if (!_empty_dir(inode))
+		return -ENOTEMPTY;
+
+	zus_inode_cmtime_now(dir, zus_zi(dir));
+	inode->i_ctime = dir->i_ctime;
+	timespec_to_mt(&zus_zi(inode)->i_ctime, &inode->i_ctime);
+
+	err = zuf_remove_dentry(dir, &dentry->d_name, inode);
+	if (unlikely(err))
+		return err;
+
+	zuf_zii_sync(inode, true);
+	zuf_zii_sync(dir, true);
+
+	return 0;
+}
+
+/* Structure of a directory element; */
+struct zuf_dir_element {
+	__le64  ino;
+	char name[254];
+};
+
+static int zuf_rename(struct inode *old_dir, struct dentry *old_dentry,
+		      struct inode *new_dir, struct dentry *new_dentry,
+		      uint flags)
+{
+	struct inode *old_inode = d_inode(old_dentry);
+	struct inode *new_inode = d_inode(new_dentry);
+	struct zuf_sb_info *sbi = SBI(old_inode->i_sb);
+	struct zufs_ioc_rename ioc_rename = {
+		.hdr.in_len = sizeof(ioc_rename),
+		.hdr.out_len = sizeof(ioc_rename),
+		.hdr.operation = ZUFS_OP_RENAME,
+		.old_dir_ii = ZUII(old_dir)->zus_ii,
+		.new_dir_ii = ZUII(new_dir)->zus_ii,
+		.old_zus_ii = ZUII(old_inode)->zus_ii,
+		.new_zus_ii = new_inode ? ZUII(new_inode)->zus_ii : NULL,
+		.old_d_str.len = old_dentry->d_name.len,
+		.new_d_str.len = new_dentry->d_name.len,
+		.flags = flags,
+	};
+	struct timespec64 time = current_time(old_dir);
+	int err;
+
+	zuf_dbg_vfs(
+		"old_inode=%ld new_inode=%ld old_name=%s new_name=%s f=0x%x\n",
+		old_inode->i_ino, new_inode ? new_inode->i_ino : 0,
+		old_dentry->d_name.name, new_dentry->d_name.name, flags);
+
+	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE /*| RENAME_WHITEOUT*/))
+		return -EINVAL;
+
+	if (flags & RENAME_EXCHANGE) {
+		/* A subdir holds a ref on parent, see if we need to
+		 * exchange refs
+		 */
+		if (unlikely(!new_inode))
+			return -EINVAL;
+
+		if ((S_ISDIR(old_inode->i_mode) != S_ISDIR(new_inode->i_mode))
+		    && (old_dir != new_dir)) {
+			if (S_ISDIR(old_inode->i_mode)) {
+				if (ZUFS_LINK_MAX <= new_dir->i_nlink)
+					return -EMLINK;
+			} else {
+				if (ZUFS_LINK_MAX <= old_dir->i_nlink)
+					return -EMLINK;
+			}
+		}
+	} else if (S_ISDIR(old_inode->i_mode)) {
+		if (new_inode) {
+			if (!_empty_dir(new_inode))
+				return -ENOTEMPTY;
+		} else if (ZUFS_LINK_MAX <= new_dir->i_nlink) {
+			return -EMLINK;
+		}
+	}
+
+	memcpy(&ioc_rename.old_d_str.name, old_dentry->d_name.name,
+		old_dentry->d_name.len);
+	memcpy(&ioc_rename.new_d_str.name, new_dentry->d_name.name,
+		new_dentry->d_name.len);
+	timespec_to_mt(&ioc_rename.time, &time);
+
+	zus_inode_cmtime_now(old_dir, zus_zi(old_dir));
+	if (old_dir != new_dir)
+		zus_inode_cmtime_now(new_dir, zus_zi(new_dir));
+
+	if (new_inode)
+		zus_inode_ctime_now(new_inode, zus_zi(new_inode));
+	else
+		zus_inode_ctime_now(old_inode, zus_zi(old_inode));
+
+	err = zufc_dispatch(ZUF_ROOT(sbi), &ioc_rename.hdr, NULL, 0);
+
+	zuf_zii_sync(old_dir, true);
+	zuf_zii_sync(new_dir, true);
+
+	if (unlikely(err)) {
+		zuf_dbg_err("zufc_dispatch failed => %d\n", err);
+		return err;
+	}
+
+	if (new_inode)
+		_set_nlink(new_inode, zus_zi(new_inode));
+
+	return 0;
+}
+
+const struct inode_operations zuf_dir_inode_operations = {
+	.create		= zuf_create,
+	.lookup		= zuf_lookup,
+	.link		= zuf_link,
+	.unlink		= zuf_unlink,
+	.mkdir		= zuf_mkdir,
+	.rmdir		= zuf_rmdir,
+	.mknod		= zuf_mknod,
+	.tmpfile	= zuf_tmpfile,
+	.rename		= zuf_rename,
+	.setattr	= zuf_setattr,
+	.getattr	= zuf_getattr,
+	.update_time	= zuf_update_time,
+};
+
+const struct inode_operations zuf_special_inode_operations = {
+	.setattr	= zuf_setattr,
+	.getattr	= zuf_getattr,
+	.update_time	= zuf_update_time,
+};
diff --git a/fs/zuf/super.c b/fs/zuf/super.c
index 01927deb5013..abd7e6cb2a4a 100644
--- a/fs/zuf/super.c
+++ b/fs/zuf/super.c
@@ -784,6 +784,8 @@ void zuf_destroy_inodecache(void)
 static struct super_operations zuf_sops = {
 	.alloc_inode	= zuf_alloc_inode,
 	.destroy_inode	= zuf_destroy_inode,
+	.write_inode	= zuf_write_inode,
+	.evict_inode	= zuf_evict_inode,
 	.put_super	= zuf_put_super,
 	.freeze_fs	= zuf_update_s_wtime,
 	.unfreeze_fs	= zuf_update_s_wtime,
diff --git a/fs/zuf/zuf-core.c b/fs/zuf/zuf-core.c
index a417f9463682..48dd7b665064 100644
--- a/fs/zuf/zuf-core.c
+++ b/fs/zuf/zuf-core.c
@@ -65,6 +65,16 @@ const char *zuf_op_name(enum e_zufs_operation op)
 		CASE_ENUM_NAME(ZUFS_OP_BREAK);
 		CASE_ENUM_NAME(ZUFS_OP_STATFS);
 		CASE_ENUM_NAME(ZUFS_OP_SHOW_OPTIONS);
+
+		CASE_ENUM_NAME(ZUFS_OP_NEW_INODE);
+		CASE_ENUM_NAME(ZUFS_OP_FREE_INODE);
+		CASE_ENUM_NAME(ZUFS_OP_EVICT_INODE);
+
+		CASE_ENUM_NAME(ZUFS_OP_LOOKUP);
+		CASE_ENUM_NAME(ZUFS_OP_ADD_DENTRY);
+		CASE_ENUM_NAME(ZUFS_OP_REMOVE_DENTRY);
+		CASE_ENUM_NAME(ZUFS_OP_RENAME);
+		CASE_ENUM_NAME(ZUFS_OP_SETATTR);
 	case ZUFS_OP_MAX_OPT:
 	default:
 		return "UNKNOWN";
diff --git a/fs/zuf/zuf.h b/fs/zuf/zuf.h
index 18cbc376cfa6..2d5327e1d2b1 100644
--- a/fs/zuf/zuf.h
+++ b/fs/zuf/zuf.h
@@ -130,6 +130,9 @@ enum {
 struct zuf_inode_info {
 	struct inode		vfs_inode;
 
+	/* Stuff for mmap write */
+	struct rw_semaphore	in_sync;
+
 	/* cookies from Server */
 	struct zus_inode	*zi;
 	struct zus_inode_info	*zus_ii;
@@ -248,6 +251,66 @@ static inline void *zuf_dpp_t_addr(struct super_block *sb, zu_dpp_t v)
 	return md_addr_verify(SBI(sb)->md, zu_dpp_t_val(v));
 }
 
+/* ~~~~ inode locking ~~~~ */
+static inline void zuf_r_lock(struct zuf_inode_info *zii)
+{
+	inode_lock_shared(&zii->vfs_inode);
+}
+static inline void zuf_r_unlock(struct zuf_inode_info *zii)
+{
+	inode_unlock_shared(&zii->vfs_inode);
+}
+
+static inline void zuf_smr_lock(struct zuf_inode_info *zii)
+{
+	down_read_nested(&zii->in_sync, 1);
+}
+static inline void zuf_smr_lock_pagefault(struct zuf_inode_info *zii)
+{
+	down_read_nested(&zii->in_sync, 2);
+}
+static inline void zuf_smr_unlock(struct zuf_inode_info *zii)
+{
+	up_read(&zii->in_sync);
+}
+
+static inline void zuf_smw_lock(struct zuf_inode_info *zii)
+{
+	down_write(&zii->in_sync);
+}
+static inline void zuf_smw_lock_nested(struct zuf_inode_info *zii)
+{
+	down_write_nested(&zii->in_sync, 1);
+}
+static inline void zuf_smw_unlock(struct zuf_inode_info *zii)
+{
+	up_write(&zii->in_sync);
+}
+
+static inline void zuf_w_lock(struct zuf_inode_info *zii)
+{
+	inode_lock(&zii->vfs_inode);
+	zuf_smw_lock(zii);
+}
+static inline void zuf_w_lock_nested(struct zuf_inode_info *zii)
+{
+	inode_lock_nested(&zii->vfs_inode, 2);
+	zuf_smw_lock_nested(zii);
+}
+static inline void zuf_w_unlock(struct zuf_inode_info *zii)
+{
+	zuf_smw_unlock(zii);
+	inode_unlock(&zii->vfs_inode);
+}
+
+static inline void ZUF_CHECK_I_W_LOCK(struct inode *inode)
+{
+#ifdef CONFIG_ZUF_DEBUG
+	if (WARN_ON(down_write_trylock(&inode->i_rwsem)))
+		up_write(&inode->i_rwsem);
+#endif
+}
+
 enum big_alloc_type { ba_stack, ba_8k, ba_vmalloc };
 #define S_8K (1024UL * 8)
 
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index 1af3bd016453..9b9e97fe844e 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -446,6 +446,17 @@ enum e_zufs_operation {
 	ZUFS_OP_STATFS		= 2,
 	ZUFS_OP_SHOW_OPTIONS	= 3,
 
+	ZUFS_OP_NEW_INODE	= 4,
+	ZUFS_OP_FREE_INODE	= 5,
+	ZUFS_OP_EVICT_INODE	= 6,
+
+	ZUFS_OP_LOOKUP		= 7,
+	ZUFS_OP_ADD_DENTRY	= 8,
+	ZUFS_OP_REMOVE_DENTRY	= 9,
+	ZUFS_OP_RENAME		= 10,
+
+	ZUFS_OP_SETATTR		= 19,
+
 	ZUFS_OP_MAX_OPT,
 };
 
@@ -470,4 +481,87 @@ struct zufs_ioc_statfs {
 	struct statfs64 statfs_out;
 };
 
+/* zufs_ioc_new_inode flags: */
+enum zi_flags {
+	ZI_TMPFILE = 1,		/* for new_inode */
+	ZI_LOOKUP_RACE = 1,	/* for evict */
+};
+
+struct zufs_str {
+	__u8 len;
+	char name[ZUFS_NAME_LEN];
+};
+
+/* ZUFS_OP_NEW_INODE */
+struct zufs_ioc_new_inode {
+	struct zufs_ioc_hdr hdr;
+	 /* IN */
+	struct zus_inode zi;
+	struct zus_inode_info *dir_ii; /* If mktmp this is the root */
+	struct zufs_str str;
+	__u64 flags;
+
+	 /* OUT */
+	zu_dpp_t _zi;
+	struct zus_inode_info *zus_ii;
+};
+
+/* ZUFS_OP_FREE_INODE, ZUFS_OP_EVICT_INODE */
+struct zufs_ioc_evict_inode {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_inode_info *zus_ii;
+	__u64 flags;
+};
+
+/* ZUFS_OP_LOOKUP */
+struct zufs_ioc_lookup {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_inode_info *dir_ii;
+	struct zufs_str str;
+
+	 /* OUT */
+	zu_dpp_t _zi;
+	struct zus_inode_info *zus_ii;
+};
+
+/* ZUFS_OP_ADD_DENTRY, ZUFS_OP_REMOVE_DENTRY */
+struct zufs_ioc_dentry {
+	struct zufs_ioc_hdr hdr;
+	struct zus_inode_info *zus_ii; /* IN */
+	struct zus_inode_info *zus_dir_ii; /* IN */
+	struct zufs_str str; /* IN */
+	__u64 ino; /* OUT - only for lookup */
+};
+
+/* ZUFS_OP_RENAME */
+struct zufs_ioc_rename {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_inode_info *old_dir_ii;
+	struct zus_inode_info *new_dir_ii;
+	struct zus_inode_info *old_zus_ii;
+	struct zus_inode_info *new_zus_ii;
+	struct zufs_str old_d_str;
+	struct zufs_str new_d_str;
+	__u64 time;
+	__u64 flags;
+};
+
+/* ZUFS_OP_SETATTR */
+struct zufs_ioc_attr {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_inode_info *zus_ii;
+	__u32 zuf_attr;
+	__u32 pad;
+};
+
+/* Special flag for ZUFS_OP_FALLOCATE to specify a setattr(SIZE)
+ * IE. same as punch hole but set_i_size to be @filepos. In this
+ * case @last_pos == ~0ULL
+ */
+#define ZUFS_FL_TRUNCATE 0x80000000
+
 #endif /* _LINUX_ZUFS_API_H */
-- 
2.21.0
^ permalink raw reply related	[flat|nested] 32+ messages in thread
* [PATCH 09/16] zuf: readdir operation
  2019-09-26  2:07 [PATCHSET v02 00/16] zuf: ZUFS Zero-copy User-mode FileSystem Boaz Harrosh
                   ` (7 preceding siblings ...)
  2019-09-26  2:07 ` [PATCH 08/16] zuf: Namei and directory operations Boaz Harrosh
@ 2019-09-26  2:07 ` Boaz Harrosh
  2019-09-26  2:07 ` [PATCH 10/16] zuf: symlink Boaz Harrosh
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 32+ messages in thread
From: Boaz Harrosh @ 2019-09-26  2:07 UTC (permalink / raw)
  To: linux-fsdevel, Anna Schumaker, Al Viro, Matt Benjamin
  Cc: Miklos Szeredi, Amir Goldstein, Sagi Manole, Matthew Wilcox,
	Dan Williams
Implements the file_operations->iterate_shared via info
returned from Server.
Establish protocol with Server for readdir.
The Server fills a zuf allocated buffer (up to 4M at a time)
which will contain a zufs encoded dir entries. It will then
call the proper emit vector to fill the caller buffer.
The buffer is passed to Server not as part of the zufs_ioc_readdir
struct but maps this buffer directly into Server space via the
zt_map_pages facility.
[v2]
  Fix the gcc warning:
    directory.c:86:1: warning: the frame size of 8576 bytes is
		  larger than 8192 bytes
  Fix it by allocating the pages array, which was on stack
  as part of the allocation we already do for the readdir buffer
  Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/directory.c | 69 +++++++++++++++++++++++++++++++++++-
 fs/zuf/zuf-core.c  |  2 ++
 fs/zuf/zus_api.h   | 88 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 158 insertions(+), 1 deletion(-)
diff --git a/fs/zuf/directory.c b/fs/zuf/directory.c
index 5624e05f96e5..7417aeb77773 100644
--- a/fs/zuf/directory.c
+++ b/fs/zuf/directory.c
@@ -19,7 +19,74 @@
 
 static int zuf_readdir(struct file *file, struct dir_context *ctx)
 {
-	return -ENOTSUPP;
+	struct inode *inode = file_inode(file);
+	struct super_block *sb = inode->i_sb;
+	loff_t i_size = i_size_read(inode);
+	struct zufs_ioc_readdir ioc_readdir = {
+		.hdr.in_len = sizeof(ioc_readdir),
+		.hdr.out_len = sizeof(ioc_readdir),
+		.hdr.operation = ZUFS_OP_READDIR,
+		.dir_ii = ZUII(inode)->zus_ii,
+	};
+	struct zufs_readdir_iter rdi;
+	struct page **pages;
+	struct zufs_dir_entry *zde;
+	void *addr, *__a;
+	uint nump, i;
+	int err;
+
+	if (ctx->pos && i_size <= ctx->pos)
+		return 0;
+	if (!i_size)
+		i_size = PAGE_SIZE; /* Just for the . && .. */
+	if (i_size - ctx->pos < PAGE_SIZE)
+		ioc_readdir.hdr.len = PAGE_SIZE;
+	else
+		ioc_readdir.hdr.len = min_t(loff_t, i_size - ctx->pos,
+					    ZUS_API_MAP_MAX_SIZE);
+	nump = md_o2p_up(ioc_readdir.hdr.len);
+	/* Allocating both readdir buffer and the pages-array.
+	 * Pages array is at end
+	 */
+	addr = vzalloc(md_p2o(nump) + nump * sizeof(*pages));
+	if (unlikely(!addr))
+		return -ENOMEM;
+
+	WARN_ON((ulong)addr & (PAGE_SIZE - 1));
+
+	pages = addr + md_p2o(nump);
+	__a = addr;
+	for (i = 0; i < nump; ++i) {
+		pages[i] = vmalloc_to_page(__a);
+		__a += PAGE_SIZE;
+	}
+
+more:
+	ioc_readdir.pos = ctx->pos;
+
+	err = zufc_dispatch(ZUF_ROOT(SBI(sb)), &ioc_readdir.hdr, pages, nump);
+	if (unlikely(err && err != -EINTR)) {
+		zuf_err_dispatch(sb, "zufc_dispatch failed => %d\n", err);
+		goto out;
+	}
+
+	zufs_readdir_iter_init(&rdi, &ioc_readdir, addr);
+	while ((zde = zufs_next_zde(&rdi)) != NULL) {
+		zuf_dbg_verbose("%s pos=0x%lx\n",
+				zde->zstr.name, (ulong)zde->pos);
+		ctx->pos = zde->pos;
+		if (!dir_emit(ctx, zde->zstr.name, zde->zstr.len, zde->ino,
+			      zde->type))
+			goto out;
+	}
+	ctx->pos = ioc_readdir.pos;
+	if (ioc_readdir.more) {
+		zuf_dbg_err("more\n");
+		goto more;
+	}
+out:
+	vfree(addr);
+	return err;
 }
 
 /*
diff --git a/fs/zuf/zuf-core.c b/fs/zuf/zuf-core.c
index 48dd7b665064..c0049c1d5ba3 100644
--- a/fs/zuf/zuf-core.c
+++ b/fs/zuf/zuf-core.c
@@ -74,6 +74,8 @@ const char *zuf_op_name(enum e_zufs_operation op)
 		CASE_ENUM_NAME(ZUFS_OP_ADD_DENTRY);
 		CASE_ENUM_NAME(ZUFS_OP_REMOVE_DENTRY);
 		CASE_ENUM_NAME(ZUFS_OP_RENAME);
+		CASE_ENUM_NAME(ZUFS_OP_READDIR);
+
 		CASE_ENUM_NAME(ZUFS_OP_SETATTR);
 	case ZUFS_OP_MAX_OPT:
 	default:
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index 9b9e97fe844e..2bdf047282e8 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -454,6 +454,7 @@ enum e_zufs_operation {
 	ZUFS_OP_ADD_DENTRY	= 8,
 	ZUFS_OP_REMOVE_DENTRY	= 9,
 	ZUFS_OP_RENAME		= 10,
+	ZUFS_OP_READDIR		= 11,
 
 	ZUFS_OP_SETATTR		= 19,
 
@@ -549,6 +550,93 @@ struct zufs_ioc_rename {
 	__u64 flags;
 };
 
+/* ZUFS_OP_READDIR */
+struct zufs_ioc_readdir {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_inode_info *dir_ii;
+	__u64 pos;
+
+	/* OUT */
+	__u8	more;
+};
+
+struct zufs_dir_entry {
+	__le64 ino;
+	struct {
+		unsigned	type	: 8;
+		ulong		pos	: 56;
+	};
+	struct zufs_str zstr;
+};
+
+struct zufs_readdir_iter {
+	void *__zde, *last;
+	struct zufs_ioc_readdir *ioc_readdir;
+};
+
+enum {E_ZDE_HDR_SIZE =
+	offsetof(struct zufs_dir_entry, zstr) + offsetof(struct zufs_str, name),
+};
+
+#ifndef __cplusplus
+static inline void zufs_readdir_iter_init(struct zufs_readdir_iter *rdi,
+					  struct zufs_ioc_readdir *ioc_readdir,
+					  void *app_ptr)
+{
+	rdi->__zde = app_ptr;
+	rdi->last = app_ptr + ioc_readdir->hdr.len;
+	rdi->ioc_readdir = ioc_readdir;
+	ioc_readdir->more = false;
+}
+
+static inline uint zufs_dir_entry_len(__u8 name_len)
+{
+	return ALIGN(E_ZDE_HDR_SIZE + name_len, sizeof(__u64));
+}
+
+static inline
+struct zufs_dir_entry *zufs_next_zde(struct zufs_readdir_iter *rdi)
+{
+	struct zufs_dir_entry *zde = rdi->__zde;
+	uint len;
+
+	if (rdi->last <= rdi->__zde + E_ZDE_HDR_SIZE)
+		return NULL;
+	if (zde->zstr.len == 0)
+		return NULL;
+	len = zufs_dir_entry_len(zde->zstr.len);
+	if (rdi->last <= rdi->__zde + len)
+		return NULL;
+
+	rdi->__zde += len;
+	return zde;
+}
+
+static inline bool zufs_zde_emit(struct zufs_readdir_iter *rdi, __u64 ino,
+				 __u8 type, __u64 pos, const char *name,
+				 __u8 len)
+{
+	struct zufs_dir_entry *zde = rdi->__zde;
+
+	if (rdi->last <= rdi->__zde + zufs_dir_entry_len(len)) {
+		rdi->ioc_readdir->more = true;
+		return false;
+	}
+
+	rdi->ioc_readdir->more = 0;
+	zde->ino = ino;
+	zde->type = type;
+	/*ASSERT(0 == (pos && (1 << 56 - 1)));*/
+	zde->pos = pos;
+	strncpy(zde->zstr.name, name, len);
+	zde->zstr.len = len;
+	zufs_next_zde(rdi);
+
+	return true;
+}
+#endif /* ndef __cplusplus */
+
 /* ZUFS_OP_SETATTR */
 struct zufs_ioc_attr {
 	struct zufs_ioc_hdr hdr;
-- 
2.21.0
^ permalink raw reply related	[flat|nested] 32+ messages in thread
* [PATCH 10/16] zuf: symlink
  2019-09-26  2:07 [PATCHSET v02 00/16] zuf: ZUFS Zero-copy User-mode FileSystem Boaz Harrosh
                   ` (8 preceding siblings ...)
  2019-09-26  2:07 ` [PATCH 09/16] zuf: readdir operation Boaz Harrosh
@ 2019-09-26  2:07 ` Boaz Harrosh
  2019-09-26  2:07 ` [PATCH 11/16] zuf: Write/Read implementation Boaz Harrosh
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 32+ messages in thread
From: Boaz Harrosh @ 2019-09-26  2:07 UTC (permalink / raw)
  To: linux-fsdevel, Anna Schumaker, Al Viro, Matt Benjamin
  Cc: Miklos Szeredi, Amir Goldstein, Sagi Manole, Matthew Wilcox,
	Dan Williams
The symlink support is all hidden within the creation/open
of the inode.
As part of ZUFS_OP_NEW_INODE we also send the requested
content of the symlink for storage.
On an open of an existing symlink the link information
is returned within the zufs_inode structure via a zufs_dpp_t
pointer. (See Documentation about zufs_dpp_t pointers)
Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/Makefile  |  2 +-
 fs/zuf/_extern.h |  7 +++++
 fs/zuf/inode.c   |  7 +++++
 fs/zuf/namei.c   | 27 ++++++++++++++++++
 fs/zuf/symlink.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 115 insertions(+), 1 deletion(-)
 create mode 100644 fs/zuf/symlink.c
diff --git a/fs/zuf/Makefile b/fs/zuf/Makefile
index 2bfed45723e3..04c31b7bb9ff 100644
--- a/fs/zuf/Makefile
+++ b/fs/zuf/Makefile
@@ -17,5 +17,5 @@ zuf-y += md.o t1.o t2.o
 zuf-y += zuf-core.o zuf-root.o
 
 # Main FS
-zuf-y += super.o inode.o directory.o namei.o file.o
+zuf-y += super.o inode.o directory.o namei.o file.o symlink.o
 zuf-y += module.o
diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
index 50887792bf42..95413f65c47f 100644
--- a/fs/zuf/_extern.h
+++ b/fs/zuf/_extern.h
@@ -88,6 +88,10 @@ void zuf_set_inode_flags(struct inode *inode, struct zus_inode *zi);
 int zuf_add_dentry(struct inode *dir, struct qstr *str, struct inode *inode);
 int zuf_remove_dentry(struct inode *dir, struct qstr *str, struct inode *inode);
 
+/* symlink.c */
+uint zuf_prepare_symname(struct zufs_ioc_new_inode *ioc_new_inode,
+			const char *symname, ulong len, struct page *pages[2]);
+
 /* t1.c */
 int zuf_pmem_mmap(struct file *file, struct vm_area_struct *vma);
 
@@ -109,4 +113,7 @@ extern const struct inode_operations zuf_special_inode_operations;
 /* dir.c */
 extern const struct file_operations zuf_dir_operations;
 
+/* symlink.c */
+extern const struct inode_operations zuf_symlink_inode_operations;
+
 #endif	/*ndef __ZUF_EXTERN_H__*/
diff --git a/fs/zuf/inode.c b/fs/zuf/inode.c
index 88cb1937c223..bf3f8b27f918 100644
--- a/fs/zuf/inode.c
+++ b/fs/zuf/inode.c
@@ -83,6 +83,9 @@ static void _set_inode_from_zi(struct inode *inode, struct zus_inode *zi)
 		inode->i_op = &zuf_dir_inode_operations;
 		inode->i_fop = &zuf_dir_operations;
 		break;
+	case S_IFLNK:
+		inode->i_op = &zuf_symlink_inode_operations;
+		break;
 	case S_IFBLK:
 	case S_IFCHR:
 	case S_IFIFO:
@@ -348,6 +351,10 @@ struct inode *zuf_new_inode(struct inode *dir, umode_t mode,
 	if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode) ||
 	    S_ISFIFO(inode->i_mode) || S_ISSOCK(inode->i_mode)) {
 		init_special_inode(inode, mode, rdev_or_isize);
+	} else if (symname) {
+		inode->i_size = rdev_or_isize;
+		nump = zuf_prepare_symname(&ioc_new_inode, symname,
+					   rdev_or_isize, pages);
 	}
 
 	err = _set_zi_from_inode(dir, &ioc_new_inode.zi, inode);
diff --git a/fs/zuf/namei.c b/fs/zuf/namei.c
index 299134ca7c07..e78aa04f10d5 100644
--- a/fs/zuf/namei.c
+++ b/fs/zuf/namei.c
@@ -164,6 +164,32 @@ static int zuf_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
 	return 0;
 }
 
+static int zuf_symlink(struct inode *dir, struct dentry *dentry,
+		       const char *symname)
+{
+	struct inode *inode;
+	ulong len;
+
+	zuf_dbg_vfs("[%ld] de->name=%s symname=%s\n",
+			dir->i_ino, dentry->d_name.name, symname);
+
+	len = strlen(symname);
+	if (len + 1 > ZUFS_MAX_SYMLINK)
+		return -ENAMETOOLONG;
+
+	inode = zuf_new_inode(dir, S_IFLNK|S_IRWXUGO, &dentry->d_name,
+			       symname, len, false);
+	if (IS_ERR(inode))
+		return PTR_ERR(inode);
+
+	inode->i_op = &zuf_symlink_inode_operations;
+	inode->i_mapping->a_ops = &zuf_aops;
+
+	_instantiate_unlock(dentry, inode);
+
+	return 0;
+}
+
 static int zuf_link(struct dentry *dest_dentry, struct inode *dir,
 		    struct dentry *dentry)
 {
@@ -385,6 +411,7 @@ const struct inode_operations zuf_dir_inode_operations = {
 	.lookup		= zuf_lookup,
 	.link		= zuf_link,
 	.unlink		= zuf_unlink,
+	.symlink	= zuf_symlink,
 	.mkdir		= zuf_mkdir,
 	.rmdir		= zuf_rmdir,
 	.mknod		= zuf_mknod,
diff --git a/fs/zuf/symlink.c b/fs/zuf/symlink.c
new file mode 100644
index 000000000000..1446bdf60cb9
--- /dev/null
+++ b/fs/zuf/symlink.c
@@ -0,0 +1,73 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Symlink operations
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#include "zuf.h"
+
+/* Can never fail all checks already made before.
+ * Returns: The number of pages stored @pages
+ */
+uint zuf_prepare_symname(struct zufs_ioc_new_inode *ioc_new_inode,
+			 const char *symname, ulong len,
+			 struct page *pages[2])
+{
+	uint nump;
+
+	ioc_new_inode->zi.i_size = cpu_to_le64(len);
+	if (len < sizeof(ioc_new_inode->zi.i_symlink)) {
+		memcpy(&ioc_new_inode->zi.i_symlink, symname, len);
+		return 0;
+	}
+
+	pages[0] = virt_to_page(symname);
+	nump = 1;
+
+	ioc_new_inode->hdr.len = len;
+	ioc_new_inode->hdr.offset = (ulong)symname & (PAGE_SIZE - 1);
+
+	if (PAGE_SIZE < ioc_new_inode->hdr.offset + len) {
+		pages[1] = virt_to_page(symname + PAGE_SIZE);
+		++nump;
+	}
+
+	return nump;
+}
+
+/*
+ * In case of short symlink, we serve it directly from zi; otherwise, read
+ * symlink value directly from pmem using dpp mapping.
+ */
+static const char *zuf_get_link(struct dentry *dentry, struct inode *inode,
+				struct delayed_call *notused)
+{
+	const char *link;
+	struct zuf_inode_info *zii = ZUII(inode);
+
+	if (inode->i_size < sizeof(zii->zi->i_symlink))
+		return zii->zi->i_symlink;
+
+	link = zuf_dpp_t_addr(inode->i_sb, le64_to_cpu(zii->zi->i_sym_dpp));
+	if (!link) {
+		zuf_err("bad symlink: i_sym_dpp=0x%llx\n", zii->zi->i_sym_dpp);
+		return ERR_PTR(-EIO);
+	}
+	return link;
+}
+
+const struct inode_operations zuf_symlink_inode_operations = {
+	.get_link	= zuf_get_link,
+	.update_time	= zuf_update_time,
+	.setattr	= zuf_setattr,
+	.getattr	= zuf_getattr,
+};
-- 
2.21.0
^ permalink raw reply related	[flat|nested] 32+ messages in thread
* [PATCH 11/16] zuf: Write/Read implementation
  2019-09-26  2:07 [PATCHSET v02 00/16] zuf: ZUFS Zero-copy User-mode FileSystem Boaz Harrosh
                   ` (9 preceding siblings ...)
  2019-09-26  2:07 ` [PATCH 10/16] zuf: symlink Boaz Harrosh
@ 2019-09-26  2:07 ` Boaz Harrosh
       [not found]   ` <db90d73233484d251755c5a0cb7ee570b3fc9d19.camel@netapp.com>
  2019-09-26  2:07 ` [PATCH 12/16] zuf: mmap & sync Boaz Harrosh
                   ` (6 subsequent siblings)
  17 siblings, 1 reply; 32+ messages in thread
From: Boaz Harrosh @ 2019-09-26  2:07 UTC (permalink / raw)
  To: linux-fsdevel, Anna Schumaker, Al Viro, Matt Benjamin
  Cc: Miklos Szeredi, Amir Goldstein, Sagi Manole, Matthew Wilcox,
	Dan Williams
zufs Has two ways to do IO.
1. The elegant way:
   By mapping application buffers into Server VM. This is much simpler
   to implement by zusFS. But is slow and does not scale well.
2. The fast way: (called NIO)
   Server returns physical block information. And the pmem_memcpy
   is done in Kernel.
   This way is more complicated. Each block needs to ZUFS_GET_MULTI
   But also ZUFS_PUT_MULTI to indicate that Kernel has finished the
   copy, and pmem block may be recycled.
   But if we will go to server and back twice for each IOP this will
   kill our performance. So what we do is the pigi_put mechanisim
   (See zuf-core.c). pigi_put is a way to delay the put operation for
   later so when a new operation is going to Server it will take on the
   way all accumulated put operations. So in one go I might fetch
   new block info as well as PUT the previous IO. Don't worry all
   this is done zuf-core style without any locks or atomics.
   There are times that Server may request an immediate PUT and/or
   keep the ZT-channel locked for guaranty forward progress.
It is up to the zusFS to decide which mode it wants to operate in
[1] or [2] above. And more flags govern aspects of the IO requested.
The dispatch to the server can operate on buffers up to
ZUS_API_MAP_MAX_SIZE (4M). Any bigger operations are split
up and dispatched at this size.
Also if a multy-segments aio is used each segment is dispatched
on its own.
rw.c here also includes some operations for mmap. Will be used
in next patch.
The fallocate operation with its various mode flags is also dispatched
through the rw.c IO API because it might need to do some t1/t2 IO as
part of the operation. If it is for COW of cloned inodes or read/write
of the unaligned edges. zufs also implements truncate via a private
fallocate flag.
There is also code for comparing two buffers for the implementation
of the dedup operation.
Also in this patch the facility to SWAP on a zufs system.
There is also an IOCTL fasility to execute IO (ZU_IOC_IOMAP_EXEC)
from a Server background threads. We use this in Netapp for
tiering down cold blocks to slower storage.
Both ZU_IOC_IOMAP_EXEC and the IO despatch operate on facility
we call zufs_iomap which is a varlen buffer that may request and
encode many types of operations and block/memory targets for IO.
It is kind of an IO executor of sorts. zusFS encodes such iomap
to tell Kernel what needs to be done.
[v2]
  zuf: Range of _IO_gm_inner must fit API (PXS-5151)
   Zuf must never request pages which may fall out-of-range of
  ZUS_API_MAP_MAX_PAGES. When IO request is not page-aligned, limit
  size based on start offset.
[v3]
  zufc: bad bugs in zufc_goose_all_zts
 * The BAD Bug was that we called the internal smp_call_function
   instead of the proper on_each_cpu.
   This was bad because smp_call_function calls all other CPUs
   but us. Anyway the proper public API for this is on_each_cpu.
 * Another BUG is that zufc_goose_all_zts needs to be always called
   with an inode. This is because we are assuming that we are holding
   the inode_w_lock and no more puts can come in parallel to the goose_all.
 * In clone the goose target is the destination file which is going
   to be truncated. (See above we must have a locked inode at hand)
 * Call zufc_goose_all_zts under the inode_w_lock in evict.
 * One more change is to *not* relay on Server to turn off the
   ZUFS_H_HAS_PIGY_PUT flag. We will use this later to fix another
   theoretical Race window with pigi_put
   (In fact there is a zus patch to stop resetting that bit)
[v4]
  Remove the swap activate code. It will come in later Kernels.
  This is because to do it properly we should send a small patch
  to Kernel so to not force the FS to use page_cache. The code
  had an Hack to bypass this bug. But I rather remove the code
  instead.
[v5]
  Fix the warning of type:
    warning: the frame size of 8712 bytes is larger than 8192 bytes
  We allocate the maximum stack space allowed by the Kernel
  configuration, without warning. If the needed space fits in the
  stack it is used. If not we allocate from a new dedicated kmem_cache
  an 8K buffer to store our block-numbers. 8k is the maximum allowed
  in the zufs API which is 1024 data blocks,
  The above logic is hidden under the big_alloc facility that was already
  used in other places.
Signed-off-by: Sagi Manole <sagim@netapp.com>
Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/Makefile   |   1 +
 fs/zuf/_extern.h  |  22 ++
 fs/zuf/file.c     |  73 ++++
 fs/zuf/inode.c    |  13 +
 fs/zuf/rw.c       | 959 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/zuf/zuf-core.c | 400 ++++++++++++++++++-
 fs/zuf/zuf.h      |   7 +
 fs/zuf/zus_api.h  | 251 ++++++++++++
 8 files changed, 1724 insertions(+), 2 deletions(-)
 create mode 100644 fs/zuf/rw.c
diff --git a/fs/zuf/Makefile b/fs/zuf/Makefile
index 04c31b7bb9ff..23bc3791a001 100644
--- a/fs/zuf/Makefile
+++ b/fs/zuf/Makefile
@@ -17,5 +17,6 @@ zuf-y += md.o t1.o t2.o
 zuf-y += zuf-core.o zuf-root.o
 
 # Main FS
+zuf-y += rw.o
 zuf-y += super.o inode.o directory.o namei.o file.o symlink.o
 zuf-y += module.o
diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
index 95413f65c47f..745d0cc9e719 100644
--- a/fs/zuf/_extern.h
+++ b/fs/zuf/_extern.h
@@ -43,6 +43,9 @@ int zufc_dispatch(struct zuf_root_info *zri, struct zufs_ioc_hdr *hdr,
 	zuf_dispatch_init(&zdo, hdr, pages, nump);
 	return __zufc_dispatch(zri, &zdo);
 }
+int zufc_pigy_put(struct zuf_root_info *zri, struct zuf_dispatch_op *zdo,
+		  struct zufs_ioc_IO *io, uint iom_n, ulong *bns, bool do_now);
+void zufc_goose_all_zts(struct zuf_root_info *zri, struct inode *inode);
 
 /* zuf-root.c */
 int zufr_register_fs(struct super_block *sb, struct zufs_ioc_register_fs *rfs);
@@ -92,6 +95,25 @@ int zuf_remove_dentry(struct inode *dir, struct qstr *str, struct inode *inode);
 uint zuf_prepare_symname(struct zufs_ioc_new_inode *ioc_new_inode,
 			const char *symname, ulong len, struct page *pages[2]);
 
+/* rw.c */
+int zuf_rw_read_page(struct zuf_sb_info *sbi, struct inode *inode,
+		     struct page *page, u64 filepos);
+ssize_t zuf_rw_read_iter(struct super_block *sb, struct inode *inode,
+			 struct kiocb *kiocb, struct iov_iter *ii);
+ssize_t zuf_rw_write_iter(struct super_block *sb, struct inode *inode,
+			  struct kiocb *kiocb, struct iov_iter *ii);
+int _zufs_IO_get_multy(struct zuf_sb_info *sbi, struct inode *inode,
+		       loff_t pos, ulong len, struct _io_gb_multy *io_gb);
+void _zufs_IO_put_multy(struct zuf_sb_info *sbi, struct inode *inode,
+			struct _io_gb_multy *io_gb);
+int zuf_rw_fallocate(struct inode *inode, uint mode, loff_t offset, loff_t len);
+int zuf_iom_execute_sync(struct super_block *sb, struct inode *inode,
+			 __u64 *iom_e, uint iom_n);
+int zuf_iom_execute_async(struct super_block *sb, struct zus_iomap_build *iomb,
+			 __u64 *iom_e_user, uint iom_n);
+int zuf_rw_file_range_compare(struct inode *i_in, loff_t pos_in,
+			      struct inode *i_out, loff_t pos_out, loff_t len);
+
 /* t1.c */
 int zuf_pmem_mmap(struct file *file, struct vm_area_struct *vma);
 
diff --git a/fs/zuf/file.c b/fs/zuf/file.c
index 619dada43666..8711b44371e0 100644
--- a/fs/zuf/file.c
+++ b/fs/zuf/file.c
@@ -13,6 +13,9 @@
  *	Sagi Manole <sagim@netapp.com>"
  */
 
+#include <linux/fs.h>
+#include <linux/uio.h>
+
 #include "zuf.h"
 
 long __zuf_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len)
@@ -20,8 +23,78 @@ long __zuf_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len)
 	return -ENOTSUPP;
 }
 
+static ssize_t zuf_read_iter(struct kiocb *kiocb, struct iov_iter *ii)
+{
+	struct inode *inode = file_inode(kiocb->ki_filp);
+	struct zuf_inode_info *zii = ZUII(inode);
+	ssize_t ret;
+
+	zuf_dbg_rw("[%ld] ppos=0x%llx len=0x%zx\n",
+		     inode->i_ino, kiocb->ki_pos, iov_iter_count(ii));
+
+	file_accessed(kiocb->ki_filp);
+
+	zuf_r_lock(zii);
+
+	ret = zuf_rw_read_iter(inode->i_sb, inode, kiocb, ii);
+
+	zuf_r_unlock(zii);
+
+	zuf_dbg_rw("[%ld] => 0x%lx\n", inode->i_ino, ret);
+	return ret;
+}
+
+static ssize_t zuf_write_iter(struct kiocb *kiocb, struct iov_iter *ii)
+{
+	struct inode *inode = file_inode(kiocb->ki_filp);
+	struct zuf_inode_info *zii = ZUII(inode);
+	ssize_t ret;
+	loff_t end_offset;
+
+	ret = generic_write_checks(kiocb, ii);
+	if (unlikely(ret < 0)) {
+		zuf_dbg_vfs("[%ld] generic_write_checks => 0x%lx\n",
+			    inode->i_ino, ret);
+		return ret;
+	}
+
+	zuf_r_lock(zii);
+
+	ret = file_remove_privs(kiocb->ki_filp);
+	if (unlikely(ret < 0))
+		goto out;
+
+	end_offset = kiocb->ki_pos + iov_iter_count(ii);
+	if (inode->i_size < end_offset) {
+		spin_lock(&inode->i_lock);
+		if (inode->i_size < end_offset) {
+			zii->zi->i_size = cpu_to_le64(end_offset);
+			i_size_write(inode, end_offset);
+		}
+		spin_unlock(&inode->i_lock);
+	}
+
+	zus_inode_cmtime_now(inode, zii->zi);
+
+	ret = zuf_rw_write_iter(inode->i_sb, inode, kiocb, ii);
+	if (unlikely(ret < 0)) {
+		/* TODO(sagi): do we want to truncate i_size? */
+		goto out;
+	}
+
+	inode->i_blocks = le64_to_cpu(zii->zi->i_blocks);
+
+out:
+	zuf_r_unlock(zii);
+
+	zuf_dbg_rw("[%ld] => 0x%lx\n", inode->i_ino, ret);
+	return ret;
+}
+
 const struct file_operations zuf_file_operations = {
 	.open			= generic_file_open,
+	.read_iter		= zuf_read_iter,
+	.write_iter		= zuf_write_iter,
 };
 
 const struct inode_operations zuf_file_inode_operations = {
diff --git a/fs/zuf/inode.c b/fs/zuf/inode.c
index bf3f8b27f918..27660979ed6f 100644
--- a/fs/zuf/inode.c
+++ b/fs/zuf/inode.c
@@ -287,6 +287,8 @@ void zuf_evict_inode(struct inode *inode)
 
 		zuf_w_lock(zii);
 
+		zufc_goose_all_zts(ZUF_ROOT(SBI(sb)), inode);
+
 		zuf_evict_dispatch(sb, zii->zus_ii, ZUFS_OP_FREE_INODE, 0);
 
 		inode->i_mtime = inode->i_ctime = current_time(inode);
@@ -298,6 +300,8 @@ void zuf_evict_inode(struct inode *inode)
 
 		zuf_smw_lock(zii);
 
+		zufc_goose_all_zts(ZUF_ROOT(SBI(sb)), inode);
+
 		zuf_evict_dispatch(sb, zii->zus_ii, ZUFS_OP_EVICT_INODE, 0);
 
 		zuf_smw_unlock(zii);
@@ -585,5 +589,14 @@ void zuf_set_inode_flags(struct inode *inode, struct zus_inode *zi)
 		inode_has_no_xattr(inode);
 }
 
+/* direct_IO is not called. We set an empty one so open(O_DIRECT) will be happy
+ */
+static ssize_t zuf_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
+{
+	WARN_ON(1);
+	return 0;
+}
+
 const struct address_space_operations zuf_aops = {
+	.direct_IO		= zuf_direct_IO,
 };
diff --git a/fs/zuf/rw.c b/fs/zuf/rw.c
new file mode 100644
index 000000000000..48f584e71a03
--- /dev/null
+++ b/fs/zuf/rw.c
@@ -0,0 +1,959 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Read/Write operations.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+#include <linux/fadvise.h>
+#include <linux/uio.h>
+#include <linux/delay.h>
+#include <asm/cacheflush.h>
+
+#include "zuf.h"
+#include "t2.h"
+
+#define	rand_tag(kiocb)	\
+	((kiocb->ki_filp->f_mode & FMODE_RANDOM) ? ZUFS_RW_RAND : 0)
+#define	kiocb_ra(kiocb)	(&kiocb->ki_filp->f_ra)
+
+static const char *_pr_rw(uint rw)
+{
+	return (rw & WRITE) ? "WRITE" : "READ";
+}
+
+static int _ioc_bounds_check(struct zufs_iomap *ziom,
+			     struct zufs_iomap *user_ziom, void *ziom_end)
+{
+	size_t iom_max_bytes = ziom_end - (void *)&user_ziom->iom_e;
+
+	if (unlikely((iom_max_bytes / sizeof(__u64) < ziom->iom_max))) {
+		zuf_err("kernel-buff-size(0x%zx) < ziom->iom_max(0x%x)\n",
+			(iom_max_bytes / sizeof(__u64)), ziom->iom_max);
+		return -EINVAL;
+	}
+
+	if (unlikely(ziom->iom_max < ziom->iom_n)) {
+		zuf_err("ziom->iom_max(0x%x) < ziom->iom_n(0x%x)\n",
+			ziom->iom_max, ziom->iom_n);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static void _extract_gb_multy_bns(struct _io_gb_multy *io_gb,
+				  struct zufs_ioc_IO *io_user)
+{
+	uint i;
+
+	/* Return of some T1 pages from GET_MULTY */
+	io_gb->iom_n = 0;
+	for (i = 0; i < io_gb->IO.ziom.iom_n; ++i) {
+		ulong bn = _zufs_iom_t1_bn(io_user->iom_e[i]);
+
+		if (unlikely(bn == -1)) {
+			zuf_err("!!!!");
+			break;
+		}
+		io_gb->bns[io_gb->iom_n++] = bn;
+	}
+}
+
+static int rw_overflow_handler(struct zuf_dispatch_op *zdo, void *arg,
+			       ulong max_bytes)
+{
+	struct zufs_ioc_IO *io = container_of(zdo->hdr, typeof(*io), hdr);
+	struct zufs_ioc_IO *io_user = arg;
+	int err;
+
+	*io = *io_user;
+
+	err = _ioc_bounds_check(&io->ziom, &io_user->ziom, arg + max_bytes);
+	if (unlikely(err))
+		return err;
+
+	if ((io->hdr.err == -EZUFS_RETRY) &&
+	    io->ziom.iom_n && _zufs_iom_pop(io->iom_e)) {
+
+		zuf_dbg_rw(
+			"[%s]zuf_iom_execute_sync(%d) max=0x%lx iom_e[%d] => %d\n",
+			zuf_op_name(io->hdr.operation), io->ziom.iom_n,
+			max_bytes, _zufs_iom_opt_type(io_user->iom_e),
+			io->hdr.err);
+
+		io->hdr.err = zuf_iom_execute_sync(zdo->sb, zdo->inode,
+						   io_user->iom_e,
+						   io->ziom.iom_n);
+		return EZUF_RETRY_DONE;
+	}
+
+	/* No tier ups needed */
+
+	if (io->hdr.err == -EZUFS_RETRY) {
+		zuf_warn("ZUSfs violating API EZUFS_RETRY with no payload\n");
+		/* continue any way because we want to PUT all these GETs
+		 * we did. But the Server is buggy
+		 */
+		io->hdr.err = 0;
+	}
+
+	if (io->hdr.operation != ZUFS_OP_GET_MULTY)
+		return 0; /* We are finished */
+
+	/* ZUFS_OP_GET_MULTY Decoding at ZT context  */
+
+	if (io->ziom.iom_n) {
+		struct _io_gb_multy *io_gb =
+					container_of(io, typeof(*io_gb), IO);
+
+		zuf_dbg_rw("[%s] _extract_bns(%d) iom_e[0x%llx]\n",
+			   zuf_op_name(io->hdr.operation), io->ziom.iom_n,
+			   io_user->iom_e[0]);
+
+		if (unlikely(ZUS_API_MAP_MAX_PAGES < io->ziom.iom_n)) {
+			zuf_err("[%s] leaking T1 (%d) iom_e[0x%llx]\n",
+				zuf_op_name(io->hdr.operation), io->ziom.iom_n,
+				io_user->iom_e[0]);
+
+			io->ziom.iom_n = ZUS_API_MAP_MAX_PAGES;
+		}
+
+		_extract_gb_multy_bns(io_gb, io_user);
+	}
+
+	return 0;
+}
+
+static int _IO_dispatch(struct zuf_sb_info *sbi, struct zufs_ioc_IO *IO,
+			struct zuf_inode_info *zii, int operation,
+			uint pgoffset, struct page **pages, uint nump,
+			u64 filepos, uint len)
+{
+	struct zuf_dispatch_op zdo;
+	int err;
+
+	IO->hdr.operation = operation;
+	IO->hdr.in_len = sizeof(*IO);
+	IO->hdr.out_len = sizeof(*IO);
+	IO->hdr.offset = pgoffset;
+	IO->hdr.len = len;
+	IO->zus_ii = zii->zus_ii;
+	IO->filepos = filepos;
+
+	zuf_dispatch_init(&zdo, &IO->hdr, pages, nump);
+	zdo.oh = rw_overflow_handler;
+	zdo.sb = sbi->sb;
+	zdo.inode = &zii->vfs_inode;
+
+	zuf_dbg_verbose("[%ld][%s] fp=0x%llx nump=0x%x len=0x%x\n",
+			zdo.inode ? zdo.inode->i_ino : -1,
+			zuf_op_name(operation), filepos, nump, len);
+
+	err = __zufc_dispatch(ZUF_ROOT(sbi), &zdo);
+	if (unlikely(err == -EZUFS_RETRY)) {
+		zuf_err("Unexpected ZUS return => %d\n", err);
+		err = -EIO;
+	}
+	return err;
+}
+
+int zuf_rw_read_page(struct zuf_sb_info *sbi, struct inode *inode,
+		     struct page *page, u64 filepos)
+{
+	struct zufs_ioc_IO io = {};
+	struct page *pages[1];
+	uint nump;
+	int err;
+
+	pages[0] = page;
+	nump = 1;
+
+	err = _IO_dispatch(sbi, &io, ZUII(inode), ZUFS_OP_READ, 0, pages, nump,
+			   filepos, PAGE_SIZE);
+	return err;
+}
+
+
+/* return < 0 - is err. 0 compairs */
+int zuf_rw_file_range_compare(struct inode *i_in, loff_t pos_in,
+			      struct inode *i_out, loff_t pos_out, loff_t len)
+{
+	struct super_block *sb = i_in->i_sb;
+	ulong bs = sb->s_blocksize;
+	struct page *p_in, *p_out;
+	void *a_in, *a_out;
+	int err = 0;
+
+	if (unlikely((pos_in & (bs - 1)) || (pos_out & (bs - 1)) ||
+		     (bs != PAGE_SIZE))) {
+		zuf_err("[%ld]@0x%llx & [%ld]@0x%llx len=0x%llx bs=0x%lx\n",
+			   i_in->i_ino, pos_in, i_out->i_ino, pos_out, len, bs);
+		return -EINVAL;
+	}
+
+	zuf_dbg_rw("[%ld]@0x%llx & [%ld]@0x%llx len=0x%llx\n",
+		   i_in->i_ino, pos_in, i_out->i_ino, pos_out, len);
+
+	p_in = alloc_page(GFP_KERNEL);
+	p_out = alloc_page(GFP_KERNEL);
+	if (unlikely(!p_in || !p_out)) {
+		err = -ENOMEM;
+		goto out;
+	}
+	a_in = page_address(p_in);
+	a_out = page_address(p_out);
+
+	while (len) {
+		ulong l;
+
+		err = zuf_rw_read_page(SBI(sb), i_in, p_in, pos_in);
+		if (unlikely(err))
+			goto out;
+
+		err = zuf_rw_read_page(SBI(sb), i_out, p_out, pos_out);
+		if (unlikely(err))
+			goto out;
+
+		l = min_t(ulong, PAGE_SIZE, len);
+		if (memcmp(a_in, a_out, l)) {
+			err = -EBADE;
+			goto out;
+		}
+
+		pos_in += l;
+		pos_out += l;
+		len -= l;
+	}
+
+out:
+	__free_page(p_in);
+	__free_page(p_out);
+
+	return err;
+}
+
+/* ZERO a part of a single block. len does not cross a block boundary */
+int zuf_rw_fallocate(struct inode *inode, uint mode, loff_t pos, loff_t len)
+{
+	struct zufs_ioc_IO io = {};
+	int err;
+
+	io.last_pos = (len == ~0ULL) ? ~0ULL : pos + len;
+	io.rw = mode;
+
+	err = _IO_dispatch(SBI(inode->i_sb), &io, ZUII(inode),
+			   ZUFS_OP_FALLOCATE, 0, NULL, 0, pos, 0);
+	return err;
+
+}
+
+static struct page *_addr_to_page(unsigned long addr)
+{
+	const void *p = (const void *)addr;
+
+	return is_vmalloc_addr(p) ? vmalloc_to_page(p) : virt_to_page(p);
+}
+
+static ssize_t _iov_iter_get_pages_kvec(struct iov_iter *ii,
+		   struct page **pages, size_t maxsize, uint maxpages,
+		   size_t *start)
+{
+	ssize_t bytes;
+	size_t i, nump;
+	unsigned long addr = (unsigned long)ii->kvec->iov_base;
+
+	*start = addr & (PAGE_SIZE - 1);
+	bytes = min_t(ssize_t, iov_iter_single_seg_count(ii), maxsize);
+	nump = min_t(size_t, DIV_ROUND_UP(bytes + *start, PAGE_SIZE), maxpages);
+
+	/* TODO: FUSE assumes single page for ITER_KVEC. Boaz: Remove? */
+	WARN_ON(nump > 1);
+
+	for (i = 0; i < nump; ++i) {
+		pages[i] = _addr_to_page(addr + (i * PAGE_SIZE));
+
+		get_page(pages[i]);
+	}
+	return bytes;
+}
+
+static ssize_t _iov_iter_get_pages_any(struct iov_iter *ii,
+		   struct page **pages, size_t maxsize, uint maxpages,
+		   size_t *start)
+{
+	ssize_t bytes;
+
+	bytes = unlikely(ii->type & ITER_KVEC) ?
+		_iov_iter_get_pages_kvec(ii, pages, maxsize, maxpages, start) :
+		iov_iter_get_pages(ii, pages, maxsize, maxpages, start);
+
+	if (unlikely(bytes < 0))
+		zuf_dbg_err("[%d] bytes=%ld type=%d count=%lu",
+			smp_processor_id(), bytes, ii->type, ii->count);
+
+	return bytes;
+}
+
+static ssize_t _zufs_IO(struct zuf_sb_info *sbi, struct inode *inode,
+			void *on_stack, uint max_on_stack,
+			struct iov_iter *ii, struct kiocb *kiocb,
+			struct file_ra_state *ra, int operation, uint rw)
+{
+	int err = 0;
+	loff_t start_pos = kiocb->ki_pos;
+	loff_t pos = start_pos;
+	enum big_alloc_type bat;
+	struct page **pages;
+	uint max_pages = min_t(uint,
+			md_o2p_up(iov_iter_count(ii) + (pos & ~PAGE_MASK)),
+			ZUS_API_MAP_MAX_PAGES);
+
+	pages = big_alloc(max_pages * sizeof(*pages), max_on_stack, on_stack,
+			  GFP_NOFS, &bat);
+	if (unlikely(!pages)) {
+		zuf_err("Sigh on stack is best max_pages=%d\n", max_pages);
+		return -ENOMEM;
+	};
+
+	while (iov_iter_count(ii)) {
+		struct zufs_ioc_IO io = {};
+		uint nump;
+		ssize_t bytes;
+		size_t pgoffset;
+		uint i;
+
+		if (ra) {
+			io.ra.start	= ra->start;
+			io.ra.ra_pages	= ra->ra_pages;
+			io.ra.prev_pos	= ra->prev_pos;
+		}
+		io.rw = rw;
+
+		bytes = _iov_iter_get_pages_any(ii, pages,
+					ZUS_API_MAP_MAX_SIZE,
+					ZUS_API_MAP_MAX_PAGES, &pgoffset);
+		if (unlikely(bytes < 0)) {
+			err = bytes;
+			break;
+		}
+
+		nump = DIV_ROUND_UP(bytes + pgoffset, PAGE_SIZE);
+
+		io.last_pos = pos;
+		err = _IO_dispatch(sbi, &io, ZUII(inode), operation,
+				   pgoffset, pages, nump, pos, bytes);
+
+		bytes = io.last_pos - pos;
+
+		zuf_dbg_rw("[%ld]	%s [0x%llx-0x%zx]\n",
+			    inode->i_ino, _pr_rw(rw), pos, bytes);
+
+		iov_iter_advance(ii, bytes);
+		pos += bytes;
+
+		if (ra) {
+			ra->start	= io.ra.start;
+			ra->ra_pages	= io.ra.ra_pages;
+			ra->prev_pos	= io.ra.prev_pos;
+		}
+		if (io.wr_unmap.len)
+			unmap_mapping_range(inode->i_mapping,
+					    io.wr_unmap.offset,
+					    io.wr_unmap.len, 0);
+
+		for (i = 0; i < nump; ++i)
+			put_page(pages[i]);
+
+		if (unlikely(err))
+			break;
+	}
+
+	big_free(pages, bat);
+
+	if (unlikely(pos == start_pos))
+		return err;
+
+	kiocb->ki_pos = pos;
+	return pos - start_pos;
+}
+
+int _zufs_IO_get_multy(struct zuf_sb_info *sbi, struct inode *inode,
+		       loff_t pos, ulong len, struct _io_gb_multy *io_gb)
+{
+	struct zufs_ioc_IO *IO = &io_gb->IO;
+	int err;
+
+	IO->hdr.operation = ZUFS_OP_GET_MULTY;
+	IO->hdr.in_len = sizeof(*IO);
+	IO->hdr.out_len = sizeof(*IO);
+	IO->hdr.len = len;
+	IO->zus_ii = ZUII(inode)->zus_ii;
+	IO->filepos = pos;
+	IO->last_pos = pos;
+
+	zuf_dispatch_init(&io_gb->zdo, &IO->hdr, NULL, 0);
+	io_gb->zdo.oh = rw_overflow_handler;
+	io_gb->zdo.sb = sbi->sb;
+	io_gb->zdo.inode = inode;
+	io_gb->zdo.bns = io_gb->bns;
+
+
+	err = __zufc_dispatch(ZUF_ROOT(sbi), &io_gb->zdo);
+	if (unlikely(err == -EZUFS_RETRY)) {
+		zuf_err("Unexpected ZUS return => %d\n", err);
+		err = -EIO;
+	}
+
+	if (unlikely(err)) {
+		/* err from Server means no contract and NO bns locked
+		 * so no puts
+		 */
+		if ((err != -ENOSPC) && (err != -EIO) && (err != -EINTR))
+			zuf_warn("At this early stage show me %d\n", err);
+		if (io_gb->IO.ziom.iom_n)
+			zuf_err("Server Smoking iom_n=%u err=%d\n",
+				io_gb->IO.ziom.iom_n, err);
+		zuf_dbg_err("_IO_dispatch => %d\n", err);
+		return err;
+	}
+	if (unlikely(!io_gb->iom_n)) {
+		if (!io_gb->IO.ziom.iom_n) {
+			zuf_err("WANT tO SEE => %d\n", err);
+			return err;
+		}
+
+		_extract_gb_multy_bns(io_gb, &io_gb->IO);
+		if (unlikely(!io_gb->iom_n)) {
+			zuf_err("WHAT ????\n");
+			return err;
+		}
+	}
+	/* Even if _IO_dispatch returned a theoretical error but also some
+	 * pages, we do the few pages and do an OP_PUT_MULTY (error ignored)
+	 */
+	return 0;
+}
+
+void _zufs_IO_put_multy(struct zuf_sb_info *sbi, struct inode *inode,
+			struct _io_gb_multy *io_gb)
+{
+	bool put_now;
+	int err;
+
+	put_now = io_gb->IO.ret_flags &
+		  (ZUFS_RET_PUT_NOW | ZUFS_RET_NEW | ZUFS_RET_LOCKED_PUT);
+
+	err  = zufc_pigy_put(ZUF_ROOT(sbi), &io_gb->zdo, &io_gb->IO,
+			     io_gb->iom_n, io_gb->bns, put_now);
+	if (unlikely(err))
+		zuf_warn("zufc_pigy_put => %d\n", err);
+}
+
+static inline int _read_one(struct zuf_sb_info *sbi, struct iov_iter *ii,
+			     ulong bn, uint offset, uint len, int i)
+{
+	uint retl;
+
+	if (!bn) {
+		retl = iov_iter_zero(len, ii);
+	} else {
+		void *addr = md_addr_verify(sbi->md, md_p2o(bn));
+
+		if (unlikely(!addr)) {
+			zuf_err("Server bad bn[%d]=0x%lx bytes_more=0x%lx\n",
+				i, bn, iov_iter_count(ii));
+			return -EIO;
+		}
+		retl = copy_to_iter(addr + offset, len, ii);
+	}
+	if (unlikely(retl != len)) {
+		/* This can happen if we get a read_only Prt from App */
+		zuf_dbg_err("copy_to_iter bn=0x%lx off=0x%x len=0x%x retl=0x%x\n",
+			bn, offset, len, retl);
+		return -EFAULT;
+	}
+
+	return 0;
+}
+
+static inline int _write_one(struct zuf_sb_info *sbi, struct iov_iter *ii,
+			     ulong bn, uint offset, uint len, int i)
+{
+	void *addr = md_addr_verify(sbi->md, md_p2o(bn));
+	uint retl;
+
+	if (unlikely(!addr)) {
+		zuf_err("Server bad page[%d] bn=0x%lx bytes_more=0x%lx\n",
+			i, bn, iov_iter_count(ii));
+		return -EIO;
+	}
+
+	retl = _copy_from_iter_flushcache(addr + offset, len, ii);
+	if (unlikely(retl != len)) {
+		/* FIXME: This can happen if we get a read_only Prt from App */
+		zuf_err("copy_to_iter bn=0x%lx off=0x%x len=0x%x retl=0x%x\n",
+			bn, offset, len, retl);
+		return -EFAULT;
+	}
+	return 0;
+}
+
+static ssize_t _IO_gm_inner(struct zuf_sb_info *sbi, struct inode *inode,
+			    ulong *bns, uint max_bns,
+			    struct iov_iter *ii, struct file_ra_state *ra,
+			    loff_t start, uint rw)
+{
+	loff_t pos = start;
+	uint offset = pos & (PAGE_SIZE - 1);
+	struct _io_gb_multy io_gb = { .bns = bns, };
+	ssize_t size;
+	int err;
+	uint i;
+
+	if (ra) {
+		io_gb.IO.ra.start	= ra->start;
+		io_gb.IO.ra.ra_pages	= ra->ra_pages;
+		io_gb.IO.ra.prev_pos	= ra->prev_pos;
+	}
+	io_gb.IO.rw = rw;
+
+	size = min_t(ssize_t, ZUS_API_MAP_MAX_SIZE - offset,
+		     iov_iter_count(ii));
+	err = _zufs_IO_get_multy(sbi, inode, pos, size, &io_gb);
+	if (unlikely(err))
+		return err;
+
+	if (ra) {
+		ra->start	= io_gb.IO.ra.start;
+		ra->ra_pages	= io_gb.IO.ra.ra_pages;
+		ra->prev_pos	= io_gb.IO.ra.prev_pos;
+	}
+
+	if (unlikely(io_gb.IO.last_pos != (pos + size))) {
+		if (unlikely(io_gb.IO.last_pos < pos)) {
+			zuf_err("Server bad last_pos(0x%llx) <= pos(0x%llx) len=0x%lx\n",
+				 io_gb.IO.last_pos, pos, iov_iter_count(ii));
+			err = -EIO;
+			goto out;
+		}
+
+		zuf_dbg_err("Short %s start(0x%llx) len=0x%lx last_pos(0x%llx)\n",
+			    _pr_rw(rw), pos, iov_iter_count(ii),
+			    io_gb.IO.last_pos);
+		size = io_gb.IO.last_pos - pos;
+	}
+
+	i = 0;
+	while (size) {
+		uint len;
+		ulong bn;
+
+		len = min_t(uint, PAGE_SIZE - offset, size);
+
+		bn = io_gb.bns[i];
+		if (rw & WRITE)
+			err = _write_one(sbi, ii, bn, offset, len, i);
+		else
+			err = _read_one(sbi, ii, bn, offset, len, i);
+		if (unlikely(err))
+			break;
+
+		zuf_dbg_rw("[%ld]	%s [0x%llx-0x%x] bn=0x%lx [%d]\n",
+			    inode->i_ino, _pr_rw(rw), pos, len, bn, i);
+
+		pos += len;
+		size -= len;
+		offset = 0;
+		if (io_gb.iom_n <= ++i)
+			break;
+	}
+out:
+	_zufs_IO_put_multy(sbi, inode, &io_gb);
+	if (io_gb.IO.wr_unmap.len)
+		unmap_mapping_range(inode->i_mapping, io_gb.IO.wr_unmap.offset,
+				    io_gb.IO.wr_unmap.len, 0);
+
+	return unlikely(pos == start) ? err : pos - start;
+}
+
+static ssize_t _IO_gm(struct zuf_sb_info *sbi, struct inode *inode,
+		      ulong *on_stack, uint max_on_stack,
+		      struct iov_iter *ii, struct kiocb *kiocb,
+		      struct file_ra_state *ra, uint rw)
+{
+	ssize_t size = 0;
+	ssize_t ret = 0;
+	enum big_alloc_type bat;
+	ulong *bns;
+	uint max_bns = min_t(uint,
+		md_o2p_up(iov_iter_count(ii) + (kiocb->ki_pos & ~PAGE_MASK)),
+		ZUS_API_MAP_MAX_PAGES);
+
+	bns = big_alloc(max_bns * sizeof(ulong), max_on_stack, on_stack,
+			GFP_NOFS, &bat);
+	if (unlikely(!bns)) {
+		zuf_err("life was more simple on the stack max_bns=%d\n",
+			max_bns);
+		return -ENOMEM;
+	}
+
+	while (iov_iter_count(ii)) {
+		ret = _IO_gm_inner(sbi, inode, bns, max_bns, ii, ra,
+				   kiocb->ki_pos, rw);
+		if (unlikely(ret < 0))
+			break;
+
+		kiocb->ki_pos += ret;
+		size += ret;
+	}
+
+	big_free(bns, bat);
+
+	return size ?: ret;
+}
+
+ssize_t zuf_rw_read_iter(struct super_block *sb, struct inode *inode,
+			 struct kiocb *kiocb, struct iov_iter *ii)
+{
+	long on_stack[ZUF_MAX_STACK(8) / sizeof(long)];
+	ulong rw = READ | rand_tag(kiocb);
+
+	/* EOF protection */
+	if (unlikely(kiocb->ki_pos > i_size_read(inode)))
+		return 0;
+
+	iov_iter_truncate(ii, i_size_read(inode) - kiocb->ki_pos);
+	if (unlikely(!iov_iter_count(ii))) {
+		/* Don't let zero len reads have any effect */
+		zuf_dbg_rw("called with NULL len\n");
+		return 0;
+	}
+
+	if (zuf_is_nio_reads(inode))
+		return _IO_gm(SBI(sb), inode, on_stack, sizeof(on_stack),
+			      ii, kiocb, kiocb_ra(kiocb), rw);
+
+	return _zufs_IO(SBI(sb), inode, on_stack, sizeof(on_stack), ii,
+			kiocb, kiocb_ra(kiocb), ZUFS_OP_READ, rw);
+}
+
+ssize_t zuf_rw_write_iter(struct super_block *sb, struct inode *inode,
+			  struct kiocb *kiocb, struct iov_iter *ii)
+{
+	long on_stack[ZUF_MAX_STACK(8) / sizeof(long)];
+	ulong rw = WRITE;
+
+	if (kiocb->ki_filp->f_flags & O_DSYNC ||
+	    IS_SYNC(kiocb->ki_filp->f_mapping->host))
+		rw |= ZUFS_RW_DSYNC;
+	if (kiocb->ki_filp->f_flags & O_DIRECT)
+		rw |= ZUFS_RW_DIRECT;
+
+	if (zuf_is_nio_writes(inode))
+		return _IO_gm(SBI(sb), inode, on_stack, sizeof(on_stack),
+			      ii, kiocb, kiocb_ra(kiocb), rw);
+
+	return _zufs_IO(SBI(sb), inode, on_stack, sizeof(on_stack),
+			ii, kiocb, kiocb_ra(kiocb), ZUFS_OP_WRITE, rw);
+}
+
+/* ~~~~ iom_dec.c ~~~ */
+/* for now here (at rw.c) looks logical */
+
+static int __iom_add_t2_io_len(struct super_block *sb, struct t2_io_state *tis,
+			       zu_dpp_t t1, ulong t2_bn, __u64 num_pages)
+{
+	void *ptr;
+	struct page *page;
+	int i, err;
+
+	ptr = zuf_dpp_t_addr(sb, t1);
+	if (unlikely(!ptr)) {
+		zuf_err("Bad t1 zu_dpp_t t1=0x%llx t2=0x%lx num_pages=0x%llx\n",
+			t1, t2_bn, num_pages);
+		return -EFAULT; /* zuf_dpp_t_addr already yeld */
+	}
+
+	page = virt_to_page(ptr);
+	if (unlikely(!page)) {
+		zuf_err("bad t1(0x%llx)\n", t1);
+		return -EFAULT;
+	}
+
+	for (i = 0; i < num_pages; ++i) {
+		err = t2_io_add(tis, t2_bn++, page++);
+		if (unlikely(err))
+			return err;
+	}
+	return 0;
+}
+
+static int iom_add_t2_io_len(struct super_block *sb, struct t2_io_state *tis,
+			     __u64 **cur_e)
+{
+	struct zufs_iom_t2_io_len *t2iol = (void *)*cur_e;
+	int err = __iom_add_t2_io_len(sb, tis, t2iol->iom.t1_val,
+				      _zufs_iom_first_val(&t2iol->iom.t2_val),
+				      t2iol->num_pages);
+
+	*cur_e = (void *)(t2iol + 1);
+	return err;
+}
+
+static int iom_add_t2_io(struct super_block *sb, struct t2_io_state *tis,
+			 __u64 **cur_e)
+{
+	struct zufs_iom_t2_io *t2io = (void *)*cur_e;
+
+	int err = __iom_add_t2_io_len(sb, tis, t2io->t1_val,
+				      _zufs_iom_first_val(&t2io->t2_val), 1);
+
+	*cur_e = (void *)(t2io + 1);
+	return err;
+}
+
+static int iom_t2_zusmem_io(struct super_block *sb, struct t2_io_state *tis,
+			    __u64 **cur_e)
+{
+	struct zufs_iom_t2_zusmem_io *mem_io = (void *)*cur_e;
+	ulong t2_bn = _zufs_iom_first_val(&mem_io->t2_val);
+	ulong user_ptr = (ulong)mem_io->zus_mem_ptr;
+	int rw = _zufs_iom_opt_type(*cur_e) == IOM_T2_ZUSMEM_WRITE ?
+						WRITE : READ;
+	int num_p = md_o2p_up(mem_io->len);
+	int num_p_r;
+	struct page *pages[16];
+	int i, err = 0;
+
+	if (16 < num_p) {
+		zuf_err("num_p(%d) > 16\n", num_p);
+		return -EINVAL;
+	}
+
+	num_p_r = get_user_pages_fast(user_ptr, num_p, rw,
+				      pages);
+	if (num_p_r != num_p) {
+		zuf_err("!!!! get_user_pages_fast num_p_r(%d) != num_p(%d)\n",
+			num_p_r, num_p);
+		err = -EFAULT;
+		goto out;
+	}
+
+	for (i = 0; i < num_p_r && !err; ++i)
+		err = t2_io_add(tis, t2_bn++, pages[i]);
+
+out:
+	for (i = 0; i < num_p_r; ++i)
+		put_page(pages[i]);
+
+	*cur_e = (void *)(mem_io + 1);
+	return err;
+}
+
+static int iom_unmap(struct super_block *sb, struct inode *inode, __u64 **cur_e)
+{
+	struct zufs_iom_unmap *iom_unmap = (void *)*cur_e;
+	struct inode *inode_look = NULL;
+	ulong	unmap_index = _zufs_iom_first_val(&iom_unmap->unmap_index);
+	ulong	unmap_n = iom_unmap->unmap_n;
+	ulong	ino = iom_unmap->ino;
+
+	if (!inode || ino) {
+		if (WARN_ON(!ino)) {
+			zuf_err("[%ld] 0x%lx-0x%lx\n",
+				inode ? inode->i_ino : -1, unmap_index,
+				unmap_n);
+			goto out;
+		}
+		inode_look = ilookup(sb, ino);
+		if (!inode_look) {
+			/* From the time we requested an unmap to now
+			 * inode was evicted from cache so surely it no longer
+			 * have any mappings. Cool job was already done for us.
+			 * Even if a racing thread reloads the inode it will
+			 * not have this mapping we wanted to clear, but only
+			 * new ones.
+			 * TODO: For now warn when this happen, because in
+			 *    current usage it cannot happen. But before
+			 *    upstream we should convert to zuf_dbg_err
+			 */
+			zuf_warn("[%ld] 0x%lx-0x%lx\n",
+				 ino, unmap_index, unmap_n);
+			goto out;
+		}
+
+		inode = inode_look;
+	}
+
+	zuf_dbg_rw("[%ld] 0x%lx-0x%lx\n", inode->i_ino, unmap_index, unmap_n);
+
+	unmap_mapping_range(inode->i_mapping, md_p2o(unmap_index),
+			    md_p2o(unmap_n), 0);
+
+	if (inode_look)
+		iput(inode_look);
+
+out:
+	*cur_e = (void *)(iom_unmap + 1);
+	return 0;
+}
+
+static int iom_wbinv(__u64 **cur_e)
+{
+	wbinvd();
+
+	++*cur_e;
+
+	return 0;
+}
+
+struct _iom_exec_info {
+	struct super_block *sb;
+	struct inode *inode;
+	struct t2_io_state *rd_tis;
+	struct t2_io_state *wr_tis;
+	__u64 *iom_e;
+	uint iom_n;
+	bool print;
+};
+
+static int _iom_execute_inline(struct _iom_exec_info *iei)
+{
+	__u64 *cur_e, *end_e;
+	int err = 0;
+#ifdef CONFIG_ZUF_DEBUG
+	uint wrs = 0;
+	uint rds = 0;
+	uint uns = 0;
+	uint wrmem = 0;
+	uint rdmem = 0;
+	uint wbinv = 0;
+#	define	WRS()	(++wrs)
+#	define	RDS()	(++rds)
+#	define	UNS()	(++uns)
+#	define	WRMEM()	(++wrmem)
+#	define	RDMEM()	(++rdmem)
+#	define	WBINV()	(++wbinv)
+#else
+#	define	WRS()
+#	define	RDS()
+#	define	UNS()
+#	define	WRMEM()
+#	define	RDMEM()
+#	define	WBINV()
+#endif /* !def CONFIG_ZUF_DEBUG */
+
+	cur_e =  iei->iom_e;
+	end_e = cur_e + iei->iom_n;
+	while (cur_e && (cur_e < end_e)) {
+		uint op;
+
+		op = _zufs_iom_opt_type(cur_e);
+
+		switch (op) {
+		case IOM_NONE:
+			return 0;
+
+		case IOM_T2_WRITE:
+			err = iom_add_t2_io(iei->sb, iei->wr_tis, &cur_e);
+			WRS();
+			break;
+		case IOM_T2_READ:
+			err = iom_add_t2_io(iei->sb, iei->rd_tis, &cur_e);
+			RDS();
+			break;
+
+		case IOM_T2_WRITE_LEN:
+			err = iom_add_t2_io_len(iei->sb, iei->wr_tis, &cur_e);
+			WRS();
+			break;
+		case IOM_T2_READ_LEN:
+			err = iom_add_t2_io_len(iei->sb, iei->rd_tis, &cur_e);
+			RDS();
+			break;
+
+		case IOM_T2_ZUSMEM_WRITE:
+			err = iom_t2_zusmem_io(iei->sb, iei->wr_tis, &cur_e);
+			WRMEM();
+			break;
+		case IOM_T2_ZUSMEM_READ:
+			err = iom_t2_zusmem_io(iei->sb, iei->rd_tis, &cur_e);
+			RDMEM();
+			break;
+
+		case IOM_UNMAP:
+			err = iom_unmap(iei->sb, iei->inode, &cur_e);
+			UNS();
+			break;
+
+		case IOM_WBINV:
+			err = iom_wbinv(&cur_e);
+			WBINV();
+			break;
+
+		default:
+			zuf_err("!!!!! Bad opt %d\n",
+				_zufs_iom_opt_type(cur_e));
+			err = -EIO;
+			break;
+		}
+
+		if (unlikely(err))
+			break;
+	}
+
+#ifdef CONFIG_ZUF_DEBUG
+	zuf_dbg_rw("exec wrs=%d rds=%d uns=%d rdmem=%d wrmem=%d => %d\n",
+		   wrs, rds, uns, rdmem, wrmem, err);
+#endif
+
+	return err;
+}
+
+/* inode here is the default inode if ioc_unmap->ino is zero
+ * this is an optimization for the unmap done at write_iter hot path.
+ */
+int zuf_iom_execute_sync(struct super_block *sb, struct inode *inode,
+			 __u64 *iom_e_user, uint iom_n)
+{
+	struct zuf_sb_info *sbi = SBI(sb);
+	struct t2_io_state rd_tis = {};
+	struct t2_io_state wr_tis = {};
+	struct _iom_exec_info iei = {};
+	int err, err_r, err_w;
+
+	t2_io_begin(sbi->md, READ, NULL, 0, -1, &rd_tis);
+	t2_io_begin(sbi->md, WRITE, NULL, 0, -1, &wr_tis);
+
+	iei.sb = sb;
+	iei.inode = inode;
+	iei.rd_tis = &rd_tis;
+	iei.wr_tis = &wr_tis;
+	iei.iom_e = iom_e_user;
+	iei.iom_n = iom_n;
+	iei.print = 0;
+
+	err = _iom_execute_inline(&iei);
+
+	err_r = t2_io_end(&rd_tis, true);
+	err_w = t2_io_end(&wr_tis, true);
+
+	/* TODO: not sure if OK when _iom_execute return with -ENOMEM
+	 * In such a case, we might be better of skiping t2_io_ends.
+	 */
+	return err ?: (err_r ?: err_w);
+}
+
+int zuf_iom_execute_async(struct super_block *sb, struct zus_iomap_build *iomb,
+			 __u64 *iom_e_user, uint iom_n)
+{
+	zuf_err("Async IOM NOT supported Yet!!!\n");
+	return -EFAULT;
+}
diff --git a/fs/zuf/zuf-core.c b/fs/zuf/zuf-core.c
index c0049c1d5ba3..11300fd79929 100644
--- a/fs/zuf/zuf-core.c
+++ b/fs/zuf/zuf-core.c
@@ -25,6 +25,20 @@
 #include "relay.h"
 
 enum { INITIAL_ZT_CHANNELS = 3 };
+#define _ZT_MAX_PIGY_PUT \
+	((ZUS_API_MAP_MAX_PAGES * sizeof(__u64) + \
+	  sizeof(struct zufs_ioc_IO)) * INITIAL_ZT_CHANNELS)
+
+enum { PG0 = 0, PG1 = 1, PG2 = 2, PG3 = 3, PG4 = 4, PG5 = 5 };
+struct __pigi_put_it {
+	void *buff;
+	void *waiter;
+	uint s; /* total encoded bytes */
+	uint last; /* So we can update last zufs_ioc_hdr->flags */
+	bool needs_goosing;
+	ulong inodes[PG5 + 1];
+	uint ic;
+};
 
 struct zufc_thread {
 	struct zuf_special_file hdr;
@@ -40,6 +54,12 @@ struct zufc_thread {
 
 	/* Next operation*/
 	struct zuf_dispatch_op *zdo;
+
+	/* Secondary chans point to the 0-channel's
+	 * pigi_put_chan0
+	 */
+	struct __pigi_put_it pigi_put_chan0;
+	struct __pigi_put_it *pigi_put;
 };
 
 struct zuf_threads_pool {
@@ -76,7 +96,14 @@ const char *zuf_op_name(enum e_zufs_operation op)
 		CASE_ENUM_NAME(ZUFS_OP_RENAME);
 		CASE_ENUM_NAME(ZUFS_OP_READDIR);
 
+		CASE_ENUM_NAME(ZUFS_OP_READ);
+		CASE_ENUM_NAME(ZUFS_OP_PRE_READ);
+		CASE_ENUM_NAME(ZUFS_OP_WRITE);
 		CASE_ENUM_NAME(ZUFS_OP_SETATTR);
+
+		CASE_ENUM_NAME(ZUFS_OP_GET_MULTY);
+		CASE_ENUM_NAME(ZUFS_OP_PUT_MULTY);
+		CASE_ENUM_NAME(ZUFS_OP_NOOP);
 	case ZUFS_OP_MAX_OPT:
 	default:
 		return "UNKNOWN";
@@ -543,6 +570,238 @@ static void _prep_header_size_op(struct zufs_ioc_hdr *hdr,
 	hdr->err = err;
 }
 
+/* ~~~~~ pigi_put logic ~~~~~ */
+struct _goose_waiter {
+	struct kref kref;
+	struct zuf_root_info *zri;
+	ulong inode; /* We use the inode address as a unique tag */
+};
+
+static void _last_goose(struct kref *kref)
+{
+	struct _goose_waiter *gw = container_of(kref, typeof(*gw), kref);
+
+	wake_up_var(&gw->kref);
+}
+
+static void _goose_put(struct _goose_waiter *gw)
+{
+	kref_put(&gw->kref, _last_goose);
+}
+
+static void _goose_get(struct _goose_waiter *gw)
+{
+	kref_get(&gw->kref);
+}
+
+static void _goose_wait(struct _goose_waiter *gw)
+{
+	wait_var_event(&gw->kref, !kref_read(&gw->kref));
+}
+
+static void _pigy_put_encode(struct zufs_ioc_IO *io,
+			     struct zufs_ioc_IO *io_user, ulong *bns)
+{
+	uint i;
+
+	*io_user = *io;
+	for (i = 0; i < io->ziom.iom_n; ++i)
+		_zufs_iom_enc_bn(&io_user->ziom.iom_e[i], bns[i], 0);
+
+	io_user->hdr.in_len = _ioc_IO_size(io->ziom.iom_n);
+}
+
+static void pigy_put_dh(struct zuf_dispatch_op *zdo, void *pzt, void *parg)
+{
+	struct zufs_ioc_IO *io = container_of(zdo->hdr, typeof(*io), hdr);
+	struct zufs_ioc_IO *io_user = parg;
+
+	_pigy_put_encode(io, io_user, zdo->bns);
+}
+
+static int _pigy_put_now(struct zuf_root_info *zri, struct zuf_dispatch_op *zdo)
+{
+	int err;
+
+	zdo->dh = pigy_put_dh;
+
+	err = __zufc_dispatch(zri, zdo);
+	if (unlikely(err == -EZUFS_RETRY)) {
+		zuf_err("Unexpected ZUS return => %d\n", err);
+		err = -EIO;
+	}
+	return err;
+}
+
+int zufc_pigy_put(struct zuf_root_info *zri, struct zuf_dispatch_op *zdo,
+		  struct zufs_ioc_IO *io, uint iom_n, ulong *bns, bool do_now)
+{
+	struct zufc_thread *zt;
+	struct zufs_ioc_IO *io_user;
+	uint pigi_put_s;
+	int cpu;
+
+	io->hdr.operation = ZUFS_OP_PUT_MULTY;
+	io->hdr.out_len = 0;		/* No returns from put */
+	io->ret_flags = 0;
+	io->ziom.iom_n = iom_n;
+	zdo->bns = bns;
+
+	pigi_put_s = _ioc_IO_size(iom_n);
+
+	/* FIXME: Pedantic check remove please */
+	if (WARN_ON(zdo->__locked_zt && !do_now))
+		do_now = true;
+
+	cpu = get_cpu();
+
+	zt = _zt_from_cpu(zri, cpu, 0);
+	if (do_now || (zt->pigi_put->s + pigi_put_s > _ZT_MAX_PIGY_PUT) ||
+	    (zt->pigi_put->ic > PG5)) {
+		put_cpu();
+
+		/* NOTE: pigy_put buffer is full, We dispatch a put NOW
+		 * which will also take with it the full pigy_put buffer.
+		 * At the server the pigy_put will be done first then this
+		 * one, so order of puts is preserved, not that it matters
+		 */
+		if (!do_now)
+			zuf_dbg_perf(
+				"[%ld] iom_n=0x%x zt->pigi_put->s=0x%x + 0x%x > 0x%lx ic=%d\n",
+				zdo->inode->i_ino, iom_n, zt->pigi_put->s,
+				pigi_put_s, _ZT_MAX_PIGY_PUT,
+				zt->pigi_put->ic++);
+
+		return _pigy_put_now(zri, zdo);
+	}
+
+	/* Mark last one as has more */
+	if (zt->pigi_put->s) {
+		io_user = zt->pigi_put->buff + zt->pigi_put->last;
+		io_user->hdr.flags |= ZUFS_H_HAS_PIGY_PUT;
+	}
+
+	io_user = zt->pigi_put->buff + zt->pigi_put->s;
+	_pigy_put_encode(io, io_user, bns);
+	zt->pigi_put->last = zt->pigi_put->s;
+	zt->pigi_put->s += pigi_put_s;
+	zt->pigi_put->inodes[zt->pigi_put->ic++] = (ulong)zdo->inode;
+
+	put_cpu();
+	return 0;
+}
+
+/* Add the pigy_put accumulated buff to current command
+ * Always runs in the context of a ZT
+ */
+static void _pigy_put_add_to_ioc(struct zuf_root_info *zri,
+				 struct zufc_thread *zt)
+{
+	struct zufs_ioc_hdr *hdr = zt->opt_buff;
+	struct __pigi_put_it *pigi = zt->pigi_put;
+
+	if (unlikely(!pigi->s))
+		return;
+
+	if (unlikely(pigi->s + hdr->in_len > zt->max_zt_command)) {
+		zuf_err("!!! Should not pigi_put->s(%d) + in_len(%d) > max_zt_command(%ld)\n",
+			pigi->s, hdr->in_len, zt->max_zt_command);
+		/*TODO we must check at init time that max_zt_command not too
+		 * small
+		 */
+		return;
+	}
+
+	memcpy((void *)hdr + hdr->in_len, pigi->buff, pigi->s);
+	hdr->flags |= ZUFS_H_HAS_PIGY_PUT;
+	pigi->s = pigi->last = 0;
+	pigi->ic = 0;
+	/* for every 3 channels */
+	pigi->inodes[PG0] = pigi->inodes[PG1] = pigi->inodes[PG2] = 0;
+	pigi->inodes[PG3] = pigi->inodes[PG4] = pigi->inodes[PG5] = 0;
+}
+
+static void _goose_prep(struct zuf_root_info *zri,
+			struct zufc_thread *zt)
+{
+	_prep_header_size_op(zt->opt_buff, ZUFS_OP_NOOP, 0);
+	_pigy_put_add_to_ioc(zri, zt);
+
+	zt->pigi_put->needs_goosing = false;
+}
+
+static inline bool _zt_pigi_has_inode(struct __pigi_put_it *pigi,
+				      ulong inode)
+{
+	return	pigi->ic &&
+		((pigi->inodes[PG0] == inode) ||
+		 (pigi->inodes[PG1] == inode) ||
+		 (pigi->inodes[PG2] == inode) ||
+		 (pigi->inodes[PG3] == inode) ||
+		 (pigi->inodes[PG4] == inode) ||
+		 (pigi->inodes[PG5] == inode));
+}
+
+static void _goose_one(void *info)
+{
+	struct _goose_waiter *gw = info;
+	struct zuf_root_info *zri = gw->zri;
+	struct zufc_thread *zt;
+	int cpu = smp_processor_id();
+	uint c;
+
+	/* Look for least busy channel. All busy we are left with zt0 */
+	for (c = INITIAL_ZT_CHANNELS; c; --c) {
+		zt = _zt_from_cpu(zri, cpu, c - 1);
+		if (unlikely(!(zt && zt->hdr.file)))
+			return; /* We are crashing */
+
+		if (!zt->pigi_put->s || zt->pigi_put->needs_goosing)
+			return; /* this cpu is goose empty */
+
+		if (!_zt_pigi_has_inode(zt->pigi_put, gw->inode))
+			return;
+		if (!zt->zdo)
+			break;
+	}
+
+	/* Tell them to ... */
+	zt->pigi_put->needs_goosing = true;
+	_goose_get(gw);
+	zt->pigi_put->waiter = gw;
+	if (!zt->zdo)
+		relay_fss_wakeup(&zt->relay);
+}
+
+/* NOTE: @inode must not be NULL */
+void zufc_goose_all_zts(struct zuf_root_info *zri, struct inode *inode)
+{
+	struct _goose_waiter gw;
+
+	if (!S_ISREG(inode->i_mode) || !(inode->i_size || inode->i_blocks))
+		return;
+
+	/* No point in two goosers fighting we are goosing for everyone
+	 * This protects that only one zt->pigi_put->waiter at a time
+	 */
+	mutex_lock(&zri->sbl_lock);
+
+	gw.zri = zri;
+	kref_init(&gw.kref);
+	gw.inode = (ulong)inode;
+
+	on_each_cpu(_goose_one, &gw, true);
+
+	if (kref_read(&gw.kref) == 1)
+		goto out;
+
+	_goose_put(&gw); /* put kref_init's 1 */
+	_goose_wait(&gw);
+
+out:
+	mutex_unlock(&zri->sbl_lock);
+}
+
 /* ~~~~~ ZT thread operations ~~~~~ */
 
 static int _zu_init(struct file *file, void *parg)
@@ -591,6 +850,24 @@ static int _zu_init(struct file *file, void *parg)
 		goto out;
 	}
 
+	if (zt->chan == 0) {
+		zt->pigi_put = &zt->pigi_put_chan0;
+
+		zt->pigi_put->buff = vmalloc(_ZT_MAX_PIGY_PUT);
+		if (unlikely(!zt->pigi_put->buff)) {
+			vfree(zt->opt_buff);
+			zi_init.hdr.err = -ENOMEM;
+			goto out;
+		}
+		zt->pigi_put->needs_goosing = false;
+		zt->pigi_put->last = zt->pigi_put->s = 0;
+	} else {
+		struct zufc_thread *zt0;
+
+		zt0 = _zt_from_cpu(ZRI(file->f_inode->i_sb), cpu, 0);
+		zt->pigi_put = &zt0->pigi_put_chan0;
+	}
+
 	file->private_data = &zt->hdr;
 out:
 	err = copy_to_user(parg, &zi_init, sizeof(zi_init));
@@ -625,6 +902,9 @@ static void zufc_zt_release(struct file *file)
 		msleep(1000); /* crap */
 	}
 
+	if (zt->chan == 0)
+		vfree(zt->pigi_put->buff);
+
 	vfree(zt->opt_buff);
 	memset(zt, 0, sizeof(*zt));
 }
@@ -706,9 +986,25 @@ static int _copy_outputs(struct zufc_thread *zt, void *arg)
 	}
 }
 
+static bool _need_channel_lock(struct zufc_thread *zt)
+{
+	struct zufs_ioc_IO *ret_io = zt->opt_buff;
+
+	/* Only ZUF_GET_MULTY is allowed channel locking
+	 * because it absolutely must and I truest the code.
+	 * If You need a new channel locking command come talk
+	 * to me first.
+	 */
+	return	(ret_io->hdr.err == 0) &&
+		(ret_io->hdr.operation == ZUFS_OP_GET_MULTY) &&
+		(ret_io->ret_flags & ZUFS_RET_LOCKED_PUT) &&
+		(ret_io->ziom.iom_n != 0);
+}
+
 static int _zu_wait(struct file *file, void *parg)
 {
 	struct zufc_thread *zt;
+	struct zufs_ioc_hdr *user_hdr;
 	bool __chan_is_locked = false;
 	int err;
 
@@ -730,6 +1026,10 @@ static int _zu_wait(struct file *file, void *parg)
 		goto err;
 	}
 
+	user_hdr = zt->opt_buff;
+	if (user_hdr->flags & ZUFS_H_HAS_PIGY_PUT)
+		user_hdr->flags &= ~ZUFS_H_HAS_PIGY_PUT;
+
 	if (relay_is_app_waiting(&zt->relay)) {
 		if (unlikely(!zt->zdo)) {
 			zuf_err("User has gone...\n");
@@ -751,13 +1051,29 @@ static int _zu_wait(struct file *file, void *parg)
 
 		_unmap_pages(zt, zt->zdo->pages, zt->zdo->nump);
 
-		zt->zdo = NULL;
+		if (unlikely(!err && _need_channel_lock(zt))) {
+			zt->zdo->__locked_zt = zt;
+			__chan_is_locked = true;
+		} else {
+			zt->zdo = NULL;
+		}
 		if (unlikely(err)) /* _copy_outputs returned an err */
 			goto err;
 
 		relay_app_wakeup(&zt->relay);
 	}
 
+	if (zt->pigi_put->needs_goosing && !__chan_is_locked) {
+		/* go do a cycle and come back */
+		_goose_prep(ZRI(file->f_inode->i_sb), zt);
+		return 0;
+	}
+
+	if (zt->pigi_put->waiter) {
+		_goose_put(zt->pigi_put->waiter);
+		zt->pigi_put->waiter = NULL;
+	}
+
 	err = __relay_fss_wait(&zt->relay, __chan_is_locked);
 	if (err)
 		zuf_dbg_err("[%d] relay error: %d\n", zt->no, err);
@@ -770,8 +1086,16 @@ static int _zu_wait(struct file *file, void *parg)
 		 * we should have a bit set in zt->zdo->hdr set per operation.
 		 * TODO: Why this does not work?
 		 */
-		_map_pages(zt, zt->zdo->pages, zt->zdo->nump, 0);
+		_map_pages(zt, zt->zdo->pages, zt->zdo->nump,
+			   zt->zdo->hdr->operation == ZUFS_OP_WRITE);
+		if (zt->pigi_put->s)
+			_pigy_put_add_to_ioc(ZRI(file->f_inode->i_sb), zt);
 	} else {
+		if (zt->pigi_put->needs_goosing) {
+			_goose_prep(ZRI(file->f_inode->i_sb), zt);
+			return 0;
+		}
+
 		/* This Means we were released by _zu_break */
 		zuf_dbg_zus("_zu_break? => %d\n", err);
 		_prep_header_size_op(zt->opt_buff, ZUFS_OP_BREAK, err);
@@ -953,6 +1277,30 @@ static inline struct zu_exec_buff *_ebuff_from_file(struct file *file)
 	return ebuff;
 }
 
+static int _ebuff_bounds_check(struct zu_exec_buff *ebuff, ulong buff,
+			       struct zufs_iomap *ziom,
+			       struct zufs_iomap *user_ziom, void *ziom_end)
+{
+	size_t iom_max_bytes = ziom_end - (void *)&user_ziom->iom_e;
+
+	if (buff != ebuff->vma->vm_start ||
+	    ebuff->vma->vm_end < buff + iom_max_bytes) {
+		WARN_ON_ONCE(1);
+		zuf_err("Executing out off bound vm_start=0x%lx vm_end=0x%lx buff=0x%lx buff_end=0x%lx\n",
+			ebuff->vma->vm_start, ebuff->vma->vm_end, buff,
+			buff + iom_max_bytes);
+		return -EINVAL;
+	}
+
+	if (unlikely((iom_max_bytes / sizeof(__u64) < ziom->iom_max)))
+		return -EINVAL;
+
+	if (unlikely(ziom->iom_max < ziom->iom_n))
+		return -EINVAL;
+
+	return 0;
+}
+
 static int _zu_ebuff_alloc(struct file *file, void *arg)
 {
 	struct zufs_ioc_alloc_buffer ioc_alloc;
@@ -1004,6 +1352,52 @@ static void zufc_ebuff_release(struct file *file)
 	kfree(ebuff);
 }
 
+static int _zu_iomap_exec(struct file *file, void *arg)
+{
+	struct zuf_root_info *zri = ZRI(file->f_inode->i_sb);
+	struct zu_exec_buff *ebuff = _ebuff_from_file(file);
+	struct zufs_ioc_iomap_exec ioc_iomap;
+	struct zufs_ioc_iomap_exec *user_iomap;
+
+	struct super_block *sb;
+	int err;
+
+	if (unlikely(!ebuff))
+		return -EINVAL;
+
+	user_iomap = ebuff->opt_buff;
+	/* do all checks on a kernel copy so malicious Server cannot
+	 * crash the Kernel
+	 */
+	ioc_iomap = *user_iomap;
+
+	err = _ebuff_bounds_check(ebuff, (ulong)arg, &ioc_iomap.ziom,
+				  &user_iomap->ziom,
+				  ebuff->opt_buff + ebuff->alloc_size);
+	if (unlikely(err)) {
+		zuf_err("illegal iomap: iom_max=%u iom_n=%u\n",
+			ioc_iomap.ziom.iom_max, ioc_iomap.ziom.iom_n);
+		return err;
+	}
+
+	/* The ID of the super block received in mount */
+	sb = zuf_sb_from_id(zri, ioc_iomap.sb_id, ioc_iomap.zus_sbi);
+	if (unlikely(!sb))
+		return -EINVAL;
+
+	if (ioc_iomap.wait_for_done)
+		err = zuf_iom_execute_sync(sb, NULL, user_iomap->ziom.iom_e,
+					   ioc_iomap.ziom.iom_n);
+	else
+		err =  zuf_iom_execute_async(sb, ioc_iomap.ziom.iomb,
+					     user_iomap->ziom.iom_e,
+					     ioc_iomap.ziom.iom_n);
+
+	user_iomap->hdr.err = err;
+	zuf_dbg_core("OUT => %d\n", err);
+	return 0; /* report err at hdr, but the command was executed */
+};
+
 /* ~~~~ ioctl & release handlers ~~~~ */
 static int _zu_register_fs(struct file *file, void *parg)
 {
@@ -1069,6 +1463,8 @@ long zufc_ioctl(struct file *file, unsigned int cmd, ulong arg)
 		return _zu_wait(file, parg);
 	case ZU_IOC_ALLOC_BUFFER:
 		return _zu_ebuff_alloc(file, parg);
+	case ZU_IOC_IOMAP_EXEC:
+		return _zu_iomap_exec(file, parg);
 	case ZU_IOC_PRIVATE_MOUNT:
 		return _zu_private_mounter(file, parg);
 	case ZU_IOC_BREAK_ALL:
diff --git a/fs/zuf/zuf.h b/fs/zuf/zuf.h
index 2d5327e1d2b1..2c57c51a2099 100644
--- a/fs/zuf/zuf.h
+++ b/fs/zuf/zuf.h
@@ -402,6 +402,13 @@ static inline int zuf_flt_to_err(vm_fault_t flt)
 	return -EACCES;
 }
 
+struct _io_gb_multy {
+	struct zuf_dispatch_op zdo;
+	struct zufs_ioc_IO IO;
+	ulong iom_n;
+	ulong *bns;
+};
+
 /* Keep this include last thing in file */
 #include "_extern.h"
 
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index 2bdf047282e8..e3a783748ce6 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -456,7 +456,15 @@ enum e_zufs_operation {
 	ZUFS_OP_RENAME		= 10,
 	ZUFS_OP_READDIR		= 11,
 
+	ZUFS_OP_READ		= 14,
+	ZUFS_OP_PRE_READ	= 15,
+	ZUFS_OP_WRITE		= 16,
 	ZUFS_OP_SETATTR		= 19,
+	ZUFS_OP_FALLOCATE	= 21,
+
+	ZUFS_OP_GET_MULTY	= 29,
+	ZUFS_OP_PUT_MULTY	= 30,
+	ZUFS_OP_NOOP		= 31,
 
 	ZUFS_OP_MAX_OPT,
 };
@@ -646,10 +654,253 @@ struct zufs_ioc_attr {
 	__u32 pad;
 };
 
+/* ~~~~ io_map structures && IOCTL(s) ~~~~ */
+/*
+ * These set of structures and helpers are used in return of zufs_ioc_IO and
+ * also at ZU_IOC_IOMAP_EXEC, NULL terminating list (array)
+ *
+ * Each iom_elemet stars with an __u64 of which the 8 hight bits carry an
+ * operation_type, And the 56 bits value denotes a page offset, (md_o2p()) or a
+ * length. operation_type is one of ZUFS_IOM_TYPE enum.
+ * The interpreter then jumps to the next operation depending on the size
+ * of the defined operation.
+ */
+
+enum ZUFS_IOM_TYPE {
+	IOM_NONE	= 0,
+	IOM_T1_WRITE	= 1,
+	IOM_T1_READ	= 2,
+
+	IOM_T2_WRITE	= 3,
+	IOM_T2_READ	= 4,
+	IOM_T2_WRITE_LEN = 5,
+	IOM_T2_READ_LEN	= 6,
+
+	IOM_T2_ZUSMEM_WRITE = 7,
+	IOM_T2_ZUSMEM_READ = 8,
+
+	IOM_UNMAP	= 9,
+	IOM_WBINV	= 10,
+	IOM_REPEAT	= 11,
+
+	IOM_NUM_LEGAL_OPT,
+};
+
+#define ZUFS_IOM_VAL_BITS	56
+#define ZUFS_IOM_FIRST_VAL_MASK ((1UL << ZUFS_IOM_VAL_BITS) - 1)
+
+static inline enum ZUFS_IOM_TYPE _zufs_iom_opt_type(__u64 *iom_e)
+{
+	uint ret = (*iom_e) >> ZUFS_IOM_VAL_BITS;
+
+	if (ret >= IOM_NUM_LEGAL_OPT)
+		return IOM_NONE;
+	return (enum ZUFS_IOM_TYPE)ret;
+}
+
+static inline bool _zufs_iom_pop(__u64 *iom_e)
+{
+	return _zufs_iom_opt_type(iom_e) != IOM_NONE;
+}
+
+static inline ulong _zufs_iom_first_val(__u64 *iom_elemets)
+{
+	return *iom_elemets & ZUFS_IOM_FIRST_VAL_MASK;
+}
+
+static inline void _zufs_iom_enc_type_val(__u64 *ptr, enum ZUFS_IOM_TYPE type,
+					 ulong val)
+{
+	*ptr = (__u64)val | ((__u64)type << ZUFS_IOM_VAL_BITS);
+}
+
+static inline ulong _zufs_iom_t1_bn(__u64 val)
+{
+	if (unlikely(_zufs_iom_opt_type(&val) != IOM_T1_READ))
+		return -1;
+
+	return zu_dpp_t_bn(_zufs_iom_first_val(&val));
+}
+
+static inline void _zufs_iom_enc_bn(__u64 *ptr, ulong bn, uint pool)
+{
+	_zufs_iom_enc_type_val(ptr, IOM_T1_READ, zu_enc_dpp_t_bn(bn, pool));
+}
+
+/* IOM_T1_WRITE / IOM_T1_READ
+ * May be followed by an IOM_REPEAT
+ */
+struct zufs_iom_t1_io {
+	/* Special dpp_t that denote a page ie: bn << 3 | zu_dpp_t_pool  */
+	__u64	t1_val;
+};
+
+/* IOM_T2_WRITE / IOM_T2_READ */
+struct zufs_iom_t2_io {
+	__u64	t2_val;
+	zu_dpp_t t1_val;
+};
+
+/* IOM_T2_WRITE_LEN / IOM_T2_READ_LEN */
+struct zufs_iom_t2_io_len {
+	struct zufs_iom_t2_io iom;
+	__u64 num_pages;
+};
+
+/* IOM_T2_ZUSMEM_WRITE / IOM_T2_ZUSMEM_READ */
+struct zufs_iom_t2_zusmem_io {
+	__u64	t2_val;
+	__u64	zus_mem_ptr; /* needs an get_user_pages() */
+	__u64	len;
+};
+
+/* IOM_UNMAP:
+ *	Executes unmap_mapping_range & remove of zuf's block-caching
+ *
+ * For now iom_unmap means even_cows=0, because Kernel takes care of all
+ * the cases of the even_cows=1. In future if needed it will be on the high
+ * bit of unmap_n.
+ */
+struct zufs_iom_unmap {
+	__u64	unmap_index;	/* Offset in pages of inode */
+	__u64	unmap_n;	/* Num pages to unmap (0 means: to eof) */
+	__u64	ino;		/* Pages of this inode */
+};
+
+#define ZUFS_WRITE_OP_SPACE						\
+	((sizeof(struct zufs_iom_unmap) +				\
+	  sizeof(struct zufs_iom_t2_io)) / sizeof(__u64) + sizeof(__u64))
+
+struct zus_iomap_build;
+/* For ZUFS_OP_IOM_DONE */
+struct zufs_ioc_iomap_done {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_sb_info *zus_sbi;
+
+	/* The cookie received from zufs_ioc_iomap_exec */
+	struct	zus_iomap_build *iomb;
+};
+
+struct zufs_iomap {
+	/* A cookie from zus to return when execution is done */
+	struct	zus_iomap_build *iomb;
+
+	__u32	iom_max;	/* num of __u64 allocated	 */
+	__u32	iom_n;		/* num of valid __u64 in iom_e	 */
+	__u64	iom_e[0];	/* encoded operations to execute */
+
+	/* This struct must be last */
+};
+
+/*
+ * Execute an iomap in behalf of the Server
+ *
+ * NOTE: this IOCTL must come on an above ZU_IOC_ALLOC_BUFFER type file
+ * and the passed arg-buffer must be the pointer returned from an mmap
+ * call preformed in the file, before the call to this IOC.
+ * If this is not done the IOCTL will return EINVAL.
+ */
+struct zufs_ioc_iomap_exec {
+	struct zufs_ioc_hdr hdr;
+	/* The ID of the super block received in mount */
+	__u64	sb_id;
+	/* We verify the sb_id validity against zus_sbi */
+	struct zus_sb_info *zus_sbi;
+	/* If application buffers they are from this IO*/
+	__u64	zt_iocontext;
+	/* Only return from IOCTL when finished. iomap_done NOT called */
+	__u32	wait_for_done;
+	__u32	__pad;
+
+	struct zufs_iomap ziom; /* must be last */
+};
+#define ZU_IOC_IOMAP_EXEC	_IOWR('Z', 19, struct zufs_ioc_iomap_exec)
+
+/*
+ * ZUFS_OP_READ / ZUFS_OP_WRITE / ZUFS_OP_FALLOCATE
+ *       also
+ * ZUFS_OP_GET_MULTY / ZUFS_OP_PUT_MULTY
+ */
+/* flags for zufs_ioc_IO->ret_flags */
+enum {
+	ZUFS_RET_RESERVED	= 0x0001, /* Not used */
+	ZUFS_RET_NEW		= 0x0002, /* In WRITE, allocated a new block */
+	ZUFS_RET_IOM_ALL_PMEM	= 0x0004, /* iom_e[] is encoded with pmem-bn */
+	ZUFS_RET_PUT_NOW	= 0x0008, /* GET_MULTY demands no pigi-puts  */
+	ZUFS_RET_LOCKED_PUT	= 0x0010, /* Same as PUT_NOW but must lock a zt
+					   * channel, Because GET took a lock
+					   */
+};
+
+/* flags for zufs_ioc_IO->rw */
+#define ZUFS_RW_WRITE	BIT(0)	/* SAME as WRITE in Kernel */
+#define ZUFS_RW_MMAP	BIT(1)
+
+#define ZUFS_RW_RAND	BIT(4)	/* fadvise(random) */
+
+/* Same meaning as IOCB_XXXX different bits */
+#define ZUFS_RW_KERN	8
+#define ZUFS_RW_EVENTFD	BIT(ZUFS_RW_KERN + 0)
+#define ZUFS_RW_APPEND	BIT(ZUFS_RW_KERN + 1)
+#define ZUFS_RW_DIRECT	BIT(ZUFS_RW_KERN + 2)
+#define ZUFS_RW_HIPRI	BIT(ZUFS_RW_KERN + 3)
+#define ZUFS_RW_DSYNC	BIT(ZUFS_RW_KERN + 4)
+#define ZUFS_RW_SYNC	BIT(ZUFS_RW_KERN + 5)
+#define ZUFS_RW_NOWAIT	BIT(ZUFS_RW_KERN + 7)
+#define ZUFS_RW_LAST_USED_BIT (ZUFS_RW_KERN + 7)
+/* ^^ PLEASE update (keep last) ^^ */
+
+/* 8 bits left for user */
+#define ZUFS_RW_USER_BITS 0xFF000000
+#define ZUFS_RW_USER	BIT(24)
+
 /* Special flag for ZUFS_OP_FALLOCATE to specify a setattr(SIZE)
  * IE. same as punch hole but set_i_size to be @filepos. In this
  * case @last_pos == ~0ULL
  */
 #define ZUFS_FL_TRUNCATE 0x80000000
 
+struct zufs_ioc_IO {
+	struct zufs_ioc_hdr hdr;
+
+	/* IN */
+	struct zus_inode_info *zus_ii;
+	__u64 filepos;
+	__u64 rw;		/* One or more of ZUFS_RW_XXX		*/
+	__u32 ret_flags;	/* OUT - ZUFS_RET_XXX OUT		*/
+	__u32 pool;		/* All dpp_t(s) belong to this pool	*/
+	__u64 cookie;		/* For FS private use			*/
+
+	/* in / OUT */
+	/* For read-ahead (or alloc ahead) */
+	struct __zufs_ra {
+		union {
+			ulong start;
+			__u64 __start;
+		};
+		__u64 prev_pos;
+		__u32 ra_pages;
+		__u32 ra_pad; /* we need this */
+	} ra;
+
+	/* For writes TODO: encode at iom_e? */
+	struct __zufs_write_unmap {
+		__u32  offset;
+		__u32  len;
+	} wr_unmap;
+
+	/* The last offset in this IO. If 0, than error code at .hdr.err */
+	/* for ZUFS_OP_FALLOCATE this is the requested end offset */
+	__u64 last_pos;
+
+	struct zufs_iomap ziom;
+	__u64 iom_e[ZUFS_WRITE_OP_SPACE]; /* One tier_up for WRITE or GB */
+};
+
+static inline uint _ioc_IO_size(uint iom_n)
+{
+	return offsetof(struct zufs_ioc_IO, iom_e) + iom_n * sizeof(__u64);
+}
+
 #endif /* _LINUX_ZUFS_API_H */
-- 
2.21.0
^ permalink raw reply related	[flat|nested] 32+ messages in thread
* [PATCH 12/16] zuf: mmap & sync
  2019-09-26  2:07 [PATCHSET v02 00/16] zuf: ZUFS Zero-copy User-mode FileSystem Boaz Harrosh
                   ` (10 preceding siblings ...)
  2019-09-26  2:07 ` [PATCH 11/16] zuf: Write/Read implementation Boaz Harrosh
@ 2019-09-26  2:07 ` Boaz Harrosh
  2019-09-26  2:07 ` [PATCH 13/16] zuf: More file operation Boaz Harrosh
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 32+ messages in thread
From: Boaz Harrosh @ 2019-09-26  2:07 UTC (permalink / raw)
  To: linux-fsdevel, Anna Schumaker, Al Viro, Matt Benjamin
  Cc: Miklos Szeredi, Amir Goldstein, Sagi Manole, Matthew Wilcox,
	Dan Williams
On page-fault call the zusFS for the page information. We always
mmap pmem pages directly. (No page cache)
With write-mmap and pmem. We need to keep track of dirty inodes
and call the zusFS when one of the sync variants are called.
This is because the Server will need to do a cl_flush on all
dirty pages.
If we did not have any write-mmaped pages on the inode sync does
nothing.
[v2]
  zuf: pmem mmap must be 2M aligned
  We only support huge pages on pmem mmap (2M).
  Prevent mmap on pmem with VM addresses unaligned to 2M.
  [Under valgrind it would try to give us address not aligned
   and bypass the zufr_get_unmapped_area(). By returning
   an error valgrind backs off and everything works again
  ]
Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/Makefile   |   2 +-
 fs/zuf/_extern.h  |   6 +
 fs/zuf/file.c     |  66 ++++++++++
 fs/zuf/inode.c    |  10 ++
 fs/zuf/mmap.c     | 300 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/zuf/super.c    |  89 ++++++++++++++
 fs/zuf/t1.c       |   9 ++
 fs/zuf/zuf-core.c |   2 +
 fs/zuf/zuf.h      |   3 +
 fs/zuf/zus_api.h  |  26 ++++
 10 files changed, 512 insertions(+), 1 deletion(-)
 create mode 100644 fs/zuf/mmap.c
diff --git a/fs/zuf/Makefile b/fs/zuf/Makefile
index 23bc3791a001..02df1374a946 100644
--- a/fs/zuf/Makefile
+++ b/fs/zuf/Makefile
@@ -17,6 +17,6 @@ zuf-y += md.o t1.o t2.o
 zuf-y += zuf-core.o zuf-root.o
 
 # Main FS
-zuf-y += rw.o
+zuf-y += rw.o mmap.o
 zuf-y += super.o inode.o directory.o namei.o file.o symlink.o
 zuf-y += module.o
diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
index 745d0cc9e719..cafda97c973c 100644
--- a/fs/zuf/_extern.h
+++ b/fs/zuf/_extern.h
@@ -64,8 +64,11 @@ int zuf_private_mount(struct zuf_root_info *zri, struct register_fs_info *rfi,
 int zuf_private_umount(struct zuf_root_info *zri, struct super_block *sb);
 struct super_block *zuf_sb_from_id(struct zuf_root_info *zri, __u64 sb_id,
 				   struct zus_sb_info *zus_sbi);
+void zuf_sync_inc(struct inode *inode);
+void zuf_sync_dec(struct inode *inode, ulong write_unmapped);
 
 /* file.c */
+int zuf_isync(struct inode *inode, loff_t start, loff_t end, int datasync);
 long __zuf_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len);
 
 /* namei.c */
@@ -114,6 +117,9 @@ int zuf_iom_execute_async(struct super_block *sb, struct zus_iomap_build *iomb,
 int zuf_rw_file_range_compare(struct inode *i_in, loff_t pos_in,
 			      struct inode *i_out, loff_t pos_out, loff_t len);
 
+/* mmap.c */
+int zuf_file_mmap(struct file *file, struct vm_area_struct *vma);
+
 /* t1.c */
 int zuf_pmem_mmap(struct file *file, struct vm_area_struct *vma);
 
diff --git a/fs/zuf/file.c b/fs/zuf/file.c
index 8711b44371e0..7fcaf085bf8e 100644
--- a/fs/zuf/file.c
+++ b/fs/zuf/file.c
@@ -23,6 +23,70 @@ long __zuf_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len)
 	return -ENOTSUPP;
 }
 
+/* This function is called by both msync() and fsync(). */
+int zuf_isync(struct inode *inode, loff_t start, loff_t end, int datasync)
+{
+	struct zuf_inode_info *zii = ZUII(inode);
+	struct zufs_ioc_sync ioc_range = {
+		.hdr.in_len = sizeof(ioc_range),
+		.hdr.operation = ZUFS_OP_SYNC,
+		.zus_ii = zii->zus_ii,
+		.offset = start,
+		.flags = datasync ? ZUFS_SF_DATASYNC : 0,
+	};
+	loff_t isize;
+	ulong uend = end + 1;
+	int err = 0;
+
+	zuf_dbg_vfs(
+		"[%ld] start=0x%llx end=0x%llx  datasync=%d write_mapped=%d\n",
+		inode->i_ino, start, end, datasync,
+		atomic_read(&zii->write_mapped));
+
+	/* We want to serialize the syncs so they don't fight with each other
+	 * and is though more efficient, but we do not want to lock out
+	 * read/writes and page-faults so we have a special sync semaphore
+	 */
+	zuf_smw_lock(zii);
+
+	isize = i_size_read(inode);
+	if (!isize) {
+		zuf_dbg_mmap("[%ld] file is empty\n", inode->i_ino);
+		goto out;
+	}
+	if (isize < uend)
+		uend = isize;
+	if (uend < start) {
+		zuf_dbg_mmap("[%ld] isize=0x%llx start=0x%llx end=0x%lx\n",
+				 inode->i_ino, isize, start, uend);
+		err = -ENODATA;
+		goto out;
+	}
+
+	if (!atomic_read(&zii->write_mapped))
+		goto out; /* Nothing to do on this inode */
+
+	ioc_range.length = uend - start;
+	unmap_mapping_range(inode->i_mapping, start, ioc_range.length, 0);
+	zufc_goose_all_zts(ZUF_ROOT(SBI(inode->i_sb)), inode);
+
+	err = zufc_dispatch(ZUF_ROOT(SBI(inode->i_sb)), &ioc_range.hdr,
+			    NULL, 0);
+	if (unlikely(err))
+		zuf_dbg_err("zufc_dispatch failed => %d\n", err);
+
+	zuf_sync_dec(inode, ioc_range.write_unmapped);
+
+out:
+	zuf_smw_unlock(zii);
+	return err;
+}
+
+static int zuf_fsync(struct file *file, loff_t start, loff_t end, int datasync)
+{
+	return zuf_isync(file_inode(file), start, end, datasync);
+}
+
 static ssize_t zuf_read_iter(struct kiocb *kiocb, struct iov_iter *ii)
 {
 	struct inode *inode = file_inode(kiocb->ki_filp);
@@ -95,6 +159,8 @@ const struct file_operations zuf_file_operations = {
 	.open			= generic_file_open,
 	.read_iter		= zuf_read_iter,
 	.write_iter		= zuf_write_iter,
+	.mmap			= zuf_file_mmap,
+	.fsync			= zuf_fsync,
 };
 
 const struct inode_operations zuf_file_inode_operations = {
diff --git a/fs/zuf/inode.c b/fs/zuf/inode.c
index 27660979ed6f..1e3dba654f34 100644
--- a/fs/zuf/inode.c
+++ b/fs/zuf/inode.c
@@ -271,6 +271,7 @@ void zuf_evict_inode(struct inode *inode)
 {
 	struct super_block *sb = inode->i_sb;
 	struct zuf_inode_info *zii = ZUII(inode);
+	int write_mapped;
 
 	if (!inode->i_nlink) {
 		if (unlikely(!zii->zi)) {
@@ -311,6 +312,15 @@ void zuf_evict_inode(struct inode *inode)
 	zii->zus_ii = NULL;
 	zii->zi = NULL;
 
+	/* ZUS on evict has synced all mmap dirty pages, YES? */
+	write_mapped = atomic_read(&zii->write_mapped);
+	if (unlikely(write_mapped || !list_empty(&zii->i_mmap_dirty))) {
+		zuf_dbg_mmap("[%ld] !!!! write_mapped=%d list_empty=%d\n",
+			      inode->i_ino, write_mapped,
+			      list_empty(&zii->i_mmap_dirty));
+		zuf_sync_dec(inode, write_mapped);
+	}
+
 	clear_inode(inode);
 }
 
diff --git a/fs/zuf/mmap.c b/fs/zuf/mmap.c
new file mode 100644
index 000000000000..318c701f7d7d
--- /dev/null
+++ b/fs/zuf/mmap.c
@@ -0,0 +1,300 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BRIEF DESCRIPTION
+ *
+ * mmap operations.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+
+#include <linux/pfn_t.h>
+#include "zuf.h"
+
+/* ~~~ Functions for mmap and page faults ~~~ */
+
+/* MAP_PRIVATE, copy data to user private page (cow_page) */
+static int _cow_private_page(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct inode *inode = vma->vm_file->f_mapping->host;
+	struct zuf_sb_info *sbi = SBI(inode->i_sb);
+	int err;
+
+	/* Basically a READ into vmf->cow_page */
+	err = zuf_rw_read_page(sbi, inode, vmf->cow_page,
+			       md_p2o(vmf->pgoff));
+	if (unlikely(err && err != -EINTR)) {
+		zuf_err("[%ld] read_page failed bn=0x%lx address=0x%lx => %d\n",
+			inode->i_ino, vmf->pgoff, vmf->address, err);
+		/* FIXME: Probably return VM_FAULT_SIGBUS */
+	}
+
+	/*HACK: This is an hack since Kernel v4.7 where a VM_FAULT_LOCKED with
+	 * vmf->page==NULL is no longer supported. Looks like for now this way
+	 * works well. We let mm mess around with unlocking and putting its own
+	 * cow_page.
+	 */
+	vmf->page = vmf->cow_page;
+	get_page(vmf->page);
+	lock_page(vmf->page);
+
+	return VM_FAULT_LOCKED;
+}
+
+static inline ulong _gb_bn(struct zufs_ioc_IO *get_block)
+{
+	if (unlikely(!get_block->ziom.iom_n))
+		return 0;
+
+	return _zufs_iom_t1_bn(get_block->iom_e[0]);
+}
+
+static vm_fault_t zuf_write_fault(struct vm_area_struct *vma,
+				  struct vm_fault *vmf)
+{
+	struct inode *inode = vma->vm_file->f_mapping->host;
+	struct zuf_sb_info *sbi = SBI(inode->i_sb);
+	struct zuf_inode_info *zii = ZUII(inode);
+	struct zus_inode *zi = zii->zi;
+	ulong bn;
+	struct _io_gb_multy io_gb = {
+		.IO.rw = WRITE | ZUFS_RW_MMAP,
+		.bns = &bn,
+	};
+	vm_fault_t fault = VM_FAULT_SIGBUS;
+	ulong addr = vmf->address;
+	ulong pmem_bn;
+	pgoff_t size;
+	pfn_t pfnt;
+	ulong pfn;
+	int err;
+
+	zuf_dbg_mmap("[%ld] vm_start=0x%lx vm_end=0x%lx VA=0x%lx "
+		    "pgoff=0x%lx vmf_flags=0x%x cow_page=%p page=%p\n",
+		    _zi_ino(zi), vma->vm_start, vma->vm_end, addr, vmf->pgoff,
+		    vmf->flags, vmf->cow_page, vmf->page);
+
+	sb_start_pagefault(inode->i_sb);
+	zuf_smr_lock_pagefault(zii);
+
+	size = md_o2p_up(i_size_read(inode));
+	if (unlikely(vmf->pgoff >= size)) {
+		ulong pgoff = vma->vm_pgoff + md_o2p(addr - vma->vm_start);
+
+		zuf_dbg_err("[%ld] pgoff(0x%lx)(0x%lx) >= size(0x%lx) => SIGBUS\n",
+			    _zi_ino(zi), vmf->pgoff, pgoff, size);
+
+		fault = VM_FAULT_SIGBUS;
+		goto out;
+	}
+
+	if (vmf->cow_page) {
+		fault = _cow_private_page(vma, vmf);
+		goto out;
+	}
+
+	zus_inode_cmtime_now(inode, zi);
+	/* NOTE: zus needs to flush the zi */
+
+	err = _zufs_IO_get_multy(sbi, inode, md_p2o(vmf->pgoff), PAGE_SIZE,
+				 &io_gb);
+	if (unlikely(err)) {
+		zuf_dbg_err("_get_put_block failed => %d\n", err);
+		goto out;
+	}
+	pmem_bn = _gb_bn(&io_gb.IO);
+	if (unlikely(pmem_bn == 0)) {
+		zuf_err("[%ld] pmem_bn=0  rw=0x%llx ret_flags=0x%x but no error?\n",
+			_zi_ino(zi), io_gb.IO.rw, io_gb.IO.ret_flags);
+		fault = VM_FAULT_SIGBUS;
+		goto out;
+	}
+
+	if (io_gb.IO.ret_flags & ZUFS_RET_NEW) {
+		/* newly created block */
+		inode->i_blocks = le64_to_cpu(zii->zi->i_blocks);
+	}
+	unmap_mapping_range(inode->i_mapping, vmf->pgoff << PAGE_SHIFT,
+				    PAGE_SIZE, 0);
+
+	pfn = md_pfn(sbi->md, pmem_bn);
+	pfnt = phys_to_pfn_t(PFN_PHYS(pfn), PFN_MAP | PFN_DEV);
+	fault = vmf_insert_mixed_mkwrite(vma, addr, pfnt);
+	err = zuf_flt_to_err(fault);
+	if (unlikely(err)) {
+		zuf_err("[%ld] vm_insert_mixed_mkwrite failed => fault=0x%x err=%d\n",
+			_zi_ino(zi), (int)fault, err);
+		goto put;
+	}
+
+	zuf_dbg_mmap("[%ld] vm_insert_mixed 0x%lx prot=0x%lx => %d\n",
+		    _zi_ino(zi), pfn, vma->vm_page_prot.pgprot, err);
+
+	zuf_sync_inc(inode);
+put:
+	_zufs_IO_put_multy(sbi, inode, &io_gb);
+out:
+	zuf_smr_unlock(zii);
+	sb_end_pagefault(inode->i_sb);
+	return fault;
+}
+
+static vm_fault_t zuf_pfn_mkwrite(struct vm_fault *vmf)
+{
+	return zuf_write_fault(vmf->vma, vmf);
+}
+
+static vm_fault_t zuf_read_fault(struct vm_area_struct *vma,
+				 struct vm_fault *vmf)
+{
+	struct inode *inode = vma->vm_file->f_mapping->host;
+	struct zuf_sb_info *sbi = SBI(inode->i_sb);
+	struct zuf_inode_info *zii = ZUII(inode);
+	struct zus_inode *zi = zii->zi;
+	ulong bn;
+	struct _io_gb_multy io_gb = {
+		.IO.rw = READ | ZUFS_RW_MMAP,
+		.bns = &bn,
+	};
+	vm_fault_t fault = VM_FAULT_SIGBUS;
+	ulong addr = vmf->address;
+	ulong pmem_bn;
+	pgoff_t size;
+	pfn_t pfnt;
+	int err;
+
+	zuf_dbg_mmap("[%ld] vm_start=0x%lx vm_end=0x%lx VA=0x%lx "
+		    "pgoff=0x%lx vmf_flags=0x%x cow_page=%p page=%p\n",
+		    _zi_ino(zi), vma->vm_start, vma->vm_end, addr, vmf->pgoff,
+		    vmf->flags, vmf->cow_page, vmf->page);
+
+	zuf_smr_lock_pagefault(zii);
+
+	size = md_o2p_up(i_size_read(inode));
+	if (unlikely(vmf->pgoff >= size)) {
+		ulong pgoff = vma->vm_pgoff + md_o2p(addr - vma->vm_start);
+
+		zuf_dbg_err("[%ld] pgoff(0x%lx)(0x%lx) >= size(0x%lx) => SIGBUS\n",
+			    _zi_ino(zi), vmf->pgoff, pgoff, size);
+		goto out;
+	}
+
+	if (vmf->cow_page) {
+		zuf_warn("cow is read\n");
+		fault = _cow_private_page(vma, vmf);
+		goto out;
+	}
+
+	file_accessed(vma->vm_file);
+	/* NOTE: zus needs to flush the zi */
+
+	err = _zufs_IO_get_multy(sbi, inode, md_p2o(vmf->pgoff), PAGE_SIZE,
+				 &io_gb);
+	if (unlikely(err && err != -EINTR)) {
+		zuf_err("_get_put_block failed => %d\n", err);
+		goto out;
+	}
+
+	pmem_bn = _gb_bn(&io_gb.IO);
+	if (pmem_bn == 0) {
+		/* Hole in file */
+		pfnt = pfn_to_pfn_t(my_zero_pfn(vmf->address));
+	} else {
+		/* We have a real page */
+		pfnt = phys_to_pfn_t(PFN_PHYS(md_pfn(sbi->md, pmem_bn)),
+				     PFN_MAP | PFN_DEV);
+	}
+	fault = vmf_insert_mixed(vma, addr, pfnt);
+	err = zuf_flt_to_err(fault);
+	if (unlikely(err)) {
+		zuf_err("[%ld] vm_insert_mixed => fault=0x%x err=%d\n",
+			_zi_ino(zi), (int)fault, err);
+		goto put;
+	}
+
+	zuf_dbg_mmap("[%ld] vm_insert_mixed pmem_bn=0x%lx fault=%d\n",
+		     _zi_ino(zi), pmem_bn, fault);
+
+put:
+	if (pmem_bn)
+		_zufs_IO_put_multy(sbi, inode, &io_gb);
+out:
+	zuf_smr_unlock(zii);
+	return fault;
+}
+
+static vm_fault_t zuf_fault(struct vm_fault *vmf)
+{
+	bool write_fault = (0 != (vmf->flags & FAULT_FLAG_WRITE));
+
+	if (write_fault)
+		return zuf_write_fault(vmf->vma, vmf);
+	else
+		return zuf_read_fault(vmf->vma, vmf);
+}
+
+static void zuf_mmap_open(struct vm_area_struct *vma)
+{
+	struct zuf_inode_info *zii = ZUII(file_inode(vma->vm_file));
+
+	atomic_inc(&zii->vma_count);
+}
+
+static void zuf_mmap_close(struct vm_area_struct *vma)
+{
+	struct inode *inode = file_inode(vma->vm_file);
+	int vma_count = atomic_dec_return(&ZUII(inode)->vma_count);
+
+	if (unlikely(vma_count < 0))
+		zuf_err("[%ld] WHAT??? vma_count=%d\n",
+			 inode->i_ino, vma_count);
+	else if (unlikely(vma_count == 0)) {
+		struct zuf_inode_info *zii = ZUII(inode);
+		struct zufs_ioc_mmap_close mmap_close = {};
+		int err;
+
+		mmap_close.hdr.operation = ZUFS_OP_MMAP_CLOSE;
+		mmap_close.hdr.in_len = sizeof(mmap_close);
+
+		mmap_close.zus_ii = zii->zus_ii;
+		mmap_close.rw = 0; /* TODO: Do we need this */
+
+		zuf_smr_lock(zii);
+
+		err = zufc_dispatch(ZUF_ROOT(SBI(inode->i_sb)), &mmap_close.hdr,
+				    NULL, 0);
+		if (unlikely(err))
+			zuf_dbg_err("[%ld] err=%d\n", inode->i_ino, err);
+
+		zuf_smr_unlock(zii);
+	}
+}
+
+static const struct vm_operations_struct zuf_vm_ops = {
+	.fault		= zuf_fault,
+	.pfn_mkwrite	= zuf_pfn_mkwrite,
+	.open           = zuf_mmap_open,
+	.close		= zuf_mmap_close,
+};
+
+int zuf_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct inode *inode = file_inode(file);
+	struct zuf_inode_info *zii = ZUII(inode);
+
+	file_accessed(file);
+
+	vma->vm_ops = &zuf_vm_ops;
+
+	atomic_inc(&zii->vma_count);
+
+	zuf_dbg_vfs("[%ld] start=0x%lx end=0x%lx flags=0x%lx page_prot=0x%lx\n",
+		     file->f_mapping->host->i_ino, vma->vm_start, vma->vm_end,
+		     vma->vm_flags, pgprot_val(vma->vm_page_prot));
+
+	return 0;
+}
diff --git a/fs/zuf/super.c b/fs/zuf/super.c
index abd7e6cb2a4a..2a0db11b51d6 100644
--- a/fs/zuf/super.c
+++ b/fs/zuf/super.c
@@ -737,6 +737,90 @@ static int zuf_update_s_wtime(struct super_block *sb)
 	return 0;
 }
 
+static void _sync_add_inode(struct inode *inode)
+{
+	struct zuf_sb_info *sbi = SBI(inode->i_sb);
+	struct zuf_inode_info *zii = ZUII(inode);
+
+	zuf_dbg_mmap("[%ld] write_mapped=%d\n",
+		      inode->i_ino, atomic_read(&zii->write_mapped));
+
+	spin_lock(&sbi->s_mmap_dirty_lock);
+
+	/* Because we are lazy removing the inodes, only in case of an fsync
+	 * or an evict_inode. It is fine if we are call multiple times.
+	 */
+	if (list_empty(&zii->i_mmap_dirty))
+		list_add(&zii->i_mmap_dirty, &sbi->s_mmap_dirty);
+
+	spin_unlock(&sbi->s_mmap_dirty_lock);
+}
+
+static void _sync_remove_inode(struct inode *inode)
+{
+	struct zuf_sb_info *sbi = SBI(inode->i_sb);
+	struct zuf_inode_info *zii = ZUII(inode);
+
+	zuf_dbg_mmap("[%ld] write_mapped=%d\n",
+		      inode->i_ino, atomic_read(&zii->write_mapped));
+
+	spin_lock(&sbi->s_mmap_dirty_lock);
+	list_del_init(&zii->i_mmap_dirty);
+	spin_unlock(&sbi->s_mmap_dirty_lock);
+}
+
+void zuf_sync_inc(struct inode *inode)
+{
+	struct zuf_inode_info *zii = ZUII(inode);
+
+	if (1 == atomic_inc_return(&zii->write_mapped))
+		_sync_add_inode(inode);
+}
+
+/* zuf_sync_dec will unmapped in batches */
+void zuf_sync_dec(struct inode *inode, ulong write_unmapped)
+{
+	struct zuf_inode_info *zii = ZUII(inode);
+
+	if (0 == atomic_sub_return(write_unmapped, &zii->write_mapped))
+		_sync_remove_inode(inode);
+}
+
+/*
+ * We must fsync any mmap-active inodes
+ */
+static int zuf_sync_fs(struct super_block *sb, int wait)
+{
+	struct zuf_sb_info *sbi = SBI(sb);
+	struct zuf_inode_info *zii, *t;
+	enum {to_clean_size = 120};
+	struct zuf_inode_info *zii_to_clean[to_clean_size];
+	uint i, to_clean;
+
+	zuf_dbg_vfs("Syncing wait=%d\n", wait);
+more_inodes:
+	spin_lock(&sbi->s_mmap_dirty_lock);
+	to_clean = 0;
+	list_for_each_entry_safe(zii, t, &sbi->s_mmap_dirty, i_mmap_dirty) {
+		list_del_init(&zii->i_mmap_dirty);
+		zii_to_clean[to_clean++] = zii;
+		if (to_clean >= to_clean_size)
+			break;
+	}
+	spin_unlock(&sbi->s_mmap_dirty_lock);
+
+	if (!to_clean)
+		return 0;
+
+	for (i = 0; i < to_clean; ++i)
+		zuf_isync(&zii_to_clean[i]->vfs_inode, 0, ~0 - 1, 1);
+
+	if (to_clean == to_clean_size)
+		goto more_inodes;
+
+	return 0;
+}
+
 static struct inode *zuf_alloc_inode(struct super_block *sb)
 {
 	struct zuf_inode_info *zii;
@@ -759,7 +843,11 @@ static void _init_once(void *foo)
 	struct zuf_inode_info *zii = foo;
 
 	inode_init_once(&zii->vfs_inode);
+	INIT_LIST_HEAD(&zii->i_mmap_dirty);
 	zii->zi = NULL;
+	init_rwsem(&zii->in_sync);
+	atomic_set(&zii->vma_count, 0);
+	atomic_set(&zii->write_mapped, 0);
 }
 
 int __init zuf_init_inodecache(void)
@@ -789,6 +877,7 @@ static struct super_operations zuf_sops = {
 	.put_super	= zuf_put_super,
 	.freeze_fs	= zuf_update_s_wtime,
 	.unfreeze_fs	= zuf_update_s_wtime,
+	.sync_fs	= zuf_sync_fs,
 	.statfs		= zuf_statfs,
 	.remount_fs	= zuf_remount,
 	.show_options	= zuf_show_options,
diff --git a/fs/zuf/t1.c b/fs/zuf/t1.c
index 46ea7f6181fc..1f2db5a674d5 100644
--- a/fs/zuf/t1.c
+++ b/fs/zuf/t1.c
@@ -124,6 +124,15 @@ int zuf_pmem_mmap(struct file *file, struct vm_area_struct *vma)
 	if (!zsf || zsf->type != zlfs_e_pmem)
 		return -EPERM;
 
+	/* Valgrined may interfere with our 2M mmap aligned vma start
+	 * (See zufr_get_unmapped_area). Tell the guys to back off
+	 */
+	if (unlikely(vma->vm_start & ~PMD_MASK)) {
+		zuf_err("mmap is not 2M aligned vm_start=0x%lx\n",
+				vma->vm_start);
+		return -EINVAL;
+	}
+
 	vma->vm_flags |= VM_HUGEPAGE;
 	vma->vm_ops = &t1_vm_ops;
 
diff --git a/fs/zuf/zuf-core.c b/fs/zuf/zuf-core.c
index 11300fd79929..cb4a4def646f 100644
--- a/fs/zuf/zuf-core.c
+++ b/fs/zuf/zuf-core.c
@@ -99,7 +99,9 @@ const char *zuf_op_name(enum e_zufs_operation op)
 		CASE_ENUM_NAME(ZUFS_OP_READ);
 		CASE_ENUM_NAME(ZUFS_OP_PRE_READ);
 		CASE_ENUM_NAME(ZUFS_OP_WRITE);
+		CASE_ENUM_NAME(ZUFS_OP_MMAP_CLOSE);
 		CASE_ENUM_NAME(ZUFS_OP_SETATTR);
+		CASE_ENUM_NAME(ZUFS_OP_SYNC);
 
 		CASE_ENUM_NAME(ZUFS_OP_GET_MULTY);
 		CASE_ENUM_NAME(ZUFS_OP_PUT_MULTY);
diff --git a/fs/zuf/zuf.h b/fs/zuf/zuf.h
index 2c57c51a2099..fe479cb70f97 100644
--- a/fs/zuf/zuf.h
+++ b/fs/zuf/zuf.h
@@ -132,6 +132,9 @@ struct zuf_inode_info {
 
 	/* Stuff for mmap write */
 	struct rw_semaphore	in_sync;
+	struct list_head	i_mmap_dirty;
+	atomic_t		write_mapped;
+	atomic_t		vma_count;
 
 	/* cookies from Server */
 	struct zus_inode	*zi;
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index e3a783748ce6..e70bd8b7ff69 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -459,7 +459,9 @@ enum e_zufs_operation {
 	ZUFS_OP_READ		= 14,
 	ZUFS_OP_PRE_READ	= 15,
 	ZUFS_OP_WRITE		= 16,
+	ZUFS_OP_MMAP_CLOSE	= 17,
 	ZUFS_OP_SETATTR		= 19,
+	ZUFS_OP_SYNC		= 20,
 	ZUFS_OP_FALLOCATE	= 21,
 
 	ZUFS_OP_GET_MULTY	= 29,
@@ -645,6 +647,13 @@ static inline bool zufs_zde_emit(struct zufs_readdir_iter *rdi, __u64 ino,
 }
 #endif /* ndef __cplusplus */
 
+struct zufs_ioc_mmap_close {
+	struct zufs_ioc_hdr hdr;
+	 /* IN */
+	struct zus_inode_info *zus_ii;
+	__u64 rw; /* Some flags + READ or WRITE */
+};
+
 /* ZUFS_OP_SETATTR */
 struct zufs_ioc_attr {
 	struct zufs_ioc_hdr hdr;
@@ -654,6 +663,23 @@ struct zufs_ioc_attr {
 	__u32 pad;
 };
 
+/* ZUFS_OP_SYNC */
+enum ZUFS_SYNC_FLAGS {
+	ZUFS_SF_DATASYNC		= 0x00000001,
+	ZUFS_SF_DONTNEED		= 0x00000100,
+};
+
+struct zufs_ioc_sync {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_inode_info *zus_ii;
+	__u64 offset, length;
+	__u64 flags;
+
+	/* OUT */
+	__u64 write_unmapped;
+};
+
 /* ~~~~ io_map structures && IOCTL(s) ~~~~ */
 /*
  * These set of structures and helpers are used in return of zufs_ioc_IO and
-- 
2.21.0
^ permalink raw reply related	[flat|nested] 32+ messages in thread
* [PATCH 13/16] zuf: More file operation
  2019-09-26  2:07 [PATCHSET v02 00/16] zuf: ZUFS Zero-copy User-mode FileSystem Boaz Harrosh
                   ` (11 preceding siblings ...)
  2019-09-26  2:07 ` [PATCH 12/16] zuf: mmap & sync Boaz Harrosh
@ 2019-09-26  2:07 ` Boaz Harrosh
  2019-09-26  2:07 ` [PATCH 14/16] zuf: ioctl implementation Boaz Harrosh
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 32+ messages in thread
From: Boaz Harrosh @ 2019-09-26  2:07 UTC (permalink / raw)
  To: linux-fsdevel, Anna Schumaker, Al Viro, Matt Benjamin
  Cc: Miklos Szeredi, Amir Goldstein, Sagi Manole, Matthew Wilcox,
	Dan Williams
Add more file/inode operation:
vector			function		operation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.llseek			zuf_llseek		ZUFS_OP_LLSEEK
.fallocate		zuf_fallocate		ZUFS_OP_FALLOCATE
.copy_file_range	zuf_copy_file_range	ZUFS_OP_COPY
.remap_file_range	zuf_clone_file_range	ZUFS_OP_CLONE
.fadvise		zuf_fadvise		(multiple see rw.c)
.fiemap			zuf_fiemap		ZUFS_OP_FIEMAP
See more comments in source code.
[v2]
  SQUASHME zuf: fadvise fix up missing operations
  Mainly there was a bug found by Vlad, that POSIX_FADV_RANDOM was
  missing and therefor was returning and error and some tests were
  failing.
  But while at it actually implement all the missing advise. Just
  punch into file->ra the proper flags.
  FIXME:  There is a pending patch by Jan to export generic_fadvise
	  for now duplicate what we need inline.
[v3]
  zuf: lock two zii fix
[v4]
  zuf: Reduce stack usage (fiemap)
  Same as for IO use the big_alloc to prevent compilation warning
Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/_extern.h  |   3 +
 fs/zuf/file.c     | 650 +++++++++++++++++++++++++++++++++++++++++++++-
 fs/zuf/rw.c       |  92 +++++++
 fs/zuf/zuf-core.c |   5 +
 fs/zuf/zus_api.h  |  83 ++++++
 5 files changed, 832 insertions(+), 1 deletion(-)
diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
index cafda97c973c..2c7456724ef6 100644
--- a/fs/zuf/_extern.h
+++ b/fs/zuf/_extern.h
@@ -110,6 +110,9 @@ int _zufs_IO_get_multy(struct zuf_sb_info *sbi, struct inode *inode,
 void _zufs_IO_put_multy(struct zuf_sb_info *sbi, struct inode *inode,
 			struct _io_gb_multy *io_gb);
 int zuf_rw_fallocate(struct inode *inode, uint mode, loff_t offset, loff_t len);
+int zuf_rw_fadvise(struct super_block *sb, struct file *file,
+		   loff_t offset, loff_t len, int advise, bool rand);
+
 int zuf_iom_execute_sync(struct super_block *sb, struct inode *inode,
 			 __u64 *iom_e, uint iom_n);
 int zuf_iom_execute_async(struct super_block *sb, struct zus_iomap_build *iomb,
diff --git a/fs/zuf/file.c b/fs/zuf/file.c
index 7fcaf085bf8e..1c51529694e7 100644
--- a/fs/zuf/file.c
+++ b/fs/zuf/file.c
@@ -15,12 +15,158 @@
 
 #include <linux/fs.h>
 #include <linux/uio.h>
+#include <linux/falloc.h>
+#include <linux/fadvise.h>
+#include <linux/sched/signal.h>
 
 #include "zuf.h"
 
 long __zuf_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len)
 {
-	return -ENOTSUPP;
+	struct zuf_inode_info *zii = ZUII(inode);
+	bool need_len_check, need_unmap;
+	loff_t unmap_len = 0; /* 0 means all file */
+	loff_t new_size = len + offset;
+	loff_t i_size = i_size_read(inode);
+	int err = 0;
+
+	zuf_dbg_vfs("[%ld] mode=0x%x offset=0x%llx len=0x%llx\n",
+		     inode->i_ino, mode, offset, len);
+
+	if (!S_ISREG(inode->i_mode))
+		return -EINVAL;
+	if (IS_SWAPFILE(inode))
+		return -ETXTBSY;
+
+	/* These are all the FL flags we know how to handle on the  kernel side
+	 * a zusFS that does not support one of these can just return
+	 * EOPNOTSUPP.
+	 */
+	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
+		     FALLOC_FL_NO_HIDE_STALE | FALLOC_FL_COLLAPSE_RANGE |
+		     FALLOC_FL_ZERO_RANGE | FALLOC_FL_INSERT_RANGE |
+		     FALLOC_FL_UNSHARE_RANGE | ZUFS_FL_TRUNCATE)){
+		zuf_dbg_err("Unsupported mode(0x%x)\n", mode);
+		return -EOPNOTSUPP;
+	}
+
+	if (mode & FALLOC_FL_PUNCH_HOLE) {
+		need_len_check = false;
+		need_unmap = true;
+		unmap_len = len;
+	} else if (mode & ZUFS_FL_TRUNCATE) {
+		need_len_check = true;
+		new_size = offset;
+		need_unmap = true;
+	} else if (mode & FALLOC_FL_COLLAPSE_RANGE) {
+		need_len_check = false;
+		need_unmap = true;
+	} else if (mode & FALLOC_FL_INSERT_RANGE) {
+		need_len_check = true;
+		new_size = i_size + len;
+		need_unmap = true;
+	} else if (mode & FALLOC_FL_ZERO_RANGE) {
+		need_len_check = !(mode & FALLOC_FL_KEEP_SIZE);
+		need_unmap = true;
+	} else {
+		/* FALLOC_FL_UNSHARE_RANGE same as regular */
+		need_len_check = !(mode & FALLOC_FL_KEEP_SIZE);
+		need_unmap = false;
+	}
+
+	if (need_len_check && (new_size > i_size)) {
+		err = inode_newsize_ok(inode, new_size);
+		if (unlikely(err)) {
+			zuf_dbg_err("inode_newsize_ok(0x%llx) => %d\n",
+				    new_size, err);
+			goto out;
+		}
+	}
+
+	if (need_unmap) {
+		zufc_goose_all_zts(ZUF_ROOT(SBI(inode->i_sb)), inode);
+		unmap_mapping_range(inode->i_mapping, offset, unmap_len, 1);
+	}
+
+	zus_inode_cmtime_now(inode, zii->zi);
+
+	err = zuf_rw_fallocate(inode, mode, offset, len);
+
+	/* Even if we had an error these might have changed */
+	i_size_write(inode, le64_to_cpu(zii->zi->i_size));
+	inode->i_blocks = le64_to_cpu(zii->zi->i_blocks);
+
+out:
+	return err;
+}
+
+static long zuf_fallocate(struct file *file, int mode, loff_t offset,
+			  loff_t len)
+{
+	struct inode *inode = file->f_inode;
+	struct zuf_inode_info *zii = ZUII(inode);
+	int err;
+
+	zuf_w_lock(zii);
+
+	err = __zuf_fallocate(inode, mode, offset, len);
+
+	zuf_w_unlock(zii);
+	return err;
+}
+
+static loff_t zuf_llseek(struct file *file, loff_t offset, int whence)
+{
+	struct inode *inode = file->f_path.dentry->d_inode;
+	struct zuf_inode_info *zii = ZUII(inode);
+	struct zufs_ioc_seek ioc_seek = {
+		.hdr.in_len = sizeof(ioc_seek),
+		.hdr.out_len = sizeof(ioc_seek),
+		.hdr.operation = ZUFS_OP_LLSEEK,
+		.zus_ii = zii->zus_ii,
+		.offset_in = offset,
+		.whence = whence,
+	};
+	int err = 0;
+
+	zuf_dbg_vfs("[%ld] offset=0x%llx whence=%d\n",
+		     inode->i_ino, offset, whence);
+
+	if (whence != SEEK_DATA && whence != SEEK_HOLE)
+		return generic_file_llseek(file, offset, whence);
+
+	zuf_r_lock(zii);
+
+	if ((offset < 0 && !(file->f_mode & FMODE_UNSIGNED_OFFSET)) ||
+	    offset > inode->i_sb->s_maxbytes) {
+		err = -EINVAL;
+		goto out;
+	} else if (inode->i_size <= offset) {
+		err = -ENXIO;
+		goto out;
+	} else if (!inode->i_blocks) {
+		if (whence == SEEK_HOLE)
+			ioc_seek.offset_out = i_size_read(inode);
+		else
+			err = -ENXIO;
+		goto out;
+	}
+
+	err = zufc_dispatch(ZUF_ROOT(SBI(inode->i_sb)), &ioc_seek.hdr, NULL, 0);
+	if (unlikely(err)) {
+		zuf_dbg_err("zufc_dispatch failed => %d\n", err);
+		goto out;
+	}
+
+	if (ioc_seek.offset_out != file->f_pos) {
+		file->f_pos = ioc_seek.offset_out;
+		file->f_version = 0;
+	}
+
+out:
+	zuf_r_unlock(zii);
+
+	return err ?: ioc_seek.offset_out;
 }
 
 /* This function is called by both msync() and fsync(). */
@@ -87,6 +233,481 @@ static int zuf_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 	return zuf_isync(file_inode(file), start, end, datasync);
 }
 
+/* This callback is called when a file is closed */
+static int zuf_flush(struct file *file, fl_owner_t id)
+{
+	zuf_dbg_vfs("[%ld]\n", file->f_inode->i_ino);
+	return 0;
+}
+
+static int zuf_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
+		      u64 offset, u64 length)
+{
+	struct super_block *sb = inode->i_sb;
+	struct zuf_inode_info *zii = ZUII(inode);
+	struct zufs_ioc_fiemap ioc_fiemap = {
+		.hdr.operation = ZUFS_OP_FIEMAP,
+		.hdr.in_len = sizeof(ioc_fiemap),
+		.hdr.out_len = sizeof(ioc_fiemap),
+		.zus_ii = zii->zus_ii,
+		.start = offset,
+		.length = length,
+		.flags = fieinfo->fi_flags,
+	};
+	long on_stack[ZUF_MAX_STACK(160) / sizeof(long)];
+	struct page **pages = NULL;
+	enum big_alloc_type bat = 0;
+	uint nump = 0, extents_max = 0;
+	int i, err;
+
+	zuf_dbg_vfs("[%ld] offset=0x%llx len=0x%llx extents_max=%u flags=0x%x\n",
+		    inode->i_ino, offset, length, fieinfo->fi_extents_max,
+		    fieinfo->fi_flags);
+
+	/* TODO: Have support for FIEMAP_FLAG_XATTR */
+	err = fiemap_check_flags(fieinfo, FIEMAP_FLAG_SYNC);
+	if (unlikely(err))
+		return err;
+
+	if (likely(fieinfo->fi_extents_max)) {
+		ulong start = (ulong)fieinfo->fi_extents_start;
+		ulong len = fieinfo->fi_extents_max *
+						sizeof(struct fiemap_extent);
+		ulong offset = start & (PAGE_SIZE - 1);
+		ulong end_offset = (offset + len) & (PAGE_SIZE - 1);
+		ulong __len;
+		uint nump_r;
+
+		nump = md_o2p_up(offset + len);
+		if (ZUS_API_MAP_MAX_PAGES < nump)
+			nump = ZUS_API_MAP_MAX_PAGES;
+
+		__len = nump * PAGE_SIZE - offset;
+		if (end_offset)
+			__len -= (PAGE_SIZE - end_offset);
+
+		extents_max = __len / sizeof(struct fiemap_extent);
+
+		ioc_fiemap.hdr.len = extents_max * sizeof(struct fiemap_extent);
+		ioc_fiemap.hdr.offset = offset;
+
+		pages = big_alloc(nump * sizeof(*pages), sizeof(on_stack),
+				  on_stack, GFP_KERNEL, &bat);
+		if (unlikely(!pages))
+			return -ENOMEM;
+
+		nump_r = get_user_pages_fast(start, nump, WRITE, pages);
+		if (unlikely(nump != nump_r)) {
+			err = -EFAULT;
+			goto free;
+		}
+	}
+	ioc_fiemap.extents_max = extents_max;
+
+	zuf_r_lock(zii);
+
+	err = zufc_dispatch(ZUF_ROOT(SBI(sb)), &ioc_fiemap.hdr, pages, nump);
+	if (unlikely(err)) {
+		zuf_dbg_err("zufs_dispatch failed => %d\n", err);
+		goto out;
+	}
+
+	fieinfo->fi_extents_mapped = ioc_fiemap.extents_mapped;
+	if (unlikely(extents_max &&
+		     (extents_max < ioc_fiemap.extents_mapped))) {
+		zuf_err("extents_max=%d extents_mapped=%d\n", extents_max,
+			ioc_fiemap.extents_mapped);
+		err = -EINVAL;
+	}
+
+out:
+	zuf_r_unlock(zii);
+
+	for (i = 0; i < nump; ++i)
+		put_page(pages[i]);
+free:
+	big_free(pages, bat);
+
+	return err;
+}
+
+/* ~~~~~ clone/copy range ~~~~~ */
+
+/*
+ * Copy/paste from Kernel mm/filemap.c::generic_remap_checks
+ * FIXME: make it EXPORT_GPL
+ */
+static int _access_check_limits(struct file *file, loff_t pos,
+				       loff_t *count)
+{
+	struct inode *inode = file->f_mapping->host;
+	loff_t max_size = inode->i_sb->s_maxbytes;
+
+	if (!(file->f_flags & O_LARGEFILE))
+		max_size = MAX_NON_LFS;
+
+	if (unlikely(pos >= max_size))
+		return -EFBIG;
+	*count = min(*count, max_size - pos);
+	return 0;
+}
+
+static int _write_check_limits(struct file *file, loff_t pos,
+				      loff_t *count)
+{
+
+	loff_t limit = rlimit(RLIMIT_FSIZE);
+
+	if (limit != RLIM_INFINITY) {
+		if (pos >= limit) {
+			send_sig(SIGXFSZ, current, 0);
+			return -EFBIG;
+		}
+		*count = min(*count, limit - pos);
+	}
+
+	return _access_check_limits(file, pos, count);
+}
+
+static int _remap_checks(struct file *file_in, loff_t pos_in,
+			 struct file *file_out, loff_t pos_out,
+			 loff_t *req_count, unsigned int remap_flags)
+{
+	struct inode *inode_in = file_in->f_mapping->host;
+	struct inode *inode_out = file_out->f_mapping->host;
+	uint64_t count = *req_count;
+	uint64_t bcount;
+	loff_t size_in, size_out;
+	loff_t bs = inode_out->i_sb->s_blocksize;
+	int ret;
+
+	/* The start of both ranges must be aligned to an fs block. */
+	if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_out, bs))
+		return -EINVAL;
+
+	/* Ensure offsets don't wrap. */
+	if (pos_in + count < pos_in || pos_out + count < pos_out)
+		return -EINVAL;
+
+	size_in = i_size_read(inode_in);
+	size_out = i_size_read(inode_out);
+
+	/* Dedupe requires both ranges to be within EOF. */
+	if ((remap_flags & REMAP_FILE_DEDUP) &&
+	    (pos_in >= size_in || pos_in + count > size_in ||
+	     pos_out >= size_out || pos_out + count > size_out))
+		return -EINVAL;
+
+	/* Ensure the infile range is within the infile. */
+	if (pos_in >= size_in)
+		return -EINVAL;
+	count = min(count, size_in - (uint64_t)pos_in);
+
+	ret = _access_check_limits(file_in, pos_in, &count);
+	if (ret)
+		return ret;
+
+	ret = _write_check_limits(file_out, pos_out, &count);
+	if (ret)
+		return ret;
+
+	/*
+	 * If the user wanted us to link to the infile's EOF, round up to the
+	 * next block boundary for this check.
+	 *
+	 * Otherwise, make sure the count is also block-aligned, having
+	 * already confirmed the starting offsets' block alignment.
+	 */
+	if (pos_in + count == size_in) {
+		bcount = ALIGN(size_in, bs) - pos_in;
+	} else {
+		if (!IS_ALIGNED(count, bs))
+			count = ALIGN_DOWN(count, bs);
+		bcount = count;
+	}
+
+	/* Don't allow overlapped cloning within the same file. */
+	if (inode_in == inode_out &&
+	    pos_out + bcount > pos_in &&
+	    pos_out < pos_in + bcount)
+		return -EINVAL;
+
+	/*
+	 * We shortened the request but the caller can't deal with that, so
+	 * bounce the request back to userspace.
+	 */
+	if (*req_count != count && !(remap_flags & REMAP_FILE_CAN_SHORTEN))
+		return -EINVAL;
+
+	*req_count = count;
+	return 0;
+}
+
+/*
+ * Copy/paste from generic_remap_file_range_prep(). We cannot call
+ * generic_remap_file_range_prep because it calles fsync twice and we do not
+ * want to go to the Server so many times.
+ * So below is just the checks.
+ * FIXME: Send a patch upstream to split the generic_remap_file_range_prep
+ * or receive a flag if to do the syncs
+ *
+ * Check that the two inodes are eligible for cloning, the ranges make
+ * sense.
+ *
+ * If there's an error, then the usual negative error code is returned.
+ * Otherwise returns 0 with *len set to the request length.
+ */
+static int _remap_file_range_prep(struct file *file_in, loff_t pos_in,
+				  struct file *file_out, loff_t pos_out,
+				  loff_t *len, unsigned int remap_flags)
+{
+	struct inode *inode_in = file_inode(file_in);
+	struct inode *inode_out = file_inode(file_out);
+	int ret;
+
+	/* Don't touch certain kinds of inodes */
+	if (IS_IMMUTABLE(inode_out))
+		return -EPERM;
+
+	if (IS_SWAPFILE(inode_in) || IS_SWAPFILE(inode_out))
+		return -ETXTBSY;
+
+	/* Don't reflink dirs, pipes, sockets... */
+	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
+		return -EISDIR;
+	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
+		return -EINVAL;
+
+	/* Zero length dedupe exits immediately; reflink goes to EOF. */
+	if (*len == 0) {
+		loff_t isize = i_size_read(inode_in);
+
+		if ((remap_flags & REMAP_FILE_DEDUP) || pos_in == isize)
+			return 0;
+		if (pos_in > isize)
+			return -EINVAL;
+		*len = isize - pos_in;
+		if (*len == 0)
+			return 0;
+	}
+
+	/* Check that we don't violate system file offset limits. */
+	ret = _remap_checks(file_in, pos_in, file_out, pos_out, len,
+			    remap_flags);
+	if (ret)
+		return ret;
+
+	/*
+	 * REMAP_FILE_DEDUP see if extents are the same.
+	 */
+	if (remap_flags & REMAP_FILE_DEDUP)
+		ret = zuf_rw_file_range_compare(inode_in, pos_in,
+						inode_out, pos_out, *len);
+
+	return ret;
+}
+
+static void _lock_two_ziis(struct zuf_inode_info *zii1,
+			   struct zuf_inode_info *zii2)
+{
+	if (zii1 > zii2)
+		swap(zii1, zii2);
+
+	zuf_w_lock(zii1);
+	if (zii1 != zii2)
+		zuf_w_lock_nested(zii2);
+}
+
+static void _unlock_two_ziis(struct zuf_inode_info *zii1,
+		      struct zuf_inode_info *zii2)
+{
+	if (zii1 > zii2)
+		swap(zii1, zii2);
+
+	if (zii1 != zii2)
+		zuf_w_unlock(zii2);
+	zuf_w_unlock(zii1);
+}
+
+static int _clone_file_range(struct inode *src_inode, loff_t pos_in,
+			     struct file *file_out,
+			     struct inode *dst_inode, loff_t pos_out,
+			     u64 len, u64 len_up, int operation)
+{
+	struct zuf_inode_info *src_zii = ZUII(src_inode);
+	struct zuf_inode_info *dst_zii = ZUII(dst_inode);
+	struct zus_inode *dst_zi = dst_zii->zi;
+	struct super_block *sb = src_inode->i_sb;
+	struct zufs_ioc_clone ioc_clone = {
+		.hdr.in_len = sizeof(ioc_clone),
+		.hdr.out_len = sizeof(ioc_clone),
+		.hdr.operation = operation,
+		.src_zus_ii = src_zii->zus_ii,
+		.dst_zus_ii = dst_zii->zus_ii,
+		.pos_in = pos_in,
+		.pos_out = pos_out,
+		.len = len,
+		.len_up = len_up,
+	};
+	int err;
+
+	/* NOTE: len==0 means to-end-of-file which is what we want */
+	unmap_mapping_range(src_inode->i_mapping, pos_in,  len, 0);
+	unmap_mapping_range(dst_inode->i_mapping, pos_out, len, 0);
+
+	zufc_goose_all_zts(ZUF_ROOT(SBI(dst_inode->i_sb)), dst_inode);
+
+	if ((len_up == 0) && (pos_in || pos_out)) {
+		zuf_err("Boaz Smoking 0x%llx 0x%llx 0x%llx\n",
+			pos_in, pos_out, len);
+		/* Bad caller */
+		return -EINVAL;
+	}
+
+	err = zufc_dispatch(ZUF_ROOT(SBI(sb)), &ioc_clone.hdr, NULL, 0);
+	if (unlikely(err && err != -EINTR)) {
+		zuf_dbg_err("failed to clone %ld -> %ld ; err=%d\n",
+			 src_inode->i_ino, dst_inode->i_ino, err);
+		return err;
+	}
+
+	dst_inode->i_blocks = le64_to_cpu(dst_zi->i_blocks);
+	i_size_write(dst_inode, dst_zi->i_size);
+
+	return err;
+}
+
+/* FIXME: Old checks are not needed. I keep them to make sure they
+ * are not complaining. Will remove _zuf_old_checks SOON
+ */
+static int _zuf_old_checks(struct super_block *sb,
+			   struct inode *src_inode, loff_t pos_in,
+			   struct inode *dst_inode, loff_t pos_out, loff_t len)
+{
+	if (src_inode == dst_inode) {
+		if (pos_in == pos_out) {
+			zuf_warn("[%ld] Clone nothing!!\n",
+				    src_inode->i_ino);
+			return 0;
+		}
+		if (pos_in < pos_out) {
+			if (pos_in + len > pos_out) {
+				zuf_warn("[%ld] overlapping pos_in < pos_out?? => EINVAL\n",
+					 src_inode->i_ino);
+				return -EINVAL;
+			}
+		} else {
+			if (pos_out + len > pos_in) {
+				zuf_warn("[%ld] overlapping pos_out < pos_in?? => EINVAL\n",
+					 src_inode->i_ino);
+				return -EINVAL;
+			}
+		}
+	}
+
+	if ((pos_in & (sb->s_blocksize - 1)) ||
+	    (pos_out & (sb->s_blocksize - 1))) {
+		zuf_err("[%ld] Not aligned len=0x%llx pos_in=0x%llx "
+			"pos_out=0x%llx src-size=0x%llx dst-size=0x%llx\n",
+			 src_inode->i_ino, len, pos_in, pos_out,
+			 i_size_read(src_inode), i_size_read(dst_inode));
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static loff_t zuf_clone_file_range(struct file *file_in, loff_t pos_in,
+				struct file *file_out, loff_t pos_out,
+				loff_t len, uint remap_flags)
+{
+	struct inode *src_inode = file_inode(file_in);
+	struct inode *dst_inode = file_inode(file_out);
+	struct zuf_inode_info *src_zii = ZUII(src_inode);
+	struct zuf_inode_info *dst_zii = ZUII(dst_inode);
+	ulong src_size = i_size_read(src_inode);
+	ulong dst_size = i_size_read(dst_inode);
+	struct super_block *sb = src_inode->i_sb;
+	ulong len_up;
+	int err;
+
+	zuf_dbg_vfs("IN: [%ld]{0x%llx} => [%ld]{0x%llx} length=0x%llx flags=0x%x\n",
+		    src_inode->i_ino, pos_in, dst_inode->i_ino, pos_out, len,
+		    remap_flags);
+
+	if (remap_flags & ~(REMAP_FILE_CAN_SHORTEN | REMAP_FILE_DEDUP)) {
+		/* New flags we do not know */
+		zuf_dbg_err("[%ld] Unknown remap_flags(0x%x)\n",
+			    src_inode->i_ino, remap_flags);
+		return -EINVAL;
+	}
+
+	if ((pos_in + len > sb->s_maxbytes) || (pos_out + len > sb->s_maxbytes))
+		return -EINVAL;
+
+	_lock_two_ziis(src_zii, dst_zii);
+
+	err = _remap_file_range_prep(file_in, pos_in, file_out, pos_out, &len,
+				     remap_flags);
+	if (err < 0 || len == 0)
+		goto out;
+	err = _zuf_old_checks(sb, src_inode, pos_in, dst_inode, pos_out, len);
+	if (unlikely(err))
+		goto out;
+
+	err = file_remove_privs(file_out);
+	if (unlikely(err))
+		goto out;
+
+	if (!(remap_flags & REMAP_FILE_DEDUP))
+		zus_inode_cmtime_now(dst_inode, dst_zii->zi);
+
+	/* See about all-file-clone optimization */
+	len_up = len;
+	if (!pos_in && !pos_out && (src_size <= pos_in + len) &&
+	    (dst_size <= src_size)) {
+		len_up = 0;
+	} else if (len & (sb->s_blocksize - 1)) {
+		/* un-aligned len, see if it is beyond EOF */
+		if ((src_size > pos_in  + len) ||
+		    (dst_size > pos_out + len)) {
+			zuf_err("[%ld][%ld] Not aligned len=0x%llx pos_in=0x%llx "
+				"pos_out=0x%llx src-size=0x%lx dst-size=0x%lx\n",
+				src_inode->i_ino, dst_inode->i_ino, len,
+				pos_in, pos_out, src_size, dst_size);
+			err = -EINVAL;
+			goto out;
+		}
+		len_up = md_p2o(md_o2p_up(len));
+	}
+
+	err = _clone_file_range(src_inode, pos_in, file_out, dst_inode, pos_out,
+				len, len_up, ZUFS_OP_CLONE);
+	if (unlikely(err))
+		zuf_dbg_err("_clone_file_range failed => %d\n", err);
+
+out:
+	_unlock_two_ziis(src_zii, dst_zii);
+	return err ? err : len;
+}
+
+static ssize_t zuf_copy_file_range(struct file *file_in, loff_t pos_in,
+				   struct file *file_out, loff_t pos_out,
+				   size_t len, uint flags)
+{
+	struct inode *src_inode = file_inode(file_in);
+	struct inode *dst_inode = file_inode(file_out);
+	ssize_t ret;
+
+	zuf_dbg_vfs("ino-in=%ld ino-out=%ld pos_in=0x%llx pos_out=0x%llx length=0x%lx\n",
+		    src_inode->i_ino, dst_inode->i_ino, pos_in, pos_out, len);
+
+	ret = zuf_clone_file_range(file_in, pos_in, file_out, pos_out, len,
+				   REMAP_FILE_ADVISORY);
+
+	return ret ?: len;
+}
+
 static ssize_t zuf_read_iter(struct kiocb *kiocb, struct iov_iter *ii)
 {
 	struct inode *inode = file_inode(kiocb->ki_filp);
@@ -155,16 +776,43 @@ static ssize_t zuf_write_iter(struct kiocb *kiocb, struct iov_iter *ii)
 	return ret;
 }
 
+static int zuf_fadvise(struct file *file, loff_t offset, loff_t len,
+		       int advise)
+{
+	struct inode *inode = file_inode(file);
+	struct zuf_inode_info *zii = ZUII(inode);
+	int err;
+
+	if (!S_ISREG(inode->i_mode))
+		return -EINVAL;
+
+	zuf_r_lock(zii);
+
+	err = zuf_rw_fadvise(inode->i_sb, file, offset, len, advise,
+			     file->f_mode & FMODE_RANDOM);
+
+	zuf_r_unlock(zii);
+
+	return err;
+}
+
 const struct file_operations zuf_file_operations = {
 	.open			= generic_file_open,
 	.read_iter		= zuf_read_iter,
 	.write_iter		= zuf_write_iter,
 	.mmap			= zuf_file_mmap,
 	.fsync			= zuf_fsync,
+	.llseek			= zuf_llseek,
+	.flush			= zuf_flush,
+	.fallocate		= zuf_fallocate,
+	.copy_file_range	= zuf_copy_file_range,
+	.remap_file_range	= zuf_clone_file_range,
+	.fadvise		= zuf_fadvise,
 };
 
 const struct inode_operations zuf_file_inode_operations = {
 	.setattr	= zuf_setattr,
 	.getattr	= zuf_getattr,
 	.update_time	= zuf_update_time,
+	.fiemap		= zuf_fiemap,
 };
diff --git a/fs/zuf/rw.c b/fs/zuf/rw.c
index 48f584e71a03..60b7a3e07e17 100644
--- a/fs/zuf/rw.c
+++ b/fs/zuf/rw.c
@@ -664,6 +664,98 @@ ssize_t zuf_rw_write_iter(struct super_block *sb, struct inode *inode,
 			ii, kiocb, kiocb_ra(kiocb), ZUFS_OP_WRITE, rw);
 }
 
+static int _fadv_willneed(struct super_block *sb, struct inode *inode,
+			  loff_t offset, loff_t len, bool rand)
+{
+	struct zufs_ioc_IO io = {};
+	struct __zufs_ra ra = {
+		.start = md_o2p(offset),
+		.ra_pages = md_o2p_up(len),
+		.prev_pos = offset - 1,
+	};
+	int err;
+
+	io.ra.start = ra.start;
+	io.ra.ra_pages = ra.ra_pages;
+	io.ra.prev_pos = ra.prev_pos;
+	io.rw = rand ? ZUFS_RW_RAND : 0;
+
+	err = _IO_dispatch(SBI(sb), &io, ZUII(inode), ZUFS_OP_PRE_READ, 0,
+			   NULL, 0, offset, 0);
+	return err;
+}
+
+static int _fadv_dontneed(struct super_block *sb, struct inode *inode,
+			  loff_t offset, loff_t len)
+{
+	struct zufs_ioc_sync ioc_range = {
+		.hdr.in_len = sizeof(ioc_range),
+		.hdr.operation = ZUFS_OP_SYNC,
+		.zus_ii = ZUII(inode)->zus_ii,
+		.offset = offset,
+		.length = len,
+		.flags = ZUFS_SF_DONTNEED,
+	};
+
+	return zufc_dispatch(ZUF_ROOT(SBI(sb)), &ioc_range.hdr, NULL, 0);
+}
+
+/* FIXME: There is a pending patch from Jan Karta to export generic_fadvise.
+ * until then duplicate here what we need
+ */
+#include <linux/backing-dev.h>
+
+static int _generic_fadvise(struct file *file, loff_t offset, loff_t len,
+			    int advise)
+{
+	struct backing_dev_info *bdi = inode_to_bdi(file_inode(file));
+
+	switch (advise) {
+	case POSIX_FADV_NORMAL:
+		file->f_ra.ra_pages = bdi->ra_pages;
+		spin_lock(&file->f_lock);
+		file->f_mode &= ~FMODE_RANDOM;
+		spin_unlock(&file->f_lock);
+		break;
+	case POSIX_FADV_RANDOM:
+		spin_lock(&file->f_lock);
+		file->f_mode |= FMODE_RANDOM;
+		spin_unlock(&file->f_lock);
+		break;
+	case POSIX_FADV_SEQUENTIAL:
+		file->f_ra.ra_pages = bdi->ra_pages * 2;
+		spin_lock(&file->f_lock);
+		file->f_mode &= ~FMODE_RANDOM;
+		spin_unlock(&file->f_lock);
+		break;
+	case POSIX_FADV_NOREUSE:
+		break;
+	}
+
+	return 0;
+}
+
+int zuf_rw_fadvise(struct super_block *sb, struct file *file,
+		   loff_t offset, loff_t len, int advise, bool rand)
+{
+	switch (advise) {
+	case POSIX_FADV_WILLNEED:
+		return _fadv_willneed(sb, file_inode(file), offset, len, rand);
+	case POSIX_FADV_DONTNEED:
+		return _fadv_dontneed(sb, file_inode(file), offset, len);
+
+	case POSIX_FADV_SEQUENTIAL:
+	case POSIX_FADV_NORMAL:
+	case POSIX_FADV_RANDOM:
+	case POSIX_FADV_NOREUSE:
+		return _generic_fadvise(file, offset, len, advise);
+	default:
+		zuf_warn("Unknown advise %d\n", advise);
+		return -EINVAL;
+	}
+	return -EINVAL;
+}
+
 /* ~~~~ iom_dec.c ~~~ */
 /* for now here (at rw.c) looks logical */
 
diff --git a/fs/zuf/zuf-core.c b/fs/zuf/zuf-core.c
index cb4a4def646f..4284d2298906 100644
--- a/fs/zuf/zuf-core.c
+++ b/fs/zuf/zuf-core.c
@@ -95,6 +95,8 @@ const char *zuf_op_name(enum e_zufs_operation op)
 		CASE_ENUM_NAME(ZUFS_OP_REMOVE_DENTRY);
 		CASE_ENUM_NAME(ZUFS_OP_RENAME);
 		CASE_ENUM_NAME(ZUFS_OP_READDIR);
+		CASE_ENUM_NAME(ZUFS_OP_CLONE);
+		CASE_ENUM_NAME(ZUFS_OP_COPY);
 
 		CASE_ENUM_NAME(ZUFS_OP_READ);
 		CASE_ENUM_NAME(ZUFS_OP_PRE_READ);
@@ -102,6 +104,9 @@ const char *zuf_op_name(enum e_zufs_operation op)
 		CASE_ENUM_NAME(ZUFS_OP_MMAP_CLOSE);
 		CASE_ENUM_NAME(ZUFS_OP_SETATTR);
 		CASE_ENUM_NAME(ZUFS_OP_SYNC);
+		CASE_ENUM_NAME(ZUFS_OP_FALLOCATE);
+		CASE_ENUM_NAME(ZUFS_OP_LLSEEK);
+		CASE_ENUM_NAME(ZUFS_OP_FIEMAP);
 
 		CASE_ENUM_NAME(ZUFS_OP_GET_MULTY);
 		CASE_ENUM_NAME(ZUFS_OP_PUT_MULTY);
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index e70bd8b7ff69..c8bcb6006fab 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -455,6 +455,8 @@ enum e_zufs_operation {
 	ZUFS_OP_REMOVE_DENTRY	= 9,
 	ZUFS_OP_RENAME		= 10,
 	ZUFS_OP_READDIR		= 11,
+	ZUFS_OP_CLONE		= 12,
+	ZUFS_OP_COPY		= 13,
 
 	ZUFS_OP_READ		= 14,
 	ZUFS_OP_PRE_READ	= 15,
@@ -463,6 +465,8 @@ enum e_zufs_operation {
 	ZUFS_OP_SETATTR		= 19,
 	ZUFS_OP_SYNC		= 20,
 	ZUFS_OP_FALLOCATE	= 21,
+	ZUFS_OP_LLSEEK		= 22,
+	ZUFS_OP_FIEMAP		= 28,
 
 	ZUFS_OP_GET_MULTY	= 29,
 	ZUFS_OP_PUT_MULTY	= 30,
@@ -680,6 +684,85 @@ struct zufs_ioc_sync {
 	__u64 write_unmapped;
 };
 
+/* ZUFS_OP_CLONE */
+struct zufs_ioc_clone {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_inode_info *src_zus_ii;
+	struct zus_inode_info *dst_zus_ii;
+	__u64 pos_in, pos_out;
+	__u64 len;
+	__u64 len_up;
+};
+
+/* ZUFS_OP_LLSEEK */
+struct zufs_ioc_seek {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_inode_info *zus_ii;
+	__u64 offset_in;
+	__u32 whence;
+	__u32 pad;
+
+	/* OUT */
+	__u64 offset_out;
+};
+
+/* ZUFS_OP_FIEMAP */
+struct zufs_ioc_fiemap {
+	struct zufs_ioc_hdr hdr;
+
+	/* IN */
+	struct zus_inode_info *zus_ii;
+	__u64	start;
+	__u64	length;
+	__u32	flags;
+	__u32	extents_max;
+
+	/* OUT */
+	__u32	extents_mapped;
+	__u32	pad;
+
+} __packed;
+
+struct zufs_fiemap_extent_info {
+	struct fiemap_extent *fi_extents_start;
+	__u32 fi_flags;
+	__u32 fi_extents_mapped;
+	__u32 fi_extents_max;
+	__u32 __pad;
+};
+
+static inline
+int zufs_fiemap_fill_next_extent(struct zufs_fiemap_extent_info *fieinfo,
+				 __u64 logical, __u64 phys,
+				 __u64 len, __u32 flags)
+{
+	struct fiemap_extent *dest = fieinfo->fi_extents_start;
+
+	if (fieinfo->fi_extents_max == 0) {
+		fieinfo->fi_extents_mapped++;
+		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
+	}
+
+	if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
+		return 1;
+
+	dest += fieinfo->fi_extents_mapped;
+	dest->fe_logical = logical;
+	dest->fe_physical = phys;
+	dest->fe_length = len;
+	dest->fe_flags = flags;
+
+	fieinfo->fi_extents_mapped++;
+	if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
+		return 1;
+
+	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
+}
+
+
+
 /* ~~~~ io_map structures && IOCTL(s) ~~~~ */
 /*
  * These set of structures and helpers are used in return of zufs_ioc_IO and
-- 
2.21.0
^ permalink raw reply related	[flat|nested] 32+ messages in thread
* [PATCH 14/16] zuf: ioctl implementation
  2019-09-26  2:07 [PATCHSET v02 00/16] zuf: ZUFS Zero-copy User-mode FileSystem Boaz Harrosh
                   ` (12 preceding siblings ...)
  2019-09-26  2:07 ` [PATCH 13/16] zuf: More file operation Boaz Harrosh
@ 2019-09-26  2:07 ` Boaz Harrosh
  2019-09-26  2:07 ` [PATCH 15/16] zuf: xattr && acl implementation Boaz Harrosh
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 32+ messages in thread
From: Boaz Harrosh @ 2019-09-26  2:07 UTC (permalink / raw)
  To: linux-fsdevel, Anna Schumaker, Al Viro, Matt Benjamin
  Cc: Miklos Szeredi, Amir Goldstein, Sagi Manole, Matthew Wilcox,
	Dan Williams
* support for some generic IOCTLs:
  FS_IOC_GETFLAGS, FS_IOC_SETFLAGS, FS_IOC_GETVERSION, FS_IOC_SETVERSION
* Simple support for zusFS defined IOCTLs
  We only support flat structures
  (no emmbedded pointers within the IOCTL structures)
  We try to deduce the size of the IOCTL from the _IOC_SIZE(cmd)
  If zusFS needs a bigger copy it will send a retry with the
  new size. So bad defined IOCTLs always do 2 trips to userland
* zusFS may also retry if it wants an fs_freeze to implement
  its IOCTL (TODO keep a map)
[v2]
  zuf: Reduce stack usage (ioctl)
  Same as for IO use big_alloc for buffers too big for on the stack
Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/Makefile    |   1 +
 fs/zuf/_extern.h   |   6 +
 fs/zuf/directory.c |   4 +
 fs/zuf/file.c      |   4 +
 fs/zuf/ioctl.c     | 309 +++++++++++++++++++++++++++++++++++++++++++++
 fs/zuf/zuf-core.c  |   1 +
 fs/zuf/zus_api.h   |  37 ++++++
 7 files changed, 362 insertions(+)
 create mode 100644 fs/zuf/ioctl.c
diff --git a/fs/zuf/Makefile b/fs/zuf/Makefile
index 02df1374a946..d3257bfc69ba 100644
--- a/fs/zuf/Makefile
+++ b/fs/zuf/Makefile
@@ -17,6 +17,7 @@ zuf-y += md.o t1.o t2.o
 zuf-y += zuf-core.o zuf-root.o
 
 # Main FS
+zuf-y += ioctl.o
 zuf-y += rw.o mmap.o
 zuf-y += super.o inode.o directory.o namei.o file.o symlink.o
 zuf-y += module.o
diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
index 2c7456724ef6..04e0515469e7 100644
--- a/fs/zuf/_extern.h
+++ b/fs/zuf/_extern.h
@@ -126,6 +126,12 @@ int zuf_file_mmap(struct file *file, struct vm_area_struct *vma);
 /* t1.c */
 int zuf_pmem_mmap(struct file *file, struct vm_area_struct *vma);
 
+/* ioctl.c */
+long zuf_ioctl(struct file *filp, uint cmd, ulong arg);
+#ifdef CONFIG_COMPAT
+long zuf_compat_ioctl(struct file *file, uint cmd, ulong arg);
+#endif
+
 /*
  * Inode and files operations
  */
diff --git a/fs/zuf/directory.c b/fs/zuf/directory.c
index 7417aeb77773..612b6e410615 100644
--- a/fs/zuf/directory.c
+++ b/fs/zuf/directory.c
@@ -164,4 +164,8 @@ const struct file_operations zuf_dir_operations = {
 	.read		= generic_read_dir,
 	.iterate_shared	= zuf_readdir,
 	.fsync		= noop_fsync,
+	.unlocked_ioctl = zuf_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= zuf_compat_ioctl,
+#endif
 };
diff --git a/fs/zuf/file.c b/fs/zuf/file.c
index 1c51529694e7..e0bd60e095e7 100644
--- a/fs/zuf/file.c
+++ b/fs/zuf/file.c
@@ -808,6 +808,10 @@ const struct file_operations zuf_file_operations = {
 	.copy_file_range	= zuf_copy_file_range,
 	.remap_file_range	= zuf_clone_file_range,
 	.fadvise		= zuf_fadvise,
+	.unlocked_ioctl		= zuf_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl		= zuf_compat_ioctl,
+#endif
 };
 
 const struct inode_operations zuf_file_inode_operations = {
diff --git a/fs/zuf/ioctl.c b/fs/zuf/ioctl.c
new file mode 100644
index 000000000000..77b8d7627a74
--- /dev/null
+++ b/fs/zuf/ioctl.c
@@ -0,0 +1,309 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Ioctl operations.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#include <linux/capability.h>
+#include <linux/compat.h>
+
+#include "zuf.h"
+
+#define ZUFS_SUPPORTED_FS_FLAGS (FS_SYNC_FL | FS_APPEND_FL | FS_IMMUTABLE_FL | \
+				 FS_NOATIME_FL | FS_DIRTY_FL)
+
+noinline
+static int _ioctl_dispatch(struct inode *inode, uint cmd, ulong arg,
+			   void *on_stack, uint max_stack)
+{
+	enum big_alloc_type bat;
+	struct zufs_ioc_ioctl *ioc_ioctl;
+	size_t ioc_size = _IOC_SIZE(cmd);
+	void __user *parg = (void __user *)arg;
+	struct timespec64 time = current_time(inode);
+	size_t size;
+	bool retry = false;
+	int err;
+	bool freeze = false;
+
+realloc:
+	size = sizeof(*ioc_ioctl) + ioc_size;
+
+	zuf_dbg_vfs("[%ld] cmd=0x%x arg=0x%lx size=0x%zx cap_admin=%u IOC(%d, %d, %zd)\n",
+		    inode->i_ino, cmd, arg, size, capable(CAP_SYS_ADMIN),
+		    _IOC_TYPE(cmd), _IOC_NR(cmd), ioc_size);
+
+	ioc_ioctl = big_alloc(size, max_stack, on_stack, GFP_KERNEL, &bat);
+	if (unlikely(!ioc_ioctl))
+		return -ENOMEM;
+
+	memset(ioc_ioctl, 0, sizeof(*ioc_ioctl));
+	ioc_ioctl->hdr.in_len = size;
+	ioc_ioctl->hdr.out_start = offsetof(struct zufs_ioc_ioctl, out_start);
+	ioc_ioctl->hdr.out_max = size;
+	ioc_ioctl->hdr.out_len = 0;
+	ioc_ioctl->hdr.operation = ZUFS_OP_IOCTL;
+	ioc_ioctl->zus_ii = ZUII(inode)->zus_ii;
+	ioc_ioctl->cmd = cmd;
+	ioc_ioctl->kflags = capable(CAP_SYS_ADMIN) ? ZUFS_IOC_CAP_ADMIN : 0;
+	timespec_to_mt(&ioc_ioctl->time, &time);
+
+dispatch:
+	if (arg && ioc_size) {
+		if (copy_from_user(ioc_ioctl->arg, parg, ioc_size)) {
+			err = -EFAULT;
+			goto out;
+		}
+	}
+
+	err = zufc_dispatch(ZUF_ROOT(SBI(inode->i_sb)), &ioc_ioctl->hdr,
+			    NULL, 0);
+
+	if (unlikely(err == -EZUFS_RETRY)) {
+		if (unlikely(retry)) {
+			zuf_err("Server => EZUFS_RETRY again uflags=%d\n",
+				ioc_ioctl->uflags);
+			err = -EBUSY;
+			goto out;
+		}
+		retry = true;
+		switch (ioc_ioctl->uflags) {
+		case ZUFS_IOC_REALLOC:
+			ioc_size = ioc_ioctl->new_size - sizeof(*ioc_ioctl);
+			big_free(ioc_ioctl, bat);
+			goto realloc;
+		case ZUFS_IOC_FREEZE_REQ:
+			err = freeze_super(inode->i_sb);
+			if (unlikely(err)) {
+				zuf_warn("unable to freeze fs err=%d\n", err);
+				goto out;
+			}
+			freeze = true;
+			ioc_ioctl->kflags |= ZUFS_IOC_FSFROZEN;
+			goto dispatch;
+		default:
+			zuf_err("unkonwn ZUFS retry type uflags=%d\n",
+				ioc_ioctl->uflags);
+			err = -EINVAL;
+			goto out;
+		}
+	}
+
+	if (unlikely(err)) {
+		zuf_dbg_err("zufc_dispatch failed => %d IOC(%d, %d, %zd)\n",
+			    err, _IOC_TYPE(cmd), _IOC_NR(cmd), ioc_size);
+		goto out;
+	}
+
+	if (ioc_ioctl->hdr.out_len) {
+		if (copy_to_user(parg, ioc_ioctl->arg,
+		    ioc_ioctl->hdr.out_len)) {
+			err = -EFAULT;
+			goto out;
+		}
+	}
+
+out:
+	if (freeze) {
+		int thaw_err = thaw_super(inode->i_sb);
+
+		if (unlikely(thaw_err))
+			zuf_err("post ioctl thaw file system failure err = %d\n",
+				 thaw_err);
+	}
+
+	big_free(ioc_ioctl, bat);
+
+	return err;
+}
+
+static uint _translate_to_ioc_flags(struct zus_inode *zi)
+{
+	uint zi_flags = le16_to_cpu(zi->i_flags);
+	uint ioc_flags = 0;
+
+	if (zi_flags & S_SYNC)
+		ioc_flags |= FS_SYNC_FL;
+	if (zi_flags & S_APPEND)
+		ioc_flags |= FS_APPEND_FL;
+	if (zi_flags & S_IMMUTABLE)
+		ioc_flags |= FS_IMMUTABLE_FL;
+	if (zi_flags & S_NOATIME)
+		ioc_flags |= FS_NOATIME_FL;
+	if (zi_flags & S_DIRSYNC)
+		ioc_flags |= FS_DIRSYNC_FL;
+
+	return ioc_flags;
+}
+
+static int _ioc_getflags(struct inode *inode, uint __user *parg)
+{
+	struct zus_inode *zi = zus_zi(inode);
+	uint flags = _translate_to_ioc_flags(zi);
+
+	return put_user(flags, parg);
+}
+
+static void _translate_to_zi_flags(struct zus_inode *zi, unsigned int flags)
+{
+	uint zi_flags = le16_to_cpu(zi->i_flags);
+
+	zi_flags &=
+		~(S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC);
+
+	if (flags & FS_SYNC_FL)
+		zi_flags |= S_SYNC;
+	if (flags & FS_APPEND_FL)
+		zi_flags |= S_APPEND;
+	if (flags & FS_IMMUTABLE_FL)
+		zi_flags |= S_IMMUTABLE;
+	if (flags & FS_NOATIME_FL)
+		zi_flags |= S_NOATIME;
+	if (flags & FS_DIRSYNC_FL)
+		zi_flags |= S_DIRSYNC;
+
+	zi->i_flags = cpu_to_le16(zi_flags);
+}
+
+/* use statx ioc to flush zi changes to fs */
+static int __ioc_dispatch_zi_update(struct inode *inode, uint flags)
+{
+	struct zufs_ioc_attr ioc_attr = {
+		.hdr.in_len = sizeof(ioc_attr),
+		.hdr.out_len = sizeof(ioc_attr),
+		.hdr.operation = ZUFS_OP_SETATTR,
+		.zus_ii = ZUII(inode)->zus_ii,
+		.zuf_attr = flags,
+	};
+	int err;
+
+	err = zufc_dispatch(ZUF_ROOT(SBI(inode->i_sb)), &ioc_attr.hdr, NULL, 0);
+	if (unlikely(err && err != -EINTR))
+		zuf_err("zufc_dispatch failed => %d\n", err);
+
+	return err;
+}
+
+static int _ioc_setflags(struct inode *inode, uint __user *parg)
+{
+	struct zus_inode *zi = zus_zi(inode);
+	uint flags, oldflags;
+	int err;
+
+	if (!inode_owner_or_capable(inode))
+		return -EPERM;
+
+	if (get_user(flags, parg))
+		return -EFAULT;
+
+	if (flags & ~ZUFS_SUPPORTED_FS_FLAGS)
+		return -EOPNOTSUPP;
+
+	if (zi->i_flags & ZUFS_S_IMMUTABLE)
+		return -EPERM;
+
+	inode_lock(inode);
+
+	oldflags = le32_to_cpu(zi->i_flags);
+
+	if ((flags ^ oldflags) &
+		(FS_APPEND_FL | FS_IMMUTABLE_FL)) {
+		if (!capable(CAP_LINUX_IMMUTABLE)) {
+			inode_unlock(inode);
+			return -EPERM;
+		}
+	}
+
+	if (!S_ISDIR(inode->i_mode))
+		flags &= ~FS_DIRSYNC_FL;
+
+	flags = flags & FS_FL_USER_MODIFIABLE;
+	flags |= oldflags & ~FS_FL_USER_MODIFIABLE;
+	inode->i_ctime = current_time(inode);
+	timespec_to_mt(&zi->i_ctime, &inode->i_ctime);
+	_translate_to_zi_flags(zi, flags);
+	zuf_set_inode_flags(inode, zi);
+
+	err = __ioc_dispatch_zi_update(inode, ZUFS_STATX_FLAGS | STATX_CTIME);
+
+	inode_unlock(inode);
+	return err;
+}
+
+static int _ioc_setversion(struct inode *inode, uint __user *parg)
+{
+	struct zus_inode *zi = zus_zi(inode);
+	__u32 generation;
+	int err;
+
+	if (!inode_owner_or_capable(inode))
+		return -EPERM;
+
+	if (get_user(generation, parg))
+		return -EFAULT;
+
+	inode_lock(inode);
+
+	inode->i_ctime = current_time(inode);
+	inode->i_generation = generation;
+	timespec_to_mt(&zi->i_ctime, &inode->i_ctime);
+	zi->i_generation = cpu_to_le32(inode->i_generation);
+
+	err = __ioc_dispatch_zi_update(inode, ZUFS_STATX_VERSION | STATX_CTIME);
+
+	inode_unlock(inode);
+	return err;
+}
+
+long zuf_ioctl(struct file *filp, unsigned int cmd, ulong arg)
+{
+	void __user *parg = (void __user *)arg;
+	char on_stack[ZUF_MAX_STACK(8)];
+
+	switch (cmd) {
+	case FS_IOC_GETFLAGS:
+		return _ioc_getflags(filp->f_inode, parg);
+	case FS_IOC_SETFLAGS:
+		return _ioc_setflags(filp->f_inode, parg);
+	case FS_IOC_GETVERSION:
+		return put_user(filp->f_inode->i_generation, (int __user *)arg);
+	case FS_IOC_SETVERSION:
+		return _ioc_setversion(filp->f_inode, parg);
+	default:
+		return _ioctl_dispatch(filp->f_inode, cmd, arg, on_stack,
+				       sizeof(on_stack));
+	}
+}
+
+#ifdef CONFIG_COMPAT
+long zuf_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+	switch (cmd) {
+	case FS_IOC32_GETFLAGS:
+		cmd = FS_IOC_GETFLAGS;
+		break;
+	case FS_IOC32_SETFLAGS:
+		cmd = FS_IOC_SETFLAGS;
+		break;
+	case FS_IOC32_GETVERSION:
+		cmd = FS_IOC_GETVERSION;
+		break;
+	case FS_IOC32_SETVERSION:
+		cmd = FS_IOC_SETVERSION;
+		break;
+	default:
+		return -ENOIOCTLCMD;
+	}
+	return zuf_ioctl(file, cmd, (unsigned long)compat_ptr(arg));
+}
+#endif
+
diff --git a/fs/zuf/zuf-core.c b/fs/zuf/zuf-core.c
index 4284d2298906..9b8fe3bff0cd 100644
--- a/fs/zuf/zuf-core.c
+++ b/fs/zuf/zuf-core.c
@@ -106,6 +106,7 @@ const char *zuf_op_name(enum e_zufs_operation op)
 		CASE_ENUM_NAME(ZUFS_OP_SYNC);
 		CASE_ENUM_NAME(ZUFS_OP_FALLOCATE);
 		CASE_ENUM_NAME(ZUFS_OP_LLSEEK);
+		CASE_ENUM_NAME(ZUFS_OP_IOCTL);
 		CASE_ENUM_NAME(ZUFS_OP_FIEMAP);
 
 		CASE_ENUM_NAME(ZUFS_OP_GET_MULTY);
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index c8bcb6006fab..4ebb067c0719 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -466,6 +466,7 @@ enum e_zufs_operation {
 	ZUFS_OP_SYNC		= 20,
 	ZUFS_OP_FALLOCATE	= 21,
 	ZUFS_OP_LLSEEK		= 22,
+	ZUFS_OP_IOCTL		= 23,
 	ZUFS_OP_FIEMAP		= 28,
 
 	ZUFS_OP_GET_MULTY	= 29,
@@ -708,6 +709,42 @@ struct zufs_ioc_seek {
 	__u64 offset_out;
 };
 
+/* ZUFS_OP_IOCTL */
+/* Flags for zufs_ioc_ioctl->kflags */
+enum e_ZUFS_IOCTL_KFLAGS {
+	ZUFS_IOC_FSFROZEN	= 0x1,	/* Tell Server we froze the FS	  */
+	ZUFS_IOC_CAP_ADMIN	= 0x2,	/* The ioctl caller had CAP_ADMIN */
+};
+
+/* received for zus on zufs_ioc_ioctl->uflags */
+enum e_ZUFS_IOCTL_UFLAGS {
+	ZUFS_IOC_REALLOC	= 0x1,	/*_IOC_SIZE(cmd) was not it and Server
+					 * needs a deeper copy
+					 */
+	ZUFS_IOC_FREEZE_REQ	= 0x2,	/* Server needs a freeze and a recall */
+};
+
+struct zufs_ioc_ioctl {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_inode_info *zus_ii;
+	__u64 time;
+	__u32 cmd;
+	__u32 kflags; /* zuf/kernel state and flags*/
+
+	/* OUT */
+	/* This is just a zero-size marker for the start of output */
+	char out_start[0];
+	union {
+		struct { /* If return was -EZUFS_RETRY */
+			__u32 uflags; /* flags returned from zus */
+			__u32 new_size;
+		};
+
+		char arg[0];
+	};
+};
+
 /* ZUFS_OP_FIEMAP */
 struct zufs_ioc_fiemap {
 	struct zufs_ioc_hdr hdr;
-- 
2.21.0
^ permalink raw reply related	[flat|nested] 32+ messages in thread
* [PATCH 15/16] zuf: xattr && acl implementation
  2019-09-26  2:07 [PATCHSET v02 00/16] zuf: ZUFS Zero-copy User-mode FileSystem Boaz Harrosh
                   ` (13 preceding siblings ...)
  2019-09-26  2:07 ` [PATCH 14/16] zuf: ioctl implementation Boaz Harrosh
@ 2019-09-26  2:07 ` Boaz Harrosh
  2019-09-26  2:07 ` [PATCH 16/16] zuf: Support for dynamic-debug of zusFSs Boaz Harrosh
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 32+ messages in thread
From: Boaz Harrosh @ 2019-09-26  2:07 UTC (permalink / raw)
  To: linux-fsdevel, Anna Schumaker, Al Viro, Matt Benjamin
  Cc: Miklos Szeredi, Amir Goldstein, Sagi Manole, Matthew Wilcox,
	Dan Williams
We establish the usual dispatch API to user-mode,
for get/set/list_xattr.
Since the buffers are variable length we utilize the
zdo->overflow_handler for the extra copy from Server.
(see also zuf-core.c)
The ACL support is all in Kernel. There is no new API
with zusFS.
We define the internal structure of the ACL inside
an opec xattr and store via the xattr zus_api.
TODO:
  Future FSs that have their own ACL on-disk-format, and/or
  Network zusFS that have their own verifiers for the ACL
  will need to establish an alternative API for the acl.
Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/Makefile   |   2 +-
 fs/zuf/_extern.h  |  20 +++
 fs/zuf/acl.c      | 270 +++++++++++++++++++++++++++++++++++++++
 fs/zuf/file.c     |   3 +
 fs/zuf/inode.c    |  18 +++
 fs/zuf/namei.c    |   6 +
 fs/zuf/super.c    |   2 +
 fs/zuf/symlink.c  |   1 +
 fs/zuf/xattr.c    | 314 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/zuf/zuf-core.c |   3 +
 fs/zuf/zuf.h      |  34 +++++
 fs/zuf/zus_api.h  |  25 +++-
 12 files changed, 696 insertions(+), 2 deletions(-)
 create mode 100644 fs/zuf/acl.c
 create mode 100644 fs/zuf/xattr.c
diff --git a/fs/zuf/Makefile b/fs/zuf/Makefile
index d3257bfc69ba..abc7dcda0029 100644
--- a/fs/zuf/Makefile
+++ b/fs/zuf/Makefile
@@ -17,7 +17,7 @@ zuf-y += md.o t1.o t2.o
 zuf-y += zuf-core.o zuf-root.o
 
 # Main FS
-zuf-y += ioctl.o
+zuf-y += ioctl.o acl.o xattr.o
 zuf-y += rw.o mmap.o
 zuf-y += super.o inode.o directory.o namei.o file.o symlink.o
 zuf-y += module.o
diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
index 04e0515469e7..d0d83eae75c1 100644
--- a/fs/zuf/_extern.h
+++ b/fs/zuf/_extern.h
@@ -132,6 +132,26 @@ long zuf_ioctl(struct file *filp, uint cmd, ulong arg);
 long zuf_compat_ioctl(struct file *file, uint cmd, ulong arg);
 #endif
 
+/* xattr.c */
+int zuf_initxattrs(struct inode *inode, const struct xattr *xattr_array,
+		   void *fs_info);
+ssize_t __zuf_getxattr(struct inode *inode, int type, const char *name,
+		       void *buffer, size_t size);
+int __zuf_setxattr(struct inode *inode, int type, const char *name,
+		   const void *value, size_t size, int flags);
+ssize_t zuf_listxattr(struct dentry *dentry, char *buffer, size_t size);
+extern const struct xattr_handler *zuf_xattr_handlers[];
+
+/* acl.c */
+int zuf_set_acl(struct inode *inode, struct posix_acl *acl, int type);
+struct posix_acl *zuf_get_acl(struct inode *inode, int type);
+int zuf_acls_create_pre(struct inode *dir, umode_t *mode,
+			struct posix_acl **def_acl, struct posix_acl **acl);
+int zuf_acls_create_post(struct inode *dir, struct inode *inode,
+			 struct posix_acl *def_acl, struct posix_acl *acl);
+extern const struct xattr_handler zuf_acl_access_xattr_handler;
+extern const struct xattr_handler zuf_acl_default_xattr_handler;
+
 /*
  * Inode and files operations
  */
diff --git a/fs/zuf/acl.c b/fs/zuf/acl.c
new file mode 100644
index 000000000000..fe2bcd2096bf
--- /dev/null
+++ b/fs/zuf/acl.c
@@ -0,0 +1,270 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Access Control List
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+
+#include <linux/fs.h>
+#include <linux/posix_acl_xattr.h>
+#include <linux/xattr.h>
+#include "zuf.h"
+
+static void _acl_to_value(const struct posix_acl *acl, void *value)
+{
+	int n;
+	struct zuf_acl *macl = value;
+
+	zuf_dbg_acl("acl->count=%d\n", acl->a_count);
+
+	for (n = 0; n < acl->a_count; n++) {
+		const struct posix_acl_entry *entry = &acl->a_entries[n];
+
+		zuf_dbg_acl("aclno=%d tag=0x%x perm=0x%x\n",
+			     n, entry->e_tag, entry->e_perm);
+
+		macl->tag = cpu_to_le16(entry->e_tag);
+		macl->perm = cpu_to_le16(entry->e_perm);
+
+		switch (entry->e_tag) {
+		case ACL_USER:
+			macl->id = cpu_to_le32(
+				from_kuid(&init_user_ns, entry->e_uid));
+			break;
+		case ACL_GROUP:
+			macl->id = cpu_to_le32(
+				from_kgid(&init_user_ns, entry->e_gid));
+			break;
+		case ACL_USER_OBJ:
+		case ACL_GROUP_OBJ:
+		case ACL_MASK:
+		case ACL_OTHER:
+			break;
+		default:
+			zuf_dbg_err("e_tag=0x%x\n", entry->e_tag);
+			return;
+		}
+		macl++;
+	}
+}
+
+static int __set_acl(struct inode *inode, struct posix_acl *acl, int type,
+		     bool set_mode)
+{
+	char *name = NULL;
+	void *buf;
+	int err;
+	size_t size;
+	umode_t old_mode = inode->i_mode;
+
+	zuf_dbg_acl("[%ld] acl=%p type=0x%x\n", inode->i_ino, acl, type);
+
+	switch (type) {
+	case ACL_TYPE_ACCESS: {
+		struct zus_inode *zi = ZUII(inode)->zi;
+
+		name = XATTR_POSIX_ACL_ACCESS;
+		if (acl && set_mode) {
+			err = posix_acl_update_mode(inode, &inode->i_mode,
+						    &acl);
+			if (err)
+				return err;
+
+			zuf_dbg_acl("old=0x%x new=0x%x acl_count=%d\n",
+				    old_mode, inode->i_mode,
+				    acl ? acl->a_count : -1);
+			inode->i_ctime = current_time(inode);
+			timespec_to_mt(&zi->i_ctime, &inode->i_ctime);
+			zi->i_mode = cpu_to_le16(inode->i_mode);
+		}
+		break;
+	}
+	case ACL_TYPE_DEFAULT:
+		name = XATTR_POSIX_ACL_DEFAULT;
+		if (!S_ISDIR(inode->i_mode))
+			return acl ? -EACCES : 0;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	size = acl ? acl->a_count * sizeof(struct zuf_acl) : 0;
+	buf = kmalloc(size, GFP_KERNEL);
+	if (unlikely(!buf))
+		return -ENOMEM;
+
+	if (acl)
+		_acl_to_value(acl, buf);
+
+	/* NOTE: Server's zus_setxattr implementers should cl_flush the zi.
+	 *  In the case it returned an error it should not cl_flush.
+	 *  We will restore to old i_mode.
+	 */
+	err = __zuf_setxattr(inode, ZUF_XF_SYSTEM, name, buf, size, 0);
+	if (likely(!err)) {
+		set_cached_acl(inode, type, acl);
+	} else {
+		/* Error need to restore changes (xfstest/generic/449) */
+		struct zus_inode *zi = ZUII(inode)->zi;
+
+		inode->i_mode = old_mode;
+		zi->i_mode = cpu_to_le16(inode->i_mode);
+	}
+
+	kfree(buf);
+	return err;
+}
+
+int zuf_set_acl(struct inode *inode, struct posix_acl *acl, int type)
+{
+	return __set_acl(inode, acl, type, true);
+}
+
+static struct posix_acl *_value_to_acl(void *value, size_t size)
+{
+	int n, count;
+	struct posix_acl *acl;
+	struct zuf_acl *macl = value;
+	void *end = value + size;
+
+	if (!value)
+		return NULL;
+
+	count = size / sizeof(struct zuf_acl);
+	if (count < 0)
+		return ERR_PTR(-EINVAL);
+	if (count == 0)
+		return NULL;
+
+	acl = posix_acl_alloc(count, GFP_NOFS);
+	if (unlikely(!acl))
+		return ERR_PTR(-ENOMEM);
+
+	for (n = 0; n < count; n++) {
+		if (end < (void *)macl + sizeof(struct zuf_acl))
+			goto fail;
+
+		zuf_dbg_acl("aclno=%d tag=0x%x perm=0x%x id=0x%x\n",
+			     n, le16_to_cpu(macl->tag), le16_to_cpu(macl->perm),
+			     le32_to_cpu(macl->id));
+
+		acl->a_entries[n].e_tag  = le16_to_cpu(macl->tag);
+		acl->a_entries[n].e_perm = le16_to_cpu(macl->perm);
+
+		switch (acl->a_entries[n].e_tag) {
+		case ACL_USER_OBJ:
+		case ACL_GROUP_OBJ:
+		case ACL_MASK:
+		case ACL_OTHER:
+			macl++;
+			break;
+		case ACL_USER:
+			acl->a_entries[n].e_uid = make_kuid(&init_user_ns,
+							le32_to_cpu(macl->id));
+			macl++;
+			if (end < (void *)macl)
+				goto fail;
+			break;
+		case ACL_GROUP:
+			acl->a_entries[n].e_gid = make_kgid(&init_user_ns,
+							le32_to_cpu(macl->id));
+			macl++;
+			if (end < (void *)macl)
+				goto fail;
+			break;
+
+		default:
+			goto fail;
+		}
+	}
+	if (macl != end)
+		goto fail;
+	return acl;
+
+fail:
+	posix_acl_release(acl);
+	return ERR_PTR(-EINVAL);
+}
+
+struct posix_acl *zuf_get_acl(struct inode *inode, int type)
+{
+	struct zuf_inode_info *zii = ZUII(inode);
+	char *name = NULL;
+	void *buf;
+	struct posix_acl *acl = NULL;
+	int ret;
+
+	zuf_dbg_acl("[%ld] type=0x%x\n", inode->i_ino, type);
+
+	buf = (void *)__get_free_page(GFP_KERNEL);
+	if (unlikely(!buf))
+		return ERR_PTR(-ENOMEM);
+
+	switch (type) {
+	case ACL_TYPE_ACCESS:
+		name = XATTR_POSIX_ACL_ACCESS;
+		break;
+	case ACL_TYPE_DEFAULT:
+		name = XATTR_POSIX_ACL_DEFAULT;
+		break;
+	default:
+		WARN_ON(1);
+		return ERR_PTR(-EINVAL);
+	}
+
+	zuf_smr_lock(zii);
+
+	ret = __zuf_getxattr(inode, ZUF_XF_SYSTEM, name, buf, PAGE_SIZE);
+	if (likely(ret > 0)) {
+		acl = _value_to_acl(buf, ret);
+	} else if (ret != -ENODATA) {
+		if (ret != 0)
+			zuf_dbg_err("failed to getattr ret=%d\n", ret);
+		acl = ERR_PTR(ret);
+	}
+
+	if (!IS_ERR(acl))
+		set_cached_acl(inode, type, acl);
+
+	zuf_smr_unlock(zii);
+
+	free_page((ulong)buf);
+
+	return acl;
+}
+
+/* Used by creation of new inodes */
+int zuf_acls_create_pre(struct inode *dir, umode_t *mode,
+			struct posix_acl **def_acl, struct posix_acl **acl)
+{
+	int err = posix_acl_create(dir, mode, def_acl, acl);
+
+	return err;
+}
+
+int zuf_acls_create_post(struct inode *dir, struct inode *inode,
+			 struct posix_acl *def_acl, struct posix_acl *acl)
+{
+	int err = 0, err2 = 0;
+
+	zuf_dbg_acl("def_acl_count=%d acl_count=%d\n",
+			def_acl ? def_acl->a_count : -1,
+			acl ? acl->a_count : -1);
+
+	if (def_acl)
+		err = __set_acl(inode, def_acl, ACL_TYPE_DEFAULT, false);
+	else
+		inode->i_default_acl = NULL;
+
+	if (acl)
+		err2 = __set_acl(inode, acl, ACL_TYPE_ACCESS, false);
+	else
+		inode->i_acl = NULL;
+
+	return err ?: err2;
+}
diff --git a/fs/zuf/file.c b/fs/zuf/file.c
index e0bd60e095e7..a4a788dcdc87 100644
--- a/fs/zuf/file.c
+++ b/fs/zuf/file.c
@@ -819,4 +819,7 @@ const struct inode_operations zuf_file_inode_operations = {
 	.getattr	= zuf_getattr,
 	.update_time	= zuf_update_time,
 	.fiemap		= zuf_fiemap,
+	.get_acl	= zuf_get_acl,
+	.set_acl	= zuf_set_acl,
+	.listxattr	= zuf_listxattr,
 };
diff --git a/fs/zuf/inode.c b/fs/zuf/inode.c
index 1e3dba654f34..ed324701a20b 100644
--- a/fs/zuf/inode.c
+++ b/fs/zuf/inode.c
@@ -287,6 +287,7 @@ void zuf_evict_inode(struct inode *inode)
 			_warn_inode_dirty(inode, zii->zi);
 
 		zuf_w_lock(zii);
+		zuf_xaw_lock(zii); /* Needed? probably not but palying safe */
 
 		zufc_goose_all_zts(ZUF_ROOT(SBI(sb)), inode);
 
@@ -295,6 +296,7 @@ void zuf_evict_inode(struct inode *inode)
 		inode->i_mtime = inode->i_ctime = current_time(inode);
 		inode->i_size = 0;
 
+		zuf_xaw_unlock(zii);
 		zuf_w_unlock(zii);
 	} else {
 		zuf_dbg_vfs("[%ld] inode is going down?\n", inode->i_ino);
@@ -341,6 +343,7 @@ struct inode *zuf_new_inode(struct inode *dir, umode_t mode,
 		.flags = tmpfile ? ZI_TMPFILE : 0,
 		.str.len = qstr->len,
 	};
+	struct posix_acl *acl = NULL, *def_acl = NULL;
 	struct inode *inode;
 	struct zus_inode *zi = NULL;
 	struct page *pages[2];
@@ -360,6 +363,15 @@ struct inode *zuf_new_inode(struct inode *dir, umode_t mode,
 
 	zuf_dbg_verbose("inode=%p name=%s\n", inode, qstr->name);
 
+	err = security_inode_init_security(inode, dir, qstr, zuf_initxattrs,
+					   NULL);
+	if (err && err != -EOPNOTSUPP)
+		goto fail;
+
+	err = zuf_acls_create_pre(dir, &inode->i_mode, &def_acl, &acl);
+	if (unlikely(err))
+		goto fail;
+
 	zuf_set_inode_flags(inode, &ioc_new_inode.zi);
 
 	if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode) ||
@@ -400,6 +412,12 @@ struct inode *zuf_new_inode(struct inode *dir, umode_t mode,
 
 	zuf_dbg_verbose("allocating inode %ld (zi=%p)\n", _zi_ino(zi), zi);
 
+	if ((def_acl || acl) && !symname) {
+		err = zuf_acls_create_post(dir, inode, def_acl, acl);
+		if (unlikely(err))
+			goto fail;
+	}
+
 	err = insert_inode_locked(inode);
 	if (unlikely(err)) {
 		zuf_dbg_err("[%ld:%s] generation=%lld insert_inode_locked => %d\n",
diff --git a/fs/zuf/namei.c b/fs/zuf/namei.c
index e78aa04f10d5..a33745c328b9 100644
--- a/fs/zuf/namei.c
+++ b/fs/zuf/namei.c
@@ -420,10 +420,16 @@ const struct inode_operations zuf_dir_inode_operations = {
 	.setattr	= zuf_setattr,
 	.getattr	= zuf_getattr,
 	.update_time	= zuf_update_time,
+	.get_acl	= zuf_get_acl,
+	.set_acl	= zuf_set_acl,
+	.listxattr	= zuf_listxattr,
 };
 
 const struct inode_operations zuf_special_inode_operations = {
 	.setattr	= zuf_setattr,
 	.getattr	= zuf_getattr,
 	.update_time	= zuf_update_time,
+	.get_acl	= zuf_get_acl,
+	.set_acl	= zuf_set_acl,
+	.listxattr	= zuf_listxattr,
 };
diff --git a/fs/zuf/super.c b/fs/zuf/super.c
index 2a0db11b51d6..8f760e8b3fdc 100644
--- a/fs/zuf/super.c
+++ b/fs/zuf/super.c
@@ -553,6 +553,7 @@ static int zuf_fill_super(struct super_block *sb, void *data, int silent)
 		sb->s_flags |= SB_POSIXACL;
 
 	sb->s_op = &zuf_sops;
+	sb->s_xattr = zuf_xattr_handlers;
 
 	root_i = zuf_iget(sb, ioc_mount->zmi.zus_ii, ioc_mount->zmi._zi,
 			  &exist);
@@ -845,6 +846,7 @@ static void _init_once(void *foo)
 	inode_init_once(&zii->vfs_inode);
 	INIT_LIST_HEAD(&zii->i_mmap_dirty);
 	zii->zi = NULL;
+	init_rwsem(&zii->xa_rwsem);
 	init_rwsem(&zii->in_sync);
 	atomic_set(&zii->vma_count, 0);
 	atomic_set(&zii->write_mapped, 0);
diff --git a/fs/zuf/symlink.c b/fs/zuf/symlink.c
index 1446bdf60cb9..5e9115ba4cbd 100644
--- a/fs/zuf/symlink.c
+++ b/fs/zuf/symlink.c
@@ -70,4 +70,5 @@ const struct inode_operations zuf_symlink_inode_operations = {
 	.update_time	= zuf_update_time,
 	.setattr	= zuf_setattr,
 	.getattr	= zuf_getattr,
+	.listxattr	= zuf_listxattr,
 };
diff --git a/fs/zuf/xattr.c b/fs/zuf/xattr.c
new file mode 100644
index 000000000000..3c239bb7ec7e
--- /dev/null
+++ b/fs/zuf/xattr.c
@@ -0,0 +1,314 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Extended Attributes
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+
+#include <linux/fs.h>
+#include <linux/posix_acl_xattr.h>
+#include <linux/xattr.h>
+
+#include "zuf.h"
+
+/* ~~~~~~~~~~~~~~~ xattr get ~~~~~~~~~~~~~~~ */
+
+struct _xxxattr {
+	void *user_buffer;
+	union {
+		struct zufs_ioc_xattr ioc_xattr;
+		char buf[512];
+	} d;
+};
+
+static inline uint _XXXATTR_SIZE(uint ioc_size)
+{
+	struct _xxxattr *_xxxattr;
+
+	return ioc_size + (sizeof(*_xxxattr) - sizeof(_xxxattr->d));
+}
+
+static int _xattr_oh(struct zuf_dispatch_op *zdo, void *parg, ulong max_bytes)
+{
+	struct zufs_ioc_hdr *hdr = zdo->hdr;
+	struct zufs_ioc_xattr *ioc_xattr =
+			container_of(hdr, typeof(*ioc_xattr), hdr);
+	struct _xxxattr *_xxattr =
+			container_of(ioc_xattr, typeof(*_xxattr), d.ioc_xattr);
+	struct zufs_ioc_xattr *user_ioc_xattr = parg;
+
+	if (hdr->err)
+		return 0;
+
+	ioc_xattr->user_buf_size = user_ioc_xattr->user_buf_size;
+
+	hdr->out_len -= sizeof(ioc_xattr->user_buf_size);
+	memcpy(_xxattr->user_buffer, user_ioc_xattr->buf, hdr->out_len);
+	return 0;
+}
+
+ssize_t __zuf_getxattr(struct inode *inode, int type, const char *name,
+		       void *buffer, size_t size)
+{
+	size_t name_len = strlen(name) + 1; /* plus \NUL */
+	struct _xxxattr *p_xattr;
+	struct _xxxattr s_xattr;
+	enum big_alloc_type bat;
+	struct zufs_ioc_xattr *ioc_xattr;
+	size_t ioc_size = sizeof(*ioc_xattr) + name_len;
+	struct zuf_dispatch_op zdo;
+	int err;
+	ssize_t ret;
+
+	zuf_dbg_vfs("[%ld] type=%d name=%s size=%lu ioc_size=%lu\n",
+			inode->i_ino, type, name, size, ioc_size);
+
+	p_xattr = big_alloc(_XXXATTR_SIZE(ioc_size), sizeof(s_xattr), &s_xattr,
+			    GFP_KERNEL, &bat);
+	if (unlikely(!p_xattr))
+		return -ENOMEM;
+
+	ioc_xattr = &p_xattr->d.ioc_xattr;
+	memset(ioc_xattr, 0, sizeof(*ioc_xattr));
+	p_xattr->user_buffer = buffer;
+
+	ioc_xattr->hdr.in_len = ioc_size;
+	ioc_xattr->hdr.out_start =
+				offsetof(struct zufs_ioc_xattr, user_buf_size);
+	 /* out_len updated by zus */
+	ioc_xattr->hdr.out_len = sizeof(ioc_xattr->user_buf_size);
+	ioc_xattr->hdr.out_max = 0;
+	ioc_xattr->hdr.operation = ZUFS_OP_XATTR_GET;
+	ioc_xattr->zus_ii = ZUII(inode)->zus_ii;
+	ioc_xattr->type = type;
+	ioc_xattr->name_len = name_len;
+	ioc_xattr->user_buf_size = size;
+
+	strcpy(ioc_xattr->buf, name);
+
+	zuf_dispatch_init(&zdo, &ioc_xattr->hdr, NULL, 0);
+	zdo.oh = _xattr_oh;
+	err = __zufc_dispatch(ZUF_ROOT(SBI(inode->i_sb)), &zdo);
+	ret = ioc_xattr->user_buf_size;
+
+	big_free(p_xattr, bat);
+
+	if (unlikely(err))
+		return err;
+
+	return ret;
+}
+
+/* ~~~~~~~~~~~~~~~ xattr set ~~~~~~~~~~~~~~~ */
+
+int __zuf_setxattr(struct inode *inode, int type, const char *name,
+		   const void *value, size_t size, int flags)
+{
+	size_t name_len = strlen(name) + 1;
+	struct _xxxattr *p_xattr;
+	struct _xxxattr s_xattr;
+	enum big_alloc_type bat;
+	struct zufs_ioc_xattr *ioc_xattr;
+	size_t ioc_size = sizeof(*ioc_xattr) + name_len + size;
+	int err;
+
+	zuf_dbg_vfs("[%ld] type=%d name=%s size=%lu ioc_size=%lu\n",
+			inode->i_ino, type, name, size, ioc_size);
+
+	p_xattr = big_alloc(_XXXATTR_SIZE(ioc_size), sizeof(s_xattr), &s_xattr,
+			    GFP_KERNEL, &bat);
+	if (unlikely(!p_xattr))
+		return -ENOMEM;
+
+	ioc_xattr = &p_xattr->d.ioc_xattr;
+	memset(ioc_xattr, 0, sizeof(*ioc_xattr));
+
+	ioc_xattr->hdr.in_len = ioc_size;
+	ioc_xattr->hdr.out_len = 0;
+	ioc_xattr->hdr.operation = ZUFS_OP_XATTR_SET;
+	ioc_xattr->zus_ii = ZUII(inode)->zus_ii;
+	ioc_xattr->type = type;
+	ioc_xattr->name_len = name_len;
+	ioc_xattr->user_buf_size = size;
+	ioc_xattr->flags = flags;
+
+	if (value && !size)
+		ioc_xattr->ioc_flags = ZUFS_XATTR_SET_EMPTY;
+
+	strcpy(ioc_xattr->buf, name);
+	if (value)
+		memcpy(ioc_xattr->buf + name_len, value, size);
+
+	err = zufc_dispatch(ZUF_ROOT(SBI(inode->i_sb)), &ioc_xattr->hdr,
+			    NULL, 0);
+
+	big_free(p_xattr, bat);
+
+	return err;
+}
+
+/* ~~~~~~~~~~~~~~~ xattr list ~~~~~~~~~~~~~~~ */
+
+static ssize_t __zuf_listxattr(struct inode *inode, char *buffer, size_t size)
+{
+	struct zuf_inode_info *zii = ZUII(inode);
+	struct _xxxattr s_xattr;
+	struct zufs_ioc_xattr *ioc_xattr;
+	struct zuf_dispatch_op zdo;
+
+	int err;
+
+	zuf_dbg_vfs("[%ld] size=%lu\n", inode->i_ino, size);
+
+	ioc_xattr = &s_xattr.d.ioc_xattr;
+	memset(ioc_xattr, 0, sizeof(*ioc_xattr));
+	s_xattr.user_buffer = buffer;
+
+	ioc_xattr->hdr.in_len = sizeof(*ioc_xattr);
+	ioc_xattr->hdr.out_start =
+				offsetof(struct zufs_ioc_xattr, user_buf_size);
+	 /* out_len updated by zus */
+	ioc_xattr->hdr.out_len = sizeof(ioc_xattr->user_buf_size);
+	ioc_xattr->hdr.out_max = 0;
+	ioc_xattr->hdr.operation = ZUFS_OP_XATTR_LIST;
+	ioc_xattr->zus_ii = zii->zus_ii;
+	ioc_xattr->name_len = 0;
+	ioc_xattr->user_buf_size = size;
+	ioc_xattr->ioc_flags = capable(CAP_SYS_ADMIN) ? ZUFS_XATTR_TRUSTED : 0;
+
+	zuf_dispatch_init(&zdo, &ioc_xattr->hdr, NULL, 0);
+	zdo.oh = _xattr_oh;
+	err = __zufc_dispatch(ZUF_ROOT(SBI(inode->i_sb)), &zdo);
+	if (unlikely(err))
+		return err;
+
+	return ioc_xattr->user_buf_size;
+}
+
+ssize_t zuf_listxattr(struct dentry *dentry, char *buffer, size_t size)
+{
+	struct inode *inode = dentry->d_inode;
+	struct zuf_inode_info *zii = ZUII(inode);
+	ssize_t ret;
+
+	zuf_xar_lock(zii);
+
+	ret = __zuf_listxattr(inode, buffer, size);
+
+	zuf_xar_unlock(zii);
+
+	return ret;
+}
+
+/* ~~~~~~~~~~~~~~~ xattr sb handlers ~~~~~~~~~~~~~~~ */
+static bool zuf_xattr_handler_list(struct dentry *dentry)
+{
+	return true;
+}
+
+static
+int zuf_xattr_handler_get(const struct xattr_handler *handler,
+			  struct dentry *dentry, struct inode *inode,
+			  const char *name, void *value, size_t size)
+{
+	struct zuf_inode_info *zii = ZUII(inode);
+	int ret;
+
+	zuf_dbg_xattr("[%ld] name=%s\n", inode->i_ino, name);
+
+	zuf_xar_lock(zii);
+
+	ret = __zuf_getxattr(inode, handler->flags, name, value, size);
+
+	zuf_xar_unlock(zii);
+
+	return ret;
+}
+
+static
+int zuf_xattr_handler_set(const struct xattr_handler *handler,
+			  struct dentry *d_notused, struct inode *inode,
+			  const char *name, const void *value, size_t size,
+			  int flags)
+{
+	struct zuf_inode_info *zii = ZUII(inode);
+	int err;
+
+	zuf_dbg_xattr("[%ld] name=%s size=0x%lx flags=0x%x\n",
+			inode->i_ino, name, size, flags);
+
+	zuf_xaw_lock(zii);
+
+	err = __zuf_setxattr(inode, handler->flags, name, value, size, flags);
+
+	zuf_xaw_unlock(zii);
+
+	return err;
+}
+
+const struct xattr_handler zuf_xattr_security_handler = {
+	.prefix	= XATTR_SECURITY_PREFIX,
+	.flags = ZUF_XF_SECURITY,
+	.list	= zuf_xattr_handler_list,
+	.get	= zuf_xattr_handler_get,
+	.set	= zuf_xattr_handler_set,
+};
+
+const struct xattr_handler zuf_xattr_trusted_handler = {
+	.prefix	= XATTR_TRUSTED_PREFIX,
+	.flags = ZUF_XF_TRUSTED,
+	.list	= zuf_xattr_handler_list,
+	.get	= zuf_xattr_handler_get,
+	.set	= zuf_xattr_handler_set,
+};
+
+const struct xattr_handler zuf_xattr_user_handler = {
+	.prefix	= XATTR_USER_PREFIX,
+	.flags = ZUF_XF_USER,
+	.list	= zuf_xattr_handler_list,
+	.get	= zuf_xattr_handler_get,
+	.set	= zuf_xattr_handler_set,
+};
+
+const struct xattr_handler *zuf_xattr_handlers[] = {
+	&zuf_xattr_user_handler,
+	&zuf_xattr_trusted_handler,
+	&zuf_xattr_security_handler,
+	&posix_acl_access_xattr_handler,
+	&posix_acl_default_xattr_handler,
+	NULL
+};
+
+/*
+ * Callback for security_inode_init_security() for acquiring xattrs.
+ */
+int zuf_initxattrs(struct inode *inode, const struct xattr *xattr_array,
+		   void *fs_info)
+{
+	const struct xattr *xattr;
+
+	for (xattr = xattr_array; xattr->name != NULL; xattr++) {
+		int err;
+
+		/* REMOVEME: We had a BUG here for a long time that never
+		 * crashed, I want to see this is called, please.
+		 */
+		zuf_warn("Yes it is name=%s value-size=%zd\n",
+			  xattr->name, xattr->value_len);
+
+		err = zuf_xattr_handler_set(&zuf_xattr_security_handler, NULL,
+					    inode, xattr->name, xattr->value,
+					    xattr->value_len, 0);
+		if (unlikely(err)) {
+			zuf_err("[%ld] failed to init xattrs err=%d\n",
+				 inode->i_ino, err);
+			return err;
+		}
+	}
+	return 0;
+}
diff --git a/fs/zuf/zuf-core.c b/fs/zuf/zuf-core.c
index 9b8fe3bff0cd..d3252ca7d2d1 100644
--- a/fs/zuf/zuf-core.c
+++ b/fs/zuf/zuf-core.c
@@ -107,6 +107,9 @@ const char *zuf_op_name(enum e_zufs_operation op)
 		CASE_ENUM_NAME(ZUFS_OP_FALLOCATE);
 		CASE_ENUM_NAME(ZUFS_OP_LLSEEK);
 		CASE_ENUM_NAME(ZUFS_OP_IOCTL);
+		CASE_ENUM_NAME(ZUFS_OP_XATTR_GET);
+		CASE_ENUM_NAME(ZUFS_OP_XATTR_SET);
+		CASE_ENUM_NAME(ZUFS_OP_XATTR_LIST);
 		CASE_ENUM_NAME(ZUFS_OP_FIEMAP);
 
 		CASE_ENUM_NAME(ZUFS_OP_GET_MULTY);
diff --git a/fs/zuf/zuf.h b/fs/zuf/zuf.h
index fe479cb70f97..4a1d474eb80b 100644
--- a/fs/zuf/zuf.h
+++ b/fs/zuf/zuf.h
@@ -130,6 +130,8 @@ enum {
 struct zuf_inode_info {
 	struct inode		vfs_inode;
 
+	/* Lock for xattr operations */
+	struct rw_semaphore	xa_rwsem;
 	/* Stuff for mmap write */
 	struct rw_semaphore	in_sync;
 	struct list_head	i_mmap_dirty;
@@ -313,6 +315,38 @@ static inline void ZUF_CHECK_I_W_LOCK(struct inode *inode)
 		up_write(&inode->i_rwsem);
 #endif
 }
+static inline void zuf_xar_lock(struct zuf_inode_info *zii)
+{
+	down_read(&zii->xa_rwsem);
+}
+
+static inline void zuf_xar_unlock(struct zuf_inode_info *zii)
+{
+	up_read(&zii->xa_rwsem);
+}
+
+static inline void zuf_xaw_lock(struct zuf_inode_info *zii)
+{
+	down_write(&zii->xa_rwsem);
+}
+
+static inline void zuf_xaw_unlock(struct zuf_inode_info *zii)
+{
+	up_write(&zii->xa_rwsem);
+}
+
+/* xattr types */
+enum {	ZUF_XF_SECURITY    = 1,
+	ZUF_XF_SYSTEM      = 2,
+	ZUF_XF_TRUSTED     = 3,
+	ZUF_XF_USER        = 4,
+};
+
+struct zuf_acl {
+	__le16	tag;
+	__le16	perm;
+	__le32	id;
+};
 
 enum big_alloc_type { ba_stack, ba_8k, ba_vmalloc };
 #define S_8K (1024UL * 8)
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index 4ebb067c0719..1359f0384f82 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -467,6 +467,9 @@ enum e_zufs_operation {
 	ZUFS_OP_FALLOCATE	= 21,
 	ZUFS_OP_LLSEEK		= 22,
 	ZUFS_OP_IOCTL		= 23,
+	ZUFS_OP_XATTR_GET	= 24,
+	ZUFS_OP_XATTR_SET	= 25,
+	ZUFS_OP_XATTR_LIST	= 27,
 	ZUFS_OP_FIEMAP		= 28,
 
 	ZUFS_OP_GET_MULTY	= 29,
@@ -745,6 +748,26 @@ struct zufs_ioc_ioctl {
 	};
 };
 
+/* ZUFS_OP_XATTR */
+/* xattr ioc_flags */
+#define ZUFS_XATTR_SET_EMPTY	(1 << 0)
+#define ZUFS_XATTR_TRUSTED	(1 << 1)
+
+struct zufs_ioc_xattr {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_inode_info *zus_ii;
+	__u32	flags;
+	__u32	type;
+	__u16	name_len;
+	__u16	ioc_flags;
+
+	/* OUT */
+	__u32	user_buf_size;
+	char	buf[0];
+};
+
+
 /* ZUFS_OP_FIEMAP */
 struct zufs_ioc_fiemap {
 	struct zufs_ioc_hdr hdr;
@@ -760,7 +783,7 @@ struct zufs_ioc_fiemap {
 	__u32	extents_mapped;
 	__u32	pad;
 
-} __packed;
+};
 
 struct zufs_fiemap_extent_info {
 	struct fiemap_extent *fi_extents_start;
-- 
2.21.0
^ permalink raw reply related	[flat|nested] 32+ messages in thread
* [PATCH 16/16] zuf: Support for dynamic-debug of zusFSs
  2019-09-26  2:07 [PATCHSET v02 00/16] zuf: ZUFS Zero-copy User-mode FileSystem Boaz Harrosh
                   ` (14 preceding siblings ...)
  2019-09-26  2:07 ` [PATCH 15/16] zuf: xattr && acl implementation Boaz Harrosh
@ 2019-09-26  2:07 ` Boaz Harrosh
  2019-09-26  7:11 ` [PATCHSET v02 00/16] zuf: ZUFS Zero-copy User-mode FileSystem Miklos Szeredi
  2019-09-26 11:41 ` Boaz Harrosh
  17 siblings, 0 replies; 32+ messages in thread
From: Boaz Harrosh @ 2019-09-26  2:07 UTC (permalink / raw)
  To: linux-fsdevel, Anna Schumaker, Al Viro, Matt Benjamin
  Cc: Miklos Szeredi, Amir Goldstein, Sagi Manole, Matthew Wilcox,
	Dan Williams
 [THIS PATCH will be changed or dropped before final submission]
In zus we support dynamic-debug prints. ie user can
turn on and off the prints at run time by writing
to some special files.
The API is exactly the same as the Kernel's dynamic-prints
only the special file that we perform read/write on is:
	/sys/fs/zuf/ddbg
But otherwise it is identical to Kernel.
The Kernel code is a thin wrapper to dispatch to/from
the read/write of /sys/fs/zuf/ddbg file to the zus
server.
The heavy lifting is done by the zus project build system
and core code. See zus project how this is done
This facility is dispatched on the mount-thread and not
the regular ZTs. Because it is available globally before
any mounts.
Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/_extern.h  |  3 ++
 fs/zuf/zuf-root.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 79 insertions(+)
diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
index d0d83eae75c1..40cc228e4c99 100644
--- a/fs/zuf/_extern.h
+++ b/fs/zuf/_extern.h
@@ -29,6 +29,9 @@ int zufc_release(struct inode *inode, struct file *file);
 int zufc_mmap(struct file *file, struct vm_area_struct *vma);
 const char *zuf_op_name(enum e_zufs_operation op);
 
+int __zufc_dispatch_mount(struct zuf_root_info *zri,
+			  enum e_mount_operation op,
+			  struct zufs_ioc_mount *zim);
 int zufc_dispatch_mount(struct zuf_root_info *zri, struct zus_fs_info *zus_zfi,
 			enum e_mount_operation operation,
 			struct zufs_ioc_mount *zim);
diff --git a/fs/zuf/zuf-root.c b/fs/zuf/zuf-root.c
index ecf240bd3e3f..3c3126d676a6 100644
--- a/fs/zuf/zuf-root.c
+++ b/fs/zuf/zuf-root.c
@@ -70,6 +70,81 @@ static void _fs_type_free(struct zuf_fs_type *zft)
 }
 #endif /*CONFIG_LOCKDEP*/
 
+#define DDBG_MAX_BUF_SIZE	(8 * PAGE_SIZE)
+/* We use ppos as a cookie for the dynamic debug ID we want to read from */
+static ssize_t _zus_ddbg_read(struct file *file, char __user *buf, size_t len,
+			      loff_t *ppos)
+{
+	struct zufs_ioc_mount *zim;
+	size_t buf_size = (DDBG_MAX_BUF_SIZE <= len) ? DDBG_MAX_BUF_SIZE : len;
+	size_t zim_size =  sizeof(zim->hdr) + sizeof(zim->zdi);
+	ssize_t err;
+
+	zim = vzalloc(zim_size + buf_size);
+	if (unlikely(!zim))
+		return -ENOMEM;
+
+	/* null terminate the 1st character in the buffer, hence the '+ 1' */
+	zim->hdr.in_len = zim_size + 1;
+	zim->hdr.out_len = zim_size + buf_size;
+	zim->zdi.len = buf_size;
+	zim->zdi.id = *ppos;
+	*ppos = 0;
+
+	err = __zufc_dispatch_mount(ZRI(file->f_inode->i_sb), ZUFS_M_DDBG_RD,
+				    zim);
+	if (unlikely(err)) {
+		zuf_err("error dispatching contorl message => %ld\n", err);
+		goto out;
+	}
+
+	err = simple_read_from_buffer(buf, zim->zdi.len, ppos, zim->zdi.msg,
+				      buf_size);
+	if (unlikely(err <= 0))
+		goto out;
+
+	*ppos = zim->zdi.id;
+out:
+	vfree(zim);
+	return err;
+}
+
+static ssize_t _zus_ddbg_write(struct file *file, const char __user *buf,
+			       size_t len, loff_t *ofst)
+{
+	struct _ddbg_info {
+		struct zufs_ioc_mount zim;
+		char buf[512];
+	} ddi = {};
+	ssize_t err;
+
+	if (unlikely(512 < len)) {
+		zuf_err("ddbg control message to long\n");
+		return -EINVAL;
+	}
+
+	memset(&ddi, 0, sizeof(ddi));
+	if (copy_from_user(ddi.zim.zdi.msg, buf, len))
+		return -EFAULT;
+
+	ddi.zim.hdr.in_len = sizeof(ddi);
+	ddi.zim.hdr.out_len = sizeof(ddi.zim);
+	err = __zufc_dispatch_mount(ZRI(file->f_inode->i_sb), ZUFS_M_DDBG_WR,
+				    &ddi.zim);
+	if (unlikely(err)) {
+		zuf_err("error dispatching contorl message => %ld\n", err);
+		return err;
+	}
+
+	return len;
+}
+
+static const struct file_operations _zus_ddbg_ops = {
+	.open = nonseekable_open,
+	.read = _zus_ddbg_read,
+	.write = _zus_ddbg_write,
+	.llseek = no_llseek,
+};
 
 static ssize_t _state_read(struct file *file, char __user *buf, size_t len,
 			   loff_t *ppos)
@@ -338,6 +413,7 @@ static int zufr_fill_super(struct super_block *sb, void *data, int silent)
 	static struct tree_descr zufr_files[] = {
 		[2] = {"state", &_state_ops, S_IFREG | 0400},
 		[3] = {"registered_fs", &_registered_fs_ops, S_IFREG | 0400},
+		[4] = {"ddbg", &_zus_ddbg_ops, S_IFREG | 0600},
 		{""},
 	};
 	struct zuf_root_info *zri;
-- 
2.21.0
^ permalink raw reply related	[flat|nested] 32+ messages in thread
* Re: [PATCHSET v02 00/16] zuf: ZUFS Zero-copy User-mode FileSystem
  2019-09-26  2:07 [PATCHSET v02 00/16] zuf: ZUFS Zero-copy User-mode FileSystem Boaz Harrosh
                   ` (15 preceding siblings ...)
  2019-09-26  2:07 ` [PATCH 16/16] zuf: Support for dynamic-debug of zusFSs Boaz Harrosh
@ 2019-09-26  7:11 ` Miklos Szeredi
  2019-09-26  9:41   ` Bernd Schubert
                     ` (2 more replies)
  2019-09-26 11:41 ` Boaz Harrosh
  17 siblings, 3 replies; 32+ messages in thread
From: Miklos Szeredi @ 2019-09-26  7:11 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: linux-fsdevel, Anna Schumaker, Al Viro, Matt Benjamin,
	Miklos Szeredi, Amir Goldstein, Sagi Manole, Matthew Wilcox,
	Dan Williams
On Thu, Sep 26, 2019 at 4:08 AM Boaz Harrosh <boaz@plexistor.com> wrote:
> Performance:
> A simple fio direct 4k random write test with incrementing number
> of threads.
>
> [fuse]
> threads wr_iops wr_bw   wr_lat
> 1       33606   134424  26.53226
> 2       57056   228224  30.38476
> 4       88667   354668  40.12783
> 7       116561  466245  53.98572
> 8       129134  516539  55.6134
>
> [fuse-splice]
> threads wr_iops wr_bw   wr_lat
> 1       39670   158682  21.8399
> 2       51100   204400  34.63294
> 4       75220   300882  47.42344
> 7       97706   390825  63.04435
> 8       98034   392137  73.24263
>
> [xfs-dax]
> threads wr_iops wr_bw           wr_lat
Data missing.
> [Maxdata-1.5-zufs]
> threads wr_iops wr_bw           wr_lat
> 1       1041802 260,450         3.623
> 2       1983997 495,999         3.808
> 4       3829456 957,364         3.959
> 7       4501154 1,125,288       5.895330
> 8       4400698 1,100,174       6.922174
Just a heads up, that I have achieved similar results with a prototype
using the unmodified fuse protocol.  This prototype was built with
ideas taken from zufs (percpu/lockless, mmaped dev, single syscall per
op).  I found a big scheduler scalability bottleneck that is caused by
update of mm->cpu_bitmap at context switch.   This can be worked
around by using shared memory instead of shared page tables, which is
a bit of a pain, but it does prove the point.  Thought about fixing
the cpu_bitmap cacheline pingpong, but didn't really get anywhere.
Are you interested in comparing zufs with the scalable fuse prototype?
 If so, I'll push the code into a public repo with some instructions,
Thanks,
Miklos
^ permalink raw reply	[flat|nested] 32+ messages in thread
* Re: [PATCHSET v02 00/16] zuf: ZUFS Zero-copy User-mode FileSystem
  2019-09-26  7:11 ` [PATCHSET v02 00/16] zuf: ZUFS Zero-copy User-mode FileSystem Miklos Szeredi
@ 2019-09-26  9:41   ` Bernd Schubert
  2019-09-26 11:27   ` Boaz Harrosh
  2019-09-26 12:48   ` Boaz Harrosh
  2 siblings, 0 replies; 32+ messages in thread
From: Bernd Schubert @ 2019-09-26  9:41 UTC (permalink / raw)
  To: Miklos Szeredi, Boaz Harrosh
  Cc: linux-fsdevel, Anna Schumaker, Al Viro, Matt Benjamin,
	Miklos Szeredi, Amir Goldstein, Sagi Manole, Matthew Wilcox,
	Dan Williams
Hi Miklos,
> Just a heads up, that I have achieved similar results with a prototype
> using the unmodified fuse protocol.  This prototype was built with
> ideas taken from zufs (percpu/lockless, mmaped dev, single syscall per
> op).  I found a big scheduler scalability bottleneck that is caused by
> update of mm->cpu_bitmap at context switch.   This can be worked
> around by using shared memory instead of shared page tables, which is
> a bit of a pain, but it does prove the point.  Thought about fixing
> the cpu_bitmap cacheline pingpong, but didn't really get anywhere.
> 
> Are you interested in comparing zufs with the scalable fuse prototype?
>  If so, I'll push the code into a public repo with some instructions,
I would be happy to help here (review, lightly test and debug). I wanted
to give the ioctl threads method a try for some time already just never
came to it yet.
Thanks,
Bernd
^ permalink raw reply	[flat|nested] 32+ messages in thread
* Re: [PATCHSET v02 00/16] zuf: ZUFS Zero-copy User-mode FileSystem
  2019-09-26  7:11 ` [PATCHSET v02 00/16] zuf: ZUFS Zero-copy User-mode FileSystem Miklos Szeredi
  2019-09-26  9:41   ` Bernd Schubert
@ 2019-09-26 11:27   ` Boaz Harrosh
  2019-09-26 12:12     ` Bernd Schubert
  2019-09-26 12:48   ` Boaz Harrosh
  2 siblings, 1 reply; 32+ messages in thread
From: Boaz Harrosh @ 2019-09-26 11:27 UTC (permalink / raw)
  To: Miklos Szeredi, Boaz Harrosh
  Cc: linux-fsdevel, Anna Schumaker, Al Viro, Matt Benjamin,
	Miklos Szeredi, Amir Goldstein, Sagi Manole, Matthew Wilcox,
	Dan Williams
On 26/09/2019 10:11, Miklos Szeredi wrote:
> On Thu, Sep 26, 2019 at 4:08 AM Boaz Harrosh <boaz@plexistor.com> wrote:
> 
<>
>> [xfs-dax]
>> threads wr_iops wr_bw           wr_lat
> 
> Data missing.
> 
Ooops sorry will send today
>> [Maxdata-1.5-zufs]
>> threads wr_iops wr_bw           wr_lat
>> 1       1041802 260,450         3.623
>> 2       1983997 495,999         3.808
>> 4       3829456 957,364         3.959
>> 7       4501154 1,125,288       5.895330
>> 8       4400698 1,100,174       6.922174
> 
> Just a heads up, that I have achieved similar results with a prototype
> using the unmodified fuse protocol.  This prototype was built with
> ideas taken from zufs (percpu/lockless, mmaped dev, single syscall per
> op).  I found a big scheduler scalability bottleneck that is caused by
> update of mm->cpu_bitmap at context switch.   This can be worked
> around by using shared memory instead of shared page tables, which is
> a bit of a pain, but it does prove the point.  Thought about fixing
> the cpu_bitmap cacheline pingpong, but didn't really get anywhere.
> 
> Are you interested in comparing zufs with the scalable fuse prototype?
>  If so, I'll push the code into a public repo with some instructions,
> 
Yes please do send it. I will give it a good run.
What fuseFS do you use in usermode?
> Thanks,
> Miklos
> 
Thank you Miklos for looking
Boaz
^ permalink raw reply	[flat|nested] 32+ messages in thread
* Re: [PATCHSET v02 00/16] zuf: ZUFS Zero-copy User-mode FileSystem
  2019-09-26  2:07 [PATCHSET v02 00/16] zuf: ZUFS Zero-copy User-mode FileSystem Boaz Harrosh
                   ` (16 preceding siblings ...)
  2019-09-26  7:11 ` [PATCHSET v02 00/16] zuf: ZUFS Zero-copy User-mode FileSystem Miklos Szeredi
@ 2019-09-26 11:41 ` Boaz Harrosh
  17 siblings, 0 replies; 32+ messages in thread
From: Boaz Harrosh @ 2019-09-26 11:41 UTC (permalink / raw)
  To: linux-fsdevel, Matt Benjamin
  Cc: Anna Schumaker, Al Viro, Miklos Szeredi, Amir Goldstein,
	Sagi Manole, Matthew Wilcox, Dan Williams
On 26/09/2019 05:40, Matt Benjamin wrote:
> per discussion 2 weeks ago--is there a git repo or something that I can clone?
> 
> Matt
> 
Please look in the cover letter there is a git tree address to clone
here:
[v02]
   The patches submitted are at:
	git https://github.com/NetApp/zufs-zuf upstream-v02
Also the same for zus Server in user-mode + infra:
	git https://github.com/NetApp/zufs-zus upstream
Please look in the 3rd patch:
	[PATCH 03/16] zuf: Preliminary Documentation
There are instructions what to clone how to compile and install
and how to use the scripts in do-zu to run a system.
I would love a good review for this documentation as well 
I'm sure its wrong and missing. I use it for so long I'm already
blind to it.
Please bug me day and night with any question
Thanks
Boaz
^ permalink raw reply	[flat|nested] 32+ messages in thread
* Re: [PATCHSET v02 00/16] zuf: ZUFS Zero-copy User-mode FileSystem
  2019-09-26 11:27   ` Boaz Harrosh
@ 2019-09-26 12:12     ` Bernd Schubert
  2019-09-26 12:24       ` Boaz Harrosh
  0 siblings, 1 reply; 32+ messages in thread
From: Bernd Schubert @ 2019-09-26 12:12 UTC (permalink / raw)
  To: Boaz Harrosh, Miklos Szeredi, Boaz Harrosh
  Cc: linux-fsdevel, Anna Schumaker, Al Viro, Matt Benjamin,
	Miklos Szeredi, Amir Goldstein, Sagi Manole, Matthew Wilcox,
	Dan Williams
>> Are you interested in comparing zufs with the scalable fuse prototype?
>>  If so, I'll push the code into a public repo with some instructions,
>>
> 
> Yes please do send it. I will give it a good run.
> What fuseFS do you use in usermode?
For the start passthrough should do, modified to skip all data. That is
what I am doing to measure fuse bandwidth. It also shouldn't be too
difficult to add an in-mem tree for dentries and inodes, to be able to
measure without tmpfs overhead.
Bernd
^ permalink raw reply	[flat|nested] 32+ messages in thread
* Re: [PATCHSET v02 00/16] zuf: ZUFS Zero-copy User-mode FileSystem
  2019-09-26 12:12     ` Bernd Schubert
@ 2019-09-26 12:24       ` Boaz Harrosh
  2019-09-26 13:45         ` Miklos Szeredi
  0 siblings, 1 reply; 32+ messages in thread
From: Boaz Harrosh @ 2019-09-26 12:24 UTC (permalink / raw)
  To: Bernd Schubert, Miklos Szeredi, Boaz Harrosh
  Cc: linux-fsdevel, Anna Schumaker, Al Viro, Matt Benjamin,
	Miklos Szeredi, Amir Goldstein, Sagi Manole, Matthew Wilcox,
	Dan Williams
On 26/09/2019 15:12, Bernd Schubert wrote:
>>> Are you interested in comparing zufs with the scalable fuse prototype?
>>>  If so, I'll push the code into a public repo with some instructions,
>>>
>>
>> Yes please do send it. I will give it a good run.
>> What fuseFS do you use in usermode?
> 
> For the start passthrough should do, modified to skip all data. 
skip all data is not good for me. Because it hides away the page-faults
and the actual memory bandwith. But what I do is either memcpy
a single preallocated block to all blocks in the IO and/or set
in a defined pattern where each ulong in the file contains its
offset as data. This gives me true results.
> That is
> what I am doing to measure fuse bandwidth. It also shouldn't be too
> difficult to add an in-mem tree for dentries and inodes, to be able to
> measure without tmpfs overhead.
> 
Thanks that is very helpful I will use this
Boaz
> 
> Bernd
> 
^ permalink raw reply	[flat|nested] 32+ messages in thread
* Re: [PATCHSET v02 00/16] zuf: ZUFS Zero-copy User-mode FileSystem
  2019-09-26  7:11 ` [PATCHSET v02 00/16] zuf: ZUFS Zero-copy User-mode FileSystem Miklos Szeredi
  2019-09-26  9:41   ` Bernd Schubert
  2019-09-26 11:27   ` Boaz Harrosh
@ 2019-09-26 12:48   ` Boaz Harrosh
  2019-09-26 13:48     ` Miklos Szeredi
  2 siblings, 1 reply; 32+ messages in thread
From: Boaz Harrosh @ 2019-09-26 12:48 UTC (permalink / raw)
  To: Miklos Szeredi, Boaz Harrosh
  Cc: linux-fsdevel, Anna Schumaker, Al Viro, Matt Benjamin,
	Miklos Szeredi, Amir Goldstein, Sagi Manole, Matthew Wilcox,
	Dan Williams
On 26/09/2019 10:11, Miklos Szeredi wrote:
> On Thu, Sep 26, 2019 at 4:08 AM Boaz Harrosh <boaz@plexistor.com> wrote:
> 
> Just a heads up, that I have achieved similar results with a prototype
> using the unmodified fuse protocol.  This prototype was built with
> ideas taken from zufs (percpu/lockless, mmaped dev, single syscall per
> op).
>  I found a big scheduler scalability bottleneck that is caused by
> update of mm->cpu_bitmap at context switch.   This can be worked
> around by using shared memory instead of shared page tables, which is
> a bit of a pain, but it does prove the point.  Thought about fixing
> the cpu_bitmap cacheline pingpong, but didn't really get anywhere.
> 
I'm not sure what is the scalability bottleneck you are seeing above.
With zufs I have a very good scalability, almost flat up to the
number of CPUs, and/or the limit of the memory bandwith if I'm accessing
pmem.
I do have a bad scalability bottleneck if I use mmap of pages caused
by the call to zap_vma_ptes. Which is why I invented the NIO way.
(Inspired by you)
Once you send me the git URL I will have a look in the code and see if
I can find any differences.
That said I do believe that a new Scheduler object that completely
bypasses the scheduler and just relinquishes its time slice to the
switched to thread, will cut off another 0.5u from the single thread
latency. (5th patch talks about that)
> Are you interested in comparing zufs with the scalable fuse prototype?
>  If so, I'll push the code into a public repo with some instructions,
> 
> Thanks,
> Miklos
> 
Miklos would you please have some bandwith to review my code? it would
make me very happy and calm. Your input is very valuable to me.
Thanks
Boaz
^ permalink raw reply	[flat|nested] 32+ messages in thread
* Re: [PATCHSET v02 00/16] zuf: ZUFS Zero-copy User-mode FileSystem
  2019-09-26 12:24       ` Boaz Harrosh
@ 2019-09-26 13:45         ` Miklos Szeredi
  0 siblings, 0 replies; 32+ messages in thread
From: Miklos Szeredi @ 2019-09-26 13:45 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Bernd Schubert, Boaz Harrosh, linux-fsdevel, Anna Schumaker,
	Al Viro, Matt Benjamin, Miklos Szeredi, Amir Goldstein,
	Sagi Manole, Matthew Wilcox, Dan Williams
On Thu, Sep 26, 2019 at 2:24 PM Boaz Harrosh <openosd@gmail.com> wrote:
>
> On 26/09/2019 15:12, Bernd Schubert wrote:
> >>> Are you interested in comparing zufs with the scalable fuse prototype?
> >>>  If so, I'll push the code into a public repo with some instructions,
> >>>
> >>
> >> Yes please do send it. I will give it a good run.
  git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git#fuse2
Enable:
CONFIG_FUSE2_FS=y
CONFIG_SAMPLE_FUSE2=y
> >> What fuseFS do you use in usermode?
It's the example loopback filesystem supplied in the git tree above.
I haven't converted libfuse yet to use the new features, so for now
this is the only way to try it.
Usage:
    linux/samples/fuse2/loraw -2 -p -t ~/mnt/fuse/
    options:
     -d: debug
     -s: single threaded
     -b: FUSE_DEV_IOC_CLONE (v1)
     -p: use ioctl for device I/O (v2)
     -m: use "map read" transferring offset into file instead of actual data
     -1: use regular fuse
     -2: use experimental fuse2
     -t: use shared memory instead of threads
I tested with shmfs, and IIRC got about 4-8us latency, depending on
the hardware, type of operation, etc...
Let me know if something's not working properly (this is experimental code).
Thanks,
Miklos
^ permalink raw reply	[flat|nested] 32+ messages in thread
* Re: [PATCHSET v02 00/16] zuf: ZUFS Zero-copy User-mode FileSystem
  2019-09-26 12:48   ` Boaz Harrosh
@ 2019-09-26 13:48     ` Miklos Szeredi
  0 siblings, 0 replies; 32+ messages in thread
From: Miklos Szeredi @ 2019-09-26 13:48 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Boaz Harrosh, linux-fsdevel, Anna Schumaker, Al Viro,
	Matt Benjamin, Miklos Szeredi, Amir Goldstein, Sagi Manole,
	Matthew Wilcox, Dan Williams
On Thu, Sep 26, 2019 at 2:48 PM Boaz Harrosh <openosd@gmail.com> wrote:
>
> On 26/09/2019 10:11, Miklos Szeredi wrote:
> >  I found a big scheduler scalability bottleneck that is caused by
> > update of mm->cpu_bitmap at context switch.   This can be worked
> > around by using shared memory instead of shared page tables, which is
> > a bit of a pain, but it does prove the point.  Thought about fixing
> > the cpu_bitmap cacheline pingpong, but didn't really get anywhere.
> >
>
> I'm not sure what is the scalability bottleneck you are seeing above.
> With zufs I have a very good scalability, almost flat up to the
> number of CPUs, and/or the limit of the memory bandwith if I'm accessing
> pmem.
This was *really* noticable with NUMA and many cpus (>64).
> Miklos would you please have some bandwith to review my code? it would
> make me very happy and calm. Your input is very valuable to me.
Sure, will look at the patches.
Thanks,
Miklos
^ permalink raw reply	[flat|nested] 32+ messages in thread
* Re: [PATCH 11/16] zuf: Write/Read implementation
       [not found]   ` <db90d73233484d251755c5a0cb7ee570b3fc9d19.camel@netapp.com>
@ 2019-10-29 20:15     ` Matthew Wilcox
  2019-11-14 14:04       ` Boaz Harrosh
  2019-11-14 15:15     ` Boaz Harrosh
  1 sibling, 1 reply; 32+ messages in thread
From: Matthew Wilcox @ 2019-10-29 20:15 UTC (permalink / raw)
  To: Schumaker, Anna
  Cc: linux-fsdevel@vger.kernel.org, viro@zeniv.linux.org.uk,
	mbenjami@redhat.com, boaz@plexistor.com, dan.j.williams@intel.com,
	mszeredi@redhat.com, amir73il@gmail.com, Manole, Sagi
On Tue, Oct 29, 2019 at 08:08:16PM +0000, Schumaker, Anna wrote:
> > +       return size ?: ret;
> 
> It looks like you're returning "ret" if the ternary evaluates to false, but it's not clear to
> me what is returned if it evaluates to true. It's possible it's okay, but I just don't know
> enough about how ternaries work in this case.
It's an unloved, unwnted GNU extension.  See
https://gcc.gnu.org/onlinedocs/gcc/Conditionals.html
It's really no better than writing:
	return size ? size : ret;
or even better:
	if (size)
		return size;
	return ret;
^ permalink raw reply	[flat|nested] 32+ messages in thread
* Re: [PATCH 11/16] zuf: Write/Read implementation
  2019-10-29 20:15     ` Matthew Wilcox
@ 2019-11-14 14:04       ` Boaz Harrosh
  0 siblings, 0 replies; 32+ messages in thread
From: Boaz Harrosh @ 2019-11-14 14:04 UTC (permalink / raw)
  To: Matthew Wilcox, Schumaker, Anna
  Cc: linux-fsdevel@vger.kernel.org, viro@zeniv.linux.org.uk,
	mbenjami@redhat.com, boaz@plexistor.com, dan.j.williams@intel.com,
	mszeredi@redhat.com, amir73il@gmail.com, Manole, Sagi
On 29/10/2019 22:15, Matthew Wilcox wrote:
> On Tue, Oct 29, 2019 at 08:08:16PM +0000, Schumaker, Anna wrote:
>>> +       return size ?: ret;
>>
>> It looks like you're returning "ret" if the ternary evaluates to false, but it's not clear to
>> me what is returned if it evaluates to true. It's possible it's okay, but I just don't know
>> enough about how ternaries work in this case.
> 
> It's an unloved, unwnted GNU extension.  See
> https://gcc.gnu.org/onlinedocs/gcc/Conditionals.html
> 
> It's really no better than writing:
> 
> 	return size ? size : ret;
> 
> or even better:
> 
> 	if (size)
> 		return size;
> 	return ret;
> 
OK Cool thanks will do
Boaz
^ permalink raw reply	[flat|nested] 32+ messages in thread
* Re: [PATCH 11/16] zuf: Write/Read implementation
       [not found]   ` <db90d73233484d251755c5a0cb7ee570b3fc9d19.camel@netapp.com>
  2019-10-29 20:15     ` Matthew Wilcox
@ 2019-11-14 15:15     ` Boaz Harrosh
  2019-11-14 16:08       ` Schumaker, Anna
  1 sibling, 1 reply; 32+ messages in thread
From: Boaz Harrosh @ 2019-11-14 15:15 UTC (permalink / raw)
  To: Schumaker, Anna, linux-fsdevel@vger.kernel.org,
	viro@zeniv.linux.org.uk, mbenjami@redhat.com
  Cc: dan.j.williams@intel.com, mszeredi@redhat.com,
	willy@infradead.org, amir73il@gmail.com, Manole, Sagi
On 29/10/2019 22:08, Schumaker, Anna wrote:
> Hi Boaz,
> 
> On Thu, 2019-09-26 at 05:07 +0300, Boaz Harrosh wrote:
>> zufs Has two ways to do IO.
<>
>> +static int rw_overflow_handler(struct zuf_dispatch_op *zdo, void *arg,
>> +                              ulong max_bytes)
>> +{
>> +       struct zufs_ioc_IO *io = container_of(zdo->hdr, typeof(*io), hdr);
This one is setting the typed pointer @io to be the same of what @zdo->hdr is
>> +       struct zufs_ioc_IO *io_user = arg;
>> +       int err;
>> +
>> +       *io = *io_user;
This one is deep copying the full size structure pointed to by io_user
to the space pointed to by io. (same as zdo->hdr)
Same as memcpy(io, io_user, sizeof(*io))
> 
> It looks like you're setting *io using the container_of() macro a few lines above, and then
> overwriting it here without ever using it. Can you remove one of these to make it clearer which
> one you meant to use?
> 
These are not redundant its the confusing C thing where declarations
of pointers + assignment means the pointer and not the content.
This code is correct
>> +
>> +       err = _ioc_bounds_check(&io->ziom, &io_user->ziom, arg + max_bytes);
>> +       if (unlikely(err))
>> +               return err;
>> +
>> +       if ((io->hdr.err == -EZUFS_RETRY) &&
>> +           io->ziom.iom_n && _zufs_iom_pop(io->iom_e)) {
>> +
>> +               zuf_dbg_rw(
>> +                       "[%s]zuf_iom_execute_sync(%d) max=0x%lx iom_e[%d] => %d\n",
>> +                       zuf_op_name(io->hdr.operation), io->ziom.iom_n,
>> +                       max_bytes, _zufs_iom_opt_type(io_user->iom_e),
>> +                       io->hdr.err);
>> +
>> +               io->hdr.err = zuf_iom_execute_sync(zdo->sb, zdo->inode,
>> +                                                  io_user->iom_e,
>> +                                                  io->ziom.iom_n);
>> +               return EZUF_RETRY_DONE;
>> +       }
<>
>> +static ssize_t _IO_gm(struct zuf_sb_info *sbi, struct inode *inode,
>> +                     ulong *on_stack, uint max_on_stack,
>> +                     struct iov_iter *ii, struct kiocb *kiocb,
>> +                     struct file_ra_state *ra, uint rw)
>> +{
>> +       ssize_t size = 0;
>> +       ssize_t ret = 0;
>> +       enum big_alloc_type bat;
>> +       ulong *bns;
>> +       uint max_bns = min_t(uint,
>> +               md_o2p_up(iov_iter_count(ii) + (kiocb->ki_pos & ~PAGE_MASK)),
>> +               ZUS_API_MAP_MAX_PAGES);
>> +
>> +       bns = big_alloc(max_bns * sizeof(ulong), max_on_stack, on_stack,
>> +                       GFP_NOFS, &bat);
>> +       if (unlikely(!bns)) {
>> +               zuf_err("life was more simple on the stack max_bns=%d\n",
>> +                       max_bns);
>> +               return -ENOMEM;
>> +       }
>> +
>> +       while (iov_iter_count(ii)) {
>> +               ret = _IO_gm_inner(sbi, inode, bns, max_bns, ii, ra,
>> +                                  kiocb->ki_pos, rw);
>> +               if (unlikely(ret < 0))
>> +                       break;
>> +
>> +               kiocb->ki_pos += ret;
>> +               size += ret;
>> +       }
>> +
>> +       big_free(bns, bat);
>> +
>> +       return size ?: ret;
> 
> It looks like you're returning "ret" if the ternary evaluates to false, but it's not clear to
> me what is returned if it evaluates to true. It's possible it's okay, but I just don't know
> enough about how ternaries work in this case.
> 
Yes Thanks, Will fix. Not suppose to use this in the Kernel.
>> +}
>> +
<>
>> +int zuf_iom_execute_sync(struct super_block *sb, struct inode *inode,
>> +                        __u64 *iom_e_user, uint iom_n)
>> +{
>> +       struct zuf_sb_info *sbi = SBI(sb);
>> +       struct t2_io_state rd_tis = {};
>> +       struct t2_io_state wr_tis = {};
>> +       struct _iom_exec_info iei = {};
>> +       int err, err_r, err_w;
>> +
>> +       t2_io_begin(sbi->md, READ, NULL, 0, -1, &rd_tis);
>> +       t2_io_begin(sbi->md, WRITE, NULL, 0, -1, &wr_tis);
>> +
>> +       iei.sb = sb;
>> +       iei.inode = inode;
>> +       iei.rd_tis = &rd_tis;
>> +       iei.wr_tis = &wr_tis;
>> +       iei.iom_e = iom_e_user;
>> +       iei.iom_n = iom_n;
>> +       iei.print = 0;
>> +
>> +       err = _iom_execute_inline(&iei);
>> +
>> +       err_r = t2_io_end(&rd_tis, true);
>> +       err_w = t2_io_end(&wr_tis, true);
>> +
>> +       /* TODO: not sure if OK when _iom_execute return with -ENOMEM
>> +        * In such a case, we might be better of skiping t2_io_ends.
>> +        */
>> +       return err ?: (err_r ?: err_w);
> 
> Same question here. 
> 
> Thanks,
> Anna
> 
Yes Will fix
Thanks Anna
Can I put Reviewed-by on this patch?
>> +}
Much obliged
Boaz
^ permalink raw reply	[flat|nested] 32+ messages in thread
* Re: [PATCH 11/16] zuf: Write/Read implementation
  2019-11-14 15:15     ` Boaz Harrosh
@ 2019-11-14 16:08       ` Schumaker, Anna
  0 siblings, 0 replies; 32+ messages in thread
From: Schumaker, Anna @ 2019-11-14 16:08 UTC (permalink / raw)
  To: linux-fsdevel@vger.kernel.org, viro@zeniv.linux.org.uk,
	mbenjami@redhat.com, boaz@plexistor.com
  Cc: dan.j.williams@intel.com, mszeredi@redhat.com,
	willy@infradead.org, amir73il@gmail.com, Manole, Sagi
On Thu, 2019-11-14 at 17:15 +0200, Boaz Harrosh wrote:
> NetApp Security WARNING: This is an external email. Do not click links or open
> attachments unless you recognize the sender and know the content is safe.
> 
> 
> 
> 
> On 29/10/2019 22:08, Schumaker, Anna wrote:
> > Hi Boaz,
> > 
> > On Thu, 2019-09-26 at 05:07 +0300, Boaz Harrosh wrote:
> > > zufs Has two ways to do IO.
> <>
> > > +static int rw_overflow_handler(struct zuf_dispatch_op *zdo, void *arg,
> > > +                              ulong max_bytes)
> > > +{
> > > +       struct zufs_ioc_IO *io = container_of(zdo->hdr, typeof(*io), hdr);
> 
> This one is setting the typed pointer @io to be the same of what @zdo->hdr is
> 
> > > +       struct zufs_ioc_IO *io_user = arg;
> > > +       int err;
> > > +
> > > +       *io = *io_user;
> 
> This one is deep copying the full size structure pointed to by io_user
> to the space pointed to by io. (same as zdo->hdr)
> 
> Same as memcpy(io, io_user, sizeof(*io))
> 
> > It looks like you're setting *io using the container_of() macro a few lines
> > above, and then
> > overwriting it here without ever using it. Can you remove one of these to
> > make it clearer which
> > one you meant to use?
> > 
> 
> These are not redundant its the confusing C thing where declarations
> of pointers + assignment means the pointer and not the content.
> 
> This code is correct
> 
> > > +
> > > +       err = _ioc_bounds_check(&io->ziom, &io_user->ziom, arg +
> > > max_bytes);
> > > +       if (unlikely(err))
> > > +               return err;
> > > +
> > > +       if ((io->hdr.err == -EZUFS_RETRY) &&
> > > +           io->ziom.iom_n && _zufs_iom_pop(io->iom_e)) {
> > > +
> > > +               zuf_dbg_rw(
> > > +                       "[%s]zuf_iom_execute_sync(%d) max=0x%lx iom_e[%d]
> > > => %d\n",
> > > +                       zuf_op_name(io->hdr.operation), io->ziom.iom_n,
> > > +                       max_bytes, _zufs_iom_opt_type(io_user->iom_e),
> > > +                       io->hdr.err);
> > > +
> > > +               io->hdr.err = zuf_iom_execute_sync(zdo->sb, zdo->inode,
> > > +                                                  io_user->iom_e,
> > > +                                                  io->ziom.iom_n);
> > > +               return EZUF_RETRY_DONE;
> > > +       }
> 
> <>
> 
> > > +static ssize_t _IO_gm(struct zuf_sb_info *sbi, struct inode *inode,
> > > +                     ulong *on_stack, uint max_on_stack,
> > > +                     struct iov_iter *ii, struct kiocb *kiocb,
> > > +                     struct file_ra_state *ra, uint rw)
> > > +{
> > > +       ssize_t size = 0;
> > > +       ssize_t ret = 0;
> > > +       enum big_alloc_type bat;
> > > +       ulong *bns;
> > > +       uint max_bns = min_t(uint,
> > > +               md_o2p_up(iov_iter_count(ii) + (kiocb->ki_pos &
> > > ~PAGE_MASK)),
> > > +               ZUS_API_MAP_MAX_PAGES);
> > > +
> > > +       bns = big_alloc(max_bns * sizeof(ulong), max_on_stack, on_stack,
> > > +                       GFP_NOFS, &bat);
> > > +       if (unlikely(!bns)) {
> > > +               zuf_err("life was more simple on the stack max_bns=%d\n",
> > > +                       max_bns);
> > > +               return -ENOMEM;
> > > +       }
> > > +
> > > +       while (iov_iter_count(ii)) {
> > > +               ret = _IO_gm_inner(sbi, inode, bns, max_bns, ii, ra,
> > > +                                  kiocb->ki_pos, rw);
> > > +               if (unlikely(ret < 0))
> > > +                       break;
> > > +
> > > +               kiocb->ki_pos += ret;
> > > +               size += ret;
> > > +       }
> > > +
> > > +       big_free(bns, bat);
> > > +
> > > +       return size ?: ret;
> > 
> > It looks like you're returning "ret" if the ternary evaluates to false, but
> > it's not clear to
> > me what is returned if it evaluates to true. It's possible it's okay, but I
> > just don't know
> > enough about how ternaries work in this case.
> > 
> 
> Yes Thanks, Will fix. Not suppose to use this in the Kernel.
> 
> > > +}
> > > +
> <>
> > > +int zuf_iom_execute_sync(struct super_block *sb, struct inode *inode,
> > > +                        __u64 *iom_e_user, uint iom_n)
> > > +{
> > > +       struct zuf_sb_info *sbi = SBI(sb);
> > > +       struct t2_io_state rd_tis = {};
> > > +       struct t2_io_state wr_tis = {};
> > > +       struct _iom_exec_info iei = {};
> > > +       int err, err_r, err_w;
> > > +
> > > +       t2_io_begin(sbi->md, READ, NULL, 0, -1, &rd_tis);
> > > +       t2_io_begin(sbi->md, WRITE, NULL, 0, -1, &wr_tis);
> > > +
> > > +       iei.sb = sb;
> > > +       iei.inode = inode;
> > > +       iei.rd_tis = &rd_tis;
> > > +       iei.wr_tis = &wr_tis;
> > > +       iei.iom_e = iom_e_user;
> > > +       iei.iom_n = iom_n;
> > > +       iei.print = 0;
> > > +
> > > +       err = _iom_execute_inline(&iei);
> > > +
> > > +       err_r = t2_io_end(&rd_tis, true);
> > > +       err_w = t2_io_end(&wr_tis, true);
> > > +
> > > +       /* TODO: not sure if OK when _iom_execute return with -ENOMEM
> > > +        * In such a case, we might be better of skiping t2_io_ends.
> > > +        */
> > > +       return err ?: (err_r ?: err_w);
> > 
> > Same question here.
> > 
> > Thanks,
> > Anna
> > 
> 
> Yes Will fix
> 
> Thanks Anna
> Can I put Reviewed-by on this patch?
Go for it!
> 
> > > +}
> 
> Much obliged
> Boaz
^ permalink raw reply	[flat|nested] 32+ messages in thread
end of thread, other threads:[~2019-11-14 16:08 UTC | newest]
Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-09-26  2:07 [PATCHSET v02 00/16] zuf: ZUFS Zero-copy User-mode FileSystem Boaz Harrosh
2019-09-26  2:07 ` [PATCH 01/16] fs: Add the ZUF filesystem to the build + License Boaz Harrosh
2019-09-26  2:07 ` [PATCH 02/16] MAINTAINERS: Add the ZUFS maintainership Boaz Harrosh
2019-09-26  2:07 ` [PATCH 03/16] zuf: Preliminary Documentation Boaz Harrosh
2019-09-26  2:07 ` [PATCH 04/16] zuf: zuf-rootfs Boaz Harrosh
2019-09-26  2:07 ` [PATCH 05/16] zuf: zuf-core The ZTs Boaz Harrosh
2019-09-26  2:07 ` [PATCH 06/16] zuf: Multy Devices Boaz Harrosh
2019-09-26  2:07 ` [PATCH 07/16] zuf: mounting Boaz Harrosh
2019-09-26  2:07 ` [PATCH 08/16] zuf: Namei and directory operations Boaz Harrosh
2019-09-26  2:07 ` [PATCH 09/16] zuf: readdir operation Boaz Harrosh
2019-09-26  2:07 ` [PATCH 10/16] zuf: symlink Boaz Harrosh
2019-09-26  2:07 ` [PATCH 11/16] zuf: Write/Read implementation Boaz Harrosh
     [not found]   ` <db90d73233484d251755c5a0cb7ee570b3fc9d19.camel@netapp.com>
2019-10-29 20:15     ` Matthew Wilcox
2019-11-14 14:04       ` Boaz Harrosh
2019-11-14 15:15     ` Boaz Harrosh
2019-11-14 16:08       ` Schumaker, Anna
2019-09-26  2:07 ` [PATCH 12/16] zuf: mmap & sync Boaz Harrosh
2019-09-26  2:07 ` [PATCH 13/16] zuf: More file operation Boaz Harrosh
2019-09-26  2:07 ` [PATCH 14/16] zuf: ioctl implementation Boaz Harrosh
2019-09-26  2:07 ` [PATCH 15/16] zuf: xattr && acl implementation Boaz Harrosh
2019-09-26  2:07 ` [PATCH 16/16] zuf: Support for dynamic-debug of zusFSs Boaz Harrosh
2019-09-26  7:11 ` [PATCHSET v02 00/16] zuf: ZUFS Zero-copy User-mode FileSystem Miklos Szeredi
2019-09-26  9:41   ` Bernd Schubert
2019-09-26 11:27   ` Boaz Harrosh
2019-09-26 12:12     ` Bernd Schubert
2019-09-26 12:24       ` Boaz Harrosh
2019-09-26 13:45         ` Miklos Szeredi
2019-09-26 12:48   ` Boaz Harrosh
2019-09-26 13:48     ` Miklos Szeredi
2019-09-26 11:41 ` Boaz Harrosh
  -- strict thread matches above, loose matches on Subject: below --
2019-08-12 16:47 [PATCHSET " Boaz Harrosh
2019-08-12 16:47 ` [PATCH 04/16] zuf: zuf-rootfs Boaz Harrosh
2019-08-12 16:42 [PATCHSET 00/16] zuf: ZUFS Zero-copy User-mode FileSystem Boaz Harrosh
2019-08-12 16:42 ` [PATCH 04/16] zuf: zuf-rootfs Boaz Harrosh
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).