[PATCH] vfs: get_next_ino(), never inum=0

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] vfs: get_next_ino(), never inum=0
@ 2014-04-29 15:45 hooanon05g
  2014-04-29 17:42 ` J. R. Okajima
  2014-08-18 18:21 ` [PATCH v2] " Carlos Maiolino
  0 siblings, 2 replies; 9+ messages in thread
From: hooanon05g @ 2014-04-29 15:45 UTC (permalink / raw)
  To: hch, dchinner, viro; +Cc: linux-fsdevel, J. R. Okajima

From: "J. R. Okajima" <hooanon05g@gmail.com>

It is very rare for get_next_ino() to return zero as a new inode number
since its type is unsigned int, but it can surely happen eventually.

Interestingly, ls(1) and find(1) don't show a file whose inum is zero,
so people won't be able to find it. This issue may be harmful especially
for tmpfs.
On a very long lived and busy system, users may frequently create files
on tmpfs. And if unluckily he gets inum=0, then he cannot see its
filename. If he remembers its name, he may be able to use or unlink it
by its name since the file surely exists. Otherwise, the file remains on
tmpfs silently. No one can touch it. This behaviour looks like resource
leak.
As a worse case, if a dir gets inum=0 and a user creates several files
under it, then the leaked memory will increase since a user cannot see
the name of all files under the dir whose inum=0, regardless the inum of
the children.

There is another unpleasant effect when get_next_ino() wraps
around. When there is a file whose inum=100 on tmpfs, a new file may get
inum=100. I am not sure what will happen when the duplicated inums exist
on tmpfs. Anyway this is not a issue in get_next_ino(). It should be
fixed in mm/shmem.c if it is really necessary.

Signed-off-by: J. R. Okajima <hooanon05g@gmail.com>
---
 fs/inode.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/inode.c b/fs/inode.c
index f96d2a6..a3e274a 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -848,7 +848,11 @@ unsigned int get_next_ino(void)
 	}
 #endif

-	*p = ++res;
+	res++;
+	/* never zero */
+	if (unlikely(!res))
+		res++;
+	*p = res;
 	put_cpu_var(last_ino);
 	return res;
 }
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] vfs: get_next_ino(), never inum=0
  2014-04-29 15:45 [PATCH] vfs: get_next_ino(), never inum=0 hooanon05g
@ 2014-04-29 17:42 ` J. R. Okajima
  2014-04-29 17:53   ` Christoph Hellwig
  2014-08-18 18:21 ` [PATCH v2] " Carlos Maiolino
  1 sibling, 1 reply; 9+ messages in thread
From: J. R. Okajima @ 2014-04-29 17:42 UTC (permalink / raw)
  To: hch, dchinner, viro, linux-fsdevel


> There is another unpleasant effect when get_next_ino() wraps
> around. When there is a file whose inum=100 on tmpfs, a new file may get
> inum=100. I am not sure what will happen when the duplicated inums exist
> on tmpfs. ...

Undeterministic behaviour when exporting via NFS?


J. R. Okajima

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] vfs: get_next_ino(), never inum=0
  2014-04-29 17:42 ` J. R. Okajima
@ 2014-04-29 17:53   ` Christoph Hellwig
  2014-04-30  4:08     ` J. R. Okajima
  0 siblings, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2014-04-29 17:53 UTC (permalink / raw)
  To: J. R. Okajima; +Cc: dchinner, viro, linux-fsdevel

On Wed, Apr 30, 2014 at 02:42:02AM +0900, J. R. Okajima wrote:
> 
> > There is another unpleasant effect when get_next_ino() wraps
> > around. When there is a file whose inum=100 on tmpfs, a new file may get
> > inum=100. I am not sure what will happen when the duplicated inums exist
> > on tmpfs. ...
> 
> Undeterministic behaviour when exporting via NFS?

If you care about really unique inode numbers you shouldn't use get_next_ino
but something like an idr allocator.  The default i_ino assigned in
new_inode() from which get_next_ino was factored out was mostly intended
for small synthetic filesystems with few enough inodes that it wouldn't
wrap around.

And yes, file handle based lookups are screwed by duplicated inode numbers,
as are tools trying to do file level de-duplication, mostly in the backup or
achival space.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] vfs: get_next_ino(), never inum=0
  2014-04-29 17:53   ` Christoph Hellwig
@ 2014-04-30  4:08     ` J. R. Okajima
  2014-04-30 22:56       ` Andreas Dilger
  0 siblings, 1 reply; 9+ messages in thread
From: J. R. Okajima @ 2014-04-30  4:08 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: dchinner, viro, linux-fsdevel


Christoph Hellwig:
> If you care about really unique inode numbers you shouldn't use get_next_ino
> but something like an idr allocator.  The default i_ino assigned in
> new_inode() from which get_next_ino was factored out was mostly intended
> for small synthetic filesystems with few enough inodes that it wouldn't
> wrap around.

Grep-ping get_next_ino, I got 30 calls in mainline.
How many of them are get_next_ino() inappropriate? I don't know. But at
least for tmpfs, it is better to manage the inums by itself since tmpfs
must be one of the biggest consumer of inums and it is NFS-exportable.

Do you think we need a common function in VFS to manage inums per sb, or
it is totally up to filesystem and the common function is unnecessary?

Instead of idr, I was thinking about a simple bitmap in tmpfs such like
this. It introduces a new mount option "ino" which forces tmpfs to
assign the lowest unused number for a new inode within the mounted
tmpfs. Without "ino" or specifying "noino", the behaviour is unchanged
(use vfs:get_next_ino()).
But it may not scale well due to the single spinlock every time.


J. R. Okajima


commit 214d38e8c34fb341fd0f37cc92614b5e93e0803b
Author: J. R. Okajima <hooanon05@yahoo.co.jp>
Date:   Mon Sep 2 10:45:42 2013 +0900

    shmem: management for inum
    
    Signed-off-by: J. R. Okajima <hooanon05g@gmail.com>

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 4d1771c..39762e1 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -29,6 +29,8 @@ struct shmem_sb_info {
 	unsigned long max_inodes;   /* How many inodes are allowed */
 	unsigned long free_inodes;  /* How many are left for allocation */
 	spinlock_t stat_lock;	    /* Serialize shmem_sb_info changes */
+	spinlock_t ino_lock;
+	unsigned long *ino_bitmap;
 	kuid_t uid;		    /* Mount uid for root directory */
 	kgid_t gid;		    /* Mount gid for root directory */
 	umode_t mode;		    /* Mount mode for root directory */
diff --git a/mm/shmem.c b/mm/shmem.c
index 9f70e02..bc2c5e4 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -197,14 +197,61 @@ static int shmem_reserve_inode(struct super_block *sb)
 	return 0;
 }
 
-static void shmem_free_inode(struct super_block *sb)
+static void shmem_free_inode(struct inode *inode)
 {
+	struct super_block *sb = inode->i_sb;
 	struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
+
 	if (sbinfo->max_inodes) {
 		spin_lock(&sbinfo->stat_lock);
 		sbinfo->free_inodes++;
 		spin_unlock(&sbinfo->stat_lock);
 	}
+
+	if (!inode->i_nlink) {
+		spin_lock(&sbinfo->ino_lock);
+		if (sbinfo->ino_bitmap)
+			clear_bit(inode->i_ino - 2, sbinfo->ino_bitmap);
+		spin_unlock(&sbinfo->ino_lock);
+	}
+}
+
+/*
+ * This is unsigned int instead of unsigned long.
+ * For details, see fs/inode.c:get_next_ino().
+ */
+unsigned int shmem_next_ino(struct super_block *sb)
+{
+	unsigned long ino;
+	struct shmem_sb_info *sbinfo;
+
+	ino = 0;
+	sbinfo = SHMEM_SB(sb);
+	if (sbinfo->ino_bitmap) {
+		spin_lock(&sbinfo->ino_lock);
+		/*
+		 * someone else may remount,
+		 * and ino_bitmap might be reset.
+		 */
+		if (sbinfo->ino_bitmap
+		    && !bitmap_full(sbinfo->ino_bitmap, sbinfo->max_inodes)) {
+			ino = find_first_zero_bit(sbinfo->ino_bitmap,
+						  sbinfo->max_inodes);
+			set_bit(ino, sbinfo->ino_bitmap);
+			ino += 2; /* ino 0 and 1 are reserved */
+		}
+		spin_unlock(&sbinfo->ino_lock);
+	}
+
+	/*
+	 * someone else did remount,
+	 * or ino_bitmap is unused originally,
+	 * or ino_bimapt is full.
+	 */
+	if (!ino)
+		ino = get_next_ino();
+
+	return ino;
 }
 
 /**
@@ -578,7 +625,7 @@ static void shmem_evict_inode(struct inode *inode)
 
 	simple_xattrs_free(&info->xattrs);
 	WARN_ON(inode->i_blocks);
-	shmem_free_inode(inode->i_sb);
+	shmem_free_inode(inode);
 	clear_inode(inode);
 }
 
@@ -1306,7 +1353,7 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
 
 	inode = new_inode(sb);
 	if (inode) {
-		inode->i_ino = get_next_ino();
+		inode->i_ino = shmem_next_ino(sb);
 		inode_init_owner(inode, dir, mode);
 		inode->i_blocks = 0;
 		inode->i_mapping->backing_dev_info = &shmem_backing_dev_info;
@@ -1348,7 +1395,7 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
 			break;
 		}
 	} else
-		shmem_free_inode(sb);
+		shmem_free_inode(inode);
 	return inode;
 }
 
@@ -1945,7 +1992,7 @@ static int shmem_unlink(struct inode *dir, struct dentry *dentry)
 	struct inode *inode = dentry->d_inode;
 
 	if (inode->i_nlink > 1 && !S_ISDIR(inode->i_mode))
-		shmem_free_inode(inode->i_sb);
+		shmem_free_inode(inode);
 
 	dir->i_size -= BOGO_DIRENT_SIZE;
 	inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
@@ -2315,6 +2362,54 @@ static const struct export_operations shmem_export_ops = {
 	.fh_to_dentry	= shmem_fh_to_dentry,
 };
 
+static void shmem_ino_bitmap(struct shmem_sb_info *sbinfo,
+			     unsigned long prev_max)
+{
+	unsigned long *p;
+	unsigned long n, d;
+	int do_msg;
+
+	n = sbinfo->max_inodes / BITS_PER_BYTE;
+	if (sbinfo->max_inodes % BITS_PER_BYTE)
+		n++;
+
+	do_msg = 0;
+	if (sbinfo->ino_bitmap) {
+		/*
+		 * by shrinking the bitmap, the large inode number in use
+		 * may be left. but it is harmless.
+		 */
+		d = 0;
+		if (sbinfo->max_inodes > prev_max) {
+			d = sbinfo->max_inodes - prev_max;
+			d /= BITS_PER_BYTE;
+		}
+		spin_lock(&sbinfo->ino_lock);
+		p = krealloc(sbinfo->ino_bitmap, n, GFP_NOWAIT);
+		if (p) {
+			memset(p + n - d, 0, d);
+			sbinfo->ino_bitmap = p;
+			spin_unlock(&sbinfo->ino_lock);
+		} else {
+			p = sbinfo->ino_bitmap;
+			sbinfo->ino_bitmap = NULL;
+			spin_unlock(&sbinfo->ino_lock);
+			kfree(p);
+			do_msg = 1;
+		}
+	} else {
+		p = kzalloc(n, GFP_NOFS);
+		spin_lock(&sbinfo->ino_lock);
+		sbinfo->ino_bitmap = p;
+		spin_unlock(&sbinfo->ino_lock);
+		do_msg = !p;
+	}
+
+	if (unlikely(do_msg))
+		pr_err("%s: ino failed (%lu bytes). Ignored.\n",
+		       __func__, n);
+}
+
 static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo,
 			       bool remount)
 {
@@ -2322,7 +2417,10 @@ static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo,
 	struct mempolicy *mpol = NULL;
 	uid_t uid;
 	gid_t gid;
+	bool do_ino;
+	unsigned long old_val = sbinfo->max_inodes;
 
+	do_ino = 0;
 	while (options != NULL) {
 		this_char = options;
 		for (;;) {
@@ -2342,6 +2440,14 @@ static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo,
 		}
 		if (!*this_char)
 			continue;
+		if (!strcmp(this_char, "ino")) {
+			do_ino = 1;
+			continue;
+		} else if (!strcmp(this_char, "noino")) {
+			do_ino = 0;
+			continue;
+		}
+
 		if ((value = strchr(this_char,'=')) != NULL) {
 			*value++ = 0;
 		} else {
@@ -2370,7 +2476,7 @@ static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo,
 				goto bad_val;
 		} else if (!strcmp(this_char,"nr_inodes")) {
 			sbinfo->max_inodes = memparse(value, &rest);
-			if (*rest)
+			if (*rest || !sbinfo->max_inodes)
 				goto bad_val;
 		} else if (!strcmp(this_char,"mode")) {
 			if (remount)
@@ -2408,6 +2514,16 @@ static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo,
 		}
 	}
 	sbinfo->mpol = mpol;
+
+	if (do_ino)
+		shmem_ino_bitmap(sbinfo, old_val);
+	else if (sbinfo->ino_bitmap) {
+		void *p = sbinfo->ino_bitmap;
+		spin_lock(&sbinfo->ino_lock);
+		sbinfo->ino_bitmap = NULL;
+		spin_unlock(&sbinfo->ino_lock);
+		kfree(p);
+	}
 	return 0;
 
 bad_val:
@@ -2472,6 +2588,8 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root)
 			sbinfo->max_blocks << (PAGE_CACHE_SHIFT - 10));
 	if (sbinfo->max_inodes != shmem_default_max_inodes())
 		seq_printf(seq, ",nr_inodes=%lu", sbinfo->max_inodes);
+	if (sbinfo->ino_bitmap)
+		seq_printf(seq, ",ino");
 	if (sbinfo->mode != (S_IRWXUGO | S_ISVTX))
 		seq_printf(seq, ",mode=%03ho", sbinfo->mode);
 	if (!uid_eq(sbinfo->uid, GLOBAL_ROOT_UID))
@@ -2491,6 +2609,7 @@ static void shmem_put_super(struct super_block *sb)
 
 	percpu_counter_destroy(&sbinfo->used_blocks);
 	mpol_put(sbinfo->mpol);
+	kfree(sbinfo->ino_bitmap);
 	kfree(sbinfo);
 	sb->s_fs_info = NULL;
 }
@@ -2510,6 +2629,7 @@ int shmem_fill_super(struct super_block *sb, void *data, int silent)
 	sbinfo->mode = S_IRWXUGO | S_ISVTX;
 	sbinfo->uid = current_fsuid();
 	sbinfo->gid = current_fsgid();
+	spin_lock_init(&sbinfo->ino_lock);
 	sb->s_fs_info = sbinfo;
 
 #ifdef CONFIG_TMPFS

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] vfs: get_next_ino(), never inum=0
  2014-04-30  4:08     ` J. R. Okajima
@ 2014-04-30 22:56       ` Andreas Dilger
  2014-05-10  3:18         ` J. R. Okajima
  0 siblings, 1 reply; 9+ messages in thread
From: Andreas Dilger @ 2014-04-30 22:56 UTC (permalink / raw)
  To: J. R. Okajima; +Cc: Christoph Hellwig, dchinner, viro, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 10717 bytes --]

On Apr 29, 2014, at 10:08 PM, J. R. Okajima <hooanon05g@gmail.com> wrote:
> Christoph Hellwig wrote:
>> If you care about really unique inode numbers you shouldn't use get_next_ino
>> but something like an idr allocator.  The default i_ino assigned in
>> new_inode() from which get_next_ino was factored out was mostly intended
>> for small synthetic filesystems with few enough inodes that it wouldn't
>> wrap around.
> 
> Grep-ping get_next_ino, I got 30 calls in mainline.
> How many of them are get_next_ino() inappropriate? I don't know. But at
> least for tmpfs, it is better to manage the inums by itself since tmpfs
> must be one of the biggest consumer of inums and it is NFS-exportable.
> 
> Do you think we need a common function in VFS to manage inums per sb, or
> it is totally up to filesystem and the common function is unnecessary?
> 
> Instead of idr, I was thinking about a simple bitmap in tmpfs such like
> this. It introduces a new mount option "ino" which forces tmpfs to
> assign the lowest unused number for a new inode within the mounted
> tmpfs. Without "ino" or specifying "noino", the behaviour is unchanged
> (use vfs:get_next_ino()).
> But it may not scale well due to the single spinlock every time.

The simplest solution is to just change get_next_ino() to return an
unsigned long to match i_ino, instead of an int.  That avoids any
overhead in the most common cases (i.e. 64-bit systems where I highly
doubt there will ever be a counter wrap).

We've also been discussing changing i_ino to be u64 so that this works
properly on 32-bit systems accessing 64-bit filesystems, but I don't
know where that stands today.


For 32-bit systems it would be possible to use get_next_ino() for the
common case of inode numbers < 2^32, and only fall back to doing a
lookup for an already-used inode in tmpfs if the counter wraps to 1.

That would avoid overhead for 99% of users since they are unlikely
to create more than 2^32 inodes in tmpfs over the lifetime of their
system.  Even in the check-if-inum-in-use case after the 2^32 wrap,
it is very unlikely that many inodes would still be in use so the
hash lookup should go relatively quickly.

It could use something like an optimized find_inode() that just
determined quickly if the hash entry was in use.  That would avoid
the constant spinlock contention in the most common cases, and only
impose it for systems in rare cases.

That said, I expect this overhead is more than just going to u64 for
32-bit systems.

Cheers, Andreas

> 
> commit 214d38e8c34fb341fd0f37cc92614b5e93e0803b
> Author: J. R. Okajima <hooanon05@yahoo.co.jp>
> Date:   Mon Sep 2 10:45:42 2013 +0900
> 
>    shmem: management for inum
> 
>    Signed-off-by: J. R. Okajima <hooanon05g@gmail.com>
> 
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index 4d1771c..39762e1 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -29,6 +29,8 @@ struct shmem_sb_info {
> 	unsigned long max_inodes;   /* How many inodes are allowed */
> 	unsigned long free_inodes;  /* How many are left for allocation */
> 	spinlock_t stat_lock;	    /* Serialize shmem_sb_info changes */
> +	spinlock_t ino_lock;
> +	unsigned long *ino_bitmap;
> 	kuid_t uid;		    /* Mount uid for root directory */
> 	kgid_t gid;		    /* Mount gid for root directory */
> 	umode_t mode;		    /* Mount mode for root directory */
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 9f70e02..bc2c5e4 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -197,14 +197,61 @@ static int shmem_reserve_inode(struct super_block *sb)
> 	return 0;
> }
> 
> -static void shmem_free_inode(struct super_block *sb)
> +static void shmem_free_inode(struct inode *inode)
> {
> +	struct super_block *sb = inode->i_sb;
> 	struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
> +
> 	if (sbinfo->max_inodes) {
> 		spin_lock(&sbinfo->stat_lock);
> 		sbinfo->free_inodes++;
> 		spin_unlock(&sbinfo->stat_lock);
> 	}
> +
> +	if (!inode->i_nlink) {
> +		spin_lock(&sbinfo->ino_lock);
> +		if (sbinfo->ino_bitmap)
> +			clear_bit(inode->i_ino - 2, sbinfo->ino_bitmap);
> +		spin_unlock(&sbinfo->ino_lock);
> +	}
> +}
> +
> +/*
> + * This is unsigned int instead of unsigned long.
> + * For details, see fs/inode.c:get_next_ino().
> + */
> +unsigned int shmem_next_ino(struct super_block *sb)
> +{
> +	unsigned long ino;
> +	struct shmem_sb_info *sbinfo;
> +
> +	ino = 0;
> +	sbinfo = SHMEM_SB(sb);
> +	if (sbinfo->ino_bitmap) {
> +		spin_lock(&sbinfo->ino_lock);
> +		/*
> +		 * someone else may remount,
> +		 * and ino_bitmap might be reset.
> +		 */
> +		if (sbinfo->ino_bitmap
> +		    && !bitmap_full(sbinfo->ino_bitmap, sbinfo->max_inodes)) {
> +			ino = find_first_zero_bit(sbinfo->ino_bitmap,
> +						  sbinfo->max_inodes);
> +			set_bit(ino, sbinfo->ino_bitmap);
> +			ino += 2; /* ino 0 and 1 are reserved */
> +		}
> +		spin_unlock(&sbinfo->ino_lock);
> +	}
> +
> +	/*
> +	 * someone else did remount,
> +	 * or ino_bitmap is unused originally,
> +	 * or ino_bimapt is full.
> +	 */
> +	if (!ino)
> +		ino = get_next_ino();
> +
> +	return ino;
> }
> 
> /**
> @@ -578,7 +625,7 @@ static void shmem_evict_inode(struct inode *inode)
> 
> 	simple_xattrs_free(&info->xattrs);
> 	WARN_ON(inode->i_blocks);
> -	shmem_free_inode(inode->i_sb);
> +	shmem_free_inode(inode);
> 	clear_inode(inode);
> }
> 
> @@ -1306,7 +1353,7 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
> 
> 	inode = new_inode(sb);
> 	if (inode) {
> -		inode->i_ino = get_next_ino();
> +		inode->i_ino = shmem_next_ino(sb);
> 		inode_init_owner(inode, dir, mode);
> 		inode->i_blocks = 0;
> 		inode->i_mapping->backing_dev_info = &shmem_backing_dev_info;
> @@ -1348,7 +1395,7 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
> 			break;
> 		}
> 	} else
> -		shmem_free_inode(sb);
> +		shmem_free_inode(inode);
> 	return inode;
> }
> 
> @@ -1945,7 +1992,7 @@ static int shmem_unlink(struct inode *dir, struct dentry *dentry)
> 	struct inode *inode = dentry->d_inode;
> 
> 	if (inode->i_nlink > 1 && !S_ISDIR(inode->i_mode))
> -		shmem_free_inode(inode->i_sb);
> +		shmem_free_inode(inode);
> 
> 	dir->i_size -= BOGO_DIRENT_SIZE;
> 	inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
> @@ -2315,6 +2362,54 @@ static const struct export_operations shmem_export_ops = {
> 	.fh_to_dentry	= shmem_fh_to_dentry,
> };
> 
> +static void shmem_ino_bitmap(struct shmem_sb_info *sbinfo,
> +			     unsigned long prev_max)
> +{
> +	unsigned long *p;
> +	unsigned long n, d;
> +	int do_msg;
> +
> +	n = sbinfo->max_inodes / BITS_PER_BYTE;
> +	if (sbinfo->max_inodes % BITS_PER_BYTE)
> +		n++;
> +
> +	do_msg = 0;
> +	if (sbinfo->ino_bitmap) {
> +		/*
> +		 * by shrinking the bitmap, the large inode number in use
> +		 * may be left. but it is harmless.
> +		 */
> +		d = 0;
> +		if (sbinfo->max_inodes > prev_max) {
> +			d = sbinfo->max_inodes - prev_max;
> +			d /= BITS_PER_BYTE;
> +		}
> +		spin_lock(&sbinfo->ino_lock);
> +		p = krealloc(sbinfo->ino_bitmap, n, GFP_NOWAIT);
> +		if (p) {
> +			memset(p + n - d, 0, d);
> +			sbinfo->ino_bitmap = p;
> +			spin_unlock(&sbinfo->ino_lock);
> +		} else {
> +			p = sbinfo->ino_bitmap;
> +			sbinfo->ino_bitmap = NULL;
> +			spin_unlock(&sbinfo->ino_lock);
> +			kfree(p);
> +			do_msg = 1;
> +		}
> +	} else {
> +		p = kzalloc(n, GFP_NOFS);
> +		spin_lock(&sbinfo->ino_lock);
> +		sbinfo->ino_bitmap = p;
> +		spin_unlock(&sbinfo->ino_lock);
> +		do_msg = !p;
> +	}
> +
> +	if (unlikely(do_msg))
> +		pr_err("%s: ino failed (%lu bytes). Ignored.\n",
> +		       __func__, n);
> +}
> +
> static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo,
> 			       bool remount)
> {
> @@ -2322,7 +2417,10 @@ static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo,
> 	struct mempolicy *mpol = NULL;
> 	uid_t uid;
> 	gid_t gid;
> +	bool do_ino;
> +	unsigned long old_val = sbinfo->max_inodes;
> 
> +	do_ino = 0;
> 	while (options != NULL) {
> 		this_char = options;
> 		for (;;) {
> @@ -2342,6 +2440,14 @@ static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo,
> 		}
> 		if (!*this_char)
> 			continue;
> +		if (!strcmp(this_char, "ino")) {
> +			do_ino = 1;
> +			continue;
> +		} else if (!strcmp(this_char, "noino")) {
> +			do_ino = 0;
> +			continue;
> +		}
> +
> 		if ((value = strchr(this_char,'=')) != NULL) {
> 			*value++ = 0;
> 		} else {
> @@ -2370,7 +2476,7 @@ static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo,
> 				goto bad_val;
> 		} else if (!strcmp(this_char,"nr_inodes")) {
> 			sbinfo->max_inodes = memparse(value, &rest);
> -			if (*rest)
> +			if (*rest || !sbinfo->max_inodes)
> 				goto bad_val;
> 		} else if (!strcmp(this_char,"mode")) {
> 			if (remount)
> @@ -2408,6 +2514,16 @@ static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo,
> 		}
> 	}
> 	sbinfo->mpol = mpol;
> +
> +	if (do_ino)
> +		shmem_ino_bitmap(sbinfo, old_val);
> +	else if (sbinfo->ino_bitmap) {
> +		void *p = sbinfo->ino_bitmap;
> +		spin_lock(&sbinfo->ino_lock);
> +		sbinfo->ino_bitmap = NULL;
> +		spin_unlock(&sbinfo->ino_lock);
> +		kfree(p);
> +	}
> 	return 0;
> 
> bad_val:
> @@ -2472,6 +2588,8 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root)
> 			sbinfo->max_blocks << (PAGE_CACHE_SHIFT - 10));
> 	if (sbinfo->max_inodes != shmem_default_max_inodes())
> 		seq_printf(seq, ",nr_inodes=%lu", sbinfo->max_inodes);
> +	if (sbinfo->ino_bitmap)
> +		seq_printf(seq, ",ino");
> 	if (sbinfo->mode != (S_IRWXUGO | S_ISVTX))
> 		seq_printf(seq, ",mode=%03ho", sbinfo->mode);
> 	if (!uid_eq(sbinfo->uid, GLOBAL_ROOT_UID))
> @@ -2491,6 +2609,7 @@ static void shmem_put_super(struct super_block *sb)
> 
> 	percpu_counter_destroy(&sbinfo->used_blocks);
> 	mpol_put(sbinfo->mpol);
> +	kfree(sbinfo->ino_bitmap);
> 	kfree(sbinfo);
> 	sb->s_fs_info = NULL;
> }
> @@ -2510,6 +2629,7 @@ int shmem_fill_super(struct super_block *sb, void *data, int silent)
> 	sbinfo->mode = S_IRWXUGO | S_ISVTX;
> 	sbinfo->uid = current_fsuid();
> 	sbinfo->gid = current_fsgid();
> +	spin_lock_init(&sbinfo->ino_lock);
> 	sb->s_fs_info = sbinfo;
> 
> #ifdef CONFIG_TMPFS
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] vfs: get_next_ino(), never inum=0
  2014-04-30 22:56       ` Andreas Dilger
@ 2014-05-10  3:18         ` J. R. Okajima
  0 siblings, 0 replies; 9+ messages in thread
From: J. R. Okajima @ 2014-05-10  3:18 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Christoph Hellwig, dchinner, viro, linux-fsdevel


Andreas Dilger:
> The simplest solution is to just change get_next_ino() to return an
> unsigned long to match i_ino, instead of an int.  That avoids any
> overhead in the most common cases (i.e. 64-bit systems where I highly
> doubt there will ever be a counter wrap).
>
> We've also been discussing changing i_ino to be u64 so that this works
> properly on 32-bit systems accessing 64-bit filesystems, but I don't
> know where that stands today.

I agree that such wrap-around won't happen easily although it will
happen technically.
At the same time, I am not sure chaging u64 is safe to 32bit systems.
If nothing wrong happens, I agree get_next_ino() returns u64. Otherwise,
"if (unlikely(!inum)) inum++" is necessary.


> For 32-bit systems it would be possible to use get_next_ino() for the
> common case of inode numbers < 2^32, and only fall back to doing a
> lookup for an already-used inode in tmpfs if the counter wraps to 1.

How can tmpfs detect the wrap-around?
By storing the last largest inum locally?


> That would avoid overhead for 99% of users since they are unlikely
> to create more than 2^32 inodes in tmpfs over the lifetime of their
> system.  Even in the check-if-inum-in-use case after the 2^32 wrap,
> it is very unlikely that many inodes would still be in use so the
> hash lookup should go relatively quickly.

I agree that so many inodes won't live so long.

By the way, The reason I took the bitmap approach is to keep the inum
small numbers. That is my local requiment and I know it won't be
necessary for generic use.


J. R. Okajima

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2] vfs: get_next_ino(), never inum=0
       [not found] <'<CANn89i+PBEGp=9QGRioa7CUDZmApT-UNa=OJTdz4eu7AyO3Kbw@mail.gmail.com>
@ 2014-05-28 14:06 ` J. R. Okajima
  0 siblings, 0 replies; 9+ messages in thread
From: J. R. Okajima @ 2014-05-28 14:06 UTC (permalink / raw)
  To: linux-fsdevel, dchinner, viro, Eric Dumazet, Hugh Dickins,
	Christoph Hellwig, Andreas Dilger, Jan Kara

It is very rare for get_next_ino() to return zero as a new inode number
since its type is unsigned int, but it can surely happen eventually.

Interestingly, ls(1) and find(1) (actually readdir(3)) don't show a file
whose inum is zero, so people won't be able to find it. This issue may
be harmful especially for tmpfs.

On a very long lived and busy system, users may frequently create files
on tmpfs. And if unluckily he gets inum=0, then he cannot see its
filename. If he remembers its name, he may be able to use or unlink it
by its name since the file surely exists. Otherwise, the file remains on
tmpfs silently. No one can touch it. This behaviour looks like resource
leak.
As a worse case, if a dir gets inum=0 and a user creates several files
under it, then the leaked memory will increase since a user cannot see
the name of all files under the dir whose inum=0, regardless the inum of
the children.

There is another unpleasant effect when get_next_ino() wraps
around. When there is a file whose inum=100 on tmpfs, a new file may get
inum=100, ie. the duplicated inums. I am not sure what will happen when
the duplicated inums exist on tmpfs. If it happens, then some tools
won't work correctly such as backup tools, I am afraid.
Anyway this is not a issue in get_next_ino(). It should be
fixed in mm/shmem.c separatly if it is really necessary.

There are many other get_next_ino() callers other than tmpfs, such as
several drivers, anon_inode, autofs4, freevxfs, procfs, pis, hugetlbfs,
configfs, ramfs, fuse, ocfs2, debugfs, securityfs, cgroup, socket, ipc.
Some of them will not care inum so this issue is harmless for them. But
the others may suffer from inum=0. For example, if procfs gets inum=0
for a task dir (or for one of its children), then several utilities
won't work correctly, including ps(1), lsof(8), etc.

(Essentially the patch is re-written by Eric Dumazet.)

Cc: Eric Dumazet <edumazet@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andreas Dilger <adilger@dilger.ca>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: J. R. Okajima <hooanon05g@gmail.com>
---
 fs/inode.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/inode.c b/fs/inode.c
index 567296b..58e7c56 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -840,6 +840,8 @@ unsigned int get_next_ino(void)
 	unsigned int *p = &get_cpu_var(last_ino);
 	unsigned int res = *p;

+start:
+
 #ifdef CONFIG_SMP
 	if (unlikely((res & (LAST_INO_BATCH-1)) == 0)) {
 		static atomic_t shared_last_ino;
@@ -849,7 +851,9 @@ unsigned int get_next_ino(void)
 	}
 #endif

-	*p = ++res;
+	if (unlikely(!++res))
+		goto start;	/* never zero */
+	*p = res;
 	put_cpu_var(last_ino);
 	WARN(!res, "static inum wrapped around");
 	return res;
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] vfs: get_next_ino(), never inum=0
  2014-04-29 15:45 [PATCH] vfs: get_next_ino(), never inum=0 hooanon05g
  2014-04-29 17:42 ` J. R. Okajima
@ 2014-08-18 18:21 ` Carlos Maiolino
  2014-08-19  0:58   ` J. R. Okajima
  1 sibling, 1 reply; 9+ messages in thread
From: Carlos Maiolino @ 2014-08-18 18:21 UTC (permalink / raw)
  To: linux-fsdevel

This V2 looks very reasonable, and fix the problem with files with inode=0 on
tmpfs which I tested here, so, consider it

Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>

Cheers
-- 
Carlos

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] vfs: get_next_ino(), never inum=0
  2014-08-18 18:21 ` [PATCH v2] " Carlos Maiolino
@ 2014-08-19  0:58   ` J. R. Okajima
  0 siblings, 0 replies; 9+ messages in thread
From: J. R. Okajima @ 2014-08-19  0:58 UTC (permalink / raw)
  To: Carlos Maiolino; +Cc: linux-fsdevel


Carlos Maiolino:
> This V2 looks very reasonable, and fix the problem with files with inode=0 on
> tmpfs which I tested here, so, consider it
>
> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>

Just out of curious, how did you notice the problem of inode=0? I think
it is hard for everyone to meet the problem.

And after posting the patch, some people reported me a bug related to
Sysv shm. This extra patch supports Sysv shm. But I don't like it since
it introduces an additional condition into the very normal path.


J. R. Okajima

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index ca658a8..fda816e 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -25,6 +25,7 @@ struct shmem_inode_info {
 
 struct shmem_sb_info {
 	struct mutex idr_lock;
+	bool idr_nouse;
 	struct idr idr;		    /* manages inode-number */
 	unsigned long max_blocks;   /* How many blocks are allowed */
 	struct percpu_counter used_blocks;  /* How many are allocated */
diff --git a/mm/shmem.c b/mm/shmem.c
index 0aa3b85..5eb75e9 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -648,7 +648,7 @@ static void shmem_evict_inode(struct inode *inode)
 
 	simple_xattrs_free(&info->xattrs);
 	WARN_ON(inode->i_blocks);
-	if (inode->i_ino) {
+	if (!sbinfo->idr_nouse && inode->i_ino) {
 		mutex_lock(&sbinfo->idr_lock);
 		idr_remove(&sbinfo->idr, inode->i_ino);
 		mutex_unlock(&sbinfo->idr_lock);
@@ -1423,19 +1423,24 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
 			break;
 		}
 
-		/* inum 0 and 1 are unused */
-		mutex_lock(&sbinfo->idr_lock);
-		ino = idr_alloc(&sbinfo->idr, inode, 2, INT_MAX, GFP_NOFS);
-		if (ino > 0) {
-			inode->i_ino = ino;
-			mutex_unlock(&sbinfo->idr_lock);
-			__insert_inode_hash(inode, inode->i_ino);
-		} else {
-			inode->i_ino = 0;
-			mutex_unlock(&sbinfo->idr_lock);
-			iput(inode);	/* shmem_free_inode() will be called */
-			inode = NULL;
-		}
+		if (!sbinfo->idr_nouse) {
+			/* inum 0 and 1 are unused */
+			mutex_lock(&sbinfo->idr_lock);
+			ino = idr_alloc(&sbinfo->idr, inode, 2, INT_MAX,
+					GFP_NOFS);
+			if (ino > 0) {
+				inode->i_ino = ino;
+				mutex_unlock(&sbinfo->idr_lock);
+				__insert_inode_hash(inode, inode->i_ino);
+			} else {
+				inode->i_ino = 0;
+				mutex_unlock(&sbinfo->idr_lock);
+				iput(inode);
+				/* shmem_free_inode() will be called */
+				inode = NULL;
+			}
+		} else
+			inode->i_ino = get_next_ino();
 	} else
 		shmem_free_inode(sb);
 	return inode;
@@ -2560,7 +2565,8 @@ static void shmem_put_super(struct super_block *sb)
 {
 	struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
 
-	idr_destroy(&sbinfo->idr);
+	if (!sbinfo->idr_nouse)
+		idr_destroy(&sbinfo->idr);
 	percpu_counter_destroy(&sbinfo->used_blocks);
 	mpol_put(sbinfo->mpol);
 	kfree(sbinfo);
@@ -2682,6 +2688,15 @@ static void shmem_destroy_inodecache(void)
 	kmem_cache_destroy(shmem_inode_cachep);
 }
 
+static __init void shmem_no_idr(struct super_block *sb)
+{
+	struct shmem_sb_info *sbinfo;
+
+	sbinfo = SHMEM_SB(sb);
+	sbinfo->idr_nouse = true;
+	idr_destroy(&sbinfo->idr);
+}
+
 static const struct address_space_operations shmem_aops = {
 	.writepage	= shmem_writepage,
 	.set_page_dirty	= __set_page_dirty_no_writeback,
@@ -2814,6 +2829,7 @@ int __init shmem_init(void)
 		printk(KERN_ERR "Could not kern_mount tmpfs\n");
 		goto out1;
 	}
+	shmem_no_idr(shm_mnt->mnt_sb);
 	return 0;
 
 out1:

^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2014-08-19  1:05 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-04-29 15:45 [PATCH] vfs: get_next_ino(), never inum=0 hooanon05g
2014-04-29 17:42 ` J. R. Okajima
2014-04-29 17:53   ` Christoph Hellwig
2014-04-30  4:08     ` J. R. Okajima
2014-04-30 22:56       ` Andreas Dilger
2014-05-10  3:18         ` J. R. Okajima
2014-08-18 18:21 ` [PATCH v2] " Carlos Maiolino
2014-08-19  0:58   ` J. R. Okajima
     [not found] <'<CANn89i+PBEGp=9QGRioa7CUDZmApT-UNa=OJTdz4eu7AyO3Kbw@mail.gmail.com>
2014-05-28 14:06 ` J. R. Okajima

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).