All of lore.kernel.org
 help / color / mirror / Atom feed
From: Wen Yang <wenyang@linux.alibaba.com>
To: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Sasha Levin <sashal@kernel.org>
Cc: Xunlei Pang <xlpang@linux.alibaba.com>,
	linux-kernel@vger.kernel.org, Pavel Emelyanov <xemul@openvz.org>,
	Oleg Nesterov <oleg@tv-sign.ru>,
	Sukadev Bhattiprolu <sukadev@us.ibm.com>,
	Paul Menage <menage@google.com>,
	"Eric W. Biederman" <ebiederm@xmission.com>,
	stable@vger.kernel.org
Subject: Re: [PATCH 00/10] Cover letter: fix a race in release_task when flushing the dentry
Date: Thu, 17 Dec 2020 10:26:23 +0800	[thread overview]
Message-ID: <06bffff8-ed78-e8f5-191e-ecaaec266d46@linux.alibaba.com> (raw)
In-Reply-To: <20201203183204.63759-1-wenyang@linux.alibaba.com>



在 2020/12/4 上午2:31, Wen Yang 写道:
> The dentries such as /proc/<pid>/ns/ have the DCACHE_OP_DELETE flag, they
> should be deleted when the process exits.
> 
> Suppose the following race appears:
> 
> release_task                 dput
> -> proc_flush_task
>                               -> dentry->d_op->d_delete(dentry)
> -> __exit_signal
>                               -> dentry->d_lockref.count--  and return.
> 
> In the proc_flush_task(), if another process is using this dentry, it will
> not be deleted. At the same time, in dput(), d_op->d_delete() can be executed
> before __exit_signal(pid has not been hashed), d_delete returns false, so
> this dentry still cannot be deleted.
> 
> This dentry will always be cached (although its count is 0 and the
> DCACHE_OP_DELETE flag is set), its parent denry will also be cached too, and
> these dentries can only be deleted when drop_caches is manually triggered.
> 
> This will result in wasted memory. What's more troublesome is that these
> dentries reference pid, according to the commit f333c700c610 ("pidns: Add a
> limit on the number of pid namespaces"), if the pid cannot be released, it
> may result in the inability to create a new pid_ns.
> 
> This problem occurred in our cluster environment (Linux 4.9 LTS).
> We could reproduce it by manually constructing a test program + adding some
> debugging switches in the kernel:
> * A test program to open the directory (/proc/<pid>/ns) [1]
> * Adding some debugging switches to the kernel, adding a delay between
>     proc_flush_task and __exit_signal in release_task() [2]
> 
> The test process is as follows:
> 
> A, terminal #1
> 
> Turn on the debug switch:
> echo 1> /proc/sys/vm/dentry_debug_trace
> 
> Execute the following unshare command:
> sudo unshare --pid --fork --mount-proc bash
> 
> 
> B, terminal #2
> 
> Find the pid of the unshare process:
> 
> # pstree -p | grep unshare
>             | `-sshd(716)---bash(718)--sudo(816)---unshare(817)---bash(818)
> 
> 
> Find the corresponding dentry:
> # dmesg | grep pid=818
> [70.424722] XXX proc_pid_instantiate:3119 pid=818 tid=818 entry=818/ffff8802c7b670e8
> 
> 
> C, terminal #3
> 
> Execute the opendir program, it will always open the /proc/818/ns/ directory:
> 
> # ./a.out /proc/818/ns/
> pid: 876
> .
> ..
> net
> uts
> ipc
> pid
> user
> mnt
> cgroup
> 
> D, go back to terminal #2
> 
> Turn on the debugging switches to construct the race:
> # echo 818> /proc/sys/vm/dentry_debug_pid
> # echo 1> /proc/sys/vm/dentry_debug_delay
> 
> Kill the unshare process (pid 818). Since the debugging switches have been
> turned on, it will get stuck in release_task():
> # kill -9 818
> 
> Then kill the process that opened the /proc/818/ns/ directory:
> # kill -9 876
> 
> Then turn off these debugging switches to allow the 818 process to exit:
> # echo 0> /proc/sys/vm/dentry_debug_delay
> # echo 0> /proc/sys/vm/dentry_debug_pid
> 
> Checking the dmesg, we will find that the dentry(/proc/818/ns) ’s count is 0,
> and the flag is 2800cc (#define DCACHE_OP_DELETE 0x00000008), but it is still
> cached:
> # dmesg | grep ffff8802a3999548
> …
> [565.559156] XXX dput:853 dentry=ns/ffff8802bea7b528, flag=2800cc, cnt=0, inode=ffff8802b38c2010, pdentry=818/ffff8802c7b670e8, pflag=20008c, pcnt=1, pinode=ffff8802c7812010, keywords: be cached
> 
> 
> It could also be verified via the crash tool:
> 
> crash> dentry.d_flags,d_iname,d_inode,d_lockref -x  ffff8802bea7b528
>    d_flags = 0x2800cc
>    d_iname = "ns\000kkkkkkkkkkkkkkkkkkkkkkkkkkkk"
>    d_inode = 0xffff8802b38c2010
>    d_lockref = {
>      {
>        lock_count = 0x0,
>        {
>          lock = {
>            {
>              rlock = {
>                raw_lock = {
>                  {
>                    val = {
>                      counter = 0x0
>                    },
>                    {
>                      locked = 0x0,
>                      pending = 0x0
>                    },
>                    {
>                      locked_pending = 0x0,
>                      tail = 0x0
>                    }
>                  }
>                }
>              }
>            }
>          },
>          count = 0x0
>        }
>      }
>    }
> crash> kmem  ffff8802bea7b528
> CACHE             OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE  NAME
> ffff8802dd5f5900      192      23663     26130    871    16k  dentry
>    SLAB              MEMORY            NODE  TOTAL  ALLOCATED  FREE
>    ffffea000afa9e00  ffff8802bea78000     0     30         25     5
>    FREE / [ALLOCATED]
>    [ffff8802bea7b520]
> 
>        PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
> ffffea000afa9ec0 2bea7b000 dead000000000400        0  0 2fffff80000000
> crash>
> 
> This series of patches is to fix this issue.
> 
> Regards,
> Wen
> 
> Alexey Dobriyan (1):
>    proc: use %u for pid printing and slightly less stack
> 
> Andreas Gruenbacher (1):
>    proc: Pass file mode to proc_pid_make_inode
> 
> Christian Brauner (1):
>    clone: add CLONE_PIDFD
> 
> Eric W. Biederman (6):
>    proc: Better ownership of files for non-dumpable tasks in user
>      namespaces
>    proc: Rename in proc_inode rename sysctl_inodes sibling_inodes
>    proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache
>    proc: Clear the pieces of proc_inode that proc_evict_inode cares about
>    proc: Use d_invalidate in proc_prune_siblings_dcache
>    proc: Use a list of inodes to flush from proc
> 
> Joel Fernandes (Google) (1):
>    pidfd: add polling support
> 
>   fs/proc/base.c             | 242 ++++++++++++++++++++-------------------------
>   fs/proc/fd.c               |  20 +---
>   fs/proc/inode.c            |  67 ++++++++++++-
>   fs/proc/internal.h         |  22 ++---
>   fs/proc/namespaces.c       |   3 +-
>   fs/proc/proc_sysctl.c      |  45 ++-------
>   fs/proc/self.c             |   6 +-
>   fs/proc/thread_self.c      |   5 +-
>   include/linux/pid.h        |   5 +
>   include/linux/proc_fs.h    |   4 +-
>   include/uapi/linux/sched.h |   1 +
>   kernel/exit.c              |   5 +-
>   kernel/fork.c              | 145 ++++++++++++++++++++++++++-
>   kernel/pid.c               |   3 +
>   kernel/signal.c            |  11 +++
>   security/selinux/hooks.c   |   1 +
>   16 files changed, 357 insertions(+), 228 deletions(-)
> 
> [1] A test program to open the directory (/proc/<pid>/ns)
> #include <stdio.h>
> #include <sys/types.h>
> #include <dirent.h>
> #include <errno.h>
> 
> int main(int argc, char *argv[])
> {
> 	DIR *dip;
> 	struct dirent *dit;
> 
> 	if (argc < 2) {
> 		printf("Usage :%s <directory>\n", argv[0]);
> 		return -1;
> 	}
> 
> 	if ((dip = opendir(argv[1])) == NULL) {
> 		perror("opendir");
> 		return -1;
> 	}
> 
> 	printf("pid: %d\n", getpid());
> 	while((dit = readdir (dip)) != NULL) {
> 		printf("%s\n", dit->d_name);
> 	}
> 
> 	while (1)
> 		sleep (1);
> 
> 	return 0;
> }
> 
> [2] Adding some debugging switches to the kernel, also adding a delay between
>      proc_flush_task and __exit_signal in release_task():
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 05bad55..fafad37 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -84,6 +84,9 @@
>   int sysctl_vfs_cache_pressure __read_mostly = 100;
>   EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);
> 
> +int sysctl_dentry_debug_trace __read_mostly = 0;
> +EXPORT_SYMBOL_GPL(sysctl_dentry_debug_trace);
> +
>   __cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);
> 
>   EXPORT_SYMBOL(rename_lock);
> @@ -758,6 +761,26 @@ static inline bool fast_dput(struct dentry *dentry)
>   	return 0;
>   }
> 
> +#define DENTRY_DEBUG_TRACE(dentry, keywords)                            \
> +do {                                                                    \
> +	if (sysctl_dentry_debug_trace)                                   \
> +		printk("XXX %s:%d "                                      \
> +                	"dentry=%s/%p, flag=%x, cnt=%d, inode=%p, "      \
> +                	"pdentry=%s/%p, pflag=%x, pcnt=%d, pinode=%p, "  \
> +			"keywords: %s\n",                                \
> +			__func__, __LINE__,                              \
> +			dentry->d_name.name,                             \
> +			dentry,                                          \
> +			dentry->d_flags,                                 \
> +			dentry->d_lockref.count,                         \
> +			dentry->d_inode,                                 \
> +			dentry->d_parent->d_name.name,                   \
> +			dentry->d_parent,                                \
> +			dentry->d_parent->d_flags,                       \
> +			dentry->d_parent->d_lockref.count,               \
> +			dentry->d_parent->d_inode,                       \
> +			keywords);                                       \
> +} while (0)
> 
>   /*
>    * This is dput
> @@ -804,6 +827,8 @@ void dput(struct dentry *dentry)
> 
>   	WARN_ON(d_in_lookup(dentry));
> 
> +	DENTRY_DEBUG_TRACE(dentry, "be checked");
> +
>   	/* Unreachable? Get rid of it */
>   	if (unlikely(d_unhashed(dentry)))
>   		goto kill_it;
> @@ -812,8 +837,10 @@ void dput(struct dentry *dentry)
>   		goto kill_it;
> 
>   	if (unlikely(dentry->d_flags & DCACHE_OP_DELETE)) {
> -		if (dentry->d_op->d_delete(dentry))
> +		if (dentry->d_op->d_delete(dentry)) {
> +			DENTRY_DEBUG_TRACE(dentry, "be killed");
>   			goto kill_it;
> +		}
>   	}
> 
>   	if (!(dentry->d_flags & DCACHE_REFERENCED))
> @@ -822,6 +849,9 @@ void dput(struct dentry *dentry)
> 
>   	dentry->d_lockref.count--;
>   	spin_unlock(&dentry->d_lock);
> +
> +	DENTRY_DEBUG_TRACE(dentry, "be cached");
> +
>   	return;
> 
>   kill_it:
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index b9e4183..419a409 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -3090,6 +3090,8 @@ void proc_flush_task(struct task_struct *task)
>   	}
>   }
> 
> +extern int sysctl_dentry_debug_trace;
> +
>   static int proc_pid_instantiate(struct inode *dir,
>   				   struct dentry * dentry,
>   				   struct task_struct *task, const void *ptr)
> @@ -3111,6 +3113,12 @@ static int proc_pid_instantiate(struct inode *dir,
>   	d_set_d_op(dentry, &pid_dentry_operations);
> 
>   	d_add(dentry, inode);
> +
> +	if (sysctl_dentry_debug_trace)
> +		printk("XXX %s:%d pid=%d tid=%d  entry=%s/%p\n",
> +			__func__, __LINE__, task->pid, task->tgid,
> +			dentry->d_name.name, dentry);
> +
>   	/* Close the race of the process dying before we return the dentry */
>   	if (pid_revalidate(dentry, 0))
>   		return 0;
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 27f4168..2b3e1b6 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -55,6 +55,8 @@
>   #include <linux/shm.h>
>   #include <linux/kcov.h>
> 
> +#include <linux/delay.h>
> +
>   #include <asm/uaccess.h>
>   #include <asm/unistd.h>
>   #include <asm/pgtable.h>
> @@ -164,6 +166,8 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
>   	put_task_struct(tsk);
>   }
> 
> +int sysctl_dentry_debug_delay __read_mostly = 0;
> +int sysctl_dentry_debug_pid __read_mostly = 0;
> 
>   void release_task(struct task_struct *p)
>   {
> @@ -178,6 +182,11 @@ void release_task(struct task_struct *p)
> 
>   	proc_flush_task(p);
> 
> +	if (sysctl_dentry_debug_delay && p->pid == sysctl_dentry_debug_pid) {
> +		while (sysctl_dentry_debug_delay)
> +			mdelay(1);
> +	}
> +
>   	write_lock_irq(&tasklist_lock);
>   	ptrace_release_task(p);
>   	__exit_signal(p);
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 513e6da..27f1395 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -282,6 +282,10 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write,
>   static int max_extfrag_threshold = 1000;
>   #endif
> 
> +extern int sysctl_dentry_debug_trace;
> +extern int sysctl_dentry_debug_delay;
> +extern int sysctl_dentry_debug_pid;
> +
>   static struct ctl_table kern_table[] = {
>   	{
>   		.procname	= "sched_child_runs_first",
> @@ -1498,6 +1502,30 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write,
>   		.proc_handler	= proc_dointvec,
>   		.extra1		= &zero,
>   	},
> +	{
> +		.procname	= "dentry_debug_trace",
> +		.data		= &sysctl_dentry_debug_trace,
> +		.maxlen		= sizeof(sysctl_dentry_debug_trace),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec,
> +		.extra1		= &zero,
> +	},
> +	{
> +		.procname	= "dentry_debug_delay",
> +		.data		= &sysctl_dentry_debug_delay,
> +		.maxlen		= sizeof(sysctl_dentry_debug_delay),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec,
> +		.extra1		= &zero,
> +	},
> +	{
> +		.procname	= "dentry_debug_pid",
> +		.data		= &sysctl_dentry_debug_pid,
> +		.maxlen		= sizeof(sysctl_dentry_debug_pid),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec,
> +		.extra1		= &zero,
> +	},
>   #ifdef HAVE_ARCH_PICK_MMAP_LAYOUT
>   	{
>   		.procname	= "legacy_va_layout",
> 
> 
> Signed-off-by: Wen Yang <wenyang@linux.alibaba.com>
> Cc: Pavel Emelyanov <xemul@openvz.org>
> Cc: Oleg Nesterov <oleg@tv-sign.ru>
> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
> Cc: Paul Menage <menage@google.com>
> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: <stable@vger.kernel.org>
> 

Hi Greg,

Could you kindly give some suggestions?

Thanks,



  parent reply	other threads:[~2020-12-17  2:27 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-03 18:31 [PATCH 00/10] Cover letter: fix a race in release_task when flushing the dentry Wen Yang
2020-12-03 18:31 ` [PATCH 01/10] clone: add CLONE_PIDFD Wen Yang
2021-01-04 13:03   ` Greg Kroah-Hartman
2021-01-04 13:13     ` Christian Brauner
2021-01-04 13:17       ` Greg Kroah-Hartman
2021-01-04 13:19         ` Christian Brauner
2020-12-03 18:31 ` [PATCH 02/10] pidfd: add polling support Wen Yang
2020-12-03 18:31 ` [PATCH 03/10] proc: Pass file mode to proc_pid_make_inode Wen Yang
2020-12-03 18:31 ` [PATCH 04/10] proc: Better ownership of files for non-dumpable tasks in user namespaces Wen Yang
2020-12-03 18:31 ` [PATCH 05/10] proc: use %u for pid printing and slightly less stack Wen Yang
2020-12-03 20:08   ` Alexey Dobriyan
2020-12-03 18:32 ` [PATCH 06/10] proc: Rename in proc_inode rename sysctl_inodes sibling_inodes Wen Yang
2020-12-03 18:32 ` [PATCH 07/10] proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache Wen Yang
2020-12-03 18:32 ` [PATCH 08/10] proc: Clear the pieces of proc_inode that proc_evict_inode cares about Wen Yang
2020-12-03 18:32 ` [PATCH 09/10] proc: Use d_invalidate in proc_prune_siblings_dcache Wen Yang
2020-12-03 18:32 ` [PATCH 10/10] proc: Use a list of inodes to flush from proc Wen Yang
2020-12-17  2:26 ` Wen Yang [this message]
2020-12-31  9:22   ` [PATCH 00/10] Cover letter: fix a race in release_task when flushing the dentry Greg Kroah-Hartman
2021-01-04  4:15     ` Wen Yang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=06bffff8-ed78-e8f5-191e-ecaaec266d46@linux.alibaba.com \
    --to=wenyang@linux.alibaba.com \
    --cc=ebiederm@xmission.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=menage@google.com \
    --cc=oleg@tv-sign.ru \
    --cc=sashal@kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=sukadev@us.ibm.com \
    --cc=xemul@openvz.org \
    --cc=xlpang@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.