Linux userland API discussions

* Re: [RFC PATCH] proc, pidns: Add highpid
From: Andy Lutomirski @ 2014-12-01 16:48 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: systemd Mailing List,
	Емельянов Павел,
	Linux API, Linux Kernel Mailing List, Cyrill Gorcunov,
	Greg Kroah-Hartman, Eric W. Biederman, Andrew Morton
In-Reply-To: <CALYGNiM0C3RND_3yVzCMUtzoo33kG5jS+C1KL19j8JYf150-Og@mail.gmail.com>

On Mon, Dec 1, 2014 at 8:39 AM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
> On Mon, Dec 1, 2014 at 7:21 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Sun, Nov 30, 2014 at 11:03 PM, Konstantin Khlebnikov
>> <koct9i@gmail.com> wrote:
>>> Hmm. What about per-task/thread UUID? exported via separate file: /proc/PID/uuid
>>> It could be created at the first access, thus this wouldn't shlowdown clone().
>>> Also it could be droped at execve(), so it'll describe execution
>>> context more precisely than pid.
>>>
>>
>> I'd be all for this if it weren't for two issues:
>>
>> 1. This thing needs to work as soon as init is started, and we can't
>> reliably generate random numbers that early.
>>
>> 2. Migration / CRIU is rather fundamentally at odds with making
>> anything universally unique, as opposed to just per-user-ns.
>>
>> So, given that I don't think we can actually provide a universally
>> unique pid-like thing, I'd rather not even try.
>
> Well... another idea: what about pid generation counter? Of course it
> should be per-pid-ns.
> As I see struct upid has nice 32-bit hole next to 'nr' field. (on
> 64-bit systems)
>

I thought about that, but it has two issues:

1. Implementing it will be pain.  The pid allocation algorithm is
already complicated, and it wasn't obvious to me how to accurately
detect wraparound without racing against other wrapping users.

2. I don't think it will be reliable.  Suppose that there are pid_max
- 1 processes.  One of them can repeatedly clone and exit,
incrementing the generation counter each time.  After 2^32 iterations,
which won't take all that long, the pid generation will wrap.

--Andy

>>
>> That being said, I want to rework this a little bit.  I think that
>> highpid should be restricted to the range 2^32 through 2^64 - 4096.
>> The former prevents anyone from confusing highpid with regular pid,
>> and the latter means that we don't need to worry about confusion
>> between errors and valid highpids (e.g. -1 will never be a highpid).
>>
>> Implementing that will be only mildly annoying.
>>
>> --Andy
>>
>>> On Sat, Nov 29, 2014 at 2:05 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> Pid reuse is common, which means that it's difficult or impossible
>>>> to read information about a pid from /proc without races.
>>>>
>>>> This introduces a second number associated with each (task, pidns)
>>>> pair called highpid.  Highpid is a 64-bit number, and, barring
>>>> extremely unlikely circumstances or outright error, a (highpid, pid)
>>>> will never be reused.
>>>>
>>>> With just this change, a program can open /proc/PID/status, read the
>>>> "Highpid" field, and confirm that it has the expected value.  If the
>>>> pid has been reused, then highpid will be different.
>>>>
>>>> The initial implementation is straightforward: highpid is simply a
>>>> 64-bit counter. If a high-end system can fork every 3 ns (which
>>>> would be amazing, given that just allocating a pid requires at
>>>> atomic operation), it would take well over 1000 years for highpid to
>>>> wrap.
>>>>
>>>> For CRIU's benefit, the next highpid can be set by a privileged
>>>> user.
>>>>
>>>> NB: The sysctl stuff only works on 64-bit systems.  If the approach
>>>> looks good, I'll fix that somehow.
>>>>
>>>> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
>>>> ---
>>>>
>>>> If this goes in, there's plenty of room to add new interfaces to
>>>> make this more useful.  For example, we could add a fancier tgkill
>>>> that adds and validates hightgid and highpid, and we might want to
>>>> add a syscall to read one's own hightgid and highpid.  These would
>>>> be quite useful for pidfiles.
>>>>
>>>> David, would this be useful for kdbus?
>>>>
>>>> CRIU people: will this be unduly difficult to support in CRIU?
>>>>
>>>> If you all think this is a good idea, I'll fix the sysctl stuff and
>>>> document it.  It might also be worth adding "Hightgid" to status.
>>>>
>>>>  fs/proc/array.c               |  5 ++++-
>>>>  include/linux/pid.h           |  2 ++
>>>>  include/linux/pid_namespace.h |  1 +
>>>>  kernel/pid.c                  | 19 +++++++++++++++----
>>>>  kernel/pid_namespace.c        | 22 ++++++++++++++++++++++
>>>>  5 files changed, 44 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/fs/proc/array.c b/fs/proc/array.c
>>>> index cd3653e4f35c..f1e0e69d19f9 100644
>>>> --- a/fs/proc/array.c
>>>> +++ b/fs/proc/array.c
>>>> @@ -159,6 +159,7 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
>>>>         int g;
>>>>         struct fdtable *fdt = NULL;
>>>>         const struct cred *cred;
>>>> +       const struct upid *upid;
>>>>         pid_t ppid, tpid;
>>>>
>>>>         rcu_read_lock();
>>>> @@ -170,12 +171,14 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
>>>>                 if (tracer)
>>>>                         tpid = task_pid_nr_ns(tracer, ns);
>>>>         }
>>>> +       upid = pid_upid_ns(pid, ns);
>>>>         cred = get_task_cred(p);
>>>>         seq_printf(m,
>>>>                 "State:\t%s\n"
>>>>                 "Tgid:\t%d\n"
>>>>                 "Ngid:\t%d\n"
>>>>                 "Pid:\t%d\n"
>>>> +               "Highpid:\t%llu\n"
>>>>                 "PPid:\t%d\n"
>>>>                 "TracerPid:\t%d\n"
>>>>                 "Uid:\t%d\t%d\t%d\t%d\n"
>>>> @@ -183,7 +186,7 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
>>>>                 get_task_state(p),
>>>>                 task_tgid_nr_ns(p, ns),
>>>>                 task_numa_group_id(p),
>>>> -               pid_nr_ns(pid, ns),
>>>> +               upid ? upid->nr : 0, upid ? upid->highnr : 0,
>>>>                 ppid, tpid,
>>>>                 from_kuid_munged(user_ns, cred->uid),
>>>>                 from_kuid_munged(user_ns, cred->euid),
>>>> diff --git a/include/linux/pid.h b/include/linux/pid.h
>>>> index 23705a53abba..ece70b64d04c 100644
>>>> --- a/include/linux/pid.h
>>>> +++ b/include/linux/pid.h
>>>> @@ -51,6 +51,7 @@ struct upid {
>>>>         /* Try to keep pid_chain in the same cacheline as nr for find_vpid */
>>>>         int nr;
>>>>         struct pid_namespace *ns;
>>>> +       u64 highnr;
>>>>         struct hlist_node pid_chain;
>>>>  };
>>>>
>>>> @@ -170,6 +171,7 @@ static inline pid_t pid_nr(struct pid *pid)
>>>>  }
>>>>
>>>>  pid_t pid_nr_ns(struct pid *pid, struct pid_namespace *ns);
>>>> +const struct upid *pid_upid_ns(struct pid *pid, struct pid_namespace *ns);
>>>>  pid_t pid_vnr(struct pid *pid);
>>>>
>>>>  #define do_each_pid_task(pid, type, task)                              \
>>>> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
>>>> index 1997ffc295a7..1f9f0455d7ef 100644
>>>> --- a/include/linux/pid_namespace.h
>>>> +++ b/include/linux/pid_namespace.h
>>>> @@ -26,6 +26,7 @@ struct pid_namespace {
>>>>         struct rcu_head rcu;
>>>>         int last_pid;
>>>>         unsigned int nr_hashed;
>>>> +       atomic64_t next_highpid;
>>>>         struct task_struct *child_reaper;
>>>>         struct kmem_cache *pid_cachep;
>>>>         unsigned int level;
>>>> diff --git a/kernel/pid.c b/kernel/pid.c
>>>> index 9b9a26698144..863d11a9bfbf 100644
>>>> --- a/kernel/pid.c
>>>> +++ b/kernel/pid.c
>>>> @@ -312,6 +312,8 @@ struct pid *alloc_pid(struct pid_namespace *ns)
>>>>
>>>>                 pid->numbers[i].nr = nr;
>>>>                 pid->numbers[i].ns = tmp;
>>>> +               pid->numbers[i].highnr =
>>>> +                       atomic64_inc_return(&tmp->next_highpid) - 1;
>>>>                 tmp = tmp->parent;
>>>>         }
>>>>
>>>> @@ -492,17 +494,26 @@ struct pid *find_get_pid(pid_t nr)
>>>>  }
>>>>  EXPORT_SYMBOL_GPL(find_get_pid);
>>>>
>>>> -pid_t pid_nr_ns(struct pid *pid, struct pid_namespace *ns)
>>>> +const struct upid *pid_upid_ns(struct pid *pid, struct pid_namespace *ns)
>>>>  {
>>>>         struct upid *upid;
>>>> -       pid_t nr = 0;
>>>>
>>>>         if (pid && ns->level <= pid->level) {
>>>>                 upid = &pid->numbers[ns->level];
>>>>                 if (upid->ns == ns)
>>>> -                       nr = upid->nr;
>>>> +                       return upid;
>>>>         }
>>>> -       return nr;
>>>> +       return NULL;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(pid_upid_ns);
>>>> +
>>>> +pid_t pid_nr_ns(struct pid *pid, struct pid_namespace *ns)
>>>> +{
>>>> +       const struct upid *upid = pid_upid_ns(pid, ns);
>>>> +
>>>> +       if (!upid)
>>>> +               return 0;
>>>> +       return upid->nr;
>>>>  }
>>>>  EXPORT_SYMBOL_GPL(pid_nr_ns);
>>>>
>>>> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
>>>> index db95d8eb761b..e246802b4361 100644
>>>> --- a/kernel/pid_namespace.c
>>>> +++ b/kernel/pid_namespace.c
>>>> @@ -268,6 +268,22 @@ static int pid_ns_ctl_handler(struct ctl_table *table, int write,
>>>>         return proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
>>>>  }
>>>>
>>>> +static int pid_ns_next_highpid_handler(struct ctl_table *table, int write,
>>>> +               void __user *buffer, size_t *lenp, loff_t *ppos)
>>>> +{
>>>> +       struct pid_namespace *pid_ns = task_active_pid_ns(current);
>>>> +       struct ctl_table tmp = *table;
>>>> +
>>>> +       if (write && !ns_capable(pid_ns->user_ns, CAP_SYS_ADMIN))
>>>> +               return -EPERM;
>>>> +
>>>> +       /* This needs to be fixed. */
>>>> +       BUILD_BUG_ON(sizeof(u64) != sizeof(unsigned long));
>>>> +
>>>> +       tmp.data = &pid_ns->next_highpid;
>>>> +       return proc_dointvec(&tmp, write, buffer, lenp, ppos);
>>>> +}
>>>> +
>>>>  extern int pid_max;
>>>>  static int zero = 0;
>>>>  static struct ctl_table pid_ns_ctl_table[] = {
>>>> @@ -279,6 +295,12 @@ static struct ctl_table pid_ns_ctl_table[] = {
>>>>                 .extra1 = &zero,
>>>>                 .extra2 = &pid_max,
>>>>         },
>>>> +       {
>>>> +               .procname = "ns_next_highpid",
>>>> +               .maxlen = sizeof(u64),
>>>> +               .mode = 0666, /* permissions are checked in the handler */
>>>> +               .proc_handler = pid_ns_next_highpid_handler,
>>>> +       },
>>>>         { }
>>>>  };
>>>>  static struct ctl_path kern_path[] = { { .procname = "kernel", }, { } };
>>>> --
>>>> 1.9.3
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> Please read the FAQ at  http://www.tux.org/lkml/
>>
>>
>>
>> --
>> Andy Lutomirski
>> AMA Capital Management, LLC

-- 
Andy Lutomirski
AMA Capital Management, LLC
_______________________________________________
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel

^ permalink raw reply