Linux userland API discussions
 help / color / mirror / Atom feed
* Re: [RFC PATCH] proc, pidns: Add highpid
From: Andy Lutomirski @ 2014-12-01 16:48 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: systemd Mailing List,
	Емельянов Павел,
	Linux API, Linux Kernel Mailing List, Cyrill Gorcunov,
	Greg Kroah-Hartman, Eric W. Biederman, Andrew Morton
In-Reply-To: <CALYGNiM0C3RND_3yVzCMUtzoo33kG5jS+C1KL19j8JYf150-Og@mail.gmail.com>

On Mon, Dec 1, 2014 at 8:39 AM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
> On Mon, Dec 1, 2014 at 7:21 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Sun, Nov 30, 2014 at 11:03 PM, Konstantin Khlebnikov
>> <koct9i@gmail.com> wrote:
>>> Hmm. What about per-task/thread UUID? exported via separate file: /proc/PID/uuid
>>> It could be created at the first access, thus this wouldn't shlowdown clone().
>>> Also it could be droped at execve(), so it'll describe execution
>>> context more precisely than pid.
>>>
>>
>> I'd be all for this if it weren't for two issues:
>>
>> 1. This thing needs to work as soon as init is started, and we can't
>> reliably generate random numbers that early.
>>
>> 2. Migration / CRIU is rather fundamentally at odds with making
>> anything universally unique, as opposed to just per-user-ns.
>>
>> So, given that I don't think we can actually provide a universally
>> unique pid-like thing, I'd rather not even try.
>
> Well... another idea: what about pid generation counter? Of course it
> should be per-pid-ns.
> As I see struct upid has nice 32-bit hole next to 'nr' field. (on
> 64-bit systems)
>

I thought about that, but it has two issues:

1. Implementing it will be pain.  The pid allocation algorithm is
already complicated, and it wasn't obvious to me how to accurately
detect wraparound without racing against other wrapping users.

2. I don't think it will be reliable.  Suppose that there are pid_max
- 1 processes.  One of them can repeatedly clone and exit,
incrementing the generation counter each time.  After 2^32 iterations,
which won't take all that long, the pid generation will wrap.

--Andy

>>
>> That being said, I want to rework this a little bit.  I think that
>> highpid should be restricted to the range 2^32 through 2^64 - 4096.
>> The former prevents anyone from confusing highpid with regular pid,
>> and the latter means that we don't need to worry about confusion
>> between errors and valid highpids (e.g. -1 will never be a highpid).
>>
>> Implementing that will be only mildly annoying.
>>
>> --Andy
>>
>>> On Sat, Nov 29, 2014 at 2:05 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> Pid reuse is common, which means that it's difficult or impossible
>>>> to read information about a pid from /proc without races.
>>>>
>>>> This introduces a second number associated with each (task, pidns)
>>>> pair called highpid.  Highpid is a 64-bit number, and, barring
>>>> extremely unlikely circumstances or outright error, a (highpid, pid)
>>>> will never be reused.
>>>>
>>>> With just this change, a program can open /proc/PID/status, read the
>>>> "Highpid" field, and confirm that it has the expected value.  If the
>>>> pid has been reused, then highpid will be different.
>>>>
>>>> The initial implementation is straightforward: highpid is simply a
>>>> 64-bit counter. If a high-end system can fork every 3 ns (which
>>>> would be amazing, given that just allocating a pid requires at
>>>> atomic operation), it would take well over 1000 years for highpid to
>>>> wrap.
>>>>
>>>> For CRIU's benefit, the next highpid can be set by a privileged
>>>> user.
>>>>
>>>> NB: The sysctl stuff only works on 64-bit systems.  If the approach
>>>> looks good, I'll fix that somehow.
>>>>
>>>> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
>>>> ---
>>>>
>>>> If this goes in, there's plenty of room to add new interfaces to
>>>> make this more useful.  For example, we could add a fancier tgkill
>>>> that adds and validates hightgid and highpid, and we might want to
>>>> add a syscall to read one's own hightgid and highpid.  These would
>>>> be quite useful for pidfiles.
>>>>
>>>> David, would this be useful for kdbus?
>>>>
>>>> CRIU people: will this be unduly difficult to support in CRIU?
>>>>
>>>> If you all think this is a good idea, I'll fix the sysctl stuff and
>>>> document it.  It might also be worth adding "Hightgid" to status.
>>>>
>>>>  fs/proc/array.c               |  5 ++++-
>>>>  include/linux/pid.h           |  2 ++
>>>>  include/linux/pid_namespace.h |  1 +
>>>>  kernel/pid.c                  | 19 +++++++++++++++----
>>>>  kernel/pid_namespace.c        | 22 ++++++++++++++++++++++
>>>>  5 files changed, 44 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/fs/proc/array.c b/fs/proc/array.c
>>>> index cd3653e4f35c..f1e0e69d19f9 100644
>>>> --- a/fs/proc/array.c
>>>> +++ b/fs/proc/array.c
>>>> @@ -159,6 +159,7 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
>>>>         int g;
>>>>         struct fdtable *fdt = NULL;
>>>>         const struct cred *cred;
>>>> +       const struct upid *upid;
>>>>         pid_t ppid, tpid;
>>>>
>>>>         rcu_read_lock();
>>>> @@ -170,12 +171,14 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
>>>>                 if (tracer)
>>>>                         tpid = task_pid_nr_ns(tracer, ns);
>>>>         }
>>>> +       upid = pid_upid_ns(pid, ns);
>>>>         cred = get_task_cred(p);
>>>>         seq_printf(m,
>>>>                 "State:\t%s\n"
>>>>                 "Tgid:\t%d\n"
>>>>                 "Ngid:\t%d\n"
>>>>                 "Pid:\t%d\n"
>>>> +               "Highpid:\t%llu\n"
>>>>                 "PPid:\t%d\n"
>>>>                 "TracerPid:\t%d\n"
>>>>                 "Uid:\t%d\t%d\t%d\t%d\n"
>>>> @@ -183,7 +186,7 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
>>>>                 get_task_state(p),
>>>>                 task_tgid_nr_ns(p, ns),
>>>>                 task_numa_group_id(p),
>>>> -               pid_nr_ns(pid, ns),
>>>> +               upid ? upid->nr : 0, upid ? upid->highnr : 0,
>>>>                 ppid, tpid,
>>>>                 from_kuid_munged(user_ns, cred->uid),
>>>>                 from_kuid_munged(user_ns, cred->euid),
>>>> diff --git a/include/linux/pid.h b/include/linux/pid.h
>>>> index 23705a53abba..ece70b64d04c 100644
>>>> --- a/include/linux/pid.h
>>>> +++ b/include/linux/pid.h
>>>> @@ -51,6 +51,7 @@ struct upid {
>>>>         /* Try to keep pid_chain in the same cacheline as nr for find_vpid */
>>>>         int nr;
>>>>         struct pid_namespace *ns;
>>>> +       u64 highnr;
>>>>         struct hlist_node pid_chain;
>>>>  };
>>>>
>>>> @@ -170,6 +171,7 @@ static inline pid_t pid_nr(struct pid *pid)
>>>>  }
>>>>
>>>>  pid_t pid_nr_ns(struct pid *pid, struct pid_namespace *ns);
>>>> +const struct upid *pid_upid_ns(struct pid *pid, struct pid_namespace *ns);
>>>>  pid_t pid_vnr(struct pid *pid);
>>>>
>>>>  #define do_each_pid_task(pid, type, task)                              \
>>>> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
>>>> index 1997ffc295a7..1f9f0455d7ef 100644
>>>> --- a/include/linux/pid_namespace.h
>>>> +++ b/include/linux/pid_namespace.h
>>>> @@ -26,6 +26,7 @@ struct pid_namespace {
>>>>         struct rcu_head rcu;
>>>>         int last_pid;
>>>>         unsigned int nr_hashed;
>>>> +       atomic64_t next_highpid;
>>>>         struct task_struct *child_reaper;
>>>>         struct kmem_cache *pid_cachep;
>>>>         unsigned int level;
>>>> diff --git a/kernel/pid.c b/kernel/pid.c
>>>> index 9b9a26698144..863d11a9bfbf 100644
>>>> --- a/kernel/pid.c
>>>> +++ b/kernel/pid.c
>>>> @@ -312,6 +312,8 @@ struct pid *alloc_pid(struct pid_namespace *ns)
>>>>
>>>>                 pid->numbers[i].nr = nr;
>>>>                 pid->numbers[i].ns = tmp;
>>>> +               pid->numbers[i].highnr =
>>>> +                       atomic64_inc_return(&tmp->next_highpid) - 1;
>>>>                 tmp = tmp->parent;
>>>>         }
>>>>
>>>> @@ -492,17 +494,26 @@ struct pid *find_get_pid(pid_t nr)
>>>>  }
>>>>  EXPORT_SYMBOL_GPL(find_get_pid);
>>>>
>>>> -pid_t pid_nr_ns(struct pid *pid, struct pid_namespace *ns)
>>>> +const struct upid *pid_upid_ns(struct pid *pid, struct pid_namespace *ns)
>>>>  {
>>>>         struct upid *upid;
>>>> -       pid_t nr = 0;
>>>>
>>>>         if (pid && ns->level <= pid->level) {
>>>>                 upid = &pid->numbers[ns->level];
>>>>                 if (upid->ns == ns)
>>>> -                       nr = upid->nr;
>>>> +                       return upid;
>>>>         }
>>>> -       return nr;
>>>> +       return NULL;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(pid_upid_ns);
>>>> +
>>>> +pid_t pid_nr_ns(struct pid *pid, struct pid_namespace *ns)
>>>> +{
>>>> +       const struct upid *upid = pid_upid_ns(pid, ns);
>>>> +
>>>> +       if (!upid)
>>>> +               return 0;
>>>> +       return upid->nr;
>>>>  }
>>>>  EXPORT_SYMBOL_GPL(pid_nr_ns);
>>>>
>>>> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
>>>> index db95d8eb761b..e246802b4361 100644
>>>> --- a/kernel/pid_namespace.c
>>>> +++ b/kernel/pid_namespace.c
>>>> @@ -268,6 +268,22 @@ static int pid_ns_ctl_handler(struct ctl_table *table, int write,
>>>>         return proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
>>>>  }
>>>>
>>>> +static int pid_ns_next_highpid_handler(struct ctl_table *table, int write,
>>>> +               void __user *buffer, size_t *lenp, loff_t *ppos)
>>>> +{
>>>> +       struct pid_namespace *pid_ns = task_active_pid_ns(current);
>>>> +       struct ctl_table tmp = *table;
>>>> +
>>>> +       if (write && !ns_capable(pid_ns->user_ns, CAP_SYS_ADMIN))
>>>> +               return -EPERM;
>>>> +
>>>> +       /* This needs to be fixed. */
>>>> +       BUILD_BUG_ON(sizeof(u64) != sizeof(unsigned long));
>>>> +
>>>> +       tmp.data = &pid_ns->next_highpid;
>>>> +       return proc_dointvec(&tmp, write, buffer, lenp, ppos);
>>>> +}
>>>> +
>>>>  extern int pid_max;
>>>>  static int zero = 0;
>>>>  static struct ctl_table pid_ns_ctl_table[] = {
>>>> @@ -279,6 +295,12 @@ static struct ctl_table pid_ns_ctl_table[] = {
>>>>                 .extra1 = &zero,
>>>>                 .extra2 = &pid_max,
>>>>         },
>>>> +       {
>>>> +               .procname = "ns_next_highpid",
>>>> +               .maxlen = sizeof(u64),
>>>> +               .mode = 0666, /* permissions are checked in the handler */
>>>> +               .proc_handler = pid_ns_next_highpid_handler,
>>>> +       },
>>>>         { }
>>>>  };
>>>>  static struct ctl_path kern_path[] = { { .procname = "kernel", }, { } };
>>>> --
>>>> 1.9.3
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> Please read the FAQ at  http://www.tux.org/lkml/
>>
>>
>>
>> --
>> Andy Lutomirski
>> AMA Capital Management, LLC



-- 
Andy Lutomirski
AMA Capital Management, LLC
_______________________________________________
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel

^ permalink raw reply

* Re: [PATCH v2 02/19] kbuild: kselftest_install - add a new make target to install selftests
From: Shuah Khan @ 2014-12-01 16:39 UTC (permalink / raw)
  To: Michal Marek, gregkh, akpm, davem, keescook, tranmanphong,
	dh.herrmann, hughd, bobby.prani, ebiederm, serge.hallyn
  Cc: linux-kbuild, linux-kernel, linux-api, netdev,
	masami.hiramatsu.pt@hitachi.com >> Masami Hiramatsu,
	Shuah Khan
In-Reply-To: <547C8D81.1030605@suse.cz>

On 12/01/2014 08:47 AM, Michal Marek wrote:
> On 2014-11-11 21:27, Shuah Khan wrote:
>> Add a new make target to install to install kernel selftests.
>> This new target will build and install selftests. kselftest
>> target now depends on kselftest_install and runs the generated
>> kselftest script to reduce duplicate work and for common look
>> and feel when running tests.
>>
>> Approach:
>>
>> make kselftest_target:
>> -- exports kselftest INSTALL_KSFT_PATH
>>    default $(INSTALL_MOD_PATH)/lib/kselftest/$(KERNELRELEASE)
>> -- exports path for ksefltest.sh
>> -- runs selftests make install target:
>>
>> selftests make install target
>> -- creates kselftest.sh script in install install dir
>> -- runs install targets for all INSTALL_TARGETS
>>    (Note: ftrace and powerpc aren't included in INSTALL_TARGETS,
>>           to not add more content to patch v1 series. This work
>>           will happen soon. In this series these two targets are
>>           run after running the generated kselftest script, without
>>           any regression in the way these tests are run with
>>           "make kselftest" prior to this work.)
>> -- install target can be run only from top level source dir.
>>
>> Individual test make install targets:
>> -- install test programs and/or scripts in install dir
>> -- append to the ksefltest.sh file to add commands to run test
>> -- install target can be run only from top level source dir.
>>
>> Adds the following new ways to initiate selftests:
>> -- Installing and running kselftest from install directory
>>    by running  "make kselftest"
>> -- Running kselftest script from install directory
>>
>> Maintains the following ways to run tests:
>> -- make -C tools/testing/selftests run_tests
>> -- make -C tools/testing/selftests TARGETS=target run_tests
>>    Ability specify targets: e.g TARGETS=net
>> -- make run_tests from tools/testing/selftests
>> -- make run_tests from individual test directories:
>>    e.g: make run_tests in tools/testing/selftests/breakpoints
>>
>> Signed-off-by: Shuah Khan <shuahkh@osg.samsung.com>
>> ---
>>  Makefile | 21 +++++++++++++++++++--
>>  1 file changed, 19 insertions(+), 2 deletions(-)
>>
>> diff --git a/Makefile b/Makefile
>> index 05d67af..ccbd2e1 100644
>> --- a/Makefile
>> +++ b/Makefile
>> @@ -1071,12 +1071,26 @@ headers_check: headers_install
>>  	$(Q)$(MAKE) $(hdr-inst)=arch/$(hdr-arch)/include/uapi/asm $(hdr-dst) HDRCHECK=1
>>  
>>  # ---------------------------------------------------------------------------
>> -# Kernel selftest
>> +# Kernel selftest targets
>> +
>> +PHONY += __kselftest_configure
>> +INSTALL_KSFT_PATH=$(INSTALL_MOD_PATH)/lib/kselftest/$(KERNELRELEASE)
>> +export INSTALL_KSFT_PATH
>> +KSELFTEST=$(INSTALL_KSFT_PATH)/kselftest.sh
>> +export KSELFTEST
> 
> Can this be moved to tools/testing/selftests/Makefile? It's only used in
> this part of the tree.

I looked into doing that. KERNELRELEASE will have to be exported for
tools/testing/selftests/Makefile to use it? Does that sound okay?

> 
> 
>>  PHONY += kselftest
>> -kselftest:
>> +kselftest: kselftest_install
>>  	$(Q)$(MAKE) -C tools/testing/selftests run_tests
>>  
>> +# Kernel selftest install
>> +
>> +PHONY += kselftest_install
>> +kselftest_install: __kselftest_configure
>> +	@rm -rf $(INSTALL_KSFT_PATH)
>> +	@mkdir -p $(INSTALL_KSFT_PATH)
> 
> Please use $(Q) insteaf od hardcoding the @.

I can do that.

> 
> 
>> +	$(Q)$(MAKE) -C tools/testing/selftests install
> 
> The install target is only added by the next patch, which in turn
> depends on changes done by later patches. The order (in the current
> numbering) should be 01/19, 04/19, ... 19/19, 03/19 and 02/19.
> 

Right. I can re-order the patches in my next version and make other
changes you recommended.

Also, it might be easier to get this series in, if you can Ack the main
Makefile patch (when we are ready i.e), so I can take it through
kselftest tree.

thanks,
-- Shuah


-- 
Shuah Khan
Sr. Linux Kernel Developer
Samsung Research America (Silicon Valley)
shuahkh@osg.samsung.com | (970) 217-8978

^ permalink raw reply

* Re: [RFC PATCH] proc, pidns: Add highpid
From: Konstantin Khlebnikov @ 2014-12-01 16:39 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Herrmann, Linux API, Linux Kernel Mailing List,
	Eric W. Biederman, Andrew Morton, systemd Mailing List,
	Greg Kroah-Hartman, Cyrill Gorcunov,
	Емельянов Павел
In-Reply-To: <CALCETrW0Kmi+bJSSegjD4pSZoYhDMyQUJALZY72r388giV+ruQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Mon, Dec 1, 2014 at 7:21 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Sun, Nov 30, 2014 at 11:03 PM, Konstantin Khlebnikov
> <koct9i-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> Hmm. What about per-task/thread UUID? exported via separate file: /proc/PID/uuid
>> It could be created at the first access, thus this wouldn't shlowdown clone().
>> Also it could be droped at execve(), so it'll describe execution
>> context more precisely than pid.
>>
>
> I'd be all for this if it weren't for two issues:
>
> 1. This thing needs to work as soon as init is started, and we can't
> reliably generate random numbers that early.
>
> 2. Migration / CRIU is rather fundamentally at odds with making
> anything universally unique, as opposed to just per-user-ns.
>
> So, given that I don't think we can actually provide a universally
> unique pid-like thing, I'd rather not even try.

Well... another idea: what about pid generation counter? Of course it
should be per-pid-ns.
As I see struct upid has nice 32-bit hole next to 'nr' field. (on
64-bit systems)

>
> That being said, I want to rework this a little bit.  I think that
> highpid should be restricted to the range 2^32 through 2^64 - 4096.
> The former prevents anyone from confusing highpid with regular pid,
> and the latter means that we don't need to worry about confusion
> between errors and valid highpids (e.g. -1 will never be a highpid).
>
> Implementing that will be only mildly annoying.
>
> --Andy
>
>> On Sat, Nov 29, 2014 at 2:05 AM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>> Pid reuse is common, which means that it's difficult or impossible
>>> to read information about a pid from /proc without races.
>>>
>>> This introduces a second number associated with each (task, pidns)
>>> pair called highpid.  Highpid is a 64-bit number, and, barring
>>> extremely unlikely circumstances or outright error, a (highpid, pid)
>>> will never be reused.
>>>
>>> With just this change, a program can open /proc/PID/status, read the
>>> "Highpid" field, and confirm that it has the expected value.  If the
>>> pid has been reused, then highpid will be different.
>>>
>>> The initial implementation is straightforward: highpid is simply a
>>> 64-bit counter. If a high-end system can fork every 3 ns (which
>>> would be amazing, given that just allocating a pid requires at
>>> atomic operation), it would take well over 1000 years for highpid to
>>> wrap.
>>>
>>> For CRIU's benefit, the next highpid can be set by a privileged
>>> user.
>>>
>>> NB: The sysctl stuff only works on 64-bit systems.  If the approach
>>> looks good, I'll fix that somehow.
>>>
>>> Signed-off-by: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
>>> ---
>>>
>>> If this goes in, there's plenty of room to add new interfaces to
>>> make this more useful.  For example, we could add a fancier tgkill
>>> that adds and validates hightgid and highpid, and we might want to
>>> add a syscall to read one's own hightgid and highpid.  These would
>>> be quite useful for pidfiles.
>>>
>>> David, would this be useful for kdbus?
>>>
>>> CRIU people: will this be unduly difficult to support in CRIU?
>>>
>>> If you all think this is a good idea, I'll fix the sysctl stuff and
>>> document it.  It might also be worth adding "Hightgid" to status.
>>>
>>>  fs/proc/array.c               |  5 ++++-
>>>  include/linux/pid.h           |  2 ++
>>>  include/linux/pid_namespace.h |  1 +
>>>  kernel/pid.c                  | 19 +++++++++++++++----
>>>  kernel/pid_namespace.c        | 22 ++++++++++++++++++++++
>>>  5 files changed, 44 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/fs/proc/array.c b/fs/proc/array.c
>>> index cd3653e4f35c..f1e0e69d19f9 100644
>>> --- a/fs/proc/array.c
>>> +++ b/fs/proc/array.c
>>> @@ -159,6 +159,7 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
>>>         int g;
>>>         struct fdtable *fdt = NULL;
>>>         const struct cred *cred;
>>> +       const struct upid *upid;
>>>         pid_t ppid, tpid;
>>>
>>>         rcu_read_lock();
>>> @@ -170,12 +171,14 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
>>>                 if (tracer)
>>>                         tpid = task_pid_nr_ns(tracer, ns);
>>>         }
>>> +       upid = pid_upid_ns(pid, ns);
>>>         cred = get_task_cred(p);
>>>         seq_printf(m,
>>>                 "State:\t%s\n"
>>>                 "Tgid:\t%d\n"
>>>                 "Ngid:\t%d\n"
>>>                 "Pid:\t%d\n"
>>> +               "Highpid:\t%llu\n"
>>>                 "PPid:\t%d\n"
>>>                 "TracerPid:\t%d\n"
>>>                 "Uid:\t%d\t%d\t%d\t%d\n"
>>> @@ -183,7 +186,7 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
>>>                 get_task_state(p),
>>>                 task_tgid_nr_ns(p, ns),
>>>                 task_numa_group_id(p),
>>> -               pid_nr_ns(pid, ns),
>>> +               upid ? upid->nr : 0, upid ? upid->highnr : 0,
>>>                 ppid, tpid,
>>>                 from_kuid_munged(user_ns, cred->uid),
>>>                 from_kuid_munged(user_ns, cred->euid),
>>> diff --git a/include/linux/pid.h b/include/linux/pid.h
>>> index 23705a53abba..ece70b64d04c 100644
>>> --- a/include/linux/pid.h
>>> +++ b/include/linux/pid.h
>>> @@ -51,6 +51,7 @@ struct upid {
>>>         /* Try to keep pid_chain in the same cacheline as nr for find_vpid */
>>>         int nr;
>>>         struct pid_namespace *ns;
>>> +       u64 highnr;
>>>         struct hlist_node pid_chain;
>>>  };
>>>
>>> @@ -170,6 +171,7 @@ static inline pid_t pid_nr(struct pid *pid)
>>>  }
>>>
>>>  pid_t pid_nr_ns(struct pid *pid, struct pid_namespace *ns);
>>> +const struct upid *pid_upid_ns(struct pid *pid, struct pid_namespace *ns);
>>>  pid_t pid_vnr(struct pid *pid);
>>>
>>>  #define do_each_pid_task(pid, type, task)                              \
>>> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
>>> index 1997ffc295a7..1f9f0455d7ef 100644
>>> --- a/include/linux/pid_namespace.h
>>> +++ b/include/linux/pid_namespace.h
>>> @@ -26,6 +26,7 @@ struct pid_namespace {
>>>         struct rcu_head rcu;
>>>         int last_pid;
>>>         unsigned int nr_hashed;
>>> +       atomic64_t next_highpid;
>>>         struct task_struct *child_reaper;
>>>         struct kmem_cache *pid_cachep;
>>>         unsigned int level;
>>> diff --git a/kernel/pid.c b/kernel/pid.c
>>> index 9b9a26698144..863d11a9bfbf 100644
>>> --- a/kernel/pid.c
>>> +++ b/kernel/pid.c
>>> @@ -312,6 +312,8 @@ struct pid *alloc_pid(struct pid_namespace *ns)
>>>
>>>                 pid->numbers[i].nr = nr;
>>>                 pid->numbers[i].ns = tmp;
>>> +               pid->numbers[i].highnr =
>>> +                       atomic64_inc_return(&tmp->next_highpid) - 1;
>>>                 tmp = tmp->parent;
>>>         }
>>>
>>> @@ -492,17 +494,26 @@ struct pid *find_get_pid(pid_t nr)
>>>  }
>>>  EXPORT_SYMBOL_GPL(find_get_pid);
>>>
>>> -pid_t pid_nr_ns(struct pid *pid, struct pid_namespace *ns)
>>> +const struct upid *pid_upid_ns(struct pid *pid, struct pid_namespace *ns)
>>>  {
>>>         struct upid *upid;
>>> -       pid_t nr = 0;
>>>
>>>         if (pid && ns->level <= pid->level) {
>>>                 upid = &pid->numbers[ns->level];
>>>                 if (upid->ns == ns)
>>> -                       nr = upid->nr;
>>> +                       return upid;
>>>         }
>>> -       return nr;
>>> +       return NULL;
>>> +}
>>> +EXPORT_SYMBOL_GPL(pid_upid_ns);
>>> +
>>> +pid_t pid_nr_ns(struct pid *pid, struct pid_namespace *ns)
>>> +{
>>> +       const struct upid *upid = pid_upid_ns(pid, ns);
>>> +
>>> +       if (!upid)
>>> +               return 0;
>>> +       return upid->nr;
>>>  }
>>>  EXPORT_SYMBOL_GPL(pid_nr_ns);
>>>
>>> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
>>> index db95d8eb761b..e246802b4361 100644
>>> --- a/kernel/pid_namespace.c
>>> +++ b/kernel/pid_namespace.c
>>> @@ -268,6 +268,22 @@ static int pid_ns_ctl_handler(struct ctl_table *table, int write,
>>>         return proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
>>>  }
>>>
>>> +static int pid_ns_next_highpid_handler(struct ctl_table *table, int write,
>>> +               void __user *buffer, size_t *lenp, loff_t *ppos)
>>> +{
>>> +       struct pid_namespace *pid_ns = task_active_pid_ns(current);
>>> +       struct ctl_table tmp = *table;
>>> +
>>> +       if (write && !ns_capable(pid_ns->user_ns, CAP_SYS_ADMIN))
>>> +               return -EPERM;
>>> +
>>> +       /* This needs to be fixed. */
>>> +       BUILD_BUG_ON(sizeof(u64) != sizeof(unsigned long));
>>> +
>>> +       tmp.data = &pid_ns->next_highpid;
>>> +       return proc_dointvec(&tmp, write, buffer, lenp, ppos);
>>> +}
>>> +
>>>  extern int pid_max;
>>>  static int zero = 0;
>>>  static struct ctl_table pid_ns_ctl_table[] = {
>>> @@ -279,6 +295,12 @@ static struct ctl_table pid_ns_ctl_table[] = {
>>>                 .extra1 = &zero,
>>>                 .extra2 = &pid_max,
>>>         },
>>> +       {
>>> +               .procname = "ns_next_highpid",
>>> +               .maxlen = sizeof(u64),
>>> +               .mode = 0666, /* permissions are checked in the handler */
>>> +               .proc_handler = pid_ns_next_highpid_handler,
>>> +       },
>>>         { }
>>>  };
>>>  static struct ctl_path kern_path[] = { { .procname = "kernel", }, { } };
>>> --
>>> 1.9.3
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at  http://www.tux.org/lkml/
>
>
>
> --
> Andy Lutomirski
> AMA Capital Management, LLC

^ permalink raw reply

* Re: [PATCH v2 02/19] kbuild: kselftest_install - add a new make target to install selftests
From: Shuah Khan @ 2014-12-01 16:27 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: gregkh, akpm, mmarek, davem, keescook, tranmanphong, dh.herrmann,
	hughd, bobby.prani, ebiederm, serge.hallyn, linux-kbuild,
	linux-kernel, linux-api, netdev, Shuah Khan
In-Reply-To: <5476B763.1030701@hitachi.com>

On 11/26/2014 10:32 PM, Masami Hiramatsu wrote:
> (2014/11/12 5:27), Shuah Khan wrote:
>> Add a new make target to install to install kernel selftests.
>> This new target will build and install selftests. kselftest
>> target now depends on kselftest_install and runs the generated
>> kselftest script to reduce duplicate work and for common look
>> and feel when running tests.
>>
>> Approach:
>>
>> make kselftest_target:
> 
> kselftest_install?

Thanks for catching this. Will fix it.

> 
>> -- exports kselftest INSTALL_KSFT_PATH
>>    default $(INSTALL_MOD_PATH)/lib/kselftest/$(KERNELRELEASE)
>> -- exports path for ksefltest.sh
>> -- runs selftests make install target:
> 
> This direction is OK to me.
> 
> BTW, I've found another path to make selftest in Makefile,
> Actually you can do
> 
> make -C tools/ selftest
> 
> And there are selftest_install and selftest_clean targets (but
> currently it has a bug and doesn't work, anyway)

Thanks for pointing this out. I didn't know this existed. kind of
hidden.

> 
> I think we'd better do subdir make instead of adding these targets.
> This means that "make kselftest*" should be an alias of "make -C tools/ selftest*"
> 

Yes that would be a good way to go and helps leverage these options
at tools/Makefile level.

> Also, I'd like to request passing some options like as O=$(objtree)
> so that we can make test kmodules in selftests.

Sounds good. I will look into it.

thanks,
-- Shuah

-- 
Shuah Khan
Sr. Linux Kernel Developer
Samsung Research America (Silicon Valley)
shuahkh@osg.samsung.com | (970) 217-8978

^ permalink raw reply

* Re: [RFC PATCH] proc, pidns: Add highpid
From: Andy Lutomirski @ 2014-12-01 16:21 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: systemd Mailing List, Linux API, Linux Kernel Mailing List,
	Eric W. Biederman, Andrew Morton
In-Reply-To: <CALYGNiOV9Ms=JW5sOhdmh2BmR1v+4obE8DQ0v+3rgQYkh=h2GQ@mail.gmail.com>

On Sun, Nov 30, 2014 at 11:03 PM, Konstantin Khlebnikov
<koct9i@gmail.com> wrote:
> Hmm. What about per-task/thread UUID? exported via separate file: /proc/PID/uuid
> It could be created at the first access, thus this wouldn't shlowdown clone().
> Also it could be droped at execve(), so it'll describe execution
> context more precisely than pid.
>

I'd be all for this if it weren't for two issues:

1. This thing needs to work as soon as init is started, and we can't
reliably generate random numbers that early.

2. Migration / CRIU is rather fundamentally at odds with making
anything universally unique, as opposed to just per-user-ns.

So, given that I don't think we can actually provide a universally
unique pid-like thing, I'd rather not even try.

That being said, I want to rework this a little bit.  I think that
highpid should be restricted to the range 2^32 through 2^64 - 4096.
The former prevents anyone from confusing highpid with regular pid,
and the latter means that we don't need to worry about confusion
between errors and valid highpids (e.g. -1 will never be a highpid).

Implementing that will be only mildly annoying.

--Andy

> On Sat, Nov 29, 2014 at 2:05 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> Pid reuse is common, which means that it's difficult or impossible
>> to read information about a pid from /proc without races.
>>
>> This introduces a second number associated with each (task, pidns)
>> pair called highpid.  Highpid is a 64-bit number, and, barring
>> extremely unlikely circumstances or outright error, a (highpid, pid)
>> will never be reused.
>>
>> With just this change, a program can open /proc/PID/status, read the
>> "Highpid" field, and confirm that it has the expected value.  If the
>> pid has been reused, then highpid will be different.
>>
>> The initial implementation is straightforward: highpid is simply a
>> 64-bit counter. If a high-end system can fork every 3 ns (which
>> would be amazing, given that just allocating a pid requires at
>> atomic operation), it would take well over 1000 years for highpid to
>> wrap.
>>
>> For CRIU's benefit, the next highpid can be set by a privileged
>> user.
>>
>> NB: The sysctl stuff only works on 64-bit systems.  If the approach
>> looks good, I'll fix that somehow.
>>
>> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
>> ---
>>
>> If this goes in, there's plenty of room to add new interfaces to
>> make this more useful.  For example, we could add a fancier tgkill
>> that adds and validates hightgid and highpid, and we might want to
>> add a syscall to read one's own hightgid and highpid.  These would
>> be quite useful for pidfiles.
>>
>> David, would this be useful for kdbus?
>>
>> CRIU people: will this be unduly difficult to support in CRIU?
>>
>> If you all think this is a good idea, I'll fix the sysctl stuff and
>> document it.  It might also be worth adding "Hightgid" to status.
>>
>>  fs/proc/array.c               |  5 ++++-
>>  include/linux/pid.h           |  2 ++
>>  include/linux/pid_namespace.h |  1 +
>>  kernel/pid.c                  | 19 +++++++++++++++----
>>  kernel/pid_namespace.c        | 22 ++++++++++++++++++++++
>>  5 files changed, 44 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/proc/array.c b/fs/proc/array.c
>> index cd3653e4f35c..f1e0e69d19f9 100644
>> --- a/fs/proc/array.c
>> +++ b/fs/proc/array.c
>> @@ -159,6 +159,7 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
>>         int g;
>>         struct fdtable *fdt = NULL;
>>         const struct cred *cred;
>> +       const struct upid *upid;
>>         pid_t ppid, tpid;
>>
>>         rcu_read_lock();
>> @@ -170,12 +171,14 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
>>                 if (tracer)
>>                         tpid = task_pid_nr_ns(tracer, ns);
>>         }
>> +       upid = pid_upid_ns(pid, ns);
>>         cred = get_task_cred(p);
>>         seq_printf(m,
>>                 "State:\t%s\n"
>>                 "Tgid:\t%d\n"
>>                 "Ngid:\t%d\n"
>>                 "Pid:\t%d\n"
>> +               "Highpid:\t%llu\n"
>>                 "PPid:\t%d\n"
>>                 "TracerPid:\t%d\n"
>>                 "Uid:\t%d\t%d\t%d\t%d\n"
>> @@ -183,7 +186,7 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
>>                 get_task_state(p),
>>                 task_tgid_nr_ns(p, ns),
>>                 task_numa_group_id(p),
>> -               pid_nr_ns(pid, ns),
>> +               upid ? upid->nr : 0, upid ? upid->highnr : 0,
>>                 ppid, tpid,
>>                 from_kuid_munged(user_ns, cred->uid),
>>                 from_kuid_munged(user_ns, cred->euid),
>> diff --git a/include/linux/pid.h b/include/linux/pid.h
>> index 23705a53abba..ece70b64d04c 100644
>> --- a/include/linux/pid.h
>> +++ b/include/linux/pid.h
>> @@ -51,6 +51,7 @@ struct upid {
>>         /* Try to keep pid_chain in the same cacheline as nr for find_vpid */
>>         int nr;
>>         struct pid_namespace *ns;
>> +       u64 highnr;
>>         struct hlist_node pid_chain;
>>  };
>>
>> @@ -170,6 +171,7 @@ static inline pid_t pid_nr(struct pid *pid)
>>  }
>>
>>  pid_t pid_nr_ns(struct pid *pid, struct pid_namespace *ns);
>> +const struct upid *pid_upid_ns(struct pid *pid, struct pid_namespace *ns);
>>  pid_t pid_vnr(struct pid *pid);
>>
>>  #define do_each_pid_task(pid, type, task)                              \
>> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
>> index 1997ffc295a7..1f9f0455d7ef 100644
>> --- a/include/linux/pid_namespace.h
>> +++ b/include/linux/pid_namespace.h
>> @@ -26,6 +26,7 @@ struct pid_namespace {
>>         struct rcu_head rcu;
>>         int last_pid;
>>         unsigned int nr_hashed;
>> +       atomic64_t next_highpid;
>>         struct task_struct *child_reaper;
>>         struct kmem_cache *pid_cachep;
>>         unsigned int level;
>> diff --git a/kernel/pid.c b/kernel/pid.c
>> index 9b9a26698144..863d11a9bfbf 100644
>> --- a/kernel/pid.c
>> +++ b/kernel/pid.c
>> @@ -312,6 +312,8 @@ struct pid *alloc_pid(struct pid_namespace *ns)
>>
>>                 pid->numbers[i].nr = nr;
>>                 pid->numbers[i].ns = tmp;
>> +               pid->numbers[i].highnr =
>> +                       atomic64_inc_return(&tmp->next_highpid) - 1;
>>                 tmp = tmp->parent;
>>         }
>>
>> @@ -492,17 +494,26 @@ struct pid *find_get_pid(pid_t nr)
>>  }
>>  EXPORT_SYMBOL_GPL(find_get_pid);
>>
>> -pid_t pid_nr_ns(struct pid *pid, struct pid_namespace *ns)
>> +const struct upid *pid_upid_ns(struct pid *pid, struct pid_namespace *ns)
>>  {
>>         struct upid *upid;
>> -       pid_t nr = 0;
>>
>>         if (pid && ns->level <= pid->level) {
>>                 upid = &pid->numbers[ns->level];
>>                 if (upid->ns == ns)
>> -                       nr = upid->nr;
>> +                       return upid;
>>         }
>> -       return nr;
>> +       return NULL;
>> +}
>> +EXPORT_SYMBOL_GPL(pid_upid_ns);
>> +
>> +pid_t pid_nr_ns(struct pid *pid, struct pid_namespace *ns)
>> +{
>> +       const struct upid *upid = pid_upid_ns(pid, ns);
>> +
>> +       if (!upid)
>> +               return 0;
>> +       return upid->nr;
>>  }
>>  EXPORT_SYMBOL_GPL(pid_nr_ns);
>>
>> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
>> index db95d8eb761b..e246802b4361 100644
>> --- a/kernel/pid_namespace.c
>> +++ b/kernel/pid_namespace.c
>> @@ -268,6 +268,22 @@ static int pid_ns_ctl_handler(struct ctl_table *table, int write,
>>         return proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
>>  }
>>
>> +static int pid_ns_next_highpid_handler(struct ctl_table *table, int write,
>> +               void __user *buffer, size_t *lenp, loff_t *ppos)
>> +{
>> +       struct pid_namespace *pid_ns = task_active_pid_ns(current);
>> +       struct ctl_table tmp = *table;
>> +
>> +       if (write && !ns_capable(pid_ns->user_ns, CAP_SYS_ADMIN))
>> +               return -EPERM;
>> +
>> +       /* This needs to be fixed. */
>> +       BUILD_BUG_ON(sizeof(u64) != sizeof(unsigned long));
>> +
>> +       tmp.data = &pid_ns->next_highpid;
>> +       return proc_dointvec(&tmp, write, buffer, lenp, ppos);
>> +}
>> +
>>  extern int pid_max;
>>  static int zero = 0;
>>  static struct ctl_table pid_ns_ctl_table[] = {
>> @@ -279,6 +295,12 @@ static struct ctl_table pid_ns_ctl_table[] = {
>>                 .extra1 = &zero,
>>                 .extra2 = &pid_max,
>>         },
>> +       {
>> +               .procname = "ns_next_highpid",
>> +               .maxlen = sizeof(u64),
>> +               .mode = 0666, /* permissions are checked in the handler */
>> +               .proc_handler = pid_ns_next_highpid_handler,
>> +       },
>>         { }
>>  };
>>  static struct ctl_path kern_path[] = { { .procname = "kernel", }, { } };
>> --
>> 1.9.3
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/



-- 
Andy Lutomirski
AMA Capital Management, LLC
_______________________________________________
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel

^ permalink raw reply

* Re: [PATCH v2 03/19] selftests: add install target to enable installing selftests
From: Shuah Khan @ 2014-12-01 16:16 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: gregkh, akpm, mmarek, davem, keescook, tranmanphong, dh.herrmann,
	hughd, bobby.prani, ebiederm, serge.hallyn, linux-kbuild,
	linux-kernel, linux-api, netdev, yrl.pp-manager.tt@hitachi.com,
	Shuah Khan
In-Reply-To: <5476BA67.6020305@hitachi.com>

On 11/26/2014 10:45 PM, Masami Hiramatsu wrote:
> (2014/11/12 5:27), Shuah Khan wrote:
>> Add a new make target to enable installing selftests. This
>> new target will call install targets for the tests that are
>> specified in INSTALL_TARGETS. During install, a script is
>> generated to run tests that are installed. This script will
>> be installed in the selftest install directory. Individual
>> test Makefiles are changed to add to the script. This will
>> allow new tests to add install and run test commands to the
>> generated kselftest script. run_tests target runs the
>> generated kselftest script to run tests when it is initiated
>> from from "make kselftest" from top level source directory.
>>
>> Approach:
>>
>> make kselftest_target:
>> -- exports kselftest INSTALL_KSFT_PATH
>>    default $(INSTALL_MOD_PATH)/lib/kselftest/$(KERNELRELEASE)
>> -- exports path for ksefltest.sh
>> -- runs selftests make install target:
>>
>> selftests make install target
>> -- creates kselftest.sh script in install install dir
>> -- runs install targets for all INSTALL_TARGETS
>>    (Note: ftrace and powerpc aren't included in INSTALL_TARGETS,
>>           to not add more content to patch v1 series. This work
>>           will happen soon. In this series these two targets are
>>           run after running the generated kselftest script, without
>>           any regression in the way these tests are run with
>>           "make kselftest" prior to this work.)
>> -- install target can be run only from top level source dir.
>>
>> Individual test make install targets:
>> -- install test programs and/or scripts in install dir
>> -- append to the ksefltest.sh file to add commands to run test
>> -- install target can be run only from top level source dir.
>>
>> Adds the following new ways to initiate selftests:
>> -- Installing and running kselftest from install directory
>>    by running  "make kselftest"
>> -- Running kselftest script from install directory
>>
>> Maintains the following ways to run tests:
>> -- make -C tools/testing/selftests run_tests
>> -- make -C tools/testing/selftests TARGETS=target run_tests
>>    Ability specify targets: e.g TARGETS=net
>> -- make run_tests from tools/testing/selftests
>> -- make run_tests from individual test directories:
>>    e.g: make run_tests in tools/testing/selftests/breakpoints
>>
>> Signed-off-by: Shuah Khan <shuahkh@osg.samsung.com>
>> ---
>>  tools/testing/selftests/Makefile | 31 ++++++++++++++++++++++++++++++-
>>  1 file changed, 30 insertions(+), 1 deletion(-)
>>
>> diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
>> index 45f145c..b9bdc1d 100644
>> --- a/tools/testing/selftests/Makefile
>> +++ b/tools/testing/selftests/Makefile
>> @@ -16,6 +16,10 @@ TARGETS += sysctl
>>  TARGETS += firmware
>>  TARGETS += ftrace
>>  
>> +INSTALL_TARGETS = breakpoints cpu-hotplug efivarfs firmware ipc
>> +INSTALL_TARGETS += kcmp memfd memory-hotplug mqueue mount net
>> +INSTALL_TARGETS += ptrace sysctl timers user vm
>> +
>>  TARGETS_HOTPLUG = cpu-hotplug
>>  TARGETS_HOTPLUG += memory-hotplug
>>  
> 
> I think KSELFTEST itself should be defined here, since that is not
> a parameter.

I can do that.

> 
>> @@ -24,10 +28,35 @@ all:
>>  		make -C $$TARGET; \
>>  	done;
>>  
>> -run_tests: all
>> +install:
>> +ifdef INSTALL_KSFT_PATH
>> +	make all
>> +	@echo #!/bin/sh\n# Kselftest Run Tests .... >> $(KSELFTEST)
>> +	@echo # This file is generated during kselftest_install >> $(KSELFTEST)
>> +	@echo # Please don't change it !!\n  >> $(KSELFTEST)
>> +	@echo echo ============================== >> $(KSELFTEST)
>> +	for TARGET in $(INSTALL_TARGETS); do \
>> +		echo Installing $$TARGET; \
>> +		make -C $$TARGET install; \
> 
> Please pass O= option and others here.

I will change that.

> 
>> +	done;
>> +	chmod +x $(KSELFTEST)
>> +else
>> +	@echo Run make kselftest_install in top level source directory
>> +endif
>> +
>> +run_tests:
>> +ifdef INSTALL_KSFT_PATH
>> +	@cd $(INSTALL_KSFT_PATH); ./kselftest.sh; cd -
> 
> We'd better use some macro instead of ./kselftest.sh?
> 

I can play with this and see if there is a better way.

thanks,
-- Shuah

-- 
Shuah Khan
Sr. Linux Kernel Developer
Samsung Research America (Silicon Valley)
shuahkh@osg.samsung.com | (970) 217-8978

^ permalink raw reply

* Re: [PATCH v2 11/19] selftests/memory-hotplug: add install target to enable installing test
From: Shuah Khan @ 2014-12-01 16:12 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: gregkh, akpm, mmarek, davem, keescook, tranmanphong, dh.herrmann,
	hughd, bobby.prani, ebiederm, serge.hallyn, linux-kbuild,
	linux-kernel, linux-api, netdev, yrl.pp-manager.tt@hitachi.com,
	Shuah Khan
In-Reply-To: <5476BB87.6010808@hitachi.com>

On 11/26/2014 10:49 PM, Masami Hiramatsu wrote:
> (2014/11/12 5:27), Shuah Khan wrote:
>> Add a new make target to enable installing test. This target
>> installs test in the kselftest install location and add to the
>> kselftest script to run the test. Install target can be run
>> only from top level source dir.
>>
>> Signed-off-by: Shuah Khan <shuahkh@osg.samsung.com>
>> ---
>>  tools/testing/selftests/memory-hotplug/Makefile    |  17 +-
>>  .../selftests/memory-hotplug/mem-on-off-test.sh    | 238 +++++++++++++++++++++
>>  .../selftests/memory-hotplug/on-off-test.sh        | 238 ---------------------
>>  3 files changed, 253 insertions(+), 240 deletions(-)
>>  create mode 100644 tools/testing/selftests/memory-hotplug/mem-on-off-test.sh
>>  delete mode 100644 tools/testing/selftests/memory-hotplug/on-off-test.sh
>>
>> diff --git a/tools/testing/selftests/memory-hotplug/Makefile b/tools/testing/selftests/memory-hotplug/Makefile
>> index d46b8d4..8921631 100644
>> --- a/tools/testing/selftests/memory-hotplug/Makefile
>> +++ b/tools/testing/selftests/memory-hotplug/Makefile
>> @@ -1,9 +1,22 @@
>> +TEST_STR=/bin/bash ./mem-on-off-test.sh -r 2 || echo memory-hotplug selftests: [FAIL]
>> +
>>  all:
>>  
>> +install:
>> +ifdef INSTALL_KSFT_PATH
>> +	install ./mem-on-off-test.sh $(INSTALL_KSFT_PATH)/mem-on-off-test.sh
>> +	@echo echo Start memory hotplug test .... >> $(KSELFTEST)
>> +	@echo "$(TEST_STR)" >> $(KSELFTEST) >> $(KSELFTEST)
>> +	@echo echo End memory hotplug test .... >> $(KSELFTEST)
>> +	@echo echo ============================== >> $(KSELFTEST)
>> +else
>> +	@echo Run make kselftest_install in top level source directory
>> +endif
> 
> I saw this pattern repeated many times in this series.
> Can we make it a macro and include it instead of repeating this code?
> 

I might be able to move this to the selftest Makfile and use target as the
string. I will play with it and see if I can reduce the duplication further.

thanks,
-- Shuah


-- 
Shuah Khan
Sr. Linux Kernel Developer
Samsung Research America (Silicon Valley)
shuahkh@osg.samsung.com | (970) 217-8978

^ permalink raw reply

* [PATCH v8 49/50] virtio: make VIRTIO_F_VERSION_1 a transport bit
From: Michael S. Tsirkin @ 2014-12-01 16:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: thuth, rusty, linux-api, virtualization, dahi, pbonzini,
	David Miller
In-Reply-To: <1417449619-24896-1-git-send-email-mst@redhat.com>

Activate VIRTIO_F_VERSION_1 automatically unless legacy_only
is set.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/uapi/linux/virtio_config.h | 4 ++--
 drivers/virtio/virtio.c            | 6 +++++-
 drivers/virtio/virtio_ring.c       | 2 ++
 3 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/virtio_config.h b/include/uapi/linux/virtio_config.h
index 4d05671..a6d0cde 100644
--- a/include/uapi/linux/virtio_config.h
+++ b/include/uapi/linux/virtio_config.h
@@ -43,11 +43,11 @@
 /* We've given up on this device. */
 #define VIRTIO_CONFIG_S_FAILED		0x80
 
-/* Some virtio feature bits (currently bits 28 through 31) are reserved for the
+/* Some virtio feature bits (currently bits 28 through 32) are reserved for the
  * transport being used (eg. virtio_ring), the rest are per-device feature
  * bits. */
 #define VIRTIO_TRANSPORT_F_START	28
-#define VIRTIO_TRANSPORT_F_END		32
+#define VIRTIO_TRANSPORT_F_END		33
 
 /* Do we get callbacks when the ring is completely used, even if we've
  * suppressed them? */
diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index f9ad99c..fa6b75d 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -197,7 +197,11 @@ static int virtio_dev_probe(struct device *_d)
 		driver_features_legacy = driver_features;
 	}
 
-	if (driver_features & device_features & (1ULL << VIRTIO_F_VERSION_1))
+	/* Detect legacy-only drivers and disable VIRTIO_F_VERSION_1. */
+	if (drv->legacy_only)
+		device_features &= ~(1ULL << VIRTIO_F_VERSION_1);
+
+	if (device_features & (1ULL << VIRTIO_F_VERSION_1))
 		dev->features = driver_features & device_features;
 	else
 		dev->features = driver_features_legacy & device_features;
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 55532a4..00ec6b3 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -780,6 +780,8 @@ void vring_transport_features(struct virtio_device *vdev)
 			break;
 		case VIRTIO_RING_F_EVENT_IDX:
 			break;
+		case VIRTIO_F_VERSION_1:
+			break;
 		default:
 			/* We don't understand this bit. */
 			__virtio_clear_bit(vdev, i);
-- 
MST

^ permalink raw reply related

* [PATCH v8 47/50] virtio_console: virtio 1.0 support
From: Michael S. Tsirkin @ 2014-12-01 16:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: thuth, Arnd Bergmann, rusty, Amit Shah, virtualization, dahi,
	Greg Kroah-Hartman, pbonzini, linux-api, David Miller
In-Reply-To: <1417449619-24896-1-git-send-email-mst@redhat.com>

Pretty straight-forward, just use accessors for all fields.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/uapi/linux/virtio_console.h |  7 ++++---
 drivers/char/virtio_console.c       | 29 +++++++++++++++++------------
 2 files changed, 21 insertions(+), 15 deletions(-)

diff --git a/include/uapi/linux/virtio_console.h b/include/uapi/linux/virtio_console.h
index ba260dd..b7fb108 100644
--- a/include/uapi/linux/virtio_console.h
+++ b/include/uapi/linux/virtio_console.h
@@ -32,6 +32,7 @@
 #ifndef _UAPI_LINUX_VIRTIO_CONSOLE_H
 #define _UAPI_LINUX_VIRTIO_CONSOLE_H
 #include <linux/types.h>
+#include <linux/virtio_types.h>
 #include <linux/virtio_ids.h>
 #include <linux/virtio_config.h>
 
@@ -58,9 +59,9 @@ struct virtio_console_config {
  * particular port.
  */
 struct virtio_console_control {
-	__u32 id;		/* Port number */
-	__u16 event;		/* The kind of control event (see below) */
-	__u16 value;		/* Extra information for the key */
+	__virtio32 id;		/* Port number */
+	__virtio16 event;	/* The kind of control event (see below) */
+	__virtio16 value;	/* Extra information for the key */
 };
 
 /* Some events for control messages */
diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index 8d00aa7..775c898 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -566,9 +566,9 @@ static ssize_t __send_control_msg(struct ports_device *portdev, u32 port_id,
 	if (!use_multiport(portdev))
 		return 0;
 
-	cpkt.id = port_id;
-	cpkt.event = event;
-	cpkt.value = value;
+	cpkt.id = cpu_to_virtio32(portdev->vdev, port_id);
+	cpkt.event = cpu_to_virtio16(portdev->vdev, event);
+	cpkt.value = cpu_to_virtio16(portdev->vdev, value);
 
 	vq = portdev->c_ovq;
 
@@ -1602,7 +1602,8 @@ static void unplug_port(struct port *port)
 }
 
 /* Any private messages that the Host and Guest want to share */
-static void handle_control_message(struct ports_device *portdev,
+static void handle_control_message(struct virtio_device *vdev,
+				   struct ports_device *portdev,
 				   struct port_buffer *buf)
 {
 	struct virtio_console_control *cpkt;
@@ -1612,15 +1613,16 @@ static void handle_control_message(struct ports_device *portdev,
 
 	cpkt = (struct virtio_console_control *)(buf->buf + buf->offset);
 
-	port = find_port_by_id(portdev, cpkt->id);
-	if (!port && cpkt->event != VIRTIO_CONSOLE_PORT_ADD) {
+	port = find_port_by_id(portdev, virtio32_to_cpu(vdev, cpkt->id));
+	if (!port &&
+	    cpkt->event != cpu_to_virtio16(vdev, VIRTIO_CONSOLE_PORT_ADD)) {
 		/* No valid header at start of buffer.  Drop it. */
 		dev_dbg(&portdev->vdev->dev,
 			"Invalid index %u in control packet\n", cpkt->id);
 		return;
 	}
 
-	switch (cpkt->event) {
+	switch (virtio16_to_cpu(vdev, cpkt->event)) {
 	case VIRTIO_CONSOLE_PORT_ADD:
 		if (port) {
 			dev_dbg(&portdev->vdev->dev,
@@ -1628,13 +1630,15 @@ static void handle_control_message(struct ports_device *portdev,
 			send_control_msg(port, VIRTIO_CONSOLE_PORT_READY, 1);
 			break;
 		}
-		if (cpkt->id >= portdev->config.max_nr_ports) {
+		if (virtio32_to_cpu(vdev, cpkt->id) >=
+		    portdev->config.max_nr_ports) {
 			dev_warn(&portdev->vdev->dev,
-				"Request for adding port with out-of-bound id %u, max. supported id: %u\n",
+				"Request for adding port with "
+				"out-of-bound id %u, max. supported id: %u\n",
 				cpkt->id, portdev->config.max_nr_ports - 1);
 			break;
 		}
-		add_port(portdev, cpkt->id);
+		add_port(portdev, virtio32_to_cpu(vdev, cpkt->id));
 		break;
 	case VIRTIO_CONSOLE_PORT_REMOVE:
 		unplug_port(port);
@@ -1670,7 +1674,7 @@ static void handle_control_message(struct ports_device *portdev,
 		break;
 	}
 	case VIRTIO_CONSOLE_PORT_OPEN:
-		port->host_connected = cpkt->value;
+		port->host_connected = virtio16_to_cpu(vdev, cpkt->value);
 		wake_up_interruptible(&port->waitqueue);
 		/*
 		 * If the host port got closed and the host had any
@@ -1752,7 +1756,7 @@ static void control_work_handler(struct work_struct *work)
 		buf->len = len;
 		buf->offset = 0;
 
-		handle_control_message(portdev, buf);
+		handle_control_message(vq->vdev, portdev, buf);
 
 		spin_lock(&portdev->c_ivq_lock);
 		if (add_inbuf(portdev->c_ivq, buf) < 0) {
@@ -2126,6 +2130,7 @@ static struct virtio_device_id id_table[] = {
 static unsigned int features[] = {
 	VIRTIO_CONSOLE_F_SIZE,
 	VIRTIO_CONSOLE_F_MULTIPORT,
+	VIRTIO_F_VERSION_1,
 };
 
 static struct virtio_device_id rproc_serial_id_table[] = {
-- 
MST

^ permalink raw reply related

* [PATCH v8 44/50] virtio_scsi: export to userspace
From: Michael S. Tsirkin @ 2014-12-01 16:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Bjarke Istrup Pedersen, thuth, Anup Patel, rusty,
	Greg Kroah-Hartman, David Drysdale, dahi, Laurent Pinchart,
	Sakari Ailus, linux-api, pbonzini, Andrew Morton, Andy Grover,
	virtualization, David Miller, Alexei Starovoitov
In-Reply-To: <1417449619-24896-1-git-send-email-mst@redhat.com>

Replace uXX by __uXX and _packed by __attribute((packed))
as seems to be the norm for userspace headers.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
---
 include/uapi/linux/virtio_scsi.h | 74 ++++++++++++++++++++--------------------
 include/uapi/linux/Kbuild        |  1 +
 2 files changed, 38 insertions(+), 37 deletions(-)

diff --git a/include/uapi/linux/virtio_scsi.h b/include/uapi/linux/virtio_scsi.h
index af44864..42b9370 100644
--- a/include/uapi/linux/virtio_scsi.h
+++ b/include/uapi/linux/virtio_scsi.h
@@ -34,78 +34,78 @@
 
 /* SCSI command request, followed by data-out */
 struct virtio_scsi_cmd_req {
-	u8 lun[8];		/* Logical Unit Number */
+	__u8 lun[8];		/* Logical Unit Number */
 	__virtio64 tag;		/* Command identifier */
-	u8 task_attr;		/* Task attribute */
-	u8 prio;		/* SAM command priority field */
-	u8 crn;
-	u8 cdb[VIRTIO_SCSI_CDB_SIZE];
-} __packed;
+	__u8 task_attr;		/* Task attribute */
+	__u8 prio;		/* SAM command priority field */
+	__u8 crn;
+	__u8 cdb[VIRTIO_SCSI_CDB_SIZE];
+} __attribute__((packed));
 
 /* SCSI command request, followed by protection information */
 struct virtio_scsi_cmd_req_pi {
-	u8 lun[8];		/* Logical Unit Number */
+	__u8 lun[8];		/* Logical Unit Number */
 	__virtio64 tag;		/* Command identifier */
-	u8 task_attr;		/* Task attribute */
-	u8 prio;		/* SAM command priority field */
-	u8 crn;
+	__u8 task_attr;		/* Task attribute */
+	__u8 prio;		/* SAM command priority field */
+	__u8 crn;
 	__virtio32 pi_bytesout;	/* DataOUT PI Number of bytes */
 	__virtio32 pi_bytesin;		/* DataIN PI Number of bytes */
-	u8 cdb[VIRTIO_SCSI_CDB_SIZE];
-} __packed;
+	__u8 cdb[VIRTIO_SCSI_CDB_SIZE];
+} __attribute__((packed));
 
 /* Response, followed by sense data and data-in */
 struct virtio_scsi_cmd_resp {
 	__virtio32 sense_len;		/* Sense data length */
 	__virtio32 resid;		/* Residual bytes in data buffer */
 	__virtio16 status_qualifier;	/* Status qualifier */
-	u8 status;		/* Command completion status */
-	u8 response;		/* Response values */
-	u8 sense[VIRTIO_SCSI_SENSE_SIZE];
-} __packed;
+	__u8 status;		/* Command completion status */
+	__u8 response;		/* Response values */
+	__u8 sense[VIRTIO_SCSI_SENSE_SIZE];
+} __attribute__((packed));
 
 /* Task Management Request */
 struct virtio_scsi_ctrl_tmf_req {
 	__virtio32 type;
 	__virtio32 subtype;
-	u8 lun[8];
+	__u8 lun[8];
 	__virtio64 tag;
-} __packed;
+} __attribute__((packed));
 
 struct virtio_scsi_ctrl_tmf_resp {
-	u8 response;
-} __packed;
+	__u8 response;
+} __attribute__((packed));
 
 /* Asynchronous notification query/subscription */
 struct virtio_scsi_ctrl_an_req {
 	__virtio32 type;
-	u8 lun[8];
+	__u8 lun[8];
 	__virtio32 event_requested;
-} __packed;
+} __attribute__((packed));
 
 struct virtio_scsi_ctrl_an_resp {
 	__virtio32 event_actual;
-	u8 response;
-} __packed;
+	__u8 response;
+} __attribute__((packed));
 
 struct virtio_scsi_event {
 	__virtio32 event;
-	u8 lun[8];
+	__u8 lun[8];
 	__virtio32 reason;
-} __packed;
+} __attribute__((packed));
 
 struct virtio_scsi_config {
-	u32 num_queues;
-	u32 seg_max;
-	u32 max_sectors;
-	u32 cmd_per_lun;
-	u32 event_info_size;
-	u32 sense_size;
-	u32 cdb_size;
-	u16 max_channel;
-	u16 max_target;
-	u32 max_lun;
-} __packed;
+	__u32 num_queues;
+	__u32 seg_max;
+	__u32 max_sectors;
+	__u32 cmd_per_lun;
+	__u32 event_info_size;
+	__u32 sense_size;
+	__u32 cdb_size;
+	__u16 max_channel;
+	__u16 max_target;
+	__u32 max_lun;
+} __attribute__((packed));
 
 /* Feature Bits */
 #define VIRTIO_SCSI_F_INOUT                    0
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 44a5581..e61947c 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -428,6 +428,7 @@ header-y += virtio_net.h
 header-y += virtio_pci.h
 header-y += virtio_ring.h
 header-y += virtio_rng.h
+header-y += virtio_scsi.h
 header=y += vm_sockets.h
 header-y += vt.h
 header-y += wait.h
-- 
MST

^ permalink raw reply related

* [PATCH v8 39/50] tun: add VNET_LE flag
From: Michael S. Tsirkin @ 2014-12-01 16:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Miller, cornelia.huck, rusty, nab, pbonzini, thuth, dahi,
	netdev, linux-api
In-Reply-To: <1417449619-24896-1-git-send-email-mst@redhat.com>

virtio 1.0 modified virtio net header format,
making all fields little endian.

Users can tweak header format before submitting it to tun,
but this means more data copies where none were necessary.
And if the iovec is in RO memory, this means we might
need to split iovec also means we might in theory overflow
iovec max size.

This patch adds a simpler way for applications to handle this,
using new "little endian" flag in tun.
As a result, tun simply byte-swaps header fields as appropriate.
This is a NOP on LE architectures.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/uapi/linux/if_tun.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
index 277a260..18b2403 100644
--- a/include/uapi/linux/if_tun.h
+++ b/include/uapi/linux/if_tun.h
@@ -57,6 +57,7 @@
 #define IFF_ONE_QUEUE	0x2000
 #define IFF_VNET_HDR	0x4000
 #define IFF_TUN_EXCL	0x8000
+#define IFF_VNET_LE	0x10000
 #define IFF_MULTI_QUEUE 0x0100
 #define IFF_ATTACH_QUEUE 0x0200
 #define IFF_DETACH_QUEUE 0x0400
-- 
MST

^ permalink raw reply related

* [PATCH v8 37/50] tun: move internal flag defines out of uapi
From: Michael S. Tsirkin @ 2014-12-01 16:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Miller, cornelia.huck, rusty, nab, pbonzini, thuth, dahi,
	Jason Wang, Zhi Yong Wu, Tom Herbert, Herbert Xu, Masatake YAMATO,
	Xi Wang, netdev, linux-api
In-Reply-To: <1417449619-24896-1-git-send-email-mst@redhat.com>

TUN_ flags are internal and never exposed
to userspace. Any application using it is almost
certainly buggy.

Move them out to tun.c.

Note: we remove these completely in follow-up patches,
this code movement is split out for ease of review.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/uapi/linux/if_tun.h | 16 ++--------
 drivers/net/tun.c           | 72 +++++++++++++--------------------------------
 2 files changed, 24 insertions(+), 64 deletions(-)

diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
index e9502dd..277a260 100644
--- a/include/uapi/linux/if_tun.h
+++ b/include/uapi/linux/if_tun.h
@@ -22,21 +22,11 @@
 
 /* Read queue size */
 #define TUN_READQ_SIZE	500
-
-/* TUN device flags */
-#define TUN_TUN_DEV 	0x0001	
-#define TUN_TAP_DEV	0x0002
+/* TUN device type flags: deprecated. Use IFF_TUN/IFF_TAP instead. */
+#define TUN_TUN_DEV 	IFF_TUN
+#define TUN_TAP_DEV	IFF_TAP
 #define TUN_TYPE_MASK   0x000f
 
-#define TUN_FASYNC	0x0010
-#define TUN_NOCHECKSUM	0x0020
-#define TUN_NO_PI	0x0040
-/* This flag has no real effect */
-#define TUN_ONE_QUEUE	0x0080
-#define TUN_PERSIST 	0x0100	
-#define TUN_VNET_HDR 	0x0200
-#define TUN_TAP_MQ      0x0400
-
 /* Ioctl defines */
 #define TUNSETNOCSUM  _IOW('T', 200, int) 
 #define TUNSETDEBUG   _IOW('T', 201, int) 
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 9dd3746..b4e4ca0 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -103,6 +103,21 @@ do {								\
 } while (0)
 #endif
 
+/* TUN device flags */
+
+/* IFF_ATTACH_QUEUE is never stored in device flags,
+ * overload it to mean fasync when stored there.
+ */
+#define TUN_FASYNC	IFF_ATTACH_QUEUE
+#define TUN_NO_PI	IFF_NO_PI
+/* This flag has no real effect */
+#define TUN_ONE_QUEUE	IFF_ONE_QUEUE
+#define TUN_PERSIST 	IFF_PERSIST
+#define TUN_VNET_HDR 	IFF_VNET_HDR
+#define TUN_TAP_MQ      IFF_MULTI_QUEUE
+
+#define TUN_FEATURES (IFF_NO_PI | IFF_ONE_QUEUE | IFF_VNET_HDR | \
+		      IFF_MULTI_QUEUE)
 #define GOODCOPY_LEN 128
 
 #define FLT_EXACT_COUNT 8
@@ -1521,32 +1536,7 @@ static struct proto tun_proto = {
 
 static int tun_flags(struct tun_struct *tun)
 {
-	int flags = 0;
-
-	if (tun->flags & TUN_TUN_DEV)
-		flags |= IFF_TUN;
-	else
-		flags |= IFF_TAP;
-
-	if (tun->flags & TUN_NO_PI)
-		flags |= IFF_NO_PI;
-
-	/* This flag has no real effect.  We track the value for backwards
-	 * compatibility.
-	 */
-	if (tun->flags & TUN_ONE_QUEUE)
-		flags |= IFF_ONE_QUEUE;
-
-	if (tun->flags & TUN_VNET_HDR)
-		flags |= IFF_VNET_HDR;
-
-	if (tun->flags & TUN_TAP_MQ)
-		flags |= IFF_MULTI_QUEUE;
-
-	if (tun->flags & TUN_PERSIST)
-		flags |= IFF_PERSIST;
-
-	return flags;
+	return tun->flags & (TUN_FEATURES | IFF_PERSIST | IFF_TUN | IFF_TAP);
 }
 
 static ssize_t tun_show_flags(struct device *dev, struct device_attribute *attr,
@@ -1706,28 +1696,8 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
 
 	tun_debug(KERN_INFO, tun, "tun_set_iff\n");
 
-	if (ifr->ifr_flags & IFF_NO_PI)
-		tun->flags |= TUN_NO_PI;
-	else
-		tun->flags &= ~TUN_NO_PI;
-
-	/* This flag has no real effect.  We track the value for backwards
-	 * compatibility.
-	 */
-	if (ifr->ifr_flags & IFF_ONE_QUEUE)
-		tun->flags |= TUN_ONE_QUEUE;
-	else
-		tun->flags &= ~TUN_ONE_QUEUE;
-
-	if (ifr->ifr_flags & IFF_VNET_HDR)
-		tun->flags |= TUN_VNET_HDR;
-	else
-		tun->flags &= ~TUN_VNET_HDR;
-
-	if (ifr->ifr_flags & IFF_MULTI_QUEUE)
-		tun->flags |= TUN_TAP_MQ;
-	else
-		tun->flags &= ~TUN_TAP_MQ;
+	tun->flags = (tun->flags & ~TUN_FEATURES) |
+		(ifr->ifr_flags & TUN_FEATURES);
 
 	/* Make sure persistent devices do not get stuck in
 	 * xoff state.
@@ -1890,9 +1860,9 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
 	if (cmd == TUNGETFEATURES) {
 		/* Currently this just means: "what IFF flags are valid?".
 		 * This is needed because we never checked for invalid flags on
-		 * TUNSETIFF. */
-		return put_user(IFF_TUN | IFF_TAP | IFF_NO_PI | IFF_ONE_QUEUE |
-				IFF_VNET_HDR | IFF_MULTI_QUEUE,
+		 * TUNSETIFF.
+		 */
+		return put_user(IFF_TUN | IFF_TAP | TUN_FEATURES,
 				(unsigned int __user*)argp);
 	} else if (cmd == TUNSETQUEUE)
 		return tun_set_queue(file, &ifr);
-- 
MST

^ permalink raw reply related

* [PATCH v8 16/50] virtio_blk: v1.0 support
From: Michael S. Tsirkin @ 2014-12-01 16:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Miller, cornelia.huck, rusty, nab, pbonzini, thuth, dahi,
	Rusty Russell, virtualization, linux-api
In-Reply-To: <1417449619-24896-1-git-send-email-mst@redhat.com>

Based on patch by Cornelia Huck.

Note: for consistency, and to avoid sparse errors,
      convert all fields, even those no longer in use
      for virtio v1.0.

Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/uapi/linux/virtio_blk.h | 15 ++++-----
 drivers/block/virtio_blk.c      | 70 ++++++++++++++++++++++++-----------------
 2 files changed, 49 insertions(+), 36 deletions(-)

diff --git a/include/uapi/linux/virtio_blk.h b/include/uapi/linux/virtio_blk.h
index 9ad67b2..247c8ba 100644
--- a/include/uapi/linux/virtio_blk.h
+++ b/include/uapi/linux/virtio_blk.h
@@ -28,6 +28,7 @@
 #include <linux/types.h>
 #include <linux/virtio_ids.h>
 #include <linux/virtio_config.h>
+#include <linux/virtio_types.h>
 
 /* Feature bits */
 #define VIRTIO_BLK_F_BARRIER	0	/* Does host support barriers? */
@@ -114,18 +115,18 @@ struct virtio_blk_config {
 /* This is the first element of the read scatter-gather list. */
 struct virtio_blk_outhdr {
 	/* VIRTIO_BLK_T* */
-	__u32 type;
+	__virtio32 type;
 	/* io priority. */
-	__u32 ioprio;
+	__virtio32 ioprio;
 	/* Sector (ie. 512 byte offset) */
-	__u64 sector;
+	__virtio64 sector;
 };
 
 struct virtio_scsi_inhdr {
-	__u32 errors;
-	__u32 data_len;
-	__u32 sense_len;
-	__u32 residual;
+	__virtio32 errors;
+	__virtio32 data_len;
+	__virtio32 sense_len;
+	__virtio32 residual;
 };
 
 /* And this is the final byte of the write scatter-gather list. */
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index c6a27d5..f601f16 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -80,7 +80,7 @@ static int __virtblk_add_req(struct virtqueue *vq,
 {
 	struct scatterlist hdr, status, cmd, sense, inhdr, *sgs[6];
 	unsigned int num_out = 0, num_in = 0;
-	int type = vbr->out_hdr.type & ~VIRTIO_BLK_T_OUT;
+	__virtio32 type = vbr->out_hdr.type & ~cpu_to_virtio32(vq->vdev, VIRTIO_BLK_T_OUT);
 
 	sg_init_one(&hdr, &vbr->out_hdr, sizeof(vbr->out_hdr));
 	sgs[num_out++] = &hdr;
@@ -91,19 +91,19 @@ static int __virtblk_add_req(struct virtqueue *vq,
 	 * block, and before the normal inhdr we put the sense data and the
 	 * inhdr with additional status information.
 	 */
-	if (type == VIRTIO_BLK_T_SCSI_CMD) {
+	if (type == cpu_to_virtio32(vq->vdev, VIRTIO_BLK_T_SCSI_CMD)) {
 		sg_init_one(&cmd, vbr->req->cmd, vbr->req->cmd_len);
 		sgs[num_out++] = &cmd;
 	}
 
 	if (have_data) {
-		if (vbr->out_hdr.type & VIRTIO_BLK_T_OUT)
+		if (vbr->out_hdr.type & cpu_to_virtio32(vq->vdev, VIRTIO_BLK_T_OUT))
 			sgs[num_out++] = data_sg;
 		else
 			sgs[num_out + num_in++] = data_sg;
 	}
 
-	if (type == VIRTIO_BLK_T_SCSI_CMD) {
+	if (type == cpu_to_virtio32(vq->vdev, VIRTIO_BLK_T_SCSI_CMD)) {
 		sg_init_one(&sense, vbr->req->sense, SCSI_SENSE_BUFFERSIZE);
 		sgs[num_out + num_in++] = &sense;
 		sg_init_one(&inhdr, &vbr->in_hdr, sizeof(vbr->in_hdr));
@@ -119,12 +119,13 @@ static int __virtblk_add_req(struct virtqueue *vq,
 static inline void virtblk_request_done(struct request *req)
 {
 	struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
+	struct virtio_blk *vblk = req->q->queuedata;
 	int error = virtblk_result(vbr);
 
 	if (req->cmd_type == REQ_TYPE_BLOCK_PC) {
-		req->resid_len = vbr->in_hdr.residual;
-		req->sense_len = vbr->in_hdr.sense_len;
-		req->errors = vbr->in_hdr.errors;
+		req->resid_len = virtio32_to_cpu(vblk->vdev, vbr->in_hdr.residual);
+		req->sense_len = virtio32_to_cpu(vblk->vdev, vbr->in_hdr.sense_len);
+		req->errors = virtio32_to_cpu(vblk->vdev, vbr->in_hdr.errors);
 	} else if (req->cmd_type == REQ_TYPE_SPECIAL) {
 		req->errors = (error != 0);
 	}
@@ -173,25 +174,25 @@ static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req,
 
 	vbr->req = req;
 	if (req->cmd_flags & REQ_FLUSH) {
-		vbr->out_hdr.type = VIRTIO_BLK_T_FLUSH;
+		vbr->out_hdr.type = cpu_to_virtio32(vblk->vdev, VIRTIO_BLK_T_FLUSH);
 		vbr->out_hdr.sector = 0;
-		vbr->out_hdr.ioprio = req_get_ioprio(vbr->req);
+		vbr->out_hdr.ioprio = cpu_to_virtio32(vblk->vdev, req_get_ioprio(vbr->req));
 	} else {
 		switch (req->cmd_type) {
 		case REQ_TYPE_FS:
 			vbr->out_hdr.type = 0;
-			vbr->out_hdr.sector = blk_rq_pos(vbr->req);
-			vbr->out_hdr.ioprio = req_get_ioprio(vbr->req);
+			vbr->out_hdr.sector = cpu_to_virtio64(vblk->vdev, blk_rq_pos(vbr->req));
+			vbr->out_hdr.ioprio = cpu_to_virtio32(vblk->vdev, req_get_ioprio(vbr->req));
 			break;
 		case REQ_TYPE_BLOCK_PC:
-			vbr->out_hdr.type = VIRTIO_BLK_T_SCSI_CMD;
+			vbr->out_hdr.type = cpu_to_virtio32(vblk->vdev, VIRTIO_BLK_T_SCSI_CMD);
 			vbr->out_hdr.sector = 0;
-			vbr->out_hdr.ioprio = req_get_ioprio(vbr->req);
+			vbr->out_hdr.ioprio = cpu_to_virtio32(vblk->vdev, req_get_ioprio(vbr->req));
 			break;
 		case REQ_TYPE_SPECIAL:
-			vbr->out_hdr.type = VIRTIO_BLK_T_GET_ID;
+			vbr->out_hdr.type = cpu_to_virtio32(vblk->vdev, VIRTIO_BLK_T_GET_ID);
 			vbr->out_hdr.sector = 0;
-			vbr->out_hdr.ioprio = req_get_ioprio(vbr->req);
+			vbr->out_hdr.ioprio = cpu_to_virtio32(vblk->vdev, req_get_ioprio(vbr->req));
 			break;
 		default:
 			/* We don't put anything else in the queue. */
@@ -204,9 +205,9 @@ static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req,
 	num = blk_rq_map_sg(hctx->queue, vbr->req, vbr->sg);
 	if (num) {
 		if (rq_data_dir(vbr->req) == WRITE)
-			vbr->out_hdr.type |= VIRTIO_BLK_T_OUT;
+			vbr->out_hdr.type |= cpu_to_virtio32(vblk->vdev, VIRTIO_BLK_T_OUT);
 		else
-			vbr->out_hdr.type |= VIRTIO_BLK_T_IN;
+			vbr->out_hdr.type |= cpu_to_virtio32(vblk->vdev, VIRTIO_BLK_T_IN);
 	}
 
 	spin_lock_irqsave(&vblk->vqs[qid].lock, flags);
@@ -476,7 +477,8 @@ static int virtblk_get_cache_mode(struct virtio_device *vdev)
 				   struct virtio_blk_config, wce,
 				   &writeback);
 	if (err)
-		writeback = virtio_has_feature(vdev, VIRTIO_BLK_F_WCE);
+		writeback = virtio_has_feature(vdev, VIRTIO_BLK_F_WCE) ||
+		            virtio_has_feature(vdev, VIRTIO_F_VERSION_1);
 
 	return writeback;
 }
@@ -821,25 +823,35 @@ static const struct virtio_device_id id_table[] = {
 	{ 0 },
 };
 
-static unsigned int features[] = {
+static unsigned int features_legacy[] = {
 	VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX, VIRTIO_BLK_F_GEOMETRY,
 	VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE, VIRTIO_BLK_F_SCSI,
 	VIRTIO_BLK_F_WCE, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_CONFIG_WCE,
 	VIRTIO_BLK_F_MQ,
+}
+;
+static unsigned int features[] = {
+	VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX, VIRTIO_BLK_F_GEOMETRY,
+	VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE,
+	VIRTIO_BLK_F_TOPOLOGY,
+	VIRTIO_BLK_F_MQ,
+	VIRTIO_F_VERSION_1,
 };
 
 static struct virtio_driver virtio_blk = {
-	.feature_table		= features,
-	.feature_table_size	= ARRAY_SIZE(features),
-	.driver.name		= KBUILD_MODNAME,
-	.driver.owner		= THIS_MODULE,
-	.id_table		= id_table,
-	.probe			= virtblk_probe,
-	.remove			= virtblk_remove,
-	.config_changed		= virtblk_config_changed,
+	.feature_table			= features,
+	.feature_table_size		= ARRAY_SIZE(features),
+	.feature_table_legacy		= features_legacy,
+	.feature_table_size_legacy	= ARRAY_SIZE(features_legacy),
+	.driver.name			= KBUILD_MODNAME,
+	.driver.owner			= THIS_MODULE,
+	.id_table			= id_table,
+	.probe				= virtblk_probe,
+	.remove				= virtblk_remove,
+	.config_changed			= virtblk_config_changed,
 #ifdef CONFIG_PM_SLEEP
-	.freeze			= virtblk_freeze,
-	.restore		= virtblk_restore,
+	.freeze				= virtblk_freeze,
+	.restore			= virtblk_restore,
 #endif
 };
 
-- 
MST

^ permalink raw reply related

* [PATCH v8 15/50] virtio_net: v1.0 endianness
From: Michael S. Tsirkin @ 2014-12-01 16:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: thuth, rusty, netdev, virtualization, dahi, linux-api, pbonzini,
	David Miller
In-Reply-To: <1417449619-24896-1-git-send-email-mst@redhat.com>

Based on patches by Rusty Russell, Cornelia Huck.
Note: more code changes are needed for 1.0 support
(due to different header size).
So we don't advertize support for 1.0 yet.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/uapi/linux/virtio_net.h | 15 ++++++++-------
 drivers/net/virtio_net.c        | 33 ++++++++++++++++++++-------------
 2 files changed, 28 insertions(+), 20 deletions(-)

diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index 172a7f0..b5f1677 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -28,6 +28,7 @@
 #include <linux/types.h>
 #include <linux/virtio_ids.h>
 #include <linux/virtio_config.h>
+#include <linux/virtio_types.h>
 #include <linux/if_ether.h>
 
 /* The feature bitmap for virtio net */
@@ -84,17 +85,17 @@ struct virtio_net_hdr {
 #define VIRTIO_NET_HDR_GSO_TCPV6	4	// GSO frame, IPv6 TCP
 #define VIRTIO_NET_HDR_GSO_ECN		0x80	// TCP has ECN set
 	__u8 gso_type;
-	__u16 hdr_len;		/* Ethernet + IP + tcp/udp hdrs */
-	__u16 gso_size;		/* Bytes to append to hdr_len per frame */
-	__u16 csum_start;	/* Position to start checksumming from */
-	__u16 csum_offset;	/* Offset after that to place checksum */
+	__virtio16 hdr_len;		/* Ethernet + IP + tcp/udp hdrs */
+	__virtio16 gso_size;		/* Bytes to append to hdr_len per frame */
+	__virtio16 csum_start;	/* Position to start checksumming from */
+	__virtio16 csum_offset;	/* Offset after that to place checksum */
 };
 
 /* This is the version of the header to use when the MRG_RXBUF
  * feature has been negotiated. */
 struct virtio_net_hdr_mrg_rxbuf {
 	struct virtio_net_hdr hdr;
-	__u16 num_buffers;	/* Number of merged rx buffers */
+	__virtio16 num_buffers;	/* Number of merged rx buffers */
 };
 
 /*
@@ -149,7 +150,7 @@ typedef __u8 virtio_net_ctrl_ack;
  * VIRTIO_NET_F_CTRL_MAC_ADDR feature is available.
  */
 struct virtio_net_ctrl_mac {
-	__u32 entries;
+	__virtio32 entries;
 	__u8 macs[][ETH_ALEN];
 } __attribute__((packed));
 
@@ -193,7 +194,7 @@ struct virtio_net_ctrl_mac {
  * specified.
  */
 struct virtio_net_ctrl_mq {
-	__u16 virtqueue_pairs;
+	__virtio16 virtqueue_pairs;
 };
 
 #define VIRTIO_NET_CTRL_MQ   4
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index b0bc8ea..c07e030 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -347,13 +347,14 @@ err:
 }
 
 static struct sk_buff *receive_mergeable(struct net_device *dev,
+					 struct virtnet_info *vi,
 					 struct receive_queue *rq,
 					 unsigned long ctx,
 					 unsigned int len)
 {
 	void *buf = mergeable_ctx_to_buf_address(ctx);
 	struct skb_vnet_hdr *hdr = buf;
-	int num_buf = hdr->mhdr.num_buffers;
+	u16 num_buf = virtio16_to_cpu(rq->vq->vdev, hdr->mhdr.num_buffers);
 	struct page *page = virt_to_head_page(buf);
 	int offset = buf - page_address(page);
 	unsigned int truesize = max(len, mergeable_ctx_to_buf_truesize(ctx));
@@ -369,7 +370,9 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 		ctx = (unsigned long)virtqueue_get_buf(rq->vq, &len);
 		if (unlikely(!ctx)) {
 			pr_debug("%s: rx error: %d buffers out of %d missing\n",
-				 dev->name, num_buf, hdr->mhdr.num_buffers);
+				 dev->name, num_buf,
+				 virtio16_to_cpu(rq->vq->vdev,
+						 hdr->mhdr.num_buffers));
 			dev->stats.rx_length_errors++;
 			goto err_buf;
 		}
@@ -454,7 +457,7 @@ static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len)
 	}
 
 	if (vi->mergeable_rx_bufs)
-		skb = receive_mergeable(dev, rq, (unsigned long)buf, len);
+		skb = receive_mergeable(dev, vi, rq, (unsigned long)buf, len);
 	else if (vi->big_packets)
 		skb = receive_big(dev, rq, buf, len);
 	else
@@ -473,8 +476,8 @@ static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len)
 	if (hdr->hdr.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
 		pr_debug("Needs csum!\n");
 		if (!skb_partial_csum_set(skb,
-					  hdr->hdr.csum_start,
-					  hdr->hdr.csum_offset))
+			  virtio16_to_cpu(vi->vdev, hdr->hdr.csum_start),
+			  virtio16_to_cpu(vi->vdev, hdr->hdr.csum_offset)))
 			goto frame_err;
 	} else if (hdr->hdr.flags & VIRTIO_NET_HDR_F_DATA_VALID) {
 		skb->ip_summed = CHECKSUM_UNNECESSARY;
@@ -514,7 +517,8 @@ static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len)
 		if (hdr->hdr.gso_type & VIRTIO_NET_HDR_GSO_ECN)
 			skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
 
-		skb_shinfo(skb)->gso_size = hdr->hdr.gso_size;
+		skb_shinfo(skb)->gso_size = virtio16_to_cpu(vi->vdev,
+							    hdr->hdr.gso_size);
 		if (skb_shinfo(skb)->gso_size == 0) {
 			net_warn_ratelimited("%s: zero gso size.\n", dev->name);
 			goto frame_err;
@@ -876,16 +880,19 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
 
 	if (skb->ip_summed == CHECKSUM_PARTIAL) {
 		hdr->hdr.flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
-		hdr->hdr.csum_start = skb_checksum_start_offset(skb);
-		hdr->hdr.csum_offset = skb->csum_offset;
+		hdr->hdr.csum_start = cpu_to_virtio16(vi->vdev,
+						skb_checksum_start_offset(skb));
+		hdr->hdr.csum_offset = cpu_to_virtio16(vi->vdev,
+							 skb->csum_offset);
 	} else {
 		hdr->hdr.flags = 0;
 		hdr->hdr.csum_offset = hdr->hdr.csum_start = 0;
 	}
 
 	if (skb_is_gso(skb)) {
-		hdr->hdr.hdr_len = skb_headlen(skb);
-		hdr->hdr.gso_size = skb_shinfo(skb)->gso_size;
+		hdr->hdr.hdr_len = cpu_to_virtio16(vi->vdev, skb_headlen(skb));
+		hdr->hdr.gso_size = cpu_to_virtio16(vi->vdev,
+						    skb_shinfo(skb)->gso_size);
 		if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV4)
 			hdr->hdr.gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
 		else if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV6)
@@ -1112,7 +1119,7 @@ static int virtnet_set_queues(struct virtnet_info *vi, u16 queue_pairs)
 	if (!vi->has_cvq || !virtio_has_feature(vi->vdev, VIRTIO_NET_F_MQ))
 		return 0;
 
-	s.virtqueue_pairs = queue_pairs;
+	s.virtqueue_pairs = cpu_to_virtio16(vi->vdev, queue_pairs);
 	sg_init_one(&sg, &s, sizeof(s));
 
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_MQ,
@@ -1189,7 +1196,7 @@ static void virtnet_set_rx_mode(struct net_device *dev)
 	sg_init_table(sg, 2);
 
 	/* Store the unicast list and count in the front of the buffer */
-	mac_data->entries = uc_count;
+	mac_data->entries = cpu_to_virtio32(vi->vdev, uc_count);
 	i = 0;
 	netdev_for_each_uc_addr(ha, dev)
 		memcpy(&mac_data->macs[i++][0], ha->addr, ETH_ALEN);
@@ -1200,7 +1207,7 @@ static void virtnet_set_rx_mode(struct net_device *dev)
 	/* multicast list and count fill the end */
 	mac_data = (void *)&mac_data->macs[uc_count][0];
 
-	mac_data->entries = mc_count;
+	mac_data->entries = cpu_to_virtio32(vi->vdev, mc_count);
 	i = 0;
 	netdev_for_each_mc_addr(ha, dev)
 		memcpy(&mac_data->macs[i++][0], ha->addr, ETH_ALEN);
-- 
MST

^ permalink raw reply related

* [PATCH v8 12/50] virtio: set FEATURES_OK
From: Michael S. Tsirkin @ 2014-12-01 16:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: thuth, rusty, linux-api, virtualization, dahi, pbonzini,
	David Miller
In-Reply-To: <1417449619-24896-1-git-send-email-mst@redhat.com>

set FEATURES_OK as per virtio 1.0 spec

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
---
 include/uapi/linux/virtio_config.h |  2 ++
 drivers/virtio/virtio.c            | 29 ++++++++++++++++++++++-------
 2 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/include/uapi/linux/virtio_config.h b/include/uapi/linux/virtio_config.h
index 80e7381..4d05671 100644
--- a/include/uapi/linux/virtio_config.h
+++ b/include/uapi/linux/virtio_config.h
@@ -38,6 +38,8 @@
 #define VIRTIO_CONFIG_S_DRIVER		2
 /* Driver has used its parts of the config, and is happy */
 #define VIRTIO_CONFIG_S_DRIVER_OK	4
+/* Driver has finished configuring features */
+#define VIRTIO_CONFIG_S_FEATURES_OK	8
 /* We've given up on this device. */
 #define VIRTIO_CONFIG_S_FAILED		0x80
 
diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index 746d350..9248125 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -160,6 +160,7 @@ static int virtio_dev_probe(struct device *_d)
 	struct virtio_device *dev = dev_to_virtio(_d);
 	struct virtio_driver *drv = drv_to_virtio(dev->dev.driver);
 	u64 device_features;
+	unsigned status;
 
 	/* We have a driver! */
 	add_status(dev, VIRTIO_CONFIG_S_DRIVER);
@@ -183,18 +184,32 @@ static int virtio_dev_probe(struct device *_d)
 
 	dev->config->finalize_features(dev);
 
+	if (virtio_has_feature(dev, VIRTIO_F_VERSION_1)) {
+		add_status(dev, VIRTIO_CONFIG_S_FEATURES_OK);
+		status = dev->config->get_status(dev);
+		if (!(status & VIRTIO_CONFIG_S_FEATURES_OK)) {
+			dev_err(_d, "virtio: device refuses features: %x\n",
+			       status);
+			err = -ENODEV;
+			goto err;
+		}
+	}
+
 	err = drv->probe(dev);
 	if (err)
-		add_status(dev, VIRTIO_CONFIG_S_FAILED);
-	else {
-		add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
-		if (drv->scan)
-			drv->scan(dev);
+		goto err;
 
-		virtio_config_enable(dev);
-	}
+	add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
+	if (drv->scan)
+		drv->scan(dev);
+
+	virtio_config_enable(dev);
 
+	return 0;
+err:
+	add_status(dev, VIRTIO_CONFIG_S_FAILED);
 	return err;
+
 }
 
 static int virtio_dev_remove(struct device *_d)
-- 
MST

^ permalink raw reply related

* [PATCH v8 08/50] virtio: memory access APIs
From: Michael S. Tsirkin @ 2014-12-01 16:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: Bjarke Istrup Pedersen, thuth, rusty, Greg Kroah-Hartman,
	David Drysdale, dahi, Lad, Prabhakar, Sakari Ailus, linux-api,
	pbonzini, virtualization, David Miller, Alexei Starovoitov
In-Reply-To: <1417449619-24896-1-git-send-email-mst@redhat.com>

virtio 1.0 makes all memory structures LE, so
we need APIs to conditionally do a byteswap on BE
architectures.

To make it easier to check code statically,
add virtio specific types for multi-byte integers
in memory.

Add low level wrappers that do a byteswap conditionally, these will be
useful e.g. for vhost.  Add high level wrappers that
query device endian-ness and act accordingly.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
---
 include/linux/virtio_byteorder.h  | 59 +++++++++++++++++++++++++++++++++++++++
 include/linux/virtio_config.h     | 32 +++++++++++++++++++++
 include/uapi/linux/virtio_ring.h  | 45 ++++++++++++++---------------
 include/uapi/linux/virtio_types.h | 46 ++++++++++++++++++++++++++++++
 include/uapi/linux/Kbuild         |  1 +
 5 files changed, 161 insertions(+), 22 deletions(-)
 create mode 100644 include/linux/virtio_byteorder.h
 create mode 100644 include/uapi/linux/virtio_types.h

diff --git a/include/linux/virtio_byteorder.h b/include/linux/virtio_byteorder.h
new file mode 100644
index 0000000..51865d0
--- /dev/null
+++ b/include/linux/virtio_byteorder.h
@@ -0,0 +1,59 @@
+#ifndef _LINUX_VIRTIO_BYTEORDER_H
+#define _LINUX_VIRTIO_BYTEORDER_H
+#include <linux/types.h>
+#include <uapi/linux/virtio_types.h>
+
+/*
+ * Low-level memory accessors for handling virtio in modern little endian and in
+ * compatibility native endian format.
+ */
+
+static inline u16 __virtio16_to_cpu(bool little_endian, __virtio16 val)
+{
+	if (little_endian)
+		return le16_to_cpu((__force __le16)val);
+	else
+		return (__force u16)val;
+}
+
+static inline __virtio16 __cpu_to_virtio16(bool little_endian, u16 val)
+{
+	if (little_endian)
+		return (__force __virtio16)cpu_to_le16(val);
+	else
+		return (__force __virtio16)val;
+}
+
+static inline u32 __virtio32_to_cpu(bool little_endian, __virtio32 val)
+{
+	if (little_endian)
+		return le32_to_cpu((__force __le32)val);
+	else
+		return (__force u32)val;
+}
+
+static inline __virtio32 __cpu_to_virtio32(bool little_endian, u32 val)
+{
+	if (little_endian)
+		return (__force __virtio32)cpu_to_le32(val);
+	else
+		return (__force __virtio32)val;
+}
+
+static inline u64 __virtio64_to_cpu(bool little_endian, __virtio64 val)
+{
+	if (little_endian)
+		return le64_to_cpu((__force __le64)val);
+	else
+		return (__force u64)val;
+}
+
+static inline __virtio64 __cpu_to_virtio64(bool little_endian, u64 val)
+{
+	if (little_endian)
+		return (__force __virtio64)cpu_to_le64(val);
+	else
+		return (__force __virtio64)val;
+}
+
+#endif /* _LINUX_VIRTIO_BYTEORDER */
diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
index f517884..02f0acb 100644
--- a/include/linux/virtio_config.h
+++ b/include/linux/virtio_config.h
@@ -4,6 +4,7 @@
 #include <linux/err.h>
 #include <linux/bug.h>
 #include <linux/virtio.h>
+#include <linux/virtio_byteorder.h>
 #include <uapi/linux/virtio_config.h>
 
 /**
@@ -199,6 +200,37 @@ int virtqueue_set_affinity(struct virtqueue *vq, int cpu)
 	return 0;
 }
 
+/* Memory accessors */
+static inline u16 virtio16_to_cpu(struct virtio_device *vdev, __virtio16 val)
+{
+	return __virtio16_to_cpu(virtio_has_feature(vdev, VIRTIO_F_VERSION_1), val);
+}
+
+static inline __virtio16 cpu_to_virtio16(struct virtio_device *vdev, u16 val)
+{
+	return __cpu_to_virtio16(virtio_has_feature(vdev, VIRTIO_F_VERSION_1), val);
+}
+
+static inline u32 virtio32_to_cpu(struct virtio_device *vdev, __virtio32 val)
+{
+	return __virtio32_to_cpu(virtio_has_feature(vdev, VIRTIO_F_VERSION_1), val);
+}
+
+static inline __virtio32 cpu_to_virtio32(struct virtio_device *vdev, u32 val)
+{
+	return __cpu_to_virtio32(virtio_has_feature(vdev, VIRTIO_F_VERSION_1), val);
+}
+
+static inline u64 virtio64_to_cpu(struct virtio_device *vdev, __virtio64 val)
+{
+	return __virtio64_to_cpu(virtio_has_feature(vdev, VIRTIO_F_VERSION_1), val);
+}
+
+static inline __virtio64 cpu_to_virtio64(struct virtio_device *vdev, u64 val)
+{
+	return __cpu_to_virtio64(virtio_has_feature(vdev, VIRTIO_F_VERSION_1), val);
+}
+
 /* Config space accessors. */
 #define virtio_cread(vdev, structname, member, ptr)			\
 	do {								\
diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
index a99f9b7..61c818a 100644
--- a/include/uapi/linux/virtio_ring.h
+++ b/include/uapi/linux/virtio_ring.h
@@ -32,6 +32,7 @@
  *
  * Copyright Rusty Russell IBM Corporation 2007. */
 #include <linux/types.h>
+#include <linux/virtio_types.h>
 
 /* This marks a buffer as continuing via the next field. */
 #define VRING_DESC_F_NEXT	1
@@ -61,32 +62,32 @@
 /* Virtio ring descriptors: 16 bytes.  These can chain together via "next". */
 struct vring_desc {
 	/* Address (guest-physical). */
-	__u64 addr;
+	__virtio64 addr;
 	/* Length. */
-	__u32 len;
+	__virtio32 len;
 	/* The flags as indicated above. */
-	__u16 flags;
+	__virtio16 flags;
 	/* We chain unused descriptors via this, too */
-	__u16 next;
+	__virtio16 next;
 };
 
 struct vring_avail {
-	__u16 flags;
-	__u16 idx;
-	__u16 ring[];
+	__virtio16 flags;
+	__virtio16 idx;
+	__virtio16 ring[];
 };
 
 /* u32 is used here for ids for padding reasons. */
 struct vring_used_elem {
 	/* Index of start of used descriptor chain. */
-	__u32 id;
+	__virtio32 id;
 	/* Total length of the descriptor chain which was used (written to) */
-	__u32 len;
+	__virtio32 len;
 };
 
 struct vring_used {
-	__u16 flags;
-	__u16 idx;
+	__virtio16 flags;
+	__virtio16 idx;
 	struct vring_used_elem ring[];
 };
 
@@ -109,25 +110,25 @@ struct vring {
  *	struct vring_desc desc[num];
  *
  *	// A ring of available descriptor heads with free-running index.
- *	__u16 avail_flags;
- *	__u16 avail_idx;
- *	__u16 available[num];
- *	__u16 used_event_idx;
+ *	__virtio16 avail_flags;
+ *	__virtio16 avail_idx;
+ *	__virtio16 available[num];
+ *	__virtio16 used_event_idx;
  *
  *	// Padding to the next align boundary.
  *	char pad[];
  *
  *	// A ring of used descriptor heads with free-running index.
- *	__u16 used_flags;
- *	__u16 used_idx;
+ *	__virtio16 used_flags;
+ *	__virtio16 used_idx;
  *	struct vring_used_elem used[num];
- *	__u16 avail_event_idx;
+ *	__virtio16 avail_event_idx;
  * };
  */
 /* We publish the used event index at the end of the available ring, and vice
  * versa. They are at the end for backwards compatibility. */
 #define vring_used_event(vr) ((vr)->avail->ring[(vr)->num])
-#define vring_avail_event(vr) (*(__u16 *)&(vr)->used->ring[(vr)->num])
+#define vring_avail_event(vr) (*(__virtio16 *)&(vr)->used->ring[(vr)->num])
 
 static inline void vring_init(struct vring *vr, unsigned int num, void *p,
 			      unsigned long align)
@@ -135,15 +136,15 @@ static inline void vring_init(struct vring *vr, unsigned int num, void *p,
 	vr->num = num;
 	vr->desc = p;
 	vr->avail = p + num*sizeof(struct vring_desc);
-	vr->used = (void *)(((unsigned long)&vr->avail->ring[num] + sizeof(__u16)
+	vr->used = (void *)(((unsigned long)&vr->avail->ring[num] + sizeof(__virtio16)
 		+ align-1) & ~(align - 1));
 }
 
 static inline unsigned vring_size(unsigned int num, unsigned long align)
 {
-	return ((sizeof(struct vring_desc) * num + sizeof(__u16) * (3 + num)
+	return ((sizeof(struct vring_desc) * num + sizeof(__virtio16) * (3 + num)
 		 + align - 1) & ~(align - 1))
-		+ sizeof(__u16) * 3 + sizeof(struct vring_used_elem) * num;
+		+ sizeof(__virtio16) * 3 + sizeof(struct vring_used_elem) * num;
 }
 
 /* The following is used with USED_EVENT_IDX and AVAIL_EVENT_IDX */
diff --git a/include/uapi/linux/virtio_types.h b/include/uapi/linux/virtio_types.h
new file mode 100644
index 0000000..e845e8c
--- /dev/null
+++ b/include/uapi/linux/virtio_types.h
@@ -0,0 +1,46 @@
+#ifndef _UAPI_LINUX_VIRTIO_TYPES_H
+#define _UAPI_LINUX_VIRTIO_TYPES_H
+/* Type definitions for virtio implementations.
+ *
+ * This header is BSD licensed so anyone can use the definitions to implement
+ * compatible drivers/servers.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of IBM nor the names of its contributors
+ *    may be used to endorse or promote products derived from this software
+ *    without specific prior written permission.
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ *
+ * Copyright (C) 2014 Red Hat, Inc.
+ * Author: Michael S. Tsirkin <mst@redhat.com>
+ */
+#include <linux/types.h>
+
+/*
+ * __virtio{16,32,64} have the following meaning:
+ * - __u{16,32,64} for virtio devices in legacy mode, accessed in native endian
+ * - __le{16,32,64} for standard-compliant virtio devices
+ */
+
+typedef __u16 __bitwise__ __virtio16;
+typedef __u32 __bitwise__ __virtio32;
+typedef __u64 __bitwise__ __virtio64;
+
+#endif /* _UAPI_LINUX_VIRTIO_TYPES_H */
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 4c94f31..44a5581 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -423,6 +423,7 @@ header-y += virtio_blk.h
 header-y += virtio_config.h
 header-y += virtio_console.h
 header-y += virtio_ids.h
+header-y += virtio_types.h
 header-y += virtio_net.h
 header-y += virtio_pci.h
 header-y += virtio_ring.h
-- 
MST

^ permalink raw reply related

* [PATCH v8 07/50] virtio: add virtio 1.0 feature bit
From: Michael S. Tsirkin @ 2014-12-01 16:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: thuth, rusty, linux-api, virtualization, dahi, pbonzini,
	David Miller
In-Reply-To: <1417449619-24896-1-git-send-email-mst@redhat.com>

Based on original patches by Rusty Russell, Thomas Huth
and Cornelia Huck.

Note: at this time, we do not negotiate this feature bit
in core, drivers have to declare VERSION_1 support explicitly.

For this reason we treat this bit as a device bit
and not as a transport bit for now.

After all drivers are converted, we will be able to
move VERSION_1 to core and drop it from all drivers.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
---
 include/uapi/linux/virtio_config.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/uapi/linux/virtio_config.h b/include/uapi/linux/virtio_config.h
index 3ce768c..80e7381 100644
--- a/include/uapi/linux/virtio_config.h
+++ b/include/uapi/linux/virtio_config.h
@@ -54,4 +54,7 @@
 /* Can the device handle any descriptor layout? */
 #define VIRTIO_F_ANY_LAYOUT		27
 
+/* v1.0 compliant. */
+#define VIRTIO_F_VERSION_1		32
+
 #endif /* _UAPI_LINUX_VIRTIO_CONFIG_H */
-- 
MST

^ permalink raw reply related

* Re: [PATCH v2 02/19] kbuild: kselftest_install - add a new make target to install selftests
From: Michal Marek @ 2014-12-01 15:47 UTC (permalink / raw)
  To: Shuah Khan, gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q, keescook-F7+t8E8rja9g9hUCZPvPmw,
	tranmanphong-Re5JQEeQqe8AvxtiuMwx3w,
	dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w, hughd-hpIqsD4AKlfQT0dZR+AlfA,
	bobby.prani-Re5JQEeQqe8AvxtiuMwx3w,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA
  Cc: linux-kbuild-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <a2344d4df903d673afe1631118f40917f773cc9a.1415735831.git.shuahkh-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org>

On 2014-11-11 21:27, Shuah Khan wrote:
> Add a new make target to install to install kernel selftests.
> This new target will build and install selftests. kselftest
> target now depends on kselftest_install and runs the generated
> kselftest script to reduce duplicate work and for common look
> and feel when running tests.
> 
> Approach:
> 
> make kselftest_target:
> -- exports kselftest INSTALL_KSFT_PATH
>    default $(INSTALL_MOD_PATH)/lib/kselftest/$(KERNELRELEASE)
> -- exports path for ksefltest.sh
> -- runs selftests make install target:
> 
> selftests make install target
> -- creates kselftest.sh script in install install dir
> -- runs install targets for all INSTALL_TARGETS
>    (Note: ftrace and powerpc aren't included in INSTALL_TARGETS,
>           to not add more content to patch v1 series. This work
>           will happen soon. In this series these two targets are
>           run after running the generated kselftest script, without
>           any regression in the way these tests are run with
>           "make kselftest" prior to this work.)
> -- install target can be run only from top level source dir.
> 
> Individual test make install targets:
> -- install test programs and/or scripts in install dir
> -- append to the ksefltest.sh file to add commands to run test
> -- install target can be run only from top level source dir.
> 
> Adds the following new ways to initiate selftests:
> -- Installing and running kselftest from install directory
>    by running  "make kselftest"
> -- Running kselftest script from install directory
> 
> Maintains the following ways to run tests:
> -- make -C tools/testing/selftests run_tests
> -- make -C tools/testing/selftests TARGETS=target run_tests
>    Ability specify targets: e.g TARGETS=net
> -- make run_tests from tools/testing/selftests
> -- make run_tests from individual test directories:
>    e.g: make run_tests in tools/testing/selftests/breakpoints
> 
> Signed-off-by: Shuah Khan <shuahkh-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org>
> ---
>  Makefile | 21 +++++++++++++++++++--
>  1 file changed, 19 insertions(+), 2 deletions(-)
> 
> diff --git a/Makefile b/Makefile
> index 05d67af..ccbd2e1 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -1071,12 +1071,26 @@ headers_check: headers_install
>  	$(Q)$(MAKE) $(hdr-inst)=arch/$(hdr-arch)/include/uapi/asm $(hdr-dst) HDRCHECK=1
>  
>  # ---------------------------------------------------------------------------
> -# Kernel selftest
> +# Kernel selftest targets
> +
> +PHONY += __kselftest_configure
> +INSTALL_KSFT_PATH=$(INSTALL_MOD_PATH)/lib/kselftest/$(KERNELRELEASE)
> +export INSTALL_KSFT_PATH
> +KSELFTEST=$(INSTALL_KSFT_PATH)/kselftest.sh
> +export KSELFTEST

Can this be moved to tools/testing/selftests/Makefile? It's only used in
this part of the tree.


>  PHONY += kselftest
> -kselftest:
> +kselftest: kselftest_install
>  	$(Q)$(MAKE) -C tools/testing/selftests run_tests
>  
> +# Kernel selftest install
> +
> +PHONY += kselftest_install
> +kselftest_install: __kselftest_configure
> +	@rm -rf $(INSTALL_KSFT_PATH)
> +	@mkdir -p $(INSTALL_KSFT_PATH)

Please use $(Q) insteaf od hardcoding the @.


> +	$(Q)$(MAKE) -C tools/testing/selftests install

The install target is only added by the next patch, which in turn
depends on changes done by later patches. The order (in the current
numbering) should be 01/19, 04/19, ... 19/19, 03/19 and 02/19.

Michal

^ permalink raw reply

* Re: kdbus: add code to gather metadata
From: Andy Lutomirski @ 2014-12-01 14:46 UTC (permalink / raw)
  To: Daniel Mack
  Cc: Greg Kroah-Hartman, Arnd Bergmann, Eric W. Biederman,
	One Thousand Gnomes, Tom Gundersen, Jiri Kosina, Linux API,
	linux-kernel@vger.kernel.org, David Herrmann, Djalal Harouni
In-Reply-To: <547C723C.6020305@zonque.org>

On Mon, Dec 1, 2014 at 5:50 AM, Daniel Mack <daniel@zonque.org> wrote:
> Hi Andy,
>
> Sorry for the late response.
>
> On 11/21/2014 08:50 PM, Andy Lutomirski wrote:
>>> +static int kdbus_meta_append_cred(struct kdbus_meta *meta,
>>> +                              const struct kdbus_domain *domain)
>>> +{
>>> +    struct kdbus_creds creds = {
>>> +            .uid = from_kuid_munged(domain->user_namespace, current_uid()),
>>> +            .gid = from_kgid_munged(domain->user_namespace, current_gid()),
>>> +            .pid = task_pid_nr_ns(current, domain->pid_namespace),
>>> +            .tid = task_tgid_nr_ns(current, domain->pid_namespace),
>>
>> This is better than before -- at least it gets translation right part of
>> the way.  But it's still wrong if the receiver's namespace doesn't match
>> the domain.
>
> Alright. The code now translates the items into each receiver's
> namespaces individually, so we can also take possible chroot()
> environments into account to do path translations.

Thanks!

I suspect you'll get a lot of "(unreachable)/foo/bar".

>
>> Also, please move pid and tgid into their own item.  They suck for
>> reasons that have been beaten to death.  Let's make it possible to
>> deprecate them separately in the future.
>
> Ok, done now. As mentioned in the highpid thread, we can easily add
> another item once we have a better source of information.

I'll see if I can get highpid in soon enough that starttime can be
dropped entirely.  Once it shows up, there will be an API that can
probably never be removed that isn't namespaced correctly and can't be
checkpointed.

>
>>> +static int kdbus_meta_append_caps(struct kdbus_meta *meta)
>>> +{
>>> +    struct caps {
>>> +            u32 last_cap;
>>> +            struct {
>>> +                    u32 caps[_KERNEL_CAPABILITY_U32S];
>>> +            } set[4];
>>> +    } caps;
>>> +    unsigned int i;
>>> +    const struct cred *cred = current_cred();
>>> +
>>> +    caps.last_cap = CAP_LAST_CAP;
>>> +
>>> +    for (i = 0; i < _KERNEL_CAPABILITY_U32S; i++) {
>>> +            caps.set[0].caps[i] = cred->cap_inheritable.cap[i];
>>> +            caps.set[1].caps[i] = cred->cap_permitted.cap[i];
>>> +            caps.set[2].caps[i] = cred->cap_effective.cap[i];
>>> +            caps.set[3].caps[i] = cred->cap_bset.cap[i];
>>> +    }
>>
>> Please leave this in so that I can root every single kdbus-using system.
>>  It'll be lots of fun.
>>
>> Snark aside, the correct fix is IMO to delete this function entirely.
>> Even if you could find a way to implement it safely (which will be
>> distinctly nontrivial), it seems like a bad idea to begin with.
>
> Yes, we currently have no way of translating caps from one user
> namespace to another, which means we cannot allow such an item to cross
> user namespaces. We can add this functionality later once we have the
> neccessary APIs.

I still don't expect this metadata item to ever make sense to use.
The best argument I've heard is that it would enable CAP_SYS_BOOT to
ask systemd for a reboot.  I think this is bogus:

Realistically, this probably just means that users of /sbin/reboot
might continue to work as expected.  Except:

1. CAP_SYS_BOOT was never sufficient for /sbin/reboot to reboot
cleanly -- clean reboots historically went through /dev/initctl.

2. This only makes sense anyway if /sbin/reboot gets updated to speak
dbus.  So you have a new reboot binary that you want to work as long
as it ends up with CAP_SYS_BOOT.  But caps are so broken that this
won't work intelligently no matter what kdbus does.  A user can have
CAP_SYS_BOOT or not, but whether or not /sbin/reboot inherits that bit
is almost entirely a function of that user's euid and has very little
to do with whether that user has CAP_SYS_BOOT.  So you might as well
just check euid == 0 in the first place.

And setting fscap bits on /sbin/reboot so that a new reboot binary can
prove to systemd that it has CAP_SYS_BOOT makes no sense.  Either put
the policy in systemd in the first place, or make /sbin/reboot setuid
root.

And passing the permitted, inheritable, and bounding sets makes even
less sense.  Yes, you could pass the entirety of /proc/PID/stat and
/proc/PID/status, but that's silly, and passing caps around via kdbus
for diagnostics seems ridiculous.

If you put it in now, it's here to stay even if it proves to be a bad
idea, whereas if you don't put it in at first, you can always add it
if a good reason comes up.

>
>>> +#ifdef CONFIG_CGROUPS
>>> +static int kdbus_meta_append_cgroup(struct kdbus_meta *meta)
>>> +{
>>> +    char *buf, *path;
>>> +    int ret;
>>> +
>>> +    buf = (char *)__get_free_page(GFP_TEMPORARY | __GFP_ZERO);
>>> +    if (!buf)
>>> +            return -ENOMEM;
>>> +
>>> +    path = task_cgroup_path(current, buf, PAGE_SIZE);
>>
>> This may have strange interactions with cgroupns.  It's fixable, though,
>> but only once you implement translation at receive time, and I think
>> you'll have to do that to get any of this to work right.
>
> Yes, agreed. We'll add translation to this item once cgroup namespaces
> are in place. Until then, we'll expose the same information as other
> parts of the Linux API already do.

I don't expect this to become a problem.

>
>>> +#ifdef CONFIG_AUDITSYSCALL
>>> +static int kdbus_meta_append_audit(struct kdbus_meta *meta,
>>> +                               const struct kdbus_domain *domain)
>>> +{
>>> +    struct kdbus_audit audit;
>>> +
>>> +    audit.loginuid = from_kuid_munged(domain->user_namespace,
>>> +                                      audit_get_loginuid(current));
>>> +    audit.sessionid = audit_get_sessionid(current);
>>> +
>>> +    return kdbus_meta_append_data(meta, KDBUS_ITEM_AUDIT,
>>> +                                  &audit, sizeof(audit));
>>
>> So *that's* what audit means.  Please document this and consider
>> renaming it to something like AUDIT_LOGINUID_AND_SESSIONID.
>
> The documentation was indeed bogus and is fixed now, along with man
> other details. Thanks for spotting this!

:)


>
>>> +#ifdef CONFIG_SECURITY
>>> +static int kdbus_meta_append_seclabel(struct kdbus_meta *meta)
>>> +{
>>> +    u32 len, sid;
>>> +    char *label;
>>> +    int ret;
>>> +
>>> +    security_task_getsecid(current, &sid);
>>> +    ret = security_secid_to_secctx(sid, &label, &len);
>>> +    if (ret == -EOPNOTSUPP)
>>> +            return 0;
>>> +    if (ret < 0)
>>> +            return ret;
>>> +
>>> +    if (label && len > 0)
>>> +            ret = kdbus_meta_append_data(meta, KDBUS_ITEM_SECLABEL,
>>> +                                         label, len);
>>
>> This thing needs a clear, valid use case.  I think that the use case
>> should document how non-enforcing mode is supposed to work, too.
>
> kdbus just passes the seclabel along, if people want to use this for
> security checks, then they need to call into libselinux, and should do
> that by taking the enforcing mode into consideration. This is already
> done in a lot of software that way.
>
>> Also, there should be a justification for why the LSM hooks by
>> themselves aren't good enough to remove the need for this.
>
> SELinux and other MACs might want to do additional per-service security
> checks. For example, a service manager might want to check the security
> label of a service file against the security label of the client process
> using the SELinux database before allow access. For this, we need to be
> able to pass the client's security label race-free to libselinux so that
> it can make its decision.

Yeah, but what happens on systems that use a different LSM?  And isn't
the point of the new LSM kdbus hooks to do this automatically without
userspace's help?

Admittedly, I don't see much of a problem with this feature existing,
as long as the non-selinux people are okay with it.

>
> I've added the above to the documentation now.
>
>
>
> Thanks for your review again - much appreciated!

And thanks for addressing most of the issues.  The code is starting to
look much better to me.

--Andy

^ permalink raw reply

* Re: Why not make kdbus use CUSE?
From: One Thousand Gnomes @ 2014-12-01 14:23 UTC (permalink / raw)
  To: Richard Yao
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman,
	linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20141129063416.GE32286-WkJ4KB730YtOk+SH+krGketnjj8NnB7F@public.gmane.org>

> told quite plainly that such distributions are not worth consideration. If kdbus
> is merged despite concerns about security and backward compatibility, could we
> at least have the shim moved to libc netural place, like Linus' tree?

I would expect any other libc would fork the shim anyway (or just not
bother with systemd in most cases).

Alan

^ permalink raw reply

* Re: [PATCH v7 16/46] virtio_blk: v1.0 support
From: Cornelia Huck @ 2014-12-01 14:19 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Hildenbrand, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	David Miller, rusty-8fk3Idey6ehBDgjK7y7TUQ,
	nab-IzHhD5pYlfBP7FQvKIMDCQ, pbonzini-H+wXaHxf7aLQT0dZR+AlfA,
	thuth-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Rusty Russell,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20141201134719.GA18305-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On Mon, 1 Dec 2014 15:47:19 +0200
"Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

> On Mon, Dec 01, 2014 at 02:00:04PM +0100, Cornelia Huck wrote:
> > On Mon, 1 Dec 2014 14:51:26 +0200
> > "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > 
> > > On Mon, Dec 01, 2014 at 01:40:36PM +0100, Cornelia Huck wrote:
> > > > On Mon, 1 Dec 2014 14:34:55 +0200
> > > > "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > > > 
> > > > > On Mon, Dec 01, 2014 at 01:02:58PM +0100, Cornelia Huck wrote:
> > > > > > On Mon, 1 Dec 2014 13:46:45 +0200
> > > > > > "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > > > > > 
> > > > > > > On Mon, Dec 01, 2014 at 12:33:15PM +0100, Cornelia Huck wrote:
> > > > > > > > On Mon, 1 Dec 2014 11:26:58 +0200
> > > > > > > > "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > > > > > > > 
> > > > > > > > > For some places on data path, it might be worth it
> > > > > > > > > to cache the correct value e.g. as part of device
> > > > > > > > > structure. This replaces a branch with a memory load,
> > > > > > > > > so the gain would have to be measured, best done
> > > > > > > > > separately?
> > > > > > > > 
> > > > > > > > I think we'll want to do some measuring once the basic structure is
> > > > > > > > in place anyway.
> > > > > > > 
> > > > > > > What's meant by in place here?
> > > > > > 
> > > > > > That this patchset is ready :)
> > > > > 
> > > > > Also it's ready to the level where benchmarking is possible, right?  I
> > > > > don't think you should wait until we finish polishing up commit
> > > > > messages.
> > > > 
> > > > My point is that I haven't even found time yet to test this
> > > > thouroughly :(
> > > 
> > > If my experience shows anything, it's unlikely we'll get appropriate
> > > testing without code being upstream first.
> > > That's why I pushed on with sparse tagging btw.
> > > This way we can be reasonably sure we didn't miss some path.
> > 
> > I know that I'm likely the only one to test ccw (unless I manage to get
> > some other also-busy people to try this out).
> > 
> > What's the status of virtio-pci, btw? Can people actually test this
> > sanely?
> 
> Sure, I'm testing that it's not broken by these patches.
> Others can do so, too.

So basically just regression testing, right?

> 
> Once ccw is done on host and guest (will be complete after I
> send v8), it will be easier to add virtio 1.0 for more transports.
> 
> OTOH if we require that everything is ready and perfect before merging
> anything we'll never get anywhere.

I'm not looking for perfect, I'm just trying to juggle testing this +
doing qemu changes + various other stuff that is eating my time :)

^ permalink raw reply

* Re: kdbus: add code to gather metadata
From: Daniel Mack @ 2014-12-01 13:50 UTC (permalink / raw)
  To: Andy Lutomirski, Greg Kroah-Hartman, arnd-r2nGTMty4D4,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	gnomes-qBU/x9rampVanCEyBjwyrvXRex20P6io, teg-B22kvLQNl6c,
	jkosina-AlSwsSmVLrQ, linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w, tixxdz-Umm1ozX2/EEdnm+yROfE0A
In-Reply-To: <546F977B.7040500-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>

Hi Andy,

Sorry for the late response.

On 11/21/2014 08:50 PM, Andy Lutomirski wrote:
>> +static int kdbus_meta_append_cred(struct kdbus_meta *meta,
>> +				  const struct kdbus_domain *domain)
>> +{
>> +	struct kdbus_creds creds = {
>> +		.uid = from_kuid_munged(domain->user_namespace, current_uid()),
>> +		.gid = from_kgid_munged(domain->user_namespace, current_gid()),
>> +		.pid = task_pid_nr_ns(current, domain->pid_namespace),
>> +		.tid = task_tgid_nr_ns(current, domain->pid_namespace),
> 
> This is better than before -- at least it gets translation right part of
> the way.  But it's still wrong if the receiver's namespace doesn't match
> the domain.

Alright. The code now translates the items into each receiver's
namespaces individually, so we can also take possible chroot()
environments into account to do path translations.

> Also, please move pid and tgid into their own item.  They suck for
> reasons that have been beaten to death.  Let's make it possible to
> deprecate them separately in the future.

Ok, done now. As mentioned in the highpid thread, we can easily add
another item once we have a better source of information.

>> +static int kdbus_meta_append_caps(struct kdbus_meta *meta)
>> +{
>> +	struct caps {
>> +		u32 last_cap;
>> +		struct {
>> +			u32 caps[_KERNEL_CAPABILITY_U32S];
>> +		} set[4];
>> +	} caps;
>> +	unsigned int i;
>> +	const struct cred *cred = current_cred();
>> +
>> +	caps.last_cap = CAP_LAST_CAP;
>> +
>> +	for (i = 0; i < _KERNEL_CAPABILITY_U32S; i++) {
>> +		caps.set[0].caps[i] = cred->cap_inheritable.cap[i];
>> +		caps.set[1].caps[i] = cred->cap_permitted.cap[i];
>> +		caps.set[2].caps[i] = cred->cap_effective.cap[i];
>> +		caps.set[3].caps[i] = cred->cap_bset.cap[i];
>> +	}
> 
> Please leave this in so that I can root every single kdbus-using system.
>  It'll be lots of fun.
> 
> Snark aside, the correct fix is IMO to delete this function entirely.
> Even if you could find a way to implement it safely (which will be
> distinctly nontrivial), it seems like a bad idea to begin with.

Yes, we currently have no way of translating caps from one user
namespace to another, which means we cannot allow such an item to cross
user namespaces. We can add this functionality later once we have the
neccessary APIs.

>> +#ifdef CONFIG_CGROUPS
>> +static int kdbus_meta_append_cgroup(struct kdbus_meta *meta)
>> +{
>> +	char *buf, *path;
>> +	int ret;
>> +
>> +	buf = (char *)__get_free_page(GFP_TEMPORARY | __GFP_ZERO);
>> +	if (!buf)
>> +		return -ENOMEM;
>> +
>> +	path = task_cgroup_path(current, buf, PAGE_SIZE);
> 
> This may have strange interactions with cgroupns.  It's fixable, though,
> but only once you implement translation at receive time, and I think
> you'll have to do that to get any of this to work right.

Yes, agreed. We'll add translation to this item once cgroup namespaces
are in place. Until then, we'll expose the same information as other
parts of the Linux API already do.

>> +#ifdef CONFIG_AUDITSYSCALL
>> +static int kdbus_meta_append_audit(struct kdbus_meta *meta,
>> +				   const struct kdbus_domain *domain)
>> +{
>> +	struct kdbus_audit audit;
>> +
>> +	audit.loginuid = from_kuid_munged(domain->user_namespace,
>> +					  audit_get_loginuid(current));
>> +	audit.sessionid = audit_get_sessionid(current);
>> +
>> +	return kdbus_meta_append_data(meta, KDBUS_ITEM_AUDIT,
>> +				      &audit, sizeof(audit));
> 
> So *that's* what audit means.  Please document this and consider
> renaming it to something like AUDIT_LOGINUID_AND_SESSIONID.

The documentation was indeed bogus and is fixed now, along with man
other details. Thanks for spotting this!

>> +#ifdef CONFIG_SECURITY
>> +static int kdbus_meta_append_seclabel(struct kdbus_meta *meta)
>> +{
>> +	u32 len, sid;
>> +	char *label;
>> +	int ret;
>> +
>> +	security_task_getsecid(current, &sid);
>> +	ret = security_secid_to_secctx(sid, &label, &len);
>> +	if (ret == -EOPNOTSUPP)
>> +		return 0;
>> +	if (ret < 0)
>> +		return ret;
>> +
>> +	if (label && len > 0)
>> +		ret = kdbus_meta_append_data(meta, KDBUS_ITEM_SECLABEL,
>> +					     label, len);
> 
> This thing needs a clear, valid use case.  I think that the use case
> should document how non-enforcing mode is supposed to work, too.

kdbus just passes the seclabel along, if people want to use this for
security checks, then they need to call into libselinux, and should do
that by taking the enforcing mode into consideration. This is already
done in a lot of software that way.

> Also, there should be a justification for why the LSM hooks by
> themselves aren't good enough to remove the need for this.

SELinux and other MACs might want to do additional per-service security
checks. For example, a service manager might want to check the security
label of a service file against the security label of the client process
using the SELinux database before allow access. For this, we need to be
able to pass the client's security label race-free to libselinux so that
it can make its decision.

I've added the above to the documentation now.



Thanks for your review again - much appreciated!

Daniel

^ permalink raw reply

* Re: [PATCH v7 16/46] virtio_blk: v1.0 support
From: Michael S. Tsirkin @ 2014-12-01 13:47 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: thuth, rusty, linux-api, linux-kernel, virtualization,
	David Hildenbrand, pbonzini, David Miller
In-Reply-To: <20141201140004.57d53155.cornelia.huck@de.ibm.com>

On Mon, Dec 01, 2014 at 02:00:04PM +0100, Cornelia Huck wrote:
> On Mon, 1 Dec 2014 14:51:26 +0200
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Mon, Dec 01, 2014 at 01:40:36PM +0100, Cornelia Huck wrote:
> > > On Mon, 1 Dec 2014 14:34:55 +0200
> > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > 
> > > > On Mon, Dec 01, 2014 at 01:02:58PM +0100, Cornelia Huck wrote:
> > > > > On Mon, 1 Dec 2014 13:46:45 +0200
> > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > > 
> > > > > > On Mon, Dec 01, 2014 at 12:33:15PM +0100, Cornelia Huck wrote:
> > > > > > > On Mon, 1 Dec 2014 11:26:58 +0200
> > > > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > > > > 
> > > > > > > > For some places on data path, it might be worth it
> > > > > > > > to cache the correct value e.g. as part of device
> > > > > > > > structure. This replaces a branch with a memory load,
> > > > > > > > so the gain would have to be measured, best done
> > > > > > > > separately?
> > > > > > > 
> > > > > > > I think we'll want to do some measuring once the basic structure is
> > > > > > > in place anyway.
> > > > > > 
> > > > > > What's meant by in place here?
> > > > > 
> > > > > That this patchset is ready :)
> > > > 
> > > > Also it's ready to the level where benchmarking is possible, right?  I
> > > > don't think you should wait until we finish polishing up commit
> > > > messages.
> > > 
> > > My point is that I haven't even found time yet to test this
> > > thouroughly :(
> > 
> > If my experience shows anything, it's unlikely we'll get appropriate
> > testing without code being upstream first.
> > That's why I pushed on with sparse tagging btw.
> > This way we can be reasonably sure we didn't miss some path.
> 
> I know that I'm likely the only one to test ccw (unless I manage to get
> some other also-busy people to try this out).
> 
> What's the status of virtio-pci, btw? Can people actually test this
> sanely?

Sure, I'm testing that it's not broken by these patches.
Others can do so, too.

Once ccw is done on host and guest (will be complete after I
send v8), it will be easier to add virtio 1.0 for more transports.

OTOH if we require that everything is ready and perfect before merging
anything we'll never get anywhere.

-- 
MST

^ permalink raw reply

* Re: [tpmdd-devel] [PATCH v7 08/10] tpm: TPM 2.0 CRB Interface
From: Jarkko Sakkinen @ 2014-12-01 13:26 UTC (permalink / raw)
  To: Stefan Berger
  Cc: Peter Huewe, Ashley Lai, Marcel Selhorst, christophe.ricard,
	josh.triplett, linux-api, linux-kernel, tpmdd-devel,
	jason.gunthorpe, trousers-tech
In-Reply-To: <5475DE81.50308@linux.vnet.ibm.com>

On Wed, Nov 26, 2014 at 09:06:57AM -0500, Stefan Berger wrote:
> On 11/11/2014 08:45 AM, Jarkko Sakkinen wrote:
> >tpm_crb is a driver for TPM 2.0 Command Response Buffer (CRB) Interface
> >as defined in PC Client Platform TPM Profile (PTP) Specification.
> >
> >Only polling and single locality is supported as these are the limitations
> >of the available hardware, Platform Trust Techonlogy (PTT) in Haswell
> >CPUs.
> >
> >The driver always applies CRB with ACPI start because PTT reports using
> >only ACPI start as start method but as a result of my testing it requires
> >also CRB start.
> >
> >Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
> >---
> >  drivers/char/tpm/Kconfig   |   9 ++
> >  drivers/char/tpm/Makefile  |   1 +
> >  drivers/char/tpm/tpm_crb.c | 323 +++++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 333 insertions(+)
> >  create mode 100644 drivers/char/tpm/tpm_crb.c
> >
> >diff --git a/drivers/char/tpm/Kconfig b/drivers/char/tpm/Kconfig
> >index c54cac3..10c9419 100644
> >--- a/drivers/char/tpm/Kconfig
> >+++ b/drivers/char/tpm/Kconfig
> >@@ -122,4 +122,13 @@ config TCG_XEN
> >  	  To compile this driver as a module, choose M here; the module
> >  	  will be called xen-tpmfront.
> >
> >+config TCG_CRB
> >+	tristate "TPM 2.0 CRB Interface"
> >+	depends on X86 && ACPI
> >+	---help---
> >+	  If you have a TPM security chip that is compliant with the
> >+	  TCG CRB 2.0 TPM specification say Yes and it will be accessible
> >+	  from within Linux.  To compile this driver as a module, choose
> >+	  M here; the module will be called tpm_crb.
> >+
> >  endif # TCG_TPM
> >diff --git a/drivers/char/tpm/Makefile b/drivers/char/tpm/Makefile
> >index ae56af9..e6d26dd 100644
> >--- a/drivers/char/tpm/Makefile
> >+++ b/drivers/char/tpm/Makefile
> >@@ -22,3 +22,4 @@ obj-$(CONFIG_TCG_INFINEON) += tpm_infineon.o
> >  obj-$(CONFIG_TCG_IBMVTPM) += tpm_ibmvtpm.o
> >  obj-$(CONFIG_TCG_ST33_I2C) += tpm_i2c_stm_st33.o
> >  obj-$(CONFIG_TCG_XEN) += xen-tpmfront.o
> >+obj-$(CONFIG_TCG_CRB) += tpm_crb.o
> >diff --git a/drivers/char/tpm/tpm_crb.c b/drivers/char/tpm/tpm_crb.c
> >new file mode 100644
> >index 0000000..eb221d5
> >--- /dev/null
> >+++ b/drivers/char/tpm/tpm_crb.c
> >@@ -0,0 +1,323 @@
> >+/*
> >+ * Copyright (C) 2014 Intel Corporation
> >+ *
> >+ * Authors:
> >+ * Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
> >+ *
> >+ * Maintained by: <tpmdd-devel@lists.sourceforge.net>
> >+ *
> >+ * This device driver implements the TPM interface as defined in
> >+ * the TCG CRB 2.0 TPM specification.
> >+ *
> >+ * This program is free software; you can redistribute it and/or
> >+ * modify it under the terms of the GNU General Public License
> >+ * as published by the Free Software Foundation; version 2
> >+ * of the License.
> >+ */
> >+
> >+#include <linux/acpi.h>
> >+#include <linux/highmem.h>
> >+#include <linux/rculist.h>
> >+#include <linux/module.h>
> >+#include <linux/platform_device.h>
> >+#include "tpm.h"
> >+
> >+#define ACPI_SIG_TPM2 "TPM2"
> >+
> >+static const u8 CRB_ACPI_START_UUID[] = {
> >+	/* 0000 */ 0xAB, 0x6C, 0xBF, 0x6B, 0x63, 0x54, 0x14, 0x47,
> >+	/* 0008 */ 0xB7, 0xCD, 0xF0, 0x20, 0x3C, 0x03, 0x68, 0xD4
> >+};
> >+
> >+enum crb_defaults {
> >+	CRB_ACPI_START_REVISION_ID = 1,
> >+	CRB_ACPI_START_INDEX = 1,
> >+};
> >+
> >+enum crb_start_method {
> >+	CRB_SM_ACPI_START = 2,
> >+	CRB_SM_CRB = 7,
> >+	CRB_SM_CRB_WITH_ACPI_START = 8,
> >+};
> >+
> >+struct acpi_tpm2 {
> >+	struct acpi_table_header hdr;
> >+	u16 platform_class;
> >+	u16 reserved;
> >+	u64 control_area_pa;
> >+	u32 start_method;
> >+};
> >+
> >+enum crb_ca_request {
> >+	CRB_CA_REQ_GO_IDLE	= BIT(0),
> >+	CRB_CA_REQ_CMD_READY	= BIT(1),
> >+};
> >+
> >+enum crb_ca_status {
> >+	CRB_CA_STS_ERROR	= BIT(0),
> >+	CRB_CA_STS_TPM_IDLE	= BIT(1),
> >+};
> >+
> >+struct crb_control_area {
> >+	u32 req;
> >+	u32 sts;
> >+	u32 cancel;
> >+	u32 start;
> >+	u32 int_enable;
> >+	u32 int_sts;
> >+	u32 cmd_size;
> >+	u64 cmd_pa;
> >+	u32 rsp_size;
> >+	u64 rsp_pa;
> >+} __packed;
> >+
> >+enum crb_status {
> >+	CRB_STS_COMPLETE	= BIT(0),
> >+};
> >+
> >+enum crb_flags {
> >+	CRB_FL_ACPI_START	= BIT(0),
> >+	CRB_FL_CRB_START	= BIT(1),
> >+};
> >+
> >+struct crb_priv {
> >+	unsigned int flags;
> >+	struct crb_control_area *cca;
> >+	unsigned long cca_pa;
> >+};
> >+
> >+#ifdef CONFIG_PM_SLEEP
> >+int crb_suspend(struct device *dev)
> >+{
> >+	return 0;
> >+}
> >+
> >+static int crb_resume(struct device *dev)
> >+{
> >+	struct tpm_chip *chip = dev_get_drvdata(dev);
> >+
> >+	(void) tpm2_do_selftest(chip);
> >+
> >+	return 0;
> >+}
> >+#endif
> >+
> >+static SIMPLE_DEV_PM_OPS(crb_pm, crb_suspend, crb_resume);
> >+
> >+static u8 crb_status(struct tpm_chip *chip)
> >+{
> >+	struct crb_priv *priv = chip->vendor.priv;
> >+	u8 sts = 0;
> >+
> >+	if ((le32_to_cpu(priv->cca->start) & 1) != 1)
> 
> Use a #define rather than the magic '1'.
> 
> 
> >+		sts |= CRB_STS_COMPLETE;
> >+
> >+	return sts;
> >+}
> >+
> >+static int crb_recv(struct tpm_chip *chip, u8 *buf, size_t count)
> >+{
> >+	struct crb_priv *priv = chip->vendor.priv;
> >+	struct crb_control_area *cca;
> >+	unsigned int expected;
> >+	unsigned long offset;
> >+	u8 *resp;
> >+
> >+	cca = priv->cca;
> >+	if (le32_to_cpu(cca->sts) & CRB_CA_STS_ERROR)
> >+		return -EIO;
> >+
> >+	offset = le64_to_cpu(cca->rsp_pa) - priv->cca_pa;
> >+	resp = (u8 *) ((unsigned long) cca + offset);
> 
> make sure that count >= 6?
> 
> >+	memcpy(buf, resp, 6);
> >+	expected = be32_to_cpup((__be32 *) &buf[2]);
> >+
> >+	if (expected > count)
> >+		return -EIO;
> >+
> >+	memcpy(&buf[6], &resp[6], expected - 6);
> >+
> >+	return expected;
> >+}
> >+
> >+static int crb_do_acpi_start(struct tpm_chip *chip)
> >+{
> >+	union acpi_object *obj;
> >+	int rc;
> >+
> >+	obj = acpi_evaluate_dsm(chip->acpi_dev_handle,
> >+				CRB_ACPI_START_UUID,
> >+				CRB_ACPI_START_REVISION_ID,
> >+				CRB_ACPI_START_INDEX,
> >+				NULL);
> >+	if (!obj)
> >+		return -ENXIO;
> >+	rc = obj->integer.value == 0 ? 0 : -ENXIO;
> >+	ACPI_FREE(obj);
> >+	return rc;
> >+}
> >+
> >+static int crb_send(struct tpm_chip *chip, u8 *buf, size_t len)
> >+{
> >+	struct crb_priv *priv = chip->vendor.priv;
> >+	struct crb_control_area *cca;
> >+	u8 *cmd;
> >+	int rc = 0;
> >+
> >+	cca = priv->cca;
> >+
> >+	if (len > le32_to_cpu(cca->cmd_size)) {
> >+		dev_err(&chip->dev,
> >+			"invalid command count value %x %zx\n",
> >+			(unsigned int) len,
> >+			(size_t) le32_to_cpu(cca->cmd_size));
> >+		return -E2BIG;
> >+	}
> >+
> >+	cmd = (u8 *) ((unsigned long) cca + le64_to_cpu(cca->cmd_pa) -
> >+		      priv->cca_pa);
> 
> cca = priv->cca per statement above     -> cmd = cca + x - cca = x
> 
> -> cmd = le64_to_cpu(cca->cmd_pa);
> 
> Should do the trick, no ?
> 
> >+	memcpy(cmd, buf, len);
> >+
> >+	/* Make sure that cmd is populated before issuing start. */
> >+	wmb();
> >+
> >+	cca->start = cpu_to_le32(1);
> >+	rc = crb_do_acpi_start(chip);
> 
> I had commented on this already. Your TPM seems to no implement the ACPI
> specs properly, or rather the ACPI table is wrong.
> You have to check whether the ACPI function needs to be called. The next TPM
> from a different vendor for whom the ACPI start function is not necessary
> will need this check here since it will give a return code indicating
> failure. Then your TPM won't work anymore! I think you should add a check
> into the crb_do_acpi_start for whether this function needs to be called or
> whether your TPM is being used (vendor check?) and run this start function
> then anyway.

PTT (4th Core CRB implementation) seems to report needing ACPI start only but
in practice requires both. I fixed this now by flagging ACPI start with
CRB_FL_ACPI_START always doing cca->start assignment. I added comment to 
document this issue.

Maybe it would good idea to check for MSFT0101 and do assignment of 
CRB_FL_CRB_START if that holds? Then all the quirks would be in the
initialization code.

> >+	return rc;
> >+}
> >+
> >+static void crb_cancel(struct tpm_chip *chip)
> >+{
> >+	struct crb_priv *priv = chip->vendor.priv;
> >+	struct crb_control_area *cca;
> >+
> >+	cca = priv->cca;
> >+	cca->cancel = cpu_to_le32(1);
> 
> nit: #define for this ?
> 
> >+
> >+	/* Make sure that cmd is populated before issuing start. */
> >+	wmb();
> >+
> >+	if (crb_do_acpi_start(chip))
> >+		dev_err(&chip->dev, "ACPI Start failed\n");
> >+
> >+	cca->cancel = 0;
> >+}
> >+
> >+static bool crb_req_canceled(struct tpm_chip *chip, u8 status)
> >+{
> >+	struct crb_priv *priv = chip->vendor.priv;
> >+
> >+	return (le32_to_cpu(priv->cca->cancel) & 1) == 1;
> >+}
> >+
> >+static const struct tpm_class_ops tpm_crb = {
> >+	.status = crb_status,
> >+	.recv = crb_recv,
> >+	.send = crb_send,
> >+	.cancel = crb_cancel,
> >+	.req_canceled = crb_req_canceled,
> >+	.req_complete_mask = CRB_STS_COMPLETE,
> >+	.req_complete_val = CRB_STS_COMPLETE,
> >+};
> >+
> >+static int crb_acpi_add(struct acpi_device *device)
> >+{
> >+	struct tpm_chip *chip;
> >+	struct acpi_tpm2 *buf;
> >+	struct crb_priv *priv;
> >+	struct device *dev = &device->dev;
> >+	acpi_status status;
> >+	u32 sm;
> >+	int rc;
> >+
> >+	chip = tpmm_chip_alloc(dev, &tpm_crb);
> >+	if (IS_ERR(chip))
> >+		return PTR_ERR(chip);
> >+
> >+	chip->flags = TPM_CHIP_FLAG_TPM2;
> >+
> >+	status = acpi_get_table(ACPI_SIG_TPM2, 1,
> >+				(struct acpi_table_header **) &buf);
> >+	if (ACPI_FAILURE(status)) {
> >+		dev_err(dev, "failed to get TPM2 ACPI table\n");
> >+		return -ENODEV;
> >+	}
> >+
> >+	priv = (struct crb_priv *) devm_kzalloc(dev, sizeof(struct crb_priv),
> >+						GFP_KERNEL);
> >+	if (!priv) {
> >+		dev_err(dev, "failed to devm_kzalloc for private data\n");
> >+		return -ENOMEM;
> >+	}
> >+
> >+	sm = le32_to_cpu(buf->start_method);
> 
> I wonder whether you should check whether that ACPI table is big enough to
> allow you accessing its start_method.
> 
> if (buf->length < sizeof(struct acpi_tpm2) ) {
>     return -EXYZ;
> }
> 
> >+
> >+	if (sm == CRB_SM_CRB || sm == CRB_SM_CRB_WITH_ACPI_START)
> >+		priv->flags |= CRB_FL_CRB_START;
> 
> You set this flag but you don't seem to check it anywhere.
> 
> >+
> >+	if (sm == CRB_SM_ACPI_START || sm == CRB_SM_CRB_WITH_ACPI_START)
> >+		priv->flags |= CRB_FL_ACPI_START;
> 
> 
> You set this flag but you don't seem to check it anywhere.
> >+
> >+	priv->cca_pa = le32_to_cpu(buf->control_area_pa);
> >+	priv->cca = (struct crb_control_area *)
> >+		devm_ioremap_nocache(dev, buf->control_area_pa, 0x1000);
> >+	if (!priv->cca) {
> >+		dev_err(dev, "allocating memory failed\n");
> >+		return -ENOMEM;
> >+	}
> >+
> >+	chip->vendor.priv = priv;
> >+
> >+	/* Default timeouts and durations */
> >+	chip->vendor.timeout_a = usecs_to_jiffies(TPM2_TIMEOUT_A);
> >+	chip->vendor.timeout_b = usecs_to_jiffies(TPM2_TIMEOUT_B);
> >+	chip->vendor.timeout_c = usecs_to_jiffies(TPM2_TIMEOUT_C);
> >+	chip->vendor.timeout_d = usecs_to_jiffies(TPM2_TIMEOUT_D);
> >+	chip->vendor.duration[TPM_SHORT] =
> >+		usecs_to_jiffies(TPM2_DURATION_SHORT);
> >+	chip->vendor.duration[TPM_MEDIUM] =
> >+		usecs_to_jiffies(TPM2_DURATION_MEDIUM);
> >+	chip->vendor.duration[TPM_LONG] =
> >+		usecs_to_jiffies(TPM2_DURATION_LONG);
> >+
> >+	chip->acpi_dev_handle = device->handle;
> >+
> >+	rc = tpm2_do_selftest(chip);
> >+	if (rc)
> >+		return rc;
> >+
> >+	return tpm_chip_register(chip);
> >+}
> >+
> >+int crb_acpi_remove(struct acpi_device *device)
> >+{
> >+	struct device *dev = &device->dev;
> >+	struct tpm_chip *chip = dev_get_drvdata(dev);
> >+
> >+	tpm_chip_unregister(chip);
> >+	return 0;
> >+}
> >+
> >+static struct acpi_device_id crb_device_ids[] = {
> >+	{"MSFT0101", 0},
> >+	{"", 0},
> >+};
> >+MODULE_DEVICE_TABLE(acpi, crb_device_ids);
> >+
> >+static struct acpi_driver crb_acpi_driver = {
> >+	.name = "tpm_crb",
> >+	.ids = crb_device_ids,
> >+	.ops = {
> >+		.add = crb_acpi_add,
> >+		.remove = crb_acpi_remove,
> >+	},
> >+	.drv = {
> >+		.pm = &crb_pm,
> >+	},
> >+};
> >+
> >+module_acpi_driver(crb_acpi_driver);
> >+MODULE_AUTHOR("Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>");
> >+MODULE_DESCRIPTION("TPM2 Driver");
> >+MODULE_VERSION("0.1");
> >+MODULE_LICENSE("GPL");
> 
> Regards,
>     Stefan
> 

/Jarkko

^ permalink raw reply

* Re: [PATCH v7 16/46] virtio_blk: v1.0 support
From: Cornelia Huck @ 2014-12-01 13:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: thuth, rusty, linux-api, linux-kernel, virtualization,
	David Hildenbrand, pbonzini, David Miller
In-Reply-To: <20141201125126.GB18106@redhat.com>

On Mon, 1 Dec 2014 14:51:26 +0200
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Mon, Dec 01, 2014 at 01:40:36PM +0100, Cornelia Huck wrote:
> > On Mon, 1 Dec 2014 14:34:55 +0200
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > 
> > > On Mon, Dec 01, 2014 at 01:02:58PM +0100, Cornelia Huck wrote:
> > > > On Mon, 1 Dec 2014 13:46:45 +0200
> > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > 
> > > > > On Mon, Dec 01, 2014 at 12:33:15PM +0100, Cornelia Huck wrote:
> > > > > > On Mon, 1 Dec 2014 11:26:58 +0200
> > > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > > > 
> > > > > > > For some places on data path, it might be worth it
> > > > > > > to cache the correct value e.g. as part of device
> > > > > > > structure. This replaces a branch with a memory load,
> > > > > > > so the gain would have to be measured, best done
> > > > > > > separately?
> > > > > > 
> > > > > > I think we'll want to do some measuring once the basic structure is
> > > > > > in place anyway.
> > > > > 
> > > > > What's meant by in place here?
> > > > 
> > > > That this patchset is ready :)
> > > 
> > > Also it's ready to the level where benchmarking is possible, right?  I
> > > don't think you should wait until we finish polishing up commit
> > > messages.
> > 
> > My point is that I haven't even found time yet to test this
> > thouroughly :(
> 
> If my experience shows anything, it's unlikely we'll get appropriate
> testing without code being upstream first.
> That's why I pushed on with sparse tagging btw.
> This way we can be reasonably sure we didn't miss some path.

I know that I'm likely the only one to test ccw (unless I manage to get
some other also-busy people to try this out).

What's the status of virtio-pci, btw? Can people actually test this
sanely?

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox