Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH 01/13] kdbus: add documentation
From: Daniel Mack @ 2015-02-03 10:09 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Arnd Bergmann, Ted Ts'o, Michael Kerrisk, Linux API,
	One Thousand Gnomes, Austin S Hemmelgarn, Tom Gundersen,
	Greg Kroah-Hartman, linux-kernel, Eric W. Biederman,
	David Herrmann, Djalal Harouni, Johannes Stezenbach,
	Christoph Hellwig
In-Reply-To: <CALCETrUh1Mse4CBQ4bfkJf+ew=kdpn46hMLS2QafLhfRTzQoBQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

Hi Andy,

On 02/02/2015 09:12 PM, Andy Lutomirski wrote:
> On Feb 2, 2015 1:34 AM, "Daniel Mack" <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org> wrote:

>> That's right, but again - if an application wants to gather this kind of
>> information about tasks it interacts with, it can do so today by looking
>> at /proc or similar sources. Desktop machines do exactly that already,
>> and the kernel code executed in such cases very much resembles that in
>> metadata.c, and is certainly not cheaper. kdbus just makes such
>> information more accessible when requested. Which information is
>> collected is defined by bit-masks on both the sender and the receiver
>> connection, and most applications will effectively only use a very
>> limited set by default if they go through one of the more high-level
>> libraries.
> 
> I should rephrase a bit.  Kdbus doesn't require use of send-time
> metadata.  It does, however, strongly encourage it, and it sounds like

On the kernel level, kdbus just *offers* that, just like sockets offer
SO_PASSCRED. On the userland level, kdbus helps applications get that
information race-free, easier and faster than they would otherwise.

> systemd and other major users will use send-time metadata.  Once that
> happens, it's ABI (even if it's purely in userspace), and changing it
> is asking for security holes to pop up.  So you'll be mostly stuck
> with it.

We know we can't break the ABI. At most, we could deprecate item types
and introduce new ones, but we want to avoid that by all means of
course. However, I fail to see how that is related to send time
metadata, or even to kdbus in general, as all ABIs have to be kept stable.

> Do you have some simple benchmark code you can share?  I'd like to
> play with it a bit.

Sure, it's part of the self-test suite. Call it with "-t benchmark" to
run the benchmark as isolated test with verbose output. The code for
that lives in test-benchmark.c.


Thanks,
Daniel

^ permalink raw reply

* Re: [Patch v3] firmware: dmi-sysfs: add SMBIOS entry point area raw attribute
From: Matt Fleming @ 2015-02-03 10:49 UTC (permalink / raw)
  To: Ivan Khoronzhuk
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	ard.biesheuvel-QSEj5FYQhm4dnm+yROfE0A,
	grant.likely-QSEj5FYQhm4dnm+yROfE0A,
	matt.fleming-ral2JQCrhuEAvxtiuMwx3w,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dmidecode-devel-qX2TKyscuCcdnm+yROfE0A,
	leif.lindholm-QSEj5FYQhm4dnm+yROfE0A
In-Reply-To: <54C906A9.1050001-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>

On Wed, 28 Jan, at 05:56:25PM, Ivan Khoronzhuk wrote:
> >diff --git a/drivers/firmware/dmi-sysfs.c b/drivers/firmware/dmi-sysfs.c
> >index e0f1cb3..61b6a38 100644
> >--- a/drivers/firmware/dmi-sysfs.c
> >+++ b/drivers/firmware/dmi-sysfs.c
> >@@ -29,6 +29,8 @@
> >  #define MAX_ENTRY_TYPE 255 /* Most of these aren't used, but we consider
> >  			      the top entry type is only 8 bits */
> >+static const u8 *smbios_raw_header;

There appears to be a mixture of u8 and unsigned char going on here, cf.
'smbios_header'.

While I'm pretty sure all architectures typedef them to be equivalent,
semantically, as a reviewer this makes me think there are type issues.

Is there any way to use one data type for the SMBIOS header?

> >@@ -669,6 +699,18 @@ static int __init dmi_sysfs_init(void)
> >  		goto err;
> >  	}
> >+	smbios_raw_header = dmi_get_smbios_entry_area(&size);
> >+	if (!smbios_raw_header) {
> >+		pr_debug("dmi-sysfs: SMBIOS raw data is not available.\n");
> >+		error = -ENODATA;
> >+		goto err;

Perhaps this should be -EINVAL? -ENODATA implies that if you try again
in the future data might be available, i.e. it's a temporary failure.
That's not the case here since the header is invalid.

-- 
Matt Fleming, Intel Open Source Technology Center

^ permalink raw reply

* Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints
From: Kirill A. Shutemov @ 2015-02-03 10:53 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Dave Hansen, Mel Gorman, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Minchan Kim, Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-man-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <54D08483.40209-AlSwsSmVLrQ@public.gmane.org>

On Tue, Feb 03, 2015 at 09:19:15AM +0100, Vlastimil Babka wrote:
> [CC linux-api, man pages]
> 
> On 02/02/2015 11:22 PM, Dave Hansen wrote:
> > On 02/02/2015 08:55 AM, Mel Gorman wrote:
> >> This patch identifies when a thread is frequently calling MADV_DONTNEED
> >> on the same region of memory and starts ignoring the hint. On an 8-core
> >> single-socket machine this was the impact on ebizzy using glibc 2.19.
> > 
> > The manpage, at least, claims that we zero-fill after MADV_DONTNEED is
> > called:
> > 
> >>      MADV_DONTNEED
> >>               Do  not  expect  access in the near future.  (For the time being, the application is finished with the given range, so the kernel can free resources
> >>               associated with it.)  Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents  from  the
> >>               underlying mapped file (see mmap(2)) or zero-fill-on-demand pages for mappings without an underlying file.
> > 
> > So if we have anything depending on the behavior that it's _always_
> > zero-filled after an MADV_DONTNEED, this will break it.
> 
> OK, so that's a third person (including me) who understood it as a zero-fill
> guarantee. I think the man page should be clarified (if it's indeed not
> guaranteed), or we have a bug.
> 
> The implementation actually skips MADV_DONTNEED for
> VM_LOCKED|VM_HUGETLB|VM_PFNMAP vma's.

It doesn't skip. It fails with -EINVAL. Or I miss something.

> - The word "will result" did sound as a guarantee at least to me. So here it
> could be changed to "may result (unless the advice is ignored)"?

It's too late to fix documentation. Applications already depends on the
beheviour.

-- 
 Kirill A. Shutemov

^ permalink raw reply

* Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints
From: Mel Gorman @ 2015-02-03 11:16 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Dave Hansen, linux-mm, Minchan Kim, Andrew Morton, linux-kernel,
	linux-api, mtk.manpages, linux-man
In-Reply-To: <54D08483.40209@suse.cz>

On Tue, Feb 03, 2015 at 09:19:15AM +0100, Vlastimil Babka wrote:
> [CC linux-api, man pages]
> 
> On 02/02/2015 11:22 PM, Dave Hansen wrote:
> > On 02/02/2015 08:55 AM, Mel Gorman wrote:
> >> This patch identifies when a thread is frequently calling MADV_DONTNEED
> >> on the same region of memory and starts ignoring the hint. On an 8-core
> >> single-socket machine this was the impact on ebizzy using glibc 2.19.
> > 
> > The manpage, at least, claims that we zero-fill after MADV_DONTNEED is
> > called:
> > 
> >>      MADV_DONTNEED
> >>               Do  not  expect  access in the near future.  (For the time being, the application is finished with the given range, so the kernel can free resources
> >>               associated with it.)  Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents  from  the
> >>               underlying mapped file (see mmap(2)) or zero-fill-on-demand pages for mappings without an underlying file.
> > 
> > So if we have anything depending on the behavior that it's _always_
> > zero-filled after an MADV_DONTNEED, this will break it.
> 
> OK, so that's a third person (including me) who understood it as a zero-fill
> guarantee. I think the man page should be clarified (if it's indeed not
> guaranteed), or we have a bug.
> 
> The implementation actually skips MADV_DONTNEED for
> VM_LOCKED|VM_HUGETLB|VM_PFNMAP vma's.
> 

This was the first reason why I did not consider the zero-filling to be a
guarantee. That said, at this point I'm also not considering pushing this
patch towards the kernel. I agree that this is a glibc bug so I've dropped
a line to some glibc people to see what they think the approach should be.

> I'm not sure about VM_PFNMAP, these are probably special enough. For mlock, one
> could expect that mlocking and MADV_DONTNEED would be in some opposition, but
> it's not documented in the manpage AFAIK. Neither is the hugetlb case, which
> could be really unexpected by the user.
> 

The equivalent posix page also lacks details on how exactly this flag
should behave. hugetlb is sortof special in that it's always backed by
a ram-based file where the contents can be refaulted. It gets hairy when
the mapping has been created to look anonymous but is not anonymous
really. The semantics of hugetlb have always been fuzzy.

> Next, what the man page says about guarantees:
> 
> "The kernel is free to ignore the advice."
> 
> - that would suggest that nothing is guaranteed
> 

Yep, another reason why I did not clear the page when ignoring the hint.

> "This call does not influence the semantics of the application (except in the
> case of MADV_DONTNEED)"
> 
> - that depends if the reader understands it as "does influence by MADV_DONTNEED"
> or "may influence by MADV_DONTNEED"
> 
> - btw, isn't MADV_DONTFORK another exception that does influence the semantics?
> And since it's mentioned as a workaround for some hardware, is it OK to ignore
> this advice?
> 

MADV_DONTFORK is also a Linux-specific extention. It happens to be one
that if it gets ignored then the application will be very surprised.

> And the part you already cited:
> 
> "Subsequent accesses of pages in this range will succeed, but will result either
> in reloading of the memory contents from the underlying mapped file (see
> mmap(2)) or zero-fill on-demand pages for mappings without an underlying file."
> 
> - The word "will result" did sound as a guarantee at least to me. So here it
> could be changed to "may result (unless the advice is ignored)"?
> 

The wording should be "may result" as there are circumstances where it
gets ignored even without this prototype patch.

> And if we agree that there is indeed no guarantee, what's the actual semantic
> difference from MADV_FREE? I guess none? So there's only a possible perfomance
> difference?
> 

Timing. MADV_DONTNEED if it has an effect is immediate, is a heavier
operations and RSS is reduced. MADV_FREE only has an impact in the future
if there is memory pressure.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints
From: Vlastimil Babka @ 2015-02-03 11:42 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Mel Gorman, linux-mm, Minchan Kim, Andrew Morton,
	linux-kernel, linux-api, mtk.manpages, linux-man
In-Reply-To: <20150203105301.GC14259@node.dhcp.inet.fi>

On 02/03/2015 11:53 AM, Kirill A. Shutemov wrote:
> On Tue, Feb 03, 2015 at 09:19:15AM +0100, Vlastimil Babka wrote:
>> [CC linux-api, man pages]
>> 
>> On 02/02/2015 11:22 PM, Dave Hansen wrote:
>> > On 02/02/2015 08:55 AM, Mel Gorman wrote:
>> >> This patch identifies when a thread is frequently calling MADV_DONTNEED
>> >> on the same region of memory and starts ignoring the hint. On an 8-core
>> >> single-socket machine this was the impact on ebizzy using glibc 2.19.
>> > 
>> > The manpage, at least, claims that we zero-fill after MADV_DONTNEED is
>> > called:
>> > 
>> >>      MADV_DONTNEED
>> >>               Do  not  expect  access in the near future.  (For the time being, the application is finished with the given range, so the kernel can free resources
>> >>               associated with it.)  Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents  from  the
>> >>               underlying mapped file (see mmap(2)) or zero-fill-on-demand pages for mappings without an underlying file.
>> > 
>> > So if we have anything depending on the behavior that it's _always_
>> > zero-filled after an MADV_DONTNEED, this will break it.
>> 
>> OK, so that's a third person (including me) who understood it as a zero-fill
>> guarantee. I think the man page should be clarified (if it's indeed not
>> guaranteed), or we have a bug.
>> 
>> The implementation actually skips MADV_DONTNEED for
>> VM_LOCKED|VM_HUGETLB|VM_PFNMAP vma's.
> 
> It doesn't skip. It fails with -EINVAL. Or I miss something.

No, I missed that. Thanks for pointing out. The manpage also explains EINVAL in
this case:

*  The application is attempting to release locked or shared pages (with
MADV_DONTNEED).

- that covers mlocking ok, not sure if the rest fits the "shared pages" case
though. I dont see any check for other kinds of shared pages in the code.

>> - The word "will result" did sound as a guarantee at least to me. So here it
>> could be changed to "may result (unless the advice is ignored)"?
> 
> It's too late to fix documentation. Applications already depends on the
> beheviour.

Right, so as long as they check for EINVAL, it should be safe. It appears that
jemalloc does.

I still wouldnt be sure just by reading the man page that the clearing is
guaranteed whenever I dont get an error return value, though,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [Patch v3] firmware: dmi-sysfs: add SMBIOS entry point area raw attribute
From: Ivan Khoronzhuk @ 2015-02-03 14:47 UTC (permalink / raw)
  To: Matt Fleming
  Cc: linux-kernel, ard.biesheuvel, grant.likely, matt.fleming,
	linux-api, linux-doc@vger.kernel.org, dmidecode-devel,
	leif.lindholm
In-Reply-To: <20150203104953.GA6461@codeblueprint.co.uk>


On 02/03/2015 12:49 PM, Matt Fleming wrote:
> On Wed, 28 Jan, at 05:56:25PM, Ivan Khoronzhuk wrote:
>>> diff --git a/drivers/firmware/dmi-sysfs.c b/drivers/firmware/dmi-sysfs.c
>>> index e0f1cb3..61b6a38 100644
>>> --- a/drivers/firmware/dmi-sysfs.c
>>> +++ b/drivers/firmware/dmi-sysfs.c
>>> @@ -29,6 +29,8 @@
>>>   #define MAX_ENTRY_TYPE 255 /* Most of these aren't used, but we consider
>>>   			      the top entry type is only 8 bits */
>>> +static const u8 *smbios_raw_header;
> There appears to be a mixture of u8 and unsigned char going on here, cf.
> 'smbios_header'.
>
> While I'm pretty sure all architectures typedef them to be equivalent,
> semantically, as a reviewer this makes me think there are type issues.
>
> Is there any way to use one data type for the SMBIOS header?

Let it be u8 in both cases.

>>> @@ -669,6 +699,18 @@ static int __init dmi_sysfs_init(void)
>>>   		goto err;
>>>   	}
>>> +	smbios_raw_header = dmi_get_smbios_entry_area(&size);
>>> +	if (!smbios_raw_header) {
>>> +		pr_debug("dmi-sysfs: SMBIOS raw data is not available.\n");
>>> +		error = -ENODATA;
>>> +		goto err;
> Perhaps this should be -EINVAL? -ENODATA implies that if you try again
> in the future data might be available, i.e. it's a temporary failure.
> That's not the case here since the header is invalid.
>

Yes, -EINVAL is better.
I'll send new patch soon.
Thanks!

^ permalink raw reply

* [RFC PATCH] fork: report pid reservation failure properly
From: Michal Hocko @ 2015-02-03 15:05 UTC (permalink / raw)
  To: Linux API
  Cc: Andrew Morton, Oleg Nesterov, Eric W. Biederman, Michael Kerrisk,
	LKML

Hi,
while debugging an unexpected ENOMEM from fork (there was no memory
pressure and OVERCOMMIT_ALWAYS) I have found out that fork returns
ENOMEM even when not short on memory.

In this particular case it was due to depleted pid space which is
documented to return EAGAIN in man pages.

Here is a quick fix up. I am sending that as a RFC because I am not
entirely sure about interaction with pid namespace which might cause
fork to fail. pid_ns_prepare_proc might return few error codes which are
not documented by man page but I am not sure whether that is actually
going to happen on a properly configured system.

There doesn't seem to be a good error code for an already "closed"
namespace (PIDNS_HASH_ADDING disabled) as well.  I've chosen EBUSY but
that might be completely wrong.  I am also wondering how would this
error case happen in the first place because parent should still exist
while fork is going on so the pid namespace shouldn't go away but I
smell this might have something to do with setns or something similar.

Any feedback would be appreciated.
---
>From 2a576175345fffa04da19cd51a25b7d94187b5f9 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Fri, 30 Jan 2015 14:50:15 +0100
Subject: [PATCH] fork: report pid reservation failure properly

copy_process will report any failure in alloc_pid as ENOMEM currently
which is misleading because the pid allocation might fail not only when
the memory is short but also when the pid space is consumed already.

The current man page even mentions this case:
"
EAGAIN

      A system-imposed limit on the number of threads was encountered.
      There are a number of limits that may trigger this error: the
      RLIMIT_NPROC soft resource limit (set via setrlimit(2)), which
      limits the number of processes and threads for a real user ID, was
      reached; the kernel's system-wide limit on the number of processes
      and threads, /proc/sys/kernel/threads-max, was reached (see
      proc(5)); or the maximum number of PIDs, /proc/sys/kernel/pid_max,
      was reached (see proc(5)).
"

so the current behavior is also incorrect wrt. documentation. This patch
simply propagates error code from alloc_pid and makes sure we return
-EAGAIN due to reservation failure.

There is one side effect of the change which might be unexpected.
alloc_pid might fail even when the repear in the pid namespace is dead
(the namespace basically disallows all new processes) and there is no
good error code which would match documented ones. I've taken EBUSY
because it felt like a best fit for the condition - namespace is busy
tearing down all the infrastructure.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 kernel/fork.c |  5 +++--
 kernel/pid.c  | 19 +++++++++++--------
 2 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 3ce8d57cff09..c37b88a162c5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1405,10 +1405,11 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		goto bad_fork_cleanup_io;
 
 	if (pid != &init_struct_pid) {
-		retval = -ENOMEM;
 		pid = alloc_pid(p->nsproxy->pid_ns_for_children);
-		if (!pid)
+		if (IS_ERR(pid)) {
+			retval = PTR_ERR(pid);
 			goto bad_fork_cleanup_io;
+		}
 	}
 
 	p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL;
diff --git a/kernel/pid.c b/kernel/pid.c
index 82430c858d69..c1a5875bd1ab 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -207,7 +207,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 		}
 		pid = mk_pid(pid_ns, map, offset);
 	}
-	return -1;
+	return -EAGAIN;
 }
 
 int next_pidmap(struct pid_namespace *pid_ns, unsigned int last)
@@ -298,17 +298,20 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 	int i, nr;
 	struct pid_namespace *tmp;
 	struct upid *upid;
+	int retval;
 
 	pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
 	if (!pid)
-		goto out;
+		return ERR_PTR(-ENOMEM);
 
 	tmp = ns;
 	pid->level = ns->level;
 	for (i = ns->level; i >= 0; i--) {
 		nr = alloc_pidmap(tmp);
-		if (nr < 0)
+		if (IS_ERR_VALUE(nr)) {
+			retval = nr;
 			goto out_free;
+		}
 
 		pid->numbers[i].nr = nr;
 		pid->numbers[i].ns = tmp;
@@ -316,7 +319,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 	}
 
 	if (unlikely(is_child_reaper(pid))) {
-		if (pid_ns_prepare_proc(ns))
+		if ((retval = pid_ns_prepare_proc(ns)))
 			goto out_free;
 	}
 
@@ -327,8 +330,10 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 
 	upid = pid->numbers + ns->level;
 	spin_lock_irq(&pidmap_lock);
-	if (!(ns->nr_hashed & PIDNS_HASH_ADDING))
+	if (!(ns->nr_hashed & PIDNS_HASH_ADDING)) {
+		retval = -EBUSY;
 		goto out_unlock;
+	}
 	for ( ; upid >= pid->numbers; --upid) {
 		hlist_add_head_rcu(&upid->pid_chain,
 				&pid_hash[pid_hashfn(upid->nr, upid->ns)]);
@@ -336,7 +341,6 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 	}
 	spin_unlock_irq(&pidmap_lock);
 
-out:
 	return pid;
 
 out_unlock:
@@ -348,8 +352,7 @@ out_free:
 		free_pidmap(pid->numbers + i);
 
 	kmem_cache_free(ns->pid_cachep, pid);
-	pid = NULL;
-	goto out;
+	return ERR_PTR(retval);
 }
 
 void disable_pid_allocation(struct pid_namespace *ns)
-- 
2.1.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related

* Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints
From: Michal Hocko @ 2015-02-03 15:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Vlastimil Babka, Dave Hansen, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Minchan Kim, Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-man-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20150203111600.GR2395-l3A5Bk7waGM@public.gmane.org>

On Tue 03-02-15 11:16:00, Mel Gorman wrote:
> On Tue, Feb 03, 2015 at 09:19:15AM +0100, Vlastimil Babka wrote:
[...]
> > And if we agree that there is indeed no guarantee, what's the actual semantic
> > difference from MADV_FREE? I guess none? So there's only a possible perfomance
> > difference?
> > 
> 
> Timing. MADV_DONTNEED if it has an effect is immediate, is a heavier
> operations and RSS is reduced. MADV_FREE only has an impact in the future
> if there is memory pressure.

JFTR. the man page for MADV_FREE has been proposed already
(https://lkml.org/lkml/2014/12/5/63 should be the last version AFAIR). I
do not see it in the man-pages git tree but the patch was not in time
for 3.19 so I guess it will only appear in 3.20.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v2] selftests/exec: Check if the syscall exists and bail if not
From: Shuah Khan @ 2015-02-03 15:32 UTC (permalink / raw)
  To: David Drysdale, Michael Ellerman
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
	Geert Uytterhoeven, davej-rdkfGonbjUTCLXcRTR1eJlpr/1R2p/CL
In-Reply-To: <CAHse=S9iLWK68mUSaVMEAwsaOEt4sTaM7XMBY-mv+ofk6sfT+g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On 02/03/2015 12:58 AM, David Drysdale wrote:
> On Tue, Feb 3, 2015 at 3:53 AM, Michael Ellerman <mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org> wrote:
>> On systems which don't implement sys_execveat(), this test produces a
>> lot of output.
>>
>> Add a check at the beginning to see if the syscall is present, and if
>> not just note one error and return.
>>
>> When we run on a system that doesn't implement the syscall we will get
>> ENOSYS back from the kernel, so change the logic that handles
>> __NR_execveat not being defined to also use ENOSYS rather than -ENOSYS.
>>
>> Signed-off-by: Michael Ellerman <mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org>
> 
> Acked-by: David Drysdale <drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Thanks Michael, and David. I will queue this for 3.20

-- Shuah


-- 
Shuah Khan
Sr. Linux Kernel Developer
Open Source Innovation Group
Samsung Research America (Silicon Valley)
shuahkh-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org | (970) 217-8978

^ permalink raw reply

* Re: [RFC PATCH] fork: report pid reservation failure properly
From: Michael Kerrisk (man-pages) @ 2015-02-03 15:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Linux API, Andrew Morton, Oleg Nesterov, Eric W. Biederman, LKML
In-Reply-To: <20150203150557.GB8907-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>

Hi Michal,


On 3 February 2015 at 16:05, Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote:
> Hi,
> while debugging an unexpected ENOMEM from fork (there was no memory
> pressure and OVERCOMMIT_ALWAYS) I have found out that fork returns
> ENOMEM even when not short on memory.
>
> In this particular case it was due to depleted pid space which is
> documented to return EAGAIN in man pages.
>
> Here is a quick fix up.

Could you summarize briefly what the user-space visible change is
here? It is not so obvious from your message. I believe you're turning
some cases of ENOMEM into EAGAIN, right? Note, by the way, that if I
understandwhat you intend, this change would bring the implementation
closer to POSIX, which specifies:

       EAGAIN The  system  lacked  the  necessary  resources  to  create
              another  process, or the system-imposed limit on the total
              number of processes under execution system-wide  or  by  a
              single user {CHILD_MAX} would be exceeded.

Thanks,

Michael

> I am sending that as a RFC because I am not
> entirely sure about interaction with pid namespace which might cause
> fork to fail. pid_ns_prepare_proc might return few error codes which are
> not documented by man page but I am not sure whether that is actually
> going to happen on a properly configured system.
>
> There doesn't seem to be a good error code for an already "closed"
> namespace (PIDNS_HASH_ADDING disabled) as well.  I've chosen EBUSY but
> that might be completely wrong.  I am also wondering how would this
> error case happen in the first place because parent should still exist
> while fork is going on so the pid namespace shouldn't go away but I
> smell this might have something to do with setns or something similar.
>
> Any feedback would be appreciated.
> ---
> From 2a576175345fffa04da19cd51a25b7d94187b5f9 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> Date: Fri, 30 Jan 2015 14:50:15 +0100
> Subject: [PATCH] fork: report pid reservation failure properly
>
> copy_process will report any failure in alloc_pid as ENOMEM currently
> which is misleading because the pid allocation might fail not only when
> the memory is short but also when the pid space is consumed already.
>
> The current man page even mentions this case:
> "
> EAGAIN
>
>       A system-imposed limit on the number of threads was encountered.
>       There are a number of limits that may trigger this error: the
>       RLIMIT_NPROC soft resource limit (set via setrlimit(2)), which
>       limits the number of processes and threads for a real user ID, was
>       reached; the kernel's system-wide limit on the number of processes
>       and threads, /proc/sys/kernel/threads-max, was reached (see
>       proc(5)); or the maximum number of PIDs, /proc/sys/kernel/pid_max,
>       was reached (see proc(5)).
> "
>
> so the current behavior is also incorrect wrt. documentation. This patch
> simply propagates error code from alloc_pid and makes sure we return
> -EAGAIN due to reservation failure.
>
> There is one side effect of the change which might be unexpected.
> alloc_pid might fail even when the repear in the pid namespace is dead
> (the namespace basically disallows all new processes) and there is no
> good error code which would match documented ones. I've taken EBUSY
> because it felt like a best fit for the condition - namespace is busy
> tearing down all the infrastructure.
>
> Signed-off-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> ---
>  kernel/fork.c |  5 +++--
>  kernel/pid.c  | 19 +++++++++++--------
>  2 files changed, 14 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 3ce8d57cff09..c37b88a162c5 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1405,10 +1405,11 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>                 goto bad_fork_cleanup_io;
>
>         if (pid != &init_struct_pid) {
> -               retval = -ENOMEM;
>                 pid = alloc_pid(p->nsproxy->pid_ns_for_children);
> -               if (!pid)
> +               if (IS_ERR(pid)) {
> +                       retval = PTR_ERR(pid);
>                         goto bad_fork_cleanup_io;
> +               }
>         }
>
>         p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL;
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 82430c858d69..c1a5875bd1ab 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -207,7 +207,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
>                 }
>                 pid = mk_pid(pid_ns, map, offset);
>         }
> -       return -1;
> +       return -EAGAIN;
>  }
>
>  int next_pidmap(struct pid_namespace *pid_ns, unsigned int last)
> @@ -298,17 +298,20 @@ struct pid *alloc_pid(struct pid_namespace *ns)
>         int i, nr;
>         struct pid_namespace *tmp;
>         struct upid *upid;
> +       int retval;
>
>         pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
>         if (!pid)
> -               goto out;
> +               return ERR_PTR(-ENOMEM);
>
>         tmp = ns;
>         pid->level = ns->level;
>         for (i = ns->level; i >= 0; i--) {
>                 nr = alloc_pidmap(tmp);
> -               if (nr < 0)
> +               if (IS_ERR_VALUE(nr)) {
> +                       retval = nr;
>                         goto out_free;
> +               }
>
>                 pid->numbers[i].nr = nr;
>                 pid->numbers[i].ns = tmp;
> @@ -316,7 +319,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
>         }
>
>         if (unlikely(is_child_reaper(pid))) {
> -               if (pid_ns_prepare_proc(ns))
> +               if ((retval = pid_ns_prepare_proc(ns)))
>                         goto out_free;
>         }
>
> @@ -327,8 +330,10 @@ struct pid *alloc_pid(struct pid_namespace *ns)
>
>         upid = pid->numbers + ns->level;
>         spin_lock_irq(&pidmap_lock);
> -       if (!(ns->nr_hashed & PIDNS_HASH_ADDING))
> +       if (!(ns->nr_hashed & PIDNS_HASH_ADDING)) {
> +               retval = -EBUSY;
>                 goto out_unlock;
> +       }
>         for ( ; upid >= pid->numbers; --upid) {
>                 hlist_add_head_rcu(&upid->pid_chain,
>                                 &pid_hash[pid_hashfn(upid->nr, upid->ns)]);
> @@ -336,7 +341,6 @@ struct pid *alloc_pid(struct pid_namespace *ns)
>         }
>         spin_unlock_irq(&pidmap_lock);
>
> -out:
>         return pid;
>
>  out_unlock:
> @@ -348,8 +352,7 @@ out_free:
>                 free_pidmap(pid->numbers + i);
>
>         kmem_cache_free(ns->pid_cachep, pid);
> -       pid = NULL;
> -       goto out;
> +       return ERR_PTR(retval);
>  }
>
>  void disable_pid_allocation(struct pid_namespace *ns)
> --
> 2.1.4
>
> --
> Michal Hocko
> SUSE Labs



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* [PATCH v1 1/4] scsi: ufs: add ioctl interface for query request
From: Gilad Broner @ 2015-02-03 15:37 UTC (permalink / raw)
  To: James.Bottomley
  Cc: linux-kernel, linux-scsi, linux-arm-msm, santoshsy,
	linux-scsi-owner, subhashj, ygardi, draviv, Noa Rubens,
	Raviv Shvili, Vinayak Holikatti, James E.J. Bottomley,
	open list:ABI/API
In-Reply-To: <1422977841-25036-1-git-send-email-gbroner@codeaurora.org>

From: Dolev Raviv <draviv@codeaurora.org>

This patch exposes the ioctl interface for UFS driver via SCSI device
ioctl interface. As of now UFS driver would provide the ioctl for query
interface to connected UFS device.

Signed-off-by: Dolev Raviv <draviv@codeaurora.org>
Signed-off-by: Noa Rubens <noag@codeaurora.org>
Signed-off-by: Raviv Shvili <rshvili@codeaurora.org>
Signed-off-by: Yaniv Gardi <ygardi@codeaurora.org>
---
 drivers/scsi/ufs/ufs.h        |  53 +++-------
 drivers/scsi/ufs/ufshcd.c     | 225 +++++++++++++++++++++++++++++++++++++++++-
 include/uapi/scsi/Kbuild      |   1 +
 include/uapi/scsi/ufs/Kbuild  |   3 +
 include/uapi/scsi/ufs/ioctl.h |  57 +++++++++++
 include/uapi/scsi/ufs/ufs.h   |  66 +++++++++++++
 6 files changed, 361 insertions(+), 44 deletions(-)
 create mode 100644 include/uapi/scsi/ufs/Kbuild
 create mode 100644 include/uapi/scsi/ufs/ioctl.h
 create mode 100644 include/uapi/scsi/ufs/ufs.h

diff --git a/drivers/scsi/ufs/ufs.h b/drivers/scsi/ufs/ufs.h
index 42c459a..1f023c4 100644
--- a/drivers/scsi/ufs/ufs.h
+++ b/drivers/scsi/ufs/ufs.h
@@ -38,6 +38,7 @@
 
 #include <linux/mutex.h>
 #include <linux/types.h>
+#include <scsi/ufs/ufs.h>
 
 #define MAX_CDB_SIZE	16
 #define GENERAL_UPIU_REQUEST_SIZE 32
@@ -71,6 +72,16 @@ enum {
 	UFS_UPIU_RPMB_WLUN		= 0xC4,
 };
 
+/**
+ * ufs_is_valid_unit_desc_lun - checks if the given LUN has a unit descriptor
+ * @lun: LU number to check
+ * @return: true if the lun has a matching unit descriptor, false otherwise
+ */
+static inline bool ufs_is_valid_unit_desc_lun(u8 lun)
+{
+	return (lun == UFS_UPIU_RPMB_WLUN || (lun < UFS_UPIU_MAX_GENERAL_LUN));
+}
+
 /*
  * UFS Protocol Information Unit related definitions
  */
@@ -126,35 +137,6 @@ enum {
 	UPIU_QUERY_FUNC_STANDARD_WRITE_REQUEST          = 0x81,
 };
 
-/* Flag idn for Query Requests*/
-enum flag_idn {
-	QUERY_FLAG_IDN_FDEVICEINIT      = 0x01,
-	QUERY_FLAG_IDN_PWR_ON_WPE	= 0x03,
-	QUERY_FLAG_IDN_BKOPS_EN         = 0x04,
-};
-
-/* Attribute idn for Query requests */
-enum attr_idn {
-	QUERY_ATTR_IDN_ACTIVE_ICC_LVL	= 0x03,
-	QUERY_ATTR_IDN_BKOPS_STATUS	= 0x05,
-	QUERY_ATTR_IDN_EE_CONTROL	= 0x0D,
-	QUERY_ATTR_IDN_EE_STATUS	= 0x0E,
-};
-
-/* Descriptor idn for Query requests */
-enum desc_idn {
-	QUERY_DESC_IDN_DEVICE		= 0x0,
-	QUERY_DESC_IDN_CONFIGURAION	= 0x1,
-	QUERY_DESC_IDN_UNIT		= 0x2,
-	QUERY_DESC_IDN_RFU_0		= 0x3,
-	QUERY_DESC_IDN_INTERCONNECT	= 0x4,
-	QUERY_DESC_IDN_STRING		= 0x5,
-	QUERY_DESC_IDN_RFU_1		= 0x6,
-	QUERY_DESC_IDN_GEOMETRY		= 0x7,
-	QUERY_DESC_IDN_POWER		= 0x8,
-	QUERY_DESC_IDN_MAX,
-};
-
 enum desc_header_offset {
 	QUERY_DESC_LENGTH_OFFSET	= 0x00,
 	QUERY_DESC_DESC_TYPE_OFFSET	= 0x01,
@@ -247,19 +229,6 @@ enum bkops_status {
 	BKOPS_STATUS_MAX		 = BKOPS_STATUS_CRITICAL,
 };
 
-/* UTP QUERY Transaction Specific Fields OpCode */
-enum query_opcode {
-	UPIU_QUERY_OPCODE_NOP		= 0x0,
-	UPIU_QUERY_OPCODE_READ_DESC	= 0x1,
-	UPIU_QUERY_OPCODE_WRITE_DESC	= 0x2,
-	UPIU_QUERY_OPCODE_READ_ATTR	= 0x3,
-	UPIU_QUERY_OPCODE_WRITE_ATTR	= 0x4,
-	UPIU_QUERY_OPCODE_READ_FLAG	= 0x5,
-	UPIU_QUERY_OPCODE_SET_FLAG	= 0x6,
-	UPIU_QUERY_OPCODE_CLEAR_FLAG	= 0x7,
-	UPIU_QUERY_OPCODE_TOGGLE_FLAG	= 0x8,
-};
-
 /* Query response result code */
 enum {
 	QUERY_RESULT_SUCCESS                    = 0x00,
diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
index 5d60a86..cb357f8 100644
--- a/drivers/scsi/ufs/ufshcd.c
+++ b/drivers/scsi/ufs/ufshcd.c
@@ -3,7 +3,7 @@
  *
  * This code is based on drivers/scsi/ufs/ufshcd.c
  * Copyright (C) 2011-2013 Samsung India Software Operations
- * Copyright (c) 2013-2014, The Linux Foundation. All rights reserved.
+ * Copyright (c) 2013-2015, The Linux Foundation. All rights reserved.
  *
  * Authors:
  *	Santosh Yaraganavi <santosh.sy@samsung.com>
@@ -39,6 +39,7 @@
 
 #include <linux/async.h>
 #include <linux/devfreq.h>
+#include <scsi/ufs/ioctl.h>
 
 #include "ufshcd.h"
 #include "unipro.h"
@@ -74,6 +75,9 @@
 /* Interrupt aggregation default timeout, unit: 40us */
 #define INT_AGGR_DEF_TO	0x02
 
+/* IOCTL opcode for command - ufs set device read only */
+#define UFS_IOCTL_BLKROSET      BLKROSET
+
 #define ufshcd_toggle_vreg(_dev, _vreg, _on)				\
 	({                                                              \
 		int _ret;                                               \
@@ -1882,7 +1886,7 @@ static inline int ufshcd_read_unit_desc_param(struct ufs_hba *hba,
 	 * Unit descriptors are only available for general purpose LUs (LUN id
 	 * from 0 to 7) and RPMB Well known LU.
 	 */
-	if (lun != UFS_UPIU_RPMB_WLUN && (lun >= UFS_UPIU_MAX_GENERAL_LUN))
+	if (!ufs_is_valid_unit_desc_lun(lun))
 		return -EOPNOTSUPP;
 
 	return ufshcd_read_desc_param(hba, QUERY_DESC_IDN_UNIT, lun,
@@ -4201,6 +4205,222 @@ static void ufshcd_async_scan(void *data, async_cookie_t cookie)
 	ufshcd_probe_hba(hba);
 }
 
+/**
+ * ufshcd_query_ioctl - perform user read queries
+ * @hba: per-adapter instance
+ * @lun: used for lun specific queries
+ * @buffer: user space buffer for reading and submitting query data and params
+ * @return: 0 for success negative error code otherwise
+ *
+ * Expected/Submitted buffer structure is struct ufs_ioctl_query_data.
+ * It will read the opcode, idn and buf_length parameters, and, put the
+ * response in the buffer field while updating the used size in buf_length.
+ */
+static int ufshcd_query_ioctl(struct ufs_hba *hba, u8 lun, void __user *buffer)
+{
+	struct ufs_ioctl_query_data *ioctl_data;
+	int err = 0;
+	int length = 0;
+	void *data_ptr;
+	bool flag;
+	u32 att;
+	u8 index;
+	u8 *desc = NULL;
+
+	ioctl_data = kzalloc(sizeof(struct ufs_ioctl_query_data), GFP_KERNEL);
+	if (!ioctl_data) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	/* extract params from user buffer */
+	err = copy_from_user(ioctl_data, buffer,
+			sizeof(struct ufs_ioctl_query_data));
+	if (err) {
+		dev_err(hba->dev,
+			"%s: Failed copying buffer from user, err %d\n",
+			__func__, err);
+		goto out_release_mem;
+	}
+
+	/* verify legal parameters & send query */
+	switch (ioctl_data->opcode) {
+	case UPIU_QUERY_OPCODE_READ_DESC:
+		switch (ioctl_data->idn) {
+		case QUERY_DESC_IDN_DEVICE:
+		case QUERY_DESC_IDN_CONFIGURAION:
+		case QUERY_DESC_IDN_INTERCONNECT:
+		case QUERY_DESC_IDN_GEOMETRY:
+		case QUERY_DESC_IDN_POWER:
+			index = 0;
+			break;
+		case QUERY_DESC_IDN_UNIT:
+			if (!ufs_is_valid_unit_desc_lun(lun)) {
+				dev_err(hba->dev,
+					"%s: No unit descriptor for lun 0x%x\n",
+					__func__, lun);
+				err = -EINVAL;
+				goto out_release_mem;
+			}
+			index = lun;
+			break;
+		default:
+			goto out_einval;
+		}
+		length = min_t(int, QUERY_DESC_MAX_SIZE,
+				ioctl_data->buf_size);
+		desc = kzalloc(length, GFP_KERNEL);
+		if (!desc) {
+			dev_err(hba->dev, "%s: Failed allocating %d bytes\n",
+					__func__, length);
+			err = -ENOMEM;
+			goto out_release_mem;
+		}
+		err = ufshcd_query_descriptor(hba, ioctl_data->opcode,
+				ioctl_data->idn, index, 0, desc, &length);
+		break;
+	case UPIU_QUERY_OPCODE_READ_ATTR:
+		switch (ioctl_data->idn) {
+		case QUERY_ATTR_IDN_BOOT_LU_EN:
+		case QUERY_ATTR_IDN_POWER_MODE:
+		case QUERY_ATTR_IDN_ACTIVE_ICC_LVL:
+		case QUERY_ATTR_IDN_OOO_DATA_EN:
+		case QUERY_ATTR_IDN_BKOPS_STATUS:
+		case QUERY_ATTR_IDN_PURGE_STATUS:
+		case QUERY_ATTR_IDN_MAX_DATA_IN:
+		case QUERY_ATTR_IDN_MAX_DATA_OUT:
+		case QUERY_ATTR_IDN_REF_CLK_FREQ:
+		case QUERY_ATTR_IDN_CONF_DESC_LOCK:
+		case QUERY_ATTR_IDN_MAX_NUM_OF_RTT:
+		case QUERY_ATTR_IDN_EE_CONTROL:
+		case QUERY_ATTR_IDN_EE_STATUS:
+		case QUERY_ATTR_IDN_SECONDS_PASSED:
+			index = 0;
+			break;
+		case QUERY_ATTR_IDN_DYN_CAP_NEEDED:
+		case QUERY_ATTR_IDN_CORR_PRG_BLK_NUM:
+			index = lun;
+			break;
+		default:
+			goto out_einval;
+		}
+		err = ufshcd_query_attr(hba, ioctl_data->opcode,
+					ioctl_data->idn, index, 0, &att);
+		break;
+	case UPIU_QUERY_OPCODE_READ_FLAG:
+		switch (ioctl_data->idn) {
+		case QUERY_FLAG_IDN_FDEVICEINIT:
+		case QUERY_FLAG_IDN_PERMANENT_WPE:
+		case QUERY_FLAG_IDN_PWR_ON_WPE:
+		case QUERY_FLAG_IDN_BKOPS_EN:
+		case QUERY_FLAG_IDN_PURGE_ENABLE:
+		case QUERY_FLAG_IDN_FPHYRESOURCEREMOVAL:
+		case QUERY_FLAG_IDN_BUSY_RTC:
+			break;
+		default:
+			goto out_einval;
+		}
+		err = ufshcd_query_flag(hba, ioctl_data->opcode,
+					ioctl_data->idn, &flag);
+		break;
+	default:
+		goto out_einval;
+	}
+
+	if (err) {
+		dev_err(hba->dev, "%s: Query for idn %d failed\n", __func__,
+					ioctl_data->idn);
+		goto out_release_mem;
+	}
+
+	/*
+	 * copy response data
+	 * As we might end up reading less data then what is specified in
+	 * "ioct_data->buf_size". So we are updating "ioct_data->
+	 * buf_size" to what exactly we have read.
+	 */
+	switch (ioctl_data->opcode) {
+	case UPIU_QUERY_OPCODE_READ_DESC:
+		ioctl_data->buf_size = min_t(int, ioctl_data->buf_size, length);
+		data_ptr = desc;
+		break;
+	case UPIU_QUERY_OPCODE_READ_ATTR:
+		ioctl_data->buf_size = sizeof(u32);
+		data_ptr = &att;
+		break;
+	case UPIU_QUERY_OPCODE_READ_FLAG:
+		ioctl_data->buf_size = 1;
+		data_ptr = &flag;
+		break;
+	default:
+		BUG_ON(true);
+	}
+
+	/* copy to user */
+	err = copy_to_user(buffer, ioctl_data,
+			sizeof(struct ufs_ioctl_query_data));
+	if (err)
+		dev_err(hba->dev, "%s: Failed copying back to user.\n",
+			__func__);
+	err = copy_to_user(buffer + sizeof(struct ufs_ioctl_query_data),
+			data_ptr, ioctl_data->buf_size);
+	if (err)
+		dev_err(hba->dev, "%s: err %d copying back to user.\n",
+				__func__, err);
+	goto out_release_mem;
+
+out_einval:
+	dev_err(hba->dev,
+		"%s: illegal ufs query ioctl data, opcode 0x%x, idn 0x%x\n",
+		__func__, ioctl_data->opcode, (unsigned int)ioctl_data->idn);
+	err = -EINVAL;
+out_release_mem:
+	kfree(ioctl_data);
+	kfree(desc);
+out:
+	return err;
+}
+
+/**
+ * ufshcd_ioctl - ufs ioctl callback registered in scsi_host
+ * @dev: scsi device required for per LUN queries
+ * @cmd: command opcode
+ * @buffer: user space buffer for transferring data
+ *
+ * Supported commands:
+ * UFS_IOCTL_QUERY
+ */
+static int ufshcd_ioctl(struct scsi_device *dev, int cmd, void __user *buffer)
+{
+	struct ufs_hba *hba = shost_priv(dev->host);
+	int err = 0;
+
+	BUG_ON(!hba);
+	if (!buffer) {
+		dev_err(hba->dev, "%s: User buffer is NULL!\n", __func__);
+		return -EINVAL;
+	}
+
+	switch (cmd) {
+	case UFS_IOCTL_QUERY:
+		pm_runtime_get_sync(hba->dev);
+		err = ufshcd_query_ioctl(hba, ufshcd_scsi_to_upiu_lun(dev->lun),
+				buffer);
+		pm_runtime_put_sync(hba->dev);
+		break;
+	case UFS_IOCTL_BLKROSET:
+		err = -ENOIOCTLCMD;
+		break;
+	default:
+		err = -EINVAL;
+		dev_err(hba->dev, "%s: Illegal ufs-IOCTL cmd %d\n", __func__,
+				cmd);
+		break;
+	}
+
+	return err;
+}
+
 static struct scsi_host_template ufshcd_driver_template = {
 	.module			= THIS_MODULE,
 	.name			= UFSHCD,
@@ -4213,6 +4433,7 @@ static struct scsi_host_template ufshcd_driver_template = {
 	.eh_abort_handler	= ufshcd_abort,
 	.eh_device_reset_handler = ufshcd_eh_device_reset_handler,
 	.eh_host_reset_handler   = ufshcd_eh_host_reset_handler,
+	.ioctl			= ufshcd_ioctl,
 	.this_id		= -1,
 	.sg_tablesize		= SG_ALL,
 	.cmd_per_lun		= UFSHCD_CMD_PER_LUN,
diff --git a/include/uapi/scsi/Kbuild b/include/uapi/scsi/Kbuild
index 75746d5..d404525 100644
--- a/include/uapi/scsi/Kbuild
+++ b/include/uapi/scsi/Kbuild
@@ -1,5 +1,6 @@
 # UAPI Header export list
 header-y += fc/
+header-y += ufs/
 header-y += scsi_bsg_fc.h
 header-y += scsi_netlink.h
 header-y += scsi_netlink_fc.h
diff --git a/include/uapi/scsi/ufs/Kbuild b/include/uapi/scsi/ufs/Kbuild
new file mode 100644
index 0000000..cc3ef20
--- /dev/null
+++ b/include/uapi/scsi/ufs/Kbuild
@@ -0,0 +1,3 @@
+# UAPI Header export list
+header-y += ioctl.h
+header-y += ufs.h
diff --git a/include/uapi/scsi/ufs/ioctl.h b/include/uapi/scsi/ufs/ioctl.h
new file mode 100644
index 0000000..bc4eed7
--- /dev/null
+++ b/include/uapi/scsi/ufs/ioctl.h
@@ -0,0 +1,57 @@
+#ifndef UAPI_UFS_IOCTL_H_
+#define UAPI_UFS_IOCTL_H_
+
+#include <linux/types.h>
+
+/*
+ *  IOCTL opcode for ufs queries has the following opcode after
+ *  SCSI_IOCTL_GET_PCI
+ */
+#define UFS_IOCTL_QUERY			0x5388
+
+/**
+ * struct ufs_ioctl_query_data - used to transfer data to and from user via ioctl
+ * @opcode: type of data to query (descriptor/attribute/flag)
+ * @idn: id of the data structure
+ * @buf_size: number of allocated bytes/data size on return
+ * @buffer: data location
+ *
+ * Received: buffer and buf_size (available space for transferred data)
+ * Submitted: opcode, idn, length, buf_size
+ */
+struct ufs_ioctl_query_data {
+	/*
+	 * User should select one of the opcode defined in "enum query_opcode".
+	 * Please check include/uapi/scsi/ufs/ufs.h for the definition of it.
+	 * Note that only UPIU_QUERY_OPCODE_READ_DESC,
+	 * UPIU_QUERY_OPCODE_READ_ATTR & UPIU_QUERY_OPCODE_READ_FLAG are
+	 * supported as of now. All other query_opcode would be considered
+	 * invalid.
+	 * As of now only read query operations are supported.
+	 */
+	__u32 opcode;
+	/*
+	 * User should select one of the idn from "enum flag_idn" or "enum
+	 * attr_idn" or "enum desc_idn" based on whether opcode above is
+	 * attribute, flag or descriptor.
+	 * Please check include/uapi/scsi/ufs/ufs.h for the definition of it.
+	 */
+	__u8 idn;
+	/*
+	 * User should specify the size of the buffer (buffer[0] below) where
+	 * it wants to read the query data (attribute/flag/descriptor).
+	 * As we might end up reading less data then what is specified in
+	 * buf_size. So we are updating buf_size to what exactly we have read.
+	 */
+	__u16 buf_size;
+	/*
+	 * placeholder for the start of the data buffer where kernel will copy
+	 * the query data (attribute/flag/descriptor) read from the UFS device
+	 * Note:
+	 * For Read Attribute you will have to allocate 4 bytes
+	 * For Read Flag you will have to allocate 1 byte
+	 */
+	__u8 buffer[0];
+};
+
+#endif /* UAPI_UFS_IOCTL_H_ */
diff --git a/include/uapi/scsi/ufs/ufs.h b/include/uapi/scsi/ufs/ufs.h
new file mode 100644
index 0000000..894ea45
--- /dev/null
+++ b/include/uapi/scsi/ufs/ufs.h
@@ -0,0 +1,66 @@
+#ifndef UAPI_UFS_H_
+#define UAPI_UFS_H_
+
+/* Flag idn for Query Requests*/
+enum flag_idn {
+	QUERY_FLAG_IDN_FDEVICEINIT		= 0x01,
+	QUERY_FLAG_IDN_PERMANENT_WPE		= 0x02,
+	QUERY_FLAG_IDN_PWR_ON_WPE		= 0x03,
+	QUERY_FLAG_IDN_BKOPS_EN			= 0x04,
+	QUERY_FLAG_IDN_RESERVED1		= 0x05,
+	QUERY_FLAG_IDN_PURGE_ENABLE		= 0x06,
+	QUERY_FLAG_IDN_RESERVED2		= 0x07,
+	QUERY_FLAG_IDN_FPHYRESOURCEREMOVAL      = 0x08,
+	QUERY_FLAG_IDN_BUSY_RTC			= 0x09,
+};
+
+/* Attribute idn for Query requests */
+enum attr_idn {
+	QUERY_ATTR_IDN_BOOT_LU_EN		= 0x00,
+	QUERY_ATTR_IDN_RESERVED			= 0x01,
+	QUERY_ATTR_IDN_POWER_MODE		= 0x02,
+	QUERY_ATTR_IDN_ACTIVE_ICC_LVL		= 0x03,
+	QUERY_ATTR_IDN_OOO_DATA_EN		= 0x04,
+	QUERY_ATTR_IDN_BKOPS_STATUS		= 0x05,
+	QUERY_ATTR_IDN_PURGE_STATUS		= 0x06,
+	QUERY_ATTR_IDN_MAX_DATA_IN		= 0x07,
+	QUERY_ATTR_IDN_MAX_DATA_OUT		= 0x08,
+	QUERY_ATTR_IDN_DYN_CAP_NEEDED		= 0x09,
+	QUERY_ATTR_IDN_REF_CLK_FREQ		= 0x0A,
+	QUERY_ATTR_IDN_CONF_DESC_LOCK		= 0x0B,
+	QUERY_ATTR_IDN_MAX_NUM_OF_RTT		= 0x0C,
+	QUERY_ATTR_IDN_EE_CONTROL		= 0x0D,
+	QUERY_ATTR_IDN_EE_STATUS		= 0x0E,
+	QUERY_ATTR_IDN_SECONDS_PASSED		= 0x0F,
+	QUERY_ATTR_IDN_CNTX_CONF		= 0x10,
+	QUERY_ATTR_IDN_CORR_PRG_BLK_NUM		= 0x11,
+};
+
+/* Descriptor idn for Query requests */
+enum desc_idn {
+	QUERY_DESC_IDN_DEVICE		= 0x0,
+	QUERY_DESC_IDN_CONFIGURAION	= 0x1,
+	QUERY_DESC_IDN_UNIT		= 0x2,
+	QUERY_DESC_IDN_RFU_0		= 0x3,
+	QUERY_DESC_IDN_INTERCONNECT	= 0x4,
+	QUERY_DESC_IDN_STRING		= 0x5,
+	QUERY_DESC_IDN_RFU_1		= 0x6,
+	QUERY_DESC_IDN_GEOMETRY		= 0x7,
+	QUERY_DESC_IDN_POWER		= 0x8,
+	QUERY_DESC_IDN_RFU_2		= 0x9,
+	QUERY_DESC_IDN_MAX,
+};
+
+/* UTP QUERY Transaction Specific Fields OpCode */
+enum query_opcode {
+	UPIU_QUERY_OPCODE_NOP		= 0x0,
+	UPIU_QUERY_OPCODE_READ_DESC	= 0x1,
+	UPIU_QUERY_OPCODE_WRITE_DESC	= 0x2,
+	UPIU_QUERY_OPCODE_READ_ATTR	= 0x3,
+	UPIU_QUERY_OPCODE_WRITE_ATTR	= 0x4,
+	UPIU_QUERY_OPCODE_READ_FLAG	= 0x5,
+	UPIU_QUERY_OPCODE_SET_FLAG	= 0x6,
+	UPIU_QUERY_OPCODE_CLEAR_FLAG	= 0x7,
+	UPIU_QUERY_OPCODE_TOGGLE_FLAG	= 0x8,
+};
+#endif /* UAPI_UFS_H_ */
-- 
Qualcomm Israel, on behalf of Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project


^ permalink raw reply related

* Re: [RFC PATCH] fork: report pid reservation failure properly
From: Michal Hocko @ 2015-02-03 15:52 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Linux API, Andrew Morton, Oleg Nesterov, Eric W. Biederman, LKML
In-Reply-To: <CAKgNAkjiz6Uop_QVBNyyHFu8kgAR46nB1HWi0yOKt4ZX=666vg@mail.gmail.com>

On Tue 03-02-15 16:33:03, Michael Kerrisk wrote:
> Hi Michal,
> 
> 
> On 3 February 2015 at 16:05, Michal Hocko <mhocko@suse.cz> wrote:
> > Hi,
> > while debugging an unexpected ENOMEM from fork (there was no memory
> > pressure and OVERCOMMIT_ALWAYS) I have found out that fork returns
> > ENOMEM even when not short on memory.
> >
> > In this particular case it was due to depleted pid space which is
> > documented to return EAGAIN in man pages.
> >
> > Here is a quick fix up.
> 
> Could you summarize briefly what the user-space visible change is
> here?

The user visible change is that the userspace will get EAGAIN when
calling fork and the pid space is depleted because of a system wide
limit as per man page description rather than ENOMEM which we return
currently.

> It is not so obvious from your message. I believe you're turning
> some cases of ENOMEM into EAGAIN, right?

Yes, except for the case mentioned below which discusses a potential
error code for pid namespace triggered failures.

> Note, by the way, that if I understandwhat you intend, this change
> would bring the implementation closer to POSIX, which specifies:

True.

HTH.

>        EAGAIN The  system  lacked  the  necessary  resources  to  create
>               another  process, or the system-imposed limit on the total
>               number of processes under execution system-wide  or  by  a
>               single user {CHILD_MAX} would be exceeded.
> 
> Thanks,
> 
> Michael
> 
> > I am sending that as a RFC because I am not
> > entirely sure about interaction with pid namespace which might cause
> > fork to fail. pid_ns_prepare_proc might return few error codes which are
> > not documented by man page but I am not sure whether that is actually
> > going to happen on a properly configured system.
> >
> > There doesn't seem to be a good error code for an already "closed"
> > namespace (PIDNS_HASH_ADDING disabled) as well.  I've chosen EBUSY but
> > that might be completely wrong.  I am also wondering how would this
> > error case happen in the first place because parent should still exist
> > while fork is going on so the pid namespace shouldn't go away but I
> > smell this might have something to do with setns or something similar.
> >
> > Any feedback would be appreciated.
> > ---
> > From 2a576175345fffa04da19cd51a25b7d94187b5f9 Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mhocko@suse.cz>
> > Date: Fri, 30 Jan 2015 14:50:15 +0100
> > Subject: [PATCH] fork: report pid reservation failure properly
> >
> > copy_process will report any failure in alloc_pid as ENOMEM currently
> > which is misleading because the pid allocation might fail not only when
> > the memory is short but also when the pid space is consumed already.
> >
> > The current man page even mentions this case:
> > "
> > EAGAIN
> >
> >       A system-imposed limit on the number of threads was encountered.
> >       There are a number of limits that may trigger this error: the
> >       RLIMIT_NPROC soft resource limit (set via setrlimit(2)), which
> >       limits the number of processes and threads for a real user ID, was
> >       reached; the kernel's system-wide limit on the number of processes
> >       and threads, /proc/sys/kernel/threads-max, was reached (see
> >       proc(5)); or the maximum number of PIDs, /proc/sys/kernel/pid_max,
> >       was reached (see proc(5)).
> > "
> >
> > so the current behavior is also incorrect wrt. documentation. This patch
> > simply propagates error code from alloc_pid and makes sure we return
> > -EAGAIN due to reservation failure.
> >
> > There is one side effect of the change which might be unexpected.
> > alloc_pid might fail even when the repear in the pid namespace is dead
> > (the namespace basically disallows all new processes) and there is no
> > good error code which would match documented ones. I've taken EBUSY
> > because it felt like a best fit for the condition - namespace is busy
> > tearing down all the infrastructure.
> >
> > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> > ---
> >  kernel/fork.c |  5 +++--
> >  kernel/pid.c  | 19 +++++++++++--------
> >  2 files changed, 14 insertions(+), 10 deletions(-)
> >
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 3ce8d57cff09..c37b88a162c5 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -1405,10 +1405,11 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> >                 goto bad_fork_cleanup_io;
> >
> >         if (pid != &init_struct_pid) {
> > -               retval = -ENOMEM;
> >                 pid = alloc_pid(p->nsproxy->pid_ns_for_children);
> > -               if (!pid)
> > +               if (IS_ERR(pid)) {
> > +                       retval = PTR_ERR(pid);
> >                         goto bad_fork_cleanup_io;
> > +               }
> >         }
> >
> >         p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL;
> > diff --git a/kernel/pid.c b/kernel/pid.c
> > index 82430c858d69..c1a5875bd1ab 100644
> > --- a/kernel/pid.c
> > +++ b/kernel/pid.c
> > @@ -207,7 +207,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
> >                 }
> >                 pid = mk_pid(pid_ns, map, offset);
> >         }
> > -       return -1;
> > +       return -EAGAIN;
> >  }
> >
> >  int next_pidmap(struct pid_namespace *pid_ns, unsigned int last)
> > @@ -298,17 +298,20 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> >         int i, nr;
> >         struct pid_namespace *tmp;
> >         struct upid *upid;
> > +       int retval;
> >
> >         pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
> >         if (!pid)
> > -               goto out;
> > +               return ERR_PTR(-ENOMEM);
> >
> >         tmp = ns;
> >         pid->level = ns->level;
> >         for (i = ns->level; i >= 0; i--) {
> >                 nr = alloc_pidmap(tmp);
> > -               if (nr < 0)
> > +               if (IS_ERR_VALUE(nr)) {
> > +                       retval = nr;
> >                         goto out_free;
> > +               }
> >
> >                 pid->numbers[i].nr = nr;
> >                 pid->numbers[i].ns = tmp;
> > @@ -316,7 +319,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> >         }
> >
> >         if (unlikely(is_child_reaper(pid))) {
> > -               if (pid_ns_prepare_proc(ns))
> > +               if ((retval = pid_ns_prepare_proc(ns)))
> >                         goto out_free;
> >         }
> >
> > @@ -327,8 +330,10 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> >
> >         upid = pid->numbers + ns->level;
> >         spin_lock_irq(&pidmap_lock);
> > -       if (!(ns->nr_hashed & PIDNS_HASH_ADDING))
> > +       if (!(ns->nr_hashed & PIDNS_HASH_ADDING)) {
> > +               retval = -EBUSY;
> >                 goto out_unlock;
> > +       }
> >         for ( ; upid >= pid->numbers; --upid) {
> >                 hlist_add_head_rcu(&upid->pid_chain,
> >                                 &pid_hash[pid_hashfn(upid->nr, upid->ns)]);
> > @@ -336,7 +341,6 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> >         }
> >         spin_unlock_irq(&pidmap_lock);
> >
> > -out:
> >         return pid;
> >
> >  out_unlock:
> > @@ -348,8 +352,7 @@ out_free:
> >                 free_pidmap(pid->numbers + i);
> >
> >         kmem_cache_free(ns->pid_cachep, pid);
> > -       pid = NULL;
> > -       goto out;
> > +       return ERR_PTR(retval);
> >  }
> >
> >  void disable_pid_allocation(struct pid_namespace *ns)
> > --
> > 2.1.4
> >
> > --
> > Michal Hocko
> > SUSE Labs
> 
> 
> 
> -- 
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints
From: Michael Kerrisk (man-pages) @ 2015-02-03 16:20 UTC (permalink / raw)
  To: Vlastimil Babka, Kirill A. Shutemov
  Cc: mtk.manpages, Dave Hansen, Mel Gorman, linux-mm, Minchan Kim,
	Andrew Morton, linux-kernel, linux-api, linux-man, Hugh Dickins
In-Reply-To: <54D0B43D.8000209@suse.cz>

Hello Vlastimil

Thanks for CCing me into this thread.

On 02/03/2015 12:42 PM, Vlastimil Babka wrote:
> On 02/03/2015 11:53 AM, Kirill A. Shutemov wrote:
>> On Tue, Feb 03, 2015 at 09:19:15AM +0100, Vlastimil Babka wrote:
>>> [CC linux-api, man pages]
>>>
>>> On 02/02/2015 11:22 PM, Dave Hansen wrote:
>>>> On 02/02/2015 08:55 AM, Mel Gorman wrote:
>>>>> This patch identifies when a thread is frequently calling MADV_DONTNEED
>>>>> on the same region of memory and starts ignoring the hint. On an 8-core
>>>>> single-socket machine this was the impact on ebizzy using glibc 2.19.
>>>>
>>>> The manpage, at least, claims that we zero-fill after MADV_DONTNEED is
>>>> called:
>>>>
>>>>>      MADV_DONTNEED
>>>>>               Do  not  expect  access in the near future.  (For the time being, the application is finished with the given range, so the kernel can free resources
>>>>>               associated with it.)  Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents  from  the
>>>>>               underlying mapped file (see mmap(2)) or zero-fill-on-demand pages for mappings without an underlying file.
>>>>
>>>> So if we have anything depending on the behavior that it's _always_
>>>> zero-filled after an MADV_DONTNEED, this will break it.
>>>
>>> OK, so that's a third person (including me) who understood it as a zero-fill
>>> guarantee. I think the man page should be clarified (if it's indeed not
>>> guaranteed), or we have a bug.
>>>
>>> The implementation actually skips MADV_DONTNEED for
>>> VM_LOCKED|VM_HUGETLB|VM_PFNMAP vma's.
>>
>> It doesn't skip. It fails with -EINVAL. Or I miss something.
> 
> No, I missed that. Thanks for pointing out. The manpage also explains EINVAL in
> this case:
> 
> *  The application is attempting to release locked or shared pages (with
> MADV_DONTNEED).

Yes, there is that. But the page could be more explicit when discussing
MADV_DONTNEED in the main text. I've done that.

> - that covers mlocking ok, not sure if the rest fits the "shared pages" case
> though. I dont see any check for other kinds of shared pages in the code.

Agreed. "shared" here seems confused. I've removed it. And I've
added mention of "Huge TLB pages" for this error.

>>> - The word "will result" did sound as a guarantee at least to me. So here it
>>> could be changed to "may result (unless the advice is ignored)"?
>>
>> It's too late to fix documentation. Applications already depends on the
>> beheviour.
> 
> Right, so as long as they check for EINVAL, it should be safe. It appears that
> jemalloc does.

So, first a brief question: in the cases where the call does not error out,
are we agreed that in the current implementation, MADV_DONTNEED will
always result in zero-filled pages when the region is faulted back in
(when we consider pages that are not backed by a file)?

> I still wouldnt be sure just by reading the man page that the clearing is
> guaranteed whenever I dont get an error return value, though,

I'm not quite sure what you want here. I mean: if there's an error,
then the DONTNEED action didn't occur, right? Therefore, there won't
be zero-filled pages. But, for what it's worth, I added "If the
operation succeeds" at the start of that sentence beginning "Subsequent
accesses...".

Now, some history, explaining why the page is a bit of a mess,
and for that matter why I could really use more help on it from MM
folk (especially in the form of actual patches [1], rather than notes
about deficiencies in the documentation), because:

    ***I simply cannot keep up with all of the details***.

Once upon a time (Linux 2.4), there was madvise() with just 5 flags:

       MADV_NORMAL
       MADV_RANDOM
       MADV_SEQUENTIAL
       MADV_WILLNEED
       MADV_DONTNEED

And already a dozen years ago, *I* added the text about MADV_DONTNEED.
Back then, I believe it was true. I'm not sure if it's still true now,
but I assume for the moment that it is, and await feedback. And the 
text saying that the call does not affect the semantics of memory 
access dates back even further (and was then true, MADV_DONTNEED aside).

Those 5 flags have analogs in POSIX's posix_madvise() (albeit, there
is a semantic mismatch between the destructive MADV_DONTNEED and
POSIX's nondestructive POSIX_MADV_DONTNEED). They also appear
on most other implementations.

Since the original implementation, numerous pieces of cruft^W^W^W
excellent new flags have been overloaded into this one system call.
Some of those certainly violated the "does not change the semantics
of the application" statement, but, sadly, the kernel developers who
implemented MADV_REMOVE or MADV_DONTFORK did not think to send a
patch to the man page for those new flags, one that might have noted
that the semantics of the application are changed by such flags. Equally
sadly, I did overlook to scan the bigger page when *I* added 
documentation of these flags to those pages, otherwise I might have 
caught that detail.

So, just to repeat, I  could really use more help on it from MM
folk in the form of actual patches to the man page.

Thanks,

Michael

[1] https://www.kernel.org/doc/man-pages/patches.html

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints
From: Michael Kerrisk (man-pages) @ 2015-02-03 16:25 UTC (permalink / raw)
  To: Michal Hocko, Mel Gorman
  Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, minchan Kim, Dave Hansen,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg
In-Reply-To: <20150203152121.GC8914-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>

On 02/03/2015 04:21 PM, Michal Hocko wrote:
> On Tue 03-02-15 11:16:00, Mel Gorman wrote:
>> On Tue, Feb 03, 2015 at 09:19:15AM +0100, Vlastimil Babka wrote:
> [...]
>>> And if we agree that there is indeed no guarantee, what's the actual semantic
>>> difference from MADV_FREE? I guess none? So there's only a possible perfomance
>>> difference?
>>>
>>
>> Timing. MADV_DONTNEED if it has an effect is immediate, is a heavier
>> operations and RSS is reduced. MADV_FREE only has an impact in the future
>> if there is memory pressure.
> 
> JFTR. the man page for MADV_FREE has been proposed already
> (https://lkml.org/lkml/2014/12/5/63 should be the last version AFAIR). I
> do not see it in the man-pages git tree but the patch was not in time
> for 3.19 so I guess it will only appear in 3.20.
> 

Yikes! That patch was buried in the bottom of a locked filing cabinet
in a disused lavatory. I unfortunately don't read every thread that comes
my way, especially if it doesn't look like a man-pages patch (i.e., falls
in the middle of an LKML thread that starts on another topic, and doesn't 
see linux-man@). I'll respond to that patch soon. (There are some problems
that mean I could not accept it, AFAICT.)

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* Re: [PATCH v17 1/7] mm: support madvise(MADV_FREE)
From: Michael Kerrisk (man-pages) @ 2015-02-03 16:39 UTC (permalink / raw)
  To: Michal Hocko, Minchan Kim
  Cc: mtk.manpages, Andrew Morton, linux-kernel, linux-mm, linux-api,
	Hugh Dickins, Johannes Weiner, Rik van Riel, KOSAKI Motohiro,
	Mel Gorman, Jason Evans, zhangyanfei, Kirill A. Shutemov,
	Kirill A. Shutemov
In-Reply-To: <20141205083249.GA2321@dhcp22.suse.cz>

Hello Minchan (and Michal)

I did not see this patch until just now when Michael explicitly
mentioned it in another discussion because
(a) it was buried in an LMKL thread that started a topic
    that was not about a man-pages patch.
(b) linux-man@ was not CCed.

When resubmitting this patch, could you please To:me and CC linux-man@
and give the mail a suitable subject line indicating a man-pages patch.

On 12/05/2014 09:32 AM, Michal Hocko wrote:
> On Fri 05-12-14 16:08:16, Minchan Kim wrote:
> [...]
>> From cfa212d4fb307ae772b08cf564cab7e6adb8f4fc Mon Sep 17 00:00:00 2001
>> From: Minchan Kim <minchan@kernel.org>
>> Date: Mon, 1 Dec 2014 08:53:55 +0900
>> Subject: [PATCH] madvise.2: Document MADV_FREE
>>
>> Signed-off-by: Minchan Kim <minchan@kernel.org>
> 
> Reviewed-by: Michal Hocko <mhocko@suse.cz>
> 
> Thanks!
> 
>> ---
>>  man2/madvise.2 | 12 ++++++++++++
>>  1 file changed, 12 insertions(+)
>>
>> diff --git a/man2/madvise.2 b/man2/madvise.2
>> index 032ead7..fc1aaca 100644
>> --- a/man2/madvise.2
>> +++ b/man2/madvise.2
>> @@ -265,6 +265,18 @@ file (see
>>  .BR MADV_DODUMP " (since Linux 3.4)"
>>  Undo the effect of an earlier
>>  .BR MADV_DONTDUMP .
>> +.TP
>> +.BR MADV_FREE " (since Linux 3.19)"
>> +Tell the kernel that contents in the specified address range are no
>> +longer important and the range will be overwritten. When there is
>> +demand for memory, the system will free pages associated with the
>> +specified address range. In this instance, the next time a page in the
>> +address range is referenced, it will contain all zeroes.  Otherwise,
>> +it will contain the data that was there prior to the MADV_FREE call.
>> +References made to the address range will not make the system read
>> +from backing store (swap space) until the page is modified again.
>> +It works only with private anonymous pages (see
>> +.BR mmap (2)).
>>  .SH RETURN VALUE
>>  On success
>>  .BR madvise ()

If I'm reading the conversation right, the initially proposed text 
was from the BSD man page (which would be okay), but most of the 
text above seems  to have come straight from the page here:
http://www.lehman.cuny.edu/cgi-bin/man-cgi?madvise+3

Right?

Unfortunately, I don't think we can use that text. It's from the 
Solaris man page as far as I can tell, and I doubt that it's 
under a license that we can use.

If that's the case, we need to go back and come up with an
original text. It might draw inspiration from the Solaris page,
and take actual text from the BSD page (which is under a free
license), and it might also draw inspiration from Jon Corbet's 
description at http://lwn.net/Articles/590991/. 

Could you take another shot this please!

Thanks,

Michael



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC PATCH] fork: report pid reservation failure properly
From: Eric W. Biederman @ 2015-02-03 20:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Michael Kerrisk (man-pages), Linux API, Andrew Morton,
	Oleg Nesterov, LKML
In-Reply-To: <20150203155248.GD8907@dhcp22.suse.cz>

Michal Hocko <mhocko@suse.cz> writes:

> On Tue 03-02-15 16:33:03, Michael Kerrisk wrote:
>> Hi Michal,
>> 
>> 
>> On 3 February 2015 at 16:05, Michal Hocko <mhocko@suse.cz> wrote:
>> > Hi,
>> > while debugging an unexpected ENOMEM from fork (there was no memory
>> > pressure and OVERCOMMIT_ALWAYS) I have found out that fork returns
>> > ENOMEM even when not short on memory.
>> >
>> > In this particular case it was due to depleted pid space which is
>> > documented to return EAGAIN in man pages.
>> >
>> > Here is a quick fix up.
>> 
>> Could you summarize briefly what the user-space visible change is
>> here?
>
> The user visible change is that the userspace will get EAGAIN when
> calling fork and the pid space is depleted because of a system wide
> limit as per man page description rather than ENOMEM which we return
> currently.

I don't think that EAGAIN is any better than ENOMEM,
nor do I know that it is safe to return EBUSY from fork.  What nonsense
will applications do when they see an unexpected error code.

>> It is not so obvious from your message. I believe you're turning
>> some cases of ENOMEM into EAGAIN, right?
>
> Yes, except for the case mentioned below which discusses a potential
> error code for pid namespace triggered failures.
>
>> Note, by the way, that if I understandwhat you intend, this change
>> would bring the implementation closer to POSIX, which specifies:
>
> True.
>
> HTH.
>
>>        EAGAIN The  system  lacked  the  necessary  resources  to  create
>>               another  process, or the system-imposed limit on the total
>>               number of processes under execution system-wide  or  by  a
>>               single user {CHILD_MAX} would be exceeded.
>> 

Note.  All of those documented errors documented to return EAGAIN
are the kind of errors that if you wait a while you can reasonably
expect fork to succeed later.

With respecting to dealing with errors from fork, fork
is a major pain.  Fork only has only two return codes documented,
and fork is one of the most complicated system calls in the kernel with
the most failure modes of any system call I am familiar with.  Mapping 
a plethora of failure modes onto two error codes is always going to be
problematic from some view point.

EAGAIN is a bad idea in general because that means try again and if you
have hit a fixed limit trying again is wrong.

Frankly I think posix is probably borked to recommend EAGAIN instead of
ENOMEM.

Everyone in the world uses fork which makes is quite tricky to figure
out which assumptions on the return values of fork exist in the wild,
so it is not clear if it is safe to add new more descriptive return
messages.

With respect to the case where PIDNS_HASH_ADDING would cause fork to
fail, that only happens after init has exited in a pid namespace, so it
is very much a permanent failure, and there are no longer any processes
in the specific pid namespace nor will there ever be any more processes
in that pid namespace.  EINVAL might actually makes sense.  Of course
a sensible error code from fork does not seem to be allowed.

Of the two return codes that are allowed for fork, EAGAIN and ENOMEM
ENOMEM seems to be better as it is a more permanement failure.

I agree it is a little confusing, but I don't see anything that is other
than a little confusing.

Other than someone doing:

unshare(CLONE_NEWPID);
pid = fork();
waitpid(pid);
fork();   /* returns ENOMEM */

Was there any other real world issue that started this goal to fix fork?

I think there is a reasonable argument for digging into the fork return
code situation.  Perhaps it is just a matter of returning exotic return
codes for the weird and strange cases like trying to create a pid in a
dead pid namespace.

But what we have works, and I don't know of anything bad that happens
except when people are developing new code they get confused.

Further we can't count on people to read their man pages because this
behavior of returning ENOMEM is documented in pid_namespaces(7).  Which
makes me really thinking changing the code to match the manpage is more
likely to break code than to fix code.

Eric

^ permalink raw reply

* Re: [PATCH v17 1/7] mm: support madvise(MADV_FREE)
From: Minchan Kim @ 2015-02-03 23:47 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Michal Hocko, Andrew Morton, linux-kernel, linux-mm, linux-api,
	Hugh Dickins, Johannes Weiner, Rik van Riel, KOSAKI Motohiro,
	Mel Gorman, Jason Evans, zhangyanfei, Kirill A. Shutemov,
	Kirill A. Shutemov
In-Reply-To: <54D0F9BC.4060306@gmail.com>

Hello, Michael

On Tue, Feb 03, 2015 at 05:39:24PM +0100, Michael Kerrisk (man-pages) wrote:
> Hello Minchan (and Michal)
> 
> I did not see this patch until just now when Michael explicitly
> mentioned it in another discussion because
> (a) it was buried in an LMKL thread that started a topic
>     that was not about a man-pages patch.
> (b) linux-man@ was not CCed.

Sorry about that.

> 
> When resubmitting this patch, could you please To:me and CC linux-man@
> and give the mail a suitable subject line indicating a man-pages patch.

Sure.

> 
> On 12/05/2014 09:32 AM, Michal Hocko wrote:
> > On Fri 05-12-14 16:08:16, Minchan Kim wrote:
> > [...]
> >> From cfa212d4fb307ae772b08cf564cab7e6adb8f4fc Mon Sep 17 00:00:00 2001
> >> From: Minchan Kim <minchan@kernel.org>
> >> Date: Mon, 1 Dec 2014 08:53:55 +0900
> >> Subject: [PATCH] madvise.2: Document MADV_FREE
> >>
> >> Signed-off-by: Minchan Kim <minchan@kernel.org>
> > 
> > Reviewed-by: Michal Hocko <mhocko@suse.cz>
> > 
> > Thanks!
> > 
> >> ---
> >>  man2/madvise.2 | 12 ++++++++++++
> >>  1 file changed, 12 insertions(+)
> >>
> >> diff --git a/man2/madvise.2 b/man2/madvise.2
> >> index 032ead7..fc1aaca 100644
> >> --- a/man2/madvise.2
> >> +++ b/man2/madvise.2
> >> @@ -265,6 +265,18 @@ file (see
> >>  .BR MADV_DODUMP " (since Linux 3.4)"
> >>  Undo the effect of an earlier
> >>  .BR MADV_DONTDUMP .
> >> +.TP
> >> +.BR MADV_FREE " (since Linux 3.19)"
> >> +Tell the kernel that contents in the specified address range are no
> >> +longer important and the range will be overwritten. When there is
> >> +demand for memory, the system will free pages associated with the
> >> +specified address range. In this instance, the next time a page in the
> >> +address range is referenced, it will contain all zeroes.  Otherwise,
> >> +it will contain the data that was there prior to the MADV_FREE call.
> >> +References made to the address range will not make the system read
> >> +from backing store (swap space) until the page is modified again.
> >> +It works only with private anonymous pages (see
> >> +.BR mmap (2)).
> >>  .SH RETURN VALUE
> >>  On success
> >>  .BR madvise ()
> 
> If I'm reading the conversation right, the initially proposed text 
> was from the BSD man page (which would be okay), but most of the 
> text above seems  to have come straight from the page here:
> http://www.lehman.cuny.edu/cgi-bin/man-cgi?madvise+3
> 
> Right?

True. Solaris man page was really straightforward/clear rather than BSD.

> 
> Unfortunately, I don't think we can use that text. It's from the 
> Solaris man page as far as I can tell, and I doubt that it's 
> under a license that we can use.
> 
> If that's the case, we need to go back and come up with an
> original text. It might draw inspiration from the Solaris page,
> and take actual text from the BSD page (which is under a free
> license), and it might also draw inspiration from Jon Corbet's 
> description at http://lwn.net/Articles/590991/. 
> 
> Could you take another shot this please!

No problem. I will test my essay writing skill.
Thanks. 

> 
> Thanks,
> 
> Michael
> 
> 
> 
> -- 
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints
From: Minchan Kim @ 2015-02-04  0:09 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Dave Hansen, Mel Gorman,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andrew Morton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-man-u79uwXL29TY76Z2rM5mHXA, Rik van Riel
In-Reply-To: <54D0B43D.8000209-AlSwsSmVLrQ@public.gmane.org>

On Tue, Feb 03, 2015 at 12:42:53PM +0100, Vlastimil Babka wrote:
> On 02/03/2015 11:53 AM, Kirill A. Shutemov wrote:
> > On Tue, Feb 03, 2015 at 09:19:15AM +0100, Vlastimil Babka wrote:
> >> [CC linux-api, man pages]
> >> 
> >> On 02/02/2015 11:22 PM, Dave Hansen wrote:
> >> > On 02/02/2015 08:55 AM, Mel Gorman wrote:
> >> >> This patch identifies when a thread is frequently calling MADV_DONTNEED
> >> >> on the same region of memory and starts ignoring the hint. On an 8-core
> >> >> single-socket machine this was the impact on ebizzy using glibc 2.19.
> >> > 
> >> > The manpage, at least, claims that we zero-fill after MADV_DONTNEED is
> >> > called:
> >> > 
> >> >>      MADV_DONTNEED
> >> >>               Do  not  expect  access in the near future.  (For the time being, the application is finished with the given range, so the kernel can free resources
> >> >>               associated with it.)  Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents  from  the
> >> >>               underlying mapped file (see mmap(2)) or zero-fill-on-demand pages for mappings without an underlying file.
> >> > 
> >> > So if we have anything depending on the behavior that it's _always_
> >> > zero-filled after an MADV_DONTNEED, this will break it.
> >> 
> >> OK, so that's a third person (including me) who understood it as a zero-fill
> >> guarantee. I think the man page should be clarified (if it's indeed not
> >> guaranteed), or we have a bug.
> >> 
> >> The implementation actually skips MADV_DONTNEED for
> >> VM_LOCKED|VM_HUGETLB|VM_PFNMAP vma's.
> > 
> > It doesn't skip. It fails with -EINVAL. Or I miss something.
> 
> No, I missed that. Thanks for pointing out. The manpage also explains EINVAL in
> this case:
> 
> *  The application is attempting to release locked or shared pages (with
> MADV_DONTNEED).
> 
> - that covers mlocking ok, not sure if the rest fits the "shared pages" case
> though. I dont see any check for other kinds of shared pages in the code.
> 
> >> - The word "will result" did sound as a guarantee at least to me. So here it
> >> could be changed to "may result (unless the advice is ignored)"?
> > 
> > It's too late to fix documentation. Applications already depends on the
> > beheviour.
> 
> Right, so as long as they check for EINVAL, it should be safe. It appears that
> jemalloc does.
> 
> I still wouldnt be sure just by reading the man page that the clearing is
> guaranteed whenever I dont get an error return value, though,
> 

IMHO,

Man page said
"MADV_DONTNEED: Subsequent accesses of pages in this range will succeed,
 but will result either in reloading of  the memory contents from the
 underlying mapped file (see mmap(2)) or  zero-fill-on-demand pages
 for mappings without an underlying file."

Heap by allocated by malloc(3) is anonymous page so it's a mapping
withtout an underlying file so userspace can expect zero-fill.

Man page said
"EINVAL: The application is attempting to release locked or
shared pages (with MADV_DONTNEED)"

So, user can expect the call on area by allocated by malloc(3)
if he doesn't call mlock will always be successful.

Man page said
"madivse: This call does not influence the semantics of the application
(except in the case of MADV_DONTNEED)"

So, we shouldn't break MADV_DONTNEED's semantic which free pages
instantly. It's a long time semantic and it was one of arguable issues
on MADV_FREE Rik had tried long time ago to replace MADV_DONTNEED
with MADV_FREE.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply

* Re: [PATCH 01/13] kdbus: add documentation
From: Andy Lutomirski @ 2015-02-04  0:41 UTC (permalink / raw)
  To: Daniel Mack
  Cc: Arnd Bergmann, Ted Ts'o, Michael Kerrisk, Linux API,
	One Thousand Gnomes, Austin S Hemmelgarn, Tom Gundersen,
	Greg Kroah-Hartman, linux-kernel, Eric W. Biederman,
	David Herrmann, Djalal Harouni, Johannes Stezenbach,
	Christoph Hellwig
In-Reply-To: <54D09E6B.2020903-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>

On Tue, Feb 3, 2015 at 2:09 AM, Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org> wrote:
> Hi Andy,
>
> On 02/02/2015 09:12 PM, Andy Lutomirski wrote:
>> On Feb 2, 2015 1:34 AM, "Daniel Mack" <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org> wrote:
>
>>> That's right, but again - if an application wants to gather this kind of
>>> information about tasks it interacts with, it can do so today by looking
>>> at /proc or similar sources. Desktop machines do exactly that already,
>>> and the kernel code executed in such cases very much resembles that in
>>> metadata.c, and is certainly not cheaper. kdbus just makes such
>>> information more accessible when requested. Which information is
>>> collected is defined by bit-masks on both the sender and the receiver
>>> connection, and most applications will effectively only use a very
>>> limited set by default if they go through one of the more high-level
>>> libraries.
>>
>> I should rephrase a bit.  Kdbus doesn't require use of send-time
>> metadata.  It does, however, strongly encourage it, and it sounds like
>
> On the kernel level, kdbus just *offers* that, just like sockets offer
> SO_PASSCRED. On the userland level, kdbus helps applications get that
> information race-free, easier and faster than they would otherwise.
>
>> systemd and other major users will use send-time metadata.  Once that
>> happens, it's ABI (even if it's purely in userspace), and changing it
>> is asking for security holes to pop up.  So you'll be mostly stuck
>> with it.
>
> We know we can't break the ABI. At most, we could deprecate item types
> and introduce new ones, but we want to avoid that by all means of
> course. However, I fail to see how that is related to send time
> metadata, or even to kdbus in general, as all ABIs have to be kept stable.

I should have said it differently.  ABI is the wrong term -- it's more
of a protocol issue.

It looks like, with the current code, the kernel will provide
(optional) send-time metadata, and the sd-bus library will use it.
The result will be that the communication protocol between clients and
udev, systemd, systemd-logind, g-s-d, etc, will likely involve
send-time metadata.  This may end up being a bottleneck.

Once this happens, changing the protocol will be very hard without
introducing security bugs.  If people start switching to
connection-time metadata to gain performance, then they'll break both
the communication protocol and the expectations of client code.  (In
fact, it'll break twice, sort of, since I think that the current
protocols are connect-time.)

To me, this seems like a down-side of using send-time metadata, albeit
possibly not a huge downside at least in the near term.  I don't see a
corresponding benefit, though.

>
>> Do you have some simple benchmark code you can share?  I'd like to
>> play with it a bit.
>
> Sure, it's part of the self-test suite. Call it with "-t benchmark" to
> run the benchmark as isolated test with verbose output. The code for
> that lives in test-benchmark.c.
>

I'll try to play with this soon.  Thanks.

--Andy

>
> Thanks,
> Daniel
>
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply

* Re: [PATCH v2] selftests/exec: Check if the syscall exists and bail if not
From: Michael Ellerman @ 2015-02-04  2:46 UTC (permalink / raw)
  To: Shuah Khan
  Cc: David Drysdale,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
	Geert Uytterhoeven, davej-rdkfGonbjUTCLXcRTR1eJlpr/1R2p/CL
In-Reply-To: <54D0EA1B.1000508-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org>

On Tue, 2015-02-03 at 08:32 -0700, Shuah Khan wrote:
> On 02/03/2015 12:58 AM, David Drysdale wrote:
> > On Tue, Feb 3, 2015 at 3:53 AM, Michael Ellerman <mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org> wrote:
> >> On systems which don't implement sys_execveat(), this test produces a
> >> lot of output.
> >>
> >> Add a check at the beginning to see if the syscall is present, and if
> >> not just note one error and return.
> >>
> >> When we run on a system that doesn't implement the syscall we will get
> >> ENOSYS back from the kernel, so change the logic that handles
> >> __NR_execveat not being defined to also use ENOSYS rather than -ENOSYS.
> >>
> >> Signed-off-by: Michael Ellerman <mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org>
> > 
> > Acked-by: David Drysdale <drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> 
> Thanks Michael, and David. I will queue this for 3.20

Thanks.

cheers

^ permalink raw reply

* Re: [PATCH 01/13] kdbus: add documentation
From: Eric W. Biederman @ 2015-02-04  2:47 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Daniel Mack, Arnd Bergmann, Ted Ts'o, Michael Kerrisk,
	Linux API, One Thousand Gnomes, Austin S Hemmelgarn,
	Tom Gundersen, Greg Kroah-Hartman, linux-kernel, David Herrmann,
	Djalal Harouni, Johannes Stezenbach, Christoph Hellwig
In-Reply-To: <CALCETrV_q9Y4OWC6fA78WsD+XFhsdGHrqH6OK-hc=Vvj2F5C5w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:

> On Tue, Feb 3, 2015 at 2:09 AM, Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org> wrote:
>> Hi Andy,
>>
>> On 02/02/2015 09:12 PM, Andy Lutomirski wrote:
>>> On Feb 2, 2015 1:34 AM, "Daniel Mack" <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org> wrote:
>>
>>>> That's right, but again - if an application wants to gather this kind of
>>>> information about tasks it interacts with, it can do so today by looking
>>>> at /proc or similar sources. Desktop machines do exactly that already,
>>>> and the kernel code executed in such cases very much resembles that in
>>>> metadata.c, and is certainly not cheaper. kdbus just makes such
>>>> information more accessible when requested. Which information is
>>>> collected is defined by bit-masks on both the sender and the receiver
>>>> connection, and most applications will effectively only use a very
>>>> limited set by default if they go through one of the more high-level
>>>> libraries.
>>>
>>> I should rephrase a bit.  Kdbus doesn't require use of send-time
>>> metadata.  It does, however, strongly encourage it, and it sounds like
>>
>> On the kernel level, kdbus just *offers* that, just like sockets offer
>> SO_PASSCRED. On the userland level, kdbus helps applications get that
>> information race-free, easier and faster than they would otherwise.
>>
>>> systemd and other major users will use send-time metadata.  Once that
>>> happens, it's ABI (even if it's purely in userspace), and changing it
>>> is asking for security holes to pop up.  So you'll be mostly stuck
>>> with it.
>>
>> We know we can't break the ABI. At most, we could deprecate item types
>> and introduce new ones, but we want to avoid that by all means of
>> course. However, I fail to see how that is related to send time
>> metadata, or even to kdbus in general, as all ABIs have to be kept stable.
>
> I should have said it differently.  ABI is the wrong term -- it's more
> of a protocol issue.
>
> It looks like, with the current code, the kernel will provide
> (optional) send-time metadata, and the sd-bus library will use it.
> The result will be that the communication protocol between clients and
> udev, systemd, systemd-logind, g-s-d, etc, will likely involve
> send-time metadata.  This may end up being a bottleneck.

A quick note on a couple of things I have seen in this conversation.

- The reason for kdbus is performance.

- pipes rather than unix domain sockets are likely the standard to meet.
  If you can't equal unix domain sockets for simple things you are
  likely leaving a lot of stops in.  Last I looked pipes in general were
  notiably faster than unix domain sockets.

  The performance numbers I saw posted up-thread were horrible.  I have
  seen faster numbers across a network of machines.  If your ping-pong
  latency isn't measured in nano-seconds you are probably doing
  something wrong.

- syscalls remove overhead.  So since performance is kdbus's reason for existence
  let's remove some ridiculous stops, and get a fast path into the kernel.

- send-time metadata is a performance nightmare.  SO_PASSCRED is hard
  to implement in a fast performant way, especially when namespaces
  get involved.  Over the long term if you use send-time metadata
  you will grow the kind of compatibility hacks that the user
  namespace and the pid namespace have on SO_PASSCRED and things will
  slow down.

  A similar effect that is more performant in general is to enforce that
  the sender has the expected attributes.

> Once this happens, changing the protocol will be very hard without
> introducing security bugs.  If people start switching to
> connection-time metadata to gain performance, then they'll break both
> the communication protocol and the expectations of client code.  (In
> fact, it'll break twice, sort of, since I think that the current
> protocols are connect-time.)
>
> To me, this seems like a down-side of using send-time metadata, albeit
> possibly not a huge downside at least in the near term.  I don't see a
> corresponding benefit, though.

I think send-time metadata verification is less bad in this regard.

Eric

^ permalink raw reply

* Re: [PATCH 01/13] kdbus: add documentation
From: Greg Kroah-Hartman @ 2015-02-04  3:14 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andy Lutomirski, Daniel Mack, Arnd Bergmann, Ted Ts'o,
	Michael Kerrisk, Linux API, One Thousand Gnomes,
	Austin S Hemmelgarn, Tom Gundersen, linux-kernel, David Herrmann,
	Djalal Harouni, Johannes Stezenbach, Christoph Hellwig
In-Reply-To: <87siemgzoo.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>

On Tue, Feb 03, 2015 at 08:47:51PM -0600, Eric W. Biederman wrote:
> Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
> 
> > On Tue, Feb 3, 2015 at 2:09 AM, Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org> wrote:
> >> Hi Andy,
> >>
> >> On 02/02/2015 09:12 PM, Andy Lutomirski wrote:
> >>> On Feb 2, 2015 1:34 AM, "Daniel Mack" <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org> wrote:
> >>
> >>>> That's right, but again - if an application wants to gather this kind of
> >>>> information about tasks it interacts with, it can do so today by looking
> >>>> at /proc or similar sources. Desktop machines do exactly that already,
> >>>> and the kernel code executed in such cases very much resembles that in
> >>>> metadata.c, and is certainly not cheaper. kdbus just makes such
> >>>> information more accessible when requested. Which information is
> >>>> collected is defined by bit-masks on both the sender and the receiver
> >>>> connection, and most applications will effectively only use a very
> >>>> limited set by default if they go through one of the more high-level
> >>>> libraries.
> >>>
> >>> I should rephrase a bit.  Kdbus doesn't require use of send-time
> >>> metadata.  It does, however, strongly encourage it, and it sounds like
> >>
> >> On the kernel level, kdbus just *offers* that, just like sockets offer
> >> SO_PASSCRED. On the userland level, kdbus helps applications get that
> >> information race-free, easier and faster than they would otherwise.
> >>
> >>> systemd and other major users will use send-time metadata.  Once that
> >>> happens, it's ABI (even if it's purely in userspace), and changing it
> >>> is asking for security holes to pop up.  So you'll be mostly stuck
> >>> with it.
> >>
> >> We know we can't break the ABI. At most, we could deprecate item types
> >> and introduce new ones, but we want to avoid that by all means of
> >> course. However, I fail to see how that is related to send time
> >> metadata, or even to kdbus in general, as all ABIs have to be kept stable.
> >
> > I should have said it differently.  ABI is the wrong term -- it's more
> > of a protocol issue.
> >
> > It looks like, with the current code, the kernel will provide
> > (optional) send-time metadata, and the sd-bus library will use it.
> > The result will be that the communication protocol between clients and
> > udev, systemd, systemd-logind, g-s-d, etc, will likely involve
> > send-time metadata.  This may end up being a bottleneck.
> 
> A quick note on a couple of things I have seen in this conversation.
> 
> - The reason for kdbus is performance.

No, that's not the only reason for kdbus, don't focus only on this.  I
set out a long list of things for why we created kdbus, speed was only
one of the things.  Security is also one, and the ability to gather
these attributes in an atomic and secure way is very important as
userspace wants this.

> - pipes rather than unix domain sockets are likely the standard to meet.
>   If you can't equal unix domain sockets for simple things you are
>   likely leaving a lot of stops in.  Last I looked pipes in general were
>   notiably faster than unix domain sockets.
> 
>   The performance numbers I saw posted up-thread were horrible.  I have
>   seen faster numbers across a network of machines.  If your ping-pong
>   latency isn't measured in nano-seconds you are probably doing
>   something wrong.

It all depends on what you are passing on that "ping-pong", a real
D-Bus connection has real data and meta data that has to be sent.
Trying to make a fake benchmark number isn't going to show anything.

> - syscalls remove overhead.  So since performance is kdbus's reason for existence
>   let's remove some ridiculous stops, and get a fast path into the kernel.

Again, not the only reason, see my first post in this thread for
details.

thanks,

greg k-h

^ permalink raw reply

* Re: [RFC][PATCH v2] procfs: Always expose /proc/<pid>/map_files/ and make it readable
From: Calvin Owens @ 2015-02-04  3:28 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, Andrew Morton, Cyrill Gorcunov, Kirill A. Shutemov,
	Alexey Dobriyan, Oleg Nesterov, Eric W. Biederman, Al Viro,
	Kirill A. Shutemov, Peter Feiner, Grant Likely,
	Siddhesh Poyarekar, LKML, kernel-team-b10kYP2dOMg,
	Pavel Emelyanov, Linux API
In-Reply-To: <CALCETrUufe3USocUDpkBdwx6SyGKVgVUTh4rg2H9Xn91u+6iHg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Monday 02/02 at 12:16 -0800, Andy Lutomirski wrote:
> On Fri, Jan 30, 2015 at 5:58 PM, Calvin Owens <calvinowens-b10kYP2dOMg@public.gmane.org> wrote:
> > On Thursday 01/29 at 17:30 -0800, Kees Cook wrote:
> >> On Tue, Jan 27, 2015 at 8:38 PM, Calvin Owens <calvinowens-b10kYP2dOMg@public.gmane.org> wrote:
> >> > On Monday 01/26 at 15:43 -0800, Andrew Morton wrote:
> >> >> On Tue, 27 Jan 2015 00:00:54 +0300 Cyrill Gorcunov <gorcunov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >> >>
> >> >> > On Mon, Jan 26, 2015 at 02:47:31PM +0200, Kirill A. Shutemov wrote:
> >> >> > > On Fri, Jan 23, 2015 at 07:15:44PM -0800, Calvin Owens wrote:
> >> >> > > > Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
> >> >> > > > is only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface
> >> >> > > > is very useful for enumerating the files mapped into a process when
> >> >> > > > the more verbose information in /proc/<pid>/maps is not needed.
> >> >>
> >> >> This is the main (actually only) justification for the patch, and it it
> >> >> far too thin.  What does "not needed" mean.  Why can't people just use
> >> >> /proc/pid/maps?
> >> >
> >> > The biggest difference is that if you do something like this:
> >> >
> >> >         fd = open("/stuff", O_BLAH);
> >> >         map = mmap(NULL, 4096, PROT_BLAH, MAP_SHARED, fd, 0);
> >> >         close(fd);
> >> >         unlink("/stuff");
> >> >
> >> > ...then map_files/ gives you a way to get a file descriptor for
> >> > "/stuff", which you couldn't do with /proc/pid/maps.
> >> >
> >> > It's also something of a win if you just want to see what is mapped at a
> >> > specific address, since you can just readlink() the symlink for the
> >> > address range you care about and it will go grab the appropriate VMA and
> >> > give you the answer. /proc/pid/maps requires walking the VMA tree, which
> >> > is quite expensive for processes with many thousands of threads, even
> >> > without the O(N^2) issue.
> >> >
> >> > (You have to know what address range you want though, since readdir() on
> >> > map_files/ obviously has to walk the VMA tree just like /proc/N/maps.)
> >> >
> >> >> > > > This patch moves the folder out from behind CHECKPOINT_RESTORE, and
> >> >> > > > removes the CAP_SYS_ADMIN restrictions. Following the links requires
> >> >> > > > the ability to ptrace the process in question, so this doesn't allow
> >> >> > > > an attacker to do anything they couldn't already do before.
> >> >> > > >
> >> >> > > > Signed-off-by: Calvin Owens <calvinowens-b10kYP2dOMg@public.gmane.org>
> >> >> > >
> >> >> > > Cc +linux-api@
> >> >> >
> >> >> > Looks good to me, thanks! Though I would really appreciate if someone
> >> >> > from security camp take a look as well.
> >> >>
> >> >> hm, who's that.  Kees comes to mind.
> >> >>
> >> >> And reviewers' task would be a heck of a lot easier if they knew what
> >> >> /proc/pid/map_files actually does.  This:
> >> >>
> >> >> akpm3:/usr/src/25> grep -r map_files Documentation
> >> >> akpm3:/usr/src/25>
> >> >>
> >> >> does not help.
> >> >>
> >> >> The 640708a2cff7f81 changelog says:
> >> >>
> >> >> :     This one behaves similarly to the /proc/<pid>/fd/ one - it contains
> >> >> :     symlinks one for each mapping with file, the name of a symlink is
> >> >> :     "vma->vm_start-vma->vm_end", the target is the file.  Opening a symlink
> >> >> :     results in a file that point exactly to the same inode as them vma's one.
> >> >> :
> >> >> :     For example the ls -l of some arbitrary /proc/<pid>/map_files/
> >> >> :
> >> >> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
> >> >> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
> >> >> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
> >> >> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
> >> >> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so
> >> >>
> >> >> afacit this info is also available in /proc/pid/maps, so things
> >> >> shouldn't get worse if the /proc/pid/map_files permissions are at least
> >> >> as restrictive as the /proc/pid/maps permissions.  Is that the case?
> >> >> (Please add to changelog).
> >> >
> >> > Yes, the only difference is that you can follow the link as per above.
> >> > I'll resend with a new message explaining that and the deletion thing.
> >> >
> >> >> There's one other problem here: we're assuming that the map_files
> >> >> implementation doesn't have bugs.  If it does have bugs then relaxing
> >> >> permissions like this will create new vulnerabilities.  And the
> >> >> map_files implementation is surprisingly complex.  Is it bug-free?
> >> >
> >> > While I was messing with it I used it a good bit and didn't see any
> >> > issues, although I didn't actively try to fuzz it or anything. I'd be
> >> > happy to write something to test hammering it in weird ways if you like.
> >> > I'm also happy to write testcases for namespaces.
> >> >
> >> > So far as security issues, as others have pointed out you can't follow
> >> > the links unless you can ptrace the process in question, which seems
> >> > like a pretty solid guarantee. As Cyrill pointed out in the discussion
> >> > about the documentation, that's the same protection as /proc/N/fd/*, and
> >> > those links function in the same way.
> >>
> >> My concern here is that fd/* are connected as streams, and while that
> >> has a certain level of badness as an external-to-the-process attacker,
> >> PTRACE_MODE_READ is much weaker than PTRACE_MODE_ATTACH (which is
> >> required for access to /proc/N/mem). Since these fds are the things
> >> mapped into memory on a process, writing to them is a subset of access
> >> to /proc/N/mem, and I don't feel that PTRACE_MODE_READ is sufficient.
> >
> > If you haven't done close() on a mmapped file, doesn't fd/* allow the
> > same access to the corresponding regions of memory? Or am I missing
> > something?
> >
> 
> But if you have called close(), then you can't currently do things
> like ftruncate or ioctl on the mapped file.  These things don't
> persist across execve(), but the do persist across calls to setresuid,
> etc that drop privileges.  The latter part makes me a tiny bit
> nervous.

Hmm, in that scenario you would have to open() the map_files symlink,
and since you've dropped privileges that would only succeed if the user
you dropped to has permission to access that file anyway, right? 

In the deleted file case it does actually allow something that used to
be impossible, but relying on open/map/close/unlink to prevent a user
from opening a file they have permission to open is just buggy in
general.

But, O_TMPFILE lets you end up in that position without the race. The
manpage says that O_TMPFILE files "can never be reached via any
pathname", which isn't strictly true since you can get them from fd/* in
proc. But if you close() after mapping it they are currently truly
inaccessible via any path, and given the language in the manpage it
seems reasonable that somebody might rely on that and be lazy with the
permissions.

I hadn't thought about O_TMPFILE thing: I'm definitely convinced now
that PTRACE_MODE_ATTACH is the right thing here. But I think having to
reopen the file saves you even if you "leak" maps of files across a call
to setresuid/etc. 

> It also might be worth checking for drivers or arch code that creates
> vmas that are backed by a different struct file than the struct file
> that was mmapped in the first place.

Interesting, I'll look into this before I resend.

Thanks,
Calvin
 
> --Andy

^ permalink raw reply

* Re: [RFC][PATCH v2] procfs: Always expose /proc/<pid>/map_files/ and make it readable
From: Calvin Owens @ 2015-02-04  3:53 UTC (permalink / raw)
  To: Austin S Hemmelgarn
  Cc: Kees Cook, Andrew Morton, Cyrill Gorcunov, Kirill A. Shutemov,
	Alexey Dobriyan, Oleg Nesterov, Eric W. Biederman, Al Viro,
	Kirill A. Shutemov, Peter Feiner, Grant Likely,
	Siddhesh Poyarekar, LKML, kernel-team-b10kYP2dOMg,
	Pavel Emelyanov, Linux API
In-Reply-To: <54CF832A.7010707-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

On Monday 02/02 at 09:01 -0500, Austin S Hemmelgarn wrote:
> On 2015-01-30 20:58, Calvin Owens wrote:
> >On Thursday 01/29 at 17:30 -0800, Kees Cook wrote:
> >>On Tue, Jan 27, 2015 at 8:38 PM, Calvin Owens <calvinowens-b10kYP2dOMg@public.gmane.org> wrote:
> >>>On Monday 01/26 at 15:43 -0800, Andrew Morton wrote:
> >>>>On Tue, 27 Jan 2015 00:00:54 +0300 Cyrill Gorcunov <gorcunov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>>>
> >>>>>On Mon, Jan 26, 2015 at 02:47:31PM +0200, Kirill A. Shutemov wrote:
> >>>>>>On Fri, Jan 23, 2015 at 07:15:44PM -0800, Calvin Owens wrote:
> >>>>>>>Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
> >>>>>>>is only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface
> >>>>>>>is very useful for enumerating the files mapped into a process when
> >>>>>>>the more verbose information in /proc/<pid>/maps is not needed.
> >>>>
> >>>>This is the main (actually only) justification for the patch, and it it
> >>>>far too thin.  What does "not needed" mean.  Why can't people just use
> >>>>/proc/pid/maps?
> >>>
> >>>The biggest difference is that if you do something like this:
> >>>
> >>>         fd = open("/stuff", O_BLAH);
> >>>         map = mmap(NULL, 4096, PROT_BLAH, MAP_SHARED, fd, 0);
> >>>         close(fd);
> >>>         unlink("/stuff");
> >>>
> >>>...then map_files/ gives you a way to get a file descriptor for
> >>>"/stuff", which you couldn't do with /proc/pid/maps.
> >>>
> >>>It's also something of a win if you just want to see what is mapped at a
> >>>specific address, since you can just readlink() the symlink for the
> >>>address range you care about and it will go grab the appropriate VMA and
> >>>give you the answer. /proc/pid/maps requires walking the VMA tree, which
> >>>is quite expensive for processes with many thousands of threads, even
> >>>without the O(N^2) issue.
> >>>
> >>>(You have to know what address range you want though, since readdir() on
> >>>map_files/ obviously has to walk the VMA tree just like /proc/N/maps.)
> >>>
> >>>>>>>This patch moves the folder out from behind CHECKPOINT_RESTORE, and
> >>>>>>>removes the CAP_SYS_ADMIN restrictions. Following the links requires
> >>>>>>>the ability to ptrace the process in question, so this doesn't allow
> >>>>>>>an attacker to do anything they couldn't already do before.
> >>>>>>>
> >>>>>>>Signed-off-by: Calvin Owens <calvinowens-b10kYP2dOMg@public.gmane.org>
> >>>>>>
> >>>>>>Cc +linux-api@
> >>>>>
> >>>>>Looks good to me, thanks! Though I would really appreciate if someone
> >>>>>from security camp take a look as well.
> >>>>
> >>>>hm, who's that.  Kees comes to mind.
> >>>>
> >>>>And reviewers' task would be a heck of a lot easier if they knew what
> >>>>/proc/pid/map_files actually does.  This:
> >>>>
> >>>>akpm3:/usr/src/25> grep -r map_files Documentation
> >>>>akpm3:/usr/src/25>
> >>>>
> >>>>does not help.
> >>>>
> >>>>The 640708a2cff7f81 changelog says:
> >>>>
> >>>>:     This one behaves similarly to the /proc/<pid>/fd/ one - it contains
> >>>>:     symlinks one for each mapping with file, the name of a symlink is
> >>>>:     "vma->vm_start-vma->vm_end", the target is the file.  Opening a symlink
> >>>>:     results in a file that point exactly to the same inode as them vma's one.
> >>>>:
> >>>>:     For example the ls -l of some arbitrary /proc/<pid>/map_files/
> >>>>:
> >>>>:      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
> >>>>:      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
> >>>>:      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
> >>>>:      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
> >>>>:      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so
> >>>>
> >>>>afacit this info is also available in /proc/pid/maps, so things
> >>>>shouldn't get worse if the /proc/pid/map_files permissions are at least
> >>>>as restrictive as the /proc/pid/maps permissions.  Is that the case?
> >>>>(Please add to changelog).
> >>>
> >>>Yes, the only difference is that you can follow the link as per above.
> >>>I'll resend with a new message explaining that and the deletion thing.
> >>>
> >>>>There's one other problem here: we're assuming that the map_files
> >>>>implementation doesn't have bugs.  If it does have bugs then relaxing
> >>>>permissions like this will create new vulnerabilities.  And the
> >>>>map_files implementation is surprisingly complex.  Is it bug-free?
> >>>
> >>>While I was messing with it I used it a good bit and didn't see any
> >>>issues, although I didn't actively try to fuzz it or anything. I'd be
> >>>happy to write something to test hammering it in weird ways if you like.
> >>>I'm also happy to write testcases for namespaces.
> >>>
> >>>So far as security issues, as others have pointed out you can't follow
> >>>the links unless you can ptrace the process in question, which seems
> >>>like a pretty solid guarantee. As Cyrill pointed out in the discussion
> >>>about the documentation, that's the same protection as /proc/N/fd/*, and
> >>>those links function in the same way.
> >>
> >>My concern here is that fd/* are connected as streams, and while that
> >>has a certain level of badness as an external-to-the-process attacker,
> >>PTRACE_MODE_READ is much weaker than PTRACE_MODE_ATTACH (which is
> >>required for access to /proc/N/mem). Since these fds are the things
> >>mapped into memory on a process, writing to them is a subset of access
> >>to /proc/N/mem, and I don't feel that PTRACE_MODE_READ is sufficient.
> >
> >If you haven't done close() on a mmapped file, doesn't fd/* allow the
> >same access to the corresponding regions of memory? Or am I missing
> >something?
> >
> >But that said, I can't think of any reason making it MODE_ATTACH would
> >be a problem. Would you rather that be enforced on follow_link() like
> >the original patch did, or enforce it for the whole directory?
> >
> >
> Whole directory would probably be better, as even just the mapped
> ranges could be considered sensitive information. 

You can already get the ranges that are mapped from /proc/N/maps with
PTRACE_MODE_READ, so that part isn't new information.

> Ideally, the check should be done on both follow_link(), and the
> directory itself.

Oh, I didn't mean restricting readdir(), I meant restricting any access
through the directory similar to how the original CAP_SYS_ADMIN check
was done.
 
Thanks,
Calvin

^ permalink raw reply

* Re: [PATCH 01/13] kdbus: add documentation
From: Eric W. Biederman @ 2015-02-04  6:30 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Andy Lutomirski, Daniel Mack, Arnd Bergmann, Ted Ts'o,
	Michael Kerrisk, Linux API, One Thousand Gnomes,
	Austin S Hemmelgarn, Tom Gundersen, linux-kernel, David Herrmann,
	Djalal Harouni, Johannes Stezenbach, Christoph Hellwig
In-Reply-To: <20150204031437.GB32207-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>

Greg Kroah-Hartman <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> writes:

> On Tue, Feb 03, 2015 at 08:47:51PM -0600, Eric W. Biederman wrote:
>> Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
>> 
>> > On Tue, Feb 3, 2015 at 2:09 AM, Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org> wrote:
>> >> Hi Andy,
>> >>
>> >> On 02/02/2015 09:12 PM, Andy Lutomirski wrote:
>> >>> On Feb 2, 2015 1:34 AM, "Daniel Mack" <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org> wrote:
>> >>
>> >>>> That's right, but again - if an application wants to gather this kind of
>> >>>> information about tasks it interacts with, it can do so today by looking
>> >>>> at /proc or similar sources. Desktop machines do exactly that already,
>> >>>> and the kernel code executed in such cases very much resembles that in
>> >>>> metadata.c, and is certainly not cheaper. kdbus just makes such
>> >>>> information more accessible when requested. Which information is
>> >>>> collected is defined by bit-masks on both the sender and the receiver
>> >>>> connection, and most applications will effectively only use a very
>> >>>> limited set by default if they go through one of the more high-level
>> >>>> libraries.
>> >>>
>> >>> I should rephrase a bit.  Kdbus doesn't require use of send-time
>> >>> metadata.  It does, however, strongly encourage it, and it sounds like
>> >>
>> >> On the kernel level, kdbus just *offers* that, just like sockets offer
>> >> SO_PASSCRED. On the userland level, kdbus helps applications get that
>> >> information race-free, easier and faster than they would otherwise.
>> >>
>> >>> systemd and other major users will use send-time metadata.  Once that
>> >>> happens, it's ABI (even if it's purely in userspace), and changing it
>> >>> is asking for security holes to pop up.  So you'll be mostly stuck
>> >>> with it.
>> >>
>> >> We know we can't break the ABI. At most, we could deprecate item types
>> >> and introduce new ones, but we want to avoid that by all means of
>> >> course. However, I fail to see how that is related to send time
>> >> metadata, or even to kdbus in general, as all ABIs have to be kept stable.
>> >
>> > I should have said it differently.  ABI is the wrong term -- it's more
>> > of a protocol issue.
>> >
>> > It looks like, with the current code, the kernel will provide
>> > (optional) send-time metadata, and the sd-bus library will use it.
>> > The result will be that the communication protocol between clients and
>> > udev, systemd, systemd-logind, g-s-d, etc, will likely involve
>> > send-time metadata.  This may end up being a bottleneck.
>> 
>> A quick note on a couple of things I have seen in this conversation.
>> 
>> - The reason for kdbus is performance.
>
> No, that's not the only reason for kdbus, don't focus only on this.  I
> set out a long list of things for why we created kdbus, speed was only
> one of the things.  Security is also one, and the ability to gather
> these attributes in an atomic and secure way is very important as
> userspace wants this.

Perhaps I should have said the predominant reason.  Certainly that seems
to be most of what I have seen talked about.

Regardless looking at the performance in the design and removing any
substantial obstacle to making things go fast.

Further.  I had this conversation earlier in an earlier round of the
review and I was told that in fact existing dbus applications do not
want or need these attributes.   I think I heard journald wants them for
pretty printing things.

If security is your concern I really think per message attributes
collected and sent when a message is sent is a bad idea.  It has been a
nasty anti-pattern in the kernel code.  Lots and lots of meta-data
copyed from a task and sent to someone else has significant performance,
maintenance, and security impacts.

Code written in that pattern is complex and hard to analyze, and hard to
think about.  Consider debugging why a message does not get the expected
treatment from your suid application because someone changed the euid
over that particular call and had not thought about it's consequences.
Frankly I have been there and done that and it is a mess. 

So no I do not think breaking encapsulation and having weird side
effects affecting your new primitive will have any security benefits
whatsover.   It will just result in brittle complex code.

If you want to avoid the races causing sends through a file descriptor
to fail that don't have the expected attributes (my constructive
suggestion earlier) is a very different thing from a performance and
mainteance standpoint.  That does not increase the code complexity
nearly as much in the implementation or in use, and unexpected failures
happen right away.

>> - pipes rather than unix domain sockets are likely the standard to meet.
>>   If you can't equal unix domain sockets for simple things you are
>>   likely leaving a lot of stops in.  Last I looked pipes in general were
>>   notiably faster than unix domain sockets.
>> 
>>   The performance numbers I saw posted up-thread were horrible.  I have
>>   seen faster numbers across a network of machines.  If your ping-pong
>>   latency isn't measured in nano-seconds you are probably doing
>>   something wrong.
>
> It all depends on what you are passing on that "ping-pong", a real
> D-Bus connection has real data and meta data that has to be sent.
> Trying to make a fake benchmark number isn't going to show anything.

All that I was intending to convey is that the numbers I have seen have
been orders of magnitude slower than I would expect.  And 10x to 100x
slower than the code should be is a reason to ask why.

In my experience being efficient with small messages are important
because (a) they are the hardest to make go fast (b) they are surprising
common.  Remote X application start-up times are very slow because of
these.

People have a distressing habit of writing applications that
send a small message and synchronously waits for it.  Over time these
small ipc calls build up and you are limited by  how fast they will go.

>> - syscalls remove overhead.  So since performance is kdbus's reason for existence
>>   let's remove some ridiculous stops, and get a fast path into the kernel.
>
> Again, not the only reason, see my first post in this thread for
> details.

But performance is important, and performance is a good reason to use
system calls.

Security is another reason to have real system calls, as there is less
going on (compared to an ioctl multiplexer) so the code is easier to
audit.

Eric

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox