Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH] coresight-stm: adding driver for CoreSight STM component
From: Paul Bolle @ 2015-02-05  9:26 UTC (permalink / raw)
  To: mathieu.poirier-QSEj5FYQhm4dnm+yROfE0A
  Cc: corbet-T1hC0tSOHrs, pratikp-sgV2jX0FEOL9JmXXK+q4OQ,
	al.grant-5wv7dgnIgG8, liviu.dudau-5wv7dgnIgG8,
	kaixu.xia-QSEj5FYQhm4dnm+yROfE0A, jeenu.viswambharan-5wv7dgnIgG8,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1423088550-15780-1-git-send-email-mathieu.poirier-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>

On Wed, 2015-02-04 at 15:22 -0700, mathieu.poirier-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org wrote:
> From: Pratik Patel <pratikp-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>
> 
> This driver adds support for the STM CoreSight IP block,
> allowing any system compoment (HW or SW) to log and
> aggregate messages via a single entity.
> 
> The STM exposes an application defined number of channels
> called stimulus port.  Configuration is done using entries
> in sysfs and channels made available to userspace via devfs.
> 
> Signed-off-by: Pratik Patel <pratikp-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>
> Signed-off-by: Mathieu Poirier <mathieu.poirier-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
>[...]
> +/**
> + * struct stm_node - aggregation of channel information for userspace access
> + * @channel_id:		the channel number associated to this file descriptor.
> + * @options:		options for this channel - none, timestamped,
> +i			guaranteed.

Did you perhaps use vim?

> + * @drvdata:		STM driver specifics.
> + */
> +struct stm_node {
> +	int			channel_id;
> +	u32			options;
> +	struct stm_drvdata	*drvdata;
> +};


Paul Bolle

^ permalink raw reply

* Re: [v8 4/5] ext4: adds FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR interface support
From: Konstantin Khlebnikov @ 2015-02-05  9:32 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andy Lutomirski, Li Xi, Linux FS Devel,
	linux-ext4@vger.kernel.org, Linux API, Theodore Ts'o,
	Andreas Dilger, Jan Kara, Al Viro, Christoph Hellwig, dmonakhov,
	Eric W. Biederman
In-Reply-To: <20150204225844.GA12722@dastard>

On 05.02.2015 01:58, Dave Chinner wrote:
> On Wed, Feb 04, 2015 at 06:22:01PM +0300, Konstantin Khlebnikov wrote:
>> On 28.01.2015 03:37, Dave Chinner wrote:
>>> On Tue, Jan 27, 2015 at 01:45:17PM +0300, Konstantin Khlebnikov wrote:
>>>> On 27.01.2015 11:02, Dave Chinner wrote:
>>>>> On Fri, Jan 23, 2015 at 03:59:04PM -0800, Andy Lutomirski wrote:
>>>>>> On Fri, Jan 23, 2015 at 3:30 PM, Dave Chinner <david@fromorbit.com> wrote:
>>>>>>> On Fri, Jan 23, 2015 at 02:58:09PM +0300, Konstantin Khlebnikov wrote:
>>>>>>
>>>>>> I think I must be missing something simple here.  In a hypothetical
>>>>>> world where the code used nsown_capable, if an admin wants to stick a
>>>>>> container in /mnt/container1 with associated prid 1 and a userns,
>>>>>> shouldn't it just map only prid 1 into the user ns?  Then a user in
>>>>>> that userns can't try to change the prid of a file to 2 because the
>>>>>> number "2" is unmapped for that user and translation will fail.
>>>>>
>>>>> You've effectively said "yes, project quotas are enabled, but you
>>>>> only have a single ID, it's always turned on and you can't change it
>>>>> to anything else.
>>>>>
>>>>> So, why do they need to be mapped via user namespaces to enable
>>>>> this? Think about it a little harder:
>>>>>
>>>>> 	- Project IDs are not user IDs.
>>>>> 	- Project IDs are not a security/permission mechanism.
>
> First, I'll just point this out again...

Ok, I get it.

>>>> This might be useful even without containers : normal user quota has
>>>> two levels and admins might classify users into groups and set group
>>>> quota for them. Project quota is flat and cannot provide any control
>>>> if we want classify projects.
>>>
>>> I don't follow. project ID is exactly what allows you to control
>>> project classification.
>>
>> I mean hierarchy allows to group several projects into one super-project
>> which sums all disk usage and could have its own limit too.
>
> Yes, I know, but you can also do this resource management from
> userspace with the existing project quota tools. It's just a matter
> of layering heirarchical limit management on top of the existing
> infrastructure.

Yes but not in all cases: it's impossible to overcommit disk limits on 
project level without overcommiting on super-project level.
Hierarchical quotas can handle this [ hypothetically useful ] use case.

>>
>> For now I'm more interested in participation disk space among services
>> in one system. As I see security model of project quota in XFS almost
>> non-existent for this case: it forbids linking/renaming files between
>> different projects but any unprivileged user might change project id
>> for its own files. That's strange, this operation should be privileged.
>
> <sigh>
>
> It's clear you don't understand the design/architecture of project
> quotas. You've clearly read the code, but you haven't understood
> the design that lead to the specific implementation in XFS.
>
> Users have *always* been allowed to set the project ID of
> their own files. How else are they going to set the project ID on
> files they create in random directories so to account them to the
> correct project they are working on?

In this case project disk limits are almost useless and even dangerous 
because any unprivileged user could add files into limited project
witch belongs to other user.

>
> However, you keep making the assumption that project quotas ==
> directory subtree quotas.  Project quotas are *not limited* to
> directory subtrees - the subtree quota implementation is just an
> implementation that *sets the default project ID* on files as they
> are created.
>
> e.g. there are production systems out there where project quotas are
> used to track home directory space usage rather than user quotas.
> This means users can take actions like "this file actually belongs
> to project X and it shouldn't be accounted against my home
> directory". Users can create their own sub directories that account
> everything by default to project X rather than their own home
> directory.
>
> Again: project quotas are an *accounting* mechanism, not a security
> mechanism.
>
> Containers are *security mechanism* and hence we need a security
> model for container resource controller mechanisms. Project quotas
> do not provide a directory heirarchy access security model - that's
> what we use mount namespaces for. The resource controller security
> model only has to prevent users inside the container from subverting
> the resource controller mechanism, not anything else.
>
> Not surprisingly, we've implemented *exactly* the model you are
> suggesting: that modification of the resource accounting mechanism
> is a privileged operation that cannot be accessed from within the
> container. i.e. inside a userns container you can't change the
> project ID on a file, not even as root.
>
>> Also if user have permission for changing project id he could be
>> permitted to link and rename file into directory with any project
>> id, because he anyway could change project, move, and revert it
>> back.
>
> You don't appear to understand why XFS forbids linking/renaming
> across directories different project IDs. Hint: it's resource
> accounting simplification, *not a security mechanism*.
>
> Linking is obvious: you can't have the same inode accounted to
> multiple projects - it belongs to a single project and so can't be
> accounted to multiple projects. Hence if you want to link across
> different directory-based project quotas, you have to use symlinks.
>
> That's much simpler than having to decide what project the inode is
> accounted to, especially when removing links and link that owns the
> project ID is removed. How do you even know the link you are
> removing is the last link in the current project? IOWs, you have to
> search for the other owners of the inode to determine who the
> project quota is now accounted to...

But you have to search hardlinks everywhere (inode owner can hardlink it 
into any directory where he has write access because project can be 
changed temporary). And after that you have to search broken symlinks.
Also symlinks cannot share file between isolated containers which run in 
chroot while creating hardlinks is still possible but requires some
extra steps like changing project id or creating temporary directories
even if you're root.

Not so useful too. Probably that's the reason why this feature seems
never been implemented anywhere except xfs.

Could we change that? For example by adding flag into quota-info block
which makes project id more restrictive and useful?

>
> Same for rename: there are a multitude of nasty corner cases when it
> comes to accounting the quotas correctly. So, either we try to do
> something complex and likely expensive and buggy, or we can return
> EXDEV. EXDEV was very carefully chosen here, and it's not for
> security reasons. It was chosen because applications know that if a
> rename returns EXDEV, they've got to *copy* the file instead. And,
> well, that create/write/unlink process results in correct project
> quota accounting at both the source and destination.
>
> IOWs: EXDEV not a security mechanism, it's an accounting mechanism.
>
> If you can implement project quota rename accounting and handle the
> multiple handlinks problem efficiently, then you can allow those
> things to be done directly in the filesystem rather than returning
> EXDEV.
>
>> For me perfect interface looks like couple fcntls for
>> getting/changing project id:
>>
>> int fcntl(fd, F_GET_PROJECT, projid_t *);
>> int fcntl(fd, F_SET_PROJECT, projid_t);
>>
>> F_GET_PROJECT is allowed for everybody
>> F_SET_PROJECT requires CAP_SYS_ADMIN (or maybe CAP_FOWNER?)
>
> Sure, it's nice, but you're ignoring the entire the point of making
> FS_IOC_SETXATTR generic: so that the *existing tools* that manage
> project quotas work on all project quota enabled filesystems.
> i.e. so that all filesystems *behave the same* and can *run
> identical regression tests*.

As i see quota tools in xfsprogs checks file-system name and doesn't
work for anything except "xfs", so we have to patch it anywas.
xfstests are cool but I think fixing one ioctl isn't a problem.
Something else?

>
> We do not want different project quota implementations on different
> filesystems. Like user and group quotas, they need to be
> consistently implemented across all filesystems. If you want
> something new, different and incompatible with existing
> infrastructure, then that's a separate line of development and
> discussion....
>
> Cheers,
>
> Dave.
>


-- 
Konstantin

^ permalink raw reply

* Re: [PATCH v4] gpio: lib-sysfs: Add 'wakeup' attribute
From: Johan Hovold @ 2015-02-05 10:33 UTC (permalink / raw)
  To: Sören Brinkmann
  Cc: Linus Walleij, Alexandre Courbot, Johan Hovold,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Linux Kernel Mailing List,
	linux-gpio-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <ce91ced9e1f0481f8af03a168eda48b9-reflc3kr++M/rzWiRNbYG+hlVc3/7hDbVaz/vdPVXQ4@public.gmane.org>

On Wed, Feb 04, 2015 at 10:27:15AM -0800, Sören Brinkmann wrote:
> On Wed, 2015-02-04 at 10:19AM +0100, Linus Walleij wrote:

> > What needs to happen (IMHO) is to make gpio_chips properly obeying
> > the device model, and then add the attributes for fiddling around with
> > GPIOs to either the *real* device or create a new char device interface.
> > Whatever works best. These mock devices are fragile and never
> > worked properly especially in the removal path as Johans recent
> > fixes has shown.
> 
> Sure, that would be a nice long-term solution. But until then this patch
> would probably be welcomed by some people, without making the brokenness
> of this interface much worse.

We don't knowingly add broken functionality, and especially not when it
will become ABI that needs to be maintained forever.

And "this might be useful for someone at some point" is certainly not a
reason to break that rule.

Johan

^ permalink raw reply

* Re: [PATCH] coresight-stm: adding driver for CoreSight STM component
From: Paul Bolle @ 2015-02-05 11:27 UTC (permalink / raw)
  To: mathieu.poirier
  Cc: corbet, pratikp, al.grant, liviu.dudau, kaixu.xia,
	jeenu.viswambharan, gregkh, linux-arm-kernel, linux-api,
	linux-kernel, linux-doc
In-Reply-To: <1423088550-15780-1-git-send-email-mathieu.poirier@linaro.org>

On Wed, 2015-02-04 at 15:22 -0700, mathieu.poirier@linaro.org wrote:
> From: Pratik Patel <pratikp@codeaurora.org>
> 
> This driver adds support for the STM CoreSight IP block,
> allowing any system compoment (HW or SW) to log and
> aggregate messages via a single entity.
> 
> The STM exposes an application defined number of channels
> called stimulus port.  Configuration is done using entries
> in sysfs and channels made available to userspace via devfs.
> 
> Signed-off-by: Pratik Patel <pratikp@codeaurora.org>
> Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>

This needs "coresight: Adding coresight support for arm64
architecture" (https://lkml.org/lkml/2015/2/3/677 ) in order to get
applied. Perhaps that's obvious to the people working on this.

A few comments follow.

> ---
>  .../ABI/testing/sysfs-bus-coresight-devices-stm    |   62 ++
>  Documentation/trace/coresight.txt                  |   88 +-
>  drivers/coresight/Kconfig                          |   10 +
>  drivers/coresight/Makefile                         |    1 +
>  drivers/coresight/coresight-stm.c                  | 1090 ++++++++++++++++++++
>  include/linux/coresight-stm.h                      |   35 +
>  include/uapi/linux/coresight-stm.h                 |   23 +
>  7 files changed, 1307 insertions(+), 2 deletions(-)
>  create mode 100644 Documentation/ABI/testing/sysfs-bus-coresight-devices-stm
>  create mode 100644 drivers/coresight/coresight-stm.c
>  create mode 100644 include/linux/coresight-stm.h
>  create mode 100644 include/uapi/linux/coresight-stm.h
> 
>[...]
> diff --git a/drivers/coresight/Kconfig b/drivers/coresight/Kconfig
> index fc1f1ae7a49d..08806cc7d737 100644
> --- a/drivers/coresight/Kconfig
> +++ b/drivers/coresight/Kconfig
> @@ -58,4 +58,14 @@ config CORESIGHT_SOURCE_ETM3X
>  	  which allows tracing the instructions that a processor is executing
>  	  This is primarily useful for instruction level tracing.  Depending
>  	  the ETM version data tracing may also be available.
> +
> +config CORESIGHT_STM
> +	bool "CoreSight System Trace Macrocell driver"
> +	depends on (ARM && !(CPU_32v4 || CPU_32v4T)) || ARM64 || (64BIT && COMPILE_TEST)

I'm _guessing_ that CPU_32v4 and CPU_32v4T are needed for the ldrd and
strd assembler instructions. If that's right a next _guess_ would be
that you also need to mention CPU_32v3 here.

Furthermore, this file is only sourced by arch/arm/Kconfig.debug and
arch/arm64/Kconfig.debug. So 64BIT should always be equal to ARM64 and
the 
     || (64BIT && COMPILE_TEST)

part shouldn't be needed, isn't it?

> +	select CORESIGHT_LINKS_AND_SINKS
> +	help
> +	  This driver provides support for hardware assisted software
> +	  instrumentation based tracing. This is primarily used for
> +	  logging useful software events or data coming from various entities
> +	  in the system, possibly running different OSs
>  endif
>[...]
> diff --git a/drivers/coresight/coresight-stm.c b/drivers/coresight/coresight-stm.c
> new file mode 100644
> index 000000000000..e59b0fe01d87
> --- /dev/null
> +++ b/drivers/coresight/coresight-stm.c
> @@ -0,0 +1,1090 @@
>[...]
> +#ifndef CONFIG_64BIT
> +static inline void __raw_writeq(u64 val, volatile void __iomem *addr)
> +{
> +	asm volatile("strd %1, %0"
> +		     : "+Qo" (*(volatile u64 __force *)addr)
> +		     : "r" (val));
> +}
> +
> +static inline u64 __raw_readq(const volatile void __iomem *addr)
> +{
> +	u64 val;
> +
> +	asm volatile("ldrd %1, %0"
> +		     : "+Qo" (*(volatile u64 __force *)addr),
> +		       "=r" (val));
> +	return val;
> +}
> +
> +#undef readq_relaxed
> +#define readq_relaxed(c) ({ u64 __r = le64_to_cpu((__force __le64) \
> +					__raw_readq(c)); __r; })

I spotted no users of readq_relaxed. Is it needed?

> +#undef writeq_relaxed
> +#define writeq_relaxed(v, c)	__raw_writeq((__force u64) cpu_to_le64(v), c)
> +#endif
> +
> [...]

> +static ssize_t entities_show(struct device *dev,
> +			     struct device_attribute *attr, char *buf)
> +{
> +	struct stm_drvdata *drvdata = dev_get_drvdata(dev->parent);
> +	ssize_t len;
> +
> +	len = bitmap_scnprintf(buf, PAGE_SIZE, drvdata->entities,
> +			       STM_ENTITY_MAX);
> +

bitmap_scnprintf is gone in current linux-next. I changed it to
	len = scnprintf(buf, PAGE_SIZE, "%*pb", STM_ENTITY_MAX,
			drvdata->entities);

to get this file to compile. (On x86_64, that is, but please don't tell
anybody!)


Paul Bolle

^ permalink raw reply

* Re: [PATCH v12 1/2] crypto: AF_ALG: add AEAD support
From: Stephan Mueller @ 2015-02-05 15:10 UTC (permalink / raw)
  To: 'Herbert Xu'
  Cc: 'Quentin Gouchet', Daniel Borkmann, 'LKML',
	linux-crypto-u79uwXL29TY76Z2rM5mHXA, 'Linux API'
In-Reply-To: <2035690.30evPElJOm-PJstQz4BMNNP20K/wil9xYQuADTiUCJX@public.gmane.org>

Am Donnerstag, 29. Januar 2015, 21:24:45 schrieb Stephan Mueller:

Hi Herbert,

> This patch adds the AEAD support for AF_ALG.
> 
> The implementation is based on algif_skcipher, but contains heavy
> modifications to streamline the interface for AEAD uses.
> 
> To use AEAD, the user space consumer has to use the salg_type named
> "aead".

I just saw Al Viro's patch to use the iov_iter API in algif_skcipher.c. I 
looked at it but lacked documentation for using it properly. Now I have a 
template that I will incorporate into algif_aead.c

-- 
Ciao
Stephan

^ permalink raw reply

* Re: [capabilities] Allow normal inheritance for a configurable set of capabilities
From: Serge E. Hallyn @ 2015-02-05 15:23 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Andrew G. Morgan, Christoph Lameter, casey, Andy Lutomirski,
	Jonathan Corbet, Aaron Jones, Ted Ts'o, linux-security-module,
	lkml, akpm, linux-api
In-Reply-To: <20150205003434.GC23013@mail.hallyn.com>

Quoting Serge E. Hallyn (serge@hallyn.com):
> Quoting Andrew G. Morgan (morgan@kernel.org):
> > I'm not generally in favor of this. Mostly because this seems to be a
> > mini-root kind of inheritance that propagates privilege to binaries
> > that aren't prepared for privilege.
> 
> Earlier in this thread, Casey said:
> 
> | One of the holes in the 1003.1e spec is what to do with a program file
> | that does not have a capability set attached to it. The two options are
> | drop all capabilities and leave the capabilities alone. The latter gives
> | you what you're asking for. The former is arguably safer.
> 
> and
> 
> | It's what we did in Trusted Irix. It made life much easier.
> 
> I'm going to need to clear my head a bit before I try to compare that to
> the root cause of the sendmail capabilities bug.
> 
> >  I don't really buy the mmap code
> > concern because the model as it stands says that you trust the binary
> > (and all of the various ways it was programmed to execute code) with
> > specific privileges. If the binary mmap's some code (PAM modules come
> > to mind) then that is part of what it was programmed to/allowed to do.
> 
> That's not really the point...  The point is that yes, a mini-root is
> exactly what is being asked for :)  I'm not saying I expect an adversary
> to do the mmap+jump, but that currently it is a, and the only, way to
> do unprivileged userid with retaining some privileges to run legacy
> programs.
> 
> > That being said, if you really really want this kind of thing, then
> > make it a single secure bit (with another lock on/off bit) which, when
> > set, makes: fI default to X.
> > 
> >    pP' = (X & fP) | (pI & fI)
> > 
> > That way the per-process bounding set still works as advertised and
> > you don't need to worry about the existing semantics being violated.
> 
> Maybe that is the way to go...

We could require nnp to set the new securebit, and add a
CONFIG_SECURITY_LULZ_I_DONT_CARE to skip that requirement.
(Or maybe that just makes things worse by having more
different sets of rules...)

^ permalink raw reply

* Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints
From: Michal Hocko @ 2015-02-05 15:41 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Vlastimil Babka, Kirill A. Shutemov, Dave Hansen, Mel Gorman,
	linux-mm@kvack.org, Minchan Kim, Andrew Morton, lkml, Linux API,
	linux-man, Hugh Dickins
In-Reply-To: <CAKgNAkhNbHQX7RukSsSe3bMqY11f493rYbDpTOA2jH7vsziNww@mail.gmail.com>

On Wed 04-02-15 20:24:27, Michael Kerrisk wrote:
[...]
> So, how about this text:
> 
>               After a successful MADV_DONTNEED operation, the seman‐
>               tics  of  memory  access  in  the specified region are
>               changed: subsequent accesses of  pages  in  the  range
>               will  succeed,  but will result in either reloading of
>               the memory contents from the  underlying  mapped  file

"
result in either providing the up-to-date contents of the underlying
mapped file
"

Would be more precise IMO because reload might be interpreted as a major
fault which is not necessarily the case (see below).

>               (for  shared file mappings, shared anonymous mappings,
>               and shmem-based techniques such  as  System  V  shared
>               memory  segments)  or  zero-fill-on-demand  pages  for
>               anonymous private mappings.

Yes, this wording is better because many users are not aware of
MAP_ANON|MAP_SHARED being file backed in fact and mmap man page doesn't
mention that.

I am just wondering whether it makes sense to mention that MADV_DONTNEED
for shared mappings might be surprising and not freeing the backing
pages thus not really freeing memory until there is a memory
pressure. But maybe this is too implementation specific for a man
page. What about the following wording on top of yours?
"
Please note that the MADV_DONTNEED hint on shared mappings might not
lead to immediate freeing of pages in the range. The kernel is free to
delay this until an appropriate moment. RSS of the calling process will
be reduced however.
"
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] coresight-stm: adding driver for CoreSight STM component
From: Mathieu Poirier @ 2015-02-05 15:51 UTC (permalink / raw)
  To: Paul Bolle
  Cc: corbet, Pratik Patel, Al Grant, Liviu Dudau, Kaixu Xia,
	Jeenu Viswambharan, Greg KH, linux-arm-kernel@lists.infradead.org,
	linux-api, linux-kernel@vger.kernel.org, linux-doc
In-Reply-To: <1423135651.27378.32.camel@x220>

On 5 February 2015 at 04:27, Paul Bolle <pebolle@tiscali.nl> wrote:
> On Wed, 2015-02-04 at 15:22 -0700, mathieu.poirier@linaro.org wrote:
>> From: Pratik Patel <pratikp@codeaurora.org>
>>
>> This driver adds support for the STM CoreSight IP block,
>> allowing any system compoment (HW or SW) to log and
>> aggregate messages via a single entity.
>>
>> The STM exposes an application defined number of channels
>> called stimulus port.  Configuration is done using entries
>> in sysfs and channels made available to userspace via devfs.
>>
>> Signed-off-by: Pratik Patel <pratikp@codeaurora.org>
>> Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
>
> This needs "coresight: Adding coresight support for arm64
> architecture" (https://lkml.org/lkml/2015/2/3/677 ) in order to get
> applied. Perhaps that's obvious to the people working on this.

Right, in my tree the code is tailored to use the other patch you're
referencing but having received no 'ack' for it yet I made this set
for 32 bit only (aside from the 64bit flags in Kconfig).  As such you
don't need https://lkml.org/lkml/2015/2/3/677 for this to compile and
run.

>
> A few comments follow.
>
>> ---
>>  .../ABI/testing/sysfs-bus-coresight-devices-stm    |   62 ++
>>  Documentation/trace/coresight.txt                  |   88 +-
>>  drivers/coresight/Kconfig                          |   10 +
>>  drivers/coresight/Makefile                         |    1 +
>>  drivers/coresight/coresight-stm.c                  | 1090 ++++++++++++++++++++
>>  include/linux/coresight-stm.h                      |   35 +
>>  include/uapi/linux/coresight-stm.h                 |   23 +
>>  7 files changed, 1307 insertions(+), 2 deletions(-)
>>  create mode 100644 Documentation/ABI/testing/sysfs-bus-coresight-devices-stm
>>  create mode 100644 drivers/coresight/coresight-stm.c
>>  create mode 100644 include/linux/coresight-stm.h
>>  create mode 100644 include/uapi/linux/coresight-stm.h
>>
>>[...]
>> diff --git a/drivers/coresight/Kconfig b/drivers/coresight/Kconfig
>> index fc1f1ae7a49d..08806cc7d737 100644
>> --- a/drivers/coresight/Kconfig
>> +++ b/drivers/coresight/Kconfig
>> @@ -58,4 +58,14 @@ config CORESIGHT_SOURCE_ETM3X
>>         which allows tracing the instructions that a processor is executing
>>         This is primarily useful for instruction level tracing.  Depending
>>         the ETM version data tracing may also be available.
>> +
>> +config CORESIGHT_STM
>> +     bool "CoreSight System Trace Macrocell driver"
>> +     depends on (ARM && !(CPU_32v4 || CPU_32v4T)) || ARM64 || (64BIT && COMPILE_TEST)
>
> I'm _guessing_ that CPU_32v4 and CPU_32v4T are needed for the ldrd and
> strd assembler instructions. If that's right a next _guess_ would be
> that you also need to mention CPU_32v3 here.
>
> Furthermore, this file is only sourced by arch/arm/Kconfig.debug and
> arch/arm64/Kconfig.debug. So 64BIT should always be equal to ARM64 and
> the
>      || (64BIT && COMPILE_TEST)
>
> part shouldn't be needed, isn't it?

Things have changed in the driver (and accessor functions) since I
wrote this Kconfig condition - Let me think on the above 2 comments.

>
>> +     select CORESIGHT_LINKS_AND_SINKS
>> +     help
>> +       This driver provides support for hardware assisted software
>> +       instrumentation based tracing. This is primarily used for
>> +       logging useful software events or data coming from various entities
>> +       in the system, possibly running different OSs
>>  endif
>>[...]
>> diff --git a/drivers/coresight/coresight-stm.c b/drivers/coresight/coresight-stm.c
>> new file mode 100644
>> index 000000000000..e59b0fe01d87
>> --- /dev/null
>> +++ b/drivers/coresight/coresight-stm.c
>> @@ -0,0 +1,1090 @@
>>[...]
>> +#ifndef CONFIG_64BIT
>> +static inline void __raw_writeq(u64 val, volatile void __iomem *addr)
>> +{
>> +     asm volatile("strd %1, %0"
>> +                  : "+Qo" (*(volatile u64 __force *)addr)
>> +                  : "r" (val));
>> +}
>> +
>> +static inline u64 __raw_readq(const volatile void __iomem *addr)
>> +{
>> +     u64 val;
>> +
>> +     asm volatile("ldrd %1, %0"
>> +                  : "+Qo" (*(volatile u64 __force *)addr),
>> +                    "=r" (val));
>> +     return val;
>> +}
>> +
>> +#undef readq_relaxed
>> +#define readq_relaxed(c) ({ u64 __r = le64_to_cpu((__force __le64) \
>> +                                     __raw_readq(c)); __r; })
>
> I spotted no users of readq_relaxed. Is it needed?
>
>> +#undef writeq_relaxed
>> +#define writeq_relaxed(v, c) __raw_writeq((__force u64) cpu_to_le64(v), c)
>> +#endif
>> +
>> [...]
>
>> +static ssize_t entities_show(struct device *dev,
>> +                          struct device_attribute *attr, char *buf)
>> +{
>> +     struct stm_drvdata *drvdata = dev_get_drvdata(dev->parent);
>> +     ssize_t len;
>> +
>> +     len = bitmap_scnprintf(buf, PAGE_SIZE, drvdata->entities,
>> +                            STM_ENTITY_MAX);
>> +
>
> bitmap_scnprintf is gone in current linux-next. I changed it to
>         len = scnprintf(buf, PAGE_SIZE, "%*pb", STM_ENTITY_MAX,
>                         drvdata->entities);
>
> to get this file to compile. (On x86_64, that is, but please don't tell
> anybody!)

I was not aware that "bitmap_scnprintf()" had disappeared - thanks for
pointing this out.  Did you mean cross-compile from an x86_64 host or
compile for a x86_64 target?  I don't get how you could have in the
latter case.

Thanks for the review,
Mathieu

>
>
> Paul Bolle
>

^ permalink raw reply

* Re: [PATCH] coresight-stm: adding driver for CoreSight STM component
From: Paul Bolle @ 2015-02-05 16:28 UTC (permalink / raw)
  To: Mathieu Poirier
  Cc: corbet, Pratik Patel, Al Grant, Liviu Dudau, Kaixu Xia,
	Jeenu Viswambharan, Greg KH, linux-arm-kernel@lists.infradead.org,
	linux-api, linux-kernel@vger.kernel.org, linux-doc
In-Reply-To: <CANLsYkx-XeiWAMxgMs=dAZc5oMf0NmF_2_6TzYXQtd1JdL5LrA@mail.gmail.com>

On Thu, 2015-02-05 at 08:51 -0700, Mathieu Poirier wrote:
> On 5 February 2015 at 04:27, Paul Bolle <pebolle@tiscali.nl> wrote:
> > On Wed, 2015-02-04 at 15:22 -0700, mathieu.poirier@linaro.org wrote:
> >> @@ -58,4 +58,14 @@ config CORESIGHT_SOURCE_ETM3X
> >>         which allows tracing the instructions that a processor is executing
> >>         This is primarily useful for instruction level tracing.  Depending
> >>         the ETM version data tracing may also be available.
> >> +
> >> +config CORESIGHT_STM
> >> +     bool "CoreSight System Trace Macrocell driver"
> >> +     depends on (ARM && !(CPU_32v4 || CPU_32v4T)) || ARM64 || (64BIT && COMPILE_TEST)
> >
> > I'm _guessing_ that CPU_32v4 and CPU_32v4T are needed for the ldrd and
> > strd assembler instructions. If that's right a next _guess_ would be
> > that you also need to mention CPU_32v3 here.
> >
> > Furthermore, this file is only sourced by arch/arm/Kconfig.debug and
> > arch/arm64/Kconfig.debug. So 64BIT should always be equal to ARM64 and
> > the
> >      || (64BIT && COMPILE_TEST)
> >
> > part shouldn't be needed, isn't it?
> 
> Things have changed in the driver (and accessor functions) since I
> wrote this Kconfig condition - Let me think on the above 2 comments.

In the meantime it occurred to me that the current dependency could be
written as just
    !CPU_32v4 && !CPU_32v4T

That shouldn't change anything, neither for both ARM nor for ARM64.
(This is because exactly one of ARM and ARM64 will always be set.)
Anyhow, I'll see what you'll come up with.

> >> +     select CORESIGHT_LINKS_AND_SINKS
> >> +     help
> >> +       This driver provides support for hardware assisted software
> >> +       instrumentation based tracing. This is primarily used for
> >> +       logging useful software events or data coming from various entities
> >> +       in the system, possibly running different OSs
> >>  endif

> >> +static ssize_t entities_show(struct device *dev,
> >> +                          struct device_attribute *attr, char *buf)
> >> +{
> >> +     struct stm_drvdata *drvdata = dev_get_drvdata(dev->parent);
> >> +     ssize_t len;
> >> +
> >> +     len = bitmap_scnprintf(buf, PAGE_SIZE, drvdata->entities,
> >> +                            STM_ENTITY_MAX);
> >> +
> >
> > bitmap_scnprintf is gone in current linux-next. I changed it to
> >         len = scnprintf(buf, PAGE_SIZE, "%*pb", STM_ENTITY_MAX,
> >                         drvdata->entities);
> >
> > to get this file to compile. (On x86_64, that is, but please don't tell
> > anybody!)
> 
> I was not aware that "bitmap_scnprintf()" had disappeared - thanks for
> pointing this out.  Did you mean cross-compile from an x86_64 host or
> compile for a x86_64 target?  I don't get how you could have in the
> latter case.

I compiled for the x86_64 target! I used a hack, not sure where it's
documented, to build object files by bypassing most of the build
infrastructure:
     make drivers/coresight/coresight-stm.o

But, since in this case that doesn't handle <linux/coresight-stm.h>
correctly, I used an ever bigger hack:
    make EXTRA_CFLAGS="-DCONFIG_CORESIGHT_STM" drivers/coresight/coresight-stm.o

This only works as long as the code doesn't use stuff that can't
possibly be reached in x86_64 (eg, it doesn't use platform specific
headers). It's a quick and dirty way to test changes to files normally
not compiled for your target.


Paul Bolle

^ permalink raw reply

* Re: [v8 4/5] ext4: adds FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR interface support
From: Jan Kara @ 2015-02-05 16:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Konstantin Khlebnikov, Andy Lutomirski, Li Xi, Linux FS Devel,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
	Theodore Ts'o, Andreas Dilger, Jan Kara, Al Viro,
	Christoph Hellwig, dmonakhov-GEFAQzZX7r8dnm+yROfE0A,
	Eric W. Biederman
In-Reply-To: <20150204225844.GA12722@dastard>

  Hello,

> Users have *always* been allowed to set the project ID of
> their own files. How else are they going to set the project ID on
> files they create in random directories so to account them to the
> correct project they are working on?
> 
> However, you keep making the assumption that project quotas ==
> directory subtree quotas.  Project quotas are *not limited* to
> directory subtrees - the subtree quota implementation is just an
> implementation that *sets the default project ID* on files as they
> are created.
> 
> e.g. there are production systems out there where project quotas are
> used to track home directory space usage rather than user quotas.
> This means users can take actions like "this file actually belongs
> to project X and it shouldn't be accounted against my home
> directory". Users can create their own sub directories that account
> everything by default to project X rather than their own home
> directory.
> 
> Again: project quotas are an *accounting* mechanism, not a security
> mechanism.
  OK, but now I got confused ;) So if users can change project ID of files
they own, what's the point of project quotas? If I need to create a file
and project quota doesn't allow me, I just set its project ID to some
random number and I'm done with that... So are really project quotas just
"advisory" - i.e., all users of a system cooperate so that project X
doesn't use more space than it should (and project quotas make this
cooperation somewhat simpler) or is there something which limits which
project IDs user can set? I didn't find anything...

								Honza
-- 
Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply

* Re: futex(2) man page update help request
From: Darren Hart @ 2015-02-05 19:57 UTC (permalink / raw)
  To: Thomas Gleixner, Torvald Riegel
  Cc: Michael Kerrisk (man-pages), Carlos O'Donell, Ingo Molnar,
	Jakub Jelinek, linux-man@vger.kernel.org, lkml, Davidlohr Bueso,
	Arnd Bergmann, Steven Rostedt, Peter Zijlstra, Linux API,
	Darren Hart, Anton Blanchard, Petr Baudis, Eric Dumazet,
	bill o gallmeister, Jan Kiszka, Daniel Wagner, Rich Felker
In-Reply-To: <alpine.DEB.2.11.1501241116160.5526@nanos>

On 1/24/15, 3:35 AM, "Thomas Gleixner" <tglx@linutronix.de> wrote:

>On Fri, 23 Jan 2015, Torvald Riegel wrote:
>> Second, the current documentation for EINTR is that it can happen due to
>> receiving a signal *or* due to a spurious wake-up.  This is difficult to
>
>I don't think so. I went through all callchains again with a fine comb.
>
>futex_wait()
>retry:
>	ret = futex_wait_setup();
>	if (ret) {
>		 /*
>		  * Possible return codes related to uaddr:
>		  * -EINVAL:    Not u32 aligned uaddr
>		  * -EFAULT:    No mapping, no RW
>		  * -ENOMEM:    Paging ran out of memory
>		  * -EHWPOISON: Memory hardware error
>		  *
>		  * Others:
>		  * -EWOULDBLOCK: value at uaddr has changed
>		  */
>		return ret;
>	}
>
>	futex_wait_queue_me();
>
>	if (woken by futex_wake/requeue)
>	   	return 0;
>
>	if (timeout)
>		return -ETIMEOUT;
>
>	/*
>	 * Spurious wakeup, i.e. no signal pending
>	 */
>	if (!signal_pending())
>		goto retry;
>
>	/* Handled in the low level syscall exit code */
>	if (!timed_wait)
>		return -ERESTARTSYS;
>	else
>		return -ERESTARTBLOCK;
>
>Now in the low level syscall exit we try to deliver the signal
>
>	if (!signal_delivered())
>	      restart_syscall();
>
>	if (sigaction->flags & SA_RESTART)
>	      restart_syscall();
>
>	ret_to_userspace -EINTR;
>
>So we should never see -EINTR in the case of a spurious wakeup here.
>
>But, here is the not so good news:
>
> I did some archaeology. The restart handling of futex_wait() got
> introduced in kernel 2.6.22, so anything older than that will have
> the spurious -EINTR issues.
>
>futex_wait_pi() always had the restart handling and glibc folks back
>then (2006) requested that it should never return -EINTR, so it
>unconditionally restarts the syscall whether a signal had been
>delivered or not.
>
>So kernels >= 2.6.22 should never return -EINTR spuriously. If that
>happens it's a bug and needs to be fixed.
>
>> Third, I think it would be useful to -- somewhere -- explain which
>> behavior the futex operations would have conceptually when expressed by
>> C11 code.  We currently say that they wake up, sleep, etc, and which
>> values they return.  But we never say how to properly synchronize with
>> them on the userspace side.  The C11 memory model is probably the best
>> model to use on the userspace side, so that's why I'm arguing for this.
>> Basically, I think we need to (1) tell people that they should use
>> memory_order_relaxed accesses to the futex variable (ie, the memory
>> location associated with the whole futex construct on the kernel side --
>> or do we have another name for this?), and (2) give some conceptual
>> guarantees for the kernel-side synchronization so that one use this to
>> derive how to use them correctly in userspace.
>> 
>> The man pages might not be the right place for this, and maybe we just
>> need a revision of "Futexes are tricky".  If you have other suggestions
>> for where to document this, or on the content, let me know.  (I'm also
>> willing to spend time on this :) ).
>
>The current futex code in the kernel has gained documentation about
>the required memory ordering recently. That should be a good starting
>point.

Lots of paging in here... If I recall correctly there was something about
not being able to return to userspace in these events without owning the
lock (waiters but no owner, breaking pi chains and promotion, etc.), so
restarting was the preferable path.

-- 
Darren Hart
Intel Open Source Technology Center

^ permalink raw reply

* Re: [v8 4/5] ext4: adds FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR interface support
From: Dave Chinner @ 2015-02-05 21:05 UTC (permalink / raw)
  To: Jan Kara
  Cc: Konstantin Khlebnikov, Andy Lutomirski, Li Xi, Linux FS Devel,
	linux-ext4@vger.kernel.org, Linux API, Theodore Ts'o,
	Andreas Dilger, Al Viro, Christoph Hellwig, dmonakhov,
	Eric W. Biederman
In-Reply-To: <20150205163815.GK4258@quack.suse.cz>

On Thu, Feb 05, 2015 at 05:38:15PM +0100, Jan Kara wrote:
>   Hello,
> 
> > Users have *always* been allowed to set the project ID of
> > their own files. How else are they going to set the project ID on
> > files they create in random directories so to account them to the
> > correct project they are working on?
> > 
> > However, you keep making the assumption that project quotas ==
> > directory subtree quotas.  Project quotas are *not limited* to
> > directory subtrees - the subtree quota implementation is just an
> > implementation that *sets the default project ID* on files as they
> > are created.
> > 
> > e.g. there are production systems out there where project quotas are
> > used to track home directory space usage rather than user quotas.
> > This means users can take actions like "this file actually belongs
> > to project X and it shouldn't be accounted against my home
> > directory". Users can create their own sub directories that account
> > everything by default to project X rather than their own home
> > directory.
> > 
> > Again: project quotas are an *accounting* mechanism, not a security
> > mechanism.
>   OK, but now I got confused ;) So if users can change project ID of files
> they own, what's the point of project quotas? If I need to create a file
> and project quota doesn't allow me, I just set its project ID to some
> random number and I'm done with that...

Sure, but the admin is going to notice those random numbers in the
next quota report they run and then a user is going to get
re-educated with a clue bat.

> So are really project quotas just
> "advisory" - i.e., all users of a system cooperate so that project X
> doesn't use more space than it should (and project quotas make this
> cooperation somewhat simpler) or is there something which limits which
> project IDs user can set? I didn't find anything...

Not directly.  Project quotas have historically been used in tandem
with user quotas - user quotas cannot be escaped and that puts a
limit on the shenanigans that users can play with project quota.

[ Indeed, Irix only had user and project quotas - it *never* had group
quotas. e.g. when XFS was ported to Linux the project quota inode
was re-purposed for group quotas in 2001:

http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-import.git;a=commitdiff;h=749b2bf3ed5ff064efd69370e6b31ea44c4a78a6
]

However, if you have a fileystem system that users can't directly
access (e.g. an NFS server) then you can use project quotas as a
space enforcement mechanism because users can't change the project
ID on the files on the server. We've used the same model with
containers - for the host to be able to use project quotas as space
resource controller, users inside a container must not be able to
change the project ID of a file....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply

* [PATCH] Revert "IB/core: Add support for extended query device caps"
From: Yann Droneaud @ 2015-02-05 21:10 UTC (permalink / raw)
  To: Roland Dreier
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Yann Droneaud, Eli Cohen,
	Haggai Eran, Ira Weiny, Jason Gunthorpe, Sagi Grimberg,
	Shachar Raindel

While commit 7e36ef8205ff ("IB/core: Temporarily disable ex_query_device uverb")
is correct as it make the extended QUERY_DEVICE uverb as integrated as
part of commit 5a77abf9a97a ("IB/core: Add support for extended query device caps")
and commit 860f10a799c8 ("IB/core: Add flags for on demand paging support")
not available to userspace, it doesn't address the initial issue
regarding ib_copy_to_udata() [1][2].

Additionally, further discussions around this new uverb seems to
conclude it would require a different data structure than the one
currently described in <rdma/ib_user_verbs.h> [3].

Both of these issues require a revert of the changes,
so this patch partially revert commit 8cdd312cfed7 ("IB/mlx5: Implement the ODP capability query verb"),
commit 860f10a799c8 ("IB/core: Add flags for on demand paging support")
and fully revert commit 5a77abf9a97a ("IB/core: Add support for extended query device caps").

[1] "Re: [PATCH v3 06/17] IB/core: Add support for extended query device caps"
    http://mid.gmane.org/1418733236.2779.26.camel-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org

[2] "Re: [PATCH] IB/core: Temporarily disable ex_query_device uverb"
    http://mid.gmane.org/1423067503.3030.83.camel-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org

[3] "RE: [PATCH v1 1/5] IB/uverbs: ex_query_device: answer must not depend on request's comp_mask"
    http://mid.gmane.org/2807E5FD2F6FDA4886F6618EAC48510E0CC12C30-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org

Cc: Eli Cohen <eli-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Cc: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Cc: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Cc: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
Cc: Sagi Grimberg <sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Cc: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Yann Droneaud <ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/core/uverbs.h     |   1 -
 drivers/infiniband/core/uverbs_cmd.c | 137 +++++++++++------------------------
 drivers/infiniband/hw/mlx5/main.c    |   2 -
 include/rdma/ib_verbs.h              |   5 +-
 include/uapi/rdma/ib_user_verbs.h    |  27 -------
 5 files changed, 42 insertions(+), 130 deletions(-)

diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h
index b716b0815644..643c08a025a5 100644
--- a/drivers/infiniband/core/uverbs.h
+++ b/drivers/infiniband/core/uverbs.h
@@ -258,6 +258,5 @@ IB_UVERBS_DECLARE_CMD(close_xrcd);
 
 IB_UVERBS_DECLARE_EX_CMD(create_flow);
 IB_UVERBS_DECLARE_EX_CMD(destroy_flow);
-IB_UVERBS_DECLARE_EX_CMD(query_device);
 
 #endif /* UVERBS_H */
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 532d8eba8b02..b7943ff16ed3 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -400,52 +400,6 @@ err:
 	return ret;
 }
 
-static void copy_query_dev_fields(struct ib_uverbs_file *file,
-				  struct ib_uverbs_query_device_resp *resp,
-				  struct ib_device_attr *attr)
-{
-	resp->fw_ver		= attr->fw_ver;
-	resp->node_guid		= file->device->ib_dev->node_guid;
-	resp->sys_image_guid	= attr->sys_image_guid;
-	resp->max_mr_size	= attr->max_mr_size;
-	resp->page_size_cap	= attr->page_size_cap;
-	resp->vendor_id		= attr->vendor_id;
-	resp->vendor_part_id	= attr->vendor_part_id;
-	resp->hw_ver		= attr->hw_ver;
-	resp->max_qp		= attr->max_qp;
-	resp->max_qp_wr		= attr->max_qp_wr;
-	resp->device_cap_flags	= attr->device_cap_flags;
-	resp->max_sge		= attr->max_sge;
-	resp->max_sge_rd	= attr->max_sge_rd;
-	resp->max_cq		= attr->max_cq;
-	resp->max_cqe		= attr->max_cqe;
-	resp->max_mr		= attr->max_mr;
-	resp->max_pd		= attr->max_pd;
-	resp->max_qp_rd_atom	= attr->max_qp_rd_atom;
-	resp->max_ee_rd_atom	= attr->max_ee_rd_atom;
-	resp->max_res_rd_atom	= attr->max_res_rd_atom;
-	resp->max_qp_init_rd_atom	= attr->max_qp_init_rd_atom;
-	resp->max_ee_init_rd_atom	= attr->max_ee_init_rd_atom;
-	resp->atomic_cap		= attr->atomic_cap;
-	resp->max_ee			= attr->max_ee;
-	resp->max_rdd			= attr->max_rdd;
-	resp->max_mw			= attr->max_mw;
-	resp->max_raw_ipv6_qp		= attr->max_raw_ipv6_qp;
-	resp->max_raw_ethy_qp		= attr->max_raw_ethy_qp;
-	resp->max_mcast_grp		= attr->max_mcast_grp;
-	resp->max_mcast_qp_attach	= attr->max_mcast_qp_attach;
-	resp->max_total_mcast_qp_attach	= attr->max_total_mcast_qp_attach;
-	resp->max_ah			= attr->max_ah;
-	resp->max_fmr			= attr->max_fmr;
-	resp->max_map_per_fmr		= attr->max_map_per_fmr;
-	resp->max_srq			= attr->max_srq;
-	resp->max_srq_wr		= attr->max_srq_wr;
-	resp->max_srq_sge		= attr->max_srq_sge;
-	resp->max_pkeys			= attr->max_pkeys;
-	resp->local_ca_ack_delay	= attr->local_ca_ack_delay;
-	resp->phys_port_cnt		= file->device->ib_dev->phys_port_cnt;
-}
-
 ssize_t ib_uverbs_query_device(struct ib_uverbs_file *file,
 			       const char __user *buf,
 			       int in_len, int out_len)
@@ -466,7 +420,47 @@ ssize_t ib_uverbs_query_device(struct ib_uverbs_file *file,
 		return ret;
 
 	memset(&resp, 0, sizeof resp);
-	copy_query_dev_fields(file, &resp, &attr);
+
+	resp.fw_ver 		       = attr.fw_ver;
+	resp.node_guid 		       = file->device->ib_dev->node_guid;
+	resp.sys_image_guid 	       = attr.sys_image_guid;
+	resp.max_mr_size 	       = attr.max_mr_size;
+	resp.page_size_cap 	       = attr.page_size_cap;
+	resp.vendor_id 		       = attr.vendor_id;
+	resp.vendor_part_id 	       = attr.vendor_part_id;
+	resp.hw_ver 		       = attr.hw_ver;
+	resp.max_qp 		       = attr.max_qp;
+	resp.max_qp_wr 		       = attr.max_qp_wr;
+	resp.device_cap_flags 	       = attr.device_cap_flags;
+	resp.max_sge 		       = attr.max_sge;
+	resp.max_sge_rd 	       = attr.max_sge_rd;
+	resp.max_cq 		       = attr.max_cq;
+	resp.max_cqe 		       = attr.max_cqe;
+	resp.max_mr 		       = attr.max_mr;
+	resp.max_pd 		       = attr.max_pd;
+	resp.max_qp_rd_atom 	       = attr.max_qp_rd_atom;
+	resp.max_ee_rd_atom 	       = attr.max_ee_rd_atom;
+	resp.max_res_rd_atom 	       = attr.max_res_rd_atom;
+	resp.max_qp_init_rd_atom       = attr.max_qp_init_rd_atom;
+	resp.max_ee_init_rd_atom       = attr.max_ee_init_rd_atom;
+	resp.atomic_cap 	       = attr.atomic_cap;
+	resp.max_ee 		       = attr.max_ee;
+	resp.max_rdd 		       = attr.max_rdd;
+	resp.max_mw 		       = attr.max_mw;
+	resp.max_raw_ipv6_qp 	       = attr.max_raw_ipv6_qp;
+	resp.max_raw_ethy_qp 	       = attr.max_raw_ethy_qp;
+	resp.max_mcast_grp 	       = attr.max_mcast_grp;
+	resp.max_mcast_qp_attach       = attr.max_mcast_qp_attach;
+	resp.max_total_mcast_qp_attach = attr.max_total_mcast_qp_attach;
+	resp.max_ah 		       = attr.max_ah;
+	resp.max_fmr 		       = attr.max_fmr;
+	resp.max_map_per_fmr 	       = attr.max_map_per_fmr;
+	resp.max_srq 		       = attr.max_srq;
+	resp.max_srq_wr 	       = attr.max_srq_wr;
+	resp.max_srq_sge 	       = attr.max_srq_sge;
+	resp.max_pkeys 		       = attr.max_pkeys;
+	resp.local_ca_ack_delay        = attr.local_ca_ack_delay;
+	resp.phys_port_cnt	       = file->device->ib_dev->phys_port_cnt;
 
 	if (copy_to_user((void __user *) (unsigned long) cmd.response,
 			 &resp, sizeof resp))
@@ -3293,52 +3287,3 @@ ssize_t ib_uverbs_destroy_srq(struct ib_uverbs_file *file,
 
 	return ret ? ret : in_len;
 }
-
-int ib_uverbs_ex_query_device(struct ib_uverbs_file *file,
-			      struct ib_udata *ucore,
-			      struct ib_udata *uhw)
-{
-	struct ib_uverbs_ex_query_device_resp resp;
-	struct ib_uverbs_ex_query_device  cmd;
-	struct ib_device_attr attr;
-	struct ib_device *device;
-	int err;
-
-	device = file->device->ib_dev;
-	if (ucore->inlen < sizeof(cmd))
-		return -EINVAL;
-
-	err = ib_copy_from_udata(&cmd, ucore, sizeof(cmd));
-	if (err)
-		return err;
-
-	if (cmd.reserved)
-		return -EINVAL;
-
-	err = device->query_device(device, &attr);
-	if (err)
-		return err;
-
-	memset(&resp, 0, sizeof(resp));
-	copy_query_dev_fields(file, &resp.base, &attr);
-	resp.comp_mask = 0;
-
-#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
-	if (cmd.comp_mask & IB_USER_VERBS_EX_QUERY_DEVICE_ODP) {
-		resp.odp_caps.general_caps = attr.odp_caps.general_caps;
-		resp.odp_caps.per_transport_caps.rc_odp_caps =
-			attr.odp_caps.per_transport_caps.rc_odp_caps;
-		resp.odp_caps.per_transport_caps.uc_odp_caps =
-			attr.odp_caps.per_transport_caps.uc_odp_caps;
-		resp.odp_caps.per_transport_caps.ud_odp_caps =
-			attr.odp_caps.per_transport_caps.ud_odp_caps;
-		resp.comp_mask |= IB_USER_VERBS_EX_QUERY_DEVICE_ODP;
-	}
-#endif
-
-	err = ib_copy_to_udata(ucore, &resp, sizeof(resp));
-	if (err)
-		return err;
-
-	return 0;
-}
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index 8a87404e9c76..03bf81211a54 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -1331,8 +1331,6 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
 		(1ull << IB_USER_VERBS_CMD_DESTROY_SRQ)		|
 		(1ull << IB_USER_VERBS_CMD_CREATE_XSRQ)		|
 		(1ull << IB_USER_VERBS_CMD_OPEN_QP);
-	dev->ib_dev.uverbs_ex_cmd_mask =
-		(1ull << IB_USER_VERBS_EX_CMD_QUERY_DEVICE);
 
 	dev->ib_dev.query_device	= mlx5_ib_query_device;
 	dev->ib_dev.query_port		= mlx5_ib_query_port;
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 0d74f1de99aa..65994a19e840 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1707,10 +1707,7 @@ static inline int ib_copy_from_udata(void *dest, struct ib_udata *udata, size_t
 
 static inline int ib_copy_to_udata(struct ib_udata *udata, void *src, size_t len)
 {
-	size_t copy_sz;
-
-	copy_sz = min_t(size_t, len, udata->outlen);
-	return copy_to_user(udata->outbuf, src, copy_sz) ? -EFAULT : 0;
+	return copy_to_user(udata->outbuf, src, len) ? -EFAULT : 0;
 }
 
 /**
diff --git a/include/uapi/rdma/ib_user_verbs.h b/include/uapi/rdma/ib_user_verbs.h
index 4275b961bf60..867cc5084afb 100644
--- a/include/uapi/rdma/ib_user_verbs.h
+++ b/include/uapi/rdma/ib_user_verbs.h
@@ -90,7 +90,6 @@ enum {
 };
 
 enum {
-	IB_USER_VERBS_EX_CMD_QUERY_DEVICE = IB_USER_VERBS_CMD_QUERY_DEVICE,
 	IB_USER_VERBS_EX_CMD_CREATE_FLOW = IB_USER_VERBS_CMD_THRESHOLD,
 	IB_USER_VERBS_EX_CMD_DESTROY_FLOW,
 };
@@ -202,32 +201,6 @@ struct ib_uverbs_query_device_resp {
 	__u8  reserved[4];
 };
 
-enum {
-	IB_USER_VERBS_EX_QUERY_DEVICE_ODP =		1ULL << 0,
-};
-
-struct ib_uverbs_ex_query_device {
-	__u32 comp_mask;
-	__u32 reserved;
-};
-
-struct ib_uverbs_odp_caps {
-	__u64 general_caps;
-	struct {
-		__u32 rc_odp_caps;
-		__u32 uc_odp_caps;
-		__u32 ud_odp_caps;
-	} per_transport_caps;
-	__u32 reserved;
-};
-
-struct ib_uverbs_ex_query_device_resp {
-	struct ib_uverbs_query_device_resp base;
-	__u32 comp_mask;
-	__u32 reserved;
-	struct ib_uverbs_odp_caps odp_caps;
-};
-
 struct ib_uverbs_query_port {
 	__u64 response;
 	__u8  port_num;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH] capabilities: Ambient capability set V1
From: Christoph Lameter @ 2015-02-05 21:56 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Andy Lutomirski, Aaron Jones, Ted Ts'o,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA,
	akpm-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r, Andrew G. Morgan,
	Mimi Zohar, Austin S Hemmelgarn, Markku Savela, Jarkko Sakkinen,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Michael Kerrisk,
	Jonathan Corbet

Ambient caps are something like restricted root privileges.
A process has a set of additional capabilities and those
are inherited without have to set capabilites in other
binaries involved. This allow the partial use of root
like features in a controlled way. It is often useful
to do this for user space device drivers or software that
needs increased priviledges for networking or to control
its own scheduling. Ambient caps allow one to avoid
having to run these with full root priviledges.

Control over this feature is avaialable via a new
prctl option called PR_CAP_AMBIENT. The second argument to prctl
is a the capability number and the third the desired state.
0 for off. Otherwise on.

Ambient bits are enabled regardless of the inheritance
mask of the target binary. They are only restricted
by the bounding set.

History:

Linux capabilities have suffered from the problem that they are not
inheritable like unregular process characteristics under Unix. This is
behavior that is counter intuitive to the expected behavior of processes
in Unix.

In particular there has been recently software that controls NICs from user
space and provides IP stack like behavior also in user space (DPDK and RDMA
kernel API based implementations). Those typically need either capabilities
to allow raw network access or have to be run setsuid. There is scripting and
LD_PREFLOAD etc involved, arbitrary binaries may be run from those scripts
including those setting additional capabilites or requiring root access.

That does not go well with having file capabilities set that would enable
the capabilities. Maybe it would work if one would setup capabilities on
all executables but that would also defeat a secure design since these
binaries may only need those caps for certain situations. Ok setting the
inheritable flags on everything may also get one there (if there would not
be the issues with LD_PRELOAD, debugging etc etc).

The easy solution is to allow some capabilities be inherited like setsuid
is. We really prefer to use capabilities instead of setsuid (we want to
limit what damage someone can do after all!). Therefore we have been
running a patch like this in production for the last 6 years. At some
point it becomes tedious to run your own custom kernel so we would like
to have this functionality upstream.

See some of the earlier related discussions on the problems with capability
inheritance:

0. Recent surprise:
                https://lkml.org/lkml/2014/1/21/175

1. Attempt to revise caps
                http://www.madore.org/~david/linux/newcaps/

2. Problems of passing caps through exec
                http://unix.stackexchange.com/questions/128394/passing-capabilities-through-exec

3. Problems of binding to privileged ports
                http://stackoverflow.com/questions/413807/is-there-a-way-for-non-root-processes-to-bind-to-privileged-ports-1024-on-l

4. Reviving capabilities
                http://lwn.net/Articles/199004/

There does not seem to be an alternative on the horizon. Some involved
in security development under Linux have even stated that they want to
rip out the whole thing and replace it. Its been a couple of years now
and we are still suffering from the capabilities mess. Let us just
fix it. Others have already done implementations like this like Nokia
for the N900.

This patch does not change the default behavior but it allows to set up
a list of capabilities via prctl that will enable regular
unix inheritance only for the selected group of capabilities.

With that it is then possible to do something trivial like setting
CAP_NET_RAW on an executable that can then allow that capability to
be inherited by others.

Lets have a look at a coding example of a wrapper that enables
a couple of capabilities:

------------------------------ ambient_test.c
/*
 * Test program for the ambient capabilities
 *
 *
 * Compile using:
 *	gcc -o ambient_test ambient_test.o
 *
 * This program must have the following capabilities to run properly:
 * CAP_SETPCAP, CAP_NET_RAW, CAP_NET_ADMIN, CAP_SYS_NICE
 *
 * A command to equip this with the right caps is:
 *
 *	setcap cap_setpcap,cap_net_raw,cap_net_admin,cap_sys_nice+eip ambient_test
 *
 * To get a shell with additional caps that can be inherited do:
 *
 * ./ambient_test /bin/bash
 *
 */

#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <sys/prctl.h>
#include <linux/capability.h>

/* Defintion to be updated in the user space include files */
#define PR_CAP_AMBIENT 45

int main(int argc, char **argv)
{
	int rc;

	if (prctl(PR_CAP_AMBIENT, CAP_NET_RAW))
		perror("Cannot set CAP_NET_RAW");

	if (prctl(PR_CAP_AMBIENT, CAP_NET_ADMIN))
		perror("Cannot set CAP_NET_ADMIN");

	if (prctl(PR_CAP_AMBIENT, CAP_SYS_NICE))
		perror("Cannot set CAP_SYS_NICE");

	printf("Ambient_test forking shell\n");
	if (execv(argv[1], argv + 1))
		perror("Cannot exec");

	return 0;
}
-------------------------------- ambient_test.c

Allows the inheritance of CAP_SYS_NICE, CAP_NET_RAW and CAP_NET_ADMIN.
With that device raw access is possible and also real time priorities
can be set from user space. This is a frequently needed set of
priviledged operations in HPC and HFT applications. User space
processes need to be able to directly access devices as well as
have full control over scheduling.

Signed-off-by: Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>

Index: linux/security/commoncap.c
===================================================================
--- linux.orig/security/commoncap.c	2015-02-05 09:53:43.442883383 -0600
+++ linux/security/commoncap.c	2015-02-05 15:40:31.387388142 -0600
@@ -347,14 +347,17 @@ static inline int bprm_caps_from_vfs_cap
 		*has_cap = true;

 	CAP_FOR_EACH_U32(i) {
+		__u32 ambient = current_cred()->cap_ambient.cap[i];
 		__u32 permitted = caps->permitted.cap[i];
 		__u32 inheritable = caps->inheritable.cap[i];
+		__u32 x = new->cap_bset.cap[i];

 		/*
-		 * pP' = (X & fP) | (pI & fI)
+		 * pP' = (X & pA) | (X & fP) | (pI & fI)
 		 */
 		new->cap_permitted.cap[i] =
-			(new->cap_bset.cap[i] & permitted) |
+			(x & ambient) |
+			(x & permitted) |
 			(new->cap_inheritable.cap[i] & inheritable);

 		if (permitted & ~new->cap_permitted.cap[i])
@@ -453,9 +456,13 @@ static int get_file_caps(struct linux_bi
 		if (rc == -EINVAL)
 			printk(KERN_NOTICE "%s: get_vfs_caps_from_disk returned %d for %s\n",
 				__func__, rc, bprm->filename);
-		else if (rc == -ENODATA)
-			rc = 0;
-		goto out;
+		else if (rc != -ENODATA)
+			goto out;
+		rc = 0;
+		if (cap_isclear(current_cred()->cap_ambient))
+			goto out;
+		/* Make sure that the ambient caps are enabled */
+		*effective = true;
 	}

 	rc = bprm_caps_from_vfs_caps(&vcaps, bprm, effective, has_cap);
@@ -577,6 +584,7 @@ skip:
 	}

 	new->securebits &= ~issecure_mask(SECURE_KEEP_CAPS);
+	new->cap_ambient = old->cap_ambient;
 	return 0;
 }

@@ -933,6 +941,23 @@ int cap_task_prctl(int option, unsigned
 			new->securebits &= ~issecure_mask(SECURE_KEEP_CAPS);
 		return commit_creds(new);

+	case PR_CAP_AMBIENT:
+		if (!ns_capable(current_user_ns(), CAP_SETPCAP))
+			return -EPERM;
+
+		if (!cap_valid(arg2))
+			return -EINVAL;
+
+		if (!ns_capable(current_user_ns(), arg2))
+			return -EPERM;
+
+		new = prepare_creds();
+		if (arg3 == 0)
+			cap_lower(new->cap_ambient, arg2);
+		else
+			cap_raise(new->cap_ambient, arg2);
+		return commit_creds(new);
+
 	default:
 		/* No functionality available - continue with default */
 		return -ENOSYS;
Index: linux/include/linux/cred.h
===================================================================
--- linux.orig/include/linux/cred.h	2015-02-05 09:53:43.442883383 -0600
+++ linux/include/linux/cred.h	2015-02-05 09:53:43.438883512 -0600
@@ -122,6 +122,7 @@ struct cred {
 	kernel_cap_t	cap_permitted;	/* caps we're permitted */
 	kernel_cap_t	cap_effective;	/* caps we can actually use */
 	kernel_cap_t	cap_bset;	/* capability bounding set */
+	kernel_cap_t	cap_ambient;	/* Ambient capability set */
 #ifdef CONFIG_KEYS
 	unsigned char	jit_keyring;	/* default keyring to attach requested
 					 * keys to */
Index: linux/include/uapi/linux/prctl.h
===================================================================
--- linux.orig/include/uapi/linux/prctl.h	2015-02-05 09:53:43.442883383 -0600
+++ linux/include/uapi/linux/prctl.h	2015-02-05 09:53:43.438883512 -0600
@@ -185,4 +185,7 @@ struct prctl_mm_map {
 #define PR_MPX_ENABLE_MANAGEMENT  43
 #define PR_MPX_DISABLE_MANAGEMENT 44

+/* Control the ambient capability set */
+#define PR_CAP_AMBIENT 45
+
 #endif /* _LINUX_PRCTL_H */
Index: linux/fs/proc/array.c
===================================================================
--- linux.orig/fs/proc/array.c	2014-12-12 10:27:49.304801274 -0600
+++ linux/fs/proc/array.c	2015-02-05 11:04:38.546429870 -0600
@@ -302,7 +302,8 @@ static void render_cap_t(struct seq_file
 static inline void task_cap(struct seq_file *m, struct task_struct *p)
 {
 	const struct cred *cred;
-	kernel_cap_t cap_inheritable, cap_permitted, cap_effective, cap_bset;
+	kernel_cap_t cap_inheritable, cap_permitted, cap_effective,
+			cap_bset, cap_ambient;

 	rcu_read_lock();
 	cred = __task_cred(p);
@@ -310,12 +311,14 @@ static inline void task_cap(struct seq_f
 	cap_permitted	= cred->cap_permitted;
 	cap_effective	= cred->cap_effective;
 	cap_bset	= cred->cap_bset;
+	cap_ambient	= cred->cap_ambient;
 	rcu_read_unlock();

 	render_cap_t(m, "CapInh:\t", &cap_inheritable);
 	render_cap_t(m, "CapPrm:\t", &cap_permitted);
 	render_cap_t(m, "CapEff:\t", &cap_effective);
 	render_cap_t(m, "CapBnd:\t", &cap_bset);
+	render_cap_t(m, "CapAmb:\t", &cap_ambient);
 }

 static inline void task_seccomp(struct seq_file *m, struct task_struct *p)

^ permalink raw reply

* [RFC] implementing tape statistics single file vs multi-file in sysfs
From: Seymour, Shane M @ 2015-02-06  0:20 UTC (permalink / raw)
  To: linux-api@vger.kernel.org, linux-scsi@vger.kernel.org
  Cc: Kai.Makisara@kolumbus.fi,
	James E.J. Bottomley (JBottomley@parallels.com),
	Laurence Oberman (loberman@redhat.com)

Hello linux-api'ers

There has been some ongoing discussion about the best way to implement tape statistics. The original method suggested a long time ago used a single file in sysfs similar to block statistics in sysfs. That lead to an impass about the code on the linux-scsi mailing list.

The sysfs documentation says that files should contain one item per file (with some small exceptions):

> "Attributes should be ASCII text files, preferably with only one value
> per file. It is noted that it may not be efficient to contain only one
> value per file, so it is socially acceptable to express an array of
> values of the same type."

The current patch that implements tape statistics is here:

http://marc.info/?l=linux-scsi&m=142112067313723&w=2

Recently there was was another discussion here about one file vs a collection of files for tape statistics:

http://marc.info/?l=linux-scsi&m=142316255501550&w=2

The result was that I should ask here what method I should use. I would like to get feedback in relation to tape statistics and one file vs multi-file in sysfs. I'm happy to keep the existing code or change to a single file approach.

Thanks
Shane

^ permalink raw reply

* Re: [PATCH v17 1/7] mm: support madvise(MADV_FREE)
From: Shaohua Li @ 2015-02-06  0:33 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Michael Kerrisk (man-pages), Michal Hocko, Andrew Morton,
	linux-kernel, linux-mm, linux-api, Hugh Dickins, Johannes Weiner,
	Rik van Riel, KOSAKI Motohiro, Mel Gorman, Jason Evans,
	zhangyanfei, Kirill A. Shutemov, Kirill A. Shutemov
In-Reply-To: <20150203234722.GB3583@blaptop>


Hi Minchan,

Sorry to jump in this thread so later, and if some issues are discussed before.
I'm interesting in this patch, so tried it here. I use a simple test with
jemalloc. Obviously this can improve performance when there is no memory
pressure. Did you try setup with memory pressure?

In my test, jemalloc will map 61G vma, and use about 32G memory without
MADV_FREE. If MADV_FREE is enabled, jemalloc will use whole 61G memory because
madvise doesn't reclaim the unused memory. If I disable swap (tweak your patch
slightly to make it work without swap), I got oom. If swap is enabled, my
system is totally stalled because of swap activity. Without the MADV_FREE,
everything is ok. Considering we definitely don't want to waste too much
memory, a system with memory pressure is normal, so sounds MADV_FREE will
introduce big trouble here.

Did you think about move the MADV_FREE pages to the head of inactive LRU, so
they can be reclaimed easily?

Thanks,
Shaohua

On Wed, Feb 04, 2015 at 08:47:22AM +0900, Minchan Kim wrote:
> Hello, Michael
> 
> On Tue, Feb 03, 2015 at 05:39:24PM +0100, Michael Kerrisk (man-pages) wrote:
> > Hello Minchan (and Michal)
> > 
> > I did not see this patch until just now when Michael explicitly
> > mentioned it in another discussion because
> > (a) it was buried in an LMKL thread that started a topic
> >     that was not about a man-pages patch.
> > (b) linux-man@ was not CCed.
> 
> Sorry about that.
> 
> > 
> > When resubmitting this patch, could you please To:me and CC linux-man@
> > and give the mail a suitable subject line indicating a man-pages patch.
> 
> Sure.
> 
> > 
> > On 12/05/2014 09:32 AM, Michal Hocko wrote:
> > > On Fri 05-12-14 16:08:16, Minchan Kim wrote:
> > > [...]
> > >> From cfa212d4fb307ae772b08cf564cab7e6adb8f4fc Mon Sep 17 00:00:00 2001
> > >> From: Minchan Kim <minchan@kernel.org>
> > >> Date: Mon, 1 Dec 2014 08:53:55 +0900
> > >> Subject: [PATCH] madvise.2: Document MADV_FREE
> > >>
> > >> Signed-off-by: Minchan Kim <minchan@kernel.org>
> > > 
> > > Reviewed-by: Michal Hocko <mhocko@suse.cz>
> > > 
> > > Thanks!
> > > 
> > >> ---
> > >>  man2/madvise.2 | 12 ++++++++++++
> > >>  1 file changed, 12 insertions(+)
> > >>
> > >> diff --git a/man2/madvise.2 b/man2/madvise.2
> > >> index 032ead7..fc1aaca 100644
> > >> --- a/man2/madvise.2
> > >> +++ b/man2/madvise.2
> > >> @@ -265,6 +265,18 @@ file (see
> > >>  .BR MADV_DODUMP " (since Linux 3.4)"
> > >>  Undo the effect of an earlier
> > >>  .BR MADV_DONTDUMP .
> > >> +.TP
> > >> +.BR MADV_FREE " (since Linux 3.19)"
> > >> +Tell the kernel that contents in the specified address range are no
> > >> +longer important and the range will be overwritten. When there is
> > >> +demand for memory, the system will free pages associated with the
> > >> +specified address range. In this instance, the next time a page in the
> > >> +address range is referenced, it will contain all zeroes.  Otherwise,
> > >> +it will contain the data that was there prior to the MADV_FREE call.
> > >> +References made to the address range will not make the system read
> > >> +from backing store (swap space) until the page is modified again.
> > >> +It works only with private anonymous pages (see
> > >> +.BR mmap (2)).
> > >>  .SH RETURN VALUE
> > >>  On success
> > >>  .BR madvise ()
> > 
> > If I'm reading the conversation right, the initially proposed text 
> > was from the BSD man page (which would be okay), but most of the 
> > text above seems  to have come straight from the page here:
> > http://www.lehman.cuny.edu/cgi-bin/man-cgi?madvise+3
> > 
> > Right?
> 
> True. Solaris man page was really straightforward/clear rather than BSD.
> 
> > 
> > Unfortunately, I don't think we can use that text. It's from the 
> > Solaris man page as far as I can tell, and I doubt that it's 
> > under a license that we can use.
> > 
> > If that's the case, we need to go back and come up with an
> > original text. It might draw inspiration from the Solaris page,
> > and take actual text from the BSD page (which is under a free
> > license), and it might also draw inspiration from Jon Corbet's 
> > description at http://lwn.net/Articles/590991/. 
> > 
> > Could you take another shot this please!
> 
> No problem. I will test my essay writing skill.
> Thanks. 
> 
> > 
> > Thanks,
> > 
> > Michael
> > 
> > 
> > 
> > -- 
> > Michael Kerrisk
> > Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> > Linux/UNIX System Programming Training: http://man7.org/training/
> 
> -- 
> Kind regards,
> Minchan Kim
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v17 1/7] mm: support madvise(MADV_FREE)
From: Minchan Kim @ 2015-02-06  5:51 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Michael Kerrisk (man-pages), Michal Hocko, Andrew Morton,
	linux-kernel, linux-mm, linux-api, Hugh Dickins, Johannes Weiner,
	Rik van Riel, KOSAKI Motohiro, Mel Gorman, Jason Evans,
	zhangyanfei, Kirill A. Shutemov, Kirill A. Shutemov
In-Reply-To: <20150206003311.GA2347@kernel.org>

Hi Shaohua,

On Thu, Feb 05, 2015 at 04:33:11PM -0800, Shaohua Li wrote:
> 
> Hi Minchan,
> 
> Sorry to jump in this thread so later, and if some issues are discussed before.
> I'm interesting in this patch, so tried it here. I use a simple test with

No problem at all. Interest is always win over ignorance.

> jemalloc. Obviously this can improve performance when there is no memory
> pressure. Did you try setup with memory pressure?

Sure but it was not a huge memory system like yours.

> 
> In my test, jemalloc will map 61G vma, and use about 32G memory without
> MADV_FREE. If MADV_FREE is enabled, jemalloc will use whole 61G memory because
> madvise doesn't reclaim the unused memory. If I disable swap (tweak your patch

Yes, IIUC, jemalloc replaces MADV_DONTNEED with MADV_FREE completely.

> slightly to make it work without swap), I got oom. If swap is enabled, my

You mean you modified anon aging logic so it works although there is no swap?
If so, I have no idea why OOM happens. I guess it should free all of freeable
pages during the aging so although system stall happens more, I don't expect
OOM. Anyway, with MADV_FREE with no swap, we should consider more things
about anonymous aging.

> system is totally stalled because of swap activity. Without the MADV_FREE,
> everything is ok. Considering we definitely don't want to waste too much
> memory, a system with memory pressure is normal, so sounds MADV_FREE will
> introduce big trouble here.
> 
> Did you think about move the MADV_FREE pages to the head of inactive LRU, so
> they can be reclaimed easily?

I think it's desirable if the page lived in active LRU.
The reason I didn't that was caused by volatile ranges system call which
was motivaion for MADV_FREE in my mind.
In last LSF/MM, there was concern about data's hotness.
Some of users want to keep that as it is in LRU position, others want to
handle that as cold(tail of inactive list)/warm(head of inactive list)/
hot(head of active list), for example.
The vrange syscall was just about volatiltiy, not depends on page hotness
so the decision on my head was not to change LRU order and let's make new
hotness advise if we need it later.

However, MADV_FREE's main customer is allocators and afaik, they want
to replace MADV_DONTNEED with MADV_FREE so I think it is really cold,
but we couldn't make sure so head of inactive is good compromise.
Another concern about tail of inactive list is that there could be
plenty of pages in there, which was asynchromos write-backed in
previous reclaim path, not-yet reclaimed because of not being able
to free the in softirq context of writeback. It means we ends up
freeing more potential pages to become workingset in advance
than pages VM already decided to evict.

In summary, I like your suggestion.

Thanks.

> 
> Thanks,
> Shaohua
> 
> On Wed, Feb 04, 2015 at 08:47:22AM +0900, Minchan Kim wrote:
> > Hello, Michael
> > 
> > On Tue, Feb 03, 2015 at 05:39:24PM +0100, Michael Kerrisk (man-pages) wrote:
> > > Hello Minchan (and Michal)
> > > 
> > > I did not see this patch until just now when Michael explicitly
> > > mentioned it in another discussion because
> > > (a) it was buried in an LMKL thread that started a topic
> > >     that was not about a man-pages patch.
> > > (b) linux-man@ was not CCed.
> > 
> > Sorry about that.
> > 
> > > 
> > > When resubmitting this patch, could you please To:me and CC linux-man@
> > > and give the mail a suitable subject line indicating a man-pages patch.
> > 
> > Sure.
> > 
> > > 
> > > On 12/05/2014 09:32 AM, Michal Hocko wrote:
> > > > On Fri 05-12-14 16:08:16, Minchan Kim wrote:
> > > > [...]
> > > >> From cfa212d4fb307ae772b08cf564cab7e6adb8f4fc Mon Sep 17 00:00:00 2001
> > > >> From: Minchan Kim <minchan@kernel.org>
> > > >> Date: Mon, 1 Dec 2014 08:53:55 +0900
> > > >> Subject: [PATCH] madvise.2: Document MADV_FREE
> > > >>
> > > >> Signed-off-by: Minchan Kim <minchan@kernel.org>
> > > > 
> > > > Reviewed-by: Michal Hocko <mhocko@suse.cz>
> > > > 
> > > > Thanks!
> > > > 
> > > >> ---
> > > >>  man2/madvise.2 | 12 ++++++++++++
> > > >>  1 file changed, 12 insertions(+)
> > > >>
> > > >> diff --git a/man2/madvise.2 b/man2/madvise.2
> > > >> index 032ead7..fc1aaca 100644
> > > >> --- a/man2/madvise.2
> > > >> +++ b/man2/madvise.2
> > > >> @@ -265,6 +265,18 @@ file (see
> > > >>  .BR MADV_DODUMP " (since Linux 3.4)"
> > > >>  Undo the effect of an earlier
> > > >>  .BR MADV_DONTDUMP .
> > > >> +.TP
> > > >> +.BR MADV_FREE " (since Linux 3.19)"
> > > >> +Tell the kernel that contents in the specified address range are no
> > > >> +longer important and the range will be overwritten. When there is
> > > >> +demand for memory, the system will free pages associated with the
> > > >> +specified address range. In this instance, the next time a page in the
> > > >> +address range is referenced, it will contain all zeroes.  Otherwise,
> > > >> +it will contain the data that was there prior to the MADV_FREE call.
> > > >> +References made to the address range will not make the system read
> > > >> +from backing store (swap space) until the page is modified again.
> > > >> +It works only with private anonymous pages (see
> > > >> +.BR mmap (2)).
> > > >>  .SH RETURN VALUE
> > > >>  On success
> > > >>  .BR madvise ()
> > > 
> > > If I'm reading the conversation right, the initially proposed text 
> > > was from the BSD man page (which would be okay), but most of the 
> > > text above seems  to have come straight from the page here:
> > > http://www.lehman.cuny.edu/cgi-bin/man-cgi?madvise+3
> > > 
> > > Right?
> > 
> > True. Solaris man page was really straightforward/clear rather than BSD.
> > 
> > > 
> > > Unfortunately, I don't think we can use that text. It's from the 
> > > Solaris man page as far as I can tell, and I doubt that it's 
> > > under a license that we can use.
> > > 
> > > If that's the case, we need to go back and come up with an
> > > original text. It might draw inspiration from the Solaris page,
> > > and take actual text from the BSD page (which is under a free
> > > license), and it might also draw inspiration from Jon Corbet's 
> > > description at http://lwn.net/Articles/590991/. 
> > > 
> > > Could you take another shot this please!
> > 
> > No problem. I will test my essay writing skill.
> > Thanks. 
> > 
> > > 
> > > Thanks,
> > > 
> > > Michael
> > > 
> > > 
> > > 
> > > -- 
> > > Michael Kerrisk
> > > Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> > > Linux/UNIX System Programming Training: http://man7.org/training/
> > 
> > -- 
> > Kind regards,
> > Minchan Kim
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC] implementing tape statistics single file vs multi-file in sysfs
From: Bryn M. Reeves @ 2015-02-06  9:13 UTC (permalink / raw)
  To: Seymour, Shane M
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Kai.Makisara-9Aww8k/80nUxHbG02/KK1g@public.gmane.org,
	James E.J. Bottomley (JBottomley-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org),
	Laurence Oberman (loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org)
In-Reply-To: <DDB9C85B850785449757F9914A034FCB3BF3EADA-4I1V4pQFGigSZAcGdq5asR6epYMZPwEe5NbjCUgZEJk@public.gmane.org>

On Fri, Feb 06, 2015 at 12:20:53AM +0000, Seymour, Shane M wrote:
> There has been some ongoing discussion about the best way to implement
> tape statistics. The original method suggested a long time ago used a
> single file in sysfs similar to block statistics in sysfs. That lead to
> an impass about the code on the linux-scsi mailing list.

I would have a strong preference for a single file containing an array
of integer counters. This is in keeping with other statistics attributes
in sysfs and follows the principle of least surprise: it's essentially
the same general format as /proc/diskstats and sysfs disk and partition
stats (dm statistics also follow this convention via the @print_stats
message).

This simplifies userspace code to read and parse the counters and avoids
additional sample jitter when reading stats for very large numbers of
devices; each device would require at least eleven open()/read()/close()
cycles.

For a small number of devices this shouldn't matter too much but
eventually the additional syscall overhead could become significant (I
think you mentioned users with ~20k devices?).

The sync file mechanism in the v2 patch that addresses this problem is
kinda cute but also significantly more complex than a plain old array
and as you pointed out adds hundreds of lines to the patch..

Sticking to arrays also allows existing tools like sysstat to be easily
adapted to the new data source.

> The sysfs documentation says that files should contain one item per file (with some small exceptions):
> 
> > "Attributes should be ASCII text files, preferably with only one value
> > per file. It is noted that it may not be efficient to contain only one
> > value per file, so it is socially acceptable to express an array of
> > values of the same type."

Right: I think there's good precedent for the array file style when
dealing with counter sets.

Regards,
Bryn.

^ permalink raw reply

* Re: [PATCH v17 1/7] mm: support madvise(MADV_FREE)
From: Michal Hocko @ 2015-02-06 12:58 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Minchan Kim, Michael Kerrisk (man-pages), Andrew Morton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-api-u79uwXL29TY76Z2rM5mHXA,
	Hugh Dickins, Johannes Weiner, Rik van Riel, KOSAKI Motohiro,
	Mel Gorman, Jason Evans, zhangyanfei-BthXqXjhjHXQFUHtdCDX3A,
	Kirill A. Shutemov, Kirill A. Shutemov
In-Reply-To: <20150206003311.GA2347-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

On Thu 05-02-15 16:33:11, Shaohua Li wrote:
[...]
> Did you think about move the MADV_FREE pages to the head of inactive LRU, so
> they can be reclaimed easily?

Yes this makes sense for pages living on the active LRU list. I would
preserve LRU ordering on the inactive list because there is no good
reason to make the operation more costly for inactive pages. On the
other hand having tons of to-be-freed pages on the active list clearly
sucks. Care to send a patch?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [RFC] implementing tape statistics single file vs multi-file in sysfs
From: Greg KH @ 2015-02-06 12:59 UTC (permalink / raw)
  To: Seymour, Shane M
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Kai.Makisara-9Aww8k/80nUxHbG02/KK1g@public.gmane.org,
	James E.J. Bottomley (JBottomley-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org),
	Laurence Oberman (loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org)
In-Reply-To: <DDB9C85B850785449757F9914A034FCB3BF3EADA-4I1V4pQFGigSZAcGdq5asR6epYMZPwEe5NbjCUgZEJk@public.gmane.org>

On Fri, Feb 06, 2015 at 12:20:53AM +0000, Seymour, Shane M wrote:
> Hello linux-api'ers
> 
> There has been some ongoing discussion about the best way to implement tape statistics. The original method suggested a long time ago used a single file in sysfs similar to block statistics in sysfs. That lead to an impass about the code on the linux-scsi mailing list.
> 
> The sysfs documentation says that files should contain one item per file (with some small exceptions):
> 
> > "Attributes should be ASCII text files, preferably with only one value
> > per file. It is noted that it may not be efficient to contain only one
> > value per file, so it is socially acceptable to express an array of
> > values of the same type."
> 
> The current patch that implements tape statistics is here:
> 
> http://marc.info/?l=linux-scsi&m=142112067313723&w=2

Aside from the "do we want to do this all in a single file" issue that I
will say more on below, this patch has issues.  Please don't use a
kobject for _ANYTHING_ in sysfs that has a struct device as a parent.
If you do that, it can't be seen by userspace tools very well, if at
all.

Instead, if you want to create a directory, just use an attribute group,
which will be created at the proper time (before the device is announced
to userspace), and then userspace can see it with the tools it is used
to using (i.e. libudev and friends.)

That should simplify your code a lot, please make that change no matter
what happens here with the content of the files.

> Recently there was was another discussion here about one file vs a collection of files for tape statistics:
> 
> http://marc.info/?l=linux-scsi&m=142316255501550&w=2
> 
> The result was that I should ask here what method I should use. I
> would like to get feedback in relation to tape statistics and one file
> vs multi-file in sysfs. I'm happy to keep the existing code or change
> to a single file approach.

One of the primary reasons we created sysfs and the "one value per file"
rule is that multi-value files just do not work well.  Yes, you get an
atomic snapshot, and you save some open/read/close syscall roundtrips,
but you do so at the expense of forcing userspace to "know" what the
format of the file is.  And once you create it, you can NEVER CHANGE IT
AGAIN.

Yes, that's right, if you come up with some new statistic in the future,
or realize that one of the ones you have now is wrong, you can't change
it, you have to make a whole new file, otherwise you could break
userspace tools.

Instead, a one-value-per-file rule forces userspace to know that if the
file is present, it can be opened and read from for a single value.  If
it isn't there, it should fail properly and move on to the next
statistic.  If you want to add a new statistic, great, just add a new
file and you are set.

You aren't dealing with performance-sensitive numbers here that "have"
to have an atomic snapshot of the state at a specific point in time in
order to work properly, so just have a bunch of files, all one value per
file, then all userspace tools will "just work" (i.e. libudev), and
everyone is happy.

And yes, open/read/close does take take a few extra cycles, but you
can't really measure it for a virtual filesystem like this on any modern
system.

Hope this helps explain why we have the sysfs rule, and why you should
continue to follow it as well.

Yes, it's not always followed, but that's usually because people forgot
why we had this rule, and no one noticed or pointed it out to me that it
was wrong.

thanks,

greg k-h

^ permalink raw reply

* Re: [RFC] implementing tape statistics single file vs multi-file in sysfs
From: Greg KH @ 2015-02-06 13:01 UTC (permalink / raw)
  To: Bryn M. Reeves
  Cc: Seymour, Shane M,
	linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Kai.Makisara-9Aww8k/80nUxHbG02/KK1g@public.gmane.org,
	James E.J. Bottomley (JBottomley-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org),
	Laurence Oberman (loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org)
In-Reply-To: <20150206091354.GA1143-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>

On Fri, Feb 06, 2015 at 09:13:55AM +0000, Bryn M. Reeves wrote:
> > The sysfs documentation says that files should contain one item per
> > file (with some small exceptions):
> > 
> > > "Attributes should be ASCII text files, preferably with only one value
> > > per file. It is noted that it may not be efficient to contain only one
> > > value per file, so it is socially acceptable to express an array of
> > > values of the same type."
> 
> Right: I think there's good precedent for the array file style when
> dealing with counter sets.

See my previous reply for why I strongly feel this is incorrect, sorry.

greg k-h

^ permalink raw reply

* Re: [PATCH] Revert "IB/core: Add support for extended query device caps"
From: Yann Droneaud @ 2015-02-06 13:47 UTC (permalink / raw)
  To: Roland Dreier
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Eli Cohen, Haggai Eran,
	Ira Weiny, Jason Gunthorpe, Sagi Grimberg, Shachar Raindel
In-Reply-To: <CAL1RGDWhNy2h1gqg0FYbnXX0J=ZBXy5uwf_N+gnXreJkLgLCpw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

Hi,

Le vendredi 06 février 2015 à 00:55 -0800, Roland Dreier a écrit :
> Thanks for noticing this at the last moment.
> 

I should have noticed this earlier, but, as small is beautiful,
I was impressed by the patch from Haggai and thought it would better 
than an ugly revert ... Foolish I was.

> I will send this to Linus on Friday.

Thanks a lot. And sorry about this epic adventure in ABI land :/

Regards.

-- 
Yann Droneaud
OPTEYA



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints
From: Michael Kerrisk (man-pages) @ 2015-02-06 15:41 UTC (permalink / raw)
  To: Minchan Kim
  Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, Vlastimil Babka,
	Kirill A. Shutemov, Dave Hansen, Mel Gorman,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, Andrew Morton,
	lkml, Linux API, linux-man, Hugh Dickins
In-Reply-To: <20150205010757.GA20996@blaptop>

On 02/05/2015 02:07 AM, Minchan Kim wrote:
> Hello,
> 
> On Wed, Feb 04, 2015 at 08:24:27PM +0100, Michael Kerrisk (man-pages) wrote:
>> On 4 February 2015 at 18:02, Vlastimil Babka <vbabka-AlSwsSmVLrQ@public.gmane.org> wrote:
>>> On 02/04/2015 03:00 PM, Michael Kerrisk (man-pages) wrote:
>>>>
>>>> Hello Vlastimil,
>>>>
>>>> On 4 February 2015 at 14:46, Vlastimil Babka <vbabka-AlSwsSmVLrQ@public.gmane.org> wrote:
>>>>>>>
>>>>>>> - that covers mlocking ok, not sure if the rest fits the "shared pages"
>>>>>>> case
>>>>>>> though. I dont see any check for other kinds of shared pages in the
>>>>>>> code.
>>>>>>
>>>>>>
>>>>>> Agreed. "shared" here seems confused. I've removed it. And I've
>>>>>> added mention of "Huge TLB pages" for this error.
>>>>>
>>>>>
>>>>> Thanks.
>>>>
>>>>
>>>> I also added those cases for MADV_REMOVE, BTW.
>>>
>>>
>>> Right. There's also the following for MADV_REMOVE that needs updating:
>>>
>>> "Currently, only shmfs/tmpfs supports this; other filesystems return with
>>> the error ENOSYS."
>>>
>>> - it's not just shmem/tmpfs anymore. It should be best to refer to
>>> fallocate(2) option FALLOC_FL_PUNCH_HOLE which seems to be (more) up to
>>> date.
>>>
>>> - AFAICS it doesn't return ENOSYS but EOPNOTSUPP. Also neither error code is
>>> listed in the ERRORS section.
>>
>> Yup, I recently added that as well, based on a patch from Jan Chaloupka.
>>
>>>>>>>>> - The word "will result" did sound as a guarantee at least to me. So
>>>>>>>>> here it
>>>>>>>>> could be changed to "may result (unless the advice is ignored)"?
>>>>>>>>
>>>>>>>> It's too late to fix documentation. Applications already depends on
>>>>>>>> the
>>>>>>>> beheviour.
>>>>>>>
>>>>>>> Right, so as long as they check for EINVAL, it should be safe. It
>>>>>>> appears
>>>>>>> that
>>>>>>> jemalloc does.
>>>>>>
>>>>>> So, first a brief question: in the cases where the call does not error
>>>>>> out,
>>>>>> are we agreed that in the current implementation, MADV_DONTNEED will
>>>>>> always result in zero-filled pages when the region is faulted back in
>>>>>> (when we consider pages that are not backed by a file)?
>>>>>
>>>>> I'd agree at this point.
>>>>
>>>> Thanks for the confirmation.
>>>>
>>>>> Also we should probably mention anonymously shared pages (shmem). I think
>>>>> they behave the same as file here.
>>>>
>>>> You mean tmpfs here, right? (I don't keep all of the synonyms straight.)
>>>
>>> shmem is tmpfs (that by itself would fit under "files" just fine), but also
>>> sys V segments created by shmget(2) and also mappings created by mmap with
>>> MAP_SHARED | MAP_ANONYMOUS. I'm not sure if there's a single manpage to
>>> refer to the full list.
>>
>> So, how about this text:
>>
>>               After a successful MADV_DONTNEED operation, the seman‐
>>               tics  of  memory  access  in  the specified region are
>>               changed: subsequent accesses of  pages  in  the  range
>>               will  succeed,  but will result in either reloading of
>>               the memory contents from the  underlying  mapped  file
>>               (for  shared file mappings, shared anonymous mappings,
>>               and shmem-based techniques such  as  System  V  shared
>>               memory  segments)  or  zero-fill-on-demand  pages  for
>>               anonymous private mappings.
> 
> Hmm, I'd like to clarify.
> 
> Whether it was intention or not, some of userspace developers thought
> about that syscall drop pages instantly if was no-error return so that
> they will see more free pages(ie, rss for the process will be decreased)
> with keeping the VMA. Can we rely on it?

I do not know. Michael?

> And we should make error section, too.
> "locked" covers mlock(2) and you said you will add hugetlb. Then,
> VM_PFNMAP? In that case, it fails. How can we say about VM_PFNMAP?
> special mapping for some drivers?

I'm open for offers on what to add.
 
> One more thing, "The kernel is free to ignore the advice".
> It conflicts "This call does not influence the semantics of the
> application (except in the case of MADV_DONTNEED)" so
> is it okay we can believe "The kernel is free to ingmore the advise
> except MADV_DONTNEED"?

I decided to just drop the sentence

     The kernel is free to ignore the advice.

It creates misunderstandings, and does not really add information.

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* Re: [RFC] implementing tape statistics single file vs multi-file in sysfs
From: Bryn M. Reeves @ 2015-02-06 15:41 UTC (permalink / raw)
  To: Greg KH
  Cc: Seymour, Shane M,
	linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Kai.Makisara-9Aww8k/80nUxHbG02/KK1g@public.gmane.org,
	James E.J. Bottomley (JBottomley-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org),
	Laurence Oberman (loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org)
In-Reply-To: <20150206125916.GB26247-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>

On Fri, Feb 06, 2015 at 04:59:16AM -0800, Greg KH wrote:
> On Fri, Feb 06, 2015 at 12:20:53AM +0000, Seymour, Shane M wrote:
> > The current patch that implements tape statistics is here:
> > 
> > http://marc.info/?l=linux-scsi&m=142112067313723&w=2
> 
> Aside from the "do we want to do this all in a single file" issue that I
> will say more on below, this patch has issues.  Please don't use a
> kobject for _ANYTHING_ in sysfs that has a struct device as a parent.
> If you do that, it can't be seen by userspace tools very well, if at
> all.

I can't speak for Shane but wouldn't spend too much time looking at the
current v2 patch: it's the result of a pretty ugly compromise suggested
on linux-scsi.

This thread was really to try to settle the discussion on the structure
of the stats files.

> > Recently there was was another discussion here about one file vs a collection of files for tape statistics:
> > 
> > http://marc.info/?l=linux-scsi&m=142316255501550&w=2
> > 
> > The result was that I should ask here what method I should use. I
> > would like to get feedback in relation to tape statistics and one file
> > vs multi-file in sysfs. I'm happy to keep the existing code or change
> > to a single file approach.
> 
> One of the primary reasons we created sysfs and the "one value per file"
> rule is that multi-value files just do not work well.  Yes, you get an
> atomic snapshot, and you save some open/read/close syscall roundtrips,
> but you do so at the expense of forcing userspace to "know" what the
> format of the file is.  And once you create it, you can NEVER CHANGE IT
> AGAIN.

I am not convinced this is a concern for tape statistics: they are pretty
much a solved problem. The commercial *nixes have had this for decades.

Likewise for disk stats: although fluff like maj:min/name etc. has been
shuffled a few times the basic fields have remained unchanged for a very
long time and sysfs already removes the need to include an identity
field.
 
> Yes, that's right, if you come up with some new statistic in the future,
> or realize that one of the ones you have now is wrong, you can't change
> it, you have to make a whole new file, otherwise you could break
> userspace tools.

I understand the fact that you can't change them; I just don't think it's
a big problem in this specific case (and much less than some of the
more imaginative sysfs content - 2d int arrays with column headers
anyone?).

> And yes, open/read/close does take take a few extra cycles, but you
> can't really measure it for a virtual filesystem like this on any modern
> system.

I'll try to get some numbers when I get back home next week - Shane is
talking about use cases involving tens of thousands of tape devices. I
am not certain that the overhead would be unmeasurable in that case: the
additional context switching & TLB flushes alone seem like they would
add up.

> Hope this helps explain why we have the sysfs rule, and why you should
> continue to follow it as well.
>
> Yes, it's not always followed, but that's usually because people forgot
> why we had this rule, and no one noticed or pointed it out to me that it
> was wrong.

Perhaps sysfs.txt should be updated to make the position more clear? The
current wording seems rather more liberal than this thread would
suggest. Maybe something like the patch below?

This would help people who are trying to dtrt by reading the documentation.

Regards,
Bryn.


  From 3081aad4cc4d19b68f39499dbeb3837f0642f70e Mon Sep 17 00:00:00 2001
  From: "Bryn M. Reeves" <bmr-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  Date: Fri, 6 Feb 2015 15:19:39 +0000
  Subject: [PATCH] docs/sysfs: Specify array valued attribute review
   requirements
  
  Although the linux-api position that one-value-per-file is a strong rule
  is very clear in mailing list discussions the sysfs.txt documentation
  suggests a rather more liberal stance:
  
  "... it is socially acceptable to express an array of values of the same
  type."
  
  Fix the documentation to make it clear that such uses should be
  discussed on linux-api first.

Signed-off-by: Bryn M. Reeves <bmr-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 Documentation/filesystems/sysfs.txt | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/sysfs.txt b/Documentation/filesystems/sysfs.txt
index b35a64b..494fa78 100644
--- a/Documentation/filesystems/sysfs.txt
+++ b/Documentation/filesystems/sysfs.txt
@@ -57,8 +57,15 @@ attributes.
 
 Attributes should be ASCII text files, preferably with only one value
 per file. It is noted that it may not be efficient to contain only one
-value per file, so it is socially acceptable to express an array of
-values of the same type. 
+value per file, so it may be socially acceptable to express an array of
+values of the same type.
+
+If you are considering adding such an array attribute to sysfs please
+discuss it via the linux-api mailing list first to ensure that your
+proposed use is acceptable:
+
+  https://www.kernel.org/doc/man-pages/linux-api-ml.html
+  linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
 
 Mixing types, expressing multiple lines of data, and doing fancy
 formatting of data is heavily frowned upon. Doing these things may get
-- 
1.9.3

^ permalink raw reply related

* Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints
From: Michael Kerrisk (man-pages) @ 2015-02-06 15:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: mtk.manpages, Vlastimil Babka, Kirill A. Shutemov, Dave Hansen,
	Mel Gorman, linux-mm@kvack.org, Minchan Kim, Andrew Morton, lkml,
	Linux API, linux-man, Hugh Dickins
In-Reply-To: <20150205154102.GA20607@dhcp22.suse.cz>

Hi Michael

On 02/05/2015 04:41 PM, Michal Hocko wrote:
> On Wed 04-02-15 20:24:27, Michael Kerrisk wrote:
> [...]
>> So, how about this text:
>>
>>               After a successful MADV_DONTNEED operation, the seman‐
>>               tics  of  memory  access  in  the specified region are
>>               changed: subsequent accesses of  pages  in  the  range
>>               will  succeed,  but will result in either reloading of
>>               the memory contents from the  underlying  mapped  file
> 
> "
> result in either providing the up-to-date contents of the underlying
> mapped file
> "

Thanks! I did something like that. See below.

> Would be more precise IMO because reload might be interpreted as a major
> fault which is not necessarily the case (see below).
> 
>>               (for  shared file mappings, shared anonymous mappings,
>>               and shmem-based techniques such  as  System  V  shared
>>               memory  segments)  or  zero-fill-on-demand  pages  for
>>               anonymous private mappings.
> 
> Yes, this wording is better because many users are not aware of
> MAP_ANON|MAP_SHARED being file backed in fact and mmap man page doesn't
> mention that.

(Michal, would you have a text to propose to add to the mmap(2) page?
Maybe it would be useful to add something there.)

> 
> I am just wondering whether it makes sense to mention that MADV_DONTNEED
> for shared mappings might be surprising and not freeing the backing
> pages thus not really freeing memory until there is a memory
> pressure. But maybe this is too implementation specific for a man
> page. What about the following wording on top of yours?
> "
> Please note that the MADV_DONTNEED hint on shared mappings might not
> lead to immediate freeing of pages in the range. The kernel is free to
> delay this until an appropriate moment. RSS of the calling process will
> be reduced however.
> "

Thanks! I added this, but dropped in the word "immediately" in the last 
sentence, since I assume that was implied. So now we have:

              After  a  successful MADV_DONTNEED operation, the seman‐
              tics of  memory  access  in  the  specified  region  are
              changed:  subsequent accesses of pages in the range will
              succeed, but will result in either repopulating the mem‐
              ory  contents from the up-to-date contents of the under‐
              lying mapped file  (for  shared  file  mappings,  shared
              anonymous  mappings,  and shmem-based techniques such as
              System V shared memory segments) or  zero-fill-on-demand
              pages for anonymous private mappings.

              Note  that,  when applied to shared mappings, MADV_DONT‐
              NEED might not lead to immediate freeing of the pages in
              the  range.   The  kernel  is  free to delay freeing the
              pages until an appropriate  moment.   The  resident  set
              size  (RSS)  of  the calling process will be immediately
              reduced however.

The current draft of the page can be found in a branch,
http://git.kernel.org/cgit/docs/man-pages/man-pages.git/log/?h=draft_madvise

Thanks,

Michael



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox