Linux userland API discussions
 help / color / mirror / Atom feed
* Re: [PATCH] selftests: Fix build failures when invoked from kselftest target
From: Shuah Khan @ 2015-03-18 16:46 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Shuah Khan
In-Reply-To: <1426466329.21900.2.camel-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org>

On 03/15/2015 06:38 PM, Michael Ellerman wrote:
> On Fri, 2015-03-13 at 19:45 -0600, Shuah Khan wrote:
>> Several tests that rely on implicit build rules fail to build,
>> when invoked from the main Makefile kselftest target. These
>> failures are due to --no-builtin-rules and --no-builtin-variables
>> options set in the inherited MAKEFLAGS.
>>
>> --no-builtin-rules eliminates the use of built-in implicit rules
>> and --no-builtin-variables is for not defining built-in variables.
>> These two options override the use of implicit rules resulting in
>> build failures. In addition, inherited LDFLAGS result in build
>> failures and there is no need to define LDFLAGS.  Clear LDFLAGS
>> and MAKEFLAG when make is invoked from the main Makefile kselftest
>> target. Fixing this at selftests Makefile avoids changing the main
>> Makefile and keeps this change self contained at selftests level.
>>
>> Signed-off-by: Shuah Khan <shuahkh-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org>
>> ---
>>  tools/testing/selftests/Makefile | 9 +++++++++
>>  1 file changed, 9 insertions(+)
>>
>> diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
>> index 4e51122..8e09db7 100644
>> --- a/tools/testing/selftests/Makefile
>> +++ b/tools/testing/selftests/Makefile
>> @@ -22,6 +22,15 @@ TARGETS += vm
>>  TARGETS_HOTPLUG = cpu-hotplug
>>  TARGETS_HOTPLUG += memory-hotplug
>>  
>> +# Clear LDFLAGS and MAKEFLAGS if called from main
>> +# Makefile to avoid test build failures when test
>> +# Makefile doesn't have explicit build rules.
>> +ifeq (1,$(MAKELEVEL))
>> +undefine LDFLAGS
>> +override define MAKEFLAGS =
>> +endef
>> +endif
> 
> You shouldn't need to use define/endef here, that is just for multi-line
> variables.
> 
> This should be equivalent:
> 
>   ifeq (1,$(MAKELEVEL))
>   undefine LDFLAGS
>   override MAKEFLAGS =
>   endif
> 

ok. Will send a patch v2 with that change.

-- Shuah

-- 
Shuah Khan
Sr. Linux Kernel Developer
Open Source Innovation Group
Samsung Research America (Silicon Valley)
shuahkh-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org | (970) 217-8978

^ permalink raw reply

* Re: [PATCH v2] add support for Freescale's MMA8653FC 10 bit accelerometer
From: Bastien Nocera @ 2015-03-18 16:59 UTC (permalink / raw)
  To: Martin Kepplinger
  Cc: Alexander Stein, robh+dt-DgEjT+Ai2ygdnm+yROfE0A,
	pawel.moll-5wv7dgnIgG8, mark.rutland-5wv7dgnIgG8,
	ijc+devicetree-KcIKpvwj1kUDXYZnReoRVg,
	galak-sgV2jX0FEOL9JmXXK+q4OQ,
	dmitry.torokhov-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-input-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Martin Kepplinger,
	Christoph Muellner
In-Reply-To: <5509AAE5.1000503-1KBjaw7Xf1+zQB+pC5nmwQ@public.gmane.org>

On Wed, 2015-03-18 at 17:42 +0100, Martin Kepplinger wrote:
> 
<snip>
> It could have gone to drivers/iio/accel if it would use an iio 
> interface, which would make more sense, you are right, but I simply 
> don't have the time to merge it in to iio.
> 
> It doesn't use an input interface either but I don't see a good 
> place for an accelerometer that uses sysfs only.
> 
> It works well, is a relatively recent chip and a clean dirver. But 
> this is all I can provide.

As a person who works on the user-space interaction of those with 
desktops [1]: Urgh.

I already have 3 (probably 4) types of accelerometers to contend with, 
I'm not fond of adding yet another type.

Is there any way to get this hardware working outside the SoCs it's 
designed for (say, a device with I2C like a Raspberry Pi), so that a 
kind soul could handle getting this using the right interfaces?

Cheers

[1]: https://github.com/hadess/iio-sensor-proxy

^ permalink raw reply

* Re: [PATCH v2] add support for Freescale's MMA8653FC 10 bit accelerometer
From: Martin Kepplinger @ 2015-03-18 18:02 UTC (permalink / raw)
  To: Bastien Nocera
  Cc: Alexander Stein, robh+dt, pawel.moll, mark.rutland,
	ijc+devicetree, galak, dmitry.torokhov, akpm, gregkh, linux-api,
	devicetree, linux-input, linux-kernel, Martin Kepplinger,
	Christoph Muellner
In-Reply-To: <1426697978.6764.8.camel@hadess.net>

Am 2015-03-18 um 17:59 schrieb Bastien Nocera:
> On Wed, 2015-03-18 at 17:42 +0100, Martin Kepplinger wrote:
>>
> <snip>
>> It could have gone to drivers/iio/accel if it would use an iio 
>> interface, which would make more sense, you are right, but I simply 
>> don't have the time to merge it in to iio.
>>
>> It doesn't use an input interface either but I don't see a good 
>> place for an accelerometer that uses sysfs only.
>>
>> It works well, is a relatively recent chip and a clean dirver. But 
>> this is all I can provide.
> 
> As a person who works on the user-space interaction of those with 
> desktops [1]: Urgh.
> 
> I already have 3 (probably 4) types of accelerometers to contend with, 
> I'm not fond of adding yet another type.
> 
> Is there any way to get this hardware working outside the SoCs it's 
> designed for (say, a device with I2C like a Raspberry Pi), so that a 
> kind soul could handle getting this using the right interfaces?
> 

It works on basically any SoC and is in no way limited in this regard.
Sure, userspace has to expicitely support it and I hear you. Using the
iio interface would make more sense. I can only say I'd love to have the
time to move this driver over. I'm very sorry.

> Cheers
> 
> [1]: https://github.com/hadess/iio-sensor-proxy
> 


^ permalink raw reply

* Re: [PATCH v2] add support for Freescale's MMA8653FC 10 bit accelerometer
From: Bastien Nocera @ 2015-03-18 18:05 UTC (permalink / raw)
  To: Martin Kepplinger
  Cc: Alexander Stein, robh+dt, pawel.moll, mark.rutland,
	ijc+devicetree, galak, dmitry.torokhov, akpm, gregkh, linux-api,
	devicetree, linux-input, linux-kernel, Martin Kepplinger,
	Christoph Muellner
In-Reply-To: <5509BDA0.9000806@posteo.de>

On Wed, 2015-03-18 at 19:02 +0100, Martin Kepplinger wrote:
> Am 2015-03-18 um 17:59 schrieb Bastien Nocera:
> > On Wed, 2015-03-18 at 17:42 +0100, Martin Kepplinger wrote:
> > > 
> > <snip>
> > > It could have gone to drivers/iio/accel if it would use an iio  
> > > interface, which would make more sense, you are right, but I 
> > > simply  don't have the time to merge it in to iio.
> > > 
> > > It doesn't use an input interface either but I don't see a good  
> > > place for an accelerometer that uses sysfs only.
> > > 
> > > It works well, is a relatively recent chip and a clean dirver. 
> > > But  this is all I can provide.
> > 
> > As a person who works on the user-space interaction of those with  
> > desktops [1]: Urgh.
> > 
> > I already have 3 (probably 4) types of accelerometers to contend 
> > with,  I'm not fond of adding yet another type.
> > 
> > Is there any way to get this hardware working outside the SoCs 
> > it's  designed for (say, a device with I2C like a Raspberry Pi), 
> > so that a  kind soul could handle getting this using the right 
> > interfaces?
> > 
> 
> It works on basically any SoC and is in no way limited in this 
> regard. Sure, userspace has to expicitely support it and I hear you. 
> Using the iio interface would make more sense. I can only say I'd 
> love to have the time to move this driver over. I'm very sorry.

How can we get the hardware for somebody to use on their own 
laptops/embedded boards to implement this driver?

^ permalink raw reply

* [PATCH v2] selftests: Fix build failures when invoked from kselftest target
From: Shuah Khan @ 2015-03-18 18:05 UTC (permalink / raw)
  To: linux-api, linux-kernel; +Cc: Shuah Khan

Several tests that rely on implicit build rules fail to build,
when invoked from the main Makefile kselftest target. These
failures are due to --no-builtin-rules and --no-builtin-variables
options set in the inherited MAKEFLAGS.

--no-builtin-rules eliminates the use of built-in implicit rules
and --no-builtin-variables is for not defining built-in variables.
These two options override the use of implicit rules resulting in
build failures. In addition, inherited LDFLAGS result in build
failures and there is no need to define LDFLAGS.  Clear LDFLAGS
and MAKEFLAG when make is invoked from the main Makefile kselftest
target. Fixing this at selftests Makefile avoids changing the main
Makefile and keeps this change self contained at selftests level.

Signed-off-by: Shuah Khan <shuahkh@osg.samsung.com>
---
 tools/testing/selftests/Makefile | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 4e51122..0db5713 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -22,6 +22,14 @@ TARGETS += vm
 TARGETS_HOTPLUG = cpu-hotplug
 TARGETS_HOTPLUG += memory-hotplug
 
+# Clear LDFLAGS and MAKEFLAGS if called from main
+# Makefile to avoid test build failures when test
+# Makefile doesn't have explicit build rules.
+ifeq (1,$(MAKELEVEL))
+undefine LDFLAGS
+override MAKEFLAGS =
+endif
+
 all:
 	for TARGET in $(TARGETS); do \
 		make -C $$TARGET; \
-- 
2.1.0

^ permalink raw reply related

* Re: [PATCH v4 00/14] Add kdbus implementation
From: Andy Lutomirski @ 2015-03-18 18:24 UTC (permalink / raw)
  To: David Herrmann
  Cc: Greg Kroah-Hartman, Arnd Bergmann, Eric W. Biederman,
	One Thousand Gnomes, Tom Gundersen, Jiri Kosina, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Daniel Mack,
	Djalal Harouni
In-Reply-To: <CANq1E4Tf_6Cn+sxY=BPEbRpEr5WY+-rRX5gipz-_=4PNLa9bnQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Wed, Mar 18, 2015 at 6:54 AM, David Herrmann <dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> Hi
>
> On Tue, Mar 17, 2015 at 8:24 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> [...]
>> Can anyone comment on how fast it actually is.  I'm curious
>> about:
>>
>>  - The time it takes to do the ioctl to send a message
>>
>>  - The time it takes to receive a message (poll + whatever ioctls)
>
> I'm not sure I can gather useful absolute data here. This highly
> depends on how you call it, how often you call it, what payloads you
> pass, what machine you're on.. you know all that.
>
> So here's a flamegraph for you (which I use for comparisons to UDS):
>   http://people.freedesktop.org/~dvdhrm/kdbus_8kb.svg
>
> This program sends unicast messages on kdbus and UDS, exactly the same
> number of times with the same 8kb payload. No parsing, no marshaling
> is done, just plain message passing. The interesting spikes are
> sys_read(), sys_write() and the 3 kdbus sys_ioctl(). Everything else
> should be ignored.
>
> sys_read() and sys_ioctl(KDBUS_CMD_RECV) are about the same. But note
> that we don't copy any payload in RECV, so it scales O(1) compared to
> message-size.
>
> sys_write() is about 3x faster than sys_ioctl(KDBUS_CMD_WRITE).

Is that factor of 3 for 8 kb payloads?  If so, I expect it's a factor
of much worse than 3 for small payloads.

>
> I see lots of room for improvement in both RECV and SEND. Caching the
> namespaces on a connection, would get rid of
> kdbus_queue_entry_install() in RECV, thus speeding it up by ~30%. In
> SEND, we could merge the kvec and iovec copying, to avoid calling
> shmem_begin_write() twice. We should also stop allocating management
> structures that are not used (like for metadata, if no metadata is
> transmitted). We should use stack-space for small ioctl objects,
> instead of memdup_user(). And so on.. Oh, and locking can be reduced.
> We haven't even looked at rcu, yet (though that's mostly interesting
> for policy and broadcasts, not unicasts).
>
>>  - The time it takes to transfer a memfd (I don't care about how long
>> it takes to create or map a memfd -- that's exactly the same between
>> kdbus and any other memfds user, I imagine)
>
> The time to transmit a memfd is the same as to transmit a 64-byte
> payload. Ok, you also get to install the fd into the fd-table, but
> that's true regardless of the transport.
> Here's a graph for 64byte transfers (otherwise, same as above):
>   http://people.freedesktop.org/~dvdhrm/kdbus_64b.svg
>
>>  - The time it takes to connect
>
> No idea, never measured it. Why is it of interest?

Gah, sorry, bad terminology.  I mean the time it takes to send a
message to a receiver that you haven't sent to before.

(The kdbus terminology is weird.  You don't send to "endpoints", you
don't "connect" to other participants, and it's not even clear to me
what a participant in the bus is called.)

>
>> I'm also interested in whether the current design is actually amenable
>> to good performance.  I'm told that it is, but ISTM there's a lot of
>> heavyweight stuff going on with each send operation that will be hard
>> to remove.
>
> I disagree. What heavyweight stuff is going on?

At least metadata generation, metadata free, and policy db checks seem
expensive.  It could be worth running a bunch of copies of your
benchmark on different cores and seeing how it scales.

--Andy

^ permalink raw reply

* Re: [PATCH v2] add support for Freescale's MMA8653FC 10 bit accelerometer
From: Martin Kepplinger @ 2015-03-18 18:28 UTC (permalink / raw)
  To: Bastien Nocera
  Cc: Alexander Stein, robh+dt-DgEjT+Ai2ygdnm+yROfE0A,
	pawel.moll-5wv7dgnIgG8, mark.rutland-5wv7dgnIgG8,
	ijc+devicetree-KcIKpvwj1kUDXYZnReoRVg,
	galak-sgV2jX0FEOL9JmXXK+q4OQ,
	dmitry.torokhov-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-input-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Martin Kepplinger,
	Christoph Muellner
In-Reply-To: <1426701934.6764.10.camel-0MeiytkfxGOsTnJN9+BGXg@public.gmane.org>

Am 2015-03-18 um 19:05 schrieb Bastien Nocera:
> On Wed, 2015-03-18 at 19:02 +0100, Martin Kepplinger wrote:
>> Am 2015-03-18 um 17:59 schrieb Bastien Nocera:
>>> On Wed, 2015-03-18 at 17:42 +0100, Martin Kepplinger wrote:
>>>>
>>> <snip>
>>>> It could have gone to drivers/iio/accel if it would use an iio  
>>>> interface, which would make more sense, you are right, but I 
>>>> simply  don't have the time to merge it in to iio.
>>>>
>>>> It doesn't use an input interface either but I don't see a good  
>>>> place for an accelerometer that uses sysfs only.
>>>>
>>>> It works well, is a relatively recent chip and a clean dirver. 
>>>> But  this is all I can provide.
>>>
>>> As a person who works on the user-space interaction of those with  
>>> desktops [1]: Urgh.
>>>
>>> I already have 3 (probably 4) types of accelerometers to contend 
>>> with,  I'm not fond of adding yet another type.
>>>
>>> Is there any way to get this hardware working outside the SoCs 
>>> it's  designed for (say, a device with I2C like a Raspberry Pi), 
>>> so that a  kind soul could handle getting this using the right 
>>> interfaces?
>>>
>>
>> It works on basically any SoC and is in no way limited in this 
>> regard. Sure, userspace has to expicitely support it and I hear you. 
>> Using the iio interface would make more sense. I can only say I'd 
>> love to have the time to move this driver over. I'm very sorry.
> 
> How can we get the hardware for somebody to use on their own 
> laptops/embedded boards to implement this driver?
> 

It's connected over I2C. If the included documentation is not clear
please tell me what exacly. Thanks!

^ permalink raw reply

* [PATCH v3] add support for Freescale's MMA8653FC 10 bit accelerometer
From: Martin Kepplinger @ 2015-03-18 18:32 UTC (permalink / raw)
  To: robh+dt-DgEjT+Ai2ygdnm+yROfE0A, pawel.moll-5wv7dgnIgG8,
	mark.rutland-5wv7dgnIgG8, ijc+devicetree-KcIKpvwj1kUDXYZnReoRVg,
	galak-sgV2jX0FEOL9JmXXK+q4OQ,
	dmitry.torokhov-Re5JQEeQqe8AvxtiuMwx3w
  Cc: alexander.stein-93q1YBGzJSMe9JSWTWOYM3xStJ4P+DSV,
	hadess-0MeiytkfxGOsTnJN9+BGXg,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-input-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Martin Kepplinger,
	Christoph Muellner

From: Martin Kepplinger <martin.kepplinger-SN7IsUiht6C/RdPyistoZJqQE7yCjDx5@public.gmane.org>

The MMA8653FC is a low-power, three-axis, capacitive micromachined
accelerometer with 10 bits of resolution with flexible user-programmable
options.

Embedded interrupt functions enable overall power savings, by relieving the
host processor from continuously polling data, for example using the poll()
system call.

The device can be configured to generate wake-up interrupt signals from any
combination of the configurable embedded functions, enabling the MMA8653FC
to monitor events while remaining in a low-power mode during periods of
inactivity.

This driver provides devicetree properties to program the device's behaviour
and a simple, tested and documented sysfs interface. The data sheet and more
information is available on Freescale's website.

Signed-off-by: Martin Kepplinger <martin.kepplinger-SN7IsUiht6C/RdPyistoZJqQE7yCjDx5@public.gmane.org>
Signed-off-by: Christoph Muellner <christoph.muellner-SN7IsUiht6C/RdPyistoZJqQE7yCjDx5@public.gmane.org>
---
applies to v4.0-rc4 and the current -next.

patch revision history
......................
v3 moves the driver from drivers/input/misc to drivers/misc
v2 corrects licensing and commit messages and adds appropriate recipients

 .../testing/sysfs-bus-i2c-devices-fsl-mma8653fc    |  39 +
 .../devicetree/bindings/misc/fsl,mma8653fc.txt     |  96 +++
 MAINTAINERS                                        |   5 +
 drivers/misc/Kconfig                               |  11 +
 drivers/misc/Makefile                              |   1 +
 drivers/misc/mma8653fc.c                           | 913 +++++++++++++++++++++
 6 files changed, 1065 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-bus-i2c-devices-fsl-mma8653fc
 create mode 100644 Documentation/devicetree/bindings/misc/fsl,mma8653fc.txt
 create mode 100644 drivers/misc/mma8653fc.c

diff --git a/Documentation/ABI/testing/sysfs-bus-i2c-devices-fsl-mma8653fc b/Documentation/ABI/testing/sysfs-bus-i2c-devices-fsl-mma8653fc
new file mode 100644
index 0000000..8172c27
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-i2c-devices-fsl-mma8653fc
@@ -0,0 +1,39 @@
+What:		/sys/bus/i2c/drivers/mma8653fc/*/standby
+Date:		March 2015
+Contact:	Martin Kepplinger <martin.kepplinger-SN7IsUiht6C/RdPyistoZJqQE7yCjDx5@public.gmane.org>
+Description:
+		Write 0 to this in order to turn on the device, and 1 to turn
+		it off. Read to see if it is turned on or off.
+
+
+What:		/sys/bus/i2c/drivers/mma8653fc/*/currentmode
+Date:		March 2015
+Contact:	Martin Kepplinger <martin.kepplinger-SN7IsUiht6C/RdPyistoZJqQE7yCjDx5@public.gmane.org>
+Description:
+		Reading this provides the current state of the device, read
+		directly from a register. This can be "standby", "wake" or
+		"sleep".
+
+
+What:		/sys/bus/i2c/drivers/mma8653fc/*/position
+Date:		March 2015
+Contact:	Martin Kepplinger <martin.kepplinger-SN7IsUiht6C/RdPyistoZJqQE7yCjDx5@public.gmane.org>
+Description:
+		Read only. Without interrupts enabled gets current position
+		values by reading. Poll "position" with interrupt conditions
+		set, to get notified; see Documentation/.../fsl,mma8653fc.txt
+
+		position file format:
+		"x y z [landscape/portrait status] [front/back status]"
+
+		x y z values:
+			in mg
+		landscape/portrait status char:
+			r	landscape right
+			d	portrait down
+			u	portrait up
+			l	landscape left
+		front/back status char:
+			f	front facing
+			b	back facing
+
diff --git a/Documentation/devicetree/bindings/misc/fsl,mma8653fc.txt b/Documentation/devicetree/bindings/misc/fsl,mma8653fc.txt
new file mode 100644
index 0000000..3921acb
--- /dev/null
+++ b/Documentation/devicetree/bindings/misc/fsl,mma8653fc.txt
@@ -0,0 +1,96 @@
+Freescale MMA8653FC 3-axis Accelerometer
+
+Required properties:
+- compatible
+	"fsl,mma8653fc"
+- reg
+	I2C address
+
+Optional properties:
+
+- interrupt-parent
+	a phandle for the interrupt controller (see
+	Documentation/devicetree/bindings/interrupt-controller/interrupts.txt)
+- interrupts
+	interrupt line to which the chip is connected
+- int1
+	set to use interrupt line 1 instead of 2
+- int_active_high
+	set interrupt line active high
+- ir_freefall_motion_x
+	activate freefall/motion interrupts on x axis
+- ir_freefall_motion_y
+	activate freefall/motion interrupts on y axis
+- ir_freefall_motion_z
+	activate freefall/motion interrupts on z axis
+- irq_threshold
+	0 < value < 8000: threshold for motion interrupts in mg
+- ir_landscape_portrait
+	activate landscape/portrait interrupts
+- ir_data_ready:
+	activate data-ready interrupts
+	Interrupt events can be activated in any combination.
+- range
+	2, 4, or 8: range in g, default: 2
+- auto_wake_sleep
+	auto sleep mode (lower frequency)
+- motion_mode
+	use motion mode instead of freefall mode (trigger if >threshold).
+	per default an interrupt occurs if motion values fall below the
+	value set in "threshold" and therefore can detect free fall on the
+	vertical axis (depending on the position of the device).
+	Setting this values inverts the behaviour and an interrupt occurs
+	above the threshold value, so usually activate horizontal axis in
+	this case.
+
+- x-offset
+	0 < value < 500: calibration offset in mg
+	this value has an offset of 250 itself:
+	0 is -250mg, 250 is 0 mg, 500 is 250mg
+- y-offset
+	see x-offset
+- z-offset
+	see x-offset
+
+Example 1:
+for a device laying on flat ground to recognize acceleration over 100mg.
+x-axis is calibrated to +10mg. Adapt interrupt line to your device.
+
+mma8653fc@1d {
+		compatible = "fsl,mma8653fc";
+		interrupt-parent = <&gpio1>;
+		interrupts = <5 0>;
+		reg = <0x1d>;
+
+		range = <2>;
+		motion_mode;
+		ir_freefall_motion_x;
+		ir_freefall_motion_y;
+		irq_threshold = <100>;
+		x-offset = <160>;
+};
+
+Example 2:
+for a device mounted on a wall with y being the vertical axis. This recognizes
+y-acceleration below 800mg, so free fall or changing the orientation of the
+device (y not being the vertical axis and having ~1000mg anymore).
+
+mma8653fc@1d {
+		compatible = "fsl,mma8653fc";
+		interrupt-parent = <&gpio1>;
+		interrupts = <5 0>;
+		reg = <0x1d>;
+
+		range = <2>;
+		ir_freefall_motion_y;
+		irq_threshold = <800>;
+};
+
+Example 3:
+minimal example. lets users read current acceleration values. No polling
+is available.
+
+mma8653fc@1d {
+		compatible = "fsl,mma8653fc";
+		reg = <0x1d>;
+};
diff --git a/MAINTAINERS b/MAINTAINERS
index 0e1abe8..04c080d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4104,6 +4104,11 @@ S:	Maintained
 F:	include/linux/platform_data/video-imxfb.h
 F:	drivers/video/fbdev/imxfb.c
 
+FREESCALE MMA8653FC ACCELEROMETER I2C DRIVER
+M:	Martin Kepplinger <martin.kepplinger-SN7IsUiht6C/RdPyistoZJqQE7yCjDx5@public.gmane.org>
+S:	Maintained
+F:	drivers/input/misc/mma8653fc.c
+
 FREESCALE QUAD SPI DRIVER
 M:	Han Xu <han.xu-KZfg59tc24xl57MIdRCFDg@public.gmane.org>
 L:	linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org
diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
index 006242c..05ef1c1 100644
--- a/drivers/misc/Kconfig
+++ b/drivers/misc/Kconfig
@@ -324,6 +324,17 @@ config ISL29020
 	  This driver can also be built as a module.  If so, the module
 	  will be called isl29020.
 
+config MMA8653FC
+	tristate "MMA8653FC - Freescale's 3-Axis, 10-bit Digital Accelerometer"
+	depends on I2C
+	default n
+	help
+	  Say Y here if you want to support Freescale's MMA8653FC Accelerometer
+	  through I2C interface.
+
+	  To compile this driver as a module, choose M here: the
+	  module will be called mma8653fc.
+
 config SENSORS_TSL2550
 	tristate "Taos TSL2550 ambient light sensor"
 	depends on I2C && SYSFS
diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
index 7d5c4cd..044c8fc 100644
--- a/drivers/misc/Makefile
+++ b/drivers/misc/Makefile
@@ -17,6 +17,7 @@ obj-$(CONFIG_ICS932S401)	+= ics932s401.o
 obj-$(CONFIG_LKDTM)		+= lkdtm.o
 obj-$(CONFIG_TIFM_CORE)       	+= tifm_core.o
 obj-$(CONFIG_TIFM_7XX1)       	+= tifm_7xx1.o
+obj-$(CONFIG_MMA8653FC)		+= mma8653fc.o
 obj-$(CONFIG_PHANTOM)		+= phantom.o
 obj-$(CONFIG_SENSORS_BH1780)	+= bh1780gli.o
 obj-$(CONFIG_SENSORS_BH1770)	+= bh1770glc.o
diff --git a/drivers/misc/mma8653fc.c b/drivers/misc/mma8653fc.c
new file mode 100644
index 0000000..aba1e5a
--- /dev/null
+++ b/drivers/misc/mma8653fc.c
@@ -0,0 +1,913 @@
+/*
+ * mma8653fc.c - Support for Freescale MMA8653FC 3-axis 10-bit accelerometer
+ *
+ * Copyright (c) 2014 Theobroma Systems Design and Consulting GmbH
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/device.h>
+#include <linux/delay.h>
+#include <linux/interrupt.h>
+#include <linux/irq.h>
+#include <linux/slab.h>
+#include <linux/module.h>
+#include <linux/i2c.h>
+#include <linux/types.h>
+#include <linux/of_device.h>
+#include <linux/of_irq.h>
+
+#define DRV_NAME	"mma8653fc"
+#define MMA8653FC_DEVICE_ID	0x5a
+
+#define MMA8653FC_STATUS	0x00
+
+#define ZYXOW_MASK	0x80
+#define ZYXDR		0x08
+
+#define MMA8653FC_WHO_AM_I	0x0d
+
+#define MMA8653FC_SYSMOD	0x0b
+#define STATE_STANDBY	0x00
+#define STATE_WAKE	0x01
+#define STATE_SLEEP	0x02
+
+#define MMA8450_STATUS_ZXYDR	0x08
+
+/*
+ * 10 bit output data registers
+ * MSB: 7:0 bits 9:2 of data word
+ * LSB: 7:6 bits 1:0 of data word
+ */
+#define MMA8653FC_OUT_X_MSB	0x01
+#define MMA8653FC_OUT_X_LSB	0x02
+#define MMA8653FC_OUT_Y_MSB	0x03
+#define MMA8653FC_OUT_Y_LSB	0x04
+#define MMA8653FC_OUT_Z_MSB	0x05
+#define MMA8653FC_OUT_Z_LSB	0x06
+
+/*
+ * Portrait/Landscape Status
+ */
+#define MMA8653FC_PL_STATUS	0x10
+
+#define NEWLP		0x80
+#define LAPO_HIGH	0x04
+#define LAPO_LOW	0x02
+#define BAFRO		0x01
+
+/*
+ * Portrait/Landscape Configuration
+ */
+#define MMA8653FC_PL_CFG	0x11
+
+#define PL_EN		(1 << 6)
+
+/*
+ * Data calibration registers
+ */
+#define MMA8653FC_OFF_X		0x2f
+#define MMA8653FC_OFF_Y		0x30
+#define MMA8653FC_OFF_Z		0x31
+
+/* 0 to 500 for dts, but -250 to 250 in mg */
+#define DEFAULT_OFF	250
+
+/*
+ * bits 1:0 dynamic range
+ * 00 +/- 2g
+ * 01 +/- 4g
+ * 10 +/- 8g
+ *
+ * HPF_Out bit 4 - data high pass or low pass filtered
+ */
+#define MMA8653FC_XYZ_DATA_CFG	0x0e
+
+#define RANGE_MASK	0x03
+#define RANGE2G		0x00
+#define RANGE4G		0x01
+#define RANGE8G		0x02
+/* values for calculation */
+#define SHIFT_2G	8
+#define INCR_2G		128
+#define SHIFT_4G	7
+#define INCR_4G		64
+#define SHIFT_8G	6
+#define INCR_8G		32
+#define DYN_RANGE_2G	2
+#define DYN_RANGE_4G	4
+#define DYN_RANGE_8G	8
+
+/*
+ * System Control Reg 1
+ */
+#define MMA8653FC_CTRL_REG1	0x2a
+
+#define ACTIVE_BIT	(1 << 0)
+#define ODR_MASK	0x38
+#define ODR_DEFAULT	0x20 /* 50 Hz */
+#define ASLP_RATE_MASK	0xc0
+#define ASLP_RATE_DEFAULT 0x80 /* 6.25 Hz */
+
+/*
+ * Sys Control Reg 2
+ *
+ * auto-sleep enable, software reset
+ */
+#define MMA8653FC_CTRL_REG2	0x2b
+
+#define SLPE		(1 << 2)
+#define SELFTEST	(1 << 7)
+#define SOFT_RESET	(1 << 6)
+
+/*
+ * Interrupt Source
+ */
+#define MMA8653FC_INT_SOURCE	0x0c
+
+#define SRC_ASLP	(1 << 7)
+#define SRC_LNDPRT	(1 << 4)
+#define SRC_FF_MT	(1 << 2)
+#define SRC_DRDY	(1 << 0)
+
+/*
+ * Interrupt Control Register
+ *
+ * default: active low
+ * default push-pull, not open-drain
+ */
+#define MMA8653FC_CTRL_REG3	0x2c
+
+#define WAKE_LNDPRT	(1 << 5)
+#define WAKE_FF_MT	(1 << 3)
+#define IPOL		(1 << 1)
+#define PP_OD		(1 << 0)
+
+/*
+ * Interrupt Enable Register
+ */
+#define MMA8653FC_CTRL_REG4	0x2d
+
+#define INT_EN_ASLP	(1 << 7)
+#define INT_EN_LNDPRT	(1 << 4)
+#define INT_EN_FF_MT	(1 << 2)
+#define INT_EN_DRDY	(1 << 0)
+
+/*
+ * Interrupt Configuration Register
+ * bit value 0 ... INT2 (default)
+ * bit value 1 ... INT1
+ */
+#define MMA8653FC_CTRL_REG5	0x2e
+
+#define INT_CFG_ASLP	(1 << 7)
+#define INT_CFG_LNDPRT	(1 << 4)
+#define INT_CFG_FF_MT	(1 << 2)
+#define INT_CFG_DRDY	(1 << 0)
+
+/*
+ * Freefall / Motion Configuration Register
+ *
+ * Event Latch enable/disable, motion or freefall mode
+ * and event flag enable per axis
+ */
+#define MMA8653FC_FF_MT_CFG	0x15
+
+#define FF_MT_CFG_ELE	(1 << 7)
+#define FF_MT_CFG_OAE	(1 << 6)
+#define FF_MT_CFG_ZEFE	(1 << 5)
+#define FF_MT_CFG_YEFE	(1 << 4)
+#define FF_MT_CFG_XEFE	(1 << 3)
+
+/*
+ * Freefall / Motion Source Register
+ */
+#define MMA8653FC_FF_MT_SRC	0x16
+
+/*
+ * Freefall / Motion Threshold Register
+ *
+ * define motion threshold
+ * 0.063 g/LSB, 127 counts(0x7f) (7 bit from LSB)
+ * range: 0.063g - 8g
+ */
+#define MMA8653FC_FF_MT_THS	0x17
+
+struct axis_triple {
+	s16 x;
+	s16 y;
+	s16 z;
+};
+
+struct mma8653fc_pdata {
+	s8 x_axis_offset;
+	s8 y_axis_offset;
+	s8 z_axis_offset;
+	bool auto_wake_sleep;
+	u32 range;
+	bool int1;
+	bool int_active_high;
+	bool motion_mode;
+	u8 freefall_motion_thr;
+	bool int_src_data_ready;
+	bool int_src_ff_mt_x;
+	bool int_src_ff_mt_y;
+	bool int_src_ff_mt_z;
+	bool int_src_lndprt;
+	bool int_src_aslp;
+};
+
+struct mma8653fc {
+	struct i2c_client *client;
+	struct mutex mutex;
+	struct mma8653fc_pdata pdata;
+	struct axis_triple saved;
+	char orientation;
+	char bafro;
+	bool standby;
+	int irq;
+	unsigned int_mask;
+	u8 (*read)(struct mma8653fc *, unsigned char);
+	int (*write)(struct mma8653fc *, unsigned char, unsigned char);
+};
+
+/* defaults */
+static const struct mma8653fc_pdata mma8653fc_default_init = {
+	.range = 2,
+	.x_axis_offset = DEFAULT_OFF,
+	.y_axis_offset = DEFAULT_OFF,
+	.z_axis_offset = DEFAULT_OFF,
+	.auto_wake_sleep = false,
+	.int1 = false,
+	.int_active_high = false,
+	.motion_mode = false,
+	.freefall_motion_thr = 5,
+	.int_src_data_ready = false,
+	.int_src_ff_mt_x = false,
+	.int_src_ff_mt_y = false,
+	.int_src_ff_mt_z = false,
+	.int_src_lndprt = false,
+	.int_src_aslp = false,
+};
+
+static void mma8653fc_get_triple(struct mma8653fc *mma)
+{
+	u8 buf[6];
+	u8 status;
+
+	buf[0] = 0;
+
+	status = mma->read(mma, MMA8653FC_STATUS);
+	if (status & ZYXOW_MASK)
+		dev_dbg(&mma->client->dev, "previous read not completed\n");
+
+	buf[0] = mma->read(mma, MMA8653FC_OUT_X_MSB);
+	buf[1] = mma->read(mma, MMA8653FC_OUT_X_LSB);
+	buf[2] = mma->read(mma, MMA8653FC_OUT_Y_MSB);
+	buf[3] = mma->read(mma, MMA8653FC_OUT_Y_LSB);
+	buf[4] = mma->read(mma, MMA8653FC_OUT_Z_MSB);
+	buf[5] = mma->read(mma, MMA8653FC_OUT_Z_LSB);
+
+	mutex_lock(&mma->mutex);
+	/* move from registers to s16 */
+	mma->saved.x = (buf[1] | (buf[0] << 8)) >> 6;
+	mma->saved.y = (buf[3] | (buf[2] << 8)) >> 6;
+	mma->saved.z = (buf[5] | (buf[4] << 8)) >> 6;
+	mma->saved.x = sign_extend32(mma->saved.x, 9);
+	mma->saved.y = sign_extend32(mma->saved.y, 9);
+	mma->saved.z = sign_extend32(mma->saved.z, 9);
+
+	/* calc g, see data sheet and application note */
+	switch (mma->pdata.range) {
+	case DYN_RANGE_2G:
+		mma->saved.x = le16_to_cpu((1000 * mma->saved.x +
+					    INCR_2G) >> SHIFT_2G);
+		mma->saved.y = le16_to_cpu((1000 * mma->saved.y +
+					   INCR_2G) >> SHIFT_2G);
+		mma->saved.z = le16_to_cpu((1000 * mma->saved.z +
+					   INCR_2G) >> SHIFT_2G);
+		break;
+	case DYN_RANGE_4G:
+		mma->saved.x = le16_to_cpu((1000 * mma->saved.x +
+					   INCR_4G) >> SHIFT_4G);
+		mma->saved.y = le16_to_cpu((1000 * mma->saved.y +
+					   INCR_4G) >> SHIFT_4G);
+		mma->saved.z = le16_to_cpu((1000 * mma->saved.z +
+					   INCR_4G) >> SHIFT_4G);
+		break;
+	case DYN_RANGE_8G:
+		mma->saved.x = le16_to_cpu((1000 * mma->saved.x +
+					   INCR_8G) >> SHIFT_8G);
+		mma->saved.y = le16_to_cpu((1000 * mma->saved.y +
+					   INCR_8G) >> SHIFT_8G);
+		mma->saved.z = le16_to_cpu((1000 * mma->saved.z +
+					   INCR_8G) >> SHIFT_8G);
+		break;
+	default:
+		dev_err(&mma->client->dev, "internal data corrupt\n");
+	}
+	mutex_unlock(&mma->mutex);
+}
+
+static void mma8653fc_get_orientation(struct mma8653fc *mma, u8 byte)
+{
+	if ((byte & LAPO_HIGH) && !(LAPO_LOW))
+		mma->orientation = 'r'; /* landscape right */
+	if (!(byte & LAPO_HIGH) && (byte & LAPO_LOW))
+		mma->orientation = 'd'; /* portrait down */
+	if (!(byte & LAPO_HIGH) && !(byte & LAPO_LOW))
+		mma->orientation = 'u'; /* portrait up */
+	if ((byte & LAPO_HIGH) && (byte & LAPO_LOW))
+		mma->orientation = 'l'; /* landscape left */
+
+	if (byte & BAFRO)
+		mma->bafro = 'b'; /* back facing */
+	else
+		mma->bafro = 'f'; /* front facing */
+}
+
+static irqreturn_t mma8653fc_irq(int irq, void *handle)
+{
+	struct mma8653fc *mma = handle;
+	u8 int_src;
+	u8 byte;
+
+	int_src = mma->read(mma, MMA8653FC_INT_SOURCE);
+	if (int_src & SRC_DRDY)
+		/* data ready handle */
+	if (int_src & SRC_FF_MT) {
+		/* freefall/motion change handle */
+		dev_dbg(&mma->client->dev,
+			"freefall or motion change\n");
+		byte = mma->read(mma, MMA8653FC_FF_MT_SRC);
+	}
+	if (int_src & SRC_LNDPRT) {
+		/* landscape/portrait change handle */
+		dev_dbg(&mma->client->dev,
+			"landscape / portrait change\n");
+		byte = mma->read(mma, MMA8653FC_PL_STATUS);
+		mma8653fc_get_orientation(mma, byte);
+	}
+	if (int_src & SRC_ASLP)
+		/* autosleep change handle */
+	mma8653fc_get_triple(mma);
+
+	sysfs_notify(&mma->client->dev.kobj, NULL, "position");
+
+	return IRQ_HANDLED;
+}
+
+static void __mma8653fc_enable(struct mma8653fc *mma)
+{
+	u8 byte;
+
+	byte = mma->read(mma, MMA8653FC_CTRL_REG1);
+	mma->write(mma, MMA8653FC_CTRL_REG1, byte | ACTIVE_BIT);
+}
+
+static void __mma8653fc_disable(struct mma8653fc *mma)
+{
+	u8 byte;
+
+	byte = mma->read(mma, MMA8653FC_CTRL_REG1);
+	mma->write(mma, MMA8653FC_CTRL_REG1, byte & ~ACTIVE_BIT);
+}
+
+static ssize_t mma8653fc_standby_show(struct device *dev,
+				    struct device_attribute *attr, char *buf)
+{
+	struct i2c_client *client = to_i2c_client(dev);
+	struct mma8653fc *mma = i2c_get_clientdata(client);
+
+	return scnprintf(buf, PAGE_SIZE, "%u\n", mma->standby);
+}
+
+static ssize_t mma8653fc_standby_store(struct device *dev,
+				     struct device_attribute *attr,
+				     const char *buf, size_t count)
+{
+	struct i2c_client *client = to_i2c_client(dev);
+	struct mma8653fc *mma = i2c_get_clientdata(client);
+	unsigned int val;
+	int error;
+
+	error = kstrtouint(buf, 10, &val);
+	if (error)
+		return error;
+
+	mutex_lock(&mma->mutex);
+
+	if (val) {
+		if (!mma->standby)
+			__mma8653fc_disable(mma);
+	} else {
+		if (mma->standby)
+			__mma8653fc_enable(mma);
+	}
+
+	mma->standby = !!val;
+
+	mutex_unlock(&mma->mutex);
+
+	return count;
+}
+
+static DEVICE_ATTR(standby, 0664, mma8653fc_standby_show,
+				  mma8653fc_standby_store);
+
+static ssize_t mma8653fc_currentmode_show(struct device *dev,
+				 struct device_attribute *attr, char *buf)
+{
+	struct i2c_client *client = to_i2c_client(dev);
+	struct mma8653fc *mma = i2c_get_clientdata(client);
+	ssize_t count;
+	u8 byte;
+
+	byte = mma->read(mma, MMA8653FC_SYSMOD);
+	if (byte < 0)
+		return byte;
+
+	switch (byte) {
+	case STATE_STANDBY:
+		count = scnprintf(buf, PAGE_SIZE, "standby\n");
+		break;
+	case STATE_WAKE:
+		count = scnprintf(buf, PAGE_SIZE, "wake\n");
+		break;
+	case STATE_SLEEP:
+		count = scnprintf(buf, PAGE_SIZE, "sleep\n");
+		break;
+	default:
+		count = scnprintf(buf, PAGE_SIZE, "unknown\n");
+	}
+	return count;
+}
+static DEVICE_ATTR(currentmode, S_IRUGO, mma8653fc_currentmode_show, NULL);
+
+static ssize_t mma8653fc_position_show(struct device *dev,
+				 struct device_attribute *attr, char *buf)
+{
+	struct i2c_client *client = to_i2c_client(dev);
+	struct mma8653fc *mma = i2c_get_clientdata(client);
+	ssize_t count;
+	u8 byte;
+
+	if (!mma->irq) {
+		byte = mma->read(mma, MMA8653FC_PL_STATUS);
+		if (byte & NEWLP)
+			mma8653fc_get_orientation(mma, byte);
+
+		byte = mma->read(mma, MMA8653FC_STATUS);
+		if (byte & ZYXDR)
+			mma8653fc_get_triple(mma);
+
+		msleep(20);
+		dev_dbg(&client->dev, "data polled\n");
+	}
+	mutex_lock(&mma->mutex);
+	count = scnprintf(buf, PAGE_SIZE, "%d %d %d %c %c\n",
+			mma->saved.x, mma->saved.y, mma->saved.z,
+			mma->orientation, mma->bafro);
+	mutex_unlock(&mma->mutex);
+
+	return count;
+}
+static DEVICE_ATTR(position, S_IRUGO, mma8653fc_position_show, NULL);
+
+static struct attribute *mma8653fc_attributes[] = {
+	&dev_attr_position.attr,
+	&dev_attr_standby.attr,
+	&dev_attr_currentmode.attr,
+	NULL,
+};
+
+static const struct attribute_group mma8653fc_attr_group = {
+	.attrs = mma8653fc_attributes,
+};
+
+static u8 mma8653fc_read(struct mma8653fc *mma, unsigned char reg)
+{
+	u8 val;
+
+	val = i2c_smbus_read_byte_data(mma->client, reg);
+	if (val < 0) {
+		dev_err(&mma->client->dev,
+			"failed to read %x register\n", reg);
+	}
+	return val;
+}
+
+static int mma8653fc_write(struct mma8653fc *mma, unsigned char reg,
+			   unsigned char val)
+{
+	int ret;
+
+	ret = i2c_smbus_write_byte_data(mma->client, reg, val);
+	if (ret) {
+		dev_err(&mma->client->dev,
+			"failed to write %x register\n", reg);
+	}
+	return ret;
+}
+
+static const struct of_device_id mma8653fc_dt_ids[] = {
+	{ .compatible = "fsl,mma8653fc", },
+	{ }
+};
+MODULE_DEVICE_TABLE(of, mma8653fc_dt_ids);
+
+static const struct mma8653fc_pdata *mma8653fc_probe_dt(struct device *dev)
+{
+	struct mma8653fc_pdata *pdata;
+	struct device_node *node = dev->of_node;
+	const struct of_device_id *match;
+	u32 testu32;
+	s32 tests32;
+
+	if (!node) {
+		dev_err(dev, "no associated DT data\n");
+		return ERR_PTR(-EINVAL);
+	}
+
+	match = of_match_device(mma8653fc_dt_ids, dev);
+	if (!match) {
+		dev_err(dev, "unknown device model\n");
+		return ERR_PTR(-EINVAL);
+	}
+
+	pdata = devm_kzalloc(dev, sizeof(*pdata), GFP_KERNEL);
+	if (!pdata)
+		return ERR_PTR(-ENOMEM);
+
+	*pdata = mma8653fc_default_init;
+
+	/* overwrite from dts */
+	testu32 = pdata->x_axis_offset;
+	tests32 = 0;
+	of_property_read_u32(node, "x-offset", &testu32);
+	tests32 = testu32 - DEFAULT_OFF;
+	if ((tests32) && (tests32 <= DEFAULT_OFF) &&
+	    (tests32 >= -DEFAULT_OFF)) {
+		dev_info(dev, "use %dmg offset on X axis\n", tests32);
+		/* calc register value, resolution: 1.96mg */
+		pdata->x_axis_offset = (s8) (tests32 / 2);
+	}
+	testu32 = pdata->y_axis_offset;
+	tests32 = 0;
+	of_property_read_u32(node, "y-offset", &testu32);
+	tests32 = testu32 - DEFAULT_OFF;
+	if ((tests32) && (tests32 <= DEFAULT_OFF) &&
+	    (tests32 >= -DEFAULT_OFF)) {
+		dev_info(dev, "use %dmg offset on Y axis\n", tests32);
+		pdata->y_axis_offset = (s8) (tests32 / 2);
+	}
+	testu32 = pdata->z_axis_offset;
+	tests32 = 0;
+	of_property_read_u32(node, "z-offset", &testu32);
+	tests32 = testu32 - DEFAULT_OFF;
+	if ((tests32) && (tests32 <= DEFAULT_OFF) &&
+	    (tests32 >= -DEFAULT_OFF)) {
+		dev_info(dev, "use %dmg offset on Z axis\n", tests32);
+		pdata->z_axis_offset = (s8) (tests32 / 2);
+	}
+
+	testu32 = 0;
+	of_property_read_u32(node, "range", &testu32);
+	if ((testu32) && (testu32 != 2) && (testu32 != 4) && (testu32 != 8)) {
+		dev_warn(dev, "wrong value for full scale range in dtb\n");
+	} else {
+		if (testu32)
+			pdata->range = testu32;
+	}
+
+	if (of_property_read_bool(node, "auto_wake_sleep"))
+		pdata->auto_wake_sleep = true;
+
+	if (of_property_read_bool(node, "int1"))
+		pdata->int1 = true;
+
+	if (of_property_read_bool(node, "int_active_high"))
+		pdata->int_active_high = true;
+
+	if (of_property_read_bool(node, "motion_mode"))
+		pdata->motion_mode = true;
+
+	if (of_property_read_bool(node, "ir_data_ready"))
+		pdata->int_src_data_ready = true;
+
+	if (of_property_read_bool(node, "ir_freefall_motion_x"))
+		pdata->int_src_ff_mt_x = true;
+
+	if (of_property_read_bool(node, "ir_freefall_motion_y"))
+		pdata->int_src_ff_mt_y = true;
+
+	if (of_property_read_bool(node, "ir_freefall_motion_z"))
+		pdata->int_src_ff_mt_z = true;
+
+	if (of_property_read_bool(node, "ir_auto_wake"))
+		pdata->int_src_aslp = true;
+
+	if (of_property_read_bool(node, "ir_landscape_portrait"))
+		pdata->int_src_lndprt = true;
+
+	testu32 = 0;
+	of_property_read_u32(node, "irq_threshold", &testu32);
+	/* always uses maximum range +/- 8000g, resolution 63mg */
+	if ((testu32 <= 8000) && (testu32 > 0)) {
+		dev_dbg(dev, "use freefall / motion threshold %dmg\n",
+			 testu32);
+		/* calculate register value from mg */
+		pdata->freefall_motion_thr = (testu32 / 63) + 1;
+	}
+
+	return pdata;
+}
+static int mma8653fc_i2c_probe(struct i2c_client *client,
+				       const struct i2c_device_id *id)
+{
+	struct mma8653fc *mma;
+	const struct mma8653fc_pdata *pdata;
+	int err;
+	u8 byte;
+
+	err = i2c_check_functionality(client->adapter,
+			I2C_FUNC_SMBUS_BYTE_DATA);
+	if (!err) {
+		dev_err(&client->dev, "SMBUS Byte Data not Supported\n");
+		return -EIO;
+	}
+
+	mma = kzalloc(sizeof(*mma), GFP_KERNEL);
+	if (!mma) {
+		err = -ENOMEM;
+		goto err_out;
+	}
+
+	pdata = dev_get_platdata(&client->dev);
+	if (!pdata) {
+		pdata = mma8653fc_probe_dt(&client->dev);
+		if (IS_ERR(pdata)) {
+			err = PTR_ERR(pdata);
+			pr_err("pdata from DT failed\n");
+			goto err_free_mem;
+		}
+	}
+	mma->pdata = *pdata;
+	pdata = &mma->pdata;
+	mma->client = client;
+	mma->read = &mma8653fc_read;
+	mma->write = &mma8653fc_write;
+
+	mutex_init(&mma->mutex);
+
+	i2c_set_clientdata(client, mma);
+
+	err = sysfs_create_group(&client->dev.kobj, &mma8653fc_attr_group);
+	if (err)
+		goto err_free_mem;
+
+	mma->irq = irq_of_parse_and_map(client->dev.of_node, 0);
+	if (!mma->irq) {
+		dev_err(&client->dev, "Unable to parse or map IRQ\n");
+		goto no_irq;
+	}
+
+	err = irq_set_irq_type(mma->irq, IRQ_TYPE_EDGE_FALLING);
+	if (err) {
+		dev_err(&client->dev, "set_irq_type failed\n");
+		goto err_free_mem;
+	}
+
+	err = request_threaded_irq(mma->irq, NULL, mma8653fc_irq, IRQF_ONESHOT,
+				   dev_name(&mma->client->dev), mma);
+	if (err) {
+		dev_err(&client->dev, "irq %d busy?\n", mma->irq);
+		goto err_free_mem;
+	}
+
+	if (mma->write(mma, MMA8653FC_CTRL_REG2, SOFT_RESET))
+		goto err_free_irq;
+
+	__mma8653fc_disable(mma);
+	mma->standby = true;
+
+	/* enable desired interrupts */
+	mma->orientation = '\0';
+	mma->bafro = '\0';
+	byte = 0;
+	if (pdata->int_src_data_ready) {
+		byte |= INT_EN_DRDY;
+		dev_dbg(&client->dev, "DATA READY interrupt source enabled\n");
+	}
+	if (pdata->int_src_ff_mt_x || pdata->int_src_ff_mt_y ||
+	    pdata->int_src_ff_mt_z) {
+		byte |= INT_EN_FF_MT;
+		dev_dbg(&client->dev, "FF MT interrupt source enabled\n");
+	}
+	if (pdata->int_src_lndprt) {
+		if (mma->write(mma, MMA8653FC_PL_CFG, PL_EN))
+			goto err_free_irq;
+		byte |= INT_EN_LNDPRT;
+		dev_dbg(&client->dev, "LNDPRT interrupt source enabled\n");
+	}
+	if (pdata->int_src_aslp) {
+		byte |= INT_EN_ASLP;
+		dev_dbg(&client->dev, "ASLP interrupt source enabled\n");
+	}
+	if (mma->write(mma, MMA8653FC_CTRL_REG4, byte))
+		goto err_free_irq;
+
+	/* force everything to line 1 */
+	if (pdata->int1) {
+		if (mma->write(mma, MMA8653FC_CTRL_REG5,
+				 (INT_CFG_ASLP | INT_CFG_LNDPRT |
+				 INT_CFG_FF_MT | INT_CFG_DRDY)))
+			goto err_free_irq;
+		dev_dbg(&client->dev, "using interrupt line 1\n");
+	}
+
+	/* force active high */
+	if (pdata->int_active_high) {
+		byte = mma->read(mma, MMA8653FC_CTRL_REG3);
+		if (byte < 0)
+			goto err_free_irq;
+		byte |= IPOL;
+		if (mma->write(mma, MMA8653FC_CTRL_REG3, byte))
+			goto err_free_irq;
+		dev_dbg(&client->dev, "using active high on interrupt line\n");
+	}
+no_irq:
+	/* range mode */
+	byte = mma->read(mma, MMA8653FC_XYZ_DATA_CFG);
+	byte &= ~RANGE_MASK;
+	switch (pdata->range) {
+	case DYN_RANGE_2G:
+		byte |= RANGE2G;
+		dev_dbg(&client->dev, "use 2g range\n");
+		break;
+	case DYN_RANGE_4G:
+		byte |= RANGE4G;
+		dev_dbg(&client->dev, "use 4g range\n");
+		break;
+	case DYN_RANGE_8G:
+		byte |= RANGE8G;
+		dev_dbg(&client->dev, "use 8g range\n");
+		break;
+	default:
+		dev_err(&client->dev, "wrong range mode value\n");
+		goto err_free_irq;
+	}
+	if (mma->write(mma, MMA8653FC_XYZ_DATA_CFG, byte))
+		goto err_free_irq;
+
+	/* data calibration offsets */
+	if (pdata->x_axis_offset) {
+		if (mma->write(mma, MMA8653FC_OFF_X, pdata->x_axis_offset))
+			goto err_free_irq;
+	}
+	if (pdata->y_axis_offset) {
+		if (mma->write(mma, MMA8653FC_OFF_Y, pdata->y_axis_offset))
+			goto err_free_irq;
+	}
+	if (pdata->z_axis_offset) {
+		if (mma->write(mma, MMA8653FC_OFF_Z, pdata->z_axis_offset))
+			goto err_free_irq;
+	}
+
+	/* if autosleep, wake on both landscape and motion changes */
+	if (pdata->auto_wake_sleep) {
+		byte = 0;
+		byte |= WAKE_LNDPRT;
+		byte |= WAKE_FF_MT;
+		if (mma->write(mma, MMA8653FC_CTRL_REG3, byte))
+			goto err_free_irq;
+		if (mma->write(mma, MMA8653FC_CTRL_REG2, SLPE))
+			goto err_free_irq;
+		dev_dbg(&client->dev, "auto sleep enabled\n");
+	}
+
+	/* data rates */
+	byte = 0;
+	byte = mma->read(mma, MMA8653FC_CTRL_REG1);
+	if (byte < 0)
+		goto err_free_irq;
+	byte &= ~ODR_MASK;
+	byte |= ODR_DEFAULT;
+	byte &= ~ASLP_RATE_MASK;
+	byte |= ASLP_RATE_DEFAULT;
+	if (mma->write(mma, MMA8653FC_CTRL_REG1, byte))
+		goto err_free_irq;
+
+	/* freefall / motion config */
+	byte = 0;
+	if (pdata->motion_mode) {
+		byte |= FF_MT_CFG_OAE;
+		dev_dbg(&client->dev, "detect motion instead of freefall\n");
+	}
+	byte |= FF_MT_CFG_ELE;
+	if (pdata->int_src_ff_mt_x)
+		byte |= FF_MT_CFG_XEFE;
+	if (pdata->int_src_ff_mt_y)
+		byte |= FF_MT_CFG_YEFE;
+	if (pdata->int_src_ff_mt_z)
+		byte |= FF_MT_CFG_ZEFE;
+	if (mma->write(mma, MMA8653FC_FF_MT_CFG, byte))
+		goto err_free_irq;
+
+	if (pdata->freefall_motion_thr) {
+		if (mma->write(mma, MMA8653FC_FF_MT_THS,
+				 pdata->freefall_motion_thr))
+			goto err_free_irq;
+		/* calculate back to mg */
+		dev_dbg(&client->dev, "threshold set to %dmg\n",
+			 (63 * pdata->freefall_motion_thr) - 1);
+	}
+
+	byte = mma->read(mma, MMA8653FC_WHO_AM_I);
+	if (byte != MMA8653FC_DEVICE_ID) {
+		dev_err(&client->dev, "wrong device for driver\n");
+		goto err_free_irq;
+	}
+	dev_info(&client->dev,
+		 "accelerometer driver loaded. device id %x\n", byte);
+
+	return 0;
+
+ err_free_irq:
+	if (mma->irq)
+		free_irq(mma->irq, mma);
+ err_free_mem:
+	kfree(mma);
+ err_out:
+	return err;
+}
+
+static int mma8653fc_remove(struct i2c_client *client)
+{
+	struct mma8653fc *mma = i2c_get_clientdata(client);
+
+	free_irq(mma->irq, mma);
+	dev_dbg(&client->dev, "unregistered accelerometer\n");
+	kfree(mma);
+
+	return 0;
+}
+
+#ifdef CONFIG_PM_SLEEP
+static int mma8653fc_suspend(struct device *dev)
+{
+	struct i2c_client *client = to_i2c_client(dev);
+	struct mma8653fc *mma = i2c_get_clientdata(client);
+
+	__mma8653fc_disable(mma);
+
+	return 0;
+}
+
+static int mma8653fc_resume(struct device *dev)
+{
+	struct i2c_client *client = to_i2c_client(dev);
+	struct mma8653fc *mma = i2c_get_clientdata(client);
+
+	__mma8653fc_enable(mma);
+
+	return 0;
+}
+
+static SIMPLE_DEV_PM_OPS(mma8653fc_pm_ops, mma8653fc_suspend, mma8653fc_resume);
+#define MMA8653FC_PM_OPS (&mma8653fc_pm_ops)
+#else
+#define MMA8653FC_PM_OPS NULL
+#endif
+
+static const struct i2c_device_id mma8653fc_id[] = {
+	{ DRV_NAME, 0 },
+	{ }
+};
+MODULE_DEVICE_TABLE(i2c, mma8653fc_id);
+
+static struct i2c_driver mma8653fc_driver = {
+	.driver = {
+		.name = DRV_NAME,
+		.owner = THIS_MODULE,
+		.of_match_table = mma8653fc_dt_ids,
+		.pm = MMA8653FC_PM_OPS,
+	},
+	.probe    = mma8653fc_i2c_probe,
+	.remove   = mma8653fc_remove,
+	.id_table = mma8653fc_id,
+};
+
+module_i2c_driver(mma8653fc_driver);
+
+MODULE_AUTHOR("Martin Kepplinger <martin.kepplinger-SN7IsUiht6C/RdPyistoZJqQE7yCjDx5@public.gmane.org");
+MODULE_DESCRIPTION("Freescale's MMA8653FC Three-Axis Accelerometer I2C Driver");
+MODULE_LICENSE("GPL");
-- 
2.1.4

^ permalink raw reply related

* [v10 0/5] ext4: add project quota support
From: Li Xi @ 2015-03-18 19:04 UTC (permalink / raw)
  To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, tytso-3s7WtUTddSA,
	adilger-m1MBpc4rdrD3fQ9qLvQP4Q, jack-AlSwsSmVLrQ,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, hch-wEGCiKHe2LqWVfeAwA7xHQ,
	dmonakhov-GEFAQzZX7r8dnm+yROfE0A

The following patches propose an implementation of project quota
support for ext4. A project is an aggregate of unrelated inodes
which might scatter in different directories. Inodes that belong
to the same project possess an identical identification i.e.
'project ID', just like every inode has its user/group
identification. The following patches add project quota as
supplement to the former uer/group quota types.

The semantics of ext4 project quota is consistent with XFS. Each
directory can have EXT4_INODE_PROJINHERIT flag set. When the
EXT4_INODE_PROJINHERIT flag of a parent directory is not set, a
newly created inode under that directory will have a default project
ID (i.e. 0). And its EXT4_INODE_PROJINHERIT flag is not set either.
When this flag is set on a directory, following rules will be kept:

1) The newly created inode under that directory will inherit both
the EXT4_INODE_PROJINHERIT flag and the project ID from its parent
directory.

2) Hard-linking a inode with different project ID into that directory
will fail with errno EXDEV.

3) Renaming a inode with different project ID into that directory
will fail with errno EXDEV. However, 'mv' command will detect this
failure and copy the renamed inode to a new inode in the directory.
Thus, this new inode will inherit both the project ID and
EXT4_INODE_PROJINHERIT flag.

4) If the project quota of that ID is being enforced, statfs() on
that directory will take the quotas as another upper limits along
with the capacity of the file system, i.e. the total block/inode
number will be the minimum of the quota limits and file system
capacity.

Changelog:
* v10 <- v9:
 - Remove non-journaled project quota interface;
 - Only allow admin to read project quota info;
 - Cleanup FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR interface.
* v9 <- v8:
 - Remove non-journaled project quota;
 - Rebase to newest dev branch of ext4 repository (3.19.0-rc3).
* v8 <- v7:
 - Rebase to newest dev branch of ext4 repository (3.18.0_rc3).
* v7 <- v6:
 - Map ext4 inode flags to xflags of struct fsxattr;
 - Add patch to cleanup ext4 inode flag definitions.
* v6 <- v5:
 - Add project ID check for cross rename;
 - Remove patch of EXT4_IOC_GETPROJECT/EXT4_IOC_SETPROJECT ioctl
* v5 <- v4:
 - Check project feature when set/get project ID;
 - Do not check project feature for project quota;
 - Add support of FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR.
* v4 <- v3:
 - Do not check project feature when set/get project ID;
 - Use EXT4_MAXQUOTAS instead of MAXQUOTAS in ext4 patches;
 - Remove unnecessary change of fs/quota/dquot.c;
 - Remove CONFIG_QUOTA_PROJECT.
* v3 <- v2:
 - Add EXT4_INODE_PROJINHERIT semantics.
* v2 <- v1:
 - Add ioctl interface for setting/getting project;
 - Add EXT4_FEATURE_RO_COMPAT_PROJECT;
 - Add get_projid() method in struct dquot_operations;
 - Add error check of ext4_inode_projid_set/get().

v9: http://www.spinics.net/lists/linux-ext4/msg47326.html
v8: http://www.spinics.net/lists/linux-ext4/msg46545.html
v7: http://www.spinics.net/lists/linux-fsdevel/msg80404.html
v6: http://www.spinics.net/lists/linux-fsdevel/msg80022.html
v5: http://www.spinics.net/lists/linux-api/msg04840.html
v4: http://lwn.net/Articles/612972/
v3: http://www.spinics.net/lists/linux-ext4/msg45184.html
v2: http://www.spinics.net/lists/linux-ext4/msg44695.html
v1: http://article.gmane.org/gmane.comp.file-systems.ext4/45153

Any comments or feedbacks are appreciated.

Regards,
                                         - Li Xi

Li Xi (5):
  vfs: adds general codes to enforces project quota limits
  ext4: adds project ID support
  ext4: adds project quota support
  ext4: adds FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR interface support
  ext4: cleanup inode flag definitions

 fs/ext4/ext4.h             |   89 +++++++----
 fs/ext4/ialloc.c           |    5 +
 fs/ext4/inode.c            |   29 ++++
 fs/ext4/ioctl.c            |  366 +++++++++++++++++++++++++++++++++-----------
 fs/ext4/namei.c            |   18 +++
 fs/ext4/super.c            |   77 ++++++++-
 fs/quota/dquot.c           |   35 ++++-
 fs/quota/quota.c           |    5 +-
 fs/quota/quotaio_v2.h      |    6 +-
 fs/xfs/xfs_fs.h            |   47 ++----
 include/linux/quota.h      |    2 +
 include/uapi/linux/fs.h    |   33 ++++
 include/uapi/linux/quota.h |    6 +-
 13 files changed, 550 insertions(+), 168 deletions(-)

^ permalink raw reply

* [v10 1/5] vfs: adds general codes to enforces project quota limits
From: Li Xi @ 2015-03-18 19:04 UTC (permalink / raw)
  To: linux-fsdevel, linux-ext4, linux-api, tytso, adilger, jack, viro,
	hch, dmonakhov
In-Reply-To: <1426705497-22158-1-git-send-email-lixi@ddn.com>

This patch adds support for a new quota type PRJQUOTA for project quota
enforcement. Also a new method get_projid() is added into dquot_operations
structure.

Signed-off-by: Li Xi <lixi@ddn.com>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/quota/dquot.c           |   35 ++++++++++++++++++++++++++++++-----
 fs/quota/quota.c           |    5 ++++-
 fs/quota/quotaio_v2.h      |    6 ++++--
 include/linux/quota.h      |    2 ++
 include/uapi/linux/quota.h |    6 ++++--
 5 files changed, 44 insertions(+), 10 deletions(-)

diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 8f0acef..a02bb68 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -1159,8 +1159,8 @@ static int need_print_warning(struct dquot_warn *warn)
 			return uid_eq(current_fsuid(), warn->w_dq_id.uid);
 		case GRPQUOTA:
 			return in_group_p(warn->w_dq_id.gid);
-		case PRJQUOTA:	/* Never taken... Just make gcc happy */
-			return 0;
+		case PRJQUOTA:
+			return 1;
 	}
 	return 0;
 }
@@ -1399,6 +1399,9 @@ static void __dquot_initialize(struct inode *inode, int type)
 	/* First get references to structures we might need. */
 	for (cnt = 0; cnt < MAXQUOTAS; cnt++) {
 		struct kqid qid;
+		kprojid_t projid;
+		int rc;
+
 		got[cnt] = NULL;
 		if (type != -1 && cnt != type)
 			continue;
@@ -1409,6 +1412,10 @@ static void __dquot_initialize(struct inode *inode, int type)
 		 */
 		if (i_dquot(inode)[cnt])
 			continue;
+
+		if (!sb_has_quota_active(sb, cnt))
+			continue;
+
 		init_needed = 1;
 
 		switch (cnt) {
@@ -1418,6 +1425,12 @@ static void __dquot_initialize(struct inode *inode, int type)
 		case GRPQUOTA:
 			qid = make_kqid_gid(inode->i_gid);
 			break;
+		case PRJQUOTA:
+			rc = inode->i_sb->dq_op->get_projid(inode, &projid);
+			if (rc)
+				continue;
+			qid = make_kqid_projid(projid);
+			break;
 		}
 		got[cnt] = dqget(sb, qid);
 	}
@@ -2161,7 +2174,8 @@ static int vfs_load_quota_inode(struct inode *inode, int type, int format_id,
 		error = -EROFS;
 		goto out_fmt;
 	}
-	if (!sb->s_op->quota_write || !sb->s_op->quota_read) {
+	if (!sb->s_op->quota_write || !sb->s_op->quota_read ||
+	    (type == PRJQUOTA && sb->dq_op->get_projid == NULL)) {
 		error = -EINVAL;
 		goto out_fmt;
 	}
@@ -2402,8 +2416,19 @@ static void do_get_dqblk(struct dquot *dquot, struct fs_disk_quota *di)
 
 	memset(di, 0, sizeof(*di));
 	di->d_version = FS_DQUOT_VERSION;
-	di->d_flags = dquot->dq_id.type == USRQUOTA ?
-			FS_USER_QUOTA : FS_GROUP_QUOTA;
+	switch (dquot->dq_id.type) {
+	case USRQUOTA:
+		di->d_flags = FS_USER_QUOTA;
+		break;
+	case GRPQUOTA:
+		di->d_flags = FS_GROUP_QUOTA;
+		break;
+	case PRJQUOTA:
+		di->d_flags = FS_PROJ_QUOTA;
+		break;
+	default:
+		BUG();
+	}
 	di->d_id = from_kqid_munged(current_user_ns(), dquot->dq_id);
 
 	spin_lock(&dq_data_lock);
diff --git a/fs/quota/quota.c b/fs/quota/quota.c
index 2aa4151..33b30b1 100644
--- a/fs/quota/quota.c
+++ b/fs/quota/quota.c
@@ -30,7 +30,10 @@ static int check_quotactl_permission(struct super_block *sb, int type, int cmd,
 	case Q_XGETQSTATV:
 	case Q_XQUOTASYNC:
 		break;
-	/* allow to query information for dquots we "own" */
+	/*
+	 * allow to query information for dquots we "own"
+	 * always allow querying project quota
+	 */
 	case Q_GETQUOTA:
 	case Q_XGETQUOTA:
 		if ((type == USRQUOTA && uid_eq(current_euid(), make_kuid(current_user_ns(), id))) ||
diff --git a/fs/quota/quotaio_v2.h b/fs/quota/quotaio_v2.h
index f1966b4..4e95430 100644
--- a/fs/quota/quotaio_v2.h
+++ b/fs/quota/quotaio_v2.h
@@ -13,12 +13,14 @@
  */
 #define V2_INITQMAGICS {\
 	0xd9c01f11,	/* USRQUOTA */\
-	0xd9c01927	/* GRPQUOTA */\
+	0xd9c01927,	/* GRPQUOTA */\
+	0xd9c03f14,	/* PRJQUOTA */\
 }
 
 #define V2_INITQVERSIONS {\
 	1,		/* USRQUOTA */\
-	1		/* GRPQUOTA */\
+	1,		/* GRPQUOTA */\
+	1,		/* PRJQUOTA */\
 }
 
 /* First generic header */
diff --git a/include/linux/quota.h b/include/linux/quota.h
index 50978b7..ba51f7e 100644
--- a/include/linux/quota.h
+++ b/include/linux/quota.h
@@ -50,6 +50,7 @@
 
 #undef USRQUOTA
 #undef GRPQUOTA
+#undef PRJQUOTA
 enum quota_type {
 	USRQUOTA = 0,		/* element used for user quotas */
 	GRPQUOTA = 1,		/* element used for group quotas */
@@ -317,6 +318,7 @@ struct dquot_operations {
 	/* get reserved quota for delayed alloc, value returned is managed by
 	 * quota code only */
 	qsize_t *(*get_reserved_space) (struct inode *);
+	int (*get_projid) (struct inode *, kprojid_t *);/* Get project ID */
 };
 
 struct path;
diff --git a/include/uapi/linux/quota.h b/include/uapi/linux/quota.h
index 3b6cfbe..b2d9486 100644
--- a/include/uapi/linux/quota.h
+++ b/include/uapi/linux/quota.h
@@ -36,11 +36,12 @@
 #include <linux/errno.h>
 #include <linux/types.h>
 
-#define __DQUOT_VERSION__	"dquot_6.5.2"
+#define __DQUOT_VERSION__	"dquot_6.6.0"
 
-#define MAXQUOTAS 2
+#define MAXQUOTAS 3
 #define USRQUOTA  0		/* element used for user quotas */
 #define GRPQUOTA  1		/* element used for group quotas */
+#define PRJQUOTA  2		/* element used for project quotas */
 
 /*
  * Definitions for the default names of the quotas files.
@@ -48,6 +49,7 @@
 #define INITQFNAMES { \
 	"user",    /* USRQUOTA */ \
 	"group",   /* GRPQUOTA */ \
+	"project", /* PRJQUOTA */ \
 	"undefined", \
 };
 
-- 
1.7.1


^ permalink raw reply related

* [v10 2/5] ext4: adds project ID support
From: Li Xi @ 2015-03-18 19:04 UTC (permalink / raw)
  To: linux-fsdevel, linux-ext4, linux-api, tytso, adilger, jack, viro,
	hch, dmonakhov
In-Reply-To: <1426705497-22158-1-git-send-email-lixi@ddn.com>

This patch adds a new internal field of ext4 inode to save project
identifier. Also a new flag EXT4_INODE_PROJINHERIT is added for
inheriting project ID from parent directory.

Signed-off-by: Li Xi <lixi@ddn.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h          |   21 +++++++++++++++++----
 fs/ext4/ialloc.c        |    5 +++++
 fs/ext4/inode.c         |   29 +++++++++++++++++++++++++++++
 fs/ext4/namei.c         |   18 ++++++++++++++++++
 fs/ext4/super.c         |    1 +
 include/uapi/linux/fs.h |    1 +
 6 files changed, 71 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 7fec2ef..7acb2da 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -378,16 +378,18 @@ struct flex_groups {
 #define EXT4_EA_INODE_FL	        0x00200000 /* Inode used for large EA */
 #define EXT4_EOFBLOCKS_FL		0x00400000 /* Blocks allocated beyond EOF */
 #define EXT4_INLINE_DATA_FL		0x10000000 /* Inode has inline data. */
+#define EXT4_PROJINHERIT_FL		0x20000000 /* Create with parents projid */
 #define EXT4_RESERVED_FL		0x80000000 /* reserved for ext4 lib */
 
-#define EXT4_FL_USER_VISIBLE		0x004BDFFF /* User visible flags */
-#define EXT4_FL_USER_MODIFIABLE		0x004380FF /* User modifiable flags */
+#define EXT4_FL_USER_VISIBLE		0x204BDFFF /* User visible flags */
+#define EXT4_FL_USER_MODIFIABLE		0x204380FF /* User modifiable flags */
 
 /* Flags that should be inherited by new inodes from their parent. */
 #define EXT4_FL_INHERITED (EXT4_SECRM_FL | EXT4_UNRM_FL | EXT4_COMPR_FL |\
 			   EXT4_SYNC_FL | EXT4_NODUMP_FL | EXT4_NOATIME_FL |\
 			   EXT4_NOCOMPR_FL | EXT4_JOURNAL_DATA_FL |\
-			   EXT4_NOTAIL_FL | EXT4_DIRSYNC_FL)
+			   EXT4_NOTAIL_FL | EXT4_DIRSYNC_FL |\
+			   EXT4_PROJINHERIT_FL)
 
 /* Flags that are appropriate for regular files (all but dir-specific ones). */
 #define EXT4_REG_FLMASK (~(EXT4_DIRSYNC_FL | EXT4_TOPDIR_FL))
@@ -435,6 +437,7 @@ enum {
 	EXT4_INODE_EA_INODE	= 21,	/* Inode used for large EA */
 	EXT4_INODE_EOFBLOCKS	= 22,	/* Blocks allocated beyond EOF */
 	EXT4_INODE_INLINE_DATA	= 28,	/* Data in inode. */
+	EXT4_INODE_PROJINHERIT	= 29,	/* Create with parents projid */
 	EXT4_INODE_RESERVED	= 31,	/* reserved for ext4 lib */
 };
 
@@ -684,6 +687,7 @@ struct ext4_inode {
 	__le32  i_crtime;       /* File Creation time */
 	__le32  i_crtime_extra; /* extra FileCreationtime (nsec << 2 | epoch) */
 	__le32  i_version_hi;	/* high 32 bits for 64-bit version */
+	__le32  i_projid;	/* Project ID */
 };
 
 struct move_extent {
@@ -939,6 +943,7 @@ struct ext4_inode_info {
 
 	/* Precomputed uuid+inum+igen checksum for seeding inode checksums */
 	__u32 i_csum_seed;
+	kprojid_t i_projid;
 };
 
 /*
@@ -1531,6 +1536,7 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
  */
 #define EXT4_FEATURE_RO_COMPAT_METADATA_CSUM	0x0400
 #define EXT4_FEATURE_RO_COMPAT_READONLY		0x1000
+#define EXT4_FEATURE_RO_COMPAT_PROJECT		0x2000
 
 #define EXT4_FEATURE_INCOMPAT_COMPRESSION	0x0001
 #define EXT4_FEATURE_INCOMPAT_FILETYPE		0x0002
@@ -1581,7 +1587,8 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
 					 EXT4_FEATURE_RO_COMPAT_HUGE_FILE |\
 					 EXT4_FEATURE_RO_COMPAT_BIGALLOC |\
 					 EXT4_FEATURE_RO_COMPAT_METADATA_CSUM|\
-					 EXT4_FEATURE_RO_COMPAT_QUOTA)
+					 EXT4_FEATURE_RO_COMPAT_QUOTA |\
+					 EXT4_FEATURE_RO_COMPAT_PROJECT)
 
 /*
  * Default values for user and/or group using reserved blocks
@@ -1589,6 +1596,11 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
 #define	EXT4_DEF_RESUID		0
 #define	EXT4_DEF_RESGID		0
 
+/*
+ * Default project ID
+ */
+#define	EXT4_DEF_PROJID		0
+
 #define EXT4_DEF_INODE_READAHEAD_BLKS	32
 
 /*
@@ -2141,6 +2153,7 @@ extern int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
 			     loff_t lstart, loff_t lend);
 extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
 extern qsize_t *ext4_get_reserved_space(struct inode *inode);
+extern int ext4_get_projid(struct inode *inode, kprojid_t *projid);
 extern void ext4_da_update_reserve_space(struct inode *inode,
 					int used, int quota_claim);
 
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index ac644c3..10ca9dd 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -756,6 +756,11 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
 		inode->i_gid = dir->i_gid;
 	} else
 		inode_init_owner(inode, dir, mode);
+	if (EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_PROJECT) &&
+	    ext4_test_inode_flag(dir, EXT4_INODE_PROJINHERIT))
+		ei->i_projid = EXT4_I(dir)->i_projid;
+	else
+		ei->i_projid = make_kprojid(&init_user_ns, EXT4_DEF_PROJID);
 	dquot_initialize(inode);
 
 	if (!goal)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 4df6d01..6e4833f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3870,6 +3870,14 @@ static inline void ext4_iget_extra_inode(struct inode *inode,
 		EXT4_I(inode)->i_inline_off = 0;
 }
 
+int ext4_get_projid(struct inode *inode, kprojid_t *projid)
+{
+	if (!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb, EXT4_FEATURE_RO_COMPAT_PROJECT))
+		return -EOPNOTSUPP;
+	*projid = EXT4_I(inode)->i_projid;
+	return 0;
+}
+
 struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
 {
 	struct ext4_iloc iloc;
@@ -3881,6 +3889,7 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
 	int block;
 	uid_t i_uid;
 	gid_t i_gid;
+	projid_t i_projid;
 
 	inode = iget_locked(sb, ino);
 	if (!inode)
@@ -3930,12 +3939,18 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
 	inode->i_mode = le16_to_cpu(raw_inode->i_mode);
 	i_uid = (uid_t)le16_to_cpu(raw_inode->i_uid_low);
 	i_gid = (gid_t)le16_to_cpu(raw_inode->i_gid_low);
+	if (EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_PROJECT))
+		i_projid = (projid_t)le32_to_cpu(raw_inode->i_projid);
+	else
+		i_projid = EXT4_DEF_PROJID;
+
 	if (!(test_opt(inode->i_sb, NO_UID32))) {
 		i_uid |= le16_to_cpu(raw_inode->i_uid_high) << 16;
 		i_gid |= le16_to_cpu(raw_inode->i_gid_high) << 16;
 	}
 	i_uid_write(inode, i_uid);
 	i_gid_write(inode, i_gid);
+	ei->i_projid = make_kprojid(&init_user_ns, i_projid);;
 	set_nlink(inode, le16_to_cpu(raw_inode->i_links_count));
 
 	ext4_clear_state_flags(ei);	/* Only relevant on 32-bit archs */
@@ -4165,6 +4180,7 @@ static int ext4_do_update_inode(handle_t *handle,
 	int need_datasync = 0, set_large_file = 0;
 	uid_t i_uid;
 	gid_t i_gid;
+	projid_t i_projid;
 
 	spin_lock(&ei->i_raw_lock);
 
@@ -4177,6 +4193,7 @@ static int ext4_do_update_inode(handle_t *handle,
 	raw_inode->i_mode = cpu_to_le16(inode->i_mode);
 	i_uid = i_uid_read(inode);
 	i_gid = i_gid_read(inode);
+	i_projid = from_kprojid(&init_user_ns, ei->i_projid);
 	if (!(test_opt(inode->i_sb, NO_UID32))) {
 		raw_inode->i_uid_low = cpu_to_le16(low_16_bits(i_uid));
 		raw_inode->i_gid_low = cpu_to_le16(low_16_bits(i_gid));
@@ -4256,6 +4273,18 @@ static int ext4_do_update_inode(handle_t *handle,
 		}
 	}
 
+	BUG_ON(!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
+			EXT4_FEATURE_RO_COMPAT_PROJECT) &&
+	       i_projid != EXT4_DEF_PROJID);
+	if (i_projid != EXT4_DEF_PROJID &&
+	    (EXT4_INODE_SIZE(inode->i_sb) <= EXT4_GOOD_OLD_INODE_SIZE ||
+	     (!EXT4_FITS_IN_INODE(raw_inode, ei, i_projid)))) {
+		spin_unlock(&ei->i_raw_lock);
+		err = -EFBIG;
+		goto out_brelse;
+	}
+	raw_inode->i_projid = cpu_to_le32(i_projid);
+
 	ext4_inode_csum_set(inode, raw_inode, ei);
 
 	spin_unlock(&ei->i_raw_lock);
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 2291923..63a9623 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2938,6 +2938,11 @@ static int ext4_link(struct dentry *old_dentry,
 	if (inode->i_nlink >= EXT4_LINK_MAX)
 		return -EMLINK;
 
+	if ((ext4_test_inode_flag(dir, EXT4_INODE_PROJINHERIT)) &&
+	    (!projid_eq(EXT4_I(dir)->i_projid,
+			EXT4_I(old_dentry->d_inode)->i_projid)))
+		return -EXDEV;
+
 	dquot_initialize(dir);
 
 retry:
@@ -3217,6 +3222,11 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
 	int credits;
 	u8 old_file_type;
 
+	if ((ext4_test_inode_flag(new_dir, EXT4_INODE_PROJINHERIT)) &&
+	    (!projid_eq(EXT4_I(new_dir)->i_projid,
+			EXT4_I(old_dentry->d_inode)->i_projid)))
+		return -EXDEV;
+
 	dquot_initialize(old.dir);
 	dquot_initialize(new.dir);
 
@@ -3395,6 +3405,14 @@ static int ext4_cross_rename(struct inode *old_dir, struct dentry *old_dentry,
 	u8 new_file_type;
 	int retval;
 
+	if ((ext4_test_inode_flag(new_dir, EXT4_INODE_PROJINHERIT) &&
+	     !projid_eq(EXT4_I(new_dir)->i_projid,
+			EXT4_I(old_dentry->d_inode)->i_projid)) ||
+	    (ext4_test_inode_flag(old_dir, EXT4_INODE_PROJINHERIT) &&
+	     !projid_eq(EXT4_I(old_dir)->i_projid,
+			EXT4_I(new_dentry->d_inode)->i_projid)))
+		return -EXDEV;
+
 	dquot_initialize(old.dir);
 	dquot_initialize(new.dir);
 
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index bff3427..04c6cc3 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1073,6 +1073,7 @@ static const struct dquot_operations ext4_quota_operations = {
 	.write_info	= ext4_write_info,
 	.alloc_dquot	= dquot_alloc,
 	.destroy_dquot	= dquot_destroy,
+	.get_projid	= ext4_get_projid,
 };
 
 static const struct quotactl_ops ext4_qctl_operations = {
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 3735fa0..fcbf647 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -195,6 +195,7 @@ struct inodes_stat_t {
 #define FS_EXTENT_FL			0x00080000 /* Extents */
 #define FS_DIRECTIO_FL			0x00100000 /* Use direct i/o */
 #define FS_NOCOW_FL			0x00800000 /* Do not cow file */
+#define FS_PROJINHERIT_FL		0x20000000 /* Create with parents projid */
 #define FS_RESERVED_FL			0x80000000 /* reserved for ext2 lib */
 
 #define FS_FL_USER_VISIBLE		0x0003DFFF /* User visible flags */
-- 
1.7.1


^ permalink raw reply related

* [v10 3/5] ext4: adds project quota support
From: Li Xi @ 2015-03-18 19:04 UTC (permalink / raw)
  To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, tytso-3s7WtUTddSA,
	adilger-m1MBpc4rdrD3fQ9qLvQP4Q, jack-AlSwsSmVLrQ,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, hch-wEGCiKHe2LqWVfeAwA7xHQ,
	dmonakhov-GEFAQzZX7r8dnm+yROfE0A
In-Reply-To: <1426705497-22158-1-git-send-email-lixi-LfVdkaOWEx8@public.gmane.org>

This patch adds mount options for enabling/disabling project quota
accounting and enforcement. A new specific inode is also used for
project quota accounting.

Signed-off-by: Li Xi <lixi-LfVdkaOWEx8@public.gmane.org>
Signed-off-by: Dmitry Monakhov <dmonakhov-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
---
 fs/ext4/ext4.h  |    9 +++++-
 fs/ext4/super.c |   76 ++++++++++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 74 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 7acb2da..ad650d4 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -208,6 +208,7 @@ struct ext4_io_submit {
 #define EXT4_UNDEL_DIR_INO	 6	/* Undelete directory inode */
 #define EXT4_RESIZE_INO		 7	/* Reserved group descriptors inode */
 #define EXT4_JOURNAL_INO	 8	/* Journal inode */
+#define EXT4_PRJ_QUOTA_INO	11	/* Project quota inode */
 
 /* First non-reserved inode for old ext4 filesystems */
 #define EXT4_GOOD_OLD_FIRST_INO	11
@@ -987,6 +988,7 @@ struct ext4_inode_info {
 #define EXT4_MOUNT_DIOREAD_NOLOCK	0x400000 /* Enable support for dio read nolocking */
 #define EXT4_MOUNT_JOURNAL_CHECKSUM	0x800000 /* Journal checksums */
 #define EXT4_MOUNT_JOURNAL_ASYNC_COMMIT	0x1000000 /* Journal Async Commit */
+#define EXT4_MOUNT_PRJJQUOTA		0x2000000 /* Journaled Project quota */
 #define EXT4_MOUNT_DELALLOC		0x8000000 /* Delalloc support */
 #define EXT4_MOUNT_DATA_ERR_ABORT	0x10000000 /* Abort on file data write */
 #define EXT4_MOUNT_BLOCK_VALIDITY	0x20000000 /* Block validity checking */
@@ -1169,7 +1171,8 @@ struct ext4_super_block {
 	__le32	s_overhead_clusters;	/* overhead blocks/clusters in fs */
 	__le32	s_backup_bgs[2];	/* groups with sparse_super2 SBs */
 	__u8	s_encrypt_algos[4];	/* Encryption algorithms in use  */
-	__le32	s_reserved[105];	/* Padding to the end of the block */
+	__le32  s_prj_quota_inum;       /* inode for tracking project quota */
+	__le32	s_reserved[104];	/* Padding to the end of the block */
 	__le32	s_checksum;		/* crc32c(superblock) */
 };
 
@@ -1184,7 +1187,7 @@ struct ext4_super_block {
 #define EXT4_MF_FS_ABORTED	0x0002	/* Fatal error detected */
 
 /* Number of quota types we support */
-#define EXT4_MAXQUOTAS 2
+#define EXT4_MAXQUOTAS 3
 
 /*
  * fourth extended-fs super-block data in memory
@@ -1376,6 +1379,8 @@ static inline int ext4_valid_inum(struct super_block *sb, unsigned long ino)
 		ino == EXT4_BOOT_LOADER_INO ||
 		ino == EXT4_JOURNAL_INO ||
 		ino == EXT4_RESIZE_INO ||
+		(EXT4_FIRST_INO(sb) > EXT4_PRJ_QUOTA_INO &&
+		 ino == EXT4_PRJ_QUOTA_INO) ||
 		(ino >= EXT4_FIRST_INO(sb) &&
 		 ino <= le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count));
 }
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 04c6cc3..4077932 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1036,8 +1036,8 @@ static int bdev_try_to_free_page(struct super_block *sb, struct page *page,
 }
 
 #ifdef CONFIG_QUOTA
-#define QTYPE2NAME(t) ((t) == USRQUOTA ? "user" : "group")
-#define QTYPE2MOPT(on, t) ((t) == USRQUOTA?((on)##USRJQUOTA):((on)##GRPJQUOTA))
+static char *quotatypes[] = INITQFNAMES;
+#define QTYPE2NAME(t) (quotatypes[t])
 
 static int ext4_write_dquot(struct dquot *dquot);
 static int ext4_acquire_dquot(struct dquot *dquot);
@@ -1135,7 +1135,8 @@ enum {
 	Opt_journal_path, Opt_journal_checksum, Opt_journal_async_commit,
 	Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback,
 	Opt_data_err_abort, Opt_data_err_ignore,
-	Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
+	Opt_usrjquota, Opt_grpjquota, Opt_prjjquota,
+	Opt_offusrjquota, Opt_offgrpjquota, Opt_offprjjquota,
 	Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_jqfmt_vfsv1, Opt_quota,
 	Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err,
 	Opt_usrquota, Opt_grpquota, Opt_i_version,
@@ -1190,6 +1191,8 @@ static const match_table_t tokens = {
 	{Opt_usrjquota, "usrjquota=%s"},
 	{Opt_offgrpjquota, "grpjquota="},
 	{Opt_grpjquota, "grpjquota=%s"},
+	{Opt_prjjquota, "prjjquota"},
+	{Opt_offprjjquota, "offprjjquota"},
 	{Opt_jqfmt_vfsold, "jqfmt=vfsold"},
 	{Opt_jqfmt_vfsv0, "jqfmt=vfsv0"},
 	{Opt_jqfmt_vfsv1, "jqfmt=vfsv1"},
@@ -1412,11 +1415,16 @@ static const struct mount_opts {
 	{Opt_grpquota, EXT4_MOUNT_QUOTA | EXT4_MOUNT_GRPQUOTA,
 							MOPT_SET | MOPT_Q},
 	{Opt_noquota, (EXT4_MOUNT_QUOTA | EXT4_MOUNT_USRQUOTA |
-		       EXT4_MOUNT_GRPQUOTA), MOPT_CLEAR | MOPT_Q},
+		       EXT4_MOUNT_GRPQUOTA | EXT4_MOUNT_PRJJQUOTA),
+							MOPT_CLEAR | MOPT_Q},
 	{Opt_usrjquota, 0, MOPT_Q},
 	{Opt_grpjquota, 0, MOPT_Q},
+	{Opt_prjjquota, EXT4_MOUNT_QUOTA | EXT4_MOUNT_PRJJQUOTA,
+							MOPT_SET | MOPT_Q},
 	{Opt_offusrjquota, 0, MOPT_Q},
 	{Opt_offgrpjquota, 0, MOPT_Q},
+	{Opt_offprjjquota, 0, EXT4_MOUNT_PRJJQUOTA,
+							MOPT_CLEAR | MOPT_Q},
 	{Opt_jqfmt_vfsold, QFMT_VFS_OLD, MOPT_QFMT},
 	{Opt_jqfmt_vfsv0, QFMT_VFS_V0, MOPT_QFMT},
 	{Opt_jqfmt_vfsv1, QFMT_VFS_V1, MOPT_QFMT},
@@ -1673,7 +1681,9 @@ static int parse_options(char *options, struct super_block *sb,
 			 "feature is enabled");
 		return 0;
 	}
-	if (sbi->s_qf_names[USRQUOTA] || sbi->s_qf_names[GRPQUOTA]) {
+	if (sbi->s_qf_names[USRQUOTA] ||
+	    sbi->s_qf_names[GRPQUOTA] ||
+	    test_opt(sb, PRJJQUOTA)) {
 		if (test_opt(sb, USRQUOTA) && sbi->s_qf_names[USRQUOTA])
 			clear_opt(sb, USRQUOTA);
 
@@ -3944,7 +3954,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 		sb->s_qcop = &ext4_qctl_sysfile_operations;
 	else
 		sb->s_qcop = &ext4_qctl_operations;
-	sb->s_quota_types = QTYPE_MASK_USR | QTYPE_MASK_GRP;
+	sb->s_quota_types = QTYPE_MASK_USR | QTYPE_MASK_GRP | QTYPE_MASK_PRJ;
 #endif
 	memcpy(sb->s_uuid, es->s_uuid, sizeof(es->s_uuid));
 
@@ -5040,6 +5050,46 @@ restore_opts:
 	return err;
 }
 
+static int ext4_statfs_project(struct super_block *sb,
+			       kprojid_t projid, struct kstatfs *buf)
+{
+	struct kqid qid;
+	struct dquot *dquot;
+	u64 limit;
+	u64 curblock;
+
+	qid = make_kqid_projid(projid);
+	dquot = dqget(sb, qid);
+	if (!dquot)
+		return -ESRCH;
+	spin_lock(&dq_data_lock);
+
+	limit = dquot->dq_dqb.dqb_bsoftlimit ?
+		dquot->dq_dqb.dqb_bsoftlimit :
+		dquot->dq_dqb.dqb_bhardlimit;
+	if (limit && buf->f_blocks * buf->f_bsize > limit) {
+		curblock = dquot->dq_dqb.dqb_curspace / buf->f_bsize;
+		buf->f_blocks = limit / buf->f_bsize;
+		buf->f_bfree = buf->f_bavail =
+			(buf->f_blocks > curblock) ?
+			 (buf->f_blocks - curblock) : 0;
+	}
+
+	limit = dquot->dq_dqb.dqb_isoftlimit ?
+		dquot->dq_dqb.dqb_isoftlimit :
+		dquot->dq_dqb.dqb_ihardlimit;
+	if (limit && buf->f_files > limit) {
+		buf->f_files = limit;
+		buf->f_ffree =
+			(buf->f_files > dquot->dq_dqb.dqb_curinodes) ?
+			 (buf->f_files - dquot->dq_dqb.dqb_curinodes) : 0;
+	}
+
+	spin_unlock(&dq_data_lock);
+	dqput(dquot);
+	return 0;
+}
+
 static int ext4_statfs(struct dentry *dentry, struct kstatfs *buf)
 {
 	struct super_block *sb = dentry->d_sb;
@@ -5048,6 +5098,7 @@ static int ext4_statfs(struct dentry *dentry, struct kstatfs *buf)
 	ext4_fsblk_t overhead = 0, resv_blocks;
 	u64 fsid;
 	s64 bfree;
+	struct inode *inode = dentry->d_inode;
 	resv_blocks = EXT4_C2B(sbi, atomic64_read(&sbi->s_resv_clusters));
 
 	if (!test_opt(sb, MINIX_DF))
@@ -5072,6 +5123,9 @@ static int ext4_statfs(struct dentry *dentry, struct kstatfs *buf)
 	buf->f_fsid.val[0] = fsid & 0xFFFFFFFFUL;
 	buf->f_fsid.val[1] = (fsid >> 32) & 0xFFFFFFFFUL;
 
+	if (ext4_test_inode_flag(inode, EXT4_INODE_PROJINHERIT) &&
+	    sb_has_quota_limits_enabled(sb, PRJQUOTA))
+		ext4_statfs_project(sb, EXT4_I(inode)->i_projid, buf);
 	return 0;
 }
 
@@ -5152,7 +5206,9 @@ static int ext4_mark_dquot_dirty(struct dquot *dquot)
 
 	/* Are we journaling quotas? */
 	if (EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_QUOTA) ||
-	    sbi->s_qf_names[USRQUOTA] || sbi->s_qf_names[GRPQUOTA]) {
+	    sbi->s_qf_names[USRQUOTA] ||
+	    sbi->s_qf_names[GRPQUOTA] ||
+	    test_opt(sb, PRJJQUOTA)) {
 		dquot_mark_dquot_dirty(dquot);
 		return ext4_write_dquot(dquot);
 	} else {
@@ -5236,7 +5292,8 @@ static int ext4_quota_enable(struct super_block *sb, int type, int format_id,
 	struct inode *qf_inode;
 	unsigned long qf_inums[EXT4_MAXQUOTAS] = {
 		le32_to_cpu(EXT4_SB(sb)->s_es->s_usr_quota_inum),
-		le32_to_cpu(EXT4_SB(sb)->s_es->s_grp_quota_inum)
+		le32_to_cpu(EXT4_SB(sb)->s_es->s_grp_quota_inum),
+		le32_to_cpu(EXT4_SB(sb)->s_es->s_prj_quota_inum)
 	};
 
 	BUG_ON(!EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_QUOTA));
@@ -5264,7 +5321,8 @@ static int ext4_enable_quotas(struct super_block *sb)
 	int type, err = 0;
 	unsigned long qf_inums[EXT4_MAXQUOTAS] = {
 		le32_to_cpu(EXT4_SB(sb)->s_es->s_usr_quota_inum),
-		le32_to_cpu(EXT4_SB(sb)->s_es->s_grp_quota_inum)
+		le32_to_cpu(EXT4_SB(sb)->s_es->s_grp_quota_inum),
+		le32_to_cpu(EXT4_SB(sb)->s_es->s_prj_quota_inum)
 	};
 
 	sb_dqopt(sb)->flags |= DQUOT_QUOTA_SYS_FILE;
-- 
1.7.1

^ permalink raw reply related

* [v10 4/5] ext4: adds FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR interface support
From: Li Xi @ 2015-03-18 19:04 UTC (permalink / raw)
  To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, tytso-3s7WtUTddSA,
	adilger-m1MBpc4rdrD3fQ9qLvQP4Q, jack-AlSwsSmVLrQ,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, hch-wEGCiKHe2LqWVfeAwA7xHQ,
	dmonakhov-GEFAQzZX7r8dnm+yROfE0A
In-Reply-To: <1426705497-22158-1-git-send-email-lixi-LfVdkaOWEx8@public.gmane.org>

This patch adds FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR ioctl interface
support for ext4. The interface is kept consistent with
XFS_IOC_FSGETXATTR/XFS_IOC_FSGETXATTR.

Signed-off-by: Li Xi <lixi-LfVdkaOWEx8@public.gmane.org>
---
 fs/ext4/ext4.h          |    9 ++
 fs/ext4/ioctl.c         |  366 +++++++++++++++++++++++++++++++++++-----------
 fs/xfs/xfs_fs.h         |   47 +++----
 include/uapi/linux/fs.h |   32 ++++
 4 files changed, 336 insertions(+), 118 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index ad650d4..7d67864 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -385,6 +385,13 @@ struct flex_groups {
 #define EXT4_FL_USER_VISIBLE		0x204BDFFF /* User visible flags */
 #define EXT4_FL_USER_MODIFIABLE		0x204380FF /* User modifiable flags */
 
+#define EXT4_FL_XFLAG_VISIBLE		(EXT4_SYNC_FL | \
+					 EXT4_IMMUTABLE_FL | \
+					 EXT4_APPEND_FL | \
+					 EXT4_NODUMP_FL | \
+					 EXT4_NOATIME_FL | \
+					 EXT4_PROJINHERIT_FL)
+
 /* Flags that should be inherited by new inodes from their parent. */
 #define EXT4_FL_INHERITED (EXT4_SECRM_FL | EXT4_UNRM_FL | EXT4_COMPR_FL |\
 			   EXT4_SYNC_FL | EXT4_NODUMP_FL | EXT4_NOATIME_FL |\
@@ -607,6 +614,8 @@ enum {
 #define EXT4_IOC_RESIZE_FS		_IOW('f', 16, __u64)
 #define EXT4_IOC_SWAP_BOOT		_IO('f', 17)
 #define EXT4_IOC_PRECACHE_EXTENTS	_IO('f', 18)
+#define EXT4_IOC_FSGETXATTR		FS_IOC_FSGETXATTR
+#define EXT4_IOC_FSSETXATTR		FS_IOC_FSSETXATTR
 
 #if defined(__KERNEL__) && defined(CONFIG_COMPAT)
 /*
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index f58a0d1..b634a19 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -14,6 +14,8 @@
 #include <linux/compat.h>
 #include <linux/mount.h>
 #include <linux/file.h>
+#include <linux/quotaops.h>
+#include <linux/quota.h>
 #include <asm/uaccess.h>
 #include "ext4_jbd2.h"
 #include "ext4.h"
@@ -196,6 +198,226 @@ journal_err_out:
 	return err;
 }
 
+static int ext4_ioctl_setflags(struct inode *inode,
+			       unsigned int flags)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	handle_t *handle = NULL;
+	int err = EPERM, migrate = 0;
+	struct ext4_iloc iloc;
+	unsigned int oldflags, mask, i;
+	unsigned int jflag;
+
+	/* Is it quota file? Do not allow user to mess with it */
+	if (IS_NOQUOTA(inode))
+		goto flags_out;
+
+	oldflags = ei->i_flags;
+
+	/* The JOURNAL_DATA flag is modifiable only by root */
+	jflag = flags & EXT4_JOURNAL_DATA_FL;
+
+	/*
+	 * The IMMUTABLE and APPEND_ONLY flags can only be changed by
+	 * the relevant capability.
+	 *
+	 * This test looks nicer. Thanks to Pauline Middelink
+	 */
+	if ((flags ^ oldflags) & (EXT4_APPEND_FL | EXT4_IMMUTABLE_FL)) {
+		if (!capable(CAP_LINUX_IMMUTABLE))
+			goto flags_out;
+	}
+
+	/*
+	 * The JOURNAL_DATA flag can only be changed by
+	 * the relevant capability.
+	 */
+	if ((jflag ^ oldflags) & (EXT4_JOURNAL_DATA_FL)) {
+		if (!capable(CAP_SYS_RESOURCE))
+			goto flags_out;
+	}
+	if ((flags ^ oldflags) & EXT4_EXTENTS_FL)
+		migrate = 1;
+
+	if (flags & EXT4_EOFBLOCKS_FL) {
+		/* we don't support adding EOFBLOCKS flag */
+		if (!(oldflags & EXT4_EOFBLOCKS_FL)) {
+			err = -EOPNOTSUPP;
+			goto flags_out;
+		}
+	} else if (oldflags & EXT4_EOFBLOCKS_FL)
+		ext4_truncate(inode);
+
+	handle = ext4_journal_start(inode, EXT4_HT_INODE, 1);
+	if (IS_ERR(handle)) {
+		err = PTR_ERR(handle);
+		goto flags_out;
+	}
+	if (IS_SYNC(inode))
+		ext4_handle_sync(handle);
+	err = ext4_reserve_inode_write(handle, inode, &iloc);
+	if (err)
+		goto flags_err;
+
+	for (i = 0, mask = 1; i < 32; i++, mask <<= 1) {
+		if (!(mask & EXT4_FL_USER_MODIFIABLE))
+			continue;
+		if (mask & flags)
+			ext4_set_inode_flag(inode, i);
+		else
+			ext4_clear_inode_flag(inode, i);
+	}
+
+	ext4_set_inode_flags(inode);
+	inode->i_ctime = ext4_current_time(inode);
+
+	err = ext4_mark_iloc_dirty(handle, inode, &iloc);
+flags_err:
+	ext4_journal_stop(handle);
+	if (err)
+		goto flags_out;
+
+	if ((jflag ^ oldflags) & (EXT4_JOURNAL_DATA_FL))
+		err = ext4_change_inode_journal_flag(inode, jflag);
+	if (err)
+		goto flags_out;
+	if (migrate) {
+		if (flags & EXT4_EXTENTS_FL)
+			err = ext4_ext_migrate(inode);
+		else
+			err = ext4_ind_migrate(inode);
+	}
+
+flags_out:
+	return err;
+}
+
+static int ext4_ioctl_setproject(struct file *filp, __u32 projid)
+{
+	struct inode *inode = file_inode(filp);
+	struct super_block *sb = inode->i_sb;
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	int err;
+	handle_t *handle;
+	kprojid_t kprojid;
+	struct ext4_iloc iloc;
+	struct ext4_inode *raw_inode;
+
+	struct dquot *transfer_to[EXT4_MAXQUOTAS] = { };
+
+	/* Make sure caller can change project. */
+	if (!capable(CAP_SYS_ADMIN))
+		return -EACCES;
+
+	if (!EXT4_HAS_RO_COMPAT_FEATURE(sb,
+			EXT4_FEATURE_RO_COMPAT_PROJECT)) {
+		BUG_ON(__kprojid_val(EXT4_I(inode)->i_projid)
+		       != EXT4_DEF_PROJID);
+		if (projid != EXT4_DEF_PROJID)
+			return -EOPNOTSUPP;
+		else
+			return 0;
+	}
+
+	kprojid = make_kprojid(&init_user_ns, (projid_t)projid);
+
+	if (projid_eq(kprojid, EXT4_I(inode)->i_projid))
+		return 0;
+
+	err = mnt_want_write_file(filp);
+	if (err)
+		return err;
+
+	err = -EPERM;
+	mutex_lock(&inode->i_mutex);
+	/* Is it quota file? Do not allow user to mess with it */
+	if (IS_NOQUOTA(inode))
+		goto project_out;
+
+	dquot_initialize(inode);
+
+	handle = ext4_journal_start(inode, EXT4_HT_QUOTA,
+		EXT4_QUOTA_INIT_BLOCKS(sb) +
+		EXT4_QUOTA_DEL_BLOCKS(sb) + 3);
+	if (IS_ERR(handle)) {
+		err = PTR_ERR(handle);
+		goto project_out;
+	}
+
+	err = ext4_reserve_inode_write(handle, inode, &iloc);
+	if (err)
+		goto project_stop;
+
+	raw_inode = ext4_raw_inode(&iloc);
+	if ((EXT4_INODE_SIZE(sb) <=
+	     EXT4_GOOD_OLD_INODE_SIZE) ||
+	    (!EXT4_FITS_IN_INODE(raw_inode, ei, i_projid))) {
+	    	err = -EFBIG;
+	    	goto project_stop;
+	}
+
+	transfer_to[PRJQUOTA] = dqget(sb, make_kqid_projid(kprojid));
+	if (!transfer_to[PRJQUOTA])
+		goto project_set;
+
+	err = __dquot_transfer(inode, transfer_to);
+	dqput(transfer_to[PRJQUOTA]);
+	if (err)
+		goto project_stop;
+
+project_set:
+	EXT4_I(inode)->i_projid = kprojid;
+	inode->i_ctime = ext4_current_time(inode);
+	err = ext4_mark_iloc_dirty(handle, inode, &iloc);
+project_stop:
+	ext4_journal_stop(handle);
+project_out:
+	mutex_unlock(&inode->i_mutex);
+	mnt_drop_write_file(filp);
+	return err;
+}
+
+/* Transfer internal flags to xflags */
+static inline __u32 ext4_iflags_to_xflags(unsigned long iflags)
+{
+	__u32 xflags = 0;
+
+	if (iflags & EXT4_SYNC_FL)
+		xflags |= FS_XFLAG_SYNC;
+	if (iflags & EXT4_IMMUTABLE_FL)
+		xflags |= FS_XFLAG_IMMUTABLE;
+	if (iflags & EXT4_APPEND_FL)
+		xflags |= FS_XFLAG_APPEND;
+	if (iflags & EXT4_NODUMP_FL)
+		xflags |= FS_XFLAG_NODUMP;
+	if (iflags & EXT4_NOATIME_FL)
+		xflags |= FS_XFLAG_NOATIME;
+	if (iflags & EXT4_PROJINHERIT_FL)
+		xflags |= FS_XFLAG_PROJINHERIT;
+	return xflags;
+}
+
+/* Transfer xflags flags to internal */
+static inline unsigned long ext4_xflags_to_iflags(__u32 xflags)
+{
+	unsigned long iflags = 0;
+
+	if (xflags & FS_XFLAG_SYNC)
+		iflags |= EXT4_SYNC_FL;
+	if (xflags & FS_XFLAG_IMMUTABLE)
+		iflags |= EXT4_IMMUTABLE_FL;
+	if (xflags & FS_XFLAG_APPEND)
+		iflags |= EXT4_APPEND_FL;
+	if (xflags & FS_XFLAG_NODUMP)
+		iflags |= EXT4_NODUMP_FL;
+	if (xflags & FS_XFLAG_NOATIME)
+		iflags |= EXT4_NOATIME_FL;
+	if (xflags & FS_XFLAG_PROJINHERIT)
+		iflags |= EXT4_PROJINHERIT_FL;
+
+	return iflags;
+}
+
 long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 {
 	struct inode *inode = file_inode(filp);
@@ -211,11 +433,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 		flags = ei->i_flags & EXT4_FL_USER_VISIBLE;
 		return put_user(flags, (int __user *) arg);
 	case EXT4_IOC_SETFLAGS: {
-		handle_t *handle = NULL;
-		int err, migrate = 0;
-		struct ext4_iloc iloc;
-		unsigned int oldflags, mask, i;
-		unsigned int jflag;
+		int err;
 
 		if (!inode_owner_or_capable(inode))
 			return -EACCES;
@@ -229,91 +447,10 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 
 		flags = ext4_mask_flags(inode->i_mode, flags);
 
-		err = -EPERM;
 		mutex_lock(&inode->i_mutex);
-		/* Is it quota file? Do not allow user to mess with it */
-		if (IS_NOQUOTA(inode))
-			goto flags_out;
-
-		oldflags = ei->i_flags;
-
-		/* The JOURNAL_DATA flag is modifiable only by root */
-		jflag = flags & EXT4_JOURNAL_DATA_FL;
-
-		/*
-		 * The IMMUTABLE and APPEND_ONLY flags can only be changed by
-		 * the relevant capability.
-		 *
-		 * This test looks nicer. Thanks to Pauline Middelink
-		 */
-		if ((flags ^ oldflags) & (EXT4_APPEND_FL | EXT4_IMMUTABLE_FL)) {
-			if (!capable(CAP_LINUX_IMMUTABLE))
-				goto flags_out;
-		}
-
-		/*
-		 * The JOURNAL_DATA flag can only be changed by
-		 * the relevant capability.
-		 */
-		if ((jflag ^ oldflags) & (EXT4_JOURNAL_DATA_FL)) {
-			if (!capable(CAP_SYS_RESOURCE))
-				goto flags_out;
-		}
-		if ((flags ^ oldflags) & EXT4_EXTENTS_FL)
-			migrate = 1;
-
-		if (flags & EXT4_EOFBLOCKS_FL) {
-			/* we don't support adding EOFBLOCKS flag */
-			if (!(oldflags & EXT4_EOFBLOCKS_FL)) {
-				err = -EOPNOTSUPP;
-				goto flags_out;
-			}
-		} else if (oldflags & EXT4_EOFBLOCKS_FL)
-			ext4_truncate(inode);
-
-		handle = ext4_journal_start(inode, EXT4_HT_INODE, 1);
-		if (IS_ERR(handle)) {
-			err = PTR_ERR(handle);
-			goto flags_out;
-		}
-		if (IS_SYNC(inode))
-			ext4_handle_sync(handle);
-		err = ext4_reserve_inode_write(handle, inode, &iloc);
-		if (err)
-			goto flags_err;
-
-		for (i = 0, mask = 1; i < 32; i++, mask <<= 1) {
-			if (!(mask & EXT4_FL_USER_MODIFIABLE))
-				continue;
-			if (mask & flags)
-				ext4_set_inode_flag(inode, i);
-			else
-				ext4_clear_inode_flag(inode, i);
-		}
-
-		ext4_set_inode_flags(inode);
-		inode->i_ctime = ext4_current_time(inode);
-
-		err = ext4_mark_iloc_dirty(handle, inode, &iloc);
-flags_err:
-		ext4_journal_stop(handle);
-		if (err)
-			goto flags_out;
-
-		if ((jflag ^ oldflags) & (EXT4_JOURNAL_DATA_FL))
-			err = ext4_change_inode_journal_flag(inode, jflag);
-		if (err)
-			goto flags_out;
-		if (migrate) {
-			if (flags & EXT4_EXTENTS_FL)
-				err = ext4_ext_migrate(inode);
-			else
-				err = ext4_ind_migrate(inode);
-		}
-
-flags_out:
+		err = ext4_ioctl_setflags(inode, flags);
 		mutex_unlock(&inode->i_mutex);
-		mnt_drop_write_file(filp);
+		mnt_drop_write(filp->f_path.mnt);
 		return err;
 	}
 	case EXT4_IOC_GETVERSION:
@@ -615,7 +752,60 @@ resizefs_out:
 	}
 	case EXT4_IOC_PRECACHE_EXTENTS:
 		return ext4_ext_precache(inode);
+	case EXT4_IOC_FSGETXATTR:
+	{
+		struct fsxattr fa;
 
+		memset(&fa, 0, sizeof(struct fsxattr));
+
+		ext4_get_inode_flags(ei);
+		fa.fsx_xflags = ext4_iflags_to_xflags(ei->i_flags & EXT4_FL_USER_VISIBLE);
+
+		if (EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
+				EXT4_FEATURE_RO_COMPAT_PROJECT)) {
+			fa.fsx_projid = (__u32)from_kprojid(&init_user_ns,
+				EXT4_I(inode)->i_projid);
+		}
+
+		if (copy_to_user((struct fsxattr __user *)arg,
+				 &fa, sizeof(fa)))
+			return -EFAULT;
+		return 0;
+	}
+	case EXT4_IOC_FSSETXATTR:
+	{
+		struct fsxattr fa;
+		int err;
+
+		if (!inode_owner_or_capable(inode))
+			return -EACCES;
+
+		if (copy_from_user(&fa, (struct fsxattr __user *)arg,
+				   sizeof(fa)))
+			return -EFAULT;
+
+		err = mnt_want_write_file(filp);
+		if (err)
+			return err;
+
+		flags = ext4_xflags_to_iflags(fa.fsx_xflags);
+		flags = ext4_mask_flags(inode->i_mode, flags);
+
+		mutex_lock(&inode->i_mutex);
+		flags = (ei->i_flags & ~EXT4_FL_XFLAG_VISIBLE) |
+			 (flags & EXT4_FL_XFLAG_VISIBLE);
+		err = ext4_ioctl_setflags(inode, flags);
+		mutex_unlock(&inode->i_mutex);
+		mnt_drop_write(filp->f_path.mnt);
+		if (err)
+			return err;
+
+		err = ext4_ioctl_setproject(filp, fa.fsx_projid);
+		if (err)
+			return err;
+
+		return 0;
+	}
 	default:
 		return -ENOTTY;
 	}
diff --git a/fs/xfs/xfs_fs.h b/fs/xfs/xfs_fs.h
index 18dc721..64c7ae6 100644
--- a/fs/xfs/xfs_fs.h
+++ b/fs/xfs/xfs_fs.h
@@ -36,38 +36,25 @@ struct dioattr {
 #endif
 
 /*
- * Structure for XFS_IOC_FSGETXATTR[A] and XFS_IOC_FSSETXATTR.
- */
-#ifndef HAVE_FSXATTR
-struct fsxattr {
-	__u32		fsx_xflags;	/* xflags field value (get/set) */
-	__u32		fsx_extsize;	/* extsize field value (get/set)*/
-	__u32		fsx_nextents;	/* nextents field value (get)	*/
-	__u32		fsx_projid;	/* project identifier (get/set) */
-	unsigned char	fsx_pad[12];
-};
-#endif
-
-/*
  * Flags for the bs_xflags/fsx_xflags field
  * There should be a one-to-one correspondence between these flags and the
  * XFS_DIFLAG_s.
  */
-#define XFS_XFLAG_REALTIME	0x00000001	/* data in realtime volume */
-#define XFS_XFLAG_PREALLOC	0x00000002	/* preallocated file extents */
-#define XFS_XFLAG_IMMUTABLE	0x00000008	/* file cannot be modified */
-#define XFS_XFLAG_APPEND	0x00000010	/* all writes append */
-#define XFS_XFLAG_SYNC		0x00000020	/* all writes synchronous */
-#define XFS_XFLAG_NOATIME	0x00000040	/* do not update access time */
-#define XFS_XFLAG_NODUMP	0x00000080	/* do not include in backups */
-#define XFS_XFLAG_RTINHERIT	0x00000100	/* create with rt bit set */
-#define XFS_XFLAG_PROJINHERIT	0x00000200	/* create with parents projid */
-#define XFS_XFLAG_NOSYMLINKS	0x00000400	/* disallow symlink creation */
-#define XFS_XFLAG_EXTSIZE	0x00000800	/* extent size allocator hint */
-#define XFS_XFLAG_EXTSZINHERIT	0x00001000	/* inherit inode extent size */
-#define XFS_XFLAG_NODEFRAG	0x00002000  	/* do not defragment */
-#define XFS_XFLAG_FILESTREAM	0x00004000	/* use filestream allocator */
-#define XFS_XFLAG_HASATTR	0x80000000	/* no DIFLAG for this	*/
+#define XFS_XFLAG_REALTIME	FS_XFLAG_REALTIME	/* data in realtime volume */
+#define XFS_XFLAG_PREALLOC	FS_XFLAG_PREALLOC	/* preallocated file extents */
+#define XFS_XFLAG_IMMUTABLE	FS_XFLAG_IMMUTABLE	/* file cannot be modified */
+#define XFS_XFLAG_APPEND	FS_XFLAG_APPEND		/* all writes append */
+#define XFS_XFLAG_SYNC		FS_XFLAG_SYNC		/* all writes synchronous */
+#define XFS_XFLAG_NOATIME	FS_XFLAG_NOATIME	/* do not update access time */
+#define XFS_XFLAG_NODUMP	FS_XFLAG_NODUMP		/* do not include in backups */
+#define XFS_XFLAG_RTINHERIT	FS_XFLAG_RTINHERIT	/* create with rt bit set */
+#define XFS_XFLAG_PROJINHERIT	FS_XFLAG_PROJINHERIT	/* create with parents projid */
+#define XFS_XFLAG_NOSYMLINKS	FS_XFLAG_NOSYMLINKS	/* disallow symlink creation */
+#define XFS_XFLAG_EXTSIZE	FS_XFLAG_EXTSIZE	/* extent size allocator hint */
+#define XFS_XFLAG_EXTSZINHERIT	FS_XFLAG_EXTSZINHERIT	/* inherit inode extent size */
+#define XFS_XFLAG_NODEFRAG	FS_XFLAG_NODEFRAG  	/* do not defragment */
+#define XFS_XFLAG_FILESTREAM	FS_XFLAG_FILESTREAM	/* use filestream allocator */
+#define XFS_XFLAG_HASATTR	FS_XFLAG_HASATTR	/* no DIFLAG for this	*/
 
 /*
  * Structure for XFS_IOC_GETBMAP.
@@ -503,8 +490,8 @@ typedef struct xfs_swapext
 #define XFS_IOC_ALLOCSP		_IOW ('X', 10, struct xfs_flock64)
 #define XFS_IOC_FREESP		_IOW ('X', 11, struct xfs_flock64)
 #define XFS_IOC_DIOINFO		_IOR ('X', 30, struct dioattr)
-#define XFS_IOC_FSGETXATTR	_IOR ('X', 31, struct fsxattr)
-#define XFS_IOC_FSSETXATTR	_IOW ('X', 32, struct fsxattr)
+#define XFS_IOC_FSGETXATTR	FS_IOC_FSGETXATTR
+#define XFS_IOC_FSSETXATTR	FS_IOC_FSSETXATTR
 #define XFS_IOC_ALLOCSP64	_IOW ('X', 36, struct xfs_flock64)
 #define XFS_IOC_FREESP64	_IOW ('X', 37, struct xfs_flock64)
 #define XFS_IOC_GETBMAP		_IOWR('X', 38, struct getbmap)
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index fcbf647..69dda62 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -58,6 +58,36 @@ struct inodes_stat_t {
 	long dummy[5];		/* padding for sysctl ABI compatibility */
 };
 
+/*
+ * Structure for FS_IOC_FSGETXATTR and FS_IOC_FSSETXATTR.
+ */
+struct fsxattr {
+	__u32		fsx_xflags;	/* xflags field value (get/set) */
+	__u32		fsx_extsize;	/* extsize field value (get/set)*/
+	__u32		fsx_nextents;	/* nextents field value (get)	*/
+	__u32		fsx_projid;	/* project identifier (get/set) */
+	unsigned char	fsx_pad[12];
+};
+
+/*
+ * Flags for the fsx_xflags field
+ */
+#define FS_XFLAG_REALTIME	0x00000001	/* data in realtime volume */
+#define FS_XFLAG_PREALLOC	0x00000002	/* preallocated file extents */
+#define FS_XFLAG_IMMUTABLE	0x00000008	/* file cannot be modified */
+#define FS_XFLAG_APPEND		0x00000010	/* all writes append */
+#define FS_XFLAG_SYNC		0x00000020	/* all writes synchronous */
+#define FS_XFLAG_NOATIME	0x00000040	/* do not update access time */
+#define FS_XFLAG_NODUMP		0x00000080	/* do not include in backups */
+#define FS_XFLAG_RTINHERIT	0x00000100	/* create with rt bit set */
+#define FS_XFLAG_PROJINHERIT	0x00000200	/* create with parents projid */
+#define FS_XFLAG_NOSYMLINKS	0x00000400	/* disallow symlink creation */
+#define FS_XFLAG_EXTSIZE	0x00000800	/* extent size allocator hint */
+#define FS_XFLAG_EXTSZINHERIT	0x00001000	/* inherit inode extent size */
+#define FS_XFLAG_NODEFRAG	0x00002000  	/* do not defragment */
+#define FS_XFLAG_FILESTREAM	0x00004000	/* use filestream allocator */
+#define FS_XFLAG_HASATTR	0x80000000	/* no DIFLAG for this */
+
 
 #define NR_FILE  8192	/* this can well be larger on a larger system */
 
@@ -163,6 +193,8 @@ struct inodes_stat_t {
 #define	FS_IOC_GETVERSION		_IOR('v', 1, long)
 #define	FS_IOC_SETVERSION		_IOW('v', 2, long)
 #define FS_IOC_FIEMAP			_IOWR('f', 11, struct fiemap)
+#define FS_IOC_FSGETXATTR		_IOR('X', 31, struct fsxattr)
+#define FS_IOC_FSSETXATTR		_IOW('X', 32, struct fsxattr)
 #define FS_IOC32_GETFLAGS		_IOR('f', 1, int)
 #define FS_IOC32_SETFLAGS		_IOW('f', 2, int)
 #define FS_IOC32_GETVERSION		_IOR('v', 1, int)
-- 
1.7.1

^ permalink raw reply related

* [v10 5/5] ext4: cleanup inode flag definitions
From: Li Xi @ 2015-03-18 19:04 UTC (permalink / raw)
  To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, tytso-3s7WtUTddSA,
	adilger-m1MBpc4rdrD3fQ9qLvQP4Q, jack-AlSwsSmVLrQ,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, hch-wEGCiKHe2LqWVfeAwA7xHQ,
	dmonakhov-GEFAQzZX7r8dnm+yROfE0A
In-Reply-To: <1426705497-22158-1-git-send-email-lixi-LfVdkaOWEx8@public.gmane.org>

The inode flags defined in uapi/linux/fs.h were migrated from
ext4.h. This patch changes the inode flag definitions in ext4.h
to VFS definitions to make the gaps between them clearer.

Signed-off-by: Li Xi <lixi-LfVdkaOWEx8@public.gmane.org>
Reviewed-by: Andreas Dilger <adilger-m1MBpc4rdrD3fQ9qLvQP4Q@public.gmane.org>
---
 fs/ext4/ext4.h |   50 +++++++++++++++++++++++++-------------------------
 1 files changed, 25 insertions(+), 25 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 7d67864..41646ad 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -353,34 +353,34 @@ struct flex_groups {
 /*
  * Inode flags
  */
-#define	EXT4_SECRM_FL			0x00000001 /* Secure deletion */
-#define	EXT4_UNRM_FL			0x00000002 /* Undelete */
-#define	EXT4_COMPR_FL			0x00000004 /* Compress file */
-#define EXT4_SYNC_FL			0x00000008 /* Synchronous updates */
-#define EXT4_IMMUTABLE_FL		0x00000010 /* Immutable file */
-#define EXT4_APPEND_FL			0x00000020 /* writes to file may only append */
-#define EXT4_NODUMP_FL			0x00000040 /* do not dump file */
-#define EXT4_NOATIME_FL			0x00000080 /* do not update atime */
+#define	EXT4_SECRM_FL			FS_SECRM_FL        /* Secure deletion */
+#define	EXT4_UNRM_FL			FS_UNRM_FL         /* Undelete */
+#define	EXT4_COMPR_FL			FS_COMPR_FL        /* Compress file */
+#define EXT4_SYNC_FL			FS_SYNC_FL         /* Synchronous updates */
+#define EXT4_IMMUTABLE_FL		FS_IMMUTABLE_FL    /* Immutable file */
+#define EXT4_APPEND_FL			FS_APPEND_FL       /* writes to file may only append */
+#define EXT4_NODUMP_FL			FS_NODUMP_FL       /* do not dump file */
+#define EXT4_NOATIME_FL			FS_NOATIME_FL      /* do not update atime */
 /* Reserved for compression usage... */
-#define EXT4_DIRTY_FL			0x00000100
-#define EXT4_COMPRBLK_FL		0x00000200 /* One or more compressed clusters */
-#define EXT4_NOCOMPR_FL			0x00000400 /* Don't compress */
+#define EXT4_DIRTY_FL			FS_DIRTY_FL
+#define EXT4_COMPRBLK_FL		FS_COMPRBLK_FL     /* One or more compressed clusters */
+#define EXT4_NOCOMPR_FL			FS_NOCOMP_FL       /* Don't compress */
 	/* nb: was previously EXT2_ECOMPR_FL */
-#define EXT4_ENCRYPT_FL			0x00000800 /* encrypted file */
+#define EXT4_ENCRYPT_FL			0x00000800         /* encrypted file */
 /* End compression flags --- maybe not all used */
-#define EXT4_INDEX_FL			0x00001000 /* hash-indexed directory */
-#define EXT4_IMAGIC_FL			0x00002000 /* AFS directory */
-#define EXT4_JOURNAL_DATA_FL		0x00004000 /* file data should be journaled */
-#define EXT4_NOTAIL_FL			0x00008000 /* file tail should not be merged */
-#define EXT4_DIRSYNC_FL			0x00010000 /* dirsync behaviour (directories only) */
-#define EXT4_TOPDIR_FL			0x00020000 /* Top of directory hierarchies*/
-#define EXT4_HUGE_FILE_FL               0x00040000 /* Set to each huge file */
-#define EXT4_EXTENTS_FL			0x00080000 /* Inode uses extents */
-#define EXT4_EA_INODE_FL	        0x00200000 /* Inode used for large EA */
-#define EXT4_EOFBLOCKS_FL		0x00400000 /* Blocks allocated beyond EOF */
-#define EXT4_INLINE_DATA_FL		0x10000000 /* Inode has inline data. */
-#define EXT4_PROJINHERIT_FL		0x20000000 /* Create with parents projid */
-#define EXT4_RESERVED_FL		0x80000000 /* reserved for ext4 lib */
+#define EXT4_INDEX_FL			FS_INDEX_FL        /* hash-indexed directory */
+#define EXT4_IMAGIC_FL			FS_IMAGIC_FL       /* AFS directory */
+#define EXT4_JOURNAL_DATA_FL		FS_JOURNAL_DATA_FL /* file data should be journaled */
+#define EXT4_NOTAIL_FL			FS_NOTAIL_FL       /* file tail should not be merged */
+#define EXT4_DIRSYNC_FL			FS_DIRSYNC_FL      /* dirsync behaviour (directories only) */
+#define EXT4_TOPDIR_FL			FS_TOPDIR_FL       /* Top of directory hierarchies*/
+#define EXT4_HUGE_FILE_FL               0x00040000         /* Set to each huge file */
+#define EXT4_EXTENTS_FL			FS_EXTENT_FL       /* Inode uses extents */
+#define EXT4_EA_INODE_FL	        0x00200000         /* Inode used for large EA */
+#define EXT4_EOFBLOCKS_FL		0x00400000         /* Blocks allocated beyond EOF */
+#define EXT4_INLINE_DATA_FL		0x10000000         /* Inode has inline data. */
+#define EXT4_PROJINHERIT_FL		FS_PROJINHERIT_FL  /* Create with parents projid */
+#define EXT4_RESERVED_FL		FS_RESERVED_FL     /* reserved for ext4 lib */
 
 #define EXT4_FL_USER_VISIBLE		0x204BDFFF /* User visible flags */
 #define EXT4_FL_USER_MODIFIABLE		0x204380FF /* User modifiable flags */
-- 
1.7.1

^ permalink raw reply related

* [PATCH 0/3] UserfaultFD: Extension for non cooperative uffd usage
From: Pavel Emelyanov @ 2015-03-18 19:34 UTC (permalink / raw)
  To: Andrea Arcangeli, Linux Kernel Mailing List, Linux MM, Linux API
  Cc: Sanidhya Kashyap

Hi,

On the recent LSF Andrea presented his userfault-fd patches and
I had shown some issues that appear in usage scenarios when the
monitor task and mm task do not cooperate to each other on VM
changes (and fork()-s).

Here's the implementation of the extended uffd API that would help 
us to address those issues.

As proof of concept I've implemented only fix for fork() case, but
I also plan to add the mremap() and exit() notifications, both are
also required for such non-cooperative usage.

More details about the extension itself is in patch #2 and the fork()
notification description is in patch #3.

Comments and suggestion are warmly welcome :)


Andrea, what's the best way to go on with the patches -- would you
prefer to include them in your git tree or should I instead continue
with them on my own, re-sending them when required? Either way would
be OK for me.

Thanks,
Pavel

^ permalink raw reply

* [PATCH 1/3] uffd: Tossing bits around
From: Pavel Emelyanov @ 2015-03-18 19:34 UTC (permalink / raw)
  To: Andrea Arcangeli, Linux Kernel Mailing List, Linux MM, Linux API
  Cc: Sanidhya Kashyap
In-Reply-To: <5509D342.7000403-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>

Reformat the existing code a bit to make it easier for
further patching.

Signed-off-by: Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
---
 fs/userfaultfd.c | 37 ++++++++++++++++++++++++++++---------
 1 file changed, 28 insertions(+), 9 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index b4c7f25..6c9a2d6 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -61,6 +61,23 @@ struct userfaultfd_wake_range {
 	unsigned long len;
 };
 
+static const struct file_operations userfaultfd_fops;
+
+static struct userfaultfd_ctx *userfaultfd_ctx_alloc(void)
+{
+	struct userfaultfd_ctx *ctx;
+
+	ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
+	if (ctx) {
+		atomic_set(&ctx->refcount, 1);
+		init_waitqueue_head(&ctx->fault_wqh);
+		init_waitqueue_head(&ctx->fd_wqh);
+		ctx->released = false;
+	}
+
+	return ctx;
+}
+
 static int userfaultfd_wake_function(wait_queue_t *wq, unsigned mode,
 				     int wake_flags, void *key)
 {
@@ -307,15 +324,15 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
 	return 0;
 }
 
-static inline unsigned int find_userfault(struct userfaultfd_ctx *ctx,
+static inline unsigned int do_find_userfault(wait_queue_head_t *wqh,
 					  struct userfaultfd_wait_queue **uwq)
 {
 	wait_queue_t *wq;
 	struct userfaultfd_wait_queue *_uwq;
 	unsigned int ret = 0;
 
-	spin_lock(&ctx->fault_wqh.lock);
-	list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
+	spin_lock(&wqh->lock);
+	list_for_each_entry(wq, &wqh->task_list, task_list) {
 		_uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
 		if (_uwq->pending) {
 			ret = POLLIN;
@@ -324,11 +341,17 @@ static inline unsigned int find_userfault(struct userfaultfd_ctx *ctx,
 			break;
 		}
 	}
-	spin_unlock(&ctx->fault_wqh.lock);
+	spin_unlock(&wqh->lock);
 
 	return ret;
 }
 
+static inline unsigned int find_userfault(struct userfaultfd_ctx *ctx,
+		struct userfaultfd_wait_queue **uwq)
+{
+	return do_find_userfault(&ctx->fault_wqh, uwq);
+}
+
 static unsigned int userfaultfd_poll(struct file *file, poll_table *wait)
 {
 	struct userfaultfd_ctx *ctx = file->private_data;
@@ -1080,16 +1103,12 @@ static struct file *userfaultfd_file_create(int flags)
 		goto out;
 
 	file = ERR_PTR(-ENOMEM);
-	ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
+	ctx = userfaultfd_ctx_alloc();
 	if (!ctx)
 		goto out;
 
-	atomic_set(&ctx->refcount, 1);
-	init_waitqueue_head(&ctx->fault_wqh);
-	init_waitqueue_head(&ctx->fd_wqh);
 	ctx->flags = flags;
 	ctx->state = UFFD_STATE_WAIT_API;
-	ctx->released = false;
 	ctx->mm = current->mm;
 	/* prevent the mm struct to be freed */
 	atomic_inc(&ctx->mm->mm_count);
-- 
1.8.4.2

^ permalink raw reply related

* [PATCH 2/3] uffd: Introduce the v2 API
From: Pavel Emelyanov @ 2015-03-18 19:35 UTC (permalink / raw)
  To: Andrea Arcangeli, Linux Kernel Mailing List, Linux MM, Linux API
  Cc: Sanidhya Kashyap
In-Reply-To: <5509D342.7000403-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>

The new API will report more than just the page-faults. The
reason for this is -- when the task whose mm we monitor with 
uffd and the monitor task itself cannot cooperate with each
other, the former one can screw things up. Like this.

If task fork()-s the child process is detached from uffd and
thus all not-yet-faulted-in memory gets mapped with zero-pages
on touch.

Another example is mremap(). When the victim remaps the uffd-ed
region and starts touching it the monitor would receive fault
messages with addresses that were not register-ed with uffd
ioctl before it. Thus monitor will have no idea how to handle
those faults.

To address both we can send more events to the monitor. In
particular, on fork() we can create another uffd context,
register the same set of regions in it and "send" the descriptor
to monitor.

For mremap() we can send the message describing what change
has been performed.

So this patch prepares to ground for the described above feature
by introducing the v2 API of uffd. With new API the kernel would
respond with a message containing the event type (pagefault,
fork or remap) and argument (fault address, new uffd descriptor
or region change respectively).

Signed-off-by: Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
---
 fs/userfaultfd.c                 | 56 ++++++++++++++++++++++++++++++----------
 include/uapi/linux/userfaultfd.h | 21 ++++++++++++++-
 2 files changed, 62 insertions(+), 15 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 6c9a2d6..bd629b4 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -41,6 +41,8 @@ struct userfaultfd_ctx {
 	wait_queue_head_t fd_wqh;
 	/* userfaultfd syscall flags */
 	unsigned int flags;
+	/* features flags */
+	unsigned int features;
 	/* state machine */
 	enum userfaultfd_state state;
 	/* released */
@@ -49,6 +51,8 @@ struct userfaultfd_ctx {
 	struct mm_struct *mm;
 };
 
+#define UFFD_FEATURE_LONGMSG	0x1
+
 struct userfaultfd_wait_queue {
 	unsigned long address;
 	wait_queue_t wq;
@@ -369,7 +373,7 @@ static unsigned int userfaultfd_poll(struct file *file, poll_table *wait)
 }
 
 static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
-				    __u64 *addr)
+				    __u64 *mtype, __u64 *addr)
 {
 	ssize_t ret;
 	DECLARE_WAITQUEUE(wait, current);
@@ -383,6 +387,7 @@ static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
 		if (find_userfault(ctx, &uwq)) {
 			uwq->pending = false;
 			/* careful to always initialize addr if ret == 0 */
+			*mtype = UFFD_PAGEFAULT;
 			*addr = uwq->address;
 			ret = 0;
 			break;
@@ -411,8 +416,6 @@ static ssize_t userfaultfd_read(struct file *file, char __user *buf,
 {
 	struct userfaultfd_ctx *ctx = file->private_data;
 	ssize_t _ret, ret = 0;
-	/* careful to always initialize addr if ret == 0 */
-	__u64 uninitialized_var(addr);
 	int no_wait = file->f_flags & O_NONBLOCK;
 
 	if (ctx->state == UFFD_STATE_WAIT_API)
@@ -420,16 +423,34 @@ static ssize_t userfaultfd_read(struct file *file, char __user *buf,
 	BUG_ON(ctx->state != UFFD_STATE_RUNNING);
 
 	for (;;) {
-		if (count < sizeof(addr))
-			return ret ? ret : -EINVAL;
-		_ret = userfaultfd_ctx_read(ctx, no_wait, &addr);
-		if (_ret < 0)
-			return ret ? ret : _ret;
-		if (put_user(addr, (__u64 __user *) buf))
-			return ret ? ret : -EFAULT;
-		ret += sizeof(addr);
-		buf += sizeof(addr);
-		count -= sizeof(addr);
+		if (!(ctx->features & UFFD_FEATURE_LONGMSG)) {
+			/* careful to always initialize addr if ret == 0 */
+			__u64 uninitialized_var(addr);
+			__u64 uninitialized_var(mtype);
+			if (count < sizeof(addr))
+				return ret ? ret : -EINVAL;
+			_ret = userfaultfd_ctx_read(ctx, no_wait, &mtype, &addr);
+			if (_ret < 0)
+				return ret ? ret : _ret;
+			BUG_ON(mtype != UFFD_PAGEFAULT);
+			if (put_user(addr, (__u64 __user *) buf))
+				return ret ? ret : -EFAULT;
+			_ret = sizeof(addr);
+		} else {
+			struct uffd_v2_msg msg;
+			if (count < sizeof(msg))
+				return ret ? ret : -EINVAL;
+			_ret = userfaultfd_ctx_read(ctx, no_wait, &msg.type, &msg.arg);
+			if (_ret < 0)
+				return ret ? ret : _ret;
+			if (copy_to_user(buf, &msg, sizeof(msg)))
+				return ret ? ret : -EINVAL;
+			_ret = sizeof(msg);
+		}
+
+		ret += _ret;
+		buf += _ret;
+		count -= _ret;
 		/*
 		 * Allow to read more than one fault at time but only
 		 * block if waiting for the very first one.
@@ -981,7 +1002,7 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
 	ret = -EFAULT;
 	if (copy_from_user(&uffdio_api, buf, sizeof(__u64)))
 		goto out;
-	if (uffdio_api.api != UFFD_API) {
+	if (uffdio_api.api != UFFD_API && uffdio_api.api != UFFD_API_V2) {
 		/* careful not to leak info, we only read the first 8 bytes */
 		memset(&uffdio_api, 0, sizeof(uffdio_api));
 		if (copy_to_user(buf, &uffdio_api, sizeof(uffdio_api)))
@@ -992,6 +1013,12 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
 	/* careful not to leak info, we only read the first 8 bytes */
 	uffdio_api.bits = UFFD_API_BITS;
 	uffdio_api.ioctls = UFFD_API_IOCTLS;
+
+	if (uffdio_api.api == UFFD_API_V2) {
+		ctx->features |= UFFD_FEATURE_LONGMSG;
+		uffdio_api.bits |= UFFD_API_V2_BITS;
+	}
+
 	ret = -EFAULT;
 	if (copy_to_user(buf, &uffdio_api, sizeof(uffdio_api)))
 		goto out;
@@ -1109,6 +1136,7 @@ static struct file *userfaultfd_file_create(int flags)
 
 	ctx->flags = flags;
 	ctx->state = UFFD_STATE_WAIT_API;
+	ctx->features = 0;
 	ctx->mm = current->mm;
 	/* prevent the mm struct to be freed */
 	atomic_inc(&ctx->mm->mm_count);
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index db6e99a..4e169b8 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -9,7 +9,9 @@
 #ifndef _LINUX_USERFAULTFD_H
 #define _LINUX_USERFAULTFD_H
 
-#define UFFD_API ((__u64)0xAA)
+#define UFFD_API 	((__u64)0xAA)
+#define UFFD_API_V2	((__u64)0xAB)
+
 /* FIXME: add "|UFFD_BIT_WP" to UFFD_API_BITS after implementing it */
 #define UFFD_API_BITS (UFFD_BIT_WRITE)
 #define UFFD_API_IOCTLS				\
@@ -147,4 +149,21 @@ struct uffdio_remap {
 	__s64 wake;
 };
 
+struct uffd_v2_msg {
+	__u64	type;
+	__u64	arg;
+};
+
+#define UFFD_PAGEFAULT	0x1
+
+#define UFFD_PAGEFAULT_BIT	(1 << (UFFD_PAGEFAULT - 1))
+#define __UFFD_API_V2_BITS	(UFFD_PAGEFAULT_BIT)
+
+/*
+ * Lower PAGE_SHIFT bits are used to report those supported
+ * by the pagefault message itself. Other bits are used to
+ * report the message types v2 API supports
+ */
+#define UFFD_API_V2_BITS	(__UFFD_API_V2_BITS << 12)
+
 #endif /* _LINUX_USERFAULTFD_H */
-- 
1.8.4.2

^ permalink raw reply related

* [PATCH 3/3] uffd: Introduce fork() notification
From: Pavel Emelyanov @ 2015-03-18 19:35 UTC (permalink / raw)
  To: Andrea Arcangeli, Linux Kernel Mailing List, Linux MM, Linux API
  Cc: Sanidhya Kashyap
In-Reply-To: <5509D342.7000403-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>

As described in previous e-mail, we need to get informed when
the task, whose mm is monitored with uffd, calls fork().

The fork notification is the new uffd with the same regions
and flags as those on parent. When read()-ing from uffd the
monitor task would receive the new uffd's descriptor number
and will be able to start reading events from the new task.

The fork() of mm with uffd attached doesn't finish until the 
monitor "acks" the message by reading the new uffd descriptor.

Signed-off-by: Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
---
 fs/userfaultfd.c                 | 175 ++++++++++++++++++++++++++++++++++++++-
 include/linux/userfaultfd_k.h    |  12 +++
 include/uapi/linux/userfaultfd.h |   4 +-
 kernel/fork.c                    |   9 +-
 4 files changed, 194 insertions(+), 6 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index bd629b4..265f031 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -12,6 +12,7 @@
  *  mm/ksm.c (mm hashing).
  */
 
+#include <linux/list.h>
 #include <linux/hashtable.h>
 #include <linux/sched.h>
 #include <linux/mm.h>
@@ -37,6 +38,8 @@ struct userfaultfd_ctx {
 	atomic_t refcount;
 	/* waitqueue head for the userfaultfd page faults */
 	wait_queue_head_t fault_wqh;
+	/* waitqueue head for fork-s */
+	wait_queue_head_t fork_wqh;
 	/* waitqueue head for the pseudo fd to wakeup poll/read */
 	wait_queue_head_t fd_wqh;
 	/* userfaultfd syscall flags */
@@ -51,10 +54,21 @@ struct userfaultfd_ctx {
 	struct mm_struct *mm;
 };
 
+struct userfaultfd_fork_ctx {
+	struct userfaultfd_ctx *orig;
+	struct userfaultfd_ctx *new;
+	struct list_head list;
+};
+
 #define UFFD_FEATURE_LONGMSG	0x1
+#define UFFD_FEATURE_FORK	0x2
 
 struct userfaultfd_wait_queue {
-	unsigned long address;
+	union {
+		unsigned long address;
+		struct userfaultfd_ctx *nctx;
+		int fd;
+	};
 	wait_queue_t wq;
 	bool pending;
 	struct userfaultfd_ctx *ctx;
@@ -75,6 +89,7 @@ static struct userfaultfd_ctx *userfaultfd_ctx_alloc(void)
 	if (ctx) {
 		atomic_set(&ctx->refcount, 1);
 		init_waitqueue_head(&ctx->fault_wqh);
+		init_waitqueue_head(&ctx->fork_wqh);
 		init_waitqueue_head(&ctx->fd_wqh);
 		ctx->released = false;
 	}
@@ -270,6 +285,111 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 	return VM_FAULT_RETRY;
 }
 
+int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs)
+{
+	struct userfaultfd_ctx *ctx = NULL, *octx;
+	struct userfaultfd_fork_ctx *fctx;
+
+	octx = vma->vm_userfaultfd_ctx.ctx;
+	if (!octx || !(octx->features & UFFD_FEATURE_FORK)) {
+		vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
+		vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
+		return 0;
+	}
+
+	list_for_each_entry(fctx, fcs, list)
+		if (fctx->orig == octx) {
+			ctx = fctx->new;
+			break;
+		}
+
+	if (!ctx) {
+		fctx = kmalloc(sizeof(*fctx), GFP_KERNEL);
+		if (!fctx)
+			return -ENOMEM;
+
+		ctx = userfaultfd_ctx_alloc();
+		if (!ctx) {
+			kfree(fctx);
+			return -ENOMEM;
+		}
+
+		ctx->flags = octx->flags;
+		ctx->state = UFFD_STATE_RUNNING;
+		ctx->features = UFFD_FEATURE_FORK | UFFD_FEATURE_LONGMSG;
+		ctx->mm = vma->vm_mm;
+		atomic_inc(&ctx->mm->mm_count);
+
+		userfaultfd_ctx_get(octx);
+		fctx->orig = octx;
+		fctx->new = ctx;
+		list_add_tail(&fctx->list, fcs);
+	}
+
+	vma->vm_userfaultfd_ctx.ctx = ctx;
+	return 0;
+}
+
+static int dup_fctx(struct userfaultfd_fork_ctx *fctx)
+{
+	int ret = 0;
+	struct userfaultfd_ctx *ctx = fctx->orig;
+	struct userfaultfd_wait_queue uwq;
+
+	init_waitqueue_entry(&uwq.wq, current);
+	uwq.pending = true;
+	uwq.ctx = ctx;
+	uwq.nctx = fctx->new;
+
+	spin_lock(&ctx->fork_wqh.lock);
+	/*
+	 * After the __add_wait_queue the uwq is visible to userland
+	 * through poll/read().
+	 */
+	__add_wait_queue(&ctx->fork_wqh, &uwq.wq);
+	for (;;) {
+		set_current_state(TASK_KILLABLE);
+		if (!uwq.pending)
+			break;
+		if (ACCESS_ONCE(ctx->released) ||
+				fatal_signal_pending(current)) {
+			ret = -1;
+			break;
+		}
+
+		spin_unlock(&ctx->fork_wqh.lock);
+
+		wake_up_poll(&ctx->fd_wqh, POLLIN);
+		schedule();
+
+		spin_lock(&ctx->fork_wqh.lock);
+	}
+	__remove_wait_queue(&ctx->fork_wqh, &uwq.wq);
+	__set_current_state(TASK_RUNNING);
+	spin_unlock(&ctx->fork_wqh.lock);
+
+	/*
+	 * ctx may go away after this if the userfault pseudo fd is
+	 * already released.
+	 */
+	userfaultfd_ctx_put(ctx);
+
+	return ret;
+}
+
+void dup_userfaultfd_complete(struct list_head *fcs)
+{
+	int ret = 0;
+	struct userfaultfd_fork_ctx *fctx, *n;
+
+	list_for_each_entry_safe(fctx, n, fcs, list) {
+		if (!ret)
+			ret = dup_fctx(fctx);
+		list_del(&fctx->list);
+		kfree(fctx);
+	}
+}
+
 static int userfaultfd_release(struct inode *inode, struct file *file)
 {
 	struct userfaultfd_ctx *ctx = file->private_data;
@@ -356,6 +476,12 @@ static inline unsigned int find_userfault(struct userfaultfd_ctx *ctx,
 	return do_find_userfault(&ctx->fault_wqh, uwq);
 }
 
+static inline unsigned int find_userfault_fork(struct userfaultfd_ctx *ctx,
+		struct userfaultfd_wait_queue **uwq)
+{
+	return do_find_userfault(&ctx->fork_wqh, uwq);
+}
+
 static unsigned int userfaultfd_poll(struct file *file, poll_table *wait)
 {
 	struct userfaultfd_ctx *ctx = file->private_data;
@@ -366,12 +492,40 @@ static unsigned int userfaultfd_poll(struct file *file, poll_table *wait)
 	case UFFD_STATE_WAIT_API:
 		return POLLERR;
 	case UFFD_STATE_RUNNING:
-		return find_userfault(ctx, NULL);
+		return find_userfault(ctx, NULL) || find_userfault_fork(ctx, NULL);
 	default:
 		BUG();
 	}
 }
 
+static ssize_t resolve_userfault_fork(struct userfaultfd_ctx *ctx,
+		struct userfaultfd_wait_queue **uwq)
+{
+	struct userfaultfd_ctx *new;
+	int fd;
+	struct file *file;
+
+	if (!find_userfault_fork(ctx, uwq))
+		return 0;
+
+	new = (*uwq)->nctx;
+	fd = get_unused_fd_flags(new->flags & UFFD_SHARED_FCNTL_FLAGS);
+	if (fd < 0)
+		return fd;
+
+	file = anon_inode_getfile("[userfaultfd]", &userfaultfd_fops, new,
+				  O_RDWR | (new->flags & UFFD_SHARED_FCNTL_FLAGS));
+	if (IS_ERR(file)) {
+		put_unused_fd(fd);
+		return PTR_ERR(file);
+	}
+
+	fd_install(fd, file);
+	(*uwq)->fd = fd;
+
+	return 1;
+}
+
 static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
 				    __u64 *mtype, __u64 *addr)
 {
@@ -392,6 +546,21 @@ static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
 			ret = 0;
 			break;
 		}
+
+		ret = resolve_userfault_fork(ctx, &uwq);
+		if (ret < 0)
+			break;
+		if (ret > 0) {
+			*mtype = UFFD_FORK;
+			*addr = uwq->fd;
+
+			uwq->pending = false;
+			wake_up(&ctx->fork_wqh);
+
+			ret = 0;
+			break;
+		}
+
 		if (signal_pending(current)) {
 			ret = -ERESTARTSYS;
 			break;
@@ -1015,7 +1184,7 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
 	uffdio_api.ioctls = UFFD_API_IOCTLS;
 
 	if (uffdio_api.api == UFFD_API_V2) {
-		ctx->features |= UFFD_FEATURE_LONGMSG;
+		ctx->features |= UFFD_FEATURE_FORK | UFFD_FEATURE_LONGMSG;
 		uffdio_api.bits |= UFFD_API_V2_BITS;
 	}
 
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 81f0d11..44827f7 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -75,6 +75,9 @@ static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 	return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP);
 }
 
+extern int dup_userfaultfd(struct vm_area_struct *, struct list_head *);
+extern void dup_userfaultfd_complete(struct list_head *);
+
 #else /* CONFIG_USERFAULTFD */
 
 /* mm helpers */
@@ -107,6 +110,15 @@ static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 	return false;
 }
 
+static inline int dup_userfaultfd(struct vm_area_struct *, struct list_head *)
+{
+	return 0;
+}
+
+static inline void dup_userfaultfd_complete(struct list_head *)
+{
+}
+
 #endif /* CONFIG_USERFAULTFD */
 
 #endif /* _LINUX_USERFAULTFD_K_H */
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 4e169b8..f6cfea3 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -155,9 +155,11 @@ struct uffd_v2_msg {
 };
 
 #define UFFD_PAGEFAULT	0x1
+#define UFFD_FORK	0x2
 
 #define UFFD_PAGEFAULT_BIT	(1 << (UFFD_PAGEFAULT - 1))
-#define __UFFD_API_V2_BITS	(UFFD_PAGEFAULT_BIT)
+#define UFFD_FORK_BIT		(1 << (UFFD_FORK - 1))
+#define __UFFD_API_V2_BITS	(UFFD_PAGEFAULT_BIT | UFFD_FORK_BIT)
 
 /*
  * Lower PAGE_SHIFT bits are used to report those supported
diff --git a/kernel/fork.c b/kernel/fork.c
index cfab6e9..532882d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -55,6 +55,7 @@
 #include <linux/rmap.h>
 #include <linux/ksm.h>
 #include <linux/acct.h>
+#include <linux/userfaultfd_k.h>
 #include <linux/tsacct_kern.h>
 #include <linux/cn_proc.h>
 #include <linux/freezer.h>
@@ -370,6 +371,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 	struct rb_node **rb_link, *rb_parent;
 	int retval;
 	unsigned long charge;
+	LIST_HEAD(uf);
 
 	uprobe_start_dup_mmap();
 	down_write(&oldmm->mmap_sem);
@@ -421,11 +423,13 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 		if (retval)
 			goto fail_nomem_policy;
 		tmp->vm_mm = mm;
+		retval = dup_userfaultfd(tmp, &uf);
+		if (retval)
+			goto fail_nomem_anon_vma_fork;
 		if (anon_vma_fork(tmp, mpnt))
 			goto fail_nomem_anon_vma_fork;
-		tmp->vm_flags &= ~(VM_LOCKED|VM_UFFD_MISSING|VM_UFFD_WP);
+		tmp->vm_flags &= ~(VM_LOCKED);
 		tmp->vm_next = tmp->vm_prev = NULL;
-		tmp->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
 		file = tmp->vm_file;
 		if (file) {
 			struct inode *inode = file_inode(file);
@@ -481,6 +485,7 @@ out:
 	up_write(&mm->mmap_sem);
 	flush_tlb_mm(oldmm);
 	up_write(&oldmm->mmap_sem);
+	dup_userfaultfd_complete(&uf);
 	uprobe_end_dup_mmap();
 	return retval;
 fail_nomem_anon_vma_fork:
-- 
1.8.4.2

^ permalink raw reply related

* [PATCH 0/3] idle memory tracking
From: Vladimir Davydov @ 2015-03-18 20:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse,
	David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet,
	linux-api, linux-doc, linux-mm, linux-kernel

Hi,

Knowing the portion of memory that is not used by a certain application
or memory cgroup (idle memory) can be useful for partitioning the system
efficiently. Currently, the only means to estimate the amount of idle
memory provided by the kernel is /proc/PID/clear_refs. However, it has
two serious shortcomings:

 - it does not count unmapped file pages
 - it affects the reclaimer logic

Back in 2011 an attempt was made by Michel Lespinasse to improve the
situation (see http://lwn.net/Articles/459269/). He proposed a kernel
space daemon which would periodically scan physical address range,
testing and clearing ACCESS/YOUNG PTE bits, and counting pages that had
not been referenced since the last scan. The daemon avoided interference
with the page reclaimer by setting the new PG_young flag on referenced
pages and making page_referenced() return >= 1 if PG_young was set.

This patch set reuses the idea of Michel's patch set, but the
implementation is quite different. Instead of introducing a kernel space
daemon, it only provides the userspace with the necessary means to
estimate the amount of idle memory, leaving the daemon to be implemented
in the userspace. In order to achieve that, it adds two new proc files,
/proc/kpagecgroup and /proc/sys/vm/set_idle, and extends the clear_refs
interface.

Usage:

 1. Write 1 to /proc/sys/vm/set_idle.

    This will set the IDLE flag for all user pages. The IDLE flag is cleared
    when the page is read or the ACCESS/YOUNG bit is cleared in any PTE pointing
    to the page. It is also cleared when the page is freed.

 2. Wait some time.

 3. Write 6 to /proc/PID/clear_refs for each PID of interest.

    This will clear the IDLE flag for recently accessed pages.

 4. Count the number of idle pages as reported by /proc/kpageflags. One may use
    /proc/PID/pagemap and/or /proc/kpagecgroup to filter pages that belong to a
    certain application/container.

An example of using this new interface is below. It is a script that
counts the number of pages charged to a specified cgroup that have not
been accessed for a given time interval.

---- BEGIN SCRIPT ----
#! /usr/bin/python
#

import struct
import sys
import os
import stat
import time

def get_end_pfn():
    f = open("/proc/zoneinfo", "r")
    end_pfn = 0
    for l in f.readlines():
        l = l.split()
        if l[0] == "spanned":
            end_pfn = int(l[1])
        elif l[0] == "start_pfn:":
            end_pfn += int(l[1])
    return end_pfn

def set_idle():
    open("/proc/sys/vm/set_idle", "w").write("1")

def clear_refs(target_cg_path):
    procs = open(target_cg_path + "/cgroup.procs", "r")
    for pid in procs.readlines():
        try:
            with open("/proc/" + pid.rstrip() + "/clear_refs", "w") as f:
                f.write("6")
        except IOError as e:
            print "Failed to clear_refs for pid " + pid + ": " + str(e)

def count_idle(target_cg_path):
    target_cg_ino = os.stat(target_cg_path)[stat.ST_INO]

    pgflags = open("/proc/kpageflags", "rb")
    pgcgroup = open("/proc/kpagecgroup", "rb")

    nidle = 0

    for i in range(0, get_end_pfn()):
        cg_ino = struct.unpack('Q', pgcgroup.read(8))[0]
        flags = struct.unpack('Q', pgflags.read(8))[0]

        if cg_ino != target_cg_ino:
            continue

        # unevictable?
        if (flags >> 18) & 1 != 0:
            continue

        # huge?
        if (flags >> 22) & 1 != 0:
            npages = 512
        else:
            npages = 1

        # idle?
        if (flags >> 25) & 1 != 0:
            nidle += npages

    return nidle

if len(sys.argv) <> 3:
    print "Usage: %s cgroup_path scan_interval" % sys.argv[0]
    exit(1)

cg_path = sys.argv[1]
scan_interval = int(sys.argv[2])

while True:
    set_idle()
    time.sleep(scan_interval)
    clear_refs(cg_path)
    print count_idle(cg_path)
---- END SCRIPT ----

Thanks,

Vladimir Davydov (3):
  memcg: add page_cgroup_ino helper
  proc: add kpagecgroup file
  mm: idle memory tracking

 Documentation/filesystems/proc.txt     |    3 +
 Documentation/vm/00-INDEX              |    2 +
 Documentation/vm/idle_mem_tracking.txt |   23 +++++++
 Documentation/vm/pagemap.txt           |   10 ++-
 fs/proc/Kconfig                        |    5 +-
 fs/proc/page.c                         |  107 ++++++++++++++++++++++++++++++++
 fs/proc/task_mmu.c                     |   22 ++++++-
 include/linux/memcontrol.h             |    8 +--
 include/linux/page-flags.h             |   12 ++++
 include/uapi/linux/kernel-page-flags.h |    1 +
 kernel/sysctl.c                        |   14 +++++
 mm/Kconfig                             |   12 ++++
 mm/debug.c                             |    4 ++
 mm/hwpoison-inject.c                   |    5 +-
 mm/memcontrol.c                        |   61 ++++++++----------
 mm/memory-failure.c                    |   16 +----
 mm/rmap.c                              |    7 +++
 mm/swap.c                              |    2 +
 18 files changed, 248 insertions(+), 66 deletions(-)
 create mode 100644 Documentation/vm/idle_mem_tracking.txt

-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH 1/3] memcg: add page_cgroup_ino helper
From: Vladimir Davydov @ 2015-03-18 20:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse,
	David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet,
	linux-api, linux-doc, linux-mm, linux-kernel
In-Reply-To: <cover.1426706637.git.vdavydov@parallels.com>

Hwpoison allows to filter pages by memory cgroup ino. To ahieve that, it
calls try_get_mem_cgroup_from_page(), then mem_cgroup_css(), and finally
cgroup_ino() on the cgroup returned. This looks bulky. Since in the next
patch I need to get the ino of the memory cgroup a page is charged to
too, in this patch I introduce the page_cgroup_ino() helper.

Note that page_cgroup_ino() only considers those pages that are charged
to mem_cgroup->memory (i.e. page->mem_cgroup != NULL), and for others it
returns 0, while try_get_mem_cgroup_page(), used by hwpoison before, may
extract the cgroup from a swapcache readahead page too. Ignoring
swapcache readahead pages allows to call page_cgroup_ino() on unlocked
pages, which is nice. Hwpoison users will hardly see any difference.

Another difference between try_get_mem_cgroup_page() and
page_cgroup_ino() is that the latter works for offline memory cgroups
while the former does not, which is crucial for the next patch.

Since try_get_mem_cgroup_page() is not used by anyone else, this patch
removes this function. Also, it makes hwpoison memcg filter depend on
CONFIG_MEMCG instead of CONFIG_MEMCG_SWAP (I've no idea why it was made
dependant on CONFIG_MEMCG_SWAP initially).

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 include/linux/memcontrol.h |    8 ++----
 mm/hwpoison-inject.c       |    5 +---
 mm/memcontrol.c            |   61 ++++++++++++++++++--------------------------
 mm/memory-failure.c        |   16 ++----------
 4 files changed, 30 insertions(+), 60 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 72dff5fb0d0c..9262a8407af7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -91,7 +91,6 @@ bool mem_cgroup_is_descendant(struct mem_cgroup *memcg,
 			      struct mem_cgroup *root);
 bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
 
-extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 
 extern struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg);
@@ -192,6 +191,8 @@ static inline void mem_cgroup_count_vm_event(struct mm_struct *mm,
 void mem_cgroup_split_huge_fixup(struct page *head);
 #endif
 
+unsigned long page_cgroup_ino(struct page *page);
+
 #else /* CONFIG_MEMCG */
 struct mem_cgroup;
 
@@ -252,11 +253,6 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
 	return &zone->lruvec;
 }
 
-static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
-{
-	return NULL;
-}
-
 static inline bool mm_match_cgroup(struct mm_struct *mm,
 		struct mem_cgroup *memcg)
 {
diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c
index 329caf56df22..df63c3133d70 100644
--- a/mm/hwpoison-inject.c
+++ b/mm/hwpoison-inject.c
@@ -45,12 +45,9 @@ static int hwpoison_inject(void *data, u64 val)
 	/*
 	 * do a racy check with elevated page count, to make sure PG_hwpoison
 	 * will only be set for the targeted owner (or on a free page).
-	 * We temporarily take page lock for try_get_mem_cgroup_from_page().
 	 * memory_failure() will redo the check reliably inside page lock.
 	 */
-	lock_page(hpage);
 	err = hwpoison_filter(hpage);
-	unlock_page(hpage);
 	if (err)
 		return 0;
 
@@ -123,7 +120,7 @@ static int pfn_inject_init(void)
 	if (!dentry)
 		goto fail;
 
-#ifdef CONFIG_MEMCG_SWAP
+#ifdef CONFIG_MEMCG
 	dentry = debugfs_create_u64("corrupt-filter-memcg", 0600,
 				    hwpoison_dir, &hwpoison_filter_memcg);
 	if (!dentry)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 74a9641d8f9f..3ecbeda0a3f8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2349,40 +2349,6 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
 	css_put_many(&memcg->css, nr_pages);
 }
 
-/*
- * try_get_mem_cgroup_from_page - look up page's memcg association
- * @page: the page
- *
- * Look up, get a css reference, and return the memcg that owns @page.
- *
- * The page must be locked to prevent racing with swap-in and page
- * cache charges.  If coming from an unlocked page table, the caller
- * must ensure the page is on the LRU or this can race with charging.
- */
-struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
-{
-	struct mem_cgroup *memcg;
-	unsigned short id;
-	swp_entry_t ent;
-
-	VM_BUG_ON_PAGE(!PageLocked(page), page);
-
-	memcg = page->mem_cgroup;
-	if (memcg) {
-		if (!css_tryget_online(&memcg->css))
-			memcg = NULL;
-	} else if (PageSwapCache(page)) {
-		ent.val = page_private(page);
-		id = lookup_swap_cgroup_id(ent);
-		rcu_read_lock();
-		memcg = mem_cgroup_from_id(id);
-		if (memcg && !css_tryget_online(&memcg->css))
-			memcg = NULL;
-		rcu_read_unlock();
-	}
-	return memcg;
-}
-
 static void lock_page_lru(struct page *page, int *isolated)
 {
 	struct zone *zone = page_zone(page);
@@ -2774,6 +2740,19 @@ void mem_cgroup_split_huge_fixup(struct page *head)
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+unsigned long page_cgroup_ino(struct page *page)
+{
+	struct mem_cgroup *memcg;
+	unsigned long ino = 0;
+
+	rcu_read_lock();
+	memcg = ACCESS_ONCE(page->mem_cgroup);
+	if (memcg)
+		ino = cgroup_ino(memcg->css.cgroup);
+	rcu_read_unlock();
+	return ino;
+}
+
 #ifdef CONFIG_MEMCG_SWAP
 static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
 					 bool charge)
@@ -5482,8 +5461,18 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 	}
 
-	if (do_swap_account && PageSwapCache(page))
-		memcg = try_get_mem_cgroup_from_page(page);
+	if (do_swap_account && PageSwapCache(page)) {
+		swp_entry_t ent = { .val = page_private(page), };
+		unsigned short id = lookup_swap_cgroup_id(ent);
+
+		VM_BUG_ON_PAGE(!PageLocked(page), page);
+
+		rcu_read_lock();
+		memcg = mem_cgroup_from_id(id);
+		if (memcg && !css_tryget_online(&memcg->css))
+			memcg = NULL;
+		rcu_read_unlock();
+	}
 	if (!memcg)
 		memcg = get_mem_cgroup_from_mm(mm);
 
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index d487f8dc6d39..7414f24cefdf 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -128,27 +128,15 @@ static int hwpoison_filter_flags(struct page *p)
  * can only guarantee that the page either belongs to the memcg tasks, or is
  * a freed page.
  */
-#ifdef	CONFIG_MEMCG_SWAP
+#ifdef CONFIG_MEMCG
 u64 hwpoison_filter_memcg;
 EXPORT_SYMBOL_GPL(hwpoison_filter_memcg);
 static int hwpoison_filter_task(struct page *p)
 {
-	struct mem_cgroup *mem;
-	struct cgroup_subsys_state *css;
-	unsigned long ino;
-
 	if (!hwpoison_filter_memcg)
 		return 0;
 
-	mem = try_get_mem_cgroup_from_page(p);
-	if (!mem)
-		return -EINVAL;
-
-	css = mem_cgroup_css(mem);
-	ino = cgroup_ino(css->cgroup);
-	css_put(css);
-
-	if (ino != hwpoison_filter_memcg)
+	if (page_cgroup_ino(p) != hwpoison_filter_memcg)
 		return -EINVAL;
 
 	return 0;
-- 
1.7.10.4


^ permalink raw reply related

* [PATCH 2/3] proc: add kpagecgroup file
From: Vladimir Davydov @ 2015-03-18 20:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse,
	David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet,
	linux-api, linux-doc, linux-mm, linux-kernel
In-Reply-To: <cover.1426706637.git.vdavydov@parallels.com>

/proc/kpagecgroup contains a 64-bit inode number of the memory cgroup
each page is charged to, indexed by PFN. Having this information is
useful for estimating a cgroup working set size.

The file is present if CONFIG_PROC_PAGE_MONITOR && CONFIG_MEMCG.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 Documentation/vm/pagemap.txt |    6 ++++-
 fs/proc/Kconfig              |    5 ++--
 fs/proc/page.c               |   53 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 61 insertions(+), 3 deletions(-)

diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index 6fbd55ef6b45..1ddfa1367b03 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -5,7 +5,7 @@ pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
 userspace programs to examine the page tables and related information by
 reading files in /proc.
 
-There are three components to pagemap:
+There are four components to pagemap:
 
  * /proc/pid/pagemap.  This file lets a userspace process find out which
    physical frame each virtual page is mapped to.  It contains one 64-bit
@@ -65,6 +65,10 @@ There are three components to pagemap:
     23. BALLOON
     24. ZERO_PAGE
 
+ * /proc/kpagecgroup.  This file contains a 64-bit inode number of the
+   memory cgroup each page is charged to, indexed by PFN. Only available when
+   CONFIG_MEMCG is set.
+
 Short descriptions to the page flags:
 
  0. LOCKED
diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
index 2183fcf41d59..5021a2935bb9 100644
--- a/fs/proc/Kconfig
+++ b/fs/proc/Kconfig
@@ -69,5 +69,6 @@ config PROC_PAGE_MONITOR
  	help
 	  Various /proc files exist to monitor process memory utilization:
 	  /proc/pid/smaps, /proc/pid/clear_refs, /proc/pid/pagemap,
-	  /proc/kpagecount, and /proc/kpageflags. Disabling these
-          interfaces will reduce the size of the kernel by approximately 4kb.
+	  /proc/kpagecount, /proc/kpageflags, and /proc/kpagecgroup.
+	  Disabling these interfaces will reduce the size of the kernel
+	  by approximately 4kb.
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 7eee2d8b97d9..70d23245dd43 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -9,6 +9,7 @@
 #include <linux/proc_fs.h>
 #include <linux/seq_file.h>
 #include <linux/hugetlb.h>
+#include <linux/memcontrol.h>
 #include <linux/kernel-page-flags.h>
 #include <asm/uaccess.h>
 #include "internal.h"
@@ -225,10 +226,62 @@ static const struct file_operations proc_kpageflags_operations = {
 	.read = kpageflags_read,
 };
 
+#ifdef CONFIG_MEMCG
+static ssize_t kpagecgroup_read(struct file *file, char __user *buf,
+				size_t count, loff_t *ppos)
+{
+	u64 __user *out = (u64 __user *)buf;
+	struct page *ppage;
+	unsigned long src = *ppos;
+	unsigned long pfn;
+	ssize_t ret = 0;
+	u64 ino;
+
+	pfn = src / KPMSIZE;
+	count = min_t(unsigned long, count, (max_pfn * KPMSIZE) - src);
+	if (src & KPMMASK || count & KPMMASK)
+		return -EINVAL;
+
+	while (count > 0) {
+		if (pfn_valid(pfn))
+			ppage = pfn_to_page(pfn);
+		else
+			ppage = NULL;
+
+		if (ppage)
+			ino = page_cgroup_ino(ppage);
+		else
+			ino = 0;
+
+		if (put_user(ino, out)) {
+			ret = -EFAULT;
+			break;
+		}
+
+		pfn++;
+		out++;
+		count -= KPMSIZE;
+	}
+
+	*ppos += (char __user *)out - buf;
+	if (!ret)
+		ret = (char __user *)out - buf;
+	return ret;
+}
+
+static const struct file_operations proc_kpagecgroup_operations = {
+	.llseek = mem_lseek,
+	.read = kpagecgroup_read,
+};
+#endif /* CONFIG_MEMCG */
+
 static int __init proc_page_init(void)
 {
 	proc_create("kpagecount", S_IRUSR, NULL, &proc_kpagecount_operations);
 	proc_create("kpageflags", S_IRUSR, NULL, &proc_kpageflags_operations);
+#ifdef CONFIG_MEMCG
+	proc_create("kpagecgroup", S_IRUSR, NULL, &proc_kpagecgroup_operations);
+#endif
 	return 0;
 }
 fs_initcall(proc_page_init);
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH 3/3] mm: idle memory tracking
From: Vladimir Davydov @ 2015-03-18 20:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse,
	David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet,
	linux-api, linux-doc, linux-mm, linux-kernel
In-Reply-To: <cover.1426706637.git.vdavydov@parallels.com>

Knowing the portion of memory that is not used by a certain application
or memory cgroup (idle memory) can be useful for partitioning the system
efficiently. Currently, the only means to estimate the amount of idle
memory provided by the kernel is /proc/PID/clear_refs. However, it has
two serious shortcomings:

 - it does not count unmapped file pages
 - it affects the reclaimer logic

This patch attempts to provide the userspace with the means to track
idle memory without the above mentioned limitations.

Usage:

 1. Write 1 to /proc/sys/vm/set_idle.

    This will set the IDLE flag for all user pages. The IDLE flag is cleared
    when the page is read or the ACCESS/YOUNG bit is cleared in any PTE pointing
    to the page. It is also cleared when the page is freed.

 2. Wait some time.

 3. Write 6 to /proc/PID/clear_refs for each PID of interest.

    This will clear the IDLE flag for recently accessed pages.

 4. Count the number of idle pages as reported by /proc/kpageflags. One may use
    /proc/PID/pagemap and/or /proc/kpagecgroup to filter pages that belong to a
    certain application/container.

To avoid interference with the memory reclaimer, this patch adds the
PG_young flag in addition to PG_idle. The PG_young flag is set if the
ACCESS/YOUNG bit is cleared at step 3. page_referenced() returns >= 1 if
the page has the PG_young flag set.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 Documentation/filesystems/proc.txt     |    3 ++
 Documentation/vm/00-INDEX              |    2 ++
 Documentation/vm/idle_mem_tracking.txt |   23 ++++++++++++++
 Documentation/vm/pagemap.txt           |    4 +++
 fs/proc/page.c                         |   54 ++++++++++++++++++++++++++++++++
 fs/proc/task_mmu.c                     |   22 +++++++++++--
 include/linux/page-flags.h             |   12 +++++++
 include/uapi/linux/kernel-page-flags.h |    1 +
 kernel/sysctl.c                        |   14 +++++++++
 mm/Kconfig                             |   12 +++++++
 mm/debug.c                             |    4 +++
 mm/rmap.c                              |    7 +++++
 mm/swap.c                              |    2 ++
 13 files changed, 157 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/vm/idle_mem_tracking.txt

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 8e36c7e3c345..9880ddb0383f 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -500,6 +500,9 @@ To reset the peak resident set size ("high water mark") to the process's
 current value:
     > echo 5 > /proc/PID/clear_refs
 
+To clear the idle bit (see Documentation/vm/idle_mem_tracking.txt)
+    > echo 6 > /proc/PID/clear_refs
+
 Any other value written to /proc/PID/clear_refs will have no effect.
 
 The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags
diff --git a/Documentation/vm/00-INDEX b/Documentation/vm/00-INDEX
index 081c49777abb..bab92cf7e2e4 100644
--- a/Documentation/vm/00-INDEX
+++ b/Documentation/vm/00-INDEX
@@ -14,6 +14,8 @@ hugetlbpage.txt
 	- a brief summary of hugetlbpage support in the Linux kernel.
 hwpoison.txt
 	- explains what hwpoison is
+idle_memory_tracking.txt
+	- explains how to track idle memory
 ksm.txt
 	- how to use the Kernel Samepage Merging feature.
 numa
diff --git a/Documentation/vm/idle_mem_tracking.txt b/Documentation/vm/idle_mem_tracking.txt
new file mode 100644
index 000000000000..4ca9bfafc560
--- /dev/null
+++ b/Documentation/vm/idle_mem_tracking.txt
@@ -0,0 +1,23 @@
+Idle memory tracking
+--------------------
+
+Knowing the portion of user memory that has not been touched for a given period
+of time can be useful to tune memory cgroup limits and/or for job placement
+within a compute cluster. CONFIG_IDLE_MEM_TRACKING provides the userspace with
+the means to estimate the amount of idle memory. In order to do this one should
+
+ 1. Write 1 to /proc/sys/vm/set_idle.
+
+    This will set the IDLE flag for all user pages. The IDLE flag is cleared
+    when the page is read or the ACCESS/YOUNG bit is cleared in any PTE pointing
+    to the page. It is also cleared when the page is freed.
+
+ 2. Wait some time.
+
+ 3. Write 6 to /proc/PID/clear_refs for each PID of interest.
+
+    This will clear the IDLE flag for recently accessed pages.
+
+ 4. Count the number of idle pages as reported by /proc/kpageflags. One may use
+    /proc/PID/pagemap and/or /proc/kpagecgroup to filter pages that belong to a
+    certain application/container.
diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index 1ddfa1367b03..4202e1d57d8c 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -64,6 +64,7 @@ There are four components to pagemap:
     22. THP
     23. BALLOON
     24. ZERO_PAGE
+    25. IDLE
 
  * /proc/kpagecgroup.  This file contains a 64-bit inode number of the
    memory cgroup each page is charged to, indexed by PFN. Only available when
@@ -114,6 +115,9 @@ Short descriptions to the page flags:
 24. ZERO_PAGE
     zero page for pfn_zero or huge_zero page
 
+25. IDLE
+    page is idle (see Documentation/vm/idle_mem_tracking.txt)
+
     [IO related page flags]
  1. ERROR     IO error occurred
  3. UPTODATE  page has up-to-date data
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 70d23245dd43..766478d66458 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -182,6 +182,10 @@ u64 stable_page_flags(struct page *page)
 	u |= kpf_copy_bit(k, KPF_OWNER_PRIVATE,	PG_owner_priv_1);
 	u |= kpf_copy_bit(k, KPF_ARCH,		PG_arch_1);
 
+#ifdef CONFIG_IDLE_MEM_TRACKING
+	u |= kpf_copy_bit(k, KPF_IDLE,		PG_idle);
+#endif
+
 	return u;
 };
 
@@ -275,6 +279,56 @@ static const struct file_operations proc_kpagecgroup_operations = {
 };
 #endif /* CONFIG_MEMCG */
 
+#ifdef CONFIG_IDLE_MEM_TRACKING
+static void set_mem_idle_node(int nid)
+{
+	unsigned long start_pfn, end_pfn, pfn;
+	struct page *page;
+
+	start_pfn = node_start_pfn(nid);
+	end_pfn = node_end_pfn(nid);
+
+	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
+		if (need_resched())
+			cond_resched();
+
+		if (!pfn_valid(pfn))
+			continue;
+
+		page = pfn_to_page(pfn);
+		if (page_count(page) == 0 || !PageLRU(page))
+			continue;
+
+		if (unlikely(!get_page_unless_zero(page)))
+			continue;
+		if (unlikely(!PageLRU(page)))
+			goto next;
+
+		SetPageIdle(page);
+next:
+		put_page(page);
+	}
+}
+
+static void set_mem_idle(void)
+{
+	int nid;
+
+	for_each_online_node(nid)
+		set_mem_idle_node(nid);
+}
+
+int sysctl_set_mem_idle; /* unused */
+
+int sysctl_set_mem_idle_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *length, loff_t *ppos)
+{
+	if (write)
+		set_mem_idle();
+	return 0;
+}
+#endif /* CONFIG_IDLE_MEM_TRACKING */
+
 static int __init proc_page_init(void)
 {
 	proc_create("kpagecount", S_IRUSR, NULL, &proc_kpagecount_operations);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 956b75d61809..b2b5ed1e10bb 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -458,7 +458,7 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
 
 	mss->resident += size;
 	/* Accumulate the size in pages that have been accessed. */
-	if (young || PageReferenced(page))
+	if (young || PageYoung(page) || PageReferenced(page))
 		mss->referenced += size;
 	mapcount = page_mapcount(page);
 	if (mapcount >= 2) {
@@ -733,6 +733,7 @@ enum clear_refs_types {
 	CLEAR_REFS_MAPPED,
 	CLEAR_REFS_SOFT_DIRTY,
 	CLEAR_REFS_MM_HIWATER_RSS,
+	CLEAR_REFS_IDLE,
 	CLEAR_REFS_LAST,
 };
 
@@ -806,6 +807,14 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 
 		page = pmd_page(*pmd);
 
+		if (cp->type == CLEAR_REFS_IDLE) {
+			if (pmdp_test_and_clear_young(vma, addr, pmd)) {
+				ClearPageIdle(page);
+				SetPageYoung(page);
+			}
+			goto out;
+		}
+
 		/* Clear accessed and referenced bits. */
 		pmdp_test_and_clear_young(vma, addr, pmd);
 		ClearPageReferenced(page);
@@ -833,6 +842,14 @@ out:
 		if (!page)
 			continue;
 
+		if (cp->type == CLEAR_REFS_IDLE) {
+			if (ptep_test_and_clear_young(vma, addr, pte)) {
+				ClearPageIdle(page);
+				SetPageYoung(page);
+			}
+			continue;
+		}
+
 		/* Clear accessed and referenced bits. */
 		ptep_test_and_clear_young(vma, addr, pte);
 		ClearPageReferenced(page);
@@ -852,10 +869,9 @@ static int clear_refs_test_walk(unsigned long start, unsigned long end,
 		return 1;
 
 	/*
-	 * Writing 1 to /proc/pid/clear_refs affects all pages.
+	 * Writing 1, 4, or 6 to /proc/pid/clear_refs affects all pages.
 	 * Writing 2 to /proc/pid/clear_refs only affects anonymous pages.
 	 * Writing 3 to /proc/pid/clear_refs only affects file mapped pages.
-	 * Writing 4 to /proc/pid/clear_refs affects all pages.
 	 */
 	if (cp->type == CLEAR_REFS_ANON && vma->vm_file)
 		return 1;
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index c851ff92d5b3..8e06d11eb723 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -109,6 +109,10 @@ enum pageflags {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	PG_compound_lock,
 #endif
+#ifdef CONFIG_IDLE_MEM_TRACKING
+	PG_young,
+	PG_idle,
+#endif
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
@@ -289,6 +293,14 @@ PAGEFLAG_FALSE(HWPoison)
 #define __PG_HWPOISON 0
 #endif
 
+#ifdef CONFIG_IDLE_MEM_TRACKING
+PAGEFLAG(Young, young)
+PAGEFLAG(Idle, idle)
+#else
+PAGEFLAG_FALSE(Young)
+PAGEFLAG_FALSE(Idle)
+#endif
+
 u64 stable_page_flags(struct page *page);
 
 static inline int PageUptodate(struct page *page)
diff --git a/include/uapi/linux/kernel-page-flags.h b/include/uapi/linux/kernel-page-flags.h
index a6c4962e5d46..5da5f8751ce7 100644
--- a/include/uapi/linux/kernel-page-flags.h
+++ b/include/uapi/linux/kernel-page-flags.h
@@ -33,6 +33,7 @@
 #define KPF_THP			22
 #define KPF_BALLOON		23
 #define KPF_ZERO_PAGE		24
+#define KPF_IDLE		25
 
 
 #endif /* _UAPILINUX_KERNEL_PAGE_FLAGS_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c1552633e159..54b9a0aa290f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -112,6 +112,11 @@ extern int sysctl_nr_open_min, sysctl_nr_open_max;
 #ifndef CONFIG_MMU
 extern int sysctl_nr_trim_pages;
 #endif
+#ifdef CONFIG_IDLE_MEM_TRACKING
+extern int sysctl_set_mem_idle;
+extern int sysctl_set_mem_idle_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *length, loff_t *ppos);
+#endif
 
 /* Constants used for minimum and  maximum */
 #ifdef CONFIG_LOCKUP_DETECTOR
@@ -1512,6 +1517,15 @@ static struct ctl_table vm_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_doulongvec_minmax,
 	},
+#ifdef CONFIG_IDLE_MEM_TRACKING
+	{
+		.procname	= "set_idle",
+		.data		= &sysctl_set_mem_idle,
+		.maxlen		= sizeof(int),
+		.mode		= 0200,
+		.proc_handler	= sysctl_set_mem_idle_handler,
+	},
+#endif
 	{ }
 };
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 390214da4546..c6d5e931f62c 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -635,3 +635,15 @@ config MAX_STACK_SIZE_MB
 	  changed to a smaller value in which case that is used.
 
 	  A sane initial value is 80 MB.
+
+config IDLE_MEM_TRACKING
+	bool "Enable idle memory tracking"
+	depends on 64BIT
+	select PROC_PAGE_MONITOR
+	help
+	  This feature allows a userspace process to estimate the size of user
+	  memory that has not been touched during a given period of time. This
+	  information can be useful to tune memory cgroup limits and/or for job
+	  placement within a compute cluster.
+
+	  See Documentation/vm/idle_mem_tracking.txt for more details.
diff --git a/mm/debug.c b/mm/debug.c
index 3eb3ac2fcee7..88468485a1f3 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -48,6 +48,10 @@ static const struct trace_print_flags pageflag_names[] = {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	{1UL << PG_compound_lock,	"compound_lock"	},
 #endif
+#ifdef CONFIG_IDLE_MEM_TRACKING
+	{1UL << PG_young,		"young"		},
+	{1UL << PG_idle,		"idle"		},
+#endif
 };
 
 static void dump_flags(unsigned long flags,
diff --git a/mm/rmap.c b/mm/rmap.c
index 8030382bbf5f..1afcc4db31e0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -799,6 +799,13 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 	if (referenced) {
 		pra->referenced++;
 		pra->vm_flags |= vma->vm_flags;
+		if (PageIdle(page))
+			ClearPageIdle(page);
+	}
+
+	if (PageYoung(page)) {
+		ClearPageYoung(page);
+		pra->referenced++;
 	}
 
 	if (dirty)
diff --git a/mm/swap.c b/mm/swap.c
index cd3a5e64cea9..94e591ecd64b 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -615,6 +615,8 @@ void mark_page_accessed(struct page *page)
 	} else if (!PageReferenced(page)) {
 		SetPageReferenced(page);
 	}
+	if (PageIdle(page))
+		ClearPageIdle(page);
 }
 EXPORT_SYMBOL(mark_page_accessed);
 
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH] Input: Add MT_TOOL_PALM
From: Charlie Mooney @ 2015-03-18 22:26 UTC (permalink / raw)
  To: linux-input
  Cc: Benjamin Tissoires, Hans de Goede, Peter Hutterer,
	Andrew De Los Reyes, Henrik Rydberg, Jonathan Corbet,
	Dmitry Torokhov, Charlie Mooney, Masanari Iida, Jiri Kosina,
	linux-doc, linux-kernel, linux-api

Currently there are only two "tools" that can be specified by a
multi-touch driver: MT_TOOL_FINGER and MT_TOOL_PEN.  In working with
Elan (The touch vendor) and discussing their next-gen devices it
seems that it will be useful to have more tools so that their devices
can give the upper layers of the stack hints as to what is touching
the sensor.

In particular they have new experimental firmware that can better
differentiate between palms vs fingertips and would like to plumb a
patch so that we can use their hints in higher-level gesture soft-
ware.  The firmware on the device can reasonably do a better job of
palm detection because it has access to all of the raw sensor readings
as opposed to just the width/pressure/etc that are exposed by the
driver.  As such, the firmware can characterize what a palm looks like
in much finer-grained detail and this change would allow such a
device to share its findings with the kernel.

Signed-off-by: Charlie Mooney <charliemooney@chromium.org>
---
 Documentation/input/multi-touch-protocol.txt | 9 ++++++---
 include/uapi/linux/input.h                   | 3 ++-
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/Documentation/input/multi-touch-protocol.txt b/Documentation/input/multi-touch-protocol.txt
index 7b4f59c..b85d000 100644
--- a/Documentation/input/multi-touch-protocol.txt
+++ b/Documentation/input/multi-touch-protocol.txt
@@ -312,9 +312,12 @@ ABS_MT_TOOL_TYPE
 
 The type of approaching tool. A lot of kernel drivers cannot distinguish
 between different tool types, such as a finger or a pen. In such cases, the
-event should be omitted. The protocol currently supports MT_TOOL_FINGER and
-MT_TOOL_PEN [2]. For type B devices, this event is handled by input core;
-drivers should instead use input_mt_report_slot_state().
+event should be omitted. The protocol currently supports MT_TOOL_FINGER,
+MT_TOOL_PEN, and MT_TOOL_PALM [2]. For type B devices, this event is handled
+by input core; drivers should instead use input_mt_report_slot_state().
+A contact's ABS_MT_TOOL_TYPE may change over time while still touching the
+device, because the firmware may not be able to determine which tool is being
+used when it first appears.
 
 ABS_MT_BLOB_ID
 
diff --git a/include/uapi/linux/input.h b/include/uapi/linux/input.h
index b0a8130..2f62ab2 100644
--- a/include/uapi/linux/input.h
+++ b/include/uapi/linux/input.h
@@ -973,7 +973,8 @@ struct input_keymap_entry {
  */
 #define MT_TOOL_FINGER		0
 #define MT_TOOL_PEN		1
-#define MT_TOOL_MAX		1
+#define MT_TOOL_PALM		2
+#define MT_TOOL_MAX		2
 
 /*
  * Values describing the status of a force-feedback effect
-- 
2.1.2


^ permalink raw reply related

* Re: [PATCH] mremap: add MREMAP_NOHOLE flag --resend
From: Andrew Morton @ 2015-03-18 22:31 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	danielmicay-Re5JQEeQqe8AvxtiuMwx3w,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Rik van Riel, Hugh Dickins,
	Mel Gorman, Johannes Weiner, Michal Hocko, Andy Lutomirski
In-Reply-To: <deaa4139de6e6422a0cec1e3282553aed3495e94.1426626497.git.shli-b10kYP2dOMg@public.gmane.org>

On Tue, 17 Mar 2015 14:09:39 -0700 Shaohua Li <shli-b10kYP2dOMg@public.gmane.org> wrote:

> There was a similar patch posted before, but it doesn't get merged. I'd like
> to try again if there are more discussions.
> http://marc.info/?l=linux-mm&m=141230769431688&w=2
> 
> mremap can be used to accelerate realloc. The problem is mremap will
> punch a hole in original VMA, which makes specific memory allocator
> unable to utilize it. Jemalloc is an example. It manages memory in 4M
> chunks. mremap a range of the chunk will punch a hole, which other
> mmap() syscall can fill into. The 4M chunk is then fragmented, jemalloc
> can't handle it.

Daniel's changelog had additional details regarding the userspace
allocators' behaviour.  It would be best to incorporate that into your
changelog.

Daniel also had microbenchmark testing results for glibc and jemalloc. 
Can you please do this?

I'm not seeing any testing results for tcmalloc and I'm not seeing
confirmation that this patch will be useful for tcmalloc.  Has anyone
tried it, or sought input from tcmalloc developers?

> This patch adds a new flag for mremap. With it, mremap will not punch the
> hole. page tables of original vma will be zapped in the same way, but
> vma is still there. That is original vma will look like a vma without
> pagefault. Behavior of new vma isn't changed.
> 
> For private vma, accessing original vma will cause
> page fault and just like the address of the vma has never been accessed.
> So for anonymous, new page/zero page will be fault in. For file mapping,
> new page will be allocated with file reading for cow, or pagefault will
> use existing page cache.
> 
> For shared vma, original and new vma will map to the same file. We can
> optimize this without zaping original vma's page table in this case, but
> this patch doesn't do it yet.
> 
> Since with MREMAP_NOHOLE, original vma still exists. pagefault handler
> for special vma might not able to handle pagefault for mremap'd area.
> The patch doesn't allow vmas with VM_PFNMAP|VM_MIXEDMAP flags do NOHOLE
> mremap.

At some point (preferably an early point) we'd like a manpage update
and a cc: to linux-man-u79uwXL29TY76Z2rM5mHXA@public.gmane.org please.

^ permalink raw reply

* Re: [PATCH v3  00/15] Add support to STMicroelectronics STM32 family
From: Chanwoo Choi @ 2015-03-18 23:35 UTC (permalink / raw)
  To: Maxime Coquelin
  Cc: u.kleine-koenig, afaerber, geert, Rob Herring, Philipp Zabel,
	Linus Walleij, Arnd Bergmann, stefan, pmeerw, pebolle,
	Jonathan Corbet, Pawel Moll, Mark Rutland, Ian Campbell,
	Kumar Gala, Russell King, Daniel Lezcano, Thomas Gleixner,
	Greg Kroah-Hartman, Jiri Slaby, Andrew Morton, David S. Miller,
	Mauro Carvalho Chehab, Joe Perches, Antti Palosaari, Tejun Heo
In-Reply-To: <1426197361-19290-1-git-send-email-maxime.coquelin@st.com>

Hi Maxime,

I tested this patch-set on Linux 4.0-rc4 for STM32F429IDISCOVERY board.
I completed the kernel booting successfully without any modification .

Looks good to me for this patch-set.

Tested-by: Chanwoo Choi <cw00.choi@samsung.com>

Best Regards,
Chanwoo Choi

On 03/13/2015 06:55 AM, Maxime Coquelin wrote:
> From: Maxime Coquelin <mcoquelin.stm32@gmail.com>
> 
> This third round tries to address most of the comments made on previous series.
> 
> It contains few less patches, as the reset_controller_of_init() patch has been
> removed, now that the bootlaoder handles the reset of the timers.
> 
> The pinctrl driver has also been removed after Linus review.
> It will be reworked to use the generic pinconf bindings, and may contain
> changes for other machines (Mediatek), to add support for pinmux property
> handling directly in pinconf-generic.
> 
> STM32 MCUs are Cortex-M CPU, used in various applications (consumer
> electronics, industrial applications, hobbyists...).
> Datasheets, user and programming manuals are publicly available on
> STMicroelectronics website.
> 
> With this series applied, the STM32F419 Discovery can boot succesfully.
> 
> 
> Changes since v2:
> -----------------
>  - Remove pinctrl driver from the series. 
>  - Remove reset_controller_of_init(), and reset the timers in the bootloader
>  - Add HW flow contrl property for serial driver
>  - Lots of changes in the DTS file, as per Andreas recommendations
>  - Some Kconfig clean-ups
>  - Adapt the config to be compatible with Andreas' bootwrapper, except UART port.
>  - Various fixes in documentation
> 
> Changes since v1:
> -----------------
>  - Move bindings documentation in their own patches (Andreas)
>  - Rename ARM System timer to armv7m-systick (Rob)
>  - Add clock-frequency property handling in armv7m-systick (Rob)
>  - Re-factor the reset controllers into a single controller (Philipp)
>  - Add kerneldoc to reset_controller_of_init (Philipp)
>  - Add named constants in include/dt-bindings/reset/ (Philipp)
>  - Make pinctrl driver to depend on ARCH_STM32 or COMPILE_TEST (Geert)
>  - Introduce CPUV7M_NUM_IRQ config flag to indicate the number of interrupts
> supported by the MCU, in order to limit memory waste in vectors' table (Uwe)
> 
> 
> Maxime Coquelin (15):
>   scripts: link-vmlinux: Don't pass page offset to kallsyms if XIP
>     Kernel
>   ARM: ARMv7-M: Enlarge vector table up to 256 entries
>   dt-bindings: Document the ARM System timer bindings
>   clocksource: Add ARM System timer driver
>   dt-bindings: Document the STM32 reset bindings
>   drivers: reset: Add STM32 reset driver
>   dt-bindings: Document the STM32 timer bindings
>   clockevent: Add STM32 Timer driver
>   dt-bindings: Document the STM32 USART bindings
>   serial: stm32-usart: Add STM32 USART Driver
>   ARM: Add STM32 family machine
>   ARM: dts: Add ARM System timer as clockevent in armv7m
>   ARM: dts: Introduce STM32F429 MCU
>   ARM: configs: Add STM32 defconfig
>   MAINTAINERS: Add entry for STM32 MCUs
> 
>  Documentation/arm/stm32/overview.txt               |  32 +
>  Documentation/arm/stm32/stm32f429-overview.txt     |  22 +
>  .../devicetree/bindings/arm/armv7m_systick.txt     |  26 +
>  .../devicetree/bindings/reset/st,stm32-rcc.txt     | 102 +++
>  .../devicetree/bindings/serial/st,stm32-usart.txt  |  32 +
>  .../devicetree/bindings/timer/st,stm32-timer.txt   |  22 +
>  MAINTAINERS                                        |   8 +
>  arch/arm/Kconfig                                   |  18 +
>  arch/arm/Makefile                                  |   1 +
>  arch/arm/boot/dts/Makefile                         |   1 +
>  arch/arm/boot/dts/armv7-m.dtsi                     |   6 +
>  arch/arm/boot/dts/stm32f429-disco.dts              |  71 +++
>  arch/arm/boot/dts/stm32f429.dtsi                   | 226 +++++++
>  arch/arm/configs/stm32_defconfig                   |  71 +++
>  arch/arm/kernel/entry-v7m.S                        |  13 +-
>  arch/arm/mach-stm32/Makefile                       |   1 +
>  arch/arm/mach-stm32/Makefile.boot                  |   3 +
>  arch/arm/mach-stm32/board-dt.c                     |  19 +
>  arch/arm/mm/Kconfig                                |  15 +
>  drivers/clocksource/Kconfig                        |  15 +
>  drivers/clocksource/Makefile                       |   2 +
>  drivers/clocksource/armv7m_systick.c               |  78 +++
>  drivers/clocksource/timer-stm32.c                  | 184 ++++++
>  drivers/reset/Makefile                             |   1 +
>  drivers/reset/reset-stm32.c                        | 125 ++++
>  drivers/tty/serial/Kconfig                         |  17 +
>  drivers/tty/serial/Makefile                        |   1 +
>  drivers/tty/serial/stm32-usart.c                   | 695 +++++++++++++++++++++
>  include/uapi/linux/serial_core.h                   |   3 +
>  scripts/link-vmlinux.sh                            |   2 +-
>  30 files changed, 1807 insertions(+), 5 deletions(-)
>  create mode 100644 Documentation/arm/stm32/overview.txt
>  create mode 100644 Documentation/arm/stm32/stm32f429-overview.txt
>  create mode 100644 Documentation/devicetree/bindings/arm/armv7m_systick.txt
>  create mode 100644 Documentation/devicetree/bindings/reset/st,stm32-rcc.txt
>  create mode 100644 Documentation/devicetree/bindings/serial/st,stm32-usart.txt
>  create mode 100644 Documentation/devicetree/bindings/timer/st,stm32-timer.txt
>  create mode 100644 arch/arm/boot/dts/stm32f429-disco.dts
>  create mode 100644 arch/arm/boot/dts/stm32f429.dtsi
>  create mode 100644 arch/arm/configs/stm32_defconfig
>  create mode 100644 arch/arm/mach-stm32/Makefile
>  create mode 100644 arch/arm/mach-stm32/Makefile.boot
>  create mode 100644 arch/arm/mach-stm32/board-dt.c
>  create mode 100644 drivers/clocksource/armv7m_systick.c
>  create mode 100644 drivers/clocksource/timer-stm32.c
>  create mode 100644 drivers/reset/reset-stm32.c
>  create mode 100644 drivers/tty/serial/stm32-usart.c
> 


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox