Linux userland API discussions
 help / color / mirror / Atom feed
* [PATCH v2 0/7] Add simple EEPROM Framework via regmap.
From: Srinivas Kandagatla @ 2015-03-13  9:49 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Maxime Ripard, Rob Herring, Pawel Moll, Kumar Gala, linux-api,
	linux-kernel, devicetree, Stephen Boyd, Arnd Bergmann, broonie,
	Greg Kroah-Hartman, linux-arm-msm, Srinivas Kandagatla
In-Reply-To: <1425548685-12887-1-git-send-email-srinivas.kandagatla@linaro.org>

Thankyou all for providing inputs and comments on previous versions of this
patchset. Here is the v2 of the patchset addressing all the issues raised as
part of previous versions review.

This patchset adds a new simple EEPROM framework to kernel.

Up until now, EEPROM drivers were stored in drivers/misc, where they all had to
duplicate pretty much the same code to register a sysfs file, allow in-kernel
users to access the content of the devices they were driving, etc.
    
This was also a problem as far as other in-kernel users were involved, since
the solutions used were pretty much different from on driver to another, there
was a rather big abstraction leak.
    
This introduction of this framework aims at solving this. It also introduces DT
representation for consumer devices to go get the data they require (MAC
Addresses, SoC/Revision ID, part numbers, and so on) from the EEPROMs.
    
Having regmap interface to this framework would give much better
abstraction for eeproms on different buses.

patch 1-3 Introduces the EEPROM framework.
Patch 4 migrates an existing driver to eeprom framework.
Patch 5-6 Adds Qualcomm specific qfprom driver.
Patch 7 adds entry in MAINTAINERS.

Its also possible to migrate other eeprom drivers to this framework.
Patch 6 can also be made a generic mmio-eeprom driver.

Providers APIs:
	eeprom_register/unregister();

Consumers APIs:
	eeprom_cell_get()/of_eeprom_cell_get()/of_eeprom_cell_get_byname();
	eeprom_cell_read()/eeprom_cell_write();

Device Tree:

	/* Provider */
	qfprom: qfprom@00700000 {
		compatible 	= "qcom,qfprom";
		reg		= <0x00700000 0x1000>;
		...

		/* Data cells */
		tsens_calibration: calib@404 {
			reg = <0x404 0x10>;
		};

		serial_number: sn {
			reg = <0x104 0x4>, <0x204 0x4>, <0x30c 0x4>;

		};
		...
	};
	
	/* Consumer node */
	tsens: tsens {
		...
		eeproms = <&tsens_calibration>;
		eeprom-names = "calib";
		...
	};

userspace interface:

hexdump /sys/class/eeprom/qfprom0/eeprom
                                                                                                                                                                                                                  
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
00000a0 db10 2240 0000 e000 0c00 0c00 0000 0c00
0000000 0000 0000 0000 0000 0000 0000 0000 0000
...
*
0001000


Changes since v1(https://lkml.org/lkml/2015/3/5/153)
 * Fix various Licencing issues spotted by Paul Bolle and Mark Brown
 * Allow eeprom core to build as module spotted by Paul Bolle.
 * Fix various kconfig issues spotted by Paul Bolle.
 * remove unessary atomic varible spotted by Mark Brown.
 * Few cleanups and common up some of the code in core.
 * Add qfprom bindings.

Changes since RFC(https://lkml.org/lkml/2015/2/19/307)
 * Fix documentation and error checks in read/write spotted by Andrew Lunn
 * Kconfig fix suggested by Stephen Boyd.
 * Add module owner suggested by Stephen Boyd and others.
 * Fix unsafe handling of eeprom in unregister spotted by Russell and Mark Brown.
 * seperate bindings patch as suggested by Rob.
 * Add MAINTAINERS as suggested by Rob.
 * Added support to allow reading eeprom for things like serial number which
  can be scatters across.
 * Added eeprom data using reg property suggested by Sascha and Stephen.
 * Added non-DT support.
 * Move kerneldoc to the src files spotted by Mark Brown.
 * Remove local list and do eeprom lookup by using class_find_device()


Thanks,
srini


Maxime Ripard (1):
  eeprom: sunxi: Move the SID driver to the eeprom framework

Srinivas Kandagatla (6):
  eeprom: Add a simple EEPROM framework for eeprom providers
  eeprom: Add a simple EEPROM framework for eeprom consumers
  eeprom: Add bindings for simple eeprom framework
  eeprom: qfprom: Add Qualcomm QFPROM support.
  eeprom: qfprom: Add bindings for qfprom
  eeprom: Add to MAINTAINERS for eeprom framework

 Documentation/ABI/testing/sysfs-driver-sunxi-sid   |  22 -
 .../bindings/eeprom/allwinner,sunxi-sid.txt        |  21 +
 .../devicetree/bindings/eeprom/eeprom.txt          |  70 +++
 .../devicetree/bindings/eeprom/qfprom.txt          |  23 +
 .../bindings/misc/allwinner,sunxi-sid.txt          |  17 -
 MAINTAINERS                                        |   9 +
 drivers/Kconfig                                    |   2 +
 drivers/Makefile                                   |   1 +
 drivers/eeprom/Kconfig                             |  33 ++
 drivers/eeprom/Makefile                            |   9 +
 drivers/eeprom/core.c                              | 517 +++++++++++++++++++++
 drivers/eeprom/eeprom-sunxi-sid.c                  | 136 ++++++
 drivers/eeprom/qfprom.c                            |  87 ++++
 drivers/misc/eeprom/Kconfig                        |  13 -
 drivers/misc/eeprom/Makefile                       |   1 -
 drivers/misc/eeprom/sunxi_sid.c                    | 156 -------
 include/linux/eeprom-consumer.h                    |  67 +++
 include/linux/eeprom-provider.h                    |  47 ++
 18 files changed, 1022 insertions(+), 209 deletions(-)
 delete mode 100644 Documentation/ABI/testing/sysfs-driver-sunxi-sid
 create mode 100644 Documentation/devicetree/bindings/eeprom/allwinner,sunxi-sid.txt
 create mode 100644 Documentation/devicetree/bindings/eeprom/eeprom.txt
 create mode 100644 Documentation/devicetree/bindings/eeprom/qfprom.txt
 delete mode 100644 Documentation/devicetree/bindings/misc/allwinner,sunxi-sid.txt
 create mode 100644 drivers/eeprom/Kconfig
 create mode 100644 drivers/eeprom/Makefile
 create mode 100644 drivers/eeprom/core.c
 create mode 100644 drivers/eeprom/eeprom-sunxi-sid.c
 create mode 100644 drivers/eeprom/qfprom.c
 delete mode 100644 drivers/misc/eeprom/sunxi_sid.c
 create mode 100644 include/linux/eeprom-consumer.h
 create mode 100644 include/linux/eeprom-provider.h

-- 
1.9.1

^ permalink raw reply

* Re: [PATCH v3  10/15] serial: stm32-usart: Add STM32 USART Driver
From: Paul Bolle @ 2015-03-13  9:41 UTC (permalink / raw)
  To: Maxime Coquelin
  Cc: u.kleine-koenig, afaerber, geert, Rob Herring, Philipp Zabel,
	Linus Walleij, Arnd Bergmann, stefan, pmeerw, Jonathan Corbet,
	Pawel Moll, Mark Rutland, Ian Campbell, Kumar Gala, Russell King,
	Daniel Lezcano, Thomas Gleixner, Greg Kroah-Hartman, Jiri Slaby,
	Andrew Morton, David S. Miller, Mauro Carvalho Chehab,
	Joe Perches, Antti Palosaari, Tejun Heo, Will Deacon <wi>
In-Reply-To: <1426197361-19290-11-git-send-email-maxime.coquelin@st.com>

Just a license nit, I'm afraid.

On Thu, 2015-03-12 at 22:55 +0100, Maxime Coquelin wrote:
> --- /dev/null
> +++ b/drivers/tty/serial/stm32-usart.c
> @@ -0,0 +1,695 @@
> +/*
> + * Copyright (C) Maxime Coquelin 2015
> + * Author:  Maxime Coquelin <mcoquelin.stm32@gmail.com>
> + * License terms:  GNU General Public License (GPL), version 2
> + *
> + * Inspired by st-asc.c from STMicroelectronics (c)
> + */

This states the license is GPL v2.

> +MODULE_LICENSE("GPL");

And
    MODULE_LICENSE("GPL v2");

would match that statement.


Paul Bolle


^ permalink raw reply

* Re: [PATCH v3  06/15] drivers: reset: Add STM32 reset driver
From: Philipp Zabel @ 2015-03-13  8:54 UTC (permalink / raw)
  To: Maxime Coquelin
  Cc: u.kleine-koenig, afaerber, geert, Rob Herring, Linus Walleij,
	Arnd Bergmann, stefan, pmeerw, pebolle, Jonathan Corbet,
	Pawel Moll, Mark Rutland, Ian Campbell, Kumar Gala, Russell King,
	Daniel Lezcano, Thomas Gleixner, Greg Kroah-Hartman, Jiri Slaby,
	Andrew Morton, David S. Miller, Mauro Carvalho Chehab,
	Joe Perches, Antti Palosaari, Tejun Heo, Will Deacon
In-Reply-To: <1426197361-19290-7-git-send-email-maxime.coquelin@st.com>

Am Donnerstag, den 12.03.2015, 22:55 +0100 schrieb Maxime Coquelin:
> From: Maxime Coquelin <mcoquelin.stm32@gmail.com>
> 
> The STM32 MCUs family IP can be reset by accessing some shared registers.
> 
> The specificity is that some reset lines are used by the timers.
> At timer initialization time, the timer has to be reset, that's why
> we cannot use a regular driver.

But this is a regular driver now, should this comment be updated?

> Signed-off-by: Maxime Coquelin <mcoquelin.stm32@gmail.com>
> ---
>  drivers/reset/Makefile      |   1 +
>  drivers/reset/reset-stm32.c | 125 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 126 insertions(+)
>  create mode 100644 drivers/reset/reset-stm32.c
> 
> diff --git a/drivers/reset/Makefile b/drivers/reset/Makefile
> index 157d421..aed12d1 100644
> --- a/drivers/reset/Makefile
> +++ b/drivers/reset/Makefile
> @@ -1,5 +1,6 @@
>  obj-$(CONFIG_RESET_CONTROLLER) += core.o
>  obj-$(CONFIG_ARCH_SOCFPGA) += reset-socfpga.o
>  obj-$(CONFIG_ARCH_BERLIN) += reset-berlin.o
> +obj-$(CONFIG_ARCH_STM32) += reset-stm32.o
>  obj-$(CONFIG_ARCH_SUNXI) += reset-sunxi.o
>  obj-$(CONFIG_ARCH_STI) += sti/
> diff --git a/drivers/reset/reset-stm32.c b/drivers/reset/reset-stm32.c
> new file mode 100644
> index 0000000..0d389b1
> --- /dev/null
> +++ b/drivers/reset/reset-stm32.c
> @@ -0,0 +1,125 @@
> +/*
> + * Copyright (C) Maxime Coquelin 2015
> + * Author:  Maxime Coquelin <mcoquelin.stm32@gmail.com>
> + * License terms:  GNU General Public License (GPL), version 2
> + *
> + * Heavily based on sunxi driver from Maxime Ripard.
> + */
> +
> +#include <linux/err.h>
> +#include <linux/io.h>
> +#include <linux/module.h>
> +#include <linux/of.h>
> +#include <linux/of_address.h>
> +#include <linux/platform_device.h>
> +#include <linux/reset-controller.h>
> +#include <linux/slab.h>
> +#include <linux/spinlock.h>
> +#include <linux/types.h>
> +
> +struct stm32_reset_data {
> +	spinlock_t			lock;
> +	void __iomem			*membase;
> +	struct reset_controller_dev	rcdev;
> +};
> +
> +static int stm32_reset_assert(struct reset_controller_dev *rcdev,
> +			      unsigned long id)
> +{
> +	struct stm32_reset_data *data = container_of(rcdev,
> +						     struct stm32_reset_data,
> +						     rcdev);
> +	int bank = id / BITS_PER_LONG;
> +	int offset = id % BITS_PER_LONG;
> +	unsigned long flags;
> +	u32 reg;
> +
> +	spin_lock_irqsave(&data->lock, flags);
> +
> +	reg = readl_relaxed(data->membase + (bank * 4));
> +	writel_relaxed(reg | BIT(offset), data->membase + (bank * 4));
> +
> +	spin_unlock_irqrestore(&data->lock, flags);
> +
> +	return 0;
> +}
> +
> +static int stm32_reset_deassert(struct reset_controller_dev *rcdev,
> +				unsigned long id)
> +{
> +	struct stm32_reset_data *data = container_of(rcdev,
> +						     struct stm32_reset_data,
> +						     rcdev);
> +	int bank = id / BITS_PER_LONG;
> +	int offset = id % BITS_PER_LONG;
> +	unsigned long flags;
> +	u32 reg;
> +
> +	spin_lock_irqsave(&data->lock, flags);
> +
> +	reg = readl_relaxed(data->membase + (bank * 4));
> +	writel_relaxed(reg & ~BIT(offset), data->membase + (bank * 4));
> +
> +	spin_unlock_irqrestore(&data->lock, flags);
> +
> +	return 0;
> +}
> +
> +static struct reset_control_ops stm32_reset_ops = {
> +	.assert		= stm32_reset_assert,
> +	.deassert	= stm32_reset_deassert,
> +};
> +
> +static const struct of_device_id stm32_reset_dt_ids[] = {
> +	 { .compatible = "st,stm32-rcc", },
> +	 { /* sentinel */ },
> +};
> +MODULE_DEVICE_TABLE(of, sstm32_reset_dt_ids);
> +
> +static int stm32_reset_probe(struct platform_device *pdev)
> +{
> +	struct stm32_reset_data *data;
> +	struct resource *res;
> +
> +	data = devm_kzalloc(&pdev->dev, sizeof(*data), GFP_KERNEL);
> +	if (!data)
> +		return -ENOMEM;
> +
> +	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
> +	data->membase = devm_ioremap_resource(&pdev->dev, res);
> +	if (IS_ERR(data->membase))
> +		return PTR_ERR(data->membase);
> +
> +	spin_lock_init(&data->lock);
> +
> +	data->rcdev.owner = THIS_MODULE;
> +	data->rcdev.nr_resets = resource_size(res) * 8;
> +	data->rcdev.ops = &stm32_reset_ops;
> +	data->rcdev.of_node = pdev->dev.of_node;
> +
> +	return reset_controller_register(&data->rcdev);
> +}
> +
> +static int stm32_reset_remove(struct platform_device *pdev)
> +{
> +	struct stm32_reset_data *data = platform_get_drvdata(pdev);
> +
> +	reset_controller_unregister(&data->rcdev);
> +
> +	return 0;
> +}
> +
> +static struct platform_driver stm32_reset_driver = {
> +	.probe	= stm32_reset_probe,
> +	.remove	= stm32_reset_remove,
> +	.driver = {
> +		.name		= "stm32-rcc-reset",
> +		.of_match_table	= stm32_reset_dt_ids,
> +	},
> +};
> +module_platform_driver(stm32_reset_driver);
> +
> +MODULE_AUTHOR("Maxime Coquelin <maxime.coquelin@gmail.com>");
> +MODULE_DESCRIPTION("STM32 MCUs Reset Controller Driver");
> +MODULE_LICENSE("GPL");
> +

regards
Philipp


^ permalink raw reply

* Re: [PATCH v3  05/15] dt-bindings: Document the STM32 reset bindings
From: Philipp Zabel @ 2015-03-13  8:50 UTC (permalink / raw)
  To: Maxime Coquelin
  Cc: u.kleine-koenig, afaerber, geert, Rob Herring, Linus Walleij,
	Arnd Bergmann, stefan, pmeerw, pebolle, Jonathan Corbet,
	Pawel Moll, Mark Rutland, Ian Campbell, Kumar Gala, Russell King,
	Daniel Lezcano, Thomas Gleixner, Greg Kroah-Hartman, Jiri Slaby,
	Andrew Morton, David S. Miller, Mauro Carvalho Chehab,
	Joe Perches, Antti Palosaari, Tejun Heo, Will Deacon
In-Reply-To: <1426197361-19290-6-git-send-email-maxime.coquelin@st.com>

Hi Maxime,

Am Donnerstag, den 12.03.2015, 22:55 +0100 schrieb Maxime Coquelin:
> From: Maxime Coquelin <mcoquelin.stm32@gmail.com>
> 
> This adds documentation of device tree bindings for the
> STM32 reset controller.
> 
> Signed-off-by: Maxime Coquelin <mcoquelin.stm32@gmail.com>
> ---
>  .../devicetree/bindings/reset/st,stm32-rcc.txt     | 102 +++++++++++++++++++++
>  1 file changed, 102 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/reset/st,stm32-rcc.txt
> 
> diff --git a/Documentation/devicetree/bindings/reset/st,stm32-rcc.txt b/Documentation/devicetree/bindings/reset/st,stm32-rcc.txt
> new file mode 100644
> index 0000000..962f961
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/reset/st,stm32-rcc.txt
> @@ -0,0 +1,102 @@
> +STMicroelectronics STM32 Peripheral Reset Controller
> +====================================================
> +
> +The RCC IP is both a reset and a clock controller. This documentation only
> +document the reset part.
> +
> +Please also refer to reset.txt in this directory for common reset
> +controller binding usage.
> +
> +Required properties:
> +- compatible: Should be "st,stm32-rcc"
> +- reg: should be register base and length as documented in the
> +  datasheet
> +- #reset-cells: 1, see below
> +
> +example:
> +
> +rcc: reset@40023800 {
> +	#reset-cells = <1>;
> +	compatible = "st,stm32-rcc";
> +	reg = <0x40023800 0x400>;
> +};
> +
> +Specifying softreset control of devices
> +=======================================
> +
> +Device nodes should specify the reset channel required in their "resets"
> +property, containing a phandle to the reset device node and an index specifying
> +which channel to use.

Using a single value as index is ok, but it should be documented how
this corresponds to the register and bit offsets in the reference
manual.
Maybe add a comment that the index is in fact the register offset / 4 *
32 + bit offset in that register and that not all registers are
dedicated to the rest controller? Otherwise it is confusing (to me at
least) that the indices start at some arbitrary value.

> +example:
> +
> +	timer2 {
> +		resets			= <&rcc 256>;
> +	};
> +
> +List of indexes for STM32F429:

"List of valid indices", to point out that any other index is invalid?

> + - gpioa: 128

I had to look at the RM0090 Reference manual V8.0, Chapter 6, "Reset and
clock control for STM32F42xx and STM32F43xxx (RCC)" to see that the
reset registers indeed start at 0x10 (RCC_AHB1RSTR), ...

> + - gpiob: 129
> + - gpioc: 130
> + - gpiod: 131
> + - gpioe: 132
> + - gpiof: 133
> + - gpiog: 134
> + - gpioh: 135
> + - gpioi: 136
> + - gpioj: 137
> + - gpiok: 138
> + - crc: 140
> + - dma1: 149
> + - dma2: 150
> + - dma2d: 151
> + - ethmac: 153
> + - otghs: 157
> + - dcmi: 160
> + - cryp: 164
> + - hash: 165
> + - rng: 166
> + - otgfs: 167
> + - fmc: 192
> + - tim2: 256
> + - tim3: 257
> + - tim4: 258
> + - tim5: 259
> + - tim6: 260
> + - tim7: 261
> + - tim12: 262
> + - tim13: 263
> + - tim14: 264
> + - wwdg: 267
> + - spi2: 270
> + - spi3: 271
> + - uart2: 273
> + - uart3: 274
> + - uart4: 275
> + - uart5: 276
> + - i2c1: 277
> + - i2c2: 278
> + - i2c3: 279
> + - can1: 281
> + - can2: 282
> + - pwr: 284
> + - dac: 285
> + - uart7: 286
> + - uart8: 287
> + - tim1: 288
> + - tim8: 289
> + - usart1: 292
> + - usart6: 293
> + - adc: 296
> + - sdio: 299
> + - spi1: 300
> + - spi4: 301
> + - syscfg: 302
> + - tim9: 304
> + - tim10: 305
> + - tim11: 306
> + - spi5: 308
> + - spi6: 309
> + - sai1: 310
> + - ltdc: 31

That last one should say "ltdc: 314", right?

regards
Philipp


^ permalink raw reply

* [PATCH net-next 2/2] samples: bpf: add skb->field examples and tests
From: Alexei Starovoitov @ 2015-03-13  2:21 UTC (permalink / raw)
  To: David S. Miller
  Cc: Daniel Borkmann, Thomas Graf, linux-api-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1426213271-8363-1-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>

- modify sockex1 example to count number of bytes in outgoing packets
- modify sockex2 example to count number of bytes and packets per flow
- add 4 stress tests that exercise 'skb->field' code path of verifier

Signed-off-by: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
---
 samples/bpf/sockex1_kern.c  |    8 +++--
 samples/bpf/sockex1_user.c  |    2 +-
 samples/bpf/sockex2_kern.c  |   26 +++++++++------
 samples/bpf/sockex2_user.c  |   11 +++++--
 samples/bpf/test_verifier.c |   73 +++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 104 insertions(+), 16 deletions(-)

diff --git a/samples/bpf/sockex1_kern.c b/samples/bpf/sockex1_kern.c
index 066892662915..ed18e9a4909c 100644
--- a/samples/bpf/sockex1_kern.c
+++ b/samples/bpf/sockex1_kern.c
@@ -1,5 +1,6 @@
 #include <uapi/linux/bpf.h>
 #include <uapi/linux/if_ether.h>
+#include <uapi/linux/if_packet.h>
 #include <uapi/linux/ip.h>
 #include "bpf_helpers.h"
 
@@ -11,14 +12,17 @@ struct bpf_map_def SEC("maps") my_map = {
 };
 
 SEC("socket1")
-int bpf_prog1(struct sk_buff *skb)
+int bpf_prog1(struct __sk_buff *skb)
 {
 	int index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
 	long *value;
 
+	if (skb->pkt_type != PACKET_OUTGOING)
+		return 0;
+
 	value = bpf_map_lookup_elem(&my_map, &index);
 	if (value)
-		__sync_fetch_and_add(value, 1);
+		__sync_fetch_and_add(value, skb->len);
 
 	return 0;
 }
diff --git a/samples/bpf/sockex1_user.c b/samples/bpf/sockex1_user.c
index 34a443ff3831..678ce4693551 100644
--- a/samples/bpf/sockex1_user.c
+++ b/samples/bpf/sockex1_user.c
@@ -40,7 +40,7 @@ int main(int ac, char **argv)
 		key = IPPROTO_ICMP;
 		assert(bpf_lookup_elem(map_fd[0], &key, &icmp_cnt) == 0);
 
-		printf("TCP %lld UDP %lld ICMP %lld packets\n",
+		printf("TCP %lld UDP %lld ICMP %lld bytes\n",
 		       tcp_cnt, udp_cnt, icmp_cnt);
 		sleep(1);
 	}
diff --git a/samples/bpf/sockex2_kern.c b/samples/bpf/sockex2_kern.c
index 6f0135f0f217..ba0e177ff561 100644
--- a/samples/bpf/sockex2_kern.c
+++ b/samples/bpf/sockex2_kern.c
@@ -42,13 +42,13 @@ static inline int proto_ports_offset(__u64 proto)
 	}
 }
 
-static inline int ip_is_fragment(struct sk_buff *ctx, __u64 nhoff)
+static inline int ip_is_fragment(struct __sk_buff *ctx, __u64 nhoff)
 {
 	return load_half(ctx, nhoff + offsetof(struct iphdr, frag_off))
 		& (IP_MF | IP_OFFSET);
 }
 
-static inline __u32 ipv6_addr_hash(struct sk_buff *ctx, __u64 off)
+static inline __u32 ipv6_addr_hash(struct __sk_buff *ctx, __u64 off)
 {
 	__u64 w0 = load_word(ctx, off);
 	__u64 w1 = load_word(ctx, off + 4);
@@ -58,7 +58,7 @@ static inline __u32 ipv6_addr_hash(struct sk_buff *ctx, __u64 off)
 	return (__u32)(w0 ^ w1 ^ w2 ^ w3);
 }
 
-static inline __u64 parse_ip(struct sk_buff *skb, __u64 nhoff, __u64 *ip_proto,
+static inline __u64 parse_ip(struct __sk_buff *skb, __u64 nhoff, __u64 *ip_proto,
 			     struct flow_keys *flow)
 {
 	__u64 verlen;
@@ -82,7 +82,7 @@ static inline __u64 parse_ip(struct sk_buff *skb, __u64 nhoff, __u64 *ip_proto,
 	return nhoff;
 }
 
-static inline __u64 parse_ipv6(struct sk_buff *skb, __u64 nhoff, __u64 *ip_proto,
+static inline __u64 parse_ipv6(struct __sk_buff *skb, __u64 nhoff, __u64 *ip_proto,
 			       struct flow_keys *flow)
 {
 	*ip_proto = load_byte(skb,
@@ -96,7 +96,7 @@ static inline __u64 parse_ipv6(struct sk_buff *skb, __u64 nhoff, __u64 *ip_proto
 	return nhoff;
 }
 
-static inline bool flow_dissector(struct sk_buff *skb, struct flow_keys *flow)
+static inline bool flow_dissector(struct __sk_buff *skb, struct flow_keys *flow)
 {
 	__u64 nhoff = ETH_HLEN;
 	__u64 ip_proto;
@@ -183,18 +183,23 @@ static inline bool flow_dissector(struct sk_buff *skb, struct flow_keys *flow)
 	return true;
 }
 
+struct pair {
+	long packets;
+	long bytes;
+};
+
 struct bpf_map_def SEC("maps") hash_map = {
 	.type = BPF_MAP_TYPE_HASH,
 	.key_size = sizeof(__be32),
-	.value_size = sizeof(long),
+	.value_size = sizeof(struct pair),
 	.max_entries = 1024,
 };
 
 SEC("socket2")
-int bpf_prog2(struct sk_buff *skb)
+int bpf_prog2(struct __sk_buff *skb)
 {
 	struct flow_keys flow;
-	long *value;
+	struct pair *value;
 	u32 key;
 
 	if (!flow_dissector(skb, &flow))
@@ -203,9 +208,10 @@ int bpf_prog2(struct sk_buff *skb)
 	key = flow.dst;
 	value = bpf_map_lookup_elem(&hash_map, &key);
 	if (value) {
-		__sync_fetch_and_add(value, 1);
+		__sync_fetch_and_add(&value->packets, 1);
+		__sync_fetch_and_add(&value->bytes, skb->len);
 	} else {
-		long val = 1;
+		struct pair val = {1, skb->len};
 
 		bpf_map_update_elem(&hash_map, &key, &val, BPF_ANY);
 	}
diff --git a/samples/bpf/sockex2_user.c b/samples/bpf/sockex2_user.c
index d2d5f5a790d3..29a276d766fc 100644
--- a/samples/bpf/sockex2_user.c
+++ b/samples/bpf/sockex2_user.c
@@ -6,6 +6,11 @@
 #include <unistd.h>
 #include <arpa/inet.h>
 
+struct pair {
+	__u64 packets;
+	__u64 bytes;
+};
+
 int main(int ac, char **argv)
 {
 	char filename[256];
@@ -29,13 +34,13 @@ int main(int ac, char **argv)
 
 	for (i = 0; i < 5; i++) {
 		int key = 0, next_key;
-		long long value;
+		struct pair value;
 
 		while (bpf_get_next_key(map_fd[0], &key, &next_key) == 0) {
 			bpf_lookup_elem(map_fd[0], &next_key, &value);
-			printf("ip %s count %lld\n",
+			printf("ip %s bytes %lld packets %lld\n",
 			       inet_ntoa((struct in_addr){htonl(next_key)}),
-			       value);
+			       value.bytes, value.packets);
 			key = next_key;
 		}
 		sleep(1);
diff --git a/samples/bpf/test_verifier.c b/samples/bpf/test_verifier.c
index 7b56b59fad8e..33beae615c5e 100644
--- a/samples/bpf/test_verifier.c
+++ b/samples/bpf/test_verifier.c
@@ -14,6 +14,7 @@
 #include <linux/unistd.h>
 #include <string.h>
 #include <linux/filter.h>
+#include <stddef.h>
 #include "libbpf.h"
 
 #define MAX_INSNS 512
@@ -642,6 +643,78 @@ static struct bpf_test tests[] = {
 		},
 		.result = ACCEPT,
 	},
+	{
+		"access skb fields ok",
+		.insns = {
+			BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+				    offsetof(struct __sk_buff, len)),
+			BPF_JMP_IMM(BPF_JGE, BPF_REG_0, 0, 1),
+			BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+				    offsetof(struct __sk_buff, mark)),
+			BPF_JMP_IMM(BPF_JGE, BPF_REG_0, 0, 1),
+			BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+				    offsetof(struct __sk_buff, ifindex)),
+			BPF_JMP_IMM(BPF_JGE, BPF_REG_0, 0, 1),
+			BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+				    offsetof(struct __sk_buff, pkt_type)),
+			BPF_JMP_IMM(BPF_JGE, BPF_REG_0, 0, 1),
+			BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+				    offsetof(struct __sk_buff, queue_mapping)),
+			BPF_JMP_IMM(BPF_JGE, BPF_REG_0, 0, 0),
+			BPF_EXIT_INSN(),
+		},
+		.result = ACCEPT,
+	},
+	{
+		"access skb fields bad1",
+		.insns = {
+			BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, -4),
+			BPF_EXIT_INSN(),
+		},
+		.errstr = "invalid bpf_context access",
+		.result = REJECT,
+	},
+	{
+		"access skb fields bad2",
+		.insns = {
+			BPF_JMP_IMM(BPF_JGE, BPF_REG_1, 0, 9),
+			BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+			BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+			BPF_EXIT_INSN(),
+			BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+			BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+				    offsetof(struct __sk_buff, pkt_type)),
+			BPF_EXIT_INSN(),
+		},
+		.fixup = {4},
+		.errstr = "different pointers",
+		.result = REJECT,
+	},
+	{
+		"access skb fields bad3",
+		.insns = {
+			BPF_JMP_IMM(BPF_JGE, BPF_REG_1, 0, 2),
+			BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+				    offsetof(struct __sk_buff, pkt_type)),
+			BPF_EXIT_INSN(),
+			BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+			BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+			BPF_EXIT_INSN(),
+			BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+			BPF_JMP_IMM(BPF_JA, 0, 0, -12),
+		},
+		.fixup = {6},
+		.errstr = "different pointers",
+		.result = REJECT,
+	},
 };
 
 static int probe_filter_length(struct bpf_insn *fp)
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH net-next 1/2] bpf: allow extended BPF programs access skb fields
From: Alexei Starovoitov @ 2015-03-13  2:21 UTC (permalink / raw)
  To: David S. Miller
  Cc: Daniel Borkmann, Thomas Graf, linux-api-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1426213271-8363-1-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>

introduce user accessible mirror of in-kernel 'struct sk_buff':
struct __sk_buff {
    __u32 len;
    __u32 pkt_type;
    __u32 mark;
    __u32 ifindex;
    __u32 queue_mapping;
};

bpf programs can do:
struct __sk_buff *ptr;
var = ptr->pkt_type;

which will be compiled to bpf assembler as:
dst_reg = *(u32 *)(src_reg + 4) // 4 == offsetof(struct __sk_buff, pkt_type)

bpf verifier will check validity of access and will convert it to:
dst_reg = *(u8 *)(src_reg + offsetof(struct sk_buff, __pkt_type_offset))
dst_reg &= 7

since 'pkt_type' is a bitfield.

When pkt_type field is moved around, goes into different structure, removed or
its size changes, the function sk_filter_convert_ctx_access() would need to be
updated. Just like the function convert_bpf_extensions() in case of classic bpf.

Signed-off-by: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
---
 include/linux/bpf.h      |    5 +-
 include/uapi/linux/bpf.h |    8 +++
 kernel/bpf/syscall.c     |    2 +-
 kernel/bpf/verifier.c    |  152 +++++++++++++++++++++++++++++++++++++++++-----
 net/core/filter.c        |   58 +++++++++++++++++-
 5 files changed, 205 insertions(+), 20 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 80f2e0fc3d02..2c17ebdfb5ae 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -103,6 +103,9 @@ struct bpf_verifier_ops {
 	 * with 'type' (read or write) is allowed
 	 */
 	bool (*is_valid_access)(int off, int size, enum bpf_access_type type);
+
+	u32 (*convert_ctx_access)(int dst_reg, int src_reg, int ctx_off,
+				  struct bpf_insn *insn);
 };
 
 struct bpf_prog_type_list {
@@ -133,7 +136,7 @@ struct bpf_map *bpf_map_get(struct fd f);
 void bpf_map_put(struct bpf_map *map);
 
 /* verify correctness of eBPF program */
-int bpf_check(struct bpf_prog *fp, union bpf_attr *attr);
+int bpf_check(struct bpf_prog **fp, union bpf_attr *attr);
 #else
 static inline void bpf_register_prog_type(struct bpf_prog_type_list *tl)
 {
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 3fa1af8a58d7..66a82d6cd75b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -168,4 +168,12 @@ enum bpf_func_id {
 	__BPF_FUNC_MAX_ID,
 };
 
+struct __sk_buff {
+	__u32 len;
+	__u32 pkt_type;
+	__u32 mark;
+	__u32 ifindex;
+	__u32 queue_mapping;
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 669719ccc9ee..ea75c654af1b 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -519,7 +519,7 @@ static int bpf_prog_load(union bpf_attr *attr)
 		goto free_prog;
 
 	/* run eBPF verifier */
-	err = bpf_check(prog, attr);
+	err = bpf_check(&prog, attr);
 	if (err < 0)
 		goto free_used_maps;
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index e6b522496250..c22ebd36fa4b 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1620,11 +1620,10 @@ static int do_check(struct verifier_env *env)
 				return err;
 
 		} else if (class == BPF_LDX) {
-			if (BPF_MODE(insn->code) != BPF_MEM ||
-			    insn->imm != 0) {
-				verbose("BPF_LDX uses reserved fields\n");
-				return -EINVAL;
-			}
+			enum bpf_reg_type src_reg_type;
+
+			/* check for reserved fields is already done */
+
 			/* check src operand */
 			err = check_reg_arg(regs, insn->src_reg, SRC_OP);
 			if (err)
@@ -1643,6 +1642,29 @@ static int do_check(struct verifier_env *env)
 			if (err)
 				return err;
 
+			src_reg_type = regs[insn->src_reg].type;
+
+			if (insn->imm == 0 && BPF_SIZE(insn->code) == BPF_W) {
+				/* saw a valid insn
+				 * dst_reg = *(u32 *)(src_reg + off)
+				 * use reserved 'imm' field to mark this insn
+				 */
+				insn->imm = src_reg_type;
+
+			} else if (src_reg_type != insn->imm &&
+				   (src_reg_type == PTR_TO_CTX ||
+				    insn->imm == PTR_TO_CTX)) {
+				/* ABuser program is trying to use the same insn
+				 * dst_reg = *(u32*) (src_reg + off)
+				 * with different pointer types:
+				 * src_reg == ctx in one branch and
+				 * src_reg == stack|map in some other branch.
+				 * Reject it.
+				 */
+				verbose("same insn cannot be used with different pointers\n");
+				return -EINVAL;
+			}
+
 		} else if (class == BPF_STX) {
 			if (BPF_MODE(insn->code) == BPF_XADD) {
 				err = check_xadd(env, insn);
@@ -1790,6 +1812,13 @@ static int replace_map_fd_with_map_ptr(struct verifier_env *env)
 	int i, j;
 
 	for (i = 0; i < insn_cnt; i++, insn++) {
+		if (BPF_CLASS(insn->code) == BPF_LDX &&
+		    (BPF_MODE(insn->code) != BPF_MEM ||
+		     insn->imm != 0)) {
+			verbose("BPF_LDX uses reserved fields\n");
+			return -EINVAL;
+		}
+
 		if (insn[0].code == (BPF_LD | BPF_IMM | BPF_DW)) {
 			struct bpf_map *map;
 			struct fd f;
@@ -1881,6 +1910,92 @@ static void convert_pseudo_ld_imm64(struct verifier_env *env)
 			insn->src_reg = 0;
 }
 
+static void adjust_branches(struct bpf_prog *prog, int pos, int delta)
+{
+	struct bpf_insn *insn = prog->insnsi;
+	int insn_cnt = prog->len;
+	int i;
+
+	for (i = 0; i < insn_cnt; i++, insn++) {
+		if (BPF_CLASS(insn->code) != BPF_JMP ||
+		    BPF_OP(insn->code) == BPF_CALL ||
+		    BPF_OP(insn->code) == BPF_EXIT)
+			continue;
+
+		/* adjust offset of jmps if necessary */
+		if (i < pos && i + insn->off + 1 > pos)
+			insn->off += delta;
+		else if (i > pos && i + insn->off + 1 < pos)
+			insn->off -= delta;
+	}
+}
+
+/* convert load instructions that access fields of 'struct __sk_buff'
+ * into sequence of instructions that access fields of 'struct sk_buff'
+ */
+static int convert_ctx_accesses(struct verifier_env *env)
+{
+	struct bpf_insn *insn = env->prog->insnsi;
+	int insn_cnt = env->prog->len;
+	struct bpf_insn insn_buf[16];
+	struct bpf_prog *new_prog;
+	u32 cnt;
+	int i;
+
+	if (!env->prog->aux->ops->convert_ctx_access)
+		return 0;
+
+	for (i = 0; i < insn_cnt; i++, insn++) {
+		if (insn->code != (BPF_LDX | BPF_MEM | BPF_W))
+			continue;
+
+		if (insn->imm != PTR_TO_CTX) {
+			/* clear internal mark */
+			insn->imm = 0;
+			continue;
+		}
+
+		cnt = env->prog->aux->ops->
+			convert_ctx_access(insn->dst_reg, insn->src_reg,
+					   insn->off, insn_buf);
+		if (cnt == 0 || cnt >= ARRAY_SIZE(insn_buf)) {
+			verbose("bpf verifier is misconfigured\n");
+			return -EINVAL;
+		}
+
+		if (cnt == 1) {
+			memcpy(insn, insn_buf, sizeof(*insn));
+			continue;
+		}
+
+		/* several new insns need to be inserted. Make room for them */
+		insn_cnt += cnt - 1;
+		new_prog = bpf_prog_realloc(env->prog,
+					    bpf_prog_size(insn_cnt),
+					    GFP_USER);
+		if (!new_prog)
+			return -ENOMEM;
+
+		new_prog->len = insn_cnt;
+
+		memmove(new_prog->insnsi + i + cnt, new_prog->insns + i + 1,
+			sizeof(*insn) * (insn_cnt - i - cnt));
+
+		/* copy substitute insns in place of load instruction */
+		memcpy(new_prog->insnsi + i, insn_buf, sizeof(*insn) * cnt);
+
+		/* adjust branches in the whole program */
+		adjust_branches(new_prog, i, cnt - 1);
+
+		/* keep walking new program and skip insns we just inserted */
+		env->prog = new_prog;
+		insn = new_prog->insnsi + i + cnt - 1;
+		i += cnt - 1;
+	}
+
+	return 0;
+}
+
 static void free_states(struct verifier_env *env)
 {
 	struct verifier_state_list *sl, *sln;
@@ -1903,13 +2018,13 @@ static void free_states(struct verifier_env *env)
 	kfree(env->explored_states);
 }
 
-int bpf_check(struct bpf_prog *prog, union bpf_attr *attr)
+int bpf_check(struct bpf_prog **prog, union bpf_attr *attr)
 {
 	char __user *log_ubuf = NULL;
 	struct verifier_env *env;
 	int ret = -EINVAL;
 
-	if (prog->len <= 0 || prog->len > BPF_MAXINSNS)
+	if ((*prog)->len <= 0 || (*prog)->len > BPF_MAXINSNS)
 		return -E2BIG;
 
 	/* 'struct verifier_env' can be global, but since it's not small,
@@ -1919,7 +2034,7 @@ int bpf_check(struct bpf_prog *prog, union bpf_attr *attr)
 	if (!env)
 		return -ENOMEM;
 
-	env->prog = prog;
+	env->prog = *prog;
 
 	/* grab the mutex to protect few globals used by verifier */
 	mutex_lock(&bpf_verifier_lock);
@@ -1951,7 +2066,7 @@ int bpf_check(struct bpf_prog *prog, union bpf_attr *attr)
 	if (ret < 0)
 		goto skip_full_check;
 
-	env->explored_states = kcalloc(prog->len,
+	env->explored_states = kcalloc(env->prog->len,
 				       sizeof(struct verifier_state_list *),
 				       GFP_USER);
 	ret = -ENOMEM;
@@ -1968,6 +2083,10 @@ skip_full_check:
 	while (pop_stack(env, NULL) >= 0);
 	free_states(env);
 
+	if (ret == 0)
+		/* program is valid, convert *(u32*)(ctx + off) accesses */
+		ret = convert_ctx_accesses(env);
+
 	if (log_level && log_len >= log_size - 1) {
 		BUG_ON(log_len >= log_size);
 		/* verifier log exceeded user supplied buffer */
@@ -1983,18 +2102,18 @@ skip_full_check:
 
 	if (ret == 0 && env->used_map_cnt) {
 		/* if program passed verifier, update used_maps in bpf_prog_info */
-		prog->aux->used_maps = kmalloc_array(env->used_map_cnt,
-						     sizeof(env->used_maps[0]),
-						     GFP_KERNEL);
+		env->prog->aux->used_maps = kmalloc_array(env->used_map_cnt,
+							  sizeof(env->used_maps[0]),
+							  GFP_KERNEL);
 
-		if (!prog->aux->used_maps) {
+		if (!env->prog->aux->used_maps) {
 			ret = -ENOMEM;
 			goto free_log_buf;
 		}
 
-		memcpy(prog->aux->used_maps, env->used_maps,
+		memcpy(env->prog->aux->used_maps, env->used_maps,
 		       sizeof(env->used_maps[0]) * env->used_map_cnt);
-		prog->aux->used_map_cnt = env->used_map_cnt;
+		env->prog->aux->used_map_cnt = env->used_map_cnt;
 
 		/* program is valid. Convert pseudo bpf_ld_imm64 into generic
 		 * bpf_ld_imm64 instructions
@@ -2006,11 +2125,12 @@ free_log_buf:
 	if (log_level)
 		vfree(log_buf);
 free_env:
-	if (!prog->aux->used_maps)
+	if (!env->prog->aux->used_maps)
 		/* if we didn't copy map pointers into bpf_prog_info, release
 		 * them now. Otherwise free_bpf_prog_info() will release them.
 		 */
 		release_maps(env);
+	*prog = env->prog;
 	kfree(env);
 	mutex_unlock(&bpf_verifier_lock);
 	return ret;
diff --git a/net/core/filter.c b/net/core/filter.c
index 7a4eb7030dba..b5fcc7e2b608 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1147,13 +1147,67 @@ sk_filter_func_proto(enum bpf_func_id func_id)
 static bool sk_filter_is_valid_access(int off, int size,
 				      enum bpf_access_type type)
 {
-	/* skb fields cannot be accessed yet */
-	return false;
+	/* only read is allowed */
+	if (type != BPF_READ)
+		return false;
+
+	/* check bounds */
+	if (off < 0 || off >= sizeof(struct __sk_buff))
+		return false;
+
+	/* disallow misaligned access */
+	if (off % size != 0)
+		return false;
+
+	/* all __sk_buff fields are __u32 */
+	if (size != 4)
+		return false;
+
+	return true;
+}
+
+static u32 sk_filter_convert_ctx_access(int dst_reg, int src_reg, int ctx_off,
+					struct bpf_insn *insn_buf)
+{
+	struct bpf_insn *insn = insn_buf;
+
+	switch (ctx_off) {
+	case offsetof(struct __sk_buff, len):
+		*insn++ = BPF_LDX_MEM(BPF_W, dst_reg, src_reg,
+				      offsetof(struct sk_buff, len));
+		break;
+
+	case offsetof(struct __sk_buff, mark):
+		*insn++ = BPF_LDX_MEM(BPF_W, dst_reg, src_reg,
+				      offsetof(struct sk_buff, mark));
+		break;
+
+	case offsetof(struct __sk_buff, ifindex):
+		*insn++ = BPF_LDX_MEM(BPF_W, dst_reg, src_reg,
+				      offsetof(struct sk_buff, skb_iif));
+		break;
+
+	case offsetof(struct __sk_buff, pkt_type):
+		*insn++ = BPF_LDX_MEM(BPF_B, dst_reg, src_reg, PKT_TYPE_OFFSET());
+		*insn++ = BPF_ALU32_IMM(BPF_AND, dst_reg, PKT_TYPE_MAX);
+#ifdef __BIG_ENDIAN_BITFIELD
+		*insn++ = BPF_ALU32_IMM(BPF_RSH, dst_reg, 5);
+#endif
+		break;
+
+	case offsetof(struct __sk_buff, queue_mapping):
+		*insn++ = BPF_LDX_MEM(BPF_W, dst_reg, src_reg,
+				      offsetof(struct sk_buff, queue_mapping));
+		break;
+	}
+
+	return insn - insn_buf;
 }
 
 static const struct bpf_verifier_ops sk_filter_ops = {
 	.get_func_proto = sk_filter_func_proto,
 	.is_valid_access = sk_filter_is_valid_access,
+	.convert_ctx_access = sk_filter_convert_ctx_access,
 };
 
 static struct bpf_prog_type_list sk_filter_type __read_mostly = {
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH net-next 0/2] bpf: allow extended BPF programs access skb fields
From: Alexei Starovoitov @ 2015-03-13  2:21 UTC (permalink / raw)
  To: David S. Miller
  Cc: Daniel Borkmann, Thomas Graf, linux-api-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Hi All,

classic BPF has a way to access skb fields, whereas extended BPF didn't.
This patch introduces this ability.

Classic BPF can access fields via negative SKF_AD_OFF offset.
Positive bpf_ld_abs N is treated as load from packet, whereas
bpf_ld_abs -0x1000 + N is treated as skb fields access.
Many offsets were hard coded over years: SKF_AD_PROTOCOL, SKF_AD_PKTTYPE, etc.
The problem with this approach was that for every new field classic bpf
assembler had to be tweaked.

I've considered doing the same for extended, but for every new field LLVM
compiler would have to be modifed. Since it would need to add a new intrinsic.
It could be done with single intrinsic and magic offset or use of inline
assembler, but neither are clean from compiler backend point of view, since
they look like calls but shouldn't scratch caller-saved registers.

Another approach was to introduce a new helper functions like bpf_get_pkt_type()
for every field that we want to access, but that is equally ugly for kernel
and slow, since helpers are calls and they are slower then just loads.
In theory helper calls can be 'inlined' inside kernel into direct loads, but
since they were calls for user space, compiler would have to spill registers
around such calls anyway. Teaching compiler to treat such helpers differently
is even uglier.

They were few other ideas considered. At the end the best seems to be to
introduce a user accessible mirror of in-kernel sk_buff structure:

struct __sk_buff {
    __u32 len;
    __u32 pkt_type;
    __u32 mark;
    __u32 ifindex;
    __u32 queue_mapping;
};

bpf programs will do:

int bpf_prog1(struct __sk_buff *skb)
{
    __u32 var = skb->pkt_type;

which will be compiled to bpf assembler as:

dst_reg = *(u32 *)(src_reg + 4) // 4 == offsetof(struct __sk_buff, pkt_type)

bpf verifier will check validity of access and will convert it to:

dst_reg = *(u8 *)(src_reg + offsetof(struct sk_buff, __pkt_type_offset))
dst_reg &= 7

since 'pkt_type' is a bitfield.

llvm doesn't need to be modified at all, JITs don't change either and
verifier already knows when it accesses 'ctx' pointer.
The only thing needed was to convert user visible offset within __sk_buff
to kernel internal offset within sk_buff.
For 'len' and other fields conversion is trivial.
Converting 'pkt_type' takes 2 or 3 instructions depending on endianness.
More fields can be exposed by adding to the end of the 'struct __sk_buff'.
Like vlan_tci and others can be added later.

When pkt_type field is moved around, goes into different structure, removed or
its size changes, the function sk_filter_convert_ctx_access() would need to be
updated. Just like the function convert_bpf_extensions() in case of classic bpf.

Patch 2 updates examples to demonstrates how fields are accessed and
adds new tests for verifier, since it needs to detect a corner case when
attacker is using single bpf instruction in two branches with different
register types.

The 5 fields of __sk_buff are already exposed to user space via classic bpf and
I believe they're useful to access from extended.
I don't think we need to expose skb->protocol or skb->dev->type, but that's
a seprate discussion.

Daniel,
patch 1 includes a bit of code that does prog_realloc and branch adjustment
to make room for new instructions. I think you'd need the same for your
'constant blinding' work. If indeed that would be the case, we'll make it
into a helper function.

Since sk_filter_ops are shared between BPF_PROG_TYPE_SOCKET_FILTER and
BPF_PROG_TYPE_SCHED_CLS types, cls_bpf will be able to see packet length :)

Alexei Starovoitov (2):
  bpf: allow extended BPF programs access skb fields
  samples: bpf: add skb->field examples and tests

 include/linux/bpf.h         |    5 +-
 include/uapi/linux/bpf.h    |    8 +++
 kernel/bpf/syscall.c        |    2 +-
 kernel/bpf/verifier.c       |  152 ++++++++++++++++++++++++++++++++++++++-----
 net/core/filter.c           |   58 ++++++++++++++++-
 samples/bpf/sockex1_kern.c  |    8 ++-
 samples/bpf/sockex1_user.c  |    2 +-
 samples/bpf/sockex2_kern.c  |   26 +++++---
 samples/bpf/sockex2_user.c  |   11 +++-
 samples/bpf/test_verifier.c |   73 +++++++++++++++++++++
 10 files changed, 309 insertions(+), 36 deletions(-)

-- 
1.7.9.5

^ permalink raw reply

* Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor
From: Thiago Macieira @ 2015-03-13  2:07 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A
In-Reply-To: <cover.1426180120.git.josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>

On Thursday 12 March 2015 18:40:03 Josh Triplett wrote:
> This patch series introduces a new clone flag, CLONE_FD, which lets the
> caller handle child process exit notification via a file descriptor rather
> than SIGCHLD.  CLONE_FD makes it possible for libraries to safely launch
> and manage child processes on behalf of their caller, *without* taking over
> process-wide SIGCHLD handling (either via signal handler or signalfd).

FYI, the matching use of this feature in Qt can be found at:

	https://codereview.qt-project.org/108455
	https://codereview.qt-project.org/108456

The forkfd.c file this modifies aims at implementing the semantics of CLONE_FD 
for the fork case when support for CLONE_FD is missing in the kernel.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center

^ permalink raw reply

* [PATCH] clone4.2: New manpage documenting clone4(2)
From: Josh Triplett @ 2015-03-13  1:41 UTC (permalink / raw)
  To: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Thiago Macieira, Michael Kerrisk, linux-kernel,
	linux-api, linux-man, linux-fsdevel, x86
In-Reply-To: <cover.1426180120.git.josh@joshtriplett.org>

Also includes new cross-reference from clone.2.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Thiago Macieira <thiago.macieira@intel.com>
---
 man2/clone.2  |   1 +
 man2/clone4.2 | 332 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 333 insertions(+)
 create mode 100644 man2/clone4.2

diff --git a/man2/clone.2 b/man2/clone.2
index 752c01e..7013885 100644
--- a/man2/clone.2
+++ b/man2/clone.2
@@ -1209,6 +1209,7 @@ main(int argc, char *argv[])
 }
 .fi
 .SH SEE ALSO
+.BR clone4 (2),
 .BR fork (2),
 .BR futex (2),
 .BR getpid (2),
diff --git a/man2/clone4.2 b/man2/clone4.2
new file mode 100644
index 0000000..c2ce188
--- /dev/null
+++ b/man2/clone4.2
@@ -0,0 +1,332 @@
+.\" Based on clone.2:
+.\" Copyright (c) 1992 Drew Eckhardt <drew@cs.colorado.edu>, March 28, 1992
+.\" and Copyright (c) Michael Kerrisk, 2001, 2002, 2005, 2013
+.\"
+.\" %%%LICENSE_START(GPL_NOVERSION_ONELINE)
+.\" May be distributed under the GNU General Public License.
+.\" %%%LICENSE_END
+.TH CLONE4 2 2015-03-01 "Linux" "Linux Programmer's Manual"
+.SH NAME
+clone4 \- create a child process
+.SH SYNOPSIS
+.nf
+/* Prototype for the glibc wrapper function */
+
+.B #define _GNU_SOURCE
+.B #include <sched.h>
+
+.BI "int clone4(uint64_t " flags ,
+.BI "           size_t " args_size ,
+.BI "           struct clone4_args *" args ,
+.BI "           int (*" "fn" ")(void *), void *" arg );
+
+/* Prototype for the raw system call */
+
+.BI "int clone4(unsigned " flags_high ", unsigned " flags_low ,
+.BI "           unsigned long " args_size ,
+.BI "           struct clone4_args *" args );
+
+struct clone4_args {
+    pid_t *ptid;
+    pid_t *ctid;
+    unsigned long stack_start;
+    unsigned long stack_size;
+    unsigned long tls;
+};
+
+.SH DESCRIPTION
+.BR clone4 ()
+creates a new process, similar to
+.BR clone (2)
+and
+.BR fork (2).
+.BR clone4 ()
+supports additional flags that
+.BR clone (2)
+does not, and accepts arguments via an extensible structure.
+
+.I args
+points to a
+.I clone4_args
+structure, and
+.I args_size
+must contain the size of that structure, as understood by the caller.  If the
+caller passes a shorter structure than the kernel expects, the remaining fields
+will default to 0.  If the caller passes a larger structure than the kernel
+expects (such as one from a newer kernel),
+.BR clone4 ()
+will return
+.BR EINVAL .
+The
+.I clone4_args
+structure may gain additional fields at the end in the future, and callers must
+only pass a size that encompasses the number of fields they understand.  If the
+caller passes 0 for
+.IR args_size ,
+.I args
+is ignored and may be NULL.
+
+In the
+.I clone4_args
+structure,
+.IR ptid ,
+.IR ctid ,
+.IR stack_start ,
+.IR stack_size ,
+and
+.I tls
+have the same semantics as they do with
+.BR clone (2)
+and
+.BR clone2 (2).
+
+In the glibc wrapper,
+.I fn
+and
+.I arg
+have the same semantics as they do with
+.BR clone (2).
+As with
+.BR clone (2),
+the underlying system call works more like
+.BR fork (2),
+returning 0 in the child process; the glibc wrapper simplifies thread execution
+by calling
+.IR fn ( arg )
+and exiting the child when that function exits.
+
+The 64-bit
+.I flags
+argument (split into the 32-bit
+.I flags_high
+and
+.I flags_low
+arguments in the kernel interface)
+accepts all the same flags as
+.BR clone (2),
+with the exception of the obsolete
+.BR CLONE_PID ,
+.BR CLONE_DETACHED ,
+and
+.BR CLONE_STOPPED .
+In addition,
+.I flags
+accepts the following flags:
+
+.TP
+.B CLONE_FD
+Instead of returning a process ID,
+.BR clone4 ()
+with the
+.B CLONE_FD
+flag returns a file descriptor associated with the new process.
+When the new process exits, the kernel will not send a signal to the parent
+process, and will not keep the new process around as a "zombie" process until a
+call to
+.BR waitpid (2)
+or similar.  Instead, the file descriptor will become available for reading,
+and the new process will be immediately reaped.
+
+Unlike using
+.BR signalfd (2)
+for the
+.B SIGCHLD
+signal,
+the file descriptor returned by
+.BR clone4 ()
+with the
+.B CLONE_FD
+flag works even with
+.B SIGCHLD
+unblocked in one or more threads of the parent process, and allows the process
+to have different handlers for different child processes, such as those created
+by a library, without introducing race conditions around process-wide signal
+handling.
+
+.BR clone4 ()
+will never return a file descriptor in the range 0-2 to the caller, to avoid
+ambiguity with the return of 0 in the child process.  Only the calling process
+will have the new file descriptor open; the child process will not.
+
+Since the kernel does not send a termination signal when a child process
+created with
+.B CLONE_FD
+exits, the low byte of flags does not contain a signal number.  Instead, the
+low byte of flags can contain the following additional flags for use with
+.BR CLONE_FD :
+
+.RS
+.TP
+.B CLONEFD_CLOEXEC
+Set the
+.B O_CLOEXEC
+flag on the new open file descriptor.
+See the description of the
+.B O_CLOEXEC
+flag in
+.BR open (2)
+for reasons why this may be useful.
+
+.TP
+.B CLONEFD_NONBLOCK
+Set the
+.B O_NONBLOCK
+flag on the new open file descriptor.
+Using this flag saves extra calls to
+.BR fcntl (2)
+to achieve the same result.
+.RE
+
+.IP
+.BR clone4 ()
+with the
+.B CLONE_FD
+flag returns a file descriptor that supports the following operations:
+.RS
+.TP
+.BR read "(2) (and similar)"
+When the new process exits, reading from the file descriptor produces
+a single
+.I clonefd_info
+structure:
+.nf
+
+struct clonefd_info {
+    uint32_t code;   /* Signal code */
+    uint32_t status; /* Exit status or signal */
+    uint64_t utime;  /* User CPU time */
+    uint64_t stime;  /* System CPU time */
+};
+
+.fi
+.IP
+If the new process has not yet exited,
+.BR read (2)
+either blocks until it does,
+or fails with the error
+.B EAGAIN
+if the file descriptor has been made nonblocking.
+.IP
+Future kernels may extend
+.I clonefd_info
+by appending additional fields to the end.  Callers should read as many bytes
+as they understand; unread data will be discarded, and subsequent reads after
+the first will return 0 to indicate end-of-file.  Callers requesting more bytes
+than the kernel provides (such as callers expecting a newer
+.I clonefd_info
+structure) will receive a shorter structure from older kernels.
+.TP
+.BR poll "(2), " select "(2), " epoll "(7) (and similar)"
+The file descriptor is readable
+(the
+.BR select (2)
+.I readfds
+argument; the
+.BR poll (2)
+.B POLLIN
+flag)
+if the new process has exited.
+.TP
+.BR close (2)
+When the file descriptor is no longer required it should be closed.  If no
+process has a file descriptor open for the new process, no process will receive
+any notification when the new process exits.  The new process will still be
+immediately reaped.
+.RE
+
+.SS C library/kernel ABI differences
+As with
+.BR clone (2),
+the raw
+.BR clone4 ()
+system call corresponds more closely to
+.BR fork (2)
+in that execution in the child continues from the point of the call.
+
+Unlike
+.BR clone (2),
+the raw system call interface for
+.BR clone4 ()
+accepts arguments in the same order on all architectures.
+
+The raw system call accepts
+.I flags
+as two 32-bit arguments,
+.I flags_high
+and
+.IR flags_low ,
+to simplify portability across 32-bit and 64-bit architectures and calling
+conventions.  The glibc wrapper accepts
+.I flags
+as a single 64-bit argument for convenience.
+
+.SH RETURN VALUE
+For the glibc wrapper, on success,
+.BR clone4 ()
+returns the file descriptor (with
+.BR CLONE_FD )
+or new process ID
+(without
+.BR CLONE_FD ),
+and the child process begins running at the specified function.
+
+For the raw syscall, on success,
+.BR clone4 ()
+returns the file descriptor or new process ID to the calling process, and
+returns 0 in the new child process.
+
+On failure,
+.BR clone4 ()
+returns \-1 and sets
+.I errno
+accordingly.
+
+.SH ERRORS
+.BR clone4 ()
+can return any error from
+.BR clone (2),
+as well as the following additional errors:
+.TP
+.B EINVAL
+.I flags
+contained an unknown flag.
+.TP
+.B EINVAL
+.I flags
+included
+.BR CLONE_FD,
+but the kernel configuration does not have the
+.B CONFIG_CLONEFD
+option enabled.
+.TP
+.B EMFILE
+.I flags
+included
+.BR CLONE_FD,
+but the new file descriptor would exceed the process limit on open file descriptors.
+.TP
+.B ENFILE
+.I flags
+included
+.BR CLONE_FD,
+but the new file descriptor would exceed the system-wide limit on open file descriptors.
+.TP
+.B ENODEV
+.I flags
+included
+.BR CLONE_FD,
+but
+.BR clone4 ()
+could not mount the (internal) anonymous inode device.
+
+.SH CONFORMING TO
+.BR clone4 ()
+is Linux-specific and should not be used in programs intended to be portable.
+
+.SH SEE ALSO
+.BR clone (2),
+.BR epoll (7),
+.BR poll (2),
+.BR pthreads (7),
+.BR read (2),
+.BR select (2)
-- 
2.1.4

^ permalink raw reply related

* [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
From: Josh Triplett @ 2015-03-13  1:41 UTC (permalink / raw)
  To: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Thiago Macieira, Michael Kerrisk, linux-kernel,
	linux-api, linux-fsdevel, x86
In-Reply-To: <cover.1426180120.git.josh@joshtriplett.org>

When passed CLONE_FD, clone4 will return a file descriptor rather than a
PID.  When the child process exits, it gets automatically reaped, and
the file descriptor becomes readable, producing a structure containing
the exit code and user/system time.  The file descriptor also works in
epoll, poll, or select.

This allows libraries to safely launch and manage child processes on
behalf of a caller, without taking over or interfering with process-wide
signal handling.  Without this, such a library would need to take over
or cooperate with the entire process's SIGCHLD handling, either via a
signal handler or a signalfd.

CLONE_FD will never return a file descriptor in the 0-2 range; thus, a 0
return from clone4 still indicates the child process.

Since a process created with CLONE_FD does not send any exit signal, the
low byte of the clone flags no longer needs to contain a signal number,
freeing it up for use as CLONE_FD-specific flags; use that to provide
the usual CLOEXEC and NONBLOCK flags.

CLONE_FD takes the value of the unused CLONE_PID, so CLONE4_VALID_ARGS
now includes CLONE_FD; CLONE_VALID_ARGS still doesn't, and sys_clone
still ignores that flag, as only clone4 can use it.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Thiago Macieira <thiago.macieira@intel.com>
---
 include/linux/sched.h      |   5 ++
 include/uapi/linux/sched.h |  23 ++++++++-
 init/Kconfig               |  11 ++++
 kernel/Makefile            |   1 +
 kernel/clonefd.c           | 123 +++++++++++++++++++++++++++++++++++++++++++++
 kernel/clonefd.h           |  27 ++++++++++
 kernel/exit.c              |  10 +++-
 kernel/fork.c              |  40 ++++++++++++---
 8 files changed, 231 insertions(+), 9 deletions(-)
 create mode 100644 kernel/clonefd.c
 create mode 100644 kernel/clonefd.h

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 668c58f..55cf10bb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1351,6 +1351,9 @@ struct task_struct {
 #if defined(SPLIT_RSS_COUNTING)
 	struct task_rss_stat	rss_stat;
 #endif
+#ifdef CONFIG_CLONEFD
+	wait_queue_head_t clonefd_wqh;
+#endif
 /* task state */
 	int exit_state;
 	int exit_code, exit_signal;
@@ -1372,6 +1375,8 @@ struct task_struct {
 	unsigned memcg_kmem_skip_account:1;
 #endif
 
+	unsigned autoreap:1; /* Do not become a zombie on exit */
+
 	unsigned long atomic_flags; /* Flags needing atomic access. */
 
 	struct restart_block restart_block;
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index b5b8012..d2082c61 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -38,10 +38,31 @@
 #define CLONE_STOPPED	0x02000000
 
 /*
+ * Flags that only work with clone4.
+ */
+#define CLONE_FD	0x00001000	/* set if we want a file descriptor rather than a PID */
+
+/*
  * Valid flags for clone and for clone4
  */
 #define CLONE_VALID_FLAGS	(0xffffffffULL & ~(CLONE_PID | CLONE_DETACHED | CLONE_STOPPED))
-#define CLONE4_VALID_FLAGS	CLONE_VALID_FLAGS
+#define CLONE4_VALID_FLAGS	(CLONE_VALID_FLAGS | CLONE_FD)
+
+/*
+ * Flags passed in the low byte when using CLONE_FD, in place of the signal.
+ */
+#define CLONEFD_CLOEXEC		0x00000001	/* Used with CLONE_FD to set O_CLOEXEC on new fd */
+#define CLONEFD_NONBLOCK	0x00000002	/* Used with CLONE_FD to set O_NONBLOCK on new fd */
+
+/*
+ * Structure read from CLONE_FD file descriptor after process exits
+ */
+struct clonefd_info {
+        __s32 code;
+        __s32 status;
+        __u64 utime;
+        __u64 stime;
+};
 
 /*
  * Structure passed to clone4 for additional arguments.  Initialized to 0,
diff --git a/init/Kconfig b/init/Kconfig
index 3ab6649..b444280 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1521,6 +1521,17 @@ config CLONE4
 
 	  If unsure, say Y.
 
+config CLONEFD
+	bool "Enable CLONE_FD flag for clone4()" if EXPERT
+	depends on CLONE4
+	select ANON_INODES
+	default y
+	help
+	  Enable the CLONE_FD flag for clone4(), which creates a file descriptor
+	  to receive child exit events rather than receiving a signal.
+
+	  If unsure, say Y.
+
 # syscall, maps, verifier
 config BPF_SYSCALL
 	bool "Enable bpf() system call" if EXPERT
diff --git a/kernel/Makefile b/kernel/Makefile
index 1408b33..368986c 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -29,6 +29,7 @@ obj-y += rcu/
 obj-y += livepatch/
 
 obj-$(CONFIG_CHECKPOINT_RESTORE) += kcmp.o
+obj-$(CONFIG_CLONEFD) += clonefd.o
 obj-$(CONFIG_FREEZER) += freezer.o
 obj-$(CONFIG_PROFILING) += profile.o
 obj-$(CONFIG_STACKTRACE) += stacktrace.o
diff --git a/kernel/clonefd.c b/kernel/clonefd.c
new file mode 100644
index 0000000..78fb776
--- /dev/null
+++ b/kernel/clonefd.c
@@ -0,0 +1,123 @@
+/*
+ * Support functions for CLONE_FD
+ *
+ * Copyright (c) 2015 Intel Corporation
+ * Original authors: Josh Triplett <josh@joshtriplett.org>
+ *                   Thiago Macieira <thiago@macieira.org>
+ */
+#include <linux/anon_inodes.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include "clonefd.h"
+
+static int clonefd_release(struct inode *inode, struct file *file)
+{
+	put_task_struct(file->private_data);
+	return 0;
+}
+
+static unsigned int clonefd_poll(struct file *file, poll_table *wait)
+{
+	struct task_struct *p = file->private_data;
+	poll_wait(file, &p->clonefd_wqh, wait);
+	return p->exit_state == EXIT_DEAD ? (POLLIN | POLLRDNORM) : 0;
+}
+
+static ssize_t clonefd_read(struct file *file, char __user *buf, size_t count, loff_t *ppos)
+{
+	struct task_struct *p = file->private_data;
+	int ret = 0;
+
+	/* EOF after first read */
+	if (*ppos)
+		return 0;
+
+	if (file->f_flags & O_NONBLOCK)
+		ret = -EAGAIN;
+	else
+		ret = wait_event_interruptible(p->clonefd_wqh, p->exit_state == EXIT_DEAD);
+
+	if (p->exit_state == EXIT_DEAD) {
+		struct clonefd_info info = {};
+		cputime_t utime, stime;
+		task_exit_code_status(p->exit_code, &info.code, &info.status);
+		info.code &= ~__SI_MASK;
+		task_cputime(p, &utime, &stime);
+		info.utime = cputime_to_clock_t(utime + p->signal->utime);
+		info.stime = cputime_to_clock_t(stime + p->signal->stime);
+		ret = simple_read_from_buffer(buf, count, ppos, &info, sizeof(info));
+	}
+	return ret;
+}
+
+static struct file_operations clonefd_fops = {
+	.release = clonefd_release,
+	.poll = clonefd_poll,
+	.read = clonefd_read,
+	.llseek = no_llseek,
+};
+
+/* Do process exit notification for clonefd. */
+void clonefd_do_notify(struct task_struct *p)
+{
+	if (p->autoreap)
+		wake_up_all(&p->clonefd_wqh);
+}
+
+/* Handle the CLONE_FD case for copy_process. */
+int clonefd_do_clone(u64 clone_flags, struct task_struct *p, struct clonefd_setup *setup)
+{
+	int flags;
+	struct file *file;
+	int fd;
+
+	if (!(clone_flags & CLONE_FD))
+		return 0;
+
+	p->autoreap = 1;
+	init_waitqueue_head(&p->clonefd_wqh);
+
+	get_task_struct(p);
+	flags = O_RDONLY | FMODE_ATOMIC_POS
+	      | (clone_flags & CLONEFD_CLOEXEC ? O_CLOEXEC : 0)
+	      | (clone_flags & CLONEFD_NONBLOCK ? O_NONBLOCK : 0);
+	file = anon_inode_getfile("[process]", &clonefd_fops, p, flags);
+	if (IS_ERR(file)) {
+		put_task_struct(p);
+		return PTR_ERR(file);
+	}
+
+	/*
+	 * We avoid allocating a low fd so that clone can still return 0 in the
+	 * child; the child shouldn't have to change just because the parent
+	 * used CLONE_FD.
+	 */
+	fd = alloc_fd(3, flags);
+	if (fd < 0) {
+		fput(file);
+		return fd;
+	}
+
+	setup->fd = fd;
+	setup->file = file;
+
+	return 0;
+}
+
+/* Clean up clonefd information after a partially complete clone */
+void clonefd_cleanup_failed_clone(struct task_struct *p, struct clonefd_setup *setup)
+{
+	if (setup->fd)
+		put_unused_fd(setup->fd);
+	if (setup->file)
+		fput(setup->file);
+}
+
+/* Finish setting up the clonefd */
+int clonefd_install_fd(struct task_struct *p, struct clonefd_setup *setup)
+{
+	fd_install(setup->fd, setup->file);
+	return setup->fd;
+}
diff --git a/kernel/clonefd.h b/kernel/clonefd.h
new file mode 100644
index 0000000..07bd31f
--- /dev/null
+++ b/kernel/clonefd.h
@@ -0,0 +1,27 @@
+/*
+ * Support functions for CLONE_FD
+ *
+ * Copyright (c) 2015 Intel Corporation
+ * Original authors: Josh Triplett <josh@joshtriplett.org>
+ *                   Thiago Macieira <thiago@macieira.org>
+ */
+#pragma once
+
+#include <linux/sched.h>
+
+#ifdef CONFIG_CLONEFD
+struct clonefd_setup {
+	int fd;
+	struct file *file;
+};
+int clonefd_do_clone(u64 clone_flags, struct task_struct *p, struct clonefd_setup *setup);
+void clonefd_cleanup_failed_clone(struct task_struct *p, struct clonefd_setup *setup);
+int clonefd_install_fd(struct task_struct *p, struct clonefd_setup *setup);
+void clonefd_do_notify(struct task_struct *p);
+#else /* CONFIG_CLONEFD */
+struct clonefd_setup {};
+static inline int clonefd_do_clone(u64 clone_flags, struct task_struct *p, struct clonefd_setup *setup) { return 0; }
+static inline void clonefd_cleanup_failed_clone (struct task_struct *p, struct clonefd_setup *setup) {}
+static inline int clonefd_install_fd(struct task_struct *p, struct clonefd_setup *setup) { return -EINVAL; }
+static inline void clonefd_do_notify(struct task_struct *p) {}
+#endif /* CONFIG_CLONEFD */
diff --git a/kernel/exit.c b/kernel/exit.c
index feff10b..a2c8520 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -59,6 +59,8 @@
 #include <asm/pgtable.h>
 #include <asm/mmu_context.h>
 
+#include "clonefd.h"
+
 static void exit_mm(struct task_struct *tsk);
 
 static void __unhash_process(struct task_struct *p, bool group_dead)
@@ -598,7 +600,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
 	if (group_dead)
 		kill_orphaned_pgrp(tsk->group_leader, NULL);
 
-	if (unlikely(tsk->ptrace)) {
+	if (tsk->autoreap) {
+		autoreap = true;
+	} else if (unlikely(tsk->ptrace)) {
 		int sig = thread_group_leader(tsk) &&
 				thread_group_empty(tsk) &&
 				!ptrace_reparented(tsk) ?
@@ -612,8 +616,10 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
 	}
 
 	tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE;
-	if (tsk->exit_state == EXIT_DEAD)
+	if (tsk->exit_state == EXIT_DEAD) {
 		list_add(&tsk->ptrace_entry, &dead);
+		clonefd_do_notify(tsk);
+	}
 
 	/* mt-exec, de_thread() is waiting for group leader */
 	if (unlikely(tsk->signal->notify_count < 0))
diff --git a/kernel/fork.c b/kernel/fork.c
index e29edea..00cab05 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -87,6 +87,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/task.h>
 
+#include "clonefd.h"
+
 /*
  * Protected counters by write_lock_irq(&tasklist_lock)
  */
@@ -321,6 +323,8 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
 	if (err)
 		goto free_ti;
 
+	tsk->autoreap = 0;
+
 	tsk->stack = ti;
 #ifdef CONFIG_SECCOMP
 	/*
@@ -1193,7 +1197,8 @@ static struct task_struct *copy_process(u64 clone_flags,
 					int __user *child_tidptr,
 					struct pid *pid,
 					int trace,
-					unsigned long tls)
+					unsigned long tls,
+					struct clonefd_setup *clonefd_setup)
 {
 	int retval;
 	struct task_struct *p;
@@ -1244,6 +1249,16 @@ static struct task_struct *copy_process(u64 clone_flags,
 			return ERR_PTR(-EINVAL);
 	}
 
+	/*
+	 * If using CLONE_FD, the low byte is used for additional flags; check
+	 * for unknown flags.
+	 */
+	if (clone_flags & CLONE_FD) {
+		if (!IS_ENABLED(CONFIG_CLONEFD) ||
+		    (clone_flags & CSIGNAL & ~(CLONEFD_CLOEXEC | CLONEFD_NONBLOCK)))
+			return ERR_PTR(-EINVAL);
+	}
+
 	retval = security_task_create(clone_flags);
 	if (retval)
 		goto fork_out;
@@ -1416,6 +1431,10 @@ static struct task_struct *copy_process(u64 clone_flags,
 			goto bad_fork_cleanup_io;
 	}
 
+	retval = clonefd_do_clone(clone_flags, p, clonefd_setup);
+	if (retval)
+		goto bad_fork_free_pid;
+
 	p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL;
 	/*
 	 * Clear TID on mm_release()?
@@ -1456,7 +1475,9 @@ static struct task_struct *copy_process(u64 clone_flags,
 		p->group_leader = current->group_leader;
 		p->tgid = current->tgid;
 	} else {
-		if (clone_flags & CLONE_PARENT)
+		if (clone_flags & CLONE_FD)
+			p->exit_signal = 0;
+		else if (clone_flags & CLONE_PARENT)
 			p->exit_signal = current->group_leader->exit_signal;
 		else
 			p->exit_signal = (clone_flags & CSIGNAL);
@@ -1508,7 +1529,7 @@ static struct task_struct *copy_process(u64 clone_flags,
 		spin_unlock(&current->sighand->siglock);
 		write_unlock_irq(&tasklist_lock);
 		retval = -ERESTARTNOINTR;
-		goto bad_fork_free_pid;
+		goto bad_fork_cleanup_clonefd;
 	}
 
 	if (likely(p->pid)) {
@@ -1560,6 +1581,8 @@ static struct task_struct *copy_process(u64 clone_flags,
 
 	return p;
 
+bad_fork_cleanup_clonefd:
+	clonefd_cleanup_failed_clone(p, clonefd_setup);
 bad_fork_free_pid:
 	if (pid != &init_struct_pid)
 		free_pid(pid);
@@ -1617,7 +1640,7 @@ static inline void init_idle_pids(struct pid_link *links)
 struct task_struct *fork_idle(int cpu)
 {
 	struct task_struct *task;
-	task = copy_process(CLONE_VM, 0, 0, NULL, &init_struct_pid, 0, 0);
+	task = copy_process(CLONE_VM, 0, 0, NULL, &init_struct_pid, 0, 0, NULL);
 	if (!IS_ERR(task)) {
 		init_idle_pids(task->pids);
 		init_idle(task, cpu);
@@ -1643,6 +1666,7 @@ static long _do_fork(
 	struct task_struct *p;
 	int trace = 0;
 	long nr;
+	struct clonefd_setup clonefd_setup = {};
 
 	/*
 	 * Determine whether and which event to report to ptracer.  When
@@ -1653,7 +1677,8 @@ static long _do_fork(
 	if (!(clone_flags & CLONE_UNTRACED)) {
 		if (clone_flags & CLONE_VFORK)
 			trace = PTRACE_EVENT_VFORK;
-		else if ((clone_flags & CSIGNAL) != SIGCHLD)
+		else if ((clone_flags & CLONE_FD) ||
+			 (clone_flags & CSIGNAL) != SIGCHLD)
 			trace = PTRACE_EVENT_CLONE;
 		else
 			trace = PTRACE_EVENT_FORK;
@@ -1663,7 +1688,7 @@ static long _do_fork(
 	}
 
 	p = copy_process(clone_flags, stack_start, stack_size,
-			 child_tidptr, NULL, trace, tls);
+			 child_tidptr, NULL, trace, tls, &clonefd_setup);
 	/*
 	 * Do this prior waking up the new thread - the thread pointer
 	 * might get invalid after that point, if the thread exits quickly.
@@ -1686,6 +1711,9 @@ static long _do_fork(
 			get_task_struct(p);
 		}
 
+		if (clone_flags & CLONE_FD)
+			nr = clonefd_install_fd(p, &clonefd_setup);
+
 		wake_up_new_task(p);
 
 		/* forking complete and child started to run, tell ptracer */
-- 
2.1.4

^ permalink raw reply related

* [PATCH 5/6] fs: Make alloc_fd non-private
From: Josh Triplett @ 2015-03-13  1:40 UTC (permalink / raw)
  To: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Thiago Macieira, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A
In-Reply-To: <cover.1426180120.git.josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>

This allows callers to allocate a file descriptor with a defined minimum
value, without directly calling the lower-level __alloc_fd.

Signed-off-by: Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
Signed-off-by: Thiago Macieira <thiago.macieira-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 fs/file.c            | 2 +-
 include/linux/file.h | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/file.c b/fs/file.c
index ee738ea..583ba46 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -500,7 +500,7 @@ out:
 	return error;
 }
 
-static int alloc_fd(unsigned start, unsigned flags)
+int alloc_fd(unsigned start, unsigned flags)
 {
 	return __alloc_fd(current->files, start, rlimit(RLIMIT_NOFILE), flags);
 }
diff --git a/include/linux/file.h b/include/linux/file.h
index f87d308..d49f3bd 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -65,6 +65,7 @@ extern int replace_fd(unsigned fd, struct file *file, unsigned flags);
 extern void set_close_on_exec(unsigned int fd, int flag);
 extern bool get_close_on_exec(unsigned int fd);
 extern void put_filp(struct file *);
+extern int alloc_fd(unsigned start, unsigned flags);
 extern int get_unused_fd_flags(unsigned flags);
 extern void put_unused_fd(unsigned int fd);
 
-- 
2.1.4

^ permalink raw reply related

* [PATCH 4/6] signal: Factor out a helper function to process task_struct exit_code
From: Josh Triplett @ 2015-03-13  1:40 UTC (permalink / raw)
  To: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Thiago Macieira, Michael Kerrisk, linux-kernel,
	linux-api, linux-fsdevel, x86
In-Reply-To: <cover.1426180120.git.josh@joshtriplett.org>

do_notify_parent includes the code to convert the exit_code field of
struct task_struct to the code and status fields that accompany SIGCHLD.
Factor that out into a new helper function task_exit_code_status, to
allow other methods of task exit notification to share that code.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Thiago Macieira <thiago.macieira@intel.com>
---
 include/linux/sched.h |  1 +
 kernel/signal.c       | 24 +++++++++++++++---------
 2 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9ec36fd..668c58f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2386,6 +2386,7 @@ extern int kill_pid_info_as_cred(int, struct siginfo *, struct pid *,
 extern int kill_pgrp(struct pid *pid, int sig, int priv);
 extern int kill_pid(struct pid *pid, int sig, int priv);
 extern int kill_proc_info(int, struct siginfo *, pid_t);
+extern void task_exit_code_status(int exit_code, s32 *code, s32 *status);
 extern __must_check bool do_notify_parent(struct task_struct *, int);
 extern void __wake_up_parent(struct task_struct *p, struct task_struct *parent);
 extern void force_sig(int, struct task_struct *);
diff --git a/kernel/signal.c b/kernel/signal.c
index a390499..f959d07 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1613,6 +1613,20 @@ ret:
 	return ret;
 }
 
+/* Translate exit_code to code and status. */
+void task_exit_code_status(int exit_code, s32 *code, s32 *status)
+{
+	*status = exit_code & 0x7f;
+	if (exit_code & 0x80)
+		*code = CLD_DUMPED;
+	else if (exit_code & 0x7f)
+		*code = CLD_KILLED;
+	else {
+		*code = CLD_EXITED;
+		*status = exit_code >> 8;
+	}
+}
+
 /*
  * Let a parent know about the death of a child.
  * For a stopped/continued status change, use do_notify_parent_cldstop instead.
@@ -1668,15 +1682,7 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
 	info.si_utime = cputime_to_clock_t(utime + tsk->signal->utime);
 	info.si_stime = cputime_to_clock_t(stime + tsk->signal->stime);
 
-	info.si_status = tsk->exit_code & 0x7f;
-	if (tsk->exit_code & 0x80)
-		info.si_code = CLD_DUMPED;
-	else if (tsk->exit_code & 0x7f)
-		info.si_code = CLD_KILLED;
-	else {
-		info.si_code = CLD_EXITED;
-		info.si_status = tsk->exit_code >> 8;
-	}
+	task_exit_code_status(tsk->exit_code, &info.si_code, &info.si_status);
 
 	psig = tsk->parent->sighand;
 	spin_lock_irqsave(&psig->siglock, flags);
-- 
2.1.4

^ permalink raw reply related

* [PATCH 3/6] Introduce a new clone4 syscall with more flag bits and extensible arguments
From: Josh Triplett @ 2015-03-13  1:40 UTC (permalink / raw)
  To: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Thiago Macieira, Michael Kerrisk, linux-kernel,
	linux-api, linux-fsdevel, x86
In-Reply-To: <cover.1426180120.git.josh@joshtriplett.org>

clone() has no more usable flags available.  It has three now-unused
flags (CLONE_PID, CLONE_DETACHED, and CLONE_STOPPED), but current
kernels just ignore those flags without returning an error like EINVAL,
so reusing those flags would not allow userspace to detect the
availability of the new functionality.

Introduce a new system call, clone4, which accepts a second 32-bit flags
field.  clone4 also returns EINVAL for the currently unused flags in
clone, allowing their reuse.

To process these new flags, change the flags argument of _do_fork to a
u64.  sys_clone and do_fork both still use "unsigned long" for flags as
they did before, truncating it to 32-bit and masking out the obsolete
flags to behave like clone currently does.

clone4 accepts its remaining arguments as a structure, and userspace
passes in the size of that structure.  clone4 has well-defined semantics
that allow extending that structure in the future.  New userspace
passing in a larger structure than the kernel expects will receive
EINVAL, and can use a smaller structure to work with old kernels.  New
kernels accept smaller argument structures passed by userspace, and any
un-passed arguments default to 0.

clone4 handles arguments in the same order on all architectures, with no
backwards variations; to do so, it depends on the new
HAVE_COPY_THREAD_TLS.

The new system call currently accepts exactly the same flags as clone;
future commits will introduce new flags for additional functionality.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Thiago Macieira <thiago.macieira@intel.com>
---
 arch/x86/ia32/ia32entry.S        |  1 +
 arch/x86/kernel/entry_64.S       |  1 +
 arch/x86/syscalls/syscall_32.tbl |  1 +
 arch/x86/syscalls/syscall_64.tbl |  2 ++
 include/linux/compat.h           | 12 ++++++++
 include/uapi/linux/sched.h       | 33 ++++++++++++++++++++--
 init/Kconfig                     | 10 +++++++
 kernel/fork.c                    | 60 +++++++++++++++++++++++++++++++++++++---
 kernel/sys_ni.c                  |  1 +
 9 files changed, 114 insertions(+), 7 deletions(-)

diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 0286735..ba28306 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -483,6 +483,7 @@ GLOBAL(\label)
 	PTREGSCALL stub32_execveat, compat_sys_execveat
 	PTREGSCALL stub32_fork, sys_fork
 	PTREGSCALL stub32_vfork, sys_vfork
+	PTREGSCALL stub32_clone4, compat_sys_clone4
 
 	ALIGN
 GLOBAL(stub32_clone)
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 1d74d16..ead143f 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -520,6 +520,7 @@ END(\label)
 	FORK_LIKE  clone
 	FORK_LIKE  fork
 	FORK_LIKE  vfork
+	FORK_LIKE  clone4
 	FIXED_FRAME stub_iopl, sys_iopl
 
 ENTRY(stub_execve)
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index b3560ec..56fcc90 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -365,3 +365,4 @@
 356	i386	memfd_create		sys_memfd_create
 357	i386	bpf			sys_bpf
 358	i386	execveat		sys_execveat			stub32_execveat
+359	i386	clone4			sys_clone4			stub32_clone4
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 8d656fb..af15b0f 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -329,6 +329,7 @@
 320	common	kexec_file_load		sys_kexec_file_load
 321	common	bpf			sys_bpf
 322	64	execveat		stub_execveat
+323	64	clone4			stub_clone4
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
@@ -368,3 +369,4 @@
 543	x32	io_setup		compat_sys_io_setup
 544	x32	io_submit		compat_sys_io_submit
 545	x32	execveat		stub_x32_execveat
+546	x32	clone4			stub32_clone4
diff --git a/include/linux/compat.h b/include/linux/compat.h
index ab25814..6c4a68d 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -293,6 +293,14 @@ struct compat_old_sigaction {
 };
 #endif
 
+struct compat_clone4_args {
+	compat_uptr_t ptid;
+	compat_uptr_t ctid;
+	compat_ulong_t stack_start;
+	compat_ulong_t stack_size;
+	compat_ulong_t tls;
+};
+
 struct compat_statfs;
 struct compat_statfs64;
 struct compat_old_linux_dirent;
@@ -713,6 +721,10 @@ asmlinkage long compat_sys_sched_rr_get_interval(compat_pid_t pid,
 
 asmlinkage long compat_sys_fanotify_mark(int, unsigned int, __u32, __u32,
 					    int, const char __user *);
+
+asmlinkage long compat_sys_clone4(unsigned, unsigned, compat_ulong_t,
+				  struct compat_clone4_args __user *);
+
 #else
 
 #define is_compat_task() (0)
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index cc89dde..b5b8012 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -1,6 +1,8 @@
 #ifndef _UAPI_LINUX_SCHED_H
 #define _UAPI_LINUX_SCHED_H
 
+#include <linux/types.h>
+
 /*
  * cloning flags:
  */
@@ -18,11 +20,8 @@
 #define CLONE_SETTLS	0x00080000	/* create a new TLS for the child */
 #define CLONE_PARENT_SETTID	0x00100000	/* set the TID in the parent */
 #define CLONE_CHILD_CLEARTID	0x00200000	/* clear the TID in the child */
-#define CLONE_DETACHED		0x00400000	/* Unused, ignored */
 #define CLONE_UNTRACED		0x00800000	/* set if the tracing process can't force CLONE_PTRACE on this clone */
 #define CLONE_CHILD_SETTID	0x01000000	/* set the TID in the child */
-/* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
-   and is now available for re-use. */
 #define CLONE_NEWUTS		0x04000000	/* New utsname namespace */
 #define CLONE_NEWIPC		0x08000000	/* New ipc namespace */
 #define CLONE_NEWUSER		0x10000000	/* New user namespace */
@@ -31,6 +30,34 @@
 #define CLONE_IO		0x80000000	/* Clone io context */
 
 /*
+ * Old flags, unused by current clone.  clone does not return EINVAL for these
+ * flags, so they can't easily be reused.  clone4 can use them.
+ */
+#define CLONE_PID	0x00001000
+#define CLONE_DETACHED	0x00400000
+#define CLONE_STOPPED	0x02000000
+
+/*
+ * Valid flags for clone and for clone4
+ */
+#define CLONE_VALID_FLAGS	(0xffffffffULL & ~(CLONE_PID | CLONE_DETACHED | CLONE_STOPPED))
+#define CLONE4_VALID_FLAGS	CLONE_VALID_FLAGS
+
+/*
+ * Structure passed to clone4 for additional arguments.  Initialized to 0,
+ * then overwritten with arguments from userspace, so arguments not supplied by
+ * userspace will remain 0.  New versions of the kernel may safely append new
+ * arguments to the end.
+ */
+struct clone4_args {
+	__kernel_pid_t __user *ptid;
+	__kernel_pid_t __user *ctid;
+	__kernel_ulong_t stack_start;
+	__kernel_ulong_t stack_size;
+	__kernel_ulong_t tls;
+};
+
+/*
  * Scheduling policies
  */
 #define SCHED_NORMAL		0
diff --git a/init/Kconfig b/init/Kconfig
index f5dbc6d..3ab6649 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1511,6 +1511,16 @@ config EVENTFD
 
 	  If unsure, say Y.
 
+config CLONE4
+	bool "Enable clone4() system call" if EXPERT
+	depends on HAVE_COPY_THREAD_TLS
+	default y
+	help
+	  Enable the clone4() system call, which supports passing additional
+	  flags.
+
+	  If unsure, say Y.
+
 # syscall, maps, verifier
 config BPF_SYSCALL
 	bool "Enable bpf() system call" if EXPERT
diff --git a/kernel/fork.c b/kernel/fork.c
index b3dadf4..e29edea 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1187,7 +1187,7 @@ init_task_pid(struct task_struct *task, enum pid_type type, struct pid *pid)
  * parts of the process environment (as per the clone
  * flags). The actual kick-off is left to the caller.
  */
-static struct task_struct *copy_process(unsigned long clone_flags,
+static struct task_struct *copy_process(u64 clone_flags,
 					unsigned long stack_start,
 					unsigned long stack_size,
 					int __user *child_tidptr,
@@ -1198,6 +1198,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	int retval;
 	struct task_struct *p;
 
+	if (clone_flags & ~CLONE4_VALID_FLAGS)
+		return ERR_PTR(-EINVAL);
+
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);
 
@@ -1630,7 +1633,7 @@ struct task_struct *fork_idle(int cpu)
  * it and waits for it to finish using the VM if required.
  */
 static long _do_fork(
-		unsigned long clone_flags,
+		u64 clone_flags,
 		unsigned long stack_start,
 		unsigned long stack_size,
 		int __user *parent_tidptr,
@@ -1701,6 +1704,15 @@ static long _do_fork(
 	return nr;
 }
 
+/*
+ * Convenience function for callers passing unsigned long flags, to prevent old
+ * syscall entry points from unexpectedly returning EINVAL.
+ */
+static inline u64 squelch_clone_flags(unsigned long clone_flags)
+{
+	return (u32)(clone_flags & ~CLONE_VALID_FLAGS);
+}
+
 #ifndef CONFIG_HAVE_COPY_THREAD_TLS
 /* For compatibility with architectures that call do_fork directly rather than
  * using the syscall entry points below. */
@@ -1710,7 +1722,8 @@ long do_fork(unsigned long clone_flags,
 	      int __user *parent_tidptr,
 	      int __user *child_tidptr)
 {
-	return _do_fork(clone_flags, stack_start, stack_size,
+	return _do_fork(squelch_clone_flags(clone_flags),
+			stack_start, stack_size,
 			parent_tidptr, child_tidptr, 0);
 }
 #endif
@@ -1768,10 +1781,49 @@ SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
 		 unsigned long, tls)
 #endif
 {
-	return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr, tls);
+	return _do_fork(squelch_clone_flags(clone_flags), newsp, 0,
+			parent_tidptr, child_tidptr, tls);
 }
 #endif
 
+#ifdef CONFIG_CLONE4
+SYSCALL_DEFINE4(clone4, unsigned, flags_high, unsigned, flags_low,
+		unsigned long, args_size, struct clone4_args __user *, args)
+{
+	struct clone4_args kargs = {};
+	if (args_size > sizeof(kargs)) {
+		return -EINVAL;
+	} else if (args_size) {
+		int ret = copy_from_user(&kargs, args, args_size);
+		if (ret < 0)
+			return ret;
+	}
+	return _do_fork((u64)flags_high << 32 | flags_low,
+			kargs.stack_start, kargs.stack_size,
+			kargs.ptid, kargs.ctid, kargs.tls);
+}
+
+#ifdef CONFIG_COMPAT
+COMPAT_SYSCALL_DEFINE4(clone4, unsigned, flags_high, unsigned, flags_low,
+			compat_ulong_t, args_size,
+			struct compat_clone4_args __user *, args)
+{
+	struct compat_clone4_args kargs = {};
+	if (args_size > sizeof(kargs)) {
+		return -EINVAL;
+	} else if (args_size) {
+		int ret = copy_from_user(&kargs, args, args_size);
+		if (ret < 0)
+			return ret;
+	}
+	return _do_fork((u64)flags_high << 32 | flags_low,
+			kargs.stack_start, kargs.stack_size,
+			compat_ptr(kargs.ptid), compat_ptr(kargs.ctid),
+			kargs.tls);
+}
+#endif /* CONFIG_COMPAT */
+#endif /* CONFIG_CLONE4 */
+
 #ifndef ARCH_MIN_MMSTRUCT_ALIGN
 #define ARCH_MIN_MMSTRUCT_ALIGN 0
 #endif
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 5adcb0a..5b5d2b9 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -159,6 +159,7 @@ cond_syscall(sys_uselib);
 cond_syscall(sys_fadvise64);
 cond_syscall(sys_fadvise64_64);
 cond_syscall(sys_madvise);
+cond_syscall(sys_clone4);
 
 /* arch-specific weak syscall entries */
 cond_syscall(sys_pciconfig_read);
-- 
2.1.4

^ permalink raw reply related

* [PATCH 2/6] x86: Opt into HAVE_COPY_THREAD_TLS, for both 32-bit and 64-bit
From: Josh Triplett @ 2015-03-13  1:40 UTC (permalink / raw)
  To: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Thiago Macieira, Michael Kerrisk,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A
In-Reply-To: <cover.1426180120.git.josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>

For 32-bit userspace on a 64-bit kernel, this requires modifying
stub32_clone to actually swap the appropriate arguments to match
CONFIG_CLONE_BACKWARDS, rather than just leaving the C argument for tls
broken.

Signed-off-by: Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
Signed-off-by: Thiago Macieira <thiago.macieira-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 arch/x86/Kconfig             | 1 +
 arch/x86/ia32/ia32entry.S    | 2 +-
 arch/x86/kernel/process_32.c | 6 +++---
 arch/x86/kernel/process_64.c | 8 ++++----
 4 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b7d31ca..4960b0d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -124,6 +124,7 @@ config X86
 	select MODULES_USE_ELF_REL if X86_32
 	select MODULES_USE_ELF_RELA if X86_64
 	select CLONE_BACKWARDS if X86_32
+	select HAVE_COPY_THREAD_TLS
 	select ARCH_USE_BUILTIN_BSWAP
 	select ARCH_USE_QUEUE_RWLOCK
 	select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 156ebca..0286735 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -487,7 +487,7 @@ GLOBAL(\label)
 	ALIGN
 GLOBAL(stub32_clone)
 	leaq sys_clone(%rip),%rax
-	mov	%r8, %rcx
+	xchg %r8, %rcx
 	jmp  ia32_ptregs_common	
 
 	ALIGN
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 603c4f9..ead28ff 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -129,8 +129,8 @@ void release_thread(struct task_struct *dead_task)
 	release_vm86_irqs(dead_task);
 }
 
-int copy_thread(unsigned long clone_flags, unsigned long sp,
-	unsigned long arg, struct task_struct *p)
+int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
+	unsigned long arg, struct task_struct *p, unsigned long tls)
 {
 	struct pt_regs *childregs = task_pt_regs(p);
 	struct task_struct *tsk;
@@ -185,7 +185,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp,
 	 */
 	if (clone_flags & CLONE_SETTLS)
 		err = do_set_thread_area(p, -1,
-			(struct user_desc __user *)childregs->si, 0);
+			(struct user_desc __user *)tls, 0);
 
 	if (err && p->thread.io_bitmap_ptr) {
 		kfree(p->thread.io_bitmap_ptr);
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 67fcc43..c69cabc 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -151,8 +151,8 @@ static inline u32 read_32bit_tls(struct task_struct *t, int tls)
 	return get_desc_base(&t->thread.tls_array[tls]);
 }
 
-int copy_thread(unsigned long clone_flags, unsigned long sp,
-		unsigned long arg, struct task_struct *p)
+int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
+		unsigned long arg, struct task_struct *p, unsigned long tls)
 {
 	int err;
 	struct pt_regs *childregs;
@@ -209,10 +209,10 @@ int copy_thread(unsigned long clone_flags, unsigned long sp,
 #ifdef CONFIG_IA32_EMULATION
 		if (test_thread_flag(TIF_IA32))
 			err = do_set_thread_area(p, -1,
-				(struct user_desc __user *)childregs->si, 0);
+				(struct user_desc __user *)tls, 0);
 		else
 #endif
-			err = do_arch_prctl(p, ARCH_SET_FS, childregs->r8);
+			err = do_arch_prctl(p, ARCH_SET_FS, tls);
 		if (err)
 			goto out;
 	}
-- 
2.1.4

^ permalink raw reply related

* [PATCH 1/6] clone: Support passing tls argument via C rather than pt_regs magic
From: Josh Triplett @ 2015-03-13  1:40 UTC (permalink / raw)
  To: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Thiago Macieira, Michael Kerrisk, linux-kernel,
	linux-api, linux-fsdevel, x86
In-Reply-To: <cover.1426180120.git.josh@joshtriplett.org>

clone with CLONE_SETTLS accepts an argument to set the thread-local
storage area for the new thread.  sys_clone declares an int argument
tls_val in the appropriate point in the argument list (based on the
various CLONE_BACKWARDS variants), but doesn't actually use or pass
along that argument.  Instead, sys_clone calls do_fork, which calls
copy_process, which calls the arch-specific copy_thread, and copy_thread
pulls the corresponding syscall argument out of the pt_regs captured at
kernel entry (knowing what argument of clone that architecture passes
tls in).

Apart from being awful and inscrutable, that also only works because
only one code path into copy_thread can pass the CLONE_SETTLS flag, and
that code path comes from sys_clone with its architecture-specific
argument-passing order.  This prevents introducing a new version of the
clone system call without propagating the same architecture-specific
position of the tls argument.

However, there's no reason to pull the argument out of pt_regs when
sys_clone could just pass it down via C function call arguments.

Introduce a new CONFIG_HAVE_COPY_THREAD_TLS for architectures to opt
into, and a new copy_thread_tls that accepts the tls parameter as an
additional unsigned long (syscall-argument-sized) argument.
Change sys_clone's tls argument to an unsigned long (which does
not change the ABI), and pass that down to copy_thread_tls.

Architectures that don't opt into copy_thread_tls will continue to
ignore the C argument to sys_clone in favor of the pt_regs captured at
kernel entry, and thus will be unable to introduce new versions of the
clone syscall.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Thiago Macieira <thiago.macieira@intel.com>
---
 arch/Kconfig             |  7 ++++++
 include/linux/sched.h    | 14 ++++++++++++
 include/linux/syscalls.h |  6 +++---
 kernel/fork.c            | 55 +++++++++++++++++++++++++++++++-----------------
 4 files changed, 60 insertions(+), 22 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 05d7a8a..4834a58 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -484,6 +484,13 @@ config HAVE_IRQ_EXIT_ON_IRQ_STACK
 	  This spares a stack switch and improves cache usage on softirq
 	  processing.
 
+config HAVE_COPY_THREAD_TLS
+	bool
+	help
+	  Architecture provides copy_thread_tls to accept tls argument via
+	  normal C parameter passing, rather than extracting the syscall
+	  argument from pt_regs.
+
 #
 # ABI hall of shame
 #
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6d77432..9ec36fd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2479,8 +2479,22 @@ extern struct mm_struct *mm_access(struct task_struct *task, unsigned int mode);
 /* Remove the current tasks stale references to the old mm_struct */
 extern void mm_release(struct task_struct *, struct mm_struct *);
 
+#ifdef CONFIG_HAVE_COPY_THREAD_TLS
+extern int copy_thread_tls(unsigned long, unsigned long, unsigned long,
+			struct task_struct *, unsigned long);
+#else
 extern int copy_thread(unsigned long, unsigned long, unsigned long,
 			struct task_struct *);
+
+/* Architectures that haven't opted into copy_thread_tls get the tls argument
+ * via pt_regs, so ignore the tls argument passed via C. */
+static inline int copy_thread_tls(
+		unsigned long clone_flags, unsigned long sp, unsigned long arg,
+		struct task_struct *p, unsigned long tls)
+{
+	return copy_thread(clone_flags, sp, arg, p);
+}
+#endif
 extern void flush_thread(void);
 extern void exit_thread(void);
 
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 76d1e38..bb51bec 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -827,15 +827,15 @@ asmlinkage long sys_syncfs(int fd);
 asmlinkage long sys_fork(void);
 asmlinkage long sys_vfork(void);
 #ifdef CONFIG_CLONE_BACKWARDS
-asmlinkage long sys_clone(unsigned long, unsigned long, int __user *, int,
+asmlinkage long sys_clone(unsigned long, unsigned long, int __user *, unsigned long,
 	       int __user *);
 #else
 #ifdef CONFIG_CLONE_BACKWARDS3
 asmlinkage long sys_clone(unsigned long, unsigned long, int, int __user *,
-			  int __user *, int);
+			  int __user *, unsigned long);
 #else
 asmlinkage long sys_clone(unsigned long, unsigned long, int __user *,
-	       int __user *, int);
+	       int __user *, unsigned long);
 #endif
 #endif
 
diff --git a/kernel/fork.c b/kernel/fork.c
index cf65139..b3dadf4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1192,7 +1192,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 					unsigned long stack_size,
 					int __user *child_tidptr,
 					struct pid *pid,
-					int trace)
+					int trace,
+					unsigned long tls)
 {
 	int retval;
 	struct task_struct *p;
@@ -1401,7 +1402,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	retval = copy_io(clone_flags, p);
 	if (retval)
 		goto bad_fork_cleanup_namespaces;
-	retval = copy_thread(clone_flags, stack_start, stack_size, p);
+	retval = copy_thread_tls(clone_flags, stack_start, stack_size, p, tls);
 	if (retval)
 		goto bad_fork_cleanup_io;
 
@@ -1613,7 +1614,7 @@ static inline void init_idle_pids(struct pid_link *links)
 struct task_struct *fork_idle(int cpu)
 {
 	struct task_struct *task;
-	task = copy_process(CLONE_VM, 0, 0, NULL, &init_struct_pid, 0);
+	task = copy_process(CLONE_VM, 0, 0, NULL, &init_struct_pid, 0, 0);
 	if (!IS_ERR(task)) {
 		init_idle_pids(task->pids);
 		init_idle(task, cpu);
@@ -1628,11 +1629,13 @@ struct task_struct *fork_idle(int cpu)
  * It copies the process, and if successful kick-starts
  * it and waits for it to finish using the VM if required.
  */
-long do_fork(unsigned long clone_flags,
-	      unsigned long stack_start,
-	      unsigned long stack_size,
-	      int __user *parent_tidptr,
-	      int __user *child_tidptr)
+static long _do_fork(
+		unsigned long clone_flags,
+		unsigned long stack_start,
+		unsigned long stack_size,
+		int __user *parent_tidptr,
+		int __user *child_tidptr,
+		unsigned long tls)
 {
 	struct task_struct *p;
 	int trace = 0;
@@ -1657,7 +1660,7 @@ long do_fork(unsigned long clone_flags,
 	}
 
 	p = copy_process(clone_flags, stack_start, stack_size,
-			 child_tidptr, NULL, trace);
+			 child_tidptr, NULL, trace, tls);
 	/*
 	 * Do this prior waking up the new thread - the thread pointer
 	 * might get invalid after that point, if the thread exits quickly.
@@ -1698,20 +1701,34 @@ long do_fork(unsigned long clone_flags,
 	return nr;
 }
 
+#ifndef CONFIG_HAVE_COPY_THREAD_TLS
+/* For compatibility with architectures that call do_fork directly rather than
+ * using the syscall entry points below. */
+long do_fork(unsigned long clone_flags,
+	      unsigned long stack_start,
+	      unsigned long stack_size,
+	      int __user *parent_tidptr,
+	      int __user *child_tidptr)
+{
+	return _do_fork(clone_flags, stack_start, stack_size,
+			parent_tidptr, child_tidptr, 0);
+}
+#endif
+
 /*
  * Create a kernel thread.
  */
 pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
 {
-	return do_fork(flags|CLONE_VM|CLONE_UNTRACED, (unsigned long)fn,
-		(unsigned long)arg, NULL, NULL);
+	return _do_fork(flags|CLONE_VM|CLONE_UNTRACED, (unsigned long)fn,
+		(unsigned long)arg, NULL, NULL, 0);
 }
 
 #ifdef __ARCH_WANT_SYS_FORK
 SYSCALL_DEFINE0(fork)
 {
 #ifdef CONFIG_MMU
-	return do_fork(SIGCHLD, 0, 0, NULL, NULL);
+	return _do_fork(SIGCHLD, 0, 0, NULL, NULL, 0);
 #else
 	/* can not support in nommu mode */
 	return -EINVAL;
@@ -1722,8 +1739,8 @@ SYSCALL_DEFINE0(fork)
 #ifdef __ARCH_WANT_SYS_VFORK
 SYSCALL_DEFINE0(vfork)
 {
-	return do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, 0,
-			0, NULL, NULL);
+	return _do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, 0,
+			0, NULL, NULL, 0);
 }
 #endif
 
@@ -1731,27 +1748,27 @@ SYSCALL_DEFINE0(vfork)
 #ifdef CONFIG_CLONE_BACKWARDS
 SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
 		 int __user *, parent_tidptr,
-		 int, tls_val,
+		 unsigned long, tls,
 		 int __user *, child_tidptr)
 #elif defined(CONFIG_CLONE_BACKWARDS2)
 SYSCALL_DEFINE5(clone, unsigned long, newsp, unsigned long, clone_flags,
 		 int __user *, parent_tidptr,
 		 int __user *, child_tidptr,
-		 int, tls_val)
+		 unsigned long, tls)
 #elif defined(CONFIG_CLONE_BACKWARDS3)
 SYSCALL_DEFINE6(clone, unsigned long, clone_flags, unsigned long, newsp,
 		int, stack_size,
 		int __user *, parent_tidptr,
 		int __user *, child_tidptr,
-		int, tls_val)
+		unsigned long, tls)
 #else
 SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
 		 int __user *, parent_tidptr,
 		 int __user *, child_tidptr,
-		 int, tls_val)
+		 unsigned long, tls)
 #endif
 {
-	return do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr);
+	return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr, tls);
 }
 #endif
 
-- 
2.1.4

^ permalink raw reply related

* [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor
From: Josh Triplett @ 2015-03-13  1:40 UTC (permalink / raw)
  To: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
	Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
	Thomas Gleixner, Thiago Macieira, Michael Kerrisk, linux-kernel,
	linux-api, linux-fsdevel, x86

This patch series introduces a new clone flag, CLONE_FD, which lets the caller
handle child process exit notification via a file descriptor rather than
SIGCHLD.  CLONE_FD makes it possible for libraries to safely launch and manage
child processes on behalf of their caller, *without* taking over process-wide
SIGCHLD handling (either via signal handler or signalfd).

Note that signalfd for SIGCHLD does not suffice here, because that still
receives notification for all child processes, and interferes with process-wide
signal handling.

The CLONE_FD file descriptor uniquely identifies a process on the system in a
race-free way, by holding a reference to the task_struct.  In the future, we
may introduce APIs that support using process file descriptors instead of PIDs.

Introducing CLONE_FD required two additional bits of yak shaving: Since clone
has no more usable flags (with the three currently unused flags unusable
because old kernels ignore them without EINVAL), also introduce a new clone4
system call with more flag bits and an extensible argument structure.  And
since the magic pt_regs-based syscall argument processing for clone's tls
argument would otherwise prevent introducing a sane clone4 system call, fix
that too.

I tested the CLONE_SETTLS changes with a thread-local storage test program (two
threads independently reading and writing a __thread variable), on both 32-bit
and 64-bit, and I observed no issues there.

I tested clone4 and the new CLONE_FD call with several additional test
programs, launching either a process or thread (in the former case using
syscall(), in the latter case by calling clone4 via assembly and returning to
C), sleeping in parent and child to test the case of either exiting first, and
then printing the received clone4_info structure.  Thiago also tested clone4
with CLONE_FD with a modified version of libqt's process handling, which
includes a test suite.

I've also included the manpages patch at the end of this series.  (Note that
the manpage documents the behavior of the future glibc wrapper as well as the
raw syscall.)  Here's a formatted plain-text version of the manpage for
reference:

CLONE4(2)                  Linux Programmer's Manual                 CLONE4(2)



NAME
       clone4 - create a child process

SYNOPSIS
       /* Prototype for the glibc wrapper function */

       #define _GNU_SOURCE
       #include <sched.h>

       int clone4(uint64_t flags,
                  size_t args_size,
                  struct clone4_args *args,
                  int (*fn)(void *), void *arg);

       /* Prototype for the raw system call */

       int clone4(unsigned flags_high, unsigned flags_low,
                  unsigned long args_size,
                  struct clone4_args *args);

       struct clone4_args {
           pid_t *ptid;
           pid_t *ctid;
           unsigned long stack_start;
           unsigned long stack_size;
           unsigned long tls;
       };


DESCRIPTION
       clone4()  creates  a  new  process,  similar  to  clone(2) and fork(2).
       clone4() supports additional flags that clone(2) does not, and  accepts
       arguments via an extensible structure.

       args  points to a clone4_args structure, and args_size must contain the
       size of that structure, as understood by the  caller.   If  the  caller
       passes  a  shorter  structure  than  the  kernel expects, the remaining
       fields will default to 0.  If the caller passes a larger structure than
       the  kernel  expects  (such  as one from a newer kernel), clone4() will
       return EINVAL.  The clone4_args structure may gain additional fields at
       the  end  in  the future, and callers must only pass a size that encom‐
       passes the number of fields they understand.  If the  caller  passes  0
       for args_size, args is ignored and may be NULL.

       In  the clone4_args structure, ptid, ctid, stack_start, stack_size, and
       tls have the same semantics as they do with clone(2) and clone2(2).

       In the glibc wrapper, fn and arg have the same  semantics  as  they  do
       with clone(2).  As with clone(2), the underlying system call works more
       like fork(2), returning 0 in the child process; the glibc wrapper  sim‐
       plifies  thread execution by calling fn(arg) and exiting the child when
       that function exits.

       The 64-bit  flags  argument  (split  into  the  32-bit  flags_high  and
       flags_low arguments in the kernel interface) accepts all the same flags
       as  clone(2),  with  the   exception   of   the   obsolete   CLONE_PID,
       CLONE_DETACHED, and CLONE_STOPPED.  In addition, flags accepts the fol‐
       lowing flags:


       CLONE_FD
              Instead of returning a process ID, clone4()  with  the  CLONE_FD
              flag  returns a file descriptor associated with the new process.
              When the new process exits, the kernel will not send a signal to
              the  parent process, and will not keep the new process around as
              a "zombie" process  until  a  call  to  waitpid(2)  or  similar.
              Instead,  the file descriptor will become available for reading,
              and the new process will be immediately reaped.

              Unlike using  signalfd(2)  for  the  SIGCHLD  signal,  the  file
              descriptor  returned  by  clone4()  with the CLONE_FD flag works
              even with SIGCHLD unblocked in one or more threads of the parent
              process,  and  allows the process to have different handlers for
              different child processes, such as those created by  a  library,
              without  introducing  race conditions around process-wide signal
              handling.

              clone4() will never return a file descriptor in the range 0-2 to
              the caller, to avoid ambiguity with the return of 0 in the child
              process.  Only the  calling  process  will  have  the  new  file
              descriptor open; the child process will not.

              Since the kernel does not send a termination signal when a child
              process created with CLONE_FD exits, the low byte of flags  does
              not contain a signal number.  Instead, the low byte of flags can
              contain the following additional flags for use with CLONE_FD:


              CLONEFD_CLOEXEC
                     Set the O_CLOEXEC flag on the new open  file  descriptor.
                     See  the description of the O_CLOEXEC flag in open(2) for
                     reasons why this may be useful.


              CLONEFD_NONBLOCK
                     Set the O_NONBLOCK flag on the new open file  descriptor.
                     Using  this flag saves extra calls to fcntl(2) to achieve
                     the same result.


              clone4() with the CLONE_FD flag returns a file  descriptor  that
              supports the following operations:

              read(2) (and similar)
                     When  the  new  process  exits,  reading  from  the  file
                     descriptor produces a single clonefd_info structure:

                     struct clonefd_info {
                         uint32_t code;   /* Signal code */
                         uint32_t status; /* Exit status or signal */
                         uint64_t utime;  /* User CPU time */
                         uint64_t stime;  /* System CPU time */
                     };


                     If the new process has not  yet  exited,  read(2)  either
                     blocks  until  it does, or fails with the error EAGAIN if
                     the file descriptor has been made nonblocking.

                     Future kernels may extend clonefd_info by appending addi‐
                     tional  fields  to  the end.  Callers should read as many
                     bytes as they understand; unread data will be  discarded,
                     and  subsequent  reads  after  the first will return 0 to
                     indicate end-of-file.  Callers requesting more bytes than
                     the  kernel  provides  (such as callers expecting a newer
                     clonefd_info structure) will receive a shorter  structure
                     from older kernels.

              poll(2), select(2), epoll(7) (and similar)
                     The  file  descriptor  is readable (the select(2) readfds
                     argument; the poll(2) POLLIN flag) if the new process has
                     exited.

              close(2)
                     When  the file descriptor is no longer required it should
                     be closed.  If no process has a file descriptor open  for
                     the new process, no process will receive any notification
                     when the new process exits.  The new process  will  still
                     be immediately reaped.


   C library/kernel ABI differences
       As with clone(2), the raw clone4() system call corresponds more closely
       to fork(2) in that execution in the child continues from the  point  of
       the call.

       Unlike  clone(2),  the  raw  system call interface for clone4() accepts
       arguments in the same order on all architectures.

       The raw system call accepts flags as two 32-bit  arguments,  flags_high
       and  flags_low, to simplify portability across 32-bit and 64-bit archi‐
       tectures and calling conventions.  The glibc wrapper accepts flags as a
       single 64-bit argument for convenience.


RETURN VALUE
       For the glibc wrapper, on success, clone4() returns the file descriptor
       (with CLONE_FD) or new process ID (without  CLONE_FD),  and  the  child
       process begins running at the specified function.

       For  the  raw syscall, on success, clone4() returns the file descriptor
       or new process ID to the calling process, and  returns  0  in  the  new
       child process.

       On failure, clone4() returns -1 and sets errno accordingly.


ERRORS
       clone4()  can  return any error from clone(2), as well as the following
       additional errors:

       EINVAL flags contained an unknown flag.

       EINVAL flags included CLONE_FD, but the kernel configuration  does  not
              have the CONFIG_CLONEFD option enabled.

       EMFILE flags  included  CLONE_FD,  but  the  new  file descriptor would
              exceed the process limit on open file descriptors.

       ENFILE flags included CLONE_FD,  but  the  new  file  descriptor  would
              exceed the system-wide limit on open file descriptors.

       ENODEV flags  included  CLONE_FD,  but  clone4()  could  not  mount the
              (internal) anonymous inode device.


CONFORMING TO
       clone4() is Linux-specific and should not be used in programs  intended
       to be portable.


SEE ALSO
       clone(2), epoll(7), poll(2), pthreads(7), read(2), select(2)



Linux                             2015-03-01                         CLONE4(2)


Josh Triplett and Thiago Macieira (6):
  clone: Support passing tls argument via C rather than pt_regs magic
  x86: Opt into HAVE_COPY_THREAD_TLS, for both 32-bit and 64-bit
  Introduce a new clone4 syscall with more flag bits and extensible arguments
  signal: Factor out a helper function to process task_struct exit_code
  fs: Make alloc_fd non-private
  clone4: Introduce new CLONE_FD flag to get task exit notification via fd

 arch/Kconfig                     |   7 ++
 arch/x86/Kconfig                 |   1 +
 arch/x86/ia32/ia32entry.S        |   3 +-
 arch/x86/kernel/entry_64.S       |   1 +
 arch/x86/kernel/process_32.c     |   6 +-
 arch/x86/kernel/process_64.c     |   8 +--
 arch/x86/syscalls/syscall_32.tbl |   1 +
 arch/x86/syscalls/syscall_64.tbl |   2 +
 fs/file.c                        |   2 +-
 include/linux/compat.h           |  12 ++++
 include/linux/file.h             |   1 +
 include/linux/sched.h            |  20 ++++++
 include/linux/syscalls.h         |   6 +-
 include/uapi/linux/sched.h       |  54 ++++++++++++++-
 init/Kconfig                     |  21 ++++++
 kernel/Makefile                  |   1 +
 kernel/clonefd.c                 | 123 +++++++++++++++++++++++++++++++++
 kernel/clonefd.h                 |  27 ++++++++
 kernel/exit.c                    |  10 ++-
 kernel/fork.c                    | 143 ++++++++++++++++++++++++++++++++-------
 kernel/signal.c                  |  24 ++++---
 kernel/sys_ni.c                  |   1 +
 22 files changed, 425 insertions(+), 49 deletions(-)
 create mode 100644 kernel/clonefd.c
 create mode 100644 kernel/clonefd.h

-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v3  06/15] drivers: reset: Add STM32 reset driver
From: Chanwoo Choi @ 2015-03-13  0:11 UTC (permalink / raw)
  To: Maxime Coquelin
  Cc: u.kleine-koenig, afaerber, geert, Rob Herring, Philipp Zabel,
	Linus Walleij, Arnd Bergmann, stefan, pmeerw, pebolle,
	Jonathan Corbet, Pawel Moll, Mark Rutland, Ian Campbell,
	Kumar Gala, Russell King, Daniel Lezcano, Thomas Gleixner,
	Greg Kroah-Hartman, Jiri Slaby, Andrew Morton, David S. Miller,
	Mauro Carvalho Chehab, Joe Perches, Antti Palosaari, Tejun Heo
In-Reply-To: <1426197361-19290-7-git-send-email-maxime.coquelin@st.com>

Hi Maxime,

On 03/13/2015 06:55 AM, Maxime Coquelin wrote:
> From: Maxime Coquelin <mcoquelin.stm32@gmail.com>
> 
> The STM32 MCUs family IP can be reset by accessing some shared registers.
> 
> The specificity is that some reset lines are used by the timers.
> At timer initialization time, the timer has to be reset, that's why
> we cannot use a regular driver.
> 
> Signed-off-by: Maxime Coquelin <mcoquelin.stm32@gmail.com>
> ---
>  drivers/reset/Makefile      |   1 +
>  drivers/reset/reset-stm32.c | 125 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 126 insertions(+)
>  create mode 100644 drivers/reset/reset-stm32.c
> 
> diff --git a/drivers/reset/Makefile b/drivers/reset/Makefile
> index 157d421..aed12d1 100644
> --- a/drivers/reset/Makefile
> +++ b/drivers/reset/Makefile
> @@ -1,5 +1,6 @@
>  obj-$(CONFIG_RESET_CONTROLLER) += core.o
>  obj-$(CONFIG_ARCH_SOCFPGA) += reset-socfpga.o
>  obj-$(CONFIG_ARCH_BERLIN) += reset-berlin.o
> +obj-$(CONFIG_ARCH_STM32) += reset-stm32.o
>  obj-$(CONFIG_ARCH_SUNXI) += reset-sunxi.o
>  obj-$(CONFIG_ARCH_STI) += sti/
> diff --git a/drivers/reset/reset-stm32.c b/drivers/reset/reset-stm32.c
> new file mode 100644
> index 0000000..0d389b1
> --- /dev/null
> +++ b/drivers/reset/reset-stm32.c
> @@ -0,0 +1,125 @@
> +/*
> + * Copyright (C) Maxime Coquelin 2015
> + * Author:  Maxime Coquelin <mcoquelin.stm32@gmail.com>
> + * License terms:  GNU General Public License (GPL), version 2
> + *
> + * Heavily based on sunxi driver from Maxime Ripard.
> + */
> +
> +#include <linux/err.h>
> +#include <linux/io.h>
> +#include <linux/module.h>
> +#include <linux/of.h>
> +#include <linux/of_address.h>
> +#include <linux/platform_device.h>
> +#include <linux/reset-controller.h>
> +#include <linux/slab.h>
> +#include <linux/spinlock.h>
> +#include <linux/types.h>
> +
> +struct stm32_reset_data {
> +	spinlock_t			lock;
> +	void __iomem			*membase;
> +	struct reset_controller_dev	rcdev;
> +};
> +
> +static int stm32_reset_assert(struct reset_controller_dev *rcdev,
> +			      unsigned long id)
> +{
> +	struct stm32_reset_data *data = container_of(rcdev,
> +						     struct stm32_reset_data,
> +						     rcdev);
> +	int bank = id / BITS_PER_LONG;
> +	int offset = id % BITS_PER_LONG;
> +	unsigned long flags;
> +	u32 reg;
> +
> +	spin_lock_irqsave(&data->lock, flags);
> +
> +	reg = readl_relaxed(data->membase + (bank * 4));
> +	writel_relaxed(reg | BIT(offset), data->membase + (bank * 4));
> +
> +	spin_unlock_irqrestore(&data->lock, flags);
> +
> +	return 0;
> +}
> +
> +static int stm32_reset_deassert(struct reset_controller_dev *rcdev,
> +				unsigned long id)
> +{
> +	struct stm32_reset_data *data = container_of(rcdev,
> +						     struct stm32_reset_data,
> +						     rcdev);
> +	int bank = id / BITS_PER_LONG;
> +	int offset = id % BITS_PER_LONG;
> +	unsigned long flags;
> +	u32 reg;
> +
> +	spin_lock_irqsave(&data->lock, flags);
> +
> +	reg = readl_relaxed(data->membase + (bank * 4));
> +	writel_relaxed(reg & ~BIT(offset), data->membase + (bank * 4));
> +
> +	spin_unlock_irqrestore(&data->lock, flags);
> +
> +	return 0;
> +}
> +
> +static struct reset_control_ops stm32_reset_ops = {
> +	.assert		= stm32_reset_assert,
> +	.deassert	= stm32_reset_deassert,
> +};
> +
> +static const struct of_device_id stm32_reset_dt_ids[] = {
> +	 { .compatible = "st,stm32-rcc", },
> +	 { /* sentinel */ },
> +};
> +MODULE_DEVICE_TABLE(of, sstm32_reset_dt_ids);
> +
> +static int stm32_reset_probe(struct platform_device *pdev)
> +{
> +	struct stm32_reset_data *data;
> +	struct resource *res;
> +
> +	data = devm_kzalloc(&pdev->dev, sizeof(*data), GFP_KERNEL);
> +	if (!data)
> +		return -ENOMEM;
> +
> +	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
> +	data->membase = devm_ioremap_resource(&pdev->dev, res);
> +	if (IS_ERR(data->membase))
> +		return PTR_ERR(data->membase);
> +
> +	spin_lock_init(&data->lock);
> +
> +	data->rcdev.owner = THIS_MODULE;
> +	data->rcdev.nr_resets = resource_size(res) * 8;
> +	data->rcdev.ops = &stm32_reset_ops;
> +	data->rcdev.of_node = pdev->dev.of_node;
> +
> +	return reset_controller_register(&data->rcdev);
> +}
> +
> +static int stm32_reset_remove(struct platform_device *pdev)
> +{
> +	struct stm32_reset_data *data = platform_get_drvdata(pdev);
> +
> +	reset_controller_unregister(&data->rcdev);
> +
> +	return 0;
> +}
> +
> +static struct platform_driver stm32_reset_driver = {
> +	.probe	= stm32_reset_probe,
> +	.remove	= stm32_reset_remove,
> +	.driver = {
> +		.name		= "stm32-rcc-reset",
> +		.of_match_table	= stm32_reset_dt_ids,
> +	},
> +};
> +module_platform_driver(stm32_reset_driver);
> +
> +MODULE_AUTHOR("Maxime Coquelin <maxime.coquelin@gmail.com>");
> +MODULE_DESCRIPTION("STM32 MCUs Reset Controller Driver");
> +MODULE_LICENSE("GPL");
> +
> 

Last blank line is un-necessary. When I applied this patch for test,
"new blank line at EOF" happen.

Thanks,
Chanwoo Choi



^ permalink raw reply

* Re: [PATCH v3  05/15] dt-bindings: Document the STM32 reset bindings
From: Chanwoo Choi @ 2015-03-13  0:09 UTC (permalink / raw)
  To: Maxime Coquelin
  Cc: u.kleine-koenig, afaerber, geert, Rob Herring, Philipp Zabel,
	Linus Walleij, Arnd Bergmann, stefan, pmeerw, pebolle,
	Jonathan Corbet, Pawel Moll, Mark Rutland, Ian Campbell,
	Kumar Gala, Russell King, Daniel Lezcano, Thomas Gleixner,
	Greg Kroah-Hartman, Jiri Slaby, Andrew Morton, David S. Miller,
	Mauro Carvalho Chehab, Joe Perches, Antti Palosaari, Tejun Heo
In-Reply-To: <1426197361-19290-6-git-send-email-maxime.coquelin@st.com>

Hi Maxime,

On 03/13/2015 06:55 AM, Maxime Coquelin wrote:
> From: Maxime Coquelin <mcoquelin.stm32@gmail.com>
> 
> This adds documentation of device tree bindings for the
> STM32 reset controller.
> 
> Signed-off-by: Maxime Coquelin <mcoquelin.stm32@gmail.com>
> ---
>  .../devicetree/bindings/reset/st,stm32-rcc.txt     | 102 +++++++++++++++++++++
>  1 file changed, 102 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/reset/st,stm32-rcc.txt
> 
> diff --git a/Documentation/devicetree/bindings/reset/st,stm32-rcc.txt b/Documentation/devicetree/bindings/reset/st,stm32-rcc.txt
> new file mode 100644
> index 0000000..962f961
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/reset/st,stm32-rcc.txt
> @@ -0,0 +1,102 @@
> +STMicroelectronics STM32 Peripheral Reset Controller
> +====================================================
> +
> +The RCC IP is both a reset and a clock controller. This documentation only
> +document the reset part.
> +
> +Please also refer to reset.txt in this directory for common reset
> +controller binding usage.
> +
> +Required properties:
> +- compatible: Should be "st,stm32-rcc"
> +- reg: should be register base and length as documented in the
> +  datasheet
> +- #reset-cells: 1, see below
> +
> +example:
> +
> +rcc: reset@40023800 {
> +	#reset-cells = <1>;
> +	compatible = "st,stm32-rcc";
> +	reg = <0x40023800 0x400>;
> +};
> +
> +Specifying softreset control of devices
> +=======================================
> +
> +Device nodes should specify the reset channel required in their "resets"
> +property, containing a phandle to the reset device node and an index specifying
> +which channel to use.
> +
> +example:
> +
> +	timer2 {
> +		resets			= <&rcc 256>;
> +	};
> +
> +List of indexes for STM32F429:
> + - gpioa: 128
> + - gpiob: 129
> + - gpioc: 130
> + - gpiod: 131
> + - gpioe: 132
> + - gpiof: 133
> + - gpiog: 134
> + - gpioh: 135
> + - gpioi: 136
> + - gpioj: 137
> + - gpiok: 138
> + - crc: 140
> + - dma1: 149
> + - dma2: 150
> + - dma2d: 151
> + - ethmac: 153
> + - otghs: 157
> + - dcmi: 160
> + - cryp: 164
> + - hash: 165
> + - rng: 166
> + - otgfs: 167
> + - fmc: 192
> + - tim2: 256
> + - tim3: 257
> + - tim4: 258
> + - tim5: 259
> + - tim6: 260
> + - tim7: 261
> + - tim12: 262
> + - tim13: 263
> + - tim14: 264
> + - wwdg: 267
> + - spi2: 270
> + - spi3: 271
> + - uart2: 273
> + - uart3: 274
> + - uart4: 275
> + - uart5: 276
> + - i2c1: 277
> + - i2c2: 278
> + - i2c3: 279
> + - can1: 281
> + - can2: 282
> + - pwr: 284
> + - dac: 285
> + - uart7: 286
> + - uart8: 287
> + - tim1: 288
> + - tim8: 289
> + - usart1: 292
> + - usart6: 293
> + - adc: 296
> + - sdio: 299
> + - spi1: 300
> + - spi4: 301
> + - syscfg: 302
> + - tim9: 304
> + - tim10: 305
> + - tim11: 306
> + - spi5: 308
> + - spi6: 309
> + - sai1: 310
> + - ltdc: 31
> +
> 

Last line is un-necessary. When I applied this patch for test
"new blank line at EOF" happen.

+++ b/Documentation/devicetree/bindings/reset/st,stm32-rcc.txt
@@ -99,4 +99,3 @@ List of indexes for STM32F429:
  - spi6: 309
  - sai1: 310
  - ltdc: 31
-

Thanks,
Chanwoo Choi

^ permalink raw reply

* Re: [PATCH v3  00/15] Add support to STMicroelectronics STM32 family
From: Chanwoo Choi @ 2015-03-12 23:45 UTC (permalink / raw)
  To: Maxime Coquelin
  Cc: u.kleine-koenig, afaerber, geert, Rob Herring, Philipp Zabel,
	Linus Walleij, Arnd Bergmann, stefan, pmeerw, pebolle,
	Jonathan Corbet, Pawel Moll, Mark Rutland, Ian Campbell,
	Kumar Gala, Russell King, Daniel Lezcano, Thomas Gleixner,
	Greg Kroah-Hartman, Jiri Slaby, Andrew Morton, David S. Miller,
	Mauro Carvalho Chehab, Joe Perches, Antti Palosaari, Tejun Heo
In-Reply-To: <1426197361-19290-1-git-send-email-maxime.coquelin@st.com>

Dear Maxime,

I'm working for STM32 SoC. So, I'm very interesting in this patch-set.
If you possible, please add me to Cc list on next patch-set.

Best Regards,
Chanwoo Choi


On 03/13/2015 06:55 AM, Maxime Coquelin wrote:
> From: Maxime Coquelin <mcoquelin.stm32@gmail.com>
> 
> This third round tries to address most of the comments made on previous series.
> 
> It contains few less patches, as the reset_controller_of_init() patch has been
> removed, now that the bootlaoder handles the reset of the timers.
> 
> The pinctrl driver has also been removed after Linus review.
> It will be reworked to use the generic pinconf bindings, and may contain
> changes for other machines (Mediatek), to add support for pinmux property
> handling directly in pinconf-generic.
> 
> STM32 MCUs are Cortex-M CPU, used in various applications (consumer
> electronics, industrial applications, hobbyists...).
> Datasheets, user and programming manuals are publicly available on
> STMicroelectronics website.
> 
> With this series applied, the STM32F419 Discovery can boot succesfully.
> 
> 
> Changes since v2:
> -----------------
>  - Remove pinctrl driver from the series. 
>  - Remove reset_controller_of_init(), and reset the timers in the bootloader
>  - Add HW flow contrl property for serial driver
>  - Lots of changes in the DTS file, as per Andreas recommendations
>  - Some Kconfig clean-ups
>  - Adapt the config to be compatible with Andreas' bootwrapper, except UART port.
>  - Various fixes in documentation
> 
> Changes since v1:
> -----------------
>  - Move bindings documentation in their own patches (Andreas)
>  - Rename ARM System timer to armv7m-systick (Rob)
>  - Add clock-frequency property handling in armv7m-systick (Rob)
>  - Re-factor the reset controllers into a single controller (Philipp)
>  - Add kerneldoc to reset_controller_of_init (Philipp)
>  - Add named constants in include/dt-bindings/reset/ (Philipp)
>  - Make pinctrl driver to depend on ARCH_STM32 or COMPILE_TEST (Geert)
>  - Introduce CPUV7M_NUM_IRQ config flag to indicate the number of interrupts
> supported by the MCU, in order to limit memory waste in vectors' table (Uwe)
> 
> 
> Maxime Coquelin (15):
>   scripts: link-vmlinux: Don't pass page offset to kallsyms if XIP
>     Kernel
>   ARM: ARMv7-M: Enlarge vector table up to 256 entries
>   dt-bindings: Document the ARM System timer bindings
>   clocksource: Add ARM System timer driver
>   dt-bindings: Document the STM32 reset bindings
>   drivers: reset: Add STM32 reset driver
>   dt-bindings: Document the STM32 timer bindings
>   clockevent: Add STM32 Timer driver
>   dt-bindings: Document the STM32 USART bindings
>   serial: stm32-usart: Add STM32 USART Driver
>   ARM: Add STM32 family machine
>   ARM: dts: Add ARM System timer as clockevent in armv7m
>   ARM: dts: Introduce STM32F429 MCU
>   ARM: configs: Add STM32 defconfig
>   MAINTAINERS: Add entry for STM32 MCUs
> 
>  Documentation/arm/stm32/overview.txt               |  32 +
>  Documentation/arm/stm32/stm32f429-overview.txt     |  22 +
>  .../devicetree/bindings/arm/armv7m_systick.txt     |  26 +
>  .../devicetree/bindings/reset/st,stm32-rcc.txt     | 102 +++
>  .../devicetree/bindings/serial/st,stm32-usart.txt  |  32 +
>  .../devicetree/bindings/timer/st,stm32-timer.txt   |  22 +
>  MAINTAINERS                                        |   8 +
>  arch/arm/Kconfig                                   |  18 +
>  arch/arm/Makefile                                  |   1 +
>  arch/arm/boot/dts/Makefile                         |   1 +
>  arch/arm/boot/dts/armv7-m.dtsi                     |   6 +
>  arch/arm/boot/dts/stm32f429-disco.dts              |  71 +++
>  arch/arm/boot/dts/stm32f429.dtsi                   | 226 +++++++
>  arch/arm/configs/stm32_defconfig                   |  71 +++
>  arch/arm/kernel/entry-v7m.S                        |  13 +-
>  arch/arm/mach-stm32/Makefile                       |   1 +
>  arch/arm/mach-stm32/Makefile.boot                  |   3 +
>  arch/arm/mach-stm32/board-dt.c                     |  19 +
>  arch/arm/mm/Kconfig                                |  15 +
>  drivers/clocksource/Kconfig                        |  15 +
>  drivers/clocksource/Makefile                       |   2 +
>  drivers/clocksource/armv7m_systick.c               |  78 +++
>  drivers/clocksource/timer-stm32.c                  | 184 ++++++
>  drivers/reset/Makefile                             |   1 +
>  drivers/reset/reset-stm32.c                        | 125 ++++
>  drivers/tty/serial/Kconfig                         |  17 +
>  drivers/tty/serial/Makefile                        |   1 +
>  drivers/tty/serial/stm32-usart.c                   | 695 +++++++++++++++++++++
>  include/uapi/linux/serial_core.h                   |   3 +
>  scripts/link-vmlinux.sh                            |   2 +-
>  30 files changed, 1807 insertions(+), 5 deletions(-)
>  create mode 100644 Documentation/arm/stm32/overview.txt
>  create mode 100644 Documentation/arm/stm32/stm32f429-overview.txt
>  create mode 100644 Documentation/devicetree/bindings/arm/armv7m_systick.txt
>  create mode 100644 Documentation/devicetree/bindings/reset/st,stm32-rcc.txt
>  create mode 100644 Documentation/devicetree/bindings/serial/st,stm32-usart.txt
>  create mode 100644 Documentation/devicetree/bindings/timer/st,stm32-timer.txt
>  create mode 100644 arch/arm/boot/dts/stm32f429-disco.dts
>  create mode 100644 arch/arm/boot/dts/stm32f429.dtsi
>  create mode 100644 arch/arm/configs/stm32_defconfig
>  create mode 100644 arch/arm/mach-stm32/Makefile
>  create mode 100644 arch/arm/mach-stm32/Makefile.boot
>  create mode 100644 arch/arm/mach-stm32/board-dt.c
>  create mode 100644 drivers/clocksource/armv7m_systick.c
>  create mode 100644 drivers/clocksource/timer-stm32.c
>  create mode 100644 drivers/reset/reset-stm32.c
>  create mode 100644 drivers/tty/serial/stm32-usart.c
> 


^ permalink raw reply

* Re: [RFC] capabilities: Ambient capabilities
From: Andrew Lutomirski @ 2015-03-12 22:27 UTC (permalink / raw)
  To: Andrew G. Morgan
  Cc: Kees Cook, Christoph Lameter, Serge Hallyn, Andy Lutomirski,
	Jonathan Corbet, Aaron Jones, Ted Ts'o, linux-security-module,
	LKML, Linux API, Andrew Morton, Mimi Zohar, Austin S Hemmelgarn,
	Markku Savela, Jarkko Sakkinen, Michael Kerrisk
In-Reply-To: <CALQRfL7b8CjYgUnVy3jykNwv48fOc03T385RKo--cfv25YenBg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Thu, Mar 12, 2015 at 3:10 PM, Andrew G. Morgan <morgan-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> I'm unclear why you refer to the inheritable set in this test:
>
> +               } else {
> +                       if (arg2 == PR_CAP_AMBIENT_RAISE &&
> +                           (!cap_raised(current_cred()->cap_permitted, arg3) ||
> +                            !cap_raised(current_cred()->cap_inheritable,
> +                                        arg3)))
> +                               return -EPERM;

It's to preserve the invariant that pA is always a subset of pI.

>
> I'm also unclear how you can turn off this new 'feature' for a process
> tree? As it is, the code creates an exploit path for a capable (pP !=
> 0) program with an exploitable flaw to create a privilege escalation
> for an arbitrary child program.

Huh?  If you exploit the parent, you already win.  Yes, if a kiddie
injects shellcode that does system("/bin/bash") into some pP != 0
program, they don't actually elevate their privileges.  On the other
hand, by the time an attacker injected shellcode for:

prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, CAP_SYS_ADMIN);
system("/bin/bash");

into a target, they can already do whatever they want.

> While I understand that everyone
> 'knows what they are doing' in implementing this change, I'm convinced
> that folk that are up to no good also do... Why not provide a lockable
> secure bit to selectively disable this support?

Show me a legitimate use case and I'll gladly implement a secure bit.
In the mean time, I don't even believe that there's a legitimate use
for any of the other secure bits (except keepcaps, and I don't know
why that's a securebit in the first place).

In the mean time, see CVE-2014-3215 for an example of why securebits
are probably more trouble than they're worth.

--Andy

^ permalink raw reply

* Re: [RFC] capabilities: Ambient capabilities
From: Andrew G. Morgan @ 2015-03-12 22:10 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andy Lutomirski, Christoph Lameter, Serge Hallyn, Andy Lutomirski,
	Jonathan Corbet, Aaron Jones, Ted Ts'o, linux-security-module,
	LKML, Linux API, Andrew Morton, Mimi Zohar, Austin S Hemmelgarn,
	Markku Savela, Jarkko Sakkinen, Michael Kerrisk
In-Reply-To: <CAGXu5jKfvswdUMcrGbr6+wn1ME_JkySMvdNd0wvvHpbQ1C7HUg@mail.gmail.com>

I'm unclear why you refer to the inheritable set in this test:

+               } else {
+                       if (arg2 == PR_CAP_AMBIENT_RAISE &&
+                           (!cap_raised(current_cred()->cap_permitted, arg3) ||
+                            !cap_raised(current_cred()->cap_inheritable,
+                                        arg3)))
+                               return -EPERM;

I'm also unclear how you can turn off this new 'feature' for a process
tree? As it is, the code creates an exploit path for a capable (pP !=
0) program with an exploitable flaw to create a privilege escalation
for an arbitrary child program. While I understand that everyone
'knows what they are doing' in implementing this change, I'm convinced
that folk that are up to no good also do... Why not provide a lockable
secure bit to selectively disable this support?

Nacked-by: Andrew G. Morgan <morgan@kernel.org>

Cheers

Andrew


On Thu, Mar 12, 2015 at 2:49 PM, Kees Cook <keescook@chromium.org> wrote:
> On Thu, Mar 12, 2015 at 11:08 AM, Andy Lutomirski <luto@kernel.org> wrote:
>> Credit where credit is due: this idea comes from Christoph Lameter
>> with a lot of valuable input from Serge Hallyn.  This patch is
>> heavily based on Christoph's patch.
>>
>> ===== The status quo =====
>>
>> On Linux, there are a number of capabilities defined by the kernel.
>> To perform various privileged tasks, processes can wield
>> capabilities that they hold.
>>
>> Each task has four capability masks: effective (pE), permitted (pP),
>> inheritable (pI), and a bounding set (X).  When the kernel checks
>> for a capability, it checks pE.  The other capability masks serve to
>> modify what capabilities can be in pE.
>>
>> Any task can remove capabilities from pE, pP, or pI at any time.  If
>> a task has a capability in pP, it can add that capability to pE
>> and/or pI.  If a task has CAP_SETPCAP, then it can add any
>> capability to pI, and it can remove capabilities from X.
>>
>> Tasks are not the only things that can have capabilities; files can
>> also have capabilities.  A file can have no capabilty information at
>> all [1].  If a file has capability information, then it has a
>> permitted mask (fP) and an inheritable mask (fI) as well as a single
>> effective bit (fE) [2].  File capabilities modify the capabilities
>> of tasks that execve(2) them.
>>
>> A task that successfully calls execve has its capabilities modified
>> for the file ultimately being excecuted (i.e. the binary itself if
>> that binary is ELF or for the interpreter if the binary is a
>> script.) [3] In the capability evolution rules, for each mask Z, pZ
>> represents the old value and pZ' represents the new value.  The
>> rules are:
>>
>>   pP' = (X & fP) | (pI & fI)
>>   pI' = pI
>>   pE' = (fE ? pP' : 0)
>>   X is unchanged
>>
>> For setuid binaries, fP, fI, and fE are modified by a moderately
>> complicated set of rules that emulate POSIX behavior.  Similarly, if
>> euid == 0 or ruid == 0, then fP, fI, and fE are modified differently
>> (primary, fP and fI usually end up being the full set).  For nonroot
>> users executing binaries with neither setuid nor file caps, fI and
>> fP are empty and fE is false.
>>
>> As an extra complication, if you execute a process as nonroot and fE
>> is set, then the "secure exec" rules are in effect: AT_SECURE gets
>> set, LD_PRELOAD doesn't work, etc.
>>
>> This is rather messy.  We've learned that making any changes is
>> dangerous, though: if a new kernel version allows an unprivileged
>> program to change its security state in a way that persists cross
>> execution of a setuid program or a program with file caps, this
>> persistent state is surprisingly likely to allow setuid or
>> file-capped programs to be exploited for privilege escalation.
>>
>> ===== The problem =====
>>
>> Capability inheritance is basically useless.
>>
>> If you aren't root and you execute an ordinary binary, fI is zero,
>> so your capabilities have no effect whatsoever on pP'.  This means
>> that you can't usefully execute a helper process or a shell command
>> with elevated capabilities if you aren't root.
>>
>> On current kernels, you can sort of work around this by setting fI
>> to the full set for most or all non-setuid executable files.  This
>> causes pP' = pI for nonroot, and inheritance works.  No one does
>> this because it's a PITA and it isn't even supported on most
>> filesystems.
>>
>> If you try this, you'll discover that every nonroot program ends up
>> with secure exec rules, breaking many things.
>>
>> This is a problem that has bitten many people who have tried to use
>> capabilities for anything useful.
>>
>> ===== The proposed change =====
>>
>> This patch adds a fifth capability mask called the ambient mask
>> (pA).  pA does what pI should have done.
>>
>> pA obeys the invariant that no bit can ever be set in pA if it is
>> not set in both pP and pI.  Dropping a bit from pP or pI drops that
>> bit from pA.  This ensures that existing programs that try to drop
>> capabilities still do so, with a complication.  Because capability
>> inheritance is so broken, setting KEEPCAPS, using setresuid to
>> switch to nonroot uids, and calling execve effectively drops
>> capabilities.  Therefore, setresuid from root to nonroot
>> unconditionally clears pA.  Processes that don't like this can
>> re-add bits to pA afterwards.
>>
>> The capability evolution rules are changed:
>>
>>   pA' = (file caps or setuid or setgid ? 0 : pA)
>>   pP' = (X & fP) | (pI & fI) | pA'
>>   pI' = pI
>>   pE' = (fE ? pP' : pA')
>>   X is unchanged
>>
>> If you are nonroot but you have a capability, you can add it to pA.
>> If you do so, your children get that capability in pA, pP, and pE.
>> For example, you can set pA = CAP_NET_BIND_SERVICE, and your
>> children can automatically bind low-numbered ports.  Hallelujah!
>>
>> Unprivileged users can create user namespaces, map themselves to a
>> nonzero uid, and create both privileged (relative to their
>> namespace) and unprivileged process trees.  This is currently more
>> or less impossible.  Hallelujah!
>>
>> You cannot use pA to try to subvert a setuid, setgid, or file-capped
>> program: if you execute any such program, pA gets cleared and the
>> resulting evolution rules are unchanged by this patch.
>>
>> Users with nonzero pA are unlikely to unintentionally leak that
>> capability.  If they run programs that try to drop privileges,
>> dropping privileges will still work.
>>
>> It's worth noting that the degree of paranoia in this patch could
>> possibly be relaxed without causing serious problems.  Specifically,
>> if we allowed pA to persist across executing non-pA-aware setuid
>> binaries and across setresuid, then, naively, the only capabilities
>> that could leak as a result would be the capabilities in pA, and any
>> attacker *already* has those capabilities.  This would make me
>> nervous, though -- setuid binaries that tried to privilege-separate
>> might fail to do so, and putting CAP_DAC_READ_SEARCH or
>> CAP_DAC_OVERRIDE into pA could have unexpected side effects.
>> (Whether these unexpected side effects would be exploitable is an
>> open question.)  I've therefore taken the more paranoid route.
>>
>> An alternative would be to either require PR_SET_NO_NEW_PRIVS before
>> setting ambient capabilities.  I think that this would be annoying
>> and would make granting otherwise unprivileged users minor ambient
>> capabilities (CAP_NET_BIND_SERVICE or CAP_NET_RAW for example) much
>> less useful than it is with this patch.
>>
>> ===== Footnotes =====
>>
>> [1] Files that are missing the "security.capability" xattr or that
>> have unrecognized values for that xattr end up with has_cap ==
>> false.  The code that does that appears to be complicated for no
>> good reason.
>>
>> [2] The libcap capability mask parsers and formatters are
>> dangerously misleading and the documentation is flat-out wrong.  fE
>> is *not* a mask; it's a single bit.  This has probably confused
>> every single person who has tried to use file capabilities.
>>
>> [3] Linux very confusingly processes the script and the interpreter if
>> applicable, for reasons that escape me.  The results from thinking
>> about a script's file capabilities and/or setuid bits are mostly discarded.
>>
>> Cc: Kees Cook <keescook@chromium.org>
>> Cc: Christoph Lameter <cl@linux.com>
>> Cc: Serge Hallyn <serge.hallyn@canonical.com>
>> Cc: Andy Lutomirski <luto@amacapital.net>
>> Cc: Jonathan Corbet <corbet@lwn.net>
>> Cc: Aaron Jones <aaronmdjones@gmail.com>
>> CC: Ted Ts'o <tytso@mit.edu>
>> Cc: linux-security-module@vger.kernel.org
>> Cc: linux-kernel@vger.kernel.org
>> Cc: linux-api@vger.kernel.org
>> Cc: akpm@linuxfoundation.org
>> Cc: Andrew G. Morgan <morgan@kernel.org>
>> Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
>> Cc: Austin S Hemmelgarn <ahferroin7@gmail.com>
>> Cc: Markku Savela <msa@moth.iki.fi>
>> Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
>> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>
> This would be quite welcome for things we're doing in Chrome OS.
> Presently, we're able to use fscaps to keep non-root caps across exec
> and haven't encountered issues with AT_SECURE (yet), but using pA
> would be much nicer and exactly matches how we want to use it: a
> launcher is creating a tree of processes that are non-root but need
> some capabilities. Right now the tree is very small and we're able to
> sprinkle our fscaps lightly. :) This would be better.
>
> -Kees
>
>> ---
>>
>> Preliminary userspace code is here:
>>
>> https://git.kernel.org/cgit/linux/kernel/git/luto/util-linux-playground.git/commit/?h=cap_ambient&id=860c73ac1acaaae976bdd3bb83b89b0180f0702a
>>
>> fs/proc/array.c              |  5 ++-
>>  include/linux/cred.h         | 15 +++++++++
>>  include/uapi/linux/prctl.h   |  6 ++++
>>  kernel/user_namespace.c      |  1 +
>>  security/commoncap.c         | 75 ++++++++++++++++++++++++++++++++++++++------
>>  security/keys/process_keys.c |  1 +
>>  6 files changed, 92 insertions(+), 11 deletions(-)
>>
>> diff --git a/fs/proc/array.c b/fs/proc/array.c
>> index 1295a00ca316..bc15356d6551 100644
>> --- a/fs/proc/array.c
>> +++ b/fs/proc/array.c
>> @@ -282,7 +282,8 @@ static void render_cap_t(struct seq_file *m, const char *header,
>>  static inline void task_cap(struct seq_file *m, struct task_struct *p)
>>  {
>>         const struct cred *cred;
>> -       kernel_cap_t cap_inheritable, cap_permitted, cap_effective, cap_bset;
>> +       kernel_cap_t cap_inheritable, cap_permitted, cap_effective,
>> +                       cap_bset, cap_ambient;
>>
>>         rcu_read_lock();
>>         cred = __task_cred(p);
>> @@ -290,12 +291,14 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p)
>>         cap_permitted   = cred->cap_permitted;
>>         cap_effective   = cred->cap_effective;
>>         cap_bset        = cred->cap_bset;
>> +       cap_ambient     = cred->cap_ambient;
>>         rcu_read_unlock();
>>
>>         render_cap_t(m, "CapInh:\t", &cap_inheritable);
>>         render_cap_t(m, "CapPrm:\t", &cap_permitted);
>>         render_cap_t(m, "CapEff:\t", &cap_effective);
>>         render_cap_t(m, "CapBnd:\t", &cap_bset);
>> +       render_cap_t(m, "CapAmb:\t", &cap_ambient);
>>  }
>>
>>  static inline void task_seccomp(struct seq_file *m, struct task_struct *p)
>> diff --git a/include/linux/cred.h b/include/linux/cred.h
>> index 2fb2ca2127ed..a21bcba6ef84 100644
>> --- a/include/linux/cred.h
>> +++ b/include/linux/cred.h
>> @@ -122,6 +122,7 @@ struct cred {
>>         kernel_cap_t    cap_permitted;  /* caps we're permitted */
>>         kernel_cap_t    cap_effective;  /* caps we can actually use */
>>         kernel_cap_t    cap_bset;       /* capability bounding set */
>> +       kernel_cap_t    cap_ambient;    /* Ambient capability set */
>>  #ifdef CONFIG_KEYS
>>         unsigned char   jit_keyring;    /* default keyring to attach requested
>>                                          * keys to */
>> @@ -197,6 +198,20 @@ static inline void validate_process_creds(void)
>>  }
>>  #endif
>>
>> +static inline void cap_enforce_ambient_invariants(struct cred *cred)
>> +{
>> +       cred->cap_ambient = cap_intersect(cred->cap_ambient,
>> +                                         cap_intersect(cred->cap_permitted,
>> +                                                       cred->cap_inheritable));
>> +}
>> +
>> +static inline bool cap_ambient_invariant_ok(const struct cred *cred)
>> +{
>> +       return cap_issubset(cred->cap_ambient,
>> +                           cap_intersect(cred->cap_permitted,
>> +                                         cred->cap_inheritable));
>> +}
>> +
>>  /**
>>   * get_new_cred - Get a reference on a new set of credentials
>>   * @cred: The new credentials to reference
>> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
>> index 31891d9535e2..65407f867e82 100644
>> --- a/include/uapi/linux/prctl.h
>> +++ b/include/uapi/linux/prctl.h
>> @@ -190,4 +190,10 @@ struct prctl_mm_map {
>>  # define PR_FP_MODE_FR         (1 << 0)        /* 64b FP registers */
>>  # define PR_FP_MODE_FRE                (1 << 1)        /* 32b compatibility */
>>
>> +/* Control the ambient capability set */
>> +#define PR_CAP_AMBIENT         47
>> +# define PR_CAP_AMBIENT_GET    1
>> +# define PR_CAP_AMBIENT_RAISE  2
>> +# define PR_CAP_AMBIENT_LOWER  3
>> +
>>  #endif /* _LINUX_PRCTL_H */
>> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
>> index 4109f8320684..dab0f808235a 100644
>> --- a/kernel/user_namespace.c
>> +++ b/kernel/user_namespace.c
>> @@ -39,6 +39,7 @@ static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns)
>>         cred->cap_inheritable = CAP_EMPTY_SET;
>>         cred->cap_permitted = CAP_FULL_SET;
>>         cred->cap_effective = CAP_FULL_SET;
>> +       cred->cap_ambient = CAP_EMPTY_SET;
>>         cred->cap_bset = CAP_FULL_SET;
>>  #ifdef CONFIG_KEYS
>>         key_put(cred->request_key_auth);
>> diff --git a/security/commoncap.c b/security/commoncap.c
>> index f66713bd7450..b3253886ecad 100644
>> --- a/security/commoncap.c
>> +++ b/security/commoncap.c
>> @@ -272,6 +272,7 @@ int cap_capset(struct cred *new,
>>         new->cap_effective   = *effective;
>>         new->cap_inheritable = *inheritable;
>>         new->cap_permitted   = *permitted;
>> +       cap_enforce_ambient_invariants(new);
>>         return 0;
>>  }
>>
>> @@ -352,6 +353,7 @@ static inline int bprm_caps_from_vfs_caps(struct cpu_vfs_cap_data *caps,
>>
>>                 /*
>>                  * pP' = (X & fP) | (pI & fI)
>> +                * The addition of pA' is handled later.
>>                  */
>>                 new->cap_permitted.cap[i] =
>>                         (new->cap_bset.cap[i] & permitted) |
>> @@ -479,10 +481,12 @@ int cap_bprm_set_creds(struct linux_binprm *bprm)
>>  {
>>         const struct cred *old = current_cred();
>>         struct cred *new = bprm->cred;
>> -       bool effective, has_cap = false;
>> +       bool effective, has_cap = false, is_setid;
>>         int ret;
>>         kuid_t root_uid;
>>
>> +       BUG_ON(!cap_ambient_invariant_ok(old));
>> +
>>         effective = false;
>>         ret = get_file_caps(bprm, &effective, &has_cap);
>>         if (ret < 0)
>> @@ -527,8 +531,9 @@ skip:
>>          *
>>          * In addition, if NO_NEW_PRIVS, then ensure we get no new privs.
>>          */
>> -       if ((!uid_eq(new->euid, old->uid) ||
>> -            !gid_eq(new->egid, old->gid) ||
>> +       is_setid = !uid_eq(new->euid, old->uid) || !gid_eq(new->egid, old->gid);
>> +
>> +       if ((is_setid ||
>>              !cap_issubset(new->cap_permitted, old->cap_permitted)) &&
>>             bprm->unsafe & ~LSM_UNSAFE_PTRACE_CAP) {
>>                 /* downgrade; they get no more than they had, and maybe less */
>> @@ -544,10 +549,23 @@ skip:
>>         new->suid = new->fsuid = new->euid;
>>         new->sgid = new->fsgid = new->egid;
>>
>> +       /* File caps or setid cancel ambient. */
>> +       if (has_cap || is_setid)
>> +               cap_clear(new->cap_ambient);
>> +
>> +       /*
>> +        * Now that we've computed pA', update pP' to give:
>> +        *   pP' = (X & fP) | (pI & fI) | pA'
>> +        */
>> +       new->cap_permitted = cap_combine(new->cap_permitted, new->cap_ambient);
>> +
>>         if (effective)
>>                 new->cap_effective = new->cap_permitted;
>>         else
>> -               cap_clear(new->cap_effective);
>> +               new->cap_effective = new->cap_ambient;
>> +
>> +       BUG_ON(!cap_ambient_invariant_ok(new));
>> +
>>         bprm->cap_effective = effective;
>>
>>         /*
>> @@ -562,7 +580,7 @@ skip:
>>          * Number 1 above might fail if you don't have a full bset, but I think
>>          * that is interesting information to audit.
>>          */
>> -       if (!cap_isclear(new->cap_effective)) {
>> +       if (!cap_issubset(new->cap_effective, new->cap_ambient)) {
>>                 if (!cap_issubset(CAP_FULL_SET, new->cap_effective) ||
>>                     !uid_eq(new->euid, root_uid) || !uid_eq(new->uid, root_uid) ||
>>                     issecure(SECURE_NOROOT)) {
>> @@ -573,6 +591,9 @@ skip:
>>         }
>>
>>         new->securebits &= ~issecure_mask(SECURE_KEEP_CAPS);
>> +
>> +       BUG_ON(!cap_ambient_invariant_ok(new));
>> +
>>         return 0;
>>  }
>>
>> @@ -594,7 +615,7 @@ int cap_bprm_secureexec(struct linux_binprm *bprm)
>>         if (!uid_eq(cred->uid, root_uid)) {
>>                 if (bprm->cap_effective)
>>                         return 1;
>> -               if (!cap_isclear(cred->cap_permitted))
>> +               if (!cap_issubset(cred->cap_permitted, cred->cap_ambient))
>>                         return 1;
>>         }
>>
>> @@ -696,10 +717,18 @@ static inline void cap_emulate_setxuid(struct cred *new, const struct cred *old)
>>              uid_eq(old->suid, root_uid)) &&
>>             (!uid_eq(new->uid, root_uid) &&
>>              !uid_eq(new->euid, root_uid) &&
>> -            !uid_eq(new->suid, root_uid)) &&
>> -           !issecure(SECURE_KEEP_CAPS)) {
>> -               cap_clear(new->cap_permitted);
>> -               cap_clear(new->cap_effective);
>> +            !uid_eq(new->suid, root_uid))) {
>> +               if (!issecure(SECURE_KEEP_CAPS)) {
>> +                       cap_clear(new->cap_permitted);
>> +                       cap_clear(new->cap_effective);
>> +               }
>> +
>> +               /*
>> +                * Pre-ambient programs except setresuid to nonroot followed
>> +                * by exec to drop capabilities.  We should make sure that
>> +                * this remains the case.
>> +                */
>> +               cap_clear(new->cap_ambient);
>>         }
>>         if (uid_eq(old->euid, root_uid) && !uid_eq(new->euid, root_uid))
>>                 cap_clear(new->cap_effective);
>> @@ -929,6 +958,32 @@ int cap_task_prctl(int option, unsigned long arg2, unsigned long arg3,
>>                         new->securebits &= ~issecure_mask(SECURE_KEEP_CAPS);
>>                 return commit_creds(new);
>>
>> +       case PR_CAP_AMBIENT:
>> +               if (((!cap_valid(arg3)) | arg4 | arg5))
>> +                       return -EINVAL;
>> +
>> +               if (arg2 == PR_CAP_AMBIENT_GET) {
>> +                       return !!cap_raised(current_cred()->cap_ambient, arg3);
>> +               } else if (arg2 != PR_CAP_AMBIENT_RAISE &&
>> +                          arg2 != PR_CAP_AMBIENT_LOWER) {
>> +                       return -EINVAL;
>> +               } else {
>> +                       if (arg2 == PR_CAP_AMBIENT_RAISE &&
>> +                           (!cap_raised(current_cred()->cap_permitted, arg3) ||
>> +                            !cap_raised(current_cred()->cap_inheritable,
>> +                                        arg3)))
>> +                               return -EPERM;
>> +
>> +                       new = prepare_creds();
>> +                       if (!new)
>> +                               return -ENOMEM;
>> +                       if (arg2 == PR_CAP_AMBIENT_RAISE)
>> +                               cap_raise(new->cap_ambient, arg3);
>> +                       else
>> +                               cap_lower(new->cap_ambient, arg3);
>> +                       return commit_creds(new);
>> +               }
>> +
>>         default:
>>                 /* No functionality available - continue with default */
>>                 return -ENOSYS;
>> diff --git a/security/keys/process_keys.c b/security/keys/process_keys.c
>> index bd536cb221e2..43b4cddbf2b3 100644
>> --- a/security/keys/process_keys.c
>> +++ b/security/keys/process_keys.c
>> @@ -848,6 +848,7 @@ void key_change_session_keyring(struct callback_head *twork)
>>         new->cap_inheritable    = old->cap_inheritable;
>>         new->cap_permitted      = old->cap_permitted;
>>         new->cap_effective      = old->cap_effective;
>> +       new->cap_ambient        = old->cap_ambient;
>>         new->cap_bset           = old->cap_bset;
>>
>>         new->jit_keyring        = old->jit_keyring;
>> --
>> 2.3.0
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>
>
>
> --
> Kees Cook
> Chrome OS Security

^ permalink raw reply

* [PATCH v3  15/15] MAINTAINERS: Add entry for STM32 MCUs
From: Maxime Coquelin @ 2015-03-12 21:56 UTC (permalink / raw)
  To: u.kleine-koenig, afaerber, geert, Rob Herring, Philipp Zabel,
	Linus Walleij, Arnd Bergmann, stefan, pmeerw, pebolle
  Cc: Jonathan Corbet, Pawel Moll, Mark Rutland, Ian Campbell,
	Kumar Gala, Russell King, Daniel Lezcano, Thomas Gleixner,
	Greg Kroah-Hartman, Jiri Slaby, Andrew Morton, David S. Miller,
	Mauro Carvalho Chehab, Joe Perches, Antti Palosaari, Tejun Heo,
	Will Deacon, Nikolay Borisov, Rusty Russell, Kees Cook,
	Michal Marek, linux-doc, linux-arm-kernel, linux-kernel
In-Reply-To: <1426197361-19290-1-git-send-email-maxime.coquelin@st.com>

From: Maxime Coquelin <mcoquelin.stm32@gmail.com>

Add a MAINTAINER entry covering all STM32 machine and drivers files.

Signed-off-by: Maxime Coquelin <mcoquelin.stm32@gmail.com>
---
 MAINTAINERS | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index ddc5a8c..08c08c4 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1467,6 +1467,14 @@ F:	drivers/usb/host/ehci-st.c
 F:	drivers/usb/host/ohci-st.c
 F:	drivers/ata/ahci_st.c
 
+ARM/STM32 ARCHITECTURE
+M:	Maxime Coquelin <mcoquelin.stm32@gmail.com>
+L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
+S:	Maintained
+T:	git git://git.kernel.org/pub/scm/linux/kernel/git/mcoquelin/stm32.git
+N:	stm32
+F:	drivers/clocksource/armv7m_systick.c
+
 ARM/TECHNOLOGIC SYSTEMS TS7250 MACHINE SUPPORT
 M:	Lennert Buytenhek <kernel@wantstofly.org>
 L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
-- 
1.9.1


^ permalink raw reply related

* [PATCH v3  14/15] ARM: configs: Add STM32 defconfig
From: Maxime Coquelin @ 2015-03-12 21:56 UTC (permalink / raw)
  To: u.kleine-koenig, afaerber, geert, Rob Herring, Philipp Zabel,
	Linus Walleij, Arnd Bergmann, stefan, pmeerw, pebolle
  Cc: Jonathan Corbet, Pawel Moll, Mark Rutland, Ian Campbell,
	Kumar Gala, Russell King, Daniel Lezcano, Thomas Gleixner,
	Greg Kroah-Hartman, Jiri Slaby, Andrew Morton, David S. Miller,
	Mauro Carvalho Chehab, Joe Perches, Antti Palosaari, Tejun Heo,
	Will Deacon, Nikolay Borisov, Rusty Russell, Kees Cook,
	Michal Marek, linux-doc, linux-arm-kernel, linux-kernel
In-Reply-To: <1426197361-19290-1-git-send-email-maxime.coquelin@st.com>

From: Maxime Coquelin <mcoquelin.stm32@gmail.com>

This patch adds a new config for STM32 MCUs.
STM32F429 Discovery board boots successfully with this config applied.

Signed-off-by: Maxime Coquelin <mcoquelin.stm32@gmail.com>
---
 arch/arm/configs/stm32_defconfig | 71 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 71 insertions(+)
 create mode 100644 arch/arm/configs/stm32_defconfig

diff --git a/arch/arm/configs/stm32_defconfig b/arch/arm/configs/stm32_defconfig
new file mode 100644
index 0000000..412a9f9
--- /dev/null
+++ b/arch/arm/configs/stm32_defconfig
@@ -0,0 +1,71 @@
+CONFIG_HIGH_RES_TIMERS=y
+CONFIG_LOG_BUF_SHIFT=16
+CONFIG_BLK_DEV_INITRD=y
+CONFIG_INITRAMFS_SOURCE="./rootfs.cpio"
+CONFIG_CC_OPTIMIZE_FOR_SIZE=y
+# CONFIG_UID16 is not set
+# CONFIG_BASE_FULL is not set
+# CONFIG_FUTEX is not set
+# CONFIG_EPOLL is not set
+# CONFIG_SIGNALFD is not set
+# CONFIG_EVENTFD is not set
+# CONFIG_AIO is not set
+CONFIG_EMBEDDED=y
+# CONFIG_VM_EVENT_COUNTERS is not set
+# CONFIG_SLUB_DEBUG is not set
+# CONFIG_LBDAF is not set
+# CONFIG_BLK_DEV_BSG is not set
+# CONFIG_IOSCHED_DEADLINE is not set
+# CONFIG_IOSCHED_CFQ is not set
+# CONFIG_MMU is not set
+CONFIG_ARCH_STM32=y
+CONFIG_SET_MEM_PARAM=y
+CONFIG_DRAM_BASE=0x90000000
+CONFIG_FLASH_MEM_BASE=0x08000000
+CONFIG_FLASH_SIZE=0x00200000
+CONFIG_PREEMPT=y
+# CONFIG_ATAGS is not set
+CONFIG_ZBOOT_ROM_TEXT=0x0
+CONFIG_ZBOOT_ROM_BSS=0x0
+CONFIG_XIP_KERNEL=y
+CONFIG_XIP_PHYS_ADDR=0x08008000
+CONFIG_BINFMT_FLAT=y
+CONFIG_BINFMT_SHARED_FLAT=y
+# CONFIG_COREDUMP is not set
+CONFIG_DEVTMPFS=y
+CONFIG_DEVTMPFS_MOUNT=y
+# CONFIG_FW_LOADER is not set
+# CONFIG_BLK_DEV is not set
+CONFIG_EEPROM_93CX6=y
+# CONFIG_INPUT is not set
+# CONFIG_SERIO is not set
+# CONFIG_VT is not set
+# CONFIG_UNIX98_PTYS is not set
+# CONFIG_LEGACY_PTYS is not set
+CONFIG_SERIAL_NONSTANDARD=y
+# CONFIG_DEVKMEM is not set
+CONFIG_SERIAL_STM32=y
+CONFIG_SERIAL_STM32_CONSOLE=y
+# CONFIG_HW_RANDOM is not set
+CONFIG_GPIO_SYSFS=y
+# CONFIG_HWMON is not set
+CONFIG_USB=y
+CONFIG_USB_DWC2=y
+CONFIG_NEW_LEDS=y
+CONFIG_LEDS_CLASS=y
+CONFIG_LEDS_GPIO=y
+CONFIG_LEDS_TRIGGERS=y
+CONFIG_LEDS_TRIGGER_HEARTBEAT=y
+# CONFIG_FILE_LOCKING is not set
+# CONFIG_DNOTIFY is not set
+# CONFIG_INOTIFY_USER is not set
+CONFIG_PRINTK_TIME=y
+CONFIG_DEBUG_INFO=y
+# CONFIG_ENABLE_WARN_DEPRECATED is not set
+# CONFIG_ENABLE_MUST_CHECK is not set
+CONFIG_MAGIC_SYSRQ=y
+# CONFIG_SCHED_DEBUG is not set
+# CONFIG_DEBUG_BUGVERBOSE is not set
+# CONFIG_FTRACE is not set
+CONFIG_CRC_ITU_T=y
+CONFIG_CRC7=y
-- 
1.9.1

^ permalink raw reply related

* [PATCH v3  13/15] ARM: dts: Introduce STM32F429 MCU
From: Maxime Coquelin @ 2015-03-12 21:55 UTC (permalink / raw)
  To: u.kleine-koenig, afaerber, geert, Rob Herring, Philipp Zabel,
	Linus Walleij, Arnd Bergmann, stefan, pmeerw, pebolle
  Cc: Jonathan Corbet, Pawel Moll, Mark Rutland, Ian Campbell,
	Kumar Gala, Russell King, Daniel Lezcano, Thomas Gleixner,
	Greg Kroah-Hartman, Jiri Slaby, Andrew Morton, David S. Miller,
	Mauro Carvalho Chehab, Joe Perches, Antti Palosaari, Tejun Heo,
	Will Deacon, Nikolay Borisov, Rusty Russell, Kees Cook,
	Michal Marek, linux-doc, linux-arm-kernel, linux-kernel
In-Reply-To: <1426197361-19290-1-git-send-email-maxime.coquelin@st.com>

From: Maxime Coquelin <mcoquelin.stm32@gmail.com>

The STMicrolectornics's STM32F419 MCU has the following main features:
 - Cortex-M4 core running up to @180MHz
 - 2MB internal flash, 256KBytes internal RAM
 - FMC controller to connect SDRAM, NOR and NAND memories
 - SD/MMC/SDIO support
 - Ethernet controller
 - USB OTFG FS & HS controllers
 - I2C, SPI, CAN busses support
 - Several 16 & 32 bits general purpose timers
 - Serial Audio interface
 - LCD controller

Signed-off-by: Maxime Coquelin <mcoquelin.stm32@gmail.com>
---
 arch/arm/boot/dts/Makefile            |   1 +
 arch/arm/boot/dts/stm32f429-disco.dts |  71 +++++++++++
 arch/arm/boot/dts/stm32f429.dtsi      | 226 ++++++++++++++++++++++++++++++++++
 3 files changed, 298 insertions(+)
 create mode 100644 arch/arm/boot/dts/stm32f429-disco.dts
 create mode 100644 arch/arm/boot/dts/stm32f429.dtsi

diff --git a/arch/arm/boot/dts/Makefile b/arch/arm/boot/dts/Makefile
index a1c776b..e5dbd03 100644
--- a/arch/arm/boot/dts/Makefile
+++ b/arch/arm/boot/dts/Makefile
@@ -509,6 +509,7 @@ dtb-$(CONFIG_ARCH_STI) += \
 	stih416-b2020.dtb \
 	stih416-b2020e.dtb \
 	stih418-b2199.dtb
+dtb-$(CONFIG_ARCH_STM32)+= stm32f429-disco.dtb
 dtb-$(CONFIG_MACH_SUN4I) += \
 	sun4i-a10-a1000.dtb \
 	sun4i-a10-ba10-tvbox.dtb \
diff --git a/arch/arm/boot/dts/stm32f429-disco.dts b/arch/arm/boot/dts/stm32f429-disco.dts
new file mode 100644
index 0000000..6b9aa59
--- /dev/null
+++ b/arch/arm/boot/dts/stm32f429-disco.dts
@@ -0,0 +1,71 @@
+/*
+ * Copyright 2015 - Maxime Coquelin <mcoquelin.stm32@gmail.com>
+ *
+ * This file is dual-licensed: you can use it either under the terms
+ * of the GPL or the X11 license, at your option. Note that this dual
+ * licensing only applies to this file, and not this project as a
+ * whole.
+ *
+ *  a) This file is free software; you can redistribute it and/or
+ *     modify it under the terms of the GNU General Public License as
+ *     published by the Free Software Foundation; either version 2 of the
+ *     License, or (at your option) any later version.
+ *
+ *     This file is distributed in the hope that it will be useful,
+ *     but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *     MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *     GNU General Public License for more details.
+ *
+ *     You should have received a copy of the GNU General Public
+ *     License along with this file; if not, write to the Free
+ *     Software Foundation, Inc., 51 Franklin St, Fifth Floor, Boston,
+ *     MA 02110-1301 USA
+ *
+ * Or, alternatively,
+ *
+ *  b) Permission is hereby granted, free of charge, to any person
+ *     obtaining a copy of this software and associated documentation
+ *     files (the "Software"), to deal in the Software without
+ *     restriction, including without limitation the rights to use,
+ *     copy, modify, merge, publish, distribute, sublicense, and/or
+ *     sell copies of the Software, and to permit persons to whom the
+ *     Software is furnished to do so, subject to the following
+ *     conditions:
+ *
+ *     The above copyright notice and this permission notice shall be
+ *     included in all copies or substantial portions of the Software.
+ *
+ *     THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ *     EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
+ *     OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ *     NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
+ *     HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
+ *     WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ *     FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+ *     OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+/dts-v1/;
+#include "stm32f429.dtsi"
+
+/ {
+	model = "STMicroelectronics STM32F429i-DISCO board";
+	compatible = "st,stm32f429i-disco", "st,stm32f429";
+
+	chosen {
+		bootargs = "console=ttyS0,115200 root=/dev/ram rdinit=/linuxrc";
+		linux,stdout-path = &usart1;
+	};
+
+	memory {
+		reg = <0x90000000 0x800000>;
+	};
+
+	aliases {
+		serial0 = &usart1;
+	};
+};
+
+&usart1 {
+	status = "okay";
+};
diff --git a/arch/arm/boot/dts/stm32f429.dtsi b/arch/arm/boot/dts/stm32f429.dtsi
new file mode 100644
index 0000000..39ffdb8
--- /dev/null
+++ b/arch/arm/boot/dts/stm32f429.dtsi
@@ -0,0 +1,226 @@
+/*
+ * Copyright 2015 - Maxime Coquelin <mcoquelin.stm32@gmail.com>
+ *
+ * This file is dual-licensed: you can use it either under the terms
+ * of the GPL or the X11 license, at your option. Note that this dual
+ * licensing only applies to this file, and not this project as a
+ * whole.
+ *
+ *  a) This file is free software; you can redistribute it and/or
+ *     modify it under the terms of the GNU General Public License as
+ *     published by the Free Software Foundation; either version 2 of the
+ *     License, or (at your option) any later version.
+ *
+ *     This file is distributed in the hope that it will be useful,
+ *     but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *     MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *     GNU General Public License for more details.
+ *
+ *     You should have received a copy of the GNU General Public
+ *     License along with this file; if not, write to the Free
+ *     Software Foundation, Inc., 51 Franklin St, Fifth Floor, Boston,
+ *     MA 02110-1301 USA
+ *
+ * Or, alternatively,
+ *
+ *  b) Permission is hereby granted, free of charge, to any person
+ *     obtaining a copy of this software and associated documentation
+ *     files (the "Software"), to deal in the Software without
+ *     restriction, including without limitation the rights to use,
+ *     copy, modify, merge, publish, distribute, sublicense, and/or
+ *     sell copies of the Software, and to permit persons to whom the
+ *     Software is furnished to do so, subject to the following
+ *     conditions:
+ *
+ *     The above copyright notice and this permission notice shall be
+ *     included in all copies or substantial portions of the Software.
+ *
+ *     THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ *     EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
+ *     OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ *     NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
+ *     HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
+ *     WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ *     FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+ *     OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#include "armv7-m.dtsi"
+
+/ {
+	clocks {
+		clk_sysclk: clk-sysclk {
+			#clock-cells = <0>;
+			compatible = "fixed-clock";
+			clock-frequency = <180000000>;
+		};
+
+		clk_hclk: clk-hclk {
+			#clock-cells = <0>;
+			compatible = "fixed-clock";
+			clock-frequency = <180000000>;
+		};
+
+		clk_pclk1: clk-pclk1 {
+			#clock-cells = <0>;
+			compatible = "fixed-clock";
+			clock-frequency = <45000000>;
+		};
+
+		clk_pclk2: clk-pclk2 {
+			#clock-cells = <0>;
+			compatible = "fixed-clock";
+			clock-frequency = <90000000>;
+		};
+
+		clk_pmtr1: clk-pmtr1 {
+			#clock-cells = <0>;
+			compatible = "fixed-clock";
+			clock-frequency = <90000000>;
+		};
+
+		clk_pmtr2: clk-pmtr2 {
+			#clock-cells = <0>;
+			compatible = "fixed-clock";
+			clock-frequency = <180000000>;
+		};
+
+		clk_systick: clk-systick {
+			compatible = "fixed-factor-clock";
+			clocks = <&clk_hclk>;
+			#clock-cells = <0>;
+			clock-div = <8>;
+			clock-mult = <1>;
+		};
+	};
+
+	soc {
+		timer2: timer@40000000 {
+			compatible = "st,stm32-timer";
+			reg = <0x40000000 0x400>;
+			interrupts = <28>;
+			resets = <&rcc 256>;
+			clocks = <&clk_pmtr1>;
+			status = "disabled";
+		};
+
+		timer3: timer@40000400 {
+			compatible = "st,stm32-timer";
+			reg = <0x40000400 0x400>;
+			interrupts = <29>;
+			resets = <&rcc 257>;
+			clocks = <&clk_pmtr1>;
+			status = "disabled";
+		};
+
+		timer4: timer@40000800 {
+			compatible = "st,stm32-timer";
+			reg = <0x40000800 0x400>;
+			interrupts = <30>;
+			resets = <&rcc 258>;
+			clocks = <&clk_pmtr1>;
+			status = "disabled";
+		};
+
+		timer5: timer@40000c00 {
+			compatible = "st,stm32-timer";
+			reg = <0x40000c00 0x400>;
+			interrupts = <50>;
+			resets = <&rcc 259>;
+			clocks = <&clk_pmtr1>;
+		};
+
+		timer6: timer@40001000 {
+			compatible = "st,stm32-timer";
+			reg = <0x40001000 0x400>;
+			interrupts = <54>;
+			resets = <&rcc 260>;
+			clocks = <&clk_pmtr1>;
+			status = "disabled";
+		};
+
+		timer7: timer@40001400 {
+			compatible = "st,stm32-timer";
+			reg = <0x40001400 0x400>;
+			interrupts = <55>;
+			resets = <&rcc 261>;
+			clocks = <&clk_pmtr1>;
+			status = "disabled";
+		};
+
+		usart2: serial@40004400 {
+			compatible = "st,stm32-usart", "st,stm32-uart";
+			reg = <0x40004400 0x400>;
+			interrupts = <38>;
+			clocks = <&clk_pclk1>;
+			status = "disabled";
+		};
+
+		usart3: serial@40004800 {
+			compatible = "st,stm32-usart", "st,stm32-uart";
+			reg = <0x40004800 0x400>;
+			interrupts = <39>;
+			clocks = <&clk_pclk1>;
+			status = "disabled";
+		};
+
+		usart4: serial@40004c00 {
+			compatible = "st,stm32-uart";
+			reg = <0x40004c00 0x400>;
+			interrupts = <52>;
+			clocks = <&clk_pclk1>;
+			status = "disabled";
+		};
+
+		usart5: serial@40005000 {
+			compatible = "st,stm32-uart";
+			reg = <0x40005000 0x400>;
+			interrupts = <53>;
+			clocks = <&clk_pclk1>;
+			status = "disabled";
+		};
+
+		usart7: serial@40007800 {
+			compatible = "st,stm32-usart", "st,stm32-uart";
+			reg = <0x40007800 0x400>;
+			interrupts = <82>;
+			clocks = <&clk_pclk1>;
+			status = "disabled";
+		};
+
+		usart8: serial@40007c00 {
+			compatible = "st,stm32-usart", "st,stm32-uart";
+			reg = <0x40007c00 0x400>;
+			interrupts = <83>;
+			clocks = <&clk_pclk1>;
+			status = "disabled";
+		};
+
+		usart1: serial@40011000 {
+			compatible = "st,stm32-usart", "st,stm32-uart";
+			reg = <0x40011000 0x400>;
+			interrupts = <37>;
+			clocks = <&clk_pclk2>;
+			status = "disabled";
+		};
+
+		usart6: serial@40011400 {
+			compatible = "st,stm32-usart", "st,stm32-uart";
+			reg = <0x40011400 0x400>;
+			interrupts = <71>;
+			clocks = <&clk_pclk2>;
+			status = "disabled";
+		};
+
+		rcc: rcc@40023810 {
+			#reset-cells = <1>;
+			compatible = "st,stm32-rcc";
+			reg = <0x40023800 0x400>;
+		};
+	};
+};
+
+&systick {
+	clocks = <&clk_systick>;
+	status = "okay";
+};
-- 
1.9.1


^ permalink raw reply related

* [PATCH v3  12/15] ARM: dts: Add ARM System timer as clockevent in armv7m
From: Maxime Coquelin @ 2015-03-12 21:55 UTC (permalink / raw)
  To: u.kleine-koenig, afaerber, geert, Rob Herring, Philipp Zabel,
	Linus Walleij, Arnd Bergmann, stefan, pmeerw, pebolle
  Cc: Jonathan Corbet, Pawel Moll, Mark Rutland, Ian Campbell,
	Kumar Gala, Russell King, Daniel Lezcano, Thomas Gleixner,
	Greg Kroah-Hartman, Jiri Slaby, Andrew Morton, David S. Miller,
	Mauro Carvalho Chehab, Joe Perches, Antti Palosaari, Tejun Heo,
	Will Deacon, Nikolay Borisov, Rusty Russell, Kees Cook,
	Michal Marek, linux-doc, linux-arm-kernel, linux-kernel
In-Reply-To: <1426197361-19290-1-git-send-email-maxime.coquelin@st.com>

From: Maxime Coquelin <mcoquelin.stm32@gmail.com>

Signed-off-by: Maxime Coquelin <mcoquelin.stm32@gmail.com>
---
 arch/arm/boot/dts/armv7-m.dtsi | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/arm/boot/dts/armv7-m.dtsi b/arch/arm/boot/dts/armv7-m.dtsi
index 5a660d0..b1ad7cf 100644
--- a/arch/arm/boot/dts/armv7-m.dtsi
+++ b/arch/arm/boot/dts/armv7-m.dtsi
@@ -8,6 +8,12 @@
 		reg = <0xe000e100 0xc00>;
 	};
 
+	systick: timer@e000e010 {
+		compatible = "arm,armv7m-systick";
+		reg = <0xe000e010 0x10>;
+		status = "disabled";
+	};
+
 	soc {
 		#address-cells = <1>;
 		#size-cells = <1>;
-- 
1.9.1


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox