Linux Documentation

Linux Documentation
 help / color / mirror / Atom feed

* Re: [PATCH 02/10] x86/cet: Introduce WRUSS instruction
From: Yu-cheng Yu @ 2018-06-11 15:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, LKML, linux-doc, Linux-MM, linux-arch, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, H. J. Lu,
	Shanbhogue, Vedvyas, Ravi V. Shankar, Dave Hansen,
	Jonathan Corbet, Oleg Nesterov, Arnd Bergmann, mike.kravetz
In-Reply-To: <20180611081704.GI12180@hirez.programming.kicks-ass.net>

On Mon, 2018-06-11 at 10:17 +0200, Peter Zijlstra wrote:
> On Thu, Jun 07, 2018 at 09:40:02AM -0700, Andy Lutomirski wrote:
> > On Thu, Jun 7, 2018 at 7:41 AM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
> 
> > Peterz, isn't there some fancy better way we're supposed to handle the
> > error return these days?
> 
> > > +       asm volatile("1:.byte 0x66, 0x0f, 0x38, 0xf5, 0x37\n"
> > > +                    "xor %[err],%[err]\n"
> > > +                    "2:\n"
> > > +                    ".section .fixup,\"ax\"\n"
> > > +                    "3: mov $-1,%[err]; jmp 2b\n"
> > > +                    ".previous\n"
> > > +                    _ASM_EXTABLE(1b, 3b)
> > > +               : [err] "=a" (err)
> > > +               : [val] "S" (val), [addr] "D" (addr)
> > > +               : "memory");
> 
> So the alternative is something like:
> 
> __visible bool ex_handler_wuss(const struct exception_table_entry *fixup,
> 			       struct pt_regs *regs, int trapnr)
> {
> 	regs->ip = ex_fixup_addr(fixup);
> 	regs->ax = -1L;
> 
> 	return true;
> }
> 
> 
> 	int err = 0;
> 
> 	asm volatile("1: INSN_WUSS\n"
> 		     "2:\n"
> 		     _ASM_EXTABLE_HANDLE(1b, 2b, ex_handler_wuss)
> 		     : "=a" (err)
> 		     : "S" (val), "D" (addr));
> 
> But I'm not at all sure that's actually better.

Thanks!  I will fix it.

Yu-cheng


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH] Documentation: Fix reference to stm.txt
From: Mathieu Poirier @ 2018-06-11 16:15 UTC (permalink / raw)
  To: corbet, joe
  Cc: tom.saeger, changbin.du, alexander.shishkin, linux-arm-kernel,
	linux-doc, linux-kernel

Commit "1606f8d8e75b trace doc: convert trace/stm.txt to rst format"
changed stm.txt to stm.rst but references to it in other files have not
been modified, something that is corrected in this patch.

Reported-by: Joe Perches <joe@perches.com>
Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
---
 Documentation/trace/coresight.txt | 2 +-
 Documentation/trace/intel_th.rst  | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/Documentation/trace/coresight.txt b/Documentation/trace/coresight.txt
index 1d74ad0202b6..efbc832146e7 100644
--- a/Documentation/trace/coresight.txt
+++ b/Documentation/trace/coresight.txt
@@ -426,5 +426,5 @@ root@genericarmv8:~#
 Details on how to use the generic STM API can be found here [2].
 
 [1]. Documentation/ABI/testing/sysfs-bus-coresight-devices-stm
-[2]. Documentation/trace/stm.txt
+[2]. Documentation/trace/stm.rst
 [3]. https://github.com/Linaro/perf-opencsd
diff --git a/Documentation/trace/intel_th.rst b/Documentation/trace/intel_th.rst
index 990f13265178..fdb34114ee20 100644
--- a/Documentation/trace/intel_th.rst
+++ b/Documentation/trace/intel_th.rst
@@ -38,7 +38,7 @@ description is at Documentation/ABI/testing/sysfs-bus-intel_th-devices-gth.
 
 STH registers an stm class device, through which it provides interface
 to userspace and kernelspace software trace sources. See
-Documentation/trace/stm.txt for more information on that.
+Documentation/trace/stm.rst for more information on that.
 
 MSU can be configured to collect trace data into a system memory
 buffer, which can later on be read from its device nodes via read() or
@@ -89,7 +89,7 @@ Quick example
 
 	$ echo 1 > /sys/bus/intel_th/devices/0-msc0/active
 
-# .. send data to master 33, see stm.txt for more details ..
+# .. send data to master 33, see stm.rst for more details ..
 # .. wait for traces to pile up ..
 # .. and stop the trace::
 
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [PATCH 06/10] x86/cet: Add arch_prctl functions for shadow stack
From: Thomas Gleixner @ 2018-06-12 10:03 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Andy Lutomirski, Yu-cheng Yu, LKML, linux-doc, Linux-MM,
	linux-arch, X86 ML, H. Peter Anvin, Ingo Molnar,
	Shanbhogue, Vedvyas, Ravi V. Shankar, Dave Hansen,
	Jonathan Corbet, Oleg Nesterov, Arnd Bergmann, mike.kravetz
In-Reply-To: <CAMe9rOr49V8rqRa_KVsw61PWd+crkQvPDgPKtvowazjmsfgWWQ@mail.gmail.com>

On Thu, 7 Jun 2018, H.J. Lu wrote:
> On Thu, Jun 7, 2018 at 2:01 PM, Andy Lutomirski <luto@kernel.org> wrote:
> > Why is the lockout necessary?  If user code enables CET and tries to
> > run code that doesn't support CET, it will crash.  I don't see why we
> > need special code in the kernel to prevent a user program from calling
> > arch_prctl() and crashing itself.  There are already plenty of ways to
> > do that :)
> 
> On CET enabled machine, not all programs nor shared libraries are
> CET enabled.  But since ld.so is CET enabled, all programs start
> as CET enabled.  ld.so will disable CET if a program or any of its shared
> libraries aren't CET enabled.  ld.so will lock up CET once it is done CET
> checking so that CET can't no longer be disabled afterwards.

That works for stuff which loads all libraries at start time, but what
happens if the program uses dlopen() later on? If CET is force locked and
the library is not CET enabled, it will fail.

I don't see the point of trying to support CET by magic. It adds complexity
and you'll never be able to handle all corner cases correctly. dlopen() is
not even a corner case.

Occasionally stuff needs to be recompiled to utilize new mechanisms, see
retpoline ...

Thanks,

	tglx

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next 4/6] net: ethernet: ti: cpsw: add CBS Qdisc offload
From: Ilias Apalodimas @ 2018-06-12 10:18 UTC (permalink / raw)
  To: Ivan Khoronzhuk
  Cc: grygorii.strashko, davem, corbet, akpm, netdev, linux-doc,
	linux-kernel, linux-omap, vinicius.gomes, henrik,
	jesus.sanchez-palencia, p-varis, spatton, francois.ozog, yogeshs,
	nsekhar
In-Reply-To: <20180611133047.4818-5-ivan.khoronzhuk@linaro.org>

On Mon, Jun 11, 2018 at 04:30:45PM +0300, Ivan Khoronzhuk wrote:
> The cpsw has up to 4 FIFOs per port and upper 3 FIFOs can feed rate
> limited queue with shaping. In order to set and enable shaping for
> those 3 FIFOs queues the network device with CBS qdisc attached is
> needed. The CBS configuration is added for dual-emac/single port mode
> only, but potentially can be used in switch mode also, based on
> switchdev for instance.
> 
> Despite the FIFO shapers can work w/o cpdma level shapers the base
> usage must be in combine with cpdma level shapers as described in TRM,
> that are set as maximum rates for interface queues with sysfs.
> 
> One of the possible configuration with txq shapers and CBS shapers:
> 
>                       Configured with echo RATE >
>                   /sys/class/net/eth0/queues/tx-0/tx_maxrate
>              /---------------------------------------------------
>             /
>            /            cpdma level shapers
>         +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+
>         | c7 | | c6 | | c5 | | c4 | | c3 | | c2 | | c1 | | c0 |
>         \    / \    / \    / \    / \    / \    / \    / \    /
>          \  /   \  /   \  /   \  /   \  /   \  /   \  /   \  /
>           \/     \/     \/     \/     \/     \/     \/     \/
> +---------|------|------|------|-------------------------------------+
> |    +----+      |      |  +---+                                     |
> |    |      +----+      |  |                                         |
> |    v      v           v  v                                         |
> | +----+ +----+ +----+ +----+ p        p+----+ +----+ +----+ +----+  |
> | |    | |    | |    | |    | o        o|    | |    | |    | |    |  |
> | | f3 | | f2 | | f1 | | f0 | r  CPSW  r| f3 | | f2 | | f1 | | f0 |  |
> | |    | |    | |    | |    | t        t|    | |    | |    | |    |  |
> | \    / \    / \    / \    / 0        1\    / \    / \    / \    /  |
> |  \  X   \  /   \  /   \  /             \  /   \  /   \  /   \  /   |
> |   \/ \   \/     \/     \/               \/     \/     \/     \/    |
> +-------\------------------------------------------------------------+
>          \
>           \ FIFO shaper, set with CBS offload added in this patch,
>            \ FIFO0 cannot be rate limited
>             ------------------------------------------------------
> 
> CBS shaper configuration is supposed to be used with root MQPRIO Qdisc
> offload allowing to add sk_prio->tc->txq maps that direct traffic to
> appropriate tx queue and maps L2 priority to FIFO shaper.
> 
> The CBS shaper is intended to be used for AVB where L2 priority
> (pcp field) is used to differentiate class of traffic. So additionally
> vlan needs to be created with appropriate egress sk_prio->l2 prio map.
> 
> If CBS has several tx queues assigned to it, the sum of their
> bandwidth has not overlap bandwidth set for CBS. It's recomended the
> CBS bandwidth to be a little bit more.
> 
> The CBS shaper is configured with CBS qdisc offload interface using tc
> tool from iproute2 packet.
> 
> For instance:
> 
> $ tc qdisc replace dev eth0 handle 100: parent root mqprio num_tc 3 \
> map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 1
> 
> $ tc -g class show dev eth0
> +---(100:ffe2) mqprio
> |    +---(100:3) mqprio
> |    +---(100:4) mqprio
> |    
> +---(100:ffe1) mqprio
> |    +---(100:2) mqprio
> |    
> +---(100:ffe0) mqprio
>      +---(100:1) mqprio
> 
> $ tc qdisc add dev eth0 parent 100:1 cbs locredit -1440 \
> hicredit 60 sendslope -960000 idleslope 40000 offload 1
> 
> $ tc qdisc add dev eth0 parent 100:2 cbs locredit -1470 \
> hicredit 62 sendslope -980000 idleslope 20000 offload 1
> 
> The above code set CBS shapers for tc0 and tc1, for that txq0 and
> txq1 is used. Pay attention, the real set bandwidth can differ a bit
> due to discreteness of configuration parameters.
> 
> Here parameters like locredit, hicredit and sendslope are ignored
> internally and are supposed to be set with assumption that maximum
> frame size for frame - 1500.
> 
> It's supposed that interface speed is not changed while reconnection,
> not always is true, so inform user in case speed of interface was
> changed, as it can impact on dependent shapers configuration.
> 
> For more examples see Documentation.
> 
> Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
> ---
>  drivers/net/ethernet/ti/cpsw.c | 221 +++++++++++++++++++++++++++++++++
>  1 file changed, 221 insertions(+)
> 
> diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
> index fd967d2bce5d..87a5586c5ea5 100644
> --- a/drivers/net/ethernet/ti/cpsw.c
> +++ b/drivers/net/ethernet/ti/cpsw.c
> @@ -46,6 +46,8 @@
>  #include "cpts.h"
>  #include "davinci_cpdma.h"
>  
> +#include <net/pkt_sched.h>
> +
>  #define CPSW_DEBUG	(NETIF_MSG_HW		| NETIF_MSG_WOL		| \
>  			 NETIF_MSG_DRV		| NETIF_MSG_LINK	| \
>  			 NETIF_MSG_IFUP		| NETIF_MSG_INTR	| \
> @@ -154,8 +156,12 @@ do {								\
>  #define IRQ_NUM			2
>  #define CPSW_MAX_QUEUES		8
>  #define CPSW_CPDMA_DESCS_POOL_SIZE_DEFAULT 256
> +#define CPSW_FIFO_QUEUE_TYPE_SHIFT	16
> +#define CPSW_FIFO_SHAPE_EN_SHIFT	16
> +#define CPSW_FIFO_RATE_EN_SHIFT		20
>  #define CPSW_TC_NUM			4
>  #define CPSW_FIFO_SHAPERS_NUM		(CPSW_TC_NUM - 1)
> +#define CPSW_PCT_MASK			0x7f
>  
>  #define CPSW_RX_VLAN_ENCAP_HDR_PRIO_SHIFT	29
>  #define CPSW_RX_VLAN_ENCAP_HDR_PRIO_MSK		GENMASK(2, 0)
> @@ -457,6 +463,8 @@ struct cpsw_priv {
>  	bool				rx_pause;
>  	bool				tx_pause;
>  	bool				mqprio_hw;
> +	int				fifo_bw[CPSW_TC_NUM];
> +	int				shp_cfg_speed;
>  	u32 emac_port;
>  	struct cpsw_common *cpsw;
>  };
> @@ -1081,6 +1089,38 @@ static void cpsw_set_slave_mac(struct cpsw_slave *slave,
>  	slave_write(slave, mac_lo(priv->mac_addr), SA_LO);
>  }
>  
> +static bool cpsw_shp_is_off(struct cpsw_priv *priv)
> +{
> +	struct cpsw_common *cpsw = priv->cpsw;
> +	struct cpsw_slave *slave;
> +	u32 shift, mask, val;
> +
> +	val = readl_relaxed(&cpsw->regs->ptype);
> +
> +	slave = &cpsw->slaves[cpsw_slave_index(cpsw, priv)];
> +	shift = CPSW_FIFO_SHAPE_EN_SHIFT + 3 * slave->slave_num;
> +	mask = 7 << shift;
> +	val = val & mask;
> +
> +	return !val;
> +}
> +
> +static void cpsw_fifo_shp_on(struct cpsw_priv *priv, int fifo, int on)
> +{
> +	struct cpsw_common *cpsw = priv->cpsw;
> +	struct cpsw_slave *slave;
> +	u32 shift, mask, val;
> +
> +	val = readl_relaxed(&cpsw->regs->ptype);
> +
> +	slave = &cpsw->slaves[cpsw_slave_index(cpsw, priv)];
> +	shift = CPSW_FIFO_SHAPE_EN_SHIFT + 3 * slave->slave_num;
> +	mask = (1 << --fifo) << shift;
> +	val = on ? val | mask : val & ~mask;
> +
> +	writel_relaxed(val, &cpsw->regs->ptype);
> +}
> +
>  static void _cpsw_adjust_link(struct cpsw_slave *slave,
>  			      struct cpsw_priv *priv, bool *link)
>  {
> @@ -1120,6 +1160,12 @@ static void _cpsw_adjust_link(struct cpsw_slave *slave,
>  			mac_control |= BIT(4);
>  
>  		*link = true;
> +
> +		if (priv->shp_cfg_speed &&
> +		    priv->shp_cfg_speed != slave->phy->speed &&
> +		    !cpsw_shp_is_off(priv))
> +			dev_warn(priv->dev,
> +				 "Speed was changed, CBS sahper speeds are changed!");
>  	} else {
>  		mac_control = 0;
>  		/* disable forwarding */
> @@ -1589,6 +1635,178 @@ static int cpsw_tc_to_fifo(int tc, int num_tc)
>  	return CPSW_FIFO_SHAPERS_NUM - tc;
>  }
>  
> +static int cpsw_set_fifo_bw(struct cpsw_priv *priv, int fifo, int bw)
> +{
> +	struct cpsw_common *cpsw = priv->cpsw;
> +	u32 val = 0, send_pct, shift;
> +	struct cpsw_slave *slave;
> +	int pct = 0, i;
> +
> +	if (bw > priv->shp_cfg_speed * 1000)
> +		goto err;
> +
> +	/* shaping has to stay enabled for highest fifos linearly
> +	 * and fifo bw no more then interface can allow
> +	 */
> +	slave = &cpsw->slaves[cpsw_slave_index(cpsw, priv)];
> +	send_pct = slave_read(slave, SEND_PERCENT);
> +	for (i = CPSW_FIFO_SHAPERS_NUM; i > 0; i--) {
> +		if (!bw) {
> +			if (i >= fifo || !priv->fifo_bw[i])
> +				continue;
> +
> +			dev_warn(priv->dev, "Prev FIFO%d is shaped", i);
> +			continue;
> +		}
> +
> +		if (!priv->fifo_bw[i] && i > fifo) {
> +			dev_err(priv->dev, "Upper FIFO%d is not shaped", i);
> +			return -EINVAL;
> +		}
> +
> +		shift = (i - 1) * 8;
> +		if (i == fifo) {
> +			send_pct &= ~(CPSW_PCT_MASK << shift);
> +			val = DIV_ROUND_UP(bw, priv->shp_cfg_speed * 10);
> +			if (!val)
> +				val = 1;
> +
> +			send_pct |= val << shift;
> +			pct += val;
> +			continue;
> +		}
> +
> +		if (priv->fifo_bw[i])
> +			pct += (send_pct >> shift) & CPSW_PCT_MASK;
> +	}
> +
> +	if (pct >= 100)
> +		goto err;
> +
> +	slave_write(slave, send_pct, SEND_PERCENT);
> +	priv->fifo_bw[fifo] = bw;
> +
> +	dev_warn(priv->dev, "set FIFO%d bw = %d\n", fifo,
> +		 DIV_ROUND_CLOSEST(val * priv->shp_cfg_speed, 100));
> +
> +	return 0;
> +err:
> +	dev_err(priv->dev, "Bandwidth doesn't fit in tc configuration");
> +	return -EINVAL;
> +}
> +
> +static int cpsw_set_fifo_rlimit(struct cpsw_priv *priv, int fifo, int bw)
> +{
> +	struct cpsw_common *cpsw = priv->cpsw;
> +	struct cpsw_slave *slave;
> +	u32 tx_in_ctl_rg, val;
> +	int ret;
> +
> +	ret = cpsw_set_fifo_bw(priv, fifo, bw);
> +	if (ret)
> +		return ret;
> +
> +	slave = &cpsw->slaves[cpsw_slave_index(cpsw, priv)];
> +	tx_in_ctl_rg = cpsw->version == CPSW_VERSION_1 ?
> +		       CPSW1_TX_IN_CTL : CPSW2_TX_IN_CTL;
> +
> +	if (!bw)
> +		cpsw_fifo_shp_on(priv, fifo, bw);
> +
> +	val = slave_read(slave, tx_in_ctl_rg);
> +	if (cpsw_shp_is_off(priv)) {
> +		/* disable FIFOs rate limited queues */
> +		val &= ~(0xf << CPSW_FIFO_RATE_EN_SHIFT);
> +
> +		/* set type of FIFO queues to normal priority mode */
> +		val &= ~(3 << CPSW_FIFO_QUEUE_TYPE_SHIFT);
> +
> +		/* set type of FIFO queues to be rate limited */
> +		if (bw)
> +			val |= 2 << CPSW_FIFO_QUEUE_TYPE_SHIFT;
> +		else
> +			priv->shp_cfg_speed = 0;
> +	}
> +
> +	/* toggle a FIFO rate limited queue */
> +	if (bw)
> +		val |= BIT(fifo + CPSW_FIFO_RATE_EN_SHIFT);
> +	else
> +		val &= ~BIT(fifo + CPSW_FIFO_RATE_EN_SHIFT);
> +	slave_write(slave, val, tx_in_ctl_rg);
> +
> +	/* FIFO transmit shape enable */
> +	cpsw_fifo_shp_on(priv, fifo, bw);
> +	return 0;
> +}
> +
> +/* Defaults:
> + * class A - prio 3
> + * class B - prio 2
> + * shaping for class A should be set first
> + */
> +static int cpsw_set_cbs(struct net_device *ndev,
> +			struct tc_cbs_qopt_offload *qopt)
> +{
> +	struct cpsw_priv *priv = netdev_priv(ndev);
> +	struct cpsw_common *cpsw = priv->cpsw;
> +	struct cpsw_slave *slave;
> +	int prev_speed = 0;
> +	int tc, ret, fifo;
> +	u32 bw = 0;
> +
> +	tc = netdev_txq_to_tc(priv->ndev, qopt->queue);
> +
> +	/* enable channels in backward order, as highest FIFOs must be rate
> +	 * limited first and for compliance with CPDMA rate limited channels
> +	 * that also used in bacward order. FIFO0 cannot be rate limited.
> +	 */
> +	fifo = cpsw_tc_to_fifo(tc, ndev->num_tc);
> +	if (!fifo) {
> +		dev_err(priv->dev, "Last tc%d can't be rate limited", tc);
> +		return -EINVAL;
> +	}
> +
> +	/* do nothing, it's disabled anyway */
> +	if (!qopt->enable && !priv->fifo_bw[fifo])
> +		return 0;
> +
> +	/* shapers can be set if link speed is known */
> +	slave = &cpsw->slaves[cpsw_slave_index(cpsw, priv)];
> +	if (slave->phy && slave->phy->link) {
> +		if (priv->shp_cfg_speed &&
> +		    priv->shp_cfg_speed != slave->phy->speed)
> +			prev_speed = priv->shp_cfg_speed;
> +
> +		priv->shp_cfg_speed = slave->phy->speed;
> +	}
> +
> +	if (!priv->shp_cfg_speed) {
> +		dev_err(priv->dev, "Link speed is not known");
> +		return -1;
> +	}
> +
> +	ret = pm_runtime_get_sync(cpsw->dev);
> +	if (ret < 0) {
> +		pm_runtime_put_noidle(cpsw->dev);
> +		return ret;
> +	}
> +
> +	bw = qopt->enable ? qopt->idleslope : 0;
> +	ret = cpsw_set_fifo_rlimit(priv, fifo, bw);
> +	if (ret) {
> +		priv->shp_cfg_speed = prev_speed;
> +		prev_speed = 0;
> +	}
> +
> +	if (bw && prev_speed)
> +		dev_warn(priv->dev,
> +			 "Speed was changed, CBS sahper speeds are changed!");
> +
> +	pm_runtime_put_sync(cpsw->dev);
> +	return ret;
> +}
> +
>  static int cpsw_ndo_open(struct net_device *ndev)
>  {
>  	struct cpsw_priv *priv = netdev_priv(ndev);
> @@ -2263,6 +2481,9 @@ static int cpsw_ndo_setup_tc(struct net_device *ndev, enum tc_setup_type type,
>  			     void *type_data)
>  {
>  	switch (type) {
> +	case TC_SETUP_QDISC_CBS:
> +		return cpsw_set_cbs(ndev, type_data);
> +
>  	case TC_SETUP_QDISC_MQPRIO:
>  		return cpsw_set_tc(ndev, type_data);
>  
> -- 
> 2.17.1
> 

Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v11 00/13] Intel SGX1 support
From: Pavel Machek @ 2018-06-12 10:50 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: x86, platform-driver-x86, dave.hansen, sean.j.christopherson,
	nhorman, npmccallum, Alexei Starovoitov, Andi Kleen,
	Andrew Morton, Andy Lutomirski, Borislav Petkov, David S. Miller,
	David Woodhouse, Greg Kroah-Hartman, H. Peter Anvin, Ingo Molnar,
	open list:INTEL SGX, Janakarajan Natarajan, Kirill A. Shutemov,
	Konrad Rzeszutek Wilk,
	open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86), Len Brown,
	Linus Walleij, open list:CRYPTO API, open list:DOCUMENTATION,
	open list, open list:SPARSE CHECKER, Mauro Carvalho Chehab,
	Peter Zijlstra, Rafael J. Wysocki, Randy Dunlap, Ricardo Neri,
	Thomas Gleixner, Tom Lendacky, Vikas Shivappa
In-Reply-To: <20180608171216.26521-1-jarkko.sakkinen@linux.intel.com>

[-- Attachment #1: Type: text/plain, Size: 769 bytes --]

On Fri 2018-06-08 19:09:35, Jarkko Sakkinen wrote:
> Intel(R) SGX is a set of CPU instructions that can be used by applications
> to set aside private regions of code and data. The code outside the enclave
> is disallowed to access the memory inside the enclave by the CPU access
> control.  In a way you can think that SGX provides inverted sandbox. It
> protects the application from a malicious host.

Do you intend to allow non-root applications to use SGX?

What are non-evil uses for SGX?

...because it is quite useful for some kinds of evil:

https://taesoo.kim/pubs/2017/jang:sgx-bomb.pdf

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply

* Re: [PATCH 00/10] Control Flow Enforcement - Part (3)
From: Balbir Singh @ 2018-06-12 10:56 UTC (permalink / raw)
  To: Yu-cheng Yu, linux-kernel, linux-doc, linux-mm, linux-arch, x86,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, H.J. Lu,
	Vedvyas Shanbhogue, Ravi V. Shankar, Dave Hansen, Andy Lutomirski,
	Jonathan Corbet, Oleg Nesterov, Arnd Bergmann, Mike Kravetz
In-Reply-To: <20180607143807.3611-1-yu-cheng.yu@intel.com>



On 08/06/18 00:37, Yu-cheng Yu wrote:
> This series introduces CET - Shadow stack
> 
> At the high level, shadow stack is:
> 
> 	Allocated from a task's address space with vm_flags VM_SHSTK;
> 	Its PTEs must be read-only and dirty;
> 	Fixed sized, but the default size can be changed by sys admin.
> 
> For a forked child, the shadow stack is duplicated when the next
> shadow stack access takes place.
> 
> For a pthread child, a new shadow stack is allocated.
> 
> The signal handler uses the same shadow stack as the main program.
> 

Even with sigaltstack()?


Balbir Singh.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 06/10] x86/cet: Add arch_prctl functions for shadow stack
From: H.J. Lu @ 2018-06-12 11:43 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, Yu-cheng Yu, LKML, linux-doc, Linux-MM,
	linux-arch, X86 ML, H. Peter Anvin, Ingo Molnar,
	Shanbhogue, Vedvyas, Ravi V. Shankar, Dave Hansen,
	Jonathan Corbet, Oleg Nesterov, Arnd Bergmann, mike.kravetz
In-Reply-To: <alpine.DEB.2.21.1806121155450.2157@nanos.tec.linutronix.de>

On Tue, Jun 12, 2018 at 3:03 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Thu, 7 Jun 2018, H.J. Lu wrote:
>> On Thu, Jun 7, 2018 at 2:01 PM, Andy Lutomirski <luto@kernel.org> wrote:
>> > Why is the lockout necessary?  If user code enables CET and tries to
>> > run code that doesn't support CET, it will crash.  I don't see why we
>> > need special code in the kernel to prevent a user program from calling
>> > arch_prctl() and crashing itself.  There are already plenty of ways to
>> > do that :)
>>
>> On CET enabled machine, not all programs nor shared libraries are
>> CET enabled.  But since ld.so is CET enabled, all programs start
>> as CET enabled.  ld.so will disable CET if a program or any of its shared
>> libraries aren't CET enabled.  ld.so will lock up CET once it is done CET
>> checking so that CET can't no longer be disabled afterwards.
>
> That works for stuff which loads all libraries at start time, but what
> happens if the program uses dlopen() later on? If CET is force locked and
> the library is not CET enabled, it will fail.

That is to prevent disabling CET by dlopening a legacy shared library.

> I don't see the point of trying to support CET by magic. It adds complexity
> and you'll never be able to handle all corner cases correctly. dlopen() is
> not even a corner case.

That is a price we pay for security.  To enable CET, especially shadow
shack, the program and all of shared libraries it uses should be CET
enabled.  Most of programs can be enabled with CET by compiling them
with -fcf-protection.

> Occasionally stuff needs to be recompiled to utilize new mechanisms, see
> retpoline ...
>
> Thanks,
>
>         tglx
>



-- 
H.J.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 01/10] x86/cet: User-mode shadow stack support
From: Balbir Singh @ 2018-06-12 11:56 UTC (permalink / raw)
  To: Yu-cheng Yu, linux-kernel, linux-doc, linux-mm, linux-arch, x86,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, H.J. Lu,
	Vedvyas Shanbhogue, Ravi V. Shankar, Dave Hansen, Andy Lutomirski,
	Jonathan Corbet, Oleg Nesterov, Arnd Bergmann, Mike Kravetz
In-Reply-To: <20180607143807.3611-2-yu-cheng.yu@intel.com>



On 08/06/18 00:37, Yu-cheng Yu wrote:
> This patch adds basic shadow stack enabling/disabling routines.
> A task's shadow stack is allocated from memory with VM_SHSTK
> flag set and read-only protection.  The shadow stack is
> allocated to a fixed size and that can be changed by the system
> admin.
> 

I presume a read-only permission on the kernel side, but it
can be written by hardware?

> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> ---
>  arch/x86/include/asm/cet.h               |  32 ++++++++
>  arch/x86/include/asm/disabled-features.h |   8 +-
>  arch/x86/include/asm/msr-index.h         |  14 ++++
>  arch/x86/include/asm/processor.h         |   5 ++
>  arch/x86/kernel/Makefile                 |   2 +
>  arch/x86/kernel/cet.c                    | 123 +++++++++++++++++++++++++++++++
>  arch/x86/kernel/cpu/common.c             |  24 ++++++
>  arch/x86/kernel/process.c                |   2 +
>  fs/proc/task_mmu.c                       |   3 +
>  9 files changed, 212 insertions(+), 1 deletion(-)
>  create mode 100644 arch/x86/include/asm/cet.h
>  create mode 100644 arch/x86/kernel/cet.c
> 
> diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
> new file mode 100644
> index 000000000000..9d5bc1efc9b7
> --- /dev/null
> +++ b/arch/x86/include/asm/cet.h
> @@ -0,0 +1,32 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_X86_CET_H
> +#define _ASM_X86_CET_H
> +
> +#ifndef __ASSEMBLY__
> +#include <linux/types.h>
> +
> +struct task_struct;
> +/*
> + * Per-thread CET status
> + */
> +struct cet_stat {

stat sounds like statistics, just expand out to status please

> +	unsigned long	shstk_base;
> +	unsigned long	shstk_size;
> +	unsigned int	shstk_enabled:1;
> +};
> +
> +#ifdef CONFIG_X86_INTEL_CET
> +unsigned long cet_get_shstk_ptr(void);

For the current task? Why does _ptr routine return an unsigned long?

> +int cet_setup_shstk(void);
> +void cet_disable_shstk(void);
> +void cet_disable_free_shstk(struct task_struct *p);
> +#else
> +static inline unsigned long cet_get_shstk_ptr(void) { return 0; }
> +static inline int cet_setup_shstk(void) { return 0; }
> +static inline void cet_disable_shstk(void) {}
> +static inline void cet_disable_free_shstk(struct task_struct *p) {}
> +#endif
> +
> +#endif /* __ASSEMBLY__ */
> +
> +#endif /* _ASM_X86_CET_H */
> diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
> index 33833d1909af..3624a11e5ba6 100644
> --- a/arch/x86/include/asm/disabled-features.h
> +++ b/arch/x86/include/asm/disabled-features.h
> @@ -56,6 +56,12 @@
>  # define DISABLE_PTI		(1 << (X86_FEATURE_PTI & 31))
>  #endif
>  
> +#ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
> +#define DISABLE_SHSTK	0
> +#else
> +#define DISABLE_SHSTK	(1<<(X86_FEATURE_SHSTK & 31))
> +#endif
> +
>  /*
>   * Make sure to add features to the correct mask
>   */
> @@ -75,7 +81,7 @@
>  #define DISABLED_MASK13	0
>  #define DISABLED_MASK14	0
>  #define DISABLED_MASK15	0
> -#define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP)
> +#define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP|DISABLE_SHSTK)
>  #define DISABLED_MASK17	0
>  #define DISABLED_MASK18	0
>  #define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 19)
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index fda2114197b3..428d13828ba9 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -770,4 +770,18 @@
>  #define MSR_VM_IGNNE                    0xc0010115
>  #define MSR_VM_HSAVE_PA                 0xc0010117
>  
> +/* Control-flow Enforcement Technology MSRs */
> +#define MSR_IA32_U_CET		0x6a0
> +#define MSR_IA32_S_CET		0x6a2
> +#define MSR_IA32_PL0_SSP	0x6a4
> +#define MSR_IA32_PL3_SSP	0x6a7
> +#define MSR_IA32_INT_SSP_TAB	0x6a8

some comments on the purpose of the MSR would be nice

> +
> +/* MSR_IA32_U_CET and MSR_IA32_S_CET bits */
> +#define MSR_IA32_CET_SHSTK_EN		0x0000000000000001
> +#define MSR_IA32_CET_WRSS_EN		0x0000000000000002
> +#define MSR_IA32_CET_ENDBR_EN		0x0000000000000004
> +#define MSR_IA32_CET_LEG_IW_EN		0x0000000000000008
> +#define MSR_IA32_CET_NO_TRACK_EN	0x0000000000000010
> +

Same as above

>  #endif /* _ASM_X86_MSR_INDEX_H */
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index 21a114914ba4..e632dd7adaac 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -24,6 +24,7 @@ struct vm86;
>  #include <asm/special_insns.h>
>  #include <asm/fpu/types.h>
>  #include <asm/unwind_hints.h>
> +#include <asm/cet.h>
>  
>  #include <linux/personality.h>
>  #include <linux/cache.h>
> @@ -507,6 +508,10 @@ struct thread_struct {
>  	unsigned int		sig_on_uaccess_err:1;
>  	unsigned int		uaccess_err:1;	/* uaccess failed */
>  
> +#ifdef CONFIG_X86_INTEL_CET
> +	struct cet_stat		cet;
> +#endif
> +
>  	/* Floating point and extended processor state */
>  	struct fpu		fpu;
>  	/*
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index 02d6f5cf4e70..7ea5e099d558 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -138,6 +138,8 @@ obj-$(CONFIG_UNWINDER_ORC)		+= unwind_orc.o
>  obj-$(CONFIG_UNWINDER_FRAME_POINTER)	+= unwind_frame.o
>  obj-$(CONFIG_UNWINDER_GUESS)		+= unwind_guess.o
>  
> +obj-$(CONFIG_X86_INTEL_CET)		+= cet.o
> +
>  ###
>  # 64 bit specific files
>  ifeq ($(CONFIG_X86_64),y)
> diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
> new file mode 100644
> index 000000000000..8abbfd44322a
> --- /dev/null
> +++ b/arch/x86/kernel/cet.c
> @@ -0,0 +1,123 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * cet.c - Control Flow Enforcement (CET)
> + *
> + * Copyright (c) 2018, Intel Corporation.
> + * Yu-cheng Yu <yu-cheng.yu@intel.com>
> + */
> +
> +#include <linux/types.h>
> +#include <linux/mm.h>
> +#include <linux/mman.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +#include <linux/sched/signal.h>
> +#include <asm/msr.h>
> +#include <asm/user.h>
> +#include <asm/fpu/xstate.h>
> +#include <asm/fpu/types.h>
> +#include <asm/cet.h>
> +
> +#define SHSTK_SIZE (0x8000 * (test_thread_flag(TIF_IA32) ? 4 : 8))
> +
> +static inline int cet_set_shstk_ptr(unsigned long addr)
> +{
> +	u64 r;
> +
> +	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
> +		return -1;
> +
> +	if ((addr >= TASK_SIZE) || (!IS_ALIGNED(addr, 4)))
> +		return -1;

I think there was a comment about this being TASK_SIZE_MAX

> +
> +	rdmsrl(MSR_IA32_U_CET, r);
> +	wrmsrl(MSR_IA32_U_CET, r | MSR_IA32_CET_SHSTK_EN);
> +	wrmsrl(MSR_IA32_PL3_SSP, addr);

Should the enable happen before setting addr? I would expect to do this in the opposite order.

> +	return 0;
> +}
> +
> +unsigned long cet_get_shstk_ptr(void)
> +{
> +	unsigned long ptr;
> +
> +	if (!current->thread.cet.shstk_enabled)
> +		return 0;
> +
> +	rdmsrl(MSR_IA32_PL3_SSP, ptr);
> +	return ptr;
> +}
> +
> +static unsigned long shstk_mmap(unsigned long addr, unsigned long len)
> +{
> +	struct mm_struct *mm = current->mm;
> +	unsigned long populate;
> +
> +	down_write(&mm->mmap_sem);
> +	addr = do_mmap(NULL, addr, len, PROT_READ,
> +		       MAP_ANONYMOUS | MAP_PRIVATE, VM_SHSTK,
> +		       0, &populate, NULL);
> +	up_write(&mm->mmap_sem);

What happens if the mmap fails for any reason? I presume the caller disables shadow stack on this process?

> +
> +	if (populate)
> +		mm_populate(addr, populate);
> +
> +	return addr;
> +}
> +
> +int cet_setup_shstk(void)
> +{
> +	unsigned long addr, size;
> +
> +	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
> +		return -EOPNOTSUPP;
> +
> +	size = SHSTK_SIZE;
> +	addr = shstk_mmap(0, size);
> +
> +	if (addr >= TASK_SIZE)
> +		return -ENOMEM;
> +

TASK_SIZE_MAX?

> +	cet_set_shstk_ptr(addr + size - sizeof(void *));
> +	current->thread.cet.shstk_base = addr;
> +	current->thread.cet.shstk_size = size;
> +	current->thread.cet.shstk_enabled = 1;
> +	return 0;
> +}
> +
> +void cet_disable_shstk(void)
> +{
> +	u64 r;
> +
> +	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
> +		return;
> +
> +	rdmsrl(MSR_IA32_U_CET, r);
> +	r &= ~(MSR_IA32_CET_SHSTK_EN);
> +	wrmsrl(MSR_IA32_U_CET, r);
> +	wrmsrl(MSR_IA32_PL3_SSP, 0);

Again, I'd expect the order to be the reverse

> +	current->thread.cet.shstk_enabled = 0;
> +}
> +
> +void cet_disable_free_shstk(struct task_struct *tsk)
> +{
> +	if (!cpu_feature_enabled(X86_FEATURE_SHSTK) ||
> +	    !tsk->thread.cet.shstk_enabled)
> +		return;
> +
> +	if (tsk == current)
> +		cet_disable_shstk();
> +
> +	/*
> +	 * Free only when tsk is current or shares mm
> +	 * with current but has its own shstk.
> +	 */
> +	if (tsk->mm && (tsk->mm == current->mm) &&
> +	    (tsk->thread.cet.shstk_base)) {

Does the caller hold a reference to tsk->mm?

> +		vm_munmap(tsk->thread.cet.shstk_base,
> +			  tsk->thread.cet.shstk_size);
> +		tsk->thread.cet.shstk_base = 0;
> +		tsk->thread.cet.shstk_size = 0;
> +	}
> +
> +	tsk->thread.cet.shstk_enabled = 0;
> +}
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index 38276f58d3bf..f54fabdaef60 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -401,6 +401,29 @@ static __init int setup_disable_pku(char *arg)
>  __setup("nopku", setup_disable_pku);
>  #endif /* CONFIG_X86_64 */
>  
> +static __always_inline void setup_cet(struct cpuinfo_x86 *c)
> +{
> +	if (cpu_feature_enabled(X86_FEATURE_SHSTK))
> +		cr4_set_bits(X86_CR4_CET);
> +}
> +
> +#ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
> +static __init int setup_disable_shstk(char *s)
> +{
> +	/* require an exact match without trailing characters */
> +	if (strlen(s))
> +		return 0;
> +
> +	if (!boot_cpu_has(X86_FEATURE_SHSTK))
> +		return 1;
> +
> +	setup_clear_cpu_cap(X86_FEATURE_SHSTK);
> +	pr_info("x86: 'noshstk' specified, disabling Shadow Stack\n");
> +	return 1;
> +}
> +__setup("noshstk", setup_disable_shstk);
> +#endif
> +
>  /*
>   * Some CPU features depend on higher CPUID levels, which may not always
>   * be available due to CPUID level capping or broken virtualization
> @@ -1313,6 +1336,7 @@ static void identify_cpu(struct cpuinfo_x86 *c)
>  	x86_init_rdrand(c);
>  	x86_init_cache_qos(c);
>  	setup_pku(c);
> +	setup_cet(c);
>  
>  	/*
>  	 * Clear/Set all flags overridden by options, need do it
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index 30ca2d1a9231..b3b0b482983a 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -39,6 +39,7 @@
>  #include <asm/desc.h>
>  #include <asm/prctl.h>
>  #include <asm/spec-ctrl.h>
> +#include <asm/cet.h>
>  
>  /*
>   * per-CPU TSS segments. Threads are completely 'soft' on Linux,
> @@ -136,6 +137,7 @@ void flush_thread(void)
>  	flush_ptrace_hw_breakpoint(tsk);
>  	memset(tsk->thread.tls_array, 0, sizeof(tsk->thread.tls_array));
>  
> +	cet_disable_shstk();
>  	fpu__clear(&tsk->thread.fpu);
>  }
>  
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index c486ad4b43f0..6aca93ecec0e 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -679,6 +679,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
>  		[ilog2(VM_PKEY_BIT1)]	= "",
>  		[ilog2(VM_PKEY_BIT2)]	= "",
>  		[ilog2(VM_PKEY_BIT3)]	= "",
> +#endif
> +#ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
> +		[ilog2(VM_SHSTK)]	= "ss"
>  #endif
>  	};
>  	size_t i;
> 

Balbir Singh.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH RESEND v4 2/2] arm/arm64: KVM: Add KVM_GET/SET_VCPU_EVENTS
From: gengdongjiu @ 2018-06-12 12:42 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: rkrcmar, corbet, marc.zyngier, linux, catalin.marinas,
	will.deacon, kvm, linux-doc, james.morse, linux-arm-kernel,
	linux-kernel, linux-acpi
In-Reply-To: <20180609111745.GJ5097@C02W217FHV2R.local>

Christoffer,
   Thanks for the review.

On 2018/6/9 19:17, Christoffer Dall wrote:
> On Sat, Jun 09, 2018 at 03:48:40AM +0800, Dongjiu Geng wrote:
>> For the migrating VMs, user space may need to know the exception
>> state. For example, in the machine A, KVM make an SError pending,
>> when migrate to B, KVM also needs to pend an SError.
>>
>> This new IOCTL exports user-invisible states related to SError.
>> Together with appropriate user space changes, user space can get/set
>> the SError exception state to do migrate/snapshot/suspend.
>>
>> Signed-off-by: Dongjiu Geng <gengdongjiu@huawei.com>
>> ---
>> change since v3:
>> 1. Fix the memset() issue in the kvm_arm_vcpu_get_events()
>>
>> change since v2:
>> 1. Add kvm_vcpu_events structure definition for arm platform to avoid the build errors.
>>
>> change since v1:
>> Address Marc's comments, thanks Marc's review
>> 1. serror_has_esr always true when ARM64_HAS_RAS_EXTN is set
>> 2. remove Spurious blank line in kvm_arm_vcpu_set_events()
>> 3. rename pend_guest_serror() to kvm_set_sei_esr()
>> 4. Make kvm_arm_vcpu_get_events() did all the work rather than having this split responsibility.
>> 5.  using sizeof(events) instead of sizeof(struct kvm_vcpu_events)
>>
>> this series patch is separated from https://www.spinics.net/lists/kvm/msg168917.html
>> The user space patch is here: https://lists.gnu.org/archive/html/qemu-devel/2018-05/msg06965.html
>>
>> change since V12:
>> 1. change (vcpu->arch.hcr_el2 & HCR_VSE) to !!(vcpu->arch.hcr_el2 & HCR_VSE) in kvm_arm_vcpu_get_events()
>>
>> Change since V11:
>> Address James's comments, thanks James
>> 1. Align the struct of kvm_vcpu_events to 64 bytes
>> 2. Avoid exposing the stale ESR value in the kvm_arm_vcpu_get_events()
>> 3. Change variables 'injected' name to 'serror_pending' in the kvm_arm_vcpu_set_events()
>> 4. Change to sizeof(events) from sizeof(struct kvm_vcpu_events) in kvm_arch_vcpu_ioctl()
>>
>> Change since V10:
>> Address James's comments, thanks James
>> 1. Merge the helper function with the user.
>> 2. Move the ISS_MASK into pend_guest_serror() to clear top bits
>> 3. Make kvm_vcpu_events struct align to 4 bytes
>> 4. Add something check in the kvm_arm_vcpu_set_events()
>> 5. Check kvm_arm_vcpu_get/set_events()'s return value.
>> 6. Initialise kvm_vcpu_events to 0 so that padding transferred to user-space doesn't
>>    contain kernel stack.
>> ---
>>  Documentation/virtual/kvm/api.txt    | 31 ++++++++++++++++++++++++++++---
>>  arch/arm/include/asm/kvm_host.h      |  6 ++++++
>>  arch/arm/include/uapi/asm/kvm.h      | 12 ++++++++++++
>>  arch/arm/kvm/guest.c                 | 12 ++++++++++++
>>  arch/arm64/include/asm/kvm_emulate.h |  5 +++++
>>  arch/arm64/include/asm/kvm_host.h    |  7 +++++++
>>  arch/arm64/include/uapi/asm/kvm.h    | 13 +++++++++++++
>>  arch/arm64/kvm/guest.c               | 36 ++++++++++++++++++++++++++++++++++++
>>  arch/arm64/kvm/inject_fault.c        |  6 +++---
>>  arch/arm64/kvm/reset.c               |  1 +
>>  virt/kvm/arm/arm.c                   | 19 +++++++++++++++++++
>>  11 files changed, 142 insertions(+), 6 deletions(-)
>>
>> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
>> index fdac969..8896737 100644
>> --- a/Documentation/virtual/kvm/api.txt
>> +++ b/Documentation/virtual/kvm/api.txt
>> @@ -835,11 +835,13 @@ struct kvm_clock_data {
>>  
>>  Capability: KVM_CAP_VCPU_EVENTS
>>  Extended by: KVM_CAP_INTR_SHADOW
>> -Architectures: x86
>> +Architectures: x86, arm, arm64
>>  Type: vm ioctl
>>  Parameters: struct kvm_vcpu_event (out)
>>  Returns: 0 on success, -1 on error
>>  
>> +X86:
>> +
>>  Gets currently pending exceptions, interrupts, and NMIs as well as related
>>  states of the vcpu.
>>  
>> @@ -881,15 +883,32 @@ Only two fields are defined in the flags field:
>>  - KVM_VCPUEVENT_VALID_SMM may be set in the flags field to signal that
>>    smi contains a valid state.
>>  
>> +ARM, ARM64:
>> +
>> +Gets currently pending SError exceptions as well as related states of the vcpu.
>> +
>> +struct kvm_vcpu_events {
>> +	struct {
>> +		__u8 serror_pending;
>> +		__u8 serror_has_esr;
>> +		/* Align it to 8 bytes */
>> +		__u8 pad[6];
>> +		__u64 serror_esr;
>> +	} exception;
>> +	__u32 reserved[12];
>> +};
>> +
>>  4.32 KVM_SET_VCPU_EVENTS
>>  
>> -Capability: KVM_CAP_VCPU_EVENTS
>> +Capebility: KVM_CAP_VCPU_EVENTS
> 
> nit: unintended change?
> 
>>  Extended by: KVM_CAP_INTR_SHADOW
>> -Architectures: x86
>> +Architectures: x86, arm, arm64
>>  Type: vm ioctl
>>  Parameters: struct kvm_vcpu_event (in)
>>  Returns: 0 on success, -1 on error
>>  
>> +X86:
>> +
>>  Set pending exceptions, interrupts, and NMIs as well as related states of the
>>  vcpu.
>>  
>> @@ -910,6 +929,12 @@ shall be written into the VCPU.
>>  
>>  KVM_VCPUEVENT_VALID_SMM can only be set if KVM_CAP_X86_SMM is available.
>>  
>> +ARM, ARM64:
>> +
>> +Set pending SError exceptions as well as related states of the vcpu.
>> +
>> +See KVM_GET_VCPU_EVENTS for the data structure.
>> +
>>  
>>  4.33 KVM_GET_DEBUGREGS
>>  
>> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
>> index c7c28c8..39f9901 100644
>> --- a/arch/arm/include/asm/kvm_host.h
>> +++ b/arch/arm/include/asm/kvm_host.h
>> @@ -213,6 +213,12 @@ unsigned long kvm_arm_num_regs(struct kvm_vcpu *vcpu);
>>  int kvm_arm_copy_reg_indices(struct kvm_vcpu *vcpu, u64 __user *indices);
>>  int kvm_arm_get_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg);
>>  int kvm_arm_set_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg);
>> +int kvm_arm_vcpu_get_events(struct kvm_vcpu *vcpu,
>> +			struct kvm_vcpu_events *events);
>> +
>> +int kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
>> +			struct kvm_vcpu_events *events);
>> +
>>  unsigned long kvm_call_hyp(void *hypfn, ...);
>>  void force_vm_exit(const cpumask_t *mask);
>>  
>> diff --git a/arch/arm/include/uapi/asm/kvm.h b/arch/arm/include/uapi/asm/kvm.h
>> index caae484..c3e6975 100644
>> --- a/arch/arm/include/uapi/asm/kvm.h
>> +++ b/arch/arm/include/uapi/asm/kvm.h
>> @@ -124,6 +124,18 @@ struct kvm_sync_regs {
>>  struct kvm_arch_memory_slot {
>>  };
>>  
>> +/* for KVM_GET/SET_VCPU_EVENTS */
>> +struct kvm_vcpu_events {
>> +	struct {
>> +		__u8 serror_pending;
>> +		__u8 serror_has_esr;
>> +		/* Align it to 8 bytes */
>> +		__u8 pad[6];
>> +		__u64 serror_esr;
>> +	} exception;
>> +	__u32 reserved[12];
>> +};
>> +
>>  /* If you need to interpret the index values, here is the key: */
>>  #define KVM_REG_ARM_COPROC_MASK		0x000000000FFF0000
>>  #define KVM_REG_ARM_COPROC_SHIFT	16
>> diff --git a/arch/arm/kvm/guest.c b/arch/arm/kvm/guest.c
>> index a18f33e..c685f0e 100644
>> --- a/arch/arm/kvm/guest.c
>> +++ b/arch/arm/kvm/guest.c
>> @@ -261,6 +261,18 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
>>  	return -EINVAL;
>>  }
>>  
>> +int kvm_arm_vcpu_get_events(struct kvm_vcpu *vcpu,
>> +			struct kvm_vcpu_events *events)
>> +{
>> +	return -EINVAL;
>> +}
>> +
>> +int kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
>> +			struct kvm_vcpu_events *events)
>> +{
>> +	return -EINVAL;
>> +}
>> +
>>  int __attribute_const__ kvm_target_cpu(void)
>>  {
>>  	switch (read_cpuid_part()) {
>> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
>> index 1dab3a9..18f61ff 100644
>> --- a/arch/arm64/include/asm/kvm_emulate.h
>> +++ b/arch/arm64/include/asm/kvm_emulate.h
>> @@ -81,6 +81,11 @@ static inline unsigned long *vcpu_hcr(struct kvm_vcpu *vcpu)
>>  	return (unsigned long *)&vcpu->arch.hcr_el2;
>>  }
>>  
>> +static inline unsigned long vcpu_get_vsesr(struct kvm_vcpu *vcpu)
>> +{
>> +	return vcpu->arch.vsesr_el2;
>> +}
>> +
>>  static inline void vcpu_set_vsesr(struct kvm_vcpu *vcpu, u64 vsesr)
>>  {
>>  	vcpu->arch.vsesr_el2 = vsesr;
>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>> index 469de8a..357304a 100644
>> --- a/arch/arm64/include/asm/kvm_host.h
>> +++ b/arch/arm64/include/asm/kvm_host.h
>> @@ -335,6 +335,11 @@ unsigned long kvm_arm_num_regs(struct kvm_vcpu *vcpu);
>>  int kvm_arm_copy_reg_indices(struct kvm_vcpu *vcpu, u64 __user *indices);
>>  int kvm_arm_get_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg);
>>  int kvm_arm_set_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg);
>> +int kvm_arm_vcpu_get_events(struct kvm_vcpu *vcpu,
>> +			struct kvm_vcpu_events *events);
>> +
>> +int kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
>> +			struct kvm_vcpu_events *events);
>>  
>>  #define KVM_ARCH_WANT_MMU_NOTIFIER
>>  int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
>> @@ -363,6 +368,8 @@ void handle_exit_early(struct kvm_vcpu *vcpu, struct kvm_run *run,
>>  int kvm_perf_init(void);
>>  int kvm_perf_teardown(void);
>>  
>> +void kvm_set_sei_esr(struct kvm_vcpu *vcpu, u64 syndrome);
>> +
>>  struct kvm_vcpu *kvm_mpidr_to_vcpu(struct kvm *kvm, unsigned long mpidr);
>>  
>>  void __kvm_set_tpidr_el2(u64 tpidr_el2);
>> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
>> index 04b3256..df4faee 100644
>> --- a/arch/arm64/include/uapi/asm/kvm.h
>> +++ b/arch/arm64/include/uapi/asm/kvm.h
>> @@ -39,6 +39,7 @@
>>  #define __KVM_HAVE_GUEST_DEBUG
>>  #define __KVM_HAVE_IRQ_LINE
>>  #define __KVM_HAVE_READONLY_MEM
>> +#define __KVM_HAVE_VCPU_EVENTS
>>  
>>  #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
>>  
>> @@ -153,6 +154,18 @@ struct kvm_sync_regs {
>>  struct kvm_arch_memory_slot {
>>  };
>>  
>> +/* for KVM_GET/SET_VCPU_EVENTS */
>> +struct kvm_vcpu_events {
>> +	struct {
>> +		__u8 serror_pending;
>> +		__u8 serror_has_esr;
>> +		/* Align it to 8 bytes */
>> +		__u8 pad[6];
>> +		__u64 serror_esr;
>> +	} exception;
>> +	__u32 reserved[12];
>> +};
>> +
>>  /* If you need to interpret the index values, here is the key: */
>>  #define KVM_REG_ARM_COPROC_MASK		0x000000000FFF0000
>>  #define KVM_REG_ARM_COPROC_SHIFT	16
>> diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
>> index 56a0260..4426915 100644
>> --- a/arch/arm64/kvm/guest.c
>> +++ b/arch/arm64/kvm/guest.c
>> @@ -289,6 +289,42 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
>>  	return -EINVAL;
>>  }
>>  
>> +int kvm_arm_vcpu_get_events(struct kvm_vcpu *vcpu,
>> +			struct kvm_vcpu_events *events)
>> +{
>> +	memset(events, 0, sizeof(*events));
>> +
>> +	events->exception.serror_pending = !!(vcpu->arch.hcr_el2 & HCR_VSE);
>> +	events->exception.serror_has_esr =
>> +					cpus_have_const_cap(ARM64_HAS_RAS_EXTN);
> 
> nit: no need to wrap this line so strangely, just keep it on a single
> line (regardless of going slightly over the 80 chars limit).
  Ok, will fix it.
> 
>> +
>> +	if (events->exception.serror_pending &&
>> +		events->exception.serror_has_esr)
> 
> same here
   OK, will fix it.
> 
>> +		events->exception.serror_esr = vcpu_get_vsesr(vcpu);
>> +	else
>> +		events->exception.serror_esr = 0;
>> +
>> +	return 0;
>> +}
>> +
>> +int kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
>> +			struct kvm_vcpu_events *events)
>> +{
>> +	bool serror_pending = events->exception.serror_pending;
>> +	bool has_esr = events->exception.serror_has_esr;
>> +
>> +	if (serror_pending && has_esr) {
>> +		if (!cpus_have_const_cap(ARM64_HAS_RAS_EXTN))
>> +			return -EINVAL;
>> +
>> +		kvm_set_sei_esr(vcpu, events->exception.serror_esr);
>> +	} else if (serror_pending) {
>> +		kvm_inject_vabt(vcpu);
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>>  int __attribute_const__ kvm_target_cpu(void)
>>  {
>>  	unsigned long implementor = read_cpuid_implementor();
>> diff --git a/arch/arm64/kvm/inject_fault.c b/arch/arm64/kvm/inject_fault.c
>> index d8e7165..a55e91d 100644
>> --- a/arch/arm64/kvm/inject_fault.c
>> +++ b/arch/arm64/kvm/inject_fault.c
>> @@ -164,9 +164,9 @@ void kvm_inject_undefined(struct kvm_vcpu *vcpu)
>>  		inject_undef64(vcpu);
>>  }
>>  
>> -static void pend_guest_serror(struct kvm_vcpu *vcpu, u64 esr)
>> +void kvm_set_sei_esr(struct kvm_vcpu *vcpu, u64 esr)
>>  {
>> -	vcpu_set_vsesr(vcpu, esr);
>> +	vcpu_set_vsesr(vcpu, esr & ESR_ELx_ISS_MASK);
>>  	*vcpu_hcr(vcpu) |= HCR_VSE;
>>  }
>>  
>> @@ -184,5 +184,5 @@ static void pend_guest_serror(struct kvm_vcpu *vcpu, u64 esr)
>>   */
>>  void kvm_inject_vabt(struct kvm_vcpu *vcpu)
>>  {
>> -	pend_guest_serror(vcpu, ESR_ELx_ISV);
>> +	kvm_set_sei_esr(vcpu, ESR_ELx_ISV);
>>  }
>> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
>> index 38c8a64..20e919a 100644
>> --- a/arch/arm64/kvm/reset.c
>> +++ b/arch/arm64/kvm/reset.c
>> @@ -82,6 +82,7 @@ int kvm_arch_dev_ioctl_check_extension(struct kvm *kvm, long ext)
>>  		break;
>>  	case KVM_CAP_SET_GUEST_DEBUG:
>>  	case KVM_CAP_VCPU_ATTRIBUTES:
>> +	case KVM_CAP_VCPU_EVENTS:
>>  		r = 1;
>>  		break;
>>  	default:
>> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
>> index a4c1b76..79ecba9 100644
>> --- a/virt/kvm/arm/arm.c
>> +++ b/virt/kvm/arm/arm.c
>> @@ -1107,6 +1107,25 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
>>  		r = kvm_arm_vcpu_has_attr(vcpu, &attr);
>>  		break;
>>  	}
>> +	case KVM_GET_VCPU_EVENTS: {
>> +		struct kvm_vcpu_events events;
>> +
>> +		if (kvm_arm_vcpu_get_events(vcpu, &events))
>> +			return -EINVAL;
>> +
>> +		if (copy_to_user(argp, &events, sizeof(events)))
>> +			return -EFAULT;
>> +
>> +		return 0;
>> +	}
>> +	case KVM_SET_VCPU_EVENTS: {
>> +		struct kvm_vcpu_events events;
>> +
>> +		if (copy_from_user(&events, argp, sizeof(events)))
>> +			return -EFAULT;
>> +
>> +		return kvm_arm_vcpu_set_events(vcpu, &events);
>> +	}
>>  	default:
>>  		r = -EINVAL;
>>  	}
>> -- 
>> 2.7.4
>>
> 
> I'll leave it to James to comment on the specifics of the RAS
> interaction, but I think the two patches should be re-ordered, so that
> the capability patch comes last, after the functionality has been
> introduced.
ok, I will reorder them in the next version.

> 
> Otherwise this looks reasonable enough.
> 
> Thanks,
> -Christoffer
> 
> .
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH RESEND v4 2/2] arm/arm64: KVM: Add KVM_GET/SET_VCPU_EVENTS
From: gengdongjiu @ 2018-06-12 13:06 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: rkrcmar, corbet, christoffer.dall, linux, catalin.marinas,
	will.deacon, kvm, linux-doc, james.morse, linux-arm-kernel,
	linux-kernel, linux-acpi
In-Reply-To: <86zi04875t.wl-marc.zyngier@arm.com>

Hi Marc
   thanks for the review.

On 2018/6/9 20:40, Marc Zyngier wrote:
> On Fri, 08 Jun 2018 20:48:40 +0100,
> Dongjiu Geng wrote:
>>
>> For the migrating VMs, user space may need to know the exception
>> state. For example, in the machine A, KVM make an SError pending,
>> when migrate to B, KVM also needs to pend an SError.
>>
>> This new IOCTL exports user-invisible states related to SError.
>> Together with appropriate user space changes, user space can get/set
>> the SError exception state to do migrate/snapshot/suspend.
>>
>> Signed-off-by: Dongjiu Geng <gengdongjiu@huawei.com>
>> ---
>> change since v3:
>> 1. Fix the memset() issue in the kvm_arm_vcpu_get_events()
>>
>> change since v2:
>> 1. Add kvm_vcpu_events structure definition for arm platform to avoid the build errors.
>>
>> change since v1:
>> Address Marc's comments, thanks Marc's review
>> 1. serror_has_esr always true when ARM64_HAS_RAS_EXTN is set
>> 2. remove Spurious blank line in kvm_arm_vcpu_set_events()
>> 3. rename pend_guest_serror() to kvm_set_sei_esr()
>> 4. Make kvm_arm_vcpu_get_events() did all the work rather than having this split responsibility.
>> 5.  using sizeof(events) instead of sizeof(struct kvm_vcpu_events)
>>
>> this series patch is separated from https://www.spinics.net/lists/kvm/msg168917.html
>> The user space patch is here: https://lists.gnu.org/archive/html/qemu-devel/2018-05/msg06965.html
>>
>> change since V12:
>> 1. change (vcpu->arch.hcr_el2 & HCR_VSE) to !!(vcpu->arch.hcr_el2 & HCR_VSE) in kvm_arm_vcpu_get_events()
>>
>> Change since V11:
>> Address James's comments, thanks James
>> 1. Align the struct of kvm_vcpu_events to 64 bytes
>> 2. Avoid exposing the stale ESR value in the kvm_arm_vcpu_get_events()
>> 3. Change variables 'injected' name to 'serror_pending' in the kvm_arm_vcpu_set_events()
>> 4. Change to sizeof(events) from sizeof(struct kvm_vcpu_events) in kvm_arch_vcpu_ioctl()
>>
>> Change since V10:
>> Address James's comments, thanks James
>> 1. Merge the helper function with the user.
>> 2. Move the ISS_MASK into pend_guest_serror() to clear top bits
>> 3. Make kvm_vcpu_events struct align to 4 bytes
>> 4. Add something check in the kvm_arm_vcpu_set_events()
>> 5. Check kvm_arm_vcpu_get/set_events()'s return value.
>> 6. Initialise kvm_vcpu_events to 0 so that padding transferred to user-space doesn't
>>    contain kernel stack.
>> ---
>>  Documentation/virtual/kvm/api.txt    | 31 ++++++++++++++++++++++++++++---
>>  arch/arm/include/asm/kvm_host.h      |  6 ++++++
>>  arch/arm/include/uapi/asm/kvm.h      | 12 ++++++++++++
>>  arch/arm/kvm/guest.c                 | 12 ++++++++++++
>>  arch/arm64/include/asm/kvm_emulate.h |  5 +++++
>>  arch/arm64/include/asm/kvm_host.h    |  7 +++++++
>>  arch/arm64/include/uapi/asm/kvm.h    | 13 +++++++++++++
>>  arch/arm64/kvm/guest.c               | 36 ++++++++++++++++++++++++++++++++++++
>>  arch/arm64/kvm/inject_fault.c        |  6 +++---
>>  arch/arm64/kvm/reset.c               |  1 +
>>  virt/kvm/arm/arm.c                   | 19 +++++++++++++++++++
>>  11 files changed, 142 insertions(+), 6 deletions(-)
>>
>> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
>> index fdac969..8896737 100644
>> --- a/Documentation/virtual/kvm/api.txt
>> +++ b/Documentation/virtual/kvm/api.txt
>> @@ -835,11 +835,13 @@ struct kvm_clock_data {
>>  
>>  Capability: KVM_CAP_VCPU_EVENTS
>>  Extended by: KVM_CAP_INTR_SHADOW
>> -Architectures: x86
>> +Architectures: x86, arm, arm64
>>  Type: vm ioctl
>>  Parameters: struct kvm_vcpu_event (out)
>>  Returns: 0 on success, -1 on error
>>  
>> +X86:
>> +
>>  Gets currently pending exceptions, interrupts, and NMIs as well as related
>>  states of the vcpu.
>>  
>> @@ -881,15 +883,32 @@ Only two fields are defined in the flags field:
>>  - KVM_VCPUEVENT_VALID_SMM may be set in the flags field to signal that
>>    smi contains a valid state.
>>  
>> +ARM, ARM64:
>> +
>> +Gets currently pending SError exceptions as well as related states of the vcpu.
>> +
>> +struct kvm_vcpu_events {
>> +	struct {
>> +		__u8 serror_pending;
>> +		__u8 serror_has_esr;
>> +		/* Align it to 8 bytes */
>> +		__u8 pad[6];
>> +		__u64 serror_esr;
>> +	} exception;
>> +	__u32 reserved[12];
>> +};
>> +
>>  4.32 KVM_SET_VCPU_EVENTS
>>  
>> -Capability: KVM_CAP_VCPU_EVENTS
>> +Capebility: KVM_CAP_VCPU_EVENTS
>>  Extended by: KVM_CAP_INTR_SHADOW
>> -Architectures: x86
>> +Architectures: x86, arm, arm64
>>  Type: vm ioctl
>>  Parameters: struct kvm_vcpu_event (in)
>>  Returns: 0 on success, -1 on error
>>  
>> +X86:
>> +
>>  Set pending exceptions, interrupts, and NMIs as well as related states of the
>>  vcpu.
>>  
>> @@ -910,6 +929,12 @@ shall be written into the VCPU.
>>  
>>  KVM_VCPUEVENT_VALID_SMM can only be set if KVM_CAP_X86_SMM is available.
>>  
>> +ARM, ARM64:
>> +
>> +Set pending SError exceptions as well as related states of the vcpu.
>> +
>> +See KVM_GET_VCPU_EVENTS for the data structure.
>> +
>>  
>>  4.33 KVM_GET_DEBUGREGS
>>  
>> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
>> index c7c28c8..39f9901 100644
>> --- a/arch/arm/include/asm/kvm_host.h
>> +++ b/arch/arm/include/asm/kvm_host.h
>> @@ -213,6 +213,12 @@ unsigned long kvm_arm_num_regs(struct kvm_vcpu *vcpu);
>>  int kvm_arm_copy_reg_indices(struct kvm_vcpu *vcpu, u64 __user *indices);
>>  int kvm_arm_get_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg);
>>  int kvm_arm_set_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg);
>> +int kvm_arm_vcpu_get_events(struct kvm_vcpu *vcpu,
>> +			struct kvm_vcpu_events *events);
>> +
>> +int kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
>> +			struct kvm_vcpu_events *events);
>> +
>>  unsigned long kvm_call_hyp(void *hypfn, ...);
>>  void force_vm_exit(const cpumask_t *mask);
>>  
>> diff --git a/arch/arm/include/uapi/asm/kvm.h b/arch/arm/include/uapi/asm/kvm.h
>> index caae484..c3e6975 100644
>> --- a/arch/arm/include/uapi/asm/kvm.h
>> +++ b/arch/arm/include/uapi/asm/kvm.h
>> @@ -124,6 +124,18 @@ struct kvm_sync_regs {
>>  struct kvm_arch_memory_slot {
>>  };
>>  
>> +/* for KVM_GET/SET_VCPU_EVENTS */
>> +struct kvm_vcpu_events {
>> +	struct {
>> +		__u8 serror_pending;
>> +		__u8 serror_has_esr;
>> +		/* Align it to 8 bytes */
>> +		__u8 pad[6];
>> +		__u64 serror_esr;
>> +	} exception;
>> +	__u32 reserved[12];
>> +};
>> +
>>  /* If you need to interpret the index values, here is the key: */
>>  #define KVM_REG_ARM_COPROC_MASK		0x000000000FFF0000
>>  #define KVM_REG_ARM_COPROC_SHIFT	16
>> diff --git a/arch/arm/kvm/guest.c b/arch/arm/kvm/guest.c
>> index a18f33e..c685f0e 100644
>> --- a/arch/arm/kvm/guest.c
>> +++ b/arch/arm/kvm/guest.c
>> @@ -261,6 +261,18 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
>>  	return -EINVAL;
>>  }
>>  
>> +int kvm_arm_vcpu_get_events(struct kvm_vcpu *vcpu,
>> +			struct kvm_vcpu_events *events)
>> +{
>> +	return -EINVAL;
>> +}
>> +
>> +int kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
>> +			struct kvm_vcpu_events *events)
>> +{
>> +	return -EINVAL;
>> +}
>> +
>>  int __attribute_const__ kvm_target_cpu(void)
>>  {
>>  	switch (read_cpuid_part()) {
>> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
>> index 1dab3a9..18f61ff 100644
>> --- a/arch/arm64/include/asm/kvm_emulate.h
>> +++ b/arch/arm64/include/asm/kvm_emulate.h
>> @@ -81,6 +81,11 @@ static inline unsigned long *vcpu_hcr(struct kvm_vcpu *vcpu)
>>  	return (unsigned long *)&vcpu->arch.hcr_el2;
>>  }
>>  
>> +static inline unsigned long vcpu_get_vsesr(struct kvm_vcpu *vcpu)
>> +{
>> +	return vcpu->arch.vsesr_el2;
>> +}
>> +
>>  static inline void vcpu_set_vsesr(struct kvm_vcpu *vcpu, u64 vsesr)
>>  {
>>  	vcpu->arch.vsesr_el2 = vsesr;
>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>> index 469de8a..357304a 100644
>> --- a/arch/arm64/include/asm/kvm_host.h
>> +++ b/arch/arm64/include/asm/kvm_host.h
>> @@ -335,6 +335,11 @@ unsigned long kvm_arm_num_regs(struct kvm_vcpu *vcpu);
>>  int kvm_arm_copy_reg_indices(struct kvm_vcpu *vcpu, u64 __user *indices);
>>  int kvm_arm_get_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg);
>>  int kvm_arm_set_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg);
>> +int kvm_arm_vcpu_get_events(struct kvm_vcpu *vcpu,
>> +			struct kvm_vcpu_events *events);
>> +
>> +int kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
>> +			struct kvm_vcpu_events *events);
>>  
>>  #define KVM_ARCH_WANT_MMU_NOTIFIER
>>  int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
>> @@ -363,6 +368,8 @@ void handle_exit_early(struct kvm_vcpu *vcpu, struct kvm_run *run,
>>  int kvm_perf_init(void);
>>  int kvm_perf_teardown(void);
>>  
>> +void kvm_set_sei_esr(struct kvm_vcpu *vcpu, u64 syndrome);
>> +
>>  struct kvm_vcpu *kvm_mpidr_to_vcpu(struct kvm *kvm, unsigned long mpidr);
>>  
>>  void __kvm_set_tpidr_el2(u64 tpidr_el2);
>> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
>> index 04b3256..df4faee 100644
>> --- a/arch/arm64/include/uapi/asm/kvm.h
>> +++ b/arch/arm64/include/uapi/asm/kvm.h
>> @@ -39,6 +39,7 @@
>>  #define __KVM_HAVE_GUEST_DEBUG
>>  #define __KVM_HAVE_IRQ_LINE
>>  #define __KVM_HAVE_READONLY_MEM
>> +#define __KVM_HAVE_VCPU_EVENTS
>>  
>>  #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
>>  
>> @@ -153,6 +154,18 @@ struct kvm_sync_regs {
>>  struct kvm_arch_memory_slot {
>>  };
>>  
>> +/* for KVM_GET/SET_VCPU_EVENTS */
>> +struct kvm_vcpu_events {
>> +	struct {
>> +		__u8 serror_pending;
>> +		__u8 serror_has_esr;
>> +		/* Align it to 8 bytes */
>> +		__u8 pad[6];
>> +		__u64 serror_esr;
>> +	} exception;
>> +	__u32 reserved[12];
>> +};
>> +
>>  /* If you need to interpret the index values, here is the key: */
>>  #define KVM_REG_ARM_COPROC_MASK		0x000000000FFF0000
>>  #define KVM_REG_ARM_COPROC_SHIFT	16
>> diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
>> index 56a0260..4426915 100644
>> --- a/arch/arm64/kvm/guest.c
>> +++ b/arch/arm64/kvm/guest.c
>> @@ -289,6 +289,42 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
>>  	return -EINVAL;
>>  }
>>  
>> +int kvm_arm_vcpu_get_events(struct kvm_vcpu *vcpu,
>> +			struct kvm_vcpu_events *events)
>> +{
>> +	memset(events, 0, sizeof(*events));
>> +
>> +	events->exception.serror_pending = !!(vcpu->arch.hcr_el2 & HCR_VSE);
>> +	events->exception.serror_has_esr =
>> +					cpus_have_const_cap(ARM64_HAS_RAS_EXTN);
>> +
>> +	if (events->exception.serror_pending &&
>> +		events->exception.serror_has_esr)
>> +		events->exception.serror_esr = vcpu_get_vsesr(vcpu);
>> +	else
>> +		events->exception.serror_esr = 0;
> 
> Other than the alignment issues that Christoffer already commented on,
> you can perfectly remove the "else" clause altogether (we've just
> zeroed the whole structure).
yes, you are right. it should be moved, thanks for the pointing out.

> 
>> +
>> +	return 0;
>> +}
>> +
>> +int kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
>> +			struct kvm_vcpu_events *events)
>> +{
>> +	bool serror_pending = events->exception.serror_pending;
>> +	bool has_esr = events->exception.serror_has_esr;
>> +
>> +	if (serror_pending && has_esr) {
>> +		if (!cpus_have_const_cap(ARM64_HAS_RAS_EXTN))
>> +			return -EINVAL;
>> +
>> +		kvm_set_sei_esr(vcpu, events->exception.serror_esr);
>> +	} else if (serror_pending) {
>> +		kvm_inject_vabt(vcpu);
>> +	}
>> +
>> +	return 0;
> 
> There was an earlier request to check that all the padding is set to
> zero. I still think this makes sense.
previously I think it may be not needed to check that all the padding is set to
zero. so not added it. but now I consider it again, it should be make sense to check that.
so I will added them. thanks for the mention again.

> 
> Thanks,
> 
> 	M.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH RESEND v4 2/2] arm/arm64: KVM: Add KVM_GET/SET_VCPU_EVENTS
From: gengdongjiu @ 2018-06-12 14:50 UTC (permalink / raw)
  To: James Morse
  Cc: rkrcmar, corbet, christoffer.dall, marc.zyngier, linux,
	catalin.marinas, will.deacon, kvm, linux-doc, linux-arm-kernel,
	linux-kernel, linux-acpi
In-Reply-To: <45e94aae-ed9f-1fb7-f10e-d95c2f969ddd@arm.com>

Hi James,
  thanks for the review.

On 2018/6/11 21:36, James Morse wrote:
> Hi Dongjiu Geng,
> 
> Please only put 'RESEND' in the subject if the patch content is identical.
> This patch is not the same as v4.

Yes, it should

> 
> On 08/06/18 20:48, Dongjiu Geng wrote:
>> For the migrating VMs, user space may need to know the exception
>> state. For example, in the machine A, KVM make an SError pending,
>> when migrate to B, KVM also needs to pend an SError.
>>
>> This new IOCTL exports user-invisible states related to SError.
>> Together with appropriate user space changes, user space can get/set
>> the SError exception state to do migrate/snapshot/suspend.
> 
>> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
>> index fdac969..8896737 100644
>> --- a/Documentation/virtual/kvm/api.txt
>> +++ b/Documentation/virtual/kvm/api.txt
>> @@ -835,11 +835,13 @@ struct kvm_clock_data {
>>  
>>  Capability: KVM_CAP_VCPU_EVENTS
>>  Extended by: KVM_CAP_INTR_SHADOW
>> -Architectures: x86
>> +Architectures: x86, arm, arm64
>>  Type: vm ioctl
> 
> Isn't this actually a per-vcpu ioctl? Can we fix the documentation?
  I will modify the original documentation
> 
> 
>>  Parameters: struct kvm_vcpu_event (out)
>>  Returns: 0 on success, -1 on error
>>  
>> +X86:
>> +
>>  Gets currently pending exceptions, interrupts, and NMIs as well as related
>>  states of the vcpu.
>>  
>> @@ -881,15 +883,32 @@ Only two fields are defined in the flags field:
>>  - KVM_VCPUEVENT_VALID_SMM may be set in the flags field to signal that
>>    smi contains a valid state.
>>  
>> +ARM, ARM64:
>> +
>> +Gets currently pending SError exceptions as well as related states of the vcpu.
>> +
>> +struct kvm_vcpu_events {
>> +	struct {
>> +		__u8 serror_pending;
>> +		__u8 serror_has_esr;
>> +		/* Align it to 8 bytes */
>> +		__u8 pad[6];
>> +		__u64 serror_esr;
>> +	} exception;
>> +	__u32 reserved[12];
>> +};
>> +
>>  4.32 KVM_SET_VCPU_EVENTS
>>  
>> -Capability: KVM_CAP_VCPU_EVENTS
>> +Capebility: KVM_CAP_VCPU_EVENTS
> 
> (please fix this)
   Ok, will fix this
> 
> 
>>  Extended by: KVM_CAP_INTR_SHADOW
>> -Architectures: x86
>> +Architectures: x86, arm, arm64
>>  Type: vm ioctl
> 
> (this is also a vcpu ioctl)
  will fix

> 
> 
>>  Parameters: struct kvm_vcpu_event (in)
>>  Returns: 0 on success, -1 on error
>>  
>> +X86:
>> +
>>  Set pending exceptions, interrupts, and NMIs as well as related states of the
>>  vcpu.
>>  
>> @@ -910,6 +929,12 @@ shall be written into the VCPU.
>>  
>>  KVM_VCPUEVENT_VALID_SMM can only be set if KVM_CAP_X86_SMM is available.
>>  
>> +ARM, ARM64:
>> +
>> +Set pending SError exceptions as well as related states of the vcpu.
> 
> There are some deliberate choices here I think we should document:
> | This API can't be used to clear a pending SError.
> 
> If there already was an SError pending, this API just overwrites it with the new
> one. The architecture has some rules about merging multiple SError. (details in
> 2.5.3 Multiple SError interrupts of [0])
> 
> I don't think KVM needs to enforce these, as they are implementation-defined if
> one of the ESR is implementation-defined... the part that matters is reporting
> the 'most severe' RAS ESR if there are multiple pending. As only user-space ever
> sets these, let's make it user-spaces problem to do.
> 
> I think we should recommend user-space always reads the pending values and
> applies its merging-multiple-SError logic. (I assume your Qemu patches do this).


 I will check whether QEMU can be possible to do such things, anyway this patch
 not need to do such merging.

> Something like:
> | User-space should first use KVM_GET_VCPU_EVENTS in case KVM has made an SError
> | pending as part of its device emulation. When both values are architected RAS
> | SError ESR values, the new ESR should reflect the combined effect of both
> | errors.>
> 
>> diff --git a/arch/arm/include/uapi/asm/kvm.h b/arch/arm/include/uapi/asm/kvm.h
>> index caae484..c3e6975 100644
>> --- a/arch/arm/include/uapi/asm/kvm.h
>> +++ b/arch/arm/include/uapi/asm/kvm.h
>> @@ -124,6 +124,18 @@ struct kvm_sync_regs {
>>  struct kvm_arch_memory_slot {
>>  };
>>  
>> +/* for KVM_GET/SET_VCPU_EVENTS */
>> +struct kvm_vcpu_events {
>> +	struct {
>> +		__u8 serror_pending;
>> +		__u8 serror_has_esr;
>> +		/* Align it to 8 bytes */
>> +		__u8 pad[6];
>> +		__u64 serror_esr;
>> +	} exception;
>> +	__u32 reserved[12];
>> +};
>> +
> 
> You haven't defined __KVM_HAVE_VCPU_EVENTS for 32bit, so presumably this struct
> will never be used. Why is it here?
  if not add it for 32 bits. the 32 arm platform will build Fail, whether you have good
   idea to avoid this Failure if not add this struct for the 32 bit?
> (I agree if we ever provide it on 32bit, the struct layout should be the same.
> Is this only here to force that to happen?)
> 
> [...]
> 
> 
>> +int kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
>> +			struct kvm_vcpu_events *events)
>> +{
>> +	bool serror_pending = events->exception.serror_pending;
>> +	bool has_esr = events->exception.serror_has_esr;
>> +
>> +	if (serror_pending && has_esr) {
>> +		if (!cpus_have_const_cap(ARM64_HAS_RAS_EXTN))
>> +			return -EINVAL;
>> +
>> +		kvm_set_sei_esr(vcpu, events->exception.serror_esr);
> 
> kvm_set_sei_esr() will silently discard the top 40 bits of serror_esr, (which is
> correct, we shouldn't copy them into hardware without know what they do).
> 
> Could we please force user-space to zero these bits, we can advertise extra CAPs
> if new features turn up in that space, instead of user-space passing <something>
> and relying on the kernel to remove it.

  yes, I can zero these bits in the  user-space and not depend on kernel to remove it.

> 
> (Background: VSESR is a 64bit register that holds the value to go in a 32bit
> register. I suspect the top-half could get re-used for control values or
> something we don't want to give user-space)
  do you mean when user-space get the VSESR value through KVM_GET_VCPU_EVENTS it only return the low-half 32 bits?

> 
> 
>> diff --git a/arch/arm64/kvm/inject_fault.c b/arch/arm64/kvm/inject_fault.c
>> index d8e7165..a55e91d 100644
>> --- a/arch/arm64/kvm/inject_fault.c
>> +++ b/arch/arm64/kvm/inject_fault.c
>> @@ -164,9 +164,9 @@ void kvm_inject_undefined(struct kvm_vcpu *vcpu)
>>  		inject_undef64(vcpu);
>>  }
>>  
>> -static void pend_guest_serror(struct kvm_vcpu *vcpu, u64 esr)
>> +void kvm_set_sei_esr(struct kvm_vcpu *vcpu, u64 esr)
>>  {
>> -	vcpu_set_vsesr(vcpu, esr);
>> +	vcpu_set_vsesr(vcpu, esr & ESR_ELx_ISS_MASK);
>>  	*vcpu_hcr(vcpu) |= HCR_VSE;
>>  }
>>  
> 
>> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
>> index a4c1b76..79ecba9 100644
>> --- a/virt/kvm/arm/arm.c
>> +++ b/virt/kvm/arm/arm.c
>> @@ -1107,6 +1107,25 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
> 
>> +	case KVM_SET_VCPU_EVENTS: {
>> +		struct kvm_vcpu_events events;
>> +
>> +		if (copy_from_user(&events, argp, sizeof(events)))
>> +			return -EFAULT;
>> +
>> +		return kvm_arm_vcpu_set_events(vcpu, &events);
>> +	}
> 
> Please check the padding[] and reserved[] are zero, otherwise we can't re-use these.
  Ok, thanks
> 
> 
> Thanks,
> 
> James
> 
> [0]
> https://static.docs.arm.com/ddi0587/a/RAS%20Extension-release%20candidate_march_29.pdf
> 
> .
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH RESEND v4 2/2] arm/arm64: KVM: Add KVM_GET/SET_VCPU_EVENTS
From: gengdongjiu @ 2018-06-12 14:53 UTC (permalink / raw)
  To: James Morse, Marc Zyngier
  Cc: rkrcmar, corbet, christoffer.dall, linux, catalin.marinas,
	will.deacon, kvm, linux-doc, linux-arm-kernel, linux-kernel,
	linux-acpi
In-Reply-To: <af93e4b1-80d5-33be-d5ed-312c3ea1d715@arm.com>



On 2018/6/11 21:36, James Morse wrote:
> Hi Dongjiu Geng,
> 
> On 09/06/18 13:40, Marc Zyngier wrote:
>> On Fri, 08 Jun 2018 20:48:40 +0100, Dongjiu Geng wrote:
>>> For the migrating VMs, user space may need to know the exception
>>> state. For example, in the machine A, KVM make an SError pending,
>>> when migrate to B, KVM also needs to pend an SError.
>>>
>>> This new IOCTL exports user-invisible states related to SError.
>>> Together with appropriate user space changes, user space can get/set
>>> the SError exception state to do migrate/snapshot/suspend.
> 
>>> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
>>> index 04b3256..df4faee 100644
>>> --- a/arch/arm64/include/uapi/asm/kvm.h
>>> +++ b/arch/arm64/include/uapi/asm/kvm.h
>>> @@ -153,6 +154,18 @@ struct kvm_sync_regs {
>>>  struct kvm_arch_memory_slot {
>>>  };
>>>  
>>> +/* for KVM_GET/SET_VCPU_EVENTS */
>>> +struct kvm_vcpu_events {
>>> +	struct {
>>> +		__u8 serror_pending;
>>> +		__u8 serror_has_esr;
>>> +		/* Align it to 8 bytes */
>>> +		__u8 pad[6];
>>> +		__u64 serror_esr;
>>> +	} exception;
>>> +	__u32 reserved[12];
>>> +};
> 
>>> diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
>>> index 56a0260..4426915 100644
>>> --- a/arch/arm64/kvm/guest.c
>>> +++ b/arch/arm64/kvm/guest.c
> 
>>> +int kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
>>> +			struct kvm_vcpu_events *events)
>>> +{
>>> +	bool serror_pending = events->exception.serror_pending;
>>> +	bool has_esr = events->exception.serror_has_esr;
>>> +
>>> +	if (serror_pending && has_esr) {
>>> +		if (!cpus_have_const_cap(ARM64_HAS_RAS_EXTN))
>>> +			return -EINVAL;
>>> +
>>> +		kvm_set_sei_esr(vcpu, events->exception.serror_esr);
>>> +	} else if (serror_pending) {
>>> +		kvm_inject_vabt(vcpu);
>>> +	}
>>> +
>>> +	return 0;
>>
>> There was an earlier request to check that all the padding is set to
>> zero. I still think this makes sense.
> 
> I agree, not just the exception.padding[], but reserved[] too.
  Ok, thanks for the reminder again.

> 
> 
> Thanks,
> 
> James
> 
> .
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 01/10] x86/cet: User-mode shadow stack support
From: Yu-cheng Yu @ 2018-06-12 15:03 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-kernel, linux-doc, linux-mm, linux-arch, x86,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, H.J. Lu,
	Vedvyas Shanbhogue, Ravi V. Shankar, Dave Hansen, Andy Lutomirski,
	Jonathan Corbet, Oleg Nesterov, Arnd Bergmann, Mike Kravetz
In-Reply-To: <0e80c181-83b2-457f-a419-01e79f94ca1c@gmail.com>

On Tue, 2018-06-12 at 21:56 +1000, Balbir Singh wrote:
> 
> On 08/06/18 00:37, Yu-cheng Yu wrote:
> > This patch adds basic shadow stack enabling/disabling routines.
> > A task's shadow stack is allocated from memory with VM_SHSTK
> > flag set and read-only protection.  The shadow stack is
> > allocated to a fixed size and that can be changed by the system
> > admin.
> > 
> 
> I presume a read-only permission on the kernel side, but it
> can be written by hardware?

Yes, the shadow stack is written by the processor when a call
instruction is executed.

...

> > 
> > diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
> > new file mode 100644
> > index 000000000000..9d5bc1efc9b7
> > --- /dev/null
> > +++ b/arch/x86/include/asm/cet.h
> > @@ -0,0 +1,32 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _ASM_X86_CET_H
> > +#define _ASM_X86_CET_H
> > +
> > +#ifndef __ASSEMBLY__
> > +#include <linux/types.h>
> > +
> > +struct task_struct;
> > +/*
> > + * Per-thread CET status
> > + */
> > +struct cet_stat {
> 
> stat sounds like statistics, just expand out to status please

I will make it 'cet_status'.

> > +	unsigned long	shstk_base;
> > +	unsigned long	shstk_size;
> > +	unsigned int	shstk_enabled:1;
> > +};
> > +
> > +#ifdef CONFIG_X86_INTEL_CET
> > +unsigned long cet_get_shstk_ptr(void);
> 
> For the current task? Why does _ptr routine return an unsigned long?

What about cet_get_shstk_addr()?

...

> > diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> > index fda2114197b3..428d13828ba9 100644
> > --- a/arch/x86/include/asm/msr-index.h
> > +++ b/arch/x86/include/asm/msr-index.h
> > @@ -770,4 +770,18 @@
> >  #define MSR_VM_IGNNE                    0xc0010115
> >  #define MSR_VM_HSAVE_PA                 0xc0010117
> >  
> > +/* Control-flow Enforcement Technology MSRs */
> > +#define MSR_IA32_U_CET		0x6a0
> > +#define MSR_IA32_S_CET		0x6a2
> > +#define MSR_IA32_PL0_SSP	0x6a4
> > +#define MSR_IA32_PL3_SSP	0x6a7
> > +#define MSR_IA32_INT_SSP_TAB	0x6a8
> 
> some comments on the purpose of the MSR would be nice

Sure.

...

> 
> I think there was a comment about this being TASK_SIZE_MAX
> 
> > +
> > +	rdmsrl(MSR_IA32_U_CET, r);
> > +	wrmsrl(MSR_IA32_U_CET, r | MSR_IA32_CET_SHSTK_EN);
> > +	wrmsrl(MSR_IA32_PL3_SSP, addr);
> 
> Should the enable happen before setting addr? I would expect to do this in the opposite order.

I will check.

> > +	return 0;
> > +}
> > +
> > +unsigned long cet_get_shstk_ptr(void)
> > +{
> > +	unsigned long ptr;
> > +
> > +	if (!current->thread.cet.shstk_enabled)
> > +		return 0;
> > +
> > +	rdmsrl(MSR_IA32_PL3_SSP, ptr);
> > +	return ptr;
> > +}
> > +
> > +static unsigned long shstk_mmap(unsigned long addr, unsigned long len)
> > +{
> > +	struct mm_struct *mm = current->mm;
> > +	unsigned long populate;
> > +
> > +	down_write(&mm->mmap_sem);
> > +	addr = do_mmap(NULL, addr, len, PROT_READ,
> > +		       MAP_ANONYMOUS | MAP_PRIVATE, VM_SHSTK,
> > +		       0, &populate, NULL);
> > +	up_write(&mm->mmap_sem);
> 
> What happens if the mmap fails for any reason? I presume the caller disables shadow stack on this process?

This is from exec(), and that fails.

> > +
> > +	if (populate)
> > +		mm_populate(addr, populate);
> > +
> > +	return addr;
> > +}
> > +
> > +int cet_setup_shstk(void)
> > +{
> > +	unsigned long addr, size;
> > +
> > +	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
> > +		return -EOPNOTSUPP;
> > +
> > +	size = SHSTK_SIZE;
> > +	addr = shstk_mmap(0, size);
> > +
> > +	if (addr >= TASK_SIZE)
> > +		return -ENOMEM;
> > +
> 
> TASK_SIZE_MAX?

Yes.

> 
> > +	cet_set_shstk_ptr(addr + size - sizeof(void *));
> > +	current->thread.cet.shstk_base = addr;
> > +	current->thread.cet.shstk_size = size;
> > +	current->thread.cet.shstk_enabled = 1;
> > +	return 0;
> > +}
> > +
> > +void cet_disable_shstk(void)
> > +{
> > +	u64 r;
> > +
> > +	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
> > +		return;
> > +
> > +	rdmsrl(MSR_IA32_U_CET, r);
> > +	r &= ~(MSR_IA32_CET_SHSTK_EN);
> > +	wrmsrl(MSR_IA32_U_CET, r);
> > +	wrmsrl(MSR_IA32_PL3_SSP, 0);
> 
> Again, I'd expect the order to be the reverse
> 
> > +	current->thread.cet.shstk_enabled = 0;
> > +}
> > +
> > +void cet_disable_free_shstk(struct task_struct *tsk)
> > +{
> > +	if (!cpu_feature_enabled(X86_FEATURE_SHSTK) ||
> > +	    !tsk->thread.cet.shstk_enabled)
> > +		return;
> > +
> > +	if (tsk == current)
> > +		cet_disable_shstk();
> > +
> > +	/*
> > +	 * Free only when tsk is current or shares mm
> > +	 * with current but has its own shstk.
> > +	 */
> > +	if (tsk->mm && (tsk->mm == current->mm) &&
> > +	    (tsk->thread.cet.shstk_base)) {
> 
> Does the caller hold a reference to tsk->mm?

If (tsk->mm == current->mm), i.e. it is current or it is a pthread of
current, then yes.

Yu-cheng

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 00/10] Control Flow Enforcement - Part (3)
From: Yu-cheng Yu @ 2018-06-12 15:03 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-kernel, linux-doc, linux-mm, linux-arch, x86,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, H.J. Lu,
	Vedvyas Shanbhogue, Ravi V. Shankar, Dave Hansen, Andy Lutomirski,
	Jonathan Corbet, Oleg Nesterov, Arnd Bergmann, Mike Kravetz
In-Reply-To: <bbfde1b3-5e1b-80e3-30e8-fd1e46a2ceb1@gmail.com>

On Tue, 2018-06-12 at 20:56 +1000, Balbir Singh wrote:
> 
> On 08/06/18 00:37, Yu-cheng Yu wrote:
> > This series introduces CET - Shadow stack
> > 
> > At the high level, shadow stack is:
> > 
> > 	Allocated from a task's address space with vm_flags VM_SHSTK;
> > 	Its PTEs must be read-only and dirty;
> > 	Fixed sized, but the default size can be changed by sys admin.
> > 
> > For a forked child, the shadow stack is duplicated when the next
> > shadow stack access takes place.
> > 
> > For a pthread child, a new shadow stack is allocated.
> > 
> > The signal handler uses the same shadow stack as the main program.
> > 
> 
> Even with sigaltstack()?
> 
> 
> Balbir Singh.

Yes.

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH RESEND v4 2/2] arm/arm64: KVM: Add KVM_GET/SET_VCPU_EVENTS
From: James Morse @ 2018-06-12 15:29 UTC (permalink / raw)
  To: gengdongjiu
  Cc: rkrcmar, corbet, christoffer.dall, marc.zyngier, linux,
	catalin.marinas, will.deacon, kvm, linux-doc, linux-arm-kernel,
	linux-kernel, linux-acpi
In-Reply-To: <f7f59176-16a1-3b24-5acc-87bf1b4c1490@huawei.com>

Hi gengdongjiu,

On 12/06/18 15:50, gengdongjiu wrote:
> On 2018/6/11 21:36, James Morse wrote:
>> On 08/06/18 20:48, Dongjiu Geng wrote:
>>> For the migrating VMs, user space may need to know the exception
>>> state. For example, in the machine A, KVM make an SError pending,
>>> when migrate to B, KVM also needs to pend an SError.
>>>
>>> This new IOCTL exports user-invisible states related to SError.
>>> Together with appropriate user space changes, user space can get/set
>>> the SError exception state to do migrate/snapshot/suspend.


>>> diff --git a/arch/arm/include/uapi/asm/kvm.h b/arch/arm/include/uapi/asm/kvm.h
>>> index caae484..c3e6975 100644
>>> --- a/arch/arm/include/uapi/asm/kvm.h
>>> +++ b/arch/arm/include/uapi/asm/kvm.h
>>> @@ -124,6 +124,18 @@ struct kvm_sync_regs {
>>>  struct kvm_arch_memory_slot {
>>>  };
>>>  
>>> +/* for KVM_GET/SET_VCPU_EVENTS */
>>> +struct kvm_vcpu_events {
>>> +	struct {
>>> +		__u8 serror_pending;
>>> +		__u8 serror_has_esr;
>>> +		/* Align it to 8 bytes */
>>> +		__u8 pad[6];
>>> +		__u64 serror_esr;
>>> +	} exception;
>>> +	__u32 reserved[12];
>>> +};
>>> +
>>
>> You haven't defined __KVM_HAVE_VCPU_EVENTS for 32bit, so presumably this struct
>> will never be used. Why is it here?

>   if not add it for 32 bits. the 32 arm platform will build Fail, whether you have good
>    idea to avoid this Failure if not add this struct for the 32 bit?

How does this 32bit code build without this patch?
If do you provide the struct, how will that code build with older headers?

As far as I can see, this is what the __KVM_HAVE_VCPU_EVENTS define is for.

This should be both, or neither. Having just the struct is useless.


>>> +int kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
>>> +			struct kvm_vcpu_events *events)
>>> +{
>>> +	bool serror_pending = events->exception.serror_pending;
>>> +	bool has_esr = events->exception.serror_has_esr;
>>> +
>>> +	if (serror_pending && has_esr) {
>>> +		if (!cpus_have_const_cap(ARM64_HAS_RAS_EXTN))
>>> +			return -EINVAL;
>>> +
>>> +		kvm_set_sei_esr(vcpu, events->exception.serror_esr);
>>
>> kvm_set_sei_esr() will silently discard the top 40 bits of serror_esr, (which is
>> correct, we shouldn't copy them into hardware without know what they do).
>>
>> Could we please force user-space to zero these bits, we can advertise extra CAPs
>> if new features turn up in that space, instead of user-space passing <something>
>> and relying on the kernel to remove it.
> 
>   yes, I can zero these bits in the  user-space and not depend on kernel to remove it.

But the kernel must check that user-space did zero those bits. Otherwise
user-space may start using them when a future version of the architecture gives
them a meaning, but an older kernel version doesn't know it has to do extra
work, but still lets the bits from user-space through into the hardware.

If new bits do turn up, we can advertise a CAP that says that KVM supports
whatever that feature is.


>> (Background: VSESR is a 64bit register that holds the value to go in a 32bit
>> register. I suspect the top-half could get re-used for control values or
>> something we don't want to give user-space)

>   do you mean when user-space get the VSESR value through KVM_GET_VCPU_EVENTS
> it only return the low-half 32 bits?

No, the kernel will only ever set a 24bit value here. If we force user-space to
only provide a 24bit value then we don't need to check it on read. We never read
the value back from hardware.

These high bits are RES0 at the moment, they may get used for something in the
future. As we are exposing this via a user-space ABI we need to make sure we
only expose the bits we understand today.


Thanks,

James
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH RESEND v4 2/2] arm/arm64: KVM: Add KVM_GET/SET_VCPU_EVENTS
From: gengdongjiu @ 2018-06-12 15:48 UTC (permalink / raw)
  To: James Morse
  Cc: rkrcmar, corbet, christoffer.dall, marc.zyngier, linux,
	catalin.marinas, will.deacon, kvm, linux-doc, linux-arm-kernel,
	linux-kernel, linux-acpi
In-Reply-To: <6887237f-d252-5b9e-02cb-5a44fef27080@arm.com>



On 2018/6/12 23:29, James Morse wrote:
> Hi gengdongjiu,
> 
> On 12/06/18 15:50, gengdongjiu wrote:
>> On 2018/6/11 21:36, James Morse wrote:
>>> On 08/06/18 20:48, Dongjiu Geng wrote:
>>>> For the migrating VMs, user space may need to know the exception
>>>> state. For example, in the machine A, KVM make an SError pending,
>>>> when migrate to B, KVM also needs to pend an SError.
>>>>
>>>> This new IOCTL exports user-invisible states related to SError.
>>>> Together with appropriate user space changes, user space can get/set
>>>> the SError exception state to do migrate/snapshot/suspend.
> 
> 
>>>> diff --git a/arch/arm/include/uapi/asm/kvm.h b/arch/arm/include/uapi/asm/kvm.h
>>>> index caae484..c3e6975 100644
>>>> --- a/arch/arm/include/uapi/asm/kvm.h
>>>> +++ b/arch/arm/include/uapi/asm/kvm.h
>>>> @@ -124,6 +124,18 @@ struct kvm_sync_regs {
>>>>  struct kvm_arch_memory_slot {
>>>>  };
>>>>  
>>>> +/* for KVM_GET/SET_VCPU_EVENTS */
>>>> +struct kvm_vcpu_events {
>>>> +	struct {
>>>> +		__u8 serror_pending;
>>>> +		__u8 serror_has_esr;
>>>> +		/* Align it to 8 bytes */
>>>> +		__u8 pad[6];
>>>> +		__u64 serror_esr;
>>>> +	} exception;
>>>> +	__u32 reserved[12];
>>>> +};
>>>> +
>>>
>>> You haven't defined __KVM_HAVE_VCPU_EVENTS for 32bit, so presumably this struct
>>> will never be used. Why is it here?
> 
>>   if not add it for 32 bits. the 32 arm platform will build Fail, whether you have good
>>    idea to avoid this Failure if not add this struct for the 32 bit?
> 
> How does this 32bit code build without this patch?
> If do you provide the struct, how will that code build with older headers?
> 
> As far as I can see, this is what the __KVM_HAVE_VCPU_EVENTS define is for.
> 
> This should be both, or neither. Having just the struct is useless.
It because the caller of kvm_arm_vcpu_get/set_events() is in "virt/kvm/arm/arm.c".
the virt/kvm/arm/arm.c will used by both arm64 and arm.
so It needs to add kvm_arm_vcpu_get/set_events() for the 32 bits, however, kvm_arm_vcpu_get/set_events() will directly return,

I attached the build erros below:
https://lkml.org/lkml/2018/6/4/918
https://lkml.org/lkml/2018/6/4/873


[..]
I will continue check below comments, thanks

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 00/10] Control Flow Enforcement - Part (3)
From: Andy Lutomirski @ 2018-06-12 16:00 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: bsingharora, LKML, linux-doc, Linux-MM, linux-arch, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, H. J. Lu,
	Shanbhogue, Vedvyas, Ravi V. Shankar, Dave Hansen,
	Jonathan Corbet, Oleg Nesterov, Arnd Bergmann, mike.kravetz
In-Reply-To: <1528815820.8271.16.camel@2b52.sc.intel.com>

On Tue, Jun 12, 2018 at 8:06 AM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
>
> On Tue, 2018-06-12 at 20:56 +1000, Balbir Singh wrote:
> >
> > On 08/06/18 00:37, Yu-cheng Yu wrote:
> > > This series introduces CET - Shadow stack
> > >
> > > At the high level, shadow stack is:
> > >
> > >     Allocated from a task's address space with vm_flags VM_SHSTK;
> > >     Its PTEs must be read-only and dirty;
> > >     Fixed sized, but the default size can be changed by sys admin.
> > >
> > > For a forked child, the shadow stack is duplicated when the next
> > > shadow stack access takes place.
> > >
> > > For a pthread child, a new shadow stack is allocated.
> > >
> > > The signal handler uses the same shadow stack as the main program.
> > >
> >
> > Even with sigaltstack()?
> >
> >
> > Balbir Singh.
>
> Yes.
>

I think we're going to need some provision to add an alternate signal
stack to handle the case where the shadow stack overflows.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 06/10] x86/cet: Add arch_prctl functions for shadow stack
From: Andy Lutomirski @ 2018-06-12 16:01 UTC (permalink / raw)
  To: H. J. Lu
  Cc: Thomas Gleixner, Andrew Lutomirski, Yu-cheng Yu, LKML, linux-doc,
	Linux-MM, linux-arch, X86 ML, H. Peter Anvin, Ingo Molnar,
	Shanbhogue, Vedvyas, Ravi V. Shankar, Dave Hansen,
	Jonathan Corbet, Oleg Nesterov, Arnd Bergmann, mike.kravetz
In-Reply-To: <CAMe9rOoCiXQ4iVD3j_AHGrvEXtoaVVZVs7H7fCuqNEuuR5j+2Q@mail.gmail.com>

On Tue, Jun 12, 2018 at 4:43 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Tue, Jun 12, 2018 at 3:03 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Thu, 7 Jun 2018, H.J. Lu wrote:
> >> On Thu, Jun 7, 2018 at 2:01 PM, Andy Lutomirski <luto@kernel.org> wrote:
> >> > Why is the lockout necessary?  If user code enables CET and tries to
> >> > run code that doesn't support CET, it will crash.  I don't see why we
> >> > need special code in the kernel to prevent a user program from calling
> >> > arch_prctl() and crashing itself.  There are already plenty of ways to
> >> > do that :)
> >>
> >> On CET enabled machine, not all programs nor shared libraries are
> >> CET enabled.  But since ld.so is CET enabled, all programs start
> >> as CET enabled.  ld.so will disable CET if a program or any of its shared
> >> libraries aren't CET enabled.  ld.so will lock up CET once it is done CET
> >> checking so that CET can't no longer be disabled afterwards.
> >
> > That works for stuff which loads all libraries at start time, but what
> > happens if the program uses dlopen() later on? If CET is force locked and
> > the library is not CET enabled, it will fail.
>
> That is to prevent disabling CET by dlopening a legacy shared library.
>
> > I don't see the point of trying to support CET by magic. It adds complexity
> > and you'll never be able to handle all corner cases correctly. dlopen() is
> > not even a corner case.
>
> That is a price we pay for security.  To enable CET, especially shadow
> shack, the program and all of shared libraries it uses should be CET
> enabled.  Most of programs can be enabled with CET by compiling them
> with -fcf-protection.

If you charge too high a price for security, people may turn it off.
I think we're going to need a mode where a program says "I want to use
the CET, but turn it off if I dlopen an unsupported library".  There
are programs that load binary-only plugins.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 06/10] x86/cet: Add arch_prctl functions for shadow stack
From: H.J. Lu @ 2018-06-12 16:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Yu-cheng Yu, LKML, linux-doc, Linux-MM,
	linux-arch, X86 ML, H. Peter Anvin, Ingo Molnar,
	Shanbhogue, Vedvyas, Ravi V. Shankar, Dave Hansen,
	Jonathan Corbet, Oleg Nesterov, Arnd Bergmann, mike.kravetz
In-Reply-To: <CALCETrXO8R+RQPhJFk4oiA4PF77OgSS2Yro_POXQj1zvdLo61A@mail.gmail.com>

On Tue, Jun 12, 2018 at 9:01 AM, Andy Lutomirski <luto@kernel.org> wrote:
> On Tue, Jun 12, 2018 at 4:43 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>>
>> On Tue, Jun 12, 2018 at 3:03 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> > On Thu, 7 Jun 2018, H.J. Lu wrote:
>> >> On Thu, Jun 7, 2018 at 2:01 PM, Andy Lutomirski <luto@kernel.org> wrote:
>> >> > Why is the lockout necessary?  If user code enables CET and tries to
>> >> > run code that doesn't support CET, it will crash.  I don't see why we
>> >> > need special code in the kernel to prevent a user program from calling
>> >> > arch_prctl() and crashing itself.  There are already plenty of ways to
>> >> > do that :)
>> >>
>> >> On CET enabled machine, not all programs nor shared libraries are
>> >> CET enabled.  But since ld.so is CET enabled, all programs start
>> >> as CET enabled.  ld.so will disable CET if a program or any of its shared
>> >> libraries aren't CET enabled.  ld.so will lock up CET once it is done CET
>> >> checking so that CET can't no longer be disabled afterwards.
>> >
>> > That works for stuff which loads all libraries at start time, but what
>> > happens if the program uses dlopen() later on? If CET is force locked and
>> > the library is not CET enabled, it will fail.
>>
>> That is to prevent disabling CET by dlopening a legacy shared library.
>>
>> > I don't see the point of trying to support CET by magic. It adds complexity
>> > and you'll never be able to handle all corner cases correctly. dlopen() is
>> > not even a corner case.
>>
>> That is a price we pay for security.  To enable CET, especially shadow
>> shack, the program and all of shared libraries it uses should be CET
>> enabled.  Most of programs can be enabled with CET by compiling them
>> with -fcf-protection.
>
> If you charge too high a price for security, people may turn it off.
> I think we're going to need a mode where a program says "I want to use
> the CET, but turn it off if I dlopen an unsupported library".  There
> are programs that load binary-only plugins.

You can do

# export GLIBC_TUNABLES=glibc.tune.hwcaps=-SHSTK

which turns off shadow stack.


-- 
H.J.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 00/10] Control Flow Enforcement - Part (3)
From: Yu-cheng Yu @ 2018-06-12 16:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: bsingharora, LKML, linux-doc, Linux-MM, linux-arch, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, H. J. Lu,
	Shanbhogue, Vedvyas, Ravi V. Shankar, Dave Hansen,
	Jonathan Corbet, Oleg Nesterov, Arnd Bergmann, mike.kravetz
In-Reply-To: <CALCETrXK6hypCb5sXwxWRKr=J6_7XtS6s5GB1WPBiqi79q8-8g@mail.gmail.com>

On Tue, 2018-06-12 at 09:00 -0700, Andy Lutomirski wrote:
> On Tue, Jun 12, 2018 at 8:06 AM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
> >
> > On Tue, 2018-06-12 at 20:56 +1000, Balbir Singh wrote:
> > >
> > > On 08/06/18 00:37, Yu-cheng Yu wrote:
> > > > This series introduces CET - Shadow stack
> > > >
> > > > At the high level, shadow stack is:
> > > >
> > > >     Allocated from a task's address space with vm_flags VM_SHSTK;
> > > >     Its PTEs must be read-only and dirty;
> > > >     Fixed sized, but the default size can be changed by sys admin.
> > > >
> > > > For a forked child, the shadow stack is duplicated when the next
> > > > shadow stack access takes place.
> > > >
> > > > For a pthread child, a new shadow stack is allocated.
> > > >
> > > > The signal handler uses the same shadow stack as the main program.
> > > >
> > >
> > > Even with sigaltstack()?
> > >
> > >
> > > Balbir Singh.
> >
> > Yes.
> >
> 
> I think we're going to need some provision to add an alternate signal
> stack to handle the case where the shadow stack overflows.

The shadow stack stores only return addresses; its consumption will not
exceed a percentage of (program stack size + sigaltstack size) before
those overflow.  When that happens, there is usually very little we can
do.  So we set a default shadow stack size that supports certain nested
calls and allow sys admin to adjust it.


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 00/10] Control Flow Enforcement - Part (3)
From: Andy Lutomirski @ 2018-06-12 16:31 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: Andrew Lutomirski, bsingharora, LKML, linux-doc, Linux-MM,
	linux-arch, X86 ML, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	H. J. Lu, Shanbhogue, Vedvyas, Ravi V. Shankar, Dave Hansen,
	Jonathan Corbet, Oleg Nesterov, Arnd Bergmann, mike.kravetz
In-Reply-To: <1528820489.9324.14.camel@2b52.sc.intel.com>

On Tue, Jun 12, 2018 at 9:24 AM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
>
> On Tue, 2018-06-12 at 09:00 -0700, Andy Lutomirski wrote:
> > On Tue, Jun 12, 2018 at 8:06 AM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
> > >
> > > On Tue, 2018-06-12 at 20:56 +1000, Balbir Singh wrote:
> > > >
> > > > On 08/06/18 00:37, Yu-cheng Yu wrote:
> > > > > This series introduces CET - Shadow stack
> > > > >
> > > > > At the high level, shadow stack is:
> > > > >
> > > > >     Allocated from a task's address space with vm_flags VM_SHSTK;
> > > > >     Its PTEs must be read-only and dirty;
> > > > >     Fixed sized, but the default size can be changed by sys admin.
> > > > >
> > > > > For a forked child, the shadow stack is duplicated when the next
> > > > > shadow stack access takes place.
> > > > >
> > > > > For a pthread child, a new shadow stack is allocated.
> > > > >
> > > > > The signal handler uses the same shadow stack as the main program.
> > > > >
> > > >
> > > > Even with sigaltstack()?
> > > >
> > > >
> > > > Balbir Singh.
> > >
> > > Yes.
> > >
> >
> > I think we're going to need some provision to add an alternate signal
> > stack to handle the case where the shadow stack overflows.
>
> The shadow stack stores only return addresses; its consumption will not
> exceed a percentage of (program stack size + sigaltstack size) before
> those overflow.  When that happens, there is usually very little we can
> do.  So we set a default shadow stack size that supports certain nested
> calls and allow sys admin to adjust it.
>

Of course there's something you can do: add a sigaltstack-like stack
switching mechanism.  Have a reserve shadow stack and, when a signal
is delivered (possibly guarded by other conditions like "did the
shadow stack overflow"), switch to a new shadow stack and maybe write
a special token to the new shadow stack that says "signal delivery
jumped here and will restore to the previous shadow stack and
such-and-such address on return".

Also, I have a couple of other questions after reading the
documentation some more:

1. Why on Earth does INCSSP only take an 8-bit number of frames to
skip?  It seems to me that code that calls setjmp() and then calls
longjmp() while nested more than 256 function call levels will crash.

2. The mnemonic RSTORSSP makes no sense to me.  RSTORSSP is a stack
*switch* operation not a stack *restore* operation, unless I'm
seriously misunderstanding.

3. Is there anything resembling clear documentation of the format of
the shadow stack?  That is, what types of values might be found on the
shadow stack and what do they all mean?

4. Usually Intel doesn't submit upstream Linux patches for ISA
extensions until the ISA is documented for real.  CET does not appear
to be documented for real.  Could Intel kindly release something that
at least claims to be authoritative documentation?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 06/10] x86/cet: Add arch_prctl functions for shadow stack
From: Andy Lutomirski @ 2018-06-12 16:34 UTC (permalink / raw)
  To: H. J. Lu
  Cc: Andrew Lutomirski, Thomas Gleixner, Yu-cheng Yu, LKML, linux-doc,
	Linux-MM, linux-arch, X86 ML, H. Peter Anvin, Ingo Molnar,
	Shanbhogue, Vedvyas, Ravi V. Shankar, Dave Hansen,
	Jonathan Corbet, Oleg Nesterov, Arnd Bergmann, mike.kravetz
In-Reply-To: <CAMe9rOpLxPussn7gKvn0GgbOB4f5W+DKOGipe_8NMam+Afd+RA@mail.gmail.com>

On Tue, Jun 12, 2018 at 9:05 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Tue, Jun 12, 2018 at 9:01 AM, Andy Lutomirski <luto@kernel.org> wrote:
> > On Tue, Jun 12, 2018 at 4:43 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> >>
> >> On Tue, Jun 12, 2018 at 3:03 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> >> > On Thu, 7 Jun 2018, H.J. Lu wrote:
> >> >> On Thu, Jun 7, 2018 at 2:01 PM, Andy Lutomirski <luto@kernel.org> wrote:
> >> >> > Why is the lockout necessary?  If user code enables CET and tries to
> >> >> > run code that doesn't support CET, it will crash.  I don't see why we
> >> >> > need special code in the kernel to prevent a user program from calling
> >> >> > arch_prctl() and crashing itself.  There are already plenty of ways to
> >> >> > do that :)
> >> >>
> >> >> On CET enabled machine, not all programs nor shared libraries are
> >> >> CET enabled.  But since ld.so is CET enabled, all programs start
> >> >> as CET enabled.  ld.so will disable CET if a program or any of its shared
> >> >> libraries aren't CET enabled.  ld.so will lock up CET once it is done CET
> >> >> checking so that CET can't no longer be disabled afterwards.
> >> >
> >> > That works for stuff which loads all libraries at start time, but what
> >> > happens if the program uses dlopen() later on? If CET is force locked and
> >> > the library is not CET enabled, it will fail.
> >>
> >> That is to prevent disabling CET by dlopening a legacy shared library.
> >>
> >> > I don't see the point of trying to support CET by magic. It adds complexity
> >> > and you'll never be able to handle all corner cases correctly. dlopen() is
> >> > not even a corner case.
> >>
> >> That is a price we pay for security.  To enable CET, especially shadow
> >> shack, the program and all of shared libraries it uses should be CET
> >> enabled.  Most of programs can be enabled with CET by compiling them
> >> with -fcf-protection.
> >
> > If you charge too high a price for security, people may turn it off.
> > I think we're going to need a mode where a program says "I want to use
> > the CET, but turn it off if I dlopen an unsupported library".  There
> > are programs that load binary-only plugins.
>
> You can do
>
> # export GLIBC_TUNABLES=glibc.tune.hwcaps=-SHSTK
>
> which turns off shadow stack.
>

Which exactly illustrates my point.  By making your security story too
absolute, you'll force people to turn it off when they don't need to.
If I'm using a fully CET-ified distro and I'm using a CET-aware
program that loads binary plugins, and I may or may not have an old
(binary-only, perhaps) plugin that doesn't support CET, then the
behavior I want is for CET to be on until I dlopen() a program that
doesn't support it.  Unless there's some ABI reason why that can't be
done, but I don't think there is.

I'm concerned that the entire concept of locking CET is there to solve
a security problem that doesn't actually exist.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next 3/6] net: ethernet: ti: cpsw: add MQPRIO Qdisc offload
From: Andrew Lunn @ 2018-06-12 16:36 UTC (permalink / raw)
  To: Ivan Khoronzhuk
  Cc: grygorii.strashko, davem, corbet, akpm, netdev, linux-doc,
	linux-kernel, linux-omap, vinicius.gomes, henrik,
	jesus.sanchez-palencia, ilias.apalodimas, p-varis, spatton,
	francois.ozog, yogeshs, nsekhar
In-Reply-To: <20180611133047.4818-4-ivan.khoronzhuk@linaro.org>

On Mon, Jun 11, 2018 at 04:30:44PM +0300, Ivan Khoronzhuk wrote:
> That's possible to offload vlan to tc priority mapping with
> assumption sk_prio == L2 prio.
> 
> Example:
> $ ethtool -L eth0 rx 1 tx 4
> 
> $ qdisc replace dev eth0 handle 100: parent root mqprio num_tc 3 \
> map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 1
> 
> $ tc -g class show dev eth0
> +---(100:ffe2) mqprio
> |    +---(100:3) mqprio
> |    +---(100:4) mqprio
> |    
> +---(100:ffe1) mqprio
> |    +---(100:2) mqprio
> |    
> +---(100:ffe0) mqprio
>      +---(100:1) mqprio
> 
> Here, 100:1 is txq0, 100:2 is txq1, 100:3 is txq2, 100:4 is txq3
> txq0 belongs to tc0, txq1 to tc1, txq2 and txq3 to tc2
> The offload part only maps L2 prio to classes of traffic, but not
> to transmit queues, so to direct traffic to traffic class vlan has
> to be created with appropriate egress map.
> 
> Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
> ---
>  drivers/net/ethernet/ti/cpsw.c | 82 ++++++++++++++++++++++++++++++++++
>  1 file changed, 82 insertions(+)
> 
> diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
> index 406537d74ec1..fd967d2bce5d 100644
> --- a/drivers/net/ethernet/ti/cpsw.c
> +++ b/drivers/net/ethernet/ti/cpsw.c
> @@ -39,6 +39,7 @@
>  #include <linux/sys_soc.h>
>  
>  #include <linux/pinctrl/consumer.h>
> +#include <net/pkt_cls.h>
>  
>  #include "cpsw.h"
>  #include "cpsw_ale.h"
> @@ -153,6 +154,8 @@ do {								\
>  #define IRQ_NUM			2
>  #define CPSW_MAX_QUEUES		8
>  #define CPSW_CPDMA_DESCS_POOL_SIZE_DEFAULT 256
> +#define CPSW_TC_NUM			4
> +#define CPSW_FIFO_SHAPERS_NUM		(CPSW_TC_NUM - 1)
>  
>  #define CPSW_RX_VLAN_ENCAP_HDR_PRIO_SHIFT	29
>  #define CPSW_RX_VLAN_ENCAP_HDR_PRIO_MSK		GENMASK(2, 0)
> @@ -453,6 +456,7 @@ struct cpsw_priv {
>  	u8				mac_addr[ETH_ALEN];
>  	bool				rx_pause;
>  	bool				tx_pause;
> +	bool				mqprio_hw;
>  	u32 emac_port;
>  	struct cpsw_common *cpsw;
>  };
> @@ -1577,6 +1581,14 @@ static void cpsw_slave_stop(struct cpsw_slave *slave, struct cpsw_common *cpsw)
>  	soft_reset_slave(slave);
>  }
>  
> +static int cpsw_tc_to_fifo(int tc, int num_tc)
> +{
> +	if (tc == num_tc - 1)
> +		return 0;
> +
> +	return CPSW_FIFO_SHAPERS_NUM - tc;
> +}
> +
>  static int cpsw_ndo_open(struct net_device *ndev)
>  {
>  	struct cpsw_priv *priv = netdev_priv(ndev);
> @@ -2190,6 +2202,75 @@ static int cpsw_ndo_set_tx_maxrate(struct net_device *ndev, int queue, u32 rate)
>  	return ret;
>  }
>  
> +static int cpsw_set_tc(struct net_device *ndev, void *type_data)
> +{

Hi Ivan

Maybe this is not the best of names. What if you add support for
another TC qdisc? So you have another case in the switch statement
below?

Maybe call it cpsw_set_mqprio?

> +static int cpsw_ndo_setup_tc(struct net_device *ndev, enum tc_setup_type type,
> +			     void *type_data)
> +{
> +	switch (type) {
> +	case TC_SETUP_QDISC_MQPRIO:
> +		return cpsw_set_tc(ndev, type_data);
> +
> +	default:
> +		return -EOPNOTSUPP;
> +	}
> +}

  Andrew
  
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 06/10] x86/cet: Add arch_prctl functions for shadow stack
From: H.J. Lu @ 2018-06-12 16:51 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Yu-cheng Yu, LKML, linux-doc, Linux-MM,
	linux-arch, X86 ML, H. Peter Anvin, Ingo Molnar,
	Shanbhogue, Vedvyas, Ravi V. Shankar, Dave Hansen,
	Jonathan Corbet, Oleg Nesterov, Arnd Bergmann, mike.kravetz
In-Reply-To: <CALCETrWmGRkQvsUgRaj+j0CP4beKys+TT5aDR5+18nuphwr+Cw@mail.gmail.com>

On Tue, Jun 12, 2018 at 9:34 AM, Andy Lutomirski <luto@kernel.org> wrote:
> On Tue, Jun 12, 2018 at 9:05 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>>
>> On Tue, Jun 12, 2018 at 9:01 AM, Andy Lutomirski <luto@kernel.org> wrote:
>> > On Tue, Jun 12, 2018 at 4:43 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>> >>
>> >> On Tue, Jun 12, 2018 at 3:03 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> >> > On Thu, 7 Jun 2018, H.J. Lu wrote:
>> >> >> On Thu, Jun 7, 2018 at 2:01 PM, Andy Lutomirski <luto@kernel.org> wrote:
>> >> >> > Why is the lockout necessary?  If user code enables CET and tries to
>> >> >> > run code that doesn't support CET, it will crash.  I don't see why we
>> >> >> > need special code in the kernel to prevent a user program from calling
>> >> >> > arch_prctl() and crashing itself.  There are already plenty of ways to
>> >> >> > do that :)
>> >> >>
>> >> >> On CET enabled machine, not all programs nor shared libraries are
>> >> >> CET enabled.  But since ld.so is CET enabled, all programs start
>> >> >> as CET enabled.  ld.so will disable CET if a program or any of its shared
>> >> >> libraries aren't CET enabled.  ld.so will lock up CET once it is done CET
>> >> >> checking so that CET can't no longer be disabled afterwards.
>> >> >
>> >> > That works for stuff which loads all libraries at start time, but what
>> >> > happens if the program uses dlopen() later on? If CET is force locked and
>> >> > the library is not CET enabled, it will fail.
>> >>
>> >> That is to prevent disabling CET by dlopening a legacy shared library.
>> >>
>> >> > I don't see the point of trying to support CET by magic. It adds complexity
>> >> > and you'll never be able to handle all corner cases correctly. dlopen() is
>> >> > not even a corner case.
>> >>
>> >> That is a price we pay for security.  To enable CET, especially shadow
>> >> shack, the program and all of shared libraries it uses should be CET
>> >> enabled.  Most of programs can be enabled with CET by compiling them
>> >> with -fcf-protection.
>> >
>> > If you charge too high a price for security, people may turn it off.
>> > I think we're going to need a mode where a program says "I want to use
>> > the CET, but turn it off if I dlopen an unsupported library".  There
>> > are programs that load binary-only plugins.
>>
>> You can do
>>
>> # export GLIBC_TUNABLES=glibc.tune.hwcaps=-SHSTK
>>
>> which turns off shadow stack.
>>
>
> Which exactly illustrates my point.  By making your security story too
> absolute, you'll force people to turn it off when they don't need to.
> If I'm using a fully CET-ified distro and I'm using a CET-aware
> program that loads binary plugins, and I may or may not have an old
> (binary-only, perhaps) plugin that doesn't support CET, then the
> behavior I want is for CET to be on until I dlopen() a program that
> doesn't support it.  Unless there's some ABI reason why that can't be
> done, but I don't think there is.

We can make it opt-in via GLIBC_TUNABLES.  But by default, the legacy
shared object is disallowed when CET is enabled.

> I'm concerned that the entire concept of locking CET is there to solve
> a security problem that doesn't actually exist.

We don't know that.


-- 
H.J.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 00/10] Control Flow Enforcement - Part (3)
From: Yu-cheng Yu @ 2018-06-12 17:24 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: bsingharora, LKML, linux-doc, Linux-MM, linux-arch, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, H. J. Lu,
	Shanbhogue, Vedvyas, Ravi V. Shankar, Dave Hansen,
	Jonathan Corbet, Oleg Nesterov, Arnd Bergmann, mike.kravetz
In-Reply-To: <CALCETrVOyZz72RuoRB=z_EjFTqqctSLfX30GM+MSEVtbcd=PeQ@mail.gmail.com>

On Tue, 2018-06-12 at 09:31 -0700, Andy Lutomirski wrote:
> On Tue, Jun 12, 2018 at 9:24 AM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
> >
> > On Tue, 2018-06-12 at 09:00 -0700, Andy Lutomirski wrote:
> > > On Tue, Jun 12, 2018 at 8:06 AM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
> > > >
> > > > On Tue, 2018-06-12 at 20:56 +1000, Balbir Singh wrote:
> > > > >
> > > > > On 08/06/18 00:37, Yu-cheng Yu wrote:
> > > > > > This series introduces CET - Shadow stack
> > > > > >
> > > > > > At the high level, shadow stack is:
> > > > > >
> > > > > >     Allocated from a task's address space with vm_flags VM_SHSTK;
> > > > > >     Its PTEs must be read-only and dirty;
> > > > > >     Fixed sized, but the default size can be changed by sys admin.
> > > > > >
> > > > > > For a forked child, the shadow stack is duplicated when the next
> > > > > > shadow stack access takes place.
> > > > > >
> > > > > > For a pthread child, a new shadow stack is allocated.
> > > > > >
> > > > > > The signal handler uses the same shadow stack as the main program.
> > > > > >
> > > > >
> > > > > Even with sigaltstack()?
> > > > >
> > > > >
> > > > > Balbir Singh.
> > > >
> > > > Yes.
> > > >
> > >
> > > I think we're going to need some provision to add an alternate signal
> > > stack to handle the case where the shadow stack overflows.
> >
> > The shadow stack stores only return addresses; its consumption will not
> > exceed a percentage of (program stack size + sigaltstack size) before
> > those overflow.  When that happens, there is usually very little we can
> > do.  So we set a default shadow stack size that supports certain nested
> > calls and allow sys admin to adjust it.
> >
> 
> Of course there's something you can do: add a sigaltstack-like stack
> switching mechanism.  Have a reserve shadow stack and, when a signal
> is delivered (possibly guarded by other conditions like "did the
> shadow stack overflow"), switch to a new shadow stack and maybe write
> a special token to the new shadow stack that says "signal delivery
> jumped here and will restore to the previous shadow stack and
> such-and-such address on return".

If (shstk size == (stack size + sigaltstack size)), then shstk will not
overflow before program stack overflows and sigaltstack also overflows.

Let me think about this.

> Also, I have a couple of other questions after reading the
> documentation some more:
> 
> 1. Why on Earth does INCSSP only take an 8-bit number of frames to
> skip?  It seems to me that code that calls setjmp() and then calls
> longjmp() while nested more than 256 function call levels will crash.

GLIBC takes care of more than 256 functions calls.

> 2. The mnemonic RSTORSSP makes no sense to me.  RSTORSSP is a stack
> *switch* operation not a stack *restore* operation, unless I'm
> seriously misunderstanding.

The intention is to switch shadow stacks with tokens.  RSTORSSP restores
to a previous shadow stack address from a restore token.

> 3. Is there anything resembling clear documentation of the format of
> the shadow stack?  That is, what types of values might be found on the
> shadow stack and what do they all mean?

Only return addresses and restore tokens can be on a user-mode shadow
stack.  The restore token has the incoming shadow stack address plus one
bit indicating 64/32-bit mode.

I will put this into Documentation/x86/intel_cet.txt.

> 
> 4. Usually Intel doesn't submit upstream Linux patches for ISA
> extensions until the ISA is documented for real.  CET does not appear
> to be documented for real.  Could Intel kindly release something that
> at least claims to be authoritative documentation?
> 
> --Andy


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox